[PATCH v3 0/3] KVM: x86: Include host suspended time in steal time.

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/3] KVM: x86: Include host suspended time in steal time.
@ 2025-01-07  4:21 Suleiman Souhlal
  2025-01-07  4:21 ` [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns() Suleiman Souhlal
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Suleiman Souhlal @ 2025-01-07  4:21 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Chao Gao, David Woodhouse, kvm, linux-kernel,
	ssouhlal, Suleiman Souhlal

This series makes it so that the time that the host is suspended is
included in guests' steal time.

When the host resumes from a suspend, the guest thinks any task
that was running during the suspend ran for a long time, even though
the effective run time was much shorter, which can end up having
negative effects with scheduling. This can be particularly noticeable
if the guest task was RT, as it can end up getting throttled for a
long time.

To mitigate this issue, we include the time that the host was
suspended in steal time, which lets the guest can subtract the
duration from the tasks' runtime.

v3:
- Use PM notifier instead of syscore ops (kvm_suspend()/kvm_resume()),
  because the latter doesn't get called on shallow suspend.
- Don't call function under UACCESS.
- Whitespace.

v2: https://lore.kernel.org/lkml/20241118043745.1857272-1-suleiman@google.com/
- Accumulate suspend time at machine-independent kvm layer and track per-VCPU
  instead of per-VM.
- Document changes.

v1: https://lore.kernel.org/kvm/20240710074410.770409-1-suleiman@google.com/

Suleiman Souhlal (3):
  kvm: Introduce kvm_total_suspend_ns().
  KVM: x86: Include host suspended time in steal time.
  KVM: x86: Document host suspend being included in steal time.

 Documentation/virt/kvm/x86/msr.rst |  6 ++++--
 arch/x86/include/asm/kvm_host.h    |  1 +
 arch/x86/kvm/x86.c                 | 11 ++++++++++-
 include/linux/kvm_host.h           |  2 ++
 virt/kvm/kvm_main.c                | 26 ++++++++++++++++++++++++++
 5 files changed, 43 insertions(+), 3 deletions(-)

-- 
2.47.1.613.gc27f4b7a9f-goog

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
  2025-01-07  4:21 [PATCH v3 0/3] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
@ 2025-01-07  4:21 ` Suleiman Souhlal
  2025-01-07 15:27   ` Sean Christopherson
                     ` (3 more replies)
  2025-01-07  4:22 ` [PATCH v3 2/3] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
  2025-01-07  4:22 ` [PATCH v3 3/3] KVM: x86: Document host suspend being included " Suleiman Souhlal
  2 siblings, 4 replies; 21+ messages in thread
From: Suleiman Souhlal @ 2025-01-07  4:21 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Chao Gao, David Woodhouse, kvm, linux-kernel,
	ssouhlal, Suleiman Souhlal

It returns the cumulative nanoseconds that the host has been suspended.
It is intended to be used for reporting host suspend time to the guest.

Change-Id: I8f644c9fbdb2c48d2c99dd9efaa5c85a83a14c2a
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      | 26 ++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 401439bb21e3e6..cf926168b30820 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2553,4 +2553,6 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range);
 #endif
 
+u64 kvm_total_suspend_ns(void);
+
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index de2c11dae23163..d5ae237df76d0d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -889,13 +889,39 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */
 
+static u64 last_suspend;
+static u64 total_suspend_ns;
+
+u64 kvm_total_suspend_ns(void)
+{
+	return total_suspend_ns;
+}
+
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
+static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
+{
+	switch (state) {
+	case PM_HIBERNATION_PREPARE:
+	case PM_SUSPEND_PREPARE:
+		last_suspend = ktime_get_boottime_ns();
+	case PM_POST_HIBERNATION:
+	case PM_POST_SUSPEND:
+		total_suspend_ns += ktime_get_boottime_ns() - last_suspend;
+	}
+
+	return NOTIFY_DONE;
+}
+
 static int kvm_pm_notifier_call(struct notifier_block *bl,
 				unsigned long state,
 				void *unused)
 {
 	struct kvm *kvm = container_of(bl, struct kvm, pm_notifier);
 
+	if (kvm_pm_notifier(kvm, state) != NOTIFY_DONE)
+		return NOTIFY_BAD;
+
 	return kvm_arch_pm_notifier(kvm, state);
 }
 
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 2/3] KVM: x86: Include host suspended time in steal time.
  2025-01-07  4:21 [PATCH v3 0/3] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
  2025-01-07  4:21 ` [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns() Suleiman Souhlal
@ 2025-01-07  4:22 ` Suleiman Souhlal
  2025-01-07 15:37   ` Sean Christopherson
  2025-01-07  4:22 ` [PATCH v3 3/3] KVM: x86: Document host suspend being included " Suleiman Souhlal
  2 siblings, 1 reply; 21+ messages in thread
From: Suleiman Souhlal @ 2025-01-07  4:22 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Chao Gao, David Woodhouse, kvm, linux-kernel,
	ssouhlal, Suleiman Souhlal

When the host resumes from a suspend, the guest thinks any task
that was running during the suspend ran for a long time, even though
the effective run time was much shorter, which can end up having
negative effects with scheduling. This can be particularly noticeable
if the guest task was RT, as it can end up getting throttled for a
long time.

To mitigate this issue, we include the time that the host was
suspended in steal time, which lets the guest can subtract the
duration from the tasks' runtime.

Note that the case of a suspend happening during a VM migration
might not be accounted.

Change-Id: I18d1d17d4d0d6f4c89b312e427036e052c47e1fa
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 11 ++++++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e159e44a6a1b61..01d44d527a7f88 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -897,6 +897,7 @@ struct kvm_vcpu_arch {
 		u8 preempted;
 		u64 msr_val;
 		u64 last_steal;
+		u64 last_suspend_ns;
 		struct gfn_to_hva_cache cache;
 	} st;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c8160baf383851..12439edc36f83a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3650,7 +3650,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 	struct kvm_steal_time __user *st;
 	struct kvm_memslots *slots;
 	gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
-	u64 steal;
+	u64 steal, suspend_ns;
 	u32 version;
 
 	if (kvm_xen_msr_enabled(vcpu->kvm)) {
@@ -3677,6 +3677,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 			return;
 	}
 
+	suspend_ns = kvm_total_suspend_ns();
 	st = (struct kvm_steal_time __user *)ghc->hva;
 	/*
 	 * Doing a TLB flush here, on the guest's behalf, can avoid
@@ -3731,6 +3732,13 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 	steal += current->sched_info.run_delay -
 		vcpu->arch.st.last_steal;
 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
+	/*
+	 * Include the time that the host was suspended in steal time.
+	 * Note that the case of a suspend happening during a VM migration
+	 * might not be accounted.
+	 */
+	steal += suspend_ns - vcpu->arch.st.last_suspend_ns;
+	vcpu->arch.st.last_suspend_ns = suspend_ns;
 	unsafe_put_user(steal, &st->steal, out);
 
 	version += 1;
@@ -12299,6 +12307,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	if (r)
 		goto free_guest_fpu;
 
+	vcpu->arch.st.last_suspend_ns = kvm_total_suspend_ns();
 	kvm_xen_init_vcpu(vcpu);
 	vcpu_load(vcpu);
 	kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.default_tsc_khz);
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 3/3] KVM: x86: Document host suspend being included in steal time.
  2025-01-07  4:21 [PATCH v3 0/3] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
  2025-01-07  4:21 ` [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns() Suleiman Souhlal
  2025-01-07  4:22 ` [PATCH v3 2/3] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
@ 2025-01-07  4:22 ` Suleiman Souhlal
  2025-01-07 15:37   ` Sean Christopherson
  2 siblings, 1 reply; 21+ messages in thread
From: Suleiman Souhlal @ 2025-01-07  4:22 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Chao Gao, David Woodhouse, kvm, linux-kernel,
	ssouhlal, Suleiman Souhlal

Steal time now includes the time that the host was suspended.

Change-Id: Ie1236bc787e0d3bc9aff0d35e6b82b7e5cc8fd91
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 Documentation/virt/kvm/x86/msr.rst | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/x86/msr.rst b/Documentation/virt/kvm/x86/msr.rst
index 3aecf2a70e7b43..81c17c2200ca2f 100644
--- a/Documentation/virt/kvm/x86/msr.rst
+++ b/Documentation/virt/kvm/x86/msr.rst
@@ -294,8 +294,10 @@ data:
 
 	steal:
 		the amount of time in which this vCPU did not run, in
-		nanoseconds. Time during which the vcpu is idle, will not be
-		reported as steal time.
+		nanoseconds. This includes the time during which the host is
+		suspended. However, the case where the host suspends during a
+		VM migration might not be correctly accounted. Time during
+		which the vcpu is idle, will not be reported as steal time.
 
 	preempted:
 		indicate the vCPU who owns this struct is running or
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
  2025-01-07  4:21 ` [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns() Suleiman Souhlal
@ 2025-01-07 15:27   ` Sean Christopherson
  2025-01-07 16:43     ` Suleiman Souhlal
  2025-01-08  7:15   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-01-07 15:27 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

KVM: for the scope.

On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> It returns the cumulative nanoseconds that the host has been suspended.

Please write changelogs that are standalone.  "It returns ..." is completely
nonsensical without the context of the shortlog.

> It is intended to be used for reporting host suspend time to the guest.
> 
> Change-Id: I8f644c9fbdb2c48d2c99dd9efaa5c85a83a14c2a

Drop gerrit's metadata before posting.

> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
>  include/linux/kvm_host.h |  2 ++
>  virt/kvm/kvm_main.c      | 26 ++++++++++++++++++++++++++
>  2 files changed, 28 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 401439bb21e3e6..cf926168b30820 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2553,4 +2553,6 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  				    struct kvm_pre_fault_memory *range);
>  #endif
>  
> +u64 kvm_total_suspend_ns(void);
> +
>  #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index de2c11dae23163..d5ae237df76d0d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -889,13 +889,39 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>  
>  #endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */
>  
> +static u64 last_suspend;
> +static u64 total_suspend_ns;
> +
> +u64 kvm_total_suspend_ns(void)
> +{
> +	return total_suspend_ns;
> +}
> +
> +

Extra whitespace.

>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> +{
> +	switch (state) {
> +	case PM_HIBERNATION_PREPARE:
> +	case PM_SUSPEND_PREPARE:
> +		last_suspend = ktime_get_boottime_ns();
> +	case PM_POST_HIBERNATION:
> +	case PM_POST_SUSPEND:
> +		total_suspend_ns += ktime_get_boottime_ns() - last_suspend;

This is broken.  As should be quite clear from the function parameters, this is
a per-VM notifier.  While clobbering "last_suspend" is relatively benign,
accumulating into "total_suspend_ns" for every VM will cause the "total" suspend
time to be wildly inaccurate if there are multiple VMs.

> +	}
> +
> +	return NOTIFY_DONE;
> +}
> +
>  static int kvm_pm_notifier_call(struct notifier_block *bl,
>  				unsigned long state,
>  				void *unused)
>  {
>  	struct kvm *kvm = container_of(bl, struct kvm, pm_notifier);
>  
> +	if (kvm_pm_notifier(kvm, state) != NOTIFY_DONE)
> +		return NOTIFY_BAD;

This is ridiculous on multiple fronts.  There's zero reason for kvm_pm_notifier()
to return a value, it always succeeds.  I also see zero reason to put this code
in common KVM.  It's 100% an x86-only concept, and x86 is the only architecture
that implements kvm_arch_pm_notifier().  So just shove the logic into x86's
implementation.

> +
>  	return kvm_arch_pm_notifier(kvm, state);
>  }
>  
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] KVM: x86: Include host suspended time in steal time.
  2025-01-07  4:22 ` [PATCH v3 2/3] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
@ 2025-01-07 15:37   ` Sean Christopherson
  2025-01-08  4:05     ` Suleiman Souhlal
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-01-07 15:37 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> When the host resumes from a suspend, the guest thinks any task
> that was running during the suspend ran for a long time, even though
> the effective run time was much shorter, which can end up having
> negative effects with scheduling. This can be particularly noticeable
> if the guest task was RT, as it can end up getting throttled for a
> long time.
> 
> To mitigate this issue, we include the time that the host was

No "we".

> suspended in steal time, which lets the guest can subtract the
> duration from the tasks' runtime.
> 
> Note that the case of a suspend happening during a VM migration
> might not be accounted.

And this isn't considered a bug because?  I asked for documentation, not a
statement of fact.

> Change-Id: I18d1d17d4d0d6f4c89b312e427036e052c47e1fa

gerrit.

> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/x86.c              | 11 ++++++++++-
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e159e44a6a1b61..01d44d527a7f88 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -897,6 +897,7 @@ struct kvm_vcpu_arch {
>  		u8 preempted;
>  		u64 msr_val;
>  		u64 last_steal;
> +		u64 last_suspend_ns;
>  		struct gfn_to_hva_cache cache;
>  	} st;
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c8160baf383851..12439edc36f83a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3650,7 +3650,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
>  	struct kvm_steal_time __user *st;
>  	struct kvm_memslots *slots;
>  	gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
> -	u64 steal;
> +	u64 steal, suspend_ns;
>  	u32 version;
>  
>  	if (kvm_xen_msr_enabled(vcpu->kvm)) {
> @@ -3677,6 +3677,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
>  			return;
>  	}
>  
> +	suspend_ns = kvm_total_suspend_ns();
>  	st = (struct kvm_steal_time __user *)ghc->hva;
>  	/*
>  	 * Doing a TLB flush here, on the guest's behalf, can avoid
> @@ -3731,6 +3732,13 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
>  	steal += current->sched_info.run_delay -
>  		vcpu->arch.st.last_steal;
>  	vcpu->arch.st.last_steal = current->sched_info.run_delay;
> +	/*
> +	 * Include the time that the host was suspended in steal time.
> +	 * Note that the case of a suspend happening during a VM migration
> +	 * might not be accounted.
> +	 */

This is not a useful comment.  It's quite clear what that suspend time is being
accumulated into steal_time, and restating the migration caveat does more harm
than good, as that flaw is an issue with the overall design, i.e. has nothing to
do with this specific snippet of code.

> +	steal += suspend_ns - vcpu->arch.st.last_suspend_ns;
> +	vcpu->arch.st.last_suspend_ns = suspend_ns;
>  	unsafe_put_user(steal, &st->steal, out);
>  
>  	version += 1;
> @@ -12299,6 +12307,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  	if (r)
>  		goto free_guest_fpu;
>  
> +	vcpu->arch.st.last_suspend_ns = kvm_total_suspend_ns();
>  	kvm_xen_init_vcpu(vcpu);
>  	vcpu_load(vcpu);
>  	kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.default_tsc_khz);
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 3/3] KVM: x86: Document host suspend being included in steal time.
  2025-01-07  4:22 ` [PATCH v3 3/3] KVM: x86: Document host suspend being included " Suleiman Souhlal
@ 2025-01-07 15:37   ` Sean Christopherson
  0 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-01-07 15:37 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> Steal time now includes the time that the host was suspended.
> 
> Change-Id: Ie1236bc787e0d3bc9aff0d35e6b82b7e5cc8fd91

gerrit.

> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
>  Documentation/virt/kvm/x86/msr.rst | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/x86/msr.rst b/Documentation/virt/kvm/x86/msr.rst
> index 3aecf2a70e7b43..81c17c2200ca2f 100644
> --- a/Documentation/virt/kvm/x86/msr.rst
> +++ b/Documentation/virt/kvm/x86/msr.rst
> @@ -294,8 +294,10 @@ data:
>  
>  	steal:
>  		the amount of time in which this vCPU did not run, in
> -		nanoseconds. Time during which the vcpu is idle, will not be
> -		reported as steal time.
> +		nanoseconds. This includes the time during which the host is
> +		suspended. However, the case where the host suspends during a
> +		VM migration might not be correctly accounted. Time during
> +		which the vcpu is idle, will not be reported as steal time.

This belongs in the previous patch.  All in all, IMO this can all be one single
patch, as I don't see any reason to put crud into common KVM.

>  
>  	preempted:
>  		indicate the vCPU who owns this struct is running or
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
  2025-01-07 15:27   ` Sean Christopherson
@ 2025-01-07 16:43     ` Suleiman Souhlal
  0 siblings, 0 replies; 21+ messages in thread
From: Suleiman Souhlal @ 2025-01-07 16:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

On Wed, Jan 8, 2025 at 12:27 AM Sean Christopherson <seanjc@google.com> wrote:
>
> KVM: for the scope.
>
> On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> > It returns the cumulative nanoseconds that the host has been suspended.
>
> Please write changelogs that are standalone.  "It returns ..." is completely
> nonsensical without the context of the shortlog.
>
> > It is intended to be used for reporting host suspend time to the guest.
> >
> > Change-Id: I8f644c9fbdb2c48d2c99dd9efaa5c85a83a14c2a
>
> Drop gerrit's metadata before posting.
>
> > Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> > ---
> >  include/linux/kvm_host.h |  2 ++
> >  virt/kvm/kvm_main.c      | 26 ++++++++++++++++++++++++++
> >  2 files changed, 28 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 401439bb21e3e6..cf926168b30820 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2553,4 +2553,6 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >                                   struct kvm_pre_fault_memory *range);
> >  #endif
> >
> > +u64 kvm_total_suspend_ns(void);
> > +
> >  #endif
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index de2c11dae23163..d5ae237df76d0d 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -889,13 +889,39 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >
> >  #endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */
> >
> > +static u64 last_suspend;
> > +static u64 total_suspend_ns;
> > +
> > +u64 kvm_total_suspend_ns(void)
> > +{
> > +     return total_suspend_ns;
> > +}
> > +
> > +
>
> Extra whitespace.
>
> >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> > +{
> > +     switch (state) {
> > +     case PM_HIBERNATION_PREPARE:
> > +     case PM_SUSPEND_PREPARE:
> > +             last_suspend = ktime_get_boottime_ns();
> > +     case PM_POST_HIBERNATION:
> > +     case PM_POST_SUSPEND:
> > +             total_suspend_ns += ktime_get_boottime_ns() - last_suspend;
>
> This is broken.  As should be quite clear from the function parameters, this is
> a per-VM notifier.  While clobbering "last_suspend" is relatively benign,
> accumulating into "total_suspend_ns" for every VM will cause the "total" suspend
> time to be wildly inaccurate if there are multiple VMs.

Good catch. Thanks for spotting that.

>
> > +     }
> > +
> > +     return NOTIFY_DONE;
> > +}
> > +
> >  static int kvm_pm_notifier_call(struct notifier_block *bl,
> >                               unsigned long state,
> >                               void *unused)
> >  {
> >       struct kvm *kvm = container_of(bl, struct kvm, pm_notifier);
> >
> > +     if (kvm_pm_notifier(kvm, state) != NOTIFY_DONE)
> > +             return NOTIFY_BAD;
>
> This is ridiculous on multiple fronts.  There's zero reason for kvm_pm_notifier()
> to return a value, it always succeeds.  I also see zero reason to put this code
> in common KVM.  It's 100% an x86-only concept, and x86 is the only architecture
> that implements kvm_arch_pm_notifier().  So just shove the logic into x86's
> implementation.

The feedback I thought I got in v1 was that this could be useful for
other architectures too.
But given that the current series only implements it for x86, I guess
it's fair for it to be out of common KVM for now.
I will move it back into x86's implementation.

Thanks for the very quick reviews.
-- Suleiman

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] KVM: x86: Include host suspended time in steal time.
  2025-01-07 15:37   ` Sean Christopherson
@ 2025-01-08  4:05     ` Suleiman Souhlal
  2025-01-08 15:17       ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: Suleiman Souhlal @ 2025-01-08  4:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

On Wed, Jan 8, 2025 at 12:37 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> > When the host resumes from a suspend, the guest thinks any task
> > that was running during the suspend ran for a long time, even though
> > the effective run time was much shorter, which can end up having
> > negative effects with scheduling. This can be particularly noticeable
> > if the guest task was RT, as it can end up getting throttled for a
> > long time.
> >
> > To mitigate this issue, we include the time that the host was
>
> No "we".
>
> > suspended in steal time, which lets the guest can subtract the
> > duration from the tasks' runtime.
> >
> > Note that the case of a suspend happening during a VM migration
> > might not be accounted.
>
> And this isn't considered a bug because?  I asked for documentation, not a
> statement of fact.

I guess I don't really understand what the difference between
documentation and statements of fact is.

It's not completely clear to me what the desired behavior would be
when suspending during a VM migration.
If we wanted to inject the suspend duration that happened after the
migration started, but before it ended, I suppose we would need to add
a way for the new VM instance to add to steal time, possibly through a
new uAPI.

It is also not clear to me why we would want that.

>
> > Change-Id: I18d1d17d4d0d6f4c89b312e427036e052c47e1fa
>
> gerrit.
>
> > Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  1 +
> >  arch/x86/kvm/x86.c              | 11 ++++++++++-
> >  2 files changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index e159e44a6a1b61..01d44d527a7f88 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -897,6 +897,7 @@ struct kvm_vcpu_arch {
> >               u8 preempted;
> >               u64 msr_val;
> >               u64 last_steal;
> > +             u64 last_suspend_ns;
> >               struct gfn_to_hva_cache cache;
> >       } st;
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index c8160baf383851..12439edc36f83a 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3650,7 +3650,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
> >       struct kvm_steal_time __user *st;
> >       struct kvm_memslots *slots;
> >       gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
> > -     u64 steal;
> > +     u64 steal, suspend_ns;
> >       u32 version;
> >
> >       if (kvm_xen_msr_enabled(vcpu->kvm)) {
> > @@ -3677,6 +3677,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
> >                       return;
> >       }
> >
> > +     suspend_ns = kvm_total_suspend_ns();
> >       st = (struct kvm_steal_time __user *)ghc->hva;
> >       /*
> >        * Doing a TLB flush here, on the guest's behalf, can avoid
> > @@ -3731,6 +3732,13 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
> >       steal += current->sched_info.run_delay -
> >               vcpu->arch.st.last_steal;
> >       vcpu->arch.st.last_steal = current->sched_info.run_delay;
> > +     /*
> > +      * Include the time that the host was suspended in steal time.
> > +      * Note that the case of a suspend happening during a VM migration
> > +      * might not be accounted.
> > +      */
>
> This is not a useful comment.  It's quite clear what that suspend time is being
> accumulated into steal_time, and restating the migration caveat does more harm
> than good, as that flaw is an issue with the overall design, i.e. has nothing to
> do with this specific snippet of code.

I will remove it.

Thanks,
-- Suleiman

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
  2025-01-07  4:21 ` [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns() Suleiman Souhlal
  2025-01-07 15:27   ` Sean Christopherson
@ 2025-01-08  7:15   ` kernel test robot
  2025-01-08  8:36   ` kernel test robot
  2025-01-15 21:49   ` Sean Christopherson
  3 siblings, 0 replies; 21+ messages in thread
From: kernel test robot @ 2025-01-08  7:15 UTC (permalink / raw)
  To: Suleiman Souhlal, Paolo Bonzini, Sean Christopherson
  Cc: oe-kbuild-all, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal, Suleiman Souhlal

Hi Suleiman,

kernel test robot noticed the following build warnings:

[auto build test WARNING on kvm/queue]
[also build test WARNING on kvm/next linus/master v6.13-rc6 next-20250107]
[cannot apply to kvm/linux-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Suleiman-Souhlal/kvm-Introduce-kvm_total_suspend_ns/20250107-122819
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link:    https://lore.kernel.org/r/20250107042202.2554063-2-suleiman%40google.com
patch subject: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
config: s390-randconfig-001-20250108 (https://download.01.org/0day-ci/archive/20250108/202501081504.AE4wDCL9-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250108/202501081504.AE4wDCL9-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501081504.AE4wDCL9-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> arch/s390/kvm/../../../virt/kvm/kvm_main.c:897:12: warning: 'last_suspend' defined but not used [-Wunused-variable]
     897 | static u64 last_suspend;
         |            ^~~~~~~~~~~~

vim +/last_suspend +897 arch/s390/kvm/../../../virt/kvm/kvm_main.c

   896	
 > 897	static u64 last_suspend;
   898	static u64 total_suspend_ns;
   899	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
  2025-01-07  4:21 ` [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns() Suleiman Souhlal
  2025-01-07 15:27   ` Sean Christopherson
  2025-01-08  7:15   ` kernel test robot
@ 2025-01-08  8:36   ` kernel test robot
  2025-01-15 21:49   ` Sean Christopherson
  3 siblings, 0 replies; 21+ messages in thread
From: kernel test robot @ 2025-01-08  8:36 UTC (permalink / raw)
  To: Suleiman Souhlal, Paolo Bonzini, Sean Christopherson
  Cc: oe-kbuild-all, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal, Suleiman Souhlal

Hi Suleiman,

kernel test robot noticed the following build warnings:

[auto build test WARNING on kvm/queue]
[also build test WARNING on kvm/next linus/master v6.13-rc6 next-20250107]
[cannot apply to kvm/linux-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Suleiman-Souhlal/kvm-Introduce-kvm_total_suspend_ns/20250107-122819
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link:    https://lore.kernel.org/r/20250107042202.2554063-2-suleiman%40google.com
patch subject: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
config: loongarch-randconfig-002-20250108 (https://download.01.org/0day-ci/archive/20250108/202501081616.nFwj9jxx-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250108/202501081616.nFwj9jxx-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501081616.nFwj9jxx-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> arch/loongarch/kvm/../../../virt/kvm/kvm_main.c:897:12: warning: 'last_suspend' defined but not used [-Wunused-variable]
     897 | static u64 last_suspend;
         |            ^~~~~~~~~~~~

vim +/last_suspend +897 arch/loongarch/kvm/../../../virt/kvm/kvm_main.c

   896	
 > 897	static u64 last_suspend;
   898	static u64 total_suspend_ns;
   899	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] KVM: x86: Include host suspended time in steal time.
  2025-01-08  4:05     ` Suleiman Souhlal
@ 2025-01-08 15:17       ` Sean Christopherson
  0 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-01-08 15:17 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

On Wed, Jan 08, 2025, Suleiman Souhlal wrote:
> On Wed, Jan 8, 2025 at 12:37 AM Sean Christopherson <seanjc@google.com> wrote:
> > On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> > > Note that the case of a suspend happening during a VM migration
> > > might not be accounted.
> >
> > And this isn't considered a bug because?  I asked for documentation, not a
> > statement of fact.
> 
> I guess I don't really understand what the difference between documentation
> and statements of fact is.

It's the difference between "X may be buggy" and "X has this exact caveat because
KVM doesn't support saving/restoring associated metadata".  For this case, I'm
pretty sure it's possible to document exactly when suspended time will be lost,
what may happen if that accounted time is lost, what userspace could do to avoid
dropping suspend time, and finally for the changelog only, specifically why KVM
support for accounting suspended time is *intentionally* not providing APIs to
save/restore pending suspended time.

> It's not completely clear to me what the desired behavior would be when
> suspending during a VM migration.

That's fine, it's not your (or KVM's) responsibility to know the desired
behavior for every possible/theoretical use case.  But as above, it *is* KVM's
responsibility to properly document any caveats with functionality.

> If we wanted to inject the suspend duration that happened after the migration
> started, but before it ended, I suppose we would need to add a way for the
> new VM instance to add to steal time, possibly through a new uAPI.

It's not just migration, and it's not just the case where the host is suspended
*after* vCPU state is saved.  Pending suspended time will also be lost if vCPU
state is saved after suspend+resume without entering the guest (because the
accounting is done late in KVM_RUN).

> It is also not clear to me why we would want that.

Which is fine as well.  If some crazy use case cares about accounting suspended
time across save+restore, then the onus is on that use case to implement and
justify appropriate support.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
  2025-01-07  4:21 ` [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns() Suleiman Souhlal
                     ` (2 preceding siblings ...)
  2025-01-08  8:36   ` kernel test robot
@ 2025-01-15 21:49   ` Sean Christopherson
  2025-01-17  6:35     ` Suleiman Souhlal
  3 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-01-15 21:49 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> It returns the cumulative nanoseconds that the host has been suspended.
> It is intended to be used for reporting host suspend time to the guest.

...

>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> +{
> +	switch (state) {
> +	case PM_HIBERNATION_PREPARE:
> +	case PM_SUSPEND_PREPARE:
> +		last_suspend = ktime_get_boottime_ns();
> +	case PM_POST_HIBERNATION:
> +	case PM_POST_SUSPEND:
> +		total_suspend_ns += ktime_get_boottime_ns() - last_suspend;

After spending too much time poking around kvmlock and sched_clock code, I'm pretty
sure that accounting *all* suspend time to steal_time is wildly inaccurate for
most clocksources that will be used by KVM x86 guests.

KVM already adjusts TSC, and by extension kvmclock, to account for the TSC going
backwards due to suspend+resume.  I haven't dug super deep, buy I assume/hope the
majority of suspend time is handled by massaging guest TSC.

There's still a notable gap, as KVM's TSC adjustments likely won't account for
the lag between CPUs coming online and vCPU's being restarted, but I don't know
that having KVM account the suspend duration is the right way to solve that issue.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
  2025-01-15 21:49   ` Sean Christopherson
@ 2025-01-17  6:35     ` Suleiman Souhlal
  2025-01-17 16:52       ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: Suleiman Souhlal @ 2025-01-17  6:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

On Thu, Jan 16, 2025 at 6:49 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> > It returns the cumulative nanoseconds that the host has been suspended.
> > It is intended to be used for reporting host suspend time to the guest.
>
> ...
>
> >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> > +{
> > +     switch (state) {
> > +     case PM_HIBERNATION_PREPARE:
> > +     case PM_SUSPEND_PREPARE:
> > +             last_suspend = ktime_get_boottime_ns();
> > +     case PM_POST_HIBERNATION:
> > +     case PM_POST_SUSPEND:
> > +             total_suspend_ns += ktime_get_boottime_ns() - last_suspend;
>
> After spending too much time poking around kvmlock and sched_clock code, I'm pretty
> sure that accounting *all* suspend time to steal_time is wildly inaccurate for
> most clocksources that will be used by KVM x86 guests.
>
> KVM already adjusts TSC, and by extension kvmclock, to account for the TSC going
> backwards due to suspend+resume.  I haven't dug super deep, buy I assume/hope the
> majority of suspend time is handled by massaging guest TSC.
>
> There's still a notable gap, as KVM's TSC adjustments likely won't account for
> the lag between CPUs coming online and vCPU's being restarted, but I don't know
> that having KVM account the suspend duration is the right way to solve that issue.

(It is my understanding that steal time has no impact on clock sources.)
On our machines, the problem isn't that the TSC is going backwards. As
you say, kvmclock takes care of that.

The problem these patches are trying to solve is that the time keeps
advancing for the VM while the host is suspended.
On the host, this problem does not happen because timekeeping is
stopped by timekeeping_suspend().
The problem with time advancing is that the guest scheduler thinks the
task that was running when the host suspended was running for the
whole duration, which was especially bad when we still had RT
throttling (prior to v6.12), as in the case of a RT task being
current, the scheduler would then throttle all RT tasks for a very
long time. With dlserver, the problem is much less severe, but still
exists.
There is a similar problem when the host CPU is overcommitted, where
the guest scheduler thinks the current task ran for the full duration,
even though the effective running time was much lower. This is exactly
what steal time solves, which is why I thought addressing the suspend
issue with steal time was an acceptable approach.

TSC adjustments would also be a way to address the issue, but we would
then need another mechanism to still advance the guest wall time
during the host suspension.
We have tried that approach in the past, and it was working pretty
well for us, but it did not seem popular with the rest of the
community: https://lore.kernel.org/kvm/20210806100710.2425336-1-hikalium@chromium.org/T/

There is an additional gap with both approaches, which is that time
advances when the VM is blocked because the VMM hasn't run KVM_RUN.
Ideally steal time would also include time where the VM isn't being
scheduled because of the VMM (but maybe only if the blocking is due to
something outside of the guest's control), so that the guest scheduler
doesn't punish tasks for it. But that's a completely different
discussion, and I should probably not even be mentioning that here.

I hope that helps give some background on these patches.
-- Suleiman

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
  2025-01-17  6:35     ` Suleiman Souhlal
@ 2025-01-17 16:52       ` Sean Christopherson
  2025-01-21  5:37         ` Suleiman Souhlal
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-01-17 16:52 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

On Fri, Jan 17, 2025, Suleiman Souhlal wrote:
> On Thu, Jan 16, 2025 at 6:49 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> > > It returns the cumulative nanoseconds that the host has been suspended.
> > > It is intended to be used for reporting host suspend time to the guest.
> >
> > ...
> >
> > >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > > +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> > > +{
> > > +     switch (state) {
> > > +     case PM_HIBERNATION_PREPARE:
> > > +     case PM_SUSPEND_PREPARE:
> > > +             last_suspend = ktime_get_boottime_ns();
> > > +     case PM_POST_HIBERNATION:
> > > +     case PM_POST_SUSPEND:
> > > +             total_suspend_ns += ktime_get_boottime_ns() - last_suspend;
> >
> > After spending too much time poking around kvmlock and sched_clock code, I'm pretty
> > sure that accounting *all* suspend time to steal_time is wildly inaccurate for
> > most clocksources that will be used by KVM x86 guests.
> >
> > KVM already adjusts TSC, and by extension kvmclock, to account for the TSC going
> > backwards due to suspend+resume.  I haven't dug super deep, buy I assume/hope the
> > majority of suspend time is handled by massaging guest TSC.
> >
> > There's still a notable gap, as KVM's TSC adjustments likely won't account for
> > the lag between CPUs coming online and vCPU's being restarted, but I don't know
> > that having KVM account the suspend duration is the right way to solve that issue.
> 
> (It is my understanding that steal time has no impact on clock sources.)
> On our machines, the problem isn't that the TSC is going backwards. As
> you say, kvmclock takes care of that.
> 
> The problem these patches are trying to solve is that the time keeps
> advancing for the VM while the host is suspended.

Right, the issue is that because KVM adjusts guest TSC if the host TSC does go
backwards, then the accounting will be all kinds of messed up.

  1. Initiate suspend at host TSC X, guest TSC X+Y.

  2. Save X into last_host_tsc via kvm_arch_vcpu_put():

	vcpu->arch.last_host_tsc = rdtsc();

  3. Resume after N hours, host TSC reset to 0 and starts counting.

  4. kvm_arch_enable_virtualization_cpu() runs at new host time Z.

  5. KVM detects backwards TSC (Z < X) and adjusts guest TSC offset so that guest
     TSC stays at/near X+Y, i.e. guest TSC becomes "Z + Y + (X - Z)".

		u64 delta_cyc = max_tsc - local_tsc;
		list_for_each_entry(kvm, &vm_list, vm_list) {
			kvm->arch.backwards_tsc_observed = true;
			kvm_for_each_vcpu(i, vcpu, kvm) {
				vcpu->arch.tsc_offset_adjustment += delta_cyc;
				vcpu->arch.last_host_tsc = local_tsc;
				kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
			}

Thus, if the guest is using the TSC for sched_clock, guest time does NOT keep
advancing.

kvmclock on the other hand counts from *host* boot, and so guest time keeps
advancing if the guest is using kvmclock.

  #ifdef CONFIG_X86_64
  static s64 get_kvmclock_base_ns(void)
  {
	/* Count up from boot time, but with the frequency of the raw clock.  */
	return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
  }
  #else
  static s64 get_kvmclock_base_ns(void)
  {
	/* Master clock not used, so we can just use CLOCK_BOOTTIME.  */
	return ktime_get_boottime_ns();
  }
  #endif

In short, AFAICT the issues you are observing are mostly a problem with kvmclock.
Or maybe it's the other way around and effectively freezing guest TSC is super
problematic and fundamentally flawed.

Regardless of which one is "broken", unconditionally accounting suspend time to
steal_time will do the wrong thing when sched_clock=tsc.  To further muddy the
waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc,
but sched_clock=kvmclock.  In that scenario, guest time doesn't advanced, but
guest scheduler time does.  Ugh.

That particular wart can be avoided by having the guest use TSC for sched_clock[*],
e.g. so that at least the behavior of time is consistent.

Hmm, if freezing guest time across suspend is indeed problematic, one thought
would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the
host may be suspending.  The Linux-as-a-guest would prefer kvmclock over TSC for
both clocksource and sched_clock.

[*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
  2025-01-17 16:52       ` Sean Christopherson
@ 2025-01-21  5:37         ` Suleiman Souhlal
  2025-01-21 20:22           ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: Suleiman Souhlal @ 2025-01-21  5:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

On Sat, Jan 18, 2025 at 1:52 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jan 17, 2025, Suleiman Souhlal wrote:
> > On Thu, Jan 16, 2025 at 6:49 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> > > > It returns the cumulative nanoseconds that the host has been suspended.
> > > > It is intended to be used for reporting host suspend time to the guest.
> > >
> > > ...
> > >
> > > >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > > > +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> > > > +{
> > > > +     switch (state) {
> > > > +     case PM_HIBERNATION_PREPARE:
> > > > +     case PM_SUSPEND_PREPARE:
> > > > +             last_suspend = ktime_get_boottime_ns();
> > > > +     case PM_POST_HIBERNATION:
> > > > +     case PM_POST_SUSPEND:
> > > > +             total_suspend_ns += ktime_get_boottime_ns() - last_suspend;
> > >
> > > After spending too much time poking around kvmlock and sched_clock code, I'm pretty
> > > sure that accounting *all* suspend time to steal_time is wildly inaccurate for
> > > most clocksources that will be used by KVM x86 guests.
> > >
> > > KVM already adjusts TSC, and by extension kvmclock, to account for the TSC going
> > > backwards due to suspend+resume.  I haven't dug super deep, buy I assume/hope the
> > > majority of suspend time is handled by massaging guest TSC.
> > >
> > > There's still a notable gap, as KVM's TSC adjustments likely won't account for
> > > the lag between CPUs coming online and vCPU's being restarted, but I don't know
> > > that having KVM account the suspend duration is the right way to solve that issue.
> >
> > (It is my understanding that steal time has no impact on clock sources.)
> > On our machines, the problem isn't that the TSC is going backwards. As
> > you say, kvmclock takes care of that.
> >
> > The problem these patches are trying to solve is that the time keeps
> > advancing for the VM while the host is suspended.
>
> Right, the issue is that because KVM adjusts guest TSC if the host TSC does go
> backwards, then the accounting will be all kinds of messed up.
>
>   1. Initiate suspend at host TSC X, guest TSC X+Y.
>
>   2. Save X into last_host_tsc via kvm_arch_vcpu_put():
>
>         vcpu->arch.last_host_tsc = rdtsc();
>
>   3. Resume after N hours, host TSC reset to 0 and starts counting.
>
>   4. kvm_arch_enable_virtualization_cpu() runs at new host time Z.
>
>   5. KVM detects backwards TSC (Z < X) and adjusts guest TSC offset so that guest
>      TSC stays at/near X+Y, i.e. guest TSC becomes "Z + Y + (X - Z)".
>
>                 u64 delta_cyc = max_tsc - local_tsc;
>                 list_for_each_entry(kvm, &vm_list, vm_list) {
>                         kvm->arch.backwards_tsc_observed = true;
>                         kvm_for_each_vcpu(i, vcpu, kvm) {
>                                 vcpu->arch.tsc_offset_adjustment += delta_cyc;
>                                 vcpu->arch.last_host_tsc = local_tsc;
>                                 kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
>                         }
>
> Thus, if the guest is using the TSC for sched_clock, guest time does NOT keep
> advancing.
>
> kvmclock on the other hand counts from *host* boot, and so guest time keeps
> advancing if the guest is using kvmclock.
>
>   #ifdef CONFIG_X86_64
>   static s64 get_kvmclock_base_ns(void)
>   {
>         /* Count up from boot time, but with the frequency of the raw clock.  */
>         return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
>   }
>   #else
>   static s64 get_kvmclock_base_ns(void)
>   {
>         /* Master clock not used, so we can just use CLOCK_BOOTTIME.  */
>         return ktime_get_boottime_ns();
>   }
>   #endif
>
> In short, AFAICT the issues you are observing are mostly a problem with kvmclock.
> Or maybe it's the other way around and effectively freezing guest TSC is super
> problematic and fundamentally flawed.
>
> Regardless of which one is "broken", unconditionally accounting suspend time to
> steal_time will do the wrong thing when sched_clock=tsc.  To further muddy the
> waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc,
> but sched_clock=kvmclock.  In that scenario, guest time doesn't advanced, but
> guest scheduler time does.  Ugh.
>
> That particular wart can be avoided by having the guest use TSC for sched_clock[*],
> e.g. so that at least the behavior of time is consistent.
>
> Hmm, if freezing guest time across suspend is indeed problematic, one thought
> would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the
> host may be suspending.  The Linux-as-a-guest would prefer kvmclock over TSC for
> both clocksource and sched_clock.
>
> [*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com

I see what you're saying. Thanks for explaining.

To complicate things further there are also different kinds of
suspends. From what I've seen "shallow" (and/or "suspend-to-idle")
suspends don't stop the CPU, at least on our machines, so the host TSC
keeps ticking. On "deep" suspends, on the other hand, the TSC might go
backwards.

But I suppose if the guest uses kvmclock the behavior should be the
same in either case.

At least for our use case we would definitely want guest *wall* time
to keep advancing, so we would still want to use kvmclock.

Would accounting the suspend duration in steal time be acceptable if
it was conditional on the guest using kvmclock?
We would need a way for the host to be notified that the guest is
indeed using it, possibly by adding a new MSR to be written to in
kvm_cs_enable().

-- Suleiman

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
  2025-01-21  5:37         ` Suleiman Souhlal
@ 2025-01-21 20:22           ` Sean Christopherson
  2025-02-04  7:58             ` Suleiman Souhlal
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-01-21 20:22 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

On Tue, Jan 21, 2025, Suleiman Souhlal wrote:
> On Sat, Jan 18, 2025 at 1:52 AM Sean Christopherson <seanjc@google.com> wrote:
> > In short, AFAICT the issues you are observing are mostly a problem with kvmclock.
> > Or maybe it's the other way around and effectively freezing guest TSC is super
> > problematic and fundamentally flawed.
> >
> > Regardless of which one is "broken", unconditionally accounting suspend time to
> > steal_time will do the wrong thing when sched_clock=tsc.  To further muddy the
> > waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc,
> > but sched_clock=kvmclock.  In that scenario, guest time doesn't advanced, but
> > guest scheduler time does.  Ugh.
> >
> > That particular wart can be avoided by having the guest use TSC for sched_clock[*],
> > e.g. so that at least the behavior of time is consistent.
> >
> > Hmm, if freezing guest time across suspend is indeed problematic, one thought
> > would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the
> > host may be suspending.  The Linux-as-a-guest would prefer kvmclock over TSC for
> > both clocksource and sched_clock.
> >
> > [*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com
> 
> I see what you're saying. Thanks for explaining.
> 
> To complicate things further there are also different kinds of
> suspends. From what I've seen "shallow" (and/or "suspend-to-idle")
> suspends don't stop the CPU, at least on our machines, so the host TSC
> keeps ticking. On "deep" suspends, on the other hand, the TSC might go
> backwards.

Yeah, only S3 and lower will power down the CPU.  All bets are off if the CPU
doesn't have a nonstop TSC, but that's not at all unique to suspend, e.g. it's a
problem if the CPU goes idle, and so I think it's safe to only worry about CPUs
with nonstop TSC.

> But I suppose if the guest uses kvmclock the behavior should be the
> same in either case.
> 
> At least for our use case we would definitely want guest *wall* time
> to keep advancing, so we would still want to use kvmclock.
> 
> Would accounting the suspend duration in steal time be acceptable if
> it was conditional on the guest using kvmclock?
> We would need a way for the host to be notified that the guest is
> indeed using it,

And not just using kvmclock, but specifically using for sched_clock.  E.g. the
current behavior for most Linux guests on modern hardware is that they'll use TSC
for clocksource, but kvmclock for sched_clock and wall clock.

> possibly by adding a new MSR to be written to in
> kvm_cs_enable().

I don't think that's a good way forward.  I expect kvmclock to be largely
deprecated (guest side) in favor of raw TSC (with hardware-provided scale+offset),
at which point tying this to kvmclock puts us back at square one.

Given that s2idle and standby don't reset host TSC, I think the right way to
handle this conundrum is to address the flaw that's noted in the "backwards TSC"
logic, and adjust guest TSC to be fully up-to-date in the S3 (or lower) case.

	 * ......................................  Unfortunately, we can't
	 * bring the TSCs fully up to date with real time, as we aren't yet far
	 * enough into CPU bringup that we know how much real time has actually
	 * elapsed; our helper function, ktime_get_boottime_ns() will be using boot
	 * variables that haven't been updated yet.

I have no idea why commit 0dd6a6edb012 ("KVM: Dont mark TSC unstable due to S4
suspend") hooked kvm_arch_enable_virtualization_cpu() instead of implementing a
PM notifier, but I don't see anything that suggests it was deliberate, i.e. that
KVm *needs* to effectively snapshot guest TSC when onlining CPUs.

If that wart is fixed, then both kvmclock and TSC will account host suspend time,
and KVM can safely account the suspend time into steal time regardless of which
clock(s) the guest is using.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
  2025-01-21 20:22           ` Sean Christopherson
@ 2025-02-04  7:58             ` Suleiman Souhlal
  2025-02-05  5:55               ` Suleiman Souhlal
  0 siblings, 1 reply; 21+ messages in thread
From: Suleiman Souhlal @ 2025-02-04  7:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

On Wed, Jan 22, 2025 at 5:22 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jan 21, 2025, Suleiman Souhlal wrote:
> > On Sat, Jan 18, 2025 at 1:52 AM Sean Christopherson <seanjc@google.com> wrote:
> > > In short, AFAICT the issues you are observing are mostly a problem with kvmclock.
> > > Or maybe it's the other way around and effectively freezing guest TSC is super
> > > problematic and fundamentally flawed.
> > >
> > > Regardless of which one is "broken", unconditionally accounting suspend time to
> > > steal_time will do the wrong thing when sched_clock=tsc.  To further muddy the
> > > waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc,
> > > but sched_clock=kvmclock.  In that scenario, guest time doesn't advanced, but
> > > guest scheduler time does.  Ugh.
> > >
> > > That particular wart can be avoided by having the guest use TSC for sched_clock[*],
> > > e.g. so that at least the behavior of time is consistent.
> > >
> > > Hmm, if freezing guest time across suspend is indeed problematic, one thought
> > > would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the
> > > host may be suspending.  The Linux-as-a-guest would prefer kvmclock over TSC for
> > > both clocksource and sched_clock.
> > >
> > > [*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com
> >
> > I see what you're saying. Thanks for explaining.
> >
> > To complicate things further there are also different kinds of
> > suspends. From what I've seen "shallow" (and/or "suspend-to-idle")
> > suspends don't stop the CPU, at least on our machines, so the host TSC
> > keeps ticking. On "deep" suspends, on the other hand, the TSC might go
> > backwards.
>
> Yeah, only S3 and lower will power down the CPU.  All bets are off if the CPU
> doesn't have a nonstop TSC, but that's not at all unique to suspend, e.g. it's a
> problem if the CPU goes idle, and so I think it's safe to only worry about CPUs
> with nonstop TSC.
>
> > But I suppose if the guest uses kvmclock the behavior should be the
> > same in either case.
> >
> > At least for our use case we would definitely want guest *wall* time
> > to keep advancing, so we would still want to use kvmclock.
> >
> > Would accounting the suspend duration in steal time be acceptable if
> > it was conditional on the guest using kvmclock?
> > We would need a way for the host to be notified that the guest is
> > indeed using it,
>
> And not just using kvmclock, but specifically using for sched_clock.  E.g. the
> current behavior for most Linux guests on modern hardware is that they'll use TSC
> for clocksource, but kvmclock for sched_clock and wall clock.
>
> > possibly by adding a new MSR to be written to in
> > kvm_cs_enable().
>
> I don't think that's a good way forward.  I expect kvmclock to be largely
> deprecated (guest side) in favor of raw TSC (with hardware-provided scale+offset),
> at which point tying this to kvmclock puts us back at square one.
>
> Given that s2idle and standby don't reset host TSC, I think the right way to
> handle this conundrum is to address the flaw that's noted in the "backwards TSC"
> logic, and adjust guest TSC to be fully up-to-date in the S3 (or lower) case.
>
>          * ......................................  Unfortunately, we can't
>          * bring the TSCs fully up to date with real time, as we aren't yet far
>          * enough into CPU bringup that we know how much real time has actually
>          * elapsed; our helper function, ktime_get_boottime_ns() will be using boot
>          * variables that haven't been updated yet.
>
> I have no idea why commit 0dd6a6edb012 ("KVM: Dont mark TSC unstable due to S4
> suspend") hooked kvm_arch_enable_virtualization_cpu() instead of implementing a
> PM notifier, but I don't see anything that suggests it was deliberate, i.e. that
> KVm *needs* to effectively snapshot guest TSC when onlining CPUs.
>
> If that wart is fixed, then both kvmclock and TSC will account host suspend time,
> and KVM can safely account the suspend time into steal time regardless of which
> clock(s) the guest is using.

I tried your suggestion of moving this to a PM notifier and I found
that it's possible for VCPUs to run after resume but before the PM
notifier has been called, because the resume notifiers get called
after tasks are unfrozen. Unfortunately that means that if we were to
do that, guest TSCs could go backwards.

However, I think it should be possible to keep the existing backwards
guest TSC prevention code but also use a notifier that further adjusts
the guest TSCs to advance time on suspends where the TSC did go
backwards. This would make both s2idle and deep suspends behave the
same way.

-- Suleiman

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
  2025-02-04  7:58             ` Suleiman Souhlal
@ 2025-02-05  5:55               ` Suleiman Souhlal
  2025-02-06  1:29                 ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: Suleiman Souhlal @ 2025-02-05  5:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

On Tue, Feb 4, 2025 at 4:58 PM Suleiman Souhlal <suleiman@google.com> wrote:
>
> On Wed, Jan 22, 2025 at 5:22 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Jan 21, 2025, Suleiman Souhlal wrote:
> > > On Sat, Jan 18, 2025 at 1:52 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > In short, AFAICT the issues you are observing are mostly a problem with kvmclock.
> > > > Or maybe it's the other way around and effectively freezing guest TSC is super
> > > > problematic and fundamentally flawed.
> > > >
> > > > Regardless of which one is "broken", unconditionally accounting suspend time to
> > > > steal_time will do the wrong thing when sched_clock=tsc.  To further muddy the
> > > > waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc,
> > > > but sched_clock=kvmclock.  In that scenario, guest time doesn't advanced, but
> > > > guest scheduler time does.  Ugh.
> > > >
> > > > That particular wart can be avoided by having the guest use TSC for sched_clock[*],
> > > > e.g. so that at least the behavior of time is consistent.
> > > >
> > > > Hmm, if freezing guest time across suspend is indeed problematic, one thought
> > > > would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the
> > > > host may be suspending.  The Linux-as-a-guest would prefer kvmclock over TSC for
> > > > both clocksource and sched_clock.
> > > >
> > > > [*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com
> > >
> > > I see what you're saying. Thanks for explaining.
> > >
> > > To complicate things further there are also different kinds of
> > > suspends. From what I've seen "shallow" (and/or "suspend-to-idle")
> > > suspends don't stop the CPU, at least on our machines, so the host TSC
> > > keeps ticking. On "deep" suspends, on the other hand, the TSC might go
> > > backwards.
> >
> > Yeah, only S3 and lower will power down the CPU.  All bets are off if the CPU
> > doesn't have a nonstop TSC, but that's not at all unique to suspend, e.g. it's a
> > problem if the CPU goes idle, and so I think it's safe to only worry about CPUs
> > with nonstop TSC.
> >
> > > But I suppose if the guest uses kvmclock the behavior should be the
> > > same in either case.
> > >
> > > At least for our use case we would definitely want guest *wall* time
> > > to keep advancing, so we would still want to use kvmclock.
> > >
> > > Would accounting the suspend duration in steal time be acceptable if
> > > it was conditional on the guest using kvmclock?
> > > We would need a way for the host to be notified that the guest is
> > > indeed using it,
> >
> > And not just using kvmclock, but specifically using for sched_clock.  E.g. the
> > current behavior for most Linux guests on modern hardware is that they'll use TSC
> > for clocksource, but kvmclock for sched_clock and wall clock.
> >
> > > possibly by adding a new MSR to be written to in
> > > kvm_cs_enable().
> >
> > I don't think that's a good way forward.  I expect kvmclock to be largely
> > deprecated (guest side) in favor of raw TSC (with hardware-provided scale+offset),
> > at which point tying this to kvmclock puts us back at square one.
> >
> > Given that s2idle and standby don't reset host TSC, I think the right way to
> > handle this conundrum is to address the flaw that's noted in the "backwards TSC"
> > logic, and adjust guest TSC to be fully up-to-date in the S3 (or lower) case.
> >
> >          * ......................................  Unfortunately, we can't
> >          * bring the TSCs fully up to date with real time, as we aren't yet far
> >          * enough into CPU bringup that we know how much real time has actually
> >          * elapsed; our helper function, ktime_get_boottime_ns() will be using boot
> >          * variables that haven't been updated yet.
> >
> > I have no idea why commit 0dd6a6edb012 ("KVM: Dont mark TSC unstable due to S4
> > suspend") hooked kvm_arch_enable_virtualization_cpu() instead of implementing a
> > PM notifier, but I don't see anything that suggests it was deliberate, i.e. that
> > KVm *needs* to effectively snapshot guest TSC when onlining CPUs.
> >
> > If that wart is fixed, then both kvmclock and TSC will account host suspend time,
> > and KVM can safely account the suspend time into steal time regardless of which
> > clock(s) the guest is using.
>
> I tried your suggestion of moving this to a PM notifier and I found
> that it's possible for VCPUs to run after resume but before the PM
> notifier has been called, because the resume notifiers get called
> after tasks are unfrozen. Unfortunately that means that if we were to
> do that, guest TSCs could go backwards.
>
> However, I think it should be possible to keep the existing backwards
> guest TSC prevention code but also use a notifier that further adjusts
> the guest TSCs to advance time on suspends where the TSC did go
> backwards. This would make both s2idle and deep suspends behave the
> same way.

An alternative might be to block VCPUs from newly entering the guest
between the pre and post suspend notifiers.
Otherwise, some of the steal time accounting would have to be done in
kvm_arch_enable_virtualization_cpu(), to make sure it gets applied on
the first VCPU run, in case that happens before the resume notifier
would have fired. But the comment there says we can't call
ktime_get_boottime_ns() there, so maybe that's not possible.

-- Suleiman

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
  2025-02-05  5:55               ` Suleiman Souhlal
@ 2025-02-06  1:29                 ` Sean Christopherson
  2025-02-13  3:56                   ` Suleiman Souhlal
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-02-06  1:29 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

On Wed, Feb 05, 2025, Suleiman Souhlal wrote:
> On Tue, Feb 4, 2025 at 4:58 PM Suleiman Souhlal <suleiman@google.com> wrote:
> > > Given that s2idle and standby don't reset host TSC, I think the right way to
> > > handle this conundrum is to address the flaw that's noted in the "backwards TSC"
> > > logic, and adjust guest TSC to be fully up-to-date in the S3 (or lower) case.
> > >
> > >          * ......................................  Unfortunately, we can't
> > >          * bring the TSCs fully up to date with real time, as we aren't yet far
> > >          * enough into CPU bringup that we know how much real time has actually
> > >          * elapsed; our helper function, ktime_get_boottime_ns() will be using boot
> > >          * variables that haven't been updated yet.
> > >
> > > I have no idea why commit 0dd6a6edb012 ("KVM: Dont mark TSC unstable due to S4
> > > suspend") hooked kvm_arch_enable_virtualization_cpu() instead of implementing a
> > > PM notifier, but I don't see anything that suggests it was deliberate, i.e. that
> > > KVm *needs* to effectively snapshot guest TSC when onlining CPUs.
> > >
> > > If that wart is fixed, then both kvmclock and TSC will account host suspend time,
> > > and KVM can safely account the suspend time into steal time regardless of which
> > > clock(s) the guest is using.
> >
> > I tried your suggestion of moving this to a PM notifier and I found
> > that it's possible for VCPUs to run after resume but before the PM
> > notifier has been called, because the resume notifiers get called
> > after tasks are unfrozen. Unfortunately that means that if we were to
> > do that, guest TSCs could go backwards.

Ugh.  That explains why KVM hooks the CPU online path.

> > However, I think it should be possible to keep the existing backwards
> > guest TSC prevention code but also use a notifier that further adjusts
> > the guest TSCs to advance time on suspends where the TSC did go
> > backwards. This would make both s2idle and deep suspends behave the
> > same way.
> 
> An alternative might be to block VCPUs from newly entering the guest
> between the pre and post suspend notifiers.
> Otherwise, some of the steal time accounting would have to be done in
> kvm_arch_enable_virtualization_cpu(), to make sure it gets applied on
> the first VCPU run, in case that happens before the resume notifier
> would have fired. But the comment there says we can't call
> ktime_get_boottime_ns() there, so maybe that's not possible.

I don't think the PM notifier approach is viable.  It's simply too late.  Without
a hook in CPU online, KVM can't even tell which VMs/vCPUs were running before the
suspend, i.e. which VMs need to be updated.

One idea would be to simply fast forward guest TSC to current time when the vCPU
is loaded after suspend+resume.  E.g. this hack appears to work.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dcd0c12c308e..27b25f8ca413 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4824,7 +4824,16 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
        /* Apply any externally detected TSC adjustments (due to suspend) */
        if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
-               adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
+               u64 kernel_ns, tsc_now;
+
+               if (kvm_get_time_and_clockread(&kernel_ns, &tsc_now)) {
+                       u64 l1_tsc = nsec_to_cycles(vcpu, vcpu->kvm->arch.kvmclock_offset + kernel_ns);
+                       u64 offset = kvm_compute_l1_tsc_offset(vcpu, l1_tsc);
+
+                       kvm_vcpu_write_tsc_offset(vcpu, offset);
+               } else {
+                       adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
+               }
                vcpu->arch.tsc_offset_adjustment = 0;
                kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
        }

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
  2025-02-06  1:29                 ` Sean Christopherson
@ 2025-02-13  3:56                   ` Suleiman Souhlal
  0 siblings, 0 replies; 21+ messages in thread
From: Suleiman Souhlal @ 2025-02-13  3:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse, kvm,
	linux-kernel, ssouhlal

On Thu, Feb 6, 2025 at 10:29 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Feb 05, 2025, Suleiman Souhlal wrote:
> > >
> > > I tried your suggestion of moving this to a PM notifier and I found
> > > that it's possible for VCPUs to run after resume but before the PM
> > > notifier has been called, because the resume notifiers get called
> > > after tasks are unfrozen. Unfortunately that means that if we were to
> > > do that, guest TSCs could go backwards.
>
> Ugh.  That explains why KVM hooks the CPU online path.
>
> > > However, I think it should be possible to keep the existing backwards
> > > guest TSC prevention code but also use a notifier that further adjusts
> > > the guest TSCs to advance time on suspends where the TSC did go
> > > backwards. This would make both s2idle and deep suspends behave the
> > > same way.
> >
> > An alternative might be to block VCPUs from newly entering the guest
> > between the pre and post suspend notifiers.
> > Otherwise, some of the steal time accounting would have to be done in
> > kvm_arch_enable_virtualization_cpu(), to make sure it gets applied on
> > the first VCPU run, in case that happens before the resume notifier
> > would have fired. But the comment there says we can't call
> > ktime_get_boottime_ns() there, so maybe that's not possible.
>
> I don't think the PM notifier approach is viable.  It's simply too late.  Without
> a hook in CPU online, KVM can't even tell which VMs/vCPUs were running before the
> suspend, i.e. which VMs need to be updated.
>
> One idea would be to simply fast forward guest TSC to current time when the vCPU
> is loaded after suspend+resume.  E.g. this hack appears to work.

That's really interesting!
One possible concern with this approach is that now each VCPU gets a
slightly different offset.
It's possible to avoid that by making sure that only one VCPU computes
the offset and having the other ones reuse it.
I'll send a patch. Should it be "Suggested-by:" you or is there
something else I should do?

I also wasn't sure if we should do this when the user sets the offset
with KVM_VCPU_TSC_OFFSET. I guess we probably should.

-- Suleiman

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-02-13  3:56 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-07  4:21 [PATCH v3 0/3] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
2025-01-07  4:21 ` [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns() Suleiman Souhlal
2025-01-07 15:27   ` Sean Christopherson
2025-01-07 16:43     ` Suleiman Souhlal
2025-01-08  7:15   ` kernel test robot
2025-01-08  8:36   ` kernel test robot
2025-01-15 21:49   ` Sean Christopherson
2025-01-17  6:35     ` Suleiman Souhlal
2025-01-17 16:52       ` Sean Christopherson
2025-01-21  5:37         ` Suleiman Souhlal
2025-01-21 20:22           ` Sean Christopherson
2025-02-04  7:58             ` Suleiman Souhlal
2025-02-05  5:55               ` Suleiman Souhlal
2025-02-06  1:29                 ` Sean Christopherson
2025-02-13  3:56                   ` Suleiman Souhlal
2025-01-07  4:22 ` [PATCH v3 2/3] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
2025-01-07 15:37   ` Sean Christopherson
2025-01-08  4:05     ` Suleiman Souhlal
2025-01-08 15:17       ` Sean Christopherson
2025-01-07  4:22 ` [PATCH v3 3/3] KVM: x86: Document host suspend being included " Suleiman Souhlal
2025-01-07 15:37   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).