All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Suleiman Souhlal <ssouhlal@freebsd.org>
Cc: Chao Gao <chao.gao@intel.com>,
	Suleiman Souhlal <suleiman@google.com>,
	 Paolo Bonzini <pbonzini@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,  Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org,  "H. Peter Anvin" <hpa@zytor.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] KVM: x86: Include host suspended time in steal time.
Date: Tue, 13 Aug 2024 10:06:44 -0700	[thread overview]
Message-ID: <ZruSpDcysc2B-HQ-@google.com> (raw)
In-Reply-To: <Zqi2RJKp8JxSedOI@freefall.freebsd.org>

On Tue, Jul 30, 2024, Suleiman Souhlal wrote:
> On Tue, Jul 30, 2024 at 10:26:30AM +0800, Chao Gao wrote:
> > Hi,
> > 
> > On Wed, Jul 10, 2024 at 04:44:10PM +0900, Suleiman Souhlal wrote:
> > >When the host resumes from a suspend, the guest thinks any task
> > >that was running during the suspend ran for a long time, even though
> > >the effective run time was much shorter, which can end up having
> > >negative effects with scheduling. This can be particularly noticeable
> > >if the guest task was RT, as it can end up getting throttled for a
> > >long time.
> > >
> > >To mitigate this issue, we include the time that the host was
> > >suspended in steal time, which lets the guest can subtract the
> > >duration from the tasks' runtime.
> > >
> > >Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> > >---
> > > arch/x86/kvm/x86.c       | 23 ++++++++++++++++++++++-
> > > include/linux/kvm_host.h |  4 ++++
> > > 2 files changed, 26 insertions(+), 1 deletion(-)
> > >
> > >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > >index 0763a0f72a067f..94bbdeef843863 100644
> > >--- a/arch/x86/kvm/x86.c
> > >+++ b/arch/x86/kvm/x86.c
> > >@@ -3669,7 +3669,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
> > > 	struct kvm_steal_time __user *st;
> > > 	struct kvm_memslots *slots;
> > > 	gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
> > >-	u64 steal;
> > >+	u64 steal, suspend_duration;
> > > 	u32 version;
> > > 
> > > 	if (kvm_xen_msr_enabled(vcpu->kvm)) {
> > >@@ -3696,6 +3696,12 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
> > > 			return;
> > > 	}
> > > 
> > >+	suspend_duration = 0;
> > >+	if (READ_ONCE(vcpu->suspended)) {
> > >+		suspend_duration = vcpu->kvm->last_suspend_duration;
> > >+		vcpu->suspended = 0;
> > 
> > Can you explain why READ_ONCE() is necessary here, but WRITE_ONCE() isn't used
> > for clearing vcpu->suspended?
> 
> Because this part of the code is essentially single-threaded, it didn't seem
> like WRITE_ONCE() was needed. I can add it in an eventual future version of
> the patch if it makes things less confusing (if this code still exists).

{READ,WRITE}_ONCE() don't provide ordering, they only ensure that the compiler
emits a load/store.  But (a) forcing the compiler to emit a load/store is only
necessary when it's possible for the compiler to know/prove that the load/store
won't affect functionalty, and (b) this code doesn't have have the correct
ordering even if there were the appropriate smp_wmb() and smp_rmb() annotations.

The writer needs to ensure the write to last_suspend_duration is visible before
vcpu->suspended is set, and the reader needs to ensure it reads last_suspend_duration
after vcpu->suspended.

That said, it's likely a moot point, because I assume the PM notifier runs while
tasks are frozen, i.e. its writes are guaranteed to be ordered before
record_steal_time().

And _that_ said, I don't see any reason for two fields, nor do I see any reason
to track the suspended duration in the VM instead of the vCPU.  Similar to what
Chao pointed out, per-VM tracking falls apart if only some vCPUs consume the
suspend duration.  I.e. just accumulate the suspend duration directly in the
vCPU.

> > >+	}
> > >+
> > > 	st = (struct kvm_steal_time __user *)ghc->hva;
> > > 	/*
> > > 	 * Doing a TLB flush here, on the guest's behalf, can avoid
> > >@@ -3749,6 +3755,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
> > > 	unsafe_get_user(steal, &st->steal, out);
> > > 	steal += current->sched_info.run_delay -
> > > 		vcpu->arch.st.last_steal;
> > >+	steal += suspend_duration;
> > > 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
> > > 	unsafe_put_user(steal, &st->steal, out);
> > > 
> > >@@ -6920,6 +6927,7 @@ static int kvm_arch_suspend_notifier(struct kvm *kvm)
> > > 
> > > 	mutex_lock(&kvm->lock);
> > > 	kvm_for_each_vcpu(i, vcpu, kvm) {
> > >+		WRITE_ONCE(vcpu->suspended, 1);
> > > 		if (!vcpu->arch.pv_time.active)
> > > 			continue;
> > > 
> > >@@ -6932,15 +6940,28 @@ static int kvm_arch_suspend_notifier(struct kvm *kvm)
> > > 	}
> > > 	mutex_unlock(&kvm->lock);
> > > 
> > >+	kvm->suspended_time = ktime_get_boottime_ns();
> > >+
> > > 	return ret ? NOTIFY_BAD : NOTIFY_DONE;
> > > }
> > > 
> > >+static int
> > >+kvm_arch_resume_notifier(struct kvm *kvm)

Do not wrap before the function name.  Linus has a nice explanation/rant on this[*].

[*] https://lore.kernel.org/all/CAHk-=wjoLAYG446ZNHfg=GhjSY6nFmuB_wA8fYd5iLBNXjo9Bw@mail.gmail.com

> > >+{
> > >+	kvm->last_suspend_duration = ktime_get_boottime_ns() -
> > >+	    kvm->suspended_time;
> > 
> > Is it possible that a vCPU doesn't get any chance to run (i.e., update steal
> > time) between two suspends? In this case, only the second suspend would be
> > recorded.
> 
> Good point. I'll address this.
> 
> > 
> > Maybe we need an infrastructure in the PM subsystem to record accumulated
> > suspended time. When updating steal time, KVM can add the additional suspended
> > time since the last update into steal_time (as how KVM deals with
> > current->sched_info.run_deley). This way, the scenario I mentioned above won't
> > be a problem and KVM needn't calculate the suspend duration for each guest. And
> > this approach can potentially benefit RISC-V and ARM as well, since they have
> > the same logic as x86 regarding steal_time.
> 
> Thanks for the suggestion.
> I'm a bit wary of making a whole PM subsystem addition for such a counter, but
> maybe I can make a architecture-independent KVM change for it, with a PM
> notifier in kvm_main.c.

Yes.  Either that, or the fields need to be in kvm_vcpu_arch, not kvm_vcpu.

> > Additionally, it seems that if a guest migrates to another system after a
> > suspend and before updating steal time, the suspended time is lost during
> > migration. I'm not sure if this is a practical issue.
> 
> The systems where the host suspends don't usually do VM migrations. Or at
> least the ones where we're encountering the problem this patch is trying to
> address don't (laptops).
>
> But even if they did, it doesn't seem that likely that the migration would
> happen over a host suspend.

I think we want to account for this straightaway, or at least have defined and
documented behavior, else we risk rehashing the issues with marking a vCPU as
preempted when it's loaded, but not running.  Which causes problems for live
migration as it results in KVM marking the steal-time page as dirty after vCPUs
have been paused.

[*] https://lkml.kernel.org/r/20240503181734.1467938-4-dmatlack%40google.com

  parent reply	other threads:[~2024-08-13 17:06 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-10  7:44 [PATCH] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
2024-07-29  7:26 ` Suleiman Souhlal
2024-07-30  2:26 ` Chao Gao
2024-07-30  9:45   ` Suleiman Souhlal
2024-07-31  1:03     ` Chao Gao
2024-08-13 17:06     ` Sean Christopherson [this message]
2024-08-14  4:50       ` Suleiman Souhlal
2024-08-14 15:35         ` Sean Christopherson
2024-08-15  4:33           ` Suleiman Souhlal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZruSpDcysc2B-HQ-@google.com \
    --to=seanjc@google.com \
    --cc=bp@alien8.de \
    --cc=chao.gao@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=ssouhlal@freebsd.org \
    --cc=suleiman@google.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.