public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Suleiman Souhlal <suleiman@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	 Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	 Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org,  "H. Peter Anvin" <hpa@zytor.com>,
	Chao Gao <chao.gao@intel.com>,
	 David Woodhouse <dwmw2@infradead.org>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	 ssouhlal@freebsd.org
Subject: Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
Date: Tue, 21 Jan 2025 12:22:19 -0800	[thread overview]
Message-ID: <Z5AB-6bLRNLle27G@google.com> (raw)
In-Reply-To: <CABCjUKDDDhXx8mSRKHCa34JjSX1nfM5WMG-UrPu9fjei6gkUJA@mail.gmail.com>

On Tue, Jan 21, 2025, Suleiman Souhlal wrote:
> On Sat, Jan 18, 2025 at 1:52 AM Sean Christopherson <seanjc@google.com> wrote:
> > In short, AFAICT the issues you are observing are mostly a problem with kvmclock.
> > Or maybe it's the other way around and effectively freezing guest TSC is super
> > problematic and fundamentally flawed.
> >
> > Regardless of which one is "broken", unconditionally accounting suspend time to
> > steal_time will do the wrong thing when sched_clock=tsc.  To further muddy the
> > waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc,
> > but sched_clock=kvmclock.  In that scenario, guest time doesn't advanced, but
> > guest scheduler time does.  Ugh.
> >
> > That particular wart can be avoided by having the guest use TSC for sched_clock[*],
> > e.g. so that at least the behavior of time is consistent.
> >
> > Hmm, if freezing guest time across suspend is indeed problematic, one thought
> > would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the
> > host may be suspending.  The Linux-as-a-guest would prefer kvmclock over TSC for
> > both clocksource and sched_clock.
> >
> > [*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com
> 
> I see what you're saying. Thanks for explaining.
> 
> To complicate things further there are also different kinds of
> suspends. From what I've seen "shallow" (and/or "suspend-to-idle")
> suspends don't stop the CPU, at least on our machines, so the host TSC
> keeps ticking. On "deep" suspends, on the other hand, the TSC might go
> backwards.

Yeah, only S3 and lower will power down the CPU.  All bets are off if the CPU
doesn't have a nonstop TSC, but that's not at all unique to suspend, e.g. it's a
problem if the CPU goes idle, and so I think it's safe to only worry about CPUs
with nonstop TSC.

> But I suppose if the guest uses kvmclock the behavior should be the
> same in either case.
> 
> At least for our use case we would definitely want guest *wall* time
> to keep advancing, so we would still want to use kvmclock.
> 
> Would accounting the suspend duration in steal time be acceptable if
> it was conditional on the guest using kvmclock?
> We would need a way for the host to be notified that the guest is
> indeed using it,

And not just using kvmclock, but specifically using for sched_clock.  E.g. the
current behavior for most Linux guests on modern hardware is that they'll use TSC
for clocksource, but kvmclock for sched_clock and wall clock.

> possibly by adding a new MSR to be written to in
> kvm_cs_enable().

I don't think that's a good way forward.  I expect kvmclock to be largely
deprecated (guest side) in favor of raw TSC (with hardware-provided scale+offset),
at which point tying this to kvmclock puts us back at square one.

Given that s2idle and standby don't reset host TSC, I think the right way to
handle this conundrum is to address the flaw that's noted in the "backwards TSC"
logic, and adjust guest TSC to be fully up-to-date in the S3 (or lower) case.

	 * ......................................  Unfortunately, we can't
	 * bring the TSCs fully up to date with real time, as we aren't yet far
	 * enough into CPU bringup that we know how much real time has actually
	 * elapsed; our helper function, ktime_get_boottime_ns() will be using boot
	 * variables that haven't been updated yet.

I have no idea why commit 0dd6a6edb012 ("KVM: Dont mark TSC unstable due to S4
suspend") hooked kvm_arch_enable_virtualization_cpu() instead of implementing a
PM notifier, but I don't see anything that suggests it was deliberate, i.e. that
KVm *needs* to effectively snapshot guest TSC when onlining CPUs.

If that wart is fixed, then both kvmclock and TSC will account host suspend time,
and KVM can safely account the suspend time into steal time regardless of which
clock(s) the guest is using.

  reply	other threads:[~2025-01-21 20:22 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-07  4:21 [PATCH v3 0/3] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
2025-01-07  4:21 ` [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns() Suleiman Souhlal
2025-01-07 15:27   ` Sean Christopherson
2025-01-07 16:43     ` Suleiman Souhlal
2025-01-08  7:15   ` kernel test robot
2025-01-08  8:36   ` kernel test robot
2025-01-15 21:49   ` Sean Christopherson
2025-01-17  6:35     ` Suleiman Souhlal
2025-01-17 16:52       ` Sean Christopherson
2025-01-21  5:37         ` Suleiman Souhlal
2025-01-21 20:22           ` Sean Christopherson [this message]
2025-02-04  7:58             ` Suleiman Souhlal
2025-02-05  5:55               ` Suleiman Souhlal
2025-02-06  1:29                 ` Sean Christopherson
2025-02-13  3:56                   ` Suleiman Souhlal
2025-01-07  4:22 ` [PATCH v3 2/3] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
2025-01-07 15:37   ` Sean Christopherson
2025-01-08  4:05     ` Suleiman Souhlal
2025-01-08 15:17       ` Sean Christopherson
2025-01-07  4:22 ` [PATCH v3 3/3] KVM: x86: Document host suspend being included " Suleiman Souhlal
2025-01-07 15:37   ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z5AB-6bLRNLle27G@google.com \
    --to=seanjc@google.com \
    --cc=bp@alien8.de \
    --cc=chao.gao@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dwmw2@infradead.org \
    --cc=hpa@zytor.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=ssouhlal@freebsd.org \
    --cc=suleiman@google.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox