From: Thomas Gleixner <tglx@linutronix.de>
To: Oliver Upton <oupton@google.com>
Cc: "Franke, Daniel" <dff@amazon.com>,
David Woodhouse <dwmw2@infradead.org>,
Paolo Bonzini <pbonzini@redhat.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
Sean Christopherson <seanjc@google.com>,
Vitaly Kuznetsov <vkuznets@redhat.com>,
Wanpeng Li <wanpengli@tencent.com>,
Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>
Subject: Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
Date: Tue, 29 Mar 2022 21:34:13 +0200 [thread overview]
Message-ID: <87sfr0eeoa.ffs@tglx> (raw)
In-Reply-To: <CAOQ_QsgDY0oeS0kU62MWMWj9JR3mKfU_p=MC7kto=LX5tQ2PPA@mail.gmail.com>
Oliver,
On Tue, Mar 29 2022 at 09:02, Oliver Upton wrote:
> On Tue, Mar 29, 2022 at 7:19 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Doing this the other way around (advance the TSC, tell the guest to fix
>> > MONOTONIC) is fundamentally wrong, as it violates two invariants of the
>> > monotonic clock. Monotonic counts during a migration, which really is a
>> > forced suspend. Additionally, you cannot step the monotonic clock.
>>
>> A migration _should_ have suspend semantics, but the forced suspend
>> which is done by migration today does not have proper defined semantics
>> at all.
>>
>> Also clock monotonic can be stepped forward under certain circumstances
>> and the kernel is very well able to handle it within well defined
>> limits. Think about scheduled out vCPUs. From their perspective clock
>> monotonic is stepping forwards.
>>
>
> Right. For better or worse, the kernel has been conditioned to
> tolerate small steps due to scheduling. There is just zero definition
> around how much slop is allowed.
From the experience with the MONO=BOOT experiment, I'd say anything < 1
second is unlikely to cause larger trouble, but there might be user
space applications/configurations which disagree :)
>> The problem starts when the 'force suspended' time becomes excessive, as
>> that causes the mass expiry of clock monotonic timers with all the nasty
>> side effects described in a3ed0e4393d6. In the worst case it's going to
>> exceed the catchup limit of non-suspended timekeeping (~440s for a TSC
>> @2GHz) which in fact breaks the world and some more.
>>
>> So let me go back to the use cases:
>>
>> 1) Regular freeze of a VM to disk and resume at some arbitrary point in
>> the future.
>>
>> 2) Live migration
>>
>> In both cases the proper solution is to make the guest go into a well
>> defined state (suspend) and resume it on restore. Everything just works
>> because it is well defined.
>>
>> Though you say that there is a special case (not that I believe it):
>
> I believe the easier special case to articulate is when the hypervisor
> has already done its due diligence to warn the guest about a
> migration. Guest doesn't heed the warning and doesn't quiesce. The
> most predictable state at this point is probably just to kill the VM
> on the spot, but that is likely to be a _very_ tough sell :)
I'm all for it. It's very well defined.
> So assuming that it's still possible for a non-cooperative suspend
> (live migration, VM freeze, etc.) there's still a need to stop the
> bleeding. I think you touch on what that may look like:
>
>> 1) Trestart - Tstop < TOLERABLE_THRESHOLD
>>
>> That's the easy case as you just can adjust TSC on restore by that
>> amount on all vCPUs and be done with it. Just works like scheduling
>> out all vCPUs for some time.
>>
>> 2) Trestart - Tstop >= TOLERABLE_THRESHOLD
>>
>> Avoid adjusting TSC for the reasons above.
>
> Which naturally will prompt the question: what is the value of
> TOLERABLE_THRESHOLD? Speaking from experience (Google already does
> something similar, but without a good fallback for exceeding the
> threshold), there's ~zero science in deriving that value. But, IMO if
> it's at least documented we can make the shenanigans at least a bit
> more predictable. It also makes it very easy to define who
> (guest/host) is responsible for cleaning up the mess.
See above, but the hyperscalers with experience on heavy host overload
might have better information when keeping a vCPU scheduled out starts
to create problems in the guest.
> In absence of documentation there's an unlimited license for VM
> operators to do as they please and I fear we will forever perpetuate
> the pain of time in virt.
You can prevent that, by making 'Cooperate within time or die hard' the
policy. :)
>> if (seq != get_migration_sequence())
>> do_something_smart();
>> else
>> proceed_as_usual();
>
> Agreed pretty much the whole way through. There's no point in keeping
> NTP naive at this point.
>
> There's a need to sound the alarm for NTP regardless of whether
> TOLERABLE_THRESHOLD is exceeded. David pointed out that the host
> advancing the guest clocks (delta injection or TSC advancement) could
> inject some error. Also, hardware has likely changed and the new parts
> will have their own errors as well.
There is no reason why you can't use the #2 scheme for the #1 case
too:
>> On destination host:
>> Restore memory image
>> Expose metadata in PV:
>> - migration sequence number + 1
- Flag whether Tout was compensated already via
TSC or just set Tout = 0
>> - Tout (dest/source host delta of clock TAI)
>> Run guest
>>
>> Guest kernel:
>>
>> - Keep track of the PV migration sequence number.
>>
>> If it changed act accordingly by injecting the TAI delta,
>> which updates NTP state, wakes TFD_TIMER_CANCEL_ON_SET,
>> etc...
if it was compensated via TSC already, it might be
sufficient to just reset NTP state.
>> NTP:
>> - utilize the sequence counter information
....
OTOH, the question is whether it's worth it.
If we assume that the sane case is a cooperative guest and the forced
migration is the last resort, then we can just avoid the extra magic and
the discussion around the correct value for TOLERABLE_THRESHOLD
alltogether.
I suggest to start from a TOLERABLE_THRESHOLD=0 assumption to keep it
simple in the first step. Once this has been established, you can still
experiment with the threshold and figure out whether it matters.
In fact, doing the host side TSC compensation is just an excuse for VM
operators not to make the guest cooperative, because it might solve
their main problems for the vast majority of migrations.
Forcing them to doing it right is definitely the better option, which
means the 'Cooperate or die hard' policy is the best one you can
chose. :)
>> That still will leave everything else exposed to
>> CLOCK_REALTIME/TAI jumping forward, but there is nothing you can
>> do about that and any application which cares about this has to
>> be able to deal with it anyway.
>
> Right. There's no cure-all between hypervisor/guest kernel that could
> fix the problem for userspace entirely.
In the same way as there is no cure for time jumps caused by
settimeofday(), daylight saving changes, leap seconds etc., unless the
application is carefully designed to deal with that.
> Appreciate you chiming in on this topic yet again.
I still hope that this get's fixed _before_ I retire :)
Thanks,
tglx
next prev parent reply other threads:[~2022-03-29 19:34 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-22 19:18 [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm Franke, Daniel
2022-03-22 21:53 ` Oliver Upton
2022-03-23 12:35 ` David Woodhouse
2022-03-23 16:21 ` Oliver Upton
2022-03-25 9:03 ` David Woodhouse
2022-03-25 17:47 ` Oliver Upton
2022-03-29 14:19 ` Thomas Gleixner
2022-03-29 16:02 ` Oliver Upton
2022-03-29 19:34 ` Thomas Gleixner [this message]
2022-06-30 11:58 ` David Woodhouse
2022-07-05 14:43 ` David Woodhouse
2022-07-07 16:43 ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Simon Veith
2022-07-07 16:43 ` [PATCH 2/2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
2022-07-29 21:21 ` Sean Christopherson
2022-07-29 21:14 ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Sean Christopherson
2023-02-02 16:35 ` David Woodhouse
2023-02-02 16:59 ` [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
2023-03-15 19:57 ` Sean Christopherson
2023-03-23 23:26 ` David Woodhouse
2023-03-24 11:22 ` Paolo Bonzini
2023-03-24 13:08 ` David Woodhouse
2023-09-13 14:08 ` David Woodhouse
-- strict thread matches above, loose matches on Subject: below --
2022-03-16 4:53 [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm Oliver Upton
2022-03-16 7:47 ` Paolo Bonzini
2022-03-18 18:39 ` Oliver Upton
2022-03-19 7:52 ` Paolo Bonzini
2022-03-19 7:59 ` Oliver Upton
2022-03-19 8:08 ` David Woodhouse
2022-03-19 11:54 ` Paolo Bonzini
2022-03-19 13:00 ` Paolo Bonzini
2022-03-19 13:13 ` David Woodhouse
2022-03-20 8:10 ` Paolo Bonzini
2022-03-20 8:52 ` Oliver Upton
2022-03-20 9:46 ` David Woodhouse
2022-03-21 0:38 ` Oliver Upton
2022-03-21 19:43 ` David Woodhouse
2022-03-21 21:23 ` Oliver Upton
2022-03-20 13:39 ` Paolo Bonzini
2022-03-21 0:51 ` Oliver Upton
2022-03-21 12:36 ` Paolo Bonzini
2022-03-21 12:56 ` David Woodhouse
2022-03-21 12:16 ` David Woodhouse
2022-03-21 13:10 ` Paolo Bonzini
2022-03-21 14:59 ` David Woodhouse
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87sfr0eeoa.ffs@tglx \
--to=tglx@linutronix.de \
--cc=dff@amazon.com \
--cc=dwmw2@infradead.org \
--cc=jmattson@google.com \
--cc=joro@8bytes.org \
--cc=kvm@vger.kernel.org \
--cc=oupton@google.com \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
--cc=vkuznets@redhat.com \
--cc=wanpengli@tencent.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox