public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Marcelo Tosatti <mtosatti@redhat.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Oliver Upton <oupton@google.com>,
	kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu,
	Paolo Bonzini <pbonzini@redhat.com>,
	Sean Christopherson <seanjc@google.com>,
	Marc Zyngier <maz@kernel.org>, Peter Shier <pshier@google.com>,
	Jim Mattson <jmattson@google.com>,
	David Matlack <dmatlack@google.com>,
	Ricardo Koller <ricarkol@google.com>,
	Jing Zhang <jingzhangos@google.com>,
	Raghavendra Rao Anata <rananta@google.com>,
	James Morse <james.morse@arm.com>,
	Alexandru Elisei <alexandru.elisei@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	linux-arm-kernel@lists.infradead.org,
	Andrew Jones <drjones@redhat.com>, Will Deacon <will@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>
Subject: Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
Date: Fri, 1 Oct 2021 09:10:41 -0300	[thread overview]
Message-ID: <20211001121041.GA45897@fuller.cnet> (raw)
In-Reply-To: <20211001120527.GA43086@fuller.cnet>

On Fri, Oct 01, 2021 at 09:05:27AM -0300, Marcelo Tosatti wrote:
> On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> > Marcelo,
> > 
> > On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote:
> > > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> > >> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> > >> 
> > >> Thomas, CC'ed, has deeper understanding of problems with 
> > >> forward time jumps than I do. Thomas, any comments?
> > >
> > > Based on the earlier discussion about the problems of synchronizing
> > > the guests clock via a notification to the NTP/Chrony daemon 
> > > (where there is a window where applications can read the stale
> > > value of the clock), a possible solution would be triggering
> > > an NMI on the destination (so that it runs ASAP, with higher
> > > priority than application/kernel).
> > >
> > > What would this NMI do, exactly?
> > 
> > Nothing. You cannot do anything time related in an NMI.
> > 
> > You might queue irq work which handles that, but that would still not
> > prevent user space or kernel space from observing the stale time stamp
> > depending on the execution state from where it resumes.
> 
> Yes.
> 
> > >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
> > >> for either vm pause / vm resume (well, if paused for long periods of time) 
> > >> or savevm / restorevm.
> > >
> > > Maybe with the NMI above, it would be possible to use
> > > the realtime clock as a way to know time elapsed between
> > > events and advance guest clock without the current 
> > > problematic window.
> > 
> > As much duct tape you throw at the problem, it cannot be solved ever
> > because it's fundamentally broken. All you can do is to make the
> > observation windows smaller, but that's just curing the symptom.
> 
> Yes.
> 
> > The problem is that the guest is paused/resumed without getting any
> > information about that and the execution of the guest is stopped at an
> > arbitrary instruction boundary from which it resumes after migration or
> > restore. So there is no way to guarantee that after resume all vCPUs are
> > executing in a state which can handle that.
> > 
> > But even if that would be the case, then what prevents the stale time
> > stamps to be visible? Nothing:
> > 
> > T0:    t = now();
> >          -> pause
> >          -> resume
> >          -> magic "fixup"
> > T1:    dostuff(t);
> 
> Yes.
> 
> BTW, you could have a userspace notification (then applications 
> could handle this if desired).
> 
> > But that's not a fundamental problem because every preemptible or
> > interruptible code has the same issue:
> > 
> > T0:    t = now();
> >          -> preemption or interrupt
> > T1:    dostuff(t);
> > 
> > Which is usually not a problem, but It becomes a problem when T1 - T0 is
> > greater than the usual expectations which can obviously be trivially
> > achieved by guest migration or a savevm, restorevm cycle.
> > 
> > But let's go a step back and look at the clocks and their expectations:
> > 
> > CLOCK_MONOTONIC:
> > 
> >   Monotonically increasing clock which counts unless the system
> >   is in suspend. On resume it continues counting without jumping
> >   forward.
> > 
> >   That's the reference clock for everything else and therefore it
> >   is important that it does _not_ jump around.
> > 
> >   The reasons why CLOCK_MONOTONIC stops during suspend is
> >   historical and any attempt to change that breaks the world and
> >   some more because making it jump forward will trigger all sorts
> >   of timeouts, watchdogs and whatever. The last attempt to make
> >   CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3
> >   weeks. It's not going to be attempted again. See a3ed0e4393d6
> >   ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for
> >   details.
> > 
> >   Now the proposed change is creating exactly the same problem:
> > 
> >   >> > +	if (data.flags & KVM_CLOCK_REALTIME) {
> >   >> > +		u64 now_real_ns = ktime_get_real_ns();
> >   >> > +
> >   >> > +		/*
> >   >> > +		 * Avoid stepping the kvmclock backwards.
> >   >> > +		 */
> >   >> > +		if (now_real_ns > data.realtime)
> >   >> > +			data.clock += now_real_ns - data.realtime;
> >   >> > +	}
> > 
> >   IOW, it takes the time between pause and resume into account and
> >   forwards the underlying base clock which makes CLOCK_MONOTONIC
> >   jump forward by exactly that amount of time.
> 
> Well, it is assuming that the
> 
>  T0:    t = now();
>  T1:    pause vm()
>  T2:	finish vm migration()
>  T3:    dostuff(t);
> 
> Interval between T1 and T2 is small (and that the guest
> clocks are synchronized up to a given boundary).
> 
> But i suppose adding a limit to the forward clock advance 
> (in the migration case) is useful:
> 
> 	1) If migration (well actually, only the final steps
> 	   to finish migration, the time between when guest is paused
> 	   on source and is resumed on destination) takes too long,
> 	   then too bad: fix it to be shorter if you want the clocks
> 	   to have close to zero change to realtime on migration.
> 
> 	2) Avoid the other bugs in case of large forward advance.
> 
> Maybe having it configurable, with a say, 1 minute maximum by default
> is a good choice?
> 
> An alternative would be to advance only the guests REALTIME clock, from 
> data about how long steps T1-T2 took.
> 
> >   So depending on the size of the delta you are running into exactly the
> >   same problem as the final failed attempt to unify CLOCK_MONOTONIC and
> >   CLOCK_BOOTTIME which btw. would have been a magic cure for virt.
> > 
> >   Too bad, not going to happen ever unless you fix all affected user
> >   space and kernel side issues.
> > 
> > 
> > CLOCK_BOOTTIME:
> > 
> >   CLOCK_MONOTONIC + time spent in suspend
> > 
> > 
> > CLOCK_REALTIME/TAI:
> > 
> >   CLOCK_MONOTONIC + offset
> > 
> >   The offset is established by reading RTC at boot time and can be
> >   changed by clock_settime(2) and adjtimex(2). The latter is used
> >   by NTP/PTP.
> > 
> >   Any user of CLOCK_REALTIME has to be prepared for it to jump in
> >   both directions, but of course NTP/PTP daemons have expectations
> >   vs. such time jumps.
> > 
> >   They rightfully assume on a properly configured and administrated
> >   system that there are only two things which can make CLOCK_REALTIME
> >   jump:
> > 
> >   1) NTP/PTP daemon controlled
> >   2) Suspend/resume related updates by the kernel
> > 
> > 
> > Just for the record, these assumptions predate virtualization.
> > 
> > So now virt came along and created a hard to solve circular dependency
> > problem:
> > 
> >    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
> >      sync, but everything else is happy.
> >      
> >    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
> >      lose, but NTP/PTP is happy.
> 
> One must handle the
> 
>  T0:    t = now();
>           -> pause
>           -> resume
>           -> magic "fixup"
>  T1:    dostuff(t);
> 
> fact if one is going to use savevm/restorevm anyway, so...
> (it is kind of unfixable, unless you modify your application
> to accept notifications to redo any computation based on t, isnt it?).

https://lore.kernel.org/lkml/1289503802-22444-2-git-send-email-virtuoso@slind.org/


  reply	other threads:[~2021-10-01 12:11 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-16 18:15 [PATCH v8 0/7] KVM: x86: Add idempotent controls for migrating system counter state Oliver Upton
2021-09-16 18:15 ` [PATCH v8 1/7] kvm: x86: abstract locking around pvclock_update_vm_gtod_copy Oliver Upton
2021-09-16 18:15 ` [PATCH v8 2/7] KVM: x86: extract KVM_GET_CLOCK/KVM_SET_CLOCK to separate functions Oliver Upton
2021-09-16 18:15 ` [PATCH v8 3/7] KVM: x86: Fix potential race in KVM_GET_CLOCK Oliver Upton
2021-09-29 13:33   ` Marcelo Tosatti
2021-09-16 18:15 ` [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values " Oliver Upton
2021-09-28 18:53   ` Marcelo Tosatti
2021-09-29 11:20     ` Paolo Bonzini
2021-09-29 18:56   ` Marcelo Tosatti
2021-09-30 19:21     ` Marcelo Tosatti
2021-09-30 23:02       ` Thomas Gleixner
2021-10-01 12:05         ` Marcelo Tosatti
2021-10-01 12:10           ` Marcelo Tosatti [this message]
2021-10-01 19:59           ` Thomas Gleixner
2021-10-01 21:03             ` Oliver Upton
2021-10-01 14:17         ` Paolo Bonzini
2021-10-01 14:39   ` Paolo Bonzini
2021-10-01 14:41     ` Paolo Bonzini
2021-10-01 15:39       ` Oliver Upton
2021-10-01 16:42         ` Paolo Bonzini
2024-01-17 14:28   ` David Woodhouse
2024-07-24 22:24     ` [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK' Sean Christopherson
2024-07-25  8:24       ` David Woodhouse
2021-09-16 18:15 ` [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount Oliver Upton
2021-09-24 16:42   ` Paolo Bonzini
2021-09-30 17:51   ` Marcelo Tosatti
2021-10-01 16:48   ` Paolo Bonzini
2021-09-16 18:15 ` [PATCH v8 6/7] KVM: x86: Refactor tsc synchronization code Oliver Upton
2021-09-16 18:15 ` [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace Oliver Upton
2021-09-30 19:14   ` Marcelo Tosatti
2021-10-01  9:17     ` Paolo Bonzini
2021-10-01 10:32       ` Marcelo Tosatti
2021-10-01 15:12         ` Paolo Bonzini
2021-10-01 19:11           ` Marcelo Tosatti
2021-10-01 19:33             ` Oliver Upton
2021-10-04 14:30               ` Marcelo Tosatti
2021-10-04 11:44             ` Paolo Bonzini
2021-10-05 15:22   ` Sean Christopherson
2022-02-23 10:02   ` David Woodhouse
2021-09-24 16:43 ` [PATCH v8 0/7] KVM: x86: Add idempotent controls for migrating system counter state Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211001121041.GA45897@fuller.cnet \
    --to=mtosatti@redhat.com \
    --cc=alexandru.elisei@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=dmatlack@google.com \
    --cc=drjones@redhat.com \
    --cc=james.morse@arm.com \
    --cc=jingzhangos@google.com \
    --cc=jmattson@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=maz@kernel.org \
    --cc=oupton@google.com \
    --cc=pbonzini@redhat.com \
    --cc=pshier@google.com \
    --cc=rananta@google.com \
    --cc=ricarkol@google.com \
    --cc=seanjc@google.com \
    --cc=suzuki.poulose@arm.com \
    --cc=tglx@linutronix.de \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox