From: Marcelo Tosatti <mtosatti@redhat.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
kvm list <kvm@vger.kernel.org>, Radim Krcmar <rkrcmar@redhat.com>,
X86 ML <x86@kernel.org>
Subject: Re: kvmclock doesn't work, help?
Date: Fri, 18 Dec 2015 09:47:35 -0200 [thread overview]
Message-ID: <20151218114734.GA28306@amt.cnet> (raw)
In-Reply-To: <CALCETrViR0iTA4SgqQRQRokKAstQnMdW+cf1FHF7ZJ2vqW58sA@mail.gmail.com>
On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote:
> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote:
> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote:
> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> >> >>
> >> >> >>
> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote:
> >> >> >>> > RAW TSC NTP corrected TSC
> >> >> >>> > t0 10 10
> >> >> >>> > t1 20 19.99
> >> >> >>> > t2 30 29.98
> >> >> >>> > t3 40 39.97
> >> >> >>> > t4 50 49.96
> >
> > (1)
> >
> >> >> >>> >
> >> >> >>> > ...
> >> >> >>> >
> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC,
> >> >> >>> > you can see what will happen.
> >> >> >>>
> >> >> >>> Sure, but why would you ever switch from one to the other?
> >> >> >>
> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After
> >> >> >> resume, the TSC certainly increases at the same rate as before, but the
> >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower
> >> >> >> than the guest kvmclock.
> >> >> >
> >> >> > Wait, are we talking about the host's NTP or the guest's NTP?
> >> >> >
> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to
> >> >> > the NTP corrected value? If so, the guest wouldn't see time go
> >> >> > backwards.
> >> >> >
> >> >> > If it's the guest's, then the guest's NTP correction is applied on top
> >> >> > of kvmclock, and this shouldn't matter.
> >> >> >
> >> >> > I still feel like I'm missing something very basic here.
> >> >> >
> >> >>
> >> >> OK, I think I get it.
> >> >>
> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's
> >> >> correction to the guest. If it did, indeed, propagate the correction
> >> >> then, after resume, the host's new system_time would match the guest's
> >> >> idea of it (after accounting for the guest's long nap), and I don't
> >> >> think there would be a problem.
> >> >> That being said, I can't find the code in the masterclock stuff that
> >> >> would actually do this.
> >> >
> >> > Guest clock is maintained by guest timekeeping code, which does:
> >> >
> >> > timer_interrupt()
> >> > offset = read clocksource since last timer interrupt
> >> > accumulate_to_systemclock(offset)
> >> >
> >> > The frequency correction of NTP in the host can be applied to
> >> > kvmclock, which will be visible to the guest
> >> > at "read clocksource since last timer interrupt"
> >> > (kvmclock_clocksource_read function).
> >>
> >> pvclock_clocksource_read? That seems to do the same thing as all the
> >> other clocksource access functions.
> >>
> >> >
> >> > This does not mean that the NTP correction in the host is propagated
> >> > to the guests system clock directly.
> >> >
> >> > (For example, the guest can run NTP which is free to do further
> >> > adjustments at "accumulate_to_systemclock(offset)" time).
> >>
> >> Of course. But I expected that, in the absence of NTP on the guest,
> >> that the guest would track the host's *corrected* time.
> >>
> >> >
> >> >> If, on the other hand, the host's NTP correction is not supposed to
> >> >> propagate to the guest,
> >> >
> >> > This is optional. There is a module option to control this, in fact.
> >> >
> >> > Its nice to have, because then you can execute a guest without NTP
> >> > (say without network connection), and have a kvmclock (kvmclock is a
> >> > clocksource, not a guest system clock) which is NTP corrected.
> >>
> >> Can you point to how this works? I found kvm_guest_time_update, whch
> >> is called under circumstances that I haven't untangled. I can't
> >> really tell what it's trying to do.
> >
> > Documentation/virtual/kvm/timekeeping.txt.
> >
>
> That document is really long. I skimmed it and found nothing.
kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set.
This happens when:
- kvmclock is enabled or disabled by the guest.
- periodically to propagate NTP correction to kvmclock clock.
- guest vcpu switching between host pcpus when TSCs are out of sync.
- after migration.
- after savevm/loadvm.
> >> In any case, this still seems much more convoluted than it has to be.
> >> In the case in which the host has a stable TSC (tsc is selected in the
> >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all
> >> the time on the last few generations of CPUs, then the core
> >> timekeeping code is already exposing a linear function that's supposed
> >> to be used for monotonic, cpu-local access to a corrected nanosecond
> >> counter. It's even in pretty much exactly the right form to pass
> >> through to the guest via pvclock in the gtod data. Why doesn't KVM
> >> pass it through verbatim, updated in real time? Is there some legacy
> >> reason that KVM must apply its own corrections and has to jump through
> >> hoops to pause vcpus when updating those vcpu's copies of the pvclock
> >> data?
> >
> > Read the comment on x86.c which starts with
> > " *
> > * Assuming a stable TSC across physical CPUS, and a stable TSC
> > * across virtual CPUs, the following condition is possible.
> > * Each numbered line represents an event visible to both
> > * CPUs at the next numbered event.
> > "
>
> A couple things:
>
> 1. That says: timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
>
> but that's wrong, I think. rdtsc is a function, not a number.
View it as a number, then its correct.
> Shouldn't it be:
>
> timespec0 + (rdtsc0 - tsc0) < timespec0 + N + (rdtsc1 - (tsc0 + M))
Think "rdtsc" is one number (rdtsc0 = rdtsc1).
> which is true iff rdtsc0 < rdtsc1 + N - M, which is equivalent to M <
> N + (rdtsc1 - rdtsc0)?
>
> That doesn't change the conclusion.
>
> In any case, I'm not arguing that the concept of a master copy is
> unnecessary; I'm arguing that the implementation, the calculations,
> and the machinations in the code are all very, very complicated. All
> that should be needed is to keep all of the vcpu pvti copies the same
> and to make sure that you can't ever have one vcpu see a new copy and
> then another vcpu see an old copy.
Yes, you can't allow two vcpus to see a different copy of the pvti
structure.
> You can do that by brute-force
> freezing all vcpus on an update (what happens now), or you could do it
> by just writing all of the copies at the same time from the same host
> cpu *while other vcpus are still running*.
Ok, can you do that and guarantee the first copy won't be seen by
other vcpus? I don't know how.
> For the best outcome, you could offer a pvclock protocol v3 in which
> there is literally just one pvti copy shared by all vcpus.
Sure, lets write a more formal proposal?
> >> >> then shouldn't KVM just update system_time on
> >> >> resume to whatever the guest would think it had (which I think would
> >> >> be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly
> >> >> shifted by some per-guest constant offset).
> >> >>
> >> >> --Andy
> >> >
> >> > Sure, you could add a correction to compensate and make sure
> >> > the guest clock does not see time backwards.
> >> >
> >>
> >> Could you help do that? You understand the code far better than I do.
> >
> > Sure, you have to save the guests view of time (system_time + scaled tsc
> > read) when suspending, and add an offset to get_kernel_ns() to
> > compensate the effect of (1) when resuming.
> >
> > Does that make sense?
>
> I think so.
>
> --Andy
>
> >
> >> As it stands, it simply doesn't work on any system that suspends and
> >> resumes (unless maybe the system has the upcoming Intel ART feature,
> >> and I have no clue when that'll show up).
> >>
> >> --Andy
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
next prev parent reply other threads:[~2015-12-18 19:21 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-12-09 21:10 kvmclock doesn't work, help? Andy Lutomirski
2015-12-09 21:16 ` Paolo Bonzini
2015-12-09 21:49 ` Andy Lutomirski
2015-12-09 22:12 ` Paolo Bonzini
2015-12-09 22:27 ` Andy Lutomirski
2015-12-09 22:42 ` Paolo Bonzini
2015-12-09 22:43 ` Andy Lutomirski
2015-12-10 21:33 ` Marcelo Tosatti
2015-12-10 21:32 ` Marcelo Tosatti
2015-12-11 21:57 ` Andy Lutomirski
2015-12-11 23:48 ` Marcelo Tosatti
2015-12-14 18:07 ` Andy Lutomirski
2015-12-14 21:47 ` Marcelo Tosatti
2015-12-14 13:44 ` Paolo Bonzini
2015-12-14 22:00 ` Marcelo Tosatti
2015-12-14 22:31 ` Andy Lutomirski
2015-12-14 22:38 ` Marcelo Tosatti
2015-12-15 8:42 ` Paolo Bonzini
2015-12-16 17:48 ` Andy Lutomirski
2015-12-16 18:17 ` Andy Lutomirski
2015-12-16 21:57 ` Marcelo Tosatti
2015-12-17 16:33 ` Andy Lutomirski
2015-12-17 19:08 ` Marcelo Tosatti
2015-12-18 1:12 ` Andy Lutomirski
2015-12-18 11:47 ` Marcelo Tosatti [this message]
2015-12-18 19:27 ` Andy Lutomirski
2015-12-18 19:45 ` Marcelo Tosatti
2015-12-18 20:25 ` Andy Lutomirski
2015-12-18 21:49 ` Marcelo Tosatti
2015-12-21 22:49 ` Andy Lutomirski
2015-12-23 19:27 ` Marcelo Tosatti
2015-12-23 23:09 ` Andy Lutomirski
2015-12-19 1:16 ` John Stultz
2015-12-10 21:36 ` Marcelo Tosatti
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151218114734.GA28306@amt.cnet \
--to=mtosatti@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=pbonzini@redhat.com \
--cc=rkrcmar@redhat.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).