public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Marcelo Tosatti <mtosatti@redhat.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Radim Krcmar <rkrcmar@redhat.com>, kvm list <kvm@vger.kernel.org>,
	stable <stable@vger.kernel.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Rik van Riel <riel@redhat.com>
Subject: Re: x86: kvm: Revert "remove sched notifier for cross-cpu migrations"
Date: Thu, 26 Mar 2015 19:56:32 -0300	[thread overview]
Message-ID: <20150326225632.GA29770@amt.cnet> (raw)
In-Reply-To: <CALCETrUgqiianBfV=D4vJ-dht0m1bYvcBXyyPi00OiSwDTcEWg@mail.gmail.com>

On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote:
> On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar <rkrcmar@redhat.com> wrote:
> > 2015-03-26 11:51-0700, Andy Lutomirski:
> >> On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> > On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
> >> >> Suppose we start out with all vcpus agreeing on their pvti and perfect
> >> >> invariant TSCs.  Now the host updates its frequency (due to NTP or
> >> >> whatever).  KVM updates vcpu 0's pvti.  Before KVM updates vcpu 1's
> >> >> pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti.
> >> >> They'll disagree on the time, and one of them will be ahead until vcpu
> >> >> 1's pvti gets updated.
> >> >
> >> > The masterclock scheme enforces the same system_timestamp/tsc_timestamp pairs
> >> > to be visible at one time, for all vcpus.
> >> >
> >> >
> >> >  * That is, when timespec0 != timespec1, M < N. Unfortunately that is
> >> >  * not
> >> >  * always the case (the difference between two distinct xtime instances
> >> >  * might be smaller then the difference between corresponding TSC reads,
> >> >  * when updating guest vcpus pvclock areas).
> >> >  *
> >> >  * To avoid that problem, do not allow visibility of distinct
> >> >  * system_timestamp/tsc_timestamp values simultaneously: use a master
> >> >  * copy of host monotonic time values. Update that master copy
> >> >  * in lockstep.
> >>
> >> Yuck.  So we have per cpu timing data, but the protocol is only usable
> >> for monotonic timing because we forcibly freeze all vcpus when we
> >> update the nominally per cpu data.
> >>
> >> The obvious guest implementations are still unnecessarily slow,
> >> though.  It would be nice if the guest could get away without using
> >> any getcpu operation at all.
> >>
> >> Even if we fixed the host to increment version as advertised, I think
> >> we can't avoid two getcpu ops.  We need one before rdtsc to figure out
> >> which pvti to look at,
> >
> > Yes.
> >
> >>                        and we need another to make sure that we were
> >> actually on that cpu at the time we did rdtsc.  (Rdtscp doesn't help
> >> -- we need to check version before rdtsc, and we don't know what
> >> version to check until we do a getcpu.).
> >
> > Exactly, reading cpuid after rdtsc doesn't do that though, we could have
> > migrated back between those reads.
> > rtdscp would allow us to check that we read tsc of pvti's cpu.
> > (It doesn't get rid of that first read.)
> >
> >>                                          The migration hook has the
> >> same issue -- we need to check the migration count, then confirm we're
> >> on that cpu, then check the migration count again, and we can't do
> >> that until we know what cpu we're on.
> >
> > True;  the revert has a bug -- we need to check cpuid for the second
> > time before rdtsc.  (Migration hook is there just because we don't know
> > which cpu executed rdtsc.)
> 
> One way or another, I'm planning on completely rewriting the vdso
> code.  An early draft is here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso&id=57ace6e6e032afc4faf7b9ec52f78a8e6642c980
> 
> but I can't finish it until the KVM side shakes out.
> 
> I think there are at least two ways that would work:
> 
> a) If KVM incremented version as advertised:

All for it.

> cpu = getcpu();
> pvti = pvti for cpu;
> 
> ver1 = pvti->version;
> check stable bit;
> rdtsc_barrier, rdtsc, read scale, shift, etc.
> if (getcpu() != cpu) retry;
> if (pvti->version != ver1) retry;
> 
> I think this is safe because, we're guaranteed that there was an
> interval (between the two version reads) in which the vcpu we think
> we're on was running and the kvmclock data was valid and marked
> stable, and we know that the tsc we read came from that interval.
> 
> Note: rdtscp isn't needed. If we're stable, is makes no difference
> which cpu's tsc we actually read.

Yes, can't see a problem with that.

> b) If version remains buggy but we use this migrations_from hack:

There is no reason for version to remain buggy.

> cpu = getcpu();
> pvti = pvti for cpu;
> m1 = pvti->migrations_from;
> barrier();
> 
> ver1 = pvti->version;
> check stable bit;
> rdtsc_barrier, rdtsc, read scale, shift, etc.
> if (getcpu() != cpu) retry;
> if (pvti->version != ver1) retry;  /* probably not really needed */
> 
> barrier();
> if (pvti->migrations_from != m1) retry;
> 
> This is just like (a), except that we're using a guest kernel hack to
> ensure that no one migrated off the vcpu during the version-protected
> critical section and that we were, in fact, on that vcpu at some point
> during that critical section.  Once we've ensured that we were on
> pvti's associated vcpu for the entire time we were reading it, then we
> are protected by the existing versioning in the host.
> 
> >
> >> If, on the other hand, we could rely on having all of these things in
> >> sync, then this complication goes away, and we go down from two getcpu
> >> ops to zero.
> >
> > (Yeah, we should look what are the drawbacks of doing it differently.)
> 
> If the versioning were fixed, I think we could almost get away with:
> 
> pvti = pvti for vcpu 0;
> 
> ver1 = pvti->version;
> check stable bit;
> rdtsc_barrier, rdtsc, read scale, shift, etc.
> if (pvti->version != ver1) retry;
> 
> This guarantees that the tsc came from an interval in which vcpu0's
> kvmclock was *marked* stable.  If vcpu0's kvmclock were genuinely
> stable in that interval, then we'd be fine, but there's a race window
> in which the kvmclock is *not* stable and vcpu 0 wasn't running.

What is that window again ? Have no objections against using vcpu0's
pvti (cacheline should be read-only 99.9% of time).

> Why doesn't KVM just update all of the kvmclock data at once?  

Because it has not been necessary -- updating kvmclock data on vcpu 
entry was the previous method, so that was reused.

> (For
> that matter, why is the pvti in guest memory at all?  Wouldn't this
> all be simpler if the kvmclock data were host-allocated so the host
> could write it directly and maybe even share it between guests?)

And use a 4K TLB entry for that kvmclock area rather than 
sharing one of kernel's 2MB (or 1GB) TLB entry?

> 
> --Andy

  parent reply	other threads:[~2015-03-26 22:56 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-23 23:21 x86: kvm: Revert "remove sched notifier for cross-cpu migrations" Marcelo Tosatti
2015-03-23 23:30 ` Andy Lutomirski
2015-03-24 15:34 ` Radim Krčmář
2015-03-24 22:33   ` Andy Lutomirski
2015-03-25 11:08     ` Radim Krčmář
2015-03-25 12:52       ` Radim Krčmář
2015-03-25 21:28         ` Marcelo Tosatti
2015-03-25 22:33           ` Andy Lutomirski
2015-03-25 22:41             ` Marcelo Tosatti
2015-03-25 22:48               ` Andy Lutomirski
2015-03-25 23:13                 ` Marcelo Tosatti
2015-03-25 23:22                   ` Andy Lutomirski
2015-03-26 11:29                     ` Marcelo Tosatti
2015-03-26 18:51                       ` Andy Lutomirski
2015-03-26 20:31                         ` Radim Krcmar
2015-03-26 20:58                           ` Andy Lutomirski
2015-03-26 22:22                             ` Andy Lutomirski
2015-03-26 22:56                             ` Marcelo Tosatti [this message]
2015-03-26 23:09                               ` Andy Lutomirski
2015-03-26 23:22                                 ` Marcelo Tosatti
2015-03-26 23:28                                   ` Andy Lutomirski
2015-03-26 23:38                                     ` Marcelo Tosatti
2015-03-26 18:47       ` Andy Lutomirski
2015-03-26 20:10         ` Radim Krčmář
2015-03-26 20:52           ` Paolo Bonzini
2015-03-24 22:59   ` Marcelo Tosatti
2015-03-25 11:09     ` Radim Krčmář
2015-03-25 13:06 ` Radim Krčmář
2015-03-26 20:59 ` Radim Krčmář
2015-03-26 22:22   ` Marcelo Tosatti
2015-03-26 22:24     ` Andy Lutomirski
2015-03-26 22:40       ` Marcelo Tosatti

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150326225632.GA29770@amt.cnet \
    --to=mtosatti@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=pbonzini@redhat.com \
    --cc=riel@redhat.com \
    --cc=rkrcmar@redhat.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox