Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Marcelo Tosatti <mtosatti@redhat.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	kvm list <kvm@vger.kernel.org>, Gleb Natapov <gleb@kernel.org>
Subject: Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
Date: Tue, 6 Jan 2015 16:45:12 -0200	[thread overview]
Message-ID: <20150106184512.GA31263@amt.cnet> (raw)
In-Reply-To: <CALCETrXzJkxbUsVgPbKNpBdp32yf=0M=RfseX=u7Mg2Mmsz2VQ@mail.gmail.com>

On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
> >> >
> >> >
> >> >
> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
> >> > > > > Still confused.  So we can freeze all vCPUs in the host, then update
> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
> >> > > > > doesn't increment the version pre-update, and we can return completely
> >> > > > > bogus results.
> >> > > > Yes.
> >> > > But then the getcpu test would fail (1->0).  Even if you have an ABA
> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the
> >> > > one returned by the first getcpu.
> >> >
> >> > ... this case of partial update of pvti, which is caught by the version
> >> > field, if of course different from the other (extremely unlikely) that
> >> > Andy pointed out.  That is when the getcpus are done on the same vCPU,
> >> > but the rdtsc is another.
> >> >
> >> > That one can be fixed by rdtscp, like
> >> >
> >> > do {
> >> >     // get a consistent (pvti, v, tsc) tuple
> >> >     do {
> >> >         cpu = get_cpu();
> >> >         pvti = get_pvti(cpu);
> >> >         v = pvti->version & ~1;
> >> >         // also acts as rmb();
> >> >         rdtsc_barrier();
> >> >         tsc = rdtscp(&cpu1);
> >>
> >> Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
> >> specified it that way and both AMD and Intel implement it correctly.
> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
> >>
> >> >         // control dependency, no need for rdtsc_barrier?
> >> >     } while(cpu != cpu1);
> >> >
> >> >     // ... compute nanoseconds from pvti and tsc ...
> >> >     rmb();
> >> > }   while(v != pvti->version);
> >>
> >> Still no good.  We can migrate a bunch of times so we see the same CPU
> >> all three times and *still* don't get a consistent read, unless we
> >> play nasty games with lots of version checks (I have a patch for that,
> >> but I don't like it very much).  The patch is here:
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
> >>
> >> but I don't like it.
> >>
> >> Thus far, I've been told unambiguously that a guest can't observe pvti
> >> while it's being written, and I think you're now telling me that this
> >> isn't true and that a guest *can* observe pvti while it's being
> >> written while the low bit of the version field is not set.  If so,
> >> this is rather strongly incompatible with the spec in the KVM docs.
> >>
> >> I don't suppose that you and Marcelo could agree on what the actual
> >> semantics that KVM provides are and could write it down in a way that
> >> people who haven't spent a long time staring at the request code
> >> understand?  And maybe you could even fix the implementation while
> >> you're at it if the implementation is, indeed, broken.  I have ugly
> >> patches to fix it here:
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
> >>
> >> but I'm not thrilled with them.
> >>
> >> --Andy
> >
> > I suppose that separating the version write from the rest of the pvclock
> > structure is sufficient, as that would guarantee the writes are not
> > reordered even with fast string REP MOVS.
> >
> > Thanks for catching this Andy!
> >
> 
> Don't you stil need:
> 
> version++;
> write the rest;
> version++;
> 
> with possible smp_wmb() in there to keep the compiler from messing around?

Correct. Could just as well follow the protocol and use odd/even, which 
is what your patch does.

What is the point with the new flags bit though?

> Also, if you do this, can you also make setting and clearing
> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
> like setting it last and clearing it first on vPCU 0?

If the version "seqlock" works properly across vCPUs, why do you need
STABLE_BIT "properly atomic" ?

Please define what you mean by "properly atomic".

next prev parent reply	other threads:[~2015-01-06 18:45 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-23  0:39 [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Andy Lutomirski
2014-12-23  0:39 ` [RFC 1/2] x86, vdso: Use asm volatile in __getcpu Andy Lutomirski
2014-12-23  0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
2014-12-23 10:28   ` [Xen-devel] " David Vrabel
2014-12-23 15:14   ` Boris Ostrovsky
2014-12-23 15:14     ` Paolo Bonzini
2014-12-23 15:25       ` Boris Ostrovsky
2014-12-24 21:30   ` David Matlack
2014-12-24 21:43     ` Andy Lutomirski
2015-01-05 15:25   ` Marcelo Tosatti
2015-01-05 18:56     ` Andy Lutomirski
2015-01-05 19:17       ` Marcelo Tosatti
2015-01-05 22:38         ` Andy Lutomirski
2015-01-05 22:48           ` Marcelo Tosatti
2015-01-05 22:53             ` Andy Lutomirski
2015-01-06  8:42             ` Paolo Bonzini
2015-01-06 12:01               ` Paolo Bonzini
2015-01-06 16:56                 ` Andy Lutomirski
2015-01-06 18:13                   ` Marcelo Tosatti
2015-01-06 18:26                     ` Andy Lutomirski
2015-01-06 18:45                       ` Marcelo Tosatti [this message]
2015-01-06 19:49                         ` Andy Lutomirski
2015-01-06 20:20                           ` Marcelo Tosatti
2015-01-06 21:54                             ` Andy Lutomirski
2015-01-08 22:31                           ` Marcelo Tosatti
2015-01-08 22:43                             ` Andy Lutomirski
2015-02-26 22:46                               ` Andy Lutomirski
2015-01-07  5:41                       ` Paolo Bonzini
2015-01-07  5:38                   ` Paolo Bonzini
2015-01-07  7:18                     ` Andy Lutomirski
2015-01-07  9:00                       ` Paolo Bonzini
2015-01-07 14:45                       ` Marcelo Tosatti
2015-01-06  8:39         ` Paolo Bonzini
2015-01-05 22:23       ` Paolo Bonzini
2015-01-06 14:35       ` Konrad Rzeszutek Wilk
2015-01-08 12:51   ` David Vrabel
2014-12-23  7:21 ` [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Paolo Bonzini
2014-12-23  8:16   ` Andy Lutomirski
2014-12-23  8:30     ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150106184512.GA31263@amt.cnet \
    --to=mtosatti@redhat.com \
    --cc=gleb@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=pbonzini@redhat.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).