From: Andy Lutomirski <luto@amacapital.net>
To: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
kvm list <kvm@vger.kernel.org>, Gleb Natapov <gleb@kernel.org>
Subject: Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
Date: Thu, 26 Feb 2015 14:46:22 -0800 [thread overview]
Message-ID: <CALCETrVcdmthWwkOKCFHqRwFSBR4EFXEGkJmkNp_88wwFRgBpg@mail.gmail.com> (raw)
In-Reply-To: <CALCETrWJtjTiSZu_pigcQUaATrFF-8g6v8hXkgCx==FYO=bFew@mail.gmail.com>
On Thu, Jan 8, 2015 at 2:43 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jan 8, 2015 at 2:31 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
>>> On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>> > On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
>>> >> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>> >> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
>>> >> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
>>> >> >> > > > > Still confused. So we can freeze all vCPUs in the host, then update
>>> >> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0? In that case, we have
>>> >> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
>>> >> >> > > > > doesn't increment the version pre-update, and we can return completely
>>> >> >> > > > > bogus results.
>>> >> >> > > > Yes.
>>> >> >> > > But then the getcpu test would fail (1->0). Even if you have an ABA
>>> >> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the
>>> >> >> > > one returned by the first getcpu.
>>> >> >> >
>>> >> >> > ... this case of partial update of pvti, which is caught by the version
>>> >> >> > field, if of course different from the other (extremely unlikely) that
>>> >> >> > Andy pointed out. That is when the getcpus are done on the same vCPU,
>>> >> >> > but the rdtsc is another.
>>> >> >> >
>>> >> >> > That one can be fixed by rdtscp, like
>>> >> >> >
>>> >> >> > do {
>>> >> >> > // get a consistent (pvti, v, tsc) tuple
>>> >> >> > do {
>>> >> >> > cpu = get_cpu();
>>> >> >> > pvti = get_pvti(cpu);
>>> >> >> > v = pvti->version & ~1;
>>> >> >> > // also acts as rmb();
>>> >> >> > rdtsc_barrier();
>>> >> >> > tsc = rdtscp(&cpu1);
>>> >> >>
>>> >> >> Off-topic note: rdtscp doesn't need a barrier at all. AIUI AMD
>>> >> >> specified it that way and both AMD and Intel implement it correctly.
>>> >> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
>>> >> >>
>>> >> >> > // control dependency, no need for rdtsc_barrier?
>>> >> >> > } while(cpu != cpu1);
>>> >> >> >
>>> >> >> > // ... compute nanoseconds from pvti and tsc ...
>>> >> >> > rmb();
>>> >> >> > } while(v != pvti->version);
>>> >> >>
>>> >> >> Still no good. We can migrate a bunch of times so we see the same CPU
>>> >> >> all three times and *still* don't get a consistent read, unless we
>>> >> >> play nasty games with lots of version checks (I have a patch for that,
>>> >> >> but I don't like it very much). The patch is here:
>>> >> >>
>>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
>>> >> >>
>>> >> >> but I don't like it.
>>> >> >>
>>> >> >> Thus far, I've been told unambiguously that a guest can't observe pvti
>>> >> >> while it's being written, and I think you're now telling me that this
>>> >> >> isn't true and that a guest *can* observe pvti while it's being
>>> >> >> written while the low bit of the version field is not set. If so,
>>> >> >> this is rather strongly incompatible with the spec in the KVM docs.
>>> >> >>
>>> >> >> I don't suppose that you and Marcelo could agree on what the actual
>>> >> >> semantics that KVM provides are and could write it down in a way that
>>> >> >> people who haven't spent a long time staring at the request code
>>> >> >> understand? And maybe you could even fix the implementation while
>>> >> >> you're at it if the implementation is, indeed, broken. I have ugly
>>> >> >> patches to fix it here:
>>> >> >>
>>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
>>> >> >>
>>> >> >> but I'm not thrilled with them.
>>> >> >>
>>> >> >> --Andy
>>> >> >
>>> >> > I suppose that separating the version write from the rest of the pvclock
>>> >> > structure is sufficient, as that would guarantee the writes are not
>>> >> > reordered even with fast string REP MOVS.
>>> >> >
>>> >> > Thanks for catching this Andy!
>>> >> >
>>> >>
>>> >> Don't you stil need:
>>> >>
>>> >> version++;
>>> >> write the rest;
>>> >> version++;
>>> >>
>>> >> with possible smp_wmb() in there to keep the compiler from messing around?
>>> >
>>> > Correct. Could just as well follow the protocol and use odd/even, which
>>> > is what your patch does.
>>> >
>>> > What is the point with the new flags bit though?
>>>
>>> To try to work around the problem on old hosts. I'm not at all
>>> convinced that this is worthwhile or that it helps, though.
>>
>> Andy,
>>
>> Are you going to submit the fix or should i?
>>
>
> I'd prefer if you did it. I'm not familiar enough with the KVM memory
> management stuff to do it confidently. Feel free to mooch from my
> patch if it's helpful.
Any update here? I can try it myself if no one else wants to do it.
--Andy
>
> --Andy
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
--
Andy Lutomirski
AMA Capital Management, LLC
next prev parent reply other threads:[~2015-02-26 22:46 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-12-23 0:39 [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Andy Lutomirski
2014-12-23 0:39 ` [RFC 1/2] x86, vdso: Use asm volatile in __getcpu Andy Lutomirski
2014-12-23 0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
2014-12-23 10:28 ` [Xen-devel] " David Vrabel
2014-12-23 15:14 ` Boris Ostrovsky
2014-12-23 15:14 ` Paolo Bonzini
2014-12-23 15:25 ` Boris Ostrovsky
2014-12-24 21:30 ` David Matlack
2014-12-24 21:43 ` Andy Lutomirski
2015-01-05 15:25 ` Marcelo Tosatti
2015-01-05 18:56 ` Andy Lutomirski
2015-01-05 19:17 ` Marcelo Tosatti
2015-01-05 22:38 ` Andy Lutomirski
2015-01-05 22:48 ` Marcelo Tosatti
2015-01-05 22:53 ` Andy Lutomirski
2015-01-06 8:42 ` Paolo Bonzini
2015-01-06 12:01 ` Paolo Bonzini
2015-01-06 16:56 ` Andy Lutomirski
2015-01-06 18:13 ` Marcelo Tosatti
2015-01-06 18:26 ` Andy Lutomirski
2015-01-06 18:45 ` Marcelo Tosatti
2015-01-06 19:49 ` Andy Lutomirski
2015-01-06 20:20 ` Marcelo Tosatti
2015-01-06 21:54 ` Andy Lutomirski
2015-01-08 22:31 ` Marcelo Tosatti
2015-01-08 22:43 ` Andy Lutomirski
2015-02-26 22:46 ` Andy Lutomirski [this message]
2015-01-07 5:41 ` Paolo Bonzini
2015-01-07 5:38 ` Paolo Bonzini
2015-01-07 7:18 ` Andy Lutomirski
2015-01-07 9:00 ` Paolo Bonzini
2015-01-07 14:45 ` Marcelo Tosatti
2015-01-06 8:39 ` Paolo Bonzini
2015-01-05 22:23 ` Paolo Bonzini
2015-01-06 14:35 ` [Xen-devel] " Konrad Rzeszutek Wilk
2015-01-08 12:51 ` David Vrabel
2014-12-23 7:21 ` [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Paolo Bonzini
2014-12-23 8:16 ` Andy Lutomirski
2014-12-23 8:30 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CALCETrVcdmthWwkOKCFHqRwFSBR4EFXEGkJmkNp_88wwFRgBpg@mail.gmail.com \
--to=luto@amacapital.net \
--cc=gleb@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mtosatti@redhat.com \
--cc=pbonzini@redhat.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).