From: Marcelo Tosatti <mtosatti@redhat.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
kvm list <kvm@vger.kernel.org>, Gleb Natapov <gleb@kernel.org>
Subject: Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
Date: Wed, 7 Jan 2015 12:45:03 -0200 [thread overview]
Message-ID: <20150107144503.GA16114@amt.cnet> (raw)
In-Reply-To: <CALCETrXGGV9R_4Knz_0LTaz3X2PS1KJDdBfq_RHAVquKdUjrYQ@mail.gmail.com>
On Tue, Jan 06, 2015 at 11:18:21PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 6, 2015 at 9:38 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> >
> > On 06/01/2015 17:56, Andy Lutomirski wrote:
> >> Still no good. We can migrate a bunch of times so we see the same CPU
> >> all three times
> >
> > There are no three times. The CPU you see here:
> >
> >>>
> >>>
> >>> // ... compute nanoseconds from pvti and tsc ...
> >>> rmb();
> >>> } while(v != pvti->version);
> >
> > is the same you read here:
> >
> >>> cpu = get_cpu();
> >
> > The algorithm is:
>
> I still don't see why this is safe, and I think that the issue is that
> you left out part of the loop.
>
> >
> > 1) get a consistent (cpu, version, tsc)
> >
> > 1.a) get cpu
>
> Suppose we observe cpu 0.
>
> > 1.b) get pvti[cpu]->version, ignoring low bit
>
> Missing step, presumably here: read pvti[cpu]->tsc_timestamp, scale,
> etc. This could all execute on vCPU 1. We could read values that are
> inconsistent with each other.
>
> > 1.c) get (tsc, cpu)
>
> Now we could end up back on vCPU 0.
>
> > 1.d) if cpu from 1.a and 1.c do not match, loop
> > 1.e) if pvti[cpu] was being updated, we'll loop later
> >
> > 2) compute nanoseconds from pvti[cpu] and tsc
> >
> > 3) if pvti[cpu] changed under our feet during (2), i.e. version doesn't
> > match, retry.
> >
> > As long as the CPU is consistent between get_cpu() and rdtscp(), there
> > is no problem with migration, because pvti is always accessed for that
> > one CPU. If there were any problem, it would be caught by the version
> > check. Writing it down with two nested do...whiles makes it clearer IMHO.
>
> Why exactly would it be caught by the version check?
>
> My ugly patch tries to make the argument that, at any point at which
> we observe ourselves to be on a given vCPU, that vCPU won't be
> updating pvti. That means that, if version doesn't change between two
> consecutive observations that we're on that vCPU, then we're okay.
> This IMO sucks. It's fragile, it's hard to make a coherent argument
> about correctness, and it requires at least two getcpu-like operations
> to read the time. Those operations are *slow*. One is much better
> than two, and zero is much better than one.
>
> >
> >> and *still* don't get a consistent read, unless we
> >> play nasty games with lots of version checks (I have a patch for that,
> >> but I don't like it very much). The patch is here:
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
> >>
> >> but I don't like it.
> >>
> >> Thus far, I've been told unambiguously that a guest can't observe pvti
> >> while it's being written, and I think you're now telling me that this
> >> isn't true and that a guest *can* observe pvti while it's being
> >> written while the low bit of the version field is not set. If so,
> >> this is rather strongly incompatible with the spec in the KVM docs.
> >
> > Where am I saying that?
>
> I thought the conclusion from what you and Marcelo pointed out about
> the code was that, once the first vCPU updated its pvti, it could
> start running guest code while the other vCPUs are still updating
> pvti, so its guest code can observe the other vCPUs mid-update.
>
> >> Also, if you do this, can you also make setting and clearing
> >> STABLE_BIT properly atomic across all vCPUs? Or at least do something
> >> like setting it last and clearing it first on vPCU 0?
> >
> > That would be nice if you want to make the pvclock area fit in a single
> > page. However, it would have to be a separate flag bit, or a separate
> > CPUID feature.
>
> It would be nice to have. Although I think that fixing the host to
> increment version pre-update and post-update may actually be good
> enough. Is there any case in which it would fail in practice if we
> made that fix and always looked at pvti 0?
TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition steps would finish but
still allow VCPU-1 to use stale values from VCPU-0.
To fix, do one of the following:
1) Check validity of local TSC_STABLE_BIT in addition (slow).
2) Perform update of VCPU-0 pvclock area before allowing
any other VCPU to VM-entry.
>
> --Andy
>
> >
> > Paolo
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
next prev parent reply other threads:[~2015-01-07 14:45 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-12-23 0:39 [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Andy Lutomirski
2014-12-23 0:39 ` [RFC 1/2] x86, vdso: Use asm volatile in __getcpu Andy Lutomirski
2014-12-23 0:39 ` [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader Andy Lutomirski
2014-12-23 10:28 ` [Xen-devel] " David Vrabel
2014-12-23 15:14 ` Boris Ostrovsky
2014-12-23 15:14 ` Paolo Bonzini
2014-12-23 15:25 ` Boris Ostrovsky
2014-12-24 21:30 ` David Matlack
2014-12-24 21:43 ` Andy Lutomirski
2015-01-05 15:25 ` Marcelo Tosatti
2015-01-05 18:56 ` Andy Lutomirski
2015-01-05 19:17 ` Marcelo Tosatti
2015-01-05 22:38 ` Andy Lutomirski
2015-01-05 22:48 ` Marcelo Tosatti
2015-01-05 22:53 ` Andy Lutomirski
2015-01-06 8:42 ` Paolo Bonzini
2015-01-06 12:01 ` Paolo Bonzini
2015-01-06 16:56 ` Andy Lutomirski
2015-01-06 18:13 ` Marcelo Tosatti
2015-01-06 18:26 ` Andy Lutomirski
2015-01-06 18:45 ` Marcelo Tosatti
2015-01-06 19:49 ` Andy Lutomirski
2015-01-06 20:20 ` Marcelo Tosatti
2015-01-06 21:54 ` Andy Lutomirski
2015-01-08 22:31 ` Marcelo Tosatti
2015-01-08 22:43 ` Andy Lutomirski
2015-02-26 22:46 ` Andy Lutomirski
2015-01-07 5:41 ` Paolo Bonzini
2015-01-07 5:38 ` Paolo Bonzini
2015-01-07 7:18 ` Andy Lutomirski
2015-01-07 9:00 ` Paolo Bonzini
2015-01-07 14:45 ` Marcelo Tosatti [this message]
2015-01-06 8:39 ` Paolo Bonzini
2015-01-05 22:23 ` Paolo Bonzini
2015-01-06 14:35 ` [Xen-devel] " Konrad Rzeszutek Wilk
2015-01-08 12:51 ` David Vrabel
2014-12-23 7:21 ` [RFC 0/2] x86, vdso, pvclock: Cleanups and speedups Paolo Bonzini
2014-12-23 8:16 ` Andy Lutomirski
2014-12-23 8:30 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150107144503.GA16114@amt.cnet \
--to=mtosatti@redhat.com \
--cc=gleb@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=pbonzini@redhat.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).