From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: KVM: x86: fix kvmclock write race
Date: Fri, 17 Apr 2015 21:12:56 -0300
Message-ID: <20150418001256.GA21386@amt.cnet>
References: <20150417233800.GA18714@amt.cnet>
 <CALCETrWRZENhH2O21CRqQ8fnFkSj7ff6RH4fQZTHf84AM5kd9g@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: kvm list <kvm@vger.kernel.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>
To: Andy Lutomirski <luto@amacapital.net>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:40463 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751679AbbDRANJ (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 17 Apr 2015 20:13:09 -0400
Content-Disposition: inline
In-Reply-To: <CALCETrWRZENhH2O21CRqQ8fnFkSj7ff6RH4fQZTHf84AM5kd9g@mail.gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Fri, Apr 17, 2015 at 05:04:29PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 17, 2015 at 4:38 PM, Marcelo Tosatti <mtosatti@redhat.com=
> wrote:
> >
> > From: Radim Kr=C4=8Dm=C3=A1=C5=99 <rkrcmar@redhat.com>
> >
> > As noted by Andy Lutomirski, kvm does not follow the documented ver=
sion
> > protocol. Fix it.
> >
> > Note: this bug results in a race which can occur if the following t=
hree
> > conditions are met:
> >
> > 1) There is KVM guest time update (there is one every 5 minutes).
> >
> > 2) Which races with a thread in the guest in the following way:
> > The execution of these 29 instructions has to take at _least_
> > 2 seconds (rebalance interval is 1 second).
> >
> > lsl    %r9w,%esi
> > mov    %esi,%r8d
> > and    $0x3f,%esi
> > and    $0xfff,%r8d
> > test   $0xfc0,%r8d
> > jne    0xa12 <vread_pvclock+210>
> > shl    $0x6,%rsi
> > mov    -0xa01000(%rsi),%r10d
> > data32 xchg %ax,%ax
> > data32 xchg %ax,%ax
> > rdtsc
> > shl    $0x20,%rdx
> > mov    %eax,%eax
> > movsbl -0xa00fe4(%rsi),%ecx
> > or     %rax,%rdx
> > sub    -0xa00ff8(%rsi),%rdx
> > mov    -0xa00fe8(%rsi),%r11d
> > mov    %rdx,%rax
> > shl    %cl,%rax
> > test   %ecx,%ecx
> > js     0xa08 <vread_pvclock+200>
> > mov    %r11d,%edx
> > movzbl -0xa00fe3(%rsi),%ecx
> > mov    -0xa00ff0(%rsi),%r11
> > mul    %rdx
> > shrd   $0x20,%rdx,%rax
> > data32 xchg %ax,%ax
> > data32 xchg %ax,%ax
> > lsl    %r9w,%edx
> >
> > 3) Scheduler moves the task, while executing these 29 instructions,=
 to a
> > destination processor, then back to the source processor.
> >
> > 4) Source processor, after has been moved back from destination,
> > perceives data out of order as written by processor performing gues=
t
> > time update (item 1), with string mov.
> >
> > Given the rarity of this condition, and the fact it was never obser=
ved
> > or reported, reverting pvclock vsyscall on systems whose host is
> > susceptible to the race, seems unnecessary.
> >
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index cc2c759f69a3..8658599e0024 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1658,12 +1658,24 @@ static int kvm_guest_time_update(struct kvm=
_vcpu *v)
> >                 &guest_hv_clock, sizeof(guest_hv_clock))))
> >                 return 0;
> >
> > -       /*
> > -        * The interface expects us to write an even number signali=
ng that the
> > -        * update is finished. Since the guest won't see the interm=
ediate
> > -        * state, we just increase by 2 at the end.
> > +       /* A guest can read other VCPU's kvmclock; specification sa=
ys that
> > +        * version is odd if data is being modified and even after =
it is
> > +        * consistent.
> > +        * We write three times to be sure.
> > +        *  1) update version to odd number
> > +        *  2) write modified data (version is still odd)
> > +        *  3) update version to even number
> > +        *
> > +        * TODO: optimize
> > +        *  - only two writes should be enough -- version is first
> > +        *  - the second write could update just version
>=20
> You're relying on lots of barely-defined behavior here, since I think
> that both copies could use fast string operations.  Those are
> explicitly unordered internally, so I think you really do need three
> writes.

The memory-ordering model respects the follow principles:
1. Stores within a single string operation may be executed out of order=
=2E
2. Stores from separate string operations (for example, stores from
consecutive string operations) do not execute
out of order. All the stores from an earlier string operation will
complete before any store from a later string
operation.

> Personally, if I wanted to optimize this (I'm not convinced it
> matters),=20

It does not matter.

> I'd add a write-a-single-word primitive and use that for the
> version.
>=20
> Anyway, I think this code looks okay as is.

Looking forward to see the patchset to have all vcpus reading
from vcpu0 pvclock area.