From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcelo Tosatti Subject: KVM: x86: fix kvmclock write race (v2) Date: Fri, 17 Apr 2015 21:23:10 -0300 Message-ID: <20150418002310.GA23898@amt.cnet> References: <20150417233800.GA18714@amt.cnet> <20150418002010.GA23227@amt.cnet> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: kvm list , Paolo Bonzini , Radim =?utf-8?B?S3LEjW3DocWZ?= To: Andy Lutomirski Return-path: Received: from mx1.redhat.com ([209.132.183.28]:54090 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751767AbbDRAX0 (ORCPT ); Fri, 17 Apr 2015 20:23:26 -0400 Content-Disposition: inline In-Reply-To: <20150418002010.GA23227@amt.cnet> Sender: kvm-owner@vger.kernel.org List-ID: =46rom: Radim Kr=C4=8Dm=C3=A1=C5=99 As noted by Andy Lutomirski, kvm does not follow the documented version protocol. Fix it. Note: this bug results in a race which can occur if the following three conditions are met: 1) There is KVM guest time update (there is one every 5 minutes). 2) Which races with a thread in the guest in the following way: The execution of these 29 instructions has to take at _least_ 2 seconds (rebalance interval is 1 second). lsl %r9w,%esi mov %esi,%r8d and $0x3f,%esi and $0xfff,%r8d test $0xfc0,%r8d jne 0xa12 shl $0x6,%rsi mov -0xa01000(%rsi),%r10d data32 xchg %ax,%ax data32 xchg %ax,%ax rdtsc =20 shl $0x20,%rdx mov %eax,%eax movsbl -0xa00fe4(%rsi),%ecx or %rax,%rdx sub -0xa00ff8(%rsi),%rdx mov -0xa00fe8(%rsi),%r11d mov %rdx,%rax shl %cl,%rax test %ecx,%ecx js 0xa08 mov %r11d,%edx movzbl -0xa00fe3(%rsi),%ecx mov -0xa00ff0(%rsi),%r11 mul %rdx shrd $0x20,%rdx,%rax data32 xchg %ax,%ax data32 xchg %ax,%ax lsl %r9w,%edx 3) Scheduler moves the task, while executing these 29 instructions, to = a destination processor, then back to the source processor. 4) Source processor, after has been moved back from destination, perceives data out of order as written by processor performing guest time update (item 1), with string mov. Given the rarity of this condition, and the fact it was never observed or reported, reverting pvclock vsyscall on systems whose host is susceptible to the race, seems an excessive measure. Signed-off-by: Marcelo Tosatti Cc: stable@kernel.org diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cc2c759..e75fafd 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1658,12 +1658,20 @@ static int kvm_guest_time_update(struct kvm_vcp= u *v) &guest_hv_clock, sizeof(guest_hv_clock)))) return 0; =20 - /* - * The interface expects us to write an even number signaling that th= e - * update is finished. Since the guest won't see the intermediate - * state, we just increase by 2 at the end. + /* A guest can read other VCPU's kvmclock; specification says that + * version is odd if data is being modified and even after it is + * consistent. + * We write three times to be sure. + * 1) update version to odd number + * 2) write modified data (version is still odd) + * 3) update version to even number */ - vcpu->hv_clock.version =3D guest_hv_clock.version + 2; + guest_hv_clock.version +=3D 1; + kvm_write_guest_cached(v->kvm, &vcpu->pv_time, + &guest_hv_clock, + sizeof(guest_hv_clock)); + + vcpu->hv_clock.version =3D guest_hv_clock.version; =20 /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */ pvclock_flags =3D (guest_hv_clock.flags & PVCLOCK_GUEST_STOPPED); @@ -1684,6 +1692,11 @@ static int kvm_guest_time_update(struct kvm_vcpu= *v) kvm_write_guest_cached(v->kvm, &vcpu->pv_time, &vcpu->hv_clock, sizeof(vcpu->hv_clock)); + + vcpu->hv_clock.version +=3D 1; + kvm_write_guest_cached(v->kvm, &vcpu->pv_time, + &vcpu->hv_clock, + sizeof(vcpu->hv_clock)); return 0; } =20