From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcelo Tosatti Subject: Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO. Date: Mon, 21 Sep 2015 12:52:24 -0300 Message-ID: <20150921155224.GA12938@amt.cnet> References: <1442591670-5216-1-git-send-email-rkrcmar@redhat.com> <20150920225742.GA27666@amt.cnet> <20150921151209.GA2734@potion.brq.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Paolo Bonzini , Luiz Capitulino To: Radim =?utf-8?B?S3LEjW3DocWZ?= Return-path: Received: from mx1.redhat.com ([209.132.183.28]:39455 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932104AbbIUPwp (ORCPT ); Mon, 21 Sep 2015 11:52:45 -0400 Content-Disposition: inline In-Reply-To: <20150921151209.GA2734@potion.brq.redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Kr=C4=8Dm=C3=A1=C5=99 w= rote: > 2015-09-20 19:57-0300, Marcelo Tosatti: > > On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Kr=C4=8Dm=C3=A1=C5=99= wrote: > >> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag = and is > >> RFC because I haven't explored many potential problems or tested i= t. > >=20 > > The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because yo= u > > haven't explored potential problems or tested it? Sorry can't parse= it. > >=20 > >>=20 > >> [1/2] uses a different algorithm in the guest to start counting fr= om 0. > >> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor. > >>=20 > >> A viable alternative would be to implement opt-in features in kvm = clock. > >>=20 > >> And because we probably only broke one old user (the infamous SLES= 10), a > >> workaround like this is also possible: (but I'd rather not do that= ) > >=20 > > Please describe why SLES 10 breaks in detail: the state of the gues= t and > > the host before the patch, the state of the guest and host after th= e > > patch. >=20 > 1) The guest periodically receives an interrupt that is handled by > main_timer_handler(): > a) get time using the kvm clock: > 1) write the address to MSR_KVM_SYSTEM_TIME > 2) read tsc and pvclock (tsc_offset, system_time) > 3) time =3D tsc - tsc_offset + system_time > b) compute time since the last main_timer_handler() > c) bump jiffies if enough time has elapsed > 2) the guest wants to calibrate loops per jiffy [1]: > a) read tsc > b) loop till jiffies increase > c) compute lpj >=20 > Because (1a1) always resets the system_time to 0, we read the same va= lue > over and over so the condition for (1c) is never true and jiffies rem= ain > constant. This is the problem. A hang happens in (2b) as it is the > first place that depends on jiffies. >=20 > > What does SLES10 expect? >=20 > That a write to MSR_KVM_SYSTEM_TIME does not reset the system time. >=20 > > Is it counting from zero that breaks SLES10? >=20 > Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer d= id. > The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes t= o, > while still keeping system time; we used to allow that, which means = an > ABI breakage. (And we can't even say that guest's behaviour is again= st > the spec ...) Because this behaviour was not defined. Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on boot_vcpu_runs_old_kvmclock =3D=3D false?=20 The patch would be much simpler. The problem is, "selecting one read as the initial point" is inherently racy: that delta is relative to one moment (kvmclock read) at one vcpu, but must be applied to all vcpus. Besides: 1) Stable sched clock in guest does not depend on KVM_FEATURE_CLOCKSOURCE_STABLE_BIT. 2) You rely on monotonicity across vcpus to perform=20 the 'minus delta that was read on vcpu0' calculation, but=20 monotonicity across vcpus can fail during runtime (say host clocksource goes tsc->hpet due to tsc instability)= =2E >=20 >=20 > --- > 1: I also did diassembly, but the reproducer is easier to paste > (couldn't find debuginfo) > # qemu-kvm -nographic -kernel vmlinuz-2.6.16.60-0.85.1-default \ > -serial stdio -monitor /dev/null -append 'console=3DttyS0' > =20 > and you can get a bit further when setting loops per jiffy manually= , > -serial stdio -monitor /dev/null -append 'console=3DttyS0 lpj=3D1= 2345678' >=20 > The dmesg for failing run is > Initializing CPU#0 > PID hash table entries: 512 (order: 9, 16384 bytes) > kvm-clock: cpu 0, msr 0:3f6041, boot clock > kvm_get_tsc_khz: cpu 0, msr 0:e001 > time.c: Using tsc for timekeeping HZ 250 > time.c: Using 100.000000 MHz WALL KVM GTOD KVM timer. > time.c: Detected 2591.580 MHz processor. > Console: colour VGA+ 80x25 > Dentry cache hash table entries: 16384 (order: 5, 131072 bytes) > Inode-cache hash table entries: 8192 (order: 4, 65536 bytes) > Checking aperture... > Nosave address range: 000000000009f000 - 00000000000a0000 > Nosave address range: 00000000000a0000 - 00000000000f0000 > Nosave address range: 00000000000f0000 - 0000000000100000 > Memory: 124884k/130944k available (1856k kernel code, 5544k reser= ved, 812k data, 188k init) > [Infinitely querying kvm clock here ...] >=20 > With '-cpu kvm64,-kvmclock', the next line is > Calibrating delay using timer specific routine.. 5199.75 BogoMIPS= (lpj=3D10399519) >=20 > With 'lpj=3D10399519', > Calibrating delay loop (skipped)... 5199.75 BogoMIPS preset > [Manages to get stuck later, in default_idle.]