From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcelo Tosatti Subject: Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO. Date: Mon, 21 Sep 2015 19:37:14 -0300 Message-ID: <20150921223713.GA21139@amt.cnet> References: <1442591670-5216-1-git-send-email-rkrcmar@redhat.com> <20150920225742.GA27666@amt.cnet> <20150921151209.GA2734@potion.brq.redhat.com> <20150921155224.GA12938@amt.cnet> <20150921200026.GB2734@potion.brq.redhat.com> <20150921205312.GA17800@amt.cnet> <20150921220039.GC2734@potion.brq.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Paolo Bonzini , Luiz Capitulino To: Radim =?utf-8?B?S3LEjW3DocWZ?= Return-path: Content-Disposition: inline In-Reply-To: <20150921220039.GC2734@potion.brq.redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org On Tue, Sep 22, 2015 at 12:00:39AM +0200, Radim Kr=C4=8Dm=C3=A1=C5=99 w= rote: > 2015-09-21 17:53-0300, Marcelo Tosatti: > > On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Kr=C4=8Dm=C3=A1=C5=99= wrote: > >> 2015-09-21 12:52-0300, Marcelo Tosatti: > >> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Kr=C4=8Dm=C3=A1=C5= =99 wrote: > >> >> 2015-09-20 19:57-0300, Marcelo Tosatti: > >> >>> Is it counting from zero that breaks SLES10? > >> >>=20 > >> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initial= izer did. > >> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it l= ikes to, > >> >> while still keeping system time; we used to allow that, which = means an > >> >> ABI breakage. (And we can't even say that guest's behaviour is= against > >> >> the spec ...) > >> >=20 > >> > Because this behaviour was not defined. > >>=20 > >> It is defined by implementation. > >>=20 > >> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on > >> > boot_vcpu_runs_old_kvmclock =3D=3D false?=20 > >> > The patch would be much simpler. > >>=20 > >> If you mean the hunk in cover letter, I don't like it because we p= resume > >> that no other guests were broken. > >=20 > > Yes patch 1. >=20 > (I'd call them: a hunk in cover letter [0/2], patch 1 =3D [1/2], and > patch 2 =3D [2/2].) >=20 > > Do you have an example of another guest that is broken?=20 >=20 > Yes, the hot-plug in any relatively recent Linux guest. >=20 > > Really, assuming things about the value of the msr read when you > > write to the MSR is very awkward. That SuSE code is incorrect, it f= ails > > on other occasions as well (see > > 54750f2cf042c42b4223d67b1bd20138464bde0e). >=20 > Yeah, I even read the SUSE implementation, sad times. >=20 > >> I really don't like it=20 > >=20 > > Because it changes behaviour that was previously unspecified? >=20 > No, because it adds another exemption to already ugly code. > (Talking about the hunk in [0/2].) >=20 > >> so I thought about other problems with > >> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0? > >=20 > > Can't unplug VCPU 0 i think. >=20 > You can. >=20 > >> It doesn't work well ;) > >=20 > > Can you hot-unplug vcpu 0?=20 >=20 > I can, I did, hot-plug bugged. Due to PVCLOCK_COUNTS_FROM_ZERO patch? =20 > >> > The problem is, "selecting one read as the initial point" is inh= erently > >> > racy: that delta is relative to one moment (kvmclock read) at on= e vcpu, > >> > but must be applied to all vcpus. > >>=20 > >> I don't think that is a problem. > >>=20 > >> Kvmclock has a notion of a global system_time in nanoseconds (one = value > >> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and t= ries > >> to propagate system_time into guest VCPUs as precisely as possible= with > >> the help of TSC. > > > >Different pairs of values (system_time, tsc reads) are fundamentally > >broken. This is why=20 > > > >commit 489fb490dbf8dab0249ad82b56688ae3842a79e8 > > x86, paravirt: Add a global synchronization point for pvclock > > > >Exists. > > > >Then to guarantee monotonicity you need to stop those updates > >(or perform the pair update in lock step): > > > >commit d828199e84447795c6669ff0e6c6d55eb9beeff6 > > KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag >=20 > (Thanks for the commits, split values are the core design that avoids > modifying rdtsc,=20 Which is broken (thats the point). > so I'll be grad to read its details.) >=20 > >> > 2) You rely on monotonicity across vcpus to perform=20 > >> > the 'minus delta that was read on vcpu0' calculation, but=20 > >> > monotonicity across vcpus can fail during runtime > >> > (say host clocksource goes tsc->hpet due to tsc insta= bility). > >>=20 > >> That could be a problem, but I'm adding a VCPU independent constan= t to > >> all reads -- does the new code rely on monoticity in places where = the > >> old one didn't? > >=20 > > The problem is overflow: > > kvmclock non-monotonic across vcpus AND delta subtraction: > >=20 > > ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time = | vcpu1sched_clock > > 1 1 1 =20 > > 2 2 2 1 = =20 >=20 > KVM sched clock not used before this point (I even moved the code to > make sure), so there is no problem with monotonicity. >=20 > > 3 3 0 2 = -1 > = ^^^ > -1 is the overflow. Very unlikely to ever happen in Linux, as we now > boot other VCPUs much later than the KVM clock configuration and it c= an > only happen if VCPU1 is running as the reconfiguration takes place, b= ut > a simple >=20 > if (vcpu[x].sched_clock <=3D global_sched_clock_offset) > return 0; >=20 > would take care of it. The time would stand still for a while, which= is > not a huge problem for boot-only scenario. =20 Look at=20 commit 489fb490dbf8dab0249ad82b56688ae3842a79e8 x86, paravirt: Add a global synchronization point for pvclock And think of what an overflow does. Note: i tried your solution before (to add offset) and saw this issue in practice. > Other VCPUs couldn't be > reading KVM sched before it was configured, so only operations that > happen before vcpu1sched_clock goes positive are affected. > (We have other problems if the window can be long.) The point where vcpu1sched_clock is negative is after kvmclock is initialized. >=20 > > 4 4 1 3 = 0 > > 5 5 2 4 = 1 > > 6 6 3 5 = 2 > > 7 7 4 6 = 3 > >=20 > > ptime is "physical time". > >=20 > > I prefer that the host counts from zero (there are less problems to > > handle). >=20 > The main advantage of a hypervisor solution for me is one saved > subtraction and comparison on every sched_clock(). To me are such complications as handling overflows. > > Again, that SuSE patch is incorrect and a huge hack. >=20 > I'm not disputing that. (Which doesn't justify the breakage.) >=20 > > An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest > > requests the feature. >=20 > Yes, and the alternative needs new paravirt interface. So either: Proceed with guest solution: -> Make sure the overflow can't happen (and write down why not in the code). Don't assume a small delta between kvmclock values of vcpus. -> Handle stable -> non-stable kvmclock transition. -> kvmclock counts from zero should not depend on stable kvmclock (because nohz_full should work on hpet host systems). Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW: -> Figure out whats wrong with different kvmclock values on hotplug,=20 and fix it.