From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcelo Tosatti Subject: Re: x86: kvm: Revert "remove sched notifier for cross-cpu migrations" Date: Thu, 26 Mar 2015 08:29:07 -0300 Message-ID: <20150326112907.GA15098@amt.cnet> References: <20150324153412.GB21710@potion.brq.redhat.com> <20150325110814.GE21522@potion.brq.redhat.com> <20150325125212.GC21710@potion.brq.redhat.com> <20150325212851.GB3649@amt.cnet> <20150325224145.GA5928@amt.cnet> <20150325231317.GA7144@amt.cnet> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: kvm list , Radim Krcmar , stable , Paolo Bonzini To: Andy Lutomirski Return-path: Received: from mx1.redhat.com ([209.132.183.28]:34031 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750790AbbCZLaD (ORCPT ); Thu, 26 Mar 2015 07:30:03 -0400 Content-Disposition: inline In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote: > On Wed, Mar 25, 2015 at 4:13 PM, Marcelo Tosatti wrote: > > On Wed, Mar 25, 2015 at 03:48:02PM -0700, Andy Lutomirski wrote: > >> On Wed, Mar 25, 2015 at 3:41 PM, Marcelo Tosatti wrote: > >> > On Wed, Mar 25, 2015 at 03:33:10PM -0700, Andy Lutomirski wrote: > >> >> On Mar 25, 2015 2:29 PM, "Marcelo Tosatti" wrote: > >> >> > > >> >> > On Wed, Mar 25, 2015 at 01:52:15PM +0100, Radim Kr=C4=8Dm=C3=A1= =C5=99 wrote: > >> >> > > 2015-03-25 12:08+0100, Radim Kr=C4=8Dm=C3=A1=C5=99: > >> >> > > > Reverting the patch protects us from any migration, but I= don't think we > >> >> > > > need to care about changing VCPUs as long as we read a co= nsistent data > >> >> > > > from kvmclock. (VCPU can change outside of this loop too= , so it doesn't > >> >> > > > matter if we return a value not fit for this VCPU.) > >> >> > > > > >> >> > > > I think we could drop the second __getcpu if our kvmclock= was being > >> >> > > > handled better; maybe with a patch like the one below: > >> >> > > > >> >> > > The second __getcpu is not neccessary, but I forgot about r= dtsc. > >> >> > > We need to either use rtdscp, know the host has synchronize= d tsc, or > >> >> > > monitor VCPU migrations. Only the last one works everywher= e. > >> >> > > >> >> > The vdso code is only used if host has synchronized tsc. > >> >> > > >> >> > But you have to handle the case where host goes from synchron= ized tsc to > >> >> > unsynchronized tsc (see the clocksource notifier in the host = side). > >> >> > > >> >> > >> >> Can't we change the host to freeze all vcpus and clear the stab= le bit > >> >> on all of them if this happens? This would simplify and speed = up > >> >> vclock_gettime. > >> >> > >> >> --Andy > >> > > >> > Seems interesting to do on 512-vcpus, but sure, could be done. > >> > > >> > >> If you have a 512-vcpu system that switches between stable and > >> unstable more than once per migration, then I expect that you have > >> serious problems and this is the least of your worries. > >> > >> Personally, I'd *much* rather we just made vcpu 0's pvti authorita= tive > >> if we're stable. If nothing else, I'm not even remotely convinced > >> that the current scheme gives monotonic timing due to skew between > >> when the updates happen on different vcpus. > > > > Can you write down the problem ? > > >=20 > I can try. >=20 > Suppose we start out with all vcpus agreeing on their pvti and perfec= t > invariant TSCs. Now the host updates its frequency (due to NTP or > whatever). KVM updates vcpu 0's pvti. Before KVM updates vcpu 1's > pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti. > They'll disagree on the time, and one of them will be ahead until vcp= u > 1's pvti gets updated. The masterclock scheme enforces the same system_timestamp/tsc_timestamp= pairs to be visible at one time, for all vcpus. * That is, when timespec0 !=3D timespec1, M < N. Unfortunately that is * not * always the case (the difference between two distinct xtime instances * might be smaller then the difference between corresponding TSC reads= , * when updating guest vcpus pvclock areas). * * To avoid that problem, do not allow visibility of distinct * system_timestamp/tsc_timestamp values simultaneously: use a master * copy of host monotonic time values. Update that master copy * in lockstep.