From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from
 PVCLOCK_COUNTS_FROM_ZERO.
Date: Mon, 21 Sep 2015 19:37:14 -0300
Message-ID: <20150921223713.GA21139@amt.cnet>
References: <1442591670-5216-1-git-send-email-rkrcmar@redhat.com>
 <20150920225742.GA27666@amt.cnet>
 <20150921151209.GA2734@potion.brq.redhat.com>
 <20150921155224.GA12938@amt.cnet>
 <20150921200026.GB2734@potion.brq.redhat.com>
 <20150921205312.GA17800@amt.cnet>
 <20150921220039.GC2734@potion.brq.redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	Paolo Bonzini <pbonzini@redhat.com>,
	Luiz Capitulino <lcapitulino@redhat.com>
To: Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>
Return-path: <linux-kernel-owner@vger.kernel.org>
Content-Disposition: inline
In-Reply-To: <20150921220039.GC2734@potion.brq.redhat.com>
Sender: linux-kernel-owner@vger.kernel.org
List-Id: kvm.vger.kernel.org

On Tue, Sep 22, 2015 at 12:00:39AM +0200, Radim Kr=C4=8Dm=C3=A1=C5=99 w=
rote:
> 2015-09-21 17:53-0300, Marcelo Tosatti:
> > On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Kr=C4=8Dm=C3=A1=C5=99=
 wrote:
> >> 2015-09-21 12:52-0300, Marcelo Tosatti:
> >> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Kr=C4=8Dm=C3=A1=C5=
=99 wrote:
> >> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
> >> >>> Is it counting from zero that breaks SLES10?
> >> >>=20
> >> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initial=
izer did.
> >> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it l=
ikes to,
> >> >> while still keeping system time;  we used to allow that, which =
means an
> >> >> ABI breakage.  (And we can't even say that guest's behaviour is=
 against
> >> >> the spec ...)
> >> >=20
> >> > Because this behaviour was not defined.
> >>=20
> >> It is defined by implementation.
> >>=20
> >> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> >> > boot_vcpu_runs_old_kvmclock =3D=3D false?=20
> >> > The patch would be much simpler.
> >>=20
> >> If you mean the hunk in cover letter, I don't like it because we p=
resume
> >> that no other guests were broken.
> >=20
> > Yes patch 1.
>=20
> (I'd call them: a hunk in cover letter [0/2], patch 1 =3D [1/2], and
>  patch 2 =3D [2/2].)
>=20
> > Do you have an example of another guest that is broken?=20
>=20
> Yes, the hot-plug in any relatively recent Linux guest.
>=20
> > Really, assuming things about the value of the msr read when you
> > write to the MSR is very awkward. That SuSE code is incorrect, it f=
ails
> > on other occasions as well (see
> > 54750f2cf042c42b4223d67b1bd20138464bde0e).
>=20
> Yeah, I even read the SUSE implementation, sad times.
>=20
> >> I really don't like it=20
> >=20
> > Because it changes behaviour that was previously unspecified?
>=20
> No, because it adds another exemption to already ugly code.
> (Talking about the hunk in [0/2].)
>=20
> >> so I thought about other problems with
> >> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
> >=20
> > Can't unplug VCPU 0 i think.
>=20
> You can.
>=20
> >> It doesn't work well ;)
> >=20
> > Can you hot-unplug vcpu 0?=20
>=20
> I can, I did, hot-plug bugged.

Due to PVCLOCK_COUNTS_FROM_ZERO patch? =20

> >> > The problem is, "selecting one read as the initial point" is inh=
erently
> >> > racy: that delta is relative to one moment (kvmclock read) at on=
e vcpu,
> >> > but must be applied to all vcpus.
> >>=20
> >> I don't think that is a problem.
> >>=20
> >> Kvmclock has a notion of a global system_time in nanoseconds (one =
value
> >> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and t=
ries
> >> to propagate system_time into guest VCPUs as precisely as possible=
 with
> >> the help of TSC.
> >
> >Different pairs of values (system_time, tsc reads) are fundamentally
> >broken. This is why=20
> >
> >commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
> >    x86, paravirt: Add a global synchronization point for pvclock
> >
> >Exists.
> >
> >Then to guarantee monotonicity you need to stop those updates
> >(or perform the pair update in lock step):
> >
> >commit d828199e84447795c6669ff0e6c6d55eb9beeff6
> >    KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
>=20
> (Thanks for the commits, split values are the core design that avoids
>  modifying rdtsc,=20

Which is broken (thats the point).

> so I'll be grad to read its details.)
>=20
> >> > 	2) You rely on monotonicity across vcpus to perform=20
> >> > 	   the 'minus delta that was read on vcpu0' calculation, but=20
> >> > 	   monotonicity across vcpus can fail during runtime
> >> >            (say host clocksource goes tsc->hpet due to tsc insta=
bility).
> >>=20
> >> That could be a problem, but I'm adding a VCPU independent constan=
t to
> >> all reads -- does the new code rely on monoticity in places where =
the
> >> old one didn't?
> >=20
> > The problem is overflow:
> > kvmclock non-monotonic across vcpus AND delta subtraction:
> >=20
> > ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time =
| vcpu1sched_clock
> > 1          1                      1                        =20
> > 2          2                      2                        1       =
        =20
>=20
> KVM sched clock not used before this point (I even moved the code to
> make sure), so there is no problem with monotonicity.
>=20
> > 3          3                      0                        2       =
     -1
>                                                                      =
     ^^^
> -1 is the overflow.  Very unlikely to ever happen in Linux, as we now
> boot other VCPUs much later than the KVM clock configuration and it c=
an
> only happen if VCPU1 is running as the reconfiguration takes place, b=
ut
> a simple
>=20
>   if (vcpu[x].sched_clock <=3D global_sched_clock_offset)
>   	return 0;
>=20
> would take care of it.  The time would stand still for a while, which=
 is
> not a huge problem for boot-only scenario. =20

Look at=20

commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
    x86, paravirt: Add a global synchronization point for pvclock

And think of what an overflow does.

Note: i tried your solution before (to add offset) and saw this issue
in practice.

> Other VCPUs couldn't be
> reading KVM sched before it was configured, so only operations that
> happen before vcpu1sched_clock goes positive are affected.
> (We have other problems if the window can be long.)

The point where vcpu1sched_clock is negative is after kvmclock is
initialized.

>=20
> > 4          4                      1                        3       =
      0
> > 5          5                      2                        4       =
      1
> > 6          6                      3                        5       =
      2
> > 7          7                      4                        6       =
      3
> >=20
> > ptime is "physical time".
> >=20
> > I prefer that the host counts from zero (there are less problems to
> > handle).
>=20
> The main advantage of a hypervisor solution for me is one saved
> subtraction and comparison on every sched_clock().

To me are such complications as handling overflows.

> > Again, that SuSE patch is incorrect and a huge hack.
>=20
> I'm not disputing that.  (Which doesn't justify the breakage.)
>=20
> > An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
> > requests the feature.
>=20
> Yes, and the alternative needs new paravirt interface.

So either:

Proceed with guest solution:
-> Make sure the overflow can't happen (and write down why not in the
code). Don't assume a small delta between kvmclock values of vcpus.
-> Handle stable -> non-stable kvmclock transition.
-> kvmclock counts from zero should not depend on stable kvmclock
(because nohz_full should work on hpet host systems).

Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
-> Figure out whats wrong with different kvmclock values on hotplug,=20
and fix it.