From: Marcelo Tosatti <mtosatti@redhat.com>
To: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
Paolo Bonzini <pbonzini@redhat.com>,
Luiz Capitulino <lcapitulino@redhat.com>
Subject: Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
Date: Mon, 21 Sep 2015 19:37:14 -0300 [thread overview]
Message-ID: <20150921223713.GA21139@amt.cnet> (raw)
In-Reply-To: <20150921220039.GC2734@potion.brq.redhat.com>
On Tue, Sep 22, 2015 at 12:00:39AM +0200, Radim Krčmář wrote:
> 2015-09-21 17:53-0300, Marcelo Tosatti:
> > On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Krčmář wrote:
> >> 2015-09-21 12:52-0300, Marcelo Tosatti:
> >> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
> >> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
> >> >>> Is it counting from zero that breaks SLES10?
> >> >>
> >> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> >> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> >> >> while still keeping system time; we used to allow that, which means an
> >> >> ABI breakage. (And we can't even say that guest's behaviour is against
> >> >> the spec ...)
> >> >
> >> > Because this behaviour was not defined.
> >>
> >> It is defined by implementation.
> >>
> >> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> >> > boot_vcpu_runs_old_kvmclock == false?
> >> > The patch would be much simpler.
> >>
> >> If you mean the hunk in cover letter, I don't like it because we presume
> >> that no other guests were broken.
> >
> > Yes patch 1.
>
> (I'd call them: a hunk in cover letter [0/2], patch 1 = [1/2], and
> patch 2 = [2/2].)
>
> > Do you have an example of another guest that is broken?
>
> Yes, the hot-plug in any relatively recent Linux guest.
>
> > Really, assuming things about the value of the msr read when you
> > write to the MSR is very awkward. That SuSE code is incorrect, it fails
> > on other occasions as well (see
> > 54750f2cf042c42b4223d67b1bd20138464bde0e).
>
> Yeah, I even read the SUSE implementation, sad times.
>
> >> I really don't like it
> >
> > Because it changes behaviour that was previously unspecified?
>
> No, because it adds another exemption to already ugly code.
> (Talking about the hunk in [0/2].)
>
> >> so I thought about other problems with
> >> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
> >
> > Can't unplug VCPU 0 i think.
>
> You can.
>
> >> It doesn't work well ;)
> >
> > Can you hot-unplug vcpu 0?
>
> I can, I did, hot-plug bugged.
Due to PVCLOCK_COUNTS_FROM_ZERO patch?
> >> > The problem is, "selecting one read as the initial point" is inherently
> >> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> >> > but must be applied to all vcpus.
> >>
> >> I don't think that is a problem.
> >>
> >> Kvmclock has a notion of a global system_time in nanoseconds (one value
> >> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
> >> to propagate system_time into guest VCPUs as precisely as possible with
> >> the help of TSC.
> >
> >Different pairs of values (system_time, tsc reads) are fundamentally
> >broken. This is why
> >
> >commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
> > x86, paravirt: Add a global synchronization point for pvclock
> >
> >Exists.
> >
> >Then to guarantee monotonicity you need to stop those updates
> >(or perform the pair update in lock step):
> >
> >commit d828199e84447795c6669ff0e6c6d55eb9beeff6
> > KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
>
> (Thanks for the commits, split values are the core design that avoids
> modifying rdtsc,
Which is broken (thats the point).
> so I'll be grad to read its details.)
>
> >> > 2) You rely on monotonicity across vcpus to perform
> >> > the 'minus delta that was read on vcpu0' calculation, but
> >> > monotonicity across vcpus can fail during runtime
> >> > (say host clocksource goes tsc->hpet due to tsc instability).
> >>
> >> That could be a problem, but I'm adding a VCPU independent constant to
> >> all reads -- does the new code rely on monoticity in places where the
> >> old one didn't?
> >
> > The problem is overflow:
> > kvmclock non-monotonic across vcpus AND delta subtraction:
> >
> > ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock
> > 1 1 1
> > 2 2 2 1
>
> KVM sched clock not used before this point (I even moved the code to
> make sure), so there is no problem with monotonicity.
>
> > 3 3 0 2 -1
> ^^^
> -1 is the overflow. Very unlikely to ever happen in Linux, as we now
> boot other VCPUs much later than the KVM clock configuration and it can
> only happen if VCPU1 is running as the reconfiguration takes place, but
> a simple
>
> if (vcpu[x].sched_clock <= global_sched_clock_offset)
> return 0;
>
> would take care of it. The time would stand still for a while, which is
> not a huge problem for boot-only scenario.
Look at
commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
x86, paravirt: Add a global synchronization point for pvclock
And think of what an overflow does.
Note: i tried your solution before (to add offset) and saw this issue
in practice.
> Other VCPUs couldn't be
> reading KVM sched before it was configured, so only operations that
> happen before vcpu1sched_clock goes positive are affected.
> (We have other problems if the window can be long.)
The point where vcpu1sched_clock is negative is after kvmclock is
initialized.
>
> > 4 4 1 3 0
> > 5 5 2 4 1
> > 6 6 3 5 2
> > 7 7 4 6 3
> >
> > ptime is "physical time".
> >
> > I prefer that the host counts from zero (there are less problems to
> > handle).
>
> The main advantage of a hypervisor solution for me is one saved
> subtraction and comparison on every sched_clock().
To me are such complications as handling overflows.
> > Again, that SuSE patch is incorrect and a huge hack.
>
> I'm not disputing that. (Which doesn't justify the breakage.)
>
> > An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
> > requests the feature.
>
> Yes, and the alternative needs new paravirt interface.
So either:
Proceed with guest solution:
-> Make sure the overflow can't happen (and write down why not in the
code). Don't assume a small delta between kvmclock values of vcpus.
-> Handle stable -> non-stable kvmclock transition.
-> kvmclock counts from zero should not depend on stable kvmclock
(because nohz_full should work on hpet host systems).
Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
-> Figure out whats wrong with different kvmclock values on hotplug,
and fix it.
next prev parent reply other threads:[~2015-09-21 22:37 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-18 15:54 [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
2015-09-18 15:54 ` [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
2015-09-22 19:01 ` Marcelo Tosatti
2015-09-28 14:10 ` Paolo Bonzini
2015-09-18 15:54 ` [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR" Radim Krčmář
2015-09-22 19:01 ` Marcelo Tosatti
2015-09-22 19:52 ` Paolo Bonzini
2015-09-22 20:23 ` Marcelo Tosatti
2015-09-20 22:57 ` [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Marcelo Tosatti
2015-09-21 15:12 ` Radim Krčmář
2015-09-21 15:43 ` Radim Krčmář
2015-09-21 15:52 ` Marcelo Tosatti
2015-09-21 20:00 ` Radim Krčmář
2015-09-21 20:53 ` Marcelo Tosatti
2015-09-21 22:00 ` Radim Krčmář
2015-09-21 22:37 ` Marcelo Tosatti [this message]
2015-09-22 0:40 ` Marcelo Tosatti
2015-09-22 14:33 ` Radim Krčmář
2015-09-22 14:46 ` Radim Krčmář
2015-09-28 11:05 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150921223713.GA21139@amt.cnet \
--to=mtosatti@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=lcapitulino@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=rkrcmar@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).