From: Marcelo Tosatti <mtosatti@redhat.com>
To: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
Paolo Bonzini <pbonzini@redhat.com>,
Luiz Capitulino <lcapitulino@redhat.com>
Subject: Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
Date: Mon, 21 Sep 2015 12:52:24 -0300 [thread overview]
Message-ID: <20150921155224.GA12938@amt.cnet> (raw)
In-Reply-To: <20150921151209.GA2734@potion.brq.redhat.com>
On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
> 2015-09-20 19:57-0300, Marcelo Tosatti:
> > On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Krčmář wrote:
> >> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
> >> RFC because I haven't explored many potential problems or tested it.
> >
> > The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you
> > haven't explored potential problems or tested it? Sorry can't parse it.
> >
> >>
> >> [1/2] uses a different algorithm in the guest to start counting from 0.
> >> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.
> >>
> >> A viable alternative would be to implement opt-in features in kvm clock.
> >>
> >> And because we probably only broke one old user (the infamous SLES 10), a
> >> workaround like this is also possible: (but I'd rather not do that)
> >
> > Please describe why SLES 10 breaks in detail: the state of the guest and
> > the host before the patch, the state of the guest and host after the
> > patch.
>
> 1) The guest periodically receives an interrupt that is handled by
> main_timer_handler():
> a) get time using the kvm clock:
> 1) write the address to MSR_KVM_SYSTEM_TIME
> 2) read tsc and pvclock (tsc_offset, system_time)
> 3) time = tsc - tsc_offset + system_time
> b) compute time since the last main_timer_handler()
> c) bump jiffies if enough time has elapsed
> 2) the guest wants to calibrate loops per jiffy [1]:
> a) read tsc
> b) loop till jiffies increase
> c) compute lpj
>
> Because (1a1) always resets the system_time to 0, we read the same value
> over and over so the condition for (1c) is never true and jiffies remain
> constant. This is the problem. A hang happens in (2b) as it is the
> first place that depends on jiffies.
>
> > What does SLES10 expect?
>
> That a write to MSR_KVM_SYSTEM_TIME does not reset the system time.
>
> > Is it counting from zero that breaks SLES10?
>
> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> while still keeping system time; we used to allow that, which means an
> ABI breakage. (And we can't even say that guest's behaviour is against
> the spec ...)
Because this behaviour was not defined.
Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
boot_vcpu_runs_old_kvmclock == false?
The patch would be much simpler.
The problem is, "selecting one read as the initial point" is inherently
racy: that delta is relative to one moment (kvmclock read) at one vcpu,
but must be applied to all vcpus.
Besides:
1) Stable sched clock in guest does not depend on
KVM_FEATURE_CLOCKSOURCE_STABLE_BIT.
2) You rely on monotonicity across vcpus to perform
the 'minus delta that was read on vcpu0' calculation, but
monotonicity across vcpus can fail during runtime
(say host clocksource goes tsc->hpet due to tsc instability).
>
>
> ---
> 1: I also did diassembly, but the reproducer is easier to paste
> (couldn't find debuginfo)
> # qemu-kvm -nographic -kernel vmlinuz-2.6.16.60-0.85.1-default \
> -serial stdio -monitor /dev/null -append 'console=ttyS0'
>
> and you can get a bit further when setting loops per jiffy manually,
> -serial stdio -monitor /dev/null -append 'console=ttyS0 lpj=12345678'
>
> The dmesg for failing run is
> Initializing CPU#0
> PID hash table entries: 512 (order: 9, 16384 bytes)
> kvm-clock: cpu 0, msr 0:3f6041, boot clock
> kvm_get_tsc_khz: cpu 0, msr 0:e001
> time.c: Using tsc for timekeeping HZ 250
> time.c: Using 100.000000 MHz WALL KVM GTOD KVM timer.
> time.c: Detected 2591.580 MHz processor.
> Console: colour VGA+ 80x25
> Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
> Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
> Checking aperture...
> Nosave address range: 000000000009f000 - 00000000000a0000
> Nosave address range: 00000000000a0000 - 00000000000f0000
> Nosave address range: 00000000000f0000 - 0000000000100000
> Memory: 124884k/130944k available (1856k kernel code, 5544k reserved, 812k data, 188k init)
> [Infinitely querying kvm clock here ...]
>
> With '-cpu kvm64,-kvmclock', the next line is
> Calibrating delay using timer specific routine.. 5199.75 BogoMIPS (lpj=10399519)
>
> With 'lpj=10399519',
> Calibrating delay loop (skipped)... 5199.75 BogoMIPS preset
> [Manages to get stuck later, in default_idle.]
next prev parent reply other threads:[~2015-09-21 15:52 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-18 15:54 [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
2015-09-18 15:54 ` [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
2015-09-22 19:01 ` Marcelo Tosatti
2015-09-28 14:10 ` Paolo Bonzini
2015-09-18 15:54 ` [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR" Radim Krčmář
2015-09-22 19:01 ` Marcelo Tosatti
2015-09-22 19:52 ` Paolo Bonzini
2015-09-22 20:23 ` Marcelo Tosatti
2015-09-20 22:57 ` [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Marcelo Tosatti
2015-09-21 15:12 ` Radim Krčmář
2015-09-21 15:43 ` Radim Krčmář
2015-09-21 15:52 ` Marcelo Tosatti [this message]
2015-09-21 20:00 ` Radim Krčmář
2015-09-21 20:53 ` Marcelo Tosatti
2015-09-21 22:00 ` Radim Krčmář
2015-09-21 22:37 ` Marcelo Tosatti
2015-09-22 0:40 ` Marcelo Tosatti
2015-09-22 14:33 ` Radim Krčmář
2015-09-22 14:46 ` Radim Krčmář
2015-09-28 11:05 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150921155224.GA12938@amt.cnet \
--to=mtosatti@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=lcapitulino@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=rkrcmar@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).