From: "Radim Krčmář" <rkrcmar@redhat.com>
To: Marcelo Tosatti <mtosatti@redhat.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
Paolo Bonzini <pbonzini@redhat.com>,
Luiz Capitulino <lcapitulino@redhat.com>
Subject: Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
Date: Mon, 21 Sep 2015 17:12:10 +0200 [thread overview]
Message-ID: <20150921151209.GA2734@potion.brq.redhat.com> (raw)
In-Reply-To: <20150920225742.GA27666@amt.cnet>
2015-09-20 19:57-0300, Marcelo Tosatti:
> On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Krčmář wrote:
>> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
>> RFC because I haven't explored many potential problems or tested it.
>
> The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you
> haven't explored potential problems or tested it? Sorry can't parse it.
>
>>
>> [1/2] uses a different algorithm in the guest to start counting from 0.
>> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.
>>
>> A viable alternative would be to implement opt-in features in kvm clock.
>>
>> And because we probably only broke one old user (the infamous SLES 10), a
>> workaround like this is also possible: (but I'd rather not do that)
>
> Please describe why SLES 10 breaks in detail: the state of the guest and
> the host before the patch, the state of the guest and host after the
> patch.
1) The guest periodically receives an interrupt that is handled by
main_timer_handler():
a) get time using the kvm clock:
1) write the address to MSR_KVM_SYSTEM_TIME
2) read tsc and pvclock (tsc_offset, system_time)
3) time = tsc - tsc_offset + system_time
b) compute time since the last main_timer_handler()
c) bump jiffies if enough time has elapsed
2) the guest wants to calibrate loops per jiffy [1]:
a) read tsc
b) loop till jiffies increase
c) compute lpj
Because (1a1) always resets the system_time to 0, we read the same value
over and over so the condition for (1c) is never true and jiffies remain
constant. This is the problem. A hang happens in (2b) as it is the
first place that depends on jiffies.
> What does SLES10 expect?
That a write to MSR_KVM_SYSTEM_TIME does not reset the system time.
> Is it counting from zero that breaks SLES10?
Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
while still keeping system time; we used to allow that, which means an
ABI breakage. (And we can't even say that guest's behaviour is against
the spec ...)
---
1: I also did diassembly, but the reproducer is easier to paste
(couldn't find debuginfo)
# qemu-kvm -nographic -kernel vmlinuz-2.6.16.60-0.85.1-default \
-serial stdio -monitor /dev/null -append 'console=ttyS0'
and you can get a bit further when setting loops per jiffy manually,
-serial stdio -monitor /dev/null -append 'console=ttyS0 lpj=12345678'
The dmesg for failing run is
Initializing CPU#0
PID hash table entries: 512 (order: 9, 16384 bytes)
kvm-clock: cpu 0, msr 0:3f6041, boot clock
kvm_get_tsc_khz: cpu 0, msr 0:e001
time.c: Using tsc for timekeeping HZ 250
time.c: Using 100.000000 MHz WALL KVM GTOD KVM timer.
time.c: Detected 2591.580 MHz processor.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
Checking aperture...
Nosave address range: 000000000009f000 - 00000000000a0000
Nosave address range: 00000000000a0000 - 00000000000f0000
Nosave address range: 00000000000f0000 - 0000000000100000
Memory: 124884k/130944k available (1856k kernel code, 5544k reserved, 812k data, 188k init)
[Infinitely querying kvm clock here ...]
With '-cpu kvm64,-kvmclock', the next line is
Calibrating delay using timer specific routine.. 5199.75 BogoMIPS (lpj=10399519)
With 'lpj=10399519',
Calibrating delay loop (skipped)... 5199.75 BogoMIPS preset
[Manages to get stuck later, in default_idle.]
next prev parent reply other threads:[~2015-09-21 15:12 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-18 15:54 [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
2015-09-18 15:54 ` [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
2015-09-22 19:01 ` Marcelo Tosatti
2015-09-28 14:10 ` Paolo Bonzini
2015-09-18 15:54 ` [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR" Radim Krčmář
2015-09-22 19:01 ` Marcelo Tosatti
2015-09-22 19:52 ` Paolo Bonzini
2015-09-22 20:23 ` Marcelo Tosatti
2015-09-20 22:57 ` [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Marcelo Tosatti
2015-09-21 15:12 ` Radim Krčmář [this message]
2015-09-21 15:43 ` Radim Krčmář
2015-09-21 15:52 ` Marcelo Tosatti
2015-09-21 20:00 ` Radim Krčmář
2015-09-21 20:53 ` Marcelo Tosatti
2015-09-21 22:00 ` Radim Krčmář
2015-09-21 22:37 ` Marcelo Tosatti
2015-09-22 0:40 ` Marcelo Tosatti
2015-09-22 14:33 ` Radim Krčmář
2015-09-22 14:46 ` Radim Krčmář
2015-09-28 11:05 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150921151209.GA2734@potion.brq.redhat.com \
--to=rkrcmar@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=lcapitulino@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mtosatti@redhat.com \
--cc=pbonzini@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.