* Cleaning up the KVM clock
@ 2014-12-21 3:31 Andy Lutomirski
2014-12-22 10:15 ` Paolo Bonzini
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Andy Lutomirski @ 2014-12-21 3:31 UTC (permalink / raw)
To: Gleb Natapov, Paolo Bonzini, Marcelo Tosatti, kvm list,
linux-kernel@vger.kernel.org
I'm looking at the vdso timing code, and I'm puzzled by the pvclock
code. My motivation is comprehensibility, performance, and
correctness.
# for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0; done
10000000 loops in 0.69138s = 69.14 nsec / loop
10000000 loops in 0.63614s = 63.61 nsec / loop
10000000 loops in 0.63213s = 63.21 nsec / loop
10000000 loops in 0.63087s = 63.09 nsec / loop
10000000 loops in 0.63079s = 63.08 nsec / loop
10000000 loops in 0.63096s = 63.10 nsec / loop
10000000 loops in 0.63096s = 63.10 nsec / loop
10000000 loops in 0.63062s = 63.06 nsec / loop
10000000 loops in 0.63100s = 63.10 nsec / loop
10000000 loops in 0.63112s = 63.11 nsec / loop
bash-4.3# echo tsc
>/sys/devices/system/clocksource/clocksource0/current_clocksource
[ 45.957524] Switched to clocksource tsc
bash-4.3# for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0;
done10000000 loops in 0.33583s = 33.58 nsec / loop
10000000 loops in 0.28530s = 28.53 nsec / loop
10000000 loops in 0.28904s = 28.90 nsec / loop
10000000 loops in 0.29001s = 29.00 nsec / loop
10000000 loops in 0.28775s = 28.78 nsec / loop
10000000 loops in 0.30102s = 30.10 nsec / loop
10000000 loops in 0.28006s = 28.01 nsec / loop
10000000 loops in 0.28584s = 28.58 nsec / loop
10000000 loops in 0.28175s = 28.17 nsec / loop
10000000 loops in 0.28724s = 28.72 nsec / loop
The current code is rather slow, especially compared to the tsc variant.
The algorithm used by the pvclock vgetsns implementation is, approximately:
cpu = getcpu;
pvti = pointer to the relevant paravirt data
version = pvti->version;
rdtsc_barrier();
tsc = rdtsc()
delta = (tsc - x) * y >> z;
cycles = delta + w;
flags = pvti->flags;
rdtsc_barrier(); <-- totally unnecessary
cpu1 = getcpu;
if (cpu != cpu1 || the we missed the seqlock)
retry;
if (!stable)
bail;
After that, the main vclock_gettime code applies the kernel's regular
time adjustments.
First, is there any guarantee that, if pvti is marked as stable, that
the pvti data is consistent across cpus? If so (which would be really
nice), then we could always use vcpu 0's pvti, which would be a really
nice cleanup.
If not, then the current algorithm is buggy. There is no guarantee
that the tsc stamp we get matches the cpu whose pvti we looked at. We
could fix that using rdtscp.
I think it's also rather strange that the return value is "cycles"
instead of nanoseconds. If the guest is using pvclock *and* ntp,
isn't something very wrong?
Can the algorithm just be:
tsc, cpu = rdtscp;
pvti = pvti for cpu
read the scale, offset, etc;
if (!stable)
bail;
barrier();
read pvti->tsc_timestamp;
if (tsc < pvti->tsc_timestamp)
retry;
if (the versions are unhappy)
retry;
return the computed nanosecond count;
I think this is likely to be at least as correct as the current
algorithm, if not more so, and it correctly handles the case where we
migrate to a different vcpu in the middle. (I also think that, with
this algorithm, the version check should also be unnecessary, since if
we race with a host update, we'll fail the tsc < pvti->tsc_timestamp
check.)
It would be even nicer, though, if we could do much the same thing but
without worrying about which vcpu we're on.
Thoughts? Am I missing some considerations here?
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Cleaning up the KVM clock 2014-12-21 3:31 Cleaning up the KVM clock Andy Lutomirski @ 2014-12-22 10:15 ` Paolo Bonzini 2014-12-22 13:34 ` Marcelo Tosatti 2014-12-22 13:38 ` Marcelo Tosatti 2 siblings, 0 replies; 12+ messages in thread From: Paolo Bonzini @ 2014-12-22 10:15 UTC (permalink / raw) To: Andy Lutomirski, Gleb Natapov, Marcelo Tosatti, kvm list, linux-kernel@vger.kernel.org On 21/12/2014 04:31, Andy Lutomirski wrote: > I'm looking at the vdso timing code, and I'm puzzled by the pvclock > code. My motivation is comprehensibility, performance, and > correctness. > > # for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0; done > 10000000 loops in 0.69138s = 69.14 nsec / loop > 10000000 loops in 0.63614s = 63.61 nsec / loop > 10000000 loops in 0.63213s = 63.21 nsec / loop > 10000000 loops in 0.63087s = 63.09 nsec / loop > 10000000 loops in 0.63079s = 63.08 nsec / loop > 10000000 loops in 0.63096s = 63.10 nsec / loop > 10000000 loops in 0.63096s = 63.10 nsec / loop > 10000000 loops in 0.63062s = 63.06 nsec / loop > 10000000 loops in 0.63100s = 63.10 nsec / loop > 10000000 loops in 0.63112s = 63.11 nsec / loop > bash-4.3# echo tsc >> /sys/devices/system/clocksource/clocksource0/current_clocksource > [ 45.957524] Switched to clocksource tsc > bash-4.3# for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0; > done10000000 loops in 0.33583s = 33.58 nsec / loop > 10000000 loops in 0.28530s = 28.53 nsec / loop > 10000000 loops in 0.28904s = 28.90 nsec / loop > 10000000 loops in 0.29001s = 29.00 nsec / loop > 10000000 loops in 0.28775s = 28.78 nsec / loop > 10000000 loops in 0.30102s = 30.10 nsec / loop > 10000000 loops in 0.28006s = 28.01 nsec / loop > 10000000 loops in 0.28584s = 28.58 nsec / loop > 10000000 loops in 0.28175s = 28.17 nsec / loop > 10000000 loops in 0.28724s = 28.72 nsec / loop > > The current code is rather slow, especially compared to the tsc variant. > > The algorithm used by the pvclock vgetsns implementation is, approximately: > > cpu = getcpu; > pvti = pointer to the relevant paravirt data > version = pvti->version; > rdtsc_barrier(); > tsc = rdtsc() > delta = (tsc - x) * y >> z; > cycles = delta + w; > flags = pvti->flags; > rdtsc_barrier(); <-- totally unnecessary It's not unnecessary. The first one is a "lock", the second is an "unlock". You can move the second rdtsc_barrier below the cpu/seqlock check though. > > cpu1 = getcpu; > if (cpu != cpu1 || the we missed the seqlock) > retry; > > if (!stable) > bail; > > After that, the main vclock_gettime code applies the kernel's regular > time adjustments. > > First, is there any guarantee that, if pvti is marked as stable, that > the pvti data is consistent across cpus? If so (which would be really > nice), then we could always use vcpu 0's pvti, which would be a really > nice cleanup. I think you cannot because the TSCs might not be perfectly synced up even if the rates are, but... > If not, then the current algorithm is buggy. There is no guarantee > that the tsc stamp we get matches the cpu whose pvti we looked at. We > could fix that using rdtscp. ... Marcelo will have to answer this. > I think it's also rather strange that the return value is "cycles" > instead of nanoseconds. If the guest is using pvclock *and* ntp, > isn't something very wrong? It's not cycles. pvclock_get_nsec_offset returns nanoseconds, and __pvclock_read_cycles does the same. Patches are welcome. :) Paolo ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Cleaning up the KVM clock 2014-12-21 3:31 Cleaning up the KVM clock Andy Lutomirski 2014-12-22 10:15 ` Paolo Bonzini @ 2014-12-22 13:34 ` Marcelo Tosatti 2014-12-22 14:09 ` Marcelo Tosatti 2014-12-22 16:03 ` Andy Lutomirski 2014-12-22 13:38 ` Marcelo Tosatti 2 siblings, 2 replies; 12+ messages in thread From: Marcelo Tosatti @ 2014-12-22 13:34 UTC (permalink / raw) To: Andy Lutomirski Cc: Gleb Natapov, Paolo Bonzini, kvm list, linux-kernel@vger.kernel.org On Sat, Dec 20, 2014 at 07:31:19PM -0800, Andy Lutomirski wrote: > I'm looking at the vdso timing code, and I'm puzzled by the pvclock > code. My motivation is comprehensibility, performance, and > correctness. > > # for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0; done > 10000000 loops in 0.69138s = 69.14 nsec / loop > 10000000 loops in 0.63614s = 63.61 nsec / loop > 10000000 loops in 0.63213s = 63.21 nsec / loop > 10000000 loops in 0.63087s = 63.09 nsec / loop > 10000000 loops in 0.63079s = 63.08 nsec / loop > 10000000 loops in 0.63096s = 63.10 nsec / loop > 10000000 loops in 0.63096s = 63.10 nsec / loop > 10000000 loops in 0.63062s = 63.06 nsec / loop > 10000000 loops in 0.63100s = 63.10 nsec / loop > 10000000 loops in 0.63112s = 63.11 nsec / loop > bash-4.3# echo tsc > >/sys/devices/system/clocksource/clocksource0/current_clocksource > [ 45.957524] Switched to clocksource tsc > bash-4.3# for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0; > done10000000 loops in 0.33583s = 33.58 nsec / loop > 10000000 loops in 0.28530s = 28.53 nsec / loop > 10000000 loops in 0.28904s = 28.90 nsec / loop > 10000000 loops in 0.29001s = 29.00 nsec / loop > 10000000 loops in 0.28775s = 28.78 nsec / loop > 10000000 loops in 0.30102s = 30.10 nsec / loop > 10000000 loops in 0.28006s = 28.01 nsec / loop > 10000000 loops in 0.28584s = 28.58 nsec / loop > 10000000 loops in 0.28175s = 28.17 nsec / loop > 10000000 loops in 0.28724s = 28.72 nsec / loop > > The current code is rather slow, especially compared to the tsc variant. > > The algorithm used by the pvclock vgetsns implementation is, approximately: > > cpu = getcpu; > pvti = pointer to the relevant paravirt data > version = pvti->version; > rdtsc_barrier(); > tsc = rdtsc() > delta = (tsc - x) * y >> z; > cycles = delta + w; > flags = pvti->flags; > rdtsc_barrier(); <-- totally unnecessary > > cpu1 = getcpu; > if (cpu != cpu1 || the we missed the seqlock) > retry; > > if (!stable) > bail; > > After that, the main vclock_gettime code applies the kernel's regular > time adjustments. > > > First, is there any guarantee that, if pvti is marked as stable, that > the pvti data is consistent across cpus? If so (which would be really > nice), then we could always use vcpu 0's pvti, which would be a really > nice cleanup. > > If not, then the current algorithm is buggy. There is no guarantee > that the tsc stamp we get matches the cpu whose pvti we looked at. We > could fix that using rdtscp. Please read the comment at arch/x86/kvm/x86.c which starts with "Assuming a stable TSC across physical CPUS, and a stable TSC". > I think it's also rather strange that the return value is "cycles" > instead of nanoseconds. If the guest is using pvclock *and* ntp, > isn't something very wrong? > > Can the algorithm just be: > > tsc, cpu = rdtscp; > pvti = pvti for cpu > > read the scale, offset, etc; > if (!stable) > bail; "The RDTSCP instruction waits until all previous instructions have been executed before reading the counter. However, subsequent instructions may begin execution before the read operation is performed." So you would need a barrier there after RDTSCP. > barrier(); > read pvti->tsc_timestamp; > if (tsc < pvti->tsc_timestamp) > retry; A kvmclock update does not necessarily update tsc_timestamp. See " /* * If the host uses TSC clock, then passthrough TSC as stable * to the guest. */ spin_lock(&ka->pvclock_gtod_sync_lock); use_master_clock = ka->use_master_clock; if (use_master_clock) { host_tsc = ka->master_cycle_now; kernel_ns = ka->master_kernel_ns; } " At arch/x86/kvm/x86.c. > if (the versions are unhappy) > retry; > return the computed nanosecond count; > > I think this is likely to be at least as correct as the current > algorithm, if not more so, and it correctly handles the case where we > migrate to a different vcpu in the middle. (I also think that, with > this algorithm, the version check should also be unnecessary, since if > we race with a host update, we'll fail the tsc < pvti->tsc_timestamp > check.) > > It would be even nicer, though, if we could do much the same thing but > without worrying about which vcpu we're on. > > Thoughts? Am I missing some considerations here? Maybe we can find another optimization opportunities? Thanks! ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Cleaning up the KVM clock 2014-12-22 13:34 ` Marcelo Tosatti @ 2014-12-22 14:09 ` Marcelo Tosatti 2014-12-22 16:03 ` Andy Lutomirski 1 sibling, 0 replies; 12+ messages in thread From: Marcelo Tosatti @ 2014-12-22 14:09 UTC (permalink / raw) To: Andy Lutomirski Cc: Gleb Natapov, Paolo Bonzini, kvm list, linux-kernel@vger.kernel.org On Mon, Dec 22, 2014 at 11:34:30AM -0200, Marcelo Tosatti wrote: > > It would be even nicer, though, if we could do much the same thing but > > without worrying about which vcpu we're on. > > > > Thoughts? Am I missing some considerations here? > > Maybe we can find another optimization opportunities? Perhaps RDTSCP rather than getcpu is a win? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Cleaning up the KVM clock 2014-12-22 13:34 ` Marcelo Tosatti 2014-12-22 14:09 ` Marcelo Tosatti @ 2014-12-22 16:03 ` Andy Lutomirski 2014-12-22 22:47 ` Paolo Bonzini 2014-12-22 22:49 ` Paolo Bonzini 1 sibling, 2 replies; 12+ messages in thread From: Andy Lutomirski @ 2014-12-22 16:03 UTC (permalink / raw) To: Marcelo Tosatti Cc: Gleb Natapov, Paolo Bonzini, kvm list, linux-kernel@vger.kernel.org On Mon, Dec 22, 2014 at 5:34 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > On Sat, Dec 20, 2014 at 07:31:19PM -0800, Andy Lutomirski wrote: >> I'm looking at the vdso timing code, and I'm puzzled by the pvclock >> code. My motivation is comprehensibility, performance, and >> correctness. >> >> # for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0; done >> 10000000 loops in 0.69138s = 69.14 nsec / loop >> 10000000 loops in 0.63614s = 63.61 nsec / loop >> 10000000 loops in 0.63213s = 63.21 nsec / loop >> 10000000 loops in 0.63087s = 63.09 nsec / loop >> 10000000 loops in 0.63079s = 63.08 nsec / loop >> 10000000 loops in 0.63096s = 63.10 nsec / loop >> 10000000 loops in 0.63096s = 63.10 nsec / loop >> 10000000 loops in 0.63062s = 63.06 nsec / loop >> 10000000 loops in 0.63100s = 63.10 nsec / loop >> 10000000 loops in 0.63112s = 63.11 nsec / loop >> bash-4.3# echo tsc >> >/sys/devices/system/clocksource/clocksource0/current_clocksource >> [ 45.957524] Switched to clocksource tsc >> bash-4.3# for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0; >> done10000000 loops in 0.33583s = 33.58 nsec / loop >> 10000000 loops in 0.28530s = 28.53 nsec / loop >> 10000000 loops in 0.28904s = 28.90 nsec / loop >> 10000000 loops in 0.29001s = 29.00 nsec / loop >> 10000000 loops in 0.28775s = 28.78 nsec / loop >> 10000000 loops in 0.30102s = 30.10 nsec / loop >> 10000000 loops in 0.28006s = 28.01 nsec / loop >> 10000000 loops in 0.28584s = 28.58 nsec / loop >> 10000000 loops in 0.28175s = 28.17 nsec / loop >> 10000000 loops in 0.28724s = 28.72 nsec / loop >> >> The current code is rather slow, especially compared to the tsc variant. >> >> The algorithm used by the pvclock vgetsns implementation is, approximately: >> >> cpu = getcpu; >> pvti = pointer to the relevant paravirt data >> version = pvti->version; >> rdtsc_barrier(); >> tsc = rdtsc() >> delta = (tsc - x) * y >> z; >> cycles = delta + w; >> flags = pvti->flags; >> rdtsc_barrier(); <-- totally unnecessary >> >> cpu1 = getcpu; >> if (cpu != cpu1 || the we missed the seqlock) >> retry; >> >> if (!stable) >> bail; >> >> After that, the main vclock_gettime code applies the kernel's regular >> time adjustments. >> >> >> First, is there any guarantee that, if pvti is marked as stable, that >> the pvti data is consistent across cpus? If so (which would be really >> nice), then we could always use vcpu 0's pvti, which would be a really >> nice cleanup. >> >> If not, then the current algorithm is buggy. There is no guarantee >> that the tsc stamp we get matches the cpu whose pvti we looked at. We >> could fix that using rdtscp. > > Please read the comment at arch/x86/kvm/x86.c which starts with > > "Assuming a stable TSC across physical CPUS, and a stable TSC". > >> I think it's also rather strange that the return value is "cycles" >> instead of nanoseconds. If the guest is using pvclock *and* ntp, >> isn't something very wrong? >> >> Can the algorithm just be: >> >> tsc, cpu = rdtscp; >> pvti = pvti for cpu >> >> read the scale, offset, etc; >> if (!stable) >> bail; > > "The RDTSCP instruction waits until all previous instructions have been > executed before reading the counter. > However, subsequent instructions may begin execution before the read > operation is performed." > > So you would need a barrier there after RDTSCP. > After considerable manual reading and experimentation a couple years ago, the conclusion was that: - RDTSCP is ordered like a load on AMD and Intel. That means that you can't observe RDTSCP by itself failing to be monotonic across CPUs. - RDTSC by itself is not ordered. It's easy to observe it behaving non-monotonically. - rdtsc_barrier(); RDTSC is ordered like RDTSCP on AMD and Intel. >> barrier(); >> read pvti->tsc_timestamp; >> if (tsc < pvti->tsc_timestamp) >> retry; > > A kvmclock update does not necessarily update tsc_timestamp. Hmm. > > See > > " /* > * If the host uses TSC clock, then passthrough TSC as stable > * to the guest. > */ > spin_lock(&ka->pvclock_gtod_sync_lock); > use_master_clock = ka->use_master_clock; > if (use_master_clock) { > host_tsc = ka->master_cycle_now; > kernel_ns = ka->master_kernel_ns; > } > " > > At arch/x86/kvm/x86.c. So there's a much bigger problem here. Despite the read implementation and the docs in Documentation/, the KVM hots doesn't actually use the version field the way it's supposed to. It just updates the whole pvti with one __copy_to_user. It has a comment: * The interface expects us to write an even number signaling that the * update is finished. Since the guest won't see the intermediate * state, we just increase by 2 at the end. This is wrong. The guest *kernel* might not see the intermediate state because the kernel (presumably it disabled migration while reading pvti), but the guest vdso can't do that and could very easily observe pvti while it's being written. Also, __getcpu is completely unordered on current kernels, so it doesn't generate the code that anyone would expect. I'll fix that. I'll send patches for the whole mess, complete with lots of comments, after I test them a bit today. --Andy > >> if (the versions are unhappy) >> retry; >> return the computed nanosecond count; >> >> I think this is likely to be at least as correct as the current >> algorithm, if not more so, and it correctly handles the case where we >> migrate to a different vcpu in the middle. (I also think that, with >> this algorithm, the version check should also be unnecessary, since if >> we race with a host update, we'll fail the tsc < pvti->tsc_timestamp >> check.) >> >> It would be even nicer, though, if we could do much the same thing but >> without worrying about which vcpu we're on. >> >> Thoughts? Am I missing some considerations here? > > Maybe we can find another optimization opportunities? > > Thanks! > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Cleaning up the KVM clock 2014-12-22 16:03 ` Andy Lutomirski @ 2014-12-22 22:47 ` Paolo Bonzini 2014-12-22 22:49 ` Paolo Bonzini 1 sibling, 0 replies; 12+ messages in thread From: Paolo Bonzini @ 2014-12-22 22:47 UTC (permalink / raw) To: linux-kernel; +Cc: kvm On 22/12/2014 17:03, Andy Lutomirski wrote: > > This is wrong. The guest *kernel* might not see the intermediate > state because the kernel (presumably it disabled migration while > reading pvti), but the guest vdso can't do that and could very easily > observe pvti while it's being written. No. kvm_guest_time_update is called by vcpu_enter_guest, while the vCPU is not running, so it's entirely atomic from the point of view of the guest. > I'll send patches for the whole mess, complete with lots of comments, > after I test them a bit today. Ok, some comments can certainly help the discussion. Paolo ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Cleaning up the KVM clock 2014-12-22 16:03 ` Andy Lutomirski 2014-12-22 22:47 ` Paolo Bonzini @ 2014-12-22 22:49 ` Paolo Bonzini 2014-12-22 23:00 ` Andy Lutomirski [not found] ` <CAAyOgsbNsDmPGzJcEuYExaswWaQoLHzkMVd+r6r-8jnGc00yTA@mail.gmail.com> 1 sibling, 2 replies; 12+ messages in thread From: Paolo Bonzini @ 2014-12-22 22:49 UTC (permalink / raw) To: Andy Lutomirski, Marcelo Tosatti Cc: Gleb Natapov, kvm list, linux-kernel@vger.kernel.org On 22/12/2014 17:03, Andy Lutomirski wrote: > This is wrong. The guest *kernel* might not see the intermediate > state because the kernel (presumably it disabled migration while > reading pvti), but the guest vdso can't do that and could very easily > observe pvti while it's being written. No. kvm_guest_time_update is called by vcpu_enter_guest, while the vCPU is not running, so it's entirely atomic from the point of view of the guest. > I'll send patches for the whole mess, complete with lots of comments, > after I test them a bit today. Ok, some comments can certainly help the discussion. Paolo ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Cleaning up the KVM clock 2014-12-22 22:49 ` Paolo Bonzini @ 2014-12-22 23:00 ` Andy Lutomirski 2014-12-22 23:14 ` Paolo Bonzini [not found] ` <CAAyOgsbNsDmPGzJcEuYExaswWaQoLHzkMVd+r6r-8jnGc00yTA@mail.gmail.com> 1 sibling, 1 reply; 12+ messages in thread From: Andy Lutomirski @ 2014-12-22 23:00 UTC (permalink / raw) To: Paolo Bonzini Cc: Marcelo Tosatti, Gleb Natapov, kvm list, linux-kernel@vger.kernel.org On Mon, Dec 22, 2014 at 2:49 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 22/12/2014 17:03, Andy Lutomirski wrote: >> This is wrong. The guest *kernel* might not see the intermediate >> state because the kernel (presumably it disabled migration while >> reading pvti), but the guest vdso can't do that and could very easily >> observe pvti while it's being written. > > No. kvm_guest_time_update is called by vcpu_enter_guest, while the vCPU > is not running, so it's entirely atomic from the point of view of the guest. Which vCPU? Unless kvm_guest_time_update freezes all of the vcpus, then there's a race: vCPU 0 guest: __getcpu vdso thread migrates to vCPU 1 vCPU 0 exits host starts writing pvti for vCPU 0 vdso thread starts reading pvti host finishes writing pvti for vCPU 0 vCPU 0 resumes vdso migrates back to vCPU 0 __getcpu returns 0 and we fail. > >> I'll send patches for the whole mess, complete with lots of comments, >> after I test them a bit today. > > Ok, some comments can certainly help the discussion. > I'm having a hard time testing, since KVM on 3.19-rc1 appears to be entirely unusable. No matter what I do, I get this very early in guest boot: KVM internal error. Suberror: 1 emulation failure EAX=000dee58 EBX=00000000 ECX=00000000 EDX=00000cfd ESI=00000059 EDI=00000000 EBP=00000000 ESP=00006fc4 EIP=000f17f4 EFL=00010012 [----A--] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA] SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy GDT= 000f6c58 00000037 IDT= 000f6c96 00000000 CR0=60000011 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=e8 75 fc ff ff 89 f2 a8 10 89 d8 75 0a b9 74 17 ff ff ff d1 <5b> 5e c3 5b 5e e9 76 ff ff ff 57 56 53 8b 35 38 65 0f 00 85 f6 0f 88 be 00 00 00 0f b7 f6 and it sometimes comes with a lockdep splat, too. --Andy > Paolo -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Cleaning up the KVM clock 2014-12-22 23:00 ` Andy Lutomirski @ 2014-12-22 23:14 ` Paolo Bonzini 2014-12-22 23:31 ` Andy Lutomirski 0 siblings, 1 reply; 12+ messages in thread From: Paolo Bonzini @ 2014-12-22 23:14 UTC (permalink / raw) To: Andy Lutomirski Cc: Marcelo Tosatti, Gleb Natapov, kvm list, linux-kernel@vger.kernel.org On 23/12/2014 00:00, Andy Lutomirski wrote: > On Mon, Dec 22, 2014 at 2:49 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> >> >> On 22/12/2014 17:03, Andy Lutomirski wrote: >>> This is wrong. The guest *kernel* might not see the intermediate >>> state because the kernel (presumably it disabled migration while >>> reading pvti), but the guest vdso can't do that and could very easily >>> observe pvti while it's being written. >> >> No. kvm_guest_time_update is called by vcpu_enter_guest, while the vCPU >> is not running, so it's entirely atomic from the point of view of the guest. > > Which vCPU? Unless kvm_guest_time_update freezes all of the vcpus, > then there's a race: > > vCPU 0 guest: __getcpu > vdso thread migrates to vCPU 1 > vCPU 0 exits > host starts writing pvti for vCPU 0 > vdso thread starts reading pvti > host finishes writing pvti for vCPU 0 > vCPU 0 resumes > vdso migrates back to vCPU 0 > __getcpu returns 0 > > and we fail. Yes, it does. See kvm_gen_update_masterclock. See also http://www.spinics.net/lists/kvm/msg95533.html for some discussion about KVM_REQ_MCLOCK_INPROGRESS. > I'm having a hard time testing, since KVM on 3.19-rc1 appears to be > entirely unusable. No matter what I do, I get this very early in > guest boot: > > KVM internal error. Suberror: 1 > emulation failure > EAX=000dee58 EBX=00000000 ECX=00000000 EDX=00000cfd > ESI=00000059 EDI=00000000 EBP=00000000 ESP=00006fc4 > EIP=000f17f4 EFL=00010012 [----A--] CPL=0 II=0 A20=1 SMM=0 HLT=0 > ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] > CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA] > SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] > DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] > FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] > GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] > LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT > TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy > GDT= 000f6c58 00000037 > IDT= 000f6c96 00000000 > CR0=60000011 CR2=00000000 CR3=00000000 CR4=00000000 > DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 > DR3=0000000000000000 > DR6=00000000ffff0ff0 DR7=0000000000000400 > EFER=0000000000000000 > Code=e8 75 fc ff ff 89 f2 a8 10 89 d8 75 0a b9 74 17 ff ff ff d1 <5b> > 5e c3 5b 5e e9 76 ff ff ff 57 56 53 8b 35 38 65 0f 00 85 f6 0f 88 be > 00 00 00 0f b7 f6 > > and it sometimes comes with a lockdep splat, too. I can look at it tomorrow. Does commit 2c4aa55a6af070262cca425745e8e54310e96b8d work for you? Paolo ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Cleaning up the KVM clock 2014-12-22 23:14 ` Paolo Bonzini @ 2014-12-22 23:31 ` Andy Lutomirski 0 siblings, 0 replies; 12+ messages in thread From: Andy Lutomirski @ 2014-12-22 23:31 UTC (permalink / raw) To: Paolo Bonzini Cc: Marcelo Tosatti, Gleb Natapov, kvm list, linux-kernel@vger.kernel.org On Mon, Dec 22, 2014 at 3:14 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 23/12/2014 00:00, Andy Lutomirski wrote: >> On Mon, Dec 22, 2014 at 2:49 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >>> >>> >>> On 22/12/2014 17:03, Andy Lutomirski wrote: >>>> This is wrong. The guest *kernel* might not see the intermediate >>>> state because the kernel (presumably it disabled migration while >>>> reading pvti), but the guest vdso can't do that and could very easily >>>> observe pvti while it's being written. >>> >>> No. kvm_guest_time_update is called by vcpu_enter_guest, while the vCPU >>> is not running, so it's entirely atomic from the point of view of the guest. >> >> Which vCPU? Unless kvm_guest_time_update freezes all of the vcpus, >> then there's a race: >> >> vCPU 0 guest: __getcpu >> vdso thread migrates to vCPU 1 >> vCPU 0 exits >> host starts writing pvti for vCPU 0 >> vdso thread starts reading pvti >> host finishes writing pvti for vCPU 0 >> vCPU 0 resumes >> vdso migrates back to vCPU 0 >> __getcpu returns 0 >> >> and we fail. > > Yes, it does. See kvm_gen_update_masterclock. > > See also http://www.spinics.net/lists/kvm/msg95533.html for some > discussion about KVM_REQ_MCLOCK_INPROGRESS. Ah. Assuming that works, then most of my patches are unnecessary. But then I have a different question: why do we bother doing the __getcpu at all? Can we rely on cpu 0's pvti to be appropriate for all of the vcpus to use if the stable bit is set? > >> I'm having a hard time testing, since KVM on 3.19-rc1 appears to be >> entirely unusable. No matter what I do, I get this very early in >> guest boot: >> >> KVM internal error. Suberror: 1 >> emulation failure >> EAX=000dee58 EBX=00000000 ECX=00000000 EDX=00000cfd >> ESI=00000059 EDI=00000000 EBP=00000000 ESP=00006fc4 >> EIP=000f17f4 EFL=00010012 [----A--] CPL=0 II=0 A20=1 SMM=0 HLT=0 >> ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] >> CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA] >> SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] >> DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] >> FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] >> GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] >> LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT >> TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy >> GDT= 000f6c58 00000037 >> IDT= 000f6c96 00000000 >> CR0=60000011 CR2=00000000 CR3=00000000 CR4=00000000 >> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 >> DR3=0000000000000000 >> DR6=00000000ffff0ff0 DR7=0000000000000400 >> EFER=0000000000000000 >> Code=e8 75 fc ff ff 89 f2 a8 10 89 d8 75 0a b9 74 17 ff ff ff d1 <5b> >> 5e c3 5b 5e e9 76 ff ff ff 57 56 53 8b 35 38 65 0f 00 85 f6 0f 88 be >> 00 00 00 0f b7 f6 >> >> and it sometimes comes with a lockdep splat, too. > > I can look at it tomorrow. Does commit > 2c4aa55a6af070262cca425745e8e54310e96b8d work for you? Nope. Running: qemu-system-x86_64 -machine accel=kvm:tcg -cpu host -parallel none -net none -echr 1 -serial none -chardev stdio,id=console,signal=off,mux=on -serial chardev:console -mon chardev=console -vga none -display none from L1 where L1 is 3.19-rc1 or 2c4aa55a6af070262cca425745e8e54310e96b8d and L0 is a good Fedora kernel results in the same failure after a couple of seconds. This is on Sandy Bridge Extreme. I tried 3.19-rc1 on bare metal earlier today, and it didn't work any better. --Andy > > Paolo -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <CAAyOgsbNsDmPGzJcEuYExaswWaQoLHzkMVd+r6r-8jnGc00yTA@mail.gmail.com>]
* Re: Cleaning up the KVM clock [not found] ` <CAAyOgsbNsDmPGzJcEuYExaswWaQoLHzkMVd+r6r-8jnGc00yTA@mail.gmail.com> @ 2014-12-23 10:25 ` Paolo Bonzini 0 siblings, 0 replies; 12+ messages in thread From: Paolo Bonzini @ 2014-12-23 10:25 UTC (permalink / raw) To: Santosh Shukla Cc: Andy Lutomirski, Marcelo Tosatti, Gleb Natapov, kvm list, linux-kernel@vger.kernel.org On 23/12/2014 11:23, Santosh Shukla wrote: > > > No. kvm_guest_time_update is called by vcpu_enter_guest, while the vCPU > is not running, so it's entirely atomic from the point of view of > the guest. > > > Then checking odd value for version field (at guest side: function > pvclock_clocksource_read / pvclock_read_flag) is redundant considering > that kvm_guest_time_update incremented by 2. The code is common to Xen and KVM. Xen uses seqlock semantics. The cost of one AND is not detectable. Paolo ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Cleaning up the KVM clock 2014-12-21 3:31 Cleaning up the KVM clock Andy Lutomirski 2014-12-22 10:15 ` Paolo Bonzini 2014-12-22 13:34 ` Marcelo Tosatti @ 2014-12-22 13:38 ` Marcelo Tosatti 2 siblings, 0 replies; 12+ messages in thread From: Marcelo Tosatti @ 2014-12-22 13:38 UTC (permalink / raw) To: Andy Lutomirski Cc: Gleb Natapov, Paolo Bonzini, kvm list, linux-kernel@vger.kernel.org On Sat, Dec 20, 2014 at 07:31:19PM -0800, Andy Lutomirski wrote: > I'm looking at the vdso timing code, and I'm puzzled by the pvclock > code. My motivation is comprehensibility, performance, and > correctness. > > # for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0; done > 10000000 loops in 0.69138s = 69.14 nsec / loop > 10000000 loops in 0.63614s = 63.61 nsec / loop > 10000000 loops in 0.63213s = 63.21 nsec / loop > 10000000 loops in 0.63087s = 63.09 nsec / loop > 10000000 loops in 0.63079s = 63.08 nsec / loop > 10000000 loops in 0.63096s = 63.10 nsec / loop > 10000000 loops in 0.63096s = 63.10 nsec / loop > 10000000 loops in 0.63062s = 63.06 nsec / loop > 10000000 loops in 0.63100s = 63.10 nsec / loop > 10000000 loops in 0.63112s = 63.11 nsec / loop > bash-4.3# echo tsc > >/sys/devices/system/clocksource/clocksource0/current_clocksource > [ 45.957524] Switched to clocksource tsc > bash-4.3# for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0; > done10000000 loops in 0.33583s = 33.58 nsec / loop > 10000000 loops in 0.28530s = 28.53 nsec / loop > 10000000 loops in 0.28904s = 28.90 nsec / loop > 10000000 loops in 0.29001s = 29.00 nsec / loop > 10000000 loops in 0.28775s = 28.78 nsec / loop > 10000000 loops in 0.30102s = 30.10 nsec / loop > 10000000 loops in 0.28006s = 28.01 nsec / loop > 10000000 loops in 0.28584s = 28.58 nsec / loop > 10000000 loops in 0.28175s = 28.17 nsec / loop > 10000000 loops in 0.28724s = 28.72 nsec / loop > > The current code is rather slow, especially compared to the tsc variant. > > The algorithm used by the pvclock vgetsns implementation is, approximately: > > cpu = getcpu; > pvti = pointer to the relevant paravirt data > version = pvti->version; > rdtsc_barrier(); > tsc = rdtsc() > delta = (tsc - x) * y >> z; > cycles = delta + w; > flags = pvti->flags; > rdtsc_barrier(); <-- totally unnecessary Why? "The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed." ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2014-12-23 10:25 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-21 3:31 Cleaning up the KVM clock Andy Lutomirski
2014-12-22 10:15 ` Paolo Bonzini
2014-12-22 13:34 ` Marcelo Tosatti
2014-12-22 14:09 ` Marcelo Tosatti
2014-12-22 16:03 ` Andy Lutomirski
2014-12-22 22:47 ` Paolo Bonzini
2014-12-22 22:49 ` Paolo Bonzini
2014-12-22 23:00 ` Andy Lutomirski
2014-12-22 23:14 ` Paolo Bonzini
2014-12-22 23:31 ` Andy Lutomirski
[not found] ` <CAAyOgsbNsDmPGzJcEuYExaswWaQoLHzkMVd+r6r-8jnGc00yTA@mail.gmail.com>
2014-12-23 10:25 ` Paolo Bonzini
2014-12-22 13:38 ` Marcelo Tosatti
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox