Cleaning up the KVM clock

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* Cleaning up the KVM clock
@ 2014-12-21  3:31 Andy Lutomirski
  2014-12-22 10:15 ` Paolo Bonzini
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Andy Lutomirski @ 2014-12-21  3:31 UTC (permalink / raw)
  To: Gleb Natapov, Paolo Bonzini, Marcelo Tosatti, kvm list,
	linux-kernel@vger.kernel.org

I'm looking at the vdso timing code, and I'm puzzled by the pvclock
code.  My motivation is comprehensibility, performance, and
correctness.

# for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0; done
10000000 loops in 0.69138s = 69.14 nsec / loop
10000000 loops in 0.63614s = 63.61 nsec / loop
10000000 loops in 0.63213s = 63.21 nsec / loop
10000000 loops in 0.63087s = 63.09 nsec / loop
10000000 loops in 0.63079s = 63.08 nsec / loop
10000000 loops in 0.63096s = 63.10 nsec / loop
10000000 loops in 0.63096s = 63.10 nsec / loop
10000000 loops in 0.63062s = 63.06 nsec / loop
10000000 loops in 0.63100s = 63.10 nsec / loop
10000000 loops in 0.63112s = 63.11 nsec / loop
bash-4.3# echo tsc
>/sys/devices/system/clocksource/clocksource0/current_clocksource
[   45.957524] Switched to clocksource tsc
bash-4.3# for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0;
done10000000 loops in 0.33583s = 33.58 nsec / loop
10000000 loops in 0.28530s = 28.53 nsec / loop
10000000 loops in 0.28904s = 28.90 nsec / loop
10000000 loops in 0.29001s = 29.00 nsec / loop
10000000 loops in 0.28775s = 28.78 nsec / loop
10000000 loops in 0.30102s = 30.10 nsec / loop
10000000 loops in 0.28006s = 28.01 nsec / loop
10000000 loops in 0.28584s = 28.58 nsec / loop
10000000 loops in 0.28175s = 28.17 nsec / loop
10000000 loops in 0.28724s = 28.72 nsec / loop

The current code is rather slow, especially compared to the tsc variant.

The algorithm used by the pvclock vgetsns implementation is, approximately:

cpu = getcpu;
pvti = pointer to the relevant paravirt data
version = pvti->version;
rdtsc_barrier();
tsc = rdtsc()
delta = (tsc - x) * y >> z;
cycles = delta + w;
flags = pvti->flags;
rdtsc_barrier();  <-- totally unnecessary

cpu1 = getcpu;
if (cpu != cpu1 || the we missed the seqlock)
  retry;

if (!stable)
  bail;

After that, the main vclock_gettime code applies the kernel's regular
time adjustments.

First, is there any guarantee that, if pvti is marked as stable, that
the pvti data is consistent across cpus?  If so (which would be really
nice), then we could always use vcpu 0's pvti, which would be a really
nice cleanup.

If not, then the current algorithm is buggy.  There is no guarantee
that the tsc stamp we get matches the cpu whose pvti we looked at.  We
could fix that using rdtscp.

I think it's also rather strange that the return value is "cycles"
instead of nanoseconds.  If the guest is using pvclock *and* ntp,
isn't something very wrong?

Can the algorithm just be:

tsc, cpu = rdtscp;
pvti = pvti for cpu

read the scale, offset, etc;
if (!stable)
  bail;
barrier();
read pvti->tsc_timestamp;
if (tsc < pvti->tsc_timestamp)
  retry;
if (the versions are unhappy)
  retry;
return the computed nanosecond count;

I think this is likely to be at least as correct as the current
algorithm, if not more so, and it correctly handles the case where we
migrate to a different vcpu in the middle.  (I also think that, with
this algorithm, the version check should also be unnecessary, since if
we race with a host update, we'll fail the tsc < pvti->tsc_timestamp
check.)

It would be even nicer, though, if we could do much the same thing but
without worrying about which vcpu we're on.

Thoughts?  Am I missing some considerations here?

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Cleaning up the KVM clock
  2014-12-21  3:31 Cleaning up the KVM clock Andy Lutomirski
@ 2014-12-22 10:15 ` Paolo Bonzini
  2014-12-22 13:34 ` Marcelo Tosatti
  2014-12-22 13:38 ` Marcelo Tosatti
  2 siblings, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2014-12-22 10:15 UTC (permalink / raw)
  To: Andy Lutomirski, Gleb Natapov, Marcelo Tosatti, kvm list,
	linux-kernel@vger.kernel.org



On 21/12/2014 04:31, Andy Lutomirski wrote:
> I'm looking at the vdso timing code, and I'm puzzled by the pvclock
> code.  My motivation is comprehensibility, performance, and
> correctness.
> 
> # for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0; done
> 10000000 loops in 0.69138s = 69.14 nsec / loop
> 10000000 loops in 0.63614s = 63.61 nsec / loop
> 10000000 loops in 0.63213s = 63.21 nsec / loop
> 10000000 loops in 0.63087s = 63.09 nsec / loop
> 10000000 loops in 0.63079s = 63.08 nsec / loop
> 10000000 loops in 0.63096s = 63.10 nsec / loop
> 10000000 loops in 0.63096s = 63.10 nsec / loop
> 10000000 loops in 0.63062s = 63.06 nsec / loop
> 10000000 loops in 0.63100s = 63.10 nsec / loop
> 10000000 loops in 0.63112s = 63.11 nsec / loop
> bash-4.3# echo tsc
>> /sys/devices/system/clocksource/clocksource0/current_clocksource
> [   45.957524] Switched to clocksource tsc
> bash-4.3# for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0;
> done10000000 loops in 0.33583s = 33.58 nsec / loop
> 10000000 loops in 0.28530s = 28.53 nsec / loop
> 10000000 loops in 0.28904s = 28.90 nsec / loop
> 10000000 loops in 0.29001s = 29.00 nsec / loop
> 10000000 loops in 0.28775s = 28.78 nsec / loop
> 10000000 loops in 0.30102s = 30.10 nsec / loop
> 10000000 loops in 0.28006s = 28.01 nsec / loop
> 10000000 loops in 0.28584s = 28.58 nsec / loop
> 10000000 loops in 0.28175s = 28.17 nsec / loop
> 10000000 loops in 0.28724s = 28.72 nsec / loop
> 
> The current code is rather slow, especially compared to the tsc variant.
> 
> The algorithm used by the pvclock vgetsns implementation is, approximately:
> 
> cpu = getcpu;
> pvti = pointer to the relevant paravirt data
> version = pvti->version;
> rdtsc_barrier();
> tsc = rdtsc()
> delta = (tsc - x) * y >> z;
> cycles = delta + w;
> flags = pvti->flags;
> rdtsc_barrier();  <-- totally unnecessary

It's not unnecessary.  The first one is a "lock", the second is an
"unlock".  You can move the second rdtsc_barrier below the cpu/seqlock
check though.

> 
> cpu1 = getcpu;
> if (cpu != cpu1 || the we missed the seqlock)
>   retry;
> 
> if (!stable)
>   bail;
> 
> After that, the main vclock_gettime code applies the kernel's regular
> time adjustments.
> 
> First, is there any guarantee that, if pvti is marked as stable, that
> the pvti data is consistent across cpus?  If so (which would be really
> nice), then we could always use vcpu 0's pvti, which would be a really
> nice cleanup.

I think you cannot because the TSCs might not be perfectly synced up
even if the rates are, but...

> If not, then the current algorithm is buggy.  There is no guarantee
> that the tsc stamp we get matches the cpu whose pvti we looked at.  We
> could fix that using rdtscp.

... Marcelo will have to answer this.

> I think it's also rather strange that the return value is "cycles"
> instead of nanoseconds.  If the guest is using pvclock *and* ntp,
> isn't something very wrong?

It's not cycles.  pvclock_get_nsec_offset returns nanoseconds, and
__pvclock_read_cycles does the same.  Patches are welcome. :)

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Cleaning up the KVM clock
  2014-12-21  3:31 Cleaning up the KVM clock Andy Lutomirski
  2014-12-22 10:15 ` Paolo Bonzini
@ 2014-12-22 13:34 ` Marcelo Tosatti
  2014-12-22 14:09   ` Marcelo Tosatti
  2014-12-22 16:03   ` Andy Lutomirski
  2014-12-22 13:38 ` Marcelo Tosatti
  2 siblings, 2 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2014-12-22 13:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gleb Natapov, Paolo Bonzini, kvm list,
	linux-kernel@vger.kernel.org

On Sat, Dec 20, 2014 at 07:31:19PM -0800, Andy Lutomirski wrote:
> I'm looking at the vdso timing code, and I'm puzzled by the pvclock
> code.  My motivation is comprehensibility, performance, and
> correctness.
> 
> # for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0; done
> 10000000 loops in 0.69138s = 69.14 nsec / loop
> 10000000 loops in 0.63614s = 63.61 nsec / loop
> 10000000 loops in 0.63213s = 63.21 nsec / loop
> 10000000 loops in 0.63087s = 63.09 nsec / loop
> 10000000 loops in 0.63079s = 63.08 nsec / loop
> 10000000 loops in 0.63096s = 63.10 nsec / loop
> 10000000 loops in 0.63096s = 63.10 nsec / loop
> 10000000 loops in 0.63062s = 63.06 nsec / loop
> 10000000 loops in 0.63100s = 63.10 nsec / loop
> 10000000 loops in 0.63112s = 63.11 nsec / loop
> bash-4.3# echo tsc
> >/sys/devices/system/clocksource/clocksource0/current_clocksource
> [   45.957524] Switched to clocksource tsc
> bash-4.3# for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0;
> done10000000 loops in 0.33583s = 33.58 nsec / loop
> 10000000 loops in 0.28530s = 28.53 nsec / loop
> 10000000 loops in 0.28904s = 28.90 nsec / loop
> 10000000 loops in 0.29001s = 29.00 nsec / loop
> 10000000 loops in 0.28775s = 28.78 nsec / loop
> 10000000 loops in 0.30102s = 30.10 nsec / loop
> 10000000 loops in 0.28006s = 28.01 nsec / loop
> 10000000 loops in 0.28584s = 28.58 nsec / loop
> 10000000 loops in 0.28175s = 28.17 nsec / loop
> 10000000 loops in 0.28724s = 28.72 nsec / loop
> 
> The current code is rather slow, especially compared to the tsc variant.
> 
> The algorithm used by the pvclock vgetsns implementation is, approximately:
> 
> cpu = getcpu;
> pvti = pointer to the relevant paravirt data
> version = pvti->version;
> rdtsc_barrier();
> tsc = rdtsc()
> delta = (tsc - x) * y >> z;
> cycles = delta + w;
> flags = pvti->flags;
> rdtsc_barrier();  <-- totally unnecessary
> 
> cpu1 = getcpu;
> if (cpu != cpu1 || the we missed the seqlock)
>   retry;
> 
> if (!stable)
>   bail;
> 
> After that, the main vclock_gettime code applies the kernel's regular
> time adjustments.
> 
> 
> First, is there any guarantee that, if pvti is marked as stable, that
> the pvti data is consistent across cpus?  If so (which would be really
> nice), then we could always use vcpu 0's pvti, which would be a really
> nice cleanup.
> 
> If not, then the current algorithm is buggy.  There is no guarantee
> that the tsc stamp we get matches the cpu whose pvti we looked at.  We
> could fix that using rdtscp.

Please read the comment at arch/x86/kvm/x86.c which starts with 

"Assuming a stable TSC across physical CPUS, and a stable TSC".

> I think it's also rather strange that the return value is "cycles"
> instead of nanoseconds.  If the guest is using pvclock *and* ntp,
> isn't something very wrong?
> 
> Can the algorithm just be:
> 
> tsc, cpu = rdtscp;
> pvti = pvti for cpu
> 
> read the scale, offset, etc;
> if (!stable)
>   bail;

"The RDTSCP instruction waits until all previous instructions have been
executed before reading the counter.
However, subsequent instructions may begin execution before the read
operation is performed."

So you would need a barrier there after RDTSCP.

> barrier();
> read pvti->tsc_timestamp;
> if (tsc < pvti->tsc_timestamp)
>   retry;

A kvmclock update does not necessarily update tsc_timestamp.

See 

"        /*
         * If the host uses TSC clock, then passthrough TSC as stable
         * to the guest.
         */
        spin_lock(&ka->pvclock_gtod_sync_lock);
        use_master_clock = ka->use_master_clock;
        if (use_master_clock) {
                host_tsc = ka->master_cycle_now;
                kernel_ns = ka->master_kernel_ns;
        }
"

At arch/x86/kvm/x86.c.

> if (the versions are unhappy)
>   retry;
> return the computed nanosecond count;
> 
> I think this is likely to be at least as correct as the current
> algorithm, if not more so, and it correctly handles the case where we
> migrate to a different vcpu in the middle.  (I also think that, with
> this algorithm, the version check should also be unnecessary, since if
> we race with a host update, we'll fail the tsc < pvti->tsc_timestamp
> check.)
> 
> It would be even nicer, though, if we could do much the same thing but
> without worrying about which vcpu we're on.
> 
> Thoughts?  Am I missing some considerations here?

Maybe we can find another optimization opportunities?

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Cleaning up the KVM clock
  2014-12-22 13:34 ` Marcelo Tosatti
@ 2014-12-22 14:09   ` Marcelo Tosatti
  2014-12-22 16:03   ` Andy Lutomirski
  1 sibling, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2014-12-22 14:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gleb Natapov, Paolo Bonzini, kvm list,
	linux-kernel@vger.kernel.org

On Mon, Dec 22, 2014 at 11:34:30AM -0200, Marcelo Tosatti wrote:
> > It would be even nicer, though, if we could do much the same thing but
> > without worrying about which vcpu we're on.
> > 
> > Thoughts?  Am I missing some considerations here?
> 
> Maybe we can find another optimization opportunities?

Perhaps RDTSCP rather than getcpu is a win?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Cleaning up the KVM clock
  2014-12-22 13:34 ` Marcelo Tosatti
  2014-12-22 14:09   ` Marcelo Tosatti
@ 2014-12-22 16:03   ` Andy Lutomirski
  2014-12-22 22:47     ` Paolo Bonzini
  2014-12-22 22:49     ` Paolo Bonzini
  1 sibling, 2 replies; 12+ messages in thread
From: Andy Lutomirski @ 2014-12-22 16:03 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Paolo Bonzini, kvm list,
	linux-kernel@vger.kernel.org

On Mon, Dec 22, 2014 at 5:34 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Sat, Dec 20, 2014 at 07:31:19PM -0800, Andy Lutomirski wrote:
>> I'm looking at the vdso timing code, and I'm puzzled by the pvclock
>> code.  My motivation is comprehensibility, performance, and
>> correctness.
>>
>> # for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0; done
>> 10000000 loops in 0.69138s = 69.14 nsec / loop
>> 10000000 loops in 0.63614s = 63.61 nsec / loop
>> 10000000 loops in 0.63213s = 63.21 nsec / loop
>> 10000000 loops in 0.63087s = 63.09 nsec / loop
>> 10000000 loops in 0.63079s = 63.08 nsec / loop
>> 10000000 loops in 0.63096s = 63.10 nsec / loop
>> 10000000 loops in 0.63096s = 63.10 nsec / loop
>> 10000000 loops in 0.63062s = 63.06 nsec / loop
>> 10000000 loops in 0.63100s = 63.10 nsec / loop
>> 10000000 loops in 0.63112s = 63.11 nsec / loop
>> bash-4.3# echo tsc
>> >/sys/devices/system/clocksource/clocksource0/current_clocksource
>> [   45.957524] Switched to clocksource tsc
>> bash-4.3# for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0;
>> done10000000 loops in 0.33583s = 33.58 nsec / loop
>> 10000000 loops in 0.28530s = 28.53 nsec / loop
>> 10000000 loops in 0.28904s = 28.90 nsec / loop
>> 10000000 loops in 0.29001s = 29.00 nsec / loop
>> 10000000 loops in 0.28775s = 28.78 nsec / loop
>> 10000000 loops in 0.30102s = 30.10 nsec / loop
>> 10000000 loops in 0.28006s = 28.01 nsec / loop
>> 10000000 loops in 0.28584s = 28.58 nsec / loop
>> 10000000 loops in 0.28175s = 28.17 nsec / loop
>> 10000000 loops in 0.28724s = 28.72 nsec / loop
>>
>> The current code is rather slow, especially compared to the tsc variant.
>>
>> The algorithm used by the pvclock vgetsns implementation is, approximately:
>>
>> cpu = getcpu;
>> pvti = pointer to the relevant paravirt data
>> version = pvti->version;
>> rdtsc_barrier();
>> tsc = rdtsc()
>> delta = (tsc - x) * y >> z;
>> cycles = delta + w;
>> flags = pvti->flags;
>> rdtsc_barrier();  <-- totally unnecessary
>>
>> cpu1 = getcpu;
>> if (cpu != cpu1 || the we missed the seqlock)
>>   retry;
>>
>> if (!stable)
>>   bail;
>>
>> After that, the main vclock_gettime code applies the kernel's regular
>> time adjustments.
>>
>>
>> First, is there any guarantee that, if pvti is marked as stable, that
>> the pvti data is consistent across cpus?  If so (which would be really
>> nice), then we could always use vcpu 0's pvti, which would be a really
>> nice cleanup.
>>
>> If not, then the current algorithm is buggy.  There is no guarantee
>> that the tsc stamp we get matches the cpu whose pvti we looked at.  We
>> could fix that using rdtscp.
>
> Please read the comment at arch/x86/kvm/x86.c which starts with
>
> "Assuming a stable TSC across physical CPUS, and a stable TSC".
>
>> I think it's also rather strange that the return value is "cycles"
>> instead of nanoseconds.  If the guest is using pvclock *and* ntp,
>> isn't something very wrong?
>>
>> Can the algorithm just be:
>>
>> tsc, cpu = rdtscp;
>> pvti = pvti for cpu
>>
>> read the scale, offset, etc;
>> if (!stable)
>>   bail;
>
> "The RDTSCP instruction waits until all previous instructions have been
> executed before reading the counter.
> However, subsequent instructions may begin execution before the read
> operation is performed."
>
> So you would need a barrier there after RDTSCP.
>

After considerable manual reading and experimentation a couple years
ago, the conclusion was that:

 - RDTSCP is ordered like a load on AMD and Intel.  That means that
you can't observe RDTSCP by itself failing to be monotonic across
CPUs.

 - RDTSC by itself is not ordered.  It's easy to observe it behaving
non-monotonically.

 - rdtsc_barrier(); RDTSC is ordered like RDTSCP on AMD and Intel.

>> barrier();
>> read pvti->tsc_timestamp;
>> if (tsc < pvti->tsc_timestamp)
>>   retry;
>
> A kvmclock update does not necessarily update tsc_timestamp.

Hmm.

>
> See
>
> "        /*
>          * If the host uses TSC clock, then passthrough TSC as stable
>          * to the guest.
>          */
>         spin_lock(&ka->pvclock_gtod_sync_lock);
>         use_master_clock = ka->use_master_clock;
>         if (use_master_clock) {
>                 host_tsc = ka->master_cycle_now;
>                 kernel_ns = ka->master_kernel_ns;
>         }
> "
>
> At arch/x86/kvm/x86.c.


So there's a much bigger problem here.  Despite the read
implementation and the docs in Documentation/, the KVM hots doesn't
actually use the version field the way it's supposed to.  It just
updates the whole pvti with one __copy_to_user.  It has a comment:

        * The interface expects us to write an even number signaling that the
        * update is finished. Since the guest won't see the intermediate
        * state, we just increase by 2 at the end.

This is wrong.  The guest *kernel* might not see the intermediate
state because the kernel (presumably it disabled migration while
reading pvti), but the guest vdso can't do that and could very easily
observe pvti while it's being written.

Also, __getcpu is completely unordered on current kernels, so it
doesn't generate the code that anyone would expect.  I'll fix that.

I'll send patches for the whole mess, complete with lots of comments,
after I test them a bit today.

--Andy

>
>> if (the versions are unhappy)
>>   retry;
>> return the computed nanosecond count;
>>
>> I think this is likely to be at least as correct as the current
>> algorithm, if not more so, and it correctly handles the case where we
>> migrate to a different vcpu in the middle.  (I also think that, with
>> this algorithm, the version check should also be unnecessary, since if
>> we race with a host update, we'll fail the tsc < pvti->tsc_timestamp
>> check.)
>>
>> It would be even nicer, though, if we could do much the same thing but
>> without worrying about which vcpu we're on.
>>
>> Thoughts?  Am I missing some considerations here?
>
> Maybe we can find another optimization opportunities?
>
> Thanks!
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Cleaning up the KVM clock
  2014-12-22 16:03   ` Andy Lutomirski
@ 2014-12-22 22:47     ` Paolo Bonzini
  2014-12-22 22:49     ` Paolo Bonzini
  1 sibling, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2014-12-22 22:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm



On 22/12/2014 17:03, Andy Lutomirski wrote:
> 
> This is wrong.  The guest *kernel* might not see the intermediate
> state because the kernel (presumably it disabled migration while
> reading pvti), but the guest vdso can't do that and could very easily
> observe pvti while it's being written.

No.  kvm_guest_time_update is called by vcpu_enter_guest, while the vCPU
is not running, so it's entirely atomic from the point of view of the guest.

> I'll send patches for the whole mess, complete with lots of comments,
> after I test them a bit today.

Ok, some comments can certainly help the discussion.

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Cleaning up the KVM clock
  2014-12-22 16:03   ` Andy Lutomirski
  2014-12-22 22:47     ` Paolo Bonzini
@ 2014-12-22 22:49     ` Paolo Bonzini
  2014-12-22 23:00       ` Andy Lutomirski
       [not found]       ` <CAAyOgsbNsDmPGzJcEuYExaswWaQoLHzkMVd+r6r-8jnGc00yTA@mail.gmail.com>
  1 sibling, 2 replies; 12+ messages in thread
From: Paolo Bonzini @ 2014-12-22 22:49 UTC (permalink / raw)
  To: Andy Lutomirski, Marcelo Tosatti
  Cc: Gleb Natapov, kvm list, linux-kernel@vger.kernel.org



On 22/12/2014 17:03, Andy Lutomirski wrote:
> This is wrong.  The guest *kernel* might not see the intermediate
> state because the kernel (presumably it disabled migration while
> reading pvti), but the guest vdso can't do that and could very easily
> observe pvti while it's being written.

No.  kvm_guest_time_update is called by vcpu_enter_guest, while the vCPU
is not running, so it's entirely atomic from the point of view of the guest.

> I'll send patches for the whole mess, complete with lots of comments,
> after I test them a bit today.

Ok, some comments can certainly help the discussion.

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Cleaning up the KVM clock
  2014-12-22 22:49     ` Paolo Bonzini
@ 2014-12-22 23:00       ` Andy Lutomirski
  2014-12-22 23:14         ` Paolo Bonzini
       [not found]       ` <CAAyOgsbNsDmPGzJcEuYExaswWaQoLHzkMVd+r6r-8jnGc00yTA@mail.gmail.com>
  1 sibling, 1 reply; 12+ messages in thread
From: Andy Lutomirski @ 2014-12-22 23:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marcelo Tosatti, Gleb Natapov, kvm list,
	linux-kernel@vger.kernel.org

On Mon, Dec 22, 2014 at 2:49 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 22/12/2014 17:03, Andy Lutomirski wrote:
>> This is wrong.  The guest *kernel* might not see the intermediate
>> state because the kernel (presumably it disabled migration while
>> reading pvti), but the guest vdso can't do that and could very easily
>> observe pvti while it's being written.
>
> No.  kvm_guest_time_update is called by vcpu_enter_guest, while the vCPU
> is not running, so it's entirely atomic from the point of view of the guest.

Which vCPU?  Unless kvm_guest_time_update freezes all of the vcpus,
then there's a race:

vCPU 0 guest: __getcpu
vdso thread migrates to vCPU 1
vCPU 0 exits
host starts writing pvti for vCPU 0
vdso thread starts reading pvti
host finishes writing pvti for vCPU 0
vCPU 0 resumes
vdso migrates back to vCPU 0
__getcpu returns 0

and we fail.

>
>> I'll send patches for the whole mess, complete with lots of comments,
>> after I test them a bit today.
>
> Ok, some comments can certainly help the discussion.
>

I'm having a hard time testing, since KVM on 3.19-rc1 appears to be
entirely unusable.  No matter what I do, I get this very early in
guest boot:

KVM internal error. Suberror: 1
emulation failure
EAX=000dee58 EBX=00000000 ECX=00000000 EDX=00000cfd
ESI=00000059 EDI=00000000 EBP=00000000 ESP=00006fc4
EIP=000f17f4 EFL=00010012 [----A--] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT=     000f6c58 00000037
IDT=     000f6c96 00000000
CR0=60000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=e8 75 fc ff ff 89 f2 a8 10 89 d8 75 0a b9 74 17 ff ff ff d1 <5b>
5e c3 5b 5e e9 76 ff ff ff 57 56 53 8b 35 38 65 0f 00 85 f6 0f 88 be
00 00 00 0f b7 f6

and it sometimes comes with a lockdep splat, too.

--Andy

> Paolo



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Cleaning up the KVM clock
  2014-12-22 23:00       ` Andy Lutomirski
@ 2014-12-22 23:14         ` Paolo Bonzini
  2014-12-22 23:31           ` Andy Lutomirski
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2014-12-22 23:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Marcelo Tosatti, Gleb Natapov, kvm list,
	linux-kernel@vger.kernel.org



On 23/12/2014 00:00, Andy Lutomirski wrote:
> On Mon, Dec 22, 2014 at 2:49 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 22/12/2014 17:03, Andy Lutomirski wrote:
>>> This is wrong.  The guest *kernel* might not see the intermediate
>>> state because the kernel (presumably it disabled migration while
>>> reading pvti), but the guest vdso can't do that and could very easily
>>> observe pvti while it's being written.
>>
>> No.  kvm_guest_time_update is called by vcpu_enter_guest, while the vCPU
>> is not running, so it's entirely atomic from the point of view of the guest.
> 
> Which vCPU?  Unless kvm_guest_time_update freezes all of the vcpus,
> then there's a race:
> 
> vCPU 0 guest: __getcpu
> vdso thread migrates to vCPU 1
> vCPU 0 exits
> host starts writing pvti for vCPU 0
> vdso thread starts reading pvti
> host finishes writing pvti for vCPU 0
> vCPU 0 resumes
> vdso migrates back to vCPU 0
> __getcpu returns 0
> 
> and we fail.

Yes, it does.  See kvm_gen_update_masterclock.

See also http://www.spinics.net/lists/kvm/msg95533.html for some
discussion about KVM_REQ_MCLOCK_INPROGRESS.

> I'm having a hard time testing, since KVM on 3.19-rc1 appears to be
> entirely unusable.  No matter what I do, I get this very early in
> guest boot:
> 
> KVM internal error. Suberror: 1
> emulation failure
> EAX=000dee58 EBX=00000000 ECX=00000000 EDX=00000cfd
> ESI=00000059 EDI=00000000 EBP=00000000 ESP=00006fc4
> EIP=000f17f4 EFL=00010012 [----A--] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
> SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
> TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
> GDT=     000f6c58 00000037
> IDT=     000f6c96 00000000
> CR0=60000011 CR2=00000000 CR3=00000000 CR4=00000000
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
> DR3=0000000000000000
> DR6=00000000ffff0ff0 DR7=0000000000000400
> EFER=0000000000000000
> Code=e8 75 fc ff ff 89 f2 a8 10 89 d8 75 0a b9 74 17 ff ff ff d1 <5b>
> 5e c3 5b 5e e9 76 ff ff ff 57 56 53 8b 35 38 65 0f 00 85 f6 0f 88 be
> 00 00 00 0f b7 f6
> 
> and it sometimes comes with a lockdep splat, too.

I can look at it tomorrow.  Does commit
2c4aa55a6af070262cca425745e8e54310e96b8d work for you?

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Cleaning up the KVM clock
  2014-12-22 23:14         ` Paolo Bonzini
@ 2014-12-22 23:31           ` Andy Lutomirski
  0 siblings, 0 replies; 12+ messages in thread
From: Andy Lutomirski @ 2014-12-22 23:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marcelo Tosatti, Gleb Natapov, kvm list,
	linux-kernel@vger.kernel.org

On Mon, Dec 22, 2014 at 3:14 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 23/12/2014 00:00, Andy Lutomirski wrote:
>> On Mon, Dec 22, 2014 at 2:49 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>> On 22/12/2014 17:03, Andy Lutomirski wrote:
>>>> This is wrong.  The guest *kernel* might not see the intermediate
>>>> state because the kernel (presumably it disabled migration while
>>>> reading pvti), but the guest vdso can't do that and could very easily
>>>> observe pvti while it's being written.
>>>
>>> No.  kvm_guest_time_update is called by vcpu_enter_guest, while the vCPU
>>> is not running, so it's entirely atomic from the point of view of the guest.
>>
>> Which vCPU?  Unless kvm_guest_time_update freezes all of the vcpus,
>> then there's a race:
>>
>> vCPU 0 guest: __getcpu
>> vdso thread migrates to vCPU 1
>> vCPU 0 exits
>> host starts writing pvti for vCPU 0
>> vdso thread starts reading pvti
>> host finishes writing pvti for vCPU 0
>> vCPU 0 resumes
>> vdso migrates back to vCPU 0
>> __getcpu returns 0
>>
>> and we fail.
>
> Yes, it does.  See kvm_gen_update_masterclock.
>
> See also http://www.spinics.net/lists/kvm/msg95533.html for some
> discussion about KVM_REQ_MCLOCK_INPROGRESS.

Ah.  Assuming that works, then most of my patches are unnecessary.
But then I have a different question: why do we bother doing the
__getcpu at all?  Can we rely on cpu 0's pvti to be appropriate for
all of the vcpus to use if the stable bit is set?

>
>> I'm having a hard time testing, since KVM on 3.19-rc1 appears to be
>> entirely unusable.  No matter what I do, I get this very early in
>> guest boot:
>>
>> KVM internal error. Suberror: 1
>> emulation failure
>> EAX=000dee58 EBX=00000000 ECX=00000000 EDX=00000cfd
>> ESI=00000059 EDI=00000000 EBP=00000000 ESP=00006fc4
>> EIP=000f17f4 EFL=00010012 [----A--] CPL=0 II=0 A20=1 SMM=0 HLT=0
>> ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
>> SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
>> TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
>> GDT=     000f6c58 00000037
>> IDT=     000f6c96 00000000
>> CR0=60000011 CR2=00000000 CR3=00000000 CR4=00000000
>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
>> DR3=0000000000000000
>> DR6=00000000ffff0ff0 DR7=0000000000000400
>> EFER=0000000000000000
>> Code=e8 75 fc ff ff 89 f2 a8 10 89 d8 75 0a b9 74 17 ff ff ff d1 <5b>
>> 5e c3 5b 5e e9 76 ff ff ff 57 56 53 8b 35 38 65 0f 00 85 f6 0f 88 be
>> 00 00 00 0f b7 f6
>>
>> and it sometimes comes with a lockdep splat, too.
>
> I can look at it tomorrow.  Does commit
> 2c4aa55a6af070262cca425745e8e54310e96b8d work for you?

Nope.

Running:

qemu-system-x86_64 -machine accel=kvm:tcg -cpu host -parallel none
-net none -echr 1 -serial none -chardev
stdio,id=console,signal=off,mux=on -serial chardev:console -mon
chardev=console -vga none -display none

from L1 where L1 is 3.19-rc1 or
2c4aa55a6af070262cca425745e8e54310e96b8d and L0 is a good Fedora
kernel results in the same failure after a couple of seconds.  This is
on Sandy Bridge Extreme.

I tried 3.19-rc1 on bare metal earlier today, and it didn't work any better.

--Andy

>
> Paolo



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <CAAyOgsbNsDmPGzJcEuYExaswWaQoLHzkMVd+r6r-8jnGc00yTA@mail.gmail.com>]

* Re: Cleaning up the KVM clock
       [not found]       ` <CAAyOgsbNsDmPGzJcEuYExaswWaQoLHzkMVd+r6r-8jnGc00yTA@mail.gmail.com>
@ 2014-12-23 10:25         ` Paolo Bonzini
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2014-12-23 10:25 UTC (permalink / raw)
  To: Santosh Shukla
  Cc: Andy Lutomirski, Marcelo Tosatti, Gleb Natapov, kvm list,
	linux-kernel@vger.kernel.org



On 23/12/2014 11:23, Santosh Shukla wrote:
> 
> 
>     No.  kvm_guest_time_update is called by vcpu_enter_guest, while the vCPU
>     is not running, so it's entirely atomic from the point of view of
>     the guest.
> 
>  
> Then checking odd value for version field (at guest side: function
> pvclock_clocksource_read / pvclock_read_flag) is redundant considering
> that  kvm_guest_time_update incremented by 2. 

The code is common to Xen and KVM.  Xen uses seqlock semantics.  The
cost of one AND is not detectable.

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Cleaning up the KVM clock
  2014-12-21  3:31 Cleaning up the KVM clock Andy Lutomirski
  2014-12-22 10:15 ` Paolo Bonzini
  2014-12-22 13:34 ` Marcelo Tosatti
@ 2014-12-22 13:38 ` Marcelo Tosatti
  2 siblings, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2014-12-22 13:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gleb Natapov, Paolo Bonzini, kvm list,
	linux-kernel@vger.kernel.org

On Sat, Dec 20, 2014 at 07:31:19PM -0800, Andy Lutomirski wrote:
> I'm looking at the vdso timing code, and I'm puzzled by the pvclock
> code.  My motivation is comprehensibility, performance, and
> correctness.
> 
> # for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0; done
> 10000000 loops in 0.69138s = 69.14 nsec / loop
> 10000000 loops in 0.63614s = 63.61 nsec / loop
> 10000000 loops in 0.63213s = 63.21 nsec / loop
> 10000000 loops in 0.63087s = 63.09 nsec / loop
> 10000000 loops in 0.63079s = 63.08 nsec / loop
> 10000000 loops in 0.63096s = 63.10 nsec / loop
> 10000000 loops in 0.63096s = 63.10 nsec / loop
> 10000000 loops in 0.63062s = 63.06 nsec / loop
> 10000000 loops in 0.63100s = 63.10 nsec / loop
> 10000000 loops in 0.63112s = 63.11 nsec / loop
> bash-4.3# echo tsc
> >/sys/devices/system/clocksource/clocksource0/current_clocksource
> [   45.957524] Switched to clocksource tsc
> bash-4.3# for i in `seq 10`; do ./timing_test_64 10 vclock_gettime 0;
> done10000000 loops in 0.33583s = 33.58 nsec / loop
> 10000000 loops in 0.28530s = 28.53 nsec / loop
> 10000000 loops in 0.28904s = 28.90 nsec / loop
> 10000000 loops in 0.29001s = 29.00 nsec / loop
> 10000000 loops in 0.28775s = 28.78 nsec / loop
> 10000000 loops in 0.30102s = 30.10 nsec / loop
> 10000000 loops in 0.28006s = 28.01 nsec / loop
> 10000000 loops in 0.28584s = 28.58 nsec / loop
> 10000000 loops in 0.28175s = 28.17 nsec / loop
> 10000000 loops in 0.28724s = 28.72 nsec / loop
> 
> The current code is rather slow, especially compared to the tsc variant.
> 
> The algorithm used by the pvclock vgetsns implementation is, approximately:
> 
> cpu = getcpu;
> pvti = pointer to the relevant paravirt data
> version = pvti->version;
> rdtsc_barrier();
> tsc = rdtsc()
> delta = (tsc - x) * y >> z;
> cycles = delta + w;
> flags = pvti->flags;
> rdtsc_barrier();  <-- totally unnecessary

Why?

"The RDTSC instruction is not a serializing instruction. It does not
necessarily wait until all previous instructions have been executed
before reading the counter. Similarly, subsequent instructions may begin
execution before the read operation is performed."

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-12-23 10:25 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-21  3:31 Cleaning up the KVM clock Andy Lutomirski
2014-12-22 10:15 ` Paolo Bonzini
2014-12-22 13:34 ` Marcelo Tosatti
2014-12-22 14:09   ` Marcelo Tosatti
2014-12-22 16:03   ` Andy Lutomirski
2014-12-22 22:47     ` Paolo Bonzini
2014-12-22 22:49     ` Paolo Bonzini
2014-12-22 23:00       ` Andy Lutomirski
2014-12-22 23:14         ` Paolo Bonzini
2014-12-22 23:31           ` Andy Lutomirski
     [not found]       ` <CAAyOgsbNsDmPGzJcEuYExaswWaQoLHzkMVd+r6r-8jnGc00yTA@mail.gmail.com>
2014-12-23 10:25         ` Paolo Bonzini
2014-12-22 13:38 ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox