[Qemu-devel] memory: memory_region_transaction

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] memory: memory_region_transaction_commit() slow
@ 2014-06-25 17:53 Etienne Martineau
  2014-06-25 18:58 ` Paolo Bonzini
  2014-06-26  8:18 ` Avi Kivity
  0 siblings, 2 replies; 8+ messages in thread
From: Etienne Martineau @ 2014-06-25 17:53 UTC (permalink / raw)
  To: gonglei, Fam, Paolo Bonzini, Peter Crosthwaite, qemu-devel

Hi,

It seems to me that there is a scale issue O(n) in memory_region_transaction_commit().

Basically the time it takes to rebuild the memory view during device assignment 
pci_bridge_update_mappings() increase linearly with respect to the number of 
device already assigned to the guest.

I'm running on a recent qemu.git and I merged from git://github.com/bonzini/qemu.git memory
the following patches that seems to be related to the scale issue I'm facing:
  Fam Zheng (1):
      memory: Don't call memory_region_update_coalesced_range if nothing changed
  Gonglei (1):
      memory: Don't update all memory region when ioeventfd changed

Those patches help but don't fix the issue. The problem become more noticeable 
when lots of device are being assigned to the guest.

I'm running my test on a QEMU q35 machine with the following topology:
 ioh3420 ( root port )
  x3130-upstream
   xio3130-downstream
   xio3130-downstream
   xio3130-downstream
   ...

I have added instrumentation in kvm_cpu_exec() to track to amount of time spend
in the emulation ( patch at the end but not relevant for this discussion )

Here what I see when I assign device one after to other. NOTE the time-stamp is in
msec. The linear increase in the time comes from memory_region_transaction_commit().

(qemu) device_add pci-assign,host=28:10.1,bus=pciehp.3.7
QEMU long exit vCPU 0 25 2
QEMU long exit vCPU 0 22 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 22 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 45 2  <<<
QEMU long exit vCPU 0 23 2

(qemu) device_add pci-assign,host=28:10.2,bus=pciehp.3.8
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 24 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 49 2 <<<
QEMU long exit vCPU 0 25 2


(qemu) device_add pci-assign,host=28:10.3,bus=pciehp.3.9
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 25 2
QEMU long exit vCPU 0 25 2
QEMU long exit vCPU 0 25 2
QEMU long exit vCPU 0 25 2
QEMU long exit vCPU 0 24 2
QEMU long exit vCPU 0 24 2
QEMU long exit vCPU 0 24 2
QEMU long exit vCPU 0 24 2
QEMU long exit vCPU 0 24 2
QEMU long exit vCPU 0 52 2 <<<
QEMU long exit vCPU 0 26 2

(qemu) device_add pci-assign,host=28:10.4,bus=pciehp.3.10
QEMU long exit vCPU 0 35 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 27 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 56 2 <<<
QEMU long exit vCPU 0 28 2

(qemu) device_add pci-assign,host=28:10.5,bus=pciehp.3.11
QEMU long exit vCPU 0 33 2
QEMU long exit vCPU 0 30 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 29 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 59 2 <<<
QEMU long exit vCPU 0 30 2



diff --git a/kvm-all.c b/kvm-all.c
index 3ae30ee..e3a1964 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1685,6 +1685,8 @@ int kvm_cpu_exec(CPUState *cpu)
 {
     struct kvm_run *run = cpu->kvm_run;
     int ret, run_ret;
+    int64_t clock_ns, delta_ms;
+    __u32 last_exit_reason, last_vcpu;
 
     DPRINTF("kvm_cpu_exec()\n");
 
@@ -1711,6 +1713,12 @@ int kvm_cpu_exec(CPUState *cpu)
         }
         qemu_mutex_unlock_iothread();
 
+        delta_ms = (get_clock() - clock_ns)/(1000*1000);
+        if( delta_ms >= 10){
+            fprintf(stderr, "QEMU long exit vCPU %d %ld %d\n",last_vcpu,
+                delta_ms, last_exit_reason);
+        }
+
         run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
 
         qemu_mutex_lock_iothread();
@@ -1727,7 +1735,15 @@ int kvm_cpu_exec(CPUState *cpu)
             abort();
         }
 
+        /*
+         * Capture exit reasons
+         */
+        clock_ns = get_clock();
+        last_exit_reason = run->exit_reason;
+        last_vcpu = cpu->cpu_index;
+

thanks,
Etienne

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] memory: memory_region_transaction_commit() slow
  2014-06-25 17:53 [Qemu-devel] memory: memory_region_transaction_commit() slow Etienne Martineau
@ 2014-06-25 18:58 ` Paolo Bonzini
  2014-06-25 20:41   ` Etienne Martineau
  2014-06-26  3:52   ` Peter Crosthwaite
  2014-06-26  8:18 ` Avi Kivity
  1 sibling, 2 replies; 8+ messages in thread
From: Paolo Bonzini @ 2014-06-25 18:58 UTC (permalink / raw)
  To: Etienne Martineau, gonglei, Fam, Peter Crosthwaite, qemu-devel

Il 25/06/2014 19:53, Etienne Martineau ha scritto:
>
> It seems to me that there is a scale issue O(n) in memory_region_transaction_commit().
>
> Basically the time it takes to rebuild the memory view during device assignment
> pci_bridge_update_mappings() increase linearly with respect to the number of
> device already assigned to the guest.

That's correct, unfortunately.  It can be fixed, it's not hard but also 
not trivial.

Basically you can detect address spaces whose memory region is an alias 
of an address space's root memory region.  You can then reuse that 
address space's FlatView instead of building another one.

Paolo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] memory: memory_region_transaction_commit() slow
  2014-06-25 18:58 ` Paolo Bonzini
@ 2014-06-25 20:41   ` Etienne Martineau
  2014-06-26  3:52   ` Peter Crosthwaite
  1 sibling, 0 replies; 8+ messages in thread
From: Etienne Martineau @ 2014-06-25 20:41 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Peter Crosthwaite, gonglei, Fam, qemu-devel

On 14-06-25 02:58 PM, Paolo Bonzini wrote:
> Il 25/06/2014 19:53, Etienne Martineau ha scritto:
>>
>> It seems to me that there is a scale issue O(n) in memory_region_transaction_commit().
>>
>> Basically the time it takes to rebuild the memory view during device assignment
>> pci_bridge_update_mappings() increase linearly with respect to the number of
>> device already assigned to the guest.
> 
> That's correct, unfortunately.  It can be fixed, it's not hard but also not trivial.
> 
> Basically you can detect address spaces whose memory region is an alias of an address space's root memory region.  You can then reuse that address space's FlatView instead of building another one.

Thanks for your reply.

I'm not too sure to understand what you mean by 'reuse that address space's flatview'?

Are you suggesting to push update directly in the flatview? If so then any further 
update to that address space will wipe out the flatview changes previously done isn't.

thanks,
Etienne

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] memory: memory_region_transaction_commit() slow
  2014-06-25 18:58 ` Paolo Bonzini
  2014-06-25 20:41   ` Etienne Martineau
@ 2014-06-26  3:52   ` Peter Crosthwaite
  2014-06-26 15:02     ` Etienne Martineau
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Crosthwaite @ 2014-06-26  3:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: gonglei, Fam, qemu-devel@nongnu.org Developers, Etienne Martineau

On Thu, Jun 26, 2014 at 4:58 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 25/06/2014 19:53, Etienne Martineau ha scritto:
>
>>
>> It seems to me that there is a scale issue O(n) in
>> memory_region_transaction_commit().
>>
>> Basically the time it takes to rebuild the memory view during device
>> assignment
>> pci_bridge_update_mappings() increase linearly with respect to the number
>> of
>> device already assigned to the guest.
>
>
> That's correct, unfortunately.  It can be fixed, it's not hard but also not
> trivial.
>
> Basically you can detect address spaces whose memory region is an alias of
> an address space's root memory region.  You can then reuse that address
> space's FlatView instead of building another one.
>

Sounds like my shareable address spaces scheme:

http://lists.gnu.org/archive/html/qemu-devel/2014-06/msg00366.html

Its not MR alias aware but shouldn't be too hard to extend. Would this help?

Regards,
Peter

> Paolo
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] memory: memory_region_transaction_commit() slow
  2014-06-26  3:52   ` Peter Crosthwaite
@ 2014-06-26 15:02     ` Etienne Martineau
  0 siblings, 0 replies; 8+ messages in thread
From: Etienne Martineau @ 2014-06-26 15:02 UTC (permalink / raw)
  To: Peter Crosthwaite
  Cc: Paolo Bonzini, gonglei, Fam, qemu-devel@nongnu.org Developers

On 14-06-25 11:52 PM, Peter Crosthwaite wrote:
> On Thu, Jun 26, 2014 at 4:58 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 25/06/2014 19:53, Etienne Martineau ha scritto:
>>
>>>
>>> It seems to me that there is a scale issue O(n) in
>>> memory_region_transaction_commit().
>>>
>>> Basically the time it takes to rebuild the memory view during device
>>> assignment
>>> pci_bridge_update_mappings() increase linearly with respect to the number
>>> of
>>> device already assigned to the guest.
>>
>>
>> That's correct, unfortunately.  It can be fixed, it's not hard but also not
>> trivial.
>>
>> Basically you can detect address spaces whose memory region is an alias of
>> an address space's root memory region.  You can then reuse that address
>> space's FlatView instead of building another one.
>>
> 
> Sounds like my shareable address spaces scheme:
> 
> http://lists.gnu.org/archive/html/qemu-devel/2014-06/msg00366.html
> 
> Its not MR alias aware but shouldn't be too hard to extend. Would this help?
> 
> Regards,
> Peter
> 
>> Paolo
>>

Hi Peter,

I 'think' I understand what you are proposing. My problem is that a) I don't see how to covert pci_bridge to that PMA model 
and b) I'm not sure if this is going to address the issue I'm facing and if yes then how ?

thanks,
Etienne

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] memory: memory_region_transaction_commit() slow
  2014-06-25 17:53 [Qemu-devel] memory: memory_region_transaction_commit() slow Etienne Martineau
  2014-06-25 18:58 ` Paolo Bonzini
@ 2014-06-26  8:18 ` Avi Kivity
  2014-06-26 14:31   ` Etienne Martineau
  1 sibling, 1 reply; 8+ messages in thread
From: Avi Kivity @ 2014-06-26  8:18 UTC (permalink / raw)
  To: Etienne Martineau, gonglei, Fam, Paolo Bonzini, Peter Crosthwaite,
	qemu-devel


On 06/25/2014 08:53 PM, Etienne Martineau wrote:
> Hi,
>
> It seems to me that there is a scale issue O(n) in memory_region_transaction_commit().

It's actually O(n^3).  Flatview is kept sorted but is just a vector, so 
if you insert n regions, you have n^2 operations. In addition every PCI 
device has an address space, so we get n^3 (technically the third n is 
different from the first two, but they are related).

The first problem can be solved by implementing Flatview with an 
std::set<> or equivalent, the second by memoization - most pci address 
spaces are equal (they only differ based on whether bus mastering is 
enabled or not), so a clever cache can reduce the effort to generate them.

However I'm not at all sure that the problem is cpu time in qemu. It 
could be due to rcu_synchronize delays when the new memory maps are fed 
to kvm and vfio.  I recommend trying to isolate exactly where the time 
is spent.

> Basically the time it takes to rebuild the memory view during device assignment
> pci_bridge_update_mappings() increase linearly with respect to the number of
> device already assigned to the guest.
>
> I'm running on a recent qemu.git and I merged from git://github.com/bonzini/qemu.git memory
> the following patches that seems to be related to the scale issue I'm facing:
>    Fam Zheng (1):
>        memory: Don't call memory_region_update_coalesced_range if nothing changed
>    Gonglei (1):
>        memory: Don't update all memory region when ioeventfd changed
>
> Those patches help but don't fix the issue. The problem become more noticeable
> when lots of device are being assigned to the guest.
>
> I'm running my test on a QEMU q35 machine with the following topology:
>   ioh3420 ( root port )
>    x3130-upstream
>     xio3130-downstream
>     xio3130-downstream
>     xio3130-downstream
>     ...
>
> I have added instrumentation in kvm_cpu_exec() to track to amount of time spend
> in the emulation ( patch at the end but not relevant for this discussion )
>
> Here what I see when I assign device one after to other. NOTE the time-stamp is in
> msec. The linear increase in the time comes from memory_region_transaction_commit().
>
> (qemu) device_add pci-assign,host=28:10.1,bus=pciehp.3.7
> QEMU long exit vCPU 0 25 2
> QEMU long exit vCPU 0 22 2
> QEMU long exit vCPU 0 21 2
> QEMU long exit vCPU 0 22 2
> QEMU long exit vCPU 0 21 2
> QEMU long exit vCPU 0 21 2
> QEMU long exit vCPU 0 21 2
> QEMU long exit vCPU 0 21 2
> QEMU long exit vCPU 0 21 2
> QEMU long exit vCPU 0 21 2
> QEMU long exit vCPU 0 21 2
> QEMU long exit vCPU 0 45 2  <<<
> QEMU long exit vCPU 0 23 2
>
> (qemu) device_add pci-assign,host=28:10.2,bus=pciehp.3.8
> QEMU long exit vCPU 0 26 2
> QEMU long exit vCPU 0 24 2
> QEMU long exit vCPU 0 23 2
> QEMU long exit vCPU 0 23 2
> QEMU long exit vCPU 0 23 2
> QEMU long exit vCPU 0 23 2
> QEMU long exit vCPU 0 23 2
> QEMU long exit vCPU 0 23 2
> QEMU long exit vCPU 0 23 2
> QEMU long exit vCPU 0 23 2
> QEMU long exit vCPU 0 23 2
> QEMU long exit vCPU 0 49 2 <<<
> QEMU long exit vCPU 0 25 2
>
>
> (qemu) device_add pci-assign,host=28:10.3,bus=pciehp.3.9
> QEMU long exit vCPU 0 28 2
> QEMU long exit vCPU 0 26 2
> QEMU long exit vCPU 0 25 2
> QEMU long exit vCPU 0 25 2
> QEMU long exit vCPU 0 25 2
> QEMU long exit vCPU 0 25 2
> QEMU long exit vCPU 0 24 2
> QEMU long exit vCPU 0 24 2
> QEMU long exit vCPU 0 24 2
> QEMU long exit vCPU 0 24 2
> QEMU long exit vCPU 0 24 2
> QEMU long exit vCPU 0 52 2 <<<
> QEMU long exit vCPU 0 26 2
>
> (qemu) device_add pci-assign,host=28:10.4,bus=pciehp.3.10
> QEMU long exit vCPU 0 35 2
> QEMU long exit vCPU 0 28 2
> QEMU long exit vCPU 0 26 2
> QEMU long exit vCPU 0 27 2
> QEMU long exit vCPU 0 26 2
> QEMU long exit vCPU 0 26 2
> QEMU long exit vCPU 0 26 2
> QEMU long exit vCPU 0 26 2
> QEMU long exit vCPU 0 26 2
> QEMU long exit vCPU 0 26 2
> QEMU long exit vCPU 0 26 2
> QEMU long exit vCPU 0 56 2 <<<
> QEMU long exit vCPU 0 28 2
>
> (qemu) device_add pci-assign,host=28:10.5,bus=pciehp.3.11
> QEMU long exit vCPU 0 33 2
> QEMU long exit vCPU 0 30 2
> QEMU long exit vCPU 0 28 2
> QEMU long exit vCPU 0 29 2
> QEMU long exit vCPU 0 28 2
> QEMU long exit vCPU 0 28 2
> QEMU long exit vCPU 0 28 2
> QEMU long exit vCPU 0 28 2
> QEMU long exit vCPU 0 28 2
> QEMU long exit vCPU 0 28 2
> QEMU long exit vCPU 0 28 2
> QEMU long exit vCPU 0 59 2 <<<
> QEMU long exit vCPU 0 30 2
>
>
>
> diff --git a/kvm-all.c b/kvm-all.c
> index 3ae30ee..e3a1964 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -1685,6 +1685,8 @@ int kvm_cpu_exec(CPUState *cpu)
>   {
>       struct kvm_run *run = cpu->kvm_run;
>       int ret, run_ret;
> +    int64_t clock_ns, delta_ms;
> +    __u32 last_exit_reason, last_vcpu;
>   
>       DPRINTF("kvm_cpu_exec()\n");
>   
> @@ -1711,6 +1713,12 @@ int kvm_cpu_exec(CPUState *cpu)
>           }
>           qemu_mutex_unlock_iothread();
>   
> +        delta_ms = (get_clock() - clock_ns)/(1000*1000);
> +        if( delta_ms >= 10){
> +            fprintf(stderr, "QEMU long exit vCPU %d %ld %d\n",last_vcpu,
> +                delta_ms, last_exit_reason);
> +        }
> +
>           run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
>   
>           qemu_mutex_lock_iothread();
> @@ -1727,7 +1735,15 @@ int kvm_cpu_exec(CPUState *cpu)
>               abort();
>           }
>   
> +        /*
> +         * Capture exit reasons
> +         */
> +        clock_ns = get_clock();
> +        last_exit_reason = run->exit_reason;
> +        last_vcpu = cpu->cpu_index;
> +
>
> thanks,
> Etienne
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] memory: memory_region_transaction_commit() slow
  2014-06-26  8:18 ` Avi Kivity
@ 2014-06-26 14:31   ` Etienne Martineau
  2014-06-29  6:56     ` Avi Kivity
  0 siblings, 1 reply; 8+ messages in thread
From: Etienne Martineau @ 2014-06-26 14:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Paolo Bonzini, Peter Crosthwaite, gonglei, Fam, qemu-devel

On 14-06-26 04:18 AM, Avi Kivity wrote:
> 
> On 06/25/2014 08:53 PM, Etienne Martineau wrote:
>> Hi,
>>
>> It seems to me that there is a scale issue O(n) in memory_region_transaction_commit().
> 
> It's actually O(n^3).  Flatview is kept sorted but is just a vector, so if you insert n regions, you have n^2 operations. In addition every PCI device has an address space, so we get n^3 (technically the third n is different from the first two, but they are related).
> 
> The first problem can be solved by implementing Flatview with an std::set<> or equivalent, the second by memoization - most pci address spaces are equal (they only differ based on whether bus mastering is enabled or not), so a clever cache can reduce the effort to generate them.
> 
> However I'm not at all sure that the problem is cpu time in qemu. It could be due to rcu_synchronize delays when the new memory maps are fed to kvm and vfio.  I recommend trying to isolate exactly where the time is spent.
> 

It's seem like the linear increase in CPU time comes from QEMU ( at least from my measurements below)

In QEMU kvm_cpu_exec() I've added a hook that measure the time that is spent outside 'kvm_vcpu_ioctl(cpu, KVM_RUN, 0)'
>From the logs below this is "QEMU long exit vCPU n x(msec) exit_reason'

Similarly in KVM vcpu_enter_guest() I've added a new ftrace that measure the time spent outside 'kvm_x86_ops->run(vcpu)'
>From the logs below this is "kvm_long_exit: x(msec)'. Please note that this is a trimmed down view of the real ftrace output.

Also please note that the above hacks are useful ( at least to me since I haven't figured out a better way to do the same with existing ftrace ) to measure the RTT at both QEMU and KVM level.

The time spent outside KVM 'kvm_x86_ops->run(vcpu)' will always be greater than the time spent outside QEMU 'kvm_vcpu_ioctl(cpu, KVM_RUN, 0)' for a given vCPU. Now 
the difference between the time spent outside KVM to the time spend outside QEMU ( for a given vCPU ) tells us who is burning cycle ( QEMU or KVM ) and how much ( in msec )

In the below experiment I've put side by side the QEMU and the KVM RTT time. We can see that the time to assign device ( same BAR size for all devices ) increase
linearly ( like previously reported ). Also from the RTT measurement both QEMU and KVM are mostly within the same range suggesting that the increase comes from QEMU and not KVM.

The one exception is that for every device assign there is a KVM operation that seems to be taking ~100msec each time. Since this is O(1) I'm not too concerned.


device assign #1:
   device_add pci-assign,host=28:10.2,bus=pciehp.3.8
                                       
                                 kvm_long_exit: 100 
   QEMU long exit vCPU 0 25 2    kvm_long_exit: 26 
   QEMU long exit vCPU 0 20 2    kvm_long_exit: 20 
   QEMU long exit vCPU 0 20 2    kvm_long_exit: 20 
   QEMU long exit vCPU 0 20 2    kvm_long_exit: 20 
   QEMU long exit vCPU 0 19 2    kvm_long_exit: 19 
   QEMU long exit vCPU 0 19 2    kvm_long_exit: 19 
   QEMU long exit vCPU 0 19 2    kvm_long_exit: 20 
   QEMU long exit vCPU 0 19 2    kvm_long_exit: 19 
   QEMU long exit vCPU 0 19 2    kvm_long_exit: 19 
   QEMU long exit vCPU 0 19 2    kvm_long_exit: 19 
   QEMU long exit vCPU 0 19 2    kvm_long_exit: 20 
   QEMU long exit vCPU 0 42 2    kvm_long_exit: 42 
   QEMU long exit vCPU 0 21 2    kvm_long_exit: 21 

device assign #2:
   device_add pci-assign,host=28:10.3,bus=pciehp.3.9
   
                                 kvm_long_exit: 101	
   QEMU long exit vCPU 0 25 2    kvm_long_exit: 25 
   QEMU long exit vCPU 0 21 2    kvm_long_exit: 21 
   QEMU long exit vCPU 0 21 2    kvm_long_exit: 21 
   QEMU long exit vCPU 0 21 2    kvm_long_exit: 21 
   QEMU long exit vCPU 0 21 2    kvm_long_exit: 21 
   QEMU long exit vCPU 0 21 2    kvm_long_exit: 21 
   QEMU long exit vCPU 0 21 2    kvm_long_exit: 21 
   QEMU long exit vCPU 0 21 2    kvm_long_exit: 21 
   QEMU long exit vCPU 0 21 2    kvm_long_exit: 21 
   QEMU long exit vCPU 0 21 2    kvm_long_exit: 21 
   QEMU long exit vCPU 0 21 2    kvm_long_exit: 21 
   QEMU long exit vCPU 0 45 2    kvm_long_exit: 45 
   QEMU long exit vCPU 0 23 2    kvm_long_exit: 23 

device assign #3:
   device_add pci-assign,host=28:10.4,bus=pciehp.3.10
   
                                 kvm_long_exit: 100 
   QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
   QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
   QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
   QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
   QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
   QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
   QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
   QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
   QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
   QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
   QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
   QEMU long exit vCPU 0 48 2    kvm_long_exit: 48
   QEMU long exit vCPU 0 25 2    kvm_long_exit: 25

device assign #4:
   device_add pci-assign,host=28:10.5,bus=pciehp.3.11
   
                                 kvm_long_exit: 100		   
   QEMU long exit vCPU 0 27 2    kvm_long_exit: 27
   QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
   QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
   QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
   QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
   QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
   QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
   QEMU long exit vCPU 0 24 2    kvm_long_exit: 24
   QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
   QEMU long exit vCPU 0 24 2    kvm_long_exit: 24
   QEMU long exit vCPU 0 24 2    kvm_long_exit: 25
   QEMU long exit vCPU 0 52 2    kvm_long_exit: 52
   QEMU long exit vCPU 0 26 2    kvm_long_exit: 26

device assign #5:
   device_add pci-assign,host=28:10.6,bus=pciehp.3.12
   
                                 kvm_long_exit: 100		   
   QEMU long exit vCPU 0 28 2    kvm_long_exit: 28
   QEMU long exit vCPU 0 27 2    kvm_long_exit: 27
   QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
   QEMU long exit vCPU 0 27 2    kvm_long_exit: 27
   QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
   QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
   QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
   QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
   QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
   QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
   QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
   QEMU long exit vCPU 0 55 2    kvm_long_exit: 56
   QEMU long exit vCPU 0 28 2    kvm_long_exit: 28

thanks,
Etienne

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] memory: memory_region_transaction_commit() slow
  2014-06-26 14:31   ` Etienne Martineau
@ 2014-06-29  6:56     ` Avi Kivity
  0 siblings, 0 replies; 8+ messages in thread
From: Avi Kivity @ 2014-06-29  6:56 UTC (permalink / raw)
  To: Etienne Martineau
  Cc: Paolo Bonzini, Peter Crosthwaite, gonglei, Fam, qemu-devel


On 06/26/2014 05:31 PM, Etienne Martineau wrote:
> On 14-06-26 04:18 AM, Avi Kivity wrote:
>> On 06/25/2014 08:53 PM, Etienne Martineau wrote:
>>> Hi,
>>>
>>> It seems to me that there is a scale issue O(n) in memory_region_transaction_commit().
>> It's actually O(n^3).  Flatview is kept sorted but is just a vector, so if you insert n regions, you have n^2 operations. In addition every PCI device has an address space, so we get n^3 (technically the third n is different from the first two, but they are related).
>>
>> The first problem can be solved by implementing Flatview with an std::set<> or equivalent, the second by memoization - most pci address spaces are equal (they only differ based on whether bus mastering is enabled or not), so a clever cache can reduce the effort to generate them.
>>
>> However I'm not at all sure that the problem is cpu time in qemu. It could be due to rcu_synchronize delays when the new memory maps are fed to kvm and vfio.  I recommend trying to isolate exactly where the time is spent.
>>
> It's seem like the linear increase in CPU time comes from QEMU ( at least from my measurements below)

In those code paths QEMU calls back into KVM (KVM_SET_MEMORY_REGION) and 
vfio.  So it would be good to understand exactly where the time is 
spent. I doubt it's computation (which is O(n^3), but very fast), 
instead it's likely waiting for something.


> In QEMU kvm_cpu_exec() I've added a hook that measure the time that is spent outside 'kvm_vcpu_ioctl(cpu, KVM_RUN, 0)'
>  From the logs below this is "QEMU long exit vCPU n x(msec) exit_reason'
>
> Similarly in KVM vcpu_enter_guest() I've added a new ftrace that measure the time spent outside 'kvm_x86_ops->run(vcpu)'
>  From the logs below this is "kvm_long_exit: x(msec)'. Please note that this is a trimmed down view of the real ftrace output.
>
> Also please note that the above hacks are useful ( at least to me since I haven't figured out a better way to do the same with existing ftrace ) to measure the RTT at both QEMU and KVM level.
>
> The time spent outside KVM 'kvm_x86_ops->run(vcpu)' will always be greater than the time spent outside QEMU 'kvm_vcpu_ioctl(cpu, KVM_RUN, 0)' for a given vCPU. Now
> the difference between the time spent outside KVM to the time spend outside QEMU ( for a given vCPU ) tells us who is burning cycle ( QEMU or KVM ) and how much ( in msec )
>
> In the below experiment I've put side by side the QEMU and the KVM RTT time. We can see that the time to assign device ( same BAR size for all devices ) increase
> linearly ( like previously reported ). Also from the RTT measurement both QEMU and KVM are mostly within the same range suggesting that the increase comes from QEMU and not KVM.
>
> The one exception is that for every device assign there is a KVM operation that seems to be taking ~100msec each time. Since this is O(1) I'm not too concerned.
>
>
> device assign #1:
>     device_add pci-assign,host=28:10.2,bus=pciehp.3.8
>                                         
>                                   kvm_long_exit: 100
>     QEMU long exit vCPU 0 25 2    kvm_long_exit: 26
>     QEMU long exit vCPU 0 20 2    kvm_long_exit: 20
>     QEMU long exit vCPU 0 20 2    kvm_long_exit: 20
>     QEMU long exit vCPU 0 20 2    kvm_long_exit: 20
>     QEMU long exit vCPU 0 19 2    kvm_long_exit: 19
>     QEMU long exit vCPU 0 19 2    kvm_long_exit: 19
>     QEMU long exit vCPU 0 19 2    kvm_long_exit: 20
>     QEMU long exit vCPU 0 19 2    kvm_long_exit: 19
>     QEMU long exit vCPU 0 19 2    kvm_long_exit: 19
>     QEMU long exit vCPU 0 19 2    kvm_long_exit: 19
>     QEMU long exit vCPU 0 19 2    kvm_long_exit: 20
>     QEMU long exit vCPU 0 42 2    kvm_long_exit: 42
>     QEMU long exit vCPU 0 21 2    kvm_long_exit: 21
>
> device assign #2:
>     device_add pci-assign,host=28:10.3,bus=pciehp.3.9
>     
>                                   kvm_long_exit: 101	
>     QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
>     QEMU long exit vCPU 0 21 2    kvm_long_exit: 21
>     QEMU long exit vCPU 0 21 2    kvm_long_exit: 21
>     QEMU long exit vCPU 0 21 2    kvm_long_exit: 21
>     QEMU long exit vCPU 0 21 2    kvm_long_exit: 21
>     QEMU long exit vCPU 0 21 2    kvm_long_exit: 21
>     QEMU long exit vCPU 0 21 2    kvm_long_exit: 21
>     QEMU long exit vCPU 0 21 2    kvm_long_exit: 21
>     QEMU long exit vCPU 0 21 2    kvm_long_exit: 21
>     QEMU long exit vCPU 0 21 2    kvm_long_exit: 21
>     QEMU long exit vCPU 0 21 2    kvm_long_exit: 21
>     QEMU long exit vCPU 0 45 2    kvm_long_exit: 45
>     QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
>
> device assign #3:
>     device_add pci-assign,host=28:10.4,bus=pciehp.3.10
>     
>                                   kvm_long_exit: 100
>     QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
>     QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
>     QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
>     QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
>     QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
>     QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
>     QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
>     QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
>     QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
>     QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
>     QEMU long exit vCPU 0 23 2    kvm_long_exit: 23
>     QEMU long exit vCPU 0 48 2    kvm_long_exit: 48
>     QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
>
> device assign #4:
>     device_add pci-assign,host=28:10.5,bus=pciehp.3.11
>     
>                                   kvm_long_exit: 100		
>     QEMU long exit vCPU 0 27 2    kvm_long_exit: 27
>     QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
>     QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
>     QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
>     QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
>     QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
>     QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
>     QEMU long exit vCPU 0 24 2    kvm_long_exit: 24
>     QEMU long exit vCPU 0 25 2    kvm_long_exit: 25
>     QEMU long exit vCPU 0 24 2    kvm_long_exit: 24
>     QEMU long exit vCPU 0 24 2    kvm_long_exit: 25
>     QEMU long exit vCPU 0 52 2    kvm_long_exit: 52
>     QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
>
> device assign #5:
>     device_add pci-assign,host=28:10.6,bus=pciehp.3.12
>     
>                                   kvm_long_exit: 100		
>     QEMU long exit vCPU 0 28 2    kvm_long_exit: 28
>     QEMU long exit vCPU 0 27 2    kvm_long_exit: 27
>     QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
>     QEMU long exit vCPU 0 27 2    kvm_long_exit: 27
>     QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
>     QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
>     QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
>     QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
>     QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
>     QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
>     QEMU long exit vCPU 0 26 2    kvm_long_exit: 26
>     QEMU long exit vCPU 0 55 2    kvm_long_exit: 56
>     QEMU long exit vCPU 0 28 2    kvm_long_exit: 28
>
> thanks,
> Etienne
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-06-29  6:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-25 17:53 [Qemu-devel] memory: memory_region_transaction_commit() slow Etienne Martineau
2014-06-25 18:58 ` Paolo Bonzini
2014-06-25 20:41   ` Etienne Martineau
2014-06-26  3:52   ` Peter Crosthwaite
2014-06-26 15:02     ` Etienne Martineau
2014-06-26  8:18 ` Avi Kivity
2014-06-26 14:31   ` Etienne Martineau
2014-06-29  6:56     ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).