[Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs
@ 2015-12-14 14:51 Vincenzo Maffione
  2015-12-14 14:51 ` [Qemu-devel] [PATCH RFC] virtio: optimize access to guest physical memory Vincenzo Maffione
  2015-12-14 16:06 ` [Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs Paolo Bonzini
  0 siblings, 2 replies; 6+ messages in thread
From: Vincenzo Maffione @ 2015-12-14 14:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: mst, jasowang, armbru, Vincenzo Maffione, g.lettieri, rizzo

Hi,
  I am doing performance experiments to test how QEMU behaves when the
guest is transmitting (short) network packets at very high packet rates, say
over 1Mpps.
I run a netmap application in the guest to generate high packet rates,
but this is not relevant to this discussion. The only important fact is that
the generator running in the guest is not the bottleneck, and in fact CPU
utilization is low (20%).

Moreover, I'm not considering vhost-net to boost virtio-net HV-side processing,
because I want to do performance unit-tests on the QEMU virtio userspace
implementation (hw/virtio/virtio.c).

In the most common benchmarks - e.g. netperf TCP_STREAM, TCP_RR,
UDP_STREAM, ..., with one end of the communication in the guest, and the
other in the host, for instance with the simplest TAP networking setup - the
virtio-net adapter definitely outperforms the emulated e1000 adapter (and
all the other emulated devices). This was expected because of the great
benefits of I/O paravirtualization.

However, I was surprised to find out that the situation changes drastically
at very high packet rates.

My measurements show that e1000 emulated adapter is able to transmit over
3.5 Mpps when the network backend is disconnected. I disconnect the
backend to see at what packet rate e1000 becomes the bottleneck.

The same experiment, however, shows that virtio-net has a bottleneck at
1Mpps only. Once verified that the TX VQ kicks and TX VQ interrupts are
properly amortized/suppressed, I found out that the bottleneck is partially
due to the way the code accesses the VQ in the guest physical memory, since
each access involves an expensive address space translation. For each VQ
element to process, I counted over 15 accesses, while e1000 has just 2 accesses
to its rings.

This patch slightly rewrites the code to reduce the number of accesses, since
many of them seems unnecessary to me. After this reduction, the bottleneck
jumps from 1 Mpps to 2 Mpps.

Patch is not complete (e.g. it still does not properly manage endianess, it is
not clean, etc.). I just wanted to ask if you think the idea makes sense, and
a proper patch in this direction would be accepted.

Thanks,
  Vincenzo

Vincenzo Maffione (1):
  virtio: optimize access to guest physical memory

 hw/virtio/virtio.c | 118 +++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 88 insertions(+), 30 deletions(-)

-- 
2.6.3

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Qemu-devel] [PATCH RFC] virtio: optimize access to guest physical memory
  2015-12-14 14:51 [Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs Vincenzo Maffione
@ 2015-12-14 14:51 ` Vincenzo Maffione
  2015-12-14 16:06 ` [Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs Paolo Bonzini
  1 sibling, 0 replies; 6+ messages in thread
From: Vincenzo Maffione @ 2015-12-14 14:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: mst, jasowang, armbru, Vincenzo Maffione, g.lettieri, rizzo

Accessing a location of guest physical memory involves a translation
from guest physical addresses to host virtual address. This patch
removes unnecessary accesses by virtio implementation, so that at
high processing rates the bottleneck goes from ~1Mtts to ~2Mtts.

Signed-off-by: Vincenzo Maffione <v.maffione@gmail.com>
---
 hw/virtio/virtio.c | 118 +++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 88 insertions(+), 30 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 1edef59..1f853d6 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -21,6 +21,7 @@
 #include "hw/virtio/virtio-bus.h"
 #include "migration/migration.h"
 #include "hw/virtio/virtio-access.h"
+#include "exec/address-spaces.h"
 
 /*
  * The alignment to use between consumer and producer parts of vring.
@@ -71,6 +72,8 @@ struct VirtQueue
 {
     VRing vring;
     uint16_t last_avail_idx;
+    uint16_t last_fetched_avail_idx;
+    uint16_t last_used_idx;
     /* Last used index value we have signalled on */
     uint16_t signalled_used;
 
@@ -138,6 +141,15 @@ static inline uint16_t vring_desc_next(VirtIODevice *vdev, hwaddr desc_pa,
     return virtio_lduw_phys(vdev, pa);
 }
 
+static inline void
+vring_desc_fetch(VirtIODevice *vdev, hwaddr desc_pa, int i, VRingDesc *desc)
+{
+    hwaddr pa;
+    pa = desc_pa + sizeof(VRingDesc) * i;
+    address_space_rw(&address_space_memory, pa, MEMTXATTRS_UNSPECIFIED,
+                     (uint8_t *)desc, sizeof(VRingDesc), false);
+}
+
 static inline uint16_t vring_avail_flags(VirtQueue *vq)
 {
     hwaddr pa;
@@ -178,6 +190,15 @@ static inline void vring_used_ring_len(VirtQueue *vq, int i, uint32_t val)
     virtio_stl_phys(vq->vdev, pa, val);
 }
 
+static inline void
+vring_used_writeback(VirtQueue *vq, int i, VRingUsedElem *ue)
+{
+    hwaddr pa;
+    pa = vq->vring.used + offsetof(VRingUsed, ring[i]);
+    address_space_rw(&address_space_memory, pa, MEMTXATTRS_UNSPECIFIED,
+                     (uint8_t *)ue, sizeof(VRingUsedElem), true);
+}
+
 static uint16_t vring_used_idx(VirtQueue *vq)
 {
     hwaddr pa;
@@ -190,6 +211,7 @@ static inline void vring_used_idx_set(VirtQueue *vq, uint16_t val)
     hwaddr pa;
     pa = vq->vring.used + offsetof(VRingUsed, idx);
     virtio_stw_phys(vq->vdev, pa, val);
+    vq->last_used_idx = val;
 }
 
 static inline void vring_used_flags_set_bit(VirtQueue *vq, int mask)
@@ -277,15 +299,17 @@ void virtqueue_discard(VirtQueue *vq, const VirtQueueElement *elem,
 void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem,
                     unsigned int len, unsigned int idx)
 {
+    VRingUsedElem ue;
+
     trace_virtqueue_fill(vq, elem, len, idx);
 
     virtqueue_unmap_sg(vq, elem, len);
 
-    idx = (idx + vring_used_idx(vq)) % vq->vring.num;
+    idx = (idx + vq->last_used_idx) % vq->vring.num;
 
-    /* Get a pointer to the next entry in the used ring. */
-    vring_used_ring_id(vq, idx, elem->index);
-    vring_used_ring_len(vq, idx, len);
+    ue.id = elem->index;
+    ue.len = len;
+    vring_used_writeback(vq, idx, &ue);
 }
 
 void virtqueue_flush(VirtQueue *vq, unsigned int count)
@@ -294,7 +318,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
     /* Make sure buffer is written before we update index. */
     smp_wmb();
     trace_virtqueue_flush(vq, count);
-    old = vring_used_idx(vq);
+    old = vq->last_used_idx;
     new = old + count;
     vring_used_idx_set(vq, new);
     vq->inuse -= count;
@@ -328,6 +352,18 @@ static int virtqueue_num_heads(VirtQueue *vq, unsigned int idx)
     return num_heads;
 }
 
+static bool virtqueue_empty(VirtQueue *vq)
+{
+    if (vq->last_avail_idx == vq->last_fetched_avail_idx) {
+        vq->last_fetched_avail_idx = vring_avail_idx(vq);
+        /* On success, callers read a descriptor at vq->last_avail_idx.
+         * Make sure descriptor read does not bypass avail index read. */
+        smp_rmb();
+    }
+
+    return vq->last_avail_idx == vq->last_fetched_avail_idx;
+}
+
 static unsigned int virtqueue_get_head(VirtQueue *vq, unsigned int idx)
 {
     unsigned int head;
@@ -345,18 +381,18 @@ static unsigned int virtqueue_get_head(VirtQueue *vq, unsigned int idx)
     return head;
 }
 
-static unsigned virtqueue_next_desc(VirtIODevice *vdev, hwaddr desc_pa,
-                                    unsigned int i, unsigned int max)
+static unsigned virtqueue_next_desc(VirtIODevice *vdev, VRingDesc *desc,
+                                    unsigned int max)
 {
     unsigned int next;
 
     /* If this descriptor says it doesn't chain, we're done. */
-    if (!(vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_NEXT)) {
+    if (!(desc->flags & VRING_DESC_F_NEXT)) {
         return max;
     }
 
     /* Check they're not leading us off end of descriptors. */
-    next = vring_desc_next(vdev, desc_pa, i);
+    next = desc->next;
     /* Make sure compiler knows to grab that: we don't want it changing! */
     smp_wmb();
 
@@ -382,15 +418,17 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
         VirtIODevice *vdev = vq->vdev;
         unsigned int max, num_bufs, indirect = 0;
         hwaddr desc_pa;
+        VRingDesc desc;
         int i;
 
         max = vq->vring.num;
         num_bufs = total_bufs;
         i = virtqueue_get_head(vq, idx++);
         desc_pa = vq->vring.desc;
+        vring_desc_fetch(vdev, desc_pa, i, &desc);
 
-        if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_INDIRECT) {
-            if (vring_desc_len(vdev, desc_pa, i) % sizeof(VRingDesc)) {
+        if (desc.flags & VRING_DESC_F_INDIRECT) {
+            if (desc.len % sizeof(VRingDesc)) {
                 error_report("Invalid size for indirect buffer table");
                 exit(1);
             }
@@ -403,27 +441,34 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
 
             /* loop over the indirect descriptor table */
             indirect = 1;
-            max = vring_desc_len(vdev, desc_pa, i) / sizeof(VRingDesc);
-            desc_pa = vring_desc_addr(vdev, desc_pa, i);
+            max = desc.len / sizeof(VRingDesc);
+            desc_pa = desc.addr;
             num_bufs = i = 0;
+            vring_desc_fetch(vdev, desc_pa, i, &desc);
         }
 
-        do {
+        for (;;) {
             /* If we've got too many, that implies a descriptor loop. */
             if (++num_bufs > max) {
                 error_report("Looped descriptor");
                 exit(1);
             }
 
-            if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
-                in_total += vring_desc_len(vdev, desc_pa, i);
+            if (desc.flags & VRING_DESC_F_WRITE) {
+                in_total += desc.len;
             } else {
-                out_total += vring_desc_len(vdev, desc_pa, i);
+                out_total += desc.len;
             }
             if (in_total >= max_in_bytes && out_total >= max_out_bytes) {
                 goto done;
             }
-        } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
+
+            i = virtqueue_next_desc(vdev, &desc, max);
+            if (i == max) {
+                break;
+            }
+            vring_desc_fetch(vdev, desc_pa, i, &desc);
+        }
 
         if (!indirect)
             total_bufs = num_bufs;
@@ -506,9 +551,11 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
     unsigned int i, head, max;
     hwaddr desc_pa = vq->vring.desc;
     VirtIODevice *vdev = vq->vdev;
+    VRingDesc desc;
 
-    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
+    if (virtqueue_empty(vq)) {
         return 0;
+    }
 
     /* When we start there are none of either input nor output. */
     elem->out_num = elem->in_num = 0;
@@ -520,46 +567,55 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
         vring_set_avail_event(vq, vq->last_avail_idx);
     }
 
-    if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_INDIRECT) {
-        if (vring_desc_len(vdev, desc_pa, i) % sizeof(VRingDesc)) {
+    vring_desc_fetch(vdev, desc_pa, i, &desc);
+
+    if (desc.flags & VRING_DESC_F_INDIRECT) {
+        if (desc.len % sizeof(VRingDesc)) {
             error_report("Invalid size for indirect buffer table");
             exit(1);
         }
 
         /* loop over the indirect descriptor table */
-        max = vring_desc_len(vdev, desc_pa, i) / sizeof(VRingDesc);
-        desc_pa = vring_desc_addr(vdev, desc_pa, i);
+        max = desc.len / sizeof(VRingDesc);
+        desc_pa = desc.addr;
         i = 0;
+        vring_desc_fetch(vdev, desc_pa, i, &desc);
     }
 
     /* Collect all the descriptors */
-    do {
+    for (;;) {
         struct iovec *sg;
 
-        if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
+        if (desc.flags & VRING_DESC_F_WRITE) {
             if (elem->in_num >= ARRAY_SIZE(elem->in_sg)) {
                 error_report("Too many write descriptors in indirect table");
                 exit(1);
             }
-            elem->in_addr[elem->in_num] = vring_desc_addr(vdev, desc_pa, i);
+            elem->in_addr[elem->in_num] = desc.addr;
             sg = &elem->in_sg[elem->in_num++];
         } else {
             if (elem->out_num >= ARRAY_SIZE(elem->out_sg)) {
                 error_report("Too many read descriptors in indirect table");
                 exit(1);
             }
-            elem->out_addr[elem->out_num] = vring_desc_addr(vdev, desc_pa, i);
+            elem->out_addr[elem->out_num] = desc.addr;
             sg = &elem->out_sg[elem->out_num++];
         }
 
-        sg->iov_len = vring_desc_len(vdev, desc_pa, i);
+        sg->iov_len = desc.len;
 
         /* If we've got too many, that implies a descriptor loop. */
         if ((elem->in_num + elem->out_num) > max) {
             error_report("Looped descriptor");
             exit(1);
         }
-    } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
+
+        i = virtqueue_next_desc(vdev, &desc, max);
+        if (i == max) {
+            break;
+        }
+        vring_desc_fetch(vdev, desc_pa, i, &desc);
+    }
 
     /* Now map what we have collected */
     virtqueue_map(elem);
@@ -673,6 +729,8 @@ void virtio_reset(void *opaque)
         vdev->vq[i].vring.avail = 0;
         vdev->vq[i].vring.used = 0;
         vdev->vq[i].last_avail_idx = 0;
+        vdev->vq[i].last_fetched_avail_idx = 0;
+        vdev->vq[i].last_used_idx = 0;
         virtio_queue_set_vector(vdev, i, VIRTIO_NO_VECTOR);
         vdev->vq[i].signalled_used = 0;
         vdev->vq[i].signalled_used_valid = false;
@@ -1052,7 +1110,7 @@ static bool vring_notify(VirtIODevice *vdev, VirtQueue *vq)
     v = vq->signalled_used_valid;
     vq->signalled_used_valid = true;
     old = vq->signalled_used;
-    new = vq->signalled_used = vring_used_idx(vq);
+    new = vq->signalled_used = vq->last_used_idx;
     return !v || vring_need_event(vring_get_used_event(vq), new, old);
 }
 
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs
  2015-12-14 14:51 [Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs Vincenzo Maffione
  2015-12-14 14:51 ` [Qemu-devel] [PATCH RFC] virtio: optimize access to guest physical memory Vincenzo Maffione
@ 2015-12-14 16:06 ` Paolo Bonzini
  2015-12-14 17:16   ` Vincenzo Maffione
  2015-12-15  9:22   ` Vincenzo Maffione
  1 sibling, 2 replies; 6+ messages in thread
From: Paolo Bonzini @ 2015-12-14 16:06 UTC (permalink / raw)
  To: Vincenzo Maffione, qemu-devel; +Cc: g.lettieri, jasowang, rizzo, armbru, mst



On 14/12/2015 15:51, Vincenzo Maffione wrote:
> 
> This patch slightly rewrites the code to reduce the number of accesses, since
> many of them seems unnecessary to me. After this reduction, the bottleneck
> jumps from 1 Mpps to 2 Mpps.
> 
> Patch is not complete (e.g. it still does not properly manage endianess, it is
> not clean, etc.). I just wanted to ask if you think the idea makes sense, and
> a proper patch in this direction would be accepted.

Yes, definitely.   I have even done a very similar change here:

	http://article.gmane.org/gmane.comp.emulators.qemu.block/6620

Can you review that patch, and possibly work on top of it?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs
  2015-12-14 16:06 ` [Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs Paolo Bonzini
@ 2015-12-14 17:16   ` Vincenzo Maffione
  2015-12-15  9:22   ` Vincenzo Maffione
  1 sibling, 0 replies; 6+ messages in thread
From: Vincenzo Maffione @ 2015-12-14 17:16 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael S. Tsirkin, Jason Wang, Markus Armbruster, qemu-devel,
	Giuseppe Lettieri, Luigi Rizzo

2015-12-14 17:06 GMT+01:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 14/12/2015 15:51, Vincenzo Maffione wrote:
>>
>> This patch slightly rewrites the code to reduce the number of accesses, since
>> many of them seems unnecessary to me. After this reduction, the bottleneck
>> jumps from 1 Mpps to 2 Mpps.
>>
>> Patch is not complete (e.g. it still does not properly manage endianess, it is
>> not clean, etc.). I just wanted to ask if you think the idea makes sense, and
>> a proper patch in this direction would be accepted.
>
> Yes, definitely.   I have even done a very similar change here:
>
>         http://article.gmane.org/gmane.comp.emulators.qemu.block/6620
>
> Can you review that patch, and possibly work on top of it?
>
> Thanks,
>
> Paolo

Yes, I will take a look at it.

Cheers,
  Vincenzo


-- 
Vincenzo Maffione

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs
  2015-12-14 16:06 ` [Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs Paolo Bonzini
  2015-12-14 17:16   ` Vincenzo Maffione
@ 2015-12-15  9:22   ` Vincenzo Maffione
  2015-12-15 10:34     ` Paolo Bonzini
  1 sibling, 1 reply; 6+ messages in thread
From: Vincenzo Maffione @ 2015-12-15  9:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael S. Tsirkin, Jason Wang, Markus Armbruster, qemu-devel,
	Giuseppe Lettieri, Luigi Rizzo

> @@ -415,15 +399,15 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
>                  exit(1);
>              }
>
> -            if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
> -                in_total += vring_desc_len(vdev, desc_pa, i);
> +            if (desc.flags & VRING_DESC_F_WRITE) {
> +                in_total += desc.flags;
>              } else {
> -                out_total += vring_desc_len(vdev, desc_pa, i);
> +                out_total += desc.flags;
>              }
>              if (in_total >= max_in_bytes && out_total >= max_out_bytes) {
>                  goto done;
>              }
> -        } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
> +        } while ((i = virtqueue_read_next_desc(vdev, &desc, desc_pa, max)) != max);
>
>          if (!indirect)
>              total_bufs = num_bufs;

Here (http://article.gmane.org/gmane.comp.emulators.qemu.block/6620) you meant

+                in_total += desc.len;

and

+                out_total += desc.len;


2015-12-14 17:06 GMT+01:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 14/12/2015 15:51, Vincenzo Maffione wrote:
>>
>> This patch slightly rewrites the code to reduce the number of accesses, since
>> many of them seems unnecessary to me. After this reduction, the bottleneck
>> jumps from 1 Mpps to 2 Mpps.
>>
>> Patch is not complete (e.g. it still does not properly manage endianess, it is
>> not clean, etc.). I just wanted to ask if you think the idea makes sense, and
>> a proper patch in this direction would be accepted.
>
> Yes, definitely.   I have even done a very similar change here:
>
>         http://article.gmane.org/gmane.comp.emulators.qemu.block/6620
>
> Can you review that patch, and possibly work on top of it?
>
> Thanks,
>
> Paolo



-- 
Vincenzo Maffione

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs
  2015-12-15  9:22   ` Vincenzo Maffione
@ 2015-12-15 10:34     ` Paolo Bonzini
  0 siblings, 0 replies; 6+ messages in thread
From: Paolo Bonzini @ 2015-12-15 10:34 UTC (permalink / raw)
  To: Vincenzo Maffione
  Cc: Michael S. Tsirkin, Jason Wang, Markus Armbruster, qemu-devel,
	Giuseppe Lettieri, Luigi Rizzo

> >          if (!indirect)
> >              total_bufs = num_bufs;
> 
> Here (http://article.gmane.org/gmane.comp.emulators.qemu.block/6620) you
> meant
> 
> +                in_total += desc.len;
> 
> and
> 
> +                out_total += desc.len;
> 

Yes, I did... I changed the patch and will push to the dataplane branch
soon.

Paolo

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-12-15 10:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-14 14:51 [Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs Vincenzo Maffione
2015-12-14 14:51 ` [Qemu-devel] [PATCH RFC] virtio: optimize access to guest physical memory Vincenzo Maffione
2015-12-14 16:06 ` [Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs Paolo Bonzini
2015-12-14 17:16   ` Vincenzo Maffione
2015-12-15  9:22   ` Vincenzo Maffione
2015-12-15 10:34     ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).