* [Qemu-devel] [PATCH v2 1/3] virtio: cache used_idx in a VirtQueue field
2015-12-15 22:33 [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs Vincenzo Maffione
@ 2015-12-15 22:33 ` Vincenzo Maffione
2015-12-15 22:33 ` [Qemu-devel] [PATCH v2 2/3] virtio: read avail_idx from VQ only when necessary Vincenzo Maffione
` (2 subsequent siblings)
3 siblings, 0 replies; 14+ messages in thread
From: Vincenzo Maffione @ 2015-12-15 22:33 UTC (permalink / raw)
To: qemu-devel
Cc: mst, jasowang, armbru, Vincenzo Maffione, pbonzini, g.lettieri,
rizzo
Accessing used_idx in the VQ requires an expensive access to
guest physical memory. Before this patch, 3 accesses are normally
done for each pop/push/notify call. However, since the used_idx is
only written by us, we can track it in our internal data structure.
Signed-off-by: Vincenzo Maffione <v.maffione@gmail.com>
---
hw/virtio/virtio.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 925bb36..a35bb38 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -71,6 +71,9 @@ struct VirtQueue
{
VRing vring;
uint16_t last_avail_idx;
+
+ uint16_t used_idx;
+
/* Last used index value we have signalled on */
uint16_t signalled_used;
@@ -170,6 +173,7 @@ static inline void vring_used_idx_set(VirtQueue *vq, uint16_t val)
hwaddr pa;
pa = vq->vring.used + offsetof(VRingUsed, idx);
virtio_stw_phys(vq->vdev, pa, val);
+ vq->used_idx = val;
}
static inline void vring_used_flags_set_bit(VirtQueue *vq, int mask)
@@ -261,7 +265,7 @@ void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem,
virtqueue_unmap_sg(vq, elem, len);
- idx = (idx + vring_used_idx(vq)) % vq->vring.num;
+ idx = (idx + vq->used_idx) % vq->vring.num;
/* Get a pointer to the next entry in the used ring. */
vring_used_ring_id(vq, idx, elem->index);
@@ -274,7 +278,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
/* Make sure buffer is written before we update index. */
smp_wmb();
trace_virtqueue_flush(vq, count);
- old = vring_used_idx(vq);
+ old = vq->used_idx;
new = old + count;
vring_used_idx_set(vq, new);
vq->inuse -= count;
@@ -806,6 +810,7 @@ void virtio_reset(void *opaque)
vdev->vq[i].vring.avail = 0;
vdev->vq[i].vring.used = 0;
vdev->vq[i].last_avail_idx = 0;
+ vdev->vq[i].used_idx = 0;
virtio_queue_set_vector(vdev, i, VIRTIO_NO_VECTOR);
vdev->vq[i].signalled_used = 0;
vdev->vq[i].signalled_used_valid = false;
@@ -1189,7 +1194,7 @@ bool virtio_should_notify(VirtIODevice *vdev, VirtQueue *vq)
v = vq->signalled_used_valid;
vq->signalled_used_valid = true;
old = vq->signalled_used;
- new = vq->signalled_used = vring_used_idx(vq);
+ new = vq->signalled_used = vq->used_idx;
return !v || vring_need_event(vring_get_used_event(vq), new, old);
}
@@ -1650,6 +1655,7 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
vdev->vq[i].last_avail_idx, nheads);
return -1;
}
+ vdev->vq[i].used_idx = vring_used_idx(&vdev->vq[i]);
}
}
--
2.6.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [Qemu-devel] [PATCH v2 2/3] virtio: read avail_idx from VQ only when necessary
2015-12-15 22:33 [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs Vincenzo Maffione
2015-12-15 22:33 ` [Qemu-devel] [PATCH v2 1/3] virtio: cache used_idx in a VirtQueue field Vincenzo Maffione
@ 2015-12-15 22:33 ` Vincenzo Maffione
2015-12-15 22:33 ` [Qemu-devel] [PATCH v2 3/3] virtio: combine write of an entry into used ring Vincenzo Maffione
2015-12-16 8:38 ` [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs Paolo Bonzini
3 siblings, 0 replies; 14+ messages in thread
From: Vincenzo Maffione @ 2015-12-15 22:33 UTC (permalink / raw)
To: qemu-devel
Cc: mst, jasowang, armbru, Vincenzo Maffione, pbonzini, g.lettieri,
rizzo
The virtqueue_pop() implementation needs to check if the avail ring
contains some pending buffers. To perform this check, it is not
always necessary to fetch the avail_idx in the VQ memory, which is
expensive. This patch introduces a shadow variable tracking avail_idx
and modifies virtio_queue_empty() to access avail_idx in physical
memory only when necessary.
Signed-off-by: Vincenzo Maffione <v.maffione@gmail.com>
---
hw/virtio/virtio.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index a35bb38..5a2ea8a 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -70,8 +70,13 @@ typedef struct VRing
struct VirtQueue
{
VRing vring;
+
+ /* Next head to pop */
uint16_t last_avail_idx;
+ /* Last avail_idx read from VQ. */
+ uint16_t shadow_avail_idx;
+
uint16_t used_idx;
/* Last used index value we have signalled on */
@@ -132,7 +137,8 @@ static inline uint16_t vring_avail_idx(VirtQueue *vq)
{
hwaddr pa;
pa = vq->vring.avail + offsetof(VRingAvail, idx);
- return virtio_lduw_phys(vq->vdev, pa);
+ vq->shadow_avail_idx = virtio_lduw_phys(vq->vdev, pa);
+ return vq->shadow_avail_idx;
}
static inline uint16_t vring_avail_ring(VirtQueue *vq, int i)
@@ -223,8 +229,14 @@ int virtio_queue_ready(VirtQueue *vq)
return vq->vring.avail != 0;
}
+/* Fetch avail_idx from VQ memory only when we really need to know if
+ * guest has added some buffers. */
int virtio_queue_empty(VirtQueue *vq)
{
+ if (vq->shadow_avail_idx != vq->last_avail_idx) {
+ return 0;
+ }
+
return vring_avail_idx(vq) == vq->last_avail_idx;
}
@@ -300,7 +312,7 @@ static int virtqueue_num_heads(VirtQueue *vq, unsigned int idx)
/* Check it isn't doing very strange things with descriptor numbers. */
if (num_heads > vq->vring.num) {
error_report("Guest moved used index from %u to %u",
- idx, vring_avail_idx(vq));
+ idx, vq->shadow_avail_idx);
exit(1);
}
/* On success, callers read a descriptor at vq->last_avail_idx.
@@ -534,9 +546,12 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
struct iovec iov[VIRTQUEUE_MAX_SIZE];
VRingDesc desc;
- if (!virtqueue_num_heads(vq, vq->last_avail_idx)) {
+ if (virtio_queue_empty(vq)) {
return NULL;
}
+ /* Needed after virtio_queue_empty(), see comment in
+ * virtqueue_num_heads(). */
+ smp_rmb();
/* When we start there are none of either input nor output. */
out_num = in_num = 0;
@@ -810,6 +825,7 @@ void virtio_reset(void *opaque)
vdev->vq[i].vring.avail = 0;
vdev->vq[i].vring.used = 0;
vdev->vq[i].last_avail_idx = 0;
+ vdev->vq[i].shadow_avail_idx = 0;
vdev->vq[i].used_idx = 0;
virtio_queue_set_vector(vdev, i, VIRTIO_NO_VECTOR);
vdev->vq[i].signalled_used = 0;
@@ -1183,7 +1199,7 @@ bool virtio_should_notify(VirtIODevice *vdev, VirtQueue *vq)
smp_mb();
/* Always notify when queue is empty (when feature acknowledge) */
if (virtio_vdev_has_feature(vdev, VIRTIO_F_NOTIFY_ON_EMPTY) &&
- !vq->inuse && vring_avail_idx(vq) == vq->last_avail_idx) {
+ !vq->inuse && virtio_queue_empty(vq)) {
return true;
}
@@ -1656,6 +1672,7 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
return -1;
}
vdev->vq[i].used_idx = vring_used_idx(&vdev->vq[i]);
+ vdev->vq[i].shadow_avail_idx = vring_avail_idx(&vdev->vq[i]);
}
}
--
2.6.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [Qemu-devel] [PATCH v2 3/3] virtio: combine write of an entry into used ring
2015-12-15 22:33 [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs Vincenzo Maffione
2015-12-15 22:33 ` [Qemu-devel] [PATCH v2 1/3] virtio: cache used_idx in a VirtQueue field Vincenzo Maffione
2015-12-15 22:33 ` [Qemu-devel] [PATCH v2 2/3] virtio: read avail_idx from VQ only when necessary Vincenzo Maffione
@ 2015-12-15 22:33 ` Vincenzo Maffione
2015-12-16 8:38 ` [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs Paolo Bonzini
3 siblings, 0 replies; 14+ messages in thread
From: Vincenzo Maffione @ 2015-12-15 22:33 UTC (permalink / raw)
To: qemu-devel
Cc: mst, jasowang, armbru, Vincenzo Maffione, pbonzini, g.lettieri,
rizzo
Fill in an element of the used ring with a single combined access to the
guest physical memory, rather than using two separated accesses.
This reduces the overhead due to expensive address translation.
Signed-off-by: Vincenzo Maffione <v.maffione@gmail.com>
---
hw/virtio/virtio.c | 25 ++++++++++++-------------
1 file changed, 12 insertions(+), 13 deletions(-)
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 5a2ea8a..ecdb140 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -153,18 +153,15 @@ static inline uint16_t vring_get_used_event(VirtQueue *vq)
return vring_avail_ring(vq, vq->vring.num);
}
-static inline void vring_used_ring_id(VirtQueue *vq, int i, uint32_t val)
+static inline void vring_used_write(VirtQueue *vq, VRingUsedElem *uelem,
+ int i)
{
hwaddr pa;
- pa = vq->vring.used + offsetof(VRingUsed, ring[i].id);
- virtio_stl_phys(vq->vdev, pa, val);
-}
-
-static inline void vring_used_ring_len(VirtQueue *vq, int i, uint32_t val)
-{
- hwaddr pa;
- pa = vq->vring.used + offsetof(VRingUsed, ring[i].len);
- virtio_stl_phys(vq->vdev, pa, val);
+ virtio_tswap32s(vq->vdev, &uelem->id);
+ virtio_tswap32s(vq->vdev, &uelem->len);
+ pa = vq->vring.used + offsetof(VRingUsed, ring[i]);
+ address_space_write(&address_space_memory, pa, MEMTXATTRS_UNSPECIFIED,
+ (void *)uelem, sizeof(VRingUsedElem));
}
static uint16_t vring_used_idx(VirtQueue *vq)
@@ -273,15 +270,17 @@ void virtqueue_discard(VirtQueue *vq, const VirtQueueElement *elem,
void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem,
unsigned int len, unsigned int idx)
{
+ VRingUsedElem uelem;
+
trace_virtqueue_fill(vq, elem, len, idx);
virtqueue_unmap_sg(vq, elem, len);
idx = (idx + vq->used_idx) % vq->vring.num;
- /* Get a pointer to the next entry in the used ring. */
- vring_used_ring_id(vq, idx, elem->index);
- vring_used_ring_len(vq, idx, len);
+ uelem.id = elem->index;
+ uelem.len = len;
+ vring_used_write(vq, &uelem, idx);
}
void virtqueue_flush(VirtQueue *vq, unsigned int count)
--
2.6.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs
2015-12-15 22:33 [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs Vincenzo Maffione
` (2 preceding siblings ...)
2015-12-15 22:33 ` [Qemu-devel] [PATCH v2 3/3] virtio: combine write of an entry into used ring Vincenzo Maffione
@ 2015-12-16 8:38 ` Paolo Bonzini
2015-12-16 9:28 ` Vincenzo Maffione
3 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2015-12-16 8:38 UTC (permalink / raw)
To: Vincenzo Maffione, qemu-devel; +Cc: g.lettieri, jasowang, rizzo, armbru, mst
On 15/12/2015 23:33, Vincenzo Maffione wrote:
> This patch slightly rewrites the code to reduce the number of accesses, since
> many of them seems unnecessary to me. After this reduction, the bottleneck
> jumps from 1 Mpps to 2 Mpps.
Very nice. Did you get new numbers with the rebase? That would help
measuring the effect of removing variable-sized memcpy (I'll post the
patches for this shortly; they're entirely in memory.h/exec.c so they're
not virtio-specific). A rough measurement from "perf" says they're
worth about 5%.
Related to this, patch 3 introduces a variable-sized memcpy, because it
switches from 2 virtio_stl_phys to 1 address_space_write. I'm curious
if the effect of this individual patch is positive, negative or neutral.
On the other hand, patches 1 and 2 are clear wins.
Paolo
> Patch is not complete (e.g. it still does not properly manage endianess, it is
> not clean, etc.). I just wanted to ask if you think the idea makes sense, and
> a proper patch in this direction would be accepted.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs
2015-12-16 8:38 ` [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs Paolo Bonzini
@ 2015-12-16 9:28 ` Vincenzo Maffione
2015-12-16 9:34 ` Paolo Bonzini
0 siblings, 1 reply; 14+ messages in thread
From: Vincenzo Maffione @ 2015-12-16 9:28 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Michael S. Tsirkin, Jason Wang, Markus Armbruster, qemu-devel,
Giuseppe Lettieri, Luigi Rizzo
Assuming my TX experiments with disconnected backend (and I disable
CPU dynamic scaling of performance, etc.):
1) after patch 1 and 2, virtio bottleneck jumps from ~1Mpps to 1.910 Mpps.
2) after patch 1,2 and 3, virtio bottleneck jumps to 2.039 Mpps.
So I see an improvement for patch 3, and I guess it's because we avoid
an additional memory translation and related overhead. I believe that
avoiding the memory translation is more beneficial than avoiding the
variable-sized memcpy.
I'm not surprised of that, because taking a brief look at what happens
under the hood when you call an access_memory() function - it looks
like a lot of operations.
Cheers,
Vincenzo
2015-12-16 9:38 GMT+01:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 15/12/2015 23:33, Vincenzo Maffione wrote:
>> This patch slightly rewrites the code to reduce the number of accesses, since
>> many of them seems unnecessary to me. After this reduction, the bottleneck
>> jumps from 1 Mpps to 2 Mpps.
>
> Very nice. Did you get new numbers with the rebase? That would help
> measuring the effect of removing variable-sized memcpy (I'll post the
> patches for this shortly; they're entirely in memory.h/exec.c so they're
> not virtio-specific). A rough measurement from "perf" says they're
> worth about 5%.
>
> Related to this, patch 3 introduces a variable-sized memcpy, because it
> switches from 2 virtio_stl_phys to 1 address_space_write. I'm curious
> if the effect of this individual patch is positive, negative or neutral.
> On the other hand, patches 1 and 2 are clear wins.
>
> Paolo
>
>> Patch is not complete (e.g. it still does not properly manage endianess, it is
>> not clean, etc.). I just wanted to ask if you think the idea makes sense, and
>> a proper patch in this direction would be accepted.
--
Vincenzo Maffione
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs
2015-12-16 9:28 ` Vincenzo Maffione
@ 2015-12-16 9:34 ` Paolo Bonzini
2015-12-16 10:39 ` Vincenzo Maffione
0 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2015-12-16 9:34 UTC (permalink / raw)
To: Vincenzo Maffione
Cc: Michael S. Tsirkin, Jason Wang, qemu-devel, Markus Armbruster,
Giuseppe Lettieri, Luigi Rizzo
On 16/12/2015 10:28, Vincenzo Maffione wrote:
> Assuming my TX experiments with disconnected backend (and I disable
> CPU dynamic scaling of performance, etc.):
> 1) after patch 1 and 2, virtio bottleneck jumps from ~1Mpps to 1.910 Mpps.
> 2) after patch 1,2 and 3, virtio bottleneck jumps to 2.039 Mpps.
>
> So I see an improvement for patch 3, and I guess it's because we avoid
> an additional memory translation and related overhead. I believe that
> avoiding the memory translation is more beneficial than avoiding the
> variable-sized memcpy.
> I'm not surprised of that, because taking a brief look at what happens
> under the hood when you call an access_memory() function - it looks
> like a lot of operations.
Great, thanks for confirming!
Paolo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs
2015-12-16 9:34 ` Paolo Bonzini
@ 2015-12-16 10:39 ` Vincenzo Maffione
2015-12-16 11:02 ` Michael S. Tsirkin
2015-12-16 11:37 ` Paolo Bonzini
0 siblings, 2 replies; 14+ messages in thread
From: Vincenzo Maffione @ 2015-12-16 10:39 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Michael S. Tsirkin, Jason Wang, qemu-devel, Markus Armbruster,
Giuseppe Lettieri, Luigi Rizzo
2015-12-16 10:34 GMT+01:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 16/12/2015 10:28, Vincenzo Maffione wrote:
>> Assuming my TX experiments with disconnected backend (and I disable
>> CPU dynamic scaling of performance, etc.):
>> 1) after patch 1 and 2, virtio bottleneck jumps from ~1Mpps to 1.910 Mpps.
>> 2) after patch 1,2 and 3, virtio bottleneck jumps to 2.039 Mpps.
>>
>> So I see an improvement for patch 3, and I guess it's because we avoid
>> an additional memory translation and related overhead. I believe that
>> avoiding the memory translation is more beneficial than avoiding the
>> variable-sized memcpy.
>> I'm not surprised of that, because taking a brief look at what happens
>> under the hood when you call an access_memory() function - it looks
>> like a lot of operations.
>
> Great, thanks for confirming!
>
> Paolo
No problems.
I have some additional (orthogonal) curiosities:
1) Assuming "hw/virtio/dataplane/vring.c" is what I think it is (VQ
data structures directly accessible in the host virtual memory, with
guest-phyisical-to-host-virtual mapping done statically at setup time)
why isn't QEMU using this approach also for virtio-net? I see it is
used by virtio-blk only.
2) In any case (vring or not) QEMU dynamically maps data buffers
from guest physical memory, for each descriptor to be processed: e1000
uses pci_dma_read/pci_dma_write, virtio uses
cpu_physical_memory_map()/cpu_physical_memory_unmap(), vring uses the
more specialied vring_map()/vring_unmap(). All of these go through
expensive lookups and related operations to do the address
translation.
Have you considered the possibility to cache the translation result to
remove this bottleneck (maybe just for virtio devices)? Or is any
consistency or migration-related problem that would create issues?
Just to give an example of what I'm talking about:
https://github.com/vmaffione/qemu/blob/master/hw/net/e1000.c#L349-L423.
At very high packet rates, once notifications (kicks and interrupts)
have been amortized in some way, memory translation becomes the major
bottleneck. And this (1 and 2) is why QEMU virtio implementation
cannot achieve the same throughput as bhyve does (5-6 Mpps or more
IIRC).
Cheers,
Vincenzo
--
Vincenzo Maffione
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs
2015-12-16 10:39 ` Vincenzo Maffione
@ 2015-12-16 11:02 ` Michael S. Tsirkin
2015-12-16 11:23 ` Vincenzo Maffione
2015-12-16 11:37 ` Paolo Bonzini
1 sibling, 1 reply; 14+ messages in thread
From: Michael S. Tsirkin @ 2015-12-16 11:02 UTC (permalink / raw)
To: Vincenzo Maffione
Cc: Jason Wang, Markus Armbruster, qemu-devel, Paolo Bonzini,
Giuseppe Lettieri, Luigi Rizzo
On Wed, Dec 16, 2015 at 11:39:46AM +0100, Vincenzo Maffione wrote:
> 2015-12-16 10:34 GMT+01:00 Paolo Bonzini <pbonzini@redhat.com>:
> >
> >
> > On 16/12/2015 10:28, Vincenzo Maffione wrote:
> >> Assuming my TX experiments with disconnected backend (and I disable
> >> CPU dynamic scaling of performance, etc.):
> >> 1) after patch 1 and 2, virtio bottleneck jumps from ~1Mpps to 1.910 Mpps.
> >> 2) after patch 1,2 and 3, virtio bottleneck jumps to 2.039 Mpps.
> >>
> >> So I see an improvement for patch 3, and I guess it's because we avoid
> >> an additional memory translation and related overhead. I believe that
> >> avoiding the memory translation is more beneficial than avoiding the
> >> variable-sized memcpy.
> >> I'm not surprised of that, because taking a brief look at what happens
> >> under the hood when you call an access_memory() function - it looks
> >> like a lot of operations.
> >
> > Great, thanks for confirming!
> >
> > Paolo
>
> No problems.
>
> I have some additional (orthogonal) curiosities:
>
> 1) Assuming "hw/virtio/dataplane/vring.c" is what I think it is (VQ
> data structures directly accessible in the host virtual memory, with
> guest-phyisical-to-host-virtual mapping done statically at setup time)
> why isn't QEMU using this approach also for virtio-net? I see it is
> used by virtio-blk only.
Because on Linux, nothing would be gained as compared to using vhost-net
in kernel or vhost-user with dpdk. virtio-net is there for non-Linux
hosts, keeping it simple is important to avoid e.g. security problems.
Same as serial, etc.
> 2) In any case (vring or not) QEMU dynamically maps data buffers
> from guest physical memory, for each descriptor to be processed: e1000
> uses pci_dma_read/pci_dma_write, virtio uses
> cpu_physical_memory_map()/cpu_physical_memory_unmap(), vring uses the
> more specialied vring_map()/vring_unmap(). All of these go through
> expensive lookups and related operations to do the address
> translation.
> Have you considered the possibility to cache the translation result to
> remove this bottleneck (maybe just for virtio devices)? Or is any
> consistency or migration-related problem that would create issues?
> Just to give an example of what I'm talking about:
> https://github.com/vmaffione/qemu/blob/master/hw/net/e1000.c#L349-L423.
>
> At very high packet rates, once notifications (kicks and interrupts)
> have been amortized in some way, memory translation becomes the major
> bottleneck. And this (1 and 2) is why QEMU virtio implementation
> cannot achieve the same throughput as bhyve does (5-6 Mpps or more
> IIRC).
>
> Cheers,
> Vincenzo
>
>
>
> --
> Vincenzo Maffione
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs
2015-12-16 11:02 ` Michael S. Tsirkin
@ 2015-12-16 11:23 ` Vincenzo Maffione
0 siblings, 0 replies; 14+ messages in thread
From: Vincenzo Maffione @ 2015-12-16 11:23 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Wang, Markus Armbruster, qemu-devel, Paolo Bonzini,
Giuseppe Lettieri, Luigi Rizzo
2015-12-16 12:02 GMT+01:00 Michael S. Tsirkin <mst@redhat.com>:
> On Wed, Dec 16, 2015 at 11:39:46AM +0100, Vincenzo Maffione wrote:
>> 2015-12-16 10:34 GMT+01:00 Paolo Bonzini <pbonzini@redhat.com>:
>> >
>> >
>> > On 16/12/2015 10:28, Vincenzo Maffione wrote:
>> >> Assuming my TX experiments with disconnected backend (and I disable
>> >> CPU dynamic scaling of performance, etc.):
>> >> 1) after patch 1 and 2, virtio bottleneck jumps from ~1Mpps to 1.910 Mpps.
>> >> 2) after patch 1,2 and 3, virtio bottleneck jumps to 2.039 Mpps.
>> >>
>> >> So I see an improvement for patch 3, and I guess it's because we avoid
>> >> an additional memory translation and related overhead. I believe that
>> >> avoiding the memory translation is more beneficial than avoiding the
>> >> variable-sized memcpy.
>> >> I'm not surprised of that, because taking a brief look at what happens
>> >> under the hood when you call an access_memory() function - it looks
>> >> like a lot of operations.
>> >
>> > Great, thanks for confirming!
>> >
>> > Paolo
>>
>> No problems.
>>
>> I have some additional (orthogonal) curiosities:
>>
>> 1) Assuming "hw/virtio/dataplane/vring.c" is what I think it is (VQ
>> data structures directly accessible in the host virtual memory, with
>> guest-phyisical-to-host-virtual mapping done statically at setup time)
>> why isn't QEMU using this approach also for virtio-net? I see it is
>> used by virtio-blk only.
>
> Because on Linux, nothing would be gained as compared to using vhost-net
> in kernel or vhost-user with dpdk. virtio-net is there for non-Linux
> hosts, keeping it simple is important to avoid e.g. security problems.
> Same as serial, etc.
Ok, thanks for the clarification.
>
>> 2) In any case (vring or not) QEMU dynamically maps data buffers
>> from guest physical memory, for each descriptor to be processed: e1000
>> uses pci_dma_read/pci_dma_write, virtio uses
>> cpu_physical_memory_map()/cpu_physical_memory_unmap(), vring uses the
>> more specialied vring_map()/vring_unmap(). All of these go through
>> expensive lookups and related operations to do the address
>> translation.
>> Have you considered the possibility to cache the translation result to
>> remove this bottleneck (maybe just for virtio devices)? Or is any
>> consistency or migration-related problem that would create issues?
>> Just to give an example of what I'm talking about:
>> https://github.com/vmaffione/qemu/blob/master/hw/net/e1000.c#L349-L423.
>>
>> At very high packet rates, once notifications (kicks and interrupts)
>> have been amortized in some way, memory translation becomes the major
>> bottleneck. And this (1 and 2) is why QEMU virtio implementation
>> cannot achieve the same throughput as bhyve does (5-6 Mpps or more
>> IIRC).
>>
>> Cheers,
>> Vincenzo
>>
>>
>>
>> --
>> Vincenzo Maffione
--
Vincenzo Maffione
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs
2015-12-16 10:39 ` Vincenzo Maffione
2015-12-16 11:02 ` Michael S. Tsirkin
@ 2015-12-16 11:37 ` Paolo Bonzini
2015-12-16 14:25 ` Vincenzo Maffione
1 sibling, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2015-12-16 11:37 UTC (permalink / raw)
To: Vincenzo Maffione
Cc: Michael S. Tsirkin, Jason Wang, Markus Armbruster, qemu-devel,
Giuseppe Lettieri, Luigi Rizzo
On 16/12/2015 11:39, Vincenzo Maffione wrote:
> No problems.
>
> I have some additional (orthogonal) curiosities:
>
> 1) Assuming "hw/virtio/dataplane/vring.c" is what I think it is (VQ
> data structures directly accessible in the host virtual memory, with
> guest-phyisical-to-host-virtual mapping done statically at setup time)
> why isn't QEMU using this approach also for virtio-net? I see it is
> used by virtio-blk only.
Yes, it's a good idea. vring.c's roots are in the old "virtio-blk
dataplane" code, which bypassed all of QEMU. It had a separate event
loop, a separate I/O queue, a separate implementation of the memory map,
and a separate implementation of virtio. virtio-blk dataplane is plenty
fast, which is why there isn't a vhost-blk.
Right now the only part that survives is the last one, which is vring.c.
The event loop has been reconciled with AioContext, the I/O queue uses
BlockDriverState, and the memory map uses memory_region_find.
virtio-net dataplane also existed, as a possible competitor of
vhost-net. However, vhost-net actually had better performance, so
virtio-net dataplane was never committed. As Michael mentioned, in
practice on Linux you use vhost, and non-Linux hypervisors you do not
use QEMU. :)
Indeed the implementation in virtio.c does kind of suck. On the other
hand, vring.c doesn't support migration because it doesn't know how to
mark guest memory as dirty. And virtio.c is used by plenty of
devices---including virtio-blk and virtio-scsi unless you enable usage
of a separate I/O thread---and converting them to vring.c is bug-prone.
This is why I would like to drop vring.c and improve virtio.c, rather
than use vring.c even more.
The main optimization that vring.c has is to cache the translation of
the rings. Using address_space_map/unmap for rings in virtio.c would be
a noticeable improvement, as your numbers for patch 3 show. However, by
caching translations you also conveniently "forget" to promptly mark the
pages as dirty. As you pointed out this is obviously an issue for
migration. You can then add a notifier for runstate changes. When
entering RUN_STATE_FINISH_MIGRATE or RUN_STATE_SAVE_VM the rings would
be unmapped, and then remapped the next time the VM starts running again.
You also guessed right that there are consistency issues; for these you
can add a MemoryListener that invalidates all mappings.
That said, I'm wondering where the cost of address translation lies---is
it cache-unfriendly data structures, locked operations, or simply too
much code to execute? It was quite surprising to me that on virtio-blk
benchmarks we were spending 5% of the time doing memcpy! (I have just
extracted from my branch the patches to remove that, and sent them to
qemu-devel).
Examples of missing optimizations in exec.c include:
* caching enough information in RAM MemoryRegions to avoid the calls to
qemu_get_ram_block (e.g. replace mr->ram_addr with a RAMBlock pointer);
* adding a MRU cache to address_space_lookup_region.
In particular, the former should be easy if you want to give it a
try---easier than caching ring translations in virtio.c.
Paolo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs
2015-12-16 11:37 ` Paolo Bonzini
@ 2015-12-16 14:25 ` Vincenzo Maffione
2015-12-16 15:46 ` Paolo Bonzini
0 siblings, 1 reply; 14+ messages in thread
From: Vincenzo Maffione @ 2015-12-16 14:25 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Michael S. Tsirkin, Jason Wang, Markus Armbruster, qemu-devel,
Giuseppe Lettieri, Luigi Rizzo
2015-12-16 12:37 GMT+01:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 16/12/2015 11:39, Vincenzo Maffione wrote:
>> No problems.
>>
>> I have some additional (orthogonal) curiosities:
>>
>> 1) Assuming "hw/virtio/dataplane/vring.c" is what I think it is (VQ
>> data structures directly accessible in the host virtual memory, with
>> guest-phyisical-to-host-virtual mapping done statically at setup time)
>> why isn't QEMU using this approach also for virtio-net? I see it is
>> used by virtio-blk only.
>
> Yes, it's a good idea. vring.c's roots are in the old "virtio-blk
> dataplane" code, which bypassed all of QEMU. It had a separate event
> loop, a separate I/O queue, a separate implementation of the memory map,
> and a separate implementation of virtio. virtio-blk dataplane is plenty
> fast, which is why there isn't a vhost-blk.
>
> Right now the only part that survives is the last one, which is vring.c.
> The event loop has been reconciled with AioContext, the I/O queue uses
> BlockDriverState, and the memory map uses memory_region_find.
>
> virtio-net dataplane also existed, as a possible competitor of
> vhost-net. However, vhost-net actually had better performance, so
> virtio-net dataplane was never committed. As Michael mentioned, in
> practice on Linux you use vhost, and non-Linux hypervisors you do not
> use QEMU. :)
Yes, I understand. However, another possible use-case would using QEMU
+ virtio-net + netmap backend + Linux (e.g. for QEMU-sandboxed packet
generators or packe processors, where very high packet rates are
common), where is not possible to use vhost.
>
> Indeed the implementation in virtio.c does kind of suck. On the other
> hand, vring.c doesn't support migration because it doesn't know how to
> mark guest memory as dirty. And virtio.c is used by plenty of
> devices---including virtio-blk and virtio-scsi unless you enable usage
> of a separate I/O thread---and converting them to vring.c is bug-prone.
> This is why I would like to drop vring.c and improve virtio.c, rather
> than use vring.c even more.
+1
>
> The main optimization that vring.c has is to cache the translation of
> the rings. Using address_space_map/unmap for rings in virtio.c would be
> a noticeable improvement, as your numbers for patch 3 show. However, by
> caching translations you also conveniently "forget" to promptly mark the
> pages as dirty. As you pointed out this is obviously an issue for
> migration. You can then add a notifier for runstate changes. When
> entering RUN_STATE_FINISH_MIGRATE or RUN_STATE_SAVE_VM the rings would
> be unmapped, and then remapped the next time the VM starts running again.
>
Ok so it seems feasible with a bit of care. The numbers we've been
seing in various experiments have always shown that this optimization
could easily double the 2 Mpps packet rate bottleneck.
> You also guessed right that there are consistency issues; for these you
> can add a MemoryListener that invalidates all mappings.
Yeah, but I don't know exactly what kind of inconsinstencies there can
be. Maybe the memory we are mapping may be hot-unplugged?
>
> That said, I'm wondering where the cost of address translation lies---is
> it cache-unfriendly data structures, locked operations, or simply too
> much code to execute? It was quite surprising to me that on virtio-blk
> benchmarks we were spending 5% of the time doing memcpy! (I have just
> extracted from my branch the patches to remove that, and sent them to
> qemu-devel).
I feel it's just too much code (but I may be wrong).
I'm not sure whether you are thinking that 5% is too much or too little.
To me it's too little, showing that most of the overhead it's
somewhere else (e.g. memory translation, or backend processing). In a
ideal transmission system, most of the overhead should be spent on
copying, because it means that you successfully managed to suppress
notifications and translation overhead.
I've tried out your patches, but unfortunately I don't see any effect
on my transmission tests (fast transmission over virtio-net with
disconnected backend).
>
> Examples of missing optimizations in exec.c include:
>
> * caching enough information in RAM MemoryRegions to avoid the calls to
> qemu_get_ram_block (e.g. replace mr->ram_addr with a RAMBlock pointer);
>
> * adding a MRU cache to address_space_lookup_region.
>
> In particular, the former should be easy if you want to give it a
> try---easier than caching ring translations in virtio.c.
>
> Paolo
Thank you so much for the insights :)
Cheers,
Vincenzo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs
2015-12-16 14:25 ` Vincenzo Maffione
@ 2015-12-16 15:46 ` Paolo Bonzini
2015-12-30 16:45 ` Vincenzo Maffione
0 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2015-12-16 15:46 UTC (permalink / raw)
To: Vincenzo Maffione
Cc: Michael S. Tsirkin, Jason Wang, Markus Armbruster, qemu-devel,
Giuseppe Lettieri, Luigi Rizzo
On 16/12/2015 15:25, Vincenzo Maffione wrote:
>> vhost-net actually had better performance, so virtio-net dataplane
>> was never committed. As Michael mentioned, in practice on Linux you
>> use vhost, and non-Linux hypervisors you do not use QEMU. :)
>
> Yes, I understand. However, another possible use-case would using QEMU
> + virtio-net + netmap backend + Linux (e.g. for QEMU-sandboxed packet
> generators or packe processors, where very high packet rates are
> common), where is not possible to use vhost.
Yes, of course. That was tongue in cheek. Another possibility for your
use case is to interface with netmap through vhost-user, but I'm happy
if you choose to improve virtio.c instead!
>> The main optimization that vring.c has is to cache the translation of
>> the rings. Using address_space_map/unmap for rings in virtio.c would be
>> a noticeable improvement, as your numbers for patch 3 show. However, by
>> caching translations you also conveniently "forget" to promptly mark the
>> pages as dirty. As you pointed out this is obviously an issue for
>> migration. You can then add a notifier for runstate changes. When
>> entering RUN_STATE_FINISH_MIGRATE or RUN_STATE_SAVE_VM the rings would
>> be unmapped, and then remapped the next time the VM starts running again.
>
> Ok so it seems feasible with a bit of care. The numbers we've been
> seing in various experiments have always shown that this optimization
> could easily double the 2 Mpps packet rate bottleneck.
Cool. Bonus points for nicely abstracting it so that virtio.c is just a
user.
>> You also guessed right that there are consistency issues; for these you
>> can add a MemoryListener that invalidates all mappings.
>
> Yeah, but I don't know exactly what kind of inconsinstencies there can
> be. Maybe the memory we are mapping may be hot-unplugged?
Yes. Just blow away all mappings in the MemoryListener commit callback.
>> That said, I'm wondering where the cost of address translation lies---is
>> it cache-unfriendly data structures, locked operations, or simply too
>> much code to execute? It was quite surprising to me that on virtio-blk
>> benchmarks we were spending 5% of the time doing memcpy! (I have just
>> extracted from my branch the patches to remove that, and sent them to
>> qemu-devel).
>
> I feel it's just too much code (but I may be wrong).
That is likely to be a good guess, but notice that the fast path doesn't
actually have _that much_ code, because a lot of "if"s that are almost
always false.
Looking at a profile would be useful. Is it flat, or does something
(e.g. address_space_translate) actually stand out?
> I'm not sure whether you are thinking that 5% is too much or too little.
> To me it's too little, showing that most of the overhead it's
> somewhere else (e.g. memory translation, or backend processing). In a
> ideal transmission system, most of the overhead should be spent on
> copying, because it means that you successfully managed to suppress
> notifications and translation overhead.
On copying data, though---not on copying virtio descriptors. 5% for
those is entirely wasted time.
Also, note that I'm looking at disk I/O rather than networking, where
there should be no copies at all.
Paolo
>> Examples of missing optimizations in exec.c include:
>>
>> * caching enough information in RAM MemoryRegions to avoid the calls to
>> qemu_get_ram_block (e.g. replace mr->ram_addr with a RAMBlock pointer);
>>
>> * adding a MRU cache to address_space_lookup_region.
>>
>> In particular, the former should be easy if you want to give it a
>> try---easier than caching ring translations in virtio.c.
>
> Thank you so much for the insights :)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs
2015-12-16 15:46 ` Paolo Bonzini
@ 2015-12-30 16:45 ` Vincenzo Maffione
0 siblings, 0 replies; 14+ messages in thread
From: Vincenzo Maffione @ 2015-12-30 16:45 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Michael S. Tsirkin, Jason Wang, Markus Armbruster, qemu-devel,
Giuseppe Lettieri, Luigi Rizzo
2015-12-16 16:46 GMT+01:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 16/12/2015 15:25, Vincenzo Maffione wrote:
>>> vhost-net actually had better performance, so virtio-net dataplane
>>> was never committed. As Michael mentioned, in practice on Linux you
>>> use vhost, and non-Linux hypervisors you do not use QEMU. :)
>>
>> Yes, I understand. However, another possible use-case would using QEMU
>> + virtio-net + netmap backend + Linux (e.g. for QEMU-sandboxed packet
>> generators or packe processors, where very high packet rates are
>> common), where is not possible to use vhost.
>
> Yes, of course. That was tongue in cheek. Another possibility for your
> use case is to interface with netmap through vhost-user, but I'm happy
> if you choose to improve virtio.c instead!
>
>>> The main optimization that vring.c has is to cache the translation of
>>> the rings. Using address_space_map/unmap for rings in virtio.c would be
>>> a noticeable improvement, as your numbers for patch 3 show. However, by
>>> caching translations you also conveniently "forget" to promptly mark the
>>> pages as dirty. As you pointed out this is obviously an issue for
>>> migration. You can then add a notifier for runstate changes. When
>>> entering RUN_STATE_FINISH_MIGRATE or RUN_STATE_SAVE_VM the rings would
>>> be unmapped, and then remapped the next time the VM starts running again.
>>
>> Ok so it seems feasible with a bit of care. The numbers we've been
>> seing in various experiments have always shown that this optimization
>> could easily double the 2 Mpps packet rate bottleneck.
>
> Cool. Bonus points for nicely abstracting it so that virtio.c is just a
> user.
>
>>> You also guessed right that there are consistency issues; for these you
>>> can add a MemoryListener that invalidates all mappings.
>>
>> Yeah, but I don't know exactly what kind of inconsinstencies there can
>> be. Maybe the memory we are mapping may be hot-unplugged?
>
> Yes. Just blow away all mappings in the MemoryListener commit callback.
>
>>> That said, I'm wondering where the cost of address translation lies---is
>>> it cache-unfriendly data structures, locked operations, or simply too
>>> much code to execute? It was quite surprising to me that on virtio-blk
>>> benchmarks we were spending 5% of the time doing memcpy! (I have just
>>> extracted from my branch the patches to remove that, and sent them to
>>> qemu-devel).
>>
>> I feel it's just too much code (but I may be wrong).
>
> That is likely to be a good guess, but notice that the fast path doesn't
> actually have _that much_ code, because a lot of "if"s that are almost
> always false.
>
> Looking at a profile would be useful. Is it flat, or does something
> (e.g. address_space_translate) actually stand out?
I'm so sorry, I forget to answer this.
This is what perf top shows while doing the experiment
12.35% qemu-system-x86_64 [.] address_space_map
10.87% qemu-system-x86_64 [.] vring_desc_read.isra.0
7.50% qemu-system-x86_64 [.] address_space_lduw_le
6.32% qemu-system-x86_64 [.] address_space_translate
5.84% qemu-system-x86_64 [.] address_space_translate_internal
5.75% qemu-system-x86_64 [.] phys_page_find
5.74% qemu-system-x86_64 [.] qemu_ram_block_from_host
4.04% qemu-system-x86_64 [.] address_space_stw_le
4.02% qemu-system-x86_64 [.] address_space_write
3.33% qemu-system-x86_64 [.] virtio_should_notify
So it seems most of the time is spent while doing translations.
>
>> I'm not sure whether you are thinking that 5% is too much or too little.
>> To me it's too little, showing that most of the overhead it's
>> somewhere else (e.g. memory translation, or backend processing). In a
>> ideal transmission system, most of the overhead should be spent on
>> copying, because it means that you successfully managed to suppress
>> notifications and translation overhead.
>
> On copying data, though---not on copying virtio descriptors. 5% for
> those is entirely wasted time.
>
> Also, note that I'm looking at disk I/O rather than networking, where
> there should be no copies at all.
In the experiment I'm doing there is a per-packet copy from the guest
memory to the netmap backend.
Cheers,
Vincenzo
>
> Paolo
>
>>> Examples of missing optimizations in exec.c include:
>>>
>>> * caching enough information in RAM MemoryRegions to avoid the calls to
>>> qemu_get_ram_block (e.g. replace mr->ram_addr with a RAMBlock pointer);
>>>
>>> * adding a MRU cache to address_space_lookup_region.
>>>
>>> In particular, the former should be easy if you want to give it a
>>> try---easier than caching ring translations in virtio.c.
>>
>> Thank you so much for the insights :)
>
--
Vincenzo Maffione
^ permalink raw reply [flat|nested] 14+ messages in thread