* [PATCH 0/9][RFC] KVM virtio_net performance
@ 2008-07-24 11:46 Mark McLoughlin
2008-07-24 11:46 ` [PATCH 1/9] kvm: qemu: Set MIN_TIMER_REARM_US to 150us Mark McLoughlin
` (4 more replies)
0 siblings, 5 replies; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw)
To: kvm; +Cc: Herbert Xu, Rusty Russell
Hey,
Here's a bunch of patches attempting to improve the performance
of virtio_net. This is more an RFC rather than a patch submission
since, as can be seen below, not all patches actually improve the
perfomance measurably.
I've tried hard to test each of these patches with as stable and
informative a benchmark as I could find. The first benchmark is a
netperf[1] based throughput benchmark and the second uses a flood
ping[2] to measure latency differences.
Each set of figures is min/average/max/standard deviation. The
first set is Gb/s and the second is milliseconds.
The network configuration used was very simple - the guest with
a virtio_net interface and the host with a tap interface and static
IP addresses assigned to both - e.g. there was no bridge in the host
involved and iptables was disable in both the host and guest.
I used:
1) kvm-71-26-g6152996 with the patches that follow
2) Linus's v2.6.26-5752-g93ded9b with Rusty's virtio patches from
219:bbd2611289c5 applied; these are the patches have just been
submitted to Linus
The conclusions I draw are:
1) The length of the tx mitigation timer makes quite a difference to
throughput achieved; we probably need a good heuristic for
adjusting this on the fly.
2) Using the recently merged GSO support in the tun/tap driver gives
a huge boost, but much more so on the host->guest side.
3) Adjusting the virtio_net ring sizes makes a small difference, but
not as much as one might expect
4) Dropping the global mutex while reading GSO packets from the tap
interface gives a nice speedup. This highlights the global mutex
as a general perfomance issue.
5) Eliminating an extra copy on the host->guest path only makes a
barely measurable difference.
Anyway, the figures:
netperf, 10x20s runs (Gb/s) | guest->host | host->guest
-----------------------------+----------------------------+---------------------------
baseline | 1.520/ 1.573/ 1.610/ 0.034 | 1.160/ 1.357/ 1.630/ 0.165
50us tx timer + rearm | 1.050/ 1.086/ 1.110/ 0.017 | 1.710/ 1.832/ 1.960/ 0.092
250us tx timer + rearm | 1.700/ 1.764/ 1.880/ 0.064 | 0.900/ 1.203/ 1.580/ 0.205
150us tx timer + rearm | 1.520/ 1.602/ 1.690/ 0.044 | 1.670/ 1.928/ 2.150/ 0.141
no ring-full heuristic | 1.480/ 1.569/ 1.710/ 0.066 | 1.610/ 1.857/ 2.140/ 0.153
VIRTIO_F_NOTIFY_ON_EMPTY | 1.470/ 1.554/ 1.650/ 0.054 | 1.770/ 1.960/ 2.170/ 0.119
recv NO_NOTIFY | 1.530/ 1.604/ 1.680/ 0.047 | 1.780/ 1.944/ 2.190/ 0.129
GSO | 4.120/ 4.323/ 4.420/ 0.099 | 6.540/ 7.033/ 7.340/ 0.244
ring size == 256 | 4.050/ 4.406/ 4.560/ 0.143 | 6.280/ 7.236/ 8.280/ 0.613
ring size == 512 | 4.420/ 4.600/ 4.960/ 0.140 | 6.470/ 7.205/ 7.510/ 0.314
drop mutex during tapfd read | 4.320/ 4.578/ 4.790/ 0.161 | 8.370/ 8.589/ 8.730/ 0.120
aligouri zero-copy | 4.510/ 4.694/ 4.960/ 0.148 | 8.430/ 8.614/ 8.840/ 0.142
ping -f -c 100000 (ms) | guest->host | host->guest
-----------------------------+----------------------------+---------------------------
baseline | 0.060/ 0.459/ 7.602/ 0.846 | 0.067/ 0.331/ 2.517/ 0.057
50us tx timer + rearm | 0.081/ 0.143/ 7.436/ 0.374 | 0.093/ 0.133/ 1.883/ 0.026
250us tx timer + rearm | 0.302/ 0.463/ 7.580/ 0.849 | 0.297/ 0.344/ 2.128/ 0.028
150us tx timer + rearm | 0.197/ 0.323/ 7.671/ 0.740 | 0.199/ 0.245/ 7.836/ 0.037
no ring-full heuristic | 0.182/ 0.324/ 7.688/ 0.753 | 0.199/ 0.243/ 2.197/ 0.030
VIRTIO_F_NOTIFY_ON_EMPTY | 0.197/ 0.321/ 7.447/ 0.730 | 0.196/ 0.242/ 2.218/ 0.032
recv NO_NOTIFY | 0.186/ 0.321/ 7.520/ 0.732 | 0.200/ 0.233/ 2.216/ 0.028
GSO | 0.178/ 0.324/ 7.667/ 0.736 | 0.147/ 0.246/ 1.361/ 0.024
ring size == 256 | 0.184/ 0.323/ 7.674/ 0.728 | 0.199/ 0.243/ 2.181/ 0.028
ring size == 512 | (not measured) | (not measured)
drop mutex during tapfd read | 0.183/ 0.323/ 7.820/ 0.733 | 0.202/ 0.242/ 2.219/ 0.027
aligouri zero-copy | 0.185/ 0.325/ 7.863/ 0.736 | 0.202/ 0.245/ 7.844/ 0.036
Cheers,
Mark.
[1] - I used netperf trunk from:
http://www.netperf.org/svn/netperf2/trunk
and simply ran:
$> i=0; while [ $i -lt 10 ]; do ./netperf -H <host> -f g -l 20 -P 0 | netperf-collect.py; i=$((i+1)); done
where netperf-collect.py is just a script to calculate the
average across the runs:
http://markmc.fedorapeople.org/netperf-collect.py
[2] - ping -c 100000 -f <host>
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH 1/9] kvm: qemu: Set MIN_TIMER_REARM_US to 150us
2008-07-24 11:46 [PATCH 0/9][RFC] KVM virtio_net performance Mark McLoughlin
@ 2008-07-24 11:46 ` Mark McLoughlin
2008-07-24 11:46 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Mark McLoughlin
2008-07-24 11:55 ` [PATCH 0/9][RFC] KVM virtio_net performance Herbert Xu
` (3 subsequent siblings)
4 siblings, 1 reply; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw)
To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin
Equivalent to ~300 syscalls on my machine
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
qemu/vl.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/qemu/vl.c b/qemu/vl.c
index 5d285cc..b7d3397 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -891,7 +891,7 @@ static void qemu_rearm_alarm_timer(struct qemu_alarm_timer *t)
}
/* TODO: MIN_TIMER_REARM_US should be optimized */
-#define MIN_TIMER_REARM_US 250
+#define MIN_TIMER_REARM_US 150
static struct qemu_alarm_timer *alarm_timer;
--
1.5.4.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer
2008-07-24 11:46 ` [PATCH 1/9] kvm: qemu: Set MIN_TIMER_REARM_US to 150us Mark McLoughlin
@ 2008-07-24 11:46 ` Mark McLoughlin
2008-07-24 11:46 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Mark McLoughlin
2008-07-26 9:48 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Avi Kivity
0 siblings, 2 replies; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw)
To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin
The current virtio_net tx timer is 2ns, which doesn't
make any sense. Set it to a more reasonable 150us
instead.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
qemu/hw/virtio-net.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 2e57e5a..31867f1 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -26,7 +26,7 @@
#define VIRTIO_NET_F_MAC 5
#define VIRTIO_NET_F_GS0 6
-#define TX_TIMER_INTERVAL (1000 / 500)
+#define TX_TIMER_INTERVAL (150000) /* 150 us */
/* The config defining mac address (6 bytes) */
struct virtio_net_config
--
1.5.4.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic
2008-07-24 11:46 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Mark McLoughlin
@ 2008-07-24 11:46 ` Mark McLoughlin
2008-07-24 11:46 ` [PATCH 4/9] kvm: qemu: Add VIRTIO_F_NOTIFY_ON_EMPTY Mark McLoughlin
` (2 more replies)
2008-07-26 9:48 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Avi Kivity
1 sibling, 3 replies; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw)
To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin
virtio_net tries to guess when it has received a tx
notification from the guest whether it indicates that the
guest has no more room in the tx ring and it should
immediately flush the queued buffers.
The heuristic is based on the fact that there are 128
buffer entries in the ring and each packet uses 2 buffers
(i.e. the virtio_net_hdr and the packet's linear data).
Using GSO or increasing the size of the rings will break
that heuristic, so let's remove it and assume that any
notification from the guest after we've disabled
notifications indicates that we should flush our buffers.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
qemu/hw/virtio-net.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 31867f1..4adfa42 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -175,8 +175,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq)
{
VirtIONet *n = to_virtio_net(vdev);
- if (n->tx_timer_active &&
- (vq->vring.avail->idx - vq->last_avail_idx) == 64) {
+ if (n->tx_timer_active) {
vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
qemu_del_timer(n->tx_timer);
n->tx_timer_active = 0;
--
1.5.4.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 4/9] kvm: qemu: Add VIRTIO_F_NOTIFY_ON_EMPTY
2008-07-24 11:46 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Mark McLoughlin
@ 2008-07-24 11:46 ` Mark McLoughlin
2008-07-24 11:46 ` [PATCH 5/9] kvm: qemu: Disable recv notifications until avail buffers exhausted Mark McLoughlin
2008-07-24 23:22 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Dor Laor
2008-07-24 23:56 ` Dor Laor
2 siblings, 1 reply; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw)
To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin
Set the VIRTIO_F_NOTIFY_ON_EMPTY feature bit so the
guest can rely on us notifying them when the queue
is empty.
Also, only notify when the available queue is empty
*and* when we've finished with all the buffers we
had detached. Right now, when the queue is empty,
we notify the guest for every used buffer.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
qemu/hw/virtio.c | 6 +++++-
qemu/hw/virtio.h | 5 +++++
2 files changed, 10 insertions(+), 1 deletions(-)
diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
index 3429ac8..e035e4e 100644
--- a/qemu/hw/virtio.c
+++ b/qemu/hw/virtio.c
@@ -138,6 +138,7 @@ void virtqueue_push(VirtQueue *vq, const VirtQueueElement *elem,
/* Make sure buffer is written before we update index. */
wmb();
vq->vring.used->idx++;
+ vq->inuse--;
}
int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
@@ -187,6 +188,8 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
elem->index = head;
+ vq->inuse++;
+
return elem->in_num + elem->out_num;
}
@@ -275,6 +278,7 @@ static uint32_t virtio_ioport_read(void *opaque, uint32_t addr)
switch (addr) {
case VIRTIO_PCI_HOST_FEATURES:
ret = vdev->get_features(vdev);
+ ret |= (1 << VIRTIO_F_NOTIFY_ON_EMPTY);
break;
case VIRTIO_PCI_GUEST_FEATURES:
ret = vdev->features;
@@ -431,7 +435,7 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
void virtio_notify(VirtIODevice *vdev, VirtQueue *vq)
{
/* Always notify when queue is empty */
- if (vq->vring.avail->idx != vq->last_avail_idx &&
+ if ((vq->inuse || vq->vring.avail->idx != vq->last_avail_idx) &&
(vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
return;
diff --git a/qemu/hw/virtio.h b/qemu/hw/virtio.h
index 61f5038..1adaed3 100644
--- a/qemu/hw/virtio.h
+++ b/qemu/hw/virtio.h
@@ -30,6 +30,10 @@
/* We've given up on this device. */
#define VIRTIO_CONFIG_S_FAILED 0x80
+/* We notify when the ring is completely used, even if the guest is supressing
+ * callbacks */
+#define VIRTIO_F_NOTIFY_ON_EMPTY 24
+
/* from Linux's linux/virtio_ring.h */
/* This marks a buffer as continuing via the next field. */
@@ -86,6 +90,7 @@ struct VirtQueue
VRing vring;
uint32_t pfn;
uint16_t last_avail_idx;
+ int inuse;
void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
};
--
1.5.4.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 5/9] kvm: qemu: Disable recv notifications until avail buffers exhausted
2008-07-24 11:46 ` [PATCH 4/9] kvm: qemu: Add VIRTIO_F_NOTIFY_ON_EMPTY Mark McLoughlin
@ 2008-07-24 11:46 ` Mark McLoughlin
2008-07-24 11:46 ` [PATCH 6/9] kvm: qemu: Add support for partial csums and GSO Mark McLoughlin
0 siblings, 1 reply; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw)
To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
qemu/hw/virtio-net.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 4adfa42..419a2d7 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -106,9 +106,12 @@ static int virtio_net_can_receive(void *opaque)
!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
return 0;
- if (n->rx_vq->vring.avail->idx == n->rx_vq->last_avail_idx)
+ if (n->rx_vq->vring.avail->idx == n->rx_vq->last_avail_idx) {
+ n->rx_vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
return 0;
+ }
+ n->rx_vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY;
return 1;
}
--
1.5.4.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 6/9] kvm: qemu: Add support for partial csums and GSO
2008-07-24 11:46 ` [PATCH 5/9] kvm: qemu: Disable recv notifications until avail buffers exhausted Mark McLoughlin
@ 2008-07-24 11:46 ` Mark McLoughlin
2008-07-24 11:46 ` [PATCH 7/9] kvm: qemu: Increase size of virtio_net rings Mark McLoughlin
0 siblings, 1 reply; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw)
To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
qemu/hw/virtio-net.c | 86 +++++++++++++++++++++++++++++++++++++++++---------
qemu/net.h | 5 +++
qemu/vl.c | 73 +++++++++++++++++++++++++++++++++++++++---
3 files changed, 144 insertions(+), 20 deletions(-)
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 419a2d7..81282c4 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -22,9 +22,18 @@
#define VIRTIO_ID_NET 1
/* The feature bitmap for virtio net */
-#define VIRTIO_NET_F_NO_CSUM 0
-#define VIRTIO_NET_F_MAC 5
-#define VIRTIO_NET_F_GS0 6
+#define VIRTIO_NET_F_CSUM 0 /* Host handles pkts w/ partial csum */
+#define VIRTIO_NET_F_GUEST_CSUM 1 /* Guest handles pkts w/ partial csum */
+#define VIRTIO_NET_F_MAC 5 /* Host has given MAC address. */
+#define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */
+#define VIRTIO_NET_F_GUEST_TSO4 7 /* Guest can handle TSOv4 in. */
+#define VIRTIO_NET_F_GUEST_TSO6 8 /* Guest can handle TSOv6 in. */
+#define VIRTIO_NET_F_GUEST_ECN 9 /* Guest can handle TSO[6] w/ ECN in. */
+#define VIRTIO_NET_F_GUEST_UFO 10 /* Guest can handle UFO in. */
+#define VIRTIO_NET_F_HOST_TSO4 11 /* Host can handle TSOv4 in. */
+#define VIRTIO_NET_F_HOST_TSO6 12 /* Host can handle TSOv6 in. */
+#define VIRTIO_NET_F_HOST_ECN 13 /* Host can handle TSO[6] w/ ECN in. */
+#define VIRTIO_NET_F_HOST_UFO 14 /* Host can handle UFO in. */
#define TX_TIMER_INTERVAL (150000) /* 150 us */
@@ -42,8 +51,6 @@ struct virtio_net_hdr
uint8_t flags;
#define VIRTIO_NET_HDR_GSO_NONE 0 // Not a GSO frame
#define VIRTIO_NET_HDR_GSO_TCPV4 1 // GSO frame, IPv4 TCP (TSO)
-/* FIXME: Do we need this? If they said they can handle ECN, do they care? */
-#define VIRTIO_NET_HDR_GSO_TCPV4_ECN 2 // GSO frame, IPv4 TCP w/ ECN
#define VIRTIO_NET_HDR_GSO_UDP 3 // GSO frame, IPv4 UDP (UFO)
#define VIRTIO_NET_HDR_GSO_TCPV6 4 // GSO frame, IPv6 TCP
#define VIRTIO_NET_HDR_GSO_ECN 0x80 // TCP has ECN set
@@ -85,7 +92,38 @@ static void virtio_net_update_config(VirtIODevice *vdev, uint8_t *config)
static uint32_t virtio_net_get_features(VirtIODevice *vdev)
{
- return (1 << VIRTIO_NET_F_MAC);
+ VirtIONet *n = to_virtio_net(vdev);
+ VLANClientState *host = n->vc->vlan->first_client;
+ uint32_t features = (1 << VIRTIO_NET_F_MAC);
+
+ if (tap_has_offload(host)) {
+ features |= (1 << VIRTIO_NET_F_CSUM);
+ features |= (1 << VIRTIO_NET_F_GUEST_CSUM);
+ features |= (1 << VIRTIO_NET_F_GUEST_TSO4);
+ features |= (1 << VIRTIO_NET_F_GUEST_TSO6);
+ features |= (1 << VIRTIO_NET_F_GUEST_ECN);
+ features |= (1 << VIRTIO_NET_F_HOST_TSO4);
+ features |= (1 << VIRTIO_NET_F_HOST_TSO6);
+ features |= (1 << VIRTIO_NET_F_HOST_ECN);
+ /* Kernel can't actually handle UFO in software currently. */
+ }
+
+ return features;
+}
+
+static void virtio_net_set_features(VirtIODevice *vdev, uint32_t features)
+{
+ VirtIONet *n = to_virtio_net(vdev);
+ VLANClientState *host = n->vc->vlan->first_client;
+
+ if (!tap_has_offload(host) || !host->set_offload)
+ return;
+
+ host->set_offload(host,
+ (features >> VIRTIO_NET_F_GUEST_CSUM) & 1,
+ (features >> VIRTIO_NET_F_GUEST_TSO4) & 1,
+ (features >> VIRTIO_NET_F_GUEST_TSO6) & 1,
+ (features >> VIRTIO_NET_F_GUEST_ECN) & 1);
}
/* RX */
@@ -121,6 +159,7 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
VirtQueueElement elem;
struct virtio_net_hdr *hdr;
int offset, i;
+ int total;
if (virtqueue_pop(n->rx_vq, &elem) == 0)
return;
@@ -134,18 +173,26 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
hdr->flags = 0;
hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE;
- /* copy in packet. ugh */
offset = 0;
+ total = sizeof(*hdr);
+
+ if (tap_has_offload(n->vc->vlan->first_client)) {
+ memcpy(hdr, buf, sizeof(*hdr));
+ offset += total;
+ }
+
+ /* copy in packet. ugh */
i = 1;
while (offset < size && i < elem.in_num) {
int len = MIN(elem.in_sg[i].iov_len, size - offset);
memcpy(elem.in_sg[i].iov_base, buf + offset, len);
offset += len;
+ total += len;
i++;
}
/* signal other side */
- virtqueue_push(n->rx_vq, &elem, sizeof(*hdr) + offset);
+ virtqueue_push(n->rx_vq, &elem, total);
virtio_notify(&n->vdev, n->rx_vq);
}
@@ -153,23 +200,31 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
{
VirtQueueElement elem;
+ int has_offload = tap_has_offload(n->vc->vlan->first_client);
if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
return;
while (virtqueue_pop(vq, &elem)) {
ssize_t len = 0;
+ unsigned int out_num = elem.out_num;
+ struct iovec *out_sg = &elem.out_sg[0];
+
+ if (out_num < 1 || out_sg->iov_len != sizeof(struct virtio_net_hdr)) {
+ fprintf(stderr, "virtio-net header not in first element\n");
+ exit(1);
+ }
- if (elem.out_num < 1 ||
- elem.out_sg[0].iov_len != sizeof(struct virtio_net_hdr)) {
- fprintf(stderr, "virtio-net header not in first element\n");
- exit(1);
+ /* ignore the header if GSO is not supported */
+ if (!has_offload) {
+ out_num--;
+ out_sg++;
+ len += sizeof(struct virtio_net_hdr);
}
- /* ignore the header for now */
- len = qemu_sendv_packet(n->vc, &elem.out_sg[1], elem.out_num - 1);
+ len += qemu_sendv_packet(n->vc, out_sg, out_num);
- virtqueue_push(vq, &elem, sizeof(struct virtio_net_hdr) + len);
+ virtqueue_push(vq, &elem, len);
virtio_notify(&n->vdev, vq);
}
}
@@ -249,6 +304,7 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
n->vdev.update_config = virtio_net_update_config;
n->vdev.get_features = virtio_net_get_features;
+ n->vdev.set_features = virtio_net_set_features;
n->rx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_rx);
n->tx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_tx);
memcpy(n->mac, nd->macaddr, 6);
diff --git a/qemu/net.h b/qemu/net.h
index e8ee325..6cfd8ce 100644
--- a/qemu/net.h
+++ b/qemu/net.h
@@ -9,12 +9,15 @@ typedef ssize_t (IOReadvHandler)(void *, const struct iovec *, int);
typedef struct VLANClientState VLANClientState;
+typedef void (SetOffload)(VLANClientState *, int, int, int, int);
+
struct VLANClientState {
IOReadHandler *fd_read;
IOReadvHandler *fd_readv;
/* Packets may still be sent if this returns zero. It's used to
rate-limit the slirp code. */
IOCanRWHandler *fd_can_read;
+ SetOffload *set_offload;
void *opaque;
struct VLANClientState *next;
struct VLANState *vlan;
@@ -42,6 +45,8 @@ void qemu_handler_true(void *opaque);
void do_info_network(void);
+int tap_has_offload(void *opaque);
+
int net_client_init(const char *str);
void net_client_uninit(NICInfo *nd);
diff --git a/qemu/vl.c b/qemu/vl.c
index b7d3397..efdaafd 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -4186,12 +4186,24 @@ void do_info_slirp(void)
#if !defined(_WIN32)
+#ifndef IFF_VNET_HDR
+#define TAP_BUFSIZE 4096
+#else
+#include <linux/virtio_net.h>
+#define ETH_HLEN 14
+#define ETH_DATA_LEN 1500
+#define MAX_PACKET_LEN (ETH_HLEN + ETH_DATA_LEN)
+#define MAX_SKB_FRAGS ((65536/TARGET_PAGE_SIZE) + 2)
+#define TAP_BUFSIZE (sizeof(struct virtio_net_hdr) + MAX_PACKET_LEN + (MAX_SKB_FRAGS*TARGET_PAGE_SIZE))
+#endif
+
typedef struct TAPState {
VLANClientState *vc;
int fd;
char down_script[1024];
- char buf[4096];
+ char buf[TAP_BUFSIZE];
int size;
+ int offload;
} TAPState;
static void tap_receive(void *opaque, const uint8_t *buf, int size)
@@ -4286,6 +4298,37 @@ static void tap_send(void *opaque)
} while (s->size > 0);
}
+int tap_has_offload(void *opaque)
+{
+ VLANClientState *vc = opaque;
+ TAPState *ts = vc->opaque;
+
+ return ts ? ts->offload : 0;
+}
+
+#ifdef TUNSETOFFLOAD
+static void tap_set_offload(VLANClientState *vc, int csum, int tso4, int tso6,
+ int ecn)
+{
+ TAPState *s = vc->opaque;
+ unsigned int offload = 0;
+
+ if (csum) {
+ offload |= TUN_F_CSUM;
+ if (tso4)
+ offload |= TUN_F_TSO4;
+ if (tso6)
+ offload |= TUN_F_TSO6;
+ if ((tso4 || tso6) && ecn)
+ offload |= TUN_F_TSO_ECN;
+ }
+
+ if (ioctl(s->fd, TUNSETOFFLOAD, offload) != 0)
+ fprintf(stderr, "TUNSETOFFLOAD ioctl() failed: %s\n",
+ strerror(errno));
+}
+#endif /* TUNSETOFFLOAD */
+
/* fd support */
static TAPState *net_tap_fd_init(VLANState *vlan, int fd)
@@ -4298,13 +4341,16 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd)
s->fd = fd;
s->vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s);
s->vc->fd_readv = tap_readv;
+#ifdef TUNSETOFFLOAD
+ s->vc->set_offload = tap_set_offload;
+#endif
qemu_set_fd_handler2(s->fd, tap_can_send, tap_send, NULL, s);
snprintf(s->vc->info_str, sizeof(s->vc->info_str), "tap: fd=%d", fd);
return s;
}
#if defined (_BSD) || defined (__FreeBSD_kernel__)
-static int tap_open(char *ifname, int ifname_size)
+static int tap_open(char *ifname, int ifname_size, int *offload)
{
int fd;
char *dev;
@@ -4446,7 +4492,7 @@ int tap_alloc(char *dev)
return tap_fd;
}
-static int tap_open(char *ifname, int ifname_size)
+static int tap_open(char *ifname, int ifname_size, int *offload)
{
char dev[10]="";
int fd;
@@ -4459,18 +4505,31 @@ static int tap_open(char *ifname, int ifname_size)
return fd;
}
#else
-static int tap_open(char *ifname, int ifname_size)
+static int tap_open(char *ifname, int ifname_size, int *offload)
{
struct ifreq ifr;
int fd, ret;
+ unsigned int features;
TFR(fd = open("/dev/net/tun", O_RDWR));
if (fd < 0) {
fprintf(stderr, "warning: could not open /dev/net/tun: no virtual network emulation\n");
return -1;
}
+
+ if (ioctl(fd, TUNGETFEATURES, &features))
+ features = IFF_TUN | IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE;
+
memset(&ifr, 0, sizeof(ifr));
ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+
+#ifdef IFF_VNET_HDR
+ if (features & IFF_VNET_HDR) {
+ *offload = 1;
+ ifr.ifr_flags |= IFF_VNET_HDR;
+ }
+#endif
+
if (ifname[0] != '\0')
pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname);
else
@@ -4528,13 +4587,15 @@ static int net_tap_init(VLANState *vlan, const char *ifname1,
{
TAPState *s;
int fd;
+ int offload;
char ifname[128];
if (ifname1 != NULL)
pstrcpy(ifname, sizeof(ifname), ifname1);
else
ifname[0] = '\0';
- TFR(fd = tap_open(ifname, sizeof(ifname)));
+ offload = 0;
+ TFR(fd = tap_open(ifname, sizeof(ifname), &offload));
if (fd < 0)
return -1;
@@ -4547,6 +4608,8 @@ static int net_tap_init(VLANState *vlan, const char *ifname1,
s = net_tap_fd_init(vlan, fd);
if (!s)
return -1;
+
+ s->offload = offload;
snprintf(s->vc->info_str, sizeof(s->vc->info_str),
"tap: ifname=%s setup_script=%s", ifname, setup_script);
if (down_script && strcmp(down_script, "no"))
--
1.5.4.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 7/9] kvm: qemu: Increase size of virtio_net rings
2008-07-24 11:46 ` [PATCH 6/9] kvm: qemu: Add support for partial csums and GSO Mark McLoughlin
@ 2008-07-24 11:46 ` Mark McLoughlin
2008-07-24 11:46 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Mark McLoughlin
0 siblings, 1 reply; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw)
To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
qemu/hw/virtio-net.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 81282c4..a681a7e 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -305,8 +305,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
n->vdev.update_config = virtio_net_update_config;
n->vdev.get_features = virtio_net_get_features;
n->vdev.set_features = virtio_net_set_features;
- n->rx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_rx);
- n->tx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_tx);
+ n->rx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_rx);
+ n->tx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_tx);
memcpy(n->mac, nd->macaddr, 6);
n->vc = qemu_new_vlan_client(nd->vlan, virtio_net_receive,
virtio_net_can_receive, n);
--
1.5.4.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd
2008-07-24 11:46 ` [PATCH 7/9] kvm: qemu: Increase size of virtio_net rings Mark McLoughlin
@ 2008-07-24 11:46 ` Mark McLoughlin
2008-07-24 11:46 ` [PATCH 9/9] kvm: qemu: Eliminate extra virtio_net copy Mark McLoughlin
2008-07-24 23:33 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Dor Laor
0 siblings, 2 replies; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw)
To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin
The idea here is that with GSO, packets are much larger
and we can allow the vcpu threads to e.g. process irq
acks during the window where we're reading these
packets from the tapfd.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
qemu/vl.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/qemu/vl.c b/qemu/vl.c
index efdaafd..de92848 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -4281,7 +4281,9 @@ static void tap_send(void *opaque)
sbuf.buf = s->buf;
s->size = getmsg(s->fd, NULL, &sbuf, &f) >=0 ? sbuf.len : -1;
#else
+ kvm_sleep_begin();
s->size = read(s->fd, s->buf, sizeof(s->buf));
+ kvm_sleep_end();
#endif
if (s->size == -1 && errno == EINTR)
--
1.5.4.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 9/9] kvm: qemu: Eliminate extra virtio_net copy
2008-07-24 11:46 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Mark McLoughlin
@ 2008-07-24 11:46 ` Mark McLoughlin
2008-07-24 23:33 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Dor Laor
1 sibling, 0 replies; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw)
To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin
This is Anthony's net-tap-zero-copy.patch which eliminates
a copy on the host->guest data path with virtio_net.
---
qemu/hw/virtio-net.c | 76 ++++++++++++++++++++++++++++++++++++-------------
qemu/net.h | 3 ++
qemu/vl.c | 50 +++++++++++++++++++++++++++++++++
3 files changed, 109 insertions(+), 20 deletions(-)
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index a681a7e..5e71afe 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -70,6 +70,8 @@ typedef struct VirtIONet
VLANClientState *vc;
QEMUTimer *tx_timer;
int tx_timer_active;
+ int last_elem_valid;
+ VirtQueueElement last_elem;
} VirtIONet;
/* TODO
@@ -153,47 +155,80 @@ static int virtio_net_can_receive(void *opaque)
return 1;
}
-static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
+static void virtio_net_receive_zc(void *opaque, IOZeroCopyHandler *zc, void *data)
{
VirtIONet *n = opaque;
- VirtQueueElement elem;
+ VirtQueueElement *elem = &n->last_elem;
struct virtio_net_hdr *hdr;
- int offset, i;
- int total;
+ ssize_t err;
+ int idx;
- if (virtqueue_pop(n->rx_vq, &elem) == 0)
+ if (!n->last_elem_valid && virtqueue_pop(n->rx_vq, elem) == 0)
return;
- if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) {
+ if (elem->in_num < 1 || elem->in_sg[0].iov_len != sizeof(*hdr)) {
fprintf(stderr, "virtio-net header not in first element\n");
exit(1);
}
- hdr = (void *)elem.in_sg[0].iov_base;
+ n->last_elem_valid = 1;
+
+ hdr = (void *)elem->in_sg[0].iov_base;
hdr->flags = 0;
hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE;
- offset = 0;
- total = sizeof(*hdr);
+ idx = tap_has_offload(n->vc->vlan->first_client) ? 0 : 1;
+
+ do {
+ err = zc(data, &elem->in_sg[idx], elem->in_num - idx);
+ } while (err == -1 && errno == EINTR);
+
+ if (err == -1 && errno == EAGAIN)
+ return;
- if (tap_has_offload(n->vc->vlan->first_client)) {
- memcpy(hdr, buf, sizeof(*hdr));
- offset += total;
+ if (err < 0) {
+ fprintf(stderr, "virtio_net: error during IO\n");
+ return;
}
+ /* signal other side */
+ n->last_elem_valid = 0;
+ virtqueue_push(n->rx_vq, elem, sizeof(*hdr) + err);
+ virtio_notify(&n->vdev, n->rx_vq);
+}
+
+struct compat_data
+{
+ const uint8_t *buf;
+ int size;
+};
+
+static ssize_t compat_copy(void *opaque, struct iovec *iov, int iovcnt)
+{
+ struct compat_data *compat = opaque;
+ int offset, i;
+
/* copy in packet. ugh */
- i = 1;
- while (offset < size && i < elem.in_num) {
- int len = MIN(elem.in_sg[i].iov_len, size - offset);
- memcpy(elem.in_sg[i].iov_base, buf + offset, len);
+ offset = 0;
+ i = 0;
+ while (offset < compat->size && i < iovcnt) {
+ int len = MIN(iov[i].iov_len, compat->size - offset);
+ memcpy(iov[i].iov_base, compat->buf + offset, len);
offset += len;
- total += len;
i++;
}
- /* signal other side */
- virtqueue_push(n->rx_vq, &elem, total);
- virtio_notify(&n->vdev, n->rx_vq);
+ return offset;
+}
+
+static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
+{
+ struct compat_data compat;
+
+ compat.buf = buf;
+ compat.size = size;
+
+ virtio_net_receive_zc(opaque, compat_copy, &compat);
}
/* TX */
@@ -310,6 +345,7 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
memcpy(n->mac, nd->macaddr, 6);
n->vc = qemu_new_vlan_client(nd->vlan, virtio_net_receive,
virtio_net_can_receive, n);
+ n->vc->fd_read_zc = virtio_net_receive_zc;
n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
n->tx_timer_active = 0;
diff --git a/qemu/net.h b/qemu/net.h
index 6cfd8ce..aca50e9 100644
--- a/qemu/net.h
+++ b/qemu/net.h
@@ -6,6 +6,8 @@
/* VLANs support */
typedef ssize_t (IOReadvHandler)(void *, const struct iovec *, int);
+typedef ssize_t (IOZeroCopyHandler)(void *, struct iovec *, int);
+typedef void (IOReadZCHandler)(void *, IOZeroCopyHandler *, void *);
typedef struct VLANClientState VLANClientState;
@@ -14,6 +16,7 @@ typedef void (SetOffload)(VLANClientState *, int, int, int, int);
struct VLANClientState {
IOReadHandler *fd_read;
IOReadvHandler *fd_readv;
+ IOReadZCHandler *fd_read_zc;
/* Packets may still be sent if this returns zero. It's used to
rate-limit the slirp code. */
IOCanRWHandler *fd_can_read;
diff --git a/qemu/vl.c b/qemu/vl.c
index de92848..bc5b151 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -4204,6 +4204,7 @@ typedef struct TAPState {
char buf[TAP_BUFSIZE];
int size;
int offload;
+ int received_eagain;
} TAPState;
static void tap_receive(void *opaque, const uint8_t *buf, int size)
@@ -4232,6 +4233,48 @@ static ssize_t tap_readv(void *opaque, const struct iovec *iov,
return len;
}
+static VLANClientState *tap_can_zero_copy(TAPState *s)
+{
+ VLANClientState *vc, *vc1 = NULL;
+ int vc_count = 0;
+
+ for (vc = s->vc->vlan->first_client; vc; vc = vc->next) {
+ if (vc == s->vc)
+ continue;
+
+ if (!vc->fd_read_zc || vc_count)
+ return NULL;
+
+ vc_count++;
+ vc1 = vc;
+ }
+
+ return vc1;
+}
+
+static ssize_t tap_sendv(void *opaque, struct iovec *iov, int iovcnt)
+{
+ TAPState *s = opaque;
+ ssize_t ret;
+
+ kvm_sleep_begin();
+ ret = readv(s->fd, iov, iovcnt);
+ kvm_sleep_end();
+ if (ret == -1 && errno == EAGAIN)
+ s->received_eagain = 1;
+
+ return ret;
+}
+
+static void tap_send_zero_copy(TAPState *s, VLANClientState *vc)
+{
+ s->received_eagain = 0;
+ while (s->received_eagain == 0 &&
+ (!vc->fd_can_read || vc->fd_can_read(vc->opaque))) {
+ vc->fd_read_zc(vc->opaque, tap_sendv, s);
+ }
+}
+
static int tap_can_send(void *opaque)
{
TAPState *s = opaque;
@@ -4261,6 +4304,13 @@ static int tap_can_send(void *opaque)
static void tap_send(void *opaque)
{
TAPState *s = opaque;
+ VLANClientState *zc;
+
+ zc = tap_can_zero_copy(s);
+ if (zc) {
+ tap_send_zero_copy(s, zc);
+ return;
+ }
/* First try to send any buffered packet */
if (s->size > 0) {
--
1.5.4.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-24 11:46 [PATCH 0/9][RFC] KVM virtio_net performance Mark McLoughlin
2008-07-24 11:46 ` [PATCH 1/9] kvm: qemu: Set MIN_TIMER_REARM_US to 150us Mark McLoughlin
@ 2008-07-24 11:55 ` Herbert Xu
2008-07-24 16:53 ` Mark McLoughlin
` (2 subsequent siblings)
4 siblings, 0 replies; 36+ messages in thread
From: Herbert Xu @ 2008-07-24 11:55 UTC (permalink / raw)
To: Mark McLoughlin; +Cc: kvm, Rusty Russell
On Thu, Jul 24, 2008 at 12:46:10PM +0100, Mark McLoughlin wrote:
> Here's a bunch of patches attempting to improve the performance
> of virtio_net. This is more an RFC rather than a patch submission
> since, as can be seen below, not all patches actually improve the
> perfomance measurably.
Nice patches Mark, your work is always of the highest calibre!
> 2) Using the recently merged GSO support in the tun/tap driver gives
> a huge boost, but much more so on the host->guest side.
IIRC the reason guest=>host isn't so good is because the tuntap
GSO interface isn't the best when it comes to getting data from
user-space.
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-24 11:46 [PATCH 0/9][RFC] KVM virtio_net performance Mark McLoughlin
2008-07-24 11:46 ` [PATCH 1/9] kvm: qemu: Set MIN_TIMER_REARM_US to 150us Mark McLoughlin
2008-07-24 11:55 ` [PATCH 0/9][RFC] KVM virtio_net performance Herbert Xu
@ 2008-07-24 16:53 ` Mark McLoughlin
2008-07-24 18:29 ` Anthony Liguori
2008-07-24 20:56 ` Anthony Liguori
2008-07-26 9:45 ` Avi Kivity
4 siblings, 1 reply; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-24 16:53 UTC (permalink / raw)
To: kvm; +Cc: Herbert Xu, Rusty Russell
Hey,
One all all-important thing I forgot to include was a comparison with
lguest :-)
netperf, 10x20s runs (Gb/s) | guest->host | host->guest
-----------------------------+----------------------------+---------------------------
KVM | 4.230/ 4.619/ 4.780/ 0.155 | 8.140/ 8.578/ 8.770/ 0.162
lguest | 5.700/ 5.926/ 6.150/ 0.132 | 8.680/ 9.073/ 9.320/ 0.205
ping -f -c 100000 (ms) | guest->host | host->guest
-----------------------------+----------------------------+---------------------------
KVM | 0.199/ 0.326/ 7.698/ 0.744 | 0.199/ 0.245/ 0.402/ 0.022
lguest | 0.022/ 0.055/ 0.467/ 0.019 | 0.019/ 0.046/89.249/ 0.448
So, puppies gets you an extra 1.3Gb/s guest->host, .5Gb/s host->guest
and much better latency.
Actually, I guess the main reason for the latency difference is that
when lguest gets notified on the tx ring, it immediately sends whatever
is available and then starts a timer. KVM doesn't send anything until
it's tx timer fires or the ring is full.
Cheers,
Mark.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-24 16:53 ` Mark McLoughlin
@ 2008-07-24 18:29 ` Anthony Liguori
2008-07-25 16:36 ` Mark McLoughlin
0 siblings, 1 reply; 36+ messages in thread
From: Anthony Liguori @ 2008-07-24 18:29 UTC (permalink / raw)
To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell
Mark McLoughlin wrote:
> Hey,
> One all all-important thing I forgot to include was a comparison with
> lguest :-)
>
Hey Mark,
This patch set is really great! I guess the hard part now is deciding
what all we want to apply. Do you have a suggestion of which patches
you think are worth applying?
BTW, do you have native and guest loopback numbers to compare where we
stand with native?
> netperf, 10x20s runs (Gb/s) | guest->host | host->guest
> -----------------------------+----------------------------+---------------------------
> KVM | 4.230/ 4.619/ 4.780/ 0.155 | 8.140/ 8.578/ 8.770/ 0.162
> lguest | 5.700/ 5.926/ 6.150/ 0.132 | 8.680/ 9.073/ 9.320/ 0.205
>
> ping -f -c 100000 (ms) | guest->host | host->guest
> -----------------------------+----------------------------+---------------------------
> KVM | 0.199/ 0.326/ 7.698/ 0.744 | 0.199/ 0.245/ 0.402/ 0.022
> lguest | 0.022/ 0.055/ 0.467/ 0.019 | 0.019/ 0.046/89.249/ 0.448
>
>
> So, puppies gets you an extra 1.3Gb/s guest->host, .5Gb/s host->guest
> and much better latency.
>
I'm surprised lguest gets an extra 1.3gb guest->host. Any idea of where
we're loosing it?
> Actually, I guess the main reason for the latency difference is that
> when lguest gets notified on the tx ring, it immediately sends whatever
> is available and then starts a timer. KVM doesn't send anything until
> it's tx timer fires or the ring is full.
>
Yes, we should definitely do that. It will make ping appear to be a lot
faster than it really is :-)
Regards,
Anthony Liguori
> Cheers,
> Mark.
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-24 11:46 [PATCH 0/9][RFC] KVM virtio_net performance Mark McLoughlin
` (2 preceding siblings ...)
2008-07-24 16:53 ` Mark McLoughlin
@ 2008-07-24 20:56 ` Anthony Liguori
2008-07-25 17:17 ` Mark McLoughlin
2008-07-26 19:09 ` Bill Davidsen
2008-07-26 9:45 ` Avi Kivity
4 siblings, 2 replies; 36+ messages in thread
From: Anthony Liguori @ 2008-07-24 20:56 UTC (permalink / raw)
To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell
Hi Mark,
Mark McLoughlin wrote:
> Hey,
> Here's a bunch of patches attempting to improve the performance
> of virtio_net. This is more an RFC rather than a patch submission
> since, as can be seen below, not all patches actually improve the
> perfomance measurably.
>
I'm still seeing the same problem I saw with my patch series. Namely,
dhclient fails to get a DHCP address. Rusty noticed that RX has a lot
more packets received then it should so we're suspicious that we're
getting packet corruption.
Configuring the tap device with a static address, here's what I get with
iperf:
w/o patches:
guest->host: 625 Mbits/sec
host->guest: 825 Mbits/sec
w/patches
guest->host: 2.02 Gbits/sec
host->guest: 1.89 Gbits/sec
guest lo: 4.35 Gbits/sec
host lo: 4.36 Gbits/sec
This is with KVM GUEST configured FWIW.
Regards,
Anthony Liguori
> I've tried hard to test each of these patches with as stable and
> informative a benchmark as I could find. The first benchmark is a
> netperf[1] based throughput benchmark and the second uses a flood
> ping[2] to measure latency differences.
>
> Each set of figures is min/average/max/standard deviation. The
> first set is Gb/s and the second is milliseconds.
>
> The network configuration used was very simple - the guest with
> a virtio_net interface and the host with a tap interface and static
> IP addresses assigned to both - e.g. there was no bridge in the host
> involved and iptables was disable in both the host and guest.
>
> I used:
>
> 1) kvm-71-26-g6152996 with the patches that follow
>
> 2) Linus's v2.6.26-5752-g93ded9b with Rusty's virtio patches from
> 219:bbd2611289c5 applied; these are the patches have just been
> submitted to Linus
>
> The conclusions I draw are:
>
> 1) The length of the tx mitigation timer makes quite a difference to
> throughput achieved; we probably need a good heuristic for
> adjusting this on the fly.
>
> 2) Using the recently merged GSO support in the tun/tap driver gives
> a huge boost, but much more so on the host->guest side.
>
> 3) Adjusting the virtio_net ring sizes makes a small difference, but
> not as much as one might expect
>
> 4) Dropping the global mutex while reading GSO packets from the tap
> interface gives a nice speedup. This highlights the global mutex
> as a general perfomance issue.
>
> 5) Eliminating an extra copy on the host->guest path only makes a
> barely measurable difference.
>
> Anyway, the figures:
>
> netperf, 10x20s runs (Gb/s) | guest->host | host->guest
> -----------------------------+----------------------------+---------------------------
> baseline | 1.520/ 1.573/ 1.610/ 0.034 | 1.160/ 1.357/ 1.630/ 0.165
> 50us tx timer + rearm | 1.050/ 1.086/ 1.110/ 0.017 | 1.710/ 1.832/ 1.960/ 0.092
> 250us tx timer + rearm | 1.700/ 1.764/ 1.880/ 0.064 | 0.900/ 1.203/ 1.580/ 0.205
> 150us tx timer + rearm | 1.520/ 1.602/ 1.690/ 0.044 | 1.670/ 1.928/ 2.150/ 0.141
> no ring-full heuristic | 1.480/ 1.569/ 1.710/ 0.066 | 1.610/ 1.857/ 2.140/ 0.153
> VIRTIO_F_NOTIFY_ON_EMPTY | 1.470/ 1.554/ 1.650/ 0.054 | 1.770/ 1.960/ 2.170/ 0.119
> recv NO_NOTIFY | 1.530/ 1.604/ 1.680/ 0.047 | 1.780/ 1.944/ 2.190/ 0.129
> GSO | 4.120/ 4.323/ 4.420/ 0.099 | 6.540/ 7.033/ 7.340/ 0.244
> ring size == 256 | 4.050/ 4.406/ 4.560/ 0.143 | 6.280/ 7.236/ 8.280/ 0.613
> ring size == 512 | 4.420/ 4.600/ 4.960/ 0.140 | 6.470/ 7.205/ 7.510/ 0.314
> drop mutex during tapfd read | 4.320/ 4.578/ 4.790/ 0.161 | 8.370/ 8.589/ 8.730/ 0.120
> aligouri zero-copy | 4.510/ 4.694/ 4.960/ 0.148 | 8.430/ 8.614/ 8.840/ 0.142
>
> ping -f -c 100000 (ms) | guest->host | host->guest
> -----------------------------+----------------------------+---------------------------
> baseline | 0.060/ 0.459/ 7.602/ 0.846 | 0.067/ 0.331/ 2.517/ 0.057
> 50us tx timer + rearm | 0.081/ 0.143/ 7.436/ 0.374 | 0.093/ 0.133/ 1.883/ 0.026
> 250us tx timer + rearm | 0.302/ 0.463/ 7.580/ 0.849 | 0.297/ 0.344/ 2.128/ 0.028
> 150us tx timer + rearm | 0.197/ 0.323/ 7.671/ 0.740 | 0.199/ 0.245/ 7.836/ 0.037
> no ring-full heuristic | 0.182/ 0.324/ 7.688/ 0.753 | 0.199/ 0.243/ 2.197/ 0.030
> VIRTIO_F_NOTIFY_ON_EMPTY | 0.197/ 0.321/ 7.447/ 0.730 | 0.196/ 0.242/ 2.218/ 0.032
> recv NO_NOTIFY | 0.186/ 0.321/ 7.520/ 0.732 | 0.200/ 0.233/ 2.216/ 0.028
> GSO | 0.178/ 0.324/ 7.667/ 0.736 | 0.147/ 0.246/ 1.361/ 0.024
> ring size == 256 | 0.184/ 0.323/ 7.674/ 0.728 | 0.199/ 0.243/ 2.181/ 0.028
> ring size == 512 | (not measured) | (not measured)
> drop mutex during tapfd read | 0.183/ 0.323/ 7.820/ 0.733 | 0.202/ 0.242/ 2.219/ 0.027
> aligouri zero-copy | 0.185/ 0.325/ 7.863/ 0.736 | 0.202/ 0.245/ 7.844/ 0.036
>
> Cheers,
> Mark.
>
> [1] - I used netperf trunk from:
>
> http://www.netperf.org/svn/netperf2/trunk
>
> and simply ran:
>
> $> i=0; while [ $i -lt 10 ]; do ./netperf -H <host> -f g -l 20 -P 0 | netperf-collect.py; i=$((i+1)); done
>
> where netperf-collect.py is just a script to calculate the
> average across the runs:
>
> http://markmc.fedorapeople.org/netperf-collect.py
>
> [2] - ping -c 100000 -f <host>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic
2008-07-24 11:46 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Mark McLoughlin
2008-07-24 11:46 ` [PATCH 4/9] kvm: qemu: Add VIRTIO_F_NOTIFY_ON_EMPTY Mark McLoughlin
@ 2008-07-24 23:22 ` Dor Laor
2008-07-25 0:30 ` Rusty Russell
2008-07-25 17:23 ` Mark McLoughlin
2008-07-24 23:56 ` Dor Laor
2 siblings, 2 replies; 36+ messages in thread
From: Dor Laor @ 2008-07-24 23:22 UTC (permalink / raw)
To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell
Mark McLoughlin wrote:
> virtio_net tries to guess when it has received a tx
> notification from the guest whether it indicates that the
> guest has no more room in the tx ring and it should
> immediately flush the queued buffers.
>
> The heuristic is based on the fact that there are 128
> buffer entries in the ring and each packet uses 2 buffers
> (i.e. the virtio_net_hdr and the packet's linear data).
>
> Using GSO or increasing the size of the rings will break
> that heuristic, so let's remove it and assume that any
> notification from the guest after we've disabled
> notifications indicates that we should flush our buffers.
>
> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
> ---
> qemu/hw/virtio-net.c | 3 +--
> 1 files changed, 1 insertions(+), 2 deletions(-)
>
> diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
> index 31867f1..4adfa42 100644
> --- a/qemu/hw/virtio-net.c
> +++ b/qemu/hw/virtio-net.c
> @@ -175,8 +175,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq)
> {
> VirtIONet *n = to_virtio_net(vdev);
>
> - if (n->tx_timer_active &&
> - (vq->vring.avail->idx - vq->last_avail_idx) == 64) {
> + if (n->tx_timer_active) {
>
Actually it was worthless anyway since if we set the timer, the flag
below would have been cleared,
causing the guest not to notify the host, thus the ==64 never called.
So I'm in favour too.
> vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> qemu_del_timer(n->tx_timer);
> n->tx_timer_active = 0;
>
As stated by newer messages, we should handle the first tx notification
if the timer wasn't active to shorten latency.
Cheers, Dor
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd
2008-07-24 11:46 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Mark McLoughlin
2008-07-24 11:46 ` [PATCH 9/9] kvm: qemu: Eliminate extra virtio_net copy Mark McLoughlin
@ 2008-07-24 23:33 ` Dor Laor
2008-07-25 17:25 ` Mark McLoughlin
1 sibling, 1 reply; 36+ messages in thread
From: Dor Laor @ 2008-07-24 23:33 UTC (permalink / raw)
To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell
Mark McLoughlin wrote:
> The idea here is that with GSO, packets are much larger
> and we can allow the vcpu threads to e.g. process irq
> acks during the window where we're reading these
> packets from the tapfd.
>
> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
> ---
> qemu/vl.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/qemu/vl.c b/qemu/vl.c
> index efdaafd..de92848 100644
> --- a/qemu/vl.c
> +++ b/qemu/vl.c
> @@ -4281,7 +4281,9 @@ static void tap_send(void *opaque)
> sbuf.buf = s->buf;
> s->size = getmsg(s->fd, NULL, &sbuf, &f) >=0 ? sbuf.len : -1;
> #else
>
Maybe do it only when GSO is actually used by the guest/tap.
Otherwise it can cause some ctx trashing right?
> + kvm_sleep_begin();
> s->size = read(s->fd, s->buf, sizeof(s->buf));
> + kvm_sleep_end();
> #endif
>
> if (s->size == -1 && errno == EINTR)
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic
2008-07-24 11:46 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Mark McLoughlin
2008-07-24 11:46 ` [PATCH 4/9] kvm: qemu: Add VIRTIO_F_NOTIFY_ON_EMPTY Mark McLoughlin
2008-07-24 23:22 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Dor Laor
@ 2008-07-24 23:56 ` Dor Laor
2 siblings, 0 replies; 36+ messages in thread
From: Dor Laor @ 2008-07-24 23:56 UTC (permalink / raw)
To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell
Mark McLoughlin wrote:
> virtio_net tries to guess when it has received a tx
> notification from the guest whether it indicates that the
> guest has no more room in the tx ring and it should
> immediately flush the queued buffers.
>
> The heuristic is based on the fact that there are 128
> buffer entries in the ring and each packet uses 2 buffers
> (i.e. the virtio_net_hdr and the packet's linear data).
>
> Using GSO or increasing the size of the rings will break
> that heuristic, so let's remove it and assume that any
> notification from the guest after we've disabled
> notifications indicates that we should flush our buffers.
>
> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
> ---
> qemu/hw/virtio-net.c | 3 +--
> 1 files changed, 1 insertions(+), 2 deletions(-)
>
> diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
> index 31867f1..4adfa42 100644
> --- a/qemu/hw/virtio-net.c
> +++ b/qemu/hw/virtio-net.c
> @@ -175,8 +175,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq)
> {
> VirtIONet *n = to_virtio_net(vdev);
>
> - if (n->tx_timer_active &&
> - (vq->vring.avail->idx - vq->last_avail_idx) == 64) {
> + if (n->tx_timer_active) {
> vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> qemu_del_timer(n->tx_timer);
> n->tx_timer_active = 0;
>
Actually we can improve latency a bit more by using this timer only for
high throughput
scenario. For example, if during the previous timer period no/few
packets were accumulated,
we can set the flag off and not issue new timer. This way we'll get
notified immediately without timer
latency. When lots of packets will be transmitted, we'll go back to this
batch mode again.
Cheers, Dor
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic
2008-07-24 23:22 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Dor Laor
@ 2008-07-25 0:30 ` Rusty Russell
2008-07-25 17:30 ` Mark McLoughlin
2008-07-25 17:23 ` Mark McLoughlin
1 sibling, 1 reply; 36+ messages in thread
From: Rusty Russell @ 2008-07-25 0:30 UTC (permalink / raw)
To: Dor Laor; +Cc: Mark McLoughlin, kvm, Herbert Xu
On Friday 25 July 2008 09:22:53 Dor Laor wrote:
> Mark McLoughlin wrote:
> > vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> > qemu_del_timer(n->tx_timer);
> > n->tx_timer_active = 0;
>
> As stated by newer messages, we should handle the first tx notification
> if the timer wasn't active to shorten latency.
> Cheers, Dor
Here's what lguest does at the moment. Basically, we cut the timeout a tiny
bit each time, until we get *fewer* packets than last time. Then we bump it
up again.
Rough, but seems to work (it should be a per-device var of course, not a
static).
@@ -921,6 +922,7 @@ static void handle_net_output(int fd, st
unsigned int head, out, in, num = 0;
int len;
struct iovec iov[vq->vring.num];
+ static int last_timeout_num;
if (!timeout)
net_xmit_notify++;
@@ -941,6 +943,14 @@ static void handle_net_output(int fd, st
/* Block further kicks and set up a timer if we saw anything. */
if (!timeout && num)
block_vq(vq);
+
+ if (timeout) {
+ if (num < last_timeout_num)
+ timeout_usec += 10;
+ else if (timeout_usec > 1)
+ timeout_usec--;
+ last_timeout_num = num;
+ }
}
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-24 18:29 ` Anthony Liguori
@ 2008-07-25 16:36 ` Mark McLoughlin
0 siblings, 0 replies; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-25 16:36 UTC (permalink / raw)
To: Anthony Liguori; +Cc: kvm, Herbert Xu, Rusty Russell
On Thu, 2008-07-24 at 13:29 -0500, Anthony Liguori wrote:
> Mark McLoughlin wrote:
> > Hey,
> > One all all-important thing I forgot to include was a comparison with
> > lguest :-)
> >
>
> Hey Mark,
>
> This patch set is really great! I guess the hard part now is deciding
> what all we want to apply. Do you have a suggestion of which patches
> you think are worth applying?
Good question. I think they're all fairly sane as a starting point, I
guess.
Making the tx timer adaptive, dropping the mutex when writing to the tap
interface and flushing the tx queue immediately on notify are all things
worth adding too, but they could be an additional patch set.
Your zero-copy patch is quite a chunk for not much gain - how do you
feel about that? Could we get that in upstream qemu?
> BTW, do you have native and guest loopback numbers to compare where we
> stand with native?
Yep, sorry:
netperf, 10x20s runs (Gb/s) | guest->host | host->guest
-----------------------------+----------------------------+---------------------------
KVM | 4.230/ 4.619/ 4.780/ 0.155 | 8.140/ 8.578/ 8.770/ 0.162
lguest | 5.700/ 5.926/ 6.150/ 0.132 | 8.680/ 9.073/ 9.320/ 0.205
loopback | 9.520/12.473/15.390/ 1.947
ping -f -c 100000 (ms) | guest->host | host->guest
-----------------------------+----------------------------+---------------------------
KVM | 0.199/ 0.326/ 7.698/ 0.744 | 0.199/ 0.245/ 0.402/ 0.022
lguest | 0.022/ 0.055/ 0.467/ 0.019 | 0.019/ 0.046/89.249/ 0.448
loopback | 0.001/ 0.002/ 0.025/ 0.001
Very surprising how much jitter there is in the loopback throughput figures.
> I'm surprised lguest gets an extra 1.3gb guest->host. Any idea of where
> we're loosing it?
The main difference on the tx path would appear to be the adaptive
timeout and the immediate flushing of the queue on notify.
Cheers,
Mark.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-24 20:56 ` Anthony Liguori
@ 2008-07-25 17:17 ` Mark McLoughlin
2008-07-25 21:29 ` Dor Laor
2008-07-26 19:09 ` Bill Davidsen
1 sibling, 1 reply; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-25 17:17 UTC (permalink / raw)
To: Anthony Liguori; +Cc: kvm, Herbert Xu, Rusty Russell
On Thu, 2008-07-24 at 15:56 -0500, Anthony Liguori wrote:
> Hi Mark,
>
> Mark McLoughlin wrote:
> > Hey,
> > Here's a bunch of patches attempting to improve the performance
> > of virtio_net. This is more an RFC rather than a patch submission
> > since, as can be seen below, not all patches actually improve the
> > perfomance measurably.
> >
>
> I'm still seeing the same problem I saw with my patch series. Namely,
> dhclient fails to get a DHCP address. Rusty noticed that RX has a lot
> more packets received then it should so we're suspicious that we're
> getting packet corruption.
I've just tried bridging to my physical LAN and DHCP seems to be working
fine.
Which reminds me, though - doing this makes host->guest throughput drop
to well below pre-GSO figures. GSO appears to be disabled while there's
a physical interface on the bridge. If I remove eth0, the figures jump
right back up again.
I also just noticed that the GSO patch breaks e1000 because it
unconditionally sets IFF_VNET_HDR. Will fix that up.
> Configuring the tap device with a static address, here's what I get with
> iperf:
>
> w/o patches:
>
> guest->host: 625 Mbits/sec
> host->guest: 825 Mbits/sec
>
> w/patches
>
> guest->host: 2.02 Gbits/sec
> host->guest: 1.89 Gbits/sec
>
> guest lo: 4.35 Gbits/sec
> host lo: 4.36 Gbits/sec
I tried iperf at one point and was getting really low figures; not sure
why.
Apart from your iperf figures being lower than my netperf figures, it
also contradicts what I was seeing - namely guest->host beating
host->guest before the patches and host->guest beating guest->host after
the patches.
It could all just be down to the length of the tx timer. If you try
adjusting that does it help?
> This is with KVM GUEST configured FWIW.
Yep, same here.
Cheers,
Mark.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic
2008-07-24 23:22 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Dor Laor
2008-07-25 0:30 ` Rusty Russell
@ 2008-07-25 17:23 ` Mark McLoughlin
1 sibling, 0 replies; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-25 17:23 UTC (permalink / raw)
To: Dor Laor; +Cc: kvm, Herbert Xu, Rusty Russell
On Fri, 2008-07-25 at 02:22 +0300, Dor Laor wrote:
> Mark McLoughlin wrote:
> > virtio_net tries to guess when it has received a tx
> > notification from the guest whether it indicates that the
> > guest has no more room in the tx ring and it should
> > immediately flush the queued buffers.
...
> > - if (n->tx_timer_active &&
> > - (vq->vring.avail->idx - vq->last_avail_idx) == 64) {
> > + if (n->tx_timer_active) {
> >
> Actually it was worthless anyway since if we set the timer, the flag
> below would have been cleared,
> causing the guest not to notify the host, thus the ==64 never called.
No, that's the point I'm trying to make in the commit message - the
guest ignores the NO_NOTIFY flag when the tx queue fills up and sends a
notification anyway.
The "== 64" test seems to be an attempt to catch this ring-full
condition.
Well, actually the test was worthless for a different reason, I guess.
If tx_timer_active is set and we get a notification, then given the
current guest code the number of buffers available should *always* be
64.
Cheers,
Mark.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd
2008-07-24 23:33 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Dor Laor
@ 2008-07-25 17:25 ` Mark McLoughlin
0 siblings, 0 replies; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-25 17:25 UTC (permalink / raw)
To: Dor Laor; +Cc: kvm, Herbert Xu, Rusty Russell
On Fri, 2008-07-25 at 02:33 +0300, Dor Laor wrote:
> Mark McLoughlin wrote:
> > The idea here is that with GSO, packets are much larger
> > and we can allow the vcpu threads to e.g. process irq
> > acks during the window where we're reading these
> > packets from the tapfd.
> >
...
> > diff --git a/qemu/vl.c b/qemu/vl.c
> > index efdaafd..de92848 100644
> > --- a/qemu/vl.c
> > +++ b/qemu/vl.c
> > @@ -4281,7 +4281,9 @@ static void tap_send(void *opaque)
> > sbuf.buf = s->buf;
> > s->size = getmsg(s->fd, NULL, &sbuf, &f) >=0 ? sbuf.len : -1;
> > #else
> >
> Maybe do it only when GSO is actually used by the guest/tap.
> Otherwise it can cause some ctx trashing right?
(Strange habit you have of "top commenting" on patches :-)
I've been meaning to do some profiling of just how long the vcpu thread
spends blocking on the mutex in this case, both before and after the GSO
patches.
i.e. this change may make sense even with smaller buffers ...
> > + kvm_sleep_begin();
> > s->size = read(s->fd, s->buf, sizeof(s->buf));
> > + kvm_sleep_end();
Cheers,
Mark.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic
2008-07-25 0:30 ` Rusty Russell
@ 2008-07-25 17:30 ` Mark McLoughlin
0 siblings, 0 replies; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-25 17:30 UTC (permalink / raw)
To: Rusty Russell; +Cc: Dor Laor, kvm, Herbert Xu
On Fri, 2008-07-25 at 10:30 +1000, Rusty Russell wrote:
> On Friday 25 July 2008 09:22:53 Dor Laor wrote:
> > Mark McLoughlin wrote:
> > > vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> > > qemu_del_timer(n->tx_timer);
> > > n->tx_timer_active = 0;
> >
> > As stated by newer messages, we should handle the first tx notification
> > if the timer wasn't active to shorten latency.
> > Cheers, Dor
>
> Here's what lguest does at the moment. Basically, we cut the timeout a tiny
> bit each time, until we get *fewer* packets than last time. Then we bump it
> up again.
>
> Rough, but seems to work (it should be a per-device var of course, not a
> static).
>
> @@ -921,6 +922,7 @@ static void handle_net_output(int fd, st
> unsigned int head, out, in, num = 0;
> int len;
> struct iovec iov[vq->vring.num];
> + static int last_timeout_num;
>
> if (!timeout)
> net_xmit_notify++;
> @@ -941,6 +943,14 @@ static void handle_net_output(int fd, st
> /* Block further kicks and set up a timer if we saw anything. */
> if (!timeout && num)
> block_vq(vq);
> +
> + if (timeout) {
> + if (num < last_timeout_num)
> + timeout_usec += 10;
> + else if (timeout_usec > 1)
> + timeout_usec--;
> + last_timeout_num = num;
> + }
Yeah, I gave this a try in kvm and in the host->guest case the timeout
just grew and grew. In the guest->host case, it did stabilise at around
50us with high throughput.
Basically, I think in the host->guest case the number of buffers was
very variable so the timeout would get bumped by 10, reduced by a small
amount, bumped by 10, reduced by a small amount, ...
But, I agree the general principal seems about right; just needs some
tweaking.
Cheers,
Mark.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-25 17:17 ` Mark McLoughlin
@ 2008-07-25 21:29 ` Dor Laor
0 siblings, 0 replies; 36+ messages in thread
From: Dor Laor @ 2008-07-25 21:29 UTC (permalink / raw)
To: Mark McLoughlin; +Cc: Anthony Liguori, kvm, Herbert Xu, Rusty Russell
Mark McLoughlin wrote:
> On Thu, 2008-07-24 at 15:56 -0500, Anthony Liguori wrote:
>
>> Hi Mark,
>>
>> Mark McLoughlin wrote:
>>
>>> Hey,
>>> Here's a bunch of patches attempting to improve the performance
>>> of virtio_net. This is more an RFC rather than a patch submission
>>> since, as can be seen below, not all patches actually improve the
>>> perfomance measurably.
>>>
>>>
>> I'm still seeing the same problem I saw with my patch series. Namely,
>> dhclient fails to get a DHCP address. Rusty noticed that RX has a lot
>> more packets received then it should so we're suspicious that we're
>> getting packet corruption.
>>
>
> I've just tried bridging to my physical LAN and DHCP seems to be working
> fine.
>
> Which reminds me, though - doing this makes host->guest throughput drop
> to well below pre-GSO figures. GSO appears to be disabled while there's
> a physical interface on the bridge. If I remove eth0, the figures jump
> right back up again.
>
> I also just noticed that the GSO patch breaks e1000 because it
> unconditionally sets IFF_VNET_HDR. Will fix that up.
>
>
>> Configuring the tap device with a static address, here's what I get with
>> iperf:
>>
>> w/o patches:
>>
>> guest->host: 625 Mbits/sec
>> host->guest: 825 Mbits/sec
>>
>> w/patches
>>
>> guest->host: 2.02 Gbits/sec
>> host->guest: 1.89 Gbits/sec
>>
>> guest lo: 4.35 Gbits/sec
>> host lo: 4.36 Gbits/sec
>>
>
> I tried iperf at one point and was getting really low figures; not sure
> why.
>
>
Older iperf (~2.0.2) has locking vs yield bug and consumes lots of cpu.
In 2.0.4 it is fixed. Can that be the difference?
> Apart from your iperf figures being lower than my netperf figures, it
> also contradicts what I was seeing - namely guest->host beating
> host->guest before the patches and host->guest beating guest->host after
> the patches.
>
> It could all just be down to the length of the tx timer. If you try
> adjusting that does it help?
>
>
>> This is with KVM GUEST configured FWIW.
>>
>
> Yep, same here.
>
> Cheers,
> Mark.
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-24 11:46 [PATCH 0/9][RFC] KVM virtio_net performance Mark McLoughlin
` (3 preceding siblings ...)
2008-07-24 20:56 ` Anthony Liguori
@ 2008-07-26 9:45 ` Avi Kivity
2008-07-27 6:48 ` Rusty Russell
` (2 more replies)
4 siblings, 3 replies; 36+ messages in thread
From: Avi Kivity @ 2008-07-26 9:45 UTC (permalink / raw)
To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell
Mark McLoughlin wrote:
> Hey,
> Here's a bunch of patches attempting to improve the performance
> of virtio_net. This is more an RFC rather than a patch submission
> since, as can be seen below, not all patches actually improve the
> perfomance measurably.
>
> I've tried hard to test each of these patches with as stable and
> informative a benchmark as I could find. The first benchmark is a
> netperf[1] based throughput benchmark and the second uses a flood
> ping[2] to measure latency differences.
>
> Each set of figures is min/average/max/standard deviation. The
> first set is Gb/s and the second is milliseconds.
>
> The network configuration used was very simple - the guest with
> a virtio_net interface and the host with a tap interface and static
> IP addresses assigned to both - e.g. there was no bridge in the host
> involved and iptables was disable in both the host and guest.
>
> I used:
>
> 1) kvm-71-26-g6152996 with the patches that follow
>
> 2) Linus's v2.6.26-5752-g93ded9b with Rusty's virtio patches from
> 219:bbd2611289c5 applied; these are the patches have just been
> submitted to Linus
>
> The conclusions I draw are:
>
> 1) The length of the tx mitigation timer makes quite a difference to
> throughput achieved; we probably need a good heuristic for
> adjusting this on the fly.
>
The tx mitigation timer is just one part of the equation; the other is
the virtio ring window size, which is now fixed.
Using a maximum sized window is good when the guest and host are running
flat out, doing nothing but networking. When throughput drops (because
the guest is spending cpu on processing, or simply because the other
side is not keeping up), we need to drop the windows size so as to
retain acceptable latencies.
The tx timer can then be set to "a bit after the end of the window",
acting as a safety belt in case the throughput changes.
> 4) Dropping the global mutex while reading GSO packets from the tap
> interface gives a nice speedup. This highlights the global mutex
> as a general perfomance issue.
>
>
Not sure whether this is safe. What's stopping the guest from accessing
virtio and changing some state?
> 5) Eliminating an extra copy on the host->guest path only makes a
> barely measurable difference.
>
>
That's expected on a host->guest test. Zero copy is mostly important
for guest->external, and with zerocopy already enabled in the guest
(sendfile or nfs server workloads).
> Anyway, the figures:
>
> netperf, 10x20s runs (Gb/s) | guest->host | host->guest
> -----------------------------+----------------------------+---------------------------
> baseline | 1.520/ 1.573/ 1.610/ 0.034 | 1.160/ 1.357/ 1.630/ 0.165
> 50us tx timer + rearm | 1.050/ 1.086/ 1.110/ 0.017 | 1.710/ 1.832/ 1.960/ 0.092
> 250us tx timer + rearm | 1.700/ 1.764/ 1.880/ 0.064 | 0.900/ 1.203/ 1.580/ 0.205
> 150us tx timer + rearm | 1.520/ 1.602/ 1.690/ 0.044 | 1.670/ 1.928/ 2.150/ 0.141
> no ring-full heuristic | 1.480/ 1.569/ 1.710/ 0.066 | 1.610/ 1.857/ 2.140/ 0.153
> VIRTIO_F_NOTIFY_ON_EMPTY | 1.470/ 1.554/ 1.650/ 0.054 | 1.770/ 1.960/ 2.170/ 0.119
> recv NO_NOTIFY | 1.530/ 1.604/ 1.680/ 0.047 | 1.780/ 1.944/ 2.190/ 0.129
> GSO | 4.120/ 4.323/ 4.420/ 0.099 | 6.540/ 7.033/ 7.340/ 0.244
> ring size == 256 | 4.050/ 4.406/ 4.560/ 0.143 | 6.280/ 7.236/ 8.280/ 0.613
> ring size == 512 | 4.420/ 4.600/ 4.960/ 0.140 | 6.470/ 7.205/ 7.510/ 0.314
> drop mutex during tapfd read | 4.320/ 4.578/ 4.790/ 0.161 | 8.370/ 8.589/ 8.730/ 0.120
> aligouri zero-copy | 4.510/ 4.694/ 4.960/ 0.148 | 8.430/ 8.614/ 8.840/ 0.142
>
Very impressive numbers; much better than I expected. The host->guest
numbers are around 100x better than the original emulated card througput
we got from kvm.
> ping -f -c 100000 (ms) | guest->host | host->guest
> -----------------------------+----------------------------+---------------------------
> baseline | 0.060/ 0.459/ 7.602/ 0.846 | 0.067/ 0.331/ 2.517/ 0.057
> 50us tx timer + rearm | 0.081/ 0.143/ 7.436/ 0.374 | 0.093/ 0.133/ 1.883/ 0.026
> 250us tx timer + rearm | 0.302/ 0.463/ 7.580/ 0.849 | 0.297/ 0.344/ 2.128/ 0.028
> 150us tx timer + rearm | 0.197/ 0.323/ 7.671/ 0.740 | 0.199/ 0.245/ 7.836/ 0.037
> no ring-full heuristic | 0.182/ 0.324/ 7.688/ 0.753 | 0.199/ 0.243/ 2.197/ 0.030
> VIRTIO_F_NOTIFY_ON_EMPTY | 0.197/ 0.321/ 7.447/ 0.730 | 0.196/ 0.242/ 2.218/ 0.032
> recv NO_NOTIFY | 0.186/ 0.321/ 7.520/ 0.732 | 0.200/ 0.233/ 2.216/ 0.028
> GSO | 0.178/ 0.324/ 7.667/ 0.736 | 0.147/ 0.246/ 1.361/ 0.024
> ring size == 256 | 0.184/ 0.323/ 7.674/ 0.728 | 0.199/ 0.243/ 2.181/ 0.028
> ring size == 512 | (not measured) | (not measured)
> drop mutex during tapfd read | 0.183/ 0.323/ 7.820/ 0.733 | 0.202/ 0.242/ 2.219/ 0.027
> aligouri zero-copy | 0.185/ 0.325/ 7.863/ 0.736 | 0.202/ 0.245/ 7.844/ 0.036
>
>
This isn't too good. Low latency is important for nfs clients (or other
request/response workloads). I think we can keep these low by adjusting
the virtio window (for example, on an idle system it should be 1), so
that the tx mitigation timer only fires when the workload transitions
from throughput to request/response.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer
2008-07-24 11:46 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Mark McLoughlin
2008-07-24 11:46 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Mark McLoughlin
@ 2008-07-26 9:48 ` Avi Kivity
2008-07-26 12:08 ` Mark McLoughlin
1 sibling, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2008-07-26 9:48 UTC (permalink / raw)
To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell
Mark McLoughlin wrote:
> The current virtio_net tx timer is 2ns, which doesn't
> make any sense. Set it to a more reasonable 150us
> instead.
>
> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
> ---
> qemu/hw/virtio-net.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
> index 2e57e5a..31867f1 100644
> --- a/qemu/hw/virtio-net.c
> +++ b/qemu/hw/virtio-net.c
> @@ -26,7 +26,7 @@
> #define VIRTIO_NET_F_MAC 5
> #define VIRTIO_NET_F_GS0 6
>
> -#define TX_TIMER_INTERVAL (1000 / 500)
> +#define TX_TIMER_INTERVAL (150000) /* 150 us */
>
>
Ouch.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer
2008-07-26 9:48 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Avi Kivity
@ 2008-07-26 12:08 ` Mark McLoughlin
0 siblings, 0 replies; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-26 12:08 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm, Herbert Xu, Rusty Russell
On Sat, 2008-07-26 at 12:48 +0300, Avi Kivity wrote:
> Mark McLoughlin wrote:
> > The current virtio_net tx timer is 2ns, which doesn't
> > make any sense. Set it to a more reasonable 150us
> > instead.
> >
> > Signed-off-by: Mark McLoughlin <markmc@redhat.com>
> > ---
> > qemu/hw/virtio-net.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
> > index 2e57e5a..31867f1 100644
> > --- a/qemu/hw/virtio-net.c
> > +++ b/qemu/hw/virtio-net.c
> > @@ -26,7 +26,7 @@
> > #define VIRTIO_NET_F_MAC 5
> > #define VIRTIO_NET_F_GS0 6
> >
> > -#define TX_TIMER_INTERVAL (1000 / 500)
> > +#define TX_TIMER_INTERVAL (150000) /* 150 us */
> >
>
> Ouch.
Well, not so much - and I should have explained why.
Even though virtio-net is requesting a 2ns tx timer, it actually gets
limited to MIN_TIMER_REARM_US which is currently 250us.
However, even though the timer itself will only fire after 250us,
expire_time is only set to +2ns, so we'll get the timeout callback next
time qemu_run_timers() is called from the mainloop.
I think this might account for a lot of the jitter in the throughput
numbers - the effective tx window size is anywhere between 2ns and 250us
depending on e.g. whether there is rx data available on the tap fd.
Cheers,
Mark.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-24 20:56 ` Anthony Liguori
2008-07-25 17:17 ` Mark McLoughlin
@ 2008-07-26 19:09 ` Bill Davidsen
2008-07-27 7:52 ` Avi Kivity
1 sibling, 1 reply; 36+ messages in thread
From: Bill Davidsen @ 2008-07-26 19:09 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Mark McLoughlin, kvm, Herbert Xu, Rusty Russell
Anthony Liguori wrote:
> Hi Mark,
>
> Mark McLoughlin wrote:
>> Hey,
>> Here's a bunch of patches attempting to improve the performance
>> of virtio_net. This is more an RFC rather than a patch submission
>> since, as can be seen below, not all patches actually improve the
>> perfomance measurably.
>>
>
> I'm still seeing the same problem I saw with my patch series. Namely,
> dhclient fails to get a DHCP address. Rusty noticed that RX has a lot
> more packets received then it should so we're suspicious that we're
> getting packet corruption.
I have been discussing this (on this list) in another thread. Putting
tcpdump on the eth0 device in the VM, the br0 device in the host, and
the eth0 (physical NIC) in the host, you can see that when the VM
generates a DHCP request it shows up on the br0 in the host, but never
gets sent on the wire by eth0.
That's the point of failure, at least using RHEL5/FC6/kvm-66 as the
environment.
>
> Configuring the tap device with a static address, here's what I get with
> iperf:
>
> w/o patches:
>
> guest->host: 625 Mbits/sec
> host->guest: 825 Mbits/sec
>
> w/patches
>
> guest->host: 2.02 Gbits/sec
> host->guest: 1.89 Gbits/sec
>
> guest lo: 4.35 Gbits/sec
> host lo: 4.36 Gbits/sec
>
> This is with KVM GUEST configured FWIW.
>
> Regards,
>
> Anthony Liguori
>
--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-26 9:45 ` Avi Kivity
@ 2008-07-27 6:48 ` Rusty Russell
2008-07-27 6:48 ` Rusty Russell
2008-08-11 19:56 ` Mark McLoughlin
2 siblings, 0 replies; 36+ messages in thread
From: Rusty Russell @ 2008-07-27 6:48 UTC (permalink / raw)
To: Avi Kivity; +Cc: Mark McLoughlin, kvm, Herbert Xu
On Saturday 26 July 2008 19:45:36 Avi Kivity wrote:
> Mark McLoughlin wrote:
> > Hey,
> > Here's a bunch of patches attempting to improve the performance
> > of virtio_net. This is more an RFC rather than a patch submission
> > since, as can be seen below, not all patches actually improve the
> > perfomance measurably.
> >
> > I've tried hard to test each of these patches with as stable and
> > informative a benchmark as I could find. The first benchmark is a
> > netperf[1] based throughput benchmark and the second uses a flood
> > ping[2] to measure latency differences.
> >
> > Each set of figures is min/average/max/standard deviation. The
> > first set is Gb/s and the second is milliseconds.
> >
> > The network configuration used was very simple - the guest with
> > a virtio_net interface and the host with a tap interface and static
> > IP addresses assigned to both - e.g. there was no bridge in the host
> > involved and iptables was disable in both the host and guest.
> >
> > I used:
> >
> > 1) kvm-71-26-g6152996 with the patches that follow
> >
> > 2) Linus's v2.6.26-5752-g93ded9b with Rusty's virtio patches from
> > 219:bbd2611289c5 applied; these are the patches have just been
> > submitted to Linus
> >
> > The conclusions I draw are:
> >
> > 1) The length of the tx mitigation timer makes quite a difference to
> > throughput achieved; we probably need a good heuristic for
> > adjusting this on the fly.
>
> The tx mitigation timer is just one part of the equation; the other is
> the virtio ring window size, which is now fixed.
>
> Using a maximum sized window is good when the guest and host are running
> flat out, doing nothing but networking. When throughput drops (because
> the guest is spending cpu on processing, or simply because the other
> side is not keeping up), we need to drop the windows size so as to
> retain acceptable latencies.
>
> The tx timer can then be set to "a bit after the end of the window",
> acting as a safety belt in case the throughput changes.
Interestingly, I played a little with a threshold patch. Unfortunately it
seemed to just add YA variable to the mix; it'd take real research to figure
out how to adjust the threshold and timeout values appropriately for
different situations.
> This isn't too good. Low latency is important for nfs clients (or other
> request/response workloads). I think we can keep these low by adjusting
> the virtio window (for example, on an idle system it should be 1), so
> that the tx mitigation timer only fires when the workload transitions
> from throughput to request/response.
>From reading this thread, this seems like an implementation bug. Don't
suppress notifications on an empty ring (ie. threshold will be 1 when nothing
is happening).
Rusty.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-26 9:45 ` Avi Kivity
2008-07-27 6:48 ` Rusty Russell
@ 2008-07-27 6:48 ` Rusty Russell
2008-08-11 19:56 ` Mark McLoughlin
2 siblings, 0 replies; 36+ messages in thread
From: Rusty Russell @ 2008-07-27 6:48 UTC (permalink / raw)
To: Avi Kivity; +Cc: Mark McLoughlin, kvm, Herbert Xu
On Saturday 26 July 2008 19:45:36 Avi Kivity wrote:
> > 5) Eliminating an extra copy on the host->guest path only makes a
> > barely measurable difference.
>
> That's expected on a host->guest test. Zero copy is mostly important
> for guest->external, and with zerocopy already enabled in the guest
> (sendfile or nfs server workloads).
Note that this will also need the copyless tun patches. If we're doing one
copy from userspace->kernel anyway, this extra copy is probably lost in the
noise.
Cheers,
Rusty.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-26 19:09 ` Bill Davidsen
@ 2008-07-27 7:52 ` Avi Kivity
2008-07-27 12:52 ` Bill Davidsen
2008-07-27 13:17 ` Bill Davidsen
0 siblings, 2 replies; 36+ messages in thread
From: Avi Kivity @ 2008-07-27 7:52 UTC (permalink / raw)
To: Bill Davidsen
Cc: Anthony Liguori, Mark McLoughlin, kvm, Herbert Xu, Rusty Russell
Bill Davidsen wrote:
> Anthony Liguori wrote:
>> Hi Mark,
>>
>> Mark McLoughlin wrote:
>>> Hey,
>>> Here's a bunch of patches attempting to improve the performance
>>> of virtio_net. This is more an RFC rather than a patch submission
>>> since, as can be seen below, not all patches actually improve the
>>> perfomance measurably.
>>>
>>
>> I'm still seeing the same problem I saw with my patch series.
>> Namely, dhclient fails to get a DHCP address. Rusty noticed that RX
>> has a lot more packets received then it should so we're suspicious
>> that we're getting packet corruption.
>
> I have been discussing this (on this list) in another thread. Putting
> tcpdump on the eth0 device in the VM, the br0 device in the host, and
> the eth0 (physical NIC) in the host, you can see that when the VM
> generates a DHCP request it shows up on the br0 in the host, but never
> gets sent on the wire by eth0.
>
> That's the point of failure, at least using RHEL5/FC6/kvm-66 as the
> environment.
Does playing with the bridge forward delay ('brctl setfd') help?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-27 7:52 ` Avi Kivity
@ 2008-07-27 12:52 ` Bill Davidsen
2008-07-27 13:17 ` Bill Davidsen
1 sibling, 0 replies; 36+ messages in thread
From: Bill Davidsen @ 2008-07-27 12:52 UTC (permalink / raw)
To: Avi Kivity
Cc: Anthony Liguori, Mark McLoughlin, kvm, Herbert Xu, Rusty Russell
Avi Kivity wrote:
> Bill Davidsen wrote:
>> Anthony Liguori wrote:
>>> Hi Mark,
>>>
>>> [...snip...]
>>> I'm still seeing the same problem I saw with my patch series.
>>> Namely, dhclient fails to get a DHCP address. Rusty noticed that RX
>>> has a lot more packets received then it should so we're suspicious
>>> that we're getting packet corruption.
>>
>> I have been discussing this (on this list) in another thread. Putting
>> tcpdump on the eth0 device in the VM, the br0 device in the host, and
>> the eth0 (physical NIC) in the host, you can see that when the VM
>> generates a DHCP request it shows up on the br0 in the host, but
>> never gets sent on the wire by eth0.
>>
>> That's the point of failure, at least using RHEL5/FC6/kvm-66 as the
>> environment.
>
> Does playing with the bridge forward delay ('brctl setfd') help?
>
Not obviously, default value is 15, stepping from 10..30 made no
difference. It appears the bridge just doesn't forward DHCP packets. I'm
not an expert on bridge setup, and I don't see an obvious control, so I
will look at the code this afternoon and see if something jumps out at me.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-27 7:52 ` Avi Kivity
2008-07-27 12:52 ` Bill Davidsen
@ 2008-07-27 13:17 ` Bill Davidsen
2008-07-28 6:42 ` Mark McLoughlin
1 sibling, 1 reply; 36+ messages in thread
From: Bill Davidsen @ 2008-07-27 13:17 UTC (permalink / raw)
To: Avi Kivity
Cc: Anthony Liguori, Mark McLoughlin, kvm, Herbert Xu, Rusty Russell
Avi Kivity wrote:
> Bill Davidsen wrote:
>>
>> I have been discussing this (on this list) in another thread. Putting
>> tcpdump on the eth0 device in the VM, the br0 device in the host, and
>> the eth0 (physical NIC) in the host, you can see that when the VM
>> generates a DHCP request it shows up on the br0 in the host, but
>> never gets sent on the wire by eth0.
>>
>> That's the point of failure, at least using RHEL5/FC6/kvm-66 as the
>> environment.
>
> Does playing with the bridge forward delay ('brctl setfd') help?
>
Update: Redhat has a user chain in iptables shared between INPUT and
FORWARD (bad idea) which doesn't pass bootp packets by default. Adding
the following rules to that table solved the DHCP for me.
ACCEPT udp -- anywhere anywhere udp
spt:bootps dpt:bootpc
ACCEPT udp -- anywhere anywhere udp
spt:bootpc dpt:bootps
This seems to solve my problem, I just have to make it part of my "start
kvm" procedure.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-27 13:17 ` Bill Davidsen
@ 2008-07-28 6:42 ` Mark McLoughlin
0 siblings, 0 replies; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-28 6:42 UTC (permalink / raw)
To: Bill Davidsen; +Cc: Avi Kivity, Anthony Liguori, kvm, Herbert Xu, Rusty Russell
On Sun, 2008-07-27 at 09:17 -0400, Bill Davidsen wrote:
> Avi Kivity wrote:
> > Bill Davidsen wrote:
> >>
> >> I have been discussing this (on this list) in another thread. Putting
> >> tcpdump on the eth0 device in the VM, the br0 device in the host, and
> >> the eth0 (physical NIC) in the host, you can see that when the VM
> >> generates a DHCP request it shows up on the br0 in the host, but
> >> never gets sent on the wire by eth0.
> >>
> >> That's the point of failure, at least using RHEL5/FC6/kvm-66 as the
> >> environment.
> >
> > Does playing with the bridge forward delay ('brctl setfd') help?
> >
> Update: Redhat has a user chain in iptables shared between INPUT and
> FORWARD (bad idea) which doesn't pass bootp packets by default.
Yeah, I've been trying to get that rule changed to allow all bridged
packets to be forwarded by default. See:
https://bugzilla.redhat.com/221828
> Adding
> the following rules to that table solved the DHCP for me.
>
> ACCEPT udp -- anywhere anywhere udp
> spt:bootps dpt:bootpc
> ACCEPT udp -- anywhere anywhere udp
> spt:bootpc dpt:bootps
>
> This seems to solve my problem, I just have to make it part of my "start
> kvm" procedure.
See here:
http://wiki.libvirt.org/page/Networking
in the 'Bridged networking (aka "shared physical device")' section:
# echo "-I FORWARD -m physdev --physdev-is-bridged -j ACCEPT" > /etc/sysconfig/iptables-forward-bridged
# lokkit --custom-rules=ipv4:filter:/etc/sysconfig/iptables-forward-bridged
# service libvirtd reload
Cheers,
Mark.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-07-26 9:45 ` Avi Kivity
2008-07-27 6:48 ` Rusty Russell
2008-07-27 6:48 ` Rusty Russell
@ 2008-08-11 19:56 ` Mark McLoughlin
2008-08-12 13:35 ` Avi Kivity
2 siblings, 1 reply; 36+ messages in thread
From: Mark McLoughlin @ 2008-08-11 19:56 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm, Herbert Xu, Rusty Russell
Hi Avi,
Sorry, I got distracted from this ...
On Sat, 2008-07-26 at 12:45 +0300, Avi Kivity wrote:
> Mark McLoughlin wrote:
> > 1) The length of the tx mitigation timer makes quite a difference to
> > throughput achieved; we probably need a good heuristic for
> > adjusting this on the fly.
> >
>
> The tx mitigation timer is just one part of the equation; the other is
> the virtio ring window size, which is now fixed.
>
> Using a maximum sized window is good when the guest and host are running
> flat out, doing nothing but networking. When throughput drops (because
> the guest is spending cpu on processing, or simply because the other
> side is not keeping up), we need to drop the windows size so as to
> retain acceptable latencies.
>
> The tx timer can then be set to "a bit after the end of the window",
> acting as a safety belt in case the throughput changes.
i.e. the tx timer should give just enough time for a flat out guest to
fill the ring, and no more?
Yep, that's basically what lguest's tx timer heuristic is aiming for
AFAICT.
> > 4) Dropping the global mutex while reading GSO packets from the tap
> > interface gives a nice speedup. This highlights the global mutex
> > as a general perfomance issue.
> >
> >
>
> Not sure whether this is safe. What's stopping the guest from accessing
> virtio and changing some state?
With the current code, the virtio state should be consistent before we
drop the mutex. The I/O thread would only drop the lock while it reads
into the tap buffer and then grab the lock again before popping a buffer
from the ring and copying to it.
With Anthony's zero-copy patch, the situation is less clear - we pop a
buffer from the avail, drop the lock, read() into the buffer, grab the
lock and then push the buffer back onto the used ring. While the mutex
is released, the guest could e.g. reset the ring and release the buffer
which we're in the process of read()ing too.
So, yes - dropping the mutex during read() in the zero-copy patch isn't
safe.
Another potential concern is that if we drop the mutex, the guest thread
could delete an I/O handler while the I/O thread is in the I/O handler
loop in main_loop_wait(). However, this seems to have been coded to
handle this situation - the I/O handler would only be marked as deleted,
and ignored by the loop.
Oh, and there's the posix-timers/signalfd race condition that I can only
seem to trigger when dropping the mutex.
> > 5) Eliminating an extra copy on the host->guest path only makes a
> > barely measurable difference.
> >
> >
>
> That's expected on a host->guest test. Zero copy is mostly important
> for guest->external, and with zerocopy already enabled in the guest
> (sendfile or nfs server workloads).
Hmm, could you elaborate on that?
The copy we're eliminating here is an intermediate copy from tapfd into
a buffer before copying to a guest buffer. It doesn't give you zero-copy
as we still copy from kernel space to user space and vice-versa.
> > Anyway, the figures:
> >
> > netperf, 10x20s runs (Gb/s) | guest->host | host->guest
> > -----------------------------+----------------------------+---------------------------
> > baseline | 1.520/ 1.573/ 1.610/ 0.034 | 1.160/ 1.357/ 1.630/ 0.165
> > 50us tx timer + rearm | 1.050/ 1.086/ 1.110/ 0.017 | 1.710/ 1.832/ 1.960/ 0.092
> > 250us tx timer + rearm | 1.700/ 1.764/ 1.880/ 0.064 | 0.900/ 1.203/ 1.580/ 0.205
> > 150us tx timer + rearm | 1.520/ 1.602/ 1.690/ 0.044 | 1.670/ 1.928/ 2.150/ 0.141
> > no ring-full heuristic | 1.480/ 1.569/ 1.710/ 0.066 | 1.610/ 1.857/ 2.140/ 0.153
> > VIRTIO_F_NOTIFY_ON_EMPTY | 1.470/ 1.554/ 1.650/ 0.054 | 1.770/ 1.960/ 2.170/ 0.119
> > recv NO_NOTIFY | 1.530/ 1.604/ 1.680/ 0.047 | 1.780/ 1.944/ 2.190/ 0.129
> > GSO | 4.120/ 4.323/ 4.420/ 0.099 | 6.540/ 7.033/ 7.340/ 0.244
> > ring size == 256 | 4.050/ 4.406/ 4.560/ 0.143 | 6.280/ 7.236/ 8.280/ 0.613
> > ring size == 512 | 4.420/ 4.600/ 4.960/ 0.140 | 6.470/ 7.205/ 7.510/ 0.314
> > drop mutex during tapfd read | 4.320/ 4.578/ 4.790/ 0.161 | 8.370/ 8.589/ 8.730/ 0.120
> > aligouri zero-copy | 4.510/ 4.694/ 4.960/ 0.148 | 8.430/ 8.614/ 8.840/ 0.142
> >
>
> Very impressive numbers; much better than I expected. The host->guest
> numbers are around 100x better than the original emulated card througput
> we got from kvm.
Hmm, I had intended to post comparison numbers for e1000; if I re-run
these again, I'll do that too. I think it's about 0.5 Gb/s guest->host
and 1.5 Gb/s host->guest.
Note also that with current Fedora rawhide kernels the numbers aren't
nearly as good as the numbers using my own kernel builds. This could be
down to the fact that we enable pretty much all kernel debugging options
during this phase of Fedora development. If I get a chance, I'll see if
I can confirm that suspicion.
> > ping -f -c 100000 (ms) | guest->host | host->guest
> > -----------------------------+----------------------------+---------------------------
> > baseline | 0.060/ 0.459/ 7.602/ 0.846 | 0.067/ 0.331/ 2.517/ 0.057
> > 50us tx timer + rearm | 0.081/ 0.143/ 7.436/ 0.374 | 0.093/ 0.133/ 1.883/ 0.026
> > 250us tx timer + rearm | 0.302/ 0.463/ 7.580/ 0.849 | 0.297/ 0.344/ 2.128/ 0.028
> > 150us tx timer + rearm | 0.197/ 0.323/ 7.671/ 0.740 | 0.199/ 0.245/ 7.836/ 0.037
> > no ring-full heuristic | 0.182/ 0.324/ 7.688/ 0.753 | 0.199/ 0.243/ 2.197/ 0.030
> > VIRTIO_F_NOTIFY_ON_EMPTY | 0.197/ 0.321/ 7.447/ 0.730 | 0.196/ 0.242/ 2.218/ 0.032
> > recv NO_NOTIFY | 0.186/ 0.321/ 7.520/ 0.732 | 0.200/ 0.233/ 2.216/ 0.028
> > GSO | 0.178/ 0.324/ 7.667/ 0.736 | 0.147/ 0.246/ 1.361/ 0.024
> > ring size == 256 | 0.184/ 0.323/ 7.674/ 0.728 | 0.199/ 0.243/ 2.181/ 0.028
> > ring size == 512 | (not measured) | (not measured)
> > drop mutex during tapfd read | 0.183/ 0.323/ 7.820/ 0.733 | 0.202/ 0.242/ 2.219/ 0.027
> > aligouri zero-copy | 0.185/ 0.325/ 7.863/ 0.736 | 0.202/ 0.245/ 7.844/ 0.036
> >
> >
>
> This isn't too good. Low latency is important for nfs clients (or other
> request/response workloads). I think we can keep these low by adjusting
> the virtio window (for example, on an idle system it should be 1), so
> that the tx mitigation timer only fires when the workload transitions
> from throughput to request/response
The other suggestion to help latency is to always flush the queue
immediately on guest notification, before setting the timer.
Cheers,
Mark.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance
2008-08-11 19:56 ` Mark McLoughlin
@ 2008-08-12 13:35 ` Avi Kivity
0 siblings, 0 replies; 36+ messages in thread
From: Avi Kivity @ 2008-08-12 13:35 UTC (permalink / raw)
To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell
Mark McLoughlin wrote:
> Hi Avi,
>
> Sorry, I got distracted from this ...
>
>
So did I :)
>>> 1) The length of the tx mitigation timer makes quite a difference to
>>> throughput achieved; we probably need a good heuristic for
>>> adjusting this on the fly.
>>>
>>>
>> The tx mitigation timer is just one part of the equation; the other is
>> the virtio ring window size, which is now fixed.
>>
>> Using a maximum sized window is good when the guest and host are running
>> flat out, doing nothing but networking. When throughput drops (because
>> the guest is spending cpu on processing, or simply because the other
>> side is not keeping up), we need to drop the windows size so as to
>> retain acceptable latencies.
>>
>> The tx timer can then be set to "a bit after the end of the window",
>> acting as a safety belt in case the throughput changes.
>>
>
> i.e. the tx timer should give just enough time for a flat out guest to
> fill the ring, and no more?
>
> Yep, that's basically what lguest's tx timer heuristic is aiming for
> AFAICT.
>
>
Yes, but that's not enough. If networking is slow (for whatever reason)
we need to drop the window size, to make sure the timer never fires
under steady state circumstances.
Thinking about it, we could have an explicit "worst case latency"
parameter (instead of the implicit "flat out guest fills ring") and set
the timer to that. Adjust window size to as large as we can without
seeing the timer expire.
>>> 4) Dropping the global mutex while reading GSO packets from the tap
>>> interface gives a nice speedup. This highlights the global mutex
>>> as a general perfomance issue.
>>>
>>>
>>>
>> Not sure whether this is safe. What's stopping the guest from accessing
>> virtio and changing some state?
>>
>
> With the current code, the virtio state should be consistent before we
> drop the mutex. The I/O thread would only drop the lock while it reads
> into the tap buffer and then grab the lock again before popping a buffer
> from the ring and copying to it.
>
>
Right, tap_send() is called outside virtio-net context.
> With Anthony's zero-copy patch, the situation is less clear - we pop a
> buffer from the avail, drop the lock, read() into the buffer, grab the
> lock and then push the buffer back onto the used ring. While the mutex
> is released, the guest could e.g. reset the ring and release the buffer
> which we're in the process of read()ing too.
>
> So, yes - dropping the mutex during read() in the zero-copy patch isn't
> safe.
>
> Another potential concern is that if we drop the mutex, the guest thread
> could delete an I/O handler while the I/O thread is in the I/O handler
> loop in main_loop_wait(). However, this seems to have been coded to
> handle this situation - the I/O handler would only be marked as deleted,
> and ignored by the loop.
>
I think it's safe. Still I don't feel good about it.
>
>>> 5) Eliminating an extra copy on the host->guest path only makes a
>>> barely measurable difference.
>>>
>>>
>>>
>> That's expected on a host->guest test. Zero copy is mostly important
>> for guest->external, and with zerocopy already enabled in the guest
>> (sendfile or nfs server workloads).
>>
>
> Hmm, could you elaborate on that?
>
> The copy we're eliminating here is an intermediate copy from tapfd into
> a buffer before copying to a guest buffer. It doesn't give you zero-copy
> as we still copy from kernel space to user space and vice-versa.
>
So long as we're eliminating intermediate copies, each elimination is
not bring us much. It's the elimination of the last copy that brings
the benefit (actually, the last copy for each separate L2 cache; need to
test on loaded multisocket hosts).
There are broadly three workload categories wrt copying:
- server-side static http/nfs/smb protocols, serving to clients outside
the host: guest is already copyless; serving from cache-cold buffers
- normal network servers that actually process their data: you'll have a
guest kernel/user copy, and in any case, guest protocol processing means
that the potential for gains from eliminating copies is limited
- guest/host benchmarks, which do a copy from a single buffer which is
always in L1: zero copy is not going to show a gain (perhaps the opposite)
To get the first workload type optimized, we have to get the entire path
copyless.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2008-08-12 13:35 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-24 11:46 [PATCH 0/9][RFC] KVM virtio_net performance Mark McLoughlin
2008-07-24 11:46 ` [PATCH 1/9] kvm: qemu: Set MIN_TIMER_REARM_US to 150us Mark McLoughlin
2008-07-24 11:46 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Mark McLoughlin
2008-07-24 11:46 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Mark McLoughlin
2008-07-24 11:46 ` [PATCH 4/9] kvm: qemu: Add VIRTIO_F_NOTIFY_ON_EMPTY Mark McLoughlin
2008-07-24 11:46 ` [PATCH 5/9] kvm: qemu: Disable recv notifications until avail buffers exhausted Mark McLoughlin
2008-07-24 11:46 ` [PATCH 6/9] kvm: qemu: Add support for partial csums and GSO Mark McLoughlin
2008-07-24 11:46 ` [PATCH 7/9] kvm: qemu: Increase size of virtio_net rings Mark McLoughlin
2008-07-24 11:46 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Mark McLoughlin
2008-07-24 11:46 ` [PATCH 9/9] kvm: qemu: Eliminate extra virtio_net copy Mark McLoughlin
2008-07-24 23:33 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Dor Laor
2008-07-25 17:25 ` Mark McLoughlin
2008-07-24 23:22 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Dor Laor
2008-07-25 0:30 ` Rusty Russell
2008-07-25 17:30 ` Mark McLoughlin
2008-07-25 17:23 ` Mark McLoughlin
2008-07-24 23:56 ` Dor Laor
2008-07-26 9:48 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Avi Kivity
2008-07-26 12:08 ` Mark McLoughlin
2008-07-24 11:55 ` [PATCH 0/9][RFC] KVM virtio_net performance Herbert Xu
2008-07-24 16:53 ` Mark McLoughlin
2008-07-24 18:29 ` Anthony Liguori
2008-07-25 16:36 ` Mark McLoughlin
2008-07-24 20:56 ` Anthony Liguori
2008-07-25 17:17 ` Mark McLoughlin
2008-07-25 21:29 ` Dor Laor
2008-07-26 19:09 ` Bill Davidsen
2008-07-27 7:52 ` Avi Kivity
2008-07-27 12:52 ` Bill Davidsen
2008-07-27 13:17 ` Bill Davidsen
2008-07-28 6:42 ` Mark McLoughlin
2008-07-26 9:45 ` Avi Kivity
2008-07-27 6:48 ` Rusty Russell
2008-07-27 6:48 ` Rusty Russell
2008-08-11 19:56 ` Mark McLoughlin
2008-08-12 13:35 ` Avi Kivity
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox