* [PATCH 0/9][RFC] KVM virtio_net performance
@ 2008-07-24 11:46 Mark McLoughlin
2008-07-24 11:46 ` [PATCH 1/9] kvm: qemu: Set MIN_TIMER_REARM_US to 150us Mark McLoughlin
` (4 more replies)
0 siblings, 5 replies; 36+ messages in thread
From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw)
To: kvm; +Cc: Herbert Xu, Rusty Russell
Hey,
Here's a bunch of patches attempting to improve the performance
of virtio_net. This is more an RFC rather than a patch submission
since, as can be seen below, not all patches actually improve the
perfomance measurably.
I've tried hard to test each of these patches with as stable and
informative a benchmark as I could find. The first benchmark is a
netperf[1] based throughput benchmark and the second uses a flood
ping[2] to measure latency differences.
Each set of figures is min/average/max/standard deviation. The
first set is Gb/s and the second is milliseconds.
The network configuration used was very simple - the guest with
a virtio_net interface and the host with a tap interface and static
IP addresses assigned to both - e.g. there was no bridge in the host
involved and iptables was disable in both the host and guest.
I used:
1) kvm-71-26-g6152996 with the patches that follow
2) Linus's v2.6.26-5752-g93ded9b with Rusty's virtio patches from
219:bbd2611289c5 applied; these are the patches have just been
submitted to Linus
The conclusions I draw are:
1) The length of the tx mitigation timer makes quite a difference to
throughput achieved; we probably need a good heuristic for
adjusting this on the fly.
2) Using the recently merged GSO support in the tun/tap driver gives
a huge boost, but much more so on the host->guest side.
3) Adjusting the virtio_net ring sizes makes a small difference, but
not as much as one might expect
4) Dropping the global mutex while reading GSO packets from the tap
interface gives a nice speedup. This highlights the global mutex
as a general perfomance issue.
5) Eliminating an extra copy on the host->guest path only makes a
barely measurable difference.
Anyway, the figures:
netperf, 10x20s runs (Gb/s) | guest->host | host->guest
-----------------------------+----------------------------+---------------------------
baseline | 1.520/ 1.573/ 1.610/ 0.034 | 1.160/ 1.357/ 1.630/ 0.165
50us tx timer + rearm | 1.050/ 1.086/ 1.110/ 0.017 | 1.710/ 1.832/ 1.960/ 0.092
250us tx timer + rearm | 1.700/ 1.764/ 1.880/ 0.064 | 0.900/ 1.203/ 1.580/ 0.205
150us tx timer + rearm | 1.520/ 1.602/ 1.690/ 0.044 | 1.670/ 1.928/ 2.150/ 0.141
no ring-full heuristic | 1.480/ 1.569/ 1.710/ 0.066 | 1.610/ 1.857/ 2.140/ 0.153
VIRTIO_F_NOTIFY_ON_EMPTY | 1.470/ 1.554/ 1.650/ 0.054 | 1.770/ 1.960/ 2.170/ 0.119
recv NO_NOTIFY | 1.530/ 1.604/ 1.680/ 0.047 | 1.780/ 1.944/ 2.190/ 0.129
GSO | 4.120/ 4.323/ 4.420/ 0.099 | 6.540/ 7.033/ 7.340/ 0.244
ring size == 256 | 4.050/ 4.406/ 4.560/ 0.143 | 6.280/ 7.236/ 8.280/ 0.613
ring size == 512 | 4.420/ 4.600/ 4.960/ 0.140 | 6.470/ 7.205/ 7.510/ 0.314
drop mutex during tapfd read | 4.320/ 4.578/ 4.790/ 0.161 | 8.370/ 8.589/ 8.730/ 0.120
aligouri zero-copy | 4.510/ 4.694/ 4.960/ 0.148 | 8.430/ 8.614/ 8.840/ 0.142
ping -f -c 100000 (ms) | guest->host | host->guest
-----------------------------+----------------------------+---------------------------
baseline | 0.060/ 0.459/ 7.602/ 0.846 | 0.067/ 0.331/ 2.517/ 0.057
50us tx timer + rearm | 0.081/ 0.143/ 7.436/ 0.374 | 0.093/ 0.133/ 1.883/ 0.026
250us tx timer + rearm | 0.302/ 0.463/ 7.580/ 0.849 | 0.297/ 0.344/ 2.128/ 0.028
150us tx timer + rearm | 0.197/ 0.323/ 7.671/ 0.740 | 0.199/ 0.245/ 7.836/ 0.037
no ring-full heuristic | 0.182/ 0.324/ 7.688/ 0.753 | 0.199/ 0.243/ 2.197/ 0.030
VIRTIO_F_NOTIFY_ON_EMPTY | 0.197/ 0.321/ 7.447/ 0.730 | 0.196/ 0.242/ 2.218/ 0.032
recv NO_NOTIFY | 0.186/ 0.321/ 7.520/ 0.732 | 0.200/ 0.233/ 2.216/ 0.028
GSO | 0.178/ 0.324/ 7.667/ 0.736 | 0.147/ 0.246/ 1.361/ 0.024
ring size == 256 | 0.184/ 0.323/ 7.674/ 0.728 | 0.199/ 0.243/ 2.181/ 0.028
ring size == 512 | (not measured) | (not measured)
drop mutex during tapfd read | 0.183/ 0.323/ 7.820/ 0.733 | 0.202/ 0.242/ 2.219/ 0.027
aligouri zero-copy | 0.185/ 0.325/ 7.863/ 0.736 | 0.202/ 0.245/ 7.844/ 0.036
Cheers,
Mark.
[1] - I used netperf trunk from:
http://www.netperf.org/svn/netperf2/trunk
and simply ran:
$> i=0; while [ $i -lt 10 ]; do ./netperf -H <host> -f g -l 20 -P 0 | netperf-collect.py; i=$((i+1)); done
where netperf-collect.py is just a script to calculate the
average across the runs:
http://markmc.fedorapeople.org/netperf-collect.py
[2] - ping -c 100000 -f <host>
^ permalink raw reply [flat|nested] 36+ messages in thread* [PATCH 1/9] kvm: qemu: Set MIN_TIMER_REARM_US to 150us 2008-07-24 11:46 [PATCH 0/9][RFC] KVM virtio_net performance Mark McLoughlin @ 2008-07-24 11:46 ` Mark McLoughlin 2008-07-24 11:46 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Mark McLoughlin 2008-07-24 11:55 ` [PATCH 0/9][RFC] KVM virtio_net performance Herbert Xu ` (3 subsequent siblings) 4 siblings, 1 reply; 36+ messages in thread From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw) To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin Equivalent to ~300 syscalls on my machine Signed-off-by: Mark McLoughlin <markmc@redhat.com> --- qemu/vl.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/qemu/vl.c b/qemu/vl.c index 5d285cc..b7d3397 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -891,7 +891,7 @@ static void qemu_rearm_alarm_timer(struct qemu_alarm_timer *t) } /* TODO: MIN_TIMER_REARM_US should be optimized */ -#define MIN_TIMER_REARM_US 250 +#define MIN_TIMER_REARM_US 150 static struct qemu_alarm_timer *alarm_timer; -- 1.5.4.1 ^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer 2008-07-24 11:46 ` [PATCH 1/9] kvm: qemu: Set MIN_TIMER_REARM_US to 150us Mark McLoughlin @ 2008-07-24 11:46 ` Mark McLoughlin 2008-07-24 11:46 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Mark McLoughlin 2008-07-26 9:48 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Avi Kivity 0 siblings, 2 replies; 36+ messages in thread From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw) To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin The current virtio_net tx timer is 2ns, which doesn't make any sense. Set it to a more reasonable 150us instead. Signed-off-by: Mark McLoughlin <markmc@redhat.com> --- qemu/hw/virtio-net.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index 2e57e5a..31867f1 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -26,7 +26,7 @@ #define VIRTIO_NET_F_MAC 5 #define VIRTIO_NET_F_GS0 6 -#define TX_TIMER_INTERVAL (1000 / 500) +#define TX_TIMER_INTERVAL (150000) /* 150 us */ /* The config defining mac address (6 bytes) */ struct virtio_net_config -- 1.5.4.1 ^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic 2008-07-24 11:46 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Mark McLoughlin @ 2008-07-24 11:46 ` Mark McLoughlin 2008-07-24 11:46 ` [PATCH 4/9] kvm: qemu: Add VIRTIO_F_NOTIFY_ON_EMPTY Mark McLoughlin ` (2 more replies) 2008-07-26 9:48 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Avi Kivity 1 sibling, 3 replies; 36+ messages in thread From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw) To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin virtio_net tries to guess when it has received a tx notification from the guest whether it indicates that the guest has no more room in the tx ring and it should immediately flush the queued buffers. The heuristic is based on the fact that there are 128 buffer entries in the ring and each packet uses 2 buffers (i.e. the virtio_net_hdr and the packet's linear data). Using GSO or increasing the size of the rings will break that heuristic, so let's remove it and assume that any notification from the guest after we've disabled notifications indicates that we should flush our buffers. Signed-off-by: Mark McLoughlin <markmc@redhat.com> --- qemu/hw/virtio-net.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index 31867f1..4adfa42 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -175,8 +175,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq) { VirtIONet *n = to_virtio_net(vdev); - if (n->tx_timer_active && - (vq->vring.avail->idx - vq->last_avail_idx) == 64) { + if (n->tx_timer_active) { vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY; qemu_del_timer(n->tx_timer); n->tx_timer_active = 0; -- 1.5.4.1 ^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 4/9] kvm: qemu: Add VIRTIO_F_NOTIFY_ON_EMPTY 2008-07-24 11:46 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Mark McLoughlin @ 2008-07-24 11:46 ` Mark McLoughlin 2008-07-24 11:46 ` [PATCH 5/9] kvm: qemu: Disable recv notifications until avail buffers exhausted Mark McLoughlin 2008-07-24 23:22 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Dor Laor 2008-07-24 23:56 ` Dor Laor 2 siblings, 1 reply; 36+ messages in thread From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw) To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin Set the VIRTIO_F_NOTIFY_ON_EMPTY feature bit so the guest can rely on us notifying them when the queue is empty. Also, only notify when the available queue is empty *and* when we've finished with all the buffers we had detached. Right now, when the queue is empty, we notify the guest for every used buffer. Signed-off-by: Mark McLoughlin <markmc@redhat.com> --- qemu/hw/virtio.c | 6 +++++- qemu/hw/virtio.h | 5 +++++ 2 files changed, 10 insertions(+), 1 deletions(-) diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c index 3429ac8..e035e4e 100644 --- a/qemu/hw/virtio.c +++ b/qemu/hw/virtio.c @@ -138,6 +138,7 @@ void virtqueue_push(VirtQueue *vq, const VirtQueueElement *elem, /* Make sure buffer is written before we update index. */ wmb(); vq->vring.used->idx++; + vq->inuse--; } int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem) @@ -187,6 +188,8 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem) elem->index = head; + vq->inuse++; + return elem->in_num + elem->out_num; } @@ -275,6 +278,7 @@ static uint32_t virtio_ioport_read(void *opaque, uint32_t addr) switch (addr) { case VIRTIO_PCI_HOST_FEATURES: ret = vdev->get_features(vdev); + ret |= (1 << VIRTIO_F_NOTIFY_ON_EMPTY); break; case VIRTIO_PCI_GUEST_FEATURES: ret = vdev->features; @@ -431,7 +435,7 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size, void virtio_notify(VirtIODevice *vdev, VirtQueue *vq) { /* Always notify when queue is empty */ - if (vq->vring.avail->idx != vq->last_avail_idx && + if ((vq->inuse || vq->vring.avail->idx != vq->last_avail_idx) && (vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT)) return; diff --git a/qemu/hw/virtio.h b/qemu/hw/virtio.h index 61f5038..1adaed3 100644 --- a/qemu/hw/virtio.h +++ b/qemu/hw/virtio.h @@ -30,6 +30,10 @@ /* We've given up on this device. */ #define VIRTIO_CONFIG_S_FAILED 0x80 +/* We notify when the ring is completely used, even if the guest is supressing + * callbacks */ +#define VIRTIO_F_NOTIFY_ON_EMPTY 24 + /* from Linux's linux/virtio_ring.h */ /* This marks a buffer as continuing via the next field. */ @@ -86,6 +90,7 @@ struct VirtQueue VRing vring; uint32_t pfn; uint16_t last_avail_idx; + int inuse; void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq); }; -- 1.5.4.1 ^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 5/9] kvm: qemu: Disable recv notifications until avail buffers exhausted 2008-07-24 11:46 ` [PATCH 4/9] kvm: qemu: Add VIRTIO_F_NOTIFY_ON_EMPTY Mark McLoughlin @ 2008-07-24 11:46 ` Mark McLoughlin 2008-07-24 11:46 ` [PATCH 6/9] kvm: qemu: Add support for partial csums and GSO Mark McLoughlin 0 siblings, 1 reply; 36+ messages in thread From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw) To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin Signed-off-by: Mark McLoughlin <markmc@redhat.com> --- qemu/hw/virtio-net.c | 5 ++++- 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index 4adfa42..419a2d7 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -106,9 +106,12 @@ static int virtio_net_can_receive(void *opaque) !(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK)) return 0; - if (n->rx_vq->vring.avail->idx == n->rx_vq->last_avail_idx) + if (n->rx_vq->vring.avail->idx == n->rx_vq->last_avail_idx) { + n->rx_vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY; return 0; + } + n->rx_vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY; return 1; } -- 1.5.4.1 ^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 6/9] kvm: qemu: Add support for partial csums and GSO 2008-07-24 11:46 ` [PATCH 5/9] kvm: qemu: Disable recv notifications until avail buffers exhausted Mark McLoughlin @ 2008-07-24 11:46 ` Mark McLoughlin 2008-07-24 11:46 ` [PATCH 7/9] kvm: qemu: Increase size of virtio_net rings Mark McLoughlin 0 siblings, 1 reply; 36+ messages in thread From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw) To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin Signed-off-by: Mark McLoughlin <markmc@redhat.com> --- qemu/hw/virtio-net.c | 86 +++++++++++++++++++++++++++++++++++++++++--------- qemu/net.h | 5 +++ qemu/vl.c | 73 +++++++++++++++++++++++++++++++++++++++--- 3 files changed, 144 insertions(+), 20 deletions(-) diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index 419a2d7..81282c4 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -22,9 +22,18 @@ #define VIRTIO_ID_NET 1 /* The feature bitmap for virtio net */ -#define VIRTIO_NET_F_NO_CSUM 0 -#define VIRTIO_NET_F_MAC 5 -#define VIRTIO_NET_F_GS0 6 +#define VIRTIO_NET_F_CSUM 0 /* Host handles pkts w/ partial csum */ +#define VIRTIO_NET_F_GUEST_CSUM 1 /* Guest handles pkts w/ partial csum */ +#define VIRTIO_NET_F_MAC 5 /* Host has given MAC address. */ +#define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */ +#define VIRTIO_NET_F_GUEST_TSO4 7 /* Guest can handle TSOv4 in. */ +#define VIRTIO_NET_F_GUEST_TSO6 8 /* Guest can handle TSOv6 in. */ +#define VIRTIO_NET_F_GUEST_ECN 9 /* Guest can handle TSO[6] w/ ECN in. */ +#define VIRTIO_NET_F_GUEST_UFO 10 /* Guest can handle UFO in. */ +#define VIRTIO_NET_F_HOST_TSO4 11 /* Host can handle TSOv4 in. */ +#define VIRTIO_NET_F_HOST_TSO6 12 /* Host can handle TSOv6 in. */ +#define VIRTIO_NET_F_HOST_ECN 13 /* Host can handle TSO[6] w/ ECN in. */ +#define VIRTIO_NET_F_HOST_UFO 14 /* Host can handle UFO in. */ #define TX_TIMER_INTERVAL (150000) /* 150 us */ @@ -42,8 +51,6 @@ struct virtio_net_hdr uint8_t flags; #define VIRTIO_NET_HDR_GSO_NONE 0 // Not a GSO frame #define VIRTIO_NET_HDR_GSO_TCPV4 1 // GSO frame, IPv4 TCP (TSO) -/* FIXME: Do we need this? If they said they can handle ECN, do they care? */ -#define VIRTIO_NET_HDR_GSO_TCPV4_ECN 2 // GSO frame, IPv4 TCP w/ ECN #define VIRTIO_NET_HDR_GSO_UDP 3 // GSO frame, IPv4 UDP (UFO) #define VIRTIO_NET_HDR_GSO_TCPV6 4 // GSO frame, IPv6 TCP #define VIRTIO_NET_HDR_GSO_ECN 0x80 // TCP has ECN set @@ -85,7 +92,38 @@ static void virtio_net_update_config(VirtIODevice *vdev, uint8_t *config) static uint32_t virtio_net_get_features(VirtIODevice *vdev) { - return (1 << VIRTIO_NET_F_MAC); + VirtIONet *n = to_virtio_net(vdev); + VLANClientState *host = n->vc->vlan->first_client; + uint32_t features = (1 << VIRTIO_NET_F_MAC); + + if (tap_has_offload(host)) { + features |= (1 << VIRTIO_NET_F_CSUM); + features |= (1 << VIRTIO_NET_F_GUEST_CSUM); + features |= (1 << VIRTIO_NET_F_GUEST_TSO4); + features |= (1 << VIRTIO_NET_F_GUEST_TSO6); + features |= (1 << VIRTIO_NET_F_GUEST_ECN); + features |= (1 << VIRTIO_NET_F_HOST_TSO4); + features |= (1 << VIRTIO_NET_F_HOST_TSO6); + features |= (1 << VIRTIO_NET_F_HOST_ECN); + /* Kernel can't actually handle UFO in software currently. */ + } + + return features; +} + +static void virtio_net_set_features(VirtIODevice *vdev, uint32_t features) +{ + VirtIONet *n = to_virtio_net(vdev); + VLANClientState *host = n->vc->vlan->first_client; + + if (!tap_has_offload(host) || !host->set_offload) + return; + + host->set_offload(host, + (features >> VIRTIO_NET_F_GUEST_CSUM) & 1, + (features >> VIRTIO_NET_F_GUEST_TSO4) & 1, + (features >> VIRTIO_NET_F_GUEST_TSO6) & 1, + (features >> VIRTIO_NET_F_GUEST_ECN) & 1); } /* RX */ @@ -121,6 +159,7 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) VirtQueueElement elem; struct virtio_net_hdr *hdr; int offset, i; + int total; if (virtqueue_pop(n->rx_vq, &elem) == 0) return; @@ -134,18 +173,26 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) hdr->flags = 0; hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE; - /* copy in packet. ugh */ offset = 0; + total = sizeof(*hdr); + + if (tap_has_offload(n->vc->vlan->first_client)) { + memcpy(hdr, buf, sizeof(*hdr)); + offset += total; + } + + /* copy in packet. ugh */ i = 1; while (offset < size && i < elem.in_num) { int len = MIN(elem.in_sg[i].iov_len, size - offset); memcpy(elem.in_sg[i].iov_base, buf + offset, len); offset += len; + total += len; i++; } /* signal other side */ - virtqueue_push(n->rx_vq, &elem, sizeof(*hdr) + offset); + virtqueue_push(n->rx_vq, &elem, total); virtio_notify(&n->vdev, n->rx_vq); } @@ -153,23 +200,31 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq) { VirtQueueElement elem; + int has_offload = tap_has_offload(n->vc->vlan->first_client); if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK)) return; while (virtqueue_pop(vq, &elem)) { ssize_t len = 0; + unsigned int out_num = elem.out_num; + struct iovec *out_sg = &elem.out_sg[0]; + + if (out_num < 1 || out_sg->iov_len != sizeof(struct virtio_net_hdr)) { + fprintf(stderr, "virtio-net header not in first element\n"); + exit(1); + } - if (elem.out_num < 1 || - elem.out_sg[0].iov_len != sizeof(struct virtio_net_hdr)) { - fprintf(stderr, "virtio-net header not in first element\n"); - exit(1); + /* ignore the header if GSO is not supported */ + if (!has_offload) { + out_num--; + out_sg++; + len += sizeof(struct virtio_net_hdr); } - /* ignore the header for now */ - len = qemu_sendv_packet(n->vc, &elem.out_sg[1], elem.out_num - 1); + len += qemu_sendv_packet(n->vc, out_sg, out_num); - virtqueue_push(vq, &elem, sizeof(struct virtio_net_hdr) + len); + virtqueue_push(vq, &elem, len); virtio_notify(&n->vdev, vq); } } @@ -249,6 +304,7 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) n->vdev.update_config = virtio_net_update_config; n->vdev.get_features = virtio_net_get_features; + n->vdev.set_features = virtio_net_set_features; n->rx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_rx); n->tx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_tx); memcpy(n->mac, nd->macaddr, 6); diff --git a/qemu/net.h b/qemu/net.h index e8ee325..6cfd8ce 100644 --- a/qemu/net.h +++ b/qemu/net.h @@ -9,12 +9,15 @@ typedef ssize_t (IOReadvHandler)(void *, const struct iovec *, int); typedef struct VLANClientState VLANClientState; +typedef void (SetOffload)(VLANClientState *, int, int, int, int); + struct VLANClientState { IOReadHandler *fd_read; IOReadvHandler *fd_readv; /* Packets may still be sent if this returns zero. It's used to rate-limit the slirp code. */ IOCanRWHandler *fd_can_read; + SetOffload *set_offload; void *opaque; struct VLANClientState *next; struct VLANState *vlan; @@ -42,6 +45,8 @@ void qemu_handler_true(void *opaque); void do_info_network(void); +int tap_has_offload(void *opaque); + int net_client_init(const char *str); void net_client_uninit(NICInfo *nd); diff --git a/qemu/vl.c b/qemu/vl.c index b7d3397..efdaafd 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -4186,12 +4186,24 @@ void do_info_slirp(void) #if !defined(_WIN32) +#ifndef IFF_VNET_HDR +#define TAP_BUFSIZE 4096 +#else +#include <linux/virtio_net.h> +#define ETH_HLEN 14 +#define ETH_DATA_LEN 1500 +#define MAX_PACKET_LEN (ETH_HLEN + ETH_DATA_LEN) +#define MAX_SKB_FRAGS ((65536/TARGET_PAGE_SIZE) + 2) +#define TAP_BUFSIZE (sizeof(struct virtio_net_hdr) + MAX_PACKET_LEN + (MAX_SKB_FRAGS*TARGET_PAGE_SIZE)) +#endif + typedef struct TAPState { VLANClientState *vc; int fd; char down_script[1024]; - char buf[4096]; + char buf[TAP_BUFSIZE]; int size; + int offload; } TAPState; static void tap_receive(void *opaque, const uint8_t *buf, int size) @@ -4286,6 +4298,37 @@ static void tap_send(void *opaque) } while (s->size > 0); } +int tap_has_offload(void *opaque) +{ + VLANClientState *vc = opaque; + TAPState *ts = vc->opaque; + + return ts ? ts->offload : 0; +} + +#ifdef TUNSETOFFLOAD +static void tap_set_offload(VLANClientState *vc, int csum, int tso4, int tso6, + int ecn) +{ + TAPState *s = vc->opaque; + unsigned int offload = 0; + + if (csum) { + offload |= TUN_F_CSUM; + if (tso4) + offload |= TUN_F_TSO4; + if (tso6) + offload |= TUN_F_TSO6; + if ((tso4 || tso6) && ecn) + offload |= TUN_F_TSO_ECN; + } + + if (ioctl(s->fd, TUNSETOFFLOAD, offload) != 0) + fprintf(stderr, "TUNSETOFFLOAD ioctl() failed: %s\n", + strerror(errno)); +} +#endif /* TUNSETOFFLOAD */ + /* fd support */ static TAPState *net_tap_fd_init(VLANState *vlan, int fd) @@ -4298,13 +4341,16 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd) s->fd = fd; s->vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s); s->vc->fd_readv = tap_readv; +#ifdef TUNSETOFFLOAD + s->vc->set_offload = tap_set_offload; +#endif qemu_set_fd_handler2(s->fd, tap_can_send, tap_send, NULL, s); snprintf(s->vc->info_str, sizeof(s->vc->info_str), "tap: fd=%d", fd); return s; } #if defined (_BSD) || defined (__FreeBSD_kernel__) -static int tap_open(char *ifname, int ifname_size) +static int tap_open(char *ifname, int ifname_size, int *offload) { int fd; char *dev; @@ -4446,7 +4492,7 @@ int tap_alloc(char *dev) return tap_fd; } -static int tap_open(char *ifname, int ifname_size) +static int tap_open(char *ifname, int ifname_size, int *offload) { char dev[10]=""; int fd; @@ -4459,18 +4505,31 @@ static int tap_open(char *ifname, int ifname_size) return fd; } #else -static int tap_open(char *ifname, int ifname_size) +static int tap_open(char *ifname, int ifname_size, int *offload) { struct ifreq ifr; int fd, ret; + unsigned int features; TFR(fd = open("/dev/net/tun", O_RDWR)); if (fd < 0) { fprintf(stderr, "warning: could not open /dev/net/tun: no virtual network emulation\n"); return -1; } + + if (ioctl(fd, TUNGETFEATURES, &features)) + features = IFF_TUN | IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE; + memset(&ifr, 0, sizeof(ifr)); ifr.ifr_flags = IFF_TAP | IFF_NO_PI; + +#ifdef IFF_VNET_HDR + if (features & IFF_VNET_HDR) { + *offload = 1; + ifr.ifr_flags |= IFF_VNET_HDR; + } +#endif + if (ifname[0] != '\0') pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname); else @@ -4528,13 +4587,15 @@ static int net_tap_init(VLANState *vlan, const char *ifname1, { TAPState *s; int fd; + int offload; char ifname[128]; if (ifname1 != NULL) pstrcpy(ifname, sizeof(ifname), ifname1); else ifname[0] = '\0'; - TFR(fd = tap_open(ifname, sizeof(ifname))); + offload = 0; + TFR(fd = tap_open(ifname, sizeof(ifname), &offload)); if (fd < 0) return -1; @@ -4547,6 +4608,8 @@ static int net_tap_init(VLANState *vlan, const char *ifname1, s = net_tap_fd_init(vlan, fd); if (!s) return -1; + + s->offload = offload; snprintf(s->vc->info_str, sizeof(s->vc->info_str), "tap: ifname=%s setup_script=%s", ifname, setup_script); if (down_script && strcmp(down_script, "no")) -- 1.5.4.1 ^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 7/9] kvm: qemu: Increase size of virtio_net rings 2008-07-24 11:46 ` [PATCH 6/9] kvm: qemu: Add support for partial csums and GSO Mark McLoughlin @ 2008-07-24 11:46 ` Mark McLoughlin 2008-07-24 11:46 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Mark McLoughlin 0 siblings, 1 reply; 36+ messages in thread From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw) To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin Signed-off-by: Mark McLoughlin <markmc@redhat.com> --- qemu/hw/virtio-net.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index 81282c4..a681a7e 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -305,8 +305,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) n->vdev.update_config = virtio_net_update_config; n->vdev.get_features = virtio_net_get_features; n->vdev.set_features = virtio_net_set_features; - n->rx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_rx); - n->tx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_tx); + n->rx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_rx); + n->tx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_tx); memcpy(n->mac, nd->macaddr, 6); n->vc = qemu_new_vlan_client(nd->vlan, virtio_net_receive, virtio_net_can_receive, n); -- 1.5.4.1 ^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd 2008-07-24 11:46 ` [PATCH 7/9] kvm: qemu: Increase size of virtio_net rings Mark McLoughlin @ 2008-07-24 11:46 ` Mark McLoughlin 2008-07-24 11:46 ` [PATCH 9/9] kvm: qemu: Eliminate extra virtio_net copy Mark McLoughlin 2008-07-24 23:33 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Dor Laor 0 siblings, 2 replies; 36+ messages in thread From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw) To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin The idea here is that with GSO, packets are much larger and we can allow the vcpu threads to e.g. process irq acks during the window where we're reading these packets from the tapfd. Signed-off-by: Mark McLoughlin <markmc@redhat.com> --- qemu/vl.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/qemu/vl.c b/qemu/vl.c index efdaafd..de92848 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -4281,7 +4281,9 @@ static void tap_send(void *opaque) sbuf.buf = s->buf; s->size = getmsg(s->fd, NULL, &sbuf, &f) >=0 ? sbuf.len : -1; #else + kvm_sleep_begin(); s->size = read(s->fd, s->buf, sizeof(s->buf)); + kvm_sleep_end(); #endif if (s->size == -1 && errno == EINTR) -- 1.5.4.1 ^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 9/9] kvm: qemu: Eliminate extra virtio_net copy 2008-07-24 11:46 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Mark McLoughlin @ 2008-07-24 11:46 ` Mark McLoughlin 2008-07-24 23:33 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Dor Laor 1 sibling, 0 replies; 36+ messages in thread From: Mark McLoughlin @ 2008-07-24 11:46 UTC (permalink / raw) To: kvm; +Cc: Herbert Xu, Rusty Russell, Mark McLoughlin This is Anthony's net-tap-zero-copy.patch which eliminates a copy on the host->guest data path with virtio_net. --- qemu/hw/virtio-net.c | 76 ++++++++++++++++++++++++++++++++++++------------- qemu/net.h | 3 ++ qemu/vl.c | 50 +++++++++++++++++++++++++++++++++ 3 files changed, 109 insertions(+), 20 deletions(-) diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index a681a7e..5e71afe 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -70,6 +70,8 @@ typedef struct VirtIONet VLANClientState *vc; QEMUTimer *tx_timer; int tx_timer_active; + int last_elem_valid; + VirtQueueElement last_elem; } VirtIONet; /* TODO @@ -153,47 +155,80 @@ static int virtio_net_can_receive(void *opaque) return 1; } -static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) +static void virtio_net_receive_zc(void *opaque, IOZeroCopyHandler *zc, void *data) { VirtIONet *n = opaque; - VirtQueueElement elem; + VirtQueueElement *elem = &n->last_elem; struct virtio_net_hdr *hdr; - int offset, i; - int total; + ssize_t err; + int idx; - if (virtqueue_pop(n->rx_vq, &elem) == 0) + if (!n->last_elem_valid && virtqueue_pop(n->rx_vq, elem) == 0) return; - if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { + if (elem->in_num < 1 || elem->in_sg[0].iov_len != sizeof(*hdr)) { fprintf(stderr, "virtio-net header not in first element\n"); exit(1); } - hdr = (void *)elem.in_sg[0].iov_base; + n->last_elem_valid = 1; + + hdr = (void *)elem->in_sg[0].iov_base; hdr->flags = 0; hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE; - offset = 0; - total = sizeof(*hdr); + idx = tap_has_offload(n->vc->vlan->first_client) ? 0 : 1; + + do { + err = zc(data, &elem->in_sg[idx], elem->in_num - idx); + } while (err == -1 && errno == EINTR); + + if (err == -1 && errno == EAGAIN) + return; - if (tap_has_offload(n->vc->vlan->first_client)) { - memcpy(hdr, buf, sizeof(*hdr)); - offset += total; + if (err < 0) { + fprintf(stderr, "virtio_net: error during IO\n"); + return; } + /* signal other side */ + n->last_elem_valid = 0; + virtqueue_push(n->rx_vq, elem, sizeof(*hdr) + err); + virtio_notify(&n->vdev, n->rx_vq); +} + +struct compat_data +{ + const uint8_t *buf; + int size; +}; + +static ssize_t compat_copy(void *opaque, struct iovec *iov, int iovcnt) +{ + struct compat_data *compat = opaque; + int offset, i; + /* copy in packet. ugh */ - i = 1; - while (offset < size && i < elem.in_num) { - int len = MIN(elem.in_sg[i].iov_len, size - offset); - memcpy(elem.in_sg[i].iov_base, buf + offset, len); + offset = 0; + i = 0; + while (offset < compat->size && i < iovcnt) { + int len = MIN(iov[i].iov_len, compat->size - offset); + memcpy(iov[i].iov_base, compat->buf + offset, len); offset += len; - total += len; i++; } - /* signal other side */ - virtqueue_push(n->rx_vq, &elem, total); - virtio_notify(&n->vdev, n->rx_vq); + return offset; +} + +static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) +{ + struct compat_data compat; + + compat.buf = buf; + compat.size = size; + + virtio_net_receive_zc(opaque, compat_copy, &compat); } /* TX */ @@ -310,6 +345,7 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) memcpy(n->mac, nd->macaddr, 6); n->vc = qemu_new_vlan_client(nd->vlan, virtio_net_receive, virtio_net_can_receive, n); + n->vc->fd_read_zc = virtio_net_receive_zc; n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n); n->tx_timer_active = 0; diff --git a/qemu/net.h b/qemu/net.h index 6cfd8ce..aca50e9 100644 --- a/qemu/net.h +++ b/qemu/net.h @@ -6,6 +6,8 @@ /* VLANs support */ typedef ssize_t (IOReadvHandler)(void *, const struct iovec *, int); +typedef ssize_t (IOZeroCopyHandler)(void *, struct iovec *, int); +typedef void (IOReadZCHandler)(void *, IOZeroCopyHandler *, void *); typedef struct VLANClientState VLANClientState; @@ -14,6 +16,7 @@ typedef void (SetOffload)(VLANClientState *, int, int, int, int); struct VLANClientState { IOReadHandler *fd_read; IOReadvHandler *fd_readv; + IOReadZCHandler *fd_read_zc; /* Packets may still be sent if this returns zero. It's used to rate-limit the slirp code. */ IOCanRWHandler *fd_can_read; diff --git a/qemu/vl.c b/qemu/vl.c index de92848..bc5b151 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -4204,6 +4204,7 @@ typedef struct TAPState { char buf[TAP_BUFSIZE]; int size; int offload; + int received_eagain; } TAPState; static void tap_receive(void *opaque, const uint8_t *buf, int size) @@ -4232,6 +4233,48 @@ static ssize_t tap_readv(void *opaque, const struct iovec *iov, return len; } +static VLANClientState *tap_can_zero_copy(TAPState *s) +{ + VLANClientState *vc, *vc1 = NULL; + int vc_count = 0; + + for (vc = s->vc->vlan->first_client; vc; vc = vc->next) { + if (vc == s->vc) + continue; + + if (!vc->fd_read_zc || vc_count) + return NULL; + + vc_count++; + vc1 = vc; + } + + return vc1; +} + +static ssize_t tap_sendv(void *opaque, struct iovec *iov, int iovcnt) +{ + TAPState *s = opaque; + ssize_t ret; + + kvm_sleep_begin(); + ret = readv(s->fd, iov, iovcnt); + kvm_sleep_end(); + if (ret == -1 && errno == EAGAIN) + s->received_eagain = 1; + + return ret; +} + +static void tap_send_zero_copy(TAPState *s, VLANClientState *vc) +{ + s->received_eagain = 0; + while (s->received_eagain == 0 && + (!vc->fd_can_read || vc->fd_can_read(vc->opaque))) { + vc->fd_read_zc(vc->opaque, tap_sendv, s); + } +} + static int tap_can_send(void *opaque) { TAPState *s = opaque; @@ -4261,6 +4304,13 @@ static int tap_can_send(void *opaque) static void tap_send(void *opaque) { TAPState *s = opaque; + VLANClientState *zc; + + zc = tap_can_zero_copy(s); + if (zc) { + tap_send_zero_copy(s, zc); + return; + } /* First try to send any buffered packet */ if (s->size > 0) { -- 1.5.4.1 ^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd 2008-07-24 11:46 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Mark McLoughlin 2008-07-24 11:46 ` [PATCH 9/9] kvm: qemu: Eliminate extra virtio_net copy Mark McLoughlin @ 2008-07-24 23:33 ` Dor Laor 2008-07-25 17:25 ` Mark McLoughlin 1 sibling, 1 reply; 36+ messages in thread From: Dor Laor @ 2008-07-24 23:33 UTC (permalink / raw) To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell Mark McLoughlin wrote: > The idea here is that with GSO, packets are much larger > and we can allow the vcpu threads to e.g. process irq > acks during the window where we're reading these > packets from the tapfd. > > Signed-off-by: Mark McLoughlin <markmc@redhat.com> > --- > qemu/vl.c | 2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) > > diff --git a/qemu/vl.c b/qemu/vl.c > index efdaafd..de92848 100644 > --- a/qemu/vl.c > +++ b/qemu/vl.c > @@ -4281,7 +4281,9 @@ static void tap_send(void *opaque) > sbuf.buf = s->buf; > s->size = getmsg(s->fd, NULL, &sbuf, &f) >=0 ? sbuf.len : -1; > #else > Maybe do it only when GSO is actually used by the guest/tap. Otherwise it can cause some ctx trashing right? > + kvm_sleep_begin(); > s->size = read(s->fd, s->buf, sizeof(s->buf)); > + kvm_sleep_end(); > #endif > > if (s->size == -1 && errno == EINTR) > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd 2008-07-24 23:33 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Dor Laor @ 2008-07-25 17:25 ` Mark McLoughlin 0 siblings, 0 replies; 36+ messages in thread From: Mark McLoughlin @ 2008-07-25 17:25 UTC (permalink / raw) To: Dor Laor; +Cc: kvm, Herbert Xu, Rusty Russell On Fri, 2008-07-25 at 02:33 +0300, Dor Laor wrote: > Mark McLoughlin wrote: > > The idea here is that with GSO, packets are much larger > > and we can allow the vcpu threads to e.g. process irq > > acks during the window where we're reading these > > packets from the tapfd. > > ... > > diff --git a/qemu/vl.c b/qemu/vl.c > > index efdaafd..de92848 100644 > > --- a/qemu/vl.c > > +++ b/qemu/vl.c > > @@ -4281,7 +4281,9 @@ static void tap_send(void *opaque) > > sbuf.buf = s->buf; > > s->size = getmsg(s->fd, NULL, &sbuf, &f) >=0 ? sbuf.len : -1; > > #else > > > Maybe do it only when GSO is actually used by the guest/tap. > Otherwise it can cause some ctx trashing right? (Strange habit you have of "top commenting" on patches :-) I've been meaning to do some profiling of just how long the vcpu thread spends blocking on the mutex in this case, both before and after the GSO patches. i.e. this change may make sense even with smaller buffers ... > > + kvm_sleep_begin(); > > s->size = read(s->fd, s->buf, sizeof(s->buf)); > > + kvm_sleep_end(); Cheers, Mark. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic 2008-07-24 11:46 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Mark McLoughlin 2008-07-24 11:46 ` [PATCH 4/9] kvm: qemu: Add VIRTIO_F_NOTIFY_ON_EMPTY Mark McLoughlin @ 2008-07-24 23:22 ` Dor Laor 2008-07-25 0:30 ` Rusty Russell 2008-07-25 17:23 ` Mark McLoughlin 2008-07-24 23:56 ` Dor Laor 2 siblings, 2 replies; 36+ messages in thread From: Dor Laor @ 2008-07-24 23:22 UTC (permalink / raw) To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell Mark McLoughlin wrote: > virtio_net tries to guess when it has received a tx > notification from the guest whether it indicates that the > guest has no more room in the tx ring and it should > immediately flush the queued buffers. > > The heuristic is based on the fact that there are 128 > buffer entries in the ring and each packet uses 2 buffers > (i.e. the virtio_net_hdr and the packet's linear data). > > Using GSO or increasing the size of the rings will break > that heuristic, so let's remove it and assume that any > notification from the guest after we've disabled > notifications indicates that we should flush our buffers. > > Signed-off-by: Mark McLoughlin <markmc@redhat.com> > --- > qemu/hw/virtio-net.c | 3 +-- > 1 files changed, 1 insertions(+), 2 deletions(-) > > diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c > index 31867f1..4adfa42 100644 > --- a/qemu/hw/virtio-net.c > +++ b/qemu/hw/virtio-net.c > @@ -175,8 +175,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq) > { > VirtIONet *n = to_virtio_net(vdev); > > - if (n->tx_timer_active && > - (vq->vring.avail->idx - vq->last_avail_idx) == 64) { > + if (n->tx_timer_active) { > Actually it was worthless anyway since if we set the timer, the flag below would have been cleared, causing the guest not to notify the host, thus the ==64 never called. So I'm in favour too. > vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY; > qemu_del_timer(n->tx_timer); > n->tx_timer_active = 0; > As stated by newer messages, we should handle the first tx notification if the timer wasn't active to shorten latency. Cheers, Dor ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic 2008-07-24 23:22 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Dor Laor @ 2008-07-25 0:30 ` Rusty Russell 2008-07-25 17:30 ` Mark McLoughlin 2008-07-25 17:23 ` Mark McLoughlin 1 sibling, 1 reply; 36+ messages in thread From: Rusty Russell @ 2008-07-25 0:30 UTC (permalink / raw) To: Dor Laor; +Cc: Mark McLoughlin, kvm, Herbert Xu On Friday 25 July 2008 09:22:53 Dor Laor wrote: > Mark McLoughlin wrote: > > vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY; > > qemu_del_timer(n->tx_timer); > > n->tx_timer_active = 0; > > As stated by newer messages, we should handle the first tx notification > if the timer wasn't active to shorten latency. > Cheers, Dor Here's what lguest does at the moment. Basically, we cut the timeout a tiny bit each time, until we get *fewer* packets than last time. Then we bump it up again. Rough, but seems to work (it should be a per-device var of course, not a static). @@ -921,6 +922,7 @@ static void handle_net_output(int fd, st unsigned int head, out, in, num = 0; int len; struct iovec iov[vq->vring.num]; + static int last_timeout_num; if (!timeout) net_xmit_notify++; @@ -941,6 +943,14 @@ static void handle_net_output(int fd, st /* Block further kicks and set up a timer if we saw anything. */ if (!timeout && num) block_vq(vq); + + if (timeout) { + if (num < last_timeout_num) + timeout_usec += 10; + else if (timeout_usec > 1) + timeout_usec--; + last_timeout_num = num; + } } ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic 2008-07-25 0:30 ` Rusty Russell @ 2008-07-25 17:30 ` Mark McLoughlin 0 siblings, 0 replies; 36+ messages in thread From: Mark McLoughlin @ 2008-07-25 17:30 UTC (permalink / raw) To: Rusty Russell; +Cc: Dor Laor, kvm, Herbert Xu On Fri, 2008-07-25 at 10:30 +1000, Rusty Russell wrote: > On Friday 25 July 2008 09:22:53 Dor Laor wrote: > > Mark McLoughlin wrote: > > > vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY; > > > qemu_del_timer(n->tx_timer); > > > n->tx_timer_active = 0; > > > > As stated by newer messages, we should handle the first tx notification > > if the timer wasn't active to shorten latency. > > Cheers, Dor > > Here's what lguest does at the moment. Basically, we cut the timeout a tiny > bit each time, until we get *fewer* packets than last time. Then we bump it > up again. > > Rough, but seems to work (it should be a per-device var of course, not a > static). > > @@ -921,6 +922,7 @@ static void handle_net_output(int fd, st > unsigned int head, out, in, num = 0; > int len; > struct iovec iov[vq->vring.num]; > + static int last_timeout_num; > > if (!timeout) > net_xmit_notify++; > @@ -941,6 +943,14 @@ static void handle_net_output(int fd, st > /* Block further kicks and set up a timer if we saw anything. */ > if (!timeout && num) > block_vq(vq); > + > + if (timeout) { > + if (num < last_timeout_num) > + timeout_usec += 10; > + else if (timeout_usec > 1) > + timeout_usec--; > + last_timeout_num = num; > + } Yeah, I gave this a try in kvm and in the host->guest case the timeout just grew and grew. In the guest->host case, it did stabilise at around 50us with high throughput. Basically, I think in the host->guest case the number of buffers was very variable so the timeout would get bumped by 10, reduced by a small amount, bumped by 10, reduced by a small amount, ... But, I agree the general principal seems about right; just needs some tweaking. Cheers, Mark. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic 2008-07-24 23:22 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Dor Laor 2008-07-25 0:30 ` Rusty Russell @ 2008-07-25 17:23 ` Mark McLoughlin 1 sibling, 0 replies; 36+ messages in thread From: Mark McLoughlin @ 2008-07-25 17:23 UTC (permalink / raw) To: Dor Laor; +Cc: kvm, Herbert Xu, Rusty Russell On Fri, 2008-07-25 at 02:22 +0300, Dor Laor wrote: > Mark McLoughlin wrote: > > virtio_net tries to guess when it has received a tx > > notification from the guest whether it indicates that the > > guest has no more room in the tx ring and it should > > immediately flush the queued buffers. ... > > - if (n->tx_timer_active && > > - (vq->vring.avail->idx - vq->last_avail_idx) == 64) { > > + if (n->tx_timer_active) { > > > Actually it was worthless anyway since if we set the timer, the flag > below would have been cleared, > causing the guest not to notify the host, thus the ==64 never called. No, that's the point I'm trying to make in the commit message - the guest ignores the NO_NOTIFY flag when the tx queue fills up and sends a notification anyway. The "== 64" test seems to be an attempt to catch this ring-full condition. Well, actually the test was worthless for a different reason, I guess. If tx_timer_active is set and we get a notification, then given the current guest code the number of buffers available should *always* be 64. Cheers, Mark. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic 2008-07-24 11:46 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Mark McLoughlin 2008-07-24 11:46 ` [PATCH 4/9] kvm: qemu: Add VIRTIO_F_NOTIFY_ON_EMPTY Mark McLoughlin 2008-07-24 23:22 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Dor Laor @ 2008-07-24 23:56 ` Dor Laor 2 siblings, 0 replies; 36+ messages in thread From: Dor Laor @ 2008-07-24 23:56 UTC (permalink / raw) To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell Mark McLoughlin wrote: > virtio_net tries to guess when it has received a tx > notification from the guest whether it indicates that the > guest has no more room in the tx ring and it should > immediately flush the queued buffers. > > The heuristic is based on the fact that there are 128 > buffer entries in the ring and each packet uses 2 buffers > (i.e. the virtio_net_hdr and the packet's linear data). > > Using GSO or increasing the size of the rings will break > that heuristic, so let's remove it and assume that any > notification from the guest after we've disabled > notifications indicates that we should flush our buffers. > > Signed-off-by: Mark McLoughlin <markmc@redhat.com> > --- > qemu/hw/virtio-net.c | 3 +-- > 1 files changed, 1 insertions(+), 2 deletions(-) > > diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c > index 31867f1..4adfa42 100644 > --- a/qemu/hw/virtio-net.c > +++ b/qemu/hw/virtio-net.c > @@ -175,8 +175,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq) > { > VirtIONet *n = to_virtio_net(vdev); > > - if (n->tx_timer_active && > - (vq->vring.avail->idx - vq->last_avail_idx) == 64) { > + if (n->tx_timer_active) { > vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY; > qemu_del_timer(n->tx_timer); > n->tx_timer_active = 0; > Actually we can improve latency a bit more by using this timer only for high throughput scenario. For example, if during the previous timer period no/few packets were accumulated, we can set the flag off and not issue new timer. This way we'll get notified immediately without timer latency. When lots of packets will be transmitted, we'll go back to this batch mode again. Cheers, Dor ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer 2008-07-24 11:46 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Mark McLoughlin 2008-07-24 11:46 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Mark McLoughlin @ 2008-07-26 9:48 ` Avi Kivity 2008-07-26 12:08 ` Mark McLoughlin 1 sibling, 1 reply; 36+ messages in thread From: Avi Kivity @ 2008-07-26 9:48 UTC (permalink / raw) To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell Mark McLoughlin wrote: > The current virtio_net tx timer is 2ns, which doesn't > make any sense. Set it to a more reasonable 150us > instead. > > Signed-off-by: Mark McLoughlin <markmc@redhat.com> > --- > qemu/hw/virtio-net.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c > index 2e57e5a..31867f1 100644 > --- a/qemu/hw/virtio-net.c > +++ b/qemu/hw/virtio-net.c > @@ -26,7 +26,7 @@ > #define VIRTIO_NET_F_MAC 5 > #define VIRTIO_NET_F_GS0 6 > > -#define TX_TIMER_INTERVAL (1000 / 500) > +#define TX_TIMER_INTERVAL (150000) /* 150 us */ > > Ouch. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer 2008-07-26 9:48 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Avi Kivity @ 2008-07-26 12:08 ` Mark McLoughlin 0 siblings, 0 replies; 36+ messages in thread From: Mark McLoughlin @ 2008-07-26 12:08 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm, Herbert Xu, Rusty Russell On Sat, 2008-07-26 at 12:48 +0300, Avi Kivity wrote: > Mark McLoughlin wrote: > > The current virtio_net tx timer is 2ns, which doesn't > > make any sense. Set it to a more reasonable 150us > > instead. > > > > Signed-off-by: Mark McLoughlin <markmc@redhat.com> > > --- > > qemu/hw/virtio-net.c | 2 +- > > 1 files changed, 1 insertions(+), 1 deletions(-) > > > > diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c > > index 2e57e5a..31867f1 100644 > > --- a/qemu/hw/virtio-net.c > > +++ b/qemu/hw/virtio-net.c > > @@ -26,7 +26,7 @@ > > #define VIRTIO_NET_F_MAC 5 > > #define VIRTIO_NET_F_GS0 6 > > > > -#define TX_TIMER_INTERVAL (1000 / 500) > > +#define TX_TIMER_INTERVAL (150000) /* 150 us */ > > > > Ouch. Well, not so much - and I should have explained why. Even though virtio-net is requesting a 2ns tx timer, it actually gets limited to MIN_TIMER_REARM_US which is currently 250us. However, even though the timer itself will only fire after 250us, expire_time is only set to +2ns, so we'll get the timeout callback next time qemu_run_timers() is called from the mainloop. I think this might account for a lot of the jitter in the throughput numbers - the effective tx window size is anywhere between 2ns and 250us depending on e.g. whether there is rx data available on the tap fd. Cheers, Mark. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-24 11:46 [PATCH 0/9][RFC] KVM virtio_net performance Mark McLoughlin 2008-07-24 11:46 ` [PATCH 1/9] kvm: qemu: Set MIN_TIMER_REARM_US to 150us Mark McLoughlin @ 2008-07-24 11:55 ` Herbert Xu 2008-07-24 16:53 ` Mark McLoughlin ` (2 subsequent siblings) 4 siblings, 0 replies; 36+ messages in thread From: Herbert Xu @ 2008-07-24 11:55 UTC (permalink / raw) To: Mark McLoughlin; +Cc: kvm, Rusty Russell On Thu, Jul 24, 2008 at 12:46:10PM +0100, Mark McLoughlin wrote: > Here's a bunch of patches attempting to improve the performance > of virtio_net. This is more an RFC rather than a patch submission > since, as can be seen below, not all patches actually improve the > perfomance measurably. Nice patches Mark, your work is always of the highest calibre! > 2) Using the recently merged GSO support in the tun/tap driver gives > a huge boost, but much more so on the host->guest side. IIRC the reason guest=>host isn't so good is because the tuntap GSO interface isn't the best when it comes to getting data from user-space. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-24 11:46 [PATCH 0/9][RFC] KVM virtio_net performance Mark McLoughlin 2008-07-24 11:46 ` [PATCH 1/9] kvm: qemu: Set MIN_TIMER_REARM_US to 150us Mark McLoughlin 2008-07-24 11:55 ` [PATCH 0/9][RFC] KVM virtio_net performance Herbert Xu @ 2008-07-24 16:53 ` Mark McLoughlin 2008-07-24 18:29 ` Anthony Liguori 2008-07-24 20:56 ` Anthony Liguori 2008-07-26 9:45 ` Avi Kivity 4 siblings, 1 reply; 36+ messages in thread From: Mark McLoughlin @ 2008-07-24 16:53 UTC (permalink / raw) To: kvm; +Cc: Herbert Xu, Rusty Russell Hey, One all all-important thing I forgot to include was a comparison with lguest :-) netperf, 10x20s runs (Gb/s) | guest->host | host->guest -----------------------------+----------------------------+--------------------------- KVM | 4.230/ 4.619/ 4.780/ 0.155 | 8.140/ 8.578/ 8.770/ 0.162 lguest | 5.700/ 5.926/ 6.150/ 0.132 | 8.680/ 9.073/ 9.320/ 0.205 ping -f -c 100000 (ms) | guest->host | host->guest -----------------------------+----------------------------+--------------------------- KVM | 0.199/ 0.326/ 7.698/ 0.744 | 0.199/ 0.245/ 0.402/ 0.022 lguest | 0.022/ 0.055/ 0.467/ 0.019 | 0.019/ 0.046/89.249/ 0.448 So, puppies gets you an extra 1.3Gb/s guest->host, .5Gb/s host->guest and much better latency. Actually, I guess the main reason for the latency difference is that when lguest gets notified on the tx ring, it immediately sends whatever is available and then starts a timer. KVM doesn't send anything until it's tx timer fires or the ring is full. Cheers, Mark. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-24 16:53 ` Mark McLoughlin @ 2008-07-24 18:29 ` Anthony Liguori 2008-07-25 16:36 ` Mark McLoughlin 0 siblings, 1 reply; 36+ messages in thread From: Anthony Liguori @ 2008-07-24 18:29 UTC (permalink / raw) To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell Mark McLoughlin wrote: > Hey, > One all all-important thing I forgot to include was a comparison with > lguest :-) > Hey Mark, This patch set is really great! I guess the hard part now is deciding what all we want to apply. Do you have a suggestion of which patches you think are worth applying? BTW, do you have native and guest loopback numbers to compare where we stand with native? > netperf, 10x20s runs (Gb/s) | guest->host | host->guest > -----------------------------+----------------------------+--------------------------- > KVM | 4.230/ 4.619/ 4.780/ 0.155 | 8.140/ 8.578/ 8.770/ 0.162 > lguest | 5.700/ 5.926/ 6.150/ 0.132 | 8.680/ 9.073/ 9.320/ 0.205 > > ping -f -c 100000 (ms) | guest->host | host->guest > -----------------------------+----------------------------+--------------------------- > KVM | 0.199/ 0.326/ 7.698/ 0.744 | 0.199/ 0.245/ 0.402/ 0.022 > lguest | 0.022/ 0.055/ 0.467/ 0.019 | 0.019/ 0.046/89.249/ 0.448 > > > So, puppies gets you an extra 1.3Gb/s guest->host, .5Gb/s host->guest > and much better latency. > I'm surprised lguest gets an extra 1.3gb guest->host. Any idea of where we're loosing it? > Actually, I guess the main reason for the latency difference is that > when lguest gets notified on the tx ring, it immediately sends whatever > is available and then starts a timer. KVM doesn't send anything until > it's tx timer fires or the ring is full. > Yes, we should definitely do that. It will make ping appear to be a lot faster than it really is :-) Regards, Anthony Liguori > Cheers, > Mark. > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-24 18:29 ` Anthony Liguori @ 2008-07-25 16:36 ` Mark McLoughlin 0 siblings, 0 replies; 36+ messages in thread From: Mark McLoughlin @ 2008-07-25 16:36 UTC (permalink / raw) To: Anthony Liguori; +Cc: kvm, Herbert Xu, Rusty Russell On Thu, 2008-07-24 at 13:29 -0500, Anthony Liguori wrote: > Mark McLoughlin wrote: > > Hey, > > One all all-important thing I forgot to include was a comparison with > > lguest :-) > > > > Hey Mark, > > This patch set is really great! I guess the hard part now is deciding > what all we want to apply. Do you have a suggestion of which patches > you think are worth applying? Good question. I think they're all fairly sane as a starting point, I guess. Making the tx timer adaptive, dropping the mutex when writing to the tap interface and flushing the tx queue immediately on notify are all things worth adding too, but they could be an additional patch set. Your zero-copy patch is quite a chunk for not much gain - how do you feel about that? Could we get that in upstream qemu? > BTW, do you have native and guest loopback numbers to compare where we > stand with native? Yep, sorry: netperf, 10x20s runs (Gb/s) | guest->host | host->guest -----------------------------+----------------------------+--------------------------- KVM | 4.230/ 4.619/ 4.780/ 0.155 | 8.140/ 8.578/ 8.770/ 0.162 lguest | 5.700/ 5.926/ 6.150/ 0.132 | 8.680/ 9.073/ 9.320/ 0.205 loopback | 9.520/12.473/15.390/ 1.947 ping -f -c 100000 (ms) | guest->host | host->guest -----------------------------+----------------------------+--------------------------- KVM | 0.199/ 0.326/ 7.698/ 0.744 | 0.199/ 0.245/ 0.402/ 0.022 lguest | 0.022/ 0.055/ 0.467/ 0.019 | 0.019/ 0.046/89.249/ 0.448 loopback | 0.001/ 0.002/ 0.025/ 0.001 Very surprising how much jitter there is in the loopback throughput figures. > I'm surprised lguest gets an extra 1.3gb guest->host. Any idea of where > we're loosing it? The main difference on the tx path would appear to be the adaptive timeout and the immediate flushing of the queue on notify. Cheers, Mark. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-24 11:46 [PATCH 0/9][RFC] KVM virtio_net performance Mark McLoughlin ` (2 preceding siblings ...) 2008-07-24 16:53 ` Mark McLoughlin @ 2008-07-24 20:56 ` Anthony Liguori 2008-07-25 17:17 ` Mark McLoughlin 2008-07-26 19:09 ` Bill Davidsen 2008-07-26 9:45 ` Avi Kivity 4 siblings, 2 replies; 36+ messages in thread From: Anthony Liguori @ 2008-07-24 20:56 UTC (permalink / raw) To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell Hi Mark, Mark McLoughlin wrote: > Hey, > Here's a bunch of patches attempting to improve the performance > of virtio_net. This is more an RFC rather than a patch submission > since, as can be seen below, not all patches actually improve the > perfomance measurably. > I'm still seeing the same problem I saw with my patch series. Namely, dhclient fails to get a DHCP address. Rusty noticed that RX has a lot more packets received then it should so we're suspicious that we're getting packet corruption. Configuring the tap device with a static address, here's what I get with iperf: w/o patches: guest->host: 625 Mbits/sec host->guest: 825 Mbits/sec w/patches guest->host: 2.02 Gbits/sec host->guest: 1.89 Gbits/sec guest lo: 4.35 Gbits/sec host lo: 4.36 Gbits/sec This is with KVM GUEST configured FWIW. Regards, Anthony Liguori > I've tried hard to test each of these patches with as stable and > informative a benchmark as I could find. The first benchmark is a > netperf[1] based throughput benchmark and the second uses a flood > ping[2] to measure latency differences. > > Each set of figures is min/average/max/standard deviation. The > first set is Gb/s and the second is milliseconds. > > The network configuration used was very simple - the guest with > a virtio_net interface and the host with a tap interface and static > IP addresses assigned to both - e.g. there was no bridge in the host > involved and iptables was disable in both the host and guest. > > I used: > > 1) kvm-71-26-g6152996 with the patches that follow > > 2) Linus's v2.6.26-5752-g93ded9b with Rusty's virtio patches from > 219:bbd2611289c5 applied; these are the patches have just been > submitted to Linus > > The conclusions I draw are: > > 1) The length of the tx mitigation timer makes quite a difference to > throughput achieved; we probably need a good heuristic for > adjusting this on the fly. > > 2) Using the recently merged GSO support in the tun/tap driver gives > a huge boost, but much more so on the host->guest side. > > 3) Adjusting the virtio_net ring sizes makes a small difference, but > not as much as one might expect > > 4) Dropping the global mutex while reading GSO packets from the tap > interface gives a nice speedup. This highlights the global mutex > as a general perfomance issue. > > 5) Eliminating an extra copy on the host->guest path only makes a > barely measurable difference. > > Anyway, the figures: > > netperf, 10x20s runs (Gb/s) | guest->host | host->guest > -----------------------------+----------------------------+--------------------------- > baseline | 1.520/ 1.573/ 1.610/ 0.034 | 1.160/ 1.357/ 1.630/ 0.165 > 50us tx timer + rearm | 1.050/ 1.086/ 1.110/ 0.017 | 1.710/ 1.832/ 1.960/ 0.092 > 250us tx timer + rearm | 1.700/ 1.764/ 1.880/ 0.064 | 0.900/ 1.203/ 1.580/ 0.205 > 150us tx timer + rearm | 1.520/ 1.602/ 1.690/ 0.044 | 1.670/ 1.928/ 2.150/ 0.141 > no ring-full heuristic | 1.480/ 1.569/ 1.710/ 0.066 | 1.610/ 1.857/ 2.140/ 0.153 > VIRTIO_F_NOTIFY_ON_EMPTY | 1.470/ 1.554/ 1.650/ 0.054 | 1.770/ 1.960/ 2.170/ 0.119 > recv NO_NOTIFY | 1.530/ 1.604/ 1.680/ 0.047 | 1.780/ 1.944/ 2.190/ 0.129 > GSO | 4.120/ 4.323/ 4.420/ 0.099 | 6.540/ 7.033/ 7.340/ 0.244 > ring size == 256 | 4.050/ 4.406/ 4.560/ 0.143 | 6.280/ 7.236/ 8.280/ 0.613 > ring size == 512 | 4.420/ 4.600/ 4.960/ 0.140 | 6.470/ 7.205/ 7.510/ 0.314 > drop mutex during tapfd read | 4.320/ 4.578/ 4.790/ 0.161 | 8.370/ 8.589/ 8.730/ 0.120 > aligouri zero-copy | 4.510/ 4.694/ 4.960/ 0.148 | 8.430/ 8.614/ 8.840/ 0.142 > > ping -f -c 100000 (ms) | guest->host | host->guest > -----------------------------+----------------------------+--------------------------- > baseline | 0.060/ 0.459/ 7.602/ 0.846 | 0.067/ 0.331/ 2.517/ 0.057 > 50us tx timer + rearm | 0.081/ 0.143/ 7.436/ 0.374 | 0.093/ 0.133/ 1.883/ 0.026 > 250us tx timer + rearm | 0.302/ 0.463/ 7.580/ 0.849 | 0.297/ 0.344/ 2.128/ 0.028 > 150us tx timer + rearm | 0.197/ 0.323/ 7.671/ 0.740 | 0.199/ 0.245/ 7.836/ 0.037 > no ring-full heuristic | 0.182/ 0.324/ 7.688/ 0.753 | 0.199/ 0.243/ 2.197/ 0.030 > VIRTIO_F_NOTIFY_ON_EMPTY | 0.197/ 0.321/ 7.447/ 0.730 | 0.196/ 0.242/ 2.218/ 0.032 > recv NO_NOTIFY | 0.186/ 0.321/ 7.520/ 0.732 | 0.200/ 0.233/ 2.216/ 0.028 > GSO | 0.178/ 0.324/ 7.667/ 0.736 | 0.147/ 0.246/ 1.361/ 0.024 > ring size == 256 | 0.184/ 0.323/ 7.674/ 0.728 | 0.199/ 0.243/ 2.181/ 0.028 > ring size == 512 | (not measured) | (not measured) > drop mutex during tapfd read | 0.183/ 0.323/ 7.820/ 0.733 | 0.202/ 0.242/ 2.219/ 0.027 > aligouri zero-copy | 0.185/ 0.325/ 7.863/ 0.736 | 0.202/ 0.245/ 7.844/ 0.036 > > Cheers, > Mark. > > [1] - I used netperf trunk from: > > http://www.netperf.org/svn/netperf2/trunk > > and simply ran: > > $> i=0; while [ $i -lt 10 ]; do ./netperf -H <host> -f g -l 20 -P 0 | netperf-collect.py; i=$((i+1)); done > > where netperf-collect.py is just a script to calculate the > average across the runs: > > http://markmc.fedorapeople.org/netperf-collect.py > > [2] - ping -c 100000 -f <host> > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-24 20:56 ` Anthony Liguori @ 2008-07-25 17:17 ` Mark McLoughlin 2008-07-25 21:29 ` Dor Laor 2008-07-26 19:09 ` Bill Davidsen 1 sibling, 1 reply; 36+ messages in thread From: Mark McLoughlin @ 2008-07-25 17:17 UTC (permalink / raw) To: Anthony Liguori; +Cc: kvm, Herbert Xu, Rusty Russell On Thu, 2008-07-24 at 15:56 -0500, Anthony Liguori wrote: > Hi Mark, > > Mark McLoughlin wrote: > > Hey, > > Here's a bunch of patches attempting to improve the performance > > of virtio_net. This is more an RFC rather than a patch submission > > since, as can be seen below, not all patches actually improve the > > perfomance measurably. > > > > I'm still seeing the same problem I saw with my patch series. Namely, > dhclient fails to get a DHCP address. Rusty noticed that RX has a lot > more packets received then it should so we're suspicious that we're > getting packet corruption. I've just tried bridging to my physical LAN and DHCP seems to be working fine. Which reminds me, though - doing this makes host->guest throughput drop to well below pre-GSO figures. GSO appears to be disabled while there's a physical interface on the bridge. If I remove eth0, the figures jump right back up again. I also just noticed that the GSO patch breaks e1000 because it unconditionally sets IFF_VNET_HDR. Will fix that up. > Configuring the tap device with a static address, here's what I get with > iperf: > > w/o patches: > > guest->host: 625 Mbits/sec > host->guest: 825 Mbits/sec > > w/patches > > guest->host: 2.02 Gbits/sec > host->guest: 1.89 Gbits/sec > > guest lo: 4.35 Gbits/sec > host lo: 4.36 Gbits/sec I tried iperf at one point and was getting really low figures; not sure why. Apart from your iperf figures being lower than my netperf figures, it also contradicts what I was seeing - namely guest->host beating host->guest before the patches and host->guest beating guest->host after the patches. It could all just be down to the length of the tx timer. If you try adjusting that does it help? > This is with KVM GUEST configured FWIW. Yep, same here. Cheers, Mark. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-25 17:17 ` Mark McLoughlin @ 2008-07-25 21:29 ` Dor Laor 0 siblings, 0 replies; 36+ messages in thread From: Dor Laor @ 2008-07-25 21:29 UTC (permalink / raw) To: Mark McLoughlin; +Cc: Anthony Liguori, kvm, Herbert Xu, Rusty Russell Mark McLoughlin wrote: > On Thu, 2008-07-24 at 15:56 -0500, Anthony Liguori wrote: > >> Hi Mark, >> >> Mark McLoughlin wrote: >> >>> Hey, >>> Here's a bunch of patches attempting to improve the performance >>> of virtio_net. This is more an RFC rather than a patch submission >>> since, as can be seen below, not all patches actually improve the >>> perfomance measurably. >>> >>> >> I'm still seeing the same problem I saw with my patch series. Namely, >> dhclient fails to get a DHCP address. Rusty noticed that RX has a lot >> more packets received then it should so we're suspicious that we're >> getting packet corruption. >> > > I've just tried bridging to my physical LAN and DHCP seems to be working > fine. > > Which reminds me, though - doing this makes host->guest throughput drop > to well below pre-GSO figures. GSO appears to be disabled while there's > a physical interface on the bridge. If I remove eth0, the figures jump > right back up again. > > I also just noticed that the GSO patch breaks e1000 because it > unconditionally sets IFF_VNET_HDR. Will fix that up. > > >> Configuring the tap device with a static address, here's what I get with >> iperf: >> >> w/o patches: >> >> guest->host: 625 Mbits/sec >> host->guest: 825 Mbits/sec >> >> w/patches >> >> guest->host: 2.02 Gbits/sec >> host->guest: 1.89 Gbits/sec >> >> guest lo: 4.35 Gbits/sec >> host lo: 4.36 Gbits/sec >> > > I tried iperf at one point and was getting really low figures; not sure > why. > > Older iperf (~2.0.2) has locking vs yield bug and consumes lots of cpu. In 2.0.4 it is fixed. Can that be the difference? > Apart from your iperf figures being lower than my netperf figures, it > also contradicts what I was seeing - namely guest->host beating > host->guest before the patches and host->guest beating guest->host after > the patches. > > It could all just be down to the length of the tx timer. If you try > adjusting that does it help? > > >> This is with KVM GUEST configured FWIW. >> > > Yep, same here. > > Cheers, > Mark. > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-24 20:56 ` Anthony Liguori 2008-07-25 17:17 ` Mark McLoughlin @ 2008-07-26 19:09 ` Bill Davidsen 2008-07-27 7:52 ` Avi Kivity 1 sibling, 1 reply; 36+ messages in thread From: Bill Davidsen @ 2008-07-26 19:09 UTC (permalink / raw) To: Anthony Liguori; +Cc: Mark McLoughlin, kvm, Herbert Xu, Rusty Russell Anthony Liguori wrote: > Hi Mark, > > Mark McLoughlin wrote: >> Hey, >> Here's a bunch of patches attempting to improve the performance >> of virtio_net. This is more an RFC rather than a patch submission >> since, as can be seen below, not all patches actually improve the >> perfomance measurably. >> > > I'm still seeing the same problem I saw with my patch series. Namely, > dhclient fails to get a DHCP address. Rusty noticed that RX has a lot > more packets received then it should so we're suspicious that we're > getting packet corruption. I have been discussing this (on this list) in another thread. Putting tcpdump on the eth0 device in the VM, the br0 device in the host, and the eth0 (physical NIC) in the host, you can see that when the VM generates a DHCP request it shows up on the br0 in the host, but never gets sent on the wire by eth0. That's the point of failure, at least using RHEL5/FC6/kvm-66 as the environment. > > Configuring the tap device with a static address, here's what I get with > iperf: > > w/o patches: > > guest->host: 625 Mbits/sec > host->guest: 825 Mbits/sec > > w/patches > > guest->host: 2.02 Gbits/sec > host->guest: 1.89 Gbits/sec > > guest lo: 4.35 Gbits/sec > host lo: 4.36 Gbits/sec > > This is with KVM GUEST configured FWIW. > > Regards, > > Anthony Liguori > -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-26 19:09 ` Bill Davidsen @ 2008-07-27 7:52 ` Avi Kivity 2008-07-27 12:52 ` Bill Davidsen 2008-07-27 13:17 ` Bill Davidsen 0 siblings, 2 replies; 36+ messages in thread From: Avi Kivity @ 2008-07-27 7:52 UTC (permalink / raw) To: Bill Davidsen Cc: Anthony Liguori, Mark McLoughlin, kvm, Herbert Xu, Rusty Russell Bill Davidsen wrote: > Anthony Liguori wrote: >> Hi Mark, >> >> Mark McLoughlin wrote: >>> Hey, >>> Here's a bunch of patches attempting to improve the performance >>> of virtio_net. This is more an RFC rather than a patch submission >>> since, as can be seen below, not all patches actually improve the >>> perfomance measurably. >>> >> >> I'm still seeing the same problem I saw with my patch series. >> Namely, dhclient fails to get a DHCP address. Rusty noticed that RX >> has a lot more packets received then it should so we're suspicious >> that we're getting packet corruption. > > I have been discussing this (on this list) in another thread. Putting > tcpdump on the eth0 device in the VM, the br0 device in the host, and > the eth0 (physical NIC) in the host, you can see that when the VM > generates a DHCP request it shows up on the br0 in the host, but never > gets sent on the wire by eth0. > > That's the point of failure, at least using RHEL5/FC6/kvm-66 as the > environment. Does playing with the bridge forward delay ('brctl setfd') help? -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-27 7:52 ` Avi Kivity @ 2008-07-27 12:52 ` Bill Davidsen 2008-07-27 13:17 ` Bill Davidsen 1 sibling, 0 replies; 36+ messages in thread From: Bill Davidsen @ 2008-07-27 12:52 UTC (permalink / raw) To: Avi Kivity Cc: Anthony Liguori, Mark McLoughlin, kvm, Herbert Xu, Rusty Russell Avi Kivity wrote: > Bill Davidsen wrote: >> Anthony Liguori wrote: >>> Hi Mark, >>> >>> [...snip...] >>> I'm still seeing the same problem I saw with my patch series. >>> Namely, dhclient fails to get a DHCP address. Rusty noticed that RX >>> has a lot more packets received then it should so we're suspicious >>> that we're getting packet corruption. >> >> I have been discussing this (on this list) in another thread. Putting >> tcpdump on the eth0 device in the VM, the br0 device in the host, and >> the eth0 (physical NIC) in the host, you can see that when the VM >> generates a DHCP request it shows up on the br0 in the host, but >> never gets sent on the wire by eth0. >> >> That's the point of failure, at least using RHEL5/FC6/kvm-66 as the >> environment. > > Does playing with the bridge forward delay ('brctl setfd') help? > Not obviously, default value is 15, stepping from 10..30 made no difference. It appears the bridge just doesn't forward DHCP packets. I'm not an expert on bridge setup, and I don't see an obvious control, so I will look at the code this afternoon and see if something jumps out at me. -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-27 7:52 ` Avi Kivity 2008-07-27 12:52 ` Bill Davidsen @ 2008-07-27 13:17 ` Bill Davidsen 2008-07-28 6:42 ` Mark McLoughlin 1 sibling, 1 reply; 36+ messages in thread From: Bill Davidsen @ 2008-07-27 13:17 UTC (permalink / raw) To: Avi Kivity Cc: Anthony Liguori, Mark McLoughlin, kvm, Herbert Xu, Rusty Russell Avi Kivity wrote: > Bill Davidsen wrote: >> >> I have been discussing this (on this list) in another thread. Putting >> tcpdump on the eth0 device in the VM, the br0 device in the host, and >> the eth0 (physical NIC) in the host, you can see that when the VM >> generates a DHCP request it shows up on the br0 in the host, but >> never gets sent on the wire by eth0. >> >> That's the point of failure, at least using RHEL5/FC6/kvm-66 as the >> environment. > > Does playing with the bridge forward delay ('brctl setfd') help? > Update: Redhat has a user chain in iptables shared between INPUT and FORWARD (bad idea) which doesn't pass bootp packets by default. Adding the following rules to that table solved the DHCP for me. ACCEPT udp -- anywhere anywhere udp spt:bootps dpt:bootpc ACCEPT udp -- anywhere anywhere udp spt:bootpc dpt:bootps This seems to solve my problem, I just have to make it part of my "start kvm" procedure. -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-27 13:17 ` Bill Davidsen @ 2008-07-28 6:42 ` Mark McLoughlin 0 siblings, 0 replies; 36+ messages in thread From: Mark McLoughlin @ 2008-07-28 6:42 UTC (permalink / raw) To: Bill Davidsen; +Cc: Avi Kivity, Anthony Liguori, kvm, Herbert Xu, Rusty Russell On Sun, 2008-07-27 at 09:17 -0400, Bill Davidsen wrote: > Avi Kivity wrote: > > Bill Davidsen wrote: > >> > >> I have been discussing this (on this list) in another thread. Putting > >> tcpdump on the eth0 device in the VM, the br0 device in the host, and > >> the eth0 (physical NIC) in the host, you can see that when the VM > >> generates a DHCP request it shows up on the br0 in the host, but > >> never gets sent on the wire by eth0. > >> > >> That's the point of failure, at least using RHEL5/FC6/kvm-66 as the > >> environment. > > > > Does playing with the bridge forward delay ('brctl setfd') help? > > > Update: Redhat has a user chain in iptables shared between INPUT and > FORWARD (bad idea) which doesn't pass bootp packets by default. Yeah, I've been trying to get that rule changed to allow all bridged packets to be forwarded by default. See: https://bugzilla.redhat.com/221828 > Adding > the following rules to that table solved the DHCP for me. > > ACCEPT udp -- anywhere anywhere udp > spt:bootps dpt:bootpc > ACCEPT udp -- anywhere anywhere udp > spt:bootpc dpt:bootps > > This seems to solve my problem, I just have to make it part of my "start > kvm" procedure. See here: http://wiki.libvirt.org/page/Networking in the 'Bridged networking (aka "shared physical device")' section: # echo "-I FORWARD -m physdev --physdev-is-bridged -j ACCEPT" > /etc/sysconfig/iptables-forward-bridged # lokkit --custom-rules=ipv4:filter:/etc/sysconfig/iptables-forward-bridged # service libvirtd reload Cheers, Mark. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-24 11:46 [PATCH 0/9][RFC] KVM virtio_net performance Mark McLoughlin ` (3 preceding siblings ...) 2008-07-24 20:56 ` Anthony Liguori @ 2008-07-26 9:45 ` Avi Kivity 2008-07-27 6:48 ` Rusty Russell ` (2 more replies) 4 siblings, 3 replies; 36+ messages in thread From: Avi Kivity @ 2008-07-26 9:45 UTC (permalink / raw) To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell Mark McLoughlin wrote: > Hey, > Here's a bunch of patches attempting to improve the performance > of virtio_net. This is more an RFC rather than a patch submission > since, as can be seen below, not all patches actually improve the > perfomance measurably. > > I've tried hard to test each of these patches with as stable and > informative a benchmark as I could find. The first benchmark is a > netperf[1] based throughput benchmark and the second uses a flood > ping[2] to measure latency differences. > > Each set of figures is min/average/max/standard deviation. The > first set is Gb/s and the second is milliseconds. > > The network configuration used was very simple - the guest with > a virtio_net interface and the host with a tap interface and static > IP addresses assigned to both - e.g. there was no bridge in the host > involved and iptables was disable in both the host and guest. > > I used: > > 1) kvm-71-26-g6152996 with the patches that follow > > 2) Linus's v2.6.26-5752-g93ded9b with Rusty's virtio patches from > 219:bbd2611289c5 applied; these are the patches have just been > submitted to Linus > > The conclusions I draw are: > > 1) The length of the tx mitigation timer makes quite a difference to > throughput achieved; we probably need a good heuristic for > adjusting this on the fly. > The tx mitigation timer is just one part of the equation; the other is the virtio ring window size, which is now fixed. Using a maximum sized window is good when the guest and host are running flat out, doing nothing but networking. When throughput drops (because the guest is spending cpu on processing, or simply because the other side is not keeping up), we need to drop the windows size so as to retain acceptable latencies. The tx timer can then be set to "a bit after the end of the window", acting as a safety belt in case the throughput changes. > 4) Dropping the global mutex while reading GSO packets from the tap > interface gives a nice speedup. This highlights the global mutex > as a general perfomance issue. > > Not sure whether this is safe. What's stopping the guest from accessing virtio and changing some state? > 5) Eliminating an extra copy on the host->guest path only makes a > barely measurable difference. > > That's expected on a host->guest test. Zero copy is mostly important for guest->external, and with zerocopy already enabled in the guest (sendfile or nfs server workloads). > Anyway, the figures: > > netperf, 10x20s runs (Gb/s) | guest->host | host->guest > -----------------------------+----------------------------+--------------------------- > baseline | 1.520/ 1.573/ 1.610/ 0.034 | 1.160/ 1.357/ 1.630/ 0.165 > 50us tx timer + rearm | 1.050/ 1.086/ 1.110/ 0.017 | 1.710/ 1.832/ 1.960/ 0.092 > 250us tx timer + rearm | 1.700/ 1.764/ 1.880/ 0.064 | 0.900/ 1.203/ 1.580/ 0.205 > 150us tx timer + rearm | 1.520/ 1.602/ 1.690/ 0.044 | 1.670/ 1.928/ 2.150/ 0.141 > no ring-full heuristic | 1.480/ 1.569/ 1.710/ 0.066 | 1.610/ 1.857/ 2.140/ 0.153 > VIRTIO_F_NOTIFY_ON_EMPTY | 1.470/ 1.554/ 1.650/ 0.054 | 1.770/ 1.960/ 2.170/ 0.119 > recv NO_NOTIFY | 1.530/ 1.604/ 1.680/ 0.047 | 1.780/ 1.944/ 2.190/ 0.129 > GSO | 4.120/ 4.323/ 4.420/ 0.099 | 6.540/ 7.033/ 7.340/ 0.244 > ring size == 256 | 4.050/ 4.406/ 4.560/ 0.143 | 6.280/ 7.236/ 8.280/ 0.613 > ring size == 512 | 4.420/ 4.600/ 4.960/ 0.140 | 6.470/ 7.205/ 7.510/ 0.314 > drop mutex during tapfd read | 4.320/ 4.578/ 4.790/ 0.161 | 8.370/ 8.589/ 8.730/ 0.120 > aligouri zero-copy | 4.510/ 4.694/ 4.960/ 0.148 | 8.430/ 8.614/ 8.840/ 0.142 > Very impressive numbers; much better than I expected. The host->guest numbers are around 100x better than the original emulated card througput we got from kvm. > ping -f -c 100000 (ms) | guest->host | host->guest > -----------------------------+----------------------------+--------------------------- > baseline | 0.060/ 0.459/ 7.602/ 0.846 | 0.067/ 0.331/ 2.517/ 0.057 > 50us tx timer + rearm | 0.081/ 0.143/ 7.436/ 0.374 | 0.093/ 0.133/ 1.883/ 0.026 > 250us tx timer + rearm | 0.302/ 0.463/ 7.580/ 0.849 | 0.297/ 0.344/ 2.128/ 0.028 > 150us tx timer + rearm | 0.197/ 0.323/ 7.671/ 0.740 | 0.199/ 0.245/ 7.836/ 0.037 > no ring-full heuristic | 0.182/ 0.324/ 7.688/ 0.753 | 0.199/ 0.243/ 2.197/ 0.030 > VIRTIO_F_NOTIFY_ON_EMPTY | 0.197/ 0.321/ 7.447/ 0.730 | 0.196/ 0.242/ 2.218/ 0.032 > recv NO_NOTIFY | 0.186/ 0.321/ 7.520/ 0.732 | 0.200/ 0.233/ 2.216/ 0.028 > GSO | 0.178/ 0.324/ 7.667/ 0.736 | 0.147/ 0.246/ 1.361/ 0.024 > ring size == 256 | 0.184/ 0.323/ 7.674/ 0.728 | 0.199/ 0.243/ 2.181/ 0.028 > ring size == 512 | (not measured) | (not measured) > drop mutex during tapfd read | 0.183/ 0.323/ 7.820/ 0.733 | 0.202/ 0.242/ 2.219/ 0.027 > aligouri zero-copy | 0.185/ 0.325/ 7.863/ 0.736 | 0.202/ 0.245/ 7.844/ 0.036 > > This isn't too good. Low latency is important for nfs clients (or other request/response workloads). I think we can keep these low by adjusting the virtio window (for example, on an idle system it should be 1), so that the tx mitigation timer only fires when the workload transitions from throughput to request/response. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-26 9:45 ` Avi Kivity @ 2008-07-27 6:48 ` Rusty Russell 2008-07-27 6:48 ` Rusty Russell 2008-08-11 19:56 ` Mark McLoughlin 2 siblings, 0 replies; 36+ messages in thread From: Rusty Russell @ 2008-07-27 6:48 UTC (permalink / raw) To: Avi Kivity; +Cc: Mark McLoughlin, kvm, Herbert Xu On Saturday 26 July 2008 19:45:36 Avi Kivity wrote: > Mark McLoughlin wrote: > > Hey, > > Here's a bunch of patches attempting to improve the performance > > of virtio_net. This is more an RFC rather than a patch submission > > since, as can be seen below, not all patches actually improve the > > perfomance measurably. > > > > I've tried hard to test each of these patches with as stable and > > informative a benchmark as I could find. The first benchmark is a > > netperf[1] based throughput benchmark and the second uses a flood > > ping[2] to measure latency differences. > > > > Each set of figures is min/average/max/standard deviation. The > > first set is Gb/s and the second is milliseconds. > > > > The network configuration used was very simple - the guest with > > a virtio_net interface and the host with a tap interface and static > > IP addresses assigned to both - e.g. there was no bridge in the host > > involved and iptables was disable in both the host and guest. > > > > I used: > > > > 1) kvm-71-26-g6152996 with the patches that follow > > > > 2) Linus's v2.6.26-5752-g93ded9b with Rusty's virtio patches from > > 219:bbd2611289c5 applied; these are the patches have just been > > submitted to Linus > > > > The conclusions I draw are: > > > > 1) The length of the tx mitigation timer makes quite a difference to > > throughput achieved; we probably need a good heuristic for > > adjusting this on the fly. > > The tx mitigation timer is just one part of the equation; the other is > the virtio ring window size, which is now fixed. > > Using a maximum sized window is good when the guest and host are running > flat out, doing nothing but networking. When throughput drops (because > the guest is spending cpu on processing, or simply because the other > side is not keeping up), we need to drop the windows size so as to > retain acceptable latencies. > > The tx timer can then be set to "a bit after the end of the window", > acting as a safety belt in case the throughput changes. Interestingly, I played a little with a threshold patch. Unfortunately it seemed to just add YA variable to the mix; it'd take real research to figure out how to adjust the threshold and timeout values appropriately for different situations. > This isn't too good. Low latency is important for nfs clients (or other > request/response workloads). I think we can keep these low by adjusting > the virtio window (for example, on an idle system it should be 1), so > that the tx mitigation timer only fires when the workload transitions > from throughput to request/response. >From reading this thread, this seems like an implementation bug. Don't suppress notifications on an empty ring (ie. threshold will be 1 when nothing is happening). Rusty. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-26 9:45 ` Avi Kivity 2008-07-27 6:48 ` Rusty Russell @ 2008-07-27 6:48 ` Rusty Russell 2008-08-11 19:56 ` Mark McLoughlin 2 siblings, 0 replies; 36+ messages in thread From: Rusty Russell @ 2008-07-27 6:48 UTC (permalink / raw) To: Avi Kivity; +Cc: Mark McLoughlin, kvm, Herbert Xu On Saturday 26 July 2008 19:45:36 Avi Kivity wrote: > > 5) Eliminating an extra copy on the host->guest path only makes a > > barely measurable difference. > > That's expected on a host->guest test. Zero copy is mostly important > for guest->external, and with zerocopy already enabled in the guest > (sendfile or nfs server workloads). Note that this will also need the copyless tun patches. If we're doing one copy from userspace->kernel anyway, this extra copy is probably lost in the noise. Cheers, Rusty. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-07-26 9:45 ` Avi Kivity 2008-07-27 6:48 ` Rusty Russell 2008-07-27 6:48 ` Rusty Russell @ 2008-08-11 19:56 ` Mark McLoughlin 2008-08-12 13:35 ` Avi Kivity 2 siblings, 1 reply; 36+ messages in thread From: Mark McLoughlin @ 2008-08-11 19:56 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm, Herbert Xu, Rusty Russell Hi Avi, Sorry, I got distracted from this ... On Sat, 2008-07-26 at 12:45 +0300, Avi Kivity wrote: > Mark McLoughlin wrote: > > 1) The length of the tx mitigation timer makes quite a difference to > > throughput achieved; we probably need a good heuristic for > > adjusting this on the fly. > > > > The tx mitigation timer is just one part of the equation; the other is > the virtio ring window size, which is now fixed. > > Using a maximum sized window is good when the guest and host are running > flat out, doing nothing but networking. When throughput drops (because > the guest is spending cpu on processing, or simply because the other > side is not keeping up), we need to drop the windows size so as to > retain acceptable latencies. > > The tx timer can then be set to "a bit after the end of the window", > acting as a safety belt in case the throughput changes. i.e. the tx timer should give just enough time for a flat out guest to fill the ring, and no more? Yep, that's basically what lguest's tx timer heuristic is aiming for AFAICT. > > 4) Dropping the global mutex while reading GSO packets from the tap > > interface gives a nice speedup. This highlights the global mutex > > as a general perfomance issue. > > > > > > Not sure whether this is safe. What's stopping the guest from accessing > virtio and changing some state? With the current code, the virtio state should be consistent before we drop the mutex. The I/O thread would only drop the lock while it reads into the tap buffer and then grab the lock again before popping a buffer from the ring and copying to it. With Anthony's zero-copy patch, the situation is less clear - we pop a buffer from the avail, drop the lock, read() into the buffer, grab the lock and then push the buffer back onto the used ring. While the mutex is released, the guest could e.g. reset the ring and release the buffer which we're in the process of read()ing too. So, yes - dropping the mutex during read() in the zero-copy patch isn't safe. Another potential concern is that if we drop the mutex, the guest thread could delete an I/O handler while the I/O thread is in the I/O handler loop in main_loop_wait(). However, this seems to have been coded to handle this situation - the I/O handler would only be marked as deleted, and ignored by the loop. Oh, and there's the posix-timers/signalfd race condition that I can only seem to trigger when dropping the mutex. > > 5) Eliminating an extra copy on the host->guest path only makes a > > barely measurable difference. > > > > > > That's expected on a host->guest test. Zero copy is mostly important > for guest->external, and with zerocopy already enabled in the guest > (sendfile or nfs server workloads). Hmm, could you elaborate on that? The copy we're eliminating here is an intermediate copy from tapfd into a buffer before copying to a guest buffer. It doesn't give you zero-copy as we still copy from kernel space to user space and vice-versa. > > Anyway, the figures: > > > > netperf, 10x20s runs (Gb/s) | guest->host | host->guest > > -----------------------------+----------------------------+--------------------------- > > baseline | 1.520/ 1.573/ 1.610/ 0.034 | 1.160/ 1.357/ 1.630/ 0.165 > > 50us tx timer + rearm | 1.050/ 1.086/ 1.110/ 0.017 | 1.710/ 1.832/ 1.960/ 0.092 > > 250us tx timer + rearm | 1.700/ 1.764/ 1.880/ 0.064 | 0.900/ 1.203/ 1.580/ 0.205 > > 150us tx timer + rearm | 1.520/ 1.602/ 1.690/ 0.044 | 1.670/ 1.928/ 2.150/ 0.141 > > no ring-full heuristic | 1.480/ 1.569/ 1.710/ 0.066 | 1.610/ 1.857/ 2.140/ 0.153 > > VIRTIO_F_NOTIFY_ON_EMPTY | 1.470/ 1.554/ 1.650/ 0.054 | 1.770/ 1.960/ 2.170/ 0.119 > > recv NO_NOTIFY | 1.530/ 1.604/ 1.680/ 0.047 | 1.780/ 1.944/ 2.190/ 0.129 > > GSO | 4.120/ 4.323/ 4.420/ 0.099 | 6.540/ 7.033/ 7.340/ 0.244 > > ring size == 256 | 4.050/ 4.406/ 4.560/ 0.143 | 6.280/ 7.236/ 8.280/ 0.613 > > ring size == 512 | 4.420/ 4.600/ 4.960/ 0.140 | 6.470/ 7.205/ 7.510/ 0.314 > > drop mutex during tapfd read | 4.320/ 4.578/ 4.790/ 0.161 | 8.370/ 8.589/ 8.730/ 0.120 > > aligouri zero-copy | 4.510/ 4.694/ 4.960/ 0.148 | 8.430/ 8.614/ 8.840/ 0.142 > > > > Very impressive numbers; much better than I expected. The host->guest > numbers are around 100x better than the original emulated card througput > we got from kvm. Hmm, I had intended to post comparison numbers for e1000; if I re-run these again, I'll do that too. I think it's about 0.5 Gb/s guest->host and 1.5 Gb/s host->guest. Note also that with current Fedora rawhide kernels the numbers aren't nearly as good as the numbers using my own kernel builds. This could be down to the fact that we enable pretty much all kernel debugging options during this phase of Fedora development. If I get a chance, I'll see if I can confirm that suspicion. > > ping -f -c 100000 (ms) | guest->host | host->guest > > -----------------------------+----------------------------+--------------------------- > > baseline | 0.060/ 0.459/ 7.602/ 0.846 | 0.067/ 0.331/ 2.517/ 0.057 > > 50us tx timer + rearm | 0.081/ 0.143/ 7.436/ 0.374 | 0.093/ 0.133/ 1.883/ 0.026 > > 250us tx timer + rearm | 0.302/ 0.463/ 7.580/ 0.849 | 0.297/ 0.344/ 2.128/ 0.028 > > 150us tx timer + rearm | 0.197/ 0.323/ 7.671/ 0.740 | 0.199/ 0.245/ 7.836/ 0.037 > > no ring-full heuristic | 0.182/ 0.324/ 7.688/ 0.753 | 0.199/ 0.243/ 2.197/ 0.030 > > VIRTIO_F_NOTIFY_ON_EMPTY | 0.197/ 0.321/ 7.447/ 0.730 | 0.196/ 0.242/ 2.218/ 0.032 > > recv NO_NOTIFY | 0.186/ 0.321/ 7.520/ 0.732 | 0.200/ 0.233/ 2.216/ 0.028 > > GSO | 0.178/ 0.324/ 7.667/ 0.736 | 0.147/ 0.246/ 1.361/ 0.024 > > ring size == 256 | 0.184/ 0.323/ 7.674/ 0.728 | 0.199/ 0.243/ 2.181/ 0.028 > > ring size == 512 | (not measured) | (not measured) > > drop mutex during tapfd read | 0.183/ 0.323/ 7.820/ 0.733 | 0.202/ 0.242/ 2.219/ 0.027 > > aligouri zero-copy | 0.185/ 0.325/ 7.863/ 0.736 | 0.202/ 0.245/ 7.844/ 0.036 > > > > > > This isn't too good. Low latency is important for nfs clients (or other > request/response workloads). I think we can keep these low by adjusting > the virtio window (for example, on an idle system it should be 1), so > that the tx mitigation timer only fires when the workload transitions > from throughput to request/response The other suggestion to help latency is to always flush the queue immediately on guest notification, before setting the timer. Cheers, Mark. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/9][RFC] KVM virtio_net performance 2008-08-11 19:56 ` Mark McLoughlin @ 2008-08-12 13:35 ` Avi Kivity 0 siblings, 0 replies; 36+ messages in thread From: Avi Kivity @ 2008-08-12 13:35 UTC (permalink / raw) To: Mark McLoughlin; +Cc: kvm, Herbert Xu, Rusty Russell Mark McLoughlin wrote: > Hi Avi, > > Sorry, I got distracted from this ... > > So did I :) >>> 1) The length of the tx mitigation timer makes quite a difference to >>> throughput achieved; we probably need a good heuristic for >>> adjusting this on the fly. >>> >>> >> The tx mitigation timer is just one part of the equation; the other is >> the virtio ring window size, which is now fixed. >> >> Using a maximum sized window is good when the guest and host are running >> flat out, doing nothing but networking. When throughput drops (because >> the guest is spending cpu on processing, or simply because the other >> side is not keeping up), we need to drop the windows size so as to >> retain acceptable latencies. >> >> The tx timer can then be set to "a bit after the end of the window", >> acting as a safety belt in case the throughput changes. >> > > i.e. the tx timer should give just enough time for a flat out guest to > fill the ring, and no more? > > Yep, that's basically what lguest's tx timer heuristic is aiming for > AFAICT. > > Yes, but that's not enough. If networking is slow (for whatever reason) we need to drop the window size, to make sure the timer never fires under steady state circumstances. Thinking about it, we could have an explicit "worst case latency" parameter (instead of the implicit "flat out guest fills ring") and set the timer to that. Adjust window size to as large as we can without seeing the timer expire. >>> 4) Dropping the global mutex while reading GSO packets from the tap >>> interface gives a nice speedup. This highlights the global mutex >>> as a general perfomance issue. >>> >>> >>> >> Not sure whether this is safe. What's stopping the guest from accessing >> virtio and changing some state? >> > > With the current code, the virtio state should be consistent before we > drop the mutex. The I/O thread would only drop the lock while it reads > into the tap buffer and then grab the lock again before popping a buffer > from the ring and copying to it. > > Right, tap_send() is called outside virtio-net context. > With Anthony's zero-copy patch, the situation is less clear - we pop a > buffer from the avail, drop the lock, read() into the buffer, grab the > lock and then push the buffer back onto the used ring. While the mutex > is released, the guest could e.g. reset the ring and release the buffer > which we're in the process of read()ing too. > > So, yes - dropping the mutex during read() in the zero-copy patch isn't > safe. > > Another potential concern is that if we drop the mutex, the guest thread > could delete an I/O handler while the I/O thread is in the I/O handler > loop in main_loop_wait(). However, this seems to have been coded to > handle this situation - the I/O handler would only be marked as deleted, > and ignored by the loop. > I think it's safe. Still I don't feel good about it. > >>> 5) Eliminating an extra copy on the host->guest path only makes a >>> barely measurable difference. >>> >>> >>> >> That's expected on a host->guest test. Zero copy is mostly important >> for guest->external, and with zerocopy already enabled in the guest >> (sendfile or nfs server workloads). >> > > Hmm, could you elaborate on that? > > The copy we're eliminating here is an intermediate copy from tapfd into > a buffer before copying to a guest buffer. It doesn't give you zero-copy > as we still copy from kernel space to user space and vice-versa. > So long as we're eliminating intermediate copies, each elimination is not bring us much. It's the elimination of the last copy that brings the benefit (actually, the last copy for each separate L2 cache; need to test on loaded multisocket hosts). There are broadly three workload categories wrt copying: - server-side static http/nfs/smb protocols, serving to clients outside the host: guest is already copyless; serving from cache-cold buffers - normal network servers that actually process their data: you'll have a guest kernel/user copy, and in any case, guest protocol processing means that the potential for gains from eliminating copies is limited - guest/host benchmarks, which do a copy from a single buffer which is always in L1: zero copy is not going to show a gain (perhaps the opposite) To get the first workload type optimized, we have to get the entire path copyless. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2008-08-12 13:35 UTC | newest] Thread overview: 36+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-07-24 11:46 [PATCH 0/9][RFC] KVM virtio_net performance Mark McLoughlin 2008-07-24 11:46 ` [PATCH 1/9] kvm: qemu: Set MIN_TIMER_REARM_US to 150us Mark McLoughlin 2008-07-24 11:46 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Mark McLoughlin 2008-07-24 11:46 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Mark McLoughlin 2008-07-24 11:46 ` [PATCH 4/9] kvm: qemu: Add VIRTIO_F_NOTIFY_ON_EMPTY Mark McLoughlin 2008-07-24 11:46 ` [PATCH 5/9] kvm: qemu: Disable recv notifications until avail buffers exhausted Mark McLoughlin 2008-07-24 11:46 ` [PATCH 6/9] kvm: qemu: Add support for partial csums and GSO Mark McLoughlin 2008-07-24 11:46 ` [PATCH 7/9] kvm: qemu: Increase size of virtio_net rings Mark McLoughlin 2008-07-24 11:46 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Mark McLoughlin 2008-07-24 11:46 ` [PATCH 9/9] kvm: qemu: Eliminate extra virtio_net copy Mark McLoughlin 2008-07-24 23:33 ` [PATCH 8/9] kvm: qemu: Drop the mutex while reading from tapfd Dor Laor 2008-07-25 17:25 ` Mark McLoughlin 2008-07-24 23:22 ` [PATCH 3/9] kvm: qemu: Remove virtio_net tx ring-full heuristic Dor Laor 2008-07-25 0:30 ` Rusty Russell 2008-07-25 17:30 ` Mark McLoughlin 2008-07-25 17:23 ` Mark McLoughlin 2008-07-24 23:56 ` Dor Laor 2008-07-26 9:48 ` [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer Avi Kivity 2008-07-26 12:08 ` Mark McLoughlin 2008-07-24 11:55 ` [PATCH 0/9][RFC] KVM virtio_net performance Herbert Xu 2008-07-24 16:53 ` Mark McLoughlin 2008-07-24 18:29 ` Anthony Liguori 2008-07-25 16:36 ` Mark McLoughlin 2008-07-24 20:56 ` Anthony Liguori 2008-07-25 17:17 ` Mark McLoughlin 2008-07-25 21:29 ` Dor Laor 2008-07-26 19:09 ` Bill Davidsen 2008-07-27 7:52 ` Avi Kivity 2008-07-27 12:52 ` Bill Davidsen 2008-07-27 13:17 ` Bill Davidsen 2008-07-28 6:42 ` Mark McLoughlin 2008-07-26 9:45 ` Avi Kivity 2008-07-27 6:48 ` Rusty Russell 2008-07-27 6:48 ` Rusty Russell 2008-08-11 19:56 ` Mark McLoughlin 2008-08-12 13:35 ` Avi Kivity
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox