[PATCH 0/6] Kill off the virtio_net tx mitigation timer

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] Kill off the virtio_net tx mitigation timer
@ 2008-10-30 17:51 Mark McLoughlin
  2008-10-30 17:51 ` [PATCH 1/6] kvm: qemu: virtio: remove unused variable Mark McLoughlin
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Mark McLoughlin @ 2008-10-30 17:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm


Hey,

The main patch in this series is 5/6 - it just kills off the
virtio_net tx mitigation timer and does all the tx I/O in the
I/O thread.

Below are the results I got from benchmarking guest->host and
host->guest on my machine.

There's enough numbers there to make anyone blind, but basically
there are results for current kvm-userspace.git, with the
no-tx-timer patch applied and with the drop-the-mutex patch
applied.

Also, I've included results that show what difference some tuning
makes with all the patches applied. The tuning basically just
involves pinning the I/O thread and the netperf/netserver processes
in both the host and guest to two physical CPUs which share a L2
cache.

(Yes, the 1k buffer size results are weird - we think there's a bug
in recent kernels that causes us not to coalesce these small buffers
into a large GSO packet before sending)

Anyway, the results in all their glory:

                                |       guest->host tput     |      host->guest tput
  netperf, 10x20s runs (Gb/s)   |    min/ mean/   max/stddev |   min/  mean/   max/stddev
  ------------------------------+----------------------------+---------------------------
  kvm-userspace.git, 1k         | 0.600/ 0.645/ 0.670/ 0.025 | 5.170/ 5.285/ 5.470/ 0.087
  kvm-userspace.git, 16k        | 3.070/ 3.350/ 3.710/ 0.248 | 5.950/ 6.374/ 6.760/ 0.261
  kvm-userspace.git, 65k        | 4.950/ 6.041/ 7.170/ 0.639 | 5.480/ 5.642/ 5.810/ 0.092

  no tx timer, 1k               | 0.720/ 0.790/ 0.850/ 0.040 | 4.950/ 5.172/ 5.370/ 0.128
  no tx timer, 16k              | 4.120/ 4.512/ 4.740/ 0.190 | 4.900/ 5.480/ 6.230/ 0.416
  no tx timer, 65k              | 5.510/ 7.702/ 9.600/ 1.153 | 4.490/ 5.208/ 5.690/ 0.408

  drop mutex, 1k                | 0.810/ 0.847/ 0.910/ 0.030 | 5.140/ 5.416/ 5.660/ 0.145
  drop mutex, 16k               | 5.110/ 5.713/ 6.480/ 0.440 | 6.050/ 6.658/ 7.490/ 0.443
  drop mutex, 65k               | 7.070/ 8.054/ 9.210/ 0.671 | 4.470/ 5.922/ 7.200/ 0.930

  tuned, 1k                     | 4.260/ 6.739/ 8.020/ 1.075 | 3.640/ 3.777/ 4.020/ 0.110
  tuned, 16k                    | 8.530/ 8.867/ 9.110/ 0.146 | 7.220/ 7.276/ 7.360/ 0.040
  tuned, 65k                    | 9.200/ 9.282/ 9.870/ 0.198 | 7.850/ 7.924/ 8.000/ 0.045

                                |       guest->host %idle       |      host->guest %idle
  netperf, 10x20s runs (Gb/s)   |   avg/cpu#0/cpu#1/cpu#2/cpu#3 | avg/cpu#0/cpu#1/cpu#2/cpu#3
  ------------------------------+--------------------------------+-----------------------------
  kvm-userspace.git, 1k         | 64.93/65.94/61.06/72.89/60.31 | 39.36/52.52/19.04/63.94/21.35
  kvm-userspace.git, 16k        | 61.47/66.22/52.76/66.99/60.02 | 43.94/58.12/27.45/55.85/33.64
  kvm-userspace.git, 65k        | 61.37/82.58/51.55/51.15/60.16 | 49.30/69.80/37.14/51.75/38.24

  no tx timer, 1k               | 57.99/59.93/59.68/71.17/42.03 | 42.13/60.85/31.13/55.06/22.94
  no tx timer, 16k              | 54.87/55.57/55.88/61.35/46.67 | 49.18/54.35/40.56/59.42/43.46
  no tx timer, 65k              | 47.53/73.63/33.00/46.36/37.14 | 53.67/66.69/40.92/47.73/59.89

  drop mutex, 1k                | 52.94/61.90/48.44/64.35/38.25 | 39.31/53.29/26.35/60.90/17.53
  drop mutex, 16k               | 49.28/59.47/37.16/50.07/50.16 | 42.18/52.26/26.76/55.95/34.26
  drop mutex, 65k               | 49.65/65.94/41.38/47.55/42.55 | 46.36/42.51/60.55/45.15/37.91

  tuned, 1k                     | 57.68/99.81/99.96/18.76/21.91 | 65.49/99.84/99.97/15.20/52.46
  tuned, 16k                    | 61.66/99.85/99.98/13.01/40.83 | 61.43/99.79/99.93/0.97/47.61
  tuned, 65k                    | 62.57/99.99/99.95/13.13/39.28 | 60.15/99.81/99.97/1.27/43.10

Cheers,
Mark.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/6] kvm: qemu: virtio: remove unused variable
  2008-10-30 17:51 [PATCH 0/6] Kill off the virtio_net tx mitigation timer Mark McLoughlin
@ 2008-10-30 17:51 ` Mark McLoughlin
  2008-10-30 17:51   ` [PATCH 2/6] kvm: qemu: dup the qemu_eventfd() return Mark McLoughlin
  2008-10-30 19:20 ` [PATCH 0/6] Kill off the virtio_net tx mitigation timer Anthony Liguori
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Mark McLoughlin @ 2008-10-30 17:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Mark McLoughlin

Remove a variable from virtqueue_pop() which has been unused since the
very start.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 qemu/hw/virtio.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
index e675f43..8fac354 100644
--- a/qemu/hw/virtio.c
+++ b/qemu/hw/virtio.c
@@ -144,7 +144,6 @@ void virtqueue_push(VirtQueue *vq, const VirtQueueElement *elem,
 int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
 {
     unsigned int i, head;
-    unsigned int position;
 
     /* Check it isn't doing very strange things with descriptor numbers. */
     if ((uint16_t)(vq->vring.avail->idx - vq->last_avail_idx) > vq->vring.num)
@@ -164,7 +163,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
 	errx(1, "Guest says index %u is available", head);
 
     /* When we start there are none of either input nor output. */
-    position = elem->out_num = elem->in_num = 0;
+    elem->out_num = elem->in_num = 0;
 
     i = head;
     do {
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/6] kvm: qemu: dup the qemu_eventfd() return
  2008-10-30 17:51 ` [PATCH 1/6] kvm: qemu: virtio: remove unused variable Mark McLoughlin
@ 2008-10-30 17:51   ` Mark McLoughlin
  2008-10-30 17:51     ` [PATCH 3/6] kvm: qemu: add qemu_eventfd_write() and qemu_eventfd_read() Mark McLoughlin
  0 siblings, 1 reply; 27+ messages in thread
From: Mark McLoughlin @ 2008-10-30 17:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Mark McLoughlin

qemu_eventfd() returns two file descriptors, both of which
must be closed when pipe() is used in the absence of eventfd().

Duplicate the eventfd() file descriptor so that closing
both descriptors will work in that case.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 qemu/compatfd.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/qemu/compatfd.c b/qemu/compatfd.c
index cc5ced3..36e37e5 100644
--- a/qemu/compatfd.c
+++ b/qemu/compatfd.c
@@ -118,7 +118,11 @@ int qemu_eventfd(int *fds)
 
     ret = syscall(SYS_eventfd, 0);
     if (ret >= 0) {
-        fds[0] = fds[1] = ret;
+        fds[0] = ret;
+        if ((fds[1] = dup(ret)) == -1) {
+            close(ret);
+            return -1;
+        }
         return 0;
     }
 #endif
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/6] kvm: qemu: add qemu_eventfd_write() and qemu_eventfd_read()
  2008-10-30 17:51   ` [PATCH 2/6] kvm: qemu: dup the qemu_eventfd() return Mark McLoughlin
@ 2008-10-30 17:51     ` Mark McLoughlin
  2008-10-30 17:51       ` [PATCH 4/6] kvm: qemu: aggregate reads from eventfd Mark McLoughlin
  0 siblings, 1 reply; 27+ messages in thread
From: Mark McLoughlin @ 2008-10-30 17:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Mark McLoughlin

Split out code for common eventfd operations to avoid
re-implementation.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 qemu/compatfd.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 qemu/compatfd.h |    2 ++
 qemu/qemu-kvm.c |   42 ++++--------------------------------------
 3 files changed, 54 insertions(+), 38 deletions(-)

diff --git a/qemu/compatfd.c b/qemu/compatfd.c
index 36e37e5..7e3e7f7 100644
--- a/qemu/compatfd.c
+++ b/qemu/compatfd.c
@@ -129,3 +129,51 @@ int qemu_eventfd(int *fds)
 
     return pipe(fds);
 }
+
+int qemu_eventfd_write(int eventfd, uint64_t value)
+{
+    char buffer[8];
+    size_t offset = 0;
+
+    memcpy(buffer, &value, sizeof(value));
+
+    while (offset < 8) {
+	ssize_t len;
+
+	len = write(eventfd, buffer + offset, 8 - offset);
+	if (len == -1 && errno == EINTR)
+	    continue;
+
+	if (len <= 0)
+	    break;
+
+	offset += len;
+    }
+
+    return offset == 8 ? 0 : -1;
+}
+
+uint64_t qemu_eventfd_read(int eventfd)
+{
+    char buffer[8];
+    size_t offset = 0;
+    uint64_t value = 0;
+
+    while (offset < 8) {
+	ssize_t len;
+
+	len = read(eventfd, buffer + offset, 8 - offset);
+	if (len == -1 && errno == EINTR)
+	    continue;
+
+	if (len <= 0)
+	    break;
+
+	offset += len;
+    }
+
+    if (offset == 8)
+        memcpy(&value, buffer, sizeof(value));
+
+    return value;
+}
diff --git a/qemu/compatfd.h b/qemu/compatfd.h
index 55a111a..b2792e6 100644
--- a/qemu/compatfd.h
+++ b/qemu/compatfd.h
@@ -24,5 +24,7 @@ struct qemu_signalfd_siginfo {
 int qemu_signalfd(const sigset_t *mask);
 
 int qemu_eventfd(int *fds);
+int qemu_eventfd_write(int eventfd, uint64_t value);
+uint64_t qemu_eventfd_read(int eventfd);
 
 #endif
diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index c5f3f29..70c6802 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -485,30 +485,11 @@ int kvm_init_ap(void)
 
 void qemu_kvm_notify_work(void)
 {
-    uint64_t value = 1;
-    char buffer[8];
-    size_t offset = 0;
-
     if (io_thread_fd == -1)
-	return;
-
-    memcpy(buffer, &value, sizeof(value));
-
-    while (offset < 8) {
-	ssize_t len;
-
-	len = write(io_thread_fd, buffer + offset, 8 - offset);
-	if (len == -1 && errno == EINTR)
-	    continue;
-
-	if (len <= 0)
-	    break;
-
-	offset += len;
-    }
+        return;
 
-    if (offset != 8)
-	fprintf(stderr, "failed to notify io thread\n");
+    if (qemu_eventfd_write(io_thread_fd, 1) != 0)
+        fprintf(stderr, "failed to notify io thread\n");
 }
 
 /* If we have signalfd, we mask out the signals we want to handle and then
@@ -546,22 +527,7 @@ static void sigfd_handler(void *opaque)
 /* Used to break IO thread out of select */
 static void io_thread_wakeup(void *opaque)
 {
-    int fd = (unsigned long)opaque;
-    char buffer[8];
-    size_t offset = 0;
-
-    while (offset < 8) {
-	ssize_t len;
-
-	len = read(fd, buffer + offset, 8 - offset);
-	if (len == -1 && errno == EINTR)
-	    continue;
-
-	if (len <= 0)
-	    break;
-
-	offset += len;
-    }
+    qemu_eventfd_read((unsigned long)opaque);
 }
 
 int kvm_main_loop(void)
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/6] kvm: qemu: aggregate reads from eventfd
  2008-10-30 17:51     ` [PATCH 3/6] kvm: qemu: add qemu_eventfd_write() and qemu_eventfd_read() Mark McLoughlin
@ 2008-10-30 17:51       ` Mark McLoughlin
  2008-10-30 17:51         ` [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer Mark McLoughlin
  0 siblings, 1 reply; 27+ messages in thread
From: Mark McLoughlin @ 2008-10-30 17:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Mark McLoughlin

Keep reading from eventfd until it's empty and aggregate the results.

Only really important when we're emulating eventfd using a pipe as
in that case we may have multiple "1" values queued up, potentially
leading to the pipe buffer filling up and the write side blocking.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 qemu/compatfd.c |   21 ++++++++++++++++++++-
 1 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/qemu/compatfd.c b/qemu/compatfd.c
index 7e3e7f7..2da6dd6 100644
--- a/qemu/compatfd.c
+++ b/qemu/compatfd.c
@@ -153,7 +153,7 @@ int qemu_eventfd_write(int eventfd, uint64_t value)
     return offset == 8 ? 0 : -1;
 }
 
-uint64_t qemu_eventfd_read(int eventfd)
+static uint64_t qemu_eventfd_read_one(int eventfd)
 {
     char buffer[8];
     size_t offset = 0;
@@ -177,3 +177,22 @@ uint64_t qemu_eventfd_read(int eventfd)
 
     return value;
 }
+
+uint64_t qemu_eventfd_read(int eventfd)
+{
+    uint64_t ret;
+    fd_set fds;
+    struct timeval tv;
+
+    FD_ZERO(&fds);
+    FD_SET(eventfd, &fds);
+
+    tv.tv_sec = tv.tv_usec = 0;
+
+    ret = 0;
+    do {
+        ret += qemu_eventfd_read_one(eventfd);
+    } while (select(eventfd + 1, &fds, NULL, NULL, &tv) > 0);
+
+    return ret;
+}
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer
  2008-10-30 17:51       ` [PATCH 4/6] kvm: qemu: aggregate reads from eventfd Mark McLoughlin
@ 2008-10-30 17:51         ` Mark McLoughlin
  2008-10-30 17:51           ` [PATCH 6/6] kvm: qemu: virtio-net: drop mutex during tx tapfd write Mark McLoughlin
                             ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Mark McLoughlin @ 2008-10-30 17:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Mark McLoughlin

By removing the tx timer altogether and doing all the copies in the
I/O thread, we can keep the I/O churning away in parallel with the
guest generating more I/O.

In my tests, this significantly increases guest->host throughput,
causes a minor increase in host->guest throughput, reduces CPU
utilization somewhat and greatly reduces roundtrip times.

Even aside from the benchmark results, removing the arbitrary 150us
timer is a nicer option than coming up with a heuristic to make it
vary according to load. Finally, on kernels which don't have a
suitably low posix timer latency, we won't be scuppered by effectively
having e.g. a 1ms timer.

Note, this highlights that the I/O thread may become a scalability
concern and we might want to consider e.g. an I/O thread per device.

Note also that when tuning for a specific workload, which CPU
the I/O thread is pinned to is important.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 qemu/hw/virtio-net.c |   79 ++++++++++++++++++++++++++++---------------------
 1 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index bc2ede6..0612f5f 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -15,6 +15,8 @@
 #include "net.h"
 #include "qemu-timer.h"
 #include "qemu-kvm.h"
+#include "qemu-char.h"
+#include "compatfd.h"
 
 /* from Linux's virtio_net.h */
 
@@ -35,8 +37,6 @@
 #define VIRTIO_NET_F_HOST_ECN	13	/* Host can handle TSO[6] w/ ECN in. */
 #define VIRTIO_NET_F_HOST_UFO	14	/* Host can handle UFO in. */
 
-#define TX_TIMER_INTERVAL 150000 /* 150 us */
-
 /* The config defining mac address (6 bytes) */
 struct virtio_net_config
 {
@@ -68,8 +68,7 @@ typedef struct VirtIONet
     VirtQueue *rx_vq;
     VirtQueue *tx_vq;
     VLANClientState *vc;
-    QEMUTimer *tx_timer;
-    int tx_timer_active;
+    int tx_eventfds[2];
 } VirtIONet;
 
 /* TODO
@@ -227,13 +226,14 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
 }
 
 /* TX */
-static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
+static int virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 {
     VirtQueueElement elem;
     int has_vnet_hdr = tap_has_vnet_hdr(n->vc->vlan->first_client);
+    int num_packets = 0;
 
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
-        return;
+        return num_packets;
 
     while (virtqueue_pop(vq, &elem)) {
 	ssize_t len = 0;
@@ -256,38 +256,31 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 
 	virtqueue_push(vq, &elem, len);
 	virtio_notify(&n->vdev, vq);
+
+	num_packets++;
     }
+
+    return num_packets;
 }
 
 static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
 
-    if (n->tx_timer_active) {
-	vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
-	qemu_del_timer(n->tx_timer);
-	n->tx_timer_active = 0;
-	virtio_net_flush_tx(n, vq);
-    } else {
-	qemu_mod_timer(n->tx_timer,
-		       qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
-	n->tx_timer_active = 1;
-	vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY;
-    }
+    vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY;
+    qemu_eventfd_write(n->tx_eventfds[1], 1);
 }
 
-static void virtio_net_tx_timer(void *opaque)
+static void virtio_net_tx_event(void *opaque)
 {
     VirtIONet *n = opaque;
 
-    n->tx_timer_active = 0;
+    qemu_eventfd_read(n->tx_eventfds[0]);
 
-    /* Just in case the driver is not ready on more */
-    if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
-        return;
-
-    n->tx_vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
-    virtio_net_flush_tx(n, n->tx_vq);
+    if (!virtio_net_flush_tx(n, n->tx_vq))
+        n->tx_vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
+    else
+        qemu_eventfd_write(n->tx_eventfds[1], 1);
 }
 
 static void virtio_net_save(QEMUFile *f, void *opaque)
@@ -297,7 +290,6 @@ static void virtio_net_save(QEMUFile *f, void *opaque)
     virtio_save(&n->vdev, f);
 
     qemu_put_buffer(f, n->mac, 6);
-    qemu_put_be32(f, n->tx_timer_active);
 }
 
 static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
@@ -310,12 +302,16 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
     virtio_load(&n->vdev, f);
 
     qemu_get_buffer(f, n->mac, 6);
-    n->tx_timer_active = qemu_get_be32(f);
 
-    if (n->tx_timer_active) {
-	qemu_mod_timer(n->tx_timer,
-		       qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
-    }
+    return 0;
+}
+
+static int virtio_net_uninit(PCIDevice *dev)
+{
+    VirtIONet *n = (VirtIONet *)dev;
+
+    close(n->tx_eventfds[0]);
+    close(n->tx_eventfds[1]);
 
     return 0;
 }
@@ -324,13 +320,23 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
 {
     VirtIONet *n;
     static int virtio_net_id;
+    int eventfds[2];
+
+    if (qemu_eventfd(eventfds) == -1) {
+        fprintf(stderr, "Failed to create eventfds : %s\n",
+                strerror(errno));
+        return NULL;
+    }
 
     n = (VirtIONet *)virtio_init_pci(bus, "virtio-net", 6900, 0x1000,
 				     0, VIRTIO_ID_NET,
 				     0x02, 0x00, 0x00,
 				     6, sizeof(VirtIONet));
-    if (!n)
+    if (!n) {
+	close(eventfds[0]);
+	close(eventfds[1]);
 	return NULL;
+    }
 
     n->vdev.get_config = virtio_net_update_config;
     n->vdev.get_features = virtio_net_get_features;
@@ -341,8 +347,13 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
     n->vc = qemu_new_vlan_client(nd->vlan, virtio_net_receive,
                                  virtio_net_can_receive, n);
 
-    n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
-    n->tx_timer_active = 0;
+    fcntl(eventfds[0], F_SETFL, O_NONBLOCK);
+    n->tx_eventfds[0] = eventfds[0];
+    fcntl(eventfds[1], F_SETFL, O_NONBLOCK);
+    n->tx_eventfds[1] = eventfds[1];
+
+    n->vdev.pci_dev.unregister = virtio_net_uninit;
+    qemu_set_fd_handler2(n->tx_eventfds[0], NULL, virtio_net_tx_event, NULL, n);
 
     register_savevm("virtio-net", virtio_net_id++, 1,
 		    virtio_net_save, virtio_net_load, n);
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/6] kvm: qemu: virtio-net: drop mutex during tx tapfd write
  2008-10-30 17:51         ` [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer Mark McLoughlin
@ 2008-10-30 17:51           ` Mark McLoughlin
  2008-11-04 11:43             ` Avi Kivity
  2008-10-30 19:24           ` [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer Anthony Liguori
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Mark McLoughlin @ 2008-10-30 17:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Mark McLoughlin

This allows the guest vcpu thread to exit while the I/O thread is
churning away.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 qemu/hw/virtio-net.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 0612f5f..aa1c107 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -252,7 +252,9 @@ static int virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 	    len += sizeof(struct virtio_net_hdr);
 	}
 
+	kvm_sleep_begin();
 	len += qemu_sendv_packet(n->vc, out_sg, out_num);
+	kvm_sleep_end();
 
 	virtqueue_push(vq, &elem, len);
 	virtio_notify(&n->vdev, vq);
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6] Kill off the virtio_net tx mitigation timer
  2008-10-30 17:51 [PATCH 0/6] Kill off the virtio_net tx mitigation timer Mark McLoughlin
  2008-10-30 17:51 ` [PATCH 1/6] kvm: qemu: virtio: remove unused variable Mark McLoughlin
@ 2008-10-30 19:20 ` Anthony Liguori
  2008-11-02  9:48 ` Avi Kivity
  2008-11-02  9:57 ` Avi Kivity
  3 siblings, 0 replies; 27+ messages in thread
From: Anthony Liguori @ 2008-10-30 19:20 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: Avi Kivity, kvm

Mark McLoughlin wrote:
> Hey,
>
> The main patch in this series is 5/6 - it just kills off the
> virtio_net tx mitigation timer and does all the tx I/O in the
> I/O thread.
>
> Below are the results I got from benchmarking guest->host and
> host->guest on my machine.
>
> There's enough numbers there to make anyone blind, but basically
> there are results for current kvm-userspace.git, with the
> no-tx-timer patch applied and with the drop-the-mutex patch
> applied.
>
> Also, I've included results that show what difference some tuning
> makes with all the patches applied. The tuning basically just
> involves pinning the I/O thread and the netperf/netserver processes
> in both the host and guest to two physical CPUs which share a L2
> cache.
>
> (Yes, the 1k buffer size results are weird - we think there's a bug
> in recent kernels that causes us not to coalesce these small buffers
> into a large GSO packet before sending)
>
> Anyway, the results in all their glory:
>
>                                 |       guest->host tput     |      host->guest tput
>   netperf, 10x20s runs (Gb/s)   |    min/ mean/   max/stddev |   min/  mean/   max/stddev
>   ------------------------------+----------------------------+---------------------------
>   kvm-userspace.git, 1k         | 0.600/ 0.645/ 0.670/ 0.025 | 5.170/ 5.285/ 5.470/ 0.087
>   kvm-userspace.git, 16k        | 3.070/ 3.350/ 3.710/ 0.248 | 5.950/ 6.374/ 6.760/ 0.261
>   kvm-userspace.git, 65k        | 4.950/ 6.041/ 7.170/ 0.639 | 5.480/ 5.642/ 5.810/ 0.092
>
>   no tx timer, 1k               | 0.720/ 0.790/ 0.850/ 0.040 | 4.950/ 5.172/ 5.370/ 0.128
>   no tx timer, 16k              | 4.120/ 4.512/ 4.740/ 0.190 | 4.900/ 5.480/ 6.230/ 0.416
>   no tx timer, 65k              | 5.510/ 7.702/ 9.600/ 1.153 | 4.490/ 5.208/ 5.690/ 0.408
>   

Okay, I don't see 3/6 yet, but does no tx timer mean just no tx timer or 
no tx timer + handling IO in the IO thread?

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer
  2008-10-30 17:51         ` [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer Mark McLoughlin
  2008-10-30 17:51           ` [PATCH 6/6] kvm: qemu: virtio-net: drop mutex during tx tapfd write Mark McLoughlin
@ 2008-10-30 19:24           ` Anthony Liguori
  2008-10-31  9:16             ` Mark McLoughlin
  2008-11-02  9:56           ` Avi Kivity
  2008-11-04 15:23           ` David S. Ahern
  3 siblings, 1 reply; 27+ messages in thread
From: Anthony Liguori @ 2008-10-30 19:24 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: Avi Kivity, kvm

Mark McLoughlin wrote:
> By removing the tx timer altogether and doing all the copies in the
> I/O thread, we can keep the I/O churning away in parallel with the
> guest generating more I/O.
>
> In my tests, this significantly increases guest->host throughput,
> causes a minor increase in host->guest throughput, reduces CPU
> utilization somewhat and greatly reduces roundtrip times.
>
> Even aside from the benchmark results, removing the arbitrary 150us
> timer is a nicer option than coming up with a heuristic to make it
> vary according to load. Finally, on kernels which don't have a
> suitably low posix timer latency, we won't be scuppered by effectively
> having e.g. a 1ms timer.
>
> Note, this highlights that the I/O thread may become a scalability
> concern and we might want to consider e.g. an I/O thread per device.
>
> Note also that when tuning for a specific workload, which CPU
> the I/O thread is pinned to is important.
>   

Instead of using an event fd, perhaps you could just schedule a bottom 
half?  I think that would be a whole lot cleaner.

Regards,

Anthony Liguori

> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
> ---
>  qemu/hw/virtio-net.c |   79 ++++++++++++++++++++++++++++---------------------
>  1 files changed, 45 insertions(+), 34 deletions(-)
>
> diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
> index bc2ede6..0612f5f 100644
> --- a/qemu/hw/virtio-net.c
> +++ b/qemu/hw/virtio-net.c
> @@ -15,6 +15,8 @@
>  #include "net.h"
>  #include "qemu-timer.h"
>  #include "qemu-kvm.h"
> +#include "qemu-char.h"
> +#include "compatfd.h"
>  
>  /* from Linux's virtio_net.h */
>  
> @@ -35,8 +37,6 @@
>  #define VIRTIO_NET_F_HOST_ECN	13	/* Host can handle TSO[6] w/ ECN in. */
>  #define VIRTIO_NET_F_HOST_UFO	14	/* Host can handle UFO in. */
>  
> -#define TX_TIMER_INTERVAL 150000 /* 150 us */
> -
>  /* The config defining mac address (6 bytes) */
>  struct virtio_net_config
>  {
> @@ -68,8 +68,7 @@ typedef struct VirtIONet
>      VirtQueue *rx_vq;
>      VirtQueue *tx_vq;
>      VLANClientState *vc;
> -    QEMUTimer *tx_timer;
> -    int tx_timer_active;
> +    int tx_eventfds[2];
>  } VirtIONet;
>  
>  /* TODO
> @@ -227,13 +226,14 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
>  }
>  
>  /* TX */
> -static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
> +static int virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
>  {
>      VirtQueueElement elem;
>      int has_vnet_hdr = tap_has_vnet_hdr(n->vc->vlan->first_client);
> +    int num_packets = 0;
>  
>      if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
> -        return;
> +        return num_packets;
>  
>      while (virtqueue_pop(vq, &elem)) {
>  	ssize_t len = 0;
> @@ -256,38 +256,31 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
>  
>  	virtqueue_push(vq, &elem, len);
>  	virtio_notify(&n->vdev, vq);
> +
> +	num_packets++;
>      }
> +
> +    return num_packets;
>  }
>  
>  static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq)
>  {
>      VirtIONet *n = to_virtio_net(vdev);
>  
> -    if (n->tx_timer_active) {
> -	vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> -	qemu_del_timer(n->tx_timer);
> -	n->tx_timer_active = 0;
> -	virtio_net_flush_tx(n, vq);
> -    } else {
> -	qemu_mod_timer(n->tx_timer,
> -		       qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
> -	n->tx_timer_active = 1;
> -	vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY;
> -    }
> +    vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY;
> +    qemu_eventfd_write(n->tx_eventfds[1], 1);
>  }
>  
> -static void virtio_net_tx_timer(void *opaque)
> +static void virtio_net_tx_event(void *opaque)
>  {
>      VirtIONet *n = opaque;
>  
> -    n->tx_timer_active = 0;
> +    qemu_eventfd_read(n->tx_eventfds[0]);
>  
> -    /* Just in case the driver is not ready on more */
> -    if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
> -        return;
> -
> -    n->tx_vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> -    virtio_net_flush_tx(n, n->tx_vq);
> +    if (!virtio_net_flush_tx(n, n->tx_vq))
> +        n->tx_vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> +    else
> +        qemu_eventfd_write(n->tx_eventfds[1], 1);
>  }
>  
>  static void virtio_net_save(QEMUFile *f, void *opaque)
> @@ -297,7 +290,6 @@ static void virtio_net_save(QEMUFile *f, void *opaque)
>      virtio_save(&n->vdev, f);
>  
>      qemu_put_buffer(f, n->mac, 6);
> -    qemu_put_be32(f, n->tx_timer_active);
>  }
>  
>  static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
> @@ -310,12 +302,16 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
>      virtio_load(&n->vdev, f);
>  
>      qemu_get_buffer(f, n->mac, 6);
> -    n->tx_timer_active = qemu_get_be32(f);
>  
> -    if (n->tx_timer_active) {
> -	qemu_mod_timer(n->tx_timer,
> -		       qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
> -    }
> +    return 0;
> +}
> +
> +static int virtio_net_uninit(PCIDevice *dev)
> +{
> +    VirtIONet *n = (VirtIONet *)dev;
> +
> +    close(n->tx_eventfds[0]);
> +    close(n->tx_eventfds[1]);
>  
>      return 0;
>  }
> @@ -324,13 +320,23 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
>  {
>      VirtIONet *n;
>      static int virtio_net_id;
> +    int eventfds[2];
> +
> +    if (qemu_eventfd(eventfds) == -1) {
> +        fprintf(stderr, "Failed to create eventfds : %s\n",
> +                strerror(errno));
> +        return NULL;
> +    }
>  
>      n = (VirtIONet *)virtio_init_pci(bus, "virtio-net", 6900, 0x1000,
>  				     0, VIRTIO_ID_NET,
>  				     0x02, 0x00, 0x00,
>  				     6, sizeof(VirtIONet));
> -    if (!n)
> +    if (!n) {
> +	close(eventfds[0]);
> +	close(eventfds[1]);
>  	return NULL;
> +    }
>  
>      n->vdev.get_config = virtio_net_update_config;
>      n->vdev.get_features = virtio_net_get_features;
> @@ -341,8 +347,13 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
>      n->vc = qemu_new_vlan_client(nd->vlan, virtio_net_receive,
>                                   virtio_net_can_receive, n);
>  
> -    n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
> -    n->tx_timer_active = 0;
> +    fcntl(eventfds[0], F_SETFL, O_NONBLOCK);
> +    n->tx_eventfds[0] = eventfds[0];
> +    fcntl(eventfds[1], F_SETFL, O_NONBLOCK);
> +    n->tx_eventfds[1] = eventfds[1];
> +
> +    n->vdev.pci_dev.unregister = virtio_net_uninit;
> +    qemu_set_fd_handler2(n->tx_eventfds[0], NULL, virtio_net_tx_event, NULL, n);
>  
>      register_savevm("virtio-net", virtio_net_id++, 1,
>  		    virtio_net_save, virtio_net_load, n);
>   


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer
  2008-10-30 19:24           ` [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer Anthony Liguori
@ 2008-10-31  9:16             ` Mark McLoughlin
  2008-11-03 15:07               ` Mark McLoughlin
  0 siblings, 1 reply; 27+ messages in thread
From: Mark McLoughlin @ 2008-10-31  9:16 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm

On Thu, 2008-10-30 at 14:24 -0500, Anthony Liguori wrote:

> Instead of using an event fd, perhaps you could just schedule a bottom 
> half?  I think that would be a whole lot cleaner.

Nice, I hadn't noticed the bottom halves. Much cleaner, indeed.

Results are a little better too:

                                |       guest->host tput     |      host->guest tput
  netperf, 10x20s runs (Gb/s)   |    min/ mean/   max/stddev |   min/  mean/   max/stddev
  ------------------------------+----------------------------+---------------------------
  eventfd, 1k                   | 4.260/ 6.739/ 8.020/ 1.075 | 3.640/ 3.777/ 4.020/ 0.110
  eventfd, 16k                  | 8.530/ 8.867/ 9.110/ 0.146 | 7.220/ 7.276/ 7.360/ 0.040
  eventfd, 65k                  | 9.200/ 9.282/ 9.870/ 0.198 | 7.850/ 7.924/ 8.000/ 0.045

  bottom half, bh, 1k           | 4.890/ 6.712/ 7.800/ 0.800 | 4.900/ 5.002/ 5.130/ 0.067
  bottom half, bh, 16k          | 8.640/ 8.827/ 9.290/ 0.186 | 7.920/ 7.963/ 8.020/ 0.034
  bottom half, bh, 65k          | 9.310/ 9.323/ 9.350/ 0.010 | 8.740/ 8.769/ 8.800/ 0.016

                                |       guest->host %idle       |      host->guest %idle
  netperf, 10x20s runs (Gb/s)   |   avg/cpu#0/cpu#1/cpu#2/cpu#3 | avg/cpu#0/cpu#1/cpu#2/cpu#3
  ------------------------------+--------------------------------+-----------------------------
  eventfd, 1k                   | 57.68/99.81/99.96/18.76/21.91 | 65.49/99.84/99.97/15.20/52.46
  eventfd, 16k                  | 61.66/99.85/99.98/13.01/40.83 | 61.43/99.79/99.93/0.97/47.61
  eventfd, 65k                  | 62.57/99.99/99.95/13.13/39.28 | 60.15/99.81/99.97/1.27/43.10

  bottom half, 1k               | 57.29/99.83/99.81/22.30/15.93 | 63.36/99.98/99.95/2.92/51.26
  bottom half, 16k              | 62.21/99.81/99.96/15.74/42.40 | 60.21/99.81/99.97/0.92/43.78
  bottom half, 65k              | 62.25/99.99/99.96/12.67/37.45 | 58.63/99.83/99.96/0.92/36.71

(With this approach, patches 2-4 aren't needed anymore, but they're
still worthwhile cleanups IMHO)

Cheers,
Mark.

From: Mark McLoughlin <markmc@redhat.com>
Subject: kvm: qemu: virtio-net: handle all tx in I/O thread without timer

By removing the tx timer altogether and doing all the copies in the
I/O thread, we can keep the I/O churning away in parallel with the
guest generating more I/O.

In my tests, this significantly increases guest->host throughput,
causes a minor increase in host->guest throughput, reduces CPU
utilization somewhat and greatly reduces roundtrip times.

Even aside from the benchmark results, removing the arbitrary 150us
timer is a nicer option than coming up with a heuristic to make it
vary according to load. Finally, on kernels which don't have a
suitably low posix timer latency, we won't be scuppered by effectively
having e.g. a 1ms timer.

Note, this highlights that the I/O thread may become a scalability
concern and we might want to consider e.g. an I/O thread per device.

Note also that when tuning for a specific workload, which CPU
the I/O thread is pinned to is important.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 qemu/hw/virtio-net.c |   60 +++++++++++++++++++++----------------------------
 1 files changed, 26 insertions(+), 34 deletions(-)

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index bc2ede6..bc198ef 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -35,8 +35,6 @@
 #define VIRTIO_NET_F_HOST_ECN	13	/* Host can handle TSO[6] w/ ECN in. */
 #define VIRTIO_NET_F_HOST_UFO	14	/* Host can handle UFO in. */
 
-#define TX_TIMER_INTERVAL 150000 /* 150 us */
-
 /* The config defining mac address (6 bytes) */
 struct virtio_net_config
 {
@@ -68,8 +66,7 @@ typedef struct VirtIONet
     VirtQueue *rx_vq;
     VirtQueue *tx_vq;
     VLANClientState *vc;
-    QEMUTimer *tx_timer;
-    int tx_timer_active;
+    QEMUBH *tx_bh;
 } VirtIONet;
 
 /* TODO
@@ -227,13 +224,14 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
 }
 
 /* TX */
-static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
+static int virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 {
     VirtQueueElement elem;
     int has_vnet_hdr = tap_has_vnet_hdr(n->vc->vlan->first_client);
+    int num_packets = 0;
 
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
-        return;
+        return num_packets;
 
     while (virtqueue_pop(vq, &elem)) {
 	ssize_t len = 0;
@@ -256,38 +254,29 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 
 	virtqueue_push(vq, &elem, len);
 	virtio_notify(&n->vdev, vq);
+
+	num_packets++;
     }
+
+    return num_packets;
 }
 
 static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
 
-    if (n->tx_timer_active) {
-	vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
-	qemu_del_timer(n->tx_timer);
-	n->tx_timer_active = 0;
-	virtio_net_flush_tx(n, vq);
-    } else {
-	qemu_mod_timer(n->tx_timer,
-		       qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
-	n->tx_timer_active = 1;
-	vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY;
-    }
+    vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY;
+    qemu_bh_schedule(n->tx_bh);
 }
 
-static void virtio_net_tx_timer(void *opaque)
+static void virtio_net_tx_bh(void *opaque)
 {
     VirtIONet *n = opaque;
 
-    n->tx_timer_active = 0;
-
-    /* Just in case the driver is not ready on more */
-    if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
-        return;
-
-    n->tx_vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
-    virtio_net_flush_tx(n, n->tx_vq);
+    if (!virtio_net_flush_tx(n, n->tx_vq))
+        n->tx_vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
+    else
+        qemu_bh_schedule(n->tx_bh);
 }
 
 static void virtio_net_save(QEMUFile *f, void *opaque)
@@ -297,7 +286,6 @@ static void virtio_net_save(QEMUFile *f, void *opaque)
     virtio_save(&n->vdev, f);
 
     qemu_put_buffer(f, n->mac, 6);
-    qemu_put_be32(f, n->tx_timer_active);
 }
 
 static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
@@ -310,12 +298,15 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
     virtio_load(&n->vdev, f);
 
     qemu_get_buffer(f, n->mac, 6);
-    n->tx_timer_active = qemu_get_be32(f);
 
-    if (n->tx_timer_active) {
-	qemu_mod_timer(n->tx_timer,
-		       qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
-    }
+    return 0;
+}
+
+static int virtio_net_uninit(PCIDevice *dev)
+{
+    VirtIONet *n = (VirtIONet *)dev;
+
+    qemu_bh_delete(n->tx_bh);
 
     return 0;
 }
@@ -341,8 +332,9 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
     n->vc = qemu_new_vlan_client(nd->vlan, virtio_net_receive,
                                  virtio_net_can_receive, n);
 
-    n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
-    n->tx_timer_active = 0;
+    n->tx_bh = qemu_bh_new(virtio_net_tx_bh, n);
+
+    n->vdev.pci_dev.unregister = virtio_net_uninit;
 
     register_savevm("virtio-net", virtio_net_id++, 1,
 		    virtio_net_save, virtio_net_load, n);
-- 
1.5.4.3



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6] Kill off the virtio_net tx mitigation timer
  2008-10-30 17:51 [PATCH 0/6] Kill off the virtio_net tx mitigation timer Mark McLoughlin
  2008-10-30 17:51 ` [PATCH 1/6] kvm: qemu: virtio: remove unused variable Mark McLoughlin
  2008-10-30 19:20 ` [PATCH 0/6] Kill off the virtio_net tx mitigation timer Anthony Liguori
@ 2008-11-02  9:48 ` Avi Kivity
  2008-11-03 12:23   ` Mark McLoughlin
  2008-11-02  9:57 ` Avi Kivity
  3 siblings, 1 reply; 27+ messages in thread
From: Avi Kivity @ 2008-11-02  9:48 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: Avi Kivity, kvm

Mark McLoughlin wrote:
> Hey,
>
> The main patch in this series is 5/6 - it just kills off the
> virtio_net tx mitigation timer and does all the tx I/O in the
> I/O thread.
>
>   

What will it do to small packet, multi-flow loads (simulated by ping -f 
-l 30 $external)?

Where does the benefit come from?  Is the overhead of managing the timer 
too high, or does it fire too late and so we sleep?  If the latter, can 
we tune it dynamically?

For example, if the guest sees it is making a lot of progress without 
the host catching up (waiting on the tx timer), it can 
kick_I_really_mean_this_now(), to get the host to notice.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer
  2008-10-30 17:51         ` [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer Mark McLoughlin
  2008-10-30 17:51           ` [PATCH 6/6] kvm: qemu: virtio-net: drop mutex during tx tapfd write Mark McLoughlin
  2008-10-30 19:24           ` [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer Anthony Liguori
@ 2008-11-02  9:56           ` Avi Kivity
  2008-11-04 15:23           ` David S. Ahern
  3 siblings, 0 replies; 27+ messages in thread
From: Avi Kivity @ 2008-11-02  9:56 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: Avi Kivity, kvm

Mark McLoughlin wrote:
> By removing the tx timer altogether and doing all the copies in the
> I/O thread, we can keep the I/O churning away in parallel with the
> guest generating more I/O.
>
>   

On a multi-socket machine, you may also be doing the copy on the wrong 
cache.  We're also now increasing latency, and serializing all NICs on 
one thread.

Of course, the only true answer is to avoid the copy completely, but we 
aren't there yet.  I'm not sure how to tradeoff the benefits below with 
the problems above.

> In my tests, this significantly increases guest->host throughput,
> causes a minor increase in host->guest throughput, reduces CPU
> utilization somewhat and greatly reduces roundtrip times.
>
> Even aside from the benchmark results, removing the arbitrary 150us
> timer is a nicer option than coming up with a heuristic to make it
> vary according to load. Finally, on kernels which don't have a
> suitably low posix timer latency, we won't be scuppered by effectively
> having e.g. a 1ms timer.
>
> Note, this highlights that the I/O thread may become a scalability
> concern and we might want to consider e.g. an I/O thread per device.
>
> Note also that when tuning for a specific workload, which CPU
> the I/O thread is pinned to is important.
>   

This is a significant drawback.  Maybe we need a thread per virtio nic 
to do the copying, and affine it to the current cpu before handing off?  
but no, this will either serialize the vcpu with the copythread, or will 
force the vcpu to migrate.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6] Kill off the virtio_net tx mitigation timer
  2008-10-30 17:51 [PATCH 0/6] Kill off the virtio_net tx mitigation timer Mark McLoughlin
                   ` (2 preceding siblings ...)
  2008-11-02  9:48 ` Avi Kivity
@ 2008-11-02  9:57 ` Avi Kivity
  3 siblings, 0 replies; 27+ messages in thread
From: Avi Kivity @ 2008-11-02  9:57 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: Avi Kivity, kvm

Mark McLoughlin wrote:
> Hey,
>
> The main patch in this series is 5/6 - it just kills off the
> virtio_net tx mitigation timer and does all the tx I/O in the
> I/O thread.
>
> Below are the results I got from benchmarking guest->host and
> host->guest on my machine.
>
> There's enough numbers there to make anyone blind, but basically
> there are results for current kvm-userspace.git, with the
> no-tx-timer patch applied and with the drop-the-mutex patch
> applied.
>
> Also, I've included results that show what difference some tuning
> makes with all the patches applied. The tuning basically just
> involves pinning the I/O thread and the netperf/netserver processes
> in both the host and guest to two physical CPUs which share a L2
> cache.
>
> (Yes, the 1k buffer size results are weird - we think there's a bug
> in recent kernels that causes us not to coalesce these small buffers
> into a large GSO packet before sending)
>   

Applied 1-2 while we debate the rest.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6] Kill off the virtio_net tx mitigation timer
  2008-11-02  9:48 ` Avi Kivity
@ 2008-11-03 12:23   ` Mark McLoughlin
  2008-11-03 12:40     ` Avi Kivity
  0 siblings, 1 reply; 27+ messages in thread
From: Mark McLoughlin @ 2008-11-03 12:23 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Avi Kivity, kvm

On Sun, 2008-11-02 at 11:48 +0200, Avi Kivity wrote:
> Mark McLoughlin wrote:
> > Hey,
> >
> > The main patch in this series is 5/6 - it just kills off the
> > virtio_net tx mitigation timer and does all the tx I/O in the
> > I/O thread.
> >
> >   
> 
> What will it do to small packet, multi-flow loads (simulated by ping -f 
> -l 30 $external)?

It should improve the latency - the packets will be flushed more quickly
than the 150us timeout without blocking the guest.

I've a crappy external network setup locally atm, so the improvement for
guest->external gets lost in the noise there, but it does show up with
that workload and guest->host.

> Where does the benefit come from?

There are two things going on here, I think.

First is that the timer affects latency, removing the timeout helps
that.

Second is that currently when we fill up the ring we block the guest
vcpu and flush. Thus, while we're copying a entire ring full of packets
that guest isn't making progress. Doing the copying in the I/O thread
helps there.

Note - the only net I/O we currently do in the vcpu thread at the moment
is when the guest is saturating the link. Any other timer, all the I/O
is done in the I/O thread by virtue of the timer.

> Is the overhead of managing the timer too high, or does it fire too
> late and so we sleep?  If the latter, can we tune it dynamically?
> 
> For example, if the guest sees it is making a lot of progress without 
> the host catching up (waiting on the tx timer), it can 
> kick_I_really_mean_this_now(), to get the host to notice.

It does that already - if the ring fills up the guests forces a kick
which causes the host to flush the ring in the vcpu thread.

Cheers,
Mark.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6] Kill off the virtio_net tx mitigation timer
  2008-11-03 12:23   ` Mark McLoughlin
@ 2008-11-03 12:40     ` Avi Kivity
  2008-11-03 15:04       ` Mark McLoughlin
  2008-11-06 17:45       ` Mark McLoughlin
  0 siblings, 2 replies; 27+ messages in thread
From: Avi Kivity @ 2008-11-03 12:40 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: Avi Kivity, kvm

Mark McLoughlin wrote:
> On Sun, 2008-11-02 at 11:48 +0200, Avi Kivity wrote:
>   
>> Mark McLoughlin wrote:
>>     
>>> Hey,
>>>
>>> The main patch in this series is 5/6 - it just kills off the
>>> virtio_net tx mitigation timer and does all the tx I/O in the
>>> I/O thread.
>>>
>>>   
>>>       
>> What will it do to small packet, multi-flow loads (simulated by ping -f 
>> -l 30 $external)?
>>     
>
> It should improve the latency - the packets will be flushed more quickly
> than the 150us timeout without blocking the guest.
>
>   

But it will increase overhead, since suddenly we aren't queueing 
anymore.  One vmexit per small packet.

>> Where does the benefit come from?
>>     
>
> There are two things going on here, I think.
>
> First is that the timer affects latency, removing the timeout helps
> that.
>   

If the timer affects latency, then something is very wrong.  We're 
lacking an adjustable window.

The way I see it, the notification window should be adjusted according 
to the current workload.  If the link is idle, the window should be one 
packet -- notify as soon as something is queued.  As the workload 
increases, the window increases to (safety_factor * allowable_latency / 
packet_rate).  The timer is set to allowable_latency to catch changes in 
workload.

For example:

- allowable_latency 1ms (implies 1K vmexits/sec desired)
- current packet_rate 20K packets/sec
- safety_factor 0.8

So we request notifications every 0.8 * 20K * 1m = 16 packets, and set 
the timer to 1ms.  Usually we get a notification every 16 packets, just 
before timer expiration.  If the workload increases, we get 
notifications sooner, so we increase the window.  If the workload drops, 
the timer fires and we decrease the window.

The timer should never fire on an all-out benchmark, or in a ping test.

> Second is that currently when we fill up the ring we block the guest
> vcpu and flush. Thus, while we're copying a entire ring full of packets
> that guest isn't making progress. Doing the copying in the I/O thread
> helps there.
>   

We're hurting our cache, and this won't work well with many nics.  At 
the very least this should be done in a dedicated thread.  It's also 
going to damage latency.

The only real fix is to avoid the copy altogether.

> Note - the only net I/O we currently do in the vcpu thread at the moment
> is when the guest is saturating the link. Any other timer, all the I/O
> is done in the I/O thread by virtue of the timer.
>   

This is fundamental brokenness, as mentioned above, in my 
non-networking-expert opinion.

>> Is the overhead of managing the timer too high, or does it fire too
>> late and so we sleep?  If the latter, can we tune it dynamically?
>>
>> For example, if the guest sees it is making a lot of progress without 
>> the host catching up (waiting on the tx timer), it can 
>> kick_I_really_mean_this_now(), to get the host to notice.
>>     
>
> It does that already - if the ring fills up the guests forces a kick
> which causes the host to flush the ring in the vcpu thread.
>   

Should happen some time before the ring fills up.  Especially if we make 
the flushing aync by offloading to some other thread.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6] Kill off the virtio_net tx mitigation timer
  2008-11-03 12:40     ` Avi Kivity
@ 2008-11-03 15:04       ` Mark McLoughlin
  2008-11-03 15:19         ` Avi Kivity
  2008-11-06 17:45       ` Mark McLoughlin
  1 sibling, 1 reply; 27+ messages in thread
From: Mark McLoughlin @ 2008-11-03 15:04 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Mon, 2008-11-03 at 14:40 +0200, Avi Kivity wrote:
> Mark McLoughlin wrote:
> > On Sun, 2008-11-02 at 11:48 +0200, Avi Kivity wrote:
> >   
> >> Mark McLoughlin wrote:
> >>> The main patch in this series is 5/6 - it just kills off the
> >>> virtio_net tx mitigation timer and does all the tx I/O in the
> >>> I/O thread.
> >>>
> >>>   
> >>>       
> >> What will it do to small packet, multi-flow loads (simulated by ping -f 
> >> -l 30 $external)?
> >>     
> >
> > It should improve the latency - the packets will be flushed more quickly
> > than the 150us timeout without blocking the guest.
> >
> >   
> 
> But it will increase overhead, since suddenly we aren't queueing 
> anymore.  One vmexit per small packet.

Yes in theory, but the packet copies are acting to mitigate exits since
we don't re-enable notifications again until we're sure the ring is
empty.

With copyless, though, we'd have an unacceptable vmexit rate.

> >> Where does the benefit come from?
> >>     
> >
> > There are two things going on here, I think.
> >
> > First is that the timer affects latency, removing the timeout helps
> > that.
> >   
> 
> If the timer affects latency, then something is very wrong.  We're 
> lacking an adjustable window.
> 
> The way I see it, the notification window should be adjusted according 
> to the current workload.  If the link is idle, the window should be one 
> packet -- notify as soon as something is queued.  As the workload 
> increases, the window increases to (safety_factor * allowable_latency / 
> packet_rate).  The timer is set to allowable_latency to catch changes in 
> workload.
> 
> For example:
> 
> - allowable_latency 1ms (implies 1K vmexits/sec desired)
> - current packet_rate 20K packets/sec
> - safety_factor 0.8
> 
> So we request notifications every 0.8 * 20K * 1m = 16 packets, and set 
> the timer to 1ms.  Usually we get a notification every 16 packets, just 
> before timer expiration.  If the workload increases, we get 
> notifications sooner, so we increase the window.  If the workload drops, 
> the timer fires and we decrease the window.
> 
> The timer should never fire on an all-out benchmark, or in a ping test.

Yeah, I do like the sound of this.

However, since it requires a new guest feature and I don't expect it'll
improve the situation over the proposed patch until we have copyless
transmit, I think we should do this as part of the copyless effort.

One thing I'd worry about with this scheme is all-out receive - e.g. any
delay in returning a TCP ACK to the sending side, might cause us to hit
the TCP window size.

> > Second is that currently when we fill up the ring we block the guest
> > vcpu and flush. Thus, while we're copying a entire ring full of packets
> > that guest isn't making progress. Doing the copying in the I/O thread
> > helps there.
> >   
> 
> We're hurting our cache, and this won't work well with many nics.  At 
> the very least this should be done in a dedicated thread.

A thread per nic is doable, but it'd be especially tricky on the receive
side without more "short-cut the one producer, one consumer case" work.

Cheers,
Mark.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer
  2008-10-31  9:16             ` Mark McLoughlin
@ 2008-11-03 15:07               ` Mark McLoughlin
  0 siblings, 0 replies; 27+ messages in thread
From: Mark McLoughlin @ 2008-11-03 15:07 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm

On Fri, 2008-10-31 at 09:16 +0000, Mark McLoughlin wrote:
 
> +static void virtio_net_tx_bh(void *opaque)
>  {
>      VirtIONet *n = opaque;
>  
> -    n->tx_timer_active = 0;
> -
> -    /* Just in case the driver is not ready on more */
> -    if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
> -        return;
> -
> -    n->tx_vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> -    virtio_net_flush_tx(n, n->tx_vq);
> +    if (!virtio_net_flush_tx(n, n->tx_vq))
> +        n->tx_vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
> +    else
> +        qemu_bh_schedule(n->tx_bh);
>  }

Ouch, race condition here between our finding the ring empty and
re-enabling the notification. We need to check the ring again.

Cheers,
Mark.

From: Mark McLoughlin <markmc@redhat.com>
Subject: [PATCH] kvm: qemu: virtio-net: handle all tx in I/O thread without timer

By removing the tx timer altogether and doing all the copies in the
I/O thread, we can keep the I/O churning away in parallel with the
guest generating more I/O.

In my tests, this significantly increases guest->host throughput,
causes a minor increase in host->guest throughput, reduces CPU
utilization somewhat and greatly reduces roundtrip times.

Even aside from the benchmark results, removing the arbitrary 150us
timer is a nicer option than coming up with a heuristic to make it
vary according to load. Finally, on kernels which don't have a
suitably low posix timer latency, we won't be scuppered by effectively
having e.g. a 1ms timer.

Note, this highlights that the I/O thread may become a scalability
concern and we might want to consider e.g. an I/O thread per device.

Note also that when tuning for a specific workload, which CPU
the I/O thread is pinned to is important.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 qemu/hw/virtio-net.c |   65 +++++++++++++++++++++++++------------------------
 1 files changed, 33 insertions(+), 32 deletions(-)

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index bc2ede6..43a06da 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -35,8 +35,6 @@
 #define VIRTIO_NET_F_HOST_ECN	13	/* Host can handle TSO[6] w/ ECN in. */
 #define VIRTIO_NET_F_HOST_UFO	14	/* Host can handle UFO in. */
 
-#define TX_TIMER_INTERVAL 150000 /* 150 us */
-
 /* The config defining mac address (6 bytes) */
 struct virtio_net_config
 {
@@ -68,8 +66,7 @@ typedef struct VirtIONet
     VirtQueue *rx_vq;
     VirtQueue *tx_vq;
     VLANClientState *vc;
-    QEMUTimer *tx_timer;
-    int tx_timer_active;
+    QEMUBH *tx_bh;
 } VirtIONet;
 
 /* TODO
@@ -227,13 +224,14 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
 }
 
 /* TX */
-static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
+static int virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 {
     VirtQueueElement elem;
     int has_vnet_hdr = tap_has_vnet_hdr(n->vc->vlan->first_client);
+    int num_packets = 0;
 
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
-        return;
+        return num_packets;
 
     while (virtqueue_pop(vq, &elem)) {
 	ssize_t len = 0;
@@ -256,38 +254,38 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 
 	virtqueue_push(vq, &elem, len);
 	virtio_notify(&n->vdev, vq);
+
+	num_packets++;
     }
+
+    return num_packets;
 }
 
 static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
 
-    if (n->tx_timer_active) {
-	vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
-	qemu_del_timer(n->tx_timer);
-	n->tx_timer_active = 0;
-	virtio_net_flush_tx(n, vq);
-    } else {
-	qemu_mod_timer(n->tx_timer,
-		       qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
-	n->tx_timer_active = 1;
-	vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY;
-    }
+    vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY;
+    qemu_bh_schedule(n->tx_bh);
 }
 
-static void virtio_net_tx_timer(void *opaque)
+static void virtio_net_tx_bh(void *opaque)
 {
     VirtIONet *n = opaque;
+    VirtQueue *vq = n->tx_vq;
 
-    n->tx_timer_active = 0;
+    /* Re-enable notifications if empty */
+    if (!virtio_net_flush_tx(n, vq)) {
+        vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
 
-    /* Just in case the driver is not ready on more */
-    if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
-        return;
+        /* Re-check to be sure we didn't miss something */
+        if (!virtio_net_flush_tx(n, vq))
+            return;
+    }
 
-    n->tx_vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
-    virtio_net_flush_tx(n, n->tx_vq);
+    /* We flushed some packets; let's try again soon */
+    vq->vring.used->flags |= VRING_USED_F_NO_NOTIFY;
+    qemu_bh_schedule(n->tx_bh);
 }
 
 static void virtio_net_save(QEMUFile *f, void *opaque)
@@ -297,7 +295,6 @@ static void virtio_net_save(QEMUFile *f, void *opaque)
     virtio_save(&n->vdev, f);
 
     qemu_put_buffer(f, n->mac, 6);
-    qemu_put_be32(f, n->tx_timer_active);
 }
 
 static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
@@ -310,12 +307,15 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
     virtio_load(&n->vdev, f);
 
     qemu_get_buffer(f, n->mac, 6);
-    n->tx_timer_active = qemu_get_be32(f);
 
-    if (n->tx_timer_active) {
-	qemu_mod_timer(n->tx_timer,
-		       qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
-    }
+    return 0;
+}
+
+static int virtio_net_uninit(PCIDevice *dev)
+{
+    VirtIONet *n = (VirtIONet *)dev;
+
+    qemu_bh_delete(n->tx_bh);
 
     return 0;
 }
@@ -341,8 +341,9 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
     n->vc = qemu_new_vlan_client(nd->vlan, virtio_net_receive,
                                  virtio_net_can_receive, n);
 
-    n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
-    n->tx_timer_active = 0;
+    n->tx_bh = qemu_bh_new(virtio_net_tx_bh, n);
+
+    n->vdev.pci_dev.unregister = virtio_net_uninit;
 
     register_savevm("virtio-net", virtio_net_id++, 1,
 		    virtio_net_save, virtio_net_load, n);
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6] Kill off the virtio_net tx mitigation timer
  2008-11-03 15:04       ` Mark McLoughlin
@ 2008-11-03 15:19         ` Avi Kivity
  2008-11-06 16:46           ` Mark McLoughlin
  0 siblings, 1 reply; 27+ messages in thread
From: Avi Kivity @ 2008-11-03 15:19 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: kvm

Mark McLoughlin wrote:
>> But it will increase overhead, since suddenly we aren't queueing 
>> anymore.  One vmexit per small packet.
>>     
>
> Yes in theory, but the packet copies are acting to mitigate exits since
> we don't re-enable notifications again until we're sure the ring is
> empty.
>   

You mean, the guest and the copy proceed in parallel, and while they do, 
exits are disabled?

> With copyless, though, we'd have an unacceptable vmexit rate.
>   

Right.

  

>> If the timer affects latency, then something is very wrong.  We're 
>> lacking an adjustable window.
>>
>> The way I see it, the notification window should be adjusted according 
>> to the current workload.  If the link is idle, the window should be one 
>> packet -- notify as soon as something is queued.  As the workload 
>> increases, the window increases to (safety_factor * allowable_latency / 
>> packet_rate).  The timer is set to allowable_latency to catch changes in 
>> workload.
>>
>> For example:
>>
>> - allowable_latency 1ms (implies 1K vmexits/sec desired)
>> - current packet_rate 20K packets/sec
>> - safety_factor 0.8
>>
>> So we request notifications every 0.8 * 20K * 1m = 16 packets, and set 
>> the timer to 1ms.  Usually we get a notification every 16 packets, just 
>> before timer expiration.  If the workload increases, we get 
>> notifications sooner, so we increase the window.  If the workload drops, 
>> the timer fires and we decrease the window.
>>
>> The timer should never fire on an all-out benchmark, or in a ping test.
>>     
>
> Yeah, I do like the sound of this.
>
> However, since it requires a new guest feature and I don't expect it'll
> improve the situation over the proposed patch until we have copyless
> transmit, I think we should do this as part of the copyless effort.
>   

Hopefully copyless and this can be done in parallel.  I think they have 
value independently.

> One thing I'd worry about with this scheme is all-out receive - e.g. any
> delay in returning a TCP ACK to the sending side, might cause us to hit
> the TCP window size.
>   

Consider a real NIC, that also has latency for ACKs that is determined 
by the queue length.  The proposal doesn't change that, except 
momentarily when transitioning from high throughput to low throughput.

In any case, latency is never more than allowable_latency (not including 
time spent in the guest network stack queues, but we aren't responsible 
for that).

(one day we can add a queue for acks and other high priority stuff, but 
we have enough on our hands now)

>> We're hurting our cache, and this won't work well with many nics.  At 
>> the very least this should be done in a dedicated thread.
>>     
>
> A thread per nic is doable, but it'd be especially tricky on the receive
> side without more "short-cut the one producer, one consumer case" work.
>   

We can start with transmit.  I'm somewhat worried about further 
divergence from qemu mainline (just completed a merge...).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/6] kvm: qemu: virtio-net: drop mutex during tx tapfd write
  2008-10-30 17:51           ` [PATCH 6/6] kvm: qemu: virtio-net: drop mutex during tx tapfd write Mark McLoughlin
@ 2008-11-04 11:43             ` Avi Kivity
  0 siblings, 0 replies; 27+ messages in thread
From: Avi Kivity @ 2008-11-04 11:43 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: Avi Kivity, kvm

Mark McLoughlin wrote:
> This allows the guest vcpu thread to exit while the I/O thread is
> churning away.
>
>  
> +	kvm_sleep_begin();
>   

What if the nic is hot-unplugged here?

>  	len += qemu_sendv_packet(n->vc, out_sg, out_num);
>   

n is freed, no?

> +	kvm_sleep_end();
>  
>   


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer
  2008-10-30 17:51         ` [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer Mark McLoughlin
                             ` (2 preceding siblings ...)
  2008-11-02  9:56           ` Avi Kivity
@ 2008-11-04 15:23           ` David S. Ahern
  2008-11-06 17:02             ` Mark McLoughlin
  3 siblings, 1 reply; 27+ messages in thread
From: David S. Ahern @ 2008-11-04 15:23 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: Avi Kivity, kvm



Mark McLoughlin wrote:

> Note also that when tuning for a specific workload, which CPU
> the I/O thread is pinned to is important.
> 

Hi Mark:

Can you give an example of when that has a noticeable affect?

For example, if the guest handles network interrupts on vcpu0 and it is
pinned to pcpu0 where should the IO thread be pinned for best performance?

thanks,

david

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6] Kill off the virtio_net tx mitigation timer
  2008-11-03 15:19         ` Avi Kivity
@ 2008-11-06 16:46           ` Mark McLoughlin
  2008-11-06 17:38             ` Avi Kivity
  0 siblings, 1 reply; 27+ messages in thread
From: Mark McLoughlin @ 2008-11-06 16:46 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

Hey,
	So, I went off and spent some time gathering more data on this stuff
and putting it together in a more consumable fashion.

	Here are some graphs showing the effect some of these changes have on
throughput, cpu utilization and vmexit rate:

  http://markmc.fedorapeople.org/virtio-netperf/2008-11-06/

	The results are a little surprising, and I'm not sure I've fully
digested them yet but some conclusions:

  1) Disabling notifications from the guest for longer helps; you see 
     an increase in cpu utilization and vmexit rate, but that can be 
     accounted for by the extra data we're transferring

  2) Flushing (when the ring is full) in the I/O thread doesn't seem to
     help anything; strangely, it has a detrimental effect on 
     host->guest traffic where I would expect us to hit this case at 
     all.

     I suspect we may not actually be hitting the full ring condition
     in these tests at all.

  3) The catch-more-io thing helps a little, especially host->guest, 
     without any real detrimental impact.

  4) Removing the tx timer doesn't have a huge affect on guest->host, 
     except for 32 byte buffers where we see a huge increase in vmexits 
     and a drop in throughput. Bizarrely, we don't see this effect with 
     64 byte buffers.

     However, it does have a pretty significant impact on host->guest, 
     which makes sense since in that case we'll just have a steady
     stream of TCP ACK packets so if small guest->host packets are 
     affected badly, so will the ACK packets.

  5) The drop-mutex patch is a nice win overall, except for a huge 
     increase in vmexits for sub-4k guest->host packets. Strange.

Cheers,
Mark.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer
  2008-11-04 15:23           ` David S. Ahern
@ 2008-11-06 17:02             ` Mark McLoughlin
  2008-11-06 17:13               ` David S. Ahern
  2008-11-06 17:43               ` Avi Kivity
  0 siblings, 2 replies; 27+ messages in thread
From: Mark McLoughlin @ 2008-11-06 17:02 UTC (permalink / raw)
  To: David S. Ahern; +Cc: Avi Kivity, kvm

On Tue, 2008-11-04 at 08:23 -0700, David S. Ahern wrote:
> 
> Mark McLoughlin wrote:
> 
> > Note also that when tuning for a specific workload, which CPU
> > the I/O thread is pinned to is important.
> > 
> 
> Hi Mark:
> 
> Can you give an example of when that has a noticeable affect?
> 
> For example, if the guest handles network interrupts on vcpu0 and it is
> pinned to pcpu0 where should the IO thread be pinned for best performance?

Basically, the I/O thread is where packets are copied too and from host
kernel space at the moment.

If there are other copies of the packets anywhere, you want those to
copy from a cache.

With my netperf guest->host benchmark, you actually have four copies
going on:

  1) netperf process in guest copying to guest kernel space

  2) qemu process in the host copying between internal buffers

  3) qemu process in the host copying to host kernel space

  4) netserver process in the host copying into its buffers

My machine has four CPUs, with two 6Mb L2 caches - each cache is shared
between two of the CPUs, so I set things up as follows:

  pcpu#3 - netserver, I/O thread, vcpu#0
  pcup#4 - vcpu#1, virtio_net irq, netperf

which (hopefully) ensures that we're only doing one copy using RAM and
the rest are using the L1/L2 caches.

Cheers,
Mark.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer
  2008-11-06 17:02             ` Mark McLoughlin
@ 2008-11-06 17:13               ` David S. Ahern
  2008-11-06 17:43               ` Avi Kivity
  1 sibling, 0 replies; 27+ messages in thread
From: David S. Ahern @ 2008-11-06 17:13 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: Avi Kivity, kvm



Mark McLoughlin wrote:
> On Tue, 2008-11-04 at 08:23 -0700, David S. Ahern wrote:
>> Mark McLoughlin wrote:
>>
>>> Note also that when tuning for a specific workload, which CPU
>>> the I/O thread is pinned to is important.
>>>
>> Hi Mark:
>>
>> Can you give an example of when that has a noticeable affect?
>>
>> For example, if the guest handles network interrupts on vcpu0 and it is
>> pinned to pcpu0 where should the IO thread be pinned for best performance?
> 
> Basically, the I/O thread is where packets are copied too and from host
> kernel space at the moment.
> 
> If there are other copies of the packets anywhere, you want those to
> copy from a cache.
> 
> With my netperf guest->host benchmark, you actually have four copies
> going on:
> 
>   1) netperf process in guest copying to guest kernel space
> 
>   2) qemu process in the host copying between internal buffers
> 
>   3) qemu process in the host copying to host kernel space
> 
>   4) netserver process in the host copying into its buffers
> 
> My machine has four CPUs, with two 6Mb L2 caches - each cache is shared
> between two of the CPUs, so I set things up as follows:
> 
>   pcpu#3 - netserver, I/O thread, vcpu#0
>   pcup#4 - vcpu#1, virtio_net irq, netperf
> 
> which (hopefully) ensures that we're only doing one copy using RAM and
> the rest are using the L1/L2 caches.

So, in other words you don't necessarily need the guest vcpu that
handles the net irq on the same pcpu as the IO thread, but it should
help to keep them within the same processor caches. Correct?

david

> 
> Cheers,
> Mark.
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6] Kill off the virtio_net tx mitigation timer
  2008-11-06 16:46           ` Mark McLoughlin
@ 2008-11-06 17:38             ` Avi Kivity
  0 siblings, 0 replies; 27+ messages in thread
From: Avi Kivity @ 2008-11-06 17:38 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: kvm

Mark McLoughlin wrote:
> Hey,
> 	So, I went off and spent some time gathering more data on this stuff
> and putting it together in a more consumable fashion.
>
> 	Here are some graphs showing the effect some of these changes have on
> throughput, cpu utilization and vmexit rate:
>
>   http://markmc.fedorapeople.org/virtio-netperf/2008-11-06/
>
>   

This is very helpful.

> 	The results are a little surprising, and I'm not sure I've fully
> digested them yet but some conclusions:
>
>   1) Disabling notifications from the guest for longer helps; you see 
>      an increase in cpu utilization and vmexit rate, but that can be 
>      accounted for by the extra data we're transferring
>
>   

Graphing cpu/bandwidth (cycles/bit) will show that nicely.

>   2) Flushing (when the ring is full) in the I/O thread doesn't seem to
>      help anything; strangely, it has a detrimental effect on 
>      host->guest traffic where I would expect us to hit this case at 
>      all.
>
>      I suspect we may not actually be hitting the full ring condition
>      in these tests at all.
>
>   

That's good; ring full == stall, especially with smp guests.


>   4) Removing the tx timer doesn't have a huge affect on guest->host, 
>      except for 32 byte buffers where we see a huge increase in vmexits 
>      and a drop in throughput. Bizarrely, we don't see this effect with 
>      64 byte buffers.
>   

Wierd.  Cacheline size effects?  the host must copy twice the number of 
cachelines for the same throughput, when moving between 32 and 64.

> 	
>      However, it does have a pretty significant impact on host->guest, 
>      which makes sense since in that case we'll just have a steady
>      stream of TCP ACK packets so if small guest->host packets are 
>      affected badly, so will the ACK packets.
>   

no-tx-timer is good for two workloads: streaming gso packets, where the 
packet is so large the vmexit count is low anyway, and small, latency 
sensitive packets, where you need the vmexits.  I'm worried about the 
workloads in between, which is why I'm pushing for the dynamic window.

>   5) The drop-mutex patch is a nice win overall, except for a huge 
>      increase in vmexits for sub-4k guest->host packets. Strange.
>   

What types of vmexits are these?  virtio pio or mmu?  and what's the 
test length (interested in vmexits/sec and vmexits/bit).

Maybe the allocator changes its behavior and we're faulting in pages.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer
  2008-11-06 17:02             ` Mark McLoughlin
  2008-11-06 17:13               ` David S. Ahern
@ 2008-11-06 17:43               ` Avi Kivity
  1 sibling, 0 replies; 27+ messages in thread
From: Avi Kivity @ 2008-11-06 17:43 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: David S. Ahern, Avi Kivity, kvm

Mark McLoughlin wrote:
> My machine has four CPUs, with two 6Mb L2 caches - each cache is shared
> between two of the CPUs, so I set things up as follows:
>
>   pcpu#3 - netserver, I/O thread, vcpu#0
>   pcup#4 - vcpu#1, virtio_net irq, netperf
>
> which (hopefully) ensures that we're only doing one copy using RAM and
> the rest are using the L1/L2 caches.
>   

This pinning can drastically alter the results in other ways.  If the 
vcpu doing the transmit and the iothread can run in parallel, each 
guest->host kick will result in a vmexit, IPI, and iothread execution, 
with much less blocking of interrupts.

When the iothread and vcpu are collocated, the scheduler will ensure 
natural batching and you will only vmexit when the ring is full (or 
empty, in the case of the iothread), or if you spent quite a long time 
processing it.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6] Kill off the virtio_net tx mitigation timer
  2008-11-03 12:40     ` Avi Kivity
  2008-11-03 15:04       ` Mark McLoughlin
@ 2008-11-06 17:45       ` Mark McLoughlin
  2008-11-09 11:29         ` Avi Kivity
  1 sibling, 1 reply; 27+ messages in thread
From: Mark McLoughlin @ 2008-11-06 17:45 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

Hi Avi,

Just thinking about your variable window suggestion ...

On Mon, 2008-11-03 at 14:40 +0200, Avi Kivity wrote:
> Mark McLoughlin wrote:
> > On Sun, 2008-11-02 at 11:48 +0200, Avi Kivity wrote:

> >> Where does the benefit come from?
> >>     
> >
> > There are two things going on here, I think.
> >
> > First is that the timer affects latency, removing the timeout helps
> > that.
> >   
> 
> If the timer affects latency, then something is very wrong.  We're 
> lacking an adjustable window.
> 
> The way I see it, the notification window should be adjusted according 
> to the current workload.  If the link is idle, the window should be one 
> packet -- notify as soon as something is queued.  As the workload 
> increases, the window increases to (safety_factor * allowable_latency / 
> packet_rate).  The timer is set to allowable_latency to catch changes in 
> workload.
> 
> For example:
> 
> - allowable_latency 1ms (implies 1K vmexits/sec desired)
> - current packet_rate 20K packets/sec
> - safety_factor 0.8
> 
> So we request notifications every 0.8 * 20K * 1m = 16 packets, and set 
> the timer to 1ms.  Usually we get a notification every 16 packets, just 
> before timer expiration.  If the workload increases, we get 
> notifications sooner, so we increase the window.  If the workload drops, 
> the timer fires and we decrease the window.
> 
> The timer should never fire on an all-out benchmark, or in a ping test.

The way I see this (continuing with your example figures) playing out
is:

  - If we have a packet rate of <2.5K packets/sec, we essentially have 
    zero added latency - each packet causes a vmexit and the packet is 
    dispatched immediately

  - As soon as we go above 2.5k we add, on average, an additional 
    ~400us delay to each packet

  - This is almost identical to our current scheme with an 800us timer, 
    except that flushes are typically triggered by a vmexit instead of
    the timer expiring

I don't think this is the effect you're looking for? Am I missing
something?

Cheers,
Mark.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6] Kill off the virtio_net tx mitigation timer
  2008-11-06 17:45       ` Mark McLoughlin
@ 2008-11-09 11:29         ` Avi Kivity
  0 siblings, 0 replies; 27+ messages in thread
From: Avi Kivity @ 2008-11-09 11:29 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: kvm

Mark McLoughlin wrote:
> The way I see this (continuing with your example figures) playing out
> is:
>
>   - If we have a packet rate of <2.5K packets/sec, we essentially have 
>     zero added latency - each packet causes a vmexit and the packet is 
>     dispatched immediately
>
>   - As soon as we go above 2.5k we add, on average, an additional 
>     ~400us delay to each packet
>
>   - This is almost identical to our current scheme with an 800us timer, 
>     except that flushes are typically triggered by a vmexit instead of
>     the timer expiring
>
> I don't think this is the effect you're looking for? Am I missing
> something?
>   

No.  While it's what my description implies, it's not what I want.

Let's step back for a bit.  What do we want?

Let's assume the virtio queue is connected to a real queue.  The 
guest->host scenario is easier, and less important.

So:

1. we never want to have a situation where the host queue is empty, but 
the guest queue has unkicked entries.  That will drop us below line rate 
and add latencies.
2. we want to avoid situations where the host queue is non-empty, and we 
kick the guest queue.  The won't improve latency, and will increase cpu 
utilization
  - if the host queue is close to depletion, then we _do_ want the kick, 
to avoid violating the first requirement (which is more important)

Does this seem sane as high-level goals?  If so we can think about how 
to implement it.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2008-11-09 11:29 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-30 17:51 [PATCH 0/6] Kill off the virtio_net tx mitigation timer Mark McLoughlin
2008-10-30 17:51 ` [PATCH 1/6] kvm: qemu: virtio: remove unused variable Mark McLoughlin
2008-10-30 17:51   ` [PATCH 2/6] kvm: qemu: dup the qemu_eventfd() return Mark McLoughlin
2008-10-30 17:51     ` [PATCH 3/6] kvm: qemu: add qemu_eventfd_write() and qemu_eventfd_read() Mark McLoughlin
2008-10-30 17:51       ` [PATCH 4/6] kvm: qemu: aggregate reads from eventfd Mark McLoughlin
2008-10-30 17:51         ` [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer Mark McLoughlin
2008-10-30 17:51           ` [PATCH 6/6] kvm: qemu: virtio-net: drop mutex during tx tapfd write Mark McLoughlin
2008-11-04 11:43             ` Avi Kivity
2008-10-30 19:24           ` [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer Anthony Liguori
2008-10-31  9:16             ` Mark McLoughlin
2008-11-03 15:07               ` Mark McLoughlin
2008-11-02  9:56           ` Avi Kivity
2008-11-04 15:23           ` David S. Ahern
2008-11-06 17:02             ` Mark McLoughlin
2008-11-06 17:13               ` David S. Ahern
2008-11-06 17:43               ` Avi Kivity
2008-10-30 19:20 ` [PATCH 0/6] Kill off the virtio_net tx mitigation timer Anthony Liguori
2008-11-02  9:48 ` Avi Kivity
2008-11-03 12:23   ` Mark McLoughlin
2008-11-03 12:40     ` Avi Kivity
2008-11-03 15:04       ` Mark McLoughlin
2008-11-03 15:19         ` Avi Kivity
2008-11-06 16:46           ` Mark McLoughlin
2008-11-06 17:38             ` Avi Kivity
2008-11-06 17:45       ` Mark McLoughlin
2008-11-09 11:29         ` Avi Kivity
2008-11-02  9:57 ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox