From mboxrd@z Thu Jan  1 00:00:00 1970
From: Mark McLoughlin <markmc@redhat.com>
Subject: Re: [PATCH 5/5] kvm: qemu: Improve virtio_net recv buffer
	allocation scheme
Date: Tue, 14 Oct 2008 14:44:00 +0100
Message-ID: <1223991840.11098.29.camel@blaa>
References: <> <1223494513-18826-1-git-send-email-markmc@redhat.com>
	 <1223494513-18826-2-git-send-email-markmc@redhat.com>
	 <1223494513-18826-3-git-send-email-markmc@redhat.com>
	 <1223494513-18826-4-git-send-email-markmc@redhat.com>
	 <1223494513-18826-5-git-send-email-markmc@redhat.com>
	 <48F1CABA.8010301@redhat.com>
Reply-To: Mark McLoughlin <markmc@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org, Rusty Russell <rusty@rustcorp.com.au>,
	Herbert Xu <herbert.xu@redhat.com>,
	Anthony Liguori <anthony@codemonkey.ws>
To: Avi Kivity <avi@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx2.redhat.com ([66.187.237.31]:39743 "EHLO mx2.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750929AbYJNNu2 (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 14 Oct 2008 09:50:28 -0400
In-Reply-To: <48F1CABA.8010301@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Sun, 2008-10-12 at 12:00 +0200, Avi Kivity wrote:
> Mark McLoughlin wrote:
> > diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
> > index 403247b..afa5fe5 100644
> > --- a/qemu/hw/virtio-net.c
> > +++ b/qemu/hw/virtio-net.c
> > @@ -34,9 +34,13 @@
> >  #define VIRTIO_NET_F_HOST_TSO6	12	/* Host can handle TSOv6 in. */
> >  #define VIRTIO_NET_F_HOST_ECN	13	/* Host can handle TSO[6] w/ ECN in. */
> >  #define VIRTIO_NET_F_HOST_UFO	14	/* Host can handle UFO in. */
> > +#define VIRTIO_NET_F_MRG_RXBUF	15	/* Host can merge receive buffers. */
> >  
> What's the status of the guest side of this feature?

Waiting on Rusty, basically.

As requested, I've changed the ABI so we include the number of merged
buffers in the virtio_net_hdr:

  http://lkml.org/lkml/2008/10/10/204

> >  #define TX_TIMER_INTERVAL 150000 /* 150 us */
> >  
> > +/* Should be the largest MAX_SKB_FRAGS supported. */
> > +#define VIRTIO_NET_MAX_FRAGS	18
> > +
> >   
> 
> This should be advertised by the host to the guest (or vice-versa?).  
> We're embedding Linux-specific magic numbers in a guest-OS-agnostic ABI.
> 
> Perfereably, there shouldn't be a limit at all.

Yeah, it's far from pretty. The current ABI basically says "you must
supply >64k receive buffers" whereas this new ABI says "you must supply
at least 18 >4k receive buffers".

We could think about having the host expose the maximum rx packet size
to the guest (and handle migrating to a host with a different max), but
TBH I don't think it would be worth much until we have the prospect of
running on a host with a larger maximum rx packet size.

Requiring the guest to fill the ring with ~64k of buffers isn't onerous;
the Linux guest impl currently re-fills the ring up to the max (e.g. 256
x 4k)

> > @@ -209,7 +220,12 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
> >      if (virtqueue_pop(n->rx_vq, &elem) == 0)
> >  	return;
> >  
> > -    if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) {
> > +    if (n->mergeable_rx_bufs) {
> > +	if (elem.in_num < 1 || elem.in_sg[0].iov_len < TARGET_PAGE_SIZE) {
> > +	    fprintf(stderr, "virtio-net IOV is irregular\n");
> > +	    exit(1);
> > +	}
> >   
> 
> Again, this is burying details of the current Linux stack into the ABI.  
> The Linux stack may change not to be page oriented, or maybe this won't 
> fit will to how Windows views things.  Can this be made not to depend on 
> the size of the iov elements?

This actually relates to the check in can_receive() - we need to be sure
that we have enough buffers in the ring before we read() from the tapfd.

So, it's not so much about the guest being page oriented, but rather
having a simple way to determine whether there's enough buffers in the
ring - "do we have 18 >4k buffers?" is more preferable than peeking in
the ring and adding up the size of the available buffers

> > +    } else {
> > +	iov_fill(&elem.in_sg[1], elem.in_num - 1,
> > +		 buf + offset, size - offset);
> > +
> > +	/* signal other side */
> > +	virtqueue_push(n->rx_vq, &elem, total);
> > +    }
> > +
> >   
> 
> Can we merge the two sides of the if () so that the only difference is 
> the number of times we go through the loop?

Yeah, good point.

Re-factored version below, which also includes the virtio_net_hdr2
changes.

Cheers,
Mark.

Subject: kvm: qemu: Improve virtio_net recv buffer allocation scheme

Currently, in order to receive large packets, the guest must allocate
max-sized packet buffers and pass them to the host. Each of these
max-sized packets occupy 20 ring entries, which means we can only
transfer a maximum of 12 packets in a single batch with a 256 entry
ring.

When receiving packets from external networks, we only receive MTU
sized packets and so the throughput observed is throttled by the
number of packets the ring can hold.

Implement the VIRTIO_NET_F_MRG_RXBUF feature to let guests know that
we can merge smaller buffers together in order to handle large packets.

This scheme allows us to be efficient in our use of ring entries
while still supporting large packets. Benchmarking using netperf from
an external machine to a guest over a 10Gb/s network shows a 100%
improvement from ~1Gb/s to ~2Gb/s. With a local host->guest benchmark
with GSO disabled on the host side, throughput was seen to increase
from 700Mb/s to 1.7Gb/s.

Based on a patch from Herbert Xu.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 qemu/hw/virtio-net.c |  133 +++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 106 insertions(+), 27 deletions(-)

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index d6b4457..c082f02 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -34,9 +34,13 @@
 #define VIRTIO_NET_F_HOST_TSO6	12	/* Host can handle TSOv6 in. */
 #define VIRTIO_NET_F_HOST_ECN	13	/* Host can handle TSO[6] w/ ECN in. */
 #define VIRTIO_NET_F_HOST_UFO	14	/* Host can handle UFO in. */
+#define VIRTIO_NET_F_MRG_RXBUF	15	/* Host can merge receive buffers. */
 
 #define TX_TIMER_INTERVAL 150000 /* 150 us */
 
+/* Should be the largest MAX_SKB_FRAGS supported. */
+#define VIRTIO_NET_MAX_FRAGS	18
+
 /* The config defining mac address (6 bytes) */
 struct virtio_net_config
 {
@@ -61,6 +65,15 @@ struct virtio_net_hdr
     uint16_t csum_offset;
 };
 
+/* This is the version of the header to use when the MRG_RXBUF
+ * feature (or any later feature) has been negotiated. */
+struct virtio_net_hdr2
+{
+    struct virtio_net_hdr hdr;
+    uint8_t num_buffers;   /* Number of merged rx buffers */
+    uint8_t pad[21];       /* Pad to 32 bytes */
+};
+
 typedef struct VirtIONet
 {
     VirtIODevice vdev;
@@ -70,6 +83,8 @@ typedef struct VirtIONet
     VLANClientState *vc;
     QEMUTimer *tx_timer;
     int tx_timer_active;
+    int mergeable_rx_bufs;
+    int use_vnet_hdr2;
 } VirtIONet;
 
 /* TODO
@@ -106,6 +121,7 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev)
 	features |= (1 << VIRTIO_NET_F_HOST_TSO4);
 	features |= (1 << VIRTIO_NET_F_HOST_TSO6);
 	features |= (1 << VIRTIO_NET_F_HOST_ECN);
+	features |= (1 << VIRTIO_NET_F_MRG_RXBUF);
 	/* Kernel can't actually handle UFO in software currently. */
     }
 
@@ -117,6 +133,10 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint32_t features)
     VirtIONet *n = to_virtio_net(vdev);
     VLANClientState *host = n->vc->vlan->first_client;
 
+    n->mergeable_rx_bufs = !!(features & (1 << VIRTIO_NET_F_MRG_RXBUF));
+
+    n->use_vnet_hdr2 = n->mergeable_rx_bufs;
+
     if (!tap_has_vnet_hdr(host) || !host->set_offload)
 	return;
 
@@ -141,12 +161,15 @@ static int virtio_net_can_receive(void *opaque)
 {
     VirtIONet *n = opaque;
     VirtQueue *vq = n->rx_vq;
+    int min_bufs;
 
     if (vq->vring.avail == NULL ||
 	!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
 	return 0;
 
-    if (vq->vring.avail->idx == vq->last_avail_idx) {
+    min_bufs = n->mergeable_rx_bufs ? VIRTIO_NET_MAX_FRAGS : 1;
+
+    if ((uint16_t)(vq->vring.avail->idx - vq->last_avail_idx) < min_bufs) {
 	vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
 	return 0;
     }
@@ -198,41 +221,86 @@ static int iov_fill(struct iovec *iov, int iovcnt, const void *buf, int count)
     return offset;
 }
 
-static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
+static int receive_header(VirtIONet *n, struct iovec *iov, int iovcnt,
+			  const void *buf, int size, int hdr_len)
 {
-    VirtIONet *n = opaque;
-    VirtQueueElement elem;
-    struct virtio_net_hdr *hdr;
+    struct virtio_net_hdr *hdr = iov[0].iov_base;
     int offset;
-    int total;
-
-    if (virtqueue_pop(n->rx_vq, &elem) == 0)
-	return;
-
-    if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) {
-	fprintf(stderr, "virtio-net header not in first element\n");
-	exit(1);
-    }
 
-    hdr = (void *)elem.in_sg[0].iov_base;
     hdr->flags = 0;
     hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE;
 
-    offset = 0;
-    total = sizeof(*hdr);
-
     if (tap_has_vnet_hdr(n->vc->vlan->first_client)) {
 	memcpy(hdr, buf, sizeof(*hdr));
-	offset += total;
-        work_around_broken_dhclient(hdr, buf + offset, size - offset);
+	offset = sizeof(*hdr);
+	work_around_broken_dhclient(hdr, buf + offset, size - offset);
+    }
+
+    iov[0].iov_base += hdr_len;
+    iov[0].iov_len  -= hdr_len;
+
+    return offset;
+}
+
+static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
+{
+    VirtIONet *n = opaque;
+    struct virtio_net_hdr2 *hdr2 = NULL;
+    int hdr_len, offset, i;
+
+    hdr_len = n->use_vnet_hdr2 ?
+	sizeof(struct virtio_net_hdr2) : sizeof(struct virtio_net_hdr);
+
+    offset = i = 0;
+
+    while (offset < size) {
+	VirtQueueElement elem;
+	int len, total;
+
+	len = total = 0;
+
+	if ((i != 0 && !n->mergeable_rx_bufs) ||
+	    virtqueue_pop(n->rx_vq, &elem) == 0) {
+	    if (i == 0)
+		return;
+	    fprintf(stderr, "virtio-net truncating packet\n");
+	    exit(1);
+	}
+
+	if (n->mergeable_rx_bufs) {
+	    if (elem.in_num < 1 || elem.in_sg[0].iov_len < TARGET_PAGE_SIZE) {
+		fprintf(stderr, "virtio-net IOV is irregular\n");
+		exit(1);
+	    }
+	} else if (elem.in_num < 1 || elem.in_sg[0].iov_len != hdr_len) {
+	    fprintf(stderr, "virtio-net header not in first element\n");
+	    exit(1);
+	}
+
+	if (i == 0) {
+	    if (n->use_vnet_hdr2)
+		hdr2 = (struct virtio_net_hdr2 *)elem.in_sg[0].iov_base;
+
+	    offset += receive_header(n, &elem.in_sg[0], elem.in_num,
+				     buf + offset, size - offset, hdr_len);
+	    total += hdr_len;
+	}
+
+	/* copy in packet.  ugh */
+	len = iov_fill(&elem.in_sg[0], elem.in_num,
+		       buf + offset, size - offset);
+	total += len;
+
+	/* signal other side */
+	virtqueue_fill(n->rx_vq, &elem, total, i++);
+
+	offset += len;
     }
 
-    /* copy in packet.  ugh */
-    total += iov_fill(&elem.in_sg[1], elem.in_num - 1,
-                      buf + offset, size - offset);
+    if (hdr2)
+	hdr2->num_buffers = i;
 
-    /* signal other side */
-    virtqueue_push(n->rx_vq, &elem, total);
+    virtqueue_flush(n->rx_vq, i);
     virtio_notify(&n->vdev, n->rx_vq);
 }
 
@@ -249,8 +317,12 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 	ssize_t len = 0;
 	unsigned int out_num = elem.out_num;
 	struct iovec *out_sg = &elem.out_sg[0];
+	unsigned hdr_len;
+
+	hdr_len = n->use_vnet_hdr2 ?
+	    sizeof(struct virtio_net_hdr2) : sizeof(struct virtio_net_hdr);
 
-	if (out_num < 1 || out_sg->iov_len != sizeof(struct virtio_net_hdr)) {
+	if (out_num < 1 || out_sg->iov_len != hdr_len) {
 	    fprintf(stderr, "virtio-net header not in first element\n");
 	    exit(1);
 	}
@@ -259,7 +331,12 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 	if (!has_vnet_hdr) {
 	    out_num--;
 	    out_sg++;
-	    len += sizeof(struct virtio_net_hdr);
+	    len += hdr_len;
+	} else if (n->use_vnet_hdr2) {
+	    /* tapfd expects a virtio_net_hdr */
+	    hdr_len -= sizeof(struct virtio_net_hdr);
+	    out_sg->iov_len -= hdr_len;
+	    len += hdr_len;
 	}
 
 	len += qemu_sendv_packet(n->vc, out_sg, out_num);
@@ -353,6 +430,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
 
     n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
     n->tx_timer_active = 0;
+    n->mergeable_rx_bufs = 0;
+    n->use_vnet_hdr2 = 0;
 
     register_savevm("virtio-net", virtio_net_id++, 1,
 		    virtio_net_save, virtio_net_load, n);
-- 
1.6.0.1