From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark McLoughlin Subject: Re: [PATCH 5/5] kvm: qemu: Improve virtio_net recv buffer allocation scheme Date: Tue, 14 Oct 2008 14:44:00 +0100 Message-ID: <1223991840.11098.29.camel@blaa> References: <> <1223494513-18826-1-git-send-email-markmc@redhat.com> <1223494513-18826-2-git-send-email-markmc@redhat.com> <1223494513-18826-3-git-send-email-markmc@redhat.com> <1223494513-18826-4-git-send-email-markmc@redhat.com> <1223494513-18826-5-git-send-email-markmc@redhat.com> <48F1CABA.8010301@redhat.com> Reply-To: Mark McLoughlin Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Cc: kvm@vger.kernel.org, Rusty Russell , Herbert Xu , Anthony Liguori To: Avi Kivity Return-path: Received: from mx2.redhat.com ([66.187.237.31]:39743 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750929AbYJNNu2 (ORCPT ); Tue, 14 Oct 2008 09:50:28 -0400 In-Reply-To: <48F1CABA.8010301@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: On Sun, 2008-10-12 at 12:00 +0200, Avi Kivity wrote: > Mark McLoughlin wrote: > > diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c > > index 403247b..afa5fe5 100644 > > --- a/qemu/hw/virtio-net.c > > +++ b/qemu/hw/virtio-net.c > > @@ -34,9 +34,13 @@ > > #define VIRTIO_NET_F_HOST_TSO6 12 /* Host can handle TSOv6 in. */ > > #define VIRTIO_NET_F_HOST_ECN 13 /* Host can handle TSO[6] w/ ECN in. */ > > #define VIRTIO_NET_F_HOST_UFO 14 /* Host can handle UFO in. */ > > +#define VIRTIO_NET_F_MRG_RXBUF 15 /* Host can merge receive buffers. */ > > > What's the status of the guest side of this feature? Waiting on Rusty, basically. As requested, I've changed the ABI so we include the number of merged buffers in the virtio_net_hdr: http://lkml.org/lkml/2008/10/10/204 > > #define TX_TIMER_INTERVAL 150000 /* 150 us */ > > > > +/* Should be the largest MAX_SKB_FRAGS supported. */ > > +#define VIRTIO_NET_MAX_FRAGS 18 > > + > > > > This should be advertised by the host to the guest (or vice-versa?). > We're embedding Linux-specific magic numbers in a guest-OS-agnostic ABI. > > Perfereably, there shouldn't be a limit at all. Yeah, it's far from pretty. The current ABI basically says "you must supply >64k receive buffers" whereas this new ABI says "you must supply at least 18 >4k receive buffers". We could think about having the host expose the maximum rx packet size to the guest (and handle migrating to a host with a different max), but TBH I don't think it would be worth much until we have the prospect of running on a host with a larger maximum rx packet size. Requiring the guest to fill the ring with ~64k of buffers isn't onerous; the Linux guest impl currently re-fills the ring up to the max (e.g. 256 x 4k) > > @@ -209,7 +220,12 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) > > if (virtqueue_pop(n->rx_vq, &elem) == 0) > > return; > > > > - if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { > > + if (n->mergeable_rx_bufs) { > > + if (elem.in_num < 1 || elem.in_sg[0].iov_len < TARGET_PAGE_SIZE) { > > + fprintf(stderr, "virtio-net IOV is irregular\n"); > > + exit(1); > > + } > > > > Again, this is burying details of the current Linux stack into the ABI. > The Linux stack may change not to be page oriented, or maybe this won't > fit will to how Windows views things. Can this be made not to depend on > the size of the iov elements? This actually relates to the check in can_receive() - we need to be sure that we have enough buffers in the ring before we read() from the tapfd. So, it's not so much about the guest being page oriented, but rather having a simple way to determine whether there's enough buffers in the ring - "do we have 18 >4k buffers?" is more preferable than peeking in the ring and adding up the size of the available buffers > > + } else { > > + iov_fill(&elem.in_sg[1], elem.in_num - 1, > > + buf + offset, size - offset); > > + > > + /* signal other side */ > > + virtqueue_push(n->rx_vq, &elem, total); > > + } > > + > > > > Can we merge the two sides of the if () so that the only difference is > the number of times we go through the loop? Yeah, good point. Re-factored version below, which also includes the virtio_net_hdr2 changes. Cheers, Mark. Subject: kvm: qemu: Improve virtio_net recv buffer allocation scheme Currently, in order to receive large packets, the guest must allocate max-sized packet buffers and pass them to the host. Each of these max-sized packets occupy 20 ring entries, which means we can only transfer a maximum of 12 packets in a single batch with a 256 entry ring. When receiving packets from external networks, we only receive MTU sized packets and so the throughput observed is throttled by the number of packets the ring can hold. Implement the VIRTIO_NET_F_MRG_RXBUF feature to let guests know that we can merge smaller buffers together in order to handle large packets. This scheme allows us to be efficient in our use of ring entries while still supporting large packets. Benchmarking using netperf from an external machine to a guest over a 10Gb/s network shows a 100% improvement from ~1Gb/s to ~2Gb/s. With a local host->guest benchmark with GSO disabled on the host side, throughput was seen to increase from 700Mb/s to 1.7Gb/s. Based on a patch from Herbert Xu. Signed-off-by: Mark McLoughlin --- qemu/hw/virtio-net.c | 133 +++++++++++++++++++++++++++++++++++++++---------- 1 files changed, 106 insertions(+), 27 deletions(-) diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index d6b4457..c082f02 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -34,9 +34,13 @@ #define VIRTIO_NET_F_HOST_TSO6 12 /* Host can handle TSOv6 in. */ #define VIRTIO_NET_F_HOST_ECN 13 /* Host can handle TSO[6] w/ ECN in. */ #define VIRTIO_NET_F_HOST_UFO 14 /* Host can handle UFO in. */ +#define VIRTIO_NET_F_MRG_RXBUF 15 /* Host can merge receive buffers. */ #define TX_TIMER_INTERVAL 150000 /* 150 us */ +/* Should be the largest MAX_SKB_FRAGS supported. */ +#define VIRTIO_NET_MAX_FRAGS 18 + /* The config defining mac address (6 bytes) */ struct virtio_net_config { @@ -61,6 +65,15 @@ struct virtio_net_hdr uint16_t csum_offset; }; +/* This is the version of the header to use when the MRG_RXBUF + * feature (or any later feature) has been negotiated. */ +struct virtio_net_hdr2 +{ + struct virtio_net_hdr hdr; + uint8_t num_buffers; /* Number of merged rx buffers */ + uint8_t pad[21]; /* Pad to 32 bytes */ +}; + typedef struct VirtIONet { VirtIODevice vdev; @@ -70,6 +83,8 @@ typedef struct VirtIONet VLANClientState *vc; QEMUTimer *tx_timer; int tx_timer_active; + int mergeable_rx_bufs; + int use_vnet_hdr2; } VirtIONet; /* TODO @@ -106,6 +121,7 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev) features |= (1 << VIRTIO_NET_F_HOST_TSO4); features |= (1 << VIRTIO_NET_F_HOST_TSO6); features |= (1 << VIRTIO_NET_F_HOST_ECN); + features |= (1 << VIRTIO_NET_F_MRG_RXBUF); /* Kernel can't actually handle UFO in software currently. */ } @@ -117,6 +133,10 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint32_t features) VirtIONet *n = to_virtio_net(vdev); VLANClientState *host = n->vc->vlan->first_client; + n->mergeable_rx_bufs = !!(features & (1 << VIRTIO_NET_F_MRG_RXBUF)); + + n->use_vnet_hdr2 = n->mergeable_rx_bufs; + if (!tap_has_vnet_hdr(host) || !host->set_offload) return; @@ -141,12 +161,15 @@ static int virtio_net_can_receive(void *opaque) { VirtIONet *n = opaque; VirtQueue *vq = n->rx_vq; + int min_bufs; if (vq->vring.avail == NULL || !(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK)) return 0; - if (vq->vring.avail->idx == vq->last_avail_idx) { + min_bufs = n->mergeable_rx_bufs ? VIRTIO_NET_MAX_FRAGS : 1; + + if ((uint16_t)(vq->vring.avail->idx - vq->last_avail_idx) < min_bufs) { vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY; return 0; } @@ -198,41 +221,86 @@ static int iov_fill(struct iovec *iov, int iovcnt, const void *buf, int count) return offset; } -static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) +static int receive_header(VirtIONet *n, struct iovec *iov, int iovcnt, + const void *buf, int size, int hdr_len) { - VirtIONet *n = opaque; - VirtQueueElement elem; - struct virtio_net_hdr *hdr; + struct virtio_net_hdr *hdr = iov[0].iov_base; int offset; - int total; - - if (virtqueue_pop(n->rx_vq, &elem) == 0) - return; - - if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { - fprintf(stderr, "virtio-net header not in first element\n"); - exit(1); - } - hdr = (void *)elem.in_sg[0].iov_base; hdr->flags = 0; hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE; - offset = 0; - total = sizeof(*hdr); - if (tap_has_vnet_hdr(n->vc->vlan->first_client)) { memcpy(hdr, buf, sizeof(*hdr)); - offset += total; - work_around_broken_dhclient(hdr, buf + offset, size - offset); + offset = sizeof(*hdr); + work_around_broken_dhclient(hdr, buf + offset, size - offset); + } + + iov[0].iov_base += hdr_len; + iov[0].iov_len -= hdr_len; + + return offset; +} + +static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) +{ + VirtIONet *n = opaque; + struct virtio_net_hdr2 *hdr2 = NULL; + int hdr_len, offset, i; + + hdr_len = n->use_vnet_hdr2 ? + sizeof(struct virtio_net_hdr2) : sizeof(struct virtio_net_hdr); + + offset = i = 0; + + while (offset < size) { + VirtQueueElement elem; + int len, total; + + len = total = 0; + + if ((i != 0 && !n->mergeable_rx_bufs) || + virtqueue_pop(n->rx_vq, &elem) == 0) { + if (i == 0) + return; + fprintf(stderr, "virtio-net truncating packet\n"); + exit(1); + } + + if (n->mergeable_rx_bufs) { + if (elem.in_num < 1 || elem.in_sg[0].iov_len < TARGET_PAGE_SIZE) { + fprintf(stderr, "virtio-net IOV is irregular\n"); + exit(1); + } + } else if (elem.in_num < 1 || elem.in_sg[0].iov_len != hdr_len) { + fprintf(stderr, "virtio-net header not in first element\n"); + exit(1); + } + + if (i == 0) { + if (n->use_vnet_hdr2) + hdr2 = (struct virtio_net_hdr2 *)elem.in_sg[0].iov_base; + + offset += receive_header(n, &elem.in_sg[0], elem.in_num, + buf + offset, size - offset, hdr_len); + total += hdr_len; + } + + /* copy in packet. ugh */ + len = iov_fill(&elem.in_sg[0], elem.in_num, + buf + offset, size - offset); + total += len; + + /* signal other side */ + virtqueue_fill(n->rx_vq, &elem, total, i++); + + offset += len; } - /* copy in packet. ugh */ - total += iov_fill(&elem.in_sg[1], elem.in_num - 1, - buf + offset, size - offset); + if (hdr2) + hdr2->num_buffers = i; - /* signal other side */ - virtqueue_push(n->rx_vq, &elem, total); + virtqueue_flush(n->rx_vq, i); virtio_notify(&n->vdev, n->rx_vq); } @@ -249,8 +317,12 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq) ssize_t len = 0; unsigned int out_num = elem.out_num; struct iovec *out_sg = &elem.out_sg[0]; + unsigned hdr_len; + + hdr_len = n->use_vnet_hdr2 ? + sizeof(struct virtio_net_hdr2) : sizeof(struct virtio_net_hdr); - if (out_num < 1 || out_sg->iov_len != sizeof(struct virtio_net_hdr)) { + if (out_num < 1 || out_sg->iov_len != hdr_len) { fprintf(stderr, "virtio-net header not in first element\n"); exit(1); } @@ -259,7 +331,12 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq) if (!has_vnet_hdr) { out_num--; out_sg++; - len += sizeof(struct virtio_net_hdr); + len += hdr_len; + } else if (n->use_vnet_hdr2) { + /* tapfd expects a virtio_net_hdr */ + hdr_len -= sizeof(struct virtio_net_hdr); + out_sg->iov_len -= hdr_len; + len += hdr_len; } len += qemu_sendv_packet(n->vc, out_sg, out_num); @@ -353,6 +430,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n); n->tx_timer_active = 0; + n->mergeable_rx_bufs = 0; + n->use_vnet_hdr2 = 0; register_savevm("virtio-net", virtio_net_id++, 1, virtio_net_save, virtio_net_load, n); -- 1.6.0.1