Linux virtualization list
 help / color / mirror / Atom feed
* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
From: Stefan Hajnoczi @ 2011-12-05 10:55 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings
In-Reply-To: <20111205085925.6116.94352.stgit@dhcp-8-146.nay.redhat.com>

On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang <jasowang@redhat.com> wrote:
> +static int virtnet_set_fd(struct net_device *dev, u32 pfn)
> +{
> +       struct virtnet_info *vi = netdev_priv(dev);
> +       struct virtio_device *vdev = vi->vdev;
> +
> +       if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
> +               vdev->config->set(vdev,
> +                                 offsetof(struct virtio_net_config_fd, addr),
> +                                 &pfn, sizeof(u32));

Please use the virtio model (i.e. virtqueues) instead of shared
memory.  Mapping a page breaks the virtio abstraction.

Stefan

^ permalink raw reply

* Re: [PATCH v3 2/3] hvc_init(): Enforce one-time initialization.
From: Amit Shah @ 2011-12-05 10:54 UTC (permalink / raw)
  To: Miche Baker-Harvey
  Cc: Stephen Rothwell, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, linux-kernel, virtualization,
	Anton Blanchard, Mike Waychison, ppc-dev, Greg Kroah-Hartman,
	Eric Northrup
In-Reply-To: <CAB8Rdapfuf5XH+X9gsUL+CJ9gTQcSNfq4Dj9DoYWQeFgD04ODQ@mail.gmail.com>

On (Tue) 29 Nov 2011 [09:50:41], Miche Baker-Harvey wrote:
> Good grief!  Sorry for the spacing mess-up!  Here's a resend with reformatting.
> 
> Amit,
> We aren't using either QEMU or kvmtool, but we are using KVM.  All

So it's a different userspace?  Any chance this different userspace is
causing these problems to appear?  Esp. since I couldn't reproduce
with qemu.

> the issues we are seeing happen when we try to establish multiple
> virtioconsoles at boot time.  The command line isn't relevant, but I
> can tell you the protocol that's passing between the host (kvm) and
> the guest (see the end of this message).
> 
> We do go through the control_work_handler(), but it's not
> providing synchronization.  Here's a trace of the
> control_work_handler() and handle_control_message() calls; note that
> there are two concurrent calls to control_work_handler().

Ah; how does that happen?  control_work_handler() should just be
invoked once, and if there are any more pending work items to be
consumed, they should be done within the loop inside
control_work_handler().

> I decorated control_work_handler() with a "lifetime" marker, and
> passed this value to handle_control_message(), so we can see which
> control messages are being handled from which instance of
> the control_work_handler() thread.
> 
> Notice that we enter control_work_handler() a second time before
> the handling of the second PORT_ADD message is complete. The
> first CONSOLE_PORT message is handled by the second
> control_work_handler() call, but the second is handled by the first
> control_work_handler() call.
> 
> root@myubuntu:~# dmesg | grep MBH
> [3371055.808738] control_work_handler #1
> [3371055.809372] + #1 handle_control_message PORT_ADD
> [3371055.810169] - handle_control_message PORT_ADD
> [3371055.810170] + #1 handle_control_message PORT_ADD
> [3371055.810244]  control_work_handler #2
> [3371055.810245] + #2 handle_control_message CONSOLE_PORT
> [3371055.810246]  got hvc_ports_mutex
> [3371055.810578] - handle_control_message PORT_ADD
> [3371055.810579] + #1 handle_control_message CONSOLE_PORT
> [3371055.810580]  trylock of hvc_ports_mutex failed
> [3371055.811352]  got hvc_ports_mutex
> [3371055.811370] - handle_control_message CONSOLE_PORT
> [3371055.816609] - handle_control_message CONSOLE_PORT
> 
> So, I'm guessing the bug is that there shouldn't be two instances of
> control_work_handler() running simultaneously?

Yep, I assumed we did that but apparently not.  Do you plan to chase
this one down?

		Amit

^ permalink raw reply

* Re: [net-next RFC PATCH 2/5] tuntap: simple flow director support
From: Stefan Hajnoczi @ 2011-12-05 10:38 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings
In-Reply-To: <20111205085857.6116.99252.stgit@dhcp-8-146.nay.redhat.com>

On Mon, Dec 5, 2011 at 8:58 AM, Jason Wang <jasowang@redhat.com> wrote:
> This patch adds a simple flow director to tun/tap device. It is just a
> page that contains the hash to queue mapping which could be changed by
> user-space. The backend (tap/macvtap) would query this table to get
> the desired queue of a packets when it send packets to userspace.
>
> The page address were set through a new kind of ioctl - TUNSETFD and
> were pinned until device exit or another new page were specified.

Please use "flow" or "fdir" instead of "fd" in the ioctl and code.
"fd" reminds of file descriptor.  The ixgbe driver uses "fdir".

Stefan

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Avi Kivity @ 2011-12-05  9:52 UTC (permalink / raw)
  To: Rusty Russell
  Cc: markmc, kvm, Michael S. Tsirkin, linux-kernel, virtualization,
	Sasha Levin
In-Reply-To: <87bornri92.fsf@rustcorp.com.au>

On 12/05/2011 02:10 AM, Rusty Russell wrote:
> On Sun, 04 Dec 2011 17:16:59 +0200, Avi Kivity <avi@redhat.com> wrote:
> > On 12/04/2011 05:11 PM, Michael S. Tsirkin wrote:
> > > > There's also the used ring, but that's a
> > > > mistake if you have out of order completion.  We should have used copying.
> > >
> > > Seems unrelated... unless you want used to be written into
> > > descriptor ring itself?
> > 
> > The avail/used rings are in addition to the regular ring, no?  If you
> > copy descriptors, then it goes away.
>
> There were two ideas which drove the current design:
>
> 1) The Van-Jacobson style "no two writers to same cacheline makes rings
>    fast" idea.  Empirically, this doesn't show any winnage.

Write/write is the same as write/read or read/write.  Both cases have to
send a probe and wait for the result.  What we really need is to
minimize cache line ping ponging, and the descriptor pool fails that
with ooo completion.  I doubt it's measurable though except with the
very fastest storage providers.

> 2) Allowing a generic inter-guest copy mechanism, so we could have
>    genuinely untrusted driver domains.  Yet noone ever did this so it's
>    hardly a killer feature :(

It's still a goal, though not an important one.  But we have to
translate rings anyway, don't, since buffers are in guest physical
addresses, and we're moving into an address space that doesn't map those.

I thought of having a vhost-copy driver that could do ring translation,
using a dma engine for the copy.

> So if we're going to revisit and drop those requirements, I'd say:
>
> 1) Shared device/driver rings like Xen.  Xen uses device-specific ring
>    contents, I'd be tempted to stick to our pre-headers, and a 'u64
>    addr; u64 len_and_flags; u64 cookie;' generic style.  Then use
>    the same ring for responses.  That's a slight space-win, since
>    we're 24 bytes vs 26 bytes now.

Let's cheat and have inline contents.  Take three bits from
len_and_flags to specify additional descriptors as inline data.  Also,
stuff the cookie into len_and_flags as well.

> 2) Stick with physically-contiguous rings, but use them of size (2^n)-1.
>    Makes the indexing harder, but that -1 lets us stash the indices in
>    the first entry and makes the ring a nice 2^n size.

Allocate at lease a cache line for those.  The 2^n size is not really
material, a division is never necessary.

> > > > 16kB worth of descriptors is 1024 entries.  With 4kB buffers, that's 4MB
> > > > worth of data, or 4 ms at 10GbE line speed.  With 1500 byte buffers it's
> > > > just 1.5 ms.  In any case I think it's sufficient.
> > >
> > > Right. So I think that without indirect, we waste about 3 entries
> > > per packet for virtio header and transport etc headers.
> > 
> > That does suck.  Are there issues in increasing the ring size?  Or
> > making it discontiguous?
>
> Because the qemu implementation is broken.  

I was talking about something else, but this is more important.  Every
time we make a simplifying assumption, it turns around and bites us, and
the code becomes twice as complicated as it would have been in the first
place, and the test matrix explodes.

> We can often put the virtio
> header at the head of the packet.  In practice, the qemu implementation
> insists the header be a single descriptor.
>
> (At least, it used to, perhaps it has now been fixed.  We need a
> VIRTIO_NET_F_I_NOW_CONFORM_TO_THE_DAMN_SPEC_SORRY_I_SUCK bit).

We'll run out of bits in no time.

> We currently use small rings: the guest can't negotiate so qemu has to
> offer a lowest-common-denominator value.  The new virtio-pci layout
> fixes this, and lets the guest set the ring size.

Ok good.  Note the figuring out the best ring size needs some info from
the host, but that can be had from other channels.

> > Can you take a peek at how Xen manages its rings?  They have the same
> > problems we do.
>
> Yes, I made some mistakes, but I did steal from them in the first
> place...

There was a bit of second system syndrome there.  And I don't understand
how the ring/pool issue didn't surface during review, it seems so
obvious now but completely eluded me then.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* [net-next RFC PATCH 5/5] virtio-net: flow director support
From: Jason Wang @ 2011-12-05  8:59 UTC (permalink / raw)
  To: krkumar2, kvm, mst, netdev, rusty, virtualization, levinsasha928,
	bhutchings
In-Reply-To: <20111205085603.6116.65101.stgit@dhcp-8-146.nay.redhat.com>

In order to let the packets of a flow to be passed to the desired
guest cpu, we can co-operate with devices through programming the flow
director which was just a hash to queue table.

This kinds of co-operation is done through the accelerate RFS support,
a device specific flow sterring method virtnet_fd() is used to modify
the flow director based on rfs mapping. The desired queue were
calculated through reverse mapping of the irq affinity table. In order
to parallelize the ingress path, irq affinity of rx queue were also
provides by the driver.

In addition to accelerate RFS, we can also use the guest scheduler to
balance the load of TX and reduce the lock contention on egress path,
so the processor_id() were used to tx queue selection.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c   |  165 +++++++++++++++++++++++++++++++++++++++++++-
 include/linux/virtio_net.h |    6 ++
 2 files changed, 169 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0d871f8..89bb5e7 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,10 @@
 #include <linux/scatterlist.h>
 #include <linux/if_vlan.h>
 #include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/cpu_rmap.h>
+#include <linux/interrupt.h>
+#include <linux/cpumask.h>
 
 static int napi_weight = 128;
 module_param(napi_weight, int, 0444);
@@ -40,6 +44,7 @@ module_param(gso, bool, 0444);
 
 #define VIRTNET_SEND_COMMAND_SG_MAX    2
 #define VIRTNET_DRIVER_VERSION "1.0.0"
+#define TAP_HASH_MASK 0xFF
 
 struct virtnet_send_stats {
 	struct u64_stats_sync syncp;
@@ -89,6 +94,9 @@ struct receive_queue {
 
 	/* Active rx statistics */
 	struct virtnet_recv_stats __percpu *stats;
+
+	/* FIXME: per vector instead of per queue ?? */
+	cpumask_var_t affinity_mask;
 };
 
 struct virtnet_info {
@@ -110,6 +118,11 @@ struct virtnet_info {
 
 	/* Host will pass rxhash to us. */
 	bool has_rxhash;
+
+	/* A page of flow director */
+	struct page *fd_page;
+
+	cpumask_var_t affinity_mask;
 };
 
 struct skb_vnet_hdr {
@@ -386,6 +399,7 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 	if (vi->has_rxhash)
 		skb->rxhash = hdr->rhdr.rxhash;
 
+	skb_record_rx_queue(skb, rq->vq->queue_index / 2);
 	netif_receive_skb(skb);
 	return;
 
@@ -722,6 +736,19 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
+static int virtnet_set_fd(struct net_device *dev, u32 pfn)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtio_device *vdev = vi->vdev;
+
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
+		vdev->config->set(vdev,
+				  offsetof(struct virtio_net_config_fd, addr),
+				  &pfn, sizeof(u32));
+	}
+	return 0;
+}
+
 static int virtnet_set_mac_address(struct net_device *dev, void *p)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
@@ -1017,6 +1044,39 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
 	return 0;
 }
 
+#ifdef CONFIG_RFS_ACCEL
+
+int virtnet_fd(struct net_device *net_dev, const struct sk_buff *skb,
+	       u16 rxq_index, u32 flow_id)
+{
+	struct virtnet_info *vi = netdev_priv(net_dev);
+	u16 *table = NULL;
+
+	if (skb->protocol != htons(ETH_P_IP) || !skb->rxhash)
+		return -EPROTONOSUPPORT;
+
+	table = kmap_atomic(vi->fd_page);
+	table[skb->rxhash & TAP_HASH_MASK] = rxq_index;
+	kunmap_atomic(table);
+
+	return 0;
+}
+#endif
+
+static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
+{
+	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
+					       smp_processor_id();
+
+	/* As we make use of the accelerate rfs which let the scheduler to
+	 * balance the load, it make sense to choose the tx queue also based on
+	 * theprocessor id?
+	 */
+	while (unlikely(txq >= dev->real_num_tx_queues))
+		txq -= dev->real_num_tx_queues;
+	return txq;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -1028,9 +1088,13 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_get_stats64     = virtnet_stats,
 	.ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
 	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
+	.ndo_select_queue    = virtnet_select_queue,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller = virtnet_netpoll,
 #endif
+#ifdef CONFIG_RFS_ACCEL
+	.ndo_rx_flow_steer   = virtnet_fd,
+#endif
 };
 
 static void virtnet_update_status(struct virtnet_info *vi)
@@ -1272,12 +1336,76 @@ static int virtnet_setup_vqs(struct virtnet_info *vi)
 	return ret;
 }
 
+static int virtnet_init_rx_cpu_rmap(struct virtnet_info *vi)
+{
+#ifdef CONFIG_RFS_ACCEL
+	struct virtio_device *vdev = vi->vdev;
+	int i, rc;
+
+	vi->dev->rx_cpu_rmap = alloc_irq_cpu_rmap(vi->num_queue_pairs);
+	if (!vi->dev->rx_cpu_rmap)
+		return -ENOMEM;
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		rc = irq_cpu_rmap_add(vi->dev->rx_cpu_rmap,
+				vdev->config->get_vq_irq(vdev, vi->rq[i]->vq));
+		if (rc) {
+			free_irq_cpu_rmap(vi->dev->rx_cpu_rmap);
+			vi->dev->rx_cpu_rmap = NULL;
+			return rc;
+		}
+	}
+#endif
+	return 0;
+}
+
+static int virtnet_init_rq_affinity(struct virtnet_info *vi)
+{
+	struct virtio_device *vdev = vi->vdev;
+	int i;
+
+	/* FIXME: TX/RX share a vector */
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		if (!alloc_cpumask_var(&vi->rq[i]->affinity_mask, GFP_KERNEL))
+			goto err_out;
+		cpumask_set_cpu(i, vi->rq[i]->affinity_mask);
+		irq_set_affinity_hint(vdev->config->get_vq_irq(vdev,
+							       vi->rq[i]->vq),
+				      vi->rq[i]->affinity_mask);
+	}
+
+	return 0;
+err_out:
+	while (i) {
+		i--;
+		irq_set_affinity_hint(vdev->config->get_vq_irq(vdev,
+							       vi->rq[i]->vq),
+				      NULL);
+		free_cpumask_var(vi->rq[i]->affinity_mask);
+	}
+	return -ENOMEM;
+}
+
+static void virtnet_free_rq_affinity(struct virtnet_info *vi)
+{
+	struct virtio_device *vdev = vi->vdev;
+	int i;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		irq_set_affinity_hint(vdev->config->get_vq_irq(vdev,
+							       vi->rq[i]->vq),
+				      NULL);
+		free_cpumask_var(vi->rq[i]->affinity_mask);
+	}
+}
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
 	int i, err;
 	struct net_device *dev;
 	struct virtnet_info *vi;
 	u16 num_queues, num_queue_pairs;
+	struct page *page = NULL;
+	u16 *table = NULL;
 
 	/* Find if host supports multiqueue virtio_net device */
 	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
@@ -1298,7 +1426,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 	/* Set up network device as normal. */
 	dev->priv_flags |= IFF_UNICAST_FLT;
 	dev->netdev_ops = &virtnet_netdev;
-	dev->features = NETIF_F_HIGHDMA;
+	dev->features = NETIF_F_HIGHDMA | NETIF_F_NTUPLE;
 
 	SET_ETHTOOL_OPS(dev, &virtnet_ethtool_ops);
 	SET_NETDEV_DEV(dev, &vdev->dev);
@@ -1342,6 +1470,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vdev->priv = vi;
 	vi->num_queue_pairs = num_queue_pairs;
 
+
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
 	    virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6) ||
@@ -1382,6 +1511,31 @@ static int virtnet_probe(struct virtio_device *vdev)
 		}
 	}
 
+	/* Config flow director */
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
+		page = alloc_page(GFP_KERNEL);
+		if (!page)
+			return -ENOMEM;
+		table = (u16 *)kmap_atomic(page);
+		for (i = 0; i < (PAGE_SIZE / 16); i++) {
+			/* invalid all entries */
+			table[i] = num_queue_pairs;
+		}
+
+		vi->fd_page = page;
+		kunmap_atomic(table);
+		virtnet_set_fd(dev, page_to_pfn(page));
+
+		err = virtnet_init_rx_cpu_rmap(vi);
+		if (err)
+			goto free_recv_bufs;
+
+		err = virtnet_init_rq_affinity(vi);
+		if (err)
+			goto free_recv_bufs;
+
+	}
+
 	/* Assume link up if device can't report link status,
 	   otherwise get link status from config. */
 	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
@@ -1437,6 +1591,13 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
 	/* Free memory for send and receive queues */
 	free_rq_sq(vi);
 
+	/* Free the page of flow director */
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
+		if (vi->fd_page)
+			put_page(vi->fd_page);
+
+		virtnet_free_rq_affinity(vi);
+	}
 	free_netdev(vi->dev);
 }
 
@@ -1453,7 +1614,7 @@ static unsigned int features[] = {
 	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
 	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
 	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, VIRTIO_NET_F_MULTIQUEUE,
-	VIRTIO_NET_F_GUEST_RXHASH,
+	VIRTIO_NET_F_GUEST_RXHASH, VIRTIO_NET_F_HOST_FD,
 };
 
 static struct virtio_driver virtio_net_driver = {
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 2291317..abcea52 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -51,6 +51,7 @@
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20	/* Extra RX mode control support */
 #define VIRTIO_NET_F_MULTIQUEUE	21	/* Device supports multiple TXQ/RXQ */
 #define VIRTIO_NET_F_GUEST_RXHASH 22    /* Guest can receive rxhash */
+#define VIRTIO_NET_F_HOST_FD    23      /* Host has a flow director */
 
 #define VIRTIO_NET_S_LINK_UP	1	/* Link is up */
 
@@ -63,6 +64,11 @@ struct virtio_net_config {
 	__u16 num_queues;
 } __attribute__((packed));
 
+struct virtio_net_config_fd {
+	struct virtio_net_config cfg;
+	u32 addr;
+} __packed;
+
 /* This is the first element of the scatter-gather list.  If you don't
  * specify GSO, CSUM or HASH features, you can simply ignore the header. */
 struct virtio_net_hdr {

^ permalink raw reply related

* [net-next RFC PATCH 4/5] virtio: introduce a method to get the irq of a specific virtqueue
From: Jason Wang @ 2011-12-05  8:59 UTC (permalink / raw)
  To: krkumar2, kvm, mst, netdev, rusty, virtualization, levinsasha928,
	bhutchings
In-Reply-To: <20111205085603.6116.65101.stgit@dhcp-8-146.nay.redhat.com>

Device specific irq configuration may be need in order to do some
optimization. So a new configuration is needed to get the irq of a
virtqueue.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/lguest/lguest_device.c |    8 ++++++++
 drivers/s390/kvm/kvm_virtio.c  |    6 ++++++
 drivers/virtio/virtio_mmio.c   |    8 ++++++++
 drivers/virtio/virtio_pci.c    |   12 ++++++++++++
 include/linux/virtio_config.h  |    4 ++++
 5 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
index 595d731..6483bff 100644
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -386,6 +386,13 @@ static const char *lg_bus_name(struct virtio_device *vdev)
 	return "";
 }
 
+static int lg_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
+{
+	struct lguest_vq_info *lvq = vq->priv;
+
+	return lvq->config.irq;
+}
+
 /* The ops structure which hooks everything together. */
 static struct virtio_config_ops lguest_config_ops = {
 	.get_features = lg_get_features,
@@ -398,6 +405,7 @@ static struct virtio_config_ops lguest_config_ops = {
 	.find_vqs = lg_find_vqs,
 	.del_vqs = lg_del_vqs,
 	.bus_name = lg_bus_name,
+	.get_vq_irq = lg_get_vq_irq,
 };
 
 /*
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
index 8af868b..a8d5ca1 100644
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -268,6 +268,11 @@ static const char *kvm_bus_name(struct virtio_device *vdev)
 	return "";
 }
 
+static int kvm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
+{
+	return 0x2603;
+}
+
 /*
  * The config ops structure as defined by virtio config
  */
@@ -282,6 +287,7 @@ static struct virtio_config_ops kvm_vq_configspace_ops = {
 	.find_vqs = kvm_find_vqs,
 	.del_vqs = kvm_del_vqs,
 	.bus_name = kvm_bus_name,
+	.get_vq_irq = kvm_get_vq_irq,
 };
 
 /*
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 2f57380..309d471 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -368,6 +368,13 @@ static const char *vm_bus_name(struct virtio_device *vdev)
 	return vm_dev->pdev->name;
 }
 
+static int vm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
+{
+	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
+
+	return platform_get_irq(vm_dev->pdev, 0);
+}
+
 static struct virtio_config_ops virtio_mmio_config_ops = {
 	.get		= vm_get,
 	.set		= vm_set,
@@ -379,6 +386,7 @@ static struct virtio_config_ops virtio_mmio_config_ops = {
 	.get_features	= vm_get_features,
 	.finalize_features = vm_finalize_features,
 	.bus_name	= vm_bus_name,
+	.get_vq_irq     = vm_get_vq_irq,
 };
 
 
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 229ea56..4f99164 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -583,6 +583,17 @@ static const char *vp_bus_name(struct virtio_device *vdev)
 	return pci_name(vp_dev->pci_dev);
 }
 
+static int vp_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
+{
+	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	struct virtio_pci_vq_info *info = vq->priv;
+
+	if (vp_dev->intx_enabled)
+		return vp_dev->pci_dev->irq;
+	else
+		return vp_dev->msix_entries[info->msix_vector].vector;
+}
+
 static struct virtio_config_ops virtio_pci_config_ops = {
 	.get		= vp_get,
 	.set		= vp_set,
@@ -594,6 +605,7 @@ static struct virtio_config_ops virtio_pci_config_ops = {
 	.get_features	= vp_get_features,
 	.finalize_features = vp_finalize_features,
 	.bus_name	= vp_bus_name,
+	.get_vq_irq     = vp_get_vq_irq,
 };
 
 static void virtio_pci_release_dev(struct device *_d)
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 63f98d0..7b783a6 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -104,6 +104,9 @@
  *	vdev: the virtio_device
  *      This returns a pointer to the bus name a la pci_name from which
  *      the caller can then copy.
+ * @get_vq_irq: get the irq numer of the specific virt queue.
+ *      vdev: the virtio_device
+ *      vq: the virtqueue
  */
 typedef void vq_callback_t(struct virtqueue *);
 struct virtio_config_ops {
@@ -122,6 +125,7 @@ struct virtio_config_ops {
 	u32 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);
 	const char *(*bus_name)(struct virtio_device *vdev);
+	int (*get_vq_irq)(struct virtio_device *vdev, struct virtqueue *vq);
 };
 
 /* If driver didn't advertise the feature, it will never appear. */

^ permalink raw reply related

* [net-next RFC PATCH 3/5] macvtap: flow director support
From: Jason Wang @ 2011-12-05  8:59 UTC (permalink / raw)
  To: krkumar2, kvm, mst, netdev, rusty, virtualization, levinsasha928,
	bhutchings
In-Reply-To: <20111205085603.6116.65101.stgit@dhcp-8-146.nay.redhat.com>

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvlan.c      |    4 ++++
 drivers/net/macvtap.c      |   36 ++++++++++++++++++++++++++++++++++--
 include/linux/if_macvlan.h |    1 +
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 7413497..b0cb7ce 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -706,6 +706,7 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
 	vlan->port     = port;
 	vlan->receive  = receive;
 	vlan->forward  = forward;
+	vlan->fd_page[0] = NULL;
 
 	vlan->mode     = MACVLAN_MODE_VEPA;
 	if (data && data[IFLA_MACVLAN_MODE])
@@ -749,6 +750,9 @@ void macvlan_dellink(struct net_device *dev, struct list_head *head)
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 
+	if (vlan->fd_page[0])
+		put_page(vlan->fd_page[0]);
+
 	list_del(&vlan->list);
 	unregister_netdevice_queue(dev, head);
 }
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 504c745..a34eb84 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -14,6 +14,7 @@
 #include <linux/wait.h>
 #include <linux/cdev.h>
 #include <linux/fs.h>
+#include <linux/highmem.h>
 
 #include <net/net_namespace.h>
 #include <net/rtnetlink.h>
@@ -62,6 +63,8 @@ static DEFINE_IDR(minor_idr);
 static struct class *macvtap_class;
 static struct cdev macvtap_cdev;
 
+#define TAP_HASH_MASK 0xFF
+
 static const struct proto_ops macvtap_socket_ops;
 
 /*
@@ -189,6 +192,11 @@ static struct macvtap_queue *macvtap_get_queue(struct net_device *dev,
 	/* Check if we can use flow to select a queue */
 	rxq = skb_get_rxhash(skb);
 	if (rxq) {
+		if (vlan->fd_page[0]) {
+			u16 *table = kmap_atomic(vlan->fd_page[0]);
+			rxq = table[rxq & TAP_HASH_MASK];
+			kunmap_atomic(table);
+		}
 		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
 		if (tap)
 			goto out;
@@ -851,6 +859,7 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 {
 	struct macvtap_queue *q = file->private_data;
 	struct macvlan_dev *vlan;
+	struct tun_fd tfd;
 	void __user *argp = (void __user *)arg;
 	struct ifreq __user *ifr = argp;
 	unsigned int __user *up = argp;
@@ -891,8 +900,8 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		return ret;
 
 	case TUNGETFEATURES:
-		if (put_user(IFF_TAP | IFF_NO_PI | IFF_VNET_HDR | IFF_RXHASH,
-			     up))
+		if (put_user(IFF_TAP | IFF_NO_PI | IFF_VNET_HDR | IFF_RXHASH |
+			     IFF_FD, up))
 			return -EFAULT;
 		return 0;
 
@@ -918,6 +927,29 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		q->vnet_hdr_sz = s;
 		return 0;
 
+	case TUNSETFD:
+		rcu_read_lock_bh();
+		vlan = rcu_dereference(q->vlan);
+		if (!vlan)
+			ret = -ENOLINK;
+		else {
+			if (copy_from_user(&tfd, argp, sizeof(tfd)))
+				ret = -EFAULT;
+			if (vlan->fd_page[0]) {
+				put_page(vlan->fd_page[0]);
+				vlan->fd_page[0] = NULL;
+			}
+
+			/* put_page() in macvlan_dellink() */
+			if (get_user_pages_fast(tfd.addr, 1, 0,
+						&vlan->fd_page[0]) != 1)
+				ret = -EFAULT;
+			else
+				ret = 0;
+		}
+		rcu_read_unlock_bh();
+		return ret;
+
 	case TUNSETOFFLOAD:
 		/* let the user check for future flags */
 		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index d103dca..69a87a1 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -65,6 +65,7 @@ struct macvlan_dev {
 	struct macvtap_queue	*taps[MAX_MACVTAP_QUEUES];
 	int			numvtaps;
 	int			minor;
+	struct page             *fd_page[1];
 };
 
 static inline void macvlan_count_rx(const struct macvlan_dev *vlan,

^ permalink raw reply related

* [net-next RFC PATCH 2/5] tuntap: simple flow director support
From: Jason Wang @ 2011-12-05  8:58 UTC (permalink / raw)
  To: krkumar2, kvm, mst, netdev, rusty, virtualization, levinsasha928,
	bhutchings
In-Reply-To: <20111205085603.6116.65101.stgit@dhcp-8-146.nay.redhat.com>

This patch adds a simple flow director to tun/tap device. It is just a
page that contains the hash to queue mapping which could be changed by
user-space. The backend (tap/macvtap) would query this table to get
the desired queue of a packets when it send packets to userspace.

The page address were set through a new kind of ioctl - TUNSETFD and
were pinned until device exit or another new page were specified.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c      |   63 ++++++++++++++++++++++++++++++++++++++++--------
 include/linux/if_tun.h |   10 ++++++++
 2 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 7d22b4b..2efaf81 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -64,6 +64,7 @@
 #include <linux/nsproxy.h>
 #include <linux/virtio_net.h>
 #include <linux/rcupdate.h>
+#include <linux/highmem.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 #include <net/rtnetlink.h>
@@ -109,6 +110,7 @@ struct tap_filter {
 };
 
 #define MAX_TAP_QUEUES (NR_CPUS < 16 ? NR_CPUS : 16)
+#define TAP_HASH_MASK  0xFF
 
 struct tun_file {
 	struct sock sk;
@@ -128,6 +130,7 @@ struct tun_sock;
 
 struct tun_struct {
 	struct tun_file		*tfiles[MAX_TAP_QUEUES];
+	struct page             *fd_page[1];
 	unsigned int            numqueues;
 	unsigned int 		flags;
 	uid_t			owner;
@@ -156,7 +159,7 @@ static struct tun_file *tun_get_queue(struct net_device *dev,
 	struct tun_struct *tun = netdev_priv(dev);
 	struct tun_file *tfile = NULL;
 	int numqueues = tun->numqueues;
-	__u32 rxq;
+	__u32 rxq, rxhash;
 
 	BUG_ON(!rcu_read_lock_held());
 
@@ -168,6 +171,22 @@ static struct tun_file *tun_get_queue(struct net_device *dev,
 		goto out;
 	}
 
+	rxhash = skb_get_rxhash(skb);
+	if (rxhash) {
+		if (tun->fd_page[0]) {
+			u16 *table = kmap_atomic(tun->fd_page[0]);
+			rxq = table[rxhash & TAP_HASH_MASK];
+			kunmap_atomic(table);
+			if (rxq < numqueues) {
+				tfile = rcu_dereference(tun->tfiles[rxq]);
+				goto out;
+			}
+		}
+		rxq = ((u64)rxhash * numqueues) >> 32;
+		tfile = rcu_dereference(tun->tfiles[rxq]);
+		goto out;
+	}
+
 	if (likely(skb_rx_queue_recorded(skb))) {
 		rxq = skb_get_rx_queue(skb);
 
@@ -178,14 +197,6 @@ static struct tun_file *tun_get_queue(struct net_device *dev,
 		goto out;
 	}
 
-	/* Check if we can use flow to select a queue */
-	rxq = skb_get_rxhash(skb);
-	if (rxq) {
-		u32 idx = ((u64)rxq * numqueues) >> 32;
-		tfile = rcu_dereference(tun->tfiles[idx]);
-		goto out;
-	}
-
 	tfile = rcu_dereference(tun->tfiles[0]);
 out:
 	return tfile;
@@ -1020,6 +1031,14 @@ out:
 	return ret;
 }
 
+static void tun_destructor(struct net_device *dev)
+{
+	struct tun_struct *tun = netdev_priv(dev);
+	if (tun->fd_page[0])
+		put_page(tun->fd_page[0]);
+	free_netdev(dev);
+}
+
 static void tun_setup(struct net_device *dev)
 {
 	struct tun_struct *tun = netdev_priv(dev);
@@ -1028,7 +1047,7 @@ static void tun_setup(struct net_device *dev)
 	tun->group = -1;
 
 	dev->ethtool_ops = &tun_ethtool_ops;
-	dev->destructor = free_netdev;
+	dev->destructor = tun_destructor;
 }
 
 /* Trivial set of netlink ops to allow deleting tun or tap
@@ -1230,6 +1249,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 		tun = netdev_priv(dev);
 		tun->dev = dev;
 		tun->flags = flags;
+		tun->fd_page[0] = NULL;
 
 		security_tun_dev_post_create(&tfile->sk);
 
@@ -1353,6 +1373,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 	struct net_device *dev = NULL;
 	void __user* argp = (void __user*)arg;
 	struct ifreq ifr;
+	struct tun_fd tfd;
 	int ret;
 
 	if (cmd == TUNSETIFF || cmd == TUNATTACHQUEUE || _IOC_TYPE(cmd) == 0x89)
@@ -1364,7 +1385,8 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		 * This is needed because we never checked for invalid flags on
 		 * TUNSETIFF. */
 		return put_user(IFF_TUN | IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE |
-				IFF_VNET_HDR | IFF_MULTI_QUEUE | IFF_RXHASH,
+				IFF_VNET_HDR | IFF_MULTI_QUEUE | IFF_RXHASH |
+				IFF_FD,
 				(unsigned int __user*)argp);
 	}
 
@@ -1476,6 +1498,25 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		ret = set_offload(tun, arg);
 		break;
 
+	case TUNSETFD:
+		if (copy_from_user(&tfd, argp, sizeof(tfd)))
+			ret = -EFAULT;
+		else {
+			if (tun->fd_page[0]) {
+				put_page(tun->fd_page[0]);
+				tun->fd_page[0] = NULL;
+			}
+
+			/* put_page() in tun_destructor() */
+			if (get_user_pages_fast(tfd.addr, 1, 0,
+						&tun->fd_page[0]) != 1)
+				ret = -EFAULT;
+			else
+				ret = 0;
+		}
+
+		break;
+
 	case SIOCGIFHWADDR:
 		/* Get hw address */
 		memcpy(ifr.ifr_hwaddr.sa_data, tun->dev->dev_addr, ETH_ALEN);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index a1f6f3f..726731d 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -36,6 +36,8 @@
 #define TUN_VNET_HDR 	0x0200
 #define TUN_TAP_MQ      0x0400
 
+struct tun_fd;
+
 /* Ioctl defines */
 #define TUNSETNOCSUM  _IOW('T', 200, int) 
 #define TUNSETDEBUG   _IOW('T', 201, int) 
@@ -56,6 +58,7 @@
 #define TUNSETVNETHDRSZ _IOW('T', 216, int)
 #define TUNATTACHQUEUE  _IOW('T', 217, int)
 #define TUNDETACHQUEUE  _IOW('T', 218, int)
+#define TUNSETFD        _IOW('T', 219, struct tun_fd)
 
 
 /* TUNSETIFF ifr flags */
@@ -67,6 +70,7 @@
 #define IFF_TUN_EXCL	0x8000
 #define IFF_MULTI_QUEUE 0x0100
 #define IFF_RXHASH      0x0200
+#define IFF_FD          0x0400
 
 /* Features for GSO (TUNSETOFFLOAD). */
 #define TUN_F_CSUM	0x01	/* You can hand me unchecksummed packets. */
@@ -97,6 +101,12 @@ struct tun_filter {
 	__u8   addr[0][ETH_ALEN];
 };
 
+/* Programmable flow director */
+struct tun_fd {
+	unsigned long addr;
+	size_t size;
+};
+
 #ifdef __KERNEL__
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);

^ permalink raw reply related

* [net-next RFC PATCH 1/5] virtio_net: passing rxhash through vnet_hdr
From: Jason Wang @ 2011-12-05  8:58 UTC (permalink / raw)
  To: krkumar2, kvm, mst, netdev, rusty, virtualization, levinsasha928,
	bhutchings
In-Reply-To: <20111205085603.6116.65101.stgit@dhcp-8-146.nay.redhat.com>

This patch enables the ability to pass the rxhash value to guest
through vnet_hdr. This is useful for guest when it wants to cooperate
with virtual device to steer a flow to dedicated guest cpu.

This feature is negotiated through VIRTIO_NET_F_GUEST_RXHASH.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvtap.c      |   10 ++++++----
 drivers/net/tun.c          |   44 +++++++++++++++++++++++++-------------------
 drivers/net/virtio_net.c   |   26 ++++++++++++++++++++++----
 drivers/vhost/net.c        |   10 +++++++---
 drivers/vhost/vhost.h      |    5 +++--
 include/linux/if_tun.h     |    1 +
 include/linux/virtio_net.h |   10 +++++++++-
 7 files changed, 73 insertions(+), 33 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 7c88d13..504c745 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -760,16 +760,17 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
 	int vnet_hdr_len = 0;
 
 	if (q->flags & IFF_VNET_HDR) {
-		struct virtio_net_hdr vnet_hdr;
+		struct virtio_net_hdr_rxhash vnet_hdr;
 		vnet_hdr_len = q->vnet_hdr_sz;
 		if ((len -= vnet_hdr_len) < 0)
 			return -EINVAL;
 
-		ret = macvtap_skb_to_vnet_hdr(skb, &vnet_hdr);
+		ret = macvtap_skb_to_vnet_hdr(skb, &vnet_hdr.hdr.hdr);
 		if (ret)
 			return ret;
 
-		if (memcpy_toiovecend(iv, (void *)&vnet_hdr, 0, sizeof(vnet_hdr)))
+		vnet_hdr.rxhash = skb->rxhash;
+		if (memcpy_toiovecend(iv, (void *)&vnet_hdr, 0, q->vnet_hdr_sz))
 			return -EFAULT;
 	}
 
@@ -890,7 +891,8 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		return ret;
 
 	case TUNGETFEATURES:
-		if (put_user(IFF_TAP | IFF_NO_PI | IFF_VNET_HDR, up))
+		if (put_user(IFF_TAP | IFF_NO_PI | IFF_VNET_HDR | IFF_RXHASH,
+			     up))
 			return -EFAULT;
 		return 0;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afb11d1..7d22b4b 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -869,49 +869,55 @@ static ssize_t tun_put_user(struct tun_file *tfile,
 	}
 
 	if (tfile->flags & TUN_VNET_HDR) {
-		struct virtio_net_hdr gso = { 0 }; /* no info leak */
-		if ((len -= tfile->vnet_hdr_sz) < 0)
+		struct virtio_net_hdr_rxhash hdr;
+		struct virtio_net_hdr *gso = (struct virtio_net_hdr *)&hdr;
+
+		if ((len -= tfile->vnet_hdr_sz) < 0 ||
+		    tfile->vnet_hdr_sz > sizeof(struct virtio_net_hdr_rxhash))
 			return -EINVAL;
 
+		memset(&hdr, 0, sizeof(hdr));
 		if (skb_is_gso(skb)) {
 			struct skb_shared_info *sinfo = skb_shinfo(skb);
 
 			/* This is a hint as to how much should be linear. */
-			gso.hdr_len = skb_headlen(skb);
-			gso.gso_size = sinfo->gso_size;
+			gso->hdr_len = skb_headlen(skb);
+			gso->gso_size = sinfo->gso_size;
 			if (sinfo->gso_type & SKB_GSO_TCPV4)
-				gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+				gso->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
 			else if (sinfo->gso_type & SKB_GSO_TCPV6)
-				gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+				gso->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
 			else if (sinfo->gso_type & SKB_GSO_UDP)
-				gso.gso_type = VIRTIO_NET_HDR_GSO_UDP;
+				gso->gso_type = VIRTIO_NET_HDR_GSO_UDP;
 			else {
 				pr_err("unexpected GSO type: "
 				       "0x%x, gso_size %d, hdr_len %d\n",
-				       sinfo->gso_type, gso.gso_size,
-				       gso.hdr_len);
+				       sinfo->gso_type, gso->gso_size,
+				       gso->hdr_len);
 				print_hex_dump(KERN_ERR, "tun: ",
 					       DUMP_PREFIX_NONE,
 					       16, 1, skb->head,
-					       min((int)gso.hdr_len, 64), true);
+					       min((int)gso->hdr_len, 64),
+					       true);
 				WARN_ON_ONCE(1);
 				return -EINVAL;
 			}
 			if (sinfo->gso_type & SKB_GSO_TCP_ECN)
-				gso.gso_type |= VIRTIO_NET_HDR_GSO_ECN;
+				gso->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
 		} else
-			gso.gso_type = VIRTIO_NET_HDR_GSO_NONE;
+			gso->gso_type = VIRTIO_NET_HDR_GSO_NONE;
 
 		if (skb->ip_summed == CHECKSUM_PARTIAL) {
-			gso.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			gso.csum_start = skb_checksum_start_offset(skb);
-			gso.csum_offset = skb->csum_offset;
+			gso->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			gso->csum_start = skb_checksum_start_offset(skb);
+			gso->csum_offset = skb->csum_offset;
 		} else if (skb->ip_summed == CHECKSUM_UNNECESSARY) {
-			gso.flags = VIRTIO_NET_HDR_F_DATA_VALID;
+			gso->flags = VIRTIO_NET_HDR_F_DATA_VALID;
 		} /* else everything is zero */
 
-		if (unlikely(memcpy_toiovecend(iv, (void *)&gso, total,
-					       sizeof(gso))))
+		hdr.rxhash = skb_get_rxhash(skb);
+		if (unlikely(memcpy_toiovecend(iv, (void *)&hdr, total,
+					       tfile->vnet_hdr_sz)))
 			return -EFAULT;
 		total += tfile->vnet_hdr_sz;
 	}
@@ -1358,7 +1364,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		 * This is needed because we never checked for invalid flags on
 		 * TUNSETIFF. */
 		return put_user(IFF_TUN | IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE |
-				IFF_VNET_HDR | IFF_MULTI_QUEUE,
+				IFF_VNET_HDR | IFF_MULTI_QUEUE | IFF_RXHASH,
 				(unsigned int __user*)argp);
 	}
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 157ee63..0d871f8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -107,12 +107,16 @@ struct virtnet_info {
 
 	/* Host will merge rx buffers for big packets (shake it! shake it!) */
 	bool mergeable_rx_bufs;
+
+	/* Host will pass rxhash to us. */
+	bool has_rxhash;
 };
 
 struct skb_vnet_hdr {
 	union {
 		struct virtio_net_hdr hdr;
 		struct virtio_net_hdr_mrg_rxbuf mhdr;
+		struct virtio_net_hdr_rxhash rhdr;
 	};
 	unsigned int num_sg;
 };
@@ -205,7 +209,10 @@ static struct sk_buff *page_to_skb(struct receive_queue *rq,
 	hdr = skb_vnet_hdr(skb);
 
 	if (vi->mergeable_rx_bufs) {
-		hdr_len = sizeof hdr->mhdr;
+		if (vi->has_rxhash)
+			hdr_len = sizeof hdr->rhdr;
+		else
+			hdr_len = sizeof hdr->mhdr;
 		offset = hdr_len;
 	} else {
 		hdr_len = sizeof hdr->hdr;
@@ -376,6 +383,9 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 		skb_shinfo(skb)->gso_segs = 0;
 	}
 
+	if (vi->has_rxhash)
+		skb->rxhash = hdr->rhdr.rxhash;
+
 	netif_receive_skb(skb);
 	return;
 
@@ -645,9 +655,12 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
 	hdr->mhdr.num_buffers = 0;
 
 	/* Encode metadata header at front. */
-	if (vi->mergeable_rx_bufs)
-		sg_set_buf(sg, &hdr->mhdr, sizeof hdr->mhdr);
-	else
+	if (vi->mergeable_rx_bufs) {
+		if (vi->has_rxhash)
+			sg_set_buf(sg, &hdr->rhdr, sizeof hdr->rhdr);
+		else
+			sg_set_buf(sg, &hdr->mhdr, sizeof hdr->mhdr);
+	} else
 		sg_set_buf(sg, &hdr->hdr, sizeof hdr->hdr);
 
 	hdr->num_sg = skb_to_sgvec(skb, sg + 1, 0, skb->len) + 1;
@@ -1338,8 +1351,12 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
 		vi->mergeable_rx_bufs = true;
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_RXHASH))
+		vi->has_rxhash = true;
+
 	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
 	err = virtnet_setup_vqs(vi);
+
 	if (err)
 		goto free_netdev;
 
@@ -1436,6 +1453,7 @@ static unsigned int features[] = {
 	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
 	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
 	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, VIRTIO_NET_F_MULTIQUEUE,
+	VIRTIO_NET_F_GUEST_RXHASH,
 };
 
 static struct virtio_driver virtio_net_driver = {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 882a51f..b2d6548 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -768,9 +768,13 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
 	size_t vhost_hlen, sock_hlen, hdr_len;
 	int i;
 
-	hdr_len = (features & (1 << VIRTIO_NET_F_MRG_RXBUF)) ?
-			sizeof(struct virtio_net_hdr_mrg_rxbuf) :
-			sizeof(struct virtio_net_hdr);
+	if (features & (1 << VIRTIO_NET_F_MRG_RXBUF))
+		hdr_len = (features & (1 << VIRTIO_NET_F_GUEST_RXHASH)) ?
+			sizeof(struct virtio_net_hdr_rxhash) :
+			sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	else
+		hdr_len = sizeof(struct virtio_net_hdr);
+
 	if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
 		/* vhost provides vnet_hdr */
 		vhost_hlen = hdr_len;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index a801e28..4ad2d5f 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -115,7 +115,7 @@ struct vhost_virtqueue {
 	/* hdr is used to store the virtio header.
 	 * Since each iovec has >= 1 byte length, we never need more than
 	 * header length entries to store the header. */
-	struct iovec hdr[sizeof(struct virtio_net_hdr_mrg_rxbuf)];
+	struct iovec hdr[sizeof(struct virtio_net_hdr_rxhash)];
 	struct iovec *indirect;
 	size_t vhost_hlen;
 	size_t sock_hlen;
@@ -203,7 +203,8 @@ enum {
 			 (1ULL << VIRTIO_RING_F_EVENT_IDX) |
 			 (1ULL << VHOST_F_LOG_ALL) |
 			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
-			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
+			 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+			 (1ULL << VIRTIO_NET_F_GUEST_RXHASH) ,
 };
 
 static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index d3f24d8..a1f6f3f 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -66,6 +66,7 @@
 #define IFF_VNET_HDR	0x4000
 #define IFF_TUN_EXCL	0x8000
 #define IFF_MULTI_QUEUE 0x0100
+#define IFF_RXHASH      0x0200
 
 /* Features for GSO (TUNSETOFFLOAD). */
 #define TUN_F_CSUM	0x01	/* You can hand me unchecksummed packets. */
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index c92b83f..2291317 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -50,6 +50,7 @@
 #define VIRTIO_NET_F_CTRL_VLAN	19	/* Control channel VLAN filtering */
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20	/* Extra RX mode control support */
 #define VIRTIO_NET_F_MULTIQUEUE	21	/* Device supports multiple TXQ/RXQ */
+#define VIRTIO_NET_F_GUEST_RXHASH 22    /* Guest can receive rxhash */
 
 #define VIRTIO_NET_S_LINK_UP	1	/* Link is up */
 
@@ -63,7 +64,7 @@ struct virtio_net_config {
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't
- * specify GSO or CSUM features, you can simply ignore the header. */
+ * specify GSO, CSUM or HASH features, you can simply ignore the header. */
 struct virtio_net_hdr {
 #define VIRTIO_NET_HDR_F_NEEDS_CSUM	1	// Use csum_start, csum_offset
 #define VIRTIO_NET_HDR_F_DATA_VALID	2	// Csum is valid
@@ -87,6 +88,13 @@ struct virtio_net_hdr_mrg_rxbuf {
 	__u16 num_buffers;	/* Number of merged rx buffers */
 };
 
+/* This is the version of the header to use when GUEST_RXHASH
+ * feature has been negotiated. */
+struct virtio_net_hdr_rxhash {
+	struct virtio_net_hdr_mrg_rxbuf hdr;
+	__u32 rxhash;
+};
+
 /*
  * Control virtqueue data structures
  *

^ permalink raw reply related

* [net-next RFC PATCH 0/5] Series short description
From: Jason Wang @ 2011-12-05  8:58 UTC (permalink / raw)
  To: krkumar2, kvm, mst, netdev, rusty, virtualization, levinsasha928,
	bhutchings

multiple queue virtio-net: flow steering through host/guest cooperation

Hello all:

This is a rough series adds the guest/host cooperation of flow
steering support based on Krish Kumar's multiple queue virtio-net
driver patch 3/3 (http://lwn.net/Articles/467283/).

This idea is simple, the backend pass the rxhash to the guest and
guest would tell the backend the hash to queue mapping when necessary
then backend can choose the queue based on the hash value of the
packet.  The table is just a page shared bettwen userspace and the
backend.

Patch 1 enable the ability to pass the rxhash through vnet_hdr to
guest.
Patch 2,3 implement a very simple flow director for tap and
mavtap. tap part is based on the multiqueue tap patches posted by me
(http://lwn.net/Articles/459270/).
Patch 4 implement a method for virtio device to find the irq of a
specific virtqueue, in order to do device specific interrupt
optimization
Patch 5 is the part of the guest driver that using accelerate rfs to
program the flow director and with some optimizations on irq affinity
and tx queue selection.

This is just a prototype that demonstrates the idea, there are still
things need to be discussed:

- An alternative idea instead of shared page is ctrl vq, the reason
  that a shared table is preferable is the delay of ctrl vq itself.
- Optimization on irq affinity and tx queue selection

Comments are welcomed, thanks!

---

Jason Wang (5):
      virtio_net: passing rxhash through vnet_hdr
      tuntap: simple flow director support
      macvtap: flow director support
      virtio: introduce a method to get the irq of a specific virtqueue
      virtio-net: flow director support


 drivers/lguest/lguest_device.c |    8 ++
 drivers/net/macvlan.c          |    4 +
 drivers/net/macvtap.c          |   42 ++++++++-
 drivers/net/tun.c              |  105 ++++++++++++++++------
 drivers/net/virtio_net.c       |  189 +++++++++++++++++++++++++++++++++++++++-
 drivers/s390/kvm/kvm_virtio.c  |    6 +
 drivers/vhost/net.c            |   10 +-
 drivers/vhost/vhost.h          |    5 +
 drivers/virtio/virtio_mmio.c   |    8 ++
 drivers/virtio/virtio_pci.c    |   12 +++
 include/linux/if_macvlan.h     |    1 
 include/linux/if_tun.h         |   11 ++
 include/linux/virtio_config.h  |    4 +
 include/linux/virtio_net.h     |   16 +++
 14 files changed, 377 insertions(+), 44 deletions(-)

-- 
Signature

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Rusty Russell @ 2011-12-05  0:10 UTC (permalink / raw)
  To: Avi Kivity, Michael S. Tsirkin
  Cc: markmc, virtualization, Sasha Levin, kvm, linux-kernel
In-Reply-To: <4EDB8EEB.4070309@redhat.com>

On Sun, 04 Dec 2011 17:16:59 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/04/2011 05:11 PM, Michael S. Tsirkin wrote:
> > > There's also the used ring, but that's a
> > > mistake if you have out of order completion.  We should have used copying.
> >
> > Seems unrelated... unless you want used to be written into
> > descriptor ring itself?
> 
> The avail/used rings are in addition to the regular ring, no?  If you
> copy descriptors, then it goes away.

There were two ideas which drove the current design:

1) The Van-Jacobson style "no two writers to same cacheline makes rings
   fast" idea.  Empirically, this doesn't show any winnage.
2) Allowing a generic inter-guest copy mechanism, so we could have
   genuinely untrusted driver domains.  Yet noone ever did this so it's
   hardly a killer feature :(

So if we're going to revisit and drop those requirements, I'd say:

1) Shared device/driver rings like Xen.  Xen uses device-specific ring
   contents, I'd be tempted to stick to our pre-headers, and a 'u64
   addr; u64 len_and_flags; u64 cookie;' generic style.  Then use
   the same ring for responses.  That's a slight space-win, since
   we're 24 bytes vs 26 bytes now.

2) Stick with physically-contiguous rings, but use them of size (2^n)-1.
   Makes the indexing harder, but that -1 lets us stash the indices in
   the first entry and makes the ring a nice 2^n size.

> > > 16kB worth of descriptors is 1024 entries.  With 4kB buffers, that's 4MB
> > > worth of data, or 4 ms at 10GbE line speed.  With 1500 byte buffers it's
> > > just 1.5 ms.  In any case I think it's sufficient.
> >
> > Right. So I think that without indirect, we waste about 3 entries
> > per packet for virtio header and transport etc headers.
> 
> That does suck.  Are there issues in increasing the ring size?  Or
> making it discontiguous?

Because the qemu implementation is broken.  We can often put the virtio
header at the head of the packet.  In practice, the qemu implementation
insists the header be a single descriptor.

(At least, it used to, perhaps it has now been fixed.  We need a
VIRTIO_NET_F_I_NOW_CONFORM_TO_THE_DAMN_SPEC_SORRY_I_SUCK bit).

We currently use small rings: the guest can't negotiate so qemu has to
offer a lowest-common-denominator value.  The new virtio-pci layout
fixes this, and lets the guest set the ring size.

> Can you take a peek at how Xen manages its rings?  They have the same
> problems we do.

Yes, I made some mistakes, but I did steal from them in the first
place...

Cheers,
Rusty.

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Sasha Levin @ 2011-12-04 18:23 UTC (permalink / raw)
  To: Avi Kivity; +Cc: markmc, kvm, Michael S. Tsirkin, linux-kernel, virtualization
In-Reply-To: <1323020374.3256.5.camel@lappy>

On Sun, 2011-12-04 at 19:39 +0200, Sasha Levin wrote:
> On Sun, 2011-12-04 at 19:37 +0200, Avi Kivity wrote:
> > On 12/04/2011 07:34 PM, Sasha Levin wrote:
> > > > 
> > > > I'm confused. didn't you see a bigger benefit for guest->host by
> > > > switching indirect off?
> > >
> > > The 5% improvement is over the 'regular' indirect on, not over indirect
> > > off. Sorry for the confusion there.
> > >
> > > I suggested this change regardless of the outcome of indirect descriptor
> > > threshold discussion, since it would help anyways.
> > 
> > For net, this makes sense.  For block, it reduces the effective queue
> > depth, so it's not a trivial change.  It probably makes sense there too,
> > though.
> 
> It doesn't have to be limited at that number, anything above that can go
> through the regular kmalloc() path.

Something like the following patch:

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index c7a2c20..3166ca0 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -82,6 +82,7 @@ struct vring_virtqueue
 
 	/* Host supports indirect buffers */
 	bool indirect;
+	struct kmem_cache *indirect_cache;
 
 	/* Host publishes avail event idx */
 	bool event;
@@ -110,6 +111,9 @@ struct vring_virtqueue
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+static unsigned int ind_alloc_thresh = 0;
+module_param(ind_alloc_thresh, uint, S_IRUGO);
+
 /* Set up an indirect table of descriptors and add it to the queue. */
 static int vring_add_indirect(struct vring_virtqueue *vq,
 			      struct scatterlist sg[],
@@ -121,7 +125,10 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
 	unsigned head;
 	int i;
 
-	desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
+	if ((out + in) <= ind_alloc_thresh)
+		desc = kmem_cache_alloc(vq->indirect_cache, gfp);
+	else
+		desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
 	if (!desc)
 		return -ENOMEM;
 
@@ -479,6 +486,9 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
 	vq->broken = false;
 	vq->last_used_idx = 0;
 	vq->num_added = 0;
+	if (ind_alloc_thresh)
+		vq->indirect_cache = KMEM_CACHE(vring_desc[ind_alloc_thresh], 0);
 	list_add_tail(&vq->vq.list, &vdev->vqs);
 #ifdef DEBUG
 	vq->in_use = false;

-- 

Sasha.

^ permalink raw reply related

* Re: [PATCH RFC V3 2/4] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks
From: Raghavendra K T @ 2011-12-04 18:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jeremy Fitzhardinge, x86, Peter Zijlstra, Virtualization,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
	Raghavendra K T, Ingo Molnar, Rik van Riel, Konrad Rzeszutek Wilk,
	Srivatsa Vaddagiri, Jeremy Fitzhardinge, Sasha Levin, Sedat Dilek,
	Thomas Gleixner, Yinghai Lu, Greg Kroah-Hartman, LKML,
	Dave Hansen, Suzuki
In-Reply-To: <4ED7DA85.5080504@linux.vnet.ibm.com>

On 12/02/2011 01:20 AM, Raghavendra K T wrote:
>> Have you tested it on AMD machines? There are some differences in the
>> hypercall infrastructure there.
>
> Yes. 'll test on AMD machine and update on that.
>

I tested the code on 64 bit Dual-Core AMD Opteron machine, and it is 
working.

- Raghu

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Sasha Levin @ 2011-12-04 17:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: markmc, kvm, Michael S. Tsirkin, linux-kernel, virtualization
In-Reply-To: <4EDBAFC5.2010405@redhat.com>

On Sun, 2011-12-04 at 19:37 +0200, Avi Kivity wrote:
> On 12/04/2011 07:34 PM, Sasha Levin wrote:
> > > 
> > > I'm confused. didn't you see a bigger benefit for guest->host by
> > > switching indirect off?
> >
> > The 5% improvement is over the 'regular' indirect on, not over indirect
> > off. Sorry for the confusion there.
> >
> > I suggested this change regardless of the outcome of indirect descriptor
> > threshold discussion, since it would help anyways.
> 
> For net, this makes sense.  For block, it reduces the effective queue
> depth, so it's not a trivial change.  It probably makes sense there too,
> though.

It doesn't have to be limited at that number, anything above that can go
through the regular kmalloc() path.

-- 

Sasha.

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Avi Kivity @ 2011-12-04 17:37 UTC (permalink / raw)
  To: Sasha Levin; +Cc: markmc, kvm, Michael S. Tsirkin, linux-kernel, virtualization
In-Reply-To: <1323020088.3256.3.camel@lappy>

On 12/04/2011 07:34 PM, Sasha Levin wrote:
> > 
> > I'm confused. didn't you see a bigger benefit for guest->host by
> > switching indirect off?
>
> The 5% improvement is over the 'regular' indirect on, not over indirect
> off. Sorry for the confusion there.
>
> I suggested this change regardless of the outcome of indirect descriptor
> threshold discussion, since it would help anyways.

For net, this makes sense.  For block, it reduces the effective queue
depth, so it's not a trivial change.  It probably makes sense there too,
though.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Sasha Levin @ 2011-12-04 17:34 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: markmc, kvm, linux-kernel, virtualization, Avi Kivity
In-Reply-To: <20111204162221.GB22501@redhat.com>

On Sun, 2011-12-04 at 18:22 +0200, Michael S. Tsirkin wrote:
> On Sun, Dec 04, 2011 at 02:13:51PM +0200, Sasha Levin wrote:
> > On Sun, 2011-12-04 at 13:52 +0200, Avi Kivity wrote:
> > > On 12/03/2011 01:50 PM, Sasha Levin wrote:
> > > > On Fri, 2011-12-02 at 11:16 +1030, Rusty Russell wrote:
> > > > > On Thu, 1 Dec 2011 12:26:42 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > On Thu, Dec 01, 2011 at 10:09:37AM +0200, Sasha Levin wrote:
> > > > > > > On Thu, 2011-12-01 at 09:58 +0200, Michael S. Tsirkin wrote:
> > > > > > > > We'll presumably need some logic to increment is back,
> > > > > > > > to account for random workload changes.
> > > > > > > > Something like slow start?
> > > > > > > 
> > > > > > > We can increment it each time the queue was less than 10% full, it
> > > > > > > should act like slow start, no?
> > > > > > 
> > > > > > No, we really shouldn't get an empty ring as long as things behave
> > > > > > well. What I meant is something like:
> > > > > 
> > > > > I was thinking of the network output case, but you're right.  We need to
> > > > > distinguish between usually full (eg. virtio-net input) and usually
> > > > > empty (eg. virtio-net output).
> > > > > 
> > > > > The signal for "we to pack more into the ring" is different.  We could
> > > > > use some hacky heuristic like "out == 0" but I'd rather make it explicit
> > > > > when we set up the virtqueue.
> > > > > 
> > > > > Our other alternative, moving the logic to the driver, is worse.
> > > > > 
> > > > > As to fading the effect over time, that's harder.  We have to deplete
> > > > > the ring quite a few times before it turns into always-indirect.  We
> > > > > could back off every time the ring is totally idle, but that may hurt
> > > > > bursty traffic.  Let's try simple first?
> > > >
> > > > I tried to take a different approach, and tried putting the indirect
> > > > descriptors in a kmem_cache as Michael suggested. The benchmarks showed
> > > > that this way virtio-net actually worked faster with indirect on even in
> > > > a single stream.
> > > 
> > > How much better?
> > 
> > host->guest was same with both indirect on and off, and guest->host went
> > up by 5% with indirect on.
> > 
> > This was just a simple 1 TCP stream test.
> 
> I'm confused. didn't you see a bigger benefit for guest->host by
> switching indirect off?

The 5% improvement is over the 'regular' indirect on, not over indirect
off. Sorry for the confusion there.

I suggested this change regardless of the outcome of indirect descriptor
threshold discussion, since it would help anyways.

-- 

Sasha.

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Avi Kivity @ 2011-12-04 16:33 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: markmc, kvm, linux-kernel, virtualization, Sasha Levin
In-Reply-To: <20111204160053.GA22501@redhat.com>

On 12/04/2011 06:00 PM, Michael S. Tsirkin wrote:
> >  If you
> > copy descriptors, then it goes away.
>
> The avail ring could go away. used could if we make descriptors
> writeable. IIUC it was made RO in the hope that will make it
> easier for xen to adopt. Still relevant?

You mean RO from the consumer side?  Why can't Xen do that?

> > That does suck.  Are there issues in increasing the ring size?  Or
> > making it discontiguous?
>
> discontiguous ring is what indirect is, basically.

No, discontiguous is more cache and prefetch friendly.  With vmap(), the
code doesn't even change.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Michael S. Tsirkin @ 2011-12-04 16:22 UTC (permalink / raw)
  To: Sasha Levin; +Cc: markmc, kvm, linux-kernel, virtualization, Avi Kivity
In-Reply-To: <1323000831.4205.4.camel@lappy>

On Sun, Dec 04, 2011 at 02:13:51PM +0200, Sasha Levin wrote:
> On Sun, 2011-12-04 at 13:52 +0200, Avi Kivity wrote:
> > On 12/03/2011 01:50 PM, Sasha Levin wrote:
> > > On Fri, 2011-12-02 at 11:16 +1030, Rusty Russell wrote:
> > > > On Thu, 1 Dec 2011 12:26:42 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > On Thu, Dec 01, 2011 at 10:09:37AM +0200, Sasha Levin wrote:
> > > > > > On Thu, 2011-12-01 at 09:58 +0200, Michael S. Tsirkin wrote:
> > > > > > > We'll presumably need some logic to increment is back,
> > > > > > > to account for random workload changes.
> > > > > > > Something like slow start?
> > > > > > 
> > > > > > We can increment it each time the queue was less than 10% full, it
> > > > > > should act like slow start, no?
> > > > > 
> > > > > No, we really shouldn't get an empty ring as long as things behave
> > > > > well. What I meant is something like:
> > > > 
> > > > I was thinking of the network output case, but you're right.  We need to
> > > > distinguish between usually full (eg. virtio-net input) and usually
> > > > empty (eg. virtio-net output).
> > > > 
> > > > The signal for "we to pack more into the ring" is different.  We could
> > > > use some hacky heuristic like "out == 0" but I'd rather make it explicit
> > > > when we set up the virtqueue.
> > > > 
> > > > Our other alternative, moving the logic to the driver, is worse.
> > > > 
> > > > As to fading the effect over time, that's harder.  We have to deplete
> > > > the ring quite a few times before it turns into always-indirect.  We
> > > > could back off every time the ring is totally idle, but that may hurt
> > > > bursty traffic.  Let's try simple first?
> > >
> > > I tried to take a different approach, and tried putting the indirect
> > > descriptors in a kmem_cache as Michael suggested. The benchmarks showed
> > > that this way virtio-net actually worked faster with indirect on even in
> > > a single stream.
> > 
> > How much better?
> 
> host->guest was same with both indirect on and off, and guest->host went
> up by 5% with indirect on.
> 
> This was just a simple 1 TCP stream test.

I'm confused. didn't you see a bigger benefit for guest->host by
switching indirect off?

> > 
> > I think that if indirects benefit networking, then we're doing something
> > wrong.  What's going on?  Does the ring get filled too early?  If so we
> > should expand it.
> > 
> 
> -- 
> 
> Sasha.

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Michael S. Tsirkin @ 2011-12-04 16:00 UTC (permalink / raw)
  To: Avi Kivity; +Cc: markmc, kvm, linux-kernel, virtualization, Sasha Levin
In-Reply-To: <4EDB8EEB.4070309@redhat.com>

On Sun, Dec 04, 2011 at 05:16:59PM +0200, Avi Kivity wrote:
> On 12/04/2011 05:11 PM, Michael S. Tsirkin wrote:
> > > There's also the used ring, but that's a
> > > mistake if you have out of order completion.  We should have used copying.
> >
> > Seems unrelated... unless you want used to be written into
> > descriptor ring itself?
> 
> The avail/used rings are in addition to the regular ring, no?

Yes. A couple of extra pages, if we reduce alignment we could pack this
in a single extra page.

>  If you
> copy descriptors, then it goes away.

The avail ring could go away. used could if we make descriptors
writeable. IIUC it was made RO in the hope that will make it
easier for xen to adopt. Still relevant?

> > But, I don't really know why does virtio ring insist on
> > making the 3 buffers (avail/used/descriptor)
> > physically contigious. Rusty?
> 
> Let's drop them instead.
> 
> >
> > > 16kB worth of descriptors is 1024 entries.  With 4kB buffers, that's 4MB
> > > worth of data, or 4 ms at 10GbE line speed.  With 1500 byte buffers it's
> > > just 1.5 ms.  In any case I think it's sufficient.
> >
> > Right. So I think that without indirect, we waste about 3 entries
> > per packet for virtio header and transport etc headers.
> 
> That does suck.  Are there issues in increasing the ring size?  Or
> making it discontiguous?

discontiguous ring is what indirect is, basically.

> Can you take a peek at how Xen manages its rings?  They have the same
> problems we do.
> 
> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Avi Kivity @ 2011-12-04 15:16 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: markmc, kvm, linux-kernel, virtualization, Sasha Levin
In-Reply-To: <20111204151148.GA21851@redhat.com>

On 12/04/2011 05:11 PM, Michael S. Tsirkin wrote:
> > There's also the used ring, but that's a
> > mistake if you have out of order completion.  We should have used copying.
>
> Seems unrelated... unless you want used to be written into
> descriptor ring itself?

The avail/used rings are in addition to the regular ring, no?  If you
copy descriptors, then it goes away.

> But, I don't really know why does virtio ring insist on
> making the 3 buffers (avail/used/descriptor)
> physically contigious. Rusty?

Let's drop them instead.

>
> > 16kB worth of descriptors is 1024 entries.  With 4kB buffers, that's 4MB
> > worth of data, or 4 ms at 10GbE line speed.  With 1500 byte buffers it's
> > just 1.5 ms.  In any case I think it's sufficient.
>
> Right. So I think that without indirect, we waste about 3 entries
> per packet for virtio header and transport etc headers.

That does suck.  Are there issues in increasing the ring size?  Or
making it discontiguous?

Can you take a peek at how Xen manages its rings?  They have the same
problems we do.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Michael S. Tsirkin @ 2011-12-04 15:15 UTC (permalink / raw)
  To: Sasha Levin; +Cc: markmc, kvm, linux-kernel, virtualization, Avi Kivity
In-Reply-To: <20111204110637.GN15464@redhat.com>

On Sun, Dec 04, 2011 at 01:06:37PM +0200, Michael S. Tsirkin wrote:
> On Sat, Dec 03, 2011 at 01:50:28PM +0200, Sasha Levin wrote:
> > On Fri, 2011-12-02 at 11:16 +1030, Rusty Russell wrote:
> > > On Thu, 1 Dec 2011 12:26:42 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > On Thu, Dec 01, 2011 at 10:09:37AM +0200, Sasha Levin wrote:
> > > > > On Thu, 2011-12-01 at 09:58 +0200, Michael S. Tsirkin wrote:
> > > > > > We'll presumably need some logic to increment is back,
> > > > > > to account for random workload changes.
> > > > > > Something like slow start?
> > > > > 
> > > > > We can increment it each time the queue was less than 10% full, it
> > > > > should act like slow start, no?
> > > > 
> > > > No, we really shouldn't get an empty ring as long as things behave
> > > > well. What I meant is something like:
> > > 
> > > I was thinking of the network output case, but you're right.  We need to
> > > distinguish between usually full (eg. virtio-net input) and usually
> > > empty (eg. virtio-net output).
> > > 
> > > The signal for "we to pack more into the ring" is different.  We could
> > > use some hacky heuristic like "out == 0" but I'd rather make it explicit
> > > when we set up the virtqueue.
> > > 
> > > Our other alternative, moving the logic to the driver, is worse.
> > > 
> > > As to fading the effect over time, that's harder.  We have to deplete
> > > the ring quite a few times before it turns into always-indirect.  We
> > > could back off every time the ring is totally idle, but that may hurt
> > > bursty traffic.  Let's try simple first?
> > 
> > I tried to take a different approach, and tried putting the indirect
> > descriptors in a kmem_cache as Michael suggested. The benchmarks showed
> > that this way virtio-net actually worked faster with indirect on even in
> > a single stream.
> > 
> > Maybe we can do that instead of playing with threshold for now.
> > 
> > The question here, how much wasted space we can afford? since indirect
> > descriptors would have to be the same size we'd have a bunch of them
> > wasted in the cache. Ofcourse we can make that configurable, but how
> > much is ok by default?
> 
> I think it's a good idea to make that per-device.
> For network at least, each skb already has overhead of
> around 1/2 K, so using up to 1/2K more seems acceptable.
> But even if we went up to MAX_SKB_FRAGS+2, it would be
> only 1K per ring entry,

I got this wrong - descriptor is 16 bytes, so MAX_SKB_FRAGS+2
descriptors would be less than 300 bytes overhead per packet.
That's not a lot.

> so for a ring of 256 entries, we end up with
> 256K max waste. That's not that terrible.
> 
> But I'd say let's do some benchmarking to figure out
> the point where the gains are becoming very small.


> > -- 
> > 
> > Sasha.

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Michael S. Tsirkin @ 2011-12-04 15:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: markmc, kvm, linux-kernel, virtualization, Sasha Levin
In-Reply-To: <4EDB624A.3030403@redhat.com>

On Sun, Dec 04, 2011 at 02:06:34PM +0200, Avi Kivity wrote:
> On 12/04/2011 02:01 PM, Michael S. Tsirkin wrote:
> > > 
> > > How much better?
> > > 
> > > I think that if indirects benefit networking, then we're doing something
> > > wrong.  What's going on?  Does the ring get filled too early?  If so we
> > > should expand it.
> >
> > The ring is physically contigious.
> > With 256 entries and 64 bytes each, that's already 16K.
> 
> A descriptor is just 16 bytes.

Right. Not sure where did I get 64.

> There's also the used ring, but that's a
> mistake if you have out of order completion.  We should have used copying.

Seems unrelated... unless you want used to be written into
descriptor ring itself?
But, I don't really know why does virtio ring insist on
making the 3 buffers (avail/used/descriptor)
physically contigious. Rusty?

> 16kB worth of descriptors is 1024 entries.  With 4kB buffers, that's 4MB
> worth of data, or 4 ms at 10GbE line speed.  With 1500 byte buffers it's
> just 1.5 ms.  In any case I think it's sufficient.

Right. So I think that without indirect, we waste about 3 entries
per packet for virtio header and transport etc headers.

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Sasha Levin @ 2011-12-04 12:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: markmc, kvm, Michael S. Tsirkin, linux-kernel, virtualization
In-Reply-To: <4EDB5EF0.2010909@redhat.com>

On Sun, 2011-12-04 at 13:52 +0200, Avi Kivity wrote:
> On 12/03/2011 01:50 PM, Sasha Levin wrote:
> > On Fri, 2011-12-02 at 11:16 +1030, Rusty Russell wrote:
> > > On Thu, 1 Dec 2011 12:26:42 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > On Thu, Dec 01, 2011 at 10:09:37AM +0200, Sasha Levin wrote:
> > > > > On Thu, 2011-12-01 at 09:58 +0200, Michael S. Tsirkin wrote:
> > > > > > We'll presumably need some logic to increment is back,
> > > > > > to account for random workload changes.
> > > > > > Something like slow start?
> > > > > 
> > > > > We can increment it each time the queue was less than 10% full, it
> > > > > should act like slow start, no?
> > > > 
> > > > No, we really shouldn't get an empty ring as long as things behave
> > > > well. What I meant is something like:
> > > 
> > > I was thinking of the network output case, but you're right.  We need to
> > > distinguish between usually full (eg. virtio-net input) and usually
> > > empty (eg. virtio-net output).
> > > 
> > > The signal for "we to pack more into the ring" is different.  We could
> > > use some hacky heuristic like "out == 0" but I'd rather make it explicit
> > > when we set up the virtqueue.
> > > 
> > > Our other alternative, moving the logic to the driver, is worse.
> > > 
> > > As to fading the effect over time, that's harder.  We have to deplete
> > > the ring quite a few times before it turns into always-indirect.  We
> > > could back off every time the ring is totally idle, but that may hurt
> > > bursty traffic.  Let's try simple first?
> >
> > I tried to take a different approach, and tried putting the indirect
> > descriptors in a kmem_cache as Michael suggested. The benchmarks showed
> > that this way virtio-net actually worked faster with indirect on even in
> > a single stream.
> 
> How much better?

host->guest was same with both indirect on and off, and guest->host went
up by 5% with indirect on.

This was just a simple 1 TCP stream test.

> 
> I think that if indirects benefit networking, then we're doing something
> wrong.  What's going on?  Does the ring get filled too early?  If so we
> should expand it.
> 

-- 

Sasha.

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Avi Kivity @ 2011-12-04 12:06 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: markmc, kvm, linux-kernel, virtualization, Sasha Levin
In-Reply-To: <20111204120132.GB18758@redhat.com>

On 12/04/2011 02:01 PM, Michael S. Tsirkin wrote:
> > 
> > How much better?
> > 
> > I think that if indirects benefit networking, then we're doing something
> > wrong.  What's going on?  Does the ring get filled too early?  If so we
> > should expand it.
>
> The ring is physically contigious.
> With 256 entries and 64 bytes each, that's already 16K.

A descriptor is just 16 bytes.  There's also the used ring, but that's a
mistake if you have out of order completion.  We should have used copying.

16kB worth of descriptors is 1024 entries.  With 4kB buffers, that's 4MB
worth of data, or 4 ms at 10GbE line speed.  With 1500 byte buffers it's
just 1.5 ms.  In any case I think it's sufficient.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Michael S. Tsirkin @ 2011-12-04 12:01 UTC (permalink / raw)
  To: Avi Kivity; +Cc: markmc, kvm, linux-kernel, virtualization, Sasha Levin
In-Reply-To: <4EDB5EF0.2010909@redhat.com>

On Sun, Dec 04, 2011 at 01:52:16PM +0200, Avi Kivity wrote:
> On 12/03/2011 01:50 PM, Sasha Levin wrote:
> > On Fri, 2011-12-02 at 11:16 +1030, Rusty Russell wrote:
> > > On Thu, 1 Dec 2011 12:26:42 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > On Thu, Dec 01, 2011 at 10:09:37AM +0200, Sasha Levin wrote:
> > > > > On Thu, 2011-12-01 at 09:58 +0200, Michael S. Tsirkin wrote:
> > > > > > We'll presumably need some logic to increment is back,
> > > > > > to account for random workload changes.
> > > > > > Something like slow start?
> > > > > 
> > > > > We can increment it each time the queue was less than 10% full, it
> > > > > should act like slow start, no?
> > > > 
> > > > No, we really shouldn't get an empty ring as long as things behave
> > > > well. What I meant is something like:
> > > 
> > > I was thinking of the network output case, but you're right.  We need to
> > > distinguish between usually full (eg. virtio-net input) and usually
> > > empty (eg. virtio-net output).
> > > 
> > > The signal for "we to pack more into the ring" is different.  We could
> > > use some hacky heuristic like "out == 0" but I'd rather make it explicit
> > > when we set up the virtqueue.
> > > 
> > > Our other alternative, moving the logic to the driver, is worse.
> > > 
> > > As to fading the effect over time, that's harder.  We have to deplete
> > > the ring quite a few times before it turns into always-indirect.  We
> > > could back off every time the ring is totally idle, but that may hurt
> > > bursty traffic.  Let's try simple first?
> >
> > I tried to take a different approach, and tried putting the indirect
> > descriptors in a kmem_cache as Michael suggested. The benchmarks showed
> > that this way virtio-net actually worked faster with indirect on even in
> > a single stream.
> 
> How much better?
> 
> I think that if indirects benefit networking, then we're doing something
> wrong.  What's going on?  Does the ring get filled too early?  If so we
> should expand it.

The ring is physically contigious.
With 256 entries and 64 bytes each, that's already 16K.

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox