[net-next RFC PATCH 0/5] Series short description

virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed

* [net-next RFC PATCH 0/5] Series short description
@ 2011-12-05  8:58 Jason Wang
  2011-12-05  8:58 ` [net-next RFC PATCH 1/5] virtio_net: passing rxhash through vnet_hdr Jason Wang
                   ` (7 more replies)
  0 siblings, 8 replies; 36+ messages in thread
From: Jason Wang @ 2011-12-05  8:58 UTC (permalink / raw)
  To: krkumar2, kvm, mst, netdev, rusty, virtualization, levinsasha928,
	bhutchings

multiple queue virtio-net: flow steering through host/guest cooperation

Hello all:

This is a rough series adds the guest/host cooperation of flow
steering support based on Krish Kumar's multiple queue virtio-net
driver patch 3/3 (http://lwn.net/Articles/467283/).

This idea is simple, the backend pass the rxhash to the guest and
guest would tell the backend the hash to queue mapping when necessary
then backend can choose the queue based on the hash value of the
packet.  The table is just a page shared bettwen userspace and the
backend.

Patch 1 enable the ability to pass the rxhash through vnet_hdr to
guest.
Patch 2,3 implement a very simple flow director for tap and
mavtap. tap part is based on the multiqueue tap patches posted by me
(http://lwn.net/Articles/459270/).
Patch 4 implement a method for virtio device to find the irq of a
specific virtqueue, in order to do device specific interrupt
optimization
Patch 5 is the part of the guest driver that using accelerate rfs to
program the flow director and with some optimizations on irq affinity
and tx queue selection.

This is just a prototype that demonstrates the idea, there are still
things need to be discussed:

- An alternative idea instead of shared page is ctrl vq, the reason
  that a shared table is preferable is the delay of ctrl vq itself.
- Optimization on irq affinity and tx queue selection

Comments are welcomed, thanks!

---

Jason Wang (5):
      virtio_net: passing rxhash through vnet_hdr
      tuntap: simple flow director support
      macvtap: flow director support
      virtio: introduce a method to get the irq of a specific virtqueue
      virtio-net: flow director support

 drivers/lguest/lguest_device.c |    8 ++
 drivers/net/macvlan.c          |    4 +
 drivers/net/macvtap.c          |   42 ++++++++-
 drivers/net/tun.c              |  105 ++++++++++++++++------
 drivers/net/virtio_net.c       |  189 +++++++++++++++++++++++++++++++++++++++-
 drivers/s390/kvm/kvm_virtio.c  |    6 +
 drivers/vhost/net.c            |   10 +-
 drivers/vhost/vhost.h          |    5 +
 drivers/virtio/virtio_mmio.c   |    8 ++
 drivers/virtio/virtio_pci.c    |   12 +++
 include/linux/if_macvlan.h     |    1 
 include/linux/if_tun.h         |   11 ++
 include/linux/virtio_config.h  |    4 +
 include/linux/virtio_net.h     |   16 +++
 14 files changed, 377 insertions(+), 44 deletions(-)

-- 
Signature

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [net-next RFC PATCH 1/5] virtio_net: passing rxhash through vnet_hdr
  2011-12-05  8:58 [net-next RFC PATCH 0/5] Series short description Jason Wang
@ 2011-12-05  8:58 ` Jason Wang
  2011-12-05  8:58 ` [net-next RFC PATCH 2/5] tuntap: simple flow director support Jason Wang
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Jason Wang @ 2011-12-05  8:58 UTC (permalink / raw)
  To: krkumar2, kvm, mst, netdev, rusty, virtualization, levinsasha928,
	bhutchings

This patch enables the ability to pass the rxhash value to guest
through vnet_hdr. This is useful for guest when it wants to cooperate
with virtual device to steer a flow to dedicated guest cpu.

This feature is negotiated through VIRTIO_NET_F_GUEST_RXHASH.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvtap.c      |   10 ++++++----
 drivers/net/tun.c          |   44 +++++++++++++++++++++++++-------------------
 drivers/net/virtio_net.c   |   26 ++++++++++++++++++++++----
 drivers/vhost/net.c        |   10 +++++++---
 drivers/vhost/vhost.h      |    5 +++--
 include/linux/if_tun.h     |    1 +
 include/linux/virtio_net.h |   10 +++++++++-
 7 files changed, 73 insertions(+), 33 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 7c88d13..504c745 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -760,16 +760,17 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
 	int vnet_hdr_len = 0;
 
 	if (q->flags & IFF_VNET_HDR) {
-		struct virtio_net_hdr vnet_hdr;
+		struct virtio_net_hdr_rxhash vnet_hdr;
 		vnet_hdr_len = q->vnet_hdr_sz;
 		if ((len -= vnet_hdr_len) < 0)
 			return -EINVAL;
 
-		ret = macvtap_skb_to_vnet_hdr(skb, &vnet_hdr);
+		ret = macvtap_skb_to_vnet_hdr(skb, &vnet_hdr.hdr.hdr);
 		if (ret)
 			return ret;
 
-		if (memcpy_toiovecend(iv, (void *)&vnet_hdr, 0, sizeof(vnet_hdr)))
+		vnet_hdr.rxhash = skb->rxhash;
+		if (memcpy_toiovecend(iv, (void *)&vnet_hdr, 0, q->vnet_hdr_sz))
 			return -EFAULT;
 	}
 
@@ -890,7 +891,8 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		return ret;
 
 	case TUNGETFEATURES:
-		if (put_user(IFF_TAP | IFF_NO_PI | IFF_VNET_HDR, up))
+		if (put_user(IFF_TAP | IFF_NO_PI | IFF_VNET_HDR | IFF_RXHASH,
+			     up))
 			return -EFAULT;
 		return 0;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afb11d1..7d22b4b 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -869,49 +869,55 @@ static ssize_t tun_put_user(struct tun_file *tfile,
 	}
 
 	if (tfile->flags & TUN_VNET_HDR) {
-		struct virtio_net_hdr gso = { 0 }; /* no info leak */
-		if ((len -= tfile->vnet_hdr_sz) < 0)
+		struct virtio_net_hdr_rxhash hdr;
+		struct virtio_net_hdr *gso = (struct virtio_net_hdr *)&hdr;
+
+		if ((len -= tfile->vnet_hdr_sz) < 0 ||
+		    tfile->vnet_hdr_sz > sizeof(struct virtio_net_hdr_rxhash))
 			return -EINVAL;
 
+		memset(&hdr, 0, sizeof(hdr));
 		if (skb_is_gso(skb)) {
 			struct skb_shared_info *sinfo = skb_shinfo(skb);
 
 			/* This is a hint as to how much should be linear. */
-			gso.hdr_len = skb_headlen(skb);
-			gso.gso_size = sinfo->gso_size;
+			gso->hdr_len = skb_headlen(skb);
+			gso->gso_size = sinfo->gso_size;
 			if (sinfo->gso_type & SKB_GSO_TCPV4)
-				gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+				gso->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
 			else if (sinfo->gso_type & SKB_GSO_TCPV6)
-				gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+				gso->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
 			else if (sinfo->gso_type & SKB_GSO_UDP)
-				gso.gso_type = VIRTIO_NET_HDR_GSO_UDP;
+				gso->gso_type = VIRTIO_NET_HDR_GSO_UDP;
 			else {
 				pr_err("unexpected GSO type: "
 				       "0x%x, gso_size %d, hdr_len %d\n",
-				       sinfo->gso_type, gso.gso_size,
-				       gso.hdr_len);
+				       sinfo->gso_type, gso->gso_size,
+				       gso->hdr_len);
 				print_hex_dump(KERN_ERR, "tun: ",
 					       DUMP_PREFIX_NONE,
 					       16, 1, skb->head,
-					       min((int)gso.hdr_len, 64), true);
+					       min((int)gso->hdr_len, 64),
+					       true);
 				WARN_ON_ONCE(1);
 				return -EINVAL;
 			}
 			if (sinfo->gso_type & SKB_GSO_TCP_ECN)
-				gso.gso_type |= VIRTIO_NET_HDR_GSO_ECN;
+				gso->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
 		} else
-			gso.gso_type = VIRTIO_NET_HDR_GSO_NONE;
+			gso->gso_type = VIRTIO_NET_HDR_GSO_NONE;
 
 		if (skb->ip_summed == CHECKSUM_PARTIAL) {
-			gso.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-			gso.csum_start = skb_checksum_start_offset(skb);
-			gso.csum_offset = skb->csum_offset;
+			gso->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			gso->csum_start = skb_checksum_start_offset(skb);
+			gso->csum_offset = skb->csum_offset;
 		} else if (skb->ip_summed == CHECKSUM_UNNECESSARY) {
-			gso.flags = VIRTIO_NET_HDR_F_DATA_VALID;
+			gso->flags = VIRTIO_NET_HDR_F_DATA_VALID;
 		} /* else everything is zero */
 
-		if (unlikely(memcpy_toiovecend(iv, (void *)&gso, total,
-					       sizeof(gso))))
+		hdr.rxhash = skb_get_rxhash(skb);
+		if (unlikely(memcpy_toiovecend(iv, (void *)&hdr, total,
+					       tfile->vnet_hdr_sz)))
 			return -EFAULT;
 		total += tfile->vnet_hdr_sz;
 	}
@@ -1358,7 +1364,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		 * This is needed because we never checked for invalid flags on
 		 * TUNSETIFF. */
 		return put_user(IFF_TUN | IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE |
-				IFF_VNET_HDR | IFF_MULTI_QUEUE,
+				IFF_VNET_HDR | IFF_MULTI_QUEUE | IFF_RXHASH,
 				(unsigned int __user*)argp);
 	}
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 157ee63..0d871f8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -107,12 +107,16 @@ struct virtnet_info {
 
 	/* Host will merge rx buffers for big packets (shake it! shake it!) */
 	bool mergeable_rx_bufs;
+
+	/* Host will pass rxhash to us. */
+	bool has_rxhash;
 };
 
 struct skb_vnet_hdr {
 	union {
 		struct virtio_net_hdr hdr;
 		struct virtio_net_hdr_mrg_rxbuf mhdr;
+		struct virtio_net_hdr_rxhash rhdr;
 	};
 	unsigned int num_sg;
 };
@@ -205,7 +209,10 @@ static struct sk_buff *page_to_skb(struct receive_queue *rq,
 	hdr = skb_vnet_hdr(skb);
 
 	if (vi->mergeable_rx_bufs) {
-		hdr_len = sizeof hdr->mhdr;
+		if (vi->has_rxhash)
+			hdr_len = sizeof hdr->rhdr;
+		else
+			hdr_len = sizeof hdr->mhdr;
 		offset = hdr_len;
 	} else {
 		hdr_len = sizeof hdr->hdr;
@@ -376,6 +383,9 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 		skb_shinfo(skb)->gso_segs = 0;
 	}
 
+	if (vi->has_rxhash)
+		skb->rxhash = hdr->rhdr.rxhash;
+
 	netif_receive_skb(skb);
 	return;
 
@@ -645,9 +655,12 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
 	hdr->mhdr.num_buffers = 0;
 
 	/* Encode metadata header at front. */
-	if (vi->mergeable_rx_bufs)
-		sg_set_buf(sg, &hdr->mhdr, sizeof hdr->mhdr);
-	else
+	if (vi->mergeable_rx_bufs) {
+		if (vi->has_rxhash)
+			sg_set_buf(sg, &hdr->rhdr, sizeof hdr->rhdr);
+		else
+			sg_set_buf(sg, &hdr->mhdr, sizeof hdr->mhdr);
+	} else
 		sg_set_buf(sg, &hdr->hdr, sizeof hdr->hdr);
 
 	hdr->num_sg = skb_to_sgvec(skb, sg + 1, 0, skb->len) + 1;
@@ -1338,8 +1351,12 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
 		vi->mergeable_rx_bufs = true;
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_RXHASH))
+		vi->has_rxhash = true;
+
 	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
 	err = virtnet_setup_vqs(vi);
+
 	if (err)
 		goto free_netdev;
 
@@ -1436,6 +1453,7 @@ static unsigned int features[] = {
 	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
 	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
 	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, VIRTIO_NET_F_MULTIQUEUE,
+	VIRTIO_NET_F_GUEST_RXHASH,
 };
 
 static struct virtio_driver virtio_net_driver = {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 882a51f..b2d6548 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -768,9 +768,13 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
 	size_t vhost_hlen, sock_hlen, hdr_len;
 	int i;
 
-	hdr_len = (features & (1 << VIRTIO_NET_F_MRG_RXBUF)) ?
-			sizeof(struct virtio_net_hdr_mrg_rxbuf) :
-			sizeof(struct virtio_net_hdr);
+	if (features & (1 << VIRTIO_NET_F_MRG_RXBUF))
+		hdr_len = (features & (1 << VIRTIO_NET_F_GUEST_RXHASH)) ?
+			sizeof(struct virtio_net_hdr_rxhash) :
+			sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	else
+		hdr_len = sizeof(struct virtio_net_hdr);
+
 	if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
 		/* vhost provides vnet_hdr */
 		vhost_hlen = hdr_len;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index a801e28..4ad2d5f 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -115,7 +115,7 @@ struct vhost_virtqueue {
 	/* hdr is used to store the virtio header.
 	 * Since each iovec has >= 1 byte length, we never need more than
 	 * header length entries to store the header. */
-	struct iovec hdr[sizeof(struct virtio_net_hdr_mrg_rxbuf)];
+	struct iovec hdr[sizeof(struct virtio_net_hdr_rxhash)];
 	struct iovec *indirect;
 	size_t vhost_hlen;
 	size_t sock_hlen;
@@ -203,7 +203,8 @@ enum {
 			 (1ULL << VIRTIO_RING_F_EVENT_IDX) |
 			 (1ULL << VHOST_F_LOG_ALL) |
 			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
-			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
+			 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+			 (1ULL << VIRTIO_NET_F_GUEST_RXHASH) ,
 };
 
 static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index d3f24d8..a1f6f3f 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -66,6 +66,7 @@
 #define IFF_VNET_HDR	0x4000
 #define IFF_TUN_EXCL	0x8000
 #define IFF_MULTI_QUEUE 0x0100
+#define IFF_RXHASH      0x0200
 
 /* Features for GSO (TUNSETOFFLOAD). */
 #define TUN_F_CSUM	0x01	/* You can hand me unchecksummed packets. */
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index c92b83f..2291317 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -50,6 +50,7 @@
 #define VIRTIO_NET_F_CTRL_VLAN	19	/* Control channel VLAN filtering */
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20	/* Extra RX mode control support */
 #define VIRTIO_NET_F_MULTIQUEUE	21	/* Device supports multiple TXQ/RXQ */
+#define VIRTIO_NET_F_GUEST_RXHASH 22    /* Guest can receive rxhash */
 
 #define VIRTIO_NET_S_LINK_UP	1	/* Link is up */
 
@@ -63,7 +64,7 @@ struct virtio_net_config {
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't
- * specify GSO or CSUM features, you can simply ignore the header. */
+ * specify GSO, CSUM or HASH features, you can simply ignore the header. */
 struct virtio_net_hdr {
 #define VIRTIO_NET_HDR_F_NEEDS_CSUM	1	// Use csum_start, csum_offset
 #define VIRTIO_NET_HDR_F_DATA_VALID	2	// Csum is valid
@@ -87,6 +88,13 @@ struct virtio_net_hdr_mrg_rxbuf {
 	__u16 num_buffers;	/* Number of merged rx buffers */
 };
 
+/* This is the version of the header to use when GUEST_RXHASH
+ * feature has been negotiated. */
+struct virtio_net_hdr_rxhash {
+	struct virtio_net_hdr_mrg_rxbuf hdr;
+	__u32 rxhash;
+};
+
 /*
  * Control virtqueue data structures
  *

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [net-next RFC PATCH 2/5] tuntap: simple flow director support
  2011-12-05  8:58 [net-next RFC PATCH 0/5] Series short description Jason Wang
  2011-12-05  8:58 ` [net-next RFC PATCH 1/5] virtio_net: passing rxhash through vnet_hdr Jason Wang
@ 2011-12-05  8:58 ` Jason Wang
  2011-12-05 10:38   ` Stefan Hajnoczi
                     ` (2 more replies)
  2011-12-05  8:59 ` [net-next RFC PATCH 3/5] macvtap: " Jason Wang
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 36+ messages in thread
From: Jason Wang @ 2011-12-05  8:58 UTC (permalink / raw)
  To: krkumar2, kvm, mst, netdev, rusty, virtualization, levinsasha928,
	bhutchings

This patch adds a simple flow director to tun/tap device. It is just a
page that contains the hash to queue mapping which could be changed by
user-space. The backend (tap/macvtap) would query this table to get
the desired queue of a packets when it send packets to userspace.

The page address were set through a new kind of ioctl - TUNSETFD and
were pinned until device exit or another new page were specified.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c      |   63 ++++++++++++++++++++++++++++++++++++++++--------
 include/linux/if_tun.h |   10 ++++++++
 2 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 7d22b4b..2efaf81 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -64,6 +64,7 @@
 #include <linux/nsproxy.h>
 #include <linux/virtio_net.h>
 #include <linux/rcupdate.h>
+#include <linux/highmem.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 #include <net/rtnetlink.h>
@@ -109,6 +110,7 @@ struct tap_filter {
 };
 
 #define MAX_TAP_QUEUES (NR_CPUS < 16 ? NR_CPUS : 16)
+#define TAP_HASH_MASK  0xFF
 
 struct tun_file {
 	struct sock sk;
@@ -128,6 +130,7 @@ struct tun_sock;
 
 struct tun_struct {
 	struct tun_file		*tfiles[MAX_TAP_QUEUES];
+	struct page             *fd_page[1];
 	unsigned int            numqueues;
 	unsigned int 		flags;
 	uid_t			owner;
@@ -156,7 +159,7 @@ static struct tun_file *tun_get_queue(struct net_device *dev,
 	struct tun_struct *tun = netdev_priv(dev);
 	struct tun_file *tfile = NULL;
 	int numqueues = tun->numqueues;
-	__u32 rxq;
+	__u32 rxq, rxhash;
 
 	BUG_ON(!rcu_read_lock_held());
 
@@ -168,6 +171,22 @@ static struct tun_file *tun_get_queue(struct net_device *dev,
 		goto out;
 	}
 
+	rxhash = skb_get_rxhash(skb);
+	if (rxhash) {
+		if (tun->fd_page[0]) {
+			u16 *table = kmap_atomic(tun->fd_page[0]);
+			rxq = table[rxhash & TAP_HASH_MASK];
+			kunmap_atomic(table);
+			if (rxq < numqueues) {
+				tfile = rcu_dereference(tun->tfiles[rxq]);
+				goto out;
+			}
+		}
+		rxq = ((u64)rxhash * numqueues) >> 32;
+		tfile = rcu_dereference(tun->tfiles[rxq]);
+		goto out;
+	}
+
 	if (likely(skb_rx_queue_recorded(skb))) {
 		rxq = skb_get_rx_queue(skb);
 
@@ -178,14 +197,6 @@ static struct tun_file *tun_get_queue(struct net_device *dev,
 		goto out;
 	}
 
-	/* Check if we can use flow to select a queue */
-	rxq = skb_get_rxhash(skb);
-	if (rxq) {
-		u32 idx = ((u64)rxq * numqueues) >> 32;
-		tfile = rcu_dereference(tun->tfiles[idx]);
-		goto out;
-	}
-
 	tfile = rcu_dereference(tun->tfiles[0]);
 out:
 	return tfile;
@@ -1020,6 +1031,14 @@ out:
 	return ret;
 }
 
+static void tun_destructor(struct net_device *dev)
+{
+	struct tun_struct *tun = netdev_priv(dev);
+	if (tun->fd_page[0])
+		put_page(tun->fd_page[0]);
+	free_netdev(dev);
+}
+
 static void tun_setup(struct net_device *dev)
 {
 	struct tun_struct *tun = netdev_priv(dev);
@@ -1028,7 +1047,7 @@ static void tun_setup(struct net_device *dev)
 	tun->group = -1;
 
 	dev->ethtool_ops = &tun_ethtool_ops;
-	dev->destructor = free_netdev;
+	dev->destructor = tun_destructor;
 }
 
 /* Trivial set of netlink ops to allow deleting tun or tap
@@ -1230,6 +1249,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 		tun = netdev_priv(dev);
 		tun->dev = dev;
 		tun->flags = flags;
+		tun->fd_page[0] = NULL;
 
 		security_tun_dev_post_create(&tfile->sk);
 
@@ -1353,6 +1373,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 	struct net_device *dev = NULL;
 	void __user* argp = (void __user*)arg;
 	struct ifreq ifr;
+	struct tun_fd tfd;
 	int ret;
 
 	if (cmd == TUNSETIFF || cmd == TUNATTACHQUEUE || _IOC_TYPE(cmd) == 0x89)
@@ -1364,7 +1385,8 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		 * This is needed because we never checked for invalid flags on
 		 * TUNSETIFF. */
 		return put_user(IFF_TUN | IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE |
-				IFF_VNET_HDR | IFF_MULTI_QUEUE | IFF_RXHASH,
+				IFF_VNET_HDR | IFF_MULTI_QUEUE | IFF_RXHASH |
+				IFF_FD,
 				(unsigned int __user*)argp);
 	}
 
@@ -1476,6 +1498,25 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		ret = set_offload(tun, arg);
 		break;
 
+	case TUNSETFD:
+		if (copy_from_user(&tfd, argp, sizeof(tfd)))
+			ret = -EFAULT;
+		else {
+			if (tun->fd_page[0]) {
+				put_page(tun->fd_page[0]);
+				tun->fd_page[0] = NULL;
+			}
+
+			/* put_page() in tun_destructor() */
+			if (get_user_pages_fast(tfd.addr, 1, 0,
+						&tun->fd_page[0]) != 1)
+				ret = -EFAULT;
+			else
+				ret = 0;
+		}
+
+		break;
+
 	case SIOCGIFHWADDR:
 		/* Get hw address */
 		memcpy(ifr.ifr_hwaddr.sa_data, tun->dev->dev_addr, ETH_ALEN);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index a1f6f3f..726731d 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -36,6 +36,8 @@
 #define TUN_VNET_HDR 	0x0200
 #define TUN_TAP_MQ      0x0400
 
+struct tun_fd;
+
 /* Ioctl defines */
 #define TUNSETNOCSUM  _IOW('T', 200, int) 
 #define TUNSETDEBUG   _IOW('T', 201, int) 
@@ -56,6 +58,7 @@
 #define TUNSETVNETHDRSZ _IOW('T', 216, int)
 #define TUNATTACHQUEUE  _IOW('T', 217, int)
 #define TUNDETACHQUEUE  _IOW('T', 218, int)
+#define TUNSETFD        _IOW('T', 219, struct tun_fd)
 
 
 /* TUNSETIFF ifr flags */
@@ -67,6 +70,7 @@
 #define IFF_TUN_EXCL	0x8000
 #define IFF_MULTI_QUEUE 0x0100
 #define IFF_RXHASH      0x0200
+#define IFF_FD          0x0400
 
 /* Features for GSO (TUNSETOFFLOAD). */
 #define TUN_F_CSUM	0x01	/* You can hand me unchecksummed packets. */
@@ -97,6 +101,12 @@ struct tun_filter {
 	__u8   addr[0][ETH_ALEN];
 };
 
+/* Programmable flow director */
+struct tun_fd {
+	unsigned long addr;
+	size_t size;
+};
+
 #ifdef __KERNEL__
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [net-next RFC PATCH 3/5] macvtap: flow director support
  2011-12-05  8:58 [net-next RFC PATCH 0/5] Series short description Jason Wang
  2011-12-05  8:58 ` [net-next RFC PATCH 1/5] virtio_net: passing rxhash through vnet_hdr Jason Wang
  2011-12-05  8:58 ` [net-next RFC PATCH 2/5] tuntap: simple flow director support Jason Wang
@ 2011-12-05  8:59 ` Jason Wang
  2011-12-05 20:11   ` Ben Hutchings
  2011-12-05  8:59 ` [net-next RFC PATCH 4/5] virtio: introduce a method to get the irq of a specific virtqueue Jason Wang
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 36+ messages in thread
From: Jason Wang @ 2011-12-05  8:59 UTC (permalink / raw)
  To: krkumar2, kvm, mst, netdev, rusty, virtualization, levinsasha928,
	bhutchings

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvlan.c      |    4 ++++
 drivers/net/macvtap.c      |   36 ++++++++++++++++++++++++++++++++++--
 include/linux/if_macvlan.h |    1 +
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 7413497..b0cb7ce 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -706,6 +706,7 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
 	vlan->port     = port;
 	vlan->receive  = receive;
 	vlan->forward  = forward;
+	vlan->fd_page[0] = NULL;
 
 	vlan->mode     = MACVLAN_MODE_VEPA;
 	if (data && data[IFLA_MACVLAN_MODE])
@@ -749,6 +750,9 @@ void macvlan_dellink(struct net_device *dev, struct list_head *head)
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 
+	if (vlan->fd_page[0])
+		put_page(vlan->fd_page[0]);
+
 	list_del(&vlan->list);
 	unregister_netdevice_queue(dev, head);
 }
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 504c745..a34eb84 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -14,6 +14,7 @@
 #include <linux/wait.h>
 #include <linux/cdev.h>
 #include <linux/fs.h>
+#include <linux/highmem.h>
 
 #include <net/net_namespace.h>
 #include <net/rtnetlink.h>
@@ -62,6 +63,8 @@ static DEFINE_IDR(minor_idr);
 static struct class *macvtap_class;
 static struct cdev macvtap_cdev;
 
+#define TAP_HASH_MASK 0xFF
+
 static const struct proto_ops macvtap_socket_ops;
 
 /*
@@ -189,6 +192,11 @@ static struct macvtap_queue *macvtap_get_queue(struct net_device *dev,
 	/* Check if we can use flow to select a queue */
 	rxq = skb_get_rxhash(skb);
 	if (rxq) {
+		if (vlan->fd_page[0]) {
+			u16 *table = kmap_atomic(vlan->fd_page[0]);
+			rxq = table[rxq & TAP_HASH_MASK];
+			kunmap_atomic(table);
+		}
 		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
 		if (tap)
 			goto out;
@@ -851,6 +859,7 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 {
 	struct macvtap_queue *q = file->private_data;
 	struct macvlan_dev *vlan;
+	struct tun_fd tfd;
 	void __user *argp = (void __user *)arg;
 	struct ifreq __user *ifr = argp;
 	unsigned int __user *up = argp;
@@ -891,8 +900,8 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		return ret;
 
 	case TUNGETFEATURES:
-		if (put_user(IFF_TAP | IFF_NO_PI | IFF_VNET_HDR | IFF_RXHASH,
-			     up))
+		if (put_user(IFF_TAP | IFF_NO_PI | IFF_VNET_HDR | IFF_RXHASH |
+			     IFF_FD, up))
 			return -EFAULT;
 		return 0;
 
@@ -918,6 +927,29 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		q->vnet_hdr_sz = s;
 		return 0;
 
+	case TUNSETFD:
+		rcu_read_lock_bh();
+		vlan = rcu_dereference(q->vlan);
+		if (!vlan)
+			ret = -ENOLINK;
+		else {
+			if (copy_from_user(&tfd, argp, sizeof(tfd)))
+				ret = -EFAULT;
+			if (vlan->fd_page[0]) {
+				put_page(vlan->fd_page[0]);
+				vlan->fd_page[0] = NULL;
+			}
+
+			/* put_page() in macvlan_dellink() */
+			if (get_user_pages_fast(tfd.addr, 1, 0,
+						&vlan->fd_page[0]) != 1)
+				ret = -EFAULT;
+			else
+				ret = 0;
+		}
+		rcu_read_unlock_bh();
+		return ret;
+
 	case TUNSETOFFLOAD:
 		/* let the user check for future flags */
 		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index d103dca..69a87a1 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -65,6 +65,7 @@ struct macvlan_dev {
 	struct macvtap_queue	*taps[MAX_MACVTAP_QUEUES];
 	int			numvtaps;
 	int			minor;
+	struct page             *fd_page[1];
 };
 
 static inline void macvlan_count_rx(const struct macvlan_dev *vlan,

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [net-next RFC PATCH 4/5] virtio: introduce a method to get the irq of a specific virtqueue
  2011-12-05  8:58 [net-next RFC PATCH 0/5] Series short description Jason Wang
                   ` (2 preceding siblings ...)
  2011-12-05  8:59 ` [net-next RFC PATCH 3/5] macvtap: " Jason Wang
@ 2011-12-05  8:59 ` Jason Wang
  2011-12-05  8:59 ` [net-next RFC PATCH 5/5] virtio-net: flow director support Jason Wang
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Jason Wang @ 2011-12-05  8:59 UTC (permalink / raw)
  To: krkumar2, kvm, mst, netdev, rusty, virtualization, levinsasha928,
	bhutchings

Device specific irq configuration may be need in order to do some
optimization. So a new configuration is needed to get the irq of a
virtqueue.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/lguest/lguest_device.c |    8 ++++++++
 drivers/s390/kvm/kvm_virtio.c  |    6 ++++++
 drivers/virtio/virtio_mmio.c   |    8 ++++++++
 drivers/virtio/virtio_pci.c    |   12 ++++++++++++
 include/linux/virtio_config.h  |    4 ++++
 5 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
index 595d731..6483bff 100644
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -386,6 +386,13 @@ static const char *lg_bus_name(struct virtio_device *vdev)
 	return "";
 }
 
+static int lg_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
+{
+	struct lguest_vq_info *lvq = vq->priv;
+
+	return lvq->config.irq;
+}
+
 /* The ops structure which hooks everything together. */
 static struct virtio_config_ops lguest_config_ops = {
 	.get_features = lg_get_features,
@@ -398,6 +405,7 @@ static struct virtio_config_ops lguest_config_ops = {
 	.find_vqs = lg_find_vqs,
 	.del_vqs = lg_del_vqs,
 	.bus_name = lg_bus_name,
+	.get_vq_irq = lg_get_vq_irq,
 };
 
 /*
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
index 8af868b..a8d5ca1 100644
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -268,6 +268,11 @@ static const char *kvm_bus_name(struct virtio_device *vdev)
 	return "";
 }
 
+static int kvm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
+{
+	return 0x2603;
+}
+
 /*
  * The config ops structure as defined by virtio config
  */
@@ -282,6 +287,7 @@ static struct virtio_config_ops kvm_vq_configspace_ops = {
 	.find_vqs = kvm_find_vqs,
 	.del_vqs = kvm_del_vqs,
 	.bus_name = kvm_bus_name,
+	.get_vq_irq = kvm_get_vq_irq,
 };
 
 /*
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 2f57380..309d471 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -368,6 +368,13 @@ static const char *vm_bus_name(struct virtio_device *vdev)
 	return vm_dev->pdev->name;
 }
 
+static int vm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
+{
+	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
+
+	return platform_get_irq(vm_dev->pdev, 0);
+}
+
 static struct virtio_config_ops virtio_mmio_config_ops = {
 	.get		= vm_get,
 	.set		= vm_set,
@@ -379,6 +386,7 @@ static struct virtio_config_ops virtio_mmio_config_ops = {
 	.get_features	= vm_get_features,
 	.finalize_features = vm_finalize_features,
 	.bus_name	= vm_bus_name,
+	.get_vq_irq     = vm_get_vq_irq,
 };
 
 
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 229ea56..4f99164 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -583,6 +583,17 @@ static const char *vp_bus_name(struct virtio_device *vdev)
 	return pci_name(vp_dev->pci_dev);
 }
 
+static int vp_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
+{
+	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	struct virtio_pci_vq_info *info = vq->priv;
+
+	if (vp_dev->intx_enabled)
+		return vp_dev->pci_dev->irq;
+	else
+		return vp_dev->msix_entries[info->msix_vector].vector;
+}
+
 static struct virtio_config_ops virtio_pci_config_ops = {
 	.get		= vp_get,
 	.set		= vp_set,
@@ -594,6 +605,7 @@ static struct virtio_config_ops virtio_pci_config_ops = {
 	.get_features	= vp_get_features,
 	.finalize_features = vp_finalize_features,
 	.bus_name	= vp_bus_name,
+	.get_vq_irq     = vp_get_vq_irq,
 };
 
 static void virtio_pci_release_dev(struct device *_d)
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 63f98d0..7b783a6 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -104,6 +104,9 @@
  *	vdev: the virtio_device
  *      This returns a pointer to the bus name a la pci_name from which
  *      the caller can then copy.
+ * @get_vq_irq: get the irq numer of the specific virt queue.
+ *      vdev: the virtio_device
+ *      vq: the virtqueue
  */
 typedef void vq_callback_t(struct virtqueue *);
 struct virtio_config_ops {
@@ -122,6 +125,7 @@ struct virtio_config_ops {
 	u32 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);
 	const char *(*bus_name)(struct virtio_device *vdev);
+	int (*get_vq_irq)(struct virtio_device *vdev, struct virtqueue *vq);
 };
 
 /* If driver didn't advertise the feature, it will never appear. */

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [net-next RFC PATCH 5/5] virtio-net: flow director support
  2011-12-05  8:58 [net-next RFC PATCH 0/5] Series short description Jason Wang
                   ` (3 preceding siblings ...)
  2011-12-05  8:59 ` [net-next RFC PATCH 4/5] virtio: introduce a method to get the irq of a specific virtqueue Jason Wang
@ 2011-12-05  8:59 ` Jason Wang
       [not found] ` <20111205085925.6116.94352.stgit@dhcp-8-146.nay.redhat.com>
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Jason Wang @ 2011-12-05  8:59 UTC (permalink / raw)
  To: krkumar2, kvm, mst, netdev, rusty, virtualization, levinsasha928,
	bhutchings

In order to let the packets of a flow to be passed to the desired
guest cpu, we can co-operate with devices through programming the flow
director which was just a hash to queue table.

This kinds of co-operation is done through the accelerate RFS support,
a device specific flow sterring method virtnet_fd() is used to modify
the flow director based on rfs mapping. The desired queue were
calculated through reverse mapping of the irq affinity table. In order
to parallelize the ingress path, irq affinity of rx queue were also
provides by the driver.

In addition to accelerate RFS, we can also use the guest scheduler to
balance the load of TX and reduce the lock contention on egress path,
so the processor_id() were used to tx queue selection.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c   |  165 +++++++++++++++++++++++++++++++++++++++++++-
 include/linux/virtio_net.h |    6 ++
 2 files changed, 169 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0d871f8..89bb5e7 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,10 @@
 #include <linux/scatterlist.h>
 #include <linux/if_vlan.h>
 #include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/cpu_rmap.h>
+#include <linux/interrupt.h>
+#include <linux/cpumask.h>
 
 static int napi_weight = 128;
 module_param(napi_weight, int, 0444);
@@ -40,6 +44,7 @@ module_param(gso, bool, 0444);
 
 #define VIRTNET_SEND_COMMAND_SG_MAX    2
 #define VIRTNET_DRIVER_VERSION "1.0.0"
+#define TAP_HASH_MASK 0xFF
 
 struct virtnet_send_stats {
 	struct u64_stats_sync syncp;
@@ -89,6 +94,9 @@ struct receive_queue {
 
 	/* Active rx statistics */
 	struct virtnet_recv_stats __percpu *stats;
+
+	/* FIXME: per vector instead of per queue ?? */
+	cpumask_var_t affinity_mask;
 };
 
 struct virtnet_info {
@@ -110,6 +118,11 @@ struct virtnet_info {
 
 	/* Host will pass rxhash to us. */
 	bool has_rxhash;
+
+	/* A page of flow director */
+	struct page *fd_page;
+
+	cpumask_var_t affinity_mask;
 };
 
 struct skb_vnet_hdr {
@@ -386,6 +399,7 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 	if (vi->has_rxhash)
 		skb->rxhash = hdr->rhdr.rxhash;
 
+	skb_record_rx_queue(skb, rq->vq->queue_index / 2);
 	netif_receive_skb(skb);
 	return;
 
@@ -722,6 +736,19 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
+static int virtnet_set_fd(struct net_device *dev, u32 pfn)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtio_device *vdev = vi->vdev;
+
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
+		vdev->config->set(vdev,
+				  offsetof(struct virtio_net_config_fd, addr),
+				  &pfn, sizeof(u32));
+	}
+	return 0;
+}
+
 static int virtnet_set_mac_address(struct net_device *dev, void *p)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
@@ -1017,6 +1044,39 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
 	return 0;
 }
 
+#ifdef CONFIG_RFS_ACCEL
+
+int virtnet_fd(struct net_device *net_dev, const struct sk_buff *skb,
+	       u16 rxq_index, u32 flow_id)
+{
+	struct virtnet_info *vi = netdev_priv(net_dev);
+	u16 *table = NULL;
+
+	if (skb->protocol != htons(ETH_P_IP) || !skb->rxhash)
+		return -EPROTONOSUPPORT;
+
+	table = kmap_atomic(vi->fd_page);
+	table[skb->rxhash & TAP_HASH_MASK] = rxq_index;
+	kunmap_atomic(table);
+
+	return 0;
+}
+#endif
+
+static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
+{
+	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
+					       smp_processor_id();
+
+	/* As we make use of the accelerate rfs which let the scheduler to
+	 * balance the load, it make sense to choose the tx queue also based on
+	 * theprocessor id?
+	 */
+	while (unlikely(txq >= dev->real_num_tx_queues))
+		txq -= dev->real_num_tx_queues;
+	return txq;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -1028,9 +1088,13 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_get_stats64     = virtnet_stats,
 	.ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
 	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
+	.ndo_select_queue    = virtnet_select_queue,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller = virtnet_netpoll,
 #endif
+#ifdef CONFIG_RFS_ACCEL
+	.ndo_rx_flow_steer   = virtnet_fd,
+#endif
 };
 
 static void virtnet_update_status(struct virtnet_info *vi)
@@ -1272,12 +1336,76 @@ static int virtnet_setup_vqs(struct virtnet_info *vi)
 	return ret;
 }
 
+static int virtnet_init_rx_cpu_rmap(struct virtnet_info *vi)
+{
+#ifdef CONFIG_RFS_ACCEL
+	struct virtio_device *vdev = vi->vdev;
+	int i, rc;
+
+	vi->dev->rx_cpu_rmap = alloc_irq_cpu_rmap(vi->num_queue_pairs);
+	if (!vi->dev->rx_cpu_rmap)
+		return -ENOMEM;
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		rc = irq_cpu_rmap_add(vi->dev->rx_cpu_rmap,
+				vdev->config->get_vq_irq(vdev, vi->rq[i]->vq));
+		if (rc) {
+			free_irq_cpu_rmap(vi->dev->rx_cpu_rmap);
+			vi->dev->rx_cpu_rmap = NULL;
+			return rc;
+		}
+	}
+#endif
+	return 0;
+}
+
+static int virtnet_init_rq_affinity(struct virtnet_info *vi)
+{
+	struct virtio_device *vdev = vi->vdev;
+	int i;
+
+	/* FIXME: TX/RX share a vector */
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		if (!alloc_cpumask_var(&vi->rq[i]->affinity_mask, GFP_KERNEL))
+			goto err_out;
+		cpumask_set_cpu(i, vi->rq[i]->affinity_mask);
+		irq_set_affinity_hint(vdev->config->get_vq_irq(vdev,
+							       vi->rq[i]->vq),
+				      vi->rq[i]->affinity_mask);
+	}
+
+	return 0;
+err_out:
+	while (i) {
+		i--;
+		irq_set_affinity_hint(vdev->config->get_vq_irq(vdev,
+							       vi->rq[i]->vq),
+				      NULL);
+		free_cpumask_var(vi->rq[i]->affinity_mask);
+	}
+	return -ENOMEM;
+}
+
+static void virtnet_free_rq_affinity(struct virtnet_info *vi)
+{
+	struct virtio_device *vdev = vi->vdev;
+	int i;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		irq_set_affinity_hint(vdev->config->get_vq_irq(vdev,
+							       vi->rq[i]->vq),
+				      NULL);
+		free_cpumask_var(vi->rq[i]->affinity_mask);
+	}
+}
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
 	int i, err;
 	struct net_device *dev;
 	struct virtnet_info *vi;
 	u16 num_queues, num_queue_pairs;
+	struct page *page = NULL;
+	u16 *table = NULL;
 
 	/* Find if host supports multiqueue virtio_net device */
 	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
@@ -1298,7 +1426,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 	/* Set up network device as normal. */
 	dev->priv_flags |= IFF_UNICAST_FLT;
 	dev->netdev_ops = &virtnet_netdev;
-	dev->features = NETIF_F_HIGHDMA;
+	dev->features = NETIF_F_HIGHDMA | NETIF_F_NTUPLE;
 
 	SET_ETHTOOL_OPS(dev, &virtnet_ethtool_ops);
 	SET_NETDEV_DEV(dev, &vdev->dev);
@@ -1342,6 +1470,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vdev->priv = vi;
 	vi->num_queue_pairs = num_queue_pairs;
 
+
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
 	    virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6) ||
@@ -1382,6 +1511,31 @@ static int virtnet_probe(struct virtio_device *vdev)
 		}
 	}
 
+	/* Config flow director */
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
+		page = alloc_page(GFP_KERNEL);
+		if (!page)
+			return -ENOMEM;
+		table = (u16 *)kmap_atomic(page);
+		for (i = 0; i < (PAGE_SIZE / 16); i++) {
+			/* invalid all entries */
+			table[i] = num_queue_pairs;
+		}
+
+		vi->fd_page = page;
+		kunmap_atomic(table);
+		virtnet_set_fd(dev, page_to_pfn(page));
+
+		err = virtnet_init_rx_cpu_rmap(vi);
+		if (err)
+			goto free_recv_bufs;
+
+		err = virtnet_init_rq_affinity(vi);
+		if (err)
+			goto free_recv_bufs;
+
+	}
+
 	/* Assume link up if device can't report link status,
 	   otherwise get link status from config. */
 	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
@@ -1437,6 +1591,13 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
 	/* Free memory for send and receive queues */
 	free_rq_sq(vi);
 
+	/* Free the page of flow director */
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
+		if (vi->fd_page)
+			put_page(vi->fd_page);
+
+		virtnet_free_rq_affinity(vi);
+	}
 	free_netdev(vi->dev);
 }
 
@@ -1453,7 +1614,7 @@ static unsigned int features[] = {
 	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
 	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
 	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, VIRTIO_NET_F_MULTIQUEUE,
-	VIRTIO_NET_F_GUEST_RXHASH,
+	VIRTIO_NET_F_GUEST_RXHASH, VIRTIO_NET_F_HOST_FD,
 };
 
 static struct virtio_driver virtio_net_driver = {
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 2291317..abcea52 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -51,6 +51,7 @@
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20	/* Extra RX mode control support */
 #define VIRTIO_NET_F_MULTIQUEUE	21	/* Device supports multiple TXQ/RXQ */
 #define VIRTIO_NET_F_GUEST_RXHASH 22    /* Guest can receive rxhash */
+#define VIRTIO_NET_F_HOST_FD    23      /* Host has a flow director */
 
 #define VIRTIO_NET_S_LINK_UP	1	/* Link is up */
 
@@ -63,6 +64,11 @@ struct virtio_net_config {
 	__u16 num_queues;
 } __attribute__((packed));
 
+struct virtio_net_config_fd {
+	struct virtio_net_config cfg;
+	u32 addr;
+} __packed;
+
 /* This is the first element of the scatter-gather list.  If you don't
  * specify GSO, CSUM or HASH features, you can simply ignore the header. */
 struct virtio_net_hdr {

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 2/5] tuntap: simple flow director support
  2011-12-05  8:58 ` [net-next RFC PATCH 2/5] tuntap: simple flow director support Jason Wang
@ 2011-12-05 10:38   ` Stefan Hajnoczi
  2011-12-05 20:09   ` Ben Hutchings
       [not found]   ` <1323115763.2887.12.camel@bwh-desktop>
  2 siblings, 0 replies; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-12-05 10:38 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings

On Mon, Dec 5, 2011 at 8:58 AM, Jason Wang <jasowang@redhat.com> wrote:
> This patch adds a simple flow director to tun/tap device. It is just a
> page that contains the hash to queue mapping which could be changed by
> user-space. The backend (tap/macvtap) would query this table to get
> the desired queue of a packets when it send packets to userspace.
>
> The page address were set through a new kind of ioctl - TUNSETFD and
> were pinned until device exit or another new page were specified.

Please use "flow" or "fdir" instead of "fd" in the ioctl and code.
"fd" reminds of file descriptor.  The ixgbe driver uses "fdir".

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
       [not found] ` <20111205085925.6116.94352.stgit@dhcp-8-146.nay.redhat.com>
@ 2011-12-05 10:55   ` Stefan Hajnoczi
  2011-12-06  6:33     ` Jason Wang
  2011-12-05 20:42   ` Ben Hutchings
  1 sibling, 1 reply; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-12-05 10:55 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings

On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang <jasowang@redhat.com> wrote:
> +static int virtnet_set_fd(struct net_device *dev, u32 pfn)
> +{
> +       struct virtnet_info *vi = netdev_priv(dev);
> +       struct virtio_device *vdev = vi->vdev;
> +
> +       if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
> +               vdev->config->set(vdev,
> +                                 offsetof(struct virtio_net_config_fd, addr),
> +                                 &pfn, sizeof(u32));

Please use the virtio model (i.e. virtqueues) instead of shared
memory.  Mapping a page breaks the virtio abstraction.

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 2/5] tuntap: simple flow director support
  2011-12-05  8:58 ` [net-next RFC PATCH 2/5] tuntap: simple flow director support Jason Wang
  2011-12-05 10:38   ` Stefan Hajnoczi
@ 2011-12-05 20:09   ` Ben Hutchings
       [not found]   ` <1323115763.2887.12.camel@bwh-desktop>
  2 siblings, 0 replies; 36+ messages in thread
From: Ben Hutchings @ 2011-12-05 20:09 UTC (permalink / raw)
  To: Jason Wang; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928

On Mon, 2011-12-05 at 16:58 +0800, Jason Wang wrote:
> This patch adds a simple flow director to tun/tap device. It is just a
> page that contains the hash to queue mapping which could be changed by
> user-space. The backend (tap/macvtap) would query this table to get
> the desired queue of a packets when it send packets to userspace.

This is just flow hashing (RSS), not flow steering.

> The page address were set through a new kind of ioctl - TUNSETFD and
> were pinned until device exit or another new page were specified.
[...]

You should implement ethtool ETHTOOL_{G,S}RXFHINDIR instead.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 3/5] macvtap: flow director support
  2011-12-05  8:59 ` [net-next RFC PATCH 3/5] macvtap: " Jason Wang
@ 2011-12-05 20:11   ` Ben Hutchings
  0 siblings, 0 replies; 36+ messages in thread
From: Ben Hutchings @ 2011-12-05 20:11 UTC (permalink / raw)
  To: Jason Wang; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928

Similarly, macvtap chould implement the ethtool {get,set}_rxfh_indir
operations.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
       [not found] ` <20111205085925.6116.94352.stgit@dhcp-8-146.nay.redhat.com>
  2011-12-05 10:55   ` Stefan Hajnoczi
@ 2011-12-05 20:42   ` Ben Hutchings
  2011-12-06  7:25     ` Jason Wang
  1 sibling, 1 reply; 36+ messages in thread
From: Ben Hutchings @ 2011-12-05 20:42 UTC (permalink / raw)
  To: Jason Wang; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928

On Mon, 2011-12-05 at 16:59 +0800, Jason Wang wrote:
> In order to let the packets of a flow to be passed to the desired
> guest cpu, we can co-operate with devices through programming the flow
> director which was just a hash to queue table.
> 
> This kinds of co-operation is done through the accelerate RFS support,
> a device specific flow sterring method virtnet_fd() is used to modify
> the flow director based on rfs mapping. The desired queue were
> calculated through reverse mapping of the irq affinity table. In order
> to parallelize the ingress path, irq affinity of rx queue were also
> provides by the driver.
> 
> In addition to accelerate RFS, we can also use the guest scheduler to
> balance the load of TX and reduce the lock contention on egress path,
> so the processor_id() were used to tx queue selection.
[...]
> +#ifdef CONFIG_RFS_ACCEL
> +
> +int virtnet_fd(struct net_device *net_dev, const struct sk_buff *skb,
> +	       u16 rxq_index, u32 flow_id)
> +{
> +	struct virtnet_info *vi = netdev_priv(net_dev);
> +	u16 *table = NULL;
> +
> +	if (skb->protocol != htons(ETH_P_IP) || !skb->rxhash)
> +		return -EPROTONOSUPPORT;

Why only IPv4?

> +	table = kmap_atomic(vi->fd_page);
> +	table[skb->rxhash & TAP_HASH_MASK] = rxq_index;
> +	kunmap_atomic(table);
> +
> +	return 0;
> +}
> +#endif

This is not a proper implementation of ndo_rx_flow_steer.  If you steer
a flow by changing the RSS table this can easily cause packet reordering
in other flows.  The filtering should be more precise, ideally matching
exactly a single flow by e.g. VID and IP 5-tuple.

I think you need to add a second hash table which records exactly which
flow is supposed to be steered.  Also, you must call
rps_may_expire_flow() to check whether an entry in this table may be
replaced; otherwise you can cause packet reordering in the flow that was
previously being steered.

Finally, this function must return the table index it assigned, so that
rps_may_expire_flow() works.

> +static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
> +{
> +	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
> +					       smp_processor_id();
> +
> +	/* As we make use of the accelerate rfs which let the scheduler to
> +	 * balance the load, it make sense to choose the tx queue also based on
> +	 * theprocessor id?
> +	 */
> +	while (unlikely(txq >= dev->real_num_tx_queues))
> +		txq -= dev->real_num_tx_queues;
> +	return txq;
> +}
[...]

Don't do this, let XPS handle it.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
  2011-12-05 10:55   ` Stefan Hajnoczi
@ 2011-12-06  6:33     ` Jason Wang
  2011-12-06  9:18       ` Stefan Hajnoczi
       [not found]       ` <CAJSP0QX5dDkpX+cRcQut2mb6K91zeqGLRrZBGAWT_r2p685gaQ@mail.gmail.com>
  0 siblings, 2 replies; 36+ messages in thread
From: Jason Wang @ 2011-12-06  6:33 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings

On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>  wrote:
>> +static int virtnet_set_fd(struct net_device *dev, u32 pfn)
>> +{
>> +       struct virtnet_info *vi = netdev_priv(dev);
>> +       struct virtio_device *vdev = vi->vdev;
>> +
>> +       if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
>> +               vdev->config->set(vdev,
>> +                                 offsetof(struct virtio_net_config_fd, addr),
>> +&pfn, sizeof(u32));
> Please use the virtio model (i.e. virtqueues) instead of shared
> memory.  Mapping a page breaks the virtio abstraction.

Using control virtqueue is more suitable but there's are also some problems:

One problem is the interface,  if we use control virtqueue, we need a 
interface between the backend and tap/macvtap to change the flow 
mapping. But qemu and vhost_net only know about the file descriptor, 
more informations or interfaces need to be exposed in order to let 
ethtool or ioctl work.

Another problem is the delay introduced by ctrl vq, as the ctrl vq would 
be used in the critical path in guest and it use busy wait to get the 
response, the delay is not neglectable.

> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 2/5] tuntap: simple flow director support
       [not found]   ` <1323115763.2887.12.camel@bwh-desktop>
@ 2011-12-06  7:21     ` Jason Wang
  2011-12-06 17:31       ` Ben Hutchings
  0 siblings, 1 reply; 36+ messages in thread
From: Jason Wang @ 2011-12-06  7:21 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928

On 12/06/2011 04:09 AM, Ben Hutchings wrote:
> On Mon, 2011-12-05 at 16:58 +0800, Jason Wang wrote:
>> This patch adds a simple flow director to tun/tap device. It is just a
>> page that contains the hash to queue mapping which could be changed by
>> user-space. The backend (tap/macvtap) would query this table to get
>> the desired queue of a packets when it send packets to userspace.
> This is just flow hashing (RSS), not flow steering.
>
>> The page address were set through a new kind of ioctl - TUNSETFD and
>> were pinned until device exit or another new page were specified.
> [...]
>
> You should implement ethtool ETHTOOL_{G,S}RXFHINDIR instead.
>
> Ben.
>

I'm not fully understanding this. The page belongs to guest, and the 
idea is to let guest driver can easily change any entry. Looks like if 
ethtool_set_rxfh_indir() is used, this kind of change is not easy as it 
needs one copy and can only accept the whole table as its parameters.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
  2011-12-05 20:42   ` Ben Hutchings
@ 2011-12-06  7:25     ` Jason Wang
  2011-12-06 17:36       ` Ben Hutchings
  0 siblings, 1 reply; 36+ messages in thread
From: Jason Wang @ 2011-12-06  7:25 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928

On 12/06/2011 04:42 AM, Ben Hutchings wrote:
> On Mon, 2011-12-05 at 16:59 +0800, Jason Wang wrote:
>> In order to let the packets of a flow to be passed to the desired
>> guest cpu, we can co-operate with devices through programming the flow
>> director which was just a hash to queue table.
>>
>> This kinds of co-operation is done through the accelerate RFS support,
>> a device specific flow sterring method virtnet_fd() is used to modify
>> the flow director based on rfs mapping. The desired queue were
>> calculated through reverse mapping of the irq affinity table. In order
>> to parallelize the ingress path, irq affinity of rx queue were also
>> provides by the driver.
>>
>> In addition to accelerate RFS, we can also use the guest scheduler to
>> balance the load of TX and reduce the lock contention on egress path,
>> so the processor_id() were used to tx queue selection.
> [...]
>> +#ifdef CONFIG_RFS_ACCEL
>> +
>> +int virtnet_fd(struct net_device *net_dev, const struct sk_buff *skb,
>> +	       u16 rxq_index, u32 flow_id)
>> +{
>> +	struct virtnet_info *vi = netdev_priv(net_dev);
>> +	u16 *table = NULL;
>> +
>> +	if (skb->protocol != htons(ETH_P_IP) || !skb->rxhash)
>> +		return -EPROTONOSUPPORT;
> Why only IPv4?

Oops, IPv6 should work also.
>> +	table = kmap_atomic(vi->fd_page);
>> +	table[skb->rxhash&  TAP_HASH_MASK] = rxq_index;
>> +	kunmap_atomic(table);
>> +
>> +	return 0;
>> +}
>> +#endif
> This is not a proper implementation of ndo_rx_flow_steer.  If you steer
> a flow by changing the RSS table this can easily cause packet reordering
> in other flows.  The filtering should be more precise, ideally matching
> exactly a single flow by e.g. VID and IP 5-tuple.
>
> I think you need to add a second hash table which records exactly which
> flow is supposed to be steered.  Also, you must call
> rps_may_expire_flow() to check whether an entry in this table may be
> replaced; otherwise you can cause packet reordering in the flow that was
> previously being steered.
>
> Finally, this function must return the table index it assigned, so that
> rps_may_expire_flow() works.

Thanks for the explanation, how about document this briefly in scaling.txt?
>> +static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
>> +{
>> +	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
>> +					       smp_processor_id();
>> +
>> +	/* As we make use of the accelerate rfs which let the scheduler to
>> +	 * balance the load, it make sense to choose the tx queue also based on
>> +	 * theprocessor id?
>> +	 */
>> +	while (unlikely(txq>= dev->real_num_tx_queues))
>> +		txq -= dev->real_num_tx_queues;
>> +	return txq;
>> +}
> [...]
>
> Don't do this, let XPS handle it.
>
> Ben.
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
  2011-12-06  6:33     ` Jason Wang
@ 2011-12-06  9:18       ` Stefan Hajnoczi
       [not found]       ` <CAJSP0QX5dDkpX+cRcQut2mb6K91zeqGLRrZBGAWT_r2p685gaQ@mail.gmail.com>
  1 sibling, 0 replies; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-12-06  9:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings

On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang <jasowang@redhat.com> wrote:
> On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
>>
>> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>  wrote:
>>>
>>> +static int virtnet_set_fd(struct net_device *dev, u32 pfn)
>>> +{
>>> +       struct virtnet_info *vi = netdev_priv(dev);
>>> +       struct virtio_device *vdev = vi->vdev;
>>> +
>>> +       if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
>>> +               vdev->config->set(vdev,
>>> +                                 offsetof(struct virtio_net_config_fd,
>>> addr),
>>> +&pfn, sizeof(u32));
>>
>> Please use the virtio model (i.e. virtqueues) instead of shared
>> memory.  Mapping a page breaks the virtio abstraction.
>
>
> Using control virtqueue is more suitable but there's are also some problems:
>
> One problem is the interface,  if we use control virtqueue, we need a
> interface between the backend and tap/macvtap to change the flow mapping.
> But qemu and vhost_net only know about the file descriptor, more
> informations or interfaces need to be exposed in order to let ethtool or
> ioctl work.

QEMU could provide map a shared page with tap/macvtap.  The difference
would be that the guest<->host interface is still virtio and QEMU
pokes values into the shared page on behalf of the guest.

> Another problem is the delay introduced by ctrl vq, as the ctrl vq would be
> used in the critical path in guest and it use busy wait to get the response,
> the delay is not neglectable.

Then you need to find a better way of doing this.  Can the host
automatically associate the flow from the tx virtqueue packets are
transmitted on?  Does it make sense to add a virtio_net_hdr field that
updates the queue mapping?

The vcpus are just threads and may not be bound to physical CPUs, so
what is the big picture here?  Is the guest even in the position to
set the best queue mappings today?

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
       [not found]       ` <CAJSP0QX5dDkpX+cRcQut2mb6K91zeqGLRrZBGAWT_r2p685gaQ@mail.gmail.com>
@ 2011-12-06 10:21         ` Jason Wang
  2011-12-06 13:15           ` Stefan Hajnoczi
       [not found]           ` <CAJSP0QXsLwvH5xYj6h0E_V4VLg6DuUc-GKXu9esEYzL2MFcFGw@mail.gmail.com>
  0 siblings, 2 replies; 36+ messages in thread
From: Jason Wang @ 2011-12-06 10:21 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings

On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
> On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang<jasowang@redhat.com>  wrote:
>> On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
>>> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>    wrote:
>>>> +static int virtnet_set_fd(struct net_device *dev, u32 pfn)
>>>> +{
>>>> +       struct virtnet_info *vi = netdev_priv(dev);
>>>> +       struct virtio_device *vdev = vi->vdev;
>>>> +
>>>> +       if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_FD)) {
>>>> +               vdev->config->set(vdev,
>>>> +                                 offsetof(struct virtio_net_config_fd,
>>>> addr),
>>>> +&pfn, sizeof(u32));
>>> Please use the virtio model (i.e. virtqueues) instead of shared
>>> memory.  Mapping a page breaks the virtio abstraction.
>>
>> Using control virtqueue is more suitable but there's are also some problems:
>>
>> One problem is the interface,  if we use control virtqueue, we need a
>> interface between the backend and tap/macvtap to change the flow mapping.
>> But qemu and vhost_net only know about the file descriptor, more
>> informations or interfaces need to be exposed in order to let ethtool or
>> ioctl work.
> QEMU could provide map a shared page with tap/macvtap.  The difference
> would be that the guest<->host interface is still virtio and QEMU
> pokes values into the shared page on behalf of the guest.

This makes sense.

>> Another problem is the delay introduced by ctrl vq, as the ctrl vq would be
>> used in the critical path in guest and it use busy wait to get the response,
>> the delay is not neglectable.
> Then you need to find a better way of doing this.  Can the host
> automatically associate the flow from the tx virtqueue packets are
> transmitted on?  Does it make sense to add a virtio_net_hdr field that
> updates the queue mapping?

It can but it can not properly handling the the packet re-ordering 
caused by the moving of guest applications among guest cpus. One more 
problem for virtio_net_hdr is we need to build a empty packet when there 
no other packet to send.

One solution is to introduce unblock cmd for ctrl vq.
> The vcpus are just threads and may not be bound to physical CPUs, so
> what is the big picture here?  Is the guest even in the position to
> set the best queue mappings today?

Not sure it could publish the best mapping but the idea is to make sure 
the packets of a flow were handled by the same guest vcpu and may be the 
same vhost thread in order to eliminate the packet reordering and lock 
contention. But this assumption does not take the bouncing of vhost or 
vcpu threads which would also affect the result.

Anyway, the mapping from guest was an important reference.
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
  2011-12-06 10:21         ` Jason Wang
@ 2011-12-06 13:15           ` Stefan Hajnoczi
       [not found]           ` <CAJSP0QXsLwvH5xYj6h0E_V4VLg6DuUc-GKXu9esEYzL2MFcFGw@mail.gmail.com>
  1 sibling, 0 replies; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-12-06 13:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings

On Tue, Dec 6, 2011 at 10:21 AM, Jason Wang <jasowang@redhat.com> wrote:
> On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
>>
>> On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang<jasowang@redhat.com>  wrote:
>>>
>>> On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
>>>>
>>>> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>
>>>>  wrote:
>> The vcpus are just threads and may not be bound to physical CPUs, so
>> what is the big picture here?  Is the guest even in the position to
>> set the best queue mappings today?
>
>
> Not sure it could publish the best mapping but the idea is to make sure the
> packets of a flow were handled by the same guest vcpu and may be the same
> vhost thread in order to eliminate the packet reordering and lock
> contention. But this assumption does not take the bouncing of vhost or vcpu
> threads which would also affect the result.

Okay, this is why I'd like to know what the big picture here is.  What
solution are you proposing?  How are we going to have everything from
guest application, guest kernel, host threads, and host NIC driver
play along so we get the right steering up the entire stack.  I think
there needs to be an answer to that before changing virtio-net to add
any steering mechanism.

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
       [not found]           ` <CAJSP0QXsLwvH5xYj6h0E_V4VLg6DuUc-GKXu9esEYzL2MFcFGw@mail.gmail.com>
@ 2011-12-06 15:42             ` Sridhar Samudrala
  2011-12-07  3:03             ` Jason Wang
       [not found]             ` <4EDE37FE.5090409@us.ibm.com>
  2 siblings, 0 replies; 36+ messages in thread
From: Sridhar Samudrala @ 2011-12-06 15:42 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings

On 12/6/2011 5:15 AM, Stefan Hajnoczi wrote:
> On Tue, Dec 6, 2011 at 10:21 AM, Jason Wang<jasowang@redhat.com>  wrote:
>> On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
>>> On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang<jasowang@redhat.com>    wrote:
>>>> On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
>>>>> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>
>>>>>   wrote:
>>> The vcpus are just threads and may not be bound to physical CPUs, so
>>> what is the big picture here?  Is the guest even in the position to
>>> set the best queue mappings today?
>>
>> Not sure it could publish the best mapping but the idea is to make sure the
>> packets of a flow were handled by the same guest vcpu and may be the same
>> vhost thread in order to eliminate the packet reordering and lock
>> contention. But this assumption does not take the bouncing of vhost or vcpu
>> threads which would also affect the result.
> Okay, this is why I'd like to know what the big picture here is.  What
> solution are you proposing?  How are we going to have everything from
> guest application, guest kernel, host threads, and host NIC driver
> play along so we get the right steering up the entire stack.  I think
> there needs to be an answer to that before changing virtio-net to add
> any steering mechanism.
>
>
Yes. Also the current model of  a vhost thread per VM's interface 
doesn't help with packet steering
all the way from the guest to the host physical NIC.

I think we need to have vhost thread(s) per-CPU that can handle packets 
to/from physical NIC's
TX/RX queues. Currently we have a single vhost thread for a VM's i/f 
that handles all the packets from
various flows coming from a multi-queue physical NIC.

Thanks
Sridhar

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
       [not found]             ` <4EDE37FE.5090409@us.ibm.com>
@ 2011-12-06 16:14               ` Michael S. Tsirkin
  2011-12-06 23:10                 ` Sridhar Samudrala
  2011-12-07 11:02               ` Jason Wang
  1 sibling, 1 reply; 36+ messages in thread
From: Michael S. Tsirkin @ 2011-12-06 16:14 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: krkumar2, kvm, virtualization, levinsasha928, netdev, bhutchings

On Tue, Dec 06, 2011 at 07:42:54AM -0800, Sridhar Samudrala wrote:
> On 12/6/2011 5:15 AM, Stefan Hajnoczi wrote:
> >On Tue, Dec 6, 2011 at 10:21 AM, Jason Wang<jasowang@redhat.com>  wrote:
> >>On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
> >>>On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang<jasowang@redhat.com>    wrote:
> >>>>On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
> >>>>>On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>
> >>>>>  wrote:
> >>>The vcpus are just threads and may not be bound to physical CPUs, so
> >>>what is the big picture here?  Is the guest even in the position to
> >>>set the best queue mappings today?
> >>
> >>Not sure it could publish the best mapping but the idea is to make sure the
> >>packets of a flow were handled by the same guest vcpu and may be the same
> >>vhost thread in order to eliminate the packet reordering and lock
> >>contention. But this assumption does not take the bouncing of vhost or vcpu
> >>threads which would also affect the result.
> >Okay, this is why I'd like to know what the big picture here is.  What
> >solution are you proposing?  How are we going to have everything from
> >guest application, guest kernel, host threads, and host NIC driver
> >play along so we get the right steering up the entire stack.  I think
> >there needs to be an answer to that before changing virtio-net to add
> >any steering mechanism.
> >
> >
> Yes. Also the current model of  a vhost thread per VM's interface
> doesn't help with packet steering
> all the way from the guest to the host physical NIC.
> 
> I think we need to have vhost thread(s) per-CPU that can handle
> packets to/from physical NIC's
> TX/RX queues.
> Currently we have a single vhost thread for a VM's i/f
> that handles all the packets from
> various flows coming from a multi-queue physical NIC.
> 
> Thanks
> Sridhar

It's not hard to try that:
1. revert c23f3445e68e1db0e74099f264bc5ff5d55ebdeb
   this will convert our thread to a workqueue
2. convert the workqueue to a per-cpu one

It didn't work that well in the past, but YMMV

On the surface I'd say a single thread makes some sense
as long as guest uses a single queue.

-- 
MST

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 2/5] tuntap: simple flow director support
  2011-12-06  7:21     ` Jason Wang
@ 2011-12-06 17:31       ` Ben Hutchings
  0 siblings, 0 replies; 36+ messages in thread
From: Ben Hutchings @ 2011-12-06 17:31 UTC (permalink / raw)
  To: Jason Wang; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928

On Tue, 2011-12-06 at 15:21 +0800, Jason Wang wrote:
> On 12/06/2011 04:09 AM, Ben Hutchings wrote:
> > On Mon, 2011-12-05 at 16:58 +0800, Jason Wang wrote:
> >> This patch adds a simple flow director to tun/tap device. It is just a
> >> page that contains the hash to queue mapping which could be changed by
> >> user-space. The backend (tap/macvtap) would query this table to get
> >> the desired queue of a packets when it send packets to userspace.
> > This is just flow hashing (RSS), not flow steering.
> >
> >> The page address were set through a new kind of ioctl - TUNSETFD and
> >> were pinned until device exit or another new page were specified.
> > [...]
> >
> > You should implement ethtool ETHTOOL_{G,S}RXFHINDIR instead.
> >
> > Ben.
> >
> 
> I'm not fully understanding this. The page belongs to guest, and the 
> idea is to let guest driver can easily change any entry. Looks like if 
> ethtool_set_rxfh_indir() is used, this kind of change is not easy as it 
> needs one copy and can only accept the whole table as its parameters.

Sorry, yes, I was misreading this.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
  2011-12-06  7:25     ` Jason Wang
@ 2011-12-06 17:36       ` Ben Hutchings
  0 siblings, 0 replies; 36+ messages in thread
From: Ben Hutchings @ 2011-12-06 17:36 UTC (permalink / raw)
  To: Jason Wang; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928

On Tue, 2011-12-06 at 15:25 +0800, Jason Wang wrote:
> On 12/06/2011 04:42 AM, Ben Hutchings wrote:
[...]
> > This is not a proper implementation of ndo_rx_flow_steer.  If you steer
> > a flow by changing the RSS table this can easily cause packet reordering
> > in other flows.  The filtering should be more precise, ideally matching
> > exactly a single flow by e.g. VID and IP 5-tuple.
> >
> > I think you need to add a second hash table which records exactly which
> > flow is supposed to be steered.  Also, you must call
> > rps_may_expire_flow() to check whether an entry in this table may be
> > replaced; otherwise you can cause packet reordering in the flow that was
> > previously being steered.
> >
> > Finally, this function must return the table index it assigned, so that
> > rps_may_expire_flow() works.
> 
> Thanks for the explanation, how about document this briefly in scaling.txt?
[...]

I believe scaling.txt is intended for users/administrators, not
developers.

The documentation for implementers of accelerated RFS is in the comment
for struct net_device_ops and the commit message adding it.  But I
really should improve that comment.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
  2011-12-06 16:14               ` Michael S. Tsirkin
@ 2011-12-06 23:10                 ` Sridhar Samudrala
  2011-12-07 11:05                   ` Jason Wang
  0 siblings, 1 reply; 36+ messages in thread
From: Sridhar Samudrala @ 2011-12-06 23:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: krkumar2, xma, kvm, virtualization, levinsasha928, netdev,
	bhutchings

On 12/6/2011 8:14 AM, Michael S. Tsirkin wrote:
> On Tue, Dec 06, 2011 at 07:42:54AM -0800, Sridhar Samudrala wrote:
>> On 12/6/2011 5:15 AM, Stefan Hajnoczi wrote:
>>> On Tue, Dec 6, 2011 at 10:21 AM, Jason Wang<jasowang@redhat.com>   wrote:
>>>> On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
>>>>> On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang<jasowang@redhat.com>     wrote:
>>>>>> On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
>>>>>>> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>
>>>>>>>   wrote:
>>>>> The vcpus are just threads and may not be bound to physical CPUs, so
>>>>> what is the big picture here?  Is the guest even in the position to
>>>>> set the best queue mappings today?
>>>> Not sure it could publish the best mapping but the idea is to make sure the
>>>> packets of a flow were handled by the same guest vcpu and may be the same
>>>> vhost thread in order to eliminate the packet reordering and lock
>>>> contention. But this assumption does not take the bouncing of vhost or vcpu
>>>> threads which would also affect the result.
>>> Okay, this is why I'd like to know what the big picture here is.  What
>>> solution are you proposing?  How are we going to have everything from
>>> guest application, guest kernel, host threads, and host NIC driver
>>> play along so we get the right steering up the entire stack.  I think
>>> there needs to be an answer to that before changing virtio-net to add
>>> any steering mechanism.
>>>
>>>
>> Yes. Also the current model of  a vhost thread per VM's interface
>> doesn't help with packet steering
>> all the way from the guest to the host physical NIC.
>>
>> I think we need to have vhost thread(s) per-CPU that can handle
>> packets to/from physical NIC's
>> TX/RX queues.
>> Currently we have a single vhost thread for a VM's i/f
>> that handles all the packets from
>> various flows coming from a multi-queue physical NIC.
>>
>> Thanks
>> Sridhar
> It's not hard to try that:
> 1. revert c23f3445e68e1db0e74099f264bc5ff5d55ebdeb
>     this will convert our thread to a workqueue
> 2. convert the workqueue to a per-cpu one
>
> It didn't work that well in the past, but YMMV
Yes. I tried this before we went ahead with per-interface vhost 
threading model.
At that time, per-cpu vhost  showed a regression with a single-VM and
per-vq vhost showed good performance improvements upto 8 VMs.

So  just making it per-cpu would not be enough. I think we may need a way
to schedule vcpu threads on the same cpu-socket as vhost.

Another aspect we need to look into is the splitting of vhost thread 
into separate
threads for TX and RX. Shirley is doing some work in this area and she 
is seeing
perf. improvements as long as TX and RX threads are on the same cpu-socket.
>
> On the surface I'd say a single thread makes some sense
> as long as guest uses a single queue.
>
But this may not be scalable long term when we want to support a large 
number of VMs each
having multiple virtio-net interfaces with multiple queues.

Thanks
Sridhar

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
       [not found]           ` <CAJSP0QXsLwvH5xYj6h0E_V4VLg6DuUc-GKXu9esEYzL2MFcFGw@mail.gmail.com>
  2011-12-06 15:42             ` Sridhar Samudrala
@ 2011-12-07  3:03             ` Jason Wang
  2011-12-07  9:08               ` Stefan Hajnoczi
       [not found]             ` <4EDE37FE.5090409@us.ibm.com>
  2 siblings, 1 reply; 36+ messages in thread
From: Jason Wang @ 2011-12-07  3:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings

On 12/06/2011 09:15 PM, Stefan Hajnoczi wrote:
> On Tue, Dec 6, 2011 at 10:21 AM, Jason Wang<jasowang@redhat.com>  wrote:
>> On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
>>> On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang<jasowang@redhat.com>    wrote:
>>>> On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
>>>>> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>
>>>>>   wrote:
>>> The vcpus are just threads and may not be bound to physical CPUs, so
>>> what is the big picture here?  Is the guest even in the position to
>>> set the best queue mappings today?
>>
>> Not sure it could publish the best mapping but the idea is to make sure the
>> packets of a flow were handled by the same guest vcpu and may be the same
>> vhost thread in order to eliminate the packet reordering and lock
>> contention. But this assumption does not take the bouncing of vhost or vcpu
>> threads which would also affect the result.
> Okay, this is why I'd like to know what the big picture here is.  What
> solution are you proposing?  How are we going to have everything from
> guest application, guest kernel, host threads, and host NIC driver
> play along so we get the right steering up the entire stack.  I think
> there needs to be an answer to that before changing virtio-net to add
> any steering mechanism.

Consider the complexity of the host nic each with their own steering 
features,  this series make the first step with minimal effort to try to 
let guest driver and host tap/macvtap co-operate like what physical nic 
does. There may be other method, but performance numbers is also needed 
to give the answer.
>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 0/5] Series short description
  2011-12-05  8:58 [net-next RFC PATCH 0/5] Series short description Jason Wang
                   ` (5 preceding siblings ...)
       [not found] ` <20111205085925.6116.94352.stgit@dhcp-8-146.nay.redhat.com>
@ 2011-12-07  7:30 ` Rusty Russell
       [not found] ` <87ty5cj0sw.fsf@rustcorp.com.au>
  7 siblings, 0 replies; 36+ messages in thread
From: Rusty Russell @ 2011-12-07  7:30 UTC (permalink / raw)
  To: Jason Wang, krkumar2, kvm, mst, netdev, virtualization,
	levinsasha928, bhutchings

On Mon, 05 Dec 2011 16:58:37 +0800, Jason Wang <jasowang@redhat.com> wrote:
> multiple queue virtio-net: flow steering through host/guest cooperation
> 
> Hello all:
> 
> This is a rough series adds the guest/host cooperation of flow
> steering support based on Krish Kumar's multiple queue virtio-net
> driver patch 3/3 (http://lwn.net/Articles/467283/).

Is there a real (physical) device which does this kind of thing?  How do
they do it?  Can we copy them?

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
  2011-12-07  3:03             ` Jason Wang
@ 2011-12-07  9:08               ` Stefan Hajnoczi
  2011-12-07 12:10                 ` Jason Wang
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-12-07  9:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings

On Wed, Dec 7, 2011 at 3:03 AM, Jason Wang <jasowang@redhat.com> wrote:
> On 12/06/2011 09:15 PM, Stefan Hajnoczi wrote:
>>
>> On Tue, Dec 6, 2011 at 10:21 AM, Jason Wang<jasowang@redhat.com>  wrote:
>>>
>>> On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
>>>>
>>>> On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang<jasowang@redhat.com>
>>>>  wrote:
>>>>>
>>>>> On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
>>>>>>
>>>>>> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>
>>>>>>  wrote:
>>>>
>>>> The vcpus are just threads and may not be bound to physical CPUs, so
>>>> what is the big picture here?  Is the guest even in the position to
>>>> set the best queue mappings today?
>>>
>>>
>>> Not sure it could publish the best mapping but the idea is to make sure
>>> the
>>> packets of a flow were handled by the same guest vcpu and may be the same
>>> vhost thread in order to eliminate the packet reordering and lock
>>> contention. But this assumption does not take the bouncing of vhost or
>>> vcpu
>>> threads which would also affect the result.
>>
>> Okay, this is why I'd like to know what the big picture here is.  What
>> solution are you proposing?  How are we going to have everything from
>> guest application, guest kernel, host threads, and host NIC driver
>> play along so we get the right steering up the entire stack.  I think
>> there needs to be an answer to that before changing virtio-net to add
>> any steering mechanism.
>
>
> Consider the complexity of the host nic each with their own steering
> features,  this series make the first step with minimal effort to try to let
> guest driver and host tap/macvtap co-operate like what physical nic does.
> There may be other method, but performance numbers is also needed to give
> the answer.

I agree that performance results for this need to be shown.

My original point is really that it's not a good idea to take
individual steps without a good big picture because this will change
the virtio-net device specification.  If this turns out to be a dead
end then hosts will need to continue to support the interface forever
(legacy guests could still try to use it).  So please first explain
what the full stack picture is going to look like and how you think it
will lead to better performance.  You don't need to have all the code
or evidence, but just enough explanation so we see where this is all
going.

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
       [not found]             ` <4EDE37FE.5090409@us.ibm.com>
  2011-12-06 16:14               ` Michael S. Tsirkin
@ 2011-12-07 11:02               ` Jason Wang
  2011-12-09  2:00                 ` Sridhar Samudrala
  1 sibling, 1 reply; 36+ messages in thread
From: Jason Wang @ 2011-12-07 11:02 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: krkumar2, kvm, mst, virtualization, levinsasha928, netdev,
	bhutchings

On 12/06/2011 11:42 PM, Sridhar Samudrala wrote:
> On 12/6/2011 5:15 AM, Stefan Hajnoczi wrote:
>> On Tue, Dec 6, 2011 at 10:21 AM, Jason Wang<jasowang@redhat.com>  wrote:
>>> On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
>>>> On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang<jasowang@redhat.com>    
>>>> wrote:
>>>>> On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
>>>>>> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>
>>>>>>   wrote:
>>>> The vcpus are just threads and may not be bound to physical CPUs, so
>>>> what is the big picture here?  Is the guest even in the position to
>>>> set the best queue mappings today?
>>>
>>> Not sure it could publish the best mapping but the idea is to make 
>>> sure the
>>> packets of a flow were handled by the same guest vcpu and may be the 
>>> same
>>> vhost thread in order to eliminate the packet reordering and lock
>>> contention. But this assumption does not take the bouncing of vhost 
>>> or vcpu
>>> threads which would also affect the result.
>> Okay, this is why I'd like to know what the big picture here is.  What
>> solution are you proposing?  How are we going to have everything from
>> guest application, guest kernel, host threads, and host NIC driver
>> play along so we get the right steering up the entire stack.  I think
>> there needs to be an answer to that before changing virtio-net to add
>> any steering mechanism.
>>
>>
> Yes. Also the current model of  a vhost thread per VM's interface 
> doesn't help with packet steering
> all the way from the guest to the host physical NIC.
>
> I think we need to have vhost thread(s) per-CPU that can handle 
> packets to/from physical NIC's
> TX/RX queues. Currently we have a single vhost thread for a VM's i/f 
> that handles all the packets from
> various flows coming from a multi-queue physical NIC.

Even if we have per-cpu workthread, only one socket is used to queue the 
packet then, so a multiple queue(sockets) tap/macvtap is still needed.
>
> Thanks
> Sridhar
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
  2011-12-06 23:10                 ` Sridhar Samudrala
@ 2011-12-07 11:05                   ` Jason Wang
  0 siblings, 0 replies; 36+ messages in thread
From: Jason Wang @ 2011-12-07 11:05 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: krkumar2, xma, kvm, Michael S. Tsirkin, virtualization,
	levinsasha928, netdev, bhutchings

On 12/07/2011 07:10 AM, Sridhar Samudrala wrote:
> On 12/6/2011 8:14 AM, Michael S. Tsirkin wrote:
>> On Tue, Dec 06, 2011 at 07:42:54AM -0800, Sridhar Samudrala wrote:
>>> On 12/6/2011 5:15 AM, Stefan Hajnoczi wrote:
>>>> On Tue, Dec 6, 2011 at 10:21 AM, Jason Wang<jasowang@redhat.com>   
>>>> wrote:
>>>>> On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
>>>>>> On Tue, Dec 6, 2011 at 6:33 AM, Jason 
>>>>>> Wang<jasowang@redhat.com>     wrote:
>>>>>>> On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
>>>>>>>> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>
>>>>>>>>   wrote:
>>>>>> The vcpus are just threads and may not be bound to physical CPUs, so
>>>>>> what is the big picture here?  Is the guest even in the position to
>>>>>> set the best queue mappings today?
>>>>> Not sure it could publish the best mapping but the idea is to make 
>>>>> sure the
>>>>> packets of a flow were handled by the same guest vcpu and may be 
>>>>> the same
>>>>> vhost thread in order to eliminate the packet reordering and lock
>>>>> contention. But this assumption does not take the bouncing of 
>>>>> vhost or vcpu
>>>>> threads which would also affect the result.
>>>> Okay, this is why I'd like to know what the big picture here is.  What
>>>> solution are you proposing?  How are we going to have everything from
>>>> guest application, guest kernel, host threads, and host NIC driver
>>>> play along so we get the right steering up the entire stack.  I think
>>>> there needs to be an answer to that before changing virtio-net to add
>>>> any steering mechanism.
>>>>
>>>>
>>> Yes. Also the current model of  a vhost thread per VM's interface
>>> doesn't help with packet steering
>>> all the way from the guest to the host physical NIC.
>>>
>>> I think we need to have vhost thread(s) per-CPU that can handle
>>> packets to/from physical NIC's
>>> TX/RX queues.
>>> Currently we have a single vhost thread for a VM's i/f
>>> that handles all the packets from
>>> various flows coming from a multi-queue physical NIC.
>>>
>>> Thanks
>>> Sridhar
>> It's not hard to try that:
>> 1. revert c23f3445e68e1db0e74099f264bc5ff5d55ebdeb
>>     this will convert our thread to a workqueue
>> 2. convert the workqueue to a per-cpu one
>>
>> It didn't work that well in the past, but YMMV
> Yes. I tried this before we went ahead with per-interface vhost 
> threading model.
> At that time, per-cpu vhost  showed a regression with a single-VM and
> per-vq vhost showed good performance improvements upto 8 VMs.
>
> So  just making it per-cpu would not be enough. I think we may need a way
> to schedule vcpu threads on the same cpu-socket as vhost.
>
> Another aspect we need to look into is the splitting of vhost thread 
> into separate
> threads for TX and RX. Shirley is doing some work in this area and she 
> is seeing
> perf. improvements as long as TX and RX threads are on the same 
> cpu-socket.

I emulated this through my multi-queue series in the past, looks like it 
damages the performance of single stream especially guest tx.
>>
>> On the surface I'd say a single thread makes some sense
>> as long as guest uses a single queue.
>>
> But this may not be scalable long term when we want to support a large 
> number of VMs each
> having multiple virtio-net interfaces with multiple queues.
>
> Thanks
> Sridhar
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 0/5] Series short description
       [not found] ` <87ty5cj0sw.fsf@rustcorp.com.au>
@ 2011-12-07 11:31   ` Jason Wang
  2011-12-07 17:02     ` Ben Hutchings
  0 siblings, 1 reply; 36+ messages in thread
From: Jason Wang @ 2011-12-07 11:31 UTC (permalink / raw)
  To: Rusty Russell
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings

On 12/07/2011 03:30 PM, Rusty Russell wrote:
> On Mon, 05 Dec 2011 16:58:37 +0800, Jason Wang<jasowang@redhat.com>  wrote:
>> multiple queue virtio-net: flow steering through host/guest cooperation
>>
>> Hello all:
>>
>> This is a rough series adds the guest/host cooperation of flow
>> steering support based on Krish Kumar's multiple queue virtio-net
>> driver patch 3/3 (http://lwn.net/Articles/467283/).
> Is there a real (physical) device which does this kind of thing?  How do
> they do it?  Can we copy them?
>
> Cheers,
> Rusty.
As far as I see, ixgbe and sfc have similar but much more sophisticated 
mechanism.

The idea was originally suggested by Ben and it was just borrowed form 
those real physical nic cards who can dispatch packets based on their 
hash. All of theses cards can filter the flow based on the hash of 
L2/L3/L4 header and the stack would tell the card which queue should 
this flow goes.

So in host, a simple hash to queue table were introduced in tap/macvtap 
and in guest, the guest driver would tell the desired queue of a flow 
through changing this table.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
  2011-12-07  9:08               ` Stefan Hajnoczi
@ 2011-12-07 12:10                 ` Jason Wang
  2011-12-07 15:04                   ` Stefan Hajnoczi
  0 siblings, 1 reply; 36+ messages in thread
From: Jason Wang @ 2011-12-07 12:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings

On 12/07/2011 05:08 PM, Stefan Hajnoczi wrote:
[...]
>> >  Consider the complexity of the host nic each with their own steering
>> >  features,  this series make the first step with minimal effort to try to let
>> >  guest driver and host tap/macvtap co-operate like what physical nic does.
>> >  There may be other method, but performance numbers is also needed to give
>> >  the answer.
> I agree that performance results for this need to be shown.
>
> My original point is really that it's not a good idea to take
> individual steps without a good big picture because this will change
> the virtio-net device specification.  If this turns out to be a dead
> end then hosts will need to continue to support the interface forever
> (legacy guests could still try to use it).  So please first explain
> what the full stack picture is going to look like and how you think it
> will lead to better performance.  You don't need to have all the code
> or evidence, but just enough explanation so we see where this is all
> going.
I think I mention too little in the cover message.

There's no much changes with Krishna's series except the method that 
choosing a rx virtqueue. Since original series use different hash 
methods in host (rxhash) and guest (txhash), a different virtqueue were 
chose for a flow which could lead packets of a flow to be handled by 
different vhost thread and vcpu. This may damage the performance.

This series tries to let one vhost thread to process the packets of a 
flow and also let the packets to be sent directly to a vcpu local to the 
thread process the data. This is done by letting guest tell the desired 
queue form which it want to receive the pakcet of a dedicated flow.

So passing the hash from host to guest is needed to get the same hash in 
the two sides. Then a guest programmable hash to queue table were 
introduced and guest co-operate with the host through accelerate RFS in 
guest.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
  2011-12-07 12:10                 ` Jason Wang
@ 2011-12-07 15:04                   ` Stefan Hajnoczi
  0 siblings, 0 replies; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-12-07 15:04 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings

On Wed, Dec 7, 2011 at 12:10 PM, Jason Wang <jasowang@redhat.com> wrote:
> On 12/07/2011 05:08 PM, Stefan Hajnoczi wrote:
> [...]
>
>>> >  Consider the complexity of the host nic each with their own steering
>>> >  features,  this series make the first step with minimal effort to try
>>> > to let
>>> >  guest driver and host tap/macvtap co-operate like what physical nic
>>> > does.
>>> >  There may be other method, but performance numbers is also needed to
>>> > give
>>> >  the answer.
>>
>> I agree that performance results for this need to be shown.
>>
>> My original point is really that it's not a good idea to take
>> individual steps without a good big picture because this will change
>> the virtio-net device specification.  If this turns out to be a dead
>> end then hosts will need to continue to support the interface forever
>> (legacy guests could still try to use it).  So please first explain
>> what the full stack picture is going to look like and how you think it
>> will lead to better performance.  You don't need to have all the code
>> or evidence, but just enough explanation so we see where this is all
>> going.
>
> I think I mention too little in the cover message.
>
> There's no much changes with Krishna's series except the method that
> choosing a rx virtqueue. Since original series use different hash methods in
> host (rxhash) and guest (txhash), a different virtqueue were chose for a
> flow which could lead packets of a flow to be handled by different vhost
> thread and vcpu. This may damage the performance.
>
> This series tries to let one vhost thread to process the packets of a flow
> and also let the packets to be sent directly to a vcpu local to the thread
> process the data. This is done by letting guest tell the desired queue form
> which it want to receive the pakcet of a dedicated flow.
>
> So passing the hash from host to guest is needed to get the same hash in the
> two sides. Then a guest programmable hash to queue table were introduced and
> guest co-operate with the host through accelerate RFS in guest.

Thanks for the context, that helps me understand it a little better.

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 0/5] Series short description
  2011-12-07 11:31   ` Jason Wang
@ 2011-12-07 17:02     ` Ben Hutchings
  2011-12-08 10:06       ` Jason Wang
  2011-12-09  5:31       ` Rusty Russell
  0 siblings, 2 replies; 36+ messages in thread
From: Ben Hutchings @ 2011-12-07 17:02 UTC (permalink / raw)
  To: Jason Wang, Rusty Russell
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928

On Wed, 2011-12-07 at 19:31 +0800, Jason Wang wrote:
> On 12/07/2011 03:30 PM, Rusty Russell wrote:
> > On Mon, 05 Dec 2011 16:58:37 +0800, Jason Wang<jasowang@redhat.com>  wrote:
> >> multiple queue virtio-net: flow steering through host/guest cooperation
> >>
> >> Hello all:
> >>
> >> This is a rough series adds the guest/host cooperation of flow
> >> steering support based on Krish Kumar's multiple queue virtio-net
> >> driver patch 3/3 (http://lwn.net/Articles/467283/).
> > Is there a real (physical) device which does this kind of thing?  How do
> > they do it?  Can we copy them?
> >
> > Cheers,
> > Rusty.
> As far as I see, ixgbe and sfc have similar but much more sophisticated 
> mechanism.
> 
> The idea was originally suggested by Ben and it was just borrowed form 
> those real physical nic cards who can dispatch packets based on their 
> hash. All of theses cards can filter the flow based on the hash of 
> L2/L3/L4 header and the stack would tell the card which queue should 
> this flow goes.

Solarflare controllers (sfc driver) have 8192 perfect filters for
TCP/IPv4 and UDP/IPv4 which can be used for flow steering.  (The filters
are organised as a hash table, but matched based on 5-tuples.)  I
implemented the 'accelerated RFS' interface in this driver.

I believe the Intel 82599 controllers (ixgbe driver) have both
hash-based and perfect filter modes and the driver can be configured to
use one or the other.  The driver has its own independent mechanism for
steering RX and TX flows which predates RFS; I don't know whether it
uses hash-based or perfect filters.

Most multi-queue controllers could support a kind of hash-based
filtering for TCP/IP by adjusting the RSS indirection table.  However,
this table is usually quite small (64-256 entries).  This means that
hash collisions will be quite common and this can result in reordering.
The same applies to the small table Jason has proposed for virtio-net.

> So in host, a simple hash to queue table were introduced in tap/macvtap 
> and in guest, the guest driver would tell the desired queue of a flow 
> through changing this table.

I don't think accelerated RFS can work well without the use of perfect
filtering or hash-based filtering with a very low rate of collisions.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 0/5] Series short description
  2011-12-07 17:02     ` Ben Hutchings
@ 2011-12-08 10:06       ` Jason Wang
  2011-12-09  5:31       ` Rusty Russell
  1 sibling, 0 replies; 36+ messages in thread
From: Jason Wang @ 2011-12-08 10:06 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928

On 12/08/2011 01:02 AM, Ben Hutchings wrote:
> On Wed, 2011-12-07 at 19:31 +0800, Jason Wang wrote:
>> On 12/07/2011 03:30 PM, Rusty Russell wrote:
>>> On Mon, 05 Dec 2011 16:58:37 +0800, Jason Wang<jasowang@redhat.com>   wrote:
>>>> multiple queue virtio-net: flow steering through host/guest cooperation
>>>>
>>>> Hello all:
>>>>
>>>> This is a rough series adds the guest/host cooperation of flow
>>>> steering support based on Krish Kumar's multiple queue virtio-net
>>>> driver patch 3/3 (http://lwn.net/Articles/467283/).
>>> Is there a real (physical) device which does this kind of thing?  How do
>>> they do it?  Can we copy them?
>>>
>>> Cheers,
>>> Rusty.
>> As far as I see, ixgbe and sfc have similar but much more sophisticated
>> mechanism.
>>
>> The idea was originally suggested by Ben and it was just borrowed form
>> those real physical nic cards who can dispatch packets based on their
>> hash. All of theses cards can filter the flow based on the hash of
>> L2/L3/L4 header and the stack would tell the card which queue should
>> this flow goes.
> Solarflare controllers (sfc driver) have 8192 perfect filters for
> TCP/IPv4 and UDP/IPv4 which can be used for flow steering.  (The filters
> are organised as a hash table, but matched based on 5-tuples.)  I
> implemented the 'accelerated RFS' interface in this driver.
>
> I believe the Intel 82599 controllers (ixgbe driver) have both
> hash-based and perfect filter modes and the driver can be configured to
> use one or the other.  The driver has its own independent mechanism for
> steering RX and TX flows which predates RFS; I don't know whether it
> uses hash-based or perfect filters.

As far as I see, their driver predates RFS by binding the TX queue and 
RX queue to the same CPU and adding hash based filter during packet 
transmission.

> Most multi-queue controllers could support a kind of hash-based
> filtering for TCP/IP by adjusting the RSS indirection table.  However,
> this table is usually quite small (64-256 entries).  This means that
> hash collisions will be quite common and this can result in reordering.
> The same applies to the small table Jason has proposed for virtio-net.
>

Thanks for the clarification. Consider the hash were provided by host 
nic or host kernel, the collision rate is not fixed. Perfect filter is 
more suitable then.
>> So in host, a simple hash to queue table were introduced in tap/macvtap
>> and in guest, the guest driver would tell the desired queue of a flow
>> through changing this table.
> I don't think accelerated RFS can work well without the use of perfect
> filtering or hash-based filtering with a very low rate of collisions.
>
> Ben.
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
  2011-12-07 11:02               ` Jason Wang
@ 2011-12-09  2:00                 ` Sridhar Samudrala
  0 siblings, 0 replies; 36+ messages in thread
From: Sridhar Samudrala @ 2011-12-09  2:00 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, kvm, mst, virtualization, levinsasha928, netdev,
	bhutchings

On 12/7/2011 3:02 AM, Jason Wang wrote:
> On 12/06/2011 11:42 PM, Sridhar Samudrala wrote:
>> On 12/6/2011 5:15 AM, Stefan Hajnoczi wrote:
>>> On Tue, Dec 6, 2011 at 10:21 AM, Jason Wang<jasowang@redhat.com>  
>>> wrote:
>>>> On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
>>>>> On Tue, Dec 6, 2011 at 6:33 AM, Jason Wang<jasowang@redhat.com>    
>>>>> wrote:
>>>>>> On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:
>>>>>>> On Mon, Dec 5, 2011 at 8:59 AM, Jason Wang<jasowang@redhat.com>
>>>>>>>   wrote:
>>>>> The vcpus are just threads and may not be bound to physical CPUs, so
>>>>> what is the big picture here?  Is the guest even in the position to
>>>>> set the best queue mappings today?
>>>>
>>>> Not sure it could publish the best mapping but the idea is to make 
>>>> sure the
>>>> packets of a flow were handled by the same guest vcpu and may be 
>>>> the same
>>>> vhost thread in order to eliminate the packet reordering and lock
>>>> contention. But this assumption does not take the bouncing of vhost 
>>>> or vcpu
>>>> threads which would also affect the result.
>>> Okay, this is why I'd like to know what the big picture here is.  What
>>> solution are you proposing?  How are we going to have everything from
>>> guest application, guest kernel, host threads, and host NIC driver
>>> play along so we get the right steering up the entire stack.  I think
>>> there needs to be an answer to that before changing virtio-net to add
>>> any steering mechanism.
>>>
>>>
>> Yes. Also the current model of  a vhost thread per VM's interface 
>> doesn't help with packet steering
>> all the way from the guest to the host physical NIC.
>>
>> I think we need to have vhost thread(s) per-CPU that can handle 
>> packets to/from physical NIC's
>> TX/RX queues. Currently we have a single vhost thread for a VM's i/f 
>> that handles all the packets from
>> various flows coming from a multi-queue physical NIC.
>
> Even if we have per-cpu workthread, only one socket is used to queue 
> the packet then, so a multiple queue(sockets) tap/macvtap is still 
> needed.
I think so.  We need per-cpu tap/macvtap sockets along with per-cpu 
vhost threads.
This will parallelize all the way from physical NIC to vhost.

Thanks
Sridhar

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 0/5] Series short description
  2011-12-07 17:02     ` Ben Hutchings
  2011-12-08 10:06       ` Jason Wang
@ 2011-12-09  5:31       ` Rusty Russell
  2011-12-15  1:36         ` Ben Hutchings
  1 sibling, 1 reply; 36+ messages in thread
From: Rusty Russell @ 2011-12-09  5:31 UTC (permalink / raw)
  To: Ben Hutchings, Jason Wang
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928

On Wed, 7 Dec 2011 17:02:04 +0000, Ben Hutchings <bhutchings@solarflare.com> wrote:
> Solarflare controllers (sfc driver) have 8192 perfect filters for
> TCP/IPv4 and UDP/IPv4 which can be used for flow steering.  (The filters
> are organised as a hash table, but matched based on 5-tuples.)  I
> implemented the 'accelerated RFS' interface in this driver.
> 
> I believe the Intel 82599 controllers (ixgbe driver) have both
> hash-based and perfect filter modes and the driver can be configured to
> use one or the other.  The driver has its own independent mechanism for
> steering RX and TX flows which predates RFS; I don't know whether it
> uses hash-based or perfect filters.

Thanks for this summary (and Jason, too).  I've fallen a long way behind
NIC state-of-the-art.
 
> Most multi-queue controllers could support a kind of hash-based
> filtering for TCP/IP by adjusting the RSS indirection table.  However,
> this table is usually quite small (64-256 entries).  This means that
> hash collisions will be quite common and this can result in reordering.
> The same applies to the small table Jason has proposed for virtio-net.

But this happens on real hardware today.  Better that real hardware is
nice, but is it overkill?

And can't you reorder even with perfect matching, since prior packets
will be on the old queue and more recent ones on the new queue?  Does it
discard or requeue old ones?  Or am I missing a trick?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 0/5] Series short description
  2011-12-09  5:31       ` Rusty Russell
@ 2011-12-15  1:36         ` Ben Hutchings
  2011-12-15 23:12           ` Rusty Russell
  0 siblings, 1 reply; 36+ messages in thread
From: Ben Hutchings @ 2011-12-15  1:36 UTC (permalink / raw)
  To: Rusty Russell; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928

On Fri, 2011-12-09 at 16:01 +1030, Rusty Russell wrote:
> On Wed, 7 Dec 2011 17:02:04 +0000, Ben Hutchings <bhutchings@solarflare.com> wrote:
> > Solarflare controllers (sfc driver) have 8192 perfect filters for
> > TCP/IPv4 and UDP/IPv4 which can be used for flow steering.  (The filters
> > are organised as a hash table, but matched based on 5-tuples.)  I
> > implemented the 'accelerated RFS' interface in this driver.
> > 
> > I believe the Intel 82599 controllers (ixgbe driver) have both
> > hash-based and perfect filter modes and the driver can be configured to
> > use one or the other.  The driver has its own independent mechanism for
> > steering RX and TX flows which predates RFS; I don't know whether it
> > uses hash-based or perfect filters.
> 
> Thanks for this summary (and Jason, too).  I've fallen a long way behind
> NIC state-of-the-art.
>  
> > Most multi-queue controllers could support a kind of hash-based
> > filtering for TCP/IP by adjusting the RSS indirection table.  However,
> > this table is usually quite small (64-256 entries).  This means that
> > hash collisions will be quite common and this can result in reordering.
> > The same applies to the small table Jason has proposed for virtio-net.
> 
> But this happens on real hardware today.  Better that real hardware is
> nice, but is it overkill?

What do you mean, it happens on real hardware today?  So far as I know,
the only cases where we have dynamic adjustment of flow steering are in
ixgbe (big table of hash filters, I think) and sfc (perfect filters).
I don't think that anyone's currently doing flow steering with the RSS
indirection table.  (At least, not on Linux.  I think that Microsoft was
intending to do so on Windows, but I don't know whether they ever did.)

> And can't you reorder even with perfect matching, since prior packets
> will be on the old queue and more recent ones on the new queue?  Does it
> discard or requeue old ones?  Or am I missing a trick?

Yes, that is possible.  RFS is careful to avoid such reordering by only
changing the steering of a flow when none of its packets can be in a
software receive queue.  It is not generally possible to do the same for
hardware receive queues.  However, when the first condition is met it is
likely that there won't be a whole lot of packets for that flow in the
hardware receive queue either.  (But if there are, then I think as a
side-effect of commit 09994d1 RFS will repeatedly ask the driver to
steer the flow.  Which isn't ideal.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next RFC PATCH 0/5] Series short description
  2011-12-15  1:36         ` Ben Hutchings
@ 2011-12-15 23:12           ` Rusty Russell
  0 siblings, 0 replies; 36+ messages in thread
From: Rusty Russell @ 2011-12-15 23:12 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928, davem

On Thu, 15 Dec 2011 01:36:44 +0000, Ben Hutchings <bhutchings@solarflare.com> wrote:
> On Fri, 2011-12-09 at 16:01 +1030, Rusty Russell wrote:
> > On Wed, 7 Dec 2011 17:02:04 +0000, Ben Hutchings <bhutchings@solarflare.com> wrote:
> > > Most multi-queue controllers could support a kind of hash-based
> > > filtering for TCP/IP by adjusting the RSS indirection table.  However,
> > > this table is usually quite small (64-256 entries).  This means that
> > > hash collisions will be quite common and this can result in reordering.
> > > The same applies to the small table Jason has proposed for virtio-net.
> > 
> > But this happens on real hardware today.  Better that real hardware is
> > nice, but is it overkill?
> 
> What do you mean, it happens on real hardware today?  So far as I know,
> the only cases where we have dynamic adjustment of flow steering are in
> ixgbe (big table of hash filters, I think) and sfc (perfect filters).
> I don't think that anyone's currently doing flow steering with the RSS
> indirection table.  (At least, not on Linux.  I think that Microsoft was
> intending to do so on Windows, but I don't know whether they ever did.)

Thanks, I missed the word "could".

> > And can't you reorder even with perfect matching, since prior packets
> > will be on the old queue and more recent ones on the new queue?  Does it
> > discard or requeue old ones?  Or am I missing a trick?
> 
> Yes, that is possible.  RFS is careful to avoid such reordering by only
> changing the steering of a flow when none of its packets can be in a
> software receive queue.  It is not generally possible to do the same for
> hardware receive queues.  However, when the first condition is met it is
> likely that there won't be a whole lot of packets for that flow in the
> hardware receive queue either.  (But if there are, then I think as a
> side-effect of commit 09994d1 RFS will repeatedly ask the driver to
> steer the flow.  Which isn't ideal.)

Should be easy to test, but the question is, how hard should we fight to
maintain ordering?  Dave?

It comes down to this.  We can say in the spec that a virtio nic which
offers VIRTIO_F_NET_RFS:

1) Must do a perfect matching, with perfect ordering.  This means you need
   perfect filters, and handle inter-queue ordering if you change a
   filter (requeue packets?)
2) Must do a perfect matching, but don't worry about ordering across changes.
3) Best effort matching, with perfect ordering.
3) Best effort matching, best effort ordering.

For a perfect filtering setup, the virtio nic needs to either say how
many filter slots it has, or have a way to fail an RFS request.  For
best effort, you can simply ignore RFS requests or accept hash
collisions, without bothering the guest driver at all.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2011-12-15 23:12 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-05  8:58 [net-next RFC PATCH 0/5] Series short description Jason Wang
2011-12-05  8:58 ` [net-next RFC PATCH 1/5] virtio_net: passing rxhash through vnet_hdr Jason Wang
2011-12-05  8:58 ` [net-next RFC PATCH 2/5] tuntap: simple flow director support Jason Wang
2011-12-05 10:38   ` Stefan Hajnoczi
2011-12-05 20:09   ` Ben Hutchings
     [not found]   ` <1323115763.2887.12.camel@bwh-desktop>
2011-12-06  7:21     ` Jason Wang
2011-12-06 17:31       ` Ben Hutchings
2011-12-05  8:59 ` [net-next RFC PATCH 3/5] macvtap: " Jason Wang
2011-12-05 20:11   ` Ben Hutchings
2011-12-05  8:59 ` [net-next RFC PATCH 4/5] virtio: introduce a method to get the irq of a specific virtqueue Jason Wang
2011-12-05  8:59 ` [net-next RFC PATCH 5/5] virtio-net: flow director support Jason Wang
     [not found] ` <20111205085925.6116.94352.stgit@dhcp-8-146.nay.redhat.com>
2011-12-05 10:55   ` Stefan Hajnoczi
2011-12-06  6:33     ` Jason Wang
2011-12-06  9:18       ` Stefan Hajnoczi
     [not found]       ` <CAJSP0QX5dDkpX+cRcQut2mb6K91zeqGLRrZBGAWT_r2p685gaQ@mail.gmail.com>
2011-12-06 10:21         ` Jason Wang
2011-12-06 13:15           ` Stefan Hajnoczi
     [not found]           ` <CAJSP0QXsLwvH5xYj6h0E_V4VLg6DuUc-GKXu9esEYzL2MFcFGw@mail.gmail.com>
2011-12-06 15:42             ` Sridhar Samudrala
2011-12-07  3:03             ` Jason Wang
2011-12-07  9:08               ` Stefan Hajnoczi
2011-12-07 12:10                 ` Jason Wang
2011-12-07 15:04                   ` Stefan Hajnoczi
     [not found]             ` <4EDE37FE.5090409@us.ibm.com>
2011-12-06 16:14               ` Michael S. Tsirkin
2011-12-06 23:10                 ` Sridhar Samudrala
2011-12-07 11:05                   ` Jason Wang
2011-12-07 11:02               ` Jason Wang
2011-12-09  2:00                 ` Sridhar Samudrala
2011-12-05 20:42   ` Ben Hutchings
2011-12-06  7:25     ` Jason Wang
2011-12-06 17:36       ` Ben Hutchings
2011-12-07  7:30 ` [net-next RFC PATCH 0/5] Series short description Rusty Russell
     [not found] ` <87ty5cj0sw.fsf@rustcorp.com.au>
2011-12-07 11:31   ` Jason Wang
2011-12-07 17:02     ` Ben Hutchings
2011-12-08 10:06       ` Jason Wang
2011-12-09  5:31       ` Rusty Russell
2011-12-15  1:36         ` Ben Hutchings
2011-12-15 23:12           ` Rusty Russell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).