Netdev List
 help / color / mirror / Atom feed
* [RFC net-next (v2) 10/14] bnx2x: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index daa894b..53659f3 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -822,7 +822,8 @@ static inline int bnx2x_calc_num_queues(struct bnx2x *bp)
 {
 	return  num_queues ?
 		 min_t(int, num_queues, BNX2X_MAX_QUEUES(bp)) :
-		 min_t(int, num_online_cpus(), BNX2X_MAX_QUEUES(bp));
+		 min_t(int, netif_get_num_default_rss_queues(),
+		       BNX2X_MAX_QUEUES(bp));
 }
 
 static inline void bnx2x_clear_sge_mask_next_elems(struct bnx2x_fastpath *fp)
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 11/14] igb: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz, Jeff Kirsher
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 01ced68..b11ee60 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2465,7 +2465,8 @@ static int __devinit igb_sw_init(struct igb_adapter *adapter)
 		break;
 	}
 
-	adapter->rss_queues = min_t(u32, max_rss_queues, num_online_cpus());
+	adapter->rss_queues = min_t(u32, max_rss_queues,
+				    netif_get_num_default_rss_queues());
 
 	/* Determine if we need to pair queues. */
 	switch (hw->mac.type) {
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 08/14] tg3: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz, Matt Carlson
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Matt Carlson <mcarlson@broadcom.com>
---
 drivers/net/ethernet/broadcom/tg3.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index e47ff8b..6cbab03 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -9908,7 +9908,7 @@ static bool tg3_enable_msix(struct tg3 *tp)
 	int i, rc;
 	struct msix_entry msix_ent[tp->irq_max];
 
-	tp->irq_cnt = num_online_cpus();
+	tp->irq_cnt = netif_get_num_default_rss_queues();
 	if (tp->irq_cnt > 1) {
 		/* We want as many rx rings enabled as there are cpus.
 		 * In multiqueue MSI-X mode, the first MSI-X vector
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 01/14] net-next: Add netif_get_num_default_rss_queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Most multi-queue networking driver consider the number of online cpus when
configuring RSS queues. 
This patch adds a wrapper to the number of cpus, setting an upper limit on the
number of cpus a driver should consider (by default) when allocating resources
for his queues.

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 include/linux/netdevice.h |    3 +++
 net/core/dev.c            |   11 +++++++++++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2c2ecea..ab0251d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2119,6 +2119,9 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev,
 #endif
 }
 
+#define DEFAULT_MAX_NUM_RSS_QUEUES	(8)
+extern int netif_get_num_default_rss_queues(void);
+
 /* Use this variant when it is known for sure that it
  * is executing from hardware interrupt context or with hardware interrupts
  * disabled.
diff --git a/net/core/dev.c b/net/core/dev.c
index 57c4f9b..caae49c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1793,6 +1793,17 @@ int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq)
 EXPORT_SYMBOL(netif_set_real_num_rx_queues);
 #endif
 
+/* netif_get_num_default_rss_queues - default number of RSS queues
+ *
+ * This routine should set an upper limit on the number of RSS queues
+ * used by default by multiqueue devices.
+ */
+int netif_get_num_default_rss_queues()
+{
+	return min_t(int, DEFAULT_MAX_NUM_RSS_QUEUES, num_online_cpus());
+}
+EXPORT_SYMBOL(netif_get_num_default_rss_queues);
+
 static inline void __netif_reschedule(struct Qdisc *q)
 {
 	struct softnet_data *sd;
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 00/14] default maximal number of RSS queues in mq drivers
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev
  Cc: eilong, Yuval Mintz, Divy Le Ray, Or Gerlitz, Jon Mason,
	Anirban Chakraborty, Jitendra Kalsaria, Ron Mercer, Jeff Kirsher,
	Jon Mason, Andrew Gallatin, Sathya Perla, Subbu Seetharaman,
	Ajit Khaparde, Matt Carlson, Michael Chan, Eric Dumazet,
	Ben Hutchings

Different vendors support different number of RSS queues by default. Today,
there exists an ethtool API through which users can change the number of
channels their driver supports; This enables us to pursue the goal of using
a default number of RSS queues in various multi-queue drivers.

This RFC intendeds to achieve the above default, by upper-limiting the number
of interrupts multi-queue drivers request (by default, not via the new API) 
with correlation to the number of cpus on the machine.

After examining multi-queue drivers that call alloc_etherdev_mq[s],
it became evident that most drivers allocate their devices using hard-coded
values. Changing those defaults directly will most likely cause a regression. 

However, (most) multi-queue driver look at the number of online cpus when 
requesting for interrupts. We assume that the number of interrupts the
driver manages to request is propagated across the driver, and the number
of RSS queues it configures is based upon it. 

This RFC modifies said logic - if the number of cpus is large enough, use
a smaller default value instead. This serves 2 main purposes: 
 1. A step forward unity in the number of RSS queues of various drivers.
 2. It prevents wasteful requests for interrupts on machines with many cpus.

Notice no testing was made on this RFC (other than on the bnx2x driver)
except for compilation test.

Drivers identified as multi-queue, handled in this RFC:

* mellanox mlx4
* neterion vxge
* qlogic   qlge
* intel    igb, igbxe igbxevf
* chelsio  cxgb3, cxgb4
* myricom  myri10ge
* emulex   benet
* broadcom tg3, bnx2, bnx2x

Driver identified as multi-queue, no reference to number of online cpus found,
and thus unhandled in this RFC:

* neterion  s2io
* marvell   mv643xx
* freescale gianfar
* ibm       ehea
* ti        cpmac
* sun       niu
* sfc       efx
* chelsio   cxgb4vf

Cheers,
Yuval Mintz

Cc: Divy Le Ray <divy@chelsio.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
Cc: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Cc: Ron Mercer <ron.mercer@qlogic.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Jon Mason <mason@myri.com>
Cc: Andrew Gallatin <gallatin@myri.com>
Cc: Sathya Perla <sathya.perla@emulex.com>
Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com>
Cc: Ajit Khaparde <ajit.khaparde@emulex.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>

^ permalink raw reply

* [RFC net-next (v2) 03/14] vxge: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz, Jon Mason
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Jon Mason <jdmason@kudzu.us>
---
 drivers/net/ethernet/neterion/vxge/vxge-main.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/neterion/vxge/vxge-main.c b/drivers/net/ethernet/neterion/vxge/vxge-main.c
index 2578eb1..2fd1edb 100644
--- a/drivers/net/ethernet/neterion/vxge/vxge-main.c
+++ b/drivers/net/ethernet/neterion/vxge/vxge-main.c
@@ -3687,7 +3687,8 @@ static int __devinit vxge_config_vpaths(
 			return 0;
 
 		if (!driver_config->g_no_cpus)
-			driver_config->g_no_cpus = num_online_cpus();
+			driver_config->g_no_cpus =
+				netif_get_num_default_rss_queues();
 
 		driver_config->vpath_per_dev = driver_config->g_no_cpus >> 1;
 		if (!driver_config->vpath_per_dev)
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 04/14] qlge: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev
  Cc: eilong, Yuval Mintz, Anirban Chakraborty, Jitendra Kalsaria,
	Ron Mercer
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
Cc: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Cc: Ron Mercer <ron.mercer@qlogic.com>
---
 drivers/net/ethernet/qlogic/qlge/qlge_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
index 09d8d33..3c3499d 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
+++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
@@ -4649,7 +4649,7 @@ static int __devinit qlge_probe(struct pci_dev *pdev,
 	int err = 0;
 
 	ndev = alloc_etherdev_mq(sizeof(struct ql_adapter),
-			min(MAX_CPUS, (int)num_online_cpus()));
+			min(MAX_CPUS, netif_get_num_default_rss_queues()));
 	if (!ndev)
 		return -ENOMEM;
 
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 05/14] cxgb3: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz, Divy Le Ray
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Divy Le Ray <divy@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c
index abb6ce7..9b08749 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c
@@ -3050,7 +3050,7 @@ static struct pci_error_handlers t3_err_handler = {
 static void set_nqsets(struct adapter *adap)
 {
 	int i, j = 0;
-	int num_cpus = num_online_cpus();
+	int num_cpus = netif_get_num_default_rss_queues();
 	int hwports = adap->params.nports;
 	int nqsets = adap->msix_nvectors - 1;
 
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 06/14] cxgb4: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz, Divy Le Ray
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Divy Le Ray <divy@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index e1f96fb..5ed49af 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -3493,8 +3493,8 @@ static void __devinit cfg_queues(struct adapter *adap)
 	 */
 	if (n10g)
 		q10g = (MAX_ETH_QSETS - (adap->params.nports - n10g)) / n10g;
-	if (q10g > num_online_cpus())
-		q10g = num_online_cpus();
+	if (q10g > netif_get_num_default_rss_queues())
+		q10g = netif_get_num_default_rss_queues();
 
 	for_each_port(adap, i) {
 		struct port_info *pi = adap2pinfo(adap, i);
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 02/14] mlx4: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz, Or Gerlitz
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/main.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index ee6f4fe..8f990a3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -41,6 +41,7 @@
 #include <linux/slab.h>
 #include <linux/io-mapping.h>
 #include <linux/delay.h>
+#include <linux/netdevice.h>
 
 #include <linux/mlx4/device.h>
 #include <linux/mlx4/doorbell.h>
@@ -1539,8 +1540,8 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct msix_entry *entries;
 	int nreq = min_t(int, dev->caps.num_ports *
-			 min_t(int, num_online_cpus() + 1, MAX_MSIX_P_PORT)
-				+ MSIX_LEGACY_SZ, MAX_MSIX);
+			 min_t(int, netif_get_num_default_rss_queues() + 1,
+			       MAX_MSIX_P_PORT) + MSIX_LEGACY_SZ, MAX_MSIX);
 	int err;
 	int i;
 
-- 
1.7.9.rc2

^ permalink raw reply related

* Re: [RFC net-next 05/14] Fix intel/ixgbevf
From: Yuval Mintz @ 2012-06-25 11:23 UTC (permalink / raw)
  To: Greg Rose; +Cc: eilong, Alexander Duyck, netdev, davem, Jeff Kirsher
In-Reply-To: <20120619110704.000045b2@unknown>

>>>>  	 * It's easy to be greedy for MSI-X vectors, but it

>>>> really @@ -2022,8 +2022,9 @@ static int
>>>> ixgbevf_set_interrupt_capability(struct ixgbevf_adapter *adapter)
>>>>  	 * than CPU's.  So let's be conservative and only ask for
>>>>  	 * (roughly) twice the number of vectors as there are
>>>> CPU's. */
>>>> +	ncpu = min_t(int, num_online_cpus(),
>>>> DEFAULT_MAX_NUM_RSS_QUEUES); v_budget =
>>>> min(adapter->num_rx_queues + adapter->num_tx_queues,
>>>> -		       (int)(num_online_cpus() * 2)) +
>>>> NON_Q_VECTORS;
>>>> +		       ncpu * 2) + NON_Q_VECTORS;
>>>>  
>>>>  	/* A failure in MSI-X entry allocation isn't fatal, but
>>>> it does
>>>>  	 * mean we disable MSI-X capabilities of the adapter. */
>>> This change is pointless on the ixgbevf driver.  The VF hardware can
>>> support at most 4 RSS queues.  As such num_rx_queues + num_tx_queues
>>> will never exceed 8 so you are essentially adding a necessary
>>> min(x,8).
>>
>> It is pointless with the current value, but if someone will edit the
>> kernel source code and replace the 8 with a 2, it will become
>> meaningful. The compiler will optimize this part, and I think that for
>> completion, it is best to keep this reference so a future default
>> number change will not be missed.
>>
>> Eilon
> 
> I don't feel there is any real point to making this change to the
> ixgbevf driver.  82599 virtual functions have 3 MSI-X vectors, one of
> which is for the mailbox and the other two can be shared with tx/rx
> queue pairs or assigned separately to tx or rx queues.  So this code is
> pointless no matter what value is set for DEFAULT_MAX_NUM_RSS_QUEUES.
> Perhaps the patches to the other drivers in your RFC will have some
> effect but this one looks like a no-op for the ixgbevf driver so there
> is no reason for it.
> 
> - Greg
> 

Hi Greg,

Since we're changing the RFC to use a new wrapper function which should
replace num_online_cpus (for these purpose), the next RFC version will still
change this driver (for uniformity, if nothing else).

Of course, if you would still have reservations for this change - send them.

Thanks,
Yuval

^ permalink raw reply

* Re: [PATCH 04/13] netfilter: regard users as refcount for l4proto's per-net data
From: Pablo Neira Ayuso @ 2012-06-25 11:20 UTC (permalink / raw)
  To: Gao feng; +Cc: netdev, netfilter-devel
In-Reply-To: <1340289410-17642-4-git-send-email-gaofeng@cn.fujitsu.com>

On Thu, Jun 21, 2012 at 10:36:41PM +0800, Gao feng wrote:
> Now, nf_proto_net's users is confusing.
> we should regard it as the refcount for l4proto's per-net data,
> because maybe there are two l4protos use the same per-net data.
> 
> so increment pn->users when nf_conntrack_l4proto_register
> success, and decrement it for nf_conntrack_l4_unregister case.
> 
> because nf_conntrack_l3proto_ipv[4|6] don't use the same per-net
> data,so we don't need to add a refcnt for their per-net data.
> 
> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
> ---
>  net/netfilter/nf_conntrack_proto.c |   76 ++++++++++++++++++++++--------------
>  1 files changed, 46 insertions(+), 30 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
> index 9d6b6ab..63612e6 100644
> --- a/net/netfilter/nf_conntrack_proto.c
> +++ b/net/netfilter/nf_conntrack_proto.c
[...]
> @@ -458,23 +446,32 @@ int nf_conntrack_l4proto_register(struct net *net,
>  				  struct nf_conntrack_l4proto *l4proto)
>  {
>  	int ret = 0;
> +	struct nf_proto_net *pn = NULL;
>  
>  	if (l4proto->init_net) {
>  		ret = l4proto->init_net(net, l4proto->l3proto);
>  		if (ret < 0)
> -			return ret;
> +			goto out;
>  	}
>  
> -	ret = nf_ct_l4proto_register_sysctl(net, l4proto);
> +	pn = nf_ct_l4proto_net(net, l4proto);
> +	if (pn == NULL)
> +		goto out;

Same thing here, we're leaking memory allocated by l4proto->init_net.

> +	ret = nf_ct_l4proto_register_sysctl(net, pn, l4proto);
>  	if (ret < 0)
> -		return ret;
> +		goto out;
>  
>  	if (net == &init_net) {
>  		ret = nf_conntrack_l4proto_register_net(l4proto);
> -		if (ret < 0)
> -			nf_ct_l4proto_unregister_sysctl(net, l4proto);
> +		if (ret < 0) {
> +			nf_ct_l4proto_unregister_sysctl(net, pn, l4proto);
> +			goto out;

Better replace the two lines above by:

goto out_register_net;

and then...

> +		}
>  	}
>  
> +	pn->users++;

out_register_net:
        nf_ct_l4proto_unregister_sysctl(net, pn, l4proto);

> +out:
>  	return ret;

I think that this change is similar to patch 1/1, I think you should
send it as a separated patch.

>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_register);
> @@ -499,10 +496,18 @@ nf_conntrack_l4proto_unregister_net(struct nf_conntrack_l4proto *l4proto)
>  void nf_conntrack_l4proto_unregister(struct net *net,
>  				     struct nf_conntrack_l4proto *l4proto)
>  {
> +	struct nf_proto_net *pn = NULL;
> +
>  	if (net == &init_net)
>  		nf_conntrack_l4proto_unregister_net(l4proto);
>  
> -	nf_ct_l4proto_unregister_sysctl(net, l4proto);
> +	pn = nf_ct_l4proto_net(net, l4proto);
> +	if (pn == NULL)
> +		return;
> +
> +	pn->users--;
> +	nf_ct_l4proto_unregister_sysctl(net, pn, l4proto);
> +
>  	/* Remove all contrack entries for this protocol */
>  	rtnl_lock();
>  	nf_ct_iterate_cleanup(net, kill_l4proto, l4proto);
> @@ -514,11 +519,15 @@ int nf_conntrack_proto_init(struct net *net)
>  {
>  	unsigned int i;
>  	int err;
> +	struct nf_proto_net *pn = nf_ct_l4proto_net(net,
> +					&nf_conntrack_l4proto_generic);
> +
>  	err = nf_conntrack_l4proto_generic.init_net(net,
>  					nf_conntrack_l4proto_generic.l3proto);
>  	if (err < 0)
>  		return err;
>  	err = nf_ct_l4proto_register_sysctl(net,
> +					    pn,
>  					    &nf_conntrack_l4proto_generic);
>  	if (err < 0)
>  		return err;
> @@ -528,13 +537,20 @@ int nf_conntrack_proto_init(struct net *net)
>  			rcu_assign_pointer(nf_ct_l3protos[i],
>  					   &nf_conntrack_l3proto_generic);
>  	}
> +
> +	pn->users++;
>  	return 0;
>  }
>  
>  void nf_conntrack_proto_fini(struct net *net)
>  {
>  	unsigned int i;
> +	struct nf_proto_net *pn = nf_ct_l4proto_net(net,
> +					&nf_conntrack_l4proto_generic);
> +
> +	pn->users--;
>  	nf_ct_l4proto_unregister_sysctl(net,
> +					pn,
>  					&nf_conntrack_l4proto_generic);
>  	if (net == &init_net) {
>  		/* free l3proto protocol tables */
> -- 
> 1.7.7.6
> 

^ permalink raw reply

* Re: [PATCH 01/13] netfilter: fix problem with proto register
From: Pablo Neira Ayuso @ 2012-06-25 11:12 UTC (permalink / raw)
  To: Gao feng; +Cc: netdev, netfilter-devel
In-Reply-To: <1340289410-17642-1-git-send-email-gaofeng@cn.fujitsu.com>

On Thu, Jun 21, 2012 at 10:36:38PM +0800, Gao feng wrote:
> before commit 2c352f444ccfa966a1aa4fd8e9ee29381c467448
> (netfilter: nf_conntrack: prepare namespace support for
> l4 protocol trackers), we register sysctl before register
> protos, so if sysctl is registered faild, the protos will
> not be registered.
> 
> but now, we register protos first, and when register
> sysctl failed, we can use protos too, it's different
> from before.

No, this has to be an all-or-nothing game. If one fails, everything
else that you've registered has to be unregistered.

> so change to register sysctl before register protos.
> 
> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
> ---
>  net/netfilter/nf_conntrack_proto.c |   36 +++++++++++++++++++++++-------------
>  1 files changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
> index 1ea9194..9bd88aa 100644
> --- a/net/netfilter/nf_conntrack_proto.c
> +++ b/net/netfilter/nf_conntrack_proto.c
> @@ -253,18 +253,23 @@ int nf_conntrack_l3proto_register(struct net *net,
>  {
>  	int ret = 0;
>  
> -	if (net == &init_net)
> -		ret = nf_conntrack_l3proto_register_net(proto);
> +	if (proto->init_net) {
> +		ret = proto->init_net(net);
> +		if (ret < 0)
> +			return ret;
> +	}
>  
> +	ret = nf_ct_l3proto_register_sysctl(net, proto);
>  	if (ret < 0)
>  		return ret;

This is still wrong.

If nf_ct_l3proto_register_sysctl fails, we'll leak the memory that has
been reserved by proto->init_net.

> -	if (proto->init_net) {
> -		ret = proto->init_net(net);
> +	if (net == &init_net) {
> +		ret = nf_conntrack_l3proto_register_net(proto);
>  		if (ret < 0)
> -			return ret;
> +			nf_ct_l3proto_unregister_sysctl(net, proto);
>  	}
> -	return nf_ct_l3proto_register_sysctl(net, proto);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_l3proto_register);
>  
> @@ -454,19 +459,24 @@ int nf_conntrack_l4proto_register(struct net *net,
>  				  struct nf_conntrack_l4proto *l4proto)
>  {
>  	int ret = 0;
> -	if (net == &init_net)
> -		ret = nf_conntrack_l4proto_register_net(l4proto);
>  
> -	if (ret < 0)
> -		return ret;
> -
> -	if (l4proto->init_net)
> +	if (l4proto->init_net) {
>  		ret = l4proto->init_net(net);
> +		if (ret < 0)
> +			return ret;
> +	}
>  
> +	ret = nf_ct_l4proto_register_sysctl(net, l4proto);
>  	if (ret < 0)
>  		return ret;
>  
> -	return nf_ct_l4proto_register_sysctl(net, l4proto);
> +	if (net == &init_net) {
> +		ret = nf_conntrack_l4proto_register_net(l4proto);
> +		if (ret < 0)
> +			nf_ct_l4proto_unregister_sysctl(net, l4proto);
> +	}
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_register);
>  
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH] net-next: ipv6: ndisc: allocate a ndisc socket per inet6_dev
From: Menny_Hamburger @ 2012-06-25 11:08 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1340620878.10893.26.camel@edumazet-glaptop>

I'm sorry for not responding on your post.
I really want to understand how this fixes our problem.
This fix will make the skb allocations succeed, but what mechanism releases the stuck socket associated with the bad device?

Thanks,
Menny

-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
Sent: 25 June, 2012 13:41
To: Hamburger, Menny
Cc: netdev@vger.kernel.org
Subject: Re: [PATCH] net-next: ipv6: ndisc: allocate a ndisc socket per inet6_dev

On Mon, 2012-06-25 at 11:26 +0100, Menny_Hamburger@Dell.com wrote:
> From: mennyh <Menny_Hamburger@Dell.com>
> 
>  When an IPV6 network discovery packet does not get sent by the NIC, 
>  either because there is some S/W issue or a H/W problem with the NIC, NDP will stop
>  working and will not be able to send ndisc packets via other NICs on the machine.
>  The reason for this that there is only one global socket assigned per network for network discovery
>  (net->ipv6.ndisc_sk), and  when this socket is busy, NDP cannot be serviced by 
>  other NICS. 
>  
>  This patch adds a kernel configuration option IPV6_NDISC_SOCKET_PER_INTERFACE, 
>  which when enabled the kernel will allocate a network discovery socket per inet6_dev on creation,
>  instead of a single socket per network.
> 
> Signed-off-by: mennyh <Menny_Hamburger@Dell.com>
> ---

You obviously didn't see my patch to address this problem ?

I was waiting your feedback and you post this wrong patch instead ?

This sucks.

Test this instead. Please ?

^ permalink raw reply

* [PATCH] net: l2tp_eth: fix l2tp_eth_dev_xmit race
From: Eric Dumazet @ 2012-06-25 10:45 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, James Chapman

From: Eric Dumazet <edumazet@google.com>

Its illegal to dereference skb after giving it to l2tp_xmit_skb()
as it might be already freed/reused.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: James Chapman <jchapman@katalix.com>
---
diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index 185f12f..c3738f4 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -88,12 +88,12 @@ static int l2tp_eth_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct l2tp_eth *priv = netdev_priv(dev);
 	struct l2tp_session *session = priv->session;
 
-	l2tp_xmit_skb(session, skb, session->hdr_len);
-
 	dev->stats.tx_bytes += skb->len;
 	dev->stats.tx_packets++;
 
-	return 0;
+	l2tp_xmit_skb(session, skb, session->hdr_len);
+
+	return NETDEV_TX_OK;
 }
 
 static struct net_device_ops l2tp_eth_netdev_ops = {

^ permalink raw reply related

* Re: [PATCH] net-next: ipv6: ndisc: allocate a ndisc socket per inet6_dev
From: Eric Dumazet @ 2012-06-25 10:41 UTC (permalink / raw)
  To: Menny_Hamburger; +Cc: netdev
In-Reply-To: <D8C50530D6022F40A817A35C40CC06A70B34E34D44@DUBX7MCDUB01.EMEA.DELL.COM>

On Mon, 2012-06-25 at 11:26 +0100, Menny_Hamburger@Dell.com wrote:
> From: mennyh <Menny_Hamburger@Dell.com>
> 
>  When an IPV6 network discovery packet does not get sent by the NIC, 
>  either because there is some S/W issue or a H/W problem with the NIC, NDP will stop
>  working and will not be able to send ndisc packets via other NICs on the machine.
>  The reason for this that there is only one global socket assigned per network for network discovery
>  (net->ipv6.ndisc_sk), and  when this socket is busy, NDP cannot be serviced by 
>  other NICS. 
>  
>  This patch adds a kernel configuration option IPV6_NDISC_SOCKET_PER_INTERFACE, 
>  which when enabled the kernel will allocate a network discovery socket per inet6_dev on creation,
>  instead of a single socket per network.
> 
> Signed-off-by: mennyh <Menny_Hamburger@Dell.com>
> ---

You obviously didn't see my patch to address this problem ?

I was waiting your feedback and you post this wrong patch instead ?

This sucks.

Test this instead. Please ?


diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 69a6330..f149d85 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -429,7 +429,6 @@ struct sk_buff *ndisc_build_skb(struct net_device *dev,
 	int hlen = LL_RESERVED_SPACE(dev);
 	int tlen = dev->needed_tailroom;
 	int len;
-	int err;
 	u8 *opt;
 
 	if (!dev->addr_len)
@@ -439,15 +438,10 @@ struct sk_buff *ndisc_build_skb(struct net_device *dev,
 	if (llinfo)
 		len += ndisc_opt_addr_space(dev);
 
-	skb = sock_alloc_send_skb(sk,
-				  (MAX_HEADER + sizeof(struct ipv6hdr) +
-				   len + hlen + tlen),
-				  1, &err);
-	if (!skb) {
-		ND_PRINTK(0, err, "ND: %s failed to allocate an skb, err=%d\n",
-			  __func__, err);
+	skb = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + len + hlen + tlen,
+			GFP_ATOMIC);
+	if (!skb)
 		return NULL;
-	}
 
 	skb_reserve(skb, hlen);
 	ip6_nd_hdr(sk, skb, dev, saddr, daddr, IPPROTO_ICMPV6, len);
@@ -1550,16 +1544,10 @@ void ndisc_send_redirect(struct sk_buff *skb, const struct in6_addr *target)
 
 	hlen = LL_RESERVED_SPACE(dev);
 	tlen = dev->needed_tailroom;
-	buff = sock_alloc_send_skb(sk,
-				   (MAX_HEADER + sizeof(struct ipv6hdr) +
-				    len + hlen + tlen),
-				   1, &err);
-	if (buff == NULL) {
-		ND_PRINTK(0, err,
-			  "Redirect: %s failed to allocate an skb, err=%d\n",
-			  __func__, err);
+	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + len + hlen + tlen,
+			 GFP_ATOMIC);
+	if (!buff)
 		goto release;
-	}
 
 	skb_reserve(buff, hlen);
 	ip6_nd_hdr(sk, buff, dev, &saddr_buf, &ipv6_hdr(skb)->saddr,

^ permalink raw reply related

* [PATCH] net-next: ipv6: ndisc: allocate a ndisc socket per inet6_dev
From: Menny_Hamburger @ 2012-06-25 10:26 UTC (permalink / raw)
  To: netdev

From: mennyh <Menny_Hamburger@Dell.com>

 When an IPV6 network discovery packet does not get sent by the NIC, 
 either because there is some S/W issue or a H/W problem with the NIC, NDP will stop
 working and will not be able to send ndisc packets via other NICs on the machine.
 The reason for this that there is only one global socket assigned per network for network discovery
 (net->ipv6.ndisc_sk), and  when this socket is busy, NDP cannot be serviced by 
 other NICS. 
 
 This patch adds a kernel configuration option IPV6_NDISC_SOCKET_PER_INTERFACE, 
 which when enabled the kernel will allocate a network discovery socket per inet6_dev on creation,
 instead of a single socket per network.

Signed-off-by: mennyh <Menny_Hamburger@Dell.com>
---
 include/net/if_inet6.h   |    3 +++
 include/net/ndisc.h      |    3 +++
 include/net/netns/ipv6.h |    2 ++
 net/ipv6/Kconfig         |    8 ++++++++
 net/ipv6/addrconf.c      |   22 +++++++++++++++++++++
 net/ipv6/ndisc.c         |   48 +++++++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index 9356322..7134632 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -191,6 +191,9 @@ struct inet6_dev {
 	struct inet6_dev	*next;
 	struct ipv6_devconf	cnf;
 	struct ipv6_devstat	stats;
+#ifdef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
+	struct sock             *ndisc_sk;
+#endif
 	unsigned long		tstamp; /* ipv6InterfaceTable update timestamp */
 	struct rcu_head		rcu;
 };
diff --git a/include/net/ndisc.h b/include/net/ndisc.h
index c02b6ad..9039d6c 100644
--- a/include/net/ndisc.h
+++ b/include/net/ndisc.h
@@ -153,6 +153,9 @@ extern void			ndisc_send_skb(struct sk_buff *skb,
 					       const struct in6_addr *saddr,
 					       struct icmp6hdr *icmp6h);
 
+#ifdef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
+extern int ndisc_socket_init(struct sock **ndisc_sk, struct net *net);
+#endif
 
 
 /*
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index df0a545..5d65d60 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -58,7 +58,9 @@ struct netns_ipv6 {
 	struct fib_rules_ops    *fib6_rules_ops;
 #endif
 	struct sock		**icmp_sk;
+#ifndef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
 	struct sock             *ndisc_sk;
+#endif
 	struct sock             *tcp_sk;
 	struct sock             *igmp_sk;
 #ifdef CONFIG_IPV6_MROUTE
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index 5728695..77a09ff 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -250,4 +250,12 @@ config IPV6_PIMSM_V2
 	  Support for IPv6 PIM multicast routing protocol PIM-SMv2.
 	  If unsure, say N.
 
+config IPV6_NDISC_SOCKET_PER_INTERFACE
+	bool "IPv6: define socket for network discovery per interface (EXPERIMENTAL)"
+	depends on IPV6 && EXPERIMENTAL
+	---help---
+	  Normally only one socket per network is allocated to service the IPV6 network discovery protocol;
+	  This may cause NDP to stop working when ndisc packet is starved by a NIC due to S/W or H/W problems.
+	  If you say Y here, a separate socket will be allocated for each IPV6 enabled interface.
+
+	  If unsure, say N.
+
 endif # IPV6
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 8f6411c..8cbcb66 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -78,6 +78,9 @@
 #include <net/ip.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
+#ifdef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
+#include <net/inet_common.h>
+#endif
 #include <linux/if_tunnel.h>
 #include <linux/rtnetlink.h>
 
@@ -336,6 +339,13 @@ void in6_dev_finish_destroy(struct inet6_dev *idev)
 		pr_warn("Freeing alive inet6 device %p\n", idev);
 		return;
 	}
+
+#ifdef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
+	if (idev->ndisc_sk) {
+		inet_ctl_sock_destroy(idev->ndisc_sk);
+		idev->ndisc_sk = NULL;
+	}
+#endif
 	snmp6_free_dev(idev);
 	kfree_rcu(idev, rcu);
 }
@@ -392,6 +402,18 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
 		return NULL;
 	}
 
+#ifdef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
+	if (ndisc_socket_init(&ndev->ndisc_sk, dev_net(dev)) < 0) {
+		ADBG((KERN_WARNING
+			"%s(): cannot allocate network discovery socket\n"));
+		ndev->ndisc_sk = NULL;
+		neigh_parms_release(&nd_tbl, ndev->nd_parms);
+		ndev->dead = 1;
+		in6_dev_finish_destroy(ndev);
+		return NULL;
+	}
+#endif
+
 	/* One reference from device.  We must do this before
 	 * we invoke __ipv6_regen_rndid().
 	 */
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 69a6330..08f991b 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -422,8 +422,12 @@ struct sk_buff *ndisc_build_skb(struct net_device *dev,
 				const struct in6_addr *target,
 				int llinfo)
 {
+#ifndef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
 	struct net *net = dev_net(dev);
 	struct sock *sk = net->ipv6.ndisc_sk;
+#else
+	struct sock *sk;
+#endif
 	struct sk_buff *skb;
 	struct icmp6hdr *hdr;
 	int hlen = LL_RESERVED_SPACE(dev);
@@ -432,6 +436,10 @@ struct sk_buff *ndisc_build_skb(struct net_device *dev,
 	int err;
 	u8 *opt;
 
+#ifdef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
+        sk = in6_dev_get(dev)->ndisc_sk;
+#endif
+
 	if (!dev->addr_len)
 		llinfo = 0;
 
@@ -488,11 +496,19 @@ void ndisc_send_skb(struct sk_buff *skb,
 	struct flowi6 fl6;
 	struct dst_entry *dst;
 	struct net *net = dev_net(dev);
+#ifndef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
 	struct sock *sk = net->ipv6.ndisc_sk;
+#else
+	struct sock *sk;
+#endif
 	struct inet6_dev *idev;
 	int err;
 	u8 type;
 
+#ifdef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
+        sk = in6_dev_get(dev)->ndisc_sk;
+#endif
+
 	type = icmp6h->icmp6_type;
 
 	icmpv6_flow_init(sk, &fl6, type, saddr, daddr, dev->ifindex);
@@ -550,6 +566,11 @@ static void ndisc_send_na(struct net_device *dev, struct neighbour *neigh,
 	struct icmp6hdr icmp6h = {
 		.icmp6_type = NDISC_NEIGHBOUR_ADVERTISEMENT,
 	};
+#ifndef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
+	struct sock *sk = dev_net(dev)->ipv6.ndisc_sk;
+#else
+	struct sock *sk = in6_dev_get(dev)->ndisc_sk;
+#endif
 
 	/* for anycast or proxy, solicited_addr != src_addr */
 	ifp = ipv6_get_ifaddr(dev_net(dev), solicited_addr, dev, 1);
@@ -561,7 +582,7 @@ static void ndisc_send_na(struct net_device *dev, struct neighbour *neigh,
 		in6_ifa_put(ifp);
 	} else {
 		if (ipv6_dev_get_saddr(dev_net(dev), dev, daddr,
-				       inet6_sk(dev_net(dev)->ipv6.ndisc_sk)->srcprefs,
+				       inet6_sk(sk)->srcprefs,
 				       &tmpaddr))
 			return;
 		src_addr = &tmpaddr;
@@ -1470,7 +1491,11 @@ void ndisc_send_redirect(struct sk_buff *skb, const struct in6_addr *target)
 {
 	struct net_device *dev = skb->dev;
 	struct net *net = dev_net(dev);
+#ifndef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
 	struct sock *sk = net->ipv6.ndisc_sk;
+#else
+	struct sock *sk;
+#endif
 	int len = sizeof(struct icmp6hdr) + 2 * sizeof(struct in6_addr);
 	struct inet_peer *peer;
 	struct sk_buff *buff;
@@ -1487,6 +1512,10 @@ void ndisc_send_redirect(struct sk_buff *skb, const struct in6_addr *target)
 	int err;
 	u8 ha_buf[MAX_ADDR_LEN], *ha = NULL;
 
+#ifdef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
+	sk = in6_dev_get(dev)->ndisc_sk;
+#endif
+
 	if (ipv6_get_lladdr(dev, &saddr_buf, IFA_F_TENTATIVE)) {
 		ND_PRINTK(2, warn, "Redirect: no link-local address on %s\n",
 			  dev->name);
@@ -1761,7 +1790,11 @@ int ndisc_ifinfo_sysctl_change(struct ctl_table *ctl, int write, void __user *bu
 
 #endif
 
+#ifdef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
+int ndisc_socket_init(struct sock **ndisc_sk, struct net *net)
+#else
 static int __net_init ndisc_net_init(struct net *net)
+#endif
 {
 	struct ipv6_pinfo *np;
 	struct sock *sk;
@@ -1776,7 +1809,11 @@ static int __net_init ndisc_net_init(struct net *net)
 		return err;
 	}
 
+#ifdef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
+	*ndisc_sk = sk;
+#else
 	net->ipv6.ndisc_sk = sk;
+#endif
 
 	np = inet6_sk(sk);
 	np->hop_limit = 255;
@@ -1786,6 +1823,7 @@ static int __net_init ndisc_net_init(struct net *net)
 	return 0;
 }
 :q!
+#ifndef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
 static void __net_exit ndisc_net_exit(struct net *net)
 {
 	inet_ctl_sock_destroy(net->ipv6.ndisc_sk);
@@ -1795,14 +1833,18 @@ static struct pernet_operations ndisc_net_ops = {
 	.init = ndisc_net_init,
 	.exit = ndisc_net_exit,
 };
+#endif
 
 int __init ndisc_init(void)
 {
 	int err;
 
+#ifndef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
 	err = register_pernet_subsys(&ndisc_net_ops);
 	if (err)
 		return err;
+#endif
+
 	/*
 	 * Initialize the neighbour table
 	 */
@@ -1825,7 +1867,9 @@ out_unregister_sysctl:
 	neigh_sysctl_unregister(&nd_tbl.parms);
 out_unregister_pernet:
 #endif
+#ifndef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
 	unregister_pernet_subsys(&ndisc_net_ops);
+#endif
 	goto out;
 }
 
@@ -1836,5 +1880,7 @@ void ndisc_cleanup(void)
 	neigh_sysctl_unregister(&nd_tbl.parms);
 #endif
 	neigh_table_clear(&nd_tbl);
+#ifndef CONFIG_IPV6_NDISC_SOCKET_PER_INTERFACE
 	unregister_pernet_subsys(&ndisc_net_ops);
+#endif
 }
-- 
1.7.10.1

^ permalink raw reply related

* Re: [net-next RFC V4 PATCH 3/4] virtio: introduce a method to get the irq of a specific virtqueue
From: Michael S. Tsirkin @ 2012-06-25 10:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, kvm, qemu-devel, netdev, mashirle,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem
In-Reply-To: <1340617278-8022-1-git-send-email-jasowang@redhat.com>

On Mon, Jun 25, 2012 at 05:41:17PM +0800, Jason Wang wrote:
> Device specific irq optimizations such as irq affinity may be used by virtio
> drivers. So this patch introduce a new method to get the irq of a specific
> virtqueue.
> 
> After this patch, virtio device drivers could query the irq and do device
> specific optimizations. First user would be virtio-net.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/lguest/lguest_device.c |    8 ++++++++
>  drivers/s390/kvm/kvm_virtio.c  |    6 ++++++
>  drivers/virtio/virtio_mmio.c   |    8 ++++++++
>  drivers/virtio/virtio_pci.c    |   12 ++++++++++++
>  include/linux/virtio_config.h  |    4 ++++
>  5 files changed, 38 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
> index 9e8388e..bcd080f 100644
> --- a/drivers/lguest/lguest_device.c
> +++ b/drivers/lguest/lguest_device.c
> @@ -392,6 +392,13 @@ static const char *lg_bus_name(struct virtio_device *vdev)
>  	return "";
>  }
>  
> +static int lg_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
> +{
> +	struct lguest_vq_info *lvq = vq->priv;
> +
> +	return lvq->config.irq;
> +}
> +
>  /* The ops structure which hooks everything together. */
>  static struct virtio_config_ops lguest_config_ops = {
>  	.get_features = lg_get_features,
> @@ -404,6 +411,7 @@ static struct virtio_config_ops lguest_config_ops = {
>  	.find_vqs = lg_find_vqs,
>  	.del_vqs = lg_del_vqs,
>  	.bus_name = lg_bus_name,
> +	.get_vq_irq = lg_get_vq_irq,
>  };
>  
>  /*
> diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
> index d74e9ae..a897de2 100644
> --- a/drivers/s390/kvm/kvm_virtio.c
> +++ b/drivers/s390/kvm/kvm_virtio.c
> @@ -268,6 +268,11 @@ static const char *kvm_bus_name(struct virtio_device *vdev)
>  	return "";
>  }
>  
> +static int kvm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
> +{
> +	return 0x2603;
> +}
> +
>  /*
>   * The config ops structure as defined by virtio config
>   */
> @@ -282,6 +287,7 @@ static struct virtio_config_ops kvm_vq_configspace_ops = {
>  	.find_vqs = kvm_find_vqs,
>  	.del_vqs = kvm_del_vqs,
>  	.bus_name = kvm_bus_name,
> +	.get_vq_irq = kvm_get_vq_irq,
>  };
>  
>  /*
> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> index f5432b6..2ba37ed 100644
> --- a/drivers/virtio/virtio_mmio.c
> +++ b/drivers/virtio/virtio_mmio.c
> @@ -411,6 +411,13 @@ static const char *vm_bus_name(struct virtio_device *vdev)
>  	return vm_dev->pdev->name;
>  }
>  
> +static int vm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
> +{
> +	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
> +
> +	return platform_get_irq(vm_dev->pdev, 0);
> +}
> +
>  static struct virtio_config_ops virtio_mmio_config_ops = {
>  	.get		= vm_get,
>  	.set		= vm_set,
> @@ -422,6 +429,7 @@ static struct virtio_config_ops virtio_mmio_config_ops = {
>  	.get_features	= vm_get_features,
>  	.finalize_features = vm_finalize_features,
>  	.bus_name	= vm_bus_name,
> +	.get_vq_irq     = vm_get_vq_irq,
>  };
>  
>  
> diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
> index adb24f2..c062227 100644
> --- a/drivers/virtio/virtio_pci.c
> +++ b/drivers/virtio/virtio_pci.c
> @@ -607,6 +607,17 @@ static const char *vp_bus_name(struct virtio_device *vdev)
>  	return pci_name(vp_dev->pci_dev);
>  }
>  
> +static int vp_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
> +{
> +	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> +	struct virtio_pci_vq_info *info = vq->priv;
> +
> +	if (vp_dev->intx_enabled)
> +		return vp_dev->pci_dev->irq;
> +	else
> +		return vp_dev->msix_entries[info->msix_vector].vector;
> +}
> +
>  static struct virtio_config_ops virtio_pci_config_ops = {
>  	.get		= vp_get,
>  	.set		= vp_set,
> @@ -618,6 +629,7 @@ static struct virtio_config_ops virtio_pci_config_ops = {
>  	.get_features	= vp_get_features,
>  	.finalize_features = vp_finalize_features,
>  	.bus_name	= vp_bus_name,
> +	.get_vq_irq     = vp_get_vq_irq,
>  };
>  
>  static void virtio_pci_release_dev(struct device *_d)
> diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
> index fc457f4..acd6930 100644
> --- a/include/linux/virtio_config.h
> +++ b/include/linux/virtio_config.h
> @@ -98,6 +98,9 @@
>   *	vdev: the virtio_device
>   *      This returns a pointer to the bus name a la pci_name from which
>   *      the caller can then copy.
> + * @get_vq_irq: get the irq numer of the specific virt queue.
> + *      vdev: the virtio_device
> + *      vq: the virtqueue

What if the vq does not have an IRQ? E.g. control vqs don't.
What if the IRQ is shared between VQs? Between devices?
The need to cleanup affinity on destroy is also nasty.
How about we expose a set_affinity API instead?
Then:
	- non PCI can ignore for now
	- with a per vq vector we can force it
	- with a shared MSI we make it an OR over all affinities
	- with a level interrupt we can ignore it
	- on cleanup we can do it in core


>   */
>  typedef void vq_callback_t(struct virtqueue *);
>  struct virtio_config_ops {
> @@ -116,6 +119,7 @@ struct virtio_config_ops {
>  	u32 (*get_features)(struct virtio_device *vdev);
>  	void (*finalize_features)(struct virtio_device *vdev);
>  	const char *(*bus_name)(struct virtio_device *vdev);
> +	int (*get_vq_irq)(struct virtio_device *vdev, struct virtqueue *vq);
>  };
>  
>  /* If driver didn't advertise the feature, it will never appear. */
> -- 
> 1.7.1

^ permalink raw reply

* RE: [PATCH 4/4] net/mlx4_en: Use atomic counter to decide when queue is full
From: Eric Dumazet @ 2012-06-25 10:12 UTC (permalink / raw)
  To: David Laight; +Cc: David Miller, yevgenyp, netdev
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B6F61@saturn3.aculab.com>

On Mon, 2012-06-25 at 10:00 +0100, David Laight wrote:
> > > The Transmit and transmit completion flows execute from different
> contexts,
> > > which are not synchronized. Hence naive reading the of consumer
> index might
> > > give wrong value by the time it is being used, That could lead to a
> state of transmit timeout.
> > > Fix that by using atomic variable to maintain that index.
> > > 
> > > Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
> > 
> > I'm not convinced.  There is only one place that actually changes
> > the counter.
> > 
> > So it seems more like you have a missing memory barrier somewhere.
> 
> Or just keep the two ring indexes - instead of keeping the
> number of 'active' entries as well.
> Then you don't have a variable which the tx setup and
> tx completion routines both update.

This is what was implied by David.

Using a producer/consumer index and appropriate memory barriers.

start_xmit() and tx completion can be truly lockless and atomicless in
their fast path.

There are many drivers doing that correctly.

tg3 driver is a good example.

^ permalink raw reply

* Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net
From: Michael S. Tsirkin @ 2012-06-25 10:07 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, kvm, netdev, mashirle, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
In-Reply-To: <20120625090829.7263.65026.stgit@amd-6168-8-1.englab.nay.redhat.com>

On Mon, Jun 25, 2012 at 05:16:48PM +0800, Jason Wang wrote:
> Hello All:
> 
> This series is an update version of multiqueue virtio-net driver based on
> Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the
> packets reception and transmission. Please review and comments.
> 
> Test Environment:
> - Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
> - Two directed connected 82599
> 
> Test Summary:
> 
> - Highlights: huge improvements on TCP_RR test
> - Lowlights: regression on small packet transmission, higher cpu utilization
>              than single queue, need further optimization

Didn't review yet, reacting this this paragraph:

To avoid regressions, it seems reasonable to make
the device use a single queue by default for now.
Add a way to switch multiqueue on/off using ethtool.

This way guest admin can tune the device for the
workload manually until we manage to imlement some
self-tuning heuristics.

-- 
MST

^ permalink raw reply

* [net-next RFC V4 PATCH 4/4] virtio_net: multiqueue support
From: Jason Wang @ 2012-06-25  9:41 UTC (permalink / raw)
  To: mst, akong, habanero, tahm, jwhan, mashirle, krkumar2, edumazet,
	davem, rusty, linux-kernel, virtualization, netdev
  Cc: qemu-devel, kvm
In-Reply-To: <20120625090829.7263.65026.stgit@amd-6168-8-1.englab.nay.redhat.com>

This addes multiqueue support to virtio_net driver. This feature is negotiated
through VIRTIO_NET_F_MULTIQUEUE.

The driver expects the number of rx/tx queue paris is equal to the number of
vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
optimization were introduced:

- Txq selection is based on the processor id in order to avoid contending a lock
 whose owner may exits to host.
- Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
 the queue pairs.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c   |  695 ++++++++++++++++++++++++++++++++------------
 include/linux/virtio_net.h |    2 +
 2 files changed, 506 insertions(+), 191 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f18149a..5c719de 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,7 @@
 #include <linux/scatterlist.h>
 #include <linux/if_vlan.h>
 #include <linux/slab.h>
+#include <linux/interrupt.h>
 
 static int napi_weight = 128;
 module_param(napi_weight, int, 0444);
@@ -51,43 +52,73 @@ struct virtnet_stats {
 	u64 rx_packets;
 };
 
-struct virtnet_info {
-	struct virtio_device *vdev;
-	struct virtqueue *rvq, *svq, *cvq;
-	struct net_device *dev;
+/* Internal representation of a send virtqueue */
+struct send_queue {
+	/* Virtqueue associated with this send _queue */
+	struct virtqueue *vq;
+
+	/* TX: fragments + linear part + virtio header */
+	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+
+	cpumask_var_t affinity_mask;
+};
+
+/* Internal representation of a receive virtqueue */
+struct receive_queue {
+	/* Virtqueue associated with this receive_queue */
+	struct virtqueue *vq;
+
+	/* Back pointer to the virtnet_info */
+	struct virtnet_info *vi;
+
 	struct napi_struct napi;
-	unsigned int status;
 
 	/* Number of input buffers, and max we've ever had. */
 	unsigned int num, max;
 
+	/* Work struct for refilling if we run low on memory. */
+	struct delayed_work refill;
+
+	/* Chain pages by the private ptr. */
+	struct page *pages;
+
+	/* RX: fragments + linear part + virtio header */
+	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+
+	cpumask_var_t affinity_mask;
+};
+
+struct virtnet_info {
+	int num_queue_pairs;		/* # of RX/TX vq pairs */
+
+	struct send_queue **sq;
+	struct receive_queue **rq;
+	struct virtqueue *cvq;
+
+	struct virtio_device *vdev;
+	struct net_device *dev;
+	unsigned int status;
+
 	/* I like... big packets and I cannot lie! */
 	bool big_packets;
 
 	/* Host will merge rx buffers for big packets (shake it! shake it!) */
 	bool mergeable_rx_bufs;
 
+	/* Has control virtqueue */
+	bool has_cvq;
+
 	/* enable config space updates */
 	bool config_enable;
 
 	/* Active statistics */
 	struct virtnet_stats __percpu *stats;
 
-	/* Work struct for refilling if we run low on memory. */
-	struct delayed_work refill;
-
 	/* Work struct for config space updates */
 	struct work_struct config_work;
 
 	/* Lock for config space updates */
 	struct mutex config_lock;
-
-	/* Chain pages by the private ptr. */
-	struct page *pages;
-
-	/* fragments + linear part + virtio header */
-	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
-	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 };
 
 struct skb_vnet_hdr {
@@ -108,6 +139,22 @@ struct padded_vnet_hdr {
 	char padding[6];
 };
 
+static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
+{
+	int ret = virtqueue_get_queue_index(vq);
+
+	/* skip ctrl vq */
+	if (vi->has_cvq)
+		return (ret - 1) / 2;
+	else
+		return ret / 2;
+}
+
+static inline int rxq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
+{
+	return virtqueue_get_queue_index(vq) / 2;
+}
+
 static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
 {
 	return (struct skb_vnet_hdr *)skb->cb;
@@ -117,22 +164,22 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
  * private is used to chain pages for big packets, put the whole
  * most recent used list in the beginning for reuse
  */
-static void give_pages(struct virtnet_info *vi, struct page *page)
+static void give_pages(struct receive_queue *rq, struct page *page)
 {
 	struct page *end;
 
 	/* Find end of list, sew whole thing into vi->pages. */
 	for (end = page; end->private; end = (struct page *)end->private);
-	end->private = (unsigned long)vi->pages;
-	vi->pages = page;
+	end->private = (unsigned long)rq->pages;
+	rq->pages = page;
 }
 
-static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
+static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 {
-	struct page *p = vi->pages;
+	struct page *p = rq->pages;
 
 	if (p) {
-		vi->pages = (struct page *)p->private;
+		rq->pages = (struct page *)p->private;
 		/* clear private here, it is used to chain pages */
 		p->private = 0;
 	} else
@@ -140,15 +187,15 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
 	return p;
 }
 
-static void skb_xmit_done(struct virtqueue *svq)
+static void skb_xmit_done(struct virtqueue *vq)
 {
-	struct virtnet_info *vi = svq->vdev->priv;
+	struct virtnet_info *vi = vq->vdev->priv;
 
 	/* Suppress further interrupts. */
-	virtqueue_disable_cb(svq);
+	virtqueue_disable_cb(vq);
 
 	/* We were probably waiting for more output buffers. */
-	netif_wake_queue(vi->dev);
+	netif_wake_subqueue(vi->dev, txq_get_qnum(vi, vq));
 }
 
 static void set_skb_frag(struct sk_buff *skb, struct page *page,
@@ -167,9 +214,10 @@ static void set_skb_frag(struct sk_buff *skb, struct page *page,
 }
 
 /* Called from bottom half context */
-static struct sk_buff *page_to_skb(struct virtnet_info *vi,
+static struct sk_buff *page_to_skb(struct receive_queue *rq,
 				   struct page *page, unsigned int len)
 {
+	struct virtnet_info *vi = rq->vi;
 	struct sk_buff *skb;
 	struct skb_vnet_hdr *hdr;
 	unsigned int copy, hdr_len, offset;
@@ -225,12 +273,12 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 	}
 
 	if (page)
-		give_pages(vi, page);
+		give_pages(rq, page);
 
 	return skb;
 }
 
-static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
+static int receive_mergeable(struct receive_queue *rq, struct sk_buff *skb)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	struct page *page;
@@ -244,7 +292,7 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
 			skb->dev->stats.rx_length_errors++;
 			return -EINVAL;
 		}
-		page = virtqueue_get_buf(vi->rvq, &len);
+		page = virtqueue_get_buf(rq->vq, &len);
 		if (!page) {
 			pr_debug("%s: rx error: %d buffers missing\n",
 				 skb->dev->name, hdr->mhdr.num_buffers);
@@ -257,13 +305,14 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
 
 		set_skb_frag(skb, page, 0, &len);
 
-		--vi->num;
+		--rq->num;
 	}
 	return 0;
 }
 
-static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
+static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 {
+	struct net_device *dev = rq->vi->dev;
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
 	struct sk_buff *skb;
@@ -274,7 +323,7 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
 		pr_debug("%s: short packet %i\n", dev->name, len);
 		dev->stats.rx_length_errors++;
 		if (vi->mergeable_rx_bufs || vi->big_packets)
-			give_pages(vi, buf);
+			give_pages(rq, buf);
 		else
 			dev_kfree_skb(buf);
 		return;
@@ -286,14 +335,14 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
 		skb_trim(skb, len);
 	} else {
 		page = buf;
-		skb = page_to_skb(vi, page, len);
+		skb = page_to_skb(rq, page, len);
 		if (unlikely(!skb)) {
 			dev->stats.rx_dropped++;
-			give_pages(vi, page);
+			give_pages(rq, page);
 			return;
 		}
 		if (vi->mergeable_rx_bufs)
-			if (receive_mergeable(vi, skb)) {
+			if (receive_mergeable(rq, skb)) {
 				dev_kfree_skb(skb);
 				return;
 			}
@@ -363,90 +412,91 @@ frame_err:
 	dev_kfree_skb(skb);
 }
 
-static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
 {
 	struct sk_buff *skb;
 	struct skb_vnet_hdr *hdr;
 	int err;
 
-	skb = __netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN, gfp);
+	skb = __netdev_alloc_skb_ip_align(rq->vi->dev, MAX_PACKET_LEN, gfp);
 	if (unlikely(!skb))
 		return -ENOMEM;
 
 	skb_put(skb, MAX_PACKET_LEN);
 
 	hdr = skb_vnet_hdr(skb);
-	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
+	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);
 
-	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
+	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
+
+	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 2, skb, gfp);
 
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
 	if (err < 0)
 		dev_kfree_skb(skb);
 
 	return err;
 }
 
-static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 {
 	struct page *first, *list = NULL;
 	char *p;
 	int i, err, offset;
 
-	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
+	/* page in rq->sg[MAX_SKB_FRAGS + 1] is list tail */
 	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
-		first = get_a_page(vi, gfp);
+		first = get_a_page(rq, gfp);
 		if (!first) {
 			if (list)
-				give_pages(vi, list);
+				give_pages(rq, list);
 			return -ENOMEM;
 		}
-		sg_set_buf(&vi->rx_sg[i], page_address(first), PAGE_SIZE);
+		sg_set_buf(&rq->sg[i], page_address(first), PAGE_SIZE);
 
 		/* chain new page in list head to match sg */
 		first->private = (unsigned long)list;
 		list = first;
 	}
 
-	first = get_a_page(vi, gfp);
+	first = get_a_page(rq, gfp);
 	if (!first) {
-		give_pages(vi, list);
+		give_pages(rq, list);
 		return -ENOMEM;
 	}
 	p = page_address(first);
 
-	/* vi->rx_sg[0], vi->rx_sg[1] share the same page */
-	/* a separated vi->rx_sg[0] for virtio_net_hdr only due to QEMU bug */
-	sg_set_buf(&vi->rx_sg[0], p, sizeof(struct virtio_net_hdr));
+	/* rq->sg[0], rq->sg[1] share the same page */
+	/* a separated rq->sg[0] for virtio_net_hdr only due to QEMU bug */
+	sg_set_buf(&rq->sg[0], p, sizeof(struct virtio_net_hdr));
 
-	/* vi->rx_sg[1] for data packet, from offset */
+	/* rq->sg[1] for data packet, from offset */
 	offset = sizeof(struct padded_vnet_hdr);
-	sg_set_buf(&vi->rx_sg[1], p + offset, PAGE_SIZE - offset);
+	sg_set_buf(&rq->sg[1], p + offset, PAGE_SIZE - offset);
 
 	/* chain first in list head */
 	first->private = (unsigned long)list;
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
+	err = virtqueue_add_buf(rq->vq, rq->sg, 0, MAX_SKB_FRAGS + 2,
 				first, gfp);
 	if (err < 0)
-		give_pages(vi, first);
+		give_pages(rq, first);
 
 	return err;
 }
 
-static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
 	struct page *page;
 	int err;
 
-	page = get_a_page(vi, gfp);
+	page = get_a_page(rq, gfp);
 	if (!page)
 		return -ENOMEM;
 
-	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
+	sg_init_one(rq->sg, page_address(page), PAGE_SIZE);
 
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
+	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 1, page, gfp);
 	if (err < 0)
-		give_pages(vi, page);
+		give_pages(rq, page);
 
 	return err;
 }
@@ -458,97 +508,104 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
  * before we're receiving packets, or from refill_work which is
  * careful to disable receiving (using napi_disable).
  */
-static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
+static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
 {
+	struct virtnet_info *vi = rq->vi;
 	int err;
 	bool oom;
 
 	do {
 		if (vi->mergeable_rx_bufs)
-			err = add_recvbuf_mergeable(vi, gfp);
+			err = add_recvbuf_mergeable(rq, gfp);
 		else if (vi->big_packets)
-			err = add_recvbuf_big(vi, gfp);
+			err = add_recvbuf_big(rq, gfp);
 		else
-			err = add_recvbuf_small(vi, gfp);
+			err = add_recvbuf_small(rq, gfp);
 
 		oom = err == -ENOMEM;
 		if (err < 0)
 			break;
-		++vi->num;
+		++rq->num;
 	} while (err > 0);
-	if (unlikely(vi->num > vi->max))
-		vi->max = vi->num;
-	virtqueue_kick(vi->rvq);
+	if (unlikely(rq->num > rq->max))
+		rq->max = rq->num;
+	virtqueue_kick(rq->vq);
 	return !oom;
 }
 
-static void skb_recv_done(struct virtqueue *rvq)
+static void skb_recv_done(struct virtqueue *vq)
 {
-	struct virtnet_info *vi = rvq->vdev->priv;
+	struct virtnet_info *vi = vq->vdev->priv;
+	struct napi_struct *napi = &vi->rq[rxq_get_qnum(vi, vq)]->napi;
+
 	/* Schedule NAPI, Suppress further interrupts if successful. */
-	if (napi_schedule_prep(&vi->napi)) {
-		virtqueue_disable_cb(rvq);
-		__napi_schedule(&vi->napi);
+	if (napi_schedule_prep(napi)) {
+		virtqueue_disable_cb(vq);
+		__napi_schedule(napi);
 	}
 }
 
-static void virtnet_napi_enable(struct virtnet_info *vi)
+static void virtnet_napi_enable(struct receive_queue *rq)
 {
-	napi_enable(&vi->napi);
+	napi_enable(&rq->napi);
 
 	/* If all buffers were filled by other side before we napi_enabled, we
 	 * won't get another interrupt, so process any outstanding packets
 	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
 	 * We synchronize against interrupts via NAPI_STATE_SCHED */
-	if (napi_schedule_prep(&vi->napi)) {
-		virtqueue_disable_cb(vi->rvq);
+	if (napi_schedule_prep(&rq->napi)) {
+		virtqueue_disable_cb(rq->vq);
 		local_bh_disable();
-		__napi_schedule(&vi->napi);
+		__napi_schedule(&rq->napi);
 		local_bh_enable();
 	}
 }
 
 static void refill_work(struct work_struct *work)
 {
-	struct virtnet_info *vi;
+	struct napi_struct *napi;
+	struct receive_queue *rq;
 	bool still_empty;
 
-	vi = container_of(work, struct virtnet_info, refill.work);
-	napi_disable(&vi->napi);
-	still_empty = !try_fill_recv(vi, GFP_KERNEL);
-	virtnet_napi_enable(vi);
+	rq = container_of(work, struct receive_queue, refill.work);
+	napi = &rq->napi;
+
+	napi_disable(napi);
+	still_empty = !try_fill_recv(rq, GFP_KERNEL);
+	virtnet_napi_enable(rq);
 
 	/* In theory, this can happen: if we don't get any buffers in
 	 * we will *never* try to fill again. */
 	if (still_empty)
-		queue_delayed_work(system_nrt_wq, &vi->refill, HZ/2);
+		queue_delayed_work(system_nrt_wq, &rq->refill, HZ/2);
 }
 
 static int virtnet_poll(struct napi_struct *napi, int budget)
 {
-	struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
+	struct receive_queue *rq = container_of(napi, struct receive_queue,
+						napi);
 	void *buf;
 	unsigned int len, received = 0;
 
 again:
 	while (received < budget &&
-	       (buf = virtqueue_get_buf(vi->rvq, &len)) != NULL) {
-		receive_buf(vi->dev, buf, len);
-		--vi->num;
+	       (buf = virtqueue_get_buf(rq->vq, &len)) != NULL) {
+		receive_buf(rq, buf, len);
+		--rq->num;
 		received++;
 	}
 
-	if (vi->num < vi->max / 2) {
-		if (!try_fill_recv(vi, GFP_ATOMIC))
-			queue_delayed_work(system_nrt_wq, &vi->refill, 0);
+	if (rq->num < rq->max / 2) {
+		if (!try_fill_recv(rq, GFP_ATOMIC))
+			queue_delayed_work(system_nrt_wq, &rq->refill, 0);
 	}
 
 	/* Out of packets? */
 	if (received < budget) {
 		napi_complete(napi);
-		if (unlikely(!virtqueue_enable_cb(vi->rvq)) &&
+		if (unlikely(!virtqueue_enable_cb(rq->vq)) &&
 		    napi_schedule_prep(napi)) {
-			virtqueue_disable_cb(vi->rvq);
+			virtqueue_disable_cb(rq->vq);
 			__napi_schedule(napi);
 			goto again;
 		}
@@ -557,13 +614,14 @@ again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
+				       struct virtqueue *vq)
 {
 	struct sk_buff *skb;
 	unsigned int len, tot_sgs = 0;
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
 
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	while ((skb = virtqueue_get_buf(vq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 
 		u64_stats_update_begin(&stats->tx_syncp);
@@ -577,7 +635,8 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
 	return tot_sgs;
 }
 
-static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
+static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
+		    struct virtqueue *vq, struct scatterlist *sg)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
@@ -615,44 +674,47 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 
 	/* Encode metadata header at front. */
 	if (vi->mergeable_rx_bufs)
-		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
+		sg_set_buf(sg, &hdr->mhdr, sizeof hdr->mhdr);
 	else
-		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
+		sg_set_buf(sg, &hdr->hdr, sizeof hdr->hdr);
 
-	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
+	hdr->num_sg = skb_to_sgvec(skb, sg + 1, 0, skb->len) + 1;
+	return virtqueue_add_buf(vq, sg, hdr->num_sg,
 				 0, skb, GFP_ATOMIC);
 }
 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int qnum = skb_get_queue_mapping(skb);
+	struct virtqueue *vq = vi->sq[qnum]->vq;
 	int capacity;
 
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
+	free_old_xmit_skbs(vi, vq);
 
 	/* Try to transmit */
-	capacity = xmit_skb(vi, skb);
+	capacity = xmit_skb(vi, skb, vq, vi->sq[qnum]->sg);
 
 	/* This can happen with OOM and indirect buffers. */
 	if (unlikely(capacity < 0)) {
 		if (likely(capacity == -ENOMEM)) {
 			if (net_ratelimit())
 				dev_warn(&dev->dev,
-					 "TX queue failure: out of memory\n");
+					"TXQ (%d) failure: out of memory\n",
+					qnum);
 		} else {
 			dev->stats.tx_fifo_errors++;
 			if (net_ratelimit())
 				dev_warn(&dev->dev,
-					 "Unexpected TX queue failure: %d\n",
-					 capacity);
+					"Unexpected TXQ (%d) failure: %d\n",
+					qnum, capacity);
 		}
 		dev->stats.tx_dropped++;
 		kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
-	virtqueue_kick(vi->svq);
+	virtqueue_kick(vq);
 
 	/* Don't wait up for transmitted skbs to be freed. */
 	skb_orphan(skb);
@@ -661,13 +723,13 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
-		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
+		netif_stop_subqueue(dev, qnum);
+		if (unlikely(!virtqueue_enable_cb_delayed(vq))) {
 			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
+			capacity += free_old_xmit_skbs(vi, vq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
-				netif_start_queue(dev);
-				virtqueue_disable_cb(vi->svq);
+				netif_start_subqueue(dev, qnum);
+				virtqueue_disable_cb(vq);
 			}
 		}
 	}
@@ -700,7 +762,8 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
 	unsigned int start;
 
 	for_each_possible_cpu(cpu) {
-		struct virtnet_stats *stats = per_cpu_ptr(vi->stats, cpu);
+		struct virtnet_stats __percpu *stats
+			= per_cpu_ptr(vi->stats, cpu);
 		u64 tpackets, tbytes, rpackets, rbytes;
 
 		do {
@@ -734,20 +797,31 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
 static void virtnet_netpoll(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
 
-	napi_schedule(&vi->napi);
+	for (i = 0; i < vi->num_queue_pairs; i++)
+		napi_schedule(&vi->rq[i]->napi);
 }
 #endif
 
+static void free_stats(struct virtnet_info *vi)
+{
+	free_percpu(vi->stats);
+}
+
 static int virtnet_open(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
 
-	/* Make sure we have some buffers: if oom use wq. */
-	if (!try_fill_recv(vi, GFP_KERNEL))
-		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		/* Make sure we have some buffers: if oom use wq. */
+		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
+			queue_delayed_work(system_nrt_wq,
+					   &vi->rq[i]->refill, 0);
+		virtnet_napi_enable(vi->rq[i]);
+	}
 
-	virtnet_napi_enable(vi);
 	return 0;
 }
 
@@ -809,10 +883,13 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
 static int virtnet_close(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
 
 	/* Make sure refill_work doesn't re-enable napi! */
-	cancel_delayed_work_sync(&vi->refill);
-	napi_disable(&vi->napi);
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		cancel_delayed_work_sync(&vi->rq[i]->refill);
+		napi_disable(&vi->rq[i]->napi);
+	}
 
 	return 0;
 }
@@ -924,11 +1001,10 @@ static void virtnet_get_ringparam(struct net_device *dev,
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 
-	ring->rx_max_pending = virtqueue_get_vring_size(vi->rvq);
-	ring->tx_max_pending = virtqueue_get_vring_size(vi->svq);
+	ring->rx_max_pending = virtqueue_get_vring_size(vi->rq[0]->vq);
+	ring->tx_max_pending = virtqueue_get_vring_size(vi->sq[0]->vq);
 	ring->rx_pending = ring->rx_max_pending;
 	ring->tx_pending = ring->tx_max_pending;
-
 }
 
 
@@ -961,6 +1037,19 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
 	return 0;
 }
 
+/* To avoid contending a lock hold by a vcpu who would exit to host, select the
+ * txq based on the processor id.
+ */
+static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
+{
+	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
+		  smp_processor_id();
+
+	while (unlikely(txq >= dev->real_num_tx_queues))
+		txq -= dev->real_num_tx_queues;
+	return txq;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -972,6 +1061,7 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_get_stats64     = virtnet_stats,
 	.ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
 	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
+	.ndo_select_queue     = virtnet_select_queue,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller = virtnet_netpoll,
 #endif
@@ -1007,10 +1097,10 @@ static void virtnet_config_changed_work(struct work_struct *work)
 
 	if (vi->status & VIRTIO_NET_S_LINK_UP) {
 		netif_carrier_on(vi->dev);
-		netif_wake_queue(vi->dev);
+		netif_tx_wake_all_queues(vi->dev);
 	} else {
 		netif_carrier_off(vi->dev);
-		netif_stop_queue(vi->dev);
+		netif_tx_stop_all_queues(vi->dev);
 	}
 done:
 	mutex_unlock(&vi->config_lock);
@@ -1023,41 +1113,252 @@ static void virtnet_config_changed(struct virtio_device *vdev)
 	queue_work(system_nrt_wq, &vi->config_work);
 }
 
-static int init_vqs(struct virtnet_info *vi)
+static void free_receive_bufs(struct virtnet_info *vi)
+{
+	int i;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		while (vi->rq[i]->pages)
+			__free_pages(get_a_page(vi->rq[i], GFP_KERNEL), 0);
+	}
+}
+
+/* Free memory allocated for send and receive queues */
+static void free_rq_sq(struct virtnet_info *vi)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
-	const char *names[] = { "input", "output", "control" };
-	int nvqs, err;
+	int i;
 
-	/* We expect two virtqueues, receive then send,
-	 * and optionally control. */
-	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
+	free_stats(vi);
 
-	err = vi->vdev->config->find_vqs(vi->vdev, nvqs, vqs, callbacks, names);
-	if (err)
-		return err;
+	if (vi->rq) {
+		for (i = 0; i < vi->num_queue_pairs; i++)
+			kfree(vi->rq[i]);
+		kfree(vi->rq);
+	}
+
+	if (vi->sq) {
+		for (i = 0; i < vi->num_queue_pairs; i++)
+			kfree(vi->sq[i]);
+		kfree(vi->sq);
+	}
+}
+
+static void free_unused_bufs(struct virtnet_info *vi)
+{
+	void *buf;
+	int i;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		struct virtqueue *vq = vi->sq[i]->vq;
+
+		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL)
+			dev_kfree_skb(buf);
+	}
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		struct virtqueue *vq = vi->rq[i]->vq;
 
-	vi->rvq = vqs[0];
-	vi->svq = vqs[1];
+		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
+			if (vi->mergeable_rx_bufs || vi->big_packets)
+				give_pages(vi->rq[i], buf);
+			else
+				dev_kfree_skb(buf);
+			--vi->rq[i]->num;
+		}
+		BUG_ON(vi->rq[i]->num != 0);
+	}
+}
+
+static int virtnet_find_vqs(struct virtnet_info *vi)
+{
+	vq_callback_t **callbacks;
+	struct virtqueue **vqs;
+	int ret = -ENOMEM;
+	int i, total_vqs;
+	char **names;
+
+	/*
+	 * We expect 1 RX virtqueue followed by 1 TX virtqueue, followd by
+	 * possible control virtqueue and followed by the same
+	 * 'vi->num_queue_pairs-1' more times
+	 */
+	total_vqs = vi->num_queue_pairs * 2 +
+		    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kmalloc(total_vqs * sizeof(*vqs), GFP_KERNEL);
+	callbacks = kmalloc(total_vqs * sizeof(*callbacks), GFP_KERNEL);
+	names = kmalloc(total_vqs * sizeof(*names), GFP_KERNEL);
+	if (!vqs || !callbacks || !names)
+		goto err;
+
+	/* Parameters for control virtqueue, if any */
+	if (vi->has_cvq) {
+		callbacks[2] = NULL;
+		names[2] = "control";
+	}
+
+	/* Allocate/initialize parameters for send/receive virtqueues */
+	for (i = 0; i < vi->num_queue_pairs * 2; i += 2) {
+		int j = (i == 0 ? i : i + vi->has_cvq);
+		callbacks[j] = skb_recv_done;
+		callbacks[j + 1] = skb_xmit_done;
+		names[j] = kasprintf(GFP_KERNEL, "input.%d", i / 2);
+		names[j + 1] = kasprintf(GFP_KERNEL, "output.%d", i / 2);
+	}
 
-	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
+	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
+					 (const char **)names);
+	if (ret)
+		goto err;
+
+	if (vi->has_cvq)
 		vi->cvq = vqs[2];
 
-		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
-			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
+	for (i = 0; i < vi->num_queue_pairs * 2; i += 2) {
+		int j = i == 0 ? i : i + vi->has_cvq;
+		vi->rq[i / 2]->vq = vqs[j];
+		vi->sq[i / 2]->vq = vqs[j + 1];
 	}
+
+err:
+	if (ret && names)
+		for (i = 0; i < vi->num_queue_pairs * 2; i++)
+			kfree(names[i]);
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
+
+	return ret;
+}
+
+static int allocate_queues(struct virtnet_info *vi)
+{
+	int ret = -ENOMEM;
+	int i;
+
+	vi->rq = kcalloc(vi->num_queue_pairs, sizeof(*vi->rq), GFP_KERNEL);
+	vi->sq = kcalloc(vi->num_queue_pairs, sizeof(*vi->sq), GFP_KERNEL);
+	if (!vi->sq || !vi->rq)
+		goto err;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		vi->rq[i] = kzalloc(sizeof(*vi->rq[i]), GFP_KERNEL);
+		vi->sq[i] = kzalloc(sizeof(*vi->sq[i]), GFP_KERNEL);
+		if (!vi->rq[i] || !vi->sq[i])
+			goto err;
+	}
+
+	ret = 0;
+
+	/* setup initial receive and send queue parameters */
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		vi->rq[i]->vi = vi;
+		vi->rq[i]->pages = NULL;
+		INIT_DELAYED_WORK(&vi->rq[i]->refill, refill_work);
+		netif_napi_add(vi->dev, &vi->rq[i]->napi, virtnet_poll,
+			       napi_weight);
+
+		sg_init_table(vi->rq[i]->sg, ARRAY_SIZE(vi->rq[i]->sg));
+		sg_init_table(vi->sq[i]->sg, ARRAY_SIZE(vi->sq[i]->sg));
+	}
+
+err:
+	if (ret)
+		free_rq_sq(vi);
+
+	return ret;
+}
+
+static int virtnet_setup_vqs(struct virtnet_info *vi)
+{
+	int ret;
+
+	/* Allocate send & receive queues */
+	ret = allocate_queues(vi);
+	if (!ret) {
+		ret = virtnet_find_vqs(vi);
+		if (ret)
+			free_rq_sq(vi);
+	}
+
+	return ret;
+}
+
+static void _virtnet_free_affinity(struct virtnet_info *vi, int i)
+{
+	struct virtio_device *vdev = vi->vdev;
+	int irq;
+
+	while (i) {
+		i--;
+		irq = vdev->config->get_vq_irq(vdev, vi->rq[i]->vq);
+		irq_set_affinity_hint(irq, NULL);
+		free_cpumask_var(vi->rq[i]->affinity_mask);
+		irq = vdev->config->get_vq_irq(vdev, vi->sq[i]->vq);
+		irq_set_affinity_hint(irq, NULL);
+		free_cpumask_var(vi->sq[i]->affinity_mask);
+	}
+}
+
+static int virtnet_init_affinity(struct virtnet_info *vi)
+{
+	struct virtio_device *vdev = vi->vdev;
+	int i, irq;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		if (!alloc_cpumask_var(&vi->rq[i]->affinity_mask, GFP_KERNEL))
+			goto err_out;
+		if (!alloc_cpumask_var(&vi->sq[i]->affinity_mask, GFP_KERNEL)) {
+			free_cpumask_var(vi->rq[i]->affinity_mask);
+			goto err_out;
+		}
+		cpumask_set_cpu(i, vi->rq[i]->affinity_mask);
+		cpumask_set_cpu(i, vi->sq[i]->affinity_mask);
+
+		irq = vdev->config->get_vq_irq(vdev, vi->rq[i]->vq);
+		if (irq_set_affinity_hint(irq, vi->rq[i]->affinity_mask))
+			goto err_out;
+		irq = vdev->config->get_vq_irq(vdev, vi->sq[i]->vq);
+		if (irq_set_affinity_hint(irq, vi->sq[i]->affinity_mask)) {
+			irq = vdev->config->get_vq_irq(vdev, vi->rq[i]->vq);
+			irq_set_affinity_hint(irq, NULL);
+			goto err_out;
+		}
+	}
+
 	return 0;
+err_out:
+	_virtnet_free_affinity(vi, i);
+	return -ENOMEM;
+}
+
+static void virtnet_free_affinity(struct virtnet_info *vi)
+{
+	_virtnet_free_affinity(vi, vi->num_queue_pairs);
 }
 
 static int virtnet_probe(struct virtio_device *vdev)
 {
-	int err;
+	int i, err;
 	struct net_device *dev;
 	struct virtnet_info *vi;
+	u16 num_queues, num_queue_pairs;
+
+	/* Find if host supports multiqueue virtio_net device */
+	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
+				offsetof(struct virtio_net_config,
+				num_queues), &num_queues);
+
+	/* We need atleast 2 queue's */
+	if (err || num_queues < 2)
+		num_queues = 2;
+
+	num_queue_pairs = num_queues / 2;
 
 	/* Allocate ourselves a network device with room for our info */
-	dev = alloc_etherdev(sizeof(struct virtnet_info));
+	dev = alloc_etherdev_mq(sizeof(struct virtnet_info), num_queue_pairs);
 	if (!dev)
 		return -ENOMEM;
 
@@ -1103,22 +1404,18 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	/* Set up our device-specific information */
 	vi = netdev_priv(dev);
-	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
 	vi->dev = dev;
 	vi->vdev = vdev;
 	vdev->priv = vi;
-	vi->pages = NULL;
 	vi->stats = alloc_percpu(struct virtnet_stats);
 	err = -ENOMEM;
 	if (vi->stats == NULL)
-		goto free;
+		goto free_netdev;
 
-	INIT_DELAYED_WORK(&vi->refill, refill_work);
 	mutex_init(&vi->config_lock);
 	vi->config_enable = true;
 	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
-	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
-	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
+	vi->num_queue_pairs = num_queue_pairs;
 
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
@@ -1129,9 +1426,17 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
 		vi->mergeable_rx_bufs = true;
 
-	err = init_vqs(vi);
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
+		vi->has_cvq = true;
+
+	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
+	err = virtnet_setup_vqs(vi);
 	if (err)
-		goto free_stats;
+		goto free_netdev;
+
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) &&
+	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
+		dev->features |= NETIF_F_HW_VLAN_FILTER;
 
 	err = register_netdev(dev);
 	if (err) {
@@ -1140,12 +1445,21 @@ static int virtnet_probe(struct virtio_device *vdev)
 	}
 
 	/* Last of all, set up some receive buffers. */
-	try_fill_recv(vi, GFP_KERNEL);
+	for (i = 0; i < num_queue_pairs; i++) {
+		try_fill_recv(vi->rq[i], GFP_KERNEL);
+
+		/* If we didn't even get one input buffer, we're useless. */
+		if (vi->rq[i]->num == 0) {
+			free_unused_bufs(vi);
+			err = -ENOMEM;
+			goto free_recv_bufs;
+		}
+	}
 
-	/* If we didn't even get one input buffer, we're useless. */
-	if (vi->num == 0) {
-		err = -ENOMEM;
-		goto unregister;
+	err = virtnet_init_affinity(vi);
+	if (err) {
+		pr_debug("virtio_net: fail to initialize irq affinity\n");
+		goto free_recv_bufs;
 	}
 
 	/* Assume link up if device can't report link status,
@@ -1158,42 +1472,26 @@ static int virtnet_probe(struct virtio_device *vdev)
 		netif_carrier_on(dev);
 	}
 
-	pr_debug("virtnet: registered device %s\n", dev->name);
+	pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
+		 dev->name, num_queue_pairs);
+
 	return 0;
 
-unregister:
+free_recv_bufs:
+	free_receive_bufs(vi);
 	unregister_netdev(dev);
 free_vqs:
+	for (i = 0; i < num_queue_pairs; i++)
+		cancel_delayed_work_sync(&vi->rq[i]->refill);
 	vdev->config->del_vqs(vdev);
-free_stats:
-	free_percpu(vi->stats);
-free:
+
+free_netdev:
+	free_rq_sq(vi);
+
 	free_netdev(dev);
 	return err;
 }
 
-static void free_unused_bufs(struct virtnet_info *vi)
-{
-	void *buf;
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->svq);
-		if (!buf)
-			break;
-		dev_kfree_skb(buf);
-	}
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->rvq);
-		if (!buf)
-			break;
-		if (vi->mergeable_rx_bufs || vi->big_packets)
-			give_pages(vi, buf);
-		else
-			dev_kfree_skb(buf);
-		--vi->num;
-	}
-	BUG_ON(vi->num != 0);
-}
-
 static void remove_vq_common(struct virtnet_info *vi)
 {
 	vi->vdev->config->reset(vi->vdev);
@@ -1203,14 +1501,17 @@ static void remove_vq_common(struct virtnet_info *vi)
 
 	vi->vdev->config->del_vqs(vi->vdev);
 
-	while (vi->pages)
-		__free_pages(get_a_page(vi, GFP_KERNEL), 0);
+	free_receive_bufs(vi);
 }
 
 static void __devexit virtnet_remove(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
 
+	/* Stop all the virtqueues. */
+	vdev->config->reset(vdev);
+
+
 	/* Prevent config work handler from accessing the device. */
 	mutex_lock(&vi->config_lock);
 	vi->config_enable = false;
@@ -1220,6 +1521,11 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
 
 	remove_vq_common(vi);
 
+	virtnet_free_affinity(vi);
+
+	/* Free memory for send and receive queues */
+	free_rq_sq(vi);
+
 	flush_work(&vi->config_work);
 
 	free_percpu(vi->stats);
@@ -1230,6 +1536,7 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
 static int virtnet_freeze(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
+	int i;
 
 	/* Prevent config work handler from accessing the device */
 	mutex_lock(&vi->config_lock);
@@ -1237,10 +1544,13 @@ static int virtnet_freeze(struct virtio_device *vdev)
 	mutex_unlock(&vi->config_lock);
 
 	netif_device_detach(vi->dev);
-	cancel_delayed_work_sync(&vi->refill);
+	for (i = 0; i < vi->num_queue_pairs; i++)
+		cancel_delayed_work_sync(&vi->rq[i]->refill);
 
 	if (netif_running(vi->dev))
-		napi_disable(&vi->napi);
+		for (i = 0; i < vi->num_queue_pairs; i++)
+			napi_disable(&vi->rq[i]->napi);
+
 
 	remove_vq_common(vi);
 
@@ -1252,19 +1562,22 @@ static int virtnet_freeze(struct virtio_device *vdev)
 static int virtnet_restore(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
-	int err;
+	int err, i;
 
-	err = init_vqs(vi);
+	err = virtnet_setup_vqs(vi);
 	if (err)
 		return err;
 
 	if (netif_running(vi->dev))
-		virtnet_napi_enable(vi);
+		for (i = 0; i < vi->num_queue_pairs; i++)
+			virtnet_napi_enable(vi->rq[i]);
 
 	netif_device_attach(vi->dev);
 
-	if (!try_fill_recv(vi, GFP_KERNEL))
-		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
+	for (i = 0; i < vi->num_queue_pairs; i++)
+		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
+			queue_delayed_work(system_nrt_wq,
+					   &vi->rq[i]->refill, 0);
 
 	mutex_lock(&vi->config_lock);
 	vi->config_enable = true;
@@ -1287,7 +1600,7 @@ static unsigned int features[] = {
 	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
 	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
 	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
-	VIRTIO_NET_F_GUEST_ANNOUNCE,
+	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MULTIQUEUE,
 };
 
 static struct virtio_driver virtio_net_driver = {
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 1bc7e30..60f09ff 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -61,6 +61,8 @@ struct virtio_net_config {
 	__u8 mac[6];
 	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
 	__u16 status;
+	/* Total number of RX/TX queues */
+	__u16 num_queues;
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't
-- 
1.7.1

^ permalink raw reply related

* [net-next RFC V4 PATCH 3/4] virtio: introduce a method to get the irq of a specific virtqueue
From: Jason Wang @ 2012-06-25  9:41 UTC (permalink / raw)
  To: mst, akong, habanero, tahm, jwhan, mashirle, krkumar2, edumazet,
	davem, rusty, linux-kernel, virtualization, netdev
  Cc: qemu-devel, kvm
In-Reply-To: <20120625090829.7263.65026.stgit@amd-6168-8-1.englab.nay.redhat.com>

Device specific irq optimizations such as irq affinity may be used by virtio
drivers. So this patch introduce a new method to get the irq of a specific
virtqueue.

After this patch, virtio device drivers could query the irq and do device
specific optimizations. First user would be virtio-net.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/lguest/lguest_device.c |    8 ++++++++
 drivers/s390/kvm/kvm_virtio.c  |    6 ++++++
 drivers/virtio/virtio_mmio.c   |    8 ++++++++
 drivers/virtio/virtio_pci.c    |   12 ++++++++++++
 include/linux/virtio_config.h  |    4 ++++
 5 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
index 9e8388e..bcd080f 100644
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -392,6 +392,13 @@ static const char *lg_bus_name(struct virtio_device *vdev)
 	return "";
 }
 
+static int lg_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
+{
+	struct lguest_vq_info *lvq = vq->priv;
+
+	return lvq->config.irq;
+}
+
 /* The ops structure which hooks everything together. */
 static struct virtio_config_ops lguest_config_ops = {
 	.get_features = lg_get_features,
@@ -404,6 +411,7 @@ static struct virtio_config_ops lguest_config_ops = {
 	.find_vqs = lg_find_vqs,
 	.del_vqs = lg_del_vqs,
 	.bus_name = lg_bus_name,
+	.get_vq_irq = lg_get_vq_irq,
 };
 
 /*
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
index d74e9ae..a897de2 100644
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -268,6 +268,11 @@ static const char *kvm_bus_name(struct virtio_device *vdev)
 	return "";
 }
 
+static int kvm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
+{
+	return 0x2603;
+}
+
 /*
  * The config ops structure as defined by virtio config
  */
@@ -282,6 +287,7 @@ static struct virtio_config_ops kvm_vq_configspace_ops = {
 	.find_vqs = kvm_find_vqs,
 	.del_vqs = kvm_del_vqs,
 	.bus_name = kvm_bus_name,
+	.get_vq_irq = kvm_get_vq_irq,
 };
 
 /*
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index f5432b6..2ba37ed 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -411,6 +411,13 @@ static const char *vm_bus_name(struct virtio_device *vdev)
 	return vm_dev->pdev->name;
 }
 
+static int vm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
+{
+	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
+
+	return platform_get_irq(vm_dev->pdev, 0);
+}
+
 static struct virtio_config_ops virtio_mmio_config_ops = {
 	.get		= vm_get,
 	.set		= vm_set,
@@ -422,6 +429,7 @@ static struct virtio_config_ops virtio_mmio_config_ops = {
 	.get_features	= vm_get_features,
 	.finalize_features = vm_finalize_features,
 	.bus_name	= vm_bus_name,
+	.get_vq_irq     = vm_get_vq_irq,
 };
 
 
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index adb24f2..c062227 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -607,6 +607,17 @@ static const char *vp_bus_name(struct virtio_device *vdev)
 	return pci_name(vp_dev->pci_dev);
 }
 
+static int vp_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
+{
+	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	struct virtio_pci_vq_info *info = vq->priv;
+
+	if (vp_dev->intx_enabled)
+		return vp_dev->pci_dev->irq;
+	else
+		return vp_dev->msix_entries[info->msix_vector].vector;
+}
+
 static struct virtio_config_ops virtio_pci_config_ops = {
 	.get		= vp_get,
 	.set		= vp_set,
@@ -618,6 +629,7 @@ static struct virtio_config_ops virtio_pci_config_ops = {
 	.get_features	= vp_get_features,
 	.finalize_features = vp_finalize_features,
 	.bus_name	= vp_bus_name,
+	.get_vq_irq     = vp_get_vq_irq,
 };
 
 static void virtio_pci_release_dev(struct device *_d)
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index fc457f4..acd6930 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -98,6 +98,9 @@
  *	vdev: the virtio_device
  *      This returns a pointer to the bus name a la pci_name from which
  *      the caller can then copy.
+ * @get_vq_irq: get the irq numer of the specific virt queue.
+ *      vdev: the virtio_device
+ *      vq: the virtqueue
  */
 typedef void vq_callback_t(struct virtqueue *);
 struct virtio_config_ops {
@@ -116,6 +119,7 @@ struct virtio_config_ops {
 	u32 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);
 	const char *(*bus_name)(struct virtio_device *vdev);
+	int (*get_vq_irq)(struct virtio_device *vdev, struct virtqueue *vq);
 };
 
 /* If driver didn't advertise the feature, it will never appear. */
-- 
1.7.1

^ permalink raw reply related

* [net-next RFC V4 PATCH 2/4] virtio_ring: move queue_index to vring_virtqueue
From: Jason Wang @ 2012-06-25  9:17 UTC (permalink / raw)
  To: krkumar2, habanero, rusty, mst, netdev, mashirle, linux-kernel,
	virtualization, edumazet, tahm, jwhan, akong, davem
  Cc: kvm
In-Reply-To: <20120625090829.7263.65026.stgit@amd-6168-8-1.englab.nay.redhat.com>

Instead of storing the queue index in virtio infos, this patch moves them to
vring_virtqueue and introduces helpers to set and get the value. This would
simplify the management and tracing.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/virtio/virtio_mmio.c |    5 +----
 drivers/virtio/virtio_pci.c  |   12 +++++-------
 drivers/virtio/virtio_ring.c |   17 +++++++++++++++++
 include/linux/virtio.h       |    4 ++++
 4 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 453db0c..f5432b6 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -131,9 +131,6 @@ struct virtio_mmio_vq_info {
 	/* the number of entries in the queue */
 	unsigned int num;
 
-	/* the index of the queue */
-	int queue_index;
-
 	/* the virtual address of the ring queue */
 	void *queue;
 
@@ -324,7 +321,6 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
 		err = -ENOMEM;
 		goto error_kmalloc;
 	}
-	info->queue_index = index;
 
 	/* Allocate pages for the queue - start with a queue as big as
 	 * possible (limited by maximum size allowed by device), drop down
@@ -363,6 +359,7 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
 		goto error_new_virtqueue;
 	}
 
+	virtqueue_set_queue_index(vq, index);
 	vq->priv = info;
 	info->vq = vq;
 
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 2e03d41..adb24f2 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -79,9 +79,6 @@ struct virtio_pci_vq_info
 	/* the number of entries in the queue */
 	int num;
 
-	/* the index of the queue */
-	int queue_index;
-
 	/* the virtual address of the ring queue */
 	void *queue;
 
@@ -202,11 +199,11 @@ static void vp_reset(struct virtio_device *vdev)
 static void vp_notify(struct virtqueue *vq)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
-	struct virtio_pci_vq_info *info = vq->priv;
 
 	/* we write the queue's selector into the notification register to
 	 * signal the other end */
-	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
+	iowrite16(virtqueue_get_queue_index(vq),
+		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
 }
 
 /* Handle a configuration change: Tell driver if it wants to know. */
@@ -402,7 +399,6 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
 	if (!info)
 		return ERR_PTR(-ENOMEM);
 
-	info->queue_index = index;
 	info->num = num;
 	info->msix_vector = msix_vec;
 
@@ -425,6 +421,7 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
 		goto out_activate_queue;
 	}
 
+	virtqueue_set_queue_index(vq, index);
 	vq->priv = info;
 	info->vq = vq;
 
@@ -467,7 +464,8 @@ static void vp_del_vq(struct virtqueue *vq)
 	list_del(&info->node);
 	spin_unlock_irqrestore(&vp_dev->lock, flags);
 
-	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
+	iowrite16(virtqueue_get_queue_index(vq),
+		vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
 
 	if (vp_dev->msix_enabled) {
 		iowrite16(VIRTIO_MSI_NO_VECTOR,
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5aa43c3..9c5aeea 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -106,6 +106,9 @@ struct vring_virtqueue
 	/* How to notify other side. FIXME: commonalize hcalls! */
 	void (*notify)(struct virtqueue *vq);
 
+	/* Index of the queue */
+	int queue_index;
+
 #ifdef DEBUG
 	/* They're supposed to lock for us. */
 	unsigned int in_use;
@@ -171,6 +174,20 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
 	return head;
 }
 
+void virtqueue_set_queue_index(struct virtqueue *_vq, int queue_index)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	vq->queue_index = queue_index;
+}
+EXPORT_SYMBOL_GPL(virtqueue_set_queue_index);
+
+int virtqueue_get_queue_index(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	return vq->queue_index;
+}
+EXPORT_SYMBOL_GPL(virtqueue_get_queue_index);
+
 /**
  * virtqueue_add_buf - expose buffer to other end
  * @vq: the struct virtqueue we're talking about.
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 8efd28a..0d8ed46 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -50,6 +50,10 @@ void *virtqueue_detach_unused_buf(struct virtqueue *vq);
 
 unsigned int virtqueue_get_vring_size(struct virtqueue *vq);
 
+void virtqueue_set_queue_index(struct virtqueue *vq, int queue_index);
+
+int virtqueue_get_queue_index(struct virtqueue *vq);
+
 /**
  * virtio_device - representation of a device using virtio
  * @index: unique position on the virtio bus

^ permalink raw reply related

* [net-next RFC V4 PATCH 1/4] virtio_net: Introduce VIRTIO_NET_F_MULTIQUEUE
From: Jason Wang @ 2012-06-25  9:17 UTC (permalink / raw)
  To: krkumar2, habanero, rusty, mst, netdev, mashirle, linux-kernel,
	virtualization, edumazet, tahm, jwhan, akong, davem
  Cc: kvm
In-Reply-To: <20120625090829.7263.65026.stgit@amd-6168-8-1.englab.nay.redhat.com>

From: Krishna Kumar <krkumar2@in.ibm.com>

Introduce VIRTIO_NET_F_MULTIQUEUE.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/linux/virtio_net.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 2470f54..1bc7e30 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -51,6 +51,7 @@
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20	/* Extra RX mode control support */
 #define VIRTIO_NET_F_GUEST_ANNOUNCE 21	/* Guest can announce device on the
 					 * network */
+#define VIRTIO_NET_F_MULTIQUEUE	22	/* Device supports multiple TXQ/RXQ */
 
 #define VIRTIO_NET_S_LINK_UP	1	/* Link is up */
 #define VIRTIO_NET_S_ANNOUNCE	2	/* Announcement is needed */

^ permalink raw reply related

* [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net
From: Jason Wang @ 2012-06-25  9:16 UTC (permalink / raw)
  To: krkumar2, habanero, rusty, mst, netdev, mashirle, linux-kernel,
	virtualization, edumazet, tahm, jwhan, akong, davem
  Cc: kvm

Hello All:

This series is an update version of multiqueue virtio-net driver based on
Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the
packets reception and transmission. Please review and comments.

Test Environment:
- Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
- Two directed connected 82599

Test Summary:

- Highlights: huge improvements on TCP_RR test
- Lowlights: regression on small packet transmission, higher cpu utilization
             than single queue, need further optimization

Analysis of the performance result:

- I count the number of packets sending/receiving during the test, and
  multiqueue show much more ability in terms of packets per second.

- For the tx regression, multiqueue send about 1-2 times of more packets
  compared to single queue, and the packets size were much smaller than single
  queue does. I suspect tcp does less batching in multiqueue, so I hack the
  tcp_write_xmit() to forece more batching, multiqueue works as well as
  singlequeue for both small transmission and throughput

- I didn't pack the accelerate RFS with virtio-net in this sereis as it still
  need further shaping, for the one that interested in this please see:
  http://www.mail-archive.com/kvm@vger.kernel.org/msg64111.html

Detail result:

Test results: smp = 2 pin vhosts and vcpus in the same node - 1 sq 2 mq(q=2)
- TCP_MAERTS (Guest to external Host):

sessions size throughput1 throughput2   norm1 norm2:
1 64 424.91 401.49 94% 13.20 12.35 93%
2 64 1211.06 878.31 72% 24.35 15.80 64%
4 64 1292.46 1081.78 83% 26.46 20.14 76%
8 64 1355.57 826.06 60% 27.88 15.32 54%
1 256 1489.37 1406.51 94% 46.93 43.72 93%
2 256 4936.19 2688.46 54% 100.24 46.39 46%
4 256 5251.10 2900.08 55% 107.98 50.47 46%
8 256 5270.11 3562.10 67% 108.26 66.19 61%
1 512 1877.57 2697.64 143% 57.29 83.02 144%
2 512 9183.49 5056.33 55% 205.26 86.43 42%
4 512 9349.59 8918.77 95% 212.78 176.99 83%
8 512 9318.29 8947.69 96% 216.55 176.34 81%
1 1024 4746.47 4669.10 98% 143.87 140.63 97%
2 1024 7406.49 4738.58 63% 245.16 131.04 53%
4 1024 8955.70 9337.82 104% 254.49 208.66 81%
8 1024 9384.48 9378.41 99% 260.24 197.77 75%
1 2048 8947.29 8955.98 100% 338.39 338.60 100%
2 2048 9259.04 9383.77 101% 326.02 224.92 68%
4 2048 9043.04 9325.38 103% 305.09 214.77 70%
8 2048 9358.04 8413.21 89% 291.07 189.74 65%
1 4096 8882.55 8885.40 100% 329.83 326.42 98%
2 4096 9247.07 9387.49 101% 345.55 240.51 69%
4 4096 9365.30 9341.82 99% 333.40 211.97 63%
8 4096 9397.64 8472.35 90% 312.21 182.51 58%
1 16384 8843.82 8725.50 98% 315.06 312.29 99%
2 16384 7978.70 9176.16 115% 280.34 317.62 113%
4 16384 8906.91 9251.15 103% 312.85 226.13 72%
8 16384 9385.36 9401.39 100% 341.53 207.53 60%

- TCP_RR

sessions size trans1 trans2   norm1 norm2
50 1 63021.83 93003.57 147% 2360.36 1972.08 83%
100 1 54599.87 93724.43 171% 2042.64 2119.50 103%
250 1 86546.75 119738.71 138% 2414.80 2508.14 103%
50 64 59785.19 97761.24 163% 2264.59 2136.85 94%
100 64 56028.20 103233.18 184% 2063.65 2260.91 109%
250 64 99116.82 113948.86 114% 2452.77 2393.88 97%
50 128 58025.96 84971.26 146% 2284.48 1907.75 83%
100 128 66302.76 83110.60 125% 2579.87 1881.60 72%
250 128 109262.23 137571.13 125% 2603.96 2911.55 111%
50 256 59766.57 77370.93 129% 2393.53 2018.02 84%
100 256 60862.02 80737.50 132% 2426.71 2024.51 83%
250 256 90076.66 141780.44 157% 2453.06 3056.27 124%

- TCP_STREAM (External Host to Guest)

sessions size throughput1 throughput2   norm1 norm2
1 64 373.53 575.24 154% 15.37 33.44 217%
2 64 1314.17 1087.49 82% 54.21 32.97 60%
4 64 3352.93 2404.70 71% 136.35 53.72 39%
8 64 6866.17 6587.17 95% 252.06 147.03 58%
1 256 677.37 767.65 113% 26.59 29.80 112%
2 256 4386.83 4550.40 103% 164.17 174.81 106%
4 256 8446.27 8814.95 104% 307.02 186.95 60%
8 256 6529.46 9370.44 143% 237.52 207.77 87%
1 512 2542.48 1371.31 53% 102.14 51.68 50%
2 512 9183.10 9354.81 101% 319.85 333.62 104%
4 512 8488.12 9230.58 108% 303.14 258.55 85%
8 512 8115.66 9393.33 115% 293.83 219.82 74%
1 1024 2653.16 2507.78 94% 100.88 96.56 95%
2 1024 6490.12 5033.35 77% 316.74 102.74 32%
4 1024 7582.51 9183.60 121% 298.87 210.48 70%
8 1024 8304.13 9341.48 112% 298.06 242.32 81%
1 2048 2495.06 2508.12 100% 99.60 102.96 103%
2 2048 8968.12 7554.57 84% 324.57 175.56 54%
4 2048 8627.85 9192.90 106% 310.91 205.98 66%
8 2048 7781.72 9422.10 121% 284.31 208.77 73%
1 4096 6067.46 4489.04 73% 233.63 171.01 73%
2 4096 9029.43 8684.09 96% 323.17 191.19 59%
4 4096 8284.30 9253.33 111% 306.93 268.67 87%
8 4096 7789.39 9388.32 120% 283.97 238.28 83%
1 16384 8660.90 9313.87 107% 318.88 315.61 98%
2 16384 8646.37 8871.30 102% 318.81 318.88 100%
4 16384 8386.02 9397.13 112% 306.28 205.44 67%
8 16384 8006.27 9404.98 117% 288.41 224.83 77%

Test results: smp = 4 no pinning - 1 sq 2 mq(q=4)

- TCP_MAERTS - Guest to External Host

sessions size throughput1 throughput2   norm1 norm2
1 64 513.13 331.27 64% 15.86 10.39 65%
2 64 1325.72 543.38 40% 26.54 12.59 47%
4 64 2735.18 1168.01 42% 37.61 12.63 33%
8 64 3084.53 2278.38 73% 41.20 27.01 65%
1 256 1313.25 996.64 75% 41.23 28.59 69%
2 256 3425.64 3337.68 97% 71.57 61.69 86%
4 256 8174.33 4252.54 52% 126.63 49.68 39%
8 256 9365.96 8411.91 89% 145.00 101.64 70%
1 512 2363.07 2370.35 100% 73.00 73.52 100%
2 512 6047.23 3636.29 60% 135.43 61.02 45%
4 512 9330.76 7165.99 76% 184.18 95.50 51%
8 512 8178.20 9221.16 112% 162.52 122.44 75%
1 1024 3196.63 4016.30 125% 98.41 120.75 122%
2 1024 4940.41 9296.51 188% 140.03 191.95 137%
4 1024 5696.43 9018.76 158% 147.30 150.01 101%
8 1024 9355.07 9342.43 99% 216.90 140.48 64%
1 2048 4248.99 5189.24 122% 131.95 157.01 118%
2 2048 9021.22 9262.81 102% 242.37 198.21 81%
4 2048 8357.81 9241.94 110% 225.88 180.61 79%
8 2048 8024.56 9327.27 116% 205.75 145.76 70%
1 4096 9270.51 8199.32 88% 326.88 269.53 82%
2 4096 9151.10 9348.33 102% 257.92 214.46 83%
4 4096 9243.34 9294.30 100% 281.20 164.35 58%
8 4096 9020.35 9339.32 103% 249.87 143.54 57%
1 16384 9357.69 9355.69 99% 319.15 322.94 101%
2 16384 9319.39 9076.63 97% 319.26 215.64 67%
4 16384 9352.99 9183.41 98% 308.98 145.26 47%
8 16384 9384.67 9353.54 99% 322.49 155.78 48%

- TCP_RR
sessions size throughput1 throughput2   norm1 norm2
50 1 59677.80 71761.59 120% 2020.92 1585.19 78%
100 1 55324.42 81302.10 146% 1889.49 1439.22 76%
250 1 73973.48 155031.82 209% 2495.73 2195.91 87%
50 64 53093.57 67893.59 127% 1879.41 1266.19 67%
100 64 59186.33 60128.69 101% 2044.43 1084.57 53%
250 64 64159.90 137389.38 214% 2186.02 1994.61 91%
50 128 54323.45 51615.38 95% 1924.99 908.08 47%
100 128 57543.49 69999.21 121% 2049.99 1266.95 61%
250 128 71541.81 123915.35 173% 2448.38 1845.62 75%
50 256 57184.12 68314.09 119% 2204.47 1393.02 63%
100 256 51983.40 61931.00 119% 1897.89 1216.24 64%
250 256 66542.32 119435.53 179% 2267.97 1887.71 83%

- TCP_STREAM - External Host to Guest
sessions size throughput1 throughput2   norm1 norm2
1 64 589.10 634.81 107% 27.01 34.51 127%
2 64 1805.57 1442.49 79% 70.44 39.59 56%
4 64 3066.91 3869.26 126% 121.89 84.96 69%
8 64 6602.37 7626.61 115% 211.07 116.27 55%
1 256 775.25 2092.37 269% 26.38 115.47 437%
2 256 6213.63 3527.24 56% 191.89 84.18 43%
4 256 7333.13 9190.87 125% 230.74 218.05 94%
8 256 6776.23 9396.71 138% 216.14 153.14 70%
1 512 4308.25 3536.38 82% 160.93 140.38 87%
2 512 7159.09 9212.97 128% 240.80 201.42 83%
4 512 7095.49 9241.70 130% 229.77 193.26 84%
8 512 6935.59 9398.09 135% 221.37 152.36 68%
1 1024 5566.68 6155.74 110% 250.52 245.05 97%
2 1024 7303.83 9212.72 126% 299.33 229.91 76%
4 1024 7179.31 9334.93 130% 233.62 192.51 82%
8 1024 7080.91 9396.46 132% 226.37 173.81 76%
1 2048 7477.66 2671.34 35% 250.25 96.85 38%
2 2048 6743.70 8986.82 133% 295.51 230.07 77%
4 2048 7186.19 9239.56 128% 228.13 219.78 96%
8 2048 6883.19 9375.34 136% 221.11 164.59 74%
1 4096 7582.83 5040.55 66% 291.98 178.36 61%
2 4096 7617.98 9345.37 122% 245.74 197.78 80%
4 4096 7157.87 9383.81 131% 226.80 188.65 83%
8 4096 5916.83 9401.45 158% 189.64 147.61 77%
1 16384 8417.28 5351.11 63% 309.57 301.47 97%
2 16384 8426.42 9303.45 110% 302.13 197.14 65%
4 16384 5695.26 8678.73 152% 180.68 192.30 106%
8 16384 6761.33 9374.67 138% 214.50 160.25 74%

Changes from V3:

- Rebase to the net-next
- Let queue 2 to be the control virtqueue to obey the spec
- Prodives irq affinity
- Choose txq based on processor id

References:

- V3: http://lwn.net/Articles/467283/

---

Jason Wang (3):
      virtio_ring: move queue_index to vring_virtqueue
      virtio: introduce a method to get the irq of a specific virtqueue
      virtio_net: multiqueue support

Krishna Kumar (1):
      virtio_net: Introduce VIRTIO_NET_F_MULTIQUEUE


 drivers/lguest/lguest_device.c |    8 
 drivers/net/virtio_net.c       |  695 +++++++++++++++++++++++++++++-----------
 drivers/s390/kvm/kvm_virtio.c  |    6 
 drivers/virtio/virtio_mmio.c   |   13 +
 drivers/virtio/virtio_pci.c    |   24 +
 drivers/virtio/virtio_ring.c   |   17 +
 include/linux/virtio.h         |    4 
 include/linux/virtio_config.h  |    4 
 include/linux/virtio_net.h     |    3 
 9 files changed, 572 insertions(+), 202 deletions(-)

-- 
Jason Wang

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox