Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH] net: l2tp_eth: fix l2tp_eth_dev_xmit race
From: Eric Dumazet @ 2012-06-25 10:45 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, James Chapman

From: Eric Dumazet <edumazet@google.com>

Its illegal to dereference skb after giving it to l2tp_xmit_skb()
as it might be already freed/reused.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: James Chapman <jchapman@katalix.com>
---
diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index 185f12f..c3738f4 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -88,12 +88,12 @@ static int l2tp_eth_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct l2tp_eth *priv = netdev_priv(dev);
 	struct l2tp_session *session = priv->session;
 
-	l2tp_xmit_skb(session, skb, session->hdr_len);
-
 	dev->stats.tx_bytes += skb->len;
 	dev->stats.tx_packets++;
 
-	return 0;
+	l2tp_xmit_skb(session, skb, session->hdr_len);
+
+	return NETDEV_TX_OK;
 }
 
 static struct net_device_ops l2tp_eth_netdev_ops = {

^ permalink raw reply related

* RE: [PATCH] net-next: ipv6: ndisc: allocate a ndisc socket per inet6_dev
From: Menny_Hamburger @ 2012-06-25 11:08 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1340620878.10893.26.camel@edumazet-glaptop>

I'm sorry for not responding on your post.
I really want to understand how this fixes our problem.
This fix will make the skb allocations succeed, but what mechanism releases the stuck socket associated with the bad device?

Thanks,
Menny

-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
Sent: 25 June, 2012 13:41
To: Hamburger, Menny
Cc: netdev@vger.kernel.org
Subject: Re: [PATCH] net-next: ipv6: ndisc: allocate a ndisc socket per inet6_dev

On Mon, 2012-06-25 at 11:26 +0100, Menny_Hamburger@Dell.com wrote:
> From: mennyh <Menny_Hamburger@Dell.com>
> 
>  When an IPV6 network discovery packet does not get sent by the NIC, 
>  either because there is some S/W issue or a H/W problem with the NIC, NDP will stop
>  working and will not be able to send ndisc packets via other NICs on the machine.
>  The reason for this that there is only one global socket assigned per network for network discovery
>  (net->ipv6.ndisc_sk), and  when this socket is busy, NDP cannot be serviced by 
>  other NICS. 
>  
>  This patch adds a kernel configuration option IPV6_NDISC_SOCKET_PER_INTERFACE, 
>  which when enabled the kernel will allocate a network discovery socket per inet6_dev on creation,
>  instead of a single socket per network.
> 
> Signed-off-by: mennyh <Menny_Hamburger@Dell.com>
> ---

You obviously didn't see my patch to address this problem ?

I was waiting your feedback and you post this wrong patch instead ?

This sucks.

Test this instead. Please ?

^ permalink raw reply

* Re: [PATCH 01/13] netfilter: fix problem with proto register
From: Pablo Neira Ayuso @ 2012-06-25 11:12 UTC (permalink / raw)
  To: Gao feng; +Cc: netdev, netfilter-devel
In-Reply-To: <1340289410-17642-1-git-send-email-gaofeng@cn.fujitsu.com>

On Thu, Jun 21, 2012 at 10:36:38PM +0800, Gao feng wrote:
> before commit 2c352f444ccfa966a1aa4fd8e9ee29381c467448
> (netfilter: nf_conntrack: prepare namespace support for
> l4 protocol trackers), we register sysctl before register
> protos, so if sysctl is registered faild, the protos will
> not be registered.
> 
> but now, we register protos first, and when register
> sysctl failed, we can use protos too, it's different
> from before.

No, this has to be an all-or-nothing game. If one fails, everything
else that you've registered has to be unregistered.

> so change to register sysctl before register protos.
> 
> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
> ---
>  net/netfilter/nf_conntrack_proto.c |   36 +++++++++++++++++++++++-------------
>  1 files changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
> index 1ea9194..9bd88aa 100644
> --- a/net/netfilter/nf_conntrack_proto.c
> +++ b/net/netfilter/nf_conntrack_proto.c
> @@ -253,18 +253,23 @@ int nf_conntrack_l3proto_register(struct net *net,
>  {
>  	int ret = 0;
>  
> -	if (net == &init_net)
> -		ret = nf_conntrack_l3proto_register_net(proto);
> +	if (proto->init_net) {
> +		ret = proto->init_net(net);
> +		if (ret < 0)
> +			return ret;
> +	}
>  
> +	ret = nf_ct_l3proto_register_sysctl(net, proto);
>  	if (ret < 0)
>  		return ret;

This is still wrong.

If nf_ct_l3proto_register_sysctl fails, we'll leak the memory that has
been reserved by proto->init_net.

> -	if (proto->init_net) {
> -		ret = proto->init_net(net);
> +	if (net == &init_net) {
> +		ret = nf_conntrack_l3proto_register_net(proto);
>  		if (ret < 0)
> -			return ret;
> +			nf_ct_l3proto_unregister_sysctl(net, proto);
>  	}
> -	return nf_ct_l3proto_register_sysctl(net, proto);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_l3proto_register);
>  
> @@ -454,19 +459,24 @@ int nf_conntrack_l4proto_register(struct net *net,
>  				  struct nf_conntrack_l4proto *l4proto)
>  {
>  	int ret = 0;
> -	if (net == &init_net)
> -		ret = nf_conntrack_l4proto_register_net(l4proto);
>  
> -	if (ret < 0)
> -		return ret;
> -
> -	if (l4proto->init_net)
> +	if (l4proto->init_net) {
>  		ret = l4proto->init_net(net);
> +		if (ret < 0)
> +			return ret;
> +	}
>  
> +	ret = nf_ct_l4proto_register_sysctl(net, l4proto);
>  	if (ret < 0)
>  		return ret;
>  
> -	return nf_ct_l4proto_register_sysctl(net, l4proto);
> +	if (net == &init_net) {
> +		ret = nf_conntrack_l4proto_register_net(l4proto);
> +		if (ret < 0)
> +			nf_ct_l4proto_unregister_sysctl(net, l4proto);
> +	}
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_register);
>  
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 04/13] netfilter: regard users as refcount for l4proto's per-net data
From: Pablo Neira Ayuso @ 2012-06-25 11:20 UTC (permalink / raw)
  To: Gao feng; +Cc: netdev, netfilter-devel
In-Reply-To: <1340289410-17642-4-git-send-email-gaofeng@cn.fujitsu.com>

On Thu, Jun 21, 2012 at 10:36:41PM +0800, Gao feng wrote:
> Now, nf_proto_net's users is confusing.
> we should regard it as the refcount for l4proto's per-net data,
> because maybe there are two l4protos use the same per-net data.
> 
> so increment pn->users when nf_conntrack_l4proto_register
> success, and decrement it for nf_conntrack_l4_unregister case.
> 
> because nf_conntrack_l3proto_ipv[4|6] don't use the same per-net
> data,so we don't need to add a refcnt for their per-net data.
> 
> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
> ---
>  net/netfilter/nf_conntrack_proto.c |   76 ++++++++++++++++++++++--------------
>  1 files changed, 46 insertions(+), 30 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
> index 9d6b6ab..63612e6 100644
> --- a/net/netfilter/nf_conntrack_proto.c
> +++ b/net/netfilter/nf_conntrack_proto.c
[...]
> @@ -458,23 +446,32 @@ int nf_conntrack_l4proto_register(struct net *net,
>  				  struct nf_conntrack_l4proto *l4proto)
>  {
>  	int ret = 0;
> +	struct nf_proto_net *pn = NULL;
>  
>  	if (l4proto->init_net) {
>  		ret = l4proto->init_net(net, l4proto->l3proto);
>  		if (ret < 0)
> -			return ret;
> +			goto out;
>  	}
>  
> -	ret = nf_ct_l4proto_register_sysctl(net, l4proto);
> +	pn = nf_ct_l4proto_net(net, l4proto);
> +	if (pn == NULL)
> +		goto out;

Same thing here, we're leaking memory allocated by l4proto->init_net.

> +	ret = nf_ct_l4proto_register_sysctl(net, pn, l4proto);
>  	if (ret < 0)
> -		return ret;
> +		goto out;
>  
>  	if (net == &init_net) {
>  		ret = nf_conntrack_l4proto_register_net(l4proto);
> -		if (ret < 0)
> -			nf_ct_l4proto_unregister_sysctl(net, l4proto);
> +		if (ret < 0) {
> +			nf_ct_l4proto_unregister_sysctl(net, pn, l4proto);
> +			goto out;

Better replace the two lines above by:

goto out_register_net;

and then...

> +		}
>  	}
>  
> +	pn->users++;

out_register_net:
        nf_ct_l4proto_unregister_sysctl(net, pn, l4proto);

> +out:
>  	return ret;

I think that this change is similar to patch 1/1, I think you should
send it as a separated patch.

>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_register);
> @@ -499,10 +496,18 @@ nf_conntrack_l4proto_unregister_net(struct nf_conntrack_l4proto *l4proto)
>  void nf_conntrack_l4proto_unregister(struct net *net,
>  				     struct nf_conntrack_l4proto *l4proto)
>  {
> +	struct nf_proto_net *pn = NULL;
> +
>  	if (net == &init_net)
>  		nf_conntrack_l4proto_unregister_net(l4proto);
>  
> -	nf_ct_l4proto_unregister_sysctl(net, l4proto);
> +	pn = nf_ct_l4proto_net(net, l4proto);
> +	if (pn == NULL)
> +		return;
> +
> +	pn->users--;
> +	nf_ct_l4proto_unregister_sysctl(net, pn, l4proto);
> +
>  	/* Remove all contrack entries for this protocol */
>  	rtnl_lock();
>  	nf_ct_iterate_cleanup(net, kill_l4proto, l4proto);
> @@ -514,11 +519,15 @@ int nf_conntrack_proto_init(struct net *net)
>  {
>  	unsigned int i;
>  	int err;
> +	struct nf_proto_net *pn = nf_ct_l4proto_net(net,
> +					&nf_conntrack_l4proto_generic);
> +
>  	err = nf_conntrack_l4proto_generic.init_net(net,
>  					nf_conntrack_l4proto_generic.l3proto);
>  	if (err < 0)
>  		return err;
>  	err = nf_ct_l4proto_register_sysctl(net,
> +					    pn,
>  					    &nf_conntrack_l4proto_generic);
>  	if (err < 0)
>  		return err;
> @@ -528,13 +537,20 @@ int nf_conntrack_proto_init(struct net *net)
>  			rcu_assign_pointer(nf_ct_l3protos[i],
>  					   &nf_conntrack_l3proto_generic);
>  	}
> +
> +	pn->users++;
>  	return 0;
>  }
>  
>  void nf_conntrack_proto_fini(struct net *net)
>  {
>  	unsigned int i;
> +	struct nf_proto_net *pn = nf_ct_l4proto_net(net,
> +					&nf_conntrack_l4proto_generic);
> +
> +	pn->users--;
>  	nf_ct_l4proto_unregister_sysctl(net,
> +					pn,
>  					&nf_conntrack_l4proto_generic);
>  	if (net == &init_net) {
>  		/* free l3proto protocol tables */
> -- 
> 1.7.7.6
> 

^ permalink raw reply

* Re: [RFC net-next 05/14] Fix intel/ixgbevf
From: Yuval Mintz @ 2012-06-25 11:23 UTC (permalink / raw)
  To: Greg Rose; +Cc: eilong, Alexander Duyck, netdev, davem, Jeff Kirsher
In-Reply-To: <20120619110704.000045b2@unknown>

>>>>  	 * It's easy to be greedy for MSI-X vectors, but it

>>>> really @@ -2022,8 +2022,9 @@ static int
>>>> ixgbevf_set_interrupt_capability(struct ixgbevf_adapter *adapter)
>>>>  	 * than CPU's.  So let's be conservative and only ask for
>>>>  	 * (roughly) twice the number of vectors as there are
>>>> CPU's. */
>>>> +	ncpu = min_t(int, num_online_cpus(),
>>>> DEFAULT_MAX_NUM_RSS_QUEUES); v_budget =
>>>> min(adapter->num_rx_queues + adapter->num_tx_queues,
>>>> -		       (int)(num_online_cpus() * 2)) +
>>>> NON_Q_VECTORS;
>>>> +		       ncpu * 2) + NON_Q_VECTORS;
>>>>  
>>>>  	/* A failure in MSI-X entry allocation isn't fatal, but
>>>> it does
>>>>  	 * mean we disable MSI-X capabilities of the adapter. */
>>> This change is pointless on the ixgbevf driver.  The VF hardware can
>>> support at most 4 RSS queues.  As such num_rx_queues + num_tx_queues
>>> will never exceed 8 so you are essentially adding a necessary
>>> min(x,8).
>>
>> It is pointless with the current value, but if someone will edit the
>> kernel source code and replace the 8 with a 2, it will become
>> meaningful. The compiler will optimize this part, and I think that for
>> completion, it is best to keep this reference so a future default
>> number change will not be missed.
>>
>> Eilon
> 
> I don't feel there is any real point to making this change to the
> ixgbevf driver.  82599 virtual functions have 3 MSI-X vectors, one of
> which is for the mailbox and the other two can be shared with tx/rx
> queue pairs or assigned separately to tx or rx queues.  So this code is
> pointless no matter what value is set for DEFAULT_MAX_NUM_RSS_QUEUES.
> Perhaps the patches to the other drivers in your RFC will have some
> effect but this one looks like a no-op for the ixgbevf driver so there
> is no reason for it.
> 
> - Greg
> 

Hi Greg,

Since we're changing the RFC to use a new wrapper function which should
replace num_online_cpus (for these purpose), the next RFC version will still
change this driver (for uniformity, if nothing else).

Of course, if you would still have reservations for this change - send them.

Thanks,
Yuval

^ permalink raw reply

* [RFC net-next (v2) 02/14] mlx4: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz, Or Gerlitz
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/main.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index ee6f4fe..8f990a3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -41,6 +41,7 @@
 #include <linux/slab.h>
 #include <linux/io-mapping.h>
 #include <linux/delay.h>
+#include <linux/netdevice.h>
 
 #include <linux/mlx4/device.h>
 #include <linux/mlx4/doorbell.h>
@@ -1539,8 +1540,8 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct msix_entry *entries;
 	int nreq = min_t(int, dev->caps.num_ports *
-			 min_t(int, num_online_cpus() + 1, MAX_MSIX_P_PORT)
-				+ MSIX_LEGACY_SZ, MAX_MSIX);
+			 min_t(int, netif_get_num_default_rss_queues() + 1,
+			       MAX_MSIX_P_PORT) + MSIX_LEGACY_SZ, MAX_MSIX);
 	int err;
 	int i;
 
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 06/14] cxgb4: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz, Divy Le Ray
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Divy Le Ray <divy@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index e1f96fb..5ed49af 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -3493,8 +3493,8 @@ static void __devinit cfg_queues(struct adapter *adap)
 	 */
 	if (n10g)
 		q10g = (MAX_ETH_QSETS - (adap->params.nports - n10g)) / n10g;
-	if (q10g > num_online_cpus())
-		q10g = num_online_cpus();
+	if (q10g > netif_get_num_default_rss_queues())
+		q10g = netif_get_num_default_rss_queues();
 
 	for_each_port(adap, i) {
 		struct port_info *pi = adap2pinfo(adap, i);
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 05/14] cxgb3: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz, Divy Le Ray
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Divy Le Ray <divy@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c
index abb6ce7..9b08749 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c
@@ -3050,7 +3050,7 @@ static struct pci_error_handlers t3_err_handler = {
 static void set_nqsets(struct adapter *adap)
 {
 	int i, j = 0;
-	int num_cpus = num_online_cpus();
+	int num_cpus = netif_get_num_default_rss_queues();
 	int hwports = adap->params.nports;
 	int nqsets = adap->msix_nvectors - 1;
 
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 04/14] qlge: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev
  Cc: eilong, Yuval Mintz, Anirban Chakraborty, Jitendra Kalsaria,
	Ron Mercer
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
Cc: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Cc: Ron Mercer <ron.mercer@qlogic.com>
---
 drivers/net/ethernet/qlogic/qlge/qlge_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
index 09d8d33..3c3499d 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
+++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
@@ -4649,7 +4649,7 @@ static int __devinit qlge_probe(struct pci_dev *pdev,
 	int err = 0;
 
 	ndev = alloc_etherdev_mq(sizeof(struct ql_adapter),
-			min(MAX_CPUS, (int)num_online_cpus()));
+			min(MAX_CPUS, netif_get_num_default_rss_queues()));
 	if (!ndev)
 		return -ENOMEM;
 
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 03/14] vxge: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz, Jon Mason
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Jon Mason <jdmason@kudzu.us>
---
 drivers/net/ethernet/neterion/vxge/vxge-main.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/neterion/vxge/vxge-main.c b/drivers/net/ethernet/neterion/vxge/vxge-main.c
index 2578eb1..2fd1edb 100644
--- a/drivers/net/ethernet/neterion/vxge/vxge-main.c
+++ b/drivers/net/ethernet/neterion/vxge/vxge-main.c
@@ -3687,7 +3687,8 @@ static int __devinit vxge_config_vpaths(
 			return 0;
 
 		if (!driver_config->g_no_cpus)
-			driver_config->g_no_cpus = num_online_cpus();
+			driver_config->g_no_cpus =
+				netif_get_num_default_rss_queues();
 
 		driver_config->vpath_per_dev = driver_config->g_no_cpus >> 1;
 		if (!driver_config->vpath_per_dev)
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 00/14] default maximal number of RSS queues in mq drivers
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev
  Cc: eilong, Yuval Mintz, Divy Le Ray, Or Gerlitz, Jon Mason,
	Anirban Chakraborty, Jitendra Kalsaria, Ron Mercer, Jeff Kirsher,
	Jon Mason, Andrew Gallatin, Sathya Perla, Subbu Seetharaman,
	Ajit Khaparde, Matt Carlson, Michael Chan, Eric Dumazet,
	Ben Hutchings

Different vendors support different number of RSS queues by default. Today,
there exists an ethtool API through which users can change the number of
channels their driver supports; This enables us to pursue the goal of using
a default number of RSS queues in various multi-queue drivers.

This RFC intendeds to achieve the above default, by upper-limiting the number
of interrupts multi-queue drivers request (by default, not via the new API) 
with correlation to the number of cpus on the machine.

After examining multi-queue drivers that call alloc_etherdev_mq[s],
it became evident that most drivers allocate their devices using hard-coded
values. Changing those defaults directly will most likely cause a regression. 

However, (most) multi-queue driver look at the number of online cpus when 
requesting for interrupts. We assume that the number of interrupts the
driver manages to request is propagated across the driver, and the number
of RSS queues it configures is based upon it. 

This RFC modifies said logic - if the number of cpus is large enough, use
a smaller default value instead. This serves 2 main purposes: 
 1. A step forward unity in the number of RSS queues of various drivers.
 2. It prevents wasteful requests for interrupts on machines with many cpus.

Notice no testing was made on this RFC (other than on the bnx2x driver)
except for compilation test.

Drivers identified as multi-queue, handled in this RFC:

* mellanox mlx4
* neterion vxge
* qlogic   qlge
* intel    igb, igbxe igbxevf
* chelsio  cxgb3, cxgb4
* myricom  myri10ge
* emulex   benet
* broadcom tg3, bnx2, bnx2x

Driver identified as multi-queue, no reference to number of online cpus found,
and thus unhandled in this RFC:

* neterion  s2io
* marvell   mv643xx
* freescale gianfar
* ibm       ehea
* ti        cpmac
* sun       niu
* sfc       efx
* chelsio   cxgb4vf

Cheers,
Yuval Mintz

Cc: Divy Le Ray <divy@chelsio.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
Cc: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Cc: Ron Mercer <ron.mercer@qlogic.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Jon Mason <mason@myri.com>
Cc: Andrew Gallatin <gallatin@myri.com>
Cc: Sathya Perla <sathya.perla@emulex.com>
Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com>
Cc: Ajit Khaparde <ajit.khaparde@emulex.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>

^ permalink raw reply

* [RFC net-next (v2) 01/14] net-next: Add netif_get_num_default_rss_queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Most multi-queue networking driver consider the number of online cpus when
configuring RSS queues. 
This patch adds a wrapper to the number of cpus, setting an upper limit on the
number of cpus a driver should consider (by default) when allocating resources
for his queues.

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 include/linux/netdevice.h |    3 +++
 net/core/dev.c            |   11 +++++++++++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2c2ecea..ab0251d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2119,6 +2119,9 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev,
 #endif
 }
 
+#define DEFAULT_MAX_NUM_RSS_QUEUES	(8)
+extern int netif_get_num_default_rss_queues(void);
+
 /* Use this variant when it is known for sure that it
  * is executing from hardware interrupt context or with hardware interrupts
  * disabled.
diff --git a/net/core/dev.c b/net/core/dev.c
index 57c4f9b..caae49c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1793,6 +1793,17 @@ int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq)
 EXPORT_SYMBOL(netif_set_real_num_rx_queues);
 #endif
 
+/* netif_get_num_default_rss_queues - default number of RSS queues
+ *
+ * This routine should set an upper limit on the number of RSS queues
+ * used by default by multiqueue devices.
+ */
+int netif_get_num_default_rss_queues()
+{
+	return min_t(int, DEFAULT_MAX_NUM_RSS_QUEUES, num_online_cpus());
+}
+EXPORT_SYMBOL(netif_get_num_default_rss_queues);
+
 static inline void __netif_reschedule(struct Qdisc *q)
 {
 	struct softnet_data *sd;
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 08/14] tg3: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz, Matt Carlson
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Matt Carlson <mcarlson@broadcom.com>
---
 drivers/net/ethernet/broadcom/tg3.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index e47ff8b..6cbab03 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -9908,7 +9908,7 @@ static bool tg3_enable_msix(struct tg3 *tp)
 	int i, rc;
 	struct msix_entry msix_ent[tp->irq_max];
 
-	tp->irq_cnt = num_online_cpus();
+	tp->irq_cnt = netif_get_num_default_rss_queues();
 	if (tp->irq_cnt > 1) {
 		/* We want as many rx rings enabled as there are cpus.
 		 * In multiqueue MSI-X mode, the first MSI-X vector
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 11/14] igb: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz, Jeff Kirsher
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 01ced68..b11ee60 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2465,7 +2465,8 @@ static int __devinit igb_sw_init(struct igb_adapter *adapter)
 		break;
 	}
 
-	adapter->rss_queues = min_t(u32, max_rss_queues, num_online_cpus());
+	adapter->rss_queues = min_t(u32, max_rss_queues,
+				    netif_get_num_default_rss_queues());
 
 	/* Determine if we need to pair queues. */
 	switch (hw->mac.type) {
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 10/14] bnx2x: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index daa894b..53659f3 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -822,7 +822,8 @@ static inline int bnx2x_calc_num_queues(struct bnx2x *bp)
 {
 	return  num_queues ?
 		 min_t(int, num_queues, BNX2X_MAX_QUEUES(bp)) :
-		 min_t(int, num_online_cpus(), BNX2X_MAX_QUEUES(bp));
+		 min_t(int, netif_get_num_default_rss_queues(),
+		       BNX2X_MAX_QUEUES(bp));
 }
 
 static inline void bnx2x_clear_sge_mask_next_elems(struct bnx2x_fastpath *fp)
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 09/14] bnx2: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz, Michael Chan
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Michael Chan <mchan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnx2.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index 9b69a62..3f49285 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -6246,7 +6246,7 @@ bnx2_enable_msix(struct bnx2 *bp, int msix_vecs)
 static int
 bnx2_setup_int_mode(struct bnx2 *bp, int dis_msi)
 {
-	int cpus = num_online_cpus();
+	int cpus = netif_get_num_default_rss_queues();
 	int msix_vecs;
 
 	if (!bp->num_req_rx_rings)
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 07/14] myri10ge: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev; +Cc: eilong, Yuval Mintz, Jon Mason
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Jon Mason <mason@myri.com>
---
 drivers/net/ethernet/myricom/myri10ge/myri10ge.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
index 90153fc..fa85cf1 100644
--- a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
+++ b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
@@ -3775,7 +3775,7 @@ static void myri10ge_probe_slices(struct myri10ge_priv *mgp)
 
 	mgp->num_slices = 1;
 	msix_cap = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
-	ncpus = num_online_cpus();
+	ncpus = netif_get_num_default_rss_queues();
 
 	if (myri10ge_max_slices == 1 || msix_cap == 0 ||
 	    (myri10ge_max_slices == -1 && ncpus < 2))
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 14/14] ixgbevf: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev
  Cc: eilong, Yuval Mintz, Jeff Kirsher, Alexander Duyck, Greg Rose
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index f69ec42..7889644 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -2023,7 +2023,7 @@ static int ixgbevf_set_interrupt_capability(struct ixgbevf_adapter *adapter)
 	 * (roughly) twice the number of vectors as there are CPU's.
 	 */
 	v_budget = min(adapter->num_rx_queues + adapter->num_tx_queues,
-		       (int)(num_online_cpus() * 2)) + NON_Q_VECTORS;
+		       netif_get_num_default_rss_queues() * 2) + NON_Q_VECTORS;
 
 	/* A failure in MSI-X entry allocation isn't fatal, but it does
 	 * mean we disable MSI-X capabilities of the adapter. */
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 12/14] ixgbe: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev
  Cc: eilong, Yuval Mintz, Jeff Kirsher, John Fastabend,
	Alexander Duyck
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index af1a531..b352ea8 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -802,7 +802,7 @@ static int ixgbe_set_interrupt_capability(struct ixgbe_adapter *adapter)
 	 * The default is to use pairs of vectors.
 	 */
 	v_budget = max(adapter->num_rx_queues, adapter->num_tx_queues);
-	v_budget = min_t(int, v_budget, num_online_cpus());
+	v_budget = min_t(int, v_budget, netif_get_num_default_rss_queues());
 	v_budget += NON_Q_VECTORS;
 
 	/*
-- 
1.7.9.rc2

^ permalink raw reply related

* [RFC net-next (v2) 13/14] be2net: set maximal number of default RSS queues
From: Yuval Mintz @ 2012-06-25 11:45 UTC (permalink / raw)
  To: davem, netdev
  Cc: eilong, Yuval Mintz, Sathya Perla, Subbu Seetharaman,
	Ajit Khaparde
In-Reply-To: <1340624745-8650-1-git-send-email-yuvalmin@broadcom.com>

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Cc: Sathya Perla <sathya.perla@emulex.com>
Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com>
Cc: Ajit Khaparde <ajit.khaparde@emulex.com>
---
 drivers/net/ethernet/emulex/benet/be_main.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index 5a34503..a8564d0 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -2142,12 +2142,14 @@ static void be_msix_disable(struct be_adapter *adapter)
 
 static uint be_num_rss_want(struct be_adapter *adapter)
 {
+	u32 num = 0;
 	if ((adapter->function_caps & BE_FUNCTION_CAPS_RSS) &&
 	     !sriov_want(adapter) && be_physfn(adapter) &&
-	     !be_is_mc(adapter))
-		return (adapter->be3_native) ? BE3_MAX_RSS_QS : BE2_MAX_RSS_QS;
-	else
-		return 0;
+	     !be_is_mc(adapter)) {
+		num = (adapter->be3_native) ? BE3_MAX_RSS_QS : BE2_MAX_RSS_QS;
+		num = min_t(u32, num, (u32)netif_get_num_default_rss_queues());
+	}
+	return num;
 }
 
 static void be_msix_enable(struct be_adapter *adapter)
-- 
1.7.9.rc2

^ permalink raw reply related

* [net-next RFC V3 0/6] Multiqueue support in tun/tap
From: Jason Wang @ 2012-06-25 11:59 UTC (permalink / raw)
  To: mst, akong, habanero, tahm, haixiao, jwhan, ernesto.martin,
	mashirle, davem, netdev, linux-kernel, krkumar2
  Cc: shemminger, edumazet, Jason Wang
In-Reply-To: <20120625060830.6765.27584.stgit@amd-6168-8-1.englab.nay.redhat.com>

Hello All:

This is an update of multiqueue support in tun/tap from V2. Please consider to
merge.

The main idea for this series is to let tun/tap device to benefit from
multiqueue network cards and multi-cores host by letting it to be able to
transmit and receive packets from mmultiple sockets/queues. This series allows
multiple sockets to be attached and detached to the tun/tap devices. Userspace
could utilize this parallism to achiveve higher throughput.

Some quick overview of the design:

- Moving socket from tun_device to tun_file.
- Allowing multiple sockets to be attached to a tun/tap devices.
- Using RCU to synchronize the data path and system call.
- A simple hash based queue selecting algorithm is used to choose the tx queue.
- Two new ioctls were added for the usespace to attach and detach socket to the device.
- ABI compatibility were maintained, userspace that only use one queue won't
 need any changes.

Performance test:

This series were originally designed to serve as the backend of the multiqueue
virito-net in kvm guest. But the design is generic enough to let it to be reused
by any other type of userspace.

Since I would post a series of multiqueue virtio-net as RFC, so I would post the
performance result in that thread. To summarize the performance, the multiqueue
improves the transaction in the TCP_RR test but have some regression in small
packets transmission in TCP_STREAM test.

Martin test the multiqueue tap for their userspace, and he see an improvement in
terms of packets per second.

References:
- V2 of multiqueue tun/tap, http://lwn.net/Articles/459270/
- V1 of multiqueue tun/tap, http://www.mail-archive.com/kvm@vger.kernel.org/msg59479.html

Changes from V2:

- Rebase to the latest net-next
- Fix netdev leak when tun_attach fails
- Fix return value of TUNSETOWNER
- Purge the receive queue in socket destructor
- Enable multiqueue tun (V1 and V2 only allows mq to be eanbled for tap
- Add per-queue u64 statistics
- Fix wrong BUG_ON() check in tun_detach()
- Check numqueues instead of tfile[0] in tun_set_iff() to let tunctl -d works
  correctly
- Set numqueues to MAX_TAP_QUEUES during tun_detach_all() to prevent the
  attaching.

Changes from V1:

- Simplify the sockets array management by not leaving NULL in the slot.
- Optimization on the tx queue selecting.
- Fix the bug in tun_deatch_all()

Jason Wang (6):
  tuntap: move socket to tun_file
  tuntap: categorize ioctl
  tuntap: introduce multiqueue flags
  tuntap: multiqueue support
  tuntap: per queue 64 bit stats
  tuntap: add ioctls to attach or detach a file form tuntap device

 drivers/net/tun.c      |  797 ++++++++++++++++++++++++++++++------------------
 include/linux/if_tun.h |    5 +
 2 files changed, 503 insertions(+), 299 deletions(-)

^ permalink raw reply

* [PATCH 1/6] tuntap: move socket to tun_file
From: Jason Wang @ 2012-06-25 11:59 UTC (permalink / raw)
  To: mst, akong, habanero, tahm, haixiao, jwhan, ernesto.martin,
	mashirle, davem, netdev, linux-kernel, krkumar2
  Cc: shemminger, edumazet, Jason Wang
In-Reply-To: <20120625060830.6765.27584.stgit@amd-6168-8-1.englab.nay.redhat.com>

This patch moves socket structure from tun_device and to tun_file in order to
let it possbile for multiple sockets to be attached to tun/tap device. The
reference between tap device and socket was setup during TUNSETIFF as
usual.

After this patch, we can go further towards multiqueue tun/tap support by
storing an array of pointers of tun_file in tun_device.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c |  352 +++++++++++++++++++++++++++--------------------------
 1 files changed, 181 insertions(+), 171 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 987aeef..1f27789 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -108,9 +108,16 @@ struct tap_filter {
 };
 
 struct tun_file {
+	struct sock sk;
+	struct socket socket;
+	struct socket_wq wq;
+	int vnet_hdr_sz;
+	struct tap_filter txflt;
 	atomic_t count;
 	struct tun_struct *tun;
 	struct net *net;
+	struct fasync_struct *fasync;
+	unsigned int flags;
 };
 
 struct tun_sock;
@@ -125,29 +132,12 @@ struct tun_struct {
 	netdev_features_t	set_features;
 #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
 			  NETIF_F_TSO6|NETIF_F_UFO)
-	struct fasync_struct	*fasync;
-
-	struct tap_filter       txflt;
-	struct socket		socket;
-	struct socket_wq	wq;
-
-	int			vnet_hdr_sz;
 
 #ifdef TUN_DEBUG
 	int debug;
 #endif
 };
 
-struct tun_sock {
-	struct sock		sk;
-	struct tun_struct	*tun;
-};
-
-static inline struct tun_sock *tun_sk(struct sock *sk)
-{
-	return container_of(sk, struct tun_sock, sk);
-}
-
 static int tun_attach(struct tun_struct *tun, struct file *file)
 {
 	struct tun_file *tfile = file->private_data;
@@ -168,10 +158,9 @@ static int tun_attach(struct tun_struct *tun, struct file *file)
 	err = 0;
 	tfile->tun = tun;
 	tun->tfile = tfile;
-	tun->socket.file = file;
 	netif_carrier_on(tun->dev);
 	dev_hold(tun->dev);
-	sock_hold(tun->socket.sk);
+	sock_hold(&tfile->sk);
 	atomic_inc(&tfile->count);
 
 out:
@@ -181,15 +170,15 @@ out:
 
 static void __tun_detach(struct tun_struct *tun)
 {
+	struct tun_file *tfile = tun->tfile;
 	/* Detach from net device */
 	netif_tx_lock_bh(tun->dev);
 	netif_carrier_off(tun->dev);
 	tun->tfile = NULL;
-	tun->socket.file = NULL;
 	netif_tx_unlock_bh(tun->dev);
 
 	/* Drop read queue */
-	skb_queue_purge(&tun->socket.sk->sk_receive_queue);
+	skb_queue_purge(&tfile->socket.sk->sk_receive_queue);
 
 	/* Drop the extra count on the net device */
 	dev_put(tun->dev);
@@ -348,19 +337,12 @@ static void tun_net_uninit(struct net_device *dev)
 	/* Inform the methods they need to stop using the dev.
 	 */
 	if (tfile) {
-		wake_up_all(&tun->wq.wait);
+		wake_up_all(&tfile->wq.wait);
 		if (atomic_dec_and_test(&tfile->count))
 			__tun_detach(tun);
 	}
 }
 
-static void tun_free_netdev(struct net_device *dev)
-{
-	struct tun_struct *tun = netdev_priv(dev);
-
-	sk_release_kernel(tun->socket.sk);
-}
-
 /* Net device open. */
 static int tun_net_open(struct net_device *dev)
 {
@@ -379,24 +361,26 @@ static int tun_net_close(struct net_device *dev)
 static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct tun_struct *tun = netdev_priv(dev);
+	struct tun_file *tfile = tun->tfile;
 
 	tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);
 
 	/* Drop packet if interface is not attached */
-	if (!tun->tfile)
+	if (!tfile)
 		goto drop;
 
 	/* Drop if the filter does not like it.
 	 * This is a noop if the filter is disabled.
 	 * Filter can be enabled only for the TAP devices. */
-	if (!check_filter(&tun->txflt, skb))
+	if (!check_filter(&tfile->txflt, skb))
 		goto drop;
 
-	if (tun->socket.sk->sk_filter &&
-	    sk_filter(tun->socket.sk, skb))
+	if (tfile->socket.sk->sk_filter &&
+	    sk_filter(tfile->socket.sk, skb))
 		goto drop;
 
-	if (skb_queue_len(&tun->socket.sk->sk_receive_queue) >= dev->tx_queue_len) {
+	if (skb_queue_len(&tfile->socket.sk->sk_receive_queue)
+	    >= dev->tx_queue_len) {
 		if (!(tun->flags & TUN_ONE_QUEUE)) {
 			/* Normal queueing mode. */
 			/* Packet scheduler handles dropping of further packets. */
@@ -417,12 +401,12 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	skb_orphan(skb);
 
 	/* Enqueue packet */
-	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
+	skb_queue_tail(&tfile->socket.sk->sk_receive_queue, skb);
 
 	/* Notify and wake up reader process */
-	if (tun->flags & TUN_FASYNC)
-		kill_fasync(&tun->fasync, SIGIO, POLL_IN);
-	wake_up_interruptible_poll(&tun->wq.wait, POLLIN |
+	if (tfile->flags & TUN_FASYNC)
+		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
+	wake_up_interruptible_poll(&tfile->wq.wait, POLLIN |
 				   POLLRDNORM | POLLRDBAND);
 	return NETDEV_TX_OK;
 
@@ -550,11 +534,11 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
 	if (!tun)
 		return POLLERR;
 
-	sk = tun->socket.sk;
+	sk = tfile->socket.sk;
 
 	tun_debug(KERN_INFO, tun, "tun_chr_poll\n");
 
-	poll_wait(file, &tun->wq.wait, wait);
+	poll_wait(file, &tfile->wq.wait, wait);
 
 	if (!skb_queue_empty(&sk->sk_receive_queue))
 		mask |= POLLIN | POLLRDNORM;
@@ -573,11 +557,11 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
 
 /* prepad is the amount to reserve at front.  len is length after that.
  * linear is a hint as to how much to copy (usually headers). */
-static struct sk_buff *tun_alloc_skb(struct tun_struct *tun,
+static struct sk_buff *tun_alloc_skb(struct tun_file *tfile,
 				     size_t prepad, size_t len,
 				     size_t linear, int noblock)
 {
-	struct sock *sk = tun->socket.sk;
+	struct sock *sk = tfile->socket.sk;
 	struct sk_buff *skb;
 	int err;
 
@@ -601,7 +585,7 @@ static struct sk_buff *tun_alloc_skb(struct tun_struct *tun,
 }
 
 /* Get packet from user space buffer */
-static ssize_t tun_get_user(struct tun_struct *tun,
+static ssize_t tun_get_user(struct tun_file *tfile,
 			    const struct iovec *iv, size_t count,
 			    int noblock)
 {
@@ -610,8 +594,10 @@ static ssize_t tun_get_user(struct tun_struct *tun,
 	size_t len = count, align = NET_SKB_PAD;
 	struct virtio_net_hdr gso = { 0 };
 	int offset = 0;
+	struct tun_struct *tun = NULL;
+	bool drop = false, error = false;
 
-	if (!(tun->flags & TUN_NO_PI)) {
+	if (!(tfile->flags & TUN_NO_PI)) {
 		if ((len -= sizeof(pi)) > count)
 			return -EINVAL;
 
@@ -620,8 +606,9 @@ static ssize_t tun_get_user(struct tun_struct *tun,
 		offset += sizeof(pi);
 	}
 
-	if (tun->flags & TUN_VNET_HDR) {
-		if ((len -= tun->vnet_hdr_sz) > count)
+	if (tfile->flags & TUN_VNET_HDR) {
+		len -= tfile->vnet_hdr_sz;
+		if (len > count)
 			return -EINVAL;
 
 		if (memcpy_fromiovecend((void *)&gso, iv, offset, sizeof(gso)))
@@ -633,41 +620,43 @@ static ssize_t tun_get_user(struct tun_struct *tun,
 
 		if (gso.hdr_len > len)
 			return -EINVAL;
-		offset += tun->vnet_hdr_sz;
+		offset += tfile->vnet_hdr_sz;
 	}
 
-	if ((tun->flags & TUN_TYPE_MASK) == TUN_TAP_DEV) {
+	if ((tfile->flags & TUN_TYPE_MASK) == TUN_TAP_DEV) {
 		align += NET_IP_ALIGN;
 		if (unlikely(len < ETH_HLEN ||
 			     (gso.hdr_len && gso.hdr_len < ETH_HLEN)))
 			return -EINVAL;
 	}
 
-	skb = tun_alloc_skb(tun, align, len, gso.hdr_len, noblock);
+	skb = tun_alloc_skb(tfile, align, len, gso.hdr_len, noblock);
+
 	if (IS_ERR(skb)) {
 		if (PTR_ERR(skb) != -EAGAIN)
-			tun->dev->stats.rx_dropped++;
-		return PTR_ERR(skb);
+			drop = true;
+		count = PTR_ERR(skb);
+		goto err;
 	}
 
 	if (skb_copy_datagram_from_iovec(skb, 0, iv, offset, len)) {
-		tun->dev->stats.rx_dropped++;
+		drop = true;
 		kfree_skb(skb);
-		return -EFAULT;
+		count = -EFAULT;
+		goto err;
 	}
 
 	if (gso.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
 		if (!skb_partial_csum_set(skb, gso.csum_start,
 					  gso.csum_offset)) {
-			tun->dev->stats.rx_frame_errors++;
-			kfree_skb(skb);
-			return -EINVAL;
+			error = true;
+			goto err_free;
 		}
 	}
 
-	switch (tun->flags & TUN_TYPE_MASK) {
+	switch (tfile->flags & TUN_TYPE_MASK) {
 	case TUN_TUN_DEV:
-		if (tun->flags & TUN_NO_PI) {
+		if (tfile->flags & TUN_NO_PI) {
 			switch (skb->data[0] & 0xf0) {
 			case 0x40:
 				pi.proto = htons(ETH_P_IP);
@@ -676,18 +665,15 @@ static ssize_t tun_get_user(struct tun_struct *tun,
 				pi.proto = htons(ETH_P_IPV6);
 				break;
 			default:
-				tun->dev->stats.rx_dropped++;
-				kfree_skb(skb);
-				return -EINVAL;
+				drop = true;
+				goto err_free;
 			}
 		}
 
 		skb_reset_mac_header(skb);
 		skb->protocol = pi.proto;
-		skb->dev = tun->dev;
 		break;
 	case TUN_TAP_DEV:
-		skb->protocol = eth_type_trans(skb, tun->dev);
 		break;
 	}
 
@@ -704,9 +690,8 @@ static ssize_t tun_get_user(struct tun_struct *tun,
 			skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
 			break;
 		default:
-			tun->dev->stats.rx_frame_errors++;
-			kfree_skb(skb);
-			return -EINVAL;
+			error = true;
+			goto err_free;
 		}
 
 		if (gso.gso_type & VIRTIO_NET_HDR_GSO_ECN)
@@ -714,9 +699,8 @@ static ssize_t tun_get_user(struct tun_struct *tun,
 
 		skb_shinfo(skb)->gso_size = gso.gso_size;
 		if (skb_shinfo(skb)->gso_size == 0) {
-			tun->dev->stats.rx_frame_errors++;
-			kfree_skb(skb);
-			return -EINVAL;
+			error = true;
+			goto err_free;
 		}
 
 		/* Header must be checked, and gso_segs computed. */
@@ -724,11 +708,38 @@ static ssize_t tun_get_user(struct tun_struct *tun,
 		skb_shinfo(skb)->gso_segs = 0;
 	}
 
-	netif_rx_ni(skb);
+	tun = __tun_get(tfile);
+	if (!tun)
+		return -EBADFD;
 
+	switch (tfile->flags & TUN_TYPE_MASK) {
+	case TUN_TUN_DEV:
+		skb->dev = tun->dev;
+		break;
+	case TUN_TAP_DEV:
+		skb->protocol = eth_type_trans(skb, tun->dev);
+		break;
+	}
+
+	netif_rx_ni(skb);
 	tun->dev->stats.rx_packets++;
 	tun->dev->stats.rx_bytes += len;
+	tun_put(tun);
+	return count;
+
+err_free:
+	count = -EINVAL;
+	kfree_skb(skb);
+err:
+	tun = __tun_get(tfile);
+	if (!tun)
+		return -EBADFD;
 
+	if (drop)
+		tun->dev->stats.rx_dropped++;
+	if (error)
+		tun->dev->stats.rx_frame_errors++;
+	tun_put(tun);
 	return count;
 }
 
@@ -736,30 +747,25 @@ static ssize_t tun_chr_aio_write(struct kiocb *iocb, const struct iovec *iv,
 			      unsigned long count, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
-	struct tun_struct *tun = tun_get(file);
+	struct tun_file *tfile = file->private_data;
 	ssize_t result;
 
-	if (!tun)
-		return -EBADFD;
-
-	tun_debug(KERN_INFO, tun, "tun_chr_write %ld\n", count);
-
-	result = tun_get_user(tun, iv, iov_length(iv, count),
+	result = tun_get_user(tfile, iv, iov_length(iv, count),
 			      file->f_flags & O_NONBLOCK);
 
-	tun_put(tun);
 	return result;
 }
 
 /* Put packet to the user space buffer */
-static ssize_t tun_put_user(struct tun_struct *tun,
+static ssize_t tun_put_user(struct tun_file *tfile,
 			    struct sk_buff *skb,
 			    const struct iovec *iv, int len)
 {
+	struct tun_struct *tun = NULL;
 	struct tun_pi pi = { 0, skb->protocol };
 	ssize_t total = 0;
 
-	if (!(tun->flags & TUN_NO_PI)) {
+	if (!(tfile->flags & TUN_NO_PI)) {
 		if ((len -= sizeof(pi)) < 0)
 			return -EINVAL;
 
@@ -773,9 +779,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 		total += sizeof(pi);
 	}
 
-	if (tun->flags & TUN_VNET_HDR) {
+	if (tfile->flags & TUN_VNET_HDR) {
 		struct virtio_net_hdr gso = { 0 }; /* no info leak */
-		if ((len -= tun->vnet_hdr_sz) < 0)
+		len -= tfile->vnet_hdr_sz;
+		if (len < 0)
 			return -EINVAL;
 
 		if (skb_is_gso(skb)) {
@@ -818,7 +825,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 		if (unlikely(memcpy_toiovecend(iv, (void *)&gso, total,
 					       sizeof(gso))))
 			return -EFAULT;
-		total += tun->vnet_hdr_sz;
+		total += tfile->vnet_hdr_sz;
 	}
 
 	len = min_t(int, skb->len, len);
@@ -826,29 +833,33 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 	skb_copy_datagram_const_iovec(skb, 0, iv, total, len);
 	total += skb->len;
 
-	tun->dev->stats.tx_packets++;
-	tun->dev->stats.tx_bytes += len;
+	tun = __tun_get(tfile);
+	if (tun) {
+		tun->dev->stats.tx_packets++;
+		tun->dev->stats.tx_bytes += len;
+		tun_put(tun);
+	}
 
 	return total;
 }
 
-static ssize_t tun_do_read(struct tun_struct *tun,
+static ssize_t tun_do_read(struct tun_file *tfile,
 			   struct kiocb *iocb, const struct iovec *iv,
 			   ssize_t len, int noblock)
 {
 	DECLARE_WAITQUEUE(wait, current);
 	struct sk_buff *skb;
 	ssize_t ret = 0;
-
-	tun_debug(KERN_INFO, tun, "tun_chr_read\n");
+	struct tun_struct *tun = NULL;
 
 	if (unlikely(!noblock))
-		add_wait_queue(&tun->wq.wait, &wait);
+		add_wait_queue(&tfile->wq.wait, &wait);
 	while (len) {
 		current->state = TASK_INTERRUPTIBLE;
 
+		skb = skb_dequeue(&tfile->socket.sk->sk_receive_queue);
 		/* Read frames from the queue */
-		if (!(skb=skb_dequeue(&tun->socket.sk->sk_receive_queue))) {
+		if (!skb) {
 			if (noblock) {
 				ret = -EAGAIN;
 				break;
@@ -857,25 +868,38 @@ static ssize_t tun_do_read(struct tun_struct *tun,
 				ret = -ERESTARTSYS;
 				break;
 			}
+
+			tun = __tun_get(tfile);
+			if (!tun) {
+				ret = -EIO;
+				break;
+			}
 			if (tun->dev->reg_state != NETREG_REGISTERED) {
 				ret = -EIO;
+				tun_put(tun);
 				break;
 			}
+			tun_put(tun);
 
 			/* Nothing to read, let's sleep */
 			schedule();
 			continue;
 		}
-		netif_wake_queue(tun->dev);
 
-		ret = tun_put_user(tun, skb, iv, len);
+		tun = __tun_get(tfile);
+		if (tun) {
+			netif_wake_queue(tun->dev);
+			tun_put(tun);
+		}
+
+		ret = tun_put_user(tfile, skb, iv, len);
 		kfree_skb(skb);
 		break;
 	}
 
 	current->state = TASK_RUNNING;
 	if (unlikely(!noblock))
-		remove_wait_queue(&tun->wq.wait, &wait);
+		remove_wait_queue(&tfile->wq.wait, &wait);
 
 	return ret;
 }
@@ -885,21 +909,17 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv,
 {
 	struct file *file = iocb->ki_filp;
 	struct tun_file *tfile = file->private_data;
-	struct tun_struct *tun = __tun_get(tfile);
 	ssize_t len, ret;
 
-	if (!tun)
-		return -EBADFD;
 	len = iov_length(iv, count);
 	if (len < 0) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	ret = tun_do_read(tun, iocb, iv, len, file->f_flags & O_NONBLOCK);
+	ret = tun_do_read(tfile, iocb, iv, len, file->f_flags & O_NONBLOCK);
 	ret = min_t(ssize_t, ret, len);
 out:
-	tun_put(tun);
 	return ret;
 }
 
@@ -911,7 +931,7 @@ static void tun_setup(struct net_device *dev)
 	tun->group = -1;
 
 	dev->ethtool_ops = &tun_ethtool_ops;
-	dev->destructor = tun_free_netdev;
+	dev->destructor = free_netdev;
 }
 
 /* Trivial set of netlink ops to allow deleting tun or tap
@@ -931,7 +951,7 @@ static struct rtnl_link_ops tun_link_ops __read_mostly = {
 
 static void tun_sock_write_space(struct sock *sk)
 {
-	struct tun_struct *tun;
+	struct tun_file *tfile = NULL;
 	wait_queue_head_t *wqueue;
 
 	if (!sock_writeable(sk))
@@ -945,37 +965,38 @@ static void tun_sock_write_space(struct sock *sk)
 		wake_up_interruptible_sync_poll(wqueue, POLLOUT |
 						POLLWRNORM | POLLWRBAND);
 
-	tun = tun_sk(sk)->tun;
-	kill_fasync(&tun->fasync, SIGIO, POLL_OUT);
-}
-
-static void tun_sock_destruct(struct sock *sk)
-{
-	free_netdev(tun_sk(sk)->tun->dev);
+	tfile = container_of(sk, struct tun_file, sk);
+	kill_fasync(&tfile->fasync, SIGIO, POLL_OUT);
 }
 
 static int tun_sendmsg(struct kiocb *iocb, struct socket *sock,
 		       struct msghdr *m, size_t total_len)
 {
-	struct tun_struct *tun = container_of(sock, struct tun_struct, socket);
-	return tun_get_user(tun, m->msg_iov, total_len,
-			    m->msg_flags & MSG_DONTWAIT);
+	struct tun_file *tfile = container_of(sock, struct tun_file, socket);
+	ssize_t result;
+
+	result = tun_get_user(tfile, m->msg_iov, total_len,
+			      m->msg_flags & MSG_DONTWAIT);
+	return result;
 }
 
 static int tun_recvmsg(struct kiocb *iocb, struct socket *sock,
 		       struct msghdr *m, size_t total_len,
 		       int flags)
 {
-	struct tun_struct *tun = container_of(sock, struct tun_struct, socket);
+	struct tun_file *tfile = container_of(sock, struct tun_file, socket);
 	int ret;
+
 	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
 		return -EINVAL;
-	ret = tun_do_read(tun, iocb, m->msg_iov, total_len,
+
+	ret = tun_do_read(tfile, iocb, m->msg_iov, total_len,
 			  flags & MSG_DONTWAIT);
 	if (ret > total_len) {
 		m->msg_flags |= MSG_TRUNC;
 		ret = flags & MSG_TRUNC ? ret : total_len;
 	}
+
 	return ret;
 }
 
@@ -996,7 +1017,7 @@ static const struct proto_ops tun_socket_ops = {
 static struct proto tun_proto = {
 	.name		= "tun",
 	.owner		= THIS_MODULE,
-	.obj_size	= sizeof(struct tun_sock),
+	.obj_size	= sizeof(struct tun_file),
 };
 
 static int tun_flags(struct tun_struct *tun)
@@ -1047,8 +1068,8 @@ static DEVICE_ATTR(group, 0444, tun_show_group, NULL);
 
 static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 {
-	struct sock *sk;
 	struct tun_struct *tun;
+	struct tun_file *tfile = file->private_data;
 	struct net_device *dev;
 	int err;
 
@@ -1069,7 +1090,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 		     (tun->group != -1 && !in_egroup_p(tun->group))) &&
 		    !capable(CAP_NET_ADMIN))
 			return -EPERM;
-		err = security_tun_dev_attach(tun->socket.sk);
+		err = security_tun_dev_attach(tfile->socket.sk);
 		if (err < 0)
 			return err;
 
@@ -1113,25 +1134,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 		tun = netdev_priv(dev);
 		tun->dev = dev;
 		tun->flags = flags;
-		tun->txflt.count = 0;
-		tun->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
 
-		err = -ENOMEM;
-		sk = sk_alloc(&init_net, AF_UNSPEC, GFP_KERNEL, &tun_proto);
-		if (!sk)
-			goto err_free_dev;
-
-		sk_change_net(sk, net);
-		tun->socket.wq = &tun->wq;
-		init_waitqueue_head(&tun->wq.wait);
-		tun->socket.ops = &tun_socket_ops;
-		sock_init_data(&tun->socket, sk);
-		sk->sk_write_space = tun_sock_write_space;
-		sk->sk_sndbuf = INT_MAX;
-
-		tun_sk(sk)->tun = tun;
-
-		security_tun_dev_post_create(sk);
+		security_tun_dev_post_create(&tfile->sk);
 
 		tun_net_init(dev);
 
@@ -1141,15 +1145,13 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 
 		err = register_netdevice(tun->dev);
 		if (err < 0)
-			goto err_free_sk;
+			goto err_free_dev;
 
 		if (device_create_file(&tun->dev->dev, &dev_attr_tun_flags) ||
 		    device_create_file(&tun->dev->dev, &dev_attr_owner) ||
 		    device_create_file(&tun->dev->dev, &dev_attr_group))
 			pr_err("Failed to create tun sysfs files\n");
 
-		sk->sk_destruct = tun_sock_destruct;
-
 		err = tun_attach(tun, file);
 		if (err < 0)
 			goto failed;
@@ -1172,6 +1174,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 	else
 		tun->flags &= ~TUN_VNET_HDR;
 
+	/* Cache flags from tun device */
+	tfile->flags = tun->flags;
 	/* Make sure persistent devices do not get stuck in
 	 * xoff state.
 	 */
@@ -1181,11 +1185,9 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 	strcpy(ifr->ifr_name, tun->dev->name);
 	return 0;
 
- err_free_sk:
-	tun_free_netdev(dev);
- err_free_dev:
+err_free_dev:
 	free_netdev(dev);
- failed:
+failed:
 	return err;
 }
 
@@ -1357,9 +1359,9 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 	case TUNSETTXFILTER:
 		/* Can be set only for TAPs */
 		ret = -EINVAL;
-		if ((tun->flags & TUN_TYPE_MASK) != TUN_TAP_DEV)
+		if ((tfile->flags & TUN_TYPE_MASK) != TUN_TAP_DEV)
 			break;
-		ret = update_filter(&tun->txflt, (void __user *)arg);
+		ret = update_filter(&tfile->txflt, (void __user *)arg);
 		break;
 
 	case SIOCGIFHWADDR:
@@ -1379,7 +1381,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		break;
 
 	case TUNGETSNDBUF:
-		sndbuf = tun->socket.sk->sk_sndbuf;
+		sndbuf = tfile->socket.sk->sk_sndbuf;
 		if (copy_to_user(argp, &sndbuf, sizeof(sndbuf)))
 			ret = -EFAULT;
 		break;
@@ -1390,11 +1392,11 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 			break;
 		}
 
-		tun->socket.sk->sk_sndbuf = sndbuf;
+		tfile->socket.sk->sk_sndbuf = sndbuf;
 		break;
 
 	case TUNGETVNETHDRSZ:
-		vnet_hdr_sz = tun->vnet_hdr_sz;
+		vnet_hdr_sz = tfile->vnet_hdr_sz;
 		if (copy_to_user(argp, &vnet_hdr_sz, sizeof(vnet_hdr_sz)))
 			ret = -EFAULT;
 		break;
@@ -1409,27 +1411,27 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 			break;
 		}
 
-		tun->vnet_hdr_sz = vnet_hdr_sz;
+		tfile->vnet_hdr_sz = vnet_hdr_sz;
 		break;
 
 	case TUNATTACHFILTER:
 		/* Can be set only for TAPs */
 		ret = -EINVAL;
-		if ((tun->flags & TUN_TYPE_MASK) != TUN_TAP_DEV)
+		if ((tfile->flags & TUN_TYPE_MASK) != TUN_TAP_DEV)
 			break;
 		ret = -EFAULT;
 		if (copy_from_user(&fprog, argp, sizeof(fprog)))
 			break;
 
-		ret = sk_attach_filter(&fprog, tun->socket.sk);
+		ret = sk_attach_filter(&fprog, tfile->socket.sk);
 		break;
 
 	case TUNDETACHFILTER:
 		/* Can be set only for TAPs */
 		ret = -EINVAL;
-		if ((tun->flags & TUN_TYPE_MASK) != TUN_TAP_DEV)
+		if ((tfile->flags & TUN_TYPE_MASK) != TUN_TAP_DEV)
 			break;
-		ret = sk_detach_filter(tun->socket.sk);
+		ret = sk_detach_filter(tfile->socket.sk);
 		break;
 
 	default:
@@ -1481,43 +1483,50 @@ static long tun_chr_compat_ioctl(struct file *file,
 
 static int tun_chr_fasync(int fd, struct file *file, int on)
 {
-	struct tun_struct *tun = tun_get(file);
-	int ret;
-
-	if (!tun)
-		return -EBADFD;
-
-	tun_debug(KERN_INFO, tun, "tun_chr_fasync %d\n", on);
+	struct tun_file *tfile = file->private_data;
+	int ret = fasync_helper(fd, file, on, &tfile->fasync);
 
-	if ((ret = fasync_helper(fd, file, on, &tun->fasync)) < 0)
+	if (ret < 0)
 		goto out;
 
 	if (on) {
 		ret = __f_setown(file, task_pid(current), PIDTYPE_PID, 0);
 		if (ret)
 			goto out;
-		tun->flags |= TUN_FASYNC;
+		tfile->flags |= TUN_FASYNC;
 	} else
-		tun->flags &= ~TUN_FASYNC;
+		tfile->flags &= ~TUN_FASYNC;
 	ret = 0;
 out:
-	tun_put(tun);
 	return ret;
 }
 
 static int tun_chr_open(struct inode *inode, struct file * file)
 {
+	struct net *net = current->nsproxy->net_ns;
 	struct tun_file *tfile;
 
 	DBG1(KERN_INFO, "tunX: tun_chr_open\n");
 
-	tfile = kmalloc(sizeof(*tfile), GFP_KERNEL);
+	tfile = (struct tun_file *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
+					&tun_proto);
 	if (!tfile)
 		return -ENOMEM;
-	atomic_set(&tfile->count, 0);
+
 	tfile->tun = NULL;
-	tfile->net = get_net(current->nsproxy->net_ns);
+	tfile->net = net;
+	tfile->txflt.count = 0;
+	tfile->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
+	tfile->socket.wq = &tfile->wq;
+	init_waitqueue_head(&tfile->wq.wait);
+	tfile->socket.file = file;
+	tfile->socket.ops = &tun_socket_ops;
+	sock_init_data(&tfile->socket, &tfile->sk);
+
+	tfile->sk.sk_write_space = tun_sock_write_space;
+	tfile->sk.sk_sndbuf = INT_MAX;
 	file->private_data = tfile;
+
 	return 0;
 }
 
@@ -1541,14 +1550,14 @@ static int tun_chr_close(struct inode *inode, struct file *file)
 				unregister_netdevice(dev);
 			rtnl_unlock();
 		}
-	}
 
-	tun = tfile->tun;
-	if (tun)
-		sock_put(tun->socket.sk);
+		/* drop the reference that netdevice holds */
+		sock_put(&tfile->sk);
 
-	put_net(tfile->net);
-	kfree(tfile);
+	}
+
+	/* drop the reference that file holds */
+	sock_put(&tfile->sk);
 
 	return 0;
 }
@@ -1676,13 +1685,14 @@ static void tun_cleanup(void)
 struct socket *tun_get_socket(struct file *file)
 {
 	struct tun_struct *tun;
+	struct tun_file *tfile = file->private_data;
 	if (file->f_op != &tun_fops)
 		return ERR_PTR(-EINVAL);
 	tun = tun_get(file);
 	if (!tun)
 		return ERR_PTR(-EBADFD);
 	tun_put(tun);
-	return &tun->socket;
+	return &tfile->socket;
 }
 EXPORT_SYMBOL_GPL(tun_get_socket);
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH 2/6] tuntap: categorize ioctl
From: Jason Wang @ 2012-06-25 11:59 UTC (permalink / raw)
  To: mst, akong, habanero, tahm, haixiao, jwhan, ernesto.martin,
	mashirle, davem, netdev, linux-kernel, krkumar2
  Cc: shemminger, edumazet, Jason Wang
In-Reply-To: <20120625060830.6765.27584.stgit@amd-6168-8-1.englab.nay.redhat.com>

As we've moved socket related structure to file->private_data, we can optimizes
the ioctls that only touch socket out of tun_chr_ioctl() as it don't need hold
rtnl lock.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c |   52 ++++++++++++++++++++++++++++++++++------------------
 1 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 1f27789..8233b0a 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1248,10 +1248,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 	struct tun_file *tfile = file->private_data;
 	struct tun_struct *tun;
 	void __user* argp = (void __user*)arg;
-	struct sock_fprog fprog;
 	struct ifreq ifr;
-	int sndbuf;
-	int vnet_hdr_sz;
 	int ret;
 
 	if (cmd == TUNSETIFF || _IOC_TYPE(cmd) == 0x89)
@@ -1356,14 +1353,6 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		ret = set_offload(tun, arg);
 		break;
 
-	case TUNSETTXFILTER:
-		/* Can be set only for TAPs */
-		ret = -EINVAL;
-		if ((tfile->flags & TUN_TYPE_MASK) != TUN_TAP_DEV)
-			break;
-		ret = update_filter(&tfile->txflt, (void __user *)arg);
-		break;
-
 	case SIOCGIFHWADDR:
 		/* Get hw address */
 		memcpy(ifr.ifr_hwaddr.sa_data, tun->dev->dev_addr, ETH_ALEN);
@@ -1380,6 +1369,37 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		ret = dev_set_mac_address(tun->dev, &ifr.ifr_hwaddr);
 		break;
 
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+unlock:
+	rtnl_unlock();
+	if (tun)
+		tun_put(tun);
+	return ret;
+}
+
+static long __tun_socket_ioctl(struct file *file, unsigned int cmd,
+			       unsigned long arg, int ifreq_len)
+{
+	struct tun_file *tfile = file->private_data;
+	void __user *argp = (void __user *)arg;
+	struct sock_fprog fprog;
+	int sndbuf;
+	int vnet_hdr_sz;
+	int ret = 0;
+
+	switch (cmd) {
+	case TUNSETTXFILTER:
+		/* Can be set only for TAPs */
+		ret = -EINVAL;
+		if ((tfile->flags & TUN_TYPE_MASK) != TUN_TAP_DEV)
+			break;
+		ret = update_filter(&tfile->txflt, (void __user *)arg);
+		break;
+
 	case TUNGETSNDBUF:
 		sndbuf = tfile->socket.sk->sk_sndbuf;
 		if (copy_to_user(argp, &sndbuf, sizeof(sndbuf)))
@@ -1435,21 +1455,17 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		break;
 
 	default:
-		ret = -EINVAL;
+		ret = __tun_chr_ioctl(file, cmd, arg, ifreq_len);
 		break;
 	}
 
-unlock:
-	rtnl_unlock();
-	if (tun)
-		tun_put(tun);
 	return ret;
 }
 
 static long tun_chr_ioctl(struct file *file,
 			  unsigned int cmd, unsigned long arg)
 {
-	return __tun_chr_ioctl(file, cmd, arg, sizeof (struct ifreq));
+	return __tun_socket_ioctl(file, cmd, arg, sizeof(struct ifreq));
 }
 
 #ifdef CONFIG_COMPAT
@@ -1477,7 +1493,7 @@ static long tun_chr_compat_ioctl(struct file *file,
 	 * driver are compatible though, we don't need to convert the
 	 * contents.
 	 */
-	return __tun_chr_ioctl(file, cmd, arg, sizeof(struct compat_ifreq));
+	return __tun_socket_ioctl(file, cmd, arg, sizeof(struct compat_ifreq));
 }
 #endif /* CONFIG_COMPAT */
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH 3/6] tuntap: introduce multiqueue flags
From: Jason Wang @ 2012-06-25 11:59 UTC (permalink / raw)
  To: mst, akong, habanero, tahm, haixiao, jwhan, ernesto.martin,
	mashirle, davem, netdev, linux-kernel, krkumar2
  Cc: shemminger, edumazet, Jason Wang
In-Reply-To: <20120625060830.6765.27584.stgit@amd-6168-8-1.englab.nay.redhat.com>

Add flags to be used by creating multiqueue tuntap device.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/linux/if_tun.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 06b1829..c92a291 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -34,6 +34,7 @@
 #define TUN_ONE_QUEUE	0x0080
 #define TUN_PERSIST 	0x0100	
 #define TUN_VNET_HDR 	0x0200
+#define TUN_TAP_MQ      0x0400
 
 /* Ioctl defines */
 #define TUNSETNOCSUM  _IOW('T', 200, int) 
@@ -61,6 +62,7 @@
 #define IFF_ONE_QUEUE	0x2000
 #define IFF_VNET_HDR	0x4000
 #define IFF_TUN_EXCL	0x8000
+#define IFF_MULTI_QUEUE 0x0100
 
 /* Features for GSO (TUNSETOFFLOAD). */
 #define TUN_F_CSUM	0x01	/* You can hand me unchecksummed packets. */
-- 
1.7.1

^ permalink raw reply related

* [PATCH 5/6] tuntap: per queue 64 bit stats
From: Jason Wang @ 2012-06-25 11:59 UTC (permalink / raw)
  To: mst, akong, habanero, tahm, haixiao, jwhan, ernesto.martin,
	mashirle, davem, netdev, linux-kernel, krkumar2
  Cc: shemminger, edumazet, Jason Wang
In-Reply-To: <20120625060830.6765.27584.stgit@amd-6168-8-1.englab.nay.redhat.com>

As we've added multiqueue support for tun/tap, this patch convert the statistics
to use per-queue 64 bit statistics.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c |  105 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 83 insertions(+), 22 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 5c26757..37e62d3 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -64,6 +64,7 @@
 #include <linux/nsproxy.h>
 #include <linux/virtio_net.h>
 #include <linux/rcupdate.h>
+#include <linux/u64_stats_sync.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 #include <net/rtnetlink.h>
@@ -109,6 +110,19 @@ struct tap_filter {
 
 #define MAX_TAP_QUEUES (NR_CPUS < 16 ? NR_CPUS : 16)
 
+struct tun_queue_stats {
+	u64			rx_packets;
+	u64			rx_bytes;
+	u64			tx_packets;
+	u64			tx_bytes;
+	struct u64_stats_sync	rx_syncp;
+	struct u64_stats_sync	tx_syncp;
+	u32			rx_dropped;
+	u32			tx_dropped;
+	u32			rx_frame_errors;
+	u32			tx_fifo_errors;
+};
+
 struct tun_file {
 	struct sock sk;
 	struct socket socket;
@@ -119,6 +133,7 @@ struct tun_file {
 	struct tun_struct __rcu *tun;
 	struct net *net;
 	struct fasync_struct *fasync;
+	struct tun_queue_stats stats;
 	unsigned int flags;
 	u16 queue_index;
 };
@@ -134,6 +149,7 @@ struct tun_struct {
 
 	struct net_device	*dev;
 	netdev_features_t	set_features;
+	struct tun_queue_stats	stats;
 #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
 			  NETIF_F_TSO6|NETIF_F_UFO)
 
@@ -463,7 +479,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 
 			/* We won't see all dropped packets individually, so overrun
 			 * error is more appropriate. */
-			dev->stats.tx_fifo_errors++;
+			tfile->stats.tx_fifo_errors++;
 		} else {
 			/* Single queue mode or multi queue mode.
 			 * Driver handles dropping of all packets itself. */
@@ -488,7 +504,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 
 drop:
 	rcu_read_unlock();
-	dev->stats.tx_dropped++;
+	if (tfile)
+		tfile->stats.tx_dropped++;
 	kfree_skb(skb);
 	return NETDEV_TX_OK;
 }
@@ -538,6 +555,56 @@ static void tun_poll_controller(struct net_device *dev)
 	return;
 }
 #endif
+
+static struct rtnl_link_stats64 *tun_net_stats(struct net_device *dev,
+					       struct rtnl_link_stats64 *stats)
+{
+	struct tun_struct *tun = netdev_priv(dev);
+	struct tun_file *tfile;
+	struct tun_queue_stats *qstats;
+	u64 rx_packets, rx_bytes, tx_packets, tx_bytes;
+	u32 rx_dropped = 0, tx_dropped = 0,
+	    rx_frame_errors = 0, tx_fifo_errors = 0;
+	unsigned int start;
+	int i;
+
+	rcu_read_lock();
+	for (i = 0; i < tun->numqueues; i++) {
+		tfile = rcu_dereference(tun->tfiles[i]);
+		qstats = &tfile->stats;
+
+		do {
+			start = u64_stats_fetch_begin_bh(&qstats->rx_syncp);
+			rx_packets = qstats->rx_packets;
+			rx_bytes = qstats->rx_bytes;
+		} while (u64_stats_fetch_retry_bh(&qstats->rx_syncp, start));
+
+		do {
+			start = u64_stats_fetch_begin_bh(&qstats->tx_syncp);
+			tx_packets = qstats->tx_packets;
+			tx_bytes = qstats->tx_bytes;
+		} while (u64_stats_fetch_retry_bh(&qstats->tx_syncp, start));
+
+		stats->rx_packets += rx_packets;
+		stats->rx_bytes	+= rx_bytes;
+		stats->tx_packets += tx_packets;
+		stats->tx_bytes	+= tx_bytes;
+		/* following fileds are u32, no need syncp */
+		rx_dropped += qstats->rx_dropped;
+		tx_dropped += qstats->tx_dropped;
+		rx_frame_errors += qstats->rx_frame_errors;
+		tx_fifo_errors += qstats->tx_fifo_errors;
+	}
+	rcu_read_unlock();
+
+	stats->rx_dropped = rx_dropped;
+	stats->tx_dropped = tx_dropped;
+	stats->rx_frame_errors = rx_frame_errors;
+	stats->tx_fifo_errors = tx_fifo_errors;
+
+	return stats;
+}
+
 static const struct net_device_ops tun_netdev_ops = {
 	.ndo_uninit		= tun_net_uninit,
 	.ndo_open		= tun_net_open,
@@ -545,6 +612,7 @@ static const struct net_device_ops tun_netdev_ops = {
 	.ndo_start_xmit		= tun_net_xmit,
 	.ndo_change_mtu		= tun_net_change_mtu,
 	.ndo_fix_features	= tun_net_fix_features,
+	.ndo_get_stats64	= tun_net_stats,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller	= tun_poll_controller,
 #endif
@@ -560,6 +628,7 @@ static const struct net_device_ops tap_netdev_ops = {
 	.ndo_set_rx_mode	= tun_net_mclist,
 	.ndo_set_mac_address	= eth_mac_addr,
 	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_get_stats64	= tun_net_stats,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller	= tun_poll_controller,
 #endif
@@ -808,30 +877,25 @@ static ssize_t tun_get_user(struct tun_file *tfile,
 		skb->protocol = eth_type_trans(skb, tun->dev);
 		break;
 	}
-	tun->dev->stats.rx_packets++;
-	tun->dev->stats.rx_bytes += len;
 	rcu_read_unlock();
 
 	netif_rx_ni(skb);
 
+	u64_stats_update_begin(&tfile->stats.rx_syncp);
+	tfile->stats.rx_packets++;
+	tfile->stats.rx_bytes += len;
+	u64_stats_update_end(&tfile->stats.rx_syncp);
+
 	return count;
 
 err_free:
 	count = -EINVAL;
 	kfree_skb(skb);
 err:
-	rcu_read_lock();
-	tun = rcu_dereference(tfile->tun);
-	if (!tun) {
-		rcu_read_unlock();
-		return -EBADFD;
-	}
-
 	if (drop)
-		tun->dev->stats.rx_dropped++;
+		tfile->stats.rx_dropped++;
 	if (error)
-		tun->dev->stats.rx_frame_errors++;
-	rcu_read_unlock();
+		tfile->stats.rx_frame_errors++;
 	return count;
 }
 
@@ -853,7 +917,6 @@ static ssize_t tun_put_user(struct tun_file *tfile,
 			    struct sk_buff *skb,
 			    const struct iovec *iv, int len)
 {
-	struct tun_struct *tun = NULL;
 	struct tun_pi pi = { 0, skb->protocol };
 	ssize_t total = 0;
 
@@ -925,13 +988,10 @@ static ssize_t tun_put_user(struct tun_file *tfile,
 	skb_copy_datagram_const_iovec(skb, 0, iv, total, len);
 	total += skb->len;
 
-	rcu_read_lock();
-	tun = rcu_dereference(tfile->tun);
-	if (tun) {
-		tun->dev->stats.tx_packets++;
-		tun->dev->stats.tx_bytes += len;
-	}
-	rcu_read_unlock();
+	u64_stats_update_begin(&tfile->stats.tx_syncp);
+	tfile->stats.tx_packets++;
+	tfile->stats.tx_bytes += total;
+	u64_stats_update_end(&tfile->stats.tx_syncp);
 
 	return total;
 }
@@ -1650,6 +1710,7 @@ static int tun_chr_open(struct inode *inode, struct file * file)
 	tfile->socket.file = file;
 	tfile->socket.ops = &tun_socket_ops;
 	sock_init_data(&tfile->socket, &tfile->sk);
+	memset(&tfile->stats, 0, sizeof(struct tun_queue_stats));
 
 	tfile->sk.sk_write_space = tun_sock_write_space;
 	tfile->sk.sk_destruct = tun_sock_destruct;
-- 
1.7.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox