Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] ath10k: wmi: Convert use of 6 to ETH_ALEN
From: Julia Lawall @ 2013-10-03  5:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Joe Perches, Kalle Valo, Luis R. Rodriguez, netdev, ath10k,
	John W. Linville
In-Reply-To: <1380773924.19002.131.camel@edumazet-glaptop.roam.corp.google.com>



On Wed, 2 Oct 2013, Eric Dumazet wrote:

> On Wed, 2013-10-02 at 20:39 -0700, Joe Perches wrote:
> > Use the appropriate define instead of 6.
> > 
> > Signed-off-by: Joe Perches <joe@perches.com>
> > Noticed-by: Julia Lawall <julia.lawall@lip6.fr> via spatch script
> > 
> > ---
> > 
> >  drivers/net/wireless/ath/ath10k/wmi.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c
> > index bee88e8..48d44e7 100644
> > --- a/drivers/net/wireless/ath/ath10k/wmi.c
> > +++ b/drivers/net/wireless/ath/ath10k/wmi.c
> > @@ -1758,7 +1758,7 @@ int ath10k_wmi_vdev_up(struct ath10k *ar, u32 vdev_id, u32 aid, const u8 *bssid)
> >  	cmd = (struct wmi_vdev_up_cmd *)skb->data;
> >  	cmd->vdev_id       = __cpu_to_le32(vdev_id);
> >  	cmd->vdev_assoc_id = __cpu_to_le32(aid);
> > -	memcpy(&cmd->vdev_bssid.addr, bssid, 6);
> > +	memcpy(&cmd->vdev_bssid.addr, bssid, ETH_ALEN);
> >  
> >  	ath10k_dbg(ATH10K_DBG_WMI,
> >  		   "wmi mgmt vdev up id 0x%x assoc id %d bssid %pM\n",
> 
> I don't get it.
> 
> Why leaving this then ?
> 
> struct wmi_mac_addr {
>         union {
>                 u8 addr[6];
>                 struct {
>                         u32 word0;
>                         u32 word1;
>                 } __packed;
>         } __packed;
> } __packed;

Yes, this was step 2...

julia

^ permalink raw reply

* Re: [PATCH net-next] ath10k: wmi: Convert use of 6 to ETH_ALEN
From: Julia Lawall @ 2013-10-03  5:47 UTC (permalink / raw)
  To: Joe Perches
  Cc: Eric Dumazet, Kalle Valo, Luis R. Rodriguez, netdev, ath10k,
	John W. Linville
In-Reply-To: <1380778451.2081.103.camel@joe-AO722>



On Wed, 2 Oct 2013, Joe Perches wrote:

> On Wed, 2013-10-02 at 22:24 -0700, Eric Dumazet wrote:
> > On Wed, 2013-10-02 at 22:09 -0700, Joe Perches wrote:
> > > On Wed, 2013-10-02 at 21:44 -0700, Eric Dumazet wrote:
> > > > I mean the 6, of course, since Joe seems to actively track them, as if
> > > > ETH_ALEN could change eventually, you never know.
> > > 
> > > You're funny Eric.
> > > You know it's just to ease grep pattern matching.
> > 
> > You are not funny if you plan to send 500+ patches for every instance of
> > 6 changed to ETH_ALEN
> 
> You're still funny.
> 
> https://lkml.org/lkml/2013/10/1/517
> 
> Lighten up though.  It was just a straggler.

Actually, the few cases that I looked up by hand seemed to use 6 
consistently for the declaration and the memcpy/memset.  So it would be 
nicer to fix them all at once.

At least in drivers/net there are around 40, not 500.

julia

^ permalink raw reply

* Re: [PATCH net-next] ath10k: wmi: Convert use of 6 to ETH_ALEN
From: Eric Dumazet @ 2013-10-03  6:00 UTC (permalink / raw)
  To: Joe Perches
  Cc: Kalle Valo, Julia Lawall, Luis R. Rodriguez, netdev, ath10k,
	John W. Linville
In-Reply-To: <1380778451.2081.103.camel@joe-AO722>

On Wed, 2013-10-02 at 22:34 -0700, Joe Perches wrote:

> https://lkml.org/lkml/2013/10/1/517
> 

I had no problem with this one.

Did I say something (funny or not) about it ?

Apparently you mix things and you do not like me commenting or giving a
feedback on a _particular_ patch.

Everything is fine, really, let the fun continue !

^ permalink raw reply

* Re: [PATCH net-next] ath10k: wmi: Convert use of 6 to ETH_ALEN
From: Joe Perches @ 2013-10-03  6:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Kalle Valo, Julia Lawall, Luis R. Rodriguez, netdev, ath10k,
	John W. Linville
In-Reply-To: <1380780021.19002.164.camel@edumazet-glaptop.roam.corp.google.com>

On Wed, 2013-10-02 at 23:00 -0700, Eric Dumazet wrote:
> On Wed, 2013-10-02 at 22:34 -0700, Joe Perches wrote:
> > https://lkml.org/lkml/2013/10/1/517
> I had no problem with this one.
> Did I say something (funny or not) about it ?

Nope, you didn't say anything publicly or
even privately to me.

> Apparently you mix things and you do not like me commenting or giving a
> feedback on a _particular_ patch.

Feedback like expecting 500 individual patches
when you apparently saw one that modified 57 files
with multiple maintainers _is_ pretty funny.

cheers, Joe

^ permalink raw reply

* Re: [PATCH net-next] ath10k: wmi: Convert use of 6 to ETH_ALEN
From: Julia Lawall @ 2013-10-03  6:32 UTC (permalink / raw)
  To: Joe Perches
  Cc: Eric Dumazet, Kalle Valo, Luis R. Rodriguez, netdev, ath10k,
	John W. Linville
In-Reply-To: <1380778451.2081.103.camel@joe-AO722>

The following semantic patch fixes the type declarations.  It should be 
run with the argument --recursive-includes for best (but slowest) results.

julia

@@
identifier x;
expression e1;
type T;
@@

T x[
- 6
+ ETH_ALEN
];
... when any
(
memcpy(x,e1,ETH_ALEN)
|
memcpy(e1,x,ETH_ALEN)
|
memset(x,e1,ETH_ALEN)
)

@r@
type T,T1;
identifier x;
@@

T {
...
T1 x[6];
...
};

@s@
r.T *e;
identifier r.x;
expression e1;
@@

(
memcpy(e->x,e1,ETH_ALEN)
|
memcpy(e1,e->x,ETH_ALEN)
|
memset(e->x,e1,ETH_ALEN)
)

@depends on s@
type r.T,r.T1;
identifier r.x;
@@

T {
...
T1 x[
-6
+ ETH_ALEN
 ];
...
};

^ permalink raw reply

* [PATCH net] be2net: Warn users of possible broken functionality on BE2 cards with very old F/W versions with latest driver.
From: Somnath Kotur @ 2013-10-03  7:03 UTC (permalink / raw)
  To: netdev; +Cc: davem, Somnath Kotur

On very old F/W versions < 4.0, the mailbox command to set interrupts on the
card succeeds even though it is not supported and should have failed leading to
interrupts not working.
Hence warn users to upgrade to a suitable F/W version to avoid seeing broken
functionality.

Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
---
 drivers/net/ethernet/emulex/benet/be_main.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index 2c38cc4..f4bbc92 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -3247,6 +3247,11 @@ static int be_setup(struct be_adapter *adapter)
 
 	be_cmd_get_fw_ver(adapter, adapter->fw_ver, adapter->fw_on_flash);
 
+	if (BE2_chip(adapter) && memcmp(adapter->fw_ver, "4.", 2) < 0) {
+		dev_err(dev, "F/W version is very old. IRQs may not work\n");
+		dev_err(dev, "Pls upgrade to F/W version >= 4.0\n");
+	}
+
 	if (adapter->vlans_added)
 		be_vid_config(adapter);
 
-- 
1.6.0.2

^ permalink raw reply related

* RE: [PATCH net] be2net: Warn users of possible broken functionality on BE2 cards with very old F/W versions with latest driver.
From: Sathya Perla @ 2013-10-03  7:11 UTC (permalink / raw)
  To: Somnath Kotur, netdev@vger.kernel.org; +Cc: davem@davemloft.net, Somnath Kotur
In-Reply-To: <e84f68dc-6ab3-4e56-b2f5-70e51c46d1c3@CMEXHTCAS1.ad.emulex.com>


> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf
> Of Somnath Kotur
> 
> On very old F/W versions < 4.0, the mailbox command to set interrupts on the
> card succeeds even though it is not supported and should have failed leading to
> interrupts not working.
> Hence warn users to upgrade to a suitable F/W version to avoid seeing broken
> functionality.
> 

Som, we've already use the term "FW" (instead of "F/W") at various places in the
code. Can you pls stick to this.
This is important especially for log messages. "FW" (similar to "HW" ) is much more
commonly used and more readable.

> Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
> ---
>  drivers/net/ethernet/emulex/benet/be_main.c |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/ethernet/emulex/benet/be_main.c
> b/drivers/net/ethernet/emulex/benet/be_main.c
> index 2c38cc4..f4bbc92 100644
> --- a/drivers/net/ethernet/emulex/benet/be_main.c
> +++ b/drivers/net/ethernet/emulex/benet/be_main.c
> @@ -3247,6 +3247,11 @@ static int be_setup(struct be_adapter *adapter)
> 
>  	be_cmd_get_fw_ver(adapter, adapter->fw_ver, adapter->fw_on_flash);
> 
> +	if (BE2_chip(adapter) && memcmp(adapter->fw_ver, "4.", 2) < 0) {
> +		dev_err(dev, "F/W version is very old. IRQs may not work\n");
> +		dev_err(dev, "Pls upgrade to F/W version >= 4.0\n");
> +	}

Same as above.

> +
>  	if (adapter->vlans_added)
>  		be_vid_config(adapter);
> 
> --
> 1.6.0.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH RFC 50/77] mlx5: Update MSI/MSI-X interrupts enablement code
From: Eli Cohen @ 2013-10-03  7:14 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, Bjorn Helgaas, Ralf Baechle, Michael Ellerman,
	Benjamin Herrenschmidt, Martin Schwidefsky, Ingo Molnar,
	Tejun Heo, Dan Williams, Andy King, Jon Mason, Matt Porter,
	linux-pci, linux-mips, linuxppc-dev, linux390, linux-s390, x86,
	linux-ide, iss_storagedev, linux-nvme, linux-rdma, netdev,
	e1000-devel, linux-driver, Solarflare linux maintainers,
	"VMware, Inc." <pv-dr
In-Reply-To: <9650a7dfbcfd5f1da21f7b093665abf4b1041071.1380703263.git.agordeev@redhat.com>

On Wed, Oct 02, 2013 at 12:49:06PM +0200, Alexander Gordeev wrote:
>  
> +	err = pci_msix_table_size(dev->pdev);
> +	if (err < 0)
> +		return err;
> +
>  	nvec = dev->caps.num_ports * num_online_cpus() + MLX5_EQ_VEC_COMP_BASE;
>  	nvec = min_t(int, nvec, num_eqs);
> +	nvec = min_t(int, nvec, err);
>  	if (nvec <= MLX5_EQ_VEC_COMP_BASE)
>  		return -ENOSPC;

Making sure we don't request more vectors then the device's is capable
of -- looks good.
>  
> @@ -131,20 +136,15 @@ static int mlx5_enable_msix(struct mlx5_core_dev *dev)
>  	for (i = 0; i < nvec; i++)
>  		table->msix_arr[i].entry = i;
>  
> -retry:
> -	table->num_comp_vectors = nvec - MLX5_EQ_VEC_COMP_BASE;
>  	err = pci_enable_msix(dev->pdev, table->msix_arr, nvec);
> -	if (err <= 0) {
> +	if (err) {
> +		kfree(table->msix_arr);
>  		return err;
> -	} else if (err > MLX5_EQ_VEC_COMP_BASE) {
> -		nvec = err;
> -		goto retry;
>  	}
>  

According to latest sources, pci_enable_msix() may still fail so why
do you want to remove this code?

> -	mlx5_core_dbg(dev, "received %d MSI vectors out of %d requested\n", err, nvec);
> -	kfree(table->msix_arr);
> +	table->num_comp_vectors = nvec - MLX5_EQ_VEC_COMP_BASE;
>  
> -	return -ENOSPC;
> +	return 0;
>  }
>  
>  static void mlx5_disable_msix(struct mlx5_core_dev *dev)
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net-next] tcp/dccp: remove twchain
From: Eric Dumazet @ 2013-10-03  7:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

TCP listener refactoring, part 3 :

Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.

Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.

As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.

If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.

[ INET_TW_MATCH() is no longer needed ]

I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()

This way, SYN_RECV pseudo sockets will be supported the same.

A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].

Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()

Before patch :

dmesg | grep "TCP established"

TCP established hash table entries: 524288 (order: 11, 8388608 bytes)

After patch :

TCP established hash table entries: 524288 (order: 10, 4194304 bytes)

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
Note: This depends on 

"net: do not call sock_put() on TIMEWAIT sockets" (from net tree), 

and previous net-next patches in review :

("tcp: shrink tcp6_timewait_sock by one cache line")
("inet: consolidate INET_TW_MATCH")

 include/net/inet_hashtables.h    |    9 ---
 include/net/inet_timewait_sock.h |   13 ----
 include/net/sock.h               |    8 ++
 include/net/tcp.h                |    1 
 net/dccp/proto.c                 |    4 -
 net/ipv4/inet_diag.c             |   48 ++++------------
 net/ipv4/inet_hashtables.c       |   83 ++++++++++-------------------
 net/ipv4/inet_timewait_sock.c    |   55 +++++++++----------
 net/ipv4/tcp.c                   |    5 -
 net/ipv4/tcp_ipv4.c              |   83 ++++-------------------------
 net/ipv6/inet6_hashtables.c      |   75 ++++++++++----------------
 net/ipv6/tcp_ipv6.c              |    9 +--
 12 files changed, 132 insertions(+), 261 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 10d6838..1bdb477 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -37,12 +37,11 @@
 #include <asm/byteorder.h>
 
 /* This is for all connections with a full identity, no wildcards.
- * One chain is dedicated to TIME_WAIT sockets.
- * I'll experiment with dynamic table growth later.
+ * The 'e' prefix stands for Establish, but we really put all sockets
+ * but LISTEN ones.
  */
 struct inet_ehash_bucket {
 	struct hlist_nulls_head chain;
-	struct hlist_nulls_head twchain;
 };
 
 /* There are a few simple rules, which allow for local port reuse by
@@ -123,7 +122,6 @@ struct inet_hashinfo {
 	 *
 	 *          TCP_ESTABLISHED <= sk->sk_state < TCP_CLOSE
 	 *
-	 * TIME_WAIT sockets use a separate chain (twchain).
 	 */
 	struct inet_ehash_bucket	*ehash;
 	spinlock_t			*ehash_locks;
@@ -318,9 +316,6 @@ static inline struct sock *inet_lookup_listener(struct net *net,
 	 net_eq(sock_net(__sk), (__net)))
 #endif /* 64-bit arch */
 
-#define INET_TW_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif)\
-	INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif)
-
 /*
  * Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need
  * not check it for lookups anymore, thanks Alexey. -DaveM
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index d6d8fd2..0fd04eb 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -136,18 +136,6 @@ struct inet_timewait_sock {
 };
 #define tw_tclass tw_tos
 
-static inline void inet_twsk_add_node_rcu(struct inet_timewait_sock *tw,
-				      struct hlist_nulls_head *list)
-{
-	hlist_nulls_add_head_rcu(&tw->tw_node, list);
-}
-
-static inline void inet_twsk_add_bind_node(struct inet_timewait_sock *tw,
-					   struct hlist_head *list)
-{
-	hlist_add_head(&tw->tw_bind_node, list);
-}
-
 static inline int inet_twsk_dead_hashed(const struct inet_timewait_sock *tw)
 {
 	return !hlist_unhashed(&tw->tw_death_node);
@@ -187,6 +175,7 @@ static inline struct inet_timewait_sock *inet_twsk(const struct sock *sk)
 	return (struct inet_timewait_sock *)sk;
 }
 
+void inet_twsk_free(struct inet_timewait_sock *tw);
 void inet_twsk_put(struct inet_timewait_sock *tw);
 
 int inet_twsk_unhash(struct inet_timewait_sock *tw);
diff --git a/include/net/sock.h b/include/net/sock.h
index e3bf213..ef80ea5 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -156,7 +156,7 @@ typedef __u64 __bitwise __addrpair;
  */
 struct sock_common {
 	/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
-	 * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH()
+	 * address on 64bit arches : cf INET_MATCH()
 	 */
 	union {
 		__addrpair	skc_addrpair;
@@ -301,6 +301,8 @@ struct sock {
 #define sk_dontcopy_end		__sk_common.skc_dontcopy_end
 #define sk_hash			__sk_common.skc_hash
 #define sk_portpair		__sk_common.skc_portpair
+#define sk_num			__sk_common.skc_num
+#define sk_dport		__sk_common.skc_dport
 #define sk_addrpair		__sk_common.skc_addrpair
 #define sk_daddr		__sk_common.skc_daddr
 #define sk_rcv_saddr		__sk_common.skc_rcv_saddr
@@ -1655,6 +1657,10 @@ static inline void sock_put(struct sock *sk)
 	if (atomic_dec_and_test(&sk->sk_refcnt))
 		sk_free(sk);
 }
+/* Generic version of sock_put(), dealing with all sockets
+ * (TCP_TIMEWAIT, ESTABLISHED...)
+ */
+void sock_gen_put(struct sock *sk);
 
 int sk_receive_skb(struct sock *sk, struct sk_buff *skb, const int nested);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index de870ee..39bbfa1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1519,7 +1519,6 @@ enum tcp_seq_states {
 	TCP_SEQ_STATE_LISTENING,
 	TCP_SEQ_STATE_OPENREQ,
 	TCP_SEQ_STATE_ESTABLISHED,
-	TCP_SEQ_STATE_TIME_WAIT,
 };
 
 int tcp_seq_open(struct inode *inode, struct file *file);
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index ba64750..eb892b4 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1158,10 +1158,8 @@ static int __init dccp_init(void)
 		goto out_free_bind_bucket_cachep;
 	}
 
-	for (i = 0; i <= dccp_hashinfo.ehash_mask; i++) {
+	for (i = 0; i <= dccp_hashinfo.ehash_mask; i++)
 		INIT_HLIST_NULLS_HEAD(&dccp_hashinfo.ehash[i].chain, i);
-		INIT_HLIST_NULLS_HEAD(&dccp_hashinfo.ehash[i].twchain, i);
-	}
 
 	if (inet_ehash_locks_alloc(&dccp_hashinfo))
 			goto out_free_dccp_ehash;
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index d17353f..8c8171d 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -635,12 +635,14 @@ static int inet_csk_diag_dump(struct sock *sk,
 				  cb->nlh->nlmsg_seq, NLM_F_MULTI, cb->nlh);
 }
 
-static int inet_twsk_diag_dump(struct inet_timewait_sock *tw,
+static int inet_twsk_diag_dump(struct sock *sk,
 			       struct sk_buff *skb,
 			       struct netlink_callback *cb,
 			       struct inet_diag_req_v2 *r,
 			       const struct nlattr *bc)
 {
+	struct inet_timewait_sock *tw = inet_twsk(sk);
+
 	if (bc != NULL) {
 		struct inet_diag_entry entry;
 
@@ -911,8 +913,7 @@ skip_listen_ht:
 
 		num = 0;
 
-		if (hlist_nulls_empty(&head->chain) &&
-			hlist_nulls_empty(&head->twchain))
+		if (hlist_nulls_empty(&head->chain))
 			continue;
 
 		if (i > s_i)
@@ -920,7 +921,7 @@ skip_listen_ht:
 
 		spin_lock_bh(lock);
 		sk_nulls_for_each(sk, node, &head->chain) {
-			struct inet_sock *inet = inet_sk(sk);
+			int res;
 
 			if (!net_eq(sock_net(sk), net))
 				continue;
@@ -929,15 +930,19 @@ skip_listen_ht:
 			if (!(r->idiag_states & (1 << sk->sk_state)))
 				goto next_normal;
 			if (r->sdiag_family != AF_UNSPEC &&
-					sk->sk_family != r->sdiag_family)
+			    sk->sk_family != r->sdiag_family)
 				goto next_normal;
-			if (r->id.idiag_sport != inet->inet_sport &&
+			if (r->id.idiag_sport != htons(sk->sk_num) &&
 			    r->id.idiag_sport)
 				goto next_normal;
-			if (r->id.idiag_dport != inet->inet_dport &&
+			if (r->id.idiag_dport != sk->sk_dport &&
 			    r->id.idiag_dport)
 				goto next_normal;
-			if (inet_csk_diag_dump(sk, skb, cb, r, bc) < 0) {
+			if (sk->sk_state == TCP_TIME_WAIT)
+				res = inet_twsk_diag_dump(sk, skb, cb, r, bc);
+			else
+				res = inet_csk_diag_dump(sk, skb, cb, r, bc);
+			if (res < 0) {
 				spin_unlock_bh(lock);
 				goto done;
 			}
@@ -945,33 +950,6 @@ next_normal:
 			++num;
 		}
 
-		if (r->idiag_states & TCPF_TIME_WAIT) {
-			struct inet_timewait_sock *tw;
-
-			inet_twsk_for_each(tw, node,
-				    &head->twchain) {
-				if (!net_eq(twsk_net(tw), net))
-					continue;
-
-				if (num < s_num)
-					goto next_dying;
-				if (r->sdiag_family != AF_UNSPEC &&
-						tw->tw_family != r->sdiag_family)
-					goto next_dying;
-				if (r->id.idiag_sport != tw->tw_sport &&
-				    r->id.idiag_sport)
-					goto next_dying;
-				if (r->id.idiag_dport != tw->tw_dport &&
-				    r->id.idiag_dport)
-					goto next_dying;
-				if (inet_twsk_diag_dump(tw, skb, cb, r, bc) < 0) {
-					spin_unlock_bh(lock);
-					goto done;
-				}
-next_dying:
-				++num;
-			}
-		}
 		spin_unlock_bh(lock);
 	}
 
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ae19959..a4b66bb 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -230,6 +230,19 @@ begin:
 }
 EXPORT_SYMBOL_GPL(__inet_lookup_listener);
 
+/* All sockets share common refcount, but have different destructors */
+void sock_gen_put(struct sock *sk)
+{
+	if (!atomic_dec_and_test(&sk->sk_refcnt))
+		return;
+
+	if (sk->sk_state == TCP_TIME_WAIT)
+		inet_twsk_free(inet_twsk(sk));
+	else
+		sk_free(sk);
+}
+EXPORT_SYMBOL_GPL(sock_gen_put);
+
 struct sock *__inet_lookup_established(struct net *net,
 				  struct inet_hashinfo *hashinfo,
 				  const __be32 saddr, const __be16 sport,
@@ -255,13 +268,13 @@ begin:
 		if (likely(INET_MATCH(sk, net, acookie,
 				      saddr, daddr, ports, dif))) {
 			if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt)))
-				goto begintw;
+				goto out;
 			if (unlikely(!INET_MATCH(sk, net, acookie,
 						 saddr, daddr, ports, dif))) {
-				sock_put(sk);
+				sock_gen_put(sk);
 				goto begin;
 			}
-			goto out;
+			goto found;
 		}
 	}
 	/*
@@ -271,37 +284,9 @@ begin:
 	 */
 	if (get_nulls_value(node) != slot)
 		goto begin;
-
-begintw:
-	/* Must check for a TIME_WAIT'er before going to listener hash. */
-	sk_nulls_for_each_rcu(sk, node, &head->twchain) {
-		if (sk->sk_hash != hash)
-			continue;
-		if (likely(INET_TW_MATCH(sk, net, acookie,
-					 saddr, daddr, ports,
-					 dif))) {
-			if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt))) {
-				sk = NULL;
-				goto out;
-			}
-			if (unlikely(!INET_TW_MATCH(sk, net, acookie,
-						    saddr, daddr, ports,
-						    dif))) {
-				inet_twsk_put(inet_twsk(sk));
-				goto begintw;
-			}
-			goto out;
-		}
-	}
-	/*
-	 * if the nulls value we got at the end of this lookup is
-	 * not the expected one, we must restart lookup.
-	 * We probably met an item that was moved to another chain.
-	 */
-	if (get_nulls_value(node) != slot)
-		goto begintw;
-	sk = NULL;
 out:
+	sk = NULL;
+found:
 	rcu_read_unlock();
 	return sk;
 }
@@ -326,39 +311,29 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row,
 	spinlock_t *lock = inet_ehash_lockp(hinfo, hash);
 	struct sock *sk2;
 	const struct hlist_nulls_node *node;
-	struct inet_timewait_sock *tw;
+	struct inet_timewait_sock *tw = NULL;
 	int twrefcnt = 0;
 
 	spin_lock(lock);
 
-	/* Check TIME-WAIT sockets first. */
-	sk_nulls_for_each(sk2, node, &head->twchain) {
-		if (sk2->sk_hash != hash)
-			continue;
-
-		if (likely(INET_TW_MATCH(sk2, net, acookie,
-					 saddr, daddr, ports, dif))) {
-			tw = inet_twsk(sk2);
-			if (twsk_unique(sk, sk2, twp))
-				goto unique;
-			else
-				goto not_unique;
-		}
-	}
-	tw = NULL;
-
-	/* And established part... */
 	sk_nulls_for_each(sk2, node, &head->chain) {
 		if (sk2->sk_hash != hash)
 			continue;
+
 		if (likely(INET_MATCH(sk2, net, acookie,
-				      saddr, daddr, ports, dif)))
+					 saddr, daddr, ports, dif))) {
+			if (sk2->sk_state == TCP_TIME_WAIT) {
+				tw = inet_twsk(sk2);
+				if (twsk_unique(sk, sk2, twp))
+					break;
+			}
 			goto not_unique;
+		}
 	}
 
-unique:
 	/* Must record num and sport now. Otherwise we will see
-	 * in hash table socket with a funny identity. */
+	 * in hash table socket with a funny identity.
+	 */
 	inet->inet_num = lport;
 	inet->inet_sport = htons(lport);
 	sk->sk_hash = hash;
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 2c766b9..9909019 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -87,19 +87,11 @@ static void __inet_twsk_kill(struct inet_timewait_sock *tw,
 	refcnt += inet_twsk_bind_unhash(tw, hashinfo);
 	spin_unlock(&bhead->lock);
 
-#ifdef SOCK_REFCNT_DEBUG
-	if (atomic_read(&tw->tw_refcnt) != 1) {
-		pr_debug("%s timewait_sock %p refcnt=%d\n",
-			 tw->tw_prot->name, tw, atomic_read(&tw->tw_refcnt));
-	}
-#endif
-	while (refcnt) {
-		inet_twsk_put(tw);
-		refcnt--;
-	}
+	BUG_ON(refcnt >= atomic_read(&tw->tw_refcnt));
+	atomic_sub(refcnt, &tw->tw_refcnt);
 }
 
-static noinline void inet_twsk_free(struct inet_timewait_sock *tw)
+void inet_twsk_free(struct inet_timewait_sock *tw)
 {
 	struct module *owner = tw->tw_prot->owner;
 	twsk_destructor((struct sock *)tw);
@@ -118,6 +110,18 @@ void inet_twsk_put(struct inet_timewait_sock *tw)
 }
 EXPORT_SYMBOL_GPL(inet_twsk_put);
 
+static void inet_twsk_add_node_rcu(struct inet_timewait_sock *tw,
+				   struct hlist_nulls_head *list)
+{
+	hlist_nulls_add_head_rcu(&tw->tw_node, list);
+}
+
+static void inet_twsk_add_bind_node(struct inet_timewait_sock *tw,
+				    struct hlist_head *list)
+{
+	hlist_add_head(&tw->tw_bind_node, list);
+}
+
 /*
  * Enter the time wait state. This is called with locally disabled BH.
  * Essentially we whip up a timewait bucket, copy the relevant info into it
@@ -146,26 +150,21 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 	spin_lock(lock);
 
 	/*
-	 * Step 2: Hash TW into TIMEWAIT chain.
-	 * Should be done before removing sk from established chain
-	 * because readers are lockless and search established first.
+	 * Step 2: Hash TW into tcp ehash chain.
+	 * Notes :
+	 * - tw_refcnt is set to 3 because :
+	 * - We have one reference from bhash chain.
+	 * - We have one reference from ehash chain.
+	 * We can use atomic_set() because prior spin_lock()/spin_unlock()
+	 * committed into memory all tw fields.
 	 */
-	inet_twsk_add_node_rcu(tw, &ehead->twchain);
+	atomic_set(&tw->tw_refcnt, 1 + 1 + 1);
+	inet_twsk_add_node_rcu(tw, &ehead->chain);
 
-	/* Step 3: Remove SK from established hash. */
+	/* Step 3: Remove SK from hash chain */
 	if (__sk_nulls_del_node_init_rcu(sk))
 		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
 
-	/*
-	 * Notes :
-	 * - We initially set tw_refcnt to 0 in inet_twsk_alloc()
-	 * - We add one reference for the bhash link
-	 * - We add one reference for the ehash link
-	 * - We want this refcnt update done before allowing other
-	 *   threads to find this tw in ehash chain.
-	 */
-	atomic_add(1 + 1 + 1, &tw->tw_refcnt);
-
 	spin_unlock(lock);
 }
 EXPORT_SYMBOL_GPL(__inet_twsk_hashdance);
@@ -490,7 +489,9 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo,
 restart_rcu:
 		rcu_read_lock();
 restart:
-		sk_nulls_for_each_rcu(sk, node, &head->twchain) {
+		sk_nulls_for_each_rcu(sk, node, &head->chain) {
+			if (sk->sk_state != TCP_TIME_WAIT)
+				continue;
 			tw = inet_twsk(sk);
 			if ((tw->tw_family != family) ||
 				atomic_read(&twsk_net(tw)->count))
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6e5617b..be4b161 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3137,10 +3137,9 @@ void __init tcp_init(void)
 					&tcp_hashinfo.ehash_mask,
 					0,
 					thash_entries ? 0 : 512 * 1024);
-	for (i = 0; i <= tcp_hashinfo.ehash_mask; i++) {
+	for (i = 0; i <= tcp_hashinfo.ehash_mask; i++)
 		INIT_HLIST_NULLS_HEAD(&tcp_hashinfo.ehash[i].chain, i);
-		INIT_HLIST_NULLS_HEAD(&tcp_hashinfo.ehash[i].twchain, i);
-	}
+
 	if (inet_ehash_locks_alloc(&tcp_hashinfo))
 		panic("TCP: failed to alloc ehash_locks");
 	tcp_hashinfo.bhash =
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 75b63aa..cfb6d1d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2194,18 +2194,6 @@ EXPORT_SYMBOL(tcp_v4_destroy_sock);
 #ifdef CONFIG_PROC_FS
 /* Proc filesystem TCP sock list dumping. */
 
-static inline struct inet_timewait_sock *tw_head(struct hlist_nulls_head *head)
-{
-	return hlist_nulls_empty(head) ? NULL :
-		list_entry(head->first, struct inet_timewait_sock, tw_node);
-}
-
-static inline struct inet_timewait_sock *tw_next(struct inet_timewait_sock *tw)
-{
-	return !is_a_nulls(tw->tw_node.next) ?
-		hlist_nulls_entry(tw->tw_node.next, typeof(*tw), tw_node) : NULL;
-}
-
 /*
  * Get next listener socket follow cur.  If cur is NULL, get first socket
  * starting from bucket given in st->bucket; when st->bucket is zero the
@@ -2309,10 +2297,9 @@ static void *listening_get_idx(struct seq_file *seq, loff_t *pos)
 	return rc;
 }
 
-static inline bool empty_bucket(struct tcp_iter_state *st)
+static inline bool empty_bucket(const struct tcp_iter_state *st)
 {
-	return hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].chain) &&
-		hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].twchain);
+	return hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].chain);
 }
 
 /*
@@ -2329,7 +2316,6 @@ static void *established_get_first(struct seq_file *seq)
 	for (; st->bucket <= tcp_hashinfo.ehash_mask; ++st->bucket) {
 		struct sock *sk;
 		struct hlist_nulls_node *node;
-		struct inet_timewait_sock *tw;
 		spinlock_t *lock = inet_ehash_lockp(&tcp_hashinfo, st->bucket);
 
 		/* Lockless fast path for the common case of empty buckets */
@@ -2345,18 +2331,7 @@ static void *established_get_first(struct seq_file *seq)
 			rc = sk;
 			goto out;
 		}
-		st->state = TCP_SEQ_STATE_TIME_WAIT;
-		inet_twsk_for_each(tw, node,
-				   &tcp_hashinfo.ehash[st->bucket].twchain) {
-			if (tw->tw_family != st->family ||
-			    !net_eq(twsk_net(tw), net)) {
-				continue;
-			}
-			rc = tw;
-			goto out;
-		}
 		spin_unlock_bh(lock);
-		st->state = TCP_SEQ_STATE_ESTABLISHED;
 	}
 out:
 	return rc;
@@ -2365,7 +2340,6 @@ out:
 static void *established_get_next(struct seq_file *seq, void *cur)
 {
 	struct sock *sk = cur;
-	struct inet_timewait_sock *tw;
 	struct hlist_nulls_node *node;
 	struct tcp_iter_state *st = seq->private;
 	struct net *net = seq_file_net(seq);
@@ -2373,45 +2347,16 @@ static void *established_get_next(struct seq_file *seq, void *cur)
 	++st->num;
 	++st->offset;
 
-	if (st->state == TCP_SEQ_STATE_TIME_WAIT) {
-		tw = cur;
-		tw = tw_next(tw);
-get_tw:
-		while (tw && (tw->tw_family != st->family || !net_eq(twsk_net(tw), net))) {
-			tw = tw_next(tw);
-		}
-		if (tw) {
-			cur = tw;
-			goto out;
-		}
-		spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket));
-		st->state = TCP_SEQ_STATE_ESTABLISHED;
-
-		/* Look for next non empty bucket */
-		st->offset = 0;
-		while (++st->bucket <= tcp_hashinfo.ehash_mask &&
-				empty_bucket(st))
-			;
-		if (st->bucket > tcp_hashinfo.ehash_mask)
-			return NULL;
-
-		spin_lock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket));
-		sk = sk_nulls_head(&tcp_hashinfo.ehash[st->bucket].chain);
-	} else
-		sk = sk_nulls_next(sk);
+	sk = sk_nulls_next(sk);
 
 	sk_nulls_for_each_from(sk, node) {
 		if (sk->sk_family == st->family && net_eq(sock_net(sk), net))
-			goto found;
+			return sk;
 	}
 
-	st->state = TCP_SEQ_STATE_TIME_WAIT;
-	tw = tw_head(&tcp_hashinfo.ehash[st->bucket].twchain);
-	goto get_tw;
-found:
-	cur = sk;
-out:
-	return cur;
+	spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket));
+	++st->bucket;
+	return established_get_first(seq);
 }
 
 static void *established_get_idx(struct seq_file *seq, loff_t pos)
@@ -2464,10 +2409,9 @@ static void *tcp_seek_last_pos(struct seq_file *seq)
 		if (rc)
 			break;
 		st->bucket = 0;
+		st->state = TCP_SEQ_STATE_ESTABLISHED;
 		/* Fallthrough */
 	case TCP_SEQ_STATE_ESTABLISHED:
-	case TCP_SEQ_STATE_TIME_WAIT:
-		st->state = TCP_SEQ_STATE_ESTABLISHED;
 		if (st->bucket > tcp_hashinfo.ehash_mask)
 			break;
 		rc = established_get_first(seq);
@@ -2524,7 +2468,6 @@ static void *tcp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 		}
 		break;
 	case TCP_SEQ_STATE_ESTABLISHED:
-	case TCP_SEQ_STATE_TIME_WAIT:
 		rc = established_get_next(seq, v);
 		break;
 	}
@@ -2548,7 +2491,6 @@ static void tcp_seq_stop(struct seq_file *seq, void *v)
 		if (v != SEQ_START_TOKEN)
 			spin_unlock_bh(&tcp_hashinfo.listening_hash[st->bucket].lock);
 		break;
-	case TCP_SEQ_STATE_TIME_WAIT:
 	case TCP_SEQ_STATE_ESTABLISHED:
 		if (v)
 			spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket));
@@ -2707,6 +2649,7 @@ static void get_timewait4_sock(const struct inet_timewait_sock *tw,
 static int tcp4_seq_show(struct seq_file *seq, void *v)
 {
 	struct tcp_iter_state *st;
+	struct sock *sk = v;
 	int len;
 
 	if (v == SEQ_START_TOKEN) {
@@ -2721,14 +2664,14 @@ static int tcp4_seq_show(struct seq_file *seq, void *v)
 	switch (st->state) {
 	case TCP_SEQ_STATE_LISTENING:
 	case TCP_SEQ_STATE_ESTABLISHED:
-		get_tcp4_sock(v, seq, st->num, &len);
+		if (sk->sk_state == TCP_TIME_WAIT)
+			get_timewait4_sock(v, seq, st->num, &len);
+		else
+			get_tcp4_sock(v, seq, st->num, &len);
 		break;
 	case TCP_SEQ_STATE_OPENREQ:
 		get_openreq4(st->syn_wait_sk, v, seq, st->num, st->uid, &len);
 		break;
-	case TCP_SEQ_STATE_TIME_WAIT:
-		get_timewait4_sock(v, seq, st->num, &len);
-		break;
 	}
 	seq_printf(seq, "%*s\n", TMPSZ - 1 - len, "");
 out:
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 066640e..4644077 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -89,43 +89,36 @@ begin:
 	sk_nulls_for_each_rcu(sk, node, &head->chain) {
 		if (sk->sk_hash != hash)
 			continue;
-		if (likely(INET6_MATCH(sk, net, saddr, daddr, ports, dif))) {
-			if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt)))
-				goto begintw;
+		if (sk->sk_state == TCP_TIME_WAIT) {
+			if (!INET6_TW_MATCH(sk, net, saddr, daddr, ports, dif))
+				continue;
+		} else {
+			if (!INET6_MATCH(sk, net, saddr, daddr, ports, dif))
+				continue;
+		}
+		if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt)))
+			goto out;
+
+		if (sk->sk_state == TCP_TIME_WAIT) {
+			if (unlikely(!INET6_TW_MATCH(sk, net, saddr, daddr,
+						     ports, dif))) {
+				sock_gen_put(sk);
+				goto begin;
+			}
+		} else {
 			if (unlikely(!INET6_MATCH(sk, net, saddr, daddr,
 						  ports, dif))) {
 				sock_put(sk);
 				goto begin;
 			}
-		goto out;
+		goto found;
 		}
 	}
 	if (get_nulls_value(node) != slot)
 		goto begin;
-
-begintw:
-	/* Must check for a TIME_WAIT'er before going to listener hash. */
-	sk_nulls_for_each_rcu(sk, node, &head->twchain) {
-		if (sk->sk_hash != hash)
-			continue;
-		if (likely(INET6_TW_MATCH(sk, net, saddr, daddr,
-					  ports, dif))) {
-			if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt))) {
-				sk = NULL;
-				goto out;
-			}
-			if (unlikely(!INET6_TW_MATCH(sk, net, saddr, daddr,
-						     ports, dif))) {
-				inet_twsk_put(inet_twsk(sk));
-				goto begintw;
-			}
-			goto out;
-		}
-	}
-	if (get_nulls_value(node) != slot)
-		goto begintw;
-	sk = NULL;
 out:
+	sk = NULL;
+found:
 	rcu_read_unlock();
 	return sk;
 }
@@ -248,31 +241,25 @@ static int __inet6_check_established(struct inet_timewait_death_row *death_row,
 	spinlock_t *lock = inet_ehash_lockp(hinfo, hash);
 	struct sock *sk2;
 	const struct hlist_nulls_node *node;
-	struct inet_timewait_sock *tw;
+	struct inet_timewait_sock *tw = NULL;
 	int twrefcnt = 0;
 
 	spin_lock(lock);
 
-	/* Check TIME-WAIT sockets first. */
-	sk_nulls_for_each(sk2, node, &head->twchain) {
+	sk_nulls_for_each(sk2, node, &head->chain) {
 		if (sk2->sk_hash != hash)
 			continue;
 
-		if (likely(INET6_TW_MATCH(sk2, net, saddr, daddr,
-					  ports, dif))) {
-			tw = inet_twsk(sk2);
-			if (twsk_unique(sk, sk2, twp))
-				goto unique;
-			else
-				goto not_unique;
+		if (sk2->sk_state == TCP_TIME_WAIT) {
+			if (likely(INET6_TW_MATCH(sk2, net, saddr, daddr,
+						  ports, dif))) {
+				tw = inet_twsk(sk2);
+				if (twsk_unique(sk, sk2, twp))
+					goto unique;
+				else
+					goto not_unique;
+			}
 		}
-	}
-	tw = NULL;
-
-	/* And established part... */
-	sk_nulls_for_each(sk2, node, &head->chain) {
-		if (sk2->sk_hash != hash)
-			continue;
 		if (likely(INET6_MATCH(sk2, net, saddr, daddr, ports, dif)))
 			goto not_unique;
 	}
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 845a69e..e8e507b 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1834,6 +1834,7 @@ static void get_timewait6_sock(struct seq_file *seq,
 static int tcp6_seq_show(struct seq_file *seq, void *v)
 {
 	struct tcp_iter_state *st;
+	struct sock *sk = v;
 
 	if (v == SEQ_START_TOKEN) {
 		seq_puts(seq,
@@ -1849,14 +1850,14 @@ static int tcp6_seq_show(struct seq_file *seq, void *v)
 	switch (st->state) {
 	case TCP_SEQ_STATE_LISTENING:
 	case TCP_SEQ_STATE_ESTABLISHED:
-		get_tcp6_sock(seq, v, st->num);
+		if (sk->sk_state == TCP_TIME_WAIT)
+			get_timewait6_sock(seq, v, st->num);
+		else
+			get_tcp6_sock(seq, v, st->num);
 		break;
 	case TCP_SEQ_STATE_OPENREQ:
 		get_openreq6(seq, st->syn_wait_sk, v, st->num, st->uid);
 		break;
-	case TCP_SEQ_STATE_TIME_WAIT:
-		get_timewait6_sock(seq, v, st->num);
-		break;
 	}
 out:
 	return 0;

^ permalink raw reply related

* [PATCH net-next] tcp: rcvbuf autotuning improvements
From: Daniel Borkmann @ 2013-10-03  7:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, eric.dumazet, Francesco Fusco

This is a complementary patch for commit 6ae705323 ("tcp: sndbuf
autotuning improvements") that fixes a performance regression on
receiver side in setups with low to mid latency, high throughput,
and senders with TSO/GSO off (receivers w/ default settings).

The following measurements in Mbit/s were done for 60sec w/ netperf
on virtio w/ TSO/GSO off:

(ms)    1)              2)              3)
  0     2762.11         1150.32         2906.17
 10     1083.61          538.89         1091.03
 25      471.81          313.18          474.60
 50      242.33          187.84          242.36
 75      162.14          134.45          161.95
100      121.55          101.96          121.49
150       80.64           57.75           80.48
200       58.97           54.11           59.90
250       47.10           46.92           47.31

Same setup w/ TSO/GSO on:

(ms)    1)              2)              3)
  0     12225.91        12366.89        16514.37
 10      1526.64         1525.79         2176.63
 25       655.13          647.79          871.52
 50       338.51          377.88          439.46
 75       246.49          278.46          295.62
100       210.93          207.56          217.34
150       127.88          129.56          141.33
200        94.95           94.50          107.29
250        67.39           73.88           88.35

Similarly as in 6ae705323, we fixed up power-of-two rounding and
took cached mss into account, thus bringing per_mss calculations
closer to each other, the rest stays as is.

We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
consistent with tcp_sndbuf_expand().

While we do think that 6ae705323b71 is the right way to go, also
this follow-up seems necessary to restore performance for
receivers.

For the evaluation, same kernels on each host were used:

1) net-next (4fbef95af), which is before 6ae705323
2) net-next (6ae705323), which is sndbuf improvements
3) net-next (6ae705323), plus this patch on top

This was done in joint work with Francesco Fusco.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Francesco Fusco <ffusco@redhat.com>
---
 net/ipv4/tcp_input.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cd65674..ed37b1d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -367,13 +367,19 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
 }
 
 /* 3. Tuning rcvbuf, when connection enters established state. */
-static void tcp_fixup_rcvbuf(struct sock *sk)
+static void tcp_rcvbuf_expand(struct sock *sk)
 {
-	u32 mss = tcp_sk(sk)->advmss;
-	int rcvmem;
+	const struct tcp_sock *tp = tcp_sk(sk);
+	int rcvmem, per_mss;
+
+	per_mss = max_t(u32, tp->advmss, tp->mss_cache) +
+		  MAX_TCP_HEADER +
+		  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
+	per_mss = roundup_pow_of_two(per_mss) +
+		  SKB_DATA_ALIGN(sizeof(struct sk_buff));
 
-	rcvmem = 2 * SKB_TRUESIZE(mss + MAX_TCP_HEADER) *
-		 tcp_default_init_rwnd(mss);
+	rcvmem = 2 * tcp_default_init_rwnd(per_mss) * per_mss;
 
 	/* Dynamic Right Sizing (DRS) has 2 to 3 RTT latency
 	 * Allow enough cushion so that sender is not limited by our window
@@ -394,7 +400,7 @@ void tcp_init_buffer_space(struct sock *sk)
 	int maxwin;
 
 	if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK))
-		tcp_fixup_rcvbuf(sk);
+		tcp_rcvbuf_expand(sk);
 	if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
 		tcp_sndbuf_expand(sk);
 
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH RFC 46/77] mlx4: Update MSI/MSI-X interrupts enablement code
From: Jack Morgenstein @ 2013-10-03  8:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, Bjorn Helgaas, Ralf Baechle, Michael Ellerman,
	Benjamin Herrenschmidt, Martin Schwidefsky, Ingo Molnar,
	Tejun Heo, Dan Williams, Andy King, Jon Mason, Matt Porter,
	linux-pci, linux-mips, linuxppc-dev, linux390, linux-s390, x86,
	linux-ide, iss_storagedev, linux-nvme, linux-rdma, netdev,
	e1000-devel, linux-driver, Solarflare linux maintainers,
	"VMware, Inc." <pv-dr
In-Reply-To: <b0a9f6f455aa03b7769e6d9cc2e7fdbc06732b2f.1380703263.git.agordeev@redhat.com>

On Wed,  2 Oct 2013 12:49:02 +0200
Alexander Gordeev <agordeev@redhat.com> wrote:

NACK.  This change does not do anything logically as far as I can tell.
pci_enable_msix in the current upstream kernel itself calls
pci_msix_table_size.  The current code yields the same results
as the code suggested below. (i.e., the suggested code has no effect on
optimality).

BTW, pci_msix_table_size never returns a value < 0 (if msix is not
enabled, it returns 0 for the table size), so the (err < 0) check here
is not correct. (I also do not like using "err" here anyway for the
value returned by pci_msix_table_size().  There is no error here, and
it is simply confusing.

-Jack

> As result of recent re-design of the MSI/MSI-X interrupts enabling
> pattern this driver has to be updated to use the new technique to
> obtain a optimal number of MSI/MSI-X interrupts required.
> 
> Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/main.c |   17 ++++++++---------
>  1 files changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c
> b/drivers/net/ethernet/mellanox/mlx4/main.c index 60c9f4f..377a5ea
> 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -1852,8 +1852,16 @@ static void mlx4_enable_msi_x(struct mlx4_dev
> *dev) int i;
>  
>  	if (msi_x) {
> +		err = pci_msix_table_size(dev->pdev);
> +		if (err < 0)
> +			goto no_msi;
> +
> +		/* Try if at least 2 vectors are available */
>  		nreq = min_t(int, dev->caps.num_eqs -
> dev->caps.reserved_eqs, nreq);
> +		nreq = min_t(int, nreq, err);
> +		if (nreq < 2)
> +			goto no_msi;
>  
>  		entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL);
>  		if (!entries)
> @@ -1862,17 +1870,8 @@ static void mlx4_enable_msi_x(struct mlx4_dev
> *dev) for (i = 0; i < nreq; ++i)
>  			entries[i].entry = i;
>  
> -	retry:
>  		err = pci_enable_msix(dev->pdev, entries, nreq);
>  		if (err) {
> -			/* Try again if at least 2 vectors are
> available */
> -			if (err > 1) {
> -				mlx4_info(dev, "Requested %d
> vectors, "
> -					  "but only %d MSI-X vectors
> available, "
> -					  "trying again\n", nreq,
> err);
> -				nreq = err;
> -				goto retry;
> -			}
>  			kfree(entries);
>  			goto no_msi;
>  		}


^ permalink raw reply

* Re: [PATCH net 1/2] sit: allow to use rtnl ops on fb tunnel
From: Nicolas Dichtel @ 2013-10-03  8:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, steffen.klassert, pshelar
In-Reply-To: <20131002.170822.1627691739724018976.davem@davemloft.net>

Le 02/10/2013 23:08, David Miller a écrit :
> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> Date: Wed, 02 Oct 2013 09:36:02 +0200
>
>> Le 02/10/2013 09:15, Nicolas Dichtel a écrit :
>>> Le 01/10/2013 18:59, David Miller a écrit :
>>>> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>>>> Date: Tue,  1 Oct 2013 18:04:59 +0200
>>>>
>>>>> rtnl ops where introduced by ba3e3f50a0e5 ("sit: advertise tunnel
>>>>> param via
>>>>> rtnl"), but I forget to assign rtnl ops to fb tunnels.
>>>>>
>>>>> Now that it is done, we must remove the explicit call to
>>>>> unregister_netdevice_queue(), because the fallback tunnel is added to
>>>>> the queue
>>>>> in sit_destroy_tunnels() when checking rtnl_link_ops of all netdevices
>>>>> (this
>>>>> is valid since commit 5e6700b3bf98 ("sit: add support of x-netns")).
>>>>>
>>>>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>>>>
>>>> Applied and queued up for -stable.
>> Another things about ipip: between 0974658da47c ("ipip: advertise
>> tunnel param
>> via rtnl", v3.8) and fd58156e456d ("IPIP: Use ip-tunneling code.",
>> v3.10) the
>> fb device of ipip module has the same problem.
>> Should I send a patch?
>
> Yes please do, thanks for noticing this.
>
In fact, I just notice that 3.9 branch is EoL (bug is only in 3.8 and 3.9).
Should I still send a patch ? If yes, based on which tree/branch?

^ permalink raw reply

* [ethtool] ethtool: ixgbe DCB registers dump for 82599 and x540
From: Jeff Kirsher @ 2013-10-03  8:11 UTC (permalink / raw)
  To: bhutchings
  Cc: Leonardo Potenza, netdev, gospo, sassmann, Maryam Tahhan,
	Jeff Kirsher

From: Leonardo Potenza <leonardo.potenza@intel.com>

Added support for DCB registers dump using ethtool -d option both for
82599 and x540 ethernet controllers

Signed-off-by: Leonardo Potenza <leonardo.potenza@intel.com>
Signed-off-by: Maryam Tahhan <maryam.tahhan@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Tested-by: Jack Morgan <jack.morgan@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 ixgbe.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 125 insertions(+), 29 deletions(-)

diff --git a/ixgbe.c b/ixgbe.c
index 854e463..6a663e4 100644
--- a/ixgbe.c
+++ b/ixgbe.c
@@ -26,6 +26,8 @@
 #define IXGBE_HLREG0_LPBK          0x00008000
 #define IXGBE_RMCS_TFCE_802_3X     0x00000008
 #define IXGBE_RMCS_TFCE_PRIORITY   0x00000010
+#define IXGBE_FCCFG_TFCE_802_3X    0x00000008
+#define IXGBE_FCCFG_TFCE_PRIORITY  0x00000010
 #define IXGBE_MFLCN_PMCF           0x00000001 /* Pass MAC Control Frames */
 #define IXGBE_MFLCN_DPF            0x00000002 /* Discard Pause Frame */
 #define IXGBE_MFLCN_RPFCE          0x00000004 /* Receive Priority FC Enable */
@@ -127,6 +129,7 @@ int
 ixgbe_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs)
 {
 	u32 *regs_buff = (u32 *)regs->data;
+	u32 regs_buff_len = regs->len / sizeof(*regs_buff);
 	u32 reg;
 	u16 hw_device_id = (u16) regs->version;
 	u8 version = (u8)(regs->version >> 24);
@@ -208,13 +211,23 @@ ixgbe_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs)
 	(reg & IXGBE_SRRCTL_BSIZEPKT_MASK) <= 0x10 ? (reg & IXGBE_SRRCTL_BSIZEPKT_MASK) : 0x10);
 
 	reg = regs_buff[829];
-	fprintf(stdout,
-	"0x03D00: RMCS (Receive Music Control register)        0x%08X\n"
-	"       Transmit Flow Control:                         %s\n"
-	"       Priority Flow Control:                         %s\n",
-	reg,
-	reg & IXGBE_RMCS_TFCE_802_3X     ? "enabled"  : "disabled",
-	reg & IXGBE_RMCS_TFCE_PRIORITY   ? "enabled"  : "disabled");
+	if (mac_type == ixgbe_mac_82598EB) {
+		fprintf(stdout,
+		"0x03D00: RMCS (Receive Music Control register)        0x%08X\n"
+		"       Transmit Flow Control:                         %s\n"
+		"       Priority Flow Control:                         %s\n",
+		reg,
+		reg & IXGBE_RMCS_TFCE_802_3X     ? "enabled"  : "disabled",
+		reg & IXGBE_RMCS_TFCE_PRIORITY   ? "enabled"  : "disabled");
+	} else if (mac_type >= ixgbe_mac_82599EB) {
+		fprintf(stdout,
+		"0x03D00: FCCFG (Flow Control Configuration)           0x%08X\n"
+		"       Transmit Flow Control:                         %s\n"
+		"       Priority Flow Control:                         %s\n",
+		reg,
+		reg & IXGBE_FCCFG_TFCE_802_3X     ? "enabled"  : "disabled",
+		reg & IXGBE_FCCFG_TFCE_PRIORITY   ? "enabled"  : "disabled");
+	}
 
 	reg = regs_buff[1047];
 	fprintf(stdout,
@@ -428,7 +441,7 @@ ixgbe_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs)
 		"0x02F00: RDRXCTL     (Receive DMA Control)            0x%08X\n",
 		regs_buff[469]);
 
-	for (i = 0; i < 8; i++ )
+	for (i = 0; i < 8; i++)
 		fprintf(stdout,
 		"0x%05X: RXPBSIZE%d   (Receive Packet Buffer Size %d)   0x%08X\n",
 		0x3C00 + (4 * i), i, i, regs_buff[470 + i]);
@@ -592,50 +605,133 @@ ixgbe_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs)
 		"0x09000: FHFT        (Flexible Host Filter Table)     0x%08X\n",
 		regs_buff[828]);
 
-	/* DCE */
-	fprintf(stdout,
+	/* DCB */
+	if (mac_type == ixgbe_mac_82598EB) {
+		fprintf(stdout,
 		"0x07F40: DPMCS       (Desc. Plan Music Ctrl Status)   0x%08X\n",
 		regs_buff[830]);
 
-	fprintf(stdout,
+		fprintf(stdout,
 		"0x0CD00: PDPMCS      (Pkt Data Plan Music ctrl Stat)  0x%08X\n",
 		regs_buff[831]);
 
-	if (mac_type == ixgbe_mac_82598EB) {
 		fprintf(stdout,
-			"0x050A0: RUPPBMR     (Rx User Prior to Pkt Buff Map)  0x%08X\n",
-			regs_buff[832]);
+		"0x050A0: RUPPBMR     (Rx User Prior to Pkt Buff Map)  0x%08X\n",
+		regs_buff[832]);
 
 		for (i = 0; i < 8; i++)
 			fprintf(stdout,
-				"0x%05X: RT2CR%d      (Receive T2 Configure %d)         0x%08X\n",
-				0x03C20 + (4 * i), i, i, regs_buff[833 + i]);
+			"0x%05X: RT2CR%d      (Receive T2 Configure %d)         0x%08X\n",
+			0x03C20 + (4 * i), i, i, regs_buff[833 + i]);
 
 		for (i = 0; i < 8; i++)
 			fprintf(stdout,
-				"0x%05X: RT2SR%d      (Recieve T2 Status %d)            0x%08X\n",
-				0x03C40 + (4 * i), i, i, regs_buff[841 + i]);
+			"0x%05X: RT2SR%d      (Receive T2 Status %d)            0x%08X\n",
+			0x03C40 + (4 * i), i, i, regs_buff[841 + i]);
 
 		for (i = 0; i < 8; i++)
 			fprintf(stdout,
-				"0x%05X: TDTQ2TCCR%d  (Tx Desc TQ2 TC Config %d)        0x%08X\n",
-				0x0602C + (0x40 * i), i, i, regs_buff[849 + i]);
+			"0x%05X: TDTQ2TCCR%d  (Tx Desc TQ2 TC Config %d)        0x%08X\n",
+			0x0602C + (0x40 * i), i, i, regs_buff[849 + i]);
 
 		for (i = 0; i < 8; i++)
 			fprintf(stdout,
-				"0x%05X: TDTQ2TCSR%d  (Tx Desc TQ2 TC Status %d)        0x%08X\n",
-				0x0622C + (0x40 * i), i, i, regs_buff[857 + i]);
-	}
+			"0x%05X: TDTQ2TCSR%d  (Tx Desc TQ2 TC Status %d)        0x%08X\n",
+			0x0622C + (0x40 * i), i, i, regs_buff[857 + i]);
 
-	for (i = 0; i < 8; i++)
+		for (i = 0; i < 8; i++)
+			fprintf(stdout,
+			"0x%05X: TDPT2TCCR%d  (Tx Data Plane T2 TC Config %d)   0x%08X\n",
+			0x0CD20 + (4 * i), i, i, regs_buff[865 + i]);
+
+		for (i = 0; i < 8; i++)
+			fprintf(stdout,
+			"0x%05X: TDPT2TCSR%d  (Tx Data Plane T2 TC Status %d)   0x%08X\n",
+			0x0CD40 + (4 * i), i, i, regs_buff[873 + i]);
+	} else if (mac_type >= ixgbe_mac_82599EB) {
 		fprintf(stdout,
-		"0x%05X: TDPT2TCCR%d  (Tx Data Plane T2 TC Config %d)   0x%08X\n",
-		0x0CD20 + (4 * i), i, i, regs_buff[865 + i]);
+			"0x04900: RTTDCS      (Tx Descr Plane Ctrl&Status)     0x%08X\n",
+			regs_buff[830]);
+
+		fprintf(stdout,
+			"0x0CD00: RTTPCS      (Tx Pkt Plane Ctrl&Status)       0x%08X\n",
+			regs_buff[831]);
 
-	for (i = 0; i < 8; i++)
 		fprintf(stdout,
-		"0x%05X: TDPT2TCSR%d  (Tx Data Plane T2 TC Status %d)   0x%08X\n",
-		0x0CD40 + (4 * i), i, i, regs_buff[873 + i]);
+			"0x02430: RTRPCS      (Rx Packet Plane Ctrl&Status)    0x%08X\n",
+			regs_buff[832]);
+
+		for (i = 0; i < 8; i++)
+			fprintf(stdout,
+			"0x%05X: RTRPT4C%d    (Rx Packet Plane T4 Config %d)    0x%08X\n",
+			0x02140 + (4 * i), i, i, regs_buff[833 + i]);
+
+		for (i = 0; i < 8; i++)
+			fprintf(stdout,
+			"0x%05X: RTRPT4S%d    (Rx Packet Plane T4 Status %d)    0x%08X\n",
+			0x02160 + (4 * i), i, i, regs_buff[841 + i]);
+
+		for (i = 0; i < 8; i++)
+			fprintf(stdout,
+			"0x%05X: RTTDT2C%d    (Tx Descr Plane T2 Config %d)     0x%08X\n",
+			0x04910 + (4 * i), i, i, regs_buff[849 + i]);
+
+		for (i = 0; i < 8; i++)
+			fprintf(stdout,
+			"0x%05X: RTTDT2S%d    (Tx Descr Plane T2 Status %d)     0x%08X\n",
+			0x04930 + (4 * i), i, i, regs_buff[857 + i]);
+
+		for (i = 0; i < 8; i++)
+			fprintf(stdout,
+			"0x%05X: RTTPT2C%d    (Tx Packet Plane T2 Config %d)    0x%08X\n",
+			0x0CD20 + (4 * i), i, i, regs_buff[865]);
+
+		for (i = 0; i < 8; i++)
+			fprintf(stdout,
+			"0x%05X: RTTPT2S%d    (Tx Packet Plane T2 Status %d)    0x%08X\n",
+			0x0CD40 + (4 * i), i, i, regs_buff[873 + i]);
+
+		if (regs_buff_len > 1129) {
+			fprintf(stdout,
+			"0x03020: RTRUP2TC    (Rx User Prio to Traffic Classes)0x%08X\n",
+			regs_buff[1129]);
+
+			fprintf(stdout,
+			"0x0C800: RTTUP2TC    (Tx User Prio to Traffic Classes)0x%08X\n",
+			regs_buff[1130]);
+
+			for (i = 0; i < 4; i++)
+				fprintf(stdout,
+				"0x%05X: TXLLQ%d      (Strict Low Lat Tx Queues %d)     0x%08X\n",
+				0x082E0 + (4 * i), i, i, regs_buff[1131 + i]);
+
+			if (mac_type == ixgbe_mac_82599EB) {
+				fprintf(stdout,
+				"0x04980: RTTBCNRM    (DCB TX Rate Sched MMW)          0x%08X\n",
+				regs_buff[1135]);
+
+				fprintf(stdout,
+				"0x0498C: RTTBCNRD    (DCB TX Rate-Scheduler Drift)    0x%08X\n",
+				regs_buff[1136]);
+			} else if (mac_type == ixgbe_mac_X540) {
+				fprintf(stdout,
+				"0x04980: RTTQCNRM    (DCB TX QCN Rate Sched MMW)      0x%08X\n",
+				regs_buff[1135]);
+
+				fprintf(stdout,
+				"0x0498C: RTTQCNRR    (DCB TX QCN Rate Reset)          0x%08X\n",
+				regs_buff[1136]);
+
+				fprintf(stdout,
+				"0x08B00: RTTQCNCR    (DCB TX QCN Control)             0x%08X\n",
+				regs_buff[1137]);
+
+				fprintf(stdout,
+				"0x04A90: RTTQCNTG    (DCB TX QCN Tagging)             0x%08X\n",
+				regs_buff[1138]);
+			}
+		}
+	}
 
 	/* Statistics */
 	fprintf(stdout,
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH RFC 46/77] mlx4: Update MSI/MSI-X interrupts enablement code
From: Jack Morgenstein @ 2013-10-03  8:27 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, Bjorn Helgaas, Ralf Baechle, Michael Ellerman,
	Benjamin Herrenschmidt, Martin Schwidefsky, Ingo Molnar,
	Tejun Heo, Dan Williams, Andy King, Jon Mason, Matt Porter,
	linux-pci, linux-mips, linuxppc-dev, linux390, linux-s390, x86,
	linux-ide, iss_storagedev, linux-nvme, linux-rdma, netdev,
	e1000-devel, linux-driver, Solarflare linux maintainers,
	"VMware, Inc." <pv-dr
In-Reply-To: <b0a9f6f455aa03b7769e6d9cc2e7fdbc06732b2f.1380703263.git.agordeev@redhat.com>

On Wed,  2 Oct 2013 12:49:02 +0200
Alexander Gordeev <agordeev@redhat.com> wrote:

UPDATING THIS REPLY.
Your change log confused me. The change below is not from a "recent
re-design", it is required due to an earlier patch in this patch set.
>From the log, I assumed that the change you are talking about is already
upstream.

I will re-review.

-Jack

NACK.  This change does not do anything logically as far as I can tell.
pci_enable_msix in the current upstream kernel itself calls
pci_msix_table_size.  The current code yields the same resultswill
as the code suggested below. (i.e., the suggested code has no effect on
optimality).

BTW, pci_msix_table_size never returns a value < 0 (if msix is not
enabled, it returns 0 for the table size), so the (err < 0) check here
is not correct. (I also do not like using "err" here anyway for the
value returned by pci_msix_table_size().  There is no error here, and
it is simply confusing.

-Jack

> As result of recent re-design of the MSI/MSI-X interrupts enabling
> pattern this driver has to be updated to use the new technique to
> obtain a optimal number of MSI/MSI-X interrupts required.
> 
> Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/main.c |   17 ++++++++---------
>  1 files changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c
> b/drivers/net/ethernet/mellanox/mlx4/main.c index 60c9f4f..377a5ea
> 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -1852,8 +1852,16 @@ static void mlx4_enable_msi_x(struct mlx4_dev
> *dev) int i;
>  
>  	if (msi_x) {
> +		err = pci_msix_table_size(dev->pdev);
> +		if (err < 0)
> +			goto no_msi;
> +
> +		/* Try if at least 2 vectors are available */
>  		nreq = min_t(int, dev->caps.num_eqs -
> dev->caps.reserved_eqs, nreq);
> +		nreq = min_t(int, nreq, err);
> +		if (nreq < 2)
> +			goto no_msi;
>  
>  		entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL);
>  		if (!entries)
> @@ -1862,17 +1870,8 @@ static void mlx4_enable_msi_x(struct mlx4_dev
> *dev) for (i = 0; i < nreq; ++i)
>  			entries[i].entry = i;
>  
> -	retry:
>  		err = pci_enable_msix(dev->pdev, entries, nreq);
>  		if (err) {
> -			/* Try again if at least 2 vectors are
> available */
> -			if (err > 1) {
> -				mlx4_info(dev, "Requested %d
> vectors, "
> -					  "but only %d MSI-X vectors
> available, "
> -					  "trying again\n", nreq,
> err);
> -				nreq = err;
> -				goto retry;
> -			}
>  			kfree(entries);
>  			goto no_msi;
>  		}


^ permalink raw reply

* Re: [PATCH RFC 46/77] mlx4: Update MSI/MSI-X interrupts enablement code
From: Jack Morgenstein @ 2013-10-03  8:39 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, Bjorn Helgaas, Ralf Baechle, Michael Ellerman,
	Benjamin Herrenschmidt, Martin Schwidefsky, Ingo Molnar,
	Tejun Heo, Dan Williams, Andy King, Jon Mason, Matt Porter,
	linux-pci, linux-mips, linuxppc-dev, linux390, linux-s390, x86,
	linux-ide, iss_storagedev, linux-nvme, linux-rdma, netdev,
	e1000-devel, linux-driver, Solarflare linux maintainers,
	"VMware, Inc." <pv-dr
In-Reply-To: <b0a9f6f455aa03b7769e6d9cc2e7fdbc06732b2f.1380703263.git.agordeev@redhat.com>

On Wed,  2 Oct 2013 12:49:02 +0200
Alexander Gordeev <agordeev@redhat.com> wrote:

> As result of recent re-design of the MSI/MSI-X interrupts enabling
> pattern this driver has to be updated to use the new technique to
> obtain a optimal number of MSI/MSI-X interrupts required.
> 
> Signed-off-by: Alexander Gordeev <agordeev@redhat.com>

New review -- ACK (i.e., patch is OK), subject to acceptance of patches
05 and 07 of this patch set.

I sent my previous review (NACK) when I was not yet aware that
changes proposed were due to the two earlier patches (mentioned above)
in the current patch set.

The change log here should actually read something like the following:

As a result of changes to the MSI/MSI_X enabling procedures, this driver
must be modified in order to preserve its current msi/msi_x enablement
logic.

-Jack

^ permalink raw reply

* [PATCH net v2] be2net: Warn users of possible broken functionality on BE2 cards with very old FW versions with latest driver
From: Somnath Kotur @ 2013-10-03 10:04 UTC (permalink / raw)
  To: netdev; +Cc: davem, Somnath Kotur

On very old FW versions < 4.0, the mailbox command to set interrupts
on the card succeeds even though it is not supported and should have
failed, leading to a scenario where interrupts do not work.
Hence warn users to upgrade to a suitable FW version to avoid seeing
broken functionality.

Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
---
v2: Replaced all occurences of 'F/W' with 'FW' or 'Firmware' as suggested by
Sathya

 drivers/net/ethernet/emulex/benet/be_main.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index 2c38cc4..9563ced 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -3247,6 +3247,11 @@ static int be_setup(struct be_adapter *adapter)
 
 	be_cmd_get_fw_ver(adapter, adapter->fw_ver, adapter->fw_on_flash);
 
+	if (BE2_chip(adapter) && memcmp(adapter->fw_ver, "4.", 2) < 0) {
+		dev_err(dev, "Firmware version is too old.IRQs may not work\n");
+		dev_err(dev, "Pls upgrade firmware to version >= 4.0\n");
+	}
+
 	if (adapter->vlans_added)
 		be_vid_config(adapter);
 
-- 
1.6.0.2

^ permalink raw reply related

* Re: [PATCH net-next] fix unsafe set_memory_rw from softirq
From: Alexei Starovoitov @ 2013-10-03 11:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, netdev, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Daniel Borkmann, Paul E. McKenney, Xi Wang, x86,
	Eric Dumazet, linux-kernel, Heiko Carstens
In-Reply-To: <1380776250.19002.147.camel@edumazet-glaptop.roam.corp.google.com>

On Wed, Oct 2, 2013 at 9:57 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2013-10-02 at 21:53 -0700, Eric Dumazet wrote:
>> On Wed, 2013-10-02 at 21:44 -0700, Alexei Starovoitov wrote:
>>
>> > I think ifdef config_x86 is a bit ugly inside struct sk_filter, but
>> > don't mind whichever way.
>>
>> Its not fair to make sk_filter bigger, because it means that simple (non
>> JIT) filter might need an extra cache line.
>>
>> You could presumably use the following layout instead :
>>
>> struct sk_filter
>> {
>>         atomic_t                refcnt;
>>         struct rcu_head         rcu;
>>       struct work_struct      work;
>>
>>         unsigned int            len ____cacheline_aligned;    /* Number of filter blocks */
>>         unsigned int            (*bpf_func)(const struct sk_buff *skb,
>>                                             const struct sock_filter *filter);
>>         struct sock_filter      insns[0];
>> };
>
> And since @len is not used by sk_run_filter() use :
>
> struct sk_filter {
>         atomic_t                refcnt;
>         int                     len; /* number of filter blocks */
>         struct rcu_head         rcu;
>         struct work_struct      work;
>
>         unsigned int            (*bpf_func)(const struct sk_buff *skb,
>                                             const struct sock_filter *filter) ____cacheline_aligned;
>         struct sock_filter      insns[0];
> };

yes. make sense to avoid first insn cache miss inside sk_run_filter()
at the expense
of 8-byte gap between work and bpf_func (on x86_64 w/o lockdep)

Probably even better to overlap work and insns fields.
Pro: sk_filter size the same, no impact on non-jit case
Con: would be harder to understand the code

another problem is that kfree(sk_filter) inside
sk_filter_release_rcu() needs to move inside bpf_jit_free().
so self nack. Let me fix these issues and respin

Thanks
Alexei

^ permalink raw reply

* Re: [PATCH] IPv6: Allow the MTU of ipip6 tunnel to be set below 1280
From: Oussama Ghorbel @ 2013-10-03 12:37 UTC (permalink / raw)
  To: Oussama Ghorbel, David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, linux-kernel,
	Oussama Ghorbel
In-Reply-To: <CABfLueHOR1HNsRC_-+Phc=9LMPTiOVuWoEjguG64L=9hiZLeVg@mail.gmail.com>

I will send a new patch, the diff will be:

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 46ba243..4b51b03 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1429,9 +1429,17 @@ ip6_tnl_ioctl(struct net_device *dev, struct
ifreq *ifr, int cmd)
 static int
 ip6_tnl_change_mtu(struct net_device *dev, int new_mtu)
 {
-       if (new_mtu < IPV6_MIN_MTU) {
-               return -EINVAL;
+       struct ip6_tnl *tnl = netdev_priv(dev);
+
+       if (tnl->parms.proto == IPPROTO_IPIP) {
+               if (new_mtu < 68)
+                       return -EINVAL;
+       } else {
+               if (new_mtu < IPV6_MIN_MTU)
+                       return -EINVAL;
        }
+       if (new_mtu > 0xFFF8 - dev->hard_header_len)
+               return -EINVAL;
        dev->mtu = new_mtu;
        return 0;
 }


On Sun, Sep 29, 2013 at 5:33 PM, Oussama Ghorbel <ou.ghorbel@gmail.com> wrote:
> On Sun, Sep 29, 2013 at 4:45 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> On Sun, Sep 29, 2013 at 10:40:11AM +0100, Oussama Ghorbel wrote:
>>> On Fri, Sep 27, 2013 at 6:03 PM, Hannes Frederic Sowa
>>> <hannes@stressinduktion.org> wrote:
>>> > Ok, let's go with one function per protocol type. Seems easier.
>>> >
>>> > It seems to get more hairy, because it depends on the tunnel driver if the
>>> > prepended ip header is accounted in hard_header_len. :/
>>> >
>>> > I don't know if it works out cleanly. Otherwise I would be ok if the checks
>>> > just get repeated in ip6_tunnel and leave the rest as-is.
>>> >
>>> Yes, It will be the clean way to do it.
>>
>> Fine. :)
>>
>>> >
>>> > Linux currently cannot create "jumbograms" (only the receiving side
>>> > is supported).
>>> >
>>> I understand, but what are the benefit from this limit or the harm
>>> from not specifying it?
>>> Please check this comment from eth.c
>>>
>>> /**
>>>  * eth_change_mtu - set new MTU size
>>>  * @dev: network device
>>>  * @new_mtu: new Maximum Transfer Unit
>>>  *
>>>  * Allow changing MTU size. Needs to be overridden for devices
>>>  * supporting jumbo frames.
>>>  */
>>> int eth_change_mtu(struct net_device *dev, int new_mtu)
>>
>> Hmm, I cannot judge without the full patch. Will it be applicable
>> to all net_devices or just ethernet ones? The name could be a bit
>> misleading. Remindes me a lot of dev_set_mtu based on the signature, btw.
>
> Normally to all net_devices, otherwise it could get complicated to
> check for every dev separately ...
> But, never mind, the comment below solve the issue
>
>>
>>> So wouldn't be a good idea to let our function open for jumbo frames...?
>>
>> Hm, we can document the fact that the function would needed to be updated in
>> that case. But we should not allow to set a mtu which would require jumbograms
>> currently.
>
> OK, sounds a good. I will check the mtu against the limit
> IPV6_MAXPLEN, and document the jumbo restriction ...
>
>>
>> Greetings,
>>
>>   Hannes
>>
>
> Regards,
> Oussama

^ permalink raw reply related

* Re: [PATCH net-next] tcp: rcvbuf autotuning improvements
From: Eric Dumazet @ 2013-10-03 13:03 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: davem, netdev, Francesco Fusco, Michael Dalton, ycheng, ncardwell
In-Reply-To: <1380787003-20488-1-git-send-email-dborkman@redhat.com>

On Thu, 2013-10-03 at 09:56 +0200, Daniel Borkmann wrote:
> This is a complementary patch for commit 6ae705323 ("tcp: sndbuf
> autotuning improvements") that fixes a performance regression on
> receiver side in setups with low to mid latency, high throughput,
> and senders with TSO/GSO off (receivers w/ default settings).
> 
> The following measurements in Mbit/s were done for 60sec w/ netperf
> on virtio w/ TSO/GSO off:
> 
> (ms)    1)              2)              3)
>   0     2762.11         1150.32         2906.17
>  10     1083.61          538.89         1091.03
>  25      471.81          313.18          474.60
>  50      242.33          187.84          242.36
>  75      162.14          134.45          161.95
> 100      121.55          101.96          121.49
> 150       80.64           57.75           80.48
> 200       58.97           54.11           59.90
> 250       47.10           46.92           47.31
> 
> Same setup w/ TSO/GSO on:
> 
> (ms)    1)              2)              3)
>   0     12225.91        12366.89        16514.37
>  10      1526.64         1525.79         2176.63
>  25       655.13          647.79          871.52
>  50       338.51          377.88          439.46
>  75       246.49          278.46          295.62
> 100       210.93          207.56          217.34
> 150       127.88          129.56          141.33
> 200        94.95           94.50          107.29
> 250        67.39           73.88           88.35
> 
> Similarly as in 6ae705323, we fixed up power-of-two rounding and
> took cached mss into account, thus bringing per_mss calculations
> closer to each other, the rest stays as is.
> 
> We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
> consistent with tcp_sndbuf_expand().
> 
> While we do think that 6ae705323b71 is the right way to go, also
> this follow-up seems necessary to restore performance for
> receivers.

Hmm, I think you based this patch on some virtio requirements.

I would rather fix virtio, because virtio has poor truesize/payload
ratio.

Michael Dalton is working on this right now.

Really I don't understand how 'fixing' initial rcvbuf could explain such
difference in a 60 second transfert.

Normally, if autotuning was working, the first sk_rcvbuf value would
only matter in the very beginning of a flow (maybe one, two or even
three RTT)

It looks like you only need to set sk_rcvbuf to tcp_rmem[2],
so you probably have to fix the autotuning, or virtio to give normal
skbs, not fat ones ;)


Thanks

^ permalink raw reply

* Re: [PATCH net-next] tcp: rcvbuf autotuning improvements
From: Eric Dumazet @ 2013-10-03 13:13 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, netdev, Francesco Fusco
In-Reply-To: <1380787003-20488-1-git-send-email-dborkman@redhat.com>

On Thu, 2013-10-03 at 09:56 +0200, Daniel Borkmann wrote:

> We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
> consistent with tcp_sndbuf_expand().

BTW we renamed the function only because it was used both for initial
sizing, and from tcp_new_space()

As is, tcp_fixup_rcvbuf() is only called at connection setup.

^ permalink raw reply

* [PATCH net-next] dev: add support of flag IFF_NOPROC
From: Nicolas Dichtel @ 2013-10-03 13:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, Nicolas Dichtel

This flag allows to create netdevices without creating directories in
/proc, ie no /proc/sys/net/ipv[4|6]/[conf|neigh]/<dev> and no
/proc/net/dev_snmp6/<dev>.

When a system creates a lot of virtual netdevices, this allows to speed up the
creation time. For systems which continuously create and destroy virtual
netdevices, proc entries for these netdevices may not be used, hence adding this
flag is interesting.

Note that the flag should be specified at the creation time (before calling
register_netdevice()) and cannot be removed during the life of the netdevice.

Here are some numbers:

dummy20000.batch contains 20 000 times 'link add type dummy' and
dummy20000-noproc.batch 20 000 times 'link add noproc type dummy'.

time ip -b dummy20000.batch
real    0m56.367s
user    0m0.200s
sys     0m53.070s

time ip -b dummy20000-noproc.batch
real    0m42.417s
user    0m0.310s
sys     0m38.470s

Suggested-by: Thierry Herbelot <thierry.herbelot@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/uapi/linux/if.h | 2 ++
 net/core/dev.c          | 2 +-
 net/core/rtnetlink.c    | 1 +
 net/ipv4/devinet.c      | 3 +++
 net/ipv6/addrconf.c     | 3 +++
 net/ipv6/proc.c         | 5 +++++
 6 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
index 1ec407b01e46..bb9fe5eb38bf 100644
--- a/include/uapi/linux/if.h
+++ b/include/uapi/linux/if.h
@@ -53,6 +53,8 @@
 
 #define IFF_ECHO	0x40000		/* echo sent packets		*/
 
+#define IFF_NOPROC	0x80000		/* no proc/sysctl directories	*/
+
 #define IFF_VOLATILE	(IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|IFF_ECHO|\
 		IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT)
 
diff --git a/net/core/dev.c b/net/core/dev.c
index c25db20a4246..13f6dd360c74 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5199,7 +5199,7 @@ int __dev_change_flags(struct net_device *dev, unsigned int flags)
 			       IFF_DYNAMIC | IFF_MULTICAST | IFF_PORTSEL |
 			       IFF_AUTOMEDIA)) |
 		     (dev->flags & (IFF_UP | IFF_VOLATILE | IFF_PROMISC |
-				    IFF_ALLMULTI));
+				    IFF_ALLMULTI | IFF_NOPROC));
 
 	/*
 	 *	Load in the correct multicast list now the flags have changed.
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 4aedf03da052..5bad28e66fa2 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1860,6 +1860,7 @@ replay:
 		}
 
 		dev->ifindex = ifm->ifi_index;
+		dev->flags |= ifm->ifi_flags & IFF_NOPROC;
 
 		if (ops->newlink)
 			err = ops->newlink(net, dev, tb, data);
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index a1b5bcbd04ae..13b4089d8996 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -2160,6 +2160,9 @@ static void __devinet_sysctl_unregister(struct ipv4_devconf *cnf)
 
 static void devinet_sysctl_register(struct in_device *idev)
 {
+	if (idev->dev->flags & IFF_NOPROC)
+		return;
+
 	neigh_sysctl_register(idev->dev, idev->arp_parms, "ipv4", NULL);
 	__devinet_sysctl_register(dev_net(idev->dev), idev->dev->name,
 					&idev->cnf);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index cd3fb301da38..e06d15ea2dba 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5032,6 +5032,9 @@ static void __addrconf_sysctl_unregister(struct ipv6_devconf *p)
 
 static void addrconf_sysctl_register(struct inet6_dev *idev)
 {
+	if (idev->dev->flags & IFF_NOPROC)
+		return;
+
 	neigh_sysctl_register(idev->dev, idev->nd_parms, "ipv6",
 			      &ndisc_ifinfo_sysctl_change);
 	__addrconf_sysctl_register(dev_net(idev->dev), idev->dev->name,
diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index 091d066a57b3..f89911116aa7 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -274,6 +274,9 @@ int snmp6_register_dev(struct inet6_dev *idev)
 	if (!idev || !idev->dev)
 		return -EINVAL;
 
+	if (idev->dev->flags & IFF_NOPROC)
+		return 0;
+
 	net = dev_net(idev->dev);
 	if (!net->mib.proc_net_devsnmp6)
 		return -ENOENT;
@@ -291,6 +294,8 @@ int snmp6_register_dev(struct inet6_dev *idev)
 int snmp6_unregister_dev(struct inet6_dev *idev)
 {
 	struct net *net = dev_net(idev->dev);
+	if (idev->dev->flags & IFF_NOPROC)
+		return 0;
 	if (!net->mib.proc_net_devsnmp6)
 		return -ENOENT;
 	if (!idev->stats.proc_dir_entry)
-- 
1.8.2.1

^ permalink raw reply related

* [PATCH iproute2 net-next-3.11] ip: add support of link flag IFF_NOPROC
From: Nicolas Dichtel @ 2013-10-03 13:30 UTC (permalink / raw)
  To: shemminger; +Cc: netdev, davem, Nicolas Dichtel
In-Reply-To: <1380806905-4461-1-git-send-email-nicolas.dichtel@6wind.com>

When this flag is specified, /proc/sys/net/ipv[4|6]/[conf|neigh]/<dev> and
/proc/net/dev_snmp6/<dev> directories are not created.

This flag cannot be removed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/linux/if.h    | 2 ++
 ip/ipaddress.c        | 1 +
 ip/iplink.c           | 3 +++
 man/man8/ip-link.8.in | 8 ++++++++
 4 files changed, 14 insertions(+)

diff --git a/include/linux/if.h b/include/linux/if.h
index 7f261c08e816..5b8a5ebff599 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -53,6 +53,8 @@
 
 #define IFF_ECHO	0x40000		/* echo sent packets		*/
 
+#define IFF_NOPROC	0x80000		/* no proc/sysctl directories	*/
+
 #define IFF_VOLATILE	(IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|IFF_ECHO|\
 		IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT)
 
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 1c3e4da0d0da..b2e35028c844 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -116,6 +116,7 @@ static void print_link_flags(FILE *fp, unsigned flags, unsigned mdown)
 	_PF(LOWER_UP);
 	_PF(DORMANT);
 	_PF(ECHO);
+	_PF(NOPROC);
 #undef _PF
 	if (flags)
 		fprintf(fp, "%x", flags);
diff --git a/ip/iplink.c b/ip/iplink.c
index ada9d4255ba2..253ed1cc3f6f 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -50,6 +50,7 @@ void iplink_usage(void)
 		fprintf(stderr, "                   [ mtu MTU ]\n");
 		fprintf(stderr, "                   [ numtxqueues QUEUE_COUNT ]\n");
 		fprintf(stderr, "                   [ numrxqueues QUEUE_COUNT ]\n");
+		fprintf(stderr, "                   [ noproc ]\n");
 		fprintf(stderr, "                   type TYPE [ ARGS ]\n");
 		fprintf(stderr, "       ip link delete DEV type TYPE [ ARGS ]\n");
 		fprintf(stderr, "\n");
@@ -480,6 +481,8 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
 				invarg("Invalid \"numrxqueues\" value\n", *argv);
 			addattr_l(&req->n, sizeof(*req), IFLA_NUM_RX_QUEUES,
 				  &numrxqueues, 4);
+		} else if (matches(*argv, "noproc") == 0) {
+			req->i.ifi_flags |= IFF_NOPROC;
 		} else {
 			if (strcmp(*argv, "dev") == 0) {
 				NEXT_ARG();
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 76f92ddbd82c..b16d1a1f8a41 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -45,6 +45,8 @@ ip-link \- network device configuration
 .RB "[ " numrxqueues
 .IR QUEUE_COUNT " ]"
 .br
+.RB "[ " noproc " ]"
+.br
 .BR type " TYPE"
 .RI "[ " ARGS " ]"
 
@@ -197,6 +199,12 @@ specifies the number of transmit queues for new device.
 specifies the number of receive queues for new device.
 
 .TP
+.BI noproc
+specifies to no create iface related directories under /proc
+(/proc/sys/net/ipv[4|6]/[conf|neigh]/<dev> and
+/proc/net/dev_snmp6/<dev>)
+
+.TP
 VXLAN Type Support
 For a link of type 
 .I VXLAN
-- 
1.8.2.1

^ permalink raw reply related

* [PATCHv2] IPv6: Allow the MTU of ipip6 tunnel to be set below 1280
From: Oussama Ghorbel @ 2013-10-03 13:49 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy
  Cc: netdev, linux-kernel, Oussama Ghorbel

The (inner) MTU of a ipip6 (IPv4-in-IPv6) tunnel cannot be set below 1280, which is the minimum MTU in IPv6.
However, there should be no IPv6 on the tunnel interface at all, so the IPv6 rules should not apply.
More info at https://bugzilla.kernel.org/show_bug.cgi?id=15530

This patch allows to check the minimum MTU for ipv6 tunnel according to these rules:
-In case the tunnel is configured with ipip6 mode the minimum MTU is 68.
-In case the tunnel is configured with ip6ip6 or any mode the minimum MTU is 1280.

Signed-off-by: Oussama Ghorbel <ou.ghorbel@gmail.com>
---
 net/ipv6/ip6_tunnel.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 46ba243..4b51b03 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1429,9 +1429,17 @@ ip6_tnl_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 static int
 ip6_tnl_change_mtu(struct net_device *dev, int new_mtu)
 {
-	if (new_mtu < IPV6_MIN_MTU) {
-		return -EINVAL;
+	struct ip6_tnl *tnl = netdev_priv(dev);
+
+	if (tnl->parms.proto == IPPROTO_IPIP) {
+		if (new_mtu < 68)
+			return -EINVAL;
+	} else {
+		if (new_mtu < IPV6_MIN_MTU)
+			return -EINVAL;
 	}
+	if (new_mtu > 0xFFF8 - dev->hard_header_len)
+		return -EINVAL;
 	dev->mtu = new_mtu;
 	return 0;
 }
-- 
1.7.9.5

^ permalink raw reply related

* tx checksum offload in rtl8168evl disabled in driver
From: jason.morgan @ 2013-10-03 14:27 UTC (permalink / raw)
  To: netdev

Hi,

I'm try to get close to saturating a 1G ethernet.

I'm at 517Mbps and I've found that there seems to be a cpu bottleneck.

I'm using 2k to 4k frames with a rtl8168evl.

I notice from ethtool that tx-checksum is turned off and refuse to turn 
on.

I've found this message
http://www.spinics.net/lists/netdev/msg216530.html

Which indicates the cause being the driver.

I've looked at the driver code rtl8169.c in kernel 3.8 and the line 

        [RTL_GIGA_MAC_VER_34] =
                _R("RTL8168evl/8111evl",RTL_TD_1, FIRMWARE_8168E_3,
                                                        JUMBO_9K, false),

indicates the reason for this.

However the message thread, above indicates that this is not a problem and 

can be changed to make tx-checksum offload possible.

However we are using a newer chip to the on in the message thread.  I've 
tried to find other, more recent citations without success.

So, why is it still turned off?

What will be the effect of turning it on (changing false to true, in the 
driver line) for our chip?

Thanks in advance,
Jason

^ permalink raw reply

* Re: [PATCH RFC 51/77] mthca: Update MSI/MSI-X interrupts enablement code
From: Jack Morgenstein @ 2013-10-03 16:11 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, Bjorn Helgaas, Ralf Baechle, Michael Ellerman,
	Benjamin Herrenschmidt, Martin Schwidefsky, Ingo Molnar,
	Tejun Heo, Dan Williams, Andy King, Jon Mason, Matt Porter,
	linux-pci, linux-mips, linuxppc-dev, linux390, linux-s390, x86,
	linux-ide, iss_storagedev, linux-nvme, linux-rdma, netdev,
	e1000-devel, linux-driver, Solarflare linux maintainers,
	"VMware, Inc." <pv-dr
In-Reply-To: <9d424912ef78993dc75e2af5006cd12913e9e7e7.1380703263.git.agordeev@redhat.com>

On Wed,  2 Oct 2013 12:49:07 +0200
Alexander Gordeev <agordeev@redhat.com> wrote:

> Subject: [PATCH RFC 51/77] mthca: Update MSI/MSI-X interrupts
> enablement code Date: Wed,  2 Oct 2013 12:49:07 +0200
> Sender: linux-rdma-owner@vger.kernel.org
> X-Mailer: git-send-email 1.7.7.6
> 
> As result of recent re-design of the MSI/MSI-X interrupts enabling
> pattern this driver has to be updated to use the new technique to
> obtain a optimal number of MSI/MSI-X interrupts required.
> 
> Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
> ---

ACK.

-Jack

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox