Netdev List
 help / color / mirror / Atom feed
* RE: [PATCH net-next] net: introduce netdevice gso_min_segs attribute
From: David Laight @ 2014-10-06 10:20 UTC (permalink / raw)
  To: 'Eric Dumazet', Amir Vadai, David S. Miller
  Cc: Eric Dumazet, netdev@vger.kernel.org, Yevgeny Petrilin,
	Or Gerlitz, Ido Shamay
In-Reply-To: <1412529087.11091.14.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <edumazet@google.com>
> Some TSO engines might have a too heavy setup cost, that impacts
> performance on hosts sending small bursts (2 MSS per packet).
> 
> This patch adds a device gso_min_segs, allowing drivers to set
> a minimum segment size for TSO packets, according to the NIC
> performance.
> 
> Tested on a mlx4 NIC, this allows to get a ~110% increase of
> throughput when sending 2 MSS per packet.

Doesn't this all depend on what you need to optimise for.
I can think of three^Wseveral main cases:
1) minimising cpu use while saturating the local network.
2) minimising latency for single packets.
3) maximising throughput for a single connection.
4) minimising cpu use when handling a large number of connections.
plus all the variations in packet size.

I'm not sure what you are trading for what here.

Maybe gso = tx_bursting is almost always faster on some hardware?
(Especially if an idle mac engine is 'kicked' for the first packet
of a burst.)

	David


^ permalink raw reply

* Re: CPSW bug with AM437x SK
From: Mugunthan V N @ 2014-10-06  9:54 UTC (permalink / raw)
  To: balbi
  Cc: davem, netdev, Linux OMAP Mailing List,
	Linux ARM Kernel Mailing List, Linux Kernel Mailing List
In-Reply-To: <20141003010418.GA9612@saruman>

On Friday 03 October 2014 06:34 AM, Felipe Balbi wrote:
> [  261.177168] [<c0648d48>] (skb_panic) from [<c0565edc>] (skb_put+0x5c/0x60)
> [  261.184415] [<c0565edc>] (skb_put) from [<c0605aac>] (unix_stream_sendmsg+0x164/0x390)
> [  261.192712] [<c0605aac>] (unix_stream_sendmsg) from [<c055b364>] (sock_aio_write+0xdc/0xfc)
> [  261.201475] [<c055b364>] (sock_aio_write) from [<c014c42c>] (do_sync_write+0x8c/0xb4)
> [  261.209697] [<c014c42c>] (do_sync_write) from [<c014cf70>] (vfs_write+0x118/0x1c0)
> [  261.217652] [<c014cf70>] (vfs_write) from [<c014d564>] (SyS_write+0x4c/0xa0)
> [  261.225054] [<c014d564>] (SyS_write) from [<c000ed40>] (ret_fast_syscall+0x0/0x48)
> [  261.232988] Code: e58d4008 e58de00c e59f0008 ebfff48e (e7f001f2) 
> [  261.239378] ---[ end trace d64258d586f40104 ]---

The BT shows that the warn came from a unix socket interface, so this
cannot be a CPSW bug, its a bug in unix socket.

Are you not seeing this issue with file system in any other media?

I will try to reproduce this locally with my AM437x EVMsk.

Regards
Mugunthan V N

^ permalink raw reply

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: Daniel Borkmann @ 2014-10-06  9:49 UTC (permalink / raw)
  To: John Fastabend
  Cc: Florian Westphal, gerlitz.or, hannes, netdev, john.ronciak, amirv,
	eric.dumazet, danny.zhou
In-Reply-To: <5431EC82.7010305@gmail.com>

Hi John,

On 10/06/2014 03:12 AM, John Fastabend wrote:
> On 10/05/2014 05:29 PM, Florian Westphal wrote:
>> John Fastabend <john.fastabend@gmail.com> wrote:
>>> There is one critical difference when running with these interfaces
>>> vs running without them. In the normal case the af_packet module
>>> uses a standard descriptor format exported by the af_packet user
>>> space headers. In this model because we are working directly with
>>> driver queues the descriptor format maps to the descriptor format
>>> used by the device. User space applications can learn device
>>> information from the socket option PACKET_DEV_DESC_INFO which
>>> should provide enough details to extrapulate the descriptor formats.
>>> Although this adds some complexity to user space it removes the
>>> requirement to copy descriptor fields around.
>>
>> I find it very disappointing that we seem to have to expose such
>> hardware specific details to userspace via hw-independent interface.
>
> Well it was only for convenience if it doesn't fit as a socket
> option we can remove it. We can look up the device using the netdev
> name from the bind call. I see your point though so if there is
> consensus that this is not needed that is fine.
>
>> How big of a cost are we talking about when you say that it 'removes
>> the requirement to copy descriptor fields'?
>
> This was likely a poor description. If you want to let user space
> poll on the ring (without using system calls or interrupts) then
> I don't see how you can _not_ expose the ring directly complete with
> the vendor descriptor formats.

But how big is the concrete performance degradation you're seeing if you
use an e.g. `netmap-alike` Linux-own variant as a hw-neutral interface
that does *not* directly expose hw descriptor formats to user space?

With 1 core netmap does 10G line-rate on 64b; I don't know their numbers
on 40G when run on decent hardware though.

It would really be great if we have something vendor neutral exposed as
a stable ABI and could leverage emerging infrastructure we already have
in the kernel such as eBPF and recent qdisc batching for raw sockets
instead of reinventing the wheels. (Don't get me wrong, I would love to
see AF_PACKET improved ...)

Thanks,
Daniel

^ permalink raw reply

* RE: [PATCH v2 net-next] r8169:add support for RTL8168EP
From: Hau @ 2014-10-06  8:56 UTC (permalink / raw)
  To: Francois Romieu
  Cc: netdev@vger.kernel.org, nic_swsd, linux-kernel@vger.kernel.org
In-Reply-To: <20141003203244.GA19965@electric-eye.fr.zoreil.com>


> -----Original Message-----
> From: Francois Romieu [mailto:romieu@fr.zoreil.com]
> Sent: Saturday, October 04, 2014 4:33 AM
> To: Hau
> Cc: netdev@vger.kernel.org; nic_swsd; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v2 net-next] r8169:add support for RTL8168EP
> 
> Chun-Hao Lin <hau@realtek.com> :
> [...]
> > diff --git a/drivers/net/ethernet/realtek/r8169.c
> > b/drivers/net/ethernet/realtek/r8169.c
> > index 54476ba..3efdf4d 100644
> > --- a/drivers/net/ethernet/realtek/r8169.c
> > +++ b/drivers/net/ethernet/realtek/r8169.c
> [...]
> > @@ -1276,6 +1273,52 @@ static void rtl_w0w1_eri(struct rtl8169_private
> *tp, int addr, u32 mask, u32 p,
> >  	rtl_eri_write(tp, addr, mask, (val & ~m) | p, type);  }
> >
> > +static u32 ocp_read(struct rtl8169_private *tp, u8 mask, u16 reg) {
> > +	void __iomem *ioaddr = tp->mmio_addr;
> > +
> > +	switch (tp->mac_version) {
> > +	case RTL_GIGA_MAC_VER_27:
> > +	case RTL_GIGA_MAC_VER_28:
> > +	case RTL_GIGA_MAC_VER_31:
> > +		RTL_W32(OCPAR, ((u32)mask & 0x0f) << 12 | (reg & 0x0fff));
> > +		return rtl_udelay_loop_wait_high(tp, &rtl_ocpar_cond, 100,
> 20)
> > +			? RTL_R32(OCPDR) : ~0;
> 
> '?' should be on the previous line. There isn't more room than needed but it's
> still ok.
> 
> [...]
> > +static void rtl8168ep_2_hw_phy_config(struct rtl8169_private *tp) {
> [...]
> > +	rtl_w0w1_phy(tp, 0x14, 0x7C00, ~0x7Cff);
> 
> Nit:	rtl_w0w1_phy(tp, 0x14, 0x7c00, ~0x7cff);
> 
> [...]
> +	rtl_writephy(tp, 0x1f, 0x0bC8);
> 
> Nit:	rtl_writephy(tp, 0x1f, 0x0bc8);
> 
> Tangent: please find feature wise identical code below without moves.
> I have avoided introducing more helpers that may have kept the code more
> balanced for 8168dp vs 8168ep as it wasn't the point and it is still possible to
> change the code from there.
> 
> It provides a different review experience. Feel free to give it a thought.

Do you mean I should collect similar hardware parameters setting into one function?
or
I should set hardware parameters according to hardware feature support version?
For example,
If mac_version == RTL8168DP
	dash_ver = 1
else If mac_version == RTL8168DP
	dash_ver = 2

if dashver == 1
	...
else if dashver == 2
	...

^ permalink raw reply

* [PATCH net-next 5/5] ipv6: don't walk node's leaf during serial number update
From: Hannes Frederic Sowa @ 2014-10-06  8:52 UTC (permalink / raw)
  To: netdev; +Cc: hideaki, kafai
In-Reply-To: <cover.1412585162.git.hannes@stressinduktion.org>

Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org>
Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/ip6_fib.c | 47 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 30 insertions(+), 17 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 6c10c26..f36207f 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -50,6 +50,7 @@ struct fib6_cleaner_t {
 	struct fib6_walker_t w;
 	struct net *net;
 	int (*func)(struct rt6_info *, void *arg);
+	int sernum;
 	void *arg;
 };
 
@@ -105,6 +106,10 @@ static int fib6_new_sernum(struct net *net)
 	return new;
 }
 
+enum {
+	FIB6_NO_SERNUM_CHANGE = 0,
+};
+
 /*
  *	Auxiliary address test functions for the radix tree.
  *
@@ -1514,6 +1519,16 @@ static int fib6_clean_node(struct fib6_walker_t *w)
 		.nl_net = c->net,
 	};
 
+	if (c->sernum != FIB6_NO_SERNUM_CHANGE &&
+	    w->node->fn_sernum != c->sernum)
+		w->node->fn_sernum = c->sernum;
+
+	if (!c->func) {
+		WARN_ON_ONCE(c->sernum == FIB6_NO_SERNUM_CHANGE);
+		w->leaf = NULL;
+		return 0;
+	}
+
 	for (rt = w->leaf; rt; rt = rt->dst.rt6_next) {
 		res = c->func(rt, c->arg);
 		if (res < 0) {
@@ -1547,7 +1562,7 @@ static int fib6_clean_node(struct fib6_walker_t *w)
 
 static void fib6_clean_tree(struct net *net, struct fib6_node *root,
 			    int (*func)(struct rt6_info *, void *arg),
-			    bool prune, void *arg)
+			    bool prune, int sernum, void *arg)
 {
 	struct fib6_cleaner_t c;
 
@@ -1557,14 +1572,16 @@ static void fib6_clean_tree(struct net *net, struct fib6_node *root,
 	c.w.count = 0;
 	c.w.skip = 0;
 	c.func = func;
+	c.sernum = sernum;
 	c.arg = arg;
 	c.net = net;
 
 	fib6_walk(&c.w);
 }
 
-void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *arg),
-		    void *arg)
+static void __fib6_clean_all(struct net *net,
+			     int (*func)(struct rt6_info *, void *),
+			     int sernum, void *arg)
 {
 	struct fib6_table *table;
 	struct hlist_head *head;
@@ -1576,13 +1593,19 @@ void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *arg),
 		hlist_for_each_entry_rcu(table, head, tb6_hlist) {
 			write_lock_bh(&table->tb6_lock);
 			fib6_clean_tree(net, &table->tb6_root,
-					func, false, arg);
+					func, false, sernum, arg);
 			write_unlock_bh(&table->tb6_lock);
 		}
 	}
 	rcu_read_unlock();
 }
 
+void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *),
+		    void *arg)
+{
+	__fib6_clean_all(net, func, FIB6_NO_SERNUM_CHANGE, arg);
+}
+
 static int fib6_prune_clone(struct rt6_info *rt, void *arg)
 {
 	if (rt->rt6i_flags & RTF_CACHE) {
@@ -1595,25 +1618,15 @@ static int fib6_prune_clone(struct rt6_info *rt, void *arg)
 
 static void fib6_prune_clones(struct net *net, struct fib6_node *fn)
 {
-	fib6_clean_tree(net, fn, fib6_prune_clone, true, NULL);
-}
-
-static int fib6_update_sernum(struct rt6_info *rt, void *arg)
-{
-	int sernum = *(__u32 *)arg;
-
-	if (rt->rt6i_node &&
-	    rt->rt6i_node->fn_sernum != sernum)
-		rt->rt6i_node->fn_sernum = sernum;
-
-	return 0;
+	fib6_clean_tree(net, fn, fib6_prune_clone, true,
+			FIB6_NO_SERNUM_CHANGE, NULL);
 }
 
 static void fib6_flush_trees(struct net *net)
 {
 	int new_sernum = fib6_new_sernum(net);
 
-	fib6_clean_all(net, fib6_update_sernum, &new_sernum);
+	__fib6_clean_all(net, NULL, new_sernum, NULL);
 }
 
 /*
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 4/5] ipv6: make fib6 serial number per namespace
From: Hannes Frederic Sowa @ 2014-10-06  8:52 UTC (permalink / raw)
  To: netdev; +Cc: hideaki, kafai
In-Reply-To: <cover.1412585162.git.hannes@stressinduktion.org>

Try to reduce number of possible fn_sernum mutation by constraining them
to their namespace.

Also remove rt_genid which I forgot to remove in 705f1c869d577c ("ipv6:
remove rt6i_genid").

Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org>
Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/netns/ipv6.h |  2 +-
 net/ipv6/af_inet6.c      |  2 +-
 net/ipv6/ip6_fib.c       | 13 ++++++-------
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index eade27a..69ae41f 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -76,7 +76,7 @@ struct netns_ipv6 {
 #endif
 #endif
 	atomic_t		dev_addr_genid;
-	atomic_t		rt_genid;
+	atomic_t		fib6_sernum;
 };
 
 #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 34f726f..e8c4400 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -766,7 +766,7 @@ static int __net_init inet6_net_init(struct net *net)
 	net->ipv6.sysctl.icmpv6_time = 1*HZ;
 	net->ipv6.sysctl.flowlabel_consistency = 1;
 	net->ipv6.sysctl.auto_flowlabels = 0;
-	atomic_set(&net->ipv6.rt_genid, 0);
+	atomic_set(&net->ipv6.fib6_sernum, 1);
 
 	err = ipv6_init_mibs(net);
 	if (err)
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index a55a072..6c10c26 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -74,8 +74,6 @@ static int fib6_walk_continue(struct fib6_walker_t *w);
  *	result of redirects, path MTU changes, etc.
  */
 
-static atomic_t rt_sernum = ATOMIC_INIT(1);
-
 static void fib6_gc_timer_cb(unsigned long arg);
 
 static LIST_HEAD(fib6_walkers);
@@ -95,14 +93,15 @@ static void fib6_walker_unlink(struct fib6_walker_t *w)
 	write_unlock_bh(&fib6_walker_lock);
 }
 
-static int fib6_new_sernum(void)
+static int fib6_new_sernum(struct net *net)
 {
 	int new, old;
 
 	do {
-		old = atomic_read(&rt_sernum);
+		old = atomic_read(&net->ipv6.fib6_sernum);
 		new = old < INT_MAX ? old + 1 : 1;
-	} while (atomic_cmpxchg(&rt_sernum, old, new) != old);
+	} while (atomic_cmpxchg(&net->ipv6.fib6_sernum,
+				old, new) != old);
 	return new;
 }
 
@@ -841,7 +840,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
 	int err = -ENOMEM;
 	int allow_create = 1;
 	int replace_required = 0;
-	int sernum = fib6_new_sernum();
+	int sernum = fib6_new_sernum(info->nl_net);
 
 	if (info->nlh) {
 		if (!(info->nlh->nlmsg_flags & NLM_F_CREATE))
@@ -1612,7 +1611,7 @@ static int fib6_update_sernum(struct rt6_info *rt, void *arg)
 
 static void fib6_flush_trees(struct net *net)
 {
-	int new_sernum = fib6_new_sernum();
+	int new_sernum = fib6_new_sernum(net);
 
 	fib6_clean_all(net, fib6_update_sernum, &new_sernum);
 }
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 3/5] ipv6: only generate one new serial number per fib mutation
From: Hannes Frederic Sowa @ 2014-10-06  8:52 UTC (permalink / raw)
  To: netdev; +Cc: hideaki, kafai
In-Reply-To: <cover.1412585162.git.hannes@stressinduktion.org>

Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org>
Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/ip6_fib.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index bfad469..a55a072 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -417,14 +417,13 @@ out:
 static struct fib6_node *fib6_add_1(struct fib6_node *root,
 				     struct in6_addr *addr, int plen,
 				     int offset, int allow_create,
-				     int replace_required)
+				     int replace_required, int sernum)
 {
 	struct fib6_node *fn, *in, *ln;
 	struct fib6_node *pn = NULL;
 	struct rt6key *key;
 	int	bit;
 	__be32	dir = 0;
-	int	sernum = fib6_new_sernum();
 
 	RT6_TRACE("fib6_add_1\n");
 
@@ -842,6 +841,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
 	int err = -ENOMEM;
 	int allow_create = 1;
 	int replace_required = 0;
+	int sernum = fib6_new_sernum();
 
 	if (info->nlh) {
 		if (!(info->nlh->nlmsg_flags & NLM_F_CREATE))
@@ -854,7 +854,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
 
 	fn = fib6_add_1(root, &rt->rt6i_dst.addr, rt->rt6i_dst.plen,
 			offsetof(struct rt6_info, rt6i_dst), allow_create,
-			replace_required);
+			replace_required, sernum);
 	if (IS_ERR(fn)) {
 		err = PTR_ERR(fn);
 		fn = NULL;
@@ -888,14 +888,14 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
 			sfn->leaf = info->nl_net->ipv6.ip6_null_entry;
 			atomic_inc(&info->nl_net->ipv6.ip6_null_entry->rt6i_ref);
 			sfn->fn_flags = RTN_ROOT;
-			sfn->fn_sernum = fib6_new_sernum();
+			sfn->fn_sernum = sernum;
 
 			/* Now add the first leaf node to new subtree */
 
 			sn = fib6_add_1(sfn, &rt->rt6i_src.addr,
 					rt->rt6i_src.plen,
 					offsetof(struct rt6_info, rt6i_src),
-					allow_create, replace_required);
+					allow_create, replace_required, sernum);
 
 			if (IS_ERR(sn)) {
 				/* If it is failed, discard just allocated
@@ -914,7 +914,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
 			sn = fib6_add_1(fn->subtree, &rt->rt6i_src.addr,
 					rt->rt6i_src.plen,
 					offsetof(struct rt6_info, rt6i_src),
-					allow_create, replace_required);
+					allow_create, replace_required, sernum);
 
 			if (IS_ERR(sn)) {
 				err = PTR_ERR(sn);
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 2/5] ipv6: make rt_sernum atomic and serial number fields ordinary ints
From: Hannes Frederic Sowa @ 2014-10-06  8:52 UTC (permalink / raw)
  To: netdev; +Cc: hideaki, kafai
In-Reply-To: <cover.1412585162.git.hannes@stressinduktion.org>

Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org>
Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/ip6_fib.h |  2 +-
 net/ipv6/ip6_fib.c    | 23 +++++++++++++----------
 2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index f4e6b3e..ee07f92 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -64,7 +64,7 @@ struct fib6_node {
 
 	__u16			fn_bit;		/* bit key */
 	__u16			fn_flags;
-	__u32			fn_sernum;
+	int			fn_sernum;
 	struct rt6_info		*rr_ptr;
 };
 
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 92f53db..bfad469 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -74,7 +74,7 @@ static int fib6_walk_continue(struct fib6_walker_t *w);
  *	result of redirects, path MTU changes, etc.
  */
 
-static __u32 rt_sernum;
+static atomic_t rt_sernum = ATOMIC_INIT(1);
 
 static void fib6_gc_timer_cb(unsigned long arg);
 
@@ -95,12 +95,15 @@ static void fib6_walker_unlink(struct fib6_walker_t *w)
 	write_unlock_bh(&fib6_walker_lock);
 }
 
-static u32 fib6_new_sernum(void)
+static int fib6_new_sernum(void)
 {
-	u32 n = ++rt_sernum;
-	if ((__s32)n <= 0)
-		rt_sernum = n = 1;
-	return n;
+	int new, old;
+
+	do {
+		old = atomic_read(&rt_sernum);
+		new = old < INT_MAX ? old + 1 : 1;
+	} while (atomic_cmpxchg(&rt_sernum, old, new) != old);
+	return new;
 }
 
 /*
@@ -421,7 +424,7 @@ static struct fib6_node *fib6_add_1(struct fib6_node *root,
 	struct rt6key *key;
 	int	bit;
 	__be32	dir = 0;
-	__u32	sernum = fib6_new_sernum();
+	int	sernum = fib6_new_sernum();
 
 	RT6_TRACE("fib6_add_1\n");
 
@@ -1598,7 +1601,7 @@ static void fib6_prune_clones(struct net *net, struct fib6_node *fn)
 
 static int fib6_update_sernum(struct rt6_info *rt, void *arg)
 {
-	__u32 sernum = *(__u32 *)arg;
+	int sernum = *(__u32 *)arg;
 
 	if (rt->rt6i_node &&
 	    rt->rt6i_node->fn_sernum != sernum)
@@ -1609,7 +1612,7 @@ static int fib6_update_sernum(struct rt6_info *rt, void *arg)
 
 static void fib6_flush_trees(struct net *net)
 {
-	__u32 new_sernum = fib6_new_sernum();
+	int new_sernum = fib6_new_sernum();
 
 	fib6_clean_all(net, fib6_update_sernum, &new_sernum);
 }
@@ -1822,7 +1825,7 @@ struct ipv6_route_iter {
 	struct fib6_walker_t w;
 	loff_t skip;
 	struct fib6_table *tbl;
-	__u32 sernum;
+	int sernum;
 };
 
 static int ipv6_route_seq_show(struct seq_file *seq, void *v)
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 1/5] ipv6: minor fib6 cleanups like type safety, bool conversion, inline removal
From: Hannes Frederic Sowa @ 2014-10-06  8:52 UTC (permalink / raw)
  To: netdev; +Cc: hideaki, kafai
In-Reply-To: <cover.1412585162.git.hannes@stressinduktion.org>

Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org>
Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/ip6_fib.h | 14 ++++++++++++--
 net/ipv6/ip6_fib.c    | 35 +++++++++++++----------------------
 2 files changed, 25 insertions(+), 24 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index cf485f9..f4e6b3e 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -202,12 +202,22 @@ static inline void ip6_rt_put(struct rt6_info *rt)
 	dst_release(&rt->dst);
 }
 
+enum fib_walk_state_t {
+#ifdef CONFIG_IPV6_SUBTREES
+	FWS_S,
+#endif
+	FWS_L,
+	FWS_R,
+	FWS_C,
+	FWS_U
+};
+
 struct fib6_walker_t {
 	struct list_head lh;
 	struct fib6_node *root, *node;
 	struct rt6_info *leaf;
-	unsigned char state;
-	unsigned char prune;
+	enum fib_walk_state_t state;
+	bool prune;
 	unsigned int skip;
 	unsigned int count;
 	int (*func)(struct fib6_walker_t *);
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 97b9fa8..92f53db 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -46,16 +46,6 @@
 
 static struct kmem_cache *fib6_node_kmem __read_mostly;
 
-enum fib_walk_state_t {
-#ifdef CONFIG_IPV6_SUBTREES
-	FWS_S,
-#endif
-	FWS_L,
-	FWS_R,
-	FWS_C,
-	FWS_U
-};
-
 struct fib6_cleaner_t {
 	struct fib6_walker_t w;
 	struct net *net;
@@ -91,20 +81,21 @@ static void fib6_gc_timer_cb(unsigned long arg);
 static LIST_HEAD(fib6_walkers);
 #define FOR_WALKERS(w) list_for_each_entry(w, &fib6_walkers, lh)
 
-static inline void fib6_walker_link(struct fib6_walker_t *w)
+static void fib6_walker_link(struct fib6_walker_t *w)
 {
 	write_lock_bh(&fib6_walker_lock);
 	list_add(&w->lh, &fib6_walkers);
 	write_unlock_bh(&fib6_walker_lock);
 }
 
-static inline void fib6_walker_unlink(struct fib6_walker_t *w)
+static void fib6_walker_unlink(struct fib6_walker_t *w)
 {
 	write_lock_bh(&fib6_walker_lock);
 	list_del(&w->lh);
 	write_unlock_bh(&fib6_walker_lock);
 }
-static __inline__ u32 fib6_new_sernum(void)
+
+static u32 fib6_new_sernum(void)
 {
 	u32 n = ++rt_sernum;
 	if ((__s32)n <= 0)
@@ -128,7 +119,7 @@ static __inline__ u32 fib6_new_sernum(void)
 # define BITOP_BE32_SWIZZLE	0
 #endif
 
-static __inline__ __be32 addr_bit_set(const void *token, int fn_bit)
+static __be32 addr_bit_set(const void *token, int fn_bit)
 {
 	const __be32 *addr = token;
 	/*
@@ -142,7 +133,7 @@ static __inline__ __be32 addr_bit_set(const void *token, int fn_bit)
 	       addr[fn_bit >> 5];
 }
 
-static __inline__ struct fib6_node *node_alloc(void)
+static struct fib6_node *node_alloc(void)
 {
 	struct fib6_node *fn;
 
@@ -151,12 +142,12 @@ static __inline__ struct fib6_node *node_alloc(void)
 	return fn;
 }
 
-static __inline__ void node_free(struct fib6_node *fn)
+static void node_free(struct fib6_node *fn)
 {
 	kmem_cache_free(fib6_node_kmem, fn);
 }
 
-static __inline__ void rt6_release(struct rt6_info *rt)
+static void rt6_release(struct rt6_info *rt)
 {
 	if (atomic_dec_and_test(&rt->rt6i_ref))
 		dst_free(&rt->dst);
@@ -627,7 +618,7 @@ insert_above:
 	return ln;
 }
 
-static inline bool rt6_qualify_for_ecmp(struct rt6_info *rt)
+static bool rt6_qualify_for_ecmp(struct rt6_info *rt)
 {
 	return (rt->rt6i_flags & (RTF_GATEWAY|RTF_ADDRCONF|RTF_DYNAMIC)) ==
 	       RTF_GATEWAY;
@@ -820,7 +811,7 @@ add:
 	return 0;
 }
 
-static __inline__ void fib6_start_gc(struct net *net, struct rt6_info *rt)
+static void fib6_start_gc(struct net *net, struct rt6_info *rt)
 {
 	if (!timer_pending(&net->ipv6.ip6_fib_timer) &&
 	    (rt->rt6i_flags & (RTF_EXPIRES | RTF_CACHE)))
@@ -1554,7 +1545,7 @@ static int fib6_clean_node(struct fib6_walker_t *w)
 
 static void fib6_clean_tree(struct net *net, struct fib6_node *root,
 			    int (*func)(struct rt6_info *, void *arg),
-			    int prune, void *arg)
+			    bool prune, void *arg)
 {
 	struct fib6_cleaner_t c;
 
@@ -1583,7 +1574,7 @@ void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *arg),
 		hlist_for_each_entry_rcu(table, head, tb6_hlist) {
 			write_lock_bh(&table->tb6_lock);
 			fib6_clean_tree(net, &table->tb6_root,
-					func, 0, arg);
+					func, false, arg);
 			write_unlock_bh(&table->tb6_lock);
 		}
 	}
@@ -1602,7 +1593,7 @@ static int fib6_prune_clone(struct rt6_info *rt, void *arg)
 
 static void fib6_prune_clones(struct net *net, struct fib6_node *fn)
 {
-	fib6_clean_tree(net, fn, fib6_prune_clone, 1, NULL);
+	fib6_clean_tree(net, fn, fib6_prune_clone, true, NULL);
 }
 
 static int fib6_update_sernum(struct rt6_info *rt, void *arg)
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 0/5] ipv6: cleanup after rt6_genid removal
From: Hannes Frederic Sowa @ 2014-10-06  8:52 UTC (permalink / raw)
  To: netdev; +Cc: hideaki, kafai

Leftover patches after rt6_genid removal after 705f1c869d577c ("ipv6:
remove rt6i_genid").

Major two changes are:
* keep fib6_sernum per namespace to reduce number of flushes in case
  system has high number of namespaces
* make fn_sernum updates cheaper

Hannes Frederic Sowa (5):
  ipv6: minor fib6 cleanups like type safety, bool conversion, inline
    removal
  ipv6: make rt_sernum atomic and serial number fields ordinary ints
  ipv6: only generate one new serial number per fib mutation
  ipv6: make fib6 serial number per namespace
  ipv6: don't walk node's leaf during serial number update

 include/net/ip6_fib.h    |  16 +++++--
 include/net/netns/ipv6.h |   2 +-
 net/ipv6/af_inet6.c      |   2 +-
 net/ipv6/ip6_fib.c       | 106 +++++++++++++++++++++++++----------------------
 4 files changed, 71 insertions(+), 55 deletions(-)

-- 
1.9.3

^ permalink raw reply

* Re: [PATCH v8 net-next 1/2] bonding: display xmit_hash_policy for non-dynamic-tlb mode
From: Nikolay Aleksandrov @ 2014-10-06  8:35 UTC (permalink / raw)
  To: David Miller, maheshb; +Cc: j.vosburgh, andy, vfalico, netdev, edumazet, maze
In-Reply-To: <20141006.005841.1016428020297723523.davem@davemloft.net>

On 06/10/14 06:58, David Miller wrote:
>
> After 8 revisions, I want to see some ACKs before applying these
> two changes :)
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

On my side all comments have been taken care of, I don't know if it's 
appropriate to also ack the patches as I've already given a reviewed-by to the 
first and my sob to the second :-)

Nik

^ permalink raw reply

* Re: randconfig build error with next-20141001, in drivers/i2c/algos/i2c-algo-bit.c
From: Oliver Hartkopp @ 2014-10-06  8:06 UTC (permalink / raw)
  To: Randy Dunlap, Jim Davis, Stephen Rothwell
  Cc: linux-next, Stephane Grosjean, linux-i2c, netdev@vger.kernel.org,
	linux-can
In-Reply-To: <542C8D93.8090008@infradead.org>

Hello all,

just to get it right:

So far it looks like this in linux/drivers/net/can/sja1000/Kconfig

config CAN_PEAK_PCIEC
        bool "PEAK PCAN-ExpressCard Cards"
        depends on CAN_PEAK_PCI
        select I2C
        select I2C_ALGOBIT

If one would change the

        select I2C

into

        depends on I2C

IMHO the CAN_PEAK_PCIEC hardware would *only* be visible and selectable when
I2C was selected before (from anyone else?).

So what it wrong on the current Kconfig entry?
Is 'select' deprecated?

Or did randconfig generate a configuration that would not be possible by
properly generating the config file with 'make menuconfig' ??

Please explain.

Thanks,
Oliver

On 10/02/2014 01:26 AM, Randy Dunlap wrote:
> On 10/01/14 14:37, Jim Davis wrote:
>> Building with the attached random configuration file,
> 
> Also:
> warning: (CAN_PEAK_PCIEC && SFC && IGB && VIDEO_TW68 && DRM && FB_DDC && FB_VIA) selects I2C_ALGOBIT which has unmet direct dependencies (I2C)
> 
>> drivers/i2c/algos/i2c-algo-bit.c: In function ‘i2c_bit_add_bus’:
>> drivers/i2c/algos/i2c-algo-bit.c:658:33: error: ‘i2c_add_adapter’
>> undeclared (first use in this function)
>>   return __i2c_bit_add_bus(adap, i2c_add_adapter);
>>                                  ^
>> drivers/i2c/algos/i2c-algo-bit.c:658:33: note: each undeclared
>> identifier is reported only once for each function it appears in
>> drivers/i2c/algos/i2c-algo-bit.c: In function ‘i2c_bit_add_numbered_bus’:
>> drivers/i2c/algos/i2c-algo-bit.c:664:33: error:
>> ‘i2c_add_numbered_adapter’ undeclared (first use in this function)
>>   return __i2c_bit_add_bus(adap, i2c_add_numbered_adapter);
>>                                  ^
>>   CC      net/openvswitch/actions.o
>> drivers/i2c/algos/i2c-algo-bit.c: In function ‘i2c_bit_add_bus’:
>> drivers/i2c/algos/i2c-algo-bit.c:659:1: warning: control reaches end of non-void
>>  function [-Wreturn-type]
>>  }
>>  ^
>> drivers/i2c/algos/i2c-algo-bit.c: In function ‘i2c_bit_add_numbered_bus’:
>> drivers/i2c/algos/i2c-algo-bit.c:665:1: warning: control reaches end of non-void
>>  function [-Wreturn-type]
>>  }
>>  ^
>> make[3]: *** [drivers/i2c/algos/i2c-algo-bit.o] Error 1
> 
> In drivers/media/pci/tw68/Kconfig, VIDEO_TW68 should depend on I2C in order
> to make it safe to select I2C_ALGOBIT.
> 
> In drivers/net/can/sja1000/Kconfig, CAN_PEAK_PCIEC should depend on I2C
> instead of selecting I2C (and change the help text).
> 
> 

^ permalink raw reply

* Re: [net 0/8] gianfar: ARM port driver updates (1/2)
From: Claudiu Manoil @ 2014-10-06  7:55 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Li.Xiubo, Shruti
In-Reply-To: <20141005.212722.1867743289839811370.davem@davemloft.net>

On 10/6/2014 4:27 AM, David Miller wrote:
> From: Claudiu Manoil <claudiu.manoil@freescale.com>
> Date: Fri, 3 Oct 2014 19:02:41 +0300
>
>> This is the first round of driver protability fixes and clean-up
>> with the main purpose to make gianfar portable on ARM, for the ARM
>> based SoC that integrates the eTSEC ethernet controller - "ls1021a".
>> The patches primarily address compile time errors, when compiling
>> gianfar on ARM.  They replace PPC specific functions and macros
>> with architecture independent ones, solve arch specific header
>> inclusions, guard code that relates to PPC only, and even address
>> some simple endianess issues (see MAC address setup patch).
>> The patches addressing the bulk of remaining endianess issues,
>> like handling DMA fields (BD and FCB), will follow with the sencond
>> round.
>> These patches were verified on the ls1021a SoC.
>
> If more endianness fixes are necessary and "will follow with the
> second round", I do not see how you could have verified specifically
> these changes on the ls1021a.
>

Hi David,

What I did is to split the initial patchset in 2, to ease up the review
process.
This first part is fairly straightforward, these patches make localized
code changes and can be more easily ported among different kernel 
versions.  The second part has fewer patches but touches more code,
because it handles endianess conversions for all the reads/writes to
the buffer descriptors.  Please let me now if you have objections to
this approach.
As for testing, we have our internal kernel tree for ARM supporting
ls1021a, and these gianfar patches have been there for a while and 
tested.  Now it's time to upstream (a cleaned-up version of) them.
(see git.freescale.com/git/cgit.cgi/layerscape/ls1021a/linux.git/)
Please note that the current (upstream) net tree does not include the
support for ls1021a (which is to be propagated via the arm tree).

Thanks and regards,
Claudiu

^ permalink raw reply

* Re: [PATCH net-next] ipvs: Avoid null-pointer deref in debug code
From: Julian Anastasov @ 2014-10-06  6:49 UTC (permalink / raw)
  To: Alex Gartrell; +Cc: horms, dan.carpenter, lvs-devel, netdev, kernel-team
In-Reply-To: <1412556895-26891-1-git-send-email-agartrell@fb.com>


	Hello,

On Sun, 5 Oct 2014, Alex Gartrell wrote:

> Ensure that the pointer is non-NULL before dereferencing it for debugging
> purposes.
> 
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Alex Gartrell <agartrell@fb.com>
> ---
>  net/netfilter/ipvs/ip_vs_xmit.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 91f17c1..06bba9b 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> @@ -316,7 +316,7 @@ __ip_vs_get_out_rt(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
>  	if (unlikely(crosses_local_route_boundary(skb_af, skb, rt_mode,
>  						  local))) {
>  		IP_VS_DBG_RL("We are crossing local and non-local addresses"
> -			     " daddr=%pI4\n", &dest->addr.ip);
> +			     " daddr=%pI4\n", dest ? &dest->addr.ip : NULL);
>  		goto err_put;
>  	}
>  
> @@ -458,7 +458,7 @@ __ip_vs_get_out_rt_v6(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
>  	if (unlikely(crosses_local_route_boundary(skb_af, skb, rt_mode,
>  						  local))) {
>  		IP_VS_DBG_RL("We are crossing local and non-local addresses"
> -			     " daddr=%pI6\n", &dest->addr.in6);
> +			     " daddr=%pI6\n", dest ? &dest->addr.in6 : NULL);
>  		goto err_put;
>  	}

	You have to print the "daddr" variable as
it was done before your patchset in the
"Stopping traffic to %s address, dest: %p..." message
because dest is not present in all cases, for example,
for *bypass_xmit. Other places provide cp->daddr but
for backup server some conns can live without cp->dest.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH net-next] net: introduce netdevice gso_min_segs attribute
From: Amir Vadai @ 2014-10-06  6:41 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller
  Cc: Eric Dumazet, netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay
In-Reply-To: <1412529087.11091.14.camel@edumazet-glaptop2.roam.corp.google.com>

On 10/5/2014 8:11 PM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Some TSO engines might have a too heavy setup cost, that impacts
> performance on hosts sending small bursts (2 MSS per packet).
> 
> This patch adds a device gso_min_segs, allowing drivers to set
> a minimum segment size for TSO packets, according to the NIC
> performance.
> 
> Tested on a mlx4 NIC, this allows to get a ~110% increase of
> throughput when sending 2 MSS per packet.
> 

Amazing!

Shouldn't there be a netif_set_gso_min_size() too?

> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> mlx4 patch will be sent later, its a one liner.
> 
>  include/linux/netdevice.h |    4 +++-
>  net/core/dev.c            |    9 ++++++---
>  2 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 22d54b9b700d..2df86f50261c 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1416,6 +1416,8 @@ enum netdev_priv_flags {
>   *	@gso_max_size:	Maximum size of generic segmentation offload
>   *	@gso_max_segs:	Maximum number of segments that can be passed to the
>   *			NIC for GSO
> + *	@gso_min_segs:	Minimum number of segments that can be passed to the
> + *			NIC for GSO
>   *
>   *	@dcbnl_ops:	Data Center Bridging netlink ops
>   *	@num_tc:	Number of traffic classes in the net device
> @@ -1666,7 +1668,7 @@ struct net_device {
>  	unsigned int		gso_max_size;
>  #define GSO_MAX_SEGS		65535
>  	u16			gso_max_segs;
> -
> +	u16			gso_min_segs;
>  #ifdef CONFIG_DCB
>  	const struct dcbnl_rtnl_ops *dcbnl_ops;
>  #endif
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1a90530f83ff..16e8ebbd3316 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2567,10 +2567,12 @@ static netdev_features_t harmonize_features(struct sk_buff *skb,
>  
>  netdev_features_t netif_skb_features(struct sk_buff *skb)
>  {
> +	const struct net_device *dev = skb->dev;
> +	netdev_features_t features = dev->features;
> +	u16 gso_segs = skb_shinfo(skb)->gso_segs;
>  	__be16 protocol = skb->protocol;
> -	netdev_features_t features = skb->dev->features;
>  
> -	if (skb_shinfo(skb)->gso_segs > skb->dev->gso_max_segs)
> +	if (gso_segs > dev->gso_max_segs || gso_segs < dev->gso_min_segs)
>  		features &= ~NETIF_F_GSO_MASK;
>  
>  	if (protocol == htons(ETH_P_8021Q) || protocol == htons(ETH_P_8021AD)) {
> @@ -2581,7 +2583,7 @@ netdev_features_t netif_skb_features(struct sk_buff *skb)
>  	}
>  
>  	features = netdev_intersect_features(features,
> -					     skb->dev->vlan_features |
> +					     dev->vlan_features |
>  					     NETIF_F_HW_VLAN_CTAG_TX |
>  					     NETIF_F_HW_VLAN_STAG_TX);
>  
> @@ -6658,6 +6660,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>  
>  	dev->gso_max_size = GSO_MAX_SIZE;
>  	dev->gso_max_segs = GSO_MAX_SEGS;
> +	dev->gso_min_segs = 0;
>  
>  	INIT_LIST_HEAD(&dev->napi_list);
>  	INIT_LIST_HEAD(&dev->unreg_list);
> 
> 

^ permalink raw reply

* RE: [net-next v3 20/29] fm10k: Add support for netdev offloads
From: Sathya Perla @ 2014-10-06  6:19 UTC (permalink / raw)
  To: Duyck, Alexander H, Or Gerlitz, Kirsher, Jeffrey T
  Cc: David Miller, Linux Netdev List, nhorman@redhat.com,
	sassmann@redhat.com, jogreene@redhat.com, Tom Herbert
In-Reply-To: <D29CC2C771335549BDCBB23A2B92BE379C7F84CE@ORSMSX110.amr.corp.intel.com>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Duyck, Alexander H
...
> adds
> > > >> support for basic offloads including TSO, Tx checksum, Rx checksum,
> > > >> Rx hash, and the same features applied to VXLAN/NVGRE
> > > tunnels.
> > > >
> > > >> --- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
> > > >> +++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
> > > > [...]
> > > >> +#define VXLAN_HLEN (sizeof(struct udphdr) + 8) static struct
> > > >> +ethhdr *fm10k_port_is_vxlan(struct sk_buff *skb) {
> > > >> +       struct fm10k_intfc *interface = netdev_priv(skb->dev);
> > > >> +       struct fm10k_vxlan_port *vxlan_port;
> > > >> +
> > > >> +       /* we can only offload a vxlan if we recognize it as such */
> > > >> +       vxlan_port = list_first_entry_or_null(&interface->vxlan_port,
> > > >> +                                             struct
> > > >> + fm10k_vxlan_port, list);
> > > >> +
> > > >> +       if (!vxlan_port)
> > > >> +               return NULL;
> > > >> +       if (vxlan_port->port != udp_hdr(skb)->dest)
> > > >> +               return NULL;
> > > >> +
> > > >> +       /* return offset of udp_hdr plus 8 bytes for VXLAN header */
> > > >> +       return (struct ethhdr *)(skb_transport_header(skb) +
> > > >> +VXLAN_HLEN); }
> > > >> +
> > ...
> > >
> > > >> +static int fm10k_tso(struct fm10k_ring *tx_ring,
> > > >> +                    struct fm10k_tx_buffer *first) {
> > > >> +       struct sk_buff *skb = first->skb;
> > > >> +       struct fm10k_tx_desc *tx_desc;
> > > >> +       unsigned char *th;
> > > >> +       u8 hdrlen;
> > > >> +
> > > >> +       if (skb->ip_summed != CHECKSUM_PARTIAL)
> > > >> +               return 0;
> > > >> +
> > > >> +       if (!skb_is_gso(skb))
> > > >> +               return 0;
> > > >> +
> > > >> +       /* compute header lengths */
> > > >> +       if (skb->encapsulation) {
> > > >> +               if (!fm10k_tx_encap_offload(skb))
> > > >> +                       goto err_vxlan;
> > > >> +               th = skb_inner_transport_header(skb);
> > > >> +       } else {
> > > >> +               th = skb_transport_header(skb);
> > > >> +       }
> > > >> +
> > > >> +       /* compute offset from SOF to transport header and add
> > > >> + header len
> > > */
> > > >> +       hdrlen = (th - skb->data) + (((struct tcphdr *)th)->doff <<
> > > >> + 2);
> > > >> +
> > > >> +       first->tx_flags |= FM10K_TX_FLAGS_CSUM;
> > > >> +
> > > >> +       /* update gso size and bytecount with header size */
> > > >> +       first->gso_segs = skb_shinfo(skb)->gso_segs;
> > > >> +       first->bytecount += (first->gso_segs - 1) * hdrlen;
> > > >> +
> > > >> +       /* populate Tx descriptor header size and mss */
> > > >> +       tx_desc = FM10K_TX_DESC(tx_ring, tx_ring->next_to_use);
> > > >> +       tx_desc->hdrlen = hdrlen;
> > > >> +       tx_desc->mss = cpu_to_le16(skb_shinfo(skb)->gso_size);
> > > >> +
> > > >> +       return 1;
> > > >> +err_vxlan:
> > > >> +       tx_ring->netdev->features &= ~NETIF_F_GSO_UDP_TUNNEL;
> > > >> +       if (!net_ratelimit())
> > > >> +               netdev_err(tx_ring->netdev,
> > > >> +                          "TSO requested for unsupported tunnel,
> > > >> +disabling
> > > offload\n");
> > > >> +       return -1;
> > > >> +}
> > > >
> > > > why? if TSO was requested for some packet the driver can't do, you
> > > > disable
> > > GSO
> > > > for udp tunnels for any future packets too? maybe just disable it
> > > > permanently  till you feel safer to run under the current stack?
> > >
> > > That is essentially what I am doing.  If the user wants they can turn
> > > the feature back on via ethtool, but there is no point in having it on
> > > if the device itself cannot support the packets that are coming through.
> > >
> >
> > To turn off the offload, shouldn't you be using netdev_update_features()
> > (->ndo_fix_features()) instead of directly flipping bits in netdev->features?
> 
> Why?  That would just be me calling my own ndo operation wouldn't it, since
> I am changing the features of the device covered by this driver.

I guess it would be to ensure that all dependencies are resolved and a 
FEAT_CHANGE notification is issued.

Documentation/networking/netdev-features.txt says the following:
" When current feature set (netdev->features) is to be changed, new set
is calculated and filtered by calling ndo_fix_features callback
and netdev_fix_features()."

" A driver that wants to trigger recalculation must do so by calling
netdev_update_features() while holding rtnl_lock. This should not be done
from ndo_*_features callbacks. netdev->features should not be modified by
driver except by means of ndo_fix_features callback."

^ permalink raw reply

* [PATCH net-next V1 11/14] net/mlx4_en: tx_info->ts_requested was not cleared
From: Amir Vadai @ 2014-10-06  6:16 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Properly clear tx_info->ts_requested

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index e00841a..2c03b55 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -837,6 +837,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * For timestamping add flag to skb_shinfo and
 	 * set flag for further reference
 	 */
+	tx_info->ts_requested = 0;
 	if (unlikely(ring->hwtstamp_tx_type == HWTSTAMP_TX_ON &&
 		     shinfo->tx_flags & SKBTX_HW_TSTAMP)) {
 		shinfo->tx_flags |= SKBTX_IN_PROGRESS;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 13/14] ethtool: Ethtool parameter to dynamically change tx_copybreak
From: Amir Vadai @ 2014-10-06  6:16 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Use new ethtool [sg]et_tunable() to set tx_copybread (inline threshold)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 include/uapi/linux/ethtool.h | 1 +
 net/core/ethtool.c           | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 7a364f2..99b4305 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -212,6 +212,7 @@ struct ethtool_value {
 enum tunable_id {
 	ETHTOOL_ID_UNSPEC,
 	ETHTOOL_RX_COPYBREAK,
+	ETHTOOL_TX_COPYBREAK,
 };
 
 enum tunable_type_id {
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 27e61b8..1600aa2 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1625,6 +1625,7 @@ static int ethtool_tunable_valid(const struct ethtool_tunable *tuna)
 {
 	switch (tuna->id) {
 	case ETHTOOL_RX_COPYBREAK:
+	case ETHTOOL_TX_COPYBREAK:
 		if (tuna->len != sizeof(u32) ||
 		    tuna->type_id != ETHTOOL_TUNABLE_U32)
 			return -EINVAL;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 09/14] net/mlx4_en: Use local var in tx flow for skb_shinfo(skb)
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Acces skb_shinfo(skb) once in tx flow.
Also, rename @i variable to @i_frag to avoid confusion, as the "goto
tx_drop_unmap;" relied on this @i variable.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 62 +++++++++++++++++-------------
 1 file changed, 36 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 4b018ce..aa05b09 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -532,13 +532,14 @@ static struct mlx4_en_tx_desc *mlx4_en_bounce_to_desc(struct mlx4_en_priv *priv,
 }
 
 static bool is_inline(int inline_thold, const struct sk_buff *skb,
+		      const struct skb_shared_info *shinfo,
 		      void **pfrag)
 {
 	void *ptr;
 
 	if (inline_thold && !skb_is_gso(skb) && skb->len <= inline_thold) {
-		if (skb_shinfo(skb)->nr_frags == 1) {
-			ptr = skb_frag_address_safe(&skb_shinfo(skb)->frags[0]);
+		if (shinfo->nr_frags == 1) {
+			ptr = skb_frag_address_safe(&shinfo->frags[0]);
 			if (unlikely(!ptr))
 				return 0;
 
@@ -546,7 +547,7 @@ static bool is_inline(int inline_thold, const struct sk_buff *skb,
 				*pfrag = ptr;
 
 			return 1;
-		} else if (unlikely(skb_shinfo(skb)->nr_frags))
+		} else if (unlikely(shinfo->nr_frags))
 			return 0;
 		else
 			return 1;
@@ -567,18 +568,19 @@ static int inline_size(const struct sk_buff *skb)
 }
 
 static int get_real_size(const struct sk_buff *skb,
+			 const struct skb_shared_info *shinfo,
 			 struct net_device *dev,
 			 int *lso_header_size)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int real_size;
 
-	if (skb_is_gso(skb)) {
+	if (shinfo->gso_size) {
 		if (skb->encapsulation)
 			*lso_header_size = (skb_inner_transport_header(skb) - skb->data) + inner_tcp_hdrlen(skb);
 		else
 			*lso_header_size = skb_transport_offset(skb) + tcp_hdrlen(skb);
-		real_size = CTRL_SIZE + skb_shinfo(skb)->nr_frags * DS_SIZE +
+		real_size = CTRL_SIZE + shinfo->nr_frags * DS_SIZE +
 			ALIGN(*lso_header_size + 4, DS_SIZE);
 		if (unlikely(*lso_header_size != skb_headlen(skb))) {
 			/* We add a segment for the skb linear buffer only if
@@ -593,8 +595,8 @@ static int get_real_size(const struct sk_buff *skb,
 		}
 	} else {
 		*lso_header_size = 0;
-		if (!is_inline(priv->prof->inline_thold, skb, NULL))
-			real_size = CTRL_SIZE + (skb_shinfo(skb)->nr_frags + 1) * DS_SIZE;
+		if (!is_inline(priv->prof->inline_thold, skb, shinfo, NULL))
+			real_size = CTRL_SIZE + (shinfo->nr_frags + 1) * DS_SIZE;
 		else
 			real_size = inline_size(skb);
 	}
@@ -604,6 +606,7 @@ static int get_real_size(const struct sk_buff *skb,
 
 static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
 			     const struct sk_buff *skb,
+			     const struct skb_shared_info *shinfo,
 			     int real_size, u16 *vlan_tag,
 			     int tx_ind, void *fragptr)
 {
@@ -619,9 +622,9 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
 			       MIN_PKT_LEN - skb->len);
 		}
 		skb_copy_from_linear_data(skb, inl + 1, skb_headlen(skb));
-		if (skb_shinfo(skb)->nr_frags)
+		if (shinfo->nr_frags)
 			memcpy(((void *)(inl + 1)) + skb_headlen(skb), fragptr,
-			       skb_frag_size(&skb_shinfo(skb)->frags[0]));
+			       skb_frag_size(&shinfo->frags[0]));
 
 	} else {
 		inl->byte_count = cpu_to_be32(1 << 31 | spc);
@@ -639,9 +642,10 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
 			inl = (void *) (inl + 1) + spc;
 			skb_copy_from_linear_data_offset(skb, spc, inl + 1,
 					skb_headlen(skb) - spc);
-			if (skb_shinfo(skb)->nr_frags)
+			if (shinfo->nr_frags)
 				memcpy(((void *)(inl + 1)) + skb_headlen(skb) - spc,
-					fragptr, skb_frag_size(&skb_shinfo(skb)->frags[0]));
+				       fragptr,
+				       skb_frag_size(&shinfo->frags[0]));
 		}
 
 		wmb();
@@ -673,6 +677,7 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
 
 netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 {
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct device *ddev = priv->ddev;
 	struct mlx4_en_tx_ring *ring;
@@ -686,7 +691,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	u32 index, bf_index;
 	__be32 op_own;
 	u16 vlan_tag = 0;
-	int i;
+	int i_frag;
 	int lso_header_size;
 	void *fragptr;
 	bool bounce = false;
@@ -702,7 +707,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* fetch ring->cons far ahead before needing it to avoid stall */
 	ring_cons = ACCESS_ONCE(ring->cons);
 
-	real_size = get_real_size(skb, dev, &lso_header_size);
+	real_size = get_real_size(skb, shinfo, dev, &lso_header_size);
 	if (unlikely(!real_size))
 		goto tx_drop;
 
@@ -776,21 +781,22 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	tx_info->data_offset = (void *)data - (void *)tx_desc;
 
 	tx_info->linear = (lso_header_size < skb_headlen(skb) &&
-			   !is_inline(ring->inline_thold, skb, NULL)) ? 1 : 0;
+			   !is_inline(ring->inline_thold, skb, shinfo, NULL)) ? 1 : 0;
 
-	tx_info->nr_maps = skb_shinfo(skb)->nr_frags + tx_info->linear;
+	tx_info->nr_maps = shinfo->nr_frags + tx_info->linear;
 	data += tx_info->nr_maps - 1;
 
-	if (is_inline(ring->inline_thold, skb, &fragptr)) {
+	if (is_inline(ring->inline_thold, skb, shinfo, &fragptr)) {
 		tx_info->inl = 1;
 	} else {
 		dma_addr_t dma = 0;
 		u32 byte_count = 0;
 
 		/* Map fragments if any */
-		for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
+		for (i_frag = shinfo->nr_frags - 1; i_frag >= 0; i_frag--) {
 			const struct skb_frag_struct *frag;
-			frag = &skb_shinfo(skb)->frags[i];
+
+			frag = &shinfo->frags[i_frag];
 			byte_count = skb_frag_size(frag);
 			dma = skb_frag_dma_map(ddev, frag,
 					       0, byte_count,
@@ -831,8 +837,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * set flag for further reference
 	 */
 	if (unlikely(ring->hwtstamp_tx_type == HWTSTAMP_TX_ON &&
-		     skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)) {
-		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+		     shinfo->tx_flags & SKBTX_HW_TSTAMP)) {
+		shinfo->tx_flags |= SKBTX_IN_PROGRESS;
 		tx_info->ts_requested = 1;
 	}
 
@@ -858,6 +864,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	/* Handle LSO (TSO) packets */
 	if (lso_header_size) {
+		int i;
+
 		/* Mark opcode as LSO */
 		op_own = cpu_to_be32(MLX4_OPCODE_LSO | (1 << 6)) |
 			((ring->prod & ring->size) ?
@@ -865,15 +873,16 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 		/* Fill in the LSO prefix */
 		tx_desc->lso.mss_hdr_size = cpu_to_be32(
-			skb_shinfo(skb)->gso_size << 16 | lso_header_size);
+			shinfo->gso_size << 16 | lso_header_size);
 
 		/* Copy headers;
 		 * note that we already verified that it is linear */
 		memcpy(tx_desc->lso.header, skb->data, lso_header_size);
 
 		ring->tso_packets++;
-		i = ((skb->len - lso_header_size) / skb_shinfo(skb)->gso_size) +
-			!!((skb->len - lso_header_size) % skb_shinfo(skb)->gso_size);
+
+		i = ((skb->len - lso_header_size) / shinfo->gso_size) +
+			!!((skb->len - lso_header_size) % shinfo->gso_size);
 		tx_info->nr_bytes = skb->len + (i - 1) * lso_header_size;
 		ring->packets += i;
 	} else {
@@ -889,7 +898,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, skb->len);
 
 	if (tx_info->inl) {
-		build_inline_wqe(tx_desc, skb, real_size, &vlan_tag, tx_ind, fragptr);
+		build_inline_wqe(tx_desc, skb, shinfo, real_size, &vlan_tag,
+				 tx_ind, fragptr);
 		tx_info->inl = 1;
 	}
 
@@ -958,8 +968,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 tx_drop_unmap:
 	en_err(priv, "DMA mapping error\n");
 
-	for (i++; i < skb_shinfo(skb)->nr_frags; i++) {
-		data++;
+	while (++i_frag < shinfo->nr_frags) {
+		++data;
 		dma_unmap_page(ddev, (dma_addr_t) be64_to_cpu(data->addr),
 			       be32_to_cpu(data->byte_count),
 			       PCI_DMA_TODEVICE);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 07/14] net/mlx4_en: Avoid false sharing in mlx4_en_en_process_tx_cq()
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

mlx4_en_process_tx_cq() carefully fetches and writes ring->last_nr_txbb
and ring->cons only one time to avoid false sharing

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 9328e6e..63e1f24 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -387,6 +387,8 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	u64 timestamp = 0;
 	int done = 0;
 	int budget = priv->tx_work_limit;
+	u32 last_nr_txbb;
+	u32 ring_cons;
 
 	if (!priv->port_up)
 		return true;
@@ -394,7 +396,9 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	prefetchw(&ring->tx_queue->dql.limit);
 	index = cons_index & size_mask;
 	cqe = mlx4_en_get_cqe(buf, index, priv->cqe_size) + factor;
-	ring_index = ring->cons & size_mask;
+	last_nr_txbb = ACCESS_ONCE(ring->last_nr_txbb);
+	ring_cons = ACCESS_ONCE(ring->cons);
+	ring_index = ring_cons & size_mask;
 	stamp_index = ring_index;
 
 	/* Process all completed CQEs */
@@ -419,19 +423,19 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 		new_index = be16_to_cpu(cqe->wqe_index) & size_mask;
 
 		do {
-			txbbs_skipped += ring->last_nr_txbb;
-			ring_index = (ring_index + ring->last_nr_txbb) & size_mask;
+			txbbs_skipped += last_nr_txbb;
+			ring_index = (ring_index + last_nr_txbb) & size_mask;
 			if (ring->tx_info[ring_index].ts_requested)
 				timestamp = mlx4_en_get_cqe_ts(cqe);
 
 			/* free next descriptor */
-			ring->last_nr_txbb = mlx4_en_free_tx_desc(
+			last_nr_txbb = mlx4_en_free_tx_desc(
 					priv, ring, ring_index,
-					!!((ring->cons + txbbs_skipped) &
+					!!((ring_cons + txbbs_skipped) &
 					ring->size), timestamp);
 
 			mlx4_en_stamp_wqe(priv, ring, stamp_index,
-					  !!((ring->cons + txbbs_stamp) &
+					  !!((ring_cons + txbbs_stamp) &
 						ring->size));
 			stamp_index = ring_index;
 			txbbs_stamp = txbbs_skipped;
@@ -452,7 +456,11 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	mcq->cons_index = cons_index;
 	mlx4_cq_set_ci(mcq);
 	wmb();
-	ring->cons += txbbs_skipped;
+
+	/* we want to dirty this cache line once */
+	ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
+	ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
+
 	netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
 
 	/*
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 04/14] net/mlx4_en: tx_info allocated with kmalloc() instead of vmalloc()
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Try to allocate using kmalloc_node() first, only on failure use
vmalloc()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 3ea17f9..02ade59 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -68,7 +68,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 	ring->inline_thold = priv->prof->inline_thold;
 
 	tmp = size * sizeof(struct mlx4_en_tx_info);
-	ring->tx_info = vmalloc_node(tmp, node);
+	ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
 	if (!ring->tx_info) {
 		ring->tx_info = vmalloc(tmp);
 		if (!ring->tx_info) {
@@ -151,7 +151,7 @@ err_bounce:
 	kfree(ring->bounce_buf);
 	ring->bounce_buf = NULL;
 err_info:
-	vfree(ring->tx_info);
+	kvfree(ring->tx_info);
 	ring->tx_info = NULL;
 err_ring:
 	kfree(ring);
@@ -174,7 +174,7 @@ void mlx4_en_destroy_tx_ring(struct mlx4_en_priv *priv,
 	mlx4_free_hwq_res(mdev->dev, &ring->wqres, ring->buf_size);
 	kfree(ring->bounce_buf);
 	ring->bounce_buf = NULL;
-	vfree(ring->tx_info);
+	kvfree(ring->tx_info);
 	ring->tx_info = NULL;
 	kfree(ring);
 	*pring = NULL;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 02/14] net/mlx4_en: Align tx path structures to cache lines
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Reorganize struct mlx4_en_tx_ring to have:
- One cache line containing last_nr_txbb & cons & wake_queue, used by tx
  completion.
- One cache line containing fields dirtied by mlx4_en_xmit()
- Following part is read mostly and shared by cpus.

Align struct mlx4_en_tx_info to a cache line

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 86 +++++++++++++++-------------
 1 file changed, 46 insertions(+), 40 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index e54b653..b7bde95 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -216,13 +216,13 @@ enum cq_type {
 
 struct mlx4_en_tx_info {
 	struct sk_buff *skb;
-	u32 nr_txbb;
-	u32 nr_bytes;
-	u8 linear;
-	u8 data_offset;
-	u8 inl;
-	u8 ts_requested;
-};
+	u32		nr_txbb;
+	u32		nr_bytes;
+	u8		linear;
+	u8		data_offset;
+	u8		inl;
+	u8		ts_requested;
+} ____cacheline_aligned_in_smp;
 
 
 #define MLX4_EN_BIT_DESC_OWN	0x80000000
@@ -253,40 +253,46 @@ struct mlx4_en_rx_alloc {
 };
 
 struct mlx4_en_tx_ring {
+	/* cache line used and dirtied in tx completion
+	 * (mlx4_en_free_tx_buf())
+	 */
+	u32			last_nr_txbb;
+	u32			cons;
+	unsigned long		wake_queue;
+
+	/* cache line used and dirtied in mlx4_en_xmit() */
+	u32			prod ____cacheline_aligned_in_smp;
+	unsigned long		bytes;
+	unsigned long		packets;
+	unsigned long		tx_csum;
+	unsigned long		tso_packets;
+	unsigned long		xmit_more;
+	struct mlx4_bf		bf;
+	unsigned long		queue_stopped;
+
+	/* Following part should be mostly read */
+	cpumask_t		affinity_mask;
+	struct mlx4_qp		qp;
 	struct mlx4_hwq_resources wqres;
-	u32 size ; /* number of TXBBs */
-	u32 size_mask;
-	u16 stride;
-	u16 cqn;	/* index of port CQ associated with this ring */
-	u32 prod;
-	u32 cons;
-	u32 buf_size;
-	u32 doorbell_qpn;
-	void *buf;
-	struct mlx4_en_tx_info *tx_info;
-	u8 *bounce_buf;
-	u8 queue_index;
-	cpumask_t affinity_mask;
-	u32 last_nr_txbb;
-	struct mlx4_qp qp;
-	struct mlx4_qp_context context;
-	int qpn;
-	enum mlx4_qp_state qp_state;
-	struct mlx4_srq dummy;
-	unsigned long bytes;
-	unsigned long packets;
-	unsigned long tx_csum;
-	unsigned long queue_stopped;
-	unsigned long wake_queue;
-	unsigned long tso_packets;
-	unsigned long xmit_more;
-	struct mlx4_bf bf;
-	bool bf_enabled;
-	bool bf_alloced;
-	struct netdev_queue *tx_queue;
-	int hwtstamp_tx_type;
-	int inline_thold;
-};
+	u32			size; /* number of TXBBs */
+	u32			size_mask;
+	u16			stride;
+	u16			cqn;	/* index of port CQ associated with this ring */
+	u32			buf_size;
+	u32			doorbell_qpn;
+	void			*buf;
+	struct mlx4_en_tx_info	*tx_info;
+	u8			*bounce_buf;
+	struct mlx4_qp_context	context;
+	int			qpn;
+	enum mlx4_qp_state	qp_state;
+	u8			queue_index;
+	bool			bf_enabled;
+	bool			bf_alloced;
+	struct netdev_queue	*tx_queue;
+	int			hwtstamp_tx_type;
+	int			inline_thold;
+} ____cacheline_aligned_in_smp;
 
 struct mlx4_en_rx_desc {
 	/* actual number of entries depends on rx ring stride */
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 12/14] net/mlx4_en: Enable the compiler to make is_inline() inlined
From: Amir Vadai @ 2014-10-06  6:16 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Reorganize code to call is_inline() once, so compiler can inline it

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 67 +++++++++++++++++-------------
 1 file changed, 38 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 2c03b55..f0080c5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -531,29 +531,32 @@ static struct mlx4_en_tx_desc *mlx4_en_bounce_to_desc(struct mlx4_en_priv *priv,
 	return ring->buf + index * TXBB_SIZE;
 }
 
+/* Decide if skb can be inlined in tx descriptor to avoid dma mapping
+ *
+ * It seems strange we do not simply use skb_copy_bits().
+ * This would allow to inline all skbs iff skb->len <= inline_thold
+ *
+ * Note that caller already checked skb was not a gso packet
+ */
 static bool is_inline(int inline_thold, const struct sk_buff *skb,
 		      const struct skb_shared_info *shinfo,
 		      void **pfrag)
 {
 	void *ptr;
 
-	if (inline_thold && !skb_is_gso(skb) && skb->len <= inline_thold) {
-		if (shinfo->nr_frags == 1) {
-			ptr = skb_frag_address_safe(&shinfo->frags[0]);
-			if (unlikely(!ptr))
-				return 0;
-
-			if (pfrag)
-				*pfrag = ptr;
+	if (skb->len > inline_thold || !inline_thold)
+		return false;
 
-			return 1;
-		} else if (unlikely(shinfo->nr_frags))
-			return 0;
-		else
-			return 1;
+	if (shinfo->nr_frags == 1) {
+		ptr = skb_frag_address_safe(&shinfo->frags[0]);
+		if (unlikely(!ptr))
+			return false;
+		*pfrag = ptr;
+		return true;
 	}
-
-	return 0;
+	if (shinfo->nr_frags)
+		return false;
+	return true;
 }
 
 static int inline_size(const struct sk_buff *skb)
@@ -570,12 +573,15 @@ static int inline_size(const struct sk_buff *skb)
 static int get_real_size(const struct sk_buff *skb,
 			 const struct skb_shared_info *shinfo,
 			 struct net_device *dev,
-			 int *lso_header_size)
+			 int *lso_header_size,
+			 bool *inline_ok,
+			 void **pfrag)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int real_size;
 
 	if (shinfo->gso_size) {
+		*inline_ok = false;
 		if (skb->encapsulation)
 			*lso_header_size = (skb_inner_transport_header(skb) - skb->data) + inner_tcp_hdrlen(skb);
 		else
@@ -595,10 +601,14 @@ static int get_real_size(const struct sk_buff *skb,
 		}
 	} else {
 		*lso_header_size = 0;
-		if (!is_inline(priv->prof->inline_thold, skb, shinfo, NULL))
-			real_size = CTRL_SIZE + (shinfo->nr_frags + 1) * DS_SIZE;
-		else
+		*inline_ok = is_inline(priv->prof->inline_thold, skb,
+				       shinfo, pfrag);
+
+		if (*inline_ok)
 			real_size = inline_size(skb);
+		else
+			real_size = CTRL_SIZE +
+				    (shinfo->nr_frags + 1) * DS_SIZE;
 	}
 
 	return real_size;
@@ -694,9 +704,10 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	u16 vlan_tag = 0;
 	int i_frag;
 	int lso_header_size;
-	void *fragptr;
+	void *fragptr = NULL;
 	bool bounce = false;
 	bool send_doorbell;
+	bool inline_ok;
 	u32 ring_cons;
 
 	if (!priv->port_up)
@@ -708,7 +719,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* fetch ring->cons far ahead before needing it to avoid stall */
 	ring_cons = ACCESS_ONCE(ring->cons);
 
-	real_size = get_real_size(skb, shinfo, dev, &lso_header_size);
+	real_size = get_real_size(skb, shinfo, dev, &lso_header_size,
+				  &inline_ok, &fragptr);
 	if (unlikely(!real_size))
 		goto tx_drop;
 
@@ -781,15 +793,15 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* valid only for none inline segments */
 	tx_info->data_offset = (void *)data - (void *)tx_desc;
 
+	tx_info->inl = inline_ok;
+
 	tx_info->linear = (lso_header_size < skb_headlen(skb) &&
-			   !is_inline(ring->inline_thold, skb, shinfo, NULL)) ? 1 : 0;
+			   !inline_ok) ? 1 : 0;
 
 	tx_info->nr_maps = shinfo->nr_frags + tx_info->linear;
 	data += tx_info->nr_maps - 1;
 
-	if (is_inline(ring->inline_thold, skb, shinfo, &fragptr)) {
-		tx_info->inl = 1;
-	} else {
+	if (!tx_info->inl) {
 		dma_addr_t dma = 0;
 		u32 byte_count = 0;
 
@@ -827,7 +839,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			wmb();
 			data->byte_count = cpu_to_be32(byte_count);
 		}
-		tx_info->inl = 0;
 		/* tx completion can avoid cache line miss for common cases */
 		tx_info->map0_dma = dma;
 		tx_info->map0_byte_count = byte_count;
@@ -899,11 +910,9 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	netdev_tx_sent_queue(ring->tx_queue, tx_info->nr_bytes);
 	AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, skb->len);
 
-	if (tx_info->inl) {
+	if (tx_info->inl)
 		build_inline_wqe(tx_desc, skb, shinfo, real_size, &vlan_tag,
 				 tx_ind, fragptr);
-		tx_info->inl = 1;
-	}
 
 	if (skb->encapsulation) {
 		struct iphdr *ipv4 = (struct iphdr *)skb_inner_network_header(skb);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 10/14] net/mlx4_en: Use local var for skb_headlen(skb)
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Access skb_headlen() once in tx flow

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index aa05b09..e00841a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -612,6 +612,7 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
 {
 	struct mlx4_wqe_inline_seg *inl = &tx_desc->inl;
 	int spc = MLX4_INLINE_ALIGN - CTRL_SIZE - sizeof *inl;
+	unsigned int hlen = skb_headlen(skb);
 
 	if (skb->len <= spc) {
 		if (likely(skb->len >= MIN_PKT_LEN)) {
@@ -621,19 +622,19 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
 			memset(((void *)(inl + 1)) + skb->len, 0,
 			       MIN_PKT_LEN - skb->len);
 		}
-		skb_copy_from_linear_data(skb, inl + 1, skb_headlen(skb));
+		skb_copy_from_linear_data(skb, inl + 1, hlen);
 		if (shinfo->nr_frags)
-			memcpy(((void *)(inl + 1)) + skb_headlen(skb), fragptr,
+			memcpy(((void *)(inl + 1)) + hlen, fragptr,
 			       skb_frag_size(&shinfo->frags[0]));
 
 	} else {
 		inl->byte_count = cpu_to_be32(1 << 31 | spc);
-		if (skb_headlen(skb) <= spc) {
-			skb_copy_from_linear_data(skb, inl + 1, skb_headlen(skb));
-			if (skb_headlen(skb) < spc) {
-				memcpy(((void *)(inl + 1)) + skb_headlen(skb),
-					fragptr, spc - skb_headlen(skb));
-				fragptr +=  spc - skb_headlen(skb);
+		if (hlen <= spc) {
+			skb_copy_from_linear_data(skb, inl + 1, hlen);
+			if (hlen < spc) {
+				memcpy(((void *)(inl + 1)) + hlen,
+				       fragptr, spc - hlen);
+				fragptr +=  spc - hlen;
 			}
 			inl = (void *) (inl + 1) + spc;
 			memcpy(((void *)(inl + 1)), fragptr, skb->len - spc);
@@ -641,9 +642,9 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
 			skb_copy_from_linear_data(skb, inl + 1, spc);
 			inl = (void *) (inl + 1) + spc;
 			skb_copy_from_linear_data_offset(skb, spc, inl + 1,
-					skb_headlen(skb) - spc);
+							 hlen - spc);
 			if (shinfo->nr_frags)
-				memcpy(((void *)(inl + 1)) + skb_headlen(skb) - spc,
+				memcpy(((void *)(inl + 1)) + hlen - spc,
 				       fragptr,
 				       skb_frag_size(&shinfo->frags[0]));
 		}
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 08/14] net/mlx4_en: mlx4_en_xmit() reads ring->cons once, and ahead of time to avoid stalls
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 63e1f24..4b018ce 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -691,10 +691,17 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	void *fragptr;
 	bool bounce = false;
 	bool send_doorbell;
+	u32 ring_cons;
 
 	if (!priv->port_up)
 		goto tx_drop;
 
+	tx_ind = skb_get_queue_mapping(skb);
+	ring = priv->tx_ring[tx_ind];
+
+	/* fetch ring->cons far ahead before needing it to avoid stall */
+	ring_cons = ACCESS_ONCE(ring->cons);
+
 	real_size = get_real_size(skb, dev, &lso_header_size);
 	if (unlikely(!real_size))
 		goto tx_drop;
@@ -708,13 +715,11 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto tx_drop;
 	}
 
-	tx_ind = skb->queue_mapping;
-	ring = priv->tx_ring[tx_ind];
 	if (vlan_tx_tag_present(skb))
 		vlan_tag = vlan_tx_tag_get(skb);
 
 	/* Check available TXBBs And 2K spare for prefetch */
-	if (unlikely(((int)(ring->prod - ring->cons)) >
+	if (unlikely(((int)(ring->prod - ring_cons)) >
 		     ring->size - HEADROOM - MAX_DESC_TXBBS)) {
 		/* every full Tx ring stops queue */
 		netif_tx_stop_queue(ring->tx_queue);
@@ -728,7 +733,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		 */
 		wmb();
 
-		if (unlikely(((int)(ring->prod - ring->cons)) <=
+		ring_cons = ACCESS_ONCE(ring->cons);
+		if (unlikely(((int)(ring->prod - ring_cons)) <=
 			     ring->size - HEADROOM - MAX_DESC_TXBBS)) {
 			netif_tx_wake_queue(ring->tx_queue);
 			ring->wake_queue++;
@@ -741,7 +747,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	/* Track current inflight packets for performance analysis */
 	AVG_PERF_COUNTER(priv->pstats.inflight_avg,
-			 (u32) (ring->prod - ring->cons - 1));
+			 (u32)(ring->prod - ring_cons - 1));
 
 	/* Packet is good - grab an index and transmit it */
 	index = ring->prod & ring->size_mask;
-- 
1.8.3.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox