Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 net-next 1/1] net: fec: fix build error at m68k platform
From: David Miller @ 2014-10-06  4:22 UTC (permalink / raw)
  To: Frank.Li; +Cc: lznuaa, netdev, b38611
In-Reply-To: <1412371754-60384-1-git-send-email-Frank.Li@freescale.com>

From: Frank Li <Frank.Li@freescale.com>
Date: Fri, 3 Oct 2014 14:29:14 -0700

> reproduce:
>   wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
>   chmod +x ~/bin/make.cross
>   git checkout 1b7bde6d659d30f171259cc2dfba8e5dab34e735
> 
>   make.cross ARCH=m68k m5275evb_defconfig
>   make.cross ARCH=m68k
> 
> All error/warnings:
> 
>    drivers/net/ethernet/freescale/fec_main.c: In function 'fec_enet_rx_queue':
>>> drivers/net/ethernet/freescale/fec_main.c:1470:3: error: implicit declaration of function 'prefetch' [-Werror=implicit-function-declaration]
>       prefetch(skb->data - NET_IP_ALIGN);
>       ^
>    cc1: some warnings being treated as errors
> 
> missed included prefetch.h
> 
> Reported-by: kbuild test robot <fengguang.wu@intel.com>
> Signed-off-by: Frank Li <Frank.Li@freescale.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] sctp: handle association restarts when the socket is closed.
From: David Miller @ 2014-10-06  4:22 UTC (permalink / raw)
  To: vyasevich; +Cc: netdev
In-Reply-To: <1412374580-22286-1-git-send-email-vyasevich@gmail.com>

From: Vladislav Yasevich <vyasevich@gmail.com>
Date: Fri,  3 Oct 2014 18:16:20 -0400

> From: Vlad Yasevich <vyasevich@gmail.com>
> 
> Currently association restarts do not take into consideration the
> state of the socket.  When a restart happens, the current assocation
> simply transitions into established state.  This creates a condition
> where a remote system, through a the restart procedure, may create a
> local association that is no way reachable by user.  The conditions
> to trigger this are as follows:
>   1) Remote does not acknoledge some data causing data to remain
>      outstanding.
>   2) Local application calls close() on the socket.  Since data
>      is still outstanding, the association is placed in SHUTDOWN_PENDING
>      state.  However, the socket is closed.
>   3) The remote tries to create a new association, triggering a restart
>      on the local system.  The association moves from SHUTDOWN_PENDING
>      to ESTABLISHED.  At this point, it is no longer reachable by
>      any socket on the local system.
> 
> This patch addresses the above situation by moving the newly ESTABLISHED
> association into SHUTDOWN-SENT state and bundling a SHUTDOWN after
> the COOKIE-ACK chunk.  This way, the restarted associate immidiately
> enters the shutdown procedure and forces the termination of the
> unreachable association.
> 
> Reported-by: David Laight <David.Laight@aculab.com>
> Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>

Applied, thanks.

Candidate for -stable?

^ permalink raw reply

* [net-next PATCH v1 0/3] net sched rcu updates
From: John Fastabend @ 2014-10-06  4:27 UTC (permalink / raw)
  To: xiyou.wangcong, davem; +Cc: netdev, jhs, eric.dumazet

This fixes the use of tcf_proto from RCU callbacks it requires
moving the unbind calls out of the callbacks and removing the
tcf_proto argument from the tcf_em_tree_destroy().

This is a rework of two previous series and addresses comments
from Cong. And should apply against latest net-next.

The previous series links below for reference:

(1/2) net: sched: do not use tcf_proto 'tp' argument from call_rcu
http://patchwork.ozlabs.org/patch/396149/ 

(2/2) net: sched: replace ematch calls to use struct net
http://patchwork.ozlabs.org/patch/396150/


net: sched: cls_cgroup tear down exts and ematch from rcu callback
http://patchwork.ozlabs.org/patch/396307/

---

John Fastabend (3):
      net: sched: remove tcf_proto from ematch calls
      net: sched: cls_cgroup tear down exts and ematch from rcu callback
      net: sched: do not use tcf_proto 'tp' argument from call_rcu


 include/net/pkt_cls.h  |   10 +++++-----
 net/sched/cls_basic.c  |    7 ++++---
 net/sched/cls_bpf.c    |    4 +++-
 net/sched/cls_cgroup.c |    6 ++----
 net/sched/cls_flow.c   |    4 ++--
 net/sched/cls_fw.c     |    5 +++--
 net/sched/cls_route.c  |    8 +++++---
 net/sched/em_canid.c   |    4 ++--
 net/sched/em_ipset.c   |    7 +++----
 net/sched/em_meta.c    |    4 ++--
 net/sched/em_nbyte.c   |    2 +-
 net/sched/em_text.c    |    4 ++--
 net/sched/ematch.c     |   10 ++++++----
 13 files changed, 40 insertions(+), 35 deletions(-)

-- 
Signature

^ permalink raw reply

* [net-next PATCH v1 1/3] net: sched: remove tcf_proto from ematch calls
From: John Fastabend @ 2014-10-06  4:27 UTC (permalink / raw)
  To: xiyou.wangcong, davem; +Cc: netdev, jhs, eric.dumazet
In-Reply-To: <20141006042335.6010.27000.stgit@nitbit.x32>

This removes the tcf_proto argument from the ematch code paths that
only need it to reference the net namespace. This allows simplifying
qdisc code paths especially when we need to tear down the ematch
from an RCU callback. In this case we can not guarentee that the
tcf_proto structure is still valid.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/pkt_cls.h  |   10 +++++-----
 net/sched/cls_basic.c  |    2 +-
 net/sched/cls_cgroup.c |    4 ++--
 net/sched/cls_flow.c   |    4 ++--
 net/sched/em_canid.c   |    4 ++--
 net/sched/em_ipset.c   |    7 +++----
 net/sched/em_meta.c    |    4 ++--
 net/sched/em_nbyte.c   |    2 +-
 net/sched/em_text.c    |    4 ++--
 net/sched/ematch.c     |   10 ++++++----
 10 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index ef44ad9..bc49967 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -166,6 +166,7 @@ struct tcf_ematch {
 	unsigned int		datalen;
 	u16			matchid;
 	u16			flags;
+	struct net		*net;
 };
 
 static inline int tcf_em_is_container(struct tcf_ematch *em)
@@ -229,12 +230,11 @@ struct tcf_ematch_tree {
 struct tcf_ematch_ops {
 	int			kind;
 	int			datalen;
-	int			(*change)(struct tcf_proto *, void *,
+	int			(*change)(struct net *net, void *,
 					  int, struct tcf_ematch *);
 	int			(*match)(struct sk_buff *, struct tcf_ematch *,
 					 struct tcf_pkt_info *);
-	void			(*destroy)(struct tcf_proto *,
-					   struct tcf_ematch *);
+	void			(*destroy)(struct tcf_ematch *);
 	int			(*dump)(struct sk_buff *, struct tcf_ematch *);
 	struct module		*owner;
 	struct list_head	link;
@@ -244,7 +244,7 @@ int tcf_em_register(struct tcf_ematch_ops *);
 void tcf_em_unregister(struct tcf_ematch_ops *);
 int tcf_em_tree_validate(struct tcf_proto *, struct nlattr *,
 			 struct tcf_ematch_tree *);
-void tcf_em_tree_destroy(struct tcf_proto *, struct tcf_ematch_tree *);
+void tcf_em_tree_destroy(struct tcf_ematch_tree *);
 int tcf_em_tree_dump(struct sk_buff *, struct tcf_ematch_tree *, int);
 int __tcf_em_tree_match(struct sk_buff *, struct tcf_ematch_tree *,
 			struct tcf_pkt_info *);
@@ -301,7 +301,7 @@ struct tcf_ematch_tree {
 };
 
 #define tcf_em_tree_validate(tp, tb, t) ((void)(t), 0)
-#define tcf_em_tree_destroy(tp, t) do { (void)(t); } while(0)
+#define tcf_em_tree_destroy(t) do { (void)(t); } while(0)
 #define tcf_em_tree_dump(skb, t, tlv) (0)
 #define tcf_em_tree_change(tp, dst, src) do { } while(0)
 #define tcf_em_tree_match(skb, t, info) ((void)(info), 1)
diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index fe20826..90647a8 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -95,7 +95,7 @@ static void basic_delete_filter(struct rcu_head *head)
 
 	tcf_unbind_filter(tp, &f->res);
 	tcf_exts_destroy(&f->exts);
-	tcf_em_tree_destroy(tp, &f->ematches);
+	tcf_em_tree_destroy(&f->ematches);
 	kfree(f);
 }
 
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 3409f16..2f77a89 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -87,7 +87,7 @@ static void cls_cgroup_destroy_rcu(struct rcu_head *root)
 						    rcu);
 
 	tcf_exts_destroy(&head->exts);
-	tcf_em_tree_destroy(head->tp, &head->ematches);
+	tcf_em_tree_destroy(&head->ematches);
 	kfree(head);
 }
 
@@ -157,7 +157,7 @@ static void cls_cgroup_destroy(struct tcf_proto *tp)
 
 	if (head) {
 		tcf_exts_destroy(&head->exts);
-		tcf_em_tree_destroy(tp, &head->ematches);
+		tcf_em_tree_destroy(&head->ematches);
 		RCU_INIT_POINTER(tp->root, NULL);
 		kfree_rcu(head, rcu);
 	}
diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index f18d27f7..a5d2b20 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -355,7 +355,7 @@ static void flow_destroy_filter(struct rcu_head *head)
 
 	del_timer_sync(&f->perturb_timer);
 	tcf_exts_destroy(&f->exts);
-	tcf_em_tree_destroy(f->tp, &f->ematches);
+	tcf_em_tree_destroy(&f->ematches);
 	kfree(f);
 }
 
@@ -530,7 +530,7 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
 	return 0;
 
 err2:
-	tcf_em_tree_destroy(tp, &t);
+	tcf_em_tree_destroy(&t);
 	kfree(fnew);
 err1:
 	tcf_exts_destroy(&e);
diff --git a/net/sched/em_canid.c b/net/sched/em_canid.c
index 7c292d4..ddd883c 100644
--- a/net/sched/em_canid.c
+++ b/net/sched/em_canid.c
@@ -120,7 +120,7 @@ static int em_canid_match(struct sk_buff *skb, struct tcf_ematch *m,
 	return match;
 }
 
-static int em_canid_change(struct tcf_proto *tp, void *data, int len,
+static int em_canid_change(struct net *net, void *data, int len,
 			  struct tcf_ematch *m)
 {
 	struct can_filter *conf = data; /* Array with rules */
@@ -183,7 +183,7 @@ static int em_canid_change(struct tcf_proto *tp, void *data, int len,
 	return 0;
 }
 
-static void em_canid_destroy(struct tcf_proto *tp, struct tcf_ematch *m)
+static void em_canid_destroy(struct tcf_ematch *m)
 {
 	struct canid_match *cm = em_canid_priv(m);
 
diff --git a/net/sched/em_ipset.c b/net/sched/em_ipset.c
index 527aeb7..5b4a4ef 100644
--- a/net/sched/em_ipset.c
+++ b/net/sched/em_ipset.c
@@ -19,12 +19,11 @@
 #include <net/ip.h>
 #include <net/pkt_cls.h>
 
-static int em_ipset_change(struct tcf_proto *tp, void *data, int data_len,
+static int em_ipset_change(struct net *net, void *data, int data_len,
 			   struct tcf_ematch *em)
 {
 	struct xt_set_info *set = data;
 	ip_set_id_t index;
-	struct net *net = dev_net(qdisc_dev(tp->q));
 
 	if (data_len != sizeof(*set))
 		return -EINVAL;
@@ -42,11 +41,11 @@ static int em_ipset_change(struct tcf_proto *tp, void *data, int data_len,
 	return -ENOMEM;
 }
 
-static void em_ipset_destroy(struct tcf_proto *p, struct tcf_ematch *em)
+static void em_ipset_destroy(struct tcf_ematch *em)
 {
 	const struct xt_set_info *set = (const void *) em->data;
 	if (set) {
-		ip_set_nfnl_put(dev_net(qdisc_dev(p->q)), set->index);
+		ip_set_nfnl_put(em->net, set->index);
 		kfree((void *) em->data);
 	}
 }
diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index 9b8c0b0..c8f8c39 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -856,7 +856,7 @@ static const struct nla_policy meta_policy[TCA_EM_META_MAX + 1] = {
 	[TCA_EM_META_HDR]	= { .len = sizeof(struct tcf_meta_hdr) },
 };
 
-static int em_meta_change(struct tcf_proto *tp, void *data, int len,
+static int em_meta_change(struct net *net, void *data, int len,
 			  struct tcf_ematch *m)
 {
 	int err;
@@ -908,7 +908,7 @@ errout:
 	return err;
 }
 
-static void em_meta_destroy(struct tcf_proto *tp, struct tcf_ematch *m)
+static void em_meta_destroy(struct tcf_ematch *m)
 {
 	if (m)
 		meta_delete((struct meta_match *) m->data);
diff --git a/net/sched/em_nbyte.c b/net/sched/em_nbyte.c
index a3bed07..df3110d 100644
--- a/net/sched/em_nbyte.c
+++ b/net/sched/em_nbyte.c
@@ -23,7 +23,7 @@ struct nbyte_data {
 	char			pattern[0];
 };
 
-static int em_nbyte_change(struct tcf_proto *tp, void *data, int data_len,
+static int em_nbyte_change(struct net *net, void *data, int data_len,
 			   struct tcf_ematch *em)
 {
 	struct tcf_em_nbyte *nbyte = data;
diff --git a/net/sched/em_text.c b/net/sched/em_text.c
index 15d353d..f03c3de 100644
--- a/net/sched/em_text.c
+++ b/net/sched/em_text.c
@@ -45,7 +45,7 @@ static int em_text_match(struct sk_buff *skb, struct tcf_ematch *m,
 	return skb_find_text(skb, from, to, tm->config, &state) != UINT_MAX;
 }
 
-static int em_text_change(struct tcf_proto *tp, void *data, int len,
+static int em_text_change(struct net *net, void *data, int len,
 			  struct tcf_ematch *m)
 {
 	struct text_match *tm;
@@ -100,7 +100,7 @@ retry:
 	return 0;
 }
 
-static void em_text_destroy(struct tcf_proto *tp, struct tcf_ematch *m)
+static void em_text_destroy(struct tcf_ematch *m)
 {
 	if (EM_TEXT_PRIV(m) && EM_TEXT_PRIV(m)->config)
 		textsearch_destroy(EM_TEXT_PRIV(m)->config);
diff --git a/net/sched/ematch.c b/net/sched/ematch.c
index ad57f44..8250c36 100644
--- a/net/sched/ematch.c
+++ b/net/sched/ematch.c
@@ -178,6 +178,7 @@ static int tcf_em_validate(struct tcf_proto *tp,
 	struct tcf_ematch_hdr *em_hdr = nla_data(nla);
 	int data_len = nla_len(nla) - sizeof(*em_hdr);
 	void *data = (void *) em_hdr + sizeof(*em_hdr);
+	struct net *net = dev_net(qdisc_dev(tp->q));
 
 	if (!TCF_EM_REL_VALID(em_hdr->flags))
 		goto errout;
@@ -240,7 +241,7 @@ static int tcf_em_validate(struct tcf_proto *tp,
 			goto errout;
 
 		if (em->ops->change) {
-			err = em->ops->change(tp, data, data_len, em);
+			err = em->ops->change(net, data, data_len, em);
 			if (err < 0)
 				goto errout;
 		} else if (data_len > 0) {
@@ -271,6 +272,7 @@ static int tcf_em_validate(struct tcf_proto *tp,
 	em->matchid = em_hdr->matchid;
 	em->flags = em_hdr->flags;
 	em->datalen = data_len;
+	em->net = net;
 
 	err = 0;
 errout:
@@ -378,7 +380,7 @@ errout:
 	return err;
 
 errout_abort:
-	tcf_em_tree_destroy(tp, tree);
+	tcf_em_tree_destroy(tree);
 	return err;
 }
 EXPORT_SYMBOL(tcf_em_tree_validate);
@@ -393,7 +395,7 @@ EXPORT_SYMBOL(tcf_em_tree_validate);
  * tcf_em_tree_validate()/tcf_em_tree_change(). You must ensure that
  * the ematch tree is not in use before calling this function.
  */
-void tcf_em_tree_destroy(struct tcf_proto *tp, struct tcf_ematch_tree *tree)
+void tcf_em_tree_destroy(struct tcf_ematch_tree *tree)
 {
 	int i;
 
@@ -405,7 +407,7 @@ void tcf_em_tree_destroy(struct tcf_proto *tp, struct tcf_ematch_tree *tree)
 
 		if (em->ops) {
 			if (em->ops->destroy)
-				em->ops->destroy(tp, em);
+				em->ops->destroy(em);
 			else if (!tcf_em_is_simple(em))
 				kfree((void *) em->data);
 			module_put(em->ops->owner);

^ permalink raw reply related

* [net-next PATCH v1 2/3] net: sched: cls_cgroup tear down exts and ematch from rcu callback
From: John Fastabend @ 2014-10-06  4:28 UTC (permalink / raw)
  To: xiyou.wangcong, davem; +Cc: netdev, jhs, eric.dumazet
In-Reply-To: <20141006042335.6010.27000.stgit@nitbit.x32>

It is not RCU safe to destroy the action chain while there
is a possibility of readers accessing it. Move this code
into the rcu callback using the same rcu callback used in the
code patch to make a change to head.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/cls_cgroup.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 2f77a89..d61a801 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -156,10 +156,8 @@ static void cls_cgroup_destroy(struct tcf_proto *tp)
 	struct cls_cgroup_head *head = rtnl_dereference(tp->root);
 
 	if (head) {
-		tcf_exts_destroy(&head->exts);
-		tcf_em_tree_destroy(&head->ematches);
 		RCU_INIT_POINTER(tp->root, NULL);
-		kfree_rcu(head, rcu);
+		call_rcu(&head->rcu, cls_cgroup_destroy_rcu);
 	}
 }
 

^ permalink raw reply related

* [net-next PATCH v1 3/3] net: sched: do not use tcf_proto 'tp' argument from call_rcu
From: John Fastabend @ 2014-10-06  4:28 UTC (permalink / raw)
  To: xiyou.wangcong, davem; +Cc: netdev, jhs, eric.dumazet
In-Reply-To: <20141006042335.6010.27000.stgit@nitbit.x32>

Using the tcf_proto pointer 'tp' from inside the classifiers callback
is not valid because it may have been cleaned up by another call_rcu
occuring on another CPU.

'tp' is currently being used by tcf_unbind_filter() in this patch we
move instances of tcf_unbind_filter outside of the call_rcu() context.
This is safe to do because any running schedulers will either read the
valid class field or it will be zeroed.

And all schedulers today when the class is 0 do a lookup using the
same call used by the tcf_exts_bind(). So even if we have a running
classifier hit the null class pointer it will do a lookup and get
to the same result. This is particularly fragile at the moment because
the only way to verify this is to audit the schedulers call sites.

Reported-by: Cong Wang <xiyou.wangconf@gmail.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/cls_basic.c |    5 +++--
 net/sched/cls_bpf.c   |    4 +++-
 net/sched/cls_fw.c    |    5 +++--
 net/sched/cls_route.c |    8 +++++---
 4 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index 90647a8..cd61280 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -91,9 +91,7 @@ static int basic_init(struct tcf_proto *tp)
 static void basic_delete_filter(struct rcu_head *head)
 {
 	struct basic_filter *f = container_of(head, struct basic_filter, rcu);
-	struct tcf_proto *tp = f->tp;
 
-	tcf_unbind_filter(tp, &f->res);
 	tcf_exts_destroy(&f->exts);
 	tcf_em_tree_destroy(&f->ematches);
 	kfree(f);
@@ -106,6 +104,7 @@ static void basic_destroy(struct tcf_proto *tp)
 
 	list_for_each_entry_safe(f, n, &head->flist, link) {
 		list_del_rcu(&f->link);
+		tcf_unbind_filter(tp, &f->res);
 		call_rcu(&f->rcu, basic_delete_filter);
 	}
 	RCU_INIT_POINTER(tp->root, NULL);
@@ -120,6 +119,7 @@ static int basic_delete(struct tcf_proto *tp, unsigned long arg)
 	list_for_each_entry(t, &head->flist, link)
 		if (t == f) {
 			list_del_rcu(&t->link);
+			tcf_unbind_filter(tp, &t->res);
 			call_rcu(&t->rcu, basic_delete_filter);
 			return 0;
 		}
@@ -222,6 +222,7 @@ static int basic_change(struct net *net, struct sk_buff *in_skb,
 
 	if (fold) {
 		list_replace_rcu(&fold->link, &fnew->link);
+		tcf_unbind_filter(tp, &fold->res);
 		call_rcu(&fold->rcu, basic_delete_filter);
 	} else {
 		list_add_rcu(&fnew->link, &head->flist);
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 4318d06..eed49d1 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -92,7 +92,6 @@ static int cls_bpf_init(struct tcf_proto *tp)
 
 static void cls_bpf_delete_prog(struct tcf_proto *tp, struct cls_bpf_prog *prog)
 {
-	tcf_unbind_filter(tp, &prog->res);
 	tcf_exts_destroy(&prog->exts);
 
 	bpf_prog_destroy(prog->filter);
@@ -116,6 +115,7 @@ static int cls_bpf_delete(struct tcf_proto *tp, unsigned long arg)
 	list_for_each_entry(prog, &head->plist, link) {
 		if (prog == todel) {
 			list_del_rcu(&prog->link);
+			tcf_unbind_filter(tp, &prog->res);
 			call_rcu(&prog->rcu, __cls_bpf_delete_prog);
 			return 0;
 		}
@@ -131,6 +131,7 @@ static void cls_bpf_destroy(struct tcf_proto *tp)
 
 	list_for_each_entry_safe(prog, tmp, &head->plist, link) {
 		list_del_rcu(&prog->link);
+		tcf_unbind_filter(tp, &prog->res);
 		call_rcu(&prog->rcu, __cls_bpf_delete_prog);
 	}
 
@@ -282,6 +283,7 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 
 	if (oldprog) {
 		list_replace_rcu(&prog->link, &oldprog->link);
+		tcf_unbind_filter(tp, &oldprog->res);
 		call_rcu(&oldprog->rcu, __cls_bpf_delete_prog);
 	} else {
 		list_add_rcu(&prog->link, &head->plist);
diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c
index da805ae..dbfdfd1 100644
--- a/net/sched/cls_fw.c
+++ b/net/sched/cls_fw.c
@@ -123,9 +123,7 @@ static int fw_init(struct tcf_proto *tp)
 static void fw_delete_filter(struct rcu_head *head)
 {
 	struct fw_filter *f = container_of(head, struct fw_filter, rcu);
-	struct tcf_proto *tp = f->tp;
 
-	tcf_unbind_filter(tp, &f->res);
 	tcf_exts_destroy(&f->exts);
 	kfree(f);
 }
@@ -143,6 +141,7 @@ static void fw_destroy(struct tcf_proto *tp)
 		while ((f = rtnl_dereference(head->ht[h])) != NULL) {
 			RCU_INIT_POINTER(head->ht[h],
 					 rtnl_dereference(f->next));
+			tcf_unbind_filter(tp, &f->res);
 			call_rcu(&f->rcu, fw_delete_filter);
 		}
 	}
@@ -166,6 +165,7 @@ static int fw_delete(struct tcf_proto *tp, unsigned long arg)
 	     fp = &pfp->next, pfp = rtnl_dereference(*fp)) {
 		if (pfp == f) {
 			RCU_INIT_POINTER(*fp, rtnl_dereference(f->next));
+			tcf_unbind_filter(tp, &f->res);
 			call_rcu(&f->rcu, fw_delete_filter);
 			return 0;
 		}
@@ -280,6 +280,7 @@ static int fw_change(struct net *net, struct sk_buff *in_skb,
 
 		RCU_INIT_POINTER(fnew->next, rtnl_dereference(pfp->next));
 		rcu_assign_pointer(*fp, fnew);
+		tcf_unbind_filter(tp, &f->res);
 		call_rcu(&f->rcu, fw_delete_filter);
 
 		*arg = (unsigned long)fnew;
diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c
index b665aee..6f22baa 100644
--- a/net/sched/cls_route.c
+++ b/net/sched/cls_route.c
@@ -269,9 +269,7 @@ static void
 route4_delete_filter(struct rcu_head *head)
 {
 	struct route4_filter *f = container_of(head, struct route4_filter, rcu);
-	struct tcf_proto *tp = f->tp;
 
-	tcf_unbind_filter(tp, &f->res);
 	tcf_exts_destroy(&f->exts);
 	kfree(f);
 }
@@ -297,6 +295,7 @@ static void route4_destroy(struct tcf_proto *tp)
 
 					next = rtnl_dereference(f->next);
 					RCU_INIT_POINTER(b->ht[h2], next);
+					tcf_unbind_filter(tp, &f->res);
 					call_rcu(&f->rcu, route4_delete_filter);
 				}
 			}
@@ -338,6 +337,7 @@ static int route4_delete(struct tcf_proto *tp, unsigned long arg)
 			route4_reset_fastmap(head);
 
 			/* Delete it */
+			tcf_unbind_filter(tp, &f->res);
 			call_rcu(&f->rcu, route4_delete_filter);
 
 			/* Strip RTNL protected tree */
@@ -545,8 +545,10 @@ static int route4_change(struct net *net, struct sk_buff *in_skb,
 
 	route4_reset_fastmap(head);
 	*arg = (unsigned long)f;
-	if (fold)
+	if (fold) {
+		tcf_unbind_filter(tp, &fold->res);
 		call_rcu(&fold->rcu, route4_delete_filter);
+	}
 	return 0;
 
 errout:

^ permalink raw reply related

* Re: [net-next v2 0/6] Add Geneve tunnel protocol support
From: David Miller @ 2014-10-06  4:32 UTC (permalink / raw)
  To: azhou; +Cc: netdev
In-Reply-To: <1412375733-30981-1-git-send-email-azhou@nicira.com>

From: Andy Zhou <azhou@nicira.com>
Date: Fri,  3 Oct 2014 15:35:27 -0700

> This patch series adds kernel support for Geneve (Generic Network
> Virtualization Encapsulation) based on Geneve IETF draft:
> http://www.ietf.org/id/draft-gross-geneve-01.txt
> 
> Patch 1 implements Geneve tunneling protocol driver
> 
> Patch 2-6 adds openvswitch support for creating and using
> Geneve tunnels by OVS user space.
> 
> ---
> v1->v2:   Style fixes: use tab instead space for Kconfig
> 	  Patch 2-6 are reviewed by Pravin Shetty, add him to acked-by
> 	  Patch 6 was reviewed by Thomas Graf when commiting
> 	    to openvswitch.org, add him to acked-by.

Series applied, thanks Andy.

^ permalink raw reply

* Re: [PATCH net-next] net: skb_segment() provides list head and tail
From: David Miller @ 2014-10-06  4:38 UTC (permalink / raw)
  To: eric.dumazet
  Cc: brouer, netdev, therbert, hannes, fw, dborkman, jhs,
	alexander.duyck, john.r.fastabend
In-Reply-To: <1412395159.17245.33.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 03 Oct 2014 20:59:19 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Its unfortunate we have to walk again skb list to find the tail
> after segmentation, even if data is probably hot in cpu caches.
> 
> skb_segment() can store the tail of the list into segs->prev,
> and validate_xmit_skb_list() can immediately get the tail.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH] drivers/net/phy/Kconfig: Let MDIO_BCM_UNIMAC depend on HAS_IOMEM
From: David Miller @ 2014-10-06  4:46 UTC (permalink / raw)
  To: gang.chen.5i5j; +Cc: f.fainelli, netdev, linux-kernel, richard
In-Reply-To: <542FC3D9.4080607@gmail.com>

From: Chen Gang <gang.chen.5i5j@gmail.com>
Date: Sat, 04 Oct 2014 17:54:33 +0800

> MDIO_BCM_UNIMAC needs HAS_IOMEM, so depend on it, the related error (
> with allmodconfig under um):
> 
>     MODPOST 1205 modules
>   ERROR: "devm_ioremap" [drivers/net/phy/mdio-bcm-unimac.ko] undefined!
> 
> Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: bridge: Do not compile options in br_parse_ip_options
From: David Miller @ 2014-10-06  4:53 UTC (permalink / raw)
  To: herbert; +Cc: fw, netfilter-devel, bsd, stephen, netdev, eric.dumazet, davidn
In-Reply-To: <20141004141802.GA10878@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 4 Oct 2014 22:18:02 +0800

> bridge: Do not compile options in br_parse_ip_options
> 
> Commit 462fb2af9788a82a534f8184abfde31574e1cfa0
> 
> 	bridge : Sanitize skb before it enters the IP stack
> 
> broke when IP options are actually used because it mangles the
> skb as if it entered the IP stack which is wrong because the
> bridge is supposed to operate below the IP stack.
> 
> Since nobody has actually requested for parsing of IP options
> this patch fixes it by simply reverting to the previous approach
> of ignoring all IP options, i.e., zeroing the IPCB.
> 
> If and when somebody who uses IP options and actually needs them
> to be parsed by the bridge complains then we can revisit this.
> 
> Reported-by: David Newall <davidn@davidnewall.com>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Agreed that mangling the packet is definitely wrong here, and since
you preserve the CB clearing this change should be fine.

Please submit this formally after it's been tested.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next] fec: Fix fec_enet_alloc_buffers() error path
From: David Miller @ 2014-10-06  4:54 UTC (permalink / raw)
  To: festevam; +Cc: rmk+kernel, Frank.Li, netdev, fabio.estevam
In-Reply-To: <1412440801-15381-1-git-send-email-festevam@gmail.com>

From: Fabio Estevam <festevam@gmail.com>
Date: Sat,  4 Oct 2014 13:40:01 -0300

> From: Fabio Estevam <fabio.estevam@freescale.com>
> 
> When fec_enet_alloc_buffers() fails we should better undo the previous actions,
> which consists of: disabling the FEC clocks and putting the FEC pins into
> inactive state.
> 
> The error path for fec_enet_mii_probe() is kept unchanged.
> 
> Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>

Applied, thanks Fabio.

^ permalink raw reply

* Re: [PATCH net-next] net: sched: avoid costly atomic operation in fq_dequeue()
From: David Miller @ 2014-10-06  4:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1412442691.17245.40.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 04 Oct 2014 10:11:31 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Standard qdisc API to setup a timer implies an atomic operation on every
> packet dequeue : qdisc_unthrottled()
> 
> It turns out this is not really needed for FQ, as FQ has no concept of
> global qdisc throttling, being a qdisc handling many different flows,
> some of them can be throttled, while others are not.
> 
> Fix is straightforward : add a 'bool throttle' to
> qdisc_watchdog_schedule_ns(), and remove calls to qdisc_unthrottled()
> in sch_fq.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH v8 net-next 1/2] bonding: display xmit_hash_policy for non-dynamic-tlb mode
From: David Miller @ 2014-10-06  4:58 UTC (permalink / raw)
  To: maheshb; +Cc: j.vosburgh, andy, vfalico, nikolay, netdev, edumazet, maze
In-Reply-To: <1412469884-27308-1-git-send-email-maheshb@google.com>


After 8 revisions, I want to see some ACKs before applying these
two changes :)

^ permalink raw reply

* Re: [PATCH net-next 00/14] net/mlx4_en: Optimizations to TX flow
From: David Miller @ 2014-10-06  5:04 UTC (permalink / raw)
  To: amirv; +Cc: edumazet, netdev, yevgenyp, ogerlitz, idos
In-Reply-To: <1412501722-25092-1-git-send-email-amirv@mellanox.com>

From: Amir Vadai <amirv@mellanox.com>
Date: Sun,  5 Oct 2014 12:35:08 +0300

> This patchset contains optimizations to TX flow in mlx4_en driver. It also introduce
> setting/getting tx copybreak, to enable controlling inline threshold dynamically.
> 
> TX flow optimizations was authored and posted to the mailing list by Eric
> Dumazet [1] as a single patch. I splitted this patch to smaller patches,
> Reviewed it and tested.
> Changed from original patch:
> - s/iowrite32be/iowrite32/, since ring->doorbell_qpn is stored as be32
> 
> The tx copybreak patch was also suggested by Eric Dumazet, and was edited and
> reviewed by me. User space patch will be sent after kernel code is ready.
> 
> I am sending this patchset now since the merge window is near and don't want to
> miss it.
> 
> More work need to do:
> - Disable BF when xmit_more is in use
> - Make TSO use xmit_more too. Maybe by splitting small TSO packets in the
>   driver itself, to avoid extra cpu/memory costs of GSO before the driver
> - Fix mlx4_en_xmit buggy handling of queue full in the middle of a burst
>   partially posted to send queue using xmit_more
> 
> Eric, I edited the patches to have you as the Author and the first
> signed-off-by. I hope it is ok with you (I wasn't sure if it is ok to sign by
> you), anyway all the credit to those changes should go to you.
> 
> Patchset was tested and applied over commit 1e203c1 "(net: sched:
> suspicious RCU usage in qdisc_watchdog")
> 
> [1] - https://patchwork.ozlabs.org/patch/394256/

Looks great, nice work everyone.

^ permalink raw reply

* [PATCH net-next V1 01/14] net/mlx4_en: Code cleanups in tx path
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

- Remove unused variable ring->poll_cnt
- No need to set some fields if using blueflame
- Add missing const's
- Use unlikely
- Remove unneeded new line
- Make some comments more precise
- struct mlx4_bf @offset field reduced to unsigned int to save space

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   | 47 +++++++++++++++-------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  1 -
 include/linux/mlx4/device.h                  |  2 +-
 3 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 0c50125..d2f06a7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -191,7 +191,6 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
 	ring->prod = 0;
 	ring->cons = 0xffffffff;
 	ring->last_nr_txbb = 1;
-	ring->poll_cnt = 0;
 	memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
 	memset(ring->buf, 0, ring->buf_size);
 
@@ -512,7 +511,8 @@ static struct mlx4_en_tx_desc *mlx4_en_bounce_to_desc(struct mlx4_en_priv *priv,
 	return ring->buf + index * TXBB_SIZE;
 }
 
-static int is_inline(int inline_thold, struct sk_buff *skb, void **pfrag)
+static bool is_inline(int inline_thold, const struct sk_buff *skb,
+		      void **pfrag)
 {
 	void *ptr;
 
@@ -535,7 +535,7 @@ static int is_inline(int inline_thold, struct sk_buff *skb, void **pfrag)
 	return 0;
 }
 
-static int inline_size(struct sk_buff *skb)
+static int inline_size(const struct sk_buff *skb)
 {
 	if (skb->len + CTRL_SIZE + sizeof(struct mlx4_wqe_inline_seg)
 	    <= MLX4_INLINE_ALIGN)
@@ -546,7 +546,8 @@ static int inline_size(struct sk_buff *skb)
 			     sizeof(struct mlx4_wqe_inline_seg), 16);
 }
 
-static int get_real_size(struct sk_buff *skb, struct net_device *dev,
+static int get_real_size(const struct sk_buff *skb,
+			 struct net_device *dev,
 			 int *lso_header_size)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -581,8 +582,10 @@ static int get_real_size(struct sk_buff *skb, struct net_device *dev,
 	return real_size;
 }
 
-static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *skb,
-			     int real_size, u16 *vlan_tag, int tx_ind, void *fragptr)
+static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
+			     const struct sk_buff *skb,
+			     int real_size, u16 *vlan_tag,
+			     int tx_ind, void *fragptr)
 {
 	struct mlx4_wqe_inline_seg *inl = &tx_desc->inl;
 	int spc = MLX4_INLINE_ALIGN - CTRL_SIZE - sizeof *inl;
@@ -642,7 +645,8 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb,
 	return fallback(dev, skb) % rings_p_up + up * rings_p_up;
 }
 
-static void mlx4_bf_copy(void __iomem *dst, unsigned long *src, unsigned bytecnt)
+static void mlx4_bf_copy(void __iomem *dst, const void *src,
+			 unsigned int bytecnt)
 {
 	__iowrite64_copy(dst, src, bytecnt / 8);
 }
@@ -736,11 +740,10 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	tx_info->skb = skb;
 	tx_info->nr_txbb = nr_txbb;
 
+	data = &tx_desc->data;
 	if (lso_header_size)
 		data = ((void *)&tx_desc->lso + ALIGN(lso_header_size + 4,
 						      DS_SIZE));
-	else
-		data = &tx_desc->data;
 
 	/* valid only for none inline segments */
 	tx_info->data_offset = (void *)data - (void *)tx_desc;
@@ -753,9 +756,9 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (is_inline(ring->inline_thold, skb, &fragptr)) {
 		tx_info->inl = 1;
 	} else {
-		/* Map fragments */
+		/* Map fragments if any */
 		for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
-			struct skb_frag_struct *frag;
+			const struct skb_frag_struct *frag;
 			dma_addr_t dma;
 
 			frag = &skb_shinfo(skb)->frags[i];
@@ -772,7 +775,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			--data;
 		}
 
-		/* Map linear part */
+		/* Map linear part if needed */
 		if (tx_info->linear) {
 			u32 byte_count = skb_headlen(skb) - lso_header_size;
 			dma_addr_t dma;
@@ -795,18 +798,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * For timestamping add flag to skb_shinfo and
 	 * set flag for further reference
 	 */
-	if (ring->hwtstamp_tx_type == HWTSTAMP_TX_ON &&
-	    skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) {
+	if (unlikely(ring->hwtstamp_tx_type == HWTSTAMP_TX_ON &&
+		     skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)) {
 		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
 		tx_info->ts_requested = 1;
 	}
 
 	/* Prepare ctrl segement apart opcode+ownership, which depends on
 	 * whether LSO is used */
-	tx_desc->ctrl.vlan_tag = cpu_to_be16(vlan_tag);
-	tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_VLAN *
-		!!vlan_tx_tag_present(skb);
-	tx_desc->ctrl.fence_size = (real_size / 16) & 0x3f;
 	tx_desc->ctrl.srcrb_flags = priv->ctrl_flags;
 	if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
 		tx_desc->ctrl.srcrb_flags |= cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM |
@@ -852,7 +851,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			 cpu_to_be32(MLX4_EN_BIT_DESC_OWN) : 0);
 		tx_info->nr_bytes = max_t(unsigned int, skb->len, ETH_ZLEN);
 		ring->packets++;
-
 	}
 	ring->bytes += tx_info->nr_bytes;
 	netdev_tx_sent_queue(ring->tx_queue, tx_info->nr_bytes);
@@ -874,7 +872,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	ring->prod += nr_txbb;
 
 	/* If we used a bounce buffer then copy descriptor back into place */
-	if (bounce)
+	if (unlikely(bounce))
 		tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
 
 	skb_tx_timestamp(skb);
@@ -894,13 +892,18 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 		wmb();
 
-		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, (unsigned long *) &tx_desc->ctrl,
-		     desc_size);
+		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
+			     desc_size);
 
 		wmb();
 
 		ring->bf.offset ^= ring->bf.buf_size;
 	} else {
+		tx_desc->ctrl.vlan_tag = cpu_to_be16(vlan_tag);
+		tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_VLAN *
+			!!vlan_tx_tag_present(skb);
+		tx_desc->ctrl.fence_size = real_size;
+
 		/* Ensure new descriptor hits memory
 		 * before setting ownership of this descriptor to HW
 		 */
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 84c9d5d..e54b653 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -263,7 +263,6 @@ struct mlx4_en_tx_ring {
 	u32 buf_size;
 	u32 doorbell_qpn;
 	void *buf;
-	u16 poll_cnt;
 	struct mlx4_en_tx_info *tx_info;
 	u8 *bounce_buf;
 	u8 queue_index;
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index b2f8ab9..37e4404 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -583,7 +583,7 @@ struct mlx4_uar {
 };
 
 struct mlx4_bf {
-	unsigned long		offset;
+	unsigned int		offset;
 	int			buf_size;
 	struct mlx4_uar	       *uar;
 	void __iomem	       *reg;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 00/14] net/mlx4_en: Optimizations to TX flow
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai

Hi,

This patchset contains optimizations to TX flow in mlx4_en driver. It also introduce
setting/getting tx copybreak, to enable controlling inline threshold dynamically.

TX flow optimizations was authored and posted to the mailing list by Eric
Dumazet [1] as a single patch. I splitted this patch to smaller patches,
Reviewed it and tested.
Changed from original patch:
- s/iowrite32be/iowrite32/, since ring->doorbell_qpn is stored as be32

The tx copybreak patch was also suggested by Eric Dumazet, and was edited and
reviewed by me. User space patch will be sent after kernel code is ready.

I am sending this patchset now since the merge window is near and don't want to
miss it.

More work need to do:
- Disable BF when xmit_more is in use
- Make TSO use xmit_more too. Maybe by splitting small TSO packets in the
  driver itself, to avoid extra cpu/memory costs of GSO before the driver
- Fix mlx4_en_xmit buggy handling of queue full in the middle of a burst
  partially posted to send queue using xmit_more

Eric, I edited the patches to have you as the Author and the first
signed-off-by. I hope it is ok with you (I wasn't sure if it is ok to sign by
you), anyway all the credit to those changes should go to you.

Patchset was tested and applied over commit 1e203c1 "(net: sched:
suspicious RCU usage in qdisc_watchdog")

[1] - https://patchwork.ozlabs.org/patch/394256/

Changes from V0:
- Patch 14/14 ("Use the new tx_copybreak to set inline threshold"):
  - Use same coding convention as currently is in en_ethtool.c
- Patch 1/14 ("Code cleanups in tx path") and Patch 9/14 ("Use local var in
  tx flow for skb_shinfo(skb)"):
  - local var shinfo was used by mistake in Patch 1/14 while declared at 9/14.
    Fixed it for the sake of future bisections

Thanks,
Amir

Eric Dumazet (14):
  net/mlx4_en: Code cleanups in tx path
  net/mlx4_en: Align tx path structures to cache lines
  net/mlx4_en: Avoid calling bswap in tx fast path
  net/mlx4_en: tx_info allocated with kmalloc() instead of vmalloc()
  net/mlx4_en: Avoid a cache line miss in TX completion for single frag
    skb's
  net/mlx4_en: Use prefetch in tx path
  net/mlx4_en: Avoid false sharing in mlx4_en_en_process_tx_cq()
  net/mlx4_en: mlx4_en_xmit() reads ring->cons once, and ahead of time
    to avoid stalls
  net/mlx4_en: Use local var in tx flow for skb_shinfo(skb)
  net/mlx4_en: Use local var for skb_headlen(skb)
  net/mlx4_en: tx_info->ts_requested was not cleared
  net/mlx4_en: Enable the compiler to make is_inline() inlined
  ethtool: Ethtool parameter to dynamically change tx_copybreak
  net/mlx4_en: Use the new tx_copybreak to set inline threshold

 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  44 ++++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      | 330 ++++++++++++++----------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |  90 ++++---
 include/linux/mlx4/device.h                     |   2 +-
 include/uapi/linux/ethtool.h                    |   1 +
 net/core/ethtool.c                              |   1 +
 6 files changed, 290 insertions(+), 178 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH net-next V1 05/14] net/mlx4_en: Avoid a cache line miss in TX completion for single frag skb's
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Add frag0_dma/frag0_byte_count into mlx4_en_tx_info to avoid a cache
line miss in TX completion for frames having one dma element.  (We avoid
reading back the tx descriptor)

Note this could be extended to 2/3 dma elements later, as we have free
room in mlx4_en_tx_info

Also, mlx4_en_free_tx_desc() no longer accesses skb_shinfo(). We use a
new nr_maps fields in mlx4_en_tx_info to avoid 2 or 3 cache misses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   | 83 +++++++++++++++-------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  3 +
 2 files changed, 49 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 02ade59..772ae6f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -259,38 +259,40 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 				struct mlx4_en_tx_ring *ring,
 				int index, u8 owner, u64 timestamp)
 {
-	struct mlx4_en_dev *mdev = priv->mdev;
 	struct mlx4_en_tx_info *tx_info = &ring->tx_info[index];
 	struct mlx4_en_tx_desc *tx_desc = ring->buf + index * TXBB_SIZE;
 	struct mlx4_wqe_data_seg *data = (void *) tx_desc + tx_info->data_offset;
-	struct sk_buff *skb = tx_info->skb;
-	struct skb_frag_struct *frag;
 	void *end = ring->buf + ring->buf_size;
-	int frags = skb_shinfo(skb)->nr_frags;
+	struct sk_buff *skb = tx_info->skb;
+	int nr_maps = tx_info->nr_maps;
 	int i;
-	struct skb_shared_hwtstamps hwts;
 
-	if (timestamp) {
-		mlx4_en_fill_hwtstamps(mdev, &hwts, timestamp);
+	if (unlikely(timestamp)) {
+		struct skb_shared_hwtstamps hwts;
+
+		mlx4_en_fill_hwtstamps(priv->mdev, &hwts, timestamp);
 		skb_tstamp_tx(skb, &hwts);
 	}
 
 	/* Optimize the common case when there are no wraparounds */
 	if (likely((void *) tx_desc + tx_info->nr_txbb * TXBB_SIZE <= end)) {
 		if (!tx_info->inl) {
-			if (tx_info->linear) {
+			if (tx_info->linear)
 				dma_unmap_single(priv->ddev,
-					(dma_addr_t) be64_to_cpu(data->addr),
-					 be32_to_cpu(data->byte_count),
-					 PCI_DMA_TODEVICE);
-				++data;
-			}
-
-			for (i = 0; i < frags; i++) {
-				frag = &skb_shinfo(skb)->frags[i];
+						tx_info->map0_dma,
+						tx_info->map0_byte_count,
+						PCI_DMA_TODEVICE);
+			else
+				dma_unmap_page(priv->ddev,
+					       tx_info->map0_dma,
+					       tx_info->map0_byte_count,
+					       PCI_DMA_TODEVICE);
+			for (i = 1; i < nr_maps; i++) {
+				data++;
 				dma_unmap_page(priv->ddev,
-					(dma_addr_t) be64_to_cpu(data[i].addr),
-					skb_frag_size(frag), PCI_DMA_TODEVICE);
+					(dma_addr_t)be64_to_cpu(data->addr),
+					be32_to_cpu(data->byte_count),
+					PCI_DMA_TODEVICE);
 			}
 		}
 	} else {
@@ -299,23 +301,25 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 				data = ring->buf + ((void *)data - end);
 			}
 
-			if (tx_info->linear) {
+			if (tx_info->linear)
 				dma_unmap_single(priv->ddev,
-					(dma_addr_t) be64_to_cpu(data->addr),
-					 be32_to_cpu(data->byte_count),
-					 PCI_DMA_TODEVICE);
-				++data;
-			}
-
-			for (i = 0; i < frags; i++) {
+						tx_info->map0_dma,
+						tx_info->map0_byte_count,
+						PCI_DMA_TODEVICE);
+			else
+				dma_unmap_page(priv->ddev,
+					       tx_info->map0_dma,
+					       tx_info->map0_byte_count,
+					       PCI_DMA_TODEVICE);
+			for (i = 1; i < nr_maps; i++) {
+				data++;
 				/* Check for wraparound before unmapping */
 				if ((void *) data >= end)
 					data = ring->buf;
-				frag = &skb_shinfo(skb)->frags[i];
 				dma_unmap_page(priv->ddev,
-					(dma_addr_t) be64_to_cpu(data->addr),
-					 skb_frag_size(frag), PCI_DMA_TODEVICE);
-				++data;
+					(dma_addr_t)be64_to_cpu(data->addr),
+					be32_to_cpu(data->byte_count),
+					PCI_DMA_TODEVICE);
 			}
 		}
 	}
@@ -751,19 +755,22 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	tx_info->linear = (lso_header_size < skb_headlen(skb) &&
 			   !is_inline(ring->inline_thold, skb, NULL)) ? 1 : 0;
 
-	data += skb_shinfo(skb)->nr_frags + tx_info->linear - 1;
+	tx_info->nr_maps = skb_shinfo(skb)->nr_frags + tx_info->linear;
+	data += tx_info->nr_maps - 1;
 
 	if (is_inline(ring->inline_thold, skb, &fragptr)) {
 		tx_info->inl = 1;
 	} else {
+		dma_addr_t dma = 0;
+		u32 byte_count = 0;
+
 		/* Map fragments if any */
 		for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
 			const struct skb_frag_struct *frag;
-			dma_addr_t dma;
-
 			frag = &skb_shinfo(skb)->frags[i];
+			byte_count = skb_frag_size(frag);
 			dma = skb_frag_dma_map(ddev, frag,
-					       0, skb_frag_size(frag),
+					       0, byte_count,
 					       DMA_TO_DEVICE);
 			if (dma_mapping_error(ddev, dma))
 				goto tx_drop_unmap;
@@ -771,14 +778,13 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			data->addr = cpu_to_be64(dma);
 			data->lkey = ring->mr_key;
 			wmb();
-			data->byte_count = cpu_to_be32(skb_frag_size(frag));
+			data->byte_count = cpu_to_be32(byte_count);
 			--data;
 		}
 
 		/* Map linear part if needed */
 		if (tx_info->linear) {
-			u32 byte_count = skb_headlen(skb) - lso_header_size;
-			dma_addr_t dma;
+			byte_count = skb_headlen(skb) - lso_header_size;
 
 			dma = dma_map_single(ddev, skb->data +
 					     lso_header_size, byte_count,
@@ -792,6 +798,9 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			data->byte_count = cpu_to_be32(byte_count);
 		}
 		tx_info->inl = 0;
+		/* tx completion can avoid cache line miss for common cases */
+		tx_info->map0_dma = dma;
+		tx_info->map0_byte_count = byte_count;
 	}
 
 	/*
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index ab34461..a904030 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -216,12 +216,15 @@ enum cq_type {
 
 struct mlx4_en_tx_info {
 	struct sk_buff *skb;
+	dma_addr_t	map0_dma;
+	u32		map0_byte_count;
 	u32		nr_txbb;
 	u32		nr_bytes;
 	u8		linear;
 	u8		data_offset;
 	u8		inl;
 	u8		ts_requested;
+	u8		nr_maps;
 } ____cacheline_aligned_in_smp;
 
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 03/14] net/mlx4_en: Avoid calling bswap in tx fast path
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

- doorbell_qpn is stored in the cpu_to_be32() way to avoid bswap() in fast
  path.
- mdev->mr.key stored in ring->mr_key to also avoid bswap() and access to
  cold cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   | 17 ++++++++++-------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  3 ++-
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index d2f06a7..3ea17f9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -195,7 +195,8 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
 	memset(ring->buf, 0, ring->buf_size);
 
 	ring->qp_state = MLX4_QP_STATE_RST;
-	ring->doorbell_qpn = ring->qp.qpn << 8;
+	ring->doorbell_qpn = cpu_to_be32(ring->qp.qpn << 8);
+	ring->mr_key = cpu_to_be32(mdev->mr.key);
 
 	mlx4_en_fill_qp_context(priv, ring->size, ring->stride, 1, 0, ring->qpn,
 				ring->cqn, user_prio, &ring->context);
@@ -654,7 +655,6 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
 netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	struct mlx4_en_dev *mdev = priv->mdev;
 	struct device *ddev = priv->ddev;
 	struct mlx4_en_tx_ring *ring;
 	struct mlx4_en_tx_desc *tx_desc;
@@ -769,7 +769,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 				goto tx_drop_unmap;
 
 			data->addr = cpu_to_be64(dma);
-			data->lkey = cpu_to_be32(mdev->mr.key);
+			data->lkey = ring->mr_key;
 			wmb();
 			data->byte_count = cpu_to_be32(skb_frag_size(frag));
 			--data;
@@ -787,7 +787,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 				goto tx_drop_unmap;
 
 			data->addr = cpu_to_be64(dma);
-			data->lkey = cpu_to_be32(mdev->mr.key);
+			data->lkey = ring->mr_key;
 			wmb();
 			data->byte_count = cpu_to_be32(byte_count);
 		}
@@ -879,9 +879,12 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
 
+	real_size = (real_size / 16) & 0x3f;
+
 	if (ring->bf_enabled && desc_size <= MAX_BF && !bounce &&
 	    !vlan_tx_tag_present(skb) && send_doorbell) {
-		tx_desc->ctrl.bf_qpn |= cpu_to_be32(ring->doorbell_qpn);
+		tx_desc->ctrl.bf_qpn = ring->doorbell_qpn |
+				       cpu_to_be32(real_size);
 
 		op_own |= htonl((bf_index & 0xffff) << 8);
 		/* Ensure new descriptor hits memory
@@ -911,8 +914,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		tx_desc->ctrl.owner_opcode = op_own;
 		if (send_doorbell) {
 			wmb();
-			iowrite32be(ring->doorbell_qpn,
-				    ring->bf.uar->map + MLX4_SEND_DOORBELL);
+			iowrite32(ring->doorbell_qpn,
+				  ring->bf.uar->map + MLX4_SEND_DOORBELL);
 		} else {
 			ring->xmit_more++;
 		}
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index b7bde95..ab34461 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -279,7 +279,8 @@ struct mlx4_en_tx_ring {
 	u16			stride;
 	u16			cqn;	/* index of port CQ associated with this ring */
 	u32			buf_size;
-	u32			doorbell_qpn;
+	__be32			doorbell_qpn;
+	__be32			mr_key;
 	void			*buf;
 	struct mlx4_en_tx_info	*tx_info;
 	u8			*bounce_buf;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 14/14] net/mlx4_en: Use the new tx_copybreak to set inline threshold
From: Amir Vadai @ 2014-10-06  6:16 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Instead of setting inline threshold using module parameter only on
driver load, use set_tunable() to set it dynamically.
No need to store the threshold per ring, using instead the netdev global
priv->prof->inline_thold
Initial value still is set using the module parameter, therefore
backward compatability is kept.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 44 +++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      |  1 -
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |  1 -
 3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 42c9f8b..b2d9e1f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -1267,6 +1267,48 @@ static u32 mlx4_en_get_priv_flags(struct net_device *dev)
 	return priv->pflags;
 }
 
+static int mlx4_en_get_tunable(struct net_device *dev,
+			       const struct ethtool_tunable *tuna,
+			       void *data)
+{
+	const struct mlx4_en_priv *priv = netdev_priv(dev);
+	int ret = 0;
+
+	switch (tuna->id) {
+	case ETHTOOL_TX_COPYBREAK:
+		*(u32 *)data = priv->prof->inline_thold;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static int mlx4_en_set_tunable(struct net_device *dev,
+			       const struct ethtool_tunable *tuna,
+			       const void *data)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	int val, ret = 0;
+
+	switch (tuna->id) {
+	case ETHTOOL_TX_COPYBREAK:
+		val = *(u32 *)data;
+		if (val < MIN_PKT_LEN || val > MAX_INLINE)
+			ret = -EINVAL;
+		else
+			priv->prof->inline_thold = val;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
 
 const struct ethtool_ops mlx4_en_ethtool_ops = {
 	.get_drvinfo = mlx4_en_get_drvinfo,
@@ -1297,6 +1339,8 @@ const struct ethtool_ops mlx4_en_ethtool_ops = {
 	.get_ts_info = mlx4_en_get_ts_info,
 	.set_priv_flags = mlx4_en_set_priv_flags,
 	.get_priv_flags = mlx4_en_get_priv_flags,
+	.get_tunable = mlx4_en_get_tunable,
+	.set_tunable = mlx4_en_set_tunable,
 };
 
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index f0080c5..92a7cf4 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -66,7 +66,6 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 	ring->size = size;
 	ring->size_mask = size - 1;
 	ring->stride = stride;
-	ring->inline_thold = priv->prof->inline_thold;
 
 	tmp = size * sizeof(struct mlx4_en_tx_info);
 	ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index a904030..8fef658 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -295,7 +295,6 @@ struct mlx4_en_tx_ring {
 	bool			bf_alloced;
 	struct netdev_queue	*tx_queue;
 	int			hwtstamp_tx_type;
-	int			inline_thold;
 } ____cacheline_aligned_in_smp;
 
 struct mlx4_en_rx_desc {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 06/14] net/mlx4_en: Use prefetch in tx path
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

mlx4_en_free_tx_desc() uses a prefetchw(&skb->users) to speed up
consume_skb()
prefetchw(&ring->tx_queue->dql) to speed up BQL update

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 772ae6f..9328e6e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -37,6 +37,7 @@
 #include <linux/mlx4/qp.h>
 #include <linux/skbuff.h>
 #include <linux/if_vlan.h>
+#include <linux/prefetch.h>
 #include <linux/vmalloc.h>
 #include <linux/tcp.h>
 #include <linux/ip.h>
@@ -267,6 +268,11 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 	int nr_maps = tx_info->nr_maps;
 	int i;
 
+	/* We do not touch skb here, so prefetch skb->users location
+	 * to speedup consume_skb()
+	 */
+	prefetchw(&skb->users);
+
 	if (unlikely(timestamp)) {
 		struct skb_shared_hwtstamps hwts;
 
@@ -385,6 +391,7 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	if (!priv->port_up)
 		return true;
 
+	prefetchw(&ring->tx_queue->dql.limit);
 	index = cons_index & size_mask;
 	cqe = mlx4_en_get_cqe(buf, index, priv->cqe_size) + factor;
 	ring_index = ring->cons & size_mask;
@@ -722,6 +729,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 	}
 
+	prefetchw(&ring->tx_queue->dql);
+
 	/* Track current inflight packets for performance analysis */
 	AVG_PERF_COUNTER(priv->pstats.inflight_avg,
 			 (u32) (ring->prod - ring->cons - 1));
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 08/14] net/mlx4_en: mlx4_en_xmit() reads ring->cons once, and ahead of time to avoid stalls
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 63e1f24..4b018ce 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -691,10 +691,17 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	void *fragptr;
 	bool bounce = false;
 	bool send_doorbell;
+	u32 ring_cons;
 
 	if (!priv->port_up)
 		goto tx_drop;
 
+	tx_ind = skb_get_queue_mapping(skb);
+	ring = priv->tx_ring[tx_ind];
+
+	/* fetch ring->cons far ahead before needing it to avoid stall */
+	ring_cons = ACCESS_ONCE(ring->cons);
+
 	real_size = get_real_size(skb, dev, &lso_header_size);
 	if (unlikely(!real_size))
 		goto tx_drop;
@@ -708,13 +715,11 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto tx_drop;
 	}
 
-	tx_ind = skb->queue_mapping;
-	ring = priv->tx_ring[tx_ind];
 	if (vlan_tx_tag_present(skb))
 		vlan_tag = vlan_tx_tag_get(skb);
 
 	/* Check available TXBBs And 2K spare for prefetch */
-	if (unlikely(((int)(ring->prod - ring->cons)) >
+	if (unlikely(((int)(ring->prod - ring_cons)) >
 		     ring->size - HEADROOM - MAX_DESC_TXBBS)) {
 		/* every full Tx ring stops queue */
 		netif_tx_stop_queue(ring->tx_queue);
@@ -728,7 +733,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		 */
 		wmb();
 
-		if (unlikely(((int)(ring->prod - ring->cons)) <=
+		ring_cons = ACCESS_ONCE(ring->cons);
+		if (unlikely(((int)(ring->prod - ring_cons)) <=
 			     ring->size - HEADROOM - MAX_DESC_TXBBS)) {
 			netif_tx_wake_queue(ring->tx_queue);
 			ring->wake_queue++;
@@ -741,7 +747,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	/* Track current inflight packets for performance analysis */
 	AVG_PERF_COUNTER(priv->pstats.inflight_avg,
-			 (u32) (ring->prod - ring->cons - 1));
+			 (u32)(ring->prod - ring_cons - 1));
 
 	/* Packet is good - grab an index and transmit it */
 	index = ring->prod & ring->size_mask;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 10/14] net/mlx4_en: Use local var for skb_headlen(skb)
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Access skb_headlen() once in tx flow

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index aa05b09..e00841a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -612,6 +612,7 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
 {
 	struct mlx4_wqe_inline_seg *inl = &tx_desc->inl;
 	int spc = MLX4_INLINE_ALIGN - CTRL_SIZE - sizeof *inl;
+	unsigned int hlen = skb_headlen(skb);
 
 	if (skb->len <= spc) {
 		if (likely(skb->len >= MIN_PKT_LEN)) {
@@ -621,19 +622,19 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
 			memset(((void *)(inl + 1)) + skb->len, 0,
 			       MIN_PKT_LEN - skb->len);
 		}
-		skb_copy_from_linear_data(skb, inl + 1, skb_headlen(skb));
+		skb_copy_from_linear_data(skb, inl + 1, hlen);
 		if (shinfo->nr_frags)
-			memcpy(((void *)(inl + 1)) + skb_headlen(skb), fragptr,
+			memcpy(((void *)(inl + 1)) + hlen, fragptr,
 			       skb_frag_size(&shinfo->frags[0]));
 
 	} else {
 		inl->byte_count = cpu_to_be32(1 << 31 | spc);
-		if (skb_headlen(skb) <= spc) {
-			skb_copy_from_linear_data(skb, inl + 1, skb_headlen(skb));
-			if (skb_headlen(skb) < spc) {
-				memcpy(((void *)(inl + 1)) + skb_headlen(skb),
-					fragptr, spc - skb_headlen(skb));
-				fragptr +=  spc - skb_headlen(skb);
+		if (hlen <= spc) {
+			skb_copy_from_linear_data(skb, inl + 1, hlen);
+			if (hlen < spc) {
+				memcpy(((void *)(inl + 1)) + hlen,
+				       fragptr, spc - hlen);
+				fragptr +=  spc - hlen;
 			}
 			inl = (void *) (inl + 1) + spc;
 			memcpy(((void *)(inl + 1)), fragptr, skb->len - spc);
@@ -641,9 +642,9 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
 			skb_copy_from_linear_data(skb, inl + 1, spc);
 			inl = (void *) (inl + 1) + spc;
 			skb_copy_from_linear_data_offset(skb, spc, inl + 1,
-					skb_headlen(skb) - spc);
+							 hlen - spc);
 			if (shinfo->nr_frags)
-				memcpy(((void *)(inl + 1)) + skb_headlen(skb) - spc,
+				memcpy(((void *)(inl + 1)) + hlen - spc,
 				       fragptr,
 				       skb_frag_size(&shinfo->frags[0]));
 		}
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 12/14] net/mlx4_en: Enable the compiler to make is_inline() inlined
From: Amir Vadai @ 2014-10-06  6:16 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Reorganize code to call is_inline() once, so compiler can inline it

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 67 +++++++++++++++++-------------
 1 file changed, 38 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 2c03b55..f0080c5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -531,29 +531,32 @@ static struct mlx4_en_tx_desc *mlx4_en_bounce_to_desc(struct mlx4_en_priv *priv,
 	return ring->buf + index * TXBB_SIZE;
 }
 
+/* Decide if skb can be inlined in tx descriptor to avoid dma mapping
+ *
+ * It seems strange we do not simply use skb_copy_bits().
+ * This would allow to inline all skbs iff skb->len <= inline_thold
+ *
+ * Note that caller already checked skb was not a gso packet
+ */
 static bool is_inline(int inline_thold, const struct sk_buff *skb,
 		      const struct skb_shared_info *shinfo,
 		      void **pfrag)
 {
 	void *ptr;
 
-	if (inline_thold && !skb_is_gso(skb) && skb->len <= inline_thold) {
-		if (shinfo->nr_frags == 1) {
-			ptr = skb_frag_address_safe(&shinfo->frags[0]);
-			if (unlikely(!ptr))
-				return 0;
-
-			if (pfrag)
-				*pfrag = ptr;
+	if (skb->len > inline_thold || !inline_thold)
+		return false;
 
-			return 1;
-		} else if (unlikely(shinfo->nr_frags))
-			return 0;
-		else
-			return 1;
+	if (shinfo->nr_frags == 1) {
+		ptr = skb_frag_address_safe(&shinfo->frags[0]);
+		if (unlikely(!ptr))
+			return false;
+		*pfrag = ptr;
+		return true;
 	}
-
-	return 0;
+	if (shinfo->nr_frags)
+		return false;
+	return true;
 }
 
 static int inline_size(const struct sk_buff *skb)
@@ -570,12 +573,15 @@ static int inline_size(const struct sk_buff *skb)
 static int get_real_size(const struct sk_buff *skb,
 			 const struct skb_shared_info *shinfo,
 			 struct net_device *dev,
-			 int *lso_header_size)
+			 int *lso_header_size,
+			 bool *inline_ok,
+			 void **pfrag)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int real_size;
 
 	if (shinfo->gso_size) {
+		*inline_ok = false;
 		if (skb->encapsulation)
 			*lso_header_size = (skb_inner_transport_header(skb) - skb->data) + inner_tcp_hdrlen(skb);
 		else
@@ -595,10 +601,14 @@ static int get_real_size(const struct sk_buff *skb,
 		}
 	} else {
 		*lso_header_size = 0;
-		if (!is_inline(priv->prof->inline_thold, skb, shinfo, NULL))
-			real_size = CTRL_SIZE + (shinfo->nr_frags + 1) * DS_SIZE;
-		else
+		*inline_ok = is_inline(priv->prof->inline_thold, skb,
+				       shinfo, pfrag);
+
+		if (*inline_ok)
 			real_size = inline_size(skb);
+		else
+			real_size = CTRL_SIZE +
+				    (shinfo->nr_frags + 1) * DS_SIZE;
 	}
 
 	return real_size;
@@ -694,9 +704,10 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	u16 vlan_tag = 0;
 	int i_frag;
 	int lso_header_size;
-	void *fragptr;
+	void *fragptr = NULL;
 	bool bounce = false;
 	bool send_doorbell;
+	bool inline_ok;
 	u32 ring_cons;
 
 	if (!priv->port_up)
@@ -708,7 +719,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* fetch ring->cons far ahead before needing it to avoid stall */
 	ring_cons = ACCESS_ONCE(ring->cons);
 
-	real_size = get_real_size(skb, shinfo, dev, &lso_header_size);
+	real_size = get_real_size(skb, shinfo, dev, &lso_header_size,
+				  &inline_ok, &fragptr);
 	if (unlikely(!real_size))
 		goto tx_drop;
 
@@ -781,15 +793,15 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* valid only for none inline segments */
 	tx_info->data_offset = (void *)data - (void *)tx_desc;
 
+	tx_info->inl = inline_ok;
+
 	tx_info->linear = (lso_header_size < skb_headlen(skb) &&
-			   !is_inline(ring->inline_thold, skb, shinfo, NULL)) ? 1 : 0;
+			   !inline_ok) ? 1 : 0;
 
 	tx_info->nr_maps = shinfo->nr_frags + tx_info->linear;
 	data += tx_info->nr_maps - 1;
 
-	if (is_inline(ring->inline_thold, skb, shinfo, &fragptr)) {
-		tx_info->inl = 1;
-	} else {
+	if (!tx_info->inl) {
 		dma_addr_t dma = 0;
 		u32 byte_count = 0;
 
@@ -827,7 +839,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			wmb();
 			data->byte_count = cpu_to_be32(byte_count);
 		}
-		tx_info->inl = 0;
 		/* tx completion can avoid cache line miss for common cases */
 		tx_info->map0_dma = dma;
 		tx_info->map0_byte_count = byte_count;
@@ -899,11 +910,9 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	netdev_tx_sent_queue(ring->tx_queue, tx_info->nr_bytes);
 	AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, skb->len);
 
-	if (tx_info->inl) {
+	if (tx_info->inl)
 		build_inline_wqe(tx_desc, skb, shinfo, real_size, &vlan_tag,
 				 tx_ind, fragptr);
-		tx_info->inl = 1;
-	}
 
 	if (skb->encapsulation) {
 		struct iphdr *ipv4 = (struct iphdr *)skb_inner_network_header(skb);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 02/14] net/mlx4_en: Align tx path structures to cache lines
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Reorganize struct mlx4_en_tx_ring to have:
- One cache line containing last_nr_txbb & cons & wake_queue, used by tx
  completion.
- One cache line containing fields dirtied by mlx4_en_xmit()
- Following part is read mostly and shared by cpus.

Align struct mlx4_en_tx_info to a cache line

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 86 +++++++++++++++-------------
 1 file changed, 46 insertions(+), 40 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index e54b653..b7bde95 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -216,13 +216,13 @@ enum cq_type {
 
 struct mlx4_en_tx_info {
 	struct sk_buff *skb;
-	u32 nr_txbb;
-	u32 nr_bytes;
-	u8 linear;
-	u8 data_offset;
-	u8 inl;
-	u8 ts_requested;
-};
+	u32		nr_txbb;
+	u32		nr_bytes;
+	u8		linear;
+	u8		data_offset;
+	u8		inl;
+	u8		ts_requested;
+} ____cacheline_aligned_in_smp;
 
 
 #define MLX4_EN_BIT_DESC_OWN	0x80000000
@@ -253,40 +253,46 @@ struct mlx4_en_rx_alloc {
 };
 
 struct mlx4_en_tx_ring {
+	/* cache line used and dirtied in tx completion
+	 * (mlx4_en_free_tx_buf())
+	 */
+	u32			last_nr_txbb;
+	u32			cons;
+	unsigned long		wake_queue;
+
+	/* cache line used and dirtied in mlx4_en_xmit() */
+	u32			prod ____cacheline_aligned_in_smp;
+	unsigned long		bytes;
+	unsigned long		packets;
+	unsigned long		tx_csum;
+	unsigned long		tso_packets;
+	unsigned long		xmit_more;
+	struct mlx4_bf		bf;
+	unsigned long		queue_stopped;
+
+	/* Following part should be mostly read */
+	cpumask_t		affinity_mask;
+	struct mlx4_qp		qp;
 	struct mlx4_hwq_resources wqres;
-	u32 size ; /* number of TXBBs */
-	u32 size_mask;
-	u16 stride;
-	u16 cqn;	/* index of port CQ associated with this ring */
-	u32 prod;
-	u32 cons;
-	u32 buf_size;
-	u32 doorbell_qpn;
-	void *buf;
-	struct mlx4_en_tx_info *tx_info;
-	u8 *bounce_buf;
-	u8 queue_index;
-	cpumask_t affinity_mask;
-	u32 last_nr_txbb;
-	struct mlx4_qp qp;
-	struct mlx4_qp_context context;
-	int qpn;
-	enum mlx4_qp_state qp_state;
-	struct mlx4_srq dummy;
-	unsigned long bytes;
-	unsigned long packets;
-	unsigned long tx_csum;
-	unsigned long queue_stopped;
-	unsigned long wake_queue;
-	unsigned long tso_packets;
-	unsigned long xmit_more;
-	struct mlx4_bf bf;
-	bool bf_enabled;
-	bool bf_alloced;
-	struct netdev_queue *tx_queue;
-	int hwtstamp_tx_type;
-	int inline_thold;
-};
+	u32			size; /* number of TXBBs */
+	u32			size_mask;
+	u16			stride;
+	u16			cqn;	/* index of port CQ associated with this ring */
+	u32			buf_size;
+	u32			doorbell_qpn;
+	void			*buf;
+	struct mlx4_en_tx_info	*tx_info;
+	u8			*bounce_buf;
+	struct mlx4_qp_context	context;
+	int			qpn;
+	enum mlx4_qp_state	qp_state;
+	u8			queue_index;
+	bool			bf_enabled;
+	bool			bf_alloced;
+	struct netdev_queue	*tx_queue;
+	int			hwtstamp_tx_type;
+	int			inline_thold;
+} ____cacheline_aligned_in_smp;
 
 struct mlx4_en_rx_desc {
 	/* actual number of entries depends on rx ring stride */
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next V1 04/14] net/mlx4_en: tx_info allocated with kmalloc() instead of vmalloc()
From: Amir Vadai @ 2014-10-06  6:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412576163-7224-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

Try to allocate using kmalloc_node() first, only on failure use
vmalloc()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 3ea17f9..02ade59 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -68,7 +68,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 	ring->inline_thold = priv->prof->inline_thold;
 
 	tmp = size * sizeof(struct mlx4_en_tx_info);
-	ring->tx_info = vmalloc_node(tmp, node);
+	ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
 	if (!ring->tx_info) {
 		ring->tx_info = vmalloc(tmp);
 		if (!ring->tx_info) {
@@ -151,7 +151,7 @@ err_bounce:
 	kfree(ring->bounce_buf);
 	ring->bounce_buf = NULL;
 err_info:
-	vfree(ring->tx_info);
+	kvfree(ring->tx_info);
 	ring->tx_info = NULL;
 err_ring:
 	kfree(ring);
@@ -174,7 +174,7 @@ void mlx4_en_destroy_tx_ring(struct mlx4_en_priv *priv,
 	mlx4_free_hwq_res(mdev->dev, &ring->wqres, ring->buf_size);
 	kfree(ring->bounce_buf);
 	ring->bounce_buf = NULL;
-	vfree(ring->tx_info);
+	kvfree(ring->tx_info);
 	ring->tx_info = NULL;
 	kfree(ring);
 	*pring = NULL;
-- 
1.8.3.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox