Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: [PATCH v2] sh_eth: remove unchecked interrupts
From: Chris Brandt @ 2016-12-01 18:24 UTC (permalink / raw)
  To: Sergei Shtylyov, David Miller
  Cc: Simon Horman, Geert Uytterhoeven, netdev@vger.kernel.org,
	linux-renesas-soc@vger.kernel.org
In-Reply-To: <7d50b5e0-ae7c-a39d-625a-c9cfc11e398f@cogentembedded.com>

On 12/1/2016, Sergei Shtylyov wrote:
>     One thing you've missed so far is mentioning R7S72100 (RZ/A1) in the
> subject. This driver supports many SoCs, you're only fixing one of them...

For the last sh_eth.c patch I submitted, I had:

"net: ethernet: renesas: sh_eth: add POST registers for rz"


Should I resend the patch as:

"[PATCH v3] sh_eth: remove unchecked interrupts for RZ/A1"


??

Chris


^ permalink raw reply

* Re: [PATCH v2] sh_eth: remove unchecked interrupts
From: Sergei Shtylyov @ 2016-12-01 18:28 UTC (permalink / raw)
  To: Chris Brandt, David Miller
  Cc: Simon Horman, Geert Uytterhoeven, netdev@vger.kernel.org,
	linux-renesas-soc@vger.kernel.org
In-Reply-To: <SG2PR06MB1165DFB4E735FC10696F08CA8A8F0@SG2PR06MB1165.apcprd06.prod.outlook.com>

On 12/01/2016 09:24 PM, Chris Brandt wrote:

>>     One thing you've missed so far is mentioning R7S72100 (RZ/A1) in the
>> subject. This driver supports many SoCs, you're only fixing one of them...
>
> For the last sh_eth.c patch I submitted, I had:
>
> "net: ethernet: renesas: sh_eth: add POST registers for rz"
>
>
> Should I resend the patch as:
>
> "[PATCH v3] sh_eth: remove unchecked interrupts for RZ/A1"
>
>
> ??

    Yes, that will do.

> Chris

MBR, Sergei

^ permalink raw reply

* [net PATCH 0/2] IPv4 FIB suffix length fixes
From: Alexander Duyck @ 2016-12-01 12:27 UTC (permalink / raw)
  To: netdev, rshearma; +Cc: davem

In reviewing the patch from Robert Shearman and looking over the code I
realized there were a few different bugs we were still carrying in the IPv4
FIB lookup code.

These two patches are based off of Robert's original patch, but take things
one step further by splitting them up to address two additional issues I
found.

So first have Robert's original patch which was addressing the fact that
us calling update_suffix in resize is expensive when it is called per add.
To address that I incorporated the core bit of the patch which was us
dropping the update_suffix call from resize.

The first patch in the series does a rename and fix on the push_suffix and
pull_suffix code.  Specifically we drop the need to pass a leaf and
secondly we fix things so we pull the suffix as long as the value of the
suffix in the node is dropping.

The second patch addresses the original issue reported as well as
optimizing the code for the fact that update_suffix is only really meant to
go through and clean things up when we are decreasing a suffix.  I had
originally added code for it to somehow cause an increase, but if we push
the suffix when a new leaf is added we only ever have to handle pulling
down the suffix with update_suffix so I updated the code to reflect that.

As far as side effects the only ones I think that will be obvious should be
the fact that some routes may be able to be found earlier since before we
relied on resize to update the suffix lengths, and now we are updating them
before we add or remove the leaf.

---

Alexander Duyck (2):
      ipv4: Drop leaf from suffix pull/push functions
      ipv4: Drop suffix update from resize code

 net/ipv4/fib_trie.c |   68 ++++++++++++++++++++++++++-------------------------
 1 file changed, 35 insertions(+), 33 deletions(-)

^ permalink raw reply

* [net PATCH 1/2] ipv4: Drop leaf from suffix pull/push functions
From: Alexander Duyck @ 2016-12-01 12:27 UTC (permalink / raw)
  To: netdev, rshearma; +Cc: davem
In-Reply-To: <20161201121605.15499.13176.stgit@ahduyck-blue-test.jf.intel.com>

It wasn't necessary to pass a leaf in when doing the suffix updates so just
drop it.  Instead just pass the suffix and work with that.

Since we dropped the leaf there is no need to include that in the name so
the names are updated to node_push_suffix and node_pull_suffix.

Finally I noticed that the logic for pulling the suffix length back
actually had some issues.  Specifically it would stop prematurely if there
was a longer suffix, but it was not as long as the original suffix.  I
updated the code to address that in node_pull_suffix.

Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Suggested-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 net/ipv4/fib_trie.c |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index e2ffc2a..c5cd226 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -892,22 +892,24 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
 	return tp;
 }
 
-static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l)
+static void node_pull_suffix(struct key_vector *tn, unsigned char slen)
 {
-	while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
-		if (update_suffix(tp) > l->slen)
+	unsigned char node_slen = tn->slen;
+
+	while ((node_slen > tn->pos) && (node_slen > slen)) {
+		slen = update_suffix(tn);
+		if (node_slen == slen)
 			break;
-		tp = node_parent(tp);
+
+		tn = node_parent(tn);
+		node_slen = tn->slen;
 	}
 }
 
-static void leaf_push_suffix(struct key_vector *tn, struct key_vector *l)
+static void node_push_suffix(struct key_vector *tn, unsigned char slen)
 {
-	/* if this is a new leaf then tn will be NULL and we can sort
-	 * out parent suffix lengths as a part of trie_rebalance
-	 */
-	while (tn->slen < l->slen) {
-		tn->slen = l->slen;
+	while (tn->slen < slen) {
+		tn->slen = slen;
 		tn = node_parent(tn);
 	}
 }
@@ -1069,7 +1071,7 @@ static int fib_insert_alias(struct trie *t, struct key_vector *tp,
 	/* if we added to the tail node then we need to update slen */
 	if (l->slen < new->fa_slen) {
 		l->slen = new->fa_slen;
-		leaf_push_suffix(tp, l);
+		node_push_suffix(tp, new->fa_slen);
 	}
 
 	return 0;
@@ -1482,7 +1484,7 @@ static void fib_remove_alias(struct trie *t, struct key_vector *tp,
 
 	/* update the trie with the latest suffix length */
 	l->slen = fa->fa_slen;
-	leaf_pull_suffix(tp, l);
+	node_pull_suffix(tp, fa->fa_slen);
 }
 
 /* Caller must hold RTNL. */

^ permalink raw reply related

* Re: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required
From: Alexander Duyck @ 2016-12-01 18:29 UTC (permalink / raw)
  To: Robert Shearman; +Cc: David Miller, Netdev, Alexander Duyck
In-Reply-To: <50c4f236-c486-1224-e3b0-c3e322d68d58@brocade.com>

On Wed, Nov 30, 2016 at 4:27 PM, Robert Shearman <rshearma@brocade.com> wrote:
> On 29/11/16 23:14, Alexander Duyck wrote:
>>
>> On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman <rshearma@brocade.com>
>> wrote:
>>>
>>> With certain distributions of routes it can take a long time to add
>>> and delete further routes at scale. For example, with a route for
>>> 10.37.96.0/20 present it takes 47s to add ~200k contiguous /24 routes
>>> from 8.0.0.0/24 through to 11.138.207.0/24. Perf shows the bottleneck
>>> is update_suffix:
>>>
>>>       40.39%  [kernel]                      [k] update_suffix
>>>        8.02%  libnl-3.so.200.19.0           [.] nl_hash_table_lookup
>>
>>
>> Well yeah, it will be expensive when you have over 512K entries in a
>> single node.  I'm assuming that is the case based on the fact that
>> 200K routes will double up in the trie node due to broadcast and the
>> route ending up in adjacent key vectors.
>
>
> The example scenario was in fact using a large scale of just routes rather
> than addresses so there are no broadcast leafs (nor local leafs):
>
>         +-- 8.0.0.0/6 18 2 52436 suffix/20
>            |-- 8.0.0.0
>               /24 universe UNICAST
>            |-- 8.0.1.0
>               /24 universe UNICAST
>            |-- 8.0.2.0
>               /24 universe UNICAST
>
> (the "suffix/20" is for test purposes as per below). In this case the
> 8.0.0.0/6 node has a child array of size 2^18 = 262144.
>
>>
>>> With these changes, the time is reduced to 4s for the same scale and
>>> distribution of routes.
>>>
>>> The issue is that update_suffix does an O(n) walk on the children of a
>>> node and the with a dense distribtion of routes the number of children
>>> in a node tends towards the number of nodes in the tree.
>>
>>
>> Other than the performance I was just wondering if you did any other
>> validation to confirm that nothing else differs.  Basically it would
>> just be a matter of verifying that /proc/net/fib_trie is the same
>> before and after your patch.
>
>
> Yes, to verify these changes I applied some local changes to dump the slen
> field of trie nodes from /proc/net/fib_trie. I presumed that the format of
> /proc/net/fib_trie is a public API that we can't break so I intend to submit
> this as a patch. I verified by inspection that the suffix length listed in
> the nodes was as expected when exercising the insert and remove functions
> for routes with both larger and smaller suffixes than in the current nodes,
> and then for the inflate, halve and flush cases.
>
> I've now verified that a diff of /proc/net/fib_trie before and after my
> patch with all the routes added of my example scenario shows no changes.
>
>>
> ...
>>>
>>> Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
>>> Signed-off-by: Robert Shearman <rshearma@brocade.com>
>>
>>
>> So I am not entirely convinced this is a regression, I was wondering
>> what your numbers looked like before the patch you are saying this
>> fixes?  I recall I fixed a number of issues leading up to this so I am
>> just curious.
>
>
> At commit 21d1f11db0e2f20a549ad8322879507850544670 ("fib_trie: Remove checks
> for index >= tnode_child_length from tnode_get_child") which is the parent
> of 5405afd1a306:
>
> $ time sudo ip route restore < ~/allroutes
> RTNETLINK answers: File exists
> RTNETLINK answers: File exists
> RTNETLINK answers: File exists
> RTNETLINK answers: File exists
>
> real    0m3.451s
> user    0m0.184s
> sys     0m2.004s
>
>
> At commit 5405afd1a30620de8601436bae739c67c0bc7324 ("fib_trie: Add tracking
> value for suffix length"):
>
> $ time sudo ip route restore < ~/allroutes
> RTNETLINK answers: File exists
> RTNETLINK answers: File exists
> RTNETLINK answers: File exists
> RTNETLINK answers: File exists
>
> real    0m48.624s
> user    0m0.360s
> sys     0m46.960s
>
> So does that warrant a fixes tag for this patch?

Yes, please include it in the patch description next time.

I've been giving it thought and the fact is the patch series has
merit.  I just think it was being a bit too aggressive in terms of
some of the changes.  So placing any changes in put_child seem to be
the one piece in this that make this a bit ugly.  I'll submit a
counter proposal, if you could try testing it I would appreciate it,
or if you could look at incorporating some of it into your patch it
would be useful.

>>
>> My thought is this seems more like a performance optimization than a
>> fix.  If that is the case this probably belongs in net-next.
>>
>> It seems to me we might be able to simplify update_suffix rather than
>> going through all this change.  That might be something that is more
>> acceptable for net.  Have you looked at simply adding code so that
>> there is a break inside update_suffix if (slen == tn->slen)?  We don't
>> have to call it for if a leaf is added so it might make sense to add
>> that check.
>
>
> That doesn't really help since the search always starts from the smallest
> suffix length and thus could potentially require visiting a large number of
> children before finding the node that makes slen == tn->slen.
>
> In the example above, 10.37.96.0/20 would be child number 140640 in node
> 8.0.0.0/6 18. Since the loop starts out with stride = 2 this means that it
> requires 70320 iterations round to find 10.37.96.0/20 which contributes the
> largest suffix length to the node.

Yes, however that is still 60K fewer iterations we would have to do.

> Now it may be possible to improve the algorithm by starting the search from
> the largest suffix length towards the smallest suffix length instead of the
> current smallest to largest, but there would still be distributions of
> routes that would lead to needing to visit a large number of nodes only to
> find that nothing has changed.

The largest possible suffix always starts at 0.  Any bits in the index
mean that the suffix has to stop before that bit since suffix
represents the number of bits that are 0.  By starting at 0 and
increasing the stride as we hit bigger values we should be about as
optimal there as we can be.

>>
>>> ---
>>>  net/ipv4/fib_trie.c | 86
>>> ++++++++++++++++++++++++++++++++++++++---------------
>>>  1 file changed, 62 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>>> index 026f309c51e9..701cae8af44a 100644
>>> --- a/net/ipv4/fib_trie.c
>>> +++ b/net/ipv4/fib_trie.c
>>> @@ -421,8 +421,22 @@ static inline int tnode_full(struct key_vector *tn,
>>> struct key_vector *n)
>>>         return n && ((n->pos + n->bits) == tn->pos) && IS_TNODE(n);
>>>  }
>>>
>>> +static void node_push_suffix(struct key_vector *tn, struct key_vector
>>> *l)
>>> +{
>>> +       while (tn->slen < l->slen) {
>>> +               tn->slen = l->slen;
>>> +               tn = node_parent(tn);
>>> +               if (!tn)
>>> +                       break;
>>
>>
>> I don't think the NULL check is necessary, at least it wasn't with the old
>> code.
>
>
> It wasn't necessary before because the root node had the largest possible
> suffix length of KEYLENGTH. This isn't the case for the nodes this is now
> called on since they're either nodes without parents or sub-tries that end
> up in node without a parent, where they're initialised to have the smallest
> possible suffix length for the node (equal to their position).
>
>>
>>> +       }
>>> +}
>>> +
>>>  /* Add a child at position i overwriting the old value.
>>>   * Update the value of full_children and empty_children.
>>> + *
>>> + * The suffix length of the parent node and the rest of the tree is
>>> + * updated (if required) when adding/replacing a node, but is caller's
>>> + * responsibility on removal.
>>>   */
>>>  static void put_child(struct key_vector *tn, unsigned long i,
>>>                       struct key_vector *n)
>>> @@ -447,8 +461,8 @@ static void put_child(struct key_vector *tn, unsigned
>>> long i,
>>>         else if (!wasfull && isfull)
>>>                 tn_info(tn)->full_children++;
>>>
>>> -       if (n && (tn->slen < n->slen))
>>> -               tn->slen = n->slen;
>>> +       if (n)
>>> +               node_push_suffix(tn, n);
>>
>>
>> This is just creating redundant work if we have to perform a resize.
>
>
> The only real redundant work is the assignment of slen in tn, since the
> propagation up the trie has to happen regardless and if a subsequent resize
> results in changes to the trie then the propagation will stop at tn's parent
> since the suffix length of the tn's parent will not be less than tn's suffix
> length.


This isn't going to work.  We want to avoid trying to push the suffix
when we place a child.  There are scenarios where we are placing
children in nodes that don't have parents yet because we are doing
rebalances and such.  I suspect this could be pretty expensive on a
resize call.

We want to pull the suffix pushing out of the resize completely.  The
problem is this is going to cause us to start pushing suffixes for
every node moved as a part of resize.

>>
>>>         rcu_assign_pointer(tn->tnode[i], n);
>>>  }
>
> ...
>
>>> -static void leaf_pull_suffix(struct key_vector *tp, struct key_vector
>>> *l)
>>> +static void node_set_suffix(struct key_vector *tp, unsigned char slen)
>>>  {
>>> -       while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
>>> -               if (update_suffix(tp) > l->slen)
>>> +       if (slen > tp->slen)
>>> +               tp->slen = slen;
>>> +}
>>> +
>>> +static void node_pull_suffix(struct key_vector *tn)
>>> +{
>>> +       struct key_vector *tp;
>>> +       unsigned char slen;
>>> +
>>> +       slen = update_suffix(tn);
>>> +       tp = node_parent(tn);
>>> +       while ((tp->slen > tp->pos) && (tp->slen > slen)) {
>>> +               if (update_suffix(tp) > slen)
>>>                         break;
>>>                 tp = node_parent(tp);
>>>         }
>>>  }
>>>
>>

Actually I realized that there is a bug here.  The check for
update_suffix(tp) > slen can cause us to stop prematurely if I am not
mistaken.  What we should probably be doing is pulling out tp->slen
and if the result of update_suffix is the same value then we can break
the loop.  Otherwise we can stop early if a "second runner up" shows
up that is supposed to be pushed up the trie.

I actually started around with patches to do much the same as what you
are doing and I will probably submit them as an RFC so if you need you
can pick pieces out as needed, or I could submit them if they work for
you.

>> I really hate all the renaming.  Can't you just leave these as
>> leaf_pull_suffix and leaf_push _suffix.  I'm fine with us dropping the
>> leaf of the suffix but the renaming just makes adds a bunch of noise.
>> Really it might work better if you broke the replacement of the
>> key_vector as a leaf with the suffix length into a separate patch,
>> then you could do the rename as a part of that.
>
>
> Ok, I'll leave the naming of leaf_push_suffix alone. Note that
> leaf_pull_suffix hasn't been renamed, the below in the diff is just an
> artifact of how diff decided to present the changes along with the moving of
> leaf_push_suffix.

So I have been looking this over for the last couple days and actually
I have started to be okay with the renaming.

However I would ask to be consistent.  If you are going to have
node_pull_suffix and node_push_suffix then just pass a node and a
suffix length.  You can drop the leaf key vector from both functions
instead of just one.

>>
>>> -static void leaf_push_suffix(struct key_vector *tn, struct key_vector
>>> *l)
>>> +static void leaf_pull_suffix(struct key_vector *tp, struct key_vector
>>> *l)
>>>  {
>>> -       /* if this is a new leaf then tn will be NULL and we can sort
>>> -        * out parent suffix lengths as a part of trie_rebalance
>>> -        */
>>> -       while (tn->slen < l->slen) {
>>> -               tn->slen = l->slen;
>>> -               tn = node_parent(tn);
>>> +       while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
>>> +               if (update_suffix(tp) > l->slen)
>>> +                       break;
>>> +               tp = node_parent(tp);
>>>         }
>>>  }
>>
>>
>> If possible it would work better if you could avoid moving functions
>> around as a result of your changes.
>
>
> Ok, I can add a forward declaration instead.

It shouldn't be necessary.  I think you are doing way too much with
this patch.  We just need this to be a fix, you are doing a bit too
much like adding this new node_set_suffix function which isn't really
needed.

>>
>>> @@ -1107,7 +1122,7 @@ static int fib_insert_alias(struct trie *t, struct
>>> key_vector *tp,
>>>         /* if we added to the tail node then we need to update slen */
>>>         if (l->slen < new->fa_slen) {
>>>                 l->slen = new->fa_slen;
>>> -               leaf_push_suffix(tp, l);
>>> +               node_push_suffix(tp, l);
>>>         }
>>>
>>>         return 0;
>
> ...
>>>
>>> @@ -1783,6 +1801,16 @@ void fib_table_flush_external(struct fib_table
>>> *tb)
>>>                         if (IS_TRIE(pn))
>>>                                 break;
>>>
>>> +                       /* push the suffix length to the parent node,
>>> +                        * since a previous leaf removal may have
>>> +                        * caused it to change
>>> +                        */
>>> +                       if (pn->slen > pn->pos) {
>>> +                               unsigned char slen = update_suffix(pn);
>>> +
>>> +                               node_set_suffix(node_parent(pn), slen);
>>> +                       }
>>> +
>>
>>
>> Why bother with the local variable?  You could just pass
>> update_suffix(pn) as the parameter to your node_set_suffix function.
>
>
> To avoid a long line on the node_set_suffix call or splitting it across
> lines, but I'll remove the local variable as you suggest and the same below.

Actually I think there is a bug here.  You shouldn't be setting the
suffix for the parent based on the child.  You can just call
update_suffix(pn) and that should be enough.

^ permalink raw reply

* [net PATCH 2/2] ipv4: Drop suffix update from resize code
From: Alexander Duyck @ 2016-12-01 12:27 UTC (permalink / raw)
  To: netdev, rshearma; +Cc: davem
In-Reply-To: <20161201121605.15499.13176.stgit@ahduyck-blue-test.jf.intel.com>

It has been reported that update_suffix can be expensive when it is called
on a large node in which most of the suffix lengths are the same.  The time
required to add 200K entries had increased from around 3 seconds to almost
49 seconds.

In order to address this we need to move the code for updating the suffix
out of resize and instead just have it handled in the cases where we are
pushing a node that increases the suffix length, or will decrease the
suffix length.

Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Reported-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 net/ipv4/fib_trie.c |   42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index c5cd226..2cfa9f4 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -681,6 +681,13 @@ static unsigned char update_suffix(struct key_vector *tn)
 {
 	unsigned char slen = tn->pos;
 	unsigned long stride, i;
+	unsigned char slen_max;
+
+	/* only vector 0 can have a suffix length greater than or equal to
+	 * tn->pos + tn->bits, the second highest node will have a suffix
+	 * length at most of tn->pos + tn->bits - 1
+	 */
+	slen_max = min_t(unsigned char, tn->pos + tn->bits - 1, tn->slen);
 
 	/* search though the list of children looking for nodes that might
 	 * have a suffix greater than the one we currently have.  This is
@@ -698,12 +705,8 @@ static unsigned char update_suffix(struct key_vector *tn)
 		slen = n->slen;
 		i &= ~(stride - 1);
 
-		/* if slen covers all but the last bit we can stop here
-		 * there will be nothing longer than that since only node
-		 * 0 and 1 << (bits - 1) could have that as their suffix
-		 * length.
-		 */
-		if ((slen + 1) >= (tn->pos + tn->bits))
+		/* stop searching if we have hit the maximum possible value */
+		if (slen >= slen_max)
 			break;
 	}
 
@@ -875,21 +878,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
 		return collapse(t, tn);
 
 	/* update parent in case halve failed */
-	tp = node_parent(tn);
-
-	/* Return if at least one deflate was run */
-	if (max_work != MAX_WORK)
-		return tp;
-
-	/* push the suffix length to the parent node */
-	if (tn->slen > tn->pos) {
-		unsigned char slen = update_suffix(tn);
-
-		if (slen > tp->slen)
-			tp->slen = slen;
-	}
-
-	return tp;
+	return node_parent(tn);
 }
 
 static void node_pull_suffix(struct key_vector *tn, unsigned char slen)
@@ -1030,6 +1019,7 @@ static int fib_insert_node(struct trie *t, struct key_vector *tp,
 	}
 
 	/* Case 3: n is NULL, and will just insert a new leaf */
+	node_push_suffix(tp, new->fa_slen);
 	NODE_INIT_PARENT(l, tp);
 	put_child_root(tp, key, l);
 	trie_rebalance(t, tp);
@@ -1472,6 +1462,8 @@ static void fib_remove_alias(struct trie *t, struct key_vector *tp,
 	 * out parent suffix lengths as a part of trie_rebalance
 	 */
 	if (hlist_empty(&l->leaf)) {
+		if (tp->slen == l->slen)
+			node_pull_suffix(tp, tp->pos);
 		put_child_root(tp, l->key, NULL);
 		node_free(l);
 		trie_rebalance(t, tp);
@@ -1753,6 +1745,10 @@ void fib_table_flush_external(struct fib_table *tb)
 			if (IS_TRIE(pn))
 				break;
 
+			/* update the suffix to address pulled leaves */
+			if (pn->slen > pn->pos)
+				update_suffix(pn);
+
 			/* resize completed node */
 			pn = resize(t, pn);
 			cindex = get_index(pkey, pn);
@@ -1828,6 +1824,10 @@ int fib_table_flush(struct fib_table *tb)
 			if (IS_TRIE(pn))
 				break;
 
+			/* update the suffix to address pulled leaves */
+			if (pn->slen > pn->pos)
+				update_suffix(pn);
+
 			/* resize completed node */
 			pn = resize(t, pn);
 			cindex = get_index(pkey, pn);

^ permalink raw reply related

* [PATCH v3] sh_eth: remove unchecked interrupts for RZ/A1
From: Chris Brandt @ 2016-12-01 18:32 UTC (permalink / raw)
  To: David Miller, Sergei Shtylyov
  Cc: Simon Horman, Geert Uytterhoeven, netdev, linux-renesas-soc,
	Chris Brandt

When streaming a lot of data and the RZ/A1 can't keep up, some status bits
will get set that are not being checked or cleared which cause the
following messages and the Ethernet driver to stop working. This
patch fixes that issue.

irq 21: nobody cared (try booting with the "irqpoll" option)
handlers:
[<c036b71c>] sh_eth_interrupt
Disabling IRQ #21

Fixes: db893473d313a4ad ("sh_eth: Add support for r7s72100")
Signed-off-by: Chris Brandt <chris.brandt@renesas.com>
Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
---
v3:
* add RZ/A1 to subject line
v2:
* switched from modifying eesr_err_check to modifying eesipr_value
---
 drivers/net/ethernet/renesas/sh_eth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
index 05b0dc5..1a92de7 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c
@@ -518,7 +518,7 @@ static struct sh_eth_cpu_data r7s72100_data = {
 
 	.ecsr_value	= ECSR_ICD,
 	.ecsipr_value	= ECSIPR_ICDIP,
-	.eesipr_value	= 0xff7f009f,
+	.eesipr_value	= 0xe77f009f,
 
 	.tx_check	= EESR_TC1 | EESR_FTC,
 	.eesr_err_check	= EESR_TWB1 | EESR_TWB | EESR_TABT | EESR_RABT |
-- 
2.10.1

^ permalink raw reply related

* pull-request: wireless-drivers-next 2016-12-01
From: Kalle Valo @ 2016-12-01 18:33 UTC (permalink / raw)
  To: David Miller; +Cc: linux-wireless, netdev, linux-kernel

Hi Dave,

here's another pull request for net-next. Nothing special to mention
about, the details are in the signed tag below.

This time there's a trivial conflict in
drivers/net/wireless/ath/ath10k/mac.c:

<<<<<<< HEAD
	ieee80211_hw_set(ar->hw, SUPPORTS_TX_FRAG);
=======
	ieee80211_hw_set(ar->hw, REPORTS_LOW_ACK);
>>>>>>> d5fb3a138048798ce4cc4b4ced47d07d1794c577

We want to have both flags enabled in ath10k.

I'm planning to submit at least one more pull request, if Linus gives us
one more week I might send even two. For example there are patches to
convert wcn36xx to use the real SMD bus subsystem but they depend on few
arm-soc patches. I'll send a separate email about that, they are not
part of this pull request.

Please let me know if there are any problems.

Kalle

The following changes since commit 159a55a64d44acbbd6f0d8f3c082e628d6d75670:

  rt2800: disable CCK rates on HT (2016-11-23 17:38:53 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next.git tags/wireless-drivers-next-for-davem-2016-12-01

for you to fetch changes up to d5fb3a138048798ce4cc4b4ced47d07d1794c577:

  Merge ath-next from git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git (2016-12-01 15:09:14 +0200)

----------------------------------------------------------------

wireless-drivers-next patches for 4.10

Major changes:

rsi

* filter rx frames
* configure tx power
* make it possible to select antenna
* support 802.11d

brcmfmac

* cleanup of scheduled scan code
* support for bcm43341 chipset with different chip id
* support rev6 of PCIe device interface

ath10k

* add spectral scan support for QCA6174 and QCA9377 families
* show used tx bitrate with 10.4 firmware

wil6210

* add power save mode support
* add abort scan functionality
* add support settings retry limit for short frames

bcma

* add Dell Inspiron 3148

----------------------------------------------------------------
Anilkumar Kolli (2):
      ath10k: add per peer htt tx stats support for 10.4
      ath10k: add support for per sta tx bitrate

Anthony Romano (2):
      mt7601u: wait for clear rxq when stopping mac
      ath9k_htc: don't use HZ for usb msg timeouts

Arend Van Spriel (11):
      brcmfmac: add support for 43341 chip
      brcmfmac: move pno helper functions in separate source file
      brcmfmac: fix handling ssids in .sched_scan_start() callback
      brcmfmac: change prototype for brcmf_do_escan()
      brcmfmac: make internal escan more generic
      brcmfmac: split up brcmf_pno_config() function
      brcmfmac: move scheduled scan activation to pno source file
      brcmfmac: use provided channels for scheduled scan
      brcmfmac: remove restriction from .sched_scan_start() callback
      brcmfmac: use requested scan interval in scheduled scan
      brcmfmac: fix scheduled scan result handling for newer chips

Barry Day (1):
      rtl8xxxu: tx rate reported before set

Ben Greear (1):
      ath10k: wmi-alloc-chunk should use DMA_BIDIRECTIONAL

Bhumika Goyal (1):
      ath9k: constify ath_bus_ops structure

Brian Norris (3):
      mwifiex: cleanup wake-IRQ handling if suspend fails
      mwifiex: avoid double-disable_irq() race
      mwifiex: pcie: implement timeout loop for FW programming doorbell

Dedy Lansky (1):
      wil6210: fix net queue stop/wake

Erik Stromdahl (1):
      ath10k: fix TLV set regdomain command

Franky Lin (1):
      brcmfmac: add pcie host dongle interface rev6 support

Geliang Tang (1):
      ath5k: drop duplicate header vmalloc.h

Jes Sorensen (7):
      rtl8xxxu: Fix memory leak in handling rxdesc16 packets
      rtl8xxxu: Fix big-endian problem reporting mactime
      rtl8xxxu: Fix rtl8723bu driver reload issue
      rtl8xxxu: Fix rtl8192eu driver reload issue
      rtl8xxxu: Obtain RTS rates from mac80211
      rtl8xxxu: Pass tx_info to fill_txdesc in order to have access to retry count
      rtl8xxxu: Work around issue with 8192eu and 8723bu devices not reconnecting

Jiri Slaby (1):
      bcma: add Dell Inspiron 3148

Kalle Valo (1):
      Merge ath-next from git://git.kernel.org/.../kvalo/ath.git

Karthik D A (1):
      mwifiex: Disable adhoc feature based on firmware capability

Kirtika Ruchandani (7):
      mwifiex: Removed unused mwifiex_private* 'priv' variable
      mwifiex: Remove unused 'chan_num' variable
      mwifiex: Remove unused 'sta_ptr' variable
      mwifiex: Remove unused 'adapter'variable
      mwifiex: Remove unused 'pm_flag' variable
      mwifiex: Removed unused 'pkt_type' variable
      mwifiex: Remove unused 'bcd_usb' variable

Larry Finger (1):
      rtlwifi: Fix enter/exit power_save

Lior David (6):
      wil6210: fix deadlock when using fw_no_recovery option
      wil6210: align to latest auto generated wmi.h
      wil6210: support NL80211_ATTR_WIPHY_RETRY_SHORT
      wil6210: delay remain on channel when scan is active
      wil6210: add debugfs blobs for UCODE code and data
      wil6210: align to latest auto generated wmi.h

Manoharan, Rajkumar (1):
      ath10k: fix monitor vdev for receiving other bss frames

Matthias Schiffer (1):
      ath9k: fix ath9k_hw_gpio_get() to return 0 or 1 on success

Maya Erez (3):
      wil6210: add support for power save enable / disable
      wil6210: add support for abort scan
      wil6210: validate wil_pmc_alloc parameters

Miaoqing Pan (1):
      ath9k: fix NULL pointer dereference

Michal Kazior (2):
      ath10k: fix null deref on wmi-tlv when trying spectral scan
      ath10k: add spectral scan support to wmi-tlv

Mohammed Shafi Shajakhan (2):
      ath10k: fix soft lockup during firmware crash/hw-restart
      ath10k: fix Tx DMA alloc failure during continuous wifi down/up

Pedersen, Thomas (2):
      ath10k: implement offset_tsf ieee80211_op
      ath10k: remove set/get_tsf ieee80211_ops

Prameela Rani Garnepudi (4):
      rsi: Add support to filter rx frames
      rsi: Add support for configuring tx power
      rsi: Add support for antenna selection
      rsi: Add support for 802.11d

Rajkumar Manoharan (1):
      ath10k: advertize hardware packet loss mechanism

Tobias Regnery (1):
      brcmsmac: fix array out-of-bounds access in qm_log10

Wei Yongjun (1):
      rtl8xxxu: Fix non static symbol warning

Zefir Kurtisi (1):
      ath9k: feed only active spectral / dfs-detector

 drivers/bcma/host_pci.c                            |    1 +
 drivers/net/wireless/ath/ath10k/core.c             |    8 +-
 drivers/net/wireless/ath/ath10k/core.h             |   24 +
 drivers/net/wireless/ath/ath10k/debugfs_sta.c      |   13 +
 drivers/net/wireless/ath/ath10k/htt.c              |    2 +
 drivers/net/wireless/ath/ath10k/htt.h              |   31 +-
 drivers/net/wireless/ath/ath10k/htt_rx.c           |  125 +++++
 drivers/net/wireless/ath/ath10k/htt_tx.c           |   54 +-
 drivers/net/wireless/ath/ath10k/mac.c              |   51 +-
 drivers/net/wireless/ath/ath10k/wmi-ops.h          |    6 +
 drivers/net/wireless/ath/ath10k/wmi-tlv.c          |   77 ++-
 drivers/net/wireless/ath/ath10k/wmi.c              |    8 +-
 drivers/net/wireless/ath/ath10k/wmi.h              |   18 +-
 drivers/net/wireless/ath/ath5k/debug.c             |    1 -
 drivers/net/wireless/ath/ath9k/ahb.c               |    2 +-
 drivers/net/wireless/ath/ath9k/common-spectral.c   |    8 +-
 drivers/net/wireless/ath/ath9k/hif_usb.c           |    9 +-
 drivers/net/wireless/ath/ath9k/hif_usb.h           |    2 +
 drivers/net/wireless/ath/ath9k/hw.c                |    2 +-
 drivers/net/wireless/ath/ath9k/recv.c              |   17 +-
 drivers/net/wireless/ath/wil6210/cfg80211.c        |  129 ++++-
 drivers/net/wireless/ath/wil6210/main.c            |  100 ++--
 drivers/net/wireless/ath/wil6210/netdev.c          |    2 +-
 drivers/net/wireless/ath/wil6210/p2p.c             |  160 ++++--
 drivers/net/wireless/ath/wil6210/pmc.c             |   55 +-
 drivers/net/wireless/ath/wil6210/txrx.c            |  110 +++-
 drivers/net/wireless/ath/wil6210/wil6210.h         |   25 +-
 drivers/net/wireless/ath/wil6210/wil_crash_dump.c  |    6 +
 drivers/net/wireless/ath/wil6210/wmi.c             |  160 +++++-
 drivers/net/wireless/ath/wil6210/wmi.h             |  586 ++++++++++++++++----
 .../wireless/broadcom/brcm80211/brcmfmac/Makefile  |    3 +-
 .../net/wireless/broadcom/brcm80211/brcmfmac/bus.h |   10 +-
 .../broadcom/brcm80211/brcmfmac/cfg80211.c         |  381 +++++--------
 .../broadcom/brcm80211/brcmfmac/cfg80211.h         |    4 +-
 .../broadcom/brcm80211/brcmfmac/fwil_types.h       |   23 +
 .../wireless/broadcom/brcm80211/brcmfmac/msgbuf.c  |   38 +-
 .../wireless/broadcom/brcm80211/brcmfmac/msgbuf.h  |    4 +
 .../wireless/broadcom/brcm80211/brcmfmac/pcie.c    |  171 +++---
 .../net/wireless/broadcom/brcm80211/brcmfmac/pno.c |  242 ++++++++
 .../net/wireless/broadcom/brcm80211/brcmfmac/pno.h |   40 ++
 .../wireless/broadcom/brcm80211/brcmfmac/sdio.c    |    1 +
 .../broadcom/brcm80211/brcmsmac/phy/phy_qmath.c    |    5 +-
 .../broadcom/brcm80211/include/brcm_hw_ids.h       |    1 +
 drivers/net/wireless/marvell/mwifiex/cfg80211.c    |   12 +-
 drivers/net/wireless/marvell/mwifiex/fw.h          |    1 +
 drivers/net/wireless/marvell/mwifiex/main.c        |    3 -
 drivers/net/wireless/marvell/mwifiex/main.h        |    9 +-
 drivers/net/wireless/marvell/mwifiex/pcie.c        |   17 +-
 drivers/net/wireless/marvell/mwifiex/scan.c        |    8 +-
 drivers/net/wireless/marvell/mwifiex/sdio.c        |    6 +-
 drivers/net/wireless/marvell/mwifiex/sta_cmd.c     |   38 +-
 drivers/net/wireless/marvell/mwifiex/usb.c         |    3 +-
 drivers/net/wireless/mediatek/mt7601u/init.c       |   14 +-
 drivers/net/wireless/mediatek/mt7601u/regs.h       |    3 +
 drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h   |   31 +-
 .../net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192e.c |   10 +-
 .../net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c |    4 +
 .../net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c  |  122 ++--
 drivers/net/wireless/realtek/rtlwifi/base.c        |    8 +-
 drivers/net/wireless/realtek/rtlwifi/core.c        |    9 +-
 drivers/net/wireless/realtek/rtlwifi/pci.c         |   14 +-
 drivers/net/wireless/realtek/rtlwifi/ps.c          |   36 +-
 drivers/net/wireless/rsi/rsi_91x_mac80211.c        |  156 +++++-
 drivers/net/wireless/rsi/rsi_91x_mgmt.c            |  129 ++++-
 drivers/net/wireless/rsi/rsi_main.h                |    4 +
 drivers/net/wireless/rsi/rsi_mgmt.h                |   23 +-
 66 files changed, 2581 insertions(+), 794 deletions(-)
 create mode 100644 drivers/net/wireless/broadcom/brcm80211/brcmfmac/pno.c
 create mode 100644 drivers/net/wireless/broadcom/brcm80211/brcmfmac/pno.h

^ permalink raw reply

* Re: linux-next: Tree for Dec 1 (ethernet/mellanox)
From: Randy Dunlap @ 2016-12-01 18:38 UTC (permalink / raw)
  To: Stephen Rothwell, linux-next
  Cc: linux-kernel, netdev@vger.kernel.org, Tariq Toukan,
	Saeed Mahameed
In-Reply-To: <20161201174235.21689436@canb.auug.org.au>

On 11/30/16 22:42, Stephen Rothwell wrote:
> Hi all,
> 
> Changes since 20161130:
> 

on i386:

when CONFIG_INET is not enabled:

drivers/built-in.o: In function `mlx5e_test_loopback':
en_selftest.c:(.text+0x690bde): undefined reference to `ip_send_check'
en_selftest.c:(.text+0x690c42): undefined reference to `udp4_hwcsum'


-- 
~Randy

^ permalink raw reply

* Re: [iproute PATCH 04/18] ss: Use sockstat->type in all socket types
From: Stephen Hemminger @ 2016-12-01 18:41 UTC (permalink / raw)
  To: Phil Sutter; +Cc: netdev
In-Reply-To: <20161111131014.21865-5-phil@nwl.cc>

On Fri, 11 Nov 2016 14:10:00 +0100
Phil Sutter <phil@nwl.cc> wrote:

> Unix sockets used that field already to hold info about the socket type.
> By replicating this approach in all other socket types, we can get rid
> of protocol parameter in inet_stats_print() and have sock_state_print()
> figure things out by itself.
> 
> Signed-off-by: Phil Sutter <phil@nwl.cc>

Cleaning out the patch backlog...
This patch does not apply to current master branch, rejects are:

--- misc/ss.c
+++ misc/ss.c
@@ -1719,27 +1768,11 @@ void *parse_markmask(const char *markmask)
 	return res;
 }
 
-static char *proto_name(int protocol)
-{
-	switch (protocol) {
-	case 0:
-		return "raw";
-	case IPPROTO_UDP:
-		return "udp";
-	case IPPROTO_TCP:
-		return "tcp";
-	case IPPROTO_DCCP:
-		return "dccp";
-	}
-
-	return "???";
-}
-
-static void inet_stats_print(struct sockstat *s, int protocol)
+static void inet_stats_print(struct sockstat *s)
 {
 	char *buf = NULL;
 
-	sock_state_print(s, proto_name(protocol));
+	sock_state_print(s);
 
 	inet_addr_print(&s->local, s->lport, s->iface);
 	inet_addr_print(&s->remote, s->rport, 0);


Please fix and resubmit whole series.

^ permalink raw reply

* Re: [PATCH iproute2 2/3] ifstat: Add extended statistics to ifstat
From: Stephen Hemminger @ 2016-12-01 18:46 UTC (permalink / raw)
  To: Nogah Frankel; +Cc: netdev, eladr, yotamg, jiri, idosch, ogerlitz
In-Reply-To: <1479996760-61271-3-git-send-email-nogahf@mellanox.com>

On Thu, 24 Nov 2016 16:12:39 +0200
Nogah Frankel <nogahf@mellanox.com> wrote:

> Add extended stats option for ifstat. It supports stats that are in the
> nesting level as the "normal" stats or one lower, as long as they are in
> the same struct type as the "normal" stats.
> Every extension is unaware of data from other extension and is being
> presented by itself.
> The extension can be called by its name or any shorten of it. If there is
> more then one matched, the first one will be picked.
> 
> To get the extended stats the flag -x <stats type> is used.
> 
> Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
> Reviewed-by: Jiri Pirko <jiri@mellanox.com>

Finally clearing up iproute2 patch backlog. This feature looks good,
but does not apply cleanly to current git master branch.

--- misc/ifstat.c
+++ misc/ifstat.c
@@ -733,7 +790,8 @@ static void usage(void)
 "   -s, --noupdate	don\'t update history\n"
 "   -t, --interval=SECS	report average over the last SECS\n"
 "   -V, --version	output version information\n"
-"   -z, --zeros		show entries with zero activity\n");
+"   -z, --zeros		show entries with zero activity\n"
+"   -x, --extended=TYPE	show extended stats of TYPE\n");
 
 	exit(-1);
 }

Please rebase your patches and resubmit.

^ permalink raw reply

* Re: [PATCH iproute2 net-next 2/4] tc: flower: document SCTP ip_proto
From: Stephen Hemminger @ 2016-12-01 18:50 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev
In-Reply-To: <1480434693-29397-3-git-send-email-simon.horman@netronome.com>

On Tue, 29 Nov 2016 16:51:31 +0100
Simon Horman <simon.horman@netronome.com> wrote:

> Add SCTP ip_proto to help text and man page.
> 
> Signed-off-by: Simon Horman <simon.horman@netronome.com>

Sorry doesn't apply to current net-next branch in iproute2 git.
Probably some of the other changes modified formatting.

^ permalink raw reply

* RE: [PATCH v2] sh_eth: remove unchecked interrupts
From: Chris Brandt @ 2016-12-01 18:53 UTC (permalink / raw)
  To: Sergei Shtylyov, Geert Uytterhoeven
  Cc: David Miller, Simon Horman, Geert Uytterhoeven,
	netdev@vger.kernel.org, Linux-Renesas
In-Reply-To: <25b92de0-806b-342b-7556-06b96b948b2c@cogentembedded.com>

Hi Geert,

On 12/1/2016, Sergei Shtylyov wrote:
> 
> On 12/01/2016 05:42 PM, Geert Uytterhoeven wrote:
> 
> >> --- a/drivers/net/ethernet/renesas/sh_eth.c
> >> +++ b/drivers/net/ethernet/renesas/sh_eth.c
> >> @@ -518,7 +518,7 @@ static struct sh_eth_cpu_data r7s72100_data = {
> >>
> >>         .ecsr_value     = ECSR_ICD,
> >>         .ecsipr_value   = ECSIPR_ICDIP,
> >> -       .eesipr_value   = 0xff7f009f,
> >> +       .eesipr_value   = 0xe77f009f,
> >
> > Comment not directly related to the merits of this patch: the EESIPR
> > bit definitions seem to be identical to those for EESR, so we can get
> > rid of the hardcoded values here?
> 
>     Do you mean using the @define's? We have EESIPR bits also declared,
> see enum DMAC_IM_BIT,


Is your idea to get rid of .eesipr_value for devices that have values
that are the same as .eesr_err_check?


For example in sh_eth_dev_init():

	sh_eth_modify(ndev, EESR, 0, 0);
	mdp->irq_enabled = true;
-	sh_eth_write(ndev, mdp->cd->eesipr_value, EESIPR);
+	if (mdp->cd->eesipr_value)
+		sh_eth_write(ndev, mdp->cd->eesipr_value, EESIPR);
+	else
+		sh_eth_write(ndev, mdp->cd->eesr_err_check, EESIPR);


Chris


^ permalink raw reply

* Re: [PATCH iproute2 V4 0/3] tc: Support for ip tunnel metadata set/unset/classify
From: Stephen Hemminger @ 2016-12-01 18:54 UTC (permalink / raw)
  To: Amir Vadai
  Cc: netdev, David S. Miller, Jiri Benc, Or Gerlitz, Hadar Har-Zion,
	Roi Dayan
In-Reply-To: <20161201114446.30333-1-amir@vadai.me>

On Thu,  1 Dec 2016 13:44:43 +0200
Amir Vadai <amir@vadai.me> wrote:

> Hi,
> 
> This short series adds support for matching and setting metadata for ip tunnel
> shared device using the TC system, introduced in kernel 4.9 [1].
> 
> Applied and tested on top of commit f3f339e9590a ("cleanup debris from revert")
> 
> Example usage:
> 
> $ tc filter add dev vxlan0 protocol ip parent ffff: \
>     flower \
>       enc_src_ip 11.11.0.2 \
>       enc_dst_ip 11.11.0.1 \
>       enc_key_id 11 \
>       dst_ip 11.11.11.1 \
>     action mirred egress redirect dev vnet0
> 
> $ tc filter add dev net0 protocol ip parent ffff: \
>     flower \
>       ip_proto 1 \
>       dst_ip 11.11.11.2 \
>     action tunnel_key set \
>       src_ip 11.11.0.1 \
>       dst_ip 11.11.0.2 \
>       id 11 \
>     action mirred egress redirect dev vxlan0
> 
> [1] - d1ba24feb466 ("Merge branch 'act_tunnel_key'")
> 
> Thanks,
> Amir
> 
> Changes from V3:
> - Fix bad wording in the man page about the use of the 'unset' operation
> 
> Changes from V2:
> - Use const where needed
> - Don't lose return value
> - Introduce rta_getattr_be16() and rta_getattr_be32()
> 
> Changes from V1:
> - Updated Patch 2/2 ("tc/act_tunnel: Introduce ip tunnel action") commit log
> 	and the man page tc-tunnel_key to reflect the fact that 'unset' operation is
> 	no mandatory.
> 	And describe when it might be needed.
> - Rename the 'release' operation to 'unset'
> 
> Amir Vadai (3):
>   libnetlink: Introduce rta_getattr_be*()
>   tc/cls_flower: Classify packet in ip tunnels
>   tc/act_tunnel: Introduce ip tunnel action
> 
>  bridge/fdb.c                         |   4 +-
>  include/libnetlink.h                 |   9 ++
>  include/linux/tc_act/tc_tunnel_key.h |  42 ++++++
>  ip/iplink_geneve.c                   |   2 +-
>  ip/iplink_vxlan.c                    |   2 +-
>  man/man8/tc-flower.8                 |  17 ++-
>  man/man8/tc-tunnel_key.8             | 112 +++++++++++++++
>  tc/Makefile                          |   1 +
>  tc/f_flower.c                        |  84 +++++++++++-
>  tc/m_tunnel_key.c                    | 258 +++++++++++++++++++++++++++++++++++
>  10 files changed, 522 insertions(+), 9 deletions(-)
>  create mode 100644 include/linux/tc_act/tc_tunnel_key.h
>  create mode 100644 man/man8/tc-tunnel_key.8
>  create mode 100644 tc/m_tunnel_key.c

I cleared up patch backlog and got net-next branch up to date with kernel
headers, and this patch series does not apply cleanly anymore.

Please rebase and resubmit.

^ permalink raw reply

* Re: [PATCH net-next iproute2 PATCH 2/2 v2] ss: Add inet raw sockets information gathering via netlink diag interface
From: Stephen Hemminger @ 2016-12-01 18:57 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: netdev, avagin
In-Reply-To: <1478092496-7540-3-git-send-email-gorcunov@gmail.com>

On Wed,  2 Nov 2016 16:14:56 +0300
Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> unix, tcp, udp[lite], packet, netlink sockets already support diag
> interface for their collection and killing. Implement support
> for raw sockets.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

Applied both patches, but needed to remove inet_diag.h since
already updated kernel headers.

^ permalink raw reply

* Re: [PATCH iproute2 1/2] ss: print new tcp_info fields: delivery_rate and app_limited
From: Stephen Hemminger @ 2016-12-01 19:01 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: netdev, Yuchung Cheng, Eric Dumazet, Soheil Hassas Yeganeh
In-Reply-To: <1480616500-16919-1-git-send-email-ncardwell@google.com>

On Thu,  1 Dec 2016 13:21:39 -0500
Neal Cardwell <ncardwell@google.com> wrote:

> Dump the new delivery_rate and delivery_rate_app_limited fields that
> were added to tcp_info in Linux v4.9.
> 
> Example output:
>   pacing_rate 65.7Mbps delivery_rate 62.9Mbps
> 
> And for the application-limited case this looks like:
>   pacing_rate 1031.1Mbps delivery_rate 87.4Mbps app_limited
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

Looks good, applied to net-next branch

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-01 19:05 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: Linux Kernel Network Developers
In-Reply-To: <20161201135508.GB24547@oracle.com>

On Thu, Dec 1, 2016 at 5:55 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (11/30/16 14:54), Tom Herbert wrote:
>>
>> Posting for discussion....
>    :
>> One simplifying assumption we might make is that TXDP is primarily for
>> optimizing latency, specifically request/response type operations
>> (think HPC, HFT, flash server, or other tightly coupled communications
>> within the datacenter). Notably, I don't think that saving CPU is as
>> relevant to TXDP, in fact we have already seen that CPU utilization
>> can be traded off for lower latency via spin polling. Similar to XDP
>> though, we might assume that single CPU performance is relevant (i.e.
>> on a cache server we'd like to spin as few CPUs as needed and no more
>> to handle the load an maintain throughput and latency requirements).
>> High throughput (ops/sec) and low variance should be side effects of
>> any design.
>
> I'm sending this with some hesitation (esp as the flamebait threads
> are starting up - I have no interest in getting into food-fights!!),
> because it sounds like the HPC/request-response use-case you have in mind
> (HTTP based?) is very likely different than the one the DB use-cases in
> my environment (RDBMS, Cluster req/responses). But to provide some
> perspective from the latter use-case..
>
> We also have request-response transactions, but CPU utilization
> is extremely critical- many DB operations are highly CPU bound,
> so it's not acceptable for the network to hog CPU util by polling.
> In that sense, the DB req/resp model has a lot of overlap with the
> Suricata use-case.
>
Hi Sowmini,

Polling does not necessarily imply that networking monopolizes the CPU
except when the CPU is otherwise idle. Presumably the application
drives the polling when it is ready to receive work.

> Also we need a select()able socket, because we have to deal with
> input from several sources- network I/O, but also disk, and
> file-system I/O. So need to make sure there is no starvation,
> and that we multiplex between  I/O sources efficiently
>
Yes, that is a requirement.

> and one other critical difference from the hot-potato-forwarding
> model (the sort of OVS model that DPDK etc might aruguably be a fit for)
> does not apply: in order to figure out the ethernet and IP headers
> in the response correctly at all times (in the face of things like VRRP,
> gw changes, gw's mac addr changes etc) the application should really
> be listening on NETLINK sockets for modifications to the networking
> state - again points to needing a select() socket set where you can
> have both the I/O fds and the netlink socket,
>
I would think that that is management would not be implemented in a
fast path processing thread for an application.

> For all of these reasons, we are investigating approaches similar ot
> Suricata- PF_PACKET with TPACKETV2 (since we need both Tx and Rx,
> and so far, tpacketv2 seems "good enough"). FWIW, we also took
> a look at netmap and so far have not seen any significant benefits
> to netmap over pf_packet.. investigation still ongoing.
>
>>   - Call into TCP/IP stack with page data directly from driver-- no
>> skbuff allocation or interface. This is essentially provided by the
>
> I'm curious- one thing that came out of the IPsec evaluation
> is that TSO is very valuable for performance, and this is most easily
> accessed via the sk_buff interfaces.  I have not had a chance
> to review your patches yet, but isnt that an issue if you bypass
> sk_buff usage? But I should probably go and review your patchset..
>
The *SOs are always an interesting question. They make for great
benchmarks, but in real life the amount of benefit is somewhat
unclear. Under the wrong conditions, like all cwnds have collapsed or
received packets for flows are small or so mixed that we can't get
much aggregation, SO provides no benefit and in fact becomes
overhead. Relying on any amount of segmentation offload in real
deployment is risky; for instance we've seen some video servers
deployed that were able to serve line rate at 90% CPU in testing (SO
was effective) but ended up needing 110% CPU in deployment when a
hiccup caused all cwnds to collapse. Moral of the story is provision
your servers assuming the worse case conditions that would render
opportunistic offloads unless.

For the GSO and GRO the rationale is that performing the extra SW
processing to do the offloads is significantly less expensive than
running each packet through the full stack. This is true in a
multi-layered generalized stack. In TXDP, however, we should be able
to optimize the stack data path such that that would no longer be
true. For instance, if we can process the packets received on a
connection quickly enough so that it's about the same or just a little
more costly than GRO processing then we might bypass GRO entirely.
TSO is probably still relevant in TXDP since it reduces overheads
processing TX in the device itself.

Tom

> --Sowmini

^ permalink raw reply

* Re: [PATCH] netfilter: avoid warn and OOM on vmalloc call
From: Marcelo Ricardo Leitner @ 2016-12-01 19:08 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Florian Westphal, Neil Horman, netdev, netfilter-devel, LKML
In-Reply-To: <CAAeHK+wjP=h_4YxB6VUc+FjKcZi9igmyTs3nPAuUJeNomYSA0w@mail.gmail.com>

On Thu, Dec 01, 2016 at 10:42:22AM +0100, Andrey Konovalov wrote:
> On Wed, Nov 30, 2016 at 8:42 PM, Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> > Hi Andrey,
> >
> > Please let me know how this works for you. It seems good here, though
> > your poc may still trigger OOM through other means.
> 
> Hi Marcelo,
> 
> Don't see any reports with this patch.
> 
> Thanks!

Thanks Andrey.
I'll post a v2 after a few more tests here and to s/OOM/OOM killer/ in
most of the changelog.

> 
> >
> > Thanks,
> > Marcelo
> >
> > ---8<---
> >
> > Andrey Konovalov reported that this vmalloc call is based on an
> > userspace request and that it's spewing traces, which may flood the logs
> > and cause DoS if abused.
> >
> > Florian Westphal also mentioned that this call should not trigger OOM,
> > as kmalloc one is already not triggering it.
> >
> > This patch brings the vmalloc call in sync to kmalloc and disables the
> > warn trace on allocation failure and also disable OOM invocation.
> >
> > Note, however, that under such stress situation, other places may
> > trigger OOM invocation.
> >
> > Reported-by: Andrey Konovalov <andreyknvl@google.com>
> > Cc: Florian Westphal <fw@strlen.de>
> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > ---
> >  net/netfilter/x_tables.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> > index fc4977456c30e098197b4f987b758072c9cf60d9..dece525bf83a0098dad607fce665cd0bde228362 100644
> > --- a/net/netfilter/x_tables.c
> > +++ b/net/netfilter/x_tables.c
> > @@ -958,7 +958,9 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
> >         if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> >                 info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
> >         if (!info) {
> > -               info = vmalloc(sz);
> > +               info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN |
> > +                                    __GFP_NORETRY | __GFP_HIGHMEM,
> > +                                PAGE_KERNEL);
> >                 if (!info)
> >                         return NULL;
> >         }
> > --
> > 2.9.3
> >
> 

^ permalink raw reply

* Re: [PATCH net-next iproute2 PATCH 2/2 v2] ss: Add inet raw sockets information gathering via netlink diag interface
From: Cyrill Gorcunov @ 2016-12-01 19:13 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, avagin
In-Reply-To: <20161201105701.1c54b258@xeon-e3>

On Thu, Dec 01, 2016 at 10:57:01AM -0800, Stephen Hemminger wrote:
> 
> Applied both patches, but needed to remove inet_diag.h since
> already updated kernel headers.

Thank you! I think we might need to extend the matching interface
for killing raw sockets in near future, because for now it is
too wildcard. I put this into my todo list, once I finish with
more urgent tasks will back to this one.

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Jesper Dangaard Brouer @ 2016-12-01 19:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
	Tariq Toukan, brouer
In-Reply-To: <1480611857.18162.319.camel@edumazet-glaptop3.roam.corp.google.com>


On Thu, 01 Dec 2016 09:04:17 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> BTW, if you are doing tests on mlx4 40Gbit,

I'm mostly testing with mlx5 50Gbit, but I do have 40G NIC in the
machines too.

>  would you check the
> following quick/dirty hack, using lots of low-rate flows ?

What tool should I use to send "low-rate flows"?

And what am I looking for?

> mlx4 has really hard time to transmit small TSO packets (2 or 3 MSS)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 12ea3405f442..96940666abd3 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2631,6 +2631,11 @@ static void mlx4_en_del_vxlan_port(struct  net_device *dev,
>         queue_work(priv->mdev->workqueue, &priv->vxlan_del_task);
>  }
>  
> +static int mlx4_gso_segs_min = 4; /* TSO packets with less than 4 segments are segmented */
> +module_param_named(mlx4_gso_segs_min, mlx4_gso_segs_min, uint, 0644);
> +MODULE_PARM_DESC(mlx4_gso_segs_min, "threshold for software segmentation of small TSO packets");
> +
> +
>  static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
>                                                 struct net_device *dev,
>                                                 netdev_features_t features)
> @@ -2651,6 +2656,8 @@ static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
>                     (udp_hdr(skb)->dest != priv->vxlan_port))
>                         features &= ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
>         }
> +       if (skb_is_gso(skb) && skb_shinfo(skb)->gso_segs < mlx4_gso_segs_min)
> +               features &= NETIF_F_GSO_MASK;
>  
>         return features;
>  }
 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Willem de Bruijn @ 2016-12-01 19:24 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Willem de Bruijn,
	Rick Jones, Linux Kernel Network Developers, Saeed Mahameed,
	Tariq Toukan, Achiad Shochat
In-Reply-To: <CALx6S34ENTbUmCGx_4izNHoXbdy5UHuvUesbFGw+8kQSidesEg@mail.gmail.com>

>>> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
>>> > and RX work.

This problem is somewhat tangential to the doorbell avoidance discussion.

>>> >
>>> > This problem is magnified when XPS is used, if one mono-threaded application deals with
>>> > thousands of TCP sockets.
>>> >
>> We have thousands of applications, and some of them 'kind of multicast'
>> events to a broad number of TCP sockets.
>>
>> Very often, applications writers use a single thread for doing this,
>> when all they need is to send small packets to 10,000 sockets, and they
>> do not really care of doing this very fast. They also do not want to
>> hurt other applications sharing the NIC.
>>
>> Very often, process scheduler will also run this single thread in a
>> single cpu, ie avoiding expensive migrations if they are not needed.
>>
>> Problem is this behavior locks one TX queue for the duration of the
>> multicast, since XPS will force all the TX packets to go to one TX
>> queue.
>>
> The fact that XPS is forcing TX packets to go over one CPU is a result
> of how you've configured XPS. There are other potentially
> configurations that present different tradeoffs.

Right, XPS supports multiple txqueues mappings, using skb_tx_hash
to decide among them. Unfortunately cross-cpu is more expensive
than tx + completion on the same core, so this is suboptimal for
the common case where there is no excessive load imbalance.

> For instance, TX
> queues are plentiful now days to the point that we could map a number
> of queues to each CPU while respecting NUMA locality between the
> sending CPU and where TX completions occur. If the set is sufficiently
> large this would also helps to address device lock contention. The
> other thing I'm wondering is if Willem's concepts distributing DOS
> attacks in RPS might be applicable in XPS. If we could detect that a
> TX queue is "under attack" maybe we can automatically backoff to
> distributing the load to more queues in XPS somehow.

If only targeting states of imbalance, that indeed could work. For the
10,000 socket case, instead of load balancing qdisc servicing, we
could perhaps modify tx queue selection in __netdev_pick_tx to
choose another queue if the the initial choice is paused or otherwise
backlogged.

^ permalink raw reply

* Re: [PATCH] net: asix: Fix AX88772_suspend() USB vendor commands failure issues
From: David Miller @ 2016-12-01 19:35 UTC (permalink / raw)
  To: allan
  Cc: jonathanh, freddy, Dean_Jenkins, Mark_Craske, robert.foss,
	ivecera, john.stultz, vpalatin, stephen, grundler, changchias,
	andrew, tremyfr, colin.king, linux-usb, netdev, linux-kernel,
	vpalatin
In-Reply-To: <00d701d24ae3$d4f4f2a0$7eded7e0$@asix.com.tw>

From: "ASIX_Allan [Office]" <allan@asix.com.tw>
Date: Wed, 30 Nov 2016 16:29:08 +0800

> The change fixes AX88772_suspend() USB vendor commands failure issues.
> 
> Signed-off-by: Allan Chou <allan@asix.com.tw>
> Tested-by: Allan Chou <allan@asix.com.tw>
> Tested-by: Jon Hunter <jonathanh@nvidia.com>

Patch applied, thanks.

^ permalink raw reply

* Re: [net-next] rtnetlink: return the correct error code
From: David Miller @ 2016-12-01 19:39 UTC (permalink / raw)
  To: zhangshengju; +Cc: netdev
In-Reply-To: <1480495054-5114-1-git-send-email-zhangshengju@cmss.chinamobile.com>

From: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Date: Wed, 30 Nov 2016 16:37:34 +0800

> Before this patch, function ndo_dflt_fdb_dump() will always return code
> from uc fdb dump. The reture code of mc fdb dump is lost.
> 
> Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 2/3] net/act_pedit: Support using offset relative to the conventional network headers
From: David Miller @ 2016-12-01 19:41 UTC (permalink / raw)
  To: amir; +Cc: netdev, jhs, ogerlitz, hadarh
In-Reply-To: <20161130090928.14816-3-amir@vadai.me>

From: Amir Vadai <amir@vadai.me>
Date: Wed, 30 Nov 2016 11:09:27 +0200

> @@ -119,18 +119,45 @@ static bool offset_valid(struct sk_buff *skb, int offset)
>  	return true;
>  }
>  
> +static int pedit_skb_hdr_offset(struct sk_buff *skb,
> +				enum pedit_header_type htype, int *hoffset)
> +{
> +	int ret = -1;
> +
> +	switch (htype) {
> +	case PEDIT_HDR_TYPE_ETH:
> +		if (skb_mac_header_was_set(skb)) {
> +			*hoffset = skb_mac_offset(skb);
> +			ret = 0;
> +		}
> +		break;
> +	case PEDIT_HDR_TYPE_RAW:
> +	case PEDIT_HDR_TYPE_IP4:
> +	case PEDIT_HDR_TYPE_IP6:
> +		*hoffset = skb_network_offset(skb);
> +		ret = 0;
> +		break;
> +	case PEDIT_HDR_TYPE_TCP:
> +	case PEDIT_HDR_TYPE_UDP:
> +		if (skb_transport_header_was_set(skb)) {
> +			*hoffset = skb_transport_offset(skb);
> +			ret = 0;
> +		}
> +		break;
> +	};
> +
> +	return ret;
> +}
> +

The only distinction between the cases is "L2", "L3", and "L4".

Therefore I don't see any reason to break it down into IP4 vs. IP6 vs.
RAW, for example.  They all map to the same thing.

So why not just have PEDIT_HDR_TYPE_L2, PEDIT_HDR_TYPE_L3, and
PEDIT_HDR_TYPE_L4?  It definitely seems more straightforward
and cleaner that way.

Thanks.

^ permalink raw reply

* Re: [PATCH v2] tun: Use netif_receive_skb instead of netif_rx
From: David Miller @ 2016-12-01 19:43 UTC (permalink / raw)
  To: andreyknvl
  Cc: herbert, jasowang, edumazet, pmk, pabeni, mst, soheil, elfring,
	rppt, netdev, linux-kernel, dvyukov, kcc, syzkaller
In-Reply-To: <1480584880-48651-1-git-send-email-andreyknvl@google.com>

From: Andrey Konovalov <andreyknvl@google.com>
Date: Thu,  1 Dec 2016 10:34:40 +0100

> This patch changes tun.c to call netif_receive_skb instead of netif_rx
> when a packet is received (if CONFIG_4KSTACKS is not enabled to avoid
> stack exhaustion). The difference between the two is that netif_rx queues
> the packet into the backlog, and netif_receive_skb proccesses the packet
> in the current context.
> 
> This patch is required for syzkaller [1] to collect coverage from packet
> receive paths, when a packet being received through tun (syzkaller collects
> coverage per process in the process context).
> 
> As mentioned by Eric this change also speeds up tun/tap. As measured by
> Peter it speeds up his closed-loop single-stream tap/OVS benchmark by
> about 23%, from 700k packets/second to 867k packets/second.
> 
> A similar patch was introduced back in 2010 [2, 3], but the author found
> out that the patch doesn't help with the task he had in mind (for cgroups
> to shape network traffic based on the original process) and decided not to
> go further with it. The main concern back then was about possible stack
> exhaustion with 4K stacks.
> 
> [1] https://github.com/google/syzkaller
> 
> [2] https://www.spinics.net/lists/netdev/thrd440.html#130570
> 
> [3] https://www.spinics.net/lists/netdev/msg130570.html
> 
> Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
> ---
> 
> Changes since v1:
> - incorporate Eric's note about speed improvements in commit description
> - use netif_receive_skb CONFIG_4KSTACKS is not enabled

Applied to net-next, thanks!

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox