Netdev List

Netdev List
 help / color / mirror / Atom feed

* linux-next: manual merge of the net-next tree with the arm-soc tree
From: Stephen Rothwell @ 2016-12-01  1:31 UTC (permalink / raw)
  To: David Miller, Networking, Olof Johansson, Arnd Bergmann, ARM
  Cc: Rob Rice, Florian Fainelli, linux-next, linux-kernel, Jon Mason

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  arch/arm64/boot/dts/broadcom/ns2.dtsi

between commit:

  e79249143f46 ("arm64: dts: Add Broadcom Northstar2 device tree entries for PDC driver.")

from the arm-soc tree and commit:

  dddc3c9d7d02 ("arm64: dts: NS2: add AMAC ethernet support")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc arch/arm64/boot/dts/broadcom/ns2.dtsi
index 863503d78f57,773ed593da4d..000000000000
--- a/arch/arm64/boot/dts/broadcom/ns2.dtsi
+++ b/arch/arm64/boot/dts/broadcom/ns2.dtsi
@@@ -197,42 -191,18 +197,54 @@@
  
  		#include "ns2-clock.dtsi"
  
 +		pdc0: iproc-pdc0@612c0000 {
 +			compatible = "brcm,iproc-pdc-mbox";
 +			reg = <0x612c0000 0x445>;  /* PDC FS0 regs */
 +			interrupts = <GIC_SPI 187 IRQ_TYPE_LEVEL_HIGH>;
 +			#mbox-cells = <1>;
 +			brcm,rx-status-len = <32>;
 +			brcm,use-bcm-hdr;
 +		};
 +
 +		pdc1: iproc-pdc1@612e0000 {
 +			compatible = "brcm,iproc-pdc-mbox";
 +			reg = <0x612e0000 0x445>;  /* PDC FS1 regs */
 +			interrupts = <GIC_SPI 189 IRQ_TYPE_LEVEL_HIGH>;
 +			#mbox-cells = <1>;
 +			brcm,rx-status-len = <32>;
 +			brcm,use-bcm-hdr;
 +		};
 +
 +		pdc2: iproc-pdc2@61300000 {
 +			compatible = "brcm,iproc-pdc-mbox";
 +			reg = <0x61300000 0x445>;  /* PDC FS2 regs */
 +			interrupts = <GIC_SPI 191 IRQ_TYPE_LEVEL_HIGH>;
 +			#mbox-cells = <1>;
 +			brcm,rx-status-len = <32>;
 +			brcm,use-bcm-hdr;
 +		};
 +
 +		pdc3: iproc-pdc3@61320000 {
 +			compatible = "brcm,iproc-pdc-mbox";
 +			reg = <0x61320000 0x445>;  /* PDC FS3 regs */
 +			interrupts = <GIC_SPI 193 IRQ_TYPE_LEVEL_HIGH>;
 +			#mbox-cells = <1>;
 +			brcm,rx-status-len = <32>;
 +			brcm,use-bcm-hdr;
 +		};
 +
+ 		enet: ethernet@61000000 {
+ 			compatible = "brcm,ns2-amac";
+ 			reg = <0x61000000 0x1000>,
+ 			      <0x61090000 0x1000>,
+ 			      <0x61030000 0x100>;
+ 			reg-names = "amac_base", "idm_base", "nicpm_base";
+ 			interrupts = <GIC_SPI 341 IRQ_TYPE_LEVEL_HIGH>;
+ 			phy-handle = <&gphy0>;
+ 			phy-mode = "rgmii";
+ 			status = "disabled";
+ 		};
+ 
  		dma0: dma@61360000 {
  			compatible = "arm,pl330", "arm,primecell";
  			reg = <0x61360000 0x1000>;

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Tom Herbert @ 2016-12-01  1:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
	Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
	Achiad Shochat
In-Reply-To: <1480552059.18162.239.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Another issue I found during my tests last days, is a problem with BQL,
> and more generally when driver stops/starts the queue.
>
> When under stress and BQL stops the queue, driver TX completion does a
> lot of work, and servicing CPU also takes over further qdisc_run().
>
Hmm, hard to say if this is problem or a feature I think ;-). One way
to "solve" this would be to use IRQ round robin, that would spread the
load of networking across CPUs, but that gives us no additional
parallelism and reduces locality-- it's generally considered a bad
idea. The question might be: is it better to continuously ding one CPU
with all the networking work or try to artificially spread it out?
Note this is orthogonal to MQ also, so we should already have multiple
CPUs doing netif_schedule_queue for queues they manage.

Do you have a test or application that shows this is causing pain?

> The work-flow is :
>
> 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
> unmap things, queue skbs for freeing.
>
> 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
>
> if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
>      netif_schedule_queue(dev_queue);
>
> This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
> (They absolutely have no chance to grab it)
>
> So we end up with one cpu doing the ndo_start_xmit() and TX completions,
> and RX work.
>
> This problem is magnified when XPS is used, if one mono-threaded application deals with
> thousands of TCP sockets.
>
Do you know of an application doing this? The typical way XPS and most
of the other steering mechanisms are configured assume that workloads
tend towards a normal distribution. Such mechanisms can become
problematic under asymmetric loads, but then I would expect these are
likely dedicated servers so that the mechanisms can be tuned
accordingly. For instance, XPS can be configured to map more than one
queue to a CPU. Alternatively, IIRC Windows has some functionality to
tune networking for the load (spin up queues, reconfigure things
similar to XPS/RPS, etc.)-- that's promising up the point that we need
a lot of heuristics and measurement; but then we lose determinism and
risk edge case where we get completely unsatisfied results (one of my
concerns with the recent attempt to put configuration in the kernel).

> We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
> to allow another cpu to service the qdisc and spare us.
>
Wouldn't this need to be an active operation? That is to queue the
qdisc on another CPU's output_queue?

Tom

>
>

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01  0:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Tom Herbert, Willem de Bruijn
  Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat
In-Reply-To: <1480534200.18162.203.camel@edumazet-glaptop3.roam.corp.google.com>

Another issue I found during my tests last days, is a problem with BQL,
and more generally when driver stops/starts the queue.

When under stress and BQL stops the queue, driver TX completion does a
lot of work, and servicing CPU also takes over further qdisc_run().

The work-flow is :

1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
unmap things, queue skbs for freeing.

2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes); 

if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
     netif_schedule_queue(dev_queue);

This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
(They absolutely have no chance to grab it)

So we end up with one cpu doing the ndo_start_xmit() and TX completions,
and RX work.

This problem is magnified when XPS is used, if one mono-threaded application deals with
thousands of TCP sockets.

We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
to allow another cpu to service the qdisc and spare us.

^ permalink raw reply

* Re: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required
From: Robert Shearman @ 2016-12-01  0:27 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, Netdev, Alexander Duyck
In-Reply-To: <CAKgT0UcK-+t3DK=TtxY14y6XB8nqtgdv57E53QkOAjmdB36-vw@mail.gmail.com>

On 29/11/16 23:14, Alexander Duyck wrote:
> On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman <rshearma@brocade.com> wrote:
>> With certain distributions of routes it can take a long time to add
>> and delete further routes at scale. For example, with a route for
>> 10.37.96.0/20 present it takes 47s to add ~200k contiguous /24 routes
>> from 8.0.0.0/24 through to 11.138.207.0/24. Perf shows the bottleneck
>> is update_suffix:
>>
>>       40.39%  [kernel]                      [k] update_suffix
>>        8.02%  libnl-3.so.200.19.0           [.] nl_hash_table_lookup
>
> Well yeah, it will be expensive when you have over 512K entries in a
> single node.  I'm assuming that is the case based on the fact that
> 200K routes will double up in the trie node due to broadcast and the
> route ending up in adjacent key vectors.

The example scenario was in fact using a large scale of just routes 
rather than addresses so there are no broadcast leafs (nor local leafs):

         +-- 8.0.0.0/6 18 2 52436 suffix/20
            |-- 8.0.0.0
               /24 universe UNICAST
            |-- 8.0.1.0
               /24 universe UNICAST
            |-- 8.0.2.0
               /24 universe UNICAST

(the "suffix/20" is for test purposes as per below). In this case the 
8.0.0.0/6 node has a child array of size 2^18 = 262144.

>
>> With these changes, the time is reduced to 4s for the same scale and
>> distribution of routes.
>>
>> The issue is that update_suffix does an O(n) walk on the children of a
>> node and the with a dense distribtion of routes the number of children
>> in a node tends towards the number of nodes in the tree.
>
> Other than the performance I was just wondering if you did any other
> validation to confirm that nothing else differs.  Basically it would
> just be a matter of verifying that /proc/net/fib_trie is the same
> before and after your patch.

Yes, to verify these changes I applied some local changes to dump the 
slen field of trie nodes from /proc/net/fib_trie. I presumed that the 
format of /proc/net/fib_trie is a public API that we can't break so I 
intend to submit this as a patch. I verified by inspection that the 
suffix length listed in the nodes was as expected when exercising the 
insert and remove functions for routes with both larger and smaller 
suffixes than in the current nodes, and then for the inflate, halve and 
flush cases.

I've now verified that a diff of /proc/net/fib_trie before and after my 
patch with all the routes added of my example scenario shows no changes.

>
...
>> Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
>> Signed-off-by: Robert Shearman <rshearma@brocade.com>
>
> So I am not entirely convinced this is a regression, I was wondering
> what your numbers looked like before the patch you are saying this
> fixes?  I recall I fixed a number of issues leading up to this so I am
> just curious.

At commit 21d1f11db0e2f20a549ad8322879507850544670 ("fib_trie: Remove 
checks for index >= tnode_child_length from tnode_get_child") which is 
the parent of 5405afd1a306:

$ time sudo ip route restore < ~/allroutes
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists

real	0m3.451s
user	0m0.184s
sys	0m2.004s


At commit 5405afd1a30620de8601436bae739c67c0bc7324 ("fib_trie: Add 
tracking value for suffix length"):

$ time sudo ip route restore < ~/allroutes
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists

real	0m48.624s
user	0m0.360s
sys	0m46.960s

So does that warrant a fixes tag for this patch?

>
> My thought is this seems more like a performance optimization than a
> fix.  If that is the case this probably belongs in net-next.
>
> It seems to me we might be able to simplify update_suffix rather than
> going through all this change.  That might be something that is more
> acceptable for net.  Have you looked at simply adding code so that
> there is a break inside update_suffix if (slen == tn->slen)?  We don't
> have to call it for if a leaf is added so it might make sense to add
> that check.

That doesn't really help since the search always starts from the 
smallest suffix length and thus could potentially require visiting a 
large number of children before finding the node that makes slen == 
tn->slen.

In the example above, 10.37.96.0/20 would be child number 140640 in node 
8.0.0.0/6 18. Since the loop starts out with stride = 2 this means that 
it requires 70320 iterations round to find 10.37.96.0/20 which 
contributes the largest suffix length to the node.

Now it may be possible to improve the algorithm by starting the search 
from the largest suffix length towards the smallest suffix length 
instead of the current smallest to largest, but there would still be 
distributions of routes that would lead to needing to visit a large 
number of nodes only to find that nothing has changed.

>
>> ---
>>  net/ipv4/fib_trie.c | 86 ++++++++++++++++++++++++++++++++++++++---------------
>>  1 file changed, 62 insertions(+), 24 deletions(-)
>>
>> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>> index 026f309c51e9..701cae8af44a 100644
>> --- a/net/ipv4/fib_trie.c
>> +++ b/net/ipv4/fib_trie.c
>> @@ -421,8 +421,22 @@ static inline int tnode_full(struct key_vector *tn, struct key_vector *n)
>>         return n && ((n->pos + n->bits) == tn->pos) && IS_TNODE(n);
>>  }
>>
>> +static void node_push_suffix(struct key_vector *tn, struct key_vector *l)
>> +{
>> +       while (tn->slen < l->slen) {
>> +               tn->slen = l->slen;
>> +               tn = node_parent(tn);
>> +               if (!tn)
>> +                       break;
>
> I don't think the NULL check is necessary, at least it wasn't with the old code.

It wasn't necessary before because the root node had the largest 
possible suffix length of KEYLENGTH. This isn't the case for the nodes 
this is now called on since they're either nodes without parents or 
sub-tries that end up in node without a parent, where they're 
initialised to have the smallest possible suffix length for the node 
(equal to their position).

>
>> +       }
>> +}
>> +
>>  /* Add a child at position i overwriting the old value.
>>   * Update the value of full_children and empty_children.
>> + *
>> + * The suffix length of the parent node and the rest of the tree is
>> + * updated (if required) when adding/replacing a node, but is caller's
>> + * responsibility on removal.
>>   */
>>  static void put_child(struct key_vector *tn, unsigned long i,
>>                       struct key_vector *n)
>> @@ -447,8 +461,8 @@ static void put_child(struct key_vector *tn, unsigned long i,
>>         else if (!wasfull && isfull)
>>                 tn_info(tn)->full_children++;
>>
>> -       if (n && (tn->slen < n->slen))
>> -               tn->slen = n->slen;
>> +       if (n)
>> +               node_push_suffix(tn, n);
>
> This is just creating redundant work if we have to perform a resize.

The only real redundant work is the assignment of slen in tn, since the 
propagation up the trie has to happen regardless and if a subsequent 
resize results in changes to the trie then the propagation will stop at 
tn's parent since the suffix length of the tn's parent will not be less 
than tn's suffix length.

>
>>         rcu_assign_pointer(tn->tnode[i], n);
>>  }
...
>> -static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l)
>> +static void node_set_suffix(struct key_vector *tp, unsigned char slen)
>>  {
>> -       while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
>> -               if (update_suffix(tp) > l->slen)
>> +       if (slen > tp->slen)
>> +               tp->slen = slen;
>> +}
>> +
>> +static void node_pull_suffix(struct key_vector *tn)
>> +{
>> +       struct key_vector *tp;
>> +       unsigned char slen;
>> +
>> +       slen = update_suffix(tn);
>> +       tp = node_parent(tn);
>> +       while ((tp->slen > tp->pos) && (tp->slen > slen)) {
>> +               if (update_suffix(tp) > slen)
>>                         break;
>>                 tp = node_parent(tp);
>>         }
>>  }
>>
>
> I really hate all the renaming.  Can't you just leave these as
> leaf_pull_suffix and leaf_push _suffix.  I'm fine with us dropping the
> leaf of the suffix but the renaming just makes adds a bunch of noise.
> Really it might work better if you broke the replacement of the
> key_vector as a leaf with the suffix length into a separate patch,
> then you could do the rename as a part of that.

Ok, I'll leave the naming of leaf_push_suffix alone. Note that 
leaf_pull_suffix hasn't been renamed, the below in the diff is just an 
artifact of how diff decided to present the changes along with the 
moving of leaf_push_suffix.

>
>> -static void leaf_push_suffix(struct key_vector *tn, struct key_vector *l)
>> +static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l)
>>  {
>> -       /* if this is a new leaf then tn will be NULL and we can sort
>> -        * out parent suffix lengths as a part of trie_rebalance
>> -        */
>> -       while (tn->slen < l->slen) {
>> -               tn->slen = l->slen;
>> -               tn = node_parent(tn);
>> +       while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
>> +               if (update_suffix(tp) > l->slen)
>> +                       break;
>> +               tp = node_parent(tp);
>>         }
>>  }
>
> If possible it would work better if you could avoid moving functions
> around as a result of your changes.

Ok, I can add a forward declaration instead.

>
>> @@ -1107,7 +1122,7 @@ static int fib_insert_alias(struct trie *t, struct key_vector *tp,
>>         /* if we added to the tail node then we need to update slen */
>>         if (l->slen < new->fa_slen) {
>>                 l->slen = new->fa_slen;
>> -               leaf_push_suffix(tp, l);
>> +               node_push_suffix(tp, l);
>>         }
>>
>>         return 0;
...
>> @@ -1783,6 +1801,16 @@ void fib_table_flush_external(struct fib_table *tb)
>>                         if (IS_TRIE(pn))
>>                                 break;
>>
>> +                       /* push the suffix length to the parent node,
>> +                        * since a previous leaf removal may have
>> +                        * caused it to change
>> +                        */
>> +                       if (pn->slen > pn->pos) {
>> +                               unsigned char slen = update_suffix(pn);
>> +
>> +                               node_set_suffix(node_parent(pn), slen);
>> +                       }
>> +
>
> Why bother with the local variable?  You could just pass
> update_suffix(pn) as the parameter to your node_set_suffix function.

To avoid a long line on the node_set_suffix call or splitting it across 
lines, but I'll remove the local variable as you suggest and the same below.

>
>>                         /* resize completed node */
>>                         pn = resize(t, pn);
>>                         cindex = get_index(pkey, pn);
>> @@ -1849,6 +1877,16 @@ int fib_table_flush(struct net *net, struct fib_table *tb)
>>                         if (IS_TRIE(pn))
>>                                 break;
>>
>> +                       /* push the suffix length to the parent node,
>> +                        * since a previous leaf removal may have
>> +                        * caused it to change
>> +                        */
>> +                       if (pn->slen > pn->pos) {
>> +                               unsigned char slen = update_suffix(pn);
>> +
>> +                               node_set_suffix(node_parent(pn), slen);
>> +                       }
>> +
>
> Same here.
>
>>                         /* resize completed node */
>>                         pn = resize(t, pn);
>>                         cindex = get_index(pkey, pn);
>> --
>> 2.1.4
>>

Thanks,
Rob

^ permalink raw reply

* Re: [PATCH net] tipc: check minimum bearer MTU
From: Ben Hutchings @ 2016-12-01  0:05 UTC (permalink / raw)
  To: Michal Kubecek, Jon Maloy, Ying Xue
  Cc: David S. Miller, tipc-discussion, netdev, linux-kernel,
	Qian Zhang
In-Reply-To: <20161130102424.GC19119@unicorn.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 1488 bytes --]

On Wed, 2016-11-30 at 11:24 +0100, Michal Kubecek wrote:
> On Wed, Nov 30, 2016 at 10:57:02AM +0100, Michal Kubecek wrote:
> > Qian Zhang (张谦) reported a potential socket buffer overflow in
> > tipc_msg_build() which is also known as CVE-2016-8632: due to
> > insufficient checks, a buffer overflow can occur if MTU is too short for
> > even tipc headers. As anyone can set device MTU in a user/net namespace,
> > this issue can be abused by a regular user.
> > 
> > As agreed in the discussion on Ben Hutchings' original patch, we should
> > check the MTU at the moment a bearer is attached rather than for each
> > processed packet. We also need to repeat the check when bearer MTU is
> > adjusted to new device MTU. UDP case also needs a check to avoid
> > overflow when calculating bearer MTU.
> > 
> > Fixes: b97bf3fd8f6a ("[TIPC] Initial merge")
> > Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
> > Reported-by: Qian Zhang (张谦) <zhangqian-c@360.cn>
> 
> Self-NACK.
> 
> Im sorry, while testing this, I overlooked that an attempt to change
> MTU of an underlying device to low value issues a warning but it
> succeeds anyway.
[...]

I'm not sure that TIPC should block the MTU change, anyway.  For IPv4
and IPv6 we disable the protocol on a device if its MTU is reduced
below the minimum.  I think TIPC should behave the same way.

Ben.

-- 
Ben Hutchings
Never attribute to conspiracy what can adequately be explained by
stupidity.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Package has been sent.
From: Linda Guo @ 2016-11-30 23:52 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 1416 bytes --]

Your shipment(s) is scheduled for delivery

Scheduled Delivery Date: 12/02/2016

Shipper: Chambers Group.

Kindly view the attached document for shipment/delivery details and
tracking procedure. You can also request a delivery change (e.g.
reschedule or reroute) from the tracking detail.

Approximate Delivery Time: between 3:00 PM and 7:00 PM DHL Service:
DHL 2nd Day Air.

We are pleased to provide you with delivery that fits your life.
----------------------------------------------------------------------------------------------------------------------------------------------------------

© 2014 Parcel Service of the World. DHL, the DHL brandmark, and the
color brown are trademarks of DHL Service of America, Inc. All rights
reserved. All trademarks, trade names, or service marks that appear in
connection with DHL services are the property of their respective
owners. For more information on DHL's privacy practices, refer to the
DHL Privacy Notice. Please do not reply directly to this e-mail. DHL
will not receive any reply message. For questions or comments, contact
DHL.

Privacy Notice: This communication contains proprietary information
and may be confidential. If you are not the intended recipient, the
reading, copying, disclosure or other use of the contents of this
e-mail is strictly prohibited and you are instructed to please delete
this e-mail immediately.

[-- Attachment #2: DHL.pdf --]
[-- Type: application/pdf, Size: 54297 bytes --]

^ permalink raw reply

* [PATCH net-next v3] ipv6 addrconf: Implemented enhanced DAD (RFC7527)
From: Erik Nordmark @ 2016-11-30 23:39 UTC (permalink / raw)
  To: davem; +Cc: netdev, hannes, Erik Nordmark, Bob Gilligan

Implemented RFC7527 Enhanced DAD.
IPv6 duplicate address detection can fail if there is some temporary
loopback of Ethernet frames. RFC7527 solves this by including a random
nonce in the NS messages used for DAD, and if an NS is received with the
same nonce it is assumed to be a looped back DAD probe and is ignored.
RFC7527 is enabled by default. Can be disabled by setting both of
conf/{all,interface}/enhanced_dad to zero.

Signed-off-by: Erik Nordmark <nordmark@arista.com>
Signed-off-by: Bob Gilligan <gilligan@arista.com>
---

v2: renamed sysctl and made it default to true, plus minor code review fixes
v3: respun with later net-next; fixed whitespace issues

 Documentation/networking/ip-sysctl.txt |  9 +++++++++
 include/linux/ipv6.h                   |  1 +
 include/net/if_inet6.h                 |  1 +
 include/net/ndisc.h                    |  5 ++++-
 include/uapi/linux/ipv6.h              |  1 +
 net/ipv6/addrconf.c                    | 22 +++++++++++++++++++++-
 net/ipv6/ndisc.c                       | 31 ++++++++++++++++++++++++++++---
 7 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 5af48dd..d9ef566 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1729,6 +1729,15 @@ drop_unsolicited_na - BOOLEAN
 
 	By default this is turned off.
 
+enhanced_dad - BOOLEAN
+	Include a nonce option in the IPv6 neighbor solicitation messages used for
+	duplicate address detection per RFC7527. A received DAD NS will only signal
+	a duplicate address if the nonce is different. This avoids any false
+	detection of duplicates due to loopback of the NS messages that we send.
+	The nonce option will be sent on an interface unless both of
+	conf/{all,interface}/enhanced_dad are set to FALSE.
+	Default: TRUE
+
 icmp/*:
 ratelimit - INTEGER
 	Limit the maximal rates for sending ICMPv6 packets.
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 3f95233..671d014 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -68,6 +68,7 @@ struct ipv6_devconf {
 #ifdef CONFIG_IPV6_SEG6_HMAC
 	__s32		seg6_require_hmac;
 #endif
+	__u32		enhanced_dad;
 
 	struct ctl_table_header *sysctl_header;
 };
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index b0576cb..0fa4c32 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -55,6 +55,7 @@ struct inet6_ifaddr {
 	__u8			stable_privacy_retry;
 
 	__u16			scope;
+	__u64			dad_nonce;
 
 	unsigned long		cstamp;	/* created timestamp */
 	unsigned long		tstamp; /* updated timestamp */
diff --git a/include/net/ndisc.h b/include/net/ndisc.h
index be1fe228..d562a2f 100644
--- a/include/net/ndisc.h
+++ b/include/net/ndisc.h
@@ -31,6 +31,7 @@ enum {
 	ND_OPT_PREFIX_INFO = 3,		/* RFC2461 */
 	ND_OPT_REDIRECT_HDR = 4,	/* RFC2461 */
 	ND_OPT_MTU = 5,			/* RFC2461 */
+	ND_OPT_NONCE = 14,              /* RFC7527 */
 	__ND_OPT_ARRAY_MAX,
 	ND_OPT_ROUTE_INFO = 24,		/* RFC4191 */
 	ND_OPT_RDNSS = 25,		/* RFC5006 */
@@ -121,6 +122,7 @@ struct ndisc_options {
 #define nd_opts_pi_end			nd_opt_array[__ND_OPT_PREFIX_INFO_END]
 #define nd_opts_rh			nd_opt_array[ND_OPT_REDIRECT_HDR]
 #define nd_opts_mtu			nd_opt_array[ND_OPT_MTU]
+#define nd_opts_nonce			nd_opt_array[ND_OPT_NONCE]
 #define nd_802154_opts_src_lladdr	nd_802154_opt_array[ND_OPT_SOURCE_LL_ADDR]
 #define nd_802154_opts_tgt_lladdr	nd_802154_opt_array[ND_OPT_TARGET_LL_ADDR]
 
@@ -398,7 +400,8 @@ static inline struct neighbour *__ipv6_neigh_lookup(struct net_device *dev, cons
 int ndisc_rcv(struct sk_buff *skb);
 
 void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit,
-		   const struct in6_addr *daddr, const struct in6_addr *saddr);
+		   const struct in6_addr *daddr, const struct in6_addr *saddr,
+		   u64 nonce);
 
 void ndisc_send_rs(struct net_device *dev,
 		   const struct in6_addr *saddr, const struct in6_addr *daddr);
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index 53561be..eaf65dc 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -181,6 +181,7 @@ enum {
 	DEVCONF_RTR_SOLICIT_MAX_INTERVAL,
 	DEVCONF_SEG6_ENABLED,
 	DEVCONF_SEG6_REQUIRE_HMAC,
+	DEVCONF_ENHANCED_DAD,
 	DEVCONF_MAX
 };
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 4c387dc..c1e124b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -242,6 +242,7 @@ static bool ipv6_chk_same_addr(struct net *net, const struct in6_addr *addr,
 #ifdef CONFIG_IPV6_SEG6_HMAC
 	.seg6_require_hmac	= 0,
 #endif
+	.enhanced_dad           = 1,
 };
 
 static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
@@ -292,6 +293,7 @@ static bool ipv6_chk_same_addr(struct net *net, const struct in6_addr *addr,
 #ifdef CONFIG_IPV6_SEG6_HMAC
 	.seg6_require_hmac	= 0,
 #endif
+	.enhanced_dad           = 1,
 };
 
 /* Check if a valid qdisc is available */
@@ -3735,12 +3737,21 @@ static void addrconf_dad_kick(struct inet6_ifaddr *ifp)
 {
 	unsigned long rand_num;
 	struct inet6_dev *idev = ifp->idev;
+	u64 nonce;
 
 	if (ifp->flags & IFA_F_OPTIMISTIC)
 		rand_num = 0;
 	else
 		rand_num = prandom_u32() % (idev->cnf.rtr_solicit_delay ? : 1);
 
+	nonce = 0;
+	if (idev->cnf.enhanced_dad ||
+	    dev_net(idev->dev)->ipv6.devconf_all->enhanced_dad) {
+		do
+			get_random_bytes(&nonce, 6);
+		while (nonce == 0);
+	}
+	ifp->dad_nonce = nonce;
 	ifp->dad_probes = idev->cnf.dad_transmits;
 	addrconf_mod_dad_work(ifp, rand_num);
 }
@@ -3918,7 +3929,8 @@ static void addrconf_dad_work(struct work_struct *w)
 
 	/* send a neighbour solicitation for our addr */
 	addrconf_addr_solict_mult(&ifp->addr, &mcaddr);
-	ndisc_send_ns(ifp->idev->dev, &ifp->addr, &mcaddr, &in6addr_any);
+	ndisc_send_ns(ifp->idev->dev, &ifp->addr, &mcaddr, &in6addr_any,
+		      ifp->dad_nonce);
 out:
 	in6_ifa_put(ifp);
 	rtnl_unlock();
@@ -4962,6 +4974,7 @@ static inline void ipv6_store_devconf(struct ipv6_devconf *cnf,
 #ifdef CONFIG_IPV6_SEG6_HMAC
 	array[DEVCONF_SEG6_REQUIRE_HMAC] = cnf->seg6_require_hmac;
 #endif
+	array[DEVCONF_ENHANCED_DAD] = cnf->enhanced_dad;
 }
 
 static inline size_t inet6_ifla6_size(void)
@@ -6070,6 +6083,13 @@ int addrconf_sysctl_ignore_routes_with_linkdown(struct ctl_table *ctl,
 	},
 #endif
 	{
+		.procname       = "enhanced_dad",
+		.data           = &ipv6_devconf.enhanced_dad,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
 		/* sentinel */
 	}
 };
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index d8e6714..eb35f73 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -233,6 +233,7 @@ struct ndisc_options *ndisc_parse_options(const struct net_device *dev,
 		case ND_OPT_SOURCE_LL_ADDR:
 		case ND_OPT_TARGET_LL_ADDR:
 		case ND_OPT_MTU:
+		case ND_OPT_NONCE:
 		case ND_OPT_REDIRECT_HDR:
 			if (ndopts->nd_opt_array[nd_opt->nd_opt_type]) {
 				ND_PRINTK(2, warn,
@@ -568,7 +569,8 @@ static void ndisc_send_unsol_na(struct net_device *dev)
 }
 
 void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit,
-		   const struct in6_addr *daddr, const struct in6_addr *saddr)
+		   const struct in6_addr *daddr, const struct in6_addr *saddr,
+		   u64 nonce)
 {
 	struct sk_buff *skb;
 	struct in6_addr addr_buf;
@@ -588,6 +590,8 @@ void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit,
 	if (inc_opt)
 		optlen += ndisc_opt_addr_space(dev,
 					       NDISC_NEIGHBOUR_SOLICITATION);
+	if (nonce != 0)
+		optlen += 8;
 
 	skb = ndisc_alloc_skb(dev, sizeof(*msg) + optlen);
 	if (!skb)
@@ -605,6 +609,13 @@ void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit,
 		ndisc_fill_addr_option(skb, ND_OPT_SOURCE_LL_ADDR,
 				       dev->dev_addr,
 				       NDISC_NEIGHBOUR_SOLICITATION);
+	if (nonce != 0) {
+		u8 *opt = skb_put(skb, 8);
+
+		opt[0] = ND_OPT_NONCE;
+		opt[1] = 8 >> 3;
+		memcpy(opt + 2, &nonce, 6);
+	}
 
 	ndisc_send_skb(skb, daddr, saddr);
 }
@@ -693,12 +704,12 @@ static void ndisc_solicit(struct neighbour *neigh, struct sk_buff *skb)
 				  "%s: trying to ucast probe in NUD_INVALID: %pI6\n",
 				  __func__, target);
 		}
-		ndisc_send_ns(dev, target, target, saddr);
+		ndisc_send_ns(dev, target, target, saddr, 0);
 	} else if ((probes -= NEIGH_VAR(neigh->parms, APP_PROBES)) < 0) {
 		neigh_app_ns(neigh);
 	} else {
 		addrconf_addr_solict_mult(target, &mcaddr);
-		ndisc_send_ns(dev, target, &mcaddr, saddr);
+		ndisc_send_ns(dev, target, &mcaddr, saddr, 0);
 	}
 }
 
@@ -742,6 +753,7 @@ static void ndisc_recv_ns(struct sk_buff *skb)
 	int dad = ipv6_addr_any(saddr);
 	bool inc;
 	int is_router = -1;
+	u64 nonce = 0;
 
 	if (skb->len < sizeof(struct nd_msg)) {
 		ND_PRINTK(2, warn, "NS: packet too short\n");
@@ -786,6 +798,8 @@ static void ndisc_recv_ns(struct sk_buff *skb)
 			return;
 		}
 	}
+	if (ndopts.nd_opts_nonce)
+		memcpy(&nonce, (u8 *)(ndopts.nd_opts_nonce + 1), 6);
 
 	inc = ipv6_addr_is_multicast(daddr);
 
@@ -794,6 +808,17 @@ static void ndisc_recv_ns(struct sk_buff *skb)
 have_ifp:
 		if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) {
 			if (dad) {
+				if (nonce != 0 && ifp->dad_nonce == nonce) {
+					u8 *np = (u8 *)&nonce;
+					/* Matching nonce if looped back */
+					ND_PRINTK(2, notice,
+						  "%s: IPv6 DAD loopback for address %pI6c nonce %02x:%02x:%02x:%02x:%02x:%02x ignored\n",
+						  ifp->idev->dev->name,
+						  &ifp->addr,
+						  np[0], np[1], np[2], np[3],
+						  np[4], np[5]);
+					goto out;
+				}
 				/*
 				 * We are colliding with another node
 				 * who is doing DAD
-- 
1.8.1.4

^ permalink raw reply related

* Re: [PATCH net-next v4 4/4] bpf: Add tests and samples for LWT-BPF
From: Alexei Starovoitov @ 2016-11-30 23:46 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, daniel, tom, roopa, hannes
In-Reply-To: <81603830e3684a9a48d5c18687e47a9a0df6b1bf.1480522144.git.tgraf@suug.ch>

On Wed, Nov 30, 2016 at 05:10:11PM +0100, Thomas Graf wrote:
> Adds a series of tests to verify the functionality of attaching
> BPF programs at LWT hooks.
> 
> Also adds a sample which collects a histogram of packet sizes which
> pass through an LWT hook.
> 
> $ ./lwt_len_hist.sh
> Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.253.2 () port 0 AF_INET : demo
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
>  87380  16384  16384    10.00    39857.69
>        1 -> 1        : 0        |                                      |
>        2 -> 3        : 0        |                                      |
>        4 -> 7        : 0        |                                      |
>        8 -> 15       : 0        |                                      |
>       16 -> 31       : 0        |                                      |
>       32 -> 63       : 22       |                                      |
>       64 -> 127      : 98       |                                      |
>      128 -> 255      : 213      |                                      |
>      256 -> 511      : 1444251  |********                              |
>      512 -> 1023     : 660610   |***                                   |
>     1024 -> 2047     : 535241   |**                                    |
>     2048 -> 4095     : 19       |                                      |
>     4096 -> 8191     : 180      |                                      |
>     8192 -> 16383    : 5578023  |************************************* |
>    16384 -> 32767    : 632099   |***                                   |
>    32768 -> 65535    : 6575     |                                      |
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> ---
>  samples/bpf/Makefile            |   4 +
>  samples/bpf/bpf_helpers.h       |   4 +
>  samples/bpf/lwt_len_hist.sh     |  37 ++++
>  samples/bpf/lwt_len_hist_kern.c |  82 +++++++++
>  samples/bpf/lwt_len_hist_user.c |  76 ++++++++
>  samples/bpf/test_lwt_bpf.c      | 253 +++++++++++++++++++++++++
>  samples/bpf/test_lwt_bpf.sh     | 399 ++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 855 insertions(+)
>  create mode 100755 samples/bpf/lwt_len_hist.sh
>  create mode 100644 samples/bpf/lwt_len_hist_kern.c
>  create mode 100644 samples/bpf/lwt_len_hist_user.c
>  create mode 100644 samples/bpf/test_lwt_bpf.c
>  create mode 100755 samples/bpf/test_lwt_bpf.sh
...
> +static inline void __fill_garbage(struct __sk_buff *skb)
> +{
> +	uint64_t f = 0xFFFFFFFFFFFFFFFF;
> +
> +	bpf_skb_store_bytes(skb, 0, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 8, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 16, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 24, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 32, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 40, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 48, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 56, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 64, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 72, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 80, &f, sizeof(f), 0);
> +	bpf_skb_store_bytes(skb, 88, &f, sizeof(f), 0);
> +}
> +
> +SEC("fill_garbage")
> +int do_fill_garbage(struct __sk_buff *skb)
> +{
> +	__fill_garbage(skb);
> +	printk("Set initial 96 bytes of header to FF\n");
> +	return BPF_OK;
> +}
> +
> +SEC("fill_garbage_and_redirect")
> +int do_fill_garbage_and_redirect(struct __sk_buff *skb)
> +{
> +	int ifindex = DST_IFINDEX;
> +	__fill_garbage(skb);
> +	printk("redirected to %d\n", ifindex);
> +	return bpf_redirect(ifindex, 0);
> +}

Two 'garbage' tests! Pure garbage ;)

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH net-next v4 3/4] bpf: BPF for lightweight tunnel infrastructure
From: Alexei Starovoitov @ 2016-11-30 23:43 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, daniel, tom, roopa, hannes
In-Reply-To: <950bbc4f82150683dd87e26dbd41412c26a38eba.1480522144.git.tgraf@suug.ch>

On Wed, Nov 30, 2016 at 05:10:10PM +0100, Thomas Graf wrote:
> Registers new BPF program types which correspond to the LWT hooks:
>   - BPF_PROG_TYPE_LWT_IN   => dst_input()
>   - BPF_PROG_TYPE_LWT_OUT  => dst_output()
>   - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
> 
> The separate program types are required to differentiate between the
> capabilities each LWT hook allows:
> 
>  * Programs attached to dst_input() or dst_output() are restricted and
>    may only read the data of an skb. This prevent modification and
>    possible invalidation of already validated packet headers on receive
>    and the construction of illegal headers while the IP headers are
>    still being assembled.
> 
>  * Programs attached to lwtunnel_xmit() are allowed to modify packet
>    content as well as prepending an L2 header via a newly introduced
>    helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
>    invoked after the IP header has been assembled completely.
> 
> All BPF programs receive an skb with L3 headers attached and may return
> one of the following error codes:
> 
>  BPF_OK - Continue routing as per nexthop
>  BPF_DROP - Drop skb and return EPERM
>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
>                 (Only valid in lwtunnel_xmit() context)
> 
> The return codes are binary compatible with their TC_ACT_
> relatives to ease compatibility.
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>

Looks great.

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH net-next 5/6] net: dsa: mv88e6xxx: add helper for switch ready
From: Andrew Lunn @ 2016-11-30 23:38 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20161130225930.25510-6-vivien.didelot@savoirfairelinux.com>

> +static int mv88e6xxx_wait_switch_ready(struct mv88e6xxx_chip *chip)
> +{
> +	const unsigned long timeout = jiffies + 1 * HZ;
> +	bool ready;
> +	int err;
> +
> +	/* Wait up to 1 second for switch to be ready.
> +	 * The switch is ready when all units inside the device (ATU, VTU, etc.)
> +	 * have finished their initialization and are ready to accept frames.
> +	 */
> +	while (time_before(jiffies, timeout)) {
> +		err = mv88e6xxx_g1_init_ready(chip, &ready);
> +		if (err)
> +			return err;
> +
> +		if (ready)
> +			break;
> +
> +		usleep_range(1000, 2000);
> +	}
> +
> +	if (time_after(jiffies, timeout))
> +		return -ETIMEDOUT;

As we have seen in the past, this sort of loop is broken if we end up
sleeping for a long time. Please take the opportunity to replace it
with one of our _wait() helpers, e.g. mv88e6xxx_g1_wait()

> +int mv88e6xxx_g1_init_ready(struct mv88e6xxx_chip *chip, bool *ready)
> +{
> +	u16 val;
> +	int err;
> +
> +	/* Check the value of the InitReady bit 11 */
> +	err = mv88e6xxx_g1_read(chip, GLOBAL_STATUS, &val);
> +	if (err)
> +		return err;
> +
> +	*ready = !!(val & GLOBAL_STATUS_INIT_READY);

I would actually do the wait here.

> +
> +	return 0;
> +}
> +

Thanks

  Andrew

^ permalink raw reply

* Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op
From: Andrew Lunn @ 2016-11-30 23:26 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20161130225930.25510-4-vivien.didelot@savoirfairelinux.com>

> diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> index ab52c37..9e51405 100644
> --- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> +++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> @@ -765,6 +765,9 @@ struct mv88e6xxx_ops {
>  	int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg,
>  			 u16 val);
>  
> +	/* Switch Software Reset */
> +	int (*reset)(struct mv88e6xxx_chip *chip);
> +

Hi Vivien

In my huge patch series of 6390, i've been using a g1_ prefix for
functionality which is in global 1, g2_ for global 2, etc.  This has
worked for everything so far with the exception of setting which
reserved MAC addresses should be sent to the CPU. Most devices have it
in g2, but 6390 has it in g1.

Please could you add the prefix.

       Andrew

^ permalink raw reply

* Re: [PATCH net-next 2/6] net: dsa: mv88e6xxx: add helper to hardware reset
From: Andrew Lunn @ 2016-11-30 23:21 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20161130225930.25510-3-vivien.didelot@savoirfairelinux.com>

On Wed, Nov 30, 2016 at 05:59:26PM -0500, Vivien Didelot wrote:
> Add an helper to toggle the eventual GPIO connected to the reset pin.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 1/6] net: dsa: mv88e6xxx: add helper to disable ports
From: Andrew Lunn @ 2016-11-30 23:20 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20161130225930.25510-2-vivien.didelot@savoirfairelinux.com>

On Wed, Nov 30, 2016 at 05:59:25PM -0500, Vivien Didelot wrote:
> Before resetting a switch, the ports should be set to the Disabled state
> and the transmit queues should be drained.
> 
> Add an helper to explicit that.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH] tcp_bbr: fix Kconfig to be able to make BBR the default congestion algorithm
From: Eric Dumazet @ 2016-11-30 23:04 UTC (permalink / raw)
  To: Bernhard Held; +Cc: David S. Miller, linux-kernel, netdev, Neal Cardwell
In-Reply-To: <b37e4220-b857-4853-f2e2-b77bb08115e5@gmx.de>

On Wed, 2016-11-30 at 23:31 +0100, Bernhard Held wrote:
> Add missing line to be able to make BBR the default
> congestion algorithm.
> 
> Signed-off-by: Bernhard Held <berny156@gmx.de>
> ---
>   net/ipv4/Kconfig | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
> index 300b068..b54b3ca 100644
> --- a/net/ipv4/Kconfig
> +++ b/net/ipv4/Kconfig
> @@ -715,6 +715,7 @@ config DEFAULT_TCP_CONG
>          default "reno" if DEFAULT_RENO
>          default "dctcp" if DEFAULT_DCTCP
>          default "cdg" if DEFAULT_CDG
> +       default "bbr" if DEFAULT_BBR
>          default "cubic"
> 
>   config TCP_MD5SIG

Okay, but it seems your patch was mangled (tabulations replaced by
spaces)

Please use git, or read Documentation/email-clients.txt for some hints.

^ permalink raw reply

* [PATCH net-next 6/6] net: dsa: mv88e6xxx: add PPU enable/disable ops
From: Vivien Didelot @ 2016-11-30 22:59 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot
In-Reply-To: <20161130225930.25510-1-vivien.didelot@savoirfairelinux.com>

Some Marvell chips can enable/disable the PPU on demand. This is needed
to access the PHY registers when there is no indirection mechanism.

Add two new ppu_enable and ppu_disable ops to describe this and finally
get rid of the MV88E6XXX_FLAG_PPU* flags.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c      | 49 ++++++++++++++++-------------------
 drivers/net/dsa/mv88e6xxx/global1.c   | 28 ++++++++++++++++++++
 drivers/net/dsa/mv88e6xxx/global1.h   |  2 ++
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 18 +++----------
 4 files changed, 56 insertions(+), 41 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index f5d5370..78e3aeb 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -549,15 +549,12 @@ static int mv88e6xxx_ppu_wait(struct mv88e6xxx_chip *chip, bool state)
 
 static int mv88e6xxx_ppu_disable(struct mv88e6xxx_chip *chip)
 {
-	u16 val;
 	int err;
 
-	err = mv88e6xxx_g1_read(chip, GLOBAL_CONTROL, &val);
-	if (err)
-		return err;
+	if (!chip->info->ops->ppu_disable)
+		return 0;
 
-	err = mv88e6xxx_g1_write(chip, GLOBAL_CONTROL,
-				 val & ~GLOBAL_CONTROL_PPU_ENABLE);
+	err = chip->info->ops->ppu_disable(chip);
 	if (err)
 		return err;
 
@@ -566,15 +563,12 @@ static int mv88e6xxx_ppu_disable(struct mv88e6xxx_chip *chip)
 
 static int mv88e6xxx_ppu_enable(struct mv88e6xxx_chip *chip)
 {
-	u16 val;
 	int err;
 
-	err = mv88e6xxx_g1_read(chip, GLOBAL_CONTROL, &val);
-	if (err)
-		return err;
+	if (!chip->info->ops->ppu_enable)
+		return 0;
 
-	err = mv88e6xxx_g1_write(chip, GLOBAL_CONTROL,
-				 val | GLOBAL_CONTROL_PPU_ENABLE);
+	err = chip->info->ops->ppu_enable(chip);
 	if (err)
 		return err;
 
@@ -2775,18 +2769,11 @@ static int mv88e6xxx_g1_setup(struct mv88e6xxx_chip *chip)
 	/* Enable the PHY Polling Unit if present, don't discard any packets,
 	 * and mask all interrupt sources.
 	 */
-	err = mv88e6xxx_g1_read(chip, GLOBAL_CONTROL, &reg);
-	if (err < 0)
-		return err;
-
-	reg &= ~GLOBAL_CONTROL_PPU_ENABLE;
-	if (mv88e6xxx_has(chip, MV88E6XXX_FLAG_PPU) ||
-	    mv88e6xxx_has(chip, MV88E6XXX_FLAG_PPU_ACTIVE))
-		reg |= GLOBAL_CONTROL_PPU_ENABLE;
-
-	err = mv88e6xxx_g1_write(chip, GLOBAL_CONTROL, reg);
-	if (err)
-		return err;
+	if (chip->info->ops->ppu_enable) {
+		err = chip->info->ops->ppu_enable(chip);
+		if (err)
+			return err;
+	}
 
 	/* Configure the upstream port, and configure it as the port to which
 	 * ingress and egress and ARP monitor frames are to be sent.
@@ -3225,6 +3212,8 @@ static const struct mv88e6xxx_ops mv88e6085_ops = {
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
 	.ppu_polling = mv88e6185_g1_ppu_polling,
+	.ppu_enable = mv88e6xxx_g1_ppu_enable,
+	.ppu_disable = mv88e6xxx_g1_ppu_disable,
 	.reset = mv88e6185_g1_reset,
 };
 
@@ -3241,6 +3230,8 @@ static const struct mv88e6xxx_ops mv88e6095_ops = {
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
 	.ppu_polling = mv88e6185_g1_ppu_polling,
+	.ppu_enable = mv88e6xxx_g1_ppu_enable,
+	.ppu_disable = mv88e6xxx_g1_ppu_disable,
 	.reset = mv88e6185_g1_reset,
 };
 
@@ -3257,6 +3248,8 @@ static const struct mv88e6xxx_ops mv88e6097_ops = {
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
 	.ppu_polling = mv88e6185_g1_ppu_polling,
+	.ppu_enable = mv88e6xxx_g1_ppu_enable,
+	.ppu_disable = mv88e6xxx_g1_ppu_disable,
 	.reset = mv88e6185_g1_reset,
 };
 
@@ -3288,6 +3281,8 @@ static const struct mv88e6xxx_ops mv88e6131_ops = {
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
 	.ppu_polling = mv88e6185_g1_ppu_polling,
+	.ppu_enable = mv88e6xxx_g1_ppu_enable,
+	.ppu_disable = mv88e6xxx_g1_ppu_disable,
 	.reset = mv88e6185_g1_reset,
 };
 
@@ -3406,6 +3401,8 @@ static const struct mv88e6xxx_ops mv88e6185_ops = {
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
 	.ppu_polling = mv88e6185_g1_ppu_polling,
+	.ppu_enable = mv88e6xxx_g1_ppu_enable,
+	.ppu_disable = mv88e6xxx_g1_ppu_disable,
 	.reset = mv88e6185_g1_reset,
 };
 
@@ -4037,13 +4034,13 @@ static struct mv88e6xxx_chip *mv88e6xxx_alloc_chip(struct device *dev)
 
 static void mv88e6xxx_phy_init(struct mv88e6xxx_chip *chip)
 {
-	if (mv88e6xxx_has(chip, MV88E6XXX_FLAG_PPU))
+	if (chip->info->ops->ppu_enable && chip->info->ops->ppu_disable)
 		mv88e6xxx_ppu_state_init(chip);
 }
 
 static void mv88e6xxx_phy_destroy(struct mv88e6xxx_chip *chip)
 {
-	if (mv88e6xxx_has(chip, MV88E6XXX_FLAG_PPU))
+	if (chip->info->ops->ppu_enable && chip->info->ops->ppu_disable)
 		mv88e6xxx_ppu_state_destroy(chip);
 }
 
diff --git a/drivers/net/dsa/mv88e6xxx/global1.c b/drivers/net/dsa/mv88e6xxx/global1.c
index 4d69cec..355231a 100644
--- a/drivers/net/dsa/mv88e6xxx/global1.c
+++ b/drivers/net/dsa/mv88e6xxx/global1.c
@@ -116,6 +116,34 @@ int mv88e6352_g1_reset(struct mv88e6xxx_chip *chip)
 	return mv88e6xxx_g1_write(chip, GLOBAL_CONTROL, val);
 }
 
+int mv88e6xxx_g1_ppu_enable(struct mv88e6xxx_chip *chip)
+{
+	u16 val;
+	int err;
+
+	err = mv88e6xxx_g1_read(chip, GLOBAL_CONTROL, &val);
+	if (err)
+		return err;
+
+	val |= GLOBAL_CONTROL_PPU_ENABLE;
+
+	return mv88e6xxx_g1_write(chip, GLOBAL_CONTROL, val);
+}
+
+int mv88e6xxx_g1_ppu_disable(struct mv88e6xxx_chip *chip)
+{
+	u16 val;
+	int err;
+
+	err = mv88e6xxx_g1_read(chip, GLOBAL_CONTROL, &val);
+	if (err)
+		return err;
+
+	val &= ~GLOBAL_CONTROL_PPU_ENABLE;
+
+	return mv88e6xxx_g1_write(chip, GLOBAL_CONTROL, val);
+}
+
 /* Offset 0x1c: Global Control 2 */
 
 int mv88e6390_g1_stats_set_histogram(struct mv88e6xxx_chip *chip)
diff --git a/drivers/net/dsa/mv88e6xxx/global1.h b/drivers/net/dsa/mv88e6xxx/global1.h
index 6f0fb7f..736e4da 100644
--- a/drivers/net/dsa/mv88e6xxx/global1.h
+++ b/drivers/net/dsa/mv88e6xxx/global1.h
@@ -26,6 +26,8 @@ int mv88e6xxx_g1_init_ready(struct mv88e6xxx_chip *chip, bool *ready);
 
 int mv88e6185_g1_reset(struct mv88e6xxx_chip *chip);
 int mv88e6352_g1_reset(struct mv88e6xxx_chip *chip);
+int mv88e6xxx_g1_ppu_enable(struct mv88e6xxx_chip *chip);
+int mv88e6xxx_g1_ppu_disable(struct mv88e6xxx_chip *chip);
 
 int mv88e6xxx_g1_stats_wait(struct mv88e6xxx_chip *chip);
 int mv88e6xxx_g1_stats_snapshot(struct mv88e6xxx_chip *chip, int port);
diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
index cd0f414..116766c 100644
--- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
@@ -460,12 +460,6 @@ enum mv88e6xxx_cap {
 	MV88E6XXX_CAP_G2_PVT_DATA,	/* (0x0c) Cross Chip Port VLAN Data */
 	MV88E6XXX_CAP_G2_POT,		/* (0x0f) Priority Override Table */
 
-	/* PHY Polling Unit.
-	 * See GLOBAL_CONTROL_PPU_ENABLE and GLOBAL_STATUS_PPU_POLLING.
-	 */
-	MV88E6XXX_CAP_PPU,
-	MV88E6XXX_CAP_PPU_ACTIVE,
-
 	/* Per VLAN Spanning Tree Unit (STU).
 	 * The Port State database, if present, is accessed through VTU
 	 * operations and dedicated SID registers. See GLOBAL_VTU_SID.
@@ -508,8 +502,6 @@ enum mv88e6xxx_cap {
 #define MV88E6XXX_FLAG_G2_PVT_DATA	BIT_ULL(MV88E6XXX_CAP_G2_PVT_DATA)
 #define MV88E6XXX_FLAG_G2_POT		BIT_ULL(MV88E6XXX_CAP_G2_POT)
 
-#define MV88E6XXX_FLAG_PPU		BIT_ULL(MV88E6XXX_CAP_PPU)
-#define MV88E6XXX_FLAG_PPU_ACTIVE	BIT_ULL(MV88E6XXX_CAP_PPU_ACTIVE)
 #define MV88E6XXX_FLAG_STU		BIT_ULL(MV88E6XXX_CAP_STU)
 #define MV88E6XXX_FLAG_TEMP		BIT_ULL(MV88E6XXX_CAP_TEMP)
 #define MV88E6XXX_FLAG_TEMP_LIMIT	BIT_ULL(MV88E6XXX_CAP_TEMP_LIMIT)
@@ -538,7 +530,6 @@ enum mv88e6xxx_cap {
 #define MV88E6XXX_FLAGS_FAMILY_6095	\
 	(MV88E6XXX_FLAG_GLOBAL2 |	\
 	 MV88E6XXX_FLAG_G2_MGMT_EN_0X |	\
-	 MV88E6XXX_FLAG_PPU |		\
 	 MV88E6XXX_FLAG_VTU |		\
 	 MV88E6XXX_FLAGS_MULTI_CHIP)
 
@@ -549,7 +540,6 @@ enum mv88e6xxx_cap {
 	 MV88E6XXX_FLAG_G2_MGMT_EN_2X |	\
 	 MV88E6XXX_FLAG_G2_MGMT_EN_0X |	\
 	 MV88E6XXX_FLAG_G2_POT |	\
-	 MV88E6XXX_FLAG_PPU |		\
 	 MV88E6XXX_FLAG_STU |		\
 	 MV88E6XXX_FLAG_VTU |		\
 	 MV88E6XXX_FLAGS_IRL |		\
@@ -576,7 +566,6 @@ enum mv88e6xxx_cap {
 	 MV88E6XXX_FLAG_G2_INT |	\
 	 MV88E6XXX_FLAG_G2_MGMT_EN_0X |	\
 	 MV88E6XXX_FLAGS_MULTI_CHIP |	\
-	 MV88E6XXX_FLAG_PPU |		\
 	 MV88E6XXX_FLAG_VTU)
 
 #define MV88E6XXX_FLAGS_FAMILY_6320	\
@@ -586,7 +575,6 @@ enum mv88e6xxx_cap {
 	 MV88E6XXX_FLAG_G2_MGMT_EN_2X |	\
 	 MV88E6XXX_FLAG_G2_MGMT_EN_0X |	\
 	 MV88E6XXX_FLAG_G2_POT |	\
-	 MV88E6XXX_FLAG_PPU_ACTIVE |	\
 	 MV88E6XXX_FLAG_TEMP |		\
 	 MV88E6XXX_FLAG_TEMP_LIMIT |	\
 	 MV88E6XXX_FLAG_VTU |		\
@@ -603,7 +591,6 @@ enum mv88e6xxx_cap {
 	 MV88E6XXX_FLAG_G2_MGMT_EN_2X |	\
 	 MV88E6XXX_FLAG_G2_MGMT_EN_0X |	\
 	 MV88E6XXX_FLAG_G2_POT |	\
-	 MV88E6XXX_FLAG_PPU_ACTIVE |	\
 	 MV88E6XXX_FLAG_STU |		\
 	 MV88E6XXX_FLAG_TEMP |		\
 	 MV88E6XXX_FLAG_VTU |		\
@@ -621,7 +608,6 @@ enum mv88e6xxx_cap {
 	 MV88E6XXX_FLAG_G2_MGMT_EN_2X |	\
 	 MV88E6XXX_FLAG_G2_MGMT_EN_0X |	\
 	 MV88E6XXX_FLAG_G2_POT |	\
-	 MV88E6XXX_FLAG_PPU_ACTIVE |	\
 	 MV88E6XXX_FLAG_STU |		\
 	 MV88E6XXX_FLAG_TEMP |		\
 	 MV88E6XXX_FLAG_TEMP_LIMIT |	\
@@ -636,7 +622,6 @@ struct mv88e6xxx_ops;
 #define MV88E6XXX_FLAGS_FAMILY_6390	\
 	(MV88E6XXX_FLAG_EEE |		\
 	 MV88E6XXX_FLAG_GLOBAL2 |	\
-	 MV88E6XXX_FLAG_PPU_ACTIVE |	\
 	 MV88E6XXX_FLAG_STU |		\
 	 MV88E6XXX_FLAG_TEMP |		\
 	 MV88E6XXX_FLAG_TEMP_LIMIT |	\
@@ -767,6 +752,9 @@ struct mv88e6xxx_ops {
 
 	/* PHY Polling Unit (PPU) operations */
 	int (*ppu_polling)(struct mv88e6xxx_chip *chip, bool *polling);
+	int (*ppu_enable)(struct mv88e6xxx_chip *chip);
+	int (*ppu_disable)(struct mv88e6xxx_chip *chip);
+
 	/* Switch Software Reset */
 	int (*reset)(struct mv88e6xxx_chip *chip);
 
-- 
2.10.2

^ permalink raw reply related

* [PATCH net-next 5/6] net: dsa: mv88e6xxx: add helper for switch ready
From: Vivien Didelot @ 2016-11-30 22:59 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot
In-Reply-To: <20161130225930.25510-1-vivien.didelot@savoirfairelinux.com>

All Marvell switches (even 88E6060) have a bit to inform that the switch
is ready to accept frames. Implement that in  mv88e6xxx_g1_init_ready().

If the switch has a PPU, we must wait until its state is active polling,
otherwise we cannot access PortStatus registers. Nicely wrap all that in
a mv88e6xxx_wait_switch_ready() helper for the switch reset code.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c      | 52 +++++++++++++++++++++--------------
 drivers/net/dsa/mv88e6xxx/global1.c   | 15 ++++++++++
 drivers/net/dsa/mv88e6xxx/global1.h   |  1 +
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  1 +
 4 files changed, 48 insertions(+), 21 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index db4014c..f5d5370 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2363,6 +2363,36 @@ static void mv88e6xxx_port_bridge_leave(struct dsa_switch *ds, int port)
 	mutex_unlock(&chip->reg_lock);
 }
 
+static int mv88e6xxx_wait_switch_ready(struct mv88e6xxx_chip *chip)
+{
+	const unsigned long timeout = jiffies + 1 * HZ;
+	bool ready;
+	int err;
+
+	/* Wait up to 1 second for switch to be ready.
+	 * The switch is ready when all units inside the device (ATU, VTU, etc.)
+	 * have finished their initialization and are ready to accept frames.
+	 */
+	while (time_before(jiffies, timeout)) {
+		err = mv88e6xxx_g1_init_ready(chip, &ready);
+		if (err)
+			return err;
+
+		if (ready)
+			break;
+
+		usleep_range(1000, 2000);
+	}
+
+	if (time_after(jiffies, timeout))
+		return -ETIMEDOUT;
+
+	/* If there is a PPU, the PortStatus registers must not be written to by
+	 * software until the PPU is active polling, so wait until then.
+	 */
+	return mv88e6xxx_ppu_wait(chip, true);
+}
+
 static int mv88e6xxx_software_reset(struct mv88e6xxx_chip *chip)
 {
 	if (chip->info->ops->reset)
@@ -2406,10 +2436,6 @@ static int mv88e6xxx_disable_ports(struct mv88e6xxx_chip *chip)
 
 static int mv88e6xxx_switch_reset(struct mv88e6xxx_chip *chip)
 {
-	bool ppu_active = mv88e6xxx_has(chip, MV88E6XXX_FLAG_PPU_ACTIVE);
-	u16 is_reset = (ppu_active ? 0x8800 : 0xc800);
-	unsigned long timeout;
-	u16 reg;
 	int err;
 
 	err = mv88e6xxx_disable_ports(chip);
@@ -2422,23 +2448,7 @@ static int mv88e6xxx_switch_reset(struct mv88e6xxx_chip *chip)
 	if (err)
 		return err;
 
-	/* Wait up to one second for reset to complete. */
-	timeout = jiffies + 1 * HZ;
-	while (time_before(jiffies, timeout)) {
-		err = mv88e6xxx_g1_read(chip, 0x00, &reg);
-		if (err)
-			return err;
-
-		if ((reg & is_reset) == is_reset)
-			break;
-		usleep_range(1000, 2000);
-	}
-	if (time_after(jiffies, timeout))
-		err = -ETIMEDOUT;
-	else
-		err = 0;
-
-	return err;
+	return mv88e6xxx_wait_switch_ready(chip);
 }
 
 static int mv88e6xxx_serdes_power_on(struct mv88e6xxx_chip *chip)
diff --git a/drivers/net/dsa/mv88e6xxx/global1.c b/drivers/net/dsa/mv88e6xxx/global1.c
index 371d96a..4d69cec 100644
--- a/drivers/net/dsa/mv88e6xxx/global1.c
+++ b/drivers/net/dsa/mv88e6xxx/global1.c
@@ -66,6 +66,21 @@ int mv88e6352_g1_ppu_polling(struct mv88e6xxx_chip *chip, bool *polling)
 	return 0;
 }
 
+int mv88e6xxx_g1_init_ready(struct mv88e6xxx_chip *chip, bool *ready)
+{
+	u16 val;
+	int err;
+
+	/* Check the value of the InitReady bit 11 */
+	err = mv88e6xxx_g1_read(chip, GLOBAL_STATUS, &val);
+	if (err)
+		return err;
+
+	*ready = !!(val & GLOBAL_STATUS_INIT_READY);
+
+	return 0;
+}
+
 /* Offset 0x04: Switch Global Control Register */
 
 int mv88e6185_g1_reset(struct mv88e6xxx_chip *chip)
diff --git a/drivers/net/dsa/mv88e6xxx/global1.h b/drivers/net/dsa/mv88e6xxx/global1.h
index cdccf473..6f0fb7f 100644
--- a/drivers/net/dsa/mv88e6xxx/global1.h
+++ b/drivers/net/dsa/mv88e6xxx/global1.h
@@ -22,6 +22,7 @@ int mv88e6xxx_g1_wait(struct mv88e6xxx_chip *chip, int reg, u16 mask);
 
 int mv88e6185_g1_ppu_polling(struct mv88e6xxx_chip *chip, bool *polling);
 int mv88e6352_g1_ppu_polling(struct mv88e6xxx_chip *chip, bool *polling);
+int mv88e6xxx_g1_init_ready(struct mv88e6xxx_chip *chip, bool *ready);
 
 int mv88e6185_g1_reset(struct mv88e6xxx_chip *chip);
 int mv88e6352_g1_reset(struct mv88e6xxx_chip *chip);
diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
index 9ae228c..cd0f414 100644
--- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
@@ -180,6 +180,7 @@
 #define GLOBAL_STATUS_PPU_STATE_INITIALIZING	(0x1 << 14)
 #define GLOBAL_STATUS_PPU_STATE_DISABLED	(0x2 << 14)
 #define GLOBAL_STATUS_PPU_STATE_POLLING		(0x3 << 14)
+#define GLOBAL_STATUS_INIT_READY	BIT(11)
 #define GLOBAL_STATUS_IRQ_AVB		8
 #define GLOBAL_STATUS_IRQ_DEVICE	7
 #define GLOBAL_STATUS_IRQ_STATS		6
-- 
2.10.2

^ permalink raw reply related

* [PATCH net-next 4/6] net: dsa: mv88e6xxx: add a PPU polling op
From: Vivien Didelot @ 2016-11-30 22:59 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot
In-Reply-To: <20161130225930.25510-1-vivien.didelot@savoirfairelinux.com>

Marvell chips with controllable PPU have 2 bits to identify the PPU
state. Chips without PPU control have their PPU enable by default and
have only 1 bit to read the PPU state.

The only state we care about is the PPU active and polling state
(meaning PortStatus registers are available for reading).

Add a .ppu_polling op check this state and use it in the PPU
enable/disable routines. It will also be used later in the switch reset
code.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c      | 70 +++++++++++++++++++++++------------
 drivers/net/dsa/mv88e6xxx/global1.c   | 33 +++++++++++++++++
 drivers/net/dsa/mv88e6xxx/global1.h   |  3 ++
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 13 ++++---
 4 files changed, 89 insertions(+), 30 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 6bb7571..db4014c 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -525,10 +525,32 @@ int mv88e6xxx_update(struct mv88e6xxx_chip *chip, int addr, int reg, u16 update)
 	return mv88e6xxx_write(chip, addr, reg, val);
 }
 
+static int mv88e6xxx_ppu_wait(struct mv88e6xxx_chip *chip, bool state)
+{
+	bool polling;
+	int i, err;
+
+	if (!chip->info->ops->ppu_polling)
+		return 0;
+
+	for (i = 0; i < 16; i++) {
+		err = chip->info->ops->ppu_polling(chip, &polling);
+		if (err)
+			return err;
+
+		usleep_range(1000, 2000);
+
+		if (polling == state)
+			return 0;
+	}
+
+	return -ETIMEDOUT;
+}
+
 static int mv88e6xxx_ppu_disable(struct mv88e6xxx_chip *chip)
 {
 	u16 val;
-	int i, err;
+	int err;
 
 	err = mv88e6xxx_g1_read(chip, GLOBAL_CONTROL, &val);
 	if (err)
@@ -539,23 +561,13 @@ static int mv88e6xxx_ppu_disable(struct mv88e6xxx_chip *chip)
 	if (err)
 		return err;
 
-	for (i = 0; i < 16; i++) {
-		err = mv88e6xxx_g1_read(chip, GLOBAL_STATUS, &val);
-		if (err)
-			return err;
-
-		usleep_range(1000, 2000);
-		if ((val & GLOBAL_STATUS_PPU_MASK) != GLOBAL_STATUS_PPU_POLLING)
-			return 0;
-	}
-
-	return -ETIMEDOUT;
+	return mv88e6xxx_ppu_wait(chip, false);
 }
 
 static int mv88e6xxx_ppu_enable(struct mv88e6xxx_chip *chip)
 {
 	u16 val;
-	int i, err;
+	int err;
 
 	err = mv88e6xxx_g1_read(chip, GLOBAL_CONTROL, &val);
 	if (err)
@@ -566,17 +578,7 @@ static int mv88e6xxx_ppu_enable(struct mv88e6xxx_chip *chip)
 	if (err)
 		return err;
 
-	for (i = 0; i < 16; i++) {
-		err = mv88e6xxx_g1_read(chip, GLOBAL_STATUS, &val);
-		if (err)
-			return err;
-
-		usleep_range(1000, 2000);
-		if ((val & GLOBAL_STATUS_PPU_MASK) == GLOBAL_STATUS_PPU_POLLING)
-			return 0;
-	}
-
-	return -ETIMEDOUT;
+	return mv88e6xxx_ppu_wait(chip, true);
 }
 
 static void mv88e6xxx_ppu_reenable_work(struct work_struct *ugly)
@@ -3212,6 +3214,7 @@ static const struct mv88e6xxx_ops mv88e6085_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.ppu_polling = mv88e6185_g1_ppu_polling,
 	.reset = mv88e6185_g1_reset,
 };
 
@@ -3227,6 +3230,7 @@ static const struct mv88e6xxx_ops mv88e6095_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.ppu_polling = mv88e6185_g1_ppu_polling,
 	.reset = mv88e6185_g1_reset,
 };
 
@@ -3242,6 +3246,7 @@ static const struct mv88e6xxx_ops mv88e6097_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.ppu_polling = mv88e6185_g1_ppu_polling,
 	.reset = mv88e6185_g1_reset,
 };
 
@@ -3272,6 +3277,7 @@ static const struct mv88e6xxx_ops mv88e6131_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.ppu_polling = mv88e6185_g1_ppu_polling,
 	.reset = mv88e6185_g1_reset,
 };
 
@@ -3318,6 +3324,7 @@ static const struct mv88e6xxx_ops mv88e6171_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
@@ -3336,6 +3343,7 @@ static const struct mv88e6xxx_ops mv88e6172_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
@@ -3352,6 +3360,7 @@ static const struct mv88e6xxx_ops mv88e6175_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
@@ -3370,6 +3379,7 @@ static const struct mv88e6xxx_ops mv88e6176_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
@@ -3385,6 +3395,7 @@ static const struct mv88e6xxx_ops mv88e6185_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.ppu_polling = mv88e6185_g1_ppu_polling,
 	.reset = mv88e6185_g1_reset,
 };
 
@@ -3402,6 +3413,7 @@ static const struct mv88e6xxx_ops mv88e6190_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6390_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
@@ -3419,6 +3431,7 @@ static const struct mv88e6xxx_ops mv88e6190x_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6390_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
@@ -3436,6 +3449,7 @@ static const struct mv88e6xxx_ops mv88e6191_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6390_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
@@ -3454,6 +3468,7 @@ static const struct mv88e6xxx_ops mv88e6240_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
@@ -3471,6 +3486,7 @@ static const struct mv88e6xxx_ops mv88e6290_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6390_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
@@ -3521,6 +3537,7 @@ static const struct mv88e6xxx_ops mv88e6350_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
@@ -3537,6 +3554,7 @@ static const struct mv88e6xxx_ops mv88e6351_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
@@ -3555,6 +3573,7 @@ static const struct mv88e6xxx_ops mv88e6352_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
@@ -3572,6 +3591,7 @@ static const struct mv88e6xxx_ops mv88e6390_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6390_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
@@ -3589,6 +3609,7 @@ static const struct mv88e6xxx_ops mv88e6390x_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6390_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
@@ -3606,6 +3627,7 @@ static const struct mv88e6xxx_ops mv88e6391_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6390_stats_get_stats,
+	.ppu_polling = mv88e6352_g1_ppu_polling,
 	.reset = mv88e6352_g1_reset,
 };
 
diff --git a/drivers/net/dsa/mv88e6xxx/global1.c b/drivers/net/dsa/mv88e6xxx/global1.c
index 27f37a7..371d96a 100644
--- a/drivers/net/dsa/mv88e6xxx/global1.c
+++ b/drivers/net/dsa/mv88e6xxx/global1.c
@@ -33,6 +33,39 @@ int mv88e6xxx_g1_wait(struct mv88e6xxx_chip *chip, int reg, u16 mask)
 	return mv88e6xxx_wait(chip, chip->info->global1_addr, reg, mask);
 }
 
+/* Offset 0x00: Switch Global Status Register */
+
+int mv88e6185_g1_ppu_polling(struct mv88e6xxx_chip *chip, bool *polling)
+{
+	u16 state;
+	int err;
+
+	/* Check the value of the PPUState bits 15:14 */
+	err = mv88e6xxx_g1_read(chip, GLOBAL_STATUS, &state);
+	if (err)
+		return err;
+
+	state &= GLOBAL_STATUS_PPU_STATE_MASK;
+	*polling = state == GLOBAL_STATUS_PPU_STATE_POLLING;
+
+	return 0;
+}
+
+int mv88e6352_g1_ppu_polling(struct mv88e6xxx_chip *chip, bool *polling)
+{
+	u16 state;
+	int err;
+
+	/* Check the value of the PPUState (or InitState) bit 15 */
+	err = mv88e6xxx_g1_read(chip, GLOBAL_STATUS, &state);
+	if (err)
+		return err;
+
+	*polling = !!(state & GLOBAL_STATUS_PPU_STATE);
+
+	return 0;
+}
+
 /* Offset 0x04: Switch Global Control Register */
 
 int mv88e6185_g1_reset(struct mv88e6xxx_chip *chip)
diff --git a/drivers/net/dsa/mv88e6xxx/global1.h b/drivers/net/dsa/mv88e6xxx/global1.h
index 868a281..cdccf473 100644
--- a/drivers/net/dsa/mv88e6xxx/global1.h
+++ b/drivers/net/dsa/mv88e6xxx/global1.h
@@ -20,6 +20,9 @@ int mv88e6xxx_g1_read(struct mv88e6xxx_chip *chip, int reg, u16 *val);
 int mv88e6xxx_g1_write(struct mv88e6xxx_chip *chip, int reg, u16 val);
 int mv88e6xxx_g1_wait(struct mv88e6xxx_chip *chip, int reg, u16 mask);
 
+int mv88e6185_g1_ppu_polling(struct mv88e6xxx_chip *chip, bool *polling);
+int mv88e6352_g1_ppu_polling(struct mv88e6xxx_chip *chip, bool *polling);
+
 int mv88e6185_g1_reset(struct mv88e6xxx_chip *chip);
 int mv88e6352_g1_reset(struct mv88e6xxx_chip *chip);
 
diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
index 9e51405..9ae228c 100644
--- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
@@ -175,12 +175,11 @@
 
 #define GLOBAL_STATUS		0x00
 #define GLOBAL_STATUS_PPU_STATE BIT(15) /* 6351 and 6171 */
-/* Two bits for 6165, 6185 etc */
-#define GLOBAL_STATUS_PPU_MASK		(0x3 << 14)
-#define GLOBAL_STATUS_PPU_DISABLED_RST	(0x0 << 14)
-#define GLOBAL_STATUS_PPU_INITIALIZING	(0x1 << 14)
-#define GLOBAL_STATUS_PPU_DISABLED	(0x2 << 14)
-#define GLOBAL_STATUS_PPU_POLLING	(0x3 << 14)
+#define GLOBAL_STATUS_PPU_STATE_MASK		(0x3 << 14) /* 6165 6185 */
+#define GLOBAL_STATUS_PPU_STATE_DISABLED_RST	(0x0 << 14)
+#define GLOBAL_STATUS_PPU_STATE_INITIALIZING	(0x1 << 14)
+#define GLOBAL_STATUS_PPU_STATE_DISABLED	(0x2 << 14)
+#define GLOBAL_STATUS_PPU_STATE_POLLING		(0x3 << 14)
 #define GLOBAL_STATUS_IRQ_AVB		8
 #define GLOBAL_STATUS_IRQ_DEVICE	7
 #define GLOBAL_STATUS_IRQ_STATS		6
@@ -765,6 +764,8 @@ struct mv88e6xxx_ops {
 	int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg,
 			 u16 val);
 
+	/* PHY Polling Unit (PPU) operations */
+	int (*ppu_polling)(struct mv88e6xxx_chip *chip, bool *polling);
 	/* Switch Software Reset */
 	int (*reset)(struct mv88e6xxx_chip *chip);
 
-- 
2.10.2

^ permalink raw reply related

* [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op
From: Vivien Didelot @ 2016-11-30 22:59 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot
In-Reply-To: <20161130225930.25510-1-vivien.didelot@savoirfairelinux.com>

Marvell chips have different way to issue a software reset.

Old chips (such as 88E6060) have a reset bit in an ATU control register.

Newer chips moved this bit in a Global control register. Chips with
controllable PPU should reset the PPU when resetting the switch.

Add a new reset operation to implement these differences and introduce a
mv88e6xxx_software_reset() helper to wrap it conveniently.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c      | 42 ++++++++++++++++++++++++++++-------
 drivers/net/dsa/mv88e6xxx/global1.c   | 35 +++++++++++++++++++++++++++++
 drivers/net/dsa/mv88e6xxx/global1.h   |  4 ++++
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  3 +++
 4 files changed, 76 insertions(+), 8 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 4a862c2..6bb7571 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2361,6 +2361,14 @@ static void mv88e6xxx_port_bridge_leave(struct dsa_switch *ds, int port)
 	mutex_unlock(&chip->reg_lock);
 }
 
+static int mv88e6xxx_software_reset(struct mv88e6xxx_chip *chip)
+{
+	if (chip->info->ops->reset)
+		return chip->info->ops->reset(chip);
+
+	return 0;
+}
+
 static void mv88e6xxx_hardware_reset(struct mv88e6xxx_chip *chip)
 {
 	struct gpio_desc *gpiod = chip->reset;
@@ -2408,14 +2416,7 @@ static int mv88e6xxx_switch_reset(struct mv88e6xxx_chip *chip)
 
 	mv88e6xxx_hardware_reset(chip);
 
-	/* Reset the switch. Keep the PPU active if requested. The PPU
-	 * needs to be active to support indirect phy register access
-	 * through global registers 0x18 and 0x19.
-	 */
-	if (ppu_active)
-		err = mv88e6xxx_g1_write(chip, 0x04, 0xc000);
-	else
-		err = mv88e6xxx_g1_write(chip, 0x04, 0xc400);
+	err = mv88e6xxx_software_reset(chip);
 	if (err)
 		return err;
 
@@ -3211,6 +3212,7 @@ static const struct mv88e6xxx_ops mv88e6085_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6185_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6095_ops = {
@@ -3225,6 +3227,7 @@ static const struct mv88e6xxx_ops mv88e6095_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6185_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6097_ops = {
@@ -3239,6 +3242,7 @@ static const struct mv88e6xxx_ops mv88e6097_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6185_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6123_ops = {
@@ -3253,6 +3257,7 @@ static const struct mv88e6xxx_ops mv88e6123_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6131_ops = {
@@ -3267,6 +3272,7 @@ static const struct mv88e6xxx_ops mv88e6131_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6185_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6161_ops = {
@@ -3281,6 +3287,7 @@ static const struct mv88e6xxx_ops mv88e6161_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6165_ops = {
@@ -3295,6 +3302,7 @@ static const struct mv88e6xxx_ops mv88e6165_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6171_ops = {
@@ -3310,6 +3318,7 @@ static const struct mv88e6xxx_ops mv88e6171_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6172_ops = {
@@ -3327,6 +3336,7 @@ static const struct mv88e6xxx_ops mv88e6172_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6175_ops = {
@@ -3342,6 +3352,7 @@ static const struct mv88e6xxx_ops mv88e6175_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6176_ops = {
@@ -3359,6 +3370,7 @@ static const struct mv88e6xxx_ops mv88e6176_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6185_ops = {
@@ -3373,6 +3385,7 @@ static const struct mv88e6xxx_ops mv88e6185_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6185_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6190_ops = {
@@ -3389,6 +3402,7 @@ static const struct mv88e6xxx_ops mv88e6190_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6390_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6190x_ops = {
@@ -3405,6 +3419,7 @@ static const struct mv88e6xxx_ops mv88e6190x_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6390_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6191_ops = {
@@ -3421,6 +3436,7 @@ static const struct mv88e6xxx_ops mv88e6191_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6390_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6240_ops = {
@@ -3438,6 +3454,7 @@ static const struct mv88e6xxx_ops mv88e6240_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6290_ops = {
@@ -3454,6 +3471,7 @@ static const struct mv88e6xxx_ops mv88e6290_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6390_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6320_ops = {
@@ -3470,6 +3488,7 @@ static const struct mv88e6xxx_ops mv88e6320_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6320_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6321_ops = {
@@ -3486,6 +3505,7 @@ static const struct mv88e6xxx_ops mv88e6321_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6320_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6350_ops = {
@@ -3501,6 +3521,7 @@ static const struct mv88e6xxx_ops mv88e6350_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6351_ops = {
@@ -3516,6 +3537,7 @@ static const struct mv88e6xxx_ops mv88e6351_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6352_ops = {
@@ -3533,6 +3555,7 @@ static const struct mv88e6xxx_ops mv88e6352_ops = {
 	.stats_get_sset_count = mv88e6095_stats_get_sset_count,
 	.stats_get_strings = mv88e6095_stats_get_strings,
 	.stats_get_stats = mv88e6095_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6390_ops = {
@@ -3549,6 +3572,7 @@ static const struct mv88e6xxx_ops mv88e6390_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6390_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6390x_ops = {
@@ -3565,6 +3589,7 @@ static const struct mv88e6xxx_ops mv88e6390x_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6390_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_ops mv88e6391_ops = {
@@ -3581,6 +3606,7 @@ static const struct mv88e6xxx_ops mv88e6391_ops = {
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
 	.stats_get_strings = mv88e6320_stats_get_strings,
 	.stats_get_stats = mv88e6390_stats_get_stats,
+	.reset = mv88e6352_g1_reset,
 };
 
 static const struct mv88e6xxx_info mv88e6xxx_table[] = {
diff --git a/drivers/net/dsa/mv88e6xxx/global1.c b/drivers/net/dsa/mv88e6xxx/global1.c
index 5fcf23d..27f37a7 100644
--- a/drivers/net/dsa/mv88e6xxx/global1.c
+++ b/drivers/net/dsa/mv88e6xxx/global1.c
@@ -33,6 +33,41 @@ int mv88e6xxx_g1_wait(struct mv88e6xxx_chip *chip, int reg, u16 mask)
 	return mv88e6xxx_wait(chip, chip->info->global1_addr, reg, mask);
 }
 
+/* Offset 0x04: Switch Global Control Register */
+
+int mv88e6185_g1_reset(struct mv88e6xxx_chip *chip)
+{
+	u16 val;
+	int err;
+
+	/* Set the SWReset bit 15 along with the PPUEn bit 14, to also restart
+	 * the PPU, including re-doing PHY detection and initialization
+	 */
+	err = mv88e6xxx_g1_read(chip, GLOBAL_CONTROL, &val);
+	if (err)
+		return err;
+
+	val |= GLOBAL_CONTROL_SW_RESET;
+	val |= GLOBAL_CONTROL_PPU_ENABLE;
+
+	return mv88e6xxx_g1_write(chip, GLOBAL_CONTROL, val);
+}
+
+int mv88e6352_g1_reset(struct mv88e6xxx_chip *chip)
+{
+	u16 val;
+	int err;
+
+	/* Set the SWReset bit 15 */
+	err = mv88e6xxx_g1_read(chip, GLOBAL_CONTROL, &val);
+	if (err)
+		return err;
+
+	val |= GLOBAL_CONTROL_SW_RESET;
+
+	return mv88e6xxx_g1_write(chip, GLOBAL_CONTROL, val);
+}
+
 /* Offset 0x1c: Global Control 2 */
 
 int mv88e6390_g1_stats_set_histogram(struct mv88e6xxx_chip *chip)
diff --git a/drivers/net/dsa/mv88e6xxx/global1.h b/drivers/net/dsa/mv88e6xxx/global1.h
index df3794c..868a281 100644
--- a/drivers/net/dsa/mv88e6xxx/global1.h
+++ b/drivers/net/dsa/mv88e6xxx/global1.h
@@ -19,6 +19,10 @@
 int mv88e6xxx_g1_read(struct mv88e6xxx_chip *chip, int reg, u16 *val);
 int mv88e6xxx_g1_write(struct mv88e6xxx_chip *chip, int reg, u16 val);
 int mv88e6xxx_g1_wait(struct mv88e6xxx_chip *chip, int reg, u16 mask);
+
+int mv88e6185_g1_reset(struct mv88e6xxx_chip *chip);
+int mv88e6352_g1_reset(struct mv88e6xxx_chip *chip);
+
 int mv88e6xxx_g1_stats_wait(struct mv88e6xxx_chip *chip);
 int mv88e6xxx_g1_stats_snapshot(struct mv88e6xxx_chip *chip, int port);
 int mv88e6320_g1_stats_snapshot(struct mv88e6xxx_chip *chip, int port);
diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
index ab52c37..9e51405 100644
--- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
@@ -765,6 +765,9 @@ struct mv88e6xxx_ops {
 	int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg,
 			 u16 val);
 
+	/* Switch Software Reset */
+	int (*reset)(struct mv88e6xxx_chip *chip);
+
 	/* RGMII Receive/Transmit Timing Control
 	 * Add delay on PHY_INTERFACE_MODE_RGMII_*ID, no delay otherwise.
 	 */
-- 
2.10.2

^ permalink raw reply related

* [PATCH net-next 2/6] net: dsa: mv88e6xxx: add helper to hardware reset
From: Vivien Didelot @ 2016-11-30 22:59 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot
In-Reply-To: <20161130225930.25510-1-vivien.didelot@savoirfairelinux.com>

Add an helper to toggle the eventual GPIO connected to the reset pin.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index c89701e..4a862c2 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2361,6 +2361,19 @@ static void mv88e6xxx_port_bridge_leave(struct dsa_switch *ds, int port)
 	mutex_unlock(&chip->reg_lock);
 }
 
+static void mv88e6xxx_hardware_reset(struct mv88e6xxx_chip *chip)
+{
+	struct gpio_desc *gpiod = chip->reset;
+
+	/* If there is a GPIO connected to the reset pin, toggle it */
+	if (gpiod) {
+		gpiod_set_value_cansleep(gpiod, 1);
+		usleep_range(10000, 20000);
+		gpiod_set_value_cansleep(gpiod, 0);
+		usleep_range(10000, 20000);
+	}
+}
+
 static int mv88e6xxx_disable_ports(struct mv88e6xxx_chip *chip)
 {
 	int i, err;
@@ -2385,7 +2398,6 @@ static int mv88e6xxx_switch_reset(struct mv88e6xxx_chip *chip)
 {
 	bool ppu_active = mv88e6xxx_has(chip, MV88E6XXX_FLAG_PPU_ACTIVE);
 	u16 is_reset = (ppu_active ? 0x8800 : 0xc800);
-	struct gpio_desc *gpiod = chip->reset;
 	unsigned long timeout;
 	u16 reg;
 	int err;
@@ -2394,13 +2406,7 @@ static int mv88e6xxx_switch_reset(struct mv88e6xxx_chip *chip)
 	if (err)
 		return err;
 
-	/* If there is a gpio connected to the reset pin, toggle it */
-	if (gpiod) {
-		gpiod_set_value_cansleep(gpiod, 1);
-		usleep_range(10000, 20000);
-		gpiod_set_value_cansleep(gpiod, 0);
-		usleep_range(10000, 20000);
-	}
+	mv88e6xxx_hardware_reset(chip);
 
 	/* Reset the switch. Keep the PPU active if requested. The PPU
 	 * needs to be active to support indirect phy register access
-- 
2.10.2

^ permalink raw reply related

* [PATCH net-next 1/6] net: dsa: mv88e6xxx: add helper to disable ports
From: Vivien Didelot @ 2016-11-30 22:59 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot
In-Reply-To: <20161130225930.25510-1-vivien.didelot@savoirfairelinux.com>

Before resetting a switch, the ports should be set to the Disabled state
and the transmit queues should be drained.

Add an helper to explicit that.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index ce2f7ff..c89701e 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2361,6 +2361,26 @@ static void mv88e6xxx_port_bridge_leave(struct dsa_switch *ds, int port)
 	mutex_unlock(&chip->reg_lock);
 }
 
+static int mv88e6xxx_disable_ports(struct mv88e6xxx_chip *chip)
+{
+	int i, err;
+
+	/* Set all ports to the Disabled state */
+	for (i = 0; i < mv88e6xxx_num_ports(chip); i++) {
+		err = mv88e6xxx_port_set_state(chip, i,
+					       PORT_CONTROL_STATE_DISABLED);
+		if (err)
+			return err;
+	}
+
+	/* Wait for transmit queues to drain,
+	 * i.e. 2ms for a maximum frame to be transmitted at 10 Mbps.
+	 */
+	usleep_range(2000, 4000);
+
+	return 0;
+}
+
 static int mv88e6xxx_switch_reset(struct mv88e6xxx_chip *chip)
 {
 	bool ppu_active = mv88e6xxx_has(chip, MV88E6XXX_FLAG_PPU_ACTIVE);
@@ -2369,18 +2389,10 @@ static int mv88e6xxx_switch_reset(struct mv88e6xxx_chip *chip)
 	unsigned long timeout;
 	u16 reg;
 	int err;
-	int i;
 
-	/* Set all ports to the disabled state. */
-	for (i = 0; i < mv88e6xxx_num_ports(chip); i++) {
-		err = mv88e6xxx_port_set_state(chip, i,
-					       PORT_CONTROL_STATE_DISABLED);
-		if (err)
-			return err;
-	}
-
-	/* Wait for transmit queues to drain. */
-	usleep_range(2000, 4000);
+	err = mv88e6xxx_disable_ports(chip);
+	if (err)
+		return err;
 
 	/* If there is a gpio connected to the reset pin, toggle it */
 	if (gpiod) {
-- 
2.10.2

^ permalink raw reply related

* [PATCH net-next 0/6] net: dsa: mv88e6xxx: rework reset and PPU code
From: Vivien Didelot @ 2016-11-30 22:59 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot

Old Marvell chips (like 88E6060) don't have a PHY Polling Unit (PPU).

Next chips (like 88E6185) have a PPU, which has exclusive access to the
PHY registers, thus must be disabled before access.

Newer chips (like 88E6352) have an indirect mechanism to access the PHY
registers whenever, thus loose control over the PPU (alway enabled).

Here's a summary:

Model | PPU? | Has PPU ctrl?  | PPU state readable? | PHY access
----- | ---- | -------------- | ------------------- | ----------
 6060 | no   | no             | no                  | direct
 6185 | yes  | yes, PPUEn bit | yes, PPUState 2-bit | direct w/ PPU dis.
 6352 | yes  | no             | yes, PPUState 1-bit | indirect
 6390 | yes  | no             | yes, InitState bit  | indirect

Depending on the PPU control, a switch may have to restart the PPU when
resetting the switch. Once the switch is reset, we must wait for the PPU
state to be active polling again before accessing the registers.

For that purpose, add new operations to the chips to enable/disable the
PPU and check its polling state, and execute software reset. With these
new ops in place, rework the switch reset code and finally get rid of
the MV88E6XXX_FLAG_PPU* flags.

Vivien Didelot (6):
  net: dsa: mv88e6xxx: add helper to disable ports
  net: dsa: mv88e6xxx: add helper to hardware reset
  net: dsa: mv88e6xxx: add a software reset op
  net: dsa: mv88e6xxx: add a PPU polling op
  net: dsa: mv88e6xxx: add helper for switch ready
  net: dsa: mv88e6xxx: add PPU enable/disable ops

 drivers/net/dsa/mv88e6xxx/chip.c      | 241 ++++++++++++++++++++++------------
 drivers/net/dsa/mv88e6xxx/global1.c   | 111 ++++++++++++++++
 drivers/net/dsa/mv88e6xxx/global1.h   |  10 ++
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  35 ++---
 4 files changed, 292 insertions(+), 105 deletions(-)

-- 
2.10.2

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-11-30 22:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, Achiad Shochat
In-Reply-To: <20161130233015.3de95356@redhat.com>

On Wed, 2016-11-30 at 23:30 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 30 Nov 2016 11:30:00 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Wed, 2016-11-30 at 20:17 +0100, Jesper Dangaard Brouer wrote:
> > 
> > > Don't take is as critique Eric.  I was hoping your patch would have
> > > solved this issue of being sensitive to TX completion adjustments.  You
> > > usually have good solutions for difficult issues. I basically rejected
> > > Achiad's approach/patch because it was too sensitive to these kind of
> > > adjustments.  
> > 
> > Well, this patch can hurt latencies, because a doorbell can be delayed,
> > and softirqs can be delayed by many hundred of usec in some cases.
> > 
> > I would not enable this behavior by default.
> 
> What about another scheme, where dev_hard_start_xmit() can return an
> indication that driver choose not to flush (based on TX queue depth),
> and there by requesting stack to call flush at a later point.
> 
> Would that introduce less latency issues?


Again, how is it going to change anything when your netperf UDP sends
one packet at a time ?

qdisc gets one packet , dequeue it immediately.

No pressure. -> doorbell will be sent as before.

Some TX producers do not even use a qdisc to begin with.

Please think of a solution that does not involve qdisc layer at all.

^ permalink raw reply

* [PATCH net] packet: fix race condition in packet_set_ring
From: Eric Dumazet @ 2016-11-30 22:55 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Philip Pettersson

From: Philip Pettersson <philip.pettersson@gmail.com>

When packet_set_ring creates a ring buffer it will initialize a
struct timer_list if the packet version is TPACKET_V3. This value
can then be raced by a different thread calling setsockopt to
set the version to TPACKET_V1 before packet_set_ring has finished.

This leads to a use-after-free on a function pointer in the
struct timer_list when the socket is closed as the previously
initialized timer will not be deleted.

The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
changing the packet version while also taking the lock at the start
of packet_set_ring.

Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
Signed-off-by: Philip Pettersson <philip.pettersson@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/packet/af_packet.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d2238b204691b8e4f2e3acb9bc167b553ba32d50..dd2332390c45bbff7c3fc5d259453f2e1ca352bf 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3648,19 +3648,25 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 
 		if (optlen != sizeof(val))
 			return -EINVAL;
-		if (po->rx_ring.pg_vec || po->tx_ring.pg_vec)
-			return -EBUSY;
 		if (copy_from_user(&val, optval, sizeof(val)))
 			return -EFAULT;
 		switch (val) {
 		case TPACKET_V1:
 		case TPACKET_V2:
 		case TPACKET_V3:
-			po->tp_version = val;
-			return 0;
+			break;
 		default:
 			return -EINVAL;
 		}
+		lock_sock(sk);
+		if (po->rx_ring.pg_vec || po->tx_ring.pg_vec) {
+			ret = -EBUSY;
+		} else {
+			po->tp_version = val;
+			ret = 0;
+		}
+		release_sock(sk);
+		return ret;
 	}
 	case PACKET_RESERVE:
 	{
@@ -4164,6 +4170,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 	/* Added to avoid minimal code churn */
 	struct tpacket_req *req = &req_u->req;
 
+	lock_sock(sk);
 	/* Opening a Tx-ring is NOT supported in TPACKET_V3 */
 	if (!closing && tx_ring && (po->tp_version > TPACKET_V2)) {
 		net_warn_ratelimited("Tx-ring is not supported.\n");
@@ -4245,7 +4252,6 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 			goto out;
 	}
 
-	lock_sock(sk);
 
 	/* Detach socket from network */
 	spin_lock(&po->bind_lock);
@@ -4294,11 +4300,11 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 		if (!tx_ring)
 			prb_shutdown_retire_blk_timer(po, rb_queue);
 	}
-	release_sock(sk);
 
 	if (pg_vec)
 		free_pg_vec(pg_vec, order, req->tp_block_nr);
 out:
+	release_sock(sk);
 	return err;
 }
 

^ permalink raw reply related

* Initial thoughts on TXDP
From: Tom Herbert @ 2016-11-30 22:54 UTC (permalink / raw)
  To: Linux Kernel Network Developers

Posting for discussion....

Now that XDP seems to be nicely gaining traction we can start to
consider the next logical step which is to apply the principles of XDP
to accelerating transport protocols in the kernel. For lack of a
better name I'll refer to this as Transport eXpress Data Path, or just
TXDP :-). Pulling off TXDP might not be the most trivial of problems
to solve, but if we can this may address the performance gap between
kernel bypass and the stack for transport layer protocols (XDP
addresses the performance gap for stateless packet processing). The
problem statement is analogous to that which we had for XDP, namely
can we create a mode in the kernel that offer the same performance
that is seen with L4 protocols over kernel bypass (e.g. TCP/OpenOnload
or TCP/DPDK) or perhaps something reasonably close to a full HW
offload solution (such as RDMA)?

TXDP is different from XDP in that we are dealing with stateful
protocols and is part of a full protocol implementation, specifically
this would be an accelerated mode of transport connections (e.g. TCP)
in the kernel. Also, unlike XDP we now need to be concerned with
transmit path (both application generating packets as well as protocol
sourced packets like ACKs, retransmits, clocking out data, etc.).
Another distinction is that the user API needs to be considered, for
instance optimizing the nominal protocol stack but then using an
unmodified socket interface could easily undo the effects of
optimizing the lower layers. This last point actually implies a nice
constraint, if we can't keep the accelerated path simple its probably
not worth trying to accelerate.

One simplifying assumption we might make is that TXDP is primarily for
optimizing latency, specifically request/response type operations
(think HPC, HFT, flash server, or other tightly coupled communications
within the datacenter). Notably, I don't think that saving CPU is as
relevant to TXDP, in fact we have already seen that CPU utilization
can be traded off for lower latency via spin polling. Similar to XDP
though, we might assume that single CPU performance is relevant (i.e.
on a cache server we'd like to spin as few CPUs as needed and no more
to handle the load an maintain throughput and latency requirements).
High throughput (ops/sec) and low variance should be side effects of
any design.

As with XDP, TXDP is _not_ intended to be a completely generic and
transparent solution. The application may be specifically optimized
for use with TXDP (for instance to implement perfect lockless
silo'ing). So TXDP is not going to be for everyone and it should be as
minimally invasive to the rest of the stack as possible.

I imagine there are a few reasons why userspace TCP stacks can get
good performance:

- Spin polling (we already can do this in kernel)
- Lockless, I would assume that threads typically have exclusive
access to a queue pair for a connection
- Minimal TCP/IP stack code
- Zero copy TX/RX
- Light weight structures for queuing
- No context switches
- Fast data path for in order, uncongested flows
- Silo'ing between application and device queues

Not all of these have cognates in the Linux stack, for instance we
probably can't entirely eliminate context switches for a userspace
application.

So with that the components of TXDP might look something like:

RX

  - Call into TCP/IP stack with page data directly from driver-- no
skbuff allocation or interface. This is essentially provided by the
XDP API although we would need to generalize the interface to call
stack functions (I previously posted patches for that). We will also
need a new action, XDP_HELD?, that indicates the XDP function held the
packet (put on a socket for instance).
  - Perform connection lookup. If we assume lockless model as
described below then we should be able to perform lockless connection
lookup similar to work Eric did to optimize UDP lookups for tunnel
processing
  - Call function that implements expedited TCP/IP datapath (something
like Van Jacobson's famous 80 instructions? :-) ).
  - If there is anything funky about the packet, connection state, or
TCP connection is not being TXDP accelerated just return XDP_PASS so
that packet follows normal stack processing. Since we did connection
lookup we could return an early demux also.Since we're already in an
exception mode this is where we might want to move packet processing
to different CPU (can be done by RPS/RFS)..
  - If packet contains new data we can allocate a "mini skbfuf (talked
about that at netdev) for queuing on socket.
  - If packet is an ACK we can process it directly without ever creating skbuff
  - There is also the possibility of avoiding the skbuff allocation
for in-kernel applications. Stream parser might also be taught how to
deal with raw buffers.
  - If we're really ambitious we can also consider putting packets
into a packet ring for user space presuming that packets are typically
in order (might be a little orthogonal to TXDP.

TX

  - Normal TX socket options apply however they might be lockless
under locking constraints below
  - skbuff is required for keeping data on socket a TCP (not
necessarily for UDP though), but we might be able to use a mini skbuff
for this
  - We'd need an interface to transmit a packet in a page buffer
without an skbuff.
  - When we transmit, it would be nice to go straight from TCP
connection to an XDP device queue and in particular skip the qdisc
layer. This follows the principle of low latency being first criteria.
Effective use of qdiscs, especially non work conserving ones, imply
longer latencies in effect which likely means TXDP isn't appropriate
in such a cases. BQL is also out, however we would want the TX
batching of XDP.
  - If we really need priority queing we should have the option of
using device mulitple queue
  - Pure ACKs would not require an skb also
  - Zero copy TX (splice) can be used where needed. This might be
particular useful in a flash or remote memory server
  - TX completion would most like be happening on same CPU also

Miscellaneous

  - Under the simplicity principle we really only want the TXDP to
contain the minimal necessary path. What is "minimal" is a very
relevant question. If we constrain the use case to be communications
within a single rack such that we can engineer for basically no loss
and no congestion (pause might be reasonable here) then the TXDP data
path might just implement that. Mostly this would just be the
established data path that is accelerated, handling other states might
be done in existing path (which becomes slow path in TXDP).
  - To make transport sockets to have a lockless mode I am
contemplating that connections/sockets can be bound to particularly
CPUs and that any operations (socket operations, timers, receive
processing) must occur on that CPU. The CPU would be the one where RX
happens. Note this implies perfect silo'ing, everything for driver RX
to application processing happens inline on the CPU. The stack would
not cross CPUs for a connection while in this mode.
  - We might be able to take advantage of per connection queues or
ntuple filter to isolate flows to to certain queues and hence certain
CPUs. I don't think this can be a requirement though, TXDP should be
able to work well with generic MQ devices. Specialized devices queues
are an optimization, without having those those we'd want to push to
other CPUs as quickly as possible (RPS but maybe we can get away with
a mini-skbuff here).

Thanks,
Tom

^ permalink raw reply

* Re: [PATCH net-next] bpf, xdp: drop rcu_read_lock from bpf_prog_run_xdp and move to caller
From: Jakub Kicinski @ 2016-11-30 22:50 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, alexei.starovoitov, saeedm, Yuval.Mintz, netdev
In-Reply-To: <b70c47e0284823e6a5db600ae75c01eac4cf7922.1480538565.git.daniel@iogearbox.net>

On Wed, 30 Nov 2016 22:16:06 +0100, Daniel Borkmann wrote:
> After 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
> the rcu_read_lock() in bpf_prog_run_xdp() is superfluous, since callers
> need to hold rcu_read_lock() already to make sure BPF program doesn't
> get released in the background.
> 
> Thus, drop it from bpf_prog_run_xdp(), as it can otherwise be misleading.
> Still keeping the bpf_prog_run_xdp() is useful as it allows for grepping
> in XDP supported drivers and to keep the typecheck on the context intact.
> For mlx4, this means we don't have a double rcu_read_lock() anymore. nfp can
> just make use of bpf_prog_run_xdp(), too. For qede, just move rcu_read_lock()
> out of the helper. When the driver gets atomic replace support, this will
> move to call-sites eventually.
> 
> mlx5 needs actual fixing as it has the same issue as described already in
> 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
> that is, we're under RCU bh at this time, BPF programs are released via
> call_rcu(), and call_rcu() != call_rcu_bh(), so we need to properly mark
> read side as programs can get xchg()'ed in mlx5e_xdp_set() without queue
> reset.
> 
> Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Alexei Starovoitov <ast@kernel.org>

FWIW:

Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Thanks!

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox