Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next] tcp: fix a stale ooo_last_skb after a replace
From: Eric Dumazet @ 2016-09-14  5:55 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Yuchung Cheng, Yaogong Wang

From: Eric Dumazet <edumazet@google.com>

When skb replaces another one in ooo queue, I forgot to also
update tp->ooo_last_skb as well, if the replaced skb was the last one
in the queue.

To fix this, we simply can re-use the code that runs after an insertion,
trying to merge skbs at the right of current skb.

This not only fixes the bug, but also remove all small skbs that might
be a subset of the new one.

Example:

We receive segments 2001:3001,  4001:5001

Then we receive 2001:8001 : We should replace 2001:3001 with the big
skb, but also remove 4001:50001 from the queue to save space.

packetdrill test demonstrating the bug

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

+0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
+0.100 < . 1:1(0) ack 1 win 1024
+0 accept(3, ..., ...) = 4

+0.01 < . 1001:2001(1000) ack 1 win 1024
+0    > . 1:1(0) ack 1 <nop,nop, sack 1001:2001>

+0.01 < . 1001:3001(2000) ack 1 win 1024
+0    > . 1:1(0) ack 1 <nop,nop, sack 1001:2001 1001:3001>


Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Cc: Yaogong Wang <wygivan@google.com>
---
 net/ipv4/tcp_input.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 70b892db99018fb42ab38ab7e5ce0dab498f9571..dad3e7eeed94b6f76f4bef4812c5d0fe9944e5f0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4502,7 +4502,7 @@ coalesce_done:
 				NET_INC_STATS(sock_net(sk),
 					      LINUX_MIB_TCPOFOMERGE);
 				__kfree_skb(skb1);
-				goto add_sack;
+				goto merge_right;
 			}
 		} else if (tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
 			goto coalesce_done;
@@ -4514,6 +4514,7 @@ insert:
 	rb_link_node(&skb->rbnode, parent, p);
 	rb_insert_color(&skb->rbnode, &tp->out_of_order_queue);
 
+merge_right:
 	/* Remove other segments covered by skb. */
 	while ((q = rb_next(&skb->rbnode)) != NULL) {
 		skb1 = rb_entry(q, struct sk_buff, rbnode);

^ permalink raw reply related

* Re: Modification to skb->queue_mapping affecting performance
From: Eric Dumazet @ 2016-09-14  5:22 UTC (permalink / raw)
  To: Michael Ma; +Cc: netdev
In-Reply-To: <CAAmHdhxGBBOCebkvq43pwsaZ8HcULZmXnikQAtc2uLBMWZgXjA@mail.gmail.com>

On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote:

> I don't intend to install multiple qdisc - the only reason that I'm
> doing this now is to leverage MQ to workaround the lock contention,
> and based on the profile this all worked. However to simplify the way
> to setup HTB I wanted to use TXQ to partition HTB classes so that a
> HTB class only belongs to one TXQ, which also requires mapping skb to
> TXQ using some rules (here I'm using priority but I assume it's
> straightforward to use other information such as classid). And the
> problem I found here is that when using priority to infer the TXQ so
> that queue_mapping is changed, bandwidth is affected significantly -
> the only thing I can guess is that due to queue switch, there are more
> cache misses assuming processor cores have a static mapping to all the
> queues. Any suggestion on what to do next for the investigation?
> 
> I would also guess that this should be a common problem if anyone
> wants to use MQ+IFB to workaround the qdisc lock contention on the
> receiver side and classful qdisc is used on IFB, but haven't really
> found a similar thread here...

But why are you changing the queue ?

NIC already does the proper RSS thing, meaning all packets of one flow
should land on one RX queue. No need to ' classify yourself and risk
lock contention' 

I use IFB + MQ + netem every day, and it scales to 10 Mpps with no
problem.

Do you really need to rate limit flows ? Not clear what are your goals,
why for example you use HTB to begin with.

^ permalink raw reply

* RE: [PATCH net-next 2/3] net: ethernet: mediatek: add ethtool functions to configure RX flows of HW LRO
From: Nelson Chang @ 2016-09-14  5:22 UTC (permalink / raw)
  To: f.fainelli, john, davem; +Cc: nbd, netdev, linux-mediatek, nelsonch.tw

(resend)

Thanks Florian for the review!
I will add ndo_fix_features hook in v2 to prevent the case that a user
wants to turn off NETIF_F_LRO but RX flow is programmed.
If any programmed RX flow exists, NETIF_F_LRO cannot be turned off.

-----Original Message-----
From: Florian Fainelli [mailto:f.fainelli@gmail.com]
Sent: Wednesday, September 14, 2016 2:27 AM
To: Nelson Chang (張家祥); john@phrozen.org; davem@davemloft.net
Cc: nbd@openwrt.org; netdev@vger.kernel.org;
linux-mediatek@lists.infradead.org; nelsonch.tw@gmail.com
Subject: Re: [PATCH net-next 2/3] net: ethernet: mediatek: add ethtool
functions to configure RX flows of HW LRO

On 09/13/2016 06:54 AM, Nelson Chang wrote:
> The codes add ethtool functions to set RX flows for HW LRO. Because 
> the HW LRO hardware can only recognize the destination IP of TCP/IP
RX 
> flows, the ethtool command to add HW LRO flow is as below:
> ethtool -N [devname] flow-type tcp4 dst-ip [ip_addr] loc [0~1]
> 
> Otherwise, cause the hardware can set total four destination IPs,
each 
> GMAC (GMAC1/GMAC2) can set two IPs separately at most.
> 
> Signed-off-by: Nelson Chang <nelson.chang@mediatek.com>
> ---

> +
> +static int mtk_set_features(struct net_device *dev, netdev_features_t
> +features) {
> +	int err = 0;
> +
> +	if (!((dev->features ^ features) & NETIF_F_LRO))
> +		return 0;
> +
> +	if (!(features & NETIF_F_LRO))
> +		mtk_hwlro_netdev_disable(dev);

you may want to implement a fix_features ndo operations which makes sure
that NETIF_F_LRO is turned on in case a RX flow is programmed,
otherwise, it may be confusing to the user that a flow was programmed,
but no offload is happening.

^ permalink raw reply

* RE: [PATCH net-next 3/3] net: ethernet: mediatek: add dts configuration to enable HW LRO
From: Nelson Chang @ 2016-09-14  5:20 UTC (permalink / raw)
  To: f.fainelli, john, davem; +Cc: nbd, netdev, linux-mediatek, nelsonch.tw

(resend)

The description of the property as you said is more precise.
The property is a capability if the hardware supports LRO. I'll rephrase
the property description in v2.

Thanks Florian!

-----Original Message-----
From: Florian Fainelli [mailto:f.fainelli@gmail.com]
Sent: Wednesday, September 14, 2016 2:25 AM
To: Nelson Chang (張家祥); john@phrozen.org; davem@davemloft.net
Cc: nbd@openwrt.org; netdev@vger.kernel.org;
linux-mediatek@lists.infradead.org; nelsonch.tw@gmail.com
Subject: Re: [PATCH net-next 3/3] net: ethernet: mediatek: add dts
configuration to enable HW LRO

On 09/13/2016 06:54 AM, Nelson Chang wrote:
> Add the configuration of HW LRO in the binding document.
> 
> Signed-off-by: Nelson Chang <nelson.chang@mediatek.com>
> ---
>  Documentation/devicetree/bindings/net/mediatek-net.txt | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/mediatek-net.txt
> b/Documentation/devicetree/bindings/net/mediatek-net.txt
> index 32eaaca..f43c0d1 100644
> --- a/Documentation/devicetree/bindings/net/mediatek-net.txt
> +++ b/Documentation/devicetree/bindings/net/mediatek-net.txt
> @@ -20,6 +20,7 @@ Required properties:
>  - mediatek,ethsys: phandle to the syscon node that handles the port 
> setup
>  - mediatek,pctl: phandle to the syscon node that handles the ports
slew rate
>  	and driver current
> +- mediatek,hwlro: set to enable HW LRO functions of PDMA rx rings

That sounds like implementing a enable/disable policy in the Device Tree
as opposed to providing an indication as to whether the HW supports LRO
or not. If all versions of the hardware support LRO, then you would
rather let the users change NETIF_F_LRO using ethtool features instead
of having this be defined in the Device Tree.

If, on the other hand, not all version of the HW support LRO, then you
would just want to rephrase the property description to say this
describes a capability.

^ permalink raw reply

* Re: Modification to skb->queue_mapping affecting performance
From: Michael Ma @ 2016-09-14  5:19 UTC (permalink / raw)
  To: Eric Dumazet, Cong Wang; +Cc: netdev
In-Reply-To: <CAAmHdhxGBBOCebkvq43pwsaZ8HcULZmXnikQAtc2uLBMWZgXjA@mail.gmail.com>

2016-09-13 22:13 GMT-07:00 Michael Ma <make0818@gmail.com>:
> 2016-09-13 18:18 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
>> On Tue, 2016-09-13 at 17:23 -0700, Michael Ma wrote:
>>
>>> If I understand correctly this is still to associate a qdisc with each
>>> ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
>>> to divide the bandwidth of each class in HTB by the number of TX
>>> queues for each individual HTB qdisc associated?
>>>
>>> My original idea was to attach a HTB qdisc for each ifb queue
>>> representing a set of flows not sharing bandwidth with others so that
>>> root lock contention still happens but only affects flows in the same
>>> HTB. Did I understand the root lock contention issue incorrectly for
>>> ifb? I do see some comments in __dev_queue_xmit() about using a
>>> different code path for software devices which bypasses
>>> __dev_xmit_skb(). Does this mean ifb won't go through
>>> __dev_xmit_skb()?
>>
>> You can install HTB on all of your MQ children for sure.
>>
>> Again, there is no qdisc lock contention if you properly use MQ.
>>
>> Now if you _need_ to install a single qdisc for whatever reason, then
>> maybe you want to use a single rx queue on the NIC, to reduce lock
>> contention ;)

Yes - this might reduce lock contention but there would still be
contention and I'm really looking for more concurrency...

>>
>>
> I don't intend to install multiple qdisc - the only reason that I'm
> doing this now is to leverage MQ to workaround the lock contention,
> and based on the profile this all worked. However to simplify the way
> to setup HTB I wanted to use TXQ to partition HTB classes so that a
> HTB class only belongs to one TXQ, which also requires mapping skb to
> TXQ using some rules (here I'm using priority but I assume it's
> straightforward to use other information such as classid). And the
> problem I found here is that when using priority to infer the TXQ so
> that queue_mapping is changed, bandwidth is affected significantly -
> the only thing I can guess is that due to queue switch, there are more
> cache misses assuming processor cores have a static mapping to all the
> queues. Any suggestion on what to do next for the investigation?
>
> I would also guess that this should be a common problem if anyone
> wants to use MQ+IFB to workaround the qdisc lock contention on the
> receiver side and classful qdisc is used on IFB, but haven't really
> found a similar thread here...

Hi Cong - I saw quite some threads from you regarding to ingress qdisc
+ MQ and issues for queue_mapping. Do you by any chance have a similar
setup? (classful qdiscs associated to the queues of IFB which requires
queue_mapping modification so that the qdisc selection is done at
queue selection time based on information such as skb
priority/classid. Would appreciate any suggestions.

^ permalink raw reply

* Re: Modification to skb->queue_mapping affecting performance
From: Michael Ma @ 2016-09-14  5:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1473815917.18970.266.camel@edumazet-glaptop3.roam.corp.google.com>

2016-09-13 18:18 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> On Tue, 2016-09-13 at 17:23 -0700, Michael Ma wrote:
>
>> If I understand correctly this is still to associate a qdisc with each
>> ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
>> to divide the bandwidth of each class in HTB by the number of TX
>> queues for each individual HTB qdisc associated?
>>
>> My original idea was to attach a HTB qdisc for each ifb queue
>> representing a set of flows not sharing bandwidth with others so that
>> root lock contention still happens but only affects flows in the same
>> HTB. Did I understand the root lock contention issue incorrectly for
>> ifb? I do see some comments in __dev_queue_xmit() about using a
>> different code path for software devices which bypasses
>> __dev_xmit_skb(). Does this mean ifb won't go through
>> __dev_xmit_skb()?
>
> You can install HTB on all of your MQ children for sure.
>
> Again, there is no qdisc lock contention if you properly use MQ.
>
> Now if you _need_ to install a single qdisc for whatever reason, then
> maybe you want to use a single rx queue on the NIC, to reduce lock
> contention ;)
>
>
I don't intend to install multiple qdisc - the only reason that I'm
doing this now is to leverage MQ to workaround the lock contention,
and based on the profile this all worked. However to simplify the way
to setup HTB I wanted to use TXQ to partition HTB classes so that a
HTB class only belongs to one TXQ, which also requires mapping skb to
TXQ using some rules (here I'm using priority but I assume it's
straightforward to use other information such as classid). And the
problem I found here is that when using priority to infer the TXQ so
that queue_mapping is changed, bandwidth is affected significantly -
the only thing I can guess is that due to queue switch, there are more
cache misses assuming processor cores have a static mapping to all the
queues. Any suggestion on what to do next for the investigation?

I would also guess that this should be a common problem if anyone
wants to use MQ+IFB to workaround the qdisc lock contention on the
receiver side and classful qdisc is used on IFB, but haven't really
found a similar thread here...

^ permalink raw reply

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
From: Alexei Starovoitov @ 2016-09-14  4:42 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Daniel Mack, htejun, daniel, ast, davem, kafai, fw, harald,
	netdev, sargun, cgroups
In-Reply-To: <20160913172408.GC6138@salvia>

On Tue, Sep 13, 2016 at 07:24:08PM +0200, Pablo Neira Ayuso wrote:
> On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:
> > Hi,
> > 
> > On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
> > > On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
> > >> This is v5 of the patch set to allow eBPF programs for network
> > >> filtering and accounting to be attached to cgroups, so that they apply
> > >> to all sockets of all tasks placed in that cgroup. The logic also
> > >> allows to be extendeded for other cgroup based eBPF logic.
> > > 
> > > 1) This infrastructure can only be useful to systemd, or any similar
> > >    orchestration daemon. Look, you can only apply filtering policies
> > >    to processes that are launched by systemd, so this only works
> > >    for server processes.
> > 
> > Sorry, but both statements aren't true. The eBPF policies apply to every
> > process that is placed in a cgroup, and my example program in 6/6 shows
> > how that can be done from the command line.
> 
> Then you have to explain me how can anyone else than systemd use this
> infrastructure?

Sounds like systemd and bpf phobia combined :)
Jokes aside. I'm puzzled why systemd is even being mentioned here.
Here we use tupperware (our internal container management system) that
is heavily using cgroups and has nothing to do with systemd.
we're working as part of open container initiative, so hopefully soon
all container management systems will benefit from what we're building.
cgroups and bpf are crucial part of this process.

> > Also, systemd is able to control userspace processes just fine, and
> > it not limited to 'server processes'.
> 
> My main point is that those processes *need* to be launched by the
> orchestrator, which is was refering as 'server processes'.

No experience in systemd, so cannot comment about it,
but that statement is not true for our stuff.

> > > For client processes this infrastructure is
> > >    *racy*, you have to add new processes in runtime to the cgroup,
> > >    thus there will be time some little time where no filtering policy
> > >    will be applied. For quality of service, this may be an acceptable
> > >    race, but this is aiming to deploy a filtering policy.
> > 
> > That's a limitation that applies to many more control mechanisms in the
> > kernel, and it's something that can easily be solved with fork+exec.
> 
> As long as you have control to launch the processes yes, but this
> will not work in other scenarios. Just like cgroup net_cls and friends
> are broken for filtering for things that you have no control to
> fork+exec.

not true

> To use this infrastructure from a non-launcher process, you'll have to
> rely on the proc connection to subscribe to new process events, then
> echo that pid to the cgroup, and that interface is asynchronous so
> *adding new processes to the cgroup is subject to races*.

in general not true either. have you worked with cgroups or just speculating?

> *You're proposing a socket filtering facility that hooks layer 2
> output path*!

flashback. Not too long ago you were beating drums about netfilter
ingress hook operating at layer 2... sounds like nobody used it
and that was a bad call? Should we remove that netfilter hook then?

Our use case is different from Daniel's.
For us this cgroup+bpf is _not_ for filterting and _not_ for security.
We run a ton of tasks in cgroups that launch all sorts of
things on their own. We need to monitor what they do from networking
point of view. Therefore bpf programs need to monitor the traffic in
particular part of cgroup hierarchy. Not globally and no pass/drop decisions.
The monitoring itself is complicated. Like we need to group and
aggregate within bpf program based on certain bits of ipv6 address
and so on. bpf is only programmable engine that can do this job.
nft is simply not flexible enough to do that.
I'd really love to have an alternative to bpf for such tasks,
but you seem to spend all the energy arguing against bpf whereas
nft still has a lot to be desired.

^ permalink raw reply

* Re: [PATCH net-next v3] net: inet: diag: expose the socket mark to privileged processes.
From: David Ahern @ 2016-09-14  4:19 UTC (permalink / raw)
  To: Lorenzo Colitti, David Miller
  Cc: netdev@vger.kernel.org, Eric Dumazet, Erik Kline
In-Reply-To: <CAKD1Yr3q71ap3TQDyeuKFsZ4LTt6yEBh2pAZ5=FMnCd9UmyuGA@mail.gmail.com>

On 9/13/16 10:00 PM, Lorenzo Colitti wrote:
> On Fri, Sep 9, 2016 at 2:23 PM, Lorenzo Colitti <lorenzo@google.com> wrote:
>> RFC patch sent out as http://patchwork.ozlabs.org/patch/667892/ . This
>> achieves a fair bit of simplification with no or negligible
>> performance impact, because there was a lot of redundancy in the
>> parameters that were passed in.
> 
> David, any thoughts on that patch? I submitted it as RFC because I
> wasn't sure what you wanted. Should I have sent it as non-RFC instead?
> 

I realize you meant DaveM, but this one has been accepted. It's your other 2 that are marked RFC by you and in patchwork.

^ permalink raw reply

* Re: [PATCH net-next v3] net: inet: diag: expose the socket mark to privileged processes.
From: Lorenzo Colitti @ 2016-09-14  4:00 UTC (permalink / raw)
  To: David Miller
  Cc: netdev@vger.kernel.org, Eric Dumazet, David Ahern, Erik Kline
In-Reply-To: <CAKD1Yr2JovZ=dmNXuDCLT6gzVtC9JZ30bW6ND=10733je60bDw@mail.gmail.com>

On Fri, Sep 9, 2016 at 2:23 PM, Lorenzo Colitti <lorenzo@google.com> wrote:
> RFC patch sent out as http://patchwork.ozlabs.org/patch/667892/ . This
> achieves a fair bit of simplification with no or negligible
> performance impact, because there was a lot of redundancy in the
> parameters that were passed in.

David, any thoughts on that patch? I submitted it as RFC because I
wasn't sure what you wanted. Should I have sent it as non-RFC instead?

^ permalink raw reply

* [PATCH net-next 2/7] net: ethernet: mediatek: add mtk_hw_deinit call as the opposite to mtk_hw_init call
From: sean.wang @ 2016-09-14  3:50 UTC (permalink / raw)
  To: john, davem; +Cc: nbd, netdev, linux-mediatek, keyhaede, objelf, Sean Wang
In-Reply-To: <1473825041-21072-1-git-send-email-sean.wang@mediatek.com>

From: Sean Wang <sean.wang@mediatek.com>

grouping things related to the deinitialization of what
mtk_hw_init call does that help to be reused by the reset
process and the error path handling.

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index ca46e82..c71b0b3 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1477,6 +1477,16 @@ static int __init mtk_hw_init(struct mtk_eth *eth)
 	return 0;
 }
 
+static int mtk_hw_deinit(struct mtk_eth *eth)
+{
+	clk_disable_unprepare(eth->clks[MTK_CLK_GP2]);
+	clk_disable_unprepare(eth->clks[MTK_CLK_GP1]);
+	clk_disable_unprepare(eth->clks[MTK_CLK_ESW]);
+	clk_disable_unprepare(eth->clks[MTK_CLK_ETHIF]);
+
+	return 0;
+}
+
 static int __init mtk_init(struct net_device *dev)
 {
 	struct mtk_mac *mac = netdev_priv(dev);
@@ -1923,10 +1933,7 @@ static int mtk_remove(struct platform_device *pdev)
 		mtk_stop(eth->netdev[i]);
 	}
 
-	clk_disable_unprepare(eth->clks[MTK_CLK_ETHIF]);
-	clk_disable_unprepare(eth->clks[MTK_CLK_ESW]);
-	clk_disable_unprepare(eth->clks[MTK_CLK_GP1]);
-	clk_disable_unprepare(eth->clks[MTK_CLK_GP2]);
+	mtk_hw_deinit(eth);
 
 	netif_napi_del(&eth->tx_napi);
 	netif_napi_del(&eth->rx_napi);
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next 6/7] net: ethernet: mediatek: add more resets for internal ethernet circuit block
From: sean.wang @ 2016-09-14  3:50 UTC (permalink / raw)
  To: john, davem; +Cc: nbd, netdev, linux-mediatek, keyhaede, objelf, Sean Wang
In-Reply-To: <1473825041-21072-1-git-send-email-sean.wang@mediatek.com>

From: Sean Wang <sean.wang@mediatek.com>

struct mtk_eth has already contained struct regmap ethsys pointer
to the address range of the internal circuit reset, so we reuse it
to reset more internal blocks on ethernet hardware such as packet
processing engine (PPE) and frame engine (FE) instead of rstc which
deals with FE only.

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 27 +++++++++++++++------------
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  6 +++++-
 2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index b9ddbcb..48cddf9 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1414,6 +1414,19 @@ static int mtk_stop(struct net_device *dev)
 	return 0;
 }
 
+static void ethsys_reset(struct mtk_eth *eth, u32 reset_bits)
+{
+	regmap_update_bits(eth->ethsys, ETHSYS_RSTCTRL,
+			   reset_bits,
+			   reset_bits);
+
+	usleep_range(1000, 1100);
+	regmap_update_bits(eth->ethsys, ETHSYS_RSTCTRL,
+			   reset_bits,
+			   ~reset_bits);
+	mdelay(10);
+}
+
 static int mtk_hw_init(struct mtk_eth *eth)
 {
 	int i, val;
@@ -1428,12 +1441,8 @@ static int mtk_hw_init(struct mtk_eth *eth)
 	clk_prepare_enable(eth->clks[MTK_CLK_ESW]);
 	clk_prepare_enable(eth->clks[MTK_CLK_GP1]);
 	clk_prepare_enable(eth->clks[MTK_CLK_GP2]);
-
-	/* reset the frame engine */
-	reset_control_assert(eth->rstc);
-	usleep_range(10, 20);
-	reset_control_deassert(eth->rstc);
-	usleep_range(10, 20);
+	ethsys_reset(eth, RSTCTRL_FE);
+	ethsys_reset(eth, RSTCTRL_PPE);
 
 	regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
 	for (i = 0; i < MTK_MAC_COUNT; i++) {
@@ -1894,12 +1903,6 @@ static int mtk_probe(struct platform_device *pdev)
 		return PTR_ERR(eth->pctl);
 	}
 
-	eth->rstc = devm_reset_control_get(&pdev->dev, "eth");
-	if (IS_ERR(eth->rstc)) {
-		dev_err(&pdev->dev, "no eth reset found\n");
-		return PTR_ERR(eth->rstc);
-	}
-
 	for (i = 0; i < 3; i++) {
 		eth->irq[i] = platform_get_irq(pdev, i);
 		if (eth->irq[i] < 0) {
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 388cbe7..7efa00f 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -266,6 +266,11 @@
 #define SYSCFG0_GE_MASK		0x3
 #define SYSCFG0_GE_MODE(x, y)	(x << (12 + (y * 2)))
 
+/*ethernet reset control register*/
+#define ETHSYS_RSTCTRL		0x34
+#define RSTCTRL_FE		BIT(6)
+#define RSTCTRL_PPE		BIT(31)
+
 struct mtk_rx_dma {
 	unsigned int rxd1;
 	unsigned int rxd2;
@@ -423,7 +428,6 @@ struct mtk_rx_ring {
 struct mtk_eth {
 	struct device			*dev;
 	void __iomem			*base;
-	struct reset_control		*rstc;
 	spinlock_t			page_lock;
 	spinlock_t			irq_lock;
 	struct net_device		dummy_dev;
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next 1/7] net: ethernet: mediatek: refactoring mtk_hw_init to be reused
From: sean.wang @ 2016-09-14  3:50 UTC (permalink / raw)
  To: john, davem; +Cc: nbd, netdev, linux-mediatek, keyhaede, objelf, Sean Wang
In-Reply-To: <1473825041-21072-1-git-send-email-sean.wang@mediatek.com>

From: Sean Wang <sean.wang@mediatek.com>

the existing mtk_hw_init includes hardware and software
initialization inside so that it is slightly hard to reuse
them for the process of the reset recovery, so some splitting
is made here for keeping hardware initializing relevant thing
and the else such as IRQ registration and MDIO initialization
what are all about to the interface of core driver moved to the
other proper place because they have no needs to register IRQ and
re-initialize structure again during the reset process.

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 62 ++++++++++++++++-------------
 1 file changed, 34 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 66fd45a..ca46e82 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1415,7 +1415,12 @@ static int mtk_stop(struct net_device *dev)
 
 static int __init mtk_hw_init(struct mtk_eth *eth)
 {
-	int err, i;
+	int i;
+
+	clk_prepare_enable(eth->clks[MTK_CLK_ETHIF]);
+	clk_prepare_enable(eth->clks[MTK_CLK_ESW]);
+	clk_prepare_enable(eth->clks[MTK_CLK_GP1]);
+	clk_prepare_enable(eth->clks[MTK_CLK_GP2]);
 
 	/* reset the frame engine */
 	reset_control_assert(eth->rstc);
@@ -1441,19 +1446,6 @@ static int __init mtk_hw_init(struct mtk_eth *eth)
 	/* Enable RX VLan Offloading */
 	mtk_w32(eth, 1, MTK_CDMP_EG_CTRL);
 
-	err = devm_request_irq(eth->dev, eth->irq[1], mtk_handle_irq_tx, 0,
-			       dev_name(eth->dev), eth);
-	if (err)
-		return err;
-	err = devm_request_irq(eth->dev, eth->irq[2], mtk_handle_irq_rx, 0,
-			       dev_name(eth->dev), eth);
-	if (err)
-		return err;
-
-	err = mtk_mdio_init(eth);
-	if (err)
-		return err;
-
 	/* disable delay and normal interrupt */
 	mtk_w32(eth, 0, MTK_QDMA_DELAY_INT);
 	mtk_w32(eth, 0, MTK_PDMA_DELAY_INT);
@@ -1783,16 +1775,7 @@ static int mtk_add_mac(struct mtk_eth *eth, struct device_node *np)
 	eth->netdev[id]->features |= MTK_HW_FEATURES;
 	eth->netdev[id]->ethtool_ops = &mtk_ethtool_ops;
 
-	err = register_netdev(eth->netdev[id]);
-	if (err) {
-		dev_err(eth->dev, "error bringing up device\n");
-		goto free_netdev;
-	}
 	eth->netdev[id]->irq = eth->irq[0];
-	netif_info(eth, probe, eth->netdev[id],
-		   "mediatek frame engine at 0x%08lx, irq %d\n",
-		   eth->netdev[id]->base_addr, eth->irq[0]);
-
 	return 0;
 
 free_netdev:
@@ -1862,11 +1845,6 @@ static int mtk_probe(struct platform_device *pdev)
 		}
 	}
 
-	clk_prepare_enable(eth->clks[MTK_CLK_ETHIF]);
-	clk_prepare_enable(eth->clks[MTK_CLK_ESW]);
-	clk_prepare_enable(eth->clks[MTK_CLK_GP1]);
-	clk_prepare_enable(eth->clks[MTK_CLK_GP2]);
-
 	eth->msg_enable = netif_msg_init(mtk_msg_level, MTK_DEFAULT_MSG_ENABLE);
 	INIT_WORK(&eth->pending_work, mtk_pending_work);
 
@@ -1887,6 +1865,34 @@ static int mtk_probe(struct platform_device *pdev)
 			goto err_free_dev;
 	}
 
+	err = devm_request_irq(eth->dev, eth->irq[1], mtk_handle_irq_tx, 0,
+			       dev_name(eth->dev), eth);
+	if (err)
+		goto err_free_dev;
+
+	err = devm_request_irq(eth->dev, eth->irq[2], mtk_handle_irq_rx, 0,
+			       dev_name(eth->dev), eth);
+	if (err)
+		goto err_free_dev;
+
+	err = mtk_mdio_init(eth);
+	if (err)
+		goto err_free_dev;
+
+	for (i = 0; i < MTK_MAX_DEVS; i++) {
+		if (!eth->netdev[i])
+			continue;
+
+		err = register_netdev(eth->netdev[i]);
+		if (err) {
+			dev_err(eth->dev, "error bringing up device\n");
+			goto err_free_dev;
+		} else
+			netif_info(eth, probe, eth->netdev[i],
+				   "mediatek frame engine at 0x%08lx, irq %d\n",
+				   eth->netdev[i]->base_addr, eth->irq[0]);
+	}
+
 	/* we run 2 devices on the same DMA ring so we need a dummy device
 	 * for NAPI to work
 	 */
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next 5/7] net: ethernet: mediatek: add the whole ethernet reset into the reset process
From: sean.wang @ 2016-09-14  3:50 UTC (permalink / raw)
  To: john, davem; +Cc: nbd, netdev, linux-mediatek, keyhaede, objelf, Sean Wang
In-Reply-To: <1473825041-21072-1-git-send-email-sean.wang@mediatek.com>

From: Sean Wang <sean.wang@mediatek.com>

1) original driver only resets DMA used by descriptor rings
which can't guarantee it can recover all various kinds of fatal
errors, so the patch tries to reset the underlying hardware
resource from scratch on Mediatek SoC required for ethernet
running, including power, pin mux control, clock and internal
circuits on the ethernet in order to restore into the initial
state which the rebooted machine gives.

2) add state variable inside structure mtk_eth to help distinguish
mtk_hw_init is called between the initialization during boot time
or re-initialization during the reset process.

3) add ge_mode variable inside structure mtk_mac for restoring
the interface mode of the current setup for the target MAC.

4) remove __init attribute from mtk_hw_init definition

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 52 ++++++++++++++++++++++++-----
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  8 +++++
 2 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index fd5d064..b9ddbcb 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -231,7 +231,7 @@ static int mtk_phy_connect(struct mtk_mac *mac)
 {
 	struct mtk_eth *eth = mac->hw;
 	struct device_node *np;
-	u32 val, ge_mode;
+	u32 val;
 
 	np = of_parse_phandle(mac->of_node, "phy-handle", 0);
 	if (!np && of_phy_is_fixed_link(mac->of_node))
@@ -245,18 +245,18 @@ static int mtk_phy_connect(struct mtk_mac *mac)
 	case PHY_INTERFACE_MODE_RGMII_RXID:
 	case PHY_INTERFACE_MODE_RGMII_ID:
 	case PHY_INTERFACE_MODE_RGMII:
-		ge_mode = 0;
+		mac->ge_mode = 0;
 		break;
 	case PHY_INTERFACE_MODE_MII:
-		ge_mode = 1;
+		mac->ge_mode = 1;
 		break;
 	case PHY_INTERFACE_MODE_REVMII:
-		ge_mode = 2;
+		mac->ge_mode = 2;
 		break;
 	case PHY_INTERFACE_MODE_RMII:
 		if (!mac->id)
 			goto err_phy;
-		ge_mode = 3;
+		mac->ge_mode = 3;
 		break;
 	default:
 		goto err_phy;
@@ -265,7 +265,7 @@ static int mtk_phy_connect(struct mtk_mac *mac)
 	/* put the gmac into the right mode */
 	regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
 	val &= ~SYSCFG0_GE_MODE(SYSCFG0_GE_MASK, mac->id);
-	val |= SYSCFG0_GE_MODE(ge_mode, mac->id);
+	val |= SYSCFG0_GE_MODE(mac->ge_mode, mac->id);
 	regmap_write(eth->ethsys, ETHSYS_SYSCFG0, val);
 
 	mtk_phy_connect_node(eth, mac, np);
@@ -1414,9 +1414,12 @@ static int mtk_stop(struct net_device *dev)
 	return 0;
 }
 
-static int __init mtk_hw_init(struct mtk_eth *eth)
+static int mtk_hw_init(struct mtk_eth *eth)
 {
-	int i;
+	int i, val;
+
+	if (test_and_set_bit(MTK_HW_INIT, &eth->state))
+		return 0;
 
 	pm_runtime_enable(eth->dev);
 	pm_runtime_get_sync(eth->dev);
@@ -1432,6 +1435,15 @@ static int __init mtk_hw_init(struct mtk_eth *eth)
 	reset_control_deassert(eth->rstc);
 	usleep_range(10, 20);
 
+	regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
+	for (i = 0; i < MTK_MAC_COUNT; i++) {
+		if (!eth->mac[i])
+			continue;
+		val &= ~SYSCFG0_GE_MODE(SYSCFG0_GE_MASK, eth->mac[i]->id);
+		val |= SYSCFG0_GE_MODE(eth->mac[i]->ge_mode, eth->mac[i]->id);
+	}
+	regmap_write(eth->ethsys, ETHSYS_SYSCFG0, val);
+
 	/* Set GE2 driving and slew rate */
 	regmap_write(eth->pctl, GPIO_DRV_SEL10, 0xa00);
 
@@ -1483,6 +1495,9 @@ static int __init mtk_hw_init(struct mtk_eth *eth)
 
 static int mtk_hw_deinit(struct mtk_eth *eth)
 {
+	if (!test_and_clear_bit(MTK_HW_INIT, &eth->state))
+		return 0;
+
 	clk_disable_unprepare(eth->clks[MTK_CLK_GP2]);
 	clk_disable_unprepare(eth->clks[MTK_CLK_GP1]);
 	clk_disable_unprepare(eth->clks[MTK_CLK_ESW]);
@@ -1557,6 +1572,27 @@ static void mtk_pending_work(struct work_struct *work)
 		__set_bit(i, &restart);
 	}
 
+	/* restart underlying hardware such as power, clock, pin mux
+	 * and the connected phy
+	 */
+	mtk_hw_deinit(eth);
+
+	if (eth->dev->pins)
+		devm_kfree(eth->dev, eth->dev->pins);
+	pinctrl_bind_pins(eth->dev);
+
+	mtk_hw_init(eth);
+
+	for (i = 0; i < MTK_MAC_COUNT; i++) {
+		if (!eth->mac[i] ||
+		    of_phy_is_fixed_link(eth->mac[i]->of_node))
+			continue;
+		err = phy_init_hw(eth->mac[i]->phy_dev);
+		if (err)
+			dev_err(eth->dev, "%s: PHY init failed.\n",
+				eth->netdev[i]->name);
+	}
+
 	/* restart DMA and enable IRQs */
 	for (i = 0; i < MTK_MAC_COUNT; i++) {
 		if (!test_bit(i, &restart))
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 0b984dc..388cbe7 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -330,6 +330,10 @@ enum mtk_clks_map {
 	MTK_CLK_MAX
 };
 
+enum mtk_dev_state {
+	MTK_HW_INIT
+};
+
 /* struct mtk_tx_buf -	This struct holds the pointers to the memory pointed at
  *			by the TX descriptor	s
  * @skb:		The SKB pointer of the packet being sent
@@ -413,6 +417,7 @@ struct mtk_rx_ring {
  * @clks:		clock array for all clocks required
  * @mii_bus:		If there is a bus we need to create an instance for it
  * @pending_work:	The workqueue used to reset the dma ring
+ * @state               Initialization and runtime state of the device.
  */
 
 struct mtk_eth {
@@ -441,11 +446,13 @@ struct mtk_eth {
 
 	struct mii_bus			*mii_bus;
 	struct work_struct		pending_work;
+	unsigned long			state;
 };
 
 /* struct mtk_mac -	the structure that holds the info about the MACs of the
  *			SoC
  * @id:			The number of the MAC
+ * @ge_mode:            Interface mode kept for setup restoring
  * @of_node:		Our devicetree node
  * @hw:			Backpointer to our main datastruture
  * @hw_stats:		Packet statistics counter
@@ -453,6 +460,7 @@ struct mtk_eth {
  */
 struct mtk_mac {
 	int				id;
+	int				ge_mode;
 	struct device_node		*of_node;
 	struct mtk_eth			*hw;
 	struct mtk_hw_stats		*hw_stats;
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next 3/7] net: ethernet: mediatek: cleanup error path inside mtk_hw_init
From: sean.wang @ 2016-09-14  3:50 UTC (permalink / raw)
  To: john, davem; +Cc: nbd, netdev, linux-mediatek, keyhaede, objelf, Sean Wang
In-Reply-To: <1473825041-21072-1-git-send-email-sean.wang@mediatek.com>

From: Sean Wang <sean.wang@mediatek.com>

This cleans up the error path inside mtk_hw_init call, causing it able
to exit appropriately when something fails and also includes refactoring
mtk_cleanup call to make the partial logic reusable on the error path.

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 34 ++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index c71b0b3..917a49c6 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1564,17 +1564,36 @@ static void mtk_pending_work(struct work_struct *work)
 	rtnl_unlock();
 }
 
-static int mtk_cleanup(struct mtk_eth *eth)
+static int mtk_free_dev(struct mtk_eth *eth)
 {
 	int i;
 
 	for (i = 0; i < MTK_MAC_COUNT; i++) {
 		if (!eth->netdev[i])
 			continue;
+		free_netdev(eth->netdev[i]);
+	}
+
+	return 0;
+}
 
+static int mtk_unreg_dev(struct mtk_eth *eth)
+{
+	int i;
+
+	for (i = 0; i < MTK_MAC_COUNT; i++) {
+		if (!eth->netdev[i])
+			continue;
 		unregister_netdev(eth->netdev[i]);
-		free_netdev(eth->netdev[i]);
 	}
+
+	return 0;
+}
+
+static int mtk_cleanup(struct mtk_eth *eth)
+{
+	mtk_unreg_dev(eth);
+	mtk_free_dev(eth);
 	cancel_work_sync(&eth->pending_work);
 
 	return 0;
@@ -1872,7 +1891,7 @@ static int mtk_probe(struct platform_device *pdev)
 
 		err = mtk_add_mac(eth, mac_np);
 		if (err)
-			goto err_free_dev;
+			goto err_deinit_hw;
 	}
 
 	err = devm_request_irq(eth->dev, eth->irq[1], mtk_handle_irq_tx, 0,
@@ -1896,7 +1915,7 @@ static int mtk_probe(struct platform_device *pdev)
 		err = register_netdev(eth->netdev[i]);
 		if (err) {
 			dev_err(eth->dev, "error bringing up device\n");
-			goto err_free_dev;
+			goto err_deinit_mdio;
 		} else
 			netif_info(eth, probe, eth->netdev[i],
 				   "mediatek frame engine at 0x%08lx, irq %d\n",
@@ -1916,8 +1935,13 @@ static int mtk_probe(struct platform_device *pdev)
 
 	return 0;
 
+err_deinit_mdio:
+	mtk_mdio_cleanup(eth);
 err_free_dev:
-	mtk_cleanup(eth);
+	mtk_free_dev(eth);
+err_deinit_hw:
+	mtk_hw_deinit(eth);
+
 	return err;
 }
 
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next 7/7] net: ethernet: mediatek: avoid race condition during the reset process
From: sean.wang @ 2016-09-14  3:50 UTC (permalink / raw)
  To: john, davem; +Cc: nbd, netdev, linux-mediatek, keyhaede, objelf, Sean Wang
In-Reply-To: <1473825041-21072-1-git-send-email-sean.wang@mediatek.com>

From: Sean Wang <sean.wang@mediatek.com>

add the protection of the race condition between
the reset process and hardware access happening
on the related callbacks.

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 36 +++++++++++++++++++++++++++++
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  3 ++-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 48cddf9..a6a9a2f 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -145,6 +145,9 @@ static void mtk_phy_link_adjust(struct net_device *dev)
 		  MAC_MCR_RX_EN | MAC_MCR_BACKOFF_EN |
 		  MAC_MCR_BACKPR_EN;
 
+	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
+		return;
+
 	switch (mac->phy_dev->speed) {
 	case SPEED_1000:
 		mcr |= MAC_MCR_SPEED_1000;
@@ -370,6 +373,9 @@ static int mtk_set_mac_address(struct net_device *dev, void *p)
 	if (ret)
 		return ret;
 
+	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
+		return -EBUSY;
+
 	spin_lock_bh(&mac->hw->page_lock);
 	mtk_w32(mac->hw, (macaddr[0] << 8) | macaddr[1],
 		MTK_GDMA_MAC_ADRH(mac->id));
@@ -770,6 +776,9 @@ static int mtk_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	 */
 	spin_lock(&eth->page_lock);
 
+	if (unlikely(test_bit(MTK_RESETTING, &eth->state)))
+		goto drop;
+
 	tx_num = mtk_cal_txd_req(skb);
 	if (unlikely(atomic_read(&ring->free_count) <= tx_num)) {
 		mtk_stop_queue(eth);
@@ -842,6 +851,9 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget,
 
 		netdev = eth->netdev[mac];
 
+		if (unlikely(test_bit(MTK_RESETTING, &eth->state)))
+			goto release_desc;
+
 		/* alloc new buffer */
 		new_data = napi_alloc_frag(ring->frag_size);
 		if (unlikely(!new_data)) {
@@ -1573,6 +1585,12 @@ static void mtk_pending_work(struct work_struct *work)
 
 	rtnl_lock();
 
+	dev_dbg(eth->dev, "[%s][%d] reset\n", __func__, __LINE__);
+
+	while (test_and_set_bit_lock(MTK_RESETTING, &eth->state))
+		cpu_relax();
+
+	dev_dbg(eth->dev, "[%s][%d] mtk_stop starts\n", __func__, __LINE__);
 	/* stop all devices to make sure that dma is properly shut down */
 	for (i = 0; i < MTK_MAC_COUNT; i++) {
 		if (!eth->netdev[i])
@@ -1580,6 +1598,7 @@ static void mtk_pending_work(struct work_struct *work)
 		mtk_stop(eth->netdev[i]);
 		__set_bit(i, &restart);
 	}
+	dev_dbg(eth->dev, "[%s][%d] mtk_stop ends\n", __func__, __LINE__);
 
 	/* restart underlying hardware such as power, clock, pin mux
 	 * and the connected phy
@@ -1613,6 +1632,11 @@ static void mtk_pending_work(struct work_struct *work)
 			dev_close(eth->netdev[i]);
 		}
 	}
+
+	dev_dbg(eth->dev, "[%s][%d] reset done\n", __func__, __LINE__);
+
+	clear_bit_unlock(MTK_RESETTING, &eth->state);
+
 	rtnl_unlock();
 }
 
@@ -1657,6 +1681,9 @@ static int mtk_get_settings(struct net_device *dev,
 	struct mtk_mac *mac = netdev_priv(dev);
 	int err;
 
+	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
+		return -EBUSY;
+
 	err = phy_read_status(mac->phy_dev);
 	if (err)
 		return -ENODEV;
@@ -1707,6 +1734,9 @@ static int mtk_nway_reset(struct net_device *dev)
 {
 	struct mtk_mac *mac = netdev_priv(dev);
 
+	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
+		return -EBUSY;
+
 	return genphy_restart_aneg(mac->phy_dev);
 }
 
@@ -1715,6 +1745,9 @@ static u32 mtk_get_link(struct net_device *dev)
 	struct mtk_mac *mac = netdev_priv(dev);
 	int err;
 
+	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
+		return -EBUSY;
+
 	err = genphy_update_link(mac->phy_dev);
 	if (err)
 		return ethtool_op_get_link(dev);
@@ -1755,6 +1788,9 @@ static void mtk_get_ethtool_stats(struct net_device *dev,
 	unsigned int start;
 	int i;
 
+	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
+		return;
+
 	if (netif_running(dev) && netif_device_present(dev)) {
 		if (spin_trylock(&hwstats->stats_lock)) {
 			mtk_stats_update_mac(mac);
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 7efa00f..79954b4 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -336,7 +336,8 @@ enum mtk_clks_map {
 };
 
 enum mtk_dev_state {
-	MTK_HW_INIT
+	MTK_HW_INIT,
+	MTK_RESETTING
 };
 
 /* struct mtk_tx_buf -	This struct holds the pointers to the memory pointed at
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next 4/7] net: ethernet: mediatek: add controlling power domain the ethernet belongs to
From: sean.wang @ 2016-09-14  3:50 UTC (permalink / raw)
  To: john, davem; +Cc: nbd, netdev, linux-mediatek, keyhaede, objelf, Sean Wang
In-Reply-To: <1473825041-21072-1-git-send-email-sean.wang@mediatek.com>

From: Sean Wang <sean.wang@mediatek.com>

introduce power domain control which the digital circuit of
the ethernet belongs to inside the flow of hardware initialization
and deinitialization which helps the entire ethernet hardware block
could restart cleanly and completely as being back to the initial
state when the whole machine reboot.

Signed-off-by: Sean Wang <sean.wang@mediatek.com>
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 917a49c6..fd5d064 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -18,6 +18,7 @@
 #include <linux/mfd/syscon.h>
 #include <linux/regmap.h>
 #include <linux/clk.h>
+#include <linux/pm_runtime.h>
 #include <linux/if_vlan.h>
 #include <linux/reset.h>
 #include <linux/tcp.h>
@@ -1417,6 +1418,9 @@ static int __init mtk_hw_init(struct mtk_eth *eth)
 {
 	int i;
 
+	pm_runtime_enable(eth->dev);
+	pm_runtime_get_sync(eth->dev);
+
 	clk_prepare_enable(eth->clks[MTK_CLK_ETHIF]);
 	clk_prepare_enable(eth->clks[MTK_CLK_ESW]);
 	clk_prepare_enable(eth->clks[MTK_CLK_GP1]);
@@ -1484,6 +1488,9 @@ static int mtk_hw_deinit(struct mtk_eth *eth)
 	clk_disable_unprepare(eth->clks[MTK_CLK_ESW]);
 	clk_disable_unprepare(eth->clks[MTK_CLK_ETHIF]);
 
+	pm_runtime_put_sync(eth->dev);
+	pm_runtime_disable(eth->dev);
+
 	return 0;
 }
 
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next 0/7] add enhancement into the existing reset flow
From: sean.wang @ 2016-09-14  3:50 UTC (permalink / raw)
  To: john, davem; +Cc: nbd, netdev, linux-mediatek, keyhaede, objelf, Sean Wang

From: Sean Wang <sean.wang@mediatek.com>

Current driver only resets DMA used by descriptor rings which
can't guarantee it can recover all various kinds of fatal
errors, so the patch
1) tries to reset the underlying hardware resource from scratch on
Mediatek SoC required for ethernet running.
2) refactors code in order to the reusability of existing code.
3) considers handling for race condition between the reset flow and
callbacks registered into core driver called about hardware accessing.
4) introduces power domain usage to hardware setup which leads to have
cleanly and completely restore to the state as the initial.

Sean Wang (7):
  net: ethernet: mediatek: refactoring mtk_hw_init to be reused
  net: ethernet: mediatek: add mtk_hw_deinit call as the opposite to
    mtk_hw_init call
  net: ethernet: mediatek: cleanup error path inside mtk_hw_init
  net: ethernet: mediatek: add controlling power domain the ethernet
    belongs to
  net: ethernet: mediatek: add the whole ethernet reset into the reset
    process
  net: ethernet: mediatek: add more resets for internal ethernet circuit
    block
  net: ethernet: mediatek: avoid race condition during the reset process

 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 227 +++++++++++++++++++++-------
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  15 +-
 2 files changed, 187 insertions(+), 55 deletions(-)

-- 
1.9.1

^ permalink raw reply

* Re: [PATCHv2 next 3/3] ipvlan: Introduce l3s mode
From: David Ahern @ 2016-09-14  3:24 UTC (permalink / raw)
  To: Mahesh Bandewar, netdev; +Cc: Eric Dumazet, David Miller, Mahesh Bandewar
In-Reply-To: <1473703312-26757-1-git-send-email-mahesh@bandewar.net>

On 9/12/16 12:01 PM, Mahesh Bandewar wrote:

> +struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb,
> +			      u16 proto)
> +{
> +	struct ipvl_addr *addr;
> +	struct net_device *sdev;
> +
> +	addr = ipvlan_skb_to_addr(skb, dev);
> +	if (!addr)
> +		goto out;
> +
> +	sdev = addr->master->dev;
> +	switch (proto) {
> +	case AF_INET:
> +	{
> +		int err;
> +		struct iphdr *ip4h = ip_hdr(skb);
> +
> +		err = ip_route_input_noref(skb, ip4h->daddr, ip4h->saddr,
> +					   ip4h->tos, sdev);
> +		if (unlikely(err))
> +			goto out;
> +		break;
> +	}
> +	case AF_INET6:
> +	{
> +		struct dst_entry *dst;
> +		struct ipv6hdr *ip6h = ipv6_hdr(skb);
> +		int flags = RT6_LOOKUP_F_HAS_SADDR;
> +		struct flowi6 fl6 = {
> +			.flowi6_iif   = sdev->ifindex,
> +			.daddr        = ip6h->daddr,
> +			.saddr        = ip6h->saddr,
> +			.flowlabel    = ip6_flowinfo(ip6h),
> +			.flowi6_mark  = skb->mark,
> +			.flowi6_proto = ip6h->nexthdr,
> +		};
> +
> +		skb_dst_drop(skb);
> +		dst = ip6_route_input_lookup(dev_net(sdev), sdev, &fl6, flags);
> +		skb_dst_set(skb, dst);
> +		break;
> +	}
> +	default:
> +		break;
> +	}

Nit: why not put the above in separate per-version functions (ipvlan_ip_rcv and ipvlan_ip6_rcv) similar to what is done for ipvlan_process_outbound?


> +
> +out:
> +	return skb;
> +}
> +
> +unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
> +			     const struct nf_hook_state *state)
> +{
> +	struct ipvl_addr *addr;
> +	unsigned int len;
> +
> +	addr = ipvlan_skb_to_addr(skb, skb->dev);
> +	if (!addr)
> +		goto out;
> +
> +	skb->dev = addr->master->dev;
> +	len = skb->len + ETH_HLEN;
> +	ipvlan_count_rx(addr->master, len, true, false);
> +out:
> +	return NF_ACCEPT;
> +}
> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
> index 18b4e8c7f68a..d02be277e1db 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -9,24 +9,65 @@
>  
>  #include "ipvlan.h"
>  
> +static struct nf_hook_ops ipvl_nfops[] __read_mostly = {
> +	{
> +		.hook     = ipvlan_nf_input,
> +		.pf       = NFPROTO_IPV4,
> +		.hooknum  = NF_INET_LOCAL_IN,
> +		.priority = INT_MAX,
> +	},
> +	{
> +		.hook     = ipvlan_nf_input,
> +		.pf       = NFPROTO_IPV6,
> +		.hooknum  = NF_INET_LOCAL_IN,
> +		.priority = INT_MAX,
> +	},
> +};
> +
> +static struct l3mdev_ops ipvl_l3mdev_ops __read_mostly = {
> +	.l3mdev_l3_rcv = ipvlan_l3_rcv,
> +};
> +
>  static void ipvlan_adjust_mtu(struct ipvl_dev *ipvlan, struct net_device *dev)
>  {
>  	ipvlan->dev->mtu = dev->mtu - ipvlan->mtu_adj;
>  }
>  
> -static void ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
> +static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>  {
>  	struct ipvl_dev *ipvlan;
> +	int err = 0;
>  
> +	ASSERT_RTNL();
>  	if (port->mode != nval) {
> +		if (nval == IPVLAN_MODE_L3S) {
> +			port->dev->l3mdev_ops = &ipvl_l3mdev_ops;
> +			port->dev->priv_flags |= IFF_L3MDEV_MASTER;
> +			if (!port->ipt_hook_added) {
> +				err = _nf_register_hooks(ipvl_nfops,
> +							ARRAY_SIZE(ipvl_nfops));

That's clever. The hooks are not device based so why do the register for each device? Alternatively, you could use a static dst like VRF does for Tx. In the ipvlan rcv function set the dst input handler to send the packet back to the ipvlan driver via dst->input. From there send the packet through the netfilter hooks and then do the real lookup, update the dst and call its input function. I have working code for VRF driver somewhere that shows how to do this.

 
> +				if (!err)
> +					port->ipt_hook_added = true;
> +				else
> +					return err;
> +			}
> +		} else {
> +			port->dev->priv_flags &= ~IFF_L3MDEV_MASTER;
> +			port->dev->l3mdev_ops = NULL;
> +			if (port->ipt_hook_added)
> +				_nf_unregister_hooks(ipvl_nfops,
> +						     ARRAY_SIZE(ipvl_nfops));
> +			port->ipt_hook_added = false;
> +		}

^ permalink raw reply

* Re: [RFC PATCH v3 3/7] proc: Reduce cache miss in snmp6_seq_show
From: hejianet @ 2016-09-14  3:11 UTC (permalink / raw)
  To: Marcelo
  Cc: netdev, linux-sctp, linux-kernel, davem, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Vlad Yasevich,
	Neil Horman, Steffen Klassert, Herbert Xu
In-Reply-To: <20160912190502.GC17689@localhost.localdomain>



On 9/13/16 3:05 AM, Marcelo wrote:
> On Fri, Sep 09, 2016 at 02:33:58PM +0800, Jia He wrote:
>> This is to use the generic interface snmp_get_cpu_field{,64}_batch to
>> aggregate the data by going through all the items of each cpu sequentially.
>>
>> Signed-off-by: Jia He <hejianet@gmail.com>
>> ---
>>   net/ipv6/proc.c | 32 +++++++++++++++++++++++---------
>>   1 file changed, 23 insertions(+), 9 deletions(-)
>>
>> diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
>> index 679253d0..50ba2c3 100644
>> --- a/net/ipv6/proc.c
>> +++ b/net/ipv6/proc.c
>> @@ -30,6 +30,11 @@
>>   #include <net/transp_v6.h>
>>   #include <net/ipv6.h>
>>   
>> +#define MAX4(a, b, c, d) \
>> +	max_t(u32, max_t(u32, a, b), max_t(u32, c, d))
>> +#define SNMP_MIB_MAX MAX4(UDP_MIB_MAX, TCP_MIB_MAX, \
>> +			IPSTATS_MIB_MAX, ICMP_MIB_MAX)
>> +
>>   static int sockstat6_seq_show(struct seq_file *seq, void *v)
>>   {
>>   	struct net *net = seq->private;
>> @@ -192,13 +197,19 @@ static void snmp6_seq_show_item(struct seq_file *seq, void __percpu *pcpumib,
>>   				const struct snmp_mib *itemlist)
>>   {
>>   	int i;
>> -	unsigned long val;
>> -
>> -	for (i = 0; itemlist[i].name; i++) {
>> -		val = pcpumib ?
>> -			snmp_fold_field(pcpumib, itemlist[i].entry) :
>> -			atomic_long_read(smib + itemlist[i].entry);
>> -		seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name, val);
>> +	unsigned long buff[SNMP_MIB_MAX];
>> +
>> +	memset(buff, 0, sizeof(unsigned long) * SNMP_MIB_MAX);
> This memset() could be moved...
>
>> +
>> +	if (pcpumib) {
> ... here, so it's not executed if it hits the else block.
Thanks for the suggestion
B.R.
Jia
>> +		snmp_get_cpu_field_batch(buff, itemlist, pcpumib);
>> +		for (i = 0; itemlist[i].name; i++)
>> +			seq_printf(seq, "%-32s\t%lu\n",
>> +				   itemlist[i].name, buff[i]);
>> +	} else {
>> +		for (i = 0; itemlist[i].name; i++)
>> +			seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name,
>> +				   atomic_long_read(smib + itemlist[i].entry));
>>   	}
>>   }
>>   
>> @@ -206,10 +217,13 @@ static void snmp6_seq_show_item64(struct seq_file *seq, void __percpu *mib,
>>   				  const struct snmp_mib *itemlist, size_t syncpoff)
>>   {
>>   	int i;
>> +	u64 buff64[SNMP_MIB_MAX];
>> +
>> +	memset(buff64, 0, sizeof(unsigned long) * SNMP_MIB_MAX);
>>   
>> +	snmp_get_cpu_field64_batch(buff64, itemlist, mib, syncpoff);
>>   	for (i = 0; itemlist[i].name; i++)
>> -		seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name,
>> -			   snmp_fold_field64(mib, itemlist[i].entry, syncpoff));
>> +		seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name, buff64[i]);
>>   }
>>   
>>   static int snmp6_seq_show(struct seq_file *seq, void *v)
>> -- 
>> 1.8.3.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply

* Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show
From: hejianet @ 2016-09-14  3:10 UTC (permalink / raw)
  To: Marcelo
  Cc: netdev, linux-sctp, linux-kernel, davem, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Vlad Yasevich,
	Neil Horman, Steffen Klassert, Herbert Xu
In-Reply-To: <20160912185752.GB17689@localhost.localdomain>

Hi Marcelo


On 9/13/16 2:57 AM, Marcelo wrote:
> On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:
>> This is to use the generic interface snmp_get_cpu_field{,64}_batch to
>> aggregate the data by going through all the items of each cpu sequentially.
>> Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
>> warning "the frame size" larger than 1024 on s390.
> Yeah about that, did you test it with stack overflow detection?
> These arrays can be quite large.
>
> One more below..
I found scripts/checkstack.pl could analyze the stack usage statically.
[root@tian-lp1 kernel]# objdump -d vmlinux | scripts/checkstack.pl ppc64|grep seq
0xc0000000007d4b18 netstat_seq_show_tcpext.isra.7 [vmlinux]:1120
0xc0000000007ccbe8 fib_triestat_seq_show [vmlinux]:     496
0xc00000000083e7a4 tcp6_seq_show [vmlinux]:             480
0xc0000000007d4908 snmp_seq_show_ipstats.isra.6 [vmlinux]:464
0xc0000000007d4d18 netstat_seq_show_ipext.isra.8 [vmlinux]:464
0xc0000000006f5bd8 proto_seq_show [vmlinux]:            416
0xc0000000007f5718 xfrm_statistics_seq_show [vmlinux]:  416
0xc0000000007405b4 dev_seq_printf_stats [vmlinux]:      400

seems the stack usage in netstat_seq_show_tcpext is too big.
Will consider how to reduce it

B.R.
Jia
>> Signed-off-by: Jia He <hejianet@gmail.com>
>> ---
>>   net/ipv4/proc.c | 106 +++++++++++++++++++++++++++++++++++++++-----------------
>>   1 file changed, 74 insertions(+), 32 deletions(-)
>>
>> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
>> index 9f665b6..c6fc80e 100644
>> --- a/net/ipv4/proc.c
>> +++ b/net/ipv4/proc.c
>> @@ -46,6 +46,8 @@
>>   #include <net/sock.h>
>>   #include <net/raw.h>
>>   
>> +#define TCPUDP_MIB_MAX max_t(u32, UDP_MIB_MAX, TCP_MIB_MAX)
>> +
>>   /*
>>    *	Report socket allocation statistics [mea@utu.fi]
>>    */
>> @@ -378,13 +380,15 @@ static void icmp_put(struct seq_file *seq)
>>   /*
>>    *	Called from the PROCfs module. This outputs /proc/net/snmp.
>>    */
>> -static int snmp_seq_show(struct seq_file *seq, void *v)
>> +static int snmp_seq_show_ipstats(struct seq_file *seq, void *v)
>>   {
>>   	int i;
>> +	u64 buff64[IPSTATS_MIB_MAX];
>>   	struct net *net = seq->private;
>>   
>> -	seq_puts(seq, "Ip: Forwarding DefaultTTL");
>> +	memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64));
>>   
>> +	seq_puts(seq, "Ip: Forwarding DefaultTTL");
>>   	for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
>>   		seq_printf(seq, " %s", snmp4_ipstats_list[i].name);
>>   
>> @@ -393,57 +397,77 @@ static int snmp_seq_show(struct seq_file *seq, void *v)
>>   		   net->ipv4.sysctl_ip_default_ttl);
>>   
>>   	BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);
>> +	snmp_get_cpu_field64_batch(buff64, snmp4_ipstats_list,
>> +				   net->mib.ip_statistics,
>> +				   offsetof(struct ipstats_mib, syncp));
>>   	for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
>> -		seq_printf(seq, " %llu",
>> -			   snmp_fold_field64(net->mib.ip_statistics,
>> -					     snmp4_ipstats_list[i].entry,
>> -					     offsetof(struct ipstats_mib, syncp)));
>> +		seq_printf(seq, " %llu", buff64[i]);
>>   
>> -	icmp_put(seq);	/* RFC 2011 compatibility */
>> -	icmpmsg_put(seq);
>> +	return 0;
>> +}
>> +
>> +static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v)
>> +{
>> +	int i;
>> +	unsigned long buff[TCPUDP_MIB_MAX];
>> +	struct net *net = seq->private;
>> +
>> +	memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
>>   
>>   	seq_puts(seq, "\nTcp:");
>>   	for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
>>   		seq_printf(seq, " %s", snmp4_tcp_list[i].name);
>>   
>>   	seq_puts(seq, "\nTcp:");
>> +	snmp_get_cpu_field_batch(buff, snmp4_tcp_list,
>> +				 net->mib.tcp_statistics);
>>   	for (i = 0; snmp4_tcp_list[i].name != NULL; i++) {
>>   		/* MaxConn field is signed, RFC 2012 */
>>   		if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
>> -			seq_printf(seq, " %ld",
>> -				   snmp_fold_field(net->mib.tcp_statistics,
>> -						   snmp4_tcp_list[i].entry));
>> +			seq_printf(seq, " %ld", buff[i]);
>>   		else
>> -			seq_printf(seq, " %lu",
>> -				   snmp_fold_field(net->mib.tcp_statistics,
>> -						   snmp4_tcp_list[i].entry));
>> +			seq_printf(seq, " %lu", buff[i]);
>>   	}
>>   
>> +	memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
>> +
>> +	snmp_get_cpu_field_batch(buff, snmp4_udp_list,
>> +				 net->mib.udp_statistics);
>>   	seq_puts(seq, "\nUdp:");
>>   	for (i = 0; snmp4_udp_list[i].name != NULL; i++)
>>   		seq_printf(seq, " %s", snmp4_udp_list[i].name);
>> -
>>   	seq_puts(seq, "\nUdp:");
>>   	for (i = 0; snmp4_udp_list[i].name != NULL; i++)
>> -		seq_printf(seq, " %lu",
>> -			   snmp_fold_field(net->mib.udp_statistics,
>> -					   snmp4_udp_list[i].entry));
>> +		seq_printf(seq, " %lu", buff[i]);
>> +
>> +	memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
>>   
>>   	/* the UDP and UDP-Lite MIBs are the same */
>>   	seq_puts(seq, "\nUdpLite:");
>> +	snmp_get_cpu_field_batch(buff, snmp4_udp_list,
>> +				 net->mib.udplite_statistics);
>>   	for (i = 0; snmp4_udp_list[i].name != NULL; i++)
>>   		seq_printf(seq, " %s", snmp4_udp_list[i].name);
>> -
>>   	seq_puts(seq, "\nUdpLite:");
>>   	for (i = 0; snmp4_udp_list[i].name != NULL; i++)
>> -		seq_printf(seq, " %lu",
>> -			   snmp_fold_field(net->mib.udplite_statistics,
>> -					   snmp4_udp_list[i].entry));
>> +		seq_printf(seq, " %lu", buff[i]);
>>   
>>   	seq_putc(seq, '\n');
>>   	return 0;
>>   }
>>   
>> +static int snmp_seq_show(struct seq_file *seq, void *v)
>> +{
>> +	snmp_seq_show_ipstats(seq, v);
>> +
>> +	icmp_put(seq);	/* RFC 2011 compatibility */
>> +	icmpmsg_put(seq);
>> +
>> +	snmp_seq_show_tcp_udp(seq, v);
>> +
>> +	return 0;
>> +}
>> +
>>   static int snmp_seq_open(struct inode *inode, struct file *file)
>>   {
>>   	return single_open_net(inode, file, snmp_seq_show);
>> @@ -457,41 +481,59 @@ static const struct file_operations snmp_seq_fops = {
>>   	.release = single_release_net,
>>   };
>>   
>> -
>> -
>>   /*
>>    *	Output /proc/net/netstat
>>    */
>> -static int netstat_seq_show(struct seq_file *seq, void *v)
>> +static int netstat_seq_show_tcpext(struct seq_file *seq, void *v)
>>   {
>>   	int i;
>> +	unsigned long buff[LINUX_MIB_MAX];
>>   	struct net *net = seq->private;
>>   
>> +	memset(buff, 0, sizeof(unsigned long) * LINUX_MIB_MAX);
>> +
>>   	seq_puts(seq, "TcpExt:");
>>   	for (i = 0; snmp4_net_list[i].name != NULL; i++)
>>   		seq_printf(seq, " %s", snmp4_net_list[i].name);
>>   
>>   	seq_puts(seq, "\nTcpExt:");
>> +	snmp_get_cpu_field_batch(buff, snmp4_net_list,
>> +				 net->mib.net_statistics);
>>   	for (i = 0; snmp4_net_list[i].name != NULL; i++)
>> -		seq_printf(seq, " %lu",
>> -			   snmp_fold_field(net->mib.net_statistics,
>> -					   snmp4_net_list[i].entry));
>> +		seq_printf(seq, " %lu", buff[i]);
>> +
>> +	return 0;
>> +}
>> +
>> +static int netstat_seq_show_ipext(struct seq_file *seq, void *v)
>> +{
>> +	int i;
>> +	u64 buff64[IPSTATS_MIB_MAX];
>> +	struct net *net = seq->private;
>>   
>>   	seq_puts(seq, "\nIpExt:");
>>   	for (i = 0; snmp4_ipextstats_list[i].name != NULL; i++)
>>   		seq_printf(seq, " %s", snmp4_ipextstats_list[i].name);
>>   
>>   	seq_puts(seq, "\nIpExt:");
> You're missing a memset() call here.
>
>> +	snmp_get_cpu_field64_batch(buff64, snmp4_ipextstats_list,
>> +				   net->mib.ip_statistics,
>> +				   offsetof(struct ipstats_mib, syncp));
>>   	for (i = 0; snmp4_ipextstats_list[i].name != NULL; i++)
>> -		seq_printf(seq, " %llu",
>> -			   snmp_fold_field64(net->mib.ip_statistics,
>> -					     snmp4_ipextstats_list[i].entry,
>> -					     offsetof(struct ipstats_mib, syncp)));
>> +		seq_printf(seq, " %llu", buff64[i]);
>>   
>>   	seq_putc(seq, '\n');
>>   	return 0;
>>   }
>>   
>> +static int netstat_seq_show(struct seq_file *seq, void *v)
>> +{
>> +	netstat_seq_show_tcpext(seq, v);
>> +	netstat_seq_show_ipext(seq, v);
>> +
>> +	return 0;
>> +}
>> +
>>   static int netstat_seq_open(struct inode *inode, struct file *file)
>>   {
>>   	return single_open_net(inode, file, netstat_seq_show);
>> -- 
>> 1.8.3.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply

* RE: [PATCH net-next 3/5] liquidio CN23XX: Mailbox support
From: Vatsavayi, Raghu @ 2016-09-14  1:25 UTC (permalink / raw)
  To: David Miller
  Cc: netdev@vger.kernel.org, Chickles, Derek, Burla, Satananda,
	Manlunas, Felix
In-Reply-To: <20160910.214201.116857018916997521.davem@davemloft.net>

Sure Dave, Will submit new patches with these changes.
Thanks
Raghu.

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Saturday, September 10, 2016 9:42 PM
> To: Vatsavayi, Raghu
> Cc: netdev@vger.kernel.org; Chickles, Derek; Burla, Satananda; Manlunas,
> Felix; Vatsavayi, Raghu
> Subject: Re: [PATCH net-next 3/5] liquidio CN23XX: Mailbox support
> 
> From: Raghu Vatsavayi <rvatsavayi@caviumnetworks.com>
> Date: Fri, 9 Sep 2016 13:08:25 -0700
> 
> > +int octeon_mbox_read(struct octeon_mbox *mbox) {
> > +	int ret = 0;
> > +	union octeon_mbox_message msg;
> > +
> 
> Please always order local variable declarations from longest to shortest line.
> 
> Please audit your entire submission for this problem.

^ permalink raw reply

* Re: Modification to skb->queue_mapping affecting performance
From: Eric Dumazet @ 2016-09-14  1:18 UTC (permalink / raw)
  To: Michael Ma; +Cc: netdev
In-Reply-To: <CAAmHdhy_=GJKBE9=YD2n8MZOaBzD_JJrk4i8Rq43ob2HbWqbCA@mail.gmail.com>

On Tue, 2016-09-13 at 17:23 -0700, Michael Ma wrote:

> If I understand correctly this is still to associate a qdisc with each
> ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
> to divide the bandwidth of each class in HTB by the number of TX
> queues for each individual HTB qdisc associated?
> 
> My original idea was to attach a HTB qdisc for each ifb queue
> representing a set of flows not sharing bandwidth with others so that
> root lock contention still happens but only affects flows in the same
> HTB. Did I understand the root lock contention issue incorrectly for
> ifb? I do see some comments in __dev_queue_xmit() about using a
> different code path for software devices which bypasses
> __dev_xmit_skb(). Does this mean ifb won't go through
> __dev_xmit_skb()?

You can install HTB on all of your MQ children for sure.

Again, there is no qdisc lock contention if you properly use MQ.

Now if you _need_ to install a single qdisc for whatever reason, then
maybe you want to use a single rx queue on the NIC, to reduce lock
contention ;)

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: Remove myself from PA Semi entries
From: Wolfram Sang @ 2016-09-14  0:31 UTC (permalink / raw)
  To: Olof Johansson; +Cc: Michael Ellerman, netdev, linux-i2c, davem, jdelvare
In-Reply-To: <1473803318-25197-1-git-send-email-olof@lixom.net>

[-- Attachment #1: Type: text/plain, Size: 2114 bytes --]

On Tue, Sep 13, 2016 at 02:48:38PM -0700, Olof Johansson wrote:
> The platform is old, very few users and I lack bandwidth to keep after
> it these days.
> 
> Mark the base platform as well as the drivers as orphans, patches have
> been flowing through the fallback maintainers for a while already.
> 
> Signed-off-by: Olof Johansson <olof@lixom.net>
> ---
> 
> Jean, Dave,
> 
> I was hoping to have Michael merge this since the bulk of the platform is under him,
> cc:ing you mostly to be aware that I am orphaning a driver in your subsystems.

Let me answer for Jean since I took over I2C in November 2012 ;) I'd
think the entry can go completely. The last 'F:' tag for the platform
catches the I2C driver anyhow. But in general:

Acked-by: Wolfram Sang <wsa@the-dreams.de>

Thanks,

   Wolfram

> 
> 
> Thanks,
> 
> -Olof
> 
>  MAINTAINERS | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d8e81b1..411f4f7 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7155,9 +7155,8 @@ F:	arch/powerpc/platforms/83xx/
>  F:	arch/powerpc/platforms/85xx/
>  
>  LINUX FOR POWERPC PA SEMI PWRFICIENT
> -M:	Olof Johansson <olof@lixom.net>
>  L:	linuxppc-dev@lists.ozlabs.org
> -S:	Maintained
> +S:	Orphan
>  F:	arch/powerpc/platforms/pasemi/
>  F:	drivers/*/*pasemi*
>  F:	drivers/*/*/*pasemi*
> @@ -8849,15 +8848,13 @@ S:	Maintained
>  F:	drivers/net/wireless/intersil/p54/
>  
>  PA SEMI ETHERNET DRIVER
> -M:	Olof Johansson <olof@lixom.net>
>  L:	netdev@vger.kernel.org
> -S:	Maintained
> +S:	Orphan
>  F:	drivers/net/ethernet/pasemi/*
>  
>  PA SEMI SMBUS DRIVER
> -M:	Olof Johansson <olof@lixom.net>
>  L:	linux-i2c@vger.kernel.org
> -S:	Maintained
> +S:	Orphan
>  F:	drivers/i2c/busses/i2c-pasemi.c
>  
>  PADATA PARALLEL EXECUTION MECHANISM
> -- 
> 2.8.0.rc3.29.gb552ff8
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-i2c" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: Modification to skb->queue_mapping affecting performance
From: Michael Ma @ 2016-09-14  0:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1473811749.18970.259.camel@edumazet-glaptop3.roam.corp.google.com>

2016-09-13 17:09 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> On Tue, 2016-09-13 at 16:30 -0700, Michael Ma wrote:
>
>> The RX queue number I found from "ls /sys/class/net/eth0/queues" is
>> 64. (is this the correct way of identifying the queue number on NIC?)
>> I setup ifb with 24 queues which is equal to the TX queue number of
>> eth0 and also the number of CPU cores.
>
> Please do not drop netdev@ from this mail exchange.

Sorry that I accidentally dropped that.
>
> ethtool -l eth0
>
>>
>> > There is no qdisc lock contention anymore AFAIK, since each cpu will use
>> > a dedicate IFB queue and tasklet.
>> >
>> How is this achieved? I thought qdisc on ifb will still be protected
>> by the qdisc root lock in __dev_xmit_skb() so essentially all threads
>> processing qdisc are still serialized without using MQ?
>
> You have to properly setup ifb/mq like in :
>
> # netem based setup, installed at receiver side only
> ETH=eth0
> IFB=ifb10
> #DELAY="delay 100ms"
> EST="est 1sec 4sec"
> #REORDER=1000us
> #LOSS="loss 2.0"
> TXQ=24  # change this to number of TX queues on the physical NIC
>
> ip link add $IFB numtxqueues $TXQ type ifb
> ip link set dev $IFB up
>
> tc qdisc del dev $ETH ingress 2>/dev/null
> tc qdisc add dev $ETH ingress 2>/dev/null
>
> tc filter add dev $ETH parent ffff: \
>    protocol ip u32 match u32 0 0 flowid 1:1 \
>         action mirred egress redirect dev $IFB
>
> tc qdisc del dev $IFB root 2>/dev/null
>
> tc qdisc add dev $IFB root handle 1: mq
> for i in `seq 1 $TXQ`
> do
>  slot=$( printf %x $(( i )) )
>  tc qd add dev $IFB parent 1:$slot $EST netem \
>         limit 100000 $DELAY $REORDER $LOSS
> done
>
>
If I understand correctly this is still to associate a qdisc with each
ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
to divide the bandwidth of each class in HTB by the number of TX
queues for each individual HTB qdisc associated?

My original idea was to attach a HTB qdisc for each ifb queue
representing a set of flows not sharing bandwidth with others so that
root lock contention still happens but only affects flows in the same
HTB. Did I understand the root lock contention issue incorrectly for
ifb? I do see some comments in __dev_queue_xmit() about using a
different code path for software devices which bypasses
__dev_xmit_skb(). Does this mean ifb won't go through
__dev_xmit_skb()?

^ permalink raw reply

* Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
From: Rustad, Mark D @ 2016-09-14  0:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Tom Herbert, Brenden Blanco,
	Linux Kernel Network Developers, intel-wired-lan,
	Jesper Dangaard Brouer, Cong Wang, David S. Miller, William Tu
In-Reply-To: <20160913234039.GA42336@ast-mbp.thefacebook.com>

[-- Attachment #1: Type: text/plain, Size: 1064 bytes --]

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Sep 13, 2016 at 10:41:12PM +0000, Rustad, Mark D wrote:
>> That said, I can see that you have tried to keep the original code path
>> pretty much intact. I would note that you introduced rcu calls into the  
>> !bpf
>> path that would never have been done before. While that should be ok, I
>> would really like to see it tested, at least for the !bpf case, on real
>> hardware to be sure.
>
> please go ahead and test. rcu_read_lock is zero extra instructions
> for everything but preempt or debug kernels.

Well, I don't have any hardware in hand to test with, though my former  
employer would. I guess my current employer would too! :-) FWIW, the kernel  
used in that system I referred to before was a preempt kernel.

The test matrix is large, the tail is long and you can't just gloss these  
things over. I understand that it isn't the focus of your work, just as  
regression testing the e1000 is not the focus of any of our work any more.  
That is precisely why it is a sensitive area.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox