Netdev List
 help / color / mirror / Atom feed
* Re: [net,v2] neigh: fix the loop index error in neigh dump
From: David Ahern @ 2016-11-28  3:09 UTC (permalink / raw)
  To: 张胜举, netdev
In-Reply-To: <a01fcd54-c0ed-9d05-743a-27592d845c56@cumulusnetworks.com>

On 11/27/16 7:56 PM, David Ahern wrote:
> On 11/27/16 7:53 PM, 张胜举 wrote:
>>
>>
>>> -----Original Message-----
>>> From: David Ahern [mailto:dsa@cumulusnetworks.com]
>>> Sent: Monday, November 28, 2016 10:39 AM
>>> To: 张胜举 <zhangshengju@cmss.chinamobile.com>;
>>> netdev@vger.kernel.org
>>> Subject: Re: [net,v2] neigh: fix the loop index error in neigh dump
>>>
>>> On 11/27/16 7:34 PM, 张胜举 wrote:
>>>>> -----Original Message-----
>>>>> From: David Ahern [mailto:dsa@cumulusnetworks.com]
>>>>> Sent: Monday, November 28, 2016 10:10 AM
>>>>> To: Zhang Shengju <zhangshengju@cmss.chinamobile.com>;
>>>>> netdev@vger.kernel.org
>>>>> Subject: Re: [net,v2] neigh: fix the loop index error in neigh dump
>>>>>
>>>>> On 11/27/16 6:32 PM, Zhang Shengju wrote:
>>>>>> Loop index in neigh dump function is not updated correctly under
>>>>>> some circumstances, this patch will fix it.
>>>>>
>>>>> What's an example?
>>>>
>>>> If dev is filtered out, the original code goes to next loop without
>>>> updating loop index 'idx'.
>>>
>>> And you have a use case with missing or redundant data? Or is your
>>> comment based on a review of code only?
>> It's on my code review. No use case currently,  this is uncommon to happen.
>>
>>
>>>
>>>>> You are completely rewriting the dump loops.
>>>>
>>>> I put 'idx++' into for loop,  so I replace 'goto' with 'continue'.
>>>> The other change is style related.
>>>
>>> A "fixes" should not include 'style related' changes.
>> Okay, I will send another version without style changes.
>>
> 
> Personally, I think you need to produce a use case that fails before sending another patch. I have not seen a problem with this code.
> 

And looking back at 3f0ae05d6f I should not have acked it (reviewed it too quickly while on PTO). Your change is a no-op because of what idx represents - the position in the hash list for devices relevant for the dump request. Same goes for the neigh dump so this patch is not needed.

^ permalink raw reply

* Re: [Patch net-next] net_sched: move the empty tp check from ->destroy() to ->delete()
From: John Fastabend @ 2016-11-28  2:57 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: roid, jiri, Daniel Borkmann
In-Reply-To: <1479952708-26763-1-git-send-email-xiyou.wangcong@gmail.com>

On 16-11-23 05:58 PM, Cong Wang wrote:
> Roi reported we could have a race condition where in ->classify() path
> we dereference tp->root and meanwhile a parallel ->destroy() makes it
> a NULL.
> 
> This is possible because ->destroy() could be called when deleting
> a filter to check if we are the last one in tp, this tp is still
> linked and visible at that time.
> 
> The root cause of this problem is the semantic of ->destroy(), it
> does two things (for non-force case):
> 
> 1) check if tp is empty
> 2) if tp is empty we could really destroy it
> 
> and its caller, if cares, needs to check its return value to see if
> it is really destroyed. Therefore we can't unlink tp unless we know
> it is empty.
> 
> As suggested by Daniel, we could actually move the test logic to ->delete()
> so that we can safely unlink tp after ->delete() tells us the last one is
> just deleted and before ->destroy().
> 
> What's more, even we unlink it before ->destroy(), it could still have
> readers since we don't wait for a grace period here, we should not modify
> tp->root in ->destroy() either.
> 
> Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
> Reported-by: Roi Dayan <roid@mellanox.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---

Hi Cong,

Thanks a lot for doing this. Can you rebase it on top of Daniel's patch
though,

 [PATCH net] net, sched: respect rcu grace period on cls destruction

And then push the NULL pointer work for the cls_fw and cls_route
classifiers into another patch.

Then I believe the last thing to make this correct is to convert the
call_rcu() paths to call_rcu_bh().

.John

^ permalink raw reply

* Re: [net,v2] neigh: fix the loop index error in neigh dump
From: David Ahern @ 2016-11-28  2:56 UTC (permalink / raw)
  To: 张胜举, netdev
In-Reply-To: <001d01d24922$9a16b1e0$ce4415a0$@cmss.chinamobile.com>

On 11/27/16 7:53 PM, 张胜举 wrote:
> 
> 
>> -----Original Message-----
>> From: David Ahern [mailto:dsa@cumulusnetworks.com]
>> Sent: Monday, November 28, 2016 10:39 AM
>> To: 张胜举 <zhangshengju@cmss.chinamobile.com>;
>> netdev@vger.kernel.org
>> Subject: Re: [net,v2] neigh: fix the loop index error in neigh dump
>>
>> On 11/27/16 7:34 PM, 张胜举 wrote:
>>>> -----Original Message-----
>>>> From: David Ahern [mailto:dsa@cumulusnetworks.com]
>>>> Sent: Monday, November 28, 2016 10:10 AM
>>>> To: Zhang Shengju <zhangshengju@cmss.chinamobile.com>;
>>>> netdev@vger.kernel.org
>>>> Subject: Re: [net,v2] neigh: fix the loop index error in neigh dump
>>>>
>>>> On 11/27/16 6:32 PM, Zhang Shengju wrote:
>>>>> Loop index in neigh dump function is not updated correctly under
>>>>> some circumstances, this patch will fix it.
>>>>
>>>> What's an example?
>>>
>>> If dev is filtered out, the original code goes to next loop without
>>> updating loop index 'idx'.
>>
>> And you have a use case with missing or redundant data? Or is your
>> comment based on a review of code only?
> It's on my code review. No use case currently,  this is uncommon to happen.
> 
> 
>>
>>>> You are completely rewriting the dump loops.
>>>
>>> I put 'idx++' into for loop,  so I replace 'goto' with 'continue'.
>>> The other change is style related.
>>
>> A "fixes" should not include 'style related' changes.
> Okay, I will send another version without style changes.
> 

Personally, I think you need to produce a use case that fails before sending another patch. I have not seen a problem with this code.

^ permalink raw reply

* Re: [PATCH net] net, sched: respect rcu grace period on cls destruction
From: John Fastabend @ 2016-11-28  2:55 UTC (permalink / raw)
  To: Daniel Borkmann, davem; +Cc: xiyou.wangcong, roid, ast, hannes, jiri, netdev
In-Reply-To: <0d6d89f885033f1739e97f7f3372ae6e1db72892.1480204343.git.daniel@iogearbox.net>

On 16-11-26 04:18 PM, Daniel Borkmann wrote:
> Roi reported a crash in flower where tp->root was NULL in ->classify()
> callbacks. Reason is that in ->destroy() tp->root is set to NULL via
> RCU_INIT_POINTER(). It's problematic for some of the classifiers, because
> this doesn't respect RCU grace period for them, and as a result, still
> outstanding readers from tc_classify() will try to blindly dereference
> a NULL tp->root.
> 
> The tp->root object is strictly private to the classifier implementation
> and holds internal data the core such as tc_ctl_tfilter() doesn't know
> about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root
> is only checked for NULL in ->get() callback, but nowhere else. This is
> misleading and seemed to be copied from old classifier code that was not
> cleaned up properly. For example, d3fa76ee6b4a ("[NET_SCHED]: cls_basic:
> fix NULL pointer dereference") moved tp->root initialization into ->init()
> routine, where before it was part of ->change(), so ->get() had to deal
> with tp->root being NULL back then, so that was indeed a valid case, after
> d3fa76ee6b4a, not really anymore. We used to set tp->root to NULL long
> ago in ->destroy(), see 47a1a1d4be29 ("pkt_sched: remove unnecessary xchg()
> in packet classifiers"); but the NULLifying was reintroduced with the
> RCUification, but it's not correct for every classifier implementation.
> 
> In the cases that are fixed here with one exception of cls_cgroup, tp->root
> object is allocated and initialized inside ->init() callback, which is always
> performed at a point in time after we allocate a new tp, which means tp and
> thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()).
> Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy()
> handler, same for the tp which is kfree_rcu()'ed right when we return
> from ->destroy() in tcf_destroy(). This means, the head object's lifetime
> for such classifiers is always tied to the tp lifetime. The RCU callback
> invocation for the two kfree_rcu() could be out of order, but that's fine
> since both are independent.
> 
> Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here
> means that 1) we don't need a useless NULL check in fast-path and, 2) that
> outstanding readers of that tp in tc_classify() can still execute under
> respect with RCU grace period as it is actually expected.
> 
> Things that haven't been touched here: cls_fw and cls_route. They each
> handle tp->root being NULL in ->classify() path for historic reasons, so
> their ->destroy() implementation can stay as is. If someone actually
> cares, they could get cleaned up at some point to avoid the test in fast
> path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a
> !head should anyone actually be using/testing it, so it at least aligns with
> cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable
> destruction (to a sleepable context) after RCU grace period as concurrent
> readers might still access it. (Note that in this case we need to hold module
> reference to keep work callback address intact, since we only wait on module
> unload for all call_rcu()s to finish.)
> 
> This fixes one race to bring RCU grace period guarantees back. Next step
> as worked on by Cong however is to fix 1e052be69d04 ("net_sched: destroy
> proto tp when all filters are gone") to get the order of unlinking the tp
> in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving
> RCU_INIT_POINTER() before tcf_destroy() and let the notification for
> removal be done through the prior ->delete() callback. Both are independant
> issues. Once we have that right, we can then clean tp->root up for a number
> of classifiers by not making them RCU pointers, which requires a new callback
> (->uninit) that is triggered from tp's RCU callback, where we just kfree()
> tp->root from there.

Thanks looks good to me and appreciate the detailed commit message.

Acked-by: John Fastabend <john.r.fastabend@intel.com>

> 
> Fixes: 1f947bf151e9 ("net: sched: rcu'ify cls_bpf")
> Fixes: 9888faefe132 ("net: sched: cls_basic use RCU")
> Fixes: 70da9f0bf999 ("net: sched: cls_flow use RCU")
> Fixes: 77b9900ef53a ("tc: introduce Flower classifier")
> Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier")
> Fixes: 952313bd6258 ("net: sched: cls_cgroup use RCU")
> Reported-by: Roi Dayan <roid@mellanox.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Cong Wang <xiyou.wangcong@gmail.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Roi Dayan <roid@mellanox.com>
> Cc: Jiri Pirko <jiri@mellanox.com>
> ---

^ permalink raw reply

* RE: [net,v2] neigh: fix the loop index error in neigh dump
From: 张胜举 @ 2016-11-28  2:53 UTC (permalink / raw)
  To: 'David Ahern', netdev
In-Reply-To: <6d1a324e-e29e-e2dd-a6fc-1e9b4455cb3d@cumulusnetworks.com>



> -----Original Message-----
> From: David Ahern [mailto:dsa@cumulusnetworks.com]
> Sent: Monday, November 28, 2016 10:39 AM
> To: 张胜举 <zhangshengju@cmss.chinamobile.com>;
> netdev@vger.kernel.org
> Subject: Re: [net,v2] neigh: fix the loop index error in neigh dump
> 
> On 11/27/16 7:34 PM, 张胜举 wrote:
> >> -----Original Message-----
> >> From: David Ahern [mailto:dsa@cumulusnetworks.com]
> >> Sent: Monday, November 28, 2016 10:10 AM
> >> To: Zhang Shengju <zhangshengju@cmss.chinamobile.com>;
> >> netdev@vger.kernel.org
> >> Subject: Re: [net,v2] neigh: fix the loop index error in neigh dump
> >>
> >> On 11/27/16 6:32 PM, Zhang Shengju wrote:
> >>> Loop index in neigh dump function is not updated correctly under
> >>> some circumstances, this patch will fix it.
> >>
> >> What's an example?
> >
> > If dev is filtered out, the original code goes to next loop without
> > updating loop index 'idx'.
> 
> And you have a use case with missing or redundant data? Or is your
> comment based on a review of code only?
It's on my code review. No use case currently,  this is uncommon to happen.


> 
> >> You are completely rewriting the dump loops.
> >
> > I put 'idx++' into for loop,  so I replace 'goto' with 'continue'.
> > The other change is style related.
> 
> A "fixes" should not include 'style related' changes.
Okay, I will send another version without style changes.

^ permalink raw reply

* [PATCH] net: handle no dst on skb in icmp6_send
From: David Ahern @ 2016-11-28  2:52 UTC (permalink / raw)
  To: netdev; +Cc: andreyknvl, David Ahern

Andrey reported the following while fuzzing the kernel with syzkaller:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 0 PID: 3859 Comm: a.out Not tainted 4.9.0-rc6+ #429
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff8800666d4200 task.stack: ffff880067348000
RIP: 0010:[<ffffffff833617ec>]  [<ffffffff833617ec>]
icmp6_send+0x5fc/0x1e30 net/ipv6/icmp.c:451
RSP: 0018:ffff88006734f2c0  EFLAGS: 00010206
RAX: ffff8800666d4200 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000018
RBP: ffff88006734f630 R08: ffff880064138418 R09: 0000000000000003
R10: dffffc0000000000 R11: 0000000000000005 R12: 0000000000000000
R13: ffffffff84e7e200 R14: ffff880064138484 R15: ffff8800641383c0
FS:  00007fb3887a07c0(0000) GS:ffff88006cc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000000 CR3: 000000006b040000 CR4: 00000000000006f0
Stack:
 ffff8800666d4200 ffff8800666d49f8 ffff8800666d4200 ffffffff84c02460
 ffff8800666d4a1a 1ffff1000ccdaa2f ffff88006734f498 0000000000000046
 ffff88006734f440 ffffffff832f4269 ffff880064ba7456 0000000000000000
Call Trace:
 [<ffffffff83364ddc>] icmpv6_param_prob+0x2c/0x40 net/ipv6/icmp.c:557
 [<     inline     >] ip6_tlvopt_unknown net/ipv6/exthdrs.c:88
 [<ffffffff83394405>] ip6_parse_tlv+0x555/0x670 net/ipv6/exthdrs.c:157
 [<ffffffff8339a759>] ipv6_parse_hopopts+0x199/0x460 net/ipv6/exthdrs.c:663
 [<ffffffff832ee773>] ipv6_rcv+0xfa3/0x1dc0 net/ipv6/ip6_input.c:191
 ...

icmp6_send / icmpv6_send is invoked for both rx and tx paths. In both
cases the dst->dev should be preferred for determining the L3 domain
if the dst has been set on the skb. Fallback to the skb->dev if it has
not. This covers the case reported here where icmp6_send is invoked on
Rx before the route lookup.

Fixes: 5d41ce29e ("net: icmp6_send should use dst dev to determine L3 domain")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 net/ipv6/icmp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 7370ad2e693a..2772004ba5a1 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -447,8 +447,10 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
 
 	if (__ipv6_addr_needs_scope_id(addr_type))
 		iif = skb->dev->ifindex;
-	else
-		iif = l3mdev_master_ifindex(skb_dst(skb)->dev);
+	else {
+		dst = skb_dst(skb);
+		iif = l3mdev_master_ifindex(dst ? dst->dev : skb->dev);
+	}
 
 	/*
 	 *	Must not send error if the source does not uniquely
-- 
2.1.4

^ permalink raw reply related

* Re: [Patch net-next] net_sched: move the empty tp check from ->destroy() to ->delete()
From: John Fastabend @ 2016-11-28  2:51 UTC (permalink / raw)
  To: Roi Dayan, Daniel Borkmann, Cong Wang
  Cc: Linux Kernel Network Developers, Jiri Pirko
In-Reply-To: <583B95CE.7080309@gmail.com>

On 16-11-27 06:26 PM, John Fastabend wrote:
> On 16-11-26 10:29 PM, Roi Dayan wrote:
>>
>>
>> On 27/11/2016 06:47, Roi Dayan wrote:
>>>
>>>
>>> On 27/11/2016 02:33, Daniel Borkmann wrote:
>>>> On 11/26/2016 12:09 PM, Daniel Borkmann wrote:
>>>>> On 11/26/2016 07:46 AM, Cong Wang wrote:
>>>>>> On Thu, Nov 24, 2016 at 7:20 AM, Daniel Borkmann
>>>>>> <daniel@iogearbox.net> wrote:
>>>> [...]
>>>>>>> Ok, strange, qdisc_destroy() calls into ops->destroy(), where ingress
>>>>>>> drops its entire chain via tcf_destroy_chain(), so that will be NULL
>>>>>>> eventually. The tps are freed by call_rcu() as well as qdisc itself
>>>>>>> later on via qdisc_rcu_free(), where it frees per-cpu bstats as well.
>>>>>>> Outstanding readers should either bail out due to if (!cl) or can
>>>>>>> still
>>>>>>> process the chain until read section ends, but during that time,
>>>>>>> cl->q
>>>>>>> resp. bstats should be good. Do you happen to know what's at address
>>>>>>> ffff880a68b04028? I was wondering wrt call_rcu() vs call_rcu_bh(),
>>>>>>> but
>>>>>>> at least on ingress (netif_receive_skb_internal()) we hold
>>>>>>> rcu_read_lock()
>>>>>>> here. The KASAN report is reliably happening at this location, right?
>>>>>>
>>>>>> I am confused as well, I don't see how it could be related to my
>>>>>> patch yet.
>>>>>> I will take a deep look in the weekend.
>>>
>>>
>>>
>>> Hi Cong,
>>>
>>> When reported the new trace I didn't mean it's related to your patch,
>>> I just wanted to point it out it exposed something. I should have been
>>> clear about it.
>>>
>>>
>>>>>
>>>>> Ok, I'm currently on the run. Got too late yesterday night, but I'll
>>>>> write what I found in the evening today, not related to ingress though.
>>>>
>>>> Just pushed out my analysis to netdev under "[PATCH net] net, sched:
>>>> respect
>>>> rcu grace period on cls destruction". My conclusion is that both
>>>> issues are
>>>> actually separate, and that one is small enough where we could route
>>>> it via
>>>> net actually. Perhaps this at the same time shrinks your "[PATCH
>>>> net-next]
>>>> net_sched: move the empty tp check from ->destroy() to ->delete()" to a
>>>> reasonable size that it's suitable to net as well. Your
>>>> ->delete()/->destroy()
>>>> one is definitely needed, too. The tp->root one is independant of
>>>> ->delete()/
>>>> ->destroy() as they are different races and tp->root could also
>>>> happen when
>>>> you just destroy the whole tp directly. I think that seems like a
>>>> good path
>>>> forward to me.
>>>>
>>>> Thanks,
>>>> Daniel
>>>
>>>
>>>
>>> Hi Daniel,
>>>
>>> As for the tainted kernel. I was in old (week or two) net-next tree
>>> and only cherry-picked from latest net-next related patches to
>>> Mellanox HCA, cls_api, cls_flower, devlink. so those are the tainted
>>> modules.
>>> I have the issue reproducing in that tree so wanted it to check it
>>> with Cong's patch instead of latest net-next.
>>> I'll try running reproducing the issue with your new patch and later
>>> try latest net-next as well.
>>>
>>> Thanks,
>>> Roi
>>>
>>
>> Hi,
>>
>> I tested "[PATCH net] net, sched: respect rcu grace period on cls
>> destruction" and could not reproduce my original issue.
> 
> Hi Roi,
> 
> Just so I'm 100% clear. No issue with just the above "respect rcu grace
> period on cls destruction" per above statement.
> 
>> I rebased "[Patch net-next] net_sched: move the empty tp check from
>> ->destroy() to ->delete()" over to test it in the same tree and got into
>> a new trace in fl_delete.
> 
> In this case did you test with "net_sched: move the empty tp check from
> ->destroy() to ->delete()" _only_ or did this include both patches when
> you see the error below.
> 
> From my inspection we really need both patches to get correct behavior.
> 
> Thanks!
> John

Ah dang nevermind I just read both patches in detail and applying them
both at the same time is nonsense. Let me reply with comments directly
to the patches.

Thanks. sorry for the noise.

> 
>>
>> [35659.012123] BUG: KASAN: wild-memory-access on address 1ffffffff803ca31
>> [35659.020042] Write of size 1 by task ovs-vswitchd/20135
>> [35659.025878] CPU: 19 PID: 20135 Comm: ovs-vswitchd Tainted:
>> G           O    4.9.0-rc3+ #18
>> [35659.035948] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 07/01/2015
>> [35659.043730] Call Trace:
>> [35659.046619]  [<ffffffff95b6dc42>] dump_stack+0x63/0x81
>> [35659.052456]  [<ffffffff955fbbf8>] kasan_report_error+0x408/0x4e0
>> [35659.059402]  [<ffffffff955fc2e8>] kasan_report+0x58/0x60
>> [35659.065428]  [<ffffffff952d5e8d>] ? call_rcu_sched+0x1d/0x20
>> [35659.072119]  [<ffffffffc01e0701>] ? fl_destroy_filter+0x21/0x30
>> [cls_flower]
>> [35659.080217]  [<ffffffffc01e1ccf>] ? fl_delete+0x1df/0x2e0 [cls_flower]
>> [35659.087580]  [<ffffffff955fa4ca>] __asan_store1+0x4a/0x50
>> [35659.093697]  [<ffffffffc01e1ccf>] fl_delete+0x1df/0x2e0 [cls_flower]
>> [35659.100870]  [<ffffffff9653ecba>] tc_ctl_tfilter+0x10da/0x1b90
>>
>>
>> 0x1d02 is in fl_delete (net/sched/cls_flower.c:805).
>> 800             struct cls_fl_filter *f = (struct cls_fl_filter *) arg;
>> 801
>> 802             rhashtable_remove_fast(&head->ht, &f->ht_node,
>> 803                                    head->ht_params);
>> 804             __fl_delete(tp, f);
>> 805             *last = list_empty(&head->filters);
>> 806             return 0;
>> 807     }
>>
>>
>> Thanks,
>> Roi
> 

^ permalink raw reply

* [PATCH net-next v3 4/4] Documentation: net: phy: Add links to several standards documents
From: Florian Fainelli @ 2016-11-28  2:45 UTC (permalink / raw)
  To: netdev
  Cc: davem, andrew, sf84, martin.blumenstingl, mans, alexandre.torgue,
	peppe.cavallaro, timur, jbrunet, Florian Fainelli
In-Reply-To: <20161128024515.13070-1-f.fainelli@gmail.com>

Add links to the IEEE 802.3-2008 document, and the RGMII v1.3 and v2.0
revisions of the standard.

Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 Documentation/networking/phy.txt | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
index c7ba84b5d912..e017d933d530 100644
--- a/Documentation/networking/phy.txt
+++ b/Documentation/networking/phy.txt
@@ -407,3 +407,13 @@ Board Fixups
  The stubs set one of the two matching criteria, and set the other one to
  match anything.
 
+Standards
+
+ IEEE Standard 802.3: CSMA/CD Access Method and Physical Layer Specifications, Section Two:
+ http://standards.ieee.org/getieee802/download/802.3-2008_section2.pdf
+
+ RGMII v1.3:
+ http://web.archive.org/web/20160303212629/http://www.hp.com/rnd/pdfs/RGMIIv1_3.pdf
+
+ RGMII v2.0:
+ http://web.archive.org/web/20160303171328/http://www.hp.com/rnd/pdfs/RGMIIv2_0_final_hp.pdf
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v3 3/4] Documentation: net: phy: Add blurb about RGMII
From: Florian Fainelli @ 2016-11-28  2:45 UTC (permalink / raw)
  To: netdev
  Cc: davem, andrew, sf84, martin.blumenstingl, mans, alexandre.torgue,
	peppe.cavallaro, timur, jbrunet, Florian Fainelli
In-Reply-To: <20161128024515.13070-1-f.fainelli@gmail.com>

RGMII is a recurring source of pain for people with Gigabit Ethernet
hardware since it may require PHY driver and MAC driver level
configuration hints. Document what are the expectations from PHYLIB and
what options exist.

Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 Documentation/networking/phy.txt | 77 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
index 9a42a9414cea..c7ba84b5d912 100644
--- a/Documentation/networking/phy.txt
+++ b/Documentation/networking/phy.txt
@@ -65,6 +65,83 @@ The MDIO bus
  drivers/net/ethernet/freescale/fsl_pq_mdio.c and an associated DTS file
  for one of the users. (e.g. "git grep fsl,.*-mdio arch/powerpc/boot/dts/")
 
+(RG)MII/electrical interface considerations
+
+ The Reduced Gigabit Medium Independent Interface (RGMII) is a 12-pin
+ electrical signal interface using a synchronous 125Mhz clock signal and several
+ data lines. Due to this design decision, a 1.5ns to 2ns delay must be added
+ between the clock line (RXC or TXC) and the data lines to let the PHY (clock
+ sink) have enough setup and hold times to sample the data lines correctly. The
+ PHY library offers different types of PHY_INTERFACE_MODE_RGMII* values to let
+ the PHY driver and optionally the MAC driver, implement the required delay. The
+ values of phy_interface_t must be understood from the perspective of the PHY
+ device itself, leading to the following:
+
+ * PHY_INTERFACE_MODE_RGMII: the PHY is not responsible for inserting any
+   internal delay by itself, it assumes that either the Ethernet MAC (if capable
+   or the PCB traces) insert the correct 1.5-2ns delay
+
+ * PHY_INTERFACE_MODE_RGMII_TXID: the PHY should insert an internal delay
+   for the transmit data lines (TXD[3:0]) processed by the PHY device
+
+ * PHY_INTERFACE_MODE_RGMII_RXID: the PHY should insert an internal delay
+   for the receive data lines (RXD[3:0]) processed by the PHY device
+
+ * PHY_INTERFACE_MODE_RGMII_ID: the PHY should insert internal delays for
+   both transmit AND receive data lines from/to the PHY device
+
+ Whenever possible, use the PHY side RGMII delay for these reasons:
+
+ * PHY devices may offer sub-nanosecond granularity in how they allow a
+   receiver/transmitter side delay (e.g: 0.5, 1.0, 1.5ns) to be specified. Such
+   precision may be required to account for differences in PCB trace lengths
+
+ * PHY devices are typically qualified for a large range of applications
+   (industrial, medical, automotive...), and they provide a constant and
+   reliable delay across temperature/pressure/voltage ranges
+
+ * PHY device drivers in PHYLIB being reusable by nature, being able to
+   configure correctly a specified delay enables more designs with similar delay
+   requirements to be operate correctly
+
+ For cases where the PHY is not capable of providing this delay, but the
+ Ethernet MAC driver is capable of doing so, the correct phy_interface_t value
+ should be PHY_INTERFACE_MODE_RGMII, and the Ethernet MAC driver should be
+ configured correctly in order to provide the required transmit and/or receive
+ side delay from the perspective of the PHY device. Conversely, if the Ethernet
+ MAC driver looks at the phy_interface_t value, for any other mode but
+ PHY_INTERFACE_MODE_RGMII, it should make sure that the MAC-level delays are
+ disabled.
+
+ In case neither the Ethernet MAC, nor the PHY are capable of providing the
+ required delays, as defined per the RGMII standard, several options may be
+ available:
+
+ * Some SoCs may offer a pin pad/mux/controller capable of configuring a given
+   set of pins'strength, delays, and voltage; and it may be a suitable
+   option to insert the expected 2ns RGMII delay.
+
+ * Modifying the PCB design to include a fixed delay (e.g: using a specifically
+   designed serpentine), which may not require software configuration at all.
+
+Common problems with RGMII delay mismatch
+
+ When there is a RGMII delay mismatch between the Ethernet MAC and the PHY, this
+ will most likely result in the clock and data line signals to be unstable when
+ the PHY or MAC take a snapshot of these signals to translate them into logical
+ 1 or 0 states and reconstruct the data being transmitted/received. Typical
+ symptoms include:
+
+ * Transmission/reception partially works, and there is frequent or occasional
+   packet loss observed
+
+ * Ethernet MAC may report some or all packets ingressing with a FCS/CRC error,
+   or just discard them all
+
+ * Switching to lower speeds such as 10/100Mbits/sec makes the problem go away
+   (since there is enough setup/hold time in that case)
+
+
 Connecting to a PHY
 
  Sometime during startup, the network driver needs to establish a connection
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v3 2/4] Documentation: net: phy: Add a paragraph about pause frames/flow control
From: Florian Fainelli @ 2016-11-28  2:45 UTC (permalink / raw)
  To: netdev
  Cc: davem, andrew, sf84, martin.blumenstingl, mans, alexandre.torgue,
	peppe.cavallaro, timur, jbrunet, Florian Fainelli
In-Reply-To: <20161128024515.13070-1-f.fainelli@gmail.com>

Describe that the Ethernet MAC controller is ultimately responsible for
dealing with proper pause frames/flow control advertisement and
enabling, and that it is therefore allowed to have it change
phydev->supported/advertising with SUPPORTED_Pause and
SUPPORTED_AsymPause.

Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 Documentation/networking/phy.txt | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
index 4b25c0f24201..9a42a9414cea 100644
--- a/Documentation/networking/phy.txt
+++ b/Documentation/networking/phy.txt
@@ -127,8 +127,9 @@ Letting the PHY Abstraction Layer do Everything
  values pruned from them which don't make sense for your controller (a 10/100
  controller may be connected to a gigabit capable PHY, so you would need to
  mask off SUPPORTED_1000baseT*).  See include/linux/ethtool.h for definitions
- for these bitfields. Note that you should not SET any bits, or the PHY may
- get put into an unsupported state.
+ for these bitfields. Note that you should not SET any bits, except the
+ SUPPORTED_Pause and SUPPORTED_AsymPause bits (see below), or the PHY may get
+ put into an unsupported state.
 
  Lastly, once the controller is ready to handle network traffic, you call
  phy_start(phydev).  This tells the PAL that you are ready, and configures the
@@ -139,6 +140,19 @@ Letting the PHY Abstraction Layer do Everything
  When you want to disconnect from the network (even if just briefly), you call
  phy_stop(phydev).
 
+Pause frames / flow control
+
+ The PHY does not participate directly in flow control/pause frames except by
+ making sure that the SUPPORTED_Pause and SUPPORTED_AsymPause bits are set in
+ MII_ADVERTISE to indicate towards the link partner that the Ethernet MAC
+ controller supports such a thing. Since flow control/pause frames generation
+ involves the Ethernet MAC driver, it is recommended that this driver takes care
+ of properly indicating advertisement and support for such features by setting
+ the SUPPORTED_Pause and SUPPORTED_AsymPause bits accordingly. This can be done
+ either before or after phy_connect() and/or as a result of implementing the
+ ethtool::set_pauseparam feature.
+
+
 Keeping Close Tabs on the PAL
 
  It is possible that the PAL's built-in state machine needs a little help to
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v3 1/4] Documentation: net: phy: remove description of function pointers
From: Florian Fainelli @ 2016-11-28  2:45 UTC (permalink / raw)
  To: netdev
  Cc: davem, andrew, sf84, martin.blumenstingl, mans, alexandre.torgue,
	peppe.cavallaro, timur, jbrunet, Florian Fainelli
In-Reply-To: <20161128024515.13070-1-f.fainelli@gmail.com>

Remove the function pointers documentation which duplicates information
found in include/linux/phy.h. Maintaining documentation about two
different locations just does not work, but the code is less likely to
be outdated.

Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 Documentation/networking/phy.txt | 35 ++---------------------------------
 1 file changed, 2 insertions(+), 33 deletions(-)

diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
index 7ab9404a8412..4b25c0f24201 100644
--- a/Documentation/networking/phy.txt
+++ b/Documentation/networking/phy.txt
@@ -251,39 +251,8 @@ Writing a PHY driver
  PHY_BASIC_FEATURES, but you can look in include/mii.h for other
  features.
 
- Each driver consists of a number of function pointers:
-
-   soft_reset: perform a PHY software reset
-   config_init: configures PHY into a sane state after a reset.
-     For instance, a Davicom PHY requires descrambling disabled.
-   probe: Allocate phy->priv, optionally refuse to bind.
-   PHY may not have been reset or had fixups run yet.
-   suspend/resume: power management
-   config_aneg: Changes the speed/duplex/negotiation settings
-   aneg_done: Determines the auto-negotiation result
-   read_status: Reads the current speed/duplex/negotiation settings
-   ack_interrupt: Clear a pending interrupt
-   did_interrupt: Checks if the PHY generated an interrupt
-   config_intr: Enable or disable interrupts
-   remove: Does any driver take-down
-   ts_info: Queries about the HW timestamping status
-   match_phy_device: used for Clause 45 capable PHYs to match devices
-   in package and ensure they are compatible
-   hwtstamp: Set the PHY HW timestamping configuration
-   rxtstamp: Requests a receive timestamp at the PHY level for a 'skb'
-   txtsamp: Requests a transmit timestamp at the PHY level for a 'skb'
-   set_wol: Enable Wake-on-LAN at the PHY level
-   get_wol: Get the Wake-on-LAN status at the PHY level
-   link_change_notify: called to inform the core is about to change the
-   link state, can be used to work around bogus PHY between state changes
-   read_mmd_indirect: Read PHY MMD indirect register
-   write_mmd_indirect: Write PHY MMD indirect register
-   module_info: Get the size and type of an EEPROM contained in an plug-in
-   module
-   module_eeprom: Get EEPROM information of a plug-in module
-   get_sset_count: Get number of strings sets that get_strings will count
-   get_strings: Get strings from requested objects (statistics)
-   get_stats: Get the extended statistics from the PHY device
+ Each driver consists of a number of function pointers, documented
+ in include/linux/phy.h under the phy_driver structure.
 
  Of these, only config_aneg and read_status are required to be
  assigned by the driver code.  The rest are optional.  Also, it is
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v3 0/4] Documentation: net: phy: Improve documentation
From: Florian Fainelli @ 2016-11-28  2:45 UTC (permalink / raw)
  To: netdev
  Cc: davem, andrew, sf84, martin.blumenstingl, mans, alexandre.torgue,
	peppe.cavallaro, timur, jbrunet, Florian Fainelli

Hi all,

This patch series addresses discussions and feedback that was recently received
on the mailing-list in the area of: flow control/pause frames, interpretation of
phy_interface_t and finally add some links to useful standards documents.

Changes in v3:

- add Timur's feedback into patch 3

Changes in v2:

- clarify a few things in the RGMII section, add a paragraph about common issues
  with RGMII delay mismatches

Florian Fainelli (4):
  Documentation: net: phy: remove description of function pointers
  Documentation: net: phy: Add a paragraph about pause frames/flow
    control
  Documentation: net: phy: Add blurb about RGMII
  Documentation: net: phy: Add links to several standards documents

 Documentation/networking/phy.txt | 140 +++++++++++++++++++++++++++++----------
 1 file changed, 105 insertions(+), 35 deletions(-)

-- 
2.9.3

^ permalink raw reply

* Re: [net,v2] neigh: fix the loop index error in neigh dump
From: David Ahern @ 2016-11-28  2:39 UTC (permalink / raw)
  To: 张胜举, netdev
In-Reply-To: <001c01d2491f$fcd53250$f67f96f0$@cmss.chinamobile.com>

On 11/27/16 7:34 PM, 张胜举 wrote:
>> -----Original Message-----
>> From: David Ahern [mailto:dsa@cumulusnetworks.com]
>> Sent: Monday, November 28, 2016 10:10 AM
>> To: Zhang Shengju <zhangshengju@cmss.chinamobile.com>;
>> netdev@vger.kernel.org
>> Subject: Re: [net,v2] neigh: fix the loop index error in neigh dump
>>
>> On 11/27/16 6:32 PM, Zhang Shengju wrote:
>>> Loop index in neigh dump function is not updated correctly under some
>>> circumstances, this patch will fix it.
>>
>> What's an example?
> 
> If dev is filtered out, the original code goes to next loop without updating
> loop index 'idx'.

And you have a use case with missing or redundant data? Or is your comment based on a review of code only?


>> You are completely rewriting the dump loops.
> 
> I put 'idx++' into for loop,  so I replace 'goto' with 'continue'.  The
> other change is style related. 

A "fixes" should not include 'style related' changes.

^ permalink raw reply

* RE: [net,v2] neigh: fix the loop index error in neigh dump
From: 张胜举 @ 2016-11-28  2:34 UTC (permalink / raw)
  To: 'David Ahern', netdev
In-Reply-To: <caa11887-00b9-d2f1-7c0b-5b5096b42f56@cumulusnetworks.com>

> -----Original Message-----
> From: David Ahern [mailto:dsa@cumulusnetworks.com]
> Sent: Monday, November 28, 2016 10:10 AM
> To: Zhang Shengju <zhangshengju@cmss.chinamobile.com>;
> netdev@vger.kernel.org
> Subject: Re: [net,v2] neigh: fix the loop index error in neigh dump
> 
> On 11/27/16 6:32 PM, Zhang Shengju wrote:
> > Loop index in neigh dump function is not updated correctly under some
> > circumstances, this patch will fix it.
> 
> What's an example?

If dev is filtered out, the original code goes to next loop without updating
loop index 'idx'.

> 
> >
> > Fixes: 16660f0bd9 ("net: Add support for filtering neigh dump by
> > device index")
> > Fixes: 21fdd092ac ("net: Add support for filtering neigh dump by
> > master device")
> >
> > Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
> > ---
> >  net/core/neighbour.c | 39 ++++++++++++++++++---------------------
> >  1 file changed, 18 insertions(+), 21 deletions(-)
> >
> > diff --git a/net/core/neighbour.c b/net/core/neighbour.c index
> > 2ae929f..ce32e9c 100644
> > --- a/net/core/neighbour.c
> > +++ b/net/core/neighbour.c
> > @@ -2256,6 +2256,16 @@ static bool neigh_ifindex_filtered(struct
> net_device *dev, int filter_idx)
> >  	return false;
> >  }
> >
> > +static bool neigh_dump_filtered(struct net_device *dev, int filter_idx,
> > +		int filter_master_idx)
> > +{
> > +	if (neigh_ifindex_filtered(dev, filter_idx) ||
> > +	    neigh_master_filtered(dev, filter_master_idx))
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> >  static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff
*skb,
> >  			    struct netlink_callback *cb)
> >  {
> > @@ -2285,20 +2295,15 @@ static int neigh_dump_table(struct neigh_table
> *tbl, struct sk_buff *skb,
> >  	rcu_read_lock_bh();
> >  	nht = rcu_dereference_bh(tbl->nht);
> >
> > -	for (h = s_h; h < (1 << nht->hash_shift); h++) {
> > -		if (h > s_h)
> > -			s_idx = 0;
> > +	for (h = s_h; h < (1 << nht->hash_shift); h++, s_idx = 0) {
> >  		for (n = rcu_dereference_bh(nht->hash_buckets[h]), idx = 0;
> >  		     n != NULL;
> > -		     n = rcu_dereference_bh(n->next)) {
> > -			if (!net_eq(dev_net(n->dev), net))
> > -				continue;
> > -			if (neigh_ifindex_filtered(n->dev, filter_idx))
> > +		     n = rcu_dereference_bh(n->next), idx++) {
> > +			if (idx < s_idx || !net_eq(dev_net(n->dev), net))
> >  				continue;
> > -			if (neigh_master_filtered(n->dev,
filter_master_idx))
> > +			if (neigh_dump_filtered(n->dev, filter_idx,
> > +						filter_master_idx))
> >  				continue;
> > -			if (idx < s_idx)
> > -				goto next;
> >  			if (neigh_fill_info(skb, n, NETLINK_CB(cb-
> >skb).portid,
> >  					    cb->nlh->nlmsg_seq,
> >  					    RTM_NEWNEIGH,
> > @@ -2306,8 +2311,6 @@ static int neigh_dump_table(struct neigh_table
> *tbl, struct sk_buff *skb,
> >  				rc = -1;
> >  				goto out;
> >  			}
> > -next:
> > -			idx++;
> >  		}
> >  	}
> >  	rc = skb->len;
> > @@ -2328,14 +2331,10 @@ static int pneigh_dump_table(struct
> > neigh_table *tbl, struct sk_buff *skb,
> >
> >  	read_lock_bh(&tbl->lock);
> >
> > -	for (h = s_h; h <= PNEIGH_HASHMASK; h++) {
> > -		if (h > s_h)
> > -			s_idx = 0;
> > -		for (n = tbl->phash_buckets[h], idx = 0; n; n = n->next) {
> > -			if (pneigh_net(n) != net)
> > +	for (h = s_h; h <= PNEIGH_HASHMASK; h++, s_idx = 0) {
> > +		for (n = tbl->phash_buckets[h], idx = 0; n; n = n->next,
idx++)
> {
> > +			if (idx < s_idx || pneigh_net(n) != net)
> >  				continue;
> > -			if (idx < s_idx)
> > -				goto next;
> >  			if (pneigh_fill_info(skb, n, NETLINK_CB(cb-
> >skb).portid,
> >  					    cb->nlh->nlmsg_seq,
> >  					    RTM_NEWNEIGH,
> > @@ -2344,8 +2343,6 @@ static int pneigh_dump_table(struct neigh_table
> *tbl, struct sk_buff *skb,
> >  				rc = -1;
> >  				goto out;
> >  			}
> > -		next:
> > -			idx++;
> >  		}
> >  	}
> >
> 
> This fix is way to be complicated to be fixing anything related to
16660f0bd9
> or 21fdd092ac. Both of those commits added a continue:
> 
>                         if (neigh_ifindex_filtered(n->dev, filter_idx))
>                                 continue;
>                         if (neigh_master_filtered(n->dev,
filter_master_idx))
>                                 continue;
> 
> At best the continue is replaced by 'goto next;' and I am not convinced
that is
> right.
> 
> You are completely rewriting the dump loops.

I put 'idx++' into for loop,  so I replace 'goto' with 'continue'.  The
other change is style related. 

Thanks,
Zhang Shengju

^ permalink raw reply

* Re: [Patch net-next] net_sched: move the empty tp check from ->destroy() to ->delete()
From: John Fastabend @ 2016-11-28  2:26 UTC (permalink / raw)
  To: Roi Dayan, Daniel Borkmann, Cong Wang
  Cc: Linux Kernel Network Developers, Jiri Pirko
In-Reply-To: <583A7D67.50003@mellanox.com>

On 16-11-26 10:29 PM, Roi Dayan wrote:
> 
> 
> On 27/11/2016 06:47, Roi Dayan wrote:
>>
>>
>> On 27/11/2016 02:33, Daniel Borkmann wrote:
>>> On 11/26/2016 12:09 PM, Daniel Borkmann wrote:
>>>> On 11/26/2016 07:46 AM, Cong Wang wrote:
>>>>> On Thu, Nov 24, 2016 at 7:20 AM, Daniel Borkmann
>>>>> <daniel@iogearbox.net> wrote:
>>> [...]
>>>>>> Ok, strange, qdisc_destroy() calls into ops->destroy(), where ingress
>>>>>> drops its entire chain via tcf_destroy_chain(), so that will be NULL
>>>>>> eventually. The tps are freed by call_rcu() as well as qdisc itself
>>>>>> later on via qdisc_rcu_free(), where it frees per-cpu bstats as well.
>>>>>> Outstanding readers should either bail out due to if (!cl) or can
>>>>>> still
>>>>>> process the chain until read section ends, but during that time,
>>>>>> cl->q
>>>>>> resp. bstats should be good. Do you happen to know what's at address
>>>>>> ffff880a68b04028? I was wondering wrt call_rcu() vs call_rcu_bh(),
>>>>>> but
>>>>>> at least on ingress (netif_receive_skb_internal()) we hold
>>>>>> rcu_read_lock()
>>>>>> here. The KASAN report is reliably happening at this location, right?
>>>>>
>>>>> I am confused as well, I don't see how it could be related to my
>>>>> patch yet.
>>>>> I will take a deep look in the weekend.
>>
>>
>>
>> Hi Cong,
>>
>> When reported the new trace I didn't mean it's related to your patch,
>> I just wanted to point it out it exposed something. I should have been
>> clear about it.
>>
>>
>>>>
>>>> Ok, I'm currently on the run. Got too late yesterday night, but I'll
>>>> write what I found in the evening today, not related to ingress though.
>>>
>>> Just pushed out my analysis to netdev under "[PATCH net] net, sched:
>>> respect
>>> rcu grace period on cls destruction". My conclusion is that both
>>> issues are
>>> actually separate, and that one is small enough where we could route
>>> it via
>>> net actually. Perhaps this at the same time shrinks your "[PATCH
>>> net-next]
>>> net_sched: move the empty tp check from ->destroy() to ->delete()" to a
>>> reasonable size that it's suitable to net as well. Your
>>> ->delete()/->destroy()
>>> one is definitely needed, too. The tp->root one is independant of
>>> ->delete()/
>>> ->destroy() as they are different races and tp->root could also
>>> happen when
>>> you just destroy the whole tp directly. I think that seems like a
>>> good path
>>> forward to me.
>>>
>>> Thanks,
>>> Daniel
>>
>>
>>
>> Hi Daniel,
>>
>> As for the tainted kernel. I was in old (week or two) net-next tree
>> and only cherry-picked from latest net-next related patches to
>> Mellanox HCA, cls_api, cls_flower, devlink. so those are the tainted
>> modules.
>> I have the issue reproducing in that tree so wanted it to check it
>> with Cong's patch instead of latest net-next.
>> I'll try running reproducing the issue with your new patch and later
>> try latest net-next as well.
>>
>> Thanks,
>> Roi
>>
> 
> Hi,
> 
> I tested "[PATCH net] net, sched: respect rcu grace period on cls
> destruction" and could not reproduce my original issue.

Hi Roi,

Just so I'm 100% clear. No issue with just the above "respect rcu grace
period on cls destruction" per above statement.

> I rebased "[Patch net-next] net_sched: move the empty tp check from
> ->destroy() to ->delete()" over to test it in the same tree and got into
> a new trace in fl_delete.

In this case did you test with "net_sched: move the empty tp check from
->destroy() to ->delete()" _only_ or did this include both patches when
you see the error below.

>From my inspection we really need both patches to get correct behavior.

Thanks!
John

> 
> [35659.012123] BUG: KASAN: wild-memory-access on address 1ffffffff803ca31
> [35659.020042] Write of size 1 by task ovs-vswitchd/20135
> [35659.025878] CPU: 19 PID: 20135 Comm: ovs-vswitchd Tainted:
> G           O    4.9.0-rc3+ #18
> [35659.035948] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 07/01/2015
> [35659.043730] Call Trace:
> [35659.046619]  [<ffffffff95b6dc42>] dump_stack+0x63/0x81
> [35659.052456]  [<ffffffff955fbbf8>] kasan_report_error+0x408/0x4e0
> [35659.059402]  [<ffffffff955fc2e8>] kasan_report+0x58/0x60
> [35659.065428]  [<ffffffff952d5e8d>] ? call_rcu_sched+0x1d/0x20
> [35659.072119]  [<ffffffffc01e0701>] ? fl_destroy_filter+0x21/0x30
> [cls_flower]
> [35659.080217]  [<ffffffffc01e1ccf>] ? fl_delete+0x1df/0x2e0 [cls_flower]
> [35659.087580]  [<ffffffff955fa4ca>] __asan_store1+0x4a/0x50
> [35659.093697]  [<ffffffffc01e1ccf>] fl_delete+0x1df/0x2e0 [cls_flower]
> [35659.100870]  [<ffffffff9653ecba>] tc_ctl_tfilter+0x10da/0x1b90
> 
> 
> 0x1d02 is in fl_delete (net/sched/cls_flower.c:805).
> 800             struct cls_fl_filter *f = (struct cls_fl_filter *) arg;
> 801
> 802             rhashtable_remove_fast(&head->ht, &f->ht_node,
> 803                                    head->ht_params);
> 804             __fl_delete(tp, f);
> 805             *last = list_empty(&head->filters);
> 806             return 0;
> 807     }
> 
> 
> Thanks,
> Roi

^ permalink raw reply

* Re: [PATCH net v2 0/5] net: fix phydev reference leaks
From: Timur Tabi @ 2016-11-28  2:11 UTC (permalink / raw)
  To: David Miller, johan
  Cc: f.fainelli, madalin.bucur, andrew, vivien.didelot, netdev,
	linux-kernel
In-Reply-To: <20161127.200238.2048371945400616113.davem@davemloft.net>

David Miller wrote:
> Series applied, thanks.

I was really hoping you'd give me the chance to test the patches before 
applying them.

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.

^ permalink raw reply

* Re: [net,v2] neigh: fix the loop index error in neigh dump
From: David Ahern @ 2016-11-28  2:09 UTC (permalink / raw)
  To: Zhang Shengju, netdev
In-Reply-To: <1480296725-5563-1-git-send-email-zhangshengju@cmss.chinamobile.com>

On 11/27/16 6:32 PM, Zhang Shengju wrote:
> Loop index in neigh dump function is not updated correctly under some
> circumstances, this patch will fix it.

What's an example?

> 
> Fixes: 16660f0bd9 ("net: Add support for filtering neigh dump by device index")
> Fixes: 21fdd092ac ("net: Add support for filtering neigh dump by master device")
> 
> Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
> ---
>  net/core/neighbour.c | 39 ++++++++++++++++++---------------------
>  1 file changed, 18 insertions(+), 21 deletions(-)
> 
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 2ae929f..ce32e9c 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -2256,6 +2256,16 @@ static bool neigh_ifindex_filtered(struct net_device *dev, int filter_idx)
>  	return false;
>  }
>  
> +static bool neigh_dump_filtered(struct net_device *dev, int filter_idx,
> +		int filter_master_idx)
> +{
> +	if (neigh_ifindex_filtered(dev, filter_idx) ||
> +	    neigh_master_filtered(dev, filter_master_idx))
> +		return true;
> +
> +	return false;
> +}
> +
>  static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
>  			    struct netlink_callback *cb)
>  {
> @@ -2285,20 +2295,15 @@ static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
>  	rcu_read_lock_bh();
>  	nht = rcu_dereference_bh(tbl->nht);
>  
> -	for (h = s_h; h < (1 << nht->hash_shift); h++) {
> -		if (h > s_h)
> -			s_idx = 0;
> +	for (h = s_h; h < (1 << nht->hash_shift); h++, s_idx = 0) {
>  		for (n = rcu_dereference_bh(nht->hash_buckets[h]), idx = 0;
>  		     n != NULL;
> -		     n = rcu_dereference_bh(n->next)) {
> -			if (!net_eq(dev_net(n->dev), net))
> -				continue;
> -			if (neigh_ifindex_filtered(n->dev, filter_idx))
> +		     n = rcu_dereference_bh(n->next), idx++) {
> +			if (idx < s_idx || !net_eq(dev_net(n->dev), net))
>  				continue;
> -			if (neigh_master_filtered(n->dev, filter_master_idx))
> +			if (neigh_dump_filtered(n->dev, filter_idx,
> +						filter_master_idx))
>  				continue;
> -			if (idx < s_idx)
> -				goto next;
>  			if (neigh_fill_info(skb, n, NETLINK_CB(cb->skb).portid,
>  					    cb->nlh->nlmsg_seq,
>  					    RTM_NEWNEIGH,
> @@ -2306,8 +2311,6 @@ static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
>  				rc = -1;
>  				goto out;
>  			}
> -next:
> -			idx++;
>  		}
>  	}
>  	rc = skb->len;
> @@ -2328,14 +2331,10 @@ static int pneigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
>  
>  	read_lock_bh(&tbl->lock);
>  
> -	for (h = s_h; h <= PNEIGH_HASHMASK; h++) {
> -		if (h > s_h)
> -			s_idx = 0;
> -		for (n = tbl->phash_buckets[h], idx = 0; n; n = n->next) {
> -			if (pneigh_net(n) != net)
> +	for (h = s_h; h <= PNEIGH_HASHMASK; h++, s_idx = 0) {
> +		for (n = tbl->phash_buckets[h], idx = 0; n; n = n->next, idx++) {
> +			if (idx < s_idx || pneigh_net(n) != net)
>  				continue;
> -			if (idx < s_idx)
> -				goto next;
>  			if (pneigh_fill_info(skb, n, NETLINK_CB(cb->skb).portid,
>  					    cb->nlh->nlmsg_seq,
>  					    RTM_NEWNEIGH,
> @@ -2344,8 +2343,6 @@ static int pneigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
>  				rc = -1;
>  				goto out;
>  			}
> -		next:
> -			idx++;
>  		}
>  	}
>  

This fix is way to be complicated to be fixing anything related to 16660f0bd9 or 21fdd092ac. Both of those commits added a continue:

                        if (neigh_ifindex_filtered(n->dev, filter_idx))
                                continue;
                        if (neigh_master_filtered(n->dev, filter_master_idx))
                                continue;

At best the continue is replaced by 'goto next;' and I am not convinced that is right.

You are completely rewriting the dump loops.

^ permalink raw reply

* Re: [PATCH net-next 0/6] BPF cleanups and misc updates
From: David Miller @ 2016-11-28  1:39 UTC (permalink / raw)
  To: daniel; +Cc: alexei.starovoitov, netdev
In-Reply-To: <cover.1480119395.git.daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Sat, 26 Nov 2016 01:28:03 +0100

> This patch set adds couple of cleanups in first few patches,
> exposes owner_prog_type for array maps as well as mlocked mem
> for maps in fdinfo, allows for mount permissions in fs and
> fixes various outstanding issues in selftests and samples.

Series applied, thanks Daniel.

^ permalink raw reply

* Re: [PATCH net 1/1] tipc: fix link statistics counter errors
From: David Miller @ 2016-11-28  1:37 UTC (permalink / raw)
  To: jon.maloy
  Cc: netdev, parthasarathy.bhuvaragan, ying.xue, maloy,
	tipc-discussion
In-Reply-To: <1480088102-31516-1-git-send-email-jon.maloy@ericsson.com>

From: Jon Maloy <jon.maloy@ericsson.com>
Date: Fri, 25 Nov 2016 10:35:02 -0500

> In commit e4bf4f76962b ("tipc: simplify packet sequence number
> handling") we changed the internal representation of the packet
> sequence number counters from u32 to u16, reflecting what is really
> sent over the wire.
> 
> Since then some link statistics counters have been displaying incorrect
> values, partially because the counters meant to be used as sequence
> number snapshots are now used as direct counters, stored as u32, and
> partially because some counter updates are just missing in the code.
> 
> In this commit we correct this in two ways. First, we base the
> displayed packet sent/received values on direct counters instead
> of as previously a calculated difference between current sequence
> number and a snapshot. Second, we add the missing updates of the
> counters.
> 
> This change is compatible with the current netlink API, and requires
> no changes to the user space tools.
> 
> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>

Applied.

^ permalink raw reply

* Re: [PATCH v2 0/7] stmmac: dwmac-meson8b: configurable RGMII TX delay
From: David Miller @ 2016-11-28  1:33 UTC (permalink / raw)
  To: martin.blumenstingl-gM/Ye1E23mwN+BqQ9rBEUg
  Cc: linux-amlogic-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	devicetree-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	khilman-rdvid1DuHRBWk0Htik3J/w, mark.rutland-5wv7dgnIgG8,
	robh+dt-DgEjT+Ai2ygdnm+yROfE0A,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	alexandre.torgue-qxv4g6HH51o, peppe.cavallaro-qxv4g6HH51o,
	will.deacon-5wv7dgnIgG8, catalin.marinas-5wv7dgnIgG8,
	carlo-KA+7E9HrN00dnm+yROfE0A, f.fainelli-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20161125130156.17879-1-martin.blumenstingl-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org>

From: Martin Blumenstingl <martin.blumenstingl-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org>
Date: Fri, 25 Nov 2016 14:01:49 +0100

> Currently the dwmac-meson8b stmmac glue driver uses a hardcoded 1/4
> cycle TX clock delay. This seems to work fine for many boards (for
> example Odroid-C2 or Amlogic's reference boards) but there are some
> others where TX traffic is simply broken.
> There are probably multiple reasons why it's working on some boards
> while it's broken on others:
> - some of Amlogic's reference boards are using a Micrel PHY
> - hardware circuit design
> - maybe more...

The ARM arch file changes do not apply cleanly to net-next, you probably
want to merge them via the ARM tree instead of mine, and respin this series
to be without the .dts file changes.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [net,v2] neigh: fix the loop index error in neigh dump
From: Zhang Shengju @ 2016-11-28  1:32 UTC (permalink / raw)
  To: netdev, dsa

Loop index in neigh dump function is not updated correctly under some
circumstances, this patch will fix it.

Fixes: 16660f0bd9 ("net: Add support for filtering neigh dump by device index")
Fixes: 21fdd092ac ("net: Add support for filtering neigh dump by master device")

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
---
 net/core/neighbour.c | 39 ++++++++++++++++++---------------------
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 2ae929f..ce32e9c 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2256,6 +2256,16 @@ static bool neigh_ifindex_filtered(struct net_device *dev, int filter_idx)
 	return false;
 }
 
+static bool neigh_dump_filtered(struct net_device *dev, int filter_idx,
+		int filter_master_idx)
+{
+	if (neigh_ifindex_filtered(dev, filter_idx) ||
+	    neigh_master_filtered(dev, filter_master_idx))
+		return true;
+
+	return false;
+}
+
 static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 			    struct netlink_callback *cb)
 {
@@ -2285,20 +2295,15 @@ static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 	rcu_read_lock_bh();
 	nht = rcu_dereference_bh(tbl->nht);
 
-	for (h = s_h; h < (1 << nht->hash_shift); h++) {
-		if (h > s_h)
-			s_idx = 0;
+	for (h = s_h; h < (1 << nht->hash_shift); h++, s_idx = 0) {
 		for (n = rcu_dereference_bh(nht->hash_buckets[h]), idx = 0;
 		     n != NULL;
-		     n = rcu_dereference_bh(n->next)) {
-			if (!net_eq(dev_net(n->dev), net))
-				continue;
-			if (neigh_ifindex_filtered(n->dev, filter_idx))
+		     n = rcu_dereference_bh(n->next), idx++) {
+			if (idx < s_idx || !net_eq(dev_net(n->dev), net))
 				continue;
-			if (neigh_master_filtered(n->dev, filter_master_idx))
+			if (neigh_dump_filtered(n->dev, filter_idx,
+						filter_master_idx))
 				continue;
-			if (idx < s_idx)
-				goto next;
 			if (neigh_fill_info(skb, n, NETLINK_CB(cb->skb).portid,
 					    cb->nlh->nlmsg_seq,
 					    RTM_NEWNEIGH,
@@ -2306,8 +2311,6 @@ static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 				rc = -1;
 				goto out;
 			}
-next:
-			idx++;
 		}
 	}
 	rc = skb->len;
@@ -2328,14 +2331,10 @@ static int pneigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 
 	read_lock_bh(&tbl->lock);
 
-	for (h = s_h; h <= PNEIGH_HASHMASK; h++) {
-		if (h > s_h)
-			s_idx = 0;
-		for (n = tbl->phash_buckets[h], idx = 0; n; n = n->next) {
-			if (pneigh_net(n) != net)
+	for (h = s_h; h <= PNEIGH_HASHMASK; h++, s_idx = 0) {
+		for (n = tbl->phash_buckets[h], idx = 0; n; n = n->next, idx++) {
+			if (idx < s_idx || pneigh_net(n) != net)
 				continue;
-			if (idx < s_idx)
-				goto next;
 			if (pneigh_fill_info(skb, n, NETLINK_CB(cb->skb).portid,
 					    cb->nlh->nlmsg_seq,
 					    RTM_NEWNEIGH,
@@ -2344,8 +2343,6 @@ static int pneigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 				rc = -1;
 				goto out;
 			}
-		next:
-			idx++;
 		}
 	}
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: [patch v2 net-next] sfc: remove unneeded variable
From: David Miller @ 2016-11-28  1:30 UTC (permalink / raw)
  To: dan.carpenter; +Cc: linux-net-drivers, ecree, bkenward, netdev, kernel-janitors
In-Reply-To: <20161125104304.GA5938@mwanda>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Fri, 25 Nov 2016 13:43:04 +0300

> We don't use ->heap_buf after commit 46d1efd852cc ("sfc: remove Software
> TSO") so let's remove the last traces.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Applied, thanks Dan.

^ permalink raw reply

* Re: [PATCH] net: fec: turn on device when extracting statistics
From: David Miller @ 2016-11-28  1:29 UTC (permalink / raw)
  To: nikita.yoush
  Cc: fugang.duan, troy.kisky, andrew, eric, tremyfr, johannes, netdev,
	linux-kernel, cphealy
In-Reply-To: <1480068120-22137-1-git-send-email-nikita.yoush@cogentembedded.com>

From: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Date: Fri, 25 Nov 2016 13:02:00 +0300

> +	int i, ret;
> +
> +	ret = pm_runtime_get_sync(&fep->pdev->dev);
> +	if (IS_ERR_VALUE(ret)) {
> +		memset(data, 0, sizeof(*data) * ARRAY_SIZE(fec_stats));
> +		return;
> +	}

This really isn't the way to do this.

When the device is suspended and the clocks are going to be stopped,
you must fetch the statistic values into a software copy and provide
those if the device is suspended when statistics are requested.

^ permalink raw reply

* Re: pull-request: wireless-drivers-next 2016-11-25
From: David Miller @ 2016-11-28  1:27 UTC (permalink / raw)
  To: kvalo-sgV2jX0FEOL9JmXXK+q4OQ
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <8760nbaoey.fsf-HodKDYzPHsUD5k0oWYwrnHL1okKdlPRT@public.gmane.org>

From: Kalle Valo <kvalo-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
Date: Fri, 25 Nov 2016 11:39:49 +0200

> here's a pull request for 4.10. ath9k has now been converted to use
> mac80211 intermediate software queues to fix bufferbloat problems. rsi
> has become active again and latevy mwifiex has been getting a _lot_ of
> love.
> 
> I'm not expecting to see any problems with this pull request. When you
> pull git will do lots of automerging but at least I didn't see any
> conflicts. Please let me know if you have any problems.

Pulled, thanks Kalle.

^ permalink raw reply

* Re: [PATCH 1/1] net: macb: fix the RX queue reset in macb_rx()
From: David Miller @ 2016-11-28  1:25 UTC (permalink / raw)
  To: cyrille.pitchen
  Cc: nicolas.ferre, netdev, soren.brinkmann, Andrei.Pistirica,
	linux-arm-kernel, linux-kernel
In-Reply-To: <fd8131cbe6c0546b2b8ee35bcaac5e7eb1a1647f.1480063339.git.cyrille.pitchen@atmel.com>

From: Cyrille Pitchen <cyrille.pitchen@atmel.com>
Date: Fri, 25 Nov 2016 09:49:32 +0100

> On macb only (not gem), when a RX queue corruption was detected from
> macb_rx(), the RX queue was reset: during this process the RX ring
> buffer descriptor was initialized by macb_init_rx_ring() but we forgot
> to also set bp->rx_tail to 0.
> 
> Indeed, when processing the received frames, bp->rx_tail provides the
> macb driver with the index in the RX ring buffer of the next buffer to
> process. So when the whole ring buffer is reset we must also reset
> bp->rx_tail so the driver is synchronized again with the hardware.
> 
> Since macb_init_rx_ring() is called from many locations, currently from
> macb_rx() and macb_init_rings(), we'd rather add the "bp->rx_tail = 0;"
> line inside macb_init_rx_ring() than add the very same line after each
> call of this function.
> 
> Without this fix, the rx queue is not reset properly to recover from
> queue corruption and connection drop may occur.
> 
> Signed-off-by: Cyrille Pitchen <cyrille.pitchen@atmel.com>
> Fixes: 9ba723b081a2 ("net: macb: remove BUG_ON() and reset the queue to handle RX errors")

This doesn't apply cleanly to the 'net' tree, where
RX_RING_SIZE is used instead of bp->rx_ring_size. It seems
you generated this against net-next, however you didn't say
that either in your Subject line nor the commit message.

As a bug fix this should be targetted at 'net'.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox