Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next V3 2/2] rtnl: Add support for netdev event attribute to link messages
From: Roopa Prabhu @ 2017-04-24 15:14 UTC (permalink / raw)
  To: David Ahern
  Cc: Vladislav Yasevich, netdev@vger.kernel.org, Vladislav Yasevich,
	Jiri Pirko
In-Reply-To: <877efb54-2aef-4d1e-c0b4-2ce6aa6562df@cumulusnetworks.com>

On Sun, Apr 23, 2017 at 6:07 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>
> On 4/21/17 11:31 AM, Vladislav Yasevich wrote:
> > @@ -1276,9 +1277,40 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
> >       return err;
> >  }
> >
> > +static int rtnl_fill_link_event(struct sk_buff *skb, unsigned long event)
> > +{
> > +     u32 rtnl_event;
> > +
> > +     switch (event) {
> > +     case NETDEV_REBOOT:
> > +             rtnl_event = IFLA_EVENT_REBOOT;
> > +             break;
> > +     case NETDEV_FEAT_CHANGE:
> > +             rtnl_event = IFLA_EVENT_FEAT_CHANGE;
> > +             break;
> > +     case NETDEV_BONDING_FAILOVER:
> > +             rtnl_event = IFLA_EVENT_BONDING_FAILOVER;
> > +             break;
> > +     case NETDEV_NOTIFY_PEERS:
> > +             rtnl_event = IFLA_EVENT_NOTIFY_PEERS;
> > +             break;
> > +     case NETDEV_RESEND_IGMP:
> > +             rtnl_event = IFLA_EVENT_RESEND_IGMP;
> > +             break;
> > +     case NETDEV_CHANGEINFODATA:
> > +             rtnl_event = IFLA_EVENT_CHANGE_INFO_DATA;
> > +             break;
> > +     default:
> > +             return 0;
> > +     }
> > +
> > +     return nla_put_u32(skb, IFLA_EVENT, rtnl_event);
> > +}
> > +
>
> I still have doubts about encoding kernel events into a uapi.

agree. I don't see why user-space will need NETDEV_CHANGEINFODATA and
others david listed.

My other concerns are, once we have this exposed to user-space and
user-space starts relying on it, it will need accurate information and
will expect to have this event information all the time.
IIUC, we cannot cover multiple events in a single notification and not
all link notifications will contain an IFLA_EVENT attribute. In other
words, we will be telling user-space to not expect that the kernel
will send IFLA_EVENT every time.



>
> For example, NETDEV_CHANGEINFODATA is only for bonds though nothing
> about the name suggests it is a bonding notification. This one was added
> specifically to notify userspace (d4261e5650004), yet seems to happen
> only during a changelink and that already generates a RTM_NEWLINK
> message via do_setlink. Since the rtnetlink_event message does not
> contain anything "NETDEV_CHANGEINFODATA" related what purpose does it
> really serve besides duplicating netlink messages to userspace.
>
> The REBOOT, IGMP, FEAT_CHANGE and BONDING_FAILOVER seem to be unique
> messages (code analysis only) which I get for notifying userspace.
>
> NETDEV_NOTIFY_PEERS is not so clear in how often it duplicates other
> messages.

^ permalink raw reply

* Re: macvlan: Fix device ref leak when purging bc_queue
From: David Miller @ 2017-04-24 15:10 UTC (permalink / raw)
  To: Joe.Ghalam; +Cc: herbert, Clifford.Wichmann, netdev
In-Reply-To: <1493046084156.78287@Dell.com>

From: <Joe.Ghalam@dell.com>
Date: Mon, 24 Apr 2017 15:01:24 +0000

>> The only thing that can stop macvlan_process_broadcast from getting
>> called is macvlan_port_destroy.  Nothing else can stop the work
>> queue, unless of course the work queue mechanism itself is broken.
> 
>> So if you're sure macvlan_port_destroy is never even called in
>> your case, then you'll need to start debugging the kernel work
>> queue mechanism to see why macvlan_process_broadcast is not getting
>> called.
> 
> I will get your changes reloaded and re-tested without any other debug tools. Hopefully, we'll see success. I will let you know if I see any issues. 
> Btw, is your fix committed already? if not, do you know when and where it would be committed?

I'm waiting for this discussion to settle down before I apply the
patch.

^ permalink raw reply

* [PATCH v3 net] net: ipv6: regenerate host route if moved to gc list
From: David Ahern @ 2017-04-24 15:09 UTC (permalink / raw)
  To: netdev; +Cc: dvyukov, andreyknvl, mmanning, kafai, David Ahern

Taking down the loopback device wreaks havoc on IPv6 routing. By
extension, taking down a VRF device wreaks havoc on its table.

Dmitry and Andrey both reported heap out-of-bounds reports in the IPv6
FIB code while running syzkaller fuzzer. The root cause is a dead dst
that is on the garbage list gets reinserted into the IPv6 FIB. While on
the gc (or perhaps when it gets added to the gc list) the dst->next is
set to an IPv4 dst. A subsequent walk of the ipv6 tables causes the
out-of-bounds access.

Andrey's reproducer was the key to getting to the bottom of this.

With IPv6, host routes for an address have the dst->dev set to the
loopback device. When the 'lo' device is taken down, rt6_ifdown initiates
a walk of the fib evicting routes with the 'lo' device which means all
host routes are removed. That process moves the dst which is attached to
an inet6_ifaddr to the gc list and marks it as dead.

The recent change to keep global IPv6 addresses added a new function,
fixup_permanent_addr, that is called on admin up. That function restarts
dad for an inet6_ifaddr and when it completes the host route attached
to it is inserted into the fib. Since the route was marked dead and
moved to the gc list, re-inserting the route causes the reported
out-of-bounds accesses. If the device with the address is taken down
or the address is removed, the WARN_ON in fib6_del is triggered.

All of those faults are fixed by regenerating the host route if the
existing one has been moved to the gc list, something that can be
determined by checking if the rt6i_ref counter is 0.

Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v3
- removed 'if (prev)' and just call ip6_rt_put; added comment about spinlock

v2
- change ifp->rt under spinlock vs cmpxchg
- add comment about rt6i_ref == 0

 net/ipv6/addrconf.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 80ce478c4851..93f81d9cd85f 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3271,14 +3271,25 @@ static void addrconf_gre_config(struct net_device *dev)
 static int fixup_permanent_addr(struct inet6_dev *idev,
 				struct inet6_ifaddr *ifp)
 {
-	if (!ifp->rt) {
-		struct rt6_info *rt;
+	/* rt6i_ref == 0 means the host route was removed from the
+	 * FIB, for example, if 'lo' device is taken down. In that
+	 * case regenerate the host route.
+	 */
+	if (!ifp->rt || !atomic_read(&ifp->rt->rt6i_ref)) {
+		struct rt6_info *rt, *prev;

 		rt = addrconf_dst_alloc(idev, &ifp->addr, false);
 		if (unlikely(IS_ERR(rt)))
 			return PTR_ERR(rt);

+		prev = ifp->rt;
+
+		/* ifp->rt can be accessed outside of rtnl */
+		spin_lock(&ifp->lock);
 		ifp->rt = rt;
+		spin_unlock(&ifp->lock);
+
+		ip6_rt_put(prev);
 	}

 	if (!(ifp->flags & IFA_F_NOPREFIXROUTE)) {
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH v3 07/29] x86: bpf_jit, use ENTRY+ENDPROC
From: David Miller @ 2017-04-24 15:08 UTC (permalink / raw)
  To: jslaby
  Cc: alexei.starovoitov, mingo, tglx, hpa, x86, jpoimboe, linux-kernel,
	netdev, daniel, edumazet
In-Reply-To: <f7a94f4e-6d93-a447-a62f-3f290e66c647@suse.cz>

From: Jiri Slaby <jslaby@suse.cz>
Date: Mon, 24 Apr 2017 16:52:43 +0200

> On 04/24/2017, 04:41 PM, David Miller wrote:
>>> It cannot stay as-is simply because we want to know where the functions
>>> end to inject debuginfo properly. The code above does not warrant for
>>> any exception.
>> 
>> I totally and completely disagree.
> 
> You can disagree as you wish but there is really nothing special on the
> bpf code with respect to annotations.
> 
>>> Executing a nop takes a little and having externally-callable functions
>>> aligned can actually help performance (no, I haven't measured nor tested
>>> the code). But sure, the tool is generic, so I can introduce a local
>>> macros to avoid alignments in the functions:
>> 
>> Not for this case, it's a bunch of entry points all packed together
>> intentionally so that SKB accesses of different access sizes (which is
>> almost always the case) from BPF programs use the smallest amount of
>> I-cache as possible.
> 
> And for that reason I suggested the special macros for the code (see the
> macros in the e-mail you replied to again). So what problem do you
> actually have with the suggested solution?

If you align the entry points, then the code sequence as a whole is
are no longer densely packed.

Or do I misunderstand how your macros work?

^ permalink raw reply

* net/ipv6: slab-out-of-bounds in ip6_tnl_xmit
From: Andrey Konovalov @ 2017-04-24 15:03 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, LKML
  Cc: Eric Dumazet, Cong Wang, Dmitry Vyukov, Kostya Serebryany,
	syzkaller

Hi,

I've got the following error report while fuzzing the kernel with syzkaller.

On commit 5a7ad1146caa895ad718a534399e38bd2ba721b7 (4.11-rc8).

Unfortunately it's not reproducible.

The issue might be similar to this one:
https://groups.google.com/forum/#!topic/syzkaller/IDoQHFmrnRI

==================================================================
BUG: KASAN: slab-out-of-bounds in ip6_tnl_xmit+0x25dd/0x28f0
net/ipv6/ip6_tunnel.c:1078 at addr ffff88005dcc5f98
Read of size 16 by task syz-executor7/8076
CPU: 3 PID: 8076 Comm: syz-executor7 Not tainted 4.11.0-rc8+ #266
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x192/0x22d lib/dump_stack.c:52
 kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
 print_address_description mm/kasan/report.c:202 [inline]
 kasan_report_error mm/kasan/report.c:291 [inline]
 kasan_report+0x252/0x510 mm/kasan/report.c:347
 __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:378
 ip6_tnl_xmit+0x25dd/0x28f0 net/ipv6/ip6_tunnel.c:1078
 ip4ip6_tnl_xmit net/ipv6/ip6_tunnel.c:1268 [inline]
 ip6_tnl_start_xmit+0xc1e/0x1890 net/ipv6/ip6_tunnel.c:1370
 __netdev_start_xmit include/linux/netdevice.h:3980 [inline]
 netdev_start_xmit include/linux/netdevice.h:3989 [inline]
 xmit_one net/core/dev.c:2908 [inline]
 dev_hard_start_xmit+0x213/0x800 net/core/dev.c:2924
 __dev_queue_xmit+0x1abc/0x2580 net/core/dev.c:3391
 dev_queue_xmit+0x17/0x20 net/core/dev.c:3424
 neigh_direct_output+0x15/0x20 net/core/neighbour.c:1349
 neigh_output include/net/neighbour.h:478 [inline]
 ip_finish_output2+0x7cd/0x1020 net/ipv4/ip_output.c:228
 ip_finish_output+0x83d/0xc30 net/ipv4/ip_output.c:316
 NF_HOOK_COND include/linux/netfilter.h:246 [inline]
 ip_output+0x1e7/0x5d0 net/ipv4/ip_output.c:404
 dst_output include/net/dst.h:486 [inline]
 ip_local_out+0x82/0xb0 net/ipv4/ip_output.c:124
 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
 ping_v4_push_pending_frames net/ipv4/ping.c:653 [inline]
 ping_v4_sendmsg+0x1b35/0x23e0 net/ipv4/ping.c:840
 inet_sendmsg+0x164/0x490 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 SYSC_sendto+0x660/0x810 net/socket.c:1696
 SyS_sendto+0x40/0x50 net/socket.c:1664
 entry_SYSCALL_64_fastpath+0x1a/0xa9
RIP: 0033:0x4458d9
RSP: 002b:00007f853159db58 EFLAGS: 00000282 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000708000 RCX: 00000000004458d9
RDX: 0000000000000008 RSI: 00000000204f9fe1 RDI: 0000000000000017
RBP: 0000000000003410 R08: 0000000020235000 R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000282 R12: 00000000006e24d0
R13: 0000000020ef8000 R14: 0000000000001000 R15: 0000000000000003
Object at ffff88005dcc5e20, in cache kmalloc-512 size: 512
Allocated:
PID = 8076
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:513
 set_track mm/kasan/kasan.c:525 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
 __kmalloc+0x7c/0x1c0 mm/slub.c:3745
 kmalloc include/linux/slab.h:495 [inline]
 kzalloc include/linux/slab.h:663 [inline]
 neigh_alloc net/core/neighbour.c:286 [inline]
 __neigh_create+0x386/0x1da0 net/core/neighbour.c:458
 neigh_create include/net/neighbour.h:313 [inline]
 ipv4_neigh_lookup+0x4bb/0x730 net/ipv4/route.c:463
 dst_neigh_lookup include/net/dst.h:447 [inline]
 ip6_tnl_xmit+0x1598/0x28f0 net/ipv6/ip6_tunnel.c:1067
 ip4ip6_tnl_xmit net/ipv6/ip6_tunnel.c:1268 [inline]
 ip6_tnl_start_xmit+0xc1e/0x1890 net/ipv6/ip6_tunnel.c:1370
 __netdev_start_xmit include/linux/netdevice.h:3980 [inline]
 netdev_start_xmit include/linux/netdevice.h:3989 [inline]
 xmit_one net/core/dev.c:2908 [inline]
 dev_hard_start_xmit+0x213/0x800 net/core/dev.c:2924
 __dev_queue_xmit+0x1abc/0x2580 net/core/dev.c:3391
 dev_queue_xmit+0x17/0x20 net/core/dev.c:3424
 neigh_direct_output+0x15/0x20 net/core/neighbour.c:1349
 neigh_output include/net/neighbour.h:478 [inline]
 ip_finish_output2+0x7cd/0x1020 net/ipv4/ip_output.c:228
 ip_finish_output+0x83d/0xc30 net/ipv4/ip_output.c:316
 NF_HOOK_COND include/linux/netfilter.h:246 [inline]
 ip_output+0x1e7/0x5d0 net/ipv4/ip_output.c:404
 dst_output include/net/dst.h:486 [inline]
 ip_local_out+0x82/0xb0 net/ipv4/ip_output.c:124
 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
 ping_v4_push_pending_frames net/ipv4/ping.c:653 [inline]
 ping_v4_sendmsg+0x1b35/0x23e0 net/ipv4/ping.c:840
 inet_sendmsg+0x164/0x490 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 SYSC_sendto+0x660/0x810 net/socket.c:1696
 SyS_sendto+0x40/0x50 net/socket.c:1664
 entry_SYSCALL_64_fastpath+0x1a/0xa9
Freed:
PID = 7604
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:513
 set_track mm/kasan/kasan.c:525 [inline]
 kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
 slab_free_hook mm/slub.c:1357 [inline]
 slab_free_freelist_hook mm/slub.c:1379 [inline]
 slab_free mm/slub.c:2961 [inline]
 kfree+0x91/0x190 mm/slub.c:3882
 skb_free_head+0x74/0xb0 net/core/skbuff.c:579
 skb_release_data+0x37c/0x440 net/core/skbuff.c:610
 skb_release_all+0x4a/0x60 net/core/skbuff.c:669
 __kfree_skb net/core/skbuff.c:683 [inline]
 consume_skb+0x130/0x2f0 net/core/skbuff.c:756
 netlink_broadcast_filtered+0x5fa/0x1420 net/netlink/af_netlink.c:1473
 netlink_broadcast net/netlink/af_netlink.c:1495 [inline]
 nlmsg_multicast include/net/netlink.h:577 [inline]
 nlmsg_notify+0x9c/0x140 net/netlink/af_netlink.c:2382
 rtnl_notify+0xbb/0xe0 net/core/rtnetlink.c:674
 rtmsg_fib+0x3a7/0x4b0 net/ipv4/fib_semantics.c:422
 fib_table_delete+0x836/0x1140 net/ipv4/fib_trie.c:1659
 fib_magic.isra.14+0x4b3/0x890 net/ipv4/fib_frontend.c:840
 fib_del_ifaddr+0xb20/0xe10 net/ipv4/fib_frontend.c:1013
 fib_inetaddr_event+0xaf/0x200 net/ipv4/fib_frontend.c:1150
 notifier_call_chain+0x145/0x2f0 kernel/notifier.c:93
 __blocking_notifier_call_chain kernel/notifier.c:317 [inline]
 blocking_notifier_call_chain+0x109/0x1a0 kernel/notifier.c:328
 __inet_del_ifa+0x4b5/0xb00 net/ipv4/devinet.c:402
 inet_del_ifa net/ipv4/devinet.c:432 [inline]
 devinet_ioctl+0xa75/0x1a10 net/ipv4/devinet.c:1073
 inet_ioctl+0x117/0x1c0 net/ipv4/af_inet.c:900
 sock_do_ioctl+0x65/0xb0 net/socket.c:906
 sock_ioctl+0x27a/0x410 net/socket.c:1004
 vfs_ioctl fs/ioctl.c:45 [inline]
 do_vfs_ioctl+0x1cd/0x15a0 fs/ioctl.c:685
 SYSC_ioctl fs/ioctl.c:700 [inline]
 SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
 entry_SYSCALL_64_fastpath+0x1a/0xa9
Memory state around the buggy address:
 ffff88005dcc5e80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff88005dcc5f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88005dcc5f80: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
                               ^
 ffff88005dcc6000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88005dcc6080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================

^ permalink raw reply

* Re: macvlan: Fix device ref leak when purging bc_queue
From: Joe.Ghalam @ 2017-04-24 15:01 UTC (permalink / raw)
  To: herbert; +Cc: davem, Clifford.Wichmann, netdev
In-Reply-To: <20170424075606.GA19926@gondor.apana.org.au>

> The only thing that can stop macvlan_process_broadcast from getting
> called is macvlan_port_destroy.  Nothing else can stop the work
> queue, unless of course the work queue mechanism itself is broken.

> So if you're sure macvlan_port_destroy is never even called in
> your case, then you'll need to start debugging the kernel work
> queue mechanism to see why macvlan_process_broadcast is not getting
> called.

I will get your changes reloaded and re-tested without any other debug tools. Hopefully, we'll see success. I will let you know if I see any issues. 
Btw, is your fix committed already? if not, do you know when and where it would be committed?

Thanks,
Joe
 

^ permalink raw reply

* Re: [PATCH net] bridge: shutdown bridge device before removing it
From: Nikolay Aleksandrov @ 2017-04-24 14:53 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, bridge@lists.linux-foundation.org, David S. Miller
In-Reply-To: <CADvbK_ckVO19iPnhVLoSKXsRPLcb3kJ_H+fN-UEwaF4T=JY0dQ@mail.gmail.com>

On 24/04/17 17:41, Xin Long wrote:
> On Mon, Apr 24, 2017 at 8:07 PM, Nikolay Aleksandrov
> <nikolay@cumulusnetworks.com> wrote:
>> On 24/04/17 14:01, Nikolay Aleksandrov wrote:
>>> On 24/04/17 10:25, Xin Long wrote:
>>>> During removing a bridge device, if the bridge is still up, a new mdb entry
>>>> still can be added in br_multicast_add_group() after all mdb entries are
>>>> removed in br_multicast_dev_del(). Like the path:
>>>>
>>>>   mld_ifc_timer_expire ->
>>>>     mld_sendpack -> ...
>>>>       br_multicast_rcv ->
>>>>         br_multicast_add_group
>>>>
>>>> The new mp's timer will be set up. If the timer expires after the bridge
>>>> is freed, it may cause use-after-free panic in br_multicast_group_expired.
>>>> This can happen when ip link remove a bridge or destroy a netns with a
>>>> bridge device inside.
>>>>
>>>> As we can see in br_del_bridge, brctl is also supposed to remove a bridge
>>>> device after it's shutdown.
>>>>
>>>> This patch is to call dev_close at the beginning of br_dev_delete so that
>>>> netif_running check in br_multicast_add_group can avoid this issue. But
>>>> to keep consistent with before, it will not remove the IFF_UP check in
>>>> br_del_bridge for brctl.
>>>>
>>>> Reported-by: Jianwen Ji <jiji@redhat.com>
>>>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
>>>> ---
>>>>  net/bridge/br_if.c | 2 ++
>>>>  1 file changed, 2 insertions(+)
>>>>
>>>
>>> +CC bridge maintainers
>>>
>>> I can see how this could happen, could you also provide the traceback ?
>>>
>>> The patch looks good to me, actually I think it fixes another issue with
>>> mcast stats where the percpu pointer can be accessed after it's freed if
>>> an mcast packet can get sent via br->dev after the br_multicast_dev_del() call.
>>> This is definitely stable material, if I'm not mistaken the issue is there since
>>> the introduction of br_dev_delete:
>>> commit e10177abf842
>>> Author: Satish Ashok <sashok@cumulusnetworks.com>
>>> Date:   Wed Jul 15 07:16:51 2015 -0700
>>>
>>>     bridge: multicast: fix handling of temp and perm entries
>>>
>>>
>>>
>>> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>>
>>
>> Actually I have a better idea for a fix because dev_close() for a single device is rather heavy.
>> Why don't you move the mdb flush logic in the bridge's ndo_uninit() callback ?
>> That should have the same effect and be much faster.
> Yes. But it seems that all cleanups for bridge should be done after
> it's shutdown since beginning according to brctl. I'm not sure if there
> are still other problems caused by this. maybe safer to use dev_close.
> I need to check more to confirm this.
> 

ndo_uninit() is after the device has been stopped, so it is the same as
your fix as I said.

> I also have another question about mp->timer removing.
> As we can see, now it removes this timer with del_timer, instead of
> del_timer_sync. What if the timer is running when del_timer ?
> How can we be sure that br_multicast_group_expired will be done
> before the bridge dev is freed. synchronize_net ?
> 

Yeah, I've been thinking about that and the only race is that the timer
might have fired and waiting for the lock while the mdb is being flushed
thus the cancel_timer() won't affect it and then it will enter and see
that !netif_running(br->dev), but unfortunately there's a bug because we
cannot guarantee that br->dev still exists at that point.
This is a different bug though.

>>
>> By the way I just noticed that there's also a memory leak - the mdb hash is reallocated
>> and not freed due to the mdb rehash, here's also kmemleak's object:
>>
> yeps, ;-)
> 
>> unreferenced object 0xffff8800540ba800 (size 2048):
>>   comm "softirq", pid 0, jiffies 4520588901 (age 5787.284s)
>>   hex dump (first 32 bytes):
>>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>   backtrace:
>>     [<ffffffff816e2287>] kmemleak_alloc+0x67/0xc0
>>     [<ffffffff81260bea>] __kmalloc+0x1ba/0x3e0
>>     [<ffffffffa05c60ee>] br_mdb_rehash+0x5e/0x340 [bridge]
>>     [<ffffffffa05c74af>] br_multicast_new_group+0x43f/0x6e0 [bridge]
>>     [<ffffffffa05c7aa3>] br_multicast_add_group+0x203/0x260 [bridge]
>>     [<ffffffffa05ca4b5>] br_multicast_rcv+0x945/0x11d0 [bridge]
>>     [<ffffffffa05b6b10>] br_dev_xmit+0x180/0x470 [bridge]
>>     [<ffffffff815c781b>] dev_hard_start_xmit+0xbb/0x3d0
>>     [<ffffffff815c8743>] __dev_queue_xmit+0xb13/0xc10
>>     [<ffffffff815c8850>] dev_queue_xmit+0x10/0x20
>>     [<ffffffffa02f8d7a>] ip6_finish_output2+0x5ca/0xac0 [ipv6]
>>     [<ffffffffa02fbfc6>] ip6_finish_output+0x126/0x2c0 [ipv6]
>>     [<ffffffffa02fc245>] ip6_output+0xe5/0x390 [ipv6]
>>     [<ffffffffa032b92c>] NF_HOOK.constprop.44+0x6c/0x240 [ipv6]
>>     [<ffffffffa032bd16>] mld_sendpack+0x216/0x3e0 [ipv6]
>>     [<ffffffffa032d5eb>] mld_ifc_timer_expire+0x18b/0x2b0 [ipv6]
>>
>>
>>

^ permalink raw reply

* Re: Re: [PATCH net-next 1/4] ixgbe: sparc: rename the ARCH_WANT_RELAX_ORDER to IXGBE_ALLOW_RELAXED_ORDER
From: Will Deacon @ 2017-04-24 14:53 UTC (permalink / raw)
  To: Gabriele Paoloni
  Cc: Amir Ancel, David Laight, davem@davemloft.net, Catalin Marinas,
	Mark Rutland, Robin Murphy, jeffrey.t.kirsher@intel.com,
	alexander.duyck@gmail.com, linux-arm-kernel@lists.infradead.org,
	netdev@vger.kernel.org, Dingtianhong, Linuxarm
In-Reply-To: <EE11001F9E5DDD47B7634E2F8A612F2E2053543C@FRAEML521-MBX.china.huawei.com>

On Wed, Apr 19, 2017 at 02:46:19PM +0000, Gabriele Paoloni wrote:
> > From: Amir Ancel [mailto:amira@mellanox.com]
> > Sent: 18 April 2017 21:18
> > To: David Laight; Gabriele Paoloni; davem@davemloft.net
> > Cc: Catalin Marinas; Will Deacon; Mark Rutland; Robin Murphy;
> > jeffrey.t.kirsher@intel.com; alexander.duyck@gmail.com; linux-arm-
> > kernel@lists.infradead.org; netdev@vger.kernel.org; Dingtianhong;
> > Linuxarm
> > Subject: Re: Re: [PATCH net-next 1/4] ixgbe: sparc: rename the
> > ARCH_WANT_RELAX_ORDER to IXGBE_ALLOW_RELAXED_ORDER
> > 
> > Hi,
> > mlx5 driver is planned to have RO support this year.
> > I believe drivers should be able to query whether the arch support it
> 
> I guess that here when you say query you mean having a config symbol
> that is set accordingly to the host architecture, right?
> 
> As already said I have looked around a bit and other drivers do not seem
> to enable/disable RO for their EP on the basis of the host architecture.
> So why should mlx5 do it according to the host?
> 
> Also my understating is that some architectures (like ARM64 for example)
> can have different PCI host controller implementations depending on the
> vendor...therefore maybe it is not appropriate there to have a Kconfig
> symbol selected by the architecture...  

Indeed. We're not able to determine whether or not RO is supported at
compile time, so we'd have to detect this dynamically if we want to support
it for arm64 with a single kernel Image. That means either passing something
through firmware, having the PCI host controller opt-in or something coarse
like a command-line option.

Will

^ permalink raw reply

* Re: [PATCH v3 07/29] x86: bpf_jit, use ENTRY+ENDPROC
From: Jiri Slaby @ 2017-04-24 14:52 UTC (permalink / raw)
  To: David Miller
  Cc: alexei.starovoitov, mingo, tglx, hpa, x86, jpoimboe, linux-kernel,
	netdev, daniel, edumazet
In-Reply-To: <20170424.104132.950580313142367896.davem@davemloft.net>

On 04/24/2017, 04:41 PM, David Miller wrote:
>> It cannot stay as-is simply because we want to know where the functions
>> end to inject debuginfo properly. The code above does not warrant for
>> any exception.
> 
> I totally and completely disagree.

You can disagree as you wish but there is really nothing special on the
bpf code with respect to annotations.

>> Executing a nop takes a little and having externally-callable functions
>> aligned can actually help performance (no, I haven't measured nor tested
>> the code). But sure, the tool is generic, so I can introduce a local
>> macros to avoid alignments in the functions:
> 
> Not for this case, it's a bunch of entry points all packed together
> intentionally so that SKB accesses of different access sizes (which is
> almost always the case) from BPF programs use the smallest amount of
> I-cache as possible.

And for that reason I suggested the special macros for the code (see the
macros in the e-mail you replied to again). So what problem do you
actually have with the suggested solution?

thanks,
-- 
js
suse labs

^ permalink raw reply

* Re: [PATCH net v2 1/3] net: hns: support deferred probe when can not obtain irq
From: Matthias Brugger @ 2017-04-24 14:47 UTC (permalink / raw)
  To: lipeng (Y), Yankejian, davem, salil.mehta, yisen.zhuang,
	huangdaode, zhouhuiru
  Cc: netdev, charles.chenxin, linuxarm
In-Reply-To: <a64b254a-d5b4-287e-1f76-ec6f8189609a@huawei.com>



On 24/04/17 13:43, lipeng (Y) wrote:
>
>
> On 2017/4/24 18:28, Matthias Brugger wrote:
>> On 21/04/17 09:44, Yankejian wrote:
>>> From: lipeng <lipeng321@huawei.com>
>>>
>>> In the hip06 and hip07 SoCs, the interrupt lines from the
>>> DSAF controllers are connected to mbigen hw module.
>>> The mbigen module is probed with module_init, and, as such,
>>> is not guaranteed to probe before the HNS driver. So we need
>>> to support deferred probe.
>>>
>>> We check for probe deferral in the hw layer probe, so we not
>>> probe into the main layer and memories, etc., to later learn
>>> that we need to defer the probe.
>>>
>>
>> Why? This looks like a hack.
>> From what I see, we can handle EPROBE_DEFER easily inside hns_ppe_init
>> checking the return value of hns_rcb_get_cfg. Like you do in 2/3 of
>> this series.
>>
>> Regards,
>> Matthias
> Hi Matthias,
>
> mdio && phy is not necessary condition, and port can work well  for port
> + SFP (without mdio &&phy).
>
> BUT irq is the necessary condition,  port can not work well without irq.
>
> So, I check IRQ first,and do not probe dsaf if can't obtain irq(1/3 of
> this series),   and check mdio only when there is phy(2/3 of this series).
>
> And thanks for your review.

I think I didn't explained myself good enough.
I was suggesting the following (not even compile tested):

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
index eba406bea52f..be38d47bc399 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
@@ -510,7 +510,9 @@ int hns_ppe_init(struct dsaf_device *dsaf_dev)

                 hns_ppe_get_cfg(dsaf_dev->ppe_common[i]);

-               hns_rcb_get_cfg(dsaf_dev->rcb_common[i]);
+               ret = hns_rcb_get_cfg(dsaf_dev->rcb_common[i]);
+               if (reg < 0)
+                       goto get_cfg_fail;
         }

         for (i = 0; i < HNS_PPE_COM_NUM; i++)
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
index c20a0f4f8f02..c7e801d0c3b7 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
@@ -492,7 +492,7 @@ static int hns_rcb_get_base_irq_idx(struct 
rcb_common_cb *rcb_common)
   *hns_rcb_get_cfg - get rcb config
   *@rcb_common: rcb common device
   */
-void hns_rcb_get_cfg(struct rcb_common_cb *rcb_common)
+int hns_rcb_get_cfg(struct rcb_common_cb *rcb_common)
  {
         struct ring_pair_cb *ring_pair_cb;
         u32 i;
@@ -517,10 +517,18 @@ void hns_rcb_get_cfg(struct rcb_common_cb *rcb_common)
                 ring_pair_cb->virq[HNS_RCB_IRQ_IDX_RX] =
                 is_ver1 ? platform_get_irq(pdev, base_irq_idx + i * 2 + 
1) :
                           platform_get_irq(pdev, base_irq_idx + i * 3);
+
+               if ((ring_pair_cb->virq[HNS_RCB_IRQ_IDX_TX] == 
-EPROBE_DEFER) ||
+                   (ring_pair_cb->virq[HNS_RCB_IRQ_IDX_RX] == 
-EPROBE_DEFER)) {
+                       return -EPROBE_DEFER;
+               }
+
                 ring_pair_cb->q.phy_base =
 
RCB_COMM_BASE_TO_RING_BASE(rcb_common->phy_base, i);
                 hns_rcb_ring_pair_get_cfg(ring_pair_cb);
         }
+
+       return 0;
  }

  /**


Regards,
Matthias

>
> lipeng
>
>>
>>> Signed-off-by: lipeng <lipeng321@huawei.com>
>>> Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
>>> ---
>>>  drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c | 12 ++++++++++++
>>>  1 file changed, 12 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
>>> b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
>>> index 403ea9d..2da5b42 100644
>>> --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
>>> +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
>>> @@ -2971,6 +2971,18 @@ static int hns_dsaf_probe(struct
>>> platform_device *pdev)
>>>      struct dsaf_device *dsaf_dev;
>>>      int ret;
>>>
>>> +    /*
>>> +     * Check if we should defer the probe before we probe the
>>> +     * dsaf, as it's hard to defer later on.
>>> +     */
>>> +    ret = platform_get_irq(pdev, 0);
>>> +    if (ret < 0) {
>>> +        if (ret != -EPROBE_DEFER)
>>> +            dev_err(&pdev->dev, "Cannot obtain irq\n");
>>> +
>>> +        return ret;
>>> +    }
>>> +
>>>      dsaf_dev = hns_dsaf_alloc_dev(&pdev->dev, sizeof(struct
>>> dsaf_drv_priv));
>>>      if (IS_ERR(dsaf_dev)) {
>>>          ret = PTR_ERR(dsaf_dev);
>>>
>>
>> .
>>
>

^ permalink raw reply related

* Re: [PATCH net-next 0/3] Misc BPF cleanup
From: Alexei Starovoitov @ 2017-04-24 14:45 UTC (permalink / raw)
  To: Alexander Alemayhu, netdev; +Cc: daniel
In-Reply-To: <20170424133108.31595-1-alexander@alemayhu.com>

On 4/24/17 6:31 AM, Alexander Alemayhu wrote:
> Hei,
>
> while looking into making the Makefile in samples/bpf better handle O= I saw
> several warnings when running `make clean && make samples/bpf/`. This series
> reduces those warnings.

Cleanup looks good to me.
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH net-next v5 1/2] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch
From: Jamal Hadi Salim @ 2017-04-24 14:42 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Simon Horman, Jiri Pirko, davem, xiyou.wangcong, eric.dumazet,
	netdev, Tom Herbert
In-Reply-To: <20170424142058.GA26625@salvia>

On 17-04-24 10:20 AM, Pablo Neira Ayuso wrote:
> On Mon, Apr 24, 2017 at 08:49:00AM -0400, Jamal Hadi Salim wrote:

>>
>> I am fine with the counter-Postel view of having the kernel
>> validate that appropriate bits are set as long as we dont make
>> user space to now start learning how to play acrobatics.
>
> jamal, what performance concern you have in building this error
> message? TLVs is the most flexible way. And this is error path, so we
> should build this message rarely, only if the user sends us something
> incorrect, why bother...

I have a feeling we are reffering to 2 different things.
Which error message? Are you talking about extended ACK?
I have no problem with that.

Let me sumarize for you the discussion.

My concern was was the double request needed now
which was unneeded before.

Before: You send a msg and say the kernel didnt understand.
Kernel ignores what it didnt understand and does things
you asked it to. i.e Part of Postel principle which says
"Be liberal in what you expect of others"
But the new concern is user space not abiding to the other
half of Postel principle "Be conservative in what you send".
It may set some random flags which the kernel doesnt understand.

One idea is to have the kernel totally reject anytime
it sees such flags. I am sure such a message could be
conveyed back to the user. Then the user sends the
correct one back.

The challenge i have is to enforce this trial by fire
approach to all user space apps. It is a large change.

My suggestion is for user to set flag to request the
old behavior of sending only one message.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH v3 07/29] x86: bpf_jit, use ENTRY+ENDPROC
From: David Miller @ 2017-04-24 14:41 UTC (permalink / raw)
  To: jslaby
  Cc: alexei.starovoitov, mingo, tglx, hpa, x86, jpoimboe, linux-kernel,
	netdev, daniel, edumazet
In-Reply-To: <697947f4-0a2c-1480-0995-9919556dc020@suse.cz>

From: Jiri Slaby <jslaby@suse.cz>
Date: Mon, 24 Apr 2017 08:45:11 +0200

> On 04/21/2017, 09:32 PM, Alexei Starovoitov wrote:
>> On Fri, Apr 21, 2017 at 04:12:43PM +0200, Jiri Slaby wrote:
>>> Do not use a custom macro FUNC for starts of the global functions, use
>>> ENTRY instead.
>>>
>>> And while at it, annotate also ends of the functions by ENDPROC.
>>>
>>> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
>>> Cc: "David S. Miller" <davem@davemloft.net>
>>> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
>>> Cc: James Morris <jmorris@namei.org>
>>> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
>>> Cc: Patrick McHardy <kaber@trash.net>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>> Cc: x86@kernel.org
>>> Cc: netdev@vger.kernel.org
>>> ---
>>>  arch/x86/net/bpf_jit.S | 32 ++++++++++++++++++--------------
>>>  1 file changed, 18 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/arch/x86/net/bpf_jit.S b/arch/x86/net/bpf_jit.S
>>> index f2a7faf4706e..762c29fb8832 100644
>>> --- a/arch/x86/net/bpf_jit.S
>>> +++ b/arch/x86/net/bpf_jit.S
>>> @@ -23,16 +23,12 @@
>>>  	32 /* space for rbx,r13,r14,r15 */ + \
>>>  	8 /* space for skb_copy_bits */)
>>>  
>>> -#define FUNC(name) \
>>> -	.globl name; \
>>> -	.type name, @function; \
>>> -	name:
>>> -
>>> -FUNC(sk_load_word)
>>> +ENTRY(sk_load_word)
>>>  	test	%esi,%esi
>>>  	js	bpf_slow_path_word_neg
>>> +ENDPROC(sk_load_word)
>> 
>> this doens't look right.
>> It will add alignment nops in critical paths of these pseudo functions.
>> I'm also not sure whether it will still work afterwards.
>> Was it tested?
>> I'd prefer if this code kept as-is.
> 
> It cannot stay as-is simply because we want to know where the functions
> end to inject debuginfo properly. The code above does not warrant for
> any exception.

I totally and completely disagree.

> Executing a nop takes a little and having externally-callable functions
> aligned can actually help performance (no, I haven't measured nor tested
> the code). But sure, the tool is generic, so I can introduce a local
> macros to avoid alignments in the functions:

Not for this case, it's a bunch of entry points all packed together
intentionally so that SKB accesses of different access sizes (which is
almost always the case) from BPF programs use the smallest amount of
I-cache as possible.

^ permalink raw reply

* Re: [PATCH net-next 3/3] samples/bpf: check before defining offsetof
From: Daniel Borkmann @ 2017-04-24 14:41 UTC (permalink / raw)
  To: Alexander Alemayhu, netdev; +Cc: ast
In-Reply-To: <20170424133108.31595-4-alexander@alemayhu.com>

On 04/24/2017 03:31 PM, Alexander Alemayhu wrote:
> Fixes the following warning
>
> samples/bpf/test_lru_dist.c:28:0: warning: "offsetof" redefined
>   #define offsetof(TYPE, MEMBER) ((size_t)&((TYPE *)0)->MEMBER)
>
> In file included from ./tools/lib/bpf/bpf.h:25:0,
>                   from samples/bpf/libbpf.h:5,
>                   from samples/bpf/test_lru_dist.c:24:
> /usr/lib/gcc/x86_64-redhat-linux/6.3.1/include/stddef.h:417:0: note: this is the location of the previous definition
>   #define offsetof(TYPE, MEMBER) __builtin_offsetof (TYPE, MEMBER)
>
> Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [PATCH net] bridge: shutdown bridge device before removing it
From: Xin Long @ 2017-04-24 14:41 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: network dev, David S. Miller, Stephen Hemminger,
	bridge@lists.linux-foundation.org
In-Reply-To: <6144408f-4cfd-dd87-7444-8c61269b105a@cumulusnetworks.com>

On Mon, Apr 24, 2017 at 8:07 PM, Nikolay Aleksandrov
<nikolay@cumulusnetworks.com> wrote:
> On 24/04/17 14:01, Nikolay Aleksandrov wrote:
>> On 24/04/17 10:25, Xin Long wrote:
>>> During removing a bridge device, if the bridge is still up, a new mdb entry
>>> still can be added in br_multicast_add_group() after all mdb entries are
>>> removed in br_multicast_dev_del(). Like the path:
>>>
>>>   mld_ifc_timer_expire ->
>>>     mld_sendpack -> ...
>>>       br_multicast_rcv ->
>>>         br_multicast_add_group
>>>
>>> The new mp's timer will be set up. If the timer expires after the bridge
>>> is freed, it may cause use-after-free panic in br_multicast_group_expired.
>>> This can happen when ip link remove a bridge or destroy a netns with a
>>> bridge device inside.
>>>
>>> As we can see in br_del_bridge, brctl is also supposed to remove a bridge
>>> device after it's shutdown.
>>>
>>> This patch is to call dev_close at the beginning of br_dev_delete so that
>>> netif_running check in br_multicast_add_group can avoid this issue. But
>>> to keep consistent with before, it will not remove the IFF_UP check in
>>> br_del_bridge for brctl.
>>>
>>> Reported-by: Jianwen Ji <jiji@redhat.com>
>>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
>>> ---
>>>  net/bridge/br_if.c | 2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>
>> +CC bridge maintainers
>>
>> I can see how this could happen, could you also provide the traceback ?
>>
>> The patch looks good to me, actually I think it fixes another issue with
>> mcast stats where the percpu pointer can be accessed after it's freed if
>> an mcast packet can get sent via br->dev after the br_multicast_dev_del() call.
>> This is definitely stable material, if I'm not mistaken the issue is there since
>> the introduction of br_dev_delete:
>> commit e10177abf842
>> Author: Satish Ashok <sashok@cumulusnetworks.com>
>> Date:   Wed Jul 15 07:16:51 2015 -0700
>>
>>     bridge: multicast: fix handling of temp and perm entries
>>
>>
>>
>> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>
>
> Actually I have a better idea for a fix because dev_close() for a single device is rather heavy.
> Why don't you move the mdb flush logic in the bridge's ndo_uninit() callback ?
> That should have the same effect and be much faster.
Yes. But it seems that all cleanups for bridge should be done after
it's shutdown since beginning according to brctl. I'm not sure if there
are still other problems caused by this. maybe safer to use dev_close.
I need to check more to confirm this.

I also have another question about mp->timer removing.
As we can see, now it removes this timer with del_timer, instead of
del_timer_sync. What if the timer is running when del_timer ?
How can we be sure that br_multicast_group_expired will be done
before the bridge dev is freed. synchronize_net ?

>
> By the way I just noticed that there's also a memory leak - the mdb hash is reallocated
> and not freed due to the mdb rehash, here's also kmemleak's object:
>
yeps, ;-)

> unreferenced object 0xffff8800540ba800 (size 2048):
>   comm "softirq", pid 0, jiffies 4520588901 (age 5787.284s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<ffffffff816e2287>] kmemleak_alloc+0x67/0xc0
>     [<ffffffff81260bea>] __kmalloc+0x1ba/0x3e0
>     [<ffffffffa05c60ee>] br_mdb_rehash+0x5e/0x340 [bridge]
>     [<ffffffffa05c74af>] br_multicast_new_group+0x43f/0x6e0 [bridge]
>     [<ffffffffa05c7aa3>] br_multicast_add_group+0x203/0x260 [bridge]
>     [<ffffffffa05ca4b5>] br_multicast_rcv+0x945/0x11d0 [bridge]
>     [<ffffffffa05b6b10>] br_dev_xmit+0x180/0x470 [bridge]
>     [<ffffffff815c781b>] dev_hard_start_xmit+0xbb/0x3d0
>     [<ffffffff815c8743>] __dev_queue_xmit+0xb13/0xc10
>     [<ffffffff815c8850>] dev_queue_xmit+0x10/0x20
>     [<ffffffffa02f8d7a>] ip6_finish_output2+0x5ca/0xac0 [ipv6]
>     [<ffffffffa02fbfc6>] ip6_finish_output+0x126/0x2c0 [ipv6]
>     [<ffffffffa02fc245>] ip6_output+0xe5/0x390 [ipv6]
>     [<ffffffffa032b92c>] NF_HOOK.constprop.44+0x6c/0x240 [ipv6]
>     [<ffffffffa032bd16>] mld_sendpack+0x216/0x3e0 [ipv6]
>     [<ffffffffa032d5eb>] mld_ifc_timer_expire+0x18b/0x2b0 [ipv6]
>
>
>

^ permalink raw reply

* Re: [PATCH net-next 2/3] samples/bpf: add static to function with no prototype
From: Daniel Borkmann @ 2017-04-24 14:40 UTC (permalink / raw)
  To: Alexander Alemayhu, netdev; +Cc: ast
In-Reply-To: <20170424133108.31595-3-alexander@alemayhu.com>

On 04/24/2017 03:31 PM, Alexander Alemayhu wrote:
> Fixes the following warning
>
> samples/bpf/cookie_uid_helper_example.c: At top level:
> samples/bpf/cookie_uid_helper_example.c:276:6: warning: no previous prototype for ‘finish’ [-Wmissing-prototypes]
>   void finish(int ret)
>        ^~~~~~
>    HOSTLD  samples/bpf/per_socket_stats_example
>
> Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [PATCH net-next 1/3] samples/bpf: add -Wno-unknown-warning-option to clang
From: Daniel Borkmann @ 2017-04-24 14:37 UTC (permalink / raw)
  To: Alexander Alemayhu, netdev; +Cc: ast
In-Reply-To: <20170424133108.31595-2-alexander@alemayhu.com>

On 04/24/2017 03:31 PM, Alexander Alemayhu wrote:
> I was initially going to remove '-Wno-address-of-packed-member' because I
> thought it was not supposed to be there but Daniel suggested using
> '-Wno-unknown-warning-option'.
>
> This silences several warnings similiar to the one below
>
> warning: unknown warning option '-Wno-address-of-packed-member' [-Wunknown-warning-option]
> 1 warning generated.

Yeah, that feature seems fairly new (Feb 2017 accepted if I
see this correctly): https://reviews.llvm.org/D20561

Given the -Wno-address-of-packed-member was there to silence
warnings in the first place, we should also silence warnings
when -Wno-address-of-packed-member is unknown to clang/llvm.

[...]
>
> Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [PATCH 5/7] IB/hfi1: use pcie_flr instead of duplicating it
From: Christoph Hellwig @ 2017-04-24 14:35 UTC (permalink / raw)
  To: Byczkowski, Jakub
  Cc: Christoph Hellwig, Bjorn Helgaas, Cabiddu, Giovanni,
	Benedetto, Salvatore, Marciniszyn, Mike, Dalessandro, Dennis,
	Derek Chickles, Satanand Burla, Felix Manlunas, Raghu Vatsavayi,
	Kirsher, Jeffrey T, linux-pci@vger.kernel.org, qat-linux,
	linux-crypto@vger.kernel.org, linux-rdma@vger.kernel.org,
	"netdev@vger.kernel.or
In-Reply-To: <A33E225544E47D4DB5713032774B132783960382@irsmsx105.ger.corp.intel.com>

On Mon, Apr 24, 2017 at 02:16:31PM +0000, Byczkowski, Jakub wrote:
> Tested-by: Jakub Byczkowski <jakub.byczkowski@intel.com>

Are you (and Doug) ok with queueing this up in the PCI tree?

^ permalink raw reply

* Blogpost evaluation this [PATCH v4 net-next RFC] net: Generic XDP
From: Jesper Dangaard Brouer @ 2017-04-24 14:24 UTC (permalink / raw)
  To: David Miller, xdp-newbies; +Cc: netdev, brouer, Andy Gospodarek
In-Reply-To: <20170413.120925.2082322246776478766.davem@davemloft.net>


I've done a very detailed evaluation of this patch, and I've created a
blogpost like report here:

 https://prototype-kernel.readthedocs.io/en/latest/blogposts/xdp25_eval_generic_xdp_tx.html

I didn't evaluate the adjust_head part, so I hope Andy is still
planning to validate that part?

--Jesper



On Thu, 13 Apr 2017 12:09:25 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> This provides a generic SKB based non-optimized XDP path which is used
> if either the driver lacks a specific XDP implementation, or the user
> requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE.
> 
> It is arguable that perhaps I should have required something like
> this as part of the initial XDP feature merge.
> 
> I believe this is critical for two reasons:
> 
> 1) Accessibility.  More people can play with XDP with less
>    dependencies.  Yes I know we have XDP support in virtio_net, but
>    that just creates another depedency for learning how to use this
>    facility.
> 
>    I wrote this to make life easier for the XDP newbies.
> 
> 2) As a model for what the expected semantics are.  If there is a pure
>    generic core implementation, it serves as a semantic example for
>    driver folks adding XDP support.
> 
> This is just a rough draft and is untested.



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net-next v5 1/2] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch
From: Pablo Neira Ayuso @ 2017-04-24 14:20 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Simon Horman, Jiri Pirko, davem, xiyou.wangcong, eric.dumazet,
	netdev, Tom Herbert
In-Reply-To: <2ca4266f-a164-d2ad-37fa-45f7ae354eb8@mojatatu.com>

On Mon, Apr 24, 2017 at 08:49:00AM -0400, Jamal Hadi Salim wrote:
> On 17-04-24 05:14 AM, Simon Horman wrote:
> [..]
> 
> >Jamal, I am confused about why are you so concerned about the space
> >consumed by this attribute, it's per-message, right? Is it the bigger
> >picture you are worried about - a similar per-entry flag at some point in
> >the future?
> 
> 
> To me the two worries are one and the same.
> 
> Jiri strongly believes (from a big picture view) we must use
> TLVs for extensibility.
> While I agree with him in general i have strong reservations
> in this case because i can get both extensibility and
> build for performance with using a flag bitmask as the
> content of the TLV.
> 
> A TLV consumes 64 bits minimum. It doesnt matter if we decide
> to use a u8 or a u16, we are still sending 64 bits on that
> TLV with the rest being PADding. Not to be melodramatic, but
> the worst case scenario of putting everything in a TLV for 32
> flags is using about 30x more space than using a bitmask.
> 
> Yes, space is important and if i can express upto 32 flags
> with one TLV rather than 32 TLVs i choose one TLV.
> I am always looking for ways to filter out crap i dont need
> when i do stats collection. I have numerous wounds from fdb
> entries which decided to use a TLV per flag.
> 
> The design approach we have used in netlink is: flags start
> as a bitmap (whether they are on main headers or TLVs); they may be
> complemented with a bitmask/selector (refer to IFLINK messages).
> 
> Lets look at this specific patch I have sending. I have already
> changed it 3 times and involved a churn of 3 different flags.
> If you asked me in the beggining i wouldve scratched my head
> thinking for a near term use for bit #3, #4 etc,
> 
> I am fine with the counter-Postel view of having the kernel
> validate that appropriate bits are set as long as we dont make
> user space to now start learning how to play acrobatics.

jamal, what performance concern you have in building this error
message? TLVs is the most flexible way. And this is error path, so we
should build this message rarely, only if the user sends us something
incorrect, why bother...

^ permalink raw reply

* RE: [PATCH 5/7] IB/hfi1: use pcie_flr instead of duplicating it
From: Byczkowski, Jakub @ 2017-04-24 14:16 UTC (permalink / raw)
  To: Christoph Hellwig, Bjorn Helgaas, Cabiddu, Giovanni,
	Benedetto, Salvatore, Marciniszyn, Mike, Dalessandro, Dennis,
	Derek Chickles, Satanand Burla, Felix Manlunas, Raghu Vatsavayi,
	Kirsher, Jeffrey T
  Cc: linux-pci@vger.kernel.org, qat-linux,
	linux-crypto@vger.kernel.org, linux-rdma@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20170414191131.14286-6-hch@lst.de>

Tested-by: Jakub Byczkowski <jakub.byczkowski@intel.com>

-----Original Message-----
From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Christoph Hellwig
Sent: Friday, April 14, 2017 9:11 PM
To: Bjorn Helgaas <bhelgaas@google.com>; Cabiddu, Giovanni <giovanni.cabiddu@intel.com>; Benedetto, Salvatore <salvatore.benedetto@intel.com>; Marciniszyn, Mike <mike.marciniszyn@intel.com>; Dalessandro, Dennis <dennis.dalessandro@intel.com>; Derek Chickles <derek.chickles@caviumnetworks.com>; Satanand Burla <satananda.burla@caviumnetworks.com>; Felix Manlunas <felix.manlunas@caviumnetworks.com>; Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>; Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>
Cc: linux-pci@vger.kernel.org; qat-linux <qat-linux@intel.com>; linux-crypto@vger.kernel.org; linux-rdma@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
Subject: [PATCH 5/7] IB/hfi1: use pcie_flr instead of duplicating it

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/infiniband/hw/hfi1/chip.c |  4 ++--  drivers/infiniband/hw/hfi1/hfi.h  |  1 -  drivers/infiniband/hw/hfi1/pcie.c | 30 ------------------------------
 3 files changed, 2 insertions(+), 33 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c
index 121a4c920f1b..d037f72e4d96 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -13610,14 +13610,14 @@ static void init_chip(struct hfi1_devdata *dd)
 		dd_dev_info(dd, "Resetting CSRs with FLR\n");
 
 		/* do the FLR, the DC reset will remain */
-		hfi1_pcie_flr(dd);
+		pcie_flr(dd->pcidev);
 
 		/* restore command and BARs */
 		restore_pci_variables(dd);
 
 		if (is_ax(dd)) {
 			dd_dev_info(dd, "Resetting CSRs with FLR\n");
-			hfi1_pcie_flr(dd);
+			pcie_flr(dd->pcidev);
 			restore_pci_variables(dd);
 		}
 	} else {
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 0808e3c3ba39..40d7559fa723 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1764,7 +1764,6 @@ int hfi1_pcie_init(struct pci_dev *, const struct pci_device_id *);  void hfi1_pcie_cleanup(struct pci_dev *);  int hfi1_pcie_ddinit(struct hfi1_devdata *, struct pci_dev *);  void hfi1_pcie_ddcleanup(struct hfi1_devdata *); -void hfi1_pcie_flr(struct hfi1_devdata *);  int pcie_speeds(struct hfi1_devdata *);  void request_msix(struct hfi1_devdata *, u32 *, struct hfi1_msix_entry *);  void hfi1_enable_intx(struct pci_dev *); diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
index 0829fce06172..c81556e84831 100644
--- a/drivers/infiniband/hw/hfi1/pcie.c
+++ b/drivers/infiniband/hw/hfi1/pcie.c
@@ -240,36 +240,6 @@ void hfi1_pcie_ddcleanup(struct hfi1_devdata *dd)
 		iounmap(dd->piobase);
 }
 
-/*
- * Do a Function Level Reset (FLR) on the device.
- * Based on static function drivers/pci/pci.c:pcie_flr().
- */
-void hfi1_pcie_flr(struct hfi1_devdata *dd) -{
-	int i;
-	u16 status;
-
-	/* no need to check for the capability - we know the device has it */
-
-	/* wait for Transaction Pending bit to clear, at most a few ms */
-	for (i = 0; i < 4; i++) {
-		if (i)
-			msleep((1 << (i - 1)) * 100);
-
-		pcie_capability_read_word(dd->pcidev, PCI_EXP_DEVSTA, &status);
-		if (!(status & PCI_EXP_DEVSTA_TRPND))
-			goto clear;
-	}
-
-	dd_dev_err(dd, "Transaction Pending bit is not clearing, proceeding with reset anyway\n");
-
-clear:
-	pcie_capability_set_word(dd->pcidev, PCI_EXP_DEVCTL,
-				 PCI_EXP_DEVCTL_BCR_FLR);
-	/* PCIe spec requires the function to be back within 100ms */
-	msleep(100);
-}
-
 static void msix_setup(struct hfi1_devdata *dd, int pos, u32 *msixcnt,
 		       struct hfi1_msix_entry *hfi1_msix_entry)  {
--
2.11.0

^ permalink raw reply related

* [PATCH] net: bridge: suppress broadcast when multicast flood is disabled
From: Mike Manning @ 2017-04-24 14:09 UTC (permalink / raw)
  To: netdev; +Cc: Nikolay Aleksandrov

Flood suppression for packets that are not unicast needs to be handled
consistently by also not flooding broadcast packets. As broadcast is a
special case of multicast, the same kernel parameter should be used to
suppress flooding for both of these packet types.

Fixes: b6cb5ac8331b ("net: bridge: add per-port multicast flood flag")
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Mike Manning <mmanning@brocade.com>
---
 net/bridge/br_forward.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 902af6b..a61c7ad 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -183,13 +183,16 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
 	struct net_bridge_port *p;
 
 	list_for_each_entry_rcu(p, &br->port_list, list) {
-		/* Do not flood unicast traffic to ports that turn it off */
-		if (pkt_type == BR_PKT_UNICAST && !(p->flags & BR_FLOOD))
-			continue;
-		/* Do not flood if mc off, except for traffic we originate */
-		if (pkt_type == BR_PKT_MULTICAST &&
-		    !(p->flags & BR_MCAST_FLOOD) && skb->dev != br->dev)
-			continue;
+		/* Do not flood unicast traffic to ports that turn it off, nor
+		 * other traffic if mc flood off except for traffic we originate
+		 */
+		if (pkt_type == BR_PKT_UNICAST) {
+			if (!(p->flags & BR_FLOOD))
+				continue;
+		} else {
+			if (!(p->flags & BR_MCAST_FLOOD) && skb->dev != br->dev)
+				continue;
+		}
 
 		/* Do not flood to ports that enable proxy ARP */
 		if (p->flags & BR_PROXYARP)
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH v2 net] net: ipv6: regenerate host route if moved to gc list
From: Andrey Konovalov @ 2017-04-24 14:03 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, Dmitry Vyukov, mmanning
In-Reply-To: <1492879237-31566-1-git-send-email-dsa@cumulusnetworks.com>

On Sat, Apr 22, 2017 at 6:40 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
> Taking down the loopback device wreaks havoc on IPv6 routes. By
> extension, taking a VRF device wreaks havoc on its table.
>
> Dmitry and Andrey both reported heap out-of-bounds reports in the IPv6
> FIB code while running syzkaller fuzzer. The root cause is a dead dst
> that is on the garbage list gets reinserted into the IPv6 FIB. While on
> the gc (or perhaps when it gets added to the gc list) the dst->next is
> set to an IPv4 dst. A subsequent walk of the ipv6 tables causes the
> out-of-bounds access.
>
> Andrey's reproducer was the key to getting to the bottom of this.
>
> With IPv6, host routes for an address have the dst->dev set to the
> loopback device. When the 'lo' device is taken down, rt6_ifdown initiates
> a walk of the fib evicting routes with the 'lo' device which means all
> host routes are removed. That process moves the dst which is attached to
> an inet6_ifaddr to the gc list and marks it as dead.
>
> The recent change to keep global IPv6 addresses added a new function
> fixup_permanent_addr that is called on admin up. That function restarts
> dad for an inet6_ifaddr and when it completes the host route attached
> to it is inserted into the fib. Since the route was marked dead and
> moved to the gc list, we get the reported out-of-bounds accesses. If
> the device with the address is taken down or the address is removed, the
> WARN_ON in fib6_del is triggered.
>
> All of those faults are fixed by regenerating the host route of the
> existing one has been moved to the gc list, something that can be
> determined by checking if the rt6i_ref counter is 0.
>
> Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> ---
> v2
> - change ifp->rt under spinlock vs cmpxchg
> - add comment about rt6i_ref == 0
>
> Dmitry / Andrey: can you guys add this patch to your tree and run
> syzkaller tests? I'd like to confirm that all of the fib traces
> are fixed. Thanks.

Tried fuzzing with your patch, I don't see any fib6 related reports
any more, so it should be good. I'll let you know if I see some
possibly related crashes.

Thanks!

>
>  net/ipv6/addrconf.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 08f9e8ea7a81..97e86158bbcb 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -3303,14 +3303,24 @@ static void addrconf_gre_config(struct net_device *dev)
>  static int fixup_permanent_addr(struct inet6_dev *idev,
>                                 struct inet6_ifaddr *ifp)
>  {
> -       if (!ifp->rt) {
> -               struct rt6_info *rt;
> +       /* rt6i_ref == 0 means the host route was removed from the
> +        * FIB, for example, if 'lo' device is taken down. In that
> +        * case regenerate the host route.
> +        */
> +       if (!ifp->rt || !atomic_read(&ifp->rt->rt6i_ref)) {
> +               struct rt6_info *rt, *prev;
>
>                 rt = addrconf_dst_alloc(idev, &ifp->addr, false);
>                 if (unlikely(IS_ERR(rt)))
>                         return PTR_ERR(rt);
>
> +               spin_lock(&ifp->lock);
> +               prev = ifp->rt;
>                 ifp->rt = rt;
> +               spin_unlock(&ifp->lock);
> +
> +               if (prev)
> +                       ip6_rt_put(prev);
>         }
>
>         if (!(ifp->flags & IFA_F_NOPREFIXROUTE)) {
> --
> 2.1.4
>

^ permalink raw reply

* [PATCH net-next 3/3] samples/bpf: check before defining offsetof
From: Alexander Alemayhu @ 2017-04-24 13:31 UTC (permalink / raw)
  To: netdev; +Cc: Alexander Alemayhu, daniel, ast
In-Reply-To: <20170424133108.31595-1-alexander@alemayhu.com>

Fixes the following warning

samples/bpf/test_lru_dist.c:28:0: warning: "offsetof" redefined
 #define offsetof(TYPE, MEMBER) ((size_t)&((TYPE *)0)->MEMBER)

In file included from ./tools/lib/bpf/bpf.h:25:0,
                 from samples/bpf/libbpf.h:5,
                 from samples/bpf/test_lru_dist.c:24:
/usr/lib/gcc/x86_64-redhat-linux/6.3.1/include/stddef.h:417:0: note: this is the location of the previous definition
 #define offsetof(TYPE, MEMBER) __builtin_offsetof (TYPE, MEMBER)

Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>
---
 samples/bpf/test_lru_dist.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/samples/bpf/test_lru_dist.c b/samples/bpf/test_lru_dist.c
index d96dc88d3b04..73c357142268 100644
--- a/samples/bpf/test_lru_dist.c
+++ b/samples/bpf/test_lru_dist.c
@@ -25,7 +25,9 @@
 #include "bpf_util.h"
 
 #define min(a, b) ((a) < (b) ? (a) : (b))
-#define offsetof(TYPE, MEMBER)	((size_t)&((TYPE *)0)->MEMBER)
+#ifndef offsetof
+# define offsetof(TYPE, MEMBER)	((size_t)&((TYPE *)0)->MEMBER)
+#endif
 #define container_of(ptr, type, member) ({			\
 	const typeof( ((type *)0)->member ) *__mptr = (ptr);	\
 	(type *)( (char *)__mptr - offsetof(type,member) );})
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next 2/3] samples/bpf: add static to function with no prototype
From: Alexander Alemayhu @ 2017-04-24 13:31 UTC (permalink / raw)
  To: netdev; +Cc: Alexander Alemayhu, daniel, ast
In-Reply-To: <20170424133108.31595-1-alexander@alemayhu.com>

Fixes the following warning

samples/bpf/cookie_uid_helper_example.c: At top level:
samples/bpf/cookie_uid_helper_example.c:276:6: warning: no previous prototype for ‘finish’ [-Wmissing-prototypes]
 void finish(int ret)
      ^~~~~~
  HOSTLD  samples/bpf/per_socket_stats_example

Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>
---
 samples/bpf/cookie_uid_helper_example.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/bpf/cookie_uid_helper_example.c b/samples/bpf/cookie_uid_helper_example.c
index ad5afedf2e70..9ce55840d61d 100644
--- a/samples/bpf/cookie_uid_helper_example.c
+++ b/samples/bpf/cookie_uid_helper_example.c
@@ -273,7 +273,7 @@ static int usage(void)
 	return 1;
 }
 
-void finish(int ret)
+static void finish(int ret)
 {
 	test_finish = true;
 }
-- 
2.9.3

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox