Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 0/3] Introduce LSM-hook for socketpair(2)
From: David Herrmann @ 2018-04-23 13:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: James Morris, Paul Moore, teg, Stephen Smalley, selinux,
	linux-security-module, Eric Paris, serge, davem, netdev,
	David Herrmann

Hi

This series adds a new LSM hook for the socketpair(2) syscall. The idea
is to allow SO_PEERSEC to be called on AF_UNIX sockets created via
socketpair(2), and return the same information as if you emulated
socketpair(2) via a temporary listener socket. Right now SO_PEERSEC
will return the unlabeled credentials for a socketpair, rather than the
actual credentials of the creating process.

A simple call to:

    socketpair(AF_UNIX, SOCK_STREAM, 0, out);

can be emulated via a temporary listener socket bound to a unique,
random name in the abstract namespace. By connecting to this listener
socket, accept(2) will return the second part of the pair. If
SO_PEERSEC is queried on these, the correct credentials of the creating
process are returned. A simple comparison between the behavior of
SO_PEERSEC on socketpair(2) and an emulated socketpair is included in
the dbus-broker test-suite [1].

This patch series tries to close this gap and makes both behave the
same. A new LSM-hook is added which allows LSMs to cache the correct
peer information on newly created socket-pairs.

Apart from fixing this behavioral difference, the dbus-broker project
actually needs to query the credentials of socketpairs, and currently
must resort to racy procfs(2) queries to get the LSM credentials of its
controller socket. Several parts of the dbus-broker project allow you
to pass in a socket during execve(2), which will be used by the child
process to accept control-commands from its parent. One natural way to
create this communication channel is to use socketpair(2). However,
right now SO_PEERSEC does not return any useful information, hence, the
child-process would need other means to retrieve this information. By
avoiding socketpair(2) and using the hacky-emulated version, this is not
an issue.

There was a previous discussion on this matter [2] roughly a year ago.
Back then there was the suspicion that proper SO_PEERSEC would confuse
applications. However, we could not find any evidence backing this
suspicion. Furthermore, we now actually see the contrary. Lack of
SO_PEERSEC makes it a hassle to use socketpairs with LSM credentials.
Hence, we propose to implement full SO_PEERSEC for socketpairs.

This series only adds SELinux backends, since that is what we need for
RHEL. I will gladly extend the other LSMs if needed.

Thanks
David

[1] https://github.com/bus1/dbus-broker/blob/master/src/util/test-peersec.c
[2] https://www.spinics.net/lists/selinux/msg22674.html

David Herrmann (3):
  security: add hook for socketpair(AF_UNIX, ...)
  net/unix: hook unix_socketpair() into LSM
  selinux: provide unix_stream_socketpair callback

 include/linux/lsm_hooks.h |  8 ++++++++
 include/linux/security.h  |  7 +++++++
 net/unix/af_unix.c        |  5 +++++
 security/security.c       |  6 ++++++
 security/selinux/hooks.c  | 14 ++++++++++++++
 5 files changed, 40 insertions(+)

-- 
2.17.0

^ permalink raw reply

* Re: IP_ADD_MEMBERSHIP with imr_ifindex!=0 for multiple processes with different interfaces
From: Eric Dumazet @ 2018-04-23 13:29 UTC (permalink / raw)
  To: Klebsch, Mario, netdev@vger.kernel.org
In-Reply-To: <9208AF0D0E6F444F80E90796B4E51CA07C163C10@EXCHANGE01.muppets.local>



On 04/23/2018 06:22 AM, Klebsch, Mario wrote:
> Hi, 
> 
> I have a problem with multicast reception in the linux kernel and I hope, this is the right place to ask for help or to report a bug.
> 
> I need to receive multicasts on a single interface. I have written a small program, which executes IP_ADD_MEMBERSHIP with imr.imr_ifindex set to the interface index. The program works well, as long as only a single instance of this program is running. If I start a second instance on a different network interface, both programs receive multicast frames from both interfaces. 
> 
> When called without argument, the test program list the network interfaces. When called with an interface name as argument, if starts receiving multicasts on that interface.
> 
> I am running vanilla Linux kernel 4.12.0.
> 
> # uname -a
> Linux c627 4.12.0 #1 SMP Mon Apr 23 14:08:24 CEST 2018 i686 Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz GenuineIntel GNU/Linux
> # 
> 
> P.S. The program runs fine on MacOSX.
>

It looks like your program needs to use SO_BINDTODEVICE if it really wants this device filtering ?

^ permalink raw reply

* KMSAN: uninit-value in ip_vs_lblcr_check_expire
From: syzbot @ 2018-04-23 13:28 UTC (permalink / raw)
  To: coreteam, davem, fw, horms, ja, kadlec, linux-kernel, lvs-devel,
	netdev, netfilter-devel, pablo, syzkaller-bugs, wensong

Hello,

syzbot hit the following crash on  
https://github.com/google/kmsan.git/master commit
d2d741e5d1898dfde1a75ea3d29a9a3e2edf0617 (Sun Apr 22 15:05:22 2018 +0000)
kmsan: add initialization for shmem pages
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=3dfdea57819073a04f21

So far this crash happened 2 times on  
https://github.com/google/kmsan.git/master.
Unfortunately, I don't have any reproducer for this crash yet.
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=6285034612850688
Kernel config: https://syzkaller.appspot.com/x/.config?id=328654897048964367
compiler: clang version 7.0.0 (trunk 329391)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+3dfdea57819073a04f21@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.
If you forward the report, please keep this part and the footer.

RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000013
RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000014
R13: 00000000000004f3 R14: 00000000006fa768 R15: 0000000000000000
==================================================================
BUG: KMSAN: uninit-value in ip_vs_lblcr_check_expire+0x1551/0x1600  
net/netfilter/ipvs/ip_vs_lblcr.c:479
CPU: 0 PID: 13883 Comm: syz-executor4 Not tainted 4.16.0+ #86
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
  ip_vs_lblcr_check_expire+0x1551/0x1600 net/netfilter/ipvs/ip_vs_lblcr.c:479
  call_timer_fn+0x26a/0x5a0 kernel/time/timer.c:1326
  expire_timers kernel/time/timer.c:1363 [inline]
  __run_timers+0xda7/0x11c0 kernel/time/timer.c:1666
  run_timer_softirq+0x43/0x70 kernel/time/timer.c:1692
  __do_softirq+0x56d/0x93d kernel/softirq.c:285
  invoke_softirq kernel/softirq.c:365 [inline]
  irq_exit+0x202/0x240 kernel/softirq.c:405
  exiting_irq+0xe/0x10 arch/x86/include/asm/apic.h:541
  smp_apic_timer_interrupt+0x64/0x90 arch/x86/kernel/apic/apic.c:1055
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:857
  </IRQ>
RIP: 0010:native_restore_fl arch/x86/include/asm/irqflags.h:37 [inline]
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/irqflags.h:78 [inline]
RIP: 0010:dump_stack+0x1af/0x1d0 lib/dump_stack.c:58
RSP: 0018:ffff880156a2ef00 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff12
RAX: ffff8801fddc2590 RBX: ffff88014f62c418 RCX: ffff880000000000
RDX: ffff8801fd9c2590 RSI: aaaaaaaaaaaab000 RDI: ffffea0000000000
RBP: ffff880156a2ef48 R08: 0000000001080000 R09: 0000000000000002
R10: 0000000000000000 R11: 0000000000000000 R12: 00000000cf000109
R13: 0000000000000286 R14: 0000000000000000 R15: 0000000000000000
  fail_dump lib/fault-inject.c:51 [inline]
  should_fail+0x87b/0xab0 lib/fault-inject.c:149
  should_failslab+0x279/0x2a0 mm/failslab.c:32
  slab_pre_alloc_hook mm/slab.h:422 [inline]
  slab_alloc_node mm/slub.c:2663 [inline]
  slab_alloc mm/slub.c:2745 [inline]
  kmem_cache_alloc+0x136/0xb90 mm/slub.c:2750
  dst_alloc+0x295/0x860 net/core/dst.c:104
  __ip6_dst_alloc net/ipv6/route.c:361 [inline]
  ip6_rt_cache_alloc+0x445/0xd00 net/ipv6/route.c:1061
  ip6_pol_route+0x3f19/0x5da0 net/ipv6/route.c:1751
  ip6_pol_route_output+0xe6/0x110 net/ipv6/route.c:1892
  fib6_rule_lookup+0x494/0x720 net/ipv6/fib6_rules.c:87
  ip6_route_output_flags+0x4fa/0x590 net/ipv6/route.c:1920
  ip6_dst_lookup_tail+0x2fe/0x1a60 net/ipv6/ip6_output.c:992
  ip6_dst_lookup_flow+0xfc/0x270 net/ipv6/ip6_output.c:1093
  rawv6_sendmsg+0x1b05/0x4fb0 net/ipv6/raw.c:908
  inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg net/socket.c:640 [inline]
  ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
  __sys_sendmsg net/socket.c:2080 [inline]
  SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
  SyS_sendmsg+0x54/0x80 net/socket.c:2087
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x455389
RSP: 002b:00007fa5b1000c68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007fa5b10016d4 RCX: 0000000000455389
RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000013
RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000014
R13: 00000000000004f3 R14: 00000000006fa768 R15: 0000000000000000

Uninit was created at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
  kmsan_alloc_meta_for_pages+0x161/0x3a0 mm/kmsan/kmsan.c:814
  kmsan_alloc_page+0x82/0xe0 mm/kmsan/kmsan.c:868
  __alloc_pages_nodemask+0xf5b/0x5dc0 mm/page_alloc.c:4283
  alloc_pages_current+0x6b5/0x970 mm/mempolicy.c:2055
  alloc_pages include/linux/gfp.h:494 [inline]
  kmalloc_order mm/slab_common.c:1164 [inline]
  kmalloc_order_trace+0xb9/0x390 mm/slab_common.c:1175
  kmalloc_large include/linux/slab.h:446 [inline]
  __kmalloc+0x332/0x350 mm/slub.c:3778
  kmalloc include/linux/slab.h:517 [inline]
  ip_vs_lblcr_init_svc+0x57/0x310 net/netfilter/ipvs/ip_vs_lblcr.c:518
  ip_vs_bind_scheduler+0xa4/0x1e0 net/netfilter/ipvs/ip_vs_sched.c:51
  ip_vs_add_service+0xa91/0x1d70 net/netfilter/ipvs/ip_vs_ctl.c:1265
  do_ip_vs_set_ctl+0x25c8/0x2790 net/netfilter/ipvs/ip_vs_ctl.c:2457
  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
  nf_setsockopt+0x476/0x4d0 net/netfilter/nf_sockopt.c:115
  ip_setsockopt+0x24b/0x2b0 net/ipv4/ip_sockglue.c:1261
  dccp_setsockopt+0x1c3/0x1f0 net/dccp/proto.c:576
  sock_common_setsockopt+0x136/0x170 net/core/sock.c:2975
  SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849
  SyS_setsockopt+0x76/0xa0 net/socket.c:1828
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
==================================================================


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply

* KMSAN: uninit-value in ip_vs_lblc_check_expire
From: syzbot @ 2018-04-23 13:28 UTC (permalink / raw)
  To: coreteam, davem, fw, horms, ja, kadlec, linux-kernel, lvs-devel,
	netdev, netfilter-devel, pablo, syzkaller-bugs, wensong

Hello,

syzbot hit the following crash on  
https://github.com/google/kmsan.git/master commit
d2d741e5d1898dfde1a75ea3d29a9a3e2edf0617 (Sun Apr 22 15:05:22 2018 +0000)
kmsan: add initialization for shmem pages
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=3e9695f147fb529aa9bc

So far this crash happened 3 times on  
https://github.com/google/kmsan.git/master.
Unfortunately, I don't have any reproducer for this crash yet.
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5822255644803072
Kernel config: https://syzkaller.appspot.com/x/.config?id=328654897048964367
compiler: clang version 7.0.0 (trunk 329391)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+3e9695f147fb529aa9bc@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.
If you forward the report, please keep this part and the footer.

kernel msg: ebtables bug: please report to author: bad policy
==================================================================
BUG: KMSAN: uninit-value in ip_vs_lblc_check_expire+0xe62/0xf10  
net/netfilter/ipvs/ip_vs_lblc.c:315
CPU: 0 PID: 11383 Comm: syz-executor3 Not tainted 4.16.0+ #86
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
  ip_vs_lblc_check_expire+0xe62/0xf10 net/netfilter/ipvs/ip_vs_lblc.c:315
  call_timer_fn+0x26a/0x5a0 kernel/time/timer.c:1326
  expire_timers kernel/time/timer.c:1363 [inline]
  __run_timers+0xda7/0x11c0 kernel/time/timer.c:1666
  run_timer_softirq+0x43/0x70 kernel/time/timer.c:1692
  __do_softirq+0x56d/0x93d kernel/softirq.c:285
  invoke_softirq kernel/softirq.c:365 [inline]
  irq_exit+0x202/0x240 kernel/softirq.c:405
  exiting_irq+0xe/0x10 arch/x86/include/asm/apic.h:541
  smp_apic_timer_interrupt+0x64/0x90 arch/x86/kernel/apic/apic.c:1055
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:857
  </IRQ>
RIP: 0010:native_restore_fl arch/x86/include/asm/irqflags.h:37 [inline]
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/irqflags.h:78 [inline]
RIP: 0010:vprintk_emit+0xcb2/0xff0 kernel/printk/printk.c:1899
RSP: 0018:ffff8801c2a1f0d8 EFLAGS: 00000296 ORIG_RAX: ffffffffffffff12
RAX: 0000000000000296 RBX: ffff8801574c4418 RCX: 0000000000040000
RDX: ffffc900033a6000 RSI: 00000000000001bf RDI: 00000000000001c0
RBP: ffff8801c2a1f1f8 R08: 000000219bfd8445 R09: ffff8801fd6d615d
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffffff8b300430 R14: 0000000000000000 R15: 0000000000000000
  vprintk_default+0x90/0xa0 kernel/printk/printk.c:1955
  vprintk_func+0x517/0x700 kernel/printk/printk_safe.c:379
  printk+0x1b6/0x1f0 kernel/printk/printk.c:1991
  translate_table+0x474/0x5e10 net/bridge/netfilter/ebtables.c:846
  do_replace_finish+0x1258/0x2ea0 net/bridge/netfilter/ebtables.c:1002
  do_replace+0x707/0x770 net/bridge/netfilter/ebtables.c:1141
  do_ebt_set_ctl+0x2ab/0x3c0 net/bridge/netfilter/ebtables.c:1518
  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
  nf_setsockopt+0x476/0x4d0 net/netfilter/nf_sockopt.c:115
  ip_setsockopt+0x24b/0x2b0 net/ipv4/ip_sockglue.c:1261
  udp_setsockopt+0x108/0x1b0 net/ipv4/udp.c:2406
  ipv6_setsockopt+0x30c/0x340 net/ipv6/ipv6_sockglue.c:917
  udpv6_setsockopt+0x110/0x1c0 net/ipv6/udp.c:1422
  sock_common_setsockopt+0x136/0x170 net/core/sock.c:2975
  SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849
  SyS_setsockopt+0x76/0xa0 net/socket.c:1828
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x455389
RSP: 002b:00007f470c9e3c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007f470c9e46d4 RCX: 0000000000455389
RDX: 0000000000000080 RSI: 0000000000000000 RDI: 0000000000000013
RBP: 000000000072bea0 R08: 0000000000000dd0 R09: 0000000000000000
R10: 0000000020000dc0 R11: 0000000000000246 R12: 00000000ffffffff
R13: 000000000000051d R14: 00000000006fab58 R15: 0000000000000000

Uninit was created at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
  kmsan_alloc_meta_for_pages+0x161/0x3a0 mm/kmsan/kmsan.c:814
  kmsan_alloc_page+0x82/0xe0 mm/kmsan/kmsan.c:868
  __alloc_pages_nodemask+0xf5b/0x5dc0 mm/page_alloc.c:4283
  alloc_pages_current+0x6b5/0x970 mm/mempolicy.c:2055
  alloc_pages include/linux/gfp.h:494 [inline]
  kmalloc_order mm/slab_common.c:1164 [inline]
  kmalloc_order_trace+0xb9/0x390 mm/slab_common.c:1175
  kmalloc_large include/linux/slab.h:446 [inline]
  __kmalloc+0x332/0x350 mm/slub.c:3778
  kmalloc include/linux/slab.h:517 [inline]
  ip_vs_lblc_init_svc+0x57/0x310 net/netfilter/ipvs/ip_vs_lblc.c:355
  ip_vs_bind_scheduler+0xa4/0x1e0 net/netfilter/ipvs/ip_vs_sched.c:51
  ip_vs_add_service+0xa91/0x1d70 net/netfilter/ipvs/ip_vs_ctl.c:1265
  do_ip_vs_set_ctl+0x25c8/0x2790 net/netfilter/ipvs/ip_vs_ctl.c:2457
  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
  nf_setsockopt+0x476/0x4d0 net/netfilter/nf_sockopt.c:115
  ip_setsockopt+0x24b/0x2b0 net/ipv4/ip_sockglue.c:1261
  raw_setsockopt+0x2e5/0x350 net/ipv4/raw.c:870
  sock_common_setsockopt+0x136/0x170 net/core/sock.c:2975
  SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849
  SyS_setsockopt+0x76/0xa0 net/socket.c:1828
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
==================================================================


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply

* IP_ADD_MEMBERSHIP with imr_ifindex!=0 for multiple processes with different interfaces
From: Klebsch, Mario @ 2018-04-23 13:22 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi, 

I have a problem with multicast reception in the linux kernel and I hope, this is the right place to ask for help or to report a bug.

I need to receive multicasts on a single interface. I have written a small program, which executes IP_ADD_MEMBERSHIP with imr.imr_ifindex set to the interface index. The program works well, as long as only a single instance of this program is running. If I start a second instance on a different network interface, both programs receive multicast frames from both interfaces. 

When called without argument, the test program list the network interfaces. When called with an interface name as argument, if starts receiving multicasts on that interface.

I am running vanilla Linux kernel 4.12.0.

# uname -a
Linux c627 4.12.0 #1 SMP Mon Apr 23 14:08:24 CEST 2018 i686 Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz GenuineIntel GNU/Linux
# 

P.S. The program runs fine on MacOSX.

73, Mario

----8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<----
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <ifaddrs.h>
#include <net/if.h>

#define MCAST_PORT  6154
#define MCAST_ADDR  "239.255.1.1"

void ListInterfaces(struct ifaddrs *Interfaces)
{
                for (struct ifaddrs *a=Interfaces; a; a=a->ifa_next)
                {
                               if (!(a->ifa_flags & IFF_UP))
                                               continue;
                               if (!a->ifa_addr || a->ifa_addr->sa_family != AF_INET)
                                               continue;
                               struct sockaddr_in *Addr = (struct sockaddr_in *)a->ifa_addr;
                               printf("%s: %s\n", a->ifa_name, inet_ntoa(Addr->sin_addr));
                }
}

int main(int argc, char *argv[])
{
                struct ifaddrs *Interfaces;
                if (getifaddrs(&Interfaces) < 0)
                {
                               perror("getifaddrs");
                               return -1;
                }

                struct sockaddr_in *MyIfAddr=NULL;
                int                MyIfIndex=0;
                if (argc > 1 && (MyIfIndex = if_nametoindex(argv[1])) )
                               for (struct ifaddrs *a=Interfaces; a; a=a->ifa_next)
                               {
                                               if (!(a->ifa_flags & IFF_UP))
                                                               continue;
                                               if (!a->ifa_addr || a->ifa_addr->sa_family != AF_INET)
                                                               continue;
                                               if (strcmp(argv[1], a->ifa_name)!= 0)
                                                               continue;
                                               MyIfAddr = (struct sockaddr_in *)a->ifa_addr;
                                               break;
                               }

                if (!MyIfAddr || !MyIfIndex)
                {
                               ListInterfaces(Interfaces);
                               return 0;
                }

                int s=socket(PF_INET, SOCK_DGRAM, 0);
                if (s<0)
                {
                               perror("socket");
                               return -1;
                }

                int off=0;
                if (setsockopt(s, IPPROTO_IP, IP_MULTICAST_LOOP, &off, sizeof(off)) < 0)
                               perror("setsockopt(SO_REUSEADDR)");

                int on=1;
                if (setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on)) < 0)
                               perror("setsockopt(SO_REUSEADDR)");

                struct sockaddr_in Addr;
                Addr.sin_family = AF_INET;
                Addr.sin_port = htons(MCAST_PORT);
                inet_aton(MCAST_ADDR, &Addr.sin_addr);
                if (bind(s, (struct sockaddr*)&Addr, sizeof(Addr)) < 0)
                               perror("bind");

                struct ip_mreqn imr;
                inet_aton(MCAST_ADDR, &imr.imr_multiaddr);
                imr.imr_address = MyIfAddr->sin_addr;
                imr.imr_ifindex = MyIfIndex;
                if (setsockopt(s, IPPROTO_IP, IP_ADD_MEMBERSHIP, &imr, sizeof(imr)) < 0)
                               perror("setsockopt(IP_ADD_MEMBERSHIP)");

                for (;;)
                {              
                               struct sockaddr_in AddrBuffer;
                               int AddrLen = sizeof(AddrBuffer);
                               char Buffer[2048];
                               size_t BufferLen = recvfrom(s, &Buffer, sizeof(Buffer), 0, (struct sockaddr*)&AddrBuffer, & AddrLen);
                               if (BufferLen <= 0)
                               {
                                               if (BufferLen < 0)
                                                               perror("recvfrom");
                                               break;
                               }
                               printf("%s: Received %d bytes from %s\n", argv[1], BufferLen, inet_ntoa(AddrBuffer.sin_addr));
                }
}
----8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<----

-- 
Mario Klebsch				Actia I+ME GmbH
Mario.klebsch@ime-actia.de		Dresdenstrasse 17/18
Fon: +49 531 38 701 716			38124 Braunschweig
Fax: +49 531 38 701 88			Germany



^ permalink raw reply

* Re: [PATCH net-next 0/5] virtio-net: Add SCTP checksum offload support
From: Vlad Yasevich @ 2018-04-23 13:17 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, Michael S. Tsirkin
  Cc: Vladislav Yasevich, netdev, linux-sctp, virtualization, jasowang,
	nhorman
In-Reply-To: <20180420172219.GR4716@localhost.localdomain>

On 04/20/2018 01:22 PM, Marcelo Ricardo Leitner wrote:
> On Wed, Apr 18, 2018 at 05:06:46PM +0300, Michael S. Tsirkin wrote:
>> On Tue, Apr 17, 2018 at 04:35:18PM -0400, Vlad Yasevich wrote:
>>> On 04/02/2018 10:47 AM, Marcelo Ricardo Leitner wrote:
>>>> On Mon, Apr 02, 2018 at 09:40:01AM -0400, Vladislav Yasevich wrote:
>>>>> Now that we have SCTP offload capabilities in the kernel, we can add
>>>>> them to virtio as well.  First step is SCTP checksum.
>>>>
>>>> Thanks.
>>>>
>>>>> As for GSO, the way sctp GSO is currently implemented buys us nothing
>>>>> in added support to virtio.  To add true GSO, would require a lot of
>>>>> re-work inside of SCTP and would require extensions to the virtio
>>>>> net header to carry extra sctp data.
>>>>
>>>> Can you please elaborate more on this? Is this because SCTP GSO relies
>>>> on the gso skb format for knowing how to segment it instead of having
>>>> a list of sizes?
>>>>
>>>
>>> it's mainly because all the true segmentation, placing data into chunks,
>>> has already happened.  All that GSO does is allow for higher bundling
>>> rate between VMs. If that is all SCTP GSO ever going to do, that fine,
>>> but the goal is to do real GSO eventually and potentially reduce the
>>> amount of memory copying we are doing.
>>> If we do that, any current attempt at GSO in virtio would have to be
>>> depricated and we'd need GSO2 or something like that.
>>
>> Batching helps virtualization *a lot* though.
> 
> Yep. The results posted by Xin in the other email give good insights
> on it.
> 
>> Are there actual plans for GSO2? Is it just for SCTP?
> 
> No plans. In this context, at least, yes, just for SCTP.
> 
> It was a supposition in case we start doing a different GSO for SCTP,
> one more like what we have for TCP.
> 
> Currently, as the SCTP GSO code doesn't leave the system, we can
> update it if we want. But by the moment we add support for it in
> virtio, we will have to be backwards compatible if we end up doing
> SCTP GSO differently.

So, just because the linux code doesn't do it differently doesn't mean
that someone else doesn't.  Since the device has to work across different
possible implementations, it needs to be generic enough.  If we simply
document the current linux practice, that may not be ideal on the future.

I was hesitant to introduce this without studying the feasibility of
doing late segmentation.

-vlad

> 
> But again, I don't think such approach for SCTP GSO would be neither
> feasible or worth. The complexity for it, to work across stream
> schedules and late TSN allocation, would do more harm then good IMO.
> 
>>
>>>
>>> This is why, after doing the GSO support, I decided not to include it.
>>>
>>> -vlad
>>>>   Marcelo
>>>>

^ permalink raw reply

* Re: Page allocator bottleneck
From: Aaron Lu @ 2018-04-23 13:10 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Linux Kernel Network Developers, linux-mm, Mel Gorman,
	David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko
In-Reply-To: <0dea4da6-8756-22d4-c586-267217a5fa63@mellanox.com>

On Mon, Apr 23, 2018 at 11:54:57AM +0300, Tariq Toukan wrote:
> Hi,
> 
> I ran my tests with your patches.
> Initial BW numbers are significantly higher than I documented back then in
> this mail-thread.
> For example, in driver #2 (see original mail thread), with 6 rings, I now
> get 92Gbps (slightly less than linerate) in comparison to 64Gbps back then.
> 
> However, there were many kernel changes since then, I need to isolate your
> changes. I am not sure I can finish this today, but I will surely get to it
> next week after I'm back from vacation.
> 
> Still, when I increase the scale (more rings, i.e. more cpus), I see that
> queued_spin_lock_slowpath gets to 60%+ cpu. Still high, but lower than it
> used to be.

I wonder if it is on allocation path or free path?

Also, increasing PCP size through vm.percpu_pagelist_fraction would
still help with my patches since it can avoid touching even more cache
lines on allocation path with a higher PCP->batch(which has an upper
limit of 96 though at the moment).

> 
> This should be root solved by the (orthogonal) changes planned in network
> subsystem, which will change the SKB allocation/free scheme so that SKBs are
> released on the originating cpu.

^ permalink raw reply

* [PATCH net] sfc: ARFS filter IDs
From: Edward Cree @ 2018-04-23 13:08 UTC (permalink / raw)
  To: linux-net-drivers, David Miller; +Cc: netdev

Associate an arbitrary ID with each ARFS filter, allowing to properly query
 for expiry.  The association is maintained in a hash table, which is
 protected by a spinlock.

Fixes: 3af0f34290f6 ("sfc: replace asynchronous filter operations")
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/ef10.c       |  80 +++++++++++--------
 drivers/net/ethernet/sfc/efx.c        | 143 ++++++++++++++++++++++++++++++++++
 drivers/net/ethernet/sfc/efx.h        |  19 +++++
 drivers/net/ethernet/sfc/farch.c      |  41 ++++++++--
 drivers/net/ethernet/sfc/net_driver.h |  36 +++++++++
 drivers/net/ethernet/sfc/rx.c         |  62 +++++++++++++--
 6 files changed, 335 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 83ce229f4eb7..63036d9bf3e6 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -3999,29 +3999,6 @@ static void efx_ef10_prepare_flr(struct efx_nic *efx)
 	atomic_set(&efx->active_queues, 0);
 }
 
-static bool efx_ef10_filter_equal(const struct efx_filter_spec *left,
-				  const struct efx_filter_spec *right)
-{
-	if ((left->match_flags ^ right->match_flags) |
-	    ((left->flags ^ right->flags) &
-	     (EFX_FILTER_FLAG_RX | EFX_FILTER_FLAG_TX)))
-		return false;
-
-	return memcmp(&left->outer_vid, &right->outer_vid,
-		      sizeof(struct efx_filter_spec) -
-		      offsetof(struct efx_filter_spec, outer_vid)) == 0;
-}
-
-static unsigned int efx_ef10_filter_hash(const struct efx_filter_spec *spec)
-{
-	BUILD_BUG_ON(offsetof(struct efx_filter_spec, outer_vid) & 3);
-	return jhash2((const u32 *)&spec->outer_vid,
-		      (sizeof(struct efx_filter_spec) -
-		       offsetof(struct efx_filter_spec, outer_vid)) / 4,
-		      0);
-	/* XXX should we randomise the initval? */
-}
-
 /* Decide whether a filter should be exclusive or else should allow
  * delivery to additional recipients.  Currently we decide that
  * filters for specific local unicast MAC and IP addresses are
@@ -4346,7 +4323,7 @@ static s32 efx_ef10_filter_insert(struct efx_nic *efx,
 		goto out_unlock;
 	match_pri = rc;
 
-	hash = efx_ef10_filter_hash(spec);
+	hash = efx_filter_spec_hash(spec);
 	is_mc_recip = efx_filter_is_mc_recipient(spec);
 	if (is_mc_recip)
 		bitmap_zero(mc_rem_map, EFX_EF10_FILTER_SEARCH_LIMIT);
@@ -4378,7 +4355,7 @@ static s32 efx_ef10_filter_insert(struct efx_nic *efx,
 		if (!saved_spec) {
 			if (ins_index < 0)
 				ins_index = i;
-		} else if (efx_ef10_filter_equal(spec, saved_spec)) {
+		} else if (efx_filter_spec_equal(spec, saved_spec)) {
 			if (spec->priority < saved_spec->priority &&
 			    spec->priority != EFX_FILTER_PRI_AUTO) {
 				rc = -EPERM;
@@ -4762,27 +4739,62 @@ static s32 efx_ef10_filter_get_rx_ids(struct efx_nic *efx,
 static bool efx_ef10_filter_rfs_expire_one(struct efx_nic *efx, u32 flow_id,
 					   unsigned int filter_idx)
 {
+	struct efx_filter_spec *spec, saved_spec;
 	struct efx_ef10_filter_table *table;
-	struct efx_filter_spec *spec;
-	bool ret;
+	struct efx_arfs_rule *rule = NULL;
+	bool ret = true, force = false;
+	u16 arfs_id;
 
 	down_read(&efx->filter_sem);
 	table = efx->filter_state;
 	down_write(&table->lock);
 	spec = efx_ef10_filter_entry_spec(table, filter_idx);
 
-	if (!spec || spec->priority != EFX_FILTER_PRI_HINT) {
-		ret = true;
+	if (!spec || spec->priority != EFX_FILTER_PRI_HINT)
 		goto out_unlock;
-	}
 
-	if (!rps_may_expire_flow(efx->net_dev, spec->dmaq_id, flow_id, 0)) {
-		ret = false;
-		goto out_unlock;
+	spin_lock_bh(&efx->rps_hash_lock);
+	if (!efx->rps_hash_table) {
+		/* In the absence of the table, we always return 0 to ARFS. */
+		arfs_id = 0;
+	} else {
+		rule = efx_rps_hash_find(efx, spec);
+		if (!rule)
+			/* ARFS table doesn't know of this filter, so remove it */
+			goto expire;
+		arfs_id = rule->arfs_id;
+		ret = efx_rps_check_rule(rule, filter_idx, &force);
+		if (force)
+			goto expire;
+		if (!ret) {
+			spin_unlock_bh(&efx->rps_hash_lock);
+			goto out_unlock;
+		}
 	}
-
+	if (!rps_may_expire_flow(efx->net_dev, spec->dmaq_id, flow_id, arfs_id))
+		ret = false;
+	else if (rule)
+		rule->filter_id = EFX_ARFS_FILTER_ID_REMOVING;
+expire:
+	saved_spec = *spec; /* remove operation will kfree spec */
+	spin_unlock_bh(&efx->rps_hash_lock);
+	/* At this point (since we dropped the lock), another thread might queue
+	 * up a fresh insertion request (but the actual insertion will be held
+	 * up by our possession of the filter table lock).  In that case, it
+	 * will set rule->filter_id to EFX_ARFS_FILTER_ID_PENDING, meaning that
+	 * the rule is not removed by efx_rps_hash_del() below.
+	 */
 	ret = efx_ef10_filter_remove_internal(efx, 1U << spec->priority,
 					      filter_idx, true) == 0;
+	/* While we can't safely dereference rule (we dropped the lock), we can
+	 * still test it for NULL.
+	 */
+	if (ret && rule) {
+		/* Expiring, so remove entry from ARFS table */
+		spin_lock_bh(&efx->rps_hash_lock);
+		efx_rps_hash_del(efx, &saved_spec);
+		spin_unlock_bh(&efx->rps_hash_lock);
+	}
 out_unlock:
 	up_write(&table->lock);
 	up_read(&efx->filter_sem);
diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 692dd729ee2a..a4ebd8715494 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -3027,6 +3027,10 @@ static int efx_init_struct(struct efx_nic *efx,
 	mutex_init(&efx->mac_lock);
 #ifdef CONFIG_RFS_ACCEL
 	mutex_init(&efx->rps_mutex);
+	spin_lock_init(&efx->rps_hash_lock);
+	/* Failure to allocate is not fatal, but may degrade ARFS performance */
+	efx->rps_hash_table = kcalloc(EFX_ARFS_HASH_TABLE_SIZE,
+				      sizeof(*efx->rps_hash_table), GFP_KERNEL);
 #endif
 	efx->phy_op = &efx_dummy_phy_operations;
 	efx->mdio.dev = net_dev;
@@ -3070,6 +3074,10 @@ static void efx_fini_struct(struct efx_nic *efx)
 {
 	int i;
 
+#ifdef CONFIG_RFS_ACCEL
+	kfree(efx->rps_hash_table);
+#endif
+
 	for (i = 0; i < EFX_MAX_CHANNELS; i++)
 		kfree(efx->channel[i]);
 
@@ -3092,6 +3100,141 @@ void efx_update_sw_stats(struct efx_nic *efx, u64 *stats)
 	stats[GENERIC_STAT_rx_noskb_drops] = atomic_read(&efx->n_rx_noskb_drops);
 }
 
+bool efx_filter_spec_equal(const struct efx_filter_spec *left,
+			   const struct efx_filter_spec *right)
+{
+	if ((left->match_flags ^ right->match_flags) |
+	    ((left->flags ^ right->flags) &
+	     (EFX_FILTER_FLAG_RX | EFX_FILTER_FLAG_TX)))
+		return false;
+
+	return memcmp(&left->outer_vid, &right->outer_vid,
+		      sizeof(struct efx_filter_spec) -
+		      offsetof(struct efx_filter_spec, outer_vid)) == 0;
+}
+
+u32 efx_filter_spec_hash(const struct efx_filter_spec *spec)
+{
+	BUILD_BUG_ON(offsetof(struct efx_filter_spec, outer_vid) & 3);
+	return jhash2((const u32 *)&spec->outer_vid,
+		      (sizeof(struct efx_filter_spec) -
+		       offsetof(struct efx_filter_spec, outer_vid)) / 4,
+		      0);
+}
+
+#ifdef CONFIG_RFS_ACCEL
+bool efx_rps_check_rule(struct efx_arfs_rule *rule, unsigned int filter_idx,
+			bool *force)
+{
+	if (rule->filter_id == EFX_ARFS_FILTER_ID_PENDING) {
+		/* ARFS is currently updating this entry, leave it */
+		return false;
+	}
+	if (rule->filter_id == EFX_ARFS_FILTER_ID_ERROR) {
+		/* ARFS tried and failed to update this, so it's probably out
+		 * of date.  Remove the filter and the ARFS rule entry.
+		 */
+		rule->filter_id = EFX_ARFS_FILTER_ID_REMOVING;
+		*force = true;
+		return true;
+	} else if (WARN_ON(rule->filter_id != filter_idx)) { /* can't happen */
+		/* ARFS has moved on, so old filter is not needed.  Since we did
+		 * not mark the rule with EFX_ARFS_FILTER_ID_REMOVING, it will
+		 * not be removed by efx_rps_hash_del() subsequently.
+		 */
+		*force = true;
+		return true;
+	}
+	/* Remove it iff ARFS wants to. */
+	return true;
+}
+
+struct hlist_head *efx_rps_hash_bucket(struct efx_nic *efx,
+				       const struct efx_filter_spec *spec)
+{
+	u32 hash = efx_filter_spec_hash(spec);
+
+	WARN_ON(!spin_is_locked(&efx->rps_hash_lock));
+	if (!efx->rps_hash_table)
+		return NULL;
+	return &efx->rps_hash_table[hash % EFX_ARFS_HASH_TABLE_SIZE];
+}
+
+struct efx_arfs_rule *efx_rps_hash_find(struct efx_nic *efx,
+					const struct efx_filter_spec *spec)
+{
+	struct efx_arfs_rule *rule;
+	struct hlist_head *head;
+	struct hlist_node *node;
+
+	head = efx_rps_hash_bucket(efx, spec);
+	if (!head)
+		return NULL;
+	hlist_for_each(node, head) {
+		rule = container_of(node, struct efx_arfs_rule, node);
+		if (efx_filter_spec_equal(spec, &rule->spec))
+			return rule;
+	}
+	return NULL;
+}
+
+struct efx_arfs_rule *efx_rps_hash_add(struct efx_nic *efx,
+				       const struct efx_filter_spec *spec,
+				       bool *new)
+{
+	struct efx_arfs_rule *rule;
+	struct hlist_head *head;
+	struct hlist_node *node;
+
+	head = efx_rps_hash_bucket(efx, spec);
+	if (!head)
+		return NULL;
+	hlist_for_each(node, head) {
+		rule = container_of(node, struct efx_arfs_rule, node);
+		if (efx_filter_spec_equal(spec, &rule->spec)) {
+			*new = false;
+			return rule;
+		}
+	}
+	rule = kmalloc(sizeof(*rule), GFP_ATOMIC);
+	*new = true;
+	if (rule) {
+		memcpy(&rule->spec, spec, sizeof(rule->spec));
+		hlist_add_head(&rule->node, head);
+	}
+	return rule;
+}
+
+void efx_rps_hash_del(struct efx_nic *efx, const struct efx_filter_spec *spec)
+{
+	struct efx_arfs_rule *rule;
+	struct hlist_head *head;
+	struct hlist_node *node;
+
+	head = efx_rps_hash_bucket(efx, spec);
+	if (WARN_ON(!head))
+		return;
+	hlist_for_each(node, head) {
+		rule = container_of(node, struct efx_arfs_rule, node);
+		if (efx_filter_spec_equal(spec, &rule->spec)) {
+			/* Someone already reused the entry.  We know that if
+			 * this check doesn't fire (i.e. filter_id == REMOVING)
+			 * then the REMOVING mark was put there by our caller,
+			 * because caller is holding a lock on filter table and
+			 * only holders of that lock set REMOVING.
+			 */
+			if (rule->filter_id != EFX_ARFS_FILTER_ID_REMOVING)
+				return;
+			hlist_del(node);
+			kfree(rule);
+			return;
+		}
+	}
+	/* We didn't find it. */
+	WARN_ON(1);
+}
+#endif
+
 /* RSS contexts.  We're using linked lists and crappy O(n) algorithms, because
  * (a) this is an infrequent control-plane operation and (b) n is small (max 64)
  */
diff --git a/drivers/net/ethernet/sfc/efx.h b/drivers/net/ethernet/sfc/efx.h
index a3140e16fcef..6b4164b6d938 100644
--- a/drivers/net/ethernet/sfc/efx.h
+++ b/drivers/net/ethernet/sfc/efx.h
@@ -186,6 +186,25 @@ static inline void efx_filter_rfs_expire(struct work_struct *data) {}
 #endif
 bool efx_filter_is_mc_recipient(const struct efx_filter_spec *spec);
 
+bool efx_filter_spec_equal(const struct efx_filter_spec *left,
+			   const struct efx_filter_spec *right);
+u32 efx_filter_spec_hash(const struct efx_filter_spec *spec);
+
+bool efx_rps_check_rule(struct efx_arfs_rule *rule, unsigned int filter_idx,
+			bool *force);
+
+struct efx_arfs_rule *efx_rps_hash_find(struct efx_nic *efx,
+					const struct efx_filter_spec *spec);
+
+/* @new is written to indicate if entry was newly added (true) or if an old
+ * entry was found and returned (false).
+ */
+struct efx_arfs_rule *efx_rps_hash_add(struct efx_nic *efx,
+				       const struct efx_filter_spec *spec,
+				       bool *new);
+
+void efx_rps_hash_del(struct efx_nic *efx, const struct efx_filter_spec *spec);
+
 /* RSS contexts */
 struct efx_rss_context *efx_alloc_rss_context_entry(struct efx_nic *efx);
 struct efx_rss_context *efx_find_rss_context_entry(struct efx_nic *efx, u32 id);
diff --git a/drivers/net/ethernet/sfc/farch.c b/drivers/net/ethernet/sfc/farch.c
index 7174ef5e5c5e..ade694c1f9a6 100644
--- a/drivers/net/ethernet/sfc/farch.c
+++ b/drivers/net/ethernet/sfc/farch.c
@@ -2905,18 +2905,45 @@ bool efx_farch_filter_rfs_expire_one(struct efx_nic *efx, u32 flow_id,
 {
 	struct efx_farch_filter_state *state = efx->filter_state;
 	struct efx_farch_filter_table *table;
-	bool ret = false;
+	bool ret = false, force = false;
+	u16 arfs_id;
 
 	down_write(&state->lock);
+	spin_lock_bh(&efx->rps_hash_lock);
 	table = &state->table[EFX_FARCH_FILTER_TABLE_RX_IP];
 	if (test_bit(index, table->used_bitmap) &&
-	    table->spec[index].priority == EFX_FILTER_PRI_HINT &&
-	    rps_may_expire_flow(efx->net_dev, table->spec[index].dmaq_id,
-				flow_id, 0)) {
-		efx_farch_filter_table_clear_entry(efx, table, index);
-		ret = true;
+	    table->spec[index].priority == EFX_FILTER_PRI_HINT) {
+		struct efx_filter_spec spec;
+		struct efx_arfs_rule *rule;
+
+		efx_farch_filter_to_gen_spec(&spec, &table->spec[index]);
+		if (!efx->rps_hash_table) {
+			/* In the absence of the table, we always returned 0 to
+			 * ARFS, so use the same to query it.
+			 */
+			arfs_id = 0;
+		} else {
+			rule = efx_rps_hash_find(efx, &spec);
+			if (!rule) {
+				/* ARFS table doesn't know of this filter, remove it */
+				force = true;
+			} else {
+				arfs_id = rule->arfs_id;
+				if (!efx_rps_check_rule(rule, index, &force))
+					goto out_unlock;
+			}
+		}
+		if (force || rps_may_expire_flow(efx->net_dev, spec.dmaq_id,
+						 flow_id, arfs_id)) {
+			if (rule)
+				rule->filter_id = EFX_ARFS_FILTER_ID_REMOVING;
+			efx_rps_hash_del(efx, &spec);
+			efx_farch_filter_table_clear_entry(efx, table, index);
+			ret = true;
+		}
 	}
-
+out_unlock:
+	spin_unlock_bh(&efx->rps_hash_lock);
 	up_write(&state->lock);
 	return ret;
 }
diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
index eea3808b3f25..65568925c3ef 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -734,6 +734,35 @@ struct efx_rss_context {
 };
 
 #ifdef CONFIG_RFS_ACCEL
+/* Order of these is important, since filter_id >= %EFX_ARFS_FILTER_ID_PENDING
+ * is used to test if filter does or will exist.
+ */
+#define EFX_ARFS_FILTER_ID_PENDING	-1
+#define EFX_ARFS_FILTER_ID_ERROR	-2
+#define EFX_ARFS_FILTER_ID_REMOVING	-3
+/**
+ * struct efx_arfs_rule - record of an ARFS filter and its IDs
+ * @node: linkage into hash table
+ * @spec: details of the filter (used as key for hash table).  Use efx->type to
+ *	determine which member to use.
+ * @rxq_index: channel to which the filter will steer traffic.
+ * @arfs_id: filter ID which was returned to ARFS
+ * @filter_id: index in software filter table.  May be
+ *	%EFX_ARFS_FILTER_ID_PENDING if filter was not inserted yet,
+ *	%EFX_ARFS_FILTER_ID_ERROR if filter insertion failed, or
+ *	%EFX_ARFS_FILTER_ID_REMOVING if expiry is currently removing the filter.
+ */
+struct efx_arfs_rule {
+	struct hlist_node node;
+	struct efx_filter_spec spec;
+	u16 rxq_index;
+	u16 arfs_id;
+	s32 filter_id;
+};
+
+/* Size chosen so that the table is one page (4kB) */
+#define EFX_ARFS_HASH_TABLE_SIZE	512
+
 /**
  * struct efx_async_filter_insertion - Request to asynchronously insert a filter
  * @net_dev: Reference to the netdevice
@@ -873,6 +902,10 @@ struct efx_async_filter_insertion {
  *	@rps_expire_channel's @rps_flow_id
  * @rps_slot_map: bitmap of in-flight entries in @rps_slot
  * @rps_slot: array of ARFS insertion requests for efx_filter_rfs_work()
+ * @rps_hash_lock: Protects ARFS filter mapping state (@rps_hash_table and
+ *	@rps_next_id).
+ * @rps_hash_table: Mapping between ARFS filters and their various IDs
+ * @rps_next_id: next arfs_id for an ARFS filter
  * @active_queues: Count of RX and TX queues that haven't been flushed and drained.
  * @rxq_flush_pending: Count of number of receive queues that need to be flushed.
  *	Decremented when the efx_flush_rx_queue() is called.
@@ -1029,6 +1062,9 @@ struct efx_nic {
 	unsigned int rps_expire_index;
 	unsigned long rps_slot_map;
 	struct efx_async_filter_insertion rps_slot[EFX_RPS_MAX_IN_FLIGHT];
+	spinlock_t rps_hash_lock;
+	struct hlist_head *rps_hash_table;
+	u32 rps_next_id;
 #endif
 
 	atomic_t active_queues;
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 9c593c661cbf..64a94f242027 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -834,9 +834,29 @@ static void efx_filter_rfs_work(struct work_struct *data)
 	struct efx_nic *efx = netdev_priv(req->net_dev);
 	struct efx_channel *channel = efx_get_channel(efx, req->rxq_index);
 	int slot_idx = req - efx->rps_slot;
+	struct efx_arfs_rule *rule;
+	u16 arfs_id = 0;
 	int rc;
 
 	rc = efx->type->filter_insert(efx, &req->spec, true);
+	if (efx->rps_hash_table) {
+		spin_lock_bh(&efx->rps_hash_lock);
+		rule = efx_rps_hash_find(efx, &req->spec);
+		/* The rule might have already gone, if someone else's request
+		 * for the same spec was already worked and then expired before
+		 * we got around to our work.  In that case we have nothing
+		 * tying us to an arfs_id, meaning that as soon as the filter
+		 * is considered for expiry it will be removed.
+		 */
+		if (rule) {
+			if (rc < 0)
+				rule->filter_id = EFX_ARFS_FILTER_ID_ERROR;
+			else
+				rule->filter_id = rc;
+			arfs_id = rule->arfs_id;
+		}
+		spin_unlock_bh(&efx->rps_hash_lock);
+	}
 	if (rc >= 0) {
 		/* Remember this so we can check whether to expire the filter
 		 * later.
@@ -848,18 +868,18 @@ static void efx_filter_rfs_work(struct work_struct *data)
 
 		if (req->spec.ether_type == htons(ETH_P_IP))
 			netif_info(efx, rx_status, efx->net_dev,
-				   "steering %s %pI4:%u:%pI4:%u to queue %u [flow %u filter %d]\n",
+				   "steering %s %pI4:%u:%pI4:%u to queue %u [flow %u filter %d id %u]\n",
 				   (req->spec.ip_proto == IPPROTO_TCP) ? "TCP" : "UDP",
 				   req->spec.rem_host, ntohs(req->spec.rem_port),
 				   req->spec.loc_host, ntohs(req->spec.loc_port),
-				   req->rxq_index, req->flow_id, rc);
+				   req->rxq_index, req->flow_id, rc, arfs_id);
 		else
 			netif_info(efx, rx_status, efx->net_dev,
-				   "steering %s [%pI6]:%u:[%pI6]:%u to queue %u [flow %u filter %d]\n",
+				   "steering %s [%pI6]:%u:[%pI6]:%u to queue %u [flow %u filter %d id %u]\n",
 				   (req->spec.ip_proto == IPPROTO_TCP) ? "TCP" : "UDP",
 				   req->spec.rem_host, ntohs(req->spec.rem_port),
 				   req->spec.loc_host, ntohs(req->spec.loc_port),
-				   req->rxq_index, req->flow_id, rc);
+				   req->rxq_index, req->flow_id, rc, arfs_id);
 	}
 
 	/* Release references */
@@ -872,8 +892,10 @@ int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
 	struct efx_async_filter_insertion *req;
+	struct efx_arfs_rule *rule;
 	struct flow_keys fk;
 	int slot_idx;
+	bool new;
 	int rc;
 
 	/* find a free slot */
@@ -926,12 +948,42 @@ int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
 	req->spec.rem_port = fk.ports.src;
 	req->spec.loc_port = fk.ports.dst;
 
+	if (efx->rps_hash_table) {
+		/* Add it to ARFS hash table */
+		spin_lock(&efx->rps_hash_lock);
+		rule = efx_rps_hash_add(efx, &req->spec, &new);
+		if (!rule) {
+			rc = -ENOMEM;
+			goto out_unlock;
+		}
+		if (new)
+			rule->arfs_id = efx->rps_next_id++ % RPS_NO_FILTER;
+		rc = rule->arfs_id;
+		/* Skip if existing or pending filter already does the right thing */
+		if (!new && rule->rxq_index == rxq_index &&
+		    rule->filter_id >= EFX_ARFS_FILTER_ID_PENDING)
+			goto out_unlock;
+		rule->rxq_index = rxq_index;
+		rule->filter_id = EFX_ARFS_FILTER_ID_PENDING;
+		spin_unlock(&efx->rps_hash_lock);
+	} else {
+		/* Without an ARFS hash table, we just use arfs_id 0 for all
+		 * filters.  This means if multiple flows hash to the same
+		 * flow_id, all but the most recently touched will be eligible
+		 * for expiry.
+		 */
+		rc = 0;
+	}
+
+	/* Queue the request */
 	dev_hold(req->net_dev = net_dev);
 	INIT_WORK(&req->work, efx_filter_rfs_work);
 	req->rxq_index = rxq_index;
 	req->flow_id = flow_id;
 	schedule_work(&req->work);
-	return 0;
+	return rc;
+out_unlock:
+	spin_unlock(&efx->rps_hash_lock);
 out_clear:
 	clear_bit(slot_idx, &efx->rps_slot_map);
 	return rc;

^ permalink raw reply related

* Repeating "unregister_netdevice: waiting for lo to become free" caused by upstream 76da0704507bb ("ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER")
From: Rafał Miłecki @ 2018-04-23 13:08 UTC (permalink / raw)
  To: WANG Cong, David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Network Development, jeffy, David Ahern, Khlebnikov
  Cc: Greg Kroah-Hartman, Stable

Hi,

I've just updated my kernel 4.4.x and noticed a regression. Bisecting
pointed me to the commit 2417da3f4d6bc ("ipv6: only call
ip6_route_dev_notify() once for NETDEV_UNREGISTER") [0] which is
backport of upstream 76da0704507bb. That backported commit has
appeared in a 4.4.103.

I use OpenWrt/LEDE [1] distribution and LXC [2] 1.1.5. After stopping
a container I start getting these messages:
[  229.419188] unregister_netdevice: waiting for lo to become free.
Usage count = 1
[  239.660408] unregister_netdevice: waiting for lo to become free.
Usage count = 1
[  249.839189] unregister_netdevice: waiting for lo to become free.
Usage count = 1
(...)

Trying to start LXC nevertheless results in lxc-start command hang
around network configuration. Trying to query LXC state afterwards
results in a lxc-info command hang too.

I tried Googling for this issue and found similar reports:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729637
https://github.com/fnproject/fn/issues/686
https://lime-technology.com/forums/topic/66863-kernelunregister_netdevice-waiting-for-lo-to-become-free-usage-count-1/
all of them related to the Docker, which is probably a similar use
case to the LXC.

I couldn't find any reference to commit 76da0704507bb that could
suggest fixing the problem I'm seeing.

Does anyone have an idea what is the issue I'm seeing about? Or even
better, how to fix it? Can I provide any additional info that would
help?

[0] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.4.y&id=2417da3f4d6bc4fc6c77f613f0e2264090892aa5
[1] https://openwrt.org/
[2] https://linuxcontainers.org/

-- 
Rafał

^ permalink raw reply

* [PATCH net 1/1] Modify the seq_puts and seq_printf of af_netlink.c file
From: Bo YU @ 2018-04-23 12:51 UTC (permalink / raw)
  To: davem, Wang, Berg, Tkhai, Long, Elena; +Cc: netdev

Modify format output symbol of seq_printf function and adjust blanks in
seq_puts function in order to make convenience with command:`cat
/proc/net/netlink`

Signed-off-by: Bo YU <tsu.yubo@gmail.com>
---
  net/netlink/af_netlink.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 55342c4d5cec..2e2dd88fc79f 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2606,13 +2606,13 @@ static int netlink_seq_show(struct seq_file *seq, void *v)
  {
  	if (v == SEQ_START_TOKEN) {
  		seq_puts(seq,
-			 "sk       Eth Pid    Groups   "
-			 "Rmem     Wmem     Dump     Locks     Drops     Inode\n");
+			 "sk               Eth Pid        Groups   "
+			 "Rmem     Wmem     Dump  Locks    Drops    Inode\n");
  	} else {
  		struct sock *s = v;
  		struct netlink_sock *nlk = nlk_sk(s);

-		seq_printf(seq, "%pK %-3d %-6u %08x %-8d %-8d %d %-8d %-8d %-8lu\n",
+		seq_printf(seq, "%pK %-3d %-10u %08x %-8d %-8d %-5d %-8d %-8d %-8lu\n",
  			   s,
  			   s->sk_protocol,
  			   nlk->portid,

^ permalink raw reply related

* [PATCH] dca: make function dca_common_get_tag static
From: Colin King @ 2018-04-23 12:49 UTC (permalink / raw)
  To: netdev; +Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Function dca_common_get_tag is local to the source and does not need to be
in global scope, so make it static.

Cleans up sparse warning:
drivers/dca/dca-core.c:273:4: warning: symbol 'dca_common_get_tag' was
not declared. Should it be static?

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/dca/dca-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/dca/dca-core.c b/drivers/dca/dca-core.c
index 7afbb28d6a0f..1bc5ffb338c8 100644
--- a/drivers/dca/dca-core.c
+++ b/drivers/dca/dca-core.c
@@ -270,7 +270,7 @@ EXPORT_SYMBOL_GPL(dca_remove_requester);
  * @dev - the device that wants dca service
  * @cpu - the cpuid as returned by get_cpu()
  */
-u8 dca_common_get_tag(struct device *dev, int cpu)
+static u8 dca_common_get_tag(struct device *dev, int cpu)
 {
 	struct dca_provider *dca;
 	u8 tag;
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH net-next] lan78xx: Lan7801 Support for Fixed PHY
From: Andrew Lunn @ 2018-04-23 12:42 UTC (permalink / raw)
  To: Raghuram Chary J; +Cc: davem, netdev, unglinuxdriver, woojung.huh
In-Reply-To: <20180423044630.2672-1-raghuramchary.jallipalli@microchip.com>

>  #define DRIVER_AUTHOR	"WOOJUNG HUH <woojung.huh@microchip.com>"
>  #define DRIVER_DESC	"LAN78XX USB 3.0 Gigabit Ethernet Devices"
>  #define DRIVER_NAME	"lan78xx"
> -#define DRIVER_VERSION	"1.0.6"
> +#define DRIVER_VERSION	"1.0.7"

Hi Raghuram

Driver version strings a pretty pointless. You might want to remove
it.

>  
>  #define TX_TIMEOUT_JIFFIES		(5 * HZ)
>  #define THROTTLE_JIFFIES		(HZ / 8)
> @@ -426,6 +426,7 @@ struct lan78xx_net {
>  	struct statstage	stats;
>  
>  	struct irq_domain_data	domain_data;
> +	struct phy_device	*fixedphy;
>  };
>  
>  /* define external phy id */
> @@ -2062,11 +2063,39 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
>  	int ret;
>  	u32 mii_adv;
>  	struct phy_device *phydev;
> +	struct fixed_phy_status fphy_status = {
> +		.link = 1,
> +		.speed = SPEED_1000,
> +		.duplex = DUPLEX_FULL,
> +	};
>  
>  	phydev = phy_find_first(dev->mdiobus);
>  	if (!phydev) {
> -		netdev_err(dev->net, "no PHY found\n");
> -		return -EIO;
> +		if (dev->chipid == ID_REV_CHIP_ID_7801_) {
> +			u32 buf;
> +
> +			netdev_info(dev->net, "PHY Not Found!! Registering Fixed PHY\n");
> +			phydev = fixed_phy_register(PHY_POLL, &fphy_status, -1,
> +						    NULL);
> +			if (IS_ERR(phydev)) {
> +				netdev_err(dev->net, "No PHY/fixed_PHY found\n");
> +				return -ENODEV;
> +			}
> +			netdev_info(dev->net, "Registered FIXED PHY\n");

There are too many detdev_info() messages here. Maybe make them both
netdev_dbg().

> +			dev->interface = PHY_INTERFACE_MODE_RGMII;
> +			dev->fixedphy = phydev;

You can use 

if (!phy_is_pseudo_fixed_link(phydev))

to determine is a PHY is a fixed phy. I think you can then do without
dev->fixedphy.

> +			ret = lan78xx_write_reg(dev, MAC_RGMII_ID,
> +						MAC_RGMII_ID_TXC_DELAY_EN_);
> +			ret = lan78xx_write_reg(dev, RGMII_TX_BYP_DLL, 0x3D00);
> +			ret = lan78xx_read_reg(dev, HW_CFG, &buf);
> +			buf |= HW_CFG_CLK125_EN_;
> +			buf |= HW_CFG_REFCLK25_EN_;
> +			ret = lan78xx_write_reg(dev, HW_CFG, buf);
> +			goto phyinit;

Please don't use a goto like this. Maybe turn this into a switch statement?

> +		} else {
> +			netdev_err(dev->net, "no PHY found\n");
> +			return -EIO;
> +		}
>  	}
>  
>  	if ((dev->chipid == ID_REV_CHIP_ID_7800_) ||
> @@ -2105,6 +2134,7 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
>  		goto error;

Please take a look at what happens at error: It does not look
correct. Probably now is a good time to refactor the whole of lan78xx_phy_init()

	 Andrew

^ permalink raw reply

* Re: [PATCH bpf-next v3 4/9] bpf/verifier: improve register value range tracking with ARSH
From: Edward Cree @ 2018-04-23 12:25 UTC (permalink / raw)
  To: Yonghong Song, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-5-yhs@fb.com>

On 20/04/18 23:18, Yonghong Song wrote:
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 3c8bb92..01c215d 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2975,6 +2975,32 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
>  		/* We may learn something more from the var_off */
>  		__update_reg_bounds(dst_reg);
>  		break;
> +	case BPF_ARSH:
> +		if (umax_val >= insn_bitness) {
> +			/* Shifts greater than 31 or 63 are undefined.
> +			 * This includes shifts by a negative number.
> +			 */
> +			mark_reg_unknown(env, regs, insn->dst_reg);
> +			break;
> +		}
> +		if (dst_reg->smin_value < 0)
> +			dst_reg->smin_value >>= umin_val;
> +		else
> +			dst_reg->smin_value >>= umax_val;
> +		if (dst_reg->smax_value < 0)
> +			dst_reg->smax_value >>= umax_val;
> +		else
> +			dst_reg->smax_value >>= umin_val;
> +		if (src_known)
> +			dst_reg->var_off = tnum_rshift(dst_reg->var_off,
> +						       umin_val);
tnum_rshift is an unsigned shift, it won't do what you want here.
I think you could write a tnum_arshift that looks something like this
 (UNTESTED!):

    struct tnum tnum_arshift(struct tnum a, u8 shift)
    {
        return TNUM(((s64)a.value) >> shift, ((s64)a.mask) >> shift);
    }
Theory: if value sign bit is 1 then number is known negative so populate
 upper bits with known 1s.  If mask sign bit is 1 then number might be
 negative so populate upper bits with unknown.  Otherwise, number is
 known positive so populate upper bits with known 0s.

> +		else
> +			dst_reg->var_off = tnum_rshift(tnum_unknown, umin_val);
Applying the above here, tnum_arshift(tnum_unknown, ...) would always just
 return tnum_unknown, so just do "dst_reg->var_off = tnum_unknown;".
The reason for the corresponding logic in the BPF_RSH case is that a right
 logical shift _always_ populates upper bits with zeroes.
In any case these 'else' branches are currently never taken because they
 fall foul of the check Alexei added just before the switch,
    if (!src_known &&
        opcode != BPF_ADD && opcode != BPF_SUB && opcode != BPF_AND) {
        __mark_reg_unknown(dst_reg);
        return 0;
    }
So I can guarantee you haven't tested this code :-)

> +		dst_reg->umin_value >>= umax_val;
> +		dst_reg->umax_value >>= umin_val;
FWIW I think the way to handle umin/umax here is to blow them away and
 just rely on inferring new ubounds from the sbounds (i.e. the inverse of
 what we do just above in case BPF_RSH) since BPF_ARSH is essentially an
 operation on the signed value.  I don't think there is a need to support
 cases where the unsigned bounds of a signed shift of a value that may
 cross the sign boundary at (1<<63) are needed to verify a program.
(Unlike in the unsigned shift case, it is at least _possible_ for there to
 be information from the ubounds that we can't get from the sbounds - but
 it's a contrived case that isn't likely to be useful in real programs.)

-Ed
> +		/* We may learn something more from the var_off */
> +		__update_reg_bounds(dst_reg);
> +		break;
>  	default:
>  		mark_reg_unknown(env, regs, insn->dst_reg);
>  		break;

^ permalink raw reply

* [PATCH net-next 2/2] qed: Add configuration information to register dump and debug data
From: Denis Bolotin @ 2018-04-23 11:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, ariel.elior, Denis Bolotin
In-Reply-To: <20180423115605.8531-1-denis.bolotin@cavium.com>

Configuration information is added to the debug data collection, in
addition to register dump.
Added qed_dbg_nvm_image() that receives an image type, allocates a
buffer and reads the image. The images are saved in the buffers and the
dump size is updated.

Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
---
 drivers/net/ethernet/qlogic/qed/qed_debug.c | 113 +++++++++++++++++++++++++++-
 drivers/net/ethernet/qlogic/qed/qed_mcp.c   |  14 +++-
 drivers/net/ethernet/qlogic/qed/qed_mcp.h   |  14 ++++
 include/linux/qed/qed_if.h                  |   3 +
 4 files changed, 139 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c b/drivers/net/ethernet/qlogic/qed/qed_debug.c
index 4926c55..b3211c7 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
@@ -7778,6 +7778,57 @@ int qed_dbg_igu_fifo_size(struct qed_dev *cdev)
 	return qed_dbg_feature_size(cdev, DBG_FEATURE_IGU_FIFO);
 }
 
+int qed_dbg_nvm_image_length(struct qed_hwfn *p_hwfn,
+			     enum qed_nvm_images image_id, u32 *length)
+{
+	struct qed_nvm_image_att image_att;
+	int rc;
+
+	*length = 0;
+	rc = qed_mcp_get_nvm_image_att(p_hwfn, image_id, &image_att);
+	if (rc)
+		return rc;
+
+	*length = image_att.length;
+
+	return rc;
+}
+
+int qed_dbg_nvm_image(struct qed_dev *cdev, void *buffer,
+		      u32 *num_dumped_bytes, enum qed_nvm_images image_id)
+{
+	struct qed_hwfn *p_hwfn =
+		&cdev->hwfns[cdev->dbg_params.engine_for_debug];
+	u32 len_rounded, i;
+	__be32 val;
+	int rc;
+
+	*num_dumped_bytes = 0;
+	rc = qed_dbg_nvm_image_length(p_hwfn, image_id, &len_rounded);
+	if (rc)
+		return rc;
+
+	DP_NOTICE(p_hwfn->cdev,
+		  "Collecting a debug feature [\"nvram image %d\"]\n",
+		  image_id);
+
+	len_rounded = roundup(len_rounded, sizeof(u32));
+	rc = qed_mcp_get_nvm_image(p_hwfn, image_id, buffer, len_rounded);
+	if (rc)
+		return rc;
+
+	/* QED_NVM_IMAGE_NVM_META image is not swapped like other images */
+	if (image_id != QED_NVM_IMAGE_NVM_META)
+		for (i = 0; i < len_rounded; i += 4) {
+			val = cpu_to_be32(*(u32 *)(buffer + i));
+			*(u32 *)(buffer + i) = val;
+		}
+
+	*num_dumped_bytes = len_rounded;
+
+	return rc;
+}
+
 int qed_dbg_protection_override(struct qed_dev *cdev, void *buffer,
 				u32 *num_dumped_bytes)
 {
@@ -7831,6 +7882,9 @@ enum debug_print_features {
 	IGU_FIFO = 6,
 	PHY = 7,
 	FW_ASSERTS = 8,
+	NVM_CFG1 = 9,
+	DEFAULT_CFG = 10,
+	NVM_META = 11,
 };
 
 static u32 qed_calc_regdump_header(enum debug_print_features feature,
@@ -7965,13 +8019,61 @@ int qed_dbg_all_data(struct qed_dev *cdev, void *buffer)
 		DP_ERR(cdev, "qed_dbg_mcp_trace failed. rc = %d\n", rc);
 	}
 
+	/* nvm cfg1 */
+	rc = qed_dbg_nvm_image(cdev,
+			       (u8 *)buffer + offset + REGDUMP_HEADER_SIZE,
+			       &feature_size, QED_NVM_IMAGE_NVM_CFG1);
+	if (!rc) {
+		*(u32 *)((u8 *)buffer + offset) =
+		    qed_calc_regdump_header(NVM_CFG1, cur_engine,
+					    feature_size, omit_engine);
+		offset += (feature_size + REGDUMP_HEADER_SIZE);
+	} else if (rc != -ENOENT) {
+		DP_ERR(cdev,
+		       "qed_dbg_nvm_image failed for image  %d (%s), rc = %d\n",
+		       QED_NVM_IMAGE_NVM_CFG1, "QED_NVM_IMAGE_NVM_CFG1", rc);
+	}
+
+	/* nvm default */
+	rc = qed_dbg_nvm_image(cdev,
+			       (u8 *)buffer + offset + REGDUMP_HEADER_SIZE,
+			       &feature_size, QED_NVM_IMAGE_DEFAULT_CFG);
+	if (!rc) {
+		*(u32 *)((u8 *)buffer + offset) =
+		    qed_calc_regdump_header(DEFAULT_CFG, cur_engine,
+					    feature_size, omit_engine);
+		offset += (feature_size + REGDUMP_HEADER_SIZE);
+	} else if (rc != -ENOENT) {
+		DP_ERR(cdev,
+		       "qed_dbg_nvm_image failed for image %d (%s), rc = %d\n",
+		       QED_NVM_IMAGE_DEFAULT_CFG, "QED_NVM_IMAGE_DEFAULT_CFG",
+		       rc);
+	}
+
+	/* nvm meta */
+	rc = qed_dbg_nvm_image(cdev,
+			       (u8 *)buffer + offset + REGDUMP_HEADER_SIZE,
+			       &feature_size, QED_NVM_IMAGE_NVM_META);
+	if (!rc) {
+		*(u32 *)((u8 *)buffer + offset) =
+		    qed_calc_regdump_header(NVM_META, cur_engine,
+					    feature_size, omit_engine);
+		offset += (feature_size + REGDUMP_HEADER_SIZE);
+	} else if (rc != -ENOENT) {
+		DP_ERR(cdev,
+		       "qed_dbg_nvm_image failed for image %d (%s), rc = %d\n",
+		       QED_NVM_IMAGE_NVM_META, "QED_NVM_IMAGE_NVM_META", rc);
+	}
+
 	return 0;
 }
 
 int qed_dbg_all_data_size(struct qed_dev *cdev)
 {
+	struct qed_hwfn *p_hwfn =
+		&cdev->hwfns[cdev->dbg_params.engine_for_debug];
+	u32 regs_len = 0, image_len = 0;
 	u8 cur_engine, org_engine;
-	u32 regs_len = 0;
 
 	org_engine = qed_get_debug_engine(cdev);
 	for (cur_engine = 0; cur_engine < cdev->num_hwfns; cur_engine++) {
@@ -7993,6 +8095,15 @@ int qed_dbg_all_data_size(struct qed_dev *cdev)
 
 	/* Engine common */
 	regs_len += REGDUMP_HEADER_SIZE + qed_dbg_mcp_trace_size(cdev);
+	qed_dbg_nvm_image_length(p_hwfn, QED_NVM_IMAGE_NVM_CFG1, &image_len);
+	if (image_len)
+		regs_len += REGDUMP_HEADER_SIZE + image_len;
+	qed_dbg_nvm_image_length(p_hwfn, QED_NVM_IMAGE_DEFAULT_CFG, &image_len);
+	if (image_len)
+		regs_len += REGDUMP_HEADER_SIZE + image_len;
+	qed_dbg_nvm_image_length(p_hwfn, QED_NVM_IMAGE_NVM_META, &image_len);
+	if (image_len)
+		regs_len += REGDUMP_HEADER_SIZE + image_len;
 
 	return regs_len;
 }
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 1377ad1..0550f0e 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -2529,7 +2529,7 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
 	return rc;
 }
 
-static int
+int
 qed_mcp_get_nvm_image_att(struct qed_hwfn *p_hwfn,
 			  enum qed_nvm_images image_id,
 			  struct qed_nvm_image_att *p_image_att)
@@ -2545,6 +2545,15 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
 	case QED_NVM_IMAGE_FCOE_CFG:
 		type = NVM_TYPE_FCOE_CFG;
 		break;
+	case QED_NVM_IMAGE_NVM_CFG1:
+		type = NVM_TYPE_NVM_CFG1;
+		break;
+	case QED_NVM_IMAGE_DEFAULT_CFG:
+		type = NVM_TYPE_DEFAULT_CFG;
+		break;
+	case QED_NVM_IMAGE_NVM_META:
+		type = NVM_TYPE_META;
+		break;
 	default:
 		DP_NOTICE(p_hwfn, "Unknown request of image_id %08x\n",
 			  image_id);
@@ -2588,9 +2597,6 @@ int qed_mcp_get_nvm_image(struct qed_hwfn *p_hwfn,
 		return -EINVAL;
 	}
 
-	/* Each NVM image is suffixed by CRC; Upper-layer has no need for it */
-	image_att.length -= 4;
-
 	if (image_att.length > buffer_len) {
 		DP_VERBOSE(p_hwfn,
 			   QED_MSG_STORAGE,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.h b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
index dd62c38..3af3896 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
@@ -486,6 +486,20 @@ struct qed_nvm_image_att {
  * @brief Allows reading a whole nvram image
  *
  * @param p_hwfn
+ * @param image_id - image to get attributes for
+ * @param p_image_att - image attributes structure into which to fill data
+ *
+ * @return int - 0 - operation was successful.
+ */
+int
+qed_mcp_get_nvm_image_att(struct qed_hwfn *p_hwfn,
+			  enum qed_nvm_images image_id,
+			  struct qed_nvm_image_att *p_image_att);
+
+/**
+ * @brief Allows reading a whole nvram image
+ *
+ * @param p_hwfn
  * @param image_id - image requested for reading
  * @param p_buffer - allocated buffer into which to fill data
  * @param buffer_len - length of the allocated buffer.
diff --git a/include/linux/qed/qed_if.h b/include/linux/qed/qed_if.h
index b5b2bc9..e53f9c7 100644
--- a/include/linux/qed/qed_if.h
+++ b/include/linux/qed/qed_if.h
@@ -159,6 +159,9 @@ struct qed_dcbx_get {
 enum qed_nvm_images {
 	QED_NVM_IMAGE_ISCSI_CFG,
 	QED_NVM_IMAGE_FCOE_CFG,
+	QED_NVM_IMAGE_NVM_CFG1,
+	QED_NVM_IMAGE_DEFAULT_CFG,
+	QED_NVM_IMAGE_NVM_META,
 };
 
 struct qed_link_eee_params {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 1/2] qed: Delete unused parameter p_ptt from mcp APIs
From: Denis Bolotin @ 2018-04-23 11:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, ariel.elior, Denis Bolotin
In-Reply-To: <20180423115605.8531-1-denis.bolotin@cavium.com>

Since nvm images attributes are cached during driver load, acquiring ptt
is not needed when calling qed_mcp_get_nvm_image().

Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
---
 drivers/net/ethernet/qlogic/qed/qed_main.c | 9 +--------
 drivers/net/ethernet/qlogic/qed/qed_mcp.c  | 4 +---
 drivers/net/ethernet/qlogic/qed/qed_mcp.h  | 2 --
 3 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c b/drivers/net/ethernet/qlogic/qed/qed_main.c
index 9854aa9..d1d3787 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -1894,15 +1894,8 @@ static int qed_nvm_get_image(struct qed_dev *cdev, enum qed_nvm_images type,
 			     u8 *buf, u16 len)
 {
 	struct qed_hwfn *hwfn = QED_LEADING_HWFN(cdev);
-	struct qed_ptt *ptt = qed_ptt_acquire(hwfn);
-	int rc;
-
-	if (!ptt)
-		return -EAGAIN;
 
-	rc = qed_mcp_get_nvm_image(hwfn, ptt, type, buf, len);
-	qed_ptt_release(hwfn, ptt);
-	return rc;
+	return qed_mcp_get_nvm_image(hwfn, type, buf, len);
 }
 
 static int qed_set_coalesce(struct qed_dev *cdev, u16 rx_coal, u16 tx_coal,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index ec0d425..1377ad1 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -2531,7 +2531,6 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
 
 static int
 qed_mcp_get_nvm_image_att(struct qed_hwfn *p_hwfn,
-			  struct qed_ptt *p_ptt,
 			  enum qed_nvm_images image_id,
 			  struct qed_nvm_image_att *p_image_att)
 {
@@ -2569,7 +2568,6 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
 }
 
 int qed_mcp_get_nvm_image(struct qed_hwfn *p_hwfn,
-			  struct qed_ptt *p_ptt,
 			  enum qed_nvm_images image_id,
 			  u8 *p_buffer, u32 buffer_len)
 {
@@ -2578,7 +2576,7 @@ int qed_mcp_get_nvm_image(struct qed_hwfn *p_hwfn,
 
 	memset(p_buffer, 0, buffer_len);
 
-	rc = qed_mcp_get_nvm_image_att(p_hwfn, p_ptt, image_id, &image_att);
+	rc = qed_mcp_get_nvm_image_att(p_hwfn, image_id, &image_att);
 	if (rc)
 		return rc;
 
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.h b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
index 8a5c988..dd62c38 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
@@ -486,7 +486,6 @@ struct qed_nvm_image_att {
  * @brief Allows reading a whole nvram image
  *
  * @param p_hwfn
- * @param p_ptt
  * @param image_id - image requested for reading
  * @param p_buffer - allocated buffer into which to fill data
  * @param buffer_len - length of the allocated buffer.
@@ -494,7 +493,6 @@ struct qed_nvm_image_att {
  * @return 0 iff p_buffer now contains the nvram image.
  */
 int qed_mcp_get_nvm_image(struct qed_hwfn *p_hwfn,
-			  struct qed_ptt *p_ptt,
 			  enum qed_nvm_images image_id,
 			  u8 *p_buffer, u32 buffer_len);
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 0/2] Add configuration information to register dump and debug data
From: Denis Bolotin @ 2018-04-23 11:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, ariel.elior, Denis Bolotin

The purpose of this patchset is to add configuration information to the
debug data collection, which already contains register dump.
The first patch (removing the ptt) is essential because it prevents the
unnecessary ptt acquirement when calling mcp APIs.

Denis Bolotin (2):
  qed: Delete unused parameter p_ptt from mcp APIs
  qed: Add configuration information to register dump and debug data

 drivers/net/ethernet/qlogic/qed/qed_debug.c | 113 +++++++++++++++++++++++++++-
 drivers/net/ethernet/qlogic/qed/qed_main.c  |   9 +--
 drivers/net/ethernet/qlogic/qed/qed_mcp.c   |  18 +++--
 drivers/net/ethernet/qlogic/qed/qed_mcp.h   |  16 +++-
 include/linux/qed/qed_if.h                  |   3 +
 5 files changed, 141 insertions(+), 18 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: kTLS in combination with mlx4 is very unstable
From: Andre Tomt @ 2018-04-23 11:44 UTC (permalink / raw)
  To: netdev; +Cc: davejwatson, Tariq Toukan
In-Reply-To: <bfb1e726-24ea-89e9-61e7-8e43e82cd23b@tomt.net>

On 22. april 2018 23:21, Andre Tomt wrote:
> Hello!
> 
> kTLS looks fun, so I decided to play with it. It is quite spiffy - 
> however with mlx4 I get kernel crashes I'm not seeing when testing on 
> ixgbe.
> 
> For testing I'm using a git build of the "stream reflector" cubemap[1] 
> configured with kTLS and 8 worker threads running on 4 physical cores, 
> loading it up with a ~13Mbps MPEG-TS stream pulled from satelite TV.
> 
> The kernel seems to get increasingly unstable as I load it up with 
> client connections. At about 9Gbps and 700 connections, it is okay at 
> least for a while - it might run fine for say 45 minutes. Once it gets 
> to 20 - 30Gbps, the kernel will usually start spewing OOPSes within 
> minutes and the traffic drops.
> 
> Some bad interaction between mlx4 and kTLS?
> 
> Hardware is a quad core Xeon-D 1520 using a dual port Mellanox 
> ConnectX-3 VPI with a single 40Gbps ethernet link configured. Mellanox 
> mlx4 driver settings are kernel.org upstream defaults. Interface is 
> configured with FQ qdisc and sockets are using BBR congestion control.
> 
> Tested on kernel 4.14.34, 4.15.17, and 4.16.2 - 4.16.3.
> 
> [1] https://git.sesse.net/?p=cubemap
> 
> First OOPS (from 4.16.3)

It also blows up with a similar trace on 4.17-rc2.

^ permalink raw reply

* [iptables 2/2] extensions: libip6t_srh: add test-cases for matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 10:48 UTC (permalink / raw)
  To: pablo, fw, davem, dav.lebrun, linux-kernel, netfilter-devel,
	coreteam, netdev
  Cc: Ahmed Abdelsalam
In-Reply-To: <1524480503-1883-1-git-send-email-amsalam20@gmail.com>

This patch adds some test-cases to "libip6t_srh.t" for matching previous SID,
next SID, and last SID.

Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
---
 extensions/libip6t_srh.t | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/extensions/libip6t_srh.t b/extensions/libip6t_srh.t
index 08897d5..88a379e 100644
--- a/extensions/libip6t_srh.t
+++ b/extensions/libip6t_srh.t
@@ -23,4 +23,8 @@
 -m srh ! --srh-tag 0;=;OK
 -m srh --srh-next-hdr 17 --srh-segs-left-eq 1 --srh-last-entry-eq 4 --srh-tag 0;=;OK
 -m srh ! --srh-next-hdr 17 ! --srh-segs-left-eq 0 --srh-tag 0;=;OK
+-m srh --srh-psid A::2/64 --srh-nsid B2::/128 --srh-lsid C::/0;=;OK
+-m srh ! --srh-psid A::2/64 ! --srh-nsid B2::/128 ! --srh-lsid C::/0;=;OK
+-m srh --srh-psid A::2 --srh-nsid B2:: --srh-lsid C::;=;OK
+-m srh ! --srh-psid A::2 ! --srh-nsid B2:: ! --srh-lsid C::;=;OK
 -m srh;=;OK
-- 
2.1.4

^ permalink raw reply related

* [nf-next] netfilter: extend SRH match to support matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 10:48 UTC (permalink / raw)
  To: pablo, fw, davem, dav.lebrun, linux-kernel, netfilter-devel,
	coreteam, netdev
  Cc: Ahmed Abdelsalam
In-Reply-To: <1524480503-1883-1-git-send-email-amsalam20@gmail.com>

IPv6 Segment Routing Header (SRH) contains a list of SIDs to be crossed by
SR encapsulated packet. Each SID is encoded as an IPv6 prefix.

When a Firewall receives an SR encapsulated packet, it should be able to
identify which node previously processed the packet (previous SID), which
node is going to process the packet next (next SID), and which node is the
last to process the packet (last SID) which represent the final destination
of the packet in case of inline SR mode.

An example use-case of using these features could be SID list that includes
two firewalls. When the second firewall receives a packet, it can check
whether the packet has been processed by the first firewall or not. Based on
that check, it decides to apply all rules, apply just subset of the rules,
or totally skip all rules and forward the packet to the next SID.

This patch extends SRH match to support matching previous SID, next SID, and
last SID.

Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
---
 include/uapi/linux/netfilter_ipv6/ip6t_srh.h | 22 +++++++++++++--
 net/ipv6/netfilter/ip6t_srh.c                | 41 +++++++++++++++++++++++++++-
 2 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
index f3cc0ef..9808382 100644
--- a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
+++ b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
@@ -17,7 +17,10 @@
 #define IP6T_SRH_LAST_GT        0x0100
 #define IP6T_SRH_LAST_LT        0x0200
 #define IP6T_SRH_TAG            0x0400
-#define IP6T_SRH_MASK           0x07FF
+#define IP6T_SRH_PSID           0x0800
+#define IP6T_SRH_NSID           0x1000
+#define IP6T_SRH_LSID           0x2000
+#define IP6T_SRH_MASK           0x3FFF
 
 /* Values for "mt_invflags" field in struct ip6t_srh */
 #define IP6T_SRH_INV_NEXTHDR    0x0001
@@ -31,7 +34,10 @@
 #define IP6T_SRH_INV_LAST_GT    0x0100
 #define IP6T_SRH_INV_LAST_LT    0x0200
 #define IP6T_SRH_INV_TAG        0x0400
-#define IP6T_SRH_INV_MASK       0x07FF
+#define IP6T_SRH_INV_PSID       0x0800
+#define IP6T_SRH_INV_NSID       0x1000
+#define IP6T_SRH_INV_LSID       0x2000
+#define IP6T_SRH_INV_MASK       0x3FFF
 
 /**
  *      struct ip6t_srh - SRH match options
@@ -40,6 +46,12 @@
  *      @ segs_left: Segments left field of SRH
  *      @ last_entry: Last entry field of SRH
  *      @ tag: Tag field of SRH
+ *      @ psid_addr: Address of previous SID in SRH SID list
+ *      @ nsid_addr: Address of NEXT SID in SRH SID list
+ *      @ lsid_addr: Address of LAST SID in SRH SID list
+ *      @ psid_msk: Mask of previous SID in SRH SID list
+ *      @ nsid_msk: Mask of next SID in SRH SID list
+ *      @ lsid_msk: MAsk of last SID in SRH SID list
  *      @ mt_flags: match options
  *      @ mt_invflags: Invert the sense of match options
  */
@@ -50,6 +62,12 @@ struct ip6t_srh {
 	__u8                    segs_left;
 	__u8                    last_entry;
 	__u16                   tag;
+	struct in6_addr		psid_addr;
+	struct in6_addr		nsid_addr;
+	struct in6_addr		lsid_addr;
+	struct in6_addr		psid_msk;
+	struct in6_addr		nsid_msk;
+	struct in6_addr		lsid_msk;
 	__u16                   mt_flags;
 	__u16                   mt_invflags;
 };
diff --git a/net/ipv6/netfilter/ip6t_srh.c b/net/ipv6/netfilter/ip6t_srh.c
index 33719d5..2b5cc73 100644
--- a/net/ipv6/netfilter/ip6t_srh.c
+++ b/net/ipv6/netfilter/ip6t_srh.c
@@ -30,7 +30,9 @@ static bool srh_mt6(const struct sk_buff *skb, struct xt_action_param *par)
 	const struct ip6t_srh *srhinfo = par->matchinfo;
 	struct ipv6_sr_hdr *srh;
 	struct ipv6_sr_hdr _srh;
-	int hdrlen, srhoff = 0;
+	int hdrlen, psidoff, nsidoff, lsidoff, srhoff = 0;
+	struct in6_addr *psid, *nsid, *lsid;
+	struct in6_addr _psid, _nsid, _lsid;
 
 	if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
 		return false;
@@ -114,6 +116,43 @@ static bool srh_mt6(const struct sk_buff *skb, struct xt_action_param *par)
 		if (NF_SRH_INVF(srhinfo, IP6T_SRH_INV_TAG,
 				!(srh->tag == srhinfo->tag)))
 			return false;
+
+	/* Previous SID matching */
+	if (srhinfo->mt_flags & IP6T_SRH_PSID) {
+		if (srh->segments_left == srh->first_segment)
+			return false;
+		psidoff = srhoff + sizeof(struct ipv6_sr_hdr) +
+			  ((srh->segments_left + 1) * sizeof(struct in6_addr));
+		psid = skb_header_pointer(skb, psidoff, sizeof(_psid), &_psid);
+		if (NF_SRH_INVF(srhinfo, IP6T_SRH_INV_PSID,
+				ipv6_masked_addr_cmp(psid, &srhinfo->psid_msk,
+						     &srhinfo->psid_addr)))
+			return false;
+	}
+
+	/* Next SID matching */
+	if (srhinfo->mt_flags & IP6T_SRH_NSID) {
+		if (srh->segments_left == 0)
+			return false;
+		nsidoff = srhoff + sizeof(struct ipv6_sr_hdr) +
+			  ((srh->segments_left - 1) * sizeof(struct in6_addr));
+		nsid = skb_header_pointer(skb, nsidoff, sizeof(_nsid), &_nsid);
+		if (NF_SRH_INVF(srhinfo, IP6T_SRH_INV_NSID,
+				ipv6_masked_addr_cmp(nsid, &srhinfo->nsid_msk,
+						     &srhinfo->nsid_addr)))
+			return false;
+	}
+
+	/* Last SID matching */
+	if (srhinfo->mt_flags & IP6T_SRH_LSID) {
+		lsidoff = srhoff + sizeof(struct ipv6_sr_hdr);
+		lsid = skb_header_pointer(skb, lsidoff, sizeof(_lsid), &_lsid);
+		if (NF_SRH_INVF(srhinfo, IP6T_SRH_INV_LSID,
+				ipv6_masked_addr_cmp(lsid, &srhinfo->lsid_msk,
+						     &srhinfo->lsid_addr)))
+			return false;
+	}
+
 	return true;
 }
 
-- 
2.1.4

^ permalink raw reply related

* [iptables 1/2] extensions: libip6t_srh: support matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 10:48 UTC (permalink / raw)
  To: pablo, fw, davem, dav.lebrun, linux-kernel, netfilter-devel,
	coreteam, netdev
  Cc: Ahmed Abdelsalam

This patch extends the libip6t_srh shared library to support matching
previous SID, next SID, and last SID.

Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
---
 extensions/libip6t_srh.c                | 65 ++++++++++++++++++++++++++++++++-
 include/linux/netfilter_ipv6/ip6t_srh.h | 22 ++++++++++-
 2 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/extensions/libip6t_srh.c b/extensions/libip6t_srh.c
index ac0ae08..5acc2ee 100644
--- a/extensions/libip6t_srh.c
+++ b/extensions/libip6t_srh.c
@@ -22,6 +22,9 @@ enum {
 	O_SRH_LAST_GT,
 	O_SRH_LAST_LT,
 	O_SRH_TAG,
+	O_SRH_PSID,
+	O_SRH_NSID,
+	O_SRH_LSID,
 };
 
 static void srh_help(void)
@@ -38,7 +41,10 @@ static void srh_help(void)
 "[!] --srh-last-entry-eq 	last_entry      Last Entry value of SRH\n"
 "[!] --srh-last-entry-gt 	last_entry      Last Entry value of SRH\n"
 "[!] --srh-last-entry-lt 	last_entry      Last Entry value of SRH\n"
-"[!] --srh-tag			tag             Tag value of SRH\n");
+"[!] --srh-tag			tag             Tag value of SRH\n"
+"[!] --srh-psid			addr[/mask]	SRH previous SID\n"
+"[!] --srh-nsid			addr[/mask]	SRH next SID\n"
+"[!] --srh-lsid			addr[/mask]	SRH Last SID\n");
 }
 
 #define s struct ip6t_srh
@@ -65,6 +71,12 @@ static const struct xt_option_entry srh_opts[] = {
 	.flags = XTOPT_INVERT | XTOPT_PUT, XTOPT_POINTER(s, last_entry)},
 	{ .name = "srh-tag", .id = O_SRH_TAG, .type = XTTYPE_UINT16,
 	.flags = XTOPT_INVERT | XTOPT_PUT, XTOPT_POINTER(s, tag)},
+	{ .name = "srh-psid", .id = O_SRH_PSID, .type = XTTYPE_HOSTMASK,
+	.flags = XTOPT_INVERT},
+	{ .name = "srh-nsid", .id = O_SRH_NSID, .type = XTTYPE_HOSTMASK,
+	.flags = XTOPT_INVERT},
+	{ .name = "srh-lsid", .id = O_SRH_LSID, .type = XTTYPE_HOSTMASK,
+	.flags = XTOPT_INVERT},
 	{ }
 };
 #undef s
@@ -75,6 +87,12 @@ static void srh_init(struct xt_entry_match *m)
 
 	srhinfo->mt_flags = 0;
 	srhinfo->mt_invflags = 0;
+	memset(srhinfo->psid_addr.s6_addr, 0, sizeof(srhinfo->psid_addr.s6_addr));
+	memset(srhinfo->nsid_addr.s6_addr, 0, sizeof(srhinfo->nsid_addr.s6_addr));
+	memset(srhinfo->lsid_addr.s6_addr, 0, sizeof(srhinfo->lsid_addr.s6_addr));
+	memset(srhinfo->psid_msk.s6_addr, 0, sizeof(srhinfo->psid_msk.s6_addr));
+	memset(srhinfo->nsid_msk.s6_addr, 0, sizeof(srhinfo->nsid_msk.s6_addr));
+	memset(srhinfo->lsid_msk.s6_addr, 0, sizeof(srhinfo->lsid_msk.s6_addr));
 }
 
 static void srh_parse(struct xt_option_call *cb)
@@ -138,6 +156,27 @@ static void srh_parse(struct xt_option_call *cb)
 		if (cb->invert)
 			srhinfo->mt_invflags |= IP6T_SRH_INV_TAG;
 		break;
+	case O_SRH_PSID:
+		srhinfo->mt_flags |= IP6T_SRH_PSID;
+		srhinfo->psid_addr = cb->val.haddr.in6;
+		srhinfo->psid_msk  = cb->val.hmask.in6;
+		if (cb->invert)
+			srhinfo->mt_invflags |= IP6T_SRH_INV_PSID;
+		break;
+	case O_SRH_NSID:
+		srhinfo->mt_flags |= IP6T_SRH_NSID;
+		srhinfo->nsid_addr = cb->val.haddr.in6;
+		srhinfo->nsid_msk  = cb->val.hmask.in6;
+		if (cb->invert)
+			srhinfo->mt_invflags |= IP6T_SRH_INV_NSID;
+		break;
+	case O_SRH_LSID:
+		srhinfo->mt_flags |= IP6T_SRH_LSID;
+		srhinfo->lsid_addr = cb->val.haddr.in6;
+		srhinfo->lsid_msk  = cb->val.hmask.in6;
+		if (cb->invert)
+			srhinfo->mt_invflags |= IP6T_SRH_INV_LSID;
+		break;
 	}
 }
 
@@ -180,6 +219,18 @@ static void srh_print(const void *ip, const struct xt_entry_match *match,
 	if (srhinfo->mt_flags & IP6T_SRH_TAG)
 		printf(" tag:%s%d", srhinfo->mt_invflags & IP6T_SRH_INV_TAG ? "!" : "",
 			srhinfo->tag);
+	if (srhinfo->mt_flags & IP6T_SRH_PSID)
+		printf(" psid %s %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_PSID ? "!" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->psid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->psid_msk));
+	if (srhinfo->mt_flags & IP6T_SRH_NSID)
+		printf(" nsid %s %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_NSID ? "!" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->nsid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->nsid_msk));
+	if (srhinfo->mt_flags & IP6T_SRH_LSID)
+		printf(" lsid %s %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_LSID ? "!" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->lsid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->lsid_msk));
 }
 
 static void srh_save(const void *ip, const struct xt_entry_match *match)
@@ -219,6 +270,18 @@ static void srh_save(const void *ip, const struct xt_entry_match *match)
 	if (srhinfo->mt_flags & IP6T_SRH_TAG)
 		printf("%s --srh-tag %u", (srhinfo->mt_invflags & IP6T_SRH_INV_TAG) ? " !" : "",
 			srhinfo->tag);
+	if (srhinfo->mt_flags & IP6T_SRH_PSID)
+		printf("%s --srh-psid %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_PSID ? " !" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->psid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->psid_msk));
+	if (srhinfo->mt_flags & IP6T_SRH_NSID)
+		printf("%s --srh-nsid %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_NSID ? " !" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->nsid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->nsid_msk));
+	if (srhinfo->mt_flags & IP6T_SRH_LSID)
+		printf("%s --srh-lsid %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_LSID ? " !" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->lsid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->lsid_msk));
 }
 
 static struct xtables_match srh_mt6_reg = {
diff --git a/include/linux/netfilter_ipv6/ip6t_srh.h b/include/linux/netfilter_ipv6/ip6t_srh.h
index 087efa1..3d77241 100644
--- a/include/linux/netfilter_ipv6/ip6t_srh.h
+++ b/include/linux/netfilter_ipv6/ip6t_srh.h
@@ -16,7 +16,10 @@
 #define IP6T_SRH_LAST_GT        0x0100
 #define IP6T_SRH_LAST_LT        0x0200
 #define IP6T_SRH_TAG            0x0400
-#define IP6T_SRH_MASK           0x07FF
+#define IP6T_SRH_PSID           0x0800
+#define IP6T_SRH_NSID           0x1000
+#define IP6T_SRH_LSID           0x2000
+#define IP6T_SRH_MASK           0x3FFF
 
 /* Values for "mt_invflags" field in struct ip6t_srh */
 #define IP6T_SRH_INV_NEXTHDR    0x0001
@@ -30,7 +33,10 @@
 #define IP6T_SRH_INV_LAST_GT    0x0100
 #define IP6T_SRH_INV_LAST_LT    0x0200
 #define IP6T_SRH_INV_TAG        0x0400
-#define IP6T_SRH_INV_MASK       0x07FF
+#define IP6T_SRH_INV_PSID       0x0800
+#define IP6T_SRH_INV_NSID       0x1000
+#define IP6T_SRH_INV_LSID       0x2000
+#define IP6T_SRH_INV_MASK       0x3FFF
 
 /**
  *      struct ip6t_srh - SRH match options
@@ -39,6 +45,12 @@
  *      @ segs_left: Segments left field of SRH
  *      @ last_entry: Last entry field of SRH
  *      @ tag: Tag field of SRH
+ *      @ psid_addr: Address of previous SID in SRH SID list
+ *      @ nsid_addr: Address of NEXT SID in SRH SID list
+ *      @ lsid_addr: Address of LAST SID in SRH SID list
+ *      @ psid_msk: Mask of previous SID in SRH SID list
+ *      @ nsid_msk: Mask of next SID in SRH SID list
+ *      @ lsid_msk: MAsk of last SID in SRH SID list
  *      @ mt_flags: match options
  *      @ mt_invflags: Invert the sense of match options
  */
@@ -49,6 +61,12 @@ struct ip6t_srh {
 	__u8                    segs_left;
 	__u8                    last_entry;
 	__u16                   tag;
+	struct in6_addr		psid_addr;
+	struct in6_addr		nsid_addr;
+	struct in6_addr		lsid_addr;
+	struct in6_addr		psid_msk;
+	struct in6_addr		nsid_msk;
+	struct in6_addr		lsid_msk;
 	__u16                   mt_flags;
 	__u16                   mt_invflags;
 };
-- 
2.1.4

^ permalink raw reply related

* RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
From: Karlsson, Magnus @ 2018-04-23 10:26 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: 'Björn Töpel', Duyck, Alexander H,
	'alexander.duyck@gmail.com',
	'john.fastabend@gmail.com', 'ast@fb.com',
	'brouer@redhat.com',
	'willemdebruijn.kernel@gmail.com',
	'daniel@iogearbox.net', 'netdev@vger.kernel.org',
	'michael.lundkvist@ericsson.com', Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z,
	'ravineet.singh@ericsson.com'
In-Reply-To: <AFED4FBCE79F3548A8F74434195ACE39588D2083@IRSMSX107.ger.corp.intel.com>

> -----Original Message-----
> From: Karlsson, Magnus
> Sent: Thursday, April 12, 2018 5:20 PM
> To: Michael S. Tsirkin <mst@redhat.com>
> Cc: Björn Töpel <bjorn.topel@gmail.com>; Duyck, Alexander H
> <alexander.h.duyck@intel.com>; alexander.duyck@gmail.com;
> john.fastabend@gmail.com; ast@fb.com; brouer@redhat.com;
> willemdebruijn.kernel@gmail.com; daniel@iogearbox.net;
> netdev@vger.kernel.org; michael.lundkvist@ericsson.com; Brandeburg,
> Jesse <jesse.brandeburg@intel.com>; Singhai, Anjali
> <anjali.singhai@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> ravineet.singh@ericsson.com
> Subject: RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> mmap
> 
> 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Thursday, April 12, 2018 4:05 PM
> > To: Karlsson, Magnus <magnus.karlsson@intel.com>
> > Cc: Björn Töpel <bjorn.topel@gmail.com>; Duyck, Alexander H
> > <alexander.h.duyck@intel.com>; alexander.duyck@gmail.com;
> > john.fastabend@gmail.com; ast@fb.com; brouer@redhat.com;
> > willemdebruijn.kernel@gmail.com; daniel@iogearbox.net;
> > netdev@vger.kernel.org; michael.lundkvist@ericsson.com; Brandeburg,
> > Jesse <jesse.brandeburg@intel.com>; Singhai, Anjali
> > <anjali.singhai@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> > ravineet.singh@ericsson.com
> > Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> > mmap
> >
> > On Thu, Apr 12, 2018 at 07:38:25AM +0000, Karlsson, Magnus wrote:
> > > I think you are definitely right in that there are ways in which we
> > > can improve performance here. That said, the current queue performs
> > > slightly better than the previous one we had that was more or less a
> > > copy of one of your first virtio 1.1 proposals from little over a
> > > year ago. It had bidirectional queues and a valid flag in the
> > > descriptor itself. The reason we abandoned this was not poor
> > > performance (it was good), but a need to go to unidirectional
> > > queues. Maybe I should have only changed that aspect and kept the valid
> flag.
> >
> > Is there a summary about unidirectional queues anywhere?  I'm curious
> > to know whether there are any lessons here to be learned for virtio or
> ptr_ring.
> 
> I did a quick hack in which I used your ptr_ring for the fill queue instead of
> our head/tail based one. In the corner cases (usually empty or usually full),
> there is basically no difference. But for the case when the queue is always
> half full, the ptr_ring implementation boosts the performance from 5.6 to 5.7
> Mpps (as there is no cache line bouncing in this case) on my system (slower
> than Björn's that was used for the numbers in the RFC).
> 
> So I think this should be implemented properly so we can get some real
> numbers.
> Especially since 0.1 Mpps with copies will likely become much more with
> zero-copy as we are really chasing cycles there. We will get back a better
> evaluation in a few days.
> 
> Thanks: Magnus
> 
> > --
> > MST

Hi Michael,

Sorry for the late reply. Been travelling. Björn and I have now
made a real implementation of the ptr_ring principles in the
af_xdp code. We just added a switch in bind (only for the purpose
of this test) to be able to pick what ring implementation to use
from the user space test program. The main difference between our
version of ptr_ring and our head/tail ring is that the ptr_ring
version uses the idx field to signal if the entry is available or
not (idx == 0 indicating empty descriptor) and that it does not
use the head and tail pointers at all. Even though it is not
a "ring of pointers" in our implementation, we will still call it
ptr_ring for the purpose of this mail.

In summary, our experiments show that the two rings perform the
same in our micro benchmarks when the queues are balanced and
rarely full or empty, but the head/tail version performs better
for RX when the queues are not perfectly balanced. Why is that?
We do not exactly know, but there are a number of differences
between a ptr_ring in the kernel and one between user and kernel
space for the use in af_xdp.

* The user space descriptors have to be validated as we are
  communicating between user space and kernel space. Done slightly
  differently for the two rings due to the batching below.

* The RX and TX ring have descriptors that are larger than one
  pointer, so need to have barriers here even with ptr_ring. We can
  not rely on address dependency because it is not a pointer.

* Batching performed slightly differently in both versions. We
  avoid touching head and tail for as long as possible. At the
  worst it is once per batch, but it might be much less than that
  on the consumer side. The drawback with the accesses to the
  head/tail pointers is that it usually ends up to be a cache
  line bounce. But with ptr_ring, the drawback is that it is
  always N writes (setting idx = 0) for a batch size of N. The
  good thing though, is that these will not incur any cache
  line bouncing if the rings are balanced (well, they will be
  read by the producer at some point, but only once per traversal
  of the ring).

Something to note is that we think that the head/tail version
provides an easier-to-use user space interface since the indexes start
from 0 instead of 1 as in the ptr_ring case. With ptr_ring you
have to teach the user space application writer not to use index
0. With the head/tail version no such restriction is needed.

Here are just some of the results for a workload where user space
is faster than kernel space. This is for the case in which the user
space program has no problem keeping up with the kernel.

head/tail 16-batch

  sock0@p3p2:16 rxdrop
                 pps        
rx              9,782,718   
tx              0           

  sock0@p3p2:16 l2fwd
                 pps        
rx              2,504,235   
tx              2,504,232   

ptr_ring 16-batch

  sock0@p3p2:16 rxdrop
                 pps        
rx              9,519,373   
tx              0           

  sock0@p3p2:16 l2fwd
                 pps        
rx              2,519,265   
tx              2,519,265   

ptr_ring with batch sizes calculated as in ptr_ring.h

  sock0@p3p2:16 rxdrop
                 pps        
rx              7,470,658   
tx              0           
^C

  sock0@p3p2:16 l2fwd
                 pps        
rx              2,431,701   
tx              2,431,701   

/Magnus

^ permalink raw reply

* [PATCH net-next 2/2 v1] netns: isolate seqnums to use per-netns locks
From: Christian Brauner @ 2018-04-23 10:24 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner
In-Reply-To: <20180423102443.16627-1-christian.brauner@ubuntu.com>

Now that it's possible to have a different set of uevents in different
network namespaces, per-network namespace uevent sequence numbers are
introduced. This increases performance as locking is now restricted to the
network namespace affected by the uevent rather than locking everything.
Testing revealed significant performance improvements. For details see
"Testing" below.

Since commit 692ec06 ("netns: send uevent messages") network namespaces not
owned by the intial user namespace can be sent uevents from a sufficiently
privileged userspace process.
In order to send a uevent into a network namespace not owned by the initial
user namespace we currently still need to take the *global mutex* that
locks the uevent socket list even though the list *only contains network
namespaces owned by the initial user namespace*. This needs to be done
because the uevent counter is a global variable. Taking the global lock is
performance sensitive since a user on the host can spawn a pool of n
process that each create their own new user and network namespaces and then
go on to inject uevents in parallel into the network namespace of all of
these processes. This can have a significant performance impact for the
host's udevd since it means that there can be a lot of delay between a
device being added and the corresponding uevent being sent out and
available for processing by udevd. It also means that each network
namespace not owned by the initial user namespace which userspace has sent
a uevent to will need to wait until the lock becomes available.

Implementation:
This patch gives each network namespace its own uevent sequence number.
Each network namespace not owned by the initial user namespace receives its
own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
file it is clearly documented which lock has to be taken. All network
namespaces owned by the initial user namespace will still share the same
lock since they are all served sequentially via the uevent socket list.
This decouples the locking and ensures that the host retrieves uevents as
fast as possible even if there are a lot of uevents injected into network
namespaces not owned by the initial user namespace.  In addition, each
network namespace not owned by the initial user namespace does not have to
wait on any other network namespace not sharing the same user namespace.

Testing:
Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
with decoupled locking and one without. To ensure that testing made sense
both kernels carried the patch to remove network namespaces not owned by
the initial user namespace from the uevent socket list.
Three tests were constructed. All of them showed significant performance
improvements with per-netns uevent sequence numbers and decoupled locking.

 # Testcase 1:
   Only Injecting Uevents into network namespaces not owned by the initial
   user namespace.
   - created 1000 new user namespace + network namespace pairs
   - opened a uevent listener in each of those namespace pairs
   - injected uevents into each of those network namespaces 10,000 times
     meaning 10,000,000 (10 million) uevents were injected. (The high
     number of uevent injections should get rid of a lot of jitter.)
     The injection was done by fork()ing 1000 uevent injectors in a simple
     for-loop to ensure that uevents were injected in parallel.
   - mean transaction time was calculated:
     - *without* uevent sequence number namespacing: 67 μs
     - *with* uevent sequence number namespacing:    55 μs
     - makes a difference of:                        12 μs
   - a t-test was performed on the two data vectors which revealed
     shows significant performance improvements:
     Welch Two Sample t-test
     data:  x1 and y1
     t = 405.16, df = 18883000, p-value < 2.2e-16
     alternative hypothesis: true difference in means is not equal to 0
     95 percent confidence interval:
     12.14949 12.26761
     sample estimates:
     mean of x mean of y
     68.48594  56.27739

 # Testcase 2:
   Injecting Uevents into network namespaces not owned by the initial user
   namespace and network namespaces owned by the initial user namespace.
   - created 500 new user namespace + network namespace pairs
   - created 500 new network namespace pairs
   - opened a uevent listener in each of those namespace pairs
   - injected uevents into each of those network namespaces 10,000 times
     meaning 10,000,000 (10 million) uevents were injected. (The high
     number of uevent injections should get rid of a lot of jitter.)
     The injection was done by fork()ing 1000 uevent injectors in a simple
     for-loop to ensure that uevents were injected in parallel.
   - mean transaction time was calculated:
     - *without* uevent sequence number namespacing: 572 μs
     - *with* uevent sequence number namespacing:    514 μs
     - makes a difference of:                         58 μs
   - a t-test was performed on the two data vectors which revealed
     shows significant performance improvements:
     Welch Two Sample t-test
     data:  x2 and y2
     t = 38.685, df = 19682000, p-value < 2.2e-16
     alternative hypothesis: true difference in means is not equal to 0
     95 percent confidence interval:
     55.10630 60.98815
     sample estimates:
     mean of x mean of y
     572.9684  514.9211

 # Testcase 3:
   Created 500 new user namespace + network namespace pairs *without uevent
   listeners*
   - created 500 new network namespace pairs *without uevent listeners*
   - injected uevents into each of those network namespaces 10,000 times
     meaning 10,000,000 (10 million) uevents were injected. (The high number
     of uevent injections should get rid of a lot of jitter.)
     The injection was done by fork()ing 1000 uevent injectors in a simple
     for-loop to ensure that uevents were injected in parallel.
    - mean transaction time was calculated:
      - *without* uevent sequence number namespacing: 206 μs
      - *with* uevent sequence number namespacing:    163 μs
      - makes a difference of:                         43 μs
    - a t-test was performed on the two data vectors which revealed
      shows significant performance improvements:
      Welch Two Sample t-test
      data:  x3 and y3
      t = 58.37, df = 17711000, p-value < 2.2e-16
      alternative hypothesis: true difference in means is not equal to 0
      95 percent confidence interval:
      41.77860 44.68178
      sample estimates:
      mean of x mean of y
      207.2632  164.0330

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
Changelog v0->v1:
* add detailed test results to the commit message
* account for kernels compiled without CONFIG_NET
---
 include/linux/kobject.h     |   2 +
 include/net/net_namespace.h |   3 ++
 kernel/ksysfs.c             |  11 +++-
 lib/kobject_uevent.c        | 104 +++++++++++++++++++++++++++++-------
 net/core/net_namespace.c    |  14 +++++
 5 files changed, 114 insertions(+), 20 deletions(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..4e608968907f 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -36,8 +36,10 @@
 extern char uevent_helper[];
 #endif
 
+#ifndef CONFIG_NET
 /* counter to tag the uevent, read only except for the kobject core */
 extern u64 uevent_seqnum;
+#endif
 
 /*
  * The actions here must match the index to the string array
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 47e35cce3b64..e4e171b1ba69 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -85,6 +85,8 @@ struct net {
 	struct sock		*genl_sock;
 
 	struct uevent_sock	*uevent_sock;		/* uevent socket */
+	/* counter to tag the uevent, read only except for the kobject core */
+	u64                     uevent_seqnum;
 
 	struct list_head 	dev_base_head;
 	struct hlist_head 	*dev_name_head;
@@ -189,6 +191,7 @@ extern struct list_head net_namespace_list;
 
 struct net *get_net_ns_by_pid(pid_t pid);
 struct net *get_net_ns_by_fd(int fd);
+u64 get_ns_uevent_seqnum_by_vpid(void);
 
 #ifdef CONFIG_SYSCTL
 void ipx_register_sysctl(void);
diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 46ba853656f6..075ec814381b 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -19,6 +19,7 @@
 #include <linux/sched.h>
 #include <linux/capability.h>
 #include <linux/compiler.h>
+#include <net/net_namespace.h>
 
 #include <linux/rcupdate.h>	/* rcu_expedited and rcu_normal */
 
@@ -33,7 +34,15 @@ static struct kobj_attribute _name##_attr = \
 static ssize_t uevent_seqnum_show(struct kobject *kobj,
 				  struct kobj_attribute *attr, char *buf)
 {
-	return sprintf(buf, "%llu\n", (unsigned long long)uevent_seqnum);
+	u64 seqnum;
+
+	#ifdef CONFIG_NET
+		seqnum = get_ns_uevent_seqnum_by_vpid();
+	#else
+		seqnum = uevent_seqnum;
+	#endif
+
+	return sprintf(buf, "%llu\n", (unsigned long long)seqnum);
 }
 KERNEL_ATTR_RO(uevent_seqnum);
 
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index f5f5038787ac..5da20def556d 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -29,21 +29,42 @@
 #include <net/net_namespace.h>
 
 
+#ifndef CONFIG_NET
 u64 uevent_seqnum;
+#endif
+
 #ifdef CONFIG_UEVENT_HELPER
 char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
 #endif
 
+/*
+ * Size a buffer needs to be in order to hold the largest possible sequence
+ * number stored in a u64 including \0 byte: 2^64 - 1 = 21 chars.
+ */
+#define SEQNUM_BUFSIZE (sizeof("SEQNUM=") + 21)
 struct uevent_sock {
 	struct list_head list;
 	struct sock *sk;
+	/*
+	 * This mutex protects uevent sockets and the uevent counter of
+	 * network namespaces *not* owned by init_user_ns.
+	 * For network namespaces owned by init_user_ns this lock is *not*
+	 * valid instead the global uevent_sock_mutex must be used!
+	 */
+	struct mutex sk_mutex;
 };
 
 #ifdef CONFIG_NET
 static LIST_HEAD(uevent_sock_list);
 #endif
 
-/* This lock protects uevent_seqnum and uevent_sock_list */
+/*
+ * This mutex protects uevent sockets and the uevent counter of network
+ * namespaces owned by init_user_ns.
+ * For network namespaces not owned by init_user_ns this lock is *not*
+ * valid instead the network namespace specific sk_mutex in struct
+ * uevent_sock must be used!
+ */
 static DEFINE_MUTEX(uevent_sock_mutex);
 
 /* the strings here must match the enum in include/linux/kobject.h */
@@ -253,6 +274,22 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
 
 	return 0;
 }
+
+static bool can_hold_seqnum(const struct kobj_uevent_env *env, size_t len)
+{
+	if (env->envp_idx >= ARRAY_SIZE(env->envp)) {
+		WARN(1, KERN_ERR "Failed to append sequence number. "
+		     "Too many uevent variables\n");
+		return false;
+	}
+
+	if ((env->buflen + len) > UEVENT_BUFFER_SIZE) {
+		WARN(1, KERN_ERR "Insufficient space to append sequence number\n");
+		return false;
+	}
+
+	return true;
+}
 #endif
 
 #ifdef CONFIG_UEVENT_HELPER
@@ -308,18 +345,22 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 
 	/* send netlink message */
 	list_for_each_entry(ue_sk, &uevent_sock_list, list) {
+		/* bump sequence number */
+		u64 seqnum = ++sock_net(ue_sk->sk)->uevent_seqnum;
 		struct sock *uevent_sock = ue_sk->sk;
+		char buf[SEQNUM_BUFSIZE];
 
 		if (!netlink_has_listeners(uevent_sock, 1))
 			continue;
 
 		if (!skb) {
-			/* allocate message with the maximum possible size */
+			/* calculate header length */
 			size_t len = strlen(action_string) + strlen(devpath) + 2;
 			char *scratch;
 
+			/* allocate message with the maximum possible size */
 			retval = -ENOMEM;
-			skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+			skb = alloc_skb(len + env->buflen + SEQNUM_BUFSIZE, GFP_KERNEL);
 			if (!skb)
 				continue;
 
@@ -327,11 +368,24 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 			scratch = skb_put(skb, len);
 			sprintf(scratch, "%s@%s", action_string, devpath);
 
+			/* add env */
 			skb_put_data(skb, env->buf, env->buflen);
 
 			NETLINK_CB(skb).dst_group = 1;
 		}
 
+		/* prepare netns seqnum */
+		retval = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu", seqnum);
+		if (retval < 0 || retval >= SEQNUM_BUFSIZE)
+			continue;
+		retval++;
+
+		if (!can_hold_seqnum(env, retval))
+			continue;
+
+		/* append netns seqnum */
+		skb_put_data(skb, buf, retval);
+
 		retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
 						    0, 1, GFP_KERNEL,
 						    kobj_bcast_filter,
@@ -339,8 +393,13 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 		/* ENOBUFS should be handled in userspace */
 		if (retval == -ENOBUFS || retval == -ESRCH)
 			retval = 0;
+
+		/* remove netns seqnum */
+		skb_trim(skb, env->buflen);
 	}
 	consume_skb(skb);
+#else
+	uevent_seqnum++;
 #endif
 	return retval;
 }
@@ -510,14 +569,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
 	}
 
 	mutex_lock(&uevent_sock_mutex);
-	/* we will send an event, so request a new sequence number */
-	retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
-	if (retval) {
-		mutex_unlock(&uevent_sock_mutex);
-		goto exit;
-	}
-	retval = kobject_uevent_net_broadcast(kobj, env, action_string,
-					      devpath);
+	retval = kobject_uevent_net_broadcast(kobj, env, action_string, devpath);
 	mutex_unlock(&uevent_sock_mutex);
 
 #ifdef CONFIG_UEVENT_HELPER
@@ -605,17 +657,18 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
 EXPORT_SYMBOL_GPL(add_uevent_var);
 
 #if defined(CONFIG_NET)
-static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
+static int uevent_net_broadcast(struct uevent_sock *ue_sk, struct sk_buff *skb,
 				struct netlink_ext_ack *extack)
 {
-	/* u64 to chars: 2^64 - 1 = 21 chars */
-	char buf[sizeof("SEQNUM=") + 21];
+	struct sock *usk = ue_sk->sk;
+	char buf[SEQNUM_BUFSIZE];
 	struct sk_buff *skbc;
 	int ret;
 
 	/* bump and prepare sequence number */
-	ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", ++uevent_seqnum);
-	if (ret < 0 || (size_t)ret >= sizeof(buf))
+	ret = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu",
+		       ++sock_net(ue_sk->sk)->uevent_seqnum);
+	if (ret < 0 || ret >= SEQNUM_BUFSIZE)
 		return -ENOMEM;
 	ret++;
 
@@ -668,9 +721,15 @@ static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return -EPERM;
 	}
 
-	mutex_lock(&uevent_sock_mutex);
-	ret = uevent_net_broadcast(net->uevent_sock->sk, skb, extack);
-	mutex_unlock(&uevent_sock_mutex);
+	if (net->user_ns == &init_user_ns)
+		mutex_lock(&uevent_sock_mutex);
+	else
+		mutex_lock(&net->uevent_sock->sk_mutex);
+	ret = uevent_net_broadcast(net->uevent_sock, skb, extack);
+	if (net->user_ns == &init_user_ns)
+		mutex_unlock(&uevent_sock_mutex);
+	else
+		mutex_unlock(&net->uevent_sock->sk_mutex);
 
 	return ret;
 }
@@ -708,6 +767,13 @@ static int uevent_net_init(struct net *net)
 		mutex_lock(&uevent_sock_mutex);
 		list_add_tail(&ue_sk->list, &uevent_sock_list);
 		mutex_unlock(&uevent_sock_mutex);
+	} else {
+		/*
+		 * Uevent sockets and counters for network namespaces
+		 * not owned by the initial user namespace have their
+		 * own mutex.
+		 */
+		mutex_init(&ue_sk->sk_mutex);
 	}
 
 	return 0;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a11e03f920d3..8894638f5150 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -618,6 +618,20 @@ struct net *get_net_ns_by_pid(pid_t pid)
 }
 EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
 
+u64 get_ns_uevent_seqnum_by_vpid(void)
+{
+	pid_t cur_pid;
+	struct net *net;
+
+	cur_pid = task_pid_vnr(current);
+	net = get_net_ns_by_pid(cur_pid);
+	if (IS_ERR(net))
+		return 0;
+
+	return net->uevent_seqnum;
+}
+EXPORT_SYMBOL_GPL(get_ns_uevent_seqnum_by_vpid);
+
 static __net_init int net_ns_net_init(struct net *net)
 {
 #ifdef CONFIG_NET_NS
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next 1/2 v1] netns: restrict uevents
From: Christian Brauner @ 2018-04-23 10:24 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner
In-Reply-To: <20180423102443.16627-1-christian.brauner@ubuntu.com>

commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")

enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk a little. We have now reached the point where hotplug events for all
devices that carry a namespace tag are filtered according to that
namespace. Specifically, they are filtered whenever the namespace tag of
the kobject does not match the namespace tag of the netlink socket. One
example are network devices. Uevents for network devices only show up in
the network namespaces these devices are moved to or created in.

However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.

This patch restricts uevents to the initial user namespace for a couple of
reasons that have been extensively discusses on the mailing list [1].
- Thundering herd:
  Broadcasting uevents into all network namespaces introduces significant
  overhead.
  All processes that listen to uevents running in non-initial user
  namespaces will end up responding to uevents that will be meaningless to
  them. Mainly, because non-initial user namespaces cannot easily manage
  devices unless they have a privileged host-process helping them out. This
  means that there will be a thundering herd of activity when there
  shouldn't be any.
- Uevents from non-root users are already filtered in userspace:
  Uevents are filtered by userspace in a user namespace because the
  received uid != 0. Instead the uid associated with the event will be
  65534 == "nobody" because the global root uid is not mapped.
  This means we can safely and without introducing regressions modify the
  kernel to not send uevents into all network namespaces whose owning user
  namespace is not the initial user namespace because we know that
  userspace will ignore the message because of the uid anyway. I have
  a) verified that is is true for every udev implementation out there b)
  that this behavior has been present in all udev implementations from the
  very beginning.
- Removing needless overhead/Increasing performance:
  Currently, the uevent socket for each network namespace is added to the
  global variable uevent_sock_list. The list itself needs to be protected
  by a mutex. So everytime a uevent is generated the mutex is taken on the
  list. The mutex is held *from the creation of the uevent (memory
  allocation, string creation etc. until all uevent sockets have been
  handled*. This is aggravated by the fact that for each uevent socket that
  has listeners the mc_list must be walked as well which means we're
  talking O(n^2) here. Given that a standard Linux workload usually has
  quite a lot of network namespaces and - in the face of containers - a lot
  of user namespaces this quickly becomes a performance problem (see
  "Thundering herd" above). By just recording uevent sockets of network
  namespaces that are owned by the initial user namespace we significantly
  increase performance in this codepath.
- Injecting uevents:
  There's a valid argument that containers might be interested in receiving
  device events especially if they are delegated to them by a privileged
  userspace process. One prime example are SR-IOV enabled devices that are
  explicitly designed to be handed of to other users such as VMs or
  containers.
  This use-case can now be correctly handled since
  commit 692ec06d7c92 ("netns: send uevent messages"). This commit
  introduced the ability to send uevents from userspace. As such we can let
  a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
  the network namespace of the netlink socket) userspace process make a
  decision what uevents should be sent. This removes the need to blindly
  broadcast uevents into all user namespaces and provides a performant and
  safe solution to this problem.
- Filtering logic:
  This patch filters by *owning user namespace of the network namespace a
  given task resides in* and not by user namespace of the task per se. This
  means if the user namespace of a given task is unshared but the network
  namespace is kept and is owned by the initial user namespace a listener
  that is opening the uevent socket in that network namespace can still
  listen to uevents.

[1]: https://lkml.org/lkml/2018/4/4/739
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
Changelog v0->v1:
* patch unchanged
---
 lib/kobject_uevent.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..f5f5038787ac 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
 
 	net->uevent_sock = ue_sk;
 
-	mutex_lock(&uevent_sock_mutex);
-	list_add_tail(&ue_sk->list, &uevent_sock_list);
-	mutex_unlock(&uevent_sock_mutex);
+	/* Restrict uevents to initial user namespace. */
+	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+		mutex_lock(&uevent_sock_mutex);
+		list_add_tail(&ue_sk->list, &uevent_sock_list);
+		mutex_unlock(&uevent_sock_mutex);
+	}
+
 	return 0;
 }
 
@@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
 {
 	struct uevent_sock *ue_sk = net->uevent_sock;
 
-	mutex_lock(&uevent_sock_mutex);
-	list_del(&ue_sk->list);
-	mutex_unlock(&uevent_sock_mutex);
+	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+		mutex_lock(&uevent_sock_mutex);
+		list_del(&ue_sk->list);
+		mutex_unlock(&uevent_sock_mutex);
+	}
 
 	netlink_kernel_release(ue_sk->sk);
 	kfree(ue_sk);
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next 0/2 v1] netns: uevent performance tweaks
From: Christian Brauner @ 2018-04-23 10:24 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner

Hey everyone,

This is v1 of "netns: uevent performance tweaks". Like Eric requested, I
did extensive testing that prove significant performance improvements
when using per-netns uevent sequence numbers with decoupled locks. The
results and test descriptions were added to the commit message of
[PATCH 2/2 v1] netns: isolate seqnums to use per-netns locks.

This series deals with a bunch of performance improvements when sending
out uevents that have been extensively discussed here:
https://lkml.org/lkml/2018/4/10/592

- Only record uevent sockets from network namespaces owned by the
  initial user namespace in the global uevent socket list.
  Eric, this is the exact patch we agreed upon in
  https://lkml.org/lkml/2018/4/10/592.
  A very detailed rationale is present in the commit message for
  [PATCH 1/2] netns: restrict uevents
- Decouple the locking for network namespaces in the global uevent
  socket list from the locking for network namespaces not in the global
  uevent socket list.
  A very detailed rationale including performance test results is
  present in the commit message for
  [PATCH 2/2] netns: isolate seqnums to use per-netns locks

Thanks!
Christian

Christian Brauner (2):
  netns: restrict uevents
  netns: isolate seqnums to use per-netns locks

 include/linux/kobject.h     |   2 +
 include/net/net_namespace.h |   3 +
 kernel/ksysfs.c             |  11 +++-
 lib/kobject_uevent.c        | 122 ++++++++++++++++++++++++++++--------
 net/core/net_namespace.c    |  14 +++++
 5 files changed, 126 insertions(+), 26 deletions(-)

-- 
2.17.0

^ permalink raw reply

* Re: [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
From: Christian Brauner @ 2018-04-23 10:12 UTC (permalink / raw)
  To: kbuild-all
  Cc: ebiederm, davem, netdev, linux-kernel, avagin, ktkhai, serge,
	gregkh
In-Reply-To: <201804231003.izXFVWTI%fengguang.wu@intel.com>

On Mon, Apr 23, 2018 at 10:39:50AM +0800, kbuild test robot wrote:
> Hi Christian,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on net-next/master]
> 
> url:    https://github.com/0day-ci/linux/commits/Christian-Brauner/netns-uevent-performance-tweaks/20180420-013717
> config: alpha-alldefconfig (attached as .config)
> compiler: alpha-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=alpha 
> 
> All errors (new ones prefixed by >>):
> 
>    kernel/ksysfs.o: In function `uevent_seqnum_show':
> >> (.text+0x18c): undefined reference to `get_ns_uevent_seqnum_by_vpid'
>    (.text+0x19c): undefined reference to `get_ns_uevent_seqnum_by_vpid'

Right, this happens when CONFIG_NET=n. I am about to send out the second
version of the patch that includes all of the test results in the commit
message and also accounts for kernels compiled without CONFIG_NET.

Thanks!
Christian

> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox