Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit
From: David Miller @ 2018-04-23 13:39 UTC (permalink / raw)
  To: yihung.wei; +Cc: netdev, pshelar
In-Reply-To: <1524011429-14500-1-git-send-email-yihung.wei@gmail.com>

From: Yi-Hung Wei <yihung.wei@gmail.com>
Date: Tue, 17 Apr 2018 17:30:27 -0700

> Currently, nf_conntrack_max is used to limit the maximum number of
> conntrack entries in the conntrack table for every network namespace.
> For the VMs and containers that reside in the same namespace,
> they share the same conntrack table, and the total # of conntrack entries
> for all the VMs and containers are limited by nf_conntrack_max.  In this
> case, if one of the VM/container abuses the usage the conntrack entries,
> it blocks the others from committing valid conntrack entries into the
> conntrack table.  Even if we can possibly put the VM in different network
> namespace, the current nf_conntrack_max configuration is kind of rigid
> that we cannot limit different VM/container to have different # conntrack
> entries.
> 
> To address the aforementioned issue, this patch proposes to have a
> fine-grained mechanism that could further limit the # of conntrack entries
> per-zone.  For example, we can designate different zone to different VM,
> and set conntrack limit to each zone.  By providing this isolation, a
> mis-behaved VM only consumes the conntrack entries in its own zone, and
> it will not influence other well-behaved VMs.  Moreover, the users can
> set various conntrack limit to different zone based on their preference.
> 
> The proposed implementation utilizes Netfilter's nf_conncount backend
> to count the number of connections in a particular zone.  If the number of
> connection is above a configured limitation, OVS will return ENOMEM to the
> userspace.  If userspace does not configure the zone limit, the limit
> defaults to zero that is no limitation, which is backward compatible to
> the behavior without this patch.
> 
> The first patch defines the conntrack limit netlink definition, and the
> second patch provides the implementation.

Pravin, I need this series reviewed.

Thank you.

^ permalink raw reply

* Re: WARNING: suspicious RCU usage in rt6_check_expired
From: Eric Dumazet @ 2018-04-23 13:31 UTC (permalink / raw)
  To: syzbot, davem, kuznet, linux-kernel, netdev, syzkaller-bugs,
	David Ahern
In-Reply-To: <0000000000009f23ab056a7fc243@google.com>



On 04/23/2018 01:24 AM, syzbot wrote:
> Hello,
> 
> syzbot hit the following crash on net-next commit
> 0638eb573cde5888c0886c7f35da604e5db209a6 (Sat Apr 21 20:06:14 2018 +0000)
> Merge branch 'ipv6-Another-followup-to-the-fib6_info-change'
> syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=2422c9e35796659d2273
> 
> So far this crash happened 3 times on net-next.
> Unfortunately, I don't have any reproducer for this crash yet.
> Raw console output: https://syzkaller.appspot.com/x/log.txt?id=6081013801287680
> Kernel config: https://syzkaller.appspot.com/x/.config?id=-8412024688694752032
> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+2422c9e35796659d2273@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for details.
> If you forward the report, please keep this part and the footer.
> 
> netlink: 'syz-executor4': attribute type 6 has an invalid length.
> netlink: 'syz-executor4': attribute type 1 has an invalid length.
> netlink: 'syz-executor4': attribute type 6 has an invalid length.
> 
> =============================
> WARNING: suspicious RCU usage
> 4.16.0+ #11 Not tainted
> -----------------------------
> net/ipv6/route.c:410 suspicious rcu_dereference_check() usage!
> 
> other info that might help us debug this:
> 
> 
> rcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by syz-executor7/25958:
>  #0: 00000000d1963139 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1469 [inline]
>  #0: 00000000d1963139 (sk_lock-AF_INET6){+.+.}, at: sock_setsockopt+0x19c/0x1fe0 net/core/sock.c:717
> 
> stack backtrace:
> CPU: 1 PID: 25958 Comm: syz-executor7 Not tainted 4.16.0+ #11
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>  lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592
>  rt6_check_expired+0x38b/0x3e0 net/ipv6/route.c:410
>  ip6_negative_advice+0x67/0xc0 net/ipv6/route.c:2204
>  dst_negative_advice include/net/sock.h:1786 [inline]
>  sock_setsockopt+0x138f/0x1fe0 net/core/sock.c:1051
>  __sys_setsockopt+0x2df/0x390 net/socket.c:1899
>  SYSC_setsockopt net/socket.c:1914 [inline]
>  SyS_setsockopt+0x34/0x50 net/socket.c:1911
>  do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x455389
> RSP: 002b:00007f7556e30c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
> RAX: ffffffffffffffda RBX: 00007f7556e316d4 RCX: 0000000000455389
> RDX: 0000000000000035 RSI: 0000000000000001 RDI: 0000000000000013
> RBP: 000000000072bf58 R08: 0000000000000004 R09: 0000000000000000
> R10: 0000000020000000 R11: 0000000000000246 R12: 00000000ffffffff
> R13: 00000000000005ff R14: 00000000006fc088 R15: 0000000000000001
> netlink: 'syz-executor4': attribute type 4 has an invalid length.
> netlink: 'syz-executor4': attribute type 4 has an invalid length.
> IPVS: set_ctl: invalid protocol: 59 127.0.0.1:20000 lc
> IPVS: set_ctl: invalid protocol: 127 224.0.0.1:20000 rr
> IPVS: sync thread started: state = BACKUP, mcast_ifn = ip6tnl0, syncid = 4, id = 0
> IPVS: set_ctl: invalid protocol: 127 224.0.0.1:20000 rr
> netlink: 72 bytes leftover after parsing attributes in process `syz-executor2'.
> netlink: 72 bytes leftover after parsing attributes in process `syz-executor2'.
> dccp_xmit_packet: Payload too large (65423) for featneg.
> IPVS: set_ctl: invalid protocol: 29 1.0.0.0:2 wlc
> IPVS: set_ctl: invalid protocol: 29 1.0.0.0:2 wlc
> netlink: 32 bytes leftover after parsing attributes in process `syz-executor7'.
> netlink: 12 bytes leftover after parsing attributes in process `syz-executor1'.
> netlink: 12 bytes leftover after parsing attributes in process `syz-executor1'.
> netlink: 'syz-executor1': attribute type 29 has an invalid length.
> netlink: 8 bytes leftover after parsing attributes in process `syz-executor1'.
> IPVS: set_ctl: invalid protocol: 108 224.0.0.1:20004 lblc
> netlink: 'syz-executor1': attribute type 29 has an invalid length.
> netlink: 8 bytes leftover after parsing attributes in process `syz-executor1'.
> IPVS: set_ctl: invalid protocol: 108 224.0.0.1:20004 lblc
> 
> 
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
> 
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug report.
> Note: all commands must start from beginning of the line in the email body.

Added in commit a68886a691804d3f6d479ebf6825480fbafb6a00
("net/ipv6: Make from in rt6_info rcu protected")

^ permalink raw reply

* [PATCH 2/3] net/unix: hook unix_socketpair() into LSM
From: David Herrmann @ 2018-04-23 13:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: James Morris, Paul Moore, teg, Stephen Smalley, selinux,
	linux-security-module, Eric Paris, serge, davem, netdev,
	David Herrmann
In-Reply-To: <20180423133015.5455-1-dh.herrmann@gmail.com>

Use the newly created LSM-hook for unix_socketpair(). The default hook
return-value is 0, so behavior stays the same unless LSMs start using
this hook.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 net/unix/af_unix.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 68bb70a62afe..bc9705ace9b1 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1371,6 +1371,11 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 static int unix_socketpair(struct socket *socka, struct socket *sockb)
 {
 	struct sock *ska = socka->sk, *skb = sockb->sk;
+	int err;
+
+	err = security_unix_stream_socketpair(ska, skb);
+	if (err)
+		return err;
 
 	/* Join our sockets back to back */
 	sock_hold(ska);
-- 
2.17.0

^ permalink raw reply related

* [PATCH 3/3] selinux: provide unix_stream_socketpair callback
From: David Herrmann @ 2018-04-23 13:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: James Morris, Paul Moore, teg, Stephen Smalley, selinux,
	linux-security-module, Eric Paris, serge, davem, netdev,
	David Herrmann
In-Reply-To: <20180423133015.5455-1-dh.herrmann@gmail.com>

Make sure to implement the new unix_stream_socketpair callback so the
SO_PEERSEC call on socketpair(2)s will return correct information.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 security/selinux/hooks.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 4cafe6a19167..828881d9a41d 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4905,6 +4905,18 @@ static int selinux_socket_unix_stream_connect(struct sock *sock,
 	return 0;
 }
 
+static int selinux_socket_unix_stream_socketpair(struct sock *socka,
+						 struct sock *sockb)
+{
+	struct sk_security_struct *sksec_a = socka->sk_security;
+	struct sk_security_struct *sksec_b = sockb->sk_security;
+
+	sksec_a->peer_sid = sksec_b->sid;
+	sksec_b->peer_sid = sksec_a->sid;
+
+	return 0;
+}
+
 static int selinux_socket_unix_may_send(struct socket *sock,
 					struct socket *other)
 {
@@ -6995,6 +7007,8 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(inode_getsecctx, selinux_inode_getsecctx),
 
 	LSM_HOOK_INIT(unix_stream_connect, selinux_socket_unix_stream_connect),
+	LSM_HOOK_INIT(unix_stream_socketpair,
+			selinux_socket_unix_stream_socketpair),
 	LSM_HOOK_INIT(unix_may_send, selinux_socket_unix_may_send),
 
 	LSM_HOOK_INIT(socket_create, selinux_socket_create),
-- 
2.17.0

^ permalink raw reply related

* [PATCH 1/3] security: add hook for socketpair(AF_UNIX, ...)
From: David Herrmann @ 2018-04-23 13:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: James Morris, Paul Moore, teg, Stephen Smalley, selinux,
	linux-security-module, Eric Paris, serge, davem, netdev,
	David Herrmann
In-Reply-To: <20180423133015.5455-1-dh.herrmann@gmail.com>

Right now the LSM labels for socketpairs are always uninitialized,
since there is no security hook for the socketpair() syscall. This
patch adds the required hooks so LSMs can properly label socketpairs.
This allows SO_PEERSEC to return useful information on those sockets.

Note that the behavior of socketpair() can be emulated by creating a
listener socket, connecting to it, and then discarding the initial
listener socket. With this workaround, SO_PEERSEC would return the
caller's security context. However, with socketpair(), the uninitialized
context is returned unconditionally. This is unexpected and makes
socketpair() less useful in situations where the security context is
crucial to the application.

With the new socketpair-hook this disparity can be solved by making
socketpair() return the expected security context.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 include/linux/lsm_hooks.h | 8 ++++++++
 include/linux/security.h  | 7 +++++++
 security/security.c       | 6 ++++++
 3 files changed, 21 insertions(+)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 9d0b286f3dba..2a23c75c1541 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -717,6 +717,12 @@
  *	@other contains the peer sock structure.
  *	@newsk contains the new sock structure.
  *	Return 0 if permission is granted.
+ * @unix_stream_socketpair:
+ *	Check permissions before establishing a Unix domain stream connection
+ *	for a fresh pair of sockets.
+ *	@socka contains the first sock structure.
+ *	@sockb contains the second sock structure.
+ *	Return 0 if permission is granted and the connection was established.
  * @unix_may_send:
  *	Check permissions before connecting or sending datagrams from @sock to
  *	@other.
@@ -1651,6 +1657,7 @@ union security_list_options {
 #ifdef CONFIG_SECURITY_NETWORK
 	int (*unix_stream_connect)(struct sock *sock, struct sock *other,
 					struct sock *newsk);
+	int (*unix_stream_socketpair)(struct sock *socka, struct sock *sockb);
 	int (*unix_may_send)(struct socket *sock, struct socket *other);
 
 	int (*socket_create)(int family, int type, int protocol, int kern);
@@ -1919,6 +1926,7 @@ struct security_hook_heads {
 	struct hlist_head inode_getsecctx;
 #ifdef CONFIG_SECURITY_NETWORK
 	struct hlist_head unix_stream_connect;
+	struct hlist_head unix_stream_socketpair;
 	struct hlist_head unix_may_send;
 	struct hlist_head socket_create;
 	struct hlist_head socket_post_create;
diff --git a/include/linux/security.h b/include/linux/security.h
index 200920f521a1..be275deeda10 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -1187,6 +1187,7 @@ static inline int security_inode_getsecctx(struct inode *inode, void **ctx, u32
 #ifdef CONFIG_SECURITY_NETWORK
 
 int security_unix_stream_connect(struct sock *sock, struct sock *other, struct sock *newsk);
+int security_unix_stream_socketpair(struct sock *socka, struct sock *sockb);
 int security_unix_may_send(struct socket *sock,  struct socket *other);
 int security_socket_create(int family, int type, int protocol, int kern);
 int security_socket_post_create(struct socket *sock, int family,
@@ -1242,6 +1243,12 @@ static inline int security_unix_stream_connect(struct sock *sock,
 	return 0;
 }
 
+static inline int security_unix_stream_socketpair(struct sock *socka,
+						  struct sock *sockb)
+{
+	return 0;
+}
+
 static inline int security_unix_may_send(struct socket *sock,
 					 struct socket *other)
 {
diff --git a/security/security.c b/security/security.c
index 7bc2fde023a7..3dfd374e84e5 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1340,6 +1340,12 @@ int security_unix_stream_connect(struct sock *sock, struct sock *other, struct s
 }
 EXPORT_SYMBOL(security_unix_stream_connect);
 
+int security_unix_stream_socketpair(struct sock *socka, struct sock *sockb)
+{
+	return call_int_hook(unix_stream_socketpair, 0, socka, sockb);
+}
+EXPORT_SYMBOL(security_unix_stream_socketpair);
+
 int security_unix_may_send(struct socket *sock,  struct socket *other)
 {
 	return call_int_hook(unix_may_send, 0, sock, other);
-- 
2.17.0

^ permalink raw reply related

* [PATCH 0/3] Introduce LSM-hook for socketpair(2)
From: David Herrmann @ 2018-04-23 13:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: James Morris, Paul Moore, teg, Stephen Smalley, selinux,
	linux-security-module, Eric Paris, serge, davem, netdev,
	David Herrmann

Hi

This series adds a new LSM hook for the socketpair(2) syscall. The idea
is to allow SO_PEERSEC to be called on AF_UNIX sockets created via
socketpair(2), and return the same information as if you emulated
socketpair(2) via a temporary listener socket. Right now SO_PEERSEC
will return the unlabeled credentials for a socketpair, rather than the
actual credentials of the creating process.

A simple call to:

    socketpair(AF_UNIX, SOCK_STREAM, 0, out);

can be emulated via a temporary listener socket bound to a unique,
random name in the abstract namespace. By connecting to this listener
socket, accept(2) will return the second part of the pair. If
SO_PEERSEC is queried on these, the correct credentials of the creating
process are returned. A simple comparison between the behavior of
SO_PEERSEC on socketpair(2) and an emulated socketpair is included in
the dbus-broker test-suite [1].

This patch series tries to close this gap and makes both behave the
same. A new LSM-hook is added which allows LSMs to cache the correct
peer information on newly created socket-pairs.

Apart from fixing this behavioral difference, the dbus-broker project
actually needs to query the credentials of socketpairs, and currently
must resort to racy procfs(2) queries to get the LSM credentials of its
controller socket. Several parts of the dbus-broker project allow you
to pass in a socket during execve(2), which will be used by the child
process to accept control-commands from its parent. One natural way to
create this communication channel is to use socketpair(2). However,
right now SO_PEERSEC does not return any useful information, hence, the
child-process would need other means to retrieve this information. By
avoiding socketpair(2) and using the hacky-emulated version, this is not
an issue.

There was a previous discussion on this matter [2] roughly a year ago.
Back then there was the suspicion that proper SO_PEERSEC would confuse
applications. However, we could not find any evidence backing this
suspicion. Furthermore, we now actually see the contrary. Lack of
SO_PEERSEC makes it a hassle to use socketpairs with LSM credentials.
Hence, we propose to implement full SO_PEERSEC for socketpairs.

This series only adds SELinux backends, since that is what we need for
RHEL. I will gladly extend the other LSMs if needed.

Thanks
David

[1] https://github.com/bus1/dbus-broker/blob/master/src/util/test-peersec.c
[2] https://www.spinics.net/lists/selinux/msg22674.html

David Herrmann (3):
  security: add hook for socketpair(AF_UNIX, ...)
  net/unix: hook unix_socketpair() into LSM
  selinux: provide unix_stream_socketpair callback

 include/linux/lsm_hooks.h |  8 ++++++++
 include/linux/security.h  |  7 +++++++
 net/unix/af_unix.c        |  5 +++++
 security/security.c       |  6 ++++++
 security/selinux/hooks.c  | 14 ++++++++++++++
 5 files changed, 40 insertions(+)

-- 
2.17.0

^ permalink raw reply

* Re: IP_ADD_MEMBERSHIP with imr_ifindex!=0 for multiple processes with different interfaces
From: Eric Dumazet @ 2018-04-23 13:29 UTC (permalink / raw)
  To: Klebsch, Mario, netdev@vger.kernel.org
In-Reply-To: <9208AF0D0E6F444F80E90796B4E51CA07C163C10@EXCHANGE01.muppets.local>



On 04/23/2018 06:22 AM, Klebsch, Mario wrote:
> Hi, 
> 
> I have a problem with multicast reception in the linux kernel and I hope, this is the right place to ask for help or to report a bug.
> 
> I need to receive multicasts on a single interface. I have written a small program, which executes IP_ADD_MEMBERSHIP with imr.imr_ifindex set to the interface index. The program works well, as long as only a single instance of this program is running. If I start a second instance on a different network interface, both programs receive multicast frames from both interfaces. 
> 
> When called without argument, the test program list the network interfaces. When called with an interface name as argument, if starts receiving multicasts on that interface.
> 
> I am running vanilla Linux kernel 4.12.0.
> 
> # uname -a
> Linux c627 4.12.0 #1 SMP Mon Apr 23 14:08:24 CEST 2018 i686 Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz GenuineIntel GNU/Linux
> # 
> 
> P.S. The program runs fine on MacOSX.
>

It looks like your program needs to use SO_BINDTODEVICE if it really wants this device filtering ?

^ permalink raw reply

* KMSAN: uninit-value in ip_vs_lblcr_check_expire
From: syzbot @ 2018-04-23 13:28 UTC (permalink / raw)
  To: coreteam, davem, fw, horms, ja, kadlec, linux-kernel, lvs-devel,
	netdev, netfilter-devel, pablo, syzkaller-bugs, wensong

Hello,

syzbot hit the following crash on  
https://github.com/google/kmsan.git/master commit
d2d741e5d1898dfde1a75ea3d29a9a3e2edf0617 (Sun Apr 22 15:05:22 2018 +0000)
kmsan: add initialization for shmem pages
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=3dfdea57819073a04f21

So far this crash happened 2 times on  
https://github.com/google/kmsan.git/master.
Unfortunately, I don't have any reproducer for this crash yet.
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=6285034612850688
Kernel config: https://syzkaller.appspot.com/x/.config?id=328654897048964367
compiler: clang version 7.0.0 (trunk 329391)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+3dfdea57819073a04f21@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.
If you forward the report, please keep this part and the footer.

RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000013
RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000014
R13: 00000000000004f3 R14: 00000000006fa768 R15: 0000000000000000
==================================================================
BUG: KMSAN: uninit-value in ip_vs_lblcr_check_expire+0x1551/0x1600  
net/netfilter/ipvs/ip_vs_lblcr.c:479
CPU: 0 PID: 13883 Comm: syz-executor4 Not tainted 4.16.0+ #86
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
  ip_vs_lblcr_check_expire+0x1551/0x1600 net/netfilter/ipvs/ip_vs_lblcr.c:479
  call_timer_fn+0x26a/0x5a0 kernel/time/timer.c:1326
  expire_timers kernel/time/timer.c:1363 [inline]
  __run_timers+0xda7/0x11c0 kernel/time/timer.c:1666
  run_timer_softirq+0x43/0x70 kernel/time/timer.c:1692
  __do_softirq+0x56d/0x93d kernel/softirq.c:285
  invoke_softirq kernel/softirq.c:365 [inline]
  irq_exit+0x202/0x240 kernel/softirq.c:405
  exiting_irq+0xe/0x10 arch/x86/include/asm/apic.h:541
  smp_apic_timer_interrupt+0x64/0x90 arch/x86/kernel/apic/apic.c:1055
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:857
  </IRQ>
RIP: 0010:native_restore_fl arch/x86/include/asm/irqflags.h:37 [inline]
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/irqflags.h:78 [inline]
RIP: 0010:dump_stack+0x1af/0x1d0 lib/dump_stack.c:58
RSP: 0018:ffff880156a2ef00 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff12
RAX: ffff8801fddc2590 RBX: ffff88014f62c418 RCX: ffff880000000000
RDX: ffff8801fd9c2590 RSI: aaaaaaaaaaaab000 RDI: ffffea0000000000
RBP: ffff880156a2ef48 R08: 0000000001080000 R09: 0000000000000002
R10: 0000000000000000 R11: 0000000000000000 R12: 00000000cf000109
R13: 0000000000000286 R14: 0000000000000000 R15: 0000000000000000
  fail_dump lib/fault-inject.c:51 [inline]
  should_fail+0x87b/0xab0 lib/fault-inject.c:149
  should_failslab+0x279/0x2a0 mm/failslab.c:32
  slab_pre_alloc_hook mm/slab.h:422 [inline]
  slab_alloc_node mm/slub.c:2663 [inline]
  slab_alloc mm/slub.c:2745 [inline]
  kmem_cache_alloc+0x136/0xb90 mm/slub.c:2750
  dst_alloc+0x295/0x860 net/core/dst.c:104
  __ip6_dst_alloc net/ipv6/route.c:361 [inline]
  ip6_rt_cache_alloc+0x445/0xd00 net/ipv6/route.c:1061
  ip6_pol_route+0x3f19/0x5da0 net/ipv6/route.c:1751
  ip6_pol_route_output+0xe6/0x110 net/ipv6/route.c:1892
  fib6_rule_lookup+0x494/0x720 net/ipv6/fib6_rules.c:87
  ip6_route_output_flags+0x4fa/0x590 net/ipv6/route.c:1920
  ip6_dst_lookup_tail+0x2fe/0x1a60 net/ipv6/ip6_output.c:992
  ip6_dst_lookup_flow+0xfc/0x270 net/ipv6/ip6_output.c:1093
  rawv6_sendmsg+0x1b05/0x4fb0 net/ipv6/raw.c:908
  inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg net/socket.c:640 [inline]
  ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
  __sys_sendmsg net/socket.c:2080 [inline]
  SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
  SyS_sendmsg+0x54/0x80 net/socket.c:2087
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x455389
RSP: 002b:00007fa5b1000c68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007fa5b10016d4 RCX: 0000000000455389
RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000013
RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000014
R13: 00000000000004f3 R14: 00000000006fa768 R15: 0000000000000000

Uninit was created at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
  kmsan_alloc_meta_for_pages+0x161/0x3a0 mm/kmsan/kmsan.c:814
  kmsan_alloc_page+0x82/0xe0 mm/kmsan/kmsan.c:868
  __alloc_pages_nodemask+0xf5b/0x5dc0 mm/page_alloc.c:4283
  alloc_pages_current+0x6b5/0x970 mm/mempolicy.c:2055
  alloc_pages include/linux/gfp.h:494 [inline]
  kmalloc_order mm/slab_common.c:1164 [inline]
  kmalloc_order_trace+0xb9/0x390 mm/slab_common.c:1175
  kmalloc_large include/linux/slab.h:446 [inline]
  __kmalloc+0x332/0x350 mm/slub.c:3778
  kmalloc include/linux/slab.h:517 [inline]
  ip_vs_lblcr_init_svc+0x57/0x310 net/netfilter/ipvs/ip_vs_lblcr.c:518
  ip_vs_bind_scheduler+0xa4/0x1e0 net/netfilter/ipvs/ip_vs_sched.c:51
  ip_vs_add_service+0xa91/0x1d70 net/netfilter/ipvs/ip_vs_ctl.c:1265
  do_ip_vs_set_ctl+0x25c8/0x2790 net/netfilter/ipvs/ip_vs_ctl.c:2457
  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
  nf_setsockopt+0x476/0x4d0 net/netfilter/nf_sockopt.c:115
  ip_setsockopt+0x24b/0x2b0 net/ipv4/ip_sockglue.c:1261
  dccp_setsockopt+0x1c3/0x1f0 net/dccp/proto.c:576
  sock_common_setsockopt+0x136/0x170 net/core/sock.c:2975
  SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849
  SyS_setsockopt+0x76/0xa0 net/socket.c:1828
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
==================================================================


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply

* KMSAN: uninit-value in ip_vs_lblc_check_expire
From: syzbot @ 2018-04-23 13:28 UTC (permalink / raw)
  To: coreteam, davem, fw, horms, ja, kadlec, linux-kernel, lvs-devel,
	netdev, netfilter-devel, pablo, syzkaller-bugs, wensong

Hello,

syzbot hit the following crash on  
https://github.com/google/kmsan.git/master commit
d2d741e5d1898dfde1a75ea3d29a9a3e2edf0617 (Sun Apr 22 15:05:22 2018 +0000)
kmsan: add initialization for shmem pages
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=3e9695f147fb529aa9bc

So far this crash happened 3 times on  
https://github.com/google/kmsan.git/master.
Unfortunately, I don't have any reproducer for this crash yet.
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5822255644803072
Kernel config: https://syzkaller.appspot.com/x/.config?id=328654897048964367
compiler: clang version 7.0.0 (trunk 329391)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+3e9695f147fb529aa9bc@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.
If you forward the report, please keep this part and the footer.

kernel msg: ebtables bug: please report to author: bad policy
==================================================================
BUG: KMSAN: uninit-value in ip_vs_lblc_check_expire+0xe62/0xf10  
net/netfilter/ipvs/ip_vs_lblc.c:315
CPU: 0 PID: 11383 Comm: syz-executor3 Not tainted 4.16.0+ #86
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
  ip_vs_lblc_check_expire+0xe62/0xf10 net/netfilter/ipvs/ip_vs_lblc.c:315
  call_timer_fn+0x26a/0x5a0 kernel/time/timer.c:1326
  expire_timers kernel/time/timer.c:1363 [inline]
  __run_timers+0xda7/0x11c0 kernel/time/timer.c:1666
  run_timer_softirq+0x43/0x70 kernel/time/timer.c:1692
  __do_softirq+0x56d/0x93d kernel/softirq.c:285
  invoke_softirq kernel/softirq.c:365 [inline]
  irq_exit+0x202/0x240 kernel/softirq.c:405
  exiting_irq+0xe/0x10 arch/x86/include/asm/apic.h:541
  smp_apic_timer_interrupt+0x64/0x90 arch/x86/kernel/apic/apic.c:1055
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:857
  </IRQ>
RIP: 0010:native_restore_fl arch/x86/include/asm/irqflags.h:37 [inline]
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/irqflags.h:78 [inline]
RIP: 0010:vprintk_emit+0xcb2/0xff0 kernel/printk/printk.c:1899
RSP: 0018:ffff8801c2a1f0d8 EFLAGS: 00000296 ORIG_RAX: ffffffffffffff12
RAX: 0000000000000296 RBX: ffff8801574c4418 RCX: 0000000000040000
RDX: ffffc900033a6000 RSI: 00000000000001bf RDI: 00000000000001c0
RBP: ffff8801c2a1f1f8 R08: 000000219bfd8445 R09: ffff8801fd6d615d
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffffff8b300430 R14: 0000000000000000 R15: 0000000000000000
  vprintk_default+0x90/0xa0 kernel/printk/printk.c:1955
  vprintk_func+0x517/0x700 kernel/printk/printk_safe.c:379
  printk+0x1b6/0x1f0 kernel/printk/printk.c:1991
  translate_table+0x474/0x5e10 net/bridge/netfilter/ebtables.c:846
  do_replace_finish+0x1258/0x2ea0 net/bridge/netfilter/ebtables.c:1002
  do_replace+0x707/0x770 net/bridge/netfilter/ebtables.c:1141
  do_ebt_set_ctl+0x2ab/0x3c0 net/bridge/netfilter/ebtables.c:1518
  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
  nf_setsockopt+0x476/0x4d0 net/netfilter/nf_sockopt.c:115
  ip_setsockopt+0x24b/0x2b0 net/ipv4/ip_sockglue.c:1261
  udp_setsockopt+0x108/0x1b0 net/ipv4/udp.c:2406
  ipv6_setsockopt+0x30c/0x340 net/ipv6/ipv6_sockglue.c:917
  udpv6_setsockopt+0x110/0x1c0 net/ipv6/udp.c:1422
  sock_common_setsockopt+0x136/0x170 net/core/sock.c:2975
  SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849
  SyS_setsockopt+0x76/0xa0 net/socket.c:1828
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x455389
RSP: 002b:00007f470c9e3c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007f470c9e46d4 RCX: 0000000000455389
RDX: 0000000000000080 RSI: 0000000000000000 RDI: 0000000000000013
RBP: 000000000072bea0 R08: 0000000000000dd0 R09: 0000000000000000
R10: 0000000020000dc0 R11: 0000000000000246 R12: 00000000ffffffff
R13: 000000000000051d R14: 00000000006fab58 R15: 0000000000000000

Uninit was created at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
  kmsan_alloc_meta_for_pages+0x161/0x3a0 mm/kmsan/kmsan.c:814
  kmsan_alloc_page+0x82/0xe0 mm/kmsan/kmsan.c:868
  __alloc_pages_nodemask+0xf5b/0x5dc0 mm/page_alloc.c:4283
  alloc_pages_current+0x6b5/0x970 mm/mempolicy.c:2055
  alloc_pages include/linux/gfp.h:494 [inline]
  kmalloc_order mm/slab_common.c:1164 [inline]
  kmalloc_order_trace+0xb9/0x390 mm/slab_common.c:1175
  kmalloc_large include/linux/slab.h:446 [inline]
  __kmalloc+0x332/0x350 mm/slub.c:3778
  kmalloc include/linux/slab.h:517 [inline]
  ip_vs_lblc_init_svc+0x57/0x310 net/netfilter/ipvs/ip_vs_lblc.c:355
  ip_vs_bind_scheduler+0xa4/0x1e0 net/netfilter/ipvs/ip_vs_sched.c:51
  ip_vs_add_service+0xa91/0x1d70 net/netfilter/ipvs/ip_vs_ctl.c:1265
  do_ip_vs_set_ctl+0x25c8/0x2790 net/netfilter/ipvs/ip_vs_ctl.c:2457
  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
  nf_setsockopt+0x476/0x4d0 net/netfilter/nf_sockopt.c:115
  ip_setsockopt+0x24b/0x2b0 net/ipv4/ip_sockglue.c:1261
  raw_setsockopt+0x2e5/0x350 net/ipv4/raw.c:870
  sock_common_setsockopt+0x136/0x170 net/core/sock.c:2975
  SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849
  SyS_setsockopt+0x76/0xa0 net/socket.c:1828
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
==================================================================


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply

* IP_ADD_MEMBERSHIP with imr_ifindex!=0 for multiple processes with different interfaces
From: Klebsch, Mario @ 2018-04-23 13:22 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi, 

I have a problem with multicast reception in the linux kernel and I hope, this is the right place to ask for help or to report a bug.

I need to receive multicasts on a single interface. I have written a small program, which executes IP_ADD_MEMBERSHIP with imr.imr_ifindex set to the interface index. The program works well, as long as only a single instance of this program is running. If I start a second instance on a different network interface, both programs receive multicast frames from both interfaces. 

When called without argument, the test program list the network interfaces. When called with an interface name as argument, if starts receiving multicasts on that interface.

I am running vanilla Linux kernel 4.12.0.

# uname -a
Linux c627 4.12.0 #1 SMP Mon Apr 23 14:08:24 CEST 2018 i686 Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz GenuineIntel GNU/Linux
# 

P.S. The program runs fine on MacOSX.

73, Mario

----8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<----
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <ifaddrs.h>
#include <net/if.h>

#define MCAST_PORT  6154
#define MCAST_ADDR  "239.255.1.1"

void ListInterfaces(struct ifaddrs *Interfaces)
{
                for (struct ifaddrs *a=Interfaces; a; a=a->ifa_next)
                {
                               if (!(a->ifa_flags & IFF_UP))
                                               continue;
                               if (!a->ifa_addr || a->ifa_addr->sa_family != AF_INET)
                                               continue;
                               struct sockaddr_in *Addr = (struct sockaddr_in *)a->ifa_addr;
                               printf("%s: %s\n", a->ifa_name, inet_ntoa(Addr->sin_addr));
                }
}

int main(int argc, char *argv[])
{
                struct ifaddrs *Interfaces;
                if (getifaddrs(&Interfaces) < 0)
                {
                               perror("getifaddrs");
                               return -1;
                }

                struct sockaddr_in *MyIfAddr=NULL;
                int                MyIfIndex=0;
                if (argc > 1 && (MyIfIndex = if_nametoindex(argv[1])) )
                               for (struct ifaddrs *a=Interfaces; a; a=a->ifa_next)
                               {
                                               if (!(a->ifa_flags & IFF_UP))
                                                               continue;
                                               if (!a->ifa_addr || a->ifa_addr->sa_family != AF_INET)
                                                               continue;
                                               if (strcmp(argv[1], a->ifa_name)!= 0)
                                                               continue;
                                               MyIfAddr = (struct sockaddr_in *)a->ifa_addr;
                                               break;
                               }

                if (!MyIfAddr || !MyIfIndex)
                {
                               ListInterfaces(Interfaces);
                               return 0;
                }

                int s=socket(PF_INET, SOCK_DGRAM, 0);
                if (s<0)
                {
                               perror("socket");
                               return -1;
                }

                int off=0;
                if (setsockopt(s, IPPROTO_IP, IP_MULTICAST_LOOP, &off, sizeof(off)) < 0)
                               perror("setsockopt(SO_REUSEADDR)");

                int on=1;
                if (setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on)) < 0)
                               perror("setsockopt(SO_REUSEADDR)");

                struct sockaddr_in Addr;
                Addr.sin_family = AF_INET;
                Addr.sin_port = htons(MCAST_PORT);
                inet_aton(MCAST_ADDR, &Addr.sin_addr);
                if (bind(s, (struct sockaddr*)&Addr, sizeof(Addr)) < 0)
                               perror("bind");

                struct ip_mreqn imr;
                inet_aton(MCAST_ADDR, &imr.imr_multiaddr);
                imr.imr_address = MyIfAddr->sin_addr;
                imr.imr_ifindex = MyIfIndex;
                if (setsockopt(s, IPPROTO_IP, IP_ADD_MEMBERSHIP, &imr, sizeof(imr)) < 0)
                               perror("setsockopt(IP_ADD_MEMBERSHIP)");

                for (;;)
                {              
                               struct sockaddr_in AddrBuffer;
                               int AddrLen = sizeof(AddrBuffer);
                               char Buffer[2048];
                               size_t BufferLen = recvfrom(s, &Buffer, sizeof(Buffer), 0, (struct sockaddr*)&AddrBuffer, & AddrLen);
                               if (BufferLen <= 0)
                               {
                                               if (BufferLen < 0)
                                                               perror("recvfrom");
                                               break;
                               }
                               printf("%s: Received %d bytes from %s\n", argv[1], BufferLen, inet_ntoa(AddrBuffer.sin_addr));
                }
}
----8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<----

-- 
Mario Klebsch				Actia I+ME GmbH
Mario.klebsch@ime-actia.de		Dresdenstrasse 17/18
Fon: +49 531 38 701 716			38124 Braunschweig
Fax: +49 531 38 701 88			Germany



^ permalink raw reply

* Re: [PATCH net-next 0/5] virtio-net: Add SCTP checksum offload support
From: Vlad Yasevich @ 2018-04-23 13:17 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, Michael S. Tsirkin
  Cc: Vladislav Yasevich, netdev, linux-sctp, virtualization, jasowang,
	nhorman
In-Reply-To: <20180420172219.GR4716@localhost.localdomain>

On 04/20/2018 01:22 PM, Marcelo Ricardo Leitner wrote:
> On Wed, Apr 18, 2018 at 05:06:46PM +0300, Michael S. Tsirkin wrote:
>> On Tue, Apr 17, 2018 at 04:35:18PM -0400, Vlad Yasevich wrote:
>>> On 04/02/2018 10:47 AM, Marcelo Ricardo Leitner wrote:
>>>> On Mon, Apr 02, 2018 at 09:40:01AM -0400, Vladislav Yasevich wrote:
>>>>> Now that we have SCTP offload capabilities in the kernel, we can add
>>>>> them to virtio as well.  First step is SCTP checksum.
>>>>
>>>> Thanks.
>>>>
>>>>> As for GSO, the way sctp GSO is currently implemented buys us nothing
>>>>> in added support to virtio.  To add true GSO, would require a lot of
>>>>> re-work inside of SCTP and would require extensions to the virtio
>>>>> net header to carry extra sctp data.
>>>>
>>>> Can you please elaborate more on this? Is this because SCTP GSO relies
>>>> on the gso skb format for knowing how to segment it instead of having
>>>> a list of sizes?
>>>>
>>>
>>> it's mainly because all the true segmentation, placing data into chunks,
>>> has already happened.  All that GSO does is allow for higher bundling
>>> rate between VMs. If that is all SCTP GSO ever going to do, that fine,
>>> but the goal is to do real GSO eventually and potentially reduce the
>>> amount of memory copying we are doing.
>>> If we do that, any current attempt at GSO in virtio would have to be
>>> depricated and we'd need GSO2 or something like that.
>>
>> Batching helps virtualization *a lot* though.
> 
> Yep. The results posted by Xin in the other email give good insights
> on it.
> 
>> Are there actual plans for GSO2? Is it just for SCTP?
> 
> No plans. In this context, at least, yes, just for SCTP.
> 
> It was a supposition in case we start doing a different GSO for SCTP,
> one more like what we have for TCP.
> 
> Currently, as the SCTP GSO code doesn't leave the system, we can
> update it if we want. But by the moment we add support for it in
> virtio, we will have to be backwards compatible if we end up doing
> SCTP GSO differently.

So, just because the linux code doesn't do it differently doesn't mean
that someone else doesn't.  Since the device has to work across different
possible implementations, it needs to be generic enough.  If we simply
document the current linux practice, that may not be ideal on the future.

I was hesitant to introduce this without studying the feasibility of
doing late segmentation.

-vlad

> 
> But again, I don't think such approach for SCTP GSO would be neither
> feasible or worth. The complexity for it, to work across stream
> schedules and late TSN allocation, would do more harm then good IMO.
> 
>>
>>>
>>> This is why, after doing the GSO support, I decided not to include it.
>>>
>>> -vlad
>>>>   Marcelo
>>>>

^ permalink raw reply

* Re: Page allocator bottleneck
From: Aaron Lu @ 2018-04-23 13:10 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Linux Kernel Network Developers, linux-mm, Mel Gorman,
	David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko
In-Reply-To: <0dea4da6-8756-22d4-c586-267217a5fa63@mellanox.com>

On Mon, Apr 23, 2018 at 11:54:57AM +0300, Tariq Toukan wrote:
> Hi,
> 
> I ran my tests with your patches.
> Initial BW numbers are significantly higher than I documented back then in
> this mail-thread.
> For example, in driver #2 (see original mail thread), with 6 rings, I now
> get 92Gbps (slightly less than linerate) in comparison to 64Gbps back then.
> 
> However, there were many kernel changes since then, I need to isolate your
> changes. I am not sure I can finish this today, but I will surely get to it
> next week after I'm back from vacation.
> 
> Still, when I increase the scale (more rings, i.e. more cpus), I see that
> queued_spin_lock_slowpath gets to 60%+ cpu. Still high, but lower than it
> used to be.

I wonder if it is on allocation path or free path?

Also, increasing PCP size through vm.percpu_pagelist_fraction would
still help with my patches since it can avoid touching even more cache
lines on allocation path with a higher PCP->batch(which has an upper
limit of 96 though at the moment).

> 
> This should be root solved by the (orthogonal) changes planned in network
> subsystem, which will change the SKB allocation/free scheme so that SKBs are
> released on the originating cpu.

^ permalink raw reply

* [PATCH net] sfc: ARFS filter IDs
From: Edward Cree @ 2018-04-23 13:08 UTC (permalink / raw)
  To: linux-net-drivers, David Miller; +Cc: netdev

Associate an arbitrary ID with each ARFS filter, allowing to properly query
 for expiry.  The association is maintained in a hash table, which is
 protected by a spinlock.

Fixes: 3af0f34290f6 ("sfc: replace asynchronous filter operations")
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/ef10.c       |  80 +++++++++++--------
 drivers/net/ethernet/sfc/efx.c        | 143 ++++++++++++++++++++++++++++++++++
 drivers/net/ethernet/sfc/efx.h        |  19 +++++
 drivers/net/ethernet/sfc/farch.c      |  41 ++++++++--
 drivers/net/ethernet/sfc/net_driver.h |  36 +++++++++
 drivers/net/ethernet/sfc/rx.c         |  62 +++++++++++++--
 6 files changed, 335 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 83ce229f4eb7..63036d9bf3e6 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -3999,29 +3999,6 @@ static void efx_ef10_prepare_flr(struct efx_nic *efx)
 	atomic_set(&efx->active_queues, 0);
 }
 
-static bool efx_ef10_filter_equal(const struct efx_filter_spec *left,
-				  const struct efx_filter_spec *right)
-{
-	if ((left->match_flags ^ right->match_flags) |
-	    ((left->flags ^ right->flags) &
-	     (EFX_FILTER_FLAG_RX | EFX_FILTER_FLAG_TX)))
-		return false;
-
-	return memcmp(&left->outer_vid, &right->outer_vid,
-		      sizeof(struct efx_filter_spec) -
-		      offsetof(struct efx_filter_spec, outer_vid)) == 0;
-}
-
-static unsigned int efx_ef10_filter_hash(const struct efx_filter_spec *spec)
-{
-	BUILD_BUG_ON(offsetof(struct efx_filter_spec, outer_vid) & 3);
-	return jhash2((const u32 *)&spec->outer_vid,
-		      (sizeof(struct efx_filter_spec) -
-		       offsetof(struct efx_filter_spec, outer_vid)) / 4,
-		      0);
-	/* XXX should we randomise the initval? */
-}
-
 /* Decide whether a filter should be exclusive or else should allow
  * delivery to additional recipients.  Currently we decide that
  * filters for specific local unicast MAC and IP addresses are
@@ -4346,7 +4323,7 @@ static s32 efx_ef10_filter_insert(struct efx_nic *efx,
 		goto out_unlock;
 	match_pri = rc;
 
-	hash = efx_ef10_filter_hash(spec);
+	hash = efx_filter_spec_hash(spec);
 	is_mc_recip = efx_filter_is_mc_recipient(spec);
 	if (is_mc_recip)
 		bitmap_zero(mc_rem_map, EFX_EF10_FILTER_SEARCH_LIMIT);
@@ -4378,7 +4355,7 @@ static s32 efx_ef10_filter_insert(struct efx_nic *efx,
 		if (!saved_spec) {
 			if (ins_index < 0)
 				ins_index = i;
-		} else if (efx_ef10_filter_equal(spec, saved_spec)) {
+		} else if (efx_filter_spec_equal(spec, saved_spec)) {
 			if (spec->priority < saved_spec->priority &&
 			    spec->priority != EFX_FILTER_PRI_AUTO) {
 				rc = -EPERM;
@@ -4762,27 +4739,62 @@ static s32 efx_ef10_filter_get_rx_ids(struct efx_nic *efx,
 static bool efx_ef10_filter_rfs_expire_one(struct efx_nic *efx, u32 flow_id,
 					   unsigned int filter_idx)
 {
+	struct efx_filter_spec *spec, saved_spec;
 	struct efx_ef10_filter_table *table;
-	struct efx_filter_spec *spec;
-	bool ret;
+	struct efx_arfs_rule *rule = NULL;
+	bool ret = true, force = false;
+	u16 arfs_id;
 
 	down_read(&efx->filter_sem);
 	table = efx->filter_state;
 	down_write(&table->lock);
 	spec = efx_ef10_filter_entry_spec(table, filter_idx);
 
-	if (!spec || spec->priority != EFX_FILTER_PRI_HINT) {
-		ret = true;
+	if (!spec || spec->priority != EFX_FILTER_PRI_HINT)
 		goto out_unlock;
-	}
 
-	if (!rps_may_expire_flow(efx->net_dev, spec->dmaq_id, flow_id, 0)) {
-		ret = false;
-		goto out_unlock;
+	spin_lock_bh(&efx->rps_hash_lock);
+	if (!efx->rps_hash_table) {
+		/* In the absence of the table, we always return 0 to ARFS. */
+		arfs_id = 0;
+	} else {
+		rule = efx_rps_hash_find(efx, spec);
+		if (!rule)
+			/* ARFS table doesn't know of this filter, so remove it */
+			goto expire;
+		arfs_id = rule->arfs_id;
+		ret = efx_rps_check_rule(rule, filter_idx, &force);
+		if (force)
+			goto expire;
+		if (!ret) {
+			spin_unlock_bh(&efx->rps_hash_lock);
+			goto out_unlock;
+		}
 	}
-
+	if (!rps_may_expire_flow(efx->net_dev, spec->dmaq_id, flow_id, arfs_id))
+		ret = false;
+	else if (rule)
+		rule->filter_id = EFX_ARFS_FILTER_ID_REMOVING;
+expire:
+	saved_spec = *spec; /* remove operation will kfree spec */
+	spin_unlock_bh(&efx->rps_hash_lock);
+	/* At this point (since we dropped the lock), another thread might queue
+	 * up a fresh insertion request (but the actual insertion will be held
+	 * up by our possession of the filter table lock).  In that case, it
+	 * will set rule->filter_id to EFX_ARFS_FILTER_ID_PENDING, meaning that
+	 * the rule is not removed by efx_rps_hash_del() below.
+	 */
 	ret = efx_ef10_filter_remove_internal(efx, 1U << spec->priority,
 					      filter_idx, true) == 0;
+	/* While we can't safely dereference rule (we dropped the lock), we can
+	 * still test it for NULL.
+	 */
+	if (ret && rule) {
+		/* Expiring, so remove entry from ARFS table */
+		spin_lock_bh(&efx->rps_hash_lock);
+		efx_rps_hash_del(efx, &saved_spec);
+		spin_unlock_bh(&efx->rps_hash_lock);
+	}
 out_unlock:
 	up_write(&table->lock);
 	up_read(&efx->filter_sem);
diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 692dd729ee2a..a4ebd8715494 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -3027,6 +3027,10 @@ static int efx_init_struct(struct efx_nic *efx,
 	mutex_init(&efx->mac_lock);
 #ifdef CONFIG_RFS_ACCEL
 	mutex_init(&efx->rps_mutex);
+	spin_lock_init(&efx->rps_hash_lock);
+	/* Failure to allocate is not fatal, but may degrade ARFS performance */
+	efx->rps_hash_table = kcalloc(EFX_ARFS_HASH_TABLE_SIZE,
+				      sizeof(*efx->rps_hash_table), GFP_KERNEL);
 #endif
 	efx->phy_op = &efx_dummy_phy_operations;
 	efx->mdio.dev = net_dev;
@@ -3070,6 +3074,10 @@ static void efx_fini_struct(struct efx_nic *efx)
 {
 	int i;
 
+#ifdef CONFIG_RFS_ACCEL
+	kfree(efx->rps_hash_table);
+#endif
+
 	for (i = 0; i < EFX_MAX_CHANNELS; i++)
 		kfree(efx->channel[i]);
 
@@ -3092,6 +3100,141 @@ void efx_update_sw_stats(struct efx_nic *efx, u64 *stats)
 	stats[GENERIC_STAT_rx_noskb_drops] = atomic_read(&efx->n_rx_noskb_drops);
 }
 
+bool efx_filter_spec_equal(const struct efx_filter_spec *left,
+			   const struct efx_filter_spec *right)
+{
+	if ((left->match_flags ^ right->match_flags) |
+	    ((left->flags ^ right->flags) &
+	     (EFX_FILTER_FLAG_RX | EFX_FILTER_FLAG_TX)))
+		return false;
+
+	return memcmp(&left->outer_vid, &right->outer_vid,
+		      sizeof(struct efx_filter_spec) -
+		      offsetof(struct efx_filter_spec, outer_vid)) == 0;
+}
+
+u32 efx_filter_spec_hash(const struct efx_filter_spec *spec)
+{
+	BUILD_BUG_ON(offsetof(struct efx_filter_spec, outer_vid) & 3);
+	return jhash2((const u32 *)&spec->outer_vid,
+		      (sizeof(struct efx_filter_spec) -
+		       offsetof(struct efx_filter_spec, outer_vid)) / 4,
+		      0);
+}
+
+#ifdef CONFIG_RFS_ACCEL
+bool efx_rps_check_rule(struct efx_arfs_rule *rule, unsigned int filter_idx,
+			bool *force)
+{
+	if (rule->filter_id == EFX_ARFS_FILTER_ID_PENDING) {
+		/* ARFS is currently updating this entry, leave it */
+		return false;
+	}
+	if (rule->filter_id == EFX_ARFS_FILTER_ID_ERROR) {
+		/* ARFS tried and failed to update this, so it's probably out
+		 * of date.  Remove the filter and the ARFS rule entry.
+		 */
+		rule->filter_id = EFX_ARFS_FILTER_ID_REMOVING;
+		*force = true;
+		return true;
+	} else if (WARN_ON(rule->filter_id != filter_idx)) { /* can't happen */
+		/* ARFS has moved on, so old filter is not needed.  Since we did
+		 * not mark the rule with EFX_ARFS_FILTER_ID_REMOVING, it will
+		 * not be removed by efx_rps_hash_del() subsequently.
+		 */
+		*force = true;
+		return true;
+	}
+	/* Remove it iff ARFS wants to. */
+	return true;
+}
+
+struct hlist_head *efx_rps_hash_bucket(struct efx_nic *efx,
+				       const struct efx_filter_spec *spec)
+{
+	u32 hash = efx_filter_spec_hash(spec);
+
+	WARN_ON(!spin_is_locked(&efx->rps_hash_lock));
+	if (!efx->rps_hash_table)
+		return NULL;
+	return &efx->rps_hash_table[hash % EFX_ARFS_HASH_TABLE_SIZE];
+}
+
+struct efx_arfs_rule *efx_rps_hash_find(struct efx_nic *efx,
+					const struct efx_filter_spec *spec)
+{
+	struct efx_arfs_rule *rule;
+	struct hlist_head *head;
+	struct hlist_node *node;
+
+	head = efx_rps_hash_bucket(efx, spec);
+	if (!head)
+		return NULL;
+	hlist_for_each(node, head) {
+		rule = container_of(node, struct efx_arfs_rule, node);
+		if (efx_filter_spec_equal(spec, &rule->spec))
+			return rule;
+	}
+	return NULL;
+}
+
+struct efx_arfs_rule *efx_rps_hash_add(struct efx_nic *efx,
+				       const struct efx_filter_spec *spec,
+				       bool *new)
+{
+	struct efx_arfs_rule *rule;
+	struct hlist_head *head;
+	struct hlist_node *node;
+
+	head = efx_rps_hash_bucket(efx, spec);
+	if (!head)
+		return NULL;
+	hlist_for_each(node, head) {
+		rule = container_of(node, struct efx_arfs_rule, node);
+		if (efx_filter_spec_equal(spec, &rule->spec)) {
+			*new = false;
+			return rule;
+		}
+	}
+	rule = kmalloc(sizeof(*rule), GFP_ATOMIC);
+	*new = true;
+	if (rule) {
+		memcpy(&rule->spec, spec, sizeof(rule->spec));
+		hlist_add_head(&rule->node, head);
+	}
+	return rule;
+}
+
+void efx_rps_hash_del(struct efx_nic *efx, const struct efx_filter_spec *spec)
+{
+	struct efx_arfs_rule *rule;
+	struct hlist_head *head;
+	struct hlist_node *node;
+
+	head = efx_rps_hash_bucket(efx, spec);
+	if (WARN_ON(!head))
+		return;
+	hlist_for_each(node, head) {
+		rule = container_of(node, struct efx_arfs_rule, node);
+		if (efx_filter_spec_equal(spec, &rule->spec)) {
+			/* Someone already reused the entry.  We know that if
+			 * this check doesn't fire (i.e. filter_id == REMOVING)
+			 * then the REMOVING mark was put there by our caller,
+			 * because caller is holding a lock on filter table and
+			 * only holders of that lock set REMOVING.
+			 */
+			if (rule->filter_id != EFX_ARFS_FILTER_ID_REMOVING)
+				return;
+			hlist_del(node);
+			kfree(rule);
+			return;
+		}
+	}
+	/* We didn't find it. */
+	WARN_ON(1);
+}
+#endif
+
 /* RSS contexts.  We're using linked lists and crappy O(n) algorithms, because
  * (a) this is an infrequent control-plane operation and (b) n is small (max 64)
  */
diff --git a/drivers/net/ethernet/sfc/efx.h b/drivers/net/ethernet/sfc/efx.h
index a3140e16fcef..6b4164b6d938 100644
--- a/drivers/net/ethernet/sfc/efx.h
+++ b/drivers/net/ethernet/sfc/efx.h
@@ -186,6 +186,25 @@ static inline void efx_filter_rfs_expire(struct work_struct *data) {}
 #endif
 bool efx_filter_is_mc_recipient(const struct efx_filter_spec *spec);
 
+bool efx_filter_spec_equal(const struct efx_filter_spec *left,
+			   const struct efx_filter_spec *right);
+u32 efx_filter_spec_hash(const struct efx_filter_spec *spec);
+
+bool efx_rps_check_rule(struct efx_arfs_rule *rule, unsigned int filter_idx,
+			bool *force);
+
+struct efx_arfs_rule *efx_rps_hash_find(struct efx_nic *efx,
+					const struct efx_filter_spec *spec);
+
+/* @new is written to indicate if entry was newly added (true) or if an old
+ * entry was found and returned (false).
+ */
+struct efx_arfs_rule *efx_rps_hash_add(struct efx_nic *efx,
+				       const struct efx_filter_spec *spec,
+				       bool *new);
+
+void efx_rps_hash_del(struct efx_nic *efx, const struct efx_filter_spec *spec);
+
 /* RSS contexts */
 struct efx_rss_context *efx_alloc_rss_context_entry(struct efx_nic *efx);
 struct efx_rss_context *efx_find_rss_context_entry(struct efx_nic *efx, u32 id);
diff --git a/drivers/net/ethernet/sfc/farch.c b/drivers/net/ethernet/sfc/farch.c
index 7174ef5e5c5e..ade694c1f9a6 100644
--- a/drivers/net/ethernet/sfc/farch.c
+++ b/drivers/net/ethernet/sfc/farch.c
@@ -2905,18 +2905,45 @@ bool efx_farch_filter_rfs_expire_one(struct efx_nic *efx, u32 flow_id,
 {
 	struct efx_farch_filter_state *state = efx->filter_state;
 	struct efx_farch_filter_table *table;
-	bool ret = false;
+	bool ret = false, force = false;
+	u16 arfs_id;
 
 	down_write(&state->lock);
+	spin_lock_bh(&efx->rps_hash_lock);
 	table = &state->table[EFX_FARCH_FILTER_TABLE_RX_IP];
 	if (test_bit(index, table->used_bitmap) &&
-	    table->spec[index].priority == EFX_FILTER_PRI_HINT &&
-	    rps_may_expire_flow(efx->net_dev, table->spec[index].dmaq_id,
-				flow_id, 0)) {
-		efx_farch_filter_table_clear_entry(efx, table, index);
-		ret = true;
+	    table->spec[index].priority == EFX_FILTER_PRI_HINT) {
+		struct efx_filter_spec spec;
+		struct efx_arfs_rule *rule;
+
+		efx_farch_filter_to_gen_spec(&spec, &table->spec[index]);
+		if (!efx->rps_hash_table) {
+			/* In the absence of the table, we always returned 0 to
+			 * ARFS, so use the same to query it.
+			 */
+			arfs_id = 0;
+		} else {
+			rule = efx_rps_hash_find(efx, &spec);
+			if (!rule) {
+				/* ARFS table doesn't know of this filter, remove it */
+				force = true;
+			} else {
+				arfs_id = rule->arfs_id;
+				if (!efx_rps_check_rule(rule, index, &force))
+					goto out_unlock;
+			}
+		}
+		if (force || rps_may_expire_flow(efx->net_dev, spec.dmaq_id,
+						 flow_id, arfs_id)) {
+			if (rule)
+				rule->filter_id = EFX_ARFS_FILTER_ID_REMOVING;
+			efx_rps_hash_del(efx, &spec);
+			efx_farch_filter_table_clear_entry(efx, table, index);
+			ret = true;
+		}
 	}
-
+out_unlock:
+	spin_unlock_bh(&efx->rps_hash_lock);
 	up_write(&state->lock);
 	return ret;
 }
diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
index eea3808b3f25..65568925c3ef 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -734,6 +734,35 @@ struct efx_rss_context {
 };
 
 #ifdef CONFIG_RFS_ACCEL
+/* Order of these is important, since filter_id >= %EFX_ARFS_FILTER_ID_PENDING
+ * is used to test if filter does or will exist.
+ */
+#define EFX_ARFS_FILTER_ID_PENDING	-1
+#define EFX_ARFS_FILTER_ID_ERROR	-2
+#define EFX_ARFS_FILTER_ID_REMOVING	-3
+/**
+ * struct efx_arfs_rule - record of an ARFS filter and its IDs
+ * @node: linkage into hash table
+ * @spec: details of the filter (used as key for hash table).  Use efx->type to
+ *	determine which member to use.
+ * @rxq_index: channel to which the filter will steer traffic.
+ * @arfs_id: filter ID which was returned to ARFS
+ * @filter_id: index in software filter table.  May be
+ *	%EFX_ARFS_FILTER_ID_PENDING if filter was not inserted yet,
+ *	%EFX_ARFS_FILTER_ID_ERROR if filter insertion failed, or
+ *	%EFX_ARFS_FILTER_ID_REMOVING if expiry is currently removing the filter.
+ */
+struct efx_arfs_rule {
+	struct hlist_node node;
+	struct efx_filter_spec spec;
+	u16 rxq_index;
+	u16 arfs_id;
+	s32 filter_id;
+};
+
+/* Size chosen so that the table is one page (4kB) */
+#define EFX_ARFS_HASH_TABLE_SIZE	512
+
 /**
  * struct efx_async_filter_insertion - Request to asynchronously insert a filter
  * @net_dev: Reference to the netdevice
@@ -873,6 +902,10 @@ struct efx_async_filter_insertion {
  *	@rps_expire_channel's @rps_flow_id
  * @rps_slot_map: bitmap of in-flight entries in @rps_slot
  * @rps_slot: array of ARFS insertion requests for efx_filter_rfs_work()
+ * @rps_hash_lock: Protects ARFS filter mapping state (@rps_hash_table and
+ *	@rps_next_id).
+ * @rps_hash_table: Mapping between ARFS filters and their various IDs
+ * @rps_next_id: next arfs_id for an ARFS filter
  * @active_queues: Count of RX and TX queues that haven't been flushed and drained.
  * @rxq_flush_pending: Count of number of receive queues that need to be flushed.
  *	Decremented when the efx_flush_rx_queue() is called.
@@ -1029,6 +1062,9 @@ struct efx_nic {
 	unsigned int rps_expire_index;
 	unsigned long rps_slot_map;
 	struct efx_async_filter_insertion rps_slot[EFX_RPS_MAX_IN_FLIGHT];
+	spinlock_t rps_hash_lock;
+	struct hlist_head *rps_hash_table;
+	u32 rps_next_id;
 #endif
 
 	atomic_t active_queues;
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 9c593c661cbf..64a94f242027 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -834,9 +834,29 @@ static void efx_filter_rfs_work(struct work_struct *data)
 	struct efx_nic *efx = netdev_priv(req->net_dev);
 	struct efx_channel *channel = efx_get_channel(efx, req->rxq_index);
 	int slot_idx = req - efx->rps_slot;
+	struct efx_arfs_rule *rule;
+	u16 arfs_id = 0;
 	int rc;
 
 	rc = efx->type->filter_insert(efx, &req->spec, true);
+	if (efx->rps_hash_table) {
+		spin_lock_bh(&efx->rps_hash_lock);
+		rule = efx_rps_hash_find(efx, &req->spec);
+		/* The rule might have already gone, if someone else's request
+		 * for the same spec was already worked and then expired before
+		 * we got around to our work.  In that case we have nothing
+		 * tying us to an arfs_id, meaning that as soon as the filter
+		 * is considered for expiry it will be removed.
+		 */
+		if (rule) {
+			if (rc < 0)
+				rule->filter_id = EFX_ARFS_FILTER_ID_ERROR;
+			else
+				rule->filter_id = rc;
+			arfs_id = rule->arfs_id;
+		}
+		spin_unlock_bh(&efx->rps_hash_lock);
+	}
 	if (rc >= 0) {
 		/* Remember this so we can check whether to expire the filter
 		 * later.
@@ -848,18 +868,18 @@ static void efx_filter_rfs_work(struct work_struct *data)
 
 		if (req->spec.ether_type == htons(ETH_P_IP))
 			netif_info(efx, rx_status, efx->net_dev,
-				   "steering %s %pI4:%u:%pI4:%u to queue %u [flow %u filter %d]\n",
+				   "steering %s %pI4:%u:%pI4:%u to queue %u [flow %u filter %d id %u]\n",
 				   (req->spec.ip_proto == IPPROTO_TCP) ? "TCP" : "UDP",
 				   req->spec.rem_host, ntohs(req->spec.rem_port),
 				   req->spec.loc_host, ntohs(req->spec.loc_port),
-				   req->rxq_index, req->flow_id, rc);
+				   req->rxq_index, req->flow_id, rc, arfs_id);
 		else
 			netif_info(efx, rx_status, efx->net_dev,
-				   "steering %s [%pI6]:%u:[%pI6]:%u to queue %u [flow %u filter %d]\n",
+				   "steering %s [%pI6]:%u:[%pI6]:%u to queue %u [flow %u filter %d id %u]\n",
 				   (req->spec.ip_proto == IPPROTO_TCP) ? "TCP" : "UDP",
 				   req->spec.rem_host, ntohs(req->spec.rem_port),
 				   req->spec.loc_host, ntohs(req->spec.loc_port),
-				   req->rxq_index, req->flow_id, rc);
+				   req->rxq_index, req->flow_id, rc, arfs_id);
 	}
 
 	/* Release references */
@@ -872,8 +892,10 @@ int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
 	struct efx_async_filter_insertion *req;
+	struct efx_arfs_rule *rule;
 	struct flow_keys fk;
 	int slot_idx;
+	bool new;
 	int rc;
 
 	/* find a free slot */
@@ -926,12 +948,42 @@ int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
 	req->spec.rem_port = fk.ports.src;
 	req->spec.loc_port = fk.ports.dst;
 
+	if (efx->rps_hash_table) {
+		/* Add it to ARFS hash table */
+		spin_lock(&efx->rps_hash_lock);
+		rule = efx_rps_hash_add(efx, &req->spec, &new);
+		if (!rule) {
+			rc = -ENOMEM;
+			goto out_unlock;
+		}
+		if (new)
+			rule->arfs_id = efx->rps_next_id++ % RPS_NO_FILTER;
+		rc = rule->arfs_id;
+		/* Skip if existing or pending filter already does the right thing */
+		if (!new && rule->rxq_index == rxq_index &&
+		    rule->filter_id >= EFX_ARFS_FILTER_ID_PENDING)
+			goto out_unlock;
+		rule->rxq_index = rxq_index;
+		rule->filter_id = EFX_ARFS_FILTER_ID_PENDING;
+		spin_unlock(&efx->rps_hash_lock);
+	} else {
+		/* Without an ARFS hash table, we just use arfs_id 0 for all
+		 * filters.  This means if multiple flows hash to the same
+		 * flow_id, all but the most recently touched will be eligible
+		 * for expiry.
+		 */
+		rc = 0;
+	}
+
+	/* Queue the request */
 	dev_hold(req->net_dev = net_dev);
 	INIT_WORK(&req->work, efx_filter_rfs_work);
 	req->rxq_index = rxq_index;
 	req->flow_id = flow_id;
 	schedule_work(&req->work);
-	return 0;
+	return rc;
+out_unlock:
+	spin_unlock(&efx->rps_hash_lock);
 out_clear:
 	clear_bit(slot_idx, &efx->rps_slot_map);
 	return rc;

^ permalink raw reply related

* Repeating "unregister_netdevice: waiting for lo to become free" caused by upstream 76da0704507bb ("ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER")
From: Rafał Miłecki @ 2018-04-23 13:08 UTC (permalink / raw)
  To: WANG Cong, David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Network Development, jeffy, David Ahern, Khlebnikov
  Cc: Greg Kroah-Hartman, Stable

Hi,

I've just updated my kernel 4.4.x and noticed a regression. Bisecting
pointed me to the commit 2417da3f4d6bc ("ipv6: only call
ip6_route_dev_notify() once for NETDEV_UNREGISTER") [0] which is
backport of upstream 76da0704507bb. That backported commit has
appeared in a 4.4.103.

I use OpenWrt/LEDE [1] distribution and LXC [2] 1.1.5. After stopping
a container I start getting these messages:
[  229.419188] unregister_netdevice: waiting for lo to become free.
Usage count = 1
[  239.660408] unregister_netdevice: waiting for lo to become free.
Usage count = 1
[  249.839189] unregister_netdevice: waiting for lo to become free.
Usage count = 1
(...)

Trying to start LXC nevertheless results in lxc-start command hang
around network configuration. Trying to query LXC state afterwards
results in a lxc-info command hang too.

I tried Googling for this issue and found similar reports:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729637
https://github.com/fnproject/fn/issues/686
https://lime-technology.com/forums/topic/66863-kernelunregister_netdevice-waiting-for-lo-to-become-free-usage-count-1/
all of them related to the Docker, which is probably a similar use
case to the LXC.

I couldn't find any reference to commit 76da0704507bb that could
suggest fixing the problem I'm seeing.

Does anyone have an idea what is the issue I'm seeing about? Or even
better, how to fix it? Can I provide any additional info that would
help?

[0] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.4.y&id=2417da3f4d6bc4fc6c77f613f0e2264090892aa5
[1] https://openwrt.org/
[2] https://linuxcontainers.org/

-- 
Rafał

^ permalink raw reply

* [PATCH net 1/1] Modify the seq_puts and seq_printf of af_netlink.c file
From: Bo YU @ 2018-04-23 12:51 UTC (permalink / raw)
  To: davem, Wang, Berg, Tkhai, Long, Elena; +Cc: netdev

Modify format output symbol of seq_printf function and adjust blanks in
seq_puts function in order to make convenience with command:`cat
/proc/net/netlink`

Signed-off-by: Bo YU <tsu.yubo@gmail.com>
---
  net/netlink/af_netlink.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 55342c4d5cec..2e2dd88fc79f 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2606,13 +2606,13 @@ static int netlink_seq_show(struct seq_file *seq, void *v)
  {
  	if (v == SEQ_START_TOKEN) {
  		seq_puts(seq,
-			 "sk       Eth Pid    Groups   "
-			 "Rmem     Wmem     Dump     Locks     Drops     Inode\n");
+			 "sk               Eth Pid        Groups   "
+			 "Rmem     Wmem     Dump  Locks    Drops    Inode\n");
  	} else {
  		struct sock *s = v;
  		struct netlink_sock *nlk = nlk_sk(s);

-		seq_printf(seq, "%pK %-3d %-6u %08x %-8d %-8d %d %-8d %-8d %-8lu\n",
+		seq_printf(seq, "%pK %-3d %-10u %08x %-8d %-8d %-5d %-8d %-8d %-8lu\n",
  			   s,
  			   s->sk_protocol,
  			   nlk->portid,

^ permalink raw reply related

* [PATCH] dca: make function dca_common_get_tag static
From: Colin King @ 2018-04-23 12:49 UTC (permalink / raw)
  To: netdev; +Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Function dca_common_get_tag is local to the source and does not need to be
in global scope, so make it static.

Cleans up sparse warning:
drivers/dca/dca-core.c:273:4: warning: symbol 'dca_common_get_tag' was
not declared. Should it be static?

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/dca/dca-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/dca/dca-core.c b/drivers/dca/dca-core.c
index 7afbb28d6a0f..1bc5ffb338c8 100644
--- a/drivers/dca/dca-core.c
+++ b/drivers/dca/dca-core.c
@@ -270,7 +270,7 @@ EXPORT_SYMBOL_GPL(dca_remove_requester);
  * @dev - the device that wants dca service
  * @cpu - the cpuid as returned by get_cpu()
  */
-u8 dca_common_get_tag(struct device *dev, int cpu)
+static u8 dca_common_get_tag(struct device *dev, int cpu)
 {
 	struct dca_provider *dca;
 	u8 tag;
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH net-next] lan78xx: Lan7801 Support for Fixed PHY
From: Andrew Lunn @ 2018-04-23 12:42 UTC (permalink / raw)
  To: Raghuram Chary J; +Cc: davem, netdev, unglinuxdriver, woojung.huh
In-Reply-To: <20180423044630.2672-1-raghuramchary.jallipalli@microchip.com>

>  #define DRIVER_AUTHOR	"WOOJUNG HUH <woojung.huh@microchip.com>"
>  #define DRIVER_DESC	"LAN78XX USB 3.0 Gigabit Ethernet Devices"
>  #define DRIVER_NAME	"lan78xx"
> -#define DRIVER_VERSION	"1.0.6"
> +#define DRIVER_VERSION	"1.0.7"

Hi Raghuram

Driver version strings a pretty pointless. You might want to remove
it.

>  
>  #define TX_TIMEOUT_JIFFIES		(5 * HZ)
>  #define THROTTLE_JIFFIES		(HZ / 8)
> @@ -426,6 +426,7 @@ struct lan78xx_net {
>  	struct statstage	stats;
>  
>  	struct irq_domain_data	domain_data;
> +	struct phy_device	*fixedphy;
>  };
>  
>  /* define external phy id */
> @@ -2062,11 +2063,39 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
>  	int ret;
>  	u32 mii_adv;
>  	struct phy_device *phydev;
> +	struct fixed_phy_status fphy_status = {
> +		.link = 1,
> +		.speed = SPEED_1000,
> +		.duplex = DUPLEX_FULL,
> +	};
>  
>  	phydev = phy_find_first(dev->mdiobus);
>  	if (!phydev) {
> -		netdev_err(dev->net, "no PHY found\n");
> -		return -EIO;
> +		if (dev->chipid == ID_REV_CHIP_ID_7801_) {
> +			u32 buf;
> +
> +			netdev_info(dev->net, "PHY Not Found!! Registering Fixed PHY\n");
> +			phydev = fixed_phy_register(PHY_POLL, &fphy_status, -1,
> +						    NULL);
> +			if (IS_ERR(phydev)) {
> +				netdev_err(dev->net, "No PHY/fixed_PHY found\n");
> +				return -ENODEV;
> +			}
> +			netdev_info(dev->net, "Registered FIXED PHY\n");

There are too many detdev_info() messages here. Maybe make them both
netdev_dbg().

> +			dev->interface = PHY_INTERFACE_MODE_RGMII;
> +			dev->fixedphy = phydev;

You can use 

if (!phy_is_pseudo_fixed_link(phydev))

to determine is a PHY is a fixed phy. I think you can then do without
dev->fixedphy.

> +			ret = lan78xx_write_reg(dev, MAC_RGMII_ID,
> +						MAC_RGMII_ID_TXC_DELAY_EN_);
> +			ret = lan78xx_write_reg(dev, RGMII_TX_BYP_DLL, 0x3D00);
> +			ret = lan78xx_read_reg(dev, HW_CFG, &buf);
> +			buf |= HW_CFG_CLK125_EN_;
> +			buf |= HW_CFG_REFCLK25_EN_;
> +			ret = lan78xx_write_reg(dev, HW_CFG, buf);
> +			goto phyinit;

Please don't use a goto like this. Maybe turn this into a switch statement?

> +		} else {
> +			netdev_err(dev->net, "no PHY found\n");
> +			return -EIO;
> +		}
>  	}
>  
>  	if ((dev->chipid == ID_REV_CHIP_ID_7800_) ||
> @@ -2105,6 +2134,7 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
>  		goto error;

Please take a look at what happens at error: It does not look
correct. Probably now is a good time to refactor the whole of lan78xx_phy_init()

	 Andrew

^ permalink raw reply

* Re: [PATCH bpf-next v3 4/9] bpf/verifier: improve register value range tracking with ARSH
From: Edward Cree @ 2018-04-23 12:25 UTC (permalink / raw)
  To: Yonghong Song, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-5-yhs@fb.com>

On 20/04/18 23:18, Yonghong Song wrote:
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 3c8bb92..01c215d 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2975,6 +2975,32 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
>  		/* We may learn something more from the var_off */
>  		__update_reg_bounds(dst_reg);
>  		break;
> +	case BPF_ARSH:
> +		if (umax_val >= insn_bitness) {
> +			/* Shifts greater than 31 or 63 are undefined.
> +			 * This includes shifts by a negative number.
> +			 */
> +			mark_reg_unknown(env, regs, insn->dst_reg);
> +			break;
> +		}
> +		if (dst_reg->smin_value < 0)
> +			dst_reg->smin_value >>= umin_val;
> +		else
> +			dst_reg->smin_value >>= umax_val;
> +		if (dst_reg->smax_value < 0)
> +			dst_reg->smax_value >>= umax_val;
> +		else
> +			dst_reg->smax_value >>= umin_val;
> +		if (src_known)
> +			dst_reg->var_off = tnum_rshift(dst_reg->var_off,
> +						       umin_val);
tnum_rshift is an unsigned shift, it won't do what you want here.
I think you could write a tnum_arshift that looks something like this
 (UNTESTED!):

    struct tnum tnum_arshift(struct tnum a, u8 shift)
    {
        return TNUM(((s64)a.value) >> shift, ((s64)a.mask) >> shift);
    }
Theory: if value sign bit is 1 then number is known negative so populate
 upper bits with known 1s.  If mask sign bit is 1 then number might be
 negative so populate upper bits with unknown.  Otherwise, number is
 known positive so populate upper bits with known 0s.

> +		else
> +			dst_reg->var_off = tnum_rshift(tnum_unknown, umin_val);
Applying the above here, tnum_arshift(tnum_unknown, ...) would always just
 return tnum_unknown, so just do "dst_reg->var_off = tnum_unknown;".
The reason for the corresponding logic in the BPF_RSH case is that a right
 logical shift _always_ populates upper bits with zeroes.
In any case these 'else' branches are currently never taken because they
 fall foul of the check Alexei added just before the switch,
    if (!src_known &&
        opcode != BPF_ADD && opcode != BPF_SUB && opcode != BPF_AND) {
        __mark_reg_unknown(dst_reg);
        return 0;
    }
So I can guarantee you haven't tested this code :-)

> +		dst_reg->umin_value >>= umax_val;
> +		dst_reg->umax_value >>= umin_val;
FWIW I think the way to handle umin/umax here is to blow them away and
 just rely on inferring new ubounds from the sbounds (i.e. the inverse of
 what we do just above in case BPF_RSH) since BPF_ARSH is essentially an
 operation on the signed value.  I don't think there is a need to support
 cases where the unsigned bounds of a signed shift of a value that may
 cross the sign boundary at (1<<63) are needed to verify a program.
(Unlike in the unsigned shift case, it is at least _possible_ for there to
 be information from the ubounds that we can't get from the sbounds - but
 it's a contrived case that isn't likely to be useful in real programs.)

-Ed
> +		/* We may learn something more from the var_off */
> +		__update_reg_bounds(dst_reg);
> +		break;
>  	default:
>  		mark_reg_unknown(env, regs, insn->dst_reg);
>  		break;

^ permalink raw reply

* [PATCH net-next 2/2] qed: Add configuration information to register dump and debug data
From: Denis Bolotin @ 2018-04-23 11:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, ariel.elior, Denis Bolotin
In-Reply-To: <20180423115605.8531-1-denis.bolotin@cavium.com>

Configuration information is added to the debug data collection, in
addition to register dump.
Added qed_dbg_nvm_image() that receives an image type, allocates a
buffer and reads the image. The images are saved in the buffers and the
dump size is updated.

Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
---
 drivers/net/ethernet/qlogic/qed/qed_debug.c | 113 +++++++++++++++++++++++++++-
 drivers/net/ethernet/qlogic/qed/qed_mcp.c   |  14 +++-
 drivers/net/ethernet/qlogic/qed/qed_mcp.h   |  14 ++++
 include/linux/qed/qed_if.h                  |   3 +
 4 files changed, 139 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c b/drivers/net/ethernet/qlogic/qed/qed_debug.c
index 4926c55..b3211c7 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
@@ -7778,6 +7778,57 @@ int qed_dbg_igu_fifo_size(struct qed_dev *cdev)
 	return qed_dbg_feature_size(cdev, DBG_FEATURE_IGU_FIFO);
 }
 
+int qed_dbg_nvm_image_length(struct qed_hwfn *p_hwfn,
+			     enum qed_nvm_images image_id, u32 *length)
+{
+	struct qed_nvm_image_att image_att;
+	int rc;
+
+	*length = 0;
+	rc = qed_mcp_get_nvm_image_att(p_hwfn, image_id, &image_att);
+	if (rc)
+		return rc;
+
+	*length = image_att.length;
+
+	return rc;
+}
+
+int qed_dbg_nvm_image(struct qed_dev *cdev, void *buffer,
+		      u32 *num_dumped_bytes, enum qed_nvm_images image_id)
+{
+	struct qed_hwfn *p_hwfn =
+		&cdev->hwfns[cdev->dbg_params.engine_for_debug];
+	u32 len_rounded, i;
+	__be32 val;
+	int rc;
+
+	*num_dumped_bytes = 0;
+	rc = qed_dbg_nvm_image_length(p_hwfn, image_id, &len_rounded);
+	if (rc)
+		return rc;
+
+	DP_NOTICE(p_hwfn->cdev,
+		  "Collecting a debug feature [\"nvram image %d\"]\n",
+		  image_id);
+
+	len_rounded = roundup(len_rounded, sizeof(u32));
+	rc = qed_mcp_get_nvm_image(p_hwfn, image_id, buffer, len_rounded);
+	if (rc)
+		return rc;
+
+	/* QED_NVM_IMAGE_NVM_META image is not swapped like other images */
+	if (image_id != QED_NVM_IMAGE_NVM_META)
+		for (i = 0; i < len_rounded; i += 4) {
+			val = cpu_to_be32(*(u32 *)(buffer + i));
+			*(u32 *)(buffer + i) = val;
+		}
+
+	*num_dumped_bytes = len_rounded;
+
+	return rc;
+}
+
 int qed_dbg_protection_override(struct qed_dev *cdev, void *buffer,
 				u32 *num_dumped_bytes)
 {
@@ -7831,6 +7882,9 @@ enum debug_print_features {
 	IGU_FIFO = 6,
 	PHY = 7,
 	FW_ASSERTS = 8,
+	NVM_CFG1 = 9,
+	DEFAULT_CFG = 10,
+	NVM_META = 11,
 };
 
 static u32 qed_calc_regdump_header(enum debug_print_features feature,
@@ -7965,13 +8019,61 @@ int qed_dbg_all_data(struct qed_dev *cdev, void *buffer)
 		DP_ERR(cdev, "qed_dbg_mcp_trace failed. rc = %d\n", rc);
 	}
 
+	/* nvm cfg1 */
+	rc = qed_dbg_nvm_image(cdev,
+			       (u8 *)buffer + offset + REGDUMP_HEADER_SIZE,
+			       &feature_size, QED_NVM_IMAGE_NVM_CFG1);
+	if (!rc) {
+		*(u32 *)((u8 *)buffer + offset) =
+		    qed_calc_regdump_header(NVM_CFG1, cur_engine,
+					    feature_size, omit_engine);
+		offset += (feature_size + REGDUMP_HEADER_SIZE);
+	} else if (rc != -ENOENT) {
+		DP_ERR(cdev,
+		       "qed_dbg_nvm_image failed for image  %d (%s), rc = %d\n",
+		       QED_NVM_IMAGE_NVM_CFG1, "QED_NVM_IMAGE_NVM_CFG1", rc);
+	}
+
+	/* nvm default */
+	rc = qed_dbg_nvm_image(cdev,
+			       (u8 *)buffer + offset + REGDUMP_HEADER_SIZE,
+			       &feature_size, QED_NVM_IMAGE_DEFAULT_CFG);
+	if (!rc) {
+		*(u32 *)((u8 *)buffer + offset) =
+		    qed_calc_regdump_header(DEFAULT_CFG, cur_engine,
+					    feature_size, omit_engine);
+		offset += (feature_size + REGDUMP_HEADER_SIZE);
+	} else if (rc != -ENOENT) {
+		DP_ERR(cdev,
+		       "qed_dbg_nvm_image failed for image %d (%s), rc = %d\n",
+		       QED_NVM_IMAGE_DEFAULT_CFG, "QED_NVM_IMAGE_DEFAULT_CFG",
+		       rc);
+	}
+
+	/* nvm meta */
+	rc = qed_dbg_nvm_image(cdev,
+			       (u8 *)buffer + offset + REGDUMP_HEADER_SIZE,
+			       &feature_size, QED_NVM_IMAGE_NVM_META);
+	if (!rc) {
+		*(u32 *)((u8 *)buffer + offset) =
+		    qed_calc_regdump_header(NVM_META, cur_engine,
+					    feature_size, omit_engine);
+		offset += (feature_size + REGDUMP_HEADER_SIZE);
+	} else if (rc != -ENOENT) {
+		DP_ERR(cdev,
+		       "qed_dbg_nvm_image failed for image %d (%s), rc = %d\n",
+		       QED_NVM_IMAGE_NVM_META, "QED_NVM_IMAGE_NVM_META", rc);
+	}
+
 	return 0;
 }
 
 int qed_dbg_all_data_size(struct qed_dev *cdev)
 {
+	struct qed_hwfn *p_hwfn =
+		&cdev->hwfns[cdev->dbg_params.engine_for_debug];
+	u32 regs_len = 0, image_len = 0;
 	u8 cur_engine, org_engine;
-	u32 regs_len = 0;
 
 	org_engine = qed_get_debug_engine(cdev);
 	for (cur_engine = 0; cur_engine < cdev->num_hwfns; cur_engine++) {
@@ -7993,6 +8095,15 @@ int qed_dbg_all_data_size(struct qed_dev *cdev)
 
 	/* Engine common */
 	regs_len += REGDUMP_HEADER_SIZE + qed_dbg_mcp_trace_size(cdev);
+	qed_dbg_nvm_image_length(p_hwfn, QED_NVM_IMAGE_NVM_CFG1, &image_len);
+	if (image_len)
+		regs_len += REGDUMP_HEADER_SIZE + image_len;
+	qed_dbg_nvm_image_length(p_hwfn, QED_NVM_IMAGE_DEFAULT_CFG, &image_len);
+	if (image_len)
+		regs_len += REGDUMP_HEADER_SIZE + image_len;
+	qed_dbg_nvm_image_length(p_hwfn, QED_NVM_IMAGE_NVM_META, &image_len);
+	if (image_len)
+		regs_len += REGDUMP_HEADER_SIZE + image_len;
 
 	return regs_len;
 }
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 1377ad1..0550f0e 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -2529,7 +2529,7 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
 	return rc;
 }
 
-static int
+int
 qed_mcp_get_nvm_image_att(struct qed_hwfn *p_hwfn,
 			  enum qed_nvm_images image_id,
 			  struct qed_nvm_image_att *p_image_att)
@@ -2545,6 +2545,15 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
 	case QED_NVM_IMAGE_FCOE_CFG:
 		type = NVM_TYPE_FCOE_CFG;
 		break;
+	case QED_NVM_IMAGE_NVM_CFG1:
+		type = NVM_TYPE_NVM_CFG1;
+		break;
+	case QED_NVM_IMAGE_DEFAULT_CFG:
+		type = NVM_TYPE_DEFAULT_CFG;
+		break;
+	case QED_NVM_IMAGE_NVM_META:
+		type = NVM_TYPE_META;
+		break;
 	default:
 		DP_NOTICE(p_hwfn, "Unknown request of image_id %08x\n",
 			  image_id);
@@ -2588,9 +2597,6 @@ int qed_mcp_get_nvm_image(struct qed_hwfn *p_hwfn,
 		return -EINVAL;
 	}
 
-	/* Each NVM image is suffixed by CRC; Upper-layer has no need for it */
-	image_att.length -= 4;
-
 	if (image_att.length > buffer_len) {
 		DP_VERBOSE(p_hwfn,
 			   QED_MSG_STORAGE,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.h b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
index dd62c38..3af3896 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
@@ -486,6 +486,20 @@ struct qed_nvm_image_att {
  * @brief Allows reading a whole nvram image
  *
  * @param p_hwfn
+ * @param image_id - image to get attributes for
+ * @param p_image_att - image attributes structure into which to fill data
+ *
+ * @return int - 0 - operation was successful.
+ */
+int
+qed_mcp_get_nvm_image_att(struct qed_hwfn *p_hwfn,
+			  enum qed_nvm_images image_id,
+			  struct qed_nvm_image_att *p_image_att);
+
+/**
+ * @brief Allows reading a whole nvram image
+ *
+ * @param p_hwfn
  * @param image_id - image requested for reading
  * @param p_buffer - allocated buffer into which to fill data
  * @param buffer_len - length of the allocated buffer.
diff --git a/include/linux/qed/qed_if.h b/include/linux/qed/qed_if.h
index b5b2bc9..e53f9c7 100644
--- a/include/linux/qed/qed_if.h
+++ b/include/linux/qed/qed_if.h
@@ -159,6 +159,9 @@ struct qed_dcbx_get {
 enum qed_nvm_images {
 	QED_NVM_IMAGE_ISCSI_CFG,
 	QED_NVM_IMAGE_FCOE_CFG,
+	QED_NVM_IMAGE_NVM_CFG1,
+	QED_NVM_IMAGE_DEFAULT_CFG,
+	QED_NVM_IMAGE_NVM_META,
 };
 
 struct qed_link_eee_params {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 1/2] qed: Delete unused parameter p_ptt from mcp APIs
From: Denis Bolotin @ 2018-04-23 11:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, ariel.elior, Denis Bolotin
In-Reply-To: <20180423115605.8531-1-denis.bolotin@cavium.com>

Since nvm images attributes are cached during driver load, acquiring ptt
is not needed when calling qed_mcp_get_nvm_image().

Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
---
 drivers/net/ethernet/qlogic/qed/qed_main.c | 9 +--------
 drivers/net/ethernet/qlogic/qed/qed_mcp.c  | 4 +---
 drivers/net/ethernet/qlogic/qed/qed_mcp.h  | 2 --
 3 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c b/drivers/net/ethernet/qlogic/qed/qed_main.c
index 9854aa9..d1d3787 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -1894,15 +1894,8 @@ static int qed_nvm_get_image(struct qed_dev *cdev, enum qed_nvm_images type,
 			     u8 *buf, u16 len)
 {
 	struct qed_hwfn *hwfn = QED_LEADING_HWFN(cdev);
-	struct qed_ptt *ptt = qed_ptt_acquire(hwfn);
-	int rc;
-
-	if (!ptt)
-		return -EAGAIN;
 
-	rc = qed_mcp_get_nvm_image(hwfn, ptt, type, buf, len);
-	qed_ptt_release(hwfn, ptt);
-	return rc;
+	return qed_mcp_get_nvm_image(hwfn, type, buf, len);
 }
 
 static int qed_set_coalesce(struct qed_dev *cdev, u16 rx_coal, u16 tx_coal,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index ec0d425..1377ad1 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -2531,7 +2531,6 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
 
 static int
 qed_mcp_get_nvm_image_att(struct qed_hwfn *p_hwfn,
-			  struct qed_ptt *p_ptt,
 			  enum qed_nvm_images image_id,
 			  struct qed_nvm_image_att *p_image_att)
 {
@@ -2569,7 +2568,6 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
 }
 
 int qed_mcp_get_nvm_image(struct qed_hwfn *p_hwfn,
-			  struct qed_ptt *p_ptt,
 			  enum qed_nvm_images image_id,
 			  u8 *p_buffer, u32 buffer_len)
 {
@@ -2578,7 +2576,7 @@ int qed_mcp_get_nvm_image(struct qed_hwfn *p_hwfn,
 
 	memset(p_buffer, 0, buffer_len);
 
-	rc = qed_mcp_get_nvm_image_att(p_hwfn, p_ptt, image_id, &image_att);
+	rc = qed_mcp_get_nvm_image_att(p_hwfn, image_id, &image_att);
 	if (rc)
 		return rc;
 
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.h b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
index 8a5c988..dd62c38 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
@@ -486,7 +486,6 @@ struct qed_nvm_image_att {
  * @brief Allows reading a whole nvram image
  *
  * @param p_hwfn
- * @param p_ptt
  * @param image_id - image requested for reading
  * @param p_buffer - allocated buffer into which to fill data
  * @param buffer_len - length of the allocated buffer.
@@ -494,7 +493,6 @@ struct qed_nvm_image_att {
  * @return 0 iff p_buffer now contains the nvram image.
  */
 int qed_mcp_get_nvm_image(struct qed_hwfn *p_hwfn,
-			  struct qed_ptt *p_ptt,
 			  enum qed_nvm_images image_id,
 			  u8 *p_buffer, u32 buffer_len);
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 0/2] Add configuration information to register dump and debug data
From: Denis Bolotin @ 2018-04-23 11:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, ariel.elior, Denis Bolotin

The purpose of this patchset is to add configuration information to the
debug data collection, which already contains register dump.
The first patch (removing the ptt) is essential because it prevents the
unnecessary ptt acquirement when calling mcp APIs.

Denis Bolotin (2):
  qed: Delete unused parameter p_ptt from mcp APIs
  qed: Add configuration information to register dump and debug data

 drivers/net/ethernet/qlogic/qed/qed_debug.c | 113 +++++++++++++++++++++++++++-
 drivers/net/ethernet/qlogic/qed/qed_main.c  |   9 +--
 drivers/net/ethernet/qlogic/qed/qed_mcp.c   |  18 +++--
 drivers/net/ethernet/qlogic/qed/qed_mcp.h   |  16 +++-
 include/linux/qed/qed_if.h                  |   3 +
 5 files changed, 141 insertions(+), 18 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: kTLS in combination with mlx4 is very unstable
From: Andre Tomt @ 2018-04-23 11:44 UTC (permalink / raw)
  To: netdev; +Cc: davejwatson, Tariq Toukan
In-Reply-To: <bfb1e726-24ea-89e9-61e7-8e43e82cd23b@tomt.net>

On 22. april 2018 23:21, Andre Tomt wrote:
> Hello!
> 
> kTLS looks fun, so I decided to play with it. It is quite spiffy - 
> however with mlx4 I get kernel crashes I'm not seeing when testing on 
> ixgbe.
> 
> For testing I'm using a git build of the "stream reflector" cubemap[1] 
> configured with kTLS and 8 worker threads running on 4 physical cores, 
> loading it up with a ~13Mbps MPEG-TS stream pulled from satelite TV.
> 
> The kernel seems to get increasingly unstable as I load it up with 
> client connections. At about 9Gbps and 700 connections, it is okay at 
> least for a while - it might run fine for say 45 minutes. Once it gets 
> to 20 - 30Gbps, the kernel will usually start spewing OOPSes within 
> minutes and the traffic drops.
> 
> Some bad interaction between mlx4 and kTLS?
> 
> Hardware is a quad core Xeon-D 1520 using a dual port Mellanox 
> ConnectX-3 VPI with a single 40Gbps ethernet link configured. Mellanox 
> mlx4 driver settings are kernel.org upstream defaults. Interface is 
> configured with FQ qdisc and sockets are using BBR congestion control.
> 
> Tested on kernel 4.14.34, 4.15.17, and 4.16.2 - 4.16.3.
> 
> [1] https://git.sesse.net/?p=cubemap
> 
> First OOPS (from 4.16.3)

It also blows up with a similar trace on 4.17-rc2.

^ permalink raw reply

* [iptables 2/2] extensions: libip6t_srh: add test-cases for matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 10:48 UTC (permalink / raw)
  To: pablo, fw, davem, dav.lebrun, linux-kernel, netfilter-devel,
	coreteam, netdev
  Cc: Ahmed Abdelsalam
In-Reply-To: <1524480503-1883-1-git-send-email-amsalam20@gmail.com>

This patch adds some test-cases to "libip6t_srh.t" for matching previous SID,
next SID, and last SID.

Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
---
 extensions/libip6t_srh.t | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/extensions/libip6t_srh.t b/extensions/libip6t_srh.t
index 08897d5..88a379e 100644
--- a/extensions/libip6t_srh.t
+++ b/extensions/libip6t_srh.t
@@ -23,4 +23,8 @@
 -m srh ! --srh-tag 0;=;OK
 -m srh --srh-next-hdr 17 --srh-segs-left-eq 1 --srh-last-entry-eq 4 --srh-tag 0;=;OK
 -m srh ! --srh-next-hdr 17 ! --srh-segs-left-eq 0 --srh-tag 0;=;OK
+-m srh --srh-psid A::2/64 --srh-nsid B2::/128 --srh-lsid C::/0;=;OK
+-m srh ! --srh-psid A::2/64 ! --srh-nsid B2::/128 ! --srh-lsid C::/0;=;OK
+-m srh --srh-psid A::2 --srh-nsid B2:: --srh-lsid C::;=;OK
+-m srh ! --srh-psid A::2 ! --srh-nsid B2:: ! --srh-lsid C::;=;OK
 -m srh;=;OK
-- 
2.1.4

^ permalink raw reply related

* [nf-next] netfilter: extend SRH match to support matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 10:48 UTC (permalink / raw)
  To: pablo, fw, davem, dav.lebrun, linux-kernel, netfilter-devel,
	coreteam, netdev
  Cc: Ahmed Abdelsalam
In-Reply-To: <1524480503-1883-1-git-send-email-amsalam20@gmail.com>

IPv6 Segment Routing Header (SRH) contains a list of SIDs to be crossed by
SR encapsulated packet. Each SID is encoded as an IPv6 prefix.

When a Firewall receives an SR encapsulated packet, it should be able to
identify which node previously processed the packet (previous SID), which
node is going to process the packet next (next SID), and which node is the
last to process the packet (last SID) which represent the final destination
of the packet in case of inline SR mode.

An example use-case of using these features could be SID list that includes
two firewalls. When the second firewall receives a packet, it can check
whether the packet has been processed by the first firewall or not. Based on
that check, it decides to apply all rules, apply just subset of the rules,
or totally skip all rules and forward the packet to the next SID.

This patch extends SRH match to support matching previous SID, next SID, and
last SID.

Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
---
 include/uapi/linux/netfilter_ipv6/ip6t_srh.h | 22 +++++++++++++--
 net/ipv6/netfilter/ip6t_srh.c                | 41 +++++++++++++++++++++++++++-
 2 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
index f3cc0ef..9808382 100644
--- a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
+++ b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
@@ -17,7 +17,10 @@
 #define IP6T_SRH_LAST_GT        0x0100
 #define IP6T_SRH_LAST_LT        0x0200
 #define IP6T_SRH_TAG            0x0400
-#define IP6T_SRH_MASK           0x07FF
+#define IP6T_SRH_PSID           0x0800
+#define IP6T_SRH_NSID           0x1000
+#define IP6T_SRH_LSID           0x2000
+#define IP6T_SRH_MASK           0x3FFF
 
 /* Values for "mt_invflags" field in struct ip6t_srh */
 #define IP6T_SRH_INV_NEXTHDR    0x0001
@@ -31,7 +34,10 @@
 #define IP6T_SRH_INV_LAST_GT    0x0100
 #define IP6T_SRH_INV_LAST_LT    0x0200
 #define IP6T_SRH_INV_TAG        0x0400
-#define IP6T_SRH_INV_MASK       0x07FF
+#define IP6T_SRH_INV_PSID       0x0800
+#define IP6T_SRH_INV_NSID       0x1000
+#define IP6T_SRH_INV_LSID       0x2000
+#define IP6T_SRH_INV_MASK       0x3FFF
 
 /**
  *      struct ip6t_srh - SRH match options
@@ -40,6 +46,12 @@
  *      @ segs_left: Segments left field of SRH
  *      @ last_entry: Last entry field of SRH
  *      @ tag: Tag field of SRH
+ *      @ psid_addr: Address of previous SID in SRH SID list
+ *      @ nsid_addr: Address of NEXT SID in SRH SID list
+ *      @ lsid_addr: Address of LAST SID in SRH SID list
+ *      @ psid_msk: Mask of previous SID in SRH SID list
+ *      @ nsid_msk: Mask of next SID in SRH SID list
+ *      @ lsid_msk: MAsk of last SID in SRH SID list
  *      @ mt_flags: match options
  *      @ mt_invflags: Invert the sense of match options
  */
@@ -50,6 +62,12 @@ struct ip6t_srh {
 	__u8                    segs_left;
 	__u8                    last_entry;
 	__u16                   tag;
+	struct in6_addr		psid_addr;
+	struct in6_addr		nsid_addr;
+	struct in6_addr		lsid_addr;
+	struct in6_addr		psid_msk;
+	struct in6_addr		nsid_msk;
+	struct in6_addr		lsid_msk;
 	__u16                   mt_flags;
 	__u16                   mt_invflags;
 };
diff --git a/net/ipv6/netfilter/ip6t_srh.c b/net/ipv6/netfilter/ip6t_srh.c
index 33719d5..2b5cc73 100644
--- a/net/ipv6/netfilter/ip6t_srh.c
+++ b/net/ipv6/netfilter/ip6t_srh.c
@@ -30,7 +30,9 @@ static bool srh_mt6(const struct sk_buff *skb, struct xt_action_param *par)
 	const struct ip6t_srh *srhinfo = par->matchinfo;
 	struct ipv6_sr_hdr *srh;
 	struct ipv6_sr_hdr _srh;
-	int hdrlen, srhoff = 0;
+	int hdrlen, psidoff, nsidoff, lsidoff, srhoff = 0;
+	struct in6_addr *psid, *nsid, *lsid;
+	struct in6_addr _psid, _nsid, _lsid;
 
 	if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
 		return false;
@@ -114,6 +116,43 @@ static bool srh_mt6(const struct sk_buff *skb, struct xt_action_param *par)
 		if (NF_SRH_INVF(srhinfo, IP6T_SRH_INV_TAG,
 				!(srh->tag == srhinfo->tag)))
 			return false;
+
+	/* Previous SID matching */
+	if (srhinfo->mt_flags & IP6T_SRH_PSID) {
+		if (srh->segments_left == srh->first_segment)
+			return false;
+		psidoff = srhoff + sizeof(struct ipv6_sr_hdr) +
+			  ((srh->segments_left + 1) * sizeof(struct in6_addr));
+		psid = skb_header_pointer(skb, psidoff, sizeof(_psid), &_psid);
+		if (NF_SRH_INVF(srhinfo, IP6T_SRH_INV_PSID,
+				ipv6_masked_addr_cmp(psid, &srhinfo->psid_msk,
+						     &srhinfo->psid_addr)))
+			return false;
+	}
+
+	/* Next SID matching */
+	if (srhinfo->mt_flags & IP6T_SRH_NSID) {
+		if (srh->segments_left == 0)
+			return false;
+		nsidoff = srhoff + sizeof(struct ipv6_sr_hdr) +
+			  ((srh->segments_left - 1) * sizeof(struct in6_addr));
+		nsid = skb_header_pointer(skb, nsidoff, sizeof(_nsid), &_nsid);
+		if (NF_SRH_INVF(srhinfo, IP6T_SRH_INV_NSID,
+				ipv6_masked_addr_cmp(nsid, &srhinfo->nsid_msk,
+						     &srhinfo->nsid_addr)))
+			return false;
+	}
+
+	/* Last SID matching */
+	if (srhinfo->mt_flags & IP6T_SRH_LSID) {
+		lsidoff = srhoff + sizeof(struct ipv6_sr_hdr);
+		lsid = skb_header_pointer(skb, lsidoff, sizeof(_lsid), &_lsid);
+		if (NF_SRH_INVF(srhinfo, IP6T_SRH_INV_LSID,
+				ipv6_masked_addr_cmp(lsid, &srhinfo->lsid_msk,
+						     &srhinfo->lsid_addr)))
+			return false;
+	}
+
 	return true;
 }
 
-- 
2.1.4

^ permalink raw reply related

* [iptables 1/2] extensions: libip6t_srh: support matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 10:48 UTC (permalink / raw)
  To: pablo, fw, davem, dav.lebrun, linux-kernel, netfilter-devel,
	coreteam, netdev
  Cc: Ahmed Abdelsalam

This patch extends the libip6t_srh shared library to support matching
previous SID, next SID, and last SID.

Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
---
 extensions/libip6t_srh.c                | 65 ++++++++++++++++++++++++++++++++-
 include/linux/netfilter_ipv6/ip6t_srh.h | 22 ++++++++++-
 2 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/extensions/libip6t_srh.c b/extensions/libip6t_srh.c
index ac0ae08..5acc2ee 100644
--- a/extensions/libip6t_srh.c
+++ b/extensions/libip6t_srh.c
@@ -22,6 +22,9 @@ enum {
 	O_SRH_LAST_GT,
 	O_SRH_LAST_LT,
 	O_SRH_TAG,
+	O_SRH_PSID,
+	O_SRH_NSID,
+	O_SRH_LSID,
 };
 
 static void srh_help(void)
@@ -38,7 +41,10 @@ static void srh_help(void)
 "[!] --srh-last-entry-eq 	last_entry      Last Entry value of SRH\n"
 "[!] --srh-last-entry-gt 	last_entry      Last Entry value of SRH\n"
 "[!] --srh-last-entry-lt 	last_entry      Last Entry value of SRH\n"
-"[!] --srh-tag			tag             Tag value of SRH\n");
+"[!] --srh-tag			tag             Tag value of SRH\n"
+"[!] --srh-psid			addr[/mask]	SRH previous SID\n"
+"[!] --srh-nsid			addr[/mask]	SRH next SID\n"
+"[!] --srh-lsid			addr[/mask]	SRH Last SID\n");
 }
 
 #define s struct ip6t_srh
@@ -65,6 +71,12 @@ static const struct xt_option_entry srh_opts[] = {
 	.flags = XTOPT_INVERT | XTOPT_PUT, XTOPT_POINTER(s, last_entry)},
 	{ .name = "srh-tag", .id = O_SRH_TAG, .type = XTTYPE_UINT16,
 	.flags = XTOPT_INVERT | XTOPT_PUT, XTOPT_POINTER(s, tag)},
+	{ .name = "srh-psid", .id = O_SRH_PSID, .type = XTTYPE_HOSTMASK,
+	.flags = XTOPT_INVERT},
+	{ .name = "srh-nsid", .id = O_SRH_NSID, .type = XTTYPE_HOSTMASK,
+	.flags = XTOPT_INVERT},
+	{ .name = "srh-lsid", .id = O_SRH_LSID, .type = XTTYPE_HOSTMASK,
+	.flags = XTOPT_INVERT},
 	{ }
 };
 #undef s
@@ -75,6 +87,12 @@ static void srh_init(struct xt_entry_match *m)
 
 	srhinfo->mt_flags = 0;
 	srhinfo->mt_invflags = 0;
+	memset(srhinfo->psid_addr.s6_addr, 0, sizeof(srhinfo->psid_addr.s6_addr));
+	memset(srhinfo->nsid_addr.s6_addr, 0, sizeof(srhinfo->nsid_addr.s6_addr));
+	memset(srhinfo->lsid_addr.s6_addr, 0, sizeof(srhinfo->lsid_addr.s6_addr));
+	memset(srhinfo->psid_msk.s6_addr, 0, sizeof(srhinfo->psid_msk.s6_addr));
+	memset(srhinfo->nsid_msk.s6_addr, 0, sizeof(srhinfo->nsid_msk.s6_addr));
+	memset(srhinfo->lsid_msk.s6_addr, 0, sizeof(srhinfo->lsid_msk.s6_addr));
 }
 
 static void srh_parse(struct xt_option_call *cb)
@@ -138,6 +156,27 @@ static void srh_parse(struct xt_option_call *cb)
 		if (cb->invert)
 			srhinfo->mt_invflags |= IP6T_SRH_INV_TAG;
 		break;
+	case O_SRH_PSID:
+		srhinfo->mt_flags |= IP6T_SRH_PSID;
+		srhinfo->psid_addr = cb->val.haddr.in6;
+		srhinfo->psid_msk  = cb->val.hmask.in6;
+		if (cb->invert)
+			srhinfo->mt_invflags |= IP6T_SRH_INV_PSID;
+		break;
+	case O_SRH_NSID:
+		srhinfo->mt_flags |= IP6T_SRH_NSID;
+		srhinfo->nsid_addr = cb->val.haddr.in6;
+		srhinfo->nsid_msk  = cb->val.hmask.in6;
+		if (cb->invert)
+			srhinfo->mt_invflags |= IP6T_SRH_INV_NSID;
+		break;
+	case O_SRH_LSID:
+		srhinfo->mt_flags |= IP6T_SRH_LSID;
+		srhinfo->lsid_addr = cb->val.haddr.in6;
+		srhinfo->lsid_msk  = cb->val.hmask.in6;
+		if (cb->invert)
+			srhinfo->mt_invflags |= IP6T_SRH_INV_LSID;
+		break;
 	}
 }
 
@@ -180,6 +219,18 @@ static void srh_print(const void *ip, const struct xt_entry_match *match,
 	if (srhinfo->mt_flags & IP6T_SRH_TAG)
 		printf(" tag:%s%d", srhinfo->mt_invflags & IP6T_SRH_INV_TAG ? "!" : "",
 			srhinfo->tag);
+	if (srhinfo->mt_flags & IP6T_SRH_PSID)
+		printf(" psid %s %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_PSID ? "!" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->psid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->psid_msk));
+	if (srhinfo->mt_flags & IP6T_SRH_NSID)
+		printf(" nsid %s %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_NSID ? "!" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->nsid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->nsid_msk));
+	if (srhinfo->mt_flags & IP6T_SRH_LSID)
+		printf(" lsid %s %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_LSID ? "!" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->lsid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->lsid_msk));
 }
 
 static void srh_save(const void *ip, const struct xt_entry_match *match)
@@ -219,6 +270,18 @@ static void srh_save(const void *ip, const struct xt_entry_match *match)
 	if (srhinfo->mt_flags & IP6T_SRH_TAG)
 		printf("%s --srh-tag %u", (srhinfo->mt_invflags & IP6T_SRH_INV_TAG) ? " !" : "",
 			srhinfo->tag);
+	if (srhinfo->mt_flags & IP6T_SRH_PSID)
+		printf("%s --srh-psid %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_PSID ? " !" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->psid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->psid_msk));
+	if (srhinfo->mt_flags & IP6T_SRH_NSID)
+		printf("%s --srh-nsid %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_NSID ? " !" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->nsid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->nsid_msk));
+	if (srhinfo->mt_flags & IP6T_SRH_LSID)
+		printf("%s --srh-lsid %s/%u", srhinfo->mt_invflags & IP6T_SRH_INV_LSID ? " !" : "",
+			xtables_ip6addr_to_numeric(&srhinfo->lsid_addr),
+			xtables_ip6mask_to_cidr(&srhinfo->lsid_msk));
 }
 
 static struct xtables_match srh_mt6_reg = {
diff --git a/include/linux/netfilter_ipv6/ip6t_srh.h b/include/linux/netfilter_ipv6/ip6t_srh.h
index 087efa1..3d77241 100644
--- a/include/linux/netfilter_ipv6/ip6t_srh.h
+++ b/include/linux/netfilter_ipv6/ip6t_srh.h
@@ -16,7 +16,10 @@
 #define IP6T_SRH_LAST_GT        0x0100
 #define IP6T_SRH_LAST_LT        0x0200
 #define IP6T_SRH_TAG            0x0400
-#define IP6T_SRH_MASK           0x07FF
+#define IP6T_SRH_PSID           0x0800
+#define IP6T_SRH_NSID           0x1000
+#define IP6T_SRH_LSID           0x2000
+#define IP6T_SRH_MASK           0x3FFF
 
 /* Values for "mt_invflags" field in struct ip6t_srh */
 #define IP6T_SRH_INV_NEXTHDR    0x0001
@@ -30,7 +33,10 @@
 #define IP6T_SRH_INV_LAST_GT    0x0100
 #define IP6T_SRH_INV_LAST_LT    0x0200
 #define IP6T_SRH_INV_TAG        0x0400
-#define IP6T_SRH_INV_MASK       0x07FF
+#define IP6T_SRH_INV_PSID       0x0800
+#define IP6T_SRH_INV_NSID       0x1000
+#define IP6T_SRH_INV_LSID       0x2000
+#define IP6T_SRH_INV_MASK       0x3FFF
 
 /**
  *      struct ip6t_srh - SRH match options
@@ -39,6 +45,12 @@
  *      @ segs_left: Segments left field of SRH
  *      @ last_entry: Last entry field of SRH
  *      @ tag: Tag field of SRH
+ *      @ psid_addr: Address of previous SID in SRH SID list
+ *      @ nsid_addr: Address of NEXT SID in SRH SID list
+ *      @ lsid_addr: Address of LAST SID in SRH SID list
+ *      @ psid_msk: Mask of previous SID in SRH SID list
+ *      @ nsid_msk: Mask of next SID in SRH SID list
+ *      @ lsid_msk: MAsk of last SID in SRH SID list
  *      @ mt_flags: match options
  *      @ mt_invflags: Invert the sense of match options
  */
@@ -49,6 +61,12 @@ struct ip6t_srh {
 	__u8                    segs_left;
 	__u8                    last_entry;
 	__u16                   tag;
+	struct in6_addr		psid_addr;
+	struct in6_addr		nsid_addr;
+	struct in6_addr		lsid_addr;
+	struct in6_addr		psid_msk;
+	struct in6_addr		nsid_msk;
+	struct in6_addr		lsid_msk;
 	__u16                   mt_flags;
 	__u16                   mt_invflags;
 };
-- 
2.1.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox