* Make CONFIG_NET and CONFIG_SECCOMP_FILTER independent of CONFIG_NET
From: Norbert Manthey @ 2018-06-06 13:52 UTC (permalink / raw)
To: linux-kernel, kvm, netdev, x86
Dear all,
currently, KVM and SECCOMP rely on functionality of CONFIG_NET, and hence the
latter has to be enabled when building the kernel for the first two
configurations. However, there exists scenarios where the system does not need
networking, but KVM and SECCOMP filters. To reduce the kernel image size for
these scenarios, and to be able to drop active code, this commit series allows
to enable CONFIG_KVM and CONFIG_SECCOMP_FILTER without using CONFIG_NET.
The functionality that is required for seccomp filters is kept in the same
files and - after reordering the source code - is guarded with a single ifdef
per file.
I hope these changes are useful for other scenarios than the one I currently
face.
Best,
Norbert
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
^ permalink raw reply
* Re: [PATCH net] kcm: fix races on sk_receive_queue
From: Paolo Abeni @ 2018-06-06 13:46 UTC (permalink / raw)
To: Kirill Tkhai, netdev; +Cc: David S. Miller, Tom Herbert
In-Reply-To: <f79f8b6f-c137-1eb7-1161-2264c270c025@virtuozzo.com>
On Wed, 2018-06-06 at 16:28 +0300, Kirill Tkhai wrote:
> On 06.06.2018 16:16, Paolo Abeni wrote:
> > KCM removes the packets from sk_receive_queue in requeue_rx_msgs()
> >
> > without acquiring any lock. Moreover, in R() when the MSG_PEEK
> > flag is not present, the skb is peeked and dequeued with two
> > separate, non-atomic, calls.
> >
> > The above create room for races, which SYZBOT has been able to
> > exploit, causing list corruption and kernel oops:
> >
> > kasan: CONFIG_KASAN_INLINE enabled
> > kasan: GPF could be caused by NULL-ptr deref or user memory access
> > general protection fault: 0000 [#1] SMP KASAN
> > Dumping ftrace buffer:
> > (ftrace buffer empty)
> > Modules linked in:
> > CPU: 0 PID: 8484 Comm: syz-executor919 Not tainted 4.17.0-rc7+ #74
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > RIP: 0010:__skb_unlink include/linux/skbuff.h:1844 [inline]
> > RIP: 0010:skb_unlink+0xc1/0x160 net/core/skbuff.c:2921
> > RSP: 0018:ffff8801d012f6f0 EFLAGS: 00010002
> > RAX: 0000000000000286 RBX: ffff8801d6e073c0 RCX: 0000000000000001
> > RDX: dffffc0000000000 RSI: 0000000000000004 RDI: 0000000000000008
> > RBP: ffff8801d012f718 R08: ffffed0038bb3b6d R09: ffffed0038bb3b6c
> > R10: ffffed0038bb3b6c R11: ffff8801c5d9db63 R12: 0000000000000000
> > R13: 0000000000000000 R14: ffff8801c5d9db60 R15: ffff8801d012fce0
> > FS: 0000000000ab7880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000020e5b000 CR3: 00000001c31fb000 CR4: 00000000001406f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> > kcm_recvmsg+0x48d/0x590 net/kcm/kcmsock.c:1160
> > sock_recvmsg_nosec+0x8c/0xb0 net/socket.c:802
> > ___sys_recvmsg+0x2b6/0x680 net/socket.c:2279
> > __sys_recvmmsg+0x2f9/0xb80 net/socket.c:2391
> > do_sys_recvmmsg+0xe4/0x190 net/socket.c:2472
> > __do_sys_recvmmsg net/socket.c:2485 [inline]
> > __se_sys_recvmmsg net/socket.c:2481 [inline]
> > __x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2481
> > do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> > entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > RIP: 0033:0x4417a9
> > RSP: 002b:00007ffe27282838 EFLAGS: 00000206 ORIG_RAX: 000000000000012b
> > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004417a9
> > RDX: 00000000040000f7 RSI: 00000000200002c0 RDI: 0000000000000006
> > RBP: 0000000000000000 R08: 0000000020000200 R09: 00007ffe272829f8
> > R10: 0000000000000060 R11: 0000000000000206 R12: 00000000000001f3
> > R13: 000000000001f871 R14: 0000000000000000 R15: 0000000000000000
> > Code: 00 00 00 49 8d 7d 08 4c 8b 63 08 48 ba 00 00 00 00 00 fc ff df 48 c7
> > 43 08 00 00 00 00 48 89 f9 48 c7 03 00 00 00 00 48 c1 e9 03 <80> 3c 11 00
> > 75 5b 4c 89 e1 4d 89 65 08 48 ba 00 00 00 00 00 fc
> > RIP: __skb_unlink include/linux/skbuff.h:1844 [inline] RSP: ffff8801d012f6f0
> > RIP: skb_unlink+0xc1/0x160 net/core/skbuff.c:2921 RSP: ffff8801d012f6f0
> >
> > To fix the above, we need to use the locked version of the socket dequeue
> > helper in requeue_rx_msgs() and kcm_wait_data is changed to dequeue
> > the available skb when not peeking.
> >
> > RFC -> v1:
> > - use skb_dequeue(), as suggested by Tom
> > - explicitly close the race between skb_peek and skb_unlink
> >
> > Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
> > Reported-and-tested-by: syzbot+278279efdd2730dd14bf@syzkaller.appspotmail.com
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> > This is an RFC, since I'm really new to this area, anyway the syzport
> > reported success in testing the proposed fix.
> > This is very likely a scenario where the upcoming skb->prev,next -> list_head
> > conversion would have helped a lot, thanks to list poisoning and list debug
> > ---
> > net/kcm/kcmsock.c | 19 ++++++++++++-------
> > 1 file changed, 12 insertions(+), 7 deletions(-)
> >
> > diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> > index d3601d421571..dd2d02bb35ae 100644
> > --- a/net/kcm/kcmsock.c
> > +++ b/net/kcm/kcmsock.c
> > @@ -223,7 +223,7 @@ static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head)
> > struct sk_buff *skb;
> > struct kcm_sock *kcm;
> >
> > - while ((skb = __skb_dequeue(head))) {
> > + while ((skb = skb_dequeue(head))) {
>
> I try to find how the patch protects against the following race:
>
> requeue_rx_msgs() kcm_recvmsg()
> skb = skb_dequeue() skb = kcm_wait_data(peek = true)
> ... ...
> free skb ...
> ... skb_copy_datagram_msg(skb) <--- Use after free?
>
> Isn't there possible a use-after-free?
You are right, this patch does not fix the above race: is addressing a
different one, when recvmsg() is not peeking.
The race itself is not introduced by this code, and I think a separate
patch for the the above would be better (we probably need to increment
the skb reference count while peeking and consume the skb after the
copy)
Cheers,
Paolo
^ permalink raw reply
* Re: [PATCH] netfilter: provide udp*_lib_lookup for nf_tproxy
From: Pablo Neira Ayuso @ 2018-06-06 13:33 UTC (permalink / raw)
To: David Miller
Cc: arnd, kuznet, yoshfuji, ecklm94, pabeni, willemb, edumazet,
dsahern, kafai, netdev, linux-kernel
In-Reply-To: <20180605.105453.339908802413146875.davem@davemloft.net>
On Tue, Jun 05, 2018 at 10:54:53AM -0400, David Miller wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> Date: Tue, 5 Jun 2018 13:40:34 +0200
>
> > It is now possible to enable the libified nf_tproxy modules without
> > also enabling NETFILTER_XT_TARGET_TPROXY, which throws off the
> > ifdef logic in the udp core code:
> >
> > net/ipv6/netfilter/nf_tproxy_ipv6.o: In function `nf_tproxy_get_sock_v6':
> > nf_tproxy_ipv6.c:(.text+0x1a8): undefined reference to `udp6_lib_lookup'
> > net/ipv4/netfilter/nf_tproxy_ipv4.o: In function `nf_tproxy_get_sock_v4':
> > nf_tproxy_ipv4.c:(.text+0x3d0): undefined reference to `udp4_lib_lookup'
> >
> > We can actually simplify the conditions now to provide the two functions
> > exactly when they are needed.
> >
> > Fixes: 45ca4e0cf273 ("netfilter: Libify xt_TPROXY")
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>
> Pablo, I'm going to apply this directly to fix the link failure.
BTW, could you also pick this fix for net-next?
http://patchwork.ozlabs.org/patch/924706/
Thanks.
^ permalink raw reply
* WARNING: refcount bug in smc_tcp_listen_work
From: syzbot @ 2018-06-06 13:32 UTC (permalink / raw)
To: davem, linux-kernel, linux-s390, netdev, syzkaller-bugs, ubraun
Hello,
syzbot found the following crash on:
HEAD commit: 75d4e704fa8d netdev-FAQ: clarify DaveM's position for stab..
git tree: net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1632a6f7800000
kernel config: https://syzkaller.appspot.com/x/.config?x=7b76ec81cd4aad4b
dashboard link: https://syzkaller.appspot.com/bug?extid=9e60d2428a42049a592a
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
Unfortunately, I don't have any reproducer for this crash yet.
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+9e60d2428a42049a592a@syzkaller.appspotmail.com
A link change request failed with some changes committed already. Interface
bridge0 may have been left with an inconsistent configuration, please check.
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 23870 at lib/refcount.c:187
refcount_sub_and_test+0x2d3/0x330 lib/refcount.c:187
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 23870 Comm: kworker/0:4 Not tainted 4.17.0-rc7+ #79
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Workqueue: events smc_tcp_listen_work
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
panic+0x22f/0x4de kernel/panic.c:184
__warn.cold.8+0x163/0x1b3 kernel/panic.c:536
report_bug+0x252/0x2d0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:178 [inline]
do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:refcount_sub_and_test+0x2d3/0x330 lib/refcount.c:187
RSP: 0018:ffff880194aaf4f8 EFLAGS: 00010286
RAX: 0000000000000026 RBX: 0000000000000000 RCX: ffffffff8160b8ad
RDX: 0000000000000000 RSI: ffffffff81610561 RDI: ffff880194aaf058
RBP: ffff880194aaf5e0 R08: ffff88014148e500 R09: 0000000000000006
R10: ffff88014148e500 R11: 0000000000000000 R12: 00000000ffffffff
R13: ffff880194aaf5b8 R14: 0000000000000001 R15: ffff880194aaf728
refcount_dec_and_test+0x1a/0x20 lib/refcount.c:212
sock_put include/net/sock.h:1668 [inline]
smc_tcp_listen_work+0xb94/0xec0 net/smc/af_smc.c:1073
process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
kthread+0x345/0x410 kernel/kthread.c:240
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..
---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.
^ permalink raw reply
* Re: [PATCH net] kcm: fix races on sk_receive_queue
From: Kirill Tkhai @ 2018-06-06 13:28 UTC (permalink / raw)
To: Paolo Abeni, netdev; +Cc: David S. Miller, Tom Herbert
In-Reply-To: <628e0398546aefabd68669450621909d269e1ba8.1528289745.git.pabeni@redhat.com>
On 06.06.2018 16:16, Paolo Abeni wrote:
> KCM removes the packets from sk_receive_queue in requeue_rx_msgs()
>
> without acquiring any lock. Moreover, in R() when the MSG_PEEK
> flag is not present, the skb is peeked and dequeued with two
> separate, non-atomic, calls.
>
> The above create room for races, which SYZBOT has been able to
> exploit, causing list corruption and kernel oops:
>
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault: 0000 [#1] SMP KASAN
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 8484 Comm: syz-executor919 Not tainted 4.17.0-rc7+ #74
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:__skb_unlink include/linux/skbuff.h:1844 [inline]
> RIP: 0010:skb_unlink+0xc1/0x160 net/core/skbuff.c:2921
> RSP: 0018:ffff8801d012f6f0 EFLAGS: 00010002
> RAX: 0000000000000286 RBX: ffff8801d6e073c0 RCX: 0000000000000001
> RDX: dffffc0000000000 RSI: 0000000000000004 RDI: 0000000000000008
> RBP: ffff8801d012f718 R08: ffffed0038bb3b6d R09: ffffed0038bb3b6c
> R10: ffffed0038bb3b6c R11: ffff8801c5d9db63 R12: 0000000000000000
> R13: 0000000000000000 R14: ffff8801c5d9db60 R15: ffff8801d012fce0
> FS: 0000000000ab7880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020e5b000 CR3: 00000001c31fb000 CR4: 00000000001406f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> kcm_recvmsg+0x48d/0x590 net/kcm/kcmsock.c:1160
> sock_recvmsg_nosec+0x8c/0xb0 net/socket.c:802
> ___sys_recvmsg+0x2b6/0x680 net/socket.c:2279
> __sys_recvmmsg+0x2f9/0xb80 net/socket.c:2391
> do_sys_recvmmsg+0xe4/0x190 net/socket.c:2472
> __do_sys_recvmmsg net/socket.c:2485 [inline]
> __se_sys_recvmmsg net/socket.c:2481 [inline]
> __x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2481
> do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x4417a9
> RSP: 002b:00007ffe27282838 EFLAGS: 00000206 ORIG_RAX: 000000000000012b
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004417a9
> RDX: 00000000040000f7 RSI: 00000000200002c0 RDI: 0000000000000006
> RBP: 0000000000000000 R08: 0000000020000200 R09: 00007ffe272829f8
> R10: 0000000000000060 R11: 0000000000000206 R12: 00000000000001f3
> R13: 000000000001f871 R14: 0000000000000000 R15: 0000000000000000
> Code: 00 00 00 49 8d 7d 08 4c 8b 63 08 48 ba 00 00 00 00 00 fc ff df 48 c7
> 43 08 00 00 00 00 48 89 f9 48 c7 03 00 00 00 00 48 c1 e9 03 <80> 3c 11 00
> 75 5b 4c 89 e1 4d 89 65 08 48 ba 00 00 00 00 00 fc
> RIP: __skb_unlink include/linux/skbuff.h:1844 [inline] RSP: ffff8801d012f6f0
> RIP: skb_unlink+0xc1/0x160 net/core/skbuff.c:2921 RSP: ffff8801d012f6f0
>
> To fix the above, we need to use the locked version of the socket dequeue
> helper in requeue_rx_msgs() and kcm_wait_data is changed to dequeue
> the available skb when not peeking.
>
> RFC -> v1:
> - use skb_dequeue(), as suggested by Tom
> - explicitly close the race between skb_peek and skb_unlink
>
> Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
> Reported-and-tested-by: syzbot+278279efdd2730dd14bf@syzkaller.appspotmail.com
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> This is an RFC, since I'm really new to this area, anyway the syzport
> reported success in testing the proposed fix.
> This is very likely a scenario where the upcoming skb->prev,next -> list_head
> conversion would have helped a lot, thanks to list poisoning and list debug
> ---
> net/kcm/kcmsock.c | 19 ++++++++++++-------
> 1 file changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> index d3601d421571..dd2d02bb35ae 100644
> --- a/net/kcm/kcmsock.c
> +++ b/net/kcm/kcmsock.c
> @@ -223,7 +223,7 @@ static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head)
> struct sk_buff *skb;
> struct kcm_sock *kcm;
>
> - while ((skb = __skb_dequeue(head))) {
> + while ((skb = skb_dequeue(head))) {
I try to find how the patch protects against the following race:
requeue_rx_msgs() kcm_recvmsg()
skb = skb_dequeue() skb = kcm_wait_data(peek = true)
... ...
free skb ...
... skb_copy_datagram_msg(skb) <--- Use after free?
Isn't there possible a use-after-free?
Thanks,
Kirill
> /* Reset destructor to avoid calling kcm_rcv_ready */
> skb->destructor = sock_rfree;
> skb_orphan(skb);
> @@ -1080,12 +1080,17 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
> return err;
> }
>
> -static struct sk_buff *kcm_wait_data(struct sock *sk, int flags,
> +static struct sk_buff *kcm_wait_data(struct sock *sk, int flags, bool peek,
> long timeo, int *err)
> {
> struct sk_buff *skb;
>
> - while (!(skb = skb_peek(&sk->sk_receive_queue))) {
> + for (;; ) {
> + skb = peek ? skb_peek(&sk->sk_receive_queue) :
> + skb_dequeue(&sk->sk_receive_queue);
> + if (skb)
> + break;
> +
> if (sk->sk_err) {
> *err = sock_error(sk);
> return NULL;
> @@ -1116,6 +1121,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
> {
> struct sock *sk = sock->sk;
> struct kcm_sock *kcm = kcm_sk(sk);
> + bool peek = flags & MSG_PEEK;
> int err = 0;
> long timeo;
> struct strp_msg *stm;
> @@ -1126,7 +1132,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
>
> lock_sock(sk);
>
> - skb = kcm_wait_data(sk, flags, timeo, &err);
> + skb = kcm_wait_data(sk, flags, peek, timeo, &err);
> if (!skb)
> goto out;
>
> @@ -1142,7 +1148,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
> goto out;
>
> copied = len;
> - if (likely(!(flags & MSG_PEEK))) {
> + if (likely(!peek)) {
> KCM_STATS_ADD(kcm->stats.rx_bytes, copied);
> if (copied < stm->full_len) {
> if (sock->type == SOCK_DGRAM) {
> @@ -1157,7 +1163,6 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
> /* Finished with message */
> msg->msg_flags |= MSG_EOR;
> KCM_STATS_INCR(kcm->stats.rx_msgs);
> - skb_unlink(skb, &sk->sk_receive_queue);
> kfree_skb(skb);
> }
> }
> @@ -1186,7 +1191,7 @@ static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
>
> lock_sock(sk);
>
> - skb = kcm_wait_data(sk, flags, timeo, &err);
> + skb = kcm_wait_data(sk, flags, true, timeo, &err);
> if (!skb)
> goto err_out;
>
>
^ permalink raw reply
* [PATCH net] kcm: fix races on sk_receive_queue
From: Paolo Abeni @ 2018-06-06 13:16 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Tom Herbert, Kirill Tkhai
KCM removes the packets from sk_receive_queue in requeue_rx_msgs()
without acquiring any lock. Moreover, in R() when the MSG_PEEK
flag is not present, the skb is peeked and dequeued with two
separate, non-atomic, calls.
The above create room for races, which SYZBOT has been able to
exploit, causing list corruption and kernel oops:
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 8484 Comm: syz-executor919 Not tainted 4.17.0-rc7+ #74
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__skb_unlink include/linux/skbuff.h:1844 [inline]
RIP: 0010:skb_unlink+0xc1/0x160 net/core/skbuff.c:2921
RSP: 0018:ffff8801d012f6f0 EFLAGS: 00010002
RAX: 0000000000000286 RBX: ffff8801d6e073c0 RCX: 0000000000000001
RDX: dffffc0000000000 RSI: 0000000000000004 RDI: 0000000000000008
RBP: ffff8801d012f718 R08: ffffed0038bb3b6d R09: ffffed0038bb3b6c
R10: ffffed0038bb3b6c R11: ffff8801c5d9db63 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8801c5d9db60 R15: ffff8801d012fce0
FS: 0000000000ab7880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020e5b000 CR3: 00000001c31fb000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
kcm_recvmsg+0x48d/0x590 net/kcm/kcmsock.c:1160
sock_recvmsg_nosec+0x8c/0xb0 net/socket.c:802
___sys_recvmsg+0x2b6/0x680 net/socket.c:2279
__sys_recvmmsg+0x2f9/0xb80 net/socket.c:2391
do_sys_recvmmsg+0xe4/0x190 net/socket.c:2472
__do_sys_recvmmsg net/socket.c:2485 [inline]
__se_sys_recvmmsg net/socket.c:2481 [inline]
__x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2481
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4417a9
RSP: 002b:00007ffe27282838 EFLAGS: 00000206 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004417a9
RDX: 00000000040000f7 RSI: 00000000200002c0 RDI: 0000000000000006
RBP: 0000000000000000 R08: 0000000020000200 R09: 00007ffe272829f8
R10: 0000000000000060 R11: 0000000000000206 R12: 00000000000001f3
R13: 000000000001f871 R14: 0000000000000000 R15: 0000000000000000
Code: 00 00 00 49 8d 7d 08 4c 8b 63 08 48 ba 00 00 00 00 00 fc ff df 48 c7
43 08 00 00 00 00 48 89 f9 48 c7 03 00 00 00 00 48 c1 e9 03 <80> 3c 11 00
75 5b 4c 89 e1 4d 89 65 08 48 ba 00 00 00 00 00 fc
RIP: __skb_unlink include/linux/skbuff.h:1844 [inline] RSP: ffff8801d012f6f0
RIP: skb_unlink+0xc1/0x160 net/core/skbuff.c:2921 RSP: ffff8801d012f6f0
To fix the above, we need to use the locked version of the socket dequeue
helper in requeue_rx_msgs() and kcm_wait_data is changed to dequeue
the available skb when not peeking.
RFC -> v1:
- use skb_dequeue(), as suggested by Tom
- explicitly close the race between skb_peek and skb_unlink
Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
Reported-and-tested-by: syzbot+278279efdd2730dd14bf@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
This is an RFC, since I'm really new to this area, anyway the syzport
reported success in testing the proposed fix.
This is very likely a scenario where the upcoming skb->prev,next -> list_head
conversion would have helped a lot, thanks to list poisoning and list debug
---
net/kcm/kcmsock.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index d3601d421571..dd2d02bb35ae 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -223,7 +223,7 @@ static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head)
struct sk_buff *skb;
struct kcm_sock *kcm;
- while ((skb = __skb_dequeue(head))) {
+ while ((skb = skb_dequeue(head))) {
/* Reset destructor to avoid calling kcm_rcv_ready */
skb->destructor = sock_rfree;
skb_orphan(skb);
@@ -1080,12 +1080,17 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
return err;
}
-static struct sk_buff *kcm_wait_data(struct sock *sk, int flags,
+static struct sk_buff *kcm_wait_data(struct sock *sk, int flags, bool peek,
long timeo, int *err)
{
struct sk_buff *skb;
- while (!(skb = skb_peek(&sk->sk_receive_queue))) {
+ for (;; ) {
+ skb = peek ? skb_peek(&sk->sk_receive_queue) :
+ skb_dequeue(&sk->sk_receive_queue);
+ if (skb)
+ break;
+
if (sk->sk_err) {
*err = sock_error(sk);
return NULL;
@@ -1116,6 +1121,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
{
struct sock *sk = sock->sk;
struct kcm_sock *kcm = kcm_sk(sk);
+ bool peek = flags & MSG_PEEK;
int err = 0;
long timeo;
struct strp_msg *stm;
@@ -1126,7 +1132,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
lock_sock(sk);
- skb = kcm_wait_data(sk, flags, timeo, &err);
+ skb = kcm_wait_data(sk, flags, peek, timeo, &err);
if (!skb)
goto out;
@@ -1142,7 +1148,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
goto out;
copied = len;
- if (likely(!(flags & MSG_PEEK))) {
+ if (likely(!peek)) {
KCM_STATS_ADD(kcm->stats.rx_bytes, copied);
if (copied < stm->full_len) {
if (sock->type == SOCK_DGRAM) {
@@ -1157,7 +1163,6 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
/* Finished with message */
msg->msg_flags |= MSG_EOR;
KCM_STATS_INCR(kcm->stats.rx_msgs);
- skb_unlink(skb, &sk->sk_receive_queue);
kfree_skb(skb);
}
}
@@ -1186,7 +1191,7 @@ static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
lock_sock(sk);
- skb = kcm_wait_data(sk, flags, timeo, &err);
+ skb = kcm_wait_data(sk, flags, true, timeo, &err);
if (!skb)
goto err_out;
--
2.17.1
^ permalink raw reply related
* RE: [PATCH net-next 2/3] net: hns3: Fix for VF mailbox receiving unknown message
From: Salil Mehta @ 2018-06-06 13:10 UTC (permalink / raw)
To: David Miller
Cc: Zhuangyuzeng (Yisen), lipeng (Y), mehta.salil@opnsrc.net,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Linuxarm,
wangxi (M)
In-Reply-To: <20180605.144046.766561700879427040.davem@davemloft.net>
Hi Dave,
Sorry about this. I have fixed this in the V2 patch floated just now.
Best regards
Salil.
> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Tuesday, June 05, 2018 7:41 PM
> To: Salil Mehta <salil.mehta@huawei.com>
> Cc: Zhuangyuzeng (Yisen) <yisen.zhuang@huawei.com>; lipeng (Y)
> <lipeng321@huawei.com>; mehta.salil@opnsrc.net; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org; Linuxarm <linuxarm@huawei.com>; wangxi
> (M) <wangxi11@huawei.com>
> Subject: Re: [PATCH net-next 2/3] net: hns3: Fix for VF mailbox
> receiving unknown message
>
> From: Salil Mehta <salil.mehta@huawei.com>
> Date: Tue, 5 Jun 2018 12:42:00 +0100
>
> > + if (unlikely(!hnae3_get_bit(flag,
> HCLGEVF_CMDQ_RX_OUTVLD_B))) {
>
> This breaks the build, there is no such symbol named hnae3_get_bit().
^ permalink raw reply
* [PATCH V2 net-next 3/3] net: hns3: Optimize PF CMDQ interrupt switching process
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
To: davem
Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
linux-kernel, linuxarm, Xi Wang
In-Reply-To: <20180606130753.54428-1-salil.mehta@huawei.com>
From: Xi Wang <wangxi11@huawei.com>
When the PF frequently switches the CMDQ interrupt, if the CMDQ_SRC is
not cleared before the hardware interrupt is generated, the new interrupt
will not be reported.
This patch optimizes this problem by clearing CMDQ_SRC and RESET_STS
before enabling interrupt and syncing pending IRQ handlers after disabling
interrupt.
Fixes: 466b0c00391b ("net: hns3: Add support for misc interrupt")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 2a80134..d318d35 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2557,6 +2557,15 @@ static void hclge_clear_event_cause(struct hclge_dev *hdev, u32 event_type,
}
}
+static void hclge_clear_all_event_cause(struct hclge_dev *hdev)
+{
+ hclge_clear_event_cause(hdev, HCLGE_VECTOR0_EVENT_RST,
+ BIT(HCLGE_VECTOR0_GLOBALRESET_INT_B) |
+ BIT(HCLGE_VECTOR0_CORERESET_INT_B) |
+ BIT(HCLGE_VECTOR0_IMPRESET_INT_B));
+ hclge_clear_event_cause(hdev, HCLGE_VECTOR0_EVENT_MBX, 0);
+}
+
static void hclge_enable_vector(struct hclge_misc_vector *vector, bool enable)
{
writel(enable ? 1 : 0, vector->addr);
@@ -5688,6 +5697,8 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
INIT_WORK(&hdev->rst_service_task, hclge_reset_service_task);
INIT_WORK(&hdev->mbx_service_task, hclge_mailbox_service_task);
+ hclge_clear_all_event_cause(hdev);
+
/* Enable MISC vector(vector0) */
hclge_enable_vector(&hdev->misc_vector, true);
@@ -5817,6 +5828,8 @@ static void hclge_uninit_ae_dev(struct hnae3_ae_dev *ae_dev)
/* Disable MISC vector(vector0) */
hclge_enable_vector(&hdev->misc_vector, false);
+ synchronize_irq(hdev->misc_vector.vector_irq);
+
hclge_destroy_cmd_queue(&hdev->hw);
hclge_misc_irq_uninit(hdev);
hclge_pci_uninit(hdev);
--
2.7.4
^ permalink raw reply related
* [PATCH V2 net-next 2/3] net: hns3: Fix for VF mailbox receiving unknown message
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
To: davem
Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
linux-kernel, linuxarm, Xi Wang
In-Reply-To: <20180606130753.54428-1-salil.mehta@huawei.com>
From: Xi Wang <wangxi11@huawei.com>
Before the firmware updates the crq's tail pointer, if the VF driver
reads the data in the crq, the data may be incomplete at this time,
which will lead to the driver read an unknown message.
This patch fixes it by checking if crq is empty before reading the
message.
Fixes: b11a0bb231f3 ("net: hns3: Add mailbox support to VF driver")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
Patch V2: Fixes the compilation break reported David Miller & Kbuild
Link: https://lkml.org/lkml/2018/6/5/866
https://lkml.org/lkml/2018/6/6/147
Patch V1: Initial Submit
---
.../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c | 23 +++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
index a286184..b598c06 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
@@ -126,6 +126,13 @@ int hclgevf_send_mbx_msg(struct hclgevf_dev *hdev, u16 code, u16 subcode,
return status;
}
+static bool hclgevf_cmd_crq_empty(struct hclgevf_hw *hw)
+{
+ u32 tail = hclgevf_read_dev(hw, HCLGEVF_NIC_CRQ_TAIL_REG);
+
+ return tail == hw->cmq.crq.next_to_use;
+}
+
void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
{
struct hclgevf_mbx_resp_status *resp;
@@ -140,11 +147,22 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
resp = &hdev->mbx_resp;
crq = &hdev->hw.cmq.crq;
- flag = le16_to_cpu(crq->desc[crq->next_to_use].flag);
- while (hnae_get_bit(flag, HCLGEVF_CMDQ_RX_OUTVLD_B)) {
+ while (!hclgevf_cmd_crq_empty(&hdev->hw)) {
desc = &crq->desc[crq->next_to_use];
req = (struct hclge_mbx_pf_to_vf_cmd *)desc->data;
+ flag = le16_to_cpu(crq->desc[crq->next_to_use].flag);
+ if (unlikely(!hnae_get_bit(flag, HCLGEVF_CMDQ_RX_OUTVLD_B))) {
+ dev_warn(&hdev->pdev->dev,
+ "dropped invalid mailbox message, code = %d\n",
+ req->msg[0]);
+
+ /* dropping/not processing this invalid message */
+ crq->desc[crq->next_to_use].flag = 0;
+ hclge_mbx_ring_ptr_move_crq(crq);
+ continue;
+ }
+
/* synchronous messages are time critical and need preferential
* treatment. Therefore, we need to acknowledge all the sync
* responses as quickly as possible so that waiting tasks do not
@@ -205,7 +223,6 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
}
crq->desc[crq->next_to_use].flag = 0;
hclge_mbx_ring_ptr_move_crq(crq);
- flag = le16_to_cpu(crq->desc[crq->next_to_use].flag);
}
/* Write back CMDQ_RQ header pointer, M7 need this pointer */
--
2.7.4
^ permalink raw reply related
* [PATCH V2 net-next 1/3] net: hns3: Fix for VF mailbox cannot receiving PF response
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
To: davem
Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
linux-kernel, linuxarm, Xi Wang
In-Reply-To: <20180606130753.54428-1-salil.mehta@huawei.com>
From: Xi Wang <wangxi11@huawei.com>
When the VF frequently switches the CMDQ interrupt, if the CMDQ_SRC is not
cleared, the VF will not receive the new PF response after the interrupt
is re-enabled, the corresponding log is as follows:
[ 317.482222] hns3 0000:00:03.0: VF could not get mbx resp(=0) from PF
in 500 tries
[ 317.483137] hns3 0000:00:03.0: VF request to get tqp info from PF
failed -5
This patch fixes this problem by clearing CMDQ_SRC before enabling
interrupt and syncing pending IRQ handlers after disabling interrupt.
Fixes: e2cb1dec9779 ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index dd8e8e6..d55ee9c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -1576,6 +1576,8 @@ static int hclgevf_misc_irq_init(struct hclgevf_dev *hdev)
return ret;
}
+ hclgevf_clear_event_cause(hdev, 0);
+
/* enable misc. vector(vector 0) */
hclgevf_enable_vector(&hdev->misc_vector, true);
@@ -1586,6 +1588,7 @@ static void hclgevf_misc_irq_uninit(struct hclgevf_dev *hdev)
{
/* disable misc vector(vector 0) */
hclgevf_enable_vector(&hdev->misc_vector, false);
+ synchronize_irq(hdev->misc_vector.vector_irq);
free_irq(hdev->misc_vector.vector_irq, hdev);
hclgevf_free_vector(hdev, 0);
}
--
2.7.4
^ permalink raw reply related
* [PATCH V2 net-next 0/3] Bug fixes & optimization for HNS3 Driver
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
To: davem
Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
linux-kernel, linuxarm
This patch-set presents miscellaneous bug fixes and an optimization
for HNS3 driver
V1->V2:
* Fixes the compilation break reported by David Miller & Kbuild
Xi Wang (3):
net: hns3: Fix for VF mailbox cannot receiving PF response
net: hns3: Fix for VF mailbox receiving unknown message
net: hns3: Optimize PF CMDQ interrupt switching process
.../ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 13 ++++++++++++
.../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 3 +++
.../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c | 23 +++++++++++++++++++---
3 files changed, 36 insertions(+), 3 deletions(-)
--
2.7.4
^ permalink raw reply
* [PATCH] bnx2x: use the right constant
From: Julia Lawall @ 2018-06-06 13:03 UTC (permalink / raw)
To: Ariel Elior
Cc: kernel-janitors, everest-linux-l2, David S. Miller, netdev,
linux-kernel
Nearby code that also tests port suggests that the P0 constant should be
used when port is zero.
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression e,e1;
@@
* e ? e1 : e1
// </smpl>
Fixes: 6c3218c6f7e5 ("bnx2x: Adjust ETS to 578xx")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
---
drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
index 7dd83d0..22243c4 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
@@ -588,7 +588,7 @@ static void bnx2x_ets_e3b0_nig_disabled(const struct link_params *params,
* slots for the highest priority.
*/
REG_WR(bp, (port) ? NIG_REG_P1_TX_ARB_NUM_STRICT_ARB_SLOTS :
- NIG_REG_P1_TX_ARB_NUM_STRICT_ARB_SLOTS, 0x100);
+ NIG_REG_P0_TX_ARB_NUM_STRICT_ARB_SLOTS, 0x100);
/* Mapping between the CREDIT_WEIGHT registers and actual client
* numbers
*/
^ permalink raw reply related
* [PATCH bpf-next v5 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events
From: Toke Høiland-Jørgensen @ 2018-06-06 12:43 UTC (permalink / raw)
To: netdev; +Cc: Jesper Dangaard Brouer
Add two new helper functions to trace_helpers that supports polling
multiple perf file descriptors for events. These are used to the XDP
perf_event_output example, which needs to work with one perf fd per CPU.
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
tools/testing/selftests/bpf/trace_helpers.c | 49 ++++++++++++++++++++++++++-
tools/testing/selftests/bpf/trace_helpers.h | 4 ++
2 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
index 3868dcb63420..0673e8840cc8 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -88,7 +88,7 @@ static int page_size;
static int page_cnt = 8;
static struct perf_event_mmap_page *header;
-int perf_event_mmap(int fd)
+int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header)
{
void *base;
int mmap_size;
@@ -102,10 +102,15 @@ int perf_event_mmap(int fd)
return -1;
}
- header = base;
+ *header = base;
return 0;
}
+int perf_event_mmap(int fd)
+{
+ return perf_event_mmap_header(fd, &header);
+}
+
static int perf_event_poll(int fd)
{
struct pollfd pfd = { .fd = fd, .events = POLLIN };
@@ -163,3 +168,43 @@ int perf_event_poller(int fd, perf_event_print_fn output_fn)
return ret;
}
+
+int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
+ int num_fds, perf_event_print_fn output_fn)
+{
+ enum bpf_perf_event_ret ret;
+ struct pollfd *pfds;
+ void *buf = NULL;
+ size_t len = 0;
+ int i;
+
+ pfds = malloc(sizeof(*pfds) * num_fds);
+ if (!pfds)
+ return LIBBPF_PERF_EVENT_ERROR;
+
+ memset(pfds, 0, sizeof(*pfds) * num_fds);
+ for (i = 0; i < num_fds; i++) {
+ pfds[i].fd = fds[i];
+ pfds[i].events = POLLIN;
+ }
+
+ for (;;) {
+ poll(pfds, num_fds, 1000);
+ for (i = 0; i < num_fds; i++) {
+ if (!pfds[i].revents)
+ continue;
+
+ ret = bpf_perf_event_read_simple(headers[i],
+ page_cnt * page_size,
+ page_size, &buf, &len,
+ bpf_perf_event_print,
+ output_fn);
+ if (ret != LIBBPF_PERF_EVENT_CONT)
+ break;
+ }
+ }
+ free(buf);
+ free(pfds);
+
+ return ret;
+}
diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
index 3b4bcf7f5084..18924f23db1b 100644
--- a/tools/testing/selftests/bpf/trace_helpers.h
+++ b/tools/testing/selftests/bpf/trace_helpers.h
@@ -3,6 +3,7 @@
#define __TRACE_HELPER_H
#include <libbpf.h>
+#include <linux/perf_event.h>
struct ksym {
long addr;
@@ -16,6 +17,9 @@ long ksym_get_addr(const char *name);
typedef enum bpf_perf_event_ret (*perf_event_print_fn)(void *data, int size);
int perf_event_mmap(int fd);
+int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header);
/* return LIBBPF_PERF_EVENT_DONE or LIBBPF_PERF_EVENT_ERROR */
int perf_event_poller(int fd, perf_event_print_fn output_fn);
+int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
+ int num_fds, perf_event_print_fn output_fn);
#endif
^ permalink raw reply related
* [PATCH bpf-next v5 2/2] samples/bpf: Add xdp_sample_pkts example
From: Toke Høiland-Jørgensen @ 2018-06-06 12:43 UTC (permalink / raw)
To: netdev; +Cc: Jesper Dangaard Brouer
In-Reply-To: <152828901936.9599.10143519677183418086.stgit@alrua-kau>
Add an example program showing how to sample packets from XDP using the
perf event buffer. The example userspace program just prints the ethernet
header for every packet sampled.
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
samples/bpf/Makefile | 4 +
samples/bpf/xdp_sample_pkts_kern.c | 66 ++++++++++++++
samples/bpf/xdp_sample_pkts_user.c | 176 ++++++++++++++++++++++++++++++++++++
3 files changed, 246 insertions(+)
create mode 100644 samples/bpf/xdp_sample_pkts_kern.c
create mode 100644 samples/bpf/xdp_sample_pkts_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 1303af10e54d..9ea2f7b64869 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -52,6 +52,7 @@ hostprogs-y += xdp_adjust_tail
hostprogs-y += xdpsock
hostprogs-y += xdp_fwd
hostprogs-y += task_fd_query
+hostprogs-y += xdp_sample_pkts
# Libbpf dependencies
LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -107,6 +108,7 @@ xdp_adjust_tail-objs := xdp_adjust_tail_user.o
xdpsock-objs := bpf_load.o xdpsock_user.o
xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
+xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
# Tell kbuild to always build the programs
always := $(hostprogs-y)
@@ -163,6 +165,7 @@ always += xdp_adjust_tail_kern.o
always += xdpsock_kern.o
always += xdp_fwd_kern.o
always += task_fd_query_kern.o
+always += xdp_sample_pkts_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -179,6 +182,7 @@ HOSTCFLAGS_spintest_user.o += -I$(srctree)/tools/lib/bpf/
HOSTCFLAGS_trace_event_user.o += -I$(srctree)/tools/lib/bpf/
HOSTCFLAGS_sampleip_user.o += -I$(srctree)/tools/lib/bpf/
HOSTCFLAGS_task_fd_query_user.o += -I$(srctree)/tools/lib/bpf/
+HOSTCFLAGS_xdp_sample_pkts_user.o += -I$(srctree)/tools/lib/bpf/
HOST_LOADLIBES += $(LIBBPF) -lelf
HOSTLOADLIBES_tracex4 += -lrt
diff --git a/samples/bpf/xdp_sample_pkts_kern.c b/samples/bpf/xdp_sample_pkts_kern.c
new file mode 100644
index 000000000000..f7ca8b850978
--- /dev/null
+++ b/samples/bpf/xdp_sample_pkts_kern.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+#define SAMPLE_SIZE 64ul
+#define MAX_CPUS 128
+
+#define bpf_printk(fmt, ...) \
+({ \
+ char ____fmt[] = fmt; \
+ bpf_trace_printk(____fmt, sizeof(____fmt), \
+ ##__VA_ARGS__); \
+})
+
+struct bpf_map_def SEC("maps") my_map = {
+ .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+ .key_size = sizeof(int),
+ .value_size = sizeof(u32),
+ .max_entries = MAX_CPUS,
+};
+
+SEC("xdp_sample")
+int xdp_sample_prog(struct xdp_md *ctx)
+{
+ void *data_end = (void *)(long)ctx->data_end;
+ void *data = (void *)(long)ctx->data;
+
+ /* Metadata will be in the perf event before the packet data. */
+ struct S {
+ u16 cookie;
+ u16 pkt_len;
+ } __packed metadata;
+
+ if (data < data_end) {
+ /* The XDP perf_event_output handler will use the upper 32 bits
+ * of the flags argument as a number of bytes to include of the
+ * packet payload in the event data. If the size is too big, the
+ * call to bpf_perf_event_output will fail and return -EFAULT.
+ *
+ * See bpf_xdp_event_output in net/core/filter.c.
+ *
+ * The BPF_F_CURRENT_CPU flag means that the event output fd
+ * will be indexed by the CPU number in the event map.
+ */
+ u64 flags = BPF_F_CURRENT_CPU;
+ u16 sample_size;
+ int ret;
+
+ metadata.cookie = 0xdead;
+ metadata.pkt_len = (u16)(data_end - data);
+ sample_size = min(metadata.pkt_len, SAMPLE_SIZE);
+ flags |= (u64)sample_size << 32;
+
+ ret = bpf_perf_event_output(ctx, &my_map, flags,
+ &metadata, sizeof(metadata));
+ if (ret)
+ bpf_printk("perf_event_output failed: %d\n", ret);
+ }
+
+ return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/xdp_sample_pkts_user.c b/samples/bpf/xdp_sample_pkts_user.c
new file mode 100644
index 000000000000..944ff7512d2b
--- /dev/null
+++ b/samples/bpf/xdp_sample_pkts_user.c
@@ -0,0 +1,176 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <linux/perf_event.h>
+#include <linux/bpf.h>
+#include <net/if.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/sysinfo.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <signal.h>
+#include <libbpf.h>
+#include <bpf/bpf.h>
+
+#include "perf-sys.h"
+#include "trace_helpers.h"
+
+#define MAX_CPUS 128
+static int pmu_fds[MAX_CPUS], if_idx;
+static struct perf_event_mmap_page *headers[MAX_CPUS];
+static char *if_name;
+
+static int do_attach(int idx, int fd, const char *name)
+{
+ int err;
+
+ err = bpf_set_link_xdp_fd(idx, fd, 0);
+ if (err < 0)
+ printf("ERROR: failed to attach program to %s\n", name);
+
+ return err;
+}
+
+static int do_detach(int idx, const char *name)
+{
+ int err;
+
+ err = bpf_set_link_xdp_fd(idx, -1, 0);
+ if (err < 0)
+ printf("ERROR: failed to detach program from %s\n", name);
+
+ return err;
+}
+
+#define SAMPLE_SIZE 64
+
+static int print_bpf_output(void *data, int size)
+{
+ struct {
+ __u16 cookie;
+ __u16 pkt_len;
+ __u8 pkt_data[SAMPLE_SIZE];
+ } __packed *e = data;
+ int i;
+
+ if (e->cookie != 0xdead) {
+ printf("BUG cookie %x sized %d\n",
+ e->cookie, size);
+ return LIBBPF_PERF_EVENT_ERROR;
+ }
+
+ printf("Pkt len: %-5d bytes. Ethernet hdr: ", e->pkt_len);
+ for (i = 0; i < 14 && i < e->pkt_len; i++)
+ printf("%02x ", e->pkt_data[i]);
+ printf("\n");
+
+ return LIBBPF_PERF_EVENT_CONT;
+}
+
+static void test_bpf_perf_event(int map_fd, int num)
+{
+ struct perf_event_attr attr = {
+ .sample_type = PERF_SAMPLE_RAW,
+ .type = PERF_TYPE_SOFTWARE,
+ .config = PERF_COUNT_SW_BPF_OUTPUT,
+ .wakeup_events = 1, /* get an fd notification for every event */
+ };
+ int i;
+
+ for (i = 0; i < num; i++) {
+ int key = i;
+
+ pmu_fds[i] = sys_perf_event_open(&attr, -1/*pid*/, i/*cpu*/,
+ -1/*group_fd*/, 0);
+
+ assert(pmu_fds[i] >= 0);
+ assert(bpf_map_update_elem(map_fd, &key,
+ &pmu_fds[i], BPF_ANY) == 0);
+ ioctl(pmu_fds[i], PERF_EVENT_IOC_ENABLE, 0);
+ }
+}
+
+static void sig_handler(int signo)
+{
+ do_detach(if_idx, if_name);
+ exit(0);
+}
+
+int main(int argc, char **argv)
+{
+ struct bpf_prog_load_attr prog_load_attr = {
+ .prog_type = BPF_PROG_TYPE_XDP,
+ };
+ struct bpf_object *obj;
+ struct bpf_map *map;
+ int prog_fd, map_fd;
+ char filename[256];
+ int ret, err, i;
+ int numcpus;
+
+ if (argc < 2) {
+ printf("Usage: %s <ifname>\n", argv[0]);
+ return 1;
+ }
+
+ numcpus = get_nprocs();
+ if (numcpus > MAX_CPUS)
+ numcpus = MAX_CPUS;
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+ prog_load_attr.file = filename;
+
+ if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+ return 1;
+
+ if (!prog_fd) {
+ printf("load_bpf_file: %s\n", strerror(errno));
+ return 1;
+ }
+
+ map = bpf_map__next(NULL, obj);
+ if (!map) {
+ printf("finding a map in obj file failed\n");
+ return 1;
+ }
+ map_fd = bpf_map__fd(map);
+
+ if_idx = if_nametoindex(argv[1]);
+ if (!if_idx)
+ if_idx = strtoul(argv[1], NULL, 0);
+
+ if (!if_idx) {
+ fprintf(stderr, "Invalid ifname\n");
+ return 1;
+ }
+ if_name = argv[1];
+ err = do_attach(if_idx, prog_fd, argv[1]);
+ if (err)
+ return err;
+
+ if (signal(SIGINT, sig_handler) ||
+ signal(SIGHUP, sig_handler) ||
+ signal(SIGTERM, sig_handler)) {
+ perror("signal");
+ return 1;
+ }
+
+ test_bpf_perf_event(map_fd, numcpus);
+
+ for (i = 0; i < numcpus; i++)
+ if (perf_event_mmap_header(pmu_fds[i], &headers[i]) < 0)
+ return 1;
+
+ ret = perf_event_poller_multi(pmu_fds, headers, numcpus,
+ print_bpf_output);
+ kill(0, SIGINT);
+ return ret;
+}
^ permalink raw reply related
* Re: [PATCH] net/phy: Micrel KSZ8061 PHY link failure after cable connect
From: Andrew Lunn @ 2018-06-06 12:39 UTC (permalink / raw)
To: Alexander Onnasch; +Cc: Florian Fainelli, netdev, linux-kernel
In-Reply-To: <1528272198-10825-1-git-send-email-alexander.onnasch@landisgyr.com>
On Wed, Jun 06, 2018 at 10:03:18AM +0200, Alexander Onnasch wrote:
> With Micrel KSZ8061 PHY, the link may occasionally not come up after
> Ethernet cable connect. The vendor's (Microchip, former Micrel) errata
> sheet 80000688A.pdf describes the problem and possible workarounds in
> detail, see below.
> The patch implements workaround 1, which permanently fixes the issue.
>
> DESCRIPTION
> Link-up may not occur properly when the Ethernet cable is initially
> connected. This issue occurs more commonly when the cable is connected
> slowly, but it may occur any time a cable is connected. This issue occurs
> in the auto-negotiation circuit, and will not occur if auto-negotiation
> is disabled (which requires that the two link partners be set to the
> same speed and duplex).
>
> END USER IMPLICATIONS
> When this issue occurs, link is not established. Subsequent cable
> plug/unplug cycles will not correct the issue.
>
> WORK AROUND
> There are four approaches to work around this issue:
> 1. This issue can be prevented by setting bit 15 in MMD device address 1,
> register 2, prior to connecting the cable or prior to setting the
> Restart Auto-Negotiation bit in register 0h.The MMD registers are
> accessed via the indirect access registers Dh and Eh, or via the Micrel
> EthUtil utility as shown here:
> • If using the EthUtil utility (usually with a Micrel KSZ8061
> Evaluation Board), type the following commands:
> > address 1
> > mmd 1
> > iw 2 b61a
> • Alternatively, write the following registers to write to the
> indirect MMD register:
> Write register Dh, data 0001h
> Write register Eh, data 0002h
> Write register Dh, data 4001h
> Write register Eh, data B61Ah
> 2. The issue can be avoided by disabling auto-negotiation in the KSZ8061,
> either by the strapping option, or by clearing bit 12 in register 0h.
> Care must be taken to ensure that the KSZ8061 and the link partner
> will link with the same speed and duplex. Note that the KSZ8061
> defaults to full-duplex when auto-negotiation is off, but other
> devices may default to half-duplex in the event of failed
> auto-negotiation.
> 3. The issue can be avoided by connecting the cable prior to powering-up
> or resetting the KSZ8061, and leaving it plugged in thereafter.
> 4. If the above measures are not taken and the problem occurs, link can
> be recovered by setting the Restart Auto-Negotiation bit in
> register 0h, or by resetting or power cycling the device. Reset may
> be either hardware reset or software reset (register 0h, bit 15).
>
> PLAN
> This errata will not be corrected in a future revision.
>
> Signed-off-by: Alexander Onnasch <alexander.onnasch@landisgyr.com>
> ---
> drivers/net/phy/micrel.c | 24 +++++++++++++++++++++++-
> 1 file changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
> index 6c45ff6..7d80a00 100644
> --- a/drivers/net/phy/micrel.c
> +++ b/drivers/net/phy/micrel.c
> @@ -339,6 +339,28 @@ static int ksz8041_config_aneg(struct phy_device *phydev)
> return genphy_config_aneg(phydev);
> }
>
> +#define MII_KSZ8061RN_MMD_CTRL_REG 0x0d
> +#define MII_KSZ8061RN_MMD_REGDATA_REG 0x0e
> +
> +static int ksz8061_extended_write(struct phy_device *phydev,
> + u8 mode, u32 dev_addr, u32 regnum, u16 val)
> +{
> + phy_write(phydev, MII_KSZ8061RN_MMD_CTRL_REG, dev_addr);
> + phy_write(phydev, MII_KSZ8061RN_MMD_REGDATA_REG, regnum);
> + phy_write(phydev, MII_KSZ8061RN_MMD_CTRL_REG, (mode << 14) | dev_addr);
> + return phy_write(phydev, MII_KSZ8061RN_MMD_REGDATA_REG, val);
> +}
Hi Alexander
This looks a lot like phy_write_mmd().
Andrew
^ permalink raw reply
* Re: [PATCH] r8169: Reinstate ALDPS and ASPM support
From: Andrew Lunn @ 2018-06-06 12:34 UTC (permalink / raw)
To: Kai-Heng Feng
Cc: David Miller, Hayes Wang, hkallweit1, romieu, Linux Netdev List,
Linux Kernel Mailing List, Ryankao
In-Reply-To: <8BAED2AC-DA94-4F63-BA16-EB3D54649BC5@canonical.com>
> Hi Andrew,
>
> Sure.
>
> Do you think we should also strip out the enable/disable logic?
>
> Or just remove the parameter?
Hi Kai-Heng
How would you use the logic if you remove the parameter?
Andrew
^ permalink raw reply
* Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format
From: Arnaldo Carvalho de Melo @ 2018-06-06 12:33 UTC (permalink / raw)
To: Martin KaFai Lau; +Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180605212548.lwtaw4svvydo2lhy@kafai-mbp.dhcp.thefacebook.com>
Em Tue, Jun 05, 2018 at 02:25:48PM -0700, Martin KaFai Lau escreveu:
> On Thu, Apr 19, 2018 at 04:40:34PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Wed, Apr 18, 2018 at 03:55:56PM -0700, Martin KaFai Lau escreveu:
> > > This patch introduces BPF Type Format (BTF).
> > >
> > > BTF (BPF Type Format) is the meta data format which describes
> > > the data types of BPF program/map. Hence, it basically focus
> > > on the C programming language which the modern BPF is primary
> > > using. The first use case is to provide a generic pretty print
> > > capability for a BPF map.
> > >
> > > A modified pahole that can convert dwarf to BTF is here:
> > > https://github.com/iamkafai/pahole/tree/btf
> > > (Arnaldo, there is some BTF_KIND numbering changes on
> > > Apr 18th, d61426c1571)
> >
> > Thanks for letting me know, I'm starting to look at this,
> Hi Arnaldo,
>
> Do you have a chance to take a look and pull it? The kernel
> changes will be in 4.18, so it will be handy if it is available in
> the pahole repository.
>
> [ btw, the latest commit (1 commit) should be 94a11b59e592 ].
Got sidetracked, will get back to it later today.
- Arnaldo
^ permalink raw reply
* Re: [PATCH net] failover: eliminate callback hell
From: Michael S. Tsirkin @ 2018-06-06 12:30 UTC (permalink / raw)
To: Jiri Pirko
Cc: Stephen Hemminger, kys, haiyangz, davem, sridhar.samudrala,
netdev, Stephen Hemminger
In-Reply-To: <20180606072512.GA2289@nanopsycho.orion>
On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
> Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
> >The net failover should be a simple library, not a virtual
> >object with function callbacks (see callback hell).
>
> Why just a library? It should do a common things. I think it should be a
> virtual object. Looks like your patch again splits the common
> functionality into multiple drivers. That is kind of backwards attitude.
> I don't get it. We should rather focus on fixing the mess the
> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> model.
So it seems that at least one benefit for netvsc would be better
handling of renames.
Question is how can this change to 3-netdev happen? Stephen is
concerned about risk of breaking some userspace.
Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
address, and you said then "why not use existing network namespaces
rather than inventing a new abstraction". So how about it then? Do you
want to find a way to use namespaces to hide the PV device for netvsc
compatibility?
--
MST
^ permalink raw reply
* Re: [PATCH bpf-next v2 2/2] samples/bpf: Add xdp_sample_pkts example
From: Toke Høiland-Jørgensen @ 2018-06-06 12:28 UTC (permalink / raw)
To: David Beckett, netdev
In-Reply-To: <1448c449-b853-b2df-2654-44c9c52adfe4@netronome.com>
David Beckett <david.beckett@netronome.com> writes:
> On 04/06/18 17:33, Toke Høiland-Jørgensen wrote:
>> +
>> +#define SAMPLE_SIZE 64ul
>> +
> The program currently cannot sample minimum sized packets, as the 4 Byte
> crc checksum isnt present in ctx,
> may be better to use 60ul sample size to allow for these packets to be
> processed?
Right. However, this also reminds me that I wanted to make the sampling
size dynamic, so it is possible to dump packets that are smaller than
the configured SAMPLE_SIZE. Will fix :)
>> + if (data + SAMPLE_SIZE < data_end) {
>> + /* The XDP perf_event_output handler will use the upper 32 bits
>> + * of the flags argument as a number of bytes to include of the
> I may be wrong on this but should this also be <= to allow for packets
> at SAMPLE_SIZE to be sampled?
Yes, you are right, but that goes away with the change I mentioned above.
-Toke
^ permalink raw reply
* Re: [RFC PATCH] kcm: hold rx mux lock when updating the receive queue.
From: Paolo Abeni @ 2018-06-06 12:23 UTC (permalink / raw)
To: Kirill Tkhai, Tom Herbert, David Miller
Cc: Linux Kernel Network Developers, Tom Herbert
In-Reply-To: <9b036d06-f011-6e80-956a-1162af25a5b8@virtuozzo.com>
Hi,
On Wed, 2018-06-06 at 13:25 +0300, Kirill Tkhai wrote:
> Hi, Paolo,
>
> below is couple my thoughts about this.
>
> On 06.06.2018 12:44, Paolo Abeni wrote:
> > On Tue, 2018-06-05 at 18:06 +0200, Paolo Abeni wrote:
> > > On Tue, 2018-06-05 at 08:35 -0700, Tom Herbert wrote:
> > > > Paolo, thanks for looking into this! Can you try replacing
> > > > __skb_dequeue in requeue_rx_msgs with skb_dequeue to see if that is
> > > > the fix.
> > >
> > > Sure, I'll retrigger the test, and report the result here (or directly
> > > a new patch, should the test be succesful)
> >
> > Contrary to my expectations, the suggested change does not fix the
> > issue. I'm still investigating the overall locking schema.
>
> kcm_rcv_strparser()->unreserve_rx_kcm()->requeue_rx_msgs()->__skb_dequeue()
>
> seems needed to be synchronized with:
>
> kcm_recvmsg()->kcm_wait_data().
>
> Otherwise, requeue_rx_msgs() removes kcm_recvmsg() peeked skb.
>
> The solution could be to take lock_sock(&kcm->sk) in requeue_rx_msgs(), but
> we can't do that since there is already locked another socket (and potentially,
> this may be a reason of deadlock).
>
> The approach you made in initial patch seems good for me to solve this problem.
> The only thing I'm not sure is either lock_sock() is needed in kcm_recvmsg() after
> this.
Thank you for the feedback!
I tried a different approach (add en explicit 'peek' argument to
kcm_wait_data, and dequeue the packet there if not explicitly asked
otherwise). It solves the issue and looks reasonably clean.
I'll post the patch soon.
Cheers,
Paolo
^ permalink raw reply
* Re: [PATCH net] failover: eliminate callback hell
From: Michael S. Tsirkin @ 2018-06-06 12:19 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Samudrala, Sridhar, kys, haiyangz, davem, netdev,
Stephen Hemminger
In-Reply-To: <20180605205118.439a7873@xeon-e3>
On Tue, Jun 05, 2018 at 08:51:18PM -0700, Stephen Hemminger wrote:
> > I think the push back was with the usage of the delay, not bringing up the primary/standby
> > device in the name change event handler.
> > Can't netvsc use this mechanism instead of depending on the delay?
> >
> >
>
> The patch that was rejected for netvsc was about using name change.
So failover is now doing exactly what you wanted netvsc to do. Rather
than reverting everyone to old behaviour how about using more pieces
from failover?
> Also, you can't depend on name change; you still need a timer. Not all distributions
> change name of devices.
So failover chose not to implement the delayed open so far.
If it does I suspect we'll have to keep it around forever -
kind of like netvsc seems to be stuck with it.
But let's see if there's enough actual demand from people running
ancient distros with latest kernels to even start looking for
a solution for failover.
And this kind of behaviour change really should be split out
so we can discuss it separately.
> Or user has blocked that by udev rules.
Don't do that then?
--
MST
^ permalink raw reply
* RE: [PATCH 09/10] dpaa_eth: add support for hardware timestamping
From: Y.b. Lu @ 2018-06-06 11:48 UTC (permalink / raw)
To: Richard Cochran
Cc: netdev@vger.kernel.org, Madalin-cristian Bucur, Rob Herring,
Shawn Guo, David S . Miller, devicetree@vger.kernel.org,
linuxppc-dev@lists.ozlabs.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org
In-Reply-To: <20180605135748.mlarwiyzf2oe27ax@localhost>
Hi Richard,
> -----Original Message-----
> From: Richard Cochran [mailto:richardcochran@gmail.com]
> Sent: Tuesday, June 5, 2018 9:58 PM
> To: Y.b. Lu <yangbo.lu@nxp.com>
> Cc: netdev@vger.kernel.org; Madalin-cristian Bucur
> <madalin.bucur@nxp.com>; Rob Herring <robh+dt@kernel.org>; Shawn Guo
> <shawnguo@kernel.org>; David S . Miller <davem@davemloft.net>;
> devicetree@vger.kernel.org; linuxppc-dev@lists.ozlabs.org;
> linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 09/10] dpaa_eth: add support for hardware timestamping
>
> On Tue, Jun 05, 2018 at 03:35:28AM +0000, Y.b. Lu wrote:
> > [Y.b. Lu] Actually these timestamping codes affected DPAA networking
> performance in our previous performance test.
> > That's why we used ifdef for it.
>
> How much does time stamping hurt performance?
>
> If the time stamping is compiled in but not enabled at run time, does it still
> affect performace?
[Y.b. Lu] I can't remember and find the old data since it had been a long time.
I just did the iperf test today between two 10G ports. I didn’t see any performance changes with timestamping code 😊
So, let's me remove the ifdef in next version.
Thanks a lot.
>
> Thanks,
> Richard
^ permalink raw reply
* Re: [PATCH bpf-next v2 2/2] samples/bpf: Add xdp_sample_pkts example
From: David Beckett @ 2018-06-06 11:40 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, netdev
In-Reply-To: <152813003614.3465.11049830599815911223.stgit@alrua-kau>
On 04/06/18 17:33, Toke Høiland-Jørgensen wrote:
> +
> +#define SAMPLE_SIZE 64ul
> +
The program currently cannot sample minimum sized packets, as the 4 Byte
crc checksum isnt present in ctx,
may be better to use 60ul sample size to allow for these packets to be
processed?
> + if (data + SAMPLE_SIZE < data_end) {
> + /* The XDP perf_event_output handler will use the upper 32 bits
> + * of the flags argument as a number of bytes to include of the
I may be wrong on this but should this also be <= to allow for packets
at SAMPLE_SIZE to be sampled?
^ permalink raw reply
* Re: [PATCH bpf-next v4 2/2] samples/bpf: Add xdp_sample_pkts example
From: Jesper Dangaard Brouer @ 2018-06-06 11:33 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: brouer, Jakub Kicinski, netdev, Martin KaFai Lau
In-Reply-To: <87vaawyy8v.fsf@toke.dk>
On Wed, 06 Jun 2018 13:01:52 +0200
Toke Høiland-Jørgensen <toke@toke.dk> wrote:
> Jakub Kicinski <jakub.kicinski@netronome.com> writes:
>
> > On Tue, 05 Jun 2018 16:50:00 +0200, Toke Høiland-Jørgensen wrote:
> >> Add an example program showing how to sample packets from XDP using the
> >> perf event buffer. The example userspace program just prints the ethernet
> >> header for every packet sampled.
> >>
> >> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
> >
> >> diff --git a/samples/bpf/xdp_sample_pkts_kern.c b/samples/bpf/xdp_sample_pkts_kern.c
> >> new file mode 100644
> >> index 000000000000..4560522ca015
> >> --- /dev/null
> >> +++ b/samples/bpf/xdp_sample_pkts_kern.c
> >> @@ -0,0 +1,62 @@
> >> +#include <linux/ptrace.h>
> >> +#include <linux/version.h>
> >> +#include <uapi/linux/bpf.h>
> >> +#include "bpf_helpers.h"
> >> +
> >> +#define SAMPLE_SIZE 64ul
> >> +#define MAX_CPUS 24
> >
> > That may be a lil' too few for modern HW with hyper-threading on ;)
> > My development machine says:
> >
> > $ ncpus
> > 28
> >
> > 128, maybe?
>
> What? 24 CPUs should be enough for everyone, surely? ;)
>
> (I just took twice the number of CPUs in my largest machine; but fair
> point, other people have access to more expensive toys I guess... will fix)
>
> >> +#define bpf_printk(fmt, ...) \
> >> +({ \
> >> + char ____fmt[] = fmt; \
> >> + bpf_trace_printk(____fmt, sizeof(____fmt), \
> >> + ##__VA_ARGS__); \
> >> +})
> >> +
> >> +struct bpf_map_def SEC("maps") my_map = {
> >> + .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> >> + .key_size = sizeof(int),
> >> + .value_size = sizeof(u32),
> >> + .max_entries = MAX_CPUS,
> >> +};
I agree, we should increase it to something larger than 24, just to
avoid people getting surprised, but this is only sample code.
At some point I want us to add another example, that show how to adjust
this runtime, which is actually possible with bpf_load.c via a map
callback that Martin added. I've not look at how libbpf can support
this. I do have sample code[1] that could be adjusted to do this, but
I don't want to create dependencies to bpf_load.c, as we all agree that
everything should be moved to use libbpf (tools/lib/bpf).
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_ddos01_blacklist_user.c#L186-L218
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
* Re: [PATCH bpf-next v4 2/2] samples/bpf: Add xdp_sample_pkts example
From: Toke Høiland-Jørgensen @ 2018-06-06 11:01 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev
In-Reply-To: <20180605152434.0149b8a7@cakuba.netronome.com>
Jakub Kicinski <jakub.kicinski@netronome.com> writes:
> On Tue, 05 Jun 2018 16:50:00 +0200, Toke Høiland-Jørgensen wrote:
>> Add an example program showing how to sample packets from XDP using the
>> perf event buffer. The example userspace program just prints the ethernet
>> header for every packet sampled.
>>
>> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
>
>> diff --git a/samples/bpf/xdp_sample_pkts_kern.c b/samples/bpf/xdp_sample_pkts_kern.c
>> new file mode 100644
>> index 000000000000..4560522ca015
>> --- /dev/null
>> +++ b/samples/bpf/xdp_sample_pkts_kern.c
>> @@ -0,0 +1,62 @@
>> +#include <linux/ptrace.h>
>> +#include <linux/version.h>
>> +#include <uapi/linux/bpf.h>
>> +#include "bpf_helpers.h"
>> +
>> +#define SAMPLE_SIZE 64ul
>> +#define MAX_CPUS 24
>
> That may be a lil' too few for modern HW with hyper-threading on ;)
> My development machine says:
>
> $ ncpus
> 28
>
> 128, maybe?
What? 24 CPUs should be enough for everyone, surely? ;)
(I just took twice the number of CPUs in my largest machine; but fair
point, other people have access to more expensive toys I guess... will fix)
>> +#define bpf_printk(fmt, ...) \
>> +({ \
>> + char ____fmt[] = fmt; \
>> + bpf_trace_printk(____fmt, sizeof(____fmt), \
>> + ##__VA_ARGS__); \
>> +})
>> +
>> +struct bpf_map_def SEC("maps") my_map = {
>> + .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
>> + .key_size = sizeof(int),
>> + .value_size = sizeof(u32),
>> + .max_entries = MAX_CPUS,
>> +};
>> +
>> +SEC("xdp_sample")
>> +int xdp_sample_prog(struct xdp_md *ctx)
>> +{
>> + void *data_end = (void *)(long)ctx->data_end;
>> + void *data = (void *)(long)ctx->data;
>> +
>> + /* Metadata will be in the perf event before the packet data. */
>> + struct S {
>> + u16 cookie;
>> + u16 pkt_len;
>> + } __attribute__((packed)) metadata;
>> +
>> + if (data + SAMPLE_SIZE < data_end) {
>> + /* The XDP perf_event_output handler will use the upper 32 bits
>> + * of the flags argument as a number of bytes to include of the
>> + * packet payload in the event data. If the size is too big, the
>> + * call to bpf_perf_event_output will fail and return -EFAULT.
>> + *
>> + * See bpf_xdp_event_output in net/core/filter.c.
>> + *
>> + * The BPF_F_CURRENT_CPU flag means that the event output fd
>> + * will be indexed by the CPU number in the event map.
>> + */
>> + u64 flags = (SAMPLE_SIZE << 32) | BPF_F_CURRENT_CPU;
>> + int ret;
>> +
>> + metadata.cookie = 0xdead;
>> + metadata.pkt_len = (u16)(data_end - data);
>> +
>> + ret = bpf_perf_event_output(ctx, &my_map, flags,
>> + &metadata, sizeof(metadata));
>> + if(ret)
>
> Please run checkpatch --strict on the samples.
Will do!
>> + bpf_printk("perf_event_output failed: %d\n", ret);
>> + }
>> +
>> + return XDP_PASS;
>> +}
>> +
>> +char _license[] SEC("license") = "GPL";
>> +u32 _version SEC("version") = LINUX_VERSION_CODE;
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox