* Re: [patch net-next 0/9] net: sched: introduce chain templates support with offloading to mlxsw
From: Jakub Kicinski @ 2018-06-26 4:58 UTC (permalink / raw)
To: Jiri Pirko
Cc: netdev, davem, jhs, xiyou.wangcong, simon.horman, john.hurley,
dsahern, mlxsw
In-Reply-To: <20180625210148.9386-1-jiri@resnulli.us>
On Mon, 25 Jun 2018 23:01:39 +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> For the TC clsact offload these days, some of HW drivers need
> to hold a magic ball. The reason is, with the first inserted rule inside
> HW they need to guess what fields will be used for the matching. If
> later on this guess proves to be wrong and user adds a filter with a
> different field to match, there's a problem. Mlxsw resolves it now with
> couple of patterns. Those try to cover as many match fields as possible.
> This aproach is far from optimal, both performance-wise and scale-wise.
> Also, there is a combination of filters that in certain order won't
> succeed.
>
> Most of the time, when user inserts filters in chain, he knows right away
> how the filters are going to look like - what type and option will they
> have. For example, he knows that he will only insert filters of type
> flower matching destination IP address. He can specify a template that
> would cover all the filters in the chain.
Perhaps it's lack of sleep, but this paragraph threw me a little off
the track. IIUC the goal of this set is to provide a way to inform the
HW about expected matches before any rule is programmed into the HW.
Not before any rule is added to a particular chain. One can just use
the first rule in the chain to make a guess about the chain, but thanks
to this set user can configure *all* chains before any rules are added.
And that's needed because once any rule is added the tcam config can no
longer be easily modified?
^ permalink raw reply
* Re: [patch net-next 6/9] net: sched: cls_flower: propagate chain teplate creation and destruction to drivers
From: Jakub Kicinski @ 2018-06-26 5:00 UTC (permalink / raw)
To: Jiri Pirko
Cc: netdev, davem, jhs, xiyou.wangcong, simon.horman, john.hurley,
dsahern, mlxsw
In-Reply-To: <20180625210148.9386-7-jiri@resnulli.us>
On Mon, 25 Jun 2018 23:01:45 +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> Introduce a couple of flower offload commands in order to propagate
> template creation/destruction events down to device drivers.
> Drivers may use this information to prepare HW in an optimal way
> for future filter insertions.
>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> index d64d43843a3a..276ba25a09c3 100644
> --- a/net/sched/cls_flower.c
> +++ b/net/sched/cls_flower.c
> @@ -1120,6 +1120,43 @@ static void fl_walk(struct tcf_proto *tp, struct tcf_walker *arg)
> }
> }
>
> +static void fl_hw_create_tmplt(struct tcf_chain *chain,
> + struct fl_flow_tmplt *tmplt,
> + struct netlink_ext_ack *extack)
> +{
> + struct tc_cls_flower_offload cls_flower = {};
> + struct tcf_block *block = chain->block;
> + struct tcf_exts dummy_exts = { 0, };
> +
> + cls_flower.common.chain_index = chain->index;
Did you skip extack on purpose?
> + cls_flower.command = TC_CLSFLOWER_TMPLT_CREATE;
> + cls_flower.cookie = (unsigned long) tmplt;
> + cls_flower.dissector = &tmplt->dissector;
> + cls_flower.mask = &tmplt->mask;
> + cls_flower.key = &tmplt->dummy_key;
> + cls_flower.exts = &dummy_exts;
> +
> + /* We don't care if driver (any of them) fails to handle this
> + * call. It serves just as a hint for it.
> + */
> + tc_setup_cb_call(block, NULL, TC_SETUP_CLSFLOWER,
> + &cls_flower, false);
> +}
^ permalink raw reply
* Re: INFO: rcu detected stall in vprintk_emit
From: Dmitry Vyukov @ 2018-06-26 5:03 UTC (permalink / raw)
To: Steven Rostedt
Cc: Sergey Senozhatsky, syzbot, LKML, Petr Mladek, Sergey Senozhatsky,
syzkaller-bugs, Samuel Ortiz, David S. Miller, linux-wireless,
netdev
In-Reply-To: <20180625225937.43aee76c@vmware.local.home>
On Tue, Jun 26, 2018 at 4:59 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 26 Jun 2018 10:49:24 +0900
> Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote:
>
>> So we can try switching to ratelimited error reporting
>> [that would be option A]:
>>
>> ---
>>
>> diff --git a/net/nfc/llcp_commands.c b/net/nfc/llcp_commands.c
>> index 2ceefa183cee..2f3becb709b8 100644
>> --- a/net/nfc/llcp_commands.c
>> +++ b/net/nfc/llcp_commands.c
>> @@ -755,7 +755,7 @@ int nfc_llcp_send_ui_frame(struct nfc_llcp_sock *sock, u8 ssap, u8 dsap,
>> pdu = nfc_alloc_send_skb(sock->dev, &sock->sk, MSG_DONTWAIT,
>> frag_len + LLCP_HEADER_SIZE, &err);
>> if (pdu == NULL) {
>> - pr_err("Could not allocate PDU\n");
>> + pr_err_ratelimited("Could not allocate PDU\n");
>> continue;
>> }
>>
>> ---
>>
>>
>> Or ratelimited error reporting and cond_resched()
>> [that would be option B]:
>
> I don't think this is a printk() issue per se, so I think Option B is
> the only option. You should not get stuck in an infinite loop if we run
> short on memory. Perhaps we could have an Option C which would exit
> this loop gracefully with some kind of error. But I haven't looked at
> the surrounding code to be sure if that is possible.
I suspect this is not even OOM. This is probably some persistent
logical condition that makes the allocation function fail, either we
ask for too much, or socket in some bad state. Potentially this is
even triggerable remotely because there are some remove variables
involved in size calculation.
^ permalink raw reply
* WARNING in sctp_assoc_update_frag_point
From: syzbot @ 2018-06-26 5:06 UTC (permalink / raw)
To: davem, linux-kernel, linux-sctp, marcelo.leitner, netdev, nhorman,
syzkaller-bugs, vyasevich
Hello,
syzbot found the following crash on:
HEAD commit: 6f0d349d922b Merge git://git.kernel.org/pub/scm/linux/kern..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=12a423c0400000
kernel config: https://syzkaller.appspot.com/x/.config?x=a63be0c83e84d370
dashboard link: https://syzkaller.appspot.com/bug?extid=f0d9d7cba052f9344b03
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
Unfortunately, I don't have any reproducer for this crash yet.
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+f0d9d7cba052f9344b03@syzkaller.appspotmail.com
WARNING: CPU: 0 PID: 22543 at include/net/sctp/sctp.h:598 sctp_mtu_payload
include/net/sctp/sctp.h:598 [inline]
WARNING: CPU: 0 PID: 22543 at include/net/sctp/sctp.h:598
sctp_assoc_update_frag_point+0x252/0x2c0 net/sctp/associola.c:1401
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 22543 Comm: syz-executor2 Not tainted 4.18.0-rc2+ #117
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
panic+0x238/0x4e7 kernel/panic.c:184
__warn.cold.8+0x163/0x1ba kernel/panic.c:536
report_bug+0x252/0x2d0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:178 [inline]
do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:sctp_mtu_payload include/net/sctp/sctp.h:598 [inline]
RIP: 0010:sctp_assoc_update_frag_point+0x252/0x2c0 net/sctp/associola.c:1401
Code: 76 fa 45 39 e5 76 1e e8 0c 69 76 fa 45 29 e5 45 89 ec e9 34 ff ff ff
e8 fc 68 76 fa 45 8d 66 34 e9 09 ff ff ff e8 ee 68 76 fa <0f> 0b 45 31 e4
e9 17 ff ff ff e8 7f 3c b4 fa e9 31 fe ff ff 4c 89
RSP: 0018:ffff8801d7def378 EFLAGS: 00010216
RAX: 0000000000040000 RBX: ffff8801d8580ac0 RCX: ffffc900133ca000
RDX: 00000000000001b9 RSI: ffffffff8705a382 RDI: 0000000000000004
RBP: ffff8801d7def3a0 R08: ffff8801cfaa6000 R09: ffffed002e0421af
R10: ffffed002e0421af R11: ffff880170210d7f R12: 0000000000000044
R13: 0000000000000044 R14: 0000000000000010 R15: ffff8801d8580ac0
sctp_assoc_set_pmtu net/sctp/associola.c:1417 [inline]
sctp_assoc_sync_pmtu+0x251/0x2e0 net/sctp/associola.c:1445
sctp_sendmsg_to_asoc+0x1851/0x21a0 net/sctp/socket.c:1921
sctp_sendmsg+0x13c2/0x1d90 net/sctp/socket.c:2124
inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
sock_sendmsg_nosec net/socket.c:645 [inline]
sock_sendmsg+0xd5/0x120 net/socket.c:655
___sys_sendmsg+0x51d/0x930 net/socket.c:2161
__sys_sendmmsg+0x240/0x6f0 net/socket.c:2256
__do_sys_sendmmsg net/socket.c:2285 [inline]
__se_sys_sendmmsg net/socket.c:2282 [inline]
__x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2282
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x455a99
Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f75b488ec68 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00007f75b488f6d4 RCX: 0000000000455a99
RDX: 0000000000000001 RSI: 0000000020005900 RDI: 0000000000000014
RBP: 000000000072bff0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000004c0b1b R14: 00000000004d0938 R15: 0000000000000002
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..
---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.
^ permalink raw reply
* Re: INFO: rcu detected stall in vprintk_emit
From: Sergey Senozhatsky @ 2018-06-26 5:07 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: Steven Rostedt, Sergey Senozhatsky, syzbot, LKML, Petr Mladek,
Sergey Senozhatsky, syzkaller-bugs, Samuel Ortiz, David S. Miller,
linux-wireless, netdev
In-Reply-To: <CACT4Y+YWFhUk9cdpSR5+yR3+D6mNFUz2WwRcHULCNpj0-4qXyg@mail.gmail.com>
On (06/26/18 07:03), Dmitry Vyukov wrote:
> > I don't think this is a printk() issue per se, so I think Option B is
> > the only option. You should not get stuck in an infinite loop if we run
> > short on memory. Perhaps we could have an Option C which would exit
> > this loop gracefully with some kind of error. But I haven't looked at
> > the surrounding code to be sure if that is possible.
>
> I suspect this is not even OOM.
Could be. But at the same time it stalls RCU, so OOM at some point
becomes realistic.
-ss
^ permalink raw reply
* Re: [PATCH] NFC: llcp: fix nfc_llcp_send_ui_frame() lockup
From: Dmitry Vyukov @ 2018-06-26 5:07 UTC (permalink / raw)
To: Sergey Senozhatsky
Cc: Samuel Ortiz, David S. Miller, Steven Rostedt, Petr Mladek,
syzkaller-bugs, linux-wireless, netdev, LKML, syzbot,
Sergey Senozhatsky
In-Reply-To: <20180626044119.30118-1-sergey.senozhatsky@gmail.com>
On Tue, Jun 26, 2018 at 6:41 AM, Sergey Senozhatsky
<sergey.senozhatsky.work@gmail.com> wrote:
> syzbot reported the following nfc_llcp_send_ui_frame() lockup:
>
> The kernel is CONFIG_PREEMPT_VOLUNTARY=y, llcp_sock_sendmsg() stuck
> in an infinite error reporting loop, because the system is low memory
> and MSG_DONTWAIT nfc_alloc_send_skb() allocations fail:
>
> do {
> ...
> pdu = nfc_alloc_send_skb(sock->dev, &sock->sk, MSG_DONTWAIT,
> frag_len + LLCP_HEADER_SIZE, &err);
> if (pdu == NULL) {
> pr_err("Could not allocate PDU\n");
> continue;
> }
> ...
> } while (remaining_len > 0);
>
> nfc_llcp_send_ui_frame() spent enough time (94+ sec) trying to
> allocate PDU, which resulted in RCU stall due to PREEMPT_VOLUNTARY:
>
> llcp: nfc_llcp_send_ui_frame: Could not allocate PDU
> llcp: nfc_llcp_send_ui_frame: Could not allocate PDU
> ...
> llcp: nfc_llcp_send_ui_frame: Could not allocate PDU
> llcp: nfc_llcp_send_ui_frame: Could not allocate PDU
> INFO: rcu_sched self-detected stall on CPU
> 1-....: (20918 ticks this GP) idle=55a/1/4611686018427387906
> softirq=11347/11347 fqs=20240
> (t=125005 jiffies g=5572 c=5571 q=149)
> NMI backtrace for cpu 1
> CPU: 1 PID: 4811 Comm: syz-executor0 Not tainted 4.18.0-rc1+ #115
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
> 01/01/2011
> Call Trace:
> <IRQ>
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
> nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
> nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
> arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
> trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
> rcu_dump_cpu_stacks+0x175/0x1c2 kernel/rcu/tree.c:1336
> print_cpu_stall kernel/rcu/tree.c:1485 [inline]
> check_cpu_stall.isra.60.cold.78+0x36c/0x5a6 kernel/rcu/tree.c:1553
> __rcu_pending kernel/rcu/tree.c:3244 [inline]
> rcu_pending kernel/rcu/tree.c:3291 [inline]
> rcu_check_callbacks+0x23f/0xcd0 kernel/rcu/tree.c:2646
> update_process_times+0x2d/0x70 kernel/time/timer.c:1636
> tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:164
> tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1274
> __run_hrtimer kernel/time/hrtimer.c:1398 [inline]
> __hrtimer_run_queues+0x3eb/0x10c0 kernel/time/hrtimer.c:1460
> hrtimer_interrupt+0x2f3/0x750 kernel/time/hrtimer.c:1518
> local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
> smp_apic_timer_interrupt+0x165/0x730 arch/x86/kernel/apic/apic.c:1050
> apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863
> </IRQ>
> RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783 [inline]
> RIP: 0010:console_unlock+0xc84/0x10b0 kernel/printk/printk.c:2397
> Code: c1 e8 03 42 80 3c 38 00 0f 85 bd 03 00 00 48 83 3d 38 f7 8e 07 00 0f 84
> 69 02 00 00 e8 45 56 19 00 48 8b bd b0 fe ff ff 57 9d <0f> 1f 44 00 00 e9 96
> f5 ff ff e8 2d 56 19 00 48 8b 7d 08 e8 94 cf
> RSP: 0018:ffff8801aab0f358 EFLAGS: 00000293 ORIG_RAX: ffffffffffffff13
> RAX: ffff8801aa2802c0 RBX: 0000000000000200 RCX: 1ffff10035450163
> RDX: 0000000000000000 RSI: ffffffff8162b8fb RDI: 0000000000000293
> RBP: ffff8801aab0f4c0 R08: ffff8801aa280af8 R09: 0000000000000006
> R10: ffff8801aa2802c0 R11: 0000000000000000 R12: 0000000000000000
> R13: ffffffff84ea9880 R14: 0000000000000001 R15: dffffc0000000000
> vprintk_emit+0x6c6/0xdf0 kernel/printk/printk.c:1907
> vprintk_default+0x28/0x30 kernel/printk/printk.c:1948
> vprintk_func+0x7a/0xe7 kernel/printk/printk_safe.c:382
> printk+0xa7/0xcf kernel/printk/printk.c:1981
> nfc_llcp_send_ui_frame.cold.9+0x18/0x1f net/nfc/llcp_commands.c:758
> llcp_sock_sendmsg+0x278/0x350 net/nfc/llcp_sock.c:786
> sock_sendmsg_nosec net/socket.c:645 [inline]
> sock_sendmsg+0xd5/0x120 net/socket.c:655
> ___sys_sendmsg+0x51d/0x930 net/socket.c:2161
> __sys_sendmmsg+0x240/0x6f0 net/socket.c:2256
> __do_sys_sendmmsg net/socket.c:2285 [inline]
> __se_sys_sendmmsg net/socket.c:2282 [inline]
> __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2282
> do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> Address the issues by rate limiting nfc_alloc_send_skb() allocation
> error, to avoid logbuf pollution, and do cond_resched() before llcp
> attempts to allocate PDU again.
>
> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> Reported-by: syzbot+d29d18215e477cfbfbdd@syzkaller.appspotmail.com
> ---
> net/nfc/llcp_commands.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/nfc/llcp_commands.c b/net/nfc/llcp_commands.c
> index 2ceefa183cee..e19fadaa9022 100644
> --- a/net/nfc/llcp_commands.c
> +++ b/net/nfc/llcp_commands.c
> @@ -20,6 +20,7 @@
> #include <linux/init.h>
> #include <linux/kernel.h>
> #include <linux/module.h>
> +#include <linux/sched.h>
> #include <linux/nfc.h>
>
> #include <net/nfc/nfc.h>
> @@ -755,7 +756,8 @@ int nfc_llcp_send_ui_frame(struct nfc_llcp_sock *sock, u8 ssap, u8 dsap,
> pdu = nfc_alloc_send_skb(sock->dev, &sock->sk, MSG_DONTWAIT,
> frag_len + LLCP_HEADER_SIZE, &err);
> if (pdu == NULL) {
> - pr_err("Could not allocate PDU\n");
> + pr_err_ratelimited("Could not allocate PDU\n");
> + cond_resched();
> continue;
> }
But this thread is still in an infinite (unkillable?) loop? If yes, we
are waiting for the next syzbot report ;)
I suspect this is not OOM. This is probably some persistent
logical condition that makes the allocation function fail, either we
ask for too much, or socket in some bad state. Potentially this is
even triggerable remotely because there are some remove variables
involved in size calculation.
^ permalink raw reply
* Re: [PATCH] NFC: llcp: fix nfc_llcp_send_ui_frame() lockup
From: Sergey Senozhatsky @ 2018-06-26 5:12 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: Sergey Senozhatsky, Samuel Ortiz, David S. Miller, Steven Rostedt,
Petr Mladek, syzkaller-bugs, linux-wireless, netdev, LKML, syzbot,
Sergey Senozhatsky
In-Reply-To: <CACT4Y+ZvLqzEwXgdB25MrmAS_j2_HsS8NFaJaMojzMV3prXypg@mail.gmail.com>
On (06/26/18 07:07), Dmitry Vyukov wrote:
[..]
> > #include <net/nfc/nfc.h>
> > @@ -755,7 +756,8 @@ int nfc_llcp_send_ui_frame(struct nfc_llcp_sock *sock, u8 ssap, u8 dsap,
> > pdu = nfc_alloc_send_skb(sock->dev, &sock->sk, MSG_DONTWAIT,
> > frag_len + LLCP_HEADER_SIZE, &err);
> > if (pdu == NULL) {
> > - pr_err("Could not allocate PDU\n");
> > + pr_err_ratelimited("Could not allocate PDU\n");
> > + cond_resched();
> > continue;
> > }
>
>
> But this thread is still in an infinite (unkillable?) loop? If yes, we
> are waiting for the next syzbot report ;)
The loop is still infinite, correct, but we have a preemption point now.
Sure, net people can come with a much better solution, I'll be happy to
scratch my patch.
-ss
^ permalink raw reply
* [PATCH net-next v2] net: vhost: improve performance when enable busyloop
From: xiangxia.m.yue @ 2018-06-26 5:17 UTC (permalink / raw)
To: jasowang; +Cc: netdev, Tonghao Zhang, virtualization
From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
This patch improves the guest receive performance from
host. On the handle_tx side, we poll the sock receive
queue at the same time. handle_rx do that in the same way.
For avoiding deadlock, change the code to lock the vq one
by one and use the VHOST_NET_VQ_XX as a subclass for
mutex_lock_nested. With the patch, qemu can set differently
the busyloop_timeout for rx or tx queue.
We set the poll-us=100us and use the iperf3 to test
its throughput. The iperf3 command is shown as below.
on the guest:
iperf3 -s -D
on the host:
iperf3 -c 192.168.1.100 -i 1 -P 10 -t 10 -M 1400
* With the patch: 23.1 Gbits/sec
* Without the patch: 12.7 Gbits/sec
Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
---
drivers/vhost/net.c | 106 +++++++++++++++++++++++++++-----------------------
drivers/vhost/vhost.c | 24 ++++--------
2 files changed, 66 insertions(+), 64 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e7cf7d2..38e9adb 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -429,22 +429,62 @@ static int vhost_net_enable_vq(struct vhost_net *n,
return vhost_poll_start(poll, sock->file);
}
+static int sk_has_rx_data(struct sock *sk)
+{
+ struct socket *sock = sk->sk_socket;
+
+ if (sock->ops->peek_len)
+ return sock->ops->peek_len(sock);
+
+ return skb_queue_empty(&sk->sk_receive_queue);
+}
+
+static void vhost_net_busy_poll(struct vhost_net *net,
+ struct vhost_virtqueue *rvq,
+ struct vhost_virtqueue *tvq,
+ bool rx)
+{
+ unsigned long uninitialized_var(endtime);
+ struct socket *sock = rvq->private_data;
+ struct vhost_virtqueue *vq = rx ? tvq : rvq;
+ unsigned long busyloop_timeout = rx ? rvq->busyloop_timeout :
+ tvq->busyloop_timeout;
+
+ mutex_lock_nested(&vq->mutex, rx ? VHOST_NET_VQ_TX: VHOST_NET_VQ_RX);
+ vhost_disable_notify(&net->dev, vq);
+
+ preempt_disable();
+ endtime = busy_clock() + busyloop_timeout;
+ while (vhost_can_busy_poll(tvq->dev, endtime) &&
+ !(sock && sk_has_rx_data(sock->sk)) &&
+ vhost_vq_avail_empty(tvq->dev, tvq))
+ cpu_relax();
+ preempt_enable();
+
+ if ((rx && !vhost_vq_avail_empty(&net->dev, vq)) ||
+ (!rx && (sock && sk_has_rx_data(sock->sk)))) {
+ vhost_poll_queue(&vq->poll);
+ } else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
+ vhost_disable_notify(&net->dev, vq);
+ vhost_poll_queue(&vq->poll);
+ }
+
+ mutex_unlock(&vq->mutex);
+}
+
static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
struct vhost_virtqueue *vq,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num)
{
- unsigned long uninitialized_var(endtime);
+ struct vhost_net_virtqueue *nvq_rx = &net->vqs[VHOST_NET_VQ_RX];
+
int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
out_num, in_num, NULL, NULL);
if (r == vq->num && vq->busyloop_timeout) {
- preempt_disable();
- endtime = busy_clock() + vq->busyloop_timeout;
- while (vhost_can_busy_poll(vq->dev, endtime) &&
- vhost_vq_avail_empty(vq->dev, vq))
- cpu_relax();
- preempt_enable();
+ vhost_net_busy_poll(net, &nvq_rx->vq, vq, false);
+
r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
out_num, in_num, NULL, NULL);
}
@@ -484,7 +524,7 @@ static void handle_tx(struct vhost_net *net)
bool zcopy, zcopy_used;
int sent_pkts = 0;
- mutex_lock(&vq->mutex);
+ mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_TX);
sock = vq->private_data;
if (!sock)
goto out;
@@ -621,16 +661,6 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
return len;
}
-static int sk_has_rx_data(struct sock *sk)
-{
- struct socket *sock = sk->sk_socket;
-
- if (sock->ops->peek_len)
- return sock->ops->peek_len(sock);
-
- return skb_queue_empty(&sk->sk_receive_queue);
-}
-
static void vhost_rx_signal_used(struct vhost_net_virtqueue *nvq)
{
struct vhost_virtqueue *vq = &nvq->vq;
@@ -645,39 +675,19 @@ static void vhost_rx_signal_used(struct vhost_net_virtqueue *nvq)
static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
{
- struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
- struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
- struct vhost_virtqueue *vq = &nvq->vq;
- unsigned long uninitialized_var(endtime);
- int len = peek_head_len(rvq, sk);
-
- if (!len && vq->busyloop_timeout) {
- /* Flush batched heads first */
- vhost_rx_signal_used(rvq);
- /* Both tx vq and rx socket were polled here */
- mutex_lock_nested(&vq->mutex, 1);
- vhost_disable_notify(&net->dev, vq);
-
- preempt_disable();
- endtime = busy_clock() + vq->busyloop_timeout;
-
- while (vhost_can_busy_poll(&net->dev, endtime) &&
- !sk_has_rx_data(sk) &&
- vhost_vq_avail_empty(&net->dev, vq))
- cpu_relax();
+ struct vhost_net_virtqueue *nvq_rx = &net->vqs[VHOST_NET_VQ_RX];
+ struct vhost_net_virtqueue *nvq_tx = &net->vqs[VHOST_NET_VQ_TX];
- preempt_enable();
+ int len = peek_head_len(nvq_rx, sk);
- if (!vhost_vq_avail_empty(&net->dev, vq))
- vhost_poll_queue(&vq->poll);
- else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
- vhost_disable_notify(&net->dev, vq);
- vhost_poll_queue(&vq->poll);
- }
+ if (!len && nvq_rx->vq.busyloop_timeout) {
+ /* Flush batched heads first */
+ vhost_rx_signal_used(nvq_rx);
- mutex_unlock(&vq->mutex);
+ /* Both tx vq and rx socket were polled here */
+ vhost_net_busy_poll(net, &nvq_rx->vq, &nvq_tx->vq, true);
- len = peek_head_len(rvq, sk);
+ len = peek_head_len(nvq_rx, sk);
}
return len;
@@ -789,7 +799,7 @@ static void handle_rx(struct vhost_net *net)
__virtio16 num_buffers;
int recv_pkts = 0;
- mutex_lock_nested(&vq->mutex, 0);
+ mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_RX);
sock = vq->private_data;
if (!sock)
goto out;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 895eaa2..1716b10 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -294,8 +294,11 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
{
int i;
- for (i = 0; i < d->nvqs; ++i)
+ for (i = 0; i < d->nvqs; ++i) {
+ mutex_lock(&d->vqs[i]->mutex);
__vhost_vq_meta_reset(d->vqs[i]);
+ mutex_unlock(&d->vqs[i]->mutex);
+ }
}
static void vhost_vq_reset(struct vhost_dev *dev,
@@ -887,19 +890,6 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
#define vhost_get_used(vq, x, ptr) \
vhost_get_user(vq, x, ptr, VHOST_ADDR_USED)
-static void vhost_dev_lock_vqs(struct vhost_dev *d)
-{
- int i = 0;
- for (i = 0; i < d->nvqs; ++i)
- mutex_lock_nested(&d->vqs[i]->mutex, i);
-}
-
-static void vhost_dev_unlock_vqs(struct vhost_dev *d)
-{
- int i = 0;
- for (i = 0; i < d->nvqs; ++i)
- mutex_unlock(&d->vqs[i]->mutex);
-}
static int vhost_new_umem_range(struct vhost_umem *umem,
u64 start, u64 size, u64 end,
@@ -950,7 +940,11 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
if (msg->iova <= vq_msg->iova &&
msg->iova + msg->size - 1 > vq_msg->iova &&
vq_msg->type == VHOST_IOTLB_MISS) {
+
+ mutex_lock(&node->vq->mutex);
vhost_poll_queue(&node->vq->poll);
+ mutex_unlock(&node->vq->mutex);
+
list_del(&node->node);
kfree(node);
}
@@ -982,7 +976,6 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
int ret = 0;
mutex_lock(&dev->mutex);
- vhost_dev_lock_vqs(dev);
switch (msg->type) {
case VHOST_IOTLB_UPDATE:
if (!dev->iotlb) {
@@ -1016,7 +1009,6 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
break;
}
- vhost_dev_unlock_vqs(dev);
mutex_unlock(&dev->mutex);
return ret;
--
1.8.3.1
^ permalink raw reply related
* WARNING: ODEBUG bug in sock_hash_free
From: syzbot @ 2018-06-26 5:30 UTC (permalink / raw)
To: ast, daniel, linux-kernel, netdev, syzkaller-bugs
Hello,
syzbot found the following crash on:
HEAD commit: f0dc7f9c6dd9 Merge git://git.kernel.org/pub/scm/linux/kern..
git tree: bpf-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1725589f800000
kernel config: https://syzkaller.appspot.com/x/.config?x=fa9c20c48788d1c1
dashboard link: https://syzkaller.appspot.com/bug?extid=71aeaaf993d216185076
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
Unfortunately, I don't have any reproducer for this crash yet.
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+71aeaaf993d216185076@syzkaller.appspotmail.com
------------[ cut here ]------------
ODEBUG: free active (active state 1) object type: rcu_head hint:
(null)
WARNING: CPU: 1 PID: 4959 at lib/debugobjects.c:329
debug_print_object+0x16a/0x210 lib/debugobjects.c:326
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 4959 Comm: kworker/1:3 Not tainted 4.17.0+ #39
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Workqueue: events bpf_map_free_deferred
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
panic+0x22f/0x4de kernel/panic.c:184
__warn.cold.8+0x163/0x1b3 kernel/panic.c:536
report_bug+0x252/0x2d0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:178 [inline]
do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:debug_print_object+0x16a/0x210 lib/debugobjects.c:326
Code: 1a 88 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 92 00 00 00 48 8b 14 dd
60 75 1a 88 4c 89 f6 48 c7 c7 e0 6a 1a 88 e8 06 62 ec fd <0f> 0b 83 05 39
5b 44 06 01 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 5f
RSP: 0018:ffff880198e47490 EFLAGS: 00010082
RAX: 0000000000000051 RBX: 0000000000000003 RCX: ffffffff81854ed8
RDX: 0000000000000000 RSI: ffffffff8161f371 RDI: 0000000000000001
RBP: ffff880198e474d0 R08: ffff8801d84b2240 R09: ffffed003b5e3ec2
R10: ffffed003b5e3ec2 R11: ffff8801daf1f617 R12: 0000000000000001
R13: ffffffff88f91d80 R14: ffffffff881a6f80 R15: 0000000000000000
__debug_check_no_obj_freed lib/debugobjects.c:783 [inline]
debug_check_no_obj_freed+0x3a6/0x584 lib/debugobjects.c:815
kfree+0xc7/0x260 mm/slab.c:3812
sock_hash_free+0x24e/0x6e0 kernel/bpf/sockmap.c:2093
bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:262
process_one_work+0xc64/0x1b70 kernel/workqueue.c:2153
worker_thread+0x181/0x13a0 kernel/workqueue.c:2296
kthread+0x345/0x410 kernel/kthread.c:240
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
======================================================
WARNING: possible circular locking dependency detected
4.17.0+ #39 Not tainted
------------------------------------------------------
kworker/1:3/4959 is trying to acquire lock:
00000000190110fa ((console_sem).lock){-...}, at: down_trylock+0x13/0x70
kernel/locking/semaphore.c:136
but task is already holding lock:
00000000af3150e8 (&obj_hash[i].lock){-.-.}, at: __debug_check_no_obj_freed
lib/debugobjects.c:774 [inline]
00000000af3150e8 (&obj_hash[i].lock){-.-.}, at:
debug_check_no_obj_freed+0x159/0x584 lib/debugobjects.c:815
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&obj_hash[i].lock){-.-.}:
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
__debug_object_init+0x11f/0x12c0 lib/debugobjects.c:381
debug_object_init+0x16/0x20 lib/debugobjects.c:429
debug_hrtimer_init kernel/time/hrtimer.c:410 [inline]
debug_init kernel/time/hrtimer.c:458 [inline]
hrtimer_init+0x8f/0x460 kernel/time/hrtimer.c:1308
init_dl_task_timer+0x1b/0x50 kernel/sched/deadline.c:1056
__sched_fork+0x2a8/0x570 kernel/sched/core.c:2184
init_idle+0x75/0x7a0 kernel/sched/core.c:5404
sched_init+0xbeb/0xd10 kernel/sched/core.c:6102
start_kernel+0x475/0x92d init/main.c:602
x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:452
x86_64_start_kernel+0x76/0x79 arch/x86/kernel/head64.c:433
secondary_startup_64+0xa5/0xb0 arch/x86/kernel/head_64.S:242
-> #2 (&rq->lock){-.-.}:
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:144
rq_lock kernel/sched/sched.h:1805 [inline]
task_fork_fair+0x8a/0x660 kernel/sched/fair.c:9953
sched_fork+0x43e/0xb30 kernel/sched/core.c:2380
copy_process.part.38+0x1bf1/0x7180 kernel/fork.c:1765
copy_process kernel/fork.c:1608 [inline]
_do_fork+0x291/0x12a0 kernel/fork.c:2091
kernel_thread+0x34/0x40 kernel/fork.c:2150
rest_init+0x22/0xe4 init/main.c:408
start_kernel+0x906/0x92d init/main.c:738
x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:452
x86_64_start_kernel+0x76/0x79 arch/x86/kernel/head64.c:433
secondary_startup_64+0xa5/0xb0 arch/x86/kernel/head_64.S:242
-> #1 (&p->pi_lock){-.-.}:
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
try_to_wake_up+0xca/0x1280 kernel/sched/core.c:1984
wake_up_process+0x10/0x20 kernel/sched/core.c:2147
__up.isra.1+0x1b8/0x290 kernel/locking/semaphore.c:262
up+0x12f/0x1b0 kernel/locking/semaphore.c:187
__up_console_sem+0xbe/0x1b0 kernel/printk/printk.c:242
console_unlock+0x79a/0x10a0 kernel/printk/printk.c:2411
vprintk_emit+0x6b2/0xde0 kernel/printk/printk.c:1907
vprintk_default+0x28/0x30 kernel/printk/printk.c:1948
vprintk_func+0x7a/0xe7 kernel/printk/printk_safe.c:382
printk+0x9e/0xba kernel/printk/printk.c:1981
load_umh+0x51/0xbd net/bpfilter/bpfilter_kern.c:99
do_one_initcall+0x127/0x913 init/main.c:884
do_initcall_level init/main.c:952 [inline]
do_initcalls init/main.c:960 [inline]
do_basic_setup init/main.c:978 [inline]
kernel_init_freeable+0x49b/0x58e init/main.c:1135
kernel_init+0x11/0x1b3 init/main.c:1061
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
-> #0 ((console_sem).lock){-...}:
lock_acquire+0x1dc/0x520 kernel/locking/lockdep.c:3924
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
down_trylock+0x13/0x70 kernel/locking/semaphore.c:136
__down_trylock_console_sem+0xae/0x200 kernel/printk/printk.c:225
console_trylock+0x15/0xa0 kernel/printk/printk.c:2230
console_trylock_spinning kernel/printk/printk.c:1643 [inline]
vprintk_emit+0x699/0xde0 kernel/printk/printk.c:1906
vprintk_default+0x28/0x30 kernel/printk/printk.c:1948
vprintk_func+0x7a/0xe7 kernel/printk/printk_safe.c:382
printk+0x9e/0xba kernel/printk/printk.c:1981
__warn_printk+0x83/0xd0 kernel/panic.c:590
debug_print_object+0x16a/0x210 lib/debugobjects.c:326
__debug_check_no_obj_freed lib/debugobjects.c:783 [inline]
debug_check_no_obj_freed+0x3a6/0x584 lib/debugobjects.c:815
kfree+0xc7/0x260 mm/slab.c:3812
sock_hash_free+0x24e/0x6e0 kernel/bpf/sockmap.c:2093
bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:262
process_one_work+0xc64/0x1b70 kernel/workqueue.c:2153
worker_thread+0x181/0x13a0 kernel/workqueue.c:2296
kthread+0x345/0x410 kernel/kthread.c:240
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
other info that might help us debug this:
Chain exists of:
(console_sem).lock --> &rq->lock --> &obj_hash[i].lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&obj_hash[i].lock);
lock(&rq->lock);
lock(&obj_hash[i].lock);
lock((console_sem).lock);
*** DEADLOCK ***
4 locks held by kworker/1:3/4959:
#0: 00000000f67deee4 ((wq_completion)"events"){+.+.}, at:
__write_once_size include/linux/compiler.h:215 [inline]
#0: 00000000f67deee4 ((wq_completion)"events"){+.+.}, at:
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
#0: 00000000f67deee4 ((wq_completion)"events"){+.+.}, at: atomic64_set
include/asm-generic/atomic-instrumented.h:40 [inline]
#0: 00000000f67deee4 ((wq_completion)"events"){+.+.}, at: atomic_long_set
include/asm-generic/atomic-long.h:59 [inline]
#0: 00000000f67deee4 ((wq_completion)"events"){+.+.}, at: set_work_data
kernel/workqueue.c:617 [inline]
#0: 00000000f67deee4 ((wq_completion)"events"){+.+.}, at:
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
#0: 00000000f67deee4 ((wq_completion)"events"){+.+.}, at:
process_one_work+0xb35/0x1b70 kernel/workqueue.c:2124
#1: 00000000776b40d0 ((work_completion)(&map->work)){+.+.}, at:
process_one_work+0xb8c/0x1b70 kernel/workqueue.c:2128
#2: 000000002a359661 (rcu_read_lock){....}, at: sock_hash_free+0x0/0x6e0
include/net/sock.h:2176
#3: 00000000af3150e8 (&obj_hash[i].lock){-.-.}, at:
__debug_check_no_obj_freed lib/debugobjects.c:774 [inline]
#3: 00000000af3150e8 (&obj_hash[i].lock){-.-.}, at:
debug_check_no_obj_freed+0x159/0x584 lib/debugobjects.c:815
stack backtrace:
CPU: 1 PID: 4959 Comm: kworker/1:3 Not tainted 4.17.0+ #39
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Workqueue: events bpf_map_free_deferred
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
print_circular_bug.isra.36.cold.56+0x1bd/0x27d
kernel/locking/lockdep.c:1227
check_prev_add kernel/locking/lockdep.c:1867 [inline]
check_prevs_add kernel/locking/lockdep.c:1980 [inline]
validate_chain kernel/locking/lockdep.c:2421 [inline]
__lock_acquire+0x343e/0x5140 kernel/locking/lockdep.c:3435
lock_acquire+0x1dc/0x520 kernel/locking/lockdep.c:3924
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
down_trylock+0x13/0x70 kernel/locking/semaphore.c:136
__down_trylock_console_sem+0xae/0x200 kernel/printk/printk.c:225
console_trylock+0x15/0xa0 kernel/printk/printk.c:2230
console_trylock_spinning kernel/printk/printk.c:1643 [inline]
vprintk_emit+0x699/0xde0 kernel/printk/printk.c:1906
vprintk_default+0x28/0x30 kernel/printk/printk.c:1948
vprintk_func+0x7a/0xe7 kernel/printk/printk_safe.c:382
printk+0x9e/0xba kernel/printk/printk.c:1981
__warn_printk+0x83/0xd0 kernel/panic.c:590
debug_print_object+0x16a/0x210 lib/debugobjects.c:326
__debug_check_no_obj_freed lib/debugobjects.c:783 [inline]
debug_check_no_obj_freed+0x3a6/0x584 lib/debugobjects.c:815
kfree+0xc7/0x260 mm/slab.c:3812
sock_hash_free+0x24e/0x6e0 kernel/bpf/sockmap.c:2093
bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:262
process_one_work+0xc64/0x1b70 kernel/workqueue.c:2153
worker_thread+0x181/0x13a0 kernel/workqueue.c:2296
kthread+0x345/0x410 kernel/kthread.c:240
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Shutting down cpus with NMI
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..
---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.
^ permalink raw reply
* Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
From: Ka-Cheong Poon @ 2018-06-26 5:30 UTC (permalink / raw)
To: Sowmini Varadhan; +Cc: netdev, santosh.shilimkar, davem, rds-devel
In-Reply-To: <20180625175006.GI14823@oracle.com>
On 06/26/2018 01:50 AM, Sowmini Varadhan wrote:
>> If a socket is bound, I guess the scope_id should be used. So
>> if a socket is not bound to a link local address and the socket
>> is used to sent to a link local peer, it should fail.
>
> PF_RDS sockets *MUST* alwasy be bound. See
> Documentation/networking/rds.txt:
> " Sockets must be bound before you can send or receive data.
> This is needed because binding also selects a transport and
> attaches it to the socket. Once bound, the transport assignment
> does not change."
>
> Also, rds_sendmsg checks this (from net-next, your version
> has the equivalent ipv6_addr_any etc check):
>
> if (daddr == 0 || rs->rs_bound_addr == 0) {
> release_sock(sk);
> ret = -ENOTCONN; /* XXX not a great errno */
> goto out;
> }
I think you misunderstood what I wrote. The above is in response
to your original question:
--
> And this is even more confusing because the fastpath in rds_sendmsg
> does not take the bound_scope_id into consideration at all:
> 1213 if (rs->rs_conn &&
ipv6_addr_equal(&rs->rs_conn->c_faddr, &daddr))
> 1214 conn = rs->rs_conn;
> 1215 else {
> 1216 conn = rds_conn_create_outgoing( /* .. */, scope_id)
> so if I erroneously passed a msg_name on a connected rds socket, what
> would happen? (see also question about rds_connect() itself, below)
^ permalink raw reply
* Re: [PATCH v2] selftests: bpf: notification about privilege required to run test_lwt_seg6local.sh testing script
From: Song Liu @ 2018-06-26 5:33 UTC (permalink / raw)
To: Jeffrin Jose T
Cc: Alexei Starovoitov, Daniel Borkmann, shuah, Networking, open list,
linux-kselftest
In-Reply-To: <20180622214032.6527-1-ahiliation@gmail.com>
On Fri, Jun 22, 2018 at 2:40 PM, Jeffrin Jose T <ahiliation@gmail.com> wrote:
> This test needs root privilege for it's successful execution.
>
> This patch is atleast used to notify the user about the privilege
> the script demands for the smooth execution of the test.
>
> Signed-off-by: Jeffrin Jose T (Rajagiri SET) <ahiliation@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
> ---
> tools/testing/selftests/bpf/test_lwt_seg6local.sh | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/test_lwt_seg6local.sh b/tools/testing/selftests/bpf/test_lwt_seg6local.sh
> index 1c77994b5e71..30575577a8b2 100755
> --- a/tools/testing/selftests/bpf/test_lwt_seg6local.sh
> +++ b/tools/testing/selftests/bpf/test_lwt_seg6local.sh
> @@ -21,6 +21,15 @@
> # An UDP datagram is sent from fb00::1 to fb00::6. The test succeeds if this
> # datagram can be read on NS6 when binding to fb00::6.
>
> +# Kselftest framework requirement - SKIP code is 4.
> +ksft_skip=4
> +
> +msg="skip all tests:"
> +if [ $UID != 0 ]; then
> + echo $msg please run this as root >&2
> + exit $ksft_skip
> +fi
> +
> TMP_FILE="/tmp/selftest_lwt_seg6local.txt"
>
> cleanup()
> --
> 2.17.0
>
^ permalink raw reply
* Re: [PATCH bpf-next 1/7] nfp: bpf: allow source ptr type be map ptr in memcpy optimization
From: Song Liu @ 2018-06-26 5:50 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking,
Jiong Wang
In-Reply-To: <20180625035421.2991-2-jakub.kicinski@netronome.com>
On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> From: Jiong Wang <jiong.wang@netronome.com>
>
> Map read has been supported on NFP, this patch enables optimization for
> memcpy from map to packet.
>
> This patch also fixed one latent bug which will cause copying from
> unexpected address once memcpy for map pointer enabled.
>
> Reported-by: Mary Pham <mary.pham@netronome.com>
> Reported-by: David Beckett <david.beckett@netronome.com>
> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
> drivers/net/ethernet/netronome/nfp/bpf/jit.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> index 8a92088df0d7..33111739b210 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> @@ -670,7 +670,7 @@ static int nfp_cpp_memcpy(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> xfer_num = round_up(len, 4) / 4;
>
> if (src_40bit_addr)
> - addr40_offset(nfp_prog, meta->insn.src_reg, off, &src_base,
> + addr40_offset(nfp_prog, meta->insn.src_reg * 2, off, &src_base,
> &off);
Did this break other cases before this patch?
I am sorry if this is a dumb question. I don't think I fully
understand addr40_offset().
Song
>
> /* Setup PREV_ALU fields to override memory read length. */
> @@ -3299,7 +3299,8 @@ curr_pair_is_memcpy(struct nfp_insn_meta *ld_meta,
> if (!is_mbpf_load(ld_meta) || !is_mbpf_store(st_meta))
> return false;
>
> - if (ld_meta->ptr.type != PTR_TO_PACKET)
> + if (ld_meta->ptr.type != PTR_TO_PACKET &&
> + ld_meta->ptr.type != PTR_TO_MAP_VALUE)
> return false;
>
> if (st_meta->ptr.type != PTR_TO_PACKET)
> --
> 2.17.1
>
^ permalink raw reply
* Re: WARNING in sctp_assoc_update_frag_point
From: Xin Long @ 2018-06-26 6:04 UTC (permalink / raw)
To: syzbot
Cc: davem, LKML, linux-sctp, Marcelo Ricardo Leitner, network dev,
Neil Horman, syzkaller-bugs, Vlad Yasevich
In-Reply-To: <0000000000005d2af6056f8474ac@google.com>
On Tue, Jun 26, 2018 at 1:06 PM, syzbot
<syzbot+f0d9d7cba052f9344b03@syzkaller.appspotmail.com> wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit: 6f0d349d922b Merge git://git.kernel.org/pub/scm/linux/kern..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=12a423c0400000
> kernel config: https://syzkaller.appspot.com/x/.config?x=a63be0c83e84d370
> dashboard link: https://syzkaller.appspot.com/bug?extid=f0d9d7cba052f9344b03
> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+f0d9d7cba052f9344b03@syzkaller.appspotmail.com
>
> WARNING: CPU: 0 PID: 22543 at include/net/sctp/sctp.h:598 sctp_mtu_payload
> include/net/sctp/sctp.h:598 [inline]
> WARNING: CPU: 0 PID: 22543 at include/net/sctp/sctp.h:598
> sctp_assoc_update_frag_point+0x252/0x2c0 net/sctp/associola.c:1401
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 0 PID: 22543 Comm: syz-executor2 Not tainted 4.18.0-rc2+ #117
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
> panic+0x238/0x4e7 kernel/panic.c:184
> __warn.cold.8+0x163/0x1ba kernel/panic.c:536
> report_bug+0x252/0x2d0 lib/bug.c:186
> fixup_bug arch/x86/kernel/traps.c:178 [inline]
> do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
> do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
> invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
> RIP: 0010:sctp_mtu_payload include/net/sctp/sctp.h:598 [inline]
> RIP: 0010:sctp_assoc_update_frag_point+0x252/0x2c0 net/sctp/associola.c:1401
> Code: 76 fa 45 39 e5 76 1e e8 0c 69 76 fa 45 29 e5 45 89 ec e9 34 ff ff ff
> e8 fc 68 76 fa 45 8d 66 34 e9 09 ff ff ff e8 ee 68 76 fa <0f> 0b 45 31 e4 e9
> 17 ff ff ff e8 7f 3c b4 fa e9 31 fe ff ff 4c 89
> RSP: 0018:ffff8801d7def378 EFLAGS: 00010216
> RAX: 0000000000040000 RBX: ffff8801d8580ac0 RCX: ffffc900133ca000
> RDX: 00000000000001b9 RSI: ffffffff8705a382 RDI: 0000000000000004
> RBP: ffff8801d7def3a0 R08: ffff8801cfaa6000 R09: ffffed002e0421af
> R10: ffffed002e0421af R11: ffff880170210d7f R12: 0000000000000044
> R13: 0000000000000044 R14: 0000000000000010 R15: ffff8801d8580ac0
> sctp_assoc_set_pmtu net/sctp/associola.c:1417 [inline]
> sctp_assoc_sync_pmtu+0x251/0x2e0 net/sctp/associola.c:1445
We may need a fix:
@@ -282,7 +282,10 @@ bool sctp_transport_update_pmtu(struct
sctp_transport *t, u32 pmtu)
if (dst) {
/* Re-fetch, as under layers may have a higher minimum size */
- pmtu = SCTP_TRUNC4(dst_mtu(dst));
+ int dst_mtu = SCTP_TRUNC4(dst_mtu(dst));
+
+ if (pmtu < dst_mtu)
+ pmtu = dst_mtu;
change = t->pathmtu != pmtu;
to make sure the t->pathmtu never get a value smaller than MINSEGMENT,
also meets the comments:
"Re-fetch, as under layers may have a higher minimum size".
^ permalink raw reply
* Re: [PATCH bpf-next 2/7] lib: reciprocal_div: implement the improved algorithm on the paper mentioned
From: Song Liu @ 2018-06-26 6:21 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking,
Jiong Wang
In-Reply-To: <20180625035421.2991-3-jakub.kicinski@netronome.com>
On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> From: Jiong Wang <jiong.wang@netronome.com>
>
> The new added "reciprocal_value_adv" implements the advanced version of the
> algorithm described in Figure 4.2 of the paper except when dividend has MSB
> set which would require u128 divide on host and actually could be easily
> handled before calling the new "reciprocal_value_adv".
>
> The advanced version requires more complex calculation to get the
> reciprocal multiplier and other control variables, but then could reduce
> the required emulation operations.
>
> It makes no sense to use this advanced version for host divide emulation,
> those extra complexities for calculating multiplier etc could completely
> waive our saving on emulation operations.
>
> However, it makes sense to use it for JIT divide code generation (for
> example eBPF JIT backends) for which we are willing to trade performance of
> JITed code with that of host. As shown by the following pseudo code, the
> required emulation operations could go down from 6 (the basic version) to 3
> or 4.
>
> To use the result of "reciprocal_value_adv", suppose we want to calculate
> n/d, the C-style pseudo code will be the following, it could be easily
> changed to real code generation for other JIT targets.
>
> struct reciprocal_value_adv rvalue;
> u8 pre_shift, exp;
>
> if (d >= (1u << 31)) {
> result = n >= d;
> return;
> }
> rvalue = reciprocal_value_adv(d, 32)
> exp = rvalue.exp;
> if (rvalue.is_wide_m && !(d & 1)) {
> pre_shift = fls(d & -d) - 1;
> rvalue = reciprocal_value_adv(d >> pre_shift, 32 - pre_shift);
> } else {
> pre_shift = 0;
> }
>
> // code generation starts.
> if (imm == 1 << exp) {
> result = n >> exp;
> } else if (rvalue.is_wide_m) {
> // pre_shift must be zero when reached here.
> t = (n * rvalue.m) >> 32;
> result = n - t;
> result >>= 1;
> result += t;
> result >>= rvalue.sh - 1;
> } else {
> if (pre_shift)
> result = n >> pre_shift;
> result = ((u64)result * rvalue.m) >> 32;
> result >>= rvalue.sh;
> }
>
> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
> include/linux/reciprocal_div.h | 65 ++++++++++++++++++++++++++++++++++
> lib/reciprocal_div.c | 37 +++++++++++++++++++
> 2 files changed, 102 insertions(+)
>
> diff --git a/include/linux/reciprocal_div.h b/include/linux/reciprocal_div.h
> index e031e9f2f9d8..5a695e4697d3 100644
> --- a/include/linux/reciprocal_div.h
> +++ b/include/linux/reciprocal_div.h
> @@ -25,6 +25,9 @@ struct reciprocal_value {
> u8 sh1, sh2;
> };
>
> +/* "reciprocal_value" and "reciprocal_divide" together implement the basic
> + * version of the algorithm described in Figure 4.1 of the paper.
> + */
> struct reciprocal_value reciprocal_value(u32 d);
>
> static inline u32 reciprocal_divide(u32 a, struct reciprocal_value R)
> @@ -33,4 +36,66 @@ static inline u32 reciprocal_divide(u32 a, struct reciprocal_value R)
> return (t + ((a - t) >> R.sh1)) >> R.sh2;
> }
>
> +struct reciprocal_value_adv {
> + u32 m;
> + u8 sh, exp;
> + bool is_wide_m;
> +};
> +
> +/* "reciprocal_value_adv" implements the advanced version of the algorithm
> + * described in Figure 4.2 of the paper except when dividend has MSB set which
> + * would require u128 divide on host and actually could be easily handled before
> + * calling "reciprocal_value_adv".
> + *
> + * The advanced version requires more complex calculation to get the reciprocal
> + * multiplier and other control variables, but then could reduce the required
> + * emulation operations.
> + *
> + * It makes no sense to use this advanced version for host divide emulation,
> + * those extra complexities for calculating multiplier etc could completely
> + * waive our saving on emulation operations.
> + *
> + * However, it makes sense to use it for JIT divide code generation for which
> + * we are willing to trade performance of JITed code with that of host. As shown
> + * by the following pseudo code, the required emulation operations could go down
> + * from 6 (the basic version) to 3 or 4.
> + *
> + * To use the result of "reciprocal_value_adv", suppose we want to calculate
> + * n/d:
> + *
> + * struct reciprocal_value_adv rvalue;
> + * u8 pre_shift, exp;
> + *
> + * if (d >= (1u << 31)) {
> + * result = n >= d;
> + * return;
> + * }
> + * rvalue = reciprocal_value_adv(d, 32)
> + * exp = rvalue.exp;
> + * if (rvalue.is_wide_m && !(d & 1)) {
> + * pre_shift = fls(d & -d) - 1;
> + * rvalue = reciprocal_value_adv(d >> pre_shift, 32 - pre_shift);
> + * } else {
> + * pre_shift = 0;
> + * }
> + *
> + * // code generation starts.
> + * if (imm == 1 << exp) {
> + * result = n >> exp;
> + * } else if (rvalue.is_wide_m) {
> + * // pre_shift must be zero when reached here.
> + * t = (n * rvalue.m) >> 32;
> + * result = n - t;
> + * result >>= 1;
> + * result += t;
> + * result >>= rvalue.sh - 1;
> + * } else {
> + * if (pre_shift)
> + * result = n >> pre_shift;
> + * result = ((u64)result * rvalue.m) >> 32;
> + * result >>= rvalue.sh;
> + * }
> + */
> +struct reciprocal_value_adv reciprocal_value_adv(u32 d, u8 prec);
> +
> #endif /* _LINUX_RECIPROCAL_DIV_H */
> diff --git a/lib/reciprocal_div.c b/lib/reciprocal_div.c
> index fcb4ce682c6f..a41501ebad7c 100644
> --- a/lib/reciprocal_div.c
> +++ b/lib/reciprocal_div.c
> @@ -26,3 +26,40 @@ struct reciprocal_value reciprocal_value(u32 d)
> return R;
> }
> EXPORT_SYMBOL(reciprocal_value);
> +
> +struct reciprocal_value_adv reciprocal_value_adv(u32 d, u8 prec)
> +{
> + struct reciprocal_value_adv R;
> + u32 l, post_shift;
> + u64 mhigh, mlow;
> +
> + l = fls(d - 1);
> + post_shift = l;
> + /* NOTE: mlow/mhigh could overflow u64 when l == 32 which means d has
> + * MSB set. This case needs to be handled before calling
> + * "reciprocal_value_adv", please see the comment at
> + * include/linux/reciprocal_div.h.
> + */
Shall we handle l == 32 case better? I guess the concern here is extra
handling may
slow down the fast path? If that's the case, we should at least add a
WARNING on the
slow path.
Thanks,
Song
> + mlow = 1ULL << (32 + l);
> + do_div(mlow, d);
> + mhigh = (1ULL << (32 + l)) + (1ULL << (32 + l - prec));
> + do_div(mhigh, d);
> +
> + for (; post_shift > 0; post_shift--) {
> + u64 lo = mlow >> 1, hi = mhigh >> 1;
> +
> + if (lo >= hi)
> + break;
> +
> + mlow = lo;
> + mhigh = hi;
> + }
> +
> + R.m = (u32)mhigh;
> + R.sh = post_shift;
> + R.exp = l;
> + R.is_wide_m = mhigh > U32_MAX;
> +
> + return R;
> +}
> +EXPORT_SYMBOL(reciprocal_value_adv);
> --
> 2.17.1
>
^ permalink raw reply
* Re: [PATCH bpf-next 3/7] nfp: bpf: rename umin/umax to umin_src/umax_src
From: Song Liu @ 2018-06-26 6:21 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking,
Jiong Wang
In-Reply-To: <20180625035421.2991-4-jakub.kicinski@netronome.com>
On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> From: Jiong Wang <jiong.wang@netronome.com>
>
> The two fields are a copy of umin and umax info of bpf_insn->src_reg
> generated by verifier.
>
> Rename to make their meaning clear.
>
> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
> ---
> drivers/net/ethernet/netronome/nfp/bpf/jit.c | 12 ++++++------
> drivers/net/ethernet/netronome/nfp/bpf/main.h | 10 +++++-----
> drivers/net/ethernet/netronome/nfp/bpf/offload.c | 2 +-
> drivers/net/ethernet/netronome/nfp/bpf/verifier.c | 4 ++--
> 4 files changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> index 33111739b210..4a629e9b5c0f 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> @@ -1772,8 +1772,8 @@ static int shl_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> u8 dst, src;
>
> dst = insn->dst_reg * 2;
> - umin = meta->umin;
> - umax = meta->umax;
> + umin = meta->umin_src;
> + umax = meta->umax_src;
> if (umin == umax)
> return __shl_imm64(nfp_prog, dst, umin);
>
> @@ -1881,8 +1881,8 @@ static int shr_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> u8 dst, src;
>
> dst = insn->dst_reg * 2;
> - umin = meta->umin;
> - umax = meta->umax;
> + umin = meta->umin_src;
> + umax = meta->umax_src;
> if (umin == umax)
> return __shr_imm64(nfp_prog, dst, umin);
>
> @@ -1995,8 +1995,8 @@ static int ashr_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> u8 dst, src;
>
> dst = insn->dst_reg * 2;
> - umin = meta->umin;
> - umax = meta->umax;
> + umin = meta->umin_src;
> + umax = meta->umax_src;
> if (umin == umax)
> return __ashr_imm64(nfp_prog, dst, umin);
>
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h
> index 654fe7823e5e..5975a19c28cb 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
> @@ -263,8 +263,8 @@ struct nfp_bpf_reg_state {
> * @func_id: function id for call instructions
> * @arg1: arg1 for call instructions
> * @arg2: arg2 for call instructions
> - * @umin: copy of core verifier umin_value.
> - * @umax: copy of core verifier umax_value.
> + * @umin_src: copy of core verifier umin_value for src opearnd.
> + * @umax_src: copy of core verifier umax_value for src operand.
> * @off: index of first generated machine instruction (in nfp_prog.prog)
> * @n: eBPF instruction number
> * @flags: eBPF instruction extra optimization flags
> @@ -301,11 +301,11 @@ struct nfp_insn_meta {
> struct nfp_bpf_reg_state arg2;
> };
> /* We are interested in range info for some operands,
> - * for example, the shift amount.
> + * for example, the shift amount which is kept in src operand.
> */
> struct {
> - u64 umin;
> - u64 umax;
> + u64 umin_src;
> + u64 umax_src;
> };
> };
> unsigned int off;
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
> index 7eae4c0266f8..856a0003bb75 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
> @@ -191,7 +191,7 @@ nfp_prog_prepare(struct nfp_prog *nfp_prog, const struct bpf_insn *prog,
> meta->insn = prog[i];
> meta->n = i;
> if (is_mbpf_indir_shift(meta))
> - meta->umin = U64_MAX;
> + meta->umin_src = U64_MAX;
>
> list_add_tail(&meta->l, &nfp_prog->insns);
> }
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> index 4bfeba7b21b2..e862b739441f 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> @@ -555,8 +555,8 @@ nfp_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx)
> const struct bpf_reg_state *sreg =
> cur_regs(env) + meta->insn.src_reg;
>
> - meta->umin = min(meta->umin, sreg->umin_value);
> - meta->umax = max(meta->umax, sreg->umax_value);
> + meta->umin_src = min(meta->umin_src, sreg->umin_value);
> + meta->umax_src = max(meta->umax_src, sreg->umax_value);
> }
>
> return 0;
> --
> 2.17.1
>
^ permalink raw reply
* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Eric Dumazet @ 2018-06-26 6:41 UTC (permalink / raw)
To: Cong Wang, Flavio Leitner
Cc: Linux Kernel Network Developers, Eric Dumazet, Paolo Abeni,
David Miller, Florian Westphal, NetFilter
In-Reply-To: <CAM_iQpX_X4hhSSrhMZavLibobp6tgMEa_26T6j4QvACKg-HPvw@mail.gmail.com>
On 06/25/2018 09:15 PM, Cong Wang wrote:
> On Mon, Jun 25, 2018 at 8:59 AM Flavio Leitner <fbl@redhat.com> wrote:
>>
>> The sock reference is lost when scrubbing the packet and that breaks
>> TSQ (TCP Small Queues) and XPS (Transmit Packet Steering) causing
>> performance impacts of about 50% in a single TCP stream when crossing
>> network namespaces.
>>
>> XPS breaks because the queue mapping stored in the socket is not
>> available, so another random queue might be selected when the stack
>> needs to transmit something like a TCP ACK, or TCP Retransmissions.
>> That causes packet re-ordering and/or performance issues.
>>
>> TSQ breaks because it orphans the packet while it is still in the
>> host, so packets are queued contributing to the buffer bloat problem.
>
> Why should TSQ in one stack care about buffer bloat in another stack?
>
> Actually, I think the current behavior is correct, once the packet leaves
> its current stack (or netns), it should relief the backpressure on TCP
> socket in this stack, whether it will be queued in another stack is beyond
> its concern. This breaks the isolation between networking stacks.
>
We discussed about this during netconf Cong, nobody was against this planned removal.
When a packet is attached to a socket, we should keep the association as much as possible.
Only when a new association needs to be done, skb_orphan() needs to be called.
Doing this skb_orphan() too soon breaks back pressure in general, this is bad, since a socket
can evades SO_SNDBUF limits.
I am not sure why the patch is so complex, I would have simply removed the skb_orphan().
^ permalink raw reply
* Re: [patch net-next 6/9] net: sched: cls_flower: propagate chain teplate creation and destruction to drivers
From: Jiri Pirko @ 2018-06-26 6:40 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, davem, jhs, xiyou.wangcong, simon.horman, john.hurley,
dsahern, mlxsw
In-Reply-To: <20180625220050.0ff6d44c@cakuba.netronome.com>
Tue, Jun 26, 2018 at 07:00:50AM CEST, jakub.kicinski@netronome.com wrote:
>On Mon, 25 Jun 2018 23:01:45 +0200, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>>
>> Introduce a couple of flower offload commands in order to propagate
>> template creation/destruction events down to device drivers.
>> Drivers may use this information to prepare HW in an optimal way
>> for future filter insertions.
>>
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>
>> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>> index d64d43843a3a..276ba25a09c3 100644
>> --- a/net/sched/cls_flower.c
>> +++ b/net/sched/cls_flower.c
>> @@ -1120,6 +1120,43 @@ static void fl_walk(struct tcf_proto *tp, struct tcf_walker *arg)
>> }
>> }
>>
>> +static void fl_hw_create_tmplt(struct tcf_chain *chain,
>> + struct fl_flow_tmplt *tmplt,
>> + struct netlink_ext_ack *extack)
>> +{
>> + struct tc_cls_flower_offload cls_flower = {};
>> + struct tcf_block *block = chain->block;
>> + struct tcf_exts dummy_exts = { 0, };
>> +
>> + cls_flower.common.chain_index = chain->index;
>
>Did you skip extack on purpose?
Oh, the extack is leftover. I will remove it in v2.
>
>> + cls_flower.command = TC_CLSFLOWER_TMPLT_CREATE;
>> + cls_flower.cookie = (unsigned long) tmplt;
>> + cls_flower.dissector = &tmplt->dissector;
>> + cls_flower.mask = &tmplt->mask;
>> + cls_flower.key = &tmplt->dummy_key;
>> + cls_flower.exts = &dummy_exts;
>> +
>> + /* We don't care if driver (any of them) fails to handle this
>> + * call. It serves just as a hint for it.
>> + */
>> + tc_setup_cb_call(block, NULL, TC_SETUP_CLSFLOWER,
>> + &cls_flower, false);
>> +}
^ permalink raw reply
* Re: [PATCH] NFC: llcp: fix nfc_llcp_send_ui_frame() lockup
From: Eric Dumazet @ 2018-06-26 6:44 UTC (permalink / raw)
To: Sergey Senozhatsky, Dmitry Vyukov
Cc: Samuel Ortiz, David S. Miller, Steven Rostedt, Petr Mladek,
syzkaller-bugs, linux-wireless, netdev, LKML, syzbot,
Sergey Senozhatsky
In-Reply-To: <20180626051221.GC31439@jagdpanzerIV>
On 06/25/2018 10:12 PM, Sergey Senozhatsky wrote:
> On (06/26/18 07:07), Dmitry Vyukov wrote:
> [..]
>>> #include <net/nfc/nfc.h>
>>> @@ -755,7 +756,8 @@ int nfc_llcp_send_ui_frame(struct nfc_llcp_sock *sock, u8 ssap, u8 dsap,
>>> pdu = nfc_alloc_send_skb(sock->dev, &sock->sk, MSG_DONTWAIT,
>>> frag_len + LLCP_HEADER_SIZE, &err);
>>> if (pdu == NULL) {
>>> - pr_err("Could not allocate PDU\n");
>>> + pr_err_ratelimited("Could not allocate PDU\n");
>>> + cond_resched();
>>> continue;
>>> }
>>
>>
>> But this thread is still in an infinite (unkillable?) loop? If yes, we
>> are waiting for the next syzbot report ;)
>
> The loop is still infinite, correct, but we have a preemption point now.
> Sure, net people can come with a much better solution, I'll be happy to
> scratch my patch.
>
This can not be the right solution, think about current thread being real time,
cond_resched() might be a nop.
We should probably not loop at all, or not use MSG_DONTWAIT.
(And remove this useless "Could not allocate PDU" message)
NFC maintainers should really take a look at this.
^ permalink raw reply
* Re: [patch net-next 0/9] net: sched: introduce chain templates support with offloading to mlxsw
From: Jiri Pirko @ 2018-06-26 6:43 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, davem, jhs, xiyou.wangcong, simon.horman, john.hurley,
dsahern, mlxsw
In-Reply-To: <20180625215850.001276b8@cakuba.netronome.com>
Tue, Jun 26, 2018 at 06:58:50AM CEST, jakub.kicinski@netronome.com wrote:
>On Mon, 25 Jun 2018 23:01:39 +0200, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>>
>> For the TC clsact offload these days, some of HW drivers need
>> to hold a magic ball. The reason is, with the first inserted rule inside
>> HW they need to guess what fields will be used for the matching. If
>> later on this guess proves to be wrong and user adds a filter with a
>> different field to match, there's a problem. Mlxsw resolves it now with
>> couple of patterns. Those try to cover as many match fields as possible.
>> This aproach is far from optimal, both performance-wise and scale-wise.
>> Also, there is a combination of filters that in certain order won't
>> succeed.
>>
>> Most of the time, when user inserts filters in chain, he knows right away
>> how the filters are going to look like - what type and option will they
>> have. For example, he knows that he will only insert filters of type
>> flower matching destination IP address. He can specify a template that
>> would cover all the filters in the chain.
>
>Perhaps it's lack of sleep, but this paragraph threw me a little off
>the track. IIUC the goal of this set is to provide a way to inform the
>HW about expected matches before any rule is programmed into the HW.
>Not before any rule is added to a particular chain. One can just use
>the first rule in the chain to make a guess about the chain, but thanks
>to this set user can configure *all* chains before any rules are added.
The template is per-chain. User can use template for chain x and
not-use it for chain y. Up to him.
>
>And that's needed because once any rule is added the tcam config can no
>longer be easily modified?
Yes.
^ permalink raw reply
* Re: [PATCH bpf-next 4/7] nfp: bpf: copy range info for all operands of all ALU operations
From: Song Liu @ 2018-06-26 6:50 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking,
Jiong Wang
In-Reply-To: <20180625035421.2991-5-jakub.kicinski@netronome.com>
On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> From: Jiong Wang <jiong.wang@netronome.com>
>
> NFP verifier hook is coping range information of the shift amount for
> indirect shift operation so optimized shift sequences could be generated.
>
> We want to use range info to do more things. For example, to decide whether
> multiplication and divide are supported on the given range.
>
> This patch simply let NFP verifier hook to copy range info for all operands
> of all ALU operands.
>
> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
> ---
> drivers/net/ethernet/netronome/nfp/bpf/main.h | 33 +++++++------------
> .../net/ethernet/netronome/nfp/bpf/offload.c | 4 ++-
> .../net/ethernet/netronome/nfp/bpf/verifier.c | 6 +++-
> 3 files changed, 20 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h
> index 5975a19c28cb..c985d0ac61a3 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
> @@ -265,6 +265,8 @@ struct nfp_bpf_reg_state {
> * @arg2: arg2 for call instructions
> * @umin_src: copy of core verifier umin_value for src opearnd.
> * @umax_src: copy of core verifier umax_value for src operand.
> + * @umin_dst: copy of core verifier umin_value for dst opearnd.
> + * @umax_dst: copy of core verifier umax_value for dst operand.
> * @off: index of first generated machine instruction (in nfp_prog.prog)
> * @n: eBPF instruction number
> * @flags: eBPF instruction extra optimization flags
> @@ -300,12 +302,15 @@ struct nfp_insn_meta {
> struct bpf_reg_state arg1;
> struct nfp_bpf_reg_state arg2;
> };
> - /* We are interested in range info for some operands,
> - * for example, the shift amount which is kept in src operand.
> + /* We are interested in range info for operands of ALU
> + * operations. For example, shift amount, multiplicand and
> + * multiplier etc.
> */
> struct {
> u64 umin_src;
> u64 umax_src;
> + u64 umin_dst;
> + u64 umax_dst;
> };
> };
> unsigned int off;
> @@ -339,6 +344,11 @@ static inline u8 mbpf_mode(const struct nfp_insn_meta *meta)
> return BPF_MODE(meta->insn.code);
> }
>
> +static inline bool is_mbpf_alu(const struct nfp_insn_meta *meta)
> +{
> + return mbpf_class(meta) == BPF_ALU64 || mbpf_class(meta) == BPF_ALU;
> +}
> +
> static inline bool is_mbpf_load(const struct nfp_insn_meta *meta)
> {
> return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_LDX | BPF_MEM);
> @@ -384,25 +394,6 @@ static inline bool is_mbpf_xadd(const struct nfp_insn_meta *meta)
> return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_STX | BPF_XADD);
> }
>
> -static inline bool is_mbpf_indir_shift(const struct nfp_insn_meta *meta)
> -{
> - u8 code = meta->insn.code;
> - bool is_alu, is_shift;
> - u8 opclass, opcode;
> -
> - opclass = BPF_CLASS(code);
> - is_alu = opclass == BPF_ALU64 || opclass == BPF_ALU;
> - if (!is_alu)
> - return false;
> -
> - opcode = BPF_OP(code);
> - is_shift = opcode == BPF_LSH || opcode == BPF_RSH || opcode == BPF_ARSH;
> - if (!is_shift)
> - return false;
> -
> - return BPF_SRC(code) == BPF_X;
> -}
> -
> /**
> * struct nfp_prog - nfp BPF program
> * @bpf: backpointer to the bpf app priv structure
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
> index 856a0003bb75..78f44c4d95b4 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
> @@ -190,8 +190,10 @@ nfp_prog_prepare(struct nfp_prog *nfp_prog, const struct bpf_insn *prog,
>
> meta->insn = prog[i];
> meta->n = i;
> - if (is_mbpf_indir_shift(meta))
> + if (is_mbpf_alu(meta)) {
> meta->umin_src = U64_MAX;
> + meta->umin_dst = U64_MAX;
> + }
>
> list_add_tail(&meta->l, &nfp_prog->insns);
> }
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> index e862b739441f..7bd9666bd8ff 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> @@ -551,12 +551,16 @@ nfp_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx)
> if (is_mbpf_xadd(meta))
> return nfp_bpf_check_xadd(nfp_prog, meta, env);
>
> - if (is_mbpf_indir_shift(meta)) {
> + if (is_mbpf_alu(meta)) {
> const struct bpf_reg_state *sreg =
> cur_regs(env) + meta->insn.src_reg;
> + const struct bpf_reg_state *dreg =
> + cur_regs(env) + meta->insn.dst_reg;
>
> meta->umin_src = min(meta->umin_src, sreg->umin_value);
> meta->umax_src = max(meta->umax_src, sreg->umax_value);
> + meta->umin_dst = min(meta->umin_dst, dreg->umin_value);
> + meta->umax_dst = max(meta->umax_dst, dreg->umax_value);
> }
>
> return 0;
> --
> 2.17.1
>
^ permalink raw reply
* Re: [PATCH] NFC: llcp: fix nfc_llcp_send_ui_frame() lockup
From: Sergey Senozhatsky @ 2018-06-26 7:00 UTC (permalink / raw)
To: Eric Dumazet
Cc: Sergey Senozhatsky, Dmitry Vyukov, Samuel Ortiz, David S. Miller,
Steven Rostedt, Petr Mladek, syzkaller-bugs, linux-wireless,
netdev, LKML, syzbot, Sergey Senozhatsky
In-Reply-To: <8c410102-43ab-dfdb-0d71-2ee5951e1af8@gmail.com>
On (06/25/18 23:44), Eric Dumazet wrote:
> > The loop is still infinite, correct, but we have a preemption point now.
> > Sure, net people can come with a much better solution, I'll be happy to
> > scratch my patch.
> >
>
> This can not be the right solution, think about current thread being real time,
> cond_resched() might be a nop.
> NFC maintainers should really take a look at this.
I'm all for it.
-ss
^ permalink raw reply
* Re: [patch net-next 0/9] net: sched: introduce chain templates support with offloading to mlxsw
From: Jakub Kicinski @ 2018-06-26 7:00 UTC (permalink / raw)
To: Jiri Pirko
Cc: Linux Netdev List, David Miller, Jamal Hadi Salim, Cong Wang,
Simon Horman, John Hurley, David Ahern, mlxsw
In-Reply-To: <20180626064355.GQ2161@nanopsycho>
On Mon, Jun 25, 2018 at 11:43 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Jun 26, 2018 at 06:58:50AM CEST, jakub.kicinski@netronome.com wrote:
>>On Mon, 25 Jun 2018 23:01:39 +0200, Jiri Pirko wrote:
>>> From: Jiri Pirko <jiri@mellanox.com>
>>>
>>> For the TC clsact offload these days, some of HW drivers need
>>> to hold a magic ball. The reason is, with the first inserted rule inside
>>> HW they need to guess what fields will be used for the matching. If
>>> later on this guess proves to be wrong and user adds a filter with a
>>> different field to match, there's a problem. Mlxsw resolves it now with
>>> couple of patterns. Those try to cover as many match fields as possible.
>>> This aproach is far from optimal, both performance-wise and scale-wise.
>>> Also, there is a combination of filters that in certain order won't
>>> succeed.
>>>
>>> Most of the time, when user inserts filters in chain, he knows right away
>>> how the filters are going to look like - what type and option will they
>>> have. For example, he knows that he will only insert filters of type
>>> flower matching destination IP address. He can specify a template that
>>> would cover all the filters in the chain.
>>
>>Perhaps it's lack of sleep, but this paragraph threw me a little off
>>the track. IIUC the goal of this set is to provide a way to inform the
>>HW about expected matches before any rule is programmed into the HW.
>>Not before any rule is added to a particular chain. One can just use
>>the first rule in the chain to make a guess about the chain, but thanks
>>to this set user can configure *all* chains before any rules are added.
>
> The template is per-chain. User can use template for chain x and
> not-use it for chain y. Up to him.
Makes sense.
I can't help but wonder if it'd be better to associate the
constraints/rules with chains instead of creating a new "template"
object. It seems more natural to create a chain with specific
constraints in place than add and delete template of which there can
be at most one to a chain... Perhaps that's more about the user space
tc command line. Anyway, not a strong objection, just a thought.
>>And that's needed because once any rule is added the tcam config can no
>>longer be easily modified?
>
> Yes.
^ permalink raw reply
* RE: [PATCH] fman: don't set node on dpaa-ethernet platform device
From: Madalin-cristian Bucur @ 2018-06-26 7:00 UTC (permalink / raw)
To: David Miller, bas@daedalean.ai
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20180623.102724.1590504977410505871.davem@davemloft.net>
> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Saturday, June 23, 2018 4:27 AM
> To: bas@daedalean.ai
> Cc: netdev@vger.kernel.org; Madalin-cristian Bucur <madalin.bucur@nxp.com>;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] fman: don't set node on dpaa-ethernet platform device
>
> From: Bas Vermeulen <bas@daedalean.ai>
> Date: Thu, 21 Jun 2018 13:42:22 +0200
>
> > Setting dev->node to the mac_node in dpaa_eth_add_device during probe
> > causes the mac_probe to be called again for the dpaa-ethernet.* device
> > that was just added.
> >
> > Fix this by not setting dev->node, as it is not needed.
> >
> > Signed-off-by: Bas Vermeulen <bas@daedalean.ai>
>
> This patch doesn't apply to the current sources.
Hi,
The line the patch is trying to remove was added in this commit
commit a1a50c8e4c241a505b7270e1a3c6e50d94e794b1
Author: Florian Fainelli <f.fainelli@gmail.com>
Date: Tue Aug 22 15:24:47 2017 -0700
fsl/man: Inherit parent device and of_node
and was already removed in this commit
commit 48167c9ce0b91c068430345bf039c7be23fa2f3f
Author: Madalin Bucur <madalin.bucur@nxp.com>
Date: Mon Oct 16 21:36:05 2017 +0300
fsl/fman: remove of_node
Regards,
Madalin
^ permalink raw reply
* Re: Financial Aid
From: M. M Fridman @ 2018-06-26 6:58 UTC (permalink / raw)
--
I Mikhail Fridman. has selected you specially as one of my beneficiaries
for my Charitable Donation, Just as I have declared on May 23, 2016 to give
my fortune as charity.
Reply as soon as possible with further directives.
Best Regards,
Mikhail Fridman.
^ permalink raw reply
* Re: [PATCH bpf-next 1/7] nfp: bpf: allow source ptr type be map ptr in memcpy optimization
From: Jakub Kicinski @ 2018-06-26 7:08 UTC (permalink / raw)
To: Song Liu
Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking,
Jiong Wang
In-Reply-To: <CAPhsuW4yqanLp6WEYihwMwKny8VxZQTa3G-+nOLu_e4peokNGQ@mail.gmail.com>
On Mon, Jun 25, 2018 at 10:50 PM, Song Liu <liu.song.a23@gmail.com> wrote:
> On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski
> <jakub.kicinski@netronome.com> wrote:
>> From: Jiong Wang <jiong.wang@netronome.com>
>>
>> Map read has been supported on NFP, this patch enables optimization for
>> memcpy from map to packet.
>>
>> This patch also fixed one latent bug which will cause copying from
>> unexpected address once memcpy for map pointer enabled.
>>
>> Reported-by: Mary Pham <mary.pham@netronome.com>
>> Reported-by: David Beckett <david.beckett@netronome.com>
>> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
>> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>> ---
>> drivers/net/ethernet/netronome/nfp/bpf/jit.c | 5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
>> index 8a92088df0d7..33111739b210 100644
>> --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
>> +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
>> @@ -670,7 +670,7 @@ static int nfp_cpp_memcpy(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
>> xfer_num = round_up(len, 4) / 4;
>>
>> if (src_40bit_addr)
>> - addr40_offset(nfp_prog, meta->insn.src_reg, off, &src_base,
>> + addr40_offset(nfp_prog, meta->insn.src_reg * 2, off, &src_base,
>> &off);
>
> Did this break other cases before this patch?
>
> I am sorry if this is a dumb question. I don't think I fully
> understand addr40_offset().
Only map memory uses 40 bit addressing right now, so the if was pretty
much dead code before the patch.
The memcpy optimization was left out of the initial map support due to
insufficient test coverage, I should have probably left more of the 40
bit addressing code out back then.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox