Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next 3/6] bpf, x64: clean up retpoline emission slightly
From: Y Song @ 2018-05-14 17:42 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, netdev
In-Reply-To: <20180514002612.12083-4-daniel@iogearbox.net>

On Sun, May 13, 2018 at 5:26 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> Make the RETPOLINE_{RA,ED}X_BPF_JIT() a bit more readable by
> cleaning up the macro, aligning comments and spacing.
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  arch/x86/include/asm/nospec-branch.h | 29 ++++++++++++++---------------
>  1 file changed, 14 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> index 2cd344d..2f700a1d 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -301,9 +301,9 @@ do {                                                                        \
>   *    jmp *%edx for x86_32
>   */
>  #ifdef CONFIG_RETPOLINE
> -#ifdef CONFIG_X86_64
> -# define RETPOLINE_RAX_BPF_JIT_SIZE    17
> -# define RETPOLINE_RAX_BPF_JIT()                               \
> +# ifdef CONFIG_X86_64
> +#  define RETPOLINE_RAX_BPF_JIT_SIZE   17
> +#  define RETPOLINE_RAX_BPF_JIT()                              \
>  do {                                                           \
>         EMIT1_off32(0xE8, 7);    /* callq do_rop */             \
>         /* spec_trap: */                                        \
> @@ -314,8 +314,8 @@ do {                                                                \
>         EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */    \
>         EMIT1(0xC3);             /* retq */                     \
>  } while (0)
> -#else
> -# define RETPOLINE_EDX_BPF_JIT()                               \
> +# else /* !CONFIG_X86_64 */
> +#  define RETPOLINE_EDX_BPF_JIT()                              \
>  do {                                                           \
>         EMIT1_off32(0xE8, 7);    /* call do_rop */              \
>         /* spec_trap: */                                        \
> @@ -326,17 +326,16 @@ do {                                                              \
>         EMIT3(0x89, 0x14, 0x24); /* mov %edx,(%esp) */          \
>         EMIT1(0xC3);             /* ret */                      \
>  } while (0)
> -#endif
> +# endif
>  #else /* !CONFIG_RETPOLINE */
> -
> -#ifdef CONFIG_X86_64
> -# define RETPOLINE_RAX_BPF_JIT_SIZE    2
> -# define RETPOLINE_RAX_BPF_JIT()                               \
> -       EMIT2(0xFF, 0xE0);       /* jmp *%rax */
> -#else
> -# define RETPOLINE_EDX_BPF_JIT()                               \
> -       EMIT2(0xFF, 0xE2) /* jmp *%edx */
> -#endif
> +# ifdef CONFIG_X86_64
> +#  define RETPOLINE_RAX_BPF_JIT_SIZE   2
> +#  define RETPOLINE_RAX_BPF_JIT()                              \
> +       EMIT2(0xFF, 0xE0);       /* jmp *%rax */
> +# else /* !CONFIG_X86_64 */
> +#  define RETPOLINE_EDX_BPF_JIT()                              \
> +       EMIT2(0xFF, 0xE2)        /* jmp *%edx */
> +# endif
>  #endif
>
>  #endif /* _ASM_X86_NOSPEC_BRANCH_H_ */
> --
> 2.9.5
>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply

* Re: [PATCH net-next v8 0/3] kernel: add support to collect hardware logs in crash recovery kernel
From: David Miller @ 2018-05-14 17:49 UTC (permalink / raw)
  To: ebiederm
  Cc: rahul.lakkireddy, netdev, kexec, linux-fsdevel, linux-kernel,
	viro, stephen, akpm, torvalds, ganeshgr, nirranjan, indranil
In-Reply-To: <87a7t2mlfn.fsf@xmission.com>

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Mon, 14 May 2018 08:11:24 -0500

> David Miller <davem@davemloft.net> writes:
> 
>> I'm deferring this patch series.
>>
>> If we can't get a reasonable review from an interested party in 10+
>> days, that is not reasonable.
>>
>> Resubmit this once someone reviews it properly.
> 
> David I am out on vacation this week and last (the reason for the delay).
> 
> The last version of this that I looked at I gave my ack.  All of my ABI
> concerns had been addressed. The only outstanding change I believe was
> the Eric Dumazet's asking about something being reviewed.
> 
> I just glanced over it again and I don't see any new issues introduced
> by the last round of changes.
> 
> From 10,000 feet flyover design perspectie and from an ABI perspective
> this patchset seems fine.
> 
> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

Ok, thanks for reviewing Eric.

Series applied.

^ permalink raw reply

* Re: possible deadlock in sk_diag_fill
From: Andrei Vagin @ 2018-05-14 18:00 UTC (permalink / raw)
  To: Dmitry Vyukov; +Cc: syzbot, avagin, David Miller, LKML, netdev, syzkaller-bugs
In-Reply-To: <CACT4Y+bt-HwBNA5REeZwoKAymCc-Da5rEU=yeeHodg7zByStQg@mail.gmail.com>

On Sat, May 12, 2018 at 09:46:25AM +0200, Dmitry Vyukov wrote:
> On Fri, May 11, 2018 at 8:33 PM, Andrei Vagin <avagin@virtuozzo.com> wrote:
> > On Sat, May 05, 2018 at 10:59:02AM -0700, syzbot wrote:
> >> Hello,
> >>
> >> syzbot found the following crash on:
> >>
> >> HEAD commit:    c1c07416cdd4 Merge tag 'kbuild-fixes-v4.17' of git://git.k..
> >> git tree:       upstream
> >> console output: https://syzkaller.appspot.com/x/log.txt?x=12164c97800000
> >> kernel config:  https://syzkaller.appspot.com/x/.config?x=5a1dc06635c10d27
> >> dashboard link: https://syzkaller.appspot.com/bug?extid=c1872be62e587eae9669
> >> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
> >> userspace arch: i386
> >>
> >> Unfortunately, I don't have any reproducer for this crash yet.
> >>
> >> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> >> Reported-by: syzbot+c1872be62e587eae9669@syzkaller.appspotmail.com
> >>
> >>
> >> ======================================================
> >> WARNING: possible circular locking dependency detected
> >> 4.17.0-rc3+ #59 Not tainted
> >> ------------------------------------------------------
> >> syz-executor1/25282 is trying to acquire lock:
> >> 000000004fddf743 (&(&u->lock)->rlock/1){+.+.}, at: sk_diag_dump_icons
> >> net/unix/diag.c:82 [inline]
> >> 000000004fddf743 (&(&u->lock)->rlock/1){+.+.}, at:
> >> sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144
> >>
> >> but task is already holding lock:
> >> 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: spin_lock
> >> include/linux/spinlock.h:310 [inline]
> >> 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_dump_icons
> >> net/unix/diag.c:64 [inline]
> >> 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_fill.isra.5+0x94e/0x10d0
> >> net/unix/diag.c:144
> >>
> >> which lock already depends on the new lock.
> >
> > In the code, we have a comment which explains why it is safe to take this lock
> >
> > /*
> >  * The state lock is outer for the same sk's
> >  * queue lock. With the other's queue locked it's
> >  * OK to lock the state.
> >  */
> > unix_state_lock_nested(req);
> >
> > It is a question how to explain this to lockdep.
> 
> Do I understand it correctly that (&u->lock)->rlock associated with
> AF_UNIX is locked under rlock-AF_UNIX, and then rlock-AF_UNIX is
> locked under (&u->lock)->rlock associated with AF_NETLINK? If so, I
> think we need to split (&u->lock)->rlock by family too, so that we
> have u->lock-AF_UNIX and u->lock-AF_NETLINK.

I think here is another problem. lockdep woried about
sk->sk_receive_queue vs unix_sk(s)->lock.

sk_diag_dump_icons() takes sk->sk_receive_queue and then
unix_sk(s)->lock.

unix_dgram_sendmsg takes unix_sk(sk)->lock and then sk->sk_receive_queue.

sk_diag_dump_icons() takes locks for two different sockets, but
unix_dgram_sendmsg() takes locks for one socket.

sk_diag_dump_icons
        if (sk->sk_state == TCP_LISTEN) {
                spin_lock(&sk->sk_receive_queue.lock);
                skb_queue_walk(&sk->sk_receive_queue, skb) {
			unix_state_lock_nested(req);
				spin_lock_nested(&unix_sk(s)->lock,


unix_dgram_sendmsg
	unix_state_lock(other)
		spin_lock(&unix_sk(s)->lock)
        skb_queue_tail(&other->sk_receive_queue, skb);
	        spin_lock_irqsave(&list->lock, flags);

> 
> 
> 
> >> the existing dependency chain (in reverse order) is:
> >>
> >> -> #1 (rlock-AF_UNIX){+.+.}:
> >>        __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
> >>        _raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
> >>        skb_queue_tail+0x26/0x150 net/core/skbuff.c:2900
> >>        unix_dgram_sendmsg+0xf77/0x1730 net/unix/af_unix.c:1797
> >>        sock_sendmsg_nosec net/socket.c:629 [inline]
> >>        sock_sendmsg+0xd5/0x120 net/socket.c:639
> >>        ___sys_sendmsg+0x525/0x940 net/socket.c:2117
> >>        __sys_sendmmsg+0x3bb/0x6f0 net/socket.c:2205
> >>        __compat_sys_sendmmsg net/compat.c:770 [inline]
> >>        __do_compat_sys_sendmmsg net/compat.c:777 [inline]
> >>        __se_compat_sys_sendmmsg net/compat.c:774 [inline]
> >>        __ia32_compat_sys_sendmmsg+0x9f/0x100 net/compat.c:774
> >>        do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline]
> >>        do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394
> >>        entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> >>
> >> -> #0 (&(&u->lock)->rlock/1){+.+.}:
> >>        lock_acquire+0x1dc/0x520 kernel/locking/lockdep.c:3920
> >>        _raw_spin_lock_nested+0x28/0x40 kernel/locking/spinlock.c:354
> >>        sk_diag_dump_icons net/unix/diag.c:82 [inline]
> >>        sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144
> >>        sk_diag_dump net/unix/diag.c:178 [inline]
> >>        unix_diag_dump+0x35f/0x550 net/unix/diag.c:206
> >>        netlink_dump+0x507/0xd20 net/netlink/af_netlink.c:2226
> >>        __netlink_dump_start+0x51a/0x780 net/netlink/af_netlink.c:2323
> >>        netlink_dump_start include/linux/netlink.h:214 [inline]
> >>        unix_diag_handler_dump+0x3f4/0x7b0 net/unix/diag.c:307
> >>        __sock_diag_cmd net/core/sock_diag.c:230 [inline]
> >>        sock_diag_rcv_msg+0x2e0/0x3d0 net/core/sock_diag.c:261
> >>        netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
> >>        sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272
> >>        netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
> >>        netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
> >>        netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
> >>        sock_sendmsg_nosec net/socket.c:629 [inline]
> >>        sock_sendmsg+0xd5/0x120 net/socket.c:639
> >>        sock_write_iter+0x35a/0x5a0 net/socket.c:908
> >>        call_write_iter include/linux/fs.h:1784 [inline]
> >>        new_sync_write fs/read_write.c:474 [inline]
> >>        __vfs_write+0x64d/0x960 fs/read_write.c:487
> >>        vfs_write+0x1f8/0x560 fs/read_write.c:549
> >>        ksys_write+0xf9/0x250 fs/read_write.c:598
> >>        __do_sys_write fs/read_write.c:610 [inline]
> >>        __se_sys_write fs/read_write.c:607 [inline]
> >>        __ia32_sys_write+0x71/0xb0 fs/read_write.c:607
> >>        do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline]
> >>        do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394
> >>        entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> >>
> >> other info that might help us debug this:
> >>
> >>  Possible unsafe locking scenario:
> >>
> >>        CPU0                    CPU1
> >>        ----                    ----
> >>   lock(rlock-AF_UNIX);
> >>                                lock(&(&u->lock)->rlock/1);
> >>                                lock(rlock-AF_UNIX);
> >>   lock(&(&u->lock)->rlock/1);
> >>
> >>  *** DEADLOCK ***
> >>
> >> 5 locks held by syz-executor1/25282:
> >>  #0: 000000003919e1bd (sock_diag_mutex){+.+.}, at: sock_diag_rcv+0x1b/0x40
> >> net/core/sock_diag.c:271
> >>  #1: 000000004f328d3e (sock_diag_table_mutex){+.+.}, at: __sock_diag_cmd
> >> net/core/sock_diag.c:225 [inline]
> >>  #1: 000000004f328d3e (sock_diag_table_mutex){+.+.}, at:
> >> sock_diag_rcv_msg+0x169/0x3d0 net/core/sock_diag.c:261
> >>  #2: 000000004cc04dbb (nlk_cb_mutex-SOCK_DIAG){+.+.}, at:
> >> netlink_dump+0x98/0xd20 net/netlink/af_netlink.c:2182
> >>  #3: 00000000accdef41 (unix_table_lock){+.+.}, at: spin_lock
> >> include/linux/spinlock.h:310 [inline]
> >>  #3: 00000000accdef41 (unix_table_lock){+.+.}, at:
> >> unix_diag_dump+0x10a/0x550 net/unix/diag.c:192
> >>  #4: 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: spin_lock
> >> include/linux/spinlock.h:310 [inline]
> >>  #4: 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_dump_icons
> >> net/unix/diag.c:64 [inline]
> >>  #4: 00000000b6895645 (rlock-AF_UNIX){+.+.}, at:
> >> sk_diag_fill.isra.5+0x94e/0x10d0 net/unix/diag.c:144
> >>
> >> stack backtrace:
> >> CPU: 1 PID: 25282 Comm: syz-executor1 Not tainted 4.17.0-rc3+ #59
> >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> >> Google 01/01/2011
> >> Call Trace:
> >>  __dump_stack lib/dump_stack.c:77 [inline]
> >>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
> >>  print_circular_bug.isra.36.cold.54+0x1bd/0x27d
> >> kernel/locking/lockdep.c:1223
> >>  check_prev_add kernel/locking/lockdep.c:1863 [inline]
> >>  check_prevs_add kernel/locking/lockdep.c:1976 [inline]
> >>  validate_chain kernel/locking/lockdep.c:2417 [inline]
> >>  __lock_acquire+0x343e/0x5140 kernel/locking/lockdep.c:3431
> >>  lock_acquire+0x1dc/0x520 kernel/locking/lockdep.c:3920
> >>  _raw_spin_lock_nested+0x28/0x40 kernel/locking/spinlock.c:354
> >>  sk_diag_dump_icons net/unix/diag.c:82 [inline]
> >>  sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144
> >>  sk_diag_dump net/unix/diag.c:178 [inline]
> >>  unix_diag_dump+0x35f/0x550 net/unix/diag.c:206
> >>  netlink_dump+0x507/0xd20 net/netlink/af_netlink.c:2226
> >>  __netlink_dump_start+0x51a/0x780 net/netlink/af_netlink.c:2323
> >>  netlink_dump_start include/linux/netlink.h:214 [inline]
> >>  unix_diag_handler_dump+0x3f4/0x7b0 net/unix/diag.c:307
> >>  __sock_diag_cmd net/core/sock_diag.c:230 [inline]
> >>  sock_diag_rcv_msg+0x2e0/0x3d0 net/core/sock_diag.c:261
> >>  netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
> >>  sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272
> >>  netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
> >>  netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
> >>  netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
> >>  sock_sendmsg_nosec net/socket.c:629 [inline]
> >>  sock_sendmsg+0xd5/0x120 net/socket.c:639
> >>  sock_write_iter+0x35a/0x5a0 net/socket.c:908
> >>  call_write_iter include/linux/fs.h:1784 [inline]
> >>  new_sync_write fs/read_write.c:474 [inline]
> >>  __vfs_write+0x64d/0x960 fs/read_write.c:487
> >>  vfs_write+0x1f8/0x560 fs/read_write.c:549
> >>  ksys_write+0xf9/0x250 fs/read_write.c:598
> >>  __do_sys_write fs/read_write.c:610 [inline]
> >>  __se_sys_write fs/read_write.c:607 [inline]
> >>  __ia32_sys_write+0x71/0xb0 fs/read_write.c:607
> >>  do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline]
> >>  do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394
> >>  entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> >> RIP: 0023:0xf7f8ccb9
> >> RSP: 002b:00000000f5f880ac EFLAGS: 00000282 ORIG_RAX: 0000000000000004
> >> RAX: ffffffffffffffda RBX: 0000000000000017 RCX: 000000002058bfe4
> >> RDX: 0000000000000029 RSI: 0000000000000000 RDI: 0000000000000000
> >> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> >> R10: 0000000000000000 R11: 0000000000000296 R12: 0000000000000000
> >> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> >>
> >>
> >> ---
> >> This bug is generated by a bot. It may contain errors.
> >> See https://goo.gl/tpsmEJ for more information about syzbot.
> >> syzbot engineers can be reached at syzkaller@googlegroups.com.
> >>
> >> syzbot will keep track of this bug report.
> >> If you forgot to add the Reported-by tag, once the fix for this bug is
> >> merged
> >> into any tree, please reply to this email with:
> >> #syz fix: exact-commit-title
> >> To mark this as a duplicate of another syzbot report, please reply with:
> >> #syz dup: exact-subject-of-another-report
> >> If it's a one-off invalid bug report, please reply with:
> >> #syz invalid
> >> Note: if the crash happens again, it will cause creation of a new bug
> >> report.
> >> Note: all commands must start from beginning of the line in the email body.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/20180511183358.GA1492%40outlook.office365.com.
> > For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH 00/14] Modify action API for implementing lockless actions
From: Jamal Hadi Salim @ 2018-05-14 18:03 UTC (permalink / raw)
  To: Vlad Buslov, netdev
  Cc: davem, xiyou.wangcong, jiri, pablo, kadlec, fw, ast, daniel,
	edumazet, keescook, linux-kernel, netfilter-devel, coreteam,
	kliteyn
In-Reply-To: <1526308035-12484-1-git-send-email-vladbu@mellanox.com>

On 14/05/18 10:27 AM, Vlad Buslov wrote:
> Currently, all netlink protocol handlers for updating rules, actions and
> qdiscs are protected with single global rtnl lock which removes any
> possibility for parallelism. This patch set is a first step to remove
> rtnl lock dependency from TC rules update path. It updates act API to
> use atomic operations, rcu and spinlocks for fine-grained locking. It
> also extend API with functions that are needed to update existing
> actions for parallel execution.
> 
> Outline of changes:
> - Change tc action to use atomic reference and bind counters, rcu
>    mechanism for cookie update.
> - Extend action ops API with 'delete' function and 'unlocked' flag.
> - Change action API to work with actions in lockless manner based on
>    primitives implemented in previous patches.
> - Extend action API with new functions necessary to implement unlocked
>    actions.

Please run all the tdc tests with these changes. This area has almost
good test coverage at this point. If you need help just ping me.

cheers,
jamal

^ permalink raw reply

* Re: INFO: rcu detected stall in kfree_skbmem
From: Xin Long @ 2018-05-14 18:04 UTC (permalink / raw)
  To: Neil Horman, Marcelo Ricardo Leitner
  Cc: Dmitry Vyukov, syzbot, Vladislav Yasevich, linux-sctp,
	Andrei Vagin, David Miller, Kirill Tkhai, LKML, netdev,
	syzkaller-bugs
In-Reply-To: <20180514133439.GA6743@hmswarspite.think-freely.org>

On Mon, May 14, 2018 at 9:34 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> On Fri, May 11, 2018 at 12:00:38PM +0200, Dmitry Vyukov wrote:
>> On Mon, Apr 30, 2018 at 8:09 PM, syzbot
>> <syzbot+fc78715ba3b3257caf6a@syzkaller.appspotmail.com> wrote:
>> > Hello,
>> >
>> > syzbot found the following crash on:
>> >
>> > HEAD commit:    5d1365940a68 Merge
>> > git://git.kernel.org/pub/scm/linux/kerne...
>> > git tree:       net-next
>> > console output: https://syzkaller.appspot.com/x/log.txt?id=5667997129637888
>> > kernel config:
>> > https://syzkaller.appspot.com/x/.config?id=-5947642240294114534
>> > dashboard link: https://syzkaller.appspot.com/bug?extid=fc78715ba3b3257caf6a
>> > compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>> >
>> > Unfortunately, I don't have any reproducer for this crash yet.
>>
>> This looks sctp-related, +sctp maintainers.
>>
> Looking at the entire trace, it appears that we are getting caught in the
> kfree_skb that is getting triggered in enqueue_to_backlog which occurs when our
> rx backlog list grows over netdev_max_backlog packets.  That suggests to me that
It might be a long skb->frag_list that made kfree_skb slow when packing
lots of small chunks to go through lo device?

> whatever test(s) is/are causing this trace are queuing up a large number of
> frames to be sent over the loopback interface, and are never/rarely getting
> received.  Looking up higher in the stack, in the sctp_generate_heartbeat_event
> function, we (in addition to the rcu_read_lock in sctp_v6_xmit) we also hold the
> socket lock during the entirety of the xmit operaion.  Is it possible that we
> are just enqueuing so many frames for xmit that we are blocking progress of
> other threads using the same socket that we cross the RCU self detected stall
> boundary?  While its not a fix per se, it might be a worthwhile test to limit
> the number of frames we flush in a single pass.
>
> Neil
>
>> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> > Reported-by: syzbot+fc78715ba3b3257caf6a@syzkaller.appspotmail.com
>> >
>> > INFO: rcu_sched self-detected stall on CPU
>> >         1-...!: (1 GPs behind) idle=a3e/1/4611686018427387908
>> > softirq=71980/71983 fqs=33
>> >          (t=125000 jiffies g=39438 c=39437 q=958)
>> > rcu_sched kthread starved for 124829 jiffies! g39438 c39437 f0x0
>> > RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
>> > RCU grace-period kthread stack dump:
>> > rcu_sched       R  running task    23768     9      2 0x80000000
>> > Call Trace:
>> >  context_switch kernel/sched/core.c:2848 [inline]
>> >  __schedule+0x801/0x1e30 kernel/sched/core.c:3490
>> >  schedule+0xef/0x430 kernel/sched/core.c:3549
>> >  schedule_timeout+0x138/0x240 kernel/time/timer.c:1801
>> >  rcu_gp_kthread+0x6b5/0x1940 kernel/rcu/tree.c:2231
>> >  kthread+0x345/0x410 kernel/kthread.c:238
>> >  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
>> > NMI backtrace for cpu 1
>> > CPU: 1 PID: 20560 Comm: syz-executor4 Not tainted 4.16.0+ #1
>> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> > Google 01/01/2011
>> > Call Trace:
>> >  <IRQ>
>> >  __dump_stack lib/dump_stack.c:77 [inline]
>> >  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>> >  nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
>> >  nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
>> >  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
>> >  trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
>> >  rcu_dump_cpu_stacks+0x175/0x1c2 kernel/rcu/tree.c:1376
>> >  print_cpu_stall kernel/rcu/tree.c:1525 [inline]
>> >  check_cpu_stall.isra.61.cold.80+0x36c/0x59a kernel/rcu/tree.c:1593
>> >  __rcu_pending kernel/rcu/tree.c:3356 [inline]
>> >  rcu_pending kernel/rcu/tree.c:3401 [inline]
>> >  rcu_check_callbacks+0x21b/0xad0 kernel/rcu/tree.c:2763
>> >  update_process_times+0x2d/0x70 kernel/time/timer.c:1636
>> >  tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:173
>> >  tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1283
>> >  __run_hrtimer kernel/time/hrtimer.c:1386 [inline]
>> >  __hrtimer_run_queues+0x3e3/0x10a0 kernel/time/hrtimer.c:1448
>> >  hrtimer_interrupt+0x286/0x650 kernel/time/hrtimer.c:1506
>> >  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
>> >  smp_apic_timer_interrupt+0x15d/0x710 arch/x86/kernel/apic/apic.c:1050
>> >  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
>> > RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
>> > [inline]
>> > RIP: 0010:kmem_cache_free+0xb3/0x2d0 mm/slab.c:3757
>> > RSP: 0018:ffff8801db105228 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
>> > RAX: 0000000000000007 RBX: ffff8800b055c940 RCX: 1ffff1003b2345a5
>> > RDX: 0000000000000000 RSI: ffff8801d91a2d80 RDI: 0000000000000282
>> > RBP: ffff8801db105248 R08: ffff8801d91a2cb8 R09: 0000000000000002
>> > R10: ffff8801d91a2480 R11: 0000000000000000 R12: ffff8801d9848e40
>> > R13: 0000000000000282 R14: ffffffff85b7f27c R15: 0000000000000000
>> >  kfree_skbmem+0x13c/0x210 net/core/skbuff.c:582
>> >  __kfree_skb net/core/skbuff.c:642 [inline]
>> >  kfree_skb+0x19d/0x560 net/core/skbuff.c:659
>> >  enqueue_to_backlog+0x2fc/0xc90 net/core/dev.c:3968
>> >  netif_rx_internal+0x14d/0xae0 net/core/dev.c:4181
>> >  netif_rx+0xba/0x400 net/core/dev.c:4206
>> >  loopback_xmit+0x283/0x741 drivers/net/loopback.c:91
>> >  __netdev_start_xmit include/linux/netdevice.h:4087 [inline]
>> >  netdev_start_xmit include/linux/netdevice.h:4096 [inline]
>> >  xmit_one net/core/dev.c:3053 [inline]
>> >  dev_hard_start_xmit+0x264/0xc10 net/core/dev.c:3069
>> >  __dev_queue_xmit+0x2724/0x34c0 net/core/dev.c:3584
>> >  dev_queue_xmit+0x17/0x20 net/core/dev.c:3617
>> >  neigh_hh_output include/net/neighbour.h:472 [inline]
>> >  neigh_output include/net/neighbour.h:480 [inline]
>> >  ip6_finish_output2+0x134e/0x2810 net/ipv6/ip6_output.c:120
>> >  ip6_finish_output+0x5fe/0xbc0 net/ipv6/ip6_output.c:154
>> >  NF_HOOK_COND include/linux/netfilter.h:277 [inline]
>> >  ip6_output+0x227/0x9b0 net/ipv6/ip6_output.c:171
>> >  dst_output include/net/dst.h:444 [inline]
>> >  NF_HOOK include/linux/netfilter.h:288 [inline]
>> >  ip6_xmit+0xf51/0x23f0 net/ipv6/ip6_output.c:277
>> >  sctp_v6_xmit+0x4a5/0x6b0 net/sctp/ipv6.c:225
>> >  sctp_packet_transmit+0x26f6/0x3ba0 net/sctp/output.c:650
>> >  sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
>> >  sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
>> >  sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
>> >  sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
>> >  sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
>> >  sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
>> >  call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
>> >  expire_timers kernel/time/timer.c:1363 [inline]
>> >  __run_timers+0x79e/0xc50 kernel/time/timer.c:1666
>> >  run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
>> >  __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
>> >  invoke_softirq kernel/softirq.c:365 [inline]
>> >  irq_exit+0x1d1/0x200 kernel/softirq.c:405
>> >  exiting_irq arch/x86/include/asm/apic.h:525 [inline]
>> >  smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
>> >  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
>> >  </IRQ>
>> > RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
>> > [inline]
>> > RIP: 0010:lock_release+0x4d4/0xa10 kernel/locking/lockdep.c:3942
>> > RSP: 0018:ffff8801971ce7b0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
>> > RAX: dffffc0000000000 RBX: 1ffff10032e39cfb RCX: 1ffff1003b234595
>> > RDX: 1ffffffff11630ed RSI: 0000000000000002 RDI: 0000000000000282
>> > RBP: ffff8801971ce8e0 R08: 1ffff10032e39cff R09: ffffed003b6246c2
>> > R10: 0000000000000003 R11: 0000000000000001 R12: ffff8801d91a2480
>> > R13: ffffffff88b8df60 R14: ffff8801d91a2480 R15: ffff8801971ce7f8
>> >  rcu_lock_release include/linux/rcupdate.h:251 [inline]
>> >  rcu_read_unlock include/linux/rcupdate.h:688 [inline]
>> >  __unlock_page_memcg+0x72/0x100 mm/memcontrol.c:1654
>> >  unlock_page_memcg+0x2c/0x40 mm/memcontrol.c:1663
>> >  page_remove_file_rmap mm/rmap.c:1248 [inline]
>> >  page_remove_rmap+0x6f2/0x1250 mm/rmap.c:1299
>> >  zap_pte_range mm/memory.c:1337 [inline]
>> >  zap_pmd_range mm/memory.c:1441 [inline]
>> >  zap_pud_range mm/memory.c:1470 [inline]
>> >  zap_p4d_range mm/memory.c:1491 [inline]
>> >  unmap_page_range+0xeb4/0x2200 mm/memory.c:1512
>> >  unmap_single_vma+0x1a0/0x310 mm/memory.c:1557
>> >  unmap_vmas+0x120/0x1f0 mm/memory.c:1587
>> >  exit_mmap+0x265/0x570 mm/mmap.c:3038
>> >  __mmput kernel/fork.c:962 [inline]
>> >  mmput+0x251/0x610 kernel/fork.c:983
>> >  exit_mm kernel/exit.c:544 [inline]
>> >  do_exit+0xe98/0x2730 kernel/exit.c:852
>> >  do_group_exit+0x16f/0x430 kernel/exit.c:968
>> >  get_signal+0x886/0x1960 kernel/signal.c:2469
>> >  do_signal+0x98/0x2040 arch/x86/kernel/signal.c:810
>> >  exit_to_usermode_loop+0x28a/0x310 arch/x86/entry/common.c:162
>> >  prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
>> >  syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
>> >  do_syscall_64+0x792/0x9d0 arch/x86/entry/common.c:292
>> >  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>> > RIP: 0033:0x455319
>> > RSP: 002b:00007fa346e81ce8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
>> > RAX: fffffffffffffe00 RBX: 000000000072bf80 RCX: 0000000000455319
>> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000072bf80
>> > RBP: 000000000072bf80 R08: 0000000000000000 R09: 000000000072bf58
>> > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>> > R13: 0000000000a3e81f R14: 00007fa346e829c0 R15: 0000000000000001
>> >
>> >
>> > ---
>> > This bug is generated by a bot. It may contain errors.
>> > See https://goo.gl/tpsmEJ for more information about syzbot.
>> > syzbot engineers can be reached at syzkaller@googlegroups.com.
>> >
>> > syzbot will keep track of this bug report.
>> > If you forgot to add the Reported-by tag, once the fix for this bug is
>> > merged
>> > into any tree, please reply to this email with:
>> > #syz fix: exact-commit-title
>> > To mark this as a duplicate of another syzbot report, please reply with:
>> > #syz dup: exact-subject-of-another-report
>> > If it's a one-off invalid bug report, please reply with:
>> > #syz invalid
>> > Note: if the crash happens again, it will cause creation of a new bug
>> > report.
>> > Note: all commands must start from beginning of the line in the email body.
>> >
>> > --
>> > You received this message because you are subscribed to the Google Groups
>> > "syzkaller-bugs" group.
>> > To unsubscribe from this group and stop receiving emails from it, send an
>> > email to syzkaller-bugs+unsubscribe@googlegroups.com.
>> > To view this discussion on the web visit
>> > https://groups.google.com/d/msgid/syzkaller-bugs/000000000000a9b0e3056b14bfb2%40google.com.
>> > For more options, visit https://groups.google.com/d/optout.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Kernel panic on kernel-3.10.0-693.21.1.el7 in ndisc.h
From: Roman Makhov @ 2018-05-14 18:29 UTC (permalink / raw)
  To: Alexander Aring; +Cc: linux-wpan, netdev
In-Reply-To: <20180514154002.ehab2ss25a45l5p6@x220t>

Hi Alexander,

Thank you for the answer.
Unfortunately CentOS goes with these dinosaurs.
So we will try to debug the problem in the current one and try to
reproduce on the latest kernel.

Thanks,
Roman.

2018-05-14 18:40 GMT+03:00 Alexander Aring <aring@mojatatu.com>:
> Hi,
>
> On Sun, May 13, 2018 at 02:35:07PM +0300, Roman Makhov wrote:
>> Hello,
>>
>> We have a problem with Kernel panic after upgrade from CentOS 7.3
>> (kernel-3.10.0-514.el7) to CentOS 7.4 (kernel-3.10.0-693.21.1.el7).
>> It occurs when we have the incoming traffic from other nodes and we
>> are performing the re-configuration of IPv6 interfaces.
>>
>> It is high-availability system without 802.15.4 support.
>>
>> The log of crash:
>> =========================================================
>> #10 [ffff88043fc03cf0] async_page_fault at ffffffff816b7798
>>     [exception RIP: ndisc_send_rs+238]
>>     RIP: ffffffff8166575e  RSP: ffff88043fc03da8  RFLAGS: 00010202
>>     RAX: 0000000000000002  RBX: ffff88042caa9000  RCX: 0000000000000001
>>     RDX: 0000000000000000  RSI: 0000000000000200  RDI: ffffffff816534f7
>>     RBP: ffff88043fc03dd0   R8: 0000000000000000   R9: ffffffff81e9f1c0
>>     R10: 0000000000000002  R11: ffff88043fc03da8  R12: 0000000000000008
>>     R13: 0000000000000006  R14: ffff88043fc03de0  R15: ffffffff81772410
>>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>> #11 [ffff88043fc03da0] ndisc_send_rs at ffffffff81665704
>> =========================================================
>>
>> I see that crash points on ndisc.h, it is ndisc_ops_opt_addr_space()
>> in function:
>> =========================================================
>> crash> kmem ffffffff8166575e
>> ffffffff8166575e (T) ndisc_send_rs+238
>> /usr/src/debug/kernel-3.10.0-693.21.1.el7/linux-3.10.0-693.21.1.el7.x86_64/include/net/ndisc.h:
>> 251
>>
>>       PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
>> ffffea0000059940   1665000                0        0  1 1fffff00000400 reserved
>> crash>
>> =========================================================
>>
>> I checked the difference between 514 and 693 kernels is in the patch
>> https://patchwork.kernel.org/patch/9179229/ .
>>
>> Any suggesions about what I am doing wrong are welcome.
>>
>
> Me as original author of this patch,
>
> I cannot help you with such a dinosaurs kernel. Please try it with the
> latest one and check if the problem still exists.
>
> - Alex

^ permalink raw reply

* Re: [net-next 00/11][pull request] 40GbE Intel Wired LAN Driver Updates 2018-05-14
From: David Miller @ 2018-05-14 18:29 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20180514152747.23154-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 14 May 2018 08:27:36 -0700

> This series contains updates to virtchnl, i40e and i40evf.
> 
> Bruce cleans up whitespace and unnecessary parentheses in virtchnl.
> 
> Jake does a number of stat cleanups in the i40e driver, including
> cleanup of code indentation, whitespace issues, remove duplicate stats,
> fix grammar in code comment and general spring cleaning of the
> statistics code.
> 
> Patryk fixes an issue where we recalculate vectors left and vectors
> wanted but do not take into account the reduced number of queue pairs
> per VSI.
> 
> Harshitha adds tx_busy stat to ethtool stats to track the number of
> times we return NETDEV_TX_BUSY to the stack during transmit.
> 
> Paweł fixes a potential system crash when unloading the VF driver after
> a hardware reset.
> 
> The following are changes since commit 289e1f4e9e4a09c73a1c0152bb93855ea351ccda:
>   net: ipv4: ipconfig: fix unused variable
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Looks good, pulled, thanks Jeff.

^ permalink raw reply

* Re: [PATCH net-next v9 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc
From: Toke Høiland-Jørgensen @ 2018-05-14 18:31 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, cake
In-Reply-To: <20180514.110558.392362320880673775.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:

> From: Toke Høiland-Jørgensen <toke@toke.dk>
> Date: Mon, 14 May 2018 11:08:28 +0200
>
>> David Miller <davem@davemloft.net> writes:
>> 
>>> From: Toke Høiland-Jørgensen <toke@toke.dk>
>>> Date: Tue, 08 May 2018 16:34:19 +0200
>>>
>>>> +struct cake_flow {
>>>> +	/* this stuff is all needed per-flow at dequeue time */
>>>> +	struct sk_buff	  *head;
>>>> +	struct sk_buff	  *tail;
>>>
>>> Please do not invent your own SKB list handling mechanism.
>> 
>> We didn't invent it, we inherited it from fq_codel. I was actually about
>> to fix that, but then I noticed it was still around in fq_codel, and so
>> let it be. I can certainly fix it anyway, but, erm, why is it acceptable
>> in fq_codel but not in cake? struct sk_buff_head is not that new, is it?
>
> I guess one argument has to do with the amount of memory consumed by this
> per-flow or per-queue information, right?  Because the skb queue head has
> a qlen and a spinlock regardless of whether they are used or not.
>
> Furthermore, if you use the __skb_insert() et al. helpers, even though it
> won't use the lock it will adjust the qlen counter.  And that's useless
> work since you have no use for the qlen value.

I think the useless work issue is larger than the memory usage. When
running this (or FQ-CoDel) on small memory-constrained routers, we've
mostly had issues with OOM because of the packet data, which dwarfs the
per-queue overhead.

> Taken together, it seems that what you and fq_codel are doing is not
> such a bad idea after all.  So please leave it alone.

OK. I'll just resend with prettier Christmas trees, then :)

> On-and-off again, I've looked into converting skbs to using list_head
> but it's a non-trivial set of work. All over the tree the different
> layers use the next/prev pointers in different ways. Some use it for a
> doubly linked list. Some use it for a singly linked list. Some encode
> state in the prev pointer. You name it, it's out there.
>
> I'll try to get back to that task because obviously it'll be useful to
> have code like cake and fq_codel use common helpers instead of custom
> stuff.

Yup, I agree. From a code readability point of view, I also prefer the
helpers.

-Toke

^ permalink raw reply

* Re: iproute2 - modifying routes in place
From: Ryan Whelan @ 2018-05-14 18:40 UTC (permalink / raw)
  To: dsahern; +Cc: netdev
In-Reply-To: <07d96b60-68e5-8c20-8505-3df9178f9f78@gmail.com>

Same behavior:


root@rwhelan-linux ~
# ip -6 route
::1 dev lo proto kernel metric 256 pref medium
fd9b:caee:ff93:ceef:3431:3831:3930:3031 dev internal0 proto kernel metric
256 pref medium
fd9b:caee:ff93:ceef:3431:3831:3930:3032 dev internal0 src
fd9b:caee:ff93:ceef:3431:3831:3930:3031 metric 1024 pref medium
fe80::/64 dev enp0s3 proto kernel metric 256 pref medium
fe80::/64 dev enp0s8 proto kernel metric 256 pref medium
fe80::/64 dev internal0 proto kernel metric 256 pref medium

root@rwhelan-linux ~
# ip -6 route change fd9b:caee:ff93:ceef:3431:3831:3930:3032 dev internal0
src fd9b:caee:ff93:ceef:3431:3831:3930:3031 metric 10
RTNETLINK answers: No such file or directory

root@rwhelan-linux ~
# ip -6 route replace fd9b:caee:ff93:ceef:3431:3831:3930:3032 dev internal0
src fd9b:caee:ff93:ceef:3431:3831:3930:3031 metric 10

root@rwhelan-linux ~
# ip -6 route
::1 dev lo proto kernel metric 256 pref medium
fd9b:caee:ff93:ceef:3431:3831:3930:3031 dev internal0 proto kernel metric
256 pref medium
fd9b:caee:ff93:ceef:3431:3831:3930:3032 dev internal0 src
fd9b:caee:ff93:ceef:3431:3831:3930:3031 metric 10 pref medium
fd9b:caee:ff93:ceef:3431:3831:3930:3032 dev internal0 src
fd9b:caee:ff93:ceef:3431:3831:3930:3031 metric 1024 pref medium
fe80::/64 dev enp0s3 proto kernel metric 256 pref medium
fe80::/64 dev enp0s8 proto kernel metric 256 pref medium
fe80::/64 dev internal0 proto kernel metric 256 pref medium

root@rwhelan-linux ~
# uname -a
Linux rwhelan-linux 4.17.0-rc3-ipv6-route-bugs+ #2 SMP Mon May 14 11:30:38
EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
On Sat, May 12, 2018 at 1:01 PM David Ahern <dsahern@gmail.com> wrote:

> On 5/11/18 4:42 AM, Ryan Whelan wrote:
> > `ip route` has 2 subcommands that don't seem to work as expected and i'm
> > not sure if its a bug, or if i'm misunderstanding the semantics.

> Can you try with ipv6/route-bugs branch in
> https://github.com/dsahern/linux

^ permalink raw reply

* Re: [PATCH v3] selftests: add headers_install to lib.mk
From: Mike Rapoport @ 2018-05-14 18:44 UTC (permalink / raw)
  To: Anders Roxell
  Cc: yamada.masahiro, michal.lkml, shuah, bamv2005, brgl, pbonzini,
	akpm, aarcange, linux-kbuild, linux-kernel, linux-kselftest,
	netdev
In-Reply-To: <20180514115809.9803-1-anders.roxell@linaro.org>

On Mon, May 14, 2018 at 01:58:09PM +0200, Anders Roxell wrote:
> If the kernel headers aren't installed we can't build all the tests.
> Add a new make target rule 'khdr' in the file lib.mk to generate the
> kernel headers and that gets include for every test-dir Makefile that
> includes lib.mk If the testdir in turn have its own sub-dirs the
> top_srcdir needs to be set to the linux-rootdir to be able to generate
> the kernel headers.
> 
> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
> Reviewed-by: Fathi Boudra <fathi.boudra@linaro.org>

Ack for the userfaultfd part.

> ---
>  Makefile                                          | 14 +-------------
>  scripts/subarch.include                           | 13 +++++++++++++
>  tools/testing/selftests/android/Makefile          |  2 +-
>  tools/testing/selftests/android/ion/Makefile      |  1 +
>  tools/testing/selftests/bpf/Makefile              |  5 ++---
>  tools/testing/selftests/futex/functional/Makefile |  1 +
>  tools/testing/selftests/gpio/Makefile             |  7 ++-----
>  tools/testing/selftests/kvm/Makefile              |  7 ++-----
>  tools/testing/selftests/lib.mk                    | 10 ++++++++++
>  tools/testing/selftests/vm/Makefile               |  4 ----
>  10 files changed, 33 insertions(+), 31 deletions(-)
>  create mode 100644 scripts/subarch.include
> 
> diff --git a/Makefile b/Makefile
> index 6395c79ffe40..bad0a8f6e874 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -286,19 +286,7 @@ KERNELRELEASE = $(shell cat include/config/kernel.release 2> /dev/null)
>  KERNELVERSION = $(VERSION)$(if $(PATCHLEVEL),.$(PATCHLEVEL)$(if $(SUBLEVEL),.$(SUBLEVEL)))$(EXTRAVERSION)
>  export VERSION PATCHLEVEL SUBLEVEL KERNELRELEASE KERNELVERSION
> 
> -# SUBARCH tells the usermode build what the underlying arch is.  That is set
> -# first, and if a usermode build is happening, the "ARCH=um" on the command
> -# line overrides the setting of ARCH below.  If a native build is happening,
> -# then ARCH is assigned, getting whatever value it gets normally, and
> -# SUBARCH is subsequently ignored.
> -
> -SUBARCH := $(shell uname -m | sed -e s/i.86/x86/ -e s/x86_64/x86/ \
> -				  -e s/sun4u/sparc64/ \
> -				  -e s/arm.*/arm/ -e s/sa110/arm/ \
> -				  -e s/s390x/s390/ -e s/parisc64/parisc/ \
> -				  -e s/ppc.*/powerpc/ -e s/mips.*/mips/ \
> -				  -e s/sh[234].*/sh/ -e s/aarch64.*/arm64/ \
> -				  -e s/riscv.*/riscv/)
> +include scripts/subarch.include
> 
>  # Cross compiling and selecting different set of gcc/bin-utils
>  # ---------------------------------------------------------------------------
> diff --git a/scripts/subarch.include b/scripts/subarch.include
> new file mode 100644
> index 000000000000..650682821126
> --- /dev/null
> +++ b/scripts/subarch.include
> @@ -0,0 +1,13 @@
> +# SUBARCH tells the usermode build what the underlying arch is.  That is set
> +# first, and if a usermode build is happening, the "ARCH=um" on the command
> +# line overrides the setting of ARCH below.  If a native build is happening,
> +# then ARCH is assigned, getting whatever value it gets normally, and
> +# SUBARCH is subsequently ignored.
> +
> +SUBARCH := $(shell uname -m | sed -e s/i.86/x86/ -e s/x86_64/x86/ \
> +				  -e s/sun4u/sparc64/ \
> +				  -e s/arm.*/arm/ -e s/sa110/arm/ \
> +				  -e s/s390x/s390/ -e s/parisc64/parisc/ \
> +				  -e s/ppc.*/powerpc/ -e s/mips.*/mips/ \
> +				  -e s/sh[234].*/sh/ -e s/aarch64.*/arm64/ \
> +				  -e s/riscv.*/riscv/)
> diff --git a/tools/testing/selftests/android/Makefile b/tools/testing/selftests/android/Makefile
> index 72c25a3cb658..d9a725478375 100644
> --- a/tools/testing/selftests/android/Makefile
> +++ b/tools/testing/selftests/android/Makefile
> @@ -6,7 +6,7 @@ TEST_PROGS := run.sh
> 
>  include ../lib.mk
> 
> -all:
> +all: khdr
>  	@for DIR in $(SUBDIRS); do		\
>  		BUILD_TARGET=$(OUTPUT)/$$DIR;	\
>  		mkdir $$BUILD_TARGET  -p;	\
> diff --git a/tools/testing/selftests/android/ion/Makefile b/tools/testing/selftests/android/ion/Makefile
> index e03695287f76..f3be7dcc060a 100644
> --- a/tools/testing/selftests/android/ion/Makefile
> +++ b/tools/testing/selftests/android/ion/Makefile
> @@ -10,6 +10,7 @@ $(TEST_GEN_FILES): ipcsocket.c ionutils.c
> 
>  TEST_PROGS := ion_test.sh
> 
> +top_srcdir = ../../../../..
>  include ../../lib.mk
> 
>  $(OUTPUT)/ionapp_export: ionapp_export.c ipcsocket.c ionutils.c
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 438d4f93875b..9741609a0eb1 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -16,9 +16,8 @@ LDLIBS += -lcap -lelf -lrt -lpthread
>  TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read
>  all: $(TEST_CUSTOM_PROGS)
> 
> -$(TEST_CUSTOM_PROGS): urandom_read
> -
> -urandom_read: urandom_read.c
> +$(TEST_CUSTOM_PROGS):| khdr
> +$(TEST_CUSTOM_PROGS): urandom_read.c
>  	$(CC) -o $(TEST_CUSTOM_PROGS) -static $<
> 
>  # Order correspond to 'make run_tests' order
> diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
> index ff8feca49746..ad1eeb14fda7 100644
> --- a/tools/testing/selftests/futex/functional/Makefile
> +++ b/tools/testing/selftests/futex/functional/Makefile
> @@ -18,6 +18,7 @@ TEST_GEN_FILES := \
> 
>  TEST_PROGS := run.sh
> 
> +top_srcdir = ../../../../..
>  include ../../lib.mk
> 
>  $(TEST_GEN_FILES): $(HEADERS)
> diff --git a/tools/testing/selftests/gpio/Makefile b/tools/testing/selftests/gpio/Makefile
> index 1bbb47565c55..4665cdbf1a8d 100644
> --- a/tools/testing/selftests/gpio/Makefile
> +++ b/tools/testing/selftests/gpio/Makefile
> @@ -21,11 +21,8 @@ endef
>  CFLAGS += -O2 -g -std=gnu99 -Wall -I../../../../usr/include/
>  LDLIBS += -lmount -I/usr/include/libmount
> 
> -$(BINARIES): ../../../gpio/gpio-utils.o ../../../../usr/include/linux/gpio.h
> +$(BINARIES):| khdr
> +$(BINARIES): ../../../gpio/gpio-utils.o
> 
>  ../../../gpio/gpio-utils.o:
>  	make ARCH=$(ARCH) CROSS_COMPILE=$(CROSS_COMPILE) -C ../../../gpio
> -
> -../../../../usr/include/linux/gpio.h:
> -	make -C ../../../.. headers_install INSTALL_HDR_PATH=$(shell pwd)/../../../../usr/
> -
> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> index d9d00319b07c..bcb69380bbab 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -32,9 +32,6 @@ $(LIBKVM_OBJ): $(OUTPUT)/%.o: %.c
>  $(OUTPUT)/libkvm.a: $(LIBKVM_OBJ)
>  	$(AR) crs $@ $^
> 
> -$(LINUX_HDR_PATH):
> -	make -C $(top_srcdir) headers_install
> -
> -all: $(STATIC_LIBS) $(LINUX_HDR_PATH)
> +all: $(STATIC_LIBS)
>  $(TEST_GEN_PROGS): $(STATIC_LIBS)
> -$(TEST_GEN_PROGS) $(LIBKVM_OBJ): | $(LINUX_HDR_PATH)
> +$(STATIC_LIBS):| khdr
> diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
> index 6466294366dc..9515af872e16 100644
> --- a/tools/testing/selftests/lib.mk
> +++ b/tools/testing/selftests/lib.mk
> @@ -16,8 +16,18 @@ TEST_GEN_PROGS := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS))
>  TEST_GEN_PROGS_EXTENDED := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS_EXTENDED))
>  TEST_GEN_FILES := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_FILES))
> 
> +top_srcdir ?= ../../../..
> +include $(top_srcdir)/scripts/subarch.include
> +ARCH		?= $(SUBARCH)
> +
>  all: $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES)
> 
> +.PHONY: khdr
> +khdr:
> +	make ARCH=$(ARCH) -C $(top_srcdir) headers_install
> +
> +$(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES):| khdr
> +
>  .ONESHELL:
>  define RUN_TEST_PRINT_RESULT
>  	TEST_HDR_MSG="selftests: "`basename $$PWD`:" $$BASENAME_TEST";	\
> diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
> index fdefa2295ddc..58759454b1d0 100644
> --- a/tools/testing/selftests/vm/Makefile
> +++ b/tools/testing/selftests/vm/Makefile
> @@ -25,10 +25,6 @@ TEST_PROGS := run_vmtests
> 
>  include ../lib.mk
> 
> -$(OUTPUT)/userfaultfd: ../../../../usr/include/linux/kernel.h
>  $(OUTPUT)/userfaultfd: LDLIBS += -lpthread
> 
>  $(OUTPUT)/mlock-random-test: LDLIBS += -lcap
> -
> -../../../../usr/include/linux/kernel.h:
> -	make -C ../../../.. headers_install
> -- 
> 2.17.0
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [bpf-next PATCH 5/5] bpf, doc: howto use/run the BPF selftests
From: Silvan Jegen @ 2018-05-14 18:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: borkmann, alexei.starovoitov, netdev, quentin.monnet, linux-man,
	linux-doc
In-Reply-To: <20180514180514.4fd0ba66@redhat.com>

On Mon, May 14, 2018 at 06:05:14PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 14 May 2018 17:15:54 +0200
> Silvan Jegen <s.jegen@gmail.com> wrote:
> 
> > Hi
> > 
> > Some typo fixes below.
> > 
> > On Mon, May 14, 2018 at 3:43 PM Jesper Dangaard Brouer <brouer@redhat.com>
> > wrote:
> > > I always forget howto run the BPF selftests. Thus, lets add that info
> > > to the QA document.  
> >
> > [...] 
> >
> > I also think the underscore at the end of this line is misplaced (or it
> > should be a dash instead).
> 
> This is also part of the RST formatting.  This is a link. 
> 
> 
> > > +document for further documentation.
> > > +
> > >   Q: Which BPF kernel selftests version should I run my kernel against?
> > >   ---------------------------------------------------------------------
> > >   A: If you run a kernel ``xyz``, then always run the BPF kernel selftests
> > > @@ -607,5 +634,7 @@ when:
> > >   .. _netdev FAQ: ../networking/netdev-FAQ.txt
> > >   .. _samples/bpf/: ../../samples/bpf/
> > >   .. _selftests: ../../tools/testing/selftests/bpf/
> > > +.. _Documentation/dev-tools/kselftest.rst:
> > > +   https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html  
> 
> The link is defined above/here.

Ah, thanks I didn't know.


Cheers,

Silvan

 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net V2] tun: fix use after free for ptr_ring
From: David Miller @ 2018-05-14 18:48 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, linux-kernel, xiyou.wangcong, eric.dumazet, mst
In-Reply-To: <1526006965-9124-1-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Fri, 11 May 2018 10:49:25 +0800

> We used to initialize ptr_ring during TUNSETIFF, this is because its
> size depends on the tx_queue_len of netdevice. And we try to clean it
> up when socket were detached from netdevice. A race were spotted when
> trying to do uninit during a read which will lead a use after free for
> pointer ring. Solving this by always initialize a zero size ptr_ring
> in open() and do resizing during TUNSETIFF, and then we can safely do
> cleanup during close(). With this, there's no need for the workaround
> that was introduced by commit 4df0bfc79904 ("tun: fix a memory leak
> for tfile->tx_array").
> 
> Reported-by: syzbot+e8b902c3c3fadf0a9dba@syzkaller.appspotmail.com
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Cong Wang <xiyou.wangcong@gmail.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Fixes: 1576d9860599 ("tun: switch to use skb array for tx")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> Changes from v1:
> - free ptr_ring during close()
> - use tun_ptr_free() during resie for safety

Applied and queued up for -stable, thanks Jason.

^ permalink raw reply

* [PATCH net-next v2 0/4] bonding: performance and reliability
From: Debabrata Banerjee @ 2018-05-14 18:48 UTC (permalink / raw)
  To: David S . Miller, netdev, Jay Vosburgh
  Cc: Veaceslav Falico, Andy Gospodarek, dbanerje

Series of fixes to how rlb updates are handled, code cleanup, allowing
higher performance tx hashing in balance-alb mode, and reliability of
link up/down monitoring.

v2: refactor bond_is_nondyn_tlb with inline fn, update log comment to
point out that multicast addresses will not get rlb updates.

Debabrata Banerjee (4):
  bonding: don't queue up extraneous rlb updates
  bonding: use common mac addr checks
  bonding: allow use of tx hashing in balance-alb
  bonding: allow carrier and link status to determine link state

 Documentation/networking/bonding.txt |  4 +--
 drivers/net/bonding/bond_alb.c       | 50 +++++++++++++++++-----------
 drivers/net/bonding/bond_main.c      | 37 ++++++++++++--------
 drivers/net/bonding/bond_options.c   |  9 ++---
 include/net/bonding.h                | 11 ++++--
 5 files changed, 70 insertions(+), 41 deletions(-)

-- 
2.17.0

^ permalink raw reply

* [PATCH net-next v2 1/4] bonding: don't queue up extraneous rlb updates
From: Debabrata Banerjee @ 2018-05-14 18:48 UTC (permalink / raw)
  To: David S . Miller, netdev, Jay Vosburgh
  Cc: Veaceslav Falico, Andy Gospodarek, dbanerje
In-Reply-To: <20180514184810.15506-1-dbanerje@akamai.com>

arps for incomplete entries can't be sent anyway.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
---
 drivers/net/bonding/bond_alb.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 5eb0df2e5464..c2f6c58e4e6a 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -421,7 +421,8 @@ static void rlb_clear_slave(struct bonding *bond, struct slave *slave)
 			if (assigned_slave) {
 				rx_hash_table[index].slave = assigned_slave;
 				if (!ether_addr_equal_64bits(rx_hash_table[index].mac_dst,
-							     mac_bcast)) {
+							     mac_bcast) &&
+				    !is_zero_ether_addr(rx_hash_table[index].mac_dst)) {
 					bond_info->rx_hashtbl[index].ntt = 1;
 					bond_info->rx_ntt = 1;
 					/* A slave has been removed from the
@@ -524,7 +525,8 @@ static void rlb_req_update_slave_clients(struct bonding *bond, struct slave *sla
 		client_info = &(bond_info->rx_hashtbl[hash_index]);
 
 		if ((client_info->slave == slave) &&
-		    !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast)) {
+		    !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
+		    !is_zero_ether_addr(client_info->mac_dst)) {
 			client_info->ntt = 1;
 			ntt = 1;
 		}
@@ -565,7 +567,8 @@ static void rlb_req_update_subnet_clients(struct bonding *bond, __be32 src_ip)
 		if ((client_info->ip_src == src_ip) &&
 		    !ether_addr_equal_64bits(client_info->slave->dev->dev_addr,
 					     bond->dev->dev_addr) &&
-		    !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast)) {
+		    !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
+		    !is_zero_ether_addr(client_info->mac_dst)) {
 			client_info->ntt = 1;
 			bond_info->rx_ntt = 1;
 		}
@@ -641,7 +644,8 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 		ether_addr_copy(client_info->mac_src, arp->mac_src);
 		client_info->slave = assigned_slave;
 
-		if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast)) {
+		if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
+		    !is_zero_ether_addr(client_info->mac_dst)) {
 			client_info->ntt = 1;
 			bond->alb_info.rx_ntt = 1;
 		} else {
@@ -733,8 +737,10 @@ static void rlb_rebalance(struct bonding *bond)
 		assigned_slave = __rlb_next_rx_slave(bond);
 		if (assigned_slave && (client_info->slave != assigned_slave)) {
 			client_info->slave = assigned_slave;
-			client_info->ntt = 1;
-			ntt = 1;
+			if (!is_zero_ether_addr(client_info->mac_dst)) {
+				client_info->ntt = 1;
+				ntt = 1;
+			}
 		}
 	}
 
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next v2 3/4] bonding: allow use of tx hashing in balance-alb
From: Debabrata Banerjee @ 2018-05-14 18:48 UTC (permalink / raw)
  To: David S . Miller, netdev, Jay Vosburgh
  Cc: Veaceslav Falico, Andy Gospodarek, dbanerje
In-Reply-To: <20180514184810.15506-1-dbanerje@akamai.com>

The rx load balancing provided by balance-alb is not mutually
exclusive with using hashing for tx selection, and should provide a decent
speed increase because this eliminates spinlocks and cache contention.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
---
 drivers/net/bonding/bond_alb.c     | 20 ++++++++++++++++++--
 drivers/net/bonding/bond_main.c    | 25 +++++++++++++++----------
 drivers/net/bonding/bond_options.c |  2 +-
 include/net/bonding.h              | 11 +++++++++--
 4 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 180e50f7806f..6228635880d5 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -1478,8 +1478,24 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
 	}
 
 	if (do_tx_balance) {
-		hash_index = _simple_hash(hash_start, hash_size);
-		tx_slave = tlb_choose_channel(bond, hash_index, skb->len);
+		if (bond->params.tlb_dynamic_lb) {
+			hash_index = _simple_hash(hash_start, hash_size);
+			tx_slave = tlb_choose_channel(bond, hash_index, skb->len);
+		} else {
+			/*
+			 * do_tx_balance means we are free to select the tx_slave
+			 * So we do exactly what tlb would do for hash selection
+			 */
+
+			struct bond_up_slave *slaves;
+			unsigned int count;
+
+			slaves = rcu_dereference(bond->slave_arr);
+			count = slaves ? READ_ONCE(slaves->count) : 0;
+			if (likely(count))
+				tx_slave = slaves->arr[bond_xmit_hash(bond, skb) %
+						       count];
+		}
 	}
 
 	return bond_do_alb_xmit(skb, bond, tx_slave);
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 1f1e97b26f95..f7f8a49cb32b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -159,7 +159,7 @@ module_param(min_links, int, 0);
 MODULE_PARM_DESC(min_links, "Minimum number of available links before turning on carrier");
 
 module_param(xmit_hash_policy, charp, 0);
-MODULE_PARM_DESC(xmit_hash_policy, "balance-xor and 802.3ad hashing method; "
+MODULE_PARM_DESC(xmit_hash_policy, "balance-alb, balance-tlb, balance-xor, 802.3ad hashing method; "
 				   "0 for layer 2 (default), 1 for layer 3+4, "
 				   "2 for layer 2+3, 3 for encap layer 2+3, "
 				   "4 for encap layer 3+4");
@@ -1735,7 +1735,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
 		unblock_netpoll_tx();
 	}
 
-	if (bond_mode_uses_xmit_hash(bond))
+	if (bond_mode_can_use_xmit_hash(bond))
 		bond_update_slave_arr(bond, NULL);
 
 	bond->nest_level = dev_get_nest_level(bond_dev);
@@ -1870,7 +1870,7 @@ static int __bond_release_one(struct net_device *bond_dev,
 	if (BOND_MODE(bond) == BOND_MODE_8023AD)
 		bond_3ad_unbind_slave(slave);
 
-	if (bond_mode_uses_xmit_hash(bond))
+	if (bond_mode_can_use_xmit_hash(bond))
 		bond_update_slave_arr(bond, slave);
 
 	netdev_info(bond_dev, "Releasing %s interface %s\n",
@@ -3102,7 +3102,7 @@ static int bond_slave_netdev_event(unsigned long event,
 		 * events. If these (miimon/arpmon) parameters are configured
 		 * then array gets refreshed twice and that should be fine!
 		 */
-		if (bond_mode_uses_xmit_hash(bond))
+		if (bond_mode_can_use_xmit_hash(bond))
 			bond_update_slave_arr(bond, NULL);
 		break;
 	case NETDEV_CHANGEMTU:
@@ -3322,7 +3322,7 @@ static int bond_open(struct net_device *bond_dev)
 		 */
 		if (bond_alb_initialize(bond, (BOND_MODE(bond) == BOND_MODE_ALB)))
 			return -ENOMEM;
-		if (bond->params.tlb_dynamic_lb)
+		if (bond->params.tlb_dynamic_lb || BOND_MODE(bond) == BOND_MODE_ALB)
 			queue_delayed_work(bond->wq, &bond->alb_work, 0);
 	}
 
@@ -3341,7 +3341,7 @@ static int bond_open(struct net_device *bond_dev)
 		bond_3ad_initiate_agg_selection(bond, 1);
 	}
 
-	if (bond_mode_uses_xmit_hash(bond))
+	if (bond_mode_can_use_xmit_hash(bond))
 		bond_update_slave_arr(bond, NULL);
 
 	return 0;
@@ -3892,7 +3892,7 @@ static void bond_slave_arr_handler(struct work_struct *work)
  * to determine the slave interface -
  * (a) BOND_MODE_8023AD
  * (b) BOND_MODE_XOR
- * (c) BOND_MODE_TLB && tlb_dynamic_lb == 0
+ * (c) (BOND_MODE_TLB || BOND_MODE_ALB) && tlb_dynamic_lb == 0
  *
  * The caller is expected to hold RTNL only and NO other lock!
  */
@@ -3945,6 +3945,11 @@ int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave)
 			continue;
 		if (skipslave == slave)
 			continue;
+
+		netdev_dbg(bond->dev,
+			   "Adding slave dev %s to tx hash array[%d]\n",
+			   slave->dev->name, new_arr->count);
+
 		new_arr->arr[new_arr->count++] = slave;
 	}
 
@@ -4320,9 +4325,9 @@ static int bond_check_params(struct bond_params *params)
 	}
 
 	if (xmit_hash_policy) {
-		if ((bond_mode != BOND_MODE_XOR) &&
-		    (bond_mode != BOND_MODE_8023AD) &&
-		    (bond_mode != BOND_MODE_TLB)) {
+		if (bond_mode == BOND_MODE_ROUNDROBIN ||
+		    bond_mode == BOND_MODE_ACTIVEBACKUP ||
+		    bond_mode == BOND_MODE_BROADCAST) {
 			pr_info("xmit_hash_policy param is irrelevant in mode %s\n",
 				bond_mode_name(bond_mode));
 		} else {
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 58c705f24f96..8a945c9341d6 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -395,7 +395,7 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = {
 		.id = BOND_OPT_TLB_DYNAMIC_LB,
 		.name = "tlb_dynamic_lb",
 		.desc = "Enable dynamic flow shuffling",
-		.unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_TLB)),
+		.unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_TLB) | BIT(BOND_MODE_ALB)),
 		.values = bond_tlb_dynamic_lb_tbl,
 		.flags = BOND_OPTFLAG_IFDOWN,
 		.set = bond_option_tlb_dynamic_lb_set,
diff --git a/include/net/bonding.h b/include/net/bonding.h
index b52235158836..808f1d167349 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -285,8 +285,15 @@ static inline bool bond_needs_speed_duplex(const struct bonding *bond)
 
 static inline bool bond_is_nondyn_tlb(const struct bonding *bond)
 {
-	return (BOND_MODE(bond) == BOND_MODE_TLB)  &&
-	       (bond->params.tlb_dynamic_lb == 0);
+	return (bond_is_lb(bond) && bond->params.tlb_dynamic_lb == 0);
+}
+
+static inline bool bond_mode_can_use_xmit_hash(const struct bonding *bond)
+{
+	return (BOND_MODE(bond) == BOND_MODE_8023AD ||
+		BOND_MODE(bond) == BOND_MODE_XOR ||
+		BOND_MODE(bond) == BOND_MODE_TLB ||
+		BOND_MODE(bond) == BOND_MODE_ALB);
 }
 
 static inline bool bond_mode_uses_xmit_hash(const struct bonding *bond)
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next v2 2/4] bonding: use common mac addr checks
From: Debabrata Banerjee @ 2018-05-14 18:48 UTC (permalink / raw)
  To: David S . Miller, netdev, Jay Vosburgh
  Cc: Veaceslav Falico, Andy Gospodarek, dbanerje
In-Reply-To: <20180514184810.15506-1-dbanerje@akamai.com>

Replace homegrown mac addr checks with faster defs from etherdevice.h

Note that this will also prevent any rlb arp updates for multicast
addresses, however this should have been forbidden anyway.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
---
 drivers/net/bonding/bond_alb.c | 28 +++++++++-------------------
 1 file changed, 9 insertions(+), 19 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index c2f6c58e4e6a..180e50f7806f 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -40,11 +40,6 @@
 #include <net/bonding.h>
 #include <net/bond_alb.h>
 
-
-
-static const u8 mac_bcast[ETH_ALEN + 2] __long_aligned = {
-	0xff, 0xff, 0xff, 0xff, 0xff, 0xff
-};
 static const u8 mac_v6_allmcast[ETH_ALEN + 2] __long_aligned = {
 	0x33, 0x33, 0x00, 0x00, 0x00, 0x01
 };
@@ -420,9 +415,7 @@ static void rlb_clear_slave(struct bonding *bond, struct slave *slave)
 
 			if (assigned_slave) {
 				rx_hash_table[index].slave = assigned_slave;
-				if (!ether_addr_equal_64bits(rx_hash_table[index].mac_dst,
-							     mac_bcast) &&
-				    !is_zero_ether_addr(rx_hash_table[index].mac_dst)) {
+				if (is_valid_ether_addr(rx_hash_table[index].mac_dst)) {
 					bond_info->rx_hashtbl[index].ntt = 1;
 					bond_info->rx_ntt = 1;
 					/* A slave has been removed from the
@@ -525,8 +518,7 @@ static void rlb_req_update_slave_clients(struct bonding *bond, struct slave *sla
 		client_info = &(bond_info->rx_hashtbl[hash_index]);
 
 		if ((client_info->slave == slave) &&
-		    !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
-		    !is_zero_ether_addr(client_info->mac_dst)) {
+		    is_valid_ether_addr(client_info->mac_dst)) {
 			client_info->ntt = 1;
 			ntt = 1;
 		}
@@ -567,8 +559,7 @@ static void rlb_req_update_subnet_clients(struct bonding *bond, __be32 src_ip)
 		if ((client_info->ip_src == src_ip) &&
 		    !ether_addr_equal_64bits(client_info->slave->dev->dev_addr,
 					     bond->dev->dev_addr) &&
-		    !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
-		    !is_zero_ether_addr(client_info->mac_dst)) {
+		    is_valid_ether_addr(client_info->mac_dst)) {
 			client_info->ntt = 1;
 			bond_info->rx_ntt = 1;
 		}
@@ -596,7 +587,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 		if ((client_info->ip_src == arp->ip_src) &&
 		    (client_info->ip_dst == arp->ip_dst)) {
 			/* the entry is already assigned to this client */
-			if (!ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) {
+			if (!is_broadcast_ether_addr(arp->mac_dst)) {
 				/* update mac address from arp */
 				ether_addr_copy(client_info->mac_dst, arp->mac_dst);
 			}
@@ -644,8 +635,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 		ether_addr_copy(client_info->mac_src, arp->mac_src);
 		client_info->slave = assigned_slave;
 
-		if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
-		    !is_zero_ether_addr(client_info->mac_dst)) {
+		if (is_valid_ether_addr(client_info->mac_dst)) {
 			client_info->ntt = 1;
 			bond->alb_info.rx_ntt = 1;
 		} else {
@@ -1418,9 +1408,9 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
 	case ETH_P_IP: {
 		const struct iphdr *iph = ip_hdr(skb);
 
-		if (ether_addr_equal_64bits(eth_data->h_dest, mac_bcast) ||
-		    (iph->daddr == ip_bcast) ||
-		    (iph->protocol == IPPROTO_IGMP)) {
+		if (is_broadcast_ether_addr(eth_data->h_dest) ||
+		    iph->daddr == ip_bcast ||
+		    iph->protocol == IPPROTO_IGMP) {
 			do_tx_balance = false;
 			break;
 		}
@@ -1432,7 +1422,7 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
 		/* IPv6 doesn't really use broadcast mac address, but leave
 		 * that here just in case.
 		 */
-		if (ether_addr_equal_64bits(eth_data->h_dest, mac_bcast)) {
+		if (is_broadcast_ether_addr(eth_data->h_dest)) {
 			do_tx_balance = false;
 			break;
 		}
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next v2 4/4] bonding: allow carrier and link status to determine link state
From: Debabrata Banerjee @ 2018-05-14 18:48 UTC (permalink / raw)
  To: David S . Miller, netdev, Jay Vosburgh
  Cc: Veaceslav Falico, Andy Gospodarek, dbanerje
In-Reply-To: <20180514184810.15506-1-dbanerje@akamai.com>

In a mixed environment it may be difficult to tell if your hardware
support carrier, if it does not it can always report true. With a new
use_carrier option of 2, we can check both carrier and link status
sequentially, instead of one or the other

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
---
 Documentation/networking/bonding.txt |  4 ++--
 drivers/net/bonding/bond_main.c      | 12 ++++++++----
 drivers/net/bonding/bond_options.c   |  7 ++++---
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index 9ba04c0bab8d..f063730e7e73 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -828,8 +828,8 @@ use_carrier
 	MII / ETHTOOL ioctl method to determine the link state.
 
 	A value of 1 enables the use of netif_carrier_ok(), a value of
-	0 will use the deprecated MII / ETHTOOL ioctls.  The default
-	value is 1.
+	0 will use the deprecated MII / ETHTOOL ioctls. A value of 2
+	will check both.  The default value is 1.
 
 xmit_hash_policy
 
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index f7f8a49cb32b..7e9652c4b35c 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -132,7 +132,7 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link down, "
 			    "in milliseconds");
 module_param(use_carrier, int, 0);
 MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; "
-			      "0 for off, 1 for on (default)");
+			      "0 for off, 1 for on (default), 2 for carrier then legacy checks");
 module_param(mode, charp, 0);
 MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
 		       "1 for active-backup, 2 for balance-xor, "
@@ -434,12 +434,16 @@ static int bond_check_dev_link(struct bonding *bond,
 	int (*ioctl)(struct net_device *, struct ifreq *, int);
 	struct ifreq ifr;
 	struct mii_ioctl_data *mii;
+	bool carrier = true;
 
 	if (!reporting && !netif_running(slave_dev))
 		return 0;
 
 	if (bond->params.use_carrier)
-		return netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0;
+		carrier = netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0;
+
+	if (!carrier)
+		return carrier;
 
 	/* Try to get link status using Ethtool first. */
 	if (slave_dev->ethtool_ops->get_link)
@@ -4399,8 +4403,8 @@ static int bond_check_params(struct bond_params *params)
 		downdelay = 0;
 	}
 
-	if ((use_carrier != 0) && (use_carrier != 1)) {
-		pr_warn("Warning: use_carrier module parameter (%d), not of valid value (0/1), so it was set to 1\n",
+	if (use_carrier < 0 || use_carrier > 2) {
+		pr_warn("Warning: use_carrier module parameter (%d), not of valid value (0-2), so it was set to 1\n",
 			use_carrier);
 		use_carrier = 1;
 	}
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 8a945c9341d6..dba6cef05134 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -164,9 +164,10 @@ static const struct bond_opt_value bond_primary_reselect_tbl[] = {
 };
 
 static const struct bond_opt_value bond_use_carrier_tbl[] = {
-	{ "off", 0,  0},
-	{ "on",  1,  BOND_VALFLAG_DEFAULT},
-	{ NULL,  -1, 0}
+	{ "off",  0,  0},
+	{ "on",   1,  BOND_VALFLAG_DEFAULT},
+	{ "both", 2,  0},
+	{ NULL,  -1,  0}
 };
 
 static const struct bond_opt_value bond_all_slaves_active_tbl[] = {
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH 05/14] net: sched: always take reference to action
From: Vlad Buslov @ 2018-05-14 18:49 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, jhs, xiyou.wangcong, pablo, kadlec, fw, ast,
	daniel, edumazet, keescook, linux-kernel, netfilter-devel,
	coreteam, kliteyn
In-Reply-To: <20180514162301.GC2134@nanopsycho.orion>


On Mon 14 May 2018 at 16:23, Jiri Pirko <jiri@resnulli.us> wrote:
> Mon, May 14, 2018 at 04:27:06PM CEST, vladbu@mellanox.com wrote:
>>Without rtnl lock protection it is no longer safe to use pointer to tc
>>action without holding reference to it. (it can be destroyed concurrently)
>>
>>Remove unsafe action idr lookup function. Instead of it, implement safe tcf
>>idr check function that atomically looks up action in idr and increments
>>its reference and bind counters.
>>
>>Implement both action search and check using new safe function.
>>
>>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>>---
>> net/sched/act_api.c | 38 ++++++++++++++++----------------------
>> 1 file changed, 16 insertions(+), 22 deletions(-)
>>
>>diff --git a/net/sched/act_api.c b/net/sched/act_api.c
>>index 1331beb..9459cce 100644
>>--- a/net/sched/act_api.c
>>+++ b/net/sched/act_api.c
>>@@ -284,44 +284,38 @@ int tcf_generic_walker(struct tc_action_net *tn, struct sk_buff *skb,
>> }
>> EXPORT_SYMBOL(tcf_generic_walker);
>> 
>>-static struct tc_action *tcf_idr_lookup(u32 index, struct tcf_idrinfo *idrinfo)
>>+bool __tcf_idr_check(struct tc_action_net *tn, u32 index, struct tc_action **a,
>>+		     int bind)
>> {
>>-	struct tc_action *p = NULL;
>>+	struct tcf_idrinfo *idrinfo = tn->idrinfo;
>>+	struct tc_action *p;
>> 
>> 	spin_lock_bh(&idrinfo->lock);
>
> Why "_bh" variant is necessary here?

It is not my code.

>
>> 	p = idr_find(&idrinfo->action_idr, index);
>>+	if (p) {
>>+		refcount_inc(&p->tcfa_refcnt);
>>+		if (bind)
>>+			atomic_inc(&p->tcfa_bindcnt);
>>+	}
>> 	spin_unlock_bh(&idrinfo->lock);
>
> [...]

^ permalink raw reply

* Re: iproute2 - modifying routes in place
From: David Ahern @ 2018-05-14 18:49 UTC (permalink / raw)
  To: Ryan Whelan; +Cc: netdev
In-Reply-To: <CAM3m09SowB_i7eoC4FGH=gYBoYRTdBO5S4paVdJ4Mb1RtYfe=A@mail.gmail.com>

On 5/14/18 12:40 PM, Ryan Whelan wrote:
> Same behavior:
> 
> 
> root@rwhelan-linux ~
> # ip -6 route
> ::1 dev lo proto kernel metric 256 pref medium
> fd9b:caee:ff93:ceef:3431:3831:3930:3031 dev internal0 proto kernel metric
> 256 pref medium
> fd9b:caee:ff93:ceef:3431:3831:3930:3032 dev internal0 src
> fd9b:caee:ff93:ceef:3431:3831:3930:3031 metric 1024 pref medium
> fe80::/64 dev enp0s3 proto kernel metric 256 pref medium
> fe80::/64 dev enp0s8 proto kernel metric 256 pref medium
> fe80::/64 dev internal0 proto kernel metric 256 pref medium
> 
> root@rwhelan-linux ~
> # ip -6 route change fd9b:caee:ff93:ceef:3431:3831:3930:3032 dev internal0
> src fd9b:caee:ff93:ceef:3431:3831:3930:3031 metric 10
> RTNETLINK answers: No such file or directory

'change' only sets NLM_F_REPLACE. Since NLM_F_CREATE is not set ('ip ro
replace') it does not add a new route and expects one to exist. Your
table above shows the prefix with metric 256 not metric 10 so the route
does not match. You should be seeing a message in dmesg to this effect.

> 
> root@rwhelan-linux ~
> # ip -6 route replace fd9b:caee:ff93:ceef:3431:3831:3930:3032 dev internal0
> src fd9b:caee:ff93:ceef:3431:3831:3930:3031 metric 10

Adds a new entry because of NLM_F_CREATE.

> 
> root@rwhelan-linux ~
> # ip -6 route
> ::1 dev lo proto kernel metric 256 pref medium
> fd9b:caee:ff93:ceef:3431:3831:3930:3031 dev internal0 proto kernel metric
> 256 pref medium
> fd9b:caee:ff93:ceef:3431:3831:3930:3032 dev internal0 src
> fd9b:caee:ff93:ceef:3431:3831:3930:3031 metric 10 pref medium
> fd9b:caee:ff93:ceef:3431:3831:3930:3032 dev internal0 src
> fd9b:caee:ff93:ceef:3431:3831:3930:3031 metric 1024 pref medium
> fe80::/64 dev enp0s3 proto kernel metric 256 pref medium
> fe80::/64 dev enp0s8 proto kernel metric 256 pref medium
> fe80::/64 dev internal0 proto kernel metric 256 pref medium
> 
> root@rwhelan-linux ~
> # uname -a
> Linux rwhelan-linux 4.17.0-rc3-ipv6-route-bugs+ #2 SMP Mon May 14 11:30:38
> EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
> On Sat, May 12, 2018 at 1:01 PM David Ahern <dsahern@gmail.com> wrote:
> 
>> On 5/11/18 4:42 AM, Ryan Whelan wrote:
>>> `ip route` has 2 subcommands that don't seem to work as expected and i'm
>>> not sure if its a bug, or if i'm misunderstanding the semantics.
> 
>> Can you try with ipv6/route-bugs branch in
>> https://github.com/dsahern/linux

^ permalink raw reply

* Re: [PATCH net-next 0/3] net: dsa: mv88e6xxx: remove Global 1 setup
From: David Miller @ 2018-05-14 18:50 UTC (permalink / raw)
  To: vivien.didelot; +Cc: netdev, linux-kernel, kernel, andrew, f.fainelli
In-Reply-To: <20180511211636.25995-1-vivien.didelot@savoirfairelinux.com>

From: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Date: Fri, 11 May 2018 17:16:33 -0400

> The mv88e6xxx driver is still writing arbitrary registers at setup time,
> e.g. priority override bits. Add ops for them and provide specific setup
> functions for priority and stats before getting rid of the erroneous
> mv88e6xxx_g1_setup code, as previously done with Global 2.

Series applied, thanks Vivien.

^ permalink raw reply

* Re: [PATCH] net/mlx5: Use 'kvfree()' for memory allocated by 'kvzalloc()'
From: David Miller @ 2018-05-14 18:56 UTC (permalink / raw)
  To: christophe.jaillet
  Cc: saeedm, matanb, leon, netdev, linux-rdma, linux-kernel,
	kernel-janitors
In-Reply-To: <20180512170925.23690-1-christophe.jaillet@wanadoo.fr>

From: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Date: Sat, 12 May 2018 19:09:25 +0200

> 'out' is allocated with 'kvzalloc()'. 'kvfree()' must be used to free it.
> 
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>

Saeed, I assume I will see this in one of your forthcoming pull
requests.

Thanks.

^ permalink raw reply

* [PATCH net-next v10 0/7] sched: Add Common Applications Kept Enhanced (cake) qdisc
From: Toke Høiland-Jørgensen @ 2018-05-14 19:00 UTC (permalink / raw)
  To: netdev; +Cc: cake

This patch series adds the CAKE qdisc, and has been split up to ease
review.

I have attempted to split out each configurable feature into its own patch.
The first commit adds the base shaper and packet scheduler, while
subsequent commits add the optional features. The full userspace API and
most data structures are included in this commit, but options not
understood in the base version will be ignored.

The result of applying the entire series is identical to the out of tree
version that have seen extensive testing in previous deployments, most
notably as an out of tree patch to OpenWrt. However, note that I have only
compile tested the individual patches; so the whole series should be
considered as a unit.

---
Changelog

v10:
  - Christmas tree gardening (fix variable declarations to be in reverse
    line length order)

v9:
  - Remove duplicated checks around kvfree() and just call it
    unconditionally.
  - Don't pass __GFP_NOWARN when allocating memory
  - Move options in cake_dump() that are related to optional features to
    later patches implementing the features.
  - Support attaching filters to the qdisc and use the classification
    result to select flow queue.
  - Support overriding diffserv priority tin from skb->priority

v8:
  - Remove inline keyword from function definitions
  - Simplify ACK filter; remove the complex state handling to make the
    logic easier to follow. This will potentially be a bit less efficient,
    but I have not been able to measure a difference.

v7:
  - Split up patch into a series to ease review.
  - Constify the ACK filter.

v6:
  - Fix 6in4 encapsulation checks in ACK filter code
  - Checkpatch fixes

v5:
  - Refactor ACK filter code and hopefully fix the safety issues
    properly this time.

v4:
  - Only split GSO packets if shaping at speeds <= 1Gbps
  - Fix overhead calculation code to also work for GSO packets
  - Don't re-implement kvzalloc()
  - Remove local header include from out-of-tree build (fixes kbuild-bot
    complaint).
  - Several fixes to the ACK filter:
    - Check pskb_may_pull() before deref of transport headers.
    - Don't run ACK filter logic on split GSO packets
    - Fix TCP sequence number compare to deal with wraparounds

v3:
  - Use IS_REACHABLE() macro to fix compilation when sch_cake is
    built-in and conntrack is a module.
  - Switch the stats output to use nested netlink attributes instead
    of a versioned struct.
  - Remove GPL boilerplate.
  - Fix array initialisation style.

v2:
  - Fix kbuild test bot complaint
  - Clean up the netlink ABI
  - Fix checkpatch complaints
  - A few tweaks to the behaviour of cake based on testing carried out
    while writing the paper.

---

Toke Høiland-Jørgensen (7):
      sched: Add Common Applications Kept Enhanced (cake) qdisc
      sch_cake: Add ingress mode
      sch_cake: Add optional ACK filter
      sch_cake: Add NAT awareness to packet classifier
      sch_cake: Add DiffServ handling
      sch_cake: Add overhead compensation support to the rate shaper
      sch_cake: Conditionally split GSO segments


 include/uapi/linux/pkt_sched.h |  105 ++
 net/sched/Kconfig              |   11 
 net/sched/Makefile             |    1 
 net/sched/sch_cake.c           | 2684 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 2801 insertions(+)
 create mode 100644 net/sched/sch_cake.c

^ permalink raw reply

* [PATCH net-next v10 5/7] sch_cake: Add DiffServ handling
From: Toke Høiland-Jørgensen @ 2018-05-14 19:00 UTC (permalink / raw)
  To: netdev; +Cc: cake
In-Reply-To: <152632431302.4861.16657365789045735410.stgit@alrua-kau>

This adds support for DiffServ-based priority queueing to CAKE. If the
shaper is in use, each priority tier gets its own virtual clock, which
limits that tier's rate to a fraction of the overall shaped rate, to
discourage trying to game the priority mechanism.

CAKE defaults to a simple, three-tier mode that interprets most code points
as "best effort", but places CS1 traffic into a low-priority "bulk" tier
which is assigned 1/16 of the total rate, and a few code points indicating
latency-sensitive or control traffic (specifically TOS4, VA, EF, CS6, CS7)
into a "latency sensitive" high-priority tier, which is assigned 1/4 rate.
The other supported DiffServ modes are a 4-tier mode matching the 802.11e
precedence rules, as well as two 8-tier modes, one of which implements
strict precedence of the eight priority levels.

This commit also adds an optional DiffServ 'wash' mode, which will zero out
the DSCP fields of any packet passing through CAKE. While this can
technically be done with other mechanisms in the kernel, having the feature
available in CAKE significantly decreases configuration complexity; and the
implementation cost is low on top of the other DiffServ-handling code.

Filters and applications can set the skb->priority field to override the
DSCP-based classification into tiers. If TC_H_MAJ(skb->priority) matches CAKE's
qdisc handle, the minor number will be interpreted as a priority tier if it is
less than or equal to the number of configured priority tiers.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 net/sched/sch_cake.c |  408 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 401 insertions(+), 7 deletions(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 2802bb2ace84..ccc6f26b306c 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -307,6 +307,68 @@ static void cobalt_set_enqueue_time(struct sk_buff *skb,
 
 static u16 quantum_div[CAKE_QUEUES + 1] = {0};
 
+/* Diffserv lookup tables */
+
+static const u8 precedence[] = {
+	0, 0, 0, 0, 0, 0, 0, 0,
+	1, 1, 1, 1, 1, 1, 1, 1,
+	2, 2, 2, 2, 2, 2, 2, 2,
+	3, 3, 3, 3, 3, 3, 3, 3,
+	4, 4, 4, 4, 4, 4, 4, 4,
+	5, 5, 5, 5, 5, 5, 5, 5,
+	6, 6, 6, 6, 6, 6, 6, 6,
+	7, 7, 7, 7, 7, 7, 7, 7,
+};
+
+static const u8 diffserv8[] = {
+	2, 5, 1, 2, 4, 2, 2, 2,
+	0, 2, 1, 2, 1, 2, 1, 2,
+	5, 2, 4, 2, 4, 2, 4, 2,
+	3, 2, 3, 2, 3, 2, 3, 2,
+	6, 2, 3, 2, 3, 2, 3, 2,
+	6, 2, 2, 2, 6, 2, 6, 2,
+	7, 2, 2, 2, 2, 2, 2, 2,
+	7, 2, 2, 2, 2, 2, 2, 2,
+};
+
+static const u8 diffserv4[] = {
+	0, 2, 0, 0, 2, 0, 0, 0,
+	1, 0, 0, 0, 0, 0, 0, 0,
+	2, 0, 2, 0, 2, 0, 2, 0,
+	2, 0, 2, 0, 2, 0, 2, 0,
+	3, 0, 2, 0, 2, 0, 2, 0,
+	3, 0, 0, 0, 3, 0, 3, 0,
+	3, 0, 0, 0, 0, 0, 0, 0,
+	3, 0, 0, 0, 0, 0, 0, 0,
+};
+
+static const u8 diffserv3[] = {
+	0, 0, 0, 0, 2, 0, 0, 0,
+	1, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 2, 0, 2, 0,
+	2, 0, 0, 0, 0, 0, 0, 0,
+	2, 0, 0, 0, 0, 0, 0, 0,
+};
+
+static const u8 besteffort[] = {
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+};
+
+/* tin priority order for stats dumping */
+
+static const u8 normal_order[] = {0, 1, 2, 3, 4, 5, 6, 7};
+static const u8 bulk_order[] = {1, 0, 2, 3};
+
 #define REC_INV_SQRT_CACHE (16)
 static u32 cobalt_rec_inv_sqrt_cache[REC_INV_SQRT_CACHE] = {0};
 
@@ -1224,6 +1286,46 @@ static unsigned int cake_drop(struct Qdisc *sch, struct sk_buff **to_free)
 	return idx + (tin << 16);
 }
 
+static void cake_wash_diffserv(struct sk_buff *skb)
+{
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
+		break;
+	case htons(ETH_P_IPV6):
+		ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0);
+		break;
+	default:
+		break;
+	}
+}
+
+static u8 cake_handle_diffserv(struct sk_buff *skb, u16 wash)
+{
+	u8 dscp;
+
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		dscp = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
+		if (wash && dscp)
+			ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
+		return dscp;
+
+	case htons(ETH_P_IPV6):
+		dscp = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
+		if (wash && dscp)
+			ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0);
+		return dscp;
+
+	case htons(ETH_P_ARP):
+		return 0x38;  /* CS7 - Net Control */
+
+	default:
+		/* If there is no Diffserv field, treat as best-effort */
+		return 0;
+	}
+}
+
 static void cake_reconfigure(struct Qdisc *sch);
 
 static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
@@ -1238,7 +1340,26 @@ static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	struct cake_flow *flow;
 	u32 idx, tin;
 
-	tin = 0;
+	if (TC_H_MAJ(skb->priority) == sch->handle &&
+	    TC_H_MIN(skb->priority) > 0 &&
+	    TC_H_MIN(skb->priority) <= q->tin_cnt) {
+		tin = TC_H_MIN(skb->priority) - 1;
+
+		if (q->rate_flags & CAKE_FLAG_WASH)
+			cake_wash_diffserv(skb);
+	} else if (q->tin_mode != CAKE_DIFFSERV_BESTEFFORT) {
+		/* extract the Diffserv Precedence field, if it exists */
+		/* and clear DSCP bits if washing */
+		tin = q->tin_index[cake_handle_diffserv(skb,
+				q->rate_flags & CAKE_FLAG_WASH)];
+		if (unlikely(tin >= q->tin_cnt))
+			tin = 0;
+	} else {
+		tin = 0;
+		if (q->rate_flags & CAKE_FLAG_WASH)
+			cake_wash_diffserv(skb);
+	}
+
 	b = &q->tins[tin];
 
 	/* choose flow to insert into */
@@ -1720,18 +1841,274 @@ static void cake_set_rate(struct cake_tin_data *b, u64 rate, u32 mtu,
 	b->cparams.p_dec = 1 << 20; /* 1/4096 */
 }
 
-static void cake_reconfigure(struct Qdisc *sch)
+static int cake_config_besteffort(struct Qdisc *sch)
 {
 	struct cake_sched_data *q = qdisc_priv(sch);
 	struct cake_tin_data *b = &q->tins[0];
-	int c, ft = 0;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 rate = q->rate_bps;
 
 	q->tin_cnt = 1;
-	cake_set_rate(b, q->rate_bps, psched_mtu(qdisc_dev(sch)),
-		      US2TIME(q->target), US2TIME(q->interval));
+
+	q->tin_index = besteffort;
+	q->tin_order = normal_order;
+
+	cake_set_rate(b, rate, mtu, US2TIME(q->target), US2TIME(q->interval));
 	b->tin_quantum_band = 65535;
 	b->tin_quantum_prio = 65535;
 
+	return 0;
+}
+
+static int cake_config_precedence(struct Qdisc *sch)
+{
+	/* convert high-level (user visible) parameters into internal format */
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 rate = q->rate_bps;
+	u32 quantum1 = 256;
+	u32 quantum2 = 256;
+	u32 i;
+
+	q->tin_cnt = 8;
+	q->tin_index = precedence;
+	q->tin_order = normal_order;
+
+	for (i = 0; i < q->tin_cnt; i++) {
+		struct cake_tin_data *b = &q->tins[i];
+
+		cake_set_rate(b, rate, mtu, US2TIME(q->target),
+			      US2TIME(q->interval));
+
+		b->tin_quantum_prio = max_t(u16, 1U, quantum1);
+		b->tin_quantum_band = max_t(u16, 1U, quantum2);
+
+		/* calculate next class's parameters */
+		rate  *= 7;
+		rate >>= 3;
+
+		quantum1  *= 3;
+		quantum1 >>= 1;
+
+		quantum2  *= 7;
+		quantum2 >>= 3;
+	}
+
+	return 0;
+}
+
+/*	List of known Diffserv codepoints:
+ *
+ *	Least Effort (CS1)
+ *	Best Effort (CS0)
+ *	Max Reliability & LLT "Lo" (TOS1)
+ *	Max Throughput (TOS2)
+ *	Min Delay (TOS4)
+ *	LLT "La" (TOS5)
+ *	Assured Forwarding 1 (AF1x) - x3
+ *	Assured Forwarding 2 (AF2x) - x3
+ *	Assured Forwarding 3 (AF3x) - x3
+ *	Assured Forwarding 4 (AF4x) - x3
+ *	Precedence Class 2 (CS2)
+ *	Precedence Class 3 (CS3)
+ *	Precedence Class 4 (CS4)
+ *	Precedence Class 5 (CS5)
+ *	Precedence Class 6 (CS6)
+ *	Precedence Class 7 (CS7)
+ *	Voice Admit (VA)
+ *	Expedited Forwarding (EF)
+
+ *	Total 25 codepoints.
+ */
+
+/*	List of traffic classes in RFC 4594:
+ *		(roughly descending order of contended priority)
+ *		(roughly ascending order of uncontended throughput)
+ *
+ *	Network Control (CS6,CS7)      - routing traffic
+ *	Telephony (EF,VA)         - aka. VoIP streams
+ *	Signalling (CS5)               - VoIP setup
+ *	Multimedia Conferencing (AF4x) - aka. video calls
+ *	Realtime Interactive (CS4)     - eg. games
+ *	Multimedia Streaming (AF3x)    - eg. YouTube, NetFlix, Twitch
+ *	Broadcast Video (CS3)
+ *	Low Latency Data (AF2x,TOS4)      - eg. database
+ *	Ops, Admin, Management (CS2,TOS1) - eg. ssh
+ *	Standard Service (CS0 & unrecognised codepoints)
+ *	High Throughput Data (AF1x,TOS2)  - eg. web traffic
+ *	Low Priority Data (CS1)           - eg. BitTorrent
+
+ *	Total 12 traffic classes.
+ */
+
+static int cake_config_diffserv8(struct Qdisc *sch)
+{
+/*	Pruned list of traffic classes for typical applications:
+ *
+ *		Network Control          (CS6, CS7)
+ *		Minimum Latency          (EF, VA, CS5, CS4)
+ *		Interactive Shell        (CS2, TOS1)
+ *		Low Latency Transactions (AF2x, TOS4)
+ *		Video Streaming          (AF4x, AF3x, CS3)
+ *		Bog Standard             (CS0 etc.)
+ *		High Throughput          (AF1x, TOS2)
+ *		Background Traffic       (CS1)
+ *
+ *		Total 8 traffic classes.
+ */
+
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 rate = q->rate_bps;
+	u32 quantum1 = 256;
+	u32 quantum2 = 256;
+	u32 i;
+
+	q->tin_cnt = 8;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv8;
+	q->tin_order = normal_order;
+
+	/* class characteristics */
+	for (i = 0; i < q->tin_cnt; i++) {
+		struct cake_tin_data *b = &q->tins[i];
+
+		cake_set_rate(b, rate, mtu, US2TIME(q->target),
+			      US2TIME(q->interval));
+
+		b->tin_quantum_prio = max_t(u16, 1U, quantum1);
+		b->tin_quantum_band = max_t(u16, 1U, quantum2);
+
+		/* calculate next class's parameters */
+		rate  *= 7;
+		rate >>= 3;
+
+		quantum1  *= 3;
+		quantum1 >>= 1;
+
+		quantum2  *= 7;
+		quantum2 >>= 3;
+	}
+
+	return 0;
+}
+
+static int cake_config_diffserv4(struct Qdisc *sch)
+{
+/*  Further pruned list of traffic classes for four-class system:
+ *
+ *	    Latency Sensitive  (CS7, CS6, EF, VA, CS5, CS4)
+ *	    Streaming Media    (AF4x, AF3x, CS3, AF2x, TOS4, CS2, TOS1)
+ *	    Best Effort        (CS0, AF1x, TOS2, and those not specified)
+ *	    Background Traffic (CS1)
+ *
+ *		Total 4 traffic classes.
+ */
+
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 rate = q->rate_bps;
+	u32 quantum = 1024;
+
+	q->tin_cnt = 4;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv4;
+	q->tin_order = bulk_order;
+
+	/* class characteristics */
+	cake_set_rate(&q->tins[0], rate, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[1], rate >> 4, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[2], rate >> 1, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[3], rate >> 2, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+
+	/* priority weights */
+	q->tins[0].tin_quantum_prio = quantum;
+	q->tins[1].tin_quantum_prio = quantum >> 4;
+	q->tins[2].tin_quantum_prio = quantum << 2;
+	q->tins[3].tin_quantum_prio = quantum << 4;
+
+	/* bandwidth-sharing weights */
+	q->tins[0].tin_quantum_band = quantum;
+	q->tins[1].tin_quantum_band = quantum >> 4;
+	q->tins[2].tin_quantum_band = quantum >> 1;
+	q->tins[3].tin_quantum_band = quantum >> 2;
+
+	return 0;
+}
+
+static int cake_config_diffserv3(struct Qdisc *sch)
+{
+/*  Simplified Diffserv structure with 3 tins.
+ *		Low Priority		(CS1)
+ *		Best Effort
+ *		Latency Sensitive	(TOS4, VA, EF, CS6, CS7)
+ */
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 rate = q->rate_bps;
+	u32 quantum = 1024;
+
+	q->tin_cnt = 3;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv3;
+	q->tin_order = bulk_order;
+
+	/* class characteristics */
+	cake_set_rate(&q->tins[0], rate, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[1], rate >> 4, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[2], rate >> 2, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+
+	/* priority weights */
+	q->tins[0].tin_quantum_prio = quantum;
+	q->tins[1].tin_quantum_prio = quantum >> 4;
+	q->tins[2].tin_quantum_prio = quantum << 4;
+
+	/* bandwidth-sharing weights */
+	q->tins[0].tin_quantum_band = quantum;
+	q->tins[1].tin_quantum_band = quantum >> 4;
+	q->tins[2].tin_quantum_band = quantum >> 2;
+
+	return 0;
+}
+
+static void cake_reconfigure(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	int c, ft;
+
+	switch (q->tin_mode) {
+	case CAKE_DIFFSERV_BESTEFFORT:
+		ft = cake_config_besteffort(sch);
+		break;
+
+	case CAKE_DIFFSERV_PRECEDENCE:
+		ft = cake_config_precedence(sch);
+		break;
+
+	case CAKE_DIFFSERV_DIFFSERV8:
+		ft = cake_config_diffserv8(sch);
+		break;
+
+	case CAKE_DIFFSERV_DIFFSERV4:
+		ft = cake_config_diffserv4(sch);
+		break;
+
+	case CAKE_DIFFSERV_DIFFSERV3:
+	default:
+		ft = cake_config_diffserv3(sch);
+		break;
+	}
+
 	for (c = q->tin_cnt; c < CAKE_MAX_TINS; c++) {
 		cake_clear_tin(sch, c);
 		q->tins[c].cparams.mtu_time = q->tins[ft].cparams.mtu_time;
@@ -1775,6 +2152,16 @@ static int cake_change(struct Qdisc *sch, struct nlattr *opt,
 	if (tb[TCA_CAKE_BASE_RATE])
 		q->rate_bps = nla_get_u32(tb[TCA_CAKE_BASE_RATE]);
 
+	if (tb[TCA_CAKE_DIFFSERV_MODE])
+		q->tin_mode = nla_get_u32(tb[TCA_CAKE_DIFFSERV_MODE]);
+
+	if (tb[TCA_CAKE_WASH]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_WASH]))
+			q->rate_flags |= CAKE_FLAG_WASH;
+		else
+			q->rate_flags &= ~CAKE_FLAG_WASH;
+	}
+
 	if (tb[TCA_CAKE_FLOW_MODE])
 		q->flow_mode = (nla_get_u32(tb[TCA_CAKE_FLOW_MODE]) &
 				CAKE_FLOW_MASK);
@@ -1844,7 +2231,7 @@ static int cake_init(struct Qdisc *sch, struct nlattr *opt,
 	int i, j, err;
 
 	sch->limit = 10240;
-	q->tin_mode = CAKE_DIFFSERV_BESTEFFORT;
+	q->tin_mode = CAKE_DIFFSERV_DIFFSERV3;
 	q->flow_mode  = CAKE_FLOW_TRIPLE;
 
 	q->rate_bps = 0; /* unlimited by default */
@@ -1952,6 +2339,13 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (nla_put_u32(skb, TCA_CAKE_NAT, !!(q->flow_mode & CAKE_FLOW_NAT_FLAG)))
 		goto nla_put_failure;
 
+	if (nla_put_u32(skb, TCA_CAKE_DIFFSERV_MODE, q->tin_mode))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_WASH,
+			!!(q->rate_flags & CAKE_FLAG_WASH)))
+		goto nla_put_failure;
+
 	return nla_nest_end(skb, opts);
 
 nla_put_failure:
@@ -1999,7 +2393,7 @@ static int cake_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
 	} while (0)
 
 	for (i = 0; i < q->tin_cnt; i++) {
-		struct cake_tin_data *b = &q->tins[i];
+		struct cake_tin_data *b = &q->tins[q->tin_order[i]];
 
 		ts = nla_nest_start(d->skb, i + 1);
 		if (!ts)

^ permalink raw reply related

* [PATCH net-next v10 4/7] sch_cake: Add NAT awareness to packet classifier
From: Toke Høiland-Jørgensen @ 2018-05-14 19:00 UTC (permalink / raw)
  To: netdev; +Cc: cake
In-Reply-To: <152632431302.4861.16657365789045735410.stgit@alrua-kau>

When CAKE is deployed on a gateway that also performs NAT (which is a
common deployment mode), the host fairness mechanism cannot distinguish
internal hosts from each other, and so fails to work correctly.

To fix this, we add an optional NAT awareness mode, which will query the
kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
and use that in the flow and host hashing.

When the shaper is enabled and the host is already performing NAT, the cost
of this lookup is negligible. However, in unlimited mode with no NAT being
performed, there is a significant CPU cost at higher bandwidths. For this
reason, the feature is turned off by default.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 net/sched/sch_cake.c |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 4bc178c09f3a..2802bb2ace84 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -71,6 +71,12 @@
 #include <net/tcp.h>
 #include <net/flow_dissector.h>
 
+#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
+#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_zones.h>
+#include <net/netfilter/nf_conntrack.h>
+#endif
+
 #define CAKE_SET_WAYS (8)
 #define CAKE_MAX_TINS (8)
 #define CAKE_QUEUES (1024)
@@ -522,6 +528,60 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
 	return drop;
 }
 
+#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
+
+static void cake_update_flowkeys(struct flow_keys *keys,
+				 const struct sk_buff *skb)
+{
+	const struct nf_conntrack_tuple *tuple;
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct;
+	bool rev = false;
+
+	if (tc_skb_protocol(skb) != htons(ETH_P_IP))
+		return;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	if (ct) {
+		tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
+	} else {
+		const struct nf_conntrack_tuple_hash *hash;
+		struct nf_conntrack_tuple srctuple;
+
+		if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
+				       NFPROTO_IPV4, dev_net(skb->dev),
+				       &srctuple))
+			return;
+
+		hash = nf_conntrack_find_get(dev_net(skb->dev),
+					     &nf_ct_zone_dflt,
+					     &srctuple);
+		if (!hash)
+			return;
+
+		rev = true;
+		ct = nf_ct_tuplehash_to_ctrack(hash);
+		tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
+	}
+
+	keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip;
+	keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip;
+
+	if (keys->ports.ports) {
+		keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
+		keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
+	}
+	if (rev)
+		nf_ct_put(ct);
+}
+#else
+static void cake_update_flowkeys(struct flow_keys *keys,
+				 const struct sk_buff *skb)
+{
+	/* There is nothing we can do here without CONNTRACK */
+}
+#endif
+
 /* Cake has several subtle multiple bit settings. In these cases you
  *  would be matching triple isolate mode as well.
  */
@@ -549,6 +609,9 @@ static u32 cake_hash(struct cake_tin_data *q, const struct sk_buff *skb,
 	skb_flow_dissect_flow_keys(skb, &keys,
 				   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
 
+	if (flow_mode & CAKE_FLOW_NAT_FLAG)
+		cake_update_flowkeys(&keys, skb);
+
 	/* flow_hash_from_keys() sorts the addresses by value, so we have
 	 * to preserve their order in a separate data structure to treat
 	 * src and dst host addresses as independently selectable.
@@ -1716,6 +1779,12 @@ static int cake_change(struct Qdisc *sch, struct nlattr *opt,
 		q->flow_mode = (nla_get_u32(tb[TCA_CAKE_FLOW_MODE]) &
 				CAKE_FLOW_MASK);
 
+	if (tb[TCA_CAKE_NAT]) {
+		q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
+		q->flow_mode |= CAKE_FLOW_NAT_FLAG *
+			!!nla_get_u32(tb[TCA_CAKE_NAT]);
+	}
+
 	if (tb[TCA_CAKE_RTT]) {
 		q->interval = nla_get_u32(tb[TCA_CAKE_RTT]);
 
@@ -1880,6 +1949,9 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter))
 		goto nla_put_failure;
 
+	if (nla_put_u32(skb, TCA_CAKE_NAT, !!(q->flow_mode & CAKE_FLOW_NAT_FLAG)))
+		goto nla_put_failure;
+
 	return nla_nest_end(skb, opts);
 
 nla_put_failure:

^ permalink raw reply related

* [PATCH net-next v10 2/7] sch_cake: Add ingress mode
From: Toke Høiland-Jørgensen @ 2018-05-14 19:00 UTC (permalink / raw)
  To: netdev; +Cc: cake
In-Reply-To: <152632431302.4861.16657365789045735410.stgit@alrua-kau>

The ingress mode is meant to be enabled when CAKE runs downlink of the
actual bottleneck (such as on an IFB device). The mode changes the shaper
to also account dropped packets to the shaped rate, as these have already
traversed the bottleneck.

Enabling ingress mode will also tune the AQM to always keep at least two
packets queued *for each flow*. This is done by scaling the minimum queue
occupancy level that will disable the AQM by the number of active bulk
flows. The rationale for this is that retransmits are more expensive in
ingress mode, since dropped packets have to traverse the bottleneck again
when they are retransmitted; thus, being more lenient and keeping a minimum
number of packets queued will improve throughput in cases where the number
of active flows are so large that they saturate the bottleneck even at
their minimum window size.

This commit also adds a separate switch to enable ingress mode rate
autoscaling. If enabled, the autoscaling code will observe the actual
traffic rate and adjust the shaper rate to match it. This can help avoid
latency increases in the case where the actual bottleneck rate decreases
below the shaped rate. The scaling filters out spikes by an EWMA filter.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 net/sched/sch_cake.c |   78 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 74 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index e22c712602fa..179bfa9e501f 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -442,7 +442,8 @@ static bool cobalt_queue_empty(struct cobalt_vars *vars,
 static bool cobalt_should_drop(struct cobalt_vars *vars,
 			       struct cobalt_params *p,
 			       cobalt_time_t now,
-			       struct sk_buff *skb)
+			       struct sk_buff *skb,
+			       u32 bulk_flows)
 {
 	bool next_due, over_target, drop = false;
 	cobalt_tdiff_t sojourn, schedule;
@@ -465,6 +466,7 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
 	sojourn = now - cobalt_get_enqueue_time(skb);
 	schedule = now - vars->drop_next;
 	over_target = sojourn > p->target &&
+		      sojourn > p->mtu_time * bulk_flows * 2 &&
 		      sojourn > p->mtu_time * 4;
 	next_due = vars->count && schedule >= 0;
 
@@ -915,6 +917,9 @@ static unsigned int cake_drop(struct Qdisc *sch, struct sk_buff **to_free)
 	b->tin_dropped++;
 	sch->qstats.drops++;
 
+	if (q->rate_flags & CAKE_FLAG_INGRESS)
+		cake_advance_shaper(q, b, skb, now, true);
+
 	__qdisc_drop(skb, to_free);
 	sch->q.qlen--;
 
@@ -990,8 +995,39 @@ static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		cake_heapify_up(q, b->overflow_idx[idx]);
 
 	/* incoming bandwidth capacity estimate */
-	q->avg_window_bytes = 0;
-	q->last_packet_time = now;
+	if (q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS) {
+		u64 packet_interval = now - q->last_packet_time;
+
+		if (packet_interval > NSEC_PER_SEC)
+			packet_interval = NSEC_PER_SEC;
+
+		/* filter out short-term bursts, eg. wifi aggregation */
+		q->avg_packet_interval = cake_ewma(q->avg_packet_interval,
+						   packet_interval,
+			packet_interval > q->avg_packet_interval ? 2 : 8);
+
+		q->last_packet_time = now;
+
+		if (packet_interval > q->avg_packet_interval) {
+			u64 window_interval = now - q->avg_window_begin;
+			u64 b = q->avg_window_bytes * (u64)NSEC_PER_SEC;
+
+			do_div(b, window_interval);
+			q->avg_peak_bandwidth =
+				cake_ewma(q->avg_peak_bandwidth, b,
+					  b > q->avg_peak_bandwidth ? 2 : 8);
+			q->avg_window_bytes = 0;
+			q->avg_window_begin = now;
+
+			if (now - q->last_reconfig_time > (NSEC_PER_SEC / 4)) {
+				q->rate_bps = (q->avg_peak_bandwidth * 15) >> 4;
+				cake_reconfigure(sch);
+			}
+		}
+	} else {
+		q->avg_window_bytes = 0;
+		q->last_packet_time = now;
+	}
 
 	/* flowchain */
 	if (!flow->set || flow->set == CAKE_SET_DECAYING) {
@@ -1246,14 +1282,26 @@ static struct sk_buff *cake_dequeue(struct Qdisc *sch)
 		}
 
 		/* Last packet in queue may be marked, shouldn't be dropped */
-		if (!cobalt_should_drop(&flow->cvars, &b->cparams, now, skb) ||
+		if (!cobalt_should_drop(&flow->cvars, &b->cparams, now, skb,
+					(b->bulk_flow_count *
+					 !!(q->rate_flags &
+					    CAKE_FLAG_INGRESS))) ||
 		    !flow->head)
 			break;
 
+		/* drop this packet, get another one */
+		if (q->rate_flags & CAKE_FLAG_INGRESS) {
+			len = cake_advance_shaper(q, b, skb,
+						  now, true);
+			flow->deficit -= len;
+			b->tin_deficit -= len;
+		}
 		b->tin_dropped++;
 		qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(skb));
 		qdisc_qstats_drop(sch);
 		kfree_skb(skb);
+		if (q->rate_flags & CAKE_FLAG_INGRESS)
+			goto retry;
 	}
 
 	b->tin_ecn_mark += !!flow->cvars.ecn_marked;
@@ -1432,6 +1480,20 @@ static int cake_change(struct Qdisc *sch, struct nlattr *opt,
 			q->target = 1;
 	}
 
+	if (tb[TCA_CAKE_AUTORATE]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_AUTORATE]))
+			q->rate_flags |= CAKE_FLAG_AUTORATE_INGRESS;
+		else
+			q->rate_flags &= ~CAKE_FLAG_AUTORATE_INGRESS;
+	}
+
+	if (tb[TCA_CAKE_INGRESS]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_INGRESS]))
+			q->rate_flags |= CAKE_FLAG_INGRESS;
+		else
+			q->rate_flags &= ~CAKE_FLAG_INGRESS;
+	}
+
 	if (tb[TCA_CAKE_MEMORY])
 		q->buffer_config_limit = nla_get_u32(tb[TCA_CAKE_MEMORY]);
 
@@ -1554,6 +1616,14 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (nla_put_u32(skb, TCA_CAKE_MEMORY, q->buffer_config_limit))
 		goto nla_put_failure;
 
+	if (nla_put_u32(skb, TCA_CAKE_AUTORATE,
+			!!(q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS)))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_INGRESS,
+			!!(q->rate_flags & CAKE_FLAG_INGRESS)))
+		goto nla_put_failure;
+
 	return nla_nest_end(skb, opts);
 
 nla_put_failure:

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox