Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 4/5] tcp: implement mmap() for zero copy receive
From: Eric Dumazet @ 2018-04-19 23:15 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Soheil Hassas Yeganeh
In-Reply-To: <20180416173339.6310-5-edumazet@google.com>



On 04/16/2018 10:33 AM, Eric Dumazet wrote:
> Some networks can make sure TCP payload can exactly fit 4KB pages,
> with well chosen MSS/MTU and architectures.
> 
> Implement mmap() system call so that applications can avoid
> copying data without complex splice() games.
> 
> Note that a successful mmap( X bytes) on TCP socket is consuming
> bytes, as if recvmsg() has been done. (tp->copied += X)
> 

Oh well, I should have run this code with LOCKDEP enabled :/

[  974.320412] ======================================================
[  974.326631] WARNING: possible circular locking dependency detected
[  974.332816] 4.16.0-dbx-DEV #40 Not tainted
[  974.336927] ------------------------------------------------------
[  974.343107] b78299096/15790 is trying to acquire lock:
[  974.348246] 000000006074c9cf (sk_lock-AF_INET6){+.+.}, at: tcp_mmap+0x7c/0x550
[  974.355505] 
               but task is already holding lock:
[  974.361366] 000000008dbe063b (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0x99/0x100
[  974.368801] 
               which lock already depends on the new lock.

[  974.377010] 
               the existing dependency chain (in reverse order) is:
[  974.384501] 
               -> #1 (&mm->mmap_sem){++++}:
[  974.389911]        __might_fault+0x68/0x90
[  974.394025]        _copy_from_user+0x23/0xa0
[  974.398311]        sock_setsockopt+0x4a2/0xac0
[  974.402761]        __sys_setsockopt+0xd9/0xf0
[  974.407118]        SyS_setsockopt+0xe/0x20
[  974.411242]        do_syscall_64+0x6e/0x1a0
[  974.415431]        entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  974.421011] 
               -> #0 (sk_lock-AF_INET6){+.+.}:
[  974.426690]        lock_acquire+0x95/0x1e0
[  974.430813]        lock_sock_nested+0x71/0xa0
[  974.435196]        tcp_mmap+0x7c/0x550
[  974.438940]        sock_mmap+0x23/0x30
[  974.442695]        mmap_region+0x3a4/0x5d0
[  974.446808]        do_mmap+0x313/0x530
[  974.450571]        vm_mmap_pgoff+0xc7/0x100
[  974.454769]        ksys_mmap_pgoff+0x1d5/0x260
[  974.459247]        SyS_mmap+0x1b/0x30
[  974.462936]        do_syscall_64+0x6e/0x1a0
[  974.467114]        entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  974.472678] 
               other info that might help us debug this:

[  974.480677]  Possible unsafe locking scenario:

[  974.486600]        CPU0                    CPU1
[  974.491152]        ----                    ----
[  974.495684]   lock(&mm->mmap_sem);
[  974.499089]                                lock(sk_lock-AF_INET6);
[  974.505285]                                lock(&mm->mmap_sem);
[  974.511211]   lock(sk_lock-AF_INET6);
[  974.514885] 
                *** DEADLOCK ***

[  974.520825] 1 lock held by b78299096/15790:
[  974.525018]  #0: 000000008dbe063b (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0x99/0x100
[  974.532852] 
               stack backtrace:
[  974.537224] CPU: 25 PID: 15790 Comm: b78299096 Not tainted 4.16.0-dbx-DEV #40
[  974.544371] Hardware name: Intel RML,PCH/Iota_QC_19, BIOS 2.40.0 06/22/2016
[  974.551333] Call Trace:
[  974.553792]  dump_stack+0x70/0xa5
[  974.557111]  print_circular_bug.isra.39+0x1d8/0x1e6
[  974.561982]  __lock_acquire+0x1284/0x1340
[  974.565992]  ? tcp_mmap+0x7c/0x550
[  974.569419]  lock_acquire+0x95/0x1e0
[  974.573011]  ? lock_acquire+0x95/0x1e0
[  974.576767]  ? tcp_mmap+0x7c/0x550
[  974.580167]  lock_sock_nested+0x71/0xa0
[  974.584023]  ? tcp_mmap+0x7c/0x550
[  974.587437]  tcp_mmap+0x7c/0x550
[  974.590677]  sock_mmap+0x23/0x30
[  974.593909]  mmap_region+0x3a4/0x5d0
[  974.597506]  do_mmap+0x313/0x530
[  974.600749]  vm_mmap_pgoff+0xc7/0x100
[  974.604414]  ksys_mmap_pgoff+0x1d5/0x260
[  974.608341]  ? fd_install+0x25/0x30
[  974.611849]  ? trace_hardirqs_on_caller+0xef/0x180
[  974.616641]  SyS_mmap+0x1b/0x30
[  974.619804]  do_syscall_64+0x6e/0x1a0
[  974.623462]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  974.628549] RIP: 0033:0x433749
[  974.631600] RSP: 002b:00007ffd29fdb438 EFLAGS: 00000216 ORIG_RAX: 0000000000000009
[  974.639197] RAX: ffffffffffffffda RBX: 00000000004002e0 RCX: 0000000000433749
[  974.646323] RDX: 0000000000000008 RSI: 0000000000004000 RDI: 0000000020ab7000
[  974.653463] RBP: 00007ffd29fdb460 R08: 0000000000000003 R09: 0000000000000000
[  974.660603] R10: 0000000000000012 R11: 0000000000000216 R12: 0000000000401670
[  974.667737] R13: 0000000000401700 R14: 0000000000000000 R15: 0000000000000000


I am not sure we can keep mmap() API, since we probably need to first lock the socket,
then grab vm semaphore.

^ permalink raw reply

* Re: [pci PATCH v7 0/5] Add support for unmanaged SR-IOV
From: Alexander Duyck @ 2018-04-19 22:54 UTC (permalink / raw)
  To: Bjorn Helgaas, Duyck, Alexander H, linux-pci
  Cc: virtio-dev, kvm, Netdev, Daly, Dan, LKML, linux-nvme, Keith Busch,
	netanel, Don Dutile, Maximilian Heyne, Wang, Liang-min,
	Rustad, Mark D, David Woodhouse, Christoph Hellwig, dwmw,
	Michael S. Tsirkin
In-Reply-To: <20180315183449.3102.64791.stgit@localhost.localdomain>

On Thu, Mar 15, 2018 at 11:40 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> This series is meant to add support for SR-IOV on devices when the VFs are
> not managed by the kernel. Examples of recent patches attempting to do this
> include:
> virto - https://patchwork.kernel.org/patch/10241225/
> pci-stub - https://patchwork.kernel.org/patch/10109935/
> vfio - https://patchwork.kernel.org/patch/10103353/
> uio - https://patchwork.kernel.org/patch/9974031/
>
> Since this is quickly blowing up into a multi-driver problem it is probably
> best to implement this solution as generically as possible.
>
> This series is an attempt to do that. What we do with this patch set is
> provide a generic framework to enable SR-IOV in the case that the PF driver
> doesn't support managing the VFs itself.
>
> I based my patch set originally on the patch by Mark Rustad but there isn't
> much left after going through and cleaning out the bits that were no longer
> needed, and after incorporating the feedback from David Miller. At this point
> the only items to be fully reused was his patch description which is now
> present in patch 3 of the set.
>
> This solution is limited in scope to just adding support for devices that
> provide no functionality for SR-IOV other than allocating the VFs by
> calling pci_enable_sriov. Previous sets had included patches for VFIO, but
> for now I am dropping that as the scope of that work is larger then I
> think I can take on at this time.
>
> v2: Reduced scope back to just virtio_pci and vfio-pci
>     Broke into 3 patch set from single patch
>     Changed autoprobe behavior to always set when num_vfs is set non-zero
> v3: Updated Documentation to clarify when sriov_unmanaged_autoprobe is used
>     Wrapped vfio_pci_sriov_configure to fix build errors w/o SR-IOV in kernel
> v4: Dropped vfio-pci patch
>     Added ena and nvme to drivers now using pci_sriov_configure_unmanaged
>     Dropped pci_disable_sriov call in virtio_pci to be consistent with ena
> v5: Dropped sriov_unmanaged_autoprobe and pci_sriov_conifgure_unmanaged
>     Added new patch that enables pci_sriov_configure_simple
>     Updated drivers to use pci_sriov_configure_simple
> v6: Defined pci_sriov_configure_simple as NULL when SR-IOV is not enabled
>     Updated drivers to drop "#ifdef" checks for IOV
>     Added pci-pf-stub as place for PF-only drivers to add support
> v7: Dropped pci_id table explanation from pci-pf-stub driver
>     Updated pci_sriov_configure_simple to drop need for err value
>     Fixed comment explaining why pci_sriov_configure_simple is NULL
>

Just following up since this has been sitting in patchwork for just
over a month now
(https://patchwork.ozlabs.org/project/linux-pci/list/?series=34034).
I'm just wondering what the expectation is on getting these pulled
into the pci tree? I'm assuming that is the best place for these
patches. Are there any concerns I still need to address or are these
going to be pulled in at some point, and if so is there any ETA on
when that will be?

Thanks.

- Alex

^ permalink raw reply

* Re: WARNING in refcount_inc (3)
From: Eric Biggers @ 2018-04-19 22:45 UTC (permalink / raw)
  To: syzbot; +Cc: davem, kuznet, linux-kernel, netdev, syzkaller-bugs, yoshfuji
In-Reply-To: <001a113ec036acc2460568bd5427@google.com>

On Sat, Mar 31, 2018 at 04:01:02PM -0700, syzbot wrote:
> Hello,
> 
> syzbot hit the following crash on bpf-next commit
> 1379ef828a18d8f81c526b25e4d5685caa2cfd65 (Thu Mar 29 22:09:44 2018 +0000)
> Merge branch 'bpf-sockmap-ingress'
> syzbot dashboard link:
> https://syzkaller.appspot.com/bug?extid=6eaf536fd743f5e119c5
> 
> So far this crash happened 6 times on bpf-next.
> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6614614900998144
> syzkaller reproducer:
> https://syzkaller.appspot.com/x/repro.syz?id=5035340528091136
> Raw console output:
> https://syzkaller.appspot.com/x/log.txt?id=5063394046509056
> Kernel config:
> https://syzkaller.appspot.com/x/.config?id=-1280663959502969741
> compiler: gcc (GCC) 7.1.1 20170620
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+6eaf536fd743f5e119c5@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
> 
> R13: 0000000000000005 R14: 0000000000001380 R15: 00007ffd314c8768
> ------------[ cut here ]------------
> ------------[ cut here ]------------
> refcount_t: increment on 0; use-after-free.
> refcount_t: underflow; use-after-free.
> WARNING: CPU: 1 PID: 4434 at lib/refcount.c:153 refcount_inc+0x47/0x50
> lib/refcount.c:153
> WARNING: CPU: 0 PID: 4437 at lib/refcount.c:187
> refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:187
> Kernel panic - not syncing: panic_on_warn set ...
> 
> Modules linked in:
> CPU: 1 PID: 4434 Comm: syzkaller349430 Not tainted 4.16.0-rc6+ #41
> CPU: 0 PID: 4437 Comm: syzkaller349430 Not tainted 4.16.0-rc6+ #41
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:187
> Call Trace:
> RSP: 0018:ffff8801b061f728 EFLAGS: 00010286
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x24d lib/dump_stack.c:53
> RAX: dffffc0000000008 RBX: 0000000000000000 RCX: ffffffff815ba4be
> RDX: 0000000000000000 RSI: 1ffff100360c3e95 RDI: 1ffff100360c3e6a
> RBP: ffff8801b061f7b8 R08: 0000000000000000 R09: 0000000000000000
> R10: ffff8801b061f850 R11: 0000000000000000 R12: 1ffff100360c3ee6
>  panic+0x1e4/0x41c kernel/panic.c:183
> R13: 00000000ffffffff R14: 0000000000000001 R15: ffff8801b1be4184
> FS:  0000000001817880(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ffd314c9000 CR3: 00000001b04a1006 CR4: 00000000001606f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  __warn+0x1dc/0x200 kernel/panic.c:547
>  report_bug+0x1f4/0x2b0 lib/bug.c:186
>  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
>  fixup_bug arch/x86/kernel/traps.c:247 [inline]
>  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
>  refcount_dec_and_test+0x1a/0x20 lib/refcount.c:212
>  put_net include/net/net_namespace.h:222 [inline]
>  __sk_destruct+0x560/0x920 net/core/sock.c:1592
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>  invalid_op+0x1b/0x40 arch/x86/entry/entry_64.S:986
> RIP: 0010:refcount_inc+0x47/0x50 lib/refcount.c:153
> RSP: 0018:ffff8801b058f860 EFLAGS: 00010286
> RAX: dffffc0000000008 RBX: ffff8801ab55a1c4 RCX: ffffffff815ba4be
> RDX: 0000000000000000 RSI: 1ffff100360b1ebc RDI: 1ffff100360b1e91
> RBP: ffff8801b058f868 R08: 0000000000000000 R09: 0000000000000000
>  sk_destruct+0x47/0x80 net/core/sock.c:1601
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801b058faf8
>  __sk_free+0xf1/0x2b0 net/core/sock.c:1612
> R13: ffff8801af87b513 R14: ffff8801ab55a1c0 R15: ffff8801af87b501
>  sk_free+0x2a/0x40 net/core/sock.c:1623
>  sock_put include/net/sock.h:1661 [inline]
>  tcp_close+0x967/0x1190 net/ipv4/tcp.c:2329
>  get_net include/net/net_namespace.h:204 [inline]
>  sk_alloc+0x3f9/0x1440 net/core/sock.c:1540
>  inet_release+0xed/0x1c0 net/ipv4/af_inet.c:427
>  sock_release+0x8d/0x1e0 net/socket.c:594
>  sock_close+0x16/0x20 net/socket.c:1149
>  __fput+0x327/0x7e0 fs/file_table.c:209
>  ____fput+0x15/0x20 fs/file_table.c:243
>  task_work_run+0x199/0x270 kernel/task_work.c:113
>  inet_create+0x47c/0xf50 net/ipv4/af_inet.c:320
>  tracehook_notify_resume include/linux/tracehook.h:191 [inline]
>  exit_to_usermode_loop+0x275/0x2f0 arch/x86/entry/common.c:166
>  __sock_create+0x4d4/0x850 net/socket.c:1285
>  prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
>  syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
>  do_syscall_64+0x6ec/0x940 arch/x86/entry/common.c:292
>  sock_create net/socket.c:1325 [inline]
>  SYSC_socket net/socket.c:1355 [inline]
>  SyS_socket+0xeb/0x1d0 net/socket.c:1335
>  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x402950
> RSP: 002b:00007ffd314c8628 EFLAGS: 00000246
>  ORIG_RAX: 0000000000000003
> RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000402950
> RDX: 00000000000000e0 RSI: 00007ffd314c8f00 RDI: 0000000000000003
> RBP: 00007ffd314c8740 R08: 00007ffd314c864c R09: 0000000000000001
> R10: 00007ffd314c8740 R11: 0000000000000246 R12: 00000000006cf4c0
> R13: 00000000006cee40 R14: 0000000000001380 R15: 00007ffd314c8768
> Code:
>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 5e
> RIP: 0033:0x4456a7
> 41
> RSP: 002b:00007ffd314c8628 EFLAGS: 00000202 ORIG_RAX: 0000000000000029
> 5f
> RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00000000004456a7
> 5d
> RDX: 0000000000000006 RSI: 0000000000000001 RDI: 0000000000000002
> RBP: 00007ffd314c8740 R08: 0000000000000000 R09: 0000000000000001
> c3
> R10: 0000000000000006 R11: 0000000000000202 R12: 0000000000000003
> e8
> R13: 0000000000000003 R14: 0000000000006cc2 R15: 00007ffd314c8768
> 0a 0b be fe 80 3d 20 c9 84 05 00 75 1a e8 fc 0a be fe 48 c7 c7 e0 78 e5 86
> c6 05 0b c9 84 05 01 e8 a9 16 8e fe <0f> 0b 31 db eb a3 e8 de 0a be fe 83 fb
> ff 0f 85 63 ff ff ff 31
> ---[ end trace dd327356f543ce46 ]---
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
> 
> 
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
> 
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title

Broken error handling when mounting rpc_pipefs is messing things up.
Fixed by patch in vfs/for-linus:

#syz fix: rpc_pipefs: deal with early sget() failures

- Eric

^ permalink raw reply

* Re: [PATCHv3 net 3/3] net: sched: ife: check on metadata length
From: Eric Dumazet @ 2018-04-19 22:24 UTC (permalink / raw)
  To: Alexander Aring, yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel
In-Reply-To: <20180419221445.26205-4-aring@mojatatu.com>



On 04/19/2018 03:14 PM, Alexander Aring wrote:
> This patch checks if sk buffer is available to dererence ife header. If
> not then NULL will returned to signal an malformed ife packet. This
> avoids to crashing the kernel from outside.
> 
> Signed-off-by: Alexander Aring <aring@mojatatu.com>
> Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
> ---
>  net/ife/ife.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/ife/ife.c b/net/ife/ife.c
> index 7fbe70a0af4b..570a18d4ca32 100644
> --- a/net/ife/ife.c
> +++ b/net/ife/ife.c
> @@ -70,6 +70,9 @@ void *ife_decode(struct sk_buff *skb, u16 *metalen)
>  	u16 ifehdrln;
>  
>  	ifehdr = (struct ifeheadr *) (skb->data + skb->dev->hard_header_len);
> +	if (!pskb_may_pull(skb, skb->dev->hard_header_len + IFE_METAHDRLEN))
> +		return NULL;
> +



No, you need to move here :

       ifehdr = (struct ifeheadr *) (skb->data + skb->dev->hard_header_len);

>  	ifehdrln = ntohs(ifehdr->metalen);
>  	total_pull = skb->dev->hard_header_len + ifehdrln;
>  
> 


Please do not rush, wait one day before sending V4, no need to flood netdev@

^ permalink raw reply

* Re: [PATCH net-next] ipv6: send netlink notifications for manually configured addresses
From: Lorenzo Bianconi @ 2018-04-19 22:23 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Thomas Haller
In-Reply-To: <20180417.140536.2292177928357979103.davem@davemloft.net>

> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> Date: Tue, 17 Apr 2018 11:54:39 +0200
>
>> Send a netlink notification when userspace adds a manually configured
>> address if DAD is enabled and optimistic flag isn't set.
>> Moreover send RTM_DELADDR notifications for tentative addresses.
>>
>> Some userspace applications (e.g. NetworkManager) are interested in
>> addr netlink events albeit the address is still in tentative state,
>> however events are not sent if DAD process is not completed.
>> If the address is added and immediately removed userspace listeners
>> are not notified. This behaviour can be easily reproduced by using
>> veth interfaces:
>>
>> $ ip -b - <<EOF
>>> link add dev vm1 type veth peer name vm2
>>> link set dev vm1 up
>>> link set dev vm2 up
>>> addr add 2001:db8:a:b:1:2:3:4/64 dev vm1
>>> addr del 2001:db8:a:b:1:2:3:4/64 dev vm1
>> EOF
>>
>> This patch reverts the behaviour introduced by the commit f784ad3d79e5
>> ("ipv6: do not send RTM_DELADDR for tentative addresses")
>>
>> Suggested-by: Thomas Haller <thaller@redhat.com>
>> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>
> Ok, applied to net-next.
>
> It would be really nice if we clearly documented somewhere the exact
> situations where we desire ipv6 address netlink notifications to be
> sent out.
>
> Maybe even a common function that guards the event emission, where we
> encode the rules.  Or a comment somewhere prominent.  Something like
> that.
>
> Right now this isn't spelled out clearly anywhere, and that's probably
> why things keep being adjusted back and forth like this.

Sounds good, I will post a patch. Thanks

Regards,
Lorenzo

>
> Thank you.

^ permalink raw reply

* [PATCHv3 net 3/3] net: sched: ife: check on metadata length
From: Alexander Aring @ 2018-04-19 22:14 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180419221445.26205-1-aring@mojatatu.com>

This patch checks if sk buffer is available to dererence ife header. If
not then NULL will returned to signal an malformed ife packet. This
avoids to crashing the kernel from outside.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/ife/ife.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ife/ife.c b/net/ife/ife.c
index 7fbe70a0af4b..570a18d4ca32 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -70,6 +70,9 @@ void *ife_decode(struct sk_buff *skb, u16 *metalen)
 	u16 ifehdrln;
 
 	ifehdr = (struct ifeheadr *) (skb->data + skb->dev->hard_header_len);
+	if (!pskb_may_pull(skb, skb->dev->hard_header_len + IFE_METAHDRLEN))
+		return NULL;
+
 	ifehdrln = ntohs(ifehdr->metalen);
 	total_pull = skb->dev->hard_header_len + ifehdrln;
 
-- 
2.11.0

^ permalink raw reply related

* [PATCHv3 net 2/3] net: sched: ife: handle malformed tlv length
From: Alexander Aring @ 2018-04-19 22:14 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180419221445.26205-1-aring@mojatatu.com>

There is currently no handling to check on a invalid tlv length. This
patch adds such handling to avoid killing the kernel with a malformed
ife packet.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/ife.h   |  3 ++-
 net/ife/ife.c       | 35 +++++++++++++++++++++++++++++++++--
 net/sched/act_ife.c |  7 ++++++-
 3 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/include/net/ife.h b/include/net/ife.h
index 44b9c00f7223..e117617e3c34 100644
--- a/include/net/ife.h
+++ b/include/net/ife.h
@@ -12,7 +12,8 @@
 void *ife_encode(struct sk_buff *skb, u16 metalen);
 void *ife_decode(struct sk_buff *skb, u16 *metalen);
 
-void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen);
+void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 *attrtype,
+			  u16 *dlen, u16 *totlen);
 int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
 			const void *dval);
 
diff --git a/net/ife/ife.c b/net/ife/ife.c
index 7d1ec76e7f43..7fbe70a0af4b 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -92,12 +92,43 @@ struct meta_tlvhdr {
 	__be16 len;
 };
 
+static bool __ife_tlv_meta_valid(const unsigned char *skbdata,
+				 const unsigned char *ifehdr_end)
+{
+	const struct meta_tlvhdr *tlv;
+	u16 tlvlen;
+
+	if (unlikely(skbdata + sizeof(*tlv) > ifehdr_end))
+		return false;
+
+	tlv = (const struct meta_tlvhdr *)skbdata;
+	tlvlen = ntohs(tlv->len);
+
+	/* tlv length field is inc header, check on minimum */
+	if (tlvlen < NLA_HDRLEN)
+		return false;
+
+	/* overflow by NLA_ALIGN check */
+	if (NLA_ALIGN(tlvlen) < tlvlen)
+		return false;
+
+	if (unlikely(skbdata + NLA_ALIGN(tlvlen) > ifehdr_end))
+		return false;
+
+	return true;
+}
+
 /* Caller takes care of presenting data in network order
  */
-void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen)
+void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 *attrtype,
+			  u16 *dlen, u16 *totlen)
 {
-	struct meta_tlvhdr *tlv = (struct meta_tlvhdr *) skbdata;
+	struct meta_tlvhdr *tlv;
+
+	if (!__ife_tlv_meta_valid(skbdata, ifehdr_end))
+		return NULL;
 
+	tlv = (struct meta_tlvhdr *)skbdata;
 	*dlen = ntohs(tlv->len) - NLA_HDRLEN;
 	*attrtype = ntohs(tlv->type);
 
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 49b8ab551fbe..8527cfdc446d 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -682,7 +682,12 @@ static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
 		u16 mtype;
 		u16 dlen;
 
-		curr_data = ife_tlv_meta_decode(tlv_data, &mtype, &dlen, NULL);
+		curr_data = ife_tlv_meta_decode(tlv_data, ifehdr_end, &mtype,
+						&dlen, NULL);
+		if (!curr_data) {
+			qstats_drop_inc(this_cpu_ptr(ife->common.cpu_qstats));
+			return TC_ACT_SHOT;
+		}
 
 		if (find_decode_metaid(skb, ife, mtype, dlen, curr_data)) {
 			/* abuse overlimits to count when we receive metadata
-- 
2.11.0

^ permalink raw reply related

* [PATCHv3 net 1/3] net: sched: ife: signal not finding metaid
From: Alexander Aring @ 2018-04-19 22:14 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180419221445.26205-1-aring@mojatatu.com>

We need to record stats for received metadata that we dont know how
to process. Have find_decode_metaid() return -ENOENT to capture this.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/sched/act_ife.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index a5994cf0512b..49b8ab551fbe 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -652,7 +652,7 @@ static int find_decode_metaid(struct sk_buff *skb, struct tcf_ife_info *ife,
 		}
 	}
 
-	return 0;
+	return -ENOENT;
 }
 
 static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
-- 
2.11.0

^ permalink raw reply related

* [PATCHv3 net 0/3] net: sched: ife: malformed ife packet fixes
From: Alexander Aring @ 2018-04-19 22:14 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring

As promised at netdev 2.2 tc workshop I am working on adding scapy support for
tdc testing. It is still work in progress. I will submit the patches to tdc
later (they are not in good shape yet). The good news is I have been able to
find bugs which normal packet testing would not be able to find.
With fuzzy testing I was able to craft certain malformed packets that IFE
action was not able to deal with. This patch set fixes those bugs.

changes since v3:
 - use pskb_may_pull

changes since v2:
 - remove inline from __ife_tlv_meta_valid
 - add const to cast to meta_tlvhdr
 - add acked and reviewed tags

Alexander Aring (3):
  net: sched: ife: signal not finding metaid
  net: sched: ife: handle malformed tlv length
  net: sched: ife: check on metadata length

 include/net/ife.h   |  3 ++-
 net/ife/ife.c       | 38 ++++++++++++++++++++++++++++++++++++--
 net/sched/act_ife.c |  9 +++++++--
 3 files changed, 45 insertions(+), 5 deletions(-)

-- 
2.11.0

^ permalink raw reply

* Re: [PATCHv2 net 3/3] net: sched: ife: check on metadata length
From: Eric Dumazet @ 2018-04-19 21:50 UTC (permalink / raw)
  To: Alexander Aring, yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel
In-Reply-To: <20180419214438.6801-4-aring@mojatatu.com>



On 04/19/2018 02:44 PM, Alexander Aring wrote:
> This patch checks if sk buffer is available to dererence ife header. If
> not then NULL will returned to signal an malformed ife packet. This
> avoids to crashing the kernel from outside.
> 
> Signed-off-by: Alexander Aring <aring@mojatatu.com>
> Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
> ---
>  net/ife/ife.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/ife/ife.c b/net/ife/ife.c
> index 7fbe70a0af4b..93e8c36ce6ec 100644
> --- a/net/ife/ife.c
> +++ b/net/ife/ife.c
> @@ -70,6 +70,9 @@ void *ife_decode(struct sk_buff *skb, u16 *metalen)
>  	u16 ifehdrln;
>  
>  	ifehdr = (struct ifeheadr *) (skb->data + skb->dev->hard_header_len);
> +	if (skb->len < skb->dev->hard_header_len + IFE_METAHDRLEN)
> +		return NULL;
> +
>  	ifehdrln = ntohs(ifehdr->metalen);
>  	total_pull = skb->dev->hard_header_len + ifehdrln;
>  
> 

Nope, please use pskb_may_pull()

^ permalink raw reply

* [PATCHv2 net 3/3] net: sched: ife: check on metadata length
From: Alexander Aring @ 2018-04-19 21:44 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180419214438.6801-1-aring@mojatatu.com>

This patch checks if sk buffer is available to dererence ife header. If
not then NULL will returned to signal an malformed ife packet. This
avoids to crashing the kernel from outside.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/ife/ife.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ife/ife.c b/net/ife/ife.c
index 7fbe70a0af4b..93e8c36ce6ec 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -70,6 +70,9 @@ void *ife_decode(struct sk_buff *skb, u16 *metalen)
 	u16 ifehdrln;
 
 	ifehdr = (struct ifeheadr *) (skb->data + skb->dev->hard_header_len);
+	if (skb->len < skb->dev->hard_header_len + IFE_METAHDRLEN)
+		return NULL;
+
 	ifehdrln = ntohs(ifehdr->metalen);
 	total_pull = skb->dev->hard_header_len + ifehdrln;
 
-- 
2.11.0

^ permalink raw reply related

* [PATCHv2 net 2/3] net: sched: ife: handle malformed tlv length
From: Alexander Aring @ 2018-04-19 21:44 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180419214438.6801-1-aring@mojatatu.com>

There is currently no handling to check on a invalid tlv length. This
patch adds such handling to avoid killing the kernel with a malformed
ife packet.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/ife.h   |  3 ++-
 net/ife/ife.c       | 35 +++++++++++++++++++++++++++++++++--
 net/sched/act_ife.c |  7 ++++++-
 3 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/include/net/ife.h b/include/net/ife.h
index 44b9c00f7223..e117617e3c34 100644
--- a/include/net/ife.h
+++ b/include/net/ife.h
@@ -12,7 +12,8 @@
 void *ife_encode(struct sk_buff *skb, u16 metalen);
 void *ife_decode(struct sk_buff *skb, u16 *metalen);
 
-void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen);
+void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 *attrtype,
+			  u16 *dlen, u16 *totlen);
 int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
 			const void *dval);
 
diff --git a/net/ife/ife.c b/net/ife/ife.c
index 7d1ec76e7f43..7fbe70a0af4b 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -92,12 +92,43 @@ struct meta_tlvhdr {
 	__be16 len;
 };
 
+static bool __ife_tlv_meta_valid(const unsigned char *skbdata,
+				 const unsigned char *ifehdr_end)
+{
+	const struct meta_tlvhdr *tlv;
+	u16 tlvlen;
+
+	if (unlikely(skbdata + sizeof(*tlv) > ifehdr_end))
+		return false;
+
+	tlv = (const struct meta_tlvhdr *)skbdata;
+	tlvlen = ntohs(tlv->len);
+
+	/* tlv length field is inc header, check on minimum */
+	if (tlvlen < NLA_HDRLEN)
+		return false;
+
+	/* overflow by NLA_ALIGN check */
+	if (NLA_ALIGN(tlvlen) < tlvlen)
+		return false;
+
+	if (unlikely(skbdata + NLA_ALIGN(tlvlen) > ifehdr_end))
+		return false;
+
+	return true;
+}
+
 /* Caller takes care of presenting data in network order
  */
-void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen)
+void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 *attrtype,
+			  u16 *dlen, u16 *totlen)
 {
-	struct meta_tlvhdr *tlv = (struct meta_tlvhdr *) skbdata;
+	struct meta_tlvhdr *tlv;
+
+	if (!__ife_tlv_meta_valid(skbdata, ifehdr_end))
+		return NULL;
 
+	tlv = (struct meta_tlvhdr *)skbdata;
 	*dlen = ntohs(tlv->len) - NLA_HDRLEN;
 	*attrtype = ntohs(tlv->type);
 
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 49b8ab551fbe..8527cfdc446d 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -682,7 +682,12 @@ static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
 		u16 mtype;
 		u16 dlen;
 
-		curr_data = ife_tlv_meta_decode(tlv_data, &mtype, &dlen, NULL);
+		curr_data = ife_tlv_meta_decode(tlv_data, ifehdr_end, &mtype,
+						&dlen, NULL);
+		if (!curr_data) {
+			qstats_drop_inc(this_cpu_ptr(ife->common.cpu_qstats));
+			return TC_ACT_SHOT;
+		}
 
 		if (find_decode_metaid(skb, ife, mtype, dlen, curr_data)) {
 			/* abuse overlimits to count when we receive metadata
-- 
2.11.0

^ permalink raw reply related

* [PATCHv2 net 1/3] net: sched: ife: signal not finding metaid
From: Alexander Aring @ 2018-04-19 21:44 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180419214438.6801-1-aring@mojatatu.com>

We need to record stats for received metadata that we dont know how
to process. Have find_decode_metaid() return -ENOENT to capture this.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/sched/act_ife.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index a5994cf0512b..49b8ab551fbe 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -652,7 +652,7 @@ static int find_decode_metaid(struct sk_buff *skb, struct tcf_ife_info *ife,
 		}
 	}
 
-	return 0;
+	return -ENOENT;
 }
 
 static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
-- 
2.11.0

^ permalink raw reply related

* [PATCHv2 net 0/3] net: sched: ife: malformed ife packet fixes
From: Alexander Aring @ 2018-04-19 21:44 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring

As promised at netdev 2.2 tc workshop I am working on adding scapy support for
tdc testing. It is still work in progress. I will submit the patches to tdc
later (they are not in good shape yet). The good news is I have been able to
find bugs which normal packet testing would not be able to find.
With fuzzy testing I was able to craft certain malformed packets that IFE
action was not able to deal with. This patch set fixes those bugs.

changes since v2:
 - remove inline from __ife_tlv_meta_valid
 - add const to cast to meta_tlvhdr
 - add acked and reviewed tags

Alexander Aring (3):
  net: sched: ife: signal not finding metaid
  net: sched: ife: handle malformed tlv length
  net: sched: ife: check on metadata length

 include/net/ife.h   |  3 ++-
 net/ife/ife.c       | 38 ++++++++++++++++++++++++++++++++++++--
 net/sched/act_ife.c |  9 +++++++--
 3 files changed, 45 insertions(+), 5 deletions(-)

-- 
2.11.0

^ permalink raw reply

* Re: [PATCH v4 00/10] New network driver for Amiga X-Surf 100 (m68k)
From: Michael Schmitz @ 2018-04-19 21:36 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Andrew Lunn, Finn Thain, Geert Uytterhoeven,
	Florian Fainelli, Linux/m68k, Michael Karcher
In-Reply-To: <20180419.161138.825724439328248224.davem@davemloft.net>

Thanks Dave!

And many thanks to all the reviewers and testers!

Cheers,

  Michael


On Fri, Apr 20, 2018 at 8:11 AM, David Miller <davem@davemloft.net> wrote:
> From: Michael Schmitz <schmitzmic@gmail.com>
> Date: Thu, 19 Apr 2018 14:05:17 +1200
>
>> This patch series adds support for the Individual Computers X-Surf 100
>> network card for m68k Amiga, a network adapter based on the AX88796 chip set.
>
> Series applied, thank you.

^ permalink raw reply

* [net-next:master 23/31] drivers/net/hyperv/rndis_filter.c:1243: undefined reference to `ucs2_as_utf8'
From: kbuild test robot @ 2018-04-19 21:33 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: kbuild-all, netdev

[-- Attachment #1: Type: text/plain, Size: 1692 bytes --]

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head:   bda73d4ec943c4f7479603a59cad09a07c6c729a
commit: 0fe554a46a0ff855376053c7e4204673b7879f05 [23/31] hv_netvsc: propogate Hyper-V friendly name into interface alias
config: x86_64-randconfig-b0-04200208 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        git checkout 0fe554a46a0ff855376053c7e4204673b7879f05
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/net/hyperv/rndis_filter.o: In function `rndis_get_friendly_name':
>> drivers/net/hyperv/rndis_filter.c:1243: undefined reference to `ucs2_as_utf8'

vim +1243 drivers/net/hyperv/rndis_filter.c

  1226	
  1227	static void rndis_get_friendly_name(struct net_device *net,
  1228					    struct rndis_device *rndis_device,
  1229					    struct netvsc_device *net_device)
  1230	{
  1231		ucs2_char_t wname[256];
  1232		unsigned long len;
  1233		u8 ifalias[256];
  1234		u32 size;
  1235	
  1236		size = sizeof(wname);
  1237		if (rndis_filter_query_device(rndis_device, net_device,
  1238					      RNDIS_OID_GEN_FRIENDLY_NAME,
  1239					      wname, &size) != 0)
  1240			return;
  1241	
  1242		/* Convert Windows Unicode string to UTF-8 */
> 1243		len = ucs2_as_utf8(ifalias, wname, sizeof(ifalias));
  1244	
  1245		/* ignore the default value from host */
  1246		if (strcmp(ifalias, "Network Adapter") != 0)
  1247			dev_set_alias(net, ifalias, len);
  1248	}
  1249	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31389 bytes --]

^ permalink raw reply

* [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-19 21:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Andrew Morton, linux-mm, eric.dumazet, edumazet,
	netdev, linux-kernel, jasowang, virtualization, dm-devel,
	Vlastimil Babka
In-Reply-To: <20180419193554-mutt-send-email-mst@kernel.org>



On Thu, 19 Apr 2018, Michael S. Tsirkin wrote:

> Maybe make it conditional on CONFIG_DEBUG_SG too?
> Otherwise I think you just trigger a hard to debug memory corruption.

OK, here I resend the patch with CONFIG_DEBUG_SG. With CONFIG_DEBUG_SG, 
the DMA API will print a stacktrace where the misuse happened, so it's 
much easier to debug than with CONFIG_DEBUG_VM.

Fedora doesn't use CONFIG_DEBUG_SG in its default kernel (it only uses it 
in the debugging kernel), so users won't be hurt by this.



From: Mikulas Patocka <mpatocka@redhat.com>
Subject: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG

The kvmalloc function tries to use kmalloc and falls back to vmalloc if
kmalloc fails.

Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code.

These bugs are hard to reproduce because vmalloc falls back to kmalloc
only if memory is fragmented.

In order to detect these bugs reliably I submit this patch that changes
kvmalloc to always use vmalloc if CONFIG_DEBUG_SG is turned on.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 mm/util.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c	2018-04-18 15:46:23.000000000 +0200
+++ linux-2.6/mm/util.c	2018-04-19 23:14:14.000000000 +0200
@@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
  */
 void *kvmalloc_node(size_t size, gfp_t flags, int node)
 {
+#ifndef CONFIG_DEBUG_SG
 	gfp_t kmalloc_flags = flags;
 	void *ret;
 
@@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
 	 */
 	if (ret || size <= PAGE_SIZE)
 		return ret;
+#endif
 
 	return __vmalloc_node_flags_caller(size, node, flags,
 			__builtin_return_address(0));

^ permalink raw reply

* [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
From: Martin KaFai Lau @ 2018-04-19 21:23 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert, Eric Dumazet, Nikita Shirokov, kernel-team

This patch is not a proper fix and mainly serves for discussion purpose.
It is based on net-next which I have been using to debug the issue.

The change that works around the issue is in ensure_mtu_is_adequate().
Other changes are the rippling effect in function arg.

This bug was uncovered by one of our legacy service that
are still using ipvs for load balancing.  In that setup,
the ipvs encap the ipv6-tcp packet in another ipv6 hdr
before tx it out to eth0.

The problem is the kernel stack could pass a skb (which was
originated from a sys_write(tcp_fd)) to the driver with skb->len
bigger than the device MTU.  In one NIC setup (with gso and tso off)
that we are using, it upset the NIC/driver and caused the tx queue
stalled for tens of seconds which is how it got uncovered.
(On the NIC side, the NIC firmware and driver have been fixed
to avoid this tx queue stall after seeing this skb).

On the kernel side, based on the commit log, this bug should have
been exposed after commit 815d22e55b0e ("ip6ip6: Support for GSO/GRO").

Before commit 815d22e55b0e, ipv6_gso_segment() would just error
out (-EPROTONOSUPPORT) because the tx-ing packet is an ip6ip6.
Due to this error out, it avoid passing it to the driver.  The TCP
stack then timeout and the TCP mtu probing eventually kicked in to
lower the skb->len enough to avoid gso_segment.

After commit 815d22e55b0e, ipv6_gso_segment() -> ipv6_gso_segment()
-> tcp6_gso_segment() which segment the packet based on a mss
that does not account for the extra IPv6 hdr.

Here is a stack from the WARN_ON() that we added to the driver to
capture the issue:
[ 1128.611875] WARNING: CPU: 40 PID: 31495 at drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:424 mlx5e_xmit+0x814
...
[ 1129.016536] Call Trace:
[ 1129.021412]  ? skb_release_data+0xfc/0x120
[ 1129.029587]  ? kfree_skbmem+0x64/0x70
[ 1129.036905]  dev_hard_start_xmit+0xa4/0x200
[ 1129.045262]  sch_direct_xmit+0x10f/0x280
[ 1129.053111]  __qdisc_run+0x223/0x5a0
[ 1129.060251]  __dev_queue_xmit+0x245/0x7d0
[ 1129.068268]  dev_queue_xmit+0x10/0x20
[ 1129.075573]  ? dev_queue_xmit+0x10/0x20
[ 1129.083218]  ip6_finish_output2+0x2db/0x490
[ 1129.091573]  ip6_finish_output+0x125/0x190
[ 1129.099754]  ip6_output+0x5f/0x100
[ 1129.106548]  ? ip6_fragment+0x9f0/0x9f0
[ 1129.114212]  ip6_local_out+0x35/0x40
[ 1129.121356]  ip_vs_tunnel_xmit_v6+0x267/0x290 [ip_vs]
[ 1129.131443]  ip_vs_in.part.24+0x302/0x710 [ip_vs]
[ 1129.140837]  ? ip_vs_in.part.24+0x302/0x710 [ip_vs]
[ 1129.150578]  ? ip_vs_conn_out_get+0x17/0x140 [ip_vs]
[ 1129.160493]  ? ip_vs_conn_out_get_proto+0x25/0x30 [ip_vs]
[ 1129.171273]  ip_vs_in+0x43/0x130 [ip_vs]
[ 1129.179109]  ip_vs_local_request6+0x26/0x30 [ip_vs]
[ 1129.188849]  nf_hook_slow+0x3e/0xc0
[ 1129.195800]  ip6_xmit+0x30b/0x540
[ 1129.202421]  ? ac6_proc_exit+0x20/0x20
[ 1129.209909]  inet6_csk_xmit+0x82/0xd0
[ 1129.217207]  ? lock_timer_base+0x76/0xa0
[ 1129.225043]  tcp_transmit_skb+0x56f/0xa40
[ 1129.233051]  tcp_write_xmit+0x2b2/0x11b0
[ 1129.240885]  __tcp_push_pending_frames+0x33/0xa0
[ 1129.250106]  tcp_push+0xde/0x100
[ 1129.256554]  tcp_sendmsg_locked+0x9ca/0xca0
[ 1129.264910]  tcp_sendmsg+0x2c/0x50
[ 1129.271703]  inet_sendmsg+0x31/0xb0
[ 1129.278672]  sock_write_iter+0xf8/0x110
[ 1129.286335]  new_sync_write+0xd9/0x120
[ 1129.293823]  vfs_write+0x18d/0x1e0
[ 1129.300614]  SyS_write+0x48/0xa0
[ 1129.307045]  do_syscall_64+0x69/0x1e0
[ 1129.314361]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
...
[ 1129.648183] ---[ end trace 635061c9c300799e ]---
[ 1129.657407] skb->len:1554 MTU:1522

The tcp flow is connecting from the address ending ':27:0' to the ':85'.

[host-a] > ip -6 r show table local
local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 advmss 1440 pref medium

[host-a] > ip -6 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
    inet6 2401:db00:1011:1f01:face:b00c:0:85/128 scope global
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
    inet6 2401:db00:1011:10af:face:0:27:0/64 scope global
       valid_lft forever preferred_lft forever

[host-a] > cat /proc/net/ip_vs
TCP  [2401:db00:1011:1f01:face:b00c:0000:0085]:01BB rr
  -> [2401:db00:1011:10cc:face:0000:0091:0000]:01BB      Tunnel  6772   9          6
  -> [2401:db00:1011:10d8:face:0000:0091:0000]:01BB      Tunnel  6772   8          6
  -> [2401:db00:1011:10d2:face:0000:0091:0000]:01BB      Tunnel  6772   19         7

[host-a] > openssl s_client -connect [2401:db00:1011:1f01:face:b00c:0:85]:443
send-something-long-here-to-trigger-the-bug

Changing the local route mtu to 1460 to account for the extra ipv6 tunnel header
can also side step the issue.  Like this:

> ip -6 r show table local
local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 mtu 1460 advmss 1440 pref medium

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/netfilter/ipvs/ip_vs_xmit.c | 49 +++++++++++++++++++++++++++--------------
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 11c416f3d6e3..88cc0d53ebce 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -212,13 +212,15 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
 		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
 }
 
-static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
+static inline bool ensure_mtu_is_adequate(struct ip_vs_conn *cp,
 					  int rt_mode,
 					  struct ip_vs_iphdr *ipvsh,
 					  struct sk_buff *skb, int mtu)
 {
+	struct netns_ipvs *ipvs = cp->ipvs;
+
 #ifdef CONFIG_IP_VS_IPV6
-	if (skb_af == AF_INET6) {
+	if (cp->af == AF_INET6) {
 		struct net *net = ipvs->net;
 
 		if (unlikely(__mtu_check_toobig_v6(skb, mtu))) {
@@ -251,6 +253,17 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
 		}
 	}
 
+	if (skb_shinfo(skb)->gso_size && cp->protocol == IPPROTO_TCP) {
+		const struct tcphdr *th = (struct tcphdr *)skb_transport_header(skb);
+		unsigned short hdr_len = (th->doff << 2) +
+			skb_network_header_len(skb);
+
+		if (mtu > hdr_len && mtu - hdr_len < skb_shinfo(skb)->gso_size)
+			skb_decrease_gso_size(skb_shinfo(skb),
+					      skb_shinfo(skb)->gso_size -
+					      (mtu - hdr_len));
+	}
+
 	return true;
 }
 
@@ -305,13 +318,15 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
 
 /* Get route to destination or remote server */
 static int
-__ip_vs_get_out_rt(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb,
+__ip_vs_get_out_rt(struct ip_vs_conn *cp, struct sk_buff *skb,
 		   struct ip_vs_dest *dest,
 		   __be32 daddr, int rt_mode, __be32 *ret_saddr,
 		   struct ip_vs_iphdr *ipvsh)
 {
+	struct netns_ipvs *ipvs = cp->ipvs;
 	struct net *net = ipvs->net;
 	struct ip_vs_dest_dst *dest_dst;
+	int skb_af = cp->af;
 	struct rtable *rt;			/* Route to the other host */
 	int mtu;
 	int local, noref = 1;
@@ -389,7 +404,7 @@ __ip_vs_get_out_rt(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb,
 		maybe_update_pmtu(skb_af, skb, mtu);
 	}
 
-	if (!ensure_mtu_is_adequate(ipvs, skb_af, rt_mode, ipvsh, skb, mtu))
+	if (!ensure_mtu_is_adequate(cp, rt_mode, ipvsh, skb, mtu))
 		goto err_put;
 
 	skb_dst_drop(skb);
@@ -455,13 +470,15 @@ __ip_vs_route_output_v6(struct net *net, struct in6_addr *daddr,
  * Get route to destination or remote server
  */
 static int
-__ip_vs_get_out_rt_v6(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb,
+__ip_vs_get_out_rt_v6(struct ip_vs_conn *cp, struct sk_buff *skb,
 		      struct ip_vs_dest *dest,
 		      struct in6_addr *daddr, struct in6_addr *ret_saddr,
 		      struct ip_vs_iphdr *ipvsh, int do_xfrm, int rt_mode)
 {
+	struct netns_ipvs *ipvs = cp->ipvs;
 	struct net *net = ipvs->net;
 	struct ip_vs_dest_dst *dest_dst;
+	int skb_af = cp->af;
 	struct rt6_info *rt;			/* Route to the other host */
 	struct dst_entry *dst;
 	int mtu;
@@ -541,7 +558,7 @@ __ip_vs_get_out_rt_v6(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb,
 		maybe_update_pmtu(skb_af, skb, mtu);
 	}
 
-	if (!ensure_mtu_is_adequate(ipvs, skb_af, rt_mode, ipvsh, skb, mtu))
+	if (!ensure_mtu_is_adequate(cp, rt_mode, ipvsh, skb, mtu))
 		goto err_put;
 
 	skb_dst_drop(skb);
@@ -679,7 +696,7 @@ ip_vs_bypass_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	if (__ip_vs_get_out_rt(cp->ipvs, cp->af, skb, NULL, iph->daddr,
+	if (__ip_vs_get_out_rt(cp, skb, NULL, iph->daddr,
 			       IP_VS_RT_MODE_NON_LOCAL, NULL, ipvsh) < 0)
 		goto tx_error;
 
@@ -708,7 +725,7 @@ ip_vs_bypass_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	if (__ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, NULL,
+	if (__ip_vs_get_out_rt_v6(cp, skb, NULL,
 				  &iph->daddr, NULL,
 				  ipvsh, 0, IP_VS_RT_MODE_NON_LOCAL) < 0)
 		goto tx_error;
@@ -753,7 +770,7 @@ ip_vs_nat_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 	}
 
 	was_input = rt_is_input_route(skb_rtable(skb));
-	local = __ip_vs_get_out_rt(cp->ipvs, cp->af, skb, cp->dest, cp->daddr.ip,
+	local = __ip_vs_get_out_rt(cp, skb, cp->dest, cp->daddr.ip,
 				   IP_VS_RT_MODE_LOCAL |
 				   IP_VS_RT_MODE_NON_LOCAL |
 				   IP_VS_RT_MODE_RDR, NULL, ipvsh);
@@ -839,7 +856,7 @@ ip_vs_nat_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 		IP_VS_DBG(10, "filled cport=%d\n", ntohs(*p));
 	}
 
-	local = __ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, cp->dest,
+	local = __ip_vs_get_out_rt_v6(cp, skb, cp->dest,
 				      &cp->daddr.in6,
 				      NULL, ipvsh, 0,
 				      IP_VS_RT_MODE_LOCAL |
@@ -1031,7 +1048,7 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	local = __ip_vs_get_out_rt(ipvs, cp->af, skb, cp->dest, cp->daddr.ip,
+	local = __ip_vs_get_out_rt(cp, skb, cp->dest, cp->daddr.ip,
 				   IP_VS_RT_MODE_LOCAL |
 				   IP_VS_RT_MODE_NON_LOCAL |
 				   IP_VS_RT_MODE_CONNECT |
@@ -1129,7 +1146,7 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	local = __ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, cp->dest,
+	local = __ip_vs_get_out_rt_v6(cp, skb, cp->dest,
 				      &cp->daddr.in6,
 				      &saddr, ipvsh, 1,
 				      IP_VS_RT_MODE_LOCAL |
@@ -1218,7 +1235,7 @@ ip_vs_dr_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	local = __ip_vs_get_out_rt(cp->ipvs, cp->af, skb, cp->dest, cp->daddr.ip,
+	local = __ip_vs_get_out_rt(cp, skb, cp->dest, cp->daddr.ip,
 				   IP_VS_RT_MODE_LOCAL |
 				   IP_VS_RT_MODE_NON_LOCAL |
 				   IP_VS_RT_MODE_KNOWN_NH, NULL, ipvsh);
@@ -1252,7 +1269,7 @@ ip_vs_dr_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	local = __ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, cp->dest,
+	local = __ip_vs_get_out_rt_v6(cp, skb, cp->dest,
 				      &cp->daddr.in6,
 				      NULL, ipvsh, 0,
 				      IP_VS_RT_MODE_LOCAL |
@@ -1317,7 +1334,7 @@ ip_vs_icmp_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 	rt_mode = (hooknum != NF_INET_FORWARD) ?
 		  IP_VS_RT_MODE_LOCAL | IP_VS_RT_MODE_NON_LOCAL |
 		  IP_VS_RT_MODE_RDR : IP_VS_RT_MODE_NON_LOCAL;
-	local = __ip_vs_get_out_rt(cp->ipvs, cp->af, skb, cp->dest, cp->daddr.ip, rt_mode,
+	local = __ip_vs_get_out_rt(cp, skb, cp->dest, cp->daddr.ip, rt_mode,
 				   NULL, iph);
 	if (local < 0)
 		goto tx_error;
@@ -1406,7 +1423,7 @@ ip_vs_icmp_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 	rt_mode = (hooknum != NF_INET_FORWARD) ?
 		  IP_VS_RT_MODE_LOCAL | IP_VS_RT_MODE_NON_LOCAL |
 		  IP_VS_RT_MODE_RDR : IP_VS_RT_MODE_NON_LOCAL;
-	local = __ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, cp->dest,
+	local = __ip_vs_get_out_rt_v6(cp, skb, cp->dest,
 				      &cp->daddr.in6, NULL, ipvsh, 0, rt_mode);
 	if (local < 0)
 		goto tx_error;
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-19 21:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, virtualization,
	linux-mm, edumazet, bhutchings, David Miller, Vlastimil Babka
In-Reply-To: <20180419124751.8884e516e99825d83da3d87a@linux-foundation.org>



On Thu, 19 Apr 2018, Andrew Morton wrote:

> On Thu, 19 Apr 2018 12:12:38 -0400 (EDT) Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> > kmalloc fails.
> > 
> > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > code.
> > 
> > These bugs are hard to reproduce because vmalloc falls back to kmalloc
> > only if memory is fragmented.
> 
> Yes, that's nasty.
> 
> > In order to detect these bugs reliably I submit this patch that changes
> > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > 
> > ...
> >
> > --- linux-2.6.orig/mm/util.c	2018-04-18 15:46:23.000000000 +0200
> > +++ linux-2.6/mm/util.c	2018-04-18 16:00:43.000000000 +0200
> > @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> >   */
> >  void *kvmalloc_node(size_t size, gfp_t flags, int node)
> >  {
> > +#ifndef CONFIG_DEBUG_VM
> >  	gfp_t kmalloc_flags = flags;
> >  	void *ret;
> >  
> > @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> >  	 */
> >  	if (ret || size <= PAGE_SIZE)
> >  		return ret;
> > +#endif
> >  
> >  	return __vmalloc_node_flags_caller(size, node, flags,
> >  			__builtin_return_address(0));
> 
> Well, it doesn't have to be done at compile-time, does it?  We could
> add a knob (in debugfs, presumably) which enables this at runtime. 
> That's far more user-friendly.

But who will turn it on in debugfs? It should be default for debugging 
kernels, so that users using them would report the error.

Conditioning it on CONFIG_DEBUG_SG is better than CONFIG_DEBUG_VM, it will 
print a stacktrace where the incorrect use happened.

Mikulas

^ permalink raw reply

* Re: [PATCH net 5/5] nfp: remove false positive offloads in flower vxlan
From: Or Gerlitz @ 2018-04-19 21:11 UTC (permalink / raw)
  To: John Hurley; +Cc: Jakub Kicinski, Linux Netdev List, oss-drivers, Simon Horman
In-Reply-To: <CAK+XE=mvnVaLnJqV+5dm+XaN90YjmFF_gkFCxPBmXAgL_ts=ng@mail.gmail.com>

On Thu, Apr 19, 2018 at 1:31 AM, John Hurley <john.hurley@netronome.com> wrote:
> On Wed, Apr 18, 2018 at 7:18 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> On Wed, Apr 18, 2018 at 3:31 PM, John Hurley <john.hurley@netronome.com> wrote:
>>> On Wed, Apr 18, 2018 at 8:43 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>>>> On Fri, Nov 17, 2017 at 4:06 AM, Jakub Kicinski
>>>> <jakub.kicinski@netronome.com> wrote:
>>>>> From: John Hurley <john.hurley@netronome.com>
>>>>>
>>>>> Pass information to the match offload on whether or not the repr is the
>>>>> ingress or egress dev. Only accept tunnel matches if repr is the egress dev.
>>>>>
>>>>> This means rules such as the following are successfully offloaded:
>>>>> tc .. add dev vxlan0 .. enc_dst_port 4789 .. action redirect dev nfp_p0
>>>>>
>>>>> While rules such as the following are rejected:
>>>>> tc .. add dev nfp_p0 .. enc_dst_port 4789 .. action redirect dev vxlan0
>>>>
>>>> cool
>>>>
>>>>
>>>>> Also reject non tunnel flows that are offloaded to an egress dev.
>>>>> Non tunnel matches assume that the offload dev is the ingress port and
>>>>> offload a match accordingly.
>>>>
>>>> not following on the "Also" here, see below
>>>>
>>>>
>>>>> diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>>>>> index a0193e0c24a0..f5d73b83dcc2 100644
>>>>> --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
>>>>> +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>>>>> @@ -131,7 +131,8 @@ static bool nfp_flower_check_higher_than_mac(struct tc_cls_flower_offload *f)
>>>>>
>>>>>  static int
>>>>>  nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>>>>> -                               struct tc_cls_flower_offload *flow)
>>>>> +                               struct tc_cls_flower_offload *flow,
>>>>> +                               bool egress)
>>>>>  {
>>>>>         struct flow_dissector_key_basic *mask_basic = NULL;
>>>>>         struct flow_dissector_key_basic *key_basic = NULL;
>>>>> @@ -167,6 +168,9 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>>>>>                         skb_flow_dissector_target(flow->dissector,
>>>>>                                                   FLOW_DISSECTOR_KEY_ENC_CONTROL,
>>>>>                                                   flow->key);
>>>>> +               if (!egress)
>>>>> +                       return -EOPNOTSUPP;
>>>>> +
>>>>>                 if (mask_enc_ctl->addr_type != 0xffff ||
>>>>>                     enc_ctl->addr_type != FLOW_DISSECTOR_KEY_IPV4_ADDRS)
>>>>>                         return -EOPNOTSUPP;
>>>>> @@ -194,6 +198,9 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>>>>>
>>>>>                 key_layer |= NFP_FLOWER_LAYER_VXLAN;
>>>>>                 key_size += sizeof(struct nfp_flower_vxlan);
>>>>> +       } else if (egress) {
>>>>> +               /* Reject non tunnel matches offloaded to egress repr. */
>>>>> +               return -EOPNOTSUPP;
>>>>>         }
>>>>
>>>> with these two hunks we get: egress <- IFF -> encap match, right?
>>>>
>>>> (1) we can't offload the egress way if there isn't matching on encap headers
>>>> (2) we can't go the matching on encap headers way if we are not egress
>>>>
>>>
>>> yes, this is correct.
>>> With the block code and egdev offload, we do not have access to the
>>> ingress netdev when doing an offload.
>>> We need to use the encap headers (especially the enc_port) to
>>> distinguish the type of tunnel used and, therefore, require that the
>>> encap matches be present before offloading.
>>>
>>>> what other cases are rejected by this logic?
>>>>
>>>
>>> Yes, some other cases may be rejected (like veth mentioned below).
>>
>> my claim is that the veth case I mentioned below will not be rejected
>> if it has the matching on encap headers, and a wrong rule will be set
>> into hw, agree?
>>
>
> yes, unfortunately this is correct.
> Without having access to the ingress netdev we have to put as many
> restrictions as possible to ensure it is 'almost certainly' a given
> ingress netdev but extreme cases can bypass this.
>
>>> However, this is better than allowing rules to be incorrectly
>>> offloaded (as could have happened before these changes).
>>
>>> Currently, we are looking at offloading flows on other ingress devices
>>> such as bonds so this will require a change to the driver code here.
>>
>> for the ingress side, Jiri suggested that the slave devices (uplink reps),
>> will be just getting all the rules set on the bond, so I am not sure what
>> problem you see here... for decap it will be still vxlan --> vf rep and your
>> egress logic will allow it.
>>
>
> Yes, Jiri suggested on another thread that the bonds simply relay
> rules to their slaves.
> This will work fine if uplink reprs are enslaved by a bond before
> rules are added to it.
> It would also assume that uplink reprs are not removed from/added to
> the bond at later stages.
> Doing this would require flushing the bond rules or writing all
> existing rules to one of the slaves but not others.
> Do you have any opinions on handling such situations?

I looked now on the thread you've posted lately, there were some responses
on the matters you brought here. We'll (MLNX) get there soon I guess too.

^ permalink raw reply

* Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format
From: Martin KaFai Lau @ 2018-04-19 20:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180419194034.GB3254@kernel.org>

On Thu, Apr 19, 2018 at 04:40:34PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Wed, Apr 18, 2018 at 03:55:56PM -0700, Martin KaFai Lau escreveu:
> > This patch introduces BPF Type Format (BTF).
> > 
> > BTF (BPF Type Format) is the meta data format which describes
> > the data types of BPF program/map.  Hence, it basically focus
> > on the C programming language which the modern BPF is primary
> > using.  The first use case is to provide a generic pretty print
> > capability for a BPF map.
> > 
> > A modified pahole that can convert dwarf to BTF is here:
> > https://github.com/iamkafai/pahole/tree/btf
> > (Arnaldo, there is some BTF_KIND numbering changes on
> >  Apr 18th, d61426c1571)
> 
> Thanks for letting me know, I'm starting to look at this,
Thanks for reviewing.  Feel free to comment directly on the github diff.
Also, I think it may make sense to wait for the kernel pieces to land
first.


> 
> - Arnaldo
>  
> > Please see individual patch for details.
> > 
> > v5:
> > - Remove BTF_KIND_FLOAT and BTF_KIND_FUNC which are not
> >   currently used.  They can be added in the future.
> >   Some bpf_df_xxx() are removed together.
> > - Add comment in patch 7 to clarify that the new bpffs_map_fops
> >   should not be extended further.
> > 
> > v4:
> > - Fix warning (remove unneeded semicolon)
> > - Remove a redundant variable (nr_bytes) from btf_int_check_meta() in
> >   patch 1.  Caught by W=1.
> > 
> > v3:
> > - Rebase to bpf-next
> > - Fix sparse warning (by adding static)
> > - Add BTF header logging: btf_verifier_log_hdr()
> > - Fix the alignment test on btf->type_off
> > - Add tests for the BTF header
> > - Lower the max BTF size to 16MB.  It should be enough
> >   for some time.  We could raise it later if it would
> >   be needed.
> > 
> > v2:
> > - Use kvfree where needed in patch 1 and 2
> > - Also consider BTF_INT_OFFSET() in the btf_int_check_meta()
> >   in patch 1
> > - Fix an incorrect goto target in map_create() during
> >   the btf-error-path in patch 7
> > - re-org some local vars to keep the rev xmas tree in btf.c
> > 
> > Martin KaFai Lau (10):
> >   bpf: btf: Introduce BPF Type Format (BTF)
> >   bpf: btf: Validate type reference
> >   bpf: btf: Check members of struct/union
> >   bpf: btf: Add pretty print capability for data with BTF type info
> >   bpf: btf: Add BPF_BTF_LOAD command
> >   bpf: btf: Add BPF_OBJ_GET_INFO_BY_FD support to BTF fd
> >   bpf: btf: Add pretty print support to the basic arraymap
> >   bpf: btf: Sync bpf.h and btf.h to tools/
> >   bpf: btf: Add BTF support to libbpf
> >   bpf: btf: Add BTF tests
> > 
> >  include/linux/bpf.h                          |   20 +-
> >  include/linux/btf.h                          |   48 +
> >  include/uapi/linux/bpf.h                     |   12 +
> >  include/uapi/linux/btf.h                     |  130 ++
> >  kernel/bpf/Makefile                          |    1 +
> >  kernel/bpf/arraymap.c                        |   50 +
> >  kernel/bpf/btf.c                             | 2064 ++++++++++++++++++++++++++
> >  kernel/bpf/inode.c                           |  156 +-
> >  kernel/bpf/syscall.c                         |   51 +-
> >  tools/include/uapi/linux/bpf.h               |   12 +
> >  tools/include/uapi/linux/btf.h               |  130 ++
> >  tools/lib/bpf/Build                          |    2 +-
> >  tools/lib/bpf/bpf.c                          |   92 +-
> >  tools/lib/bpf/bpf.h                          |   16 +
> >  tools/lib/bpf/btf.c                          |  374 +++++
> >  tools/lib/bpf/btf.h                          |   22 +
> >  tools/lib/bpf/libbpf.c                       |  148 +-
> >  tools/lib/bpf/libbpf.h                       |    3 +
> >  tools/testing/selftests/bpf/Makefile         |   26 +-
> >  tools/testing/selftests/bpf/test_btf.c       | 1669 +++++++++++++++++++++
> >  tools/testing/selftests/bpf/test_btf_haskv.c |   48 +
> >  tools/testing/selftests/bpf/test_btf_nokv.c  |   43 +
> >  22 files changed, 5076 insertions(+), 41 deletions(-)
> >  create mode 100644 include/linux/btf.h
> >  create mode 100644 include/uapi/linux/btf.h
> >  create mode 100644 kernel/bpf/btf.c
> >  create mode 100644 tools/include/uapi/linux/btf.h
> >  create mode 100644 tools/lib/bpf/btf.c
> >  create mode 100644 tools/lib/bpf/btf.h
> >  create mode 100644 tools/testing/selftests/bpf/test_btf.c
> >  create mode 100644 tools/testing/selftests/bpf/test_btf_haskv.c
> >  create mode 100644 tools/testing/selftests/bpf/test_btf_nokv.c
> > 
> > -- 
> > 2.9.5

^ permalink raw reply

* Re: [PATCH net-next 0/4] tracking TCP data delivery and ECN stats
From: Yuchung Cheng @ 2018-04-19 20:52 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh
In-Reply-To: <20180419.130710.1404241211516484963.davem@davemloft.net>

On Thu, Apr 19, 2018 at 10:07 AM, David Miller <davem@davemloft.net> wrote:
>
> From: Yuchung Cheng <ycheng@google.com>
> Date: Tue, 17 Apr 2018 23:18:45 -0700
>
> > This patch series improve tracking the data delivery status
> >   1. minor improvement on SYN data
> >   2. accounting bytes delivered with CE marks
> >   3. exporting the delivery stats to applications
> >
> > s.t. users can get better sense of TCP performance at per host,
> > per connection, and even per application message level.
>
> Definitely useful, so series applied.
Thanks.

TCP socket is getting bigger and bigger :-( I am cooking a patch set
to simplify loss recovery that should help conserving the space.

>
> But it is not lost upon me that slowly over time tcp sockets are
> bloating quite a bit...

^ permalink raw reply

* Re: [PATCH net 0/1] net/smc: shutdown fix
From: David Miller @ 2018-04-19 20:39 UTC (permalink / raw)
  To: ubraun
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, stephen,
	ubraun
In-Reply-To: <20180419135640.2907-1-ubraun@linux.ibm.com>

From: Ursula Braun <ubraun@linux.ibm.com>
Date: Thu, 19 Apr 2018 15:56:39 +0200

> This patch fixes the problem and is a candidate for -stable.

Ok, queueud up.

^ permalink raw reply

* Re: [PATCH net 1/1] net/smc: fix shutdown in state SMC_LISTEN
From: David Miller @ 2018-04-19 20:39 UTC (permalink / raw)
  To: ubraun
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, stephen,
	ubraun
In-Reply-To: <20180419135640.2907-2-ubraun@linux.ibm.com>

From: Ursula Braun <ubraun@linux.ibm.com>
Date: Thu, 19 Apr 2018 15:56:40 +0200

> From: Ursula Braun <ubraun@linux.vnet.ibm.com>
> 
> Calling shutdown with SHUT_RD and SHUT_RDWR for a listening SMC socket
> crashes, because
>    commit 127f49705823 ("net/smc: release clcsock from tcp_listen_worker")
> releases the internal clcsock in smc_close_active() and sets smc->clcsock
> to NULL.
> For SHUT_RD the smc_close_active() call is removed.
> For SHUT_RDWR the kernel_sock_shutdown() call is omitted, since the
> clcsock is already released.
> 
> Fixes: 127f49705823 ("net/smc: release clcsock from tcp_listen_worker")
> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
> Reported-by: Stephen Hemminger <stephen@networkplumber.org>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net] bnxt_en: Fix memory fault in bnxt_ethtool_init()
From: David Miller @ 2018-04-19 20:35 UTC (permalink / raw)
  To: michael.chan; +Cc: netdev, kernel-team
In-Reply-To: <1524122176-13511-1-git-send-email-michael.chan@broadcom.com>

From: Michael Chan <michael.chan@broadcom.com>
Date: Thu, 19 Apr 2018 03:16:16 -0400

> From: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
> 
> In some firmware images, the length of BNX_DIR_TYPE_PKG_LOG nvram type
> could be greater than the fixed buffer length of 4096 bytes allocated by
> the driver.  This was causing HWRM_NVM_READ to copy more data to the buffer
> than the allocated size, causing general protection fault.
> 
> Fix the issue by allocating the exact buffer length returned by
> HWRM_NVM_FIND_DIR_ENTRY, instead of 4096.  Move the kzalloc() call
> into the bnxt_get_pkgver() function.
> 
> Fixes: 3ebf6f0a09a2 ("bnxt_en: Add installed-package firmware version reporting via Ethtool GDRVINFO")
> Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
> Signed-off-by: Michael Chan <michael.chan@broadcom.com>

Applied, thanks Michael.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox