Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net] VSOCK: check sk state before receive
From: Jorgen S. Hansen @ 2018-06-04 16:02 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Hangbin Liu, netdev@vger.kernel.org, David S. Miller
In-Reply-To: <20180530091727.GF14623@stefanha-x1.localdomain>


> On May 30, 2018, at 11:17 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Sun, May 27, 2018 at 11:29:45PM +0800, Hangbin Liu wrote:
>> Hmm...Although I won't reproduce this bug with my reproducer after
>> apply my patch. I could still get a similiar issue with syzkaller sock vnet test.
>> 
>> It looks this patch is not complete. Here is the KASAN call trace with my patch.
>> I can also reproduce it without my patch.
> 
> Seems like a race between vmci_datagram_destroy_handle() and the
> delayed callback, vmci_transport_recv_dgram_cb().
> 
> I don't know the VMCI transport well so I'll leave this to Jorgen.

Yes, it looks like we are calling the delayed callback after we return from vmci_datagram_destroy_handle(). I’ll take a closer look at the VMCI side here - the refcounting of VMCI datagram endpoints should guard against this, since the delayed callback does a get on the datagram resource, so this could a VMCI driver issue, and not a problem in the VMCI transport for AF_VSOCK.

> 
>> ==================================================================
>> BUG: KASAN: use-after-free in vmci_transport_allow_dgram.part.7+0x155/0x1a0 [vmw_vsock_vmci_transport]
>> Read of size 4 at addr ffff880026a3a914 by task kworker/0:2/96
>> 
>> CPU: 0 PID: 96 Comm: kworker/0:2 Not tainted 4.17.0-rc6.vsock+ #28
>> Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>> Workqueue: events dg_delayed_dispatch [vmw_vmci]
>> Call Trace:
>> __dump_stack lib/dump_stack.c:77 [inline]
>> dump_stack+0xdd/0x18e lib/dump_stack.c:113
>> print_address_description+0x7a/0x3e0 mm/kasan/report.c:256
>> kasan_report_error mm/kasan/report.c:354 [inline]
>> kasan_report+0x1dd/0x460 mm/kasan/report.c:412
>> vmci_transport_allow_dgram.part.7+0x155/0x1a0 [vmw_vsock_vmci_transport]
>> vmci_transport_recv_dgram_cb+0x5d/0x200 [vmw_vsock_vmci_transport]
>> dg_delayed_dispatch+0x99/0x1b0 [vmw_vmci]
>> process_one_work+0xa4e/0x1720 kernel/workqueue.c:2145
>> worker_thread+0x1df/0x1400 kernel/workqueue.c:2279
>> kthread+0x343/0x4b0 kernel/kthread.c:240
>> ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:412
>> 
>> Allocated by task 2684:
>> set_track mm/kasan/kasan.c:460 [inline]
>> kasan_kmalloc+0xa0/0xd0 mm/kasan/kasan.c:553
>> slab_post_alloc_hook mm/slab.h:444 [inline]
>> slab_alloc_node mm/slub.c:2741 [inline]
>> slab_alloc mm/slub.c:2749 [inline]
>> kmem_cache_alloc+0x105/0x330 mm/slub.c:2754
>> sk_prot_alloc+0x6a/0x2c0 net/core/sock.c:1468
>> sk_alloc+0xc9/0xbb0 net/core/sock.c:1528
>> __vsock_create+0xc8/0x9b0 [vsock]
>> vsock_create+0xfd/0x1a0 [vsock]
>> __sock_create+0x310/0x690 net/socket.c:1285
>> sock_create net/socket.c:1325 [inline]
>> __sys_socket+0x101/0x240 net/socket.c:1355
>> __do_sys_socket net/socket.c:1364 [inline]
>> __se_sys_socket net/socket.c:1362 [inline]
>> __x64_sys_socket+0x7d/0xd0 net/socket.c:1362
>> do_syscall_64+0x175/0x630 arch/x86/entry/common.c:287
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> 
>> Freed by task 2684:
>> set_track mm/kasan/kasan.c:460 [inline]
>> __kasan_slab_free+0x130/0x180 mm/kasan/kasan.c:521
>> slab_free_hook mm/slub.c:1388 [inline]
>> slab_free_freelist_hook mm/slub.c:1415 [inline]
>> slab_free mm/slub.c:2988 [inline]
>> kmem_cache_free+0xce/0x410 mm/slub.c:3004
>> sk_prot_free net/core/sock.c:1509 [inline]
>> __sk_destruct+0x629/0x940 net/core/sock.c:1593
>> sk_destruct+0x4e/0x90 net/core/sock.c:1601
>> __sk_free+0xd3/0x320 net/core/sock.c:1612
>> sk_free+0x2a/0x30 net/core/sock.c:1623
>> __vsock_release+0x431/0x610 [vsock]
>> vsock_release+0x3c/0xc0 [vsock]
>> sock_release+0x91/0x200 net/socket.c:594
>> sock_close+0x17/0x20 net/socket.c:1149
>> __fput+0x368/0xa20 fs/file_table.c:209
>> task_work_run+0x1c5/0x2a0 kernel/task_work.c:113
>> exit_task_work include/linux/task_work.h:22 [inline]
>> do_exit+0x1876/0x26c0 kernel/exit.c:865
>> do_group_exit+0x159/0x3e0 kernel/exit.c:968
>> get_signal+0x65a/0x1780 kernel/signal.c:2482
>> do_signal+0xa4/0x1fe0 arch/x86/kernel/signal.c:810
>> exit_to_usermode_loop+0x1b8/0x260 arch/x86/entry/common.c:162
>> prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
>> syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
>> do_syscall_64+0x505/0x630 arch/x86/entry/common.c:290
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> 
>> The buggy address belongs to the object at ffff880026a3a600
>> which belongs to the cache AF_VSOCK of size 1056
>> The buggy address is located 788 bytes inside of
>> 1056-byte region [ffff880026a3a600, ffff880026a3aa20)
>> The buggy address belongs to the page:
>> page:ffffea00009a8e00 count:1 mapcount:0 mapping:0000000000000000 index:0x0 compound_mapcount: 0
>> flags: 0xfffffc0008100(slab|head)
>> raw: 000fffffc0008100 0000000000000000 0000000000000000 00000001000d000d
>> raw: dead000000000100 dead000000000200 ffff880034471a40 0000000000000000
>> page dumped because: kasan: bad access detected
>> 
>> Memory state around the buggy address:
>> ffff880026a3a800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff880026a3a880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>> ffff880026a3a900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>                         ^
>> ffff880026a3a980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff880026a3aa00: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
>> ==================================================================


^ permalink raw reply

* [PATCH bpf-next] bpf: guard bpf_get_current_cgroup_id() with CONFIG_CGROUPS
From: Yonghong Song @ 2018-06-04 15:53 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team

Commit bf6fa2c893c5 ("bpf: implement bpf_get_current_cgroup_id()
helper") introduced a new helper bpf_get_current_cgroup_id().
The helper has a dependency on CONFIG_CGROUPS.

When CONFIG_CGROUPS is not defined, using the helper will result
the following verifier error:
  kernel subsystem misconfigured func bpf_get_current_cgroup_id#80
which is hard for users to interpret.
Guarding the reference to bpf_get_current_cgroup_id_proto with
CONFIG_CGROUPS will result in below better message:
  unknown func bpf_get_current_cgroup_id#80

Fixes: bf6fa2c893c5 ("bpf: implement bpf_get_current_cgroup_id() helper")
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/trace/bpf_trace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index e2ab5b7..0ae6829 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -564,8 +564,10 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_prandom_u32_proto;
 	case BPF_FUNC_probe_read_str:
 		return &bpf_probe_read_str_proto;
+#ifdef CONFIG_CGROUPS
 	case BPF_FUNC_get_current_cgroup_id:
 		return &bpf_get_current_cgroup_id_proto;
+#endif
 	default:
 		return NULL;
 	}
-- 
2.9.5

^ permalink raw reply related

* Re: [RFC feedback] AF_XDP and non-Intel hardware
From: Björn Töpel @ 2018-06-04 15:43 UTC (permalink / raw)
  To: MykytaI Iziumtsev
  Cc: Netdev, Björn Töpel, Karlsson, Magnus, Zhang, Qi Z,
	Francois Ozog, Ilias Apalodimas, Brian Brooks,
	Jesper Dangaard Brouer, Andy Gospodarek, michael.chan,
	Luke Gorrie
In-Reply-To: <CAPvmOuEeag+R9sa_yGRy2ZVpmcofoa7Ey6y1To3qi=QV4BV6ww@mail.gmail.com>

Den mån 4 juni 2018 kl 11:44 skrev Mykyta Iziumtsev
<mykyta.iziumtsev@linaro.org>:
>
> Hi Björn and Magnus,
>
> Here is a short update on the proposed changes to AF_XDP.
>
> During implementation phase I figured out that 'end-of-page' in RX
> descriptor's flags won't work, unfortunately. This is because the
> kernel driver can't know if the frame will be last on the page when RX
> descriptor is being filled in. Imagine that the driver is processing a
> frame on a page, and there is still 1000 bytes of non utilized memory
> beyond this frame. If the next frame (which haven't yet arrived) is
> less than that -- the NIC can place it on the same page, otherwise the
> NIC will skip remaining 1000 bytes and take next page. Of course, the
> driver can update RX descriptor retrospectively, or use 'empty' RX
> descriptors just to indicate page completion, but it's too complex or
> inefficient and as such doesn't look good.
>

I think both "update RX descriptor retrospectively" and "zero-length
end-of-chunk descriptor" are OK (as long as the zero-length descs much
less common that the non-zero one). Out of curiosity, does Chelsio (or
any other type-writer based) NIC has a public data sheet? It would be
interesting to see how this is solved in practice.

> What I could propose instead is to make the 'filling' and 'page
> completion' symmetric. That is, UMEM will have 1 queue to pass pages
> from application to the kernel driver (fring in your terminology) and
> one to indicate complete pages back to the application.
>
> That 'page completion' queue can be added as a third queue to UMEM, or
> alternatively TX completions can be decoupled from UMEM to make it an
> simple 'memory area' object with two directional queues which work
> with RX pages only. TX completions can be handled in a simpler manner
> as part of TX queues: just make sure consumer index is advanced only
> when the frame is finally sent out. The TX completion processing can
> be then done by observing consume index advance from application. That
> would  not only be more in line with best NIC ring design practices
> but will improve cache locality and eliminate critical section  if TX
> queues are assigned to different CPUs.
>
...and this is also a valid option. Adding an additional umem ring for
end-of-chunk completions, might make sense, but let's the zero-copy
implementations drive that requirement.

As you probably noted Magnus and I just sent out a patch set with an
updated descriptor layout. A next step would be adding support for
type-writer styled NICs.

Thanks for the information, and hopefully we'll see an additional
AF_XDP zero-copy implementation on list (nudge, nudge). ;-)

Björn

> With best regards,
> Mykyta
>
> On 22 May 2018 at 14:09, Björn Töpel <bjorn.topel@gmail.com> wrote:
> > 2018-05-22 9:45 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumtsev@linaro.org>:
> >> On 21 May 2018 at 20:55, Björn Töpel <bjorn.topel@gmail.com> wrote:
> >>> 2018-05-21 14:34 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumtsev@linaro.org>:
> >>>> Hi Björn and Magnus,
> >>>>
> >>>> (This thread is a follow up to private dialogue. The intention is to
> >>>> let community know that AF_XDP can be enhanced further to make it
> >>>> compatible with wider range of NIC vendors).
> >>>>
> >>>
> >>> Mykyta, thanks for doing the write-up and sending it to the netdev
> >>> list! The timing could not be better -- we need to settle on an uapi
> >>> that works for all vendors prior enabling it in the kernel.
> >>>
> >>>> There are two NIC variations which don't fit well with current AF_XDP proposal.
> >>>>
> >>>> The first variation is represented by some NXP and Cavium NICs. AF_XDP
> >>>> expects NIC to put incoming frames into slots defined by UMEM area.
> >>>> Here slot size is set in XDP_UMEM_REG xdp_umem.reg.frame_size and
> >>>> slots available to NIC are communicated to the kernel via UMEM fill
> >>>> queue. While Intel NICs support only one slot size, NXP and Cavium
> >>>> support multiple slot sizes to optimize memory usage (e.g. { 128, 512,
> >>>> 2048 }, please refer to [1] for rationale). On frame reception the NIC
> >>>> pulls a slot from best fit pool based on frame size.
> >>>>
> >>>> The second variation is represented by e.g. Chelsio T5/T6 and Netcope
> >>>> NICs. As shown above, AF_XDP expects NIC to put incoming frames at
> >>>> predefined addresses. This is not the case here as the NIC is in
> >>>> charge of placing frames in memory based on it's own algorithm. For
> >>>> example, Chelsio T5/T6 is expecting to get whole pages from the driver
> >>>> and puts incoming frames on the page in a 'tape recorder' fashion.
> >>>> Assuming 'base' is page address and flen[N] is an array of frame
> >>>> lengths, the frame placement in memory will look like that:
> >>>>   base + 0
> >>>>   base + frame_len[0]
> >>>>   base + frame_len[0] + frame_len[1]
> >>>>   base + frame_len[0] + frame_len[1] + frame_len[2]
> >>>>   ...
> >>>>
> >>>> To better support these two NIC variations I suggest to abandon 'frame
> >>>> size' structuring in UMEM and stick with 'pages' instead. The NIC
> >>>> kernel driver is then responsible for splitting provided pages into
> >>>> slots expected by underlying HW (or not splitting at all in case of
> >>>> Chelsio/Netcope).
> >>>>
> >>>
> >>> Let's call the first variation "multi-pool" and the second
> >>> "type-writer" for simplicity. The type-writer model is very
> >>> interesting, and makes a lot of sense when the PCIe bus is a
> >>> constraint.
> >>>
> >>>> On XDP_UMEM_REG the application needs to specify page_size. Then the
> >>>> application can pass empty pages to the kernel driver using UMEM
> >>>> 'fill' queue by specifying page offset within the UMEM area. xdp_desc
> >>>> format needs to be changed as well: frame location will be defined by
> >>>> offset from the beginning of UMEM area instead of frame index. As
> >>>> payload headroom can vary with AF_XDP we'll need to specify it in
> >>>> xdp_desc as well. Beside that it could be worth to consider changing
> >>>> payload length to u16 as 64k+ frames aren't very common in networking.
> >>>> The resulting xdp_desc would look like that:
> >>>>
> >>>> struct xdp_desc {
> >>>>         __u64 offset;
> >>>>         __u16 headroom;
> >>>>         __u16 len;
> >>>>         __u8 flags;
> >>>>         __u8 padding[3];
> >>>> };
> >>>>
> >>>
> >>> Let's expand a bit here; Instead of passing indicies to fixed sized
> >>> frames to the fill ring, we now pass a memory region. For simplicity,
> >>> let's say a page. The fill ring descriptor requires offset and
> >>> len. The offset is a global offset from an UMEM perspective, and len
> >>> is the size of the region.
> >>>
> >>
> >> I would rather stick with region equal to page (regular or huge page,
> >> defined by application). The page size can be extracted from
> >> vm_area_struct in XDP_UMEM_REG (preferred) or configured by
> >> application.
> >>
> >
> > Ok, thinking more about it I prefer this as well. This means that we
> > only need to grow the UMEM fring/cring descriptors to u64, and not
> > care about length. As you state below, this makes the validation
> > simple.
> >
> > We might consider exposing a "page size hint", that the user can set.
> > For 4G hugepage scenario, it might make sense to have a chunk *less*
> > than 4G to avoid HW Rx memory running low when the end of a chunk is
> > approaching.
> >
> > As for THP, I need to think about proper behavior here.
> >
> >>> On the Rx ring the descriptor, as you wrote, must be changed as well
> >>> to your suggestion above. Note, that headroom is still needed, since
> >>> XDP can change the size of a packet, so the fixed headroom stated in
> >>> UMEM registration is not sufficient.
> >>>
> >>> This model is obviously more flexible, but then also a bit more
> >>> complex. E.g. a fixed-frame NIC (like ixgbe), what should the
> >>> behaviour be? Should the fill ring entry be used only for *one* frame,
> >>> or chopped up to multiple receive frames? Should it be configurable?
> >>> Driver specific?
> >>
> >> I think driver-specific makes most sense here. In case of fixed-frame
> >> NIC the driver shall chop the ring entry into multiple receive frames.
> >>
> >
> > Let's start there, keeping the configuration space small.
> >
> >>>
> >>> Also, validating the entries in the fill queue require more work
> >>> (compared to the current index variant). Currently, we only skip
> >>> invalid indicies. What should we do when say, you pass a memory window
> >>> that is too narrow (say 128B) but correct from a UMEM
> >>> perspective. Going this path, we need to add pretty hard constraints
> >>> so we don't end up it too complex code -- because then it'll be too
> >>> slow.
> >>
> >> If we stick with pages -- the only possible erroneous input will be
> >> 'page out of UMEM boundaries'. The validation will be essentially:
> >>
> >> if ((offset > umem->size) || (offset & (umem->page_size - 1))
> >>     fail
> >>
> >> The question is what shall be done if validation fails ? Would
> >> SEGFAULT be reasonable ? This is more or less equivalent to
> >> dereferencing invalid pointer.
> >>
> >
> > The current scheme is simply dropping that kernel skips the invalid
> > fill ring entry. SIGSEGV is an interesting idea!
> >
> >>>
> >>>> In current proposal you have a notion of 'frame ownership': 'owned by
> >>>> kernel' or 'owned by application'. The ownership is transferred by
> >>>> means of enqueueing frame index in UMEM 'fill' queue (from application
> >>>> to kernel) or in UMEM 'tx completion' queue (from kernel to
> >>>> application). If you decide to adopt 'page' approach this notion needs
> >>>> to be changed a bit. This is because in case of packet forwarding one
> >>>> and the same page can be used for RX (parts of it enqueued in HW 'free
> >>>> lists') and TX (forwarding of previously RXed packets).
> >>>>
> >>>> I propose to define 'ownership' as a right to manipulate the
> >>>> partitioning of the page into frames. Whenever application passes a
> >>>> page to the kernel via UMEM 'fill' queue -- the ownership is
> >>>> transferred to the kernel. The application can't allocate packets on
> >>>> this page until kernel is done with it, but it can change payload of
> >>>> RXed packets before forwarding them. The kernel can pass ownership
> >>>> back by means of 'end-of-page' in xdp_desc.flags.
> >>>>
> >>>
> >>> I like the end-of-page mechanism.
> >>>
> >>>> The pages are definitely supposed to be recycled sooner or later. Even
> >>>> if it's not part of kernel API and the housekeeping implementation
> >>>> resided completely in application I still would like to propose
> >>>> possible (hopefully, cost efficient) solution to that. The recycling
> >>>> could be achieved by keeping refcount on pages and recycling the page
> >>>> only when it's owned by application and refcount reaches 0.
> >>>>
> >>>> Whenever application transfers page ownership to the kernel the
> >>>> refcount shall be initialized to 0. With each incoming RX xdp_desc the
> >>>> corresponding page needs to be identified (xdp_desc.offset >>
> >>>> PAGE_SHIFT) and refcount incremented. When the packet gets freed the
> >>>> refcount shall be decremented. If packet is forwarded in TX xdp_desc
> >>>> -- the refcount gets decremented only on TX completion (again,
> >>>> tx_completion.desc >> PAGE_SHIFT). For packets originating from the
> >>>> application itself the payload buffers needs to be allocated from
> >>>> empty page owned by the application and refcount needs to be
> >>>> incremented as well.
> >>>>
> >>>
> >>> Strictly speaking, we're not enforcing correctness in the current
> >>> solution. If the userspace application passes index 1 mulitple times
> >>> to the fill ring, and at the same time send index 1, things will
> >>> break. So, with the existing solution the userspace application
> >>> *still* needs to track the frames. With this new model, the
> >>> tracking/refcounting will be more complex. That might be a concern.
> >>>
> >>> For the multi-pool NICs I think we can still just have one UMEM, and
> >>> let the driver decide where in which pool to place this certain chunk
> >>> of memory. Thoughts?
> >>
> >> Definitely agree with that. This is HW specific and exposing it to the
> >> application would only harm portability.
> >>
> >
> > Good stuff, we're on the same page then.
> >
> >>>
> >>> Now, how do we go forward? I think this is very important, and I will
> >>> hack a copy-mode version for this new API. I'm a bit worried that the
> >>> complexity/configuration space will grow and impact performance, but
> >>> let's see.
> >>>
> >>> To prove that the API works for the NICs you mentioned, we need an
> >>> actual zero-copy implementation for them. Do you think Linaro could
> >>> work on a zero-copy variant for any of the NICs above?
> >>>
> >>
> >> Linaro will definitely contribute zero-copy implementation for some
> >> ARM-based NICs with 'multi-pool' variation.
> >
> > Very nice!
> >
> >> Implementation of
> >> 'type-writer' variation is up to Chelsio/Netcope, we only try to come
> >> up with API which (most probably) will fit them as well.
> >>
> >
> > Let's hope we get an implementation from these vendors as well! :-)
> >
> >
> > Björn
> >
> >>>
> >>> Again thanks for bringing this up!
> >>> Björn
> >>>
> >>>
> >>>
> >>>> [1] https://en.wikipedia.org/wiki/Internet_Mix
> >>>>
> >>>> With best regards,
> >>>> Mykyta

^ permalink raw reply

* RE: [PATCH net] net: hns: Fix the process of adding broadcast addresses to tcam
From: Salil Mehta @ 2018-06-04 15:39 UTC (permalink / raw)
  To: David Miller
  Cc: Zhuangyuzeng (Yisen), lipeng (Y), mehta.salil@opnsrc.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Linuxarm,
	wangxi (M)
In-Reply-To: <20180604.113552.1136710379930167373.davem@davemloft.net>

Hi Dave,
Thanks for pointing out. Will fix them.

Best regards
Salil

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Monday, June 04, 2018 4:36 PM
> To: Salil Mehta <salil.mehta@huawei.com>
> Cc: Zhuangyuzeng (Yisen) <yisen.zhuang@huawei.com>; lipeng (Y)
> <lipeng321@huawei.com>; mehta.salil@opnsrc.net; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org; Linuxarm <linuxarm@huawei.com>; wangxi
> (M) <wangxi11@huawei.com>
> Subject: Re: [PATCH net] net: hns: Fix the process of adding broadcast
> addresses to tcam
> 
> From: Salil Mehta <salil.mehta@huawei.com>
> Date: Fri, 1 Jun 2018 18:00:11 +0100
> 
> > +		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
> > +			mc_mask, mac_entry->addr);
> 
> Please indent things properly when arguments span multiple lines.
> 
> The second line here should be precisely after the opennning
> parenthesis
> of the previous line.
> 
> > +		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
> > +			mc_mask, mac_entry->addr);
> 
> Likewise.

^ permalink raw reply

* Re: [RFC PATCH net-next] net: mvpp2: mvpp2_percpu_read_relaxed() can be static
From: David Miller @ 2018-06-04 15:36 UTC (permalink / raw)
  To: fengguang.wu
  Cc: maxime.chevallier, kbuild-all, netdev, antoine.tenart,
	thomas.petazzoni, ymarkman, stefanc, linux-kernel
In-Reply-To: <20180601194613.GA44770@xian>

From: kbuild test robot <fengguang.wu@intel.com>
Date: Sat, 2 Jun 2018 03:46:13 +0800

> 
> Fixes: db9d7d36eecc ("net: mvpp2: Split the PPv2 driver to a dedicated directory")
> Signed-off-by: kbuild test robot <fengguang.wu@intel.com>

Applied to net-next, thank you.

^ permalink raw reply

* Re: [PATCH net] net: hns: Fix the process of adding broadcast addresses to tcam
From: David Miller @ 2018-06-04 15:35 UTC (permalink / raw)
  To: salil.mehta
  Cc: yisen.zhuang, lipeng321, mehta.salil, netdev, linux-kernel,
	linuxarm, wangxi11
In-Reply-To: <20180601170011.22700-1-salil.mehta@huawei.com>

From: Salil Mehta <salil.mehta@huawei.com>
Date: Fri, 1 Jun 2018 18:00:11 +0100

> +		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
> +			mc_mask, mac_entry->addr);

Please indent things properly when arguments span multiple lines.

The second line here should be precisely after the opennning parenthesis
of the previous line.

> +		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
> +			mc_mask, mac_entry->addr);

Likewise.

^ permalink raw reply

* [PATCH net-next] net: sched: return error code when tcf proto is not found
From: Vlad Buslov @ 2018-06-04 15:32 UTC (permalink / raw)
  To: netdev; +Cc: davem, jhs, xiyou.wangcong, jiri, dan.carpenter, Vlad Buslov

If requested tcf proto is not found, get and del filter netlink protocol
handlers output error message to extack, but do not return actual error
code. Add check to return ENOENT when result of tp find function is NULL
pointer.

Fixes: c431f89b18a2 ("net: sched: split tc_ctl_tfilter into three
handlers")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

---
 net/sched/cls_api.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index c06585fb2dc6..cdc3c87c53e6 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1274,7 +1274,7 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 			       prio, false);
 	if (!tp || IS_ERR(tp)) {
 		NL_SET_ERR_MSG(extack, "Filter with specified priority/protocol not found");
-		err = PTR_ERR(tp);
+		err = tp ? PTR_ERR(tp) : -ENOENT;
 		goto errout;
 	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind)) {
 		NL_SET_ERR_MSG(extack, "Specified filter kind does not match existing one");
@@ -1374,7 +1374,7 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 			       prio, false);
 	if (!tp || IS_ERR(tp)) {
 		NL_SET_ERR_MSG(extack, "Filter with specified priority/protocol not found");
-		err = PTR_ERR(tp);
+		err = tp ? PTR_ERR(tp) : -ENOENT;
 		goto errout;
 	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind)) {
 		NL_SET_ERR_MSG(extack, "Specified filter kind does not match existing one");
-- 
2.7.5

^ permalink raw reply related

* Re: [PATCH] dt-bindings: net: ravb: Add support for r8a77990 SoC
From: Rob Herring @ 2018-06-04 15:32 UTC (permalink / raw)
  To: Sergei Shtylyov, Shimoda, Yoshihiro
  Cc: Simon Horman, David Miller, netdev,
	open list:MEDIA DRIVERS FOR RENESAS - FCP, Mark Rutland,
	devicetree
In-Reply-To: <19e43b97-9235-a752-7cf9-56a5e3bb281a@cogentembedded.com>

On Tue, May 15, 2018 at 3:43 AM, Sergei Shtylyov
<sergei.shtylyov@cogentembedded.com> wrote:
> On 5/13/2018 10:58 AM, Simon Horman wrote:
>
>>>> Add documentation for r8a77990 compatible string to renesas ravb device
>>>> tree bindings documentation.
>>>>
>>>> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
>>>
>>>
>>> I'm assuming this isn't targetted at one of my trees.  Just FYI.
>>
>>
>> Hi Dave,
>>
>> I think this is appropriate for net-next but if not I can take it.
>
>
>    There's also Rob, and IIRC he has taken alike patch recently...

Applied.

Rob

^ permalink raw reply

* Re: [PATCH net] net/packet: refine check for priv area size
From: David Miller @ 2018-06-04 15:32 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet
In-Reply-To: <20180601162302.32717-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Fri,  1 Jun 2018 09:23:02 -0700

> syzbot was able to trick af_packet again [1]

:-(

> Various commits tried to address the problem in the past,
> but failed to take into account V3 header size.
> 
> [1]
 ...
> Fixes: 2b6867c2ce76 ("net/packet: fix overflow in check for priv area size")
> Fixes: dc808110bb62 ("packet: handle too big packets for PACKET_V3")
> Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>

Applied and queued up for -stable, thanks Eric.

^ permalink raw reply

* Re: [PATCH] net: aquantia: make function aq_fw2x_get_mac_permanent static
From: David Miller @ 2018-06-04 15:31 UTC (permalink / raw)
  To: colin.king; +Cc: igor.russkikh, netdev, kernel-janitors, linux-kernel
In-Reply-To: <20180601152834.17958-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Fri,  1 Jun 2018 16:28:34 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> The function aq_fw2x_get_mac_permanent is local to the source and does
> not need to be in global scope, so make it static.
> 
> Cleans up sparse warning:
> warning: symbol 'aq_fw2x_get_mac_permanent' was not declared. Should it
> be static?
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied to net-next, thank you.

^ permalink raw reply

* Re: [PATCH net-next v5 00/11] Modify action API for implementing lockless actions
From: David Miller @ 2018-06-04 15:28 UTC (permalink / raw)
  To: vladbu
  Cc: netdev, jhs, xiyou.wangcong, jiri, pablo, kadlec, fw, ast, daniel,
	edumazet, keescook, marcelo.leitner, kliteyn
In-Reply-To: <1527866118-21312-1-git-send-email-vladbu@mellanox.com>

From: Vlad Buslov <vladbu@mellanox.com>
Date: Fri,  1 Jun 2018 18:15:07 +0300

> Currently, all netlink protocol handlers for updating rules, actions and
> qdiscs are protected with single global rtnl lock which removes any
> possibility for parallelism. This patch set is a first step to remove
> rtnl lock dependency from TC rules update path.

Hello Vlad.

I've read over this patch series a few times, and even though I can't
find any issues yet, I think this series is complex and invasive
enough that it should receive more time for review by other experts.

Therefore I'm deferring this to the next merge window in order to
accomodate that.

Thank you.

^ permalink raw reply

* [RFC PATCH 4/4] net: Add support for subordinate traffic classes to netdev_pick_tx
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan
In-Reply-To: <20180604150847.74754.8585.stgit@ahduyck-green-test.jf.intel.com>

This change makes it so that we can support the concept of subordinate
device traffic classes to the core networking code. In doing this we can
start pulling out the driver specific bits needed to support selecting a
queue based on an upper device.

The solution at is currently stands is only partially implemented. I have
the start of some XPS bits in here, but I would still need to allow for
configuration of the XPS maps on the queues reserved for the subordinate
devices. For now I am using the reference to the sb_dev XPS map as just a
way to skip the lookup of the lower device XPS map for now as that would
result in the wrong queue being picked.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   15 ++-----
 drivers/net/macvlan.c                         |   10 +----
 include/linux/netdevice.h                     |    4 +-
 net/core/dev.c                                |   55 +++++++++++++++----------
 4 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 05d6d48..c42498d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8218,20 +8218,18 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
 					      input, common, ring->queue_index);
 }
 
+#ifdef IXGBE_FCOE
 static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 			      void *accel_priv, select_queue_fallback_t fallback)
 {
-	struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-#ifdef IXGBE_FCOE
 	struct ixgbe_adapter *adapter;
 	struct ixgbe_ring_feature *f;
-#endif
 	int txq;
 
-	if (fwd_adapter) {
+	if (accel_priv) {
 		u8 tc = netdev_get_num_tc(dev) ?
 			netdev_get_prio_tc_map(dev, skb->priority) : 0;
-		struct net_device *vdev = fwd_adapter->netdev;
+		struct net_device *vdev = accel_priv;
 
 		txq = vdev->tc_to_txq[tc].offset;
 		txq += reciprocal_scale(skb_get_hash(skb),
@@ -8240,7 +8238,6 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 		return txq;
 	}
 
-#ifdef IXGBE_FCOE
 
 	/*
 	 * only execute the code below if protocol is FCoE
@@ -8267,11 +8264,9 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 		txq -= f->indices;
 
 	return txq + f->offset;
-#else
-	return fallback(dev, skb);
-#endif
 }
 
+#endif
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 			       struct xdp_frame *xdpf)
 {
@@ -10071,7 +10066,6 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
 	.ndo_start_xmit		= ixgbe_xmit_frame,
-	.ndo_select_queue	= ixgbe_select_queue,
 	.ndo_set_rx_mode	= ixgbe_set_rx_mode,
 	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_set_mac_address	= ixgbe_set_mac,
@@ -10094,6 +10088,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 	.ndo_poll_controller	= ixgbe_netpoll,
 #endif
 #ifdef IXGBE_FCOE
+	.ndo_select_queue	= ixgbe_select_queue,
 	.ndo_fcoe_ddp_setup = ixgbe_fcoe_ddp_get,
 	.ndo_fcoe_ddp_target = ixgbe_fcoe_ddp_target,
 	.ndo_fcoe_ddp_done = ixgbe_fcoe_ddp_put,
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index adde8fc..401e1d1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -514,7 +514,6 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 	const struct macvlan_dev *vlan = netdev_priv(dev);
 	const struct macvlan_port *port = vlan->port;
 	const struct macvlan_dev *dest;
-	void *accel_priv = NULL;
 
 	if (vlan->mode == MACVLAN_MODE_BRIDGE) {
 		const struct ethhdr *eth = (void *)skb->data;
@@ -533,15 +532,10 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 			return NET_XMIT_SUCCESS;
 		}
 	}
-
-	/* For packets that are non-multicast and not bridged we will pass
-	 * the necessary information so that the lowerdev can distinguish
-	 * the source of the packets via the accel_priv value.
-	 */
-	accel_priv = vlan->accel_priv;
 xmit_world:
 	skb->dev = vlan->lowerdev;
-	return dev_queue_xmit_accel(skb, accel_priv);
+	return dev_queue_xmit_accel(skb,
+				    netdev_get_sb_channel(dev) ? dev : NULL);
 }
 
 static inline netdev_tx_t macvlan_netpoll_send_skb(struct macvlan_dev *vlan, struct sk_buff *skb)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bbc8045..df5a582 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2072,7 +2072,7 @@ static inline void netdev_for_each_tx_queue(struct net_device *dev,
 
 struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 				    struct sk_buff *skb,
-				    void *accel_priv);
+				    struct net_device *sb_dev);
 
 /* returns the headroom that the master device needs to take in account
  * when forwarding to this dev
@@ -2523,7 +2523,7 @@ struct net_device *__dev_get_by_flags(struct net *net, unsigned short flags,
 void dev_disable_lro(struct net_device *dev);
 int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *newskb);
 int dev_queue_xmit(struct sk_buff *skb);
-int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv);
+int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev);
 int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
 int register_netdevice(struct net_device *dev);
 void unregister_netdevice_queue(struct net_device *dev, struct list_head *head);
diff --git a/net/core/dev.c b/net/core/dev.c
index eba255d..507e606 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2704,24 +2704,26 @@ void netif_device_attach(struct net_device *dev)
  * Returns a Tx hash based on the given packet descriptor a Tx queues' number
  * to be used as a distribution range.
  */
-static u16 skb_tx_hash(const struct net_device *dev, struct sk_buff *skb)
+static u16 skb_tx_hash(const struct net_device *dev,
+		       const struct net_device *sb_dev,
+		       struct sk_buff *skb)
 {
 	u32 hash;
 	u16 qoffset = 0;
 	u16 qcount = dev->real_num_tx_queues;
 
+	if (dev->num_tc) {
+		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+
+		qoffset = sb_dev->tc_to_txq[tc].offset;
+		qcount = sb_dev->tc_to_txq[tc].count;
+	}
+
 	if (skb_rx_queue_recorded(skb)) {
 		hash = skb_get_rx_queue(skb);
 		while (unlikely(hash >= qcount))
 			hash -= qcount;
-		return hash;
-	}
-
-	if (dev->num_tc) {
-		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
-
-		qoffset = dev->tc_to_txq[tc].offset;
-		qcount = dev->tc_to_txq[tc].count;
+		return hash + qoffset;
 	}
 
 	return (u16) reciprocal_scale(skb_get_hash(skb), qcount) + qoffset;
@@ -3483,7 +3485,9 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
 }
 #endif /* CONFIG_NET_EGRESS */
 
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+static inline int get_xps_queue(struct net_device *dev,
+				struct net_device *sb_dev,
+				struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
 	struct xps_dev_maps *dev_maps;
@@ -3491,7 +3495,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	int queue_index = -1;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps);
+	dev_maps = rcu_dereference(sb_dev->xps_maps);
 	if (dev_maps) {
 		unsigned int tci = skb->sender_cpu - 1;
 
@@ -3519,17 +3523,18 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 #endif
 }
 
-static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
+static u16 ___netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
+			     struct net_device *sb_dev)
 {
 	struct sock *sk = skb->sk;
 	int queue_index = sk_tx_queue_get(sk);
 
 	if (queue_index < 0 || skb->ooo_okay ||
 	    queue_index >= dev->real_num_tx_queues) {
-		int new_index = get_xps_queue(dev, skb);
+		int new_index = get_xps_queue(dev, sb_dev, skb);
 
 		if (new_index < 0)
-			new_index = skb_tx_hash(dev, skb);
+			new_index = skb_tx_hash(dev, sb_dev, skb);
 
 		if (queue_index != new_index && sk &&
 		    sk_fullsock(sk) &&
@@ -3542,9 +3547,14 @@ static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
 	return queue_index;
 }
 
+static u16 __netdev_pick_tx(struct net_device *dev,
+			    struct sk_buff *skb)
+{
+	return ___netdev_pick_tx(dev, skb, NULL);
+}
 struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 				    struct sk_buff *skb,
-				    void *accel_priv)
+				    struct net_device *sb_dev)
 {
 	int queue_index = 0;
 
@@ -3559,10 +3569,11 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 		const struct net_device_ops *ops = dev->netdev_ops;
 
 		if (ops->ndo_select_queue)
-			queue_index = ops->ndo_select_queue(dev, skb, accel_priv,
+			queue_index = ops->ndo_select_queue(dev, skb, sb_dev,
 							    __netdev_pick_tx);
 		else
-			queue_index = __netdev_pick_tx(dev, skb);
+			queue_index = ___netdev_pick_tx(dev, skb,
+							sb_dev ? : dev);
 
 		queue_index = netdev_cap_txqueue(dev, queue_index);
 	}
@@ -3574,7 +3585,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 /**
  *	__dev_queue_xmit - transmit a buffer
  *	@skb: buffer to transmit
- *	@accel_priv: private data used for L2 forwarding offload
+ *	@sb_dev: suboordinate device used for L2 forwarding offload
  *
  *	Queue a buffer for transmission to a network device. The caller must
  *	have set the device and priority and built the buffer before calling
@@ -3597,7 +3608,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
  *      the BH enable code must have IRQs enabled so that it will not deadlock.
  *          --BLG
  */
-static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
+static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 {
 	struct net_device *dev = skb->dev;
 	struct netdev_queue *txq;
@@ -3636,7 +3647,7 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
 	else
 		skb_dst_force(skb);
 
-	txq = netdev_pick_tx(dev, skb, accel_priv);
+	txq = netdev_pick_tx(dev, skb, sb_dev);
 	q = rcu_dereference_bh(txq->qdisc);
 
 	trace_net_dev_queue(skb);
@@ -3710,9 +3721,9 @@ int dev_queue_xmit(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(dev_queue_xmit);
 
-int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv)
+int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev)
 {
-	return __dev_queue_xmit(skb, accel_priv);
+	return __dev_queue_xmit(skb, sb_dev);
 }
 EXPORT_SYMBOL(dev_queue_xmit_accel);
 

^ permalink raw reply related

* [RFC PATCH 3/4] ixgbe: Add code to populate and use macvlan tc to Tx queue map
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan
In-Reply-To: <20180604150847.74754.8585.stgit@ahduyck-green-test.jf.intel.com>

This patch makes it so that we use the tc_to_txq mapping in the macvlan
device in order to select the Tx queue for outgoing packets.

The idea here is to try and move away from using ixgbe_select_queue and to
come up with a generic way to make this work for devices going forward. By
encoding this information in the netdev this can become something that can
be used generically as a solution for similar setups going forward.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   44 ++++++++++++++++++++++---
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f460c16..05d6d48 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5271,6 +5271,8 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring *rx_ring)
 static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 			     struct ixgbe_fwd_adapter *accel)
 {
+	u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
+	int num_tc = netdev_get_num_tc(adapter->netdev);
 	struct net_device *vdev = accel->netdev;
 	int i, baseq, err;
 
@@ -5282,6 +5284,11 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 	accel->rx_base_queue = baseq;
 	accel->tx_base_queue = baseq;
 
+	/* record configuration for macvlan interface in vdev */
+	for (i = 0; i < num_tc; i++)
+		netdev_bind_sb_channel_queue(adapter->netdev, vdev,
+					     i, rss_i, baseq + (rss_i * i));
+
 	for (i = 0; i < adapter->num_rx_queues_per_pool; i++)
 		adapter->rx_ring[baseq + i]->netdev = vdev;
 
@@ -5306,6 +5313,10 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 
 	netdev_err(vdev, "L2FW offload disabled due to L2 filter error\n");
 
+	/* unbind the queues and drop the subordinate channel config */
+	netdev_unbind_sb_channel(adapter->netdev, vdev);
+	netdev_set_sb_channel(vdev, 0);
+
 	clear_bit(accel->pool, adapter->fwd_bitmask);
 	kfree(accel);
 
@@ -8211,18 +8222,22 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 			      void *accel_priv, select_queue_fallback_t fallback)
 {
 	struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-	struct ixgbe_adapter *adapter;
-	int txq;
 #ifdef IXGBE_FCOE
+	struct ixgbe_adapter *adapter;
 	struct ixgbe_ring_feature *f;
 #endif
+	int txq;
 
 	if (fwd_adapter) {
-		adapter = netdev_priv(dev);
-		txq = reciprocal_scale(skb_get_hash(skb),
-				       adapter->num_rx_queues_per_pool);
+		u8 tc = netdev_get_num_tc(dev) ?
+			netdev_get_prio_tc_map(dev, skb->priority) : 0;
+		struct net_device *vdev = fwd_adapter->netdev;
+
+		txq = vdev->tc_to_txq[tc].offset;
+		txq += reciprocal_scale(skb_get_hash(skb),
+					vdev->tc_to_txq[tc].count);
 
-		return txq + fwd_adapter->tx_base_queue;
+		return txq;
 	}
 
 #ifdef IXGBE_FCOE
@@ -8776,6 +8791,11 @@ static int ixgbe_reassign_macvlan_pool(struct net_device *vdev, void *data)
 	/* if we cannot find a free pool then disable the offload */
 	netdev_err(vdev, "L2FW offload disabled due to lack of queue resources\n");
 	macvlan_release_l2fw_offload(vdev);
+
+	/* unbind the queues and drop the subordinate channel config */
+	netdev_unbind_sb_channel(adapter->netdev, vdev);
+	netdev_set_sb_channel(vdev, 0);
+
 	kfree(accel);
 
 	return 0;
@@ -9780,6 +9800,13 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
 	if (!macvlan_supports_dest_filter(vdev))
 		return ERR_PTR(-EMEDIUMTYPE);
 
+	/* We need to lock down the macvlan to be a single queue device so that
+	 * we can reuse the tc_to_txq field in the macvlan netdev to represent
+	 * the queue mapping to our netdev.
+	 */
+	if (netif_is_multiqueue(vdev))
+		return ERR_PTR(-ERANGE);
+
 	pool = find_first_zero_bit(adapter->fwd_bitmask, adapter->num_rx_pools);
 	if (pool == adapter->num_rx_pools) {
 		u16 used_pools = adapter->num_vfs + adapter->num_rx_pools;
@@ -9836,6 +9863,7 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
 		return ERR_PTR(-ENOMEM);
 
 	set_bit(pool, adapter->fwd_bitmask);
+	netdev_set_sb_channel(vdev, pool);
 	accel->pool = pool;
 	accel->netdev = vdev;
 
@@ -9877,6 +9905,10 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
 		ring->netdev = NULL;
 	}
 
+	/* unbind the queues and drop the subordinate channel config */
+	netdev_unbind_sb_channel(pdev, accel->netdev);
+	netdev_set_sb_channel(accel->netdev, 0);
+
 	clear_bit(accel->pool, adapter->fwd_bitmask);
 	kfree(accel);
 }

^ permalink raw reply related

* [RFC PATCH 2/4] net: Add support for subordinate device traffic classes
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan
In-Reply-To: <20180604150847.74754.8585.stgit@ahduyck-green-test.jf.intel.com>

This patch is meant to provide the basic tools needed to allow us to create
subordinate device traffic classes. The general idea here is to allow
subdividing the queues of a device into queue groups accessible through an
upper device such as a macvlan.

The idea here is to enforce the idea that an upper device has to be a
single queue device, ideally with IFF_NO_QUQUE set. With that being the
case we can pretty much guarantee that the tc_to_txq mappings and XPS maps
for the upper device are unused. As such we could reuse those in order to
support subdividing the lower device and distributing those queues between
the subordinate devices.

In order to distinguish between a regular set of traffic classes and if a
device is carrying subordinate traffic classes I changed num_tc from a u8
to a s16 value and use the negative values to represent the suboordinate
pool values. So starting at -1 and running to -32768 we can encode those as
pool values, and the existing values of 0 to 15 can be maintained.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 include/linux/netdevice.h |   16 ++++++++
 net/core/dev.c            |   89 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/net-sysfs.c      |   21 ++++++++++-
 3 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 03ed492..bbc8045 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -569,6 +569,9 @@ struct netdev_queue {
 	 * (/sys/class/net/DEV/Q/trans_timeout)
 	 */
 	unsigned long		trans_timeout;
+
+	/* Suboordinate device that the queue has been assigned to */
+	struct net_device	*sb_dev;
 /*
  * write-mostly part
  */
@@ -1960,7 +1963,7 @@ struct net_device {
 #ifdef CONFIG_DCB
 	const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
-	u8			num_tc;
+	s16			num_tc;
 	struct netdev_tc_txq	tc_to_txq[TC_MAX_QUEUE];
 	u8			prio_tc_map[TC_BITMASK + 1];
 
@@ -2014,6 +2017,17 @@ int netdev_get_num_tc(struct net_device *dev)
 	return dev->num_tc;
 }
 
+void netdev_unbind_sb_channel(struct net_device *dev,
+			      struct net_device *sb_dev);
+int netdev_bind_sb_channel_queue(struct net_device *dev,
+				 struct net_device *sb_dev,
+				 u8 tc, u16 count, u16 offset);
+int netdev_set_sb_channel(struct net_device *dev, u16 channel);
+static inline int netdev_get_sb_channel(struct net_device *dev)
+{
+	return max_t(int, -dev->num_tc, 0);
+}
+
 static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 					 unsigned int index)
diff --git a/net/core/dev.c b/net/core/dev.c
index 034c2a9..eba255d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2068,11 +2068,13 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq)
 		struct netdev_tc_txq *tc = &dev->tc_to_txq[0];
 		int i;
 
+		/* walk through the TCs and see if it falls into any of them */
 		for (i = 0; i < TC_MAX_QUEUE; i++, tc++) {
 			if ((txq - tc->offset) < tc->count)
 				return i;
 		}
 
+		/* didn't find it, just return -1 to indicate no match */
 		return -1;
 	}
 
@@ -2215,7 +2217,14 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	bool active = false;
 
 	if (dev->num_tc) {
+		/* Do not allow XPS on subordinate device directly */
 		num_tc = dev->num_tc;
+		if (num_tc < 0)
+			return -EINVAL;
+
+		/* If queue belongs to subordinate dev use its map */
+		dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
 		tc = netdev_txq_to_tc(dev, index);
 		if (tc < 0)
 			return -EINVAL;
@@ -2366,11 +2375,25 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 #endif
+static void netdev_unbind_all_sb_channels(struct net_device *dev)
+{
+	struct netdev_queue *txq = &dev->_tx[dev->num_tx_queues];
+
+	/* Unbind any subordinate channels */
+	while (txq-- != &dev->_tx[0]) {
+		if (txq->sb_dev)
+			netdev_unbind_sb_channel(dev, txq->sb_dev);
+	}
+}
+
 void netdev_reset_tc(struct net_device *dev)
 {
 #ifdef CONFIG_XPS
 	netif_reset_xps_queues_gt(dev, 0);
 #endif
+	netdev_unbind_all_sb_channels(dev);
+
+	/* Reset TC configuration of device */
 	dev->num_tc = 0;
 	memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
 	memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
@@ -2399,11 +2422,77 @@ int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
 #ifdef CONFIG_XPS
 	netif_reset_xps_queues_gt(dev, 0);
 #endif
+	netdev_unbind_all_sb_channels(dev);
+
 	dev->num_tc = num_tc;
 	return 0;
 }
 EXPORT_SYMBOL(netdev_set_num_tc);
 
+void netdev_unbind_sb_channel(struct net_device *dev,
+			      struct net_device *sb_dev)
+{
+	struct netdev_queue *txq = &dev->_tx[dev->num_tx_queues];
+
+#ifdef CONFIG_XPS
+	netif_reset_xps_queues_gt(sb_dev, 0);
+#endif
+	memset(sb_dev->tc_to_txq, 0, sizeof(sb_dev->tc_to_txq));
+	memset(sb_dev->prio_tc_map, 0, sizeof(sb_dev->prio_tc_map));
+
+	while (txq-- != &dev->_tx[0]) {
+		if (txq->sb_dev == sb_dev)
+			txq->sb_dev = NULL;
+	}
+}
+EXPORT_SYMBOL(netdev_unbind_sb_channel);
+
+int netdev_bind_sb_channel_queue(struct net_device *dev,
+				 struct net_device *sb_dev,
+				 u8 tc, u16 count, u16 offset)
+{
+	/* Make certain the sb_dev and dev are already configured */
+	if (sb_dev->num_tc >= 0 || tc >= dev->num_tc)
+		return -EINVAL;
+
+	/* We cannot hand out queues we don't have */
+	if ((offset + count) > dev->real_num_tx_queues)
+		return -EINVAL;
+
+	/* Record the mapping */
+	sb_dev->tc_to_txq[tc].count = count;
+	sb_dev->tc_to_txq[tc].offset = offset;
+
+	/* Provide a way for Tx queue to find the tc_to_txq map or
+	 * XPS map for itself.
+	 */
+	while (count--)
+		netdev_get_tx_queue(dev, count + offset)->sb_dev = sb_dev;
+
+	return 0;
+}
+EXPORT_SYMBOL(netdev_bind_sb_channel_queue);
+
+int netdev_set_sb_channel(struct net_device *dev, u16 channel)
+{
+	/* Do not use a multiqueue device to represent a subordinate channel */
+	if (netif_is_multiqueue(dev))
+		return -ENODEV;
+
+	/* We allow channels 1 - 32767 to be used for subordinate channels.
+	 * Channel 0 is meant to be "native" mode and used only to represent
+	 * the main root device. We allow writing 0 to reset the device back
+	 * to normal mode after being used as a subordinate channel.
+	 */
+	if (channel > S16_MAX)
+		return -EINVAL;
+
+	dev->num_tc = -channel;
+
+	return 0;
+}
+EXPORT_SYMBOL(netdev_set_sb_channel);
+
 /*
  * Routine to help set real_num_tx_queues. To avoid skbs mapped to queues
  * greater than real_num_tx_queues stale skbs on the qdisc must be flushed.
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 335c6a4..bd067b1 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1054,11 +1054,23 @@ static ssize_t traffic_class_show(struct netdev_queue *queue,
 		return -ENOENT;
 
 	index = get_netdev_queue_index(queue);
+
+	/* If queue belongs to subordinate dev use its tc mapping */
+	dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
 	tc = netdev_txq_to_tc(dev, index);
 	if (tc < 0)
 		return -EINVAL;
 
-	return sprintf(buf, "%u\n", tc);
+	/* We can report the traffic class one of two ways:
+	 * Subordinate device traffic classes are reported with the traffic
+	 * class first, and then the subordinate class so for example TC0 on
+	 * subordinate device 2 will be reported as "0-2". If the queue
+	 * belongs to the root device it will be reported with just the
+	 * traffic class, so just "0" for TC 0 for example.
+	 */
+	return dev->num_tc < 0 ? sprintf(buf, "%u%d\n", tc, dev->num_tc) :
+				 sprintf(buf, "%u\n", tc);
 }
 
 #ifdef CONFIG_XPS
@@ -1225,7 +1237,14 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 	index = get_netdev_queue_index(queue);
 
 	if (dev->num_tc) {
+		/* Do not allow XPS on subordinate device directly */
 		num_tc = dev->num_tc;
+		if (num_tc < 0)
+			return -EINVAL;
+
+		/* If queue belongs to subordinate dev use its map */
+		dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
 		tc = netdev_txq_to_tc(dev, index);
 		if (tc < 0)
 			return -EINVAL;

^ permalink raw reply related

* [RFC PATCH 1/4] net-sysfs: Drop support for XPS and traffic_class on single queue device
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan
In-Reply-To: <20180604150847.74754.8585.stgit@ahduyck-green-test.jf.intel.com>

This patch makes it so that we do not report the traffic class or allow XPS
configuration on single queue devices. This is mostly to avoid unnecessary
complexity with changes I have planned that will allow us to reuse
the unused tc_to_txq and XPS configuration on a single queue device to
allow it to make use of a subset of queues on an underlying device.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 net/core/net-sysfs.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bb7e80f..335c6a4 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1047,9 +1047,14 @@ static ssize_t traffic_class_show(struct netdev_queue *queue,
 				  char *buf)
 {
 	struct net_device *dev = queue->dev;
-	int index = get_netdev_queue_index(queue);
-	int tc = netdev_txq_to_tc(dev, index);
+	int index;
+	int tc;
 
+	if (!netif_is_multiqueue(dev))
+		return -ENOENT;
+
+	index = get_netdev_queue_index(queue);
+	tc = netdev_txq_to_tc(dev, index);
 	if (tc < 0)
 		return -EINVAL;
 
@@ -1214,6 +1219,9 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 	cpumask_var_t mask;
 	unsigned long index;
 
+	if (!netif_is_multiqueue(dev))
+		return -ENOENT;
+
 	index = get_netdev_queue_index(queue);
 
 	if (dev->num_tc) {
@@ -1260,6 +1268,9 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
 	cpumask_var_t mask;
 	int err;
 
+	if (!netif_is_multiqueue(dev))
+		return -ENOENT;
+
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 

^ permalink raw reply related

* [RFC PATCH 0/4] Add support for DCB and XPS with MACVLAN offload
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan

This patch set is meant to address a number of shortcomings in the current
implementation of MACVLAN offload. The main issue being the fact that the
current Tx queue implementation for ixgbe required the use of
ndo_select_queue which doesn't necessarily play well with DCB or XPS.

I started this patch set to address that and thought I would submit the
start of the series as an RFC just to make sure I am headed in the right
direction and to see if there is anything I am overlooking. With these
patches applied we now start seeing what I am calling "subordinate
channels" which show themselves via a "-x" suffix where the "-x" can be
anything from "-1" to "-32767". So for example traffic class 0 with a
subordinate class of 5 will be called out as "0-5" if you dump the traffic
class via the sysfs value for the queue.

The main pieces that still need to be handled are the updating of the
ndo_select_queue function and the fallback call to allow support for
passing the accel_priv which I am changing to a netdev. I figure those
patches will be mostly noise so I didn't see the need to include them in
this RFC.

---

Alexander Duyck (4):
      net-sysfs: Drop support for XPS and traffic_class on single queue device
      net: Add support for subordinate device traffic classes
      ixgbe: Add code to populate and use macvlan tc to Tx queue map
      net: Add support for subordinate traffic classes to netdev_pick_tx


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   55 +++++++---
 drivers/net/macvlan.c                         |   10 --
 include/linux/netdevice.h                     |   20 +++
 net/core/dev.c                                |  144 +++++++++++++++++++++----
 net/core/net-sysfs.c                          |   36 ++++++
 5 files changed, 215 insertions(+), 50 deletions(-)

^ permalink raw reply

* Re: INFO: task hung in ip6gre_exit_batch_net
From: Dmitry Vyukov @ 2018-06-04 15:22 UTC (permalink / raw)
  To: syzbot
  Cc: Christian Brauner, David Miller, David Ahern, Florian Westphal,
	Jiri Benc, Kirill Tkhai, LKML, Xin Long, mschiffer, netdev,
	syzkaller-bugs, Vladislav Yasevich
In-Reply-To: <0000000000006e4595056dd23c16@google.com>

On Mon, Jun 4, 2018 at 5:03 PM, syzbot
<syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com

Another hang on rtnl lock:

#syz dup: INFO: task hung in netdev_run_todo

May be related to "unregister_netdevice: waiting for DEV to become free":
https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9

Any other explanations for massive hangs on rtnl lock for minutes?


> INFO: task kworker/u4:1:22 blocked for more than 120 seconds.
>       Not tainted 4.17.0-rc6+ #68
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u4:1    D13944    22      2 0x80000000
> Workqueue: netns cleanup_net
> Call Trace:
>  context_switch kernel/sched/core.c:2859 [inline]
>  __schedule+0x801/0x1e30 kernel/sched/core.c:3501
>  schedule+0xef/0x430 kernel/sched/core.c:3545
>  schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3603
>  __mutex_lock_common kernel/locking/mutex.c:833 [inline]
>  __mutex_lock+0xe38/0x17f0 kernel/locking/mutex.c:893
>  mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
>  rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
>  ip6gre_exit_batch_net+0xd5/0x7d0 net/ipv6/ip6_gre.c:1585
>  ops_exit_list.isra.7+0x105/0x160 net/core/net_namespace.c:155
>  cleanup_net+0x51d/0xb20 net/core/net_namespace.c:523
>  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
>  worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
>  kthread+0x345/0x410 kernel/kthread.c:240
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
>
> Showing all locks held in the system:
> 4 locks held by kworker/u4:1/22:
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> __write_once_size include/linux/compiler.h:215 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at: atomic64_set
> include/asm-generic/atomic-instrumented.h:40 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at: set_work_data
> kernel/workqueue.c:617 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
>  #1: 0000000030a00b6b (net_cleanup_work){+.+.}, at:
> process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
>  #2: 000000007eb35e65 (pernet_ops_rwsem){++++}, at: cleanup_net+0x11a/0xb20
> net/core/net_namespace.c:490
>  #3: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 3 locks held by kworker/1:1/24:
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> __write_once_size include/linux/compiler.h:215 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> atomic64_set include/asm-generic/atomic-instrumented.h:40 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> set_work_data kernel/workqueue.c:617 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
>  #1: 000000009edcfbe7 ((addr_chk_work).work){+.+.}, at:
> process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
>  #2: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 2 locks held by khungtaskd/893:
>  #0: 000000007eeb621a (rcu_read_lock){....}, at:
> check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
>  #0: 000000007eeb621a (rcu_read_lock){....}, at: watchdog+0x1ff/0xf60
> kernel/hung_task.c:249
>  #1: 00000000239f1b5e (tasklist_lock){.+.+}, at:
> debug_show_all_locks+0xde/0x34a kernel/locking/lockdep.c:4470
> 2 locks held by getty/4481:
>  #0: 00000000cc114025 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 000000006ad1f3fc (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4482:
>  #0: 00000000226a16cc (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 000000008cee8cdc (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4483:
>  #0: 0000000067bd3c39 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 000000005d8bc81d (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4484:
>  #0: 00000000f0f8d839 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 00000000a9d5f091 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4485:
>  #0: 000000002c96ee9a (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 0000000033338ac7 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4486:
>  #0: 00000000f6db39b5 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 00000000bb7c6099 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4487:
>  #0: 000000006be9659d (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 00000000e2edd3d0 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 3 locks held by kworker/1:3/27147:
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: __write_once_size
> include/linux/compiler.h:215 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: arch_atomic64_set
> arch/x86/include/asm/atomic64_64.h:34 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: atomic64_set
> include/asm-generic/atomic-instrumented.h:40 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: atomic_long_set
> include/asm-generic/atomic-long.h:57 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: set_work_data
> kernel/workqueue.c:617 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:
> set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:
> process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
>  #1: 00000000fa87e61f (deferred_process_work){+.+.}, at:
> process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
>  #2: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 1 lock held by syz-executor7/29665:
>  #0: 000000006e20d618 (sk_lock-AF_INET6){+.+.}, at: lock_sock
> include/net/sock.h:1469 [inline]
>  #0: 000000006e20d618 (sk_lock-AF_INET6){+.+.}, at:
> tls_sw_sendmsg+0x1b9/0x12e0 net/tls/tls_sw.c:384
> 1 lock held by syz-executor7/29689:
>  #0: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
>
> =============================================
>
> NMI backtrace for cpu 0
> CPU: 0 PID: 893 Comm: khungtaskd Not tainted 4.17.0-rc6+ #68
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>  nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
>  nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
>  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
>  trigger_all_cpu_backtrace include/linux/nmi.h:138 [inline]
>  check_hung_task kernel/hung_task.c:132 [inline]
>  check_hung_uninterruptible_tasks kernel/hung_task.c:190 [inline]
>  watchdog+0xc10/0xf60 kernel/hung_task.c:249
>  kthread+0x345/0x410 kernel/kthread.c:240
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
> Sending NMI from CPU 0 to CPUs 1:
> NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0x6/0x10
> arch/x86/include/asm/irqflags.h:54
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
> syzbot.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/0000000000006e4595056dd23c16%40google.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* [bpf-next PATCH] bpf: sockmap, fix crash when ipv6 sock is added
From: John Fastabend @ 2018-06-04 15:21 UTC (permalink / raw)
  To: edumazet, ast, daniel; +Cc: netdev

This fixes a crash where we assign tcp_prot to IPv6 sockets instead
of tcpv6_prot.

Previously we overwrote the sk->prot field with tcp_prot even in the
AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
are used. Further, only allow ESTABLISHED connections to join the
map per note in TLS ULP,

   /* The TLS ulp is currently supported only for TCP sockets
    * in ESTABLISHED state.
    * Supporting sockets in LISTEN state will require us
    * to modify the accept implementation to clone rather then
    * share the ulp context.
    */

Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
crashing case here.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Wei Wang <weiwan@google.com>
---
 kernel/bpf/sockmap.c |   23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 52a91d8..9e80118 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -41,6 +41,7 @@
 #include <linux/mm.h>
 #include <net/strparser.h>
 #include <net/tcp.h>
+#include <net/transp_v6.h>
 #include <linux/ptr_ring.h>
 #include <net/inet_common.h>
 #include <linux/sched/signal.h>
@@ -162,6 +163,8 @@ static bool bpf_tcp_stream_read(const struct sock *sk)
 }
 
 static struct proto tcp_bpf_proto;
+static struct proto tcpv6_bpf_proto;
+
 static int bpf_tcp_init(struct sock *sk)
 {
 	struct smap_psock *psock;
@@ -182,13 +185,21 @@ static int bpf_tcp_init(struct sock *sk)
 	psock->sk_proto = sk->sk_prot;
 
 	if (psock->bpf_tx_msg) {
+		tcpv6_bpf_proto.sendmsg = bpf_tcp_sendmsg;
+		tcpv6_bpf_proto.sendpage = bpf_tcp_sendpage;
+		tcpv6_bpf_proto.recvmsg = bpf_tcp_recvmsg;
+		tcpv6_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
 		tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
 		tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
 		tcp_bpf_proto.recvmsg = bpf_tcp_recvmsg;
 		tcp_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
 	}
 
-	sk->sk_prot = &tcp_bpf_proto;
+	if (sk->sk_family == AF_INET6)
+		sk->sk_prot = &tcpv6_bpf_proto;
+	else
+		sk->sk_prot = &tcp_bpf_proto;
+
 	rcu_read_unlock();
 	return 0;
 }
@@ -1113,6 +1124,8 @@ static int bpf_tcp_ulp_register(void)
 {
 	tcp_bpf_proto = tcp_prot;
 	tcp_bpf_proto.close = bpf_tcp_close;
+	tcpv6_bpf_proto = tcpv6_prot;
+	tcpv6_bpf_proto.close = bpf_tcp_close;
 	/* Once BPF TX ULP is registered it is never unregistered. It
 	 * will be in the ULP list for the lifetime of the system. Doing
 	 * duplicate registers is not a problem.
@@ -1716,6 +1729,14 @@ static int __sock_map_ctx_update_elem(struct bpf_map *map,
 	bool new = false;
 	int err = 0;
 
+	/* ULPs are currently supported only for TCP sockets in ESTABLISHED
+	 * state. Supporting sockets in LISTEN state will require us to
+	 * modify the accept implementation to clone rather then share the
+	 * ulp context.
+	 */
+	if (sock->sk_state != TCP_ESTABLISHED)
+		return -ENOTSUPP;
+
 	/* 1. If sock map has BPF programs those will be inherited by the
 	 * sock being added. If the sock is already attached to BPF programs
 	 * this results in an error.

^ permalink raw reply related

* [bug] cxgb4: vrf stopped working with cxgb4 card
From: AMG Zollner Robert @ 2018-06-04 15:03 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, dsa

I have noticed that vrf is not working with kernel v4.15.0 but was 
working with v4.13.0 when using cxgb4 Chelsio driver (T520-cr)

Setup:
Two metal servers with a T520-cr card each, directly connected without a 
switch in between.

        SVR1  only ipfwd                 SVR2     with vrf
.----------------------------. .----------------------------------.
|                            |         |             |
|    192.168.8.1 [  ens2f4]--|---------|--[ens1f4] 192.168.8.2   |
|    192.168.9.1 [ens2f4d1]--|---------|--<ens1f4d1> 192.168.9.2 VRF=10   |
`----------------------------' `----------------------------------'

When vrf is not working there are no error messages (dmesg or iproute 
commands), tcpdump on the interface (SVR2.ens1f4d1) enslaved in vrf 10 
shows packets(arp req/reply) coming in and going out, but outgoing 
packets(arp reply) do not reach the other server SVR1.ens2f4d1


Bisect:
Found this commit to be the problem after doing a git bisect between 
v4.13..v4.15:

commit ba581f77df23c8ee70b372966e69cf10bc5453d8
Author: Ganesh Goudar <ganeshgr@chelsio.com>
Date:   Sat Sep 23 16:07:28 2017 +0530

     cxgb4: do DCB state reset in couple of places

     reset the driver's DCB state in couple of places
     where it was missing.


A bisect step was considered good when:
- successful ping from SVR1 to SVR2.ens1f4d1 vrf interface
- successful ping from SVR2 global to SVR2 vrf interface trough SVR1(l3 
forwarding) (this check was redundant,both tests fail or pass simultaneous)

The problem is still present on recent kernels also, checked v4.16.0 and 
v4.17.rc7

Disabling DCB for the card support fixes the problem ( Compiling kernel 
with "CONFIG_CHELSIO_T4_DCB=n")



This is my first time reporting a bug to the linux kernel and hope I 
have included the right amount of information. Please let me know if I 
have missed something.



Thank you,
Zollner Robert


--------
Logs:

VRF configured using folowing commands:

#!/bin/sh

CHDEV=ens1f4
VRF=vrf-recv

sysctl -w net.ipv4.tcp_l3mdev_accept=1
sysctl -w net.ipv4.udp_l3mdev_accept=1
sysctl -w net.ipv4.conf.all.accept_local=1

ifconfig ${CHDEV}   192.168.8.2/24
ifconfig ${CHDEV}d1 192.168.9.2/24

ip link add ${VRF} type vrf table 10
ip link set dev ${VRF} up

ip rule add pref 32765 table local
ip rule del pref 0

ip route add table 10 unreachable default metric 4278198272

ip link set dev ${CHDEV}d1 master ${VRF}

ip route add table 10 default via 192.168.9.1
ip route add 192.168.9.0/24 via 192.168.8.1

^ permalink raw reply

* Re: [RFC PATCH 1/2] net: macb: Add CAP to disable hardware TX checksum offloading
From: Nicolas Ferre @ 2018-06-04 15:13 UTC (permalink / raw)
  To: Jennifer Dahm, netdev, David S . Miller; +Cc: Nathan Sullivan
In-Reply-To: <1527284654-24835-2-git-send-email-jennifer.dahm@ni.com>

On 25/05/2018 at 23:44, Jennifer Dahm wrote:
> Certain PHYs have significant bugs in their TX checksum offloading
> that cannot be solved in software. In order to accommodate these PHYS,
> add a CAP to disable this hardware.
> 
> Signed-off-by: Jennifer Dahm <jennifer.dahm@ni.com>
> ---
>   drivers/net/ethernet/cadence/macb.h      | 1 +
>   drivers/net/ethernet/cadence/macb_main.c | 8 ++++++--
>   2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
> index 8665982..6b85e97 100644
> --- a/drivers/net/ethernet/cadence/macb.h
> +++ b/drivers/net/ethernet/cadence/macb.h
> @@ -635,6 +635,7 @@
>   #define MACB_CAPS_USRIO_DISABLED		0x00000010
>   #define MACB_CAPS_JUMBO				0x00000020
>   #define MACB_CAPS_GEM_HAS_PTP			0x00000040
> +#define MACB_CAPS_DISABLE_TX_HW_CSUM		0x00000080

Nitpicking: "DISABLED" tends to be at the end of the name...

>   #define MACB_CAPS_FIFO_MODE			0x10000000
>   #define MACB_CAPS_GIGABIT_MODE_AVAILABLE	0x20000000
>   #define MACB_CAPS_SG_DISABLED			0x40000000
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 3e93df5..a5d564b 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -3360,8 +3360,12 @@ static int macb_init(struct platform_device *pdev)
>   		dev->hw_features |= MACB_NETIF_LSO;
>   
>   	/* Checksum offload is only available on gem with packet buffer */
> -	if (macb_is_gem(bp) && !(bp->caps & MACB_CAPS_FIFO_MODE))
> -		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
> +	if (macb_is_gem(bp) && !(bp->caps & MACB_CAPS_FIFO_MODE)) {
> +		if (!(bp->caps & MACB_CAPS_DISABLE_TX_HW_CSUM))

Why not the other way around? negating a "disabled" feature is always 
challenge ;-)

> +			dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
> +		else
> +			dev->hw_features |= NETIF_F_RXCSUM;
> +	}
>   	if (bp->caps & MACB_CAPS_SG_DISABLED)
>   		dev->hw_features &= ~NETIF_F_SG;
>   	dev->features = dev->hw_features;
> 


-- 
Nicolas Ferre

^ permalink raw reply

* Re: [PATCH net-next v2 1/3] bpf: implement bpf_get_current_cgroup_id() helper
From: Yonghong Song @ 2018-06-04 15:11 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: ast, netdev, kernel-team
In-Reply-To: <20180604145801.7z25aqm4piw2fft7@ast-mbp>



On 6/4/18 7:58 AM, Alexei Starovoitov wrote:
> On Mon, Jun 04, 2018 at 11:08:35AM +0200, Daniel Borkmann wrote:
>> On 06/04/2018 12:59 AM, Yonghong Song wrote:
>>> bpf has been used extensively for tracing. For example, bcc
>>> contains an almost full set of bpf-based tools to trace kernel
>>> and user functions/events. Most tracing tools are currently
>>> either filtered based on pid or system-wide.
>>>
>>> Containers have been used quite extensively in industry and
>>> cgroup is often used together to provide resource isolation
>>> and protection. Several processes may run inside the same
>>> container. It is often desirable to get container-level tracing
>>> results as well, e.g. syscall count, function count, I/O
>>> activity, etc.
>>>
>>> This patch implements a new helper, bpf_get_current_cgroup_id(),
>>> which will return cgroup id based on the cgroup within which
>>> the current task is running.
>>>
>>> The later patch will provide an example to show that
>>> userspace can get the same cgroup id so it could
>>> configure a filter or policy in the bpf program based on
>>> task cgroup id.
>>>
>>> The helper is currently implemented for tracing. It can
>>> be added to other program types as well when needed.
>>>
>>> Acked-by: Alexei Starovoitov <ast@kernel.org>
>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>> ---
>>>   include/linux/bpf.h      |  1 +
>>>   include/uapi/linux/bpf.h |  8 +++++++-
>>>   kernel/bpf/core.c        |  1 +
>>>   kernel/bpf/helpers.c     | 15 +++++++++++++++
>>>   kernel/trace/bpf_trace.c |  2 ++
>>>   5 files changed, 26 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index bbe2974..995c3b1 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -746,6 +746,7 @@ extern const struct bpf_func_proto bpf_get_stackid_proto;
>>>   extern const struct bpf_func_proto bpf_get_stack_proto;
>>>   extern const struct bpf_func_proto bpf_sock_map_update_proto;
>>>   extern const struct bpf_func_proto bpf_sock_hash_update_proto;
>>> +extern const struct bpf_func_proto bpf_get_current_cgroup_id_proto;
>>>   
>>>   /* Shared helpers among cBPF and eBPF. */
>>>   void bpf_user_rnd_init_once(void);
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index f0b6608..18712b0 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -2070,6 +2070,11 @@ union bpf_attr {
>>>    * 		**CONFIG_SOCK_CGROUP_DATA** configuration option.
>>>    * 	Return
>>>    * 		The id is returned or 0 in case the id could not be retrieved.
>>> + *
>>> + * u64 bpf_get_current_cgroup_id(void)
>>> + * 	Return
>>> + * 		A 64-bit integer containing the current cgroup id based
>>> + * 		on the cgroup within which the current task is running.
>>>    */
>>>   #define __BPF_FUNC_MAPPER(FN)		\
>>>   	FN(unspec),			\
>>> @@ -2151,7 +2156,8 @@ union bpf_attr {
>>>   	FN(lwt_seg6_action),		\
>>>   	FN(rc_repeat),			\
>>>   	FN(rc_keydown),			\
>>> -	FN(skb_cgroup_id),
>>> +	FN(skb_cgroup_id),		\
>>> +	FN(get_current_cgroup_id),
>>>   
>>>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>>    * function eBPF program intends to call
>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>> index 527587d..9f14937 100644
>>> --- a/kernel/bpf/core.c
>>> +++ b/kernel/bpf/core.c
>>> @@ -1765,6 +1765,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
>>>   const struct bpf_func_proto bpf_get_current_comm_proto __weak;
>>>   const struct bpf_func_proto bpf_sock_map_update_proto __weak;
>>>   const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
>>> +const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
>>>   
>>>   const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>>>   {
>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>>> index 3d24e23..73065e2 100644
>>> --- a/kernel/bpf/helpers.c
>>> +++ b/kernel/bpf/helpers.c
>>> @@ -179,3 +179,18 @@ const struct bpf_func_proto bpf_get_current_comm_proto = {
>>>   	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
>>>   	.arg2_type	= ARG_CONST_SIZE,
>>>   };
>>> +
>>> +#ifdef CONFIG_CGROUPS
>>> +BPF_CALL_0(bpf_get_current_cgroup_id)
>>> +{
>>> +	struct cgroup *cgrp = task_dfl_cgroup(current);
>>> +
>>> +	return cgrp->kn->id.id;
>>> +}
>>> +
>>> +const struct bpf_func_proto bpf_get_current_cgroup_id_proto = {
>>> +	.func		= bpf_get_current_cgroup_id,
>>> +	.gpl_only	= false,
>>> +	.ret_type	= RET_INTEGER,
>>> +};
>>> +#endif
>>
>> Nit: why not moving this function directly to bpf_trace.c?
> 
> my preference would be to keep it in helpers.c as-is.
> imo bpf_trace.c is only for things that depend on kernel/trace/
> 
>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>>> index 752992c..e2ab5b7 100644
>>> --- a/kernel/trace/bpf_trace.c
>>> +++ b/kernel/trace/bpf_trace.c
>>> @@ -564,6 +564,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>>   		return &bpf_get_prandom_u32_proto;
>>>   	case BPF_FUNC_probe_read_str:
>>>   		return &bpf_probe_read_str_proto;
>>> +	case BPF_FUNC_get_current_cgroup_id:
>>> +		return &bpf_get_current_cgroup_id_proto;
>>
>> When you have !CONFIG_CGROUPS, then it relies on the weak definition of the
>> bpf_get_current_cgroup_id_proto, which I would think at latest in fixup_bpf_calls()
>> bails out with 'kernel subsystem misconfigured func' due to func being NULL.
>>
>> Can't we just do the #ifdef CONFIG_CGROUPS around BPF_FUNC_get_current_cgroup_id
>> case instead? Then we bail out normally with 'unknown func' when cgroups are
>> not configured?

Thanks. Will send a followup patch soon.

> 
> good idea.
> 

^ permalink raw reply

* Re: [RFC PATCH 2/2] net: macb: Disable TX checksum offloading on all Zynq
From: Nicolas Ferre @ 2018-06-04 15:06 UTC (permalink / raw)
  To: Jennifer Dahm, netdev, David S . Miller, Michal Simek,
	Harini Katakam
  Cc: Nathan Sullivan
In-Reply-To: <1527284654-24835-3-git-send-email-jennifer.dahm@ni.com>

On 25/05/2018 at 23:44, Jennifer Dahm wrote:
> The Zynq ethernet hardware has checksum offloading bugs that cause
> small UDP packets (<= 2 bytes) to be sent with an incorrect checksum
> (0xffff) and forwarded UDP packets to be re-checksummed, which is
> illegal behavior. The best solution we have right now is to disable
> hardware TX checksum offloading entirely.
> 
> Signed-off-by: Jennifer Dahm <jennifer.dahm@ni.com>

Adding some xilinx people I know...

> ---
>   drivers/net/ethernet/cadence/macb_main.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index a5d564b..e8cc68a 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -3807,7 +3807,8 @@ static const struct macb_config zynqmp_config = {
>   };
>   
>   static const struct macb_config zynq_config = {
> -	.caps = MACB_CAPS_GIGABIT_MODE_AVAILABLE | MACB_CAPS_NO_GIGABIT_HALF,
> +	.caps = MACB_CAPS_GIGABIT_MODE_AVAILABLE | MACB_CAPS_NO_GIGABIT_HALF
> +	      | MACB_CAPS_DISABLE_TX_HW_CSUM,
>   	.dma_burst_length = 16,
>   	.clk_init = macb_clk_init,
>   	.init = macb_init,
> 


-- 
Nicolas Ferre

^ permalink raw reply

* Re: [RFC PATCH 0/2] net: macb: Disable TX checksum offloading on all Zynq
From: Nicolas Ferre @ 2018-06-04 15:05 UTC (permalink / raw)
  To: Jennifer Dahm, netdev, David S . Miller
  Cc: Nathan Sullivan, Rafal Ozieblo, Claudiu Beznea, Harini Katakam
In-Reply-To: <1527284654-24835-1-git-send-email-jennifer.dahm@ni.com>

Jennifer,

On 25/05/2018 at 23:44, Jennifer Dahm wrote:
> During testing, I discovered that the Zynq GEM hardware overwrites all
> outgoing UDP packet checksums, which is illegal in packet forwarding
> cases. This happens both with and without the checksum-zeroing
> behavior  introduced  in  007e4ba3ee137f4700f39aa6dbaf01a71047c5f6
> ("net: macb: initialize checksum when using checksum offloading"). The
> only solution to both the small packet bug and the packet forwarding
> bug that I can find is to disable TX checksum offloading entirely.

Are the bugs listed above present in all revisions of the GEM IP, only 
for some revisions?
Is there an errata that describe this issue for the Zynq GEM?


> There's still the possibility that these bugs are actually with the
> driver software and not with the hardware. I've found several places
> where the checksum is set to 0xFFFF (the incorrect checksum found in
> small packets) when something goes wrong, and I can imagine a buggy
> driver writing over the checksum blindly when TX checksum offloading
> is enabled.
> 
> I would like feedback on two things:
> 1. Is it possible that the two bugs described above are caused by the
>     driver and not by the hardware? If so, where should I look to
>     implicate the driver?

Rafal,
Do you want to comment on this part?

> 2. Is this a problem we care enough about to completely disable TX
>     checksum offloading?

I would prefer to keep it enabled if possible. I mean I'm not against 
such a workaround if the HW is not able to handle such cases but 
offloading a CPU is always good when we have "low end" processors like 
ARM9 at 200MHz like the older AT91 that use this driver...

> Here is the testing procedure I used to reproduce these bugs on my
> machine. Specifically, without this patchset, step 9 fails. Without
> 007e4ba3ee, step 8 also fails.
> 
> 1. Set up the test environment:
>    a. Acquire a Zynq device with two ethernet ports. This is the DUT.
>    b. Acquire a USB-Ethernet adapter.
>    c. Acquire two ethernet cables.
>    d. Connect one Ethernet port on the DUT to your computer's network
>       switch.
>    e. Connect the other Ethernet port to the USB-Ethernet adapter and
>       plug that adapter into your computer.
>    f. Set up a Linux VM to send packets through the DUT. I recommend
>       using a VM here so that you can easily detach it from the primary
>       network to force outgoing traffic through the DUT.
>    g. Set up a computer with a packet inspecting program to receive and
>       inspect packets. This doesn't need to be a VM. For the purposes
>       of this test, I'll be using a Windows instance with WireShark.
> 2. Load the kernel you want to test onto the DUT, making sure to
>     include the `bridge` module.
> 3. Set up a bridge on the DUT. The following commands on the DUT
>     should work, replacing `eth0` and `eth1` with the two ethernet
>     interfaces on the DUT:
>     ```
>     brctl addbr test
>     brctl addif test eth0 eth1
>     ifconfig eth0 0.0.0.0
>     ifconfig eth1 0.0.0.0
>     dhclient test -v
>     ```
> 4. Disconnect the Linux VM from your host computer's network and
>     connect it to the USB-Ethernet adapter in order to force outgoing
>     network traffic through the DUT. If necessary, run dhclient on the
>     Linux VM to acquire an IP address.
> 5. Ensure that you can reach your Windows instance from your Linux VM
>     through the DUT (e.g. ping).
> 6. Start WireShark on your Windows instance and start monitoring
>     traffic on a specific, unused port (e.g. 61557).
> 7. Using netcat, send a few not-tiny UDP packets from your Linux VM to
>     your Windows instance to ensure that valid UDP packets are properly
>     forwarded. Ex:
>     ```
>     echo "hello world" | netcat -u <WindowsIP> 61557
>     ```
>     Inspect these packets to ensure that the data arrived intact and
>     that the checksum looks reasonable (i.e. not 0x0000 or 0xFFFF).
> 8. Using netcat, send a few tiny UDP packets (2 bytes or fewer) from
>     Linux VM to your Windows instance to ensure that the checksum is
>     reasonable. Ex:
>     ```
>     echo "h" | netcat -u <WindowsIP> 61557
>     ```
> 9. Using a custom program, send UDP packets with broken checksums
>     (e.g. 0xABCD) from your Linux VM to your Windows instance. Inspect
>     these packets with WireShark and make sure that the packet arrived
>     with the same checksum you sent it with.
> 
> For step 9, I wrote a C program using the Linux socket API that will
> send a properly formatted UDP packet with the payload "Hello!" and a
> (broken) checksum of 0xABCD to port 61557 on the host provided at the
> command line. I can send the full program if you would like, but here
> is the important part of it:
> ```
> struct custom_udp {
> 	int16_t s_port;
> 	int16_t d_port;
> 	int16_t length;
> 	int16_t check;
> 	char data[];
> };
> 
> int send_message(int sockfd, in_port_t port, const char *message) {
> 	struct custom_udp *frame;
> 	int16_t message_len;
> 	int16_t frame_len;
> 	int ret;
> 
> 	message_len = strlen(message) * sizeof(char);
> 	frame_len = sizeof(struct custom_udp) + message_len;
> 	frame = malloc(frame_len);
> 	frame->s_port = htons(0);
> 	frame->d_port = htons(port);
> 	frame->length = htons(frame_len);
> 	frame->check = htons(0xABCD);
> 	memmove(frame->data, message, message_len);
> 
> 	ret = write(sockfd, frame, frame_len);
> 	free(frame);
> 
> 	return ret;
> }

Thanks a lot for the detailed testing procedure. It will definitively 
help us reproduce the bug.

As a conclusion, I would say that we can use this patch as a last resort 
if we are sure that:
1/ the driver doesn't do weird things with TX CSUM in the code
2/ the hw is unable to handle these particular cases.


Best regards,
   Nicolas


> ```
> 
> Jennifer Dahm (1):
>    net/macb: Disable TX checksum offloading on all Zynq-7000
> 
>   drivers/net/ethernet/cadence/macb.h      |  1 +
>   drivers/net/ethernet/cadence/macb_main.c | 11 ++++++++---
>   2 files changed, 9 insertions(+), 3 deletions(-)
> 


-- 
Nicolas Ferre

^ permalink raw reply

* Re: [PATCH net] ipv4: igmp: hold wakelock to prevent delayed reports
From: David Miller @ 2018-06-04 15:06 UTC (permalink / raw)
  To: tejaswit; +Cc: netdev
In-Reply-To: <20180601140532.GA13253@tejaswit-linux.qualcomm.com>

From: Tejaswi Tanikella <tejaswit@codeaurora.org>
Date: Fri, 1 Jun 2018 19:35:41 +0530

> On receiving a IGMPv2/v3 query, based on max_delay set in the header a
> timer is started to send out a response after a random time within
> max_delay. If the system then moves into suspend state, Report is
> delayed until system wakes up.
> 
> In one reported scenario, on arm64 devices, max_delay was set to 10s,
> Reports were consistantly delayed if the timer is scheduled after 5 plus
> seconds.
> 
> Hold wakelock while starting the timer to prevent moving into suspend
> state.
> 
> Signed-off-by: Tejaswi Tanikella <tejaswit@codeaurora.org>

As Florian stated, this won't be the only networking facility to hit
a problem like this.  So, if we go down this route, we probably want
to generically solve this.

But I have a deeper concern.

Do we really want every timer based querying mechanism to prevent a
system from being able to suspend?

We get to the point where external entities can generate traffic which
can prevent a remote system from entering suspend state.

^ permalink raw reply

* INFO: task hung in ip6gre_exit_batch_net
From: syzbot @ 2018-06-04 15:03 UTC (permalink / raw)
  To: christian.brauner, davem, dsahern, fw, jbenc, ktkhai,
	linux-kernel, lucien.xin, mschiffer, netdev, syzkaller-bugs,
	vyasevich

Hello,

syzbot found the following crash on:

HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com

INFO: task kworker/u4:1:22 blocked for more than 120 seconds.
       Not tainted 4.17.0-rc6+ #68
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u4:1    D13944    22      2 0x80000000
Workqueue: netns cleanup_net
Call Trace:
  context_switch kernel/sched/core.c:2859 [inline]
  __schedule+0x801/0x1e30 kernel/sched/core.c:3501
  schedule+0xef/0x430 kernel/sched/core.c:3545
  schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3603
  __mutex_lock_common kernel/locking/mutex.c:833 [inline]
  __mutex_lock+0xe38/0x17f0 kernel/locking/mutex.c:893
  mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
  rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
  ip6gre_exit_batch_net+0xd5/0x7d0 net/ipv6/ip6_gre.c:1585
  ops_exit_list.isra.7+0x105/0x160 net/core/net_namespace.c:155
  cleanup_net+0x51d/0xb20 net/core/net_namespace.c:523
  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
  worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

Showing all locks held in the system:
4 locks held by kworker/u4:1/22:
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
__write_once_size include/linux/compiler.h:215 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at: atomic64_set  
include/asm-generic/atomic-instrumented.h:40 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at: set_work_data  
kernel/workqueue.c:617 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
  #1: 0000000030a00b6b (net_cleanup_work){+.+.}, at:  
process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
  #2: 000000007eb35e65 (pernet_ops_rwsem){++++}, at: cleanup_net+0x11a/0xb20  
net/core/net_namespace.c:490
  #3: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74
3 locks held by kworker/1:1/24:
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
__write_once_size include/linux/compiler.h:215 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
atomic64_set include/asm-generic/atomic-instrumented.h:40 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
set_work_data kernel/workqueue.c:617 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
  #1: 000000009edcfbe7 ((addr_chk_work).work){+.+.}, at:  
process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
  #2: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74
2 locks held by khungtaskd/893:
  #0: 000000007eeb621a (rcu_read_lock){....}, at:  
check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
  #0: 000000007eeb621a (rcu_read_lock){....}, at: watchdog+0x1ff/0xf60  
kernel/hung_task.c:249
  #1: 00000000239f1b5e (tasklist_lock){.+.+}, at:  
debug_show_all_locks+0xde/0x34a kernel/locking/lockdep.c:4470
2 locks held by getty/4481:
  #0: 00000000cc114025 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 000000006ad1f3fc (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4482:
  #0: 00000000226a16cc (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 000000008cee8cdc (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4483:
  #0: 0000000067bd3c39 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 000000005d8bc81d (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4484:
  #0: 00000000f0f8d839 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 00000000a9d5f091 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4485:
  #0: 000000002c96ee9a (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 0000000033338ac7 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4486:
  #0: 00000000f6db39b5 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 00000000bb7c6099 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4487:
  #0: 000000006be9659d (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 00000000e2edd3d0 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
3 locks held by kworker/1:3/27147:
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:  
__write_once_size include/linux/compiler.h:215 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: atomic64_set  
include/asm-generic/atomic-instrumented.h:40 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: atomic_long_set  
include/asm-generic/atomic-long.h:57 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: set_work_data  
kernel/workqueue.c:617 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:  
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
  #1: 00000000fa87e61f (deferred_process_work){+.+.}, at:  
process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
  #2: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74
1 lock held by syz-executor7/29665:
  #0: 000000006e20d618 (sk_lock-AF_INET6){+.+.}, at: lock_sock  
include/net/sock.h:1469 [inline]
  #0: 000000006e20d618 (sk_lock-AF_INET6){+.+.}, at:  
tls_sw_sendmsg+0x1b9/0x12e0 net/tls/tls_sw.c:384
1 lock held by syz-executor7/29689:
  #0: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74

=============================================

NMI backtrace for cpu 0
CPU: 0 PID: 893 Comm: khungtaskd Not tainted 4.17.0-rc6+ #68
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
  nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
  trigger_all_cpu_backtrace include/linux/nmi.h:138 [inline]
  check_hung_task kernel/hung_task.c:132 [inline]
  check_hung_uninterruptible_tasks kernel/hung_task.c:190 [inline]
  watchdog+0xc10/0xf60 kernel/hung_task.c:249
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0x6/0x10  
arch/x86/include/asm/irqflags.h:54


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox