Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: David Miller @ 2018-04-13 16:14 UTC (permalink / raw)
  To: ecree; +Cc: linux-net-drivers, netdev
In-Reply-To: <d56e5a40-48cc-d984-c5cd-e18fbc411ed3@solarflare.com>

From: Edward Cree <ecree@solarflare.com>
Date: Fri, 13 Apr 2018 16:59:07 +0100

> On 13/04/18 16:03, David Miller wrote:
>> Whilst you may not be able to program the filter into the hardware
>> synchronously, you should be able to allocate the ID and get all of
>> the software state setup.
> That's what we were doing before commit 3af0f34290f6 ("sfc: replace
>  asynchronous filter operations"), and (as mentioned in that commit)
>  that leads to (or at least the scheme we used had) race conditions
>  which I could not see a way to fix.  If the hardware resets (and
>  thus forgets all its filters) after we've done the software state
>  setup but before we reach the point of finalising the software state
>  after the hardware operation, we don't know what operations we need
>  to do to re-apply the software state to the hardware, because we
>  don't know whether the reset happened before or after the hardware
>  operation.

When an entry is successfully programmed into the chip, you update
the software state.

When the chip resets, you clear all of those state booleans to false.

Indeed, you would have to synchronize these things somehow.

Is the issue that you learn about the hardware reset asynchronously,
and therefore cannot determine if filter insertion programming
happened afterwards and thus is still in the chip?

You must have a table of all the entries, so that you can reprogram
the hardware should it reset.  Or do you not handle things that way
and it's a lossy system?

> Well, the alternative, even if the software state setup part _could_
>  be made synchronous, is to allow a potentially unbounded queue for
>  the hardware update part (I think there are even still cases in
>  which the exponential growth pathology is possible), causing the
>  filter insertions to be delayed an arbitrarily long time.  Either
>  the flow is still going by that time (in which case the backlog
>  limit approach will get a new ndo_rx_flow_steer request and insert
>  the filter too) or it isn't, in which case getting round to it
>  eventually is no better than dropping it immediately.  In fact it's
>  worse because now you waste time inserting a useless filter which
>  delays new requests even more.
> Besides, I'm fairly confident that the only cases in which you'll
>  even come close to hitting the limit are ones where ARFS wouldn't
>  do you much good anyway, such as:
> * Misconfigured interrupt affinities where ARFS is entirely pointless
> * Many short-lived flows (which greatly diminish the utility of ARFS)
> 
> So for multiple reasons, hitting the limit won't actually make
>  performance worse, although it will often be a sign that performance
>  will be bad for other reasons.

Understood, thanks for explaining.

Please respin your series with the updates you talked about and I'll
apply it.

But generally we do have this issue with various kinds of
configuration programming and async vs. sync.

^ permalink raw reply

* Re: [PATCH net 1/3] l2tp: hold reference on tunnels in netlink dumps
From: Guillaume Nault @ 2018-04-13 16:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jchapman
In-Reply-To: <20180413.105703.352607094166315307.davem@davemloft.net>

On Fri, Apr 13, 2018 at 10:57:03AM -0400, David Miller wrote:
> From: Guillaume Nault <g.nault@alphalink.fr>
> Date: Thu, 12 Apr 2018 20:50:33 +0200
> 
> > l2tp_tunnel_find_nth() is unsafe: no reference is held on the returned
> > tunnel, therefore it can be freed whenever the caller uses it.
> > This patch defines l2tp_tunnel_get_nth() which works similarly, but
> > also takes a reference on the returned tunnel. The caller then has to
> > drop it after it stops using the tunnel.
> > 
> > Convert netlink dumps to make them safe against concurrent tunnel
> > deletion.
> > 
> > Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
> > Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
> 
> During the entire invocation of l2tp_nl_cmd_tunnel_dump(), the RTNL
> mutex is held.
> 
> Therefore no tunnel configuration changes may occur and the tunnel
> object will persist and is safe to access.
> 
Yes, but only for updates done with the genl API. For L2TPv2, the
tunnel can be created by connecting a PPPOL2TP and a UDP socket.
Closing these sockets destroys the tunnel without any RTNL
synchronisation.

> The netlink dump should be safe as-is.
> 
> Were you actually able to trigger a crash or KASAN warning or is
> this purely from code inspection?
> 
Yes, I have a KASAN use-after-free for this case. I remember I saw a
few complains about stack traces in commit messages, so I've stopped
putting them there. I can paste (stripped) traces again. Just let me
know if you have any preference.

Guillaume

^ permalink raw reply

* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: Edward Cree @ 2018-04-13 15:59 UTC (permalink / raw)
  To: David Miller; +Cc: linux-net-drivers, netdev
In-Reply-To: <20180413.110359.1956367740053085545.davem@davemloft.net>

On 13/04/18 16:03, David Miller wrote:
> Whilst you may not be able to program the filter into the hardware
> synchronously, you should be able to allocate the ID and get all of
> the software state setup.
That's what we were doing before commit 3af0f34290f6 ("sfc: replace
 asynchronous filter operations"), and (as mentioned in that commit)
 that leads to (or at least the scheme we used had) race conditions
 which I could not see a way to fix.  If the hardware resets (and
 thus forgets all its filters) after we've done the software state
 setup but before we reach the point of finalising the software state
 after the hardware operation, we don't know what operations we need
 to do to re-apply the software state to the hardware, because we
 don't know whether the reset happened before or after the hardware
 operation.
Actually, it's not the insertion itself that this can happen to - if
 the reset happens first then the insert will fail because other
 hardware state we reference isn't set up yet.  However, inserting
 one filter can necessitate removing some others (lower-priority
 multicast filters for the same flow); the code before 3af0f34290f6
 was actually capable of getting so confused that it could double-
 free a pointer.
This all gets sufficiently complicated that even if I can find a
 locking scheme that in theory gets it right, there's pretty much a
 100% chance that the actual implementation will contain bugs,
 probably very subtle ones that can't reliably be reproduced etc.
 All my instincts say to get away from that if at all possible.

> People really aren't going to be happy if their performance goes into
> the tank because their filter update rate, for whatever reason, hits
> this magic backlog limit.
Well, the alternative, even if the software state setup part _could_
 be made synchronous, is to allow a potentially unbounded queue for
 the hardware update part (I think there are even still cases in
 which the exponential growth pathology is possible), causing the
 filter insertions to be delayed an arbitrarily long time.  Either
 the flow is still going by that time (in which case the backlog
 limit approach will get a new ndo_rx_flow_steer request and insert
 the filter too) or it isn't, in which case getting round to it
 eventually is no better than dropping it immediately.  In fact it's
 worse because now you waste time inserting a useless filter which
 delays new requests even more.
Besides, I'm fairly confident that the only cases in which you'll
 even come close to hitting the limit are ones where ARFS wouldn't
 do you much good anyway, such as:
* Misconfigured interrupt affinities where ARFS is entirely pointless
* Many short-lived flows (which greatly diminish the utility of ARFS)

So for multiple reasons, hitting the limit won't actually make
 performance worse, although it will often be a sign that performance
 will be bad for other reasons.

-Ed

^ permalink raw reply

* Re: net: hang in unregister_netdevice: waiting for lo to become free
From: Dmitry Vyukov @ 2018-04-13 15:54 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Tommi Rantala, Neil Horman, Xin Long, David Ahern,
	Daniel Borkmann, Cong Wang, David Miller, Eric Dumazet,
	Willem de Bruijn, Jakub Kicinski, Rasmus Villemoes, netdev, LKML,
	Alexey Kuznetsov, Hideaki YOSHIFUJI, syzkaller, Dan Streetman,
	Eric W. Biederman, Alexey Kodanev
In-Reply-To: <CALZtONBFdVsa_dqKTpzAp2wVPPfP4N=P7HT_e5e0vy7oAoxVew@mail.gmail.com>

On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetman <ddstreet@ieee.org> wrote:
> On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>> <tommi.t.rantala@nokia.com> wrote:
>>> On 20.02.2018 18:26, Neil Horman wrote:
>>>>
>>>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>>>>
>>>>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>>>> <tommi.t.rantala@nokia.com> wrote:
>>>>>>
>>>>>> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>>>>>>
>>>>>>> Is this meant to be fixed already? I am still seeing this on the
>>>>>>> latest upstream tree.
>>>>>>>
>>>>>>
>>>>>> These two commits are in v4.16-rc1:
>>>>>>
>>>>>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>>>>>> Author: Tommi Rantala <tommi.t.rantala@nokia.com>
>>>>>> Date:   Mon Feb 5 21:48:14 2018 +0200
>>>>>>
>>>>>>      sctp: fix dst refcnt leak in sctp_v4_get_dst
>>>>>> ...
>>>>>>      Fixes: 410f03831 ("sctp: add routing output fallback")
>>>>>>      Fixes: 0ca50d12f ("sctp: fix src address selection if using
>>>>>> secondary
>>>>>> addresses")
>>>>>>
>>>>>>
>>>>>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>>>>>> Author: Alexey Kodanev <alexey.kodanev@oracle.com>
>>>>>> Date:   Mon Feb 5 15:10:35 2018 +0300
>>>>>>
>>>>>>      sctp: fix dst refcnt leak in sctp_v6_get_dst()
>>>>>> ...
>>>>>>      Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>>>>>> secondary
>>>>>> addresses for ipv6")
>>>>>>
>>>>>>
>>>>>> I guess we missed something if it's still reproducible.
>>>>>>
>>>>>> I can check it later this week, unless someone else beat me to it.
>>>>>
>>>>>
>>>>> Hi Tommi,
>>>>>
>>>>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>>>>> another one then. But I am still seeing these:
>>>>>
>>>>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>>>>> Usage count = 4
>>>>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>>>>> Usage count = 4
>>>>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>>>>> Usage count = 4
>>>>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>>>>> Usage count = 4
>>>>>
>>>>> on upstream tree pulled ~12 hours ago.
>>>>>
>>>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>>>> out a
>>>> backtrace if the device name matches "lo".  That should tell us
>>>> definitively if
>>>> the problem is in the same location or not
>>>
>>>
>>> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
>>> sent in the first email in this thread:
>>>
>>> With 4.16-rc2 unable to reproduce.
>>>
>>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
>>> lo to become free. Usage count = 3"
>>>
>>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
>>> cherry-picked on top, unable to reproduce.
>>>
>>>
>>> Is syzkaller doing something else now to trigger the bug...?
>>> Can you still trigger the bug with the same reproducer?
>>
>> Hi Neil, Tommi,
>>
>> Reviving this old thread about "unregister_netdevice: waiting for lo
>> to become free. Usage count = 3" hangs.
>> I still did not have time to deep dive into what happens there (too
>> many bugs coming from syzbot). But this still actively happens and I
>> suspect accounts to a significant portion of various hang reports,
>> which are quite unpleasant.
>>
>> One idea that could make it all simpler:
>>
>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
>> prolonged periods of time under any non-buggy conditions? E.g. more
>> than 1-2 minutes?
>> If it only supposed to wait briefly for things that already supposed
>> to be shutting down, and we add a WARNING there after some timeout,
>> then syzbot will report all info how/when it happens, hopefully
>> extracting reproducers, and all the nice things.
>> But this WARNING should not have any false positives under any
>> realistic conditions (e.g. waiting for arrival of remote packets with
>> large timeouts).
>>
>> Looking at some task hung reports, it seems that this code holds some
>> mutexes, takes workqueue thread and prevents any progress with
>> destruction of other devices (and net namespace creation/destruction),
>> so I guess it should not wait for any indefinite periods of time?
>
> I'm working on this currently:
> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>
> I added a summary of what I've found to be the cause (or at least, one
> possible cause) of this:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72
>
> I'm working on a patch to work around the main side-effect of this,
> which is hanging while holding the global net mutex.  Hangs will still
> happen (e.g. if a dst leaks) but should not affect anything else,
> other than a leak of the dst and its net namespace.
>
> Fixing the dst leaks is important too, of course, but a dst leak (or
> other cause) shouldn't break the entire system.

Leaking some memory is definitely better than hanging the system.

So I've made syzkaller to recognize "unregister_netdevice: waiting for
(.*) to become free" as a kernel bug:
https://github.com/google/syzkaller/commit/7a67784ca8bdc3b26cce2f0ec9a40d2dd9ec9396
Unfortunately it does not make it catch these bugs because creating a
net namespace per test is too damn slow, so namespaces are reused for
lots of tests and when/if it's eventually destroyed it's already too
late to find root cause.

But I've run a one-off experiment with prompt net namespace
destruction and syzkaller was able to easily extract a C reproducer:
https://gist.githubusercontent.com/dvyukov/d571e8fff24e127ca48a8c4790d42bfa/raw/52050e93ba9afbb5126b9d7bb39b7e71a82af016/gistfile1.txt

On upstream 16e205cf42da1f497b10a4a24f563e6c0d574eec with this config:
https://gist.githubusercontent.com/dvyukov/9663c57443adb21f2795b92ef0829d62/raw/bbea0652e23746096dd56855a28f6c681aebcdee/gistfile1.txt

this gives me:

[   83.183198] unregister_netdevice: waiting for lo to become free.
Usage count = 9
[   85.231202] unregister_netdevice: waiting for lo to become free.
Usage count = 9
...
[  523.511205] unregister_netdevice: waiting for lo to become free.
Usage count = 9
...

This is generated from this syzkaller program:

r0 = socket$inet6(0xa, 0x1, 0x84)
setsockopt$inet6_IPV6_XFRM_POLICY(r0, 0x29, 0x23,
&(0x7f0000000380)={{{@in6=@remote={0xfe, 0x80, [], 0xbb},
@in=@dev={0xac, 0x14, 0x14}, 0x0, 0x0, 0x0, 0x0, 0xa}, {}, {}, 0x0,
0x0, 0x1}, {{@in=@local={0xac, 0x14, 0x14, 0xaa}, 0x0, 0x32}, 0x0,
@in=@local={0xac, 0x14, 0x14, 0xaa}, 0x3504}}, 0xe8)
bind$inet6(r0, &(0x7f0000000000)={0xa, 0x4e20}, 0x1c)
connect$inet(r0, &(0x7f0000000040)={0x2, 0x4e20, @dev={0xac, 0x14,
0x14, 0xd}}, 0x10)
syz_emit_ethernet(0x3e, &(0x7f00000001c0)={@local={[0xaa, 0xaa, 0xaa,
0xaa, 0xaa], 0xaa}, @dev={[0xaa, 0xaa, 0xaa, 0xaa, 0xaa]}, [],
{@ipv6={0x86dd, {0x0, 0x6, "50a09c", 0x8, 0xffffff11, 0x0,
@remote={0xfe, 0x80, [], 0xbb}, @local={0xfe, 0x80, [], 0xaa}, {[],
@udp={0x0, 0x4e20, 0x8}}}}}}, &(0x7f0000000040))

So this seems to be related to IPv6 and/or xfrm and is potentially
caused by external packets (that syz_emit_ethernet call).

^ permalink raw reply

* RE: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited
From: Keller, Jacob E @ 2018-04-13 15:34 UTC (permalink / raw)
  To: Bjorn Helgaas, Jakub Kicinski
  Cc: Tal Gilboa, Tariq Toukan, Ariel Elior, Ganesh Goudar,
	Kirsher, Jeffrey T, everest-linux-l2@cavium.com,
	intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
In-Reply-To: <20180413140633.GD46420@bhelgaas-glaptop.roam.corp.google.com>



> -----Original Message-----
> From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> Sent: Friday, April 13, 2018 7:07 AM
> To: Jakub Kicinski <kubakici@wp.pl>
> Cc: Tal Gilboa <talgi@mellanox.com>; Tariq Toukan <tariqt@mellanox.com>;
> Keller, Jacob E <jacob.e.keller@intel.com>; Ariel Elior <ariel.elior@cavium.com>;
> Ganesh Goudar <ganeshgr@chelsio.com>; Kirsher, Jeffrey T
> <jeffrey.t.kirsher@intel.com>; everest-linux-l2@cavium.com; intel-wired-
> lan@lists.osuosl.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-pci@vger.kernel.org
> Subject: Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed
> and whether it's limited
> 
> On Thu, Apr 12, 2018 at 09:32:49PM -0700, Jakub Kicinski wrote:
> > On Fri, 30 Mar 2018 16:05:18 -0500, Bjorn Helgaas wrote:
> > > +	if (bw_avail >= bw_cap)
> > > +		pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> > > +			 bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > > +	else
> > > +		pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d
> link at %s (capable of %d Mb/s with %s x%d link)\n",
> > > +			 bw_avail, PCIE_SPEED2STR(speed), width,
> > > +			 limiting_dev ? pci_name(limiting_dev) : "<unknown>",
> > > +			 bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> >
> > I was just looking at using this new function to print PCIe BW for a
> > NIC, but I'm slightly worried that there is nothing in the message that
> > says PCIe...  For a NIC some people may interpret the bandwidth as NIC
> > bandwidth:
> >
> > [   39.839989] nfp 0000:04:00.0: Netronome Flow Processor NFP4000/NFP6000
> PCIe Card Probe
> > [   39.848943] nfp 0000:04:00.0: 63.008 Gb/s available bandwidth (8 GT/s x8 link)
> > [   39.857146] nfp 0000:04:00.0: RESERVED BARs: 0.0: General/MSI-X SRAM, 0.1:
> PCIe XPB/MSI-X PBA, 0.4: Explicit0, 0.5: Explicit1, fre4
> >
> > It's not a 63Gbps NIC...  I'm sorry if this was discussed before and I
> > didn't find it.  Would it make sense to add the "PCIe: " prefix to the
> > message like bnx2x used to do?  Like:
> >
> > nfp 0000:04:00.0: PCIe: 63.008 Gb/s available bandwidth (8 GT/s x8 link)
> 
> I agree, that does look potentially confusing.  How about this:
> 
>   nfp 0000:04:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)
> 
> I did have to look twice at this before I remembered that we're
> printing Gb/s (not GB/s).  Most of the references I found on the web
> use GB/s when talking about total PCIe bandwidth.
> 
> But either way I think it's definitely worth mentioning PCIe
> explicitly.

I also agree printing PCIe explicitly is good.

Thanks,
Jake

^ permalink raw reply

* Re: [PATCH net] virtio-net: add missing virtqueue kick when flushing packets
From: David Miller @ 2018-04-13 15:27 UTC (permalink / raw)
  To: jasowang; +Cc: mst, virtualization, netdev, linux-kernel, ktaka, daniel
In-Reply-To: <1523602705-5155-1-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Fri, 13 Apr 2018 14:58:25 +0800

> We tends to batch submitting packets during XDP_TX. This requires to
> kick virtqueue after a batch, we tried to do it through
> xdp_do_flush_map() which only makes sense for devmap not XDP_TX. So
> explicitly kick the virtqueue in this case.
> 
> Reported-by: Kimitoshi Takahashi <ktaka@nii.ac.jp>
> Tested-by: Kimitoshi Takahashi <ktaka@nii.ac.jp>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Applied and queued up for -stable, thanks Jason.

^ permalink raw reply

* Re: [RFC v2] virtio: support packed ring
From: Michael S. Tsirkin @ 2018-04-13 15:22 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: jasowang, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180401141216.8969-1-tiwei.bie@intel.com>

On Sun, Apr 01, 2018 at 10:12:16PM +0800, Tiwei Bie wrote:
> +static inline bool more_used(const struct vring_virtqueue *vq)
> +{
> +	return vq->packed ? more_used_packed(vq) : more_used_split(vq);
> +}
> +
> +void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, unsigned int *len,
> +				  void **ctx)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	void *ret;
> +	unsigned int i;
> +	u16 last_used;
> +
> +	START_USE(vq);
> +
> +	if (unlikely(vq->broken)) {
> +		END_USE(vq);
> +		return NULL;
> +	}
> +
> +	if (!more_used(vq)) {
> +		pr_debug("No more buffers in queue\n");
> +		END_USE(vq);
> +		return NULL;
> +	}

So virtqueue_get_buf_ctx_split should only call more_used_split.

to avoid such issues I think we should lay out the code like this:

XXX_split

XXX_packed

XXX wrappers

> +/* The standard layout

I'd drop standard here.

> for the packed ring is a continuous chunk of memory
> + * which looks like this.
> + *
> + * struct vring_packed
> + * {

Can the opening bracket go on the prev line pls?

> + *	// The actual descriptors (16 bytes each)
> + *	struct vring_packed_desc desc[num];
> + *
> + *	// Padding to the next align boundary.
> + *	char pad[];
> + *
> + *	// Driver Event Suppression
> + *	struct vring_packed_desc_event driver;
> + *
> + *	// Device Event Suppression
> + *	struct vring_packed_desc_event device;

Maybe that's how our driver does it but it's not based on spec
so I don't think this belongs in the header.

> + * };
> + */
> +
> +static inline unsigned vring_packed_size(unsigned int num, unsigned long align)
> +{
> +	return ((sizeof(struct vring_packed_desc) * num + align - 1)
> +		& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
> +}
> +

Cant say this API makes sense for me.


>  #endif /* _UAPI_LINUX_VIRTIO_RING_H */
> -- 
> 2.11.0

^ permalink raw reply

* Re: Build error for samples/bpf/ due to commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS")
From: Alexei Starovoitov @ 2018-04-13 15:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Peter Zijlstra, Daniel Borkmann, netdev@vger.kernel.org, LKML,
	Björn Töpel
In-Reply-To: <20180413152237.3727911f@redhat.com>

On Fri, Apr 13, 2018 at 03:22:37PM +0200, Jesper Dangaard Brouer wrote:
> Hi Peter,
> 
> Your commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS") broke build
> for several samples/bpf programs. I'm unsure what the best way forward
> is to unbreak these...
> 
> The issue is that these samples are build with LLVM/clang (which
> doesn't like 'asm goto' constructs).  And they end up including
> arch/x86/include/asm/cpufeature.h via a long include path, see build
> examples below (through different path to include/linux/thread_info.h).
> 
> Maybe Alexei or Daniel have an idea how to work around this?
> As tools/testing/selftests/bpf/ does not seem to fail!?

Right. All of bpf tracing and samples/bpf/ broke.
Here is the proposed fix that we're asking Peter to apply and send to Linus asap.
https://lkml.org/lkml/2018/4/10/825

> Build error#1:
> --------------
> clang  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I./arch/x86/include -I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h  -Isamples/bpf \
>         -I./tools/testing/selftests/bpf/ \
>         -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
>         -D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
>         -Wno-gnu-variable-sized-type-not-at-end \
>         -Wno-address-of-packed-member -Wno-tautological-compare \
>         -Wno-unknown-warning-option  \
>         -O2 -emit-llvm -c samples/bpf/sockex2_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/sockex2_kern.o
> In file included from samples/bpf/sockex2_kern.c:3:
> In file included from ./include/uapi/linux/in.h:24:
> In file included from ./include/linux/socket.h:8:
> In file included from ./include/linux/uio.h:13:
> In file included from ./include/linux/thread_info.h:38:
> In file included from ./arch/x86/include/asm/thread_info.h:53:
> ./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not supported yet
>         asm_volatile_goto("1: jmp 6f\n"
>         ^
> ./include/linux/compiler-gcc.h:290:42: note: expanded from macro 'asm_volatile_goto'
> #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)

^ permalink raw reply

* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: David Miller @ 2018-04-13 15:03 UTC (permalink / raw)
  To: ecree; +Cc: linux-net-drivers, netdev
In-Reply-To: <c7a994ee-763c-1f94-2c1e-4348d7b4cc62@solarflare.com>

From: Edward Cree <ecree@solarflare.com>
Date: Fri, 13 Apr 2018 15:52:25 +0100

> On 13/04/18 15:45, David Miller wrote:
>> I understand the constraints you are working under, but do realize
>> that the real root of the problems is that you are implementing what
>> is defined clearly as a synchronous operation as asynchronous.
> Yes, it is unfortunate that we are unable to perform synchronous filter
>  insertions, but you go to war with the hardware you have :(

Whilst you may not be able to program the filter into the hardware
synchronously, you should be able to allocate the ID and get all of
the software state setup.

At least then you can return the proper return value from the netdev
op.

People really aren't going to be happy if their performance goes into
the tank because their filter update rate, for whatever reason, hits
this magic backlog limit.

^ permalink raw reply

* Re: tcp hang when socket fills up ?
From: Eric Dumazet @ 2018-04-13 15:01 UTC (permalink / raw)
  To: Dominique Martinet, netdev
In-Reply-To: <20180406090720.GA31845@nautica>



On 04/06/2018 02:07 AM, Dominique Martinet wrote:
> (current kernel: vanilla 4.14.29)
> 
> I've been running into troubles recently where a sockets "fills up" (as
> in, select() will no longer return it in its outfd / attempting to write
> to it after setting it to NONBLOCK will return -EWOULDBLOCK) and it
> doesn't seem to ever "unfill" until the tcp connexion timeout.
> 
> The previous time I pushed it down to the application for not trying to
> read the socket either as I assume the buffers could be shared and
> just waiting won't take data out, but this time I'm a bit more
> skeptical as socat waits for the fd in both read and write...
> 
> Let me take a minute to describe my setup, I don't think that how the
> socket was created matters but it might be interesting:
>  - I have two computers behind NATs, no port forwarding on either side
>  - One (call it C for client) runs ssh with a proxycommand ncat/socat to
> control the source port, e.g.
> $ ssh -o ProxyCommand="socat stdio tcp:<server public ip>:<port1>,sourceport=<port2>" server
>  - The server runs another socat to connect to that and forwards to ssh
> locally, e.g.
> $ socat tcp:<client public ip>:<port2>,sourceport=<port1> tcp:127.0.0.1:22
> 
> (yes, both are connect() calls, and that just works™ thanks to initial
> syn replay and conntrack on routers)
> 
> When things stall, the first socat is in a select with both fd in
> reading, so it's waiting data.
> The second socat has data coming from ssh and is in a select with the
> client-facing socket in both read and write - that select never returns
> so the socket is not available for reading or writing and does not free
> up until the connection eventually times out a few minutes later.
> 
> At this point, I only see tcp replays in tcpdump/wireshark. I've
> compared dumps from both sides and there are no lost packets, only
> reordering - there always is a batch of acks that were sent regularily
> coming in shortly before the hang. Here's the trace on the server:
> 
> 16:49:26.735042 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 ecr 1617129473], length 1374
> 16:49:26.735046 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 ecr 1617129473], length 1374
> 16:49:26.735334 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], length 0
> 16:49:26.736005 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 42996, win 940, options [nop,nop,TS val 1617129478 ecr 1313937609], length 0
> 16:49:26.736402 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 ecr 1617129473], length 1374
> 16:49:26.736408 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 ecr 1617129473], length 1374
> 16:49:26.738561 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 44370, win 963, options [nop,nop,TS val 1617129482 ecr 1313937616], length 0
> 16:49:26.739539 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 45744, win 986, options [nop,nop,TS val 1617129482 ecr 1313937616], length 0
> 16:49:26.739882 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 47118, win 1008, options [nop,nop,TS val 1617129484 ecr 1313937617], length 0
> 16:49:26.740255 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 48492, win 1031, options [nop,nop,TS val 1617129484 ecr 1313937617], length 0
> 16:49:26.746756 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 49866, win 1053, options [nop,nop,TS val 1617129493 ecr 1313937627], length 0
> 16:49:26.747923 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 51240, win 1076, options [nop,nop,TS val 1617129494 ecr 1313937627], length 0
> 16:49:26.749083 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 52614, win 1099, options [nop,nop,TS val 1617129495 ecr 1313937629], length 0
> 16:49:26.750171 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 53988, win 1121, options [nop,nop,TS val 1617129496 ecr 1313937629], length 0
> 16:49:26.750808 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 55362, win 1144, options [nop,nop,TS val 1617129497 ecr 1313937629], length 0
> 16:49:26.754648 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 56736, win 1167, options [nop,nop,TS val 1617129500 ecr 1313937629], length 0
> 16:49:26.755985 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 58110, win 1189, options [nop,nop,TS val 1617129501 ecr 1313937630], length 0
> 16:49:26.758513 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 59484, win 1212, options [nop,nop,TS val 1617129502 ecr 1313937630], length 0
> 16:49:26.759096 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 60858, win 1234, options [nop,nop,TS val 1617129503 ecr 1313937635], length 0
> 16:49:26.759421 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 62232, win 1257, options [nop,nop,TS val 1617129503 ecr 1313937635], length 0
> 16:49:26.759755 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 63606, win 1280, options [nop,nop,TS val 1617129504 ecr 1313937636], length 0
> 16:49:26.760653 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 64980, win 1302, options [nop,nop,TS val 1617129505 ecr 1313937636], length 0
> 16:49:26.761453 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 66354, win 1325, options [nop,nop,TS val 1617129506 ecr 1313937638], length 0
> 16:49:26.762199 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], length 0
> 16:49:26.763547 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], length 36
> 16:49:26.763553 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 70476, win 1393, options [nop,nop,TS val 1617129508 ecr 1313937639], length 0
> 16:49:26.764298 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 73224, win 1438, options [nop,nop,TS val 1617129509 ecr 1313937641], length 0
> 16:49:26.764676 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 75972, win 1444, options [nop,nop,TS val 1617129510 ecr 1313937643], length 0
> 16:49:26.807754 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 75972:77346, ack 4190, win 307, options [nop,nop,TS val 1313937714 ecr 1617129473], length 1374
> 16:49:26.876467 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617129620 ecr 1313937714], length 0

There is no way a regular TCP stack (including linux) could send the following frame.


> 16:49:27.048760 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313937955 ecr 1617129473], length 1374

So something is mangling the packet, maybe NAT or something.

> 16:49:27.051791 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617129762 ecr 1313937714], length 36
> 16:49:27.076444 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617129822 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:27.371182 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130018 ecr 1313937714], length 36


> 16:49:27.519862 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313938426 ecr 1617129473], length 1374



> 16:49:27.547662 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617130293 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:27.883372 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130530 ecr 1313937714], length 36
> 16:49:28.511861 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313939418 ecr 1617129473], length 1374
> 16:49:28.538891 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617131285 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:28.907197 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617131554 ecr 1313937714], length 36
> 16:49:30.431864 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313941338 ecr 1617129473], length 1374
> 16:49:30.459127 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617133204 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:30.955388 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617133602 ecr 1313937714], length 36
> 16:49:34.207879 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313945114 ecr 1617129473], length 1374
> 16:49:34.235726 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617136981 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:35.256285 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617137954 ecr 1313937714], length 36
> 16:49:42.143864 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313953050 ecr 1617129473], length 1374
> 16:49:42.171531 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617144917 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:43.448262 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617146146 ecr 1313937714], length 36
> 
> (I still have the pcap files if someone wants to see them, but I'd
> rather not give my work IP publicly so will send it privately on
> request)
> 
> 
> At this point, only the same 3 packets keep being replayed over and
> over... According to 'ss' the Send-Q isn't empty on either side, the
> client has some ~1k to send but the server has much more (87k)
> After increasing the window size through net.*wmem sysctl it got stuck
> with over 1MB in Send-Q, which makes sense because the socket is full...
> 
> 
> So, what I don't get is, why are these acks continuously replayed? Given
> the timing it looks like the server doesn't take the client acks into
> account, despite having received that precise 33378 ack earlier and I
> believe it should accept a higher ack value anyway ?
> 
> The ultimate question being, how can I go about debugging that?
> I'm working on getting perf probe/crash to work on the server so I can
> look at the kernel side now, I can reproduce this semi-easily so I'm
> sure I'll get down to it eventually, but if anyone has an idea I'm all
> ears.
> 
> 
> Thanks!
> 

^ permalink raw reply

* Re: [PATCH net 1/3] l2tp: hold reference on tunnels in netlink dumps
From: David Miller @ 2018-04-13 14:57 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, jchapman
In-Reply-To: <be5251775c626534633ce7658f6d7da93e7aecd5.1523558015.git.g.nault@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Thu, 12 Apr 2018 20:50:33 +0200

> l2tp_tunnel_find_nth() is unsafe: no reference is held on the returned
> tunnel, therefore it can be freed whenever the caller uses it.
> This patch defines l2tp_tunnel_get_nth() which works similarly, but
> also takes a reference on the returned tunnel. The caller then has to
> drop it after it stops using the tunnel.
> 
> Convert netlink dumps to make them safe against concurrent tunnel
> deletion.
> 
> Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
> Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>

During the entire invocation of l2tp_nl_cmd_tunnel_dump(), the RTNL
mutex is held.

Therefore no tunnel configuration changes may occur and the tunnel
object will persist and is safe to access.

The netlink dump should be safe as-is.

Were you actually able to trigger a crash or KASAN warning or is
this purely from code inspection?

^ permalink raw reply

* Re: [PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error
From: Arnd Bergmann @ 2018-04-13 14:52 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	netfilter-devel, coreteam, Networking, Linux Kernel Mailing List
In-Reply-To: <20180413131558.3jw5dhub5gcyotyt@salvia>

On Fri, Apr 13, 2018 at 3:15 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Mon, Apr 09, 2018 at 04:43:40PM +0200, Arnd Bergmann wrote:
>> On Mon, Apr 9, 2018 at 4:37 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> > Hi Arnd,
>> >
>> > On Mon, Apr 09, 2018 at 12:53:12PM +0200, Arnd Bergmann wrote:
>> >> We get a new link error with CONFIG_NFT_REJECT_INET=y and CONFIG_NF_REJECT_IPV6=m
>> >
>> > I think we can update NFT_REJECT_INET so it depends on NFT_REJECT_IPV4
>> > and NFT_REJECT_IPV6. This doesn't allow here CONFIG_NFT_REJECT_INET=y
>> > and CONFIG_NF_REJECT_IPV6=m.
>> >
>> > I mean, just like we do with NFT_FIB_INET.
>>
>> That can only work if NFT_REJECT_INET can be made a 'tristate' symbol
>> again, so that code gets built as a loadable module if
>> CONFIG_NF_REJECT_IPV6=m.
>>
>> > BTW, I think this problem has been is not related to the recent patch,
>> > but something older that kbuild robot has triggered more easily for
>> > some reason?
>>
>> 02c7b25e5f54 is the one that turned NF_TABLES_INET into a 'bool'
>> symbol. NFT_REJECT depends on NF_TABLES_INET, so it used to
>> restricted to a loadable module with IPV6=m, but can now be
>> built-in, which causes that link error.
>
> Still one more spin on this, I would like to see if we have a way to
> fix this by simplifing things a bit.
>
> Would this one I'm attaching would work?

One disadvantage is that it makes the vmlinux bigger since
NF_REJECT_IPV{4,6} can no longer be a module at all now.

I suspect you also stil get a link error with IPV6=m, this time because
the nf_reject_ipv6.o file fails to link against the ipv6 code, e.g.
ipv6_skip_exthdr() and icmpv6_send() appear to be unreachable here.
I haven't tried that though, so I might be missing something.

        Arnd

^ permalink raw reply

* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: Edward Cree @ 2018-04-13 14:52 UTC (permalink / raw)
  To: David Miller; +Cc: linux-net-drivers, netdev
In-Reply-To: <97aad935-d5fb-713e-fd0f-d84bbd733a8f@solarflare.com>

On 13/04/18 13:36, Edward Cree wrote:
> It turns out this may all be moot anyway: I figured out why I was seeing
>  ARFS storms and it wasn't the configuration issue I originally blamed.
Hmm, correction, while the fix I mentioned in my previous email is needed,
 it doesn't prevent the ARFS storms (although seems to lessen their
 severity, given that the machine didn't actually fall over this time),
 so we do also need some kind of limiting.

On 12/04/18 16:33, David Miller wrote:
> Then simply make the work process a queue, and add entries to the queue
> here if the work is already scheduled.
>
> Is there a reason why that wouldn't work?
That has the same problem as the existing code, that the length of the queue
 can grow without bound, potentially causing a very long lag between the
 request and its execution.  This then can quickly become exponential as,
 while waiting for the filter to be inserted, further packets from the same
 flow are received (still unsteered) and trigger duplicate ARFS requests
 (though I suppose it would be possible to scan the queue for matching flow
 IDs; but the concurrency / locking problems with that are 'interesting').

I'm not sure why you object to the dropping of requests - it seems reasonable
 to me to treat ARFS as a 'best-effort' thing.  The packets will still be
 received (just not necessarily on the core nearest the application), and if
 the flow continues it will generate more steering requests after the ones
 currently in flight have been processed.
And in practice we only get into this situation in the first place when we
 have interrupt affinities configured in such a way as to make ARFS
 practically useless anyway, so our failure to insert the filters is not of
 great significance.

On 13/04/18 15:45, David Miller wrote:
> I understand the constraints you are working under, but do realize
> that the real root of the problems is that you are implementing what
> is defined clearly as a synchronous operation as asynchronous.
Yes, it is unfortunate that we are unable to perform synchronous filter
 insertions, but you go to war with the hardware you have :(

-Ed

^ permalink raw reply

* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: David Miller @ 2018-04-13 14:45 UTC (permalink / raw)
  To: ecree; +Cc: linux-net-drivers, netdev
In-Reply-To: <97aad935-d5fb-713e-fd0f-d84bbd733a8f@solarflare.com>

From: Edward Cree <ecree@solarflare.com>
Date: Fri, 13 Apr 2018 13:36:28 +0100

> It turns out this may all be moot anyway: I figured out why I was seeing
>  ARFS storms and it wasn't the configuration issue I originally blamed.
> My current ndo_rx_flow_steer() implementation, efx_filter_rfs(), returns
>  0 for success, but the caller expects a filter ID to be returned (which
>  we can't give it because we don't know what the filter ID will be until
>  we start mucking around in the software state that's now protected by a
>  sleepable lock).
> As a result, when we call rps_may_expire_flow(), and pass it the _actual_
>  filter ID, this doesn't match the one set_rps_cpu() recorded, so the
>  function returns true and we immediately expire the filter.  Then the
>  next packet to come along isn't steered, so ARFS asks us to insert a
>  steering filter again.
> As a quick fix I've simply tried making the rps_may_expire_flow() calls
>  also pass a filter ID of 0, which prevents the ARFS storms.  This is
>  safe; it may cause us to delay expiring a filter when flow_ids collide,
>  but that can happen anyway with other drivers' implementations (e.g.
>  mlx4 and mlx5 can potentially reuse filter IDs) so I presume it is OK.
> I'll post a v2 with that fix in place of this Patch #2 shortly, then try
>  to follow up with a counter-generated ID (similar to what mlx have).

I understand the constraints you are working under, but do realize
that the real root of the problems is that you are implementing what
is defined clearly as a synchronous operation as asynchronous.

^ permalink raw reply

* Re: Atlantic driver 4.16 stable request
From: David Miller @ 2018-04-13 14:43 UTC (permalink / raw)
  To: igor.russkikh; +Cc: netdev, Nadezhda.Krupnina, simon.edelhaus
In-Reply-To: <9af8632f-6f5a-686d-0da1-bdf5b5b6bb07@aquantia.com>

From: Igor Russkikh <igor.russkikh@aquantia.com>
Date: Fri, 13 Apr 2018 15:03:29 +0300

> Could you please consider queuing to v4.16:
> 
> 9a11aff net: aquantia: oops when shutdown on already stopped device
> cce96d1 net: aquantia: Regression on reset with 1.x firmware
> 
> These are both critical and well tested by our team.

Queued up, thank you.

^ permalink raw reply

* Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited
From: Bjorn Helgaas @ 2018-04-13 14:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tal Gilboa, Tariq Toukan, Jacob Keller, Ariel Elior,
	Ganesh Goudar, Jeff Kirsher, everest-linux-l2, intel-wired-lan,
	netdev, linux-kernel, linux-pci
In-Reply-To: <20180412213249.06661048@cakuba.netronome.com>

On Thu, Apr 12, 2018 at 09:32:49PM -0700, Jakub Kicinski wrote:
> On Fri, 30 Mar 2018 16:05:18 -0500, Bjorn Helgaas wrote:
> > +	if (bw_avail >= bw_cap)
> > +		pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> > +			 bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > +	else
> > +		pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d link at %s (capable of %d Mb/s with %s x%d link)\n",
> > +			 bw_avail, PCIE_SPEED2STR(speed), width,
> > +			 limiting_dev ? pci_name(limiting_dev) : "<unknown>",
> > +			 bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> 
> I was just looking at using this new function to print PCIe BW for a
> NIC, but I'm slightly worried that there is nothing in the message that
> says PCIe...  For a NIC some people may interpret the bandwidth as NIC
> bandwidth:
> 
> [   39.839989] nfp 0000:04:00.0: Netronome Flow Processor NFP4000/NFP6000 PCIe Card Probe
> [   39.848943] nfp 0000:04:00.0: 63.008 Gb/s available bandwidth (8 GT/s x8 link)
> [   39.857146] nfp 0000:04:00.0: RESERVED BARs: 0.0: General/MSI-X SRAM, 0.1: PCIe XPB/MSI-X PBA, 0.4: Explicit0, 0.5: Explicit1, fre4
> 
> It's not a 63Gbps NIC...  I'm sorry if this was discussed before and I
> didn't find it.  Would it make sense to add the "PCIe: " prefix to the
> message like bnx2x used to do?  Like:
> 
> nfp 0000:04:00.0: PCIe: 63.008 Gb/s available bandwidth (8 GT/s x8 link)

I agree, that does look potentially confusing.  How about this:

  nfp 0000:04:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)

I did have to look twice at this before I remembered that we're
printing Gb/s (not GB/s).  Most of the references I found on the web
use GB/s when talking about total PCIe bandwidth.

But either way I think it's definitely worth mentioning PCIe
explicitly.

^ permalink raw reply

* Build error for samples/bpf/ due to commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS")
From: Jesper Dangaard Brouer @ 2018-04-13 13:22 UTC (permalink / raw)
  To: Peter Zijlstra, Daniel Borkmann, Alexei Starovoitov
  Cc: brouer, netdev@vger.kernel.org, LKML, Björn Töpel

Hi Peter,

Your commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS") broke build
for several samples/bpf programs. I'm unsure what the best way forward
is to unbreak these...

The issue is that these samples are build with LLVM/clang (which
doesn't like 'asm goto' constructs).  And they end up including
arch/x86/include/asm/cpufeature.h via a long include path, see build
examples below (through different path to include/linux/thread_info.h).

Maybe Alexei or Daniel have an idea how to work around this?
As tools/testing/selftests/bpf/ does not seem to fail!?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Build error#1:
--------------
clang  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I./arch/x86/include -I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h  -Isamples/bpf \
        -I./tools/testing/selftests/bpf/ \
        -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
        -D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
        -Wno-gnu-variable-sized-type-not-at-end \
        -Wno-address-of-packed-member -Wno-tautological-compare \
        -Wno-unknown-warning-option  \
        -O2 -emit-llvm -c samples/bpf/sockex2_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/sockex2_kern.o
In file included from samples/bpf/sockex2_kern.c:3:
In file included from ./include/uapi/linux/in.h:24:
In file included from ./include/linux/socket.h:8:
In file included from ./include/linux/uio.h:13:
In file included from ./include/linux/thread_info.h:38:
In file included from ./arch/x86/include/asm/thread_info.h:53:
./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not supported yet
        asm_volatile_goto("1: jmp 6f\n"
        ^
./include/linux/compiler-gcc.h:290:42: note: expanded from macro 'asm_volatile_goto'
#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
                                         ^


Build error#2:
--------------
clang  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I./arch/x86/include -I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h  -Isamples/bpf \
        -I./tools/testing/selftests/bpf/ \
        -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
        -D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
        -Wno-gnu-variable-sized-type-not-at-end \
        -Wno-address-of-packed-member -Wno-tautological-compare \
        -Wno-unknown-warning-option  \
        -O2 -emit-llvm -c samples/bpf/tracex1_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/tracex1_kern.o
In file included from samples/bpf/tracex1_kern.c:7:
In file included from ./include/linux/skbuff.h:19:
In file included from ./include/linux/time.h:6:
In file included from ./include/linux/seqlock.h:36:
In file included from ./include/linux/spinlock.h:51:
In file included from ./include/linux/preempt.h:81:
In file included from ./arch/x86/include/asm/preempt.h:7:
In file included from ./include/linux/thread_info.h:38:
In file included from ./arch/x86/include/asm/thread_info.h:53:
./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not supported yet
        asm_volatile_goto("1: jmp 6f\n"
        ^
./include/linux/compiler-gcc.h:290:42: note: expanded from macro 'asm_volatile_goto'
#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
                                         ^


Build error#3:
--------------
clang  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I./arch/x86/include -I./arch/x86/include/generated  -I./include -I./arch/x86
/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h  -Isamples/bpf \
        -I./tools/testing/selftests/bpf/ \
        -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
        -D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
        -Wno-gnu-variable-sized-type-not-at-end \
        -Wno-address-of-packed-member -Wno-tautological-compare \
        -Wno-unknown-warning-option  \
        -O2 -emit-llvm -c samples/bpf/xdp1_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/xdp1_kern.o
In file included from samples/bpf/xdp1_kern.c:9:
In file included from ./include/linux/in.h:23:
In file included from ./include/uapi/linux/in.h:24:
In file included from ./include/linux/socket.h:8:
In file included from ./include/linux/uio.h:13:
In file included from ./include/linux/thread_info.h:38:
In file included from ./arch/x86/include/asm/thread_info.h:53:
./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not supported yet
        asm_volatile_goto("1: jmp 6f\n"
        ^
./include/linux/compiler-gcc.h:290:42: note: expanded from macro 'asm_volatile_goto'
#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)

^ permalink raw reply

* Re: [PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error
From: Pablo Neira Ayuso @ 2018-04-13 13:15 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	netfilter-devel, coreteam, Networking, Linux Kernel Mailing List
In-Reply-To: <CAK8P3a1GH5OSu62TBh46S0OdiXTJODsvV3To5=Jp2nL+fz1Puw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1289 bytes --]

On Mon, Apr 09, 2018 at 04:43:40PM +0200, Arnd Bergmann wrote:
> On Mon, Apr 9, 2018 at 4:37 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > Hi Arnd,
> >
> > On Mon, Apr 09, 2018 at 12:53:12PM +0200, Arnd Bergmann wrote:
> >> We get a new link error with CONFIG_NFT_REJECT_INET=y and CONFIG_NF_REJECT_IPV6=m
> >
> > I think we can update NFT_REJECT_INET so it depends on NFT_REJECT_IPV4
> > and NFT_REJECT_IPV6. This doesn't allow here CONFIG_NFT_REJECT_INET=y
> > and CONFIG_NF_REJECT_IPV6=m.
> >
> > I mean, just like we do with NFT_FIB_INET.
> 
> That can only work if NFT_REJECT_INET can be made a 'tristate' symbol
> again, so that code gets built as a loadable module if
> CONFIG_NF_REJECT_IPV6=m.
> 
> > BTW, I think this problem has been is not related to the recent patch,
> > but something older that kbuild robot has triggered more easily for
> > some reason?
> 
> 02c7b25e5f54 is the one that turned NF_TABLES_INET into a 'bool'
> symbol. NFT_REJECT depends on NF_TABLES_INET, so it used to
> restricted to a loadable module with IPV6=m, but can now be
> built-in, which causes that link error.

Still one more spin on this, I would like to see if we have a way to
fix this by simplifing things a bit.

Would this one I'm attaching would work?

Thanks for you patience.

[-- Attachment #2: 0001-netfilter-CONFIG_NF_REJECT_IPV-4-6-becomes-bool-togg.patch --]
[-- Type: text/x-diff, Size: 2586 bytes --]

>From af07bc7ff5d34ce54e7913233912c058e6699e3c Mon Sep 17 00:00:00 2001
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Fri, 13 Apr 2018 10:48:40 +0200
Subject: [PATCH] netfilter: CONFIG_NF_REJECT_IPV{4,6} becomes bool toggle

Arnd reports that we get a new link error with CONFIG_NFT_REJECT_INET=y
and CONFIG_NF_REJECT_IPV6=m after larger parts of the nftables modules
are linked together:

net/netfilter/nft_reject_inet.o: In function `nft_reject_inet_eval':
nft_reject_inet.c:(.text+0x17c): undefined reference to `nf_send_unreach6'
nft_reject_inet.c:(.text+0x190): undefined reference to `nf_send_reset6'

The problem is that with NF_TABLES_INET set, we implicitly try to use
the ipv6 version as well for NFT_REJECT, but when CONFIG_IPV6 is set to
a loadable module, it's impossible to reach that.

This patch fixes this problem by building-in nf_reject_ipv{4,6}.c, IPv6
symbol dependencies for the IPv6 reject infrastructure are located in
exthdrs_core.c, ip6_checksum.c and ip6_icmp.c which are also built-in,
so let's do the same to simplify this.

Fixes: 02c7b25e5f54 ("netfilter: nf_tables: build-in filter chain type")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/Kconfig | 3 +--
 net/ipv6/netfilter/Kconfig | 3 +--
 net/netfilter/Kconfig      | 2 ++
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/netfilter/Kconfig b/net/ipv4/netfilter/Kconfig
index 280048e1e395..3e4e0ae2a9a1 100644
--- a/net/ipv4/netfilter/Kconfig
+++ b/net/ipv4/netfilter/Kconfig
@@ -104,8 +104,7 @@ config NF_LOG_IPV4
 	select NF_LOG_COMMON
 
 config NF_REJECT_IPV4
-	tristate "IPv4 packet rejection"
-	default m if NETFILTER_ADVANCED=n
+	bool "IPv4 packet rejection"
 
 config NF_NAT_IPV4
 	tristate "IPv4 NAT"
diff --git a/net/ipv6/netfilter/Kconfig b/net/ipv6/netfilter/Kconfig
index ccbfa83e4bb0..1e5d040a60b8 100644
--- a/net/ipv6/netfilter/Kconfig
+++ b/net/ipv6/netfilter/Kconfig
@@ -87,8 +87,7 @@ config NF_DUP_IPV6
 	  packet to be rerouted to another destination.
 
 config NF_REJECT_IPV6
-	tristate "IPv6 packet rejection"
-	default m if NETFILTER_ADVANCED=n
+	bool "IPv6 packet rejection"
 
 config NF_LOG_IPV6
 	tristate "IPv6 packet logging"
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 4189f574f5ec..d7b3272fe821 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -609,6 +609,8 @@ config NFT_REJECT
 
 config NFT_REJECT_INET
 	depends on NF_TABLES_INET
+	select NF_REJECT_IPV4
+	select NF_REJECT_IPV6
 	default NFT_REJECT
 	tristate
 
-- 
2.11.0


^ permalink raw reply related

* Re: net: hang in unregister_netdevice: waiting for lo to become free
From: Dan Streetman @ 2018-04-13 12:43 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Tommi Rantala, Neil Horman, Xin Long, David Ahern,
	Daniel Borkmann, Cong Wang, David Miller, Eric Dumazet,
	Willem de Bruijn, Jakub Kicinski, Rasmus Villemoes, netdev, LKML,
	Alexey Kuznetsov, Hideaki YOSHIFUJI, syzkaller, Dan Streetman,
	Eric W. Biederman, Alexey Kodanev
In-Reply-To: <CACT4Y+Z2kdjsYUs-H9aSTcNW-FSKqm1__fg3eOq16H-E=X8oZg@mail.gmail.com>

On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
> <tommi.t.rantala@nokia.com> wrote:
>> On 20.02.2018 18:26, Neil Horman wrote:
>>>
>>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>>>
>>>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>>> <tommi.t.rantala@nokia.com> wrote:
>>>>>
>>>>> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>>>>>
>>>>>> Is this meant to be fixed already? I am still seeing this on the
>>>>>> latest upstream tree.
>>>>>>
>>>>>
>>>>> These two commits are in v4.16-rc1:
>>>>>
>>>>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>>>>> Author: Tommi Rantala <tommi.t.rantala@nokia.com>
>>>>> Date:   Mon Feb 5 21:48:14 2018 +0200
>>>>>
>>>>>      sctp: fix dst refcnt leak in sctp_v4_get_dst
>>>>> ...
>>>>>      Fixes: 410f03831 ("sctp: add routing output fallback")
>>>>>      Fixes: 0ca50d12f ("sctp: fix src address selection if using
>>>>> secondary
>>>>> addresses")
>>>>>
>>>>>
>>>>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>>>>> Author: Alexey Kodanev <alexey.kodanev@oracle.com>
>>>>> Date:   Mon Feb 5 15:10:35 2018 +0300
>>>>>
>>>>>      sctp: fix dst refcnt leak in sctp_v6_get_dst()
>>>>> ...
>>>>>      Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>>>>> secondary
>>>>> addresses for ipv6")
>>>>>
>>>>>
>>>>> I guess we missed something if it's still reproducible.
>>>>>
>>>>> I can check it later this week, unless someone else beat me to it.
>>>>
>>>>
>>>> Hi Tommi,
>>>>
>>>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>>>> another one then. But I am still seeing these:
>>>>
>>>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>>>> Usage count = 4
>>>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>>>> Usage count = 4
>>>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>>>> Usage count = 4
>>>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>>>> Usage count = 4
>>>>
>>>> on upstream tree pulled ~12 hours ago.
>>>>
>>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>>> out a
>>> backtrace if the device name matches "lo".  That should tell us
>>> definitively if
>>> the problem is in the same location or not
>>
>>
>> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
>> sent in the first email in this thread:
>>
>> With 4.16-rc2 unable to reproduce.
>>
>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
>> lo to become free. Usage count = 3"
>>
>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
>> cherry-picked on top, unable to reproduce.
>>
>>
>> Is syzkaller doing something else now to trigger the bug...?
>> Can you still trigger the bug with the same reproducer?
>
> Hi Neil, Tommi,
>
> Reviving this old thread about "unregister_netdevice: waiting for lo
> to become free. Usage count = 3" hangs.
> I still did not have time to deep dive into what happens there (too
> many bugs coming from syzbot). But this still actively happens and I
> suspect accounts to a significant portion of various hang reports,
> which are quite unpleasant.
>
> One idea that could make it all simpler:
>
> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
> prolonged periods of time under any non-buggy conditions? E.g. more
> than 1-2 minutes?
> If it only supposed to wait briefly for things that already supposed
> to be shutting down, and we add a WARNING there after some timeout,
> then syzbot will report all info how/when it happens, hopefully
> extracting reproducers, and all the nice things.
> But this WARNING should not have any false positives under any
> realistic conditions (e.g. waiting for arrival of remote packets with
> large timeouts).
>
> Looking at some task hung reports, it seems that this code holds some
> mutexes, takes workqueue thread and prevents any progress with
> destruction of other devices (and net namespace creation/destruction),
> so I guess it should not wait for any indefinite periods of time?

I'm working on this currently:
https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407

I added a summary of what I've found to be the cause (or at least, one
possible cause) of this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72

I'm working on a patch to work around the main side-effect of this,
which is hanging while holding the global net mutex.  Hangs will still
happen (e.g. if a dst leaks) but should not affect anything else,
other than a leak of the dst and its net namespace.

Fixing the dst leaks is important too, of course, but a dst leak (or
other cause) shouldn't break the entire system.

^ permalink raw reply

* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: Edward Cree @ 2018-04-13 12:36 UTC (permalink / raw)
  To: David Miller; +Cc: linux-net-drivers, netdev
In-Reply-To: <20180412.113300.475644883265871487.davem@davemloft.net>

It turns out this may all be moot anyway: I figured out why I was seeing
 ARFS storms and it wasn't the configuration issue I originally blamed.
My current ndo_rx_flow_steer() implementation, efx_filter_rfs(), returns
 0 for success, but the caller expects a filter ID to be returned (which
 we can't give it because we don't know what the filter ID will be until
 we start mucking around in the software state that's now protected by a
 sleepable lock).
As a result, when we call rps_may_expire_flow(), and pass it the _actual_
 filter ID, this doesn't match the one set_rps_cpu() recorded, so the
 function returns true and we immediately expire the filter.  Then the
 next packet to come along isn't steered, so ARFS asks us to insert a
 steering filter again.
As a quick fix I've simply tried making the rps_may_expire_flow() calls
 also pass a filter ID of 0, which prevents the ARFS storms.  This is
 safe; it may cause us to delay expiring a filter when flow_ids collide,
 but that can happen anyway with other drivers' implementations (e.g.
 mlx4 and mlx5 can potentially reuse filter IDs) so I presume it is OK.
I'll post a v2 with that fix in place of this Patch #2 shortly, then try
 to follow up with a counter-generated ID (similar to what mlx have).

-Ed

^ permalink raw reply

* Atlantic driver 4.16 stable request
From: Igor Russkikh @ 2018-04-13 12:03 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Nadezhda Krupnina, Simon Edelhaus

Hi David,

Could you please consider queuing to v4.16:

9a11aff net: aquantia: oops when shutdown on already stopped device
cce96d1 net: aquantia: Regression on reset with 1.x firmware

These are both critical and well tested by our team.

Thanks in advance!

^ permalink raw reply

* [PATCH net] team: avoid adding twice the same option to the event list
From: Paolo Abeni @ 2018-04-13 11:59 UTC (permalink / raw)
  To: netdev; +Cc: Jiri Pirko, David S. Miller

When parsing the options provided by the user space,
team_nl_cmd_options_set() insert them in a temporary list to send
multiple events with a single message.
While each option's attribute is correctly validated, the code does
not check for duplicate entries before inserting into the event
list.

Exploiting the above, the syzbot was able to trigger the following
splat:

kernel BUG at lib/list_debug.c:31!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4466 Comm: syzkaller556835 Not tainted 4.16.0+ #17
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__list_add_valid+0xaa/0xb0 lib/list_debug.c:29
RSP: 0018:ffff8801b04bf248 EFLAGS: 00010286
RAX: 0000000000000058 RBX: ffff8801c8fc7a90 RCX: 0000000000000000
RDX: 0000000000000058 RSI: ffffffff815fbf41 RDI: ffffed0036097e3f
RBP: ffff8801b04bf260 R08: ffff8801b0b2a700 R09: ffffed003b604f90
R10: ffffed003b604f90 R11: ffff8801db027c87 R12: ffff8801c8fc7a90
R13: ffff8801c8fc7a90 R14: dffffc0000000000 R15: 0000000000000000
FS:  0000000000b98880(0000) GS:ffff8801db000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000043fc30 CR3: 00000001afe8e000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  __list_add include/linux/list.h:60 [inline]
  list_add include/linux/list.h:79 [inline]
  team_nl_cmd_options_set+0x9ff/0x12b0 drivers/net/team/team.c:2571
  genl_family_rcv_msg+0x889/0x1120 net/netlink/genetlink.c:599
  genl_rcv_msg+0xc6/0x170 net/netlink/genetlink.c:624
  netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
  genl_rcv+0x28/0x40 net/netlink/genetlink.c:635
  netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
  netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
  netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
  sock_sendmsg_nosec net/socket.c:629 [inline]
  sock_sendmsg+0xd5/0x120 net/socket.c:639
  ___sys_sendmsg+0x805/0x940 net/socket.c:2117
  __sys_sendmsg+0x115/0x270 net/socket.c:2155
  SYSC_sendmsg net/socket.c:2164 [inline]
  SyS_sendmsg+0x29/0x30 net/socket.c:2162
  do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4458b9
RSP: 002b:00007ffd1d4a7278 EFLAGS: 00000213 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000000000000001b RCX: 00000000004458b9
RDX: 0000000000000010 RSI: 0000000020000d00 RDI: 0000000000000004
RBP: 00000000004a74ed R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000213 R12: 00007ffd1d4a7348
R13: 0000000000402a60 R14: 0000000000000000 R15: 0000000000000000
Code: 75 e8 eb a9 48 89 f7 48 89 75 e8 e8 d1 85 7b fe 48 8b 75 e8 eb bb 48
89 f2 48 89 d9 4c 89 e6 48 c7 c7 a0 84 d8 87 e8 ea 67 28 fe <0f> 0b 0f 1f
40 00 48 b8 00 00 00 00 00 fc ff df 55 48 89 e5 41
RIP: __list_add_valid+0xaa/0xb0 lib/list_debug.c:29 RSP: ffff8801b04bf248

This changeset addresses the avoiding list_add() if the current
option is already present in the event list.

Reported-and-tested-by: syzbot+4d4af685432dc0e56c91@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Fixes: 2fcdb2c9e659 ("team: allow to send multiple set events in one message")
---
 drivers/net/team/team.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index a6c6ce19eeee..acbe84967834 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -261,6 +261,17 @@ static void __team_option_inst_mark_removed_port(struct team *team,
 	}
 }
 
+static bool __team_option_inst_tmp_find(const struct list_head *opts,
+					const struct team_option_inst *needle)
+{
+	struct team_option_inst *opt_inst;
+
+	list_for_each_entry(opt_inst, opts, tmp_list)
+		if (opt_inst == needle)
+			return true;
+	return false;
+}
+
 static int __team_options_register(struct team *team,
 				   const struct team_option *option,
 				   size_t option_count)
@@ -2568,6 +2579,14 @@ static int team_nl_cmd_options_set(struct sk_buff *skb, struct genl_info *info)
 			if (err)
 				goto team_put;
 			opt_inst->changed = true;
+
+			/* dumb/evil user-space can send us duplicate opt,
+			 * keep only the last one
+			 */
+			if (__team_option_inst_tmp_find(&opt_inst_list,
+							opt_inst))
+				continue;
+
 			list_add(&opt_inst->tmp_list, &opt_inst_list);
 		}
 		if (!opt_found) {
-- 
2.14.3

^ permalink raw reply related

* Re: net: hang in unregister_netdevice: waiting for lo to become free
From: Neil Horman @ 2018-04-13 11:37 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Tommi Rantala, Xin Long, David Ahern, Daniel Borkmann, Cong Wang,
	David Miller, Eric Dumazet, Willem de Bruijn, Jakub Kicinski,
	Rasmus Villemoes, netdev, LKML, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, syzkaller, Dan Streetman, Eric W. Biederman,
	Alexey Kodanev, Marcelo Ricardo Leitner <marcelo.leit
In-Reply-To: <CACT4Y+Z2kdjsYUs-H9aSTcNW-FSKqm1__fg3eOq16H-E=X8oZg@mail.gmail.com>

On Thu, Apr 12, 2018 at 02:15:30PM +0200, Dmitry Vyukov wrote:
> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
> <tommi.t.rantala@nokia.com> wrote:
> > On 20.02.2018 18:26, Neil Horman wrote:
> >>
> >> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
> >>>
> >>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
> >>> <tommi.t.rantala@nokia.com> wrote:
> >>>>
> >>>> On 19.02.2018 20:59, Dmitry Vyukov wrote:
> >>>>>
> >>>>> Is this meant to be fixed already? I am still seeing this on the
> >>>>> latest upstream tree.
> >>>>>
> >>>>
> >>>> These two commits are in v4.16-rc1:
> >>>>
> >>>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
> >>>> Author: Tommi Rantala <tommi.t.rantala@nokia.com>
> >>>> Date:   Mon Feb 5 21:48:14 2018 +0200
> >>>>
> >>>>      sctp: fix dst refcnt leak in sctp_v4_get_dst
> >>>> ...
> >>>>      Fixes: 410f03831 ("sctp: add routing output fallback")
> >>>>      Fixes: 0ca50d12f ("sctp: fix src address selection if using
> >>>> secondary
> >>>> addresses")
> >>>>
> >>>>
> >>>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
> >>>> Author: Alexey Kodanev <alexey.kodanev@oracle.com>
> >>>> Date:   Mon Feb 5 15:10:35 2018 +0300
> >>>>
> >>>>      sctp: fix dst refcnt leak in sctp_v6_get_dst()
> >>>> ...
> >>>>      Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
> >>>> secondary
> >>>> addresses for ipv6")
> >>>>
> >>>>
> >>>> I guess we missed something if it's still reproducible.
> >>>>
> >>>> I can check it later this week, unless someone else beat me to it.
> >>>
> >>>
> >>> Hi Tommi,
> >>>
> >>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
> >>> another one then. But I am still seeing these:
> >>>
> >>> [   58.799130] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [   60.847138] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [   62.895093] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>> [   64.943103] unregister_netdevice: waiting for lo to become free.
> >>> Usage count = 4
> >>>
> >>> on upstream tree pulled ~12 hours ago.
> >>>
> >> Can you write a systemtap script to probe dev_hold, and dev_put, printing
> >> out a
> >> backtrace if the device name matches "lo".  That should tell us
> >> definitively if
> >> the problem is in the same location or not
> >
> >
> > Hi Dmitry, I tested with the reproducer and the kernel .config file that you
> > sent in the first email in this thread:
> >
> > With 4.16-rc2 unable to reproduce.
> >
> > With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
> > lo to become free. Usage count = 3"
> >
> > With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
> > cherry-picked on top, unable to reproduce.
> >
> >
> > Is syzkaller doing something else now to trigger the bug...?
> > Can you still trigger the bug with the same reproducer?
> 
> Hi Neil, Tommi,
> 
> Reviving this old thread about "unregister_netdevice: waiting for lo
> to become free. Usage count = 3" hangs.
> I still did not have time to deep dive into what happens there (too
> many bugs coming from syzbot). But this still actively happens and I
> suspect accounts to a significant portion of various hang reports,
> which are quite unpleasant.
> 
> One idea that could make it all simpler:
> 
> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
> prolonged periods of time under any non-buggy conditions? E.g. more
> than 1-2 minutes?
As the name implies, its supposed to wait for the reference count to be zero
indefinately, but yes, under normal operation, its intended to not have to wait
very long at all.  The issuance of the NETDEV_UNREGISTER_FINAL notification is
meant to be a subscribable signal to any code path holding a reference that it
needs to be dropped so that the progress can be made.

Note that the "waiting for %s to become free" message is triggered after 10
seconds of waiting, and is likely the trigger you want, Its just an emergency
level log message rather a WARN.  I don't think we want to change that
permanently, but you could certainly alter it in the code to cause syzbot to
catch it (i.e. WARN_ON(time_after(jiffies, warning_time + 10 * HZ)) )


> If it only supposed to wait briefly for things that already supposed
> to be shutting down, and we add a WARNING there after some timeout,
> then syzbot will report all info how/when it happens, hopefully
> extracting reproducers, and all the nice things.
> But this WARNING should not have any false positives under any
> realistic conditions (e.g. waiting for arrival of remote packets with
> large timeouts).
> 
> Looking at some task hung reports, it seems that this code holds some
> mutexes, takes workqueue thread and prevents any progress with
> destruction of other devices (and net namespace creation/destruction),
> so I guess it should not wait for any indefinite periods of time?
Well, it drops everything and sleeps periodically, so its safe in and of itself.
The problem is its waiting for the reference count of a device to drop to zero,
and some other execution context isn't taking the appropriate action to do that.

Neil

> 

^ permalink raw reply

* Re: [net] xfrm: cover crypto status in xfrm_input
From: Steffen Klassert @ 2018-04-13 11:09 UTC (permalink / raw)
  To: Jacek Kalwas; +Cc: davem, netdev, intel-wired-lan
In-Reply-To: <20180412190315.3102-3-jacek.kalwas@intel.com>

On Thu, Apr 12, 2018 at 12:03:15PM -0700, Jacek Kalwas wrote:
> Status checking in xfrm_input doesn't cover CRYPTO_GENERIC_ERROR and
> CRYPTO_INVALID_PACKET_SYNTAX.
> 
> Given patch adds additional check for CRYPTO_INVALID_PACKET_SYNTAX and
> treats CRYPTO_GENERIC_ERROR as status matching LINUX_MIB_XFRMINERROR.
> 
> Signed-off-by: Jacek Kalwas <jacek.kalwas@intel.com>
> ---
>  net/xfrm/xfrm_input.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
> index 352abca2605f..08d70ea774f9 100644
> --- a/net/xfrm/xfrm_input.c
> +++ b/net/xfrm/xfrm_input.c
> @@ -285,7 +285,12 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
>  					goto drop;
>  				}
>  
> -				XFRM_INC_STATS(net, LINUX_MIB_XFRMINBUFFERERROR);
> +				if (xo->status & CRYPTO_INVALID_PACKET_SYNTAX) {
> +					XFRM_INC_STATS(net, LINUX_MIB_XFRMINBUFFERERROR);
> +					goto drop;
> +				}

Please consider adding separate statistic counters for offloading.
Reusing some other counter does not make it more usfull as it is now.
Some time ago, each statistic counter was bumped at a unique place,
so it was easy to identify where the packet was dropped. Unfortunately 
this changed over the years. This was one of the concerns the userspace
IPsec developers had during the IPsec workshop we held recently. So I
think it is better to add new counters insted of reusing old ones here.

^ permalink raw reply

* Re: tcp hang when socket fills up ?
From: Dominique Martinet @ 2018-04-13  9:42 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20180406090720.GA31845@nautica>


Note this is mostly me talking to myself, but in case anyone else hits
the same issues I might as well post my meagre progress.

Dominique Martinet wrote on Fri, Apr 06, 2018:
> (current kernel: vanilla 4.14.29)

reproduced in a fedora VM on that host with a 4.16.1-300.fc28.x86_64
kernel, since this one has DEBUG_INFO=y and I was lazy (but haven't seen
any patch about that kind of stuff recently so probably still valid)

Other main difference is the qdisc, VM is using fq_codel, host is
directly on wireless so mq with 4 pfifo_fast queues - it is harder to
reproduce on the VM (even on another VM with the same kernel) so I'd
put the difference down to the qdisc, but I was able to reproduce with
both ultimately.
(update: it actually was still fairly easy to reproduce until it got
later (coworkers left?), going from ~5-15s to reproduce to multiple
minutes, so this likely depends on net quality a lot. I couldn't
reproduce by fiddling with netem on a local network though...)

> [please find previous email for setup/tcpdump output]

So I have a crash dump with a socket / inet_sock that are blocked, since
this is a VM I even get gdb in bonus..

With the crash dump, I can confirm that the socket is not available for
writing (sk->sk_sndbuf - sk->sk_wmem_queued < sk->sk_wmem_queued >> 1),
but that doesn't help much if I can't tell why we're not taking acks in

With gdb, I set a breakpoint to tcp_ack (net/ipv4/tcp_input.c) as I
think that's the function that should handle my ack, and that gets the
replay.

First, abusing next, the flow seems to be something like
(I folded if/else not taken)

static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{       
        struct inet_connection_sock *icsk = inet_csk(sk);
        struct tcp_sock *tp = tcp_sk(sk);
        struct tcp_sacktag_state sack_state;
        struct rate_sample rs = { .prior_delivered = 0 };
        u32 prior_snd_una = tp->snd_una;
        bool is_sack_reneg = tp->is_sack_reneg;
        u32 ack_seq = TCP_SKB_CB(skb)->seq;
        u32 ack = TCP_SKB_CB(skb)->ack_seq;
        bool is_dupack = false;
        int prior_packets = tp->packets_out;
        u32 delivered = tp->delivered;
        u32 lost = tp->lost;
        int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover
		losses */
        u32 prior_fack;

        sack_state.first_sackt = 0;
        sack_state.rate = &rs;

        /* We very likely will need to access rtx queue. */
        prefetch(sk->tcp_rtx_queue.rb_node);

        /* If the ack is older than previous acks
         * then we can probably ignore it.
         */
        if (before(ack, prior_snd_una)) {
		}

        /* If the ack includes data we haven't sent yet, discard
         * this segment (RFC793 Section 3.9).
         */
        if (after(ack, tp->snd_nxt))

        if (after(ack, prior_snd_una)) {
        }

        prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) :
		tp->snd_una;
        rs.prior_in_flight = tcp_packets_in_flight(tp);

        /* ts_recent update must be made after we are sure that the
		packet                           
         * is in window.
         */
        if (flag & FLAG_UPDATE_TS_RECENT)

        if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
        } else {
                u32 ack_ev_flags = CA_ACK_SLOWPATH;

                if (ack_seq != TCP_SKB_CB(skb)->end_seq)
                else
                        NET_INC_STATS(sock_net(sk),
						LINUX_MIB_TCPPUREACKS);

                flag |= tcp_ack_update_window(sk, skb, ack, ack_seq);

                if (TCP_SKB_CB(skb)->sacked)

                if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) {
                }

                if (flag & FLAG_WIN_UPDATE)

                tcp_in_ack_event(sk, ack_ev_flags);
        }
        
        /* We passed data and got it acked, remove any soft error
         * log. Something worked...
         */
        sk->sk_err_soft = 0;
        icsk->icsk_probes_out = 0;
        tp->rcv_tstamp = tcp_jiffies32;
        if (!prior_packets)
                goto no_queue;

no_queue:
        /* If data was DSACKed, see if we can undo a cwnd reduction. */
        if (flag & FLAG_DSACKING_ACK)
                tcp_fastretrans_alert(sk, prior_snd_una, is_dupack,
				&flag,
                                      &rexmit);
        /* If this ack opens up a zero window, clear backoff.  It was
         * being used to time the probes, and is probably far higher
		 than
         * it needs to be for normal retransmission.
         */
        tcp_ack_probe(sk);

        if (tp->tlp_high_seq)
                tcp_process_tlp_ack(sk, ack, flag);
        return 1;


And here is 'info local' towards the end of the function:
icsk = 0xffff88001740b800
tp = 0xffff88001740b800
sack_state = {reord = 84, first_sackt = 0, last_sackt = 0, rate = 0xffff88003fc83ae0, flag = 0, mss_now = 0}
rs = {prior_mstamp = 0, prior_delivered = 0, delivered = 0, interval_us = 0, rtt_us = 0, losses = 0, acked_sacked = 0, prior_in_flight = 0, is_app_limited = false, is_retrans = false, is_ack_delayed = false}
prior_snd_una = 2896339291
is_sack_reneg = false
ack_seq = 2651638615
ack = 2896339291
is_dupack = false
prior_packets = 0
delivered = 172
lost = 0
rexmit = 0
prior_fack = 2896339291
__vpp_verify = <optimized out>
pao_ID__ = <optimized out>
pao_tmp__ = <optimized out>
pao_ID__ = <optimized out>
pao_tmp__ = <optimized out>
pao_ID__ = <optimized out>
pao_tmp__ = <optimized out>


So as I was seeing, we have ack = prior_fack = prior_snd_una (which is > ack_seq)

We're in the !prior_packets goto to no_queue though so we don't pass
tcp_clean_rtx_queue - so that explains why the replays don't help. I'm
ok with that. But all these traces actually don't tell me why the first
time we saw that packed didn't take the ack and decreate sk_wmem_queued
properly.

This isn't easy enough to reproduce that I feel confident grabbing the
first call to that function that fails though, so I guess I'll actually
have to read the code I'm debugging instead of trying to just plow
through it...

Hints still welcome, if anyone has an idea.

-- 
Dominique Martinet | Asmadeus

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox