Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Build error for samples/bpf/ due to commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS")
From: Alexei Starovoitov @ 2018-04-13 15:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Peter Zijlstra, Daniel Borkmann, netdev@vger.kernel.org, LKML,
	Björn Töpel
In-Reply-To: <20180413152237.3727911f@redhat.com>

On Fri, Apr 13, 2018 at 03:22:37PM +0200, Jesper Dangaard Brouer wrote:
> Hi Peter,
> 
> Your commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS") broke build
> for several samples/bpf programs. I'm unsure what the best way forward
> is to unbreak these...
> 
> The issue is that these samples are build with LLVM/clang (which
> doesn't like 'asm goto' constructs).  And they end up including
> arch/x86/include/asm/cpufeature.h via a long include path, see build
> examples below (through different path to include/linux/thread_info.h).
> 
> Maybe Alexei or Daniel have an idea how to work around this?
> As tools/testing/selftests/bpf/ does not seem to fail!?

Right. All of bpf tracing and samples/bpf/ broke.
Here is the proposed fix that we're asking Peter to apply and send to Linus asap.
https://lkml.org/lkml/2018/4/10/825

> Build error#1:
> --------------
> clang  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I./arch/x86/include -I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h  -Isamples/bpf \
>         -I./tools/testing/selftests/bpf/ \
>         -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
>         -D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
>         -Wno-gnu-variable-sized-type-not-at-end \
>         -Wno-address-of-packed-member -Wno-tautological-compare \
>         -Wno-unknown-warning-option  \
>         -O2 -emit-llvm -c samples/bpf/sockex2_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/sockex2_kern.o
> In file included from samples/bpf/sockex2_kern.c:3:
> In file included from ./include/uapi/linux/in.h:24:
> In file included from ./include/linux/socket.h:8:
> In file included from ./include/linux/uio.h:13:
> In file included from ./include/linux/thread_info.h:38:
> In file included from ./arch/x86/include/asm/thread_info.h:53:
> ./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not supported yet
>         asm_volatile_goto("1: jmp 6f\n"
>         ^
> ./include/linux/compiler-gcc.h:290:42: note: expanded from macro 'asm_volatile_goto'
> #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)

^ permalink raw reply

* Re: [RFC v2] virtio: support packed ring
From: Michael S. Tsirkin @ 2018-04-13 15:22 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: jasowang, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180401141216.8969-1-tiwei.bie@intel.com>

On Sun, Apr 01, 2018 at 10:12:16PM +0800, Tiwei Bie wrote:
> +static inline bool more_used(const struct vring_virtqueue *vq)
> +{
> +	return vq->packed ? more_used_packed(vq) : more_used_split(vq);
> +}
> +
> +void *virtqueue_get_buf_ctx_split(struct virtqueue *_vq, unsigned int *len,
> +				  void **ctx)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	void *ret;
> +	unsigned int i;
> +	u16 last_used;
> +
> +	START_USE(vq);
> +
> +	if (unlikely(vq->broken)) {
> +		END_USE(vq);
> +		return NULL;
> +	}
> +
> +	if (!more_used(vq)) {
> +		pr_debug("No more buffers in queue\n");
> +		END_USE(vq);
> +		return NULL;
> +	}

So virtqueue_get_buf_ctx_split should only call more_used_split.

to avoid such issues I think we should lay out the code like this:

XXX_split

XXX_packed

XXX wrappers

> +/* The standard layout

I'd drop standard here.

> for the packed ring is a continuous chunk of memory
> + * which looks like this.
> + *
> + * struct vring_packed
> + * {

Can the opening bracket go on the prev line pls?

> + *	// The actual descriptors (16 bytes each)
> + *	struct vring_packed_desc desc[num];
> + *
> + *	// Padding to the next align boundary.
> + *	char pad[];
> + *
> + *	// Driver Event Suppression
> + *	struct vring_packed_desc_event driver;
> + *
> + *	// Device Event Suppression
> + *	struct vring_packed_desc_event device;

Maybe that's how our driver does it but it's not based on spec
so I don't think this belongs in the header.

> + * };
> + */
> +
> +static inline unsigned vring_packed_size(unsigned int num, unsigned long align)
> +{
> +	return ((sizeof(struct vring_packed_desc) * num + align - 1)
> +		& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
> +}
> +

Cant say this API makes sense for me.


>  #endif /* _UAPI_LINUX_VIRTIO_RING_H */
> -- 
> 2.11.0

^ permalink raw reply

* Re: [PATCH net] virtio-net: add missing virtqueue kick when flushing packets
From: David Miller @ 2018-04-13 15:27 UTC (permalink / raw)
  To: jasowang; +Cc: mst, virtualization, netdev, linux-kernel, ktaka, daniel
In-Reply-To: <1523602705-5155-1-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Fri, 13 Apr 2018 14:58:25 +0800

> We tends to batch submitting packets during XDP_TX. This requires to
> kick virtqueue after a batch, we tried to do it through
> xdp_do_flush_map() which only makes sense for devmap not XDP_TX. So
> explicitly kick the virtqueue in this case.
> 
> Reported-by: Kimitoshi Takahashi <ktaka@nii.ac.jp>
> Tested-by: Kimitoshi Takahashi <ktaka@nii.ac.jp>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Applied and queued up for -stable, thanks Jason.

^ permalink raw reply

* RE: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited
From: Keller, Jacob E @ 2018-04-13 15:34 UTC (permalink / raw)
  To: Bjorn Helgaas, Jakub Kicinski
  Cc: Tal Gilboa, Tariq Toukan, Ariel Elior, Ganesh Goudar,
	Kirsher, Jeffrey T, everest-linux-l2@cavium.com,
	intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
In-Reply-To: <20180413140633.GD46420@bhelgaas-glaptop.roam.corp.google.com>



> -----Original Message-----
> From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> Sent: Friday, April 13, 2018 7:07 AM
> To: Jakub Kicinski <kubakici@wp.pl>
> Cc: Tal Gilboa <talgi@mellanox.com>; Tariq Toukan <tariqt@mellanox.com>;
> Keller, Jacob E <jacob.e.keller@intel.com>; Ariel Elior <ariel.elior@cavium.com>;
> Ganesh Goudar <ganeshgr@chelsio.com>; Kirsher, Jeffrey T
> <jeffrey.t.kirsher@intel.com>; everest-linux-l2@cavium.com; intel-wired-
> lan@lists.osuosl.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-pci@vger.kernel.org
> Subject: Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed
> and whether it's limited
> 
> On Thu, Apr 12, 2018 at 09:32:49PM -0700, Jakub Kicinski wrote:
> > On Fri, 30 Mar 2018 16:05:18 -0500, Bjorn Helgaas wrote:
> > > +	if (bw_avail >= bw_cap)
> > > +		pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> > > +			 bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > > +	else
> > > +		pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d
> link at %s (capable of %d Mb/s with %s x%d link)\n",
> > > +			 bw_avail, PCIE_SPEED2STR(speed), width,
> > > +			 limiting_dev ? pci_name(limiting_dev) : "<unknown>",
> > > +			 bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> >
> > I was just looking at using this new function to print PCIe BW for a
> > NIC, but I'm slightly worried that there is nothing in the message that
> > says PCIe...  For a NIC some people may interpret the bandwidth as NIC
> > bandwidth:
> >
> > [   39.839989] nfp 0000:04:00.0: Netronome Flow Processor NFP4000/NFP6000
> PCIe Card Probe
> > [   39.848943] nfp 0000:04:00.0: 63.008 Gb/s available bandwidth (8 GT/s x8 link)
> > [   39.857146] nfp 0000:04:00.0: RESERVED BARs: 0.0: General/MSI-X SRAM, 0.1:
> PCIe XPB/MSI-X PBA, 0.4: Explicit0, 0.5: Explicit1, fre4
> >
> > It's not a 63Gbps NIC...  I'm sorry if this was discussed before and I
> > didn't find it.  Would it make sense to add the "PCIe: " prefix to the
> > message like bnx2x used to do?  Like:
> >
> > nfp 0000:04:00.0: PCIe: 63.008 Gb/s available bandwidth (8 GT/s x8 link)
> 
> I agree, that does look potentially confusing.  How about this:
> 
>   nfp 0000:04:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)
> 
> I did have to look twice at this before I remembered that we're
> printing Gb/s (not GB/s).  Most of the references I found on the web
> use GB/s when talking about total PCIe bandwidth.
> 
> But either way I think it's definitely worth mentioning PCIe
> explicitly.

I also agree printing PCIe explicitly is good.

Thanks,
Jake

^ permalink raw reply

* Re: net: hang in unregister_netdevice: waiting for lo to become free
From: Dmitry Vyukov @ 2018-04-13 15:54 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Tommi Rantala, Neil Horman, Xin Long, David Ahern,
	Daniel Borkmann, Cong Wang, David Miller, Eric Dumazet,
	Willem de Bruijn, Jakub Kicinski, Rasmus Villemoes, netdev, LKML,
	Alexey Kuznetsov, Hideaki YOSHIFUJI, syzkaller, Dan Streetman,
	Eric W. Biederman, Alexey Kodanev
In-Reply-To: <CALZtONBFdVsa_dqKTpzAp2wVPPfP4N=P7HT_e5e0vy7oAoxVew@mail.gmail.com>

On Fri, Apr 13, 2018 at 2:43 PM, Dan Streetman <ddstreet@ieee.org> wrote:
> On Thu, Apr 12, 2018 at 8:15 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>> <tommi.t.rantala@nokia.com> wrote:
>>> On 20.02.2018 18:26, Neil Horman wrote:
>>>>
>>>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>>>>
>>>>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>>>> <tommi.t.rantala@nokia.com> wrote:
>>>>>>
>>>>>> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>>>>>>
>>>>>>> Is this meant to be fixed already? I am still seeing this on the
>>>>>>> latest upstream tree.
>>>>>>>
>>>>>>
>>>>>> These two commits are in v4.16-rc1:
>>>>>>
>>>>>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>>>>>> Author: Tommi Rantala <tommi.t.rantala@nokia.com>
>>>>>> Date:   Mon Feb 5 21:48:14 2018 +0200
>>>>>>
>>>>>>      sctp: fix dst refcnt leak in sctp_v4_get_dst
>>>>>> ...
>>>>>>      Fixes: 410f03831 ("sctp: add routing output fallback")
>>>>>>      Fixes: 0ca50d12f ("sctp: fix src address selection if using
>>>>>> secondary
>>>>>> addresses")
>>>>>>
>>>>>>
>>>>>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>>>>>> Author: Alexey Kodanev <alexey.kodanev@oracle.com>
>>>>>> Date:   Mon Feb 5 15:10:35 2018 +0300
>>>>>>
>>>>>>      sctp: fix dst refcnt leak in sctp_v6_get_dst()
>>>>>> ...
>>>>>>      Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>>>>>> secondary
>>>>>> addresses for ipv6")
>>>>>>
>>>>>>
>>>>>> I guess we missed something if it's still reproducible.
>>>>>>
>>>>>> I can check it later this week, unless someone else beat me to it.
>>>>>
>>>>>
>>>>> Hi Tommi,
>>>>>
>>>>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>>>>> another one then. But I am still seeing these:
>>>>>
>>>>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>>>>> Usage count = 4
>>>>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>>>>> Usage count = 4
>>>>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>>>>> Usage count = 4
>>>>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>>>>> Usage count = 4
>>>>>
>>>>> on upstream tree pulled ~12 hours ago.
>>>>>
>>>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>>>> out a
>>>> backtrace if the device name matches "lo".  That should tell us
>>>> definitively if
>>>> the problem is in the same location or not
>>>
>>>
>>> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
>>> sent in the first email in this thread:
>>>
>>> With 4.16-rc2 unable to reproduce.
>>>
>>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
>>> lo to become free. Usage count = 3"
>>>
>>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
>>> cherry-picked on top, unable to reproduce.
>>>
>>>
>>> Is syzkaller doing something else now to trigger the bug...?
>>> Can you still trigger the bug with the same reproducer?
>>
>> Hi Neil, Tommi,
>>
>> Reviving this old thread about "unregister_netdevice: waiting for lo
>> to become free. Usage count = 3" hangs.
>> I still did not have time to deep dive into what happens there (too
>> many bugs coming from syzbot). But this still actively happens and I
>> suspect accounts to a significant portion of various hang reports,
>> which are quite unpleasant.
>>
>> One idea that could make it all simpler:
>>
>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
>> prolonged periods of time under any non-buggy conditions? E.g. more
>> than 1-2 minutes?
>> If it only supposed to wait briefly for things that already supposed
>> to be shutting down, and we add a WARNING there after some timeout,
>> then syzbot will report all info how/when it happens, hopefully
>> extracting reproducers, and all the nice things.
>> But this WARNING should not have any false positives under any
>> realistic conditions (e.g. waiting for arrival of remote packets with
>> large timeouts).
>>
>> Looking at some task hung reports, it seems that this code holds some
>> mutexes, takes workqueue thread and prevents any progress with
>> destruction of other devices (and net namespace creation/destruction),
>> so I guess it should not wait for any indefinite periods of time?
>
> I'm working on this currently:
> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>
> I added a summary of what I've found to be the cause (or at least, one
> possible cause) of this:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72
>
> I'm working on a patch to work around the main side-effect of this,
> which is hanging while holding the global net mutex.  Hangs will still
> happen (e.g. if a dst leaks) but should not affect anything else,
> other than a leak of the dst and its net namespace.
>
> Fixing the dst leaks is important too, of course, but a dst leak (or
> other cause) shouldn't break the entire system.

Leaking some memory is definitely better than hanging the system.

So I've made syzkaller to recognize "unregister_netdevice: waiting for
(.*) to become free" as a kernel bug:
https://github.com/google/syzkaller/commit/7a67784ca8bdc3b26cce2f0ec9a40d2dd9ec9396
Unfortunately it does not make it catch these bugs because creating a
net namespace per test is too damn slow, so namespaces are reused for
lots of tests and when/if it's eventually destroyed it's already too
late to find root cause.

But I've run a one-off experiment with prompt net namespace
destruction and syzkaller was able to easily extract a C reproducer:
https://gist.githubusercontent.com/dvyukov/d571e8fff24e127ca48a8c4790d42bfa/raw/52050e93ba9afbb5126b9d7bb39b7e71a82af016/gistfile1.txt

On upstream 16e205cf42da1f497b10a4a24f563e6c0d574eec with this config:
https://gist.githubusercontent.com/dvyukov/9663c57443adb21f2795b92ef0829d62/raw/bbea0652e23746096dd56855a28f6c681aebcdee/gistfile1.txt

this gives me:

[   83.183198] unregister_netdevice: waiting for lo to become free.
Usage count = 9
[   85.231202] unregister_netdevice: waiting for lo to become free.
Usage count = 9
...
[  523.511205] unregister_netdevice: waiting for lo to become free.
Usage count = 9
...

This is generated from this syzkaller program:

r0 = socket$inet6(0xa, 0x1, 0x84)
setsockopt$inet6_IPV6_XFRM_POLICY(r0, 0x29, 0x23,
&(0x7f0000000380)={{{@in6=@remote={0xfe, 0x80, [], 0xbb},
@in=@dev={0xac, 0x14, 0x14}, 0x0, 0x0, 0x0, 0x0, 0xa}, {}, {}, 0x0,
0x0, 0x1}, {{@in=@local={0xac, 0x14, 0x14, 0xaa}, 0x0, 0x32}, 0x0,
@in=@local={0xac, 0x14, 0x14, 0xaa}, 0x3504}}, 0xe8)
bind$inet6(r0, &(0x7f0000000000)={0xa, 0x4e20}, 0x1c)
connect$inet(r0, &(0x7f0000000040)={0x2, 0x4e20, @dev={0xac, 0x14,
0x14, 0xd}}, 0x10)
syz_emit_ethernet(0x3e, &(0x7f00000001c0)={@local={[0xaa, 0xaa, 0xaa,
0xaa, 0xaa], 0xaa}, @dev={[0xaa, 0xaa, 0xaa, 0xaa, 0xaa]}, [],
{@ipv6={0x86dd, {0x0, 0x6, "50a09c", 0x8, 0xffffff11, 0x0,
@remote={0xfe, 0x80, [], 0xbb}, @local={0xfe, 0x80, [], 0xaa}, {[],
@udp={0x0, 0x4e20, 0x8}}}}}}, &(0x7f0000000040))

So this seems to be related to IPv6 and/or xfrm and is potentially
caused by external packets (that syz_emit_ethernet call).

^ permalink raw reply

* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: Edward Cree @ 2018-04-13 15:59 UTC (permalink / raw)
  To: David Miller; +Cc: linux-net-drivers, netdev
In-Reply-To: <20180413.110359.1956367740053085545.davem@davemloft.net>

On 13/04/18 16:03, David Miller wrote:
> Whilst you may not be able to program the filter into the hardware
> synchronously, you should be able to allocate the ID and get all of
> the software state setup.
That's what we were doing before commit 3af0f34290f6 ("sfc: replace
 asynchronous filter operations"), and (as mentioned in that commit)
 that leads to (or at least the scheme we used had) race conditions
 which I could not see a way to fix.  If the hardware resets (and
 thus forgets all its filters) after we've done the software state
 setup but before we reach the point of finalising the software state
 after the hardware operation, we don't know what operations we need
 to do to re-apply the software state to the hardware, because we
 don't know whether the reset happened before or after the hardware
 operation.
Actually, it's not the insertion itself that this can happen to - if
 the reset happens first then the insert will fail because other
 hardware state we reference isn't set up yet.  However, inserting
 one filter can necessitate removing some others (lower-priority
 multicast filters for the same flow); the code before 3af0f34290f6
 was actually capable of getting so confused that it could double-
 free a pointer.
This all gets sufficiently complicated that even if I can find a
 locking scheme that in theory gets it right, there's pretty much a
 100% chance that the actual implementation will contain bugs,
 probably very subtle ones that can't reliably be reproduced etc.
 All my instincts say to get away from that if at all possible.

> People really aren't going to be happy if their performance goes into
> the tank because their filter update rate, for whatever reason, hits
> this magic backlog limit.
Well, the alternative, even if the software state setup part _could_
 be made synchronous, is to allow a potentially unbounded queue for
 the hardware update part (I think there are even still cases in
 which the exponential growth pathology is possible), causing the
 filter insertions to be delayed an arbitrarily long time.  Either
 the flow is still going by that time (in which case the backlog
 limit approach will get a new ndo_rx_flow_steer request and insert
 the filter too) or it isn't, in which case getting round to it
 eventually is no better than dropping it immediately.  In fact it's
 worse because now you waste time inserting a useless filter which
 delays new requests even more.
Besides, I'm fairly confident that the only cases in which you'll
 even come close to hitting the limit are ones where ARFS wouldn't
 do you much good anyway, such as:
* Misconfigured interrupt affinities where ARFS is entirely pointless
* Many short-lived flows (which greatly diminish the utility of ARFS)

So for multiple reasons, hitting the limit won't actually make
 performance worse, although it will often be a sign that performance
 will be bad for other reasons.

-Ed

^ permalink raw reply

* Re: [PATCH net 1/3] l2tp: hold reference on tunnels in netlink dumps
From: Guillaume Nault @ 2018-04-13 16:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jchapman
In-Reply-To: <20180413.105703.352607094166315307.davem@davemloft.net>

On Fri, Apr 13, 2018 at 10:57:03AM -0400, David Miller wrote:
> From: Guillaume Nault <g.nault@alphalink.fr>
> Date: Thu, 12 Apr 2018 20:50:33 +0200
> 
> > l2tp_tunnel_find_nth() is unsafe: no reference is held on the returned
> > tunnel, therefore it can be freed whenever the caller uses it.
> > This patch defines l2tp_tunnel_get_nth() which works similarly, but
> > also takes a reference on the returned tunnel. The caller then has to
> > drop it after it stops using the tunnel.
> > 
> > Convert netlink dumps to make them safe against concurrent tunnel
> > deletion.
> > 
> > Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
> > Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
> 
> During the entire invocation of l2tp_nl_cmd_tunnel_dump(), the RTNL
> mutex is held.
> 
> Therefore no tunnel configuration changes may occur and the tunnel
> object will persist and is safe to access.
> 
Yes, but only for updates done with the genl API. For L2TPv2, the
tunnel can be created by connecting a PPPOL2TP and a UDP socket.
Closing these sockets destroys the tunnel without any RTNL
synchronisation.

> The netlink dump should be safe as-is.
> 
> Were you actually able to trigger a crash or KASAN warning or is
> this purely from code inspection?
> 
Yes, I have a KASAN use-after-free for this case. I remember I saw a
few complains about stack traces in commit messages, so I've stopped
putting them there. I can paste (stripped) traces again. Just let me
know if you have any preference.

Guillaume

^ permalink raw reply

* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: David Miller @ 2018-04-13 16:14 UTC (permalink / raw)
  To: ecree; +Cc: linux-net-drivers, netdev
In-Reply-To: <d56e5a40-48cc-d984-c5cd-e18fbc411ed3@solarflare.com>

From: Edward Cree <ecree@solarflare.com>
Date: Fri, 13 Apr 2018 16:59:07 +0100

> On 13/04/18 16:03, David Miller wrote:
>> Whilst you may not be able to program the filter into the hardware
>> synchronously, you should be able to allocate the ID and get all of
>> the software state setup.
> That's what we were doing before commit 3af0f34290f6 ("sfc: replace
>  asynchronous filter operations"), and (as mentioned in that commit)
>  that leads to (or at least the scheme we used had) race conditions
>  which I could not see a way to fix.  If the hardware resets (and
>  thus forgets all its filters) after we've done the software state
>  setup but before we reach the point of finalising the software state
>  after the hardware operation, we don't know what operations we need
>  to do to re-apply the software state to the hardware, because we
>  don't know whether the reset happened before or after the hardware
>  operation.

When an entry is successfully programmed into the chip, you update
the software state.

When the chip resets, you clear all of those state booleans to false.

Indeed, you would have to synchronize these things somehow.

Is the issue that you learn about the hardware reset asynchronously,
and therefore cannot determine if filter insertion programming
happened afterwards and thus is still in the chip?

You must have a table of all the entries, so that you can reprogram
the hardware should it reset.  Or do you not handle things that way
and it's a lossy system?

> Well, the alternative, even if the software state setup part _could_
>  be made synchronous, is to allow a potentially unbounded queue for
>  the hardware update part (I think there are even still cases in
>  which the exponential growth pathology is possible), causing the
>  filter insertions to be delayed an arbitrarily long time.  Either
>  the flow is still going by that time (in which case the backlog
>  limit approach will get a new ndo_rx_flow_steer request and insert
>  the filter too) or it isn't, in which case getting round to it
>  eventually is no better than dropping it immediately.  In fact it's
>  worse because now you waste time inserting a useless filter which
>  delays new requests even more.
> Besides, I'm fairly confident that the only cases in which you'll
>  even come close to hitting the limit are ones where ARFS wouldn't
>  do you much good anyway, such as:
> * Misconfigured interrupt affinities where ARFS is entirely pointless
> * Many short-lived flows (which greatly diminish the utility of ARFS)
> 
> So for multiple reasons, hitting the limit won't actually make
>  performance worse, although it will often be a sign that performance
>  will be bad for other reasons.

Understood, thanks for explaining.

Please respin your series with the updates you talked about and I'll
apply it.

But generally we do have this issue with various kinds of
configuration programming and async vs. sync.

^ permalink raw reply

* Re: [PATCH net 1/3] l2tp: hold reference on tunnels in netlink dumps
From: David Miller @ 2018-04-13 16:15 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, jchapman
In-Reply-To: <20180413160912.GA1405@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Fri, 13 Apr 2018 18:09:12 +0200

> On Fri, Apr 13, 2018 at 10:57:03AM -0400, David Miller wrote:
>> From: Guillaume Nault <g.nault@alphalink.fr>
>> Date: Thu, 12 Apr 2018 20:50:33 +0200
>> 
>> > l2tp_tunnel_find_nth() is unsafe: no reference is held on the returned
>> > tunnel, therefore it can be freed whenever the caller uses it.
>> > This patch defines l2tp_tunnel_get_nth() which works similarly, but
>> > also takes a reference on the returned tunnel. The caller then has to
>> > drop it after it stops using the tunnel.
>> > 
>> > Convert netlink dumps to make them safe against concurrent tunnel
>> > deletion.
>> > 
>> > Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
>> > Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
>> 
>> During the entire invocation of l2tp_nl_cmd_tunnel_dump(), the RTNL
>> mutex is held.
>> 
>> Therefore no tunnel configuration changes may occur and the tunnel
>> object will persist and is safe to access.
>> 
> Yes, but only for updates done with the genl API. For L2TPv2, the
> tunnel can be created by connecting a PPPOL2TP and a UDP socket.
> Closing these sockets destroys the tunnel without any RTNL
> synchronisation.

Right, that's the part I missed.  Thanks for explaining.

^ permalink raw reply

* Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel
From: Edward Cree @ 2018-04-13 16:24 UTC (permalink / raw)
  To: David Miller; +Cc: linux-net-drivers, netdev
In-Reply-To: <20180413.121415.787716401343641542.davem@davemloft.net>

On 13/04/18 17:14, David Miller wrote:
> Is the issue that you learn about the hardware reset asynchronously,
> and therefore cannot determine if filter insertion programming
> happened afterwards and thus is still in the chip?
Yes, pretty much.

> You must have a table of all the entries, so that you can reprogram
> the hardware should it reset.
Yes, we do have such a table; 'reprogram the hardware' happens in
 efx_ef10_filter_table_restore().

> Understood, thanks for explaining.
>
> Please respin your series with the updates you talked about and I'll
> apply it.
Will do, thanks.

-Ed

^ permalink raw reply

* Re: tcp hang when socket fills up ?
From: Michal Kubecek @ 2018-04-13 16:32 UTC (permalink / raw)
  To: netdev; +Cc: Dominique Martinet
In-Reply-To: <20180406090720.GA31845@nautica>

On Fri, Apr 06, 2018 at 11:07:20AM +0200, Dominique Martinet wrote:
> 16:49:26.735042 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 ecr 1617129473], length 1374
> 16:49:26.735046 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 ecr 1617129473], length 1374
> 16:49:26.735334 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], length 0
> 16:49:26.736005 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 42996, win 940, options [nop,nop,TS val 1617129478 ecr 1313937609], length 0
> 16:49:26.736402 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 ecr 1617129473], length 1374
> 16:49:26.736408 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 ecr 1617129473], length 1374
> 16:49:26.738561 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 44370, win 963, options [nop,nop,TS val 1617129482 ecr 1313937616], length 0
> 16:49:26.739539 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 45744, win 986, options [nop,nop,TS val 1617129482 ecr 1313937616], length 0
> 16:49:26.739882 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 47118, win 1008, options [nop,nop,TS val 1617129484 ecr 1313937617], length 0
> 16:49:26.740255 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 48492, win 1031, options [nop,nop,TS val 1617129484 ecr 1313937617], length 0
> 16:49:26.746756 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 49866, win 1053, options [nop,nop,TS val 1617129493 ecr 1313937627], length 0
> 16:49:26.747923 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 51240, win 1076, options [nop,nop,TS val 1617129494 ecr 1313937627], length 0
> 16:49:26.749083 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 52614, win 1099, options [nop,nop,TS val 1617129495 ecr 1313937629], length 0
> 16:49:26.750171 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 53988, win 1121, options [nop,nop,TS val 1617129496 ecr 1313937629], length 0
> 16:49:26.750808 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 55362, win 1144, options [nop,nop,TS val 1617129497 ecr 1313937629], length 0
> 16:49:26.754648 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 56736, win 1167, options [nop,nop,TS val 1617129500 ecr 1313937629], length 0
> 16:49:26.755985 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 58110, win 1189, options [nop,nop,TS val 1617129501 ecr 1313937630], length 0
> 16:49:26.758513 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 59484, win 1212, options [nop,nop,TS val 1617129502 ecr 1313937630], length 0
> 16:49:26.759096 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 60858, win 1234, options [nop,nop,TS val 1617129503 ecr 1313937635], length 0
> 16:49:26.759421 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 62232, win 1257, options [nop,nop,TS val 1617129503 ecr 1313937635], length 0
> 16:49:26.759755 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 63606, win 1280, options [nop,nop,TS val 1617129504 ecr 1313937636], length 0
> 16:49:26.760653 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 64980, win 1302, options [nop,nop,TS val 1617129505 ecr 1313937636], length 0
> 16:49:26.761453 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 66354, win 1325, options [nop,nop,TS val 1617129506 ecr 1313937638], length 0
> 16:49:26.762199 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], length 0
> 16:49:26.763547 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], length 36
> 16:49:26.763553 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 70476, win 1393, options [nop,nop,TS val 1617129508 ecr 1313937639], length 0
> 16:49:26.764298 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 73224, win 1438, options [nop,nop,TS val 1617129509 ecr 1313937641], length 0
> 16:49:26.764676 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 75972, win 1444, options [nop,nop,TS val 1617129510 ecr 1313937643], length 0
> 16:49:26.807754 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 75972:77346, ack 4190, win 307, options [nop,nop,TS val 1313937714 ecr 1617129473], length 1374
> 16:49:26.876467 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617129620 ecr 1313937714], length 0
> 16:49:27.048760 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313937955 ecr 1617129473], length 1374
> 16:49:27.051791 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617129762 ecr 1313937714], length 36
> 16:49:27.076444 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617129822 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:27.371182 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130018 ecr 1313937714], length 36
> 16:49:27.519862 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313938426 ecr 1617129473], length 1374
> 16:49:27.547662 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617130293 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:27.883372 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130530 ecr 1313937714], length 36
> 16:49:28.511861 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313939418 ecr 1617129473], length 1374
> 16:49:28.538891 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617131285 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:28.907197 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617131554 ecr 1313937714], length 36
> 16:49:30.431864 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313941338 ecr 1617129473], length 1374
> 16:49:30.459127 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617133204 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:30.955388 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617133602 ecr 1313937714], length 36
> 16:49:34.207879 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313945114 ecr 1617129473], length 1374
> 16:49:34.235726 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617136981 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:35.256285 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617137954 ecr 1313937714], length 36
> 16:49:42.143864 IP <server local ip>.13317 > <client public ip>.31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313953050 ecr 1617129473], length 1374
> 16:49:42.171531 IP <client public ip>.31872 > <server local ip>.13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617144917 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:43.448262 IP <client public ip>.31872 > <server local ip>.13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617146146 ecr 1313937714], length 36

The way I read this, server doesn't see anything sent by client since
some point shortly before the dump shown here starts (about 5ms). It
keeps sending data until 16:49:26.807754 (seq 77346) and then keeps
resending first (from its point of view) unacknowledged segment
(32004:33378) in exponentially growing intervals and ignores replies
from the client. Client apparently receives these retransmits and
replies with dupack (with D-SACK for 32004:33378) and retransmits of its
own first unacknowledged segment (4190:4226).

As we can see the client packets in the dump (which was taken on
server), it would mean they are dropped after the point where packet
socket would pass them to libpcap. That might be e.g. netfilter
(conntrack?) or the IP/TCP code detecting them to be invalid for some
reason (which is not obvious to me from the dump above).

There are two strange points:

1. While client apparently responds to all server retransmits, it does
so with TSecr=1313937714 (matching server packet from 16:49:26.807754)
rather than TSval of the packets it dupacks (1313937955 through
1313953050). This doesn't seem to follow the rules of RFC 7323
Section 4.3.

2. Window size values in acks from client grow with each acked packet by
22-23 (which might be ~1400 with scaling factor of 64). I would rather
expect advertised receive window to go down by 1374 with each received
segment and to grow by bigger steps with each read()/recv() call from
application.

We might get more insight if we saw the same connection on both sides.
>From what was presented here, my guess is that

  (1) received packets are dropped somewhere on server side (after they
      are cloned for the packet socket)
  (2) there is something wrong either on client side or between the two
      hosts (there is at least a NAT, IIUC)

Michal Kubecek

^ permalink raw reply

* [PATCH iproute2] utils: Do not reset family for default, any, all addresses
From: David Ahern @ 2018-04-13 16:36 UTC (permalink / raw)
  To: stephen; +Cc: netdev, whissi, David Ahern, Serhey Popovych

Thomas reported a change in behavior with respect to autodectecting
address families. Specifically, 'ip ro add default via fe80::1'
syntax was failing to treat fe80::1 as an IPv6 address as it did in
prior releases. The root causes appears to be a change in family when
the default keyword is parsed.

'default', 'any' and 'all' are relevant outside of AF_INET. Leave the
family arg as is for these when setting addr.

Fixes: 93fa12418dc6 ("utils: Always specify family and ->bytelen in get_prefix_1()")
Reported-by: Thomas Deutschmann <whissi@gentoo.org>
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Serhey Popovych <serhe.popovych@gmail.com>
---
 lib/utils.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/utils.c b/lib/utils.c
index 60d7eb14b438..8a0bff0babeb 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -568,7 +568,7 @@ static int __get_addr_1(inet_prefix *addr, const char *name, int family)
 	if (strcmp(name, "default") == 0) {
 		if ((family == AF_DECnet) || (family == AF_MPLS))
 			return -1;
-		addr->family = (family != AF_UNSPEC) ? family : AF_INET;
+		addr->family = family;
 		addr->bytelen = af_byte_len(addr->family);
 		addr->bitlen = -2;
 		addr->flags |= PREFIXLEN_SPECIFIED;
@@ -579,7 +579,7 @@ static int __get_addr_1(inet_prefix *addr, const char *name, int family)
 	    strcmp(name, "any") == 0) {
 		if ((family == AF_DECnet) || (family == AF_MPLS))
 			return -1;
-		addr->family = AF_UNSPEC;
+		addr->family = family;
 		addr->bytelen = 0;
 		addr->bitlen = -2;
 		return 0;
-- 
2.11.0

^ permalink raw reply related

* Re: SRIOV switchdev mode BoF minutes
From: Samudrala, Sridhar @ 2018-04-13 16:49 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Miller, Anjali Singhai Jain, Andy Gospodarek, Michael Chan,
	Simon Horman, Jakub Kicinski, John Fastabend, Saeed Mahameed,
	Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <CAJ3xEMjqj5ENO8NRbcYFFCcS4NGwrNbYo8uydHEYMyXHjCBqhw@mail.gmail.com>

On 4/13/2018 1:57 AM, Or Gerlitz wrote:
> On Fri, Apr 13, 2018 at 11:56 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> On Thu, Apr 12, 2018 at 11:33 PM, Samudrala, Sridhar
>> <sridhar.samudrala@intel.com> wrote:
>>> On 4/12/2018 1:20 PM, Or Gerlitz wrote:
>>>> On Thu, Apr 12, 2018 at 8:05 PM, Samudrala, Sridhar
>>>> <sridhar.samudrala@intel.com> wrote:
>>>>> On 11/12/2017 11:49 AM, Or Gerlitz wrote:
>>>>>> Hi Dave and all,
>>>>>>
>>>>>> During and after the BoF on SRIOV switchdev mode, we came into a
>>>>>> consensus among the developers from four different HW vendors (CC
>>>>>> audience) that a correct thing to do would be to disallow any new
>>>>>> extensions to the legacy mode.
>>>>>>
>>>>>> The idea is to put focus on the new mode and not add new UAPIs and
>>>>>> kernel code which was turned to be a wrong design which does not allow
>>>>>> for properly offloading a kernel switching SW model to e-switch HW.
>>>>>>
>>>>>> We also had a good session the day after regarding alignment for the
>>>>>> representation model of the uplink (physical port) and PF/s.
>>>>>>
>>>>>> The VF representor netdevs  exist for all drivers that support the new
>>>>>> mode but the representation for the uplink and PF wasn't the same for
>>>>>> all. The decision was to represent the uplink and PFs vports in the
>>>>>> same manner done for VFs, using rep netdevs. This alignment would
>>>>>> provide a more strict and clear view of the kernel model for e-switch
>>>>>> to users and upper layer control plane SW.
>>>>>>
>>>>> I don't see any changes in the Mellanox/other drivers to move to this new
>>>>> model to enable the uplink and PF port representors, any updates?
>>>> Yeah, I am worked on that but didn't get to finalize the upstreaming
>>>> so far.  I have resumed
>>>> the work and plan uplink rep in mlx5 to replace the PF being uplink rep
>>>> for 4.18
>>>>
>>>>> It would be really nice to highlight the pros and cons of the old versus
>>>>> the
>>>>> new model.
>>>>>
>>>>> We are looking into adding switchdev support for our new 100Gb ice driver
>>>>> and could use some feedback on the direction we should be taking.
>>>> good news.
>>>>
>>>> The uplink rep is clear cut that needs to be a rep device representing
>>>> the uplink just like vf
>>>> rep represents the vport toward the vf - please just do it correct
>>>> from the begining
>>>>
>>> Having an uplink rep will definitely help implement the slow path with
>>> flat/vlan network
>>> scenarios by not having to add PF to the bridge.
>>>
>>> But how do they help with a vxlan overlay scenario? In case of overlays, the
>>> slow path has to go via vxlan -> ip stack -> pf?
>> in  overlay networks scheme, the uplink has the VTEP ip and is not connected
> the uplink rep has the vtep ip
>
>> to the bridge, e.g you use ovs you have vf reps and vxlan ports connected to ovs
>> and the ip stack routes through the uplink rep

This changes the legacy mode behavior of configuring  vtep ip on the pf netdev.
How does host to host traffic expected to work when vtep ip is moved to uplink rep?


>>
>>> What about pf-rep?

Are you planning to create a pf-rep too? Is pf also treated similar to vf in switchdev mode?
All pf traffic goes to pf-rep and pf-rep traffic goes to pf by default without any rules
programmed?

^ permalink raw reply

* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2
From: Christoph Hellwig @ 2018-04-13 16:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Christoph Hellwig, xdp-newbies@vger.kernel.org,
	netdev@vger.kernel.org, David Woodhouse, William Tu,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Arnaldo Carvalho de Melo
In-Reply-To: <20180412173131.49f01252@redhat.com>

On Thu, Apr 12, 2018 at 05:31:31PM +0200, Jesper Dangaard Brouer wrote:
> > I guess that is because x86 selects it as the default as soon as
> > we have more than 4G memory. 
> 
> I were also confused why I ended up using SWIOTLB (SoftWare IO-TLB),
> that might explain it. And I'm not hitting the bounce-buffer case.
> 
> How do I control which DMA engine I use? (So, I can play a little)

At the lowest level you control it by:

 (1) setting the dma_ops pointer in struct device
 (2) if that is NULL by choosing what is returned from
     get_arch_dma_ops()

> 
> 
> > That should be solveable fairly easily with the per-device dma ops,
> > though.
> 
> I didn't understand this part.

What I mean with that is that we can start out setting dma_ops
to dma_direct_ops for everyone on x86 when we start out (that is assuming
we don't have an iommu), and only switching to swiotlb_dma_ops when
actually required by either a dma_mask that can't address all memory,
or some other special cases like SEV or broken bridges.

> I wanted to ask your opinion, on a hackish idea I have...
> Which is howto detect, if I can reuse the RX-DMA map address, for TX-DMA
> operation on another device (still/only calling sync_single_for_device).
> 
> With XDP_REDIRECT we are redirecting between net_device's. Usually
> we keep the RX-DMA mapping as we recycle the page. On the redirect to
> TX-device (via ndo_xdp_xmit) we do a new DMA map+unmap for TX.  The
> question is how to avoid this mapping(?).  In some cases, with some DMA
> engines (or lack of) I guess the DMA address is actually the same as
> the RX-DMA mapping dma_addr_t already known, right?  For those cases,
> would it be possible to just (re)use that address for TX?

You can't in any sensible way without breaking a lot of abstractions.
For dma direct ops that mapping will be the same unless the devices
have different dma_offsets in their struct device, or the architecture
overrides phys_to_dma entirely, in which case all bets are off.
If you have an iommu it depends on which devices are behind the same
iommu.

^ permalink raw reply

* Creating FOU tunnels to the same destination IP but different port
From: Kostas Peletidis @ 2018-04-13 16:57 UTC (permalink / raw)
  To: netdev

Hello,

I am having trouble with a particular case of setting up a fou tunnel
and I would really appreciate your help.

I have a remote multihomed host behind a NAT box and I want to create
a fou tunnel for each of its IP addresses, from my machine.

A typical case would be something like that (output from the local machine):

# ip tun
ipudp09602: ip/ip remote 135.196.22.100 local 172.31.0.140 ttl 225
ipudp00101: ip/ip remote 148.252.129.30 local 172.31.0.140 ttl 225
ipudp09604: ip/ip remote 77.247.11.249 local 172.31.0.140 ttl 225
tunl0: any/ip remote any local any ttl inherit nopmtudisc
ipudp00102: ip/ip remote 213.205.194.18 local 172.31.0.140 ttl 225

However, if the remote end has the same IP address with the remote end
of an existing tunnel (but a different remote port)
tunnel creation fails. In this example there is already a tunnel to
135.196.22.100:32270 and I wanted to create a new tunnel
to 135.196.22.100:24822 as below:

# ip link add name ipudp09603 mtu 1356 type ipip \
  remote 135.196.22.100 \
  local 172.31.0.140 \
  ttl 225 \
  encap fou \
     encap-sport 4500 \
     encap-dport 24822

RTNETLINK answers: File exists

The remote IP addresses in this case are identical because there is a
NAT box in the way, but the port numbers are different. The source
address and port are the same in all cases.

I noticed that ip_tunnel_find() does not check port numbers - being IP
and all - so I am thinking that a not-so-elegant way to do it is to
get the port numbers from the netlink request and have
ip_tunnel_find() compare them against encap.{sport, dport} of existing
tunnels.

Is there a better way to create a second fou tunnel to the same IP
address but a different port? Use of keys as unique tunnel IDs maybe?
Any feedback is appreciated. Thank you.

Regards,
Kostas

^ permalink raw reply

* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2
From: Tushar Dave @ 2018-04-13 17:12 UTC (permalink / raw)
  To: Christoph Hellwig, Jesper Dangaard Brouer
  Cc: xdp-newbies@vger.kernel.org, netdev@vger.kernel.org,
	David Woodhouse, William Tu, Björn Töpel,
	Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo
In-Reply-To: <20180412145653.GA7172@lst.de>



On 04/12/2018 07:56 AM, Christoph Hellwig wrote:
> On Thu, Apr 12, 2018 at 04:51:23PM +0200, Christoph Hellwig wrote:
>> On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote:
>>> ---------------
>>> Implement support for keeping the DMA mapping through the XDP return
>>> call, to remove RX map/unmap calls.  Implement bulking for XDP
>>> ndo_xdp_xmit and XDP return frame API.  Bulking allows to perform DMA
>>> bulking via scatter-gatter DMA calls, XDP TX need it for DMA
>>> map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder
>>> to mitigate (via bulk technique). Ask DMA maintainer for a common
>>> case direct call for swiotlb DMA sync call ;-)
>>
>> Why do you even end up in swiotlb code?  Once you bounce buffer your
>> performance is toast anyway..
> 
> I guess that is because x86 selects it as the default as soon as
> we have more than 4G memory. That should be solveable fairly easily
> with the per-device dma ops, though.\

I guess there is nothing we need to do!

On x86, in case of no intel iommu or iommu is disabled, you end up in
swiotlb for DMA API calls when system has 4G memory.
However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not
use bounce buffer until and unless you have swiotlb=force specified in
kernel commandline.

e.g. here is the snip:
dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
                             unsigned long offset, size_t size,
                             enum dma_data_direction dir,
                             unsigned long attrs)
{
         phys_addr_t map, phys = page_to_phys(page) + offset;
         dma_addr_t dev_addr = phys_to_dma(dev, phys);

         BUG_ON(dir == DMA_NONE);
         /*
          * If the address happens to be in the device's DMA window,
          * we can safely return the device addr and not worry about bounce
          * buffering it.
          */
         if (dma_capable(dev, dev_addr, size) && swiotlb_force != 
SWIOTLB_FORCE)
                 return dev_addr;


-Tushar

^ permalink raw reply

* imaging solutions
From: Ross @ 2018-04-13 14:51 UTC (permalink / raw)
  To: netdev

Hi,

Not sure if you received my email from last week.

We offer following image editing services:
images cutting out, clipping path, masking
jewelry photos retouching
beauty photos retouching
also wedding photos etc

If you want to test our quality of work.
You may send us one photo with instruction and we will work on it.

Hope to hear from you soon.

Regards,
Ross
The Studio Manager

^ permalink raw reply

* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2
From: Christoph Hellwig @ 2018-04-13 17:26 UTC (permalink / raw)
  To: Tushar Dave
  Cc: Christoph Hellwig, Jesper Dangaard Brouer,
	xdp-newbies@vger.kernel.org, netdev@vger.kernel.org,
	David Woodhouse, William Tu, Björn Töpel,
	Karlsson, Magnus, Alexander Duyck, Arnaldo Carvalho de Melo
In-Reply-To: <fbcb4cc8-53fb-82cd-bd9d-76c5fdd47918@oracle.com>

On Fri, Apr 13, 2018 at 10:12:41AM -0700, Tushar Dave wrote:
> I guess there is nothing we need to do!
>
> On x86, in case of no intel iommu or iommu is disabled, you end up in
> swiotlb for DMA API calls when system has 4G memory.
> However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not
> use bounce buffer until and unless you have swiotlb=force specified in
> kernel commandline.

Sure.  But that means very sync_*_to_device and sync_*_to_cpu now
involves an indirect call to do exactly nothing, which in the workload
Jesper is looking at is causing a huge performance degradation due to
retpolines.

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: ethernet: ave: add multiple clocks and resets support as required property
From: Rob Herring @ 2018-04-13 17:26 UTC (permalink / raw)
  To: Kunihiko Hayashi
  Cc: David Miller, netdev, Andrew Lunn, Florian Fainelli, Mark Rutland,
	linux-arm-kernel, linux-kernel, devicetree, Masahiro Yamada,
	Masami Hiramatsu, Jassi Brar
In-Reply-To: <1523255925-6469-2-git-send-email-hayashi.kunihiko@socionext.com>

On Mon, Apr 09, 2018 at 03:38:43PM +0900, Kunihiko Hayashi wrote:
> When the link is becoming up for Pro4 SoC, the kernel is stalled
> due to some missing clocks and resets.
> 
> The AVE block for Pro4 is connected to the GIO bus in the SoC.
> Without its clock/reset, the access to the AVE register makes the
> system stall.
> 
> In the same way, another MAC clock for Giga-bit Connection and
> the PHY clock are also required for Pro4 to activate the Giga-bit feature
> and to recognize the PHY.
> 
> To satisfy these requirements, this patch adds support for multiple clocks
> and resets, and adds the clock-names and reset-names to the binding because
> we need to distinguish clock/reset for the AVE main block and the others.
> 
> Also, make the resets a required property. Currently, "reset is
> optional" relies on that the bootloader or firmware has deasserted
> the reset before booting the kernel.  Drivers should work without
> such expectation.
> 
> Fixes: 4c270b55a5af ("net: ethernet: socionext: add AVE ethernet driver")
> Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com>
> Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
> ---
>  .../bindings/net/socionext,uniphier-ave4.txt       |  13 ++-

Reviewed-by: Rob Herring <robh@kernel.org>

>  drivers/net/ethernet/socionext/sni_ave.c           | 108 ++++++++++++++++-----
>  2 files changed, 96 insertions(+), 25 deletions(-)

^ permalink raw reply

* Re: [PATCH net-next 2/3] dt-bindings: net: ave: add syscon-phy-mode property to configure phy-mode setting
From: Rob Herring @ 2018-04-13 17:27 UTC (permalink / raw)
  To: Kunihiko Hayashi
  Cc: David Miller, netdev, Andrew Lunn, Florian Fainelli, Mark Rutland,
	linux-arm-kernel, linux-kernel, devicetree, Masahiro Yamada,
	Masami Hiramatsu, Jassi Brar
In-Reply-To: <1523255925-6469-3-git-send-email-hayashi.kunihiko@socionext.com>

On Mon, Apr 09, 2018 at 03:38:44PM +0900, Kunihiko Hayashi wrote:
> Add "socionext,syscon-phy-mode" property to specify system controller that
> configures the settings about phy-mode.
> 
> Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
> ---
>  Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)

Reviewed-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PATCH net] team: avoid adding twice the same option to the event list
From: David Miller @ 2018-04-13 18:07 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, jiri
In-Reply-To: <0e295e62358a68b22c646adece4272a9bd0473f8.1523620752.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Fri, 13 Apr 2018 13:59:25 +0200

> When parsing the options provided by the user space,
> team_nl_cmd_options_set() insert them in a temporary list to send
> multiple events with a single message.
> While each option's attribute is correctly validated, the code does
> not check for duplicate entries before inserting into the event
> list.
> 
> Exploiting the above, the syzbot was able to trigger the following
> splat:
 ...
> This changeset addresses the avoiding list_add() if the current
> option is already present in the event list.
> 
> Reported-and-tested-by: syzbot+4d4af685432dc0e56c91@syzkaller.appspotmail.com
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> Fixes: 2fcdb2c9e659 ("team: allow to send multiple set events in one message")

Looks good to me.

It's too bad that the tmp list entries don't get marked as they are
added, or get unlinked by the list processor.  Either scheme would
make the "already added" test a lot simpler.

Jiri, please review before I apply this.

Thanks.

^ permalink raw reply

* ethtool 4.16 released
From: John W. Linville @ 2018-04-13 18:02 UTC (permalink / raw)
  To: netdev

ethtool version 4.16 has been released.

Home page: https://www.kernel.org/pub/software/network/ethtool/
Download link:
https://www.kernel.org/pub/software/network/ethtool/ethtool-4.16.tar.xz

Release notes:

	* Feature: add support for extra RSS contexts and RSS steering filters
	* Feature: Document RSS context control and RSS filters
	* Fix: don't fall back to grxfhindir when context was specified
	* Fix: correct display of VF when showing vf/queue filters
	* Fix: show VF and queue in the help for -N
	* Fix: correct VF index values for the ring_cookie parameter
	* Feature: Add SFF 8636 date code parsing support

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* [PATCH v2 net 0/3] sfc: ARFS fixes
From: Edward Cree @ 2018-04-13 18:16 UTC (permalink / raw)
  To: linux-net-drivers, David Miller; +Cc: netdev

Three issues introduced by my recent asynchronous filter handling changes:
1. The old filter_rfs_insert would replace a matching filter of equal
   priority; we need to pass the appropriate argument to filter_insert to
   make it do the same.
2. We're lying to the kernel with our return value from ndo_rx_flow_steer,
   so we need to lie consistently when calling rps_may_expire_flow.  This
   is only a partial fix, as the lie still prevents us from steering
   multiple flows with the same ID to different queues; a proper fix that
   stops us lying at all will hopefully follow later.
3. It's possible to cause the kernel to hammer ndo_rx_flow_steer very
   hard, so make sure we don't build up too huge a backlog of workitems.

Possibly it would be better to fix #3 on the kernel side; I have a patch
 which I think does that but it's not a regression in 4.17 so isn't 'net'
 material.
There's also the issue that we come up in the bad configuration that
 triggers #3 by default, but that too is a problem for another time.

Edward Cree (3):
  sfc: insert ARFS filters with replace_equal=true
  sfc: pass the correctly bogus filter_id to rps_may_expire_flow()
  sfc: limit ARFS workitems in flight per channel

 drivers/net/ethernet/sfc/ef10.c       |  3 +-
 drivers/net/ethernet/sfc/farch.c      |  2 +-
 drivers/net/ethernet/sfc/net_driver.h | 25 +++++++++++++++
 drivers/net/ethernet/sfc/rx.c         | 60 ++++++++++++++++++-----------------
 4 files changed, 58 insertions(+), 32 deletions(-)

^ permalink raw reply

* [PATCH] PCI: Add PCIe to pcie_print_link_status() messages
From: Jakub Kicinski @ 2018-04-13 18:16 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: oss-drivers, Tal Gilboa, Tariq Toukan, Jacob Keller,
	Ganesh Goudar, Jeff Kirsher, intel-wired-lan, netdev,
	linux-kernel, linux-pci, Jakub Kicinski

Currently the pcie_print_link_status() will print PCIe bandwidth
and link width information but does not mention it is pertaining
to the PCIe.  Since this and related functions are used exclusively
by networking drivers today users may get confused into thinking
that it's the NIC bandwidth that is being talked about.  Insert a
"PCIe" into the messages.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/pci/pci.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index aa86e904f93c..73a0a4993f6a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5273,11 +5273,11 @@ void pcie_print_link_status(struct pci_dev *dev)
 	bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed, &width);
 
 	if (bw_avail >= bw_cap)
-		pci_info(dev, "%u.%03u Gb/s available bandwidth (%s x%d link)\n",
+		pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d link)\n",
 			 bw_cap / 1000, bw_cap % 1000,
 			 PCIE_SPEED2STR(speed_cap), width_cap);
 	else
-		pci_info(dev, "%u.%03u Gb/s available bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
+		pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
 			 bw_avail / 1000, bw_avail % 1000,
 			 PCIE_SPEED2STR(speed), width,
 			 limiting_dev ? pci_name(limiting_dev) : "<unknown>",
-- 
2.16.2

^ permalink raw reply related

* [PATCH v2 net 1/3] sfc: insert ARFS filters with replace_equal=true
From: Edward Cree @ 2018-04-13 18:17 UTC (permalink / raw)
  To: linux-net-drivers, David Miller; +Cc: netdev
In-Reply-To: <878265b2-a42a-d49e-0e68-0bbcabbabeaa@solarflare.com>

Necessary to allow redirecting a flow when the application moves.

Fixes: 3af0f34290f6 ("sfc: replace asynchronous filter operations")
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/rx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 95682831484e..13b0eb71dbf3 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -851,7 +851,7 @@ static void efx_filter_rfs_work(struct work_struct *data)
 	struct efx_channel *channel = efx_get_channel(efx, req->rxq_index);
 	int rc;
 
-	rc = efx->type->filter_insert(efx, &req->spec, false);
+	rc = efx->type->filter_insert(efx, &req->spec, true);
 	if (rc >= 0) {
 		/* Remember this so we can check whether to expire the filter
 		 * later.

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox