Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: bpf memory model. Was: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock
From: Alexei Starovoitov @ 2019-01-28 21:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexei Starovoitov, davem, daniel, jakub.kicinski, netdev,
	kernel-team, mingo, will.deacon, Paul McKenney, jannh
In-Reply-To: <20190128092408.GD28467@hirez.programming.kicks-ass.net>

On Mon, Jan 28, 2019 at 10:24:08AM +0100, Peter Zijlstra wrote:
> On Fri, Jan 25, 2019 at 04:17:26PM -0800, Alexei Starovoitov wrote:
> > On Fri, Jan 25, 2019 at 11:23:12AM +0100, Peter Zijlstra wrote:
> > > On Thu, Jan 24, 2019 at 03:58:59PM -0800, Alexei Starovoitov wrote:
> > > > On Thu, Jan 24, 2019 at 07:01:09PM +0100, Peter Zijlstra wrote:
> > > 
> > > > > And this would again be the moment where I go pester you about the BPF
> > > > > memory model :-)
> > > > 
> > > > hehe :)
> > > > How do you propose to define it in a way that it applies to all archs
> > > > and yet doesn't penalize x86 ?
> > > > "Assume minimum execution ordering model" the way kernel does
> > > > unfortunately is not usable, since bpf doesn't have a luxury
> > > > of using nice #defines that convert into nops on x86.
> > > 
> > > Why not? Surely the JIT can fix it up? That is, suppose you were to have
> > > smp_rmb() as a eBPF instruction then the JIT (which knows what
> > > architecture it is targeting) can simply avoid emitting instructions for
> > > it.
> > 
> > I'm all for adding new instructions that solve real use cases.
> > imo bpf_spin_lock() is the first step in helping with concurrency.
> > At plumbers conference we agreed to add new sync_fetch_and_add()
> > and xchg() instructions. That's a second step.
> > smp_rmb/wmb/mb should be added as well.
> > JITs will patch them depending on architecture.
> > 
> > What I want to avoid is to define the whole execution ordering model upfront.
> > We cannot say that BPF ISA is weakly ordered like alpha.
> > Most of the bpf progs are written and running on x86. We shouldn't
> > twist bpf developer's arm by artificially relaxing memory model.
> > BPF memory model is equal to memory model of underlying architecture.
> > What we can do is to make it bpf progs a bit more portable with
> > smp_rmb instructions, but we must not force weak execution on the developer.
> 
> Well, I agree with only introducing bits you actually need, and my
> smp_rmb() example might have been poorly chosen, smp_load_acquire() /
> smp_store_release() might have been a far more useful example.
> 
> But I disagree with the last part; we have to pick a model now;
> otherwise you'll pain yourself into a corner.
> 
> Also; Alpha isn't very relevant these days; however ARM64 does seem to
> be gaining a lot of attention and that is very much a weak architecture.
> Adding strongly ordered assumptions to BPF now, will penalize them in
> the long run.

arm64 is gaining attention just like riscV is gaining it too.
BPF jit for arm64 is very solid, while BPF jit for riscV is being worked on.
BPF is not picking sides in CPU HW and ISA battles.
Memory model is CPU HW design decision. BPF ISA cannot dictate HW design.
We're not saying today that BPF is strongly ordered.
BPF load/stores are behaving differently on x86 vs arm64.
We can add new instructions, but we cannot 'define' how load/stores behave
from memory model perspective.
For example, take atomicity of single byte load/store.
Not all archs have them atomic, but we cannot say to bpf programmers
to always assume non-atomic byte loads.

> > > Similarly; could something like this not also help with the spinlock
> > > thing? Use that generic test-and-set thing for the interpreter, but
> > > provide a better implementation in the JIT?
> > 
> > May be eventually. We can add cmpxchg insn, but the verifier still
> > doesn't support loops. We made a lot of progress in bounded loop research
> > over the last 2 years, but loops in bpf are still a year away.
> > We considered adding 'bpf_spin_lock' as a new instruction instead of helper call,
> > but that approach has a lot of negatives, so we went with the helper.
> 
> Ah, but the loop won't be in the BPF program itself. The BPF program
> would only have had the BPF_SPIN_LOCK instruction, the JIT them emits
> code similar to queued_spin_lock()/queued_spin_unlock() (or calls to
> out-of-line versions of them).

As I said we considered exactly that and such approach has a lot of downsides
comparing with the helper approach.
Pretty much every time new feature is added we're evaluating whether it
should be new instruction or new helper. 99% of the time we go with new helper.

> There isn't anything that mandates the JIT uses the exact same locking
> routines the interpreter does, is there?

sure. This bpf_spin_lock() helper can be optimized whichever way the kernel wants.
Like bpf_map_lookup_elem() call is _inlined_ by the verifier for certain map types.
JITs don't even need to do anything. It looks like function call from bpf prog
point of view, but in JITed code it is a sequence of native instructions.

Say tomorrow we find out that bpf_prog->bpf_spin_lock()->queued_spin_lock()
takes too much time then we can inline fast path of queued_spin_lock
directly into bpf prog and save function call cost.


^ permalink raw reply

* Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock
From: Alexei Starovoitov @ 2019-01-28 21:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexei Starovoitov, davem, daniel, jakub.kicinski, netdev,
	kernel-team, mingo, will.deacon, Paul McKenney, jannh
In-Reply-To: <20190128084310.GC28467@hirez.programming.kicks-ass.net>

On Mon, Jan 28, 2019 at 09:43:10AM +0100, Peter Zijlstra wrote:
> On Fri, Jan 25, 2019 at 03:42:43PM -0800, Alexei Starovoitov wrote:
> > On Fri, Jan 25, 2019 at 10:10:57AM +0100, Peter Zijlstra wrote:
> 
> > > What about the progs that run from SoftIRQ ? Since that bpf_prog_active
> > > thing isn't inside BPF_PROG_RUN() what is to stop say:
> > > 
> > >    reuseport_select_sock()
> > >      ...
> > >        BPF_PROG_RUN()
> > >          bpf_spin_lock()
> > >         <IRQ>
> > > 	  ...
> > > 	  BPF_PROG_RUN()
> > > 	    bpf_spin_lock() // forever more
> > > 
> > > 	</IRQ>
> > > 
> > > Unless you stick that bpf_prog_active stuff inside BPF_PROG_RUN itself,
> > > I don't see how you can fundamentally avoid this happening (now or in
> > > the future).
> 
> > But your issue above is valid.
> 
> > We don't use bpf_prog_active for networking progs, since we allow
> > for one level of nesting due to the classic SKF_AD_PAY_OFFSET legacy.
> > Also we allow tracing progs to nest with networking progs.
> > People using this actively.
> > Typically it's not an issue, since in networking there is no
> > arbitrary nesting (unlike kprobe/nmi in tracing),
> > but for bpf_spin_lock it can be, since the same map can be shared
> > by networking and tracing progs and above deadlock would be possible:
> > (first BPF_PROG_RUN will be from networking prog, then kprobe+bpf's
> > BPF_PROG_RUN accessing the same map with bpf_spin_lock)
> > 
> > So for now I'm going to allow bpf_spin_lock in networking progs only,
> > since there is no arbitrary nesting there.
> 
> Isn't that still broken? AFAIU networking progs can happen in task
> context (TX) and SoftIRQ context (RX), which can nest.

Sure. sendmsg side of networking can be interrupted by napi receive.
Both can have bpf progs attached at different points, but napi won't run
when bpf prog is running, because bpf prog disables preemption.
More so the whole networking stack can be recursive and there is
xmit_recursion counter to check for bad cases.
When bpf progs interact with networking they don't add to that recursion.
All of *redirect*() helpers do so outside of bpf preempt disabled context.
Also there is no nesting of the same networking prog type.
Like xdp/tc/lwt/cgroup bpf progs cannot be called recursively by design.
There are no arbitrary entry points unlike kprobe/tracepoint.
The only nesting is when socket filter _classic_ bpf prog is calling
SKF_AD_PAY_OFFSET legacy. That calls flow dissector which may call flow dissector
bpf prog. Classic bpf doesn't use bpf maps, so no deadlock issues.

> > And once we figure out the safety concerns for kprobe/tracepoint progs
> > we can enable bpf_spin_lock there too.
> > NMI bpf progs will never have bpf_spin_lock.
> 
> kprobe is like NMI, since it pokes an INT3 instruction which can trigger
> in the middle of IRQ-disabled or even in NMIs. Similar arguments can be
> made for tracepoints, they can happen 'anywhere'.

exactly. that's why there is bpf_prog_active to protect the kernel in general
for tracing bpf progs.


^ permalink raw reply

* Re: [PATCH] ucc_geth: Reset BQL queue when stopping device
From: Li Yang @ 2019-01-28 21:36 UTC (permalink / raw)
  To: Mathias Thore
  Cc: Christophe Leroy, netdev@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, David Gounaris, Joakim Tjernlund
In-Reply-To: <DM6PR10MB37217EA1DBEFE3D5CEB5A85084960@DM6PR10MB3721.namprd10.prod.outlook.com>

On Mon, Jan 28, 2019 at 8:37 AM Mathias Thore
<Mathias.Thore@infinera.com> wrote:
>
> Hi,
>
>
> This is what we observed: there was a storm on the medium so that our controller could not do its TX, resulting in timeout. When timeout occurs, the driver clears all descriptors from the TX queue. The function called in this patch is used to reflect this clearing also in the BQL layer. Without it, the controller would get stuck, unable to perform TX, even several minutes after the storm had ended. Bringing the device down and then up again would solve the problem, but this patch also solves it automatically.

The explanation makes sense.  So this should only be required in the
timeout scenario instead of other clean up scenarios like device
shutdown?  If so, it probably it will be better to be done in
ucc_geth_timeout_work()?

>
>
> Some other drivers do the same, for example e1000e driver calls netdev_reset_queue in its e1000_clean_tx_ring function. It is possible that other drivers should do the same; I have no way of verifying this.
>
>
> Regards,
>
> Mathias
>
> --
>
>
> From: Christophe Leroy <christophe.leroy@c-s.fr>
> Sent: Monday, January 28, 2019 10:48 AM
> To: Mathias Thore; leoyang.li@nxp.com; netdev@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; David Gounaris; Joakim Tjernlund
> Subject: Re: [PATCH] ucc_geth: Reset BQL queue when stopping device
>
>
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
>
> Hi,
>
> Le 28/01/2019 à 10:07, Mathias Thore a écrit :
> > After a timeout event caused by for example a broadcast storm, when
> > the MAC and PHY are reset, the BQL TX queue needs to be reset as
> > well. Otherwise, the device will exhibit severe performance issues
> > even after the storm has ended.
>
> What are the symptomns ?
>
> Is this reset needed on any network driver in that case, or is it
> something particular for the ucc_geth ?
> For instance, the freescale fs_enet doesn't have that reset. Should it
> have it too ?
>
> Christophe
>
> >
> > Co-authored-by: David Gounaris <david.gounaris@infinera.com>
> > Signed-off-by: Mathias Thore <mathias.thore@infinera.com>
> > ---
> >   drivers/net/ethernet/freescale/ucc_geth.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
> > index c3d539e209ed..eb3e65e8868f 100644
> > --- a/drivers/net/ethernet/freescale/ucc_geth.c
> > +++ b/drivers/net/ethernet/freescale/ucc_geth.c
> > @@ -1879,6 +1879,8 @@ static void ucc_geth_free_tx(struct ucc_geth_private *ugeth)
> >       u16 i, j;
> >       u8 __iomem *bd;
> >
> > +     netdev_reset_queue(ugeth->ndev);
> > +
> >       ug_info = ugeth->ug_info;
> >       uf_info = &ug_info->uf_info;
> >
> >
>

^ permalink raw reply

* Re: [PATCH net 0/4] various compat ioctl fixes
From: Johannes Berg @ 2019-01-28 21:32 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, viro, robert
In-Reply-To: <20190128.112256.1993605129492088954.davem@davemloft.net>

On Mon, 2019-01-28 at 11:22 -0800, David Miller wrote:

> I see some back and forth between you and Al, where do we stand at
> this point?

I don't really know. I think neither of us _likes_ this code, in
particular the whole copy_in_user() thing is quite a mess. The
copy_in_user() also means that decnet (and similar things, if they
exist, I didn't see any but didn't audit all protocols carefully) have
no way of working in compat - it's not even clear to me if that'd return
-EFAULT or just do something really stupid, and maybe even dangerous?

(Dangerous because at least on x86, compat_alloc_user_space() uses stack
space, and if we alloc 40 bytes but decnet writes up to 42 (?) then we
could overwrite some stack by that? Maybe the 16-byte alignment in
compat_alloc_user_space() saves us, but it's all very fragile. Even with
the previous patch fixed, decnet's idea of "struct ifreq" is bigger than
"struct ifreq" actually is because sockaddr_dn is bigger, if I'm
counting it right then that's 42 in total)

At the same time, fixing all this _completely_ is not very realistic, it
would require passing the ifreq size through to lots of places and
making the user copy there take the size rather than sizeof(ifreq),
obviously the very least to the method decnet uses, i.e. sock->ioctl() I
think, but clearly that affects every other protocol too.
This was what my previous patch had done partially for the directly
handled ioctls (the revert of which is the first patch in this series).

> From what I can see this looks like probably the simplest way to
> fix this in net and -stable currently.

I tend to agree, at least to fix the regression.

We can still deliberate separately if we want to fix decnet for compat
or if nobody cares now. But perhaps better decnet broken (quite
obviously and detectably) like it basically always was, than IP broken
(subtly, if your struct ends up landing at the end of a page).

Al, care to speak up about this here?

johannes

^ permalink raw reply

* Re: [PATCH] tty: Fix WARNING in tty_set_termios
From: shuah @ 2019-01-28 21:29 UTC (permalink / raw)
  To: Al Viro
  Cc: marcel, johan.hedberg, w.d.hubbs, chris, kirk, samuel.thibault,
	gregkh, robh, jslaby, sameo, davem, arnd, nishka.dasgupta_ug18,
	m.maya.nakamura, santhameena13, zhongjiang, linux-bluetooth,
	linux-kernel, speakup, devel, linux-serial, linux-wireless,
	netdev, shuah
In-Reply-To: <20190126041416.GF2217@ZenIV.linux.org.uk>

On 1/25/19 9:14 PM, Al Viro wrote:
> On Fri, Jan 25, 2019 at 04:29:05PM -0700, Shuah Khan wrote:
>> tty_set_termios() has the following WARMN_ON which can be triggered with a
>> syscall to invoke TIOCGETD __NR_ioctl.
>>
>> WARN_ON(tty->driver->type == TTY_DRIVER_TYPE_PTY &&
>>                  tty->driver->subtype == PTY_TYPE_MASTER);
>> Reference: https://syzkaller.appspot.com/bug?id=2410d22f1d8e5984217329dd0884b01d99e3e48d
>>
>> A simple change would have been to print error message instead of WARN_ON.
>> However, the callers assume that tty_set_termios() always returns 0 and
>> don't check return value. The complete solution is fixing all the callers
>> to check error and bail out to fix the WARN_ON.
>>
>> This fix changes tty_set_termios() to return error and all the callers
>> to check error and bail out. The reproducer is used to reproduce the
>> problem and verify the fix.
> 
>> --- a/drivers/bluetooth/hci_ldisc.c
>> +++ b/drivers/bluetooth/hci_ldisc.c
>> @@ -321,6 +321,8 @@ void hci_uart_set_flow_control(struct hci_uart *hu, bool enable)
>>   		status = tty_set_termios(tty, &ktermios);
>>   		BT_DBG("Disabling hardware flow control: %s",
>>   		       status ? "failed" : "success");
>> +		if (status)
>> +			return;
> 
> Can that ldisc end up set on pty master?  And does it make any sense there?

The initial objective of the patch is to prevent the WARN_ON by making
the change to return error instead of WARN_ON. However, without changes
to places that don't check the return and keep making progress, there
will be secondary problems.

Without this change to return here, instead of WARN_ON, it will fail
with the following NULL pointer dereference at the next thing 
hci_uart_set_flow_control() attempts.

status = tty->driver->ops->tiocmget(tty);

kernel: [10140.649783] BUG: unable to handle kernel NULL pointer 
dereference at 0000000000000000
kernel: [10140.649786] #PF error: [INSTR]
kernel: [10140.649787] PGD 0 P4D 0
kernel: [10140.649790] Oops: 0010 [#1] SMP PTI
Jan 24 15:33:35 deneb kernel: [10140.649793] CPU: 2 PID: 55 Comm: 
kworker/u33:0 Tainted: G        W         5.0.0-rc3+ #5
kernel: [10140.649794] Hardware name: Dell Inc. OptiPlex 790/0HY9JP, 
BIOS A18 09/24/2013
Workqueue: hci0 hci_power_on [bluetooth]
kernel: [10140.649805] RIP: 0010:          (null)
kernel: [10140.649809] Code: Bad RIP value.
kernel: [10140.649810] RSP: 0018:ffffa01a8153fd28 EFLAGS: 00010282
kernel: [10140.649812] RAX: 0000000000000000 RBX: ffff8958d6bc4800 RCX: 
35ad8b0300000000
kernel: [10140.649814] RDX: ffffffff00000001 RSI: 0000000000000000 RDI: 
ffff8958d6bc4800
kernel: [10140.649816] RBP: ffffa01a8153fd78 R08: 0000000091773f09 R09: 
0000000000000003
kernel: [10140.649817] R10: ffff8958d6bc4a98 R11: 0000000000000720 R12: 
ffff895814500c00
kernel: [10140.649819] R13: ffff8958a858e000 R14: 0000000000000000 R15: 
ffff8958af1af440
kernel: [10140.649821] FS:  0000000000000000(0000) 
GS:ffff895925880000(0000) knlGS:0000000000000000
kernel: [10140.649823] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [10140.649824] CR2: ffffffffffffffd6 CR3: 0000000083f46002 CR4: 
00000000000606e0
kernel: [10140.649826] Call Trace:
kernel: [10140.649830]  ? hci_uart_set_flow_control+0x20e/0x2c0 [hci_uart]
kernel: [10140.649836]  mrvl_setup+0x17/0x80 [hci_uart]
kernel: [10140.649840]  hci_uart_setup+0x56/0x160 [hci_uart]
kernel: [10140.649850]  hci_dev_do_open+0xe6/0x630 [bluetooth]
kernel: [10140.649860]  hci_power_on+0x52/0x220 [bluetooth]

> 
> IOW, I don't believe that this patch makes any sense.  If anything,
> we need to prevent unconditional tty_set_termios() on the path
> that *does* lead to calling it for pty.
> 

I don't think preventing unconditional tty_set_termios() is enough to
prevent secondary problems such as the one above.

For example, the following call chain leads to the WARN_ON that was
reported. Even if void hci_uart_set_baudrate() prevents the very first
tty_set_termios() call, its caller hci_uart_setup() continues with
more tty setup. It goes ahead to call driver setup callback. The
driver callback goes on to do more setup calling tty_set_termios().

WARN_ON call path:
  hci_uart_set_baudrate+0x1cc/0x250 drivers/bluetooth/hci_ldisc.c:378
  hci_uart_setup+0xa2/0x490 drivers/bluetooth/hci_ldisc.c:401
  hci_dev_do_open+0x6b1/0x1920 net/bluetooth/hci_core.c:1423

Once this WARN_ON is changed to return error, the following
happens, when hci_uart_setup() does driver setup callback.

kernel: [10140.649836]  mrvl_setup+0x17/0x80 [hci_uart]
kernel: [10140.649840]  hci_uart_setup+0x56/0x160 [hci_uart]
kernel: [10140.649850]  hci_dev_do_open+0xe6/0x630 [bluetooth]
kernel: [10140.649860]  hci_power_on+0x52/0x220 [bluetooth]

I think continuing to catch the invalid condition in tty_set_termios()
and preventing progress by checking return value is a straight forward
change to avoid secondary problems, and it might be difficult to catch
all the cases where it could fail. Here is the reproducer for reference:


#define _GNU_SOURCE

#include <endian.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>

uint64_t r[1] = {0xffffffffffffffff};

int main(void)
{
   syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
   long res = 0;
   memcpy((void*)0x20000100, "/dev/ptmx\x00", 10);
   res = syscall(__NR_openat, 0xffffffffffffff9c, 0x20000100, 0, 0);
   if (res != -1)
     r[0] = res;
   *(uint32_t*)0x200000c0 = 0xf;
   syscall(__NR_ioctl, r[0], 0x5423, 0x200000c0);
   syscall(__NR_ioctl, r[0], 0x400455c8, 0xb);
   return 0;
}


thanks,
-- Shuah

^ permalink raw reply

* [PATCH net 2/2] net: ip6_gre: always reports o_key to userspace
From: Lorenzo Bianconi @ 2019-01-28 21:23 UTC (permalink / raw)
  To: davem; +Cc: netdev, u9012063
In-Reply-To: <cover.1548709944.git.lorenzo.bianconi@redhat.com>

As Erspan_v4, Erspan_v6 protocol relies on o_key to configure
session id header field. However TUNNEL_KEY bit is cleared in
ip6erspan_tunnel_xmit since ERSPAN protocol does not set the key field
of the external GRE header and so the configured o_key is not reported
to userspace. The issue can be triggered with the following reproducer:

$ip link add ip6erspan1 type ip6erspan local 2000::1 remote 2000::2 \
    key 1 seq erspan_ver 1
$ip link set ip6erspan1 up
ip -d link sh ip6erspan1

ip6erspan1@NONE: <BROADCAST,MULTICAST> mtu 1422 qdisc noop state DOWN mode DEFAULT
    link/ether ba:ff:09:24:c3:0e brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500
    ip6erspan remote 2000::2 local 2000::1 encaplimit 4 flowlabel 0x00000 ikey 0.0.0.1 iseq oseq

Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in
ip6gre_fill_info

Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
---
 net/ipv6/ip6_gre.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 4416368dbd49..801a9a0c217e 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -2098,12 +2098,17 @@ static int ip6gre_fill_info(struct sk_buff *skb, const struct net_device *dev)
 {
 	struct ip6_tnl *t = netdev_priv(dev);
 	struct __ip6_tnl_parm *p = &t->parms;
+	__be16 o_flags = p->o_flags;
+
+	if ((p->erspan_ver == 1 || p->erspan_ver == 2) &&
+	    !p->collect_md)
+		o_flags |= TUNNEL_KEY;
 
 	if (nla_put_u32(skb, IFLA_GRE_LINK, p->link) ||
 	    nla_put_be16(skb, IFLA_GRE_IFLAGS,
 			 gre_tnl_flags_to_gre_flags(p->i_flags)) ||
 	    nla_put_be16(skb, IFLA_GRE_OFLAGS,
-			 gre_tnl_flags_to_gre_flags(p->o_flags)) ||
+			 gre_tnl_flags_to_gre_flags(o_flags)) ||
 	    nla_put_be32(skb, IFLA_GRE_IKEY, p->i_key) ||
 	    nla_put_be32(skb, IFLA_GRE_OKEY, p->o_key) ||
 	    nla_put_in6_addr(skb, IFLA_GRE_LOCAL, &p->laddr) ||
-- 
2.20.1


^ permalink raw reply related

* [PATCH net 1/2] net: ip_gre: always reports o_key to userspace
From: Lorenzo Bianconi @ 2019-01-28 21:23 UTC (permalink / raw)
  To: davem; +Cc: netdev, u9012063
In-Reply-To: <cover.1548709944.git.lorenzo.bianconi@redhat.com>

Erspan protocol (version 1 and 2) relies on o_key to configure
session id header field. However TUNNEL_KEY bit is cleared in
erspan_xmit since ERSPAN protocol does not set the key field
of the external GRE header and so the configured o_key is not reported
to userspace. The issue can be triggered with the following reproducer:

$ip link add erspan1 type erspan local 192.168.0.1 remote 192.168.0.2 \
    key 1 seq erspan_ver 1
$ip link set erspan1 up
$ip -d link sh erspan1

erspan1@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UNKNOWN mode DEFAULT
  link/ether 52:aa:99:95:9a:b5 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500
  erspan remote 192.168.0.2 local 192.168.0.1 ttl inherit ikey 0.0.0.1 iseq oseq erspan_index 0

Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in
ipgre_fill_info

Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
---
 net/ipv4/ip_gre.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 20a64fe6254b..3978f807fa8b 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -1455,12 +1455,17 @@ static int ipgre_fill_info(struct sk_buff *skb, const struct net_device *dev)
 {
 	struct ip_tunnel *t = netdev_priv(dev);
 	struct ip_tunnel_parm *p = &t->parms;
+	__be16 o_flags = p->o_flags;
+
+	if ((t->erspan_ver == 1 || t->erspan_ver == 2) &&
+	    !t->collect_md)
+		o_flags |= TUNNEL_KEY;
 
 	if (nla_put_u32(skb, IFLA_GRE_LINK, p->link) ||
 	    nla_put_be16(skb, IFLA_GRE_IFLAGS,
 			 gre_tnl_flags_to_gre_flags(p->i_flags)) ||
 	    nla_put_be16(skb, IFLA_GRE_OFLAGS,
-			 gre_tnl_flags_to_gre_flags(p->o_flags)) ||
+			 gre_tnl_flags_to_gre_flags(o_flags)) ||
 	    nla_put_be32(skb, IFLA_GRE_IKEY, p->i_key) ||
 	    nla_put_be32(skb, IFLA_GRE_OKEY, p->o_key) ||
 	    nla_put_in_addr(skb, IFLA_GRE_LOCAL, p->iph.saddr) ||
-- 
2.20.1


^ permalink raw reply related

* [PATCH net 0/2] erspan: always reports output key to userspace
From: Lorenzo Bianconi @ 2019-01-28 21:23 UTC (permalink / raw)
  To: davem; +Cc: netdev, u9012063

Erspan protocol relies on output key to set session id header field.
However TUNNEL_KEY bit is cleared in order to not add key field to
the external GRE header and so the configured o_key is not reported
to userspace.
Fix the issue adding TUNNEL_KEY bit to the o_flags parameter dumping
device info

Lorenzo Bianconi (2):
  net: ip_gre: always reports o_key to userspace
  net: ip6_gre: always reports o_key to userspace

 net/ipv4/ip_gre.c  | 7 ++++++-
 net/ipv6/ip6_gre.c | 7 ++++++-
 2 files changed, 12 insertions(+), 2 deletions(-)

-- 
2.20.1

^ permalink raw reply

* Re: WoL broken in r8169.c since kernel 4.19
From: Heiner Kallweit @ 2019-01-28 21:21 UTC (permalink / raw)
  To: Marc Haber; +Cc: netdev@vger.kernel.org
In-Reply-To: <20190128205957.GI27062@torres.zugschlus.de>

On 28.01.2019 21:59, Marc Haber wrote:
> On Mon, Jan 28, 2019 at 08:02:47PM +0100, Heiner Kallweit wrote:
>> One more test .. Can you provide the output of the following under 4.18 and under 4.19?
>> It may not apply cleanly, but you get the idea. The message is written when suspending.
> 
> I booted, suspeneded, sent a magic packet that got ignored. I then woke
> the box up with the any key, went through the ethtool motions, suspended
> again, sent a magic paket that the machine acted on and woke up
> 
> Log says:
> 1 [1/4994]mh@fan:~ $ grep 'may wakeup' /var/log/syslog/syslog
> Jan 28 21:51:44 fan kernel: [   77.571211] may wakeup? 0
> Jan 28 21:54:11 fan kernel: [  183.994131] may wakeup? 1
> [2/4995]mh@fan:~ $ 
> 
> Greetings
> Marc
> 

Thanks, this makes clearer what's going on.
This change to r8169
bde135a672bf ("r8169: only enable PCI wakeups when WOL is active")
removed marking the device as wakeup-enabled to work around an
issue on certain notebooks (see commit message for details).

ethtool doesn't care about the current settings and sets the new ones
(wakeup-enabling the device). systemd however checks the current settings
and writes new ones only if there's a change. Therefore in your case
systemd does nothing and device doesn't get wakeup-enabled.

Not totally clear to me is why it works under 4.18.
And I wonder why the following didn't work for you, you said it makes
no difference. Could you try again the following in addition to the
latest debug output statement?

 drivers/net/ethernet/realtek/r8169.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 3e650bd9e..bd26d3f2e 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -7442,6 +7442,7 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	}

 	tp->saved_wolopts = __rtl8169_get_wol(tp);
+	device_set_wakeup_enable(&pdev->dev, tp->saved_wolopts);

 	mutex_init(&tp->wk.mutex);
 	INIT_WORK(&tp->wk.work, rtl_task);
-- 

^ permalink raw reply related

* Re: [PATCH bpf-next v4 7/7] samples/bpf: Check the prog id before exiting
From: John Fastabend @ 2019-01-28 21:10 UTC (permalink / raw)
  To: Maciej Fijalkowski, daniel, ast; +Cc: netdev, jakub.kicinski, brouer
In-Reply-To: <20190128191613.11705-8-maciejromanfijalkowski@gmail.com>

On 1/28/19 11:16 AM, Maciej Fijalkowski wrote:
> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> 
> Check the program id within the signal handler on polling xdp samples
> that were previously converted to libbpf usage. Avoid the situation of
> unloading the program that was not attached by sample that is exiting.
> 
> Reported-by: Michal Papaj <michal.papaj@intel.com>
> Reported-by: Jakub Spizewski <jakub.spizewski@intel.com>
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>


^ permalink raw reply

* Re: [PATCH bpf-next v4 6/7] libbpf: Add a support for getting xdp prog id on ifindex
From: John Fastabend @ 2019-01-28 21:07 UTC (permalink / raw)
  To: Maciej Fijalkowski, daniel, ast; +Cc: netdev, jakub.kicinski, brouer
In-Reply-To: <20190128191613.11705-7-maciejromanfijalkowski@gmail.com>

On 1/28/19 11:16 AM, Maciej Fijalkowski wrote:
> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> 
> Since we have a dedicated netlink attributes for xdp setup on a
> particular interface, it is now possible to retrieve the program id that
> is currently attached to the interface. The use case is targeted for
> sample xdp programs, which will store the program id just after loading
> bpf program onto iface. On shutdown, the sample will make sure that it
> can unload the program by querying again the iface and verifying that
> both program id's matches.
> 
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---

small nit.

> +
> +int bpf_get_link_xdp_id(int ifindex, __u32 *prog_id, __u32 flags)
> +{
> +	struct xdp_id_md xdp_id = {};
> +	int sock, ret;
> +	__u32 nl_pid;
> +	__u32 mask;
> +
> +	if (flags & ~XDP_FLAGS_MASK)
> +		return -EINVAL;
> +
> +	/* Check whether the single {HW,DRV,SKB} mode is set */
> +	flags &= XDP_FLAGS_MODES;
> +	mask = flags - 1;
> +	if (flags && flags & mask)
> +		return -EINVAL;
> +
> +	sock = libbpf_netlink_open(&nl_pid);
> +	if (sock < 0)
> +		return sock;
> +
> +	xdp_id.ifindex = ifindex;
> +	xdp_id.flags = flags;
> +
> +	ret = libbpf_nl_get_link(sock, nl_pid, get_xdp_id, &xdp_id);
> +	*prog_id = xdp_id.id;

just a nit but should we really set prog_id from user if there is
an error. Probably friendlier not to set caller data in error
case.

> +
> +	close(sock);
> +	return ret;
> +}
> +
>  int libbpf_nl_get_link(int sock, unsigned int nl_pid,
>  		       libbpf_dump_nlmsg_t dump_link_nlmsg, void *cookie)
>  {
> 


^ permalink raw reply

* Re: WoL broken in r8169.c since kernel 4.19
From: Marc Haber @ 2019-01-28 20:59 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: netdev@vger.kernel.org
In-Reply-To: <d677ecf0-b106-03a0-2c66-d817ff39d20a@gmail.com>

On Mon, Jan 28, 2019 at 08:02:47PM +0100, Heiner Kallweit wrote:
> One more test .. Can you provide the output of the following under 4.18 and under 4.19?
> It may not apply cleanly, but you get the idea. The message is written when suspending.

I booted, suspeneded, sent a magic packet that got ignored. I then woke
the box up with the any key, went through the ethtool motions, suspended
again, sent a magic paket that the machine acted on and woke up

Log says:
1 [1/4994]mh@fan:~ $ grep 'may wakeup' /var/log/syslog/syslog
Jan 28 21:51:44 fan kernel: [   77.571211] may wakeup? 0
Jan 28 21:54:11 fan kernel: [  183.994131] may wakeup? 1
[2/4995]mh@fan:~ $ 

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply

* Re: [PATCH bpf-next v4 5/7] samples/bpf: Add a "force" flag to XDP samples
From: John Fastabend @ 2019-01-28 20:57 UTC (permalink / raw)
  To: Maciej Fijalkowski, daniel, ast; +Cc: netdev, jakub.kicinski, brouer
In-Reply-To: <20190128191613.11705-6-maciejromanfijalkowski@gmail.com>

On 1/28/19 11:16 AM, Maciej Fijalkowski wrote:
> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> 
> Make xdp samples consistent with iproute2 behavior and set the
> XDP_FLAGS_UPDATE_IF_NOEXIST by default when setting the xdp program on
> interface. Provide an option for user to force the program loading,
> which as a result will not include the mentioned flag in
> bpf_set_link_xdp_fd call.
> 
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---

Patch is fine but I don't really know if being consistent with
iproute2 here gets us much. I guess if users find the "force"
behavior better for a sample that is OK though.

Acked-by: John Fastabend <john.fastabend@gmail.com>


^ permalink raw reply

* Re: [PATCH bpf-next v4 4/7] samples/bpf: Extend RLIMIT_MEMLOCK for xdp_{sample_pkts, router_ipv4}
From: John Fastabend @ 2019-01-28 20:54 UTC (permalink / raw)
  To: Maciej Fijalkowski, daniel, ast; +Cc: netdev, jakub.kicinski, brouer
In-Reply-To: <20190128191613.11705-5-maciejromanfijalkowski@gmail.com>

On 1/28/19 11:16 AM, Maciej Fijalkowski wrote:
> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> 
> There is a common problem with xdp samples that happens when user wants
> to run a particular sample and some bpf program is already loaded. The
> default 64kb RLIMIT_MEMLOCK resource limit will cause a following error
> (assuming that xdp sample that is failing was converted to libbpf
> usage):
> 
> libbpf: Error in bpf_object__probe_name():Operation not permitted(1).
> Couldn't load basic 'r0 = 0' BPF program.
> libbpf: failed to load object './xdp_sample_pkts_kern.o'
> 
> Fix it in xdp_sample_pkts and xdp_router_ipv4 by setting RLIMIT_MEMLOCK
> to RLIM_INFINITY.
> 
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---

Seems we do this for most samples so nice to have here as well.

Acked-by: John Fastabend <john.fastabend@gmail.com>


^ permalink raw reply

* Re: [PATCH RFC v2 2/3] net: Support GRO/GSO fraglist chaining.
From: Willem de Bruijn @ 2019-01-28 20:50 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Network Development, Willem de Bruijn, Paolo Abeni,
	Jason A. Donenfeld
In-Reply-To: <20190128085025.14532-3-steffen.klassert@secunet.com>

On Mon, Jan 28, 2019 at 2:53 AM Steffen Klassert
<steffen.klassert@secunet.com> wrote:
>
> This patch adds the core functions to chain/unchain
> GSO skbs at the frag_list pointer. This also adds
> a new GSO type SKB_GSO_FRAGLIST and a is_flist
> flag to napi_gro_cb which indicates that this
> flow will be GROed by fraglist chaining.
>
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

> +struct sk_buff *skb_segment_list(struct sk_buff *skb,
> +                                netdev_features_t features,
> +                                unsigned int offset)
> +{
> +       struct sk_buff *list_skb = skb_shinfo(skb)->frag_list;
> +       unsigned int tnl_hlen = skb_tnl_header_len(skb);
> +       unsigned int delta_truesize = 0;
> +       unsigned int delta_len = 0;
> +       struct sk_buff *tail = NULL;
> +       struct sk_buff *nskb;
> +
> +       skb_push(skb, -skb_network_offset(skb) + offset);
> +
> +       skb_shinfo(skb)->frag_list = NULL;
> +
> +       do {
> +               nskb = list_skb;
> +               list_skb = list_skb->next;
> +
> +               if (!tail)
> +                       skb->next = nskb;
> +               else
> +                       tail->next = nskb;
> +
> +               tail = nskb;
> +
> +               delta_len += nskb->len;
> +               delta_truesize += nskb->truesize;
> +
> +               skb_push(nskb, -skb_network_offset(nskb) + offset);
> +
> +               if (!secpath_exists(nskb))
> +                       __skb_ext_copy(nskb, skb);
> +
> +               memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
> +
> +               nskb->ip_summed = CHECKSUM_NONE;
> +               nskb->csum_valid = 1;
> +               nskb->tstamp = skb->tstamp;
> +               nskb->dev = skb->dev;
> +               nskb->queue_mapping = skb->queue_mapping;
> +
> +               nskb->mac_len = skb->mac_len;
> +               nskb->mac_header = skb->mac_header;
> +               nskb->transport_header = skb->transport_header;
> +               nskb->network_header = skb->network_header;
> +               skb_dst_copy(nskb, skb);
> +
> +               skb_headers_offset_update(nskb, skb_headroom(nskb) - skb_headroom(skb));
> +               skb_copy_from_linear_data_offset(skb, -tnl_hlen,
> +                                                nskb->data - tnl_hlen,
> +                                                offset + tnl_hlen);
> +
> +               if (skb_needs_linearize(nskb, features) &&
> +                   __skb_linearize(nskb)) {
> +                       kfree_skb_list(skb->next);
> +                       skb->next = NULL;
> +                       return ERR_PTR(-ENOMEM);
> +               }
> +       } while (list_skb);
> +
> +       skb->truesize = skb->truesize - delta_truesize;
> +       skb->data_len = skb->data_len - delta_len;
> +       skb->len = skb->len - delta_len;
> +
> +       skb_gso_reset(skb);
> +
> +       skb->prev = tail;
> +
> +       if (skb_needs_linearize(skb, features) &&
> +           __skb_linearize(skb)) {
> +               skb->next = NULL;
> +               kfree_skb_list(skb->next);

inverse order

also, I would probably deduplicate with the same branch above in a new
err_linearize: block

^ permalink raw reply

* Re: [PATCH bpf-next v4 2/7] samples/bpf: xdp_redirect_cpu have not need for read_trace_pipe
From: John Fastabend @ 2019-01-28 20:50 UTC (permalink / raw)
  To: Maciej Fijalkowski, daniel, ast; +Cc: netdev, jakub.kicinski, brouer
In-Reply-To: <20190128191613.11705-3-maciejromanfijalkowski@gmail.com>

On 1/28/19 11:16 AM, Maciej Fijalkowski wrote:
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> The sample xdp_redirect_cpu is not using helper bpf_trace_printk.
> Thus it makes no sense that the --debug option us reading
> from /sys/kernel/debug/tracing/trace_pipe via read_trace_pipe.
> Simply remove it.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply

* Re: [PATCH RFC v2 3/3] udp: Support UDP fraglist GRO/GSO.
From: Willem de Bruijn @ 2019-01-28 20:49 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Network Development, Willem de Bruijn, Paolo Abeni,
	Jason A. Donenfeld
In-Reply-To: <20190128085025.14532-4-steffen.klassert@secunet.com>

On Mon, Jan 28, 2019 at 2:51 AM Steffen Klassert
<steffen.klassert@secunet.com> wrote:
>
> This patch extends UDP GRO to support fraglist GRO/GSO
> by using the previously introduced infrastructure.
> All UDP packets that are not targeted to a GRO capable
> UDP sockets are going to fraglist GRO now (local input
> and forward).
> ---
>  net/ipv4/udp_offload.c | 45 ++++++++++++++++++++++++++++++++++++++----
>  net/ipv6/udp_offload.c |  9 +++++++++
>  2 files changed, 50 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index 584635db9231..c0be33216750 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -188,6 +188,22 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
>  }
>  EXPORT_SYMBOL(skb_udp_tunnel_segment);
>
> +static struct sk_buff *__udp_gso_segment_list(struct sk_buff *skb,
> +                                             netdev_features_t features)
> +{
> +       unsigned int mss = skb_shinfo(skb)->gso_size;
> +
> +       skb = skb_segment_list(skb, features, skb_mac_header_len(skb));
> +       if (IS_ERR(skb))
> +               return skb;
> +
> +       udp_hdr(skb)->len = htons(sizeof(struct udphdr) + mss);
> +       skb->ip_summed = CHECKSUM_NONE;
> +       skb->csum_valid = 1;

csum_valid is only used on ingress.

Hardcoding CHECKSUM_NONE is probably fine as long as this function is
only used for forwarding, assuming we don't care about verifiying
checksums in the forwarding case.

But this is fragile if we ever add local list segment output. Should
convert the checksum field in skb_forward_csum, instead of at the GSO
layer, just as for forwarding of non list skbs? Though that would
require traversing the list yet another time. Other option is to
already do this conversion when building the list in GRO.

The comment also applies to the same logic in skb_segment_list. As a
matter or fact, even if this belongs in GSO instead of core forwarding
or GRO, then probably both the list head and frag_list skbs should be
set in the same function, so skb_segment_list.

> +
> +       return skb;
> +}
> +
>  struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
>                                   netdev_features_t features)
>  {
> @@ -200,6 +216,9 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
>         __sum16 check;
>         __be16 newlen;
>
> +       if (skb_shinfo(gso_skb)->gso_type & SKB_GSO_FRAGLIST)
> +               return __udp_gso_segment_list(gso_skb, features);
> +
>         mss = skb_shinfo(gso_skb)->gso_size;
>         if (gso_skb->len <= sizeof(*uh) + mss)
>                 return ERR_PTR(-EINVAL);
> @@ -352,16 +371,15 @@ static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
>         struct sk_buff *pp = NULL;
>         struct udphdr *uh2;
>         struct sk_buff *p;
> +       int ret;
>
>         /* requires non zero csum, for symmetry with GSO */
>         if (!uh->check) {
>                 NAPI_GRO_CB(skb)->flush = 1;
>                 return NULL;
>         }
> -

Accidental whitespace removal?

>         /* pull encapsulating udp header */
>         skb_gro_pull(skb, sizeof(struct udphdr));
> -       skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr));
>
>         list_for_each_entry(p, head, list) {
>                 if (!NAPI_GRO_CB(p)->same_flow)
> @@ -379,8 +397,17 @@ static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
>                  * Under small packet flood GRO count could elsewhere grow a lot
>                  * leading to execessive truesize values
>                  */
> -               if (!skb_gro_receive(p, skb) &&
> -                   NAPI_GRO_CB(p)->count >= UDP_GRO_CNT_MAX)
> +               if (NAPI_GRO_CB(skb)->is_flist) {
> +                       if (!pskb_may_pull(skb, skb_gro_offset(skb)))
> +                               return NULL;
> +                       ret = skb_gro_receive_list(p, skb);
> +               } else {
> +                       skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr));
> +
> +                       ret = skb_gro_receive(p, skb);
> +               }
> +
> +               if (!ret && NAPI_GRO_CB(p)->count > UDP_GRO_CNT_MAX)
>                         pp = p;
>                 else if (uh->len != uh2->len)
>                         pp = p;
> @@ -402,6 +429,7 @@ struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
>         int flush = 1;
>
>         if (!sk || !udp_sk(sk)->gro_receive) {
> +               NAPI_GRO_CB(skb)->is_flist = sk ? !udp_sk(sk)->gro_enabled: 1;

This updates the choice of whether to use a list on each received skb.
Which is problematic as a socket can call the setsockopt in between
packets.

Actually, there no longer is a need for a route lookup for each skb at
all. We always apply GRO, which was the previous reason for the
lookup. And if a matching flow is found in the GRO table, we already
the choice to use a list is already stored.


>                 pp = call_gro_receive(udp_gro_receive_segment, head, skb);
>                 return pp;
>         }
> @@ -530,6 +558,15 @@ INDIRECT_CALLABLE_SCOPE int udp4_gro_complete(struct sk_buff *skb, int nhoff)
>         const struct iphdr *iph = ip_hdr(skb);
>         struct udphdr *uh = (struct udphdr *)(skb->data + nhoff);
>
> +       if (NAPI_GRO_CB(skb)->is_flist) {
> +               uh->len = htons(skb->len - nhoff);
> +
> +               skb_shinfo(skb)->gso_type |= (SKB_GSO_FRAGLIST|SKB_GSO_UDP_L4);
> +               skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count;
> +
> +               return 0;
> +       }
> +
>         if (uh->check)
>                 uh->check = ~udp_v4_check(skb->len - nhoff, iph->saddr,
>                                           iph->daddr, 0);
> diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
> index 5f7937a4f71a..7c3f28310baa 100644
> --- a/net/ipv6/udp_offload.c
> +++ b/net/ipv6/udp_offload.c
> @@ -154,6 +154,15 @@ INDIRECT_CALLABLE_SCOPE int udp6_gro_complete(struct sk_buff *skb, int nhoff)
>         const struct ipv6hdr *ipv6h = ipv6_hdr(skb);
>         struct udphdr *uh = (struct udphdr *)(skb->data + nhoff);
>
> +       if (NAPI_GRO_CB(skb)->is_flist) {
> +               uh->len = htons(skb->len - nhoff);
> +
> +               skb_shinfo(skb)->gso_type |= (SKB_GSO_FRAGLIST|SKB_GSO_UDP_L4);
> +               skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count;
> +
> +               return 0;
> +       }
> +
>         if (uh->check)
>                 uh->check = ~udp_v6_check(skb->len - nhoff, &ipv6h->saddr,
>                                           &ipv6h->daddr, 0);
> --
> 2.17.1
>

^ permalink raw reply

* Re: [PATCH bpf-next v4 1/7] libbpf: Add a helper for retrieving a map fd for a given name
From: John Fastabend @ 2019-01-28 20:49 UTC (permalink / raw)
  To: Maciej Fijalkowski, daniel, ast; +Cc: netdev, jakub.kicinski, brouer
In-Reply-To: <20190128191613.11705-2-maciejromanfijalkowski@gmail.com>

On 1/28/19 11:16 AM, Maciej Fijalkowski wrote:
> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> 
> XDP samples are mostly cooperating with eBPF maps through their file
> descriptors. In case of a eBPF program that contains multiple maps it
> might be tiresome to iterate through them and call bpf_map__fd for each
> one. Add a helper mostly based on bpf_object__find_map_by_name, but
> instead of returning the struct bpf_map pointer, return map fd.
> 
> Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---

I've been carrying something similar around for awhile as well.

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply

* Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock
From: Alexei Starovoitov @ 2019-01-28 20:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexei Starovoitov, davem, daniel, jakub.kicinski, netdev,
	kernel-team, mingo, will.deacon, Paul McKenney, jannh
In-Reply-To: <20190128083508.GA28878@hirez.programming.kicks-ass.net>

On Mon, Jan 28, 2019 at 09:35:08AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 28, 2019 at 09:31:23AM +0100, Peter Zijlstra wrote:
> > On Fri, Jan 25, 2019 at 03:42:43PM -0800, Alexei Starovoitov wrote:
> > > On Fri, Jan 25, 2019 at 10:10:57AM +0100, Peter Zijlstra wrote:
> > > > On Thu, Jan 24, 2019 at 03:58:59PM -0800, Alexei Starovoitov wrote:
> > 
> > > >  > nmi checks for bpf_prog_active==0. See bpf_overflow_handler.
> > 
> > > > yuck yuck yuck.. That's horrific :-( That means the whole BPF crud is
> > > > unreliable and events can go randomly missing.
> > > 
> > > bpf_prog_active is the mechanism to workaround non-reentrant pieces of the kernel.
> > 
> > 'the kernel' or 'bpf' ?
> > 
> > perf has a recursion counter per context (task,softirq,hardirq,nmi) and
> > that ensures that perf doesn't recurse in on itself while allowing the
> > nesting of these contexts.
> > 
> > But if BPF itself is not able to deal with such nesting that won't work
> > of course.
> 
> Ooh, later you say:
> 
> > Also we allow tracing progs to nest with networking progs.
> 
> Which seems to suggest BPF itself can suppord (limited) nesting.

well I'm not sure where is the boundary between bpf and the kernel :)
By non-reentrant I meant pieces of the kernel that bpf maps
or bpf helpers may use.
For example kmalloc-style bpf map is using rcu and map update/delete
do call_rcu() from inside bpf program.
If kprobe or tracepoint bpf prog is called from inner bits of
__call_rcu_core and the prog is doing map update we get into
recursive __call_rcu_core and things go bad.
Hence we disallow such situations when possible.
If bpf had its own infrastructure for everything then it would be
easy enough to allow proper recursion using mechanism similar
to what perf does with get_recursion_context.

^ permalink raw reply

* Re: [PATCH bpf-next] tools: bpftool: warn about risky prog array updates
From: Song Liu @ 2019-01-28 20:45 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking
In-Reply-To: <20190128182915.434-1-jakub.kicinski@netronome.com>

On Mon, Jan 28, 2019 at 10:30 AM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> When prog array is updated with bpftool users often refer
> to the map via the ID.  Unfortunately, that's likely
> to lead to confusion because prog arrays get flushed when
> the last user reference is gone.  If there is no other
> reference bpftool will create one, update successfully
> just to close the map again and have it flushed.
>
> Warn about this case in non-JSON mode.
>
> If the problem continues causing confusion we can remove
> the support for referring to a map by ID for prog array
> update completely.  For now it seems like the potential
> inconvenience to users who know what they're doing outweighs
> the benefit.
>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  tools/bpf/bpftool/map.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
> index 8cb0e26907ff..6f33818bb6b6 100644
> --- a/tools/bpf/bpftool/map.c
> +++ b/tools/bpf/bpftool/map.c
> @@ -426,6 +426,9 @@ static int parse_elem(char **argv, struct bpf_map_info *info,
>                                 p_err("not enough value arguments for map of progs");
>                                 return -1;
>                         }
> +                       if (is_prefix(*argv, "id"))
> +                               p_info("Warning: updating program array via MAP_ID, make sure this map is kept open\n"
> +                                      "         by some process or pinned otherwise update will be lost");
>
>                         fd = prog_parse_fd(&argc, &argv);
>                         if (fd < 0)
> --
> 2.19.2
>

^ permalink raw reply

* Re: [PATCH bpf] tools: bpftool: fix crash with un-owned prog arrays
From: Song Liu @ 2019-01-28 20:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking
In-Reply-To: <20190128180121.31362-1-jakub.kicinski@netronome.com>

On Mon, Jan 28, 2019 at 10:06 AM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> Prog arrays don't have 'owner_prog_type' and 'owner_jited'
> fields in their fdinfo when they are created.  Those fields
> are set and reported when first program is checked for
> compatibility by bpf_prog_array_compatible().
>
> This means that bpftool cannot expect the fields to always
> be there.  Currently trying to show maps on a system with
> an un-owned prog array leads to a crash:
>
> $ bpftool map show
> 389: prog_array  name tail_call_map  flags 0x0
> Error: key 'owner_prog_type' not found in fdinfo
> Error: key 'owner_jited' not found in fdinfo
>        key 4B  value 4B  max_entries 4  memlock 4096B
>        Segmentation fault (core dumped)
>
> We pass a NULL pointer to atoi().
>
> Remove the assumption that fdinfo keys are always present.
> Add missing validations and remove the p_err() calls which
> may lead to broken JSON output as caller will not propagate
> the failure.
>
> Fixes: 99a44bef5870 ("tools: bpftool: add owner_prog_type and owner_jited to bpftool output")
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  tools/bpf/bpftool/common.c |  6 +-----
>  tools/bpf/bpftool/map.c    | 17 ++++++++---------
>  2 files changed, 9 insertions(+), 14 deletions(-)
>
> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> index 897483457bf0..f7261fad45c1 100644
> --- a/tools/bpf/bpftool/common.c
> +++ b/tools/bpf/bpftool/common.c
> @@ -297,10 +297,8 @@ char *get_fdinfo(int fd, const char *key)
>         snprintf(path, sizeof(path), "/proc/self/fdinfo/%d", fd);
>
>         fdi = fopen(path, "r");
> -       if (!fdi) {
> -               p_err("can't open fdinfo: %s", strerror(errno));
> +       if (!fdi)
>                 return NULL;
> -       }
>
>         while ((n = getline(&line, &line_n, fdi)) > 0) {
>                 char *value;
> @@ -313,7 +311,6 @@ char *get_fdinfo(int fd, const char *key)
>
>                 value = strchr(line, '\t');
>                 if (!value || !value[1]) {
> -                       p_err("malformed fdinfo!?");
>                         free(line);
>                         return NULL;
>                 }
> @@ -326,7 +323,6 @@ char *get_fdinfo(int fd, const char *key)
>                 return line;
>         }
>
> -       p_err("key '%s' not found in fdinfo", key);
>         free(line);
>         fclose(fdi);
>         return NULL;
> diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
> index 29a3468c6cf6..1ef1ee2280a2 100644
> --- a/tools/bpf/bpftool/map.c
> +++ b/tools/bpf/bpftool/map.c
> @@ -513,10 +513,9 @@ static int show_map_close_json(int fd, struct bpf_map_info *info)
>                                 jsonw_uint_field(json_wtr, "owner_prog_type",
>                                                  prog_type);
>                 }
> -               if (atoi(owner_jited))
> -                       jsonw_bool_field(json_wtr, "owner_jited", true);
> -               else
> -                       jsonw_bool_field(json_wtr, "owner_jited", false);
> +               if (owner_jited)
> +                       jsonw_bool_field(json_wtr, "owner_jited",
> +                                        !!atoi(owner_jited));
>
>                 free(owner_prog_type);
>                 free(owner_jited);
> @@ -569,7 +568,8 @@ static int show_map_close_plain(int fd, struct bpf_map_info *info)
>                 char *owner_prog_type = get_fdinfo(fd, "owner_prog_type");
>                 char *owner_jited = get_fdinfo(fd, "owner_jited");
>
> -               printf("\n\t");
> +               if (owner_prog_type || owner_jited)
> +                       printf("\n\t");
>                 if (owner_prog_type) {
>                         unsigned int prog_type = atoi(owner_prog_type);
>
> @@ -579,10 +579,9 @@ static int show_map_close_plain(int fd, struct bpf_map_info *info)
>                         else
>                                 printf("owner_prog_type %d  ", prog_type);
>                 }
> -               if (atoi(owner_jited))
> -                       printf("owner jited");
> -               else
> -                       printf("owner not jited");
> +               if (owner_jited)
> +                       printf("owner%s jited",
> +                              atoi(owner_jited) ? "" : " not");
>
>                 free(owner_prog_type);
>                 free(owner_jited);
> --
> 2.19.2
>

^ permalink raw reply

* Re: [PATCH bpf-next v3 3/3] selftests/bpf: add simple BPF_PROG_TEST_RUN examples for flow dissector
From: Song Liu @ 2019-01-28 20:37 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Networking, David S . Miller, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20190128165355.229403-4-sdf@google.com>

On Mon, Jan 28, 2019 at 8:55 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> Use existing pkt_v4 and pkt_v6 to make sure flow_keys are what we want.
>
> Also, add new bpf_flow_load routine (and flow_dissector_load.h header)
> that loads bpf_flow.o program and does all required setup.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  tools/testing/selftests/bpf/Makefile          |  3 +
>  .../selftests/bpf/flow_dissector_load.c       | 43 ++--------
>  .../selftests/bpf/flow_dissector_load.h       | 55 +++++++++++++
>  tools/testing/selftests/bpf/test_progs.c      | 78 ++++++++++++++++++-
>  4 files changed, 139 insertions(+), 40 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.h
>
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 89b0d1799ff3..c566090e5657 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -148,6 +148,9 @@ $(OUTPUT)/test_xdp_noinline.o: CLANG_FLAGS += -fno-inline
>  $(OUTPUT)/test_queue_map.o: test_queue_stack_map.h
>  $(OUTPUT)/test_stack_map.o: test_queue_stack_map.h
>
> +$(OUTPUT)/flow_dissector_load.o: flow_dissector_load.h
> +$(OUTPUT)/test_progs.o: flow_dissector_load.h
> +
>  BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
>  BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
>  BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 'usage.*llvm')
> diff --git a/tools/testing/selftests/bpf/flow_dissector_load.c b/tools/testing/selftests/bpf/flow_dissector_load.c
> index ae8180b11d5f..77cafa66d048 100644
> --- a/tools/testing/selftests/bpf/flow_dissector_load.c
> +++ b/tools/testing/selftests/bpf/flow_dissector_load.c
> @@ -12,6 +12,7 @@
>  #include <bpf/libbpf.h>
>
>  #include "bpf_rlimit.h"
> +#include "flow_dissector_load.h"
>
>  const char *cfg_pin_path = "/sys/fs/bpf/flow_dissector";
>  const char *cfg_map_name = "jmp_table";
> @@ -21,46 +22,13 @@ char *cfg_path_name;
>
>  static void load_and_attach_program(void)
>  {
> -       struct bpf_program *prog, *main_prog;
> -       struct bpf_map *prog_array;
> -       int i, fd, prog_fd, ret;
> +       int prog_fd, ret;
>         struct bpf_object *obj;
> -       int prog_array_fd;
>
> -       ret = bpf_prog_load(cfg_path_name, BPF_PROG_TYPE_FLOW_DISSECTOR, &obj,
> -                           &prog_fd);
> +       ret = bpf_flow_load(&obj, cfg_path_name, cfg_section_name,
> +                           cfg_map_name, &prog_fd);
>         if (ret)
> -               error(1, 0, "bpf_prog_load %s", cfg_path_name);
> -
> -       main_prog = bpf_object__find_program_by_title(obj, cfg_section_name);
> -       if (!main_prog)
> -               error(1, 0, "bpf_object__find_program_by_title %s",
> -                     cfg_section_name);
> -
> -       prog_fd = bpf_program__fd(main_prog);
> -       if (prog_fd < 0)
> -               error(1, 0, "bpf_program__fd");
> -
> -       prog_array = bpf_object__find_map_by_name(obj, cfg_map_name);
> -       if (!prog_array)
> -               error(1, 0, "bpf_object__find_map_by_name %s", cfg_map_name);
> -
> -       prog_array_fd = bpf_map__fd(prog_array);
> -       if (prog_array_fd < 0)
> -               error(1, 0, "bpf_map__fd %s", cfg_map_name);
> -
> -       i = 0;
> -       bpf_object__for_each_program(prog, obj) {
> -               fd = bpf_program__fd(prog);
> -               if (fd < 0)
> -                       error(1, 0, "bpf_program__fd");
> -
> -               if (fd != prog_fd) {
> -                       printf("%d: %s\n", i, bpf_program__title(prog, false));
> -                       bpf_map_update_elem(prog_array_fd, &i, &fd, BPF_ANY);
> -                       ++i;
> -               }
> -       }
> +               error(1, 0, "bpf_flow_load %s", cfg_path_name);
>
>         ret = bpf_prog_attach(prog_fd, 0 /* Ignore */, BPF_FLOW_DISSECTOR, 0);
>         if (ret)
> @@ -69,7 +37,6 @@ static void load_and_attach_program(void)
>         ret = bpf_object__pin(obj, cfg_pin_path);
>         if (ret)
>                 error(1, 0, "bpf_object__pin %s", cfg_pin_path);
> -
>  }
>
>  static void detach_program(void)
> diff --git a/tools/testing/selftests/bpf/flow_dissector_load.h b/tools/testing/selftests/bpf/flow_dissector_load.h
> new file mode 100644
> index 000000000000..41dd6959feb0
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/flow_dissector_load.h
> @@ -0,0 +1,55 @@
> +/* SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) */
> +#ifndef FLOW_DISSECTOR_LOAD
> +#define FLOW_DISSECTOR_LOAD
> +
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +static inline int bpf_flow_load(struct bpf_object **obj,
> +                               const char *path,
> +                               const char *section_name,
> +                               const char *map_name,
> +                               int *prog_fd)
> +{
> +       struct bpf_program *prog, *main_prog;
> +       struct bpf_map *prog_array;
> +       int prog_array_fd;
> +       int ret, fd, i;
> +
> +       ret = bpf_prog_load(path, BPF_PROG_TYPE_FLOW_DISSECTOR, obj,
> +                           prog_fd);
> +       if (ret)
> +               return ret;
> +
> +       main_prog = bpf_object__find_program_by_title(*obj, section_name);
> +       if (!main_prog)
> +               return ret;
> +
> +       *prog_fd = bpf_program__fd(main_prog);
> +       if (*prog_fd < 0)
> +               return ret;
> +
> +       prog_array = bpf_object__find_map_by_name(*obj, map_name);
> +       if (!prog_array)
> +               return ret;
> +
> +       prog_array_fd = bpf_map__fd(prog_array);
> +       if (prog_array_fd < 0)
> +               return ret;
> +
> +       i = 0;
> +       bpf_object__for_each_program(prog, *obj) {
> +               fd = bpf_program__fd(prog);
> +               if (fd < 0)
> +                       return fd;
> +
> +               if (fd != *prog_fd) {
> +                       bpf_map_update_elem(prog_array_fd, &i, &fd, BPF_ANY);
> +                       ++i;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +#endif /* FLOW_DISSECTOR_LOAD */
> diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
> index 126fc624290d..5f46680d4ad4 100644
> --- a/tools/testing/selftests/bpf/test_progs.c
> +++ b/tools/testing/selftests/bpf/test_progs.c
> @@ -39,6 +39,7 @@ typedef __u16 __sum16;
>  #include "bpf_endian.h"
>  #include "bpf_rlimit.h"
>  #include "trace_helpers.h"
> +#include "flow_dissector_load.h"
>
>  static int error_cnt, pass_cnt;
>  static bool jit_enabled;
> @@ -53,9 +54,10 @@ static struct {
>  } __packed pkt_v4 = {
>         .eth.h_proto = __bpf_constant_htons(ETH_P_IP),
>         .iph.ihl = 5,
> -       .iph.protocol = 6,
> +       .iph.protocol = IPPROTO_TCP,
>         .iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
>         .tcp.urg_ptr = 123,
> +       .tcp.doff = 5,
>  };
>
>  /* ipv6 test vector */
> @@ -65,9 +67,10 @@ static struct {
>         struct tcphdr tcp;
>  } __packed pkt_v6 = {
>         .eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
> -       .iph.nexthdr = 6,
> +       .iph.nexthdr = IPPROTO_TCP,
>         .iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
>         .tcp.urg_ptr = 123,
> +       .tcp.doff = 5,
>  };
>
>  #define _CHECK(condition, tag, duration, format...) ({                 \
> @@ -1882,6 +1885,76 @@ static void test_queue_stack_map(int type)
>         bpf_object__close(obj);
>  }
>
> +#define CHECK_FLOW_KEYS(desc, got, expected)                           \
> +       CHECK(memcmp(&got, &expected, sizeof(got)) != 0,                \
> +             desc,                                                     \
> +             "nhoff=%u/%u "                                            \
> +             "thoff=%u/%u "                                            \
> +             "addr_proto=0x%x/0x%x "                                   \
> +             "is_frag=%u/%u "                                          \
> +             "is_first_frag=%u/%u "                                    \
> +             "is_encap=%u/%u "                                         \
> +             "n_proto=0x%x/0x%x "                                      \
> +             "sport=%u/%u "                                            \
> +             "dport=%u/%u\n",                                          \
> +             got.nhoff, expected.nhoff,                                \
> +             got.thoff, expected.thoff,                                \
> +             got.addr_proto, expected.addr_proto,                      \
> +             got.is_frag, expected.is_frag,                            \
> +             got.is_first_frag, expected.is_first_frag,                \
> +             got.is_encap, expected.is_encap,                          \
> +             got.n_proto, expected.n_proto,                            \
> +             got.sport, expected.sport,                                \
> +             got.dport, expected.dport)
> +
> +static struct bpf_flow_keys pkt_v4_flow_keys = {
> +       .nhoff = 0,
> +       .thoff = sizeof(struct iphdr),
> +       .addr_proto = ETH_P_IP,
> +       .ip_proto = IPPROTO_TCP,
> +       .n_proto = bpf_htons(ETH_P_IP),
> +};
> +
> +static struct bpf_flow_keys pkt_v6_flow_keys = {
> +       .nhoff = 0,
> +       .thoff = sizeof(struct ipv6hdr),
> +       .addr_proto = ETH_P_IPV6,
> +       .ip_proto = IPPROTO_TCP,
> +       .n_proto = bpf_htons(ETH_P_IPV6),
> +};
> +
> +static void test_flow_dissector(void)
> +{
> +       struct bpf_flow_keys flow_keys;
> +       struct bpf_object *obj;
> +       __u32 duration, retval;
> +       int err, prog_fd;
> +       __u32 size;
> +
> +       err = bpf_flow_load(&obj, "./bpf_flow.o", "flow_dissector",
> +                           "jmp_table", &prog_fd);
> +       if (err) {
> +               error_cnt++;
> +               return;
> +       }
> +
> +       err = bpf_prog_test_run(prog_fd, 10, &pkt_v4, sizeof(pkt_v4),
> +                               &flow_keys, &size, &retval, &duration);
> +       CHECK(size != sizeof(flow_keys) || err || retval != 1, "ipv4",
> +             "err %d errno %d retval %d duration %d size %u/%lu\n",
> +             err, errno, retval, duration, size, sizeof(flow_keys));
> +       CHECK_FLOW_KEYS("ipv4_flow_keys", flow_keys, pkt_v4_flow_keys);
> +
> +       err = bpf_prog_test_run(prog_fd, 10, &pkt_v6, sizeof(pkt_v6),
> +                               &flow_keys, &size, &retval, &duration);
> +       CHECK(size != sizeof(flow_keys) || err || retval != 1, "ipv6",
> +             "err %d errno %d retval %d duration %d size %u/%lu\n",
> +             err, errno, retval, duration, size, sizeof(flow_keys));
> +       CHECK_FLOW_KEYS("ipv6_flow_keys", flow_keys, pkt_v6_flow_keys);
> +
> +       bpf_object__close(obj);
> +}
> +
>  int main(void)
>  {
>         srand(time(NULL));
> @@ -1909,6 +1982,7 @@ int main(void)
>         test_reference_tracking();
>         test_queue_stack_map(QUEUE);
>         test_queue_stack_map(STACK);
> +       test_flow_dissector();
>
>         printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
>         return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
> --
> 2.20.1.495.gaa96b0ce6b-goog
>

^ permalink raw reply

* Re: [PATCH bpf-next v3 2/3] bpf: add BPF_PROG_TEST_RUN support for flow dissector
From: Song Liu @ 2019-01-28 20:37 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Networking, David S . Miller, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20190128165355.229403-3-sdf@google.com>

On Mon, Jan 28, 2019 at 8:55 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> The input is packet data, the output is struct bpf_flow_key. This should
> make it easy to test flow dissector programs without elaborate
> setup.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Acked-by: Song Liu <songliubraving@fb.com>


> ---
>  include/linux/bpf.h |  3 ++
>  net/bpf/test_run.c  | 82 +++++++++++++++++++++++++++++++++++++++++++++
>  net/core/filter.c   |  1 +
>  3 files changed, 86 insertions(+)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 3851529062ec..0394f1f9213b 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -404,6 +404,9 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
>                           union bpf_attr __user *uattr);
>  int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
>                           union bpf_attr __user *uattr);
> +int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
> +                                    const union bpf_attr *kattr,
> +                                    union bpf_attr __user *uattr);
>
>  /* an array of programs to be executed under rcu_lock.
>   *
> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> index fa2644d276ef..2c5172b33209 100644
> --- a/net/bpf/test_run.c
> +++ b/net/bpf/test_run.c
> @@ -240,3 +240,85 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
>         kfree(data);
>         return ret;
>  }
> +
> +int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
> +                                    const union bpf_attr *kattr,
> +                                    union bpf_attr __user *uattr)
> +{
> +       u32 size = kattr->test.data_size_in;
> +       u32 repeat = kattr->test.repeat;
> +       struct bpf_flow_keys flow_keys;
> +       u64 time_start, time_spent = 0;
> +       struct bpf_skb_data_end *cb;
> +       u32 retval, duration;
> +       struct sk_buff *skb;
> +       struct sock *sk;
> +       void *data;
> +       int ret;
> +       u32 i;
> +
> +       if (prog->type != BPF_PROG_TYPE_FLOW_DISSECTOR)
> +               return -EINVAL;
> +
> +       data = bpf_test_init(kattr, size, NET_SKB_PAD + NET_IP_ALIGN,
> +                            SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
> +       if (IS_ERR(data))
> +               return PTR_ERR(data);
> +
> +       sk = kzalloc(sizeof(*sk), GFP_USER);
> +       if (!sk) {
> +               kfree(data);
> +               return -ENOMEM;
> +       }
> +       sock_net_set(sk, current->nsproxy->net_ns);
> +       sock_init_data(NULL, sk);
> +
> +       skb = build_skb(data, 0);
> +       if (!skb) {
> +               kfree(data);
> +               kfree(sk);
> +               return -ENOMEM;
> +       }
> +       skb->sk = sk;
> +
> +       skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
> +       __skb_put(skb, size);
> +       skb->protocol = eth_type_trans(skb,
> +                                      current->nsproxy->net_ns->loopback_dev);
> +       skb_reset_network_header(skb);
> +
> +       cb = (struct bpf_skb_data_end *)skb->cb;
> +       cb->qdisc_cb.flow_keys = &flow_keys;
> +
> +       if (!repeat)
> +               repeat = 1;
> +
> +       time_start = ktime_get_ns();
> +       for (i = 0; i < repeat; i++) {
> +               preempt_disable();
> +               rcu_read_lock();
> +               retval = __skb_flow_bpf_dissect(prog, skb,
> +                                               &flow_keys_dissector,
> +                                               &flow_keys);
> +               rcu_read_unlock();
> +               preempt_enable();
> +
> +               if (need_resched()) {
> +                       if (signal_pending(current))
> +                               break;
> +                       time_spent += ktime_get_ns() - time_start;
> +                       cond_resched();
> +                       time_start = ktime_get_ns();
> +               }
> +       }
> +       time_spent += ktime_get_ns() - time_start;
> +       do_div(time_spent, repeat);
> +       duration = time_spent > U32_MAX ? U32_MAX : (u32)time_spent;
> +
> +       ret = bpf_test_finish(kattr, uattr, &flow_keys, sizeof(flow_keys),
> +                             retval, duration);
> +
> +       kfree_skb(skb);
> +       kfree(sk);
> +       return ret;
> +}
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 8e587dd1da20..8ce421796ac6 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -7711,6 +7711,7 @@ const struct bpf_verifier_ops flow_dissector_verifier_ops = {
>  };
>
>  const struct bpf_prog_ops flow_dissector_prog_ops = {
> +       .test_run               = bpf_prog_test_run_flow_dissector,
>  };
>
>  int sk_detach_filter(struct sock *sk)
> --
> 2.20.1.495.gaa96b0ce6b-goog
>

^ permalink raw reply

* Re: [PATCH net] sk_msg: Always cancel strp work before freeing the psock
From: Song Liu @ 2019-01-28 20:35 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Networking, John Fastabend, Daniel Borkmann, Marek Majkowski
In-Reply-To: <20190128091335.20908-1-jakub@cloudflare.com>

On Mon, Jan 28, 2019 at 1:15 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Despite having stopped the parser, we still need to deinitialize it by
> calling strp_done so that it cancels its work. Otherwise the worker
> thread can run after we have freed the parser, and attempt to access its
> workqueue resulting in a use-after-free:
>
> ==================================================================
> BUG: KASAN: use-after-free in pwq_activate_delayed_work+0x1b/0x1d0
> Read of size 8 at addr ffff888069975240 by task kworker/u2:2/93
>
> CPU: 0 PID: 93 Comm: kworker/u2:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
> Workqueue:            (null) (kstrp)
> Call Trace:
>  print_address_description+0x6e/0x2b0
>  ? pwq_activate_delayed_work+0x1b/0x1d0
>  kasan_report+0xfd/0x177
>  ? pwq_activate_delayed_work+0x1b/0x1d0
>  ? pwq_activate_delayed_work+0x1b/0x1d0
>  pwq_activate_delayed_work+0x1b/0x1d0
>  ? process_one_work+0x4aa/0x660
>  pwq_dec_nr_in_flight+0x9b/0x100
>  worker_thread+0x82/0x680
>  ? process_one_work+0x660/0x660
>  kthread+0x1b9/0x1e0
>  ? __kthread_create_on_node+0x250/0x250
>  ret_from_fork+0x1f/0x30
>
> Allocated by task 111:
>  sk_psock_init+0x3c/0x1b0
>  sock_map_link.isra.2+0x103/0x4b0
>  sock_map_update_common+0x94/0x270
>  sock_map_update_elem+0x145/0x160
>  __se_sys_bpf+0x152e/0x1e10
>  do_syscall_64+0xb2/0x3e0
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Freed by task 112:
>  kfree+0x7f/0x140
>  process_one_work+0x40b/0x660
>  worker_thread+0x82/0x680
>  kthread+0x1b9/0x1e0
>  ret_from_fork+0x1f/0x30
>
> The buggy address belongs to the object at ffff888069975180
>  which belongs to the cache kmalloc-512 of size 512
> The buggy address is located 192 bytes inside of
>  512-byte region [ffff888069975180, ffff888069975380)
> The buggy address belongs to the page:
> page:ffffea0001a65d00 count:1 mapcount:0 mapping:ffff88806d401280 index:0x0 compound_mapcount: 0
> flags: 0x4000000000010200(slab|head)
> raw: 4000000000010200 dead000000000100 dead000000000200 ffff88806d401280
> raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>  ffff888069975100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  ffff888069975180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >ffff888069975200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>                                            ^
>  ffff888069975280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  ffff888069975300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==================================================================
>
> Reported-by: Marek Majkowski <marek@cloudflare.com>
> Link: https://lore.kernel.org/netdev/CAJPywTLwgXNEZ2dZVoa=udiZmtrWJ0q5SuBW64aYs0Y1khXX3A@mail.gmail.com
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  net/core/skmsg.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index d6d5c20d7044..8c826603bf36 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -545,8 +545,7 @@ static void sk_psock_destroy_deferred(struct work_struct *gc)
>         struct sk_psock *psock = container_of(gc, struct sk_psock, gc);
>
>         /* No sk_callback_lock since already detached. */
> -       if (psock->parser.enabled)
> -               strp_done(&psock->parser.strp);
> +       strp_done(&psock->parser.strp);
>
>         cancel_work_sync(&psock->work);
>
> --
> 2.17.2
>

^ permalink raw reply

* Re: [PATCH net-next v1 1/2] net: dsa: mv88e6xxx: Default CMODE to 1000BaseX only on 6390X
From: Florian Fainelli @ 2019-01-28 20:34 UTC (permalink / raw)
  To: Marek Behún, netdev; +Cc: Andrew Lunn, David Miller
In-Reply-To: <20190124154309.24987-1-marek.behun@nic.cz>

On 1/24/19 7:43 AM, Marek Behún wrote:
> Commit 787799a9d555 sets the SERDES interfaces of 6390 and 6390X to
> 1000BaseX, but this is only needed on 6390X, since there are SERDES
> interfaces which can be used on lower ports on 6390.
> 
> This commit fixes this by returning to previous behaviour on 6390.
> (Previous behaviour means that CMODE is not set at all if requested mode
> is NA).
> 
> This is needed on Turris MOX, where the 88e6190 is connected to CPU in
> 2500BaseX mode.
> 
> Fixes: 787799a9d555 ("net: dsa: mv88e6xxx: Default ports 9/10 6390X CMODE to 1000BaseX")
> Signed-off-by: Marek Behún <marek.behun@nic.cz>

I suppose for now, this is the best way to approach that problem given
the shortcomings of the fixed link support in net/dsa/port.c:

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

Thanks!
-- 
Florian

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox