Netdev List
 help / color / mirror / Atom feed
* Re: [RFC bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
From: Joe Stringer @ 2018-05-12  0:54 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: Joe Stringer, daniel, netdev, ast, john fastabend
In-Reply-To: <20180511214111.hi6ax6qoe37al37q@kafai-mbp.dhcp.thefacebook.com>

On 11 May 2018 at 14:41, Martin KaFai Lau <kafai@fb.com> wrote:
> On Fri, May 11, 2018 at 02:08:01PM -0700, Joe Stringer wrote:
>> On 10 May 2018 at 22:00, Martin KaFai Lau <kafai@fb.com> wrote:
>> > On Wed, May 09, 2018 at 02:07:05PM -0700, Joe Stringer wrote:
>> >> This patch adds a new BPF helper function, sk_lookup() which allows BPF
>> >> programs to find out if there is a socket listening on this host, and
>> >> returns a socket pointer which the BPF program can then access to
>> >> determine, for instance, whether to forward or drop traffic. sk_lookup()
>> >> takes a reference on the socket, so when a BPF program makes use of this
>> >> function, it must subsequently pass the returned pointer into the newly
>> >> added sk_release() to return the reference.
>> >>
>> >> By way of example, the following pseudocode would filter inbound
>> >> connections at XDP if there is no corresponding service listening for
>> >> the traffic:
>> >>
>> >>   struct bpf_sock_tuple tuple;
>> >>   struct bpf_sock_ops *sk;
>> >>
>> >>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
>> >>   sk = bpf_sk_lookup(ctx, &tuple, sizeof tuple, netns, 0);
>> >>   if (!sk) {
>> >>     // Couldn't find a socket listening for this traffic. Drop.
>> >>     return TC_ACT_SHOT;
>> >>   }
>> >>   bpf_sk_release(sk, 0);
>> >>   return TC_ACT_OK;
>> >>
>> >> Signed-off-by: Joe Stringer <joe@wand.net.nz>
>> >> ---
>>
>> ...
>>
>> >> @@ -4032,6 +4036,96 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
>> >>  };
>> >>  #endif
>> >>
>> >> +struct sock *
>> >> +sk_lookup(struct net *net, struct bpf_sock_tuple *tuple) {
>> > Would it be possible to have another version that
>> > returns a sk without taking its refcnt?
>> > It may have performance benefit.
>>
>> Not really. The sockets are not RCU-protected, and established sockets
>> may be torn down without notice. If we don't take a reference, there's
>> no guarantee that the socket will continue to exist for the duration
>> of running the BPF program.
>>
>> From what I follow, the comment below has a hidden implication which
>> is that sockets without SOCK_RCU_FREE, eg established sockets, may be
>> directly freed regardless of RCU.
> Right, SOCK_RCU_FREE sk is the one I am concern about.
> For example, TCP_LISTEN socket does not require taking a refcnt
> now.  Doing a bpf_sk_lookup() may have a rather big
> impact on handling TCP syn flood.  or the usual intention
> is to redirect instead of passing it up to the stack?

I see, if you're only interested in listen sockets then probably this
series could be extended with a new flag, eg something like
BPF_F_SK_FIND_LISTENERS which restricts the set of possible sockets
found to only listen sockets, then the implementation would call into
__inet_lookup_listener() instead of inet_lookup(). The presence of
that flag in the relevant register during CALL instruction would show
that the verifier should not reference-track the result, then there'd
need to be a check on the release to ensure that this unreferenced
socket is never released. Just a thought, completely untested and I
could still be missing some detail..

That said, I don't completely follow how you would expect to handle
the traffic for sockets that are already established - the helper
would no longer find those sockets, so you wouldn't know whether to
pass the traffic up the stack for established traffic or not.

^ permalink raw reply

* Re: safe skb resetting after decapsulation and encapsulation
From: Md. Islam @ 2018-05-12  2:07 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: Netdev
In-Reply-To: <CAHmME9qBtB_Vwxzs+fry2UPVQ6yCWNrmVttX4wXu+RiNq3A7sw@mail.gmail.com>

I'm not an expert on this, but it looks about right. You can take a
look at build_skb() or __build_skb(). It shows the fields that needs
to be set before passing to netif_receive_skb/netif_rx.

On Fri, May 11, 2018 at 6:56 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hey Netdev,
>
> A UDP skb comes in via the encap_rcv interface. I do a lot of wild
> things to the bytes in the skb -- change where the head starts, modify
> a few fragments, decrypt some stuff, trim off some things at the end,
> etc. In other words, I'm decapsulating the skb in a pretty intense
> way. I benefit from reusing the same skb, performance wise, but after
> I'm done processing it, it's really a totally new skb. Eventually it's
> time to pass off my skb to netif_receive_skb/netif_rx, but before I do
> that, I need to "reinitialize" the skb. (The same goes for when
> sending out an skb -- I get it from userspace via ndo_start_xmit, do
> crazy things to it, and eventually pass it off to the udp_tunnel send
> functions, but first "reinitializing" it.)
>
> At the moment I'm using a function that looks like this:
>
> static void jasons_wild_and_crazy_skb_reset(struct sk_buff *skb)
> {
>     skb_scrub_packet(skb, true); //1
>     memset(&skb->headers_start, 0, offsetof(struct sk_buff,
> headers_end) - offsetof(struct sk_buff, headers_start)); //2
>     skb->queue_mapping = 0; //3
>     skb->nohdr = 0; //4
>     skb->peeked = 0; //5
>     skb->mac_len = 0; //6
>     skb->dev = NULL; //7
> #ifdef CONFIG_NET_SCHED
>     skb->tc_index = 0; //8
>     skb_reset_tc(skb); //9
> #endif
>     skb->hdr_len = skb_headroom(skb); //10
>     skb_reset_mac_header(skb); //11
>     skb_reset_network_header(skb); //12
>     skb_probe_transport_header(skb, 0); //13
>     skb_reset_inner_headers(skb); //14
> }
>
> I'm sure that some of this is wrong. Most of it is based on part of an
> Octeon ethernet driver I read a few years ago. I numbered each
> statement above, hoping to go through it with you all in detail here,
> and see what we can cut away and see what we can approve.
>
> 1. Obviously correct and required.
> 2. This is probably wrong. At least it causes crashes when receiving
> packets from RHEL 7.5's latest i40e driver in their vendor
> frankenkernel, because those flags there have some critical bits
> related to allocation. But there are a lot flags in there that I might
> consider going through one by one and zeroing out.
> 3-5. Fields that should be zero, I assume, after
> decapsulating/decrypting (and encapsulating/encrypting).
> 6. WireGuard is layer 3, so there's no mac.
> 7. We're later going to change the dev this came in on.
> 8-9: Same flakey rationale as 2,3-5.
> 10: Since the headroom has changed during the various modifications, I
> need to let the packet field know about it.
> 11-14: The beginning of the headers has changed, and so resetting and
> probing is necessary for this to work at all.
>
> So I'm wondering - how much of this is necessary? How much am I
> unnecessarily reinventing things that exist elsewhere? I'm pretty sure
> in most cases the driver would work with only 1,10-14, but I worry
> that bad things would happen in more unusual configurations. I've
> tried to systematically go through the entire stack and see where
> these might be used or not used, but it seems really inconsistent.
>
> So, I'm writing wondering if somebody has an easy simplification or
> rule for handling this kind of intense decapsulation/decryption (and
> encapsulation/encryption operation on the other way) operation. I'd
> like to make sure I get this down solid.
>
> Thanks,
> Jason



-- 
Tamim
PhD Candidate,
Kent State University
http://web.cs.kent.edu/~mislam4/

^ permalink raw reply

* Re: possible deadlock in sk_diag_fill
From: Dmitry Vyukov @ 2018-05-12  7:46 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: syzbot, avagin, David Miller, LKML, netdev, syzkaller-bugs
In-Reply-To: <20180511183358.GA1492@outlook.office365.com>

On Fri, May 11, 2018 at 8:33 PM, Andrei Vagin <avagin@virtuozzo.com> wrote:
> On Sat, May 05, 2018 at 10:59:02AM -0700, syzbot wrote:
>> Hello,
>>
>> syzbot found the following crash on:
>>
>> HEAD commit:    c1c07416cdd4 Merge tag 'kbuild-fixes-v4.17' of git://git.k..
>> git tree:       upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=12164c97800000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=5a1dc06635c10d27
>> dashboard link: https://syzkaller.appspot.com/bug?extid=c1872be62e587eae9669
>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>> userspace arch: i386
>>
>> Unfortunately, I don't have any reproducer for this crash yet.
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+c1872be62e587eae9669@syzkaller.appspotmail.com
>>
>>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 4.17.0-rc3+ #59 Not tainted
>> ------------------------------------------------------
>> syz-executor1/25282 is trying to acquire lock:
>> 000000004fddf743 (&(&u->lock)->rlock/1){+.+.}, at: sk_diag_dump_icons
>> net/unix/diag.c:82 [inline]
>> 000000004fddf743 (&(&u->lock)->rlock/1){+.+.}, at:
>> sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144
>>
>> but task is already holding lock:
>> 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: spin_lock
>> include/linux/spinlock.h:310 [inline]
>> 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_dump_icons
>> net/unix/diag.c:64 [inline]
>> 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_fill.isra.5+0x94e/0x10d0
>> net/unix/diag.c:144
>>
>> which lock already depends on the new lock.
>
> In the code, we have a comment which explains why it is safe to take this lock
>
> /*
>  * The state lock is outer for the same sk's
>  * queue lock. With the other's queue locked it's
>  * OK to lock the state.
>  */
> unix_state_lock_nested(req);
>
> It is a question how to explain this to lockdep.

Do I understand it correctly that (&u->lock)->rlock associated with
AF_UNIX is locked under rlock-AF_UNIX, and then rlock-AF_UNIX is
locked under (&u->lock)->rlock associated with AF_NETLINK? If so, I
think we need to split (&u->lock)->rlock by family too, so that we
have u->lock-AF_UNIX and u->lock-AF_NETLINK.



>> the existing dependency chain (in reverse order) is:
>>
>> -> #1 (rlock-AF_UNIX){+.+.}:
>>        __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>>        _raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
>>        skb_queue_tail+0x26/0x150 net/core/skbuff.c:2900
>>        unix_dgram_sendmsg+0xf77/0x1730 net/unix/af_unix.c:1797
>>        sock_sendmsg_nosec net/socket.c:629 [inline]
>>        sock_sendmsg+0xd5/0x120 net/socket.c:639
>>        ___sys_sendmsg+0x525/0x940 net/socket.c:2117
>>        __sys_sendmmsg+0x3bb/0x6f0 net/socket.c:2205
>>        __compat_sys_sendmmsg net/compat.c:770 [inline]
>>        __do_compat_sys_sendmmsg net/compat.c:777 [inline]
>>        __se_compat_sys_sendmmsg net/compat.c:774 [inline]
>>        __ia32_compat_sys_sendmmsg+0x9f/0x100 net/compat.c:774
>>        do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline]
>>        do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394
>>        entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
>>
>> -> #0 (&(&u->lock)->rlock/1){+.+.}:
>>        lock_acquire+0x1dc/0x520 kernel/locking/lockdep.c:3920
>>        _raw_spin_lock_nested+0x28/0x40 kernel/locking/spinlock.c:354
>>        sk_diag_dump_icons net/unix/diag.c:82 [inline]
>>        sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144
>>        sk_diag_dump net/unix/diag.c:178 [inline]
>>        unix_diag_dump+0x35f/0x550 net/unix/diag.c:206
>>        netlink_dump+0x507/0xd20 net/netlink/af_netlink.c:2226
>>        __netlink_dump_start+0x51a/0x780 net/netlink/af_netlink.c:2323
>>        netlink_dump_start include/linux/netlink.h:214 [inline]
>>        unix_diag_handler_dump+0x3f4/0x7b0 net/unix/diag.c:307
>>        __sock_diag_cmd net/core/sock_diag.c:230 [inline]
>>        sock_diag_rcv_msg+0x2e0/0x3d0 net/core/sock_diag.c:261
>>        netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
>>        sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272
>>        netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
>>        netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
>>        netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
>>        sock_sendmsg_nosec net/socket.c:629 [inline]
>>        sock_sendmsg+0xd5/0x120 net/socket.c:639
>>        sock_write_iter+0x35a/0x5a0 net/socket.c:908
>>        call_write_iter include/linux/fs.h:1784 [inline]
>>        new_sync_write fs/read_write.c:474 [inline]
>>        __vfs_write+0x64d/0x960 fs/read_write.c:487
>>        vfs_write+0x1f8/0x560 fs/read_write.c:549
>>        ksys_write+0xf9/0x250 fs/read_write.c:598
>>        __do_sys_write fs/read_write.c:610 [inline]
>>        __se_sys_write fs/read_write.c:607 [inline]
>>        __ia32_sys_write+0x71/0xb0 fs/read_write.c:607
>>        do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline]
>>        do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394
>>        entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
>>
>> other info that might help us debug this:
>>
>>  Possible unsafe locking scenario:
>>
>>        CPU0                    CPU1
>>        ----                    ----
>>   lock(rlock-AF_UNIX);
>>                                lock(&(&u->lock)->rlock/1);
>>                                lock(rlock-AF_UNIX);
>>   lock(&(&u->lock)->rlock/1);
>>
>>  *** DEADLOCK ***
>>
>> 5 locks held by syz-executor1/25282:
>>  #0: 000000003919e1bd (sock_diag_mutex){+.+.}, at: sock_diag_rcv+0x1b/0x40
>> net/core/sock_diag.c:271
>>  #1: 000000004f328d3e (sock_diag_table_mutex){+.+.}, at: __sock_diag_cmd
>> net/core/sock_diag.c:225 [inline]
>>  #1: 000000004f328d3e (sock_diag_table_mutex){+.+.}, at:
>> sock_diag_rcv_msg+0x169/0x3d0 net/core/sock_diag.c:261
>>  #2: 000000004cc04dbb (nlk_cb_mutex-SOCK_DIAG){+.+.}, at:
>> netlink_dump+0x98/0xd20 net/netlink/af_netlink.c:2182
>>  #3: 00000000accdef41 (unix_table_lock){+.+.}, at: spin_lock
>> include/linux/spinlock.h:310 [inline]
>>  #3: 00000000accdef41 (unix_table_lock){+.+.}, at:
>> unix_diag_dump+0x10a/0x550 net/unix/diag.c:192
>>  #4: 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: spin_lock
>> include/linux/spinlock.h:310 [inline]
>>  #4: 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_dump_icons
>> net/unix/diag.c:64 [inline]
>>  #4: 00000000b6895645 (rlock-AF_UNIX){+.+.}, at:
>> sk_diag_fill.isra.5+0x94e/0x10d0 net/unix/diag.c:144
>>
>> stack backtrace:
>> CPU: 1 PID: 25282 Comm: syz-executor1 Not tainted 4.17.0-rc3+ #59
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:77 [inline]
>>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>>  print_circular_bug.isra.36.cold.54+0x1bd/0x27d
>> kernel/locking/lockdep.c:1223
>>  check_prev_add kernel/locking/lockdep.c:1863 [inline]
>>  check_prevs_add kernel/locking/lockdep.c:1976 [inline]
>>  validate_chain kernel/locking/lockdep.c:2417 [inline]
>>  __lock_acquire+0x343e/0x5140 kernel/locking/lockdep.c:3431
>>  lock_acquire+0x1dc/0x520 kernel/locking/lockdep.c:3920
>>  _raw_spin_lock_nested+0x28/0x40 kernel/locking/spinlock.c:354
>>  sk_diag_dump_icons net/unix/diag.c:82 [inline]
>>  sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144
>>  sk_diag_dump net/unix/diag.c:178 [inline]
>>  unix_diag_dump+0x35f/0x550 net/unix/diag.c:206
>>  netlink_dump+0x507/0xd20 net/netlink/af_netlink.c:2226
>>  __netlink_dump_start+0x51a/0x780 net/netlink/af_netlink.c:2323
>>  netlink_dump_start include/linux/netlink.h:214 [inline]
>>  unix_diag_handler_dump+0x3f4/0x7b0 net/unix/diag.c:307
>>  __sock_diag_cmd net/core/sock_diag.c:230 [inline]
>>  sock_diag_rcv_msg+0x2e0/0x3d0 net/core/sock_diag.c:261
>>  netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
>>  sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272
>>  netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
>>  netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
>>  netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
>>  sock_sendmsg_nosec net/socket.c:629 [inline]
>>  sock_sendmsg+0xd5/0x120 net/socket.c:639
>>  sock_write_iter+0x35a/0x5a0 net/socket.c:908
>>  call_write_iter include/linux/fs.h:1784 [inline]
>>  new_sync_write fs/read_write.c:474 [inline]
>>  __vfs_write+0x64d/0x960 fs/read_write.c:487
>>  vfs_write+0x1f8/0x560 fs/read_write.c:549
>>  ksys_write+0xf9/0x250 fs/read_write.c:598
>>  __do_sys_write fs/read_write.c:610 [inline]
>>  __se_sys_write fs/read_write.c:607 [inline]
>>  __ia32_sys_write+0x71/0xb0 fs/read_write.c:607
>>  do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline]
>>  do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394
>>  entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
>> RIP: 0023:0xf7f8ccb9
>> RSP: 002b:00000000f5f880ac EFLAGS: 00000282 ORIG_RAX: 0000000000000004
>> RAX: ffffffffffffffda RBX: 0000000000000017 RCX: 000000002058bfe4
>> RDX: 0000000000000029 RSI: 0000000000000000 RDI: 0000000000000000
>> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000296 R12: 0000000000000000
>> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>
>>
>> ---
>> This bug is generated by a bot. It may contain errors.
>> See https://goo.gl/tpsmEJ for more information about syzbot.
>> syzbot engineers can be reached at syzkaller@googlegroups.com.
>>
>> syzbot will keep track of this bug report.
>> If you forgot to add the Reported-by tag, once the fix for this bug is
>> merged
>> into any tree, please reply to this email with:
>> #syz fix: exact-commit-title
>> To mark this as a duplicate of another syzbot report, please reply with:
>> #syz dup: exact-subject-of-another-report
>> If it's a one-off invalid bug report, please reply with:
>> #syz invalid
>> Note: if the crash happens again, it will cause creation of a new bug
>> report.
>> Note: all commands must start from beginning of the line in the email body.
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/20180511183358.GA1492%40outlook.office365.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH bpf-next 0/4] samples: bpf: fix build after move to full libbpf
From: Jesper Dangaard Brouer @ 2018-05-12  8:39 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: brouer, alexei.starovoitov, daniel, oss-drivers, netdev,
	Björn Töpel
In-Reply-To: <20180512001729.21634-1-jakub.kicinski@netronome.com>

On Fri, 11 May 2018 17:17:25 -0700
Jakub Kicinski <jakub.kicinski@netronome.com> wrote:

> Following patches address build issues after recent move to libbpf.
> For out-of-tree builds we would see the following error:
> 
> gcc: error: samples/bpf/../../tools/lib/bpf/libbpf.a: No such file or directory
> 
> Mini-library called libbpf.h in samples is renamed to bpf_insn.h,
> using linux/filter.h seems not completely trivial since some samples
> get upset when order on include search path in changed.  We do have
> to rename libbpf.h, however, because otherwise it's hard to reliably
> get to libbpf's header in out-of-tree builds.


Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

Thank you for doing this... this mini-library also called libbpf.h have
confused me before, and I bet it will/would confuse others as well.
Glad to see it being renamed :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: KASAN: use-after-free Write in xt_rateest_put
From: Dmitry Vyukov @ 2018-05-12  8:39 UTC (permalink / raw)
  To: syzbot, Cong Wang
  Cc: coreteam, David Miller, Florian Westphal, Jozsef Kadlecsik, LKML,
	netdev, netfilter-devel, Pablo Neira Ayuso, syzkaller-bugs
In-Reply-To: <94eb2c0b9a14f575660563eb788d@google.com>

On Mon, Jan 29, 2018 at 3:58 PM, syzbot
<syzbot+551ff4604e832588433e@syzkaller.appspotmail.com> wrote:
> Hello,
>
> syzbot hit the following crash on upstream commit
> 24b1cccf922914f3d6eeb84036dde8338bc03abb (Sun Jan 28 20:24:36 2018 +0000)
> Merge branch 'x86-pti-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>
> C reproducer is attached.
> syzkaller reproducer is attached.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+551ff4604e832588433e@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.

This was bisected as fixed by:

#syz fix: netfilter: xt_RATEEST: acquire xt_rateest_mutex for hash insert

https://gist.githubusercontent.com/dvyukov/9d5b710cf4f429969b93aa90ec217c29/raw/68c1fee7f7e133574a0787c9e46d97a6cf521759/gistfile1.txt

> ==================================================================
> BUG: KASAN: use-after-free in __hlist_del include/linux/list.h:651 [inline]
> BUG: KASAN: use-after-free in hlist_del include/linux/list.h:656 [inline]
> BUG: KASAN: use-after-free in xt_rateest_put+0x2e3/0x300
> net/netfilter/xt_RATEEST.c:65
> Write of size 8 at addr ffff8801d4d40e58 by task syzkaller770396/3682
>
> CPU: 1 PID: 3682 Comm: syzkaller770396 Not tainted 4.15.0-rc9+ #284
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_store8_noabort+0x17/0x20 mm/kasan/report.c:435
>  __hlist_del include/linux/list.h:651 [inline]
>  hlist_del include/linux/list.h:656 [inline]
>  xt_rateest_put+0x2e3/0x300 net/netfilter/xt_RATEEST.c:65
>  xt_rateest_tg_destroy+0x50/0x70 net/netfilter/xt_RATEEST.c:154
>  cleanup_entry+0x242/0x380 net/ipv6/netfilter/ip6_tables.c:678
>  __do_replace+0x7e6/0xab0 net/ipv6/netfilter/ip6_tables.c:1115
>  do_replace net/ipv6/netfilter/ip6_tables.c:1171 [inline]
>  do_ip6t_set_ctl+0x40f/0x5f0 net/ipv6/netfilter/ip6_tables.c:1693
>  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
>  nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
>  ipv6_setsockopt+0x115/0x150 net/ipv6/ipv6_sockglue.c:928
>  udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1452
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2968
>  SYSC_setsockopt net/socket.c:1831 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1810
>  entry_SYSCALL_64_fastpath+0x29/0xa0
> RIP: 0033:0x4412d9
> RSP: 002b:00007fffb8cbf5f8 EFLAGS: 00000203 ORIG_RAX: 0000000000000036
> RAX: ffffffffffffffda RBX: ffffffffffffffff RCX: 00000000004412d9
> RDX: 0000000000000040 RSI: 0000000000000029 RDI: 0000000000000326
> RBP: f6fcce9cd855ec40 R08: 00000000000003b8 R09: 0000000000000000
> R10: 0000000020019c48 R11: 0000000000000203 R12: fbfe5b6031634428
> R13: 5826b2d59f7fe9a1 R14: fd7217c033abf8b5 R15: 0000000000000000
>
> Allocated by task 3687:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3610
>  kmalloc include/linux/slab.h:499 [inline]
>  kzalloc include/linux/slab.h:688 [inline]
>  xt_rateest_tg_checkentry+0x25a/0xaa0 net/netfilter/xt_RATEEST.c:120
>  xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:845
>  check_target net/ipv6/netfilter/ip6_tables.c:538 [inline]
>  find_check_entry.isra.7+0x935/0xcf0 net/ipv6/netfilter/ip6_tables.c:580
>  translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:749
>  do_replace net/ipv6/netfilter/ip6_tables.c:1167 [inline]
>  do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1693
>  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
>  nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
>  ipv6_setsockopt+0x115/0x150 net/ipv6/ipv6_sockglue.c:928
>  udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1452
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2968
>  SYSC_setsockopt net/socket.c:1831 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1810
>  entry_SYSCALL_64_fastpath+0x29/0xa0
>
> Freed by task 3682:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
>  __cache_free mm/slab.c:3488 [inline]
>  kfree+0xd6/0x260 mm/slab.c:3803
>  __rcu_reclaim kernel/rcu/rcu.h:190 [inline]
>  rcu_do_batch kernel/rcu/tree.c:2758 [inline]
>  invoke_rcu_callbacks kernel/rcu/tree.c:3012 [inline]
>  __rcu_process_callbacks kernel/rcu/tree.c:2979 [inline]
>  rcu_process_callbacks+0xe94/0x17f0 kernel/rcu/tree.c:2996
>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>
> The buggy address belongs to the object at ffff8801d4d40e00
>  which belongs to the cache kmalloc-192 of size 192
> The buggy address is located 88 bytes inside of
>  192-byte region [ffff8801d4d40e00, ffff8801d4d40ec0)
> The buggy address belongs to the page:
> page:ffffea0007535000 count:1 mapcount:0 mapping:ffff8801d4d40000 index:0x0
> flags: 0x2fffc0000000100(slab)
> raw: 02fffc0000000100 ffff8801d4d40000 0000000000000000 0000000100000010
> raw: ffffea00075251e0 ffff8801dac01148 ffff8801dac00040 0000000000000000
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>  ffff8801d4d40d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  ffff8801d4d40d80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>>
>> ffff8801d4d40e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>
>                                                     ^
>  ffff8801d4d40e80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>  ffff8801d4d40f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==================================================================
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> If you want to test a patch for this bug, please reply with:
> #syz test: git://repo/address.git branch
> and provide the patch inline or as an attachment.
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/94eb2c0b9a14f575660563eb788d%40google.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: rsi: fix spelling mistake: "thead" -> "thread"
From: Kalle Valo @ 2018-05-12  8:55 UTC (permalink / raw)
  To: Colin Ian King
  Cc: David S . Miller, Amitkumar Karwar, Prameela Rani Garnepudi,
	linux-wireless, netdev, kernel-janitors, linux-kernel
In-Reply-To: <20180510143217.23488-1-colin.king@canonical.com>

Colin Ian King <colin.king@canonical.com> wrote:

> From: Colin Ian King <colin.king@canonical.com>
> 
> Trivial fix to spelling mistake in rsi_dbg debug message text
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Patch applied to wireless-drivers-next.git, thanks.

b41c39367669 rsi: fix spelling mistake: "thead" -> "thread"

-- 
https://patchwork.kernel.org/patch/10391879/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply

* [PATCH net] xfrm6: avoid potential infinite loop in _decode_session6()
From: Eric Dumazet @ 2018-05-12  9:49 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Steffen Klassert,
	Nicolas Dichtel

syzbot found a way to trigger an infinitie loop by overflowing
@offset variable that has been forced to use u16 for some very
obscure reason in the past.

We probably want to look at NEXTHDR_FRAGMENT handling which looks
wrong, in a separate patch.

In net-next, we shall try to use skb_header_pointer() instead of
pskb_may_pull().

watchdog: BUG: soft lockup - CPU#1 stuck for 134s! [syz-executor738:4553]
Modules linked in:
irq event stamp: 13885653
hardirqs last  enabled at (13885652): [<ffffffff878009d5>] restore_regs_and_return_to_kernel+0x0/0x2b
hardirqs last disabled at (13885653): [<ffffffff87800905>] interrupt_entry+0xb5/0xf0 arch/x86/entry/entry_64.S:625
softirqs last  enabled at (13614028): [<ffffffff84df0809>] tun_napi_alloc_frags drivers/net/tun.c:1478 [inline]
softirqs last  enabled at (13614028): [<ffffffff84df0809>] tun_get_user+0x1dd9/0x4290 drivers/net/tun.c:1825
softirqs last disabled at (13614032): [<ffffffff84df1b6f>] tun_get_user+0x313f/0x4290 drivers/net/tun.c:1942
CPU: 1 PID: 4553 Comm: syz-executor738 Not tainted 4.17.0-rc3+ #40
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:check_kcov_mode kernel/kcov.c:67 [inline]
RIP: 0010:__sanitizer_cov_trace_pc+0x20/0x50 kernel/kcov.c:101
RSP: 0018:ffff8801d8cfe250 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: ffff8801d88a8080 RBX: ffff8801d7389e40 RCX: 0000000000000006
RDX: 0000000000000000 RSI: ffffffff868da4ad RDI: ffff8801c8a53277
RBP: ffff8801d8cfe250 R08: ffff8801d88a8080 R09: ffff8801d8cfe3e8
R10: ffffed003b19fc87 R11: ffff8801d8cfe43f R12: ffff8801c8a5327f
R13: 0000000000000000 R14: ffff8801c8a4e5fe R15: ffff8801d8cfe3e8
FS:  0000000000d88940(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffff600400 CR3: 00000001acab3000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 _decode_session6+0xc1d/0x14f0 net/ipv6/xfrm6_policy.c:150
 __xfrm_decode_session+0x71/0x140 net/xfrm/xfrm_policy.c:2368
 xfrm_decode_session_reverse include/net/xfrm.h:1213 [inline]
 icmpv6_route_lookup+0x395/0x6e0 net/ipv6/icmp.c:372
 icmp6_send+0x1982/0x2da0 net/ipv6/icmp.c:551
 icmpv6_send+0x17a/0x300 net/ipv6/ip6_icmp.c:43
 ip6_input_finish+0x14e1/0x1a30 net/ipv6/ip6_input.c:305
 NF_HOOK include/linux/netfilter.h:288 [inline]
 ip6_input+0xe1/0x5e0 net/ipv6/ip6_input.c:327
 dst_input include/net/dst.h:450 [inline]
 ip6_rcv_finish+0x29c/0xa10 net/ipv6/ip6_input.c:71
 NF_HOOK include/linux/netfilter.h:288 [inline]
 ipv6_rcv+0xeb8/0x2040 net/ipv6/ip6_input.c:208
 __netif_receive_skb_core+0x2468/0x3650 net/core/dev.c:4646
 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4711
 netif_receive_skb_internal+0x126/0x7b0 net/core/dev.c:4785
 napi_frags_finish net/core/dev.c:5226 [inline]
 napi_gro_frags+0x631/0xc40 net/core/dev.c:5299
 tun_get_user+0x3168/0x4290 drivers/net/tun.c:1951
 tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:1996
 call_write_iter include/linux/fs.h:1784 [inline]
 do_iter_readv_writev+0x859/0xa50 fs/read_write.c:680
 do_iter_write+0x185/0x5f0 fs/read_write.c:959
 vfs_writev+0x1c7/0x330 fs/read_write.c:1004
 do_writev+0x112/0x2f0 fs/read_write.c:1039
 __do_sys_writev fs/read_write.c:1112 [inline]
 __se_sys_writev fs/read_write.c:1109 [inline]
 __x64_sys_writev+0x75/0xb0 fs/read_write.c:1109
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reported-by: syzbot+0053c8...@syzkaller.appspotmail.com
---
 net/ipv6/xfrm6_policy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index 416fe67271a920f5a86dd3007c03e3113f857f8a..86dba282a147ce6ad4b3e4e2f3b5c81962493130 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -126,7 +126,7 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int reverse)
 	struct flowi6 *fl6 = &fl->u.ip6;
 	int onlyproto = 0;
 	const struct ipv6hdr *hdr = ipv6_hdr(skb);
-	u16 offset = sizeof(*hdr);
+	u32 offset = sizeof(*hdr);
 	struct ipv6_opt_hdr *exthdr;
 	const unsigned char *nh = skb_network_header(skb);
 	u16 nhoff = IP6CB(skb)->nhoff;
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* Re: [PATCH 14/32] net/tcp: convert to ->poll_mask
From: Christoph Hellwig @ 2018-05-12 10:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, viro, Avi Kivity, linux-aio, linux-fsdevel,
	netdev, linux-api, linux-kernel
In-Reply-To: <96284b0e-0d4e-a944-4fd5-933d12cf53cb@gmail.com>

On Fri, May 11, 2018 at 06:13:11AM -0700, Eric Dumazet wrote:
> > +struct wait_queue_head *tcp_get_poll_head(struct socket *sock, __poll_t events)
> > +{
> > +	sock_poll_busy_loop(sock, events);
> > +	sock_rps_record_flow(sock->sk);
> 
> Why are you adding sock_rps_record_flow() ?

Because I mismerged the removal of it from tcp_poll in
'net: revert "Update RFS target at poll for tcp/udp"'

Thanks for the headsup, this will be removed for the next version.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply

* [PATCH] 3c59x: convert to generic DMA API
From: Christoph Hellwig @ 2018-05-12 10:16 UTC (permalink / raw)
  To: netdev; +Cc: linux-pci, linux-kernel, tedheadster

This driver supports EISA devices in addition to PCI devices, and relied
on the legacy behavior of the pci_dma* shims to pass on a NULL pointer
to the DMA API, and the DMA API being able to handle that.  When the
NULL forwarding broke the EISA support got broken.  Fix this by converting
to the DMA API instead of the legacy PCI shims.

Fixes: 4167b2ad ("PCI: Remove NULL device handling from PCI DMA API")
Reported-by: tedheadster <tedheadster@gmail.com>
Tested-by: tedheadster <tedheadster@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/net/ethernet/3com/3c59x.c | 104 +++++++++++++++---------------
 1 file changed, 51 insertions(+), 53 deletions(-)

diff --git a/drivers/net/ethernet/3com/3c59x.c b/drivers/net/ethernet/3com/3c59x.c
index 36c8950dbd2d..176861bd2252 100644
--- a/drivers/net/ethernet/3com/3c59x.c
+++ b/drivers/net/ethernet/3com/3c59x.c
@@ -1212,9 +1212,9 @@ static int vortex_probe1(struct device *gendev, void __iomem *ioaddr, int irq,
 	vp->mii.reg_num_mask = 0x1f;
 
 	/* Makes sure rings are at least 16 byte aligned. */
-	vp->rx_ring = pci_alloc_consistent(pdev, sizeof(struct boom_rx_desc) * RX_RING_SIZE
+	vp->rx_ring = dma_alloc_coherent(gendev, sizeof(struct boom_rx_desc) * RX_RING_SIZE
 					   + sizeof(struct boom_tx_desc) * TX_RING_SIZE,
-					   &vp->rx_ring_dma);
+					   &vp->rx_ring_dma, GFP_KERNEL);
 	retval = -ENOMEM;
 	if (!vp->rx_ring)
 		goto free_device;
@@ -1476,11 +1476,10 @@ static int vortex_probe1(struct device *gendev, void __iomem *ioaddr, int irq,
 		return 0;
 
 free_ring:
-	pci_free_consistent(pdev,
-						sizeof(struct boom_rx_desc) * RX_RING_SIZE
-							+ sizeof(struct boom_tx_desc) * TX_RING_SIZE,
-						vp->rx_ring,
-						vp->rx_ring_dma);
+	dma_free_coherent(&pdev->dev,
+		sizeof(struct boom_rx_desc) * RX_RING_SIZE +
+		sizeof(struct boom_tx_desc) * TX_RING_SIZE,
+		vp->rx_ring, vp->rx_ring_dma);
 free_device:
 	free_netdev(dev);
 	pr_err(PFX "vortex_probe1 fails.  Returns %d\n", retval);
@@ -1751,9 +1750,9 @@ vortex_open(struct net_device *dev)
 				break;			/* Bad news!  */
 
 			skb_reserve(skb, NET_IP_ALIGN);	/* Align IP on 16 byte boundaries */
-			dma = pci_map_single(VORTEX_PCI(vp), skb->data,
-					     PKT_BUF_SZ, PCI_DMA_FROMDEVICE);
-			if (dma_mapping_error(&VORTEX_PCI(vp)->dev, dma))
+			dma = dma_map_single(vp->gendev, skb->data,
+					     PKT_BUF_SZ, DMA_FROM_DEVICE);
+			if (dma_mapping_error(vp->gendev, dma))
 				break;
 			vp->rx_ring[i].addr = cpu_to_le32(dma);
 		}
@@ -2067,9 +2066,9 @@ vortex_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (vp->bus_master) {
 		/* Set the bus-master controller to transfer the packet. */
 		int len = (skb->len + 3) & ~3;
-		vp->tx_skb_dma = pci_map_single(VORTEX_PCI(vp), skb->data, len,
-						PCI_DMA_TODEVICE);
-		if (dma_mapping_error(&VORTEX_PCI(vp)->dev, vp->tx_skb_dma)) {
+		vp->tx_skb_dma = dma_map_single(vp->gendev, skb->data, len,
+						DMA_TO_DEVICE);
+		if (dma_mapping_error(vp->gendev, vp->tx_skb_dma)) {
 			dev_kfree_skb_any(skb);
 			dev->stats.tx_dropped++;
 			return NETDEV_TX_OK;
@@ -2168,9 +2167,9 @@ boomerang_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			vp->tx_ring[entry].status = cpu_to_le32(skb->len | TxIntrUploaded | AddTCPChksum | AddUDPChksum);
 
 	if (!skb_shinfo(skb)->nr_frags) {
-		dma_addr = pci_map_single(VORTEX_PCI(vp), skb->data, skb->len,
-					  PCI_DMA_TODEVICE);
-		if (dma_mapping_error(&VORTEX_PCI(vp)->dev, dma_addr))
+		dma_addr = dma_map_single(vp->gendev, skb->data, skb->len,
+					  DMA_TO_DEVICE);
+		if (dma_mapping_error(vp->gendev, dma_addr))
 			goto out_dma_err;
 
 		vp->tx_ring[entry].frag[0].addr = cpu_to_le32(dma_addr);
@@ -2178,9 +2177,9 @@ boomerang_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	} else {
 		int i;
 
-		dma_addr = pci_map_single(VORTEX_PCI(vp), skb->data,
-					  skb_headlen(skb), PCI_DMA_TODEVICE);
-		if (dma_mapping_error(&VORTEX_PCI(vp)->dev, dma_addr))
+		dma_addr = dma_map_single(vp->gendev, skb->data,
+					  skb_headlen(skb), DMA_TO_DEVICE);
+		if (dma_mapping_error(vp->gendev, dma_addr))
 			goto out_dma_err;
 
 		vp->tx_ring[entry].frag[0].addr = cpu_to_le32(dma_addr);
@@ -2189,21 +2188,21 @@ boomerang_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
-			dma_addr = skb_frag_dma_map(&VORTEX_PCI(vp)->dev, frag,
+			dma_addr = skb_frag_dma_map(vp->gendev, frag,
 						    0,
 						    frag->size,
 						    DMA_TO_DEVICE);
-			if (dma_mapping_error(&VORTEX_PCI(vp)->dev, dma_addr)) {
+			if (dma_mapping_error(vp->gendev, dma_addr)) {
 				for(i = i-1; i >= 0; i--)
-					dma_unmap_page(&VORTEX_PCI(vp)->dev,
+					dma_unmap_page(vp->gendev,
 						       le32_to_cpu(vp->tx_ring[entry].frag[i+1].addr),
 						       le32_to_cpu(vp->tx_ring[entry].frag[i+1].length),
 						       DMA_TO_DEVICE);
 
-				pci_unmap_single(VORTEX_PCI(vp),
+				dma_unmap_single(vp->gendev,
 						 le32_to_cpu(vp->tx_ring[entry].frag[0].addr),
 						 le32_to_cpu(vp->tx_ring[entry].frag[0].length),
-						 PCI_DMA_TODEVICE);
+						 DMA_TO_DEVICE);
 
 				goto out_dma_err;
 			}
@@ -2218,8 +2217,8 @@ boomerang_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 	}
 #else
-	dma_addr = pci_map_single(VORTEX_PCI(vp), skb->data, skb->len, PCI_DMA_TODEVICE);
-	if (dma_mapping_error(&VORTEX_PCI(vp)->dev, dma_addr))
+	dma_addr = dma_map_single(vp->gendev, skb->data, skb->len, DMA_TO_DEVICE);
+	if (dma_mapping_error(vp->gendev, dma_addr))
 		goto out_dma_err;
 	vp->tx_ring[entry].addr = cpu_to_le32(dma_addr);
 	vp->tx_ring[entry].length = cpu_to_le32(skb->len | LAST_FRAG);
@@ -2254,7 +2253,7 @@ boomerang_start_xmit(struct sk_buff *skb, struct net_device *dev)
 out:
 	return NETDEV_TX_OK;
 out_dma_err:
-	dev_err(&VORTEX_PCI(vp)->dev, "Error mapping dma buffer\n");
+	dev_err(vp->gendev, "Error mapping dma buffer\n");
 	goto out;
 }
 
@@ -2322,7 +2321,7 @@ vortex_interrupt(int irq, void *dev_id)
 		if (status & DMADone) {
 			if (ioread16(ioaddr + Wn7_MasterStatus) & 0x1000) {
 				iowrite16(0x1000, ioaddr + Wn7_MasterStatus); /* Ack the event. */
-				pci_unmap_single(VORTEX_PCI(vp), vp->tx_skb_dma, (vp->tx_skb->len + 3) & ~3, PCI_DMA_TODEVICE);
+				dma_unmap_single(vp->gendev, vp->tx_skb_dma, (vp->tx_skb->len + 3) & ~3, DMA_TO_DEVICE);
 				pkts_compl++;
 				bytes_compl += vp->tx_skb->len;
 				dev_kfree_skb_irq(vp->tx_skb); /* Release the transferred buffer */
@@ -2459,19 +2458,19 @@ boomerang_interrupt(int irq, void *dev_id)
 					struct sk_buff *skb = vp->tx_skbuff[entry];
 #if DO_ZEROCOPY
 					int i;
-					pci_unmap_single(VORTEX_PCI(vp),
+					dma_unmap_single(vp->gendev,
 							le32_to_cpu(vp->tx_ring[entry].frag[0].addr),
 							le32_to_cpu(vp->tx_ring[entry].frag[0].length)&0xFFF,
-							PCI_DMA_TODEVICE);
+							DMA_TO_DEVICE);
 
 					for (i=1; i<=skb_shinfo(skb)->nr_frags; i++)
-							pci_unmap_page(VORTEX_PCI(vp),
+							dma_unmap_page(vp->gendev,
 											 le32_to_cpu(vp->tx_ring[entry].frag[i].addr),
 											 le32_to_cpu(vp->tx_ring[entry].frag[i].length)&0xFFF,
-											 PCI_DMA_TODEVICE);
+											 DMA_TO_DEVICE);
 #else
-					pci_unmap_single(VORTEX_PCI(vp),
-						le32_to_cpu(vp->tx_ring[entry].addr), skb->len, PCI_DMA_TODEVICE);
+					dma_unmap_single(vp->gendev,
+						le32_to_cpu(vp->tx_ring[entry].addr), skb->len, DMA_TO_DEVICE);
 #endif
 					pkts_compl++;
 					bytes_compl += skb->len;
@@ -2561,14 +2560,14 @@ static int vortex_rx(struct net_device *dev)
 				/* 'skb_put()' points to the start of sk_buff data area. */
 				if (vp->bus_master &&
 					! (ioread16(ioaddr + Wn7_MasterStatus) & 0x8000)) {
-					dma_addr_t dma = pci_map_single(VORTEX_PCI(vp), skb_put(skb, pkt_len),
-									   pkt_len, PCI_DMA_FROMDEVICE);
+					dma_addr_t dma = dma_map_single(vp->gendev, skb_put(skb, pkt_len),
+									   pkt_len, DMA_FROM_DEVICE);
 					iowrite32(dma, ioaddr + Wn7_MasterAddr);
 					iowrite16((skb->len + 3) & ~3, ioaddr + Wn7_MasterLen);
 					iowrite16(StartDMAUp, ioaddr + EL3_CMD);
 					while (ioread16(ioaddr + Wn7_MasterStatus) & 0x8000)
 						;
-					pci_unmap_single(VORTEX_PCI(vp), dma, pkt_len, PCI_DMA_FROMDEVICE);
+					dma_unmap_single(vp->gendev, dma, pkt_len, DMA_FROM_DEVICE);
 				} else {
 					ioread32_rep(ioaddr + RX_FIFO,
 					             skb_put(skb, pkt_len),
@@ -2635,11 +2634,11 @@ boomerang_rx(struct net_device *dev)
 			if (pkt_len < rx_copybreak &&
 			    (skb = netdev_alloc_skb(dev, pkt_len + 2)) != NULL) {
 				skb_reserve(skb, 2);	/* Align IP on 16 byte boundaries */
-				pci_dma_sync_single_for_cpu(VORTEX_PCI(vp), dma, PKT_BUF_SZ, PCI_DMA_FROMDEVICE);
+				dma_sync_single_for_cpu(vp->gendev, dma, PKT_BUF_SZ, DMA_FROM_DEVICE);
 				/* 'skb_put()' points to the start of sk_buff data area. */
 				skb_put_data(skb, vp->rx_skbuff[entry]->data,
 					     pkt_len);
-				pci_dma_sync_single_for_device(VORTEX_PCI(vp), dma, PKT_BUF_SZ, PCI_DMA_FROMDEVICE);
+				dma_sync_single_for_device(vp->gendev, dma, PKT_BUF_SZ, DMA_FROM_DEVICE);
 				vp->rx_copy++;
 			} else {
 				/* Pre-allocate the replacement skb.  If it or its
@@ -2651,9 +2650,9 @@ boomerang_rx(struct net_device *dev)
 					dev->stats.rx_dropped++;
 					goto clear_complete;
 				}
-				newdma = pci_map_single(VORTEX_PCI(vp), newskb->data,
-							PKT_BUF_SZ, PCI_DMA_FROMDEVICE);
-				if (dma_mapping_error(&VORTEX_PCI(vp)->dev, newdma)) {
+				newdma = dma_map_single(vp->gendev, newskb->data,
+							PKT_BUF_SZ, DMA_FROM_DEVICE);
+				if (dma_mapping_error(vp->gendev, newdma)) {
 					dev->stats.rx_dropped++;
 					consume_skb(newskb);
 					goto clear_complete;
@@ -2664,7 +2663,7 @@ boomerang_rx(struct net_device *dev)
 				vp->rx_skbuff[entry] = newskb;
 				vp->rx_ring[entry].addr = cpu_to_le32(newdma);
 				skb_put(skb, pkt_len);
-				pci_unmap_single(VORTEX_PCI(vp), dma, PKT_BUF_SZ, PCI_DMA_FROMDEVICE);
+				dma_unmap_single(vp->gendev, dma, PKT_BUF_SZ, DMA_FROM_DEVICE);
 				vp->rx_nocopy++;
 			}
 			skb->protocol = eth_type_trans(skb, dev);
@@ -2761,8 +2760,8 @@ vortex_close(struct net_device *dev)
 	if (vp->full_bus_master_rx) { /* Free Boomerang bus master Rx buffers. */
 		for (i = 0; i < RX_RING_SIZE; i++)
 			if (vp->rx_skbuff[i]) {
-				pci_unmap_single(	VORTEX_PCI(vp), le32_to_cpu(vp->rx_ring[i].addr),
-									PKT_BUF_SZ, PCI_DMA_FROMDEVICE);
+				dma_unmap_single(vp->gendev, le32_to_cpu(vp->rx_ring[i].addr),
+									PKT_BUF_SZ, DMA_FROM_DEVICE);
 				dev_kfree_skb(vp->rx_skbuff[i]);
 				vp->rx_skbuff[i] = NULL;
 			}
@@ -2775,12 +2774,12 @@ vortex_close(struct net_device *dev)
 				int k;
 
 				for (k=0; k<=skb_shinfo(skb)->nr_frags; k++)
-						pci_unmap_single(VORTEX_PCI(vp),
+						dma_unmap_single(vp->gendev,
 										 le32_to_cpu(vp->tx_ring[i].frag[k].addr),
 										 le32_to_cpu(vp->tx_ring[i].frag[k].length)&0xFFF,
-										 PCI_DMA_TODEVICE);
+										 DMA_TO_DEVICE);
 #else
-				pci_unmap_single(VORTEX_PCI(vp), le32_to_cpu(vp->tx_ring[i].addr), skb->len, PCI_DMA_TODEVICE);
+				dma_unmap_single(vp->gendev, le32_to_cpu(vp->tx_ring[i].addr), skb->len, DMA_TO_DEVICE);
 #endif
 				dev_kfree_skb(skb);
 				vp->tx_skbuff[i] = NULL;
@@ -3288,11 +3287,10 @@ static void vortex_remove_one(struct pci_dev *pdev)
 
 	pci_iounmap(pdev, vp->ioaddr);
 
-	pci_free_consistent(pdev,
-						sizeof(struct boom_rx_desc) * RX_RING_SIZE
-							+ sizeof(struct boom_tx_desc) * TX_RING_SIZE,
-						vp->rx_ring,
-						vp->rx_ring_dma);
+	dma_free_coherent(&pdev->dev,
+			sizeof(struct boom_rx_desc) * RX_RING_SIZE +
+			sizeof(struct boom_tx_desc) * TX_RING_SIZE,
+			vp->rx_ring, vp->rx_ring_dma);
 
 	pci_release_regions(pdev);
 
-- 
2.17.0

^ permalink raw reply related

* [PATCH] ipvlan: flush arp table when mac address changed
From: liuqifa @ 2018-05-12 11:00 UTC (permalink / raw)
  To: davem, dsahern, maheshb, weiyongjun1, maowenan, dingtianhong,
	liuqifa
  Cc: netdev, linux-kernel

From: Keefe Liu <liuqifa@huawei.com>

When master device's mac has been changed, the
commit <32c10bbfe914> "ipvlan: always use the current L2
addr of the master" makes the IPVlan devices's mac changed
also, but it doesn't flush the IPVlan's arp table.

Signed-off-by: Keefe Liu <liuqifa@huawei.com>
---
 drivers/net/ipvlan/ipvlan_main.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 450eec2..a1edfe1 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -7,6 +7,8 @@
  *
  */
 
+#include <net/neighbour.h>
+#include <net/arp.h>
 #include "ipvlan.h"
 
 static unsigned int ipvlan_netid __read_mostly;
@@ -792,8 +794,10 @@ static int ipvlan_device_event(struct notifier_block *unused,
 		break;
 
 	case NETDEV_CHANGEADDR:
-		list_for_each_entry(ipvlan, &port->ipvlans, pnode)
+		list_for_each_entry(ipvlan, &port->ipvlans, pnode) {
 			ether_addr_copy(ipvlan->dev->dev_addr, dev->dev_addr);
+			neigh_changeaddr(&arp_tbl, ipvlan->dev);
+		}
 		break;
 
 	case NETDEV_PRE_TYPE_CHANGE:
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net-next 3/8] sctp: move the flush of ctrl chunks into its own function
From: Marcelo Ricardo Leitner @ 2018-05-12 13:29 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long
In-Reply-To: <23e474529812a3e804b1a4311ec48f250b81bbc5.1526077476.git.marcelo.leitner@gmail.com>

On Fri, May 11, 2018 at 08:28:45PM -0300, Marcelo Ricardo Leitner wrote:
> Named sctp_outq_flush_ctrl and, with that, keep the contexts contained.

kbuild bot spotted some issues with this patch. They were corrected
later on on the series, but I should fix them here.

Will post a v2 later today.

^ permalink raw reply

* Re: [PATCH net-next] net:sched: add gkprio scheduler
From: Jamal Hadi Salim @ 2018-05-12 14:48 UTC (permalink / raw)
  To: Michel Machado, Cong Wang
  Cc: Nishanth Devarajan, Jiri Pirko, David Miller,
	Linux Kernel Network Developers, Cody Doucette
In-Reply-To: <32b58e4e-c06c-0266-e4d9-caca365e46ef@digirati.com.br>

Sorry for the latency..

On 09/05/18 01:37 PM, Michel Machado wrote:
> On 05/09/2018 10:43 AM, Jamal Hadi Salim wrote:
>> On 08/05/18 10:27 PM, Cong Wang wrote:
>>> On Tue, May 8, 2018 at 6:29 AM, Jamal Hadi Salim <jhs@mojatatu.com> 
>>> wrote:

> 
> I like the suggestion of extending skbmod to mark skbprio based on ds. 
> Given that DSprio would no longer depend on the DS field, would you have 
> a name suggestion for this new queue discipline since the name "prio" is 
> currently in use?
> 

Not sure what to call it.
My struggle is still with the intended end goal of the qdisc.
It looks like prio qdisc except for the enqueue part which attempts
to use a shared global queue size for all prios. I would have
pointed to other approaches which use global priority queue pool
which do early congestion detection like RED or variants like GRED but
those use average values of the queue lengths not instantenous values 
such as you do.
I am tempted to say - based on my current understanding - that you dont
need a new qdisc; rather you need to map your dsfields to skbprio
(via skbmod) and stick with prio qdisc. I also think the skbmod
mapping is useful regardless of this need.

> What should be the range of priorities that this new queue discipline 
> would accept? skb->prioriry is of type __u32, but supporting 2^32 
> priorities would require too large of an array to index packets by 
> priority; the DS field is only 6 bits long. Do you have a use case in 
> mind to guide us here?
>

Look at the priomap or prio2band arrangement on prio qdisc
or pfifo_fast qdisc. You take an skbprio as an index into the array
and retrieve a queue to enqueue to. The size of the array is 16.
In the past this was based IIRC on ip precedence + 1 bit. Those map
similarly to DS fields (calls selectors, assured forwarding etc). So
no need to even increase the array beyond current 16.

>> I find the cleverness in changing the highest/low prios confusing.
>> It looks error-prone (I guess that is why there is a BUG check)
>> To the authors: Is there a document/paper on the theory of this thing
>> as to why no explicit queues are "faster"?
> 
> The priority orientation in GKprio is due to two factors: failing safe 
> and elegance. If zero were the highest priority, any operational mistake 
> that leads not-classified packets through GKprio would potentially 
> disrupt the system. We are humans, we'll make mistakes. The elegance 
> aspect comes from the fact that the assigned priority is not massaged to 
> fit the DS field. We find it helpful while inspecting packets on the wire.
> 
> The reason for us to avoid explicit queues in GKprio, which could change 
> the behavior within a given priority, is to closely abide to the 
> expected behavior assumed to prove Theorem 4.1 in the paper "Portcullis: 
> Protecting Connection Setup from Denial-of-Capability Attacks":
> 
> https://dl.acm.org/citation.cfm?id=1282413
> 

Paper seems to be under paywall. Googling didnt help.
My concern is still the science behind this; if you had written up
some test setup which shows how you concluded this was a better
approach at DOS prevention and showed some numbers it would have
helped greatly clarify.

>> 1) I agree that using multiple queues as in prio qdisc would make it
>> more manageable; does not necessarily need to be classful if you
>> use implicit skbprio classification. i.e on equeue use a priority
>> map to select a queue; on dequeue always dequeu from highest prio
>> until it has no more packets to send.
> 
> In my reply to Cong, I point out that there is a technical limitation in 
> the interface of queue disciplines that forbids GKprio to have explicit 
> sub-queues:
> 
> https://www.mail-archive.com/netdev@vger.kernel.org/msg234201.html
> 
>> 2) Dropping already enqueued packets will not work well for
>> local feedback (__NET_XMIT_BYPASS return code is about the
>> packet that has been dropped from earlier enqueueing because
>> it is lower priority - it does not  signify anything with
>> current skb to which actually just got enqueud).
>> Perhaps (off top of my head) is to always enqueue packets on
>> high priority when their limit is exceeded as long as lower prio has
>> some space. Means youd have to increment low prio accounting if their
>> space is used.
> 
> I don't understand the point you are making here. Could you develop it 
> further?
> 

Sorry - I was meaning NET_XMIT_CN
If you drop an already enqueued packet - it makes sense to signify as
such using NET_XMIT_CN
this does not make sense for forwarded packets but it does
for locally sourced packets.

cheers,
jamal

^ permalink raw reply

* [PATCH bpf-next v5 0/6] ipv6: sr: introduce seg6local End.BPF action
From: Mathieu Xhonneux @ 2018-05-12 17:25 UTC (permalink / raw)
  To: netdev; +Cc: dlebrun, alexei.starovoitov

As of Linux 4.14, it is possible to define advanced local processing for
IPv6 packets with a Segment Routing Header through the seg6local LWT
infrastructure. This LWT implements the network programming principles
defined in the IETF “SRv6 Network Programming” draft.

The implemented operations are generic, and it would be very interesting to
be able to implement user-specific seg6local actions, without having to
modify the kernel directly. To do so, this patchset adds an End.BPF action
to seg6local, powered by some specific Segment Routing-related helpers,
which provide SR functionalities that can be applied on the packet. This
BPF hook would then allow to implement specific actions at native kernel
speed such as OAM features, advanced SR SDN policies, SRv6 actions like
Segment Routing Header (SRH) encapsulation depending on the content of
the packet, etc.

This patchset is divided in 6 patches, whose main features are :

- A new seg6local action End.BPF with the corresponding new BPF program
  type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be
  passed to the LWT seg6local through netlink, the same way as the LWT
  BPF hook operates.
- 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/
  shrink a SRH and apply on a packet some of the generic SRv6 actions.
- 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through
  encapsulation (via IPv6 encapsulation or inlining if the packet contains
  already an IPv6 header).

As this patchset adds a new LWT BPF hook, I took into account the result of
the discussions when the LWT BPF infrastructure got merged. Hence, the
seg6local BPF hook doesn’t allow write access to skb->data directly, only
the SRH can be modified through specific helpers, which ensures that the
integrity of the packet is maintained.
More details are available in the related patches messages.

The performances of this BPF hook have been assessed with the BPF JIT
enabled on a Intel Xeon X3440 processors with 4 cores and 8 threads
clocked at 2.53 GHz. No throughput losses are noted with the seg6local
BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes
TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes)
drops the throughput to 410kpps, and inlining a SRH via
bpf_lwt_seg6_action drops the throughput to 420kpps.
All throughputs are stable.

-------
v2: move the SRH integrity state from skb->cb to a per-cpu buffer
v3: - document helpers in man-page style
    - fix kbuild bugs
    - un-break BPF LWT out hook
    - bpf_push_seg6_encap is now static
    - preempt_enable is now called when the packet is dropped in
      input_action_end_bpf
v4: fix kbuild bugs when CONFIG_IPV6=m
v5: fix kbuild sparse warnings when CONFIG_IPV6=m

Thanks.


Mathieu Xhonneux (6):
  ipv6: sr: make seg6.h includable without IPv6
  ipv6: sr: export function lookup_nexthop
  bpf: Add IPv6 Segment Routing helpers
  bpf: Split lwt inout verifier structures
  ipv6: sr: Add seg6local action End.BPF
  selftests/bpf: test for seg6local End.BPF action

 include/linux/bpf_types.h                         |   5 +-
 include/net/seg6.h                                |   7 +-
 include/net/seg6_local.h                          |  32 ++
 include/uapi/linux/bpf.h                          |  98 ++++-
 include/uapi/linux/seg6_local.h                   |   3 +
 kernel/bpf/verifier.c                             |   1 +
 net/core/filter.c                                 | 390 ++++++++++++++++---
 net/ipv6/Kconfig                                  |   5 +
 net/ipv6/seg6_local.c                             | 180 ++++++++-
 tools/include/uapi/linux/bpf.h                    |  98 ++++-
 tools/lib/bpf/libbpf.c                            |   1 +
 tools/testing/selftests/bpf/Makefile              |   5 +-
 tools/testing/selftests/bpf/bpf_helpers.h         |  12 +
 tools/testing/selftests/bpf/test_lwt_seg6local.c  | 438 ++++++++++++++++++++++
 tools/testing/selftests/bpf/test_lwt_seg6local.sh | 140 +++++++
 15 files changed, 1340 insertions(+), 75 deletions(-)
 create mode 100644 include/net/seg6_local.h
 create mode 100644 tools/testing/selftests/bpf/test_lwt_seg6local.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_seg6local.sh

-- 
2.16.1

^ permalink raw reply

* [PATCH bpf-next v5 1/6] ipv6: sr: make seg6.h includable without IPv6
From: Mathieu Xhonneux @ 2018-05-12 17:25 UTC (permalink / raw)
  To: netdev; +Cc: dlebrun, alexei.starovoitov
In-Reply-To: <cover.1526143526.git.m.xhonneux@gmail.com>

include/net/seg6.h cannot be included in a source file if CONFIG_IPV6 is
not enabled:
   include/net/seg6.h: In function 'seg6_pernet':
>> include/net/seg6.h:52:14: error: 'struct net' has no member named
                                        'ipv6'; did you mean 'ipv4'?
     return net->ipv6.seg6_data;
                 ^~~~
                 ipv4

This commit makes seg6_pernet return NULL if IPv6 is not compiled, hence
allowing seg6.h to be included regardless of the configuration.

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
---
 include/net/seg6.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/net/seg6.h b/include/net/seg6.h
index 099bad59dc90..70b4cfac52d7 100644
--- a/include/net/seg6.h
+++ b/include/net/seg6.h
@@ -49,7 +49,11 @@ struct seg6_pernet_data {
 
 static inline struct seg6_pernet_data *seg6_pernet(struct net *net)
 {
+#if IS_ENABLED(CONFIG_IPV6)
 	return net->ipv6.seg6_data;
+#else
+	return NULL;
+#endif
 }
 
 extern int seg6_init(void);
-- 
2.16.1

^ permalink raw reply related

* [PATCH bpf-next v5 2/6] ipv6: sr: export function lookup_nexthop
From: Mathieu Xhonneux @ 2018-05-12 17:25 UTC (permalink / raw)
  To: netdev; +Cc: dlebrun, alexei.starovoitov
In-Reply-To: <cover.1526143526.git.m.xhonneux@gmail.com>

The function lookup_nexthop is essential to implement most of the seg6local
actions. As we want to provide a BPF helper allowing to apply some of these
actions on the packet being processed, the helper should be able to call
this function, hence the need to make it public.

Moreover, if one argument is incorrect or if the next hop can not be found,
an error should be returned by the BPF helper so the BPF program can adapt
its processing of the packet (return an error, properly force the drop,
...). This patch hence makes this function return dst->error to indicate a
possible error.

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
---
 include/net/seg6.h       |  3 ++-
 include/net/seg6_local.h | 24 ++++++++++++++++++++++++
 net/ipv6/seg6_local.c    | 20 +++++++++++---------
 3 files changed, 37 insertions(+), 10 deletions(-)
 create mode 100644 include/net/seg6_local.h

diff --git a/include/net/seg6.h b/include/net/seg6.h
index 70b4cfac52d7..e029e301faa5 100644
--- a/include/net/seg6.h
+++ b/include/net/seg6.h
@@ -67,5 +67,6 @@ extern bool seg6_validate_srh(struct ipv6_sr_hdr *srh, int len);
 extern int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh,
 			     int proto);
 extern int seg6_do_srh_inline(struct sk_buff *skb, struct ipv6_sr_hdr *osrh);
-
+extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
+			       u32 tbl_id);
 #endif
diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
new file mode 100644
index 000000000000..57498b23085d
--- /dev/null
+++ b/include/net/seg6_local.h
@@ -0,0 +1,24 @@
+/*
+ *  SR-IPv6 implementation
+ *
+ *  Authors:
+ *  David Lebrun <david.lebrun@uclouvain.be>
+ *  eBPF support: Mathieu Xhonneux <m.xhonneux@gmail.com>
+ *
+ *
+ *  This program is free software; you can redistribute it and/or
+ *      modify it under the terms of the GNU General Public License
+ *      as published by the Free Software Foundation; either version
+ *      2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _NET_SEG6_LOCAL_H
+#define _NET_SEG6_LOCAL_H
+
+#include <linux/net.h>
+#include <linux/ipv6.h>
+
+extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
+			       u32 tbl_id);
+
+#endif
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 45722327375a..e9b23fb924ad 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -30,6 +30,7 @@
 #ifdef CONFIG_IPV6_SEG6_HMAC
 #include <net/seg6_hmac.h>
 #endif
+#include <net/seg6_local.h>
 #include <linux/etherdevice.h>
 
 struct seg6_local_lwt;
@@ -140,8 +141,8 @@ static void advance_nextseg(struct ipv6_sr_hdr *srh, struct in6_addr *daddr)
 	*daddr = *addr;
 }
 
-static void lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
-			   u32 tbl_id)
+int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
+			u32 tbl_id)
 {
 	struct net *net = dev_net(skb->dev);
 	struct ipv6hdr *hdr = ipv6_hdr(skb);
@@ -187,6 +188,7 @@ static void lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
 
 	skb_dst_drop(skb);
 	skb_dst_set(skb, dst);
+	return dst->error;
 }
 
 /* regular endpoint function */
@@ -200,7 +202,7 @@ static int input_action_end(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 
 	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
 
-	lookup_nexthop(skb, NULL, 0);
+	seg6_lookup_nexthop(skb, NULL, 0);
 
 	return dst_input(skb);
 
@@ -220,7 +222,7 @@ static int input_action_end_x(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 
 	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
 
-	lookup_nexthop(skb, &slwt->nh6, 0);
+	seg6_lookup_nexthop(skb, &slwt->nh6, 0);
 
 	return dst_input(skb);
 
@@ -239,7 +241,7 @@ static int input_action_end_t(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 
 	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
 
-	lookup_nexthop(skb, NULL, slwt->table);
+	seg6_lookup_nexthop(skb, NULL, slwt->table);
 
 	return dst_input(skb);
 
@@ -331,7 +333,7 @@ static int input_action_end_dx6(struct sk_buff *skb,
 	if (!ipv6_addr_any(&slwt->nh6))
 		nhaddr = &slwt->nh6;
 
-	lookup_nexthop(skb, nhaddr, 0);
+	seg6_lookup_nexthop(skb, nhaddr, 0);
 
 	return dst_input(skb);
 drop:
@@ -380,7 +382,7 @@ static int input_action_end_dt6(struct sk_buff *skb,
 	if (!pskb_may_pull(skb, sizeof(struct ipv6hdr)))
 		goto drop;
 
-	lookup_nexthop(skb, NULL, slwt->table);
+	seg6_lookup_nexthop(skb, NULL, slwt->table);
 
 	return dst_input(skb);
 
@@ -406,7 +408,7 @@ static int input_action_end_b6(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 	ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
 	skb_set_transport_header(skb, sizeof(struct ipv6hdr));
 
-	lookup_nexthop(skb, NULL, 0);
+	seg6_lookup_nexthop(skb, NULL, 0);
 
 	return dst_input(skb);
 
@@ -438,7 +440,7 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
 	ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
 	skb_set_transport_header(skb, sizeof(struct ipv6hdr));
 
-	lookup_nexthop(skb, NULL, 0);
+	seg6_lookup_nexthop(skb, NULL, 0);
 
 	return dst_input(skb);
 
-- 
2.16.1

^ permalink raw reply related

* [PATCH bpf-next v5 3/6] bpf: Add IPv6 Segment Routing helpers
From: Mathieu Xhonneux @ 2018-05-12 17:25 UTC (permalink / raw)
  To: netdev; +Cc: dlebrun, alexei.starovoitov
In-Reply-To: <cover.1526143526.git.m.xhonneux@gmail.com>

The BPF seg6local hook should be powerful enough to enable users to
implement most of the use-cases one could think of. After some thinking,
we figured out that the following actions should be possible on a SRv6
packet, requiring 3 specific helpers :
    - bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH
    - bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH
                               (to add/delete TLVs)
    - bpf_lwt_seg6_action: Apply some SRv6 network programming actions
                           (specifically End.X, End.T, End.B6 and
                            End.B6.Encap)

The specifications of these helpers are provided in the patch (see
include/uapi/linux/bpf.h).

The non-sensitive fields of the SRH are the following : flags, tag and
TLVs. The other fields can not be modified, to maintain the SRH
integrity. Flags, tag and TLVs can easily be modified as their validity
can be checked afterwards via seg6_validate_srh. It is not allowed to
modify the segments directly. If one wants to add segments on the path,
he should stack a new SRH using the End.B6 action via
bpf_lwt_seg6_action.

Growing, shrinking or editing TLVs via the helpers will flag the SRH as
invalid, and it will have to be re-validated before re-entering the IPv6
layer. This flag is stored in a per-CPU buffer, along with the current
header length in bytes.

Storing the SRH len in bytes in the control block is mandatory when using
bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH
len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes
boundary). When adding/deleting TLVs within the BPF program, the SRH may
temporary be in an invalid state where its length cannot be rounded to 8
bytes without remainder, hence the need to store the length in bytes
separately. The caller of the BPF program can then ensure that the SRH's
final length is valid using this value. Again, a final SRH modified by a
BPF program which doesn’t respect the 8-bytes boundary will be discarded
as it will be considered as invalid.

Finally, a fourth helper is provided, bpf_lwt_push_encap, which is
available from the LWT BPF IN hook, but not from the seg6local BPF one.
This helper allows to encapsulate a Segment Routing Header (either with
a new outer IPv6 header, or by inlining it directly in the existing IPv6
header) into a non-SRv6 packet. This helper is required if we want to
offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet,
as the BPF seg6local hook only works on traffic already containing a SRH.
This is the BPF equivalent of the seg6 LWT infrastructure, which achieves
the same purpose but with a static SRH per route.

These helpers require CONFIG_IPV6=y (and not =m).

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
---
 include/net/seg6_local.h |   8 ++
 include/uapi/linux/bpf.h |  97 +++++++++++++++-
 net/core/filter.c        | 282 +++++++++++++++++++++++++++++++++++++++++++----
 net/ipv6/Kconfig         |   5 +
 net/ipv6/seg6_local.c    |   2 +
 5 files changed, 369 insertions(+), 25 deletions(-)

diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
index 57498b23085d..661fd5b4d3e0 100644
--- a/include/net/seg6_local.h
+++ b/include/net/seg6_local.h
@@ -15,10 +15,18 @@
 #ifndef _NET_SEG6_LOCAL_H
 #define _NET_SEG6_LOCAL_H
 
+#include <linux/percpu.h>
 #include <linux/net.h>
 #include <linux/ipv6.h>
 
 extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
 			       u32 tbl_id);
 
+struct seg6_bpf_srh_state {
+	bool valid;
+	u16 hdrlen;
+};
+
+DECLARE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
+
 #endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 02e4112510f8..0349c91329fd 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1828,7 +1828,6 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- *
  * int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags)
  *	Description
  *		Do FIB lookup in kernel tables using parameters in *params*.
@@ -1855,6 +1854,90 @@ union bpf_attr {
  *             Egress device index on success, 0 if packet needs to continue
  *             up the stack for further processing or a negative error in case
  *             of failure.
+ *
+ * int bpf_lwt_push_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
+ *	Description
+ *		Encapsulate the packet associated to *skb* within a Layer 3
+ *		protocol header. This header is provided in the buffer at
+ *		address *hdr*, with *len* its size in bytes. *type* indicates
+ *		the protocol of the header and can be one of:
+ *
+ *		**BPF_LWT_ENCAP_SEG6**
+ *			IPv6 encapsulation with Segment Routing Header
+ *			(**struct ipv6_sr_hdr**). *hdr* only contains the SRH,
+ *			the IPv6 header is computed by the kernel.
+ *		**BPF_LWT_ENCAP_SEG6_INLINE**
+ *			Only works if *skb* contains an IPv6 packet. Insert a
+ *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
+ *			the IPv6 header.
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len)
+ *	Description
+ *		Store *len* bytes from address *from* into the packet
+ *		associated to *skb*, at *offset*. Only the flags, tag and TLVs
+ *		inside the outermost IPv6 Segment Routing Header can be
+ *		modified through this helper.
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_adjust_srh(struct sk_buff *skb, u32 offset, s32 delta)
+ *	Description
+ *		Adjust the size allocated to TLVs in the outermost IPv6
+ *		Segment Routing Header contained in the packet associated to
+ *		*skb*, at position *offset* by *delta* bytes. Only offsets
+ *		after the segments are accepted. *delta* can be as well
+ *		positive (growing) as negative (shrinking).
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_action(struct sk_buff *skb, u32 action, void *param, u32 param_len)
+ *	Description
+ *		Apply an IPv6 Segment Routing action of type *action* to the
+ *		packet associated to *skb*. Each action takes a parameter
+ *		contained at address *param*, and of length *param_len* bytes.
+ *		*action* can be one of:
+ *
+ *		**SEG6_LOCAL_ACTION_END_X**
+ *			End.X action: Endpoint with Layer-3 cross-connect.
+ *			Type of *param*: **struct in6_addr**.
+ *		**SEG6_LOCAL_ACTION_END_T**
+ *			End.T action: Endpoint with specific IPv6 table lookup.
+ *			Type of *param*: **int**.
+ *		**SEG6_LOCAL_ACTION_END_B6**
+ *			End.B6 action: Endpoint bound to an SRv6 policy.
+ *			Type of param: **struct ipv6_sr_hdr**.
+ *		**SEG6_LOCAL_ACTION_END_B6_ENCAP**
+ *			End.B6.Encap action: Endpoint bound to an SRv6
+ *			encapsulation policy.
+ *			Type of param: **struct ipv6_sr_hdr**.
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -1926,7 +2009,11 @@ union bpf_attr {
 	FN(skb_get_xfrm_state),		\
 	FN(get_stack),			\
 	FN(skb_load_bytes_relative),	\
-	FN(fib_lookup),
+	FN(fib_lookup),			\
+	FN(lwt_push_encap),		\
+	FN(lwt_seg6_store_bytes),	\
+	FN(lwt_seg6_adjust_srh),	\
+	FN(lwt_seg6_action),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -1993,6 +2080,12 @@ enum bpf_hdr_start_off {
 	BPF_HDR_START_NET,
 };
 
+/* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
+enum bpf_lwt_encap_mode {
+	BPF_LWT_ENCAP_SEG6,
+	BPF_LWT_ENCAP_SEG6_INLINE
+};
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
diff --git a/net/core/filter.c b/net/core/filter.c
index ca60d2872da4..67b4ab4ec404 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -64,6 +64,10 @@
 #include <net/ip_fib.h>
 #include <net/flow.h>
 #include <net/arp.h>
+#include <net/ipv6.h>
+#include <linux/seg6_local.h>
+#include <net/seg6.h>
+#include <net/seg6_local.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -3326,28 +3330,6 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
-bool bpf_helper_changes_pkt_data(void *func)
-{
-	if (func == bpf_skb_vlan_push ||
-	    func == bpf_skb_vlan_pop ||
-	    func == bpf_skb_store_bytes ||
-	    func == bpf_skb_change_proto ||
-	    func == bpf_skb_change_head ||
-	    func == bpf_skb_change_tail ||
-	    func == bpf_skb_adjust_room ||
-	    func == bpf_skb_pull_data ||
-	    func == bpf_clone_redirect ||
-	    func == bpf_l3_csum_replace ||
-	    func == bpf_l4_csum_replace ||
-	    func == bpf_xdp_adjust_head ||
-	    func == bpf_xdp_adjust_meta ||
-	    func == bpf_msg_pull_data ||
-	    func == bpf_xdp_adjust_tail)
-		return true;
-
-	return false;
-}
-
 static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
 				  unsigned long off, unsigned long len)
 {
@@ -4295,6 +4277,261 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
 	.arg4_type	= ARG_ANYTHING,
 };
 
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
+{
+	int err;
+	struct ipv6_sr_hdr *srh = (struct ipv6_sr_hdr *)hdr;
+
+	if (!seg6_validate_srh(srh, len))
+		return -EINVAL;
+
+	switch (type) {
+	case BPF_LWT_ENCAP_SEG6_INLINE:
+		if (skb->protocol != htons(ETH_P_IPV6))
+			return -EBADMSG;
+
+		err = seg6_do_srh_inline(skb, srh);
+		break;
+	case BPF_LWT_ENCAP_SEG6:
+		skb_reset_inner_headers(skb);
+		skb->encapsulation = 1;
+		err = seg6_do_srh_encap(skb, srh, IPPROTO_IPV6);
+		break;
+	default:
+		return -EINVAL;
+	}
+	if (err)
+		return err;
+
+	ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
+	skb_set_transport_header(skb, sizeof(struct ipv6hdr));
+
+	return seg6_lookup_nexthop(skb, NULL, 0);
+}
+#endif /* CONFIG_IPV6_SEG6_BPF */
+
+BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
+	   u32, len)
+{
+	switch (type) {
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+	case BPF_LWT_ENCAP_SEG6:
+	case BPF_LWT_ENCAP_SEG6_INLINE:
+		return bpf_push_seg6_encap(skb, type, hdr, len);
+#endif
+	default:
+		return -EINVAL;
+	}
+}
+
+static const struct bpf_func_proto bpf_lwt_push_encap_proto = {
+	.func		= bpf_lwt_push_encap,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_MEM,
+	.arg4_type	= ARG_CONST_SIZE
+};
+
+BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
+	   const void *, from, u32, len)
+{
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+	struct seg6_bpf_srh_state *srh_state =
+		this_cpu_ptr(&seg6_bpf_srh_states);
+	void *srh_tlvs, *srh_end, *ptr;
+	struct ipv6_sr_hdr *srh;
+	int srhoff = 0;
+
+	if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
+		return -EINVAL;
+
+	srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
+	srh_tlvs = (void *)((char *)srh + ((srh->first_segment + 1) << 4));
+	srh_end = (void *)((char *)srh + sizeof(*srh) + srh_state->hdrlen);
+
+	ptr = skb->data + offset;
+	if (ptr >= srh_tlvs && ptr + len <= srh_end)
+		srh_state->valid = 0;
+	else if (ptr < (void *)&srh->flags ||
+		 ptr + len > (void *)&srh->segments)
+		return -EFAULT;
+
+	if (unlikely(bpf_try_make_writable(skb, offset + len)))
+		return -EFAULT;
+
+	memcpy(ptr, from, len);
+	return 0;
+#else /* CONFIG_IPV6_SEG6_BPF */
+	return -EOPNOTSUPP;
+#endif
+}
+
+static const struct bpf_func_proto bpf_lwt_seg6_store_bytes_proto = {
+	.func		= bpf_lwt_seg6_store_bytes,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_MEM,
+	.arg4_type	= ARG_CONST_SIZE
+};
+
+BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb,
+	   u32, action, void *, param, u32, param_len)
+{
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+	struct seg6_bpf_srh_state *srh_state =
+		this_cpu_ptr(&seg6_bpf_srh_states);
+	struct ipv6_sr_hdr *srh;
+	int srhoff = 0;
+	int err;
+
+	if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
+		return -EINVAL;
+	srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
+
+	if (!srh_state->valid) {
+		if (unlikely((srh_state->hdrlen & 7) != 0))
+			return -EBADMSG;
+
+		srh->hdrlen = (u8)(srh_state->hdrlen >> 3);
+		if (unlikely(!seg6_validate_srh(srh, (srh->hdrlen + 1) << 3)))
+			return -EBADMSG;
+
+		srh_state->valid = 1;
+	}
+
+	switch (action) {
+	case SEG6_LOCAL_ACTION_END_X:
+		if (param_len != sizeof(struct in6_addr))
+			return -EINVAL;
+		return seg6_lookup_nexthop(skb, (struct in6_addr *)param, 0);
+	case SEG6_LOCAL_ACTION_END_T:
+		if (param_len != sizeof(int))
+			return -EINVAL;
+		return seg6_lookup_nexthop(skb, NULL, *(int *)param);
+	case SEG6_LOCAL_ACTION_END_B6:
+		err = bpf_push_seg6_encap(skb, BPF_LWT_ENCAP_SEG6_INLINE,
+					  param, param_len);
+		if (!err)
+			srh_state->hdrlen =
+				((struct ipv6_sr_hdr *)param)->hdrlen << 3;
+		return err;
+	case SEG6_LOCAL_ACTION_END_B6_ENCAP:
+		err = bpf_push_seg6_encap(skb, BPF_LWT_ENCAP_SEG6,
+					  param, param_len);
+		if (!err)
+			srh_state->hdrlen =
+				((struct ipv6_sr_hdr *)param)->hdrlen << 3;
+		return err;
+	default:
+		return -EINVAL;
+	}
+#else /* CONFIG_IPV6_SEG6_BPF */
+	return -EOPNOTSUPP;
+#endif
+}
+
+static const struct bpf_func_proto bpf_lwt_seg6_action_proto = {
+	.func		= bpf_lwt_seg6_action,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_MEM,
+	.arg4_type	= ARG_CONST_SIZE
+};
+
+BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset,
+	   s32, len)
+{
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+	struct seg6_bpf_srh_state *srh_state =
+		this_cpu_ptr(&seg6_bpf_srh_states);
+	void *srh_end, *srh_tlvs, *ptr;
+	struct ipv6_sr_hdr *srh;
+	struct ipv6hdr *hdr;
+	int srhoff = 0;
+	int ret;
+
+	if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
+		return -EINVAL;
+	srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
+
+	srh_tlvs = (void *)((unsigned char *)srh + sizeof(*srh) +
+			((srh->first_segment + 1) << 4));
+	srh_end = (void *)((unsigned char *)srh + sizeof(*srh) +
+			srh_state->hdrlen);
+	ptr = skb->data + offset;
+
+	if (unlikely(ptr < srh_tlvs || ptr > srh_end))
+		return -EFAULT;
+	if (unlikely(len < 0 && (void *)((char *)ptr - len) > srh_end))
+		return -EFAULT;
+
+	if (len > 0) {
+		ret = skb_cow_head(skb, len);
+		if (unlikely(ret < 0))
+			return ret;
+
+		ret = bpf_skb_net_hdr_push(skb, offset, len);
+	} else {
+		ret = bpf_skb_net_hdr_pop(skb, offset, -1 * len);
+	}
+	if (unlikely(ret < 0))
+		return ret;
+
+	hdr = (struct ipv6hdr *)skb->data;
+	hdr->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
+
+	bpf_compute_data_pointers(skb);
+	srh_state->hdrlen += len;
+	srh_state->valid = 0;
+	return 0;
+#else /* CONFIG_IPV6_SEG6_BPF */
+	return -EOPNOTSUPP;
+#endif
+}
+
+static const struct bpf_func_proto bpf_lwt_seg6_adjust_srh_proto = {
+	.func		= bpf_lwt_seg6_adjust_srh,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+bool bpf_helper_changes_pkt_data(void *func)
+{
+	if (func == bpf_skb_vlan_push ||
+	    func == bpf_skb_vlan_pop ||
+	    func == bpf_skb_store_bytes ||
+	    func == bpf_skb_change_proto ||
+	    func == bpf_skb_change_head ||
+	    func == bpf_skb_change_tail ||
+	    func == bpf_skb_adjust_room ||
+	    func == bpf_skb_pull_data ||
+	    func == bpf_clone_redirect ||
+	    func == bpf_l3_csum_replace ||
+	    func == bpf_l4_csum_replace ||
+	    func == bpf_xdp_adjust_head ||
+	    func == bpf_xdp_adjust_meta ||
+	    func == bpf_msg_pull_data ||
+	    func == bpf_xdp_adjust_tail ||
+	    func == bpf_lwt_push_encap ||
+	    func == bpf_lwt_seg6_store_bytes ||
+	    func == bpf_lwt_seg6_adjust_srh ||
+	    func == bpf_lwt_seg6_action
+	    )
+		return true;
+
+	return false;
+}
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -4703,7 +4940,6 @@ static bool lwt_is_valid_access(int off, int size,
 	return bpf_skb_is_valid_access(off, size, type, prog, info);
 }
 
-
 /* Attach type specific accesses */
 static bool __sock_filter_check_attach_type(int off,
 					    enum bpf_access_type access_type,
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index 6794ddf0547c..f0e8a762ae0c 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -330,4 +330,9 @@ config IPV6_SEG6_HMAC
 
 	  If unsure, say N.
 
+config IPV6_SEG6_BPF
+	def_bool y
+	depends on IPV6_SEG6_LWTUNNEL
+	depends on IPV6 = y
+
 endif # IPV6
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index e9b23fb924ad..ae68c1ef8fb0 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -449,6 +449,8 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
 	return err;
 }
 
+DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
+
 static struct seg6_action_desc seg6_action_table[] = {
 	{
 		.action		= SEG6_LOCAL_ACTION_END,
-- 
2.16.1

^ permalink raw reply related

* [PATCH bpf-next v5 4/6] bpf: Split lwt inout verifier structures
From: Mathieu Xhonneux @ 2018-05-12 17:25 UTC (permalink / raw)
  To: netdev; +Cc: dlebrun, alexei.starovoitov
In-Reply-To: <cover.1526143526.git.m.xhonneux@gmail.com>

The new bpf_lwt_push_encap helper should only be accessible within the
LWT BPF IN hook, and not the OUT one, as this may lead to a skb under
panic.

At the moment, both LWT BPF IN and OUT share the same list of helpers,
whose calls are authorized by the verifier. This patch separates the
verifier ops for the IN and OUT hooks, and allows the IN hook to call the
bpf_lwt_push_encap helper.

This patch is also the occasion to put all lwt_*_func_proto functions
together for clarity. At the moment, socks_op_func_proto is in the middle
of lwt_inout_func_proto and lwt_xmit_func_proto.

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
---
 include/linux/bpf_types.h |  4 +--
 net/core/filter.c         | 83 +++++++++++++++++++++++++++++------------------
 2 files changed, 54 insertions(+), 33 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index d7df1b323082..cc9d7e031330 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -9,8 +9,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK_ADDR, cg_sock_addr)
-BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
-BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
+BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_in)
+BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_out)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
diff --git a/net/core/filter.c b/net/core/filter.c
index 67b4ab4ec404..71434204b037 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4715,33 +4715,6 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
-static const struct bpf_func_proto *
-lwt_inout_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
-{
-	switch (func_id) {
-	case BPF_FUNC_skb_load_bytes:
-		return &bpf_skb_load_bytes_proto;
-	case BPF_FUNC_skb_pull_data:
-		return &bpf_skb_pull_data_proto;
-	case BPF_FUNC_csum_diff:
-		return &bpf_csum_diff_proto;
-	case BPF_FUNC_get_cgroup_classid:
-		return &bpf_get_cgroup_classid_proto;
-	case BPF_FUNC_get_route_realm:
-		return &bpf_get_route_realm_proto;
-	case BPF_FUNC_get_hash_recalc:
-		return &bpf_get_hash_recalc_proto;
-	case BPF_FUNC_perf_event_output:
-		return &bpf_skb_event_output_proto;
-	case BPF_FUNC_get_smp_processor_id:
-		return &bpf_get_smp_processor_id_proto;
-	case BPF_FUNC_skb_under_cgroup:
-		return &bpf_skb_under_cgroup_proto;
-	default:
-		return bpf_base_func_proto(func_id);
-	}
-}
-
 static const struct bpf_func_proto *
 sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -4801,6 +4774,44 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
+static const struct bpf_func_proto *
+lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_load_bytes:
+		return &bpf_skb_load_bytes_proto;
+	case BPF_FUNC_skb_pull_data:
+		return &bpf_skb_pull_data_proto;
+	case BPF_FUNC_csum_diff:
+		return &bpf_csum_diff_proto;
+	case BPF_FUNC_get_cgroup_classid:
+		return &bpf_get_cgroup_classid_proto;
+	case BPF_FUNC_get_route_realm:
+		return &bpf_get_route_realm_proto;
+	case BPF_FUNC_get_hash_recalc:
+		return &bpf_get_hash_recalc_proto;
+	case BPF_FUNC_perf_event_output:
+		return &bpf_skb_event_output_proto;
+	case BPF_FUNC_get_smp_processor_id:
+		return &bpf_get_smp_processor_id_proto;
+	case BPF_FUNC_skb_under_cgroup:
+		return &bpf_skb_under_cgroup_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
+lwt_in_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_lwt_push_encap:
+		return &bpf_lwt_push_encap_proto;
+	default:
+		return lwt_out_func_proto(func_id, prog);
+	}
+}
+
 static const struct bpf_func_proto *
 lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -4832,7 +4843,7 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_set_hash_invalid:
 		return &bpf_set_hash_invalid_proto;
 	default:
-		return lwt_inout_func_proto(func_id, prog);
+		return lwt_out_func_proto(func_id, prog);
 	}
 }
 
@@ -6405,13 +6416,23 @@ const struct bpf_prog_ops cg_skb_prog_ops = {
 	.test_run		= bpf_prog_test_run_skb,
 };
 
-const struct bpf_verifier_ops lwt_inout_verifier_ops = {
-	.get_func_proto		= lwt_inout_func_proto,
+const struct bpf_verifier_ops lwt_in_verifier_ops = {
+	.get_func_proto		= lwt_in_func_proto,
+	.is_valid_access	= lwt_is_valid_access,
+	.convert_ctx_access	= bpf_convert_ctx_access,
+};
+
+const struct bpf_prog_ops lwt_in_prog_ops = {
+	.test_run		= bpf_prog_test_run_skb,
+};
+
+const struct bpf_verifier_ops lwt_out_verifier_ops = {
+	.get_func_proto		= lwt_out_func_proto,
 	.is_valid_access	= lwt_is_valid_access,
 	.convert_ctx_access	= bpf_convert_ctx_access,
 };
 
-const struct bpf_prog_ops lwt_inout_prog_ops = {
+const struct bpf_prog_ops lwt_out_prog_ops = {
 	.test_run		= bpf_prog_test_run_skb,
 };
 
-- 
2.16.1

^ permalink raw reply related

* [PATCH bpf-next v5 5/6] ipv6: sr: Add seg6local action End.BPF
From: Mathieu Xhonneux @ 2018-05-12 17:25 UTC (permalink / raw)
  To: netdev; +Cc: dlebrun, alexei.starovoitov
In-Reply-To: <cover.1526143526.git.m.xhonneux@gmail.com>

This patch adds the End.BPF action to the LWT seg6local infrastructure.
This action works like any other seg6local End action, meaning that an IPv6
header with SRH is needed, whose DA has to be equal to the SID of the
action. It will also advance the SRH to the next segment, the BPF program
does not have to take care of this.

Since the BPF program may not be a source of instability in the kernel, it
is important to ensure that the integrity of the packet is maintained
before yielding it back to the IPv6 layer. The hook hence keeps track if
the SRH has been altered through the helpers, and re-validates its
content if needed with seg6_validate_srh. The state kept for validation is
stored in a per-CPU buffer. The BPF program is not allowed to directly
write into the packet, and only some fields of the SRH can be altered
through the helper bpf_lwt_seg6_store_bytes.

Performances profiling has shown that the SRH re-validation does not induce
a significant overhead. If the altered SRH is deemed as invalid, the packet
is dropped.

This validation is also done before executing any action through
bpf_lwt_seg6_action, and will not be performed again if the SRH is not
modified after calling the action.

The BPF program may return 3 types of return codes:
    - BPF_OK: the End.BPF action will look up the next destination through
             seg6_lookup_nexthop.
    - BPF_REDIRECT: if an action has been executed through the
          bpf_lwt_seg6_action helper, the BPF program should return this
          value, as the skb's destination is already set and the default
          lookup should not be performed.
    - BPF_DROP : the packet will be dropped.

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
---
 include/linux/bpf_types.h       |   1 +
 include/uapi/linux/bpf.h        |   1 +
 include/uapi/linux/seg6_local.h |   3 +
 kernel/bpf/verifier.c           |   1 +
 net/core/filter.c               |  25 +++++++
 net/ipv6/seg6_local.c           | 158 +++++++++++++++++++++++++++++++++++++++-
 tools/lib/bpf/libbpf.c          |   1 +
 7 files changed, 187 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cc9d7e031330..6a979f95f986 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -12,6 +12,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK_ADDR, cg_sock_addr)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_in)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_out)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
+BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_SEG6LOCAL, lwt_seg6local)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0349c91329fd..c6a213075368 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -140,6 +140,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_MSG,
 	BPF_PROG_TYPE_RAW_TRACEPOINT,
 	BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 };
 
 enum bpf_attach_type {
diff --git a/include/uapi/linux/seg6_local.h b/include/uapi/linux/seg6_local.h
index ef2d8c3e76c1..aadcc11fb918 100644
--- a/include/uapi/linux/seg6_local.h
+++ b/include/uapi/linux/seg6_local.h
@@ -25,6 +25,7 @@ enum {
 	SEG6_LOCAL_NH6,
 	SEG6_LOCAL_IIF,
 	SEG6_LOCAL_OIF,
+	SEG6_LOCAL_BPF,
 	__SEG6_LOCAL_MAX,
 };
 #define SEG6_LOCAL_MAX (__SEG6_LOCAL_MAX - 1)
@@ -59,6 +60,8 @@ enum {
 	SEG6_LOCAL_ACTION_END_AS	= 13,
 	/* forward to SR-unaware VNF with masquerading */
 	SEG6_LOCAL_ACTION_END_AM	= 14,
+	/* custom BPF action */
+	SEG6_LOCAL_ACTION_END_BPF	= 15,
 
 	__SEG6_LOCAL_ACTION_MAX,
 };
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d92d9c37affd..c6b5eadcad16 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1262,6 +1262,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	switch (env->prog->type) {
 	case BPF_PROG_TYPE_LWT_IN:
 	case BPF_PROG_TYPE_LWT_OUT:
+	case BPF_PROG_TYPE_LWT_SEG6LOCAL:
 		/* dst_input() and dst_output() can't write for now */
 		if (t == BPF_WRITE)
 			return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index 71434204b037..d69771e56d1f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4847,6 +4847,21 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
+static const struct bpf_func_proto *
+lwt_seg6local_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_lwt_seg6_store_bytes:
+		return &bpf_lwt_seg6_store_bytes_proto;
+	case BPF_FUNC_lwt_seg6_action:
+		return &bpf_lwt_seg6_action_proto;
+	case BPF_FUNC_lwt_seg6_adjust_srh:
+		return &bpf_lwt_seg6_adjust_srh_proto;
+	default:
+		return lwt_out_func_proto(func_id, prog);
+	}
+}
+
 static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type,
 				    const struct bpf_prog *prog,
 				    struct bpf_insn_access_aux *info)
@@ -6447,6 +6462,16 @@ const struct bpf_prog_ops lwt_xmit_prog_ops = {
 	.test_run		= bpf_prog_test_run_skb,
 };
 
+const struct bpf_verifier_ops lwt_seg6local_verifier_ops = {
+	.get_func_proto		= lwt_seg6local_func_proto,
+	.is_valid_access	= lwt_is_valid_access,
+	.convert_ctx_access	= bpf_convert_ctx_access,
+};
+
+const struct bpf_prog_ops lwt_seg6local_prog_ops = {
+	.test_run		= bpf_prog_test_run_skb,
+};
+
 const struct bpf_verifier_ops cg_sock_verifier_ops = {
 	.get_func_proto		= sock_filter_func_proto,
 	.is_valid_access	= sock_filter_is_valid_access,
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index ae68c1ef8fb0..2ac887da63e2 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -1,8 +1,9 @@
 /*
  *  SR-IPv6 implementation
  *
- *  Author:
+ *  Authors:
  *  David Lebrun <david.lebrun@uclouvain.be>
+ *  eBPF support: Mathieu Xhonneux <m.xhonneux@gmail.com>
  *
  *
  *  This program is free software; you can redistribute it and/or
@@ -32,6 +33,7 @@
 #endif
 #include <net/seg6_local.h>
 #include <linux/etherdevice.h>
+#include <linux/bpf.h>
 
 struct seg6_local_lwt;
 
@@ -42,6 +44,11 @@ struct seg6_action_desc {
 	int static_headroom;
 };
 
+struct bpf_lwt_prog {
+	struct bpf_prog *prog;
+	char *name;
+};
+
 struct seg6_local_lwt {
 	int action;
 	struct ipv6_sr_hdr *srh;
@@ -50,6 +57,7 @@ struct seg6_local_lwt {
 	struct in6_addr nh6;
 	int iif;
 	int oif;
+	struct bpf_lwt_prog bpf;
 
 	int headroom;
 	struct seg6_action_desc *desc;
@@ -451,6 +459,69 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
 
 DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
 
+static int input_action_end_bpf(struct sk_buff *skb,
+				struct seg6_local_lwt *slwt)
+{
+	struct seg6_bpf_srh_state *srh_state =
+		this_cpu_ptr(&seg6_bpf_srh_states);
+	struct seg6_bpf_srh_state local_srh_state;
+	struct ipv6_sr_hdr *srh;
+	int srhoff = 0;
+	int ret;
+
+	srh = get_and_validate_srh(skb);
+	if (!srh)
+		goto drop;
+	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
+
+	/* preempt_disable is needed to protect the per-CPU buffer srh_state,
+	 * which is also accessed by the bpf_lwt_seg6_* helpers
+	 */
+	preempt_disable();
+	srh_state->hdrlen = srh->hdrlen << 3;
+	srh_state->valid = 1;
+
+	rcu_read_lock();
+	bpf_compute_data_pointers(skb);
+	ret = bpf_prog_run_save_cb(slwt->bpf.prog, skb);
+	rcu_read_unlock();
+
+	local_srh_state = *srh_state;
+	preempt_enable();
+
+	switch (ret) {
+	case BPF_OK:
+	case BPF_REDIRECT:
+		break;
+	case BPF_DROP:
+		goto drop;
+	default:
+		pr_warn_once("bpf-seg6local: Illegal return value %u\n", ret);
+		goto drop;
+	}
+
+	if (unlikely((local_srh_state.hdrlen & 7) != 0))
+		goto drop;
+
+	if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
+		goto drop;
+	srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
+	srh->hdrlen = (u8)(local_srh_state.hdrlen >> 3);
+
+	if (!local_srh_state.valid &&
+	    unlikely(!seg6_validate_srh(srh, (srh->hdrlen + 1) << 3)))
+		goto drop;
+
+	if (ret != BPF_REDIRECT)
+		seg6_lookup_nexthop(skb, NULL, 0);
+
+	return dst_input(skb);
+
+drop:
+	kfree_skb(skb);
+	return -EINVAL;
+}
+
 static struct seg6_action_desc seg6_action_table[] = {
 	{
 		.action		= SEG6_LOCAL_ACTION_END,
@@ -497,7 +568,13 @@ static struct seg6_action_desc seg6_action_table[] = {
 		.attrs		= (1 << SEG6_LOCAL_SRH),
 		.input		= input_action_end_b6_encap,
 		.static_headroom	= sizeof(struct ipv6hdr),
-	}
+	},
+	{
+		.action		= SEG6_LOCAL_ACTION_END_BPF,
+		.attrs		= (1 << SEG6_LOCAL_BPF),
+		.input		= input_action_end_bpf,
+	},
+
 };
 
 static struct seg6_action_desc *__get_action_desc(int action)
@@ -542,6 +619,7 @@ static const struct nla_policy seg6_local_policy[SEG6_LOCAL_MAX + 1] = {
 				    .len = sizeof(struct in6_addr) },
 	[SEG6_LOCAL_IIF]	= { .type = NLA_U32 },
 	[SEG6_LOCAL_OIF]	= { .type = NLA_U32 },
+	[SEG6_LOCAL_BPF]	= { .type = NLA_NESTED },
 };
 
 static int parse_nla_srh(struct nlattr **attrs, struct seg6_local_lwt *slwt)
@@ -719,6 +797,71 @@ static int cmp_nla_oif(struct seg6_local_lwt *a, struct seg6_local_lwt *b)
 	return 0;
 }
 
+#define MAX_PROG_NAME 256
+static const struct nla_policy bpf_prog_policy[LWT_BPF_PROG_MAX + 1] = {
+	[LWT_BPF_PROG_FD]   = { .type = NLA_U32, },
+	[LWT_BPF_PROG_NAME] = { .type = NLA_NUL_STRING,
+				.len = MAX_PROG_NAME },
+};
+
+static int parse_nla_bpf(struct nlattr **attrs, struct seg6_local_lwt *slwt)
+{
+	struct nlattr *tb[LWT_BPF_PROG_MAX + 1];
+	struct bpf_prog *p;
+	int ret;
+	u32 fd;
+
+	ret = nla_parse_nested(tb, LWT_BPF_PROG_MAX, attrs[SEG6_LOCAL_BPF],
+			       bpf_prog_policy, NULL);
+	if (ret < 0)
+		return ret;
+
+	if (!tb[LWT_BPF_PROG_FD] || !tb[LWT_BPF_PROG_NAME])
+		return -EINVAL;
+
+	slwt->bpf.name = nla_memdup(tb[LWT_BPF_PROG_NAME], GFP_KERNEL);
+	if (!slwt->bpf.name)
+		return -ENOMEM;
+
+	fd = nla_get_u32(tb[LWT_BPF_PROG_FD]);
+	p = bpf_prog_get_type(fd, BPF_PROG_TYPE_LWT_SEG6LOCAL);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	slwt->bpf.prog = p;
+
+	return 0;
+}
+
+static int put_nla_bpf(struct sk_buff *skb, struct seg6_local_lwt *slwt)
+{
+	struct nlattr *nest;
+
+	if (!slwt->bpf.prog)
+		return 0;
+
+	nest = nla_nest_start(skb, SEG6_LOCAL_BPF);
+	if (!nest)
+		return -EMSGSIZE;
+
+	if (slwt->bpf.name &&
+	    nla_put_string(skb, LWT_BPF_PROG_NAME, slwt->bpf.name))
+		return -EMSGSIZE;
+
+	return nla_nest_end(skb, nest);
+}
+
+static int cmp_nla_bpf(struct seg6_local_lwt *a, struct seg6_local_lwt *b)
+{
+	if (!a->bpf.name && !b->bpf.name)
+		return 0;
+
+	if (!a->bpf.name || !b->bpf.name)
+		return 1;
+
+	return strcmp(a->bpf.name, b->bpf.name);
+}
+
 struct seg6_action_param {
 	int (*parse)(struct nlattr **attrs, struct seg6_local_lwt *slwt);
 	int (*put)(struct sk_buff *skb, struct seg6_local_lwt *slwt);
@@ -749,6 +892,11 @@ static struct seg6_action_param seg6_action_params[SEG6_LOCAL_MAX + 1] = {
 	[SEG6_LOCAL_OIF]	= { .parse = parse_nla_oif,
 				    .put = put_nla_oif,
 				    .cmp = cmp_nla_oif },
+
+	[SEG6_LOCAL_BPF]	= { .parse = parse_nla_bpf,
+				    .put = put_nla_bpf,
+				    .cmp = cmp_nla_bpf },
+
 };
 
 static int parse_nla_action(struct nlattr **attrs, struct seg6_local_lwt *slwt)
@@ -797,7 +945,6 @@ static int seg6_local_build_state(struct nlattr *nla, unsigned int family,
 
 	err = nla_parse_nested(tb, SEG6_LOCAL_MAX, nla, seg6_local_policy,
 			       extack);
-
 	if (err < 0)
 		return err;
 
@@ -886,6 +1033,11 @@ static int seg6_local_get_encap_size(struct lwtunnel_state *lwt)
 	if (attrs & (1 << SEG6_LOCAL_OIF))
 		nlsize += nla_total_size(4);
 
+	if (attrs & (1 << SEG6_LOCAL_BPF))
+		nlsize += nla_total_size(sizeof(struct nlattr)) +
+		       nla_total_size(MAX_PROG_NAME) +
+		       nla_total_size(4);
+
 	return nlsize;
 }
 
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index df54c4c9e48a..68865dc83aea 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1450,6 +1450,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_LWT_IN:
 	case BPF_PROG_TYPE_LWT_OUT:
 	case BPF_PROG_TYPE_LWT_XMIT:
+	case BPF_PROG_TYPE_LWT_SEG6LOCAL:
 	case BPF_PROG_TYPE_SOCK_OPS:
 	case BPF_PROG_TYPE_SK_SKB:
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
-- 
2.16.1

^ permalink raw reply related

* [PATCH bpf-next v5 6/6] selftests/bpf: test for seg6local End.BPF action
From: Mathieu Xhonneux @ 2018-05-12 17:25 UTC (permalink / raw)
  To: netdev; +Cc: dlebrun, alexei.starovoitov
In-Reply-To: <cover.1526143526.git.m.xhonneux@gmail.com>

Add a new test for the seg6local End.BPF action. The following helpers
are also tested :

- bpf_lwt_push_encap within the LWT BPF IN hook
- bpf_lwt_seg6_action
- bpf_lwt_seg6_adjust_srh
- bpf_lwt_seg6_store_bytes

A chain of End.BPF actions is built. The SRH is injected through a LWT
BPF IN hook before the chain. Each End.BPF action validates the previous
one, otherwise the packet is dropped.
The test succeeds if the last node in the chain receives the packet and
the UDP datagram contained can be retrieved from userspace.

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
---
 tools/include/uapi/linux/bpf.h                    |  98 ++++-
 tools/testing/selftests/bpf/Makefile              |   5 +-
 tools/testing/selftests/bpf/bpf_helpers.h         |  12 +
 tools/testing/selftests/bpf/test_lwt_seg6local.c  | 438 ++++++++++++++++++++++
 tools/testing/selftests/bpf/test_lwt_seg6local.sh | 140 +++++++
 5 files changed, 689 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lwt_seg6local.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_seg6local.sh

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 02e4112510f8..c6a213075368 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -140,6 +140,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_MSG,
 	BPF_PROG_TYPE_RAW_TRACEPOINT,
 	BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 };
 
 enum bpf_attach_type {
@@ -1828,7 +1829,6 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- *
  * int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags)
  *	Description
  *		Do FIB lookup in kernel tables using parameters in *params*.
@@ -1855,6 +1855,90 @@ union bpf_attr {
  *             Egress device index on success, 0 if packet needs to continue
  *             up the stack for further processing or a negative error in case
  *             of failure.
+ *
+ * int bpf_lwt_push_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
+ *	Description
+ *		Encapsulate the packet associated to *skb* within a Layer 3
+ *		protocol header. This header is provided in the buffer at
+ *		address *hdr*, with *len* its size in bytes. *type* indicates
+ *		the protocol of the header and can be one of:
+ *
+ *		**BPF_LWT_ENCAP_SEG6**
+ *			IPv6 encapsulation with Segment Routing Header
+ *			(**struct ipv6_sr_hdr**). *hdr* only contains the SRH,
+ *			the IPv6 header is computed by the kernel.
+ *		**BPF_LWT_ENCAP_SEG6_INLINE**
+ *			Only works if *skb* contains an IPv6 packet. Insert a
+ *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
+ *			the IPv6 header.
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len)
+ *	Description
+ *		Store *len* bytes from address *from* into the packet
+ *		associated to *skb*, at *offset*. Only the flags, tag and TLVs
+ *		inside the outermost IPv6 Segment Routing Header can be
+ *		modified through this helper.
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_adjust_srh(struct sk_buff *skb, u32 offset, s32 delta)
+ *	Description
+ *		Adjust the size allocated to TLVs in the outermost IPv6
+ *		Segment Routing Header contained in the packet associated to
+ *		*skb*, at position *offset* by *delta* bytes. Only offsets
+ *		after the segments are accepted. *delta* can be as well
+ *		positive (growing) as negative (shrinking).
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_action(struct sk_buff *skb, u32 action, void *param, u32 param_len)
+ *	Description
+ *		Apply an IPv6 Segment Routing action of type *action* to the
+ *		packet associated to *skb*. Each action takes a parameter
+ *		contained at address *param*, and of length *param_len* bytes.
+ *		*action* can be one of:
+ *
+ *		**SEG6_LOCAL_ACTION_END_X**
+ *			End.X action: Endpoint with Layer-3 cross-connect.
+ *			Type of *param*: **struct in6_addr**.
+ *		**SEG6_LOCAL_ACTION_END_T**
+ *			End.T action: Endpoint with specific IPv6 table lookup.
+ *			Type of *param*: **int**.
+ *		**SEG6_LOCAL_ACTION_END_B6**
+ *			End.B6 action: Endpoint bound to an SRv6 policy.
+ *			Type of param: **struct ipv6_sr_hdr**.
+ *		**SEG6_LOCAL_ACTION_END_B6_ENCAP**
+ *			End.B6.Encap action: Endpoint bound to an SRv6
+ *			encapsulation policy.
+ *			Type of param: **struct ipv6_sr_hdr**.
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -1926,7 +2010,11 @@ union bpf_attr {
 	FN(skb_get_xfrm_state),		\
 	FN(get_stack),			\
 	FN(skb_load_bytes_relative),	\
-	FN(fib_lookup),
+	FN(fib_lookup),			\
+	FN(lwt_push_encap),		\
+	FN(lwt_seg6_store_bytes),	\
+	FN(lwt_seg6_adjust_srh),	\
+	FN(lwt_seg6_action),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -1993,6 +2081,12 @@ enum bpf_hdr_start_off {
 	BPF_HDR_START_NET,
 };
 
+/* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
+enum bpf_lwt_encap_mode {
+	BPF_LWT_ENCAP_SEG6,
+	BPF_LWT_ENCAP_SEG6_INLINE
+};
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 438d4f93875b..009aa27be884 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -33,7 +33,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
 	sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \
 	test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o \
-	test_get_stack_rawtp.o
+	test_get_stack_rawtp.o test_lwt_seg6local.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -42,7 +42,8 @@ TEST_PROGS := test_kmod.sh \
 	test_xdp_meta.sh \
 	test_offload.py \
 	test_sock_addr.sh \
-	test_tunnel.sh
+	test_tunnel.sh \
+	test_lwt_seg6local.sh
 
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 2375d06c706b..b29f2215361b 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -106,6 +106,18 @@ static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
 static int (*bpf_fib_lookup)(void *ctx, struct bpf_fib_lookup *params,
 			     int plen, __u32 flags) =
 	(void *) BPF_FUNC_fib_lookup;
+static int (*bpf_lwt_push_encap)(void *ctx, unsigned int type, void *hdr,
+				 unsigned int len) =
+	(void *) BPF_FUNC_lwt_push_encap;
+static int (*bpf_lwt_seg6_store_bytes)(void *ctx, unsigned int offset,
+				       void *from, unsigned int len) =
+	(void *) BPF_FUNC_lwt_seg6_store_bytes;
+static int (*bpf_lwt_seg6_action)(void *ctx, unsigned int action, void *param,
+				  unsigned int param_len) =
+	(void *) BPF_FUNC_lwt_seg6_action;
+static int (*bpf_lwt_seg6_adjust_srh)(void *ctx, unsigned int offset,
+				      unsigned int len) =
+	(void *) BPF_FUNC_lwt_seg6_adjust_srh;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_lwt_seg6local.c b/tools/testing/selftests/bpf/test_lwt_seg6local.c
new file mode 100644
index 000000000000..d752bc1fe81c
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_seg6local.c
@@ -0,0 +1,438 @@
+#include <stddef.h>
+#include <inttypes.h>
+#include <errno.h>
+#include <linux/seg6_local.h>
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define bpf_printk(fmt, ...)				\
+({							\
+	char ____fmt[] = fmt;				\
+	bpf_trace_printk(____fmt, sizeof(____fmt),	\
+			##__VA_ARGS__);			\
+})
+
+/* Packet parsing state machine helpers. */
+#define cursor_advance(_cursor, _len) \
+	({ void *_tmp = _cursor; _cursor += _len; _tmp; })
+
+#define SR6_FLAG_ALERT (1 << 4)
+
+#define htonll(x) ((bpf_htonl(1)) == 1 ? (x) : ((uint64_t)bpf_htonl((x) & \
+				0xFFFFFFFF) << 32) | bpf_htonl((x) >> 32))
+#define ntohll(x) ((bpf_ntohl(1)) == 1 ? (x) : ((uint64_t)bpf_ntohl((x) & \
+				0xFFFFFFFF) << 32) | bpf_ntohl((x) >> 32))
+#define BPF_PACKET_HEADER __attribute__((packed))
+
+struct ip6_t {
+	unsigned int ver:4;
+	unsigned int priority:8;
+	unsigned int flow_label:20;
+	unsigned short payload_len;
+	unsigned char next_header;
+	unsigned char hop_limit;
+	unsigned long long src_hi;
+	unsigned long long src_lo;
+	unsigned long long dst_hi;
+	unsigned long long dst_lo;
+} BPF_PACKET_HEADER;
+
+struct ip6_addr_t {
+	unsigned long long hi;
+	unsigned long long lo;
+} BPF_PACKET_HEADER;
+
+struct ip6_srh_t {
+	unsigned char nexthdr;
+	unsigned char hdrlen;
+	unsigned char type;
+	unsigned char segments_left;
+	unsigned char first_segment;
+	unsigned char flags;
+	unsigned short tag;
+
+	struct ip6_addr_t segments[0];
+} BPF_PACKET_HEADER;
+
+struct sr6_tlv_t {
+	unsigned char type;
+	unsigned char len;
+	unsigned char value[0];
+} BPF_PACKET_HEADER;
+
+__attribute__((always_inline)) struct ip6_srh_t *get_srh(struct __sk_buff *skb)
+{
+	void *cursor, *data_end;
+	struct ip6_srh_t *srh;
+	struct ip6_t *ip;
+	uint8_t *ipver;
+
+	data_end = (void *)(long)skb->data_end;
+	cursor = (void *)(long)skb->data;
+	ipver = (uint8_t *)cursor;
+
+	if ((void *)ipver + sizeof(*ipver) > data_end)
+		return NULL;
+
+	if ((*ipver >> 4) != 6)
+		return NULL;
+
+	ip = cursor_advance(cursor, sizeof(*ip));
+	if ((void *)ip + sizeof(*ip) > data_end)
+		return NULL;
+
+	if (ip->next_header != 43)
+		return NULL;
+
+	srh = cursor_advance(cursor, sizeof(*srh));
+	if ((void *)srh + sizeof(*srh) > data_end)
+		return NULL;
+
+	if (srh->type != 4)
+		return NULL;
+
+	return srh;
+}
+
+__attribute__((always_inline))
+int update_tlv_pad(struct __sk_buff *skb, uint32_t new_pad,
+		   uint32_t old_pad, uint32_t pad_off)
+{
+	int err;
+
+	if (new_pad != old_pad) {
+		err = bpf_lwt_seg6_adjust_srh(skb, pad_off,
+					  (int) new_pad - (int) old_pad);
+		if (err)
+			return err;
+	}
+
+	if (new_pad > 0) {
+		char pad_tlv_buf[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+					0, 0, 0};
+		struct sr6_tlv_t *pad_tlv = (struct sr6_tlv_t *) pad_tlv_buf;
+
+		pad_tlv->type = SR6_TLV_PADDING;
+		pad_tlv->len = new_pad - 2;
+
+		err = bpf_lwt_seg6_store_bytes(skb, pad_off,
+					       (void *)pad_tlv_buf, new_pad);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+__attribute__((always_inline))
+int is_valid_tlv_boundary(struct __sk_buff *skb, struct ip6_srh_t *srh,
+			  uint32_t *tlv_off, uint32_t *pad_size,
+			  uint32_t *pad_off)
+{
+	uint32_t srh_off, cur_off;
+	int offset_valid = 0;
+	int err;
+
+	srh_off = (char *)srh - (char *)(long)skb->data;
+	// cur_off = end of segments, start of possible TLVs
+	cur_off = srh_off + sizeof(*srh) +
+		sizeof(struct ip6_addr_t) * (srh->first_segment + 1);
+
+	*pad_off = 0;
+
+	// we can only go as far as ~10 TLVs due to the BPF max stack size
+	#pragma clang loop unroll(full)
+	for (int i = 0; i < 10; i++) {
+		struct sr6_tlv_t tlv;
+
+		if (cur_off == *tlv_off)
+			offset_valid = 1;
+
+		if (cur_off >= srh_off + ((srh->hdrlen + 1) << 3))
+			break;
+
+		err = bpf_skb_load_bytes(skb, cur_off, &tlv, sizeof(tlv));
+		if (err)
+			return err;
+
+		if (tlv.type == SR6_TLV_PADDING) {
+			*pad_size = tlv.len + sizeof(tlv);
+			*pad_off = cur_off;
+
+			if (*tlv_off == srh_off) {
+				*tlv_off = cur_off;
+				offset_valid = 1;
+			}
+			break;
+
+		} else if (tlv.type == SR6_TLV_HMAC) {
+			break;
+		}
+
+		cur_off += sizeof(tlv) + tlv.len;
+	} // we reached the padding or HMAC TLVs, or the end of the SRH
+
+	if (*pad_off == 0)
+		*pad_off = cur_off;
+
+	if (*tlv_off == -1)
+		*tlv_off = cur_off;
+	else if (!offset_valid)
+		return -EINVAL;
+
+	return 0;
+}
+
+__attribute__((always_inline))
+int add_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t tlv_off,
+	    struct sr6_tlv_t *itlv, uint8_t tlv_size)
+{
+	uint32_t srh_off = (char *)srh - (char *)(long)skb->data;
+	uint8_t len_remaining, new_pad;
+	uint32_t pad_off = 0;
+	uint32_t pad_size = 0;
+	uint32_t partial_srh_len;
+	int err;
+
+	if (tlv_off != -1)
+		tlv_off += srh_off;
+
+	if (itlv->type == SR6_TLV_PADDING || itlv->type == SR6_TLV_HMAC)
+		return -EINVAL;
+
+	err = is_valid_tlv_boundary(skb, srh, &tlv_off, &pad_size, &pad_off);
+	if (err)
+		return err;
+
+	err = bpf_lwt_seg6_adjust_srh(skb, tlv_off, sizeof(*itlv) + itlv->len);
+	if (err)
+		return err;
+
+	err = bpf_lwt_seg6_store_bytes(skb, tlv_off, (void *)itlv, tlv_size);
+	if (err)
+		return err;
+
+	// the following can't be moved inside update_tlv_pad because the
+	// bpf verifier has some issues with it
+	pad_off += sizeof(*itlv) + itlv->len;
+	partial_srh_len = pad_off - srh_off;
+	len_remaining = partial_srh_len % 8;
+	new_pad = 8 - len_remaining;
+
+	if (new_pad == 1) // cannot pad for 1 byte only
+		new_pad = 9;
+	else if (new_pad == 8)
+		new_pad = 0;
+
+	return update_tlv_pad(skb, new_pad, pad_size, pad_off);
+}
+
+__attribute__((always_inline))
+int delete_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh,
+	       uint32_t tlv_off)
+{
+	uint32_t srh_off = (char *)srh - (char *)(long)skb->data;
+	uint8_t len_remaining, new_pad;
+	uint32_t partial_srh_len;
+	uint32_t pad_off = 0;
+	uint32_t pad_size = 0;
+	struct sr6_tlv_t tlv;
+	int err;
+
+	tlv_off += srh_off;
+
+	err = is_valid_tlv_boundary(skb, srh, &tlv_off, &pad_size, &pad_off);
+	if (err)
+		return err;
+
+	err = bpf_skb_load_bytes(skb, tlv_off, &tlv, sizeof(tlv));
+	if (err)
+		return err;
+
+	err = bpf_lwt_seg6_adjust_srh(skb, tlv_off, -(sizeof(tlv) + tlv.len));
+	if (err)
+		return err;
+
+	pad_off -= sizeof(tlv) + tlv.len;
+	partial_srh_len = pad_off - srh_off;
+	len_remaining = partial_srh_len % 8;
+	new_pad = 8 - len_remaining;
+	if (new_pad == 1) // cannot pad for 1 byte only
+		new_pad = 9;
+	else if (new_pad == 8)
+		new_pad = 0;
+
+	return update_tlv_pad(skb, new_pad, pad_size, pad_off);
+}
+
+__attribute__((always_inline))
+int has_egr_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh)
+{
+	int tlv_offset = sizeof(struct ip6_t) + sizeof(struct ip6_srh_t) +
+		((srh->first_segment + 1) << 4);
+	struct sr6_tlv_t tlv;
+
+	if (bpf_skb_load_bytes(skb, tlv_offset, &tlv, sizeof(struct sr6_tlv_t)))
+		return 0;
+
+	if (tlv.type == SR6_TLV_EGRESS && tlv.len == 18) {
+		struct ip6_addr_t egr_addr;
+
+		if (bpf_skb_load_bytes(skb, tlv_offset + 4, &egr_addr, 16))
+			return 0;
+
+		// check if egress TLV value is correct
+		if (ntohll(egr_addr.hi) == 0xfd00000000000000 &&
+				ntohll(egr_addr.lo) == 0x4)
+			return 1;
+	}
+
+	return 0;
+}
+
+// This function will push a SRH with segments fd00::1, fd00::2, fd00::3,
+// fd00::4
+SEC("encap_srh")
+int __encap_srh(struct __sk_buff *skb)
+{
+	bpf_printk("got pkt\n");
+	unsigned long long hi = 0xfd00000000000000;
+	struct ip6_addr_t *seg;
+	struct ip6_srh_t *srh;
+	char srh_buf[72]; // room for 4 segments
+	int err;
+
+	srh = (struct ip6_srh_t *)srh_buf;
+	srh->nexthdr = 0;
+	srh->hdrlen = 8;
+	srh->type = 4;
+	srh->segments_left = 3;
+	srh->first_segment = 3;
+	srh->flags = 0;
+	srh->tag = 0;
+
+	seg = (struct ip6_addr_t *)((char *)srh + sizeof(*srh));
+
+	#pragma clang loop unroll(full)
+	for (unsigned long long lo = 0; lo < 4; lo++) {
+		seg->lo = htonll(4 - lo);
+		seg->hi = htonll(hi);
+		seg = (struct ip6_addr_t *)((char *)seg + sizeof(*seg));
+	}
+
+	err = bpf_lwt_push_encap(skb, 0, (void *)srh, sizeof(srh_buf));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_REDIRECT;
+}
+
+// Add an Egress TLV fc00::4, add the flag A,
+// and apply End.X action to fc42::1
+SEC("add_egr_x")
+int __add_egr_x(struct __sk_buff *skb)
+{
+	unsigned long long hi = 0xfc42000000000000;
+	unsigned long long lo = 0x1;
+	struct ip6_srh_t *srh = get_srh(skb);
+	uint8_t new_flags = SR6_FLAG_ALERT;
+	struct ip6_addr_t addr;
+	int err, offset;
+
+	if (srh == NULL)
+		return BPF_DROP;
+
+	uint8_t tlv[20] = {2, 18, 0, 0, 0xfd, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+			   0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x4};
+
+	err = add_tlv(skb, srh, (srh->hdrlen+1) << 3,
+		      (struct sr6_tlv_t *)&tlv, 20);
+	if (err)
+		return BPF_DROP;
+
+	offset = sizeof(struct ip6_t) + offsetof(struct ip6_srh_t, flags);
+	err = bpf_lwt_seg6_store_bytes(skb, offset,
+				       (void *)&new_flags, sizeof(new_flags));
+	if (err)
+		return BPF_DROP;
+
+	addr.lo = htonll(lo);
+	addr.hi = htonll(hi);
+	err = bpf_lwt_seg6_action(skb, SEG6_LOCAL_ACTION_END_X,
+				  (void *)&addr, sizeof(addr));
+	if (err)
+		return BPF_DROP;
+	return BPF_REDIRECT;
+}
+
+// Pop the Egress TLV, reset the flags, change the tag 2442 and finally do a
+// simple End action
+SEC("pop_egr")
+int __pop_egr(struct __sk_buff *skb)
+{
+	struct ip6_srh_t *srh = get_srh(skb);
+	uint16_t new_tag = bpf_htons(2442);
+	uint8_t new_flags = 0;
+	int err, offset;
+
+	if (srh == NULL)
+		return BPF_DROP;
+
+	if (srh->flags != SR6_FLAG_ALERT)
+		return BPF_DROP;
+
+	if (srh->hdrlen != 11) // 4 segments + Egress TLV + Padding TLV
+		return BPF_DROP;
+
+	if (!has_egr_tlv(skb, srh))
+		return BPF_DROP;
+
+	err = delete_tlv(skb, srh, 8 + (srh->first_segment + 1) * 16);
+	if (err)
+		return BPF_DROP;
+
+	offset = sizeof(struct ip6_t) + offsetof(struct ip6_srh_t, flags);
+	if (bpf_lwt_seg6_store_bytes(skb, offset, (void *)&new_flags,
+				     sizeof(new_flags)))
+		return BPF_DROP;
+
+	offset = sizeof(struct ip6_t) + offsetof(struct ip6_srh_t, tag);
+	if (bpf_lwt_seg6_store_bytes(skb, offset, (void *)&new_tag,
+				     sizeof(new_tag)))
+		return BPF_DROP;
+
+	return BPF_OK;
+}
+
+// Inspect if the Egress TLV and flag have been removed, if the tag is correct,
+// then apply a End.T action to reach the last segment
+SEC("inspect_t")
+int __inspect_t(struct __sk_buff *skb)
+{
+	struct ip6_srh_t *srh = get_srh(skb);
+	int table = 117;
+	int err;
+
+	if (srh == NULL)
+		return BPF_DROP;
+
+	if (srh->flags != 0)
+		return BPF_DROP;
+
+	if (srh->tag != bpf_htons(2442))
+		return BPF_DROP;
+
+	if (srh->hdrlen != 8) // 4 segments
+		return BPF_DROP;
+
+	err = bpf_lwt_seg6_action(skb, SEG6_LOCAL_ACTION_END_T,
+				  (void *)&table, sizeof(table));
+
+	if (err)
+		return BPF_DROP;
+
+	return BPF_REDIRECT;
+}
+
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_lwt_seg6local.sh b/tools/testing/selftests/bpf/test_lwt_seg6local.sh
new file mode 100755
index 000000000000..1c77994b5e71
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_seg6local.sh
@@ -0,0 +1,140 @@
+#!/bin/bash
+# Connects 6 network namespaces through veths.
+# Each NS may have different IPv6 global scope addresses :
+#   NS1 ---- NS2 ---- NS3 ---- NS4 ---- NS5 ---- NS6
+# fb00::1           fd00::1  fd00::2  fd00::3  fb00::6
+#                   fc42::1           fd00::4
+#
+# All IPv6 packets going to fb00::/16 through NS2 will be encapsulated in a
+# IPv6 header with a Segment Routing Header, with segments :
+# 	fd00::1 -> fd00::2 -> fd00::3 -> fd00::4
+#
+# 3 fd00::/16 IPv6 addresses are binded to seg6local End.BPF actions :
+# - fd00::1 : add a TLV, change the flags and apply a End.X action to fc42::1
+# - fd00::2 : remove the TLV, change the flags, add a tag
+# - fd00::3 : apply an End.T action to fd00::4, through routing table 117
+#
+# fd00::4 is a simple Segment Routing node decapsulating the inner IPv6 packet.
+# Each End.BPF action will validate the operations applied on the SRH by the
+# previous BPF program in the chain, otherwise the packet is dropped.
+#
+# An UDP datagram is sent from fb00::1 to fb00::6. The test succeeds if this
+# datagram can be read on NS6 when binding to fb00::6.
+
+TMP_FILE="/tmp/selftest_lwt_seg6local.txt"
+
+cleanup()
+{
+	if [ "$?" = "0" ]; then
+		echo "selftests: test_lwt_seg6local [PASS]";
+	else
+		echo "selftests: test_lwt_seg6local [FAILED]";
+	fi
+
+	set +e
+	ip netns del ns1 2> /dev/null
+	ip netns del ns2 2> /dev/null
+	ip netns del ns3 2> /dev/null
+	ip netns del ns4 2> /dev/null
+	ip netns del ns5 2> /dev/null
+	ip netns del ns6 2> /dev/null
+	rm -f $TMP_FILE
+}
+
+set -e
+
+ip netns add ns1
+ip netns add ns2
+ip netns add ns3
+ip netns add ns4
+ip netns add ns5
+ip netns add ns6
+
+trap cleanup 0 2 3 6 9
+
+ip link add veth1 type veth peer name veth2
+ip link add veth3 type veth peer name veth4
+ip link add veth5 type veth peer name veth6
+ip link add veth7 type veth peer name veth8
+ip link add veth9 type veth peer name veth10
+
+ip link set veth1 netns ns1
+ip link set veth2 netns ns2
+ip link set veth3 netns ns2
+ip link set veth4 netns ns3
+ip link set veth5 netns ns3
+ip link set veth6 netns ns4
+ip link set veth7 netns ns4
+ip link set veth8 netns ns5
+ip link set veth9 netns ns5
+ip link set veth10 netns ns6
+
+ip netns exec ns1 ip link set dev veth1 up
+ip netns exec ns2 ip link set dev veth2 up
+ip netns exec ns2 ip link set dev veth3 up
+ip netns exec ns3 ip link set dev veth4 up
+ip netns exec ns3 ip link set dev veth5 up
+ip netns exec ns4 ip link set dev veth6 up
+ip netns exec ns4 ip link set dev veth7 up
+ip netns exec ns5 ip link set dev veth8 up
+ip netns exec ns5 ip link set dev veth9 up
+ip netns exec ns6 ip link set dev veth10 up
+ip netns exec ns6 ip link set dev lo up
+
+# All link scope addresses and routes required between veths
+ip netns exec ns1 ip -6 addr add fb00::12/16 dev veth1 scope link
+ip netns exec ns1 ip -6 route add fb00::21 dev veth1 scope link
+ip netns exec ns2 ip -6 addr add fb00::21/16 dev veth2 scope link
+ip netns exec ns2 ip -6 addr add fb00::34/16 dev veth3 scope link
+ip netns exec ns2 ip -6 route add fb00::43 dev veth3 scope link
+ip netns exec ns3 ip -6 route add fb00::65 dev veth5 scope link
+ip netns exec ns3 ip -6 addr add fb00::43/16 dev veth4 scope link
+ip netns exec ns3 ip -6 addr add fb00::56/16 dev veth5 scope link
+ip netns exec ns4 ip -6 addr add fb00::65/16 dev veth6 scope link
+ip netns exec ns4 ip -6 addr add fb00::78/16 dev veth7 scope link
+ip netns exec ns4 ip -6 route add fb00::87 dev veth7 scope link
+ip netns exec ns5 ip -6 addr add fb00::87/16 dev veth8 scope link
+ip netns exec ns5 ip -6 addr add fb00::910/16 dev veth9 scope link
+ip netns exec ns5 ip -6 route add fb00::109 dev veth9 scope link
+ip netns exec ns5 ip -6 route add fb00::109 table 117 dev veth9 scope link
+ip netns exec ns6 ip -6 addr add fb00::109/16 dev veth10 scope link
+
+ip netns exec ns1 ip -6 addr add fb00::1/16 dev lo
+ip netns exec ns1 ip -6 route add fb00::6 dev veth1 via fb00::21
+
+ip netns exec ns2 ip -6 route add fb00::6 encap bpf in obj test_lwt_seg6local.o sec encap_srh dev veth2
+ip netns exec ns2 ip -6 route add fd00::1 dev veth3 via fb00::43 scope link
+
+ip netns exec ns3 ip -6 route add fc42::1 dev veth5 via fb00::65
+ip netns exec ns3 ip -6 route add fd00::1 encap seg6local action End.BPF obj test_lwt_seg6local.o sec add_egr_x dev veth4
+
+ip netns exec ns4 ip -6 route add fd00::2 encap seg6local action End.BPF obj test_lwt_seg6local.o sec pop_egr dev veth6
+ip netns exec ns4 ip -6 addr add fc42::1 dev lo
+ip netns exec ns4 ip -6 route add fd00::3 dev veth7 via fb00::87
+
+ip netns exec ns5 ip -6 route add fd00::4 table 117 dev veth9 via fb00::109
+ip netns exec ns5 ip -6 route add fd00::3 encap seg6local action End.BPF obj test_lwt_seg6local.o sec inspect_t dev veth8
+
+ip netns exec ns6 ip -6 addr add fb00::6/16 dev lo
+ip netns exec ns6 ip -6 addr add fd00::4/16 dev lo
+
+ip netns exec ns1 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec ns2 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec ns3 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec ns4 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec ns5 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+
+ip netns exec ns6 sysctl net.ipv6.conf.all.seg6_enabled=1 > /dev/null
+ip netns exec ns6 sysctl net.ipv6.conf.lo.seg6_enabled=1 > /dev/null
+ip netns exec ns6 sysctl net.ipv6.conf.veth10.seg6_enabled=1 > /dev/null
+
+ip netns exec ns6 nc -l -6 -u -d 7330 > $TMP_FILE &
+ip netns exec ns1 bash -c "echo 'foobar' | nc -w0 -6 -u -p 2121 -s fb00::1 fb00::6 7330"
+sleep 5 # wait enough time to ensure the UDP datagram arrived to the last segment
+kill -INT $!
+
+if [[ $(< $TMP_FILE) != "foobar" ]]; then
+	exit 1
+fi
+
+exit 0
-- 
2.16.1

^ permalink raw reply related

* Re: [PATCH bpf-next v5 0/6] ipv6: sr: introduce seg6local End.BPF action
From: Mathieu Xhonneux @ 2018-05-12 15:30 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev
In-Reply-To: <cover.1526143526.git.m.xhonneux@gmail.com>

Sorry for the v4 still throwing warnings from the kbuild bot, this
version should be OK.

2018-05-12 18:25 GMT+01:00 Mathieu Xhonneux <m.xhonneux@gmail.com>:
> As of Linux 4.14, it is possible to define advanced local processing for
> IPv6 packets with a Segment Routing Header through the seg6local LWT
> infrastructure. This LWT implements the network programming principles
> defined in the IETF “SRv6 Network Programming” draft.
>
> The implemented operations are generic, and it would be very interesting to
> be able to implement user-specific seg6local actions, without having to
> modify the kernel directly. To do so, this patchset adds an End.BPF action
> to seg6local, powered by some specific Segment Routing-related helpers,
> which provide SR functionalities that can be applied on the packet. This
> BPF hook would then allow to implement specific actions at native kernel
> speed such as OAM features, advanced SR SDN policies, SRv6 actions like
> Segment Routing Header (SRH) encapsulation depending on the content of
> the packet, etc.
>
> This patchset is divided in 6 patches, whose main features are :
>
> - A new seg6local action End.BPF with the corresponding new BPF program
>   type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be
>   passed to the LWT seg6local through netlink, the same way as the LWT
>   BPF hook operates.
> - 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/
>   shrink a SRH and apply on a packet some of the generic SRv6 actions.
> - 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through
>   encapsulation (via IPv6 encapsulation or inlining if the packet contains
>   already an IPv6 header).
>
> As this patchset adds a new LWT BPF hook, I took into account the result of
> the discussions when the LWT BPF infrastructure got merged. Hence, the
> seg6local BPF hook doesn’t allow write access to skb->data directly, only
> the SRH can be modified through specific helpers, which ensures that the
> integrity of the packet is maintained.
> More details are available in the related patches messages.
>
> The performances of this BPF hook have been assessed with the BPF JIT
> enabled on a Intel Xeon X3440 processors with 4 cores and 8 threads
> clocked at 2.53 GHz. No throughput losses are noted with the seg6local
> BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes
> TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes)
> drops the throughput to 410kpps, and inlining a SRH via
> bpf_lwt_seg6_action drops the throughput to 420kpps.
> All throughputs are stable.
>
> -------
> v2: move the SRH integrity state from skb->cb to a per-cpu buffer
> v3: - document helpers in man-page style
>     - fix kbuild bugs
>     - un-break BPF LWT out hook
>     - bpf_push_seg6_encap is now static
>     - preempt_enable is now called when the packet is dropped in
>       input_action_end_bpf
> v4: fix kbuild bugs when CONFIG_IPV6=m
> v5: fix kbuild sparse warnings when CONFIG_IPV6=m
>
> Thanks.
>
>
> Mathieu Xhonneux (6):
>   ipv6: sr: make seg6.h includable without IPv6
>   ipv6: sr: export function lookup_nexthop
>   bpf: Add IPv6 Segment Routing helpers
>   bpf: Split lwt inout verifier structures
>   ipv6: sr: Add seg6local action End.BPF
>   selftests/bpf: test for seg6local End.BPF action
>
>  include/linux/bpf_types.h                         |   5 +-
>  include/net/seg6.h                                |   7 +-
>  include/net/seg6_local.h                          |  32 ++
>  include/uapi/linux/bpf.h                          |  98 ++++-
>  include/uapi/linux/seg6_local.h                   |   3 +
>  kernel/bpf/verifier.c                             |   1 +
>  net/core/filter.c                                 | 390 ++++++++++++++++---
>  net/ipv6/Kconfig                                  |   5 +
>  net/ipv6/seg6_local.c                             | 180 ++++++++-
>  tools/include/uapi/linux/bpf.h                    |  98 ++++-
>  tools/lib/bpf/libbpf.c                            |   1 +
>  tools/testing/selftests/bpf/Makefile              |   5 +-
>  tools/testing/selftests/bpf/bpf_helpers.h         |  12 +
>  tools/testing/selftests/bpf/test_lwt_seg6local.c  | 438 ++++++++++++++++++++++
>  tools/testing/selftests/bpf/test_lwt_seg6local.sh | 140 +++++++
>  15 files changed, 1340 insertions(+), 75 deletions(-)
>  create mode 100644 include/net/seg6_local.h
>  create mode 100644 tools/testing/selftests/bpf/test_lwt_seg6local.c
>  create mode 100755 tools/testing/selftests/bpf/test_lwt_seg6local.sh
>
> --
> 2.16.1
>

^ permalink raw reply

* Re: [PATCH bpf v3] x86/cpufeature: bpf hack for clang not supporting asm goto
From: Alexei Starovoitov @ 2018-05-12 16:03 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Yonghong Song, Ingo Molnar, Linus Torvalds,
	Alexei Starovoitov, Daniel Borkmann, LKML, X86 ML,
	Network Development, Kernel Team, Thomas Gleixner

On Thu, May 10, 2018 at 10:58 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> I see no option, but to fix the kernel.
> Regardless whether it's called user space breakage or kernel breakage.

Peter,

could you please ack the patch or better yet take it into tip tree
and send to Linus asap ?
rc5 is almost here and we didn't have full test coverage
for more than a month due to this issue.

Thanks

^ permalink raw reply

* Re: iproute2 - modifying routes in place
From: David Ahern @ 2018-05-12 17:00 UTC (permalink / raw)
  To: Ryan Whelan, netdev
In-Reply-To: <CAM3m09SxxFRfhssnj9k21D=PVfJ-yRx7P2DnfpLK2C4QHvcrtQ@mail.gmail.com>

On 5/11/18 4:42 AM, Ryan Whelan wrote:
> `ip route` has 2 subcommands that don't seem to work as expected and i'm
> not sure if its a bug, or if i'm misunderstanding the semantics.

Can you try with ipv6/route-bugs branch in
https://github.com/dsahern/linux

^ permalink raw reply

* [PATCH] net/mlx5: Use 'kvfree()' for memory allocated by 'kvzalloc()'
From: Christophe JAILLET @ 2018-05-12 17:09 UTC (permalink / raw)
  To: saeedm, matanb, leon, davem
  Cc: netdev, linux-rdma, linux-kernel, kernel-janitors,
	Christophe JAILLET

'out' is allocated with 'kvzalloc()'. 'kvfree()' must be used to free it.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
---
 drivers/net/ethernet/mellanox/mlx5/core/vport.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index 177e076b8d17..49968a4db758 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -511,7 +511,7 @@ int mlx5_query_nic_vport_system_image_guid(struct mlx5_core_dev *mdev,
 	*system_image_guid = MLX5_GET64(query_nic_vport_context_out, out,
 					nic_vport_context.system_image_guid);
 
-	kfree(out);
+	kvfree(out);
 
 	return 0;
 }
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH bpf-next 3/4] samples: bpf: fix build after move to compiling full libbpf.a
From: Jakub Kicinski @ 2018-05-12 19:38 UTC (permalink / raw)
  To: alexei.starovoitov, daniel; +Cc: oss-drivers, netdev, Björn Töpel
In-Reply-To: <20180512001729.21634-4-jakub.kicinski@netronome.com>

On Fri, 11 May 2018 17:17:28 -0700, Jakub Kicinski wrote:
> There are many ways users may compile samples, some of them got
> broken by commit 5f9380572b4b ("samples: bpf: compile and link
> against full libbpf").  Improve path resolution and make libbpf
> building a dependency of source files to force its build.
> 
> Samples should now again build with any of:
>  cd samples/bpf; make
>  make samples/bpf
>  make -C samples/bpf
>  cd samples/bpf; make O=builddir
>  make samples/bpf O=builddir
>  make -C samples/bpf O=builddir
> 
> Fixes: 5f9380572b4b ("samples: bpf: compile and link against full libbpf")
> Reported-by: Björn Töpel <bjorn.topel@gmail.com>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Unfortunately Björn reports this still doesn't fix the build for him.
Investigating further.

^ permalink raw reply

* Re: KASAN: use-after-free Read in sctp_packet_transmit
From: Eric Biggers @ 2018-05-12 20:00 UTC (permalink / raw)
  To: syzbot
  Cc: davem, linux-kernel, linux-sctp, netdev, nhorman, syzkaller-bugs,
	vyasevich
In-Reply-To: <94eb2c1fcf4cf899b405620eaa66@google.com>

On Fri, Jan 05, 2018 at 02:07:01PM -0800, syzbot wrote:
> Hello,
> 
> syzkaller hit the following crash on
> 8a4816cad00bf14642f0ed6043b32d29a05006ce
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> Unfortunately, I don't have any reproducer for this bug yet.
> 
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+5adcca18fca253b4cb15@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
> 
> ==================================================================
> BUG: KASAN: use-after-free in sctp_packet_transmit+0x3505/0x3750
> net/sctp/output.c:643
> Read of size 8 at addr ffff8801bda9fb80 by task modprobe/23740
> 
> CPU: 1 PID: 23740 Comm: modprobe Not tainted 4.15.0-rc5+ #175
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  <IRQ>
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
>  sctp_packet_transmit+0x3505/0x3750 net/sctp/output.c:643
>  sctp_outq_flush+0x121b/0x4060 net/sctp/outqueue.c:1197
>  sctp_outq_uncork+0x5a/0x70 net/sctp/outqueue.c:776
>  sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1807 [inline]
>  sctp_side_effects net/sctp/sm_sideeffect.c:1210 [inline]
>  sctp_do_sm+0x4e0/0x6ed0 net/sctp/sm_sideeffect.c:1181
>  sctp_generate_heartbeat_event+0x292/0x3f0 net/sctp/sm_sideeffect.c:406
>  call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
>  expire_timers kernel/time/timer.c:1357 [inline]
>  __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
>  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>  invoke_softirq kernel/softirq.c:365 [inline]
>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
>  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
>  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
>  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
>  </IRQ>
> RIP: 0010:__preempt_count_add arch/x86/include/asm/preempt.h:76 [inline]
> RIP: 0010:__rcu_read_lock include/linux/rcupdate.h:83 [inline]
> RIP: 0010:rcu_read_lock include/linux/rcupdate.h:629 [inline]
> RIP: 0010:__is_insn_slot_addr+0x8f/0x330 kernel/kprobes.c:303
> RSP: 0018:ffff8801d4937430 EFLAGS: 00000283 ORIG_RAX: ffffffffffffff11
> RAX: ffff8801bf13c000 RBX: ffffffff8656dd00 RCX: ffffffff8170bd88
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8656dd00
> RBP: ffff8801d4937518 R08: 0000000000000000 R09: 1ffff1003a926e67
> R10: ffff8801d4937300 R11: 0000000000000000 R12: 0000000000000000
> R13: 0000000000000000 R14: ffff8801d49374f0 R15: ffff8801dae230c0
>  is_kprobe_insn_slot include/linux/kprobes.h:318 [inline]
>  kernel_text_address+0x132/0x140 kernel/extable.c:150
>  __kernel_text_address+0xd/0x40 kernel/extable.c:107
>  unwind_get_return_address+0x61/0xa0 arch/x86/kernel/unwind_frame.c:18
>  __save_stack_trace+0x7e/0xd0 arch/x86/kernel/stacktrace.c:45
>  save_stack_trace+0x1a/0x20 arch/x86/kernel/stacktrace.c:60
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
>  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3544
>  kmem_cache_zalloc include/linux/slab.h:678 [inline]
>  file_alloc_security security/selinux/hooks.c:369 [inline]
>  selinux_file_alloc_security+0xae/0x190 security/selinux/hooks.c:3454
>  security_file_alloc+0x6d/0xa0 security/security.c:873
>  get_empty_filp+0x189/0x4f0 fs/file_table.c:129
>  path_openat+0xed/0x3530 fs/namei.c:3496
>  do_filp_open+0x25b/0x3b0 fs/namei.c:3554
>  do_sys_open+0x502/0x6d0 fs/open.c:1059
>  SYSC_open fs/open.c:1077 [inline]
>  SyS_open+0x2d/0x40 fs/open.c:1072
>  entry_SYSCALL_64_fastpath+0x23/0x9a
> RIP: 0033:0x7efdff1bb120
> RSP: 002b:00007ffde6213c08 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
> RAX: ffffffffffffffda RBX: 000055c34fab4090 RCX: 00007efdff1bb120
> RDX: 00000000000001b6 RSI: 0000000000080000 RDI: 00007ffde6213d20
> RBP: 00007ffde6214d90 R08: 0000000000000008 R09: 0000000000000001
> R10: 0000000000000000 R11: 0000000000000246 R12: 000055c34fab4090
> R13: 00007ffde6215de0 R14: 0000000000000000 R15: 0000000000000000
> 
> Allocated by task 23739:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
>  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3544
>  kmem_cache_zalloc include/linux/slab.h:678 [inline]
>  sctp_chunkify+0xce/0x3f0 net/sctp/sm_make_chunk.c:1329
>  _sctp_make_chunk+0x13c/0x260 net/sctp/sm_make_chunk.c:1397
>  sctp_make_control+0x39/0x150 net/sctp/sm_make_chunk.c:1433
>  sctp_make_heartbeat+0x90/0x420 net/sctp/sm_make_chunk.c:1151
>  sctp_sf_heartbeat.isra.24+0x26/0x180 net/sctp/sm_statefuns.c:973
>  sctp_sf_do_prm_requestheartbeat+0x27/0x100 net/sctp/sm_statefuns.c:5251
>  sctp_do_sm+0x192/0x6ed0 net/sctp/sm_sideeffect.c:1178
>  sctp_primitive_REQUESTHEARTBEAT+0xa0/0xd0 net/sctp/primitive.c:200
>  sctp_apply_peer_addr_params+0x759/0xf30 net/sctp/socket.c:2462
>  sctp_setsockopt_peer_addr_params+0x36f/0x5f0 net/sctp/socket.c:2658
>  sctp_setsockopt+0x199a/0x61a0 net/sctp/socket.c:4173
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
>  SYSC_setsockopt net/socket.c:1821 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1800
>  entry_SYSCALL_64_fastpath+0x23/0x9a
> 
> Freed by task 23739:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
>  __cache_free mm/slab.c:3488 [inline]
>  kmem_cache_free+0x83/0x2a0 mm/slab.c:3746
>  sctp_chunk_destroy net/sctp/sm_make_chunk.c:1450 [inline]
>  sctp_chunk_put+0x2fd/0x420 net/sctp/sm_make_chunk.c:1473
>  sctp_chunk_free+0x53/0x60 net/sctp/sm_make_chunk.c:1460
>  sctp_packet_transmit+0xf5d/0x3750 net/sctp/output.c:646
>  sctp_outq_flush+0x121b/0x4060 net/sctp/outqueue.c:1197
>  sctp_outq_uncork+0x5a/0x70 net/sctp/outqueue.c:776
>  sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1807 [inline]
>  sctp_side_effects net/sctp/sm_sideeffect.c:1210 [inline]
>  sctp_do_sm+0x4e0/0x6ed0 net/sctp/sm_sideeffect.c:1181
>  sctp_primitive_REQUESTHEARTBEAT+0xa0/0xd0 net/sctp/primitive.c:200
>  sctp_apply_peer_addr_params+0x759/0xf30 net/sctp/socket.c:2462
>  sctp_setsockopt_peer_addr_params+0x36f/0x5f0 net/sctp/socket.c:2658
>  sctp_setsockopt+0x199a/0x61a0 net/sctp/socket.c:4173
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
>  SYSC_setsockopt net/socket.c:1821 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1800
>  entry_SYSCALL_64_fastpath+0x23/0x9a
> 
> The buggy address belongs to the object at ffff8801bda9fb80
>  which belongs to the cache sctp_chunk of size 256
> The buggy address is located 0 bytes inside of
>  256-byte region [ffff8801bda9fb80, ffff8801bda9fc80)
> The buggy address belongs to the page:
> page:00000000d1261812 count:1 mapcount:0 mapping:000000003e733284 index:0x0
> flags: 0x2fffc0000000100(slab)
> raw: 02fffc0000000100 ffff8801bda9f040 0000000000000000 000000010000000c
> raw: ffffea000714c9e0 ffffea0006fa8520 ffff8801d3246c80 0000000000000000
> page dumped because: kasan: bad access detected
> 
> Memory state around the buggy address:
>  ffff8801bda9fa80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  ffff8801bda9fb00: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
> > ffff8801bda9fb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>                    ^
>  ffff8801bda9fc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  ffff8801bda9fc80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
> ==================================================================
> 
> 
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
> 
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.

No reproducer, this only happened once (Jan 5 on net-next), and there have been
a lot of SCTP fixes in the mean time including commit 6910e25de225 ("sctp:
remove sctp_chunk_put from fail_mark err path in sctp_ulpevent_make_rcvmsg")
which may be relevant since it fixed a case of incorrect reference counting of
'struct sctp_chunk', which is the struct in which the use-after-free occurred
here.  So I'm just invalidating this bug:

#syz invalid

- Eric

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox