Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH iproute2-next 1/1] tc: jsonify skbedit action
From: David Ahern @ 2018-04-08 17:50 UTC (permalink / raw)
  To: Roman Mashak, dsahern; +Cc: stephen, netdev, kernel, jhs, xiyou.wangcong, jiri
In-Reply-To: <1522783461-14269-1-git-send-email-mrv@mojatatu.com>

On 4/3/18 1:24 PM, Roman Mashak wrote:
>  	if (tb[TCA_SKBEDIT_PTYPE] != NULL) {
> -		ptype = RTA_DATA(tb[TCA_SKBEDIT_PTYPE]);
> -		if (*ptype == PACKET_HOST)
> -			fprintf(f, " ptype host");
> -		else if (*ptype == PACKET_BROADCAST)
> -			fprintf(f, " ptype broadcast");
> -		else if (*ptype == PACKET_MULTICAST)
> -			fprintf(f, " ptype multicast");
> -		else if (*ptype == PACKET_OTHERHOST)
> -			fprintf(f, " ptype otherhost");
> +		ptype = rta_getattr_u16(tb[TCA_SKBEDIT_PTYPE]);
> +		if (ptype == PACKET_HOST)
> +			print_string(PRINT_ANY, "ptype", " %s", "ptype host");
> +		else if (ptype == PACKET_BROADCAST)
> +			print_string(PRINT_ANY, "ptype", " %s",
> +				     "ptype broadcast");
> +		else if (ptype == PACKET_MULTICAST)
> +			print_string(PRINT_ANY, "ptype", " %s",
> +				     "ptype multicast");
> +		else if (ptype == PACKET_OTHERHOST)
> +			print_string(PRINT_ANY, "ptype", " %s",
> +				     "ptype otherhost");

Shouldn't that be:
    print_string(PRINT_ANY, "ptype", "ptype %s", "otherhost");

And ditto for the other strings.

>  		else
> -			fprintf(f, " ptype %d", *ptype);
> +			print_uint(PRINT_ANY, "ptype", " %u", ptype);

And then this one needs 'ptype' before %u

^ permalink raw reply

* Re: [PATCH iproute2-next 1/1] tc: jsonify connmark action
From: David Ahern @ 2018-04-08 17:54 UTC (permalink / raw)
  To: Roman Mashak; +Cc: stephen, netdev, kernel, jhs, xiyou.wangcong, jiri
In-Reply-To: <1522760995-12441-1-git-send-email-mrv@mojatatu.com>

On 4/3/18 7:09 AM, Roman Mashak wrote:
> Signed-off-by: Roman Mashak <mrv@mojatatu.com>
> ---
>  tc/m_connmark.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)

applied to iproute2-next

^ permalink raw reply

* Re: [PATCH iproute2-next 1/1] tc: jsonify tunnel_key action
From: David Ahern @ 2018-04-08 17:54 UTC (permalink / raw)
  To: Roman Mashak; +Cc: stephen, netdev, kernel, jhs, xiyou.wangcong, jiri
In-Reply-To: <1522862478-24580-1-git-send-email-mrv@mojatatu.com>

On 4/4/18 11:21 AM, Roman Mashak wrote:
> Signed-off-by: Roman Mashak <mrv@mojatatu.com>
> ---
>  tc/m_tunnel_key.c | 36 +++++++++++++++++++++++++-----------
>  1 file changed, 25 insertions(+), 11 deletions(-)
> 

applied to iproute2-next

^ permalink raw reply

* Re: [PATCH] make net_gso_ok return false when gso_type is zero(invalid)
From: Wenhua Shi @ 2018-04-08 18:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <20180408.125114.2111042327285982582.davem@davemloft.net>

2018-04-08 18:51 GMT+02:00 David Miller <davem@davemloft.net>:
>
> From: Wenhua Shi <march511@gmail.com>
> Date: Fri,  6 Apr 2018 03:43:39 +0200
>
> > Signed-off-by: Wenhua Shi <march511@gmail.com>
>
> This precondition should be made impossible instead of having to do
> an extra check everywhere that this helper is invoked, many of which
> are in fast paths.

I believe the precondition you said is quite true. In my situation, I
have to disable GSO for some packet and I notice that it leads to a
worse performance (slower than 1Mbps, was almost 800Mbps).

Here's the hook I use on debian 9.4, kernel version 4.9:

    #include <linux/init.h>
    #include <linux/module.h>
    #include <linux/kernel.h>
    #include <linux/netfilter.h>
    #include <linux/netfilter_ipv4.h>
    #include <linux/netfilter_ipv6.h>
    #include <linux/skbuff.h>
    #include <linux/tcp.h>
    #include <linux/ip.h>

    unsigned int hook_outgoing (
        void * priv,
        struct sk_buff * skb,
        const struct nf_hook_state * state)
    {
        /* for some reason I have to disable GSO */
        skb_gso_reset(skb);

        /* After I force sk_can_gso to return false here, the
performance comes back normal. */
        // skb->sk->sk_gso_type = ~0;

        return NF_ACCEPT;

    }

    static struct nf_hook_ops hook =
    {
        .hook = hook_outgoing,
        .pf = PF_INET,
        .hooknum = NF_INET_POST_ROUTING,
        .priority = NF_IP_PRI_LAST,
    };

    static int __init init_testing(void)
    {
        nf_register_hook(&hook);
        return 0;
    }

    static void __exit exit_testing(void)
    {
        nf_unregister_hook(&hook);
    }

    module_init(init_testing);
    module_exit(exit_testing);


Here are the performance measurements.
Without the previous hook:

    root@debian-s-1vcpu-1gb-sfo1-01:~/test# iperf -c myanothernormaldebian -d
    ------------------------------------------------------------
    Server listening on TCP port 5001
    TCP window size: 85.3 KByte (default)
    ------------------------------------------------------------
    ------------------------------------------------------------
    Client connecting to myanothernormaldebian, TCP port 5001
    TCP window size:  255 KByte (default)
    ------------------------------------------------------------
    [  3] local 192.241.204.XXX port 60528 connected with
104.131.148.XXX port 5001
    [  5] local 192.241.204.XXX port 5001 connected with
104.131.148.XXX port 58576
    [ ID] Interval       Transfer     Bandwidth
    [  3]  0.0-10.0 sec   922 MBytes   773 Mbits/sec
    [  5]  0.0-10.1 sec  1.00 GBytes   849 Mbits/sec

And with the previous hook:

    root@debian-s-1vcpu-1gb-sfo1-01:~/test# iperf -c myanothernormaldebian -d
    ------------------------------------------------------------
    Server listening on TCP port 5001
    TCP window size: 85.3 KByte (default)
    ------------------------------------------------------------
    ------------------------------------------------------------
    Client connecting to myanothernormaldebian, TCP port 5001
    TCP window size: 85.0 KByte (default)
    ------------------------------------------------------------
    [  3] local 192.241.204.XXX port 60530 connected with
104.131.148.XXX port 5001
    [  5] local 192.241.204.XXX port 5001 connected with
104.131.148.XXX port 58578
    [ ID] Interval       Transfer     Bandwidth
    [  5]  0.0-10.2 sec  1.02 GBytes   864 Mbits/sec
    [  3]  0.0-13.5 sec   170 KBytes   103 Kbits/sec



Or it's just because of that I'm disabling the GSO in a wrong way?

^ permalink raw reply

* Re: suspicious RCU usage at ./include/net/inet_sock.h:LINE
From: Eric Biggers @ 2018-04-08 19:29 UTC (permalink / raw)
  To: syzbot
  Cc: davem, dccp, dvyukov, gerrit, kuznet, linux-kernel, netdev,
	syzkaller-bugs, yoshfuji
In-Reply-To: <001a113f8d5653a67c0561346ed3@google.com>

On Mon, Dec 25, 2017 at 05:45:00PM -0800, syzbot wrote:
> syzkaller has found reproducer for the following crash on
> fba961ab29e5ffb055592442808bb0f7962e05da
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
> 
> 
> Can not set IPV6_FL_F_REFLECT if flowlabel_consistency sysctl is enable
> 
> =============================
> WARNING: suspicious RCU usage
> 4.15.0-rc4+ #164 Not tainted
> -----------------------------
> ./include/net/inet_sock.h:136 suspicious rcu_dereference_check() usage!
> 
> other info that might help us debug this:
> 
> 
> rcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by syzkaller667189/5780:
>  #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000008d7d4e62>] lock_sock
> include/net/sock.h:1462 [inline]
>  #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000008d7d4e62>]
> do_ipv6_setsockopt.isra.9+0x23d/0x38f0 net/ipv6/ipv6_sockglue.c:167
> 
> stack backtrace:
> CPU: 0 PID: 5780 Comm: syzkaller667189 Not tainted 4.15.0-rc4+ #164
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
>  ireq_opt_deref include/net/inet_sock.h:135 [inline]
>  inet_csk_route_req+0x824/0xca0 net/ipv4/inet_connection_sock.c:544
>  dccp_v4_send_response+0xa7/0x640 net/dccp/ipv4.c:485
>  dccp_v4_conn_request+0x9ee/0x11b0 net/dccp/ipv4.c:633
>  dccp_v6_conn_request+0xd30/0x1350 net/dccp/ipv6.c:317
>  dccp_rcv_state_process+0x574/0x1620 net/dccp/input.c:612
>  dccp_v4_do_rcv+0xeb/0x160 net/dccp/ipv4.c:682
>  dccp_v6_do_rcv+0x81a/0x9b0 net/dccp/ipv6.c:578
>  sk_backlog_rcv include/net/sock.h:907 [inline]
>  __release_sock+0x124/0x360 net/core/sock.c:2274
>  release_sock+0xa4/0x2a0 net/core/sock.c:2789
>  do_ipv6_setsockopt.isra.9+0x50f/0x38f0 net/ipv6/ipv6_sockglue.c:898
>  ipv6_setsockopt+0xd7/0x150 net/ipv6/ipv6_sockglue.c:922
>  dccp_setsockopt+0x85/0xd0 net/dccp/proto.c:573
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
>  SYSC_setsockopt net/socket.c:1821 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1800
>  entry_SYSCALL_64_fastpath+0x1f/0x96
> RIP: 0033:0x445ec9
> RSP: 002b:00007fa001b58db8 EFLAGS: 00000297 ORIG_RAX: 0000000000000036
> RAX: ffffffffffffffda RBX: 00000000006dbc24 RCX: 0000000000445ec9
> RDX: 0000000000000020 RSI: 0000000000000029 RDI: 0000000000000004
> RBP: 00000000006dbc20 R08: 0000000000000020 R09: 0000000000000000
> R10: 000000002030a000 R11: 0000000000000297 R12: 0000000000000000
> R13: 00007fff809eec1f R14: 00007fa001b599c0 R15: 0000000000000001
> 
> =============================
> WARNING: suspicious RCU usage
> 4.15.0-rc4+ #164 Not tainted
> -----------------------------
> ./include/net/inet_sock.h:136 suspicious rcu_dereference_check() usage!
> 
> other info that might help us debug this:
> 
> 
> rcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by syzkaller667189/5780:
>  #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000008d7d4e62>] lock_sock
> include/net/sock.h:1462 [inline]
>  #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000008d7d4e62>]
> do_ipv6_setsockopt.isra.9+0x23d/0x38f0 net/ipv6/ipv6_sockglue.c:167
> 
> stack backtrace:
> CPU: 0 PID: 5780 Comm: syzkaller667189 Not tainted 4.15.0-rc4+ #164
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
>  ireq_opt_deref include/net/inet_sock.h:135 [inline]
>  dccp_v4_send_response+0x4b0/0x640 net/dccp/ipv4.c:496
>  dccp_v4_conn_request+0x9ee/0x11b0 net/dccp/ipv4.c:633
>  dccp_v6_conn_request+0xd30/0x1350 net/dccp/ipv6.c:317
>  dccp_rcv_state_process+0x574/0x1620 net/dccp/input.c:612
>  dccp_v4_do_rcv+0xeb/0x160 net/dccp/ipv4.c:682
>  dccp_v6_do_rcv+0x81a/0x9b0 net/dccp/ipv6.c:578
>  sk_backlog_rcv include/net/sock.h:907 [inline]
>  __release_sock+0x124/0x360 net/core/sock.c:2274
>  release_sock+0xa4/0x2a0 net/core/sock.c:2789
>  do_ipv6_setsockopt.isra.9+0x50f/0x38f0 net/ipv6/ipv6_sockglue.c:898
>  ipv6_setsockopt+0xd7/0x150 net/ipv6/ipv6_sockglue.c:922
>  dccp_setsockopt+0x85/0xd0 net/dccp/proto.c:573
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
>  SYSC_setsockopt net/socket.c:1821 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1800
>  entry_SYSCALL_64_fastpath+0x1f/0x96
> RIP: 0033:0x445ec9
> RSP: 002b:00007fa001b58db8 EFLAGS: 00000297 ORIG_RAX: 0000000000000036
> RAX: ffffffffffffffda RBX: 00000000006dbc24 RCX: 0000000000445ec9
> RDX: 0000000000000020 RSI: 0000000000000029 RDI: 0000000000000004
> RBP: 00000000006dbc20 R08: 0000000000000020 R09: 0000000000000000
> R10: 000000002030a000 R11: 0000000000000297 R12: 0000000000000000
> R13: 00007fff809eec1f R14: 00007fa001b599c0 R15: 0000000000000001

syzbot stopped hitting this for some reason, but the bug is still there.
Here's a simplified reproducer that works on Linus' tree as of today:

#include <linux/in.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <unistd.h>

int main()
{
        int is_parent = (fork() != 0);
        for (;;) {
                int fd = socket(AF_INET, SOCK_DCCP, 0);
                struct sockaddr_in addr = {
                        .sin_family = AF_INET,
                        .sin_port = htobe16(0x4e23),
                };
                if (is_parent) {
                        connect(fd, (void *)&addr, sizeof(addr));
                } else {
                        bind(fd, (void *)&addr, sizeof(addr));
                        listen(fd, 100);
                        setsockopt(fd, 0, 0xFFFF, NULL, 0);
                }
                close(fd);
        }
}

^ permalink raw reply

* Re: WARNING in kcm_exit_net (2)
From: Eric Biggers @ 2018-04-08 19:38 UTC (permalink / raw)
  To: syzbot
  Cc: davem, ebiggers, edumazet, linux-kernel, mingo, netdev,
	syzkaller-bugs, tom, xiaolou4617, xiyou.wangcong
In-Reply-To: <001a113ed51610bad9055f2d130e@google.com>

On Wed, Nov 29, 2017 at 10:08:01PM -0800, syzbot wrote:
> Hello,
> 
> syzkaller hit the following crash on
> 1d3b78bbc6e983fabb3fbf91b76339bf66e4a12c
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> 
> Unfortunately, I don't have any reproducer for this bug yet.
> 
> 
> WARNING: CPU: 1 PID: 4099 at net/kcm/kcmsock.c:2014 kcm_exit_net+0x317/0x360
> net/kcm/kcmsock.c:2014
> Kernel panic - not syncing: panic_on_warn set ...
> 
> CPU: 1 PID: 4099 Comm: kworker/u4:9 Not tainted 4.14.0+ #129
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: netns cleanup_net
> device lo entered promiscuous mode
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  panic+0x1e4/0x41c kernel/panic.c:183
>  __warn+0x1dc/0x200 kernel/panic.c:547
>  report_bug+0x211/0x2d0 lib/bug.c:184
>  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:177
>  fixup_bug arch/x86/kernel/traps.c:246 [inline]
>  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:295
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:314
>  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:926
> RIP: 0010:kcm_exit_net+0x317/0x360 net/kcm/kcmsock.c:2014
> RSP: 0000:ffff8801d9d27198 EFLAGS: 00010293
> RAX: ffff8801c0884540 RBX: 1ffff1003b3a4e33 RCX: ffffffff84a738e7
> RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000286
> RBP: ffff8801d9d27260 R08: 0000000000000003 R09: 1ffff1003b3a4e0c
> R10: ffff8801c0884540 R11: 0000000000000003 R12: 1ffff1003b3a4e37
> R13: ffff8801d9d27238 R14: ffff8801c5fec8a0 R15: ffff8801c4b62e40
>  ops_exit_list.isra.6+0xae/0x150 net/core/net_namespace.c:142
>  cleanup_net+0x5c7/0xb60 net/core/net_namespace.c:484
>  process_one_work+0xbfd/0x1be0 kernel/workqueue.c:2112
>  worker_thread+0x223/0x1990 kernel/workqueue.c:2246
>  kthread+0x37a/0x440 kernel/kthread.c:238
>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:437
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
> 
> 
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
> Please credit me with: Reported-by: syzbot <syzkaller@googlegroups.com>
> 
> syzbot will keep track of this bug report.
> Once a fix for this bug is committed, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid

No reproducer, this last occurred on Dec 26 (103 days ago, commit fba961ab29e),
and there have been several potentially relevant KCM fixes since then such as
581e7226a5d ("kcm: Only allow TCP sockets to be attached to a KCM mux") and
e5571240236 ("kcm: Check if sk_user_data already set in kcm_attach").  So I am
invalidating this for syzbot, but if anyone thinks this may still be a bug then
feel free to look into it.

#syz invalid

Eric

^ permalink raw reply

* Re: WARNING in skb_warn_bad_offload
From: Eric Biggers @ 2018-04-08 19:53 UTC (permalink / raw)
  To: syzbot
  Cc: Willem de Bruijn, Dmitry Vyukov, David Miller, Jamal Hadi Salim,
	Jiří Pírko, LKML, netdev, syzkaller-bugs,
	Cong Wang
In-Reply-To: <CACT4Y+ZDRNWj1iRqLvFLjOXJYc2XERWPh+Qa0hNxUgMD1uaeBA@mail.gmail.com>

On Wed, Nov 01, 2017 at 09:50:18PM +0300, 'Dmitry Vyukov' via syzkaller-bugs wrote:
> On Wed, Nov 1, 2017 at 9:48 PM, syzbot
> <bot+abb66e15eb1b298dfe4a13375f18a278d5940e6f@syzkaller.appspotmail.com>
> wrote:
> > Hello,
> >
> > syzkaller hit the following crash on
> > 720bbe532b7c8f5613b48dea627fc58ed9ace707
> > git://git.cmpxchg.org/linux-mmots.git/master
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached
> > Raw console output is attached.
> > C reproducer is attached
> > syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> > for information about syzkaller reproducers
> 
> 
> This also happens on more recent commits, including linux-next
> 36ef71cae353f88fd6e095e2aaa3e5953af1685d (Oct 20):
> 
> syz0: caps=(0x00000400000058c1, 0x0000000000000000) len=4203
> data_len=2810 gso_size=8465 gso_type=3 ip_summed=0
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 3473 at net/core/dev.c:2618
> skb_warn_bad_offload.cold.139+0x224/0x261 net/core/dev.c:2613
> Kernel panic - not syncing: panic_on_warn set ...
> 
> CPU: 0 PID: 3473 Comm: a.out Not tainted 4.14.0-rc5-next-20171018 #15
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x1a8/0x272 lib/dump_stack.c:52
>  panic+0x21e/0x4b7 kernel/panic.c:183
>  __warn.cold.6+0x182/0x187 kernel/panic.c:546
>  report_bug+0x232/0x330 lib/bug.c:183
>  fixup_bug+0x3f/0x90 arch/x86/kernel/traps.c:177
>  do_trap_no_signal arch/x86/kernel/traps.c:211 [inline]
>  do_trap+0x132/0x280 arch/x86/kernel/traps.c:260
>  do_error_trap+0x11f/0x390 arch/x86/kernel/traps.c:297
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:310
>  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
> RIP: 0010:skb_warn_bad_offload.cold.139+0x224/0x261 net/core/dev.c:2613
> RSP: 0018:ffff880064797038 EFLAGS: 00010286
> RAX: 000000000000006f RBX: ffff88006365efe8 RCX: 0000000000000000
> RDX: 000000000000006f RSI: ffffffff815c88c1 RDI: ffffed000c8f2dfd
> RBP: ffff880064797090 R08: ffff8800686f86c0 R09: 0000000000000002
> R10: ffff8800686f86c0 R11: 0000000000000000 R12: ffff8800538b1680
> R13: 0000000000000000 R14: ffff8800538b1680 R15: 0000000000002111
>  __skb_gso_segment+0x69e/0x860 net/core/dev.c:2824
>  skb_gso_segment include/linux/netdevice.h:3971 [inline]
>  validate_xmit_skb+0x29f/0xca0 net/core/dev.c:3074
>  validate_xmit_skb_list+0xb7/0x120 net/core/dev.c:3125
>  sch_direct_xmit+0x5b5/0x710 net/sched/sch_generic.c:181
>  __dev_xmit_skb net/core/dev.c:3206 [inline]
>  __dev_queue_xmit+0x1e41/0x2350 net/core/dev.c:3473
>  dev_queue_xmit+0x17/0x20 net/core/dev.c:3538
>  packet_snd net/packet/af_packet.c:2956 [inline]
>  packet_sendmsg+0x487a/0x64b0 net/packet/af_packet.c:2981
>  sock_sendmsg_nosec net/socket.c:632 [inline]
>  sock_sendmsg+0xd2/0x120 net/socket.c:642
>  ___sys_sendmsg+0x7cc/0x900 net/socket.c:2048
>  __sys_sendmsg+0xe6/0x220 net/socket.c:2082
>  SYSC_sendmsg net/socket.c:2093 [inline]
>  SyS_sendmsg+0x36/0x60 net/socket.c:2089
>  entry_SYSCALL_64_fastpath+0x1f/0xbe
> RIP: 0033:0x44bab9
> RSP: 002b:00000000007eff18 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> RAX: ffffffffffffffda RBX: 0000000020001046 RCX: 000000000044bab9
> RDX: 0000000000004010 RSI: 00000000207fcfc8 RDI: 0000000000000004
> RBP: 0000000000000086 R08: 850b2da14d2a3706 R09: 0000000000000000
> R10: 1b91126b7f398aaa R11: 0000000000000246 R12: 0000000000000000
> R13: 0000000000407950 R14: 00000000004079e0 R15: 0000000000000000
> 
> 
> 
> 
> 
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 2986 at net/core/dev.c:2585
> > skb_warn_bad_offload+0x2a9/0x380 net/core/dev.c:2580
> > Kernel panic - not syncing: panic_on_warn set ...
> >
> > CPU: 0 PID: 2986 Comm: syzkaller546001 Not tainted 4.13.0-mm1+ #7
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:16 [inline]
> >  dump_stack+0x194/0x257 lib/dump_stack.c:52
> >  panic+0x1e4/0x417 kernel/panic.c:181
> >  __warn+0x1c4/0x1d9 kernel/panic.c:542
> >  report_bug+0x211/0x2d0 lib/bug.c:183
> >  fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
> >  do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
> >  do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
> >  do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
> >  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
> >  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
> > RIP: 0010:skb_warn_bad_offload+0x2a9/0x380 net/core/dev.c:2580
> > RSP: 0018:ffff8801ce73f0a0 EFLAGS: 00010282
> > RAX: 000000000000006f RBX: ffff8801cd84cde0 RCX: 0000000000000000
> > RDX: 000000000000006f RSI: 1ffff10039ce7dd4 RDI: ffffed0039ce7e08
> > RBP: ffff8801ce73f0f8 R08: ffff8801ce73e790 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801ce7802c0
> > R13: 0000000000000000 R14: ffff8801ce7802c0 R15: 0000000000002111
> >  __skb_gso_segment+0x607/0x7f0 net/core/dev.c:2791
> >  skb_gso_segment include/linux/netdevice.h:3951 [inline]
> >  validate_xmit_skb+0x4ba/0xb20 net/core/dev.c:3041
> >  validate_xmit_skb_list+0xb7/0x120 net/core/dev.c:3092
> >  sch_direct_xmit+0x3b6/0x6d0 net/sched/sch_generic.c:181
> >  __dev_xmit_skb net/core/dev.c:3173 [inline]
> >  __dev_queue_xmit+0x15fe/0x1e40 net/core/dev.c:3440
> >  dev_queue_xmit+0x17/0x20 net/core/dev.c:3505
> >  packet_snd net/packet/af_packet.c:2950 [inline]
> >  packet_sendmsg+0x3bbf/0x6030 net/packet/af_packet.c:2975
> >  sock_sendmsg_nosec net/socket.c:633 [inline]
> >  sock_sendmsg+0xca/0x110 net/socket.c:643
> >  ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2049
> >  __sys_sendmsg+0xe5/0x210 net/socket.c:2083
> >  SYSC_sendmsg net/socket.c:2094 [inline]
> >  SyS_sendmsg+0x2d/0x50 net/socket.c:2090
> >  entry_SYSCALL_64_fastpath+0x1f/0xbe
> > RIP: 0033:0x445489
> > RSP: 002b:00000000007efe68 EFLAGS: 00000217 ORIG_RAX: 000000000000002e
> > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000445489
> > RDX: 0000000000004010 RSI: 00000000207fcfc8 RDI: 0000000000000004
> > RBP: 0000000000000082 R08: 000000000000cdf3 R09: 000000000000cdf3
> > R10: 0000000000000004 R11: 0000000000000217 R12: 0000000000402ae0
> > R13: 0000000000402b70 R14: 0000000000000000 R15: 0000000000000000
> > Dumping ftrace buffer:
> >    (ftrace buffer empty)
> > Kernel Offset: disabled
> > Rebooting in 86400 seconds..
> >
> >
> > ---
> > This bug is generated by a dumb bot. It may contain errors.
> > See https://goo.gl/tpsmEJ for details.
> > Direct all questions to syzkaller@googlegroups.com.
> > Please credit me with: Reported-by: syzbot <syzkaller@googlegroups.com>
> >
> > syzbot will keep track of this bug report.
> > Once a fix for this bug is committed, please reply to this email with:
> > #syz fix: exact-commit-title

Apparently fixed by commit 8d74e9f88d65a, so telling syzbot:

#syz fix: net: avoid skb_warn_bad_offload on IS_ERR

- Eric

^ permalink raw reply

* [PATCH v3] dp83640: Ensure against premature access to PHY registers after reset
From: Esben Haabendal @ 2018-04-08 20:17 UTC (permalink / raw)
  To: netdev
  Cc: Esben Haabendal, Richard Cochran, Andrew Lunn, Florian Fainelli,
	linux-kernel
In-Reply-To: <20180406170844.4248-1-esben.haabendal@gmail.com>

From: Esben Haabendal <eha@deif.com>

The datasheet specifies a 3uS pause after performing a software
reset. The default implementation of genphy_soft_reset() does not
provide this, so implement soft_reset with the needed pause.

Signed-off-by: Esben Haabendal <eha@deif.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/phy/dp83640.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
index 654f42d00092..a6c87793d899 100644
--- a/drivers/net/phy/dp83640.c
+++ b/drivers/net/phy/dp83640.c
@@ -1207,6 +1207,23 @@ static void dp83640_remove(struct phy_device *phydev)
 	kfree(dp83640);
 }
 
+static int dp83640_soft_reset(struct phy_device *phydev)
+{
+	int ret;
+
+	ret = genphy_soft_reset(phydev);
+	if (ret < 0)
+		return ret;
+
+	/* From DP83640 datasheet: "Software driver code must wait 3 us
+	 * following a software reset before allowing further serial MII
+	 * operations with the DP83640."
+	 */
+	udelay(10);		/* Taking udelay inaccuracy into account */
+
+	return 0;
+}
+
 static int dp83640_config_init(struct phy_device *phydev)
 {
 	struct dp83640_private *dp83640 = phydev->priv;
@@ -1501,6 +1518,7 @@ static struct phy_driver dp83640_driver = {
 	.flags		= PHY_HAS_INTERRUPT,
 	.probe		= dp83640_probe,
 	.remove		= dp83640_remove,
+	.soft_reset	= dp83640_soft_reset,
 	.config_init	= dp83640_config_init,
 	.ack_interrupt  = dp83640_ack_interrupt,
 	.config_intr    = dp83640_config_intr,
-- 
2.16.3

^ permalink raw reply related

* Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy
From: Andy Lutomirski @ 2018-04-08 21:06 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Andy Lutomirski, Alexei Starovoitov, Daniel Borkmann, LKML,
	Alexei Starovoitov, Arnaldo Carvalho de Melo, Casey Schaufler,
	David Drysdale, David S . Miller, Eric W . Biederman, Jann Horn,
	Jonathan Corbet, Michael Kerrisk, Kees Cook, Paul Moore,
	Sargun Dhillon, Serge E . Hallyn, Shuah Khan, Tejun Heo,
	Thomas Graf, Tycho Andersen, Will Drewry, Kernel Hardening,
	Linux 
In-Reply-To: <498f8193-c909-78b2-e4ca-c1dd05605255@digikod.net>

On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaün <mic@digikod.net> wrote:
>
> On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
>>
>> On 27/02/2018 17:39, Andy Lutomirski wrote:
>>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
>>>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>>>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>>>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>>>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the
>>>>>>>>> current task and all its future children. A program is immutable and a
>>>>>>>>> task can only add new restricting programs to itself, forming a list of
>>>>>>>>> programss.
>>>>>>>>>
>>>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>>>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>>>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of
>>>>>>>>> object is triggered. The list of programs for this hook is then
>>>>>>>>> evaluated. Each program return a 32-bit value which can deny the action
>>>>>>>>> on a kernel object with a non-zero value. If every programs of the list
>>>>>>>>> return zero, then the action on the object is allowed.
>>>>>>>>>
>>>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value for a
>>>>>>>>> call chain (e.g. evaluating multiple elements of a file path).  This
>>>>>>>>> chaining is restricted when a process construct this chain by loading a
>>>>>>>>> program, but additional checks are performed when it requests to apply
>>>>>>>>> this chain of programs to itself.  The restrictions ensure that it is
>>>>>>>>> not possible to call multiple programs in a way that would imply to
>>>>>>>>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>>>>>>>>> only a fs_pick program can be chained to the same type of program,
>>>>>>>>> because it may make sense if they have different triggers (cf. next
>>>>>>>>> commits).  This restrictions still allows to reuse Landlock programs in
>>>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple
>>>>>>>>> chains of fs_pick programs).
>>>>>>>>>
>>>>>>>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>>> +struct landlock_prog_set *landlock_prepend_prog(
>>>>>>>>> +             struct landlock_prog_set *current_prog_set,
>>>>>>>>> +             struct bpf_prog *prog)
>>>>>>>>> +{
>>>>>>>>> +     struct landlock_prog_set *new_prog_set = current_prog_set;
>>>>>>>>> +     unsigned long pages;
>>>>>>>>> +     int err;
>>>>>>>>> +     size_t i;
>>>>>>>>> +     struct landlock_prog_set tmp_prog_set = {};
>>>>>>>>> +
>>>>>>>>> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>>>>>>>>> +             return ERR_PTR(-EINVAL);
>>>>>>>>> +
>>>>>>>>> +     /* validate memory size allocation */
>>>>>>>>> +     pages = prog->pages;
>>>>>>>>> +     if (current_prog_set) {
>>>>>>>>> +             size_t i;
>>>>>>>>> +
>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>>>>>>>>> +                     struct landlock_prog_list *walker_p;
>>>>>>>>> +
>>>>>>>>> +                     for (walker_p = current_prog_set->programs[i];
>>>>>>>>> +                                     walker_p; walker_p = walker_p->prev)
>>>>>>>>> +                             pages += walker_p->prog->pages;
>>>>>>>>> +             }
>>>>>>>>> +             /* count a struct landlock_prog_set if we need to allocate one */
>>>>>>>>> +             if (refcount_read(&current_prog_set->usage) != 1)
>>>>>>>>> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>>>>>>>>> +                             / PAGE_SIZE;
>>>>>>>>> +     }
>>>>>>>>> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>>>>>>>>> +             return ERR_PTR(-E2BIG);
>>>>>>>>> +
>>>>>>>>> +     /* ensure early that we can allocate enough memory for the new
>>>>>>>>> +      * prog_lists */
>>>>>>>>> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>>>>>>>>> +     if (err)
>>>>>>>>> +             return ERR_PTR(err);
>>>>>>>>> +
>>>>>>>>> +     /*
>>>>>>>>> +      * Each task_struct points to an array of prog list pointers.  These
>>>>>>>>> +      * tables are duplicated when additions are made (which means each
>>>>>>>>> +      * table needs to be refcounted for the processes using it). When a new
>>>>>>>>> +      * table is created, all the refcounters on the prog_list are bumped (to
>>>>>>>>> +      * track each table that references the prog). When a new prog is
>>>>>>>>> +      * added, it's just prepended to the list for the new table to point
>>>>>>>>> +      * at.
>>>>>>>>> +      *
>>>>>>>>> +      * Manage all the possible errors before this step to not uselessly
>>>>>>>>> +      * duplicate current_prog_set and avoid a rollback.
>>>>>>>>> +      */
>>>>>>>>> +     if (!new_prog_set) {
>>>>>>>>> +             /*
>>>>>>>>> +              * If there is no Landlock program set used by the current task,
>>>>>>>>> +              * then create a new one.
>>>>>>>>> +              */
>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
>>>>>>>>> +             /*
>>>>>>>>> +              * If the current task is not the sole user of its Landlock
>>>>>>>>> +              * program set, then duplicate them.
>>>>>>>>> +              */
>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>>>>>>>>> +                     new_prog_set->programs[i] =
>>>>>>>>> +                             READ_ONCE(current_prog_set->programs[i]);
>>>>>>>>> +                     if (new_prog_set->programs[i])
>>>>>>>>> +                             refcount_inc(&new_prog_set->programs[i]->usage);
>>>>>>>>> +             }
>>>>>>>>> +
>>>>>>>>> +             /*
>>>>>>>>> +              * Landlock program set from the current task will not be freed
>>>>>>>>> +              * here because the usage is strictly greater than 1. It is
>>>>>>>>> +              * only prevented to be freed by another task thanks to the
>>>>>>>>> +              * caller of landlock_prepend_prog() which should be locked if
>>>>>>>>> +              * needed.
>>>>>>>>> +              */
>>>>>>>>> +             landlock_put_prog_set(current_prog_set);
>>>>>>>>> +     }
>>>>>>>>> +
>>>>>>>>> +     /* prepend tmp_prog_set to new_prog_set */
>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>>>>>>>>> +             /* get the last new list */
>>>>>>>>> +             struct landlock_prog_list *last_list =
>>>>>>>>> +                     tmp_prog_set.programs[i];
>>>>>>>>> +
>>>>>>>>> +             if (last_list) {
>>>>>>>>> +                     while (last_list->prev)
>>>>>>>>> +                             last_list = last_list->prev;
>>>>>>>>> +                     /* no need to increment usage (pointer replacement) */
>>>>>>>>> +                     last_list->prev = new_prog_set->programs[i];
>>>>>>>>> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
>>>>>>>>> +             }
>>>>>>>>> +     }
>>>>>>>>> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
>>>>>>>>> +     return new_prog_set;
>>>>>>>>> +
>>>>>>>>> +put_tmp_lists:
>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>>>>>>>>> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
>>>>>>>>> +     return new_prog_set;
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> Nack on the chaining concept.
>>>>>>>> Please do not reinvent the wheel.
>>>>>>>> There is an existing mechanism for attaching/detaching/quering multiple
>>>>>>>> programs attached to cgroup and tracing hooks that are also
>>>>>>>> efficiently executed via BPF_PROG_RUN_ARRAY.
>>>>>>>> Please use that instead.
>>>>>>>>
>>>>>>>
>>>>>>> I don't see how that would help.  Suppose you add a filter, then
>>>>>>> fork(), and then the child adds another filter.  Do you want to
>>>>>>> duplicate the entire array?  You certainly can't *modify* the array
>>>>>>> because you'll affect processes that shouldn't be affected.
>>>>>>>
>>>>>>> In contrast, doing this through seccomp like the earlier patches
>>>>>>> seemed just fine to me, and seccomp already had the right logic.
>>>>>>
>>>>>> it doesn't look to me that existing seccomp side of managing fork
>>>>>> situation can be reused. Here there is an attempt to add 'chaining'
>>>>>> concept which sort of an extension of existing seccomp style,
>>>>>> but somehow heavily done on bpf side and contradicts cgroup/tracing.
>>>>>>
>>>>>
>>>>> I don't see why the seccomp way can't be used.  I agree with you that
>>>>> the seccomp *style* shouldn't be used in bpf code like this, but I
>>>>> think that Landlock programs can and should just live in the existing
>>>>> seccomp chain.  If the existing seccomp code needs some modification
>>>>> to make this work, then so be it.
>>>>
>>>> +1
>>>> if that was the case...
>>>> but that's not my reading of the patch set.
>>>
>>> An earlier version of the patch set used the seccomp filter chain.
>>> Mickaël, what exactly was wrong with that approach other than that the
>>> seccomp() syscall was awkward for you to use?  You could add a
>>> seccomp_add_landlock_rule() syscall if you needed to.
>>
>> Nothing was wrong about about that, this part did not changed (see my
>> next comment).
>>
>>>
>>> As a side comment, why is this an LSM at all, let alone a non-stacking
>>> LSM?  It would make a lot more sense to me to make Landlock depend on
>>> having LSMs configured in but to call the landlock hooks directly from
>>> the security_xyz() hooks.
>>
>> See Casey's answer and his patch series: https://lwn.net/Articles/741963/
>>
>>>
>>>>
>>>>> In other words, the kernel already has two kinds of chaining:
>>>>> seccomp's and bpf's.  bpf's doesn't work right for this type of usage
>>>>> across fork(), whereas seccomp's already handles that case correctly.
>>>>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
>>>>>  So IMO Landlock should use the seccomp core code and call into bpf
>>>>> for the actual filtering.
>>>>
>>>> +1
>>>> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
>>>> since cgroup hierarchy can be complicated with bpf progs attached
>>>> at different levels with different override/multiprog properties,
>>>> so walking link list and checking all flags at run-time would have
>>>> been too slow. That's why we added compute_effective_progs().
>>>
>>> If we start adding override flags to Landlock, I think we're doing it
>>> wrong.   With cgroup bpf programs, the whole mess is set up by the
>>> administrator.  With seccomp, and with Landlock if done correctly, it
>>> *won't* be set up by the administrator, so the chance that everyone
>>> gets all the flags right is about zero.  All attached filters should
>>> run unconditionally.
>>
>>
>> There is a misunderstanding about this chaining mechanism. This should
>> not be confused with the list of seccomp filters nor the cgroup
>> hierarchies. Landlock programs can be stacked the same way seccomp's
>> filters can (cf. struct landlock_prog_set, the "chain_last" field is an
>> optimization which is not used for this struct handling). This stackable
>> property did not changed from the previous patch series. The chaining
>> mechanism is for another use case, which does not make sense for seccomp
>> filters nor other eBPF program types, at least for now, from what I can
>> tell.
>>
>> You may want to get a look at my talk at FOSDEM
>> (https://landlock.io/talks/2018-02-04_landlock-fosdem.pdf), especially
>> slides 11 and 12.
>>
>> Let me explain my reasoning about this program chaining thing.
>>
>> To check if an action on a file is allowed, we first need to identify
>> this file and match it to the security policy. In a previous
>> (non-public) patch series, I tried to use one type of eBPF program to
>> check every kind of access to a file. To be able to identify a file, I
>> relied on an eBPF map, similar to the current inode map. This map store
>> a set of references to file descriptors. I then created a function
>> bpf_is_file_beneath() to check if the requested file was beneath a file
>> in the map. This way, no chaining, only one eBPF program type to check
>> an access to a file... but some issues then emerged. First, this design
>> create a side-channel which help an attacker using such a program to
>> infer some information not normally available, for example to get a hint
>> on where a file descriptor (received from a UNIX socket) come from.
>> Another issue is that this type of program would be called for each
>> component of a path. Indeed, when the kernel check if an access to a
>> file is allowed, it walk through all of the directories in its path
>> (checking if the current process is allowed to execute them). That first
>> attempt led me to rethink the way we could filter an access to a file
>> *path*.
>>
>> To minimize the number of called to an eBPF program dedicated to
>> validate an access to a file path, I decided to create three subtype of
>> eBPF programs. The FS_WALK type is called when walking through every
>> directory of a file path (except the last one if it is the target). We
>> can then restrict this type of program to the minimum set of functions
>> it is allowed to call and the minimum set of data available from its
>> context. The first implicit chaining is for this type of program. To be
>> able to evaluate a path while being called for all its components, this
>> program need to store a state (to remember what was the parent directory
>> of this path). There is no "previous" field in the subtype for this
>> program because it is chained with itself, for each directories. This
>> enable to create a FS_WALK program to evaluate a file hierarchy, thank
>> to the inode map which can be used to check if a directory of this
>> hierarchy is part of an allowed (or denied) list of directories. This
>> design enables to express a file hierarchy in a programmatic way,
>> without requiring an eBPF helper to do the job (unlike my first experiment).
>>
>> The explicit chaining is used to tied a path evaluation (with a FS_WALK
>> program) to an access to the actual file being requested (the last
>> component of a file path), with a FS_PICK program. It is only at this
>> time that the kernel check for the requested action (e.g. read, write,
>> chdir, append...). To be able to filter such access request we can have
>> one call to the same program for every action and let this program check
>> for which action it was called. However, this design does not allow the
>> kernel to know if the current action is indeed handled by this program.
>> Hence, it is not possible to implement a cache mechanism to only call
>> this program if it knows how to handle this action.
>>
>> The approach I took for this FS_PICK type of program is to add to its
>> subtype which action it can handle (with the "triggers" bitfield, seen
>> as ORed actions). This way, the kernel knows if a call to a FS_PICK
>> program is necessary. If the user wants to enforce a different security
>> policy according to the action requested on a file, then it needs
>> multiple FS_PICK programs. However, to reduce the number of such
>> programs, this patch series allow a FS_PICK program to be chained with
>> another, the same way a FS_WALK is chained with itself. This way, if the
>> user want to check if the action is a for example an "open" and a "read"
>> and not a "map" and a "read", then it can chain multiple FS_PICK
>> programs with different triggers actions. The OR check performed by the
>> kernel is not a limitation then, only a way to know if a call to an eBPF
>> program is needed.
>>
>> The last type of program is FS_GET. This one is called when a process
>> get a struct file or change its working directory. This is the only
>> program type able (and allowed) to tag a file. This restriction is
>> important to not being subject to resource exhaustion attacks (i.e.
>> tagging every inode accessible to an attacker, which would allocate too
>> much kernel memory).
>>
>> This design gives room for improvements to create a cache of eBPF
>> context (input data, including maps if any), with the result of an eBPF
>> program. This would help limit the number of call to an eBPF program the
>> same way SELinux or other kernel components do to limit costly checks.
>>
>> The eBPF maps of progs are useful to call the same type of eBPF
>> program. It does not fit with this use case because we may want multiple
>> eBPF program according to the action requested on a kernel object (e.g.
>> FS_GET). The other reason is because the eBPF program does not know what
>> will be the next (type of) access check performed by the kernel.
>>
>> To say it another way, this chaining mechanism is a way to split a
>> kernel object evaluation with multiple specialized programs, each of
>> them being able to deal with data tied to their type. Using a monolithic
>> eBPF program to check everything does not scale and does not fit with
>> unprivileged use either.
>>
>> As a side note, the cookie value is only an ephemeral value to keep a
>> state between multiple programs call. It can be used to create a state
>> machine for an object evaluation.
>>
>> I don't see a way to do an efficient and programmatic path evaluation,
>> with different access checks, with the current eBPF features. Please let
>> me know if you know how to do it another way.
>>
>
> Andy, Alexei, Daniel, what do you think about this Landlock program
> chaining and cookie?
>

Can you give a small pseudocode real world example that acutally needs
chaining?  The mechanism is quite complicated and I'd like to
understand how it'll be used.

^ permalink raw reply

* Re: [PATCH net 0/8] net: fix uninit-values in networking stack
From: David Miller @ 2018-04-08 21:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: edumazet, netdev
In-Reply-To: <0a539248-4e45-a429-f708-d906c145c736@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 8 Apr 2018 09:55:58 -0700

> I also have a report of a WARN() in ip_rt_bug(), added in commit
> c378a9c019cf5e017d1ed24954b54fae7bebd2bc by Dave Jones.
> 
> Not sure what to do, maybe revert, since ip_rt_bug() is not catastrophic.

Let's not do the revert, I wouldn't have seen the backtrace which
points where this bug is if we had.

icmp_route_lookup(), in one branch, does an input route lookup and
uses the result of that to send the icmp message.

That can't be right, input routes should never be used for
transmitting traffice and that's how we end up at ip_rt_bug().

^ permalink raw reply

* Re: pull request: bluetooth 2018-04-08
From: David Miller @ 2018-04-08 21:19 UTC (permalink / raw)
  To: johan.hedberg; +Cc: linux-bluetooth, netdev
In-Reply-To: <20180408174702.GA11420@x1c.home>

From: Johan Hedberg <johan.hedberg@gmail.com>
Date: Sun, 8 Apr 2018 20:47:02 +0300

> Here's one important Bluetooth fix for the 4.17-rc series that's needed
> to pass several Bluetooth qualification test cases.
> 
> Let me know if there are any issues pulling. Thanks.

Pulled, thank you.

^ permalink raw reply

* Re: BUG: please report to dccp@vger.kernel.org => prev = 0, last = 0 at net/dccp/ccids/lib/packet_history.c:LINE/tfrc_rx_hist_sample_rtt()
From: Eric Biggers @ 2018-04-08 21:57 UTC (permalink / raw)
  To: dccp, gerrit
  Cc: davem, garsilva, syzbot, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <001a1145f0b4076234056309aab9@google.com>

On Thu, Jan 18, 2018 at 01:34:02AM -0800, syzbot wrote:
> syzbot has found reproducer for the following crash on linux-next commit
> a362f6d2cdbd089dd7040ba66dcb0ad276a20cf7 (Thu Jan 18 07:07:54 2018 +0000)
> Add linux-next specific files for 20180118
> 
> So far this crash happened 185 times on linux-next, mmots, net-next,
> upstream.
> C reproducer is attached.
> syzkaller reproducer is attached.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by:
> syzbot+3ca02e1a9272a28e8959b32039154c5605164653@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed.
> 
> BUG: please report to dccp@vger.kernel.org => prev = 0, last = 0 at
> net/dccp/ccids/lib/packet_history.c:425/tfrc_rx_hist_sample_rtt()
> CPU: 1 PID: 6246 Comm: syzkaller158939 Not tainted 4.15.0-rc8-next-20180118+
> #100
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  <IRQ>
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  tfrc_rx_hist_sample_rtt+0x407/0x4d0 net/dccp/ccids/lib/packet_history.c:422
>  ccid3_hc_rx_packet_recv+0x696/0xeb3 net/dccp/ccids/ccid3.c:765
>  ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline]
>  dccp_deliver_input_to_ccids+0xd9/0x250 net/dccp/input.c:180
>  dccp_rcv_established+0x88/0xb0 net/dccp/input.c:378
>  dccp_v4_do_rcv+0x135/0x160 net/dccp/ipv4.c:653
>  sk_backlog_rcv include/net/sock.h:908 [inline]
>  __sk_receive_skb+0x33e/0xc10 net/core/sock.c:513
>  dccp_v4_rcv+0xf5f/0x1c80 net/dccp/ipv4.c:874
>  ip_local_deliver_finish+0x2f1/0xc50 net/ipv4/ip_input.c:216
>  NF_HOOK include/linux/netfilter.h:288 [inline]
>  ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
>  dst_input include/net/dst.h:449 [inline]
>  ip_rcv_finish+0x953/0x1e30 net/ipv4/ip_input.c:397
>  NF_HOOK include/linux/netfilter.h:288 [inline]
>  ip_rcv+0xc5a/0x1840 net/ipv4/ip_input.c:493
>  __netif_receive_skb_core+0x1a41/0x3460 net/core/dev.c:4537
>  __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4602
>  process_backlog+0x203/0x740 net/core/dev.c:5282
>  napi_poll net/core/dev.c:5680 [inline]
>  net_rx_action+0x792/0x1910 net/core/dev.c:5746
>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>  do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1150
>  </IRQ>
>  do_softirq.part.19+0x14d/0x190 kernel/softirq.c:329
>  do_softirq kernel/softirq.c:177 [inline]
>  __local_bh_enable_ip+0x1ee/0x230 kernel/softirq.c:182
>  local_bh_enable include/linux/bottom_half.h:32 [inline]
>  rcu_read_unlock_bh include/linux/rcupdate.h:726 [inline]
>  ip_finish_output2+0x962/0x1550 net/ipv4/ip_output.c:231
>  ip_finish_output+0x864/0xd10 net/ipv4/ip_output.c:317
>  NF_HOOK_COND include/linux/netfilter.h:277 [inline]
>  ip_output+0x1d2/0x860 net/ipv4/ip_output.c:405
>  dst_output include/net/dst.h:443 [inline]
>  ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
>  ip_queue_xmit+0x8c0/0x18e0 net/ipv4/ip_output.c:504
>  dccp_transmit_skb+0x9ac/0x10f0 net/dccp/output.c:142
>  dccp_xmit_packet+0x215/0x740 net/dccp/output.c:281
>  dccp_write_xmit+0x17d/0x1d0 net/dccp/output.c:363
>  dccp_sendmsg+0x95f/0xdc0 net/dccp/proto.c:813
>  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764
>  sock_sendmsg_nosec net/socket.c:630 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:640
>  ___sys_sendmsg+0x767/0x8b0 net/socket.c:2020
>  __sys_sendmsg+0xe5/0x210 net/socket.c:2054
>  SYSC_sendmsg net/socket.c:2065 [inline]
>  SyS_sendmsg+0x2d/0x50 net/socket.c:2061
>  entry_SYSCALL_64_fastpath+0x29/0xa0
> RIP: 0033:0x446469
> RSP: 002b:00007fcecb23bda8 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
> RAX: ffffffffffffffda RBX: 00000000006dbc3c RCX: 0000000000446469
> RDX: 0000000000000080 RSI: 00000000206c8000 RDI: 0000000000000005
> RBP: 00000000006dbc38 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000293 R12: f8e4cbe49e572d45
> R13: 54c1b85d98aba1df R14: a6eaa24dbeb18c29 R15: 000000000000000c
> 

This is still happening.  It *might* be related to the other bug "suspicious RCU
usage at ./include/net/inet_sock.h:LINE".  Here's a simplified reproducer for
this one:

#include <linux/dccp.h>
#include <linux/in.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <unistd.h>

int main()
{
        struct sockaddr_in addr = { .sin_family = AF_INET };
        socklen_t addrlen = sizeof(addr);
        int fd;

        while (fork())
                wait(NULL);
        fd = socket(AF_INET, SOCK_DCCP, 0);
        bind(fd, (void *)&addr, addrlen);
        getsockname(fd, (void *)&addr, &addrlen);
        listen(fd, 100);
        if (fork()) {
                fd = socket(AF_INET, SOCK_DCCP, 0);
                setsockopt(fd, SOL_DCCP, DCCP_SOCKOPT_CCID, "\x03", 1);
                connect(fd, (void *)&addr, sizeof(addr));
        } else {
                fd = accept(fd, NULL, 0);
        }
        for (int i = 0; i < 1000; i++)
                write(fd, "X", 1);
}

^ permalink raw reply

* Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy
From: Mickaël Salaün @ 2018-04-08 22:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Daniel Borkmann, LKML, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Casey Schaufler, David Drysdale,
	David S . Miller, Eric W . Biederman, Jann Horn, Jonathan Corbet,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Tejun Heo, Thomas Graf,
	Tycho Andersen, Will Drewry
In-Reply-To: <CALCETrViaXEx1iQ6q8bEEWSLchj=FH6LjcRY6+hjMx8A+rtgDQ@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 22278 bytes --]


On 04/08/2018 11:06 PM, Andy Lutomirski wrote:
> On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>
>> On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
>>>
>>> On 27/02/2018 17:39, Andy Lutomirski wrote:
>>>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
>>>>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>>>>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>>>>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>>>>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the
>>>>>>>>>> current task and all its future children. A program is immutable and a
>>>>>>>>>> task can only add new restricting programs to itself, forming a list of
>>>>>>>>>> programss.
>>>>>>>>>>
>>>>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>>>>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>>>>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of
>>>>>>>>>> object is triggered. The list of programs for this hook is then
>>>>>>>>>> evaluated. Each program return a 32-bit value which can deny the action
>>>>>>>>>> on a kernel object with a non-zero value. If every programs of the list
>>>>>>>>>> return zero, then the action on the object is allowed.
>>>>>>>>>>
>>>>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value for a
>>>>>>>>>> call chain (e.g. evaluating multiple elements of a file path).  This
>>>>>>>>>> chaining is restricted when a process construct this chain by loading a
>>>>>>>>>> program, but additional checks are performed when it requests to apply
>>>>>>>>>> this chain of programs to itself.  The restrictions ensure that it is
>>>>>>>>>> not possible to call multiple programs in a way that would imply to
>>>>>>>>>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>>>>>>>>>> only a fs_pick program can be chained to the same type of program,
>>>>>>>>>> because it may make sense if they have different triggers (cf. next
>>>>>>>>>> commits).  This restrictions still allows to reuse Landlock programs in
>>>>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple
>>>>>>>>>> chains of fs_pick programs).
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>>>>>>>>
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>>> +struct landlock_prog_set *landlock_prepend_prog(
>>>>>>>>>> +             struct landlock_prog_set *current_prog_set,
>>>>>>>>>> +             struct bpf_prog *prog)
>>>>>>>>>> +{
>>>>>>>>>> +     struct landlock_prog_set *new_prog_set = current_prog_set;
>>>>>>>>>> +     unsigned long pages;
>>>>>>>>>> +     int err;
>>>>>>>>>> +     size_t i;
>>>>>>>>>> +     struct landlock_prog_set tmp_prog_set = {};
>>>>>>>>>> +
>>>>>>>>>> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>>>>>>>>>> +             return ERR_PTR(-EINVAL);
>>>>>>>>>> +
>>>>>>>>>> +     /* validate memory size allocation */
>>>>>>>>>> +     pages = prog->pages;
>>>>>>>>>> +     if (current_prog_set) {
>>>>>>>>>> +             size_t i;
>>>>>>>>>> +
>>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>>>>>>>>>> +                     struct landlock_prog_list *walker_p;
>>>>>>>>>> +
>>>>>>>>>> +                     for (walker_p = current_prog_set->programs[i];
>>>>>>>>>> +                                     walker_p; walker_p = walker_p->prev)
>>>>>>>>>> +                             pages += walker_p->prog->pages;
>>>>>>>>>> +             }
>>>>>>>>>> +             /* count a struct landlock_prog_set if we need to allocate one */
>>>>>>>>>> +             if (refcount_read(&current_prog_set->usage) != 1)
>>>>>>>>>> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>>>>>>>>>> +                             / PAGE_SIZE;
>>>>>>>>>> +     }
>>>>>>>>>> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>>>>>>>>>> +             return ERR_PTR(-E2BIG);
>>>>>>>>>> +
>>>>>>>>>> +     /* ensure early that we can allocate enough memory for the new
>>>>>>>>>> +      * prog_lists */
>>>>>>>>>> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>>>>>>>>>> +     if (err)
>>>>>>>>>> +             return ERR_PTR(err);
>>>>>>>>>> +
>>>>>>>>>> +     /*
>>>>>>>>>> +      * Each task_struct points to an array of prog list pointers.  These
>>>>>>>>>> +      * tables are duplicated when additions are made (which means each
>>>>>>>>>> +      * table needs to be refcounted for the processes using it). When a new
>>>>>>>>>> +      * table is created, all the refcounters on the prog_list are bumped (to
>>>>>>>>>> +      * track each table that references the prog). When a new prog is
>>>>>>>>>> +      * added, it's just prepended to the list for the new table to point
>>>>>>>>>> +      * at.
>>>>>>>>>> +      *
>>>>>>>>>> +      * Manage all the possible errors before this step to not uselessly
>>>>>>>>>> +      * duplicate current_prog_set and avoid a rollback.
>>>>>>>>>> +      */
>>>>>>>>>> +     if (!new_prog_set) {
>>>>>>>>>> +             /*
>>>>>>>>>> +              * If there is no Landlock program set used by the current task,
>>>>>>>>>> +              * then create a new one.
>>>>>>>>>> +              */
>>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>>> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
>>>>>>>>>> +             /*
>>>>>>>>>> +              * If the current task is not the sole user of its Landlock
>>>>>>>>>> +              * program set, then duplicate them.
>>>>>>>>>> +              */
>>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>>>>>>>>>> +                     new_prog_set->programs[i] =
>>>>>>>>>> +                             READ_ONCE(current_prog_set->programs[i]);
>>>>>>>>>> +                     if (new_prog_set->programs[i])
>>>>>>>>>> +                             refcount_inc(&new_prog_set->programs[i]->usage);
>>>>>>>>>> +             }
>>>>>>>>>> +
>>>>>>>>>> +             /*
>>>>>>>>>> +              * Landlock program set from the current task will not be freed
>>>>>>>>>> +              * here because the usage is strictly greater than 1. It is
>>>>>>>>>> +              * only prevented to be freed by another task thanks to the
>>>>>>>>>> +              * caller of landlock_prepend_prog() which should be locked if
>>>>>>>>>> +              * needed.
>>>>>>>>>> +              */
>>>>>>>>>> +             landlock_put_prog_set(current_prog_set);
>>>>>>>>>> +     }
>>>>>>>>>> +
>>>>>>>>>> +     /* prepend tmp_prog_set to new_prog_set */
>>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>>>>>>>>>> +             /* get the last new list */
>>>>>>>>>> +             struct landlock_prog_list *last_list =
>>>>>>>>>> +                     tmp_prog_set.programs[i];
>>>>>>>>>> +
>>>>>>>>>> +             if (last_list) {
>>>>>>>>>> +                     while (last_list->prev)
>>>>>>>>>> +                             last_list = last_list->prev;
>>>>>>>>>> +                     /* no need to increment usage (pointer replacement) */
>>>>>>>>>> +                     last_list->prev = new_prog_set->programs[i];
>>>>>>>>>> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
>>>>>>>>>> +             }
>>>>>>>>>> +     }
>>>>>>>>>> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
>>>>>>>>>> +     return new_prog_set;
>>>>>>>>>> +
>>>>>>>>>> +put_tmp_lists:
>>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>>>>>>>>>> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
>>>>>>>>>> +     return new_prog_set;
>>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>> Nack on the chaining concept.
>>>>>>>>> Please do not reinvent the wheel.
>>>>>>>>> There is an existing mechanism for attaching/detaching/quering multiple
>>>>>>>>> programs attached to cgroup and tracing hooks that are also
>>>>>>>>> efficiently executed via BPF_PROG_RUN_ARRAY.
>>>>>>>>> Please use that instead.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't see how that would help.  Suppose you add a filter, then
>>>>>>>> fork(), and then the child adds another filter.  Do you want to
>>>>>>>> duplicate the entire array?  You certainly can't *modify* the array
>>>>>>>> because you'll affect processes that shouldn't be affected.
>>>>>>>>
>>>>>>>> In contrast, doing this through seccomp like the earlier patches
>>>>>>>> seemed just fine to me, and seccomp already had the right logic.
>>>>>>>
>>>>>>> it doesn't look to me that existing seccomp side of managing fork
>>>>>>> situation can be reused. Here there is an attempt to add 'chaining'
>>>>>>> concept which sort of an extension of existing seccomp style,
>>>>>>> but somehow heavily done on bpf side and contradicts cgroup/tracing.
>>>>>>>
>>>>>>
>>>>>> I don't see why the seccomp way can't be used.  I agree with you that
>>>>>> the seccomp *style* shouldn't be used in bpf code like this, but I
>>>>>> think that Landlock programs can and should just live in the existing
>>>>>> seccomp chain.  If the existing seccomp code needs some modification
>>>>>> to make this work, then so be it.
>>>>>
>>>>> +1
>>>>> if that was the case...
>>>>> but that's not my reading of the patch set.
>>>>
>>>> An earlier version of the patch set used the seccomp filter chain.
>>>> Mickaël, what exactly was wrong with that approach other than that the
>>>> seccomp() syscall was awkward for you to use?  You could add a
>>>> seccomp_add_landlock_rule() syscall if you needed to.
>>>
>>> Nothing was wrong about about that, this part did not changed (see my
>>> next comment).
>>>
>>>>
>>>> As a side comment, why is this an LSM at all, let alone a non-stacking
>>>> LSM?  It would make a lot more sense to me to make Landlock depend on
>>>> having LSMs configured in but to call the landlock hooks directly from
>>>> the security_xyz() hooks.
>>>
>>> See Casey's answer and his patch series: https://lwn.net/Articles/741963/
>>>
>>>>
>>>>>
>>>>>> In other words, the kernel already has two kinds of chaining:
>>>>>> seccomp's and bpf's.  bpf's doesn't work right for this type of usage
>>>>>> across fork(), whereas seccomp's already handles that case correctly.
>>>>>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
>>>>>>  So IMO Landlock should use the seccomp core code and call into bpf
>>>>>> for the actual filtering.
>>>>>
>>>>> +1
>>>>> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
>>>>> since cgroup hierarchy can be complicated with bpf progs attached
>>>>> at different levels with different override/multiprog properties,
>>>>> so walking link list and checking all flags at run-time would have
>>>>> been too slow. That's why we added compute_effective_progs().
>>>>
>>>> If we start adding override flags to Landlock, I think we're doing it
>>>> wrong.   With cgroup bpf programs, the whole mess is set up by the
>>>> administrator.  With seccomp, and with Landlock if done correctly, it
>>>> *won't* be set up by the administrator, so the chance that everyone
>>>> gets all the flags right is about zero.  All attached filters should
>>>> run unconditionally.
>>>
>>>
>>> There is a misunderstanding about this chaining mechanism. This should
>>> not be confused with the list of seccomp filters nor the cgroup
>>> hierarchies. Landlock programs can be stacked the same way seccomp's
>>> filters can (cf. struct landlock_prog_set, the "chain_last" field is an
>>> optimization which is not used for this struct handling). This stackable
>>> property did not changed from the previous patch series. The chaining
>>> mechanism is for another use case, which does not make sense for seccomp
>>> filters nor other eBPF program types, at least for now, from what I can
>>> tell.
>>>
>>> You may want to get a look at my talk at FOSDEM
>>> (https://landlock.io/talks/2018-02-04_landlock-fosdem.pdf), especially
>>> slides 11 and 12.
>>>
>>> Let me explain my reasoning about this program chaining thing.
>>>
>>> To check if an action on a file is allowed, we first need to identify
>>> this file and match it to the security policy. In a previous
>>> (non-public) patch series, I tried to use one type of eBPF program to
>>> check every kind of access to a file. To be able to identify a file, I
>>> relied on an eBPF map, similar to the current inode map. This map store
>>> a set of references to file descriptors. I then created a function
>>> bpf_is_file_beneath() to check if the requested file was beneath a file
>>> in the map. This way, no chaining, only one eBPF program type to check
>>> an access to a file... but some issues then emerged. First, this design
>>> create a side-channel which help an attacker using such a program to
>>> infer some information not normally available, for example to get a hint
>>> on where a file descriptor (received from a UNIX socket) come from.
>>> Another issue is that this type of program would be called for each
>>> component of a path. Indeed, when the kernel check if an access to a
>>> file is allowed, it walk through all of the directories in its path
>>> (checking if the current process is allowed to execute them). That first
>>> attempt led me to rethink the way we could filter an access to a file
>>> *path*.
>>>
>>> To minimize the number of called to an eBPF program dedicated to
>>> validate an access to a file path, I decided to create three subtype of
>>> eBPF programs. The FS_WALK type is called when walking through every
>>> directory of a file path (except the last one if it is the target). We
>>> can then restrict this type of program to the minimum set of functions
>>> it is allowed to call and the minimum set of data available from its
>>> context. The first implicit chaining is for this type of program. To be
>>> able to evaluate a path while being called for all its components, this
>>> program need to store a state (to remember what was the parent directory
>>> of this path). There is no "previous" field in the subtype for this
>>> program because it is chained with itself, for each directories. This
>>> enable to create a FS_WALK program to evaluate a file hierarchy, thank
>>> to the inode map which can be used to check if a directory of this
>>> hierarchy is part of an allowed (or denied) list of directories. This
>>> design enables to express a file hierarchy in a programmatic way,
>>> without requiring an eBPF helper to do the job (unlike my first experiment).
>>>
>>> The explicit chaining is used to tied a path evaluation (with a FS_WALK
>>> program) to an access to the actual file being requested (the last
>>> component of a file path), with a FS_PICK program. It is only at this
>>> time that the kernel check for the requested action (e.g. read, write,
>>> chdir, append...). To be able to filter such access request we can have
>>> one call to the same program for every action and let this program check
>>> for which action it was called. However, this design does not allow the
>>> kernel to know if the current action is indeed handled by this program.
>>> Hence, it is not possible to implement a cache mechanism to only call
>>> this program if it knows how to handle this action.
>>>
>>> The approach I took for this FS_PICK type of program is to add to its
>>> subtype which action it can handle (with the "triggers" bitfield, seen
>>> as ORed actions). This way, the kernel knows if a call to a FS_PICK
>>> program is necessary. If the user wants to enforce a different security
>>> policy according to the action requested on a file, then it needs
>>> multiple FS_PICK programs. However, to reduce the number of such
>>> programs, this patch series allow a FS_PICK program to be chained with
>>> another, the same way a FS_WALK is chained with itself. This way, if the
>>> user want to check if the action is a for example an "open" and a "read"
>>> and not a "map" and a "read", then it can chain multiple FS_PICK
>>> programs with different triggers actions. The OR check performed by the
>>> kernel is not a limitation then, only a way to know if a call to an eBPF
>>> program is needed.
>>>
>>> The last type of program is FS_GET. This one is called when a process
>>> get a struct file or change its working directory. This is the only
>>> program type able (and allowed) to tag a file. This restriction is
>>> important to not being subject to resource exhaustion attacks (i.e.
>>> tagging every inode accessible to an attacker, which would allocate too
>>> much kernel memory).
>>>
>>> This design gives room for improvements to create a cache of eBPF
>>> context (input data, including maps if any), with the result of an eBPF
>>> program. This would help limit the number of call to an eBPF program the
>>> same way SELinux or other kernel components do to limit costly checks.
>>>
>>> The eBPF maps of progs are useful to call the same type of eBPF
>>> program. It does not fit with this use case because we may want multiple
>>> eBPF program according to the action requested on a kernel object (e.g.
>>> FS_GET). The other reason is because the eBPF program does not know what
>>> will be the next (type of) access check performed by the kernel.
>>>
>>> To say it another way, this chaining mechanism is a way to split a
>>> kernel object evaluation with multiple specialized programs, each of
>>> them being able to deal with data tied to their type. Using a monolithic
>>> eBPF program to check everything does not scale and does not fit with
>>> unprivileged use either.
>>>
>>> As a side note, the cookie value is only an ephemeral value to keep a
>>> state between multiple programs call. It can be used to create a state
>>> machine for an object evaluation.
>>>
>>> I don't see a way to do an efficient and programmatic path evaluation,
>>> with different access checks, with the current eBPF features. Please let
>>> me know if you know how to do it another way.
>>>
>>
>> Andy, Alexei, Daniel, what do you think about this Landlock program
>> chaining and cookie?
>>
> 
> Can you give a small pseudocode real world example that acutally needs
> chaining?  The mechanism is quite complicated and I'd like to
> understand how it'll be used.
> 

Here is the interesting part from the example (patch 09/11):

+SEC("maps")
+struct bpf_map_def inode_map = {
+	.type = BPF_MAP_TYPE_INODE,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u64),
+	.max_entries = 20,
+};
+
+SEC("subtype/landlock1")
+static union bpf_prog_subtype _subtype1 = {
+	.landlock_hook = {
+		.type = LANDLOCK_HOOK_FS_WALK,
+	}
+};
+
+static __always_inline __u64 update_cookie(__u64 cookie, __u8 lookup,
+		void *inode, void *chain, bool freeze)
+{
+	__u64 map_allow = 0;
+
+	if (cookie == 0) {
+		cookie = bpf_inode_get_tag(inode, chain);
+		if (cookie)
+			return cookie;
+		/* only look for the first match in the map, ignore nested
+		 * paths in this example */
+		map_allow = bpf_inode_map_lookup(&inode_map, inode);
+		if (map_allow)
+			cookie = 1 | map_allow;
+	} else {
+		if (cookie & COOKIE_VALUE_FREEZED)
+			return cookie;
+		map_allow = cookie & _MAP_MARK_MASK;
+		cookie &= ~_MAP_MARK_MASK;
+		switch (lookup) {
+		case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOTDOT:
+			cookie--;
+			break;
+		case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOT:
+			break;
+		default:
+			/* ignore _MAP_MARK_MASK overflow in this example */
+			cookie++;
+			break;
+		}
+		if (cookie >= 1)
+			cookie |= map_allow;
+	}
+	/* do not modify the cookie for each fs_pick */
+	if (freeze && cookie)
+		cookie |= COOKIE_VALUE_FREEZED;
+	return cookie;
+}
+
+SEC("landlock1")
+int fs_walk(struct landlock_ctx_fs_walk *ctx)
+{
+	ctx->cookie = update_cookie(ctx->cookie, ctx->inode_lookup,
+			(void *)ctx->inode, (void *)ctx->chain, false);
+	return LANDLOCK_RET_ALLOW;
+}

The program "landlock1" is called for every directory execution (except
the last one if it is the leaf of a path). This enables to identify a
file hierarchy with only a (one dimension) list of file descriptors
(i.e. inode_map).

Underneath, the Landlock LSM part looks if there is an associated path
walk (nameidata) with each inode access request. If there is one, then
the cookie associated with the path walk (if any) is made available
through the eBPF program context. This enables to develop a state
machine with an eBPF program to "evaluate" a file path (without string
parsing).

The goal with this chaining mechanism is to be able to express a complex
kernel object like a file, with multiple run of one or more eBPF
programs, as a multilayer evaluation. This semantic may only make sense
for the user/developer and his security policy. We must keep in mind
that this object identification should be available to unprivileged
processes. This means that we must be very careful to what kind of
information are available to an eBPF program because this can then leak
to a process (e.g. through a map). With this mechanism, only information
already available to user space is available to the eBPF program.

In this example, the complexity of the path evaluation is in the eBPF
program. We can then keep the kernel code more simple and generic. This
enables more flexibility for a security policy definition.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH] net: bridge: add missing NULL checks
From: Nikolay Aleksandrov @ 2018-04-08 22:25 UTC (permalink / raw)
  To: Laszlo Toth, Stephen Hemminger, David S. Miller; +Cc: netdev, bridge
In-Reply-To: <20180408174934.GA3895@laszlth>

On 08/04/18 20:49, Laszlo Toth wrote:
> br_port_get_rtnl() can return NULL
> 
> Signed-off-by: Laszlo Toth <laszlth@gmail.com>
> ---
>   net/bridge/br_netlink.c | 12 ++++++++++--
>   1 file changed, 10 insertions(+), 2 deletions(-)
> 

Nacked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
More below.

> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index 015f465c..cbec11f 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -939,14 +939,17 @@ static int br_port_slave_changelink(struct net_device *brdev,
>   				    struct nlattr *data[],
>   				    struct netlink_ext_ack *extack)
>   {
> +	struct net_bridge_port *port = br_port_get_rtnl(dev);
>   	struct net_bridge *br = netdev_priv(brdev);
>   	int ret;
>   
>   	if (!data)
>   		return 0;
> +	if (!port)
> +		return -EINVAL;
>   

If we're here, it means the master device of dev is a bridge => dev is a bridge port,
since we're running with RTNL that cannot change, so this check is unnecessary.

Have you actually hit a bug with this code ?

>   	spin_lock_bh(&br->lock);
> -	ret = br_setport(br_port_get_rtnl(dev), data);
> +	ret = br_setport(port, data);
>   	spin_unlock_bh(&br->lock);
>   
>   	return ret;
> @@ -956,7 +959,12 @@ static int br_port_fill_slave_info(struct sk_buff *skb,
>   				   const struct net_device *brdev,
>   				   const struct net_device *dev)
>   {
> -	return br_port_fill_attrs(skb, br_port_get_rtnl(dev));
> +	struct net_bridge_port *port = br_port_get_rtnl(dev);
> +
> +	if (!port)
> +		return -EINVAL;
> +
> +	return br_port_fill_attrs(skb, port);

Same rationale here, fill_slave_info is called via a master device's ops
under RTNL, which means dev is a bridge port and that also cannot change.

If you have hit a bug with this code, can we see the trace ?
The problem might be elsewhere.

Thanks,
  Nik

>   }
>   
>   static size_t br_port_get_slave_size(const struct net_device *brdev,
> 

^ permalink raw reply

* pull-request: bpf 2018-04-09
From: Daniel Borkmann @ 2018-04-08 22:28 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Two sockmap fixes: i) fix a potential warning when a socket with
   pending cork data is closed by freeing the memory right when the
   socket is closed instead of seeing still outstanding memory at
   garbage collector time, ii) fix a NULL pointer deref in case of
   duplicates release calls, so make sure to only reset the sk_prot
   pointer when it's in a valid state to do so, both from John.

2) Fix a compilation warning in bpf_prog_attach_check_attach_type()
   by moving the function under CONFIG_CGROUP_BPF ifdef since only
   used there, from Anders.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit 4608f064532c28c0ea3c03fe26a3a5909852811a:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next (2018-04-03 14:08:58 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to 33491588c1fb2c76ed114a211ad0ee76c16b5a0c:

  kernel/bpf/syscall: fix warning defined but not used (2018-04-04 11:08:36 +0200)

----------------------------------------------------------------
Anders Roxell (1):
      kernel/bpf/syscall: fix warning defined but not used

John Fastabend (2):
      bpf: sockmap, free memory on sock close with cork data
      bpf: sockmap, duplicates release calls may NULL sk_prot

 kernel/bpf/sockmap.c | 12 ++++++++++--
 kernel/bpf/syscall.c | 24 ++++++++++++------------
 2 files changed, 22 insertions(+), 14 deletions(-)

^ permalink raw reply

* Re: KASAN: use-after-free Read in inet_create
From: Eric Biggers @ 2018-04-08 23:17 UTC (permalink / raw)
  To: linux-rdma, rds-devel, Santosh Shilimkar
  Cc: syzbot, davem, kuznet, linux-kernel, netdev, syzkaller-bugs,
	yoshfuji
In-Reply-To: <001a1144d1c8e819f6055fee7118@google.com>

[+RDS list and maintainer]

On Sat, Dec 09, 2017 at 12:50:01PM -0800, syzbot wrote:
> Hello,
> 
> syzkaller hit the following crash on
> 82bcf1def3b5f1251177ad47c44f7e17af039b4b
> git://git.cmpxchg.org/linux-mmots.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> 
> Unfortunately, I don't have any reproducer for this bug yet.
> 
> 
> ==================================================================
> BUG: KASAN: use-after-free in inet_create+0xda0/0xf50 net/ipv4/af_inet.c:338
> Read of size 4 at addr ffff8801bde28554 by task kworker/u4:5/3492
> 
> CPU: 0 PID: 3492 Comm: kworker/u4:5 Not tainted 4.15.0-rc2-mm1+ #39
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: krdsd rds_connect_worker
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
>  inet_create+0xda0/0xf50 net/ipv4/af_inet.c:338
>  __sock_create+0x4d4/0x850 net/socket.c:1265
>  sock_create_kern+0x3f/0x50 net/socket.c:1311
>  rds_tcp_conn_path_connect+0x26f/0x920 net/rds/tcp_connect.c:108
>  rds_connect_worker+0x156/0x1f0 net/rds/threads.c:165
>  process_one_work+0xbfd/0x1bc0 kernel/workqueue.c:2113
>  worker_thread+0x223/0x1990 kernel/workqueue.c:2247
>  kthread+0x37a/0x440 kernel/kthread.c:238
>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:524
> 
> Allocated by task 3362:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
>  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3548
>  kmem_cache_zalloc include/linux/slab.h:695 [inline]
>  net_alloc net/core/net_namespace.c:362 [inline]
>  copy_net_ns+0x196/0x580 net/core/net_namespace.c:402
>  create_new_namespaces+0x425/0x880 kernel/nsproxy.c:107
>  unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:206
>  SYSC_unshare kernel/fork.c:2421 [inline]
>  SyS_unshare+0x653/0xfa0 kernel/fork.c:2371
>  entry_SYSCALL_64_fastpath+0x1f/0x96
> 
> Freed by task 35:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
>  __cache_free mm/slab.c:3492 [inline]
>  kmem_cache_free+0x77/0x280 mm/slab.c:3750
>  net_free+0xca/0x110 net/core/net_namespace.c:378
>  net_drop_ns.part.11+0x26/0x30 net/core/net_namespace.c:385
>  net_drop_ns net/core/net_namespace.c:384 [inline]
>  cleanup_net+0x895/0xb60 net/core/net_namespace.c:502
>  process_one_work+0xbfd/0x1bc0 kernel/workqueue.c:2113
>  worker_thread+0x223/0x1990 kernel/workqueue.c:2247
>  kthread+0x37a/0x440 kernel/kthread.c:238
>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:524
> 
> The buggy address belongs to the object at ffff8801bde28080
>  which belongs to the cache net_namespace of size 6272
> The buggy address is located 1236 bytes inside of
>  6272-byte region [ffff8801bde28080, ffff8801bde29900)
> The buggy address belongs to the page:
> page:00000000df6a4dc0 count:1 mapcount:0 mapping:00000000553659f1 index:0x0
> compound_mapcount: 0
> flags: 0x2fffc0000008100(slab|head)
> raw: 02fffc0000008100 ffff8801bde28080 0000000000000000 0000000100000001
> raw: ffffea0006f75da0 ffffea0006f60220 ffff8801d989fe00 0000000000000000
> page dumped because: kasan: bad access detected
> 
> Memory state around the buggy address:
>  ffff8801bde28400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  ffff8801bde28480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > ffff8801bde28500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>                                                  ^
>  ffff8801bde28580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  ffff8801bde28600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==================================================================
> 
> 
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
> Please credit me with: Reported-by: syzbot <syzkaller@googlegroups.com>
> 
> syzbot will keep track of this bug report.
> Once a fix for this bug is merged into any tree, reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
> 

This is still happening regularly, though syzbot hasn't been able to generate a
reproducer yet.  All the reports seem to involve rds_connect_worker()
encountering a freed network namespace (struct net) when calling
sock_create_kern() from rds_tcp_conn_path_connect().  Probably something in RDS
needs to be taking a reference to the network namespace and isn't, or the RDS
workqueue isn't being shut down correctly.  You can see all reports of this on
the syzbot dashboard at
https://syzkaller.appspot.com/bug?id=1f45ae538a0453220337ccb84962249fdd67107f.
Last one was April 5 on Linus' tree (commit 3e968c9f1401088).

- Eric

^ permalink raw reply

* DPAA TX Issues
From: Jacob S. Moroni @ 2018-04-08 23:46 UTC (permalink / raw)
  To: madalin.bucur; +Cc: netdev

Hello Madalin,

I've been experiencing some issues with the DPAA Ethernet driver,
specifically related to frame transmission. Hopefully you can point
me in the right direction.

TLDR: Attempting to transmit faster than a few frames per second causes
the TX FQ CGR to enter into the congested state and remain there forever,
even after transmission stops.

The hardware is a T2080RDB, running from the tip of net-next, using
the standard t2080rdb device tree and corenet64_smp_defconfig kernel
config. No changes were made to any of the files. The issue occurs
with 4.16.1 stable as well. In fact, the only time I've been able
to achieve reliable frame transmission was with the SDK 4.1 kernel.

For my tests, I'm running iperf3 both with and without the -R
option (send/receive). When using a USB Ethernet adapter, there
are no issues.

The issue is that it seems like the TX frame queues are getting
"stuck" when attempting to transmit at rates greater than a few frames
per second. Ping works fine, but it seems like anything that could
potentially cause multiple TX frames to be enqueued causes issues.

If I run iperf3 in reverse mode (with the T2080RDB receiving), then
I can achieve ~940 Mbps, but this is also somewhat unreliable.

If I run it with the T2080RDB transmitting, the test will never
complete. Sometimes it starts transmitting for a few seconds then stops,
and other times it never even starts. This also seems to force the
interface into a bad state.

The ethtool stats show that the interface has entered
congestion a few times, and that it's currently congested. The fact
that it's currently congested even after stopping transmission
indicates that the FQ somehow stopped being drained. I've also
noticed that whenever this issue occurs, the TX confirmation
counters are always less than the TX packet counters.

When it gets into this state, I can see that the memory usage is
climbing, up until about the point of where the CGR threshold
is (about 100 MB).

Any idea what could prevent the TX FQ from being drained? My first
guess was flow control, but it's completely disabled.

I tried messing with the egress congestion threshold, workqueue
assignments, etc., but nothing seemed to have any effect.

If you need any more information or want me to run any tests,
please let me know.

Thanks,
-- 
  Jacob S. Moroni
  mail@jakemoroni.com

^ permalink raw reply

* Re: [RFC] connector: add group_exit_code and signal_flags fields to exit_proc_event
From: Evgeniy Polyakov @ 2018-04-08 23:49 UTC (permalink / raw)
  To: Stefan Strogin, matthltc@us.ibm.com
  Cc: netdev@vger.kernel.org, xe-linux-external@cisco.com,
	Victor Kamensky, Taras Kondratiuk, Ruslan Bilovol
In-Reply-To: <307aa191-05b6-5227-19a0-13741c839fd1@cisco.com>

Hi everyone

Sorry for that late reply

01.03.2018, 21:58, "Stefan Strogin" <sstrogin@cisco.com>:
> So I was thinking to add these two fields to union event_data:
> task->signal->group_exit_code
> task->signal->flags
> This won't increase size of struct proc_event (because of comm_proc_event)
> and shouldn't break backward compatibility for the user-space. But it will
> add some useful information about what caused the process death.
> What do you think, is it an acceptable approach?

As I saw in other discussion, doesn't it break userspace API, or you are sure that no sizes has been increased?
You are using the same structure as used for plain signals and add group status there, how will userspace react,
if it was compiled with older headers? What if it uses zero-field alignment, i.e. allocating exactly the size of structure with byte precision?

^ permalink raw reply

* Re: pull-request: bpf 2018-04-09
From: David Miller @ 2018-04-08 23:58 UTC (permalink / raw)
  To: daniel; +Cc: ast, netdev
In-Reply-To: <20180408222847.27289-1-daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Mon,  9 Apr 2018 00:28:47 +0200

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) Two sockmap fixes: i) fix a potential warning when a socket with
>    pending cork data is closed by freeing the memory right when the
>    socket is closed instead of seeing still outstanding memory at
>    garbage collector time, ii) fix a NULL pointer deref in case of
>    duplicates release calls, so make sure to only reset the sk_prot
>    pointer when it's in a valid state to do so, both from John.
> 
> 2) Fix a compilation warning in bpf_prog_attach_check_attach_type()
>    by moving the function under CONFIG_CGROUP_BPF ifdef since only
>    used there, from Anders.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.

^ permalink raw reply

* Re: [PATCH v3] dp83640: Ensure against premature access to PHY registers after reset
From: David Miller @ 2018-04-08 23:59 UTC (permalink / raw)
  To: esben.haabendal
  Cc: netdev, eha, richardcochran, andrew, f.fainelli, linux-kernel
In-Reply-To: <20180408201702.23299-1-esben.haabendal@gmail.com>

From: Esben Haabendal <esben.haabendal@gmail.com>
Date: Sun,  8 Apr 2018 22:17:01 +0200

> From: Esben Haabendal <eha@deif.com>
> 
> The datasheet specifies a 3uS pause after performing a software
> reset. The default implementation of genphy_soft_reset() does not
> provide this, so implement soft_reset with the needed pause.
> 
> Signed-off-by: Esben Haabendal <eha@deif.com>
> Reviewed-by: Andrew Lunn <andrew@lunn.ch>

Applied, thank you.

^ permalink raw reply

* [PATCH AUTOSEL for 4.9 160/293] MIPS: Give __secure_computing() access to syscall arguments.
From: Sasha Levin @ 2018-04-09  0:24 UTC (permalink / raw)
  To: stable@vger.kernel.org, linux-kernel@vger.kernel.org
  Cc: David Daney, Alexei Starovoitov, Daniel Borkmann, Matt Redfearn,
	netdev@vger.kernel.org, linux-mips@linux-mips.org, Ralf Baechle,
	Sasha Levin
In-Reply-To: <20180409002239.163177-1-alexander.levin@microsoft.com>

From: David Daney <david.daney@cavium.com>

[ Upstream commit 669c4092225f0ed5df12ebee654581b558a5e3ed ]

KProbes of __seccomp_filter() are not very useful without access to
the syscall arguments.

Do what x86 does, and populate a struct seccomp_data to be passed to
__secure_computing().  This allows samples/bpf/tracex5 to extract a
sensible trace.

Signed-off-by: David Daney <david.daney@cavium.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16368/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
---
 arch/mips/kernel/ptrace.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c
index 0c8ae2cc6380..956dae7e6a69 100644
--- a/arch/mips/kernel/ptrace.c
+++ b/arch/mips/kernel/ptrace.c
@@ -1011,8 +1011,26 @@ asmlinkage long syscall_trace_enter(struct pt_regs *regs, long syscall)
 	    tracehook_report_syscall_entry(regs))
 		return -1;
 
-	if (secure_computing(NULL) == -1)
-		return -1;
+#ifdef CONFIG_SECCOMP
+	if (unlikely(test_thread_flag(TIF_SECCOMP))) {
+		int ret, i;
+		struct seccomp_data sd;
+
+		sd.nr = syscall;
+		sd.arch = syscall_get_arch();
+		for (i = 0; i < 6; i++) {
+			unsigned long v, r;
+
+			r = mips_get_syscall_arg(&v, current, regs, i);
+			sd.args[i] = r ? 0 : v;
+		}
+		sd.instruction_pointer = KSTK_EIP(current);
+
+		ret = __secure_computing(&sd);
+		if (ret == -1)
+			return ret;
+	}
+#endif
 
 	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
 		trace_sys_enter(regs, regs->regs[2]);
-- 
2.15.1

^ permalink raw reply related

* Re: KASAN: use-after-free Read in inet_create
From: Sowmini Varadhan @ 2018-04-09  1:04 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-rdma, rds-devel, Santosh Shilimkar, syzbot, davem, kuznet,
	linux-kernel, netdev, syzkaller-bugs, yoshfuji
In-Reply-To: <20180408231756.GI685@sol.localdomain>

#syz dup: KASAN: use-after-free Read in rds_cong_queue_updates

There  are a number of manifestations of this bug, basically
all suggest that the connect/reconnect etc workqs are somehow
being scheduled after the netns is deleted, despite the
code refactoring in Commit  3db6e0d172c (and looks like
the WARN_ONs in that commit are not even being triggered).
We've not been able to reproduce this issues, and without
a crash dump (or some hint of other threads that were running
at the time of the problem) are working on figuring out
the root-cause by code-inspection.

--Sowmini

^ permalink raw reply

* [PATCH net-next] net/ncsi: Refactor MAC, VLAN filters
From: Samuel Mendoza-Jonas @ 2018-04-09  2:11 UTC (permalink / raw)
  To: netdev; +Cc: Samuel Mendoza-Jonas, David S . Miller, linux-kernel, openbmc

The NCSI driver defines a generic ncsi_channel_filter struct that can be
used to store arbitrarily formatted filters, and several generic methods
of accessing data stored in such a filter.
However in both the driver and as defined in the NCSI specification
there are only two actual filters: VLAN ID filters and MAC address
filters. The splitting of the MAC filter into unicast, multicast, and
mixed is also technically not necessary as these are stored in the same
location in hardware.

To save complexity, particularly in the set up and accessing of these
generic filters, remove them in favour of two specific structs. These
can be acted on directly and do not need several generic helper
functions to use.

This also fixes a memory error found by KASAN on ARM32 (which is not
upstream yet), where response handlers accessing a filter's data field
could write past allocated memory.

[  114.926512] ==================================================================
[  114.933861] BUG: KASAN: slab-out-of-bounds in ncsi_configure_channel+0x4b8/0xc58
[  114.941304] Read of size 2 at addr 94888558 by task kworker/0:2/546
[  114.947593]
[  114.949146] CPU: 0 PID: 546 Comm: kworker/0:2 Not tainted 4.16.0-rc6-00119-ge156398bfcad #13
...
[  115.170233] The buggy address belongs to the object at 94888540
[  115.170233]  which belongs to the cache kmalloc-32 of size 32
[  115.181917] The buggy address is located 24 bytes inside of
[  115.181917]  32-byte region [94888540, 94888560)
[  115.192115] The buggy address belongs to the page:
[  115.196943] page:9eeac100 count:1 mapcount:0 mapping:94888000 index:0x94888fc1
[  115.204200] flags: 0x100(slab)
[  115.207330] raw: 00000100 94888000 94888fc1 0000003f 00000001 9eea2014 9eecaa74 96c003e0
[  115.215444] page dumped because: kasan: bad access detected
[  115.221036]
[  115.222544] Memory state around the buggy address:
[  115.227384]  94888400: fb fb fb fb fc fc fc fc 04 fc fc fc fc fc fc fc
[  115.233959]  94888480: 00 00 00 fc fc fc fc fc 00 04 fc fc fc fc fc fc
[  115.240529] >94888500: 00 00 04 fc fc fc fc fc 00 00 04 fc fc fc fc fc
[  115.247077]                                             ^
[  115.252523]  94888580: 00 04 fc fc fc fc fc fc 06 fc fc fc fc fc fc fc
[  115.259093]  94888600: 00 00 06 fc fc fc fc fc 00 00 04 fc fc fc fc fc
[  115.265639] ==================================================================

Reported-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
---
 net/ncsi/internal.h     |  34 +++---
 net/ncsi/ncsi-manage.c  | 226 +++++++++-------------------------------
 net/ncsi/ncsi-netlink.c |  20 ++--
 net/ncsi/ncsi-rsp.c     | 178 +++++++++++++------------------
 4 files changed, 147 insertions(+), 311 deletions(-)

diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
index 8da84312cd3b..8055e3965cef 100644
--- a/net/ncsi/internal.h
+++ b/net/ncsi/internal.h
@@ -68,15 +68,6 @@ enum {
 	NCSI_MODE_MAX
 };
 
-enum {
-	NCSI_FILTER_BASE	= 0,
-	NCSI_FILTER_VLAN	= 0,
-	NCSI_FILTER_UC,
-	NCSI_FILTER_MC,
-	NCSI_FILTER_MIXED,
-	NCSI_FILTER_MAX
-};
-
 struct ncsi_channel_version {
 	u32 version;		/* Supported BCD encoded NCSI version */
 	u32 alpha2;		/* Supported BCD encoded NCSI version */
@@ -98,11 +89,18 @@ struct ncsi_channel_mode {
 	u32 data[8];	/* Data entries                */
 };
 
-struct ncsi_channel_filter {
-	u32 index;	/* Index of channel filters          */
-	u32 total;	/* Total entries in the filter table */
-	u64 bitmap;	/* Bitmap of valid entries           */
-	u32 data[];	/* Data for the valid entries        */
+struct ncsi_channel_mac_filter {
+	u8	n_uc;
+	u8	n_mc;
+	u8	n_mixed;
+	u64	bitmap;
+	unsigned char	*addrs;
+};
+
+struct ncsi_channel_vlan_filter {
+	u8	n_vids;
+	u64	bitmap;
+	u16	*vids;
 };
 
 struct ncsi_channel_stats {
@@ -186,7 +184,9 @@ struct ncsi_channel {
 	struct ncsi_channel_version version;
 	struct ncsi_channel_cap	    caps[NCSI_CAP_MAX];
 	struct ncsi_channel_mode    modes[NCSI_MODE_MAX];
-	struct ncsi_channel_filter  *filters[NCSI_FILTER_MAX];
+	/* Filtering Settings */
+	struct ncsi_channel_mac_filter	mac_filter;
+	struct ncsi_channel_vlan_filter	vlan_filter;
 	struct ncsi_channel_stats   stats;
 	struct {
 		struct timer_list   timer;
@@ -320,10 +320,6 @@ extern spinlock_t ncsi_dev_lock;
 	list_for_each_entry_rcu(nc, &np->channels, node)
 
 /* Resources */
-u32 *ncsi_get_filter(struct ncsi_channel *nc, int table, int index);
-int ncsi_find_filter(struct ncsi_channel *nc, int table, void *data);
-int ncsi_add_filter(struct ncsi_channel *nc, int table, void *data);
-int ncsi_remove_filter(struct ncsi_channel *nc, int table, int index);
 void ncsi_start_channel_monitor(struct ncsi_channel *nc);
 void ncsi_stop_channel_monitor(struct ncsi_channel *nc);
 struct ncsi_channel *ncsi_find_channel(struct ncsi_package *np,
diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index c3695ba0cf94..5561e221b71f 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -27,125 +27,6 @@
 LIST_HEAD(ncsi_dev_list);
 DEFINE_SPINLOCK(ncsi_dev_lock);
 
-static inline int ncsi_filter_size(int table)
-{
-	int sizes[] = { 2, 6, 6, 6 };
-
-	BUILD_BUG_ON(ARRAY_SIZE(sizes) != NCSI_FILTER_MAX);
-	if (table < NCSI_FILTER_BASE || table >= NCSI_FILTER_MAX)
-		return -EINVAL;
-
-	return sizes[table];
-}
-
-u32 *ncsi_get_filter(struct ncsi_channel *nc, int table, int index)
-{
-	struct ncsi_channel_filter *ncf;
-	int size;
-
-	ncf = nc->filters[table];
-	if (!ncf)
-		return NULL;
-
-	size = ncsi_filter_size(table);
-	if (size < 0)
-		return NULL;
-
-	return ncf->data + size * index;
-}
-
-/* Find the first active filter in a filter table that matches the given
- * data parameter. If data is NULL, this returns the first active filter.
- */
-int ncsi_find_filter(struct ncsi_channel *nc, int table, void *data)
-{
-	struct ncsi_channel_filter *ncf;
-	void *bitmap;
-	int index, size;
-	unsigned long flags;
-
-	ncf = nc->filters[table];
-	if (!ncf)
-		return -ENXIO;
-
-	size = ncsi_filter_size(table);
-	if (size < 0)
-		return size;
-
-	spin_lock_irqsave(&nc->lock, flags);
-	bitmap = (void *)&ncf->bitmap;
-	index = -1;
-	while ((index = find_next_bit(bitmap, ncf->total, index + 1))
-	       < ncf->total) {
-		if (!data || !memcmp(ncf->data + size * index, data, size)) {
-			spin_unlock_irqrestore(&nc->lock, flags);
-			return index;
-		}
-	}
-	spin_unlock_irqrestore(&nc->lock, flags);
-
-	return -ENOENT;
-}
-
-int ncsi_add_filter(struct ncsi_channel *nc, int table, void *data)
-{
-	struct ncsi_channel_filter *ncf;
-	int index, size;
-	void *bitmap;
-	unsigned long flags;
-
-	size = ncsi_filter_size(table);
-	if (size < 0)
-		return size;
-
-	index = ncsi_find_filter(nc, table, data);
-	if (index >= 0)
-		return index;
-
-	ncf = nc->filters[table];
-	if (!ncf)
-		return -ENODEV;
-
-	spin_lock_irqsave(&nc->lock, flags);
-	bitmap = (void *)&ncf->bitmap;
-	do {
-		index = find_next_zero_bit(bitmap, ncf->total, 0);
-		if (index >= ncf->total) {
-			spin_unlock_irqrestore(&nc->lock, flags);
-			return -ENOSPC;
-		}
-	} while (test_and_set_bit(index, bitmap));
-
-	memcpy(ncf->data + size * index, data, size);
-	spin_unlock_irqrestore(&nc->lock, flags);
-
-	return index;
-}
-
-int ncsi_remove_filter(struct ncsi_channel *nc, int table, int index)
-{
-	struct ncsi_channel_filter *ncf;
-	int size;
-	void *bitmap;
-	unsigned long flags;
-
-	size = ncsi_filter_size(table);
-	if (size < 0)
-		return size;
-
-	ncf = nc->filters[table];
-	if (!ncf || index >= ncf->total)
-		return -ENODEV;
-
-	spin_lock_irqsave(&nc->lock, flags);
-	bitmap = (void *)&ncf->bitmap;
-	if (test_and_clear_bit(index, bitmap))
-		memset(ncf->data + size * index, 0, size);
-	spin_unlock_irqrestore(&nc->lock, flags);
-
-	return 0;
-}
-
 static void ncsi_report_link(struct ncsi_dev_priv *ndp, bool force_down)
 {
 	struct ncsi_dev *nd = &ndp->ndev;
@@ -339,20 +220,13 @@ struct ncsi_channel *ncsi_add_channel(struct ncsi_package *np, unsigned char id)
 static void ncsi_remove_channel(struct ncsi_channel *nc)
 {
 	struct ncsi_package *np = nc->package;
-	struct ncsi_channel_filter *ncf;
 	unsigned long flags;
-	int i;
 
-	/* Release filters */
 	spin_lock_irqsave(&nc->lock, flags);
-	for (i = 0; i < NCSI_FILTER_MAX; i++) {
-		ncf = nc->filters[i];
-		if (!ncf)
-			continue;
 
-		nc->filters[i] = NULL;
-		kfree(ncf);
-	}
+	/* Release filters */
+	kfree(nc->mac_filter.addrs);
+	kfree(nc->vlan_filter.vids);
 
 	nc->state = NCSI_CHANNEL_INACTIVE;
 	spin_unlock_irqrestore(&nc->lock, flags);
@@ -670,32 +544,26 @@ static void ncsi_suspend_channel(struct ncsi_dev_priv *ndp)
 static int clear_one_vid(struct ncsi_dev_priv *ndp, struct ncsi_channel *nc,
 			 struct ncsi_cmd_arg *nca)
 {
+	struct ncsi_channel_vlan_filter *ncf;
+	unsigned long flags;
+	void *bitmap;
 	int index;
-	u32 *data;
 	u16 vid;
 
-	index = ncsi_find_filter(nc, NCSI_FILTER_VLAN, NULL);
-	if (index < 0) {
-		/* Filter table empty */
-		return -1;
-	}
+	ncf = &nc->vlan_filter;
+	bitmap = &ncf->bitmap;
 
-	data = ncsi_get_filter(nc, NCSI_FILTER_VLAN, index);
-	if (!data) {
-		netdev_err(ndp->ndev.dev,
-			   "NCSI: failed to retrieve filter %d\n", index);
-		/* Set the VLAN id to 0 - this will still disable the entry in
-		 * the filter table, but we won't know what it was.
-		 */
-		vid = 0;
-	} else {
-		vid = *(u16 *)data;
+	spin_lock_irqsave(&nc->lock, flags);
+	index = find_next_bit(bitmap, ncf->n_vids, 0);
+	if (index >= ncf->n_vids) {
+		spin_unlock_irqrestore(&nc->lock, flags);
+		return -1;
 	}
+	vid = ncf->vids[index];
 
-	netdev_printk(KERN_DEBUG, ndp->ndev.dev,
-		      "NCSI: removed vlan tag %u at index %d\n",
-		      vid, index + 1);
-	ncsi_remove_filter(nc, NCSI_FILTER_VLAN, index);
+	clear_bit(index, bitmap);
+	ncf->vids[index] = 0;
+	spin_unlock_irqrestore(&nc->lock, flags);
 
 	nca->type = NCSI_PKT_CMD_SVF;
 	nca->words[1] = vid;
@@ -711,45 +579,55 @@ static int clear_one_vid(struct ncsi_dev_priv *ndp, struct ncsi_channel *nc,
 static int set_one_vid(struct ncsi_dev_priv *ndp, struct ncsi_channel *nc,
 		       struct ncsi_cmd_arg *nca)
 {
+	struct ncsi_channel_vlan_filter *ncf;
 	struct vlan_vid *vlan = NULL;
-	int index = 0;
+	unsigned long flags;
+	int i, index;
+	void *bitmap;
+	u16 vid;
 
+	if (list_empty(&ndp->vlan_vids))
+		return -1;
+
+	ncf = &nc->vlan_filter;
+	bitmap = &ncf->bitmap;
+
+	spin_lock_irqsave(&nc->lock, flags);
+
+	rcu_read_lock();
 	list_for_each_entry_rcu(vlan, &ndp->vlan_vids, list) {
-		index = ncsi_find_filter(nc, NCSI_FILTER_VLAN, &vlan->vid);
-		if (index < 0) {
-			/* New tag to add */
-			netdev_printk(KERN_DEBUG, ndp->ndev.dev,
-				      "NCSI: new vlan id to set: %u\n",
-				      vlan->vid);
+		vid = vlan->vid;
+		for (i = 0; i < ncf->n_vids; i++)
+			if (ncf->vids[i] == vid) {
+				vid = 0;
+				break;
+			}
+		if (vid)
 			break;
-		}
-		netdev_printk(KERN_DEBUG, ndp->ndev.dev,
-			      "vid %u already at filter pos %d\n",
-			      vlan->vid, index);
 	}
+	rcu_read_unlock();
 
-	if (!vlan || index >= 0) {
-		netdev_printk(KERN_DEBUG, ndp->ndev.dev,
-			      "no vlan ids left to set\n");
+	if (!vid) {
+		/* No VLAN ID is not set */
+		spin_unlock_irqrestore(&nc->lock, flags);
 		return -1;
 	}
 
-	index = ncsi_add_filter(nc, NCSI_FILTER_VLAN, &vlan->vid);
-	if (index < 0) {
+	index = find_next_zero_bit(bitmap, ncf->n_vids, 0);
+	if (index < 0 || index >= ncf->n_vids) {
 		netdev_err(ndp->ndev.dev,
-			   "Failed to add new VLAN tag, error %d\n", index);
-		if (index == -ENOSPC)
-			netdev_err(ndp->ndev.dev,
-				   "Channel %u already has all VLAN filters set\n",
-				   nc->id);
+			   "Channel %u already has all VLAN filters set\n",
+			   nc->id);
+		spin_unlock_irqrestore(&nc->lock, flags);
 		return -1;
 	}
 
-	netdev_printk(KERN_DEBUG, ndp->ndev.dev,
-		      "NCSI: set vid %u in packet, index %u\n",
-		      vlan->vid, index + 1);
+	ncf->vids[index] = vid;
+	set_bit(index, bitmap);
+	spin_unlock_irqrestore(&nc->lock, flags);
+
 	nca->type = NCSI_PKT_CMD_SVF;
-	nca->words[1] = vlan->vid;
+	nca->words[1] = vid;
 	/* HW filter index starts at 1 */
 	nca->bytes[6] = index + 1;
 	nca->bytes[7] = 0x01;
diff --git a/net/ncsi/ncsi-netlink.c b/net/ncsi/ncsi-netlink.c
index 8d7e849d4825..b09ef77bf4cd 100644
--- a/net/ncsi/ncsi-netlink.c
+++ b/net/ncsi/ncsi-netlink.c
@@ -58,10 +58,9 @@ static int ncsi_write_channel_info(struct sk_buff *skb,
 				   struct ncsi_dev_priv *ndp,
 				   struct ncsi_channel *nc)
 {
-	struct nlattr *vid_nest;
-	struct ncsi_channel_filter *ncf;
+	struct ncsi_channel_vlan_filter *ncf;
 	struct ncsi_channel_mode *m;
-	u32 *data;
+	struct nlattr *vid_nest;
 	int i;
 
 	nla_put_u32(skb, NCSI_CHANNEL_ATTR_ID, nc->id);
@@ -79,18 +78,13 @@ static int ncsi_write_channel_info(struct sk_buff *skb,
 	vid_nest = nla_nest_start(skb, NCSI_CHANNEL_ATTR_VLAN_LIST);
 	if (!vid_nest)
 		return -ENOMEM;
-	ncf = nc->filters[NCSI_FILTER_VLAN];
+	ncf = &nc->vlan_filter;
 	i = -1;
-	if (ncf) {
-		while ((i = find_next_bit((void *)&ncf->bitmap, ncf->total,
-					  i + 1)) < ncf->total) {
-			data = ncsi_get_filter(nc, NCSI_FILTER_VLAN, i);
-			/* Uninitialised channels will have 'zero' vlan ids */
-			if (!data || !*data)
-				continue;
+	while ((i = find_next_bit((void *)&ncf->bitmap, ncf->n_vids,
+				  i + 1)) < ncf->n_vids) {
+		if (ncf->vids[i])
 			nla_put_u16(skb, NCSI_CHANNEL_ATTR_VLAN_ID,
-				    *(u16 *)data);
-		}
+				    ncf->vids[i]);
 	}
 	nla_nest_end(skb, vid_nest);
 
diff --git a/net/ncsi/ncsi-rsp.c b/net/ncsi/ncsi-rsp.c
index efd933ff5570..ce9497966ebe 100644
--- a/net/ncsi/ncsi-rsp.c
+++ b/net/ncsi/ncsi-rsp.c
@@ -334,9 +334,9 @@ static int ncsi_rsp_handler_svf(struct ncsi_request *nr)
 	struct ncsi_rsp_pkt *rsp;
 	struct ncsi_dev_priv *ndp = nr->ndp;
 	struct ncsi_channel *nc;
-	struct ncsi_channel_filter *ncf;
-	unsigned short vlan;
-	int ret;
+	struct ncsi_channel_vlan_filter *ncf;
+	unsigned long flags;
+	void *bitmap;
 
 	/* Find the package and channel */
 	rsp = (struct ncsi_rsp_pkt *)skb_network_header(nr->rsp);
@@ -346,22 +346,23 @@ static int ncsi_rsp_handler_svf(struct ncsi_request *nr)
 		return -ENODEV;
 
 	cmd = (struct ncsi_cmd_svf_pkt *)skb_network_header(nr->cmd);
-	ncf = nc->filters[NCSI_FILTER_VLAN];
-	if (!ncf)
-		return -ENOENT;
-	if (cmd->index >= ncf->total)
+	ncf = &nc->vlan_filter;
+	if (cmd->index > ncf->n_vids)
 		return -ERANGE;
 
-	/* Add or remove the VLAN filter */
+	/* Add or remove the VLAN filter. Remember HW indexes from 1 */
+	spin_lock_irqsave(&nc->lock, flags);
+	bitmap = &ncf->bitmap;
 	if (!(cmd->enable & 0x1)) {
-		/* HW indexes from 1 */
-		ret = ncsi_remove_filter(nc, NCSI_FILTER_VLAN, cmd->index - 1);
+		if (test_and_clear_bit(cmd->index - 1, bitmap))
+			ncf->vids[cmd->index - 1] = 0;
 	} else {
-		vlan = ntohs(cmd->vlan);
-		ret = ncsi_add_filter(nc, NCSI_FILTER_VLAN, &vlan);
+		set_bit(cmd->index - 1, bitmap);
+		ncf->vids[cmd->index - 1] = ntohs(cmd->vlan);
 	}
+	spin_unlock_irqrestore(&nc->lock, flags);
 
-	return ret;
+	return 0;
 }
 
 static int ncsi_rsp_handler_ev(struct ncsi_request *nr)
@@ -422,8 +423,12 @@ static int ncsi_rsp_handler_sma(struct ncsi_request *nr)
 	struct ncsi_rsp_pkt *rsp;
 	struct ncsi_dev_priv *ndp = nr->ndp;
 	struct ncsi_channel *nc;
-	struct ncsi_channel_filter *ncf;
+	struct ncsi_channel_mac_filter *ncf;
+	unsigned long flags;
 	void *bitmap;
+	bool enabled;
+	int index;
+
 
 	/* Find the package and channel */
 	rsp = (struct ncsi_rsp_pkt *)skb_network_header(nr->rsp);
@@ -436,31 +441,23 @@ static int ncsi_rsp_handler_sma(struct ncsi_request *nr)
 	 * isn't supported yet.
 	 */
 	cmd = (struct ncsi_cmd_sma_pkt *)skb_network_header(nr->cmd);
-	switch (cmd->at_e >> 5) {
-	case 0x0:	/* UC address */
-		ncf = nc->filters[NCSI_FILTER_UC];
-		break;
-	case 0x1:	/* MC address */
-		ncf = nc->filters[NCSI_FILTER_MC];
-		break;
-	default:
-		return -EINVAL;
-	}
+	enabled = cmd->at_e & 0x1;
+	ncf = &nc->mac_filter;
+	bitmap = &ncf->bitmap;
 
-	/* Sanity check on the filter */
-	if (!ncf)
-		return -ENOENT;
-	else if (cmd->index >= ncf->total)
+	if (cmd->index > ncf->n_uc + ncf->n_mc + ncf->n_mixed)
 		return -ERANGE;
 
-	bitmap = &ncf->bitmap;
-	if (cmd->at_e & 0x1) {
-		set_bit(cmd->index, bitmap);
-		memcpy(ncf->data + 6 * cmd->index, cmd->mac, 6);
+	index = (cmd->index - 1) * ETH_ALEN;
+	spin_lock_irqsave(&nc->lock, flags);
+	if (enabled) {
+		set_bit(cmd->index - 1, bitmap);
+		memcpy(&ncf->addrs[index], cmd->mac, ETH_ALEN);
 	} else {
-		clear_bit(cmd->index, bitmap);
-		memset(ncf->data + 6 * cmd->index, 0, 6);
+		clear_bit(cmd->index - 1, bitmap);
+		memset(&ncf->addrs[index], 0, ETH_ALEN);
 	}
+	spin_unlock_irqrestore(&nc->lock, flags);
 
 	return 0;
 }
@@ -631,9 +628,7 @@ static int ncsi_rsp_handler_gc(struct ncsi_request *nr)
 	struct ncsi_rsp_gc_pkt *rsp;
 	struct ncsi_dev_priv *ndp = nr->ndp;
 	struct ncsi_channel *nc;
-	struct ncsi_channel_filter *ncf;
-	size_t size, entry_size;
-	int cnt, i;
+	size_t size;
 
 	/* Find the channel */
 	rsp = (struct ncsi_rsp_gc_pkt *)skb_network_header(nr->rsp);
@@ -655,64 +650,40 @@ static int ncsi_rsp_handler_gc(struct ncsi_request *nr)
 	nc->caps[NCSI_CAP_VLAN].cap = rsp->vlan_mode &
 				      NCSI_CAP_VLAN_MASK;
 
-	/* Build filters */
-	for (i = 0; i < NCSI_FILTER_MAX; i++) {
-		switch (i) {
-		case NCSI_FILTER_VLAN:
-			cnt = rsp->vlan_cnt;
-			entry_size = 2;
-			break;
-		case NCSI_FILTER_MIXED:
-			cnt = rsp->mixed_cnt;
-			entry_size = 6;
-			break;
-		case NCSI_FILTER_MC:
-			cnt = rsp->mc_cnt;
-			entry_size = 6;
-			break;
-		case NCSI_FILTER_UC:
-			cnt = rsp->uc_cnt;
-			entry_size = 6;
-			break;
-		default:
-			continue;
-		}
-
-		if (!cnt || nc->filters[i])
-			continue;
-
-		size = sizeof(*ncf) + cnt * entry_size;
-		ncf = kzalloc(size, GFP_ATOMIC);
-		if (!ncf) {
-			pr_warn("%s: Cannot alloc filter table (%d)\n",
-				__func__, i);
-			return -ENOMEM;
-		}
-
-		ncf->index = i;
-		ncf->total = cnt;
-		if (i == NCSI_FILTER_VLAN) {
-			/* Set VLAN filters active so they are cleared in
-			 * first configuration state
-			 */
-			ncf->bitmap = U64_MAX;
-		} else {
-			ncf->bitmap = 0x0ul;
-		}
-		nc->filters[i] = ncf;
-	}
+	size = (rsp->uc_cnt + rsp->mc_cnt + rsp->mixed_cnt) * ETH_ALEN;
+	nc->mac_filter.addrs = kzalloc(size, GFP_KERNEL);
+	if (!nc->mac_filter.addrs)
+		return -ENOMEM;
+	nc->mac_filter.n_uc = rsp->uc_cnt;
+	nc->mac_filter.n_mc = rsp->mc_cnt;
+	nc->mac_filter.n_mixed = rsp->mixed_cnt;
+
+	nc->vlan_filter.vids = kcalloc(rsp->vlan_cnt,
+				       sizeof(*nc->vlan_filter.vids),
+				       GFP_KERNEL);
+	if (!nc->vlan_filter.vids)
+		return -ENOMEM;
+	/* Set VLAN filters active so they are cleared in the first
+	 * configuration state
+	 */
+	nc->vlan_filter.bitmap = U64_MAX;
+	nc->vlan_filter.n_vids = rsp->vlan_cnt;
 
 	return 0;
 }
 
 static int ncsi_rsp_handler_gp(struct ncsi_request *nr)
 {
-	struct ncsi_rsp_gp_pkt *rsp;
+	struct ncsi_channel_vlan_filter *ncvf;
+	struct ncsi_channel_mac_filter *ncmf;
 	struct ncsi_dev_priv *ndp = nr->ndp;
+	struct ncsi_rsp_gp_pkt *rsp;
 	struct ncsi_channel *nc;
-	unsigned short enable, vlan;
+	unsigned short enable;
 	unsigned char *pdata;
-	int table, i;
+	unsigned long flags;
+	void *bitmap;
+	int i;
 
 	/* Find the channel */
 	rsp = (struct ncsi_rsp_gp_pkt *)skb_network_header(nr->rsp);
@@ -746,36 +717,33 @@ static int ncsi_rsp_handler_gp(struct ncsi_request *nr)
 	/* MAC addresses filter table */
 	pdata = (unsigned char *)rsp + 48;
 	enable = rsp->mac_enable;
+	ncmf = &nc->mac_filter;
+	spin_lock_irqsave(&nc->lock, flags);
+	bitmap = &ncmf->bitmap;
 	for (i = 0; i < rsp->mac_cnt; i++, pdata += 6) {
-		if (i >= (nc->filters[NCSI_FILTER_UC]->total +
-			  nc->filters[NCSI_FILTER_MC]->total))
-			table = NCSI_FILTER_MIXED;
-		else if (i >= nc->filters[NCSI_FILTER_UC]->total)
-			table = NCSI_FILTER_MC;
-		else
-			table = NCSI_FILTER_UC;
-
 		if (!(enable & (0x1 << i)))
-			continue;
-
-		if (ncsi_find_filter(nc, table, pdata) >= 0)
-			continue;
+			clear_bit(i, bitmap);
+		else
+			set_bit(i, bitmap);
 
-		ncsi_add_filter(nc, table, pdata);
+		memcpy(&ncmf->addrs[i * ETH_ALEN], pdata, ETH_ALEN);
 	}
+	spin_unlock_irqrestore(&nc->lock, flags);
 
 	/* VLAN filter table */
 	enable = ntohs(rsp->vlan_enable);
+	ncvf = &nc->vlan_filter;
+	bitmap = &ncvf->bitmap;
+	spin_lock_irqsave(&nc->lock, flags);
 	for (i = 0; i < rsp->vlan_cnt; i++, pdata += 2) {
 		if (!(enable & (0x1 << i)))
-			continue;
-
-		vlan = ntohs(*(__be16 *)pdata);
-		if (ncsi_find_filter(nc, NCSI_FILTER_VLAN, &vlan) >= 0)
-			continue;
+			clear_bit(i, bitmap);
+		else
+			set_bit(i, bitmap);
 
-		ncsi_add_filter(nc, NCSI_FILTER_VLAN, &vlan);
+		ncvf->vids[i] = ntohs(*(__be16 *)pdata);
 	}
+	spin_unlock_irqrestore(&nc->lock, flags);
 
 	return 0;
 }
-- 
2.17.0

^ permalink raw reply related

* Re: kernel BUG at drivers/vhost/vhost.c:LINE! (2)
From: Stefan Hajnoczi @ 2018-04-09  2:37 UTC (permalink / raw)
  To: syzbot
  Cc: Jason Wang, kvm, linux-kernel, Michael S. Tsirkin, netdev,
	syzkaller-bugs, Linux Virtualization
In-Reply-To: <001a11449424f11322056932b09c@google.com>

On Sat, Apr 7, 2018 at 3:02 AM, syzbot
<syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com> wrote:
> syzbot hit the following crash on upstream commit
> 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +0000)
> Merge tag 'armsoc-drivers' of
> git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
> syzbot dashboard link:
> https://syzkaller.appspot.com/bug?extid=65a84dde0214b0387ccd

To prevent duplicated work: I am working on this one.

Stefan

>
> So far this crash happened 4 times on upstream.
> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6586748079439872
> syzkaller reproducer:
> https://syzkaller.appspot.com/x/repro.syz?id=5974272052822016
> Raw console output:
> https://syzkaller.appspot.com/x/log.txt?id=6224632407392256
> Kernel config:
> https://syzkaller.appspot.com/x/.config?id=-5813481738265533882
> compiler: gcc (GCC) 8.0.1 20180301 (experimental)
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> ------------[ cut here ]------------
> kernel BUG at drivers/vhost/vhost.c:1652!
> invalid opcode: 0000 [#1] SMP KASAN
> ------------[ cut here ]------------
> Dumping ftrace buffer:
> kernel BUG at drivers/vhost/vhost.c:1652!
>    (ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 4461 Comm: syzkaller684218 Not tainted 4.16.0+ #3
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:set_bit_to_user drivers/vhost/vhost.c:1652 [inline]
> RIP: 0010:log_write+0x42a/0x4d0 drivers/vhost/vhost.c:1676
> RSP: 0018:ffff8801b256f920 EFLAGS: 00010293
> RAX: ffff8801adc9e2c0 RBX: dffffc0000000000 RCX: ffffffff85924a0f
> RDX: 0000000000000000 RSI: ffffffff85924cea RDI: 0000000000000005
> RBP: ffff8801b256fa58 R08: ffff8801adc9e2c0 R09: ffffed003962412d
> R10: ffff8801b256fad8 R11: ffff8801cb12096f R12: 0001ffffffffffff
> R13: ffffed00364adf36 R14: 0000000000000000 R15: ffff8801b256fa30
> FS:  00007fdf24b19700(0000) GS:ffff8801db100000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020bf6000 CR3: 00000001ae6a7000 CR4: 00000000001406e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  vhost_update_used_flags+0x3af/0x4a0 drivers/vhost/vhost.c:1723
>  vhost_vq_init_access+0x117/0x590 drivers/vhost/vhost.c:1763
>  vhost_vsock_start drivers/vhost/vsock.c:446 [inline]
>  vhost_vsock_dev_ioctl+0x751/0x920 drivers/vhost/vsock.c:678
>  vfs_ioctl fs/ioctl.c:46 [inline]
>  file_ioctl fs/ioctl.c:500 [inline]
>  do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684
>  ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
>  SYSC_ioctl fs/ioctl.c:708 [inline]
>  SyS_ioctl+0x24/0x30 fs/ioctl.c:706
>  do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x4456c9
> RSP: 002b:00007fdf24b18da8 EFLAGS: 00000297 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 00000000006dac24 RCX: 00000000004456c9
> RDX: 0000000020f82ffc RSI: 000000004004af61 RDI: 000000000000001b
> RBP: 00000000006dac20 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000297 R12: 6b636f73762d7473
> R13: 6f68762f7665642f R14: fffffffffffffffc R15: 0000000000000007
> Code: e8 7c 5e e4 fb 4c 89 ef e8 e4 16 06 fc 48 8d 85 58 ff ff ff 48 c1 e8
> 03 c6 04 18 f8 e9 46 ff ff ff 45 31 f6 eb 91 e8 56 5e e4 fb <0f> 0b e8 4f 5e
> e4 fb 48 c7 c6 a0 a3 24 88 4c 89 ef e8 60 b6 10
> RIP: set_bit_to_user drivers/vhost/vhost.c:1652 [inline] RSP:
> ffff8801b256f920
> RIP: log_write+0x42a/0x4d0 drivers/vhost/vhost.c:1676 RSP: ffff8801b256f920
> invalid opcode: 0000 [#2] SMP KASAN
> ---[ end trace 0d0ff45aa44d8a23 ]---
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Modules linked in:
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> If you want to test a patch for this bug, please reply with:
> #syz test: git://repo/address.git branch
> and provide the patch inline or as an attachment.
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply

* Re: [PATCH net-next] net/ncsi: Refactor MAC, VLAN filters
From: David Miller @ 2018-04-09  2:40 UTC (permalink / raw)
  To: sam; +Cc: netdev, linux-kernel, openbmc
In-Reply-To: <20180409021128.10055-1-sam@mendozajonas.com>

The net-next tree is closed at this time, please resend this when the
merge window is over and the net-next tree opens back up.

Thank you.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox