Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] net/fsl_pq_mdio: Allow explicit speficition of TBIPA address
From: David Miller @ 2018-04-08 16:45 UTC (permalink / raw)
  To: esben.haabendal
  Cc: netdev, eha, claudiu.manoil, robh+dt, mark.rutland, garsilva,
	devicetree, linux-kernel
In-Reply-To: <20180406123836.12019-1-esben.haabendal@gmail.com>

From: Esben Haabendal <esben.haabendal@gmail.com>
Date: Fri,  6 Apr 2018 14:38:34 +0200

> From: Esben Haabendal <eha@deif.com>
> 
> This introduces a simpler and generic method for for finding (and mapping)
> the TBIPA register.
> 
> Instead of relying of complicated logic for finding the TBIPA register
> address based on the MDIO or MII register block base
> address, which even in some cases relies on undocumented shadow registers,
> a second "reg" entry for the mdio bus devicetree node specifies the TBIPA
> register.
> 
> Backwards compatibility is kept, as the existing logic is applied when
> only a single "reg" mapping is specified.
> 
> Signed-off-by: Esben Haabendal <eha@deif.com>

Applied.

^ permalink raw reply

* Re: [PATCH] ARM: dts: ls1021a: Specify TBIPA register address
From: David Miller @ 2018-04-08 16:45 UTC (permalink / raw)
  To: esben.haabendal
  Cc: netdev, robh+dt, mark.rutland, devicetree, linux-kernel,
	claudiu.manoil, garsilva, eha
In-Reply-To: <20180406124635.12319-1-esben.haabendal@gmail.com>

From: Esben Haabendal <esben.haabendal@gmail.com>
Date: Fri,  6 Apr 2018 14:46:35 +0200

> From: Esben Haabendal <eha@deif.com>
> 
> The current (mildly evil) fsl_pq_mdio code uses an undocumented shadow of
> the TBIPA register on LS1021A, which happens to be read-only.
> Changing TBI PHY address therefore does not work on LS1021A.
> 
> The real (and documented) address of the TBIPA registere lies in the eTSEC
> block and not in MDIO/MII, which is read/write, so using that fixes
> the problem.
> 
> Signed-off-by: Esben Haabendal <eha@deif.com>

Applied.

^ permalink raw reply

* Re: [patch net] devlink: convert occ_get op to separate registration
From: David Miller @ 2018-04-08 16:46 UTC (permalink / raw)
  To: jiri; +Cc: netdev, idosch, jakub.kicinski, dsahern, mlxsw
In-Reply-To: <20180405201321.16626-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Thu,  5 Apr 2018 22:13:21 +0200

> From: Jiri Pirko <jiri@mellanox.com>
> 
> This resolves race during initialization where the resources with
> ops are registered before driver and the structures used by occ_get
> op is initialized. So keep occ_get callbacks registered only when
> all structs are initialized.
 ...
> Fixes: d9f9b9a4d05f ("devlink: Add support for resource abstraction")
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net 0/8] net: fix uninit-values in networking stack
From: David Miller @ 2018-04-08 16:49 UTC (permalink / raw)
  To: eric.dumazet; +Cc: edumazet, netdev
In-Reply-To: <3748e283-5172-199c-660d-6814afe51823@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 8 Apr 2018 09:38:13 -0700

> On 04/07/2018 07:40 PM, David Miller wrote:
>> From: Eric Dumazet <edumazet@google.com>
>> Date: Sat,  7 Apr 2018 13:42:35 -0700
>> 
>>> It seems syzbot got new features enabled, and fired some interesting
>>> reports. Oh well.
>> 
>> Series applied, however in patch #7 the condition syzbot detects
>> cannot happen.
>> 
>> In all code paths that lead to __mkroute_output() with res->type
>> uninitialized, __mkroute_output() will reassign the local variable
>> 'type' before reading it.
> 
> Well, we have :
> 
> u16 type = res->type;
> ...
> 
>        if (ipv4_is_lbcast(fl4->daddr))
>                 type = RTN_BROADCAST;
>         else if (ipv4_is_multicast(fl4->daddr))
>                 type = RTN_MULTICAST;
>         else if (ipv4_is_zeronet(fl4->daddr))
>                 return ERR_PTR(-EINVAL);
> 
> ...
> 
> if (type == RTN_BROADCAST) {  /* This is where KMSAN complained */
> 
> So it looks like type could have been random at this point.

Ok, then.  It seems that the requirement is:

	fl4->flowi4_oif is non-zero
	fl4->daddr is neither local multicast nor lbcast
	fl4->flowi4_proto is IPPROTO_IGMP

Then we can trigger such a sequence of events.

^ permalink raw reply

* Re: [PATCH] make net_gso_ok return false when gso_type is zero(invalid)
From: David Miller @ 2018-04-08 16:51 UTC (permalink / raw)
  To: march511; +Cc: netdev, linux-kernel
In-Reply-To: <20180406014340.1562-1-march511@gmail.com>

From: Wenhua Shi <march511@gmail.com>
Date: Fri,  6 Apr 2018 03:43:39 +0200

> Signed-off-by: Wenhua Shi <march511@gmail.com>

This precondition should be made impossible instead of having to do
an extra check everywhere that this helper is invoked, many of which
are in fast paths.

^ permalink raw reply

* Re: [PATCH] vhost-net: set packet weight of tx polling to 2 * vq size
From: David Miller @ 2018-04-08 16:52 UTC (permalink / raw)
  To: haibinzhang
  Cc: mst, jasowang, kvm, virtualization, netdev, linux-kernel,
	lidongchen, yunfangtai
In-Reply-To: <88D661ADF6AFBF42B2AB88D8E7682B0901FC47D3@EXMBX-SZMAIL011.tencent.com>

From: haibinzhang(张海斌) <haibinzhang@tencent.com>
Date: Fri, 6 Apr 2018 08:22:37 +0000

> handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
> polling udp packets with small length(e.g. 1byte udp payload), because setting
> VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
> 
> Ping-Latencies shown below were tested between two Virtual Machines using
> netperf (UDP_STREAM, len=1), and then another machine pinged the client:
...
> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
> Signed-off-by: Lidong Chen <lidongchen@tencent.com>

Michael and Jason, please review.

^ permalink raw reply

* Re: [PATCH net 0/8] net: fix uninit-values in networking stack
From: Eric Dumazet @ 2018-04-08 16:55 UTC (permalink / raw)
  To: David Miller; +Cc: edumazet, netdev
In-Reply-To: <20180408.124959.590910633534141941.davem@davemloft.net>



On 04/08/2018 09:49 AM, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sun, 8 Apr 2018 09:38:13 -0700
> 
>> On 04/07/2018 07:40 PM, David Miller wrote:
>>> From: Eric Dumazet <edumazet@google.com>
>>> Date: Sat,  7 Apr 2018 13:42:35 -0700
>>>
>>>> It seems syzbot got new features enabled, and fired some interesting
>>>> reports. Oh well.
>>>
>>> Series applied, however in patch #7 the condition syzbot detects
>>> cannot happen.
>>>
>>> In all code paths that lead to __mkroute_output() with res->type
>>> uninitialized, __mkroute_output() will reassign the local variable
>>> 'type' before reading it.
>>
>> Well, we have :
>>
>> u16 type = res->type;
>> ...
>>
>>        if (ipv4_is_lbcast(fl4->daddr))
>>                 type = RTN_BROADCAST;
>>         else if (ipv4_is_multicast(fl4->daddr))
>>                 type = RTN_MULTICAST;
>>         else if (ipv4_is_zeronet(fl4->daddr))
>>                 return ERR_PTR(-EINVAL);
>>
>> ...
>>
>> if (type == RTN_BROADCAST) {  /* This is where KMSAN complained */
>>
>> So it looks like type could have been random at this point.
> 
> Ok, then.  It seems that the requirement is:
> 
> 	fl4->flowi4_oif is non-zero
> 	fl4->daddr is neither local multicast nor lbcast
> 	fl4->flowi4_proto is IPPROTO_IGMP
> 
> Then we can trigger such a sequence of events.
> 

OK, maybe some more work then ;)


I also have a report of a WARN() in ip_rt_bug(), added in commit c378a9c019cf5e017d1ed24954b54fae7bebd2bc
by Dave Jones.

Not sure what to do, maybe revert, since ip_rt_bug() is not catastrophic.

WARNING: CPU: 0 PID: 11678 at net/ipv4/route.c:1213 ip_rt_bug+0x15/0x20 net/ipv4/route.c:1212
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 11678 Comm: kworker/u4:7 Not tainted 4.16.0-rc6+ #289
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x24d lib/dump_stack.c:53
 panic+0x1e4/0x41c kernel/panic.c:183
 __warn+0x1dc/0x200 kernel/panic.c:547
 report_bug+0x1f4/0x2b0 lib/bug.c:186
 fixup_bug.part.10+0x37/0x80 arch/x86/kernel/traps.c:178
 fixup_bug arch/x86/kernel/traps.c:247 [inline]
 do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
 invalid_op+0x1b/0x40 arch/x86/entry/entry_64.S:986
RIP: 0010:ip_rt_bug+0x15/0x20 net/ipv4/route.c:1212
RSP: 0018:ffff8801db007290 EFLAGS: 00010282
RAX: dffffc0000000000 RBX: ffff8801d8dda3c0 RCX: ffffffff856c31ca
RDX: 0000000000000100 RSI: ffffffff8858c300 RDI: 0000000000000282
RBP: ffff8801db007298 R08: 1ffff1003b600de1 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801d8dda3c0
R13: ffff88019bdb2200 R14: ffff88019bdeed80 R15: ffff8801d8dda418
 dst_output include/net/dst.h:444 [inline]
 ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1414
 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1434
 icmp_push_reply+0x395/0x4f0 net/ipv4/icmp.c:394
 icmp_send+0x1136/0x19b0 net/ipv4/icmp.c:741
 ipv4_link_failure+0x2a/0x1b0 net/ipv4/route.c:1200
 dst_link_failure include/net/dst.h:427 [inline]
 arp_error_report+0xae/0x180 net/ipv4/arp.c:297
 neigh_invalidate+0x225/0x530 net/core/neighbour.c:883
 neigh_timer_handler+0x897/0xd60 net/core/neighbour.c:969
 call_timer_fn+0x228/0x820 kernel/time/timer.c:1326
 expire_timers kernel/time/timer.c:1363 [inline]
 __run_timers+0x7ee/0xb70 kernel/time/timer.c:1666
 run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
 invoke_softirq kernel/softirq.c:365 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:541 [inline]
 smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:857
 </IRQ>

^ permalink raw reply

* pull request: bluetooth 2018-04-08
From: Johan Hedberg @ 2018-04-08 17:47 UTC (permalink / raw)
  To: davem; +Cc: linux-bluetooth, netdev

[-- Attachment #1: Type: text/plain, Size: 1135 bytes --]

Hi Dave,

Here's one important Bluetooth fix for the 4.17-rc series that's needed
to pass several Bluetooth qualification test cases.

Let me know if there are any issues pulling. Thanks.

Johan

---
The following changes since commit b5dbc28762fd3fd40ba76303be0c7f707826f982:

  Merge tag 'kbuild-fixes-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild (2018-03-30 18:53:57 -1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth.git for-upstream

for you to fetch changes up to 082f2300cfa1a3d9d5221c38c5eba85d4ab98bd8:

  Bluetooth: Fix connection if directed advertising and privacy is used (2018-04-03 16:12:56 +0200)

----------------------------------------------------------------
Szymon Janc (1):
      Bluetooth: Fix connection if directed advertising and privacy is used

 include/net/bluetooth/hci_core.h |  2 +-
 net/bluetooth/hci_conn.c         | 29 +++++++++++++++++++++--------
 net/bluetooth/hci_event.c        | 15 +++++++++++----
 net/bluetooth/l2cap_core.c       |  2 +-
 4 files changed, 34 insertions(+), 14 deletions(-)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* [PATCH] net: bridge: add missing NULL checks
From: Laszlo Toth @ 2018-04-08 17:49 UTC (permalink / raw)
  To: Stephen Hemminger, David S. Miller; +Cc: bridge, netdev

br_port_get_rtnl() can return NULL

Signed-off-by: Laszlo Toth <laszlth@gmail.com>
---
 net/bridge/br_netlink.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 015f465c..cbec11f 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -939,14 +939,17 @@ static int br_port_slave_changelink(struct net_device *brdev,
 				    struct nlattr *data[],
 				    struct netlink_ext_ack *extack)
 {
+	struct net_bridge_port *port = br_port_get_rtnl(dev);
 	struct net_bridge *br = netdev_priv(brdev);
 	int ret;
 
 	if (!data)
 		return 0;
+	if (!port)
+		return -EINVAL;
 
 	spin_lock_bh(&br->lock);
-	ret = br_setport(br_port_get_rtnl(dev), data);
+	ret = br_setport(port, data);
 	spin_unlock_bh(&br->lock);
 
 	return ret;
@@ -956,7 +959,12 @@ static int br_port_fill_slave_info(struct sk_buff *skb,
 				   const struct net_device *brdev,
 				   const struct net_device *dev)
 {
-	return br_port_fill_attrs(skb, br_port_get_rtnl(dev));
+	struct net_bridge_port *port = br_port_get_rtnl(dev);
+
+	if (!port)
+		return -EINVAL;
+
+	return br_port_fill_attrs(skb, port);
 }
 
 static size_t br_port_get_slave_size(const struct net_device *brdev,
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH iproute2-next 1/1] tc: jsonify skbedit action
From: David Ahern @ 2018-04-08 17:50 UTC (permalink / raw)
  To: Roman Mashak, dsahern; +Cc: stephen, netdev, kernel, jhs, xiyou.wangcong, jiri
In-Reply-To: <1522783461-14269-1-git-send-email-mrv@mojatatu.com>

On 4/3/18 1:24 PM, Roman Mashak wrote:
>  	if (tb[TCA_SKBEDIT_PTYPE] != NULL) {
> -		ptype = RTA_DATA(tb[TCA_SKBEDIT_PTYPE]);
> -		if (*ptype == PACKET_HOST)
> -			fprintf(f, " ptype host");
> -		else if (*ptype == PACKET_BROADCAST)
> -			fprintf(f, " ptype broadcast");
> -		else if (*ptype == PACKET_MULTICAST)
> -			fprintf(f, " ptype multicast");
> -		else if (*ptype == PACKET_OTHERHOST)
> -			fprintf(f, " ptype otherhost");
> +		ptype = rta_getattr_u16(tb[TCA_SKBEDIT_PTYPE]);
> +		if (ptype == PACKET_HOST)
> +			print_string(PRINT_ANY, "ptype", " %s", "ptype host");
> +		else if (ptype == PACKET_BROADCAST)
> +			print_string(PRINT_ANY, "ptype", " %s",
> +				     "ptype broadcast");
> +		else if (ptype == PACKET_MULTICAST)
> +			print_string(PRINT_ANY, "ptype", " %s",
> +				     "ptype multicast");
> +		else if (ptype == PACKET_OTHERHOST)
> +			print_string(PRINT_ANY, "ptype", " %s",
> +				     "ptype otherhost");

Shouldn't that be:
    print_string(PRINT_ANY, "ptype", "ptype %s", "otherhost");

And ditto for the other strings.

>  		else
> -			fprintf(f, " ptype %d", *ptype);
> +			print_uint(PRINT_ANY, "ptype", " %u", ptype);

And then this one needs 'ptype' before %u

^ permalink raw reply

* Re: [PATCH iproute2-next 1/1] tc: jsonify connmark action
From: David Ahern @ 2018-04-08 17:54 UTC (permalink / raw)
  To: Roman Mashak; +Cc: stephen, netdev, kernel, jhs, xiyou.wangcong, jiri
In-Reply-To: <1522760995-12441-1-git-send-email-mrv@mojatatu.com>

On 4/3/18 7:09 AM, Roman Mashak wrote:
> Signed-off-by: Roman Mashak <mrv@mojatatu.com>
> ---
>  tc/m_connmark.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)

applied to iproute2-next

^ permalink raw reply

* Re: [PATCH iproute2-next 1/1] tc: jsonify tunnel_key action
From: David Ahern @ 2018-04-08 17:54 UTC (permalink / raw)
  To: Roman Mashak; +Cc: stephen, netdev, kernel, jhs, xiyou.wangcong, jiri
In-Reply-To: <1522862478-24580-1-git-send-email-mrv@mojatatu.com>

On 4/4/18 11:21 AM, Roman Mashak wrote:
> Signed-off-by: Roman Mashak <mrv@mojatatu.com>
> ---
>  tc/m_tunnel_key.c | 36 +++++++++++++++++++++++++-----------
>  1 file changed, 25 insertions(+), 11 deletions(-)
> 

applied to iproute2-next

^ permalink raw reply

* Re: [PATCH] make net_gso_ok return false when gso_type is zero(invalid)
From: Wenhua Shi @ 2018-04-08 18:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <20180408.125114.2111042327285982582.davem@davemloft.net>

2018-04-08 18:51 GMT+02:00 David Miller <davem@davemloft.net>:
>
> From: Wenhua Shi <march511@gmail.com>
> Date: Fri,  6 Apr 2018 03:43:39 +0200
>
> > Signed-off-by: Wenhua Shi <march511@gmail.com>
>
> This precondition should be made impossible instead of having to do
> an extra check everywhere that this helper is invoked, many of which
> are in fast paths.

I believe the precondition you said is quite true. In my situation, I
have to disable GSO for some packet and I notice that it leads to a
worse performance (slower than 1Mbps, was almost 800Mbps).

Here's the hook I use on debian 9.4, kernel version 4.9:

    #include <linux/init.h>
    #include <linux/module.h>
    #include <linux/kernel.h>
    #include <linux/netfilter.h>
    #include <linux/netfilter_ipv4.h>
    #include <linux/netfilter_ipv6.h>
    #include <linux/skbuff.h>
    #include <linux/tcp.h>
    #include <linux/ip.h>

    unsigned int hook_outgoing (
        void * priv,
        struct sk_buff * skb,
        const struct nf_hook_state * state)
    {
        /* for some reason I have to disable GSO */
        skb_gso_reset(skb);

        /* After I force sk_can_gso to return false here, the
performance comes back normal. */
        // skb->sk->sk_gso_type = ~0;

        return NF_ACCEPT;

    }

    static struct nf_hook_ops hook =
    {
        .hook = hook_outgoing,
        .pf = PF_INET,
        .hooknum = NF_INET_POST_ROUTING,
        .priority = NF_IP_PRI_LAST,
    };

    static int __init init_testing(void)
    {
        nf_register_hook(&hook);
        return 0;
    }

    static void __exit exit_testing(void)
    {
        nf_unregister_hook(&hook);
    }

    module_init(init_testing);
    module_exit(exit_testing);


Here are the performance measurements.
Without the previous hook:

    root@debian-s-1vcpu-1gb-sfo1-01:~/test# iperf -c myanothernormaldebian -d
    ------------------------------------------------------------
    Server listening on TCP port 5001
    TCP window size: 85.3 KByte (default)
    ------------------------------------------------------------
    ------------------------------------------------------------
    Client connecting to myanothernormaldebian, TCP port 5001
    TCP window size:  255 KByte (default)
    ------------------------------------------------------------
    [  3] local 192.241.204.XXX port 60528 connected with
104.131.148.XXX port 5001
    [  5] local 192.241.204.XXX port 5001 connected with
104.131.148.XXX port 58576
    [ ID] Interval       Transfer     Bandwidth
    [  3]  0.0-10.0 sec   922 MBytes   773 Mbits/sec
    [  5]  0.0-10.1 sec  1.00 GBytes   849 Mbits/sec

And with the previous hook:

    root@debian-s-1vcpu-1gb-sfo1-01:~/test# iperf -c myanothernormaldebian -d
    ------------------------------------------------------------
    Server listening on TCP port 5001
    TCP window size: 85.3 KByte (default)
    ------------------------------------------------------------
    ------------------------------------------------------------
    Client connecting to myanothernormaldebian, TCP port 5001
    TCP window size: 85.0 KByte (default)
    ------------------------------------------------------------
    [  3] local 192.241.204.XXX port 60530 connected with
104.131.148.XXX port 5001
    [  5] local 192.241.204.XXX port 5001 connected with
104.131.148.XXX port 58578
    [ ID] Interval       Transfer     Bandwidth
    [  5]  0.0-10.2 sec  1.02 GBytes   864 Mbits/sec
    [  3]  0.0-13.5 sec   170 KBytes   103 Kbits/sec



Or it's just because of that I'm disabling the GSO in a wrong way?

^ permalink raw reply

* Re: suspicious RCU usage at ./include/net/inet_sock.h:LINE
From: Eric Biggers @ 2018-04-08 19:29 UTC (permalink / raw)
  To: syzbot
  Cc: davem, dccp, dvyukov, gerrit, kuznet, linux-kernel, netdev,
	syzkaller-bugs, yoshfuji
In-Reply-To: <001a113f8d5653a67c0561346ed3@google.com>

On Mon, Dec 25, 2017 at 05:45:00PM -0800, syzbot wrote:
> syzkaller has found reproducer for the following crash on
> fba961ab29e5ffb055592442808bb0f7962e05da
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
> 
> 
> Can not set IPV6_FL_F_REFLECT if flowlabel_consistency sysctl is enable
> 
> =============================
> WARNING: suspicious RCU usage
> 4.15.0-rc4+ #164 Not tainted
> -----------------------------
> ./include/net/inet_sock.h:136 suspicious rcu_dereference_check() usage!
> 
> other info that might help us debug this:
> 
> 
> rcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by syzkaller667189/5780:
>  #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000008d7d4e62>] lock_sock
> include/net/sock.h:1462 [inline]
>  #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000008d7d4e62>]
> do_ipv6_setsockopt.isra.9+0x23d/0x38f0 net/ipv6/ipv6_sockglue.c:167
> 
> stack backtrace:
> CPU: 0 PID: 5780 Comm: syzkaller667189 Not tainted 4.15.0-rc4+ #164
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
>  ireq_opt_deref include/net/inet_sock.h:135 [inline]
>  inet_csk_route_req+0x824/0xca0 net/ipv4/inet_connection_sock.c:544
>  dccp_v4_send_response+0xa7/0x640 net/dccp/ipv4.c:485
>  dccp_v4_conn_request+0x9ee/0x11b0 net/dccp/ipv4.c:633
>  dccp_v6_conn_request+0xd30/0x1350 net/dccp/ipv6.c:317
>  dccp_rcv_state_process+0x574/0x1620 net/dccp/input.c:612
>  dccp_v4_do_rcv+0xeb/0x160 net/dccp/ipv4.c:682
>  dccp_v6_do_rcv+0x81a/0x9b0 net/dccp/ipv6.c:578
>  sk_backlog_rcv include/net/sock.h:907 [inline]
>  __release_sock+0x124/0x360 net/core/sock.c:2274
>  release_sock+0xa4/0x2a0 net/core/sock.c:2789
>  do_ipv6_setsockopt.isra.9+0x50f/0x38f0 net/ipv6/ipv6_sockglue.c:898
>  ipv6_setsockopt+0xd7/0x150 net/ipv6/ipv6_sockglue.c:922
>  dccp_setsockopt+0x85/0xd0 net/dccp/proto.c:573
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
>  SYSC_setsockopt net/socket.c:1821 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1800
>  entry_SYSCALL_64_fastpath+0x1f/0x96
> RIP: 0033:0x445ec9
> RSP: 002b:00007fa001b58db8 EFLAGS: 00000297 ORIG_RAX: 0000000000000036
> RAX: ffffffffffffffda RBX: 00000000006dbc24 RCX: 0000000000445ec9
> RDX: 0000000000000020 RSI: 0000000000000029 RDI: 0000000000000004
> RBP: 00000000006dbc20 R08: 0000000000000020 R09: 0000000000000000
> R10: 000000002030a000 R11: 0000000000000297 R12: 0000000000000000
> R13: 00007fff809eec1f R14: 00007fa001b599c0 R15: 0000000000000001
> 
> =============================
> WARNING: suspicious RCU usage
> 4.15.0-rc4+ #164 Not tainted
> -----------------------------
> ./include/net/inet_sock.h:136 suspicious rcu_dereference_check() usage!
> 
> other info that might help us debug this:
> 
> 
> rcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by syzkaller667189/5780:
>  #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000008d7d4e62>] lock_sock
> include/net/sock.h:1462 [inline]
>  #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000008d7d4e62>]
> do_ipv6_setsockopt.isra.9+0x23d/0x38f0 net/ipv6/ipv6_sockglue.c:167
> 
> stack backtrace:
> CPU: 0 PID: 5780 Comm: syzkaller667189 Not tainted 4.15.0-rc4+ #164
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
>  ireq_opt_deref include/net/inet_sock.h:135 [inline]
>  dccp_v4_send_response+0x4b0/0x640 net/dccp/ipv4.c:496
>  dccp_v4_conn_request+0x9ee/0x11b0 net/dccp/ipv4.c:633
>  dccp_v6_conn_request+0xd30/0x1350 net/dccp/ipv6.c:317
>  dccp_rcv_state_process+0x574/0x1620 net/dccp/input.c:612
>  dccp_v4_do_rcv+0xeb/0x160 net/dccp/ipv4.c:682
>  dccp_v6_do_rcv+0x81a/0x9b0 net/dccp/ipv6.c:578
>  sk_backlog_rcv include/net/sock.h:907 [inline]
>  __release_sock+0x124/0x360 net/core/sock.c:2274
>  release_sock+0xa4/0x2a0 net/core/sock.c:2789
>  do_ipv6_setsockopt.isra.9+0x50f/0x38f0 net/ipv6/ipv6_sockglue.c:898
>  ipv6_setsockopt+0xd7/0x150 net/ipv6/ipv6_sockglue.c:922
>  dccp_setsockopt+0x85/0xd0 net/dccp/proto.c:573
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
>  SYSC_setsockopt net/socket.c:1821 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1800
>  entry_SYSCALL_64_fastpath+0x1f/0x96
> RIP: 0033:0x445ec9
> RSP: 002b:00007fa001b58db8 EFLAGS: 00000297 ORIG_RAX: 0000000000000036
> RAX: ffffffffffffffda RBX: 00000000006dbc24 RCX: 0000000000445ec9
> RDX: 0000000000000020 RSI: 0000000000000029 RDI: 0000000000000004
> RBP: 00000000006dbc20 R08: 0000000000000020 R09: 0000000000000000
> R10: 000000002030a000 R11: 0000000000000297 R12: 0000000000000000
> R13: 00007fff809eec1f R14: 00007fa001b599c0 R15: 0000000000000001

syzbot stopped hitting this for some reason, but the bug is still there.
Here's a simplified reproducer that works on Linus' tree as of today:

#include <linux/in.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <unistd.h>

int main()
{
        int is_parent = (fork() != 0);
        for (;;) {
                int fd = socket(AF_INET, SOCK_DCCP, 0);
                struct sockaddr_in addr = {
                        .sin_family = AF_INET,
                        .sin_port = htobe16(0x4e23),
                };
                if (is_parent) {
                        connect(fd, (void *)&addr, sizeof(addr));
                } else {
                        bind(fd, (void *)&addr, sizeof(addr));
                        listen(fd, 100);
                        setsockopt(fd, 0, 0xFFFF, NULL, 0);
                }
                close(fd);
        }
}

^ permalink raw reply

* Re: WARNING in kcm_exit_net (2)
From: Eric Biggers @ 2018-04-08 19:38 UTC (permalink / raw)
  To: syzbot
  Cc: davem, ebiggers, edumazet, linux-kernel, mingo, netdev,
	syzkaller-bugs, tom, xiaolou4617, xiyou.wangcong
In-Reply-To: <001a113ed51610bad9055f2d130e@google.com>

On Wed, Nov 29, 2017 at 10:08:01PM -0800, syzbot wrote:
> Hello,
> 
> syzkaller hit the following crash on
> 1d3b78bbc6e983fabb3fbf91b76339bf66e4a12c
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> 
> Unfortunately, I don't have any reproducer for this bug yet.
> 
> 
> WARNING: CPU: 1 PID: 4099 at net/kcm/kcmsock.c:2014 kcm_exit_net+0x317/0x360
> net/kcm/kcmsock.c:2014
> Kernel panic - not syncing: panic_on_warn set ...
> 
> CPU: 1 PID: 4099 Comm: kworker/u4:9 Not tainted 4.14.0+ #129
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: netns cleanup_net
> device lo entered promiscuous mode
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  panic+0x1e4/0x41c kernel/panic.c:183
>  __warn+0x1dc/0x200 kernel/panic.c:547
>  report_bug+0x211/0x2d0 lib/bug.c:184
>  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:177
>  fixup_bug arch/x86/kernel/traps.c:246 [inline]
>  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:295
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:314
>  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:926
> RIP: 0010:kcm_exit_net+0x317/0x360 net/kcm/kcmsock.c:2014
> RSP: 0000:ffff8801d9d27198 EFLAGS: 00010293
> RAX: ffff8801c0884540 RBX: 1ffff1003b3a4e33 RCX: ffffffff84a738e7
> RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000286
> RBP: ffff8801d9d27260 R08: 0000000000000003 R09: 1ffff1003b3a4e0c
> R10: ffff8801c0884540 R11: 0000000000000003 R12: 1ffff1003b3a4e37
> R13: ffff8801d9d27238 R14: ffff8801c5fec8a0 R15: ffff8801c4b62e40
>  ops_exit_list.isra.6+0xae/0x150 net/core/net_namespace.c:142
>  cleanup_net+0x5c7/0xb60 net/core/net_namespace.c:484
>  process_one_work+0xbfd/0x1be0 kernel/workqueue.c:2112
>  worker_thread+0x223/0x1990 kernel/workqueue.c:2246
>  kthread+0x37a/0x440 kernel/kthread.c:238
>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:437
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
> 
> 
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
> Please credit me with: Reported-by: syzbot <syzkaller@googlegroups.com>
> 
> syzbot will keep track of this bug report.
> Once a fix for this bug is committed, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid

No reproducer, this last occurred on Dec 26 (103 days ago, commit fba961ab29e),
and there have been several potentially relevant KCM fixes since then such as
581e7226a5d ("kcm: Only allow TCP sockets to be attached to a KCM mux") and
e5571240236 ("kcm: Check if sk_user_data already set in kcm_attach").  So I am
invalidating this for syzbot, but if anyone thinks this may still be a bug then
feel free to look into it.

#syz invalid

Eric

^ permalink raw reply

* Re: WARNING in skb_warn_bad_offload
From: Eric Biggers @ 2018-04-08 19:53 UTC (permalink / raw)
  To: syzbot
  Cc: Willem de Bruijn, Dmitry Vyukov, David Miller, Jamal Hadi Salim,
	Jiří Pírko, LKML, netdev, syzkaller-bugs,
	Cong Wang
In-Reply-To: <CACT4Y+ZDRNWj1iRqLvFLjOXJYc2XERWPh+Qa0hNxUgMD1uaeBA@mail.gmail.com>

On Wed, Nov 01, 2017 at 09:50:18PM +0300, 'Dmitry Vyukov' via syzkaller-bugs wrote:
> On Wed, Nov 1, 2017 at 9:48 PM, syzbot
> <bot+abb66e15eb1b298dfe4a13375f18a278d5940e6f@syzkaller.appspotmail.com>
> wrote:
> > Hello,
> >
> > syzkaller hit the following crash on
> > 720bbe532b7c8f5613b48dea627fc58ed9ace707
> > git://git.cmpxchg.org/linux-mmots.git/master
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached
> > Raw console output is attached.
> > C reproducer is attached
> > syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> > for information about syzkaller reproducers
> 
> 
> This also happens on more recent commits, including linux-next
> 36ef71cae353f88fd6e095e2aaa3e5953af1685d (Oct 20):
> 
> syz0: caps=(0x00000400000058c1, 0x0000000000000000) len=4203
> data_len=2810 gso_size=8465 gso_type=3 ip_summed=0
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 3473 at net/core/dev.c:2618
> skb_warn_bad_offload.cold.139+0x224/0x261 net/core/dev.c:2613
> Kernel panic - not syncing: panic_on_warn set ...
> 
> CPU: 0 PID: 3473 Comm: a.out Not tainted 4.14.0-rc5-next-20171018 #15
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x1a8/0x272 lib/dump_stack.c:52
>  panic+0x21e/0x4b7 kernel/panic.c:183
>  __warn.cold.6+0x182/0x187 kernel/panic.c:546
>  report_bug+0x232/0x330 lib/bug.c:183
>  fixup_bug+0x3f/0x90 arch/x86/kernel/traps.c:177
>  do_trap_no_signal arch/x86/kernel/traps.c:211 [inline]
>  do_trap+0x132/0x280 arch/x86/kernel/traps.c:260
>  do_error_trap+0x11f/0x390 arch/x86/kernel/traps.c:297
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:310
>  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
> RIP: 0010:skb_warn_bad_offload.cold.139+0x224/0x261 net/core/dev.c:2613
> RSP: 0018:ffff880064797038 EFLAGS: 00010286
> RAX: 000000000000006f RBX: ffff88006365efe8 RCX: 0000000000000000
> RDX: 000000000000006f RSI: ffffffff815c88c1 RDI: ffffed000c8f2dfd
> RBP: ffff880064797090 R08: ffff8800686f86c0 R09: 0000000000000002
> R10: ffff8800686f86c0 R11: 0000000000000000 R12: ffff8800538b1680
> R13: 0000000000000000 R14: ffff8800538b1680 R15: 0000000000002111
>  __skb_gso_segment+0x69e/0x860 net/core/dev.c:2824
>  skb_gso_segment include/linux/netdevice.h:3971 [inline]
>  validate_xmit_skb+0x29f/0xca0 net/core/dev.c:3074
>  validate_xmit_skb_list+0xb7/0x120 net/core/dev.c:3125
>  sch_direct_xmit+0x5b5/0x710 net/sched/sch_generic.c:181
>  __dev_xmit_skb net/core/dev.c:3206 [inline]
>  __dev_queue_xmit+0x1e41/0x2350 net/core/dev.c:3473
>  dev_queue_xmit+0x17/0x20 net/core/dev.c:3538
>  packet_snd net/packet/af_packet.c:2956 [inline]
>  packet_sendmsg+0x487a/0x64b0 net/packet/af_packet.c:2981
>  sock_sendmsg_nosec net/socket.c:632 [inline]
>  sock_sendmsg+0xd2/0x120 net/socket.c:642
>  ___sys_sendmsg+0x7cc/0x900 net/socket.c:2048
>  __sys_sendmsg+0xe6/0x220 net/socket.c:2082
>  SYSC_sendmsg net/socket.c:2093 [inline]
>  SyS_sendmsg+0x36/0x60 net/socket.c:2089
>  entry_SYSCALL_64_fastpath+0x1f/0xbe
> RIP: 0033:0x44bab9
> RSP: 002b:00000000007eff18 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> RAX: ffffffffffffffda RBX: 0000000020001046 RCX: 000000000044bab9
> RDX: 0000000000004010 RSI: 00000000207fcfc8 RDI: 0000000000000004
> RBP: 0000000000000086 R08: 850b2da14d2a3706 R09: 0000000000000000
> R10: 1b91126b7f398aaa R11: 0000000000000246 R12: 0000000000000000
> R13: 0000000000407950 R14: 00000000004079e0 R15: 0000000000000000
> 
> 
> 
> 
> 
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 2986 at net/core/dev.c:2585
> > skb_warn_bad_offload+0x2a9/0x380 net/core/dev.c:2580
> > Kernel panic - not syncing: panic_on_warn set ...
> >
> > CPU: 0 PID: 2986 Comm: syzkaller546001 Not tainted 4.13.0-mm1+ #7
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:16 [inline]
> >  dump_stack+0x194/0x257 lib/dump_stack.c:52
> >  panic+0x1e4/0x417 kernel/panic.c:181
> >  __warn+0x1c4/0x1d9 kernel/panic.c:542
> >  report_bug+0x211/0x2d0 lib/bug.c:183
> >  fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
> >  do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
> >  do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
> >  do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
> >  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
> >  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
> > RIP: 0010:skb_warn_bad_offload+0x2a9/0x380 net/core/dev.c:2580
> > RSP: 0018:ffff8801ce73f0a0 EFLAGS: 00010282
> > RAX: 000000000000006f RBX: ffff8801cd84cde0 RCX: 0000000000000000
> > RDX: 000000000000006f RSI: 1ffff10039ce7dd4 RDI: ffffed0039ce7e08
> > RBP: ffff8801ce73f0f8 R08: ffff8801ce73e790 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801ce7802c0
> > R13: 0000000000000000 R14: ffff8801ce7802c0 R15: 0000000000002111
> >  __skb_gso_segment+0x607/0x7f0 net/core/dev.c:2791
> >  skb_gso_segment include/linux/netdevice.h:3951 [inline]
> >  validate_xmit_skb+0x4ba/0xb20 net/core/dev.c:3041
> >  validate_xmit_skb_list+0xb7/0x120 net/core/dev.c:3092
> >  sch_direct_xmit+0x3b6/0x6d0 net/sched/sch_generic.c:181
> >  __dev_xmit_skb net/core/dev.c:3173 [inline]
> >  __dev_queue_xmit+0x15fe/0x1e40 net/core/dev.c:3440
> >  dev_queue_xmit+0x17/0x20 net/core/dev.c:3505
> >  packet_snd net/packet/af_packet.c:2950 [inline]
> >  packet_sendmsg+0x3bbf/0x6030 net/packet/af_packet.c:2975
> >  sock_sendmsg_nosec net/socket.c:633 [inline]
> >  sock_sendmsg+0xca/0x110 net/socket.c:643
> >  ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2049
> >  __sys_sendmsg+0xe5/0x210 net/socket.c:2083
> >  SYSC_sendmsg net/socket.c:2094 [inline]
> >  SyS_sendmsg+0x2d/0x50 net/socket.c:2090
> >  entry_SYSCALL_64_fastpath+0x1f/0xbe
> > RIP: 0033:0x445489
> > RSP: 002b:00000000007efe68 EFLAGS: 00000217 ORIG_RAX: 000000000000002e
> > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000445489
> > RDX: 0000000000004010 RSI: 00000000207fcfc8 RDI: 0000000000000004
> > RBP: 0000000000000082 R08: 000000000000cdf3 R09: 000000000000cdf3
> > R10: 0000000000000004 R11: 0000000000000217 R12: 0000000000402ae0
> > R13: 0000000000402b70 R14: 0000000000000000 R15: 0000000000000000
> > Dumping ftrace buffer:
> >    (ftrace buffer empty)
> > Kernel Offset: disabled
> > Rebooting in 86400 seconds..
> >
> >
> > ---
> > This bug is generated by a dumb bot. It may contain errors.
> > See https://goo.gl/tpsmEJ for details.
> > Direct all questions to syzkaller@googlegroups.com.
> > Please credit me with: Reported-by: syzbot <syzkaller@googlegroups.com>
> >
> > syzbot will keep track of this bug report.
> > Once a fix for this bug is committed, please reply to this email with:
> > #syz fix: exact-commit-title

Apparently fixed by commit 8d74e9f88d65a, so telling syzbot:

#syz fix: net: avoid skb_warn_bad_offload on IS_ERR

- Eric

^ permalink raw reply

* [PATCH v3] dp83640: Ensure against premature access to PHY registers after reset
From: Esben Haabendal @ 2018-04-08 20:17 UTC (permalink / raw)
  To: netdev
  Cc: Esben Haabendal, Richard Cochran, Andrew Lunn, Florian Fainelli,
	linux-kernel
In-Reply-To: <20180406170844.4248-1-esben.haabendal@gmail.com>

From: Esben Haabendal <eha@deif.com>

The datasheet specifies a 3uS pause after performing a software
reset. The default implementation of genphy_soft_reset() does not
provide this, so implement soft_reset with the needed pause.

Signed-off-by: Esben Haabendal <eha@deif.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/phy/dp83640.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
index 654f42d00092..a6c87793d899 100644
--- a/drivers/net/phy/dp83640.c
+++ b/drivers/net/phy/dp83640.c
@@ -1207,6 +1207,23 @@ static void dp83640_remove(struct phy_device *phydev)
 	kfree(dp83640);
 }
 
+static int dp83640_soft_reset(struct phy_device *phydev)
+{
+	int ret;
+
+	ret = genphy_soft_reset(phydev);
+	if (ret < 0)
+		return ret;
+
+	/* From DP83640 datasheet: "Software driver code must wait 3 us
+	 * following a software reset before allowing further serial MII
+	 * operations with the DP83640."
+	 */
+	udelay(10);		/* Taking udelay inaccuracy into account */
+
+	return 0;
+}
+
 static int dp83640_config_init(struct phy_device *phydev)
 {
 	struct dp83640_private *dp83640 = phydev->priv;
@@ -1501,6 +1518,7 @@ static struct phy_driver dp83640_driver = {
 	.flags		= PHY_HAS_INTERRUPT,
 	.probe		= dp83640_probe,
 	.remove		= dp83640_remove,
+	.soft_reset	= dp83640_soft_reset,
 	.config_init	= dp83640_config_init,
 	.ack_interrupt  = dp83640_ack_interrupt,
 	.config_intr    = dp83640_config_intr,
-- 
2.16.3

^ permalink raw reply related

* Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy
From: Andy Lutomirski @ 2018-04-08 21:06 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Andy Lutomirski, Alexei Starovoitov, Daniel Borkmann, LKML,
	Alexei Starovoitov, Arnaldo Carvalho de Melo, Casey Schaufler,
	David Drysdale, David S . Miller, Eric W . Biederman, Jann Horn,
	Jonathan Corbet, Michael Kerrisk, Kees Cook, Paul Moore,
	Sargun Dhillon, Serge E . Hallyn, Shuah Khan, Tejun Heo,
	Thomas Graf, Tycho Andersen, Will Drewry, Kernel Hardening,
	Linux 
In-Reply-To: <498f8193-c909-78b2-e4ca-c1dd05605255@digikod.net>

On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaün <mic@digikod.net> wrote:
>
> On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
>>
>> On 27/02/2018 17:39, Andy Lutomirski wrote:
>>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
>>>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>>>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>>>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>>>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the
>>>>>>>>> current task and all its future children. A program is immutable and a
>>>>>>>>> task can only add new restricting programs to itself, forming a list of
>>>>>>>>> programss.
>>>>>>>>>
>>>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>>>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>>>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of
>>>>>>>>> object is triggered. The list of programs for this hook is then
>>>>>>>>> evaluated. Each program return a 32-bit value which can deny the action
>>>>>>>>> on a kernel object with a non-zero value. If every programs of the list
>>>>>>>>> return zero, then the action on the object is allowed.
>>>>>>>>>
>>>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value for a
>>>>>>>>> call chain (e.g. evaluating multiple elements of a file path).  This
>>>>>>>>> chaining is restricted when a process construct this chain by loading a
>>>>>>>>> program, but additional checks are performed when it requests to apply
>>>>>>>>> this chain of programs to itself.  The restrictions ensure that it is
>>>>>>>>> not possible to call multiple programs in a way that would imply to
>>>>>>>>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>>>>>>>>> only a fs_pick program can be chained to the same type of program,
>>>>>>>>> because it may make sense if they have different triggers (cf. next
>>>>>>>>> commits).  This restrictions still allows to reuse Landlock programs in
>>>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple
>>>>>>>>> chains of fs_pick programs).
>>>>>>>>>
>>>>>>>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>>> +struct landlock_prog_set *landlock_prepend_prog(
>>>>>>>>> +             struct landlock_prog_set *current_prog_set,
>>>>>>>>> +             struct bpf_prog *prog)
>>>>>>>>> +{
>>>>>>>>> +     struct landlock_prog_set *new_prog_set = current_prog_set;
>>>>>>>>> +     unsigned long pages;
>>>>>>>>> +     int err;
>>>>>>>>> +     size_t i;
>>>>>>>>> +     struct landlock_prog_set tmp_prog_set = {};
>>>>>>>>> +
>>>>>>>>> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>>>>>>>>> +             return ERR_PTR(-EINVAL);
>>>>>>>>> +
>>>>>>>>> +     /* validate memory size allocation */
>>>>>>>>> +     pages = prog->pages;
>>>>>>>>> +     if (current_prog_set) {
>>>>>>>>> +             size_t i;
>>>>>>>>> +
>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>>>>>>>>> +                     struct landlock_prog_list *walker_p;
>>>>>>>>> +
>>>>>>>>> +                     for (walker_p = current_prog_set->programs[i];
>>>>>>>>> +                                     walker_p; walker_p = walker_p->prev)
>>>>>>>>> +                             pages += walker_p->prog->pages;
>>>>>>>>> +             }
>>>>>>>>> +             /* count a struct landlock_prog_set if we need to allocate one */
>>>>>>>>> +             if (refcount_read(&current_prog_set->usage) != 1)
>>>>>>>>> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>>>>>>>>> +                             / PAGE_SIZE;
>>>>>>>>> +     }
>>>>>>>>> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>>>>>>>>> +             return ERR_PTR(-E2BIG);
>>>>>>>>> +
>>>>>>>>> +     /* ensure early that we can allocate enough memory for the new
>>>>>>>>> +      * prog_lists */
>>>>>>>>> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>>>>>>>>> +     if (err)
>>>>>>>>> +             return ERR_PTR(err);
>>>>>>>>> +
>>>>>>>>> +     /*
>>>>>>>>> +      * Each task_struct points to an array of prog list pointers.  These
>>>>>>>>> +      * tables are duplicated when additions are made (which means each
>>>>>>>>> +      * table needs to be refcounted for the processes using it). When a new
>>>>>>>>> +      * table is created, all the refcounters on the prog_list are bumped (to
>>>>>>>>> +      * track each table that references the prog). When a new prog is
>>>>>>>>> +      * added, it's just prepended to the list for the new table to point
>>>>>>>>> +      * at.
>>>>>>>>> +      *
>>>>>>>>> +      * Manage all the possible errors before this step to not uselessly
>>>>>>>>> +      * duplicate current_prog_set and avoid a rollback.
>>>>>>>>> +      */
>>>>>>>>> +     if (!new_prog_set) {
>>>>>>>>> +             /*
>>>>>>>>> +              * If there is no Landlock program set used by the current task,
>>>>>>>>> +              * then create a new one.
>>>>>>>>> +              */
>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
>>>>>>>>> +             /*
>>>>>>>>> +              * If the current task is not the sole user of its Landlock
>>>>>>>>> +              * program set, then duplicate them.
>>>>>>>>> +              */
>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>>>>>>>>> +                     new_prog_set->programs[i] =
>>>>>>>>> +                             READ_ONCE(current_prog_set->programs[i]);
>>>>>>>>> +                     if (new_prog_set->programs[i])
>>>>>>>>> +                             refcount_inc(&new_prog_set->programs[i]->usage);
>>>>>>>>> +             }
>>>>>>>>> +
>>>>>>>>> +             /*
>>>>>>>>> +              * Landlock program set from the current task will not be freed
>>>>>>>>> +              * here because the usage is strictly greater than 1. It is
>>>>>>>>> +              * only prevented to be freed by another task thanks to the
>>>>>>>>> +              * caller of landlock_prepend_prog() which should be locked if
>>>>>>>>> +              * needed.
>>>>>>>>> +              */
>>>>>>>>> +             landlock_put_prog_set(current_prog_set);
>>>>>>>>> +     }
>>>>>>>>> +
>>>>>>>>> +     /* prepend tmp_prog_set to new_prog_set */
>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>>>>>>>>> +             /* get the last new list */
>>>>>>>>> +             struct landlock_prog_list *last_list =
>>>>>>>>> +                     tmp_prog_set.programs[i];
>>>>>>>>> +
>>>>>>>>> +             if (last_list) {
>>>>>>>>> +                     while (last_list->prev)
>>>>>>>>> +                             last_list = last_list->prev;
>>>>>>>>> +                     /* no need to increment usage (pointer replacement) */
>>>>>>>>> +                     last_list->prev = new_prog_set->programs[i];
>>>>>>>>> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
>>>>>>>>> +             }
>>>>>>>>> +     }
>>>>>>>>> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
>>>>>>>>> +     return new_prog_set;
>>>>>>>>> +
>>>>>>>>> +put_tmp_lists:
>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>>>>>>>>> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
>>>>>>>>> +     return new_prog_set;
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> Nack on the chaining concept.
>>>>>>>> Please do not reinvent the wheel.
>>>>>>>> There is an existing mechanism for attaching/detaching/quering multiple
>>>>>>>> programs attached to cgroup and tracing hooks that are also
>>>>>>>> efficiently executed via BPF_PROG_RUN_ARRAY.
>>>>>>>> Please use that instead.
>>>>>>>>
>>>>>>>
>>>>>>> I don't see how that would help.  Suppose you add a filter, then
>>>>>>> fork(), and then the child adds another filter.  Do you want to
>>>>>>> duplicate the entire array?  You certainly can't *modify* the array
>>>>>>> because you'll affect processes that shouldn't be affected.
>>>>>>>
>>>>>>> In contrast, doing this through seccomp like the earlier patches
>>>>>>> seemed just fine to me, and seccomp already had the right logic.
>>>>>>
>>>>>> it doesn't look to me that existing seccomp side of managing fork
>>>>>> situation can be reused. Here there is an attempt to add 'chaining'
>>>>>> concept which sort of an extension of existing seccomp style,
>>>>>> but somehow heavily done on bpf side and contradicts cgroup/tracing.
>>>>>>
>>>>>
>>>>> I don't see why the seccomp way can't be used.  I agree with you that
>>>>> the seccomp *style* shouldn't be used in bpf code like this, but I
>>>>> think that Landlock programs can and should just live in the existing
>>>>> seccomp chain.  If the existing seccomp code needs some modification
>>>>> to make this work, then so be it.
>>>>
>>>> +1
>>>> if that was the case...
>>>> but that's not my reading of the patch set.
>>>
>>> An earlier version of the patch set used the seccomp filter chain.
>>> Mickaël, what exactly was wrong with that approach other than that the
>>> seccomp() syscall was awkward for you to use?  You could add a
>>> seccomp_add_landlock_rule() syscall if you needed to.
>>
>> Nothing was wrong about about that, this part did not changed (see my
>> next comment).
>>
>>>
>>> As a side comment, why is this an LSM at all, let alone a non-stacking
>>> LSM?  It would make a lot more sense to me to make Landlock depend on
>>> having LSMs configured in but to call the landlock hooks directly from
>>> the security_xyz() hooks.
>>
>> See Casey's answer and his patch series: https://lwn.net/Articles/741963/
>>
>>>
>>>>
>>>>> In other words, the kernel already has two kinds of chaining:
>>>>> seccomp's and bpf's.  bpf's doesn't work right for this type of usage
>>>>> across fork(), whereas seccomp's already handles that case correctly.
>>>>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
>>>>>  So IMO Landlock should use the seccomp core code and call into bpf
>>>>> for the actual filtering.
>>>>
>>>> +1
>>>> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
>>>> since cgroup hierarchy can be complicated with bpf progs attached
>>>> at different levels with different override/multiprog properties,
>>>> so walking link list and checking all flags at run-time would have
>>>> been too slow. That's why we added compute_effective_progs().
>>>
>>> If we start adding override flags to Landlock, I think we're doing it
>>> wrong.   With cgroup bpf programs, the whole mess is set up by the
>>> administrator.  With seccomp, and with Landlock if done correctly, it
>>> *won't* be set up by the administrator, so the chance that everyone
>>> gets all the flags right is about zero.  All attached filters should
>>> run unconditionally.
>>
>>
>> There is a misunderstanding about this chaining mechanism. This should
>> not be confused with the list of seccomp filters nor the cgroup
>> hierarchies. Landlock programs can be stacked the same way seccomp's
>> filters can (cf. struct landlock_prog_set, the "chain_last" field is an
>> optimization which is not used for this struct handling). This stackable
>> property did not changed from the previous patch series. The chaining
>> mechanism is for another use case, which does not make sense for seccomp
>> filters nor other eBPF program types, at least for now, from what I can
>> tell.
>>
>> You may want to get a look at my talk at FOSDEM
>> (https://landlock.io/talks/2018-02-04_landlock-fosdem.pdf), especially
>> slides 11 and 12.
>>
>> Let me explain my reasoning about this program chaining thing.
>>
>> To check if an action on a file is allowed, we first need to identify
>> this file and match it to the security policy. In a previous
>> (non-public) patch series, I tried to use one type of eBPF program to
>> check every kind of access to a file. To be able to identify a file, I
>> relied on an eBPF map, similar to the current inode map. This map store
>> a set of references to file descriptors. I then created a function
>> bpf_is_file_beneath() to check if the requested file was beneath a file
>> in the map. This way, no chaining, only one eBPF program type to check
>> an access to a file... but some issues then emerged. First, this design
>> create a side-channel which help an attacker using such a program to
>> infer some information not normally available, for example to get a hint
>> on where a file descriptor (received from a UNIX socket) come from.
>> Another issue is that this type of program would be called for each
>> component of a path. Indeed, when the kernel check if an access to a
>> file is allowed, it walk through all of the directories in its path
>> (checking if the current process is allowed to execute them). That first
>> attempt led me to rethink the way we could filter an access to a file
>> *path*.
>>
>> To minimize the number of called to an eBPF program dedicated to
>> validate an access to a file path, I decided to create three subtype of
>> eBPF programs. The FS_WALK type is called when walking through every
>> directory of a file path (except the last one if it is the target). We
>> can then restrict this type of program to the minimum set of functions
>> it is allowed to call and the minimum set of data available from its
>> context. The first implicit chaining is for this type of program. To be
>> able to evaluate a path while being called for all its components, this
>> program need to store a state (to remember what was the parent directory
>> of this path). There is no "previous" field in the subtype for this
>> program because it is chained with itself, for each directories. This
>> enable to create a FS_WALK program to evaluate a file hierarchy, thank
>> to the inode map which can be used to check if a directory of this
>> hierarchy is part of an allowed (or denied) list of directories. This
>> design enables to express a file hierarchy in a programmatic way,
>> without requiring an eBPF helper to do the job (unlike my first experiment).
>>
>> The explicit chaining is used to tied a path evaluation (with a FS_WALK
>> program) to an access to the actual file being requested (the last
>> component of a file path), with a FS_PICK program. It is only at this
>> time that the kernel check for the requested action (e.g. read, write,
>> chdir, append...). To be able to filter such access request we can have
>> one call to the same program for every action and let this program check
>> for which action it was called. However, this design does not allow the
>> kernel to know if the current action is indeed handled by this program.
>> Hence, it is not possible to implement a cache mechanism to only call
>> this program if it knows how to handle this action.
>>
>> The approach I took for this FS_PICK type of program is to add to its
>> subtype which action it can handle (with the "triggers" bitfield, seen
>> as ORed actions). This way, the kernel knows if a call to a FS_PICK
>> program is necessary. If the user wants to enforce a different security
>> policy according to the action requested on a file, then it needs
>> multiple FS_PICK programs. However, to reduce the number of such
>> programs, this patch series allow a FS_PICK program to be chained with
>> another, the same way a FS_WALK is chained with itself. This way, if the
>> user want to check if the action is a for example an "open" and a "read"
>> and not a "map" and a "read", then it can chain multiple FS_PICK
>> programs with different triggers actions. The OR check performed by the
>> kernel is not a limitation then, only a way to know if a call to an eBPF
>> program is needed.
>>
>> The last type of program is FS_GET. This one is called when a process
>> get a struct file or change its working directory. This is the only
>> program type able (and allowed) to tag a file. This restriction is
>> important to not being subject to resource exhaustion attacks (i.e.
>> tagging every inode accessible to an attacker, which would allocate too
>> much kernel memory).
>>
>> This design gives room for improvements to create a cache of eBPF
>> context (input data, including maps if any), with the result of an eBPF
>> program. This would help limit the number of call to an eBPF program the
>> same way SELinux or other kernel components do to limit costly checks.
>>
>> The eBPF maps of progs are useful to call the same type of eBPF
>> program. It does not fit with this use case because we may want multiple
>> eBPF program according to the action requested on a kernel object (e.g.
>> FS_GET). The other reason is because the eBPF program does not know what
>> will be the next (type of) access check performed by the kernel.
>>
>> To say it another way, this chaining mechanism is a way to split a
>> kernel object evaluation with multiple specialized programs, each of
>> them being able to deal with data tied to their type. Using a monolithic
>> eBPF program to check everything does not scale and does not fit with
>> unprivileged use either.
>>
>> As a side note, the cookie value is only an ephemeral value to keep a
>> state between multiple programs call. It can be used to create a state
>> machine for an object evaluation.
>>
>> I don't see a way to do an efficient and programmatic path evaluation,
>> with different access checks, with the current eBPF features. Please let
>> me know if you know how to do it another way.
>>
>
> Andy, Alexei, Daniel, what do you think about this Landlock program
> chaining and cookie?
>

Can you give a small pseudocode real world example that acutally needs
chaining?  The mechanism is quite complicated and I'd like to
understand how it'll be used.

^ permalink raw reply

* Re: [PATCH net 0/8] net: fix uninit-values in networking stack
From: David Miller @ 2018-04-08 21:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: edumazet, netdev
In-Reply-To: <0a539248-4e45-a429-f708-d906c145c736@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 8 Apr 2018 09:55:58 -0700

> I also have a report of a WARN() in ip_rt_bug(), added in commit
> c378a9c019cf5e017d1ed24954b54fae7bebd2bc by Dave Jones.
> 
> Not sure what to do, maybe revert, since ip_rt_bug() is not catastrophic.

Let's not do the revert, I wouldn't have seen the backtrace which
points where this bug is if we had.

icmp_route_lookup(), in one branch, does an input route lookup and
uses the result of that to send the icmp message.

That can't be right, input routes should never be used for
transmitting traffice and that's how we end up at ip_rt_bug().

^ permalink raw reply

* Re: pull request: bluetooth 2018-04-08
From: David Miller @ 2018-04-08 21:19 UTC (permalink / raw)
  To: johan.hedberg; +Cc: linux-bluetooth, netdev
In-Reply-To: <20180408174702.GA11420@x1c.home>

From: Johan Hedberg <johan.hedberg@gmail.com>
Date: Sun, 8 Apr 2018 20:47:02 +0300

> Here's one important Bluetooth fix for the 4.17-rc series that's needed
> to pass several Bluetooth qualification test cases.
> 
> Let me know if there are any issues pulling. Thanks.

Pulled, thank you.

^ permalink raw reply

* Re: BUG: please report to dccp@vger.kernel.org => prev = 0, last = 0 at net/dccp/ccids/lib/packet_history.c:LINE/tfrc_rx_hist_sample_rtt()
From: Eric Biggers @ 2018-04-08 21:57 UTC (permalink / raw)
  To: dccp, gerrit
  Cc: davem, garsilva, syzbot, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <001a1145f0b4076234056309aab9@google.com>

On Thu, Jan 18, 2018 at 01:34:02AM -0800, syzbot wrote:
> syzbot has found reproducer for the following crash on linux-next commit
> a362f6d2cdbd089dd7040ba66dcb0ad276a20cf7 (Thu Jan 18 07:07:54 2018 +0000)
> Add linux-next specific files for 20180118
> 
> So far this crash happened 185 times on linux-next, mmots, net-next,
> upstream.
> C reproducer is attached.
> syzkaller reproducer is attached.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by:
> syzbot+3ca02e1a9272a28e8959b32039154c5605164653@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed.
> 
> BUG: please report to dccp@vger.kernel.org => prev = 0, last = 0 at
> net/dccp/ccids/lib/packet_history.c:425/tfrc_rx_hist_sample_rtt()
> CPU: 1 PID: 6246 Comm: syzkaller158939 Not tainted 4.15.0-rc8-next-20180118+
> #100
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  <IRQ>
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  tfrc_rx_hist_sample_rtt+0x407/0x4d0 net/dccp/ccids/lib/packet_history.c:422
>  ccid3_hc_rx_packet_recv+0x696/0xeb3 net/dccp/ccids/ccid3.c:765
>  ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline]
>  dccp_deliver_input_to_ccids+0xd9/0x250 net/dccp/input.c:180
>  dccp_rcv_established+0x88/0xb0 net/dccp/input.c:378
>  dccp_v4_do_rcv+0x135/0x160 net/dccp/ipv4.c:653
>  sk_backlog_rcv include/net/sock.h:908 [inline]
>  __sk_receive_skb+0x33e/0xc10 net/core/sock.c:513
>  dccp_v4_rcv+0xf5f/0x1c80 net/dccp/ipv4.c:874
>  ip_local_deliver_finish+0x2f1/0xc50 net/ipv4/ip_input.c:216
>  NF_HOOK include/linux/netfilter.h:288 [inline]
>  ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
>  dst_input include/net/dst.h:449 [inline]
>  ip_rcv_finish+0x953/0x1e30 net/ipv4/ip_input.c:397
>  NF_HOOK include/linux/netfilter.h:288 [inline]
>  ip_rcv+0xc5a/0x1840 net/ipv4/ip_input.c:493
>  __netif_receive_skb_core+0x1a41/0x3460 net/core/dev.c:4537
>  __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4602
>  process_backlog+0x203/0x740 net/core/dev.c:5282
>  napi_poll net/core/dev.c:5680 [inline]
>  net_rx_action+0x792/0x1910 net/core/dev.c:5746
>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>  do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1150
>  </IRQ>
>  do_softirq.part.19+0x14d/0x190 kernel/softirq.c:329
>  do_softirq kernel/softirq.c:177 [inline]
>  __local_bh_enable_ip+0x1ee/0x230 kernel/softirq.c:182
>  local_bh_enable include/linux/bottom_half.h:32 [inline]
>  rcu_read_unlock_bh include/linux/rcupdate.h:726 [inline]
>  ip_finish_output2+0x962/0x1550 net/ipv4/ip_output.c:231
>  ip_finish_output+0x864/0xd10 net/ipv4/ip_output.c:317
>  NF_HOOK_COND include/linux/netfilter.h:277 [inline]
>  ip_output+0x1d2/0x860 net/ipv4/ip_output.c:405
>  dst_output include/net/dst.h:443 [inline]
>  ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
>  ip_queue_xmit+0x8c0/0x18e0 net/ipv4/ip_output.c:504
>  dccp_transmit_skb+0x9ac/0x10f0 net/dccp/output.c:142
>  dccp_xmit_packet+0x215/0x740 net/dccp/output.c:281
>  dccp_write_xmit+0x17d/0x1d0 net/dccp/output.c:363
>  dccp_sendmsg+0x95f/0xdc0 net/dccp/proto.c:813
>  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764
>  sock_sendmsg_nosec net/socket.c:630 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:640
>  ___sys_sendmsg+0x767/0x8b0 net/socket.c:2020
>  __sys_sendmsg+0xe5/0x210 net/socket.c:2054
>  SYSC_sendmsg net/socket.c:2065 [inline]
>  SyS_sendmsg+0x2d/0x50 net/socket.c:2061
>  entry_SYSCALL_64_fastpath+0x29/0xa0
> RIP: 0033:0x446469
> RSP: 002b:00007fcecb23bda8 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
> RAX: ffffffffffffffda RBX: 00000000006dbc3c RCX: 0000000000446469
> RDX: 0000000000000080 RSI: 00000000206c8000 RDI: 0000000000000005
> RBP: 00000000006dbc38 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000293 R12: f8e4cbe49e572d45
> R13: 54c1b85d98aba1df R14: a6eaa24dbeb18c29 R15: 000000000000000c
> 

This is still happening.  It *might* be related to the other bug "suspicious RCU
usage at ./include/net/inet_sock.h:LINE".  Here's a simplified reproducer for
this one:

#include <linux/dccp.h>
#include <linux/in.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <unistd.h>

int main()
{
        struct sockaddr_in addr = { .sin_family = AF_INET };
        socklen_t addrlen = sizeof(addr);
        int fd;

        while (fork())
                wait(NULL);
        fd = socket(AF_INET, SOCK_DCCP, 0);
        bind(fd, (void *)&addr, addrlen);
        getsockname(fd, (void *)&addr, &addrlen);
        listen(fd, 100);
        if (fork()) {
                fd = socket(AF_INET, SOCK_DCCP, 0);
                setsockopt(fd, SOL_DCCP, DCCP_SOCKOPT_CCID, "\x03", 1);
                connect(fd, (void *)&addr, sizeof(addr));
        } else {
                fd = accept(fd, NULL, 0);
        }
        for (int i = 0; i < 1000; i++)
                write(fd, "X", 1);
}

^ permalink raw reply

* Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy
From: Mickaël Salaün @ 2018-04-08 22:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Daniel Borkmann, LKML, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Casey Schaufler, David Drysdale,
	David S . Miller, Eric W . Biederman, Jann Horn, Jonathan Corbet,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Tejun Heo, Thomas Graf,
	Tycho Andersen, Will Drewry
In-Reply-To: <CALCETrViaXEx1iQ6q8bEEWSLchj=FH6LjcRY6+hjMx8A+rtgDQ@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 22278 bytes --]


On 04/08/2018 11:06 PM, Andy Lutomirski wrote:
> On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>
>> On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
>>>
>>> On 27/02/2018 17:39, Andy Lutomirski wrote:
>>>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
>>>>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>>>>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>>>>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>>>>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the
>>>>>>>>>> current task and all its future children. A program is immutable and a
>>>>>>>>>> task can only add new restricting programs to itself, forming a list of
>>>>>>>>>> programss.
>>>>>>>>>>
>>>>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>>>>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>>>>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of
>>>>>>>>>> object is triggered. The list of programs for this hook is then
>>>>>>>>>> evaluated. Each program return a 32-bit value which can deny the action
>>>>>>>>>> on a kernel object with a non-zero value. If every programs of the list
>>>>>>>>>> return zero, then the action on the object is allowed.
>>>>>>>>>>
>>>>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value for a
>>>>>>>>>> call chain (e.g. evaluating multiple elements of a file path).  This
>>>>>>>>>> chaining is restricted when a process construct this chain by loading a
>>>>>>>>>> program, but additional checks are performed when it requests to apply
>>>>>>>>>> this chain of programs to itself.  The restrictions ensure that it is
>>>>>>>>>> not possible to call multiple programs in a way that would imply to
>>>>>>>>>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>>>>>>>>>> only a fs_pick program can be chained to the same type of program,
>>>>>>>>>> because it may make sense if they have different triggers (cf. next
>>>>>>>>>> commits).  This restrictions still allows to reuse Landlock programs in
>>>>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple
>>>>>>>>>> chains of fs_pick programs).
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>>>>>>>>
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>>> +struct landlock_prog_set *landlock_prepend_prog(
>>>>>>>>>> +             struct landlock_prog_set *current_prog_set,
>>>>>>>>>> +             struct bpf_prog *prog)
>>>>>>>>>> +{
>>>>>>>>>> +     struct landlock_prog_set *new_prog_set = current_prog_set;
>>>>>>>>>> +     unsigned long pages;
>>>>>>>>>> +     int err;
>>>>>>>>>> +     size_t i;
>>>>>>>>>> +     struct landlock_prog_set tmp_prog_set = {};
>>>>>>>>>> +
>>>>>>>>>> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>>>>>>>>>> +             return ERR_PTR(-EINVAL);
>>>>>>>>>> +
>>>>>>>>>> +     /* validate memory size allocation */
>>>>>>>>>> +     pages = prog->pages;
>>>>>>>>>> +     if (current_prog_set) {
>>>>>>>>>> +             size_t i;
>>>>>>>>>> +
>>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>>>>>>>>>> +                     struct landlock_prog_list *walker_p;
>>>>>>>>>> +
>>>>>>>>>> +                     for (walker_p = current_prog_set->programs[i];
>>>>>>>>>> +                                     walker_p; walker_p = walker_p->prev)
>>>>>>>>>> +                             pages += walker_p->prog->pages;
>>>>>>>>>> +             }
>>>>>>>>>> +             /* count a struct landlock_prog_set if we need to allocate one */
>>>>>>>>>> +             if (refcount_read(&current_prog_set->usage) != 1)
>>>>>>>>>> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>>>>>>>>>> +                             / PAGE_SIZE;
>>>>>>>>>> +     }
>>>>>>>>>> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>>>>>>>>>> +             return ERR_PTR(-E2BIG);
>>>>>>>>>> +
>>>>>>>>>> +     /* ensure early that we can allocate enough memory for the new
>>>>>>>>>> +      * prog_lists */
>>>>>>>>>> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>>>>>>>>>> +     if (err)
>>>>>>>>>> +             return ERR_PTR(err);
>>>>>>>>>> +
>>>>>>>>>> +     /*
>>>>>>>>>> +      * Each task_struct points to an array of prog list pointers.  These
>>>>>>>>>> +      * tables are duplicated when additions are made (which means each
>>>>>>>>>> +      * table needs to be refcounted for the processes using it). When a new
>>>>>>>>>> +      * table is created, all the refcounters on the prog_list are bumped (to
>>>>>>>>>> +      * track each table that references the prog). When a new prog is
>>>>>>>>>> +      * added, it's just prepended to the list for the new table to point
>>>>>>>>>> +      * at.
>>>>>>>>>> +      *
>>>>>>>>>> +      * Manage all the possible errors before this step to not uselessly
>>>>>>>>>> +      * duplicate current_prog_set and avoid a rollback.
>>>>>>>>>> +      */
>>>>>>>>>> +     if (!new_prog_set) {
>>>>>>>>>> +             /*
>>>>>>>>>> +              * If there is no Landlock program set used by the current task,
>>>>>>>>>> +              * then create a new one.
>>>>>>>>>> +              */
>>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>>> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
>>>>>>>>>> +             /*
>>>>>>>>>> +              * If the current task is not the sole user of its Landlock
>>>>>>>>>> +              * program set, then duplicate them.
>>>>>>>>>> +              */
>>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>>>>>>>>>> +                     new_prog_set->programs[i] =
>>>>>>>>>> +                             READ_ONCE(current_prog_set->programs[i]);
>>>>>>>>>> +                     if (new_prog_set->programs[i])
>>>>>>>>>> +                             refcount_inc(&new_prog_set->programs[i]->usage);
>>>>>>>>>> +             }
>>>>>>>>>> +
>>>>>>>>>> +             /*
>>>>>>>>>> +              * Landlock program set from the current task will not be freed
>>>>>>>>>> +              * here because the usage is strictly greater than 1. It is
>>>>>>>>>> +              * only prevented to be freed by another task thanks to the
>>>>>>>>>> +              * caller of landlock_prepend_prog() which should be locked if
>>>>>>>>>> +              * needed.
>>>>>>>>>> +              */
>>>>>>>>>> +             landlock_put_prog_set(current_prog_set);
>>>>>>>>>> +     }
>>>>>>>>>> +
>>>>>>>>>> +     /* prepend tmp_prog_set to new_prog_set */
>>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>>>>>>>>>> +             /* get the last new list */
>>>>>>>>>> +             struct landlock_prog_list *last_list =
>>>>>>>>>> +                     tmp_prog_set.programs[i];
>>>>>>>>>> +
>>>>>>>>>> +             if (last_list) {
>>>>>>>>>> +                     while (last_list->prev)
>>>>>>>>>> +                             last_list = last_list->prev;
>>>>>>>>>> +                     /* no need to increment usage (pointer replacement) */
>>>>>>>>>> +                     last_list->prev = new_prog_set->programs[i];
>>>>>>>>>> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
>>>>>>>>>> +             }
>>>>>>>>>> +     }
>>>>>>>>>> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
>>>>>>>>>> +     return new_prog_set;
>>>>>>>>>> +
>>>>>>>>>> +put_tmp_lists:
>>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>>>>>>>>>> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
>>>>>>>>>> +     return new_prog_set;
>>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>> Nack on the chaining concept.
>>>>>>>>> Please do not reinvent the wheel.
>>>>>>>>> There is an existing mechanism for attaching/detaching/quering multiple
>>>>>>>>> programs attached to cgroup and tracing hooks that are also
>>>>>>>>> efficiently executed via BPF_PROG_RUN_ARRAY.
>>>>>>>>> Please use that instead.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't see how that would help.  Suppose you add a filter, then
>>>>>>>> fork(), and then the child adds another filter.  Do you want to
>>>>>>>> duplicate the entire array?  You certainly can't *modify* the array
>>>>>>>> because you'll affect processes that shouldn't be affected.
>>>>>>>>
>>>>>>>> In contrast, doing this through seccomp like the earlier patches
>>>>>>>> seemed just fine to me, and seccomp already had the right logic.
>>>>>>>
>>>>>>> it doesn't look to me that existing seccomp side of managing fork
>>>>>>> situation can be reused. Here there is an attempt to add 'chaining'
>>>>>>> concept which sort of an extension of existing seccomp style,
>>>>>>> but somehow heavily done on bpf side and contradicts cgroup/tracing.
>>>>>>>
>>>>>>
>>>>>> I don't see why the seccomp way can't be used.  I agree with you that
>>>>>> the seccomp *style* shouldn't be used in bpf code like this, but I
>>>>>> think that Landlock programs can and should just live in the existing
>>>>>> seccomp chain.  If the existing seccomp code needs some modification
>>>>>> to make this work, then so be it.
>>>>>
>>>>> +1
>>>>> if that was the case...
>>>>> but that's not my reading of the patch set.
>>>>
>>>> An earlier version of the patch set used the seccomp filter chain.
>>>> Mickaël, what exactly was wrong with that approach other than that the
>>>> seccomp() syscall was awkward for you to use?  You could add a
>>>> seccomp_add_landlock_rule() syscall if you needed to.
>>>
>>> Nothing was wrong about about that, this part did not changed (see my
>>> next comment).
>>>
>>>>
>>>> As a side comment, why is this an LSM at all, let alone a non-stacking
>>>> LSM?  It would make a lot more sense to me to make Landlock depend on
>>>> having LSMs configured in but to call the landlock hooks directly from
>>>> the security_xyz() hooks.
>>>
>>> See Casey's answer and his patch series: https://lwn.net/Articles/741963/
>>>
>>>>
>>>>>
>>>>>> In other words, the kernel already has two kinds of chaining:
>>>>>> seccomp's and bpf's.  bpf's doesn't work right for this type of usage
>>>>>> across fork(), whereas seccomp's already handles that case correctly.
>>>>>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
>>>>>>  So IMO Landlock should use the seccomp core code and call into bpf
>>>>>> for the actual filtering.
>>>>>
>>>>> +1
>>>>> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
>>>>> since cgroup hierarchy can be complicated with bpf progs attached
>>>>> at different levels with different override/multiprog properties,
>>>>> so walking link list and checking all flags at run-time would have
>>>>> been too slow. That's why we added compute_effective_progs().
>>>>
>>>> If we start adding override flags to Landlock, I think we're doing it
>>>> wrong.   With cgroup bpf programs, the whole mess is set up by the
>>>> administrator.  With seccomp, and with Landlock if done correctly, it
>>>> *won't* be set up by the administrator, so the chance that everyone
>>>> gets all the flags right is about zero.  All attached filters should
>>>> run unconditionally.
>>>
>>>
>>> There is a misunderstanding about this chaining mechanism. This should
>>> not be confused with the list of seccomp filters nor the cgroup
>>> hierarchies. Landlock programs can be stacked the same way seccomp's
>>> filters can (cf. struct landlock_prog_set, the "chain_last" field is an
>>> optimization which is not used for this struct handling). This stackable
>>> property did not changed from the previous patch series. The chaining
>>> mechanism is for another use case, which does not make sense for seccomp
>>> filters nor other eBPF program types, at least for now, from what I can
>>> tell.
>>>
>>> You may want to get a look at my talk at FOSDEM
>>> (https://landlock.io/talks/2018-02-04_landlock-fosdem.pdf), especially
>>> slides 11 and 12.
>>>
>>> Let me explain my reasoning about this program chaining thing.
>>>
>>> To check if an action on a file is allowed, we first need to identify
>>> this file and match it to the security policy. In a previous
>>> (non-public) patch series, I tried to use one type of eBPF program to
>>> check every kind of access to a file. To be able to identify a file, I
>>> relied on an eBPF map, similar to the current inode map. This map store
>>> a set of references to file descriptors. I then created a function
>>> bpf_is_file_beneath() to check if the requested file was beneath a file
>>> in the map. This way, no chaining, only one eBPF program type to check
>>> an access to a file... but some issues then emerged. First, this design
>>> create a side-channel which help an attacker using such a program to
>>> infer some information not normally available, for example to get a hint
>>> on where a file descriptor (received from a UNIX socket) come from.
>>> Another issue is that this type of program would be called for each
>>> component of a path. Indeed, when the kernel check if an access to a
>>> file is allowed, it walk through all of the directories in its path
>>> (checking if the current process is allowed to execute them). That first
>>> attempt led me to rethink the way we could filter an access to a file
>>> *path*.
>>>
>>> To minimize the number of called to an eBPF program dedicated to
>>> validate an access to a file path, I decided to create three subtype of
>>> eBPF programs. The FS_WALK type is called when walking through every
>>> directory of a file path (except the last one if it is the target). We
>>> can then restrict this type of program to the minimum set of functions
>>> it is allowed to call and the minimum set of data available from its
>>> context. The first implicit chaining is for this type of program. To be
>>> able to evaluate a path while being called for all its components, this
>>> program need to store a state (to remember what was the parent directory
>>> of this path). There is no "previous" field in the subtype for this
>>> program because it is chained with itself, for each directories. This
>>> enable to create a FS_WALK program to evaluate a file hierarchy, thank
>>> to the inode map which can be used to check if a directory of this
>>> hierarchy is part of an allowed (or denied) list of directories. This
>>> design enables to express a file hierarchy in a programmatic way,
>>> without requiring an eBPF helper to do the job (unlike my first experiment).
>>>
>>> The explicit chaining is used to tied a path evaluation (with a FS_WALK
>>> program) to an access to the actual file being requested (the last
>>> component of a file path), with a FS_PICK program. It is only at this
>>> time that the kernel check for the requested action (e.g. read, write,
>>> chdir, append...). To be able to filter such access request we can have
>>> one call to the same program for every action and let this program check
>>> for which action it was called. However, this design does not allow the
>>> kernel to know if the current action is indeed handled by this program.
>>> Hence, it is not possible to implement a cache mechanism to only call
>>> this program if it knows how to handle this action.
>>>
>>> The approach I took for this FS_PICK type of program is to add to its
>>> subtype which action it can handle (with the "triggers" bitfield, seen
>>> as ORed actions). This way, the kernel knows if a call to a FS_PICK
>>> program is necessary. If the user wants to enforce a different security
>>> policy according to the action requested on a file, then it needs
>>> multiple FS_PICK programs. However, to reduce the number of such
>>> programs, this patch series allow a FS_PICK program to be chained with
>>> another, the same way a FS_WALK is chained with itself. This way, if the
>>> user want to check if the action is a for example an "open" and a "read"
>>> and not a "map" and a "read", then it can chain multiple FS_PICK
>>> programs with different triggers actions. The OR check performed by the
>>> kernel is not a limitation then, only a way to know if a call to an eBPF
>>> program is needed.
>>>
>>> The last type of program is FS_GET. This one is called when a process
>>> get a struct file or change its working directory. This is the only
>>> program type able (and allowed) to tag a file. This restriction is
>>> important to not being subject to resource exhaustion attacks (i.e.
>>> tagging every inode accessible to an attacker, which would allocate too
>>> much kernel memory).
>>>
>>> This design gives room for improvements to create a cache of eBPF
>>> context (input data, including maps if any), with the result of an eBPF
>>> program. This would help limit the number of call to an eBPF program the
>>> same way SELinux or other kernel components do to limit costly checks.
>>>
>>> The eBPF maps of progs are useful to call the same type of eBPF
>>> program. It does not fit with this use case because we may want multiple
>>> eBPF program according to the action requested on a kernel object (e.g.
>>> FS_GET). The other reason is because the eBPF program does not know what
>>> will be the next (type of) access check performed by the kernel.
>>>
>>> To say it another way, this chaining mechanism is a way to split a
>>> kernel object evaluation with multiple specialized programs, each of
>>> them being able to deal with data tied to their type. Using a monolithic
>>> eBPF program to check everything does not scale and does not fit with
>>> unprivileged use either.
>>>
>>> As a side note, the cookie value is only an ephemeral value to keep a
>>> state between multiple programs call. It can be used to create a state
>>> machine for an object evaluation.
>>>
>>> I don't see a way to do an efficient and programmatic path evaluation,
>>> with different access checks, with the current eBPF features. Please let
>>> me know if you know how to do it another way.
>>>
>>
>> Andy, Alexei, Daniel, what do you think about this Landlock program
>> chaining and cookie?
>>
> 
> Can you give a small pseudocode real world example that acutally needs
> chaining?  The mechanism is quite complicated and I'd like to
> understand how it'll be used.
> 

Here is the interesting part from the example (patch 09/11):

+SEC("maps")
+struct bpf_map_def inode_map = {
+	.type = BPF_MAP_TYPE_INODE,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u64),
+	.max_entries = 20,
+};
+
+SEC("subtype/landlock1")
+static union bpf_prog_subtype _subtype1 = {
+	.landlock_hook = {
+		.type = LANDLOCK_HOOK_FS_WALK,
+	}
+};
+
+static __always_inline __u64 update_cookie(__u64 cookie, __u8 lookup,
+		void *inode, void *chain, bool freeze)
+{
+	__u64 map_allow = 0;
+
+	if (cookie == 0) {
+		cookie = bpf_inode_get_tag(inode, chain);
+		if (cookie)
+			return cookie;
+		/* only look for the first match in the map, ignore nested
+		 * paths in this example */
+		map_allow = bpf_inode_map_lookup(&inode_map, inode);
+		if (map_allow)
+			cookie = 1 | map_allow;
+	} else {
+		if (cookie & COOKIE_VALUE_FREEZED)
+			return cookie;
+		map_allow = cookie & _MAP_MARK_MASK;
+		cookie &= ~_MAP_MARK_MASK;
+		switch (lookup) {
+		case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOTDOT:
+			cookie--;
+			break;
+		case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOT:
+			break;
+		default:
+			/* ignore _MAP_MARK_MASK overflow in this example */
+			cookie++;
+			break;
+		}
+		if (cookie >= 1)
+			cookie |= map_allow;
+	}
+	/* do not modify the cookie for each fs_pick */
+	if (freeze && cookie)
+		cookie |= COOKIE_VALUE_FREEZED;
+	return cookie;
+}
+
+SEC("landlock1")
+int fs_walk(struct landlock_ctx_fs_walk *ctx)
+{
+	ctx->cookie = update_cookie(ctx->cookie, ctx->inode_lookup,
+			(void *)ctx->inode, (void *)ctx->chain, false);
+	return LANDLOCK_RET_ALLOW;
+}

The program "landlock1" is called for every directory execution (except
the last one if it is the leaf of a path). This enables to identify a
file hierarchy with only a (one dimension) list of file descriptors
(i.e. inode_map).

Underneath, the Landlock LSM part looks if there is an associated path
walk (nameidata) with each inode access request. If there is one, then
the cookie associated with the path walk (if any) is made available
through the eBPF program context. This enables to develop a state
machine with an eBPF program to "evaluate" a file path (without string
parsing).

The goal with this chaining mechanism is to be able to express a complex
kernel object like a file, with multiple run of one or more eBPF
programs, as a multilayer evaluation. This semantic may only make sense
for the user/developer and his security policy. We must keep in mind
that this object identification should be available to unprivileged
processes. This means that we must be very careful to what kind of
information are available to an eBPF program because this can then leak
to a process (e.g. through a map). With this mechanism, only information
already available to user space is available to the eBPF program.

In this example, the complexity of the path evaluation is in the eBPF
program. We can then keep the kernel code more simple and generic. This
enables more flexibility for a security policy definition.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH] net: bridge: add missing NULL checks
From: Nikolay Aleksandrov @ 2018-04-08 22:25 UTC (permalink / raw)
  To: Laszlo Toth, Stephen Hemminger, David S. Miller; +Cc: netdev, bridge
In-Reply-To: <20180408174934.GA3895@laszlth>

On 08/04/18 20:49, Laszlo Toth wrote:
> br_port_get_rtnl() can return NULL
> 
> Signed-off-by: Laszlo Toth <laszlth@gmail.com>
> ---
>   net/bridge/br_netlink.c | 12 ++++++++++--
>   1 file changed, 10 insertions(+), 2 deletions(-)
> 

Nacked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
More below.

> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index 015f465c..cbec11f 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -939,14 +939,17 @@ static int br_port_slave_changelink(struct net_device *brdev,
>   				    struct nlattr *data[],
>   				    struct netlink_ext_ack *extack)
>   {
> +	struct net_bridge_port *port = br_port_get_rtnl(dev);
>   	struct net_bridge *br = netdev_priv(brdev);
>   	int ret;
>   
>   	if (!data)
>   		return 0;
> +	if (!port)
> +		return -EINVAL;
>   

If we're here, it means the master device of dev is a bridge => dev is a bridge port,
since we're running with RTNL that cannot change, so this check is unnecessary.

Have you actually hit a bug with this code ?

>   	spin_lock_bh(&br->lock);
> -	ret = br_setport(br_port_get_rtnl(dev), data);
> +	ret = br_setport(port, data);
>   	spin_unlock_bh(&br->lock);
>   
>   	return ret;
> @@ -956,7 +959,12 @@ static int br_port_fill_slave_info(struct sk_buff *skb,
>   				   const struct net_device *brdev,
>   				   const struct net_device *dev)
>   {
> -	return br_port_fill_attrs(skb, br_port_get_rtnl(dev));
> +	struct net_bridge_port *port = br_port_get_rtnl(dev);
> +
> +	if (!port)
> +		return -EINVAL;
> +
> +	return br_port_fill_attrs(skb, port);

Same rationale here, fill_slave_info is called via a master device's ops
under RTNL, which means dev is a bridge port and that also cannot change.

If you have hit a bug with this code, can we see the trace ?
The problem might be elsewhere.

Thanks,
  Nik

>   }
>   
>   static size_t br_port_get_slave_size(const struct net_device *brdev,
> 

^ permalink raw reply

* pull-request: bpf 2018-04-09
From: Daniel Borkmann @ 2018-04-08 22:28 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Two sockmap fixes: i) fix a potential warning when a socket with
   pending cork data is closed by freeing the memory right when the
   socket is closed instead of seeing still outstanding memory at
   garbage collector time, ii) fix a NULL pointer deref in case of
   duplicates release calls, so make sure to only reset the sk_prot
   pointer when it's in a valid state to do so, both from John.

2) Fix a compilation warning in bpf_prog_attach_check_attach_type()
   by moving the function under CONFIG_CGROUP_BPF ifdef since only
   used there, from Anders.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit 4608f064532c28c0ea3c03fe26a3a5909852811a:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next (2018-04-03 14:08:58 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to 33491588c1fb2c76ed114a211ad0ee76c16b5a0c:

  kernel/bpf/syscall: fix warning defined but not used (2018-04-04 11:08:36 +0200)

----------------------------------------------------------------
Anders Roxell (1):
      kernel/bpf/syscall: fix warning defined but not used

John Fastabend (2):
      bpf: sockmap, free memory on sock close with cork data
      bpf: sockmap, duplicates release calls may NULL sk_prot

 kernel/bpf/sockmap.c | 12 ++++++++++--
 kernel/bpf/syscall.c | 24 ++++++++++++------------
 2 files changed, 22 insertions(+), 14 deletions(-)

^ permalink raw reply

* Re: KASAN: use-after-free Read in inet_create
From: Eric Biggers @ 2018-04-08 23:17 UTC (permalink / raw)
  To: linux-rdma, rds-devel, Santosh Shilimkar
  Cc: syzbot, davem, kuznet, linux-kernel, netdev, syzkaller-bugs,
	yoshfuji
In-Reply-To: <001a1144d1c8e819f6055fee7118@google.com>

[+RDS list and maintainer]

On Sat, Dec 09, 2017 at 12:50:01PM -0800, syzbot wrote:
> Hello,
> 
> syzkaller hit the following crash on
> 82bcf1def3b5f1251177ad47c44f7e17af039b4b
> git://git.cmpxchg.org/linux-mmots.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> 
> Unfortunately, I don't have any reproducer for this bug yet.
> 
> 
> ==================================================================
> BUG: KASAN: use-after-free in inet_create+0xda0/0xf50 net/ipv4/af_inet.c:338
> Read of size 4 at addr ffff8801bde28554 by task kworker/u4:5/3492
> 
> CPU: 0 PID: 3492 Comm: kworker/u4:5 Not tainted 4.15.0-rc2-mm1+ #39
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: krdsd rds_connect_worker
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
>  inet_create+0xda0/0xf50 net/ipv4/af_inet.c:338
>  __sock_create+0x4d4/0x850 net/socket.c:1265
>  sock_create_kern+0x3f/0x50 net/socket.c:1311
>  rds_tcp_conn_path_connect+0x26f/0x920 net/rds/tcp_connect.c:108
>  rds_connect_worker+0x156/0x1f0 net/rds/threads.c:165
>  process_one_work+0xbfd/0x1bc0 kernel/workqueue.c:2113
>  worker_thread+0x223/0x1990 kernel/workqueue.c:2247
>  kthread+0x37a/0x440 kernel/kthread.c:238
>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:524
> 
> Allocated by task 3362:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
>  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3548
>  kmem_cache_zalloc include/linux/slab.h:695 [inline]
>  net_alloc net/core/net_namespace.c:362 [inline]
>  copy_net_ns+0x196/0x580 net/core/net_namespace.c:402
>  create_new_namespaces+0x425/0x880 kernel/nsproxy.c:107
>  unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:206
>  SYSC_unshare kernel/fork.c:2421 [inline]
>  SyS_unshare+0x653/0xfa0 kernel/fork.c:2371
>  entry_SYSCALL_64_fastpath+0x1f/0x96
> 
> Freed by task 35:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
>  __cache_free mm/slab.c:3492 [inline]
>  kmem_cache_free+0x77/0x280 mm/slab.c:3750
>  net_free+0xca/0x110 net/core/net_namespace.c:378
>  net_drop_ns.part.11+0x26/0x30 net/core/net_namespace.c:385
>  net_drop_ns net/core/net_namespace.c:384 [inline]
>  cleanup_net+0x895/0xb60 net/core/net_namespace.c:502
>  process_one_work+0xbfd/0x1bc0 kernel/workqueue.c:2113
>  worker_thread+0x223/0x1990 kernel/workqueue.c:2247
>  kthread+0x37a/0x440 kernel/kthread.c:238
>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:524
> 
> The buggy address belongs to the object at ffff8801bde28080
>  which belongs to the cache net_namespace of size 6272
> The buggy address is located 1236 bytes inside of
>  6272-byte region [ffff8801bde28080, ffff8801bde29900)
> The buggy address belongs to the page:
> page:00000000df6a4dc0 count:1 mapcount:0 mapping:00000000553659f1 index:0x0
> compound_mapcount: 0
> flags: 0x2fffc0000008100(slab|head)
> raw: 02fffc0000008100 ffff8801bde28080 0000000000000000 0000000100000001
> raw: ffffea0006f75da0 ffffea0006f60220 ffff8801d989fe00 0000000000000000
> page dumped because: kasan: bad access detected
> 
> Memory state around the buggy address:
>  ffff8801bde28400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  ffff8801bde28480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > ffff8801bde28500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>                                                  ^
>  ffff8801bde28580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  ffff8801bde28600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==================================================================
> 
> 
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
> Please credit me with: Reported-by: syzbot <syzkaller@googlegroups.com>
> 
> syzbot will keep track of this bug report.
> Once a fix for this bug is merged into any tree, reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
> 

This is still happening regularly, though syzbot hasn't been able to generate a
reproducer yet.  All the reports seem to involve rds_connect_worker()
encountering a freed network namespace (struct net) when calling
sock_create_kern() from rds_tcp_conn_path_connect().  Probably something in RDS
needs to be taking a reference to the network namespace and isn't, or the RDS
workqueue isn't being shut down correctly.  You can see all reports of this on
the syzbot dashboard at
https://syzkaller.appspot.com/bug?id=1f45ae538a0453220337ccb84962249fdd67107f.
Last one was April 5 on Linus' tree (commit 3e968c9f1401088).

- Eric

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox