Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH bpf-next 2/8] samples: bpf: Makefile: remove target for native build
From: Ivan Khoronzhuk @ 2019-09-04 21:22 UTC (permalink / raw)
  To: ast, daniel, yhs, davem, jakub.kicinski, hawk, john.fastabend
  Cc: linux-kernel, netdev, bpf, clang-built-linux, Ivan Khoronzhuk
In-Reply-To: <20190904212212.13052-1-ivan.khoronzhuk@linaro.org>

No need to set --target for native build, at least for arm, the
default target will be used anyway. In case of arm, for at least
clang 5 - 10 it causes error like:

clang: warning: unknown platform, assuming -mfloat-abi=soft
LLVM ERROR: Unsupported calling convention
make[2]: *** [/home/root/snapshot/samples/bpf/Makefile:299:
/home/root/snapshot/samples/bpf/sockex1_kern.o] Error 1

Only set to real triple helps: --target=arm-linux-gnueabihf
or just drop the target key to use default one. Decision to just
drop it and thus default target will be used (wich is native),
looks better.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
 samples/bpf/Makefile | 2 --
 1 file changed, 2 deletions(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 61b7394b811e..a2953357927e 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -197,8 +197,6 @@ BTF_PAHOLE ?= pahole
 ifdef CROSS_COMPILE
 HOSTCC = $(CROSS_COMPILE)gcc
 CLANG_ARCH_ARGS = --target=$(notdir $(CROSS_COMPILE:%-=%))
-else
-CLANG_ARCH_ARGS = -target $(ARCH)
 endif
 
 # Don't evaluate probes and warnings if we need to run make recursively
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 1/8] samples: bpf: Makefile: use --target from cross-compile
From: Ivan Khoronzhuk @ 2019-09-04 21:22 UTC (permalink / raw)
  To: ast, daniel, yhs, davem, jakub.kicinski, hawk, john.fastabend
  Cc: linux-kernel, netdev, bpf, clang-built-linux, Ivan Khoronzhuk
In-Reply-To: <20190904212212.13052-1-ivan.khoronzhuk@linaro.org>

For cross compiling the target triple can be inherited from
cross-compile prefix as it's done in CLANG_FLAGS from kernel makefile.
So copy-paste this decision from kernel Makefile.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
 samples/bpf/Makefile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 1d9be26b4edd..61b7394b811e 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -196,6 +196,8 @@ BTF_PAHOLE ?= pahole
 # Detect that we're cross compiling and use the cross compiler
 ifdef CROSS_COMPILE
 HOSTCC = $(CROSS_COMPILE)gcc
+CLANG_ARCH_ARGS = --target=$(notdir $(CROSS_COMPILE:%-=%))
+else
 CLANG_ARCH_ARGS = -target $(ARCH)
 endif
 
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH] net/skbuff: silence warnings under memory pressure
From: Qian Cai @ 2019-09-04 20:42 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Michal Hocko, Eric Dumazet, davem, netdev,
	linux-mm, linux-kernel, Petr Mladek, Steven Rostedt
In-Reply-To: <20190904144850.GA8296@tigerII.localdomain>

On Wed, 2019-09-04 at 23:48 +0900, Sergey Senozhatsky wrote:
> On (09/04/19 08:14), Qian Cai wrote:
> > > Plus one more check - waitqueue_active(&log_wait). printk() adds
> > > pending irq_work only if there is a user-space process sleeping on
> > > log_wait and irq_work is not already scheduled. If the syslog is
> > > active or there is noone to wakeup then we don't queue irq_work.
> > 
> > Another possibility for this potential livelock is that those printk() from
> > warn_alloc(), dump_stack() and show_mem() increase the time it needs to
> > process
> > build_skb() allocation failures significantly under memory pressure. As the
> > result, ksoftirqd() could be rescheduled during that time via a different
> > CPU
> > (this is a large x86 NUMA system anyway),
> > 
> > [83605.577256][   C31]  run_ksoftirqd+0x1f/0x40
> > [83605.577256][   C31]  smpboot_thread_fn+0x255/0x440
> > [83605.577256][   C31]  kthread+0x1df/0x200
> > [83605.577256][   C31]  ret_from_fork+0x35/0x40
> 
> Hum hum hum...
> 
> So I can, _probably_, think of several patches.
> 
> First, move wake_up_klogd() back to console_unlock().
> 
> Second, move `printk_pending' out of per-CPU region and make it global.
> So we will have just one printk irq_work scheduled across all CPUs;
> currently we have one irq_work per CPU. I think I sent a patch a long
> long time ago, but we never discussed it, as far as I remember.
> 
> > In addition, those printk() will deal with console drivers or even a
> > networking
> > console, so it is probably not unusual that it could call irq_exit()-
> > __do_softirq() at one point and then this livelock.
> 
> Do you use netcon? Because this, theoretically, can open up one more
> vector. netcon allocates skbs from ->write() path. We call con drivers'
> ->write() from printk_safe context, so should netcon skb allocation
> warn we will scedule one more irq_work on that CPU to flush per-CPU
> printk_safe buffer.
> 
> If this is the case, then we can stop calling console_driver() under
> printk_safe. I sent a patch a while ago, but we agreed to keep the
> things the way they are, fot the time being.
> 
> Let me think more.

To summary, those look to me are all good long-term improvement that would
reduce the likelihood of this kind of livelock in general especially for other
unknown allocations that happen while processing softirqs, but it is still up to
the air if it fixes it 100% in all situations as printk() is going to take more
time and could deal with console hardware that involve irq_exit() anyway.

On the other hand, adding __GPF_NOWARN in the build_skb() allocation will fix
this known NET_TX_SOFTIRQ case which is common when softirqd involved at least
in short-term. It even have a benefit to reduce the overall warn_alloc() noise
out there.

I can resubmit with an update changelog. Does it make any sense?

^ permalink raw reply

* Re: WARNING in hso_free_net_device
From: Hui Peng @ 2019-09-04 20:27 UTC (permalink / raw)
  To: syzbot+44d53c7255bb1aea22d2, alexios.zavras, andreyknvl, davem,
	gregkh, linux-kernel, linux-usb, mathias.payer, netdev, rfontana,
	syzkaller-bugs, tglx
In-Reply-To: <0000000000002a95df0591a4f114@google.com>

Hi, all:

I looked at the bug a little.

The issue is that in the error handling code, hso_free_net_device
unregisters

the net_device (hso_net->net)  by calling unregister_netdev. In the
error handling code path,

hso_net->net has not been registered yet.

I think there are two ways to solve the issue:

1. fix it in drivers/net/usb/hso.c to avoiding unregistering the
net_device when it is still not registered

2. fix it in unregister_netdev. We can add a field in net_device to
record whether it is registered, and make unregister_netdev return if
the net_device is not registered yet.

What do you guys think ?

On 9/3/19 8:08 AM, syzbot wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    eea39f24 usb-fuzzer: main usb gadget fuzzer driver
> git tree:       https://github.com/google/kasan.git usb-fuzzer
> console output: https://syzkaller.appspot.com/x/log.txt?x=15f17e61600000
> kernel config: 
> https://syzkaller.appspot.com/x/.config?x=d0c62209eedfd54e
> dashboard link:
> https://syzkaller.appspot.com/bug?extid=44d53c7255bb1aea22d2
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:     
> https://syzkaller.appspot.com/x/repro.syz?x=10ffdd12600000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15a738fe600000
>
> IMPORTANT: if you fix the bug, please add the following tag to the
> commit:
> Reported-by: syzbot+44d53c7255bb1aea22d2@syzkaller.appspotmail.com
>
> usb 1-1: config 0 has no interface number 0
> usb 1-1: New USB device found, idVendor=0af0, idProduct=d257,
> bcdDevice=4e.87
> usb 1-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
> usb 1-1: config 0 descriptor??
> hso 1-1:0.15: Can't find BULK IN endpoint
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 83 at net/core/dev.c:8167
> rollback_registered_many.cold+0x41/0x1bc net/core/dev.c:8167
> Kernel panic - not syncing: panic_on_warn set ...
> CPU: 1 PID: 83 Comm: kworker/1:2 Not tainted 5.3.0-rc5+ #28
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 01/01/2011
> Workqueue: usb_hub_wq hub_event
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0xca/0x13e lib/dump_stack.c:113
>  panic+0x2a3/0x6da kernel/panic.c:219
>  __warn.cold+0x20/0x4a kernel/panic.c:576
>  report_bug+0x262/0x2a0 lib/bug.c:186
>  fixup_bug arch/x86/kernel/traps.c:179 [inline]
>  fixup_bug arch/x86/kernel/traps.c:174 [inline]
>  do_error_trap+0x12b/0x1e0 arch/x86/kernel/traps.c:272
>  do_invalid_op+0x32/0x40 arch/x86/kernel/traps.c:291
>  invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028
> RIP: 0010:rollback_registered_many.cold+0x41/0x1bc net/core/dev.c:8167
> Code: ff e8 c7 26 90 fc 48 c7 c7 40 ec 63 86 e8 24 c8 7a fc 0f 0b e9
> 93 be ff ff e8 af 26 90 fc 48 c7 c7 40 ec 63 86 e8 0c c8 7a fc <0f> 0b
> 4c 89 e7 e8 f9 12 34 fd 31 ff 41 89 c4 89 c6 e8 bd 27 90 fc
> RSP: 0018:ffff8881d934f088 EFLAGS: 00010282
> RAX: 0000000000000024 RBX: ffff8881d2ad4400 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff81288cfd RDI: ffffed103b269e03
> RBP: ffff8881d934f1b8 R08: 0000000000000024 R09: fffffbfff11ad794
> R10: fffffbfff11ad793 R11: ffffffff88d6bc9f R12: ffff8881d2ad4470
> R13: ffff8881d934f148 R14: dffffc0000000000 R15: 0000000000000000
>  rollback_registered+0xf2/0x1c0 net/core/dev.c:8243
>  unregister_netdevice_queue net/core/dev.c:9290 [inline]
>  unregister_netdevice_queue+0x1d7/0x2b0 net/core/dev.c:9283
>  unregister_netdevice include/linux/netdevice.h:2631 [inline]
>  unregister_netdev+0x18/0x20 net/core/dev.c:9331
>  hso_free_net_device+0xff/0x300 drivers/net/usb/hso.c:2366
>  hso_create_net_device+0x76d/0x9c0 drivers/net/usb/hso.c:2554
>  hso_probe+0x28d/0x1a46 drivers/net/usb/hso.c:2931
>  usb_probe_interface+0x305/0x7a0 drivers/usb/core/driver.c:361
>  really_probe+0x281/0x6d0 drivers/base/dd.c:548
>  driver_probe_device+0x101/0x1b0 drivers/base/dd.c:721
>  __device_attach_driver+0x1c2/0x220 drivers/base/dd.c:828
>  bus_for_each_drv+0x162/0x1e0 drivers/base/bus.c:454
>  __device_attach+0x217/0x360 drivers/base/dd.c:894
>  bus_probe_device+0x1e4/0x290 drivers/base/bus.c:514
>  device_add+0xae6/0x16f0 drivers/base/core.c:2165
>  usb_set_configuration+0xdf6/0x1670 drivers/usb/core/message.c:2023
>  generic_probe+0x9d/0xd5 drivers/usb/core/generic.c:210
>  usb_probe_device+0x99/0x100 drivers/usb/core/driver.c:266
>  really_probe+0x281/0x6d0 drivers/base/dd.c:548
>  driver_probe_device+0x101/0x1b0 drivers/base/dd.c:721
>  __device_attach_driver+0x1c2/0x220 drivers/base/dd.c:828
>  bus_for_each_drv+0x162/0x1e0 drivers/base/bus.c:454
>  __device_attach+0x217/0x360 drivers/base/dd.c:894
>  bus_probe_device+0x1e4/0x290 drivers/base/bus.c:514
>  device_add+0xae6/0x16f0 drivers/base/core.c:2165
>  usb_new_device.cold+0x6a4/0xe79 drivers/usb/core/hub.c:2536
>  hub_port_connect drivers/usb/core/hub.c:5098 [inline]
>  hub_port_connect_change drivers/usb/core/hub.c:5213 [inline]
>  port_event drivers/usb/core/hub.c:5359 [inline]
>  hub_event+0x1b5c/0x3640 drivers/usb/core/hub.c:5441
>  process_one_work+0x92b/0x1530 kernel/workqueue.c:2269
>  worker_thread+0x96/0xe20 kernel/workqueue.c:2415
>  kthread+0x318/0x420 kernel/kthread.c:255
>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this bug, for details see:
> https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: [PATCH v2 2/2] net: phy: adin: implement Energy Detect Powerdown mode via phy-tunable
From: Andrew Lunn @ 2019-09-04 20:03 UTC (permalink / raw)
  To: Alexandru Ardelean; +Cc: netdev, linux-kernel, f.fainelli, hkallweit1, davem
In-Reply-To: <20190904162322.17542-3-alexandru.ardelean@analog.com>

On Wed, Sep 04, 2019 at 07:23:22PM +0300, Alexandru Ardelean wrote:
> This driver becomes the first user of the kernel's `ETHTOOL_PHY_EDPD`
> phy-tunable feature.
> EDPD is also enabled by default on PHY config_init, but can be disabled via
> the phy-tunable control.
> 
> When enabling EDPD, it's also a good idea (for the ADIN PHYs) to enable TX
> periodic pulses, so that in case the other PHY is also on EDPD mode, there
> is no lock-up situation where both sides are waiting for the other to
> transmit.
> 
> Via the phy-tunable control, TX pulses can be disabled if specifying 0
> `tx-interval` via ethtool.
> 
> Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
> ---
>  drivers/net/phy/adin.c | 50 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/drivers/net/phy/adin.c b/drivers/net/phy/adin.c
> index 4dec83df048d..742728ab2a5d 100644
> --- a/drivers/net/phy/adin.c
> +++ b/drivers/net/phy/adin.c
> @@ -26,6 +26,11 @@
>  
>  #define ADIN1300_RX_ERR_CNT			0x0014
>  
> +#define ADIN1300_PHY_CTRL_STATUS2		0x0015
> +#define   ADIN1300_NRG_PD_EN			BIT(3)
> +#define   ADIN1300_NRG_PD_TX_EN			BIT(2)
> +#define   ADIN1300_NRG_PD_STATUS		BIT(1)
> +
>  #define ADIN1300_PHY_CTRL2			0x0016
>  #define   ADIN1300_DOWNSPEED_AN_100_EN		BIT(11)
>  #define   ADIN1300_DOWNSPEED_AN_10_EN		BIT(10)
> @@ -328,12 +333,51 @@ static int adin_set_downshift(struct phy_device *phydev, u8 cnt)
>  			    ADIN1300_DOWNSPEEDS_EN);
>  }
>  
> +static int adin_get_edpd(struct phy_device *phydev, u16 *tx_interval)
> +{
> +	int val;
> +
> +	val = phy_read(phydev, ADIN1300_PHY_CTRL_STATUS2);
> +	if (val < 0)
> +		return val;
> +
> +	if (ADIN1300_NRG_PD_EN & val) {
> +		if (val & ADIN1300_NRG_PD_TX_EN)
> +			*tx_interval = 1;

What does 1 mean? 1 pico second, one hour? Anything but zero seconds?
Does the datasheet specify what it actually does? Maybe you should be
using ETHTOOL_PHY_EDPD_DFLT_TX_INTERVAL here, to indicate you actually
have no idea, but it is the default for this PHY?

> +		else
> +			*tx_interval = ETHTOOL_PHY_EDPD_NO_TX;
> +	} else {
> +		*tx_interval = ETHTOOL_PHY_EDPD_DISABLE;
> +	}
> +
> +	return 0;
> +}
> +
> +static int adin_set_edpd(struct phy_device *phydev, u16 tx_interval)
> +{
> +	u16 val;
> +
> +	if (tx_interval == ETHTOOL_PHY_EDPD_DISABLE)
> +		return phy_clear_bits(phydev, ADIN1300_PHY_CTRL_STATUS2,
> +				(ADIN1300_NRG_PD_EN | ADIN1300_NRG_PD_TX_EN));
> +
> +	val = ADIN1300_NRG_PD_EN;
> +	if (tx_interval != ETHTOOL_PHY_EDPD_NO_TX)
> +		val |= ADIN1300_NRG_PD_TX_EN;

So you silently accept any interval? That sounds wrong. You really
should be returning -EINVAL for any value other than, either 1, or
maybe ETHTOOL_PHY_EDPD_DFLT_TX_INTERVAL, if you change the get
function.

      Andrew

^ permalink raw reply

* Re: [PATCH v2 1/2] ethtool: implement Energy Detect Powerdown support via phy-tunable
From: Andrew Lunn @ 2019-09-04 19:53 UTC (permalink / raw)
  To: Alexandru Ardelean; +Cc: netdev, linux-kernel, f.fainelli, hkallweit1, davem
In-Reply-To: <20190904162322.17542-2-alexandru.ardelean@analog.com>

On Wed, Sep 04, 2019 at 07:23:21PM +0300, Alexandru Ardelean wrote:

Hi Alexandru

Somewhere we need a comment stating what EDPD means. Here would be a
good place.

> +#define ETHTOOL_PHY_EDPD_DFLT_TX_INTERVAL	0x7fff
> +#define ETHTOOL_PHY_EDPD_NO_TX			0x8000
> +#define ETHTOOL_PHY_EDPD_DISABLE		0

I think you are passing a u16. So why not 0xfffe and 0xffff?  We also
need to make it clear what the units are for interval. This file
specifies the contract between the kernel and user space. So we need
to clearly define what we mean here. Lots of comments are better than
no comments.

   Andrew

^ permalink raw reply

* [PATCH][next] net/mlx5: fix spelling mistake "offlaods" -> "offloads"
From: Colin King @ 2019-09-04 19:38 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky, David S . Miller, netdev,
	linux-rdma
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

There is a spelling mistake in a NL_SET_ERR_MSG_MOD error message.
Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/devlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index 7bf7b6fbc776..381925c90d94 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -133,7 +133,7 @@ static int mlx5_devlink_fs_mode_validate(struct devlink *devlink, u32 id,
 
 		else if (eswitch_mode == MLX5_ESWITCH_OFFLOADS) {
 			NL_SET_ERR_MSG_MOD(extack,
-					   "Software managed steering is not supported when eswitch offlaods enabled.");
+					   "Software managed steering is not supported when eswitch offloads enabled.");
 			err = -EOPNOTSUPP;
 		}
 	} else {
-- 
2.20.1


^ permalink raw reply related

* [PATCH 4/5] netfilter: ctnetlink: honor IPS_OFFLOAD flag
From: Pablo Neira Ayuso @ 2019-09-04 19:36 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20190904193646.23830-1-pablo@netfilter.org>

If this flag is set, timeout and state are irrelevant to userspace.

Fixes: 90964016e5d3 ("netfilter: nf_conntrack: add IPS_OFFLOAD status bit")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_conntrack_netlink.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 6aa01eb6fe99..e2d13cd18875 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -553,10 +553,8 @@ ctnetlink_fill_info(struct sk_buff *skb, u32 portid, u32 seq, u32 type,
 		goto nla_put_failure;
 
 	if (ctnetlink_dump_status(skb, ct) < 0 ||
-	    ctnetlink_dump_timeout(skb, ct) < 0 ||
 	    ctnetlink_dump_acct(skb, ct, type) < 0 ||
 	    ctnetlink_dump_timestamp(skb, ct) < 0 ||
-	    ctnetlink_dump_protoinfo(skb, ct) < 0 ||
 	    ctnetlink_dump_helpinfo(skb, ct) < 0 ||
 	    ctnetlink_dump_mark(skb, ct) < 0 ||
 	    ctnetlink_dump_secctx(skb, ct) < 0 ||
@@ -568,6 +566,11 @@ ctnetlink_fill_info(struct sk_buff *skb, u32 portid, u32 seq, u32 type,
 	    ctnetlink_dump_ct_synproxy(skb, ct) < 0)
 		goto nla_put_failure;
 
+	if (!test_bit(IPS_OFFLOAD_BIT, &ct->status) &&
+	    (ctnetlink_dump_timeout(skb, ct) < 0 ||
+	     ctnetlink_dump_protoinfo(skb, ct) < 0))
+		goto nla_put_failure;
+
 	nlmsg_end(skb, nlh);
 	return skb->len;
 
-- 
2.11.0


^ permalink raw reply related

* [PATCH 2/5] netfilter: nft_socket: fix erroneous socket assignment
From: Pablo Neira Ayuso @ 2019-09-04 19:36 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20190904193646.23830-1-pablo@netfilter.org>

From: Fernando Fernandez Mancera <ffmancera@riseup.net>

The socket assignment is wrong, see skb_orphan():
When skb->destructor callback is not set, but skb->sk is set, this hits BUG().

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1651813
Fixes: 554ced0a6e29 ("netfilter: nf_tables: add support for native socket matching")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_socket.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nft_socket.c b/net/netfilter/nft_socket.c
index d7f3776dfd71..637ce3e8c575 100644
--- a/net/netfilter/nft_socket.c
+++ b/net/netfilter/nft_socket.c
@@ -47,9 +47,6 @@ static void nft_socket_eval(const struct nft_expr *expr,
 		return;
 	}
 
-	/* So that subsequent socket matching not to require other lookups. */
-	skb->sk = sk;
-
 	switch(priv->key) {
 	case NFT_SOCKET_TRANSPARENT:
 		nft_reg_store8(dest, inet_sk_transparent(sk));
@@ -66,6 +63,9 @@ static void nft_socket_eval(const struct nft_expr *expr,
 		WARN_ON(1);
 		regs->verdict.code = NFT_BREAK;
 	}
+
+	if (sk != skb->sk)
+		sock_gen_put(sk);
 }
 
 static const struct nla_policy nft_socket_policy[NFTA_SOCKET_MAX + 1] = {
-- 
2.11.0


^ permalink raw reply related

* [PATCH 5/5] netfilter: nf_flow_table: set default timeout after successful insertion
From: Pablo Neira Ayuso @ 2019-09-04 19:36 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20190904193646.23830-1-pablo@netfilter.org>

Set up the default timeout for this new entry otherwise the garbage
collector might quickly remove it right after the flowtable insertion.

Fixes: ac2a66665e23 ("netfilter: add generic flow table infrastructure")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_flow_table_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index 80a8f9ae4c93..a0b4bf654de2 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -217,7 +217,7 @@ int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
 		return err;
 	}
 
-	flow->timeout = (u32)jiffies;
+	flow->timeout = (u32)jiffies + NF_FLOW_TIMEOUT;
 	return 0;
 }
 EXPORT_SYMBOL_GPL(flow_offload_add);
-- 
2.11.0


^ permalink raw reply related

* [PATCH 3/5] netfilter: nft_fib_netdev: Terminate rule eval if protocol=IPv6 and ipv6 module is disabled
From: Pablo Neira Ayuso @ 2019-09-04 19:36 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20190904193646.23830-1-pablo@netfilter.org>

From: Leonardo Bras <leonardo@linux.ibm.com>

If IPv6 is disabled on boot (ipv6.disable=1), but nft_fib_inet ends up
dealing with a IPv6 packet, it causes a kernel panic in
fib6_node_lookup_1(), crashing in bad_page_fault.

The panic is caused by trying to deference a very low address (0x38
in ppc64le), due to ipv6.fib6_main_tbl = NULL.
BUG: Kernel NULL pointer dereference at 0x00000038

The kernel panic was reproduced in a host that disabled IPv6 on boot and
have to process guest packets (coming from a bridge) using it's ip6tables.

Terminate rule evaluation when packet protocol is IPv6 but the ipv6 module
is not loaded.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_fib_netdev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/netfilter/nft_fib_netdev.c b/net/netfilter/nft_fib_netdev.c
index 2cf3f32fe6d2..a2e726ae7f07 100644
--- a/net/netfilter/nft_fib_netdev.c
+++ b/net/netfilter/nft_fib_netdev.c
@@ -14,6 +14,7 @@
 #include <linux/netfilter/nf_tables.h>
 #include <net/netfilter/nf_tables_core.h>
 #include <net/netfilter/nf_tables.h>
+#include <net/ipv6.h>

 #include <net/netfilter/nft_fib.h>

@@ -34,6 +35,8 @@ static void nft_fib_netdev_eval(const struct nft_expr *expr,
 		}
 		break;
 	case ETH_P_IPV6:
+		if (!ipv6_mod_enabled())
+			break;
 		switch (priv->result) {
 		case NFT_FIB_RESULT_OIF:
 		case NFT_FIB_RESULT_OIFNAME:
-- 
2.11.0

^ permalink raw reply related

* [PATCH 1/5] netfilter: bridge: Drops IPv6 packets if IPv6 module is not loaded
From: Pablo Neira Ayuso @ 2019-09-04 19:36 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20190904193646.23830-1-pablo@netfilter.org>

From: Leonardo Bras <leonardo@linux.ibm.com>

A kernel panic can happen if a host has disabled IPv6 on boot and have to
process guest packets (coming from a bridge) using it's ip6tables.

IPv6 packets need to be dropped if the IPv6 module is not loaded, and the
host ip6tables will be used.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/bridge/br_netfilter_hooks.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index d3f9592f4ff8..af7800103e51 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -496,6 +496,10 @@ static unsigned int br_nf_pre_routing(void *priv,
 		if (!brnet->call_ip6tables &&
 		    !br_opt_get(br, BROPT_NF_CALL_IP6TABLES))
 			return NF_ACCEPT;
+		if (!ipv6_mod_enabled()) {
+			pr_warn_once("Module ipv6 is disabled, so call_ip6tables is not supported.");
+			return NF_DROP;
+		}
 
 		nf_bridge_pull_encap_header_rcsum(skb);
 		return br_nf_pre_routing_ipv6(priv, skb, state);
-- 
2.11.0


^ permalink raw reply related

* [PATCH 0/5] Netfilter fixes for net
From: Pablo Neira Ayuso @ 2019-09-04 19:36 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

Hi,

The following patchset contains Netfilter fixes for net:

1) br_netfilter drops IPv6 packets if ipv6 is disabled, from Leonardo Bras.

2) nft_socket hits BUG() due to illegal skb->sk caching, patch from
   Fernando Fernandez Mancera.

3) nft_fib_netdev could be called with ipv6 disabled, leading to crash
   in the fib lookup, also from Leonardo.

4) ctnetlink honors IPS_OFFLOAD flag, just like nf_conntrack sysctl does.

5) Properly set up flowtable entry timeout, otherwise immediate
   removal by garbage collector might occur.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Thanks.

----------------------------------------------------------------

The following changes since commit e33b4325e60e146c2317a8b548cbd633239ff83b:

  net: stmmac: dwmac-sun8i: Variable "val" in function sun8i_dwmac_set_syscon() could be uninitialized (2019-09-02 11:48:15 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git HEAD

for you to fetch changes up to 110e48725db6262f260f10727d0fb2d3d25895e4:

  netfilter: nf_flow_table: set default timeout after successful insertion (2019-09-03 22:55:42 +0200)

----------------------------------------------------------------
Fernando Fernandez Mancera (1):
      netfilter: nft_socket: fix erroneous socket assignment

Leonardo Bras (2):
      netfilter: bridge: Drops IPv6 packets if IPv6 module is not loaded
      netfilter: nft_fib_netdev: Terminate rule eval if protocol=IPv6 and ipv6 module is disabled

Pablo Neira Ayuso (2):
      netfilter: ctnetlink: honor IPS_OFFLOAD flag
      netfilter: nf_flow_table: set default timeout after successful insertion

 net/bridge/br_netfilter_hooks.c      | 4 ++++
 net/netfilter/nf_conntrack_netlink.c | 7 +++++--
 net/netfilter/nf_flow_table_core.c   | 2 +-
 net/netfilter/nft_fib_netdev.c       | 3 +++
 net/netfilter/nft_socket.c           | 6 +++---
 5 files changed, 16 insertions(+), 6 deletions(-)

^ permalink raw reply

* [PATCH][next] net/mlx5: fix missing assignment of variable err
From: Colin King @ 2019-09-04 19:29 UTC (permalink / raw)
  To: Maor Gottlieb, Saeed Mahameed, Leon Romanovsky, David S . Miller,
	netdev, linux-rdma
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

The error return from a call to mlx5_flow_namespace_set_peer is not
being assigned to variable err and hence the error check following
the call is currently not working.  Fix this by assigning ret as
intended.

Addresses-Coverity: ("Logically dead code")
Fixes: 8463daf17e80 ("net/mlx5: Add support to use SMFS in switchdev mode")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index afa623b15a38..00d71db15f22 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1651,7 +1651,7 @@ static int mlx5_esw_offloads_set_ns_peer(struct mlx5_eswitch *esw,
 		if (err)
 			return err;
 
-		mlx5_flow_namespace_set_peer(peer_ns, ns);
+		err = mlx5_flow_namespace_set_peer(peer_ns, ns);
 		if (err) {
 			mlx5_flow_namespace_set_peer(ns, NULL);
 			return err;
-- 
2.20.1


^ permalink raw reply related

* Re: Is bug 200755 in anyone's queue??
From: Eric Dumazet @ 2019-09-04 19:26 UTC (permalink / raw)
  To: Willem de Bruijn, Steve Zabele
  Cc: Eric Dumazet, Mark KEATON, Network Development, shum@canndrew.org,
	vladimir116@gmail.com, saifi.khan@strikr.in, Daniel Borkmann,
	on2k16nm@gmail.com, Stephen Hemminger
In-Reply-To: <CA+FuTSf24VrjOxS9Kg3+DFEYn7ihe6vMj5o7rggOz_6KH_rNpQ@mail.gmail.com>



On 9/4/19 5:46 PM, Willem de Bruijn wrote:
> On Wed, Sep 4, 2019 at 10:51 AM Steve Zabele <zabele@comcast.net> wrote:
>>
>> I think a dual table approach makes a lot of sense here, especially if we look at the different use cases. For the DNS server example, almost certainly there will not be any connected sockets using the server port, so a test of whether the connected table is empty (maybe a boolean stored with the unconnected table?)


UDP hash tables are shared among netns, and the hashes function depend on a netns salt ( net_hash_mix())

(see udp_hashfn() definition)

So a boolean would be polluted on a slot having both non connected socket on netns A, and a connected socket for netns B.


 should get to the existing code very quickly and not require accessing the memory holding the connected table. For our use case, the connected sockets persist for long periods (at network timescales at least) and so any rehashing should be infrequent and so have limited impact on performance overall.
>>
>> So does a dual table approach seem workable to other folks that know the internals?
> 
> Let me take a stab and compare. A dual table does bring it more in
> line with how the TCP code is structured.
> 

^ permalink raw reply

* Re: [PATCH net 03/11] bonding: split IFF_BONDING into IFF_BONDING and IFF_BONDING_SLAVE
From: Jay Vosburgh @ 2019-09-04 19:22 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, netdev, vfalico, andy, jiri, sd, roopa, saeedm, manishc,
	rahulv, kys, haiyangz, sthemmin, sashal, hare, varun, ubraun,
	kgraul
In-Reply-To: <20190904183927.14754-1-ap420073@gmail.com>

Taehee Yoo <ap420073@gmail.com> wrote:

>The IFF_BONDING means bonding master or bonding slave device.
>
>->ndo_add_slave() sets IFF_BONDING flag and ->ndo_del_slave() removes
>IFF_BONDING flag.
>This routine makes a problem in the nesting bonding structure.
>
>bond1<--bond2
>
>Both bond0 and bond1 are bonding device and these should keep having
>IFF_BONDING flag until they are removed.
>But bond1 would lose IFF_BONDING at ->ndo_del_slave because that routine
>can not check whether the slave device is the bonding type or not.
>So that this patch splits the IFF_BONDING into theIFF_BONDING and
>the IFF_BONDING_SLAVE. The IFF_BONDING is bonding master flag and
>IFF_BONDING_SLAVE is bonding slave flag.
>
>Test commands:
>    ip link add bond0 type bond
>    ip link add bond1 type bond
>    ip link set bond1 master bond0
>    ip link set bond1 nomaster
>    ip link del bond1 type bond
>    ip link add bond1 type bond
>
>Splat looks like:
>[  149.201107] proc_dir_entry 'bonding/bond1' already registered
>[  149.208013] WARNING: CPU: 1 PID: 1308 at fs/proc/generic.c:361 proc_register+0x2a9/0x3e0
>[  149.208866] Modules linked in: bonding veth openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv4 ip_tables6
>[  149.208866] CPU: 1 PID: 1308 Comm: ip Not tainted 5.3.0-rc7+ #322
>[  149.208866] RIP: 0010:proc_register+0x2a9/0x3e0
>[  149.208866] Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 39 01 00 00 48 8b 04 24 48 89 ea 48 c7 c7 a0 a0 13 89 48 8b b0 0
>[  149.208866] RSP: 0018:ffff88810df9f098 EFLAGS: 00010286
>[  149.208866] RAX: dffffc0000000008 RBX: ffff8880b5d3aa50 RCX: ffffffff87cdec92
>[  149.208866] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff888116bf6a8c
>[  149.208866] RBP: ffff8880b5d3acd3 R08: ffffed1022d7ff71 R09: ffffed1022d7ff71
>[  149.208866] R10: 0000000000000001 R11: ffffed1022d7ff70 R12: ffff8880b5d3abe8
>[  149.208866] R13: ffff8880b5d3acd2 R14: dffffc0000000000 R15: ffffed1016ba759a
>[  149.208866] FS:  00007f4bd1f650c0(0000) GS:ffff888116a00000(0000) knlGS:0000000000000000
>[  149.208866] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>[  149.208866] CR2: 000055e7ca686118 CR3: 0000000106fd4000 CR4: 00000000001006e0
>[  149.208866] Call Trace:
>[  149.208866]  proc_create_seq_private+0xb3/0xf0
>[  149.208866]  bond_create_proc_entry+0x1b3/0x3f0 [bonding]
>[  149.208866]  bond_netdev_event+0x433/0x970 [bonding]
>[  149.208866]  ? __module_text_address+0x13/0x140
>[  149.208866]  notifier_call_chain+0x90/0x160
>[  149.208866]  register_netdevice+0x9b3/0xd70
>[  149.208866]  ? alloc_netdev_mqs+0x854/0xc10
>[  149.208866]  ? netdev_change_features+0xa0/0xa0
>[  149.208866]  ? rtnl_create_link+0x2ed/0xad0
>[  149.208866]  bond_newlink+0x2a/0x60 [bonding]
>[  149.208866]  __rtnl_newlink+0xb75/0x1180
>[  ... ]
>
>Fixes: 0b680e753724 ("[PATCH] bonding: Add priv_flag to avoid event mishandling")

	I'm not sure this Fixes is technically correct, as I don't think
nesting bonds has induced an oops since 2006.  I don't think nesting
bonds really does anything useful, but it's been allowed for years (but
has been broken on and off all that time) so I'm a bit leery of simply
disallowing nesting of bonds for fear it would break something already
in use.

	In any event, it would be desirable if this fix could be changed
to not need a new priv_flag, as this patch would consume the last free
bit in netdev_priv_flags.  A bond master device that is also a slave
should have IFF_MASTER set in dev->flags, which could be tested at
removal time to avoid clearing IFF_BONDING.

	-J

>Signed-off-by: Taehee Yoo <ap420073@gmail.com>
>---
> drivers/net/bonding/bond_main.c                     | 13 +++++--------
> .../net/ethernet/qlogic/netxen/netxen_nic_main.c    |  2 +-
> drivers/net/hyperv/netvsc_drv.c                     |  3 +--
> drivers/scsi/fcoe/fcoe.c                            |  2 +-
> drivers/target/iscsi/cxgbit/cxgbit_cm.c             |  2 +-
> include/linux/netdevice.h                           |  9 ++++++---
> 6 files changed, 15 insertions(+), 16 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 931d9d935686..abd008c31c9a 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1560,7 +1560,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
> 		goto err_restore_mac;
> 	}
> 
>-	slave_dev->priv_flags |= IFF_BONDING;
>+	slave_dev->priv_flags |= IFF_BONDING_SLAVE;
> 	/* initialize slave stats */
> 	dev_get_stats(new_slave->dev, &new_slave->slave_stats);
> 
>@@ -1816,7 +1816,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
> 	slave_disable_netpoll(new_slave);
> 
> err_close:
>-	slave_dev->priv_flags &= ~IFF_BONDING;
>+	slave_dev->priv_flags &= ~IFF_BONDING_SLAVE;
> 	dev_close(slave_dev);
> 
> err_restore_mac:
>@@ -2017,7 +2017,7 @@ static int __bond_release_one(struct net_device *bond_dev,
> 	else
> 		dev_set_mtu(slave_dev, slave->original_mtu);
> 
>-	slave_dev->priv_flags &= ~IFF_BONDING;
>+	slave_dev->priv_flags &= ~IFF_BONDING_SLAVE;
> 
> 	bond_free_slave(slave);
> 
>@@ -3221,10 +3221,7 @@ static int bond_netdev_event(struct notifier_block *this,
> 	netdev_dbg(event_dev, "%s received %s\n",
> 		   __func__, netdev_cmd_to_name(event));
> 
>-	if (!(event_dev->priv_flags & IFF_BONDING))
>-		return NOTIFY_DONE;
>-
>-	if (event_dev->flags & IFF_MASTER) {
>+	if (netif_is_bond_master(event_dev)) {
> 		int ret;
> 
> 		ret = bond_master_netdev_event(event, event_dev);
>@@ -3232,7 +3229,7 @@ static int bond_netdev_event(struct notifier_block *this,
> 			return ret;
> 	}
> 
>-	if (event_dev->flags & IFF_SLAVE)
>+	if (netif_is_bond_slave(event_dev))
> 		return bond_slave_netdev_event(event, event_dev);
> 
> 	return NOTIFY_DONE;
>diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
>index 58e2eaf77014..5e0389ba1f13 100644
>--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
>+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
>@@ -3340,7 +3340,7 @@ static void netxen_config_master(struct net_device *dev, unsigned long event)
> 	 * released and is dev_close()ed in bond_release()
> 	 * just before IFF_BONDING is stripped.
> 	 */
>-	if (!master && dev->priv_flags & IFF_BONDING)
>+	if (!master && netif_is_bond_slave(dev))
> 		netxen_free_ip_list(adapter, true);
> }
> 
>diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
>index e8fce6d715ef..6831202d9bcb 100644
>--- a/drivers/net/hyperv/netvsc_drv.c
>+++ b/drivers/net/hyperv/netvsc_drv.c
>@@ -2439,8 +2439,7 @@ static int netvsc_netdev_event(struct notifier_block *this,
> 		return NOTIFY_DONE;
> 
> 	/* Avoid Bonding master dev with same MAC registering as VF */
>-	if ((event_dev->priv_flags & IFF_BONDING) &&
>-	    (event_dev->flags & IFF_MASTER))
>+	if (netif_is_bond_master(event_dev))
> 		return NOTIFY_DONE;
> 
> 	switch (event) {
>diff --git a/drivers/scsi/fcoe/fcoe.c b/drivers/scsi/fcoe/fcoe.c
>index 00dd47bcbb1e..750a6540eb9d 100644
>--- a/drivers/scsi/fcoe/fcoe.c
>+++ b/drivers/scsi/fcoe/fcoe.c
>@@ -307,7 +307,7 @@ static int fcoe_interface_setup(struct fcoe_interface *fcoe,
> 	}
> 
> 	/* Do not support for bonding device */
>-	if (netdev->priv_flags & IFF_BONDING && netdev->flags & IFF_MASTER) {
>+	if (netif_is_bond_master(netdev)) {
> 		FCOE_NETDEV_DBG(netdev, "Bonded interfaces not supported\n");
> 		return -EOPNOTSUPP;
> 	}
>diff --git a/drivers/target/iscsi/cxgbit/cxgbit_cm.c b/drivers/target/iscsi/cxgbit/cxgbit_cm.c
>index c70caf4ea490..16c8cae333b2 100644
>--- a/drivers/target/iscsi/cxgbit/cxgbit_cm.c
>+++ b/drivers/target/iscsi/cxgbit/cxgbit_cm.c
>@@ -247,7 +247,7 @@ struct cxgbit_device *cxgbit_find_device(struct net_device *ndev, u8 *port_id)
> 
> static struct net_device *cxgbit_get_real_dev(struct net_device *ndev)
> {
>-	if (ndev->priv_flags & IFF_BONDING) {
>+	if (netif_is_bond_master(ndev) || netif_is_bond_slave(ndev)) {
> 		pr_err("Bond devices are not supported. Interface:%s\n",
> 		       ndev->name);
> 		return NULL;
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index 5bb5756129af..a2c47f43e54b 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1441,7 +1441,7 @@ struct net_device_ops {
>  *
>  * @IFF_802_1Q_VLAN: 802.1Q VLAN device
>  * @IFF_EBRIDGE: Ethernet bridging device
>- * @IFF_BONDING: bonding master or slave
>+ * @IFF_BONDING: bonding master
>  * @IFF_ISATAP: ISATAP interface (RFC4214)
>  * @IFF_WAN_HDLC: WAN HDLC device
>  * @IFF_XMIT_DST_RELEASE: dev_hard_start_xmit() is allowed to
>@@ -1474,6 +1474,7 @@ struct net_device_ops {
>  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
>  * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device
>  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
>+ * @IFF_BONDING_SLAVE: bonding slave
>  */
> enum netdev_priv_flags {
> 	IFF_802_1Q_VLAN			= 1<<0,
>@@ -1507,6 +1508,7 @@ enum netdev_priv_flags {
> 	IFF_FAILOVER_SLAVE		= 1<<28,
> 	IFF_L3MDEV_RX_HANDLER		= 1<<29,
> 	IFF_LIVE_RENAME_OK		= 1<<30,
>+	IFF_BONDING_SLAVE		= 1<<31,
> };
> 
> #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>@@ -1539,6 +1541,7 @@ enum netdev_priv_flags {
> #define IFF_FAILOVER_SLAVE		IFF_FAILOVER_SLAVE
> #define IFF_L3MDEV_RX_HANDLER		IFF_L3MDEV_RX_HANDLER
> #define IFF_LIVE_RENAME_OK		IFF_LIVE_RENAME_OK
>+#define IFF_BONDING_SLAVE		IFF_BONDING_SLAVE
> 
> /**
>  *	struct net_device - The DEVICE structure.
>@@ -4569,12 +4572,12 @@ static inline bool netif_is_macvlan_port(const struct net_device *dev)
> 
> static inline bool netif_is_bond_master(const struct net_device *dev)
> {
>-	return dev->flags & IFF_MASTER && dev->priv_flags & IFF_BONDING;
>+	return dev->priv_flags & IFF_BONDING;
> }
> 
> static inline bool netif_is_bond_slave(const struct net_device *dev)
> {
>-	return dev->flags & IFF_SLAVE && dev->priv_flags & IFF_BONDING;
>+	return dev->priv_flags & IFF_BONDING_SLAVE;
> }
> 
> static inline bool netif_supports_nofcs(struct net_device *dev)
>-- 
>2.17.1
>

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply

* Re: [PATCH iproute2] nexthop: Add space after blackhole
From: Stephen Hemminger @ 2019-09-04 19:06 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, David Ahern
In-Reply-To: <20190904150952.17274-1-dsahern@kernel.org>

On Wed,  4 Sep 2019 08:09:52 -0700
David Ahern <dsahern@kernel.org> wrote:

> From: David Ahern <dsahern@gmail.com>
> 
> Add a space after 'blackhole' is missing to properly separate the
> protocol when it is given.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>

Applied with Fixes tag

^ permalink raw reply

* Re: [PATCH iproute2] devlink: fix segfault on health command
From: Stephen Hemminger @ 2019-09-04 19:03 UTC (permalink / raw)
  To: Andrea Claudi; +Cc: netdev, dsahern
In-Reply-To: <1b981424e70678675af12bc391fbff02625640b8.1567617745.git.aclaudi@redhat.com>

On Wed,  4 Sep 2019 19:26:14 +0200
Andrea Claudi <aclaudi@redhat.com> wrote:

> devlink segfaults when using grace_period without reporter
> 
> $ devlink health set pci/0000:00:09.0 grace_period 3500
> Segmentation fault
> 
> devlink is instead supposed to gracefully fail printing a warning
> message
> 
> $ devlink health set pci/0000:00:09.0 grace_period 3500
> Reporter's name is expected.
> 
> This happens because DL_OPT_HEALTH_REPORTER_NAME and
> DL_OPT_HEALTH_REPORTER_GRACEFUL_PERIOD are both defined as BIT(27).
> When dl_opts_put() parse options and grace_period is set, it erroneously
> tries to set reporter name to null.
> 
> This is fixed simply shifting by 1 bit enumeration starting with
> DL_OPT_HEALTH_REPORTER_GRACEFUL_PERIOD.
> 
> Fixes: b18d89195b16 ("devlink: Add devlink health set command")
> Signed-off-by: Andrea Claudi <aclaudi@redhat.com>

Thanks, applied

^ permalink raw reply

* Re: [PATCH net 00/11] net: fix nested device bugs
From: Stephen Hemminger @ 2019-09-04 18:58 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, netdev, j.vosburgh, vfalico, andy, jiri, sd, roopa, saeedm,
	manishc, rahulv, kys, haiyangz, sthemmin, sashal, hare, varun,
	ubraun, kgraul
In-Reply-To: <20190904183828.14260-1-ap420073@gmail.com>

On Thu,  5 Sep 2019 03:38:28 +0900
Taehee Yoo <ap420073@gmail.com> wrote:

> This patchset fixes several bugs that are related to nesting
> device infrastructure.
> Current nesting infrastructure code doesn't limit the depth level of
> devices. nested devices could be handled recursively. at that moment,
> it needs huge memory and stack overflow could occur.
> Below devices type have same bug.
> VLAN, BONDING, TEAM, MACSEC, MACVLAN and VXLAN.
> 
> Test commands:
>     ip link add dummy0 type dummy
>     ip link add vlan1 link dummy0 type vlan id 1
> 
>     for i in {2..100}
>     do
> 	    let A=$i-1
> 	    ip link add name vlan$i link vlan$A type vlan id $i
>     done
>     ip link del dummy0
> 
> 1st patch actually fixes the root cause.
> It adds new common variables {upper/lower}_level that represent
> depth level. upper_level variable is depth of upper devices.
> lower_level variable is depth of lower devices.
> 
>       [U][L]       [U][L]
> vlan1  1  5  vlan4  1  4
> vlan2  2  4  vlan5  2  3
> vlan3  3  3    |
>   |            |
>   +------------+
>   |
> vlan6  4  2
> dummy0 5  1
> 
> After this patch, the nesting infrastructure code uses this variable to
> check the depth level.
> 
> 2, 4, 5, 6, 7 patches fix lockdep related problem.
> Before this patch, devices use static lockdep map.
> So, if devices that are same type is nested, lockdep will warn about
> recursive situation.
> These patches make these devices use dynamic lockdep key instead of
> static lock or subclass.
> 
> 3rd patch splits IFF_BONDING flag into IFF_BONDING and IFF_BONDING_SLAVE.
> Before this patch, there is only IFF_BONDING flags, which means
> a bonding master or a bonding slave device.
> But this single flag could be problem when bonding devices are set to
> nested.
> 
> 8th patch fixes a refcnt leak in the macsec module.
> 
> 9th patch adds ignore flag to an adjacent structure.
> In order to exchange an adjacent node safely, ignore flag is needed.
> 
> 10th patch makes vxlan add an adjacent link to limit depth level.
> 
> 11th patch removes unnecessary variables and callback.
> 
> Taehee Yoo (11):
>   net: core: limit nested device depth
>   vlan: use dynamic lockdep key instead of subclass
>   bonding: split IFF_BONDING into IFF_BONDING and IFF_BONDING_SLAVE
>   bonding: use dynamic lockdep key instead of subclass
>   team: use dynamic lockdep key instead of static key
>   macsec: use dynamic lockdep key instead of subclass
>   macvlan: use dynamic lockdep key instead of subclass
>   macsec: fix refcnt leak in module exit routine
>   net: core: add ignore flag to netdev_adjacent structure
>   vxlan: add adjacent link to limit depth level
>   net: remove unnecessary variables and callback
> 
>  drivers/net/bonding/bond_alb.c                |   2 +-
>  drivers/net/bonding/bond_main.c               |  87 ++++--
>  .../net/ethernet/mellanox/mlx5/core/en_tc.c   |   2 +-
>  .../ethernet/qlogic/netxen/netxen_nic_main.c  |   2 +-
>  drivers/net/hyperv/netvsc_drv.c               |   3 +-
>  drivers/net/macsec.c                          |  50 ++--
>  drivers/net/macvlan.c                         |  36 ++-
>  drivers/net/team/team.c                       |  61 ++++-
>  drivers/net/vxlan.c                           |  71 ++++-
>  drivers/scsi/fcoe/fcoe.c                      |   2 +-
>  drivers/target/iscsi/cxgbit/cxgbit_cm.c       |   2 +-
>  include/linux/if_macvlan.h                    |   3 +-
>  include/linux/if_team.h                       |   5 +
>  include/linux/if_vlan.h                       |  13 +-
>  include/linux/netdevice.h                     |  29 +-
>  include/net/bonding.h                         |   4 +-
>  include/net/vxlan.h                           |   1 +
>  net/8021q/vlan.c                              |   1 -
>  net/8021q/vlan_dev.c                          |  32 +--
>  net/core/dev.c                                | 252 ++++++++++++++++--
>  net/core/dev_addr_lists.c                     |  12 +-
>  net/smc/smc_core.c                            |   2 +-
>  net/smc/smc_pnet.c                            |   2 +-
>  23 files changed, 519 insertions(+), 155 deletions(-)
> 

The network receive path already avoids excessive stack
depth. Maybe the real problem is in the lockdep code.

^ permalink raw reply

* Re: [PATCH net] dev: Delay the free of the percpu refcount
From: Subash Abhinov Kasiviswanathan @ 2019-09-04 18:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, Sean Tranchetti
In-Reply-To: <adbe6efaabd34538fa424e028bdc6699@codeaurora.org>

On 2019-08-30 15:03, Subash Abhinov Kasiviswanathan wrote:
>> This looks bogus.
>> 
>> Whatever layer tries to access dev refcnt after free_netdev() has been
>> called is buggy.
>> 
>> I would rather trap early and fix the root cause.
>> 
>> Untested patch :
>> 
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index b5d28dadf964..8080f1305417 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3723,6 +3723,7 @@ void netdev_run_todo(void);
>>   */
>>  static inline void dev_put(struct net_device *dev)
>>  {
>> +       BUG_ON(!dev->pcpu_refcnt);
>>         this_cpu_dec(*dev->pcpu_refcnt);
>>  }
>> 
>> @@ -3734,6 +3735,7 @@ static inline void dev_put(struct net_device 
>> *dev)
>>   */
>>  static inline void dev_hold(struct net_device *dev)
>>  {
>> +       BUG_ON(!dev->pcpu_refcnt);
>>         this_cpu_inc(*dev->pcpu_refcnt);
>>  }
> 
> Hello Eric
> 
> I am seeing a similar crash with your patch as well.
> The NULL dev->pcpu_refcnt was caught by the BUG you added.
> 
>    786.510217:   <6> kernel BUG at include/linux/netdevice.h:3633!
>    786.510263:   <2> pc : in_dev_finish_destroy+0xcc/0xd0
>    786.510267:   <2> lr : in_dev_finish_destroy+0x2c/0xd0
>    786.511220:   <2> Call trace:
>    786.511225:   <2>  in_dev_finish_destroy+0xcc/0xd0
>    786.511230:   <2>  in_dev_rcu_put+0x24/0x30
>    786.511237:   <2>  rcu_nocb_kthread+0x43c/0x468
>    786.511243:   <2>  kthread+0x118/0x128
>    786.511249:   <2>  ret_from_fork+0x10/0x1c
> 
> This seems to be happening when there is an allocation failure
> in the IPv6 notifier callback only.
> 
> I had added some additional debug to narrow down the refcount
> validity along the callers of the dev_put/dev_hold.
> refcnt valid below shows that the pointer dev->pcpu_refcnt is valid
> while refcnt null shows the case where dev->pcpu_refcnt is NULL.
> The last dev_put happens after free_netdev leading to the
> dev->pcpu_refcnt to be accessed when NULL.
> 
> 309.908501:   <6> dev_hold() ffffffe13c9df000 ip6_vti0 refcnt valid
> setup_net+0xa0/0x210 -> ops_init+0x88/0x110
> 309.908674:   <6> dev_hold() ffffffe13c9df000 ip6_vti0 refcnt valid
> register_netdevice+0x29c/0x5b0 -> netdev_register_kobject+0xd8/0x150
> 309.908696:   <6> dev_hold() ffffffe13c9df000 ip6_vti0 refcnt valid
> register_netdevice+0x29c/0x5b0 -> netdev_register_kobject+0x100/0x150
> 309.908717:   <6> dev_hold() ffffffe13c9df000 ip6_vti0 refcnt valid
> vti6_init_net+0x188/0x1c0 -> register_netdev+0x28/0x40
> 309.908763:   <6> neighbour: dev_hold() ffffffe13c9df000 ip6_vti0
> refcnt valid inetdev_event+0x43c/0x528 -> inetdev_init+0x80/0x1e0
> 309.908835:   <6> dev_hold() ffffffe13c9df000 ip6_vti0 refcnt valid
> raw_notifier_call_chain+0x3c/0x68 -> inetdev_event+0x43c/0x528
> 309.908882:   <6> neighbour: dev_hold() ffffffe13c9df000 ip6_vti0
> refcnt valid addrconf_notify+0x42c/0xe58 -> ipv6_add_dev+0xe4/0x588
> 309.908890:   <6> IPv6: dev_hold() ffffffe13c9df000 ip6_vti0 refcnt
> valid raw_notifier_call_chain+0x3c/0x68 -> addrconf_notify+0x42c/0xe58
> 309.908906:   <6> stress-ng-clone: page allocation failure: order:0,
> mode:0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
> 309.908910:   <6> stress-ng-clone cpuset=foreground mems_allowed=0
> 309.908925:   <2> Call trace:
> 309.908931:   <2>  dump_backtrace+0x0/0x158
> 309.908934:   <2>  show_stack+0x14/0x20
> 309.908939:   <2>  dump_stack+0xc4/0xfc
> 309.908944:   <2>  warn_alloc+0xf8/0x168
> 309.908947:   <2>  __alloc_pages_nodemask+0xff4/0x1018
> 309.908955:   <2>  new_slab+0x128/0x5b8
> 309.908958:   <2>  ___slab_alloc+0x4cc/0x5f8
> 309.908960:   <2>  kmem_cache_alloc_trace+0x2a4/0x2c0
> 309.908963:   <2>  ipv6_add_dev+0x220/0x588
> 309.908966:   <2>  addrconf_notify+0x42c/0xe58
> 309.908969:   <2>  raw_notifier_call_chain+0x3c/0x68
> 309.908972:   <2>  register_netdevice+0x3c4/0x5b0
> 309.908974:   <2>  register_netdev+0x28/0x40
> 309.908978:   <2>  vti6_init_net+0x188/0x1c0
> 309.908981:   <2>  ops_init+0x88/0x110
> 309.908983:   <2>  setup_net+0xa0/0x210
> 309.908986:   <2>  copy_net_ns+0xa8/0x130
> 309.908990:   <2>  create_new_namespaces+0x138/0x170
> 309.908993:   <2>  unshare_nsproxy_namespaces+0x68/0x90
> 309.908999:   <2>  ksys_unshare+0x17c/0x248
> 309.909001:   <2>  __arm64_sys_unshare+0x10/0x20
> 309.909004:   <2>  el0_svc_common+0xa0/0x158
> 309.909007:   <2>  el0_svc_handler+0x6c/0x88
> 309.909010:   <2>  el0_svc+0x8/0xc
> 309.909021:   <6> neighbour: dev_put() ffffffe13c9df000 ip6_vti0
> refcnt valid addrconf_notify+0x42c/0xe58 -> ipv6_add_dev+0x400/0x588
> 309.909030:   <6> IPv6: dev_put() ffffffe13c9df000 ip6_vti0 refcnt
> valid raw_notifier_call_chain+0x3c/0x68 -> addrconf_notify+0x42c/0xe58
> 309.918097:   <6> neighbour: dev_put() ffffffe13c9df000 ip6_vti0
> refcnt valid raw_notifier_call_chain+0x3c/0x68 ->
> inetdev_event+0x290/0x528
> 309.918249:   <6> dev_put() ffffffe13c9df000 ip6_vti0 refcnt valid
> register_netdevice+0x3f8/0x5b0 -> rollback_registered_many+0x488/0x658
> 309.918318:   <6> dev_put() ffffffe13c9df000 ip6_vti0 refcnt valid
> net_rx_queue_update_kobjects+0x1ec/0x238 -> kobject_put+0x7c/0xc0
> 309.918405:   <6> dev_put() ffffffe13c9df000 ip6_vti0 refcnt valid
> netdev_queue_update_kobjects+0x1dc/0x228 -> kobject_put+0x7c/0xc0
> 309.918759:   <6> dev_put() ffffffe13c9df000 ip6_vti0 refcnt valid
> register_netdev+0x28/0x40 -> register_netdevice+0x3f8/0x5b0
> 309.918778:   <6> free_netdev() ffffffe13c9df000 ip6_vti0 refcnt valid
> ops_init+0x88/0x110 -> vti6_init_net+0x1ac/0x1c0
> 309.980671:   <6> dev_put() ffffffe13c9df000 ip6_vti0 refcnt null
> rcu_nocb_kthread+0x43c/0x468 -> in_dev_rcu_put+0x24/0x30
> 309.980838:   <6> kernel BUG at include/linux/netdevice.h:3636!

Hello Eric

Could you please review this? The main concern here is that
dev_put() is getting called after free_netdev() causing a NULL
dereference.

309.918778:   <6> free_netdev() ffffffe13c9df000 ip6_vti0 refcnt valid
ops_init+0x88/0x110 -> vti6_init_net+0x1ac/0x1c0
309.980671:   <6> dev_put() ffffffe13c9df000 ip6_vti0 refcnt null
rcu_nocb_kthread+0x43c/0x468 -> in_dev_rcu_put+0x24/0x30

I am able to reproduce this consistently and open to testing
out patches.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply

* [PATCH v3 bpf-next 2/3] bpf: implement CAP_BPF
From: Alexei Starovoitov @ 2019-09-04 18:43 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, luto, netdev, bpf, kernel-team, linux-api
In-Reply-To: <20190904184335.360074-1-ast@kernel.org>

Implement permissions as stated in uapi/linux/capability.h

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/arraymap.c                       |  2 +-
 kernel/bpf/cgroup.c                         |  2 +-
 kernel/bpf/core.c                           |  4 +-
 kernel/bpf/hashtab.c                        |  4 +-
 kernel/bpf/lpm_trie.c                       |  2 +-
 kernel/bpf/queue_stack_maps.c               |  2 +-
 kernel/bpf/reuseport_array.c                |  2 +-
 kernel/bpf/stackmap.c                       |  2 +-
 kernel/bpf/syscall.c                        | 32 ++++++++------
 kernel/bpf/verifier.c                       |  4 +-
 kernel/trace/bpf_trace.c                    |  2 +-
 net/core/bpf_sk_storage.c                   |  2 +-
 net/core/filter.c                           | 10 +++--
 tools/testing/selftests/bpf/test_verifier.c | 46 +++++++++++++++++----
 14 files changed, 77 insertions(+), 39 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 1c65ce0098a9..149f868a02dc 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -73,7 +73,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
 	int ret, numa_node = bpf_map_attr_numa_node(attr);
 	u32 elem_size, index_mask, max_entries;
-	bool unpriv = !capable(CAP_SYS_ADMIN);
+	bool unpriv = !capable_bpf();
 	u64 cost, array_size, mask64;
 	struct bpf_map_memory mem;
 	struct bpf_array *array;
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 6a6a154cfa7b..9c659ba5c146 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -795,7 +795,7 @@ cgroup_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_get_current_cgroup_id:
 		return &bpf_get_current_cgroup_id_proto;
 	case BPF_FUNC_trace_printk:
-		if (capable(CAP_SYS_ADMIN))
+		if (capable_bpf_tracing())
 			return bpf_get_trace_printk_proto();
 		/* fall through */
 	default:
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 8191a7db2777..6b53c064e8e6 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -646,7 +646,7 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
 void bpf_prog_kallsyms_add(struct bpf_prog *fp)
 {
 	if (!bpf_prog_kallsyms_candidate(fp) ||
-	    !capable(CAP_SYS_ADMIN))
+	    !capable_bpf())
 		return;
 
 	spin_lock_bh(&bpf_lock);
@@ -768,7 +768,7 @@ static int bpf_jit_charge_modmem(u32 pages)
 {
 	if (atomic_long_add_return(pages, &bpf_jit_current) >
 	    (bpf_jit_limit >> PAGE_SHIFT)) {
-		if (!capable(CAP_SYS_ADMIN)) {
+		if (!capable_bpf()) {
 			atomic_long_sub(pages, &bpf_jit_current);
 			return -EPERM;
 		}
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 22066a62c8c9..0fae5c45f425 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -244,9 +244,9 @@ static int htab_map_alloc_check(union bpf_attr *attr)
 	BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) !=
 		     offsetof(struct htab_elem, hash_node.pprev));
 
-	if (lru && !capable(CAP_SYS_ADMIN))
+	if (lru && !capable_bpf())
 		/* LRU implementation is much complicated than other
-		 * maps.  Hence, limit to CAP_SYS_ADMIN for now.
+		 * maps.  Hence, limit to CAP_BPF.
 		 */
 		return -EPERM;
 
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 56e6c75d354d..11da3be8a4e5 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -543,7 +543,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 	u64 cost = sizeof(*trie), cost_per_node;
 	int ret;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_bpf())
 		return ERR_PTR(-EPERM);
 
 	/* check sanity of attributes */
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index f697647ceb54..d83afac32863 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -45,7 +45,7 @@ static bool queue_stack_map_is_full(struct bpf_queue_stack *qs)
 /* Called from syscall */
 static int queue_stack_map_alloc_check(union bpf_attr *attr)
 {
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_bpf())
 		return -EPERM;
 
 	/* check sanity of attributes */
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index 50c083ba978c..b268fe4b2972 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -154,7 +154,7 @@ static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
 	struct bpf_map_memory mem;
 	u64 array_size;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_bpf())
 		return ERR_PTR(-EPERM);
 
 	array_size = sizeof(*array);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 052580c33d26..477063c63b27 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -90,7 +90,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 	u64 cost, n_buckets;
 	int err;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_tracing())
 		return ERR_PTR(-EPERM);
 
 	if (attr->map_flags & ~STACK_CREATE_FLAG_MASK)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ca60eafa6922..2b832eeafda9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1176,7 +1176,7 @@ static int map_freeze(const union bpf_attr *attr)
 		err = -EBUSY;
 		goto err_put;
 	}
-	if (!capable(CAP_SYS_ADMIN)) {
+	if (!capable_bpf()) {
 		err = -EPERM;
 		goto err_put;
 	}
@@ -1635,7 +1635,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
 	    (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
-	    !capable(CAP_SYS_ADMIN))
+	    !capable_bpf())
 		return -EPERM;
 
 	/* copy eBPF program license from user space */
@@ -1648,11 +1648,11 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 	is_gpl = license_is_gpl_compatible(license);
 
 	if (attr->insn_cnt == 0 ||
-	    attr->insn_cnt > (capable(CAP_SYS_ADMIN) ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
+	    attr->insn_cnt > (capable_bpf() ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
 		return -E2BIG;
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
-	    !capable(CAP_SYS_ADMIN))
+	    !capable_bpf())
 		return -EPERM;
 
 	bpf_prog_load_fixup_attach_type(attr);
@@ -1803,6 +1803,9 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 	char tp_name[128];
 	int tp_fd, err;
 
+	if (!capable_bpf_tracing())
+		return -EPERM;
+
 	if (strncpy_from_user(tp_name, u64_to_user_ptr(attr->raw_tracepoint.name),
 			      sizeof(tp_name) - 1) < 0)
 		return -EFAULT;
@@ -2081,7 +2084,10 @@ static int bpf_prog_test_run(const union bpf_attr *attr,
 	struct bpf_prog *prog;
 	int ret = -ENOTSUPP;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_bpf_net_admin)
+		/* test_run callback is available for networking progs only.
+		 * Add capable_bpf_tracing() above when tracing progs become runable.
+		 */
 		return -EPERM;
 	if (CHECK_ATTR(BPF_PROG_TEST_RUN))
 		return -EINVAL;
@@ -2118,7 +2124,7 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
 	if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_bpf())
 		return -EPERM;
 
 	next_id++;
@@ -2144,7 +2150,7 @@ static int bpf_prog_get_fd_by_id(const union bpf_attr *attr)
 	if (CHECK_ATTR(BPF_PROG_GET_FD_BY_ID))
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_bpf())
 		return -EPERM;
 
 	spin_lock_bh(&prog_idr_lock);
@@ -2178,7 +2184,7 @@ static int bpf_map_get_fd_by_id(const union bpf_attr *attr)
 	    attr->open_flags & ~BPF_OBJ_FLAG_MASK)
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_bpf())
 		return -EPERM;
 
 	f_flags = bpf_get_file_flag(attr->open_flags);
@@ -2353,7 +2359,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 	info.run_time_ns = stats.nsecs;
 	info.run_cnt = stats.cnt;
 
-	if (!capable(CAP_SYS_ADMIN)) {
+	if (!capable_bpf()) {
 		info.jited_prog_len = 0;
 		info.xlated_prog_len = 0;
 		info.nr_jited_ksyms = 0;
@@ -2671,7 +2677,7 @@ static int bpf_btf_load(const union bpf_attr *attr)
 	if (CHECK_ATTR(BPF_BTF_LOAD))
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_bpf())
 		return -EPERM;
 
 	return btf_new_fd(attr);
@@ -2684,7 +2690,7 @@ static int bpf_btf_get_fd_by_id(const union bpf_attr *attr)
 	if (CHECK_ATTR(BPF_BTF_GET_FD_BY_ID))
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_bpf())
 		return -EPERM;
 
 	return btf_get_fd_by_id(attr->btf_id);
@@ -2753,7 +2759,7 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
 	if (CHECK_ATTR(BPF_TASK_FD_QUERY))
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_bpf_tracing())
 		return -EPERM;
 
 	if (attr->task_fd_query.flags != 0)
@@ -2821,7 +2827,7 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	union bpf_attr attr = {};
 	int err;
 
-	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
+	if (sysctl_unprivileged_bpf_disabled && !capable_bpf())
 		return -EPERM;
 
 	err = bpf_check_uarg_tail_zero(uattr, sizeof(attr), size);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f340cfe53c6e..b2a229557602 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -987,7 +987,7 @@ static void __mark_reg_unbounded(struct bpf_reg_state *reg)
 	reg->umax_value = U64_MAX;
 
 	/* constant backtracking is enabled for root only for now */
-	reg->precise = capable(CAP_SYS_ADMIN) ? false : true;
+	reg->precise = capable_bpf() ? false : true;
 }
 
 /* Mark a register as having a completely unknown (scalar) value. */
@@ -9233,7 +9233,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
 		env->insn_aux_data[i].orig_idx = i;
 	env->prog = *prog;
 	env->ops = bpf_verifier_ops[env->prog->type];
-	is_priv = capable(CAP_SYS_ADMIN);
+	is_priv = capable_bpf();
 
 	/* grab the mutex to protect few globals used by verifier */
 	if (!is_priv)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ca1255d14576..cdf8d6c8a430 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1246,7 +1246,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
 	u32 *ids, prog_cnt, ids_len;
 	int ret;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_bpf_tracing())
 		return -EPERM;
 	if (event->attr.type != PERF_TYPE_TRACEPOINT)
 		return -EINVAL;
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index da5639a5bd3b..aa74be21f5b6 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -616,7 +616,7 @@ static int bpf_sk_storage_map_alloc_check(union bpf_attr *attr)
 	    !attr->btf_key_type_id || !attr->btf_value_type_id)
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_bpf())
 		return -EPERM;
 
 	if (attr->value_size >= KMALLOC_MAX_SIZE -
diff --git a/net/core/filter.c b/net/core/filter.c
index 17bc9af8f156..fae09b4d4da4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5990,7 +5990,7 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		break;
 	}
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_bpf())
 		return NULL;
 
 	switch (func_id) {
@@ -5999,7 +5999,9 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 	case BPF_FUNC_spin_unlock:
 		return &bpf_spin_unlock_proto;
 	case BPF_FUNC_trace_printk:
-		return bpf_get_trace_printk_proto();
+		if (capable_bpf_tracing())
+			return bpf_get_trace_printk_proto();
+		/* fall through */
 	default:
 		return NULL;
 	}
@@ -6563,7 +6565,7 @@ static bool cg_skb_is_valid_access(int off, int size,
 		return false;
 	case bpf_ctx_range(struct __sk_buff, data):
 	case bpf_ctx_range(struct __sk_buff, data_end):
-		if (!capable(CAP_SYS_ADMIN))
+		if (!capable_bpf())
 			return false;
 		break;
 	}
@@ -6575,7 +6577,7 @@ static bool cg_skb_is_valid_access(int off, int size,
 		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
 			break;
 		case bpf_ctx_range(struct __sk_buff, tstamp):
-			if (!capable(CAP_SYS_ADMIN))
+			if (!capable_bpf())
 				return false;
 			break;
 		default:
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index d27fd929abb9..0d5567962c4e 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -807,10 +807,20 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	}
 }
 
+struct libcap {
+	struct __user_cap_header_struct hdr;
+	struct __user_cap_data_struct data[2];
+};
+
 static int set_admin(bool admin)
 {
 	cap_t caps;
-	const cap_value_t cap_val = CAP_SYS_ADMIN;
+	/* need CAP_BPF to load progs and CAP_NET_ADMIN to run networking progs,
+	 * and CAP_TRACING to create stackmap
+	 */
+	const cap_value_t cap_net_admin = CAP_NET_ADMIN;
+	const cap_value_t cap_sys_admin = CAP_SYS_ADMIN;
+	struct libcap *cap;
 	int ret = -1;
 
 	caps = cap_get_proc();
@@ -818,11 +828,26 @@ static int set_admin(bool admin)
 		perror("cap_get_proc");
 		return -1;
 	}
-	if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_val,
+	cap = (struct libcap *)caps;
+	if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_sys_admin, CAP_CLEAR)) {
+		perror("cap_set_flag clear admin");
+		goto out;
+	}
+	if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_admin,
 				admin ? CAP_SET : CAP_CLEAR)) {
-		perror("cap_set_flag");
+		perror("cap_set_flag set_or_clear net");
 		goto out;
 	}
+	/* libcap is likely old and simply ignores CAP_BPF and CAP_TRACING,
+	 * so update effective bits manually
+	 */
+	if (admin) {
+		cap->data[1].effective |= 1 << (38 /* CAP_BPF */ - 32);
+		cap->data[1].effective |= 1 << (39 /* CAP_TRACING */ - 32);
+	} else {
+		cap->data[1].effective &= ~(1 << (38 - 32));
+		cap->data[1].effective &= ~(1 << (39 - 32));
+	}
 	if (cap_set_proc(caps)) {
 		perror("cap_set_proc");
 		goto out;
@@ -1051,9 +1076,11 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
 
 static bool is_admin(void)
 {
+	cap_flag_value_t net_priv = CAP_CLEAR;
+	bool tracing_priv = false;
+	bool bpf_priv = false;
+	struct libcap *cap;
 	cap_t caps;
-	cap_flag_value_t sysadmin = CAP_CLEAR;
-	const cap_value_t cap_val = CAP_SYS_ADMIN;
 
 #ifdef CAP_IS_SUPPORTED
 	if (!CAP_IS_SUPPORTED(CAP_SETFCAP)) {
@@ -1066,11 +1093,14 @@ static bool is_admin(void)
 		perror("cap_get_proc");
 		return false;
 	}
-	if (cap_get_flag(caps, cap_val, CAP_EFFECTIVE, &sysadmin))
-		perror("cap_get_flag");
+	cap = (struct libcap *)caps;
+	bpf_priv = cap->data[1].effective & (1 << (38/* CAP_BPF */ - 32));
+	tracing_priv = cap->data[1].effective & (1 << (39/* CAP_TRACING */ - 32));
+	if (cap_get_flag(caps, CAP_NET_ADMIN, CAP_EFFECTIVE, &net_priv))
+		perror("cap_get_flag NET");
 	if (cap_free(caps))
 		perror("cap_free");
-	return (sysadmin == CAP_SET);
+	return bpf_priv && tracing_priv && net_priv == CAP_SET;
 }
 
 static void get_unpriv_disabled()
-- 
2.20.0


^ permalink raw reply related

* [PATCH v3 bpf-next 3/3] perf: implement CAP_TRACING
From: Alexei Starovoitov @ 2019-09-04 18:43 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, luto, netdev, bpf, kernel-team, linux-api
In-Reply-To: <20190904184335.360074-1-ast@kernel.org>

Implement permissions as stated in uapi/linux/capability.h

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 arch/powerpc/perf/core-book3s.c |  4 ++--
 arch/x86/events/intel/bts.c     |  2 +-
 arch/x86/events/intel/core.c    |  2 +-
 arch/x86/events/intel/p4.c      |  2 +-
 kernel/events/core.c            | 14 +++++++-------
 kernel/events/hw_breakpoint.c   |  2 +-
 kernel/trace/trace_event_perf.c |  4 ++--
 7 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index ca92e01d0bd1..a204a3c6c68b 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -204,7 +204,7 @@ static inline void perf_get_data_addr(struct pt_regs *regs, u64 *addrp)
 	if (!(mmcra & MMCRA_SAMPLE_ENABLE) || sdar_valid)
 		*addrp = mfspr(SPRN_SDAR);
 
-	if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN) &&
+	if (perf_paranoid_kernel() && !capable_tracing() &&
 		is_kernel_addr(mfspr(SPRN_SDAR)))
 		*addrp = 0;
 }
@@ -472,7 +472,7 @@ static void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw)
 			 * exporting it to userspace (avoid exposure of regions
 			 * where we could have speculative execution)
 			 */
-			if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN) &&
+			if (perf_paranoid_kernel() && !capable_tracing() &&
 				is_kernel_addr(addr))
 				continue;
 
diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 5ee3fed881d3..bd713b2dd7c2 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -550,7 +550,7 @@ static int bts_event_init(struct perf_event *event)
 	 * users to profile the kernel.
 	 */
 	if (event->attr.exclude_kernel && perf_paranoid_kernel() &&
-	    !capable(CAP_SYS_ADMIN))
+	    !capable_tracing())
 		return -EACCES;
 
 	if (x86_add_exclusive(x86_lbr_exclusive_bts))
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 648260b5f367..277b12c054fa 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3307,7 +3307,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	if (x86_pmu.version < 3)
 		return -EINVAL;
 
-	if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+	if (perf_paranoid_cpu() && !capable_tracing())
 		return -EACCES;
 
 	event->hw.config |= ARCH_PERFMON_EVENTSEL_ANY;
diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
index dee579efb2b2..f379a358c9cb 100644
--- a/arch/x86/events/intel/p4.c
+++ b/arch/x86/events/intel/p4.c
@@ -776,7 +776,7 @@ static int p4_validate_raw_event(struct perf_event *event)
 	 * the user needs special permissions to be able to use it
 	 */
 	if (p4_ht_active() && p4_event_bind_map[v].shared) {
-		if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+		if (perf_paranoid_cpu() && !capable_tracing())
 			return -EACCES;
 	}
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0463c1151bae..eaba102e5d91 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4134,7 +4134,7 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
 
 	if (!task) {
 		/* Must be root to operate on a CPU event: */
-		if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+		if (perf_paranoid_cpu() && !capable_tracing())
 			return ERR_PTR(-EACCES);
 
 		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
@@ -8741,7 +8741,7 @@ static int perf_kprobe_event_init(struct perf_event *event)
 	if (event->attr.type != perf_kprobe.type)
 		return -ENOENT;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_tracing())
 		return -EACCES;
 
 	/*
@@ -8801,7 +8801,7 @@ static int perf_uprobe_event_init(struct perf_event *event)
 	if (event->attr.type != perf_uprobe.type)
 		return -ENOENT;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable_tracing())
 		return -EACCES;
 
 	/*
@@ -10588,7 +10588,7 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 		}
 		/* privileged levels capture (kernel, hv): check permissions */
 		if ((mask & PERF_SAMPLE_BRANCH_PERM_PLM)
-		    && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
+		    && perf_paranoid_kernel() && !capable_tracing())
 			return -EACCES;
 	}
 
@@ -10807,12 +10807,12 @@ SYSCALL_DEFINE5(perf_event_open,
 		return err;
 
 	if (!attr.exclude_kernel) {
-		if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
+		if (perf_paranoid_kernel() && !capable_tracing())
 			return -EACCES;
 	}
 
 	if (attr.namespaces) {
-		if (!capable(CAP_SYS_ADMIN))
+		if (!capable_tracing())
 			return -EACCES;
 	}
 
@@ -10826,7 +10826,7 @@ SYSCALL_DEFINE5(perf_event_open,
 
 	/* Only privileged users can get physical addresses */
 	if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
-	    perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
+	    perf_paranoid_kernel() && !capable_tracing())
 		return -EACCES;
 
 	/*
diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index c5cd852fe86b..8bc4d7d8c913 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -404,7 +404,7 @@ static int hw_breakpoint_parse(struct perf_event *bp,
 		 * Don't let unprivileged users set a breakpoint in the trap
 		 * path to avoid trap recursion attacks.
 		 */
-		if (!capable(CAP_SYS_ADMIN))
+		if (!capable_tracing())
 			return -EPERM;
 	}
 
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 0892e38ed6fb..6861307f14d6 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -46,7 +46,7 @@ static int perf_trace_event_perm(struct trace_event_call *tp_event,
 
 	/* The ftrace function trace is allowed only for root. */
 	if (ftrace_event_is_function(tp_event)) {
-		if (perf_paranoid_tracepoint_raw() && !capable(CAP_SYS_ADMIN))
+		if (perf_paranoid_tracepoint_raw() && !capable_tracing())
 			return -EPERM;
 
 		if (!is_sampling_event(p_event))
@@ -82,7 +82,7 @@ static int perf_trace_event_perm(struct trace_event_call *tp_event,
 	 * ...otherwise raw tracepoint data can be a severe data leak,
 	 * only allow root to have these.
 	 */
-	if (perf_paranoid_tracepoint_raw() && !capable(CAP_SYS_ADMIN))
+	if (perf_paranoid_tracepoint_raw() && !capable_tracing())
 		return -EPERM;
 
 	return 0;
-- 
2.20.0


^ permalink raw reply related

* [PATCH v3 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
From: Alexei Starovoitov @ 2019-09-04 18:43 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, luto, netdev, bpf, kernel-team, linux-api

Split BPF and perf/tracing operations that are allowed under
CAP_SYS_ADMIN into corresponding CAP_BPF and CAP_TRACING.
For backward compatibility include them in CAP_SYS_ADMIN as well.

The end result provides simple safety model for applications that use BPF:
- for tracing program types
  BPF_PROG_TYPE_{KPROBE, TRACEPOINT, PERF_EVENT, RAW_TRACEPOINT, etc}
  use CAP_BPF and CAP_TRACING
- for networking program types
  BPF_PROG_TYPE_{SCHED_CLS, XDP, CGROUP_SKB, SK_SKB, etc}
  use CAP_BPF and CAP_NET_ADMIN

There are few exceptions from this simple rule:
- bpf_trace_printk() is allowed in networking programs, but it's using
  ftrace mechanism, hence this helper needs additional CAP_TRACING.
- cpumap is used by XDP programs. Currently it's kept under CAP_SYS_ADMIN,
  but could be relaxed to CAP_NET_ADMIN in the future.
- BPF_F_ZERO_SEED flag for hash/lru map is allowed under CAP_SYS_ADMIN only
  to discourage production use.
- BPF HW offload is allowed under CAP_SYS_ADMIN.
- cg_sysctl, cg_device, lirc program types are neither networking nor tracing.
  They can be loaded under CAP_BPF, but attach is allowed under CAP_NET_ADMIN.
  This will be cleaned up in the future.

userid=nobody + (CAP_TRACING | CAP_NET_ADMIN) + CAP_BPF is safer than
typical setup with userid=root and sudo by existing bpf applications.
It's not secure, since these capabilities:
- allow bpf progs access arbitrary memory
- let tasks access any bpf map
- let tasks attach/detach any bpf prog

bpftool, bpftrace, bcc tools binaries should not be installed with
cap_bpf+cap_tracing, since unpriv users will be able to read kernel secrets.

CAP_BPF, CAP_NET_ADMIN, CAP_TRACING are roughly equal in terms of
damage they can make to the system.
Example:
CAP_NET_ADMIN can stop network traffic. CAP_BPF can write into map
and if that map is used by firewall-like bpf prog the network traffic
may stop.
CAP_BPF allows many bpf prog_load commands in parallel. The verifier
may consume large amount of memory and significantly slow down the system.
CAP_TRACING allows many kprobes that can slow down the system.

In the future more fine-grained bpf permissions may be added.

v2->v3:
- dropped ftrace and kallsyms from CAP_TRACING description.
  In the future these mechanisms can start using it too.
- added CAP_SYS_ADMIN backward compatibility.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/capability.h          | 18 +++++++++++
 include/uapi/linux/capability.h     | 48 ++++++++++++++++++++++++++++-
 security/selinux/include/classmap.h |  4 +--
 3 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index ecce0f43c73a..13eb49c75797 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -247,6 +247,24 @@ static inline bool ns_capable_setid(struct user_namespace *ns, int cap)
 	return true;
 }
 #endif /* CONFIG_MULTIUSER */
+
+static inline bool capable_bpf(void)
+{
+	return capable(CAP_SYS_ADMIN) || capable(CAP_BPF);
+}
+static inline bool capable_tracing(void)
+{
+	return capable(CAP_SYS_ADMIN) || capable(CAP_TRACING);
+}
+static inline bool capable_bpf_tracing(void)
+{
+	return capable(CAP_SYS_ADMIN) || (capable(CAP_BPF) && capable(CAP_TRACING));
+}
+static inline bool capable_bpf_net_admin(void)
+{
+	return (capable(CAP_SYS_ADMIN) || capable(CAP_BPF)) && capable(CAP_NET_ADMIN);
+}
+
 extern bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *inode);
 extern bool capable_wrt_inode_uidgid(const struct inode *inode, int cap);
 extern bool file_ns_capable(const struct file *file, struct user_namespace *ns, int cap);
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 240fdb9a60f6..fb71dee0ac2b 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -366,8 +366,54 @@ struct vfs_ns_cap_data {
 
 #define CAP_AUDIT_READ		37
 
+/*
+ * CAP_BPF allows the following BPF operations:
+ * - Loading all types of BPF programs
+ * - Creating all types of BPF maps except:
+ *    - stackmap that needs CAP_TRACING
+ *    - devmap that needs CAP_NET_ADMIN
+ *    - cpumap that needs CAP_SYS_ADMIN
+ * - Advanced verifier features
+ *   - Indirect variable access
+ *   - Bounded loops
+ *   - BPF to BPF function calls
+ *   - Scalar precision tracking
+ *   - Larger complexity limits
+ *   - Dead code elimination
+ *   - And potentially other features
+ * - Use of pointer-to-integer conversions in BPF programs
+ * - Bypassing of speculation attack hardening measures
+ * - Loading BPF Type Format (BTF) data
+ * - Iterate system wide loaded programs, maps, BTF objects
+ * - Retrieve xlated and JITed code of BPF programs
+ * - Access maps and programs via id
+ * - Use bpf_spin_lock() helper
+ *
+ * CAP_BPF and CAP_TRACING together allow the following:
+ * - bpf_probe_read to read arbitrary kernel memory
+ * - bpf_trace_printk to print data to ftrace ring buffer
+ * - Attach to raw_tracepoint
+ * - Query association between kprobe/tracepoint and bpf program
+ *
+ * CAP_BPF and CAP_NET_ADMIN together allow the following:
+ * - Attach to cgroup-bpf hooks and query
+ * - skb, xdp, flow_dissector test_run command
+ *
+ * CAP_NET_ADMIN allows:
+ * - Attach networking bpf programs to xdp, tc, lwt, flow dissector
+ */
+#define CAP_BPF			38
+
+/*
+ * CAP_TRACING allows:
+ * - Full use of perf_event_open(), similarly to the effect of
+ *   kernel.perf_event_paranoid == -1
+ * - Creation of [ku][ret]probe
+ * - Attach tracing bpf programs to perf events
+ */
+#define CAP_TRACING		39
 
-#define CAP_LAST_CAP         CAP_AUDIT_READ
+#define CAP_LAST_CAP         CAP_TRACING
 
 #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
 
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 201f7e588a29..0b364e245163 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -26,9 +26,9 @@
 	    "audit_control", "setfcap"
 
 #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
-		"wake_alarm", "block_suspend", "audit_read"
+		"wake_alarm", "block_suspend", "audit_read", "bpf", "tracing"
 
-#if CAP_LAST_CAP > CAP_AUDIT_READ
+#if CAP_LAST_CAP > CAP_TRACING
 #error New capability defined, please update COMMON_CAP2_PERMS.
 #endif
 
-- 
2.20.0


^ permalink raw reply related

* [PATCH net 11/11] net: remove unnecessary variables and callback
From: Taehee Yoo @ 2019-09-04 18:41 UTC (permalink / raw)
  To: davem, netdev, j.vosburgh, vfalico, andy, jiri, sd, roopa, saeedm,
	manishc, rahulv, kys, haiyangz, sthemmin, sashal, hare, varun,
	ubraun, kgraul
  Cc: ap420073

This patch removes variables and callback these are related to the nested
device structure.
devices that can be nested have their own nest_level variable that
represents the depth of nested devices.
In the previous patch, new {lower/upper}_level variables are added and
they replace old private nest_level variable.
So, this patch removes all 'nest_level' variables.

In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added
to get lockdep subclass value, which is actually lower nested depth value.
But now, they use the dynamic lockdep key to avoid lockdep warning instead
of the subclass.
So, this patch removes ->ndo_get_lock_subclass() callback.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---
 drivers/net/bonding/bond_alb.c                |  2 +-
 drivers/net/bonding/bond_main.c               | 14 -------------
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |  2 +-
 drivers/net/macsec.c                          |  9 ---------
 drivers/net/macvlan.c                         |  7 -------
 include/linux/if_macvlan.h                    |  1 -
 include/linux/if_vlan.h                       | 12 -----------
 include/linux/netdevice.h                     | 12 -----------
 include/net/bonding.h                         |  1 -
 net/8021q/vlan.c                              |  1 -
 net/8021q/vlan_dev.c                          |  6 ------
 net/core/dev.c                                | 20 -------------------
 net/core/dev_addr_lists.c                     | 12 +++++------
 net/smc/smc_core.c                            |  2 +-
 net/smc/smc_pnet.c                            |  2 +-
 15 files changed, 10 insertions(+), 93 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 8c79bad2a9a5..4f2e6910c623 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -952,7 +952,7 @@ static int alb_upper_dev_walk(struct net_device *upper, void *_data)
 	struct bond_vlan_tag *tags;
 
 	if (is_vlan_dev(upper) &&
-	    bond->nest_level == vlan_get_encap_level(upper) - 1) {
+	    bond->dev->lower_level == upper->lower_level - 1) {
 		if (upper->addr_assign_type == NET_ADDR_STOLEN) {
 			alb_send_lp_vid(slave, mac_addr,
 					vlan_dev_vlan_proto(upper),
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 2b16683bb8b8..2749423649eb 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1733,8 +1733,6 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
 		goto err_upper_unlink;
 	}
 
-	bond->nest_level = dev_get_nest_level(bond_dev) + 1;
-
 	/* If the mode uses primary, then the following is handled by
 	 * bond_change_active_slave().
 	 */
@@ -1982,9 +1980,6 @@ static int __bond_release_one(struct net_device *bond_dev,
 	if (!bond_has_slaves(bond)) {
 		bond_set_carrier(bond);
 		eth_hw_addr_random(bond_dev);
-		bond->nest_level = SINGLE_DEPTH_NESTING;
-	} else {
-		bond->nest_level = dev_get_nest_level(bond_dev) + 1;
 	}
 
 	unblock_netpoll_tx();
@@ -3467,13 +3462,6 @@ static void bond_fold_stats(struct rtnl_link_stats64 *_res,
 	}
 }
 
-static int bond_get_nest_level(struct net_device *bond_dev)
-{
-	struct bonding *bond = netdev_priv(bond_dev);
-
-	return bond->nest_level;
-}
-
 static void bond_get_stats(struct net_device *bond_dev,
 			   struct rtnl_link_stats64 *stats)
 {
@@ -4293,7 +4281,6 @@ static const struct net_device_ops bond_netdev_ops = {
 	.ndo_neigh_setup	= bond_neigh_setup,
 	.ndo_vlan_rx_add_vid	= bond_vlan_rx_add_vid,
 	.ndo_vlan_rx_kill_vid	= bond_vlan_rx_kill_vid,
-	.ndo_get_lock_subclass  = bond_get_nest_level,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_netpoll_setup	= bond_netpoll_setup,
 	.ndo_netpoll_cleanup	= bond_netpoll_cleanup,
@@ -4817,7 +4804,6 @@ static int bond_init(struct net_device *bond_dev)
 	if (!bond->wq)
 		return -ENOMEM;
 
-	bond->nest_level = SINGLE_DEPTH_NESTING;
 	bond_dev_set_lockdep_class(bond_dev);
 
 	list_add_tail(&bond->bond_list, &bn->dev_list);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 00b2d4a86159..e056f9aad8df 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2797,7 +2797,7 @@ static int add_vlan_pop_action(struct mlx5e_priv *priv,
 			       struct mlx5_esw_flow_attr *attr,
 			       u32 *action)
 {
-	int nest_level = vlan_get_encap_level(attr->parse_attr->filter_dev);
+	int nest_level = attr->parse_attr->filter_dev->lower_level;
 	struct flow_action_entry vlan_act = {
 		.id = FLOW_ACTION_VLAN_POP,
 	};
diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index 41ec1ed0d545..c0cb595f2bba 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -269,7 +269,6 @@ struct macsec_dev {
 	struct gro_cells gro_cells;
 	struct lock_class_key xmit_lock_key;
 	struct lock_class_key addr_lock_key;
-	unsigned int nest_level;
 };
 
 /**
@@ -2988,11 +2987,6 @@ static int macsec_get_iflink(const struct net_device *dev)
 	return macsec_priv(dev)->real_dev->ifindex;
 }
 
-static int macsec_get_nest_level(struct net_device *dev)
-{
-	return macsec_priv(dev)->nest_level;
-}
-
 static const struct net_device_ops macsec_netdev_ops = {
 	.ndo_init		= macsec_dev_init,
 	.ndo_uninit		= macsec_dev_uninit,
@@ -3006,7 +3000,6 @@ static const struct net_device_ops macsec_netdev_ops = {
 	.ndo_start_xmit		= macsec_start_xmit,
 	.ndo_get_stats64	= macsec_get_stats64,
 	.ndo_get_iflink		= macsec_get_iflink,
-	.ndo_get_lock_subclass  = macsec_get_nest_level,
 };
 
 static const struct device_type macsec_type = {
@@ -3289,8 +3282,6 @@ static int macsec_newlink(struct net *net, struct net_device *dev,
 	if (err < 0)
 		return err;
 
-	macsec->nest_level = dev_get_nest_level(real_dev) + 1;
-
 	err = netdev_upper_dev_link(real_dev, dev, extack);
 	if (err < 0)
 		goto unregister;
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index dae368a2e8d1..2c14bc606514 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -867,11 +867,6 @@ static int macvlan_do_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 #define MACVLAN_STATE_MASK \
 	((1<<__LINK_STATE_NOCARRIER) | (1<<__LINK_STATE_DORMANT))
 
-static int macvlan_get_nest_level(struct net_device *dev)
-{
-	return ((struct macvlan_dev *)netdev_priv(dev))->nest_level;
-}
-
 static void macvlan_dev_set_lockdep_one(struct net_device *dev,
 					struct netdev_queue *txq,
 					void *_unused)
@@ -1180,7 +1175,6 @@ static const struct net_device_ops macvlan_netdev_ops = {
 	.ndo_fdb_add		= macvlan_fdb_add,
 	.ndo_fdb_del		= macvlan_fdb_del,
 	.ndo_fdb_dump		= ndo_dflt_fdb_dump,
-	.ndo_get_lock_subclass  = macvlan_get_nest_level,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller	= macvlan_dev_poll_controller,
 	.ndo_netpoll_setup	= macvlan_dev_netpoll_setup,
@@ -1464,7 +1458,6 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
 	vlan->dev      = dev;
 	vlan->port     = port;
 	vlan->set_features = MACVLAN_FEATURES;
-	vlan->nest_level = dev_get_nest_level(lowerdev) + 1;
 
 	vlan->mode     = MACVLAN_MODE_VEPA;
 	if (data && data[IFLA_MACVLAN_MODE])
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index ea5b41823287..e9202edcf101 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -29,7 +29,6 @@ struct macvlan_dev {
 	netdev_features_t	set_features;
 	enum macvlan_mode	mode;
 	u16			flags;
-	int			nest_level;
 	unsigned int		macaddr_count;
 	struct lock_class_key xmit_lock_key;
 	struct lock_class_key addr_lock_key;
diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 1aed9f613e90..6f30284a58e5 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -182,8 +182,6 @@ struct vlan_dev_priv {
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	struct netpoll				*netpoll;
 #endif
-	unsigned int				nest_level;
-
 	struct lock_class_key			xmit_lock_key;
 	struct lock_class_key			addr_lock_key;
 };
@@ -224,11 +222,6 @@ extern void vlan_vids_del_by_dev(struct net_device *dev,
 
 extern bool vlan_uses_dev(const struct net_device *dev);
 
-static inline int vlan_get_encap_level(struct net_device *dev)
-{
-	BUG_ON(!is_vlan_dev(dev));
-	return vlan_dev_priv(dev)->nest_level;
-}
 #else
 static inline struct net_device *
 __vlan_find_dev_deep_rcu(struct net_device *real_dev,
@@ -298,11 +291,6 @@ static inline bool vlan_uses_dev(const struct net_device *dev)
 {
 	return false;
 }
-static inline int vlan_get_encap_level(struct net_device *dev)
-{
-	BUG();
-	return 0;
-}
 #endif
 
 /**
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d9ca4a79f715..6bf493028c41 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1408,7 +1408,6 @@ struct net_device_ops {
 	void			(*ndo_dfwd_del_station)(struct net_device *pdev,
 							void *priv);
 
-	int			(*ndo_get_lock_subclass)(struct net_device *dev);
 	int			(*ndo_set_tx_maxrate)(struct net_device *dev,
 						      int queue_index,
 						      u32 maxrate);
@@ -4050,16 +4049,6 @@ static inline void netif_addr_lock(struct net_device *dev)
 	spin_lock(&dev->addr_list_lock);
 }
 
-static inline void netif_addr_lock_nested(struct net_device *dev)
-{
-	int subclass = SINGLE_DEPTH_NESTING;
-
-	if (dev->netdev_ops->ndo_get_lock_subclass)
-		subclass = dev->netdev_ops->ndo_get_lock_subclass(dev);
-
-	spin_lock_nested(&dev->addr_list_lock, subclass);
-}
-
 static inline void netif_addr_lock_bh(struct net_device *dev)
 {
 	spin_lock_bh(&dev->addr_list_lock);
@@ -4337,7 +4326,6 @@ void netdev_lower_state_changed(struct net_device *lower_dev,
 extern u8 netdev_rss_key[NETDEV_RSS_KEY_LEN] __read_mostly;
 void netdev_rss_key_fill(void *buffer, size_t len);
 
-int dev_get_nest_level(struct net_device *dev);
 int skb_checksum_help(struct sk_buff *skb);
 int skb_crc32c_csum_help(struct sk_buff *skb);
 int skb_csum_hwoffload_help(struct sk_buff *skb,
diff --git a/include/net/bonding.h b/include/net/bonding.h
index c39ac7061e41..74f41dd73866 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -203,7 +203,6 @@ struct bonding {
 	struct   slave __rcu *primary_slave;
 	struct   bond_up_slave __rcu *slave_arr; /* Array of usable slaves */
 	bool     force_primary;
-	u32      nest_level;
 	s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
 	int     (*recv_probe)(const struct sk_buff *, struct bonding *,
 			      struct slave *);
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 54728d2eda18..d4bcfd8f95bf 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -172,7 +172,6 @@ int register_vlan_dev(struct net_device *dev, struct netlink_ext_ack *extack)
 	if (err < 0)
 		goto out_uninit_mvrp;
 
-	vlan->nest_level = dev_get_nest_level(real_dev) + 1;
 	err = register_netdevice(dev);
 	if (err < 0)
 		goto out_uninit_mvrp;
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 12bc80650087..e8707827540c 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -514,11 +514,6 @@ static void vlan_dev_set_lockdep_class(struct net_device *dev)
 	netdev_for_each_tx_queue(dev, vlan_dev_set_lockdep_one, NULL);
 }
 
-static int vlan_dev_get_lock_subclass(struct net_device *dev)
-{
-	return vlan_dev_priv(dev)->nest_level;
-}
-
 static const struct header_ops vlan_header_ops = {
 	.create	 = vlan_dev_hard_header,
 	.parse	 = eth_header_parse,
@@ -814,7 +809,6 @@ static const struct net_device_ops vlan_netdev_ops = {
 	.ndo_netpoll_cleanup	= vlan_dev_netpoll_cleanup,
 #endif
 	.ndo_fix_features	= vlan_dev_fix_features,
-	.ndo_get_lock_subclass  = vlan_dev_get_lock_subclass,
 	.ndo_get_iflink		= vlan_dev_get_iflink,
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index ac055b531c96..73a69a7a3553 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7510,26 +7510,6 @@ void *netdev_lower_dev_get_private(struct net_device *dev,
 }
 EXPORT_SYMBOL(netdev_lower_dev_get_private);
 
-
-int dev_get_nest_level(struct net_device *dev)
-{
-	struct net_device *lower = NULL;
-	struct list_head *iter;
-	int max_nest = -1;
-	int nest;
-
-	ASSERT_RTNL();
-
-	netdev_for_each_lower_dev(dev, lower, iter) {
-		nest = dev_get_nest_level(lower);
-		if (max_nest < nest)
-			max_nest = nest;
-	}
-
-	return max_nest + 1;
-}
-EXPORT_SYMBOL(dev_get_nest_level);
-
 /**
  * netdev_lower_change - Dispatch event about lower device state change
  * @lower_dev: device
diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index 6393ba930097..2f949b5a1eb9 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -637,7 +637,7 @@ int dev_uc_sync(struct net_device *to, struct net_device *from)
 	if (to->addr_len != from->addr_len)
 		return -EINVAL;
 
-	netif_addr_lock_nested(to);
+	netif_addr_lock(to);
 	err = __hw_addr_sync(&to->uc, &from->uc, to->addr_len);
 	if (!err)
 		__dev_set_rx_mode(to);
@@ -667,7 +667,7 @@ int dev_uc_sync_multiple(struct net_device *to, struct net_device *from)
 	if (to->addr_len != from->addr_len)
 		return -EINVAL;
 
-	netif_addr_lock_nested(to);
+	netif_addr_lock(to);
 	err = __hw_addr_sync_multiple(&to->uc, &from->uc, to->addr_len);
 	if (!err)
 		__dev_set_rx_mode(to);
@@ -691,7 +691,7 @@ void dev_uc_unsync(struct net_device *to, struct net_device *from)
 		return;
 
 	netif_addr_lock_bh(from);
-	netif_addr_lock_nested(to);
+	netif_addr_lock(to);
 	__hw_addr_unsync(&to->uc, &from->uc, to->addr_len);
 	__dev_set_rx_mode(to);
 	netif_addr_unlock(to);
@@ -858,7 +858,7 @@ int dev_mc_sync(struct net_device *to, struct net_device *from)
 	if (to->addr_len != from->addr_len)
 		return -EINVAL;
 
-	netif_addr_lock_nested(to);
+	netif_addr_lock(to);
 	err = __hw_addr_sync(&to->mc, &from->mc, to->addr_len);
 	if (!err)
 		__dev_set_rx_mode(to);
@@ -888,7 +888,7 @@ int dev_mc_sync_multiple(struct net_device *to, struct net_device *from)
 	if (to->addr_len != from->addr_len)
 		return -EINVAL;
 
-	netif_addr_lock_nested(to);
+	netif_addr_lock(to);
 	err = __hw_addr_sync_multiple(&to->mc, &from->mc, to->addr_len);
 	if (!err)
 		__dev_set_rx_mode(to);
@@ -912,7 +912,7 @@ void dev_mc_unsync(struct net_device *to, struct net_device *from)
 		return;
 
 	netif_addr_lock_bh(from);
-	netif_addr_lock_nested(to);
+	netif_addr_lock(to);
 	__hw_addr_unsync(&to->mc, &from->mc, to->addr_len);
 	__dev_set_rx_mode(to);
 	netif_addr_unlock(to);
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 4ca50ddf8d16..a2e91b8d04b3 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -558,7 +558,7 @@ int smc_vlan_by_tcpsk(struct socket *clcsock, struct smc_init_info *ini)
 	}
 
 	rtnl_lock();
-	nest_lvl = dev_get_nest_level(ndev);
+	nest_lvl = ndev->lower_level;
 	for (i = 0; i < nest_lvl; i++) {
 		struct list_head *lower = &ndev->adj_list.lower;
 
diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
index bab2da8cf17a..2920b006f65c 100644
--- a/net/smc/smc_pnet.c
+++ b/net/smc/smc_pnet.c
@@ -718,7 +718,7 @@ static struct net_device *pnet_find_base_ndev(struct net_device *ndev)
 	int i, nest_lvl;
 
 	rtnl_lock();
-	nest_lvl = dev_get_nest_level(ndev);
+	nest_lvl = ndev->lower_level;
 	for (i = 0; i < nest_lvl; i++) {
 		struct list_head *lower = &ndev->adj_list.lower;
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH net 10/11] vxlan: add adjacent link to limit depth level
From: Taehee Yoo @ 2019-09-04 18:41 UTC (permalink / raw)
  To: davem, netdev, j.vosburgh, vfalico, andy, jiri, sd, roopa, saeedm,
	manishc, rahulv, kys, haiyangz, sthemmin, sashal, hare, varun,
	ubraun, kgraul
  Cc: ap420073

Current vxlan code doesn't limit the number of nested devices.
Nested devices would be handled recursively and this routine needs
huge stack memory. So, unlimited nested devices could make
stack overflow.

In order to fix this issue, this patch adds adjacent links.
The adjacent link APIs internally check the depth level.

Test commands:
    ip link add dummy0 type dummy
    ip link add vxlan0 type vxlan id 0 group 239.1.1.1 dev dummy0 \
	    dstport 4789
    for i in {1..100}
    do
            let A=$i-1
            ip link add vxlan$i type vxlan id $i group 239.1.1.1 \
		    dev vxlan$A dstport 4789
    done
    ip link del dummy0

The top upper link is vxlan100 and the lowest link is vxlan0.
When vxlan0 is deleting, the upper devices will be deleted recursively.
It needs huge stack memory so it makes stack overflow.

Splat looks like:
[  229.628477] =============================================================================
[  229.629785] BUG page->ptl (Not tainted): Padding overwritten. 0x0000000026abf214-0x0000000091f6abb2
[  229.629785] -----------------------------------------------------------------------------
[  229.629785]
[  229.655439] ==================================================================
[  229.629785] INFO: Slab 0x00000000ff7cfda8 objects=19 used=19 fp=0x00000000fe33776c flags=0x200000000010200
[  229.655688] BUG: KASAN: stack-out-of-bounds in unmap_single_vma+0x25a/0x2e0
[  229.655688] Read of size 8 at addr ffff888113076928 by task vlan-network-in/2334
[  229.655688]
[  229.629785] Padding 0000000026abf214: 00 80 14 0d 81 88 ff ff 68 91 81 14 81 88 ff ff  ........h.......
[  229.629785] Padding 0000000001e24790: 38 91 81 14 81 88 ff ff 68 91 81 14 81 88 ff ff  8.......h.......
[  229.629785] Padding 00000000b39397c8: 33 30 62 a7 ff ff ff ff ff eb 60 22 10 f1 ff 1f  30b.......`"....
[  229.629785] Padding 00000000bc98f53a: 80 60 07 13 81 88 ff ff 00 80 14 0d 81 88 ff ff  .`..............
[  229.629785] Padding 000000002aa8123d: 68 91 81 14 81 88 ff ff f7 21 17 a7 ff ff ff ff  h........!......
[  229.629785] Padding 000000001c8c2369: 08 81 14 0d 81 88 ff ff 03 02 00 00 00 00 00 00  ................
[  229.629785] Padding 000000004e290c5d: 21 90 a2 21 10 ed ff ff 00 00 00 00 00 fc ff df  !..!............
[  229.629785] Padding 000000000e25d731: 18 60 07 13 81 88 ff ff c0 8b 13 05 81 88 ff ff  .`..............
[  229.629785] Padding 000000007adc7ab3: b3 8a b5 41 00 00 00 00                          ...A....
[  229.629785] FIX page->ptl: Restoring 0x0000000026abf214-0x0000000091f6abb2=0x5a
[  ... ]


Fixes: acaf4e70997f ("net: vxlan: when lower dev unregisters remove vxlan dev as well")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---
 drivers/net/vxlan.c | 71 ++++++++++++++++++++++++++++++++++++++-------
 include/net/vxlan.h |  1 +
 2 files changed, 62 insertions(+), 10 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 3d9bcc957f7d..0d5c8d22d8a4 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -3567,6 +3567,8 @@ static int __vxlan_dev_create(struct net *net, struct net_device *dev,
 	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
 	struct vxlan_dev *vxlan = netdev_priv(dev);
 	struct vxlan_fdb *f = NULL;
+	struct net_device *remote_dev = NULL;
+	struct vxlan_rdst *dst = &vxlan->default_dst;
 	bool unregister = false;
 	int err;
 
@@ -3577,14 +3579,14 @@ static int __vxlan_dev_create(struct net *net, struct net_device *dev,
 	dev->ethtool_ops = &vxlan_ethtool_ops;
 
 	/* create an fdb entry for a valid default destination */
-	if (!vxlan_addr_any(&vxlan->default_dst.remote_ip)) {
+	if (!vxlan_addr_any(&dst->remote_ip)) {
 		err = vxlan_fdb_create(vxlan, all_zeros_mac,
-				       &vxlan->default_dst.remote_ip,
+				       &dst->remote_ip,
 				       NUD_REACHABLE | NUD_PERMANENT,
 				       vxlan->cfg.dst_port,
-				       vxlan->default_dst.remote_vni,
-				       vxlan->default_dst.remote_vni,
-				       vxlan->default_dst.remote_ifindex,
+				       dst->remote_vni,
+				       dst->remote_vni,
+				       dst->remote_ifindex,
 				       NTF_SELF, &f);
 		if (err)
 			return err;
@@ -3595,26 +3597,43 @@ static int __vxlan_dev_create(struct net *net, struct net_device *dev,
 		goto errout;
 	unregister = true;
 
+	if (dst->remote_ifindex) {
+		remote_dev = __dev_get_by_index(net, dst->remote_ifindex);
+		if (!remote_dev)
+			goto errout;
+
+		err = netdev_upper_dev_link(remote_dev, dev, extack);
+		if (err)
+			goto errout;
+	}
+
 	err = rtnl_configure_link(dev, NULL);
 	if (err)
-		goto errout;
+		goto unlink;
 
 	if (f) {
-		vxlan_fdb_insert(vxlan, all_zeros_mac,
-				 vxlan->default_dst.remote_vni, f);
+		vxlan_fdb_insert(vxlan, all_zeros_mac, dst->remote_vni, f);
 
 		/* notify default fdb entry */
 		err = vxlan_fdb_notify(vxlan, f, first_remote_rtnl(f),
 				       RTM_NEWNEIGH, true, extack);
 		if (err) {
 			vxlan_fdb_destroy(vxlan, f, false, false);
+			if (remote_dev)
+				netdev_upper_dev_unlink(remote_dev, dev);
 			goto unregister;
 		}
 	}
 
 	list_add(&vxlan->next, &vn->vxlan_list);
+	if (remote_dev) {
+		dst->remote_dev = remote_dev;
+		dev_hold(remote_dev);
+	}
 	return 0;
-
+unlink:
+	if (remote_dev)
+		netdev_upper_dev_unlink(remote_dev, dev);
 errout:
 	/* unregister_netdevice() destroys the default FDB entry with deletion
 	 * notification. But the addition notification was not sent yet, so
@@ -3936,6 +3955,8 @@ static int vxlan_changelink(struct net_device *dev, struct nlattr *tb[],
 	struct net_device *lowerdev;
 	struct vxlan_config conf;
 	int err;
+	bool linked = false;
+	bool disabled = false;
 
 	err = vxlan_nl2conf(tb, data, dev, &conf, true, extack);
 	if (err)
@@ -3946,6 +3967,16 @@ static int vxlan_changelink(struct net_device *dev, struct nlattr *tb[],
 	if (err)
 		return err;
 
+	if (lowerdev) {
+		if (dst->remote_dev && lowerdev != dst->remote_dev) {
+			netdev_adjacent_dev_disable(dst->remote_dev, dev);
+			disabled = true;
+		}
+		err = netdev_upper_dev_link(lowerdev, dev, extack);
+		if (err)
+			goto err;
+		linked = true;
+	}
 	/* handle default dst entry */
 	if (!vxlan_addr_equal(&conf.remote_ip, &dst->remote_ip)) {
 		u32 hash_index = fdb_head_index(vxlan, all_zeros_mac, conf.vni);
@@ -3962,7 +3993,7 @@ static int vxlan_changelink(struct net_device *dev, struct nlattr *tb[],
 					       NTF_SELF, true, extack);
 			if (err) {
 				spin_unlock_bh(&vxlan->hash_lock[hash_index]);
-				return err;
+				goto err;
 			}
 		}
 		if (!vxlan_addr_any(&dst->remote_ip))
@@ -3979,8 +4010,24 @@ static int vxlan_changelink(struct net_device *dev, struct nlattr *tb[],
 	if (conf.age_interval != vxlan->cfg.age_interval)
 		mod_timer(&vxlan->age_timer, jiffies);
 
+	if (disabled) {
+		netdev_adjacent_dev_enable(dst->remote_dev, dev);
+		netdev_upper_dev_unlink(dst->remote_dev, dev);
+		dev_put(dst->remote_dev);
+	}
+	if (linked) {
+		dst->remote_dev = lowerdev;
+		dev_hold(dst->remote_dev);
+	}
+
 	vxlan_config_apply(dev, &conf, lowerdev, vxlan->net, true);
 	return 0;
+err:
+	if (linked)
+		netdev_upper_dev_unlink(lowerdev, dev);
+	if (disabled)
+		netdev_adjacent_dev_enable(dst->remote_dev, dev);
+	return err;
 }
 
 static void vxlan_dellink(struct net_device *dev, struct list_head *head)
@@ -3991,6 +4038,10 @@ static void vxlan_dellink(struct net_device *dev, struct list_head *head)
 
 	list_del(&vxlan->next);
 	unregister_netdevice_queue(dev, head);
+	if (vxlan->default_dst.remote_dev) {
+		netdev_upper_dev_unlink(vxlan->default_dst.remote_dev, dev);
+		dev_put(vxlan->default_dst.remote_dev);
+	}
 }
 
 static size_t vxlan_get_size(const struct net_device *dev)
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index dc1583a1fb8a..08e237d7aa73 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -197,6 +197,7 @@ struct vxlan_rdst {
 	u8			 offloaded:1;
 	__be32			 remote_vni;
 	u32			 remote_ifindex;
+	struct net_device	 *remote_dev;
 	struct list_head	 list;
 	struct rcu_head		 rcu;
 	struct dst_cache	 dst_cache;
-- 
2.17.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox