* Re: bpf memory model. Was: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock
From: Alexei Starovoitov @ 2019-01-30 2:32 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Alexei Starovoitov, davem, daniel, jakub.kicinski, netdev,
kernel-team, mingo, will.deacon, Paul McKenney, jannh
In-Reply-To: <20190129091654.GD28485@hirez.programming.kicks-ass.net>
On Tue, Jan 29, 2019 at 10:16:54AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 28, 2019 at 01:56:24PM -0800, Alexei Starovoitov wrote:
> > On Mon, Jan 28, 2019 at 10:24:08AM +0100, Peter Zijlstra wrote:
>
> > > Ah, but the loop won't be in the BPF program itself. The BPF program
> > > would only have had the BPF_SPIN_LOCK instruction, the JIT them emits
> > > code similar to queued_spin_lock()/queued_spin_unlock() (or calls to
> > > out-of-line versions of them).
> >
> > As I said we considered exactly that and such approach has a lot of downsides
> > comparing with the helper approach.
> > Pretty much every time new feature is added we're evaluating whether it
> > should be new instruction or new helper. 99% of the time we go with new helper.
>
> Ah; it seems I'm confused on helper vs instruction. As in, I've no idea
> what a helper is.
bpf helper is a normal kernel function that can be called from bpf program.
In assembler it's a direct function call.
> > > There isn't anything that mandates the JIT uses the exact same locking
> > > routines the interpreter does, is there?
> >
> > sure. This bpf_spin_lock() helper can be optimized whichever way the kernel wants.
> > Like bpf_map_lookup_elem() call is _inlined_ by the verifier for certain map types.
> > JITs don't even need to do anything. It looks like function call from bpf prog
> > point of view, but in JITed code it is a sequence of native instructions.
> >
> > Say tomorrow we find out that bpf_prog->bpf_spin_lock()->queued_spin_lock()
> > takes too much time then we can inline fast path of queued_spin_lock
> > directly into bpf prog and save function call cost.
>
> OK, so then the JIT can optimize helpers. Would it not make sense to
> have the simple test-and-set spinlock in the generic code and have the
> JITs use arch_spinlock_t where appropriate?
I think that pretty much the same as what I have with qspinlock.
Instead of taking a risk how JIT writers implement bpf_spin_lock optimization
I'm using qspinlock on architectures that are known to support it.
So instead of starting with dumb test-and-set there will be faster
qspinlock from the start on x86, arm64 and few others archs.
Those are the archs we care about the most anyway. Other archs can take
time to optimize it (if optimizations are necessary at all).
In general hacking JITs is much harder and more error prone than
changing core and adding helpers. Hence we avoid touching JITs
as much as possible.
Like map_lookup inlining optimization we do only when JIT is on.
And we do it purely in the generic core. See array_map_gen_lookup().
We generate bpf instructions only to feed them into JITs so they
can replace them with native asm. That is much easier to implement
correctly than if we were doing inlining in every JIT independently.
^ permalink raw reply
* Re: [PATCH iproute2-next v2] netns: add subcommand to attach an existing network namespace
From: David Ahern @ 2019-01-30 2:32 UTC (permalink / raw)
To: Matteo Croce, netdev; +Cc: Stephen Hemminger, Andrea Claudi
In-Reply-To: <20190129150115.19965-1-mcroce@redhat.com>
On 1/29/19 8:01 AM, Matteo Croce wrote:
> ip tracks namespaces with dummy files in /var/run/netns/, but can't see
> namespaces created with other tools.
> Creating the dummy file and bind mounting the correct procfs entry will
> make ip aware of that namespace.
> Add an ip netns subcommand to automate this task.
>
> Signed-off-by: Matteo Croce <mcroce@redhat.com>
> ---
> ip/ipnetns.c | 62 ++++++++++++++++++++++++++++++++++-----------
> man/man8/ip-netns.8 | 10 ++++++++
> 2 files changed, 57 insertions(+), 15 deletions(-)
>
applied to iproute2-next. Thanks
^ permalink raw reply
* Re: [PATCH iproute2-next 2/2] ss: add AF_XDP support
From: David Ahern @ 2019-01-30 2:39 UTC (permalink / raw)
To: bjorn.topel, netdev, stephen
Cc: Björn Töpel, magnus.karlsson, magnus.karlsson
In-Reply-To: <20190125071848.25959-3-bjorn.topel@gmail.com>
On 1/25/19 12:18 AM, bjorn.topel@gmail.com wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> AF_XDP is an address family that is optimized for high performance
> packet processing.
>
> This patch adds AF_XDP support to ss(8) so that sockets can be queried
> and monitored.
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
> man/man8/ss.8 | 9 ++-
> misc/ss.c | 168 +++++++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 172 insertions(+), 5 deletions(-)
>
AF_XDP is not currently defined for a number of distributions. Add a
definition to include/utils.h similar to what is done for MPLS.
Also, please add example output to the commit log.
^ permalink raw reply
* Re: ethtool - manual changes in ethtool-copy.h
From: Maciej Żenczykowski @ 2019-01-30 2:42 UTC (permalink / raw)
To: John W. Linville; +Cc: Michal Kubecek, Linux NetDev
In-Reply-To: <CANP3RGf-MmOYgBV4fQwoRbg_+OAvKTFKGM2EorR3ig9WfevHZg@mail.gmail.com>
Eh, I think this patch can simply be reverted.
This is a dupe. ethtool-copy.h is only included from internal.h which
already includes:
...
#ifndef ETHTOOL_INTERNAL_H__
#define ETHTOOL_INTERNAL_H__
/* Some platforms (eg. ppc64) need __SANE_USERSPACE_TYPES__ before
* <linux/types.h> to select 'int-ll64.h' and avoid compile warnings
* when printing __u64 with %llu.
*/
#define __SANE_USERSPACE_TYPES__
...
which came in:
commit c0a2c04b3cbf6d399a2551654401957ddb529a50
Author: Maciej Żenczykowski <maze@google.com>
Date: Fri Mar 11 09:58:14 2016 -0800
internal.h: change to new sane kernel headers on 64-bit archs
On ppc64, this fixes:
...
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David Decotigny <decot@googlers.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
I think I missed this while trying to synchronize internal with upstream repos.
(the local copy of the patch was obviously in the wrong spot)
^ permalink raw reply
* [PATCH] Revert "ethtool: change to new sane powerpc64 kernel headers"
From: Maciej Żenczykowski @ 2019-01-30 2:48 UTC (permalink / raw)
To: Maciej Żenczykowski, John W . Linville; +Cc: netdev
In-Reply-To: <CANP3RGdkA4LNNiC3UPcz3b4oGMBPFD2uEyKBMovXJz79eqiO-g@mail.gmail.com>
From: Maciej Żenczykowski <maze@google.com>
This reverts commit 4df55c81996dfb1dbe98c93ee62d8067ed5073a9.
It turns out this is not needed due to:
commit c0a2c04b3cbf6d399a2551654401957ddb529a50
internal.h: change to new sane kernel headers on 64-bit archs
which I apparently entirely forgot about while trying
to synchronize internal and upstream git repositories.
Change-Id: I56d90a3c1e9b66c30526824fb7bc41aab01d85d1
Signed-off-by: Maciej Żenczykowski <maze@google.com>
---
ethtool-copy.h | 6 ------
1 file changed, 6 deletions(-)
diff --git a/ethtool-copy.h b/ethtool-copy.h
index 7772a4970987..6bfbb85f9402 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -14,12 +14,6 @@
#ifndef _LINUX_ETHTOOL_H
#define _LINUX_ETHTOOL_H
-#ifdef __powerpc64__
-/* Powerpc needs __SANE_USERSPACE_TYPES__ before <linux/types.h> to select
- * 'int-ll64.h' and avoid compile warnings when printing __u64 with %llu.
- */
-#define __SANE_USERSPACE_TYPES__
-#endif
#include <linux/kernel.h>
#include <linux/types.h>
#include <linux/if_ether.h>
--
2.20.1.495.gaa96b0ce6b-goog
^ permalink raw reply related
* Re: BUG: vnet0 selects TX queue 11, but real number of TX queues is 11
From: Stanislav Fomichev @ 2019-01-30 2:52 UTC (permalink / raw)
To: George Amanakis; +Cc: linux-kernel, Netdev
In-Reply-To: <20190130021621.17250-1-gamanakis@gmail.com>
On Tue, Jan 29, 2019 at 6:16 PM George Amanakis <gamanakis@gmail.com> wrote:
>
> Since 4.20.4 when running a KVM with vhost_net I am seeing in dmesg:
> vnet0 selects TX queue 11, but real number of TX queues is 11
>
> The corresponding part in the xml definition of the virtual machine is:
> -------8<-------
> <interface type='bridge'>
> <mac address='xx:xx:xx:xx:xx:xx'/>
> <source bridge='br0'/>
> <model type='virtio'/>
> <driver name='vhost' queues='12'>
> </driver>
> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
> </interface>
> -------8<-------
>
> Doing a git-bisect with 4.20.3 last known good, and 4.20.4 as bad, this
> commit turned up:
> -------8<-------
> commit 9ff0436e2c3575ffe64d359fb3b67aee237dc519
> Author: Stanislav Fomichev <sdf@google.com>
> Date: Mon Jan 7 13:38:38 2019 -0800
>
> tun: publish tfile after it's fully initialized
>
> [ Upstream commit 0b7959b6257322f7693b08a459c505d4938646f2 ]
>
> BUG: unable to handle kernel NULL pointer dereference at
> 00000000000000d1
> -------8<-------
>
>
> Applying the following patch corrects it in 4.20.5. Would this be the
> correct thing to do?
Ouch, tun_set_real_num_queues uses tun->numqueues internally :-(
Your patch looks good to me, care to do a proper submission (with a Fixes tag)?
I wonder whether we should use it as an opportunity to also do
something like the following (to make it more explicit):
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 18656c4094b3..ea9928b3b930 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -632,10 +632,10 @@ static inline bool tun_not_capable(struct tun_struct *tun)
!ns_capable(net->user_ns, CAP_NET_ADMIN);
}
-static void tun_set_real_num_queues(struct tun_struct *tun)
+static void tun_set_real_num_queues(struct tun_struct *tun, unsigned int nr)
{
- netif_set_real_num_tx_queues(tun->dev, tun->numqueues);
- netif_set_real_num_rx_queues(tun->dev, tun->numqueues);
+ netif_set_real_num_tx_queues(tun->dev, nr);
+ netif_set_real_num_rx_queues(tun->dev, nr);
}
static void tun_disable_queue(struct tun_struct *tun, struct tun_file *tfile)
@@ -712,7 +712,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
tun_flow_delete_by_queue(tun, tun->numqueues + 1);
/* Drop read queue */
tun_queue_purge(tfile);
- tun_set_real_num_queues(tun);
+ tun_set_real_num_queues(tun, tun->numqueues);
} else if (tfile->detached && clean) {
tun = tun_enable_queue(tfile);
sock_put(&tfile->sk);
@@ -866,8 +866,6 @@ static int tun_attach(struct tun_struct *tun,
struct file *file,
if (rtnl_dereference(tun->xdp_prog))
sock_set_flag(&tfile->sk, SOCK_XDP);
- tun_set_real_num_queues(tun);
-
/* device is allowed to go away first, so no need to hold extra
* refcnt.
*/
@@ -879,6 +877,7 @@ static int tun_attach(struct tun_struct *tun,
struct file *file,
rcu_assign_pointer(tfile->tun, tun);
rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
tun->numqueues++;
+ tun_set_real_num_queues(tun, tun->numqueues);
out:
return err;
}
> ---
> drivers/net/tun.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 6658658246d2..e0dc004c6483 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -862,8 +862,6 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
> if (rtnl_dereference(tun->xdp_prog))
> sock_set_flag(&tfile->sk, SOCK_XDP);
>
> - tun_set_real_num_queues(tun);
> -
> /* device is allowed to go away first, so no need to hold extra
> * refcnt.
> */
> @@ -875,6 +873,9 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
> rcu_assign_pointer(tfile->tun, tun);
> rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
> tun->numqueues++;
> +
> + tun_set_real_num_queues(tun);
> +
> out:
> return err;
> }
> --
> 2.20.1
>
^ permalink raw reply related
* [PATCH][RESEND] ath10k: snoc: remove set but not used variable 'ar_snoc'
From: YueHaibing @ 2019-01-30 3:09 UTC (permalink / raw)
To: kvalo, davem, ath10k; +Cc: linux-kernel, netdev, linux-wireless, YueHaibing
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/net/wireless/ath/ath10k/snoc.c: In function 'ath10k_snoc_tx_pipe_cleanup':
drivers/net/wireless/ath/ath10k/snoc.c:681:22: warning:
variable 'ar_snoc' set but not used [-Wunused-but-set-variable]
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Brian Norris <briannorris@chromium.org>
---
drivers/net/wireless/ath/ath10k/snoc.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index 54efe6b..b7493df 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -875,13 +875,11 @@ static void ath10k_snoc_tx_pipe_cleanup(struct ath10k_snoc_pipe *snoc_pipe)
{
struct ath10k_ce_pipe *ce_pipe;
struct ath10k_ce_ring *ce_ring;
- struct ath10k_snoc *ar_snoc;
struct sk_buff *skb;
struct ath10k *ar;
int i;
ar = snoc_pipe->hif_ce_state;
- ar_snoc = ath10k_snoc_priv(ar);
ce_pipe = snoc_pipe->ce_hdl;
ce_ring = ce_pipe->src_ring;
--
2.7.4
^ permalink raw reply related
* [PATCH][RESEND] rtlwifi: remove set but not used variable 'cmd_seq'
From: YueHaibing @ 2019-01-30 3:15 UTC (permalink / raw)
To: kvalo, davem, pkshih; +Cc: linux-kernel, netdev, linux-wireless, YueHaibing
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/net/wireless/realtek/rtlwifi/base.c: In function 'rtl_c2h_content_parsing':
drivers/net/wireless/realtek/rtlwifi/base.c:2313:13: warning:
variable 'cmd_seq' set but not used [-Wunused-but-set-variable]
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Ping-Ke Shih <pkshih@realtek.com>
---
drivers/net/wireless/realtek/rtlwifi/base.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/net/wireless/realtek/rtlwifi/base.c b/drivers/net/wireless/realtek/rtlwifi/base.c
index ef9b502..ee726f4 100644
--- a/drivers/net/wireless/realtek/rtlwifi/base.c
+++ b/drivers/net/wireless/realtek/rtlwifi/base.c
@@ -2311,11 +2311,10 @@ static void rtl_c2h_content_parsing(struct ieee80211_hw *hw,
struct rtl_priv *rtlpriv = rtl_priv(hw);
struct rtl_hal_ops *hal_ops = rtlpriv->cfg->ops;
const struct rtl_btc_ops *btc_ops = rtlpriv->btcoexist.btc_ops;
- u8 cmd_id, cmd_seq, cmd_len;
+ u8 cmd_id, cmd_len;
u8 *cmd_buf = NULL;
cmd_id = GET_C2H_CMD_ID(skb->data);
- cmd_seq = GET_C2H_SEQ(skb->data);
cmd_len = skb->len - C2H_DATA_OFFSET;
cmd_buf = GET_C2H_DATA_PTR(skb->data);
--
2.7.4
^ permalink raw reply related
* Re: [PATCH bpf 0/2] bpf: btf: allow typedef func_proto
From: Alexei Starovoitov @ 2019-01-30 3:18 UTC (permalink / raw)
To: Yonghong Song; +Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20190130003816.1043826-1-yhs@fb.com>
On Tue, Jan 29, 2019 at 04:38:16PM -0800, Yonghong Song wrote:
> The current btf implementation disallows the typedef of
> a func_proto type. This actually is allowed per C standard.
> This patch fixed btf verification to permit such types.
> Patch #1 fixed the kernel side and Patch #2 fixed
> the tools test_btf test.
Applied, Thanks
^ permalink raw reply
* Re: [PATCH bpf-next v3 1/4] bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap
From: David Ahern @ 2019-01-30 3:29 UTC (permalink / raw)
To: Peter Oskolkov, Alexei Starovoitov, Daniel Borkmann, netdev
Cc: Peter Oskolkov, Willem de Bruijn
In-Reply-To: <20190129011217.192510-2-posk@google.com>
On 1/28/19 6:12 PM, Peter Oskolkov wrote
> @@ -2583,7 +2594,15 @@ enum bpf_ret_code {
> BPF_DROP = 2,
> /* 3-6 reserved */
> BPF_REDIRECT = 7,
> - /* >127 are reserved for prog type specific return codes */
> + /* >127 are reserved for prog type specific return codes.
> + *
> + * BPF_LWT_REROUTE: used by BPF_PROG_TYPE_LWT_IN and
> + * BPF_PROG_TYPE_LWT_XMIT to indicate that skb's dst
> + * has changed and appropriate dst_input() or dst_output()
> + * action has to be taken (this is an L3 redirect, as
> + * opposed to L2 redirect represented by BPF_REDIRECT above).
> + */
> + BPF_LWT_REROUTE = 128,
> };
What happens if a program pushes a new header onto the skb and does not
return BPF_LWT_REROUTE?
Might be better to move the route lookup and dst swap to run_lwt_bpf and
only do it if the program returns BPF_LWT_REROUTE. That allows calling
bpf_push_ip_encap without requiring a route lookup. That might be fine
as long as their is not a protocol mismatch (ipv4 packet gets an ipv6
header or vice versa). But then, I think you have the mismatch problem
now if the program does not return BPF_LWT_REROUTE.
^ permalink raw reply
* Re: [PATCHv4 2/3] net: dsa: mt7530: support the 7530 switch on the Mediatek MT7621 SoC
From: Florian Fainelli @ 2019-01-30 3:39 UTC (permalink / raw)
To: gerg, sean.wang, linux-mediatek, bjorn, andrew, vivien.didelot,
netdev
Cc: rene, john, neil
In-Reply-To: <20190130012406.28271-3-gerg@kernel.org>
On 1/29/19 5:24 PM, gerg@kernel.org wrote:
> From: Greg Ungerer <gerg@kernel.org>
>
> The MediaTek MT7621 SoC device contains a 7530 switch, and the existing
> linux kernel 7530 DSA switch driver can be used with it.
>
> The bulk of the changes required stem from the 7621 having different
> regulator and pad setup. The existing setup of these in the 7530
> driver appears to be very specific to its implemtation in the Mediatek
> 7623 SoC. (Not entirely surprising given the 7623 is a quad core ARM
> based SoC, and the 7621 is a dual core, dual thread MIPS based SoC).
>
> Create a new devicetree type, "mediatek,mt7621", to support the 7530
> switch in the 7621 SoC. There appears to be no usable ID register to
> distinguish it from a 7530 in other hardware at runtime. This is used
> to carry out the appropriate configuration and setup.
>
> Signed-off-by: Greg Ungerer <gerg@kernel.org>
> Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
--
Florian
^ permalink raw reply
* Re: [PATCHv4 3/3] dt-bindings: net: dsa: add new MT7530 binding to support MT7621
From: Florian Fainelli @ 2019-01-30 3:39 UTC (permalink / raw)
To: gerg, sean.wang, linux-mediatek, bjorn, andrew, vivien.didelot,
netdev
Cc: rene, john, neil
In-Reply-To: <20190130012406.28271-4-gerg@kernel.org>
On 1/29/19 5:24 PM, gerg@kernel.org wrote:
> From: Greg Ungerer <gerg@kernel.org>
>
> Add devicetree binding to support the compatible mt7530 switch as used
> in the MediaTek MT7621 SoC.
>
> Signed-off-by: Greg Ungerer <gerg@kernel.org>
> Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
--
Florian
^ permalink raw reply
* Re: [PATCHv4 1/3] net: ethernet: mediatek: support MT7621 SoC ethernet hardware
From: Florian Fainelli @ 2019-01-30 3:40 UTC (permalink / raw)
To: gerg, sean.wang, linux-mediatek, bjorn, andrew, vivien.didelot,
netdev
Cc: rene, john, neil
In-Reply-To: <20190130012406.28271-2-gerg@kernel.org>
On 1/29/19 5:24 PM, gerg@kernel.org wrote:
> From: Bjørn Mork <bjorn@mork.no>
>
> The Mediatek MT7621 SoC contains the same ethernet hardware module as
> used on a number of other MediaTek SoC parts. There are some minor
> differences to deal with but we can use the same driver to support
> them all.
>
> This patch is based on work by Bjørn Mork <bjorn@mork.no>, and his
> original patch is at:
>
> https://github.com/bmork/LEDE/commit/3293bc63f5461ca1eb0bbc4ed90145335e7e3404
>
> There is an additional compatible devicetree type added, and the primary
> change to the code required is to support a single interrupt (for both
> RX and TX interrupts).
>
> Signed-off-by: Bjørn Mork <bjorn@mork.no>
> [gerg@kernel.org: rebase to mainline and irq handler fix]
> Signed-off-by: Greg Ungerer <gerg@kernel.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
> ---
> drivers/net/ethernet/mediatek/Kconfig | 2 +-
> drivers/net/ethernet/mediatek/mtk_eth_soc.c | 48 ++++++++++++++++++---
> drivers/net/ethernet/mediatek/mtk_eth_soc.h | 4 ++
> 3 files changed, 46 insertions(+), 8 deletions(-)
>
> v2: first in series for this change
> v3: rebase onto 5.0-rc3
> v4: rebase onto 5.0-rc4
>
> diff --git a/drivers/net/ethernet/mediatek/Kconfig b/drivers/net/ethernet/mediatek/Kconfig
> index f9149d2a4694..43656f961891 100644
> --- a/drivers/net/ethernet/mediatek/Kconfig
> +++ b/drivers/net/ethernet/mediatek/Kconfig
> @@ -1,6 +1,6 @@
> config NET_VENDOR_MEDIATEK
> bool "MediaTek ethernet driver"
> - depends on ARCH_MEDIATEK
> + depends on ARCH_MEDIATEK || SOC_MT7621
Would be nice to add COMPILE_TEST there as well at some point so someone
can easily build test changes to the driver.
--
Florian
^ permalink raw reply
* [PATCH] tun: move the call to tun_set_real_num_queues
From: George Amanakis @ 2019-01-30 3:50 UTC (permalink / raw)
To: davem; +Cc: linux-kernel, sdf, netdev, George Amanakis
Call tun_set_real_num_queues() after the increment of tun->numqueues
since the former depends on it. Otherwise, the number of queues is not
correctly accounted for, which results to warnings similar to:
"vnet0 selects TX queue 11, but real number of TX queues is 11".
Fixes: 0b7959b62573 ("tun: publish tfile after it's fully initialized")
Reported-and-tested-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
drivers/net/tun.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 18656c4094b3..fed298c0cb39 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -866,8 +866,6 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
if (rtnl_dereference(tun->xdp_prog))
sock_set_flag(&tfile->sk, SOCK_XDP);
- tun_set_real_num_queues(tun);
-
/* device is allowed to go away first, so no need to hold extra
* refcnt.
*/
@@ -879,6 +877,7 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
rcu_assign_pointer(tfile->tun, tun);
rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
tun->numqueues++;
+ tun_set_real_num_queues(tun);
out:
return err;
}
--
2.20.1
^ permalink raw reply related
* [PATCH] tun: move the call to tun_set_real_num_queues
From: George Amanakis @ 2019-01-30 3:53 UTC (permalink / raw)
To: davem; +Cc: linux-kernel, sdf, netdev, George Amanakis
In-Reply-To: <CAKH8qBvEintAf5Dmz+8kCTW1V_BRUDxSAZy8WoYSyNoLW+JHfQ@mail.gmail.com>
Call tun_set_real_num_queues() after the increment of tun->numqueues
since the former depends on it. Otherwise, the number of queues is not
correctly accounted for, which results to warnings similar to:
"vnet0 selects TX queue 11, but real number of TX queues is 11".
Fixes: 0b7959b62573 ("tun: publish tfile after it's fully initialized")
Reported-and-tested-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
drivers/net/tun.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 18656c4094b3..fed298c0cb39 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -866,8 +866,6 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
if (rtnl_dereference(tun->xdp_prog))
sock_set_flag(&tfile->sk, SOCK_XDP);
- tun_set_real_num_queues(tun);
-
/* device is allowed to go away first, so no need to hold extra
* refcnt.
*/
@@ -879,6 +877,7 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
rcu_assign_pointer(tfile->tun, tun);
rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
tun->numqueues++;
+ tun_set_real_num_queues(tun);
out:
return err;
}
--
2.20.1
^ permalink raw reply related
* Re: [PATCHv4 1/3] net: ethernet: mediatek: support MT7621 SoC ethernet hardware
From: Sean Wang @ 2019-01-30 3:54 UTC (permalink / raw)
To: Florian Fainelli
Cc: gerg, Sean Wang (王志亘), linux-mediatek, bjorn,
Andrew Lunn, vivien.didelot, netdev, neil, rene, John Crispin
In-Reply-To: <48bb34ce-9bb0-2dc1-0e17-55e000a23114@gmail.com>
On Tue, Jan 29, 2019 at 7:40 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
>
>
> On 1/29/19 5:24 PM, gerg@kernel.org wrote:
> > From: Bjørn Mork <bjorn@mork.no>
> >
> > The Mediatek MT7621 SoC contains the same ethernet hardware module as
> > used on a number of other MediaTek SoC parts. There are some minor
> > differences to deal with but we can use the same driver to support
> > them all.
> >
> > This patch is based on work by Bjørn Mork <bjorn@mork.no>, and his
> > original patch is at:
> >
> > https://github.com/bmork/LEDE/commit/3293bc63f5461ca1eb0bbc4ed90145335e7e3404
> >
> > There is an additional compatible devicetree type added, and the primary
> > change to the code required is to support a single interrupt (for both
> > RX and TX interrupts).
> >
> > Signed-off-by: Bjørn Mork <bjorn@mork.no>
> > [gerg@kernel.org: rebase to mainline and irq handler fix]
> > Signed-off-by: Greg Ungerer <gerg@kernel.org>
>
> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
>
Acked-by: Sean Wang <sean.wang@kernel.org>
> > ---
> > drivers/net/ethernet/mediatek/Kconfig | 2 +-
> > drivers/net/ethernet/mediatek/mtk_eth_soc.c | 48 ++++++++++++++++++---
> > drivers/net/ethernet/mediatek/mtk_eth_soc.h | 4 ++
> > 3 files changed, 46 insertions(+), 8 deletions(-)
> >
> > v2: first in series for this change
> > v3: rebase onto 5.0-rc3
> > v4: rebase onto 5.0-rc4
> >
> > diff --git a/drivers/net/ethernet/mediatek/Kconfig b/drivers/net/ethernet/mediatek/Kconfig
> > index f9149d2a4694..43656f961891 100644
> > --- a/drivers/net/ethernet/mediatek/Kconfig
> > +++ b/drivers/net/ethernet/mediatek/Kconfig
> > @@ -1,6 +1,6 @@
> > config NET_VENDOR_MEDIATEK
> > bool "MediaTek ethernet driver"
> > - depends on ARCH_MEDIATEK
> > + depends on ARCH_MEDIATEK || SOC_MT7621
>
> Would be nice to add COMPILE_TEST there as well at some point so someone
> can easily build test changes to the driver.
okay, I will do it in another patch
>
> --
> Florian
>
> _______________________________________________
> Linux-mediatek mailing list
> Linux-mediatek@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-mediatek
^ permalink raw reply
* Re: [PATCHv4 3/3] dt-bindings: net: dsa: add new MT7530 binding to support MT7621
From: Sean Wang @ 2019-01-30 3:54 UTC (permalink / raw)
To: Florian Fainelli
Cc: gerg, Sean Wang (王志亘), linux-mediatek, bjorn,
Andrew Lunn, vivien.didelot, netdev, neil, rene, John Crispin
In-Reply-To: <9fd01c61-fb62-52fc-e587-43a03602bae2@gmail.com>
On Tue, Jan 29, 2019 at 7:40 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
>
>
> On 1/29/19 5:24 PM, gerg@kernel.org wrote:
> > From: Greg Ungerer <gerg@kernel.org>
> >
> > Add devicetree binding to support the compatible mt7530 switch as used
> > in the MediaTek MT7621 SoC.
> >
> > Signed-off-by: Greg Ungerer <gerg@kernel.org>
> > Reviewed-by: Andrew Lunn <andrew@lunn.ch>
>
> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Sean Wang <sean.wang@kernel.org>
> --
> Florian
>
> _______________________________________________
> Linux-mediatek mailing list
> Linux-mediatek@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-mediatek
^ permalink raw reply
* Re: [PATCHv4 2/3] net: dsa: mt7530: support the 7530 switch on the Mediatek MT7621 SoC
From: Sean Wang @ 2019-01-30 3:55 UTC (permalink / raw)
To: Florian Fainelli
Cc: gerg, Sean Wang (王志亘), linux-mediatek, bjorn,
Andrew Lunn, vivien.didelot, netdev, neil, rene, John Crispin
In-Reply-To: <d030fb8f-c707-9aa2-5ebc-e8100188a5c2@gmail.com>
On Tue, Jan 29, 2019 at 7:39 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
>
>
> On 1/29/19 5:24 PM, gerg@kernel.org wrote:
> > From: Greg Ungerer <gerg@kernel.org>
> >
> > The MediaTek MT7621 SoC device contains a 7530 switch, and the existing
> > linux kernel 7530 DSA switch driver can be used with it.
> >
> > The bulk of the changes required stem from the 7621 having different
> > regulator and pad setup. The existing setup of these in the 7530
> > driver appears to be very specific to its implemtation in the Mediatek
> > 7623 SoC. (Not entirely surprising given the 7623 is a quad core ARM
> > based SoC, and the 7621 is a dual core, dual thread MIPS based SoC).
> >
> > Create a new devicetree type, "mediatek,mt7621", to support the 7530
> > switch in the 7621 SoC. There appears to be no usable ID register to
> > distinguish it from a 7530 in other hardware at runtime. This is used
> > to carry out the appropriate configuration and setup.
> >
> > Signed-off-by: Greg Ungerer <gerg@kernel.org>
> > Reviewed-by: Andrew Lunn <andrew@lunn.ch>
>
> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Sean Wang <sean.wang@kernel.org>
> --
> Florian
>
> _______________________________________________
> Linux-mediatek mailing list
> Linux-mediatek@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-mediatek
^ permalink raw reply
* Re: [PATCH net-next] cxgb4: cxgb4_tc_u32: use struct_size() in kvzalloc()
From: Gustavo A. R. Silva @ 2019-01-30 3:57 UTC (permalink / raw)
To: David Miller; +Cc: arjun, netdev, linux-kernel
In-Reply-To: <20190129.105724.1719800687435588791.davem@davemloft.net>
On 1/29/19 12:57 PM, David Miller wrote:
>>
>> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
>
> Applied, thanks.
>
Thanks, Dave.
--
Gustavo
^ permalink raw reply
* [PATCH bpf-next 0/4] bpf: fixes for lockdep and deadlock
From: Alexei Starovoitov @ 2019-01-30 4:04 UTC (permalink / raw)
To: davem; +Cc: daniel, peterz, edumazet, jannh, netdev, kernel-team
In addition to preempt_disable patch for socket filters
https://patchwork.ozlabs.org/patch/1032437/
the first three patches fix various lockdep false positives.
Last patch fixes potential deadlock in stackmap access from
tracing bpf prog and from syscall.
Alexei Starovoitov (3):
bpf: fix lockdep false positive in percpu_freelist
bpf: fix lockdep false positive in stackmap
bpf: fix lockdep false positive in bpf_prog_register
Martin KaFai Lau (1):
bpf: Fix syscall's stackmap lookup potential deadlock
kernel/bpf/hashtab.c | 4 ++--
kernel/bpf/percpu_freelist.c | 41 +++++++++++++++++++++++++-----------
kernel/bpf/percpu_freelist.h | 4 ++++
kernel/bpf/stackmap.c | 2 +-
kernel/bpf/syscall.c | 12 +++++++++--
kernel/trace/bpf_trace.c | 14 ++----------
6 files changed, 48 insertions(+), 29 deletions(-)
--
2.20.0
^ permalink raw reply
* [PATCH bpf-next 1/4] bpf: fix lockdep false positive in percpu_freelist
From: Alexei Starovoitov @ 2019-01-30 4:04 UTC (permalink / raw)
To: davem; +Cc: daniel, peterz, edumazet, jannh, netdev, kernel-team
In-Reply-To: <20190130040458.2544340-1-ast@kernel.org>
Lockdep warns about false positive:
[ 12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40
[ 12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past:
[ 12.493275] (&rq->lock){-.-.}
[ 12.493276]
[ 12.493276]
[ 12.493276] and interrupts could create inverse lock ordering between them.
[ 12.493276]
[ 12.494435]
[ 12.494435] other info that might help us debug this:
[ 12.494979] Possible interrupt unsafe locking scenario:
[ 12.494979]
[ 12.495518] CPU0 CPU1
[ 12.495879] ---- ----
[ 12.496243] lock(&head->lock);
[ 12.496502] local_irq_disable();
[ 12.496969] lock(&rq->lock);
[ 12.497431] lock(&head->lock);
[ 12.497890] <Interrupt>
[ 12.498104] lock(&rq->lock);
[ 12.498368]
[ 12.498368] *** DEADLOCK ***
[ 12.498368]
[ 12.498837] 1 lock held by dd/276:
[ 12.499110] #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240
[ 12.499747]
[ 12.499747] the shortest dependencies between 2nd lock and 1st lock:
[ 12.500389] -> (&rq->lock){-.-.} {
[ 12.500669] IN-HARDIRQ-W at:
[ 12.500934] _raw_spin_lock+0x2f/0x40
[ 12.501373] scheduler_tick+0x4c/0xf0
[ 12.501812] update_process_times+0x40/0x50
[ 12.502294] tick_periodic+0x27/0xb0
[ 12.502723] tick_handle_periodic+0x1f/0x60
[ 12.503203] timer_interrupt+0x11/0x20
[ 12.503651] __handle_irq_event_percpu+0x43/0x2c0
[ 12.504167] handle_irq_event_percpu+0x20/0x50
[ 12.504674] handle_irq_event+0x37/0x60
[ 12.505139] handle_level_irq+0xa7/0x120
[ 12.505601] handle_irq+0xa1/0x150
[ 12.506018] do_IRQ+0x77/0x140
[ 12.506411] ret_from_intr+0x0/0x1d
[ 12.506834] _raw_spin_unlock_irqrestore+0x53/0x60
[ 12.507362] __setup_irq+0x481/0x730
[ 12.507789] setup_irq+0x49/0x80
[ 12.508195] hpet_time_init+0x21/0x32
[ 12.508644] x86_late_time_init+0xb/0x16
[ 12.509106] start_kernel+0x390/0x42a
[ 12.509554] secondary_startup_64+0xa4/0xb0
[ 12.510034] IN-SOFTIRQ-W at:
[ 12.510305] _raw_spin_lock+0x2f/0x40
[ 12.510772] try_to_wake_up+0x1c7/0x4e0
[ 12.511220] swake_up_locked+0x20/0x40
[ 12.511657] swake_up_one+0x1a/0x30
[ 12.512070] rcu_process_callbacks+0xc5/0x650
[ 12.512553] __do_softirq+0xe6/0x47b
[ 12.512978] irq_exit+0xc3/0xd0
[ 12.513372] smp_apic_timer_interrupt+0xa9/0x250
[ 12.513876] apic_timer_interrupt+0xf/0x20
[ 12.514343] default_idle+0x1c/0x170
[ 12.514765] do_idle+0x199/0x240
[ 12.515159] cpu_startup_entry+0x19/0x20
[ 12.515614] start_kernel+0x422/0x42a
[ 12.516045] secondary_startup_64+0xa4/0xb0
[ 12.516521] INITIAL USE at:
[ 12.516774] _raw_spin_lock_irqsave+0x38/0x50
[ 12.517258] rq_attach_root+0x16/0xd0
[ 12.517685] sched_init+0x2f2/0x3eb
[ 12.518096] start_kernel+0x1fb/0x42a
[ 12.518525] secondary_startup_64+0xa4/0xb0
[ 12.518986] }
[ 12.519132] ... key at: [<ffffffff82b7bc28>] __key.71384+0x0/0x8
[ 12.519649] ... acquired at:
[ 12.519892] pcpu_freelist_pop+0x7b/0xd0
[ 12.520221] bpf_get_stackid+0x1d2/0x4d0
[ 12.520563] ___bpf_prog_run+0x8b4/0x11a0
[ 12.520887]
[ 12.521008] -> (&head->lock){+...} {
[ 12.521292] HARDIRQ-ON-W at:
[ 12.521539] _raw_spin_lock+0x2f/0x40
[ 12.521950] pcpu_freelist_push+0x2a/0x40
[ 12.522396] bpf_get_stackid+0x494/0x4d0
[ 12.522828] ___bpf_prog_run+0x8b4/0x11a0
[ 12.523296] INITIAL USE at:
[ 12.523537] _raw_spin_lock+0x2f/0x40
[ 12.523944] pcpu_freelist_populate+0xc0/0x120
[ 12.524417] htab_map_alloc+0x405/0x500
[ 12.524835] __do_sys_bpf+0x1a3/0x1a90
[ 12.525253] do_syscall_64+0x4a/0x180
[ 12.525659] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 12.526167] }
[ 12.526311] ... key at: [<ffffffff838f7668>] __key.13130+0x0/0x8
[ 12.526812] ... acquired at:
[ 12.527047] __lock_acquire+0x521/0x1350
[ 12.527371] lock_acquire+0x98/0x190
[ 12.527680] _raw_spin_lock+0x2f/0x40
[ 12.527994] pcpu_freelist_push+0x2a/0x40
[ 12.528325] bpf_get_stackid+0x494/0x4d0
[ 12.528645] ___bpf_prog_run+0x8b4/0x11a0
[ 12.528970]
[ 12.529092]
[ 12.529092] stack backtrace:
[ 12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f892422 #475
[ 12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
[ 12.530750] Call Trace:
[ 12.530948] dump_stack+0x5f/0x8b
[ 12.531248] check_usage_backwards+0x10c/0x120
[ 12.531598] ? ___bpf_prog_run+0x8b4/0x11a0
[ 12.531935] ? mark_lock+0x382/0x560
[ 12.532229] mark_lock+0x382/0x560
[ 12.532496] ? print_shortest_lock_dependencies+0x180/0x180
[ 12.532928] __lock_acquire+0x521/0x1350
[ 12.533271] ? find_get_entry+0x17f/0x2e0
[ 12.533586] ? find_get_entry+0x19c/0x2e0
[ 12.533902] ? lock_acquire+0x98/0x190
[ 12.534196] lock_acquire+0x98/0x190
[ 12.534482] ? pcpu_freelist_push+0x2a/0x40
[ 12.534810] _raw_spin_lock+0x2f/0x40
[ 12.535099] ? pcpu_freelist_push+0x2a/0x40
[ 12.535432] pcpu_freelist_push+0x2a/0x40
[ 12.535750] bpf_get_stackid+0x494/0x4d0
[ 12.536062] ___bpf_prog_run+0x8b4/0x11a0
It has been explained that is a false positive here:
https://lkml.org/lkml/2018/7/25/756
Recap:
- stackmap uses pcpu_freelist
- The lock in pcpu_freelist is a percpu lock
- stackmap is only used by tracing bpf_prog
- A tracing bpf_prog cannot be run if another bpf_prog
has already been running (ensured by the percpu bpf_prog_active counter).
Eric pointed out that this lockdep splats stops other
legit lockdep splats in selftests/bpf/test_progs.c.
Fix this by calling local_irq_save/restore for stackmap.
Another false positive had also been worked around by calling
local_irq_save in commit 89ad2fa3f043 ("bpf: fix lockdep splat").
That commit added unnecessary irq_save/restore to fast path of
bpf hash map. irqs are already disabled at that point, since htab
is holding per bucket spin_lock with irqsave.
Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop
function w/o irqsave and convert pcpu_freelist_push/pop to irqsave
to be used elsewhere (right now only in stackmap).
It stops lockdep false positive in stackmap with a bit of acceptable overhead.
Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
kernel/bpf/hashtab.c | 4 ++--
kernel/bpf/percpu_freelist.c | 41 +++++++++++++++++++++++++-----------
kernel/bpf/percpu_freelist.h | 4 ++++
3 files changed, 35 insertions(+), 14 deletions(-)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 4b7c76765d9d..f9274114c88d 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -686,7 +686,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
}
if (htab_is_prealloc(htab)) {
- pcpu_freelist_push(&htab->freelist, &l->fnode);
+ __pcpu_freelist_push(&htab->freelist, &l->fnode);
} else {
atomic_dec(&htab->count);
l->htab = htab;
@@ -748,7 +748,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
} else {
struct pcpu_freelist_node *l;
- l = pcpu_freelist_pop(&htab->freelist);
+ l = __pcpu_freelist_pop(&htab->freelist);
if (!l)
return ERR_PTR(-E2BIG);
l_new = container_of(l, struct htab_elem, fnode);
diff --git a/kernel/bpf/percpu_freelist.c b/kernel/bpf/percpu_freelist.c
index 673fa6fe2d73..0c1b4ba9e90e 100644
--- a/kernel/bpf/percpu_freelist.c
+++ b/kernel/bpf/percpu_freelist.c
@@ -28,8 +28,8 @@ void pcpu_freelist_destroy(struct pcpu_freelist *s)
free_percpu(s->freelist);
}
-static inline void __pcpu_freelist_push(struct pcpu_freelist_head *head,
- struct pcpu_freelist_node *node)
+static inline void ___pcpu_freelist_push(struct pcpu_freelist_head *head,
+ struct pcpu_freelist_node *node)
{
raw_spin_lock(&head->lock);
node->next = head->first;
@@ -37,12 +37,22 @@ static inline void __pcpu_freelist_push(struct pcpu_freelist_head *head,
raw_spin_unlock(&head->lock);
}
-void pcpu_freelist_push(struct pcpu_freelist *s,
+void __pcpu_freelist_push(struct pcpu_freelist *s,
struct pcpu_freelist_node *node)
{
struct pcpu_freelist_head *head = this_cpu_ptr(s->freelist);
- __pcpu_freelist_push(head, node);
+ ___pcpu_freelist_push(head, node);
+}
+
+void pcpu_freelist_push(struct pcpu_freelist *s,
+ struct pcpu_freelist_node *node)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __pcpu_freelist_push(s, node);
+ local_irq_restore(flags);
}
void pcpu_freelist_populate(struct pcpu_freelist *s, void *buf, u32 elem_size,
@@ -63,7 +73,7 @@ void pcpu_freelist_populate(struct pcpu_freelist *s, void *buf, u32 elem_size,
for_each_possible_cpu(cpu) {
again:
head = per_cpu_ptr(s->freelist, cpu);
- __pcpu_freelist_push(head, buf);
+ ___pcpu_freelist_push(head, buf);
i++;
buf += elem_size;
if (i == nr_elems)
@@ -74,14 +84,12 @@ void pcpu_freelist_populate(struct pcpu_freelist *s, void *buf, u32 elem_size,
local_irq_restore(flags);
}
-struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *s)
+struct pcpu_freelist_node *__pcpu_freelist_pop(struct pcpu_freelist *s)
{
struct pcpu_freelist_head *head;
struct pcpu_freelist_node *node;
- unsigned long flags;
int orig_cpu, cpu;
- local_irq_save(flags);
orig_cpu = cpu = raw_smp_processor_id();
while (1) {
head = per_cpu_ptr(s->freelist, cpu);
@@ -89,16 +97,25 @@ struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *s)
node = head->first;
if (node) {
head->first = node->next;
- raw_spin_unlock_irqrestore(&head->lock, flags);
+ raw_spin_unlock(&head->lock);
return node;
}
raw_spin_unlock(&head->lock);
cpu = cpumask_next(cpu, cpu_possible_mask);
if (cpu >= nr_cpu_ids)
cpu = 0;
- if (cpu == orig_cpu) {
- local_irq_restore(flags);
+ if (cpu == orig_cpu)
return NULL;
- }
}
}
+
+struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *s)
+{
+ struct pcpu_freelist_node *ret;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ ret = __pcpu_freelist_pop(s);
+ local_irq_restore(flags);
+ return ret;
+}
diff --git a/kernel/bpf/percpu_freelist.h b/kernel/bpf/percpu_freelist.h
index 3049aae8ea1e..c3960118e617 100644
--- a/kernel/bpf/percpu_freelist.h
+++ b/kernel/bpf/percpu_freelist.h
@@ -22,8 +22,12 @@ struct pcpu_freelist_node {
struct pcpu_freelist_node *next;
};
+/* pcpu_freelist_* do spin_lock_irqsave. */
void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);
struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *);
+/* __pcpu_freelist_* do spin_lock only. caller must disable irqs. */
+void __pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);
+struct pcpu_freelist_node *__pcpu_freelist_pop(struct pcpu_freelist *);
void pcpu_freelist_populate(struct pcpu_freelist *s, void *buf, u32 elem_size,
u32 nr_elems);
int pcpu_freelist_init(struct pcpu_freelist *);
--
2.20.0
^ permalink raw reply related
* [PATCH bpf-next 2/4] bpf: fix lockdep false positive in stackmap
From: Alexei Starovoitov @ 2019-01-30 4:04 UTC (permalink / raw)
To: davem; +Cc: daniel, peterz, edumazet, jannh, netdev, kernel-team
In-Reply-To: <20190130040458.2544340-1-ast@kernel.org>
Lockdep warns about false positive:
[ 11.211460] ------------[ cut here ]------------
[ 11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0)
[ 11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280
[ 11.213134] Modules linked in:
[ 11.213413] CPU: 0 PID: 141 Comm: systemd-journal Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #476
[ 11.214191] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
[ 11.214954] RIP: 0010:lock_release+0x1ad/0x280
[ 11.217036] RSP: 0018:ffff88813ba03f50 EFLAGS: 00010086
[ 11.217516] RAX: 000000000000001f RBX: ffff8881378d8000 RCX: 0000000000000000
[ 11.218179] RDX: ffffffff810d3e9e RSI: 0000000000000001 RDI: ffffffff810d3eb3
[ 11.218851] RBP: ffff8881393e2b08 R08: 0000000000000002 R09: 0000000000000000
[ 11.219504] R10: 0000000000000000 R11: ffff88813ba03d9d R12: ffffffff8118dfa2
[ 11.220162] R13: 0000000000000086 R14: 0000000000000000 R15: 0000000000000000
[ 11.220717] FS: 00007f3c8cf35780(0000) GS:ffff88813ba00000(0000) knlGS:0000000000000000
[ 11.221348] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 11.221822] CR2: 00007f5825d92080 CR3: 00000001378c8005 CR4: 00000000003606f0
[ 11.222381] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 11.222951] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 11.223508] Call Trace:
[ 11.223705] <IRQ>
[ 11.223874] ? __local_bh_enable+0x7a/0x80
[ 11.224199] up_read+0x1c/0xa0
[ 11.224446] do_up_read+0x12/0x20
[ 11.224713] irq_work_run_list+0x43/0x70
[ 11.225030] irq_work_run+0x26/0x50
[ 11.225310] smp_irq_work_interrupt+0x57/0x1f0
[ 11.225662] irq_work_interrupt+0xf/0x20
since rw_semaphore is released in a different task vs task that locked the sema.
It is expected behavior.
Silence the warning by using up_read_non_owner().
Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
kernel/bpf/stackmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index d43b14535827..4b79e7c251e5 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -44,7 +44,7 @@ static void do_up_read(struct irq_work *entry)
struct stack_map_irq_work *work;
work = container_of(entry, struct stack_map_irq_work, irq_work);
- up_read(work->sem);
+ up_read_non_owner(work->sem);
work->sem = NULL;
}
--
2.20.0
^ permalink raw reply related
* [PATCH bpf-next 4/4] bpf: Fix syscall's stackmap lookup potential deadlock
From: Alexei Starovoitov @ 2019-01-30 4:04 UTC (permalink / raw)
To: davem; +Cc: daniel, peterz, edumazet, jannh, netdev, kernel-team
In-Reply-To: <20190130040458.2544340-1-ast@kernel.org>
From: Martin KaFai Lau <kafai@fb.com>
The map_lookup_elem used to not acquiring spinlock
in order to optimize the reader.
It was true until commit 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy().
bpf_stackmap_copy() may find the elem no longer needed after the copy is done.
If that is the case, pcpu_freelist_push() saves this elem for reuse later.
This push requires a spinlock.
If a tracing bpf_prog got run in the middle of the syscall's
map_lookup_elem(stackmap) and this tracing bpf_prog is calling
bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's
spinlock, it may end up with a dead lock situation as reported by
Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/
The situation is the same as the syscall's map_update_elem() which
needs to acquire the pcpu_freelist's spinlock and could race
with tracing bpf_prog. Hence, this patch fixes it by protecting
bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active)
to prevent tracing bpf_prog from running.
A later syscall's map_lookup_elem commit f1a2e44a3aec ("bpf: add queue and stack maps")
also acquires a spinlock and races with tracing bpf_prog similarly.
Hence, this patch is forward looking and protects the majority
of the map lookups. bpf_map_offload_lookup_elem() is the exception
since it is for network bpf_prog only (i.e. never called by tracing
bpf_prog).
Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
kernel/bpf/syscall.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b155cd17c1bd..8577bb7f8be6 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -713,8 +713,13 @@ static int map_lookup_elem(union bpf_attr *attr)
if (bpf_map_is_dev_bound(map)) {
err = bpf_map_offload_lookup_elem(map, key, value);
- } else if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
- map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
+ goto done;
+ }
+
+ preempt_disable();
+ this_cpu_inc(bpf_prog_active);
+ if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+ map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
err = bpf_percpu_hash_copy(map, key, value);
} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
err = bpf_percpu_array_copy(map, key, value);
@@ -744,7 +749,10 @@ static int map_lookup_elem(union bpf_attr *attr)
}
rcu_read_unlock();
}
+ this_cpu_dec(bpf_prog_active);
+ preempt_enable();
+done:
if (err)
goto free_value;
--
2.20.0
^ permalink raw reply related
* [PATCH bpf-next 3/4] bpf: fix lockdep false positive in bpf_prog_register
From: Alexei Starovoitov @ 2019-01-30 4:04 UTC (permalink / raw)
To: davem; +Cc: daniel, peterz, edumazet, jannh, netdev, kernel-team
In-Reply-To: <20190130040458.2544340-1-ast@kernel.org>
Lockdep warns about false positive:
[ 13.007000] WARNING: possible circular locking dependency detected
[ 13.007587] 5.0.0-rc3-00018-g2fa53f892422-dirty #477 Not tainted
[ 13.008124] ------------------------------------------------------
[ 13.008624] test_progs/246 is trying to acquire lock:
[ 13.009030] 0000000094160d1d (tracepoints_mutex){+.+.}, at: tracepoint_probe_register_prio+0x2d/0x300
[ 13.009770]
[ 13.009770] but task is already holding lock:
[ 13.010239] 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
[ 13.010877]
[ 13.010877] which lock already depends on the new lock.
[ 13.010877]
[ 13.011532]
[ 13.011532] the existing dependency chain (in reverse order) is:
[ 13.012129]
[ 13.012129] -> #4 (bpf_event_mutex){+.+.}:
[ 13.012582] perf_event_query_prog_array+0x9b/0x130
[ 13.013016] _perf_ioctl+0x3aa/0x830
[ 13.013354] perf_ioctl+0x2e/0x50
[ 13.013668] do_vfs_ioctl+0x8f/0x6a0
[ 13.014003] ksys_ioctl+0x70/0x80
[ 13.014320] __x64_sys_ioctl+0x16/0x20
[ 13.014668] do_syscall_64+0x4a/0x180
[ 13.015007] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 13.015469]
[ 13.015469] -> #3 (&cpuctx_mutex){+.+.}:
[ 13.015910] perf_event_init_cpu+0x5a/0x90
[ 13.016291] perf_event_init+0x1b2/0x1de
[ 13.016654] start_kernel+0x2b8/0x42a
[ 13.016995] secondary_startup_64+0xa4/0xb0
[ 13.017382]
[ 13.017382] -> #2 (pmus_lock){+.+.}:
[ 13.017794] perf_event_init_cpu+0x21/0x90
[ 13.018172] cpuhp_invoke_callback+0xb3/0x960
[ 13.018573] _cpu_up+0xa7/0x140
[ 13.018871] do_cpu_up+0xa4/0xc0
[ 13.019178] smp_init+0xcd/0xd2
[ 13.019483] kernel_init_freeable+0x123/0x24f
[ 13.019878] kernel_init+0xa/0x110
[ 13.020201] ret_from_fork+0x24/0x30
[ 13.020541]
[ 13.020541] -> #1 (cpu_hotplug_lock.rw_sem){++++}:
[ 13.021051] static_key_slow_inc+0xe/0x20
[ 13.021424] tracepoint_probe_register_prio+0x28c/0x300
[ 13.021891] perf_trace_event_init+0x11f/0x250
[ 13.022297] perf_trace_init+0x6b/0xa0
[ 13.022644] perf_tp_event_init+0x25/0x40
[ 13.023011] perf_try_init_event+0x6b/0x90
[ 13.023386] perf_event_alloc+0x9a8/0xc40
[ 13.023754] __do_sys_perf_event_open+0x1dd/0xd30
[ 13.024173] do_syscall_64+0x4a/0x180
[ 13.024519] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 13.024968]
[ 13.024968] -> #0 (tracepoints_mutex){+.+.}:
[ 13.025434] __mutex_lock+0x86/0x970
[ 13.025764] tracepoint_probe_register_prio+0x2d/0x300
[ 13.026215] bpf_probe_register+0x40/0x60
[ 13.026584] bpf_raw_tracepoint_open.isra.34+0xa4/0x130
[ 13.027042] __do_sys_bpf+0x94f/0x1a90
[ 13.027389] do_syscall_64+0x4a/0x180
[ 13.027727] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 13.028171]
[ 13.028171] other info that might help us debug this:
[ 13.028171]
[ 13.028807] Chain exists of:
[ 13.028807] tracepoints_mutex --> &cpuctx_mutex --> bpf_event_mutex
[ 13.028807]
[ 13.029666] Possible unsafe locking scenario:
[ 13.029666]
[ 13.030140] CPU0 CPU1
[ 13.030510] ---- ----
[ 13.030875] lock(bpf_event_mutex);
[ 13.031166] lock(&cpuctx_mutex);
[ 13.031645] lock(bpf_event_mutex);
[ 13.032135] lock(tracepoints_mutex);
[ 13.032441]
[ 13.032441] *** DEADLOCK ***
[ 13.032441]
[ 13.032911] 1 lock held by test_progs/246:
[ 13.033239] #0: 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
[ 13.033909]
[ 13.033909] stack backtrace:
[ 13.034258] CPU: 1 PID: 246 Comm: test_progs Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #477
[ 13.034964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
[ 13.035657] Call Trace:
[ 13.035859] dump_stack+0x5f/0x8b
[ 13.036130] print_circular_bug.isra.37+0x1ce/0x1db
[ 13.036526] __lock_acquire+0x1158/0x1350
[ 13.036852] ? lock_acquire+0x98/0x190
[ 13.037154] lock_acquire+0x98/0x190
[ 13.037447] ? tracepoint_probe_register_prio+0x2d/0x300
[ 13.037876] __mutex_lock+0x86/0x970
[ 13.038167] ? tracepoint_probe_register_prio+0x2d/0x300
[ 13.038600] ? tracepoint_probe_register_prio+0x2d/0x300
[ 13.039028] ? __mutex_lock+0x86/0x970
[ 13.039337] ? __mutex_lock+0x24a/0x970
[ 13.039649] ? bpf_probe_register+0x1d/0x60
[ 13.039992] ? __bpf_trace_sched_wake_idle_without_ipi+0x10/0x10
[ 13.040478] ? tracepoint_probe_register_prio+0x2d/0x300
[ 13.040906] tracepoint_probe_register_prio+0x2d/0x300
[ 13.041325] bpf_probe_register+0x40/0x60
[ 13.041649] bpf_raw_tracepoint_open.isra.34+0xa4/0x130
[ 13.042068] ? __might_fault+0x3e/0x90
[ 13.042374] __do_sys_bpf+0x94f/0x1a90
[ 13.042678] do_syscall_64+0x4a/0x180
[ 13.042975] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 13.043382] RIP: 0033:0x7f23b10a07f9
[ 13.045155] RSP: 002b:00007ffdef42fdd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
[ 13.045759] RAX: ffffffffffffffda RBX: 00007ffdef42ff70 RCX: 00007f23b10a07f9
[ 13.046326] RDX: 0000000000000070 RSI: 00007ffdef42fe10 RDI: 0000000000000011
[ 13.046893] RBP: 00007ffdef42fdf0 R08: 0000000000000038 R09: 00007ffdef42fe10
[ 13.047462] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[ 13.048029] R13: 0000000000000016 R14: 00007f23b1db4690 R15: 0000000000000000
Lockdep seems to be confusing different mutexes.
Such deadlock is not possible.
Since tracepoints_mutex will be taken in tracepoint_probe_register/unregister()
there is no need to take bpf_event_mutex too.
bpf_event_mutex is protecting modifications to prog array used in kprobe/perf bpf progs.
bpf_raw_tracepoints don't need to take this mutex.
Fixes: c4f6699dfcb8 ("bpf: introduce BPF_RAW_TRACEPOINT")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
kernel/trace/bpf_trace.c | 14 ++------------
1 file changed, 2 insertions(+), 12 deletions(-)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 8b068adb9da1..f1a86a0d881d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1204,22 +1204,12 @@ static int __bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *
int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
{
- int err;
-
- mutex_lock(&bpf_event_mutex);
- err = __bpf_probe_register(btp, prog);
- mutex_unlock(&bpf_event_mutex);
- return err;
+ return __bpf_probe_register(btp, prog);
}
int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
{
- int err;
-
- mutex_lock(&bpf_event_mutex);
- err = tracepoint_probe_unregister(btp->tp, (void *)btp->bpf_func, prog);
- mutex_unlock(&bpf_event_mutex);
- return err;
+ return tracepoint_probe_unregister(btp->tp, (void *)btp->bpf_func, prog);
}
int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
--
2.20.0
^ permalink raw reply related
* Re: [PATCH bpf-next 0/4] bpf: fixes for lockdep and deadlock
From: Alexei Starovoitov @ 2019-01-30 4:07 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: David S. Miller, Daniel Borkmann, Peter Zijlstra, Eric Dumazet,
Jann Horn, Network Development, Kernel Team
In-Reply-To: <20190130040458.2544340-1-ast@kernel.org>
On Tue, Jan 29, 2019 at 8:06 PM Alexei Starovoitov <ast@kernel.org> wrote:
>
> In addition to preempt_disable patch for socket filters
> https://patchwork.ozlabs.org/patch/1032437/
> the first three patches fix various lockdep false positives.
> Last patch fixes potential deadlock in stackmap access from
> tracing bpf prog and from syscall.
Typo in subject.
All patches are for 'bpf' tree.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox