Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH bpf] selftests/bpf: fix attach_probe on s390
From: Ilya Leoshkevich @ 2019-07-12 13:41 UTC (permalink / raw)
  To: bpf, netdev; +Cc: gor, heiko.carstens, Ilya Leoshkevich

attach_probe test fails, because it cannot install a kprobe on a
non-existent sys_nanosleep symbol.

Use the correct symbol name for the nanosleep syscall on 64-bit s390.
Don't bother adding one for 31-bit mode, since tests are compiled only
in 64-bit mode.

Fixes: 1e8611bbdfc9 ("selftests/bpf: add kprobe/uprobe selftests")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
---
 tools/testing/selftests/bpf/prog_tests/attach_probe.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/attach_probe.c b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
index a4686395522c..47af4afc5013 100644
--- a/tools/testing/selftests/bpf/prog_tests/attach_probe.c
+++ b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
@@ -23,6 +23,8 @@ ssize_t get_base_addr() {
 
 #ifdef __x86_64__
 #define SYS_KPROBE_NAME "__x64_sys_nanosleep"
+#elif defined(__s390x__)
+#define SYS_KPROBE_NAME "__s390x_sys_nanosleep"
 #else
 #define SYS_KPROBE_NAME "sys_nanosleep"
 #endif
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH][bpf-next] bpf: verifier: avoid fall-through warnings
From: Daniel Borkmann @ 2019-07-12 13:41 UTC (permalink / raw)
  To: Gustavo A. R. Silva, Alexei Starovoitov, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, Lawrence Brakmo
  Cc: netdev, bpf, linux-kernel, Kees Cook
In-Reply-To: <20190711162233.GA6977@embeddedor>

On 07/11/2019 06:22 PM, Gustavo A. R. Silva wrote:
> In preparation to enabling -Wimplicit-fallthrough, this patch silences
> the following warning:
> 
> kernel/bpf/verifier.c: In function ‘check_return_code’:
> kernel/bpf/verifier.c:6106:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
>    if (env->prog->expected_attach_type == BPF_CGROUP_UDP4_RECVMSG ||
>       ^
> kernel/bpf/verifier.c:6109:2: note: here
>   case BPF_PROG_TYPE_CGROUP_SKB:
>   ^~~~
> 
> Warning level 3 was used: -Wimplicit-fallthrough=3
> 
> Notice that is much clearer to explicitly add breaks in each case
> statement (that actually contains some code), rather than letting
> the code to fall through.
> 
> This patch is part of the ongoing efforts to enable
> -Wimplicit-fallthrough.
> 
> Acked-by: Andrii Nakryiko <andriin@fb.com>
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

Looks good, applied to bpf, thanks.

^ permalink raw reply

* RE: [PATCH v2] tipc: ensure head->lock is initialised
From: Jon Maloy @ 2019-07-12 13:35 UTC (permalink / raw)
  To: Chris Packham, eric.dumazet@gmail.com, ying.xue@windriver.com,
	davem@davemloft.net
  Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	tipc-discussion@lists.sourceforge.net
In-Reply-To: <20190711224115.21499-1-chris.packham@alliedtelesis.co.nz>

Acked-by: Jon Maloy <jon.maloy@ericsson.com>

Actually, this was not what I meant, but it solves our problem in a simple and safe way, for now.
Later, when net-next opens, I will revert this and do it the right way, which is to change from skb_queue_purge() to __skb_queue_purge as Eric correctly noted.
That change has to be done at four locations, at least,  and is too intrusive to post to 'net' now.
I'll send a cleanup patch when net-next re-opens.

BR
///jon
 

> -----Original Message-----
> From: Chris Packham <chris.packham@alliedtelesis.co.nz>
> Sent: 11-Jul-19 18:41
> To: Jon Maloy <jon.maloy@ericsson.com>; eric.dumazet@gmail.com;
> ying.xue@windriver.com; davem@davemloft.net
> Cc: linux-kernel@vger.kernel.org; netdev@vger.kernel.org; tipc-
> discussion@lists.sourceforge.net; Chris Packham
> <chris.packham@alliedtelesis.co.nz>
> Subject: [PATCH v2] tipc: ensure head->lock is initialised
> 
> tipc_named_node_up() creates a skb list. It passes the list to
> tipc_node_xmit() which has some code paths that can call
> skb_queue_purge() which relies on the list->lock being initialised.
> 
> The spin_lock is only needed if the messages end up on the receive path but
> when the list is created in tipc_named_node_up() we don't necessarily know if
> it is going to end up there.
> 
> Once all the skb list users are updated in tipc it will then be possible to update
> them to use the unlocked variants of the skb list functions and initialise the
> lock when we know the message will follow the receive path.
> 
> Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
> ---
> 
> I'm updating our products to use the latest kernel. One change that we have
> that doesn't appear to have been upstreamed is related to the following soft
> lockup.
> 
> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [swapper/3:0]
> Modules linked in: tipc jitterentropy_rng echainiv drbg platform_driver(O)
> ipifwd(PO)
> CPU: 3 PID: 0 Comm: swapper/3 Tainted: P           O    4.4.6-at1 #1
> task: a3054e00 ti: ac6b4000 task.ti: a307a000
> NIP: 806891c4 LR: 804f5060 CTR: 804f50d0
> REGS: ac6b59b0 TRAP: 0901   Tainted: P           O     (4.4.6-at1)
> MSR: 00029002 <CE,EE,ME>  CR: 84002088  XER: 20000000
> 
> GPR00: 804f50fc ac6b5a60 a3054e00 00029002 00000101 01001011
> 00000000 00000001
> GPR08: 00021002 c1502d1c ac6b5ae4 00000000 804f50d0 NIP [806891c4]
> _raw_spin_lock_irqsave+0x44/0x80 LR [804f5060] skb_dequeue+0x20/0x90
> Call Trace:
> [ac6b5a80] [804f50fc] skb_queue_purge+0x2c/0x50 [ac6b5a90] [c1511058]
> tipc_node_xmit+0x138/0x170 [tipc] [ac6b5ad0] [c1509e58]
> tipc_named_node_up+0x88/0xa0 [tipc] [ac6b5b00] [c150fc1c]
> tipc_netlink_compat_stop+0x9bc/0xf50 [tipc] [ac6b5b20] [c1511638]
> tipc_rcv+0x418/0x9b0 [tipc] [ac6b5bc0] [c150218c]
> tipc_bcast_stop+0xfc/0x7b0 [tipc] [ac6b5bd0] [80504e38]
> __netif_receive_skb_core+0x468/0xa10
> [ac6b5c70] [805082fc] netif_receive_skb_internal+0x3c/0xe0
> [ac6b5ca0] [80642a48] br_handle_frame_finish+0x1d8/0x4d0
> [ac6b5d10] [80642f30] br_handle_frame+0x1f0/0x330 [ac6b5d60]
> [80504ec8] __netif_receive_skb_core+0x4f8/0xa10
> [ac6b5e00] [805082fc] netif_receive_skb_internal+0x3c/0xe0
> [ac6b5e30] [8044c868] _dpa_rx+0x148/0x5c0 [ac6b5ea0] [8044b0c8]
> priv_rx_default_dqrr+0x98/0x170 [ac6b5ed0] [804d1338]
> qman_p_poll_dqrr+0x1b8/0x240 [ac6b5f00] [8044b1c0]
> dpaa_eth_poll+0x20/0x60 [ac6b5f20] [805087cc]
> net_rx_action+0x15c/0x320 [ac6b5f80] [8002594c]
> __do_softirq+0x13c/0x250 [ac6b5fe0] [80025c34] irq_exit+0xb4/0xf0
> [ac6b5ff0] [8000d81c] call_do_irq+0x24/0x3c [a307be60] [80004acc]
> do_IRQ+0x8c/0x120 [a307be80] [8000f450] ret_from_except+0x0/0x18
> --- interrupt: 501 at arch_cpu_idle+0x24/0x70
> 
> Eyeballing the code I think it can still happen since tipc_named_node_up
> allocates struct sk_buff_head head on the stack so it could have arbitrary
> content.
> 
> Changes in v2:
> - fixup commit subject
> - add more information to commit message from mailing list discussion
> 
>  net/tipc/name_distr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c index
> 61219f0b9677..44abc8e9c990 100644
> --- a/net/tipc/name_distr.c
> +++ b/net/tipc/name_distr.c
> @@ -190,7 +190,7 @@ void tipc_named_node_up(struct net *net, u32
> dnode)
>  	struct name_table *nt = tipc_name_table(net);
>  	struct sk_buff_head head;
> 
> -	__skb_queue_head_init(&head);
> +	skb_queue_head_init(&head);
> 
>  	read_lock_bh(&nt->cluster_scope_lock);
>  	named_distribute(net, &head, dnode, &nt->cluster_scope);
> --
> 2.22.0


^ permalink raw reply

* [PATCH] net: hisilicon: Use devm_platform_ioremap_resource
From: Jiangfeng Xiao @ 2019-07-12 13:16 UTC (permalink / raw)
  To: davem, yisen.zhuang, salil.mehta, xiaojiangfeng
  Cc: netdev, linux-kernel, sergei.shtylyov, leeyou.li, nixiaoming

Use devm_platform_ioremap_resource instead of
devm_ioremap_resource. Make the code simpler.

Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com>
---
 drivers/net/ethernet/hisilicon/hip04_eth.c    | 7 ++-----
 drivers/net/ethernet/hisilicon/hisi_femac.c   | 7 ++-----
 drivers/net/ethernet/hisilicon/hix5hd2_gmac.c | 7 ++-----
 drivers/net/ethernet/hisilicon/hns_mdio.c     | 4 +---
 4 files changed, 7 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c
index 6256357..d604528 100644
--- a/drivers/net/ethernet/hisilicon/hip04_eth.c
+++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
@@ -899,7 +899,6 @@ static int hip04_mac_probe(struct platform_device *pdev)
 	struct of_phandle_args arg;
 	struct net_device *ndev;
 	struct hip04_priv *priv;
-	struct resource *res;
 	int irq;
 	int ret;
 
@@ -912,16 +911,14 @@ static int hip04_mac_probe(struct platform_device *pdev)
 	platform_set_drvdata(pdev, ndev);
 	SET_NETDEV_DEV(ndev, &pdev->dev);
 
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	priv->base = devm_ioremap_resource(d, res);
+	priv->base = devm_platform_ioremap_resource(pdev, 0);
 	if (IS_ERR(priv->base)) {
 		ret = PTR_ERR(priv->base);
 		goto init_fail;
 	}
 
 #if defined(CONFIG_HI13X1_GMAC)
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 1);
-	priv->sysctrl_base = devm_ioremap_resource(d, res);
+	priv->sysctrl_base = devm_platform_ioremap_resource(pdev, 1);
 	if (IS_ERR(priv->sysctrl_base)) {
 		ret = PTR_ERR(priv->sysctrl_base);
 		goto init_fail;
diff --git a/drivers/net/ethernet/hisilicon/hisi_femac.c b/drivers/net/ethernet/hisilicon/hisi_femac.c
index d2e019d..689f18e 100644
--- a/drivers/net/ethernet/hisilicon/hisi_femac.c
+++ b/drivers/net/ethernet/hisilicon/hisi_femac.c
@@ -781,7 +781,6 @@ static int hisi_femac_drv_probe(struct platform_device *pdev)
 {
 	struct device *dev = &pdev->dev;
 	struct device_node *node = dev->of_node;
-	struct resource *res;
 	struct net_device *ndev;
 	struct hisi_femac_priv *priv;
 	struct phy_device *phy;
@@ -799,15 +798,13 @@ static int hisi_femac_drv_probe(struct platform_device *pdev)
 	priv->dev = dev;
 	priv->ndev = ndev;
 
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	priv->port_base = devm_ioremap_resource(dev, res);
+	priv->port_base = devm_platform_ioremap_resource(pdev, 0);
 	if (IS_ERR(priv->port_base)) {
 		ret = PTR_ERR(priv->port_base);
 		goto out_free_netdev;
 	}
 
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 1);
-	priv->glb_base = devm_ioremap_resource(dev, res);
+	priv->glb_base = devm_platform_ioremap_resource(pdev, 1);
 	if (IS_ERR(priv->glb_base)) {
 		ret = PTR_ERR(priv->glb_base);
 		goto out_free_netdev;
diff --git a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
index 89ef764..3499705 100644
--- a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
+++ b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
@@ -1097,7 +1097,6 @@ static int hix5hd2_dev_probe(struct platform_device *pdev)
 	const struct of_device_id *of_id = NULL;
 	struct net_device *ndev;
 	struct hix5hd2_priv *priv;
-	struct resource *res;
 	struct mii_bus *bus;
 	const char *mac_addr;
 	int ret;
@@ -1119,15 +1118,13 @@ static int hix5hd2_dev_probe(struct platform_device *pdev)
 	}
 	priv->hw_cap = (unsigned long)of_id->data;
 
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	priv->base = devm_ioremap_resource(dev, res);
+	priv->base = devm_platform_ioremap_resource(pdev, 0);
 	if (IS_ERR(priv->base)) {
 		ret = PTR_ERR(priv->base);
 		goto out_free_netdev;
 	}
 
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 1);
-	priv->ctrl_base = devm_ioremap_resource(dev, res);
+	priv->ctrl_base = devm_platform_ioremap_resource(pdev, 1);
 	if (IS_ERR(priv->ctrl_base)) {
 		ret = PTR_ERR(priv->ctrl_base);
 		goto out_free_netdev;
diff --git a/drivers/net/ethernet/hisilicon/hns_mdio.c b/drivers/net/ethernet/hisilicon/hns_mdio.c
index 918cab1..3e863a7 100644
--- a/drivers/net/ethernet/hisilicon/hns_mdio.c
+++ b/drivers/net/ethernet/hisilicon/hns_mdio.c
@@ -417,7 +417,6 @@ static int hns_mdio_probe(struct platform_device *pdev)
 {
 	struct hns_mdio_device *mdio_dev;
 	struct mii_bus *new_bus;
-	struct resource *res;
 	int ret = -ENODEV;
 
 	if (!pdev) {
@@ -442,8 +441,7 @@ static int hns_mdio_probe(struct platform_device *pdev)
 	new_bus->priv = mdio_dev;
 	new_bus->parent = &pdev->dev;
 
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	mdio_dev->vbase = devm_ioremap_resource(&pdev->dev, res);
+	mdio_dev->vbase = devm_platform_ioremap_resource(pdev, 0);
 	if (IS_ERR(mdio_dev->vbase)) {
 		ret = PTR_ERR(mdio_dev->vbase);
 		return ret;
-- 
1.8.5.6


^ permalink raw reply related

* Re: [PATCH v3 0/3] kernel/notifier.c: avoid duplicate registration
From: Xiaoming Ni @ 2019-07-12 13:11 UTC (permalink / raw)
  To: Vasily Averin, adobriyan@gmail.com, akpm@linux-foundation.org,
	anna.schumaker@netapp.com, arjan@linux.intel.com,
	bfields@fieldses.org, chuck.lever@oracle.com, davem@davemloft.net,
	gregkh@linuxfoundation.org, jlayton@kernel.org, luto@kernel.org,
	mingo@kernel.org, Nadia.Derbey@bull.net,
	paulmck@linux.vnet.ibm.com, semen.protsenko@linaro.org,
	stable@kernel.org, stern@rowland.harvard.edu, tglx@linutronix.de,
	torvalds@linux-foundation.org, trond.myklebust@hammerspace.com,
	viresh.kumar@linaro.org
  Cc: Huangjianhui (Alex), Dailei, linux-kernel@vger.kernel.org,
	linux-nfs@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <d70ba831-85c7-d5a3-670a-144fa4d139cc@virtuozzo.com>

On 2019/7/11 21:57, Vasily Averin wrote:
> On 7/11/19 4:55 AM, Nixiaoming wrote:
>> On Wed, July 10, 2019 1:49 PM Vasily Averin wrote:
>>> On 7/10/19 6:09 AM, Xiaoming Ni wrote:
>>>> Registering the same notifier to a hook repeatedly can cause the hook
>>>> list to form a ring or lose other members of the list.
>>>
>>> I think is not enough to _prevent_ 2nd register attempt,
>>> it's enough to detect just attempt and generate warning to mark host in bad state.
>>>
>>
>> Duplicate registration is prevented in my patch, not just "mark host in bad state"
>>
>> Duplicate registration is checked and exited in notifier_chain_cond_register()
>>
>> Duplicate registration was checked in notifier_chain_register() but only 
>> the alarm was triggered without exiting. added by commit 831246570d34692e 
>> ("kernel/notifier.c: double register detection")
>>
>> My patch is like a combination of 831246570d34692e and notifier_chain_cond_register(),
>>  which triggers an alarm and exits when a duplicate registration is detected.
>>
>>> Unexpected 2nd register of the same hook most likely will lead to 2nd unregister,
>>> and it can lead to host crash in any time: 
>>> you can unregister notifier on first attempt it can be too early, it can be still in use.
>>> on the other hand you can never call 2nd unregister at all.
>>
>> Since the member was not added to the linked list at the time of the second registration, 
>> no linked list ring was formed. 
>> The member is released on the first unregistration and -ENOENT on the second unregistration.
>> After patching, the fault has been alleviated
> 
> You are wrong here.
> 2nd notifier's registration is a pure bug, this should never happen.
> If you know the way to reproduce this situation -- you need to fix it. 
> 
> 2nd registration can happen in 2 cases:
> 1) missed rollback, when someone forget to call unregister after successfull registration, 
> and then tried to call register again. It can lead to crash for example when according module will be unloaded.
> 2) some subsystem is registered twice, for example from  different namespaces.
> in this case unregister called during sybsystem cleanup in first namespace will incorrectly remove notifier used 
> in second namespace, it also can lead to unexpacted behaviour.
> 
So in these two cases, is it more reasonable to trigger BUG() directly when checking for duplicate registration ?
But why does current notifier_chain_register() just trigger WARN() without exiting ?
notifier_chain_cond_register() direct exit without triggering WARN() ?

Thanks

Xiaoming Ni

>> It may be more helpful to return an error code when someone tries to register the same
>> notification program a second time.
> 
> You are wrong again here, it is senseless.
> If you have detected 2nd register -- your node is already in bad state.
> 
>> But I noticed that notifier_chain_cond_register() returns 0 when duplicate registration 
>> is detected. At the same time, in all the existing export function comments of notify,
>> "Currently always returns zero"
>>
>> I am a bit confused: which is better?
>>
>>>
>>> Unfortunately I do not see any ways to handle such cases properly,
>>> and it seems for me your patches does not resolve this problem.
>>>
>>> Am I missed something probably?
>>>
>>>> case1: An infinite loop in notifier_chain_register() can cause soft lockup
>>>>         atomic_notifier_chain_register(&test_notifier_list, &test1);
>>>>         atomic_notifier_chain_register(&test_notifier_list, &test1);
>>>>         atomic_notifier_chain_register(&test_notifier_list, &test2);
>>
>> Thanks
>>
>> Xiaoming Ni
>>
> 
> .
> 


^ permalink raw reply

* Re: [PATCH v2 bpf] selftests/bpf: fix bpf_target_sparc check
From: Daniel Borkmann @ 2019-07-12 13:08 UTC (permalink / raw)
  To: Ilya Leoshkevich, bpf, netdev; +Cc: andrii.nakryiko
In-Reply-To: <20190710115654.44841-1-iii@linux.ibm.com>

On 07/10/2019 01:56 PM, Ilya Leoshkevich wrote:
> bpf_helpers.h fails to compile on sparc: the code should be checking
> for defined(bpf_target_sparc), but checks simply for bpf_target_sparc.
> 
> Also change #ifdef bpf_target_powerpc to #if defined() for consistency.
> 
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>

Applied, thanks!

^ permalink raw reply

* Re: [PATCH bpf] xdp: fix potential deadlock on socket mutex
From: Daniel Borkmann @ 2019-07-12 13:07 UTC (permalink / raw)
  To: Ilya Maximets, netdev
  Cc: linux-kernel, bpf, xdp-newbies, David S. Miller,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Jakub Kicinski, Alexei Starovoitov
In-Reply-To: <20190708110344.23278-1-i.maximets@samsung.com>

On 07/08/2019 01:03 PM, Ilya Maximets wrote:
> There are 2 call chains:
> 
>   a) xsk_bind --> xdp_umem_assign_dev
>   b) unregister_netdevice_queue --> xsk_notifier
> 
> with the following locking order:
> 
>   a) xs->mutex --> rtnl_lock
>   b) rtnl_lock --> xdp.lock --> xs->mutex
> 
> Different order of taking 'xs->mutex' and 'rtnl_lock' could produce a
> deadlock here. Fix that by moving the 'rtnl_lock' before 'xs->lock' in
> the bind call chain (a).
> 
> Reported-by: syzbot+bf64ec93de836d7f4c2c@syzkaller.appspotmail.com
> Fixes: 455302d1c9ae ("xdp: fix hang while unregistering device bound to xdp socket")
> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>

Applied, thanks!

^ permalink raw reply

* [PATCH] ipvs: remove unnecessary space
From: yangxingwu @ 2019-07-12 13:07 UTC (permalink / raw)
  To: wensong
  Cc: horms, ja, pablo, kadlec, fw, davem, netdev, lvs-devel,
	netfilter-devel, coreteam, linux-kernel, joe, yangxingwu
In-Reply-To: <80a4e132f3be48899904eccdc023f5c53229840b.camel@perches.com>

this patch removes the extra space and use bitmap_zalloc instead

Signed-off-by: yangxingwu <xingwu.yang@gmail.com>
---
 net/netfilter/ipvs/ip_vs_mh.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_mh.c b/net/netfilter/ipvs/ip_vs_mh.c
index 94d9d34..3229867 100644
--- a/net/netfilter/ipvs/ip_vs_mh.c
+++ b/net/netfilter/ipvs/ip_vs_mh.c
@@ -174,8 +174,7 @@ static int ip_vs_mh_populate(struct ip_vs_mh_state *s,
 		return 0;
 	}
 
-	table =  kcalloc(BITS_TO_LONGS(IP_VS_MH_TAB_SIZE),
-			 sizeof(unsigned long), GFP_KERNEL);
+	table = bitmap_zalloc(IP_VS_MH_TAB_SIZE, GFP_KERNEL);
 	if (!table)
 		return -ENOMEM;
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH bpf] xdp: fix possible cq entry leak
From: Daniel Borkmann @ 2019-07-12 13:07 UTC (permalink / raw)
  To: Ilya Maximets, netdev
  Cc: linux-kernel, bpf, xdp-newbies, David S. Miller,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Jakub Kicinski, Alexei Starovoitov
In-Reply-To: <20190704142503.23501-1-i.maximets@samsung.com>

On 07/04/2019 04:25 PM, Ilya Maximets wrote:
> Completion queue address reservation could not be undone.
> In case of bad 'queue_id' or skb allocation failure, reserved entry
> will be leaked reducing the total capacity of completion queue.
> 
> Fix that by moving reservation to the point where failure is not
> possible. Additionally, 'queue_id' checking moved out from the loop
> since there is no point to check it there.
> 
> Fixes: 35fcde7f8deb ("xsk: support for Tx")
> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>

Applied, thanks!

^ permalink raw reply

* Re: [PATCH bpf-next] libbpf: fix ptr to u64 conversion warning on 32-bit platforms
From: Daniel Borkmann @ 2019-07-12 13:06 UTC (permalink / raw)
  To: Matt Hart, Yonghong Song
  Cc: Andrii Nakryiko, andrii.nakryiko@gmail.com, Alexei Starovoitov,
	bpf@vger.kernel.org, netdev@vger.kernel.org, Kernel Team
In-Reply-To: <CAH+k93ExQpYy+g+WUNvv+bDDzDcJR-2WYongJqv4WbMcPV=sRA@mail.gmail.com>

On 07/12/2019 02:56 PM, Matt Hart wrote:
> On Tue, 9 Jul 2019 at 05:30, Yonghong Song <yhs@fb.com> wrote:
>> On 7/8/19 9:00 PM, Andrii Nakryiko wrote:
>>> On 32-bit platforms compiler complains about conversion:
>>>
>>> libbpf.c: In function ‘perf_event_open_probe’:
>>> libbpf.c:4112:17: error: cast from pointer to integer of different
>>> size [-Werror=pointer-to-int-cast]
>>>    attr.config1 = (uint64_t)(void *)name; /* kprobe_func or uprobe_path */
>>>                   ^
>>>
>>> Reported-by: Matt Hart <matthew.hart@linaro.org>
>>> Fixes: b26500274767 ("libbpf: add kprobe/uprobe attach API")
>>> Tested-by: Matt Hart <matthew.hart@linaro.org>
>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>>
>> Acked-by: Yonghong Song <yhs@fb.com>
> 
> How do we get this merged? I see the build failure has now propagated
> up to mainline :(

I just applied the fix to bpf tree, will go its usual route to mainline.

^ permalink raw reply

* Re: [PATCH bpf-next] bpf: fix precision bit propagation for BPF_ST instructions
From: Daniel Borkmann @ 2019-07-12 13:05 UTC (permalink / raw)
  To: Andrii Nakryiko, andrii.nakryiko, ast, bpf, netdev, kernel-team
In-Reply-To: <20190709033244.1596200-1-andriin@fb.com>

On 07/09/2019 05:32 AM, Andrii Nakryiko wrote:
> When backtracking instructions to propagate precision bit for registers
> and stack slots, one class of instructions (BPF_ST) weren't handled
> causing extra stack slots to be propagated into parent state. Parent
> state might not have that much stack allocated, though, which causes
> warning on invalid stack slot usage.
> 
> This patch adds handling of BPF_ST instructions:
> 
> BPF_MEM | <size> | BPF_ST:   *(size *) (dst_reg + off) = imm32
> 
> Reported-by: syzbot+4da3ff23081bafe74fc2@syzkaller.appspotmail.com
> Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking")
> Cc: Alexei Starovoitov <ast@fb.com>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

Applied, thanks!

^ permalink raw reply

* Re: [PATCH v2 bpf-next 0/3] fix BTF verification size resolution
From: Daniel Borkmann @ 2019-07-12 12:59 UTC (permalink / raw)
  To: Yonghong Song, Andrii Nakryiko, bpf@vger.kernel.org,
	netdev@vger.kernel.org, Alexei Starovoitov
  Cc: andrii.nakryiko@gmail.com, Kernel Team
In-Reply-To: <0143c2e9-ac0d-33de-3019-85016d771c76@fb.com>

On 07/12/2019 08:03 AM, Yonghong Song wrote:
> On 7/10/19 11:53 PM, Andrii Nakryiko wrote:
>> BTF size resolution logic isn't always resolving type size correctly, leading
>> to erroneous map creation failures due to value size mismatch.
>>
>> This patch set:
>> 1. fixes the issue (patch #1);
>> 2. adds tests for trickier cases (patch #2);
>> 3. and converts few test cases utilizing BTF-defined maps, that previously
>>     couldn't use typedef'ed arrays due to kernel bug (patch #3).
>>
>> Patch #1 can be applied against bpf tree, but selftest ones (#2 and #3) have
>> to go against bpf-next for now.
> 
> Why #2 and #3 have to go to bpf-next? bpf tree also accepts tests, 
> AFAIK. Maybe leave for Daniel and Alexei to decide in this particular case.

Yes, corresponding test cases for fixes are also accepted for bpf tree, thanks.

>> Andrii Nakryiko (3):
>>    bpf: fix BTF verifier size resolution logic
>>    selftests/bpf: add trickier size resolution tests
>>    selftests/bpf: use typedef'ed arrays as map values
> 
> Looks good to me. Except minor comments in patch 1/3, Ack the series.
> Acked-by: Yonghong Song <yhs@fb.com>
> 
>>
>>   kernel/bpf/btf.c                              | 14 ++-
>>   .../bpf/progs/test_get_stack_rawtp.c          |  3 +-
>>   .../bpf/progs/test_stacktrace_build_id.c      |  3 +-
>>   .../selftests/bpf/progs/test_stacktrace_map.c |  2 +-
>>   tools/testing/selftests/bpf/test_btf.c        | 88 +++++++++++++++++++
>>   5 files changed, 102 insertions(+), 8 deletions(-)
>>


^ permalink raw reply

* Re: [PATCH bpf-next] libbpf: fix ptr to u64 conversion warning on 32-bit platforms
From: Matt Hart @ 2019-07-12 12:56 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, andrii.nakryiko@gmail.com, Alexei Starovoitov,
	daniel@iogearbox.net, bpf@vger.kernel.org, netdev@vger.kernel.org,
	Kernel Team
In-Reply-To: <b4b00fad-3f99-c36a-a510-0b281a1f2bd7@fb.com>

On Tue, 9 Jul 2019 at 05:30, Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 7/8/19 9:00 PM, Andrii Nakryiko wrote:
> > On 32-bit platforms compiler complains about conversion:
> >
> > libbpf.c: In function ‘perf_event_open_probe’:
> > libbpf.c:4112:17: error: cast from pointer to integer of different
> > size [-Werror=pointer-to-int-cast]
> >    attr.config1 = (uint64_t)(void *)name; /* kprobe_func or uprobe_path */
> >                   ^
> >
> > Reported-by: Matt Hart <matthew.hart@linaro.org>
> > Fixes: b26500274767 ("libbpf: add kprobe/uprobe attach API")
> > Tested-by: Matt Hart <matthew.hart@linaro.org>
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>
> Acked-by: Yonghong Song <yhs@fb.com>
>

How do we get this merged? I see the build failure has now propagated
up to mainline :(

> > ---
> >   tools/lib/bpf/libbpf.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index ed07789b3e62..794dd5064ae8 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -4126,8 +4126,8 @@ static int perf_event_open_probe(bool uprobe, bool retprobe, const char *name,
> >       }
> >       attr.size = sizeof(attr);
> >       attr.type = type;
> > -     attr.config1 = (uint64_t)(void *)name; /* kprobe_func or uprobe_path */
> > -     attr.config2 = offset;                 /* kprobe_addr or probe_offset */
> > +     attr.config1 = ptr_to_u64(name); /* kprobe_func or uprobe_path */
> > +     attr.config2 = offset;           /* kprobe_addr or probe_offset */
> >
> >       /* pid filter is meaningful only for uprobes */
> >       pfd = syscall(__NR_perf_event_open, &attr,
> >

^ permalink raw reply

* Re: [PATCH 00/12] treewide: Fix GENMASK misuses
From: Andrzej Hajda @ 2019-07-12 12:54 UTC (permalink / raw)
  To: Joe Perches, Andrew Morton, Patrick Venture, Nancy Yuen,
	Benjamin Fair, Andrew Jeffery, openbmc, linux-kernel,
	linux-aspeed, linux-arm-kernel, linux-amlogic, netdev,
	linux-mediatek, linux-stm32, linux-wireless, linux-media
  Cc: linux-iio, devel, alsa-devel, linux-mmc, dri-devel
In-Reply-To: <cover.1562734889.git.joe@perches.com>

Hi Joe,

On 10.07.2019 07:04, Joe Perches wrote:
> These GENMASK uses are inverted argument order and the
> actual masks produced are incorrect.  Fix them.
>
> Add checkpatch tests to help avoid more misuses too.
>
> Joe Perches (12):
>   checkpatch: Add GENMASK tests
>   clocksource/drivers/npcm: Fix misuse of GENMASK macro
>   drm: aspeed_gfx: Fix misuse of GENMASK macro
>   iio: adc: max9611: Fix misuse of GENMASK macro
>   irqchip/gic-v3-its: Fix misuse of GENMASK macro
>   mmc: meson-mx-sdio: Fix misuse of GENMASK macro
>   net: ethernet: mediatek: Fix misuses of GENMASK macro
>   net: stmmac: Fix misuses of GENMASK macro
>   rtw88: Fix misuse of GENMASK macro
>   phy: amlogic: G12A: Fix misuse of GENMASK macro
>   staging: media: cedrus: Fix misuse of GENMASK macro
>   ASoC: wcd9335: Fix misuse of GENMASK macro
>
>  drivers/clocksource/timer-npcm7xx.c               |  2 +-
>  drivers/gpu/drm/aspeed/aspeed_gfx.h               |  2 +-
>  drivers/iio/adc/max9611.c                         |  2 +-
>  drivers/irqchip/irq-gic-v3-its.c                  |  2 +-
>  drivers/mmc/host/meson-mx-sdio.c                  |  2 +-
>  drivers/net/ethernet/mediatek/mtk_eth_soc.h       |  2 +-
>  drivers/net/ethernet/mediatek/mtk_sgmii.c         |  2 +-
>  drivers/net/ethernet/stmicro/stmmac/descs.h       |  2 +-
>  drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c |  4 ++--
>  drivers/net/wireless/realtek/rtw88/rtw8822b.c     |  2 +-
>  drivers/phy/amlogic/phy-meson-g12a-usb2.c         |  2 +-
>  drivers/staging/media/sunxi/cedrus/cedrus_regs.h  |  2 +-
>  scripts/checkpatch.pl                             | 15 +++++++++++++++
>  sound/soc/codecs/wcd-clsh-v2.c                    |  2 +-
>  14 files changed, 29 insertions(+), 14 deletions(-)
>
After adding following compile time check:

------

diff --git a/Makefile b/Makefile
index 5102b2bbd224..ac4ea5f443a9 100644
--- a/Makefile
+++ b/Makefile
@@ -457,7 +457,7 @@ KBUILD_AFLAGS   := -D__ASSEMBLY__ -fno-PIE
 KBUILD_CFLAGS   := -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs \
                   -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE \
                   -Werror=implicit-function-declaration
-Werror=implicit-int \
-                  -Wno-format-security \
+                  -Wno-format-security -Werror=div-by-zero \
                   -std=gnu89
 KBUILD_CPPFLAGS := -D__KERNEL__
 KBUILD_AFLAGS_KERNEL :=
diff --git a/include/linux/bits.h b/include/linux/bits.h
index 669d69441a62..61d74b103055 100644
--- a/include/linux/bits.h
+++ b/include/linux/bits.h
@@ -19,11 +19,11 @@
  * GENMASK_ULL(39, 21) gives us the 64bit vector 0x000000ffffe00000.
  */
 #define GENMASK(h, l) \
-       (((~UL(0)) - (UL(1) << (l)) + 1) & \
+       (((~UL(0)) - (UL(1) << (l)) + 1 + 0/((h) >= (l))) & \
         (~UL(0) >> (BITS_PER_LONG - 1 - (h))))
 
 #define GENMASK_ULL(h, l) \
-       (((~ULL(0)) - (ULL(1) << (l)) + 1) & \
+       (((~ULL(0)) - (ULL(1) << (l)) + 1 + 0/((h) >= (l))) & \
         (~ULL(0) >> (BITS_PER_LONG_LONG - 1 - (h))))
 
 #endif /* __LINUX_BITS_H */

-------

I was able to detect one more GENMASK misue (AARCH64, allyesconfig):

  CC      drivers/phy/rockchip/phy-rockchip-inno-hdmi.o
In file included from ../include/linux/bitops.h:5:0,
                 from ../include/linux/kernel.h:12,
                 from ../include/linux/clk.h:13,
                 from ../drivers/phy/rockchip/phy-rockchip-inno-hdmi.c:9:
../drivers/phy/rockchip/phy-rockchip-inno-hdmi.c: In function
‘inno_hdmi_phy_rk3328_power_on’:
../include/linux/bits.h:22:37: error: division by zero [-Werror=div-by-zero]
  (((~UL(0)) - (UL(1) << (l)) + 1 + 0/((h) >= (l))) & \
                                     ^
../drivers/phy/rockchip/phy-rockchip-inno-hdmi.c:24:42: note: in
expansion of macro ‘GENMASK’
 #define UPDATE(x, h, l)  (((x) << (l)) & GENMASK((h), (l)))
                                          ^~~~~~~
../drivers/phy/rockchip/phy-rockchip-inno-hdmi.c:201:50: note: in
expansion of macro ‘UPDATE’
 #define RK3328_TERM_RESISTOR_CALIB_SPEED_7_0(x)  UPDATE(x, 7, 9)
                                                  ^~~~~~
../drivers/phy/rockchip/phy-rockchip-inno-hdmi.c:1046:26: note: in
expansion of macro ‘RK3328_TERM_RESISTOR_CALIB_SPEED_7_0’
   inno_write(inno, 0xc6, RK3328_TERM_RESISTOR_CALIB_SPEED_7_0(v));


Of course I do not advise to add the check as is to Kernel - it is
undefined behavior area AFAIK.


Regards

Andrzej


^ permalink raw reply related

* Re: bonded active-backup ethernet-wifi drops packets
From: Brian J. Murrell @ 2019-07-12 12:51 UTC (permalink / raw)
  To: netdev
In-Reply-To: <6134.1562373984@famine>

[-- Attachment #1: Type: text/plain, Size: 1255 bytes --]

On Fri, 2019-07-05 at 17:46 -0700, Jay Vosburgh wrote:
> 
> 	I did set this up and test it, but haven't had time to analyze
> in depth.
> 
> 	What I saw was that ping (IPv4) flood worked fine, bonded or
> not, over a span of several hours.

Interesting.  In contrast to my experience.

> However, ping6 showed small numbers
> of drops on a ping6 flood when bonded, on the order of 200 drops out
> of
> 48,000,000 requests sent.

I wonder if that's indicative of what I'm seeing.  Strange that you
only see it on IPv6 though.  I'm seeing it on IPv4.

> Zero losses when no bond in the stack.

That's what I see for IPv4.

> Both
> tests to the same peer connected to the same switch.

Ditto.

> All of the above
> with the bond using the Ethernet slave.

Also ditto.  Wifi introduces latencies (at least) which mask the
underlying issue.

> I haven't tracked down where
> those losses are occurring, so I don't know if it's on the transmit
> or
> receive sides (or both).

Personally, I suspect it's on the receive.  I suspect the host I am
testing from sends the ICMP echo requests just fine.  It's just not
getting the ICMP echo responses back.

Any ideas on further avenues to debugging this?

Cheers,
b.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH bpf-next v9 0/2] bpf: Allow bpf_skb_event_output for more prog types
From: Daniel Borkmann @ 2019-07-12 12:42 UTC (permalink / raw)
  To: Allan Zhang, Andrii Nakryiko
  Cc: Networking, bpf, Song Liu, Alexei Starovoitov
In-Reply-To: <CAMHgqJ71XDWSZDFOvcvu-sjDzrVp8+G8VdMx1fNUs5+xefhUSQ@mail.gmail.com>

On 07/10/2019 09:10 PM, Allan Zhang wrote:
> Sure, thanks. will do as suggested.

Yep, bpf-next is currently closed due to merge window, please resubmit
once it reopens.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH v6 bpf-next 0/3] bpf: add bpf_descendant_of helper
From: Daniel Borkmann @ 2019-07-12 12:41 UTC (permalink / raw)
  To: Javier Honduvilla Coto, netdev; +Cc: yhs, kernel-team, jonhaslam
In-Reply-To: <20190710180025.94726-1-javierhonduco@fb.com>

On 07/10/2019 08:00 PM, Javier Honduvilla Coto wrote:
> Hi all,
> 
> This patch adds the bpf_descendant_of helper which accepts a PID and
> returns 1 if the PID of the process currently being executed is a
> descendant of it or if it's itself. Returns 0 otherwise. The passed
> PID should be the one as seen from the "global" pid namespace as the
> processes' PIDs in the hierarchy are resolved using the context of said
> initial namespace.
> 
> This is very useful in tracing programs when we want to filter by a
> given PID and all the children it might spawn. The current workarounds
> most people implement for this purpose have issues:
> 
> - Attaching to process spawning syscalls and dynamically add those PIDs
> to some bpf map that would be used to filter is cumbersome and
> potentially racy.
> - Unrolling some loop to perform what this helper is doing consumes lots
> of instructions. That and the impossibility to jump backwards makes it
> really hard to be correct in really large process chains.
> 
> 
> Let me know what do you think!
> 
> Thanks,
> 
> ---
> Changes in V6:
>         - Small style fix
>         - Clarify in the docs that we are resolving PIDs using the global,
> initial PID namespace, and the provided *pid* argument should be global, too
>         - Changed the way we assert on the helper return value
> 
> Changes in V5:
>         - Addressed code review feedback
>         - Renamed from progenyof => descendant_of as suggested by Jon Haslam
> and Brendan Gregg
> 
> Changes in V4:
>         - Rebased on latest bpf-next after merge window
> 
> Changes in V3:
>         - Removed RCU read (un)locking as BPF programs alredy run in RCU locked
>                 context
>         - progenyof(0) now returns 1, which, semantically makes more sense
>         - Added new test case for PID 0 and changed sentinel value for errors
>         - Rebase on latest bpf-next/master
>         - Used my work email as somehow I accidentally used my personal one in v2
> 
> Changes in V2:
>         - Adding missing docs in include/uapi/linux/bpf.h
> 

bpf-next is currently closed due to merge window, please resubmit once it reopens.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: update BPF JIT S390 maintainers
From: Daniel Borkmann @ 2019-07-12 12:40 UTC (permalink / raw)
  To: David Miller, gor
  Cc: ast, heiko.carstens, borntraeger, iii, netdev, bpf, linux-s390
In-Reply-To: <20190711.113343.906691840255971211.davem@davemloft.net>

On 07/11/2019 08:33 PM, David Miller wrote:
> From: Vasily Gorbik <gor@linux.ibm.com>
> Date: Wed, 10 Jul 2019 13:34:54 +0200
> 
>> Dave, Alexei, Daniel,
>> would you take it via one of your trees? Or should I take it via s390?
> 
> I think it can go via the bpf tree.

Yep, just applied to bpf, thanks!

^ permalink raw reply

* Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths
From: Toke Høiland-Jørgensen @ 2019-07-12 12:33 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ido Schimmel, David Miller, netdev, jiri, mlxsw, dsahern, roopa,
	nikolay, andy, pablo, jakub.kicinski, pieter.jansenvanvuuren,
	andrew, f.fainelli, vivien.didelot, idosch
In-Reply-To: <20190712121859.GB13696@hmswarspite.think-freely.org>

Neil Horman <nhorman@tuxdriver.com> writes:

> On Fri, Jul 12, 2019 at 11:27:55AM +0200, Toke Høiland-Jørgensen wrote:
>> Neil Horman <nhorman@tuxdriver.com> writes:
>> 
>> > On Thu, Jul 11, 2019 at 03:39:09PM +0300, Ido Schimmel wrote:
>> >> On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
>> >> > From: Ido Schimmel <idosch@idosch.org>
>> >> > Date: Sun,  7 Jul 2019 10:58:17 +0300
>> >> > 
>> >> > > Users have several ways to debug the kernel and understand why a packet
>> >> > > was dropped. For example, using "drop monitor" and "perf". Both
>> >> > > utilities trace kfree_skb(), which is the function called when a packet
>> >> > > is freed as part of a failure. The information provided by these tools
>> >> > > is invaluable when trying to understand the cause of a packet loss.
>> >> > > 
>> >> > > In recent years, large portions of the kernel data path were offloaded
>> >> > > to capable devices. Today, it is possible to perform L2 and L3
>> >> > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
>> >> > > Different TC classifiers and actions are also offloaded to capable
>> >> > > devices, at both ingress and egress.
>> >> > > 
>> >> > > However, when the data path is offloaded it is not possible to achieve
>> >> > > the same level of introspection as tools such "perf" and "drop monitor"
>> >> > > become irrelevant.
>> >> > > 
>> >> > > This patchset aims to solve this by allowing users to monitor packets
>> >> > > that the underlying device decided to drop along with relevant metadata
>> >> > > such as the drop reason and ingress port.
>> >> > 
>> >> > We are now going to have 5 or so ways to capture packets passing through
>> >> > the system, this is nonsense.
>> >> > 
>> >> > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
>> >> > devlink thing.
>> >> > 
>> >> > This is insanity, too many ways to do the same thing and therefore the
>> >> > worst possible user experience.
>> >> > 
>> >> > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
>> >> > XDP perf events, and these taps there too.
>> >> > 
>> >> > I mean really, think about it from the average user's perspective.  To
>> >> > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
>> >> > listen on devlink but configure a special tap thing beforehand and then
>> >> > if someone is using XDP I gotta setup another perf event buffer capture
>> >> > thing too.
>> >> 
>> >> Dave,
>> >> 
>> >> Before I start working on v2, I would like to get your feedback on the
>> >> high level plan. Also adding Neil who is the maintainer of drop_monitor
>> >> (and counterpart DropWatch tool [1]).
>> >> 
>> >> IIUC, the problem you point out is that users need to use different
>> >> tools to monitor packet drops based on where these drops occur
>> >> (SW/HW/XDP).
>> >> 
>> >> Therefore, my plan is to extend the existing drop_monitor netlink
>> >> channel to also cover HW drops. I will add a new message type and a new
>> >> multicast group for HW drops and encode in the message what is currently
>> >> encoded in the devlink events.
>> >> 
>> > A few things here:
>> > IIRC we don't announce individual hardware drops, drivers record them in
>> > internal structures, and they are retrieved on demand via ethtool calls, so you
>> > will either need to include some polling (probably not a very performant idea),
>> > or some sort of flagging mechanism to indicate that on the next message sent to
>> > user space you should go retrieve hw stats from a given interface.  I certainly
>> > wouldn't mind seeing this happen, but its more work than just adding a new
>> > netlink message.
>> >
>> > Also, regarding XDP drops, we wont see them if the xdp program is offloaded to
>> > hardware (you'll need your hw drop gathering mechanism for that), but for xdp
>> > programs run on the cpu, dropwatch should alrady catch those.  I.e. if the xdp
>> > program returns a DROP result for a packet being processed, the OS will call
>> > kfree_skb on its behalf, and dropwatch wil call that.
>> 
>> There is no skb by the time an XDP program runs, so this is not true. As
>> I mentioned upthread, there's a tracepoint that will get called if an
>> error occurs (or the program returns XDP_ABORTED), but in most cases,
>> XDP_DROP just means that the packet silently disappears...
>> 
> As I noted, thats only true for xdp programs that are offloaded to hardware, I
> was only speaking for XDP programs that run on the cpu.  For the former case, we
> obviously need some other mechanism to detect drops, but for cpu executed xdp
> programs, the OS is responsible for freeing skbs associated with programs the
> return XDP_DROP.

Ah, I think maybe you're thinking of generic XDP (also referred to as
skb mode)? That is a separate mode; an XDP program loaded in "native
mode" (or "driver mode") runs on the CPU, but before the skb is created;
this is the common case for XDP, and there is no skb and thus no drop
notification in this mode.

There is *also* an offload mode for XDP programs, but that is only
supported by netronome cards thus far, so not as commonly used...

-Toke

^ permalink raw reply

* Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths
From: Neil Horman @ 2019-07-12 12:18 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Ido Schimmel, David Miller, netdev, jiri, mlxsw, dsahern, roopa,
	nikolay, andy, pablo, jakub.kicinski, pieter.jansenvanvuuren,
	andrew, f.fainelli, vivien.didelot, idosch
In-Reply-To: <87r26vvbz8.fsf@toke.dk>

On Fri, Jul 12, 2019 at 11:27:55AM +0200, Toke Høiland-Jørgensen wrote:
> Neil Horman <nhorman@tuxdriver.com> writes:
> 
> > On Thu, Jul 11, 2019 at 03:39:09PM +0300, Ido Schimmel wrote:
> >> On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
> >> > From: Ido Schimmel <idosch@idosch.org>
> >> > Date: Sun,  7 Jul 2019 10:58:17 +0300
> >> > 
> >> > > Users have several ways to debug the kernel and understand why a packet
> >> > > was dropped. For example, using "drop monitor" and "perf". Both
> >> > > utilities trace kfree_skb(), which is the function called when a packet
> >> > > is freed as part of a failure. The information provided by these tools
> >> > > is invaluable when trying to understand the cause of a packet loss.
> >> > > 
> >> > > In recent years, large portions of the kernel data path were offloaded
> >> > > to capable devices. Today, it is possible to perform L2 and L3
> >> > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> >> > > Different TC classifiers and actions are also offloaded to capable
> >> > > devices, at both ingress and egress.
> >> > > 
> >> > > However, when the data path is offloaded it is not possible to achieve
> >> > > the same level of introspection as tools such "perf" and "drop monitor"
> >> > > become irrelevant.
> >> > > 
> >> > > This patchset aims to solve this by allowing users to monitor packets
> >> > > that the underlying device decided to drop along with relevant metadata
> >> > > such as the drop reason and ingress port.
> >> > 
> >> > We are now going to have 5 or so ways to capture packets passing through
> >> > the system, this is nonsense.
> >> > 
> >> > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> >> > devlink thing.
> >> > 
> >> > This is insanity, too many ways to do the same thing and therefore the
> >> > worst possible user experience.
> >> > 
> >> > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> >> > XDP perf events, and these taps there too.
> >> > 
> >> > I mean really, think about it from the average user's perspective.  To
> >> > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> >> > listen on devlink but configure a special tap thing beforehand and then
> >> > if someone is using XDP I gotta setup another perf event buffer capture
> >> > thing too.
> >> 
> >> Dave,
> >> 
> >> Before I start working on v2, I would like to get your feedback on the
> >> high level plan. Also adding Neil who is the maintainer of drop_monitor
> >> (and counterpart DropWatch tool [1]).
> >> 
> >> IIUC, the problem you point out is that users need to use different
> >> tools to monitor packet drops based on where these drops occur
> >> (SW/HW/XDP).
> >> 
> >> Therefore, my plan is to extend the existing drop_monitor netlink
> >> channel to also cover HW drops. I will add a new message type and a new
> >> multicast group for HW drops and encode in the message what is currently
> >> encoded in the devlink events.
> >> 
> > A few things here:
> > IIRC we don't announce individual hardware drops, drivers record them in
> > internal structures, and they are retrieved on demand via ethtool calls, so you
> > will either need to include some polling (probably not a very performant idea),
> > or some sort of flagging mechanism to indicate that on the next message sent to
> > user space you should go retrieve hw stats from a given interface.  I certainly
> > wouldn't mind seeing this happen, but its more work than just adding a new
> > netlink message.
> >
> > Also, regarding XDP drops, we wont see them if the xdp program is offloaded to
> > hardware (you'll need your hw drop gathering mechanism for that), but for xdp
> > programs run on the cpu, dropwatch should alrady catch those.  I.e. if the xdp
> > program returns a DROP result for a packet being processed, the OS will call
> > kfree_skb on its behalf, and dropwatch wil call that.
> 
> There is no skb by the time an XDP program runs, so this is not true. As
> I mentioned upthread, there's a tracepoint that will get called if an
> error occurs (or the program returns XDP_ABORTED), but in most cases,
> XDP_DROP just means that the packet silently disappears...
> 
As I noted, thats only true for xdp programs that are offloaded to hardware, I
was only speaking for XDP programs that run on the cpu.  For the former case, we
obviously need some other mechanism to detect drops, but for cpu executed xdp
programs, the OS is responsible for freeing skbs associated with programs the
return XDP_DROP.

Neil

> -Toke
> 

^ permalink raw reply

* Re: [PATCH v1 1/6] rcu: Add support for consolidated-RCU reader checking
From: Oleg Nesterov @ 2019-07-12 12:12 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Kuznetsov, Bjorn Helgaas, Borislav Petkov,
	c0d1n61at3, David S. Miller, edumazet, Greg Kroah-Hartman,
	Hideaki YOSHIFUJI, H. Peter Anvin, Ingo Molnar, Josh Triplett,
	keescook, kernel-hardening, Lai Jiangshan, Len Brown, linux-acpi,
	linux-pci, linux-pm, Mathieu Desnoyers, neilb, netdev,
	Paul E. McKenney, Pavel Machek, peterz, Rafael J. Wysocki,
	Rasmus Villemoes, rcu, Steven Rostedt, Tejun Heo, Thomas Gleixner,
	will, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)
In-Reply-To: <20190711234401.220336-2-joel@joelfernandes.org>

On 07/11, Joel Fernandes (Google) wrote:
>
> +int rcu_read_lock_any_held(void)

rcu_sync_is_idle() wants it. You have my ack in advance ;)

Oleg.


^ permalink raw reply

* Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths
From: Neil Horman @ 2019-07-12 12:05 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Ido Schimmel, David Miller, netdev, jiri, mlxsw, dsahern, roopa,
	nikolay, andy, pablo, jakub.kicinski, pieter.jansenvanvuuren,
	andrew, vivien.didelot, idosch
In-Reply-To: <69d0917f-895f-6239-4044-76944432e8ca@gmail.com>

On Thu, Jul 11, 2019 at 08:40:34PM -0700, Florian Fainelli wrote:
> 
> 
> On 7/11/2019 4:53 PM, Neil Horman wrote:
> >> I would like to emphasize that the configuration of whether these
> >> dropped packets are even sent to the CPU from the device still needs to
> >> reside in devlink given this is the go-to tool for device-specific
> >> configuration. In addition, these drop traps are a small subset of the
> >> entire packet traps devices support and all have similar needs such as
> >> HW policer configuration and statistics.
> >>
> >> In the future we might also want to report events that indicate the
> >> formation of possible problems. For example, in case packets are queued
> >> above a certain threshold or for long periods of time. I hope we could
> >> re-use drop_monitor for this as well, thereby making it the go-to
> >> channel for diagnosing current and to-be problems in the data path.
> >>
> > Thats an interesting idea, but dropwatch certainly isn't currently setup for
> > that kind of messaging.  It may be worth creating a v2 of the netlink protocol
> > and really thinking out what you want to communicate.
> 
> Is not what you describe more or less what Ido has been doing here with
> this patch series?
possibly, I was only CCed on this thread halfway throught the conversation, and
only on the cover letter, I've not had a chance to look at the entire series

Neil

> -- 
> Florian
> 

^ permalink raw reply

* Re: [PATCH v1 1/6] rcu: Add support for consolidated-RCU reader checking
From: Peter Zijlstra @ 2019-07-12 11:11 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Kuznetsov, Bjorn Helgaas, Borislav Petkov,
	c0d1n61at3, David S. Miller, edumazet, Greg Kroah-Hartman,
	Hideaki YOSHIFUJI, H. Peter Anvin, Ingo Molnar, Josh Triplett,
	keescook, kernel-hardening, Lai Jiangshan, Len Brown, linux-acpi,
	linux-pci, linux-pm, Mathieu Desnoyers, neilb, netdev, oleg,
	Paul E. McKenney, Pavel Machek, Rafael J. Wysocki,
	Rasmus Villemoes, rcu, Steven Rostedt, Tejun Heo, Thomas Gleixner,
	will, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)
In-Reply-To: <20190711234401.220336-2-joel@joelfernandes.org>

On Thu, Jul 11, 2019 at 07:43:56PM -0400, Joel Fernandes (Google) wrote:
> +int rcu_read_lock_any_held(void)
> +{
> +	int lockdep_opinion = 0;
> +
> +	if (!debug_lockdep_rcu_enabled())
> +		return 1;
> +	if (!rcu_is_watching())
> +		return 0;
> +	if (!rcu_lockdep_current_cpu_online())
> +		return 0;
> +
> +	/* Preemptible RCU flavor */
> +	if (lock_is_held(&rcu_lock_map))

you forgot debug_locks here.

> +		return 1;
> +
> +	/* BH flavor */
> +	if (in_softirq() || irqs_disabled())

I'm not sure I'd put irqs_disabled() under BH, also this entire
condition is superfluous, see below.

> +		return 1;
> +
> +	/* Sched flavor */
> +	if (debug_locks)
> +		lockdep_opinion = lock_is_held(&rcu_sched_lock_map);
> +	return lockdep_opinion || !preemptible();

that !preemptible() turns into:

  !(preempt_count()==0 && !irqs_disabled())

which is:

  preempt_count() != 0 || irqs_disabled()

and already includes irqs_disabled() and in_softirq().

> +}

So maybe something lke:

	if (debug_locks && (lock_is_held(&rcu_lock_map) ||
			    lock_is_held(&rcu_sched_lock_map)))
		return true;

	return !preemptible();



^ permalink raw reply

* Re: [PATCH v1 1/6] rcu: Add support for consolidated-RCU reader checking
From: Peter Zijlstra @ 2019-07-12 11:01 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Kuznetsov, Bjorn Helgaas, Borislav Petkov,
	c0d1n61at3, David S. Miller, edumazet, Greg Kroah-Hartman,
	Hideaki YOSHIFUJI, H. Peter Anvin, Ingo Molnar, Josh Triplett,
	keescook, kernel-hardening, Lai Jiangshan, Len Brown, linux-acpi,
	linux-pci, linux-pm, Mathieu Desnoyers, neilb, netdev, oleg,
	Paul E. McKenney, Pavel Machek, Rafael J. Wysocki,
	Rasmus Villemoes, rcu, Steven Rostedt, Tejun Heo, Thomas Gleixner,
	will, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)
In-Reply-To: <20190711234401.220336-2-joel@joelfernandes.org>

On Thu, Jul 11, 2019 at 07:43:56PM -0400, Joel Fernandes (Google) wrote:
> This patch adds support for checking RCU reader sections in list
> traversal macros. Optionally, if the list macro is called under SRCU or
> other lock/mutex protection, then appropriate lockdep expressions can be
> passed to make the checks pass.
> 
> Existing list_for_each_entry_rcu() invocations don't need to pass the
> optional fourth argument (cond) unless they are under some non-RCU
> protection and needs to make lockdep check pass.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  include/linux/rculist.h  | 29 ++++++++++++++++++++++++-----
>  include/linux/rcupdate.h |  7 +++++++
>  kernel/rcu/Kconfig.debug | 11 +++++++++++
>  kernel/rcu/update.c      | 26 ++++++++++++++++++++++++++
>  4 files changed, 68 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index e91ec9ddcd30..78c15ec6b2c9 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -40,6 +40,23 @@ static inline void INIT_LIST_HEAD_RCU(struct list_head *list)
>   */
>  #define list_next_rcu(list)	(*((struct list_head __rcu **)(&(list)->next)))
>  
> +/*
> + * Check during list traversal that we are within an RCU reader
> + */
> +
> +#define SIXTH_ARG(a1, a2, a3, a4, a5, a6, ...) a6
> +#define COUNT_VARGS(...) SIXTH_ARG(dummy, ## __VA_ARGS__, 4, 3, 2, 1, 0)

You don't seem to actually use it in this patch; also linux/kernel.h has
COUNT_ARGS().

^ permalink raw reply

* Re: [RFC] virtio-net: share receive_*() and add_recvbuf_*() with virtio-vsock
From: Jason Wang @ 2019-07-12 10:14 UTC (permalink / raw)
  To: Stefano Garzarella, Michael S. Tsirkin
  Cc: Stefan Hajnoczi, virtualization, netdev
In-Reply-To: <20190712100033.xs3xesz2plfwj3ag@steredhat>


On 2019/7/12 下午6:00, Stefano Garzarella wrote:
> On Thu, Jul 11, 2019 at 03:52:21PM -0400, Michael S. Tsirkin wrote:
>> On Thu, Jul 11, 2019 at 01:41:34PM +0200, Stefano Garzarella wrote:
>>> On Thu, Jul 11, 2019 at 03:37:00PM +0800, Jason Wang wrote:
>>>> On 2019/7/10 下午11:37, Stefano Garzarella wrote:
>>>>> Hi,
>>>>> as Jason suggested some months ago, I looked better at the virtio-net driver to
>>>>> understand if we can reuse some parts also in the virtio-vsock driver, since we
>>>>> have similar challenges (mergeable buffers, page allocation, small
>>>>> packets, etc.).
>>>>>
>>>>> Initially, I would add the skbuff in the virtio-vsock in order to re-use
>>>>> receive_*() functions.
>>>>
>>>> Yes, that will be a good step.
>>>>
>>> Okay, I'll go on this way.
>>>
>>>>> Then I would move receive_[small, big, mergeable]() and
>>>>> add_recvbuf_[small, big, mergeable]() outside of virtio-net driver, in order to
>>>>> call them also from virtio-vsock. I need to do some refactoring (e.g. leave the
>>>>> XDP part on the virtio-net driver), but I think it is feasible.
>>>>>
>>>>> The idea is to create a virtio-skb.[h,c] where put these functions and a new
>>>>> object where stores some attributes needed (e.g. hdr_len ) and status (e.g.
>>>>> some fields of struct receive_queue).
>>>>
>>>> My understanding is we could be more ambitious here. Do you see any blocker
>>>> for reusing virtio-net directly? It's better to reuse not only the functions
>>>> but also the logic like NAPI to avoid re-inventing something buggy and
>>>> duplicated.
>>>>
>>> These are my concerns:
>>> - virtio-vsock is not a "net_device", so a lot of code related to
>>>    ethtool, net devices (MAC address, MTU, speed, VLAN, XDP, offloading) will be
>>>    not used by virtio-vsock.


Linux support device other than ethernet, so it should not be a problem.


>>>
>>> - virtio-vsock has a different header. We can consider it as part of
>>>    virtio_net payload, but it precludes the compatibility with old hosts. This
>>>    was one of the major doubts that made me think about using only the
>>>    send/recv skbuff functions, that it shouldn't break the compatibility.


We can extend the current vnet header helper for it to work for vsock.


>>>
>>>>> This is an idea of virtio-skb.h that
>>>>> I have in mind:
>>>>>       struct virtskb;
>>>>
>>>> What fields do you want to store in virtskb? It looks to be exist sk_buff is
>>>> flexible enough to us?
>>> My idea is to store queues information, like struct receive_queue or
>>> struct send_queue, and some device attributes (e.g. hdr_len ).


If you reuse skb or virtnet_info, there is not necessary.


>>>
>>>>
>>>>>       struct sk_buff *virtskb_receive_small(struct virtskb *vs, ...);
>>>>>       struct sk_buff *virtskb_receive_big(struct virtskb *vs, ...);
>>>>>       struct sk_buff *virtskb_receive_mergeable(struct virtskb *vs, ...);
>>>>>
>>>>>       int virtskb_add_recvbuf_small(struct virtskb*vs, ...);
>>>>>       int virtskb_add_recvbuf_big(struct virtskb *vs, ...);
>>>>>       int virtskb_add_recvbuf_mergeable(struct virtskb *vs, ...);
>>>>>
>>>>> For the Guest->Host path it should be easier, so maybe I can add a
>>>>> "virtskb_send(struct virtskb *vs, struct sk_buff *skb)" with a part of the code
>>>>> of xmit_skb().
>>>>
>>>> I may miss something, but I don't see any thing that prevents us from using
>>>> xmit_skb() directly.
>>>>
>>> Yes, but my initial idea was to make it more parametric and not related to the
>>> virtio_net_hdr, so the 'hdr_len' could be a parameter and the
>>> 'num_buffers' should be handled by the caller.
>>>
>>>>> Let me know if you have in mind better names or if I should put these function
>>>>> in another place.
>>>>>
>>>>> I would like to leave the control part completely separate, so, for example,
>>>>> the two drivers will negotiate the features independently and they will call
>>>>> the right virtskb_receive_*() function based on the negotiation.
>>>>
>>>> If it's one the issue of negotiation, we can simply change the
>>>> virtnet_probe() to deal with different devices.
>>>>
>>>>
>>>>> I already started to work on it, but before to do more steps and send an RFC
>>>>> patch, I would like to hear your opinion.
>>>>> Do you think that makes sense?
>>>>> Do you see any issue or a better solution?
>>>>
>>>> I still think we need to seek a way of adding some codes on virtio-net.c
>>>> directly if there's no huge different in the processing of TX/RX. That would
>>>> save us a lot time.
>>> After the reading of the buffers from the virtqueue I think the process
>>> is slightly different, because virtio-net will interface with the network
>>> stack, while virtio-vsock will interface with the vsock-core (socket).
>>> So the virtio-vsock implements the following:
>>> - control flow mechanism to avoid to loose packets, informing the peer
>>>    about the amount of memory available in the receive queue using some
>>>    fields in the virtio_vsock_hdr
>>> - de-multiplexing parsing the virtio_vsock_hdr and choosing the right
>>>    socket depending on the port
>>> - socket state handling


I think it's just a branch, for ethernet, go for networking stack. 
otherwise go for vsock core?


>>>
>>> We can use the virtio-net as transport, but we should add a lot of
>>> code to skip "net device" stuff when it is used by the virtio-vsock.


This could be another choice, but consider it was not transparent to the 
admin and require new features, we may seek a transparent solution here.


>>> This could break something in virtio-net, for this reason, I thought to reuse
>>> only the send/recv functions starting from the idea to split the virtio-net
>>> driver in two parts:
>>> a. one with all stuff related to the network stack
>>> b. one with the stuff needed to communicate with the host
>>>
>>> And use skbuff to communicate between parts. In this way, virtio-vsock
>>> can use only the b part.
>>>
>>> Maybe we can do this split in a better way, but I'm not sure it is
>>> simple.
>>>
>>> Thanks,
>>> Stefano
>> Frankly, skb is a huge structure which adds a lot of
>> overhead. I am not sure that using it is such a great idea
>> if building a device that does not have to interface
>> with the networking stack.


I believe vsock is mainly used for stream performance not for PPS. So 
the impact should be minimal. We can use other metadata, just need 
branch in recv_xxx().


> Thanks for the advice!
>
>> So I agree with Jason in theory. To clarify, he is basically saying
>> current implementation is all wrong, it should be a protocol and we
>> should teach networking stack that there are reliable net devices that
>> handle just this protocol. We could add a flag in virtio net that
>> will say it's such a device.
>>
>> Whether it's doable, I don't know, and it's definitely not simple - in
>> particular you will have to also re-implement existing devices in these
>> terms, and not just virtio - vmware vsock too.


Merging vsock protocol to exist networking stack could be a long term 
goal, I believe for the first phase, we can seek to use virtio-net first.


>>
>> If you want to do a POC you can add a new address family,
>> that's easier.
> Very interesting!
> I agree with you. In this way we can completely split the protocol
> logic, from the device.
>
> As you said, it will not simple to do, but can be an opportunity to learn
> better the Linux networking stack!
> I'll try to do a PoC with AF_VSOCK2 that will use the virtio-net.


I suggest to do this step by step:

1) use virtio-net but keep some protocol logic

2) separate protocol logic and merge it to exist Linux networking stack

Thanks


>> Just reusing random functions won't help, net stack
>> is very heavy, if it manages to outperform vsock it's
>> because vsock was not written with performance in mind.
>> But the smarts are in the core not virtio driver.
>> What makes vsock slow is design decisions like
>> using a workqueue to process packets,
>> not batching memory management etc etc.
>> All things that net core does for virtio net.
> Got it :)
>
> Michael, Jason, thank you very much! Your suggestions are very useful!
>
> Stefano

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox