Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v4 bpf-next 1/4] selftests/bpf: compile progs with -D__TARGET_ARCH_$(SRCARCH)
From: Ilya Leoshkevich @ 2019-07-11 14:29 UTC (permalink / raw)
  To: bpf, netdev; +Cc: ys114321, daniel, sdf, davem, ast, Ilya Leoshkevich
In-Reply-To: <20190711142930.68809-1-iii@linux.ibm.com>

This opens up the possibility of accessing registers in an
arch-independent way.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
---
 tools/testing/selftests/bpf/Makefile | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 2620406a53ec..ad84450e4ab8 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,4 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
+include ../../../scripts/Makefile.arch
 
 LIBDIR := ../../../lib
 BPFDIR := $(LIBDIR)/bpf
@@ -138,7 +139,8 @@ CLANG_SYS_INCLUDES := $(shell $(CLANG) -v -E - </dev/null 2>&1 \
 
 CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \
 	      $(CLANG_SYS_INCLUDES) \
-	      -Wno-compare-distinct-pointer-types
+	      -Wno-compare-distinct-pointer-types \
+	      -D__TARGET_ARCH_$(SRCARCH)
 
 $(OUTPUT)/test_l4lb_noinline.o: CLANG_FLAGS += -fno-inline
 $(OUTPUT)/test_xdp_noinline.o: CLANG_FLAGS += -fno-inline
-- 
2.21.0


^ permalink raw reply related

* [PATCH v4 bpf-next 0/4] selftests/bpf: fix compiling loop{1,2,3}.c on s390
From: Ilya Leoshkevich @ 2019-07-11 14:29 UTC (permalink / raw)
  To: bpf, netdev; +Cc: ys114321, daniel, sdf, davem, ast, Ilya Leoshkevich

Use PT_REGS_RC(ctx) instead of ctx->rax, which is not present on s390.

This patch series consists of three preparatory commits, which make it
possible to use PT_REGS_RC in BPF selftests, followed by the actual fix.

> > Will this also work for 32-bit x86?
> Thanks, this is a good catch: this builds, but makes 64-bit accesses, as
> if it used the 64-bit variant of pt_regs. I will fix this.
I found four problems in this area:

1. Selftest tracing progs are built with -target bpf, leading to struct
   pt_regs and friends being interpreted incorrectly.
2. When the Makefile is adjusted to build them without -target bpf, it
   still lacks -m32/-m64, leading to a similar issue.
3. There is no __i386__ define, leading to incorrect userspace struct
   pt_regs variant being chosen for x86.
4. Finally, there is an issue in my patch: when 1-3 are fixed, it fails
   to build, since i386 defines yet another set of field names.

I will send fixes for problems 1-3 separately, I believe for this patch
series to be correct, it's enough to fix #4 (which I did by adding
another #ifdef).

I've also changed ARCH to SRCARCH in patch #1, since while ARCH can be
e.g. "i386", SRCARCH always corresponds to directory names under arch/.

v1->v2: Split into multiple patches.
v2->v3: Added arm64 support.
v3->v4: Added i386 support, use SRCARCH instead of ARCH.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>

^ permalink raw reply

* [PATCH] libertas: add terminating entry to fw_table
From: Oliver Neukum @ 2019-07-11 14:27 UTC (permalink / raw)
  To: davem, netdev; +Cc: Oliver Neukum

In case no firmware was found, the system would happily read
and try to load garbage. Terminate the table properly.

Signed-off-by: Oliver Neukum <oneukum@suse.com>
Fixes: ce84bb69f50e6 ("libertas USB: convert to asynchronous firmware loading")
Reported-by: syzbot+8a8f48672560c8ca59dd@syzkaller.appspotmail.com
---
 drivers/net/wireless/marvell/libertas/if_usb.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/marvell/libertas/if_usb.c b/drivers/net/wireless/marvell/libertas/if_usb.c
index f1622f0ff8c9..b79c65547f4c 100644
--- a/drivers/net/wireless/marvell/libertas/if_usb.c
+++ b/drivers/net/wireless/marvell/libertas/if_usb.c
@@ -50,7 +50,10 @@ static const struct lbs_fw_table fw_table[] = {
 	{ MODEL_8388, "libertas/usb8388_v5.bin", NULL },
 	{ MODEL_8388, "libertas/usb8388.bin", NULL },
 	{ MODEL_8388, "usb8388.bin", NULL },
-	{ MODEL_8682, "libertas/usb8682.bin", NULL }
+	{ MODEL_8682, "libertas/usb8682.bin", NULL },
+
+	/*terminating entry - keep at end */
+	{ MODEL_8388, NULL, NULL }
 };
 
 static const struct usb_device_id if_usb_table[] = {
-- 
2.16.4


^ permalink raw reply related

* Re: [PATCH v3 0/3] kernel/notifier.c: avoid duplicate registration
From: Vasily Averin @ 2019-07-11 13:57 UTC (permalink / raw)
  To: Nixiaoming, adobriyan@gmail.com, akpm@linux-foundation.org,
	anna.schumaker@netapp.com, arjan@linux.intel.com,
	bfields@fieldses.org, chuck.lever@oracle.com, davem@davemloft.net,
	gregkh@linuxfoundation.org, jlayton@kernel.org, luto@kernel.org,
	mingo@kernel.org, Nadia.Derbey@bull.net,
	paulmck@linux.vnet.ibm.com, semen.protsenko@linaro.org,
	stable@kernel.org, stern@rowland.harvard.edu, tglx@linutronix.de,
	torvalds@linux-foundation.org, trond.myklebust@hammerspace.com,
	viresh.kumar@linaro.org
  Cc: Huangjianhui (Alex), Dailei, linux-kernel@vger.kernel.org,
	linux-nfs@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <E490CD805F7529488761C40FD9D26EF12AC9D068@dggemm507-mbx.china.huawei.com>

On 7/11/19 4:55 AM, Nixiaoming wrote:
> On Wed, July 10, 2019 1:49 PM Vasily Averin wrote:
>> On 7/10/19 6:09 AM, Xiaoming Ni wrote:
>>> Registering the same notifier to a hook repeatedly can cause the hook
>>> list to form a ring or lose other members of the list.
>>
>> I think is not enough to _prevent_ 2nd register attempt,
>> it's enough to detect just attempt and generate warning to mark host in bad state.
>>
> 
> Duplicate registration is prevented in my patch, not just "mark host in bad state"
> 
> Duplicate registration is checked and exited in notifier_chain_cond_register()
> 
> Duplicate registration was checked in notifier_chain_register() but only 
> the alarm was triggered without exiting. added by commit 831246570d34692e 
> ("kernel/notifier.c: double register detection")
> 
> My patch is like a combination of 831246570d34692e and notifier_chain_cond_register(),
>  which triggers an alarm and exits when a duplicate registration is detected.
> 
>> Unexpected 2nd register of the same hook most likely will lead to 2nd unregister,
>> and it can lead to host crash in any time: 
>> you can unregister notifier on first attempt it can be too early, it can be still in use.
>> on the other hand you can never call 2nd unregister at all.
> 
> Since the member was not added to the linked list at the time of the second registration, 
> no linked list ring was formed. 
> The member is released on the first unregistration and -ENOENT on the second unregistration.
> After patching, the fault has been alleviated

You are wrong here.
2nd notifier's registration is a pure bug, this should never happen.
If you know the way to reproduce this situation -- you need to fix it. 

2nd registration can happen in 2 cases:
1) missed rollback, when someone forget to call unregister after successfull registration, 
and then tried to call register again. It can lead to crash for example when according module will be unloaded.
2) some subsystem is registered twice, for example from  different namespaces.
in this case unregister called during sybsystem cleanup in first namespace will incorrectly remove notifier used 
in second namespace, it also can lead to unexpacted behaviour.

> It may be more helpful to return an error code when someone tries to register the same
> notification program a second time.

You are wrong again here, it is senseless.
If you have detected 2nd register -- your node is already in bad state.

> But I noticed that notifier_chain_cond_register() returns 0 when duplicate registration 
> is detected. At the same time, in all the existing export function comments of notify,
> "Currently always returns zero"
> 
> I am a bit confused: which is better?
> 
>>
>> Unfortunately I do not see any ways to handle such cases properly,
>> and it seems for me your patches does not resolve this problem.
>>
>> Am I missed something probably?
>>
>>> case1: An infinite loop in notifier_chain_register() can cause soft lockup
>>>         atomic_notifier_chain_register(&test_notifier_list, &test1);
>>>         atomic_notifier_chain_register(&test_notifier_list, &test1);
>>>         atomic_notifier_chain_register(&test_notifier_list, &test2);
> 
> Thanks
> 
> Xiaoming Ni
> 

^ permalink raw reply

* Re: [PATCH net-next,v2 3/3] net: flow_offload: add flow_block structure and use it
From: Jiri Pirko @ 2019-07-11 13:23 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, davem, netdev, jakub.kicinski
In-Reply-To: <20190711130923.2483-3-pablo@netfilter.org>

Thu, Jul 11, 2019 at 03:09:23PM CEST, pablo@netfilter.org wrote:
>This object stores the flow block callbacks that are attached to this
>block. This patch restores block sharing.
>
>Fixes: da3eeb904ff4 ("net: flow_offload: add list handling functions")
>Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
>---
>v3: add flow_block_init() - Jiri Pirko.

and rename flow/flow_block/

[...]


>@@ -951,7 +952,7 @@ struct nft_stats {
>  *	@stats: per-cpu chain stats
>  *	@chain: the chain
>  *	@dev_name: device name that this base chain is attached to (if any)
>- *	@cb_list: list of flow block callbacks (for hardware offload)
>+ *	@flow: flow block (for hardware offload)

You missed rename here: s/flow:/flow_block:/


>  */
> struct nft_base_chain {
> 	struct nf_hook_ops		ops;
>@@ -961,7 +962,7 @@ struct nft_base_chain {
> 	struct nft_stats __percpu	*stats;
> 	struct nft_chain		chain;
> 	char 				dev_name[IFNAMSIZ];
>-	struct list_head		cb_list;
>+	struct flow_block		flow_block;
> };
> 
> static inline struct nft_base_chain *nft_base_chain(const struct nft_chain *chain)
>diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>index 9482e060483b..6b6b01234dd9 100644
>--- a/include/net/sch_generic.h
>+++ b/include/net/sch_generic.h
>@@ -399,7 +399,7 @@ struct tcf_block {
> 	refcount_t refcnt;
> 	struct net *net;
> 	struct Qdisc *q;
>-	struct list_head cb_list;
>+	struct flow_block flow_block;
> 	struct list_head owner_list;
> 	bool keep_dst;
> 	unsigned int offloadcnt; /* Number of oddloaded filters */
>diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
>index a800fa78d96c..935c7f81a9ef 100644
>--- a/net/core/flow_offload.c
>+++ b/net/core/flow_offload.c
>@@ -198,7 +198,7 @@ struct flow_block_cb *flow_block_cb_lookup(struct flow_block_offload *f,

Reminding the block arg here.


> {
> 	struct flow_block_cb *block_cb;
> 
>-	list_for_each_entry(block_cb, f->driver_block_list, driver_list) {
>+	list_for_each_entry(block_cb, &f->block->cb_list, list) {
> 		if (block_cb->cb == cb &&
> 		    block_cb->cb_ident == cb_ident)
> 			return block_cb;

[...]


^ permalink raw reply

* Re: [PATCH net-next 3/3] net: flow_offload: add flow_block structure and use it
From: Jiri Pirko @ 2019-07-11 13:18 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, davem, netdev, jakub.kicinski
In-Reply-To: <20190711131300.ojbo5hxzvv6wi44t@salvia>

Thu, Jul 11, 2019 at 03:13:00PM CEST, pablo@netfilter.org wrote:
>On Thu, Jul 11, 2019 at 10:06:09AM +0200, Jiri Pirko wrote:
>> Thu, Jul 11, 2019 at 02:12:35AM CEST, pablo@netfilter.org wrote:
>> >This object stores the flow block callbacks that are attached to this
>> >block. This patch restores block sharing.
>> >
>> >Fixes: da3eeb904ff4 ("net: flow_offload: add list handling functions")
>> >Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
>> >---
>> > include/net/flow_offload.h        |  5 +++++
>> > include/net/netfilter/nf_tables.h |  5 +++--
>> > include/net/sch_generic.h         |  2 +-
>> > net/core/flow_offload.c           |  2 +-
>> > net/netfilter/nf_tables_api.c     |  2 +-
>> > net/netfilter/nf_tables_offload.c |  5 +++--
>> > net/sched/cls_api.c               | 10 +++++++---
>> > 7 files changed, 21 insertions(+), 10 deletions(-)
>> >
>> >diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
>> >index 98bf3af5c84d..e50d94736829 100644
>> >--- a/include/net/flow_offload.h
>> >+++ b/include/net/flow_offload.h
>> >@@ -248,6 +248,10 @@ enum flow_block_binder_type {
>> > 	FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS,
>> > };
>> > 
>> >+struct flow_block {
>> >+	struct list_head cb_list;
>> >+};
>> >+
>> > struct netlink_ext_ack;
>> > 
>> > struct flow_block_offload {
>> >@@ -255,6 +259,7 @@ struct flow_block_offload {
>> > 	enum flow_block_binder_type binder_type;
>> > 	bool block_shared;
>> > 	struct net *net;
>> >+	struct flow_block *block;
>> > 	struct list_head cb_list;
>> > 	struct list_head *driver_block_list;
>> > 	struct netlink_ext_ack *extack;
>> >diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
>> >index 35dfdd9f69b3..00658462f89b 100644
>> >--- a/include/net/netfilter/nf_tables.h
>> >+++ b/include/net/netfilter/nf_tables.h
>> >@@ -11,6 +11,7 @@
>> > #include <linux/rhashtable.h>
>> > #include <net/netfilter/nf_flow_table.h>
>> > #include <net/netlink.h>
>> >+#include <net/flow_offload.h>
>> > 
>> > struct module;
>> > 
>> >@@ -951,7 +952,7 @@ struct nft_stats {
>> >  *	@stats: per-cpu chain stats
>> >  *	@chain: the chain
>> >  *	@dev_name: device name that this base chain is attached to (if any)
>> >- *	@cb_list: list of flow block callbacks (for hardware offload)
>> >+ *	@block: flow block (for hardware offload)
>> >  */
>> > struct nft_base_chain {
>> > 	struct nf_hook_ops		ops;
>> >@@ -961,7 +962,7 @@ struct nft_base_chain {
>> > 	struct nft_stats __percpu	*stats;
>> > 	struct nft_chain		chain;
>> > 	char 				dev_name[IFNAMSIZ];
>> >-	struct list_head		cb_list;
>> >+	struct flow_block		block;
>> > };
>> > 
>> > static inline struct nft_base_chain *nft_base_chain(const struct nft_chain *chain)
>> >diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> >index 9482e060483b..58041cb0ce15 100644
>> >--- a/include/net/sch_generic.h
>> >+++ b/include/net/sch_generic.h
>> >@@ -399,7 +399,7 @@ struct tcf_block {
>> > 	refcount_t refcnt;
>> > 	struct net *net;
>> > 	struct Qdisc *q;
>> >-	struct list_head cb_list;
>> >+	struct flow_block flow;
>> 
>> It is not a "flow", that is confusing. It should be named "flow_block".
>
>Done.
>
>> > 	struct list_head owner_list;
>> > 	bool keep_dst;
>> > 	unsigned int offloadcnt; /* Number of oddloaded filters */
>> >diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
>> >index a800fa78d96c..935c7f81a9ef 100644
>> >--- a/net/core/flow_offload.c
>> >+++ b/net/core/flow_offload.c
>> >@@ -198,7 +198,7 @@ struct flow_block_cb *flow_block_cb_lookup(struct flow_block_offload *f,
>> > {
>> > 	struct flow_block_cb *block_cb;
>> > 
>> >-	list_for_each_entry(block_cb, f->driver_block_list, driver_list) {
>> >+	list_for_each_entry(block_cb, &f->block->cb_list, list) {
>> 
>> Please made struct flow_block *block and argument of cb_lookup instead
>> of struct flow_block_offload *f (as it was previously).
>
>I can do so if you insist, this will make this fix larger.

Yes please. Thanks!


>
>> > 		if (block_cb->cb == cb &&
>> > 		    block_cb->cb_ident == cb_ident)
>> > 			return block_cb;
>> >diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
>> >index ed17a7c29b86..c565f146435b 100644
>> >--- a/net/netfilter/nf_tables_api.c
>> >+++ b/net/netfilter/nf_tables_api.c
>> >@@ -1662,7 +1662,7 @@ static int nf_tables_addchain(struct nft_ctx *ctx, u8 family, u8 genmask,
>> > 
>> > 		chain->flags |= NFT_BASE_CHAIN | flags;
>> > 		basechain->policy = NF_ACCEPT;
>> >-		INIT_LIST_HEAD(&basechain->cb_list);
>> >+		INIT_LIST_HEAD(&basechain->block.cb_list);
>> > 	} else {
>> > 		chain = kzalloc(sizeof(*chain), GFP_KERNEL);
>> > 		if (chain == NULL)
>> >diff --git a/net/netfilter/nf_tables_offload.c b/net/netfilter/nf_tables_offload.c
>> >index 2c3302845f67..2a184277ee58 100644
>> >--- a/net/netfilter/nf_tables_offload.c
>> >+++ b/net/netfilter/nf_tables_offload.c
>> >@@ -116,7 +116,7 @@ static int nft_setup_cb_call(struct nft_base_chain *basechain,
>> > 	struct flow_block_cb *block_cb;
>> > 	int err;
>> > 
>> >-	list_for_each_entry(block_cb, &basechain->cb_list, list) {
>> >+	list_for_each_entry(block_cb, &basechain->block.cb_list, list) {
>> > 		err = block_cb->cb(type, type_data, block_cb->cb_priv);
>> > 		if (err < 0)
>> > 			return err;
>> >@@ -154,7 +154,7 @@ static int nft_flow_offload_rule(struct nft_trans *trans,
>> > static int nft_flow_offload_bind(struct flow_block_offload *bo,
>> > 				 struct nft_base_chain *basechain)
>> > {
>> >-	list_splice(&bo->cb_list, &basechain->cb_list);
>> >+	list_splice(&bo->cb_list, &basechain->block.cb_list);
>> > 	return 0;
>> > }
>> > 
>> >@@ -198,6 +198,7 @@ static int nft_flow_offload_chain(struct nft_trans *trans,
>> > 		return -EOPNOTSUPP;
>> > 
>> > 	bo.command = cmd;
>> >+	bo.block = &basechain->block;
>> > 	bo.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
>> > 	bo.extack = &extack;
>> > 	INIT_LIST_HEAD(&bo.cb_list);
>> >diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> >index 51fbe6e95a92..66181961ad6f 100644
>> >--- a/net/sched/cls_api.c
>> >+++ b/net/sched/cls_api.c
>> >@@ -691,6 +691,8 @@ static void tc_indr_block_ing_cmd(struct tc_indr_block_dev *indr_dev,
>> > 	if (!indr_dev->block)
>> > 		return;
>> > 
>> >+	bo.block = &indr_dev->block->flow;
>> >+
>> > 	indr_block_cb->cb(indr_dev->dev, indr_block_cb->cb_priv, TC_SETUP_BLOCK,
>> > 			  &bo);
>> > 	tcf_block_setup(indr_dev->block, &bo);
>> >@@ -775,6 +777,7 @@ static void tc_indr_block_call(struct tcf_block *block, struct net_device *dev,
>> > 		.command	= command,
>> > 		.binder_type	= ei->binder_type,
>> > 		.net		= dev_net(dev),
>> >+		.block		= &block->flow,
>> > 		.block_shared	= tcf_block_shared(block),
>> > 		.extack		= extack,
>> > 	};
>> >@@ -810,6 +813,7 @@ static int tcf_block_offload_cmd(struct tcf_block *block,
>> > 	bo.net = dev_net(dev);
>> > 	bo.command = command;
>> > 	bo.binder_type = ei->binder_type;
>> >+	bo.block = &block->flow;
>> > 	bo.block_shared = tcf_block_shared(block);
>> > 	bo.extack = extack;
>> > 	INIT_LIST_HEAD(&bo.cb_list);
>> >@@ -988,7 +992,7 @@ static struct tcf_block *tcf_block_create(struct net *net, struct Qdisc *q,
>> > 	}
>> > 	mutex_init(&block->lock);
>> > 	INIT_LIST_HEAD(&block->chain_list);
>> >-	INIT_LIST_HEAD(&block->cb_list);
>> >+	INIT_LIST_HEAD(&block->flow.cb_list);
>> 
>> With introduction of struct flow_block, please introduce also a helper
>> to init this struct. Does not look right to init it from user codes
>> (tc/nft).
>
>Done.

^ permalink raw reply

* Re: [PATCH net-next 3/3] net: flow_offload: add flow_block structure and use it
From: Pablo Neira Ayuso @ 2019-07-11 13:13 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netfilter-devel, davem, netdev, jakub.kicinski
In-Reply-To: <20190711080609.GF2291@nanopsycho>

On Thu, Jul 11, 2019 at 10:06:09AM +0200, Jiri Pirko wrote:
> Thu, Jul 11, 2019 at 02:12:35AM CEST, pablo@netfilter.org wrote:
> >This object stores the flow block callbacks that are attached to this
> >block. This patch restores block sharing.
> >
> >Fixes: da3eeb904ff4 ("net: flow_offload: add list handling functions")
> >Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> >---
> > include/net/flow_offload.h        |  5 +++++
> > include/net/netfilter/nf_tables.h |  5 +++--
> > include/net/sch_generic.h         |  2 +-
> > net/core/flow_offload.c           |  2 +-
> > net/netfilter/nf_tables_api.c     |  2 +-
> > net/netfilter/nf_tables_offload.c |  5 +++--
> > net/sched/cls_api.c               | 10 +++++++---
> > 7 files changed, 21 insertions(+), 10 deletions(-)
> >
> >diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
> >index 98bf3af5c84d..e50d94736829 100644
> >--- a/include/net/flow_offload.h
> >+++ b/include/net/flow_offload.h
> >@@ -248,6 +248,10 @@ enum flow_block_binder_type {
> > 	FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS,
> > };
> > 
> >+struct flow_block {
> >+	struct list_head cb_list;
> >+};
> >+
> > struct netlink_ext_ack;
> > 
> > struct flow_block_offload {
> >@@ -255,6 +259,7 @@ struct flow_block_offload {
> > 	enum flow_block_binder_type binder_type;
> > 	bool block_shared;
> > 	struct net *net;
> >+	struct flow_block *block;
> > 	struct list_head cb_list;
> > 	struct list_head *driver_block_list;
> > 	struct netlink_ext_ack *extack;
> >diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
> >index 35dfdd9f69b3..00658462f89b 100644
> >--- a/include/net/netfilter/nf_tables.h
> >+++ b/include/net/netfilter/nf_tables.h
> >@@ -11,6 +11,7 @@
> > #include <linux/rhashtable.h>
> > #include <net/netfilter/nf_flow_table.h>
> > #include <net/netlink.h>
> >+#include <net/flow_offload.h>
> > 
> > struct module;
> > 
> >@@ -951,7 +952,7 @@ struct nft_stats {
> >  *	@stats: per-cpu chain stats
> >  *	@chain: the chain
> >  *	@dev_name: device name that this base chain is attached to (if any)
> >- *	@cb_list: list of flow block callbacks (for hardware offload)
> >+ *	@block: flow block (for hardware offload)
> >  */
> > struct nft_base_chain {
> > 	struct nf_hook_ops		ops;
> >@@ -961,7 +962,7 @@ struct nft_base_chain {
> > 	struct nft_stats __percpu	*stats;
> > 	struct nft_chain		chain;
> > 	char 				dev_name[IFNAMSIZ];
> >-	struct list_head		cb_list;
> >+	struct flow_block		block;
> > };
> > 
> > static inline struct nft_base_chain *nft_base_chain(const struct nft_chain *chain)
> >diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> >index 9482e060483b..58041cb0ce15 100644
> >--- a/include/net/sch_generic.h
> >+++ b/include/net/sch_generic.h
> >@@ -399,7 +399,7 @@ struct tcf_block {
> > 	refcount_t refcnt;
> > 	struct net *net;
> > 	struct Qdisc *q;
> >-	struct list_head cb_list;
> >+	struct flow_block flow;
> 
> It is not a "flow", that is confusing. It should be named "flow_block".

Done.

> > 	struct list_head owner_list;
> > 	bool keep_dst;
> > 	unsigned int offloadcnt; /* Number of oddloaded filters */
> >diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
> >index a800fa78d96c..935c7f81a9ef 100644
> >--- a/net/core/flow_offload.c
> >+++ b/net/core/flow_offload.c
> >@@ -198,7 +198,7 @@ struct flow_block_cb *flow_block_cb_lookup(struct flow_block_offload *f,
> > {
> > 	struct flow_block_cb *block_cb;
> > 
> >-	list_for_each_entry(block_cb, f->driver_block_list, driver_list) {
> >+	list_for_each_entry(block_cb, &f->block->cb_list, list) {
> 
> Please made struct flow_block *block and argument of cb_lookup instead
> of struct flow_block_offload *f (as it was previously).

I can do so if you insist, this will make this fix larger.

> > 		if (block_cb->cb == cb &&
> > 		    block_cb->cb_ident == cb_ident)
> > 			return block_cb;
> >diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
> >index ed17a7c29b86..c565f146435b 100644
> >--- a/net/netfilter/nf_tables_api.c
> >+++ b/net/netfilter/nf_tables_api.c
> >@@ -1662,7 +1662,7 @@ static int nf_tables_addchain(struct nft_ctx *ctx, u8 family, u8 genmask,
> > 
> > 		chain->flags |= NFT_BASE_CHAIN | flags;
> > 		basechain->policy = NF_ACCEPT;
> >-		INIT_LIST_HEAD(&basechain->cb_list);
> >+		INIT_LIST_HEAD(&basechain->block.cb_list);
> > 	} else {
> > 		chain = kzalloc(sizeof(*chain), GFP_KERNEL);
> > 		if (chain == NULL)
> >diff --git a/net/netfilter/nf_tables_offload.c b/net/netfilter/nf_tables_offload.c
> >index 2c3302845f67..2a184277ee58 100644
> >--- a/net/netfilter/nf_tables_offload.c
> >+++ b/net/netfilter/nf_tables_offload.c
> >@@ -116,7 +116,7 @@ static int nft_setup_cb_call(struct nft_base_chain *basechain,
> > 	struct flow_block_cb *block_cb;
> > 	int err;
> > 
> >-	list_for_each_entry(block_cb, &basechain->cb_list, list) {
> >+	list_for_each_entry(block_cb, &basechain->block.cb_list, list) {
> > 		err = block_cb->cb(type, type_data, block_cb->cb_priv);
> > 		if (err < 0)
> > 			return err;
> >@@ -154,7 +154,7 @@ static int nft_flow_offload_rule(struct nft_trans *trans,
> > static int nft_flow_offload_bind(struct flow_block_offload *bo,
> > 				 struct nft_base_chain *basechain)
> > {
> >-	list_splice(&bo->cb_list, &basechain->cb_list);
> >+	list_splice(&bo->cb_list, &basechain->block.cb_list);
> > 	return 0;
> > }
> > 
> >@@ -198,6 +198,7 @@ static int nft_flow_offload_chain(struct nft_trans *trans,
> > 		return -EOPNOTSUPP;
> > 
> > 	bo.command = cmd;
> >+	bo.block = &basechain->block;
> > 	bo.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
> > 	bo.extack = &extack;
> > 	INIT_LIST_HEAD(&bo.cb_list);
> >diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> >index 51fbe6e95a92..66181961ad6f 100644
> >--- a/net/sched/cls_api.c
> >+++ b/net/sched/cls_api.c
> >@@ -691,6 +691,8 @@ static void tc_indr_block_ing_cmd(struct tc_indr_block_dev *indr_dev,
> > 	if (!indr_dev->block)
> > 		return;
> > 
> >+	bo.block = &indr_dev->block->flow;
> >+
> > 	indr_block_cb->cb(indr_dev->dev, indr_block_cb->cb_priv, TC_SETUP_BLOCK,
> > 			  &bo);
> > 	tcf_block_setup(indr_dev->block, &bo);
> >@@ -775,6 +777,7 @@ static void tc_indr_block_call(struct tcf_block *block, struct net_device *dev,
> > 		.command	= command,
> > 		.binder_type	= ei->binder_type,
> > 		.net		= dev_net(dev),
> >+		.block		= &block->flow,
> > 		.block_shared	= tcf_block_shared(block),
> > 		.extack		= extack,
> > 	};
> >@@ -810,6 +813,7 @@ static int tcf_block_offload_cmd(struct tcf_block *block,
> > 	bo.net = dev_net(dev);
> > 	bo.command = command;
> > 	bo.binder_type = ei->binder_type;
> >+	bo.block = &block->flow;
> > 	bo.block_shared = tcf_block_shared(block);
> > 	bo.extack = extack;
> > 	INIT_LIST_HEAD(&bo.cb_list);
> >@@ -988,7 +992,7 @@ static struct tcf_block *tcf_block_create(struct net *net, struct Qdisc *q,
> > 	}
> > 	mutex_init(&block->lock);
> > 	INIT_LIST_HEAD(&block->chain_list);
> >-	INIT_LIST_HEAD(&block->cb_list);
> >+	INIT_LIST_HEAD(&block->flow.cb_list);
> 
> With introduction of struct flow_block, please introduce also a helper
> to init this struct. Does not look right to init it from user codes
> (tc/nft).

Done.

^ permalink raw reply

* [PATCH net-next,v2 2/3] net: flow_offload: rename tc_setup_cb_t to flow_setup_cb_t
From: Pablo Neira Ayuso @ 2019-07-11 13:09 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, jiri, jakub.kicinski
In-Reply-To: <20190711130923.2483-1-pablo@netfilter.org>

Rename this type definition and adapt users.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |  2 +-
 drivers/net/ethernet/mscc/ocelot_tc.c          |  2 +-
 include/net/flow_offload.h                     | 16 ++++++++++------
 include/net/pkt_cls.h                          |  4 ++--
 include/net/sch_generic.h                      |  6 ++----
 net/core/flow_offload.c                        |  9 +++++----
 net/dsa/slave.c                                |  2 +-
 net/sched/cls_api.c                            |  2 +-
 net/sched/cls_bpf.c                            |  2 +-
 net/sched/cls_flower.c                         |  2 +-
 net/sched/cls_matchall.c                       |  2 +-
 net/sched/cls_u32.c                            |  6 +++---
 12 files changed, 29 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index a469035400cf..51cd0b6f1f3e 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1679,7 +1679,7 @@ static int mlxsw_sp_setup_tc_block(struct mlxsw_sp_port *mlxsw_sp_port,
 				   struct flow_block_offload *f)
 {
 	struct flow_block_cb *block_cb;
-	tc_setup_cb_t *cb;
+	flow_setup_cb_t *cb;
 	bool ingress;
 	int err;
 
diff --git a/drivers/net/ethernet/mscc/ocelot_tc.c b/drivers/net/ethernet/mscc/ocelot_tc.c
index abbcb66bf5ac..fba9512e9ca6 100644
--- a/drivers/net/ethernet/mscc/ocelot_tc.c
+++ b/drivers/net/ethernet/mscc/ocelot_tc.c
@@ -134,7 +134,7 @@ static int ocelot_setup_tc_block(struct ocelot_port *port,
 				 struct flow_block_offload *f)
 {
 	struct flow_block_cb *block_cb;
-	tc_setup_cb_t *cb;
+	flow_setup_cb_t *cb;
 	int err;
 
 	netdev_dbg(port->dev, "tc_block command %d, binder_type %d\n",
diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index aa9b5287b231..98bf3af5c84d 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -3,7 +3,6 @@
 
 #include <linux/kernel.h>
 #include <net/flow_dissector.h>
-#include <net/sch_generic.h>
 
 struct flow_match {
 	struct flow_dissector	*dissector;
@@ -261,23 +260,27 @@ struct flow_block_offload {
 	struct netlink_ext_ack *extack;
 };
 
+enum tc_setup_type;
+typedef int flow_setup_cb_t(enum tc_setup_type type, void *type_data,
+			    void *cb_priv);
+
 struct flow_block_cb {
 	struct list_head	driver_list;
 	struct list_head	list;
-	tc_setup_cb_t		*cb;
+	flow_setup_cb_t		*cb;
 	void			*cb_ident;
 	void			*cb_priv;
 	void			(*release)(void *cb_priv);
 	unsigned int		refcnt;
 };
 
-struct flow_block_cb *flow_block_cb_alloc(tc_setup_cb_t *cb,
+struct flow_block_cb *flow_block_cb_alloc(flow_setup_cb_t *cb,
 					  void *cb_ident, void *cb_priv,
 					  void (*release)(void *cb_priv));
 void flow_block_cb_free(struct flow_block_cb *block_cb);
 
 struct flow_block_cb *flow_block_cb_lookup(struct flow_block_offload *offload,
-					   tc_setup_cb_t *cb, void *cb_ident);
+					   flow_setup_cb_t *cb, void *cb_ident);
 
 void *flow_block_cb_priv(struct flow_block_cb *block_cb);
 void flow_block_cb_incref(struct flow_block_cb *block_cb);
@@ -295,11 +298,12 @@ static inline void flow_block_cb_remove(struct flow_block_cb *block_cb,
 	list_move(&block_cb->list, &offload->cb_list);
 }
 
-bool flow_block_cb_is_busy(tc_setup_cb_t *cb, void *cb_ident,
+bool flow_block_cb_is_busy(flow_setup_cb_t *cb, void *cb_ident,
 			   struct list_head *driver_block_list);
 
 int flow_block_cb_setup_simple(struct flow_block_offload *f,
-			       struct list_head *driver_list, tc_setup_cb_t *cb,
+			       struct list_head *driver_list,
+			       flow_setup_cb_t *cb,
 			       void *cb_ident, void *cb_priv, bool ingress_only);
 
 enum flow_cls_command {
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 841faadceb6e..cee651b76a1f 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -126,14 +126,14 @@ static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
 }
 
 static inline
-int tc_setup_cb_block_register(struct tcf_block *block, tc_setup_cb_t *cb,
+int tc_setup_cb_block_register(struct tcf_block *block, flow_setup_cb_t *cb,
 			       void *cb_priv)
 {
 	return 0;
 }
 
 static inline
-void tc_setup_cb_block_unregister(struct tcf_block *block, tc_setup_cb_t *cb,
+void tc_setup_cb_block_unregister(struct tcf_block *block, flow_setup_cb_t *cb,
 				  void *cb_priv)
 {
 }
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 855167bbc372..9482e060483b 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -15,6 +15,7 @@
 #include <linux/mutex.h>
 #include <net/gen_stats.h>
 #include <net/rtnetlink.h>
+#include <net/flow_offload.h>
 
 struct Qdisc_ops;
 struct qdisc_walker;
@@ -22,9 +23,6 @@ struct tcf_walker;
 struct module;
 struct bpf_flow_keys;
 
-typedef int tc_setup_cb_t(enum tc_setup_type type,
-			  void *type_data, void *cb_priv);
-
 typedef int tc_indr_block_bind_cb_t(struct net_device *dev, void *cb_priv,
 				    enum tc_setup_type type, void *type_data);
 
@@ -313,7 +311,7 @@ struct tcf_proto_ops {
 	void			(*walk)(struct tcf_proto *tp,
 					struct tcf_walker *arg, bool rtnl_held);
 	int			(*reoffload)(struct tcf_proto *tp, bool add,
-					     tc_setup_cb_t *cb, void *cb_priv,
+					     flow_setup_cb_t *cb, void *cb_priv,
 					     struct netlink_ext_ack *extack);
 	void			(*bind_class)(void *, u32, unsigned long);
 	void *			(*tmplt_create)(struct net *net,
diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
index 507de4b48815..a800fa78d96c 100644
--- a/net/core/flow_offload.c
+++ b/net/core/flow_offload.c
@@ -165,7 +165,7 @@ void flow_rule_match_enc_opts(const struct flow_rule *rule,
 }
 EXPORT_SYMBOL(flow_rule_match_enc_opts);
 
-struct flow_block_cb *flow_block_cb_alloc(tc_setup_cb_t *cb,
+struct flow_block_cb *flow_block_cb_alloc(flow_setup_cb_t *cb,
 					  void *cb_ident, void *cb_priv,
 					  void (*release)(void *cb_priv))
 {
@@ -194,7 +194,7 @@ void flow_block_cb_free(struct flow_block_cb *block_cb)
 EXPORT_SYMBOL(flow_block_cb_free);
 
 struct flow_block_cb *flow_block_cb_lookup(struct flow_block_offload *f,
-					   tc_setup_cb_t *cb, void *cb_ident)
+					   flow_setup_cb_t *cb, void *cb_ident)
 {
 	struct flow_block_cb *block_cb;
 
@@ -226,7 +226,7 @@ unsigned int flow_block_cb_decref(struct flow_block_cb *block_cb)
 }
 EXPORT_SYMBOL(flow_block_cb_decref);
 
-bool flow_block_cb_is_busy(tc_setup_cb_t *cb, void *cb_ident,
+bool flow_block_cb_is_busy(flow_setup_cb_t *cb, void *cb_ident,
 			   struct list_head *driver_block_list)
 {
 	struct flow_block_cb *block_cb;
@@ -243,7 +243,8 @@ EXPORT_SYMBOL(flow_block_cb_is_busy);
 
 int flow_block_cb_setup_simple(struct flow_block_offload *f,
 			       struct list_head *driver_block_list,
-			       tc_setup_cb_t *cb, void *cb_ident, void *cb_priv,
+			       flow_setup_cb_t *cb,
+			       void *cb_ident, void *cb_priv,
 			       bool ingress_only)
 {
 	struct flow_block_cb *block_cb;
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 6ca9ec58f881..d697a64fb564 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -951,7 +951,7 @@ static int dsa_slave_setup_tc_block(struct net_device *dev,
 				    struct flow_block_offload *f)
 {
 	struct flow_block_cb *block_cb;
-	tc_setup_cb_t *cb;
+	flow_setup_cb_t *cb;
 
 	if (f->binder_type == FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
 		cb = dsa_slave_setup_tc_block_cb_ig;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 278014e26aec..51fbe6e95a92 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1514,7 +1514,7 @@ void tcf_block_put(struct tcf_block *block)
 EXPORT_SYMBOL(tcf_block_put);
 
 static int
-tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb,
+tcf_block_playback_offloads(struct tcf_block *block, flow_setup_cb_t *cb,
 			    void *cb_priv, bool add, bool offload_in_use,
 			    struct netlink_ext_ack *extack)
 {
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 691f71830134..3f7a9c02b70c 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -651,7 +651,7 @@ static void cls_bpf_walk(struct tcf_proto *tp, struct tcf_walker *arg,
 	}
 }
 
-static int cls_bpf_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb,
+static int cls_bpf_reoffload(struct tcf_proto *tp, bool add, flow_setup_cb_t *cb,
 			     void *cb_priv, struct netlink_ext_ack *extack)
 {
 	struct cls_bpf_head *head = rtnl_dereference(tp->root);
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 38d6e85693fc..054123742e32 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1800,7 +1800,7 @@ fl_get_next_hw_filter(struct tcf_proto *tp, struct cls_fl_filter *f, bool add)
 	return NULL;
 }
 
-static int fl_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb,
+static int fl_reoffload(struct tcf_proto *tp, bool add, flow_setup_cb_t *cb,
 			void *cb_priv, struct netlink_ext_ack *extack)
 {
 	struct tcf_block *block = tp->chain->block;
diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index a30d2f8feb32..455ea2793f9b 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -282,7 +282,7 @@ static void mall_walk(struct tcf_proto *tp, struct tcf_walker *arg,
 	arg->count++;
 }
 
-static int mall_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb,
+static int mall_reoffload(struct tcf_proto *tp, bool add, flow_setup_cb_t *cb,
 			  void *cb_priv, struct netlink_ext_ack *extack)
 {
 	struct cls_mall_head *head = rtnl_dereference(tp->root);
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index be9e46c77e8b..8614088edd1b 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -1152,7 +1152,7 @@ static void u32_walk(struct tcf_proto *tp, struct tcf_walker *arg,
 }
 
 static int u32_reoffload_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht,
-			       bool add, tc_setup_cb_t *cb, void *cb_priv,
+			       bool add, flow_setup_cb_t *cb, void *cb_priv,
 			       struct netlink_ext_ack *extack)
 {
 	struct tc_cls_u32_offload cls_u32 = {};
@@ -1172,7 +1172,7 @@ static int u32_reoffload_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht,
 }
 
 static int u32_reoffload_knode(struct tcf_proto *tp, struct tc_u_knode *n,
-			       bool add, tc_setup_cb_t *cb, void *cb_priv,
+			       bool add, flow_setup_cb_t *cb, void *cb_priv,
 			       struct netlink_ext_ack *extack)
 {
 	struct tc_u_hnode *ht = rtnl_dereference(n->ht_down);
@@ -1213,7 +1213,7 @@ static int u32_reoffload_knode(struct tcf_proto *tp, struct tc_u_knode *n,
 	return 0;
 }
 
-static int u32_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb,
+static int u32_reoffload(struct tcf_proto *tp, bool add, flow_setup_cb_t *cb,
 			 void *cb_priv, struct netlink_ext_ack *extack)
 {
 	struct tc_u_common *tp_c = tp->data;
-- 
2.11.0


^ permalink raw reply related

* [PATCH net-next,v2 3/3] net: flow_offload: add flow_block structure and use it
From: Pablo Neira Ayuso @ 2019-07-11 13:09 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, jiri, jakub.kicinski
In-Reply-To: <20190711130923.2483-1-pablo@netfilter.org>

This object stores the flow block callbacks that are attached to this
block. This patch restores block sharing.

Fixes: da3eeb904ff4 ("net: flow_offload: add list handling functions")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
v3: add flow_block_init() - Jiri Pirko.

 include/net/flow_offload.h        | 10 ++++++++++
 include/net/netfilter/nf_tables.h |  5 +++--
 include/net/sch_generic.h         |  2 +-
 net/core/flow_offload.c           |  2 +-
 net/netfilter/nf_tables_api.c     |  2 +-
 net/netfilter/nf_tables_offload.c |  5 +++--
 net/sched/cls_api.c               | 10 +++++++---
 7 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 98bf3af5c84d..8d90d96ace82 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -248,6 +248,10 @@ enum flow_block_binder_type {
 	FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS,
 };
 
+struct flow_block {
+	struct list_head cb_list;
+};
+
 struct netlink_ext_ack;
 
 struct flow_block_offload {
@@ -255,6 +259,7 @@ struct flow_block_offload {
 	enum flow_block_binder_type binder_type;
 	bool block_shared;
 	struct net *net;
+	struct flow_block *block;
 	struct list_head cb_list;
 	struct list_head *driver_block_list;
 	struct netlink_ext_ack *extack;
@@ -336,4 +341,9 @@ flow_cls_offload_flow_rule(struct flow_cls_offload *flow_cmd)
 	return flow_cmd->rule;
 }
 
+static inline void flow_block_init(struct flow_block *flow_block)
+{
+	INIT_LIST_HEAD(&flow_block->cb_list);
+}
+
 #endif /* _NET_FLOW_OFFLOAD_H */
diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 35dfdd9f69b3..b02d8fa80b95 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -11,6 +11,7 @@
 #include <linux/rhashtable.h>
 #include <net/netfilter/nf_flow_table.h>
 #include <net/netlink.h>
+#include <net/flow_offload.h>
 
 struct module;
 
@@ -951,7 +952,7 @@ struct nft_stats {
  *	@stats: per-cpu chain stats
  *	@chain: the chain
  *	@dev_name: device name that this base chain is attached to (if any)
- *	@cb_list: list of flow block callbacks (for hardware offload)
+ *	@flow: flow block (for hardware offload)
  */
 struct nft_base_chain {
 	struct nf_hook_ops		ops;
@@ -961,7 +962,7 @@ struct nft_base_chain {
 	struct nft_stats __percpu	*stats;
 	struct nft_chain		chain;
 	char 				dev_name[IFNAMSIZ];
-	struct list_head		cb_list;
+	struct flow_block		flow_block;
 };
 
 static inline struct nft_base_chain *nft_base_chain(const struct nft_chain *chain)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9482e060483b..6b6b01234dd9 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -399,7 +399,7 @@ struct tcf_block {
 	refcount_t refcnt;
 	struct net *net;
 	struct Qdisc *q;
-	struct list_head cb_list;
+	struct flow_block flow_block;
 	struct list_head owner_list;
 	bool keep_dst;
 	unsigned int offloadcnt; /* Number of oddloaded filters */
diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
index a800fa78d96c..935c7f81a9ef 100644
--- a/net/core/flow_offload.c
+++ b/net/core/flow_offload.c
@@ -198,7 +198,7 @@ struct flow_block_cb *flow_block_cb_lookup(struct flow_block_offload *f,
 {
 	struct flow_block_cb *block_cb;
 
-	list_for_each_entry(block_cb, f->driver_block_list, driver_list) {
+	list_for_each_entry(block_cb, &f->block->cb_list, list) {
 		if (block_cb->cb == cb &&
 		    block_cb->cb_ident == cb_ident)
 			return block_cb;
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index ed17a7c29b86..11acfc3ee37a 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1662,7 +1662,7 @@ static int nf_tables_addchain(struct nft_ctx *ctx, u8 family, u8 genmask,
 
 		chain->flags |= NFT_BASE_CHAIN | flags;
 		basechain->policy = NF_ACCEPT;
-		INIT_LIST_HEAD(&basechain->cb_list);
+		flow_block_init(&basechain->flow_block);
 	} else {
 		chain = kzalloc(sizeof(*chain), GFP_KERNEL);
 		if (chain == NULL)
diff --git a/net/netfilter/nf_tables_offload.c b/net/netfilter/nf_tables_offload.c
index 2c3302845f67..64f5fd5f240e 100644
--- a/net/netfilter/nf_tables_offload.c
+++ b/net/netfilter/nf_tables_offload.c
@@ -116,7 +116,7 @@ static int nft_setup_cb_call(struct nft_base_chain *basechain,
 	struct flow_block_cb *block_cb;
 	int err;
 
-	list_for_each_entry(block_cb, &basechain->cb_list, list) {
+	list_for_each_entry(block_cb, &basechain->flow_block.cb_list, list) {
 		err = block_cb->cb(type, type_data, block_cb->cb_priv);
 		if (err < 0)
 			return err;
@@ -154,7 +154,7 @@ static int nft_flow_offload_rule(struct nft_trans *trans,
 static int nft_flow_offload_bind(struct flow_block_offload *bo,
 				 struct nft_base_chain *basechain)
 {
-	list_splice(&bo->cb_list, &basechain->cb_list);
+	list_splice(&bo->cb_list, &basechain->flow_block.cb_list);
 	return 0;
 }
 
@@ -198,6 +198,7 @@ static int nft_flow_offload_chain(struct nft_trans *trans,
 		return -EOPNOTSUPP;
 
 	bo.command = cmd;
+	bo.block = &basechain->flow_block;
 	bo.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
 	bo.extack = &extack;
 	INIT_LIST_HEAD(&bo.cb_list);
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 51fbe6e95a92..3b9720fe1e60 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -691,6 +691,8 @@ static void tc_indr_block_ing_cmd(struct tc_indr_block_dev *indr_dev,
 	if (!indr_dev->block)
 		return;
 
+	bo.block = &indr_dev->block->flow_block;
+
 	indr_block_cb->cb(indr_dev->dev, indr_block_cb->cb_priv, TC_SETUP_BLOCK,
 			  &bo);
 	tcf_block_setup(indr_dev->block, &bo);
@@ -775,6 +777,7 @@ static void tc_indr_block_call(struct tcf_block *block, struct net_device *dev,
 		.command	= command,
 		.binder_type	= ei->binder_type,
 		.net		= dev_net(dev),
+		.block		= &block->flow_block,
 		.block_shared	= tcf_block_shared(block),
 		.extack		= extack,
 	};
@@ -810,6 +813,7 @@ static int tcf_block_offload_cmd(struct tcf_block *block,
 	bo.net = dev_net(dev);
 	bo.command = command;
 	bo.binder_type = ei->binder_type;
+	bo.block = &block->flow_block;
 	bo.block_shared = tcf_block_shared(block);
 	bo.extack = extack;
 	INIT_LIST_HEAD(&bo.cb_list);
@@ -987,8 +991,8 @@ static struct tcf_block *tcf_block_create(struct net *net, struct Qdisc *q,
 		return ERR_PTR(-ENOMEM);
 	}
 	mutex_init(&block->lock);
+	flow_block_init(&block->flow_block);
 	INIT_LIST_HEAD(&block->chain_list);
-	INIT_LIST_HEAD(&block->cb_list);
 	INIT_LIST_HEAD(&block->owner_list);
 	INIT_LIST_HEAD(&block->chain0.filter_chain_list);
 
@@ -1570,7 +1574,7 @@ static int tcf_block_bind(struct tcf_block *block,
 
 		i++;
 	}
-	list_splice(&bo->cb_list, &block->cb_list);
+	list_splice(&bo->cb_list, &block->flow_block.cb_list);
 
 	return 0;
 
@@ -3155,7 +3159,7 @@ int tc_setup_cb_call(struct tcf_block *block, enum tc_setup_type type,
 	if (block->nooffloaddevcnt && err_stop)
 		return -EOPNOTSUPP;
 
-	list_for_each_entry(block_cb, &block->cb_list, list) {
+	list_for_each_entry(block_cb, &block->flow_block.cb_list, list) {
 		err = block_cb->cb(type, type_data, block_cb->cb_priv);
 		if (err) {
 			if (err_stop)
-- 
2.11.0


^ permalink raw reply related

* [PATCH net-next,v2 1/3] net: flow_offload: remove netns parameter from flow_block_cb_alloc()
From: Pablo Neira Ayuso @ 2019-07-11 13:09 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, jiri, jakub.kicinski

No need to annotate the netns on the flow block callback object,
flow_block_cb_is_busy() already checks for used blocks.

Fixes: d63db30c8537 ("net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
v2: no changes

 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c    | 3 +--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c      | 5 ++---
 drivers/net/ethernet/mscc/ocelot_flower.c           | 3 +--
 drivers/net/ethernet/mscc/ocelot_tc.c               | 2 +-
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 6 ++----
 include/net/flow_offload.h                          | 3 +--
 net/core/flow_offload.c                             | 9 +++------
 net/dsa/slave.c                                     | 2 +-
 8 files changed, 12 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 7245d287633d..2162412073c5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -735,8 +735,7 @@ mlx5e_rep_indr_setup_tc_block(struct net_device *netdev,
 		list_add(&indr_priv->list,
 			 &rpriv->uplink_priv.tc_indr_block_priv_list);
 
-		block_cb = flow_block_cb_alloc(f->net,
-					       mlx5e_rep_indr_setup_block_cb,
+		block_cb = flow_block_cb_alloc(mlx5e_rep_indr_setup_block_cb,
 					       indr_priv, indr_priv,
 					       mlx5e_rep_indr_tc_block_unbind);
 		if (IS_ERR(block_cb)) {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 4d34d42b3b0e..a469035400cf 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1610,8 +1610,7 @@ mlxsw_sp_setup_tc_block_flower_bind(struct mlxsw_sp_port *mlxsw_sp_port,
 		acl_block = mlxsw_sp_acl_block_create(mlxsw_sp, f->net);
 		if (!acl_block)
 			return -ENOMEM;
-		block_cb = flow_block_cb_alloc(f->net,
-					       mlxsw_sp_setup_tc_block_cb_flower,
+		block_cb = flow_block_cb_alloc(mlxsw_sp_setup_tc_block_cb_flower,
 					       mlxsw_sp, acl_block,
 					       mlxsw_sp_tc_block_flower_release);
 		if (IS_ERR(block_cb)) {
@@ -1702,7 +1701,7 @@ static int mlxsw_sp_setup_tc_block(struct mlxsw_sp_port *mlxsw_sp_port,
 					  &mlxsw_sp_block_cb_list))
 			return -EBUSY;
 
-		block_cb = flow_block_cb_alloc(f->net, cb, mlxsw_sp_port,
+		block_cb = flow_block_cb_alloc(cb, mlxsw_sp_port,
 					       mlxsw_sp_port, NULL);
 		if (IS_ERR(block_cb))
 			return PTR_ERR(block_cb);
diff --git a/drivers/net/ethernet/mscc/ocelot_flower.c b/drivers/net/ethernet/mscc/ocelot_flower.c
index 7aaddc09c185..6a11aea8b186 100644
--- a/drivers/net/ethernet/mscc/ocelot_flower.c
+++ b/drivers/net/ethernet/mscc/ocelot_flower.c
@@ -323,8 +323,7 @@ int ocelot_setup_tc_block_flower_bind(struct ocelot_port *port,
 		if (!port_block)
 			return -ENOMEM;

-		block_cb = flow_block_cb_alloc(f->net,
-					       ocelot_setup_tc_block_cb_flower,
+		block_cb = flow_block_cb_alloc(ocelot_setup_tc_block_cb_flower,
 					       port, port_block,
 					       ocelot_tc_block_unbind);
 		if (IS_ERR(block_cb)) {
diff --git a/drivers/net/ethernet/mscc/ocelot_tc.c b/drivers/net/ethernet/mscc/ocelot_tc.c
index 9e6464ffae5d..abbcb66bf5ac 100644
--- a/drivers/net/ethernet/mscc/ocelot_tc.c
+++ b/drivers/net/ethernet/mscc/ocelot_tc.c
@@ -156,7 +156,7 @@ static int ocelot_setup_tc_block(struct ocelot_port *port,
 		if (flow_block_cb_is_busy(cb, port, &ocelot_block_cb_list))
 			return -EBUSY;
 
-		block_cb = flow_block_cb_alloc(f->net, cb, port, port, NULL);
+		block_cb = flow_block_cb_alloc(cb, port, port, NULL);
 		if (IS_ERR(block_cb))
 			return PTR_ERR(block_cb);
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 7e725fa60347..a0f8892bb4b5 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -1324,8 +1324,7 @@ static int nfp_flower_setup_tc_block(struct net_device *netdev,
 					  &nfp_block_cb_list))
 			return -EBUSY;
 
-		block_cb = flow_block_cb_alloc(f->net,
-					       nfp_flower_setup_tc_block_cb,
+		block_cb = flow_block_cb_alloc(nfp_flower_setup_tc_block_cb,
 					       repr, repr, NULL);
 		if (IS_ERR(block_cb))
 			return PTR_ERR(block_cb);
@@ -1430,8 +1429,7 @@ nfp_flower_setup_indr_tc_block(struct net_device *netdev, struct nfp_app *app,
 		cb_priv->app = app;
 		list_add(&cb_priv->list, &priv->indr_block_cb_priv);
 
-		block_cb = flow_block_cb_alloc(f->net,
-					       nfp_flower_setup_indr_block_cb,
+		block_cb = flow_block_cb_alloc(nfp_flower_setup_indr_block_cb,
 					       cb_priv, cb_priv,
 					       nfp_flower_setup_indr_tc_release);
 		if (IS_ERR(block_cb)) {
diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index db337299e81e..aa9b5287b231 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -264,7 +264,6 @@ struct flow_block_offload {
 struct flow_block_cb {
 	struct list_head	driver_list;
 	struct list_head	list;
-	struct net		*net;
 	tc_setup_cb_t		*cb;
 	void			*cb_ident;
 	void			*cb_priv;
@@ -272,7 +271,7 @@ struct flow_block_cb {
 	unsigned int		refcnt;
 };
 
-struct flow_block_cb *flow_block_cb_alloc(struct net *net, tc_setup_cb_t *cb,
+struct flow_block_cb *flow_block_cb_alloc(tc_setup_cb_t *cb,
 					  void *cb_ident, void *cb_priv,
 					  void (*release)(void *cb_priv));
 void flow_block_cb_free(struct flow_block_cb *block_cb);
diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
index 76f8db3841d7..507de4b48815 100644
--- a/net/core/flow_offload.c
+++ b/net/core/flow_offload.c
@@ -165,7 +165,7 @@ void flow_rule_match_enc_opts(const struct flow_rule *rule,
 }
 EXPORT_SYMBOL(flow_rule_match_enc_opts);
 
-struct flow_block_cb *flow_block_cb_alloc(struct net *net, tc_setup_cb_t *cb,
+struct flow_block_cb *flow_block_cb_alloc(tc_setup_cb_t *cb,
 					  void *cb_ident, void *cb_priv,
 					  void (*release)(void *cb_priv))
 {
@@ -175,7 +175,6 @@ struct flow_block_cb *flow_block_cb_alloc(struct net *net, tc_setup_cb_t *cb,
 	if (!block_cb)
 		return ERR_PTR(-ENOMEM);
 
-	block_cb->net = net;
 	block_cb->cb = cb;
 	block_cb->cb_ident = cb_ident;
 	block_cb->cb_priv = cb_priv;
@@ -200,8 +199,7 @@ struct flow_block_cb *flow_block_cb_lookup(struct flow_block_offload *f,
 	struct flow_block_cb *block_cb;
 
 	list_for_each_entry(block_cb, f->driver_block_list, driver_list) {
-		if (block_cb->net == f->net &&
-		    block_cb->cb == cb &&
+		if (block_cb->cb == cb &&
 		    block_cb->cb_ident == cb_ident)
 			return block_cb;
 	}
@@ -261,8 +259,7 @@ int flow_block_cb_setup_simple(struct flow_block_offload *f,
 		if (flow_block_cb_is_busy(cb, cb_ident, driver_block_list))
 			return -EBUSY;
 
-		block_cb = flow_block_cb_alloc(f->net, cb, cb_ident,
-					       cb_priv, NULL);
+		block_cb = flow_block_cb_alloc(cb, cb_ident, cb_priv, NULL);
 		if (IS_ERR(block_cb))
 			return PTR_ERR(block_cb);
 
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 614c38ece104..6ca9ec58f881 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -967,7 +967,7 @@ static int dsa_slave_setup_tc_block(struct net_device *dev,
 		if (flow_block_cb_is_busy(cb, dev, &dsa_slave_block_cb_list))
 			return -EBUSY;
 
-		block_cb = flow_block_cb_alloc(f->net, cb, dev, dev, NULL);
+		block_cb = flow_block_cb_alloc(cb, dev, dev, NULL);
 		if (IS_ERR(block_cb))
 			return PTR_ERR(block_cb);
 
-- 
2.11.0


^ permalink raw reply related

* RE: [PATCH] tipc: ensure skb->lock is initialised
From: Jon Maloy @ 2019-07-11 12:55 UTC (permalink / raw)
  To: Chris Packham, Eric Dumazet, ying.xue@windriver.com,
	davem@davemloft.net
  Cc: netdev@vger.kernel.org, tipc-discussion@lists.sourceforge.net,
	linux-kernel@vger.kernel.org
In-Reply-To: <4d2ac0ce7f974184ac43b71f19aee7a3@svr-chch-ex1.atlnz.lc>



> -----Original Message-----
> From: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
> Sent: 10-Jul-19 16:58
> To: Jon Maloy <jon.maloy@ericsson.com>; Eric Dumazet
> <eric.dumazet@gmail.com>; ying.xue@windriver.com;
> davem@davemloft.net
> Cc: netdev@vger.kernel.org; tipc-discussion@lists.sourceforge.net; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH] tipc: ensure skb->lock is initialised
> 
> 
> On 11/07/19 1:10 AM, Jon Maloy wrote:
> >> -----Original Message-----
> >> From: Eric Dumazet <eric.dumazet@gmail.com>
> >> Sent: 10-Jul-19 04:00
> >> To: Jon Maloy <jon.maloy@ericsson.com>; Eric Dumazet
> >> <eric.dumazet@gmail.com>; Chris Packham
> >> <Chris.Packham@alliedtelesis.co.nz>; ying.xue@windriver.com;
> >> davem@davemloft.net
> >> Cc: netdev@vger.kernel.org; tipc-discussion@lists.sourceforge.net;
> >> linux- kernel@vger.kernel.org
> >> Subject: Re: [PATCH] tipc: ensure skb->lock is initialised
> >>
> >>
> >>
> >> On 7/9/19 10:15 PM, Jon Maloy wrote:
> >>>
> >>> It is not only for lockdep purposes, -it is essential.  But please
> >>> provide details
> >> about where you see that more fixes are needed.
> >>>
> >>
> >> Simple fact that you detect a problem only when skb_queue_purge() is
> >> called should talk by itself.
> >>
> >> As I stated, there are many places where the list is manipulated
> >> _without_ its spinlock being held.
> >
> > Yes, and that is the way it should be on the send path.
> >
> >>
> >> You want consistency, then
> >>
> >> - grab the spinlock all the time.
> >> - Or do not ever use it.
> >
> > That is exactly what we are doing.
> > - The send path doesn't need the spinlock, and never grabs it.
> > - The receive path does need it, and always grabs it.
> >
> > However, since we don't know from the beginning which path a created
> > message will follow, we initialize the queue spinlock "just in case"
> > when it is created, even though it may never be used later.
> > You can see this as a violation of the principle you are stating
> > above, but it is a prize that is worth paying, given savings in code
> > volume, complexity and performance.
> >
> >>
> >> Do not initialize the spinlock just in case a path will use
> >> skb_queue_purge() (instead of using __skb_queue_purge())
> >
> > I am ok with that. I think we can agree that Chris goes for that
> > solution, so we can get this bug fixed.
> 
> So would you like a v2 with an improved commit message? I note that I said
> skb->lock instead of head->lock in the subject line so at least that should be
> corrected.

Yes, unless Eric has any more objections. 
I addition, I have realized that we can change from skb_queue_head_init()  to __skb_queue_head_init() at all the disputed locations in the socket code.
Then, we do a separate call to spin_lock_init(&list->lock) at the moment we find that the message will follow the receive path, i.e., in tipc_node_xmit().
That should make everybody happy. 
I will post a patch when net-next re-opens.

BR
///jon



^ permalink raw reply

* Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths
From: Ido Schimmel @ 2019-07-11 12:39 UTC (permalink / raw)
  To: David Miller, nhorman
  Cc: netdev, jiri, mlxsw, dsahern, roopa, nikolay, andy, pablo,
	jakub.kicinski, pieter.jansenvanvuuren, andrew, f.fainelli,
	vivien.didelot, idosch
In-Reply-To: <20190707.124541.451040901050013496.davem@davemloft.net>

On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
> From: Ido Schimmel <idosch@idosch.org>
> Date: Sun,  7 Jul 2019 10:58:17 +0300
> 
> > Users have several ways to debug the kernel and understand why a packet
> > was dropped. For example, using "drop monitor" and "perf". Both
> > utilities trace kfree_skb(), which is the function called when a packet
> > is freed as part of a failure. The information provided by these tools
> > is invaluable when trying to understand the cause of a packet loss.
> > 
> > In recent years, large portions of the kernel data path were offloaded
> > to capable devices. Today, it is possible to perform L2 and L3
> > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> > Different TC classifiers and actions are also offloaded to capable
> > devices, at both ingress and egress.
> > 
> > However, when the data path is offloaded it is not possible to achieve
> > the same level of introspection as tools such "perf" and "drop monitor"
> > become irrelevant.
> > 
> > This patchset aims to solve this by allowing users to monitor packets
> > that the underlying device decided to drop along with relevant metadata
> > such as the drop reason and ingress port.
> 
> We are now going to have 5 or so ways to capture packets passing through
> the system, this is nonsense.
> 
> AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> devlink thing.
> 
> This is insanity, too many ways to do the same thing and therefore the
> worst possible user experience.
> 
> Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> XDP perf events, and these taps there too.
> 
> I mean really, think about it from the average user's perspective.  To
> see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> listen on devlink but configure a special tap thing beforehand and then
> if someone is using XDP I gotta setup another perf event buffer capture
> thing too.

Dave,

Before I start working on v2, I would like to get your feedback on the
high level plan. Also adding Neil who is the maintainer of drop_monitor
(and counterpart DropWatch tool [1]).

IIUC, the problem you point out is that users need to use different
tools to monitor packet drops based on where these drops occur
(SW/HW/XDP).

Therefore, my plan is to extend the existing drop_monitor netlink
channel to also cover HW drops. I will add a new message type and a new
multicast group for HW drops and encode in the message what is currently
encoded in the devlink events.

I would like to emphasize that the configuration of whether these
dropped packets are even sent to the CPU from the device still needs to
reside in devlink given this is the go-to tool for device-specific
configuration. In addition, these drop traps are a small subset of the
entire packet traps devices support and all have similar needs such as
HW policer configuration and statistics.

In the future we might also want to report events that indicate the
formation of possible problems. For example, in case packets are queued
above a certain threshold or for long periods of time. I hope we could
re-use drop_monitor for this as well, thereby making it the go-to
channel for diagnosing current and to-be problems in the data path.

Thanks

[1] https://github.com/nhorman/dropwatch

^ permalink raw reply

* Re:  Re: Re: linux-next: build failure after merge of the net-next tree
From: Bernard Metzler @ 2019-07-11 12:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Stephen Rothwell, Doug Ledford, David Miller,
	Networking, Linux Next Mailing List, Linux Kernel Mailing List
In-Reply-To: <20190711115235.GA25821@mellanox.com>

-----"Jason Gunthorpe" <jgg@mellanox.com> wrote: -----

>To: "Bernard Metzler" <BMT@zurich.ibm.com>
>From: "Jason Gunthorpe" <jgg@mellanox.com>
>Date: 07/11/2019 01:53PM
>Cc: "Leon Romanovsky" <leon@kernel.org>, "Stephen Rothwell"
><sfr@canb.auug.org.au>, "Doug Ledford" <dledford@redhat.com>, "David
>Miller" <davem@davemloft.net>, "Networking" <netdev@vger.kernel.org>,
>"Linux Next Mailing List" <linux-next@vger.kernel.org>, "Linux Kernel
>Mailing List" <linux-kernel@vger.kernel.org>
>Subject: [EXTERNAL] Re: Re: linux-next: build failure after merge of
>the net-next tree
>
>On Thu, Jul 11, 2019 at 08:00:49AM +0000, Bernard Metzler wrote:
>
>> That listen will not sleep. The socket is just marked
>> listening. 
>
>Eh? siw_listen_address() calls siw_cep_alloc() which does:
>
>	struct siw_cep *cep = kzalloc(sizeof(*cep), GFP_KERNEL);
>
>Which is sleeping. Many other cases too.
>
>Jason
>
>
Ah, true! I was after really deep sleeps like user level
socket accept() calls ;) So you are correct of course.

Thanks!
Bernard


^ permalink raw reply

* Re: [oss-drivers] Re: [RFC bpf-next 2/8] bpf: extend list based insn patching infra to verification layer
From: Jiong Wang @ 2019-07-11 12:20 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann, Edward Cree,
	Naveen N. Rao, Andrii Nakryiko, Jakub Kicinski, bpf, Networking,
	oss-drivers
In-Reply-To: <87o920235d.fsf@netronome.com>


Jiong Wang writes:

> Andrii Nakryiko writes:
>
>> On Thu, Jul 4, 2019 at 2:32 PM Jiong Wang <jiong.wang@netronome.com> wrote:
>>>
>>> Verification layer also needs to handle auxiliar info as well as adjusting
>>> subprog start.
>>>
>>> At this layer, insns inside patch buffer could be jump, but they should
>>> have been resolved, meaning they shouldn't jump to insn outside of the
>>> patch buffer. Lineration function for this layer won't touch insns inside
>>> patch buffer.
>>>
>>> Adjusting subprog is finished along with adjusting jump target when the
>>> input will cover bpf to bpf call insn, re-register subprog start is cheap.
>>> But adjustment when there is insn deleteion is not considered yet.
>>>
>>> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
>>> ---
>>>  kernel/bpf/verifier.c | 150 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 150 insertions(+)
>>>
>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>> index a2e7637..2026d64 100644
>>> --- a/kernel/bpf/verifier.c
>>> +++ b/kernel/bpf/verifier.c
>>> @@ -8350,6 +8350,156 @@ static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env)
>>>         }
>>>  }
>>>
>>> +/* Linearize bpf list insn to array (verifier layer). */
>>> +static struct bpf_verifier_env *
>>> +verifier_linearize_list_insn(struct bpf_verifier_env *env,
>>> +                            struct bpf_list_insn *list)
>>
>> It's unclear why this returns env back? It's not allocating a new env,
>> so it's weird and unnecessary. Just return error code.
>
> The reason is I was thinking we have two layers in BPF, the core and the
> verifier.
>
> For core layer (the relevant file is core.c), when doing patching, the
> input is insn list and bpf_prog, the linearization should linearize the
> insn list into insn array, and also whatever others affect inside bpf_prog
> due to changing on insns, for example line info inside prog->aux. So the
> return value is bpf_prog for core layer linearization hook. 
>
> For verifier layer, it is similar, but the context if bpf_verifier_env, the
> linearization hook should linearize the insn list, and also those affected
> inside env, for example bpf_insn_aux_data, so the return value is
> bpf_verifier_env, meaning returning an updated verifier context
> (bpf_verifier_env) after insn list linearization.

Realized your point is no new env is allocated, so just return error
code. Yes, the env pointer is not changed, just internal data is
updated. Return bpf_verifier_env mostly is trying to make the hook more
clear that it returns an updated "context" where the linearization happens,
for verifier layer, it is bpf_verifier_env, and for core layer, it is
bpf_prog, so return value was designed to return these two types.

>
> Make sense?
>
> Regards,
> Jiong
>
>>
>>> +{
>>> +       u32 *idx_map, idx, orig_cnt, fini_cnt = 0;
>>> +       struct bpf_subprog_info *new_subinfo;
>>> +       struct bpf_insn_aux_data *new_data;
>>> +       struct bpf_prog *prog = env->prog;
>>> +       struct bpf_verifier_env *ret_env;
>>> +       struct bpf_insn *insns, *insn;
>>> +       struct bpf_list_insn *elem;
>>> +       int ret;
>>> +
>>> +       /* Calculate final size. */
>>> +       for (elem = list; elem; elem = elem->next)
>>> +               if (!(elem->flag & LIST_INSN_FLAG_REMOVED))
>>> +                       fini_cnt++;
>>> +
>>> +       orig_cnt = prog->len;
>>> +       insns = prog->insnsi;
>>> +       /* If prog length remains same, nothing else to do. */
>>> +       if (fini_cnt == orig_cnt) {
>>> +               for (insn = insns, elem = list; elem; elem = elem->next, insn++)
>>> +                       *insn = elem->insn;
>>> +               return env;
>>> +       }
>>> +       /* Realloc insn buffer when necessary. */
>>> +       if (fini_cnt > orig_cnt)
>>> +               prog = bpf_prog_realloc(prog, bpf_prog_size(fini_cnt),
>>> +                                       GFP_USER);
>>> +       if (!prog)
>>> +               return ERR_PTR(-ENOMEM);
>>> +       insns = prog->insnsi;
>>> +       prog->len = fini_cnt;
>>> +       ret_env = env;
>>> +
>>> +       /* idx_map[OLD_IDX] = NEW_IDX */
>>> +       idx_map = kvmalloc(orig_cnt * sizeof(u32), GFP_KERNEL);
>>> +       if (!idx_map)
>>> +               return ERR_PTR(-ENOMEM);
>>> +       memset(idx_map, 0xff, orig_cnt * sizeof(u32));
>>> +
>>> +       /* Use the same alloc method used when allocating env->insn_aux_data. */
>>> +       new_data = vzalloc(array_size(sizeof(*new_data), fini_cnt));
>>> +       if (!new_data) {
>>> +               kvfree(idx_map);
>>> +               return ERR_PTR(-ENOMEM);
>>> +       }
>>> +
>>> +       /* Copy over insn + calculate idx_map. */
>>> +       for (idx = 0, elem = list; elem; elem = elem->next) {
>>> +               int orig_idx = elem->orig_idx - 1;
>>> +
>>> +               if (orig_idx >= 0) {
>>> +                       idx_map[orig_idx] = idx;
>>> +
>>> +                       if (elem->flag & LIST_INSN_FLAG_REMOVED)
>>> +                               continue;
>>> +
>>> +                       new_data[idx] = env->insn_aux_data[orig_idx];
>>> +
>>> +                       if (elem->flag & LIST_INSN_FLAG_PATCHED)
>>> +                               new_data[idx].zext_dst =
>>> +                                       insn_has_def32(env, &elem->insn);
>>> +               } else {
>>> +                       new_data[idx].seen = true;
>>> +                       new_data[idx].zext_dst = insn_has_def32(env,
>>> +                                                               &elem->insn);
>>> +               }
>>> +               insns[idx++] = elem->insn;
>>> +       }
>>> +
>>> +       new_subinfo = kvzalloc(sizeof(env->subprog_info), GFP_KERNEL);
>>> +       if (!new_subinfo) {
>>> +               kvfree(idx_map);
>>> +               vfree(new_data);
>>> +               return ERR_PTR(-ENOMEM);
>>> +       }
>>> +       memcpy(new_subinfo, env->subprog_info, sizeof(env->subprog_info));
>>> +       memset(env->subprog_info, 0, sizeof(env->subprog_info));
>>> +       env->subprog_cnt = 0;
>>> +       env->prog = prog;
>>> +       ret = add_subprog(env, 0);
>>> +       if (ret < 0) {
>>> +               ret_env = ERR_PTR(ret);
>>> +               goto free_all_ret;
>>> +       }
>>> +       /* Relocate jumps using idx_map.
>>> +        *   old_dst = jmp_insn.old_target + old_pc + 1;
>>> +        *   new_dst = idx_map[old_dst] = jmp_insn.new_target + new_pc + 1;
>>> +        *   jmp_insn.new_target = new_dst - new_pc - 1;
>>> +        */
>>> +       for (idx = 0, elem = list; elem; elem = elem->next) {
>>> +               int orig_idx = elem->orig_idx;
>>> +
>>> +               if (elem->flag & LIST_INSN_FLAG_REMOVED)
>>> +                       continue;
>>> +               if ((elem->flag & LIST_INSN_FLAG_PATCHED) || !orig_idx) {
>>> +                       idx++;
>>> +                       continue;
>>> +               }
>>> +
>>> +               ret = bpf_jit_adj_imm_off(&insns[idx], orig_idx - 1, idx,
>>> +                                         idx_map);
>>> +               if (ret < 0) {
>>> +                       ret_env = ERR_PTR(ret);
>>> +                       goto free_all_ret;
>>> +               }
>>> +               /* Recalculate subprog start as we are at bpf2bpf call insn. */
>>> +               if (ret > 0) {
>>> +                       ret = add_subprog(env, idx + insns[idx].imm + 1);
>>> +                       if (ret < 0) {
>>> +                               ret_env = ERR_PTR(ret);
>>> +                               goto free_all_ret;
>>> +                       }
>>> +               }
>>> +               idx++;
>>> +       }
>>> +       if (ret < 0) {
>>> +               ret_env = ERR_PTR(ret);
>>> +               goto free_all_ret;
>>> +       }
>>> +
>>> +       env->subprog_info[env->subprog_cnt].start = fini_cnt;
>>> +       for (idx = 0; idx <= env->subprog_cnt; idx++)
>>> +               new_subinfo[idx].start = env->subprog_info[idx].start;
>>> +       memcpy(env->subprog_info, new_subinfo, sizeof(env->subprog_info));
>>> +
>>> +       /* Adjust linfo.
>>> +        * FIXME: no support for insn removal at the moment.
>>> +        */
>>> +       if (prog->aux->nr_linfo) {
>>> +               struct bpf_line_info *linfo = prog->aux->linfo;
>>> +               u32 nr_linfo = prog->aux->nr_linfo;
>>> +
>>> +               for (idx = 0; idx < nr_linfo; idx++)
>>> +                       linfo[idx].insn_off = idx_map[linfo[idx].insn_off];
>>> +       }
>>> +       vfree(env->insn_aux_data);
>>> +       env->insn_aux_data = new_data;
>>> +       goto free_mem_list_ret;
>>> +free_all_ret:
>>> +       vfree(new_data);
>>> +free_mem_list_ret:
>>> +       kvfree(new_subinfo);
>>> +       kvfree(idx_map);
>>> +       return ret_env;
>>> +}
>>> +
>>>  static int opt_remove_dead_code(struct bpf_verifier_env *env)
>>>  {
>>>         struct bpf_insn_aux_data *aux_data = env->insn_aux_data;
>>> --
>>> 2.7.4
>>>


^ permalink raw reply

* Re: [bpf-next v3 05/12] selftests/bpf: Allow passing more information to BPF prog test run
From: Krzesimir Nowak @ 2019-07-11 12:17 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: open list, Alban Crequy, Iago López Galeiras,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Networking, bpf, xdp-newbies
In-Reply-To: <CAEf4BzYYdrcwJKg271ZL7kPJNYyZEGdxQeuUNbfPk=EjewuHeQ@mail.gmail.com>

On Thu, Jul 11, 2019 at 3:17 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jul 8, 2019 at 3:42 PM Krzesimir Nowak <krzesimir@kinvolk.io> wrote:
> >
> > The test case can now specify a custom length of the data member,
> > context data and its length, which will be passed to
> > bpf_prog_test_run_xattr. For backward compatilibity, if the data
> > length is 0 (which is what will happen when the field is left
> > unspecified in the designated initializer of a struct), then the
> > length passed to the bpf_prog_test_run_xattr is TEST_DATA_LEN.
> >
> > Also for backward compatilibity, if context data length is 0, NULL is
> > passed as a context to bpf_prog_test_run_xattr. This is to avoid
> > breaking other tests, where context data being NULL and context data
> > length being 0 is handled differently from the case where context data
> > is not NULL and context data length is 0.
> >
> > Custom lengths still can't be greater than hardcoded 64 bytes for data
> > and 192 for context data.
> >
> > 192 for context data was picked to allow passing struct
> > bpf_perf_event_data as a context for perf event programs. The struct
> > is quite large, because it contains struct pt_regs.
> >
> > Test runs for perf event programs will not allow the copying the data
> > back to data_out buffer, so they require data_out_size to be zero and
> > data_out to be NULL. Since test_verifier hardcodes it, make it
> > possible to override the size. Overriding the size to zero will cause
> > the buffer to be NULL.
> >
> > Changes since v2:
> > - Allow overriding the data out size and buffer.
> >
> > Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
> > ---
> >  tools/testing/selftests/bpf/test_verifier.c | 105 +++++++++++++++++---
> >  1 file changed, 93 insertions(+), 12 deletions(-)
> >
> > diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
> > index 1640ba9f12c1..6f124cc4ee34 100644
> > --- a/tools/testing/selftests/bpf/test_verifier.c
> > +++ b/tools/testing/selftests/bpf/test_verifier.c
> > @@ -54,6 +54,7 @@
> >  #define MAX_TEST_RUNS  8
> >  #define POINTER_VALUE  0xcafe4all
> >  #define TEST_DATA_LEN  64
> > +#define TEST_CTX_LEN   192
> >
> >  #define F_NEEDS_EFFICIENT_UNALIGNED_ACCESS     (1 << 0)
> >  #define F_LOAD_WITH_STRICT_ALIGNMENT           (1 << 1)
> > @@ -96,7 +97,12 @@ struct bpf_test {
> >         enum bpf_prog_type prog_type;
> >         uint8_t flags;
> >         __u8 data[TEST_DATA_LEN];
> > +       __u32 data_len;
> > +       __u8 ctx[TEST_CTX_LEN];
> > +       __u32 ctx_len;
> >         void (*fill_helper)(struct bpf_test *self);
> > +       bool override_data_out_len;
> > +       __u32 overridden_data_out_len;
> >         uint8_t runs;
> >         struct {
> >                 uint32_t retval, retval_unpriv;
> > @@ -104,6 +110,9 @@ struct bpf_test {
> >                         __u8 data[TEST_DATA_LEN];
> >                         __u64 data64[TEST_DATA_LEN / 8];
> >                 };
> > +               __u32 data_len;
> > +               __u8 ctx[TEST_CTX_LEN];
> > +               __u32 ctx_len;
> >         } retvals[MAX_TEST_RUNS];
> >  };
> >
> > @@ -818,21 +827,35 @@ static int set_admin(bool admin)
> >  }
> >
> >  static int do_prog_test_run(int fd_prog, bool unpriv, uint32_t expected_val,
> > -                           void *data, size_t size_data)
> > +                           void *data, size_t size_data, void *ctx,
> > +                           size_t size_ctx, u32 *overridden_data_out_size)
> >  {
> > -       __u8 tmp[TEST_DATA_LEN << 2];
> > -       __u32 size_tmp = sizeof(tmp);
> > -       int saved_errno;
> > -       int err;
> >         struct bpf_prog_test_run_attr attr = {
> >                 .prog_fd = fd_prog,
> >                 .repeat = 1,
> >                 .data_in = data,
> >                 .data_size_in = size_data,
> > -               .data_out = tmp,
> > -               .data_size_out = size_tmp,
> > +               .ctx_in = ctx,
> > +               .ctx_size_in = size_ctx,
> >         };
> > +       __u8 tmp[TEST_DATA_LEN << 2];
> > +       __u32 size_tmp = sizeof(tmp);
> > +       __u32 size_buf = size_tmp;
> > +       __u8 *buf = tmp;
> > +       int saved_errno;
> > +       int err;
> >
> > +       if (overridden_data_out_size)
> > +               size_buf = *overridden_data_out_size;
> > +       if (size_buf > size_tmp) {
> > +               printf("FAIL: out data size (%d) greater than a buffer size (%d) ",
> > +                      size_buf, size_tmp);
> > +               return -EINVAL;
> > +       }
> > +       if (!size_buf)
> > +               buf = NULL;
> > +       attr.data_size_out = size_buf;
> > +       attr.data_out = buf;
> >         if (unpriv)
> >                 set_admin(true);
> >         err = bpf_prog_test_run_xattr(&attr);
> > @@ -956,13 +979,45 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
> >         if (!alignment_prevented_execution && fd_prog >= 0) {
> >                 uint32_t expected_val;
> >                 int i;
> > +               __u32 size_data;
> > +               __u32 size_ctx;
> > +               bool bad_size;
> > +               void *ctx;
> > +               __u32 *overridden_data_out_size;
> >
> >                 if (!test->runs) {
> > +                       if (test->data_len > 0)
> > +                               size_data = test->data_len;
> > +                       else
> > +                               size_data = sizeof(test->data);
> > +                       if (test->override_data_out_len)
> > +                               overridden_data_out_size = &test->overridden_data_out_len;
> > +                       else
> > +                               overridden_data_out_size = NULL;
> > +                       size_ctx = test->ctx_len;
> > +                       bad_size = false;
>
> I hated all this duplication of logic, which with this patch becomes
> even more expansive, so I removed it. Please see [0]. Can you please
> apply that patch and add all this new logic only once?
>
>   [0] https://patchwork.ozlabs.org/patch/1130601/

Will do.

>
> >                         expected_val = unpriv && test->retval_unpriv ?
> >                                 test->retval_unpriv : test->retval;
> >
> > -                       err = do_prog_test_run(fd_prog, unpriv, expected_val,
> > -                                              test->data, sizeof(test->data));
> > +                       if (size_data > sizeof(test->data)) {
> > +                               printf("FAIL: data size (%u) greater than TEST_DATA_LEN (%lu) ", size_data, sizeof(test->data));
> > +                               bad_size = true;
> > +                       }
> > +                       if (size_ctx > sizeof(test->ctx)) {
> > +                               printf("FAIL: ctx size (%u) greater than TEST_CTX_LEN (%lu) ", size_ctx, sizeof(test->ctx));
>
> These look like way too long lines, wrap them?

Ah, yeah, these can be wrapped easily. Will do.

>
> > +                               bad_size = true;
> > +                       }
> > +                       if (size_ctx)
> > +                               ctx = test->ctx;
> > +                       else
> > +                               ctx = NULL;
>
> nit: single line:
>
> ctx = size_ctx ? test->ctx : NULL;
>
> > +                       if (bad_size)
> > +                               err = 1;
> > +                       else
> > +                               err = do_prog_test_run(fd_prog, unpriv, expected_val,
> > +                                                      test->data, size_data,
> > +                                                      ctx, size_ctx,
> > +                                                      overridden_data_out_size);
> >                         if (err)
> >                                 run_errs++;
> >                         else
> > @@ -970,14 +1025,40 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
> >                 }
> >
> >                 for (i = 0; i < test->runs; i++) {
> > +                       if (test->retvals[i].data_len > 0)
> > +                               size_data = test->retvals[i].data_len;
> > +                       else
> > +                               size_data = sizeof(test->retvals[i].data);
> > +                       if (test->override_data_out_len)
> > +                               overridden_data_out_size = &test->overridden_data_out_len;
> > +                       else
> > +                               overridden_data_out_size = NULL;
> > +                       size_ctx = test->retvals[i].ctx_len;
> > +                       bad_size = false;
> >                         if (unpriv && test->retvals[i].retval_unpriv)
> >                                 expected_val = test->retvals[i].retval_unpriv;
> >                         else
> >                                 expected_val = test->retvals[i].retval;
> >
> > -                       err = do_prog_test_run(fd_prog, unpriv, expected_val,
> > -                                              test->retvals[i].data,
> > -                                              sizeof(test->retvals[i].data));
> > +                       if (size_data > sizeof(test->retvals[i].data)) {
> > +                               printf("FAIL: data size (%u) at run %i greater than TEST_DATA_LEN (%lu) ", size_data, i + 1, sizeof(test->retvals[i].data));
> > +                               bad_size = true;
> > +                       }
> > +                       if (size_ctx > sizeof(test->retvals[i].ctx)) {
> > +                               printf("FAIL: ctx size (%u) at run %i greater than TEST_CTX_LEN (%lu) ", size_ctx, i + 1, sizeof(test->retvals[i].ctx));
> > +                               bad_size = true;
> > +                       }
> > +                       if (size_ctx)
> > +                               ctx = test->retvals[i].ctx;
> > +                       else
> > +                               ctx = NULL;
> > +                       if (bad_size)
> > +                               err = 1;
> > +                       else
> > +                               err = do_prog_test_run(fd_prog, unpriv, expected_val,
> > +                                                      test->retvals[i].data, size_data,
> > +                                                      ctx, size_ctx,
> > +                                                      overridden_data_out_size);
> >                         if (err) {
> >                                 printf("(run %d/%d) ", i + 1, test->runs);
> >                                 run_errs++;
> > --
> > 2.20.1
> >



-- 
Kinvolk GmbH | Adalbertstr.6a, 10999 Berlin | tel: +491755589364
Geschäftsführer/Directors: Alban Crequy, Chris Kühl, Iago López Galeiras
Registergericht/Court of registration: Amtsgericht Charlottenburg
Registernummer/Registration number: HRB 171414 B
Ust-ID-Nummer/VAT ID number: DE302207000

^ permalink raw reply

* Re: [PATCH bpf-next] selftests/bpf: remove logic duplication in test_verifier.c
From: Krzesimir Nowak @ 2019-07-11 12:13 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, kernel-team, ast, Daniel Borkmann, bpf,
	Networking
In-Reply-To: <20190711010844.1285018-1-andriin@fb.com>

On Thu, Jul 11, 2019 at 3:08 AM Andrii Nakryiko <andriin@fb.com> wrote:
>
> test_verifier tests can specify single- and multi-runs tests. Internally
> logic of handling them is duplicated. Get rid of it by making single run
> retval specification to be a first retvals spec.
>
> Cc: Krzesimir Nowak <krzesimir@kinvolk.io>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

Looks good, one nit below.

Acked-by: Krzesimir Nowak <krzesimir@kinvolk.io>

> ---
>  tools/testing/selftests/bpf/test_verifier.c | 37 ++++++++++-----------
>  1 file changed, 18 insertions(+), 19 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
> index b0773291012a..120ecdf4a7db 100644
> --- a/tools/testing/selftests/bpf/test_verifier.c
> +++ b/tools/testing/selftests/bpf/test_verifier.c
> @@ -86,7 +86,7 @@ struct bpf_test {
>         int fixup_sk_storage_map[MAX_FIXUPS];
>         const char *errstr;
>         const char *errstr_unpriv;
> -       uint32_t retval, retval_unpriv, insn_processed;
> +       uint32_t insn_processed;
>         int prog_len;
>         enum {
>                 UNDEF,
> @@ -95,16 +95,24 @@ struct bpf_test {
>         } result, result_unpriv;
>         enum bpf_prog_type prog_type;
>         uint8_t flags;
> -       __u8 data[TEST_DATA_LEN];
>         void (*fill_helper)(struct bpf_test *self);
>         uint8_t runs;
> -       struct {
> -               uint32_t retval, retval_unpriv;
> -               union {
> -                       __u8 data[TEST_DATA_LEN];
> -                       __u64 data64[TEST_DATA_LEN / 8];
> +       union {
> +               struct {

Maybe consider moving the struct definition outside to further the
removal of the duplication?

> +                       uint32_t retval, retval_unpriv;
> +                       union {
> +                               __u8 data[TEST_DATA_LEN];
> +                               __u64 data64[TEST_DATA_LEN / 8];
> +                       };
>                 };
> -       } retvals[MAX_TEST_RUNS];
> +               struct {
> +                       uint32_t retval, retval_unpriv;
> +                       union {
> +                               __u8 data[TEST_DATA_LEN];
> +                               __u64 data64[TEST_DATA_LEN / 8];
> +                       };
> +               } retvals[MAX_TEST_RUNS];
> +       };
>         enum bpf_attach_type expected_attach_type;
>  };
>
> @@ -949,17 +957,8 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
>                 uint32_t expected_val;
>                 int i;
>
> -               if (!test->runs) {
> -                       expected_val = unpriv && test->retval_unpriv ?
> -                               test->retval_unpriv : test->retval;
> -
> -                       err = do_prog_test_run(fd_prog, unpriv, expected_val,
> -                                              test->data, sizeof(test->data));
> -                       if (err)
> -                               run_errs++;
> -                       else
> -                               run_successes++;
> -               }
> +               if (!test->runs)
> +                       test->runs = 1;
>
>                 for (i = 0; i < test->runs; i++) {
>                         if (unpriv && test->retvals[i].retval_unpriv)
> --
> 2.17.1
>


-- 
Kinvolk GmbH | Adalbertstr.6a, 10999 Berlin | tel: +491755589364
Geschäftsführer/Directors: Alban Crequy, Chris Kühl, Iago López Galeiras
Registergericht/Court of registration: Amtsgericht Charlottenburg
Registernummer/Registration number: HRB 171414 B
Ust-ID-Nummer/VAT ID number: DE302207000

^ permalink raw reply

* Re: [bpf-next v3 04/12] selftests/bpf: Use bpf_prog_test_run_xattr
From: Krzesimir Nowak @ 2019-07-11 12:07 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: open list, Alban Crequy, Iago López Galeiras,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Networking, bpf, xdp-newbies
In-Reply-To: <CAEf4BzZoOw=1B8vV53iAxz8LDULOPVF-he4C_usoUQSdXU+oSg@mail.gmail.com>

On Thu, Jul 11, 2019 at 2:03 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jul 8, 2019 at 3:43 PM Krzesimir Nowak <krzesimir@kinvolk.io> wrote:
> >
> > The bpf_prog_test_run_xattr function gives more options to set up a
> > test run of a BPF program than the bpf_prog_test_run function.
> >
> > We will need this extra flexibility to pass ctx data later.
> >
> > Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
> > ---
>
> lgtm, with some nits below
>
> Acked-by: Andrii Nakryiko <andriin@fb.com>
>
> >  tools/testing/selftests/bpf/test_verifier.c | 16 +++++++++++-----
> >  1 file changed, 11 insertions(+), 5 deletions(-)
> >
> > diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
> > index c7541f572932..1640ba9f12c1 100644
> > --- a/tools/testing/selftests/bpf/test_verifier.c
> > +++ b/tools/testing/selftests/bpf/test_verifier.c
> > @@ -822,14 +822,20 @@ static int do_prog_test_run(int fd_prog, bool unpriv, uint32_t expected_val,
> >  {
> >         __u8 tmp[TEST_DATA_LEN << 2];
> >         __u32 size_tmp = sizeof(tmp);
>
> nit: this is now is not needed as a separate local variable, inline?

I think I'm using this variable in a followup commit, but I'll look closely.

>
> > -       uint32_t retval;
> >         int saved_errno;
> >         int err;
> > +       struct bpf_prog_test_run_attr attr = {
> > +               .prog_fd = fd_prog,
> > +               .repeat = 1,
> > +               .data_in = data,
> > +               .data_size_in = size_data,
> > +               .data_out = tmp,
> > +               .data_size_out = size_tmp,
> > +       };
> >
> >         if (unpriv)
> >                 set_admin(true);
> > -       err = bpf_prog_test_run(fd_prog, 1, data, size_data,
> > -                               tmp, &size_tmp, &retval, NULL);
> > +       err = bpf_prog_test_run_xattr(&attr);
> >         saved_errno = errno;
> >         if (unpriv)
> >                 set_admin(false);
> > @@ -846,9 +852,9 @@ static int do_prog_test_run(int fd_prog, bool unpriv, uint32_t expected_val,
> >                         return err;
> >                 }
> >         }
> > -       if (retval != expected_val &&
> > +       if (attr.retval != expected_val &&
> >             expected_val != POINTER_VALUE) {
>
> this if condition now fits one line, can you please combine? thanks!

Sure.

>
> > -               printf("FAIL retval %d != %d ", retval, expected_val);
> > +               printf("FAIL retval %d != %d ", attr.retval, expected_val);
> >                 return 1;
> >         }
> >
> > --
> > 2.20.1
> >



-- 
Kinvolk GmbH | Adalbertstr.6a, 10999 Berlin | tel: +491755589364
Geschäftsführer/Directors: Alban Crequy, Chris Kühl, Iago López Galeiras
Registergericht/Court of registration: Amtsgericht Charlottenburg
Registernummer/Registration number: HRB 171414 B
Ust-ID-Nummer/VAT ID number: DE302207000

^ permalink raw reply

* Re: [bpf-next v3 03/12] selftests/bpf: Avoid another case of errno clobbering
From: Krzesimir Nowak @ 2019-07-11 12:05 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: open list, Alban Crequy, Iago López Galeiras,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Networking, bpf, xdp-newbies
In-Reply-To: <CAEf4BzbM6EiCkN5mwK1YP-NC2bavkkHV7nFR-PXCWGOvVt7nTg@mail.gmail.com>

On Thu, Jul 11, 2019 at 1:57 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jul 8, 2019 at 3:43 PM Krzesimir Nowak <krzesimir@kinvolk.io> wrote:
> >
> > Commit 8184d44c9a57 ("selftests/bpf: skip verifier tests for
> > unsupported program types") added a check for an unsupported program
> > type. The function doing it changes errno, so test_verifier should
> > save it before calling it if test_verifier wants to print a reason why
> > verifying a BPF program of a supported type failed.
> >
> > Changes since v2:
> > - Move the declaration to fit the reverse christmas tree style.
> >
> > Fixes: 8184d44c9a57 ("selftests/bpf: skip verifier tests for unsupported program types")
> > Cc: Stanislav Fomichev <sdf@google.com>
> > Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
> > ---
>
> Acked-by: Andrii Nakryiko <andriin@fb.com>
>
> >  tools/testing/selftests/bpf/test_verifier.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
> > index 3fe126e0083b..c7541f572932 100644
> > --- a/tools/testing/selftests/bpf/test_verifier.c
> > +++ b/tools/testing/selftests/bpf/test_verifier.c
> > @@ -864,6 +864,7 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
> >         int run_errs, run_successes;
> >         int map_fds[MAX_NR_MAPS];
> >         const char *expected_err;
> > +       int saved_errno;
> >         int fixup_skips;
>
> nit: combine those ints? or even with i and err below as well?

Will do.

>
> >         __u32 pflags;
> >         int i, err;
> > @@ -894,6 +895,7 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
> >                 pflags |= BPF_F_ANY_ALIGNMENT;
> >         fd_prog = bpf_verify_program(prog_type, prog, prog_len, pflags,
> >                                      "GPL", 0, bpf_vlog, sizeof(bpf_vlog), 4);
> > +       saved_errno = errno;
> >         if (fd_prog < 0 && !bpf_probe_prog_type(prog_type, 0)) {
> >                 printf("SKIP (unsupported program type %d)\n", prog_type);
> >                 skips++;
> > @@ -910,7 +912,7 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
> >         if (expected_ret == ACCEPT) {
> >                 if (fd_prog < 0) {
> >                         printf("FAIL\nFailed to load prog '%s'!\n",
> > -                              strerror(errno));
> > +                              strerror(saved_errno));
> >                         goto fail_log;
> >                 }
> >  #ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > --
> > 2.20.1
> >



-- 
Kinvolk GmbH | Adalbertstr.6a, 10999 Berlin | tel: +491755589364
Geschäftsführer/Directors: Alban Crequy, Chris Kühl, Iago López Galeiras
Registergericht/Court of registration: Amtsgericht Charlottenburg
Registernummer/Registration number: HRB 171414 B
Ust-ID-Nummer/VAT ID number: DE302207000

^ permalink raw reply

* Re: [bpf-next v3 02/12] selftests/bpf: Avoid a clobbering of errno
From: Krzesimir Nowak @ 2019-07-11 12:04 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: open list, Alban Crequy, Iago López Galeiras,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Networking, bpf, xdp-newbies
In-Reply-To: <CAEf4BzYra9njHOB8t6kxRu6n5NJdjjAG541OLt8ci=0zbbcUSg@mail.gmail.com>

On Thu, Jul 11, 2019 at 1:52 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jul 8, 2019 at 3:42 PM Krzesimir Nowak <krzesimir@kinvolk.io> wrote:
> >
> > Save errno right after bpf_prog_test_run returns, so we later check
> > the error code actually set by bpf_prog_test_run, not by some libcap
> > function.
> >
> > Changes since v1:
> > - Fix the "Fixes:" tag to mention actual commit that introduced the
> >   bug
> >
> > Changes since v2:
> > - Move the declaration so it fits the reverse christmas tree style.
> >
> > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > Fixes: 832c6f2c29ec ("bpf: test make sure to run unpriv test cases in test_verifier")
> > Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
> > ---
> >  tools/testing/selftests/bpf/test_verifier.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
> > index b8d065623ead..3fe126e0083b 100644
> > --- a/tools/testing/selftests/bpf/test_verifier.c
> > +++ b/tools/testing/selftests/bpf/test_verifier.c
> > @@ -823,16 +823,18 @@ static int do_prog_test_run(int fd_prog, bool unpriv, uint32_t expected_val,
> >         __u8 tmp[TEST_DATA_LEN << 2];
> >         __u32 size_tmp = sizeof(tmp);
> >         uint32_t retval;
> > +       int saved_errno;
> >         int err;
> >
> >         if (unpriv)
> >                 set_admin(true);
> >         err = bpf_prog_test_run(fd_prog, 1, data, size_data,
> >                                 tmp, &size_tmp, &retval, NULL);
>
> Given err is either 0 or -1, how about instead making err useful right
> here without extra variable?
>
> if (bpf_prog_test_run(...))
>         err = errno;

I change it later to bpf_prog_test_run_xattr, which can also return
-EINVAL and then errno is not set. But this one probably should not be
triggered by the test code. So not sure, probably would be better to
keep it as is for consistency?

>
> > +       saved_errno = errno;
> >         if (unpriv)
> >                 set_admin(false);
> >         if (err) {
> > -               switch (errno) {
> > +               switch (saved_errno) {
> >                 case 524/*ENOTSUPP*/:
>
> ENOTSUPP is defined in include/linux/errno.h, is there any problem
> with using this in selftests?

I just used whatever there was earlier. Seems like <linux/errno.h> is
not copied to tools include directory.

>
> >                         printf("Did not run the program (not supported) ");
> >                         return 0;
> > --
> > 2.20.1
> >



-- 
Kinvolk GmbH | Adalbertstr.6a, 10999 Berlin | tel: +491755589364
Geschäftsführer/Directors: Alban Crequy, Chris Kühl, Iago López Galeiras
Registergericht/Court of registration: Amtsgericht Charlottenburg
Registernummer/Registration number: HRB 171414 B
Ust-ID-Nummer/VAT ID number: DE302207000

^ permalink raw reply

* Re: [oss-drivers] Re: [RFC bpf-next 2/8] bpf: extend list based insn patching infra to verification layer
From: Jiong Wang @ 2019-07-11 11:59 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiong Wang, Alexei Starovoitov, Daniel Borkmann, Edward Cree,
	Naveen N. Rao, Andrii Nakryiko, Jakub Kicinski, bpf, Networking,
	oss-drivers
In-Reply-To: <CAEf4BzaF-Bvj9veA1EYu5GWQrWOu=ttX064YTrB4yNQ4neJZOQ@mail.gmail.com>


Andrii Nakryiko writes:

> On Thu, Jul 4, 2019 at 2:32 PM Jiong Wang <jiong.wang@netronome.com> wrote:
>>
>> Verification layer also needs to handle auxiliar info as well as adjusting
>> subprog start.
>>
>> At this layer, insns inside patch buffer could be jump, but they should
>> have been resolved, meaning they shouldn't jump to insn outside of the
>> patch buffer. Lineration function for this layer won't touch insns inside
>> patch buffer.
>>
>> Adjusting subprog is finished along with adjusting jump target when the
>> input will cover bpf to bpf call insn, re-register subprog start is cheap.
>> But adjustment when there is insn deleteion is not considered yet.
>>
>> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
>> ---
>>  kernel/bpf/verifier.c | 150 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 150 insertions(+)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index a2e7637..2026d64 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -8350,6 +8350,156 @@ static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env)
>>         }
>>  }
>>
>> +/* Linearize bpf list insn to array (verifier layer). */
>> +static struct bpf_verifier_env *
>> +verifier_linearize_list_insn(struct bpf_verifier_env *env,
>> +                            struct bpf_list_insn *list)
>
> It's unclear why this returns env back? It's not allocating a new env,
> so it's weird and unnecessary. Just return error code.

The reason is I was thinking we have two layers in BPF, the core and the
verifier.

For core layer (the relevant file is core.c), when doing patching, the
input is insn list and bpf_prog, the linearization should linearize the
insn list into insn array, and also whatever others affect inside bpf_prog
due to changing on insns, for example line info inside prog->aux. So the
return value is bpf_prog for core layer linearization hook. 

For verifier layer, it is similar, but the context if bpf_verifier_env, the
linearization hook should linearize the insn list, and also those affected
inside env, for example bpf_insn_aux_data, so the return value is
bpf_verifier_env, meaning returning an updated verifier context
(bpf_verifier_env) after insn list linearization.

Make sense?

Regards,
Jiong

>
>> +{
>> +       u32 *idx_map, idx, orig_cnt, fini_cnt = 0;
>> +       struct bpf_subprog_info *new_subinfo;
>> +       struct bpf_insn_aux_data *new_data;
>> +       struct bpf_prog *prog = env->prog;
>> +       struct bpf_verifier_env *ret_env;
>> +       struct bpf_insn *insns, *insn;
>> +       struct bpf_list_insn *elem;
>> +       int ret;
>> +
>> +       /* Calculate final size. */
>> +       for (elem = list; elem; elem = elem->next)
>> +               if (!(elem->flag & LIST_INSN_FLAG_REMOVED))
>> +                       fini_cnt++;
>> +
>> +       orig_cnt = prog->len;
>> +       insns = prog->insnsi;
>> +       /* If prog length remains same, nothing else to do. */
>> +       if (fini_cnt == orig_cnt) {
>> +               for (insn = insns, elem = list; elem; elem = elem->next, insn++)
>> +                       *insn = elem->insn;
>> +               return env;
>> +       }
>> +       /* Realloc insn buffer when necessary. */
>> +       if (fini_cnt > orig_cnt)
>> +               prog = bpf_prog_realloc(prog, bpf_prog_size(fini_cnt),
>> +                                       GFP_USER);
>> +       if (!prog)
>> +               return ERR_PTR(-ENOMEM);
>> +       insns = prog->insnsi;
>> +       prog->len = fini_cnt;
>> +       ret_env = env;
>> +
>> +       /* idx_map[OLD_IDX] = NEW_IDX */
>> +       idx_map = kvmalloc(orig_cnt * sizeof(u32), GFP_KERNEL);
>> +       if (!idx_map)
>> +               return ERR_PTR(-ENOMEM);
>> +       memset(idx_map, 0xff, orig_cnt * sizeof(u32));
>> +
>> +       /* Use the same alloc method used when allocating env->insn_aux_data. */
>> +       new_data = vzalloc(array_size(sizeof(*new_data), fini_cnt));
>> +       if (!new_data) {
>> +               kvfree(idx_map);
>> +               return ERR_PTR(-ENOMEM);
>> +       }
>> +
>> +       /* Copy over insn + calculate idx_map. */
>> +       for (idx = 0, elem = list; elem; elem = elem->next) {
>> +               int orig_idx = elem->orig_idx - 1;
>> +
>> +               if (orig_idx >= 0) {
>> +                       idx_map[orig_idx] = idx;
>> +
>> +                       if (elem->flag & LIST_INSN_FLAG_REMOVED)
>> +                               continue;
>> +
>> +                       new_data[idx] = env->insn_aux_data[orig_idx];
>> +
>> +                       if (elem->flag & LIST_INSN_FLAG_PATCHED)
>> +                               new_data[idx].zext_dst =
>> +                                       insn_has_def32(env, &elem->insn);
>> +               } else {
>> +                       new_data[idx].seen = true;
>> +                       new_data[idx].zext_dst = insn_has_def32(env,
>> +                                                               &elem->insn);
>> +               }
>> +               insns[idx++] = elem->insn;
>> +       }
>> +
>> +       new_subinfo = kvzalloc(sizeof(env->subprog_info), GFP_KERNEL);
>> +       if (!new_subinfo) {
>> +               kvfree(idx_map);
>> +               vfree(new_data);
>> +               return ERR_PTR(-ENOMEM);
>> +       }
>> +       memcpy(new_subinfo, env->subprog_info, sizeof(env->subprog_info));
>> +       memset(env->subprog_info, 0, sizeof(env->subprog_info));
>> +       env->subprog_cnt = 0;
>> +       env->prog = prog;
>> +       ret = add_subprog(env, 0);
>> +       if (ret < 0) {
>> +               ret_env = ERR_PTR(ret);
>> +               goto free_all_ret;
>> +       }
>> +       /* Relocate jumps using idx_map.
>> +        *   old_dst = jmp_insn.old_target + old_pc + 1;
>> +        *   new_dst = idx_map[old_dst] = jmp_insn.new_target + new_pc + 1;
>> +        *   jmp_insn.new_target = new_dst - new_pc - 1;
>> +        */
>> +       for (idx = 0, elem = list; elem; elem = elem->next) {
>> +               int orig_idx = elem->orig_idx;
>> +
>> +               if (elem->flag & LIST_INSN_FLAG_REMOVED)
>> +                       continue;
>> +               if ((elem->flag & LIST_INSN_FLAG_PATCHED) || !orig_idx) {
>> +                       idx++;
>> +                       continue;
>> +               }
>> +
>> +               ret = bpf_jit_adj_imm_off(&insns[idx], orig_idx - 1, idx,
>> +                                         idx_map);
>> +               if (ret < 0) {
>> +                       ret_env = ERR_PTR(ret);
>> +                       goto free_all_ret;
>> +               }
>> +               /* Recalculate subprog start as we are at bpf2bpf call insn. */
>> +               if (ret > 0) {
>> +                       ret = add_subprog(env, idx + insns[idx].imm + 1);
>> +                       if (ret < 0) {
>> +                               ret_env = ERR_PTR(ret);
>> +                               goto free_all_ret;
>> +                       }
>> +               }
>> +               idx++;
>> +       }
>> +       if (ret < 0) {
>> +               ret_env = ERR_PTR(ret);
>> +               goto free_all_ret;
>> +       }
>> +
>> +       env->subprog_info[env->subprog_cnt].start = fini_cnt;
>> +       for (idx = 0; idx <= env->subprog_cnt; idx++)
>> +               new_subinfo[idx].start = env->subprog_info[idx].start;
>> +       memcpy(env->subprog_info, new_subinfo, sizeof(env->subprog_info));
>> +
>> +       /* Adjust linfo.
>> +        * FIXME: no support for insn removal at the moment.
>> +        */
>> +       if (prog->aux->nr_linfo) {
>> +               struct bpf_line_info *linfo = prog->aux->linfo;
>> +               u32 nr_linfo = prog->aux->nr_linfo;
>> +
>> +               for (idx = 0; idx < nr_linfo; idx++)
>> +                       linfo[idx].insn_off = idx_map[linfo[idx].insn_off];
>> +       }
>> +       vfree(env->insn_aux_data);
>> +       env->insn_aux_data = new_data;
>> +       goto free_mem_list_ret;
>> +free_all_ret:
>> +       vfree(new_data);
>> +free_mem_list_ret:
>> +       kvfree(new_subinfo);
>> +       kvfree(idx_map);
>> +       return ret_env;
>> +}
>> +
>>  static int opt_remove_dead_code(struct bpf_verifier_env *env)
>>  {
>>         struct bpf_insn_aux_data *aux_data = env->insn_aux_data;
>> --
>> 2.7.4
>>


^ permalink raw reply

* Re: [RFC bpf-next 1/8] bpf: introducing list based insn patching infra to core layer
From: Jiong Wang @ 2019-07-11 11:53 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiong Wang, Alexei Starovoitov, Daniel Borkmann, Edward Cree,
	Naveen N. Rao, Andrii Nakryiko, Jakub Kicinski, bpf, Networking,
	oss-drivers
In-Reply-To: <CAEf4BzbR-MQa=TTVir0m-kMeWOxtgnZx+XqAB6neEW+RMBrKEA@mail.gmail.com>


Andrii Nakryiko writes:

> On Thu, Jul 4, 2019 at 2:32 PM Jiong Wang <jiong.wang@netronome.com> wrote:
>>
>> This patch introduces list based bpf insn patching infra to bpf core layer
>> which is lower than verification layer.
>>
>> This layer has bpf insn sequence as the solo input, therefore the tasks
>> to be finished during list linerization is:
>>   - copy insn
>>   - relocate jumps
>>   - relocation line info.
>>
>> Suggested-by: Alexei Starovoitov <ast@kernel.org>
>> Suggested-by: Edward Cree <ecree@solarflare.com>
>> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
>> ---
>>  include/linux/filter.h |  25 +++++
>>  kernel/bpf/core.c      | 268 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 293 insertions(+)
>>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 1fe53e7..1fea68c 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -842,6 +842,31 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
>>                                        const struct bpf_insn *patch, u32 len);
>>  int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt);
>>
>> +int bpf_jit_adj_imm_off(struct bpf_insn *insn, int old_idx, int new_idx,
>> +                       int idx_map[]);
>> +
>> +#define LIST_INSN_FLAG_PATCHED 0x1
>> +#define LIST_INSN_FLAG_REMOVED 0x2
>> +struct bpf_list_insn {
>> +       struct bpf_insn insn;
>> +       struct bpf_list_insn *next;
>> +       s32 orig_idx;
>> +       u32 flag;
>> +};
>> +
>> +struct bpf_list_insn *bpf_create_list_insn(struct bpf_prog *prog);
>> +void bpf_destroy_list_insn(struct bpf_list_insn *list);
>> +/* Replace LIST_INSN with new list insns generated from PATCH. */
>> +struct bpf_list_insn *bpf_patch_list_insn(struct bpf_list_insn *list_insn,
>> +                                         const struct bpf_insn *patch,
>> +                                         u32 len);
>> +/* Pre-patch list_insn with insns inside PATCH, meaning LIST_INSN is not
>> + * touched. New list insns are inserted before it.
>> + */
>> +struct bpf_list_insn *bpf_prepatch_list_insn(struct bpf_list_insn *list_insn,
>> +                                            const struct bpf_insn *patch,
>> +                                            u32 len);
>> +
>>  void bpf_clear_redirect_map(struct bpf_map *map);
>>
>>  static inline bool xdp_return_frame_no_direct(void)
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index e2c1b43..e60703e 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -502,6 +502,274 @@ int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt)
>>         return WARN_ON_ONCE(bpf_adj_branches(prog, off, off + cnt, off, false));
>>  }
>>
>> +int bpf_jit_adj_imm_off(struct bpf_insn *insn, int old_idx, int new_idx,
>> +                       s32 idx_map[])
>> +{
>> +       u8 code = insn->code;
>> +       s64 imm;
>> +       s32 off;
>> +
>> +       if (BPF_CLASS(code) != BPF_JMP && BPF_CLASS(code) != BPF_JMP32)
>> +               return 0;
>> +
>> +       if (BPF_CLASS(code) == BPF_JMP &&
>> +           (BPF_OP(code) == BPF_EXIT ||
>> +            (BPF_OP(code) == BPF_CALL && insn->src_reg != BPF_PSEUDO_CALL)))
>> +               return 0;
>> +
>> +       /* BPF to BPF call. */
>> +       if (BPF_OP(code) == BPF_CALL) {
>> +               imm = idx_map[old_idx + insn->imm + 1] - new_idx - 1;
>> +               if (imm < S32_MIN || imm > S32_MAX)
>> +                       return -ERANGE;
>> +               insn->imm = imm;
>> +               return 1;
>> +       }
>> +
>> +       /* Jump. */
>> +       off = idx_map[old_idx + insn->off + 1] - new_idx - 1;
>> +       if (off < S16_MIN || off > S16_MAX)
>> +               return -ERANGE;
>> +       insn->off = off;
>> +       return 0;
>> +}
>> +
>> +void bpf_destroy_list_insn(struct bpf_list_insn *list)
>> +{
>> +       struct bpf_list_insn *elem, *next;
>> +
>> +       for (elem = list; elem; elem = next) {
>> +               next = elem->next;
>> +               kvfree(elem);
>> +       }
>> +}
>> +
>> +struct bpf_list_insn *bpf_create_list_insn(struct bpf_prog *prog)
>> +{
>> +       unsigned int idx, len = prog->len;
>> +       struct bpf_list_insn *hdr, *prev;
>> +       struct bpf_insn *insns;
>> +
>> +       hdr = kvzalloc(sizeof(*hdr), GFP_KERNEL);
>> +       if (!hdr)
>> +               return ERR_PTR(-ENOMEM);
>> +
>> +       insns = prog->insnsi;
>> +       hdr->insn = insns[0];
>> +       hdr->orig_idx = 1;
>> +       prev = hdr;
>
> I'm not sure why you need this "prologue" instead of handling first
> instruction uniformly in for loop below?

It is because the head of the list doesn't have precessor, so no need of
the prev->next assignment, not could do a check inside the loop to rule the
head out when doing it.

>> +
>> +       for (idx = 1; idx < len; idx++) {
>> +               struct bpf_list_insn *node = kvzalloc(sizeof(*node),
>> +                                                     GFP_KERNEL);
>> +
>> +               if (!node) {
>> +                       /* Destroy what has been allocated. */
>> +                       bpf_destroy_list_insn(hdr);
>> +                       return ERR_PTR(-ENOMEM);
>> +               }
>> +               node->insn = insns[idx];
>> +               node->orig_idx = idx + 1;
>
> Why orig_idx is 1-based? It's really confusing.

orig_idx == 0 means one insn is without original insn, means it is an new
insn generated for patching purpose.

While the LIST_INSN_FLAG_PATCHED in the RFC means one insn in original prog
is patched.

I had been trying to differenciate above two cases, but yes, they are
confusing and differenciating them might be useless, if an insn in original
prog is patched, all its info could be treated as clobbered and needing
re-calculating or should do conservative assumption.

>
>> +               prev->next = node;
>> +               prev = node;
>> +       }
>> +
>> +       return hdr;
>> +}
>> +
>> +/* Linearize bpf list insn to array. */
>> +static struct bpf_prog *bpf_linearize_list_insn(struct bpf_prog *prog,
>> +                                               struct bpf_list_insn *list)
>> +{
>> +       u32 *idx_map, idx, prev_idx, fini_cnt = 0, orig_cnt = prog->len;
>> +       struct bpf_insn *insns, *insn;
>> +       struct bpf_list_insn *elem;
>> +
>> +       /* Calculate final size. */
>> +       for (elem = list; elem; elem = elem->next)
>> +               if (!(elem->flag & LIST_INSN_FLAG_REMOVED))
>> +                       fini_cnt++;
>> +
>> +       insns = prog->insnsi;
>> +       /* If prog length remains same, nothing else to do. */
>> +       if (fini_cnt == orig_cnt) {
>> +               for (insn = insns, elem = list; elem; elem = elem->next, insn++)
>> +                       *insn = elem->insn;
>> +               return prog;
>> +       }
>> +       /* Realloc insn buffer when necessary. */
>> +       if (fini_cnt > orig_cnt)
>> +               prog = bpf_prog_realloc(prog, bpf_prog_size(fini_cnt),
>> +                                       GFP_USER);
>> +       if (!prog)
>> +               return ERR_PTR(-ENOMEM);
>> +       insns = prog->insnsi;
>> +       prog->len = fini_cnt;
>> +
>> +       /* idx_map[OLD_IDX] = NEW_IDX */
>> +       idx_map = kvmalloc(orig_cnt * sizeof(u32), GFP_KERNEL);
>> +       if (!idx_map)
>> +               return ERR_PTR(-ENOMEM);
>> +       memset(idx_map, 0xff, orig_cnt * sizeof(u32));
>> +
>> +       /* Copy over insn + calculate idx_map. */
>> +       for (idx = 0, elem = list; elem; elem = elem->next) {
>> +               int orig_idx = elem->orig_idx - 1;
>> +
>> +               if (orig_idx >= 0) {
>> +                       idx_map[orig_idx] = idx;
>> +
>> +                       if (elem->flag & LIST_INSN_FLAG_REMOVED)
>> +                               continue;
>> +               }
>> +               insns[idx++] = elem->insn;
>> +       }
>> +
>> +       /* Relocate jumps using idx_map.
>> +        *   old_dst = jmp_insn.old_target + old_pc + 1;
>> +        *   new_dst = idx_map[old_dst] = jmp_insn.new_target + new_pc + 1;
>> +        *   jmp_insn.new_target = new_dst - new_pc - 1;
>> +        */
>> +       for (idx = 0, prev_idx = 0, elem = list; elem; elem = elem->next) {
>> +               int ret, orig_idx;
>> +
>> +               /* A removed insn doesn't increase new_pc */
>> +               if (elem->flag & LIST_INSN_FLAG_REMOVED)
>> +                       continue;
>> +
>> +               orig_idx = elem->orig_idx - 1;
>> +               ret = bpf_jit_adj_imm_off(&insns[idx],
>> +                                         orig_idx >= 0 ? orig_idx : prev_idx,
>> +                                         idx, idx_map);
>> +               idx++;
>> +               if (ret < 0) {
>> +                       kvfree(idx_map);
>> +                       return ERR_PTR(ret);
>> +               }
>> +               if (orig_idx >= 0)
>> +                       /* Record prev_idx. it is used for relocating jump insn
>> +                        * inside patch buffer. For example, when doing jit
>> +                        * blinding, a jump could be moved to some other
>> +                        * positions inside the patch buffer, and its old_dst
>> +                        * could be calculated using prev_idx.
>> +                        */
>> +                       prev_idx = orig_idx;
>> +       }
>> +
>> +       /* Adjust linfo.
>> +        *
>> +        * NOTE: the prog reached core layer has been adjusted to contain insns
>> +        *       for single function, however linfo contains information for
>> +        *       whole program, so we need to make sure linfo beyond current
>> +        *       function is handled properly.
>> +        */
>> +       if (prog->aux->nr_linfo) {
>> +               u32 linfo_idx, insn_start, insn_end, nr_linfo, idx, delta;
>> +               struct bpf_line_info *linfo;
>> +
>> +               linfo_idx = prog->aux->linfo_idx;
>> +               linfo = &prog->aux->linfo[linfo_idx];
>> +               insn_start = linfo[0].insn_off;
>> +               insn_end = insn_start + orig_cnt;
>> +               nr_linfo = prog->aux->nr_linfo - linfo_idx;
>> +               delta = fini_cnt - orig_cnt;
>> +               for (idx = 0; idx < nr_linfo; idx++) {
>> +                       int adj_off;
>> +
>> +                       if (linfo[idx].insn_off >= insn_end) {
>> +                               linfo[idx].insn_off += delta;
>> +                               continue;
>> +                       }
>> +
>> +                       adj_off = linfo[idx].insn_off - insn_start;
>> +                       linfo[idx].insn_off = idx_map[adj_off] + insn_start;
>> +               }
>> +       }
>> +       kvfree(idx_map);
>> +
>> +       return prog;
>> +}
>> +
>> +struct bpf_list_insn *bpf_patch_list_insn(struct bpf_list_insn *list_insn,
>> +                                         const struct bpf_insn *patch,
>> +                                         u32 len)
>> +{
>> +       struct bpf_list_insn *prev, *next;
>> +       u32 insn_delta = len - 1;
>> +       u32 idx;
>> +
>> +       list_insn->insn = *patch;
>> +       list_insn->flag |= LIST_INSN_FLAG_PATCHED;
>> +
>> +       /* Since our patchlet doesn't expand the image, we're done. */
>> +       if (insn_delta == 0)
>> +               return list_insn;
>> +
>> +       len--;
>> +       patch++;
>> +
>> +       prev = list_insn;
>> +       next = list_insn->next;
>> +       for (idx = 0; idx < len; idx++) {
>> +               struct bpf_list_insn *node = kvzalloc(sizeof(*node),
>> +                                                     GFP_KERNEL);
>> +
>> +               if (!node) {
>> +                       /* Link what's allocated, so list destroyer could
>> +                        * free them.
>> +                        */
>> +                       prev->next = next;
>
> Why this special handling, if you can just insert element so that list
> is well-formed after each instruction?

Good idea, just always do "node->next = next", the "prev->next = node" in
next round will fix it.

>
>> +                       return ERR_PTR(-ENOMEM);
>> +               }
>> +
>> +               node->insn = patch[idx];
>> +               prev->next = node;
>> +               prev = node;
>
> E.g.,
>
> node->next = next;
> prev->next = node;
> prev = node;
>
>> +       }
>> +
>> +       prev->next = next;
>
> And no need for this either.
>
>> +       return prev;
>> +}
>> +
>> +struct bpf_list_insn *bpf_prepatch_list_insn(struct bpf_list_insn *list_insn,
>> +                                            const struct bpf_insn *patch,
>> +                                            u32 len)
>
> prepatch and patch functions should share the same logic.
>
> Prepend is just that - insert all instructions from buffer before current insns.
> Patch -> replace current one with first instriction in a buffer, then
> prepend remaining ones before the next instruction (so patch should
> call info prepend, with adjusted count and array pointer).

Ack, there indeed has quite a few things to simplify.

>
>> +{
>> +       struct bpf_list_insn *prev, *node, *begin_node;
>> +       u32 idx;
>> +
>> +       if (!len)
>> +               return list_insn;
>> +
>> +       node = kvzalloc(sizeof(*node), GFP_KERNEL);
>> +       if (!node)
>> +               return ERR_PTR(-ENOMEM);
>> +       node->insn = patch[0];
>> +       begin_node = node;
>> +       prev = node;
>> +
>> +       for (idx = 1; idx < len; idx++) {
>> +               node = kvzalloc(sizeof(*node), GFP_KERNEL);
>> +               if (!node) {
>> +                       node = begin_node;
>> +                       /* Release what's has been allocated. */
>> +                       while (node) {
>> +                               struct bpf_list_insn *next = node->next;
>> +
>> +                               kvfree(node);
>> +                               node = next;
>> +                       }
>> +                       return ERR_PTR(-ENOMEM);
>> +               }
>> +               node->insn = patch[idx];
>> +               prev->next = node;
>> +               prev = node;
>> +       }
>> +
>> +       prev->next = list_insn;
>> +       return begin_node;
>> +}
>> +
>>  void bpf_prog_kallsyms_del_subprogs(struct bpf_prog *fp)
>>  {
>>         int i;
>> --
>> 2.7.4
>>


^ permalink raw reply

* Re: Re: linux-next: build failure after merge of the net-next tree
From: Jason Gunthorpe @ 2019-07-11 11:52 UTC (permalink / raw)
  To: Bernard Metzler
  Cc: Leon Romanovsky, Stephen Rothwell, Doug Ledford, David Miller,
	Networking, Linux Next Mailing List, Linux Kernel Mailing List
In-Reply-To: <OF360C0EBE.4A489B94-ON00258434.002B10B7-00258434.002C0536@notes.na.collabserv.com>

On Thu, Jul 11, 2019 at 08:00:49AM +0000, Bernard Metzler wrote:

> That listen will not sleep. The socket is just marked
> listening. 

Eh? siw_listen_address() calls siw_cep_alloc() which does:

	struct siw_cep *cep = kzalloc(sizeof(*cep), GFP_KERNEL);

Which is sleeping. Many other cases too.

Jason

^ permalink raw reply

* Re: [RFC] virtio-net: share receive_*() and add_recvbuf_*() with virtio-vsock
From: Stefano Garzarella @ 2019-07-11 11:41 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin, Stefan Hajnoczi; +Cc: virtualization, netdev
In-Reply-To: <9574bc38-4c5c-2325-986b-430e4a2b6661@redhat.com>

On Thu, Jul 11, 2019 at 03:37:00PM +0800, Jason Wang wrote:
> 
> On 2019/7/10 下午11:37, Stefano Garzarella wrote:
> > Hi,
> > as Jason suggested some months ago, I looked better at the virtio-net driver to
> > understand if we can reuse some parts also in the virtio-vsock driver, since we
> > have similar challenges (mergeable buffers, page allocation, small
> > packets, etc.).
> > 
> > Initially, I would add the skbuff in the virtio-vsock in order to re-use
> > receive_*() functions.
> 
> 
> Yes, that will be a good step.
> 

Okay, I'll go on this way.

> 
> > Then I would move receive_[small, big, mergeable]() and
> > add_recvbuf_[small, big, mergeable]() outside of virtio-net driver, in order to
> > call them also from virtio-vsock. I need to do some refactoring (e.g. leave the
> > XDP part on the virtio-net driver), but I think it is feasible.
> > 
> > The idea is to create a virtio-skb.[h,c] where put these functions and a new
> > object where stores some attributes needed (e.g. hdr_len ) and status (e.g.
> > some fields of struct receive_queue).
> 
> 
> My understanding is we could be more ambitious here. Do you see any blocker
> for reusing virtio-net directly? It's better to reuse not only the functions
> but also the logic like NAPI to avoid re-inventing something buggy and
> duplicated.
> 

These are my concerns:
- virtio-vsock is not a "net_device", so a lot of code related to
  ethtool, net devices (MAC address, MTU, speed, VLAN, XDP, offloading) will be
  not used by virtio-vsock.

- virtio-vsock has a different header. We can consider it as part of
  virtio_net payload, but it precludes the compatibility with old hosts. This
  was one of the major doubts that made me think about using only the
  send/recv skbuff functions, that it shouldn't break the compatibility.

> 
> > This is an idea of virtio-skb.h that
> > I have in mind:
> >      struct virtskb;
> 
> 
> What fields do you want to store in virtskb? It looks to be exist sk_buff is
> flexible enough to us?

My idea is to store queues information, like struct receive_queue or
struct send_queue, and some device attributes (e.g. hdr_len ).

> 
> 
> > 
> >      struct sk_buff *virtskb_receive_small(struct virtskb *vs, ...);
> >      struct sk_buff *virtskb_receive_big(struct virtskb *vs, ...);
> >      struct sk_buff *virtskb_receive_mergeable(struct virtskb *vs, ...);
> > 
> >      int virtskb_add_recvbuf_small(struct virtskb*vs, ...);
> >      int virtskb_add_recvbuf_big(struct virtskb *vs, ...);
> >      int virtskb_add_recvbuf_mergeable(struct virtskb *vs, ...);
> > 
> > For the Guest->Host path it should be easier, so maybe I can add a
> > "virtskb_send(struct virtskb *vs, struct sk_buff *skb)" with a part of the code
> > of xmit_skb().
> 
> 
> I may miss something, but I don't see any thing that prevents us from using
> xmit_skb() directly.
> 

Yes, but my initial idea was to make it more parametric and not related to the
virtio_net_hdr, so the 'hdr_len' could be a parameter and the
'num_buffers' should be handled by the caller.

> 
> > 
> > Let me know if you have in mind better names or if I should put these function
> > in another place.
> > 
> > I would like to leave the control part completely separate, so, for example,
> > the two drivers will negotiate the features independently and they will call
> > the right virtskb_receive_*() function based on the negotiation.
> 
> 
> If it's one the issue of negotiation, we can simply change the
> virtnet_probe() to deal with different devices.
> 
> 
> > 
> > I already started to work on it, but before to do more steps and send an RFC
> > patch, I would like to hear your opinion.
> > Do you think that makes sense?
> > Do you see any issue or a better solution?
> 
> 
> I still think we need to seek a way of adding some codes on virtio-net.c
> directly if there's no huge different in the processing of TX/RX. That would
> save us a lot time.

After the reading of the buffers from the virtqueue I think the process
is slightly different, because virtio-net will interface with the network
stack, while virtio-vsock will interface with the vsock-core (socket).
So the virtio-vsock implements the following:
- control flow mechanism to avoid to loose packets, informing the peer
  about the amount of memory available in the receive queue using some
  fields in the virtio_vsock_hdr
- de-multiplexing parsing the virtio_vsock_hdr and choosing the right
  socket depending on the port
- socket state handling

We can use the virtio-net as transport, but we should add a lot of
code to skip "net device" stuff when it is used by the virtio-vsock.
This could break something in virtio-net, for this reason, I thought to reuse
only the send/recv functions starting from the idea to split the virtio-net
driver in two parts:
a. one with all stuff related to the network stack
b. one with the stuff needed to communicate with the host

And use skbuff to communicate between parts. In this way, virtio-vsock
can use only the b part.

Maybe we can do this split in a better way, but I'm not sure it is
simple.

Thanks,
Stefano

^ permalink raw reply

* Re: [bpf-next v3 01/12] selftests/bpf: Print a message when tester could not run a program
From: Krzesimir Nowak @ 2019-07-11 11:36 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: open list, Alban Crequy, Iago López Galeiras,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Networking, bpf, xdp-newbies
In-Reply-To: <CAEf4BzYDOyU52wdCinm9cxxvNijpTJgQbCg9UxcO1QKk6vWhNA@mail.gmail.com>

On Thu, Jul 11, 2019 at 1:45 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jul 8, 2019 at 3:42 PM Krzesimir Nowak <krzesimir@kinvolk.io> wrote:
> >
> > This prints a message when the error is about program type being not
> > supported by the test runner or because of permissions problem. This
> > is to see if the program we expected to run was actually executed.
> >
> > The messages are open-coded because strerror(ENOTSUPP) returns
> > "Unknown error 524".
> >
> > Changes since v2:
> > - Also print "FAIL" on an unexpected bpf_prog_test_run error, so there
> >   is a corresponding "FAIL" message for each failed test.
> >
> > Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
> > ---
> >  tools/testing/selftests/bpf/test_verifier.c | 17 +++++++++++++----
> >  1 file changed, 13 insertions(+), 4 deletions(-)
> >
> > diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
> > index c5514daf8865..b8d065623ead 100644
> > --- a/tools/testing/selftests/bpf/test_verifier.c
> > +++ b/tools/testing/selftests/bpf/test_verifier.c
> > @@ -831,11 +831,20 @@ static int do_prog_test_run(int fd_prog, bool unpriv, uint32_t expected_val,
> >                                 tmp, &size_tmp, &retval, NULL);
> >         if (unpriv)
> >                 set_admin(false);
> > -       if (err && errno != 524/*ENOTSUPP*/ && errno != EPERM) {
> > -               printf("Unexpected bpf_prog_test_run error ");
> > -               return err;
> > +       if (err) {
> > +               switch (errno) {
> > +               case 524/*ENOTSUPP*/:
> > +                       printf("Did not run the program (not supported) ");
> > +                       return 0;
> > +               case EPERM:
> > +                       printf("Did not run the program (no permission) ");
>
> Let's add "SKIP: " prefix to these?

Not sure about it. The important part of the test (the program being
verified by the kernel's verifier) was still executed, so the test is
not really skipped.


>
> > +                       return 0;
> > +               default:
> > +                       printf("FAIL: Unexpected bpf_prog_test_run error (%s) ", strerror(saved_errno));
> > +                       return err;
> > +               }
> >         }
> > -       if (!err && retval != expected_val &&
> > +       if (retval != expected_val &&
> >             expected_val != POINTER_VALUE) {
> >                 printf("FAIL retval %d != %d ", retval, expected_val);
> >                 return 1;
> > --
> > 2.20.1
> >



--
Kinvolk GmbH | Adalbertstr.6a, 10999 Berlin | tel: +491755589364
Geschäftsführer/Directors: Alban Crequy, Chris Kühl, Iago López Galeiras
Registergericht/Court of registration: Amtsgericht Charlottenburg
Registernummer/Registration number: HRB 171414 B
Ust-ID-Nummer/VAT ID number: DE302207000

^ permalink raw reply

* Re: [RFC bpf-next 0/8] bpf: accelerate insn patching speed
From: Jiong Wang @ 2019-07-11 11:22 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiong Wang, Alexei Starovoitov, Daniel Borkmann, Edward Cree,
	Naveen N. Rao, Andrii Nakryiko, Jakub Kicinski, bpf, Networking,
	oss-drivers
In-Reply-To: <CAEf4BzavePpW-C+zORN1kwSUJAWuJ3LxZ6QGxqaE9msxCq8ZLA@mail.gmail.com>


Andrii Nakryiko writes:

> On Thu, Jul 4, 2019 at 2:31 PM Jiong Wang <jiong.wang@netronome.com> wrote:
>>
>> This is an RFC based on latest bpf-next about acclerating insn patching
>> speed, it is now near the shape of final PATCH set, and we could see the
>> changes migrating to list patching would brings, so send out for
>> comments. Most of the info are in cover letter. I splitted the code in a
>> way to show API migration more easily.
>
>
> Hey Jiong,
>
>
> Sorry, took me a while to get to this and learn more about instruction
> patching. Overall this looks good and I think is a good direction.
> I'll post high-level feedback here, and some more
> implementation-specific ones in corresponding patches.

Great, thanks very much for the feedbacks. Most of your feedbacks are
hitting those pain points I exactly had ran into. For some of them, I
thought similar solutions like yours, but failed due to various
reasons. Let's go through them again, I could have missed some important
things.

Please see my replies below.

>>
>> Test Results
>> ===
>>   - Full pass on test_verifier/test_prog/test_prog_32 under all three
>>     modes (interpreter, JIT, JIT with blinding).
>>
>>   - Benchmarking shows 10 ~ 15x faster on medium sized prog, and reduce
>>     patching time from 5100s (nearly one and a half hour) to less than
>>     0.5s for 1M insn patching.
>>
>> Known Issues
>> ===
>>   - The following warning is triggered when running scale test which
>>     contains 1M insns and patching:
>>       warning of mm/page_alloc.c:4639 __alloc_pages_nodemask+0x29e/0x330
>>
>>     This is caused by existing code, it can be reproduced on bpf-next
>>     master with jit blinding enabled, then run scale unit test, it will
>>     shown up after half an hour. After this set, patching is very fast, so
>>     it shows up quickly.
>>
>>   - No line info adjustment support when doing insn delete, subprog adj
>>     is with bug when doing insn delete as well. Generally, removal of insns
>>     could possibly cause remove of entire line or subprog, therefore
>>     entries of prog->aux->linfo or env->subprog needs to be deleted. I
>>     don't have good idea and clean code for integrating this into the
>>     linearization code at the moment, will do more experimenting,
>>     appreciate ideas and suggestions on this.
>
> Is there any specific problem to detect which line info to delete? Or
> what am I missing besides careful implementation?

Mostly line info and subprog info are range info which covers a range of
insns. Deleting insns could causing you adjusting the range or removing one
range entirely. subprog info could be fully recalcuated during
linearization while line info I need some careful implementation and I
failed to have clean code for this during linearization also as said no
unit tests to help me understand whether the code is correct or not.

I will described this latter, spent too much time writing the following
reply. Might worth an separate discussion thread.

>>
>>     Insn delete doesn't happen on normal programs, for example Cilium
>>     benchmarks, and happens rarely on test_progs, so the test coverage is
>>     not good. That's also why this RFC have a full pass on selftest with
>>     this known issue.
>
> I hope you'll add test for deletion (and w/ corresponding line info)
> in final patch set :)

Will try. Need to spend some time on BTF format.
>
>>
>>   - Could further use mem pool to accelerate the speed, changes are trivial
>>     on top of this RFC, and could be 2x extra faster. Not included in this
>>     RFC as reducing the algo complexity from quadratic to linear of insn
>>     number is the first step.
>
> Honestly, I think that would add more complexity than necessary, and I
> think we can further speed up performance without that, see below.
>
>>
>> Background
>> ===
>> This RFC aims to accelerate BPF insn patching speed, patching means expand
>> one bpf insn at any offset inside bpf prog into a set of new insns, or
>> remove insns.
>>
>> At the moment, insn patching is quadratic of insn number, this is due to
>> branch targets of jump insns needs to be adjusted, and the algo used is:
>>
>>   for insn inside prog
>>     patch insn + regeneate bpf prog
>>     for insn inside new prog
>>       adjust jump target
>>
>> This is causing significant time spending when a bpf prog requires large
>> amount of patching on different insns. Benchmarking shows it could take
>> more than half minutes to finish patching when patching number is more
>> than 50K, and the time spent could be more than one hour when patching
>> number is around 1M.
>>
>>   15000   :    3s
>>   45000   :   29s
>>   95000   :  125s
>>   195000  :  712s
>>   1000000 : 5100s
>>
>> This RFC introduces new patching infrastructure. Before doing insn
>> patching, insns in bpf prog are turned into a singly linked list, insert
>> new insns just insert new list node, delete insns just set delete flag.
>> And finally, the list is linearized back into array, and branch target
>> adjustment is done for all jump insns during linearization. This algo
>> brings the time complexity from quadratic to linear of insn number.
>>
>> Benchmarking shows the new patching infrastructure could be 10 ~ 15x faster
>> on medium sized prog, and for a 1M patching it reduce the time from 5100s
>> to less than 0.5s.
>>
>> Patching API
>> ===
>> Insn patching could happen on two layers inside BPF. One is "core layer"
>> where only BPF insns are patched. The other is "verification layer" where
>> insns have corresponding aux info as well high level subprog info, so
>> insn patching means aux info needs to be patched as well, and subprog info
>> needs to be adjusted. BPF prog also has debug info associated, so line info
>> should always be updated after insn patching.
>>
>> So, list creation, destroy, insert, delete is the same for both layer,
>> but lineration is different. "verification layer" patching require extra
>> work. Therefore the patch APIs are:
>>
>>    list creation:                bpf_create_list_insn
>>    list patch:                   bpf_patch_list_insn
>>    list pre-patch:               bpf_prepatch_list_insn
>
> I think pre-patch name is very confusing, until I read full
> description I couldn't understand what it's supposed to be used for.
> Speaking of bpf_patch_list_insn, patch is also generic enough to leave
> me wondering whether instruction buffer is inserted after instruction,
> or instruction is replaced with a bunch of instructions.
>
> So how about two more specific names:
> bpf_patch_list_insn -> bpf_list_insn_replace (meaning replace given
> instruction with a list of patch instructions)
> bpf_prepatch_list_insn -> bpf_list_insn_prepend (well, I think this
> one is pretty clear).

My sense on English word is not great, will switch to above which indeed
reads more clear.

>>    list lineration (core layer): prog = bpf_linearize_list_insn(prog, list)
>>    list lineration (veri layer): env = verifier_linearize_list_insn(env, list)
>
> These two functions are both quite involved, as well as share a lot of
> common code. I'd rather have one linearize instruction, that takes env
> as an optional parameter. If env is specified (which is the case for
> all cases except for constant blinding pass), then adjust aux_data and
> subprogs along the way.

Two version of lineration and how to unify them was a painpoint to me. I
thought to factor out some of the common code out, but it actually doesn't
count much, the final size counting + insnsi resize parts are the same,
then things start to diverge since the "Copy over insn" loop.

verifier layer needs to copy and initialize aux data etc. And jump
relocation is different. At core layer, the use case is JIT blinding which
could expand an jump_imm insn into a and/or/jump_reg sequence, and the
jump_reg is at the end of the patch buffer, it should be relocated. While
all use case in verifier layer, no jump in the prog will be patched and all
new jumps in patch buffer will jump inside the buffer locally so no need to
resolve.

And yes we could unify them into one and control the diverge using
argument, but then where to place the function is an issue. My
understanding is verifier.c is designed to be on top of core.c and core.c
should not reference and no need to be aware of any verifier specific data
structures, for example env or bpf_aux_insn_data etc.

So, in this RFC, I had choosed to write separate linerization function for
core and verifier layer. Does this make sense?

>
> This would keep logic less duplicated and shouldn't complexity beyond
> few null checks in few places.
>
>>    list destroy:                 bpf_destroy_list_insn
>>
>
> I'd also add a macro foreach_list_insn instead of explicit for loops
> in multiple places. That would also allow to skip deleted instructions
> transparently.
>
>> list patch could change the insn at patch point, it will invalid the aux
>
> typo: invalid -> invalidate

Ack.

>
>> info at patching point. list pre-patch insert new insns before patch point
>> where the insn and associated aux info are not touched, it is used for
>> example in convert_ctx_access when generating prologue.
>>
>> Typical API sequence for one patching pass:
>>
>>    struct bpf_list_insn list = bpf_create_list_insn(struct bpf_prog);
>>    for (elem = list; elem; elem = elem->next)
>>       patch_buf = gen_patch_buf_logic;
>>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
>>    bpf_prog = bpf_linearize_list_insn(list)
>>    bpf_destroy_list_insn(list)
>>
>> Several patching passes could also share the same list:
>>
>>    struct bpf_list_insn list = bpf_create_list_insn(struct bpf_prog);
>>    for (elem = list; elem; elem = elem->next)
>>       patch_buf = gen_patch_buf_logic1;
>>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
>>    for (elem = list; elem; elem = elem->next)
>>       patch_buf = gen_patch_buf_logic2;
>>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
>>    bpf_prog = bpf_linearize_list_insn(list)
>>    bpf_destroy_list_insn(list)
>>
>> but note new inserted insns int early passes won't have aux info except
>> zext info. So, if one patch pass requires all aux info updated and
>> recalculated for all insns including those pathced, it should first
>> linearize the old list, then re-create the list. The RFC always create and
>> linearize the list for each migrated patching pass separately.
>
> I think we should do just one list creation, few passes of patching
> and then linearize once. That will save quite a lot of memory
> allocation and will speed up a lot of things. All the verifier
> patching happens one after the other without any other functionality
> in between, so there shouldn't be any problem.

Yes, as mentioned above, it is possible and I had tried to do it in an very
initial impl. IIRC convert_ctx_access + fixup_bpf_calls could share the
same list, but then the 32-bit zero extension insertion pass requires
aux.zext_dst set properly for all instructions including those patched
one which we need to linearize the list first (as we set zext_dst during
linerization), or the other choice is we do the zext_dst initialization
during bpf_patch_list_insn, but this then make bpf_patch_list_insn diverge
between core and verifier layer.

> As for aux_data. We can solve that even more simply and reliably by
> storing a pointer along the struct bpf_list_insn

This is exactly what I had implemented initially, but then the issue is how
to handle aux_data for patched insn? IIRC I was leave it as a NULL pointer,
but later found zext_dst info is required for all insns, so I end up
duplicating zext_dst in bpf_list_insn.

This leads me worrying we need to keep duplicating fields there as soon as
there is new similar requirements in future patching pass and I thought it
might be better to just reference the aux_data inside env using orig_idx,
this avoids duplicating information, but we need to make sure used fields
inside aux_data for patched insn update-to-date during linearization or
patching list.

> (btw, how about calling it bpf_patchable_insn?).

No preference, will use this one.

> Here's how I propose to represent this patchable instruction:
>
> struct bpf_list_insn {
>        struct bpf_insn insn;
>        struct bpf_list_insn *next;
>        struct bpf_list_insn *target;
>        struct bpf_insn_aux_data *aux_data;
>        s32 orig_idx; // can repurpose this to have three meanings:
>                      // -2 - deleted
>                      // -1 - patched/inserted insn
>                      // >=0 - original idx

I actually had experimented the -2/-1/0 trick, exactly the same number
assignment :) IIRC the code was not clear compared with using flag, the
reason seems to be:
  1. we still need orig_idx of an patched insn somehow, meaning negate the
     index.
  2. somehow somecode need to know whether one insn is deleted or patched
     after the negation, so I end up with some ugly code.

Anyway, I might had not thought hard enough on this, I will retry using the
special index instead of flag, hopefully I could have clean code this time.

> };
>
> The idea would be as follows:
> 1. when creating original list, target pointer will point directly to
> a patchable instruction wrapper for jumps/calls. This will allow to
> stop tracking and re-calculating jump offsets and instruction indicies
> until linearization.

Not sure I have followed the idea of "target" pointer. At the moment we are
using index mapping array (generated as by-product during coping insn).

While the "target" pointer means to during list initialization, each jump
insn will have target initialized to the list node of the converted jump
destination insn, and all those non-jump insns are with NULL? Then during
linearization you assign index to each list node (could be done as
by-product of other pass) before insn coping which could then relocate the
insn during the coping as the "target" would have final index calculated?
Am I following correctly?

> 2. aux_data is also filled at that point. Later at linearization time
> you'd just iterate over all the instructions in final order and copy
> original aux_data, if it's present. And then just repace env's
> aux_data array at the end, should be very simple and fast.

As explained, I am worried making aux_data a pointer will causing
duplicating some fields into list_insn if the fields are required for
patched insns.

> 3. during fix_bpf_calls, zext, ctx rewrite passes, we'll reuse the
> same list of instructions and those passes will just keep inserting
> instruction buffers. Given we have restriction that all the jumps are
> only within patch buffer, it will be trivial to construct proper
> patchable instruction wrappers for newly added instructions, with NULL
> for aux_data and possibly non-NULL target (if it's a JMP insn).
> 4. After those passes, linearize, adjust subprogs (for this you'll
> probably still need to create index mapping, right?), copy or create
> new aux_data.
> 5. Done.
>
> What do you think? I think this should be overall simpler and faster.
> But let me know if I'm missing something.

Thanks for all these thoughts, they are very good suggestions and reminds
me to revisit some points I had forgotten. I will do the following things:

  1. retry the negative index solution to eliminate flag if the result code
     could be clean.
  2. the "target" pointer seems make sense, it makes list_insn bigger but
     normally space trade with time, so I will try to implement it to see
     how the code looks like.
  3. I still have concerns on making aux_data as pointer. Mostly due to
     patched insn will have NULL pointer and in case aux info of patched
     insn is required, we need to duplicate info inside list_insn. For
     example 32-bit zext opt requires zext_dst.

Regards,
Jiong

>>
>> Compared with old patching code, this new infrastructure has much less core
>> code, even though the final code has a couple of extra lines but that is
>> mostly due to for list based infrastructure, we need to do more error
>> checks, so the list and associated aux data structure could be freed when
>> errors happens.
>>
>> Patching Restrictions
>> ===
>>   - For core layer, the linearization assume no new jumps inside patch buf.
>>     Currently, the only user of this layer is jit blinding.
>>   - For verifier layer, there could be new jumps inside patch buf, but
>>     they should have branch target resolved themselves, meaning new jumps
>>     doesn't jump to insns out of the patch buf. This is the case for all
>>     existing verifier layer users.
>>   - bpf_insn_aux_data for all patched insns including the one at patch
>>     point are invalidated, only 32-bit zext info will be recalcuated.
>>     If the aux data of insn at patch point needs to be retained, it is
>>     purely insn insertion, so need to use the pre-patch API.
>>
>> I plan to send out a PATCH set once I finished insn deletion line info adj
>> support, please have a looks at this RFC, and appreciate feedbacks.
>>
>> Jiong Wang (8):
>>   bpf: introducing list based insn patching infra to core layer
>>   bpf: extend list based insn patching infra to verification layer
>>   bpf: migrate jit blinding to list patching infra
>>   bpf: migrate convert_ctx_accesses to list patching infra
>>   bpf: migrate fixup_bpf_calls to list patching infra
>>   bpf: migrate zero extension opt to list patching infra
>>   bpf: migrate insn remove to list patching infra
>>   bpf: delete all those code around old insn patching infrastructure
>>
>>  include/linux/bpf_verifier.h |   1 -
>>  include/linux/filter.h       |  27 +-
>>  kernel/bpf/core.c            | 431 +++++++++++++++++-----------
>>  kernel/bpf/verifier.c        | 649 +++++++++++++++++++------------------------
>>  4 files changed, 580 insertions(+), 528 deletions(-)
>>
>> --
>> 2.7.4
>>


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox