[PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu().
@ 2023-07-06  3:34 Alexei Starovoitov
  2023-07-06  3:34 ` [PATCH v4 bpf-next 01/14] bpf: Rename few bpf_mem_alloc fields Alexei Starovoitov
                   ` (14 more replies)
  0 siblings, 15 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

v3->v4:
- extra patch 14 from Hou to check for object leaks.
- fixed the race/leak in free_by_rcu_ttrace. Extra hunk in patch 8.
- added Acks and fixed typos.

v2->v3:
- dropped _tail optimization for free_by_rcu_ttrace
- new patch 5 to refactor inc/dec of c->active
- change 'draining' logic in patch 7
- add rcu_barrier in patch 12
- __llist_add-> llist_add(waiting_for_gp_ttrace) in patch 9 to fix race
- David's Ack in patch 13 and explanation that migrate_disable cannot be removed just yet.

v1->v2:
- Fixed race condition spotted by Hou. Patch 7.

v1:

Introduce bpf_mem_cache_free_rcu() that is similar to kfree_rcu except
the objects will go through an additional RCU tasks trace grace period
before being freed into slab.

Patches 1-9 - a bunch of prep work
Patch 10 - a patch from Paul that exports rcu_request_urgent_qs_task().
Patch 12 - the main bpf_mem_cache_free_rcu patch.
Patch 13 - use it in bpf_cpumask.

bpf_local_storage, bpf_obj_drop, qp-trie will be other users eventually.

With additional hack patch to htab that replaces bpf_mem_cache_free with bpf_mem_cache_free_rcu
the following are benchmark results:
- map_perf_test 4 8 16348 1000000
drops from 800k to 600k. Waiting for RCU GP makes objects cache cold.

- bench htab-mem -a -p 8
20% drop in performance and big increase in memory. From 3 Mbyte to 50 Mbyte. As expected.

- bench htab-mem -a -p 16 --use-case add_del_on_diff_cpu
Same performance and better memory consumption.
Before these patches this bench would OOM (with or without 'reuse after GP')
Patch 8 addresses the issue.

At the end the performance drop and additional memory consumption due to _rcu()
were expected and came out to be within reasonable margin.
Without Paul's patch 10 the memory consumption in 'bench htab-mem' is in Gbytes
which wouldn't be acceptable.

Patch 8 is a heuristic to address 'alloc on one cpu, free on another' issue.
It works well in practice. One can probably construct an artificial benchmark
to make heuristic ineffective, but we have to trade off performance, code complexity,
and memory consumption.

The life cycle of objects:
alloc: dequeue free_llist
free: enqeueu free_llist
free_llist above high watermark -> free_by_rcu_ttrace
free_rcu: enqueue free_by_rcu -> waiting_for_gp
after RCU GP waiting_for_gp -> free_by_rcu_ttrace
free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab

Alexei Starovoitov (12):
  bpf: Rename few bpf_mem_alloc fields.
  bpf: Simplify code of destroy_mem_alloc() with kmemdup().
  bpf: Let free_all() return the number of freed elements.
  bpf: Refactor alloc_bulk().
  bpf: Factor out inc/dec of active flag into helpers.
  bpf: Further refactor alloc_bulk().
  bpf: Change bpf_mem_cache draining process.
  bpf: Add a hint to allocated objects.
  bpf: Allow reuse from waiting_for_gp_ttrace list.
  selftests/bpf: Improve test coverage of bpf_mem_alloc.
  bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.

Hou Tao (1):
  bpf: Add object leak check.

Paul E. McKenney (1):
  rcu: Export rcu_request_urgent_qs_task()

 include/linux/bpf_mem_alloc.h                 |   2 +
 include/linux/rcutiny.h                       |   2 +
 include/linux/rcutree.h                       |   1 +
 kernel/bpf/cpumask.c                          |  20 +-
 kernel/bpf/memalloc.c                         | 378 +++++++++++++-----
 kernel/rcu/rcu.h                              |   2 -
 .../testing/selftests/bpf/progs/linked_list.c |   2 +-
 7 files changed, 298 insertions(+), 109 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v4 bpf-next 01/14] bpf: Rename few bpf_mem_alloc fields.
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
@ 2023-07-06  3:34 ` Alexei Starovoitov
  2023-07-06  3:34 ` [PATCH v4 bpf-next 02/14] bpf: Simplify code of destroy_mem_alloc() with kmemdup() Alexei Starovoitov
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Rename:
-       struct rcu_head rcu;
-       struct llist_head free_by_rcu;
-       struct llist_head waiting_for_gp;
-       atomic_t call_rcu_in_progress;
+       struct llist_head free_by_rcu_ttrace;
+       struct llist_head waiting_for_gp_ttrace;
+       struct rcu_head rcu_ttrace;
+       atomic_t call_rcu_ttrace_in_progress;
...
-	static void do_call_rcu(struct bpf_mem_cache *c)
+	static void do_call_rcu_ttrace(struct bpf_mem_cache *c)

to better indicate intended use.

The 'tasks trace' is shortened to 'ttrace' to reduce verbosity.
No functional changes.

Later patches will add free_by_rcu/waiting_for_gp fields to be used with normal RCU.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 57 ++++++++++++++++++++++---------------------
 1 file changed, 29 insertions(+), 28 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 0668bcd7c926..cc5b8adb4c83 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -99,10 +99,11 @@ struct bpf_mem_cache {
 	int low_watermark, high_watermark, batch;
 	int percpu_size;
 
-	struct rcu_head rcu;
-	struct llist_head free_by_rcu;
-	struct llist_head waiting_for_gp;
-	atomic_t call_rcu_in_progress;
+	/* list of objects to be freed after RCU tasks trace GP */
+	struct llist_head free_by_rcu_ttrace;
+	struct llist_head waiting_for_gp_ttrace;
+	struct rcu_head rcu_ttrace;
+	atomic_t call_rcu_ttrace_in_progress;
 };
 
 struct bpf_mem_caches {
@@ -165,18 +166,18 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	old_memcg = set_active_memcg(memcg);
 	for (i = 0; i < cnt; i++) {
 		/*
-		 * free_by_rcu is only manipulated by irq work refill_work().
+		 * free_by_rcu_ttrace is only manipulated by irq work refill_work().
 		 * IRQ works on the same CPU are called sequentially, so it is
 		 * safe to use __llist_del_first() here. If alloc_bulk() is
 		 * invoked by the initial prefill, there will be no running
 		 * refill_work(), so __llist_del_first() is fine as well.
 		 *
-		 * In most cases, objects on free_by_rcu are from the same CPU.
+		 * In most cases, objects on free_by_rcu_ttrace are from the same CPU.
 		 * If some objects come from other CPUs, it doesn't incur any
 		 * harm because NUMA_NO_NODE means the preference for current
 		 * numa node and it is not a guarantee.
 		 */
-		obj = __llist_del_first(&c->free_by_rcu);
+		obj = __llist_del_first(&c->free_by_rcu_ttrace);
 		if (!obj) {
 			/* Allocate, but don't deplete atomic reserves that typical
 			 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
@@ -232,10 +233,10 @@ static void free_all(struct llist_node *llnode, bool percpu)
 
 static void __free_rcu(struct rcu_head *head)
 {
-	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
+	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu_ttrace);
 
-	free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size);
-	atomic_set(&c->call_rcu_in_progress, 0);
+	free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size);
+	atomic_set(&c->call_rcu_ttrace_in_progress, 0);
 }
 
 static void __free_rcu_tasks_trace(struct rcu_head *head)
@@ -254,32 +255,32 @@ static void enque_to_free(struct bpf_mem_cache *c, void *obj)
 	struct llist_node *llnode = obj;
 
 	/* bpf_mem_cache is a per-cpu object. Freeing happens in irq_work.
-	 * Nothing races to add to free_by_rcu list.
+	 * Nothing races to add to free_by_rcu_ttrace list.
 	 */
-	__llist_add(llnode, &c->free_by_rcu);
+	__llist_add(llnode, &c->free_by_rcu_ttrace);
 }
 
-static void do_call_rcu(struct bpf_mem_cache *c)
+static void do_call_rcu_ttrace(struct bpf_mem_cache *c)
 {
 	struct llist_node *llnode, *t;
 
-	if (atomic_xchg(&c->call_rcu_in_progress, 1))
+	if (atomic_xchg(&c->call_rcu_ttrace_in_progress, 1))
 		return;
 
-	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
-	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
-		/* There is no concurrent __llist_add(waiting_for_gp) access.
+	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp_ttrace));
+	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu_ttrace))
+		/* There is no concurrent __llist_add(waiting_for_gp_ttrace) access.
 		 * It doesn't race with llist_del_all either.
-		 * But there could be two concurrent llist_del_all(waiting_for_gp):
+		 * But there could be two concurrent llist_del_all(waiting_for_gp_ttrace):
 		 * from __free_rcu() and from drain_mem_cache().
 		 */
-		__llist_add(llnode, &c->waiting_for_gp);
+		__llist_add(llnode, &c->waiting_for_gp_ttrace);
 	/* Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
 	 * If RCU Tasks Trace grace period implies RCU grace period, free
 	 * these elements directly, else use call_rcu() to wait for normal
 	 * progs to finish and finally do free_one() on each element.
 	 */
-	call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace);
+	call_rcu_tasks_trace(&c->rcu_ttrace, __free_rcu_tasks_trace);
 }
 
 static void free_bulk(struct bpf_mem_cache *c)
@@ -307,7 +308,7 @@ static void free_bulk(struct bpf_mem_cache *c)
 	/* and drain free_llist_extra */
 	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
 		enque_to_free(c, llnode);
-	do_call_rcu(c);
+	do_call_rcu_ttrace(c);
 }
 
 static void bpf_mem_refill(struct irq_work *work)
@@ -441,13 +442,13 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 
 	/* No progs are using this bpf_mem_cache, but htab_map_free() called
 	 * bpf_mem_cache_free() for all remaining elements and they can be in
-	 * free_by_rcu or in waiting_for_gp lists, so drain those lists now.
+	 * free_by_rcu_ttrace or in waiting_for_gp_ttrace lists, so drain those lists now.
 	 *
-	 * Except for waiting_for_gp list, there are no concurrent operations
+	 * Except for waiting_for_gp_ttrace list, there are no concurrent operations
 	 * on these lists, so it is safe to use __llist_del_all().
 	 */
-	free_all(__llist_del_all(&c->free_by_rcu), percpu);
-	free_all(llist_del_all(&c->waiting_for_gp), percpu);
+	free_all(__llist_del_all(&c->free_by_rcu_ttrace), percpu);
+	free_all(llist_del_all(&c->waiting_for_gp_ttrace), percpu);
 	free_all(__llist_del_all(&c->free_llist), percpu);
 	free_all(__llist_del_all(&c->free_llist_extra), percpu);
 }
@@ -462,7 +463,7 @@ static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
 
 static void free_mem_alloc(struct bpf_mem_alloc *ma)
 {
-	/* waiting_for_gp lists was drained, but __free_rcu might
+	/* waiting_for_gp_ttrace lists was drained, but __free_rcu might
 	 * still execute. Wait for it now before we freeing percpu caches.
 	 *
 	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
@@ -535,7 +536,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 			 */
 			irq_work_sync(&c->refill_work);
 			drain_mem_cache(c);
-			rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
+			rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
 		}
 		/* objcg is the same across cpus */
 		if (c->objcg)
@@ -550,7 +551,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 				c = &cc->cache[i];
 				irq_work_sync(&c->refill_work);
 				drain_mem_cache(c);
-				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
+				rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
 			}
 		}
 		if (c->objcg)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 bpf-next 02/14] bpf: Simplify code of destroy_mem_alloc() with kmemdup().
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
  2023-07-06  3:34 ` [PATCH v4 bpf-next 01/14] bpf: Rename few bpf_mem_alloc fields Alexei Starovoitov
@ 2023-07-06  3:34 ` Alexei Starovoitov
  2023-07-06  3:34 ` [PATCH v4 bpf-next 03/14] bpf: Let free_all() return the number of freed elements Alexei Starovoitov
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Use kmemdup() to simplify the code.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index cc5b8adb4c83..b0011217be6c 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -499,7 +499,7 @@ static void destroy_mem_alloc(struct bpf_mem_alloc *ma, int rcu_in_progress)
 		return;
 	}
 
-	copy = kmalloc(sizeof(*ma), GFP_KERNEL);
+	copy = kmemdup(ma, sizeof(*ma), GFP_KERNEL);
 	if (!copy) {
 		/* Slow path with inline barrier-s */
 		free_mem_alloc(ma);
@@ -507,10 +507,7 @@ static void destroy_mem_alloc(struct bpf_mem_alloc *ma, int rcu_in_progress)
 	}
 
 	/* Defer barriers into worker to let the rest of map memory to be freed */
-	copy->cache = ma->cache;
-	ma->cache = NULL;
-	copy->caches = ma->caches;
-	ma->caches = NULL;
+	memset(ma, 0, sizeof(*ma));
 	INIT_WORK(&copy->work, free_mem_alloc_deferred);
 	queue_work(system_unbound_wq, &copy->work);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 bpf-next 03/14] bpf: Let free_all() return the number of freed elements.
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
  2023-07-06  3:34 ` [PATCH v4 bpf-next 01/14] bpf: Rename few bpf_mem_alloc fields Alexei Starovoitov
  2023-07-06  3:34 ` [PATCH v4 bpf-next 02/14] bpf: Simplify code of destroy_mem_alloc() with kmemdup() Alexei Starovoitov
@ 2023-07-06  3:34 ` Alexei Starovoitov
  2023-07-06  3:34 ` [PATCH v4 bpf-next 04/14] bpf: Refactor alloc_bulk() Alexei Starovoitov
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Let free_all() helper return the number of freed elements.
It's not used in this patch, but helps in debug/development of bpf_mem_alloc.

For example this diff for __free_rcu():
-       free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size);
+       printk("cpu %d freed %d objs after tasks trace\n", raw_smp_processor_id(),
+       	free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size));

would show how busy RCU tasks trace is.
In artificial benchmark where one cpu is allocating and different cpu is freeing
the RCU tasks trace won't be able to keep up and the list of objects
would keep growing from thousands to millions and eventually OOMing.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index b0011217be6c..693651d2648b 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -223,12 +223,16 @@ static void free_one(void *obj, bool percpu)
 	kfree(obj);
 }
 
-static void free_all(struct llist_node *llnode, bool percpu)
+static int free_all(struct llist_node *llnode, bool percpu)
 {
 	struct llist_node *pos, *t;
+	int cnt = 0;
 
-	llist_for_each_safe(pos, t, llnode)
+	llist_for_each_safe(pos, t, llnode) {
 		free_one(pos, percpu);
+		cnt++;
+	}
+	return cnt;
 }
 
 static void __free_rcu(struct rcu_head *head)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 bpf-next 04/14] bpf: Refactor alloc_bulk().
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2023-07-06  3:34 ` [PATCH v4 bpf-next 03/14] bpf: Let free_all() return the number of freed elements Alexei Starovoitov
@ 2023-07-06  3:34 ` Alexei Starovoitov
  2023-07-06  3:34 ` [PATCH v4 bpf-next 05/14] bpf: Factor out inc/dec of active flag into helpers Alexei Starovoitov
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Factor out inner body of alloc_bulk into separate helper.
No functional changes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 46 ++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 20 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 693651d2648b..9693b1f8cbda 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -154,11 +154,35 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c)
 #endif
 }
 
+static void add_obj_to_free_list(struct bpf_mem_cache *c, void *obj)
+{
+	unsigned long flags;
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		/* In RT irq_work runs in per-cpu kthread, so disable
+		 * interrupts to avoid preemption and interrupts and
+		 * reduce the chance of bpf prog executing on this cpu
+		 * when active counter is busy.
+		 */
+		local_irq_save(flags);
+	/* alloc_bulk runs from irq_work which will not preempt a bpf
+	 * program that does unit_alloc/unit_free since IRQs are
+	 * disabled there. There is no race to increment 'active'
+	 * counter. It protects free_llist from corruption in case NMI
+	 * bpf prog preempted this loop.
+	 */
+	WARN_ON_ONCE(local_inc_return(&c->active) != 1);
+	__llist_add(obj, &c->free_llist);
+	c->free_cnt++;
+	local_dec(&c->active);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_restore(flags);
+}
+
 /* Mostly runs from irq_work except __init phase. */
 static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 {
 	struct mem_cgroup *memcg = NULL, *old_memcg;
-	unsigned long flags;
 	void *obj;
 	int i;
 
@@ -188,25 +212,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 			if (!obj)
 				break;
 		}
-		if (IS_ENABLED(CONFIG_PREEMPT_RT))
-			/* In RT irq_work runs in per-cpu kthread, so disable
-			 * interrupts to avoid preemption and interrupts and
-			 * reduce the chance of bpf prog executing on this cpu
-			 * when active counter is busy.
-			 */
-			local_irq_save(flags);
-		/* alloc_bulk runs from irq_work which will not preempt a bpf
-		 * program that does unit_alloc/unit_free since IRQs are
-		 * disabled there. There is no race to increment 'active'
-		 * counter. It protects free_llist from corruption in case NMI
-		 * bpf prog preempted this loop.
-		 */
-		WARN_ON_ONCE(local_inc_return(&c->active) != 1);
-		__llist_add(obj, &c->free_llist);
-		c->free_cnt++;
-		local_dec(&c->active);
-		if (IS_ENABLED(CONFIG_PREEMPT_RT))
-			local_irq_restore(flags);
+		add_obj_to_free_list(c, obj);
 	}
 	set_active_memcg(old_memcg);
 	mem_cgroup_put(memcg);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 bpf-next 05/14] bpf: Factor out inc/dec of active flag into helpers.
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2023-07-06  3:34 ` [PATCH v4 bpf-next 04/14] bpf: Refactor alloc_bulk() Alexei Starovoitov
@ 2023-07-06  3:34 ` Alexei Starovoitov
  2023-07-06  3:34 ` [PATCH v4 bpf-next 06/14] bpf: Further refactor alloc_bulk() Alexei Starovoitov
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Factor out local_inc/dec_return(&c->active) into helpers.
No functional changes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 9693b1f8cbda..052fc801fb9f 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -154,17 +154,15 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c)
 #endif
 }
 
-static void add_obj_to_free_list(struct bpf_mem_cache *c, void *obj)
+static void inc_active(struct bpf_mem_cache *c, unsigned long *flags)
 {
-	unsigned long flags;
-
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		/* In RT irq_work runs in per-cpu kthread, so disable
 		 * interrupts to avoid preemption and interrupts and
 		 * reduce the chance of bpf prog executing on this cpu
 		 * when active counter is busy.
 		 */
-		local_irq_save(flags);
+		local_irq_save(*flags);
 	/* alloc_bulk runs from irq_work which will not preempt a bpf
 	 * program that does unit_alloc/unit_free since IRQs are
 	 * disabled there. There is no race to increment 'active'
@@ -172,13 +170,25 @@ static void add_obj_to_free_list(struct bpf_mem_cache *c, void *obj)
 	 * bpf prog preempted this loop.
 	 */
 	WARN_ON_ONCE(local_inc_return(&c->active) != 1);
-	__llist_add(obj, &c->free_llist);
-	c->free_cnt++;
+}
+
+static void dec_active(struct bpf_mem_cache *c, unsigned long flags)
+{
 	local_dec(&c->active);
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		local_irq_restore(flags);
 }
 
+static void add_obj_to_free_list(struct bpf_mem_cache *c, void *obj)
+{
+	unsigned long flags;
+
+	inc_active(c, &flags);
+	__llist_add(obj, &c->free_llist);
+	c->free_cnt++;
+	dec_active(c, flags);
+}
+
 /* Mostly runs from irq_work except __init phase. */
 static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 {
@@ -300,17 +310,13 @@ static void free_bulk(struct bpf_mem_cache *c)
 	int cnt;
 
 	do {
-		if (IS_ENABLED(CONFIG_PREEMPT_RT))
-			local_irq_save(flags);
-		WARN_ON_ONCE(local_inc_return(&c->active) != 1);
+		inc_active(c, &flags);
 		llnode = __llist_del_first(&c->free_llist);
 		if (llnode)
 			cnt = --c->free_cnt;
 		else
 			cnt = 0;
-		local_dec(&c->active);
-		if (IS_ENABLED(CONFIG_PREEMPT_RT))
-			local_irq_restore(flags);
+		dec_active(c, flags);
 		if (llnode)
 			enque_to_free(c, llnode);
 	} while (cnt > (c->high_watermark + c->low_watermark) / 2);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 bpf-next 06/14] bpf: Further refactor alloc_bulk().
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2023-07-06  3:34 ` [PATCH v4 bpf-next 05/14] bpf: Factor out inc/dec of active flag into helpers Alexei Starovoitov
@ 2023-07-06  3:34 ` Alexei Starovoitov
  2023-07-06  3:34 ` [PATCH v4 bpf-next 07/14] bpf: Change bpf_mem_cache draining process Alexei Starovoitov
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

In certain scenarios alloc_bulk() might be taking free objects mainly from
free_by_rcu_ttrace list. In such case get_memcg() and set_active_memcg() are
redundant, but they show up in perf profile. Split the loop and only set memcg
when allocating from slab. No performance difference in this patch alone, but
it helps in combination with further patches.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 052fc801fb9f..0ee566a7719a 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -196,8 +196,6 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	void *obj;
 	int i;
 
-	memcg = get_memcg(c);
-	old_memcg = set_active_memcg(memcg);
 	for (i = 0; i < cnt; i++) {
 		/*
 		 * free_by_rcu_ttrace is only manipulated by irq work refill_work().
@@ -212,16 +210,24 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 		 * numa node and it is not a guarantee.
 		 */
 		obj = __llist_del_first(&c->free_by_rcu_ttrace);
-		if (!obj) {
-			/* Allocate, but don't deplete atomic reserves that typical
-			 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
-			 * will allocate from the current numa node which is what we
-			 * want here.
-			 */
-			obj = __alloc(c, node, GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT);
-			if (!obj)
-				break;
-		}
+		if (!obj)
+			break;
+		add_obj_to_free_list(c, obj);
+	}
+	if (i >= cnt)
+		return;
+
+	memcg = get_memcg(c);
+	old_memcg = set_active_memcg(memcg);
+	for (; i < cnt; i++) {
+		/* Allocate, but don't deplete atomic reserves that typical
+		 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
+		 * will allocate from the current numa node which is what we
+		 * want here.
+		 */
+		obj = __alloc(c, node, GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT);
+		if (!obj)
+			break;
 		add_obj_to_free_list(c, obj);
 	}
 	set_active_memcg(old_memcg);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 bpf-next 07/14] bpf: Change bpf_mem_cache draining process.
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (5 preceding siblings ...)
  2023-07-06  3:34 ` [PATCH v4 bpf-next 06/14] bpf: Further refactor alloc_bulk() Alexei Starovoitov
@ 2023-07-06  3:34 ` Alexei Starovoitov
  2023-07-06 12:55   ` Hou Tao
  2023-07-06  3:34 ` [PATCH v4 bpf-next 08/14] bpf: Add a hint to allocated objects Alexei Starovoitov
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

The next patch will introduce cross-cpu llist access and existing
irq_work_sync() + drain_mem_cache() + rcu_barrier_tasks_trace() mechanism will
not be enough, since irq_work_sync() + drain_mem_cache() on cpu A won't
guarantee that llist on cpu A are empty. The free_bulk() on cpu B might add
objects back to llist of cpu A. Add 'bool draining' flag.
The modified sequence looks like:
for_each_cpu:
  WRITE_ONCE(c->draining, true); // do_call_rcu_ttrace() won't be doing call_rcu() any more
  irq_work_sync(); // wait for irq_work callback (free_bulk) to finish
  drain_mem_cache(); // free all objects
rcu_barrier_tasks_trace(); // wait for RCU callbacks to execute

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 0ee566a7719a..2615f296f052 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -98,6 +98,7 @@ struct bpf_mem_cache {
 	int free_cnt;
 	int low_watermark, high_watermark, batch;
 	int percpu_size;
+	bool draining;
 
 	/* list of objects to be freed after RCU tasks trace GP */
 	struct llist_head free_by_rcu_ttrace;
@@ -301,6 +302,12 @@ static void do_call_rcu_ttrace(struct bpf_mem_cache *c)
 		 * from __free_rcu() and from drain_mem_cache().
 		 */
 		__llist_add(llnode, &c->waiting_for_gp_ttrace);
+
+	if (unlikely(READ_ONCE(c->draining))) {
+		__free_rcu(&c->rcu_ttrace);
+		return;
+	}
+
 	/* Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
 	 * If RCU Tasks Trace grace period implies RCU grace period, free
 	 * these elements directly, else use call_rcu() to wait for normal
@@ -544,15 +551,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 		rcu_in_progress = 0;
 		for_each_possible_cpu(cpu) {
 			c = per_cpu_ptr(ma->cache, cpu);
-			/*
-			 * refill_work may be unfinished for PREEMPT_RT kernel
-			 * in which irq work is invoked in a per-CPU RT thread.
-			 * It is also possible for kernel with
-			 * arch_irq_work_has_interrupt() being false and irq
-			 * work is invoked in timer interrupt. So waiting for
-			 * the completion of irq work to ease the handling of
-			 * concurrency.
-			 */
+			WRITE_ONCE(c->draining, true);
 			irq_work_sync(&c->refill_work);
 			drain_mem_cache(c);
 			rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
@@ -568,6 +567,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 			cc = per_cpu_ptr(ma->caches, cpu);
 			for (i = 0; i < NUM_CACHES; i++) {
 				c = &cc->cache[i];
+				WRITE_ONCE(c->draining, true);
 				irq_work_sync(&c->refill_work);
 				drain_mem_cache(c);
 				rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 bpf-next 08/14] bpf: Add a hint to allocated objects.
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (6 preceding siblings ...)
  2023-07-06  3:34 ` [PATCH v4 bpf-next 07/14] bpf: Change bpf_mem_cache draining process Alexei Starovoitov
@ 2023-07-06  3:34 ` Alexei Starovoitov
  2023-07-06  3:34 ` [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list Alexei Starovoitov
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

To address OOM issue when one cpu is allocating and another cpu is freeing add
a target bpf_mem_cache hint to allocated objects and when local cpu free_llist
overflows free to that bpf_mem_cache. The hint addresses the OOM while
maintaining the same performance for common case when alloc/free are done on the
same cpu.

Note that do_call_rcu_ttrace() now has to check 'draining' flag in one more case,
since do_call_rcu_ttrace() is called not only for current cpu.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 50 +++++++++++++++++++++++++++----------------
 1 file changed, 31 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 2615f296f052..9986c6b7df4d 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -99,6 +99,7 @@ struct bpf_mem_cache {
 	int low_watermark, high_watermark, batch;
 	int percpu_size;
 	bool draining;
+	struct bpf_mem_cache *tgt;
 
 	/* list of objects to be freed after RCU tasks trace GP */
 	struct llist_head free_by_rcu_ttrace;
@@ -199,18 +200,11 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 
 	for (i = 0; i < cnt; i++) {
 		/*
-		 * free_by_rcu_ttrace is only manipulated by irq work refill_work().
-		 * IRQ works on the same CPU are called sequentially, so it is
-		 * safe to use __llist_del_first() here. If alloc_bulk() is
-		 * invoked by the initial prefill, there will be no running
-		 * refill_work(), so __llist_del_first() is fine as well.
-		 *
-		 * In most cases, objects on free_by_rcu_ttrace are from the same CPU.
-		 * If some objects come from other CPUs, it doesn't incur any
-		 * harm because NUMA_NO_NODE means the preference for current
-		 * numa node and it is not a guarantee.
+		 * For every 'c' llist_del_first(&c->free_by_rcu_ttrace); is
+		 * done only by one CPU == current CPU. Other CPUs might
+		 * llist_add() and llist_del_all() in parallel.
 		 */
-		obj = __llist_del_first(&c->free_by_rcu_ttrace);
+		obj = llist_del_first(&c->free_by_rcu_ttrace);
 		if (!obj)
 			break;
 		add_obj_to_free_list(c, obj);
@@ -284,18 +278,23 @@ static void enque_to_free(struct bpf_mem_cache *c, void *obj)
 	/* bpf_mem_cache is a per-cpu object. Freeing happens in irq_work.
 	 * Nothing races to add to free_by_rcu_ttrace list.
 	 */
-	__llist_add(llnode, &c->free_by_rcu_ttrace);
+	llist_add(llnode, &c->free_by_rcu_ttrace);
 }
 
 static void do_call_rcu_ttrace(struct bpf_mem_cache *c)
 {
 	struct llist_node *llnode, *t;
 
-	if (atomic_xchg(&c->call_rcu_ttrace_in_progress, 1))
+	if (atomic_xchg(&c->call_rcu_ttrace_in_progress, 1)) {
+		if (unlikely(READ_ONCE(c->draining))) {
+			llnode = llist_del_all(&c->free_by_rcu_ttrace);
+			free_all(llnode, !!c->percpu_size);
+		}
 		return;
+	}
 
 	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp_ttrace));
-	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu_ttrace))
+	llist_for_each_safe(llnode, t, llist_del_all(&c->free_by_rcu_ttrace))
 		/* There is no concurrent __llist_add(waiting_for_gp_ttrace) access.
 		 * It doesn't race with llist_del_all either.
 		 * But there could be two concurrent llist_del_all(waiting_for_gp_ttrace):
@@ -318,10 +317,13 @@ static void do_call_rcu_ttrace(struct bpf_mem_cache *c)
 
 static void free_bulk(struct bpf_mem_cache *c)
 {
+	struct bpf_mem_cache *tgt = c->tgt;
 	struct llist_node *llnode, *t;
 	unsigned long flags;
 	int cnt;
 
+	WARN_ON_ONCE(tgt->unit_size != c->unit_size);
+
 	do {
 		inc_active(c, &flags);
 		llnode = __llist_del_first(&c->free_llist);
@@ -331,13 +333,13 @@ static void free_bulk(struct bpf_mem_cache *c)
 			cnt = 0;
 		dec_active(c, flags);
 		if (llnode)
-			enque_to_free(c, llnode);
+			enque_to_free(tgt, llnode);
 	} while (cnt > (c->high_watermark + c->low_watermark) / 2);
 
 	/* and drain free_llist_extra */
 	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
-		enque_to_free(c, llnode);
-	do_call_rcu_ttrace(c);
+		enque_to_free(tgt, llnode);
+	do_call_rcu_ttrace(tgt);
 }
 
 static void bpf_mem_refill(struct irq_work *work)
@@ -436,6 +438,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 			c->unit_size = unit_size;
 			c->objcg = objcg;
 			c->percpu_size = percpu_size;
+			c->tgt = c;
 			prefill_mem_cache(c, cpu);
 		}
 		ma->cache = pc;
@@ -458,6 +461,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 			c = &cc->cache[i];
 			c->unit_size = sizes[i];
 			c->objcg = objcg;
+			c->tgt = c;
 			prefill_mem_cache(c, cpu);
 		}
 	}
@@ -476,7 +480,7 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 	 * Except for waiting_for_gp_ttrace list, there are no concurrent operations
 	 * on these lists, so it is safe to use __llist_del_all().
 	 */
-	free_all(__llist_del_all(&c->free_by_rcu_ttrace), percpu);
+	free_all(llist_del_all(&c->free_by_rcu_ttrace), percpu);
 	free_all(llist_del_all(&c->waiting_for_gp_ttrace), percpu);
 	free_all(__llist_del_all(&c->free_llist), percpu);
 	free_all(__llist_del_all(&c->free_llist_extra), percpu);
@@ -601,8 +605,10 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
 	local_irq_save(flags);
 	if (local_inc_return(&c->active) == 1) {
 		llnode = __llist_del_first(&c->free_llist);
-		if (llnode)
+		if (llnode) {
 			cnt = --c->free_cnt;
+			*(struct bpf_mem_cache **)llnode = c;
+		}
 	}
 	local_dec(&c->active);
 	local_irq_restore(flags);
@@ -626,6 +632,12 @@ static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
 
 	BUILD_BUG_ON(LLIST_NODE_SZ > 8);
 
+	/*
+	 * Remember bpf_mem_cache that allocated this object.
+	 * The hint is not accurate.
+	 */
+	c->tgt = *(struct bpf_mem_cache **)llnode;
+
 	local_irq_save(flags);
 	if (local_inc_return(&c->active) == 1) {
 		__llist_add(llnode, &c->free_llist);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (7 preceding siblings ...)
  2023-07-06  3:34 ` [PATCH v4 bpf-next 08/14] bpf: Add a hint to allocated objects Alexei Starovoitov
@ 2023-07-06  3:34 ` Alexei Starovoitov
  2023-07-07  2:07   ` Hou Tao
  2023-07-06  3:34 ` [PATCH v4 bpf-next 10/14] rcu: Export rcu_request_urgent_qs_task() Alexei Starovoitov
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

alloc_bulk() can reuse elements from free_by_rcu_ttrace.
Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 9986c6b7df4d..e5a87f6cf2cc 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -212,6 +212,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	if (i >= cnt)
 		return;
 
+	for (; i < cnt; i++) {
+		obj = llist_del_first(&c->waiting_for_gp_ttrace);
+		if (!obj)
+			break;
+		add_obj_to_free_list(c, obj);
+	}
+	if (i >= cnt)
+		return;
+
 	memcg = get_memcg(c);
 	old_memcg = set_active_memcg(memcg);
 	for (; i < cnt; i++) {
@@ -295,12 +304,7 @@ static void do_call_rcu_ttrace(struct bpf_mem_cache *c)
 
 	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp_ttrace));
 	llist_for_each_safe(llnode, t, llist_del_all(&c->free_by_rcu_ttrace))
-		/* There is no concurrent __llist_add(waiting_for_gp_ttrace) access.
-		 * It doesn't race with llist_del_all either.
-		 * But there could be two concurrent llist_del_all(waiting_for_gp_ttrace):
-		 * from __free_rcu() and from drain_mem_cache().
-		 */
-		__llist_add(llnode, &c->waiting_for_gp_ttrace);
+		llist_add(llnode, &c->waiting_for_gp_ttrace);
 
 	if (unlikely(READ_ONCE(c->draining))) {
 		__free_rcu(&c->rcu_ttrace);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 bpf-next 10/14] rcu: Export rcu_request_urgent_qs_task()
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (8 preceding siblings ...)
  2023-07-06  3:34 ` [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list Alexei Starovoitov
@ 2023-07-06  3:34 ` Alexei Starovoitov
  2023-07-06  3:34 ` [PATCH v4 bpf-next 11/14] selftests/bpf: Improve test coverage of bpf_mem_alloc Alexei Starovoitov
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: "Paul E. McKenney" <paulmck@kernel.org>

If a CPU is executing a long series of non-sleeping system calls,
RCU grace periods can be delayed for on the order of a couple hundred
milliseconds.  This is normally not a problem, but if each system call
does a call_rcu(), those callbacks can stack up.  RCU will eventually
notice this callback storm, but use of rcu_request_urgent_qs_task()
allows the code invoking call_rcu() to give RCU a heads up.

This function is not for general use, not yet, anyway.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/rcutiny.h | 2 ++
 include/linux/rcutree.h | 1 +
 kernel/rcu/rcu.h        | 2 --
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 7f17acf29dda..7b949292908a 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -138,6 +138,8 @@ static inline int rcu_needs_cpu(void)
 	return 0;
 }
 
+static inline void rcu_request_urgent_qs_task(struct task_struct *t) { }
+
 /*
  * Take advantage of the fact that there is only one CPU, which
  * allows us to ignore virtualization-based context switches.
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 56bccb5a8fde..126f6b418f6a 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -21,6 +21,7 @@ void rcu_softirq_qs(void);
 void rcu_note_context_switch(bool preempt);
 int rcu_needs_cpu(void);
 void rcu_cpu_stall_reset(void);
+void rcu_request_urgent_qs_task(struct task_struct *t);
 
 /*
  * Note a virtualization-based context switch.  This is simply a
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 98c1544cf572..f95cfb5bf2ee 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -493,7 +493,6 @@ static inline void rcu_expedite_gp(void) { }
 static inline void rcu_unexpedite_gp(void) { }
 static inline void rcu_async_hurry(void) { }
 static inline void rcu_async_relax(void) { }
-static inline void rcu_request_urgent_qs_task(struct task_struct *t) { }
 #else /* #ifdef CONFIG_TINY_RCU */
 bool rcu_gp_is_normal(void);     /* Internal RCU use. */
 bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
@@ -508,7 +507,6 @@ void show_rcu_tasks_gp_kthreads(void);
 #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
 static inline void show_rcu_tasks_gp_kthreads(void) {}
 #endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
-void rcu_request_urgent_qs_task(struct task_struct *t);
 #endif /* #else #ifdef CONFIG_TINY_RCU */
 
 #define RCU_SCHEDULER_INACTIVE	0
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 bpf-next 11/14] selftests/bpf: Improve test coverage of bpf_mem_alloc.
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (9 preceding siblings ...)
  2023-07-06  3:34 ` [PATCH v4 bpf-next 10/14] rcu: Export rcu_request_urgent_qs_task() Alexei Starovoitov
@ 2023-07-06  3:34 ` Alexei Starovoitov
  2023-07-06  3:34 ` [PATCH v4 bpf-next 12/14] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu() Alexei Starovoitov
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

bpf_obj_new() calls bpf_mem_alloc(), but doing alloc/free of 8 elements
is not triggering watermark conditions in bpf_mem_alloc.
Increase to 200 elements to make sure alloc_bulk/free_bulk is exercised.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/progs/linked_list.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/progs/linked_list.c b/tools/testing/selftests/bpf/progs/linked_list.c
index 57440a554304..84d1777a9e6c 100644
--- a/tools/testing/selftests/bpf/progs/linked_list.c
+++ b/tools/testing/selftests/bpf/progs/linked_list.c
@@ -96,7 +96,7 @@ static __always_inline
 int list_push_pop_multiple(struct bpf_spin_lock *lock, struct bpf_list_head *head, bool leave_in_map)
 {
 	struct bpf_list_node *n;
-	struct foo *f[8], *pf;
+	struct foo *f[200], *pf;
 	int i;
 
 	/* Loop following this check adds nodes 2-at-a-time in order to
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 bpf-next 12/14] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (10 preceding siblings ...)
  2023-07-06  3:34 ` [PATCH v4 bpf-next 11/14] selftests/bpf: Improve test coverage of bpf_mem_alloc Alexei Starovoitov
@ 2023-07-06  3:34 ` Alexei Starovoitov
  2023-07-07  1:45   ` Hou Tao
  2023-07-06  3:34 ` [PATCH v4 bpf-next 13/14] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu Alexei Starovoitov
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
objects into free_by_rcu_ttrace list where they are waiting for RCU
task trace grace period to be freed into slab.

The life cycle of objects:
alloc: dequeue free_llist
free: enqeueu free_llist
free_rcu: enqueue free_by_rcu -> waiting_for_gp
free_llist above high watermark -> free_by_rcu_ttrace
after RCU GP waiting_for_gp -> free_by_rcu_ttrace
free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_mem_alloc.h |   2 +
 kernel/bpf/memalloc.c         | 129 +++++++++++++++++++++++++++++++++-
 2 files changed, 128 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index 3929be5743f4..d644bbb298af 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -27,10 +27,12 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
 /* kmalloc/kfree equivalent: */
 void *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size);
 void bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr);
+void bpf_mem_free_rcu(struct bpf_mem_alloc *ma, void *ptr);
 
 /* kmem_cache_alloc/free equivalent: */
 void *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma);
 void bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr);
+void bpf_mem_cache_free_rcu(struct bpf_mem_alloc *ma, void *ptr);
 void bpf_mem_cache_raw_free(void *ptr);
 void *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags);
 
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index e5a87f6cf2cc..17ef2e9b278a 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -101,6 +101,15 @@ struct bpf_mem_cache {
 	bool draining;
 	struct bpf_mem_cache *tgt;
 
+	/* list of objects to be freed after RCU GP */
+	struct llist_head free_by_rcu;
+	struct llist_node *free_by_rcu_tail;
+	struct llist_head waiting_for_gp;
+	struct llist_node *waiting_for_gp_tail;
+	struct rcu_head rcu;
+	atomic_t call_rcu_in_progress;
+	struct llist_head free_llist_extra_rcu;
+
 	/* list of objects to be freed after RCU tasks trace GP */
 	struct llist_head free_by_rcu_ttrace;
 	struct llist_head waiting_for_gp_ttrace;
@@ -346,6 +355,69 @@ static void free_bulk(struct bpf_mem_cache *c)
 	do_call_rcu_ttrace(tgt);
 }
 
+static void __free_by_rcu(struct rcu_head *head)
+{
+	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
+	struct bpf_mem_cache *tgt = c->tgt;
+	struct llist_node *llnode;
+
+	llnode = llist_del_all(&c->waiting_for_gp);
+	if (!llnode)
+		goto out;
+
+	llist_add_batch(llnode, c->waiting_for_gp_tail, &tgt->free_by_rcu_ttrace);
+
+	/* Objects went through regular RCU GP. Send them to RCU tasks trace */
+	do_call_rcu_ttrace(tgt);
+out:
+	atomic_set(&c->call_rcu_in_progress, 0);
+}
+
+static void check_free_by_rcu(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode, *t;
+	unsigned long flags;
+
+	/* drain free_llist_extra_rcu */
+	if (unlikely(!llist_empty(&c->free_llist_extra_rcu))) {
+		inc_active(c, &flags);
+		llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra_rcu))
+			if (__llist_add(llnode, &c->free_by_rcu))
+				c->free_by_rcu_tail = llnode;
+		dec_active(c, flags);
+	}
+
+	if (llist_empty(&c->free_by_rcu))
+		return;
+
+	if (atomic_xchg(&c->call_rcu_in_progress, 1)) {
+		/*
+		 * Instead of kmalloc-ing new rcu_head and triggering 10k
+		 * call_rcu() to hit rcutree.qhimark and force RCU to notice
+		 * the overload just ask RCU to hurry up. There could be many
+		 * objects in free_by_rcu list.
+		 * This hint reduces memory consumption for an artificial
+		 * benchmark from 2 Gbyte to 150 Mbyte.
+		 */
+		rcu_request_urgent_qs_task(current);
+		return;
+	}
+
+	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
+
+	inc_active(c, &flags);
+	WRITE_ONCE(c->waiting_for_gp.first, __llist_del_all(&c->free_by_rcu));
+	c->waiting_for_gp_tail = c->free_by_rcu_tail;
+	dec_active(c, flags);
+
+	if (unlikely(READ_ONCE(c->draining))) {
+		free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size);
+		atomic_set(&c->call_rcu_in_progress, 0);
+	} else {
+		call_rcu_hurry(&c->rcu, __free_by_rcu);
+	}
+}
+
 static void bpf_mem_refill(struct irq_work *work)
 {
 	struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work);
@@ -360,6 +432,8 @@ static void bpf_mem_refill(struct irq_work *work)
 		alloc_bulk(c, c->batch, NUMA_NO_NODE);
 	else if (cnt > c->high_watermark)
 		free_bulk(c);
+
+	check_free_by_rcu(c);
 }
 
 static void notrace irq_work_raise(struct bpf_mem_cache *c)
@@ -488,6 +562,9 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 	free_all(llist_del_all(&c->waiting_for_gp_ttrace), percpu);
 	free_all(__llist_del_all(&c->free_llist), percpu);
 	free_all(__llist_del_all(&c->free_llist_extra), percpu);
+	free_all(__llist_del_all(&c->free_by_rcu), percpu);
+	free_all(__llist_del_all(&c->free_llist_extra_rcu), percpu);
+	free_all(llist_del_all(&c->waiting_for_gp), percpu);
 }
 
 static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
@@ -500,8 +577,8 @@ static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
 
 static void free_mem_alloc(struct bpf_mem_alloc *ma)
 {
-	/* waiting_for_gp_ttrace lists was drained, but __free_rcu might
-	 * still execute. Wait for it now before we freeing percpu caches.
+	/* waiting_for_gp[_ttrace] lists were drained, but RCU callbacks
+	 * might still execute. Wait for them.
 	 *
 	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
 	 * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
@@ -510,7 +587,8 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma)
 	 * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
 	 * using rcu_trace_implies_rcu_gp() as well.
 	 */
-	rcu_barrier_tasks_trace();
+	rcu_barrier(); /* wait for __free_by_rcu */
+	rcu_barrier_tasks_trace(); /* wait for __free_rcu */
 	if (!rcu_trace_implies_rcu_gp())
 		rcu_barrier();
 	free_mem_alloc_no_barrier(ma);
@@ -563,6 +641,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 			irq_work_sync(&c->refill_work);
 			drain_mem_cache(c);
 			rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
+			rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
 		}
 		/* objcg is the same across cpus */
 		if (c->objcg)
@@ -579,6 +658,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 				irq_work_sync(&c->refill_work);
 				drain_mem_cache(c);
 				rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
+				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
 			}
 		}
 		if (c->objcg)
@@ -663,6 +743,27 @@ static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
 		irq_work_raise(c);
 }
 
+static void notrace unit_free_rcu(struct bpf_mem_cache *c, void *ptr)
+{
+	struct llist_node *llnode = ptr - LLIST_NODE_SZ;
+	unsigned long flags;
+
+	c->tgt = *(struct bpf_mem_cache **)llnode;
+
+	local_irq_save(flags);
+	if (local_inc_return(&c->active) == 1) {
+		if (__llist_add(llnode, &c->free_by_rcu))
+			c->free_by_rcu_tail = llnode;
+	} else {
+		llist_add(llnode, &c->free_llist_extra_rcu);
+	}
+	local_dec(&c->active);
+	local_irq_restore(flags);
+
+	if (!atomic_read(&c->call_rcu_in_progress))
+		irq_work_raise(c);
+}
+
 /* Called from BPF program or from sys_bpf syscall.
  * In both cases migration is disabled.
  */
@@ -696,6 +797,20 @@ void notrace bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr)
 	unit_free(this_cpu_ptr(ma->caches)->cache + idx, ptr);
 }
 
+void notrace bpf_mem_free_rcu(struct bpf_mem_alloc *ma, void *ptr)
+{
+	int idx;
+
+	if (!ptr)
+		return;
+
+	idx = bpf_mem_cache_idx(ksize(ptr - LLIST_NODE_SZ));
+	if (idx < 0)
+		return;
+
+	unit_free_rcu(this_cpu_ptr(ma->caches)->cache + idx, ptr);
+}
+
 void notrace *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma)
 {
 	void *ret;
@@ -712,6 +827,14 @@ void notrace bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr)
 	unit_free(this_cpu_ptr(ma->cache), ptr);
 }
 
+void notrace bpf_mem_cache_free_rcu(struct bpf_mem_alloc *ma, void *ptr)
+{
+	if (!ptr)
+		return;
+
+	unit_free_rcu(this_cpu_ptr(ma->cache), ptr);
+}
+
 /* Directly does a kfree() without putting 'ptr' back to the free_llist
  * for reuse and without waiting for a rcu_tasks_trace gp.
  * The caller must first go through the rcu_tasks_trace gp for 'ptr'
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 bpf-next 13/14] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (11 preceding siblings ...)
  2023-07-06  3:34 ` [PATCH v4 bpf-next 12/14] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu() Alexei Starovoitov
@ 2023-07-06  3:34 ` Alexei Starovoitov
  2023-07-06  3:34 ` [PATCH v4 bpf-next 14/14] bpf: Add object leak check Alexei Starovoitov
  2023-07-12 21:50 ` [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() patchwork-bot+netdevbpf
  14 siblings, 0 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Convert bpf_cpumask to bpf_mem_cache_free_rcu.
Note that migrate_disable() in bpf_cpumask_release() is still necessary, since
bpf_cpumask_release() is a dtor. bpf_obj_free_fields() can be converted to do
migrate_disable() there in a follow up.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: David Vernet <void@manifault.com>
---
 kernel/bpf/cpumask.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c
index 938a60ff4295..6983af8e093c 100644
--- a/kernel/bpf/cpumask.c
+++ b/kernel/bpf/cpumask.c
@@ -9,7 +9,6 @@
 /**
  * struct bpf_cpumask - refcounted BPF cpumask wrapper structure
  * @cpumask:	The actual cpumask embedded in the struct.
- * @rcu:	The RCU head used to free the cpumask with RCU safety.
  * @usage:	Object reference counter. When the refcount goes to 0, the
  *		memory is released back to the BPF allocator, which provides
  *		RCU safety.
@@ -25,7 +24,6 @@
  */
 struct bpf_cpumask {
 	cpumask_t cpumask;
-	struct rcu_head rcu;
 	refcount_t usage;
 };
 
@@ -82,16 +80,6 @@ __bpf_kfunc struct bpf_cpumask *bpf_cpumask_acquire(struct bpf_cpumask *cpumask)
 	return cpumask;
 }
 
-static void cpumask_free_cb(struct rcu_head *head)
-{
-	struct bpf_cpumask *cpumask;
-
-	cpumask = container_of(head, struct bpf_cpumask, rcu);
-	migrate_disable();
-	bpf_mem_cache_free(&bpf_cpumask_ma, cpumask);
-	migrate_enable();
-}
-
 /**
  * bpf_cpumask_release() - Release a previously acquired BPF cpumask.
  * @cpumask: The cpumask being released.
@@ -102,8 +90,12 @@ static void cpumask_free_cb(struct rcu_head *head)
  */
 __bpf_kfunc void bpf_cpumask_release(struct bpf_cpumask *cpumask)
 {
-	if (refcount_dec_and_test(&cpumask->usage))
-		call_rcu(&cpumask->rcu, cpumask_free_cb);
+	if (!refcount_dec_and_test(&cpumask->usage))
+		return;
+
+	migrate_disable();
+	bpf_mem_cache_free_rcu(&bpf_cpumask_ma, cpumask);
+	migrate_enable();
 }
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 bpf-next 14/14] bpf: Add object leak check.
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (12 preceding siblings ...)
  2023-07-06  3:34 ` [PATCH v4 bpf-next 13/14] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu Alexei Starovoitov
@ 2023-07-06  3:34 ` Alexei Starovoitov
  2023-07-12 21:50 ` [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() patchwork-bot+netdevbpf
  14 siblings, 0 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:34 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Hou Tao <houtao1@huawei.com>

The object leak check is cheap. Do it unconditionally to spot difficult races
in bpf_mem_alloc.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 17ef2e9b278a..51d6389e5152 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -567,8 +567,43 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 	free_all(llist_del_all(&c->waiting_for_gp), percpu);
 }
 
+static void check_mem_cache(struct bpf_mem_cache *c)
+{
+	WARN_ON_ONCE(!llist_empty(&c->free_by_rcu_ttrace));
+	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp_ttrace));
+	WARN_ON_ONCE(!llist_empty(&c->free_llist));
+	WARN_ON_ONCE(!llist_empty(&c->free_llist_extra));
+	WARN_ON_ONCE(!llist_empty(&c->free_by_rcu));
+	WARN_ON_ONCE(!llist_empty(&c->free_llist_extra_rcu));
+	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
+}
+
+static void check_leaked_objs(struct bpf_mem_alloc *ma)
+{
+	struct bpf_mem_caches *cc;
+	struct bpf_mem_cache *c;
+	int cpu, i;
+
+	if (ma->cache) {
+		for_each_possible_cpu(cpu) {
+			c = per_cpu_ptr(ma->cache, cpu);
+			check_mem_cache(c);
+		}
+	}
+	if (ma->caches) {
+		for_each_possible_cpu(cpu) {
+			cc = per_cpu_ptr(ma->caches, cpu);
+			for (i = 0; i < NUM_CACHES; i++) {
+				c = &cc->cache[i];
+				check_mem_cache(c);
+			}
+		}
+	}
+}
+
 static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
 {
+	check_leaked_objs(ma);
 	free_percpu(ma->cache);
 	free_percpu(ma->caches);
 	ma->cache = NULL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 07/14] bpf: Change bpf_mem_cache draining process.
  2023-07-06  3:34 ` [PATCH v4 bpf-next 07/14] bpf: Change bpf_mem_cache draining process Alexei Starovoitov
@ 2023-07-06 12:55   ` Hou Tao
  0 siblings, 0 replies; 31+ messages in thread
From: Hou Tao @ 2023-07-06 12:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: tj, rcu, netdev, bpf, kernel-team, daniel, andrii, void, paulmck



On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> The next patch will introduce cross-cpu llist access and existing
> irq_work_sync() + drain_mem_cache() + rcu_barrier_tasks_trace() mechanism will
> not be enough, since irq_work_sync() + drain_mem_cache() on cpu A won't
> guarantee that llist on cpu A are empty. The free_bulk() on cpu B might add
> objects back to llist of cpu A. Add 'bool draining' flag.
> The modified sequence looks like:
> for_each_cpu:
>   WRITE_ONCE(c->draining, true); // do_call_rcu_ttrace() won't be doing call_rcu() any more
>   irq_work_sync(); // wait for irq_work callback (free_bulk) to finish
>   drain_mem_cache(); // free all objects
> rcu_barrier_tasks_trace(); // wait for RCU callbacks to execute
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: Hou Tao <houtao1@huawei.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 12/14] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-07-06  3:34 ` [PATCH v4 bpf-next 12/14] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu() Alexei Starovoitov
@ 2023-07-07  1:45   ` Hou Tao
  2023-07-07  2:10     ` Alexei Starovoitov
  0 siblings, 1 reply; 31+ messages in thread
From: Hou Tao @ 2023-07-07  1:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: tj, rcu, netdev, bpf, kernel-team, daniel, andrii, void, paulmck



On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
> Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
> per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
> objects into free_by_rcu_ttrace list where they are waiting for RCU
> task trace grace period to be freed into slab.
>
> The life cycle of objects:
> alloc: dequeue free_llist
> free: enqeueu free_llist
> free_rcu: enqueue free_by_rcu -> waiting_for_gp
> free_llist above high watermark -> free_by_rcu_ttrace
> after RCU GP waiting_for_gp -> free_by_rcu_ttrace
> free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: Hou Tao <houtao1@huawei.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-07-06  3:34 ` [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list Alexei Starovoitov
@ 2023-07-07  2:07   ` Hou Tao
  2023-07-07  2:12     ` Alexei Starovoitov
  0 siblings, 1 reply; 31+ messages in thread
From: Hou Tao @ 2023-07-07  2:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: tj, rcu, netdev, bpf, kernel-team, daniel, andrii, void, paulmck

Hi,

On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  kernel/bpf/memalloc.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 9986c6b7df4d..e5a87f6cf2cc 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -212,6 +212,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>  	if (i >= cnt)
>  		return;
>  
> +	for (; i < cnt; i++) {
> +		obj = llist_del_first(&c->waiting_for_gp_ttrace);
> +		if (!obj)
> +			break;
> +		add_obj_to_free_list(c, obj);
> +	}
> +	if (i >= cnt)
> +		return;

I still think using llist_del_first() here is not safe as reported in
[1]. Not sure whether or not invoking enque_to_free() firstly for
free_llist_extra will close the race completely. Will check later.

[1]:
https://lore.kernel.org/rcu/957dd5cd-0855-1197-7045-4cb1590bd753@huaweicloud.com/
> +
>  	memcg = get_memcg(c);
>  	old_memcg = set_active_memcg(memcg);
>  	for (; i < cnt; i++) {
> @@ -295,12 +304,7 @@ static void do_call_rcu_ttrace(struct bpf_mem_cache *c)
>  
>  	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp_ttrace));
>  	llist_for_each_safe(llnode, t, llist_del_all(&c->free_by_rcu_ttrace))
> -		/* There is no concurrent __llist_add(waiting_for_gp_ttrace) access.
> -		 * It doesn't race with llist_del_all either.
> -		 * But there could be two concurrent llist_del_all(waiting_for_gp_ttrace):
> -		 * from __free_rcu() and from drain_mem_cache().
> -		 */
> -		__llist_add(llnode, &c->waiting_for_gp_ttrace);
> +		llist_add(llnode, &c->waiting_for_gp_ttrace);
>  
>  	if (unlikely(READ_ONCE(c->draining))) {
>  		__free_rcu(&c->rcu_ttrace);




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 12/14] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-07-07  1:45   ` Hou Tao
@ 2023-07-07  2:10     ` Alexei Starovoitov
  2023-07-07  4:05       ` Hou Tao
  0 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-07  2:10 UTC (permalink / raw)
  To: Hou Tao
  Cc: Tejun Heo, rcu, Network Development, bpf, Kernel Team,
	Daniel Borkmann, Andrii Nakryiko, David Vernet, Paul E. McKenney

[-- Attachment #1: Type: text/plain, Size: 1388 bytes --]

On Thu, Jul 6, 2023 at 6:45 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
>
>
> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
> > Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
> > per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
> > objects into free_by_rcu_ttrace list where they are waiting for RCU
> > task trace grace period to be freed into slab.
> >
> > The life cycle of objects:
> > alloc: dequeue free_llist
> > free: enqeueu free_llist
> > free_rcu: enqueue free_by_rcu -> waiting_for_gp
> > free_llist above high watermark -> free_by_rcu_ttrace
> > after RCU GP waiting_for_gp -> free_by_rcu_ttrace
> > free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> Acked-by: Hou Tao <houtao1@huawei.com>

Thank you very much for code reviews and feedback.

btw I still believe that ABA is a non issue and prefer to keep the code as-is,
but for the sake of experiment I've converted it to spin_lock
(see attached patch which I think uglifies the code)
and performance across bench htab-mem and map_perf_test
seems to be about the same.
Which was a bit surprising to me.
Could you please benchmark it on your system?

[-- Attachment #2: 0001-bpf-Address-hypothetical-ABA-issue-with-llist_del_fi.patch --]
[-- Type: application/octet-stream, Size: 6679 bytes --]

From 0d6c06d65da81cb070238a3a89015c1927085fff Mon Sep 17 00:00:00 2001
From: Alexei Starovoitov <ast@kernel.org>
Date: Thu, 6 Jul 2023 19:05:24 -0700
Subject: [PATCH bpf-next] bpf: Address hypothetical ABA issue with
 llist_del_first

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/hashtab.c  |  2 +-
 kernel/bpf/memalloc.c | 49 +++++++++++++++++++++++++++++++++----------
 2 files changed, 39 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 56d3da7d0bc6..aebb97b1c575 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -876,7 +876,7 @@ static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 	check_and_free_fields(htab, l);
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
 		bpf_mem_cache_free(&htab->pcpu_ma, l->ptr_to_pptr);
-	bpf_mem_cache_free(&htab->ma, l);
+	bpf_mem_cache_free_rcu(&htab->ma, l);
 }
 
 static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 51d6389e5152..6ef331e639ee 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -115,6 +115,7 @@ struct bpf_mem_cache {
 	struct llist_head waiting_for_gp_ttrace;
 	struct rcu_head rcu_ttrace;
 	atomic_t call_rcu_ttrace_in_progress;
+	raw_spinlock_t lock;
 };
 
 struct bpf_mem_caches {
@@ -204,29 +205,34 @@ static void add_obj_to_free_list(struct bpf_mem_cache *c, void *obj)
 static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 {
 	struct mem_cgroup *memcg = NULL, *old_memcg;
+	unsigned long flags;
 	void *obj;
 	int i;
 
+	raw_spin_lock_irqsave(&c->lock, flags);
 	for (i = 0; i < cnt; i++) {
 		/*
 		 * For every 'c' llist_del_first(&c->free_by_rcu_ttrace); is
 		 * done only by one CPU == current CPU. Other CPUs might
 		 * llist_add() and llist_del_all() in parallel.
 		 */
-		obj = llist_del_first(&c->free_by_rcu_ttrace);
+		obj = __llist_del_first(&c->free_by_rcu_ttrace);
 		if (!obj)
 			break;
 		add_obj_to_free_list(c, obj);
 	}
-	if (i >= cnt)
+	if (i >= cnt) {
+		raw_spin_unlock_irqrestore(&c->lock, flags);
 		return;
+	}
 
 	for (; i < cnt; i++) {
-		obj = llist_del_first(&c->waiting_for_gp_ttrace);
+		obj = __llist_del_first(&c->waiting_for_gp_ttrace);
 		if (!obj)
 			break;
 		add_obj_to_free_list(c, obj);
 	}
+	raw_spin_unlock_irqrestore(&c->lock, flags);
 	if (i >= cnt)
 		return;
 
@@ -273,8 +279,13 @@ static int free_all(struct llist_node *llnode, bool percpu)
 static void __free_rcu(struct rcu_head *head)
 {
 	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu_ttrace);
+	struct llist_node *llnode;
+	unsigned long flags;
 
-	free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size);
+	raw_spin_lock_irqsave(&c->lock, flags);
+	llnode = __llist_del_all(&c->waiting_for_gp_ttrace);
+	raw_spin_unlock_irqrestore(&c->lock, flags);
+	free_all(llnode, !!c->percpu_size);
 	atomic_set(&c->call_rcu_ttrace_in_progress, 0);
 }
 
@@ -292,28 +303,36 @@ static void __free_rcu_tasks_trace(struct rcu_head *head)
 static void enque_to_free(struct bpf_mem_cache *c, void *obj)
 {
 	struct llist_node *llnode = obj;
+	unsigned long flags;
 
 	/* bpf_mem_cache is a per-cpu object. Freeing happens in irq_work.
 	 * Nothing races to add to free_by_rcu_ttrace list.
 	 */
-	llist_add(llnode, &c->free_by_rcu_ttrace);
+	raw_spin_lock_irqsave(&c->lock, flags);
+	__llist_add(llnode, &c->free_by_rcu_ttrace);
+	raw_spin_unlock_irqrestore(&c->lock, flags);
 }
 
 static void do_call_rcu_ttrace(struct bpf_mem_cache *c)
 {
 	struct llist_node *llnode, *t;
+	unsigned long flags;
 
 	if (atomic_xchg(&c->call_rcu_ttrace_in_progress, 1)) {
 		if (unlikely(READ_ONCE(c->draining))) {
-			llnode = llist_del_all(&c->free_by_rcu_ttrace);
+			raw_spin_lock_irqsave(&c->lock, flags);
+			llnode = __llist_del_all(&c->free_by_rcu_ttrace);
+			raw_spin_unlock_irqrestore(&c->lock, flags);
 			free_all(llnode, !!c->percpu_size);
 		}
 		return;
 	}
 
+	raw_spin_lock_irqsave(&c->lock, flags);
 	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp_ttrace));
-	llist_for_each_safe(llnode, t, llist_del_all(&c->free_by_rcu_ttrace))
-		llist_add(llnode, &c->waiting_for_gp_ttrace);
+	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu_ttrace))
+		__llist_add(llnode, &c->waiting_for_gp_ttrace);
+	raw_spin_unlock_irqrestore(&c->lock, flags);
 
 	if (unlikely(READ_ONCE(c->draining))) {
 		__free_rcu(&c->rcu_ttrace);
@@ -360,12 +379,15 @@ static void __free_by_rcu(struct rcu_head *head)
 	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
 	struct bpf_mem_cache *tgt = c->tgt;
 	struct llist_node *llnode;
+	unsigned long flags;
 
 	llnode = llist_del_all(&c->waiting_for_gp);
 	if (!llnode)
 		goto out;
 
-	llist_add_batch(llnode, c->waiting_for_gp_tail, &tgt->free_by_rcu_ttrace);
+	raw_spin_lock_irqsave(&tgt->lock, flags);
+	__llist_add_batch(llnode, c->waiting_for_gp_tail, &tgt->free_by_rcu_ttrace);
+	raw_spin_unlock_irqrestore(&tgt->lock, flags);
 
 	/* Objects went through regular RCU GP. Send them to RCU tasks trace */
 	do_call_rcu_ttrace(tgt);
@@ -517,6 +539,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 			c->objcg = objcg;
 			c->percpu_size = percpu_size;
 			c->tgt = c;
+			raw_spin_lock_init(&c->lock);
 			prefill_mem_cache(c, cpu);
 		}
 		ma->cache = pc;
@@ -540,6 +563,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 			c->unit_size = sizes[i];
 			c->objcg = objcg;
 			c->tgt = c;
+			raw_spin_lock_init(&c->lock);
 			prefill_mem_cache(c, cpu);
 		}
 	}
@@ -550,6 +574,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 static void drain_mem_cache(struct bpf_mem_cache *c)
 {
 	bool percpu = !!c->percpu_size;
+	unsigned long flags;
 
 	/* No progs are using this bpf_mem_cache, but htab_map_free() called
 	 * bpf_mem_cache_free() for all remaining elements and they can be in
@@ -558,8 +583,10 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 	 * Except for waiting_for_gp_ttrace list, there are no concurrent operations
 	 * on these lists, so it is safe to use __llist_del_all().
 	 */
-	free_all(llist_del_all(&c->free_by_rcu_ttrace), percpu);
-	free_all(llist_del_all(&c->waiting_for_gp_ttrace), percpu);
+	raw_spin_lock_irqsave(&c->lock, flags);
+	free_all(__llist_del_all(&c->free_by_rcu_ttrace), percpu);
+	free_all(__llist_del_all(&c->waiting_for_gp_ttrace), percpu);
+	raw_spin_unlock_irqrestore(&c->lock, flags);
 	free_all(__llist_del_all(&c->free_llist), percpu);
 	free_all(__llist_del_all(&c->free_llist_extra), percpu);
 	free_all(__llist_del_all(&c->free_by_rcu), percpu);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-07-07  2:07   ` Hou Tao
@ 2023-07-07  2:12     ` Alexei Starovoitov
  2023-07-07  3:38       ` Hou Tao
  0 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-07  2:12 UTC (permalink / raw)
  To: Hou Tao
  Cc: Tejun Heo, rcu, Network Development, bpf, Kernel Team,
	Daniel Borkmann, Andrii Nakryiko, David Vernet, Paul E. McKenney

On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > alloc_bulk() can reuse elements from free_by_rcu_ttrace.
> > Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  kernel/bpf/memalloc.c | 16 ++++++++++------
> >  1 file changed, 10 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > index 9986c6b7df4d..e5a87f6cf2cc 100644
> > --- a/kernel/bpf/memalloc.c
> > +++ b/kernel/bpf/memalloc.c
> > @@ -212,6 +212,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
> >       if (i >= cnt)
> >               return;
> >
> > +     for (; i < cnt; i++) {
> > +             obj = llist_del_first(&c->waiting_for_gp_ttrace);
> > +             if (!obj)
> > +                     break;
> > +             add_obj_to_free_list(c, obj);
> > +     }
> > +     if (i >= cnt)
> > +             return;
>
> I still think using llist_del_first() here is not safe as reported in
> [1]. Not sure whether or not invoking enque_to_free() firstly for
> free_llist_extra will close the race completely. Will check later.

lol. see my reply a second ago in the other thread.

and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-07-07  2:12     ` Alexei Starovoitov
@ 2023-07-07  3:38       ` Hou Tao
  2023-07-07  4:16         ` Alexei Starovoitov
  0 siblings, 1 reply; 31+ messages in thread
From: Hou Tao @ 2023-07-07  3:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tejun Heo, rcu, Network Development, bpf, Kernel Team,
	Daniel Borkmann, Andrii Nakryiko, David Vernet, Paul E. McKenney

Hi,

On 7/7/2023 10:12 AM, Alexei Starovoitov wrote:
> On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
>>> From: Alexei Starovoitov <ast@kernel.org>
>>>
>>> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
>>> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
>>>
>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>> ---
>>>  kernel/bpf/memalloc.c | 16 ++++++++++------
>>>  1 file changed, 10 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>> index 9986c6b7df4d..e5a87f6cf2cc 100644
>>> --- a/kernel/bpf/memalloc.c
>>> +++ b/kernel/bpf/memalloc.c
>>> @@ -212,6 +212,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>>>       if (i >= cnt)
>>>               return;
>>>
>>> +     for (; i < cnt; i++) {
>>> +             obj = llist_del_first(&c->waiting_for_gp_ttrace);
>>> +             if (!obj)
>>> +                     break;
>>> +             add_obj_to_free_list(c, obj);
>>> +     }
>>> +     if (i >= cnt)
>>> +             return;
>> I still think using llist_del_first() here is not safe as reported in
>> [1]. Not sure whether or not invoking enque_to_free() firstly for
>> free_llist_extra will close the race completely. Will check later.
> lol. see my reply a second ago in the other thread.
>
> and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.

I think free_by_rcu_ttrace is different, because the reuse is only
possible after one tasks trace RCU grace period as shown below, and the
concurrent llist_del_first() must have been completed when the head is
reused and re-added into free_by_rcu_ttrace again.

// c0->free_by_rcu_ttrace
A -> B -> C -> nil
   
P1:
alloc_bulk()
    llist_del_first(&c->free_by_rcu_ttrace)
        entry = A
        next = B

P2:
do_call_rcu_ttrace()
    // c->free_by_rcu_ttrace->first = NULL
    llist_del_all(&c->free_by_rcu_ttrace)
        move to c->waiting_for_gp_ttrace

P1:
llist_del_first()
    return NULL

// A is only reusable after one task trace RCU grace
// llist_del_first() must have been completed
__free_rcu_tasks_trace
    free_all(llist_del_all(&c->waiting_for_gp_ttrace))


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 12/14] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-07-07  2:10     ` Alexei Starovoitov
@ 2023-07-07  4:05       ` Hou Tao
  2023-07-08  7:00         ` Hou Tao
  0 siblings, 1 reply; 31+ messages in thread
From: Hou Tao @ 2023-07-07  4:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tejun Heo, rcu, Network Development, bpf, Kernel Team,
	Daniel Borkmann, Andrii Nakryiko, David Vernet, Paul E. McKenney

Hi,

On 7/7/2023 10:10 AM, Alexei Starovoitov wrote:
> On Thu, Jul 6, 2023 at 6:45 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>
>>
>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
>>> From: Alexei Starovoitov <ast@kernel.org>
>>>
>>> Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
>>> Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
>>> per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
>>> objects into free_by_rcu_ttrace list where they are waiting for RCU
>>> task trace grace period to be freed into slab.
>>>
>>> The life cycle of objects:
>>> alloc: dequeue free_llist
>>> free: enqeueu free_llist
>>> free_rcu: enqueue free_by_rcu -> waiting_for_gp
>>> free_llist above high watermark -> free_by_rcu_ttrace
>>> after RCU GP waiting_for_gp -> free_by_rcu_ttrace
>>> free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab
>>>
>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>> Acked-by: Hou Tao <houtao1@huawei.com>
> Thank you very much for code reviews and feedback.

You are welcome. I also learn a lot from this great patch set.

>
> btw I still believe that ABA is a non issue and prefer to keep the code as-is,
> but for the sake of experiment I've converted it to spin_lock
> (see attached patch which I think uglifies the code)
> and performance across bench htab-mem and map_perf_test
> seems to be about the same.
> Which was a bit surprising to me.
> Could you please benchmark it on your system?

Will do that later. It seems if there is no cross-CPU allocation and
free, the only possible contention is between __free_rcu() on CPU x and
alloc_bulk()/free_bulk() on a different CPU.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-07-07  3:38       ` Hou Tao
@ 2023-07-07  4:16         ` Alexei Starovoitov
  2023-07-07  4:37           ` Hou Tao
  0 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-07  4:16 UTC (permalink / raw)
  To: Hou Tao
  Cc: Tejun Heo, rcu, Network Development, bpf, Kernel Team,
	Daniel Borkmann, Andrii Nakryiko, David Vernet, Paul E. McKenney

On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 7/7/2023 10:12 AM, Alexei Starovoitov wrote:
> > On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi,
> >>
> >> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> >>> From: Alexei Starovoitov <ast@kernel.org>
> >>>
> >>> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
> >>> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
> >>>
> >>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >>> ---
> >>>  kernel/bpf/memalloc.c | 16 ++++++++++------
> >>>  1 file changed, 10 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> >>> index 9986c6b7df4d..e5a87f6cf2cc 100644
> >>> --- a/kernel/bpf/memalloc.c
> >>> +++ b/kernel/bpf/memalloc.c
> >>> @@ -212,6 +212,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
> >>>       if (i >= cnt)
> >>>               return;
> >>>
> >>> +     for (; i < cnt; i++) {
> >>> +             obj = llist_del_first(&c->waiting_for_gp_ttrace);
> >>> +             if (!obj)
> >>> +                     break;
> >>> +             add_obj_to_free_list(c, obj);
> >>> +     }
> >>> +     if (i >= cnt)
> >>> +             return;
> >> I still think using llist_del_first() here is not safe as reported in
> >> [1]. Not sure whether or not invoking enque_to_free() firstly for
> >> free_llist_extra will close the race completely. Will check later.
> > lol. see my reply a second ago in the other thread.
> >
> > and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.
>
> I think free_by_rcu_ttrace is different, because the reuse is only
> possible after one tasks trace RCU grace period as shown below, and the
> concurrent llist_del_first() must have been completed when the head is
> reused and re-added into free_by_rcu_ttrace again.
>
> // c0->free_by_rcu_ttrace
> A -> B -> C -> nil
>
> P1:
> alloc_bulk()
>     llist_del_first(&c->free_by_rcu_ttrace)
>         entry = A
>         next = B
>
> P2:
> do_call_rcu_ttrace()
>     // c->free_by_rcu_ttrace->first = NULL
>     llist_del_all(&c->free_by_rcu_ttrace)
>         move to c->waiting_for_gp_ttrace
>
> P1:
> llist_del_first()
>     return NULL
>
> // A is only reusable after one task trace RCU grace
> // llist_del_first() must have been completed

"must have been completed" ?

I guess you're assuming that alloc_bulk() from irq_work
is running within rcu_tasks_trace critical section,
so __free_rcu_tasks_trace() callback will execute after
irq work completed?
I don't think that's the case.
In vCPU P1 is stopped for looong time by host,
P2 can execute __free_rcu_tasks_trace (or P3, since
tasks trace callbacks execute in a kthread that is not bound
to any cpu).
__free_rcu_tasks_trace() will free it into slab.
Then kmalloc the same obj and eventually put it back into
free_by_rcu_ttrace.

Since you believe that waiting_for_gp_ttrace ABA is possible
here it's the same probability. imo both lower than a bit flip due
to cosmic rays which is actually observable in practice.

> __free_rcu_tasks_trace
>     free_all(llist_del_all(&c->waiting_for_gp_ttrace))
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-07-07  4:16         ` Alexei Starovoitov
@ 2023-07-07  4:37           ` Hou Tao
  2023-07-07 16:11             ` Alexei Starovoitov
  0 siblings, 1 reply; 31+ messages in thread
From: Hou Tao @ 2023-07-07  4:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tejun Heo, rcu, Network Development, bpf, Kernel Team,
	Daniel Borkmann, Andrii Nakryiko, David Vernet, Paul E. McKenney

Hi,

On 7/7/2023 12:16 PM, Alexei Starovoitov wrote:
> On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 7/7/2023 10:12 AM, Alexei Starovoitov wrote:
>>> On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> Hi,
>>>>
>>>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
>>>>
SNIP
>>> and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.
>> I think free_by_rcu_ttrace is different, because the reuse is only
>> possible after one tasks trace RCU grace period as shown below, and the
>> concurrent llist_del_first() must have been completed when the head is
>> reused and re-added into free_by_rcu_ttrace again.
>>
>> // c0->free_by_rcu_ttrace
>> A -> B -> C -> nil
>>
>> P1:
>> alloc_bulk()
>>     llist_del_first(&c->free_by_rcu_ttrace)
>>         entry = A
>>         next = B
>>
>> P2:
>> do_call_rcu_ttrace()
>>     // c->free_by_rcu_ttrace->first = NULL
>>     llist_del_all(&c->free_by_rcu_ttrace)
>>         move to c->waiting_for_gp_ttrace
>>
>> P1:
>> llist_del_first()
>>     return NULL
>>
>> // A is only reusable after one task trace RCU grace
>> // llist_del_first() must have been completed
> "must have been completed" ?
>
> I guess you're assuming that alloc_bulk() from irq_work
> is running within rcu_tasks_trace critical section,
> so __free_rcu_tasks_trace() callback will execute after
> irq work completed?
> I don't think that's the case.

Yes. The following is my original thoughts. Correct me if I was wrong:

1. llist_del_first() must be running concurrently with llist_del_all().
If llist_del_first() runs after llist_del_all(), it will return NULL
directly.
2. call_rcu_tasks_trace() must happen after llist_del_all(), else the
elements in free_by_rcu_ttrace will not be freed back to slab.
3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period
to call __free_rcu_tasks_trace()
4. llist_del_first() in running in an context with irq-disabled, so the
tasks trace RCU grace period will wait for the end of llist_del_first()

It seems you thought step 4) is not true, right ?
> In vCPU P1 is stopped for looong time by host,
> P2 can execute __free_rcu_tasks_trace (or P3, since
> tasks trace callbacks execute in a kthread that is not bound
> to any cpu).
> __free_rcu_tasks_trace() will free it into slab.
> Then kmalloc the same obj and eventually put it back into
> free_by_rcu_ttrace.
>
> Since you believe that waiting_for_gp_ttrace ABA is possible
> here it's the same probability. imo both lower than a bit flip due
> to cosmic rays which is actually observable in practice.
>
>> __free_rcu_tasks_trace
>>     free_all(llist_del_all(&c->waiting_for_gp_ttrace))
>>
>>
> .


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-07-07  4:37           ` Hou Tao
@ 2023-07-07 16:11             ` Alexei Starovoitov
  2023-07-07 17:47               ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2023-07-07 16:11 UTC (permalink / raw)
  To: Hou Tao
  Cc: Tejun Heo, rcu, Network Development, bpf, Kernel Team,
	Daniel Borkmann, Andrii Nakryiko, David Vernet, Paul E. McKenney

On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 7/7/2023 12:16 PM, Alexei Starovoitov wrote:
> > On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi,
> >>
> >> On 7/7/2023 10:12 AM, Alexei Starovoitov wrote:
> >>> On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>> Hi,
> >>>>
> >>>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> >>>>
> SNIP
> >>> and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.
> >> I think free_by_rcu_ttrace is different, because the reuse is only
> >> possible after one tasks trace RCU grace period as shown below, and the
> >> concurrent llist_del_first() must have been completed when the head is
> >> reused and re-added into free_by_rcu_ttrace again.
> >>
> >> // c0->free_by_rcu_ttrace
> >> A -> B -> C -> nil
> >>
> >> P1:
> >> alloc_bulk()
> >>     llist_del_first(&c->free_by_rcu_ttrace)
> >>         entry = A
> >>         next = B
> >>
> >> P2:
> >> do_call_rcu_ttrace()
> >>     // c->free_by_rcu_ttrace->first = NULL
> >>     llist_del_all(&c->free_by_rcu_ttrace)
> >>         move to c->waiting_for_gp_ttrace
> >>
> >> P1:
> >> llist_del_first()
> >>     return NULL
> >>
> >> // A is only reusable after one task trace RCU grace
> >> // llist_del_first() must have been completed
> > "must have been completed" ?
> >
> > I guess you're assuming that alloc_bulk() from irq_work
> > is running within rcu_tasks_trace critical section,
> > so __free_rcu_tasks_trace() callback will execute after
> > irq work completed?
> > I don't think that's the case.
>
> Yes. The following is my original thoughts. Correct me if I was wrong:
>
> 1. llist_del_first() must be running concurrently with llist_del_all().
> If llist_del_first() runs after llist_del_all(), it will return NULL
> directly.
> 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the
> elements in free_by_rcu_ttrace will not be freed back to slab.
> 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period
> to call __free_rcu_tasks_trace()
> 4. llist_del_first() in running in an context with irq-disabled, so the
> tasks trace RCU grace period will wait for the end of llist_del_first()
>
> It seems you thought step 4) is not true, right ?

Yes. I think so. For two reasons:

1.
I believe irq disabled region isn't considered equivalent
to rcu_read_lock_trace() region.

Paul,
could you clarify ?

2.
Even if 1 is incorrect, in RT llist_del_first() from alloc_bulk()
runs "in a per-CPU thread in preemptible context."
See irq_work_run_list.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-07-07 16:11             ` Alexei Starovoitov
@ 2023-07-07 17:47               ` Paul E. McKenney
  2023-07-07 22:22                 ` Joel Fernandes
  2023-07-08  7:03                 ` Hou Tao
  0 siblings, 2 replies; 31+ messages in thread
From: Paul E. McKenney @ 2023-07-07 17:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hou Tao, Tejun Heo, rcu, Network Development, bpf, Kernel Team,
	Daniel Borkmann, Andrii Nakryiko, David Vernet

On Fri, Jul 07, 2023 at 09:11:22AM -0700, Alexei Starovoitov wrote:
> On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > On 7/7/2023 12:16 PM, Alexei Starovoitov wrote:
> > > On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >> Hi,
> > >>
> > >> On 7/7/2023 10:12 AM, Alexei Starovoitov wrote:
> > >>> On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >>>> Hi,
> > >>>>
> > >>>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> > >>>>
> > SNIP
> > >>> and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.
> > >> I think free_by_rcu_ttrace is different, because the reuse is only
> > >> possible after one tasks trace RCU grace period as shown below, and the
> > >> concurrent llist_del_first() must have been completed when the head is
> > >> reused and re-added into free_by_rcu_ttrace again.
> > >>
> > >> // c0->free_by_rcu_ttrace
> > >> A -> B -> C -> nil
> > >>
> > >> P1:
> > >> alloc_bulk()
> > >>     llist_del_first(&c->free_by_rcu_ttrace)
> > >>         entry = A
> > >>         next = B
> > >>
> > >> P2:
> > >> do_call_rcu_ttrace()
> > >>     // c->free_by_rcu_ttrace->first = NULL
> > >>     llist_del_all(&c->free_by_rcu_ttrace)
> > >>         move to c->waiting_for_gp_ttrace
> > >>
> > >> P1:
> > >> llist_del_first()
> > >>     return NULL
> > >>
> > >> // A is only reusable after one task trace RCU grace
> > >> // llist_del_first() must have been completed
> > > "must have been completed" ?
> > >
> > > I guess you're assuming that alloc_bulk() from irq_work
> > > is running within rcu_tasks_trace critical section,
> > > so __free_rcu_tasks_trace() callback will execute after
> > > irq work completed?
> > > I don't think that's the case.
> >
> > Yes. The following is my original thoughts. Correct me if I was wrong:
> >
> > 1. llist_del_first() must be running concurrently with llist_del_all().
> > If llist_del_first() runs after llist_del_all(), it will return NULL
> > directly.
> > 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the
> > elements in free_by_rcu_ttrace will not be freed back to slab.
> > 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period
> > to call __free_rcu_tasks_trace()
> > 4. llist_del_first() in running in an context with irq-disabled, so the
> > tasks trace RCU grace period will wait for the end of llist_del_first()
> >
> > It seems you thought step 4) is not true, right ?
> 
> Yes. I think so. For two reasons:
> 
> 1.
> I believe irq disabled region isn't considered equivalent
> to rcu_read_lock_trace() region.
> 
> Paul,
> could you clarify ?

You are correct, Alexei.  Unlike vanilla RCU, RCU Tasks Trace does not
count irq-disabled regions of code as readers.

But why not just put an rcu_read_lock_trace() and a matching
rcu_read_unlock_trace() within that irq-disabled region of code?

For completeness, if it were not for CONFIG_TASKS_TRACE_RCU_READ_MB,
Hou Tao would be correct from a strict current-implementation
viewpoint.  The reason is that, given the current implementation in
CONFIG_TASKS_TRACE_RCU_READ_MB=n kernels, a task must either block or
take an IPI in order for the grace-period machinery to realize that this
task is done with all prior readers.

However, we need to account for the possibility of IPI-free
implementations, for example, if the real-time guys decide to start
making heavy use of BPF sleepable programs.  They would then insist on
getting rid of those IPIs for CONFIG_PREEMPT_RT=y kernels.  At which
point, irq-disabled regions of code will absolutely not act as
RCU tasks trace readers.

Again, why not just put an rcu_read_lock_trace() and a matching
rcu_read_unlock_trace() within that irq-disabled region of code?

Or maybe there is a better workaround.

> 2.
> Even if 1 is incorrect, in RT llist_del_first() from alloc_bulk()
> runs "in a per-CPU thread in preemptible context."
> See irq_work_run_list.

Agreed, under RT, "interrupt handlers" often run in task context.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-07-07 17:47               ` Paul E. McKenney
@ 2023-07-07 22:22                 ` Joel Fernandes
  2023-07-08  7:03                 ` Hou Tao
  1 sibling, 0 replies; 31+ messages in thread
From: Joel Fernandes @ 2023-07-07 22:22 UTC (permalink / raw)
  To: paulmck
  Cc: Alexei Starovoitov, Hou Tao, Tejun Heo, rcu, Network Development,
	bpf, Kernel Team, Daniel Borkmann, Andrii Nakryiko, David Vernet

On Fri, Jul 7, 2023 at 1:47 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Jul 07, 2023 at 09:11:22AM -0700, Alexei Starovoitov wrote:
> > On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >
> > > Hi,
> > >
> > > On 7/7/2023 12:16 PM, Alexei Starovoitov wrote:
> > > > On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > > >> Hi,
> > > >>
> > > >> On 7/7/2023 10:12 AM, Alexei Starovoitov wrote:
> > > >>> On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > > >>>> Hi,
> > > >>>>
> > > >>>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> > > >>>>
> > > SNIP
> > > >>> and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.
> > > >> I think free_by_rcu_ttrace is different, because the reuse is only
> > > >> possible after one tasks trace RCU grace period as shown below, and the
> > > >> concurrent llist_del_first() must have been completed when the head is
> > > >> reused and re-added into free_by_rcu_ttrace again.
> > > >>
> > > >> // c0->free_by_rcu_ttrace
> > > >> A -> B -> C -> nil
> > > >>
> > > >> P1:
> > > >> alloc_bulk()
> > > >>     llist_del_first(&c->free_by_rcu_ttrace)
> > > >>         entry = A
> > > >>         next = B
> > > >>
> > > >> P2:
> > > >> do_call_rcu_ttrace()
> > > >>     // c->free_by_rcu_ttrace->first = NULL
> > > >>     llist_del_all(&c->free_by_rcu_ttrace)
> > > >>         move to c->waiting_for_gp_ttrace
> > > >>
> > > >> P1:
> > > >> llist_del_first()
> > > >>     return NULL
> > > >>
> > > >> // A is only reusable after one task trace RCU grace
> > > >> // llist_del_first() must have been completed
> > > > "must have been completed" ?
> > > >
> > > > I guess you're assuming that alloc_bulk() from irq_work
> > > > is running within rcu_tasks_trace critical section,
> > > > so __free_rcu_tasks_trace() callback will execute after
> > > > irq work completed?
> > > > I don't think that's the case.
> > >
> > > Yes. The following is my original thoughts. Correct me if I was wrong:
> > >
> > > 1. llist_del_first() must be running concurrently with llist_del_all().
> > > If llist_del_first() runs after llist_del_all(), it will return NULL
> > > directly.
> > > 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the
> > > elements in free_by_rcu_ttrace will not be freed back to slab.
> > > 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period
> > > to call __free_rcu_tasks_trace()
> > > 4. llist_del_first() in running in an context with irq-disabled, so the
> > > tasks trace RCU grace period will wait for the end of llist_del_first()
> > >
> > > It seems you thought step 4) is not true, right ?
> >
> > Yes. I think so. For two reasons:
> >
> > 1.
> > I believe irq disabled region isn't considered equivalent
> > to rcu_read_lock_trace() region.
> >
> > Paul,
> > could you clarify ?
>
> You are correct, Alexei.  Unlike vanilla RCU, RCU Tasks Trace does not
> count irq-disabled regions of code as readers.
>
> But why not just put an rcu_read_lock_trace() and a matching
> rcu_read_unlock_trace() within that irq-disabled region of code?
>
> For completeness, if it were not for CONFIG_TASKS_TRACE_RCU_READ_MB,
> Hou Tao would be correct from a strict current-implementation
> viewpoint.  The reason is that, given the current implementation in
> CONFIG_TASKS_TRACE_RCU_READ_MB=n kernels, a task must either block or
> take an IPI in order for the grace-period machinery to realize that this
> task is done with all prior readers.
>
> However, we need to account for the possibility of IPI-free
> implementations, for example, if the real-time guys decide to start
> making heavy use of BPF sleepable programs.  They would then insist on
> getting rid of those IPIs for CONFIG_PREEMPT_RT=y kernels.  At which
> point, irq-disabled regions of code will absolutely not act as
> RCU tasks trace readers.
>
> Again, why not just put an rcu_read_lock_trace() and a matching
> rcu_read_unlock_trace() within that irq-disabled region of code?

If I remember correctly, the general guidance is to always put an
explicit marker if it is in an RCU-reader, instead of relying on
implementation details. So the suggestion to put the marker instead of
relying on IRQ disabling does align with that.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 12/14] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-07-07  4:05       ` Hou Tao
@ 2023-07-08  7:00         ` Hou Tao
  0 siblings, 0 replies; 31+ messages in thread
From: Hou Tao @ 2023-07-08  7:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tejun Heo, rcu, Network Development, bpf, Kernel Team,
	Daniel Borkmann, Andrii Nakryiko, David Vernet, Paul E. McKenney

Hi,

On 7/7/2023 12:05 PM, Hou Tao wrote:
> Hi,
>
> On 7/7/2023 10:10 AM, Alexei Starovoitov wrote:
>> On Thu, Jul 6, 2023 at 6:45 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>
>>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
>>>> From: Alexei Starovoitov <ast@kernel.org>
>>>>
>>>> Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
>>>> Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
>>>> per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
>>>> objects into free_by_rcu_ttrace list where they are waiting for RCU
>>>> task trace grace period to be freed into slab.
>>>>
>>>> The life cycle of objects:
>>>> alloc: dequeue free_llist
>>>> free: enqeueu free_llist
>>>> free_rcu: enqueue free_by_rcu -> waiting_for_gp
>>>> free_llist above high watermark -> free_by_rcu_ttrace
>>>> after RCU GP waiting_for_gp -> free_by_rcu_ttrace
>>>> free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab
>>>>
>>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>> Acked-by: Hou Tao <houtao1@huawei.com>
>> Thank you very much for code reviews and feedback.
> You are welcome. I also learn a lot from this great patch set.
>
>> btw I still believe that ABA is a non issue and prefer to keep the code as-is,
>> but for the sake of experiment I've converted it to spin_lock
>> (see attached patch which I think uglifies the code)
>> and performance across bench htab-mem and map_perf_test
>> seems to be about the same.
>> Which was a bit surprising to me.
>> Could you please benchmark it on your system?
> Will do that later. It seems if there is no cross-CPU allocation and
> free, the only possible contention is between __free_rcu() on CPU x and
> alloc_bulk()/free_bulk() on a different CPU.
>
For my local VM setup, the spin-lock also doesn't make much different
under both htab-mem and map_perf_test as shown below.

without spin-lock

normal bpf ma
=============
overwrite            per-prod-op: 54.16 ± 0.79k/s, avg mem: 159.99 ±
40.80MiB, peak mem: 251.41MiB
batch_add_batch_del  per-prod-op: 83.87 ± 0.86k/s, avg mem: 70.52 ±
22.73MiB, peak mem: 121.31MiB
add_del_on_diff_cpu  per-prod-op: 25.98 ± 0.13k/s, avg mem: 17.88 ±
1.84MiB, peak mem: 22.86MiB

./map_perf_test 4 8 16384
0:hash_map_perf kmalloc 361532 events per sec
2:hash_map_perf kmalloc 352594 events per sec
6:hash_map_perf kmalloc 356007 events per sec
5:hash_map_perf kmalloc 354184 events per sec
3:hash_map_perf kmalloc 348720 events per sec
1:hash_map_perf kmalloc 346332 events per sec
7:hash_map_perf kmalloc 352126 events per sec
4:hash_map_perf kmalloc 339459 events per sec

with spin-lock

normal bpf ma
=============
overwrite            per-prod-op: 54.72 ± 0.96k/s, avg mem: 133.99 ±
34.04MiB, peak mem: 221.60MiB
batch_add_batch_del  per-prod-op: 82.90 ± 1.86k/s, avg mem: 55.91 ±
11.05MiB, peak mem: 103.82MiB
add_del_on_diff_cpu  per-prod-op: 26.75 ± 0.10k/s, avg mem: 18.55 ±
1.24MiB, peak mem: 23.11MiB

./map_perf_test 4 8 16384
1:hash_map_perf kmalloc 361750 events per sec
2:hash_map_perf kmalloc 360976 events per sec
6:hash_map_perf kmalloc 361745 events per sec
0:hash_map_perf kmalloc 350349 events per sec
7:hash_map_perf kmalloc 359125 events per sec
3:hash_map_perf kmalloc 352683 events per sec
5:hash_map_perf kmalloc 350897 events per sec
4:hash_map_perf kmalloc 331215 events per sec


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-07-07 17:47               ` Paul E. McKenney
  2023-07-07 22:22                 ` Joel Fernandes
@ 2023-07-08  7:03                 ` Hou Tao
  2023-07-10  4:45                   ` Paul E. McKenney
  1 sibling, 1 reply; 31+ messages in thread
From: Hou Tao @ 2023-07-08  7:03 UTC (permalink / raw)
  To: paulmck, Alexei Starovoitov
  Cc: Tejun Heo, rcu, Network Development, bpf, Kernel Team,
	Daniel Borkmann, Andrii Nakryiko, David Vernet

Hi,

On 7/8/2023 1:47 AM, Paul E. McKenney wrote:
> On Fri, Jul 07, 2023 at 09:11:22AM -0700, Alexei Starovoitov wrote:
>> On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote:
SNIP
>>> I guess you're assuming that alloc_bulk() from irq_work
>>> is running within rcu_tasks_trace critical section,
>>> so __free_rcu_tasks_trace() callback will execute after
>>> irq work completed?
>>> I don't think that's the case.
>>> Yes. The following is my original thoughts. Correct me if I was wrong:
>>>
>>> 1. llist_del_first() must be running concurrently with llist_del_all().
>>> If llist_del_first() runs after llist_del_all(), it will return NULL
>>> directly.
>>> 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the
>>> elements in free_by_rcu_ttrace will not be freed back to slab.
>>> 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period
>>> to call __free_rcu_tasks_trace()
>>> 4. llist_del_first() in running in an context with irq-disabled, so the
>>> tasks trace RCU grace period will wait for the end of llist_del_first()
>>>
>>> It seems you thought step 4) is not true, right ?
>> Yes. I think so. For two reasons:
>>
>> 1.
>> I believe irq disabled region isn't considered equivalent
>> to rcu_read_lock_trace() region.
>>
>> Paul,
>> could you clarify ?
> You are correct, Alexei.  Unlike vanilla RCU, RCU Tasks Trace does not
> count irq-disabled regions of code as readers.

I see. But I still have one question: considering that in current
implementation one Tasks Trace RCU grace period implies one vanilla RCU
grace period (aka rcu_trace_implies_rcu_gp), so in my naive
understanding of RCU, does that mean __free_rcu_tasks_trace() will be
invoked after the expiration of current Task Trace RCU grace period,
right ? And does it also mean __free_rcu_tasks_trace() will be invoked
after the expiration of current vanilla RCU grace period, right ? If
these two conditions above are true, does it mean
__free_rcu_tasks_trace() will wait for the irq-disabled code reigion ?
> But why not just put an rcu_read_lock_trace() and a matching
> rcu_read_unlock_trace() within that irq-disabled region of code?
>
> For completeness, if it were not for CONFIG_TASKS_TRACE_RCU_READ_MB,
> Hou Tao would be correct from a strict current-implementation
> viewpoint.  The reason is that, given the current implementation in
> CONFIG_TASKS_TRACE_RCU_READ_MB=n kernels, a task must either block or
> take an IPI in order for the grace-period machinery to realize that this
> task is done with all prior readers.

Thanks for the detailed explanation.
> However, we need to account for the possibility of IPI-free
> implementations, for example, if the real-time guys decide to start
> making heavy use of BPF sleepable programs.  They would then insist on
> getting rid of those IPIs for CONFIG_PREEMPT_RT=y kernels.  At which
> point, irq-disabled regions of code will absolutely not act as
> RCU tasks trace readers.
>
> Again, why not just put an rcu_read_lock_trace() and a matching
> rcu_read_unlock_trace() within that irq-disabled region of code?
>
> Or maybe there is a better workaround.

Yes. I think we could use rcu_read_{lock,unlock}_trace to fix the ABA
problem for free_by_rcu_ttrace.
>
>> 2.
>> Even if 1 is incorrect, in RT llist_del_first() from alloc_bulk()
>> runs "in a per-CPU thread in preemptible context."
>> See irq_work_run_list.
> Agreed, under RT, "interrupt handlers" often run in task context.

Yes, I missed that. I misread alloc_bulk(), and it seems it only does
inc_active() for c->free_llist.
> 						Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-07-08  7:03                 ` Hou Tao
@ 2023-07-10  4:45                   ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2023-07-10  4:45 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, Tejun Heo, rcu, Network Development, bpf,
	Kernel Team, Daniel Borkmann, Andrii Nakryiko, David Vernet

On Sat, Jul 08, 2023 at 03:03:40PM +0800, Hou Tao wrote:
> Hi,
> 
> On 7/8/2023 1:47 AM, Paul E. McKenney wrote:
> > On Fri, Jul 07, 2023 at 09:11:22AM -0700, Alexei Starovoitov wrote:
> >> On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote:
> SNIP
> >>> I guess you're assuming that alloc_bulk() from irq_work
> >>> is running within rcu_tasks_trace critical section,
> >>> so __free_rcu_tasks_trace() callback will execute after
> >>> irq work completed?
> >>> I don't think that's the case.
> >>> Yes. The following is my original thoughts. Correct me if I was wrong:
> >>>
> >>> 1. llist_del_first() must be running concurrently with llist_del_all().
> >>> If llist_del_first() runs after llist_del_all(), it will return NULL
> >>> directly.
> >>> 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the
> >>> elements in free_by_rcu_ttrace will not be freed back to slab.
> >>> 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period
> >>> to call __free_rcu_tasks_trace()
> >>> 4. llist_del_first() in running in an context with irq-disabled, so the
> >>> tasks trace RCU grace period will wait for the end of llist_del_first()
> >>>
> >>> It seems you thought step 4) is not true, right ?
> >> Yes. I think so. For two reasons:
> >>
> >> 1.
> >> I believe irq disabled region isn't considered equivalent
> >> to rcu_read_lock_trace() region.
> >>
> >> Paul,
> >> could you clarify ?
> > You are correct, Alexei.  Unlike vanilla RCU, RCU Tasks Trace does not
> > count irq-disabled regions of code as readers.
> 
> I see. But I still have one question: considering that in current
> implementation one Tasks Trace RCU grace period implies one vanilla RCU
> grace period (aka rcu_trace_implies_rcu_gp), so in my naive
> understanding of RCU, does that mean __free_rcu_tasks_trace() will be
> invoked after the expiration of current Task Trace RCU grace period,
> right ? And does it also mean __free_rcu_tasks_trace() will be invoked
> after the expiration of current vanilla RCU grace period, right ? If
> these two conditions above are true, does it mean
> __free_rcu_tasks_trace() will wait for the irq-disabled code reigion ?

First, good show digging into the code!

However, this is guaranteed only if rcu_trace_implies_rcu_gp(), which
right now happens to return the constant true.  In other words, that is
an accident of the current implementation.  To rely on this, you must
check the return value of rcu_trace_implies_rcu_gp() and then have some
alternative code that does not rely on that synchronize_rcu().

> > But why not just put an rcu_read_lock_trace() and a matching
> > rcu_read_unlock_trace() within that irq-disabled region of code?
> >
> > For completeness, if it were not for CONFIG_TASKS_TRACE_RCU_READ_MB,
> > Hou Tao would be correct from a strict current-implementation
> > viewpoint.  The reason is that, given the current implementation in
> > CONFIG_TASKS_TRACE_RCU_READ_MB=n kernels, a task must either block or
> > take an IPI in order for the grace-period machinery to realize that this
> > task is done with all prior readers.
> 
> Thanks for the detailed explanation.
> 
> > However, we need to account for the possibility of IPI-free
> > implementations, for example, if the real-time guys decide to start
> > making heavy use of BPF sleepable programs.  They would then insist on
> > getting rid of those IPIs for CONFIG_PREEMPT_RT=y kernels.  At which
> > point, irq-disabled regions of code will absolutely not act as
> > RCU tasks trace readers.
> >
> > Again, why not just put an rcu_read_lock_trace() and a matching
> > rcu_read_unlock_trace() within that irq-disabled region of code?
> >
> > Or maybe there is a better workaround.
> 
> Yes. I think we could use rcu_read_{lock,unlock}_trace to fix the ABA
> problem for free_by_rcu_ttrace.

That sounds good to me!

							Thanx, Paul

> >> 2.
> >> Even if 1 is incorrect, in RT llist_del_first() from alloc_bulk()
> >> runs "in a per-CPU thread in preemptible context."
> >> See irq_work_run_list.
> > Agreed, under RT, "interrupt handlers" often run in task context.
> 
> Yes, I missed that. I misread alloc_bulk(), and it seems it only does
> inc_active() for c->free_llist.
> > 						Thanx, Paul
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu().
  2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (13 preceding siblings ...)
  2023-07-06  3:34 ` [PATCH v4 bpf-next 14/14] bpf: Add object leak check Alexei Starovoitov
@ 2023-07-12 21:50 ` patchwork-bot+netdevbpf
  14 siblings, 0 replies; 31+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-07-12 21:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: daniel, andrii, void, houtao, paulmck, tj, rcu, netdev, bpf,
	kernel-team

Hello:

This series was applied to bpf/bpf-next.git (master)
by Daniel Borkmann <daniel@iogearbox.net>:

On Wed,  5 Jul 2023 20:34:33 -0700 you wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> v3->v4:
> - extra patch 14 from Hou to check for object leaks.
> - fixed the race/leak in free_by_rcu_ttrace. Extra hunk in patch 8.
> - added Acks and fixed typos.
> 
> [...]

Here is the summary with links:
  - [v4,bpf-next,01/14] bpf: Rename few bpf_mem_alloc fields.
    https://git.kernel.org/bpf/bpf-next/c/12c8d0f4c870
  - [v4,bpf-next,02/14] bpf: Simplify code of destroy_mem_alloc() with kmemdup().
    https://git.kernel.org/bpf/bpf-next/c/a80672d7e10e
  - [v4,bpf-next,03/14] bpf: Let free_all() return the number of freed elements.
    https://git.kernel.org/bpf/bpf-next/c/9de3e81521b4
  - [v4,bpf-next,04/14] bpf: Refactor alloc_bulk().
    https://git.kernel.org/bpf/bpf-next/c/05ae68656a8e
  - [v4,bpf-next,05/14] bpf: Factor out inc/dec of active flag into helpers.
    https://git.kernel.org/bpf/bpf-next/c/18e027b1c7c6
  - [v4,bpf-next,06/14] bpf: Further refactor alloc_bulk().
    https://git.kernel.org/bpf/bpf-next/c/7468048237b8
  - [v4,bpf-next,07/14] bpf: Change bpf_mem_cache draining process.
    https://git.kernel.org/bpf/bpf-next/c/d114dde245f9
  - [v4,bpf-next,08/14] bpf: Add a hint to allocated objects.
    https://git.kernel.org/bpf/bpf-next/c/822fb26bdb55
  - [v4,bpf-next,09/14] bpf: Allow reuse from waiting_for_gp_ttrace list.
    https://git.kernel.org/bpf/bpf-next/c/04fabf00b4d3
  - [v4,bpf-next,10/14] rcu: Export rcu_request_urgent_qs_task()
    https://git.kernel.org/bpf/bpf-next/c/43a89baecfe2
  - [v4,bpf-next,11/14] selftests/bpf: Improve test coverage of bpf_mem_alloc.
    https://git.kernel.org/bpf/bpf-next/c/f76faa65c971
  - [v4,bpf-next,12/14] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
    https://git.kernel.org/bpf/bpf-next/c/5af6807bdb10
  - [v4,bpf-next,13/14] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
    https://git.kernel.org/bpf/bpf-next/c/8e07bb9ebcd9
  - [v4,bpf-next,14/14] bpf: Add object leak check.
    https://git.kernel.org/bpf/bpf-next/c/4ed8b5bcfada

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2023-07-12 21:50 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-06  3:34 [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
2023-07-06  3:34 ` [PATCH v4 bpf-next 01/14] bpf: Rename few bpf_mem_alloc fields Alexei Starovoitov
2023-07-06  3:34 ` [PATCH v4 bpf-next 02/14] bpf: Simplify code of destroy_mem_alloc() with kmemdup() Alexei Starovoitov
2023-07-06  3:34 ` [PATCH v4 bpf-next 03/14] bpf: Let free_all() return the number of freed elements Alexei Starovoitov
2023-07-06  3:34 ` [PATCH v4 bpf-next 04/14] bpf: Refactor alloc_bulk() Alexei Starovoitov
2023-07-06  3:34 ` [PATCH v4 bpf-next 05/14] bpf: Factor out inc/dec of active flag into helpers Alexei Starovoitov
2023-07-06  3:34 ` [PATCH v4 bpf-next 06/14] bpf: Further refactor alloc_bulk() Alexei Starovoitov
2023-07-06  3:34 ` [PATCH v4 bpf-next 07/14] bpf: Change bpf_mem_cache draining process Alexei Starovoitov
2023-07-06 12:55   ` Hou Tao
2023-07-06  3:34 ` [PATCH v4 bpf-next 08/14] bpf: Add a hint to allocated objects Alexei Starovoitov
2023-07-06  3:34 ` [PATCH v4 bpf-next 09/14] bpf: Allow reuse from waiting_for_gp_ttrace list Alexei Starovoitov
2023-07-07  2:07   ` Hou Tao
2023-07-07  2:12     ` Alexei Starovoitov
2023-07-07  3:38       ` Hou Tao
2023-07-07  4:16         ` Alexei Starovoitov
2023-07-07  4:37           ` Hou Tao
2023-07-07 16:11             ` Alexei Starovoitov
2023-07-07 17:47               ` Paul E. McKenney
2023-07-07 22:22                 ` Joel Fernandes
2023-07-08  7:03                 ` Hou Tao
2023-07-10  4:45                   ` Paul E. McKenney
2023-07-06  3:34 ` [PATCH v4 bpf-next 10/14] rcu: Export rcu_request_urgent_qs_task() Alexei Starovoitov
2023-07-06  3:34 ` [PATCH v4 bpf-next 11/14] selftests/bpf: Improve test coverage of bpf_mem_alloc Alexei Starovoitov
2023-07-06  3:34 ` [PATCH v4 bpf-next 12/14] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu() Alexei Starovoitov
2023-07-07  1:45   ` Hou Tao
2023-07-07  2:10     ` Alexei Starovoitov
2023-07-07  4:05       ` Hou Tao
2023-07-08  7:00         ` Hou Tao
2023-07-06  3:34 ` [PATCH v4 bpf-next 13/14] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu Alexei Starovoitov
2023-07-06  3:34 ` [PATCH v4 bpf-next 14/14] bpf: Add object leak check Alexei Starovoitov
2023-07-12 21:50 ` [PATCH v4 bpf-next 00/14] bpf: Introduce bpf_mem_cache_free_rcu() patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).