[PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu().
@ 2023-06-24  3:13 Alexei Starovoitov
  2023-06-24  3:13 ` [PATCH v2 bpf-next 01/13] bpf: Rename few bpf_mem_alloc fields Alexei Starovoitov
                   ` (13 more replies)
  0 siblings, 14 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  3:13 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

v1->v2:
- Fixed race condition spotted by Hou. Patch 7.

v1:

Introduce bpf_mem_cache_free_rcu() that is similar to kfree_rcu except
the objects will go through an additional RCU tasks trace grace period
before being freed into slab.

Patches 1-9 - a bunch of prep work
Patch 10 - a patch from Paul that exports rcu_request_urgent_qs_task().
Patch 12 - the main bpf_mem_cache_free_rcu patch.
Patch 13 - use it in bpf_cpumask.

bpf_local_storage, bpf_obj_drop, qp-trie will be other users eventually.

With additional hack patch to htab that replaces bpf_mem_cache_free with bpf_mem_cache_free_rcu
the following are benchmark results:
- map_perf_test 4 8 16348 1000000
drops from 800k to 600k. Waiting for RCU GP makes objects cache cold.

- bench htab-mem -a -p 8
20% drop in performance and big increase in memory. From 3 Mbyte to 50 Mbyte. As expected.

- bench htab-mem -a -p 16 --use-case add_del_on_diff_cpu
Same performance and better memory consumption.
Before these patches this bench would OOM (with or without 'reuse after GP')
Patch 8 addresses the issue.

At the end the performance drop and additional memory consumption due to _rcu()
were expected and came out to be within reasonable margin.
Without Paul's patch 10 the memory consumption in 'bench htab-mem' is in Gbytes
which wouldn't be acceptable.

Patch 8 is a heuristic to address 'alloc on one cpu, free on another' issue.
It works well in practice. One can probably construct an artificial benchmark
to make heuristic ineffective, but we have to trade off performance, code complexity,
and memory consumption.

The life cycle of objects:
alloc: dequeue free_llist
free: enqeueu free_llist
free_llist above high watermark -> free_by_rcu_ttrace
free_rcu: enqueue free_by_rcu -> waiting_for_gp
after RCU GP waiting_for_gp -> free_by_rcu_ttrace
free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab

Alexei Starovoitov (12):
  bpf: Rename few bpf_mem_alloc fields.
  bpf: Simplify code of destroy_mem_alloc() with kmemdup().
  bpf: Let free_all() return the number of freed elements.
  bpf: Refactor alloc_bulk().
  bpf: Further refactor alloc_bulk().
  bpf: Optimize moving objects from free_by_rcu_ttrace to
    waiting_for_gp_ttrace.
  bpf: Change bpf_mem_cache draining process.
  bpf: Add a hint to allocated objects.
  bpf: Allow reuse from waiting_for_gp_ttrace list.
  selftests/bpf: Improve test coverage of bpf_mem_alloc.
  bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.

Paul E. McKenney (1):
  rcu: Export rcu_request_urgent_qs_task()

 include/linux/bpf_mem_alloc.h                 |   2 +
 include/linux/rcutiny.h                       |   2 +
 include/linux/rcutree.h                       |   1 +
 kernel/bpf/cpumask.c                          |  20 +-
 kernel/bpf/memalloc.c                         | 322 +++++++++++++-----
 kernel/rcu/rcu.h                              |   2 -
 .../testing/selftests/bpf/progs/linked_list.c |   2 +-
 7 files changed, 251 insertions(+), 100 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 bpf-next 01/13] bpf: Rename few bpf_mem_alloc fields.
  2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
@ 2023-06-24  3:13 ` Alexei Starovoitov
  2023-06-24  3:13 ` [PATCH v2 bpf-next 02/13] bpf: Simplify code of destroy_mem_alloc() with kmemdup() Alexei Starovoitov
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  3:13 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Rename:
-       struct rcu_head rcu;
-       struct llist_head free_by_rcu;
-       struct llist_head waiting_for_gp;
-       atomic_t call_rcu_in_progress;
+       struct llist_head free_by_rcu_ttrace;
+       struct llist_head waiting_for_gp_ttrace;
+       struct rcu_head rcu_ttrace;
+       atomic_t call_rcu_ttrace_in_progress;
...
-	static void do_call_rcu(struct bpf_mem_cache *c)
+	static void do_call_rcu_ttrace(struct bpf_mem_cache *c)

to better indicate intended use.

The 'tasks trace' is shortened to 'ttrace' to reduce verbosity.
No functional changes.

Later patches will add free_by_rcu/waiting_for_gp fields to be used with normal RCU.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 57 ++++++++++++++++++++++---------------------
 1 file changed, 29 insertions(+), 28 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 0668bcd7c926..cc5b8adb4c83 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -99,10 +99,11 @@ struct bpf_mem_cache {
 	int low_watermark, high_watermark, batch;
 	int percpu_size;
 
-	struct rcu_head rcu;
-	struct llist_head free_by_rcu;
-	struct llist_head waiting_for_gp;
-	atomic_t call_rcu_in_progress;
+	/* list of objects to be freed after RCU tasks trace GP */
+	struct llist_head free_by_rcu_ttrace;
+	struct llist_head waiting_for_gp_ttrace;
+	struct rcu_head rcu_ttrace;
+	atomic_t call_rcu_ttrace_in_progress;
 };
 
 struct bpf_mem_caches {
@@ -165,18 +166,18 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	old_memcg = set_active_memcg(memcg);
 	for (i = 0; i < cnt; i++) {
 		/*
-		 * free_by_rcu is only manipulated by irq work refill_work().
+		 * free_by_rcu_ttrace is only manipulated by irq work refill_work().
 		 * IRQ works on the same CPU are called sequentially, so it is
 		 * safe to use __llist_del_first() here. If alloc_bulk() is
 		 * invoked by the initial prefill, there will be no running
 		 * refill_work(), so __llist_del_first() is fine as well.
 		 *
-		 * In most cases, objects on free_by_rcu are from the same CPU.
+		 * In most cases, objects on free_by_rcu_ttrace are from the same CPU.
 		 * If some objects come from other CPUs, it doesn't incur any
 		 * harm because NUMA_NO_NODE means the preference for current
 		 * numa node and it is not a guarantee.
 		 */
-		obj = __llist_del_first(&c->free_by_rcu);
+		obj = __llist_del_first(&c->free_by_rcu_ttrace);
 		if (!obj) {
 			/* Allocate, but don't deplete atomic reserves that typical
 			 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
@@ -232,10 +233,10 @@ static void free_all(struct llist_node *llnode, bool percpu)
 
 static void __free_rcu(struct rcu_head *head)
 {
-	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
+	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu_ttrace);
 
-	free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size);
-	atomic_set(&c->call_rcu_in_progress, 0);
+	free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size);
+	atomic_set(&c->call_rcu_ttrace_in_progress, 0);
 }
 
 static void __free_rcu_tasks_trace(struct rcu_head *head)
@@ -254,32 +255,32 @@ static void enque_to_free(struct bpf_mem_cache *c, void *obj)
 	struct llist_node *llnode = obj;
 
 	/* bpf_mem_cache is a per-cpu object. Freeing happens in irq_work.
-	 * Nothing races to add to free_by_rcu list.
+	 * Nothing races to add to free_by_rcu_ttrace list.
 	 */
-	__llist_add(llnode, &c->free_by_rcu);
+	__llist_add(llnode, &c->free_by_rcu_ttrace);
 }
 
-static void do_call_rcu(struct bpf_mem_cache *c)
+static void do_call_rcu_ttrace(struct bpf_mem_cache *c)
 {
 	struct llist_node *llnode, *t;
 
-	if (atomic_xchg(&c->call_rcu_in_progress, 1))
+	if (atomic_xchg(&c->call_rcu_ttrace_in_progress, 1))
 		return;
 
-	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
-	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
-		/* There is no concurrent __llist_add(waiting_for_gp) access.
+	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp_ttrace));
+	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu_ttrace))
+		/* There is no concurrent __llist_add(waiting_for_gp_ttrace) access.
 		 * It doesn't race with llist_del_all either.
-		 * But there could be two concurrent llist_del_all(waiting_for_gp):
+		 * But there could be two concurrent llist_del_all(waiting_for_gp_ttrace):
 		 * from __free_rcu() and from drain_mem_cache().
 		 */
-		__llist_add(llnode, &c->waiting_for_gp);
+		__llist_add(llnode, &c->waiting_for_gp_ttrace);
 	/* Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
 	 * If RCU Tasks Trace grace period implies RCU grace period, free
 	 * these elements directly, else use call_rcu() to wait for normal
 	 * progs to finish and finally do free_one() on each element.
 	 */
-	call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace);
+	call_rcu_tasks_trace(&c->rcu_ttrace, __free_rcu_tasks_trace);
 }
 
 static void free_bulk(struct bpf_mem_cache *c)
@@ -307,7 +308,7 @@ static void free_bulk(struct bpf_mem_cache *c)
 	/* and drain free_llist_extra */
 	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
 		enque_to_free(c, llnode);
-	do_call_rcu(c);
+	do_call_rcu_ttrace(c);
 }
 
 static void bpf_mem_refill(struct irq_work *work)
@@ -441,13 +442,13 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 
 	/* No progs are using this bpf_mem_cache, but htab_map_free() called
 	 * bpf_mem_cache_free() for all remaining elements and they can be in
-	 * free_by_rcu or in waiting_for_gp lists, so drain those lists now.
+	 * free_by_rcu_ttrace or in waiting_for_gp_ttrace lists, so drain those lists now.
 	 *
-	 * Except for waiting_for_gp list, there are no concurrent operations
+	 * Except for waiting_for_gp_ttrace list, there are no concurrent operations
 	 * on these lists, so it is safe to use __llist_del_all().
 	 */
-	free_all(__llist_del_all(&c->free_by_rcu), percpu);
-	free_all(llist_del_all(&c->waiting_for_gp), percpu);
+	free_all(__llist_del_all(&c->free_by_rcu_ttrace), percpu);
+	free_all(llist_del_all(&c->waiting_for_gp_ttrace), percpu);
 	free_all(__llist_del_all(&c->free_llist), percpu);
 	free_all(__llist_del_all(&c->free_llist_extra), percpu);
 }
@@ -462,7 +463,7 @@ static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
 
 static void free_mem_alloc(struct bpf_mem_alloc *ma)
 {
-	/* waiting_for_gp lists was drained, but __free_rcu might
+	/* waiting_for_gp_ttrace lists was drained, but __free_rcu might
 	 * still execute. Wait for it now before we freeing percpu caches.
 	 *
 	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
@@ -535,7 +536,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 			 */
 			irq_work_sync(&c->refill_work);
 			drain_mem_cache(c);
-			rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
+			rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
 		}
 		/* objcg is the same across cpus */
 		if (c->objcg)
@@ -550,7 +551,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 				c = &cc->cache[i];
 				irq_work_sync(&c->refill_work);
 				drain_mem_cache(c);
-				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
+				rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
 			}
 		}
 		if (c->objcg)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 bpf-next 02/13] bpf: Simplify code of destroy_mem_alloc() with kmemdup().
  2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
  2023-06-24  3:13 ` [PATCH v2 bpf-next 01/13] bpf: Rename few bpf_mem_alloc fields Alexei Starovoitov
@ 2023-06-24  3:13 ` Alexei Starovoitov
  2023-06-24  3:13 ` [PATCH v2 bpf-next 03/13] bpf: Let free_all() return the number of freed elements Alexei Starovoitov
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  3:13 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Use kmemdup() to simplify the code.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index cc5b8adb4c83..b0011217be6c 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -499,7 +499,7 @@ static void destroy_mem_alloc(struct bpf_mem_alloc *ma, int rcu_in_progress)
 		return;
 	}
 
-	copy = kmalloc(sizeof(*ma), GFP_KERNEL);
+	copy = kmemdup(ma, sizeof(*ma), GFP_KERNEL);
 	if (!copy) {
 		/* Slow path with inline barrier-s */
 		free_mem_alloc(ma);
@@ -507,10 +507,7 @@ static void destroy_mem_alloc(struct bpf_mem_alloc *ma, int rcu_in_progress)
 	}
 
 	/* Defer barriers into worker to let the rest of map memory to be freed */
-	copy->cache = ma->cache;
-	ma->cache = NULL;
-	copy->caches = ma->caches;
-	ma->caches = NULL;
+	memset(ma, 0, sizeof(*ma));
 	INIT_WORK(&copy->work, free_mem_alloc_deferred);
 	queue_work(system_unbound_wq, &copy->work);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 bpf-next 03/13] bpf: Let free_all() return the number of freed elements.
  2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
  2023-06-24  3:13 ` [PATCH v2 bpf-next 01/13] bpf: Rename few bpf_mem_alloc fields Alexei Starovoitov
  2023-06-24  3:13 ` [PATCH v2 bpf-next 02/13] bpf: Simplify code of destroy_mem_alloc() with kmemdup() Alexei Starovoitov
@ 2023-06-24  3:13 ` Alexei Starovoitov
  2023-06-24  3:13 ` [PATCH v2 bpf-next 04/13] bpf: Refactor alloc_bulk() Alexei Starovoitov
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  3:13 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Let free_all() helper return the number of freed elements.
It's not used in this patch, but helps in debug/development of bpf_mem_alloc.

For example this diff for __free_rcu():
-       free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size);
+       printk("cpu %d freed %d objs after tasks trace\n", raw_smp_processor_id(),
+       	free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size));

would show how busy RCU tasks trace is.
In artificial benchmark where one cpu is allocating and different cpu is freeing
the RCU tasks trace won't be able to keep up and the list of objects
would keep growing from thousands to millions and eventually OOMing.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index b0011217be6c..693651d2648b 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -223,12 +223,16 @@ static void free_one(void *obj, bool percpu)
 	kfree(obj);
 }
 
-static void free_all(struct llist_node *llnode, bool percpu)
+static int free_all(struct llist_node *llnode, bool percpu)
 {
 	struct llist_node *pos, *t;
+	int cnt = 0;
 
-	llist_for_each_safe(pos, t, llnode)
+	llist_for_each_safe(pos, t, llnode) {
 		free_one(pos, percpu);
+		cnt++;
+	}
+	return cnt;
 }
 
 static void __free_rcu(struct rcu_head *head)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 bpf-next 04/13] bpf: Refactor alloc_bulk().
  2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2023-06-24  3:13 ` [PATCH v2 bpf-next 03/13] bpf: Let free_all() return the number of freed elements Alexei Starovoitov
@ 2023-06-24  3:13 ` Alexei Starovoitov
  2023-06-24  3:13 ` [PATCH v2 bpf-next 05/13] bpf: Further refactor alloc_bulk() Alexei Starovoitov
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  3:13 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Factor out inner body of alloc_bulk into separate helper.
No functioncal changes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 46 ++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 20 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 693651d2648b..9693b1f8cbda 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -154,11 +154,35 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c)
 #endif
 }
 
+static void add_obj_to_free_list(struct bpf_mem_cache *c, void *obj)
+{
+	unsigned long flags;
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		/* In RT irq_work runs in per-cpu kthread, so disable
+		 * interrupts to avoid preemption and interrupts and
+		 * reduce the chance of bpf prog executing on this cpu
+		 * when active counter is busy.
+		 */
+		local_irq_save(flags);
+	/* alloc_bulk runs from irq_work which will not preempt a bpf
+	 * program that does unit_alloc/unit_free since IRQs are
+	 * disabled there. There is no race to increment 'active'
+	 * counter. It protects free_llist from corruption in case NMI
+	 * bpf prog preempted this loop.
+	 */
+	WARN_ON_ONCE(local_inc_return(&c->active) != 1);
+	__llist_add(obj, &c->free_llist);
+	c->free_cnt++;
+	local_dec(&c->active);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_restore(flags);
+}
+
 /* Mostly runs from irq_work except __init phase. */
 static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 {
 	struct mem_cgroup *memcg = NULL, *old_memcg;
-	unsigned long flags;
 	void *obj;
 	int i;
 
@@ -188,25 +212,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 			if (!obj)
 				break;
 		}
-		if (IS_ENABLED(CONFIG_PREEMPT_RT))
-			/* In RT irq_work runs in per-cpu kthread, so disable
-			 * interrupts to avoid preemption and interrupts and
-			 * reduce the chance of bpf prog executing on this cpu
-			 * when active counter is busy.
-			 */
-			local_irq_save(flags);
-		/* alloc_bulk runs from irq_work which will not preempt a bpf
-		 * program that does unit_alloc/unit_free since IRQs are
-		 * disabled there. There is no race to increment 'active'
-		 * counter. It protects free_llist from corruption in case NMI
-		 * bpf prog preempted this loop.
-		 */
-		WARN_ON_ONCE(local_inc_return(&c->active) != 1);
-		__llist_add(obj, &c->free_llist);
-		c->free_cnt++;
-		local_dec(&c->active);
-		if (IS_ENABLED(CONFIG_PREEMPT_RT))
-			local_irq_restore(flags);
+		add_obj_to_free_list(c, obj);
 	}
 	set_active_memcg(old_memcg);
 	mem_cgroup_put(memcg);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 bpf-next 05/13] bpf: Further refactor alloc_bulk().
  2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2023-06-24  3:13 ` [PATCH v2 bpf-next 04/13] bpf: Refactor alloc_bulk() Alexei Starovoitov
@ 2023-06-24  3:13 ` Alexei Starovoitov
  2023-06-24  3:13 ` [PATCH v2 bpf-next 06/13] bpf: Optimize moving objects from free_by_rcu_ttrace to waiting_for_gp_ttrace Alexei Starovoitov
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  3:13 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

In certain scenarios alloc_bulk() migth be taking free objects mainly from
free_by_rcu_ttrace list. In such case get_memcg() and set_active_memcg() are
redundant, but they show up in perf profile. Split the loop and only set memcg
when allocating from slab. No performance difference in this patch alone, but
it helps in combination with further patches.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 9693b1f8cbda..b07368d77343 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -186,8 +186,6 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	void *obj;
 	int i;
 
-	memcg = get_memcg(c);
-	old_memcg = set_active_memcg(memcg);
 	for (i = 0; i < cnt; i++) {
 		/*
 		 * free_by_rcu_ttrace is only manipulated by irq work refill_work().
@@ -202,16 +200,24 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 		 * numa node and it is not a guarantee.
 		 */
 		obj = __llist_del_first(&c->free_by_rcu_ttrace);
-		if (!obj) {
-			/* Allocate, but don't deplete atomic reserves that typical
-			 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
-			 * will allocate from the current numa node which is what we
-			 * want here.
-			 */
-			obj = __alloc(c, node, GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT);
-			if (!obj)
-				break;
-		}
+		if (!obj)
+			break;
+		add_obj_to_free_list(c, obj);
+	}
+	if (i >= cnt)
+		return;
+
+	memcg = get_memcg(c);
+	old_memcg = set_active_memcg(memcg);
+	for (; i < cnt; i++) {
+		/* Allocate, but don't deplete atomic reserves that typical
+		 * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
+		 * will allocate from the current numa node which is what we
+		 * want here.
+		 */
+		obj = __alloc(c, node, GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT);
+		if (!obj)
+			break;
 		add_obj_to_free_list(c, obj);
 	}
 	set_active_memcg(old_memcg);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 bpf-next 06/13] bpf: Optimize moving objects from free_by_rcu_ttrace to waiting_for_gp_ttrace.
  2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2023-06-24  3:13 ` [PATCH v2 bpf-next 05/13] bpf: Further refactor alloc_bulk() Alexei Starovoitov
@ 2023-06-24  3:13 ` Alexei Starovoitov
  2023-06-24  3:13 ` [PATCH v2 bpf-next 07/13] bpf: Change bpf_mem_cache draining process Alexei Starovoitov
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  3:13 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Optimize moving objects from free_by_rcu_ttrace to waiting_for_gp_ttrace
by remembering the tail.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index b07368d77343..4fd79bd51f5a 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -101,6 +101,7 @@ struct bpf_mem_cache {
 
 	/* list of objects to be freed after RCU tasks trace GP */
 	struct llist_head free_by_rcu_ttrace;
+	struct llist_node *free_by_rcu_ttrace_tail;
 	struct llist_head waiting_for_gp_ttrace;
 	struct rcu_head rcu_ttrace;
 	atomic_t call_rcu_ttrace_in_progress;
@@ -273,24 +274,27 @@ static void enque_to_free(struct bpf_mem_cache *c, void *obj)
 	/* bpf_mem_cache is a per-cpu object. Freeing happens in irq_work.
 	 * Nothing races to add to free_by_rcu_ttrace list.
 	 */
-	__llist_add(llnode, &c->free_by_rcu_ttrace);
+	if (__llist_add(llnode, &c->free_by_rcu_ttrace))
+		c->free_by_rcu_ttrace_tail = llnode;
 }
 
 static void do_call_rcu_ttrace(struct bpf_mem_cache *c)
 {
-	struct llist_node *llnode, *t;
+	struct llist_node *llnode;
 
 	if (atomic_xchg(&c->call_rcu_ttrace_in_progress, 1))
 		return;
 
 	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp_ttrace));
-	llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu_ttrace))
+	llnode = __llist_del_all(&c->free_by_rcu_ttrace);
+	if (llnode)
 		/* There is no concurrent __llist_add(waiting_for_gp_ttrace) access.
 		 * It doesn't race with llist_del_all either.
 		 * But there could be two concurrent llist_del_all(waiting_for_gp_ttrace):
 		 * from __free_rcu() and from drain_mem_cache().
 		 */
-		__llist_add(llnode, &c->waiting_for_gp_ttrace);
+		__llist_add_batch(llnode, c->free_by_rcu_ttrace_tail,
+				  &c->waiting_for_gp_ttrace);
 	/* Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
 	 * If RCU Tasks Trace grace period implies RCU grace period, free
 	 * these elements directly, else use call_rcu() to wait for normal
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 bpf-next 07/13] bpf: Change bpf_mem_cache draining process.
  2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (5 preceding siblings ...)
  2023-06-24  3:13 ` [PATCH v2 bpf-next 06/13] bpf: Optimize moving objects from free_by_rcu_ttrace to waiting_for_gp_ttrace Alexei Starovoitov
@ 2023-06-24  3:13 ` Alexei Starovoitov
  2023-06-24  3:13 ` [PATCH v2 bpf-next 08/13] bpf: Add a hint to allocated objects Alexei Starovoitov
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  3:13 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

The next patch will introduce cross-cpu llist access and existing
irq_work_sync() + drain_mem_cache() + rcu_barrier_tasks_trace() mechanism will
not be enough, since irq_work_sync() + drain_mem_cache() on cpu A won't
guarantee that llist on cpu A are empty. The free_bulk() on cpu B might add
objects back to llist of cpu A. Add 'bool draining' flag and set it all cpus
before proceeding with irq_work_sync.
The modified sequence looks like:
for_each_cpu:
  WRITE_ONCE(c->draining, true); // make RCU callback a nop
  irq_work_sync(); // wait for irq_work callback (free_bulk) to finish
for_each_cpu:
  drain_mem_cache(); // free all objects
rcu_barrier_tasks_trace(); // wait for RCU callbacks to execute as a nop

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 4fd79bd51f5a..d68a854f45ee 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -98,6 +98,7 @@ struct bpf_mem_cache {
 	int free_cnt;
 	int low_watermark, high_watermark, batch;
 	int percpu_size;
+	bool draining;
 
 	/* list of objects to be freed after RCU tasks trace GP */
 	struct llist_head free_by_rcu_ttrace;
@@ -252,7 +253,10 @@ static void __free_rcu(struct rcu_head *head)
 {
 	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu_ttrace);
 
+	if (unlikely(READ_ONCE(c->draining)))
+		goto out;
 	free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size);
+out:
 	atomic_set(&c->call_rcu_ttrace_in_progress, 0);
 }
 
@@ -542,16 +546,11 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 		rcu_in_progress = 0;
 		for_each_possible_cpu(cpu) {
 			c = per_cpu_ptr(ma->cache, cpu);
-			/*
-			 * refill_work may be unfinished for PREEMPT_RT kernel
-			 * in which irq work is invoked in a per-CPU RT thread.
-			 * It is also possible for kernel with
-			 * arch_irq_work_has_interrupt() being false and irq
-			 * work is invoked in timer interrupt. So waiting for
-			 * the completion of irq work to ease the handling of
-			 * concurrency.
-			 */
+			WRITE_ONCE(c->draining, true);
 			irq_work_sync(&c->refill_work);
+		}
+		for_each_possible_cpu(cpu) {
+			c = per_cpu_ptr(ma->cache, cpu);
 			drain_mem_cache(c);
 			rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
 		}
@@ -566,7 +565,14 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 			cc = per_cpu_ptr(ma->caches, cpu);
 			for (i = 0; i < NUM_CACHES; i++) {
 				c = &cc->cache[i];
+				WRITE_ONCE(c->draining, true);
 				irq_work_sync(&c->refill_work);
+			}
+		}
+		for_each_possible_cpu(cpu) {
+			cc = per_cpu_ptr(ma->caches, cpu);
+			for (i = 0; i < NUM_CACHES; i++) {
+				c = &cc->cache[i];
 				drain_mem_cache(c);
 				rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
 			}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 bpf-next 08/13] bpf: Add a hint to allocated objects.
  2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (6 preceding siblings ...)
  2023-06-24  3:13 ` [PATCH v2 bpf-next 07/13] bpf: Change bpf_mem_cache draining process Alexei Starovoitov
@ 2023-06-24  3:13 ` Alexei Starovoitov
  2023-06-24  3:13 ` [PATCH v2 bpf-next 09/13] bpf: Allow reuse from waiting_for_gp_ttrace list Alexei Starovoitov
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  3:13 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

To address OOM issue when one cpu is allocating and another cpu is freeing add
a target bpf_mem_cache hint to allocated objects and when local cpu free_llist
overflows free to that bpf_mem_cache. The hint addresses the OOM while
maintaing the same performance for common case when alloc/free are done on the
same cpu.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 46 ++++++++++++++++++++++++++-----------------
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index d68a854f45ee..692a9a30c1dc 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -99,6 +99,7 @@ struct bpf_mem_cache {
 	int low_watermark, high_watermark, batch;
 	int percpu_size;
 	bool draining;
+	struct bpf_mem_cache *tgt;
 
 	/* list of objects to be freed after RCU tasks trace GP */
 	struct llist_head free_by_rcu_ttrace;
@@ -190,18 +191,11 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 
 	for (i = 0; i < cnt; i++) {
 		/*
-		 * free_by_rcu_ttrace is only manipulated by irq work refill_work().
-		 * IRQ works on the same CPU are called sequentially, so it is
-		 * safe to use __llist_del_first() here. If alloc_bulk() is
-		 * invoked by the initial prefill, there will be no running
-		 * refill_work(), so __llist_del_first() is fine as well.
-		 *
-		 * In most cases, objects on free_by_rcu_ttrace are from the same CPU.
-		 * If some objects come from other CPUs, it doesn't incur any
-		 * harm because NUMA_NO_NODE means the preference for current
-		 * numa node and it is not a guarantee.
+		 * For every 'c' llist_del_first(&c->free_by_rcu_ttrace); is
+		 * done only by one CPU == current CPU. Other CPUs might
+		 * llist_add() and llist_del_all() in parallel.
 		 */
-		obj = __llist_del_first(&c->free_by_rcu_ttrace);
+		obj = llist_del_first(&c->free_by_rcu_ttrace);
 		if (!obj)
 			break;
 		add_obj_to_free_list(c, obj);
@@ -278,7 +272,7 @@ static void enque_to_free(struct bpf_mem_cache *c, void *obj)
 	/* bpf_mem_cache is a per-cpu object. Freeing happens in irq_work.
 	 * Nothing races to add to free_by_rcu_ttrace list.
 	 */
-	if (__llist_add(llnode, &c->free_by_rcu_ttrace))
+	if (llist_add(llnode, &c->free_by_rcu_ttrace))
 		c->free_by_rcu_ttrace_tail = llnode;
 }
 
@@ -290,7 +284,7 @@ static void do_call_rcu_ttrace(struct bpf_mem_cache *c)
 		return;
 
 	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp_ttrace));
-	llnode = __llist_del_all(&c->free_by_rcu_ttrace);
+	llnode = llist_del_all(&c->free_by_rcu_ttrace);
 	if (llnode)
 		/* There is no concurrent __llist_add(waiting_for_gp_ttrace) access.
 		 * It doesn't race with llist_del_all either.
@@ -303,16 +297,22 @@ static void do_call_rcu_ttrace(struct bpf_mem_cache *c)
 	 * If RCU Tasks Trace grace period implies RCU grace period, free
 	 * these elements directly, else use call_rcu() to wait for normal
 	 * progs to finish and finally do free_one() on each element.
+	 *
+	 * call_rcu_tasks_trace() enqueues to a global queue, so it's ok
+	 * that current cpu bpf_mem_cache != target bpf_mem_cache.
 	 */
 	call_rcu_tasks_trace(&c->rcu_ttrace, __free_rcu_tasks_trace);
 }
 
 static void free_bulk(struct bpf_mem_cache *c)
 {
+	struct bpf_mem_cache *tgt = c->tgt;
 	struct llist_node *llnode, *t;
 	unsigned long flags;
 	int cnt;
 
+	WARN_ON_ONCE(tgt->unit_size != c->unit_size);
+
 	do {
 		if (IS_ENABLED(CONFIG_PREEMPT_RT))
 			local_irq_save(flags);
@@ -326,13 +326,13 @@ static void free_bulk(struct bpf_mem_cache *c)
 		if (IS_ENABLED(CONFIG_PREEMPT_RT))
 			local_irq_restore(flags);
 		if (llnode)
-			enque_to_free(c, llnode);
+			enque_to_free(tgt, llnode);
 	} while (cnt > (c->high_watermark + c->low_watermark) / 2);
 
 	/* and drain free_llist_extra */
 	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
-		enque_to_free(c, llnode);
-	do_call_rcu_ttrace(c);
+		enque_to_free(tgt, llnode);
+	do_call_rcu_ttrace(tgt);
 }
 
 static void bpf_mem_refill(struct irq_work *work)
@@ -431,6 +431,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 			c->unit_size = unit_size;
 			c->objcg = objcg;
 			c->percpu_size = percpu_size;
+			c->tgt = c;
 			prefill_mem_cache(c, cpu);
 		}
 		ma->cache = pc;
@@ -453,6 +454,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
 			c = &cc->cache[i];
 			c->unit_size = sizes[i];
 			c->objcg = objcg;
+			c->tgt = c;
 			prefill_mem_cache(c, cpu);
 		}
 	}
@@ -471,7 +473,7 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 	 * Except for waiting_for_gp_ttrace list, there are no concurrent operations
 	 * on these lists, so it is safe to use __llist_del_all().
 	 */
-	free_all(__llist_del_all(&c->free_by_rcu_ttrace), percpu);
+	free_all(llist_del_all(&c->free_by_rcu_ttrace), percpu);
 	free_all(llist_del_all(&c->waiting_for_gp_ttrace), percpu);
 	free_all(__llist_del_all(&c->free_llist), percpu);
 	free_all(__llist_del_all(&c->free_llist_extra), percpu);
@@ -605,8 +607,10 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
 	local_irq_save(flags);
 	if (local_inc_return(&c->active) == 1) {
 		llnode = __llist_del_first(&c->free_llist);
-		if (llnode)
+		if (llnode) {
 			cnt = --c->free_cnt;
+			*(struct bpf_mem_cache **)llnode = c;
+		}
 	}
 	local_dec(&c->active);
 	local_irq_restore(flags);
@@ -630,6 +634,12 @@ static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
 
 	BUILD_BUG_ON(LLIST_NODE_SZ > 8);
 
+	/*
+	 * Remember bpf_mem_cache that allocated this object.
+	 * The hint is not accurate.
+	 */
+	c->tgt = *(struct bpf_mem_cache **)llnode;
+
 	local_irq_save(flags);
 	if (local_inc_return(&c->active) == 1) {
 		__llist_add(llnode, &c->free_llist);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 bpf-next 09/13] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (7 preceding siblings ...)
  2023-06-24  3:13 ` [PATCH v2 bpf-next 08/13] bpf: Add a hint to allocated objects Alexei Starovoitov
@ 2023-06-24  3:13 ` Alexei Starovoitov
  2023-06-26  3:30   ` Hou Tao
  2023-06-24  3:13 ` [PATCH v2 bpf-next 10/13] rcu: Export rcu_request_urgent_qs_task() Alexei Starovoitov
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  3:13 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

alloc_bulk() can reuse elements from free_by_rcu_ttrace.
Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 692a9a30c1dc..666917c16e87 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	if (i >= cnt)
 		return;
 
+	for (; i < cnt; i++) {
+		obj = llist_del_first(&c->waiting_for_gp_ttrace);
+		if (!obj)
+			break;
+		add_obj_to_free_list(c, obj);
+	}
+	if (i >= cnt)
+		return;
+
 	memcg = get_memcg(c);
 	old_memcg = set_active_memcg(memcg);
 	for (; i < cnt; i++) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 bpf-next 10/13] rcu: Export rcu_request_urgent_qs_task()
  2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (8 preceding siblings ...)
  2023-06-24  3:13 ` [PATCH v2 bpf-next 09/13] bpf: Allow reuse from waiting_for_gp_ttrace list Alexei Starovoitov
@ 2023-06-24  3:13 ` Alexei Starovoitov
  2023-06-24  3:13 ` [PATCH v2 bpf-next 11/13] selftests/bpf: Improve test coverage of bpf_mem_alloc Alexei Starovoitov
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  3:13 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: "Paul E. McKenney" <paulmck@kernel.org>

If a CPU is executing a long series of non-sleeping system calls,
RCU grace periods can be delayed for on the order of a couple hundred
milliseconds.  This is normally not a problem, but if each system call
does a call_rcu(), those callbacks can stack up.  RCU will eventually
notice this callback storm, but use of rcu_request_urgent_qs_task()
allows the code invoking call_rcu() to give RCU a heads up.

This function is not for general use, not yet, anyway.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/rcutiny.h | 2 ++
 include/linux/rcutree.h | 1 +
 kernel/rcu/rcu.h        | 2 --
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 7f17acf29dda..7b949292908a 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -138,6 +138,8 @@ static inline int rcu_needs_cpu(void)
 	return 0;
 }
 
+static inline void rcu_request_urgent_qs_task(struct task_struct *t) { }
+
 /*
  * Take advantage of the fact that there is only one CPU, which
  * allows us to ignore virtualization-based context switches.
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 56bccb5a8fde..126f6b418f6a 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -21,6 +21,7 @@ void rcu_softirq_qs(void);
 void rcu_note_context_switch(bool preempt);
 int rcu_needs_cpu(void);
 void rcu_cpu_stall_reset(void);
+void rcu_request_urgent_qs_task(struct task_struct *t);
 
 /*
  * Note a virtualization-based context switch.  This is simply a
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 4a1b9622598b..6f5fb3f7ebf3 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -493,7 +493,6 @@ static inline void rcu_expedite_gp(void) { }
 static inline void rcu_unexpedite_gp(void) { }
 static inline void rcu_async_hurry(void) { }
 static inline void rcu_async_relax(void) { }
-static inline void rcu_request_urgent_qs_task(struct task_struct *t) { }
 #else /* #ifdef CONFIG_TINY_RCU */
 bool rcu_gp_is_normal(void);     /* Internal RCU use. */
 bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
@@ -508,7 +507,6 @@ void show_rcu_tasks_gp_kthreads(void);
 #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
 static inline void show_rcu_tasks_gp_kthreads(void) {}
 #endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
-void rcu_request_urgent_qs_task(struct task_struct *t);
 #endif /* #else #ifdef CONFIG_TINY_RCU */
 
 #define RCU_SCHEDULER_INACTIVE	0
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 bpf-next 11/13] selftests/bpf: Improve test coverage of bpf_mem_alloc.
  2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (9 preceding siblings ...)
  2023-06-24  3:13 ` [PATCH v2 bpf-next 10/13] rcu: Export rcu_request_urgent_qs_task() Alexei Starovoitov
@ 2023-06-24  3:13 ` Alexei Starovoitov
  2023-06-24  3:13 ` [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu() Alexei Starovoitov
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  3:13 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

bpf_obj_new() calls bpf_mem_alloc(), but doing alloc/free of 8 elements
is not triggering watermark conditions in bpf_mem_alloc.
Increase to 200 elements to make sure alloc_bulk/free_bulk is exercised.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/progs/linked_list.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/progs/linked_list.c b/tools/testing/selftests/bpf/progs/linked_list.c
index 57440a554304..84d1777a9e6c 100644
--- a/tools/testing/selftests/bpf/progs/linked_list.c
+++ b/tools/testing/selftests/bpf/progs/linked_list.c
@@ -96,7 +96,7 @@ static __always_inline
 int list_push_pop_multiple(struct bpf_spin_lock *lock, struct bpf_list_head *head, bool leave_in_map)
 {
 	struct bpf_list_node *n;
-	struct foo *f[8], *pf;
+	struct foo *f[200], *pf;
 	int i;
 
 	/* Loop following this check adds nodes 2-at-a-time in order to
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (10 preceding siblings ...)
  2023-06-24  3:13 ` [PATCH v2 bpf-next 11/13] selftests/bpf: Improve test coverage of bpf_mem_alloc Alexei Starovoitov
@ 2023-06-24  3:13 ` Alexei Starovoitov
  2023-06-24  6:49   ` Hou Tao
                     ` (3 more replies)
  2023-06-24  3:13 ` [PATCH v2 bpf-next 13/13] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu Alexei Starovoitov
  2023-06-24  7:09 ` [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Hou Tao
  13 siblings, 4 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  3:13 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
objects into free_by_rcu_ttrace list where they are waiting for RCU
task trace grace period to be freed into slab.

The life cycle of objects:
alloc: dequeue free_llist
free: enqeueu free_llist
free_rcu: enqueue free_by_rcu -> waiting_for_gp
free_llist above high watermark -> free_by_rcu_ttrace
after RCU GP waiting_for_gp -> free_by_rcu_ttrace
free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_mem_alloc.h |   2 +
 kernel/bpf/memalloc.c         | 117 +++++++++++++++++++++++++++++++++-
 2 files changed, 117 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index 3929be5743f4..d644bbb298af 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -27,10 +27,12 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
 /* kmalloc/kfree equivalent: */
 void *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size);
 void bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr);
+void bpf_mem_free_rcu(struct bpf_mem_alloc *ma, void *ptr);
 
 /* kmem_cache_alloc/free equivalent: */
 void *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma);
 void bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr);
+void bpf_mem_cache_free_rcu(struct bpf_mem_alloc *ma, void *ptr);
 void bpf_mem_cache_raw_free(void *ptr);
 void *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags);
 
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 666917c16e87..dc144c54d502 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -101,6 +101,15 @@ struct bpf_mem_cache {
 	bool draining;
 	struct bpf_mem_cache *tgt;
 
+	/* list of objects to be freed after RCU GP */
+	struct llist_head free_by_rcu;
+	struct llist_node *free_by_rcu_tail;
+	struct llist_head waiting_for_gp;
+	struct llist_node *waiting_for_gp_tail;
+	struct rcu_head rcu;
+	atomic_t call_rcu_in_progress;
+	struct llist_head free_llist_extra_rcu;
+
 	/* list of objects to be freed after RCU tasks trace GP */
 	struct llist_head free_by_rcu_ttrace;
 	struct llist_node *free_by_rcu_ttrace_tail;
@@ -344,6 +353,60 @@ static void free_bulk(struct bpf_mem_cache *c)
 	do_call_rcu_ttrace(tgt);
 }
 
+static void __free_by_rcu(struct rcu_head *head)
+{
+	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
+	struct bpf_mem_cache *tgt = c->tgt;
+	struct llist_node *llnode;
+
+	if (unlikely(READ_ONCE(c->draining)))
+		goto out;
+
+	llnode = llist_del_all(&c->waiting_for_gp);
+	if (!llnode)
+		goto out;
+
+	if (llist_add_batch(llnode, c->waiting_for_gp_tail, &tgt->free_by_rcu_ttrace))
+		tgt->free_by_rcu_ttrace_tail = c->waiting_for_gp_tail;
+
+	/* Objects went through regular RCU GP. Send them to RCU tasks trace */
+	do_call_rcu_ttrace(tgt);
+out:
+	atomic_set(&c->call_rcu_in_progress, 0);
+}
+
+static void check_free_by_rcu(struct bpf_mem_cache *c)
+{
+	struct llist_node *llnode, *t;
+
+	if (llist_empty(&c->free_by_rcu) && llist_empty(&c->free_llist_extra_rcu))
+		return;
+
+	/* drain free_llist_extra_rcu */
+	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra_rcu))
+		if (__llist_add(llnode, &c->free_by_rcu))
+			c->free_by_rcu_tail = llnode;
+
+	if (atomic_xchg(&c->call_rcu_in_progress, 1)) {
+		/*
+		 * Instead of kmalloc-ing new rcu_head and triggering 10k
+		 * call_rcu() to hit rcutree.qhimark and force RCU to notice
+		 * the overload just ask RCU to hurry up. There could be many
+		 * objects in free_by_rcu list.
+		 * This hint reduces memory consumption for an artifical
+		 * benchmark from 2 Gbyte to 150 Mbyte.
+		 */
+		rcu_request_urgent_qs_task(current);
+		return;
+	}
+
+	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
+
+	WRITE_ONCE(c->waiting_for_gp.first, __llist_del_all(&c->free_by_rcu));
+	c->waiting_for_gp_tail = c->free_by_rcu_tail;
+	call_rcu_hurry(&c->rcu, __free_by_rcu);
+}
+
 static void bpf_mem_refill(struct irq_work *work)
 {
 	struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work);
@@ -358,6 +421,8 @@ static void bpf_mem_refill(struct irq_work *work)
 		alloc_bulk(c, c->batch, NUMA_NO_NODE);
 	else if (cnt > c->high_watermark)
 		free_bulk(c);
+
+	check_free_by_rcu(c);
 }
 
 static void notrace irq_work_raise(struct bpf_mem_cache *c)
@@ -486,6 +551,9 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
 	free_all(llist_del_all(&c->waiting_for_gp_ttrace), percpu);
 	free_all(__llist_del_all(&c->free_llist), percpu);
 	free_all(__llist_del_all(&c->free_llist_extra), percpu);
+	free_all(__llist_del_all(&c->free_by_rcu), percpu);
+	free_all(__llist_del_all(&c->free_llist_extra_rcu), percpu);
+	free_all(llist_del_all(&c->waiting_for_gp), percpu);
 }
 
 static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
@@ -498,8 +566,8 @@ static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
 
 static void free_mem_alloc(struct bpf_mem_alloc *ma)
 {
-	/* waiting_for_gp_ttrace lists was drained, but __free_rcu might
-	 * still execute. Wait for it now before we freeing percpu caches.
+	/* waiting_for_gp[_ttrace] lists were drained, but RCU callbacks
+	 * might still execute. Wait for them.
 	 *
 	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
 	 * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
@@ -564,6 +632,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 			c = per_cpu_ptr(ma->cache, cpu);
 			drain_mem_cache(c);
 			rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
+			rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
 		}
 		/* objcg is the same across cpus */
 		if (c->objcg)
@@ -586,6 +655,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 				c = &cc->cache[i];
 				drain_mem_cache(c);
 				rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
+				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
 			}
 		}
 		if (c->objcg)
@@ -670,6 +740,27 @@ static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
 		irq_work_raise(c);
 }
 
+static void notrace unit_free_rcu(struct bpf_mem_cache *c, void *ptr)
+{
+	struct llist_node *llnode = ptr - LLIST_NODE_SZ;
+	unsigned long flags;
+
+	c->tgt = *(struct bpf_mem_cache **)llnode;
+
+	local_irq_save(flags);
+	if (local_inc_return(&c->active) == 1) {
+		if (__llist_add(llnode, &c->free_by_rcu))
+			c->free_by_rcu_tail = llnode;
+	} else {
+		llist_add(llnode, &c->free_llist_extra_rcu);
+	}
+	local_dec(&c->active);
+	local_irq_restore(flags);
+
+	if (!atomic_read(&c->call_rcu_in_progress))
+		irq_work_raise(c);
+}
+
 /* Called from BPF program or from sys_bpf syscall.
  * In both cases migration is disabled.
  */
@@ -703,6 +794,20 @@ void notrace bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr)
 	unit_free(this_cpu_ptr(ma->caches)->cache + idx, ptr);
 }
 
+void notrace bpf_mem_free_rcu(struct bpf_mem_alloc *ma, void *ptr)
+{
+	int idx;
+
+	if (!ptr)
+		return;
+
+	idx = bpf_mem_cache_idx(ksize(ptr - LLIST_NODE_SZ));
+	if (idx < 0)
+		return;
+
+	unit_free_rcu(this_cpu_ptr(ma->caches)->cache + idx, ptr);
+}
+
 void notrace *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma)
 {
 	void *ret;
@@ -719,6 +824,14 @@ void notrace bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr)
 	unit_free(this_cpu_ptr(ma->cache), ptr);
 }
 
+void notrace bpf_mem_cache_free_rcu(struct bpf_mem_alloc *ma, void *ptr)
+{
+	if (!ptr)
+		return;
+
+	unit_free_rcu(this_cpu_ptr(ma->cache), ptr);
+}
+
 /* Directly does a kfree() without putting 'ptr' back to the free_llist
  * for reuse and without waiting for a rcu_tasks_trace gp.
  * The caller must first go through the rcu_tasks_trace gp for 'ptr'
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 bpf-next 13/13] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
  2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (11 preceding siblings ...)
  2023-06-24  3:13 ` [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu() Alexei Starovoitov
@ 2023-06-24  3:13 ` Alexei Starovoitov
  2023-06-26 15:42   ` David Vernet
  2023-06-24  7:09 ` [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Hou Tao
  13 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-24  3:13 UTC (permalink / raw)
  To: daniel, andrii, void, houtao, paulmck; +Cc: tj, rcu, netdev, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Convert bpf_cpumask to bpf_mem_cache_free_rcu.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/cpumask.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c
index 938a60ff4295..6983af8e093c 100644
--- a/kernel/bpf/cpumask.c
+++ b/kernel/bpf/cpumask.c
@@ -9,7 +9,6 @@
 /**
  * struct bpf_cpumask - refcounted BPF cpumask wrapper structure
  * @cpumask:	The actual cpumask embedded in the struct.
- * @rcu:	The RCU head used to free the cpumask with RCU safety.
  * @usage:	Object reference counter. When the refcount goes to 0, the
  *		memory is released back to the BPF allocator, which provides
  *		RCU safety.
@@ -25,7 +24,6 @@
  */
 struct bpf_cpumask {
 	cpumask_t cpumask;
-	struct rcu_head rcu;
 	refcount_t usage;
 };
 
@@ -82,16 +80,6 @@ __bpf_kfunc struct bpf_cpumask *bpf_cpumask_acquire(struct bpf_cpumask *cpumask)
 	return cpumask;
 }
 
-static void cpumask_free_cb(struct rcu_head *head)
-{
-	struct bpf_cpumask *cpumask;
-
-	cpumask = container_of(head, struct bpf_cpumask, rcu);
-	migrate_disable();
-	bpf_mem_cache_free(&bpf_cpumask_ma, cpumask);
-	migrate_enable();
-}
-
 /**
  * bpf_cpumask_release() - Release a previously acquired BPF cpumask.
  * @cpumask: The cpumask being released.
@@ -102,8 +90,12 @@ static void cpumask_free_cb(struct rcu_head *head)
  */
 __bpf_kfunc void bpf_cpumask_release(struct bpf_cpumask *cpumask)
 {
-	if (refcount_dec_and_test(&cpumask->usage))
-		call_rcu(&cpumask->rcu, cpumask_free_cb);
+	if (!refcount_dec_and_test(&cpumask->usage))
+		return;
+
+	migrate_disable();
+	bpf_mem_cache_free_rcu(&bpf_cpumask_ma, cpumask);
+	migrate_enable();
 }
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-06-24  3:13 ` [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu() Alexei Starovoitov
@ 2023-06-24  6:49   ` Hou Tao
  2023-06-25 11:20     ` Hou Tao
  2023-06-28  0:52     ` Alexei Starovoitov
  2023-06-24  7:53   ` Hou Tao
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 37+ messages in thread
From: Hou Tao @ 2023-06-24  6:49 UTC (permalink / raw)
  To: Alexei Starovoitov, daniel, andrii, void, paulmck
  Cc: tj, rcu, netdev, bpf, kernel-team

Hi,

On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
SNIP
>  
> +static void __free_by_rcu(struct rcu_head *head)
> +{
> +	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
> +	struct bpf_mem_cache *tgt = c->tgt;
> +	struct llist_node *llnode;
> +
> +	if (unlikely(READ_ONCE(c->draining)))
> +		goto out;
> +
> +	llnode = llist_del_all(&c->waiting_for_gp);
> +	if (!llnode)
> +		goto out;
> +
> +	if (llist_add_batch(llnode, c->waiting_for_gp_tail, &tgt->free_by_rcu_ttrace))
> +		tgt->free_by_rcu_ttrace_tail = c->waiting_for_gp_tail;
Got a null-ptr dereference oops when running multiple test_maps and
htab-mem benchmark after hacking htab to use bpf_mem_cache_free_rcu().
And I think it happened as follow:

// c->tgt
P1: __free_by_rcu()
        // c->tgt is the same as P1
        P2: __free_by_rcu()

// return true
P1: llist_add_batch(&tgt->free_by_rcu_ttrace)
        // return false
        P2: llist_add_batch(&tgt->free_by_rcu_ttrace)
        P2: do_call_rcu_ttrace
        // return false
        P2: xchg(tgt->call_rcu_ttrace_in_progress, 1)
        // llnode is not NULL
        P2: llnode = llist_del_all(&c->free_by_rcu_ttrace)
        // BAD: c->free_by_rcu_ttrace_tail is NULL, so oops
        P2: __llist_add_batch(llnode, c->free_by_rcu_ttrace_tail)

P1: tgt->free_by_rcu_ttrace_tail = X   

I don't have a good fix for the problem except adding a spin-lock for
free_by_rcu_ttrace and free_by_rcu_ttrace_tail.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu().
  2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
                   ` (12 preceding siblings ...)
  2023-06-24  3:13 ` [PATCH v2 bpf-next 13/13] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu Alexei Starovoitov
@ 2023-06-24  7:09 ` Hou Tao
  13 siblings, 0 replies; 37+ messages in thread
From: Hou Tao @ 2023-06-24  7:09 UTC (permalink / raw)
  To: Alexei Starovoitov, daniel, andrii, void, paulmck
  Cc: tj, rcu, netdev, bpf, kernel-team

Hi,

On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> v1->v2:
> - Fixed race condition spotted by Hou. Patch 7.
>
> v1:
>
> Introduce bpf_mem_cache_free_rcu() that is similar to kfree_rcu except
> the objects will go through an additional RCU tasks trace grace period
> before being freed into slab.
>
> Patches 1-9 - a bunch of prep work
> Patch 10 - a patch from Paul that exports rcu_request_urgent_qs_task().
> Patch 12 - the main bpf_mem_cache_free_rcu patch.
> Patch 13 - use it in bpf_cpumask.
Should we allow mixed use of bpf_mem_cache_free() and bpf_mem_free_rcu()
? From the implementation, it seems the mixed use is allowed, right ? I
will try to hack htab to test such use case.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-06-24  3:13 ` [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu() Alexei Starovoitov
  2023-06-24  6:49   ` Hou Tao
@ 2023-06-24  7:53   ` Hou Tao
  2023-06-28  0:54     ` Alexei Starovoitov
  2023-06-24  9:05   ` Hou Tao
  2023-06-25 11:15   ` Hou Tao
  3 siblings, 1 reply; 37+ messages in thread
From: Hou Tao @ 2023-06-24  7:53 UTC (permalink / raw)
  To: Alexei Starovoitov, daniel, andrii, void, paulmck
  Cc: tj, rcu, netdev, bpf, kernel-team

Hi,

On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
> Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
> per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
> objects into free_by_rcu_ttrace list where they are waiting for RCU
> task trace grace period to be freed into slab.
>
SNIP
> +static void check_free_by_rcu(struct bpf_mem_cache *c)
> +{
> +	struct llist_node *llnode, *t;
> +
> +	if (llist_empty(&c->free_by_rcu) && llist_empty(&c->free_llist_extra_rcu))
> +		return;
> +
> +	/* drain free_llist_extra_rcu */
> +	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra_rcu))
> +		if (__llist_add(llnode, &c->free_by_rcu))
> +			c->free_by_rcu_tail = llnode;

Just like add_obj_to_free_list(),  we should do conditional
local_irq_save(flags) and local_inc_return(&c->active) as well for
free_by_rcu, otherwise free_by_rcu may be corrupted by bpf program
running in a NMI context.
> +
> +	if (atomic_xchg(&c->call_rcu_in_progress, 1)) {
> +		/*
> +		 * Instead of kmalloc-ing new rcu_head and triggering 10k
> +		 * call_rcu() to hit rcutree.qhimark and force RCU to notice
> +		 * the overload just ask RCU to hurry up. There could be many
> +		 * objects in free_by_rcu list.
> +		 * This hint reduces memory consumption for an artifical
> +		 * benchmark from 2 Gbyte to 150 Mbyte.
> +		 */
> +		rcu_request_urgent_qs_task(current);
> +		return;
> +	}
> +
> +	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
> +
> +	WRITE_ONCE(c->waiting_for_gp.first, __llist_del_all(&c->free_by_rcu));
> +	c->waiting_for_gp_tail = c->free_by_rcu_tail;
> +	call_rcu_hurry(&c->rcu, __free_by_rcu);

The same protection is needed for c->free_by_rcu_tail as well.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-06-24  3:13 ` [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu() Alexei Starovoitov
  2023-06-24  6:49   ` Hou Tao
  2023-06-24  7:53   ` Hou Tao
@ 2023-06-24  9:05   ` Hou Tao
  2023-06-25 11:15   ` Hou Tao
  3 siblings, 0 replies; 37+ messages in thread
From: Hou Tao @ 2023-06-24  9:05 UTC (permalink / raw)
  To: Alexei Starovoitov, daniel, andrii, void, paulmck
  Cc: tj, rcu, netdev, bpf, kernel-team

Hi,

On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
> Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
> per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
> objects into free_by_rcu_ttrace list where they are waiting for RCU
> task trace grace period to be freed into slab.
SNIP
> +static void __free_by_rcu(struct rcu_head *head)
> +{
> +	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
> +	struct bpf_mem_cache *tgt = c->tgt;
> +	struct llist_node *llnode;
> +
> +	if (unlikely(READ_ONCE(c->draining)))
> +		goto out;
Because the reading of c->draining and list_add_batch(...,
free_by_rcu_ttrace) is lockless, so checking draining here could not
prevent the leak of objects in c->free_by_rcu_ttrace() as show below
(hope the formatting is OK now). A simple fix is to drain
free_by_rcu_ttrace twice as suggested before. Or checking c->draining
again in __free_by_rcu() when atomic_xchg() returns 1 and calling
free_all(free_by_rcu_ttrace) if draining is true.

P1: bpf_mem_alloc_destroy()
    P2: __free_by_rcu()

    // got false
    P2: read c->draining

P1: c->draining = true
P1: llist_del_all(&c->free_by_rcu_ttrace)

    // add to free_by_rcu_ttrace again
    P2: llist_add_batch(..., &tgt->free_by_rcu_ttrace)
        P2: do_call_rcu_ttrace()
            // call_rcu_ttrace_in_progress is 1, so xchg return 1
            // and it doesn't being moved to waiting_for_gp_ttrace
            P2: atomic_xchg(&c->call_rcu_ttrace_in_progress, 1)

// got 1
P1: atomic_read(&c->call_rcu_ttrace_in_progress)
// objects in free_by_rcu_ttrace is leaked

c->draining also can't guarantee bpf_mem_alloc_destroy() will wait for
the inflight call_rcu_tasks_trace() callback as shown in the following
two cases (these two cases are the same as reported in v1 and I only
reformatted these two diagrams). And I suggest to do
bpf_mem_alloc_destroy as follows:

        if (ma->cache) {
                rcu_in_progress = 0;
                for_each_possible_cpu(cpu) {
                        c = per_cpu_ptr(ma->cache, cpu);
                        irq_work_sync(&c->refill_work);
                        drain_mem_cache(c);
                        rcu_in_progress +=
atomic_read(&c->call_rcu_in_progress);
                }
                for_each_possible_cpu(cpu) {
                        c = per_cpu_ptr(ma->cache, cpu);
                        rcu_in_progress +=
atomic_read(&c->call_rcu_ttrace_in_progress);
                }


Case 1:

P1: bpf_mem_alloc_destroy()
            P2: __free_by_rcu()

            // got false
            P2: c->draining
P1: c->draining = true

// got 0
P1: atomic_read(&c->call_rcu_ttrace_in_progress)

            P2: do_call_rcu_ttrace()
                // return 0
                P2: atomic_xchg(&c->call_rcu_ttrace_in_progress, 1)
                P2: call_rcu_tasks_trace()
            P2: atomic_set(&c->call_rcu_in_progress, 0)

// also got 0
P1: atomic_read(&c->call_rcu_in_progress)
// won't wait for the inflight __free_rcu_tasks_trace

Case 2:

P1: bpf_mem_alloc_destroy
                P2: __free_by_rcu for c1

                P2: read c1->draing
P1: c0->draining = true
P1: c1->draining = true

// both of in_progress counter is 0
P1: read c0->call_rcu_in_progress
P1: read c0->call_rcu_ttrace_in_progress

                // c1->tgt is c0
                // c1->call_rcu_in_progress is 1
                // c0->call_rcu_ttrace_in_progress is 0
                P2: llist_add_batch(..., c0->free_by_rcu_ttrace)
                P2: xchg(c0->call_rcu_ttrace_in_progress, 1)
                P2: call_rcu_tasks_trace(c0)
                P2: c1->call_rcu_in_progress = 0

// both of in_progress counter is 0
P1: read c1->call_rcu_in_progress
P1: read c1->call_rcu_ttrace_in_progress

// BAD! There is still inflight tasks trace RCU callback
P1: free_mem_alloc_no_barrier()

> +
> +	llnode = llist_del_all(&c->waiting_for_gp);
> +	if (!llnode)
> +		goto out;
> +
> +	if (llist_add_batch(llnode, c->waiting_for_gp_tail, &tgt->free_by_rcu_ttrace))
> +		tgt->free_by_rcu_ttrace_tail = c->waiting_for_gp_tail;
> +
> +	/* Objects went through regular RCU GP. Send them to RCU tasks trace */
> +	do_call_rcu_ttrace(tgt);
> +out:
> +	atomic_set(&c->call_rcu_in_progress, 0);
> +}
> +


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-06-24  3:13 ` [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu() Alexei Starovoitov
                     ` (2 preceding siblings ...)
  2023-06-24  9:05   ` Hou Tao
@ 2023-06-25 11:15   ` Hou Tao
  2023-06-28  0:56     ` Alexei Starovoitov
  3 siblings, 1 reply; 37+ messages in thread
From: Hou Tao @ 2023-06-25 11:15 UTC (permalink / raw)
  To: Alexei Starovoitov, daniel, andrii, void, paulmck
  Cc: tj, rcu, netdev, bpf, kernel-team

Hi,

On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
> Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
> per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
> objects into free_by_rcu_ttrace list where they are waiting for RCU
> task trace grace period to be freed into slab.
SNIP
>  
>  static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
> @@ -498,8 +566,8 @@ static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
>  
>  static void free_mem_alloc(struct bpf_mem_alloc *ma)
>  {
> -	/* waiting_for_gp_ttrace lists was drained, but __free_rcu might
> -	 * still execute. Wait for it now before we freeing percpu caches.
> +	/* waiting_for_gp[_ttrace] lists were drained, but RCU callbacks
> +	 * might still execute. Wait for them.
>  	 *
>  	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
>  	 * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
I think an extra rcu_barrier() before rcu_barrier_tasks_trace() is still
needed here, otherwise free_mem_alloc will not wait for inflight
__free_by_rcu() and there will oops in rcu_do_batch().


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-06-24  6:49   ` Hou Tao
@ 2023-06-25 11:20     ` Hou Tao
  2023-06-28  0:52     ` Alexei Starovoitov
  1 sibling, 0 replies; 37+ messages in thread
From: Hou Tao @ 2023-06-25 11:20 UTC (permalink / raw)
  To: Alexei Starovoitov, daniel, andrii, void, paulmck
  Cc: tj, rcu, netdev, bpf, kernel-team

Hi,

On 6/24/2023 2:49 PM, Hou Tao wrote:
> Hi,
>
> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
>> From: Alexei Starovoitov <ast@kernel.org>
>>
> SNIP
>>  
>> +static void __free_by_rcu(struct rcu_head *head)
>> +{
>> +	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
>> +	struct bpf_mem_cache *tgt = c->tgt;
>> +	struct llist_node *llnode;
>> +
>> +	if (unlikely(READ_ONCE(c->draining)))
>> +		goto out;
>> +
>> +	llnode = llist_del_all(&c->waiting_for_gp);
>> +	if (!llnode)
>> +		goto out;
>> +
>> +	if (llist_add_batch(llnode, c->waiting_for_gp_tail, &tgt->free_by_rcu_ttrace))
>> +		tgt->free_by_rcu_ttrace_tail = c->waiting_for_gp_tail;
> Got a null-ptr dereference oops when running multiple test_maps and
> htab-mem benchmark after hacking htab to use bpf_mem_cache_free_rcu().
> And I think it happened as follow:
The same null-ptr dereference happened even before switching htab to use
bpf_mem_cache_free_rcu(). So it seems after introducing c->tgt, there
could be two concurrent free_bulk() but with the same c->tgt.
>
> // c->tgt
> P1: __free_by_rcu()
>         // c->tgt is the same as P1
>         P2: __free_by_rcu()
>
> // return true
> P1: llist_add_batch(&tgt->free_by_rcu_ttrace)
>         // return false
>         P2: llist_add_batch(&tgt->free_by_rcu_ttrace)
>         P2: do_call_rcu_ttrace
>         // return false
>         P2: xchg(tgt->call_rcu_ttrace_in_progress, 1)
>         // llnode is not NULL
>         P2: llnode = llist_del_all(&c->free_by_rcu_ttrace)
>         // BAD: c->free_by_rcu_ttrace_tail is NULL, so oops
>         P2: __llist_add_batch(llnode, c->free_by_rcu_ttrace_tail)
>
> P1: tgt->free_by_rcu_ttrace_tail = X   
>
> I don't have a good fix for the problem except adding a spin-lock for
> free_by_rcu_ttrace and free_by_rcu_ttrace_tail.
>
>
> .


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 09/13] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-06-24  3:13 ` [PATCH v2 bpf-next 09/13] bpf: Allow reuse from waiting_for_gp_ttrace list Alexei Starovoitov
@ 2023-06-26  3:30   ` Hou Tao
  2023-06-26  4:42     ` Alexei Starovoitov
  0 siblings, 1 reply; 37+ messages in thread
From: Hou Tao @ 2023-06-26  3:30 UTC (permalink / raw)
  To: Alexei Starovoitov, daniel, andrii, void, paulmck
  Cc: tj, rcu, netdev, bpf, kernel-team

Hi,

On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  kernel/bpf/memalloc.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 692a9a30c1dc..666917c16e87 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>  	if (i >= cnt)
>  		return;
>  
> +	for (; i < cnt; i++) {
> +		obj = llist_del_first(&c->waiting_for_gp_ttrace);
After allowing to reuse elements from waiting_for_gp_ttrace, there may
be concurrent llist_del_first() and llist_del_all() as shown below and
llist_del_first() is not safe because the elements freed from free_rcu()
could be reused immediately and head->first may be added back to
c0->waiting_for_gp_ttrace by other process.

// c0
alloc_bulk()
    llist_del_first(&c->waiting_for_gp_ttrace)

// c1->tgt = c0
free_rcu()
    llist_del_all(&c->waiting_for_gp_ttrace)

> +		if (!obj)
> +			break;
> +		add_obj_to_free_list(c, obj);
> +	}
> +	if (i >= cnt)
> +		return;
> +
>  	memcg = get_memcg(c);
>  	old_memcg = set_active_memcg(memcg);
>  	for (; i < cnt; i++) {


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 09/13] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-06-26  3:30   ` Hou Tao
@ 2023-06-26  4:42     ` Alexei Starovoitov
  2023-06-26  7:16       ` Hou Tao
  0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-26  4:42 UTC (permalink / raw)
  To: Hou Tao
  Cc: Daniel Borkmann, Andrii Nakryiko, David Vernet, Paul E. McKenney,
	Tejun Heo, rcu, Network Development, bpf, Kernel Team

On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > alloc_bulk() can reuse elements from free_by_rcu_ttrace.
> > Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  kernel/bpf/memalloc.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > index 692a9a30c1dc..666917c16e87 100644
> > --- a/kernel/bpf/memalloc.c
> > +++ b/kernel/bpf/memalloc.c
> > @@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
> >       if (i >= cnt)
> >               return;
> >
> > +     for (; i < cnt; i++) {
> > +             obj = llist_del_first(&c->waiting_for_gp_ttrace);
> After allowing to reuse elements from waiting_for_gp_ttrace, there may
> be concurrent llist_del_first() and llist_del_all() as shown below and
> llist_del_first() is not safe because the elements freed from free_rcu()
> could be reused immediately and head->first may be added back to
> c0->waiting_for_gp_ttrace by other process.
>
> // c0
> alloc_bulk()
>     llist_del_first(&c->waiting_for_gp_ttrace)
>
> // c1->tgt = c0
> free_rcu()
>     llist_del_all(&c->waiting_for_gp_ttrace)

I'm still thinking about how to fix the other issues you've reported,
but this one, I believe, is fine.
Are you basing 'not safe' on a comment?
Why xchg(&head->first, NULL); on one cpu and
try_cmpxchg(&head->first, &entry, next);
is unsafe?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 09/13] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-06-26  4:42     ` Alexei Starovoitov
@ 2023-06-26  7:16       ` Hou Tao
  2023-06-28  0:59         ` Alexei Starovoitov
  0 siblings, 1 reply; 37+ messages in thread
From: Hou Tao @ 2023-06-26  7:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Andrii Nakryiko, David Vernet, Paul E. McKenney,
	Tejun Heo, rcu, Network Development, bpf, Kernel Team

Hi,

On 6/26/2023 12:42 PM, Alexei Starovoitov wrote:
> On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
>>> From: Alexei Starovoitov <ast@kernel.org>
>>>
>>> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
>>> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
>>>
>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>> ---
>>>  kernel/bpf/memalloc.c | 9 +++++++++
>>>  1 file changed, 9 insertions(+)
>>>
>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>> index 692a9a30c1dc..666917c16e87 100644
>>> --- a/kernel/bpf/memalloc.c
>>> +++ b/kernel/bpf/memalloc.c
>>> @@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>>>       if (i >= cnt)
>>>               return;
>>>
>>> +     for (; i < cnt; i++) {
>>> +             obj = llist_del_first(&c->waiting_for_gp_ttrace);
>> After allowing to reuse elements from waiting_for_gp_ttrace, there may
>> be concurrent llist_del_first() and llist_del_all() as shown below and
>> llist_del_first() is not safe because the elements freed from free_rcu()
>> could be reused immediately and head->first may be added back to
>> c0->waiting_for_gp_ttrace by other process.
>>
>> // c0
>> alloc_bulk()
>>     llist_del_first(&c->waiting_for_gp_ttrace)
>>
>> // c1->tgt = c0
>> free_rcu()
>>     llist_del_all(&c->waiting_for_gp_ttrace)
> I'm still thinking about how to fix the other issues you've reported,
> but this one, I believe, is fine.
> Are you basing 'not safe' on a comment?
> Why xchg(&head->first, NULL); on one cpu and
> try_cmpxchg(&head->first, &entry, next);
> is unsafe?
No, I didn't reason it only based on the comment in llist.h. The reason
I though it was not safe because llist_del_first() may have ABA problem
as pointed by you early, because after __free_rcu(), the elements on
waiting_for_gp_ttrace would be available immediately and
waiting_for_gp_ttrace->first may be reused and then added back to
waiting_for_gp_ttrace->first again as shown below.

// c0->waiting_for_gp_ttrace 
A -> B -> C -> nil 
 
P1: 
alloc_bulk(c0) 
    llist_del_first(&c0->waiting_for_gp_ttrace) 
        entry = A 
        next = B 
 
    P2: __free_rcu 
        // A/B/C is freed back to slab 
        // c0->waiting_for_gp_ttrace->first = NULL 
        free_all(llist_del_all(c0)) 
 
    P3: alloc_bulk(c1) 
        // got A 
        __kmalloc() 
 
    // free A (from c1), ..., last free X (allocated from c0) 
    P3: unit_free(c1)
        // the last freed element X is from c0
        c1->tgt = c0 
        c1->free_llist->first -> X -> Y -> ... -> A 
    P3: free_bulk(c1) 
        enque_to_free(c0) 
            c0->free_by_rcu_ttrace->first -> A -> ... -> Y -> X 
        __llist_add_batch(c0->waiting_for_gp_ttrace) 
            c0->waiting_for_gp_ttrace = A -> ... -> Y -> X 

P1: 
    // A is added back as first again
    // but llist_del_first() didn't know
    try_cmpxhg(&c0->waiting_for_gp_ttrace->first, A, B) 
    // c0->waiting_for_gp_trrace is corrupted 
    c0->waiting_for_gp_ttrace->first = B


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 13/13] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
  2023-06-24  3:13 ` [PATCH v2 bpf-next 13/13] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu Alexei Starovoitov
@ 2023-06-26 15:42   ` David Vernet
  2023-06-26 16:09     ` Alexei Starovoitov
  0 siblings, 1 reply; 37+ messages in thread
From: David Vernet @ 2023-06-26 15:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: daniel, andrii, houtao, paulmck, tj, rcu, netdev, bpf,
	kernel-team

On Fri, Jun 23, 2023 at 08:13:33PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Convert bpf_cpumask to bpf_mem_cache_free_rcu.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: David Vernet <void@manifault.com>

LGTM, thanks for cleaning this up. I left one drive-by comment /
observation below, but it's not a blocker for this patch / series.

> ---
>  kernel/bpf/cpumask.c | 20 ++++++--------------
>  1 file changed, 6 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c
> index 938a60ff4295..6983af8e093c 100644
> --- a/kernel/bpf/cpumask.c
> +++ b/kernel/bpf/cpumask.c
> @@ -9,7 +9,6 @@
>  /**
>   * struct bpf_cpumask - refcounted BPF cpumask wrapper structure
>   * @cpumask:	The actual cpumask embedded in the struct.
> - * @rcu:	The RCU head used to free the cpumask with RCU safety.
>   * @usage:	Object reference counter. When the refcount goes to 0, the
>   *		memory is released back to the BPF allocator, which provides
>   *		RCU safety.
> @@ -25,7 +24,6 @@
>   */
>  struct bpf_cpumask {
>  	cpumask_t cpumask;
> -	struct rcu_head rcu;
>  	refcount_t usage;
>  };
>  
> @@ -82,16 +80,6 @@ __bpf_kfunc struct bpf_cpumask *bpf_cpumask_acquire(struct bpf_cpumask *cpumask)
>  	return cpumask;
>  }
>  
> -static void cpumask_free_cb(struct rcu_head *head)
> -{
> -	struct bpf_cpumask *cpumask;
> -
> -	cpumask = container_of(head, struct bpf_cpumask, rcu);
> -	migrate_disable();
> -	bpf_mem_cache_free(&bpf_cpumask_ma, cpumask);
> -	migrate_enable();
> -}
> -
>  /**
>   * bpf_cpumask_release() - Release a previously acquired BPF cpumask.
>   * @cpumask: The cpumask being released.
> @@ -102,8 +90,12 @@ static void cpumask_free_cb(struct rcu_head *head)
>   */
>  __bpf_kfunc void bpf_cpumask_release(struct bpf_cpumask *cpumask)
>  {
> -	if (refcount_dec_and_test(&cpumask->usage))
> -		call_rcu(&cpumask->rcu, cpumask_free_cb);
> +	if (!refcount_dec_and_test(&cpumask->usage))
> +		return;
> +
> +	migrate_disable();
> +	bpf_mem_cache_free_rcu(&bpf_cpumask_ma, cpumask);
> +	migrate_enable();

The fact that callers have to disable migration like this in order to
safely free the memory feels a bit leaky. Is there any reason we can't
move this into bpf_mem_{cache_}free_rcu()?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 13/13] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
  2023-06-26 15:42   ` David Vernet
@ 2023-06-26 16:09     ` Alexei Starovoitov
  2023-06-26 17:55       ` David Vernet
  0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-26 16:09 UTC (permalink / raw)
  To: David Vernet
  Cc: Daniel Borkmann, Andrii Nakryiko, Hou Tao, Paul E. McKenney,
	Tejun Heo, rcu, Network Development, bpf, Kernel Team

On Mon, Jun 26, 2023 at 8:42 AM David Vernet <void@manifault.com> wrote:
>
> On Fri, Jun 23, 2023 at 08:13:33PM -0700, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Convert bpf_cpumask to bpf_mem_cache_free_rcu.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> Acked-by: David Vernet <void@manifault.com>
>
> LGTM, thanks for cleaning this up. I left one drive-by comment /
> observation below, but it's not a blocker for this patch / series.
>
> > ---
> >  kernel/bpf/cpumask.c | 20 ++++++--------------
> >  1 file changed, 6 insertions(+), 14 deletions(-)
> >
> > diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c
> > index 938a60ff4295..6983af8e093c 100644
> > --- a/kernel/bpf/cpumask.c
> > +++ b/kernel/bpf/cpumask.c
> > @@ -9,7 +9,6 @@
> >  /**
> >   * struct bpf_cpumask - refcounted BPF cpumask wrapper structure
> >   * @cpumask: The actual cpumask embedded in the struct.
> > - * @rcu:     The RCU head used to free the cpumask with RCU safety.
> >   * @usage:   Object reference counter. When the refcount goes to 0, the
> >   *           memory is released back to the BPF allocator, which provides
> >   *           RCU safety.
> > @@ -25,7 +24,6 @@
> >   */
> >  struct bpf_cpumask {
> >       cpumask_t cpumask;
> > -     struct rcu_head rcu;
> >       refcount_t usage;
> >  };
> >
> > @@ -82,16 +80,6 @@ __bpf_kfunc struct bpf_cpumask *bpf_cpumask_acquire(struct bpf_cpumask *cpumask)
> >       return cpumask;
> >  }
> >
> > -static void cpumask_free_cb(struct rcu_head *head)
> > -{
> > -     struct bpf_cpumask *cpumask;
> > -
> > -     cpumask = container_of(head, struct bpf_cpumask, rcu);
> > -     migrate_disable();
> > -     bpf_mem_cache_free(&bpf_cpumask_ma, cpumask);
> > -     migrate_enable();
> > -}
> > -
> >  /**
> >   * bpf_cpumask_release() - Release a previously acquired BPF cpumask.
> >   * @cpumask: The cpumask being released.
> > @@ -102,8 +90,12 @@ static void cpumask_free_cb(struct rcu_head *head)
> >   */
> >  __bpf_kfunc void bpf_cpumask_release(struct bpf_cpumask *cpumask)
> >  {
> > -     if (refcount_dec_and_test(&cpumask->usage))
> > -             call_rcu(&cpumask->rcu, cpumask_free_cb);
> > +     if (!refcount_dec_and_test(&cpumask->usage))
> > +             return;
> > +
> > +     migrate_disable();
> > +     bpf_mem_cache_free_rcu(&bpf_cpumask_ma, cpumask);
> > +     migrate_enable();
>
> The fact that callers have to disable migration like this in order to
> safely free the memory feels a bit leaky. Is there any reason we can't
> move this into bpf_mem_{cache_}free_rcu()?

migrate_disable/enable() are actually not necessary here.
We can call bpf_mem_cache_free_rcu() directly from any kfunc.
Explicit migrate_disable() is only necessary from syscall.
I believe rcu callbacks also cannot migrate, so the existing
code probably doesn't need them either.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 13/13] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
  2023-06-26 16:09     ` Alexei Starovoitov
@ 2023-06-26 17:55       ` David Vernet
  2023-06-26 17:59         ` Alexei Starovoitov
  0 siblings, 1 reply; 37+ messages in thread
From: David Vernet @ 2023-06-26 17:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Andrii Nakryiko, Hou Tao, Paul E. McKenney,
	Tejun Heo, rcu, Network Development, bpf, Kernel Team

On Mon, Jun 26, 2023 at 09:09:20AM -0700, Alexei Starovoitov wrote:
> On Mon, Jun 26, 2023 at 8:42 AM David Vernet <void@manifault.com> wrote:
> >
> > On Fri, Jun 23, 2023 at 08:13:33PM -0700, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Convert bpf_cpumask to bpf_mem_cache_free_rcu.
> > >
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >
> > Acked-by: David Vernet <void@manifault.com>
> >
> > LGTM, thanks for cleaning this up. I left one drive-by comment /
> > observation below, but it's not a blocker for this patch / series.
> >
> > > ---
> > >  kernel/bpf/cpumask.c | 20 ++++++--------------
> > >  1 file changed, 6 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c
> > > index 938a60ff4295..6983af8e093c 100644
> > > --- a/kernel/bpf/cpumask.c
> > > +++ b/kernel/bpf/cpumask.c
> > > @@ -9,7 +9,6 @@
> > >  /**
> > >   * struct bpf_cpumask - refcounted BPF cpumask wrapper structure
> > >   * @cpumask: The actual cpumask embedded in the struct.
> > > - * @rcu:     The RCU head used to free the cpumask with RCU safety.
> > >   * @usage:   Object reference counter. When the refcount goes to 0, the
> > >   *           memory is released back to the BPF allocator, which provides
> > >   *           RCU safety.
> > > @@ -25,7 +24,6 @@
> > >   */
> > >  struct bpf_cpumask {
> > >       cpumask_t cpumask;
> > > -     struct rcu_head rcu;
> > >       refcount_t usage;
> > >  };
> > >
> > > @@ -82,16 +80,6 @@ __bpf_kfunc struct bpf_cpumask *bpf_cpumask_acquire(struct bpf_cpumask *cpumask)
> > >       return cpumask;
> > >  }
> > >
> > > -static void cpumask_free_cb(struct rcu_head *head)
> > > -{
> > > -     struct bpf_cpumask *cpumask;
> > > -
> > > -     cpumask = container_of(head, struct bpf_cpumask, rcu);
> > > -     migrate_disable();
> > > -     bpf_mem_cache_free(&bpf_cpumask_ma, cpumask);
> > > -     migrate_enable();
> > > -}
> > > -
> > >  /**
> > >   * bpf_cpumask_release() - Release a previously acquired BPF cpumask.
> > >   * @cpumask: The cpumask being released.
> > > @@ -102,8 +90,12 @@ static void cpumask_free_cb(struct rcu_head *head)
> > >   */
> > >  __bpf_kfunc void bpf_cpumask_release(struct bpf_cpumask *cpumask)
> > >  {
> > > -     if (refcount_dec_and_test(&cpumask->usage))
> > > -             call_rcu(&cpumask->rcu, cpumask_free_cb);
> > > +     if (!refcount_dec_and_test(&cpumask->usage))
> > > +             return;
> > > +
> > > +     migrate_disable();
> > > +     bpf_mem_cache_free_rcu(&bpf_cpumask_ma, cpumask);
> > > +     migrate_enable();
> >
> > The fact that callers have to disable migration like this in order to
> > safely free the memory feels a bit leaky. Is there any reason we can't
> > move this into bpf_mem_{cache_}free_rcu()?
> 
> migrate_disable/enable() are actually not necessary here.
> We can call bpf_mem_cache_free_rcu() directly from any kfunc.

Could you please clarify why? Can't we migrate if the kfunc is called
from a sleepable struct_ops callback? If migration is always disabled
for any kfunc then I agree these migrate_{en,dis}able() calls can be
removed. Otherwise from my reading of the code we'd race between calling
this_cpu_ptr() and the local_irq_save() in unit_free().

Thanks,
David

> Explicit migrate_disable() is only necessary from syscall.
>
> I believe rcu callbacks also cannot migrate, so the existing
> code probably doesn't need them either.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 13/13] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
  2023-06-26 17:55       ` David Vernet
@ 2023-06-26 17:59         ` Alexei Starovoitov
  2023-06-26 18:01           ` David Vernet
  0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-26 17:59 UTC (permalink / raw)
  To: David Vernet
  Cc: Daniel Borkmann, Andrii Nakryiko, Hou Tao, Paul E. McKenney,
	Tejun Heo, rcu, Network Development, bpf, Kernel Team

On Mon, Jun 26, 2023 at 10:55 AM David Vernet <void@manifault.com> wrote:
>
> > > > +
> > > > +     migrate_disable();
> > > > +     bpf_mem_cache_free_rcu(&bpf_cpumask_ma, cpumask);
> > > > +     migrate_enable();
> > >
> > > The fact that callers have to disable migration like this in order to
> > > safely free the memory feels a bit leaky. Is there any reason we can't
> > > move this into bpf_mem_{cache_}free_rcu()?
> >
> > migrate_disable/enable() are actually not necessary here.
> > We can call bpf_mem_cache_free_rcu() directly from any kfunc.
>
> Could you please clarify why? Can't we migrate if the kfunc is called
> from a sleepable struct_ops callback?

migration is disabled for all bpf progs including sleepable.

> If migration is always disabled
> for any kfunc then I agree these migrate_{en,dis}able() calls can be
> removed. Otherwise from my reading of the code we'd race between calling
> this_cpu_ptr() and the local_irq_save() in unit_free().
>
> Thanks,
> David
>
> > Explicit migrate_disable() is only necessary from syscall.
> >
> > I believe rcu callbacks also cannot migrate, so the existing
> > code probably doesn't need them either.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 13/13] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
  2023-06-26 17:59         ` Alexei Starovoitov
@ 2023-06-26 18:01           ` David Vernet
  0 siblings, 0 replies; 37+ messages in thread
From: David Vernet @ 2023-06-26 18:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Andrii Nakryiko, Hou Tao, Paul E. McKenney,
	Tejun Heo, rcu, Network Development, bpf, Kernel Team

On Mon, Jun 26, 2023 at 10:59:40AM -0700, Alexei Starovoitov wrote:
> On Mon, Jun 26, 2023 at 10:55 AM David Vernet <void@manifault.com> wrote:
> >
> > > > > +
> > > > > +     migrate_disable();
> > > > > +     bpf_mem_cache_free_rcu(&bpf_cpumask_ma, cpumask);
> > > > > +     migrate_enable();
> > > >
> > > > The fact that callers have to disable migration like this in order to
> > > > safely free the memory feels a bit leaky. Is there any reason we can't
> > > > move this into bpf_mem_{cache_}free_rcu()?
> > >
> > > migrate_disable/enable() are actually not necessary here.
> > > We can call bpf_mem_cache_free_rcu() directly from any kfunc.
> >
> > Could you please clarify why? Can't we migrate if the kfunc is called
> > from a sleepable struct_ops callback?
> 
> migration is disabled for all bpf progs including sleepable.

Fair enough :-) Then yep, let's remove these. Feel free to do so when
you send v2, and keep my Ack. Otherwise I'm happy to send a follow-on
patch after this series is merged.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-06-24  6:49   ` Hou Tao
  2023-06-25 11:20     ` Hou Tao
@ 2023-06-28  0:52     ` Alexei Starovoitov
  2023-06-28  1:36       ` Hou Tao
  1 sibling, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-28  0:52 UTC (permalink / raw)
  To: Hou Tao, Alexei Starovoitov, daniel, andrii, void, paulmck
  Cc: tj, rcu, netdev, bpf, kernel-team

On 6/23/23 11:49 PM, Hou Tao wrote:
> Hi,
> 
> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
>> From: Alexei Starovoitov <ast@kernel.org>
>>
> SNIP
>>   
>> +static void __free_by_rcu(struct rcu_head *head)
>> +{
>> +	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
>> +	struct bpf_mem_cache *tgt = c->tgt;
>> +	struct llist_node *llnode;
>> +
>> +	if (unlikely(READ_ONCE(c->draining)))
>> +		goto out;
>> +
>> +	llnode = llist_del_all(&c->waiting_for_gp);
>> +	if (!llnode)
>> +		goto out;
>> +
>> +	if (llist_add_batch(llnode, c->waiting_for_gp_tail, &tgt->free_by_rcu_ttrace))
>> +		tgt->free_by_rcu_ttrace_tail = c->waiting_for_gp_tail;
> Got a null-ptr dereference oops when running multiple test_maps and
> htab-mem benchmark after hacking htab to use bpf_mem_cache_free_rcu().
> And I think it happened as follow:
> 
> // c->tgt
> P1: __free_by_rcu()
>          // c->tgt is the same as P1
>          P2: __free_by_rcu()
> 
> // return true
> P1: llist_add_batch(&tgt->free_by_rcu_ttrace)
>          // return false
>          P2: llist_add_batch(&tgt->free_by_rcu_ttrace)
>          P2: do_call_rcu_ttrace
>          // return false
>          P2: xchg(tgt->call_rcu_ttrace_in_progress, 1)
>          // llnode is not NULL
>          P2: llnode = llist_del_all(&c->free_by_rcu_ttrace)
>          // BAD: c->free_by_rcu_ttrace_tail is NULL, so oops
>          P2: __llist_add_batch(llnode, c->free_by_rcu_ttrace_tail)
> 
> P1: tgt->free_by_rcu_ttrace_tail = X
> 
> I don't have a good fix for the problem except adding a spin-lock for
> free_by_rcu_ttrace and free_by_rcu_ttrace_tail.

null-ptr is probably something else, since the race window is
extremely tiny.
In my testing this optimization doesn't buy much.
So I'll just drop _tail optimization and switch to for_each(del_all)
to move elements. We can revisit later.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-06-24  7:53   ` Hou Tao
@ 2023-06-28  0:54     ` Alexei Starovoitov
  0 siblings, 0 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-28  0:54 UTC (permalink / raw)
  To: Hou Tao, Alexei Starovoitov, daniel, andrii, void, paulmck
  Cc: tj, rcu, netdev, bpf, kernel-team

On 6/24/23 12:53 AM, Hou Tao wrote:
> Hi,
> 
> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
>> From: Alexei Starovoitov <ast@kernel.org>
>>
>> Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
>> Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
>> per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
>> objects into free_by_rcu_ttrace list where they are waiting for RCU
>> task trace grace period to be freed into slab.
>>
> SNIP
>> +static void check_free_by_rcu(struct bpf_mem_cache *c)
>> +{
>> +	struct llist_node *llnode, *t;
>> +
>> +	if (llist_empty(&c->free_by_rcu) && llist_empty(&c->free_llist_extra_rcu))
>> +		return;
>> +
>> +	/* drain free_llist_extra_rcu */
>> +	llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra_rcu))
>> +		if (__llist_add(llnode, &c->free_by_rcu))
>> +			c->free_by_rcu_tail = llnode;
> 
> Just like add_obj_to_free_list(),  we should do conditional
> local_irq_save(flags) and local_inc_return(&c->active) as well for
> free_by_rcu, otherwise free_by_rcu may be corrupted by bpf program
> running in a NMI context.

Good catch. Will do.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-06-25 11:15   ` Hou Tao
@ 2023-06-28  0:56     ` Alexei Starovoitov
  2023-06-28  1:43       ` Hou Tao
  0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-28  0:56 UTC (permalink / raw)
  To: Hou Tao, Alexei Starovoitov, daniel, andrii, void, paulmck
  Cc: tj, rcu, netdev, bpf, kernel-team

On 6/25/23 4:15 AM, Hou Tao wrote:
> Hi,
> 
> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
>> From: Alexei Starovoitov <ast@kernel.org>
>>
>> Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
>> Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
>> per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
>> objects into free_by_rcu_ttrace list where they are waiting for RCU
>> task trace grace period to be freed into slab.
> SNIP
>>   
>>   static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
>> @@ -498,8 +566,8 @@ static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
>>   
>>   static void free_mem_alloc(struct bpf_mem_alloc *ma)
>>   {
>> -	/* waiting_for_gp_ttrace lists was drained, but __free_rcu might
>> -	 * still execute. Wait for it now before we freeing percpu caches.
>> +	/* waiting_for_gp[_ttrace] lists were drained, but RCU callbacks
>> +	 * might still execute. Wait for them.
>>   	 *
>>   	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
>>   	 * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
> I think an extra rcu_barrier() before rcu_barrier_tasks_trace() is still
> needed here, otherwise free_mem_alloc will not wait for inflight
> __free_by_rcu() and there will oops in rcu_do_batch().

Agree. I got confused by rcu_trace_implies_rcu_gp().
rcu_barrier() is necessary.

re: draining.
I'll switch to do if (draing) free_all; else call_rcu; scheme
to address potential memory leak though I wasn't able to repro it.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 09/13] bpf: Allow reuse from waiting_for_gp_ttrace list.
  2023-06-26  7:16       ` Hou Tao
@ 2023-06-28  0:59         ` Alexei Starovoitov
       [not found]           ` <957dd5cd-0855-1197-7045-4cb1590bd753@huaweicloud.com>
  0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-28  0:59 UTC (permalink / raw)
  To: Hou Tao, Alexei Starovoitov
  Cc: Daniel Borkmann, Andrii Nakryiko, David Vernet, Paul E. McKenney,
	Tejun Heo, rcu, Network Development, bpf, Kernel Team

On 6/26/23 12:16 AM, Hou Tao wrote:
> Hi,
> 
> On 6/26/2023 12:42 PM, Alexei Starovoitov wrote:
>> On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>> Hi,
>>>
>>> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
>>>> From: Alexei Starovoitov <ast@kernel.org>
>>>>
>>>> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
>>>> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
>>>>
>>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>>> ---
>>>>   kernel/bpf/memalloc.c | 9 +++++++++
>>>>   1 file changed, 9 insertions(+)
>>>>
>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>>> index 692a9a30c1dc..666917c16e87 100644
>>>> --- a/kernel/bpf/memalloc.c
>>>> +++ b/kernel/bpf/memalloc.c
>>>> @@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>>>>        if (i >= cnt)
>>>>                return;
>>>>
>>>> +     for (; i < cnt; i++) {
>>>> +             obj = llist_del_first(&c->waiting_for_gp_ttrace);
>>> After allowing to reuse elements from waiting_for_gp_ttrace, there may
>>> be concurrent llist_del_first() and llist_del_all() as shown below and
>>> llist_del_first() is not safe because the elements freed from free_rcu()
>>> could be reused immediately and head->first may be added back to
>>> c0->waiting_for_gp_ttrace by other process.
>>>
>>> // c0
>>> alloc_bulk()
>>>      llist_del_first(&c->waiting_for_gp_ttrace)
>>>
>>> // c1->tgt = c0
>>> free_rcu()
>>>      llist_del_all(&c->waiting_for_gp_ttrace)
>> I'm still thinking about how to fix the other issues you've reported,
>> but this one, I believe, is fine.
>> Are you basing 'not safe' on a comment?
>> Why xchg(&head->first, NULL); on one cpu and
>> try_cmpxchg(&head->first, &entry, next);
>> is unsafe?
> No, I didn't reason it only based on the comment in llist.h. The reason
> I though it was not safe because llist_del_first() may have ABA problem
> as pointed by you early, because after __free_rcu(), the elements on
> waiting_for_gp_ttrace would be available immediately and
> waiting_for_gp_ttrace->first may be reused and then added back to
> waiting_for_gp_ttrace->first again as shown below.
> 
> // c0->waiting_for_gp_ttrace
> A -> B -> C -> nil
>   
> P1:
> alloc_bulk(c0)
>      llist_del_first(&c0->waiting_for_gp_ttrace)
>          entry = A
>          next = B
>   
>      P2: __free_rcu
>          // A/B/C is freed back to slab
>          // c0->waiting_for_gp_ttrace->first = NULL
>          free_all(llist_del_all(c0))
>   
>      P3: alloc_bulk(c1)
>          // got A
>          __kmalloc()
>   
>      // free A (from c1), ..., last free X (allocated from c0)
>      P3: unit_free(c1)
>          // the last freed element X is from c0
>          c1->tgt = c0
>          c1->free_llist->first -> X -> Y -> ... -> A
>      P3: free_bulk(c1)
>          enque_to_free(c0)
>              c0->free_by_rcu_ttrace->first -> A -> ... -> Y -> X
>          __llist_add_batch(c0->waiting_for_gp_ttrace)
>              c0->waiting_for_gp_ttrace = A -> ... -> Y -> X

In theory that's possible, but for this to happen one cpu needs
to be thousand times slower than all others and since there is no
preemption in llist_del_first I don't think we need to worry about it.
Also with removal of _tail optimization the above 
llist_add_batch(waiting_for_gp_ttrace)
will become a loop, so reused element will be at the very end
instead of top, so one cpu to million times slower which is not realistic.

> P1:
>      // A is added back as first again
>      // but llist_del_first() didn't know
>      try_cmpxhg(&c0->waiting_for_gp_ttrace->first, A, B)
>      // c0->waiting_for_gp_trrace is corrupted
>      c0->waiting_for_gp_ttrace->first = B
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-06-28  0:52     ` Alexei Starovoitov
@ 2023-06-28  1:36       ` Hou Tao
  0 siblings, 0 replies; 37+ messages in thread
From: Hou Tao @ 2023-06-28  1:36 UTC (permalink / raw)
  To: Alexei Starovoitov, Alexei Starovoitov
  Cc: tj, rcu, netdev, bpf, kernel-team, daniel, void, andrii, paulmck

Hi,

On 6/28/2023 8:52 AM, Alexei Starovoitov wrote:
> On 6/23/23 11:49 PM, Hou Tao wrote:
>> Hi,
>>
>> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
>>> From: Alexei Starovoitov <ast@kernel.org>
>>>
>> SNIP
>>>   +static void __free_by_rcu(struct rcu_head *head)
>>> +{
>>> +    struct bpf_mem_cache *c = container_of(head, struct
>>> bpf_mem_cache, rcu);
>>> +    struct bpf_mem_cache *tgt = c->tgt;
>>> +    struct llist_node *llnode;
>>> +
>>> +    if (unlikely(READ_ONCE(c->draining)))
>>> +        goto out;
>>> +
>>> +    llnode = llist_del_all(&c->waiting_for_gp);
>>> +    if (!llnode)
>>> +        goto out;
>>> +
>>> +    if (llist_add_batch(llnode, c->waiting_for_gp_tail,
>>> &tgt->free_by_rcu_ttrace))
>>> +        tgt->free_by_rcu_ttrace_tail = c->waiting_for_gp_tail;
>> Got a null-ptr dereference oops when running multiple test_maps and
>> htab-mem benchmark after hacking htab to use bpf_mem_cache_free_rcu().
>> And I think it happened as follow:
>>
>> // c->tgt
>> P1: __free_by_rcu()
>>          // c->tgt is the same as P1
>>          P2: __free_by_rcu()
>>
>> // return true
>> P1: llist_add_batch(&tgt->free_by_rcu_ttrace)
>>          // return false
>>          P2: llist_add_batch(&tgt->free_by_rcu_ttrace)
>>          P2: do_call_rcu_ttrace
>>          // return false
>>          P2: xchg(tgt->call_rcu_ttrace_in_progress, 1)
>>          // llnode is not NULL
>>          P2: llnode = llist_del_all(&c->free_by_rcu_ttrace)
>>          // BAD: c->free_by_rcu_ttrace_tail is NULL, so oops
>>          P2: __llist_add_batch(llnode, c->free_by_rcu_ttrace_tail)
>>
>> P1: tgt->free_by_rcu_ttrace_tail = X
>>
>> I don't have a good fix for the problem except adding a spin-lock for
>> free_by_rcu_ttrace and free_by_rcu_ttrace_tail.
>
> null-ptr is probably something else, since the race window is
> extremely tiny.

The null-ptr dereference is indeed due to free_by_rcu_ttrace_tail is
NULL. The oops occurred multiple times and I have checked the vmcore to
confirm that.

> In my testing this optimization doesn't buy much.
> So I'll just drop _tail optimization and switch to for_each(del_all)
> to move elements. We can revisit later.

OK


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-06-28  0:56     ` Alexei Starovoitov
@ 2023-06-28  1:43       ` Hou Tao
  2023-06-28  1:51         ` Alexei Starovoitov
  0 siblings, 1 reply; 37+ messages in thread
From: Hou Tao @ 2023-06-28  1:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Alexei Starovoitov
  Cc: tj, rcu, netdev, bpf, kernel-team, daniel, void, andrii, paulmck

Hi,

On 6/28/2023 8:56 AM, Alexei Starovoitov wrote:
> On 6/25/23 4:15 AM, Hou Tao wrote:
>> Hi,
>>
>> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
>>> From: Alexei Starovoitov <ast@kernel.org>
>>>
>>> Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
>>> Unlike bpf_mem_[cache_]free() that links objects for immediate reuse
>>> into
>>> per-cpu free list the _rcu() flavor waits for RCU grace period and
>>> then moves
>>> objects into free_by_rcu_ttrace list where they are waiting for RCU
>>> task trace grace period to be freed into slab.
>> SNIP
>>>     static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
>>> @@ -498,8 +566,8 @@ static void free_mem_alloc_no_barrier(struct
>>> bpf_mem_alloc *ma)
>>>     static void free_mem_alloc(struct bpf_mem_alloc *ma)
>>>   {
>>> -    /* waiting_for_gp_ttrace lists was drained, but __free_rcu might
>>> -     * still execute. Wait for it now before we freeing percpu caches.
>>> +    /* waiting_for_gp[_ttrace] lists were drained, but RCU callbacks
>>> +     * might still execute. Wait for them.
>>>        *
>>>        * rcu_barrier_tasks_trace() doesn't imply
>>> synchronize_rcu_tasks_trace(),
>>>        * but rcu_barrier_tasks_trace() and rcu_barrier() below are
>>> only used
>> I think an extra rcu_barrier() before rcu_barrier_tasks_trace() is still
>> needed here, otherwise free_mem_alloc will not wait for inflight
>> __free_by_rcu() and there will oops in rcu_do_batch().
>
> Agree. I got confused by rcu_trace_implies_rcu_gp().
> rcu_barrier() is necessary.
>
> re: draining.
> I'll switch to do if (draing) free_all; else call_rcu; scheme
> to address potential memory leak though I wasn't able to repro it.
For v2, it was also hard for me to reproduce the leak problem. But after
I injected some delay by using udelay() in __free_by_rcu/__free_rcu()
after reading c->draining, it was relatively easy to reproduce the problems.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-06-28  1:43       ` Hou Tao
@ 2023-06-28  1:51         ` Alexei Starovoitov
  2023-06-28  2:03           ` Hou Tao
  0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2023-06-28  1:51 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, Tejun Heo, rcu, Network Development, bpf,
	Kernel Team, Daniel Borkmann, David Vernet, Andrii Nakryiko,
	Paul E. McKenney

On Tue, Jun 27, 2023 at 6:43 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 6/28/2023 8:56 AM, Alexei Starovoitov wrote:
> > On 6/25/23 4:15 AM, Hou Tao wrote:
> >> Hi,
> >>
> >> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
> >>> From: Alexei Starovoitov <ast@kernel.org>
> >>>
> >>> Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
> >>> Unlike bpf_mem_[cache_]free() that links objects for immediate reuse
> >>> into
> >>> per-cpu free list the _rcu() flavor waits for RCU grace period and
> >>> then moves
> >>> objects into free_by_rcu_ttrace list where they are waiting for RCU
> >>> task trace grace period to be freed into slab.
> >> SNIP
> >>>     static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
> >>> @@ -498,8 +566,8 @@ static void free_mem_alloc_no_barrier(struct
> >>> bpf_mem_alloc *ma)
> >>>     static void free_mem_alloc(struct bpf_mem_alloc *ma)
> >>>   {
> >>> -    /* waiting_for_gp_ttrace lists was drained, but __free_rcu might
> >>> -     * still execute. Wait for it now before we freeing percpu caches.
> >>> +    /* waiting_for_gp[_ttrace] lists were drained, but RCU callbacks
> >>> +     * might still execute. Wait for them.
> >>>        *
> >>>        * rcu_barrier_tasks_trace() doesn't imply
> >>> synchronize_rcu_tasks_trace(),
> >>>        * but rcu_barrier_tasks_trace() and rcu_barrier() below are
> >>> only used
> >> I think an extra rcu_barrier() before rcu_barrier_tasks_trace() is still
> >> needed here, otherwise free_mem_alloc will not wait for inflight
> >> __free_by_rcu() and there will oops in rcu_do_batch().
> >
> > Agree. I got confused by rcu_trace_implies_rcu_gp().
> > rcu_barrier() is necessary.
> >
> > re: draining.
> > I'll switch to do if (draing) free_all; else call_rcu; scheme
> > to address potential memory leak though I wasn't able to repro it.
> For v2, it was also hard for me to reproduce the leak problem. But after
> I injected some delay by using udelay() in __free_by_rcu/__free_rcu()
> after reading c->draining, it was relatively easy to reproduce the problems.

1. Please respin htab bench.
We're still discussing patching without having the same base line.

2. 'adding udelay()' is too vague. Pls post a diff hunk of what exactly
you mean.

3. I'll send v3 shortly. Let's move discussion there.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  2023-06-28  1:51         ` Alexei Starovoitov
@ 2023-06-28  2:03           ` Hou Tao
  0 siblings, 0 replies; 37+ messages in thread
From: Hou Tao @ 2023-06-28  2:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Tejun Heo, rcu, Network Development, bpf,
	Kernel Team, Daniel Borkmann, David Vernet, Andrii Nakryiko,
	Paul E. McKenney

Hi,

On 6/28/2023 9:51 AM, Alexei Starovoitov wrote:
> On Tue, Jun 27, 2023 at 6:43 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>
SNIP
>>> re: draining.
>>> I'll switch to do if (draing) free_all; else call_rcu; scheme
>>> to address potential memory leak though I wasn't able to repro it.
>> For v2, it was also hard for me to reproduce the leak problem. But after
>> I injected some delay by using udelay() in __free_by_rcu/__free_rcu()
>> after reading c->draining, it was relatively easy to reproduce the problems.
> 1. Please respin htab bench.
> We're still discussing patching without having the same base line.
Almost done. Need to do benchmark again to update the numbers in commit
message.
>
> 2. 'adding udelay()' is too vague. Pls post a diff hunk of what exactly
> you mean.

--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -4,6 +4,7 @@
 #include <linux/llist.h>
 #include <linux/bpf.h>
 #include <linux/irq_work.h>
+#include <linux/delay.h>
 #include <linux/bpf_mem_alloc.h>
 #include <linux/memcontrol.h>
 #include <asm/local.h>
@@ -261,12 +262,17 @@ static int free_all(struct llist_node *llnode,
bool percpu)
        return cnt;
 }

+static unsigned int delay;
+module_param(delay, uint, 0644);
+
 static void __free_rcu(struct rcu_head *head)
 {
        struct bpf_mem_cache *c = container_of(head, struct
bpf_mem_cache, rcu_ttrace);

        if (unlikely(READ_ONCE(c->draining)))
                goto out;
+       if (delay)
+               udelay(delay);
        free_all(llist_del_all(&c->waiting_for_gp_ttrace),
!!c->percpu_size);
 out:
        atomic_set(&c->call_rcu_ttrace_in_progress, 0);
@@ -361,6 +367,8 @@ static void __free_by_rcu(struct rcu_head *head)

        if (unlikely(READ_ONCE(c->draining)))
                goto out;
+       if (delay)
+               udelay(delay);

        llnode = llist_del_all(&c->waiting_for_gp);
        if (!llnode)

>
> 3. I'll send v3 shortly. Let's move discussion there.
Sure.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 bpf-next 09/13] bpf: Allow reuse from waiting_for_gp_ttrace list.
       [not found]           ` <957dd5cd-0855-1197-7045-4cb1590bd753@huaweicloud.com>
@ 2023-06-28 17:38             ` Paul E. McKenney
  0 siblings, 0 replies; 37+ messages in thread
From: Paul E. McKenney @ 2023-06-28 17:38 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, David Vernet, Tejun Heo, rcu,
	Network Development, bpf, Kernel Team

On Wed, Jun 28, 2023 at 04:09:14PM +0800, Hou Tao wrote:
> Hi,
> 
> On 6/28/2023 8:59 AM, Alexei Starovoitov wrote:
> > On 6/26/23 12:16 AM, Hou Tao wrote:
> >> Hi,
> >>
> >> On 6/26/2023 12:42 PM, Alexei Starovoitov wrote:
> >>> On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>> Hi,
> >>>>
> >>>> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
> >>>>> From: Alexei Starovoitov <ast@kernel.org>
> >>>>>
> >>>>> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
> >>>>> Let it reuse from waiting_for_gp_ttrace as well to avoid
> >>>>> unnecessary kmalloc().
> >>>>>
> >>>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >>>>> ---
> >>>>>   kernel/bpf/memalloc.c | 9 +++++++++
> >>>>>   1 file changed, 9 insertions(+)
> >>>>>
> SNIP
> >>        // free A (from c1), ..., last free X (allocated from c0)
> >>      P3: unit_free(c1)
> >>          // the last freed element X is from c0
> >>          c1->tgt = c0
> >>          c1->free_llist->first -> X -> Y -> ... -> A
> >>      P3: free_bulk(c1)
> >>          enque_to_free(c0)
> >>              c0->free_by_rcu_ttrace->first -> A -> ... -> Y -> X
> >>          __llist_add_batch(c0->waiting_for_gp_ttrace)
> >>              c0->waiting_for_gp_ttrace = A -> ... -> Y -> X
> >
> > In theory that's possible, but for this to happen one cpu needs
> > to be thousand times slower than all others and since there is no
> > preemption in llist_del_first I don't think we need to worry about it.
> 
> Not sure whether or not such case will be possible in a VM, after all,
> the CPU X is just a thread in host and it may be preempted in any time
> and with any duration.

vCPU preemption can happen even with guest-OS interrupts disabled, and
such preemption can persist for hundreds of milliseconds, or even for
several seconds.  So admittedly quite rare, but also quite possible.

							Thanx, Paul

> > Also with removal of _tail optimization the above
> > llist_add_batch(waiting_for_gp_ttrace)
> > will become a loop, so reused element will be at the very end
> > instead of top, so one cpu to million times slower which is not
> > realistic.
> 
> It is still possible A will be added back as
> waiting_for_gp_ttrace->first after switching to llist_add() as shown
> below. My questions is how much is the benefit for reusing from
> waiting_for_gp_ttrace ?
> 
>     // free A (from c1), ..., last free X (allocated from c0) 
>     P3: unit_free(c1)
>         // the last freed element X is allocated from c0
>         c1->tgt = c0
>         c1->free_llist->first -> A -> ... -> Y
>         c1->free_llist_extra -> X
> 
>     P3: free_bulk(c1)
>         enque_to_free(c0) 
>             c0->free_by_rcu_ttrace->first -> Y -> ... A
>             c0->free_by_rcu_ttrace->first -> X -> Y -> ... A
> 
>         llist_add(c0->waiting_for_gp_ttrace)
>             c0->waiting_for_gp_ttrace = A -> .. -> Y -> X
> 
> >
> >> P1:
> >>      // A is added back as first again
> >>      // but llist_del_first() didn't know
> >>      try_cmpxhg(&c0->waiting_for_gp_ttrace->first, A, B)
> >>      // c0->waiting_for_gp_trrace is corrupted
> >>      c0->waiting_for_gp_ttrace->first = B
> >>
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2023-06-28 17:38 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-24  3:13 [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Alexei Starovoitov
2023-06-24  3:13 ` [PATCH v2 bpf-next 01/13] bpf: Rename few bpf_mem_alloc fields Alexei Starovoitov
2023-06-24  3:13 ` [PATCH v2 bpf-next 02/13] bpf: Simplify code of destroy_mem_alloc() with kmemdup() Alexei Starovoitov
2023-06-24  3:13 ` [PATCH v2 bpf-next 03/13] bpf: Let free_all() return the number of freed elements Alexei Starovoitov
2023-06-24  3:13 ` [PATCH v2 bpf-next 04/13] bpf: Refactor alloc_bulk() Alexei Starovoitov
2023-06-24  3:13 ` [PATCH v2 bpf-next 05/13] bpf: Further refactor alloc_bulk() Alexei Starovoitov
2023-06-24  3:13 ` [PATCH v2 bpf-next 06/13] bpf: Optimize moving objects from free_by_rcu_ttrace to waiting_for_gp_ttrace Alexei Starovoitov
2023-06-24  3:13 ` [PATCH v2 bpf-next 07/13] bpf: Change bpf_mem_cache draining process Alexei Starovoitov
2023-06-24  3:13 ` [PATCH v2 bpf-next 08/13] bpf: Add a hint to allocated objects Alexei Starovoitov
2023-06-24  3:13 ` [PATCH v2 bpf-next 09/13] bpf: Allow reuse from waiting_for_gp_ttrace list Alexei Starovoitov
2023-06-26  3:30   ` Hou Tao
2023-06-26  4:42     ` Alexei Starovoitov
2023-06-26  7:16       ` Hou Tao
2023-06-28  0:59         ` Alexei Starovoitov
     [not found]           ` <957dd5cd-0855-1197-7045-4cb1590bd753@huaweicloud.com>
2023-06-28 17:38             ` Paul E. McKenney
2023-06-24  3:13 ` [PATCH v2 bpf-next 10/13] rcu: Export rcu_request_urgent_qs_task() Alexei Starovoitov
2023-06-24  3:13 ` [PATCH v2 bpf-next 11/13] selftests/bpf: Improve test coverage of bpf_mem_alloc Alexei Starovoitov
2023-06-24  3:13 ` [PATCH v2 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu() Alexei Starovoitov
2023-06-24  6:49   ` Hou Tao
2023-06-25 11:20     ` Hou Tao
2023-06-28  0:52     ` Alexei Starovoitov
2023-06-28  1:36       ` Hou Tao
2023-06-24  7:53   ` Hou Tao
2023-06-28  0:54     ` Alexei Starovoitov
2023-06-24  9:05   ` Hou Tao
2023-06-25 11:15   ` Hou Tao
2023-06-28  0:56     ` Alexei Starovoitov
2023-06-28  1:43       ` Hou Tao
2023-06-28  1:51         ` Alexei Starovoitov
2023-06-28  2:03           ` Hou Tao
2023-06-24  3:13 ` [PATCH v2 bpf-next 13/13] bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu Alexei Starovoitov
2023-06-26 15:42   ` David Vernet
2023-06-26 16:09     ` Alexei Starovoitov
2023-06-26 17:55       ` David Vernet
2023-06-26 17:59         ` Alexei Starovoitov
2023-06-26 18:01           ` David Vernet
2023-06-24  7:09 ` [PATCH v2 bpf-next 00/13] bpf: Introduce bpf_mem_cache_free_rcu() Hou Tao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).