[PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator
@ 2022-12-09  1:09 Hou Tao
  2022-12-09  1:09 ` [PATCH bpf-next v2 1/2] bpf: Reuse freed element in free_by_rcu during allocation Hou Tao
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Hou Tao @ 2022-12-09  1:09 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

Hi,

The patchset is just misc optimizations for bpf mem allocator. Patch 1
fixes the OOM problem found during running hash-table update benchmark
from qp-trie patchset [0]. The benchmark will add htab elements in
batch and then delete elements in batch, so freed objects will stack on
free_by_rcu and wait for the expiration of RCU grace period. There can
be tens of thousands of freed objects and these objects are not
available for new allocation, so adding htab element will continue to do
new allocation.

For the benchmark commmand: "./bench -w3 -d10 -a htab-update -p 16",
even the maximum entries of htab is 16384, key_size is 255 and
value_size is 4, the peak memory usage will reach 14GB or more.
Increasing rcupdate.rcu_task_enqueue_lim will decrease the peak memory to
860MB, but it is still too many. Although the above case is contrived,
it is better to fix it and the fixing is simple: just reusing the freed
objects in free_by_rcu during allocation. After the fix, the peak memory
usage will decrease to 26MB. Beside above case, the memory blow-up
problem is also possible when allocation and freeing are done on total
different CPUs. I'm trying to fix the blow-up problem by using a global
per-cpu work to free these objects in free_by_rcu timely, but it doesn't
work very well and I am still digging into it.

Patch 2 is a left-over patch from rcu_trace_implies_rcu_gp() patchset
[1]. After disscussing with Paul [2], I think it is also safe to skip
rcu_barrier() when rcu_trace_implies_rcu_gp() returns true.

Comments are always welcome.

Change Log:
v2:
  * Patch 1: repharse the commit message (Suggested by Yonghong & Alexei)
  * Add Acked-by for both patch 1 and 2

v1: https://lore.kernel.org/bpf/20221206042946.686847-1-houtao@huaweicloud.com

[0]: https://lore.kernel.org/bpf/20220924133620.4147153-13-houtao@huaweicloud.com/
[1]: https://lore.kernel.org/bpf/20221014113946.965131-1-houtao@huaweicloud.com/
[2]: https://lore.kernel.org/bpf/20221021185002.GP5600@paulmck-ThinkPad-P17-Gen-1/

Hou Tao (2):
  bpf: Reuse freed element in free_by_rcu during allocation
  bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true

 kernel/bpf/memalloc.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

-- 
2.29.2

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH bpf-next v2 1/2] bpf: Reuse freed element in free_by_rcu during allocation
  2022-12-09  1:09 [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator Hou Tao
@ 2022-12-09  1:09 ` Hou Tao
  2022-12-09  1:09 ` [PATCH bpf-next v2 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true Hou Tao
  2022-12-09  2:00 ` [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator patchwork-bot+netdevbpf
  2 siblings, 0 replies; 4+ messages in thread
From: Hou Tao @ 2022-12-09  1:09 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

When there are batched freeing operations on a specific CPU, part of
the freed elements ((high_watermark - lower_watermark) / 2 + 1) will be
indirectly moved into waiting_for_gp list through free_by_rcu list.
After call_rcu_in_progress becomes false again, the remaining elements
in free_by_rcu list will be moved to waiting_for_gp list by the next
invocation of free_bulk(). However if the expiration of RCU tasks trace
grace period is relatively slow, none element in free_by_rcu list will
be moved.

So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to
allocate a new object, in alloc_bulk() just check whether or not there is
freed element in free_by_rcu list and reuse it if available.

Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 8f0d65f2474a..04d96d1b98a3 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -171,9 +171,24 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	memcg = get_memcg(c);
 	old_memcg = set_active_memcg(memcg);
 	for (i = 0; i < cnt; i++) {
-		obj = __alloc(c, node);
-		if (!obj)
-			break;
+		/*
+		 * free_by_rcu is only manipulated by irq work refill_work().
+		 * IRQ works on the same CPU are called sequentially, so it is
+		 * safe to use __llist_del_first() here. If alloc_bulk() is
+		 * invoked by the initial prefill, there will be no running
+		 * refill_work(), so __llist_del_first() is fine as well.
+		 *
+		 * In most cases, objects on free_by_rcu are from the same CPU.
+		 * If some objects come from other CPUs, it doesn't incur any
+		 * harm because NUMA_NO_NODE means the preference for current
+		 * numa node and it is not a guarantee.
+		 */
+		obj = __llist_del_first(&c->free_by_rcu);
+		if (!obj) {
+			obj = __alloc(c, node);
+			if (!obj)
+				break;
+		}
 		if (IS_ENABLED(CONFIG_PREEMPT_RT))
 			/* In RT irq_work runs in per-cpu kthread, so disable
 			 * interrupts to avoid preemption and interrupts and
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH bpf-next v2 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true
  2022-12-09  1:09 [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator Hou Tao
  2022-12-09  1:09 ` [PATCH bpf-next v2 1/2] bpf: Reuse freed element in free_by_rcu during allocation Hou Tao
@ 2022-12-09  1:09 ` Hou Tao
  2022-12-09  2:00 ` [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator patchwork-bot+netdevbpf
  2 siblings, 0 replies; 4+ messages in thread
From: Hou Tao @ 2022-12-09  1:09 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
	rcu, houtao1

From: Hou Tao <houtao1@huawei.com>

If there are pending rcu callback, free_mem_alloc() will use
rcu_barrier_tasks_trace() and rcu_barrier() to wait for the pending
__free_rcu_tasks_trace() and __free_rcu() callback.

If rcu_trace_implies_rcu_gp() is true, there will be no pending
__free_rcu(), so it will be OK to skip rcu_barrier() as well.

Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 04d96d1b98a3..ebcc3dd0fa19 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -464,9 +464,17 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma)
 {
 	/* waiting_for_gp lists was drained, but __free_rcu might
 	 * still execute. Wait for it now before we freeing percpu caches.
+	 *
+	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
+	 * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
+	 * to wait for the pending __free_rcu_tasks_trace() and __free_rcu(),
+	 * so if call_rcu(head, __free_rcu) is skipped due to
+	 * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
+	 * using rcu_trace_implies_rcu_gp() as well.
 	 */
 	rcu_barrier_tasks_trace();
-	rcu_barrier();
+	if (!rcu_trace_implies_rcu_gp())
+		rcu_barrier();
 	free_mem_alloc_no_barrier(ma);
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator
  2022-12-09  1:09 [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator Hou Tao
  2022-12-09  1:09 ` [PATCH bpf-next v2 1/2] bpf: Reuse freed element in free_by_rcu during allocation Hou Tao
  2022-12-09  1:09 ` [PATCH bpf-next v2 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true Hou Tao
@ 2022-12-09  2:00 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 4+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-12-09  2:00 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, martin.lau, andrii, song, haoluo, yhs, ast, daniel, kpsingh,
	sdf, jolsa, john.fastabend, paulmck, rcu, houtao1

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Fri,  9 Dec 2022 09:09:45 +0800 you wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Hi,
> 
> The patchset is just misc optimizations for bpf mem allocator. Patch 1
> fixes the OOM problem found during running hash-table update benchmark
> from qp-trie patchset [0]. The benchmark will add htab elements in
> batch and then delete elements in batch, so freed objects will stack on
> free_by_rcu and wait for the expiration of RCU grace period. There can
> be tens of thousands of freed objects and these objects are not
> available for new allocation, so adding htab element will continue to do
> new allocation.
> 
> [...]

Here is the summary with links:
  - [bpf-next,v2,1/2] bpf: Reuse freed element in free_by_rcu during allocation
    https://git.kernel.org/bpf/bpf-next/c/0893d6007db5
  - [bpf-next,v2,2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true
    https://git.kernel.org/bpf/bpf-next/c/822ed78fab13

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-12-09  2:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-09  1:09 [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator Hou Tao
2022-12-09  1:09 ` [PATCH bpf-next v2 1/2] bpf: Reuse freed element in free_by_rcu during allocation Hou Tao
2022-12-09  1:09 ` [PATCH bpf-next v2 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true Hou Tao
2022-12-09  2:00 ` [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox