* [PATCH bpf-next v2 1/2] bpf: Reuse freed element in free_by_rcu during allocation
2022-12-09 1:09 [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator Hou Tao
@ 2022-12-09 1:09 ` Hou Tao
2022-12-09 1:09 ` [PATCH bpf-next v2 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true Hou Tao
2022-12-09 2:00 ` [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator patchwork-bot+netdevbpf
2 siblings, 0 replies; 4+ messages in thread
From: Hou Tao @ 2022-12-09 1:09 UTC (permalink / raw)
To: bpf
Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
rcu, houtao1
From: Hou Tao <houtao1@huawei.com>
When there are batched freeing operations on a specific CPU, part of
the freed elements ((high_watermark - lower_watermark) / 2 + 1) will be
indirectly moved into waiting_for_gp list through free_by_rcu list.
After call_rcu_in_progress becomes false again, the remaining elements
in free_by_rcu list will be moved to waiting_for_gp list by the next
invocation of free_bulk(). However if the expiration of RCU tasks trace
grace period is relatively slow, none element in free_by_rcu list will
be moved.
So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to
allocate a new object, in alloc_bulk() just check whether or not there is
freed element in free_by_rcu list and reuse it if available.
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
kernel/bpf/memalloc.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 8f0d65f2474a..04d96d1b98a3 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -171,9 +171,24 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
memcg = get_memcg(c);
old_memcg = set_active_memcg(memcg);
for (i = 0; i < cnt; i++) {
- obj = __alloc(c, node);
- if (!obj)
- break;
+ /*
+ * free_by_rcu is only manipulated by irq work refill_work().
+ * IRQ works on the same CPU are called sequentially, so it is
+ * safe to use __llist_del_first() here. If alloc_bulk() is
+ * invoked by the initial prefill, there will be no running
+ * refill_work(), so __llist_del_first() is fine as well.
+ *
+ * In most cases, objects on free_by_rcu are from the same CPU.
+ * If some objects come from other CPUs, it doesn't incur any
+ * harm because NUMA_NO_NODE means the preference for current
+ * numa node and it is not a guarantee.
+ */
+ obj = __llist_del_first(&c->free_by_rcu);
+ if (!obj) {
+ obj = __alloc(c, node);
+ if (!obj)
+ break;
+ }
if (IS_ENABLED(CONFIG_PREEMPT_RT))
/* In RT irq_work runs in per-cpu kthread, so disable
* interrupts to avoid preemption and interrupts and
--
2.29.2
^ permalink raw reply related [flat|nested] 4+ messages in thread* [PATCH bpf-next v2 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true
2022-12-09 1:09 [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator Hou Tao
2022-12-09 1:09 ` [PATCH bpf-next v2 1/2] bpf: Reuse freed element in free_by_rcu during allocation Hou Tao
@ 2022-12-09 1:09 ` Hou Tao
2022-12-09 2:00 ` [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator patchwork-bot+netdevbpf
2 siblings, 0 replies; 4+ messages in thread
From: Hou Tao @ 2022-12-09 1:09 UTC (permalink / raw)
To: bpf
Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
Stanislav Fomichev, Jiri Olsa, John Fastabend, Paul E . McKenney,
rcu, houtao1
From: Hou Tao <houtao1@huawei.com>
If there are pending rcu callback, free_mem_alloc() will use
rcu_barrier_tasks_trace() and rcu_barrier() to wait for the pending
__free_rcu_tasks_trace() and __free_rcu() callback.
If rcu_trace_implies_rcu_gp() is true, there will be no pending
__free_rcu(), so it will be OK to skip rcu_barrier() as well.
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
kernel/bpf/memalloc.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 04d96d1b98a3..ebcc3dd0fa19 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -464,9 +464,17 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma)
{
/* waiting_for_gp lists was drained, but __free_rcu might
* still execute. Wait for it now before we freeing percpu caches.
+ *
+ * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
+ * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
+ * to wait for the pending __free_rcu_tasks_trace() and __free_rcu(),
+ * so if call_rcu(head, __free_rcu) is skipped due to
+ * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
+ * using rcu_trace_implies_rcu_gp() as well.
*/
rcu_barrier_tasks_trace();
- rcu_barrier();
+ if (!rcu_trace_implies_rcu_gp())
+ rcu_barrier();
free_mem_alloc_no_barrier(ma);
}
--
2.29.2
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator
2022-12-09 1:09 [PATCH bpf-next v2 0/2] Misc optimizations for bpf mem allocator Hou Tao
2022-12-09 1:09 ` [PATCH bpf-next v2 1/2] bpf: Reuse freed element in free_by_rcu during allocation Hou Tao
2022-12-09 1:09 ` [PATCH bpf-next v2 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true Hou Tao
@ 2022-12-09 2:00 ` patchwork-bot+netdevbpf
2 siblings, 0 replies; 4+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-12-09 2:00 UTC (permalink / raw)
To: Hou Tao
Cc: bpf, martin.lau, andrii, song, haoluo, yhs, ast, daniel, kpsingh,
sdf, jolsa, john.fastabend, paulmck, rcu, houtao1
Hello:
This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:
On Fri, 9 Dec 2022 09:09:45 +0800 you wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> Hi,
>
> The patchset is just misc optimizations for bpf mem allocator. Patch 1
> fixes the OOM problem found during running hash-table update benchmark
> from qp-trie patchset [0]. The benchmark will add htab elements in
> batch and then delete elements in batch, so freed objects will stack on
> free_by_rcu and wait for the expiration of RCU grace period. There can
> be tens of thousands of freed objects and these objects are not
> available for new allocation, so adding htab element will continue to do
> new allocation.
>
> [...]
Here is the summary with links:
- [bpf-next,v2,1/2] bpf: Reuse freed element in free_by_rcu during allocation
https://git.kernel.org/bpf/bpf-next/c/0893d6007db5
- [bpf-next,v2,2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true
https://git.kernel.org/bpf/bpf-next/c/822ed78fab13
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 4+ messages in thread