* [PATCH bpf 0/2] Wait for busy refill_work when destorying bpf memory allocator @ 2022-10-19 11:55 Hou Tao 2022-10-19 11:55 ` [PATCH bpf 1/2] bpf: " Hou Tao 2022-10-19 11:55 ` [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining Hou Tao 0 siblings, 2 replies; 11+ messages in thread From: Hou Tao @ 2022-10-19 11:55 UTC (permalink / raw) To: bpf, Alexei Starovoitov Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1 From: Hou Tao <houtao1@huawei.com> Hi, The patchset aims to fix the problem of bpf memory allocator destruction when there is PREEMPT_RT kernel or kernel with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host). The root cause is that there may be busy refill_work when the allocator is destorying and it may incur oops or other problems as shown in patch #1. Patch #1 fixes the problem by waiting for the completion of irq work during destorying and patch #2 is just a clean-up patch based on patch #1. Please see individual patches for more details. Comments are always welcome. Hou Tao (2): bpf: Wait for busy refill_work when destorying bpf memory allocator bpf: Use __llist_del_all() whenever possbile during memory draining kernel/bpf/memalloc.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) -- 2.29.2 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH bpf 1/2] bpf: Wait for busy refill_work when destorying bpf memory allocator 2022-10-19 11:55 [PATCH bpf 0/2] Wait for busy refill_work when destorying bpf memory allocator Hou Tao @ 2022-10-19 11:55 ` Hou Tao 2022-10-19 18:38 ` sdf 2022-10-19 11:55 ` [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining Hou Tao 1 sibling, 1 reply; 11+ messages in thread From: Hou Tao @ 2022-10-19 11:55 UTC (permalink / raw) To: bpf, Alexei Starovoitov Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1 From: Hou Tao <houtao1@huawei.com> A busy irq work is an unfinished irq work and it can be either in the pending state or in the running state. When destroying bpf memory allocator, refill_work may be busy for PREEMPT_RT kernel in which irq work is invoked in a per-CPU RT-kthread. It is also possible for kernel with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host) and irq work is inovked in timer interrupt. The busy refill_work leads to various issues. The obvious one is that there will be concurrent operations on free_by_rcu and free_list between irq work and memory draining. Another one is call_rcu_in_progress will not be reliable for the checking of pending RCU callback because do_call_rcu() may has not been invoked by irq work. The other is there will be use-after-free if irq work is freed before the callback of irq work is invoked as shown below: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0 Oops: 0010 [#1] PREEMPT_RT SMP CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:0x0 Code: Unable to access opcode bytes at 0xffffffffffffffd6. RSP: 0018:ffffadc080293e78 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000 RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388 ...... Call Trace: <TASK> irq_work_single+0x24/0x60 irq_work_run_list+0x24/0x30 run_irq_workd+0x23/0x30 smpboot_thread_fn+0x203/0x300 kthread+0x126/0x150 ret_from_fork+0x1f/0x30 </TASK> Considering the ease of concurrency handling and the short wait time used for irq_work_sync() under PREEMPT_RT (When running two test_maps on PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and the 99th percentile is 10us), just waiting for busy refill_work to complete before memory draining and memory freeing. Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.") Signed-off-by: Hou Tao <houtao1@huawei.com> --- kernel/bpf/memalloc.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 94f0f63443a6..48e606aaacf0 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) rcu_in_progress = 0; for_each_possible_cpu(cpu) { c = per_cpu_ptr(ma->cache, cpu); + /* + * refill_work may be unfinished for PREEMPT_RT kernel + * in which irq work is invoked in a per-CPU RT thread. + * It is also possible for kernel with + * arch_irq_work_has_interrupt() being false and irq + * work is inovked in timer interrupt. So wait for the + * completion of irq work to ease the handling of + * concurrency. + */ + irq_work_sync(&c->refill_work); drain_mem_cache(c); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) cc = per_cpu_ptr(ma->caches, cpu); for (i = 0; i < NUM_CACHES; i++) { c = &cc->cache[i]; + irq_work_sync(&c->refill_work); drain_mem_cache(c); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } -- 2.29.2 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 1/2] bpf: Wait for busy refill_work when destorying bpf memory allocator 2022-10-19 11:55 ` [PATCH bpf 1/2] bpf: " Hou Tao @ 2022-10-19 18:38 ` sdf 2022-10-20 1:07 ` Hou Tao 0 siblings, 1 reply; 11+ messages in thread From: sdf @ 2022-10-19 18:38 UTC (permalink / raw) To: Hou Tao Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh, Jiri Olsa, John Fastabend, houtao1 On 10/19, Hou Tao wrote: > From: Hou Tao <houtao1@huawei.com> > A busy irq work is an unfinished irq work and it can be either in the > pending state or in the running state. When destroying bpf memory > allocator, refill_work may be busy for PREEMPT_RT kernel in which irq > work is invoked in a per-CPU RT-kthread. It is also possible for kernel > with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host) > and irq work is inovked in timer interrupt. > The busy refill_work leads to various issues. The obvious one is that > there will be concurrent operations on free_by_rcu and free_list between > irq work and memory draining. Another one is call_rcu_in_progress will > not be reliable for the checking of pending RCU callback because > do_call_rcu() may has not been invoked by irq work. The other is there > will be use-after-free if irq work is freed before the callback of > irq work is invoked as shown below: > BUG: kernel NULL pointer dereference, address: 0000000000000000 > #PF: supervisor instruction fetch in kernel mode > #PF: error_code(0x0010) - not-present page > PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0 > Oops: 0010 [#1] PREEMPT_RT SMP > CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) > RIP: 0010:0x0 > Code: Unable to access opcode bytes at 0xffffffffffffffd6. > RSP: 0018:ffffadc080293e78 EFLAGS: 00010286 > RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000 > RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388 > ...... > Call Trace: > <TASK> > irq_work_single+0x24/0x60 > irq_work_run_list+0x24/0x30 > run_irq_workd+0x23/0x30 > smpboot_thread_fn+0x203/0x300 > kthread+0x126/0x150 > ret_from_fork+0x1f/0x30 > </TASK> > Considering the ease of concurrency handling and the short wait time > used for irq_work_sync() under PREEMPT_RT (When running two test_maps on > PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and > the 99th percentile is 10us), just waiting for busy refill_work to > complete before memory draining and memory freeing. > Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory > allocator.") > Signed-off-by: Hou Tao <houtao1@huawei.com> > --- > kernel/bpf/memalloc.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 94f0f63443a6..48e606aaacf0 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) > rcu_in_progress = 0; > for_each_possible_cpu(cpu) { > c = per_cpu_ptr(ma->cache, cpu); > + /* > + * refill_work may be unfinished for PREEMPT_RT kernel > + * in which irq work is invoked in a per-CPU RT thread. > + * It is also possible for kernel with > + * arch_irq_work_has_interrupt() being false and irq > + * work is inovked in timer interrupt. So wait for the > + * completion of irq work to ease the handling of > + * concurrency. > + */ > + irq_work_sync(&c->refill_work); Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ? We do have a bunch of them sprinkled already to run alloc/free with irqs disabled. I was also trying to see if adding local_irq_save inside drain_mem_cache to pair with the ones from refill might work, but waiting for irq to finish seems easier... Maybe also move both of these in some new "static void irq_work_wait" to make it clear that the PREEMT_RT comment applies to both of them? Or maybe that helper should do 'for_each_possible_cpu(cpu) irq_work_sync(&c->refill_work);' in the PREEMPT_RT case so we don't have to call it twice? > drain_mem_cache(c); > rcu_in_progress += atomic_read(&c->call_rcu_in_progress); > } > @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) > cc = per_cpu_ptr(ma->caches, cpu); > for (i = 0; i < NUM_CACHES; i++) { > c = &cc->cache[i]; > + irq_work_sync(&c->refill_work); > drain_mem_cache(c); > rcu_in_progress += atomic_read(&c->call_rcu_in_progress); > } > -- > 2.29.2 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 1/2] bpf: Wait for busy refill_work when destorying bpf memory allocator 2022-10-19 18:38 ` sdf @ 2022-10-20 1:07 ` Hou Tao 2022-10-20 17:49 ` Stanislav Fomichev 0 siblings, 1 reply; 11+ messages in thread From: Hou Tao @ 2022-10-20 1:07 UTC (permalink / raw) To: sdf Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh, Jiri Olsa, John Fastabend, houtao1 Hi, On 10/20/2022 2:38 AM, sdf@google.com wrote: > On 10/19, Hou Tao wrote: >> From: Hou Tao <houtao1@huawei.com> > >> A busy irq work is an unfinished irq work and it can be either in the >> pending state or in the running state. When destroying bpf memory >> allocator, refill_work may be busy for PREEMPT_RT kernel in which irq >> work is invoked in a per-CPU RT-kthread. It is also possible for kernel >> with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host) >> and irq work is inovked in timer interrupt. > >> The busy refill_work leads to various issues. The obvious one is that >> there will be concurrent operations on free_by_rcu and free_list between >> irq work and memory draining. Another one is call_rcu_in_progress will >> not be reliable for the checking of pending RCU callback because >> do_call_rcu() may has not been invoked by irq work. The other is there >> will be use-after-free if irq work is freed before the callback of >> irq work is invoked as shown below: > >> BUG: kernel NULL pointer dereference, address: 0000000000000000 >> #PF: supervisor instruction fetch in kernel mode >> #PF: error_code(0x0010) - not-present page >> PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0 >> Oops: 0010 [#1] PREEMPT_RT SMP >> CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) >> RIP: 0010:0x0 >> Code: Unable to access opcode bytes at 0xffffffffffffffd6. >> RSP: 0018:ffffadc080293e78 EFLAGS: 00010286 >> RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000 >> RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388 >> ...... >> Call Trace: >> <TASK> >> irq_work_single+0x24/0x60 >> irq_work_run_list+0x24/0x30 >> run_irq_workd+0x23/0x30 >> smpboot_thread_fn+0x203/0x300 >> kthread+0x126/0x150 >> ret_from_fork+0x1f/0x30 >> </TASK> > >> Considering the ease of concurrency handling and the short wait time >> used for irq_work_sync() under PREEMPT_RT (When running two test_maps on >> PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and >> the 99th percentile is 10us), just waiting for busy refill_work to >> complete before memory draining and memory freeing. > >> Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory >> allocator.") >> Signed-off-by: Hou Tao <houtao1@huawei.com> >> --- >> kernel/bpf/memalloc.c | 11 +++++++++++ >> 1 file changed, 11 insertions(+) > >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >> index 94f0f63443a6..48e606aaacf0 100644 >> --- a/kernel/bpf/memalloc.c >> +++ b/kernel/bpf/memalloc.c >> @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) >> rcu_in_progress = 0; >> for_each_possible_cpu(cpu) { >> c = per_cpu_ptr(ma->cache, cpu); >> + /* >> + * refill_work may be unfinished for PREEMPT_RT kernel >> + * in which irq work is invoked in a per-CPU RT thread. >> + * It is also possible for kernel with >> + * arch_irq_work_has_interrupt() being false and irq >> + * work is inovked in timer interrupt. So wait for the >> + * completion of irq work to ease the handling of >> + * concurrency. >> + */ >> + irq_work_sync(&c->refill_work); > > Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ? > We do have a bunch of them sprinkled already to run alloc/free with > irqs disabled. No. As said in the commit message and the comments, irq_work_sync() is needed for both PREEMPT_RT kernel and kernel with arch_irq_work_has_interrupt() being false. And for other kernels, irq_work_sync() doesn't incur any overhead, because it is just a simple memory read through irq_work_is_busy() and nothing else. The reason is the irq work must have been completed when invoking bpf_mem_alloc_destroy() for these kernels. void irq_work_sync(struct irq_work *work) { /* Remove code snippet for PREEMPT_RT and arch_irq_work_has_interrupt() */ /* irq wor*/ while (irq_work_is_busy(work)) cpu_relax(); } > > I was also trying to see if adding local_irq_save inside drain_mem_cache > to pair with the ones from refill might work, but waiting for irq to > finish seems easier... Disabling hard irq works, but irq_work_sync() is still needed to ensure it is completed before freeing its memory. > > Maybe also move both of these in some new "static void irq_work_wait" > to make it clear that the PREEMT_RT comment applies to both of them? > > Or maybe that helper should do 'for_each_possible_cpu(cpu) > irq_work_sync(&c->refill_work);' > in the PREEMPT_RT case so we don't have to call it twice? drain_mem_cache() is also time consuming somethings, so I think it is better to interleave irq_work_sync() and drain_mem_cache() to reduce waiting time. > >> drain_mem_cache(c); >> rcu_in_progress += atomic_read(&c->call_rcu_in_progress); >> } >> @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) >> cc = per_cpu_ptr(ma->caches, cpu); >> for (i = 0; i < NUM_CACHES; i++) { >> c = &cc->cache[i]; >> + irq_work_sync(&c->refill_work); >> drain_mem_cache(c); >> rcu_in_progress += atomic_read(&c->call_rcu_in_progress); >> } >> -- >> 2.29.2 > > . ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 1/2] bpf: Wait for busy refill_work when destorying bpf memory allocator 2022-10-20 1:07 ` Hou Tao @ 2022-10-20 17:49 ` Stanislav Fomichev 2022-10-21 1:06 ` Hou Tao 0 siblings, 1 reply; 11+ messages in thread From: Stanislav Fomichev @ 2022-10-20 17:49 UTC (permalink / raw) To: Hou Tao Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh, Jiri Olsa, John Fastabend, houtao1 On Wed, Oct 19, 2022 at 6:08 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 10/20/2022 2:38 AM, sdf@google.com wrote: > > On 10/19, Hou Tao wrote: > >> From: Hou Tao <houtao1@huawei.com> > > > >> A busy irq work is an unfinished irq work and it can be either in the > >> pending state or in the running state. When destroying bpf memory > >> allocator, refill_work may be busy for PREEMPT_RT kernel in which irq > >> work is invoked in a per-CPU RT-kthread. It is also possible for kernel > >> with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host) > >> and irq work is inovked in timer interrupt. > > > >> The busy refill_work leads to various issues. The obvious one is that > >> there will be concurrent operations on free_by_rcu and free_list between > >> irq work and memory draining. Another one is call_rcu_in_progress will > >> not be reliable for the checking of pending RCU callback because > >> do_call_rcu() may has not been invoked by irq work. The other is there > >> will be use-after-free if irq work is freed before the callback of > >> irq work is invoked as shown below: > > > >> BUG: kernel NULL pointer dereference, address: 0000000000000000 > >> #PF: supervisor instruction fetch in kernel mode > >> #PF: error_code(0x0010) - not-present page > >> PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0 > >> Oops: 0010 [#1] PREEMPT_RT SMP > >> CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1 > >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) > >> RIP: 0010:0x0 > >> Code: Unable to access opcode bytes at 0xffffffffffffffd6. > >> RSP: 0018:ffffadc080293e78 EFLAGS: 00010286 > >> RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000 > >> RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388 > >> ...... > >> Call Trace: > >> <TASK> > >> irq_work_single+0x24/0x60 > >> irq_work_run_list+0x24/0x30 > >> run_irq_workd+0x23/0x30 > >> smpboot_thread_fn+0x203/0x300 > >> kthread+0x126/0x150 > >> ret_from_fork+0x1f/0x30 > >> </TASK> > > > >> Considering the ease of concurrency handling and the short wait time > >> used for irq_work_sync() under PREEMPT_RT (When running two test_maps on > >> PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and > >> the 99th percentile is 10us), just waiting for busy refill_work to > >> complete before memory draining and memory freeing. > > > >> Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory > >> allocator.") > >> Signed-off-by: Hou Tao <houtao1@huawei.com> > >> --- > >> kernel/bpf/memalloc.c | 11 +++++++++++ > >> 1 file changed, 11 insertions(+) > > > >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > >> index 94f0f63443a6..48e606aaacf0 100644 > >> --- a/kernel/bpf/memalloc.c > >> +++ b/kernel/bpf/memalloc.c > >> @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) > >> rcu_in_progress = 0; > >> for_each_possible_cpu(cpu) { > >> c = per_cpu_ptr(ma->cache, cpu); > >> + /* > >> + * refill_work may be unfinished for PREEMPT_RT kernel > >> + * in which irq work is invoked in a per-CPU RT thread. > >> + * It is also possible for kernel with > >> + * arch_irq_work_has_interrupt() being false and irq > >> + * work is inovked in timer interrupt. So wait for the > >> + * completion of irq work to ease the handling of > >> + * concurrency. > >> + */ > >> + irq_work_sync(&c->refill_work); > > > > Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ? > > We do have a bunch of them sprinkled already to run alloc/free with > > irqs disabled. > No. As said in the commit message and the comments, irq_work_sync() is needed > for both PREEMPT_RT kernel and kernel with arch_irq_work_has_interrupt() being > false. And for other kernels, irq_work_sync() doesn't incur any overhead, > because it is just a simple memory read through irq_work_is_busy() and nothing > else. The reason is the irq work must have been completed when invoking > bpf_mem_alloc_destroy() for these kernels. > > void irq_work_sync(struct irq_work *work) > { > /* Remove code snippet for PREEMPT_RT and arch_irq_work_has_interrupt() */ > /* irq wor*/ > while (irq_work_is_busy(work)) > cpu_relax(); > } I see, thanks for clarifying! I was so carried away with that PREEMPT_RT that I missed the fact that arch_irq_work_has_interrupt is a separate thing. Agreed that doing irq_work_sync won't hurt in a non-preempt/non-has_interrupt case. In this case, can you still do a respin and fix the spelling issue in the comment? You can slap my acked-by for the v2: Acked-by: Stanislav Fomichev <sdf@google.com> s/work is inovked in timer interrupt. So wait for the/... invoked .../ > > > > I was also trying to see if adding local_irq_save inside drain_mem_cache > > to pair with the ones from refill might work, but waiting for irq to > > finish seems easier... > Disabling hard irq works, but irq_work_sync() is still needed to ensure it is > completed before freeing its memory. > > > > Maybe also move both of these in some new "static void irq_work_wait" > > to make it clear that the PREEMT_RT comment applies to both of them? > > > > Or maybe that helper should do 'for_each_possible_cpu(cpu) > > irq_work_sync(&c->refill_work);' > > in the PREEMPT_RT case so we don't have to call it twice? > drain_mem_cache() is also time consuming somethings, so I think it is better to > interleave irq_work_sync() and drain_mem_cache() to reduce waiting time. > > > > >> drain_mem_cache(c); > >> rcu_in_progress += atomic_read(&c->call_rcu_in_progress); > >> } > >> @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) > >> cc = per_cpu_ptr(ma->caches, cpu); > >> for (i = 0; i < NUM_CACHES; i++) { > >> c = &cc->cache[i]; > >> + irq_work_sync(&c->refill_work); > >> drain_mem_cache(c); > >> rcu_in_progress += atomic_read(&c->call_rcu_in_progress); > >> } > >> -- > >> 2.29.2 > > > > . > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 1/2] bpf: Wait for busy refill_work when destorying bpf memory allocator 2022-10-20 17:49 ` Stanislav Fomichev @ 2022-10-21 1:06 ` Hou Tao 0 siblings, 0 replies; 11+ messages in thread From: Hou Tao @ 2022-10-21 1:06 UTC (permalink / raw) To: Stanislav Fomichev Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh, Jiri Olsa, John Fastabend, houtao1 Hi, On 10/21/2022 1:49 AM, Stanislav Fomichev wrote: > On Wed, Oct 19, 2022 at 6:08 PM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 10/20/2022 2:38 AM, sdf@google.com wrote: >>> On 10/19, Hou Tao wrote: >>>> From: Hou Tao <houtao1@huawei.com> SNIP >>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >>>> index 94f0f63443a6..48e606aaacf0 100644 >>>> --- a/kernel/bpf/memalloc.c >>>> +++ b/kernel/bpf/memalloc.c >>>> @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) >>>> rcu_in_progress = 0; >>>> for_each_possible_cpu(cpu) { >>>> c = per_cpu_ptr(ma->cache, cpu); >>>> + /* >>>> + * refill_work may be unfinished for PREEMPT_RT kernel >>>> + * in which irq work is invoked in a per-CPU RT thread. >>>> + * It is also possible for kernel with >>>> + * arch_irq_work_has_interrupt() being false and irq >>>> + * work is inovked in timer interrupt. So wait for the >>>> + * completion of irq work to ease the handling of >>>> + * concurrency. >>>> + */ >>>> + irq_work_sync(&c->refill_work); >>> Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ? >>> We do have a bunch of them sprinkled already to run alloc/free with >>> irqs disabled. >> No. As said in the commit message and the comments, irq_work_sync() is needed >> for both PREEMPT_RT kernel and kernel with arch_irq_work_has_interrupt() being >> false. And for other kernels, irq_work_sync() doesn't incur any overhead, >> because it is just a simple memory read through irq_work_is_busy() and nothing >> else. The reason is the irq work must have been completed when invoking >> bpf_mem_alloc_destroy() for these kernels. >> >> void irq_work_sync(struct irq_work *work) >> { >> /* Remove code snippet for PREEMPT_RT and arch_irq_work_has_interrupt() */ >> /* irq wor*/ >> while (irq_work_is_busy(work)) >> cpu_relax(); >> } > I see, thanks for clarifying! I was so carried away with that > PREEMPT_RT that I missed the fact that arch_irq_work_has_interrupt is > a separate thing. Agreed that doing irq_work_sync won't hurt in a > non-preempt/non-has_interrupt case. > > In this case, can you still do a respin and fix the spelling issue in > the comment? You can slap my acked-by for the v2: > > Acked-by: Stanislav Fomichev <sdf@google.com> > > s/work is inovked in timer interrupt. So wait for the/... invoked .../ Thanks. Will update the commit message and the comments in v2 to fix the typos and add notes about the fact that there is no overhead under non-PREEMPT_RT and arch_irq_work_hash_interrupt() kernel. > >>> I was also trying to see if adding local_irq_save inside drain_mem_cache >>> to pair with the ones from refill might work, but waiting for irq to >>> finish seems easier... >> Disabling hard irq works, but irq_work_sync() is still needed to ensure it is >> completed before freeing its memory. >>> Maybe also move both of these in some new "static void irq_work_wait" >>> to make it clear that the PREEMT_RT comment applies to both of them? >>> >>> Or maybe that helper should do 'for_each_possible_cpu(cpu) >>> irq_work_sync(&c->refill_work);' >>> in the PREEMPT_RT case so we don't have to call it twice? >> drain_mem_cache() is also time consuming somethings, so I think it is better to >> interleave irq_work_sync() and drain_mem_cache() to reduce waiting time. >> >>>> drain_mem_cache(c); >>>> rcu_in_progress += atomic_read(&c->call_rcu_in_progress); >>>> } >>>> @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) >>>> cc = per_cpu_ptr(ma->caches, cpu); >>>> for (i = 0; i < NUM_CACHES; i++) { >>>> c = &cc->cache[i]; >>>> + irq_work_sync(&c->refill_work); >>>> drain_mem_cache(c); >>>> rcu_in_progress += atomic_read(&c->call_rcu_in_progress); >>>> } >>>> -- >>>> 2.29.2 >>> . > . ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining 2022-10-19 11:55 [PATCH bpf 0/2] Wait for busy refill_work when destorying bpf memory allocator Hou Tao 2022-10-19 11:55 ` [PATCH bpf 1/2] bpf: " Hou Tao @ 2022-10-19 11:55 ` Hou Tao 2022-10-19 19:00 ` sdf 1 sibling, 1 reply; 11+ messages in thread From: Hou Tao @ 2022-10-19 11:55 UTC (permalink / raw) To: bpf, Alexei Starovoitov Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1 From: Hou Tao <houtao1@huawei.com> Except for waiting_for_gp list, there are no concurrent operations on free_by_rcu, free_llist and free_llist_extra lists, so use __llist_del_all() instead of llist_del_all(). waiting_for_gp list can be deleted by RCU callback concurrently, so still use llist_del_all(). Signed-off-by: Hou Tao <houtao1@huawei.com> --- kernel/bpf/memalloc.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 48e606aaacf0..7f45744a09f7 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -422,14 +422,17 @@ static void drain_mem_cache(struct bpf_mem_cache *c) /* No progs are using this bpf_mem_cache, but htab_map_free() called * bpf_mem_cache_free() for all remaining elements and they can be in * free_by_rcu or in waiting_for_gp lists, so drain those lists now. + * + * Except for waiting_for_gp list, there are no concurrent operations + * on these lists, so it is safe to use __llist_del_all(). */ llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu)) free_one(c, llnode); llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp)) free_one(c, llnode); - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist)) + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist)) free_one(c, llnode); - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra)) + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist_extra)) free_one(c, llnode); } -- 2.29.2 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining 2022-10-19 11:55 ` [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining Hou Tao @ 2022-10-19 19:00 ` sdf 2022-10-20 1:17 ` Hou Tao 0 siblings, 1 reply; 11+ messages in thread From: sdf @ 2022-10-19 19:00 UTC (permalink / raw) To: Hou Tao Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh, Jiri Olsa, John Fastabend, houtao1 On 10/19, Hou Tao wrote: > From: Hou Tao <houtao1@huawei.com> > Except for waiting_for_gp list, there are no concurrent operations on > free_by_rcu, free_llist and free_llist_extra lists, so use > __llist_del_all() instead of llist_del_all(). waiting_for_gp list can be > deleted by RCU callback concurrently, so still use llist_del_all(). > Signed-off-by: Hou Tao <houtao1@huawei.com> > --- > kernel/bpf/memalloc.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 48e606aaacf0..7f45744a09f7 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -422,14 +422,17 @@ static void drain_mem_cache(struct bpf_mem_cache *c) > /* No progs are using this bpf_mem_cache, but htab_map_free() called > * bpf_mem_cache_free() for all remaining elements and they can be in > * free_by_rcu or in waiting_for_gp lists, so drain those lists now. > + * > + * Except for waiting_for_gp list, there are no concurrent operations > + * on these lists, so it is safe to use __llist_del_all(). > */ > llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu)) > free_one(c, llnode); > llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp)) > free_one(c, llnode); > - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist)) > + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist)) > free_one(c, llnode); > - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra)) > + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist_extra)) > free_one(c, llnode); Acked-by: Stanislav Fomichev <sdf@google.com> Seems safe even without the previous patch? OTOH, do we really care about __lllist vs llist in the cleanup path? Might be safer to always do llist_del_all everywhere? > } > -- > 2.29.2 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining 2022-10-19 19:00 ` sdf @ 2022-10-20 1:17 ` Hou Tao 2022-10-20 17:52 ` Stanislav Fomichev 0 siblings, 1 reply; 11+ messages in thread From: Hou Tao @ 2022-10-20 1:17 UTC (permalink / raw) To: sdf Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh, Jiri Olsa, John Fastabend, houtao1 Hi, On 10/20/2022 3:00 AM, sdf@google.com wrote: > On 10/19, Hou Tao wrote: >> From: Hou Tao <houtao1@huawei.com> > >> Except for waiting_for_gp list, there are no concurrent operations on >> free_by_rcu, free_llist and free_llist_extra lists, so use >> __llist_del_all() instead of llist_del_all(). waiting_for_gp list can be >> deleted by RCU callback concurrently, so still use llist_del_all(). > >> Signed-off-by: Hou Tao <houtao1@huawei.com> >> --- >> kernel/bpf/memalloc.c | 7 +++++-- >> 1 file changed, 5 insertions(+), 2 deletions(-) > >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >> index 48e606aaacf0..7f45744a09f7 100644 >> --- a/kernel/bpf/memalloc.c >> +++ b/kernel/bpf/memalloc.c >> @@ -422,14 +422,17 @@ static void drain_mem_cache(struct bpf_mem_cache *c) >> /* No progs are using this bpf_mem_cache, but htab_map_free() called >> * bpf_mem_cache_free() for all remaining elements and they can be in >> * free_by_rcu or in waiting_for_gp lists, so drain those lists now. >> + * >> + * Except for waiting_for_gp list, there are no concurrent operations >> + * on these lists, so it is safe to use __llist_del_all(). >> */ >> llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu)) >> free_one(c, llnode); >> llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp)) >> free_one(c, llnode); >> - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist)) >> + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist)) >> free_one(c, llnode); >> - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra)) >> + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist_extra)) >> free_one(c, llnode); > > Acked-by: Stanislav Fomichev <sdf@google.com> Thanks for the Acked-by. > > Seems safe even without the previous patch? OTOH, do we really care > about __lllist vs llist in the cleanup path? Might be safer to always > do llist_del_all everywhere? No. free_llist is manipulated by both irq work and memory draining concurrently before patch #1. Using llist_del_all(&c->free_llist) also doesn't help because irq work uses __llist_add/__llist_del helpers. Basically there is no difference between __llist and list helper for cleanup patch, but I think it is better to clarity the possible concurrent accesses and codify these assumption. > >> } > >> -- >> 2.29.2 > > . ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining 2022-10-20 1:17 ` Hou Tao @ 2022-10-20 17:52 ` Stanislav Fomichev 2022-10-21 1:09 ` Hou Tao 0 siblings, 1 reply; 11+ messages in thread From: Stanislav Fomichev @ 2022-10-20 17:52 UTC (permalink / raw) To: Hou Tao Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh, Jiri Olsa, John Fastabend, houtao1 On Wed, Oct 19, 2022 at 6:18 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 10/20/2022 3:00 AM, sdf@google.com wrote: > > On 10/19, Hou Tao wrote: > >> From: Hou Tao <houtao1@huawei.com> > > > >> Except for waiting_for_gp list, there are no concurrent operations on > >> free_by_rcu, free_llist and free_llist_extra lists, so use > >> __llist_del_all() instead of llist_del_all(). waiting_for_gp list can be > >> deleted by RCU callback concurrently, so still use llist_del_all(). > > > >> Signed-off-by: Hou Tao <houtao1@huawei.com> > >> --- > >> kernel/bpf/memalloc.c | 7 +++++-- > >> 1 file changed, 5 insertions(+), 2 deletions(-) > > > >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > >> index 48e606aaacf0..7f45744a09f7 100644 > >> --- a/kernel/bpf/memalloc.c > >> +++ b/kernel/bpf/memalloc.c > >> @@ -422,14 +422,17 @@ static void drain_mem_cache(struct bpf_mem_cache *c) > >> /* No progs are using this bpf_mem_cache, but htab_map_free() called > >> * bpf_mem_cache_free() for all remaining elements and they can be in > >> * free_by_rcu or in waiting_for_gp lists, so drain those lists now. > >> + * > >> + * Except for waiting_for_gp list, there are no concurrent operations > >> + * on these lists, so it is safe to use __llist_del_all(). > >> */ > >> llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu)) > >> free_one(c, llnode); > >> llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp)) > >> free_one(c, llnode); > >> - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist)) > >> + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist)) > >> free_one(c, llnode); > >> - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra)) > >> + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist_extra)) > >> free_one(c, llnode); > > > > Acked-by: Stanislav Fomichev <sdf@google.com> > Thanks for the Acked-by. > > > > Seems safe even without the previous patch? OTOH, do we really care > > about __lllist vs llist in the cleanup path? Might be safer to always > > do llist_del_all everywhere? > No. free_llist is manipulated by both irq work and memory draining concurrently > before patch #1. Using llist_del_all(&c->free_llist) also doesn't help because > irq work uses __llist_add/__llist_del helpers. Basically there is no difference > between __llist and list helper for cleanup patch, but I think it is better to > clarity the possible concurrent accesses and codify these assumption. But this is still mostly relevant only for the preemt_rt/has_interrupt case, right? For non-preempt, irq should've finished long before we got to drain_mem_cache. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining 2022-10-20 17:52 ` Stanislav Fomichev @ 2022-10-21 1:09 ` Hou Tao 0 siblings, 0 replies; 11+ messages in thread From: Hou Tao @ 2022-10-21 1:09 UTC (permalink / raw) To: Stanislav Fomichev Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh, Jiri Olsa, John Fastabend, houtao1 Hi, On 10/21/2022 1:52 AM, Stanislav Fomichev wrote: > On Wed, Oct 19, 2022 at 6:18 PM Hou Tao <houtao@huaweicloud.com> wrote: SNIP >>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >>>> index 48e606aaacf0..7f45744a09f7 100644 >>>> --- a/kernel/bpf/memalloc.c >>>> +++ b/kernel/bpf/memalloc.c >>>> @@ -422,14 +422,17 @@ static void drain_mem_cache(struct bpf_mem_cache *c) >>>> /* No progs are using this bpf_mem_cache, but htab_map_free() called >>>> * bpf_mem_cache_free() for all remaining elements and they can be in >>>> * free_by_rcu or in waiting_for_gp lists, so drain those lists now. >>>> + * >>>> + * Except for waiting_for_gp list, there are no concurrent operations >>>> + * on these lists, so it is safe to use __llist_del_all(). >>>> */ >>>> llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu)) >>>> free_one(c, llnode); >>>> llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp)) >>>> free_one(c, llnode); >>>> - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist)) >>>> + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist)) >>>> free_one(c, llnode); >>>> - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra)) >>>> + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist_extra)) >>>> free_one(c, llnode); >>> Acked-by: Stanislav Fomichev <sdf@google.com> >> Thanks for the Acked-by. >>> Seems safe even without the previous patch? OTOH, do we really care >>> about __lllist vs llist in the cleanup path? Might be safer to always >>> do llist_del_all everywhere? >> No. free_llist is manipulated by both irq work and memory draining concurrently >> before patch #1. Using llist_del_all(&c->free_llist) also doesn't help because >> irq work uses __llist_add/__llist_del helpers. Basically there is no difference >> between __llist and list helper for cleanup patch, but I think it is better to >> clarity the possible concurrent accesses and codify these assumption. > But this is still mostly relevant only for the preemt_rt/has_interrupt > case, right? > For non-preempt, irq should've finished long before we got to drain_mem_cache. Yes. The concurrent access on free_llist is only possible for preempt_rt/does_not_has_interrupt cases. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-10-21 1:09 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-10-19 11:55 [PATCH bpf 0/2] Wait for busy refill_work when destorying bpf memory allocator Hou Tao 2022-10-19 11:55 ` [PATCH bpf 1/2] bpf: " Hou Tao 2022-10-19 18:38 ` sdf 2022-10-20 1:07 ` Hou Tao 2022-10-20 17:49 ` Stanislav Fomichev 2022-10-21 1:06 ` Hou Tao 2022-10-19 11:55 ` [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining Hou Tao 2022-10-19 19:00 ` sdf 2022-10-20 1:17 ` Hou Tao 2022-10-20 17:52 ` Stanislav Fomichev 2022-10-21 1:09 ` Hou Tao
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.