* WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
@ 2023-08-25 10:32 Björn Töpel
2023-08-25 15:28 ` Yonghong Song
2023-08-26 13:44 ` RISC-V uprobe bug (Was: Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342) Björn Töpel
0 siblings, 2 replies; 31+ messages in thread
From: Björn Töpel @ 2023-08-25 10:32 UTC (permalink / raw)
To: bpf; +Cc: linux-riscv
I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
selftests on bpf-next 9e3b47abeb8f.
I'm able to reproduce the hang by multiple runs of:
| ./test_progs -a link_api -a linked_list
I'm currently investigating that.
But! Sometimes (every blue moon) I get a warn_on_once hit:
| ------------[ cut here ]------------
| WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 bpf_mem_refill+0x1fc/0x206
| Modules linked in: bpf_testmod(OE)
| CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE N 6.5.0-rc5-01743-gdcb152bb8328 #2
| Hardware name: riscv-virtio,qemu (DT)
| epc : bpf_mem_refill+0x1fc/0x206
| ra : irq_work_single+0x68/0x70
| epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
| gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
| t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
| s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
| a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
| a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
| s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
| s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
| s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
| s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
| t5 : ff6000008fd28278 t6 : 0000000000040000
| status: 0000000200000100 badaddr: 0000000000000000 cause: 0000000000000003
| [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
| [<ffffffff8015fe84>] irq_work_single+0x68/0x70
| [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
| [<ffffffff8015fefa>] irq_work_run+0x38/0x66
| [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
| [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
| [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
| [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
| [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
| [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
| [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
| [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
| [<ffffffff812b6904>] do_irq+0x66/0x98
| ---[ end trace 0000000000000000 ]---
Code:
| static void free_bulk(struct bpf_mem_cache *c)
| {
| struct bpf_mem_cache *tgt = c->tgt;
| struct llist_node *llnode, *t;
| unsigned long flags;
| int cnt;
|
| WARN_ON_ONCE(tgt->unit_size != c->unit_size);
| ...
I'm not well versed in the memory allocator; Before I dive into it --
has anyone else hit it? Ideas on why the warn_on_once is hit?
Björn
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-25 10:32 WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 Björn Töpel
@ 2023-08-25 15:28 ` Yonghong Song
2023-08-25 18:53 ` Alexei Starovoitov
2023-08-26 3:48 ` Hou Tao
2023-08-26 13:44 ` RISC-V uprobe bug (Was: Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342) Björn Töpel
1 sibling, 2 replies; 31+ messages in thread
From: Yonghong Song @ 2023-08-25 15:28 UTC (permalink / raw)
To: Björn Töpel, bpf; +Cc: linux-riscv
On 8/25/23 3:32 AM, Björn Töpel wrote:
> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
> selftests on bpf-next 9e3b47abeb8f.
>
> I'm able to reproduce the hang by multiple runs of:
> | ./test_progs -a link_api -a linked_list
> I'm currently investigating that.
>
> But! Sometimes (every blue moon) I get a warn_on_once hit:
> | ------------[ cut here ]------------
> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 bpf_mem_refill+0x1fc/0x206
> | Modules linked in: bpf_testmod(OE)
> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE N 6.5.0-rc5-01743-gdcb152bb8328 #2
> | Hardware name: riscv-virtio,qemu (DT)
> | epc : bpf_mem_refill+0x1fc/0x206
> | ra : irq_work_single+0x68/0x70
> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
> | a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
> | a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
> | t5 : ff6000008fd28278 t6 : 0000000000040000
> | status: 0000000200000100 badaddr: 0000000000000000 cause: 0000000000000003
> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
> | [<ffffffff812b6904>] do_irq+0x66/0x98
> | ---[ end trace 0000000000000000 ]---
>
> Code:
> | static void free_bulk(struct bpf_mem_cache *c)
> | {
> | struct bpf_mem_cache *tgt = c->tgt;
> | struct llist_node *llnode, *t;
> | unsigned long flags;
> | int cnt;
> |
> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
> | ...
>
> I'm not well versed in the memory allocator; Before I dive into it --
> has anyone else hit it? Ideas on why the warn_on_once is hit?
Maybe take a look at the patch
822fb26bdb55 bpf: Add a hint to allocated objects.
In the above patch, we have
+ /*
+ * Remember bpf_mem_cache that allocated this object.
+ * The hint is not accurate.
+ */
+ c->tgt = *(struct bpf_mem_cache **)llnode;
I suspect that the warning may be related to the above.
I tried the above ./test_progs command line (running multiple
at the same time) and didn't trigger the issue.
>
>
> Björn
>
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-25 15:28 ` Yonghong Song
@ 2023-08-25 18:53 ` Alexei Starovoitov
2023-08-25 19:49 ` Alexei Starovoitov
2023-08-26 3:48 ` Hou Tao
1 sibling, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2023-08-25 18:53 UTC (permalink / raw)
To: Yonghong Song, Kumar Kartikeya Dwivedi, Dave Marchevsky
Cc: Björn Töpel, bpf, linux-riscv
On Fri, Aug 25, 2023 at 8:28 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
> On 8/25/23 3:32 AM, Björn Töpel wrote:
> > I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
> > selftests on bpf-next 9e3b47abeb8f.
> >
> > I'm able to reproduce the hang by multiple runs of:
> > | ./test_progs -a link_api -a linked_list
> > I'm currently investigating that.
> >
> > But! Sometimes (every blue moon) I get a warn_on_once hit:
> > | ------------[ cut here ]------------
> > | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 bpf_mem_refill+0x1fc/0x206
> > | Modules linked in: bpf_testmod(OE)
> > | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE N 6.5.0-rc5-01743-gdcb152bb8328 #2
> > | Hardware name: riscv-virtio,qemu (DT)
> > | epc : bpf_mem_refill+0x1fc/0x206
> > | ra : irq_work_single+0x68/0x70
> > | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
> > | gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
> > | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
> > | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
> > | a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
> > | a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
> > | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
> > | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
> > | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
> > | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
> > | t5 : ff6000008fd28278 t6 : 0000000000040000
> > | status: 0000000200000100 badaddr: 0000000000000000 cause: 0000000000000003
> > | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
> > | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
> > | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
> > | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
> > | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
> > | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
> > | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> > | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
> > | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
> > | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> > | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
> > | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
> > | [<ffffffff812b6904>] do_irq+0x66/0x98
> > | ---[ end trace 0000000000000000 ]---
> >
> > Code:
> > | static void free_bulk(struct bpf_mem_cache *c)
> > | {
> > | struct bpf_mem_cache *tgt = c->tgt;
> > | struct llist_node *llnode, *t;
> > | unsigned long flags;
> > | int cnt;
> > |
> > | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
> > | ...
> >
> > I'm not well versed in the memory allocator; Before I dive into it --
> > has anyone else hit it? Ideas on why the warn_on_once is hit?
>
> Maybe take a look at the patch
> 822fb26bdb55 bpf: Add a hint to allocated objects.
>
> In the above patch, we have
>
> + /*
> + * Remember bpf_mem_cache that allocated this object.
> + * The hint is not accurate.
> + */
> + c->tgt = *(struct bpf_mem_cache **)llnode;
>
> I suspect that the warning may be related to the above.
> I tried the above ./test_progs command line (running multiple
> at the same time) and didn't trigger the issue.
Interesting. Thanks for the report.
I wasn't able to repo the warn either, but I can confirm that linked_list
test is super slow on the kernel with debugs on and may appear to "hang",
since it doesn't respond to key press for many seconds.
time ./test_progs -a linked_list
Summary: 1/132 PASSED, 0 SKIPPED, 0 FAILED
real 0m35.897s
user 0m1.860s
sys 0m32.105s
That's not normal.
Kumar, Dave,
do you have any suggestions?
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-25 18:53 ` Alexei Starovoitov
@ 2023-08-25 19:49 ` Alexei Starovoitov
2023-08-25 21:31 ` Andrii Nakryiko
0 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2023-08-25 19:49 UTC (permalink / raw)
To: Yonghong Song, Kumar Kartikeya Dwivedi, Dave Marchevsky,
Andrii Nakryiko
Cc: Björn Töpel, bpf, linux-riscv
On Fri, Aug 25, 2023 at 11:53 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Aug 25, 2023 at 8:28 AM Yonghong Song <yonghong.song@linux.dev> wrote:
> >
> >
> >
> > On 8/25/23 3:32 AM, Björn Töpel wrote:
> > > I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
> > > selftests on bpf-next 9e3b47abeb8f.
> > >
> > > I'm able to reproduce the hang by multiple runs of:
> > > | ./test_progs -a link_api -a linked_list
> > > I'm currently investigating that.
> > >
> > > But! Sometimes (every blue moon) I get a warn_on_once hit:
> > > | ------------[ cut here ]------------
> > > | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 bpf_mem_refill+0x1fc/0x206
> > > | Modules linked in: bpf_testmod(OE)
> > > | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE N 6.5.0-rc5-01743-gdcb152bb8328 #2
> > > | Hardware name: riscv-virtio,qemu (DT)
> > > | epc : bpf_mem_refill+0x1fc/0x206
> > > | ra : irq_work_single+0x68/0x70
> > > | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
> > > | gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
> > > | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
> > > | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
> > > | a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
> > > | a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
> > > | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
> > > | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
> > > | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
> > > | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
> > > | t5 : ff6000008fd28278 t6 : 0000000000040000
> > > | status: 0000000200000100 badaddr: 0000000000000000 cause: 0000000000000003
> > > | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
> > > | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
> > > | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
> > > | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
> > > | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
> > > | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
> > > | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> > > | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
> > > | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
> > > | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> > > | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
> > > | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
> > > | [<ffffffff812b6904>] do_irq+0x66/0x98
> > > | ---[ end trace 0000000000000000 ]---
> > >
> > > Code:
> > > | static void free_bulk(struct bpf_mem_cache *c)
> > > | {
> > > | struct bpf_mem_cache *tgt = c->tgt;
> > > | struct llist_node *llnode, *t;
> > > | unsigned long flags;
> > > | int cnt;
> > > |
> > > | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
> > > | ...
> > >
> > > I'm not well versed in the memory allocator; Before I dive into it --
> > > has anyone else hit it? Ideas on why the warn_on_once is hit?
> >
> > Maybe take a look at the patch
> > 822fb26bdb55 bpf: Add a hint to allocated objects.
> >
> > In the above patch, we have
> >
> > + /*
> > + * Remember bpf_mem_cache that allocated this object.
> > + * The hint is not accurate.
> > + */
> > + c->tgt = *(struct bpf_mem_cache **)llnode;
> >
> > I suspect that the warning may be related to the above.
> > I tried the above ./test_progs command line (running multiple
> > at the same time) and didn't trigger the issue.
>
> Interesting. Thanks for the report.
> I wasn't able to repo the warn either, but I can confirm that linked_list
> test is super slow on the kernel with debugs on and may appear to "hang",
> since it doesn't respond to key press for many seconds.
>
> time ./test_progs -a linked_list
> Summary: 1/132 PASSED, 0 SKIPPED, 0 FAILED
>
> real 0m35.897s
> user 0m1.860s
> sys 0m32.105s
>
> That's not normal.
>
> Kumar, Dave,
>
> do you have any suggestions?
Andrii,
this issue seems to be a scalability issue with libbpf.
The perf report looks like:
9.89% test_progs [kernel.kallsyms] [k] __lock_acquire
7.70% test_progs [kernel.kallsyms] [k] check_preemption_disabled
2.61% test_progs [kernel.kallsyms] [k] asm_exc_page_fault
2.32% test_progs [kernel.kallsyms] [k] rcu_is_watching
2.30% test_progs [kernel.kallsyms] [k] memcpy_orig
2.26% test_progs test_progs [.] btf_find_by_name_kind
2.18% test_progs libc-2.28.so [.] __strcmp_avx2
2.14% test_progs [kernel.kallsyms] [k] lock_acquire
2.10% test_progs [kernel.kallsyms] [k] mark_lock.part.48
2.07% test_progs test_progs [.] btf_kind
2.05% test_progs [kernel.kallsyms] [k] clear_page_erms
1.89% test_progs test_progs [.] btf_type_by_id
1.83% test_progs [kernel.kallsyms] [k] lock_is_held_type
1.75% test_progs [kernel.kallsyms] [k] lock_release
1.31% test_progs [kernel.kallsyms] [k] unwind_next_frame
1.29% test_progs test_progs [.] btf__type_by_id
1.28% test_progs [kernel.kallsyms] [k] rep_movs_alternative
1.09% test_progs [kernel.kallsyms] [k] __orc_find
1.06% swapper [kernel.kallsyms] [k] __lock_acquire
1.02% test_progs test_progs [.] btf_type_size
0.84% test_progs test_progs [.] btf_parse_type_sec
0.78% test_progs [kernel.kallsyms] [k] __create_object
0.76% test_progs [kernel.kallsyms] [k] __lookup_object
0.75% test_progs test_progs [.] btf__str_by_offset
The top 5 functions coming from kernel due to libbpf
reading /sys/kernel/btf/vmlinux
Then libbpf proceeded to do a search in btf.
Don't know whether libbpf is doing too many unnecessary reads on vmlinux btf
or what is actually happening, but something isn't right.
This test shouldn't be causing libbpf to be the main consumer of cpu.
Not sure whether other tests are similarly affected.
This is what I've debugged so far.
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-25 19:49 ` Alexei Starovoitov
@ 2023-08-25 21:31 ` Andrii Nakryiko
2023-08-26 22:49 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 31+ messages in thread
From: Andrii Nakryiko @ 2023-08-25 21:31 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Yonghong Song, Kumar Kartikeya Dwivedi, Dave Marchevsky,
Andrii Nakryiko, Björn Töpel, bpf, linux-riscv
On Fri, Aug 25, 2023 at 12:49 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Aug 25, 2023 at 11:53 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Aug 25, 2023 at 8:28 AM Yonghong Song <yonghong.song@linux.dev> wrote:
> > >
> > >
> > >
> > > On 8/25/23 3:32 AM, Björn Töpel wrote:
> > > > I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
> > > > selftests on bpf-next 9e3b47abeb8f.
> > > >
> > > > I'm able to reproduce the hang by multiple runs of:
> > > > | ./test_progs -a link_api -a linked_list
> > > > I'm currently investigating that.
> > > >
> > > > But! Sometimes (every blue moon) I get a warn_on_once hit:
> > > > | ------------[ cut here ]------------
> > > > | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 bpf_mem_refill+0x1fc/0x206
> > > > | Modules linked in: bpf_testmod(OE)
> > > > | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE N 6.5.0-rc5-01743-gdcb152bb8328 #2
> > > > | Hardware name: riscv-virtio,qemu (DT)
> > > > | epc : bpf_mem_refill+0x1fc/0x206
> > > > | ra : irq_work_single+0x68/0x70
> > > > | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
> > > > | gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
> > > > | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
> > > > | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
> > > > | a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
> > > > | a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
> > > > | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
> > > > | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
> > > > | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
> > > > | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
> > > > | t5 : ff6000008fd28278 t6 : 0000000000040000
> > > > | status: 0000000200000100 badaddr: 0000000000000000 cause: 0000000000000003
> > > > | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
> > > > | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
> > > > | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
> > > > | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
> > > > | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
> > > > | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
> > > > | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> > > > | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
> > > > | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
> > > > | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> > > > | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
> > > > | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
> > > > | [<ffffffff812b6904>] do_irq+0x66/0x98
> > > > | ---[ end trace 0000000000000000 ]---
> > > >
> > > > Code:
> > > > | static void free_bulk(struct bpf_mem_cache *c)
> > > > | {
> > > > | struct bpf_mem_cache *tgt = c->tgt;
> > > > | struct llist_node *llnode, *t;
> > > > | unsigned long flags;
> > > > | int cnt;
> > > > |
> > > > | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
> > > > | ...
> > > >
> > > > I'm not well versed in the memory allocator; Before I dive into it --
> > > > has anyone else hit it? Ideas on why the warn_on_once is hit?
> > >
> > > Maybe take a look at the patch
> > > 822fb26bdb55 bpf: Add a hint to allocated objects.
> > >
> > > In the above patch, we have
> > >
> > > + /*
> > > + * Remember bpf_mem_cache that allocated this object.
> > > + * The hint is not accurate.
> > > + */
> > > + c->tgt = *(struct bpf_mem_cache **)llnode;
> > >
> > > I suspect that the warning may be related to the above.
> > > I tried the above ./test_progs command line (running multiple
> > > at the same time) and didn't trigger the issue.
> >
> > Interesting. Thanks for the report.
> > I wasn't able to repo the warn either, but I can confirm that linked_list
> > test is super slow on the kernel with debugs on and may appear to "hang",
> > since it doesn't respond to key press for many seconds.
> >
> > time ./test_progs -a linked_list
> > Summary: 1/132 PASSED, 0 SKIPPED, 0 FAILED
> >
> > real 0m35.897s
> > user 0m1.860s
> > sys 0m32.105s
> >
> > That's not normal.
> >
> > Kumar, Dave,
> >
> > do you have any suggestions?
>
> Andrii,
>
> this issue seems to be a scalability issue with libbpf.
> The perf report looks like:
>
> 9.89% test_progs [kernel.kallsyms] [k] __lock_acquire
> 7.70% test_progs [kernel.kallsyms] [k] check_preemption_disabled
> 2.61% test_progs [kernel.kallsyms] [k] asm_exc_page_fault
> 2.32% test_progs [kernel.kallsyms] [k] rcu_is_watching
> 2.30% test_progs [kernel.kallsyms] [k] memcpy_orig
> 2.26% test_progs test_progs [.] btf_find_by_name_kind
> 2.18% test_progs libc-2.28.so [.] __strcmp_avx2
> 2.14% test_progs [kernel.kallsyms] [k] lock_acquire
> 2.10% test_progs [kernel.kallsyms] [k] mark_lock.part.48
> 2.07% test_progs test_progs [.] btf_kind
> 2.05% test_progs [kernel.kallsyms] [k] clear_page_erms
> 1.89% test_progs test_progs [.] btf_type_by_id
> 1.83% test_progs [kernel.kallsyms] [k] lock_is_held_type
> 1.75% test_progs [kernel.kallsyms] [k] lock_release
> 1.31% test_progs [kernel.kallsyms] [k] unwind_next_frame
> 1.29% test_progs test_progs [.] btf__type_by_id
> 1.28% test_progs [kernel.kallsyms] [k] rep_movs_alternative
> 1.09% test_progs [kernel.kallsyms] [k] __orc_find
> 1.06% swapper [kernel.kallsyms] [k] __lock_acquire
> 1.02% test_progs test_progs [.] btf_type_size
> 0.84% test_progs test_progs [.] btf_parse_type_sec
> 0.78% test_progs [kernel.kallsyms] [k] __create_object
> 0.76% test_progs [kernel.kallsyms] [k] __lookup_object
> 0.75% test_progs test_progs [.] btf__str_by_offset
>
> The top 5 functions coming from kernel due to libbpf
> reading /sys/kernel/btf/vmlinux
> Then libbpf proceeded to do a search in btf.
This test loads 132 skeletons/BPF objects sequentially. Each BPF
object load should cause exactly one /sys/kernel/btf/vmlinux load. Do
you see multiple vmlinux reads per BPF object?
We then do a linear search for each ksym, so depending on the amount
of ksyms, that can scale badly. But linked_list.bpf.o and
linked_list_fail.bpf.o have only a few ksyms:
[72] DATASEC '.ksyms' size=0 vlen=6
type_id=33 offset=0 size=0 (FUNC 'bpf_obj_new_impl')
type_id=37 offset=0 size=0 (FUNC 'bpf_list_pop_front')
type_id=39 offset=0 size=0 (FUNC 'bpf_obj_drop_impl')
type_id=41 offset=0 size=0 (FUNC 'bpf_list_pop_back')
type_id=43 offset=0 size=0 (FUNC 'bpf_list_push_front_impl')
type_id=51 offset=0 size=0 (FUNC 'bpf_list_push_back_impl')
We can improve this part by either pre-building hashmap index for BTF
types or sorting all ksyms and then doing one linear pass over vmlinux
BTF, binary searching. But so far it wasn't a big bottleneck in
practice.
And running this linked_list test on production kernel, it's not that
slow, considering how much work it's doing:
sudo ./test_progs -a linked_list 1.54s user 0.58s system 97% cpu 2.178 total
Also, we are building sefltests+libbpf in debug mode. So I have worse
news, with my BTF validation patch applied, this test becomes
extremely slow, because all those 100+ vmlinux BTF parsings are now
also doing a bunch of validation, and none of the added small static
functions are inlined in debug mode. I need to look into this a bit
more and see if I can mitigate this without switching all selftests to
release mode.
>
> Don't know whether libbpf is doing too many unnecessary reads on vmlinux btf
> or what is actually happening, but something isn't right.
> This test shouldn't be causing libbpf to be the main consumer of cpu.
> Not sure whether other tests are similarly affected.
> This is what I've debugged so far.
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-25 15:28 ` Yonghong Song
2023-08-25 18:53 ` Alexei Starovoitov
@ 2023-08-26 3:48 ` Hou Tao
2023-08-26 9:23 ` Björn Töpel
1 sibling, 1 reply; 31+ messages in thread
From: Hou Tao @ 2023-08-26 3:48 UTC (permalink / raw)
To: Björn Töpel, bpf; +Cc: linux-riscv, yonghong.song, Alexei Starovoitov
[-- Attachment #1: Type: text/plain, Size: 4128 bytes --]
Hi,
On 8/25/2023 11:28 PM, Yonghong Song wrote:
>
>
> On 8/25/23 3:32 AM, Björn Töpel wrote:
>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>> selftests on bpf-next 9e3b47abeb8f.
>>
>> I'm able to reproduce the hang by multiple runs of:
>> | ./test_progs -a link_api -a linked_list
>> I'm currently investigating that.
>>
>> But! Sometimes (every blue moon) I get a warn_on_once hit:
>> | ------------[ cut here ]------------
>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
>> bpf_mem_refill+0x1fc/0x206
>> | Modules linked in: bpf_testmod(OE)
>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
>> | Hardware name: riscv-virtio,qemu (DT)
>> | epc : bpf_mem_refill+0x1fc/0x206
>> | ra : irq_work_single+0x68/0x70
>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
>> | t5 : ff6000008fd28278 t6 : 0000000000040000
>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
>> 0000000000000003
>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
>> | [<ffffffff812b6904>] do_irq+0x66/0x98
>> | ---[ end trace 0000000000000000 ]---
>>
>> Code:
>> | static void free_bulk(struct bpf_mem_cache *c)
>> | {
>> | struct bpf_mem_cache *tgt = c->tgt;
>> | struct llist_node *llnode, *t;
>> | unsigned long flags;
>> | int cnt;
>> |
>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
>> | ...
>>
>> I'm not well versed in the memory allocator; Before I dive into it --
>> has anyone else hit it? Ideas on why the warn_on_once is hit?
>
> Maybe take a look at the patch
> 822fb26bdb55 bpf: Add a hint to allocated objects.
>
> In the above patch, we have
>
> + /*
> + * Remember bpf_mem_cache that allocated this object.
> + * The hint is not accurate.
> + */
> + c->tgt = *(struct bpf_mem_cache **)llnode;
>
> I suspect that the warning may be related to the above.
> I tried the above ./test_progs command line (running multiple
> at the same time) and didn't trigger the issue.
The extra 8-bytes before the freed pointer is used to save the pointer
of the original bpf memory allocator where the freed pointer came from,
so unit_free() could free the pointer back to the original allocator to
prevent alloc-and-free unbalance.
I suspect that a wrong pointer was passed to bpf_obj_drop, but do not
find anything suspicious after checking linked_list. Another possibility
is that there is write-after-free problem which corrupts the extra
8-bytes before the freed pointer. Could you please apply the following
debug patch to check whether or not the extra 8-bytes are corrupted ?
>
>>
>>
>> Björn
>>
>
> .
[-- Attachment #2: 0001-bpf-Debug-for-bpf_mem_free.patch --]
[-- Type: text/plain, Size: 1526 bytes --]
From 69e9a281077eadcc73a49876ee6c4103ea94b257 Mon Sep 17 00:00:00 2001
From: Hou Tao <houtao1@huawei.com>
Date: Sat, 26 Aug 2023 11:30:45 +0800
Subject: [PATCH] bpf: Debug for bpf_mem_free()
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
kernel/bpf/memalloc.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 662838a34629..fb4fa0605a60 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -830,6 +830,9 @@ void notrace *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size)
void notrace bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr)
{
+ struct bpf_mem_cache *from, *to;
+ struct bpf_mem_caches *cc;
+ static int once;
int idx;
if (!ptr)
@@ -839,7 +842,20 @@ void notrace bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr)
if (idx < 0)
return;
- unit_free(this_cpu_ptr(ma->caches)->cache + idx, ptr);
+ cc = this_cpu_ptr(ma->caches);
+ to = cc->cache + idx;
+ from = *(struct bpf_mem_cache **)(ptr - LLIST_NODE_SZ);
+ if (!once && to->unit_size != from->unit_size) {
+ once = true;
+ pr_err("bad cache %px: got size %u work %px, cache %px exp size %u work %px\n",
+ from, from->unit_size, from->refill_work.func,
+ to, to->unit_size, to->refill_work.func);
+ WARN_ON(1);
+ print_hex_dump(KERN_ERR, "", DUMP_PREFIX_OFFSET, 16, 1,
+ from, sizeof(*from), false);
+ }
+
+ unit_free(to, ptr);
}
void notrace bpf_mem_free_rcu(struct bpf_mem_alloc *ma, void *ptr)
--
2.29.2
[-- Attachment #3: Type: text/plain, Size: 161 bytes --]
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-26 3:48 ` Hou Tao
@ 2023-08-26 9:23 ` Björn Töpel
2023-08-26 10:27 ` Hou Tao
0 siblings, 1 reply; 31+ messages in thread
From: Björn Töpel @ 2023-08-26 9:23 UTC (permalink / raw)
To: Hou Tao, bpf; +Cc: linux-riscv, yonghong.song, Alexei Starovoitov
Hou Tao <houtao@huaweicloud.com> writes:
> Hi,
>
> On 8/25/2023 11:28 PM, Yonghong Song wrote:
>>
>>
>> On 8/25/23 3:32 AM, Björn Töpel wrote:
>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>>> selftests on bpf-next 9e3b47abeb8f.
>>>
>>> I'm able to reproduce the hang by multiple runs of:
>>> | ./test_progs -a link_api -a linked_list
>>> I'm currently investigating that.
>>>
>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
>>> | ------------[ cut here ]------------
>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
>>> bpf_mem_refill+0x1fc/0x206
>>> | Modules linked in: bpf_testmod(OE)
>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
>>> | Hardware name: riscv-virtio,qemu (DT)
>>> | epc : bpf_mem_refill+0x1fc/0x206
>>> | ra : irq_work_single+0x68/0x70
>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
>>> | t5 : ff6000008fd28278 t6 : 0000000000040000
>>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
>>> 0000000000000003
>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
>>> | [<ffffffff812b6904>] do_irq+0x66/0x98
>>> | ---[ end trace 0000000000000000 ]---
>>>
>>> Code:
>>> | static void free_bulk(struct bpf_mem_cache *c)
>>> | {
>>> | struct bpf_mem_cache *tgt = c->tgt;
>>> | struct llist_node *llnode, *t;
>>> | unsigned long flags;
>>> | int cnt;
>>> |
>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
>>> | ...
>>>
>>> I'm not well versed in the memory allocator; Before I dive into it --
>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
>>
>> Maybe take a look at the patch
>> 822fb26bdb55 bpf: Add a hint to allocated objects.
>>
>> In the above patch, we have
>>
>> + /*
>> + * Remember bpf_mem_cache that allocated this object.
>> + * The hint is not accurate.
>> + */
>> + c->tgt = *(struct bpf_mem_cache **)llnode;
>>
>> I suspect that the warning may be related to the above.
>> I tried the above ./test_progs command line (running multiple
>> at the same time) and didn't trigger the issue.
>
> The extra 8-bytes before the freed pointer is used to save the pointer
> of the original bpf memory allocator where the freed pointer came from,
> so unit_free() could free the pointer back to the original allocator to
> prevent alloc-and-free unbalance.
>
> I suspect that a wrong pointer was passed to bpf_obj_drop, but do not
> find anything suspicious after checking linked_list. Another possibility
> is that there is write-after-free problem which corrupts the extra
> 8-bytes before the freed pointer. Could you please apply the following
> debug patch to check whether or not the extra 8-bytes are corrupted ?
Thanks for getting back!
I took your patch for a run, and there's a hit:
| bad cache ff5ffffffffe8570: got size 96 work ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work ffffffff801b19c8
| WARNING: CPU: 3 PID: 265 at kernel/bpf/memalloc.c:847 bpf_mem_free+0xb2/0xe0
| ---[ end trace 0000000000000000 ]---
| 00000000: 80 e3 dc 85 00 00 60 ff 00 00 00 00 00 00 00 00
| 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
| 00000020: 20 00 00 00 00 00 00 00 c8 19 1b 80 ff ff ff ff
| 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
| 00000040: 60 00 00 00 33 00 00 00 20 00 00 00 60 00 00 00
| 00000050: 30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
| 00000060: 70 85 fe ff ff ff 5f ff 00 00 00 00 00 00 00 00
| 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
| 00000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
| 00000090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
| 000000a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
| 000000b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
| 000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
| WARNING: CPU: 2 PID: 265 at kernel/bpf/memalloc.c:342 bpf_mem_refill+0x1fc/0x206
| ---[ end trace 0000000000000000 ]---
Björn
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-26 9:23 ` Björn Töpel
@ 2023-08-26 10:27 ` Hou Tao
2023-08-26 10:49 ` Björn Töpel
0 siblings, 1 reply; 31+ messages in thread
From: Hou Tao @ 2023-08-26 10:27 UTC (permalink / raw)
To: Björn Töpel, bpf; +Cc: linux-riscv, yonghong.song, Alexei Starovoitov
Hi,
On 8/26/2023 5:23 PM, Björn Töpel wrote:
> Hou Tao <houtao@huaweicloud.com> writes:
>
>> Hi,
>>
>> On 8/25/2023 11:28 PM, Yonghong Song wrote:
>>>
>>> On 8/25/23 3:32 AM, Björn Töpel wrote:
>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>>>> selftests on bpf-next 9e3b47abeb8f.
>>>>
>>>> I'm able to reproduce the hang by multiple runs of:
>>>> | ./test_progs -a link_api -a linked_list
>>>> I'm currently investigating that.
>>>>
>>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
>>>> | ------------[ cut here ]------------
>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
>>>> bpf_mem_refill+0x1fc/0x206
>>>> | Modules linked in: bpf_testmod(OE)
>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
>>>> | Hardware name: riscv-virtio,qemu (DT)
>>>> | epc : bpf_mem_refill+0x1fc/0x206
>>>> | ra : irq_work_single+0x68/0x70
>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000
>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
>>>> 0000000000000003
>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98
>>>> | ---[ end trace 0000000000000000 ]---
>>>>
>>>> Code:
>>>> | static void free_bulk(struct bpf_mem_cache *c)
>>>> | {
>>>> | struct bpf_mem_cache *tgt = c->tgt;
>>>> | struct llist_node *llnode, *t;
>>>> | unsigned long flags;
>>>> | int cnt;
>>>> |
>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
>>>> | ...
>>>>
>>>> I'm not well versed in the memory allocator; Before I dive into it --
>>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
>>> Maybe take a look at the patch
>>> 822fb26bdb55 bpf: Add a hint to allocated objects.
>>>
>>> In the above patch, we have
>>>
>>> + /*
>>> + * Remember bpf_mem_cache that allocated this object.
>>> + * The hint is not accurate.
>>> + */
>>> + c->tgt = *(struct bpf_mem_cache **)llnode;
>>>
>>> I suspect that the warning may be related to the above.
>>> I tried the above ./test_progs command line (running multiple
>>> at the same time) and didn't trigger the issue.
>> The extra 8-bytes before the freed pointer is used to save the pointer
>> of the original bpf memory allocator where the freed pointer came from,
>> so unit_free() could free the pointer back to the original allocator to
>> prevent alloc-and-free unbalance.
>>
>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do not
>> find anything suspicious after checking linked_list. Another possibility
>> is that there is write-after-free problem which corrupts the extra
>> 8-bytes before the freed pointer. Could you please apply the following
>> debug patch to check whether or not the extra 8-bytes are corrupted ?
> Thanks for getting back!
>
> I took your patch for a run, and there's a hit:
> | bad cache ff5ffffffffe8570: got size 96 work ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work ffffffff801b19c8
The extra 8-bytes are not corrupted. Both of these two bpf_mem_cache are
valid and there are in the cache array defined in bpf_mem_caches. BPF
memory allocator allocated the pointer from 96-bytes sized-cache, but it
tried to free the pointer through 128-bytes sized-cache.
Now I suspect there is no 96-bytes slab in your system and ksize(ptr -
LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the
128-byte sized-cache instead of 96-bytes sized-cache. Could you please
check the value of KMALLOC_MIN_SIZE in your kernel .config and using the
following command to check whether there is 96-bytes slab in your system:
$ cat /proc/slabinfo |grep kmalloc-96
dma-kmalloc-96 0 0 96 42 1 : tunables 0 0
0 : slabdata 0 0 0
kmalloc-96 1865 2268 96 42 1 : tunables 0 0
0 : slabdata 54 54 0
In my system, slab has 96-bytes cached, so grep outputs something, but I
think there will no output in your system.
> | WARNING: CPU: 3 PID: 265 at kernel/bpf/memalloc.c:847 bpf_mem_free+0xb2/0xe0
> | ---[ end trace 0000000000000000 ]---
> | 00000000: 80 e3 dc 85 00 00 60 ff 00 00 00 00 00 00 00 00
> | 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> | 00000020: 20 00 00 00 00 00 00 00 c8 19 1b 80 ff ff ff ff
> | 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> | 00000040: 60 00 00 00 33 00 00 00 20 00 00 00 60 00 00 00
> | 00000050: 30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> | 00000060: 70 85 fe ff ff ff 5f ff 00 00 00 00 00 00 00 00
> | 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> | 00000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> | 00000090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> | 000000a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> | 000000b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> | 000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> | WARNING: CPU: 2 PID: 265 at kernel/bpf/memalloc.c:342 bpf_mem_refill+0x1fc/0x206
> | ---[ end trace 0000000000000000 ]---
>
>
> Björn
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-26 10:27 ` Hou Tao
@ 2023-08-26 10:49 ` Björn Töpel
2023-08-27 8:37 ` Björn Töpel
0 siblings, 1 reply; 31+ messages in thread
From: Björn Töpel @ 2023-08-26 10:49 UTC (permalink / raw)
To: Hou Tao, bpf; +Cc: linux-riscv, yonghong.song, Alexei Starovoitov
Hou Tao <houtao@huaweicloud.com> writes:
> Hi,
>
> On 8/26/2023 5:23 PM, Björn Töpel wrote:
>> Hou Tao <houtao@huaweicloud.com> writes:
>>
>>> Hi,
>>>
>>> On 8/25/2023 11:28 PM, Yonghong Song wrote:
>>>>
>>>> On 8/25/23 3:32 AM, Björn Töpel wrote:
>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>>>>> selftests on bpf-next 9e3b47abeb8f.
>>>>>
>>>>> I'm able to reproduce the hang by multiple runs of:
>>>>> | ./test_progs -a link_api -a linked_list
>>>>> I'm currently investigating that.
>>>>>
>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
>>>>> | ------------[ cut here ]------------
>>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
>>>>> bpf_mem_refill+0x1fc/0x206
>>>>> | Modules linked in: bpf_testmod(OE)
>>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
>>>>> | Hardware name: riscv-virtio,qemu (DT)
>>>>> | epc : bpf_mem_refill+0x1fc/0x206
>>>>> | ra : irq_work_single+0x68/0x70
>>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
>>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
>>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
>>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
>>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
>>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
>>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
>>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
>>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
>>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
>>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000
>>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
>>>>> 0000000000000003
>>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
>>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
>>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
>>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
>>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
>>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
>>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
>>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
>>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98
>>>>> | ---[ end trace 0000000000000000 ]---
>>>>>
>>>>> Code:
>>>>> | static void free_bulk(struct bpf_mem_cache *c)
>>>>> | {
>>>>> | struct bpf_mem_cache *tgt = c->tgt;
>>>>> | struct llist_node *llnode, *t;
>>>>> | unsigned long flags;
>>>>> | int cnt;
>>>>> |
>>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
>>>>> | ...
>>>>>
>>>>> I'm not well versed in the memory allocator; Before I dive into it --
>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
>>>> Maybe take a look at the patch
>>>> 822fb26bdb55 bpf: Add a hint to allocated objects.
>>>>
>>>> In the above patch, we have
>>>>
>>>> + /*
>>>> + * Remember bpf_mem_cache that allocated this object.
>>>> + * The hint is not accurate.
>>>> + */
>>>> + c->tgt = *(struct bpf_mem_cache **)llnode;
>>>>
>>>> I suspect that the warning may be related to the above.
>>>> I tried the above ./test_progs command line (running multiple
>>>> at the same time) and didn't trigger the issue.
>>> The extra 8-bytes before the freed pointer is used to save the pointer
>>> of the original bpf memory allocator where the freed pointer came from,
>>> so unit_free() could free the pointer back to the original allocator to
>>> prevent alloc-and-free unbalance.
>>>
>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do not
>>> find anything suspicious after checking linked_list. Another possibility
>>> is that there is write-after-free problem which corrupts the extra
>>> 8-bytes before the freed pointer. Could you please apply the following
>>> debug patch to check whether or not the extra 8-bytes are corrupted ?
>> Thanks for getting back!
>>
>> I took your patch for a run, and there's a hit:
>> | bad cache ff5ffffffffe8570: got size 96 work ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work ffffffff801b19c8
>
> The extra 8-bytes are not corrupted. Both of these two bpf_mem_cache are
> valid and there are in the cache array defined in bpf_mem_caches. BPF
> memory allocator allocated the pointer from 96-bytes sized-cache, but it
> tried to free the pointer through 128-bytes sized-cache.
>
> Now I suspect there is no 96-bytes slab in your system and ksize(ptr -
> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the
> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please
> check the value of KMALLOC_MIN_SIZE in your kernel .config and using the
> following command to check whether there is 96-bytes slab in your system:
KMALLOC_MIN_SIZE is 64.
> $ cat /proc/slabinfo |grep kmalloc-96
> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0
> 0 : slabdata 0 0 0
> kmalloc-96 1865 2268 96 42 1 : tunables 0 0
> 0 : slabdata 54 54 0
>
> In my system, slab has 96-bytes cached, so grep outputs something, but I
> think there will no output in your system.
You're right! No kmalloc-96.
Björn
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* RISC-V uprobe bug (Was: Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342)
2023-08-25 10:32 WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 Björn Töpel
2023-08-25 15:28 ` Yonghong Song
@ 2023-08-26 13:44 ` Björn Töpel
2023-08-26 18:12 ` Nam Cao
1 sibling, 1 reply; 31+ messages in thread
From: Björn Töpel @ 2023-08-26 13:44 UTC (permalink / raw)
To: linux-riscv, Guo Ren
Cc: bpf, Hou Tao, yonghong.song, Alexei Starovoitov, Puranjay Mohan
Björn Töpel <bjorn@kernel.org> writes:
> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
> selftests on bpf-next 9e3b47abeb8f.
>
> I'm able to reproduce the hang by multiple runs of:
> | ./test_progs -a link_api -a linked_list
> I'm currently investigating that.
+Guo for uprobe
This was an interesting bug. The hang is an ebreak (RISC-V breakpoint),
that puts the kernel into an infinite loop.
To reproduce, simply run the BPF selftest:
./test_progs -v -a link_api -a linked_list
First the link_api test is being run, which exercises the uprobe
functionality. The link_api test completes, and test_progs will still
have the uprobe active/enabled. Next the linked_list test triggered a
WARN_ON (which is implemented via ebreak as well).
Now, handle_break() is entered, and the uprobe_breakpoint_handler()
returns true exiting the handle_break(), which returns to the WARN
ebreak, and we have merry-go-round.
Lucky for the RISC-V folks, the BPF memory handler had a WARN that
surfaced the bug! ;-)
This patch fixes the issue, but it's probably a prettier variant:
--8<--
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index f798c853bede..1198cb879d2f 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -248,23 +248,29 @@ static inline unsigned long get_break_insn_length(unsigned long pc)
void handle_break(struct pt_regs *regs)
{
+ bool user = user_mode(regs);
+
#ifdef CONFIG_KPROBES
- if (kprobe_single_step_handler(regs))
- return;
+ if (!user) {
+ if (kprobe_single_step_handler(regs))
+ return;
- if (kprobe_breakpoint_handler(regs))
- return;
+ if (kprobe_breakpoint_handler(regs))
+ return;
+ }
#endif
#ifdef CONFIG_UPROBES
- if (uprobe_single_step_handler(regs))
- return;
+ if (user) {
+ if (uprobe_single_step_handler(regs))
+ return;
- if (uprobe_breakpoint_handler(regs))
- return;
+ if (uprobe_breakpoint_handler(regs))
+ return;
+ }
#endif
current->thread.bad_cause = regs->cause;
- if (user_mode(regs))
+ if (user)
force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->epc);
#ifdef CONFIG_KGDB
else if (notify_die(DIE_TRAP, "EBREAK", regs, 0, regs->cause, SIGTRAP)
--8<--
I'll cook a cleaner/proper patch for this, unless the uprobes folks has
a better solution.
Björn
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: RISC-V uprobe bug (Was: Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342)
2023-08-26 13:44 ` RISC-V uprobe bug (Was: Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342) Björn Töpel
@ 2023-08-26 18:12 ` Nam Cao
2023-08-26 18:31 ` Nam Cao
2023-08-27 8:11 ` Björn Töpel
0 siblings, 2 replies; 31+ messages in thread
From: Nam Cao @ 2023-08-26 18:12 UTC (permalink / raw)
To: Björn Töpel
Cc: linux-riscv, Guo Ren, bpf, Hou Tao, yonghong.song,
Alexei Starovoitov, Puranjay Mohan
On Sat, Aug 26, 2023 at 03:44:48PM +0200, Björn Töpel wrote:
> Björn Töpel <bjorn@kernel.org> writes:
>
> > I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
> > selftests on bpf-next 9e3b47abeb8f.
> >
> > I'm able to reproduce the hang by multiple runs of:
> > | ./test_progs -a link_api -a linked_list
> > I'm currently investigating that.
>
> +Guo for uprobe
>
> This was an interesting bug. The hang is an ebreak (RISC-V breakpoint),
> that puts the kernel into an infinite loop.
>
> To reproduce, simply run the BPF selftest:
> ./test_progs -v -a link_api -a linked_list
>
> First the link_api test is being run, which exercises the uprobe
> functionality. The link_api test completes, and test_progs will still
> have the uprobe active/enabled. Next the linked_list test triggered a
> WARN_ON (which is implemented via ebreak as well).
>
> Now, handle_break() is entered, and the uprobe_breakpoint_handler()
> returns true exiting the handle_break(), which returns to the WARN
> ebreak, and we have merry-go-round.
>
> Lucky for the RISC-V folks, the BPF memory handler had a WARN that
> surfaced the bug! ;-)
Thanks for the analysis.
I couldn't reproduce the problem, so I am just taking a guess here. The problem
is bebcause uprobes didn't find a probe point at that ebreak instruction. However,
it also doesn't think a ebreak instruction is there, then it got confused and just
return back to the ebreak instruction, then everything repeats.
The reason why uprobes didn't think there is a ebreak instruction is because
is_trap_insn() only returns true if it is a 32-bit ebreak, or 16-bit c.ebreak if
C extension is available, not both. So a 32-bit ebreak is not correctly recognized
as a trap instruction.
If my guess is correct, the following should fix it. Can you please try if it works?
(this is the first time I send a patch this way, so please let me know if you can't apply)
Best regards,
Nam
---
arch/riscv/kernel/probes/uprobes.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/riscv/kernel/probes/uprobes.c b/arch/riscv/kernel/probes/uprobes.c
index 194f166b2cc4..91f4ce101cd1 100644
--- a/arch/riscv/kernel/probes/uprobes.c
+++ b/arch/riscv/kernel/probes/uprobes.c
@@ -3,6 +3,7 @@
#include <linux/highmem.h>
#include <linux/ptrace.h>
#include <linux/uprobes.h>
+#include <asm/insn.h>
#include "decode-insn.h"
@@ -17,6 +18,15 @@ bool is_swbp_insn(uprobe_opcode_t *insn)
#endif
}
+bool is_trap_insn(uprobe_opcode_t *insn)
+{
+#ifdef CONFIG_RISCV_ISA_C
+ if (riscv_insn_is_c_ebreak(*insn))
+ return true;
+#endif
+ return riscv_insn_is_ebreak(*insn);
+}
+
unsigned long uprobe_get_swbp_addr(struct pt_regs *regs)
{
return instruction_pointer(regs);
--
2.34.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: RISC-V uprobe bug (Was: Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342)
2023-08-26 18:12 ` Nam Cao
@ 2023-08-26 18:31 ` Nam Cao
2023-08-27 8:11 ` Björn Töpel
1 sibling, 0 replies; 31+ messages in thread
From: Nam Cao @ 2023-08-26 18:31 UTC (permalink / raw)
To: Björn Töpel
Cc: linux-riscv, Guo Ren, bpf, Hou Tao, yonghong.song,
Alexei Starovoitov, Puranjay Mohan
On Sat, Aug 26, 2023 at 08:12:30PM +0200, Nam Cao wrote:
> On Sat, Aug 26, 2023 at 03:44:48PM +0200, Björn Töpel wrote:
> > Björn Töpel <bjorn@kernel.org> writes:
> >
> > > I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
> > > selftests on bpf-next 9e3b47abeb8f.
> > >
> > > I'm able to reproduce the hang by multiple runs of:
> > > | ./test_progs -a link_api -a linked_list
> > > I'm currently investigating that.
> >
> > +Guo for uprobe
> >
> > This was an interesting bug. The hang is an ebreak (RISC-V breakpoint),
> > that puts the kernel into an infinite loop.
> >
> > To reproduce, simply run the BPF selftest:
> > ./test_progs -v -a link_api -a linked_list
> >
> > First the link_api test is being run, which exercises the uprobe
> > functionality. The link_api test completes, and test_progs will still
> > have the uprobe active/enabled. Next the linked_list test triggered a
> > WARN_ON (which is implemented via ebreak as well).
> >
> > Now, handle_break() is entered, and the uprobe_breakpoint_handler()
> > returns true exiting the handle_break(), which returns to the WARN
> > ebreak, and we have merry-go-round.
> >
> > Lucky for the RISC-V folks, the BPF memory handler had a WARN that
> > surfaced the bug! ;-)
>
> Thanks for the analysis.
>
> I couldn't reproduce the problem, so I am just taking a guess here. The problem
> is bebcause uprobes didn't find a probe point at that ebreak instruction. However,
> it also doesn't think a ebreak instruction is there, then it got confused and just
> return back to the ebreak instruction, then everything repeats.
>
> The reason why uprobes didn't think there is a ebreak instruction is because
> is_trap_insn() only returns true if it is a 32-bit ebreak, or 16-bit c.ebreak if
> C extension is available, not both. So a 32-bit ebreak is not correctly recognized
> as a trap instruction.
I feel like I wasn't very clear with this: I was talking about handle_swbp() in
kernel/events/uprobes.c. In this function, the call to find_active_uprobe() should
return false. Then uprobes check if the trap instruction is still there by
calling is_trap_insn(), who correctly says "no". So uprobes assume it is safe to
just comeback to that address. If is_trap_insn() correctly returns true, then
uprobes would know that this is a ebreak, but not a probe, and handle thing correctly.
Best regards,
Nam
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-25 21:31 ` Andrii Nakryiko
@ 2023-08-26 22:49 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 31+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-08-26 22:49 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Alexei Starovoitov, Yonghong Song, Dave Marchevsky,
Andrii Nakryiko, Björn Töpel, bpf, linux-riscv
On Sat, 26 Aug 2023 at 03:01, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Aug 25, 2023 at 12:49 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Aug 25, 2023 at 11:53 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Aug 25, 2023 at 8:28 AM Yonghong Song <yonghong.song@linux.dev> wrote:
> > > >
> > > >
> > > >
> > > > On 8/25/23 3:32 AM, Björn Töpel wrote:
> > > > > I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
> > > > > selftests on bpf-next 9e3b47abeb8f.
> > > > >
> > > > > I'm able to reproduce the hang by multiple runs of:
> > > > > | ./test_progs -a link_api -a linked_list
> > > > > I'm currently investigating that.
> > > > >
> > > > > But! Sometimes (every blue moon) I get a warn_on_once hit:
> > > > > | ------------[ cut here ]------------
> > > > > | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 bpf_mem_refill+0x1fc/0x206
> > > > > | Modules linked in: bpf_testmod(OE)
> > > > > | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE N 6.5.0-rc5-01743-gdcb152bb8328 #2
> > > > > | Hardware name: riscv-virtio,qemu (DT)
> > > > > | epc : bpf_mem_refill+0x1fc/0x206
> > > > > | ra : irq_work_single+0x68/0x70
> > > > > | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
> > > > > | gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
> > > > > | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
> > > > > | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
> > > > > | a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
> > > > > | a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
> > > > > | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
> > > > > | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
> > > > > | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
> > > > > | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
> > > > > | t5 : ff6000008fd28278 t6 : 0000000000040000
> > > > > | status: 0000000200000100 badaddr: 0000000000000000 cause: 0000000000000003
> > > > > | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
> > > > > | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
> > > > > | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
> > > > > | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
> > > > > | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
> > > > > | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
> > > > > | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> > > > > | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
> > > > > | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
> > > > > | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> > > > > | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
> > > > > | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
> > > > > | [<ffffffff812b6904>] do_irq+0x66/0x98
> > > > > | ---[ end trace 0000000000000000 ]---
> > > > >
> > > > > Code:
> > > > > | static void free_bulk(struct bpf_mem_cache *c)
> > > > > | {
> > > > > | struct bpf_mem_cache *tgt = c->tgt;
> > > > > | struct llist_node *llnode, *t;
> > > > > | unsigned long flags;
> > > > > | int cnt;
> > > > > |
> > > > > | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
> > > > > | ...
> > > > >
> > > > > I'm not well versed in the memory allocator; Before I dive into it --
> > > > > has anyone else hit it? Ideas on why the warn_on_once is hit?
> > > >
> > > > Maybe take a look at the patch
> > > > 822fb26bdb55 bpf: Add a hint to allocated objects.
> > > >
> > > > In the above patch, we have
> > > >
> > > > + /*
> > > > + * Remember bpf_mem_cache that allocated this object.
> > > > + * The hint is not accurate.
> > > > + */
> > > > + c->tgt = *(struct bpf_mem_cache **)llnode;
> > > >
> > > > I suspect that the warning may be related to the above.
> > > > I tried the above ./test_progs command line (running multiple
> > > > at the same time) and didn't trigger the issue.
> > >
> > > Interesting. Thanks for the report.
> > > I wasn't able to repo the warn either, but I can confirm that linked_list
> > > test is super slow on the kernel with debugs on and may appear to "hang",
> > > since it doesn't respond to key press for many seconds.
> > >
> > > time ./test_progs -a linked_list
> > > Summary: 1/132 PASSED, 0 SKIPPED, 0 FAILED
> > >
> > > real 0m35.897s
> > > user 0m1.860s
> > > sys 0m32.105s
> > >
> > > That's not normal.
> > >
> > > Kumar, Dave,
> > >
> > > do you have any suggestions?
> >
> > Andrii,
> >
> > this issue seems to be a scalability issue with libbpf.
> > The perf report looks like:
> >
> > 9.89% test_progs [kernel.kallsyms] [k] __lock_acquire
> > 7.70% test_progs [kernel.kallsyms] [k] check_preemption_disabled
> > 2.61% test_progs [kernel.kallsyms] [k] asm_exc_page_fault
> > 2.32% test_progs [kernel.kallsyms] [k] rcu_is_watching
> > 2.30% test_progs [kernel.kallsyms] [k] memcpy_orig
> > 2.26% test_progs test_progs [.] btf_find_by_name_kind
> > 2.18% test_progs libc-2.28.so [.] __strcmp_avx2
> > 2.14% test_progs [kernel.kallsyms] [k] lock_acquire
> > 2.10% test_progs [kernel.kallsyms] [k] mark_lock.part.48
> > 2.07% test_progs test_progs [.] btf_kind
> > 2.05% test_progs [kernel.kallsyms] [k] clear_page_erms
> > 1.89% test_progs test_progs [.] btf_type_by_id
> > 1.83% test_progs [kernel.kallsyms] [k] lock_is_held_type
> > 1.75% test_progs [kernel.kallsyms] [k] lock_release
> > 1.31% test_progs [kernel.kallsyms] [k] unwind_next_frame
> > 1.29% test_progs test_progs [.] btf__type_by_id
> > 1.28% test_progs [kernel.kallsyms] [k] rep_movs_alternative
> > 1.09% test_progs [kernel.kallsyms] [k] __orc_find
> > 1.06% swapper [kernel.kallsyms] [k] __lock_acquire
> > 1.02% test_progs test_progs [.] btf_type_size
> > 0.84% test_progs test_progs [.] btf_parse_type_sec
> > 0.78% test_progs [kernel.kallsyms] [k] __create_object
> > 0.76% test_progs [kernel.kallsyms] [k] __lookup_object
> > 0.75% test_progs test_progs [.] btf__str_by_offset
> >
> > The top 5 functions coming from kernel due to libbpf
> > reading /sys/kernel/btf/vmlinux
> > Then libbpf proceeded to do a search in btf.
>
> This test loads 132 skeletons/BPF objects sequentially. Each BPF
> object load should cause exactly one /sys/kernel/btf/vmlinux load. Do
> you see multiple vmlinux reads per BPF object?
>
> We then do a linear search for each ksym, so depending on the amount
> of ksyms, that can scale badly. But linked_list.bpf.o and
> linked_list_fail.bpf.o have only a few ksyms:
>
> [72] DATASEC '.ksyms' size=0 vlen=6
> type_id=33 offset=0 size=0 (FUNC 'bpf_obj_new_impl')
> type_id=37 offset=0 size=0 (FUNC 'bpf_list_pop_front')
> type_id=39 offset=0 size=0 (FUNC 'bpf_obj_drop_impl')
> type_id=41 offset=0 size=0 (FUNC 'bpf_list_pop_back')
> type_id=43 offset=0 size=0 (FUNC 'bpf_list_push_front_impl')
> type_id=51 offset=0 size=0 (FUNC 'bpf_list_push_back_impl')
>
> We can improve this part by either pre-building hashmap index for BTF
> types or sorting all ksyms and then doing one linear pass over vmlinux
> BTF, binary searching. But so far it wasn't a big bottleneck in
> practice.
>
> And running this linked_list test on production kernel, it's not that
> slow, considering how much work it's doing:
>
> sudo ./test_progs -a linked_list 1.54s user 0.58s system 97% cpu 2.178 total
>
>
> Also, we are building sefltests+libbpf in debug mode. So I have worse
> news, with my BTF validation patch applied, this test becomes
> extremely slow, because all those 100+ vmlinux BTF parsings are now
> also doing a bunch of validation, and none of the added small static
> functions are inlined in debug mode. I need to look into this a bit
> more and see if I can mitigate this without switching all selftests to
> release mode.
>
Looking at Bjorn's update, I believe the problem is unrelated to the
linked_list test.
I think the main slowdown here is because of reopening the BPF object
having failed tests over and over for each specific program.
If we make it work so that the object is only opened once and each
failing program can be tested, we can avoid the slowdown.
If this makes sense, I can work on this improvement.
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RISC-V uprobe bug (Was: Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342)
2023-08-26 18:12 ` Nam Cao
2023-08-26 18:31 ` Nam Cao
@ 2023-08-27 8:11 ` Björn Töpel
2023-08-27 8:35 ` Nam Cao
1 sibling, 1 reply; 31+ messages in thread
From: Björn Töpel @ 2023-08-27 8:11 UTC (permalink / raw)
To: Nam Cao
Cc: linux-riscv, Guo Ren, bpf, Hou Tao, yonghong.song,
Alexei Starovoitov, Puranjay Mohan
Nam,
Nam Cao <namcaov@gmail.com> writes:
> On Sat, Aug 26, 2023 at 03:44:48PM +0200, Björn Töpel wrote:
>> Björn Töpel <bjorn@kernel.org> writes:
>>
>> > I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>> > selftests on bpf-next 9e3b47abeb8f.
>> >
>> > I'm able to reproduce the hang by multiple runs of:
>> > | ./test_progs -a link_api -a linked_list
>> > I'm currently investigating that.
>>
>> +Guo for uprobe
>>
>> This was an interesting bug. The hang is an ebreak (RISC-V breakpoint),
>> that puts the kernel into an infinite loop.
>>
>> To reproduce, simply run the BPF selftest:
>> ./test_progs -v -a link_api -a linked_list
>>
>> First the link_api test is being run, which exercises the uprobe
>> functionality. The link_api test completes, and test_progs will still
>> have the uprobe active/enabled. Next the linked_list test triggered a
>> WARN_ON (which is implemented via ebreak as well).
>>
>> Now, handle_break() is entered, and the uprobe_breakpoint_handler()
>> returns true exiting the handle_break(), which returns to the WARN
>> ebreak, and we have merry-go-round.
>>
>> Lucky for the RISC-V folks, the BPF memory handler had a WARN that
>> surfaced the bug! ;-)
>
> Thanks for the analysis.
>
> I couldn't reproduce the problem, so I am just taking a guess here. The problem
> is bebcause uprobes didn't find a probe point at that ebreak instruction. However,
> it also doesn't think a ebreak instruction is there, then it got confused and just
> return back to the ebreak instruction, then everything repeats.
>
> The reason why uprobes didn't think there is a ebreak instruction is because
> is_trap_insn() only returns true if it is a 32-bit ebreak, or 16-bit c.ebreak if
> C extension is available, not both. So a 32-bit ebreak is not correctly recognized
> as a trap instruction.
>
> If my guess is correct, the following should fix it. Can you please try if it works?
>
> (this is the first time I send a patch this way, so please let me know if you can't apply)
>
> Best regards,
> Nam
>
> ---
> arch/riscv/kernel/probes/uprobes.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/riscv/kernel/probes/uprobes.c b/arch/riscv/kernel/probes/uprobes.c
> index 194f166b2cc4..91f4ce101cd1 100644
> --- a/arch/riscv/kernel/probes/uprobes.c
> +++ b/arch/riscv/kernel/probes/uprobes.c
> @@ -3,6 +3,7 @@
> #include <linux/highmem.h>
> #include <linux/ptrace.h>
> #include <linux/uprobes.h>
> +#include <asm/insn.h>
>
> #include "decode-insn.h"
>
> @@ -17,6 +18,15 @@ bool is_swbp_insn(uprobe_opcode_t *insn)
> #endif
> }
>
> +bool is_trap_insn(uprobe_opcode_t *insn)
> +{
> +#ifdef CONFIG_RISCV_ISA_C
> + if (riscv_insn_is_c_ebreak(*insn))
> + return true;
> +#endif
> + return riscv_insn_is_ebreak(*insn);
> +}
> +
> unsigned long uprobe_get_swbp_addr(struct pt_regs *regs)
> {
> return instruction_pointer(regs);
> --
> 2.34.1
The default implementation of is_trap_insn() which RISC-V is using calls
is_swbp_insn(), which is doing what your patch does. Your patch does not
address the issue.
We're taking an ebreak trap from kernel space. In this case we should
never look for a userland (uprobe) handler at all, only the kprobe
handlers should be considered.
In this case, the TIF_UPROBE is incorrectly set, and incorrectly (not)
handled in the "common entry" exit path, which takes us to the infinite
loop.
Björn
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RISC-V uprobe bug (Was: Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342)
2023-08-27 8:11 ` Björn Töpel
@ 2023-08-27 8:35 ` Nam Cao
2023-08-27 9:04 ` Björn Töpel
0 siblings, 1 reply; 31+ messages in thread
From: Nam Cao @ 2023-08-27 8:35 UTC (permalink / raw)
To: Björn Töpel
Cc: linux-riscv, Guo Ren, bpf, Hou Tao, yonghong.song,
Alexei Starovoitov, Puranjay Mohan
On Sun, Aug 27, 2023 at 10:11:25AM +0200, Björn Töpel wrote:
> The default implementation of is_trap_insn() which RISC-V is using calls
> is_swbp_insn(), which is doing what your patch does. Your patch does not
> address the issue.
is_swbp_insn() does this:
#ifdef CONFIG_RISCV_ISA_C
return (*insn & 0xffff) == UPROBE_SWBP_INSN;
#else
return *insn == UPROBE_SWBP_INSN;
#endif
...so it doesn't even check for 32-bit ebreak if C extension is on. My patch
is not the same.
But okay, if it doesn't solve the problem, then I must be wrong somewhere.
> We're taking an ebreak trap from kernel space. In this case we should
> never look for a userland (uprobe) handler at all, only the kprobe
> handlers should be considered.
>
> In this case, the TIF_UPROBE is incorrectly set, and incorrectly (not)
> handled in the "common entry" exit path, which takes us to the infinite
> loop.
This change makes a lot of sense, no reason to check for uprobes if exception
comes from the kernel.
Best regards,
Nam
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-26 10:49 ` Björn Töpel
@ 2023-08-27 8:37 ` Björn Töpel
2023-08-27 14:53 ` Yonghong Song
0 siblings, 1 reply; 31+ messages in thread
From: Björn Töpel @ 2023-08-27 8:37 UTC (permalink / raw)
To: Hou Tao, bpf; +Cc: linux-riscv, yonghong.song, Alexei Starovoitov
Björn Töpel <bjorn@kernel.org> writes:
> Hou Tao <houtao@huaweicloud.com> writes:
>
>> Hi,
>>
>> On 8/26/2023 5:23 PM, Björn Töpel wrote:
>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>
>>>> Hi,
>>>>
>>>> On 8/25/2023 11:28 PM, Yonghong Song wrote:
>>>>>
>>>>> On 8/25/23 3:32 AM, Björn Töpel wrote:
>>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>>>>>> selftests on bpf-next 9e3b47abeb8f.
>>>>>>
>>>>>> I'm able to reproduce the hang by multiple runs of:
>>>>>> | ./test_progs -a link_api -a linked_list
>>>>>> I'm currently investigating that.
>>>>>>
>>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
>>>>>> | ------------[ cut here ]------------
>>>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
>>>>>> bpf_mem_refill+0x1fc/0x206
>>>>>> | Modules linked in: bpf_testmod(OE)
>>>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
>>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
>>>>>> | Hardware name: riscv-virtio,qemu (DT)
>>>>>> | epc : bpf_mem_refill+0x1fc/0x206
>>>>>> | ra : irq_work_single+0x68/0x70
>>>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
>>>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
>>>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
>>>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
>>>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
>>>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
>>>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
>>>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
>>>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
>>>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
>>>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000
>>>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
>>>>>> 0000000000000003
>>>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
>>>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
>>>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
>>>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
>>>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
>>>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
>>>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
>>>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
>>>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98
>>>>>> | ---[ end trace 0000000000000000 ]---
>>>>>>
>>>>>> Code:
>>>>>> | static void free_bulk(struct bpf_mem_cache *c)
>>>>>> | {
>>>>>> | struct bpf_mem_cache *tgt = c->tgt;
>>>>>> | struct llist_node *llnode, *t;
>>>>>> | unsigned long flags;
>>>>>> | int cnt;
>>>>>> |
>>>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
>>>>>> | ...
>>>>>>
>>>>>> I'm not well versed in the memory allocator; Before I dive into it --
>>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
>>>>> Maybe take a look at the patch
>>>>> 822fb26bdb55 bpf: Add a hint to allocated objects.
>>>>>
>>>>> In the above patch, we have
>>>>>
>>>>> + /*
>>>>> + * Remember bpf_mem_cache that allocated this object.
>>>>> + * The hint is not accurate.
>>>>> + */
>>>>> + c->tgt = *(struct bpf_mem_cache **)llnode;
>>>>>
>>>>> I suspect that the warning may be related to the above.
>>>>> I tried the above ./test_progs command line (running multiple
>>>>> at the same time) and didn't trigger the issue.
>>>> The extra 8-bytes before the freed pointer is used to save the pointer
>>>> of the original bpf memory allocator where the freed pointer came from,
>>>> so unit_free() could free the pointer back to the original allocator to
>>>> prevent alloc-and-free unbalance.
>>>>
>>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do not
>>>> find anything suspicious after checking linked_list. Another possibility
>>>> is that there is write-after-free problem which corrupts the extra
>>>> 8-bytes before the freed pointer. Could you please apply the following
>>>> debug patch to check whether or not the extra 8-bytes are corrupted ?
>>> Thanks for getting back!
>>>
>>> I took your patch for a run, and there's a hit:
>>> | bad cache ff5ffffffffe8570: got size 96 work ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work ffffffff801b19c8
>>
>> The extra 8-bytes are not corrupted. Both of these two bpf_mem_cache are
>> valid and there are in the cache array defined in bpf_mem_caches. BPF
>> memory allocator allocated the pointer from 96-bytes sized-cache, but it
>> tried to free the pointer through 128-bytes sized-cache.
>>
>> Now I suspect there is no 96-bytes slab in your system and ksize(ptr -
>> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the
>> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please
>> check the value of KMALLOC_MIN_SIZE in your kernel .config and using the
>> following command to check whether there is 96-bytes slab in your system:
>
> KMALLOC_MIN_SIZE is 64.
>
>> $ cat /proc/slabinfo |grep kmalloc-96
>> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0
>> 0 : slabdata 0 0 0
>> kmalloc-96 1865 2268 96 42 1 : tunables 0 0
>> 0 : slabdata 54 54 0
>>
>> In my system, slab has 96-bytes cached, so grep outputs something, but I
>> think there will no output in your system.
>
> You're right! No kmalloc-96.
To get rid of the warning, limit available sizes from
bpf_mem_alloc_init()?
Björn
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RISC-V uprobe bug (Was: Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342)
2023-08-27 8:35 ` Nam Cao
@ 2023-08-27 9:04 ` Björn Töpel
2023-08-27 9:39 ` Nam Cao
0 siblings, 1 reply; 31+ messages in thread
From: Björn Töpel @ 2023-08-27 9:04 UTC (permalink / raw)
To: Nam Cao
Cc: linux-riscv, Guo Ren, bpf, Hou Tao, yonghong.song,
Alexei Starovoitov, Puranjay Mohan
Nam Cao <namcaov@gmail.com> writes:
> On Sun, Aug 27, 2023 at 10:11:25AM +0200, Björn Töpel wrote:
>> The default implementation of is_trap_insn() which RISC-V is using calls
>> is_swbp_insn(), which is doing what your patch does. Your patch does not
>> address the issue.
>
> is_swbp_insn() does this:
>
> #ifdef CONFIG_RISCV_ISA_C
> return (*insn & 0xffff) == UPROBE_SWBP_INSN;
> #else
> return *insn == UPROBE_SWBP_INSN;
> #endif
>
> ...so it doesn't even check for 32-bit ebreak if C extension is on. My patch
> is not the same.
Ah, was too quick.
AFAIU uprobes *always* uses c.ebreak when CONFIG_RISCV_ISA_C is set, and
ebreak otherwise. That's the reason is_swbp_insn() is implemented like
that. If that's not the case, there's a another bug, that your patches
addresses.
In that case we need an arch specific set_swbp() implementation together
with your patch.
Guo, thoughts? ...text patching on RISC-V is still a big WIP.
> But okay, if it doesn't solve the problem, then I must be wrong somewhere.
Yes, this bug is another issue.
>> We're taking an ebreak trap from kernel space. In this case we should
>> never look for a userland (uprobe) handler at all, only the kprobe
>> handlers should be considered.
>>
>> In this case, the TIF_UPROBE is incorrectly set, and incorrectly (not)
>> handled in the "common entry" exit path, which takes us to the infinite
>> loop.
>
> This change makes a lot of sense, no reason to check for uprobes if exception
> comes from the kernel.
Ok! I sent a patch proper for this.
Björn
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RISC-V uprobe bug (Was: Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342)
2023-08-27 9:04 ` Björn Töpel
@ 2023-08-27 9:39 ` Nam Cao
2023-08-27 19:20 ` Björn Töpel
0 siblings, 1 reply; 31+ messages in thread
From: Nam Cao @ 2023-08-27 9:39 UTC (permalink / raw)
To: Björn Töpel
Cc: linux-riscv, Guo Ren, bpf, Hou Tao, yonghong.song,
Alexei Starovoitov, Puranjay Mohan
On Sun, Aug 27, 2023 at 11:04:34AM +0200, Björn Töpel wrote:
> Nam Cao <namcaov@gmail.com> writes:
>
> > On Sun, Aug 27, 2023 at 10:11:25AM +0200, Björn Töpel wrote:
> >> The default implementation of is_trap_insn() which RISC-V is using calls
> >> is_swbp_insn(), which is doing what your patch does. Your patch does not
> >> address the issue.
> >
> > is_swbp_insn() does this:
> >
> > #ifdef CONFIG_RISCV_ISA_C
> > return (*insn & 0xffff) == UPROBE_SWBP_INSN;
> > #else
> > return *insn == UPROBE_SWBP_INSN;
> > #endif
> >
> > ...so it doesn't even check for 32-bit ebreak if C extension is on. My patch
> > is not the same.
>
> Ah, was too quick.
>
> AFAIU uprobes *always* uses c.ebreak when CONFIG_RISCV_ISA_C is set, and
> ebreak otherwise. That's the reason is_swbp_insn() is implemented like
> that.
That's what I understand too.
> If that's not the case, there's a another bug, that your patches
> addresses.
I think it's a bug regardless. is_trap_insn() is used by uprobes to figure out
if there is an instruction that generates trap exception, not just instructions
that are "SWBP". The reason is because when there is a trap, but uprobes doesn't
see a probe installed here, it needs is_trap_insn() to figure out if the trap
is generated by ebreak from something else, or because the probe is just removed.
In the latter case, uprobes will return back, because probe has already been removed,
so it should be safe to do so. That's why I think the incorrect is_swbp_insn()
would cause a hang, because uprobes incorrectly thinks there is no ebreak there,
so it should be okay to go back, but there actually is.
So, from my understanding, if uprobes encounter a 32-bit ebreak for any reason,
the kernel would hang. I think your patch is a great addition nonetheless, but I
am guessing that it only masks the problem by preventing uprobes from seeing the
32-bit ebreak in the specific test, not really solve it. So, if there is a 32-bit
ebreak in userspace, the bug still causes the kernel to hang.
I am still quite confident of my logic, so I would be very suprised if my fix
doesn't solve the reported hang. Do you mind testing my patch? My potato of a
laptop unfortunately cannot run the test :(
Best regards,
Nam
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-27 8:37 ` Björn Töpel
@ 2023-08-27 14:53 ` Yonghong Song
2023-08-28 13:57 ` Hou Tao
0 siblings, 1 reply; 31+ messages in thread
From: Yonghong Song @ 2023-08-27 14:53 UTC (permalink / raw)
To: Björn Töpel, Hou Tao, bpf; +Cc: linux-riscv, Alexei Starovoitov
On 8/27/23 1:37 AM, Björn Töpel wrote:
> Björn Töpel <bjorn@kernel.org> writes:
>
>> Hou Tao <houtao@huaweicloud.com> writes:
>>
>>> Hi,
>>>
>>> On 8/26/2023 5:23 PM, Björn Töpel wrote:
>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> On 8/25/2023 11:28 PM, Yonghong Song wrote:
>>>>>>
>>>>>> On 8/25/23 3:32 AM, Björn Töpel wrote:
>>>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>>>>>>> selftests on bpf-next 9e3b47abeb8f.
>>>>>>>
>>>>>>> I'm able to reproduce the hang by multiple runs of:
>>>>>>> | ./test_progs -a link_api -a linked_list
>>>>>>> I'm currently investigating that.
>>>>>>>
>>>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
>>>>>>> | ------------[ cut here ]------------
>>>>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
>>>>>>> bpf_mem_refill+0x1fc/0x206
>>>>>>> | Modules linked in: bpf_testmod(OE)
>>>>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
>>>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
>>>>>>> | Hardware name: riscv-virtio,qemu (DT)
>>>>>>> | epc : bpf_mem_refill+0x1fc/0x206
>>>>>>> | ra : irq_work_single+0x68/0x70
>>>>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
>>>>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
>>>>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
>>>>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
>>>>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
>>>>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
>>>>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
>>>>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
>>>>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
>>>>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
>>>>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000
>>>>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
>>>>>>> 0000000000000003
>>>>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
>>>>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
>>>>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
>>>>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
>>>>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
>>>>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
>>>>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
>>>>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
>>>>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98
>>>>>>> | ---[ end trace 0000000000000000 ]---
>>>>>>>
>>>>>>> Code:
>>>>>>> | static void free_bulk(struct bpf_mem_cache *c)
>>>>>>> | {
>>>>>>> | struct bpf_mem_cache *tgt = c->tgt;
>>>>>>> | struct llist_node *llnode, *t;
>>>>>>> | unsigned long flags;
>>>>>>> | int cnt;
>>>>>>> |
>>>>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
>>>>>>> | ...
>>>>>>>
>>>>>>> I'm not well versed in the memory allocator; Before I dive into it --
>>>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
>>>>>> Maybe take a look at the patch
>>>>>> 822fb26bdb55 bpf: Add a hint to allocated objects.
>>>>>>
>>>>>> In the above patch, we have
>>>>>>
>>>>>> + /*
>>>>>> + * Remember bpf_mem_cache that allocated this object.
>>>>>> + * The hint is not accurate.
>>>>>> + */
>>>>>> + c->tgt = *(struct bpf_mem_cache **)llnode;
>>>>>>
>>>>>> I suspect that the warning may be related to the above.
>>>>>> I tried the above ./test_progs command line (running multiple
>>>>>> at the same time) and didn't trigger the issue.
>>>>> The extra 8-bytes before the freed pointer is used to save the pointer
>>>>> of the original bpf memory allocator where the freed pointer came from,
>>>>> so unit_free() could free the pointer back to the original allocator to
>>>>> prevent alloc-and-free unbalance.
>>>>>
>>>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do not
>>>>> find anything suspicious after checking linked_list. Another possibility
>>>>> is that there is write-after-free problem which corrupts the extra
>>>>> 8-bytes before the freed pointer. Could you please apply the following
>>>>> debug patch to check whether or not the extra 8-bytes are corrupted ?
>>>> Thanks for getting back!
>>>>
>>>> I took your patch for a run, and there's a hit:
>>>> | bad cache ff5ffffffffe8570: got size 96 work ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work ffffffff801b19c8
>>>
>>> The extra 8-bytes are not corrupted. Both of these two bpf_mem_cache are
>>> valid and there are in the cache array defined in bpf_mem_caches. BPF
>>> memory allocator allocated the pointer from 96-bytes sized-cache, but it
>>> tried to free the pointer through 128-bytes sized-cache.
>>>
>>> Now I suspect there is no 96-bytes slab in your system and ksize(ptr -
>>> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the
>>> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please
>>> check the value of KMALLOC_MIN_SIZE in your kernel .config and using the
>>> following command to check whether there is 96-bytes slab in your system:
>>
>> KMALLOC_MIN_SIZE is 64.
>>
>>> $ cat /proc/slabinfo |grep kmalloc-96
>>> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0
>>> 0 : slabdata 0 0 0
>>> kmalloc-96 1865 2268 96 42 1 : tunables 0 0
>>> 0 : slabdata 54 54 0
>>>
>>> In my system, slab has 96-bytes cached, so grep outputs something, but I
>>> think there will no output in your system.
>>
>> You're right! No kmalloc-96.
>
> To get rid of the warning, limit available sizes from
> bpf_mem_alloc_init()?
Do you know why your system does not have kmalloc-96?
>
>
> Björn
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RISC-V uprobe bug (Was: Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342)
2023-08-27 9:39 ` Nam Cao
@ 2023-08-27 19:20 ` Björn Töpel
2023-08-27 19:41 ` Nam Cao
2023-08-27 20:15 ` Nam Cao
0 siblings, 2 replies; 31+ messages in thread
From: Björn Töpel @ 2023-08-27 19:20 UTC (permalink / raw)
To: Nam Cao
Cc: linux-riscv, Guo Ren, bpf, Hou Tao, yonghong.song,
Alexei Starovoitov, Puranjay Mohan
Nam Cao <namcaov@gmail.com> writes:
> On Sun, Aug 27, 2023 at 11:04:34AM +0200, Björn Töpel wrote:
>> Nam Cao <namcaov@gmail.com> writes:
>>
>> > On Sun, Aug 27, 2023 at 10:11:25AM +0200, Björn Töpel wrote:
>> >> The default implementation of is_trap_insn() which RISC-V is using calls
>> >> is_swbp_insn(), which is doing what your patch does. Your patch does not
>> >> address the issue.
>> >
>> > is_swbp_insn() does this:
>> >
>> > #ifdef CONFIG_RISCV_ISA_C
>> > return (*insn & 0xffff) == UPROBE_SWBP_INSN;
>> > #else
>> > return *insn == UPROBE_SWBP_INSN;
>> > #endif
>> >
>> > ...so it doesn't even check for 32-bit ebreak if C extension is on. My patch
>> > is not the same.
>>
>> Ah, was too quick.
>>
>> AFAIU uprobes *always* uses c.ebreak when CONFIG_RISCV_ISA_C is set, and
>> ebreak otherwise. That's the reason is_swbp_insn() is implemented like
>> that.
>
> That's what I understand too.
>
>> If that's not the case, there's a another bug, that your patches
>> addresses.
>
> I think it's a bug regardless. is_trap_insn() is used by uprobes to figure out
> if there is an instruction that generates trap exception, not just instructions
> that are "SWBP". The reason is because when there is a trap, but uprobes doesn't
> see a probe installed here, it needs is_trap_insn() to figure out if the trap
> is generated by ebreak from something else, or because the probe is just removed.
> In the latter case, uprobes will return back, because probe has already been removed,
> so it should be safe to do so. That's why I think the incorrect is_swbp_insn()
> would cause a hang, because uprobes incorrectly thinks there is no ebreak there,
> so it should be okay to go back, but there actually is.
>
> So, from my understanding, if uprobes encounter a 32-bit ebreak for any reason,
> the kernel would hang. I think your patch is a great addition nonetheless, but I
> am guessing that it only masks the problem by preventing uprobes from seeing the
> 32-bit ebreak in the specific test, not really solve it. So, if there is a 32-bit
> ebreak in userspace, the bug still causes the kernel to hang.
>
> I am still quite confident of my logic, so I would be very suprised if my fix
> doesn't solve the reported hang. Do you mind testing my patch? My potato of a
> laptop unfortunately cannot run the test :(
Maybe I wasn't clear, sorry for that! I did take the patch for a spin,
and it did not solve this particular problem.
When we're taking a trap from *kernel*mode, we should never deal with
uprobes at all. Have a look at uprobe_pre_sstep_notifier(), this
function returns 1, which then means that the trap handler exit
premature.
The code you're referring to (called from uprobe_notify_resume()), and
will never be entered, because we're not exiting the trap to
userland. Have a look in kernel/entry/common.c (search for
e.g. TIF_UPROBE).
Now, for your concern, which I see as a potential different bug. Not at
all related to my issue "trap from kernelmode touches uprobe
incorrectly"; A "random" ebreak from *userland* is trapped, when uprobes
is enabled will set the kernel in a hang. I suggest you construct try to
write a simple program to reproduce this!
I had a quick look in the uprobe handling code, and AFAIU the was used
when installing the uprobe as an additional check, and when searching
for an active uprobe. I'm still a bit puzzled how the issue you're
describing could trigger. A reproducer would help!
Cheers,
Björn
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RISC-V uprobe bug (Was: Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342)
2023-08-27 19:20 ` Björn Töpel
@ 2023-08-27 19:41 ` Nam Cao
2023-08-27 20:15 ` Nam Cao
1 sibling, 0 replies; 31+ messages in thread
From: Nam Cao @ 2023-08-27 19:41 UTC (permalink / raw)
To: Björn Töpel
Cc: linux-riscv, Guo Ren, bpf, Hou Tao, yonghong.song,
Alexei Starovoitov, Puranjay Mohan
On Sun, Aug 27, 2023 at 09:20:44PM +0200, Björn Töpel wrote:
> Nam Cao <namcaov@gmail.com> writes:
>
> > On Sun, Aug 27, 2023 at 11:04:34AM +0200, Björn Töpel wrote:
> >> Nam Cao <namcaov@gmail.com> writes:
> >>
> >> > On Sun, Aug 27, 2023 at 10:11:25AM +0200, Björn Töpel wrote:
> >> >> The default implementation of is_trap_insn() which RISC-V is using calls
> >> >> is_swbp_insn(), which is doing what your patch does. Your patch does not
> >> >> address the issue.
> >> >
> >> > is_swbp_insn() does this:
> >> >
> >> > #ifdef CONFIG_RISCV_ISA_C
> >> > return (*insn & 0xffff) == UPROBE_SWBP_INSN;
> >> > #else
> >> > return *insn == UPROBE_SWBP_INSN;
> >> > #endif
> >> >
> >> > ...so it doesn't even check for 32-bit ebreak if C extension is on. My patch
> >> > is not the same.
> >>
> >> Ah, was too quick.
> >>
> >> AFAIU uprobes *always* uses c.ebreak when CONFIG_RISCV_ISA_C is set, and
> >> ebreak otherwise. That's the reason is_swbp_insn() is implemented like
> >> that.
> >
> > That's what I understand too.
> >
> >> If that's not the case, there's a another bug, that your patches
> >> addresses.
> >
> > I think it's a bug regardless. is_trap_insn() is used by uprobes to figure out
> > if there is an instruction that generates trap exception, not just instructions
> > that are "SWBP". The reason is because when there is a trap, but uprobes doesn't
> > see a probe installed here, it needs is_trap_insn() to figure out if the trap
> > is generated by ebreak from something else, or because the probe is just removed.
> > In the latter case, uprobes will return back, because probe has already been removed,
> > so it should be safe to do so. That's why I think the incorrect is_swbp_insn()
> > would cause a hang, because uprobes incorrectly thinks there is no ebreak there,
> > so it should be okay to go back, but there actually is.
> >
> > So, from my understanding, if uprobes encounter a 32-bit ebreak for any reason,
> > the kernel would hang. I think your patch is a great addition nonetheless, but I
> > am guessing that it only masks the problem by preventing uprobes from seeing the
> > 32-bit ebreak in the specific test, not really solve it. So, if there is a 32-bit
> > ebreak in userspace, the bug still causes the kernel to hang.
> >
> > I am still quite confident of my logic, so I would be very suprised if my fix
> > doesn't solve the reported hang. Do you mind testing my patch? My potato of a
> > laptop unfortunately cannot run the test :(
>
> Maybe I wasn't clear, sorry for that! I did take the patch for a spin,
> and it did not solve this particular problem.
Okay, thanks for the comfirmation!
> When we're taking a trap from *kernel*mode, we should never deal with
> uprobes at all. Have a look at uprobe_pre_sstep_notifier(), this
> function returns 1, which then means that the trap handler exit
> premature.
>
> The code you're referring to (called from uprobe_notify_resume()), and
> will never be entered, because we're not exiting the trap to
> userland. Have a look in kernel/entry/common.c (search for
> e.g. TIF_UPROBE).
I will think about this a bit and answer later. I will answer the below part
first.
> Now, for your concern, which I see as a potential different bug. Not at
> all related to my issue "trap from kernelmode touches uprobe
> incorrectly"; A "random" ebreak from *userland* is trapped, when uprobes
> is enabled will set the kernel in a hang. I suggest you construct try to
> write a simple program to reproduce this!
>
> I had a quick look in the uprobe handling code, and AFAIU the was used
> when installing the uprobe as an additional check, and when searching
> for an active uprobe. I'm still a bit puzzled how the issue you're
> describing could trigger. A reproducer would help!
I have just produced the problem, using this small program:
.global _start
_start:
addi x0, x1, 0
addi x0, x1, 1
addi x0, x1, 2
.option push
.option arch, -c
ebreak
.option pop
ecall
Compile that with
riscv64-linux-gnu-gcc test.s -nostdlib -static -o ebreak
And setup uprobes by:
mount -t tracefs nodev /sys/kernel/tracing/
echo "p /ebreak:0x0000010c" > /sys/kernel/tracing/uprobe_events
echo 1 > /sys/kernel/tracing/events/uprobes/enable
(obviously you would have to edit the offset value to be _start symbol of your
binary)
Then I execute the program, and the kernel loop indefinitely (it keeps going in
and out of exception handler).
Then I apply my patch, then the kernel doesn't loop anymore.
So I think it is a valid issue, and I will send a proper patch to fix this.
Best regards,
Nam
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: RISC-V uprobe bug (Was: Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342)
2023-08-27 19:20 ` Björn Töpel
2023-08-27 19:41 ` Nam Cao
@ 2023-08-27 20:15 ` Nam Cao
1 sibling, 0 replies; 31+ messages in thread
From: Nam Cao @ 2023-08-27 20:15 UTC (permalink / raw)
To: Björn Töpel
Cc: linux-riscv, Guo Ren, bpf, Hou Tao, yonghong.song,
Alexei Starovoitov, Puranjay Mohan
On Sun, Aug 27, 2023 at 09:20:44PM +0200, Björn Töpel wrote:
> Nam Cao <namcaov@gmail.com> writes:
>
> > On Sun, Aug 27, 2023 at 11:04:34AM +0200, Björn Töpel wrote:
> >> Nam Cao <namcaov@gmail.com> writes:
> >>
> >> > On Sun, Aug 27, 2023 at 10:11:25AM +0200, Björn Töpel wrote:
> >> >> The default implementation of is_trap_insn() which RISC-V is using calls
> >> >> is_swbp_insn(), which is doing what your patch does. Your patch does not
> >> >> address the issue.
> >> >
> >> > is_swbp_insn() does this:
> >> >
> >> > #ifdef CONFIG_RISCV_ISA_C
> >> > return (*insn & 0xffff) == UPROBE_SWBP_INSN;
> >> > #else
> >> > return *insn == UPROBE_SWBP_INSN;
> >> > #endif
> >> >
> >> > ...so it doesn't even check for 32-bit ebreak if C extension is on. My patch
> >> > is not the same.
> >>
> >> Ah, was too quick.
> >>
> >> AFAIU uprobes *always* uses c.ebreak when CONFIG_RISCV_ISA_C is set, and
> >> ebreak otherwise. That's the reason is_swbp_insn() is implemented like
> >> that.
> >
> > That's what I understand too.
> >
> >> If that's not the case, there's a another bug, that your patches
> >> addresses.
> >
> > I think it's a bug regardless. is_trap_insn() is used by uprobes to figure out
> > if there is an instruction that generates trap exception, not just instructions
> > that are "SWBP". The reason is because when there is a trap, but uprobes doesn't
> > see a probe installed here, it needs is_trap_insn() to figure out if the trap
> > is generated by ebreak from something else, or because the probe is just removed.
> > In the latter case, uprobes will return back, because probe has already been removed,
> > so it should be safe to do so. That's why I think the incorrect is_swbp_insn()
> > would cause a hang, because uprobes incorrectly thinks there is no ebreak there,
> > so it should be okay to go back, but there actually is.
> >
> > So, from my understanding, if uprobes encounter a 32-bit ebreak for any reason,
> > the kernel would hang. I think your patch is a great addition nonetheless, but I
> > am guessing that it only masks the problem by preventing uprobes from seeing the
> > 32-bit ebreak in the specific test, not really solve it. So, if there is a 32-bit
> > ebreak in userspace, the bug still causes the kernel to hang.
> >
> > I am still quite confident of my logic, so I would be very suprised if my fix
> > doesn't solve the reported hang. Do you mind testing my patch? My potato of a
> > laptop unfortunately cannot run the test :(
>
> Maybe I wasn't clear, sorry for that! I did take the patch for a spin,
> and it did not solve this particular problem.
>
> When we're taking a trap from *kernel*mode, we should never deal with
> uprobes at all. Have a look at uprobe_pre_sstep_notifier(), this
> function returns 1, which then means that the trap handler exit
> premature.
>
> The code you're referring to (called from uprobe_notify_resume()), and
> will never be entered, because we're not exiting the trap to
> userland. Have a look in kernel/entry/common.c (search for
> e.g. TIF_UPROBE).
Ah right, uprobe_notify_resume() is not called if we do not return to user
space. My bad, I thought it is called. Thanks for the discussion, now why I can
see my patch is irrelevant, and your patch is the correct fix for the reported
problem.
Best regards,
Nam
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-27 14:53 ` Yonghong Song
@ 2023-08-28 13:57 ` Hou Tao
2023-08-29 0:54 ` Yonghong Song
` (2 more replies)
0 siblings, 3 replies; 31+ messages in thread
From: Hou Tao @ 2023-08-28 13:57 UTC (permalink / raw)
To: yonghong.song, Björn Töpel, bpf; +Cc: linux-riscv, Alexei Starovoitov
[-- Attachment #1: Type: text/plain, Size: 7315 bytes --]
Hi,
On 8/27/2023 10:53 PM, Yonghong Song wrote:
>
>
> On 8/27/23 1:37 AM, Björn Töpel wrote:
>> Björn Töpel <bjorn@kernel.org> writes:
>>
>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>
>>>> Hi,
>>>>
>>>> On 8/26/2023 5:23 PM, Björn Töpel wrote:
>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 8/25/2023 11:28 PM, Yonghong Song wrote:
>>>>>>>
>>>>>>> On 8/25/23 3:32 AM, Björn Töpel wrote:
>>>>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>>>>>>>> selftests on bpf-next 9e3b47abeb8f.
>>>>>>>>
>>>>>>>> I'm able to reproduce the hang by multiple runs of:
>>>>>>>> | ./test_progs -a link_api -a linked_list
>>>>>>>> I'm currently investigating that.
>>>>>>>>
>>>>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
>>>>>>>> | ------------[ cut here ]------------
>>>>>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
>>>>>>>> bpf_mem_refill+0x1fc/0x206
>>>>>>>> | Modules linked in: bpf_testmod(OE)
>>>>>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
>>>>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
>>>>>>>> | Hardware name: riscv-virtio,qemu (DT)
>>>>>>>> | epc : bpf_mem_refill+0x1fc/0x206
>>>>>>>> | ra : irq_work_single+0x68/0x70
>>>>>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp :
>>>>>>>> ff2000000001be20
>>>>>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 :
>>>>>>>> 0000000000046600
>>>>>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 :
>>>>>>>> ff2000000001be70
>>>>>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 :
>>>>>>>> ff600003fef4b000
>>>>>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 :
>>>>>>>> 0000000000000060
>>>>>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 :
>>>>>>>> 0000000000735049
>>>>>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 :
>>>>>>>> 0000000000001000
>>>>>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 :
>>>>>>>> ffffffff82d6bd30
>>>>>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10:
>>>>>>>> 000000000000ffff
>>>>>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 :
>>>>>>>> 0000000000000000
>>>>>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000
>>>>>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
>>>>>>>> 0000000000000003
>>>>>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
>>>>>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
>>>>>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
>>>>>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
>>>>>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
>>>>>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
>>>>>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
>>>>>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
>>>>>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98
>>>>>>>> | ---[ end trace 0000000000000000 ]---
>>>>>>>>
>>>>>>>> Code:
>>>>>>>> | static void free_bulk(struct bpf_mem_cache *c)
>>>>>>>> | {
>>>>>>>> | struct bpf_mem_cache *tgt = c->tgt;
>>>>>>>> | struct llist_node *llnode, *t;
>>>>>>>> | unsigned long flags;
>>>>>>>> | int cnt;
>>>>>>>> |
>>>>>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
>>>>>>>> | ...
>>>>>>>>
>>>>>>>> I'm not well versed in the memory allocator; Before I dive into
>>>>>>>> it --
>>>>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
>>>>>>> Maybe take a look at the patch
>>>>>>> 822fb26bdb55 bpf: Add a hint to allocated objects.
>>>>>>>
>>>>>>> In the above patch, we have
>>>>>>>
>>>>>>> + /*
>>>>>>> + * Remember bpf_mem_cache that allocated this object.
>>>>>>> + * The hint is not accurate.
>>>>>>> + */
>>>>>>> + c->tgt = *(struct bpf_mem_cache **)llnode;
>>>>>>>
>>>>>>> I suspect that the warning may be related to the above.
>>>>>>> I tried the above ./test_progs command line (running multiple
>>>>>>> at the same time) and didn't trigger the issue.
>>>>>> The extra 8-bytes before the freed pointer is used to save the
>>>>>> pointer
>>>>>> of the original bpf memory allocator where the freed pointer came
>>>>>> from,
>>>>>> so unit_free() could free the pointer back to the original
>>>>>> allocator to
>>>>>> prevent alloc-and-free unbalance.
>>>>>>
>>>>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do
>>>>>> not
>>>>>> find anything suspicious after checking linked_list. Another
>>>>>> possibility
>>>>>> is that there is write-after-free problem which corrupts the extra
>>>>>> 8-bytes before the freed pointer. Could you please apply the
>>>>>> following
>>>>>> debug patch to check whether or not the extra 8-bytes are
>>>>>> corrupted ?
>>>>> Thanks for getting back!
>>>>>
>>>>> I took your patch for a run, and there's a hit:
>>>>> | bad cache ff5ffffffffe8570: got size 96 work
>>>>> ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work
>>>>> ffffffff801b19c8
>>>>
>>>> The extra 8-bytes are not corrupted. Both of these two
>>>> bpf_mem_cache are
>>>> valid and there are in the cache array defined in bpf_mem_caches. BPF
>>>> memory allocator allocated the pointer from 96-bytes sized-cache,
>>>> but it
>>>> tried to free the pointer through 128-bytes sized-cache.
>>>>
>>>> Now I suspect there is no 96-bytes slab in your system and ksize(ptr -
>>>> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the
>>>> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please
>>>> check the value of KMALLOC_MIN_SIZE in your kernel .config and
>>>> using the
>>>> following command to check whether there is 96-bytes slab in your
>>>> system:
>>>
>>> KMALLOC_MIN_SIZE is 64.
>>>
>>>> $ cat /proc/slabinfo |grep kmalloc-96
>>>> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0
>>>> 0 : slabdata 0 0 0
>>>> kmalloc-96 1865 2268 96 42 1 : tunables 0 0
>>>> 0 : slabdata 54 54 0
>>>>
>>>> In my system, slab has 96-bytes cached, so grep outputs something,
>>>> but I
>>>> think there will no output in your system.
>>>
>>> You're right! No kmalloc-96.
>>
>> To get rid of the warning, limit available sizes from
>> bpf_mem_alloc_init()?
It is not enough. We need to adjust size_index accordingly during
initialization. Could you please try the attached patch below ? It is
not a formal patch and I am considering to disable prefilling for these
redirected bpf_mem_caches.
>
> Do you know why your system does not have kmalloc-96?
According to the implementation of setup_kmalloc_cache_index_table() and
create_kmalloc_caches(), when KMALLOC_MIN_SIZE is greater than 64,
kmalloc-96 will be omitted. If KMALLOC_MIN_SIZE is greater than 128,
kmalloc-192 will be omitted as well.
>
>>
>>
>> Björn
>
> .
[-- Attachment #2: 0001-bpf-Adjust-size_index-according-to-the-value-of-KMAL.patch --]
[-- Type: text/plain, Size: 1649 bytes --]
From c2ee572e0db09919c56d81edcad160335ca8a80e Mon Sep 17 00:00:00 2001
From: Hou Tao <houtao1@huawei.com>
Date: Mon, 28 Aug 2023 11:42:19 +0800
Subject: [PATCH] bpf: Adjust size_index according to the value of
KMALLOC_MIN_SIZE
Signed-off-by: Hou Tao <houtao1@huawei.com>
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index fb4fa0605a60..d137d6f1fb21 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -938,3 +938,41 @@ void notrace *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags)
return !ret ? NULL : ret + LLIST_NODE_SZ;
}
+
+/* Most of the logic is taken from setup_kmalloc_cache_index_table() */
+static __init int bpf_mem_cache_adjust_size(void)
+{
+ unsigned int size, index;
+
+ /* Normally KMALLOC_MIN_SIZE is 8-bytes, but it can be
+ * up-to 256-bytes.
+ */
+ size = KMALLOC_MIN_SIZE;
+ if (size <= 192)
+ index = size_index[(size - 1) / 8];
+ else
+ index = fls(size - 1) - 1;
+ for (size = 8; size < KMALLOC_MIN_SIZE && size <= 192; size += 8)
+ size_index[(size - 1) / 8] = index;
+
+ /* The alignment is 64-bytes, so disable 96-bytes cache and use
+ * 128-bytes cache instead.
+ */
+ if (KMALLOC_MIN_SIZE >= 64) {
+ index = size_index[(128 - 1) / 8];
+ for (size = 64 + 8; size <= 96; size += 8)
+ size_index[(size - 1) / 8] = index;
+ }
+
+ /* The alignment is 128-bytes, so disable 192-bytes cache and use
+ * 256-bytes cache instead.
+ */
+ if (KMALLOC_MIN_SIZE >= 128) {
+ index = fls(256 - 1) - 1;
+ for (size = 128 + 8; size <= 192; size += 8)
+ size_index[(size - 1) / 8] = index;
+ }
+
+ return 0;
+}
+subsys_initcall(bpf_mem_cache_adjust_size);
--
2.29.2
[-- Attachment #3: Type: text/plain, Size: 161 bytes --]
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-28 13:57 ` Hou Tao
@ 2023-08-29 0:54 ` Yonghong Song
2023-08-29 7:26 ` Björn Töpel
2023-08-29 15:26 ` Alexei Starovoitov
2 siblings, 0 replies; 31+ messages in thread
From: Yonghong Song @ 2023-08-29 0:54 UTC (permalink / raw)
To: Hou Tao, Björn Töpel, bpf; +Cc: linux-riscv, Alexei Starovoitov
On 8/28/23 6:57 AM, Hou Tao wrote:
> Hi,
>
> On 8/27/2023 10:53 PM, Yonghong Song wrote:
>>
>>
>> On 8/27/23 1:37 AM, Björn Töpel wrote:
>>> Björn Töpel <bjorn@kernel.org> writes:
>>>
>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> On 8/26/2023 5:23 PM, Björn Töpel wrote:
>>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 8/25/2023 11:28 PM, Yonghong Song wrote:
>>>>>>>>
>>>>>>>> On 8/25/23 3:32 AM, Björn Töpel wrote:
>>>>>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>>>>>>>>> selftests on bpf-next 9e3b47abeb8f.
>>>>>>>>>
>>>>>>>>> I'm able to reproduce the hang by multiple runs of:
>>>>>>>>> | ./test_progs -a link_api -a linked_list
>>>>>>>>> I'm currently investigating that.
>>>>>>>>>
>>>>>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
>>>>>>>>> | ------------[ cut here ]------------
>>>>>>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
>>>>>>>>> bpf_mem_refill+0x1fc/0x206
>>>>>>>>> | Modules linked in: bpf_testmod(OE)
>>>>>>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
>>>>>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
>>>>>>>>> | Hardware name: riscv-virtio,qemu (DT)
>>>>>>>>> | epc : bpf_mem_refill+0x1fc/0x206
>>>>>>>>> | ra : irq_work_single+0x68/0x70
>>>>>>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp :
>>>>>>>>> ff2000000001be20
>>>>>>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 :
>>>>>>>>> 0000000000046600
>>>>>>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 :
>>>>>>>>> ff2000000001be70
>>>>>>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 :
>>>>>>>>> ff600003fef4b000
>>>>>>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 :
>>>>>>>>> 0000000000000060
>>>>>>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 :
>>>>>>>>> 0000000000735049
>>>>>>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 :
>>>>>>>>> 0000000000001000
>>>>>>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 :
>>>>>>>>> ffffffff82d6bd30
>>>>>>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10:
>>>>>>>>> 000000000000ffff
>>>>>>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 :
>>>>>>>>> 0000000000000000
>>>>>>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000
>>>>>>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
>>>>>>>>> 0000000000000003
>>>>>>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
>>>>>>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
>>>>>>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
>>>>>>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
>>>>>>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
>>>>>>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
>>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
>>>>>>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
>>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
>>>>>>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
>>>>>>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98
>>>>>>>>> | ---[ end trace 0000000000000000 ]---
>>>>>>>>>
>>>>>>>>> Code:
>>>>>>>>> | static void free_bulk(struct bpf_mem_cache *c)
>>>>>>>>> | {
>>>>>>>>> | struct bpf_mem_cache *tgt = c->tgt;
>>>>>>>>> | struct llist_node *llnode, *t;
>>>>>>>>> | unsigned long flags;
>>>>>>>>> | int cnt;
>>>>>>>>> |
>>>>>>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
>>>>>>>>> | ...
>>>>>>>>>
>>>>>>>>> I'm not well versed in the memory allocator; Before I dive into
>>>>>>>>> it --
>>>>>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
>>>>>>>> Maybe take a look at the patch
>>>>>>>> 822fb26bdb55 bpf: Add a hint to allocated objects.
>>>>>>>>
>>>>>>>> In the above patch, we have
>>>>>>>>
>>>>>>>> + /*
>>>>>>>> + * Remember bpf_mem_cache that allocated this object.
>>>>>>>> + * The hint is not accurate.
>>>>>>>> + */
>>>>>>>> + c->tgt = *(struct bpf_mem_cache **)llnode;
>>>>>>>>
>>>>>>>> I suspect that the warning may be related to the above.
>>>>>>>> I tried the above ./test_progs command line (running multiple
>>>>>>>> at the same time) and didn't trigger the issue.
>>>>>>> The extra 8-bytes before the freed pointer is used to save the
>>>>>>> pointer
>>>>>>> of the original bpf memory allocator where the freed pointer came
>>>>>>> from,
>>>>>>> so unit_free() could free the pointer back to the original
>>>>>>> allocator to
>>>>>>> prevent alloc-and-free unbalance.
>>>>>>>
>>>>>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do
>>>>>>> not
>>>>>>> find anything suspicious after checking linked_list. Another
>>>>>>> possibility
>>>>>>> is that there is write-after-free problem which corrupts the extra
>>>>>>> 8-bytes before the freed pointer. Could you please apply the
>>>>>>> following
>>>>>>> debug patch to check whether or not the extra 8-bytes are
>>>>>>> corrupted ?
>>>>>> Thanks for getting back!
>>>>>>
>>>>>> I took your patch for a run, and there's a hit:
>>>>>> | bad cache ff5ffffffffe8570: got size 96 work
>>>>>> ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work
>>>>>> ffffffff801b19c8
>>>>>
>>>>> The extra 8-bytes are not corrupted. Both of these two
>>>>> bpf_mem_cache are
>>>>> valid and there are in the cache array defined in bpf_mem_caches. BPF
>>>>> memory allocator allocated the pointer from 96-bytes sized-cache,
>>>>> but it
>>>>> tried to free the pointer through 128-bytes sized-cache.
>>>>>
>>>>> Now I suspect there is no 96-bytes slab in your system and ksize(ptr -
>>>>> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the
>>>>> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please
>>>>> check the value of KMALLOC_MIN_SIZE in your kernel .config and
>>>>> using the
>>>>> following command to check whether there is 96-bytes slab in your
>>>>> system:
>>>>
>>>> KMALLOC_MIN_SIZE is 64.
>>>>
>>>>> $ cat /proc/slabinfo |grep kmalloc-96
>>>>> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0
>>>>> 0 : slabdata 0 0 0
>>>>> kmalloc-96 1865 2268 96 42 1 : tunables 0 0
>>>>> 0 : slabdata 54 54 0
>>>>>
>>>>> In my system, slab has 96-bytes cached, so grep outputs something,
>>>>> but I
>>>>> think there will no output in your system.
>>>>
>>>> You're right! No kmalloc-96.
>>>
>>> To get rid of the warning, limit available sizes from
>>> bpf_mem_alloc_init()?
>
> It is not enough. We need to adjust size_index accordingly during
> initialization. Could you please try the attached patch below ? It is
> not a formal patch and I am considering to disable prefilling for these
> redirected bpf_mem_caches.
>>
>> Do you know why your system does not have kmalloc-96?
>
> According to the implementation of setup_kmalloc_cache_index_table() and
> create_kmalloc_caches(), when KMALLOC_MIN_SIZE is greater than 64,
> kmalloc-96 will be omitted. If KMALLOC_MIN_SIZE is greater than 128,
> kmalloc-192 will be omitted as well.
Thanks! This indeed the case except the above 'greater than' should be
'greater than or equal to'. For example, greater than or equal to 64
means the minimum alignment is 64, so 96 is removed. Similar for
greater than or equal to 128.
I agree it is a good idea to align memalloc cache unit size matching
underlying kmalloc cache unit size.
>>
>>>
>>>
>>> Björn
>>
>> .
>
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-28 13:57 ` Hou Tao
2023-08-29 0:54 ` Yonghong Song
@ 2023-08-29 7:26 ` Björn Töpel
2023-08-29 11:46 ` Björn Töpel
2023-08-29 12:54 ` Björn Töpel
2023-08-29 15:26 ` Alexei Starovoitov
2 siblings, 2 replies; 31+ messages in thread
From: Björn Töpel @ 2023-08-29 7:26 UTC (permalink / raw)
To: Hou Tao, yonghong.song, bpf; +Cc: linux-riscv, Alexei Starovoitov
Hou Tao <houtao@huaweicloud.com> writes:
> Hi,
>
> On 8/27/2023 10:53 PM, Yonghong Song wrote:
>>
>>
>> On 8/27/23 1:37 AM, Björn Töpel wrote:
>>> Björn Töpel <bjorn@kernel.org> writes:
>>>
>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> On 8/26/2023 5:23 PM, Björn Töpel wrote:
>>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 8/25/2023 11:28 PM, Yonghong Song wrote:
>>>>>>>>
>>>>>>>> On 8/25/23 3:32 AM, Björn Töpel wrote:
>>>>>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>>>>>>>>> selftests on bpf-next 9e3b47abeb8f.
>>>>>>>>>
>>>>>>>>> I'm able to reproduce the hang by multiple runs of:
>>>>>>>>> | ./test_progs -a link_api -a linked_list
>>>>>>>>> I'm currently investigating that.
>>>>>>>>>
>>>>>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
>>>>>>>>> | ------------[ cut here ]------------
>>>>>>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
>>>>>>>>> bpf_mem_refill+0x1fc/0x206
>>>>>>>>> | Modules linked in: bpf_testmod(OE)
>>>>>>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
>>>>>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
>>>>>>>>> | Hardware name: riscv-virtio,qemu (DT)
>>>>>>>>> | epc : bpf_mem_refill+0x1fc/0x206
>>>>>>>>> | ra : irq_work_single+0x68/0x70
>>>>>>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp :
>>>>>>>>> ff2000000001be20
>>>>>>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 :
>>>>>>>>> 0000000000046600
>>>>>>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 :
>>>>>>>>> ff2000000001be70
>>>>>>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 :
>>>>>>>>> ff600003fef4b000
>>>>>>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 :
>>>>>>>>> 0000000000000060
>>>>>>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 :
>>>>>>>>> 0000000000735049
>>>>>>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 :
>>>>>>>>> 0000000000001000
>>>>>>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 :
>>>>>>>>> ffffffff82d6bd30
>>>>>>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10:
>>>>>>>>> 000000000000ffff
>>>>>>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 :
>>>>>>>>> 0000000000000000
>>>>>>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000
>>>>>>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
>>>>>>>>> 0000000000000003
>>>>>>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
>>>>>>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
>>>>>>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
>>>>>>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
>>>>>>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
>>>>>>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
>>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
>>>>>>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
>>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
>>>>>>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
>>>>>>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98
>>>>>>>>> | ---[ end trace 0000000000000000 ]---
>>>>>>>>>
>>>>>>>>> Code:
>>>>>>>>> | static void free_bulk(struct bpf_mem_cache *c)
>>>>>>>>> | {
>>>>>>>>> | struct bpf_mem_cache *tgt = c->tgt;
>>>>>>>>> | struct llist_node *llnode, *t;
>>>>>>>>> | unsigned long flags;
>>>>>>>>> | int cnt;
>>>>>>>>> |
>>>>>>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
>>>>>>>>> | ...
>>>>>>>>>
>>>>>>>>> I'm not well versed in the memory allocator; Before I dive into
>>>>>>>>> it --
>>>>>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
>>>>>>>> Maybe take a look at the patch
>>>>>>>> 822fb26bdb55 bpf: Add a hint to allocated objects.
>>>>>>>>
>>>>>>>> In the above patch, we have
>>>>>>>>
>>>>>>>> + /*
>>>>>>>> + * Remember bpf_mem_cache that allocated this object.
>>>>>>>> + * The hint is not accurate.
>>>>>>>> + */
>>>>>>>> + c->tgt = *(struct bpf_mem_cache **)llnode;
>>>>>>>>
>>>>>>>> I suspect that the warning may be related to the above.
>>>>>>>> I tried the above ./test_progs command line (running multiple
>>>>>>>> at the same time) and didn't trigger the issue.
>>>>>>> The extra 8-bytes before the freed pointer is used to save the
>>>>>>> pointer
>>>>>>> of the original bpf memory allocator where the freed pointer came
>>>>>>> from,
>>>>>>> so unit_free() could free the pointer back to the original
>>>>>>> allocator to
>>>>>>> prevent alloc-and-free unbalance.
>>>>>>>
>>>>>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do
>>>>>>> not
>>>>>>> find anything suspicious after checking linked_list. Another
>>>>>>> possibility
>>>>>>> is that there is write-after-free problem which corrupts the extra
>>>>>>> 8-bytes before the freed pointer. Could you please apply the
>>>>>>> following
>>>>>>> debug patch to check whether or not the extra 8-bytes are
>>>>>>> corrupted ?
>>>>>> Thanks for getting back!
>>>>>>
>>>>>> I took your patch for a run, and there's a hit:
>>>>>> | bad cache ff5ffffffffe8570: got size 96 work
>>>>>> ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work
>>>>>> ffffffff801b19c8
>>>>>
>>>>> The extra 8-bytes are not corrupted. Both of these two
>>>>> bpf_mem_cache are
>>>>> valid and there are in the cache array defined in bpf_mem_caches. BPF
>>>>> memory allocator allocated the pointer from 96-bytes sized-cache,
>>>>> but it
>>>>> tried to free the pointer through 128-bytes sized-cache.
>>>>>
>>>>> Now I suspect there is no 96-bytes slab in your system and ksize(ptr -
>>>>> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the
>>>>> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please
>>>>> check the value of KMALLOC_MIN_SIZE in your kernel .config and
>>>>> using the
>>>>> following command to check whether there is 96-bytes slab in your
>>>>> system:
>>>>
>>>> KMALLOC_MIN_SIZE is 64.
>>>>
>>>>> $ cat /proc/slabinfo |grep kmalloc-96
>>>>> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0
>>>>> 0 : slabdata 0 0 0
>>>>> kmalloc-96 1865 2268 96 42 1 : tunables 0 0
>>>>> 0 : slabdata 54 54 0
>>>>>
>>>>> In my system, slab has 96-bytes cached, so grep outputs something,
>>>>> but I
>>>>> think there will no output in your system.
>>>>
>>>> You're right! No kmalloc-96.
>>>
>>> To get rid of the warning, limit available sizes from
>>> bpf_mem_alloc_init()?
>
> It is not enough. We need to adjust size_index accordingly during
> initialization. Could you please try the attached patch below ? It is
> not a formal patch and I am considering to disable prefilling for these
> redirected bpf_mem_caches.
Sorry for the slow response; I'll take it for a spin today.
Björn
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-29 7:26 ` Björn Töpel
@ 2023-08-29 11:46 ` Björn Töpel
2023-08-30 12:15 ` Hou Tao
2023-08-29 12:54 ` Björn Töpel
1 sibling, 1 reply; 31+ messages in thread
From: Björn Töpel @ 2023-08-29 11:46 UTC (permalink / raw)
To: Hou Tao, yonghong.song, bpf; +Cc: linux-riscv, Alexei Starovoitov
Björn Töpel <bjorn@kernel.org> writes:
> Hou Tao <houtao@huaweicloud.com> writes:
>
>> Hi,
>>
>> On 8/27/2023 10:53 PM, Yonghong Song wrote:
>>>
>>>
>>> On 8/27/23 1:37 AM, Björn Töpel wrote:
>>>> Björn Töpel <bjorn@kernel.org> writes:
>>>>
>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 8/26/2023 5:23 PM, Björn Töpel wrote:
>>>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 8/25/2023 11:28 PM, Yonghong Song wrote:
>>>>>>>>>
>>>>>>>>> On 8/25/23 3:32 AM, Björn Töpel wrote:
>>>>>>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>>>>>>>>>> selftests on bpf-next 9e3b47abeb8f.
>>>>>>>>>>
>>>>>>>>>> I'm able to reproduce the hang by multiple runs of:
>>>>>>>>>> | ./test_progs -a link_api -a linked_list
>>>>>>>>>> I'm currently investigating that.
>>>>>>>>>>
>>>>>>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
>>>>>>>>>> | ------------[ cut here ]------------
>>>>>>>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
>>>>>>>>>> bpf_mem_refill+0x1fc/0x206
>>>>>>>>>> | Modules linked in: bpf_testmod(OE)
>>>>>>>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
>>>>>>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
>>>>>>>>>> | Hardware name: riscv-virtio,qemu (DT)
>>>>>>>>>> | epc : bpf_mem_refill+0x1fc/0x206
>>>>>>>>>> | ra : irq_work_single+0x68/0x70
>>>>>>>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp :
>>>>>>>>>> ff2000000001be20
>>>>>>>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 :
>>>>>>>>>> 0000000000046600
>>>>>>>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 :
>>>>>>>>>> ff2000000001be70
>>>>>>>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 :
>>>>>>>>>> ff600003fef4b000
>>>>>>>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 :
>>>>>>>>>> 0000000000000060
>>>>>>>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 :
>>>>>>>>>> 0000000000735049
>>>>>>>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 :
>>>>>>>>>> 0000000000001000
>>>>>>>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 :
>>>>>>>>>> ffffffff82d6bd30
>>>>>>>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10:
>>>>>>>>>> 000000000000ffff
>>>>>>>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 :
>>>>>>>>>> 0000000000000000
>>>>>>>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000
>>>>>>>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
>>>>>>>>>> 0000000000000003
>>>>>>>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
>>>>>>>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
>>>>>>>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
>>>>>>>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
>>>>>>>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
>>>>>>>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
>>>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
>>>>>>>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
>>>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
>>>>>>>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
>>>>>>>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98
>>>>>>>>>> | ---[ end trace 0000000000000000 ]---
>>>>>>>>>>
>>>>>>>>>> Code:
>>>>>>>>>> | static void free_bulk(struct bpf_mem_cache *c)
>>>>>>>>>> | {
>>>>>>>>>> | struct bpf_mem_cache *tgt = c->tgt;
>>>>>>>>>> | struct llist_node *llnode, *t;
>>>>>>>>>> | unsigned long flags;
>>>>>>>>>> | int cnt;
>>>>>>>>>> |
>>>>>>>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
>>>>>>>>>> | ...
>>>>>>>>>>
>>>>>>>>>> I'm not well versed in the memory allocator; Before I dive into
>>>>>>>>>> it --
>>>>>>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
>>>>>>>>> Maybe take a look at the patch
>>>>>>>>> 822fb26bdb55 bpf: Add a hint to allocated objects.
>>>>>>>>>
>>>>>>>>> In the above patch, we have
>>>>>>>>>
>>>>>>>>> + /*
>>>>>>>>> + * Remember bpf_mem_cache that allocated this object.
>>>>>>>>> + * The hint is not accurate.
>>>>>>>>> + */
>>>>>>>>> + c->tgt = *(struct bpf_mem_cache **)llnode;
>>>>>>>>>
>>>>>>>>> I suspect that the warning may be related to the above.
>>>>>>>>> I tried the above ./test_progs command line (running multiple
>>>>>>>>> at the same time) and didn't trigger the issue.
>>>>>>>> The extra 8-bytes before the freed pointer is used to save the
>>>>>>>> pointer
>>>>>>>> of the original bpf memory allocator where the freed pointer came
>>>>>>>> from,
>>>>>>>> so unit_free() could free the pointer back to the original
>>>>>>>> allocator to
>>>>>>>> prevent alloc-and-free unbalance.
>>>>>>>>
>>>>>>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do
>>>>>>>> not
>>>>>>>> find anything suspicious after checking linked_list. Another
>>>>>>>> possibility
>>>>>>>> is that there is write-after-free problem which corrupts the extra
>>>>>>>> 8-bytes before the freed pointer. Could you please apply the
>>>>>>>> following
>>>>>>>> debug patch to check whether or not the extra 8-bytes are
>>>>>>>> corrupted ?
>>>>>>> Thanks for getting back!
>>>>>>>
>>>>>>> I took your patch for a run, and there's a hit:
>>>>>>> | bad cache ff5ffffffffe8570: got size 96 work
>>>>>>> ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work
>>>>>>> ffffffff801b19c8
>>>>>>
>>>>>> The extra 8-bytes are not corrupted. Both of these two
>>>>>> bpf_mem_cache are
>>>>>> valid and there are in the cache array defined in bpf_mem_caches. BPF
>>>>>> memory allocator allocated the pointer from 96-bytes sized-cache,
>>>>>> but it
>>>>>> tried to free the pointer through 128-bytes sized-cache.
>>>>>>
>>>>>> Now I suspect there is no 96-bytes slab in your system and ksize(ptr -
>>>>>> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the
>>>>>> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please
>>>>>> check the value of KMALLOC_MIN_SIZE in your kernel .config and
>>>>>> using the
>>>>>> following command to check whether there is 96-bytes slab in your
>>>>>> system:
>>>>>
>>>>> KMALLOC_MIN_SIZE is 64.
>>>>>
>>>>>> $ cat /proc/slabinfo |grep kmalloc-96
>>>>>> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0
>>>>>> 0 : slabdata 0 0 0
>>>>>> kmalloc-96 1865 2268 96 42 1 : tunables 0 0
>>>>>> 0 : slabdata 54 54 0
>>>>>>
>>>>>> In my system, slab has 96-bytes cached, so grep outputs something,
>>>>>> but I
>>>>>> think there will no output in your system.
>>>>>
>>>>> You're right! No kmalloc-96.
>>>>
>>>> To get rid of the warning, limit available sizes from
>>>> bpf_mem_alloc_init()?
>>
>> It is not enough. We need to adjust size_index accordingly during
>> initialization. Could you please try the attached patch below ? It is
>> not a formal patch and I am considering to disable prefilling for these
>> redirected bpf_mem_caches.
>
> Sorry for the slow response; I'll take it for a spin today.
Hmm, on a related note, RISC-V has this change queued up [1], which will
introduce -96 and friends. Are there any other archs supporting the BPF
allocator where this is a concern? If not, I suggest simply leaving the
code as is.
Björn
[1] https://lore.kernel.org/all/20230718152214.2907-1-jszhang@kernel.org/
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-29 7:26 ` Björn Töpel
2023-08-29 11:46 ` Björn Töpel
@ 2023-08-29 12:54 ` Björn Töpel
1 sibling, 0 replies; 31+ messages in thread
From: Björn Töpel @ 2023-08-29 12:54 UTC (permalink / raw)
To: Hou Tao, yonghong.song, bpf; +Cc: linux-riscv, Alexei Starovoitov
Björn Töpel <bjorn@kernel.org> writes:
> Hou Tao <houtao@huaweicloud.com> writes:
>
>> Hi,
>>
>> On 8/27/2023 10:53 PM, Yonghong Song wrote:
>>>
>>>
>>> On 8/27/23 1:37 AM, Björn Töpel wrote:
>>>> Björn Töpel <bjorn@kernel.org> writes:
>>>>
>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 8/26/2023 5:23 PM, Björn Töpel wrote:
>>>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 8/25/2023 11:28 PM, Yonghong Song wrote:
>>>>>>>>>
>>>>>>>>> On 8/25/23 3:32 AM, Björn Töpel wrote:
>>>>>>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>>>>>>>>>> selftests on bpf-next 9e3b47abeb8f.
>>>>>>>>>>
>>>>>>>>>> I'm able to reproduce the hang by multiple runs of:
>>>>>>>>>> | ./test_progs -a link_api -a linked_list
>>>>>>>>>> I'm currently investigating that.
>>>>>>>>>>
>>>>>>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
>>>>>>>>>> | ------------[ cut here ]------------
>>>>>>>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
>>>>>>>>>> bpf_mem_refill+0x1fc/0x206
>>>>>>>>>> | Modules linked in: bpf_testmod(OE)
>>>>>>>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
>>>>>>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
>>>>>>>>>> | Hardware name: riscv-virtio,qemu (DT)
>>>>>>>>>> | epc : bpf_mem_refill+0x1fc/0x206
>>>>>>>>>> | ra : irq_work_single+0x68/0x70
>>>>>>>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp :
>>>>>>>>>> ff2000000001be20
>>>>>>>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 :
>>>>>>>>>> 0000000000046600
>>>>>>>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 :
>>>>>>>>>> ff2000000001be70
>>>>>>>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 :
>>>>>>>>>> ff600003fef4b000
>>>>>>>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 :
>>>>>>>>>> 0000000000000060
>>>>>>>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 :
>>>>>>>>>> 0000000000735049
>>>>>>>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 :
>>>>>>>>>> 0000000000001000
>>>>>>>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 :
>>>>>>>>>> ffffffff82d6bd30
>>>>>>>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10:
>>>>>>>>>> 000000000000ffff
>>>>>>>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 :
>>>>>>>>>> 0000000000000000
>>>>>>>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000
>>>>>>>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
>>>>>>>>>> 0000000000000003
>>>>>>>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
>>>>>>>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
>>>>>>>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
>>>>>>>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
>>>>>>>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
>>>>>>>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
>>>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
>>>>>>>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
>>>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
>>>>>>>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
>>>>>>>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98
>>>>>>>>>> | ---[ end trace 0000000000000000 ]---
>>>>>>>>>>
>>>>>>>>>> Code:
>>>>>>>>>> | static void free_bulk(struct bpf_mem_cache *c)
>>>>>>>>>> | {
>>>>>>>>>> | struct bpf_mem_cache *tgt = c->tgt;
>>>>>>>>>> | struct llist_node *llnode, *t;
>>>>>>>>>> | unsigned long flags;
>>>>>>>>>> | int cnt;
>>>>>>>>>> |
>>>>>>>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
>>>>>>>>>> | ...
>>>>>>>>>>
>>>>>>>>>> I'm not well versed in the memory allocator; Before I dive into
>>>>>>>>>> it --
>>>>>>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
>>>>>>>>> Maybe take a look at the patch
>>>>>>>>> 822fb26bdb55 bpf: Add a hint to allocated objects.
>>>>>>>>>
>>>>>>>>> In the above patch, we have
>>>>>>>>>
>>>>>>>>> + /*
>>>>>>>>> + * Remember bpf_mem_cache that allocated this object.
>>>>>>>>> + * The hint is not accurate.
>>>>>>>>> + */
>>>>>>>>> + c->tgt = *(struct bpf_mem_cache **)llnode;
>>>>>>>>>
>>>>>>>>> I suspect that the warning may be related to the above.
>>>>>>>>> I tried the above ./test_progs command line (running multiple
>>>>>>>>> at the same time) and didn't trigger the issue.
>>>>>>>> The extra 8-bytes before the freed pointer is used to save the
>>>>>>>> pointer
>>>>>>>> of the original bpf memory allocator where the freed pointer came
>>>>>>>> from,
>>>>>>>> so unit_free() could free the pointer back to the original
>>>>>>>> allocator to
>>>>>>>> prevent alloc-and-free unbalance.
>>>>>>>>
>>>>>>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do
>>>>>>>> not
>>>>>>>> find anything suspicious after checking linked_list. Another
>>>>>>>> possibility
>>>>>>>> is that there is write-after-free problem which corrupts the extra
>>>>>>>> 8-bytes before the freed pointer. Could you please apply the
>>>>>>>> following
>>>>>>>> debug patch to check whether or not the extra 8-bytes are
>>>>>>>> corrupted ?
>>>>>>> Thanks for getting back!
>>>>>>>
>>>>>>> I took your patch for a run, and there's a hit:
>>>>>>> | bad cache ff5ffffffffe8570: got size 96 work
>>>>>>> ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work
>>>>>>> ffffffff801b19c8
>>>>>>
>>>>>> The extra 8-bytes are not corrupted. Both of these two
>>>>>> bpf_mem_cache are
>>>>>> valid and there are in the cache array defined in bpf_mem_caches. BPF
>>>>>> memory allocator allocated the pointer from 96-bytes sized-cache,
>>>>>> but it
>>>>>> tried to free the pointer through 128-bytes sized-cache.
>>>>>>
>>>>>> Now I suspect there is no 96-bytes slab in your system and ksize(ptr -
>>>>>> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the
>>>>>> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please
>>>>>> check the value of KMALLOC_MIN_SIZE in your kernel .config and
>>>>>> using the
>>>>>> following command to check whether there is 96-bytes slab in your
>>>>>> system:
>>>>>
>>>>> KMALLOC_MIN_SIZE is 64.
>>>>>
>>>>>> $ cat /proc/slabinfo |grep kmalloc-96
>>>>>> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0
>>>>>> 0 : slabdata 0 0 0
>>>>>> kmalloc-96 1865 2268 96 42 1 : tunables 0 0
>>>>>> 0 : slabdata 54 54 0
>>>>>>
>>>>>> In my system, slab has 96-bytes cached, so grep outputs something,
>>>>>> but I
>>>>>> think there will no output in your system.
>>>>>
>>>>> You're right! No kmalloc-96.
>>>>
>>>> To get rid of the warning, limit available sizes from
>>>> bpf_mem_alloc_init()?
>>
>> It is not enough. We need to adjust size_index accordingly during
>> initialization. Could you please try the attached patch below ? It is
>> not a formal patch and I am considering to disable prefilling for these
>> redirected bpf_mem_caches.
>
> Sorry for the slow response; I'll take it for a spin today.
The warning is indeed gone with this fix.
Björn
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-28 13:57 ` Hou Tao
2023-08-29 0:54 ` Yonghong Song
2023-08-29 7:26 ` Björn Töpel
@ 2023-08-29 15:26 ` Alexei Starovoitov
2023-08-30 12:08 ` Hou Tao
2 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2023-08-29 15:26 UTC (permalink / raw)
To: Hou Tao; +Cc: Yonghong Song, Björn Töpel, bpf, linux-riscv
On Mon, Aug 28, 2023 at 6:57 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 8/27/2023 10:53 PM, Yonghong Song wrote:
> >
> >
> > On 8/27/23 1:37 AM, Björn Töpel wrote:
> >> Björn Töpel <bjorn@kernel.org> writes:
> >>
> >>> Hou Tao <houtao@huaweicloud.com> writes:
> >>>
> >>>> Hi,
> >>>>
> >>>> On 8/26/2023 5:23 PM, Björn Töpel wrote:
> >>>>> Hou Tao <houtao@huaweicloud.com> writes:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> On 8/25/2023 11:28 PM, Yonghong Song wrote:
> >>>>>>>
> >>>>>>> On 8/25/23 3:32 AM, Björn Töpel wrote:
> >>>>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
> >>>>>>>> selftests on bpf-next 9e3b47abeb8f.
> >>>>>>>>
> >>>>>>>> I'm able to reproduce the hang by multiple runs of:
> >>>>>>>> | ./test_progs -a link_api -a linked_list
> >>>>>>>> I'm currently investigating that.
> >>>>>>>>
> >>>>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
> >>>>>>>> | ------------[ cut here ]------------
> >>>>>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
> >>>>>>>> bpf_mem_refill+0x1fc/0x206
> >>>>>>>> | Modules linked in: bpf_testmod(OE)
> >>>>>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
> >>>>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
> >>>>>>>> | Hardware name: riscv-virtio,qemu (DT)
> >>>>>>>> | epc : bpf_mem_refill+0x1fc/0x206
> >>>>>>>> | ra : irq_work_single+0x68/0x70
> >>>>>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp :
> >>>>>>>> ff2000000001be20
> >>>>>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 :
> >>>>>>>> 0000000000046600
> >>>>>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 :
> >>>>>>>> ff2000000001be70
> >>>>>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 :
> >>>>>>>> ff600003fef4b000
> >>>>>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 :
> >>>>>>>> 0000000000000060
> >>>>>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 :
> >>>>>>>> 0000000000735049
> >>>>>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 :
> >>>>>>>> 0000000000001000
> >>>>>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 :
> >>>>>>>> ffffffff82d6bd30
> >>>>>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10:
> >>>>>>>> 000000000000ffff
> >>>>>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 :
> >>>>>>>> 0000000000000000
> >>>>>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000
> >>>>>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
> >>>>>>>> 0000000000000003
> >>>>>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
> >>>>>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
> >>>>>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
> >>>>>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
> >>>>>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
> >>>>>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
> >>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> >>>>>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
> >>>>>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
> >>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> >>>>>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
> >>>>>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
> >>>>>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98
> >>>>>>>> | ---[ end trace 0000000000000000 ]---
> >>>>>>>>
> >>>>>>>> Code:
> >>>>>>>> | static void free_bulk(struct bpf_mem_cache *c)
> >>>>>>>> | {
> >>>>>>>> | struct bpf_mem_cache *tgt = c->tgt;
> >>>>>>>> | struct llist_node *llnode, *t;
> >>>>>>>> | unsigned long flags;
> >>>>>>>> | int cnt;
> >>>>>>>> |
> >>>>>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
> >>>>>>>> | ...
> >>>>>>>>
> >>>>>>>> I'm not well versed in the memory allocator; Before I dive into
> >>>>>>>> it --
> >>>>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
> >>>>>>> Maybe take a look at the patch
> >>>>>>> 822fb26bdb55 bpf: Add a hint to allocated objects.
> >>>>>>>
> >>>>>>> In the above patch, we have
> >>>>>>>
> >>>>>>> + /*
> >>>>>>> + * Remember bpf_mem_cache that allocated this object.
> >>>>>>> + * The hint is not accurate.
> >>>>>>> + */
> >>>>>>> + c->tgt = *(struct bpf_mem_cache **)llnode;
> >>>>>>>
> >>>>>>> I suspect that the warning may be related to the above.
> >>>>>>> I tried the above ./test_progs command line (running multiple
> >>>>>>> at the same time) and didn't trigger the issue.
> >>>>>> The extra 8-bytes before the freed pointer is used to save the
> >>>>>> pointer
> >>>>>> of the original bpf memory allocator where the freed pointer came
> >>>>>> from,
> >>>>>> so unit_free() could free the pointer back to the original
> >>>>>> allocator to
> >>>>>> prevent alloc-and-free unbalance.
> >>>>>>
> >>>>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do
> >>>>>> not
> >>>>>> find anything suspicious after checking linked_list. Another
> >>>>>> possibility
> >>>>>> is that there is write-after-free problem which corrupts the extra
> >>>>>> 8-bytes before the freed pointer. Could you please apply the
> >>>>>> following
> >>>>>> debug patch to check whether or not the extra 8-bytes are
> >>>>>> corrupted ?
> >>>>> Thanks for getting back!
> >>>>>
> >>>>> I took your patch for a run, and there's a hit:
> >>>>> | bad cache ff5ffffffffe8570: got size 96 work
> >>>>> ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work
> >>>>> ffffffff801b19c8
> >>>>
> >>>> The extra 8-bytes are not corrupted. Both of these two
> >>>> bpf_mem_cache are
> >>>> valid and there are in the cache array defined in bpf_mem_caches. BPF
> >>>> memory allocator allocated the pointer from 96-bytes sized-cache,
> >>>> but it
> >>>> tried to free the pointer through 128-bytes sized-cache.
> >>>>
> >>>> Now I suspect there is no 96-bytes slab in your system and ksize(ptr -
> >>>> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the
> >>>> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please
> >>>> check the value of KMALLOC_MIN_SIZE in your kernel .config and
> >>>> using the
> >>>> following command to check whether there is 96-bytes slab in your
> >>>> system:
> >>>
> >>> KMALLOC_MIN_SIZE is 64.
> >>>
> >>>> $ cat /proc/slabinfo |grep kmalloc-96
> >>>> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0
> >>>> 0 : slabdata 0 0 0
> >>>> kmalloc-96 1865 2268 96 42 1 : tunables 0 0
> >>>> 0 : slabdata 54 54 0
> >>>>
> >>>> In my system, slab has 96-bytes cached, so grep outputs something,
> >>>> but I
> >>>> think there will no output in your system.
> >>>
> >>> You're right! No kmalloc-96.
> >>
> >> To get rid of the warning, limit available sizes from
> >> bpf_mem_alloc_init()?
>
> It is not enough. We need to adjust size_index accordingly during
> initialization. Could you please try the attached patch below ? It is
> not a formal patch and I am considering to disable prefilling for these
> redirected bpf_mem_caches.
> >
> > Do you know why your system does not have kmalloc-96?
>
> According to the implementation of setup_kmalloc_cache_index_table() and
> create_kmalloc_caches(), when KMALLOC_MIN_SIZE is greater than 64,
> kmalloc-96 will be omitted. If KMALLOC_MIN_SIZE is greater than 128,
> kmalloc-192 will be omitted as well.
Great catch. The fix looks good.
Please submit it officially and add an error check to bpf_mem_alloc_init()
that verifies that ksize() matches the expectations.
The alternative is to use kmalloc_size_roundup() during alloc for
checking instead of ksize().
Technically we can use kmalloc_size_roundup in unit_alloc() and avoid
setup_kmalloc_cache_index_table()-like copy paste, but performance
overhead might be too high.
So your patch + error check at bpf_mem_alloc_init() is preferred.
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-29 15:26 ` Alexei Starovoitov
@ 2023-08-30 12:08 ` Hou Tao
2023-08-30 21:05 ` Alexei Starovoitov
0 siblings, 1 reply; 31+ messages in thread
From: Hou Tao @ 2023-08-30 12:08 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: Yonghong Song, Björn Töpel, bpf, linux-riscv
Hi,
On 8/29/2023 11:26 PM, Alexei Starovoitov wrote:
> On Mon, Aug 28, 2023 at 6:57 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 8/27/2023 10:53 PM, Yonghong Song wrote:
>>>
>>> On 8/27/23 1:37 AM, Björn Töpel wrote:
>>>> Björn Töpel <bjorn@kernel.org> writes:
>>>>
>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 8/26/2023 5:23 PM, Björn Töpel wrote:
>>>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 8/25/2023 11:28 PM, Yonghong Song wrote:
>>>>>>>>> On 8/25/23 3:32 AM, Björn Töpel wrote:
>>>>>>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>>>>>>>>>> selftests on bpf-next 9e3b47abeb8f.
>>>>>>>>>>
>>>>>>>>>> I'm able to reproduce the hang by multiple runs of:
>>>>>>>>>> | ./test_progs -a link_api -a linked_list
>>>>>>>>>> I'm currently investigating that.
>>>>>>>>>>
>>>>>>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
>>>>>>>>>> | ------------[ cut here ]------------
>>>>>>>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
>>>>>>>>>> bpf_mem_refill+0x1fc/0x206
>>>>>>>>>> | Modules linked in: bpf_testmod(OE)
>>>>>>>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
>>>>>>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
>>>>>>>>>> | Hardware name: riscv-virtio,qemu (DT)
>>>>>>>>>> | epc : bpf_mem_refill+0x1fc/0x206
>>>>>>>>>> | ra : irq_work_single+0x68/0x70
>>>>>>>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp :
>>>>>>>>>> ff2000000001be20
>>>>>>>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 :
>>>>>>>>>> 0000000000046600
>>>>>>>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 :
>>>>>>>>>> ff2000000001be70
>>>>>>>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 :
>>>>>>>>>> ff600003fef4b000
>>>>>>>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 :
>>>>>>>>>> 0000000000000060
>>>>>>>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 :
>>>>>>>>>> 0000000000735049
>>>>>>>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 :
>>>>>>>>>> 0000000000001000
>>>>>>>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 :
>>>>>>>>>> ffffffff82d6bd30
>>>>>>>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10:
>>>>>>>>>> 000000000000ffff
>>>>>>>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 :
>>>>>>>>>> 0000000000000000
>>>>>>>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000
>>>>>>>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
>>>>>>>>>> 0000000000000003
>>>>>>>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
>>>>>>>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
>>>>>>>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
>>>>>>>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
>>>>>>>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
>>>>>>>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
>>>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
>>>>>>>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
>>>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
>>>>>>>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
>>>>>>>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98
>>>>>>>>>> | ---[ end trace 0000000000000000 ]---
>>>>>>>>>>
>>>>>>>>>> Code:
>>>>>>>>>> | static void free_bulk(struct bpf_mem_cache *c)
>>>>>>>>>> | {
>>>>>>>>>> | struct bpf_mem_cache *tgt = c->tgt;
>>>>>>>>>> | struct llist_node *llnode, *t;
>>>>>>>>>> | unsigned long flags;
>>>>>>>>>> | int cnt;
>>>>>>>>>> |
>>>>>>>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
>>>>>>>>>> | ...
>>>>>>>>>>
>>>>>>>>>> I'm not well versed in the memory allocator; Before I dive into
>>>>>>>>>> it --
>>>>>>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
>>>>>>>>> Maybe take a look at the patch
>>>>>>>>> 822fb26bdb55 bpf: Add a hint to allocated objects.
>>>>>>>>>
>>>>>>>>> In the above patch, we have
>>>>>>>>>
>>>>>>>>> + /*
>>>>>>>>> + * Remember bpf_mem_cache that allocated this object.
>>>>>>>>> + * The hint is not accurate.
>>>>>>>>> + */
>>>>>>>>> + c->tgt = *(struct bpf_mem_cache **)llnode;
>>>>>>>>>
>>>>>>>>> I suspect that the warning may be related to the above.
>>>>>>>>> I tried the above ./test_progs command line (running multiple
>>>>>>>>> at the same time) and didn't trigger the issue.
>>>>>>>> The extra 8-bytes before the freed pointer is used to save the
>>>>>>>> pointer
>>>>>>>> of the original bpf memory allocator where the freed pointer came
>>>>>>>> from,
>>>>>>>> so unit_free() could free the pointer back to the original
>>>>>>>> allocator to
>>>>>>>> prevent alloc-and-free unbalance.
>>>>>>>>
>>>>>>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do
>>>>>>>> not
>>>>>>>> find anything suspicious after checking linked_list. Another
>>>>>>>> possibility
>>>>>>>> is that there is write-after-free problem which corrupts the extra
>>>>>>>> 8-bytes before the freed pointer. Could you please apply the
>>>>>>>> following
>>>>>>>> debug patch to check whether or not the extra 8-bytes are
>>>>>>>> corrupted ?
>>>>>>> Thanks for getting back!
>>>>>>>
>>>>>>> I took your patch for a run, and there's a hit:
>>>>>>> | bad cache ff5ffffffffe8570: got size 96 work
>>>>>>> ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work
>>>>>>> ffffffff801b19c8
>>>>>> The extra 8-bytes are not corrupted. Both of these two
>>>>>> bpf_mem_cache are
>>>>>> valid and there are in the cache array defined in bpf_mem_caches. BPF
>>>>>> memory allocator allocated the pointer from 96-bytes sized-cache,
>>>>>> but it
>>>>>> tried to free the pointer through 128-bytes sized-cache.
>>>>>>
>>>>>> Now I suspect there is no 96-bytes slab in your system and ksize(ptr -
>>>>>> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the
>>>>>> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please
>>>>>> check the value of KMALLOC_MIN_SIZE in your kernel .config and
>>>>>> using the
>>>>>> following command to check whether there is 96-bytes slab in your
>>>>>> system:
>>>>> KMALLOC_MIN_SIZE is 64.
>>>>>
>>>>>> $ cat /proc/slabinfo |grep kmalloc-96
>>>>>> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0
>>>>>> 0 : slabdata 0 0 0
>>>>>> kmalloc-96 1865 2268 96 42 1 : tunables 0 0
>>>>>> 0 : slabdata 54 54 0
>>>>>>
>>>>>> In my system, slab has 96-bytes cached, so grep outputs something,
>>>>>> but I
>>>>>> think there will no output in your system.
>>>>> You're right! No kmalloc-96.
>>>> To get rid of the warning, limit available sizes from
>>>> bpf_mem_alloc_init()?
>> It is not enough. We need to adjust size_index accordingly during
>> initialization. Could you please try the attached patch below ? It is
>> not a formal patch and I am considering to disable prefilling for these
>> redirected bpf_mem_caches.
>>> Do you know why your system does not have kmalloc-96?
>> According to the implementation of setup_kmalloc_cache_index_table() and
>> create_kmalloc_caches(), when KMALLOC_MIN_SIZE is greater than 64,
>> kmalloc-96 will be omitted. If KMALLOC_MIN_SIZE is greater than 128,
>> kmalloc-192 will be omitted as well.
> Great catch. The fix looks good.
> Please submit it officially and add an error check to bpf_mem_alloc_init()
> that verifies that ksize() matches the expectations.
Do you mean to check the return values of ksize() for these prefill
objects in free_llist are expected, right ?
> The alternative is to use kmalloc_size_roundup() during alloc for
> checking instead of ksize().
> Technically we can use kmalloc_size_roundup in unit_alloc() and avoid
> setup_kmalloc_cache_index_table()-like copy paste, but performance
> overhead might be too high.
> So your patch + error check at bpf_mem_alloc_init() is preferred.
I see. Using kmalloc_size_round() in bpf_mem_alloc() is indeed an
alternative solution. Will post it.
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-29 11:46 ` Björn Töpel
@ 2023-08-30 12:15 ` Hou Tao
0 siblings, 0 replies; 31+ messages in thread
From: Hou Tao @ 2023-08-30 12:15 UTC (permalink / raw)
To: Björn Töpel, yonghong.song, bpf; +Cc: linux-riscv, Alexei Starovoitov
Hi,
On 8/29/2023 7:46 PM, Björn Töpel wrote:
> Björn Töpel <bjorn@kernel.org> writes:
>
>> Hou Tao <houtao@huaweicloud.com> writes:
>>
>>> Hi,
>>>
>>> On 8/27/2023 10:53 PM, Yonghong Song wrote:
>>>>
>>>> On 8/27/23 1:37 AM, Björn Töpel wrote:
>>>>> Björn Töpel <bjorn@kernel.org> writes:
>>>>>
>>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 8/26/2023 5:23 PM, Björn Töpel wrote:
>>>>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 8/25/2023 11:28 PM, Yonghong Song wrote:
>>>>>>>>>> On 8/25/23 3:32 AM, Björn Töpel wrote:
>>>>>>>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>>>>>>>>>>> selftests on bpf-next 9e3b47abeb8f.
>>>>>>>>>>>
>>>>>>>>>>> I'm able to reproduce the hang by multiple runs of:
>>>>>>>>>>> | ./test_progs -a link_api -a linked_list
>>>>>>>>>>> I'm currently investigating that.
>>>>>>>>>>>
>>>>>>>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
>>>>>>>>>>> | ------------[ cut here ]------------
>>>>>>>>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
>>>>>>>>>>> bpf_mem_refill+0x1fc/0x206
>>>>>>>>>>> | Modules linked in: bpf_testmod(OE)
>>>>>>>>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
>>>>>>>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
>>>>>>>>>>> | Hardware name: riscv-virtio,qemu (DT)
>>>>>>>>>>> | epc : bpf_mem_refill+0x1fc/0x206
>>>>>>>>>>> | ra : irq_work_single+0x68/0x70
>>>>>>>>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp :
>>>>>>>>>>> ff2000000001be20
>>>>>>>>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 :
>>>>>>>>>>> 0000000000046600
>>>>>>>>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 :
>>>>>>>>>>> ff2000000001be70
>>>>>>>>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 :
>>>>>>>>>>> ff600003fef4b000
>>>>>>>>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 :
>>>>>>>>>>> 0000000000000060
>>>>>>>>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 :
>>>>>>>>>>> 0000000000735049
>>>>>>>>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 :
>>>>>>>>>>> 0000000000001000
>>>>>>>>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 :
>>>>>>>>>>> ffffffff82d6bd30
>>>>>>>>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10:
>>>>>>>>>>> 000000000000ffff
>>>>>>>>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 :
>>>>>>>>>>> 0000000000000000
>>>>>>>>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000
>>>>>>>>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
>>>>>>>>>>> 0000000000000003
>>>>>>>>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
>>>>>>>>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
>>>>>>>>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
>>>>>>>>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
>>>>>>>>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
>>>>>>>>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
>>>>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
>>>>>>>>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
>>>>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
>>>>>>>>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
>>>>>>>>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98
>>>>>>>>>>> | ---[ end trace 0000000000000000 ]---
>>>>>>>>>>>
>>>>>>>>>>> Code:
>>>>>>>>>>> | static void free_bulk(struct bpf_mem_cache *c)
>>>>>>>>>>> | {
>>>>>>>>>>> | struct bpf_mem_cache *tgt = c->tgt;
>>>>>>>>>>> | struct llist_node *llnode, *t;
>>>>>>>>>>> | unsigned long flags;
>>>>>>>>>>> | int cnt;
>>>>>>>>>>> |
>>>>>>>>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
>>>>>>>>>>> | ...
>>>>>>>>>>>
>>>>>>>>>>> I'm not well versed in the memory allocator; Before I dive into
>>>>>>>>>>> it --
>>>>>>>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
>>>>>>>>>> Maybe take a look at the patch
>>>>>>>>>> 822fb26bdb55 bpf: Add a hint to allocated objects.
>>>>>>>>>>
>>>>>>>>>> In the above patch, we have
>>>>>>>>>>
>>>>>>>>>> + /*
>>>>>>>>>> + * Remember bpf_mem_cache that allocated this object.
>>>>>>>>>> + * The hint is not accurate.
>>>>>>>>>> + */
>>>>>>>>>> + c->tgt = *(struct bpf_mem_cache **)llnode;
>>>>>>>>>>
>>>>>>>>>> I suspect that the warning may be related to the above.
>>>>>>>>>> I tried the above ./test_progs command line (running multiple
>>>>>>>>>> at the same time) and didn't trigger the issue.
>>>>>>>>> The extra 8-bytes before the freed pointer is used to save the
>>>>>>>>> pointer
>>>>>>>>> of the original bpf memory allocator where the freed pointer came
>>>>>>>>> from,
>>>>>>>>> so unit_free() could free the pointer back to the original
>>>>>>>>> allocator to
>>>>>>>>> prevent alloc-and-free unbalance.
>>>>>>>>>
>>>>>>>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do
>>>>>>>>> not
>>>>>>>>> find anything suspicious after checking linked_list. Another
>>>>>>>>> possibility
>>>>>>>>> is that there is write-after-free problem which corrupts the extra
>>>>>>>>> 8-bytes before the freed pointer. Could you please apply the
>>>>>>>>> following
>>>>>>>>> debug patch to check whether or not the extra 8-bytes are
>>>>>>>>> corrupted ?
>>>>>>>> Thanks for getting back!
>>>>>>>>
>>>>>>>> I took your patch for a run, and there's a hit:
>>>>>>>> | bad cache ff5ffffffffe8570: got size 96 work
>>>>>>>> ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work
>>>>>>>> ffffffff801b19c8
>>>>>>> The extra 8-bytes are not corrupted. Both of these two
>>>>>>> bpf_mem_cache are
>>>>>>> valid and there are in the cache array defined in bpf_mem_caches. BPF
>>>>>>> memory allocator allocated the pointer from 96-bytes sized-cache,
>>>>>>> but it
>>>>>>> tried to free the pointer through 128-bytes sized-cache.
>>>>>>>
>>>>>>> Now I suspect there is no 96-bytes slab in your system and ksize(ptr -
>>>>>>> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the
>>>>>>> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please
>>>>>>> check the value of KMALLOC_MIN_SIZE in your kernel .config and
>>>>>>> using the
>>>>>>> following command to check whether there is 96-bytes slab in your
>>>>>>> system:
>>>>>> KMALLOC_MIN_SIZE is 64.
>>>>>>
>>>>>>> $ cat /proc/slabinfo |grep kmalloc-96
>>>>>>> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0
>>>>>>> 0 : slabdata 0 0 0
>>>>>>> kmalloc-96 1865 2268 96 42 1 : tunables 0 0
>>>>>>> 0 : slabdata 54 54 0
>>>>>>>
>>>>>>> In my system, slab has 96-bytes cached, so grep outputs something,
>>>>>>> but I
>>>>>>> think there will no output in your system.
>>>>>> You're right! No kmalloc-96.
>>>>> To get rid of the warning, limit available sizes from
>>>>> bpf_mem_alloc_init()?
>>> It is not enough. We need to adjust size_index accordingly during
>>> initialization. Could you please try the attached patch below ? It is
>>> not a formal patch and I am considering to disable prefilling for these
>>> redirected bpf_mem_caches.
>> Sorry for the slow response; I'll take it for a spin today.
Thanks for the test.
> Hmm, on a related note, RISC-V has this change queued up [1], which will
> introduce -96 and friends. Are there any other archs supporting the BPF
> allocator where this is a concern? If not, I suggest simply leaving the
> code as is.
Thanks for the information. I am not a micro-architecture expert, but it
seems arm32 sets KMALLOC_MIN_SIZE as 64 or 128. Beside the arch related
setting, If switching from slub to the deprecated slab, the similar
problem will be reported, because CONFIG_SLAB will set KMALLOC_MIN_SIZE
as 32 (KMALLOC_SHIFT_LOW is 5).
>
>
> Björn
>
> [1] https://lore.kernel.org/all/20230718152214.2907-1-jszhang@kernel.org/
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
2023-08-30 12:08 ` Hou Tao
@ 2023-08-30 21:05 ` Alexei Starovoitov
0 siblings, 0 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2023-08-30 21:05 UTC (permalink / raw)
To: Hou Tao; +Cc: Yonghong Song, Björn Töpel, bpf, linux-riscv
On Wed, Aug 30, 2023 at 5:09 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 8/29/2023 11:26 PM, Alexei Starovoitov wrote:
> > On Mon, Aug 28, 2023 at 6:57 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi,
> >>
> >> On 8/27/2023 10:53 PM, Yonghong Song wrote:
> >>>
> >>> On 8/27/23 1:37 AM, Björn Töpel wrote:
> >>>> Björn Töpel <bjorn@kernel.org> writes:
> >>>>
> >>>>> Hou Tao <houtao@huaweicloud.com> writes:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> On 8/26/2023 5:23 PM, Björn Töpel wrote:
> >>>>>>> Hou Tao <houtao@huaweicloud.com> writes:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> On 8/25/2023 11:28 PM, Yonghong Song wrote:
> >>>>>>>>> On 8/25/23 3:32 AM, Björn Töpel wrote:
> >>>>>>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
> >>>>>>>>>> selftests on bpf-next 9e3b47abeb8f.
> >>>>>>>>>>
> >>>>>>>>>> I'm able to reproduce the hang by multiple runs of:
> >>>>>>>>>> | ./test_progs -a link_api -a linked_list
> >>>>>>>>>> I'm currently investigating that.
> >>>>>>>>>>
> >>>>>>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
> >>>>>>>>>> | ------------[ cut here ]------------
> >>>>>>>>>> | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
> >>>>>>>>>> bpf_mem_refill+0x1fc/0x206
> >>>>>>>>>> | Modules linked in: bpf_testmod(OE)
> >>>>>>>>>> | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE
> >>>>>>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
> >>>>>>>>>> | Hardware name: riscv-virtio,qemu (DT)
> >>>>>>>>>> | epc : bpf_mem_refill+0x1fc/0x206
> >>>>>>>>>> | ra : irq_work_single+0x68/0x70
> >>>>>>>>>> | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp :
> >>>>>>>>>> ff2000000001be20
> >>>>>>>>>> | gp : ffffffff82d26138 tp : ff6000008477a800 t0 :
> >>>>>>>>>> 0000000000046600
> >>>>>>>>>> | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 :
> >>>>>>>>>> ff2000000001be70
> >>>>>>>>>> | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 :
> >>>>>>>>>> ff600003fef4b000
> >>>>>>>>>> | a2 : 000000000000003f a3 : ffffffff80008250 a4 :
> >>>>>>>>>> 0000000000000060
> >>>>>>>>>> | a5 : 0000000000000080 a6 : 0000000000000000 a7 :
> >>>>>>>>>> 0000000000735049
> >>>>>>>>>> | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 :
> >>>>>>>>>> 0000000000001000
> >>>>>>>>>> | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 :
> >>>>>>>>>> ffffffff82d6bd30
> >>>>>>>>>> | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10:
> >>>>>>>>>> 000000000000ffff
> >>>>>>>>>> | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 :
> >>>>>>>>>> 0000000000000000
> >>>>>>>>>> | t5 : ff6000008fd28278 t6 : 0000000000040000
> >>>>>>>>>> | status: 0000000200000100 badaddr: 0000000000000000 cause:
> >>>>>>>>>> 0000000000000003
> >>>>>>>>>> | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
> >>>>>>>>>> | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
> >>>>>>>>>> | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
> >>>>>>>>>> | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
> >>>>>>>>>> | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
> >>>>>>>>>> | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
> >>>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> >>>>>>>>>> | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
> >>>>>>>>>> | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
> >>>>>>>>>> | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> >>>>>>>>>> | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
> >>>>>>>>>> | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
> >>>>>>>>>> | [<ffffffff812b6904>] do_irq+0x66/0x98
> >>>>>>>>>> | ---[ end trace 0000000000000000 ]---
> >>>>>>>>>>
> >>>>>>>>>> Code:
> >>>>>>>>>> | static void free_bulk(struct bpf_mem_cache *c)
> >>>>>>>>>> | {
> >>>>>>>>>> | struct bpf_mem_cache *tgt = c->tgt;
> >>>>>>>>>> | struct llist_node *llnode, *t;
> >>>>>>>>>> | unsigned long flags;
> >>>>>>>>>> | int cnt;
> >>>>>>>>>> |
> >>>>>>>>>> | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
> >>>>>>>>>> | ...
> >>>>>>>>>>
> >>>>>>>>>> I'm not well versed in the memory allocator; Before I dive into
> >>>>>>>>>> it --
> >>>>>>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
> >>>>>>>>> Maybe take a look at the patch
> >>>>>>>>> 822fb26bdb55 bpf: Add a hint to allocated objects.
> >>>>>>>>>
> >>>>>>>>> In the above patch, we have
> >>>>>>>>>
> >>>>>>>>> + /*
> >>>>>>>>> + * Remember bpf_mem_cache that allocated this object.
> >>>>>>>>> + * The hint is not accurate.
> >>>>>>>>> + */
> >>>>>>>>> + c->tgt = *(struct bpf_mem_cache **)llnode;
> >>>>>>>>>
> >>>>>>>>> I suspect that the warning may be related to the above.
> >>>>>>>>> I tried the above ./test_progs command line (running multiple
> >>>>>>>>> at the same time) and didn't trigger the issue.
> >>>>>>>> The extra 8-bytes before the freed pointer is used to save the
> >>>>>>>> pointer
> >>>>>>>> of the original bpf memory allocator where the freed pointer came
> >>>>>>>> from,
> >>>>>>>> so unit_free() could free the pointer back to the original
> >>>>>>>> allocator to
> >>>>>>>> prevent alloc-and-free unbalance.
> >>>>>>>>
> >>>>>>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do
> >>>>>>>> not
> >>>>>>>> find anything suspicious after checking linked_list. Another
> >>>>>>>> possibility
> >>>>>>>> is that there is write-after-free problem which corrupts the extra
> >>>>>>>> 8-bytes before the freed pointer. Could you please apply the
> >>>>>>>> following
> >>>>>>>> debug patch to check whether or not the extra 8-bytes are
> >>>>>>>> corrupted ?
> >>>>>>> Thanks for getting back!
> >>>>>>>
> >>>>>>> I took your patch for a run, and there's a hit:
> >>>>>>> | bad cache ff5ffffffffe8570: got size 96 work
> >>>>>>> ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work
> >>>>>>> ffffffff801b19c8
> >>>>>> The extra 8-bytes are not corrupted. Both of these two
> >>>>>> bpf_mem_cache are
> >>>>>> valid and there are in the cache array defined in bpf_mem_caches. BPF
> >>>>>> memory allocator allocated the pointer from 96-bytes sized-cache,
> >>>>>> but it
> >>>>>> tried to free the pointer through 128-bytes sized-cache.
> >>>>>>
> >>>>>> Now I suspect there is no 96-bytes slab in your system and ksize(ptr -
> >>>>>> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the
> >>>>>> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please
> >>>>>> check the value of KMALLOC_MIN_SIZE in your kernel .config and
> >>>>>> using the
> >>>>>> following command to check whether there is 96-bytes slab in your
> >>>>>> system:
> >>>>> KMALLOC_MIN_SIZE is 64.
> >>>>>
> >>>>>> $ cat /proc/slabinfo |grep kmalloc-96
> >>>>>> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0
> >>>>>> 0 : slabdata 0 0 0
> >>>>>> kmalloc-96 1865 2268 96 42 1 : tunables 0 0
> >>>>>> 0 : slabdata 54 54 0
> >>>>>>
> >>>>>> In my system, slab has 96-bytes cached, so grep outputs something,
> >>>>>> but I
> >>>>>> think there will no output in your system.
> >>>>> You're right! No kmalloc-96.
> >>>> To get rid of the warning, limit available sizes from
> >>>> bpf_mem_alloc_init()?
> >> It is not enough. We need to adjust size_index accordingly during
> >> initialization. Could you please try the attached patch below ? It is
> >> not a formal patch and I am considering to disable prefilling for these
> >> redirected bpf_mem_caches.
> >>> Do you know why your system does not have kmalloc-96?
> >> According to the implementation of setup_kmalloc_cache_index_table() and
> >> create_kmalloc_caches(), when KMALLOC_MIN_SIZE is greater than 64,
> >> kmalloc-96 will be omitted. If KMALLOC_MIN_SIZE is greater than 128,
> >> kmalloc-192 will be omitted as well.
> > Great catch. The fix looks good.
> > Please submit it officially and add an error check to bpf_mem_alloc_init()
> > that verifies that ksize() matches the expectations.
>
> Do you mean to check the return values of ksize() for these prefill
> objects in free_llist are expected, right ?
I'd like to avoid adding extra flags to alloc_bulk() and passing them along.
Instead prefill_mem_cache() can peek into 1st element after alloc_bulk()
and check its ksize.
> > The alternative is to use kmalloc_size_roundup() during alloc for
> > checking instead of ksize().
> > Technically we can use kmalloc_size_roundup in unit_alloc() and avoid
> > setup_kmalloc_cache_index_table()-like copy paste, but performance
> > overhead might be too high.
> > So your patch + error check at bpf_mem_alloc_init() is preferred.
> I see. Using kmalloc_size_round() in bpf_mem_alloc() is indeed an
> alternative solution. Will post it.
No need. I think perf degradation for a corner case is prohibitive.
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2023-08-30 21:05 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-25 10:32 WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 Björn Töpel
2023-08-25 15:28 ` Yonghong Song
2023-08-25 18:53 ` Alexei Starovoitov
2023-08-25 19:49 ` Alexei Starovoitov
2023-08-25 21:31 ` Andrii Nakryiko
2023-08-26 22:49 ` Kumar Kartikeya Dwivedi
2023-08-26 3:48 ` Hou Tao
2023-08-26 9:23 ` Björn Töpel
2023-08-26 10:27 ` Hou Tao
2023-08-26 10:49 ` Björn Töpel
2023-08-27 8:37 ` Björn Töpel
2023-08-27 14:53 ` Yonghong Song
2023-08-28 13:57 ` Hou Tao
2023-08-29 0:54 ` Yonghong Song
2023-08-29 7:26 ` Björn Töpel
2023-08-29 11:46 ` Björn Töpel
2023-08-30 12:15 ` Hou Tao
2023-08-29 12:54 ` Björn Töpel
2023-08-29 15:26 ` Alexei Starovoitov
2023-08-30 12:08 ` Hou Tao
2023-08-30 21:05 ` Alexei Starovoitov
2023-08-26 13:44 ` RISC-V uprobe bug (Was: Re: WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342) Björn Töpel
2023-08-26 18:12 ` Nam Cao
2023-08-26 18:31 ` Nam Cao
2023-08-27 8:11 ` Björn Töpel
2023-08-27 8:35 ` Nam Cao
2023-08-27 9:04 ` Björn Töpel
2023-08-27 9:39 ` Nam Cao
2023-08-27 19:20 ` Björn Töpel
2023-08-27 19:41 ` Nam Cao
2023-08-27 20:15 ` Nam Cao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox