From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Hou Tao <houtao@huaweicloud.com>, bpf@vger.kernel.org
Cc: "Martin KaFai Lau" <martin.lau@linux.dev>,
"Alexei Starovoitov" <alexei.starovoitov@gmail.com>,
"Andrii Nakryiko" <andrii@kernel.org>,
"Eduard Zingerman" <eddyz87@gmail.com>,
"Song Liu" <song@kernel.org>, "Hao Luo" <haoluo@google.com>,
"Yonghong Song" <yonghong.song@linux.dev>,
"Daniel Borkmann" <daniel@iogearbox.net>,
"KP Singh" <kpsingh@kernel.org>,
"Stanislav Fomichev" <sdf@fomichev.me>,
"Jiri Olsa" <jolsa@kernel.org>,
"John Fastabend" <john.fastabend@gmail.com>,
"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
"Thomas Gleixner" <tglx@linutronix.de>,
"Thomas Weißschuh" <linux@weissschuh.net>,
houtao1@huawei.com, xukuohai@huawei.com
Subject: Re: [PATCH bpf v2 6/9] bpf: Switch to bpf mem allocator for LPM trie
Date: Fri, 29 Nov 2024 13:01:20 +0100 [thread overview]
Message-ID: <87iks6i6zz.fsf@toke.dk> (raw)
In-Reply-To: <20241127004641.1118269-7-houtao@huaweicloud.com>
Hou Tao <houtao@huaweicloud.com> writes:
> From: Hou Tao <houtao1@huawei.com>
>
> Multiple syzbot warnings have been reported. These warnings are mainly
> about the lock order between trie->lock and kmalloc()'s internal lock.
> See report [1] as an example:
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted
> ------------------------------------------------------
> syz.3.2069/15008 is trying to acquire lock:
> ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ...
>
> but task is already holding lock:
> ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ...
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (&trie->lock){-.-.}-{2:2}:
> __raw_spin_lock_irqsave
> _raw_spin_lock_irqsave+0x3a/0x60
> trie_delete_elem+0xb0/0x820
> ___bpf_prog_run+0x3e51/0xabd0
> __bpf_prog_run32+0xc1/0x100
> bpf_dispatcher_nop_func
> ......
> bpf_trace_run2+0x231/0x590
> __bpf_trace_contention_end+0xca/0x110
> trace_contention_end.constprop.0+0xea/0x170
> __pv_queued_spin_lock_slowpath+0x28e/0xcc0
> pv_queued_spin_lock_slowpath
> queued_spin_lock_slowpath
> queued_spin_lock
> do_raw_spin_lock+0x210/0x2c0
> __raw_spin_lock_irqsave
> _raw_spin_lock_irqsave+0x42/0x60
> __put_partials+0xc3/0x170
> qlink_free
> qlist_free_all+0x4e/0x140
> kasan_quarantine_reduce+0x192/0x1e0
> __kasan_slab_alloc+0x69/0x90
> kasan_slab_alloc
> slab_post_alloc_hook
> slab_alloc_node
> kmem_cache_alloc_node_noprof+0x153/0x310
> __alloc_skb+0x2b1/0x380
> ......
>
> -> #0 (&n->list_lock){-.-.}-{2:2}:
> check_prev_add
> check_prevs_add
> validate_chain
> __lock_acquire+0x2478/0x3b30
> lock_acquire
> lock_acquire+0x1b1/0x560
> __raw_spin_lock_irqsave
> _raw_spin_lock_irqsave+0x3a/0x60
> get_partial_node.part.0+0x20/0x350
> get_partial_node
> get_partial
> ___slab_alloc+0x65b/0x1870
> __slab_alloc.constprop.0+0x56/0xb0
> __slab_alloc_node
> slab_alloc_node
> __do_kmalloc_node
> __kmalloc_node_noprof+0x35c/0x440
> kmalloc_node_noprof
> bpf_map_kmalloc_node+0x98/0x4a0
> lpm_trie_node_alloc
> trie_update_elem+0x1ef/0xe00
> bpf_map_update_value+0x2c1/0x6c0
> map_update_elem+0x623/0x910
> __sys_bpf+0x90c/0x49a0
> ...
>
> other info that might help us debug this:
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&trie->lock);
> lock(&n->list_lock);
> lock(&trie->lock);
> lock(&n->list_lock);
>
> *** DEADLOCK ***
>
> [1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7
>
> A bpf program attached to trace_contention_end() triggers after
> acquiring &n->list_lock. The program invokes trie_delete_elem(), which
> then acquires trie->lock. However, it is possible that another
> process is invoking trie_update_elem(). trie_update_elem() will acquire
> trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke
> get_partial_node() and try to acquire &n->list_lock (not necessarily the
> same lock object). Therefore, lockdep warns about the circular locking
> dependency.
>
> Invoking kmalloc() before acquiring trie->lock could fix the warning.
> However, since BPF programs call be invoked from any context (e.g.,
> through kprobe/tracepoint/fentry), there may still be lock ordering
> problems for internal locks in kmalloc() or trie->lock itself.
>
> To eliminate these potential lock ordering problems with kmalloc()'s
> internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent
> BPF memory allocator APIs that can be invoked in any context. The lock
> ordering problems with trie->lock (e.g., reentrance) will be handled
> separately.
>
> Two aspects of this change require explanation:
>
> 1. Intermediate and leaf nodes are allocated from the same allocator.
> The value size of LPM trie is usually small and only use one allocator
> reduces the memory overhead of BPF memory allocator.
>
> 2. nodes are freed before invoking spin_unlock_irqrestore(). Therefore,
> there is no need to add paired migrate_{disable|enable}() calls for
> these free operations.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
I agree with Alexei's comments, but otherwise:
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
next prev parent reply other threads:[~2024-11-29 12:01 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-27 0:46 [PATCH bpf v2 0/9] Fixes for LPM trie Hou Tao
2024-11-27 0:46 ` [PATCH bpf v2 1/9] bpf: Remove unnecessary check when updating " Hou Tao
2024-12-02 16:08 ` Daniel Borkmann
2024-11-27 0:46 ` [PATCH bpf v2 2/9] bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem Hou Tao
2024-12-02 16:10 ` Daniel Borkmann
2024-11-27 0:46 ` [PATCH bpf v2 3/9] bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie Hou Tao
2024-12-02 17:17 ` Daniel Borkmann
2024-11-27 0:46 ` [PATCH bpf v2 4/9] bpf: Handle in-place update for full LPM trie correctly Hou Tao
2024-11-29 11:45 ` Toke Høiland-Jørgensen
2024-11-27 0:46 ` [PATCH bpf v2 5/9] bpf: Fix exact match conditions in trie_get_next_key() Hou Tao
2024-11-27 0:46 ` [PATCH bpf v2 6/9] bpf: Switch to bpf mem allocator for LPM trie Hou Tao
2024-11-27 5:51 ` Alexei Starovoitov
2024-11-28 4:12 ` Hou Tao
2024-11-29 12:01 ` Toke Høiland-Jørgensen [this message]
2024-11-27 0:46 ` [PATCH bpf v2 7/9] bpf: Use raw_spinlock_t " Hou Tao
2024-11-29 12:18 ` Toke Høiland-Jørgensen
2024-12-03 1:42 ` Alexei Starovoitov
2024-12-05 8:52 ` Hou Tao
2024-12-05 9:47 ` Toke Høiland-Jørgensen
2024-12-15 9:37 ` Hou Tao
2024-12-15 16:51 ` Toke Høiland-Jørgensen
2024-12-05 17:06 ` Alexei Starovoitov
2024-12-06 0:48 ` Hou Tao
2024-12-06 1:40 ` Alexei Starovoitov
2024-11-27 0:46 ` [PATCH bpf v2 8/9] selftests/bpf: Move test_lpm_map.c to map_tests Hou Tao
2024-11-27 0:46 ` [PATCH bpf v2 9/9] selftests/bpf: Add more test cases for LPM trie Hou Tao
2024-11-27 5:39 ` Alexei Starovoitov
2024-11-27 8:02 ` Hou Tao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87iks6i6zz.fsf@toke.dk \
--to=toke@redhat.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=bigeasy@linutronix.de \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=haoluo@google.com \
--cc=houtao1@huawei.com \
--cc=houtao@huaweicloud.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=linux@weissschuh.net \
--cc=martin.lau@linux.dev \
--cc=sdf@fomichev.me \
--cc=song@kernel.org \
--cc=tglx@linutronix.de \
--cc=xukuohai@huawei.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.