From: "Toke Høiland-Jørgensen" <toke@kernel.org>
To: Hou Tao <houtao@huaweicloud.com>, bpf@vger.kernel.org
Cc: Martin KaFai Lau <martin.lau@linux.dev>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Andrii Nakryiko <andrii@kernel.org>,
Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Hao Luo <haoluo@google.com>,
Yonghong Song <yonghong.song@linux.dev>,
Daniel Borkmann <daniel@iogearbox.net>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@fomichev.me>,
Jiri Olsa <jolsa@kernel.org>,
John Fastabend <john.fastabend@gmail.com>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
houtao1@huawei.com, xukuohai@huawei.com
Subject: Re: [PATCH bpf-next v3 4/5] bpf: Cancel the running bpf_timer through kworker for PREEMPT_RT
Date: Fri, 17 Jan 2025 13:40:56 +0100 [thread overview]
Message-ID: <87ldv9obon.fsf@toke.dk> (raw)
In-Reply-To: <20250117101816.2101857-5-houtao@huaweicloud.com>
Hou Tao <houtao@huaweicloud.com> writes:
> From: Hou Tao <houtao1@huawei.com>
>
> During the update procedure, when overwrite element in a pre-allocated
> htab, the freeing of old_element is protected by the bucket lock. The
> reason why the bucket lock is necessary is that the old_element has
> already been stashed in htab->extra_elems after alloc_htab_elem()
> returns. If freeing the old_element after the bucket lock is unlocked,
> the stashed element may be reused by concurrent update procedure and the
> freeing of old_element will run concurrently with the reuse of the
> old_element. However, the invocation of check_and_free_fields() may
> acquire a spin-lock which violates the lockdep rule because its caller
> has already held a raw-spin-lock (bucket lock). The following warning
> will be reported when such race happens:
>
> BUG: scheduling while atomic: test_progs/676/0x00000003
> 3 locks held by test_progs/676:
> #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
> #1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500
> #2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0
> Modules linked in: bpf_testmod(O)
> Preemption disabled at:
> [<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500
> CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11
> Tainted: [W]=WARN, [O]=OOT_MODULE
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)...
> Call Trace:
> <TASK>
> dump_stack_lvl+0x57/0x70
> dump_stack+0x10/0x20
> __schedule_bug+0x120/0x170
> __schedule+0x300c/0x4800
> schedule_rtlock+0x37/0x60
> rtlock_slowlock_locked+0x6d9/0x54c0
> rt_spin_lock+0x168/0x230
> hrtimer_cancel_wait_running+0xe9/0x1b0
> hrtimer_cancel+0x24/0x30
> bpf_timer_delete_work+0x1d/0x40
> bpf_timer_cancel_and_free+0x5e/0x80
> bpf_obj_free_fields+0x262/0x4a0
> check_and_free_fields+0x1d0/0x280
> htab_map_update_elem+0x7fc/0x1500
> bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43
> bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e
> bpf_prog_test_run_syscall+0x322/0x830
> __sys_bpf+0x135d/0x3ca0
> __x64_sys_bpf+0x75/0xb0
> x64_sys_call+0x1b5/0xa10
> do_syscall_64+0x3b/0xc0
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> ...
> </TASK>
>
> It seems feasible to break the reuse and refill of per-cpu extra_elems
> into two independent parts: reuse the per-cpu extra_elems with bucket
> lock being held and refill the old_element as per-cpu extra_elems after
> the bucket lock is unlocked. However, it will make the concurrent
> overwrite procedures on the same CPU return unexpected -E2BIG error when
> the map is full.
>
> Therefore, the patch fixes the lock problem by breaking the cancelling
> of bpf_timer into two steps for PREEMPT_RT:
> 1) use hrtimer_try_to_cancel() and check its return value
> 2) if the timer is running, use hrtimer_cancel() through a kworker to
> cancel it again
> Considering that the current implementation of hrtimer_cancel() will try
> to acquire a being held softirq_expiry_lock when the current timer is
> running, these steps above are reasonable. However, it also has
> downside. When the timer is running, the cancelling of the timer is
> delayed when releasing the last map uref. The delay is also fixable
> (e.g., break the cancelling of bpf timer into two parts: one part in
> locked scope, another one in unlocked scope), it can be revised later if
> necessary.
>
> It is a bit hard to decide the right fix tag. One reason is that the
> problem depends on PREEMPT_RT which is enabled in v6.12. Considering the
> softirq_expiry_lock lock exists since v5.4 and bpf_timer is introduced
> in v5.15, the bpf_timer commit is used in the fixes tag and an extra
> depends-on tag is added to state the dependency on PREEMPT_RT.
>
> Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
> Depends-on: v6.12+ with PREEMPT_RT enabled
> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Closes: https://lore.kernel.org/bpf/20241106084527.4gPrMnHt@linutronix.de
> Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org>
next prev parent reply other threads:[~2025-01-17 12:41 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-17 10:18 [PATCH bpf-next v3 0/5] Free htab element out of bucket lock Hou Tao
2025-01-17 10:18 ` [PATCH bpf-next v3 1/5] bpf: Free special fields after unlock in htab_lru_map_delete_node() Hou Tao
2025-01-17 12:31 ` Toke Høiland-Jørgensen
2025-01-17 10:18 ` [PATCH bpf-next v3 2/5] bpf: Bail out early in __htab_map_lookup_and_delete_elem() Hou Tao
2025-01-17 12:31 ` Toke Høiland-Jørgensen
2025-01-17 10:18 ` [PATCH bpf-next v3 3/5] bpf: Free element after unlock " Hou Tao
2025-01-17 12:35 ` Toke Høiland-Jørgensen
2025-01-20 8:49 ` Hou Tao
2025-01-20 8:52 ` Toke Høiland-Jørgensen
2025-01-21 1:15 ` Hou Tao
2025-01-21 11:04 ` Toke Høiland-Jørgensen
2025-01-17 10:18 ` [PATCH bpf-next v3 4/5] bpf: Cancel the running bpf_timer through kworker for PREEMPT_RT Hou Tao
2025-01-17 12:40 ` Toke Høiland-Jørgensen [this message]
2025-01-17 10:18 ` [PATCH bpf-next v3 5/5] selftests/bpf: Add test case for the freeing of bpf_timer Hou Tao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ldv9obon.fsf@toke.dk \
--to=toke@kernel.org \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=bigeasy@linutronix.de \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=haoluo@google.com \
--cc=houtao1@huawei.com \
--cc=houtao@huaweicloud.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=martin.lau@linux.dev \
--cc=sdf@fomichev.me \
--cc=song@kernel.org \
--cc=xukuohai@huawei.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox