All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@kernel.org>
To: Hou Tao <houtao@huaweicloud.com>, bpf@vger.kernel.org
Cc: Martin KaFai Lau <martin.lau@linux.dev>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
	Hao Luo <haoluo@google.com>,
	Yonghong Song <yonghong.song@linux.dev>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>,
	Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	houtao1@huawei.com, xukuohai@huawei.com
Subject: Re: [PATCH bpf-next v3 4/5] bpf: Cancel the running bpf_timer through kworker for PREEMPT_RT
Date: Fri, 17 Jan 2025 13:40:56 +0100	[thread overview]
Message-ID: <87ldv9obon.fsf@toke.dk> (raw)
In-Reply-To: <20250117101816.2101857-5-houtao@huaweicloud.com>

Hou Tao <houtao@huaweicloud.com> writes:

> From: Hou Tao <houtao1@huawei.com>
>
> During the update procedure, when overwrite element in a pre-allocated
> htab, the freeing of old_element is protected by the bucket lock. The
> reason why the bucket lock is necessary is that the old_element has
> already been stashed in htab->extra_elems after alloc_htab_elem()
> returns. If freeing the old_element after the bucket lock is unlocked,
> the stashed element may be reused by concurrent update procedure and the
> freeing of old_element will run concurrently with the reuse of the
> old_element. However, the invocation of check_and_free_fields() may
> acquire a spin-lock which violates the lockdep rule because its caller
> has already held a raw-spin-lock (bucket lock). The following warning
> will be reported when such race happens:
>
>   BUG: scheduling while atomic: test_progs/676/0x00000003
>   3 locks held by test_progs/676:
>   #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
>   #1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500
>   #2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0
>   Modules linked in: bpf_testmod(O)
>   Preemption disabled at:
>   [<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500
>   CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11
>   Tainted: [W]=WARN, [O]=OOT_MODULE
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)...
>   Call Trace:
>   <TASK>
>   dump_stack_lvl+0x57/0x70
>   dump_stack+0x10/0x20
>   __schedule_bug+0x120/0x170
>   __schedule+0x300c/0x4800
>   schedule_rtlock+0x37/0x60
>   rtlock_slowlock_locked+0x6d9/0x54c0
>   rt_spin_lock+0x168/0x230
>   hrtimer_cancel_wait_running+0xe9/0x1b0
>   hrtimer_cancel+0x24/0x30
>   bpf_timer_delete_work+0x1d/0x40
>   bpf_timer_cancel_and_free+0x5e/0x80
>   bpf_obj_free_fields+0x262/0x4a0
>   check_and_free_fields+0x1d0/0x280
>   htab_map_update_elem+0x7fc/0x1500
>   bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43
>   bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e
>   bpf_prog_test_run_syscall+0x322/0x830
>   __sys_bpf+0x135d/0x3ca0
>   __x64_sys_bpf+0x75/0xb0
>   x64_sys_call+0x1b5/0xa10
>   do_syscall_64+0x3b/0xc0
>   entry_SYSCALL_64_after_hwframe+0x4b/0x53
>   ...
>   </TASK>
>
> It seems feasible to break the reuse and refill of per-cpu extra_elems
> into two independent parts: reuse the per-cpu extra_elems with bucket
> lock being held and refill the old_element as per-cpu extra_elems after
> the bucket lock is unlocked. However, it will make the concurrent
> overwrite procedures on the same CPU return unexpected -E2BIG error when
> the map is full.
>
> Therefore, the patch fixes the lock problem by breaking the cancelling
> of bpf_timer into two steps for PREEMPT_RT:
> 1) use hrtimer_try_to_cancel() and check its return value
> 2) if the timer is running, use hrtimer_cancel() through a kworker to
>    cancel it again
> Considering that the current implementation of hrtimer_cancel() will try
> to acquire a being held softirq_expiry_lock when the current timer is
> running, these steps above are reasonable. However, it also has
> downside. When the timer is running, the cancelling of the timer is
> delayed when releasing the last map uref. The delay is also fixable
> (e.g., break the cancelling of bpf timer into two parts: one part in
> locked scope, another one in unlocked scope), it can be revised later if
> necessary.
>
> It is a bit hard to decide the right fix tag. One reason is that the
> problem depends on PREEMPT_RT which is enabled in v6.12. Considering the
> softirq_expiry_lock lock exists since v5.4 and bpf_timer is introduced
> in v5.15, the bpf_timer commit is used in the fixes tag and an extra
> depends-on tag is added to state the dependency on PREEMPT_RT.
>
> Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
> Depends-on: v6.12+ with PREEMPT_RT enabled
> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Closes: https://lore.kernel.org/bpf/20241106084527.4gPrMnHt@linutronix.de
> Signed-off-by: Hou Tao <houtao1@huawei.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org>

  reply	other threads:[~2025-01-17 12:41 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-17 10:18 [PATCH bpf-next v3 0/5] Free htab element out of bucket lock Hou Tao
2025-01-17 10:18 ` [PATCH bpf-next v3 1/5] bpf: Free special fields after unlock in htab_lru_map_delete_node() Hou Tao
2025-01-17 12:31   ` Toke Høiland-Jørgensen
2025-01-17 10:18 ` [PATCH bpf-next v3 2/5] bpf: Bail out early in __htab_map_lookup_and_delete_elem() Hou Tao
2025-01-17 12:31   ` Toke Høiland-Jørgensen
2025-01-17 10:18 ` [PATCH bpf-next v3 3/5] bpf: Free element after unlock " Hou Tao
2025-01-17 12:35   ` Toke Høiland-Jørgensen
2025-01-20  8:49     ` Hou Tao
2025-01-20  8:52       ` Toke Høiland-Jørgensen
2025-01-21  1:15         ` Hou Tao
2025-01-21 11:04           ` Toke Høiland-Jørgensen
2025-01-17 10:18 ` [PATCH bpf-next v3 4/5] bpf: Cancel the running bpf_timer through kworker for PREEMPT_RT Hou Tao
2025-01-17 12:40   ` Toke Høiland-Jørgensen [this message]
2025-01-17 10:18 ` [PATCH bpf-next v3 5/5] selftests/bpf: Add test case for the freeing of bpf_timer Hou Tao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ldv9obon.fsf@toke.dk \
    --to=toke@kernel.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=haoluo@google.com \
    --cc=houtao1@huawei.com \
    --cc=houtao@huaweicloud.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=sdf@fomichev.me \
    --cc=song@kernel.org \
    --cc=xukuohai@huawei.com \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.