From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 049F228E7; Thu, 7 Sep 2023 16:05:48 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9E6C6C4339A; Thu, 7 Sep 2023 16:05:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1694102748; bh=SETT7D9Th+Jmeeyrb8PHMHUPrAPO4I03JKooKx3IGLw=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=hCwjAgHBzZGgtUsCJOQlz9rSflUuq4ESDb4TQsx9InqZBSUwb3AmlsNyN4hVSDqT3 9Rg330ymDc127ePJ7cQoGr6f3pLH8ZQkXlMpsvDJ2hp7SZKa1Xgx+uCUVHYKs9b7DR Ua+VjPChRSJ7wANDTZzON/yNG3isfWZQNfci7YlNTTxUdwH08Ibo4s29k+yOAxy9te LX99eL2i6bGSq0IsmDUVLIPhxxBd+Tb/vqyTNy1QoTmKjJMmh3e/947uiRX/zHzicV PmTwwO8gxmD3EdI2zI+LYSvwJIAW6BpFitoPpw8NA6bRS+Zt0CihKesj7LlZyf3Oyn xHkDy/Fy+B6pg== Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id 2EB88DC684E; Thu, 7 Sep 2023 18:05:45 +0200 (CEST) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: Alexei Starovoitov Cc: Kumar Kartikeya Dwivedi , Hsin-Wei Hung , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , John Fastabend , KP Singh , Network Development , bpf , Arnaldo Carvalho de Melo Subject: Re: Possible deadlock in bpf queue map In-Reply-To: References: <87o7ienuss.fsf@toke.dk> <87fs3qnnh4.fsf@toke.dk> X-Clacks-Overhead: GNU Terry Pratchett Date: Thu, 07 Sep 2023 18:05:45 +0200 Message-ID: <8734zqnf3a.fsf@toke.dk> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Alexei Starovoitov writes: > On Thu, Sep 7, 2023 at 6:04=E2=80=AFAM Toke H=C3=B8iland-J=C3=B8rgensen <= toke@kernel.org> wrote: >> >> Kumar Kartikeya Dwivedi writes: >> >> > On Thu, 7 Sept 2023 at 12:26, Toke H=C3=B8iland-J=C3=B8rgensen wrote: >> >> >> >> +Arnaldo >> >> >> >> > Hi, >> >> > >> >> > Our bpf fuzzer, a customized Syzkaller, triggered a lockdep warning= in >> >> > the bpf queue map in v5.15. Since queue_stack_maps.c has no major c= hanges >> >> > since v5.15, we think this should still exist in the latest kernel. >> >> > The bug can be occasionally triggered, and we suspect one of the >> >> > eBPF programs involved to be the following one. We also attached th= e lockdep >> >> > warning at the end. >> >> > >> >> > #define DEFINE_BPF_MAP_NO_KEY(the_map, TypeOfMap, MapFlags, >> >> > TypeOfValue, MaxEntries) \ >> >> > struct { = \ >> >> > __uint(type, TypeOfMap); = \ >> >> > __uint(map_flags, (MapFlags)); = \ >> >> > __uint(max_entries, (MaxEntries)); = \ >> >> > __type(value, TypeOfValue); = \ >> >> > } the_map SEC(".maps"); >> >> > >> >> > DEFINE_BPF_MAP_NO_KEY(map_0, BPF_MAP_TYPE_QUEUE, 0 | BPF_F_WRONLY, >> >> > struct_0, 162); >> >> > SEC("perf_event") >> >> > int func(struct bpf_perf_event_data *ctx) { >> >> > char v0[96] =3D {}; >> >> > uint64_t v1 =3D 0; >> >> > v1 =3D bpf_map_pop_elem(&map_0, v0); >> >> > return 163819661; >> >> > } >> >> > >> >> > >> >> > The program is attached to the following perf event. >> >> > >> >> > struct perf_event_attr attr_type_hw =3D { >> >> > .type =3D PERF_TYPE_HARDWARE, >> >> > .config =3D PERF_COUNT_HW_CPU_CYCLES, >> >> > .sample_freq =3D 50, >> >> > .inherit =3D 1, >> >> > .freq =3D 1, >> >> > }; >> >> > >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3DWARNING: inconsistent lock state >> >> > 5.15.26+ #2 Not tainted >> >> > -------------------------------- >> >> > inconsistent {INITIAL USE} -> {IN-NMI} usage. >> >> > syz-executor.5/19749 [HC1[1]:SC0[0]:HE0:SE1] takes: >> >> > ffff88804c9fc198 (&qs->lock){..-.}-{2:2}, at: __queue_map_get+0x31/= 0x250 >> >> > {INITIAL USE} state was registered at: >> >> > lock_acquire+0x1a3/0x4b0 >> >> > _raw_spin_lock_irqsave+0x48/0x60 >> >> > __queue_map_get+0x31/0x250 >> >> > bpf_prog_577904e86c81dead_func+0x12/0x4b4 >> >> > trace_call_bpf+0x262/0x5d0 >> >> > perf_trace_run_bpf_submit+0x91/0x1c0 >> >> > perf_trace_sched_switch+0x46c/0x700 >> >> > __schedule+0x11b5/0x24a0 >> >> > schedule+0xd4/0x270 >> >> > futex_wait_queue_me+0x25f/0x520 >> >> > futex_wait+0x1e0/0x5f0 >> >> > do_futex+0x395/0x1890 >> >> > __x64_sys_futex+0x1cb/0x480 >> >> > do_syscall_64+0x3b/0xc0 >> >> > entry_SYSCALL_64_after_hwframe+0x44/0xae >> >> > irq event stamp: 13640 >> >> > hardirqs last enabled at (13639): [] >> >> > _raw_spin_unlock_irq+0x24/0x40 >> >> > hardirqs last disabled at (13640): [] >> >> > _raw_spin_lock_irqsave+0x5d/0x60 >> >> > softirqs last enabled at (13464): [] __sys_bpf+0= x3e15/0x4e80 >> >> > softirqs last disabled at (13462): [] __sys_bpf+0= x3dd3/0x4e80 >> >> > >> >> > other info that might help us debug this: >> >> > Possible unsafe locking scenario: >> >> > >> >> > CPU0 >> >> > ---- >> >> > lock(&qs->lock); >> >> > >> >> > lock(&qs->lock); >> >> >> >> Hmm, so that lock() uses raw_spin_lock_irqsave(), which *should* be >> >> disabling interrupts entirely for the critical section. But I guess a >> >> Perf hardware event can still trigger? Which seems like it would >> >> potentially wreak havoc with lots of things, not just this queue map >> >> function? >> >> >> >> No idea how to protect against this, though. Hoping Arnaldo knows? :) >> >> >> > >> > The locking should probably be protected by a percpu integer counter, >> > incremented and decremented before and after the lock is taken, >> > respectively. If it is already non-zero, then -EBUSY should be >> > returned. It is similar to what htab_lock_bucket protects against in >> > hashtab.c. >> >> Ah, neat! Okay, seems straight-forward enough to replicate. Hsin, could >> you please check if the patch below gets rid of the splat? > > Instead of adding all this complexity for the map that is so rarely used > it's easier to disallow it perf_event prog types. > Or add a single if (in_nmi()) return -EBUSY. Heh, I was actually thinking the "nmi-safe lock" thing might be something that could be neatly encapsulated into the lock library helpers. But OK, so you mean just something like the below, then? I'll send a proper patch for that later if no one objects (or beats me to it) :) -Toke diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c index 8d2ddcb7566b..138525424745 100644 --- a/kernel/bpf/queue_stack_maps.c +++ b/kernel/bpf/queue_stack_maps.c @@ -98,6 +98,9 @@ static long __queue_map_get(struct bpf_map *map, void *va= lue, bool delete) int err =3D 0; void *ptr; =20 + if (in_nmi()) + return -EBUSY; + raw_spin_lock_irqsave(&qs->lock, flags); =20 if (queue_stack_map_is_empty(qs)) { @@ -128,6 +131,9 @@ static long __stack_map_get(struct bpf_map *map, void *= value, bool delete) void *ptr; u32 index; =20 + if (in_nmi()) + return -EBUSY; + raw_spin_lock_irqsave(&qs->lock, flags); =20 if (queue_stack_map_is_empty(qs)) { @@ -193,6 +199,9 @@ static long queue_stack_map_push_elem(struct bpf_map *m= ap, void *value, if (flags & BPF_NOEXIST || flags > BPF_EXIST) return -EINVAL; =20 + if (in_nmi()) + return -EBUSY; + raw_spin_lock_irqsave(&qs->lock, irq_flags); =20 if (queue_stack_map_is_full(qs)) {