Re: [PATCH bpf-next 4/4] selftests/bpf: Add kptr nmi deadlock reproducer

BPF List
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Justin Suess" <utilityemal77@gmail.com>
Cc: bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next 4/4] selftests/bpf: Add kptr nmi deadlock reproducer
Date: Wed, 29 Apr 2026 03:39:41 +0000	[thread overview]
Message-ID: <20260429033942.0F4A4C19425@smtp.kernel.org> (raw)
In-Reply-To: <20260428201422.1518903-5-utilityemal77@gmail.com>

Thank you for your contribution! Sashiko AI review found 7 potential issue(s) to consider:
- [High] has_nmi_handler_btf() checks for the wrong BTF kind and name, causing the test to unconditionally skip.
- [High] BPF_PROG signature mismatch for the tracepoint causes verifier rejection.
- [High] Race condition drops the kptr before the task is reaped, bypassing the deadlock reproducer.
- [High] Race window in Hash map test leaks task_struct and causes a test timeout.
- [High] Expected hash map deadlock avoidance aborts incorrectly fail the test.
- [Medium] find_test_cpu incorrectly fails the test on single-core systems.
- [Low] Unnecessary manual validation of skeleton file descriptors.
--

commit 5f0b05bfc83b9c9298c4508b14319e9427f9679a
Author: Justin Suess <utilityemal77@gmail.com>
selftests/bpf: Add kptr nmi deadlock reproducer

This commit adds a deadlock reproducer for task struct kptrs, using an NMI
PMU event to trigger deletions while freeing the last task reference. It
includes test cases for array and hash map types.

> diff --git a/tools/testing/selftests/bpf/prog_tests/task_kptr_nmi_deadlock_repro.c b/tools/testing/selftests/bpf/prog_tests/task_kptr_nmi_deadlock_repro.c
> new file mode 100644
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/task_kptr_nmi_deadlock_repro.c
[ ... ]
> +static int find_test_cpu(void)
> +{
> +	cpu_set_t cpuset;
> +	int cpu, err;
> +
> +	err = sched_getaffinity(0, sizeof(cpuset), &cpuset);
> +	if (!ASSERT_OK(err, "sched_getaffinity"))
> +		return -1;
> +
> +	for (cpu = 1; cpu < CPU_SETSIZE; cpu++) {

Is it intentional to start iterating at cpu 1 here? On single-core systems or
environments affined only to CPU 0, this will exit without a match and trigger
the ASSERT_TRUE failure below.

> +		if (CPU_ISSET(cpu, &cpuset))
> +			return cpu;
> +	}
> +
> +	ASSERT_TRUE(false, "cpu_available");
> +	return -1;
> +}
[ ... ]
> +static bool has_nmi_handler_btf(void)
> +{
> +	struct btf *btf;
> +	int id;
> +
> +	btf = btf__load_vmlinux_btf();
> +	if (libbpf_get_error(btf)) {
> +		printf("SKIP:no vmlinux BTF\n");
> +		test__skip();
> +		return false;
> +	}
> +
> +	id = btf__find_by_name_kind(btf, "nmi_handler", BTF_KIND_FUNC);

Will this lookup always fail and cause the test to unconditionally skip? Kernel
tracepoints generate a BTF_KIND_TYPEDEF named btf_trace_nmi_handler rather
than a BTF_KIND_FUNC named after the tracepoint.

> +	btf__free(btf);
> +	if (id <= 0) {
> +		printf("SKIP:no BTF FUNC nmi_handler\n");
> +		test__skip();
> +		return false;
> +	}
> +
> +	return true;
> +}
[ ... ]
> +static bool stash_exited_tasks(struct task_kptr_nmi_deadlock_repro *skel)
> +{
> +	int i, status;
> +
> +	for (i = 0; i < STASHED_TASKS; i++) {
> +		int pipefd[2];
> +		char sync;
> +		pid_t child_pid;

Can the race condition here cause the reproducer to bypass the deadlock
entirely?

Since NMIs are firing continuously via the PMU while the child processes are
forked, running, and waited on, the NMI handler could frequently delete the
map element and release the kptr while the kernel still holds its own
reference to the active or zombie task.

If the NMI drops a non-final reference, __put_task_struct won't be invoked
from the NMI context, missing the intended deadlock scenario.

> +
> +		if (!ASSERT_OK(pipe2(pipefd, O_CLOEXEC), "pipe2"))
> +			return false;
[ ... ]
> +	err = task_kptr_nmi_deadlock_repro__load(skel);
> +	if (!ASSERT_OK(err, "task_kptr_nmi_deadlock_repro__load"))
> +		goto cleanup;
> +
> +	if (bpf_program__fd(skel->progs.clear_task_kptrs_from_nmi) < 0) {

This isn't a bug, but is it necessary to check the file descriptor here? The
BPF skeleton guarantees that all programs and maps have valid file descriptors
after a successful load.

> +		test__skip();
> +		goto cleanup;
> +	}
[ ... ]
> diff --git a/tools/testing/selftests/bpf/progs/task_kptr_nmi_deadlock_repro.c b/tools/testing/selftests/bpf/progs/task_kptr_nmi_deadlock_repro.c
> new file mode 100644
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/task_kptr_nmi_deadlock_repro.c
[ ... ]
> +SEC("lsm.s/file_open")
> +int insert_task_kptr_from_lsm(struct file *ctx_file)
> +{
> +	struct task_map_value init = {};
> +	struct task_map_value *slot;
> +	struct task_struct *task, *acquired;
> +	__u32 pid;
> +	int i, ret;
> +
> +	(void)ctx_file;
> +
> +	pid = bpf_get_current_pid_tgid() >> 32;
> +	i = find_target_slot(pid);
> +	if (i < 0)
> +		return 0;
> +
> +	task = bpf_get_current_task_btf();
> +	acquired = bpf_task_acquire(task);
> +	if (!acquired) {
> +		task_kptr_nmi_err = TASK_KPTR_NMI_ACQUIRE_ERR;
> +		return 0;
> +	}
> +
> +	/*
> +	 * Race is OK for these specific map types. Userspace may
> +	 * have modified the array, causing inconsistency. This
> +	 * error TASK_KPTR_NMI_CREATE_ERR is non-fatal for test
> +	 * purposes. But may be important if this test is
> +	 * extended for other map types.
> +	 */
> +	switch (task_kptr_nmi_map_type) {
> +	case TASK_KPTR_NMI_MAP_HASH:
> +		pid = i;
> +		ret = bpf_map_update_elem(&stashed_tasks_hash, &pid, &init,
> +					  BPF_NOEXIST);
> +		if (ret && ret != -EEXIST) {
> +			task_kptr_nmi_err = TASK_KPTR_NMI_CREATE_ERR;
> +			goto release_task;
> +		}
> +		slot = bpf_map_lookup_elem(&stashed_tasks_hash, &pid);

Could this leak the task struct and cause a test timeout?

If an NMI fires immediately after the lookup but before bpf_kptr_xchg() is
called in stash_task(), clear_hash_tasks() will successfully delete the
empty element. The LSM hook would then resume and write the acquired task
struct into the deleted element's memory.

Since it is no longer in the map, subsequent NMIs cannot find it, causing a
timeout while waiting for task_kptr_nmi_deleted to increment.

> +		if (!slot) {
> +			task_kptr_nmi_err = TASK_KPTR_NMI_LOOKUP_ERR;
> +			goto release_task;
> +		}
> +		break;
[ ... ]
> +static __always_inline void clear_hash_tasks(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < MAX_TARGET_PIDS; i++) {
> +		__u32 slot = i;
> +
> +		if (!task_kptr_nmi_pids[i])
> +			continue;
> +		if (!bpf_map_delete_elem(&stashed_tasks_hash, &slot)) {
> +			task_kptr_nmi_live[i] = 0;
> +			task_kptr_nmi_deleted++;
> +		} else if (bpf_map_lookup_elem(&stashed_tasks_hash, &slot)) {
> +			set_delete_err(-EIO);

Does this incorrectly fail the test during an expected hash map deadlock
avoidance abort?

BPF hash maps use a bucket lock for concurrent updates and deletions. If the
NMI interrupts the LSM hook while it holds the bucket lock,
bpf_map_delete_elem() aborts to avoid deadlocking and returns -EBUSY.

Treating this resilient spinlock abort as a fatal -EIO error makes the test
inherently flaky.

> +		}
> +	}
> +}
[ ... ]
> +SEC("?tp_btf/nmi_handler")
> +int BPF_PROG(clear_task_kptrs_from_nmi, void *handler, void *regs, s64 delta_ns,
> +	     int handled)

Does this program signature match the tracepoint?

The actual nmi_handler tracepoint provides only 3 arguments, but the BPF_PROG
macro is used here with 4 arguments. The macro automatically expands to read
ctx[3], which the verifier will reject as an out-of-bounds context access
(offset >= 24), causing the program to fail to load unconditionally.

> +{
> +	(void)handler;
> +	(void)regs;
> +	(void)delta_ns;
> +	(void)handled;
> +
> +	if (task_kptr_nmi_deleted >= task_kptr_nmi_inserted)
> +		return 0;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260428201422.1518903-1-utilityemal77@gmail.com?part=4

     prev parent reply	other threads:[~2026-04-29  3:39 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-28 20:14 [PATCH bpf-next 0/4] bpf: Fix NMI deadlock in referenced kptr destructors Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 1/4] bpf: Limit fields used in btf_record_equal comparisons Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 2/4] bpf: Use rcu_work in BTF teardown Justin Suess
2026-04-29  1:49   ` sashiko-bot
2026-04-28 20:14 ` [PATCH bpf-next 3/4] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-04-29  2:29   ` sashiko-bot
2026-04-29  9:37   ` Alexei Starovoitov
2026-04-29 16:21     ` Justin Suess
2026-05-02 14:33       ` Justin Suess
2026-04-28 20:14 ` [PATCH bpf-next 4/4] selftests/bpf: Add kptr nmi deadlock reproducer Justin Suess
2026-04-29  3:39   ` sashiko-bot [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260429033942.0F4A4C19425@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=sashiko@lists.linux.dev \
    --cc=utilityemal77@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox