[PATCH bpf-next v3 0/3] Arena Spin Lock

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next v3 0/3] Arena Spin Lock
@ 2025-03-05  1:18 Kumar Kartikeya Dwivedi
  2025-03-05  1:18 ` [PATCH bpf-next v3 1/3] selftests/bpf: Introduce cond_break_label Kumar Kartikeya Dwivedi
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-03-05  1:18 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, kernel-team

This set provides an implementation of queued spin lock for arena.
There is no support for resiliency and recovering from deadlocks yet.
We will wait for the rqspinlock patch set to land before incorporating
support.

One minor change compared to the qspinlock algorithm in the kernel is
that we don't have the trylock fallback when nesting count exceeds 4.
The maximum number of supported CPUs is 1024, but this can be increased
in the future if necessary.

The API supports returning an error, so resiliency support can be added
in the future. Callers are still expected to check for and handle any
potential errors.

Errors are returned when the spin loops time out, when the number of
CPUs is greater than 1024, or when the extreme edge case of NMI
interrupting NMI interrupting HardIRQ interrupting SoftIRQ task, all of
them simultaneously in slow path, occurs, which is unsupported.

Changelog:
----------
v2 -> v3
v2: https://lore.kernel.org/bpf/20250118162238.2621311-1-memxor@gmail.com

 * Rename to arena_spin_lock
 * Introduce cond_break_label macro to jump to label from cond_break.
 * Drop trylock fallback when nesting count exceeds 4.
 * Fix bug in try_cmpxchg implementation.
 * Add tests with critical sections of varying lengths.
 * Add comments for _Generic trick to drop __arena tag.
 * Fix bug due to qnodes being placed on first page, leading to CPU 0's
   node being indistinguishable from NULL.

v1 -> v2
v1: https://lore.kernel.org/bpf/20250117223754.1020174-1-memxor@gmail.com

 * Fix definition of lock in selftest

Kumar Kartikeya Dwivedi (3):
  selftests/bpf: Introduce cond_break_label
  selftests/bpf: Introduce arena spin lock
  selftests/bpf: Add tests for arena spin lock

 .../selftests/bpf/bpf_arena_spin_lock.h       | 505 ++++++++++++++++++
 tools/testing/selftests/bpf/bpf_atomic.h      | 132 +++++
 .../testing/selftests/bpf/bpf_experimental.h  |  15 +-
 .../bpf/prog_tests/arena_spin_lock.c          | 102 ++++
 .../selftests/bpf/progs/arena_spin_lock.c     |  51 ++
 5 files changed, 799 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_spin_lock.h
 create mode 100644 tools/testing/selftests/bpf/bpf_atomic.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
 create mode 100644 tools/testing/selftests/bpf/progs/arena_spin_lock.c

base-commit: 42ba8a49d085e0c2ad50fb9a8ec954c9762b6e01
-- 
2.47.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v3 1/3] selftests/bpf: Introduce cond_break_label
  2025-03-05  1:18 [PATCH bpf-next v3 0/3] Arena Spin Lock Kumar Kartikeya Dwivedi
@ 2025-03-05  1:18 ` Kumar Kartikeya Dwivedi
  2025-03-05  1:18 ` [PATCH bpf-next v3 2/3] selftests/bpf: Introduce arena spin lock Kumar Kartikeya Dwivedi
  2025-03-05  1:18 ` [PATCH bpf-next v3 3/3] selftests/bpf: Add tests for " Kumar Kartikeya Dwivedi
  2 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-03-05  1:18 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, kernel-team

Add a new cond_break_label macro that jumps to the specified label when
the cond_break termination check fires, and allows us to better handle
the uncontrolled termination of the loop.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 tools/testing/selftests/bpf/bpf_experimental.h | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index cd8ecd39c3f3..6535c8ae3c46 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -368,12 +368,12 @@ l_true:												\
 	ret;						\
 	})
 
-#define cond_break					\
+#define __cond_break(expr)				\
 	({ __label__ l_break, l_continue;		\
 	asm volatile goto("may_goto %l[l_break]"	\
 		      :::: l_break);			\
 	goto l_continue;				\
-	l_break: break;					\
+	l_break: expr;					\
 	l_continue:;					\
 	})
 #else
@@ -392,7 +392,7 @@ l_true:												\
 	ret;						\
 	})
 
-#define cond_break					\
+#define __cond_break(expr)				\
 	({ __label__ l_break, l_continue;		\
 	asm volatile goto("1:.byte 0xe5;		\
 		      .byte 0;				\
@@ -400,7 +400,7 @@ l_true:												\
 		      .short 0"				\
 		      :::: l_break);			\
 	goto l_continue;				\
-	l_break: break;					\
+	l_break: expr;					\
 	l_continue:;					\
 	})
 #else
@@ -418,7 +418,7 @@ l_true:												\
 	ret;						\
 	})
 
-#define cond_break					\
+#define __cond_break(expr)				\
 	({ __label__ l_break, l_continue;		\
 	asm volatile goto("1:.byte 0xe5;		\
 		      .byte 0;				\
@@ -426,12 +426,15 @@ l_true:												\
 		      .short 0"				\
 		      :::: l_break);			\
 	goto l_continue;				\
-	l_break: break;					\
+	l_break: expr;					\
 	l_continue:;					\
 	})
 #endif
 #endif
 
+#define cond_break __cond_break(break)
+#define cond_break_label(label) __cond_break(goto label)
+
 #ifndef bpf_nop_mov
 #define bpf_nop_mov(var) \
 	asm volatile("%[reg]=%[reg]"::[reg]"r"((short)var))
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v3 2/3] selftests/bpf: Introduce arena spin lock
  2025-03-05  1:18 [PATCH bpf-next v3 0/3] Arena Spin Lock Kumar Kartikeya Dwivedi
  2025-03-05  1:18 ` [PATCH bpf-next v3 1/3] selftests/bpf: Introduce cond_break_label Kumar Kartikeya Dwivedi
@ 2025-03-05  1:18 ` Kumar Kartikeya Dwivedi
  2025-03-05  2:27   ` Alexei Starovoitov
  2025-03-05  1:18 ` [PATCH bpf-next v3 3/3] selftests/bpf: Add tests for " Kumar Kartikeya Dwivedi
  2 siblings, 1 reply; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-03-05  1:18 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, kernel-team

Implement queued spin lock algorithm as BPF program for lock words
living in BPF arena.

The algorithm is copied from kernel/locking/qspinlock.c and adapted for
BPF use.

We first implement abstract helpers for portable atomics and
acquire/release load instructions, by relying on X86_64 presence to
elide expensive barriers and rely on implementation details of the JIT,
and fall back to slow but correct implementations elsewhere. When
support for acquire/release load/stores lands, we can improve this
state.

Then, the qspinlock algorithm is adapted to remove dependence on
multi-word atomics due to lack of support in BPF ISA. For instance,
xchg_tail cannot use 16-bit xchg, and needs to be a implemented as a
32-bit try_cmpxchg loop.

Loops which are seemingly infinite from verifier PoV are annotated with
cond_break_label macro to return an error. Only 1024 NR_CPUs are
supported.

We need to allocate 1025 qnodes, one more than NR_CPUs, since if libbpf
maps the qnode global variable starting at the first page of the arena,
and the lower 32-bits are zeroed for the base address, the first node
for CPU 0 will become indistinguishable from a NULL pointer, leading to
spurious timeouts and failures.

Note that the slow path is a global function, hence the verifier doesn't
know the return value's precision. The recommended way of usage is to
always test against zero for success, and not ret < 0 for error, as the
verifier would assume ret > 0 has not been accounted for.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../selftests/bpf/bpf_arena_spin_lock.h       | 505 ++++++++++++++++++
 tools/testing/selftests/bpf/bpf_atomic.h      | 132 +++++
 2 files changed, 637 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_spin_lock.h
 create mode 100644 tools/testing/selftests/bpf/bpf_atomic.h

diff --git a/tools/testing/selftests/bpf/bpf_arena_spin_lock.h b/tools/testing/selftests/bpf/bpf_arena_spin_lock.h
new file mode 100644
index 000000000000..cc7de78e0373
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_arena_spin_lock.h
@@ -0,0 +1,505 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#ifndef BPF_ARENA_SPIN_LOCK_H
+#define BPF_ARENA_SPIN_LOCK_H
+
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_atomic.h"
+
+#define arch_mcs_spin_lock_contended_label(l, label) smp_cond_load_acquire_label(l, VAL, label)
+#define arch_mcs_spin_unlock_contended(l) smp_store_release((l), 1)
+
+#if defined(ENABLE_ATOMICS_TESTS) && defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+
+#define EBUSY 16
+#define EOPNOTSUPP 95
+#define ETIMEDOUT 110
+
+#ifndef __arena
+#define __arena __attribute__((address_space(1)))
+#endif
+
+extern unsigned long CONFIG_NR_CPUS __kconfig;
+
+#define arena_spinlock_t struct qspinlock
+/* FIXME: Using typedef causes CO-RE relocation error */
+/* typedef struct qspinlock arena_spinlock_t; */
+
+struct arena_mcs_spinlock {
+	struct arena_mcs_spinlock __arena *next;
+	int locked;
+	int count;
+};
+
+struct arena_qnode {
+	struct arena_mcs_spinlock mcs;
+};
+
+#define _Q_MAX_NODES		4
+#define _Q_PENDING_LOOPS	1
+
+/*
+ * Bitfields in the atomic value:
+ *
+ *  0- 7: locked byte
+ *     8: pending
+ *  9-15: not used
+ * 16-17: tail index
+ * 18-31: tail cpu (+1)
+ */
+#define _Q_MAX_CPUS		1024
+
+#define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
+				      << _Q_ ## type ## _OFFSET)
+#define _Q_LOCKED_OFFSET	0
+#define _Q_LOCKED_BITS		8
+#define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
+
+#define _Q_PENDING_OFFSET	(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_BITS		8
+#define _Q_PENDING_MASK		_Q_SET_MASK(PENDING)
+
+#define _Q_TAIL_IDX_OFFSET	(_Q_PENDING_OFFSET + _Q_PENDING_BITS)
+#define _Q_TAIL_IDX_BITS	2
+#define _Q_TAIL_IDX_MASK	_Q_SET_MASK(TAIL_IDX)
+
+#define _Q_TAIL_CPU_OFFSET	(_Q_TAIL_IDX_OFFSET + _Q_TAIL_IDX_BITS)
+#define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
+#define _Q_TAIL_CPU_MASK	_Q_SET_MASK(TAIL_CPU)
+
+#define _Q_TAIL_OFFSET		_Q_TAIL_IDX_OFFSET
+#define _Q_TAIL_MASK		(_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
+#define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL		(1U << _Q_PENDING_OFFSET)
+
+#define likely(x) __builtin_expect(!!(x), 1)
+#define unlikely(x) __builtin_expect(!!(x), 0)
+
+/*
+ * We always index with + 1, in case we unfortunately place the qnodes at
+ * pg_offset=0 and then CPU 0's qnodes is indistinguishable from NULL if lower
+ * 32-bits of the node pointer are 0.
+ */
+struct arena_qnode __arena qnodes[_Q_MAX_CPUS + 1][_Q_MAX_NODES];
+
+static inline u32 encode_tail(int cpu, int idx)
+{
+	u32 tail;
+
+	tail  = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
+	tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */
+
+	return tail;
+}
+
+static inline struct arena_mcs_spinlock __arena *decode_tail(u32 tail)
+{
+	u32 cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
+	u32 idx = (tail &  _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
+
+	/* See comments over definition of qnodes for the + 1. */
+	if (likely(idx < _Q_MAX_NODES && cpu < _Q_MAX_CPUS))
+		return &qnodes[cpu + 1][idx].mcs;
+	bpf_printk("RUNTIME ERROR: %s idx=%u and cpu=%u are out-of-bounds!!!", __func__, idx, cpu);
+	return NULL;
+}
+
+static inline
+struct arena_mcs_spinlock __arena *grab_mcs_node(struct arena_mcs_spinlock __arena *base, int idx)
+{
+	return &((struct arena_qnode __arena *)base + idx)->mcs;
+}
+
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
+/**
+ * xchg_tail - Put in the new queue tail code word & retrieve previous one
+ * @lock : Pointer to queued spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(arena_spinlock_t __arena *lock, u32 tail)
+{
+	u32 old, new;
+
+	old = atomic_read(&lock->val);
+	do {
+		new = (old & _Q_LOCKED_PENDING_MASK) | tail;
+		/*
+		 * We can use relaxed semantics since the caller ensures that
+		 * the MCS node is properly initialized before updating the
+		 * tail.
+		 */
+		/* These loops are not expected to stall, but we still need to
+		 * prove to the verifier they will terminate eventually.
+		 */
+		cond_break_label(out);
+	} while (!atomic_try_cmpxchg_relaxed(&lock->val, &old, new));
+
+	return old;
+out:
+	bpf_printk("RUNTIME ERROR: %s unexpected cond_break exit!!!", __func__);
+	return old;
+}
+
+/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,* -> *,0,*
+ */
+static __always_inline void clear_pending(arena_spinlock_t __arena *lock)
+{
+	WRITE_ONCE(lock->pending, 0);
+}
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,0 -> *,0,1
+ *
+ * Lock stealing is not allowed if this function is used.
+ */
+static __always_inline void clear_pending_set_locked(arena_spinlock_t __arena *lock)
+{
+	WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
+}
+
+/**
+ * set_locked - Set the lock bit and own the lock
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,*,0 -> *,0,1
+ */
+static __always_inline void set_locked(arena_spinlock_t __arena *lock)
+{
+	WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
+}
+
+static __always_inline
+u32 arena_fetch_set_pending_acquire(arena_spinlock_t __arena *lock)
+{
+	u32 old, new;
+
+	old = atomic_read(&lock->val);
+	do {
+		new = old | _Q_PENDING_VAL;
+		/*
+		 * These loops are not expected to stall, but we still need to
+		 * prove to the verifier they will terminate eventually.
+		 */
+		cond_break_label(out);
+	} while (!atomic_try_cmpxchg_acquire(&lock->val, &old, new));
+
+	return old;
+out:
+	bpf_printk("RUNTIME ERROR: %s unexpected cond_break exit!!!", __func__);
+	return old;
+}
+
+/**
+ * arena_spin_trylock - try to acquire the queued spinlock
+ * @lock : Pointer to queued spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int arena_spin_trylock(arena_spinlock_t __arena *lock)
+{
+	int val = atomic_read(&lock->val);
+
+	if (unlikely(val))
+		return 0;
+
+	return likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL));
+}
+
+__noinline
+int arena_spin_lock_slowpath(arena_spinlock_t __arena __arg_arena *lock, u32 val)
+{
+	struct arena_mcs_spinlock __arena *prev, *next, *node0, *node;
+	int ret = -ETIMEDOUT;
+	u32 old, tail;
+	int idx;
+
+	/*
+	 * Wait for in-progress pending->locked hand-overs with a bounded
+	 * number of spins so that we guarantee forward progress.
+	 *
+	 * 0,1,0 -> 0,0,1
+	 */
+	if (val == _Q_PENDING_VAL) {
+		int cnt = _Q_PENDING_LOOPS;
+		val = atomic_cond_read_relaxed_label(&lock->val,
+						     (VAL != _Q_PENDING_VAL) || !cnt--,
+						     release_err);
+	}
+
+	/*
+	 * If we observe any contention; queue.
+	 */
+	if (val & ~_Q_LOCKED_MASK)
+		goto queue;
+
+	/*
+	 * trylock || pending
+	 *
+	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
+	 */
+	val = arena_fetch_set_pending_acquire(lock);
+
+	/*
+	 * If we observe contention, there is a concurrent locker.
+	 *
+	 * Undo and queue; our setting of PENDING might have made the
+	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
+	 * on @next to become !NULL.
+	 */
+	if (unlikely(val & ~_Q_LOCKED_MASK)) {
+
+		/* Undo PENDING if we set it. */
+		if (!(val & _Q_PENDING_MASK))
+			clear_pending(lock);
+
+		goto queue;
+	}
+
+	/*
+	 * We're pending, wait for the owner to go away.
+	 *
+	 * 0,1,1 -> *,1,0
+	 *
+	 * this wait loop must be a load-acquire such that we match the
+	 * store-release that clears the locked bit and create lock
+	 * sequentiality; this is because not all
+	 * clear_pending_set_locked() implementations imply full
+	 * barriers.
+	 */
+	if (val & _Q_LOCKED_MASK)
+		smp_cond_load_acquire_label(&lock->locked, !VAL, release_err);
+
+	/*
+	 * take ownership and clear the pending bit.
+	 *
+	 * 0,1,0 -> 0,0,1
+	 */
+	clear_pending_set_locked(lock);
+	return 0;
+
+	/*
+	 * End of pending bit optimistic spinning and beginning of MCS
+	 * queuing.
+	 */
+queue:
+	/* See comments over definition of qnodes for the + 1. */
+	node0 = &(qnodes[bpf_get_smp_processor_id() + 1])[0].mcs;
+	idx = node0->count++;
+	tail = encode_tail(bpf_get_smp_processor_id(), idx);
+
+	/*
+	 * 4 nodes are allocated based on the assumption that there will not be
+	 * nested NMIs taking spinlocks. That may not be true in some
+	 * architectures even though the chance of needing more than 4 nodes
+	 * will still be extremely unlikely. When that happens, we simply return
+	 * an error. Original qspinlock has a trylock fallback in this case.
+	 */
+	if (unlikely(idx >= _Q_MAX_NODES)) {
+		ret = -EBUSY;
+		goto release_node_err;
+	}
+
+	node = grab_mcs_node(node0, idx);
+
+	/*
+	 * Ensure that we increment the head node->count before initialising
+	 * the actual node. If the compiler is kind enough to reorder these
+	 * stores, then an IRQ could overwrite our assignments.
+	 */
+	barrier();
+
+	node->locked = 0;
+	node->next = NULL;
+
+	/*
+	 * We touched a (possibly) cold cacheline in the per-cpu queue node;
+	 * attempt the trylock once more in the hope someone let go while we
+	 * weren't watching.
+	 */
+	if (arena_spin_trylock(lock))
+		goto release;
+
+	/*
+	 * Ensure that the initialisation of @node is complete before we
+	 * publish the updated tail via xchg_tail() and potentially link
+	 * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
+	 */
+	smp_wmb();
+
+	/*
+	 * Publish the updated tail.
+	 * We have already touched the queueing cacheline; don't bother with
+	 * pending stuff.
+	 *
+	 * p,*,* -> n,*,*
+	 */
+	old = xchg_tail(lock, tail);
+	next = NULL;
+
+	/*
+	 * if there was a previous node; link it and wait until reaching the
+	 * head of the waitqueue.
+	 */
+	if (old & _Q_TAIL_MASK) {
+		prev = decode_tail(old);
+
+		/* Link @node into the waitqueue. */
+		WRITE_ONCE(prev->next, node);
+
+		arch_mcs_spin_lock_contended_label(&node->locked, release_node_err);
+
+		/*
+		 * While waiting for the MCS lock, the next pointer may have
+		 * been set by another lock waiter. We cannot prefetch here
+		 * due to lack of equivalent instruction in BPF ISA.
+		 */
+		next = READ_ONCE(node->next);
+	}
+
+	/*
+	 * we're at the head of the waitqueue, wait for the owner & pending to
+	 * go away.
+	 *
+	 * *,x,y -> *,0,0
+	 *
+	 * this wait loop must use a load-acquire such that we match the
+	 * store-release that clears the locked bit and create lock
+	 * sequentiality; this is because the set_locked() function below
+	 * does not imply a full barrier.
+	 */
+	val = atomic_cond_read_acquire_label(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK),
+					     release_node_err);
+
+	/*
+	 * claim the lock:
+	 *
+	 * n,0,0 -> 0,0,1 : lock, uncontended
+	 * *,*,0 -> *,*,1 : lock, contended
+	 *
+	 * If the queue head is the only one in the queue (lock value == tail)
+	 * and nobody is pending, clear the tail code and grab the lock.
+	 * Otherwise, we only need to grab the lock.
+	 */
+
+	/*
+	 * In the PV case we might already have _Q_LOCKED_VAL set, because
+	 * of lock stealing; therefore we must also allow:
+	 *
+	 * n,0,1 -> 0,0,1
+	 *
+	 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
+	 *       above wait condition, therefore any concurrent setting of
+	 *       PENDING will make the uncontended transition fail.
+	 */
+	if ((val & _Q_TAIL_MASK) == tail) {
+		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+			goto release; /* No contention */
+	}
+
+	/*
+	 * Either somebody is queued behind us or _Q_PENDING_VAL got set
+	 * which will then detect the remaining tail and queue behind us
+	 * ensuring we'll see a @next.
+	 */
+	set_locked(lock);
+
+	/*
+	 * contended path; wait for next if not observed yet, release.
+	 */
+	if (!next)
+		next = smp_cond_load_relaxed_label(&node->next, (VAL), release_node_err);
+
+	arch_mcs_spin_unlock_contended(&next->locked);
+
+release:;
+	/*
+	 * release the node
+	 *
+	 * Doing a normal dec vs this_cpu_dec is fine. An upper context always
+	 * decrements count it incremented before returning, thus we're fine.
+	 * For contexts interrupting us, they either observe our dec or not.
+	 * Just ensure the compiler doesn't reorder this statement, as a
+	 * this_cpu_dec implicitly implied that.
+	 */
+	barrier();
+	node0->count--;
+	return 0;
+release_node_err:
+	barrier();
+	node0->count--;
+	goto release_err;
+release_err:
+	return ret;
+}
+
+/**
+ * arena_spin_lock - acquire a queued spinlock
+ * @lock: Pointer to queued spinlock structure
+ *
+ * The return value _must_ be tested against zero for success.
+ * On error, returned value will be negative.
+ */
+static __always_inline int arena_spin_lock(arena_spinlock_t __arena *lock)
+{
+	int val = 0;
+
+	if (CONFIG_NR_CPUS > 1024)
+		return -EOPNOTSUPP;
+
+	bpf_preempt_disable();
+	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
+		return 0;
+
+	val = arena_spin_lock_slowpath(lock, val);
+	/* FIXME: bpf_assert_range(-MAX_ERRNO, 0) once we have it working for all cases. */
+	if (val)
+		bpf_preempt_enable();
+	return val;
+}
+
+/**
+ * arena_spin_unlock - release a queued spinlock
+ * @lock : Pointer to queued spinlock structure
+ */
+static __always_inline void arena_spin_unlock(arena_spinlock_t __arena *lock)
+{
+	/*
+	 * unlock() needs release semantics:
+	 */
+	smp_store_release(&lock->locked, 0);
+	bpf_preempt_enable();
+}
+
+#define arena_spin_lock_irqsave(lock, flags)             \
+	({                                               \
+		int __ret;                               \
+		bpf_local_irq_save(&(flags));            \
+		__ret = arena_spin_lock((lock));         \
+		if (__ret)                               \
+			bpf_local_irq_restore(&(flags)); \
+		(__ret);                                 \
+	})
+
+#define arena_spin_unlock_irqrestore(lock, flags) \
+	({                                        \
+		arena_spin_unlock((lock));        \
+		bpf_local_irq_restore(&(flags));  \
+	})
+
+#endif
+
+#endif /* BPF_ARENA_SPIN_LOCK_H */
diff --git a/tools/testing/selftests/bpf/bpf_atomic.h b/tools/testing/selftests/bpf/bpf_atomic.h
new file mode 100644
index 000000000000..06defb9e050d
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_atomic.h
@@ -0,0 +1,132 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#ifndef BPF_ATOMIC_H
+#define BPF_ATOMIC_H
+
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_experimental.h"
+
+extern bool CONFIG_X86_64 __kconfig __weak;
+
+#define __scalar_type_to_expr_cases(type) \
+	unsigned type : (unsigned type)0, signed type : (signed type)0
+/*
+ * This is lifted from __unqual_scalar_typeof in the kernel (which is used to
+ * lose const qualifier etc.), but adapted to also cover pointers. It is
+ * necessary because we ascertain type to create local variables in macros
+ * below, but for pointers with __arena tag, we'll ascertain the underlying type
+ * with the tag, causing a compilation error (as local variables that are not
+ * pointers may not have __arena tag). This trick allows losing the qualifier.
+ */
+#define __unqual_typeof(x)                              \
+	typeof(_Generic((x),                            \
+		char: (char)0,                          \
+		__scalar_type_to_expr_cases(char),      \
+		__scalar_type_to_expr_cases(short),     \
+		__scalar_type_to_expr_cases(int),       \
+		__scalar_type_to_expr_cases(long),      \
+		__scalar_type_to_expr_cases(long long), \
+		default: (void *)0))
+
+/* No-op for BPF */
+#define cpu_relax() ({})
+
+#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
+
+#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *)&(x)) = (val))
+
+#define cmpxchg(p, old, new) __sync_val_compare_and_swap((p), old, new)
+
+#define try_cmpxchg(p, pold, new)                                 \
+	({                                                        \
+		__unqual_typeof(*(pold)) __o = *(pold);           \
+		__unqual_typeof(*(p)) __r = cmpxchg(p, __o, new); \
+		if (__r != __o)                                   \
+			*(pold) = __r;                            \
+		__r == __o;                                       \
+	})
+
+#define try_cmpxchg_relaxed(p, pold, new) try_cmpxchg(p, pold, new)
+
+#define try_cmpxchg_acquire(p, pold, new) try_cmpxchg(p, pold, new)
+
+#define smp_mb()                                 \
+	({                                       \
+		unsigned long __val;             \
+		__sync_fetch_and_add(&__val, 0); \
+	})
+
+#define smp_rmb()                   \
+	({                          \
+		if (!CONFIG_X86_64) \
+			smp_mb();   \
+		else                \
+			barrier();  \
+	})
+
+#define smp_wmb()                   \
+	({                          \
+		if (!CONFIG_X86_64) \
+			smp_mb();   \
+		else                \
+			barrier();  \
+	})
+
+/* Control dependency provides LOAD->STORE, provide LOAD->LOAD */
+#define smp_acquire__after_ctrl_dep() ({ smp_rmb(); })
+
+#define smp_load_acquire(p)                                  \
+	({                                                   \
+		__unqual_typeof(*(p)) __v = READ_ONCE(*(p)); \
+		if (!CONFIG_X86_64)                          \
+			smp_mb();                            \
+		barrier();                                   \
+		__v;                                         \
+	})
+
+#define smp_store_release(p, val)      \
+	({                             \
+		if (!CONFIG_X86_64)    \
+			smp_mb();      \
+		barrier();             \
+		WRITE_ONCE(*(p), val); \
+	})
+
+#define smp_cond_load_relaxed_label(p, cond_expr, label)                \
+	({                                                              \
+		typeof(p) __ptr = (p);                                  \
+		__unqual_typeof(*(p)) VAL;                              \
+		for (;;) {                                              \
+			VAL = (__unqual_typeof(*(p)))READ_ONCE(*__ptr); \
+			if (cond_expr)                                  \
+				break;                                  \
+			cond_break_label(label);			\
+			cpu_relax();                                    \
+		}                                                       \
+		(typeof(*(p)))VAL;                                      \
+	})
+
+#define smp_cond_load_acquire_label(p, cond_expr, label)                  \
+	({                                                                \
+		__unqual_typeof(*p) __val =                               \
+			smp_cond_load_relaxed_label(p, cond_expr, label); \
+		smp_acquire__after_ctrl_dep();                            \
+		(typeof(*(p)))__val;                                      \
+	})
+
+#define atomic_read(p) READ_ONCE((p)->counter)
+
+#define atomic_cond_read_relaxed_label(p, cond_expr, label) \
+	smp_cond_load_relaxed_label(&(p)->counter, cond_expr, label)
+
+#define atomic_cond_read_acquire_label(p, cond_expr, label) \
+	smp_cond_load_acquire_label(&(p)->counter, cond_expr, label)
+
+#define atomic_try_cmpxchg_relaxed(p, pold, new) \
+	try_cmpxchg_relaxed(&(p)->counter, pold, new)
+
+#define atomic_try_cmpxchg_acquire(p, pold, new) \
+	try_cmpxchg_acquire(&(p)->counter, pold, new)
+
+#endif /* BPF_ATOMIC_H */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v3 3/3] selftests/bpf: Add tests for arena spin lock
  2025-03-05  1:18 [PATCH bpf-next v3 0/3] Arena Spin Lock Kumar Kartikeya Dwivedi
  2025-03-05  1:18 ` [PATCH bpf-next v3 1/3] selftests/bpf: Introduce cond_break_label Kumar Kartikeya Dwivedi
  2025-03-05  1:18 ` [PATCH bpf-next v3 2/3] selftests/bpf: Introduce arena spin lock Kumar Kartikeya Dwivedi
@ 2025-03-05  1:18 ` Kumar Kartikeya Dwivedi
  2025-03-05  2:03   ` Alexei Starovoitov
  2 siblings, 1 reply; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-03-05  1:18 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, kernel-team

Add some basic selftests for qspinlock built over BPF arena using
cond_break_label macro.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../bpf/prog_tests/arena_spin_lock.c          | 102 ++++++++++++++++++
 .../selftests/bpf/progs/arena_spin_lock.c     |  51 +++++++++
 2 files changed, 153 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
 create mode 100644 tools/testing/selftests/bpf/progs/arena_spin_lock.c

diff --git a/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c b/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
new file mode 100644
index 000000000000..2cc078ed1ddb
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <sys/sysinfo.h>
+
+struct qspinlock { int val; };
+typedef struct qspinlock arena_spinlock_t;
+
+struct arena_qnode {
+	unsigned long next;
+	int count;
+	int locked;
+};
+
+#include "arena_spin_lock.skel.h"
+
+static long cpu;
+int *counter;
+
+static void *spin_lock_thread(void *arg)
+{
+	int err, prog_fd = *(u32 *)arg;
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = &pkt_v4,
+		.data_size_in = sizeof(pkt_v4),
+		.repeat = 1,
+	);
+	cpu_set_t cpuset;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(__sync_fetch_and_add(&cpu, 1), &cpuset);
+	ASSERT_OK(pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset), "cpu affinity");
+
+	while (*READ_ONCE(counter) <= 1000) {
+		err = bpf_prog_test_run_opts(prog_fd, &topts);
+		if (!ASSERT_OK(err, "test_run err"))
+			break;
+		if (!ASSERT_EQ((int)topts.retval, 0, "test_run retval"))
+			break;
+	}
+	pthread_exit(arg);
+}
+
+static void test_arena_spin_lock_size(int size)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+	struct arena_spin_lock *skel;
+	pthread_t thread_id[16];
+	int prog_fd, i, err;
+	void *ret;
+
+	if (get_nprocs() < 2) {
+		test__skip();
+		return;
+	}
+
+	skel = arena_spin_lock__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "arena_spin_lock__open_and_load"))
+		return;
+	if (skel->data->test_skip == 2) {
+		test__skip();
+		goto end;
+	}
+	counter = &skel->bss->counter;
+	skel->bss->cs_count = size;
+
+	prog_fd = bpf_program__fd(skel->progs.prog);
+	for (i = 0; i < 16; i++) {
+		err = pthread_create(&thread_id[i], NULL, &spin_lock_thread, &prog_fd);
+		if (!ASSERT_OK(err, "pthread_create"))
+			goto end;
+	}
+
+	for (i = 0; i < 16; i++) {
+		if (!ASSERT_OK(pthread_join(thread_id[i], &ret), "pthread_join"))
+			goto end;
+		if (!ASSERT_EQ(ret, &prog_fd, "ret == prog_fd"))
+			goto end;
+	}
+end:
+	arena_spin_lock__destroy(skel);
+	return;
+}
+
+void test_arena_spin_lock(void)
+{
+	if (test__start_subtest("arena_spin_lock_1"))
+		test_arena_spin_lock_size(1);
+	cpu = 0;
+	if (test__start_subtest("arena_spin_lock_1000"))
+		test_arena_spin_lock_size(1000);
+	cpu = 0;
+	if (test__start_subtest("arena_spin_lock_10000"))
+		test_arena_spin_lock_size(10000);
+	cpu = 0;
+	if (test__start_subtest("arena_spin_lock_100000"))
+		test_arena_spin_lock_size(100000);
+	cpu = 0;
+	if (test__start_subtest("arena_spin_lock_500000"))
+		test_arena_spin_lock_size(500000);
+}
diff --git a/tools/testing/selftests/bpf/progs/arena_spin_lock.c b/tools/testing/selftests/bpf/progs/arena_spin_lock.c
new file mode 100644
index 000000000000..3e8ce807e028
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/arena_spin_lock.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+#include "bpf_arena_spin_lock.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARENA);
+	__uint(map_flags, BPF_F_MMAPABLE);
+	__uint(max_entries, 100); /* number of pages */
+#ifdef __TARGET_ARCH_arm64
+	__ulong(map_extra, 0x1ull << 32); /* start of mmap() region */
+#else
+	__ulong(map_extra, 0x1ull << 44); /* start of mmap() region */
+#endif
+} arena SEC(".maps");
+
+int cs_count;
+
+#if defined(ENABLE_ATOMICS_TESTS) && defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+arena_spinlock_t __arena lock;
+void *ptr;
+int test_skip = 1;
+#else
+int test_skip = 2;
+#endif
+
+int counter;
+
+SEC("tc")
+int prog(void *ctx)
+{
+	int ret = -2;
+
+#if defined(ENABLE_ATOMICS_TESTS) && defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	unsigned long flags;
+
+	ptr = &arena;
+	if ((ret = arena_spin_lock_irqsave(&lock, flags)))
+		return ret;
+	WRITE_ONCE(counter, READ_ONCE(counter) + 1);
+	bpf_repeat(cs_count);
+	ret = 0;
+	arena_spin_unlock_irqrestore(&lock, flags);
+#endif
+	return ret;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v3 3/3] selftests/bpf: Add tests for arena spin lock
  2025-03-05  1:18 ` [PATCH bpf-next v3 3/3] selftests/bpf: Add tests for " Kumar Kartikeya Dwivedi
@ 2025-03-05  2:03   ` Alexei Starovoitov
  2025-03-05  2:14     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2025-03-05  2:03 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, Kernel Team

On Tue, Mar 4, 2025 at 5:18 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> Add some basic selftests for qspinlock built over BPF arena using
> cond_break_label macro.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  .../bpf/prog_tests/arena_spin_lock.c          | 102 ++++++++++++++++++
>  .../selftests/bpf/progs/arena_spin_lock.c     |  51 +++++++++
>  2 files changed, 153 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
>  create mode 100644 tools/testing/selftests/bpf/progs/arena_spin_lock.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c b/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
> new file mode 100644
> index 000000000000..2cc078ed1ddb
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
> @@ -0,0 +1,102 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> +#include <test_progs.h>
> +#include <network_helpers.h>
> +#include <sys/sysinfo.h>
> +
> +struct qspinlock { int val; };
> +typedef struct qspinlock arena_spinlock_t;
> +
> +struct arena_qnode {
> +       unsigned long next;
> +       int count;
> +       int locked;
> +};
> +
> +#include "arena_spin_lock.skel.h"
> +
> +static long cpu;
> +int *counter;
> +
> +static void *spin_lock_thread(void *arg)
> +{
> +       int err, prog_fd = *(u32 *)arg;
> +       LIBBPF_OPTS(bpf_test_run_opts, topts,
> +               .data_in = &pkt_v4,
> +               .data_size_in = sizeof(pkt_v4),
> +               .repeat = 1,
> +       );

Why bother with 'tc' prog type?
Pick syscall type, and above will be shorter:
LIBBPF_OPTS(bpf_test_run_opts, opts);

> +       cpu_set_t cpuset;
> +
> +       CPU_ZERO(&cpuset);
> +       CPU_SET(__sync_fetch_and_add(&cpu, 1), &cpuset);
> +       ASSERT_OK(pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset), "cpu affinity");
> +
> +       while (*READ_ONCE(counter) <= 1000) {

READ_ONCE(*counter) ?

but why add this user->kernel switch overhead.
Use .repeat = 1000
one bpf_prog_test_run_opts()
and check at the end that *counter == 1000 ?

> +               err = bpf_prog_test_run_opts(prog_fd, &topts);
> +               if (!ASSERT_OK(err, "test_run err"))
> +                       break;
> +               if (!ASSERT_EQ((int)topts.retval, 0, "test_run retval"))
> +                       break;
> +       }
> +       pthread_exit(arg);
> +}
> +
> +static void test_arena_spin_lock_size(int size)
> +{
> +       LIBBPF_OPTS(bpf_test_run_opts, topts);
> +       struct arena_spin_lock *skel;
> +       pthread_t thread_id[16];
> +       int prog_fd, i, err;
> +       void *ret;
> +
> +       if (get_nprocs() < 2) {
> +               test__skip();
> +               return;
> +       }
> +
> +       skel = arena_spin_lock__open_and_load();
> +       if (!ASSERT_OK_PTR(skel, "arena_spin_lock__open_and_load"))
> +               return;
> +       if (skel->data->test_skip == 2) {
> +               test__skip();
> +               goto end;
> +       }
> +       counter = &skel->bss->counter;
> +       skel->bss->cs_count = size;
> +
> +       prog_fd = bpf_program__fd(skel->progs.prog);
> +       for (i = 0; i < 16; i++) {
> +               err = pthread_create(&thread_id[i], NULL, &spin_lock_thread, &prog_fd);
> +               if (!ASSERT_OK(err, "pthread_create"))
> +                       goto end;
> +       }
> +
> +       for (i = 0; i < 16; i++) {
> +               if (!ASSERT_OK(pthread_join(thread_id[i], &ret), "pthread_join"))
> +                       goto end;
> +               if (!ASSERT_EQ(ret, &prog_fd, "ret == prog_fd"))
> +                       goto end;
> +       }
> +end:
> +       arena_spin_lock__destroy(skel);
> +       return;
> +}
> +
> +void test_arena_spin_lock(void)
> +{
> +       if (test__start_subtest("arena_spin_lock_1"))
> +               test_arena_spin_lock_size(1);
> +       cpu = 0;
> +       if (test__start_subtest("arena_spin_lock_1000"))
> +               test_arena_spin_lock_size(1000);
> +       cpu = 0;
> +       if (test__start_subtest("arena_spin_lock_10000"))
> +               test_arena_spin_lock_size(10000);
> +       cpu = 0;
> +       if (test__start_subtest("arena_spin_lock_100000"))
> +               test_arena_spin_lock_size(100000);
> +       cpu = 0;
> +       if (test__start_subtest("arena_spin_lock_500000"))
> +               test_arena_spin_lock_size(500000);

Do 10k and 500k make a difference?
I suspect 1, 1k and 100k would cover the interesting range.

> +}
> diff --git a/tools/testing/selftests/bpf/progs/arena_spin_lock.c b/tools/testing/selftests/bpf/progs/arena_spin_lock.c
> new file mode 100644
> index 000000000000..3e8ce807e028
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/arena_spin_lock.c
> @@ -0,0 +1,51 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> +#include <vmlinux.h>
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_helpers.h>
> +#include "bpf_misc.h"
> +#include "bpf_arena_spin_lock.h"
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_ARENA);
> +       __uint(map_flags, BPF_F_MMAPABLE);
> +       __uint(max_entries, 100); /* number of pages */
> +#ifdef __TARGET_ARCH_arm64
> +       __ulong(map_extra, 0x1ull << 32); /* start of mmap() region */
> +#else
> +       __ulong(map_extra, 0x1ull << 44); /* start of mmap() region */
> +#endif
> +} arena SEC(".maps");
> +
> +int cs_count;
> +
> +#if defined(ENABLE_ATOMICS_TESTS) && defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> +arena_spinlock_t __arena lock;
> +void *ptr;
> +int test_skip = 1;
> +#else
> +int test_skip = 2;
> +#endif
> +
> +int counter;
> +
> +SEC("tc")
> +int prog(void *ctx)
> +{
> +       int ret = -2;
> +
> +#if defined(ENABLE_ATOMICS_TESTS) && defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> +       unsigned long flags;
> +
> +       ptr = &arena;

Is it really necessary?

> +       if ((ret = arena_spin_lock_irqsave(&lock, flags)))
> +               return ret;
> +       WRITE_ONCE(counter, READ_ONCE(counter) + 1);
> +       bpf_repeat(cs_count);
> +       ret = 0;
> +       arena_spin_unlock_irqrestore(&lock, flags);
> +#endif
> +       return ret;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> --
> 2.47.1
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v3 3/3] selftests/bpf: Add tests for arena spin lock
  2025-03-05  2:03   ` Alexei Starovoitov
@ 2025-03-05  2:14     ` Kumar Kartikeya Dwivedi
  2025-03-05  2:24       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-03-05  2:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, Kernel Team

On Wed, 5 Mar 2025 at 03:04, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Mar 4, 2025 at 5:18 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > Add some basic selftests for qspinlock built over BPF arena using
> > cond_break_label macro.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  .../bpf/prog_tests/arena_spin_lock.c          | 102 ++++++++++++++++++
> >  .../selftests/bpf/progs/arena_spin_lock.c     |  51 +++++++++
> >  2 files changed, 153 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/arena_spin_lock.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c b/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
> > new file mode 100644
> > index 000000000000..2cc078ed1ddb
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
> > @@ -0,0 +1,102 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> > +#include <test_progs.h>
> > +#include <network_helpers.h>
> > +#include <sys/sysinfo.h>
> > +
> > +struct qspinlock { int val; };
> > +typedef struct qspinlock arena_spinlock_t;
> > +
> > +struct arena_qnode {
> > +       unsigned long next;
> > +       int count;
> > +       int locked;
> > +};
> > +
> > +#include "arena_spin_lock.skel.h"
> > +
> > +static long cpu;
> > +int *counter;
> > +
> > +static void *spin_lock_thread(void *arg)
> > +{
> > +       int err, prog_fd = *(u32 *)arg;
> > +       LIBBPF_OPTS(bpf_test_run_opts, topts,
> > +               .data_in = &pkt_v4,
> > +               .data_size_in = sizeof(pkt_v4),
> > +               .repeat = 1,
> > +       );
>
> Why bother with 'tc' prog type?
> Pick syscall type, and above will be shorter:
> LIBBPF_OPTS(bpf_test_run_opts, opts);
>

Ack.

> > +       cpu_set_t cpuset;
> > +
> > +       CPU_ZERO(&cpuset);
> > +       CPU_SET(__sync_fetch_and_add(&cpu, 1), &cpuset);
> > +       ASSERT_OK(pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset), "cpu affinity");
> > +
> > +       while (*READ_ONCE(counter) <= 1000) {
>
> READ_ONCE(*counter) ?
>
> but why add this user->kernel switch overhead.
> Use .repeat = 1000
> one bpf_prog_test_run_opts()
> and check at the end that *counter == 1000 ?

Ok.

>
> > +               err = bpf_prog_test_run_opts(prog_fd, &topts);
> > +               if (!ASSERT_OK(err, "test_run err"))
> > +                       break;
> > +               if (!ASSERT_EQ((int)topts.retval, 0, "test_run retval"))
> > +                       break;
> > +       }
> > +       pthread_exit(arg);
> > +}
> > +
> > +static void test_arena_spin_lock_size(int size)
> > +{
> > +       LIBBPF_OPTS(bpf_test_run_opts, topts);
> > +       struct arena_spin_lock *skel;
> > +       pthread_t thread_id[16];
> > +       int prog_fd, i, err;
> > +       void *ret;
> > +
> > +       if (get_nprocs() < 2) {
> > +               test__skip();
> > +               return;
> > +       }
> > +
> > +       skel = arena_spin_lock__open_and_load();
> > +       if (!ASSERT_OK_PTR(skel, "arena_spin_lock__open_and_load"))
> > +               return;
> > +       if (skel->data->test_skip == 2) {
> > +               test__skip();
> > +               goto end;
> > +       }
> > +       counter = &skel->bss->counter;
> > +       skel->bss->cs_count = size;
> > +
> > +       prog_fd = bpf_program__fd(skel->progs.prog);
> > +       for (i = 0; i < 16; i++) {
> > +               err = pthread_create(&thread_id[i], NULL, &spin_lock_thread, &prog_fd);
> > +               if (!ASSERT_OK(err, "pthread_create"))
> > +                       goto end;
> > +       }
> > +
> > +       for (i = 0; i < 16; i++) {
> > +               if (!ASSERT_OK(pthread_join(thread_id[i], &ret), "pthread_join"))
> > +                       goto end;
> > +               if (!ASSERT_EQ(ret, &prog_fd, "ret == prog_fd"))
> > +                       goto end;
> > +       }
> > +end:
> > +       arena_spin_lock__destroy(skel);
> > +       return;
> > +}
> > +
> > +void test_arena_spin_lock(void)
> > +{
> > +       if (test__start_subtest("arena_spin_lock_1"))
> > +               test_arena_spin_lock_size(1);
> > +       cpu = 0;
> > +       if (test__start_subtest("arena_spin_lock_1000"))
> > +               test_arena_spin_lock_size(1000);
> > +       cpu = 0;
> > +       if (test__start_subtest("arena_spin_lock_10000"))
> > +               test_arena_spin_lock_size(10000);
> > +       cpu = 0;
> > +       if (test__start_subtest("arena_spin_lock_100000"))
> > +               test_arena_spin_lock_size(100000);
> > +       cpu = 0;
> > +       if (test__start_subtest("arena_spin_lock_500000"))
> > +               test_arena_spin_lock_size(500000);
>
> Do 10k and 500k make a difference?
> I suspect 1, 1k and 100k would cover the interesting range.

They do make a difference inside a VM, but not on bare-metal.
I can stick to three sizes though.

>
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/arena_spin_lock.c b/tools/testing/selftests/bpf/progs/arena_spin_lock.c
> > new file mode 100644
> > index 000000000000..3e8ce807e028
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/arena_spin_lock.c
> > @@ -0,0 +1,51 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> > +#include <vmlinux.h>
> > +#include <bpf/bpf_tracing.h>
> > +#include <bpf/bpf_helpers.h>
> > +#include "bpf_misc.h"
> > +#include "bpf_arena_spin_lock.h"
> > +
> > +struct {
> > +       __uint(type, BPF_MAP_TYPE_ARENA);
> > +       __uint(map_flags, BPF_F_MMAPABLE);
> > +       __uint(max_entries, 100); /* number of pages */
> > +#ifdef __TARGET_ARCH_arm64
> > +       __ulong(map_extra, 0x1ull << 32); /* start of mmap() region */
> > +#else
> > +       __ulong(map_extra, 0x1ull << 44); /* start of mmap() region */
> > +#endif
> > +} arena SEC(".maps");
> > +
> > +int cs_count;
> > +
> > +#if defined(ENABLE_ATOMICS_TESTS) && defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > +arena_spinlock_t __arena lock;
> > +void *ptr;
> > +int test_skip = 1;
> > +#else
> > +int test_skip = 2;
> > +#endif
> > +
> > +int counter;
> > +
> > +SEC("tc")
> > +int prog(void *ctx)
> > +{
> > +       int ret = -2;
> > +
> > +#if defined(ENABLE_ATOMICS_TESTS) && defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > +       unsigned long flags;
> > +
> > +       ptr = &arena;
>
> Is it really necessary?

Probably a remnant from previous versions, but sometimes if you don't do this
it gives you an error saying addr_space_cast cannot be used in a
program with no associated arena.

Probably if it never sees the arena map reference anywhere in the
program. This is a way to do that.
But in this case I dropped it and it works.

>
> > +       if ((ret = arena_spin_lock_irqsave(&lock, flags)))
> > +               return ret;
> > +       WRITE_ONCE(counter, READ_ONCE(counter) + 1);
> > +       bpf_repeat(cs_count);
> > +       ret = 0;
> > +       arena_spin_unlock_irqrestore(&lock, flags);
> > +#endif
> > +       return ret;
> > +}
> > +
> > +char _license[] SEC("license") = "GPL";
> > --
> > 2.47.1
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v3 3/3] selftests/bpf: Add tests for arena spin lock
  2025-03-05  2:14     ` Kumar Kartikeya Dwivedi
@ 2025-03-05  2:24       ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-03-05  2:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, Kernel Team

On Wed, 5 Mar 2025 at 03:14, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Wed, 5 Mar 2025 at 03:04, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Mar 4, 2025 at 5:18 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > >
> > > Add some basic selftests for qspinlock built over BPF arena using
> > > cond_break_label macro.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  .../bpf/prog_tests/arena_spin_lock.c          | 102 ++++++++++++++++++
> > >  .../selftests/bpf/progs/arena_spin_lock.c     |  51 +++++++++
> > >  2 files changed, 153 insertions(+)
> > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
> > >  create mode 100644 tools/testing/selftests/bpf/progs/arena_spin_lock.c
> > >
> > > diff --git a/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c b/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
> > > new file mode 100644
> > > index 000000000000..2cc078ed1ddb
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
> > > @@ -0,0 +1,102 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> > > +#include <test_progs.h>
> > > +#include <network_helpers.h>
> > > +#include <sys/sysinfo.h>
> > > +
> > > +struct qspinlock { int val; };
> > > +typedef struct qspinlock arena_spinlock_t;
> > > +
> > > +struct arena_qnode {
> > > +       unsigned long next;
> > > +       int count;
> > > +       int locked;
> > > +};
> > > +
> > > +#include "arena_spin_lock.skel.h"
> > > +
> > > +static long cpu;
> > > +int *counter;
> > > +
> > > +static void *spin_lock_thread(void *arg)
> > > +{
> > > +       int err, prog_fd = *(u32 *)arg;
> > > +       LIBBPF_OPTS(bpf_test_run_opts, topts,
> > > +               .data_in = &pkt_v4,
> > > +               .data_size_in = sizeof(pkt_v4),
> > > +               .repeat = 1,
> > > +       );
> >
> > Why bother with 'tc' prog type?
> > Pick syscall type, and above will be shorter:
> > LIBBPF_OPTS(bpf_test_run_opts, opts);
> >
>
> Ack.

Sadly, syscall prog_test_run doesn't support 'repeat' field, so we'll
have to stick with tc.

>
> > > +       cpu_set_t cpuset;
> > > +
> > > +       CPU_ZERO(&cpuset);
> > > +       CPU_SET(__sync_fetch_and_add(&cpu, 1), &cpuset);
> > > +       ASSERT_OK(pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset), "cpu affinity");
> > > +
> > > +       while (*READ_ONCE(counter) <= 1000) {
> >
> > READ_ONCE(*counter) ?
> >
> > but why add this user->kernel switch overhead.
> > Use .repeat = 1000
> > one bpf_prog_test_run_opts()
> > and check at the end that *counter == 1000 ?
>
> Ok.
>

One of the reasons to do this was to give other threads time to be
able to catch up with the first threads as it starts storming through
the counter increments.
So in case of short section there is no contention at all and most
code paths don't get triggered.
But I'll use a pthread_barrier instead.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v3 2/3] selftests/bpf: Introduce arena spin lock
  2025-03-05  1:18 ` [PATCH bpf-next v3 2/3] selftests/bpf: Introduce arena spin lock Kumar Kartikeya Dwivedi
@ 2025-03-05  2:27   ` Alexei Starovoitov
  2025-03-05  3:14     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2025-03-05  2:27 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, Kernel Team

On Tue, Mar 4, 2025 at 5:19 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> Implement queued spin lock algorithm as BPF program for lock words
> living in BPF arena.
>
> The algorithm is copied from kernel/locking/qspinlock.c and adapted for
> BPF use.
>
> We first implement abstract helpers for portable atomics and
> acquire/release load instructions, by relying on X86_64 presence to
> elide expensive barriers and rely on implementation details of the JIT,
> and fall back to slow but correct implementations elsewhere. When
> support for acquire/release load/stores lands, we can improve this
> state.
>
> Then, the qspinlock algorithm is adapted to remove dependence on
> multi-word atomics due to lack of support in BPF ISA. For instance,
> xchg_tail cannot use 16-bit xchg, and needs to be a implemented as a
> 32-bit try_cmpxchg loop.
>
> Loops which are seemingly infinite from verifier PoV are annotated with
> cond_break_label macro to return an error. Only 1024 NR_CPUs are
> supported.
>
> We need to allocate 1025 qnodes, one more than NR_CPUs, since if libbpf
> maps the qnode global variable starting at the first page of the arena,
> and the lower 32-bits are zeroed for the base address, the first node
> for CPU 0 will become indistinguishable from a NULL pointer, leading to
> spurious timeouts and failures.
>
> Note that the slow path is a global function, hence the verifier doesn't
> know the return value's precision. The recommended way of usage is to
> always test against zero for success, and not ret < 0 for error, as the
> verifier would assume ret > 0 has not been accounted for.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  .../selftests/bpf/bpf_arena_spin_lock.h       | 505 ++++++++++++++++++
>  tools/testing/selftests/bpf/bpf_atomic.h      | 132 +++++
>  2 files changed, 637 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/bpf_arena_spin_lock.h
>  create mode 100644 tools/testing/selftests/bpf/bpf_atomic.h
>
> diff --git a/tools/testing/selftests/bpf/bpf_arena_spin_lock.h b/tools/testing/selftests/bpf/bpf_arena_spin_lock.h
> new file mode 100644
> index 000000000000..cc7de78e0373
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/bpf_arena_spin_lock.h
> @@ -0,0 +1,505 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> +#ifndef BPF_ARENA_SPIN_LOCK_H
> +#define BPF_ARENA_SPIN_LOCK_H
> +
> +#include <vmlinux.h>
> +#include <bpf/bpf_helpers.h>
> +#include "bpf_atomic.h"
> +
> +#define arch_mcs_spin_lock_contended_label(l, label) smp_cond_load_acquire_label(l, VAL, label)
> +#define arch_mcs_spin_unlock_contended(l) smp_store_release((l), 1)
> +
> +#if defined(ENABLE_ATOMICS_TESTS) && defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> +
> +#define EBUSY 16
> +#define EOPNOTSUPP 95
> +#define ETIMEDOUT 110
> +
> +#ifndef __arena
> +#define __arena __attribute__((address_space(1)))
> +#endif
> +
> +extern unsigned long CONFIG_NR_CPUS __kconfig;
> +
> +#define arena_spinlock_t struct qspinlock
> +/* FIXME: Using typedef causes CO-RE relocation error */
> +/* typedef struct qspinlock arena_spinlock_t; */
> +
> +struct arena_mcs_spinlock {
> +       struct arena_mcs_spinlock __arena *next;
> +       int locked;
> +       int count;
> +};
> +
> +struct arena_qnode {
> +       struct arena_mcs_spinlock mcs;
> +};
> +
> +#define _Q_MAX_NODES           4
> +#define _Q_PENDING_LOOPS       1
> +
> +/*
> + * Bitfields in the atomic value:
> + *
> + *  0- 7: locked byte
> + *     8: pending
> + *  9-15: not used
> + * 16-17: tail index
> + * 18-31: tail cpu (+1)
> + */
> +#define _Q_MAX_CPUS            1024
> +
> +#define        _Q_SET_MASK(type)       (((1U << _Q_ ## type ## _BITS) - 1)\
> +                                     << _Q_ ## type ## _OFFSET)
> +#define _Q_LOCKED_OFFSET       0
> +#define _Q_LOCKED_BITS         8
> +#define _Q_LOCKED_MASK         _Q_SET_MASK(LOCKED)
> +
> +#define _Q_PENDING_OFFSET      (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
> +#define _Q_PENDING_BITS                8
> +#define _Q_PENDING_MASK                _Q_SET_MASK(PENDING)
> +
> +#define _Q_TAIL_IDX_OFFSET     (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
> +#define _Q_TAIL_IDX_BITS       2
> +#define _Q_TAIL_IDX_MASK       _Q_SET_MASK(TAIL_IDX)
> +
> +#define _Q_TAIL_CPU_OFFSET     (_Q_TAIL_IDX_OFFSET + _Q_TAIL_IDX_BITS)
> +#define _Q_TAIL_CPU_BITS       (32 - _Q_TAIL_CPU_OFFSET)
> +#define _Q_TAIL_CPU_MASK       _Q_SET_MASK(TAIL_CPU)
> +
> +#define _Q_TAIL_OFFSET         _Q_TAIL_IDX_OFFSET
> +#define _Q_TAIL_MASK           (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
> +
> +#define _Q_LOCKED_VAL          (1U << _Q_LOCKED_OFFSET)
> +#define _Q_PENDING_VAL         (1U << _Q_PENDING_OFFSET)
> +
> +#define likely(x) __builtin_expect(!!(x), 1)
> +#define unlikely(x) __builtin_expect(!!(x), 0)
> +
> +/*
> + * We always index with + 1, in case we unfortunately place the qnodes at
> + * pg_offset=0 and then CPU 0's qnodes is indistinguishable from NULL if lower
> + * 32-bits of the node pointer are 0.
> + */
> +struct arena_qnode __arena qnodes[_Q_MAX_CPUS + 1][_Q_MAX_NODES];
> +
> +static inline u32 encode_tail(int cpu, int idx)
> +{
> +       u32 tail;
> +
> +       tail  = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
> +       tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */
> +
> +       return tail;
> +}
> +
> +static inline struct arena_mcs_spinlock __arena *decode_tail(u32 tail)
> +{
> +       u32 cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
> +       u32 idx = (tail &  _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
> +
> +       /* See comments over definition of qnodes for the + 1. */
> +       if (likely(idx < _Q_MAX_NODES && cpu < _Q_MAX_CPUS))
> +               return &qnodes[cpu + 1][idx].mcs;
> +       bpf_printk("RUNTIME ERROR: %s idx=%u and cpu=%u are out-of-bounds!!!", __func__, idx, cpu);

Is it really possible? 1024 check is done early and
idx cannot be out of bounds.
So only to detect corruption?

> +       return NULL;
> +}
> +
> +static inline
> +struct arena_mcs_spinlock __arena *grab_mcs_node(struct arena_mcs_spinlock __arena *base, int idx)
> +{
> +       return &((struct arena_qnode __arena *)base + idx)->mcs;
> +}
> +
> +#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
> +
> +/**
> + * xchg_tail - Put in the new queue tail code word & retrieve previous one
> + * @lock : Pointer to queued spinlock structure
> + * @tail : The new queue tail code word
> + * Return: The previous queue tail code word
> + *
> + * xchg(lock, tail)
> + *
> + * p,*,* -> n,*,* ; prev = xchg(lock, node)
> + */
> +static __always_inline u32 xchg_tail(arena_spinlock_t __arena *lock, u32 tail)
> +{
> +       u32 old, new;
> +
> +       old = atomic_read(&lock->val);
> +       do {
> +               new = (old & _Q_LOCKED_PENDING_MASK) | tail;
> +               /*
> +                * We can use relaxed semantics since the caller ensures that
> +                * the MCS node is properly initialized before updating the
> +                * tail.
> +                */
> +               /* These loops are not expected to stall, but we still need to
> +                * prove to the verifier they will terminate eventually.
> +                */
> +               cond_break_label(out);
> +       } while (!atomic_try_cmpxchg_relaxed(&lock->val, &old, new));
> +
> +       return old;
> +out:
> +       bpf_printk("RUNTIME ERROR: %s unexpected cond_break exit!!!", __func__);
> +       return old;
> +}
> +
> +/**
> + * clear_pending - clear the pending bit.
> + * @lock: Pointer to queued spinlock structure
> + *
> + * *,1,* -> *,0,*
> + */
> +static __always_inline void clear_pending(arena_spinlock_t __arena *lock)
> +{
> +       WRITE_ONCE(lock->pending, 0);
> +}
> +
> +/**
> + * clear_pending_set_locked - take ownership and clear the pending bit.
> + * @lock: Pointer to queued spinlock structure
> + *
> + * *,1,0 -> *,0,1
> + *
> + * Lock stealing is not allowed if this function is used.
> + */
> +static __always_inline void clear_pending_set_locked(arena_spinlock_t __arena *lock)
> +{
> +       WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
> +}
> +
> +/**
> + * set_locked - Set the lock bit and own the lock
> + * @lock: Pointer to queued spinlock structure
> + *
> + * *,*,0 -> *,0,1
> + */
> +static __always_inline void set_locked(arena_spinlock_t __arena *lock)
> +{
> +       WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
> +}
> +
> +static __always_inline
> +u32 arena_fetch_set_pending_acquire(arena_spinlock_t __arena *lock)
> +{
> +       u32 old, new;
> +
> +       old = atomic_read(&lock->val);
> +       do {
> +               new = old | _Q_PENDING_VAL;
> +               /*
> +                * These loops are not expected to stall, but we still need to
> +                * prove to the verifier they will terminate eventually.
> +                */
> +               cond_break_label(out);
> +       } while (!atomic_try_cmpxchg_acquire(&lock->val, &old, new));
> +
> +       return old;
> +out:
> +       bpf_printk("RUNTIME ERROR: %s unexpected cond_break exit!!!", __func__);
> +       return old;
> +}
> +
> +/**
> + * arena_spin_trylock - try to acquire the queued spinlock
> + * @lock : Pointer to queued spinlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static __always_inline int arena_spin_trylock(arena_spinlock_t __arena *lock)
> +{
> +       int val = atomic_read(&lock->val);
> +
> +       if (unlikely(val))
> +               return 0;
> +
> +       return likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL));
> +}
> +
> +__noinline
> +int arena_spin_lock_slowpath(arena_spinlock_t __arena __arg_arena *lock, u32 val)
> +{
> +       struct arena_mcs_spinlock __arena *prev, *next, *node0, *node;
> +       int ret = -ETIMEDOUT;
> +       u32 old, tail;
> +       int idx;
> +
> +       /*
> +        * Wait for in-progress pending->locked hand-overs with a bounded
> +        * number of spins so that we guarantee forward progress.
> +        *
> +        * 0,1,0 -> 0,0,1
> +        */
> +       if (val == _Q_PENDING_VAL) {
> +               int cnt = _Q_PENDING_LOOPS;
> +               val = atomic_cond_read_relaxed_label(&lock->val,
> +                                                    (VAL != _Q_PENDING_VAL) || !cnt--,
> +                                                    release_err);
> +       }
> +
> +       /*
> +        * If we observe any contention; queue.
> +        */
> +       if (val & ~_Q_LOCKED_MASK)
> +               goto queue;
> +
> +       /*
> +        * trylock || pending
> +        *
> +        * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
> +        */
> +       val = arena_fetch_set_pending_acquire(lock);
> +
> +       /*
> +        * If we observe contention, there is a concurrent locker.
> +        *
> +        * Undo and queue; our setting of PENDING might have made the
> +        * n,0,0 -> 0,0,0 transition fail and it will now be waiting
> +        * on @next to become !NULL.
> +        */
> +       if (unlikely(val & ~_Q_LOCKED_MASK)) {
> +
> +               /* Undo PENDING if we set it. */
> +               if (!(val & _Q_PENDING_MASK))
> +                       clear_pending(lock);
> +
> +               goto queue;
> +       }
> +
> +       /*
> +        * We're pending, wait for the owner to go away.
> +        *
> +        * 0,1,1 -> *,1,0
> +        *
> +        * this wait loop must be a load-acquire such that we match the
> +        * store-release that clears the locked bit and create lock
> +        * sequentiality; this is because not all
> +        * clear_pending_set_locked() implementations imply full
> +        * barriers.
> +        */
> +       if (val & _Q_LOCKED_MASK)
> +               smp_cond_load_acquire_label(&lock->locked, !VAL, release_err);
> +
> +       /*
> +        * take ownership and clear the pending bit.
> +        *
> +        * 0,1,0 -> 0,0,1
> +        */
> +       clear_pending_set_locked(lock);
> +       return 0;
> +
> +       /*
> +        * End of pending bit optimistic spinning and beginning of MCS
> +        * queuing.
> +        */
> +queue:
> +       /* See comments over definition of qnodes for the + 1. */
> +       node0 = &(qnodes[bpf_get_smp_processor_id() + 1])[0].mcs;
> +       idx = node0->count++;
> +       tail = encode_tail(bpf_get_smp_processor_id(), idx);
> +
> +       /*
> +        * 4 nodes are allocated based on the assumption that there will not be
> +        * nested NMIs taking spinlocks. That may not be true in some
> +        * architectures even though the chance of needing more than 4 nodes
> +        * will still be extremely unlikely. When that happens, we simply return
> +        * an error. Original qspinlock has a trylock fallback in this case.
> +        */
> +       if (unlikely(idx >= _Q_MAX_NODES)) {
> +               ret = -EBUSY;
> +               goto release_node_err;
> +       }
> +
> +       node = grab_mcs_node(node0, idx);
> +
> +       /*
> +        * Ensure that we increment the head node->count before initialising
> +        * the actual node. If the compiler is kind enough to reorder these
> +        * stores, then an IRQ could overwrite our assignments.
> +        */
> +       barrier();
> +
> +       node->locked = 0;
> +       node->next = NULL;
> +
> +       /*
> +        * We touched a (possibly) cold cacheline in the per-cpu queue node;
> +        * attempt the trylock once more in the hope someone let go while we
> +        * weren't watching.
> +        */
> +       if (arena_spin_trylock(lock))
> +               goto release;
> +
> +       /*
> +        * Ensure that the initialisation of @node is complete before we
> +        * publish the updated tail via xchg_tail() and potentially link
> +        * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
> +        */
> +       smp_wmb();
> +
> +       /*
> +        * Publish the updated tail.
> +        * We have already touched the queueing cacheline; don't bother with
> +        * pending stuff.
> +        *
> +        * p,*,* -> n,*,*
> +        */
> +       old = xchg_tail(lock, tail);
> +       next = NULL;
> +
> +       /*
> +        * if there was a previous node; link it and wait until reaching the
> +        * head of the waitqueue.
> +        */
> +       if (old & _Q_TAIL_MASK) {
> +               prev = decode_tail(old);
> +
> +               /* Link @node into the waitqueue. */
> +               WRITE_ONCE(prev->next, node);
> +
> +               arch_mcs_spin_lock_contended_label(&node->locked, release_node_err);
> +
> +               /*
> +                * While waiting for the MCS lock, the next pointer may have
> +                * been set by another lock waiter. We cannot prefetch here
> +                * due to lack of equivalent instruction in BPF ISA.
> +                */
> +               next = READ_ONCE(node->next);
> +       }
> +
> +       /*
> +        * we're at the head of the waitqueue, wait for the owner & pending to
> +        * go away.
> +        *
> +        * *,x,y -> *,0,0
> +        *
> +        * this wait loop must use a load-acquire such that we match the
> +        * store-release that clears the locked bit and create lock
> +        * sequentiality; this is because the set_locked() function below
> +        * does not imply a full barrier.
> +        */
> +       val = atomic_cond_read_acquire_label(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK),
> +                                            release_node_err);
> +
> +       /*
> +        * claim the lock:
> +        *
> +        * n,0,0 -> 0,0,1 : lock, uncontended
> +        * *,*,0 -> *,*,1 : lock, contended
> +        *
> +        * If the queue head is the only one in the queue (lock value == tail)
> +        * and nobody is pending, clear the tail code and grab the lock.
> +        * Otherwise, we only need to grab the lock.
> +        */
> +
> +       /*
> +        * In the PV case we might already have _Q_LOCKED_VAL set, because
> +        * of lock stealing; therefore we must also allow:
> +        *
> +        * n,0,1 -> 0,0,1
> +        *
> +        * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
> +        *       above wait condition, therefore any concurrent setting of
> +        *       PENDING will make the uncontended transition fail.
> +        */
> +       if ((val & _Q_TAIL_MASK) == tail) {
> +               if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
> +                       goto release; /* No contention */
> +       }
> +
> +       /*
> +        * Either somebody is queued behind us or _Q_PENDING_VAL got set
> +        * which will then detect the remaining tail and queue behind us
> +        * ensuring we'll see a @next.
> +        */
> +       set_locked(lock);
> +
> +       /*
> +        * contended path; wait for next if not observed yet, release.
> +        */
> +       if (!next)
> +               next = smp_cond_load_relaxed_label(&node->next, (VAL), release_node_err);
> +
> +       arch_mcs_spin_unlock_contended(&next->locked);
> +
> +release:;
> +       /*
> +        * release the node
> +        *
> +        * Doing a normal dec vs this_cpu_dec is fine. An upper context always
> +        * decrements count it incremented before returning, thus we're fine.
> +        * For contexts interrupting us, they either observe our dec or not.
> +        * Just ensure the compiler doesn't reorder this statement, as a
> +        * this_cpu_dec implicitly implied that.
> +        */
> +       barrier();
> +       node0->count--;
> +       return 0;
> +release_node_err:
> +       barrier();
> +       node0->count--;
> +       goto release_err;
> +release_err:
> +       return ret;
> +}
> +
> +/**
> + * arena_spin_lock - acquire a queued spinlock
> + * @lock: Pointer to queued spinlock structure
> + *
> + * The return value _must_ be tested against zero for success.
> + * On error, returned value will be negative.
> + */
> +static __always_inline int arena_spin_lock(arena_spinlock_t __arena *lock)
> +{
> +       int val = 0;
> +
> +       if (CONFIG_NR_CPUS > 1024)
> +               return -EOPNOTSUPP;
> +
> +       bpf_preempt_disable();
> +       if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
> +               return 0;
> +
> +       val = arena_spin_lock_slowpath(lock, val);
> +       /* FIXME: bpf_assert_range(-MAX_ERRNO, 0) once we have it working for all cases. */
> +       if (val)
> +               bpf_preempt_enable();
> +       return val;
> +}
> +
> +/**
> + * arena_spin_unlock - release a queued spinlock
> + * @lock : Pointer to queued spinlock structure
> + */
> +static __always_inline void arena_spin_unlock(arena_spinlock_t __arena *lock)
> +{
> +       /*
> +        * unlock() needs release semantics:
> +        */
> +       smp_store_release(&lock->locked, 0);
> +       bpf_preempt_enable();
> +}
> +
> +#define arena_spin_lock_irqsave(lock, flags)             \
> +       ({                                               \
> +               int __ret;                               \
> +               bpf_local_irq_save(&(flags));            \
> +               __ret = arena_spin_lock((lock));         \
> +               if (__ret)                               \
> +                       bpf_local_irq_restore(&(flags)); \
> +               (__ret);                                 \
> +       })
> +
> +#define arena_spin_unlock_irqrestore(lock, flags) \
> +       ({                                        \
> +               arena_spin_unlock((lock));        \
> +               bpf_local_irq_restore(&(flags));  \
> +       })
> +
> +#endif
> +
> +#endif /* BPF_ARENA_SPIN_LOCK_H */
> diff --git a/tools/testing/selftests/bpf/bpf_atomic.h b/tools/testing/selftests/bpf/bpf_atomic.h
> new file mode 100644
> index 000000000000..06defb9e050d
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/bpf_atomic.h
> @@ -0,0 +1,132 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> +#ifndef BPF_ATOMIC_H
> +#define BPF_ATOMIC_H
> +
> +#include <vmlinux.h>
> +#include <bpf/bpf_helpers.h>
> +#include "bpf_experimental.h"
> +
> +extern bool CONFIG_X86_64 __kconfig __weak;
> +
> +#define __scalar_type_to_expr_cases(type) \
> +       unsigned type : (unsigned type)0, signed type : (signed type)0
> +/*
> + * This is lifted from __unqual_scalar_typeof in the kernel (which is used to
> + * lose const qualifier etc.), but adapted to also cover pointers. It is
> + * necessary because we ascertain type to create local variables in macros
> + * below, but for pointers with __arena tag, we'll ascertain the underlying type
> + * with the tag, causing a compilation error (as local variables that are not
> + * pointers may not have __arena tag). This trick allows losing the qualifier.
> + */

I bet that's what causing llvm to miss the cases where it needs
to emit cast_user() or more likely it forces llvm to emit cast_kern()
where it shouldn't be doing it.
Not sure which pointer you saw as indistinguishable from NULL.
I suspect it's smp_cond_load_relaxed_label(&node->next, ..
which is doing:
__unqual_typeof(*(p)) VAL;
(__unqual_typeof(*(p)))READ_ONCE(*__ptr);

and llvm will insert cast_kern() there,
so if (VAL) always sees upper 32-bit as zero.

So I suspect it's not a zero page issue.

> +#define __unqual_typeof(x)                              \
> +       typeof(_Generic((x),                            \
> +               char: (char)0,                          \
> +               __scalar_type_to_expr_cases(char),      \
> +               __scalar_type_to_expr_cases(short),     \
> +               __scalar_type_to_expr_cases(int),       \
> +               __scalar_type_to_expr_cases(long),      \
> +               __scalar_type_to_expr_cases(long long), \
> +               default: (void *)0))
> +
> +/* No-op for BPF */
> +#define cpu_relax() ({})
> +
> +#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
> +
> +#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *)&(x)) = (val))
> +
> +#define cmpxchg(p, old, new) __sync_val_compare_and_swap((p), old, new)
> +
> +#define try_cmpxchg(p, pold, new)                                 \
> +       ({                                                        \
> +               __unqual_typeof(*(pold)) __o = *(pold);           \
> +               __unqual_typeof(*(p)) __r = cmpxchg(p, __o, new); \
> +               if (__r != __o)                                   \
> +                       *(pold) = __r;                            \
> +               __r == __o;                                       \
> +       })
> +
> +#define try_cmpxchg_relaxed(p, pold, new) try_cmpxchg(p, pold, new)
> +
> +#define try_cmpxchg_acquire(p, pold, new) try_cmpxchg(p, pold, new)
> +
> +#define smp_mb()                                 \
> +       ({                                       \
> +               unsigned long __val;             \
> +               __sync_fetch_and_add(&__val, 0); \
> +       })
> +
> +#define smp_rmb()                   \
> +       ({                          \
> +               if (!CONFIG_X86_64) \
> +                       smp_mb();   \
> +               else                \
> +                       barrier();  \
> +       })
> +
> +#define smp_wmb()                   \
> +       ({                          \
> +               if (!CONFIG_X86_64) \
> +                       smp_mb();   \
> +               else                \
> +                       barrier();  \
> +       })
> +
> +/* Control dependency provides LOAD->STORE, provide LOAD->LOAD */
> +#define smp_acquire__after_ctrl_dep() ({ smp_rmb(); })
> +
> +#define smp_load_acquire(p)                                  \
> +       ({                                                   \
> +               __unqual_typeof(*(p)) __v = READ_ONCE(*(p)); \
> +               if (!CONFIG_X86_64)                          \
> +                       smp_mb();                            \
> +               barrier();                                   \
> +               __v;                                         \
> +       })
> +
> +#define smp_store_release(p, val)      \
> +       ({                             \
> +               if (!CONFIG_X86_64)    \
> +                       smp_mb();      \
> +               barrier();             \
> +               WRITE_ONCE(*(p), val); \
> +       })
> +
> +#define smp_cond_load_relaxed_label(p, cond_expr, label)                \
> +       ({                                                              \
> +               typeof(p) __ptr = (p);                                  \
> +               __unqual_typeof(*(p)) VAL;                              \
> +               for (;;) {                                              \
> +                       VAL = (__unqual_typeof(*(p)))READ_ONCE(*__ptr); \
> +                       if (cond_expr)                                  \
> +                               break;                                  \
> +                       cond_break_label(label);                        \
> +                       cpu_relax();                                    \
> +               }                                                       \
> +               (typeof(*(p)))VAL;                                      \
> +       })
> +
> +#define smp_cond_load_acquire_label(p, cond_expr, label)                  \
> +       ({                                                                \
> +               __unqual_typeof(*p) __val =                               \
> +                       smp_cond_load_relaxed_label(p, cond_expr, label); \
> +               smp_acquire__after_ctrl_dep();                            \
> +               (typeof(*(p)))__val;                                      \
> +       })
> +
> +#define atomic_read(p) READ_ONCE((p)->counter)
> +
> +#define atomic_cond_read_relaxed_label(p, cond_expr, label) \
> +       smp_cond_load_relaxed_label(&(p)->counter, cond_expr, label)
> +
> +#define atomic_cond_read_acquire_label(p, cond_expr, label) \
> +       smp_cond_load_acquire_label(&(p)->counter, cond_expr, label)
> +
> +#define atomic_try_cmpxchg_relaxed(p, pold, new) \
> +       try_cmpxchg_relaxed(&(p)->counter, pold, new)
> +
> +#define atomic_try_cmpxchg_acquire(p, pold, new) \
> +       try_cmpxchg_acquire(&(p)->counter, pold, new)
> +
> +#endif /* BPF_ATOMIC_H */
> --
> 2.47.1
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v3 2/3] selftests/bpf: Introduce arena spin lock
  2025-03-05  2:27   ` Alexei Starovoitov
@ 2025-03-05  3:14     ` Kumar Kartikeya Dwivedi
  2025-03-05  3:40       ` Alexei Starovoitov
  0 siblings, 1 reply; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-03-05  3:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, Kernel Team

On Wed, 5 Mar 2025 at 03:27, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Mar 4, 2025 at 5:19 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > Implement queued spin lock algorithm as BPF program for lock words
> > living in BPF arena.
> >
> > The algorithm is copied from kernel/locking/qspinlock.c and adapted for
> > BPF use.
> >
> > We first implement abstract helpers for portable atomics and
> > acquire/release load instructions, by relying on X86_64 presence to
> > elide expensive barriers and rely on implementation details of the JIT,
> > and fall back to slow but correct implementations elsewhere. When
> > support for acquire/release load/stores lands, we can improve this
> > state.
> >
> > Then, the qspinlock algorithm is adapted to remove dependence on
> > multi-word atomics due to lack of support in BPF ISA. For instance,
> > xchg_tail cannot use 16-bit xchg, and needs to be a implemented as a
> > 32-bit try_cmpxchg loop.
> >
> > Loops which are seemingly infinite from verifier PoV are annotated with
> > cond_break_label macro to return an error. Only 1024 NR_CPUs are
> > supported.
> >
> > We need to allocate 1025 qnodes, one more than NR_CPUs, since if libbpf
> > maps the qnode global variable starting at the first page of the arena,
> > and the lower 32-bits are zeroed for the base address, the first node
> > for CPU 0 will become indistinguishable from a NULL pointer, leading to
> > spurious timeouts and failures.
> >
> > Note that the slow path is a global function, hence the verifier doesn't
> > know the return value's precision. The recommended way of usage is to
> > always test against zero for success, and not ret < 0 for error, as the
> > verifier would assume ret > 0 has not been accounted for.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  .../selftests/bpf/bpf_arena_spin_lock.h       | 505 ++++++++++++++++++
> >  tools/testing/selftests/bpf/bpf_atomic.h      | 132 +++++
> >  2 files changed, 637 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/bpf_arena_spin_lock.h
> >  create mode 100644 tools/testing/selftests/bpf/bpf_atomic.h
> >
> > diff --git a/tools/testing/selftests/bpf/bpf_arena_spin_lock.h b/tools/testing/selftests/bpf/bpf_arena_spin_lock.h
> > new file mode 100644
> > index 000000000000..cc7de78e0373
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/bpf_arena_spin_lock.h
> > @@ -0,0 +1,505 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> > +#ifndef BPF_ARENA_SPIN_LOCK_H
> > +#define BPF_ARENA_SPIN_LOCK_H
> > +
> > +#include <vmlinux.h>
> > +#include <bpf/bpf_helpers.h>
> > +#include "bpf_atomic.h"
> > +
> > +#define arch_mcs_spin_lock_contended_label(l, label) smp_cond_load_acquire_label(l, VAL, label)
> > +#define arch_mcs_spin_unlock_contended(l) smp_store_release((l), 1)
> > +
> > +#if defined(ENABLE_ATOMICS_TESTS) && defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > +
> > +#define EBUSY 16
> > +#define EOPNOTSUPP 95
> > +#define ETIMEDOUT 110
> > +
> > +#ifndef __arena
> > +#define __arena __attribute__((address_space(1)))
> > +#endif
> > +
> > +extern unsigned long CONFIG_NR_CPUS __kconfig;
> > +
> > +#define arena_spinlock_t struct qspinlock
> > +/* FIXME: Using typedef causes CO-RE relocation error */
> > +/* typedef struct qspinlock arena_spinlock_t; */
> > +
> > +struct arena_mcs_spinlock {
> > +       struct arena_mcs_spinlock __arena *next;
> > +       int locked;
> > +       int count;
> > +};
> > +
> > +struct arena_qnode {
> > +       struct arena_mcs_spinlock mcs;
> > +};
> > +
> > +#define _Q_MAX_NODES           4
> > +#define _Q_PENDING_LOOPS       1
> > +
> > +/*
> > + * Bitfields in the atomic value:
> > + *
> > + *  0- 7: locked byte
> > + *     8: pending
> > + *  9-15: not used
> > + * 16-17: tail index
> > + * 18-31: tail cpu (+1)
> > + */
> > +#define _Q_MAX_CPUS            1024
> > +
> > +#define        _Q_SET_MASK(type)       (((1U << _Q_ ## type ## _BITS) - 1)\
> > +                                     << _Q_ ## type ## _OFFSET)
> > +#define _Q_LOCKED_OFFSET       0
> > +#define _Q_LOCKED_BITS         8
> > +#define _Q_LOCKED_MASK         _Q_SET_MASK(LOCKED)
> > +
> > +#define _Q_PENDING_OFFSET      (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
> > +#define _Q_PENDING_BITS                8
> > +#define _Q_PENDING_MASK                _Q_SET_MASK(PENDING)
> > +
> > +#define _Q_TAIL_IDX_OFFSET     (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
> > +#define _Q_TAIL_IDX_BITS       2
> > +#define _Q_TAIL_IDX_MASK       _Q_SET_MASK(TAIL_IDX)
> > +
> > +#define _Q_TAIL_CPU_OFFSET     (_Q_TAIL_IDX_OFFSET + _Q_TAIL_IDX_BITS)
> > +#define _Q_TAIL_CPU_BITS       (32 - _Q_TAIL_CPU_OFFSET)
> > +#define _Q_TAIL_CPU_MASK       _Q_SET_MASK(TAIL_CPU)
> > +
> > +#define _Q_TAIL_OFFSET         _Q_TAIL_IDX_OFFSET
> > +#define _Q_TAIL_MASK           (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
> > +
> > +#define _Q_LOCKED_VAL          (1U << _Q_LOCKED_OFFSET)
> > +#define _Q_PENDING_VAL         (1U << _Q_PENDING_OFFSET)
> > +
> > +#define likely(x) __builtin_expect(!!(x), 1)
> > +#define unlikely(x) __builtin_expect(!!(x), 0)
> > +
> > +/*
> > + * We always index with + 1, in case we unfortunately place the qnodes at
> > + * pg_offset=0 and then CPU 0's qnodes is indistinguishable from NULL if lower
> > + * 32-bits of the node pointer are 0.
> > + */
> > +struct arena_qnode __arena qnodes[_Q_MAX_CPUS + 1][_Q_MAX_NODES];
> > +
> > +static inline u32 encode_tail(int cpu, int idx)
> > +{
> > +       u32 tail;
> > +
> > +       tail  = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
> > +       tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */
> > +
> > +       return tail;
> > +}
> > +
> > +static inline struct arena_mcs_spinlock __arena *decode_tail(u32 tail)
> > +{
> > +       u32 cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
> > +       u32 idx = (tail &  _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
> > +
> > +       /* See comments over definition of qnodes for the + 1. */
> > +       if (likely(idx < _Q_MAX_NODES && cpu < _Q_MAX_CPUS))
> > +               return &qnodes[cpu + 1][idx].mcs;
> > +       bpf_printk("RUNTIME ERROR: %s idx=%u and cpu=%u are out-of-bounds!!!", __func__, idx, cpu);
>
> Is it really possible? 1024 check is done early and
> idx cannot be out of bounds.
> So only to detect corruption?

Yeah, shouldn't be possible unless corruption happens, but I can drop it.

>
> > +       return NULL;
> > +}
> > +
> > +static inline
> > +struct arena_mcs_spinlock __arena *grab_mcs_node(struct arena_mcs_spinlock __arena *base, int idx)
> > +{
> > +       return &((struct arena_qnode __arena *)base + idx)->mcs;
> > +}
> > +
> > +#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
> > +
> > +/**
> > + * xchg_tail - Put in the new queue tail code word & retrieve previous one
> > + * @lock : Pointer to queued spinlock structure
> > + * @tail : The new queue tail code word
> > + * Return: The previous queue tail code word
> > + *
> > + * xchg(lock, tail)
> > + *
> > + * p,*,* -> n,*,* ; prev = xchg(lock, node)
> > + */
> > +static __always_inline u32 xchg_tail(arena_spinlock_t __arena *lock, u32 tail)
> > +{
> > +       u32 old, new;
> > +
> > +       old = atomic_read(&lock->val);
> > +       do {
> > +               new = (old & _Q_LOCKED_PENDING_MASK) | tail;
> > +               /*
> > +                * We can use relaxed semantics since the caller ensures that
> > +                * the MCS node is properly initialized before updating the
> > +                * tail.
> > +                */
> > +               /* These loops are not expected to stall, but we still need to
> > +                * prove to the verifier they will terminate eventually.
> > +                */
> > +               cond_break_label(out);
> > +       } while (!atomic_try_cmpxchg_relaxed(&lock->val, &old, new));
> > +
> > +       return old;
> > +out:
> > +       bpf_printk("RUNTIME ERROR: %s unexpected cond_break exit!!!", __func__);
> > +       return old;
> > +}
> > +
> > +/**
> > + * clear_pending - clear the pending bit.
> > + * @lock: Pointer to queued spinlock structure
> > + *
> > + * *,1,* -> *,0,*
> > + */
> > +static __always_inline void clear_pending(arena_spinlock_t __arena *lock)
> > +{
> > +       WRITE_ONCE(lock->pending, 0);
> > +}
> > +
> > +/**
> > + * clear_pending_set_locked - take ownership and clear the pending bit.
> > + * @lock: Pointer to queued spinlock structure
> > + *
> > + * *,1,0 -> *,0,1
> > + *
> > + * Lock stealing is not allowed if this function is used.
> > + */
> > +static __always_inline void clear_pending_set_locked(arena_spinlock_t __arena *lock)
> > +{
> > +       WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
> > +}
> > +
> > +/**
> > + * set_locked - Set the lock bit and own the lock
> > + * @lock: Pointer to queued spinlock structure
> > + *
> > + * *,*,0 -> *,0,1
> > + */
> > +static __always_inline void set_locked(arena_spinlock_t __arena *lock)
> > +{
> > +       WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
> > +}
> > +
> > +static __always_inline
> > +u32 arena_fetch_set_pending_acquire(arena_spinlock_t __arena *lock)
> > +{
> > +       u32 old, new;
> > +
> > +       old = atomic_read(&lock->val);
> > +       do {
> > +               new = old | _Q_PENDING_VAL;
> > +               /*
> > +                * These loops are not expected to stall, but we still need to
> > +                * prove to the verifier they will terminate eventually.
> > +                */
> > +               cond_break_label(out);
> > +       } while (!atomic_try_cmpxchg_acquire(&lock->val, &old, new));
> > +
> > +       return old;
> > +out:
> > +       bpf_printk("RUNTIME ERROR: %s unexpected cond_break exit!!!", __func__);
> > +       return old;
> > +}
> > +
> > +/**
> > + * arena_spin_trylock - try to acquire the queued spinlock
> > + * @lock : Pointer to queued spinlock structure
> > + * Return: 1 if lock acquired, 0 if failed
> > + */
> > +static __always_inline int arena_spin_trylock(arena_spinlock_t __arena *lock)
> > +{
> > +       int val = atomic_read(&lock->val);
> > +
> > +       if (unlikely(val))
> > +               return 0;
> > +
> > +       return likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL));
> > +}
> > +
> > +__noinline
> > +int arena_spin_lock_slowpath(arena_spinlock_t __arena __arg_arena *lock, u32 val)
> > +{
> > +       struct arena_mcs_spinlock __arena *prev, *next, *node0, *node;
> > +       int ret = -ETIMEDOUT;
> > +       u32 old, tail;
> > +       int idx;
> > +
> > +       /*
> > +        * Wait for in-progress pending->locked hand-overs with a bounded
> > +        * number of spins so that we guarantee forward progress.
> > +        *
> > +        * 0,1,0 -> 0,0,1
> > +        */
> > +       if (val == _Q_PENDING_VAL) {
> > +               int cnt = _Q_PENDING_LOOPS;
> > +               val = atomic_cond_read_relaxed_label(&lock->val,
> > +                                                    (VAL != _Q_PENDING_VAL) || !cnt--,
> > +                                                    release_err);
> > +       }
> > +
> > +       /*
> > +        * If we observe any contention; queue.
> > +        */
> > +       if (val & ~_Q_LOCKED_MASK)
> > +               goto queue;
> > +
> > +       /*
> > +        * trylock || pending
> > +        *
> > +        * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
> > +        */
> > +       val = arena_fetch_set_pending_acquire(lock);
> > +
> > +       /*
> > +        * If we observe contention, there is a concurrent locker.
> > +        *
> > +        * Undo and queue; our setting of PENDING might have made the
> > +        * n,0,0 -> 0,0,0 transition fail and it will now be waiting
> > +        * on @next to become !NULL.
> > +        */
> > +       if (unlikely(val & ~_Q_LOCKED_MASK)) {
> > +
> > +               /* Undo PENDING if we set it. */
> > +               if (!(val & _Q_PENDING_MASK))
> > +                       clear_pending(lock);
> > +
> > +               goto queue;
> > +       }
> > +
> > +       /*
> > +        * We're pending, wait for the owner to go away.
> > +        *
> > +        * 0,1,1 -> *,1,0
> > +        *
> > +        * this wait loop must be a load-acquire such that we match the
> > +        * store-release that clears the locked bit and create lock
> > +        * sequentiality; this is because not all
> > +        * clear_pending_set_locked() implementations imply full
> > +        * barriers.
> > +        */
> > +       if (val & _Q_LOCKED_MASK)
> > +               smp_cond_load_acquire_label(&lock->locked, !VAL, release_err);
> > +
> > +       /*
> > +        * take ownership and clear the pending bit.
> > +        *
> > +        * 0,1,0 -> 0,0,1
> > +        */
> > +       clear_pending_set_locked(lock);
> > +       return 0;
> > +
> > +       /*
> > +        * End of pending bit optimistic spinning and beginning of MCS
> > +        * queuing.
> > +        */
> > +queue:
> > +       /* See comments over definition of qnodes for the + 1. */
> > +       node0 = &(qnodes[bpf_get_smp_processor_id() + 1])[0].mcs;
> > +       idx = node0->count++;
> > +       tail = encode_tail(bpf_get_smp_processor_id(), idx);
> > +
> > +       /*
> > +        * 4 nodes are allocated based on the assumption that there will not be
> > +        * nested NMIs taking spinlocks. That may not be true in some
> > +        * architectures even though the chance of needing more than 4 nodes
> > +        * will still be extremely unlikely. When that happens, we simply return
> > +        * an error. Original qspinlock has a trylock fallback in this case.
> > +        */
> > +       if (unlikely(idx >= _Q_MAX_NODES)) {
> > +               ret = -EBUSY;
> > +               goto release_node_err;
> > +       }
> > +
> > +       node = grab_mcs_node(node0, idx);
> > +
> > +       /*
> > +        * Ensure that we increment the head node->count before initialising
> > +        * the actual node. If the compiler is kind enough to reorder these
> > +        * stores, then an IRQ could overwrite our assignments.
> > +        */
> > +       barrier();
> > +
> > +       node->locked = 0;
> > +       node->next = NULL;
> > +
> > +       /*
> > +        * We touched a (possibly) cold cacheline in the per-cpu queue node;
> > +        * attempt the trylock once more in the hope someone let go while we
> > +        * weren't watching.
> > +        */
> > +       if (arena_spin_trylock(lock))
> > +               goto release;
> > +
> > +       /*
> > +        * Ensure that the initialisation of @node is complete before we
> > +        * publish the updated tail via xchg_tail() and potentially link
> > +        * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
> > +        */
> > +       smp_wmb();
> > +
> > +       /*
> > +        * Publish the updated tail.
> > +        * We have already touched the queueing cacheline; don't bother with
> > +        * pending stuff.
> > +        *
> > +        * p,*,* -> n,*,*
> > +        */
> > +       old = xchg_tail(lock, tail);
> > +       next = NULL;
> > +
> > +       /*
> > +        * if there was a previous node; link it and wait until reaching the
> > +        * head of the waitqueue.
> > +        */
> > +       if (old & _Q_TAIL_MASK) {
> > +               prev = decode_tail(old);
> > +
> > +               /* Link @node into the waitqueue. */
> > +               WRITE_ONCE(prev->next, node);
> > +
> > +               arch_mcs_spin_lock_contended_label(&node->locked, release_node_err);
> > +
> > +               /*
> > +                * While waiting for the MCS lock, the next pointer may have
> > +                * been set by another lock waiter. We cannot prefetch here
> > +                * due to lack of equivalent instruction in BPF ISA.
> > +                */
> > +               next = READ_ONCE(node->next);
> > +       }
> > +
> > +       /*
> > +        * we're at the head of the waitqueue, wait for the owner & pending to
> > +        * go away.
> > +        *
> > +        * *,x,y -> *,0,0
> > +        *
> > +        * this wait loop must use a load-acquire such that we match the
> > +        * store-release that clears the locked bit and create lock
> > +        * sequentiality; this is because the set_locked() function below
> > +        * does not imply a full barrier.
> > +        */
> > +       val = atomic_cond_read_acquire_label(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK),
> > +                                            release_node_err);
> > +
> > +       /*
> > +        * claim the lock:
> > +        *
> > +        * n,0,0 -> 0,0,1 : lock, uncontended
> > +        * *,*,0 -> *,*,1 : lock, contended
> > +        *
> > +        * If the queue head is the only one in the queue (lock value == tail)
> > +        * and nobody is pending, clear the tail code and grab the lock.
> > +        * Otherwise, we only need to grab the lock.
> > +        */
> > +
> > +       /*
> > +        * In the PV case we might already have _Q_LOCKED_VAL set, because
> > +        * of lock stealing; therefore we must also allow:
> > +        *
> > +        * n,0,1 -> 0,0,1
> > +        *
> > +        * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
> > +        *       above wait condition, therefore any concurrent setting of
> > +        *       PENDING will make the uncontended transition fail.
> > +        */
> > +       if ((val & _Q_TAIL_MASK) == tail) {
> > +               if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
> > +                       goto release; /* No contention */
> > +       }
> > +
> > +       /*
> > +        * Either somebody is queued behind us or _Q_PENDING_VAL got set
> > +        * which will then detect the remaining tail and queue behind us
> > +        * ensuring we'll see a @next.
> > +        */
> > +       set_locked(lock);
> > +
> > +       /*
> > +        * contended path; wait for next if not observed yet, release.
> > +        */
> > +       if (!next)
> > +               next = smp_cond_load_relaxed_label(&node->next, (VAL), release_node_err);
> > +
> > +       arch_mcs_spin_unlock_contended(&next->locked);
> > +
> > +release:;
> > +       /*
> > +        * release the node
> > +        *
> > +        * Doing a normal dec vs this_cpu_dec is fine. An upper context always
> > +        * decrements count it incremented before returning, thus we're fine.
> > +        * For contexts interrupting us, they either observe our dec or not.
> > +        * Just ensure the compiler doesn't reorder this statement, as a
> > +        * this_cpu_dec implicitly implied that.
> > +        */
> > +       barrier();
> > +       node0->count--;
> > +       return 0;
> > +release_node_err:
> > +       barrier();
> > +       node0->count--;
> > +       goto release_err;
> > +release_err:
> > +       return ret;
> > +}
> > +
> > +/**
> > + * arena_spin_lock - acquire a queued spinlock
> > + * @lock: Pointer to queued spinlock structure
> > + *
> > + * The return value _must_ be tested against zero for success.
> > + * On error, returned value will be negative.
> > + */
> > +static __always_inline int arena_spin_lock(arena_spinlock_t __arena *lock)
> > +{
> > +       int val = 0;
> > +
> > +       if (CONFIG_NR_CPUS > 1024)
> > +               return -EOPNOTSUPP;
> > +
> > +       bpf_preempt_disable();
> > +       if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
> > +               return 0;
> > +
> > +       val = arena_spin_lock_slowpath(lock, val);
> > +       /* FIXME: bpf_assert_range(-MAX_ERRNO, 0) once we have it working for all cases. */
> > +       if (val)
> > +               bpf_preempt_enable();
> > +       return val;
> > +}
> > +
> > +/**
> > + * arena_spin_unlock - release a queued spinlock
> > + * @lock : Pointer to queued spinlock structure
> > + */
> > +static __always_inline void arena_spin_unlock(arena_spinlock_t __arena *lock)
> > +{
> > +       /*
> > +        * unlock() needs release semantics:
> > +        */
> > +       smp_store_release(&lock->locked, 0);
> > +       bpf_preempt_enable();
> > +}
> > +
> > +#define arena_spin_lock_irqsave(lock, flags)             \
> > +       ({                                               \
> > +               int __ret;                               \
> > +               bpf_local_irq_save(&(flags));            \
> > +               __ret = arena_spin_lock((lock));         \
> > +               if (__ret)                               \
> > +                       bpf_local_irq_restore(&(flags)); \
> > +               (__ret);                                 \
> > +       })
> > +
> > +#define arena_spin_unlock_irqrestore(lock, flags) \
> > +       ({                                        \
> > +               arena_spin_unlock((lock));        \
> > +               bpf_local_irq_restore(&(flags));  \
> > +       })
> > +
> > +#endif
> > +
> > +#endif /* BPF_ARENA_SPIN_LOCK_H */
> > diff --git a/tools/testing/selftests/bpf/bpf_atomic.h b/tools/testing/selftests/bpf/bpf_atomic.h
> > new file mode 100644
> > index 000000000000..06defb9e050d
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/bpf_atomic.h
> > @@ -0,0 +1,132 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> > +#ifndef BPF_ATOMIC_H
> > +#define BPF_ATOMIC_H
> > +
> > +#include <vmlinux.h>
> > +#include <bpf/bpf_helpers.h>
> > +#include "bpf_experimental.h"
> > +
> > +extern bool CONFIG_X86_64 __kconfig __weak;
> > +
> > +#define __scalar_type_to_expr_cases(type) \
> > +       unsigned type : (unsigned type)0, signed type : (signed type)0
> > +/*
> > + * This is lifted from __unqual_scalar_typeof in the kernel (which is used to
> > + * lose const qualifier etc.), but adapted to also cover pointers. It is
> > + * necessary because we ascertain type to create local variables in macros
> > + * below, but for pointers with __arena tag, we'll ascertain the underlying type
> > + * with the tag, causing a compilation error (as local variables that are not
> > + * pointers may not have __arena tag). This trick allows losing the qualifier.
> > + */
>
> I bet that's what causing llvm to miss the cases where it needs
> to emit cast_user() or more likely it forces llvm to emit cast_kern()
> where it shouldn't be doing it.
> Not sure which pointer you saw as indistinguishable from NULL.
> I suspect it's smp_cond_load_relaxed_label(&node->next, ..

Yes.

> which is doing:
> __unqual_typeof(*(p)) VAL;
> (__unqual_typeof(*(p)))READ_ONCE(*__ptr);
>
> and llvm will insert cast_kern() there,

Yes, I do see a r1 = addr_space_cast(r2, 0x0, 0x1).
r2 is node->next loaded from arena pointer 'node'.

But I can't understand why that's a problem.

If I do
for (;;) {
  next = READ_ONCE(node->next);
  if (next)
     break;
  cond_break_label(...);
}

instead of the macro, everything works ok.

But that's because LLVM didn't insert a cast, and the verifier sees
next as a scalar.
So if next is 0x100000000000, it will see 0x100000000000.
With cast_kern it only sees 0.
It will probably be casted once we try to write to next->locked later on.

I would gather there's a lot of other cases where someone dereferences
before doing some pointer equality comparison.
In that case we might end up in the same situation.
ptr = load_from_arena;
x = ptr->xyz;
if (ptr == ptr2) { ... }

The extra cast_kern is certainly causing this to surface, but I am not
sure whether it's something to fix in the macro.

> so if (VAL) always sees upper 32-bit as zero.
>
> So I suspect it's not a zero page issue.
>

When I bpf_printk the node address of the qnode of CPU 0, it is
0x100000000000 i.e. user_vm_start. This is the pointer that's misdetected.
So it appears to be on the first page.


> > +#define __unqual_typeof(x)                              \
> > +       typeof(_Generic((x),                            \
> > +               char: (char)0,                          \
> > +               __scalar_type_to_expr_cases(char),      \
> > +               __scalar_type_to_expr_cases(short),     \
> > +               __scalar_type_to_expr_cases(int),       \
> > +               __scalar_type_to_expr_cases(long),      \
> > +               __scalar_type_to_expr_cases(long long), \
> > +               default: (void *)0))
> > +
> > +/* No-op for BPF */
> > +#define cpu_relax() ({})
> > +
> > +#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
> > +
> > +#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *)&(x)) = (val))
> > +
> > +#define cmpxchg(p, old, new) __sync_val_compare_and_swap((p), old, new)
> > +
> > +#define try_cmpxchg(p, pold, new)                                 \
> > +       ({                                                        \
> > +               __unqual_typeof(*(pold)) __o = *(pold);           \
> > +               __unqual_typeof(*(p)) __r = cmpxchg(p, __o, new); \
> > +               if (__r != __o)                                   \
> > +                       *(pold) = __r;                            \
> > +               __r == __o;                                       \
> > +       })
> > +
> > +#define try_cmpxchg_relaxed(p, pold, new) try_cmpxchg(p, pold, new)
> > +
> > +#define try_cmpxchg_acquire(p, pold, new) try_cmpxchg(p, pold, new)
> > +
> > +#define smp_mb()                                 \
> > +       ({                                       \
> > +               unsigned long __val;             \
> > +               __sync_fetch_and_add(&__val, 0); \
> > +       })
> > +
> > +#define smp_rmb()                   \
> > +       ({                          \
> > +               if (!CONFIG_X86_64) \
> > +                       smp_mb();   \
> > +               else                \
> > +                       barrier();  \
> > +       })
> > +
> > +#define smp_wmb()                   \
> > +       ({                          \
> > +               if (!CONFIG_X86_64) \
> > +                       smp_mb();   \
> > +               else                \
> > +                       barrier();  \
> > +       })
> > +
> > +/* Control dependency provides LOAD->STORE, provide LOAD->LOAD */
> > +#define smp_acquire__after_ctrl_dep() ({ smp_rmb(); })
> > +
> > +#define smp_load_acquire(p)                                  \
> > +       ({                                                   \
> > +               __unqual_typeof(*(p)) __v = READ_ONCE(*(p)); \
> > +               if (!CONFIG_X86_64)                          \
> > +                       smp_mb();                            \
> > +               barrier();                                   \
> > +               __v;                                         \
> > +       })
> > +
> > +#define smp_store_release(p, val)      \
> > +       ({                             \
> > +               if (!CONFIG_X86_64)    \
> > +                       smp_mb();      \
> > +               barrier();             \
> > +               WRITE_ONCE(*(p), val); \
> > +       })
> > +
> > +#define smp_cond_load_relaxed_label(p, cond_expr, label)                \
> > +       ({                                                              \
> > +               typeof(p) __ptr = (p);                                  \
> > +               __unqual_typeof(*(p)) VAL;                              \
> > +               for (;;) {                                              \
> > +                       VAL = (__unqual_typeof(*(p)))READ_ONCE(*__ptr); \
> > +                       if (cond_expr)                                  \
> > +                               break;                                  \
> > +                       cond_break_label(label);                        \
> > +                       cpu_relax();                                    \
> > +               }                                                       \
> > +               (typeof(*(p)))VAL;                                      \
> > +       })
> > +
> > +#define smp_cond_load_acquire_label(p, cond_expr, label)                  \
> > +       ({                                                                \
> > +               __unqual_typeof(*p) __val =                               \
> > +                       smp_cond_load_relaxed_label(p, cond_expr, label); \
> > +               smp_acquire__after_ctrl_dep();                            \
> > +               (typeof(*(p)))__val;                                      \
> > +       })
> > +
> > +#define atomic_read(p) READ_ONCE((p)->counter)
> > +
> > +#define atomic_cond_read_relaxed_label(p, cond_expr, label) \
> > +       smp_cond_load_relaxed_label(&(p)->counter, cond_expr, label)
> > +
> > +#define atomic_cond_read_acquire_label(p, cond_expr, label) \
> > +       smp_cond_load_acquire_label(&(p)->counter, cond_expr, label)
> > +
> > +#define atomic_try_cmpxchg_relaxed(p, pold, new) \
> > +       try_cmpxchg_relaxed(&(p)->counter, pold, new)
> > +
> > +#define atomic_try_cmpxchg_acquire(p, pold, new) \
> > +       try_cmpxchg_acquire(&(p)->counter, pold, new)
> > +
> > +#endif /* BPF_ATOMIC_H */
> > --
> > 2.47.1
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v3 2/3] selftests/bpf: Introduce arena spin lock
  2025-03-05  3:14     ` Kumar Kartikeya Dwivedi
@ 2025-03-05  3:40       ` Alexei Starovoitov
  2025-03-05  4:28         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2025-03-05  3:40 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, Kernel Team

On Tue, Mar 4, 2025 at 7:15 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> > which is doing:
> > __unqual_typeof(*(p)) VAL;
> > (__unqual_typeof(*(p)))READ_ONCE(*__ptr);
> >
> > and llvm will insert cast_kern() there,
>
> Yes, I do see a r1 = addr_space_cast(r2, 0x0, 0x1).
> r2 is node->next loaded from arena pointer 'node'.
>
> But I can't understand why that's a problem.
>
> If I do
> for (;;) {
>   next = READ_ONCE(node->next);
>   if (next)
>      break;
>   cond_break_label(...);
> }
>
> instead of the macro, everything works ok.

because the above doesn't have addr space casts.

> But that's because LLVM didn't insert a cast, and the verifier sees
> next as a scalar.
> So if next is 0x100000000000, it will see 0x100000000000.
> With cast_kern it only sees 0.

right.

> It will probably be casted once we try to write to next->locked later on.

not quite.
In a typical program llvm will emit bare minimum cast_user,
because all pointers are full 64-bit valid user space addresses all the time.
The cast_kern() is needed for read/write through the pointer
if it's not a kernel pointer yet.
See list_add_head() in bpf_arena_list.h that has
a bunch of explicit cast_kern/user (with llvm there will be a fraction
of them), but they illustrate the idea:
        cast_user(first);
        cast_kern(n); // before writing into 'n' it has to be 'kern'
        WRITE_ONCE(n->next, first); // first has to be full 64-bit
        cast_kern(first); // ignore this one :) it's my mistake.
should be after 'if'
        if (first) {
                tmp = &n->next;
                cast_user(tmp);
                WRITE_ONCE(first->pprev, tmp);
        }
        cast_user(n);
        WRITE_ONCE(h->first, n);

> I would gather there's a lot of other cases where someone dereferences
> before doing some pointer equality comparison.
> In that case we might end up in the same situation.
> ptr = load_from_arena;
> x = ptr->xyz;
> if (ptr == ptr2) { ... }

There shouldn't be any issues here.
The 'load from arena' will return full 64-bit and they should
be stored as full 64-bit in memory.
ptr->xyz (assuming xyz is another pointer) will read full 64-bit too.

> The extra cast_kern is certainly causing this to surface, but I am not
> sure whether it's something to fix in the macro.

I think it's a macro issue due to casting addr space off.

> > so if (VAL) always sees upper 32-bit as zero.
> >
> > So I suspect it's not a zero page issue.
> >
>
> When I bpf_printk the node address of the qnode of CPU 0, it is
> 0x100000000000 i.e. user_vm_start. This is the pointer that's misdetected.
> So it appears to be on the first page.

yes and looks the addr passed into printk is correct full 64-bit
as it should be.
So this part:
  return &qnodes[cpu + 1][idx].mcs;
is fine.
It's full 64-bit.
  &((struct arena_qnode __arena *)base + idx)->mcs;
is also ok.

There are no addr space casts there.
But the macro is problematic.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v3 2/3] selftests/bpf: Introduce arena spin lock
  2025-03-05  3:40       ` Alexei Starovoitov
@ 2025-03-05  4:28         ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-03-05  4:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, Kernel Team

On Wed, 5 Mar 2025 at 04:41, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Mar 4, 2025 at 7:15 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > > which is doing:
> > > __unqual_typeof(*(p)) VAL;
> > > (__unqual_typeof(*(p)))READ_ONCE(*__ptr);
> > >
> > > and llvm will insert cast_kern() there,
> >
> > Yes, I do see a r1 = addr_space_cast(r2, 0x0, 0x1).
> > r2 is node->next loaded from arena pointer 'node'.
> >
> > But I can't understand why that's a problem.
> >
> > If I do
> > for (;;) {
> >   next = READ_ONCE(node->next);
> >   if (next)
> >      break;
> >   cond_break_label(...);
> > }
> >
> > instead of the macro, everything works ok.
>
> because the above doesn't have addr space casts.
>
> > But that's because LLVM didn't insert a cast, and the verifier sees
> > next as a scalar.
> > So if next is 0x100000000000, it will see 0x100000000000.
> > With cast_kern it only sees 0.
>
> right.
>
> > It will probably be casted once we try to write to next->locked later on.
>
> not quite.
> In a typical program llvm will emit bare minimum cast_user,
> because all pointers are full 64-bit valid user space addresses all the time.
> The cast_kern() is needed for read/write through the pointer
> if it's not a kernel pointer yet.
> See list_add_head() in bpf_arena_list.h that has
> a bunch of explicit cast_kern/user (with llvm there will be a fraction
> of them), but they illustrate the idea:
>         cast_user(first);
>         cast_kern(n); // before writing into 'n' it has to be 'kern'
>         WRITE_ONCE(n->next, first); // first has to be full 64-bit
>         cast_kern(first); // ignore this one :) it's my mistake.
> should be after 'if'
>         if (first) {
>                 tmp = &n->next;
>                 cast_user(tmp);
>                 WRITE_ONCE(first->pprev, tmp);
>         }
>         cast_user(n);
>         WRITE_ONCE(h->first, n);
>
> > I would gather there's a lot of other cases where someone dereferences
> > before doing some pointer equality comparison.
> > In that case we might end up in the same situation.
> > ptr = load_from_arena;
> > x = ptr->xyz;
> > if (ptr == ptr2) { ... }
>
> There shouldn't be any issues here.
> The 'load from arena' will return full 64-bit and they should
> be stored as full 64-bit in memory.
> ptr->xyz (assuming xyz is another pointer) will read full 64-bit too.
>
> > The extra cast_kern is certainly causing this to surface, but I am not
> > sure whether it's something to fix in the macro.
>
> I think it's a macro issue due to casting addr space off.
>
> > > so if (VAL) always sees upper 32-bit as zero.
> > >
> > > So I suspect it's not a zero page issue.
> > >
> >
> > When I bpf_printk the node address of the qnode of CPU 0, it is
> > 0x100000000000 i.e. user_vm_start. This is the pointer that's misdetected.
> > So it appears to be on the first page.
>
> yes and looks the addr passed into printk is correct full 64-bit
> as it should be.
> So this part:
>   return &qnodes[cpu + 1][idx].mcs;
> is fine.
> It's full 64-bit.
>   &((struct arena_qnode __arena *)base + idx)->mcs;
> is also ok.
>
> There are no addr space casts there.
> But the macro is problematic.

Ok, makes sense. Pointer values should always be the full 64-bit equivalent.
I fixed up unqual_typeof to not cause the extra cast, as it won't be
necessary in case of pointers anyway.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-03-05  4:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-05  1:18 [PATCH bpf-next v3 0/3] Arena Spin Lock Kumar Kartikeya Dwivedi
2025-03-05  1:18 ` [PATCH bpf-next v3 1/3] selftests/bpf: Introduce cond_break_label Kumar Kartikeya Dwivedi
2025-03-05  1:18 ` [PATCH bpf-next v3 2/3] selftests/bpf: Introduce arena spin lock Kumar Kartikeya Dwivedi
2025-03-05  2:27   ` Alexei Starovoitov
2025-03-05  3:14     ` Kumar Kartikeya Dwivedi
2025-03-05  3:40       ` Alexei Starovoitov
2025-03-05  4:28         ` Kumar Kartikeya Dwivedi
2025-03-05  1:18 ` [PATCH bpf-next v3 3/3] selftests/bpf: Add tests for " Kumar Kartikeya Dwivedi
2025-03-05  2:03   ` Alexei Starovoitov
2025-03-05  2:14     ` Kumar Kartikeya Dwivedi
2025-03-05  2:24       ` Kumar Kartikeya Dwivedi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox