[PATCH v1 0/7] perf bench: Add qspinlock benchmark

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/7] perf bench: Add qspinlock benchmark
@ 2025-07-29  2:26 Yuzhuo Jing
  2025-07-29  2:26 ` [PATCH v1 1/7] tools: Import cmpxchg and xchg functions Yuzhuo Jing
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Yuzhuo Jing @ 2025-07-29  2:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang Kan, Yuzhuo Jing, Yuzhuo Jing,
	Andrea Parri, Palmer Dabbelt, Charlie Jenkins,
	Sebastian Andrzej Siewior, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Barret Rhoden, Alexandre Ghiti, Guo Ren,
	linux-kernel, linux-perf-users

As an effort to improve the perf bench subcommand, this patch series
adds benchmark for the kernel's queued spinlock implementation.

This series imports necessary kernel definitions such as atomics,
introduces userspace per-cpu adapter, and imports the qspinlock
implementation from the kernel tree to tools tree, with minimum
adaptions.

This subcommand enables convenient commands to investigate the
performance of kernel lock implementations, such as using sampling:

    perf record -- ./perf bench sync qspinlock -t5
    perf report

Yuzhuo Jing (7):
  tools: Import cmpxchg and xchg functions
  tools: Import smp_cond_load and atomic_cond_read
  tools: Partial import of prefetch.h
  tools: Implement userspace per-cpu
  perf bench: Import qspinlock from kernel
  perf bench: Add 'bench sync qspinlock' subcommand
  perf bench sync: Add latency histogram functionality

 tools/arch/x86/include/asm/atomic.h           |  14 +
 tools/arch/x86/include/asm/cmpxchg.h          | 113 +++++
 tools/include/asm-generic/atomic-gcc.h        |  47 ++
 tools/include/asm/barrier.h                   |  58 +++
 tools/include/linux/atomic.h                  |  27 ++
 tools/include/linux/compiler_types.h          |  30 ++
 tools/include/linux/percpu-simulate.h         | 128 ++++++
 tools/include/linux/prefetch.h                |  41 ++
 tools/perf/bench/Build                        |   2 +
 tools/perf/bench/bench.h                      |   1 +
 .../perf/bench/include/mcs_spinlock-private.h | 115 +++++
 tools/perf/bench/include/mcs_spinlock.h       |  19 +
 tools/perf/bench/include/qspinlock-private.h  | 204 +++++++++
 tools/perf/bench/include/qspinlock.h          | 153 +++++++
 tools/perf/bench/include/qspinlock_types.h    |  98 +++++
 tools/perf/bench/qspinlock.c                  | 411 ++++++++++++++++++
 tools/perf/bench/sync.c                       | 329 ++++++++++++++
 tools/perf/builtin-bench.c                    |   7 +
 tools/perf/check-headers.sh                   |  32 ++
 19 files changed, 1829 insertions(+)
 create mode 100644 tools/include/linux/percpu-simulate.h
 create mode 100644 tools/include/linux/prefetch.h
 create mode 100644 tools/perf/bench/include/mcs_spinlock-private.h
 create mode 100644 tools/perf/bench/include/mcs_spinlock.h
 create mode 100644 tools/perf/bench/include/qspinlock-private.h
 create mode 100644 tools/perf/bench/include/qspinlock.h
 create mode 100644 tools/perf/bench/include/qspinlock_types.h
 create mode 100644 tools/perf/bench/qspinlock.c
 create mode 100644 tools/perf/bench/sync.c

-- 
2.50.1.487.gc89ff58d15-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 1/7] tools: Import cmpxchg and xchg functions
  2025-07-29  2:26 [PATCH v1 0/7] perf bench: Add qspinlock benchmark Yuzhuo Jing
@ 2025-07-29  2:26 ` Yuzhuo Jing
  2025-07-31  4:52   ` Namhyung Kim
  2025-08-08  6:11   ` kernel test robot
  2025-07-29  2:26 ` [PATCH v1 2/7] tools: Import smp_cond_load and atomic_cond_read Yuzhuo Jing
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 18+ messages in thread
From: Yuzhuo Jing @ 2025-07-29  2:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang Kan, Yuzhuo Jing, Yuzhuo Jing,
	Andrea Parri, Palmer Dabbelt, Charlie Jenkins,
	Sebastian Andrzej Siewior, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Barret Rhoden, Alexandre Ghiti, Guo Ren,
	linux-kernel, linux-perf-users

Import necessary atomic functions used by qspinlock.  Copied x86
implementation verbatim, and used compiler builtin for generic
implementation.

Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
---
 tools/arch/x86/include/asm/atomic.h    |  14 +++
 tools/arch/x86/include/asm/cmpxchg.h   | 113 +++++++++++++++++++++++++
 tools/include/asm-generic/atomic-gcc.h |  47 ++++++++++
 tools/include/linux/atomic.h           |  24 ++++++
 tools/include/linux/compiler_types.h   |  24 ++++++
 5 files changed, 222 insertions(+)

diff --git a/tools/arch/x86/include/asm/atomic.h b/tools/arch/x86/include/asm/atomic.h
index 365cf182df12..a55ffd4eb5f1 100644
--- a/tools/arch/x86/include/asm/atomic.h
+++ b/tools/arch/x86/include/asm/atomic.h
@@ -71,6 +71,20 @@ static __always_inline int atomic_cmpxchg(atomic_t *v, int old, int new)
 	return cmpxchg(&v->counter, old, new);
 }
 
+static __always_inline bool atomic_try_cmpxchg(atomic_t *v, int *old, int new)
+{
+	return try_cmpxchg(&v->counter, old, new);
+}
+
+static __always_inline int atomic_fetch_or(int i, atomic_t *v)
+{
+	int val = atomic_read(v);
+
+	do { } while (!atomic_try_cmpxchg(v, &val, val | i));
+
+	return val;
+}
+
 static inline int test_and_set_bit(long nr, unsigned long *addr)
 {
 	GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(bts), *addr, "Ir", nr, "%0", "c");
diff --git a/tools/arch/x86/include/asm/cmpxchg.h b/tools/arch/x86/include/asm/cmpxchg.h
index 0ed9ca2766ad..5372da8b27fc 100644
--- a/tools/arch/x86/include/asm/cmpxchg.h
+++ b/tools/arch/x86/include/asm/cmpxchg.h
@@ -8,6 +8,8 @@
  * Non-existant functions to indicate usage errors at link time
  * (or compile-time if the compiler implements __compiletime_error().
  */
+extern void __xchg_wrong_size(void)
+	__compiletime_error("Bad argument size for xchg");
 extern void __cmpxchg_wrong_size(void)
 	__compiletime_error("Bad argument size for cmpxchg");
 
@@ -27,6 +29,49 @@ extern void __cmpxchg_wrong_size(void)
 #define	__X86_CASE_Q	-1		/* sizeof will never return -1 */
 #endif
 
+/* 
+ * An exchange-type operation, which takes a value and a pointer, and
+ * returns the old value.
+ */
+#define __xchg_op(ptr, arg, op, lock)					\
+	({								\
+	        __typeof__ (*(ptr)) __ret = (arg);			\
+		switch (sizeof(*(ptr))) {				\
+		case __X86_CASE_B:					\
+			asm_inline volatile (lock #op "b %b0, %1"	\
+				      : "+q" (__ret), "+m" (*(ptr))	\
+				      : : "memory", "cc");		\
+			break;						\
+		case __X86_CASE_W:					\
+			asm_inline volatile (lock #op "w %w0, %1"	\
+				      : "+r" (__ret), "+m" (*(ptr))	\
+				      : : "memory", "cc");		\
+			break;						\
+		case __X86_CASE_L:					\
+			asm_inline volatile (lock #op "l %0, %1"	\
+				      : "+r" (__ret), "+m" (*(ptr))	\
+				      : : "memory", "cc");		\
+			break;						\
+		case __X86_CASE_Q:					\
+			asm_inline volatile (lock #op "q %q0, %1"	\
+				      : "+r" (__ret), "+m" (*(ptr))	\
+				      : : "memory", "cc");		\
+			break;						\
+		default:						\
+			__ ## op ## _wrong_size();			\
+		__cmpxchg_wrong_size();					\
+		}							\
+		__ret;							\
+	})
+
+/*
+ * Note: no "lock" prefix even on SMP: xchg always implies lock anyway.
+ * Since this is generally used to protect other memory information, we
+ * use "asm volatile" and "memory" clobbers to prevent gcc from moving
+ * information around.
+ */
+#define xchg(ptr, v)	__xchg_op((ptr), (v), xchg, "")
+
 /*
  * Atomic compare and exchange.  Compare OLD with MEM, if identical,
  * store NEW in MEM.  Return the initial value in MEM.  Success is
@@ -86,5 +131,73 @@ extern void __cmpxchg_wrong_size(void)
 #define cmpxchg(ptr, old, new)						\
 	__cmpxchg(ptr, old, new, sizeof(*(ptr)))
 
+#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock)		\
+({									\
+	bool success;							\
+	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
+	__typeof__(*(_ptr)) __old = *_old;				\
+	__typeof__(*(_ptr)) __new = (_new);				\
+	switch (size) {							\
+	case __X86_CASE_B:						\
+	{								\
+		volatile u8 *__ptr = (volatile u8 *)(_ptr);		\
+		asm_inline volatile(lock "cmpxchgb %[new], %[ptr]"	\
+			     CC_SET(z)					\
+			     : CC_OUT(z) (success),			\
+			       [ptr] "+m" (*__ptr),			\
+			       [old] "+a" (__old)			\
+			     : [new] "q" (__new)			\
+			     : "memory");				\
+		break;							\
+	}								\
+	case __X86_CASE_W:						\
+	{								\
+		volatile u16 *__ptr = (volatile u16 *)(_ptr);		\
+		asm_inline volatile(lock "cmpxchgw %[new], %[ptr]"	\
+			     CC_SET(z)					\
+			     : CC_OUT(z) (success),			\
+			       [ptr] "+m" (*__ptr),			\
+			       [old] "+a" (__old)			\
+			     : [new] "r" (__new)			\
+			     : "memory");				\
+		break;							\
+	}								\
+	case __X86_CASE_L:						\
+	{								\
+		volatile u32 *__ptr = (volatile u32 *)(_ptr);		\
+		asm_inline volatile(lock "cmpxchgl %[new], %[ptr]"	\
+			     CC_SET(z)					\
+			     : CC_OUT(z) (success),			\
+			       [ptr] "+m" (*__ptr),			\
+			       [old] "+a" (__old)			\
+			     : [new] "r" (__new)			\
+			     : "memory");				\
+		break;							\
+	}								\
+	case __X86_CASE_Q:						\
+	{								\
+		volatile u64 *__ptr = (volatile u64 *)(_ptr);		\
+		asm_inline volatile(lock "cmpxchgq %[new], %[ptr]"	\
+			     CC_SET(z)					\
+			     : CC_OUT(z) (success),			\
+			       [ptr] "+m" (*__ptr),			\
+			       [old] "+a" (__old)			\
+			     : [new] "r" (__new)			\
+			     : "memory");				\
+		break;							\
+	}								\
+	default:							\
+		__cmpxchg_wrong_size();					\
+	}								\
+	if (unlikely(!success))						\
+		*_old = __old;						\
+	likely(success);						\
+})
+
+#define __try_cmpxchg(ptr, pold, new, size)				\
+	__raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)
+
+#define try_cmpxchg(ptr, pold, new) 				\
+	__try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))
 
 #endif	/* TOOLS_ASM_X86_CMPXCHG_H */
diff --git a/tools/include/asm-generic/atomic-gcc.h b/tools/include/asm-generic/atomic-gcc.h
index 9b3c528bab92..08b7b3b36873 100644
--- a/tools/include/asm-generic/atomic-gcc.h
+++ b/tools/include/asm-generic/atomic-gcc.h
@@ -62,6 +62,12 @@ static inline int atomic_dec_and_test(atomic_t *v)
 	return __sync_sub_and_fetch(&v->counter, 1) == 0;
 }
 
+#define xchg(ptr, v) \
+	__atomic_exchange_n(ptr, v, __ATOMIC_SEQ_CST)
+
+#define xchg_relaxed(ptr, v) \
+	__atomic_exchange_n(ptr, v, __ATOMIC_RELAXED)
+
 #define cmpxchg(ptr, oldval, newval) \
 	__sync_val_compare_and_swap(ptr, oldval, newval)
 
@@ -70,6 +76,47 @@ static inline int atomic_cmpxchg(atomic_t *v, int oldval, int newval)
 	return cmpxchg(&(v)->counter, oldval, newval);
 }
 
+/**
+ * atomic_try_cmpxchg() - atomic compare and exchange with full ordering
+ * @v: pointer to atomic_t
+ * @old: pointer to int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with full ordering.
+ * Otherwise, @v is not modified, @old is updated to the current value of @v,
+ * and relaxed ordering is provided.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_try_cmpxchg() there.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
+static __always_inline bool
+atomic_try_cmpxchg(atomic_t *v, int *old, int new)
+{
+	int r, o = *old;
+	r = atomic_cmpxchg(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+
+/**
+ * atomic_fetch_or() - atomic bitwise OR with full ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v | @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_or() there.
+ *
+ * Return: The original value of @v.
+ */
+static __always_inline int
+atomic_fetch_or(int i, atomic_t *v)
+{
+	return __sync_fetch_and_or(&v->counter, i);
+}
+
 static inline int test_and_set_bit(long nr, unsigned long *addr)
 {
 	unsigned long mask = BIT_MASK(nr);
diff --git a/tools/include/linux/atomic.h b/tools/include/linux/atomic.h
index 01907b33537e..332a34177995 100644
--- a/tools/include/linux/atomic.h
+++ b/tools/include/linux/atomic.h
@@ -12,4 +12,28 @@ void atomic_long_set(atomic_long_t *v, long i);
 #define  atomic_cmpxchg_release         atomic_cmpxchg
 #endif /* atomic_cmpxchg_relaxed */
 
+#ifndef atomic_cmpxchg_acquire
+#define atomic_cmpxchg_acquire		atomic_cmpxchg
+#endif
+
+#ifndef atomic_try_cmpxchg_acquire
+#define atomic_try_cmpxchg_acquire	atomic_try_cmpxchg
+#endif
+
+#ifndef atomic_try_cmpxchg_relaxed
+#define atomic_try_cmpxchg_relaxed	atomic_try_cmpxchg
+#endif
+
+#ifndef atomic_fetch_or_acquire
+#define atomic_fetch_or_acquire		atomic_fetch_or
+#endif
+
+#ifndef xchg_relaxed
+#define xchg_relaxed		xchg
+#endif
+
+#ifndef cmpxchg_release
+#define cmpxchg_release		cmpxchg
+#endif
+
 #endif /* __TOOLS_LINUX_ATOMIC_H */
diff --git a/tools/include/linux/compiler_types.h b/tools/include/linux/compiler_types.h
index d09f9dc172a4..9a2a2f8d7b6c 100644
--- a/tools/include/linux/compiler_types.h
+++ b/tools/include/linux/compiler_types.h
@@ -31,6 +31,28 @@
 # define __cond_lock(x,c) (c)
 #endif /* __CHECKER__ */
 
+/*
+ * __unqual_scalar_typeof(x) - Declare an unqualified scalar type, leaving
+ *			       non-scalar types unchanged.
+ */
+/*
+ * Prefer C11 _Generic for better compile-times and simpler code. Note: 'char'
+ * is not type-compatible with 'signed char', and we define a separate case.
+ */
+#define __scalar_type_to_expr_cases(type)				\
+		unsigned type:	(unsigned type)0,			\
+		signed type:	(signed type)0
+
+#define __unqual_scalar_typeof(x) typeof(				\
+		_Generic((x),						\
+			 char:	(char)0,				\
+			 __scalar_type_to_expr_cases(char),		\
+			 __scalar_type_to_expr_cases(short),		\
+			 __scalar_type_to_expr_cases(int),		\
+			 __scalar_type_to_expr_cases(long),		\
+			 __scalar_type_to_expr_cases(long long),	\
+			 default: (x)))
+
 /* Compiler specific macros. */
 #ifdef __GNUC__
 #include <linux/compiler-gcc.h>
@@ -40,4 +62,6 @@
 #define asm_goto_output(x...) asm goto(x)
 #endif
 
+#define asm_inline asm
+
 #endif /* __LINUX_COMPILER_TYPES_H */
-- 
2.50.1.487.gc89ff58d15-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 2/7] tools: Import smp_cond_load and atomic_cond_read
  2025-07-29  2:26 [PATCH v1 0/7] perf bench: Add qspinlock benchmark Yuzhuo Jing
  2025-07-29  2:26 ` [PATCH v1 1/7] tools: Import cmpxchg and xchg functions Yuzhuo Jing
@ 2025-07-29  2:26 ` Yuzhuo Jing
  2025-07-29  2:26 ` [PATCH v1 3/7] tools: Partial import of prefetch.h Yuzhuo Jing
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Yuzhuo Jing @ 2025-07-29  2:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang Kan, Yuzhuo Jing, Yuzhuo Jing,
	Andrea Parri, Palmer Dabbelt, Charlie Jenkins,
	Sebastian Andrzej Siewior, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Barret Rhoden, Alexandre Ghiti, Guo Ren,
	linux-kernel, linux-perf-users

Import generic barrier implementation of smp_cond_load_{acquire,relaxed}
and import macro definitions of atomic_cond_read_{acquire,relaxed}.

Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
---
 tools/include/asm/barrier.h  | 58 ++++++++++++++++++++++++++++++++++++
 tools/include/linux/atomic.h |  3 ++
 2 files changed, 61 insertions(+)

diff --git a/tools/include/asm/barrier.h b/tools/include/asm/barrier.h
index 0c21678ac5e6..5150c955c1c9 100644
--- a/tools/include/asm/barrier.h
+++ b/tools/include/asm/barrier.h
@@ -63,3 +63,61 @@ do {						\
 	___p1;					\
 })
 #endif
+
+#ifndef cpu_relax
+#define cpu_relax() ({})
+#endif
+
+/**
+ * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
+ *
+ * A control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. (load)-ACQUIRE.
+ *
+ * Architectures that do not do load speculation can have this be barrier().
+ */
+#ifndef smp_acquire__after_ctrl_dep
+#define smp_acquire__after_ctrl_dep()		smp_rmb()
+#endif
+
+/**
+ * smp_cond_load_relaxed() - (Spin) wait for cond with no ordering guarantees
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ *
+ * Equivalent to using READ_ONCE() on the condition variable.
+ *
+ * Due to C lacking lambda expressions we load the value of *ptr into a
+ * pre-named variable @VAL to be used in @cond.
+ */
+#ifndef smp_cond_load_relaxed
+#define smp_cond_load_relaxed(ptr, cond_expr) ({		\
+	typeof(ptr) __PTR = (ptr);				\
+	__unqual_scalar_typeof(*ptr) VAL;			\
+	for (;;) {						\
+		VAL = READ_ONCE(*__PTR);			\
+		if (cond_expr)					\
+			break;					\
+		cpu_relax();					\
+	}							\
+	(typeof(*ptr))VAL;					\
+})
+#endif
+
+/**
+ * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ *
+ * Equivalent to using smp_load_acquire() on the condition variable but employs
+ * the control dependency of the wait to reduce the barrier on many platforms.
+ */
+#ifndef smp_cond_load_acquire
+#define smp_cond_load_acquire(ptr, cond_expr) ({		\
+	__unqual_scalar_typeof(*ptr) _val;			\
+	_val = smp_cond_load_relaxed(ptr, cond_expr);		\
+	smp_acquire__after_ctrl_dep();				\
+	(typeof(*ptr))_val;					\
+})
+#endif
diff --git a/tools/include/linux/atomic.h b/tools/include/linux/atomic.h
index 332a34177995..6baee2c41b55 100644
--- a/tools/include/linux/atomic.h
+++ b/tools/include/linux/atomic.h
@@ -36,4 +36,7 @@ void atomic_long_set(atomic_long_t *v, long i);
 #define cmpxchg_release		cmpxchg
 #endif
 
+#define atomic_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
+#define atomic_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
+
 #endif /* __TOOLS_LINUX_ATOMIC_H */
-- 
2.50.1.487.gc89ff58d15-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 3/7] tools: Partial import of prefetch.h
  2025-07-29  2:26 [PATCH v1 0/7] perf bench: Add qspinlock benchmark Yuzhuo Jing
  2025-07-29  2:26 ` [PATCH v1 1/7] tools: Import cmpxchg and xchg functions Yuzhuo Jing
  2025-07-29  2:26 ` [PATCH v1 2/7] tools: Import smp_cond_load and atomic_cond_read Yuzhuo Jing
@ 2025-07-29  2:26 ` Yuzhuo Jing
  2025-07-31  4:54   ` Namhyung Kim
  2025-07-29  2:26 ` [PATCH v1 4/7] tools: Implement userspace per-cpu Yuzhuo Jing
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Yuzhuo Jing @ 2025-07-29  2:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang Kan, Yuzhuo Jing, Yuzhuo Jing,
	Andrea Parri, Palmer Dabbelt, Charlie Jenkins,
	Sebastian Andrzej Siewior, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Barret Rhoden, Alexandre Ghiti, Guo Ren,
	linux-kernel, linux-perf-users

Import only prefetch and prefetchw but not page and range related
methods.

Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
---
 tools/include/linux/prefetch.h | 41 ++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)
 create mode 100644 tools/include/linux/prefetch.h

diff --git a/tools/include/linux/prefetch.h b/tools/include/linux/prefetch.h
new file mode 100644
index 000000000000..1ed8678f4824
--- /dev/null
+++ b/tools/include/linux/prefetch.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ *  Generic cache management functions. Everything is arch-specific,  
+ *  but this header exists to make sure the defines/functions can be
+ *  used in a generic way.
+ *
+ *  2000-11-13  Arjan van de Ven   <arjan@fenrus.demon.nl>
+ *
+ */
+
+#ifndef _LINUX_PREFETCH_H
+#define _LINUX_PREFETCH_H
+
+/*
+	prefetch(x) attempts to pre-emptively get the memory pointed to
+	by address "x" into the CPU L1 cache. 
+	prefetch(x) should not cause any kind of exception, prefetch(0) is
+	specifically ok.
+
+	prefetch() should be defined by the architecture, if not, the 
+	#define below provides a no-op define.	
+	
+	There are 2 prefetch() macros:
+	
+	prefetch(x)  	- prefetches the cacheline at "x" for read
+	prefetchw(x)	- prefetches the cacheline at "x" for write
+	
+	there is also PREFETCH_STRIDE which is the architecure-preferred 
+	"lookahead" size for prefetching streamed operations.
+	
+*/
+
+#ifndef ARCH_HAS_PREFETCH
+#define prefetch(x) __builtin_prefetch(x)
+#endif
+
+#ifndef ARCH_HAS_PREFETCHW
+#define prefetchw(x) __builtin_prefetch(x,1)
+#endif
+
+#endif
-- 
2.50.1.487.gc89ff58d15-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 4/7] tools: Implement userspace per-cpu
  2025-07-29  2:26 [PATCH v1 0/7] perf bench: Add qspinlock benchmark Yuzhuo Jing
                   ` (2 preceding siblings ...)
  2025-07-29  2:26 ` [PATCH v1 3/7] tools: Partial import of prefetch.h Yuzhuo Jing
@ 2025-07-29  2:26 ` Yuzhuo Jing
  2025-07-31  5:07   ` Namhyung Kim
  2025-07-29  2:26 ` [PATCH v1 5/7] perf bench: Import qspinlock from kernel Yuzhuo Jing
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Yuzhuo Jing @ 2025-07-29  2:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang Kan, Yuzhuo Jing, Yuzhuo Jing,
	Andrea Parri, Palmer Dabbelt, Charlie Jenkins,
	Sebastian Andrzej Siewior, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Barret Rhoden, Alexandre Ghiti, Guo Ren,
	linux-kernel, linux-perf-users

Implement userspace per-cpu for imported kernel code.  Compared with
simple thread-local definition, the kernel per-cpu provides 1) a
guarantee of static lifetime even when thread exits, and 2) the ability
to access other CPU's per-cpu data.

This patch adds an alternative implementation and interface for
userspace per-cpu.  The kernel implementation uses special ELF sections
and offset calculation.  For simplicity, this version defines a
PERCPU_MAX length global array for each per-cpu data, and uses a
thread-local cpu id for indexing.

Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
---
 tools/include/linux/compiler_types.h  |   3 +
 tools/include/linux/percpu-simulate.h | 128 ++++++++++++++++++++++++++
 2 files changed, 131 insertions(+)
 create mode 100644 tools/include/linux/percpu-simulate.h

diff --git a/tools/include/linux/compiler_types.h b/tools/include/linux/compiler_types.h
index 9a2a2f8d7b6c..46550c500b8c 100644
--- a/tools/include/linux/compiler_types.h
+++ b/tools/include/linux/compiler_types.h
@@ -31,6 +31,9 @@
 # define __cond_lock(x,c) (c)
 #endif /* __CHECKER__ */
 
+/* Per-cpu checker flag does not use address space attribute in userspace */
+#define __percpu
+
 /*
  * __unqual_scalar_typeof(x) - Declare an unqualified scalar type, leaving
  *			       non-scalar types unchanged.
diff --git a/tools/include/linux/percpu-simulate.h b/tools/include/linux/percpu-simulate.h
new file mode 100644
index 000000000000..a6af2f2211eb
--- /dev/null
+++ b/tools/include/linux/percpu-simulate.h
@@ -0,0 +1,128 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Userspace implementation of per_cpu_ptr for adapted kernel code.
+ *
+ * Userspace code does not have and does not need a per-cpu concept, but
+ * instead can declare variables as thread-local.  However, the kernel per-cpu
+ * further provides 1) the guarantee of static lifetime when thread exits, and
+ * 2) the ability to access other CPU's per-cpu data.  This file provides a
+ * simple implementation of such functionality, but with slightly different
+ * APIs and without linker script changes.
+ *
+ * 2025  Yuzhuo Jing <yuzhuo@google.com>
+ */
+#ifndef __PERCPU_SIMULATE_H__
+#define __PERCPU_SIMULATE_H__
+
+#include <assert.h>
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+/*
+ * The maximum supported number of CPUs.  Per-cpu variables are defined as a
+ * PERCPU_MAX length array, indexed by a thread-local cpu id.
+ */
+#define PERCPU_MAX 4096
+
+#ifdef ASSERT_PERCPU
+#define __check_cpu_id(cpu)						\
+({									\
+	u32 cpuid = (cpu);						\
+	assert(cpuid < PERCPU_MAX);					\
+	cpuid;								\
+})
+#else
+#define __check_cpu_id(cpu)	(cpu)
+#endif
+
+/*
+ * Use weak symbol: only define __thread_per_cpu_id variable if any perf tool
+ * includes this header file.
+ */
+_Thread_local u32 __thread_per_cpu_id __weak;
+
+static inline u32 get_this_cpu_id(void)
+{
+	return __thread_per_cpu_id;
+}
+
+/*
+ * The user code must call this function inside of each thread that uses
+ * per-cpu data structures.  The user code can choose an id of their choice,
+ * but must ensure each thread uses a different id.
+ *
+ * Safety: asserts CPU id smaller than PERCPU_MAX if ASSERT_PERCPU is defined.
+ */
+static inline void set_this_cpu_id(u32 id)
+{
+	__thread_per_cpu_id = __check_cpu_id(id);
+}
+
+/*
+ * Declare a per-cpu data structure.  This only declares the data type and
+ * array length. Different per-cpu data are differentiated by a key (identifer).
+ *
+ * Different from the kernel version, this API must be called before the actual
+ * definition (i.e. DEFINE_PER_CPU_ALIGNED).
+ *
+ * Note that this implementation does not support prepending static qualifier,
+ * or appending assignment expressions.
+ */
+#define DECLARE_PER_CPU_ALIGNED(key, type, data) \
+	extern struct __percpu_type_##key { \
+		type data; \
+	} __percpu_data_##key[PERCPU_MAX]
+
+/*
+ * Define the per-cpu data storage for a given key.  This uses a previously
+ * defined data type in DECLARE_PER_CPU_ALIGNED.
+ *
+ * Different from the kernel version, this API only accepts a key name.
+ */
+#define DEFINE_PER_CPU_ALIGNED(key) \
+	struct __percpu_type_##key __percpu_data_##key[PERCPU_MAX]
+
+#define __raw_per_cpu_value(key, field, cpu) \
+	(__percpu_data_##key[cpu].field)
+
+/*
+ * Get a pointer of per-cpu data for a given key.
+ *
+ * Different from the kernel version, users of this API don't need to pass the
+ * address of the base variable (through `&varname').
+ *
+ * Safety: asserts CPU id smaller than PERCPU_MAX if ASSERT_PERCPU is defined.
+ */
+#define per_cpu_ptr(key, field, cpu) (&per_cpu_value(key, field, cpu))
+#define this_cpu_ptr(key, field) (&this_cpu_value(key, field))
+
+/*
+ * Additional APIs for direct value access.  Effectively, `*per_cpu_ptr(...)'.
+ *
+ * Safety: asserts CPU id smaller than PERCPU_MAX if ASSERT_PERCPU is defined.
+ */
+#define per_cpu_value(key, field, cpu) \
+	(__raw_per_cpu_value(key, field, __check_cpu_id(cpu)))
+#define this_cpu_value(key, field) \
+	(__raw_per_cpu_value(key, field, __thread_per_cpu_id))
+
+/*
+ * Helper functions of simple per-cpu operations.
+ *
+ * The kernel version differentiates __this_cpu_* from this_cpu_* for
+ * preemption/interrupt-safe contexts, but the userspace version defines them
+ * as the same.
+ */
+
+#define __this_cpu_add(key, field, val)	(this_cpu_value(key, field) += (val))
+#define __this_cpu_sub(key, field, val)	(this_cpu_value(key, field) -= (val))
+#define __this_cpu_inc(key, field)	(++this_cpu_value(key, field))
+#define __this_cpu_dec(key, field)	(--this_cpu_value(key, field))
+
+#define this_cpu_add	__this_cpu_add
+#define this_cpu_sub	__this_cpu_sub
+#define this_cpu_inc	__this_cpu_inc
+#define this_cpu_dec	__this_cpu_dec
+
+#endif /* __PERCPU_SIMULATE_H__ */
-- 
2.50.1.487.gc89ff58d15-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 5/7] perf bench: Import qspinlock from kernel
  2025-07-29  2:26 [PATCH v1 0/7] perf bench: Add qspinlock benchmark Yuzhuo Jing
                   ` (3 preceding siblings ...)
  2025-07-29  2:26 ` [PATCH v1 4/7] tools: Implement userspace per-cpu Yuzhuo Jing
@ 2025-07-29  2:26 ` Yuzhuo Jing
  2025-07-29  2:26 ` [PATCH v1 6/7] perf bench: Add 'bench sync qspinlock' subcommand Yuzhuo Jing
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Yuzhuo Jing @ 2025-07-29  2:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang Kan, Yuzhuo Jing, Yuzhuo Jing,
	Andrea Parri, Palmer Dabbelt, Charlie Jenkins,
	Sebastian Andrzej Siewior, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Barret Rhoden, Alexandre Ghiti, Guo Ren,
	linux-kernel, linux-perf-users

Import the qspinlock implementation from kernel with userland-specific
adaption.  Updated tools/perf/check-headers.sh file to detect kernel file
changes in the future.

Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
---
 tools/include/linux/compiler_types.h          |   3 +
 .../perf/bench/include/mcs_spinlock-private.h | 115 +++++
 tools/perf/bench/include/mcs_spinlock.h       |  19 +
 tools/perf/bench/include/qspinlock-private.h  | 204 +++++++++
 tools/perf/bench/include/qspinlock.h          | 153 +++++++
 tools/perf/bench/include/qspinlock_types.h    |  98 +++++
 tools/perf/bench/qspinlock.c                  | 411 ++++++++++++++++++
 tools/perf/check-headers.sh                   |  32 ++
 8 files changed, 1035 insertions(+)
 create mode 100644 tools/perf/bench/include/mcs_spinlock-private.h
 create mode 100644 tools/perf/bench/include/mcs_spinlock.h
 create mode 100644 tools/perf/bench/include/qspinlock-private.h
 create mode 100644 tools/perf/bench/include/qspinlock.h
 create mode 100644 tools/perf/bench/include/qspinlock_types.h
 create mode 100644 tools/perf/bench/qspinlock.c

diff --git a/tools/include/linux/compiler_types.h b/tools/include/linux/compiler_types.h
index 46550c500b8c..261a508ef5bd 100644
--- a/tools/include/linux/compiler_types.h
+++ b/tools/include/linux/compiler_types.h
@@ -34,6 +34,9 @@
 /* Per-cpu checker flag does not use address space attribute in userspace */
 #define __percpu
 
+/* Do not change lock sections in user space */
+#define __lockfunc
+
 /*
  * __unqual_scalar_typeof(x) - Declare an unqualified scalar type, leaving
  *			       non-scalar types unchanged.
diff --git a/tools/perf/bench/include/mcs_spinlock-private.h b/tools/perf/bench/include/mcs_spinlock-private.h
new file mode 100644
index 000000000000..f9e4bab804db
--- /dev/null
+++ b/tools/perf/bench/include/mcs_spinlock-private.h
@@ -0,0 +1,115 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * MCS lock defines
+ *
+ * This file contains the main data structure and API definitions of MCS lock.
+ *
+ * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
+ * with the desirable properties of being fair, and with each cpu trying
+ * to acquire the lock spinning on a local variable.
+ * It avoids expensive cache bounces that common test-and-set spin-lock
+ * implementations incur.
+ */
+#ifndef __LINUX_MCS_SPINLOCK_H
+#define __LINUX_MCS_SPINLOCK_H
+
+#include <stddef.h>
+#include <linux/atomic.h>
+#include "mcs_spinlock.h"
+
+#ifndef arch_mcs_spin_lock_contended
+/*
+ * Using smp_cond_load_acquire() provides the acquire semantics
+ * required so that subsequent operations happen after the
+ * lock is acquired. Additionally, some architectures such as
+ * ARM64 would like to do spin-waiting instead of purely
+ * spinning, and smp_cond_load_acquire() provides that behavior.
+ */
+#define arch_mcs_spin_lock_contended(l)					\
+	smp_cond_load_acquire(l, VAL)
+#endif
+
+#ifndef arch_mcs_spin_unlock_contended
+/*
+ * smp_store_release() provides a memory barrier to ensure all
+ * operations in the critical section has been completed before
+ * unlocking.
+ */
+#define arch_mcs_spin_unlock_contended(l)				\
+	smp_store_release((l), 1)
+#endif
+
+/*
+ * Note: the smp_load_acquire/smp_store_release pair is not
+ * sufficient to form a full memory barrier across
+ * cpus for many architectures (except x86) for mcs_unlock and mcs_lock.
+ * For applications that need a full barrier across multiple cpus
+ * with mcs_unlock and mcs_lock pair, smp_mb__after_unlock_lock() should be
+ * used after mcs_lock.
+ */
+
+/*
+ * In order to acquire the lock, the caller should declare a local node and
+ * pass a reference of the node to this function in addition to the lock.
+ * If the lock has already been acquired, then this will proceed to spin
+ * on this node->locked until the previous lock holder sets the node->locked
+ * in mcs_spin_unlock().
+ */
+static inline
+void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *prev;
+
+	/* Init node */
+	node->locked = 0;
+	node->next   = NULL;
+
+	/*
+	 * We rely on the full barrier with global transitivity implied by the
+	 * below xchg() to order the initialization stores above against any
+	 * observation of @node. And to provide the ACQUIRE ordering associated
+	 * with a LOCK primitive.
+	 */
+	prev = xchg(lock, node);
+	if (likely(prev == NULL)) {
+		/*
+		 * Lock acquired, don't need to set node->locked to 1. Threads
+		 * only spin on its own node->locked value for lock acquisition.
+		 * However, since this thread can immediately acquire the lock
+		 * and does not proceed to spin on its own node->locked, this
+		 * value won't be used. If a debug mode is needed to
+		 * audit lock status, then set node->locked value here.
+		 */
+		return;
+	}
+	WRITE_ONCE(prev->next, node);
+
+	/* Wait until the lock holder passes the lock down. */
+	arch_mcs_spin_lock_contended(&node->locked);
+}
+
+/*
+ * Releases the lock. The caller should pass in the corresponding node that
+ * was used to acquire the lock.
+ */
+static inline
+void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *next = READ_ONCE(node->next);
+
+	if (likely(!next)) {
+		/*
+		 * Release the lock by setting it to NULL
+		 */
+		if (likely(cmpxchg_release(lock, node, NULL) == node))
+			return;
+		/* Wait until the next pointer is set */
+		while (!(next = READ_ONCE(node->next)))
+			cpu_relax();
+	}
+
+	/* Pass lock to next waiter. */
+	arch_mcs_spin_unlock_contended(&next->locked);
+}
+
+#endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/tools/perf/bench/include/mcs_spinlock.h b/tools/perf/bench/include/mcs_spinlock.h
new file mode 100644
index 000000000000..39c94012b88a
--- /dev/null
+++ b/tools/perf/bench/include/mcs_spinlock.h
@@ -0,0 +1,19 @@
+#ifndef __ASM_MCS_SPINLOCK_H
+#define __ASM_MCS_SPINLOCK_H
+
+struct mcs_spinlock {
+	struct mcs_spinlock *next;
+	int locked; /* 1 if lock acquired */
+	int count;  /* nesting count, see qspinlock.c */
+};
+
+/*
+ * Architectures can define their own:
+ *
+ *   arch_mcs_spin_lock_contended(l)
+ *   arch_mcs_spin_unlock_contended(l)
+ *
+ * See kernel/locking/mcs_spinlock.c.
+ */
+
+#endif /* __ASM_MCS_SPINLOCK_H */
diff --git a/tools/perf/bench/include/qspinlock-private.h b/tools/perf/bench/include/qspinlock-private.h
new file mode 100644
index 000000000000..699f70bac980
--- /dev/null
+++ b/tools/perf/bench/include/qspinlock-private.h
@@ -0,0 +1,204 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Queued spinlock defines
+ *
+ * This file contains macro definitions and functions shared between different
+ * qspinlock slow path implementations.
+ */
+#ifndef __LINUX_QSPINLOCK_H
+#define __LINUX_QSPINLOCK_H
+
+#include <linux/percpu-simulate.h>
+#include <linux/compiler.h>
+#include <linux/compiler_types.h>
+#include <linux/atomic.h>
+#include "qspinlock_types.h"
+#include "mcs_spinlock.h"
+
+#define _Q_MAX_NODES	4
+
+/*
+ * The pending bit spinning loop count.
+ * This heuristic is used to limit the number of lockword accesses
+ * made by atomic_cond_read_relaxed when waiting for the lock to
+ * transition out of the "== _Q_PENDING_VAL" state. We don't spin
+ * indefinitely because there's no guarantee that we'll make forward
+ * progress.
+ */
+#ifndef _Q_PENDING_LOOPS
+#define _Q_PENDING_LOOPS	1
+#endif
+
+/*
+ * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
+ * size and four of them will fit nicely in one 64-byte cacheline. For
+ * pvqspinlock, however, we need more space for extra data. To accommodate
+ * that, we insert two more long words to pad it up to 32 bytes. IOW, only
+ * two of them can fit in a cacheline in this case. That is OK as it is rare
+ * to have more than 2 levels of slowpath nesting in actual use. We don't
+ * want to penalize pvqspinlocks to optimize for a rare case in native
+ * qspinlocks.
+ */
+struct qnode {
+	struct mcs_spinlock mcs;
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+	long reserved[2];
+#endif
+};
+
+DECLARE_PER_CPU_ALIGNED(qnodes, struct qnode, qnodes[_Q_MAX_NODES]);
+
+/*
+ * We must be able to distinguish between no-tail and the tail at 0:0,
+ * therefore increment the cpu number by one.
+ */
+
+static inline __pure u32 encode_tail(int cpu, int idx)
+{
+	u32 tail;
+
+	tail  = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
+	tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */
+
+	return tail;
+}
+
+static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
+{
+	int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
+	int idx = (tail &  _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
+
+	return per_cpu_ptr(qnodes, qnodes[idx].mcs, cpu);
+}
+
+static inline __pure
+struct mcs_spinlock *grab_mcs_node(struct mcs_spinlock *base, int idx)
+{
+	return &((struct qnode *)base + idx)->mcs;
+}
+
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
+#if _Q_PENDING_BITS == 8
+/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,* -> *,0,*
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+	WRITE_ONCE(lock->pending, 0);
+}
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,0 -> *,0,1
+ *
+ * Lock stealing is not allowed if this function is used.
+ */
+static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
+{
+	WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
+}
+
+/*
+ * xchg_tail - Put in the new queue tail code word & retrieve previous one
+ * @lock : Pointer to queued spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail), which heads an address dependency
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+	/*
+	 * We can use relaxed semantics since the caller ensures that the
+	 * MCS node is properly initialized before updating the tail.
+	 */
+	return (u32)xchg_relaxed(&lock->tail,
+				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+}
+
+#else /* _Q_PENDING_BITS == 8 */
+
+/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,* -> *,0,*
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+	atomic_andnot(_Q_PENDING_VAL, &lock->val);
+}
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,0 -> *,0,1
+ */
+static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
+{
+	atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, &lock->val);
+}
+
+/**
+ * xchg_tail - Put in the new queue tail code word & retrieve previous one
+ * @lock : Pointer to queued spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+	u32 old, new;
+
+	old = atomic_read(&lock->val);
+	do {
+		new = (old & _Q_LOCKED_PENDING_MASK) | tail;
+		/*
+		 * We can use relaxed semantics since the caller ensures that
+		 * the MCS node is properly initialized before updating the
+		 * tail.
+		 */
+	} while (!atomic_try_cmpxchg_relaxed(&lock->val, &old, new));
+
+	return old;
+}
+#endif /* _Q_PENDING_BITS == 8 */
+
+/**
+ * queued_fetch_set_pending_acquire - fetch the whole lock value and set pending
+ * @lock : Pointer to queued spinlock structure
+ * Return: The previous lock value
+ *
+ * *,*,* -> *,1,*
+ */
+#ifndef queued_fetch_set_pending_acquire
+static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lock)
+{
+	return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+}
+#endif
+
+/**
+ * set_locked - Set the lock bit and own the lock
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,*,0 -> *,0,1
+ */
+static __always_inline void set_locked(struct qspinlock *lock)
+{
+	WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
+}
+
+#endif /* __LINUX_QSPINLOCK_H */
diff --git a/tools/perf/bench/include/qspinlock.h b/tools/perf/bench/include/qspinlock.h
new file mode 100644
index 000000000000..2c5b00121929
--- /dev/null
+++ b/tools/perf/bench/include/qspinlock.h
@@ -0,0 +1,153 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Queued spinlock
+ *
+ * A 'generic' spinlock implementation that is based on MCS locks. For an
+ * architecture that's looking for a 'generic' spinlock, please first consider
+ * ticket-lock.h and only come looking here when you've considered all the
+ * constraints below and can show your hardware does actually perform better
+ * with qspinlock.
+ *
+ * qspinlock relies on atomic_*_release()/atomic_*_acquire() to be RCsc (or no
+ * weaker than RCtso if you're power), where regular code only expects atomic_t
+ * to be RCpc.
+ *
+ * qspinlock relies on a far greater (compared to asm-generic/spinlock.h) set
+ * of atomic operations to behave well together, please audit them carefully to
+ * ensure they all have forward progress. Many atomic operations may default to
+ * cmpxchg() loops which will not have good forward progress properties on
+ * LL/SC architectures.
+ *
+ * One notable example is atomic_fetch_or_acquire(), which x86 cannot (cheaply)
+ * do. Carefully read the patches that introduced
+ * queued_fetch_set_pending_acquire().
+ *
+ * qspinlock also heavily relies on mixed size atomic operations, in specific
+ * it requires architectures to have xchg16; something which many LL/SC
+ * architectures need to implement as a 32bit and+or in order to satisfy the
+ * forward progress guarantees mentioned above.
+ *
+ * Further reading on mixed size atomics that might be relevant:
+ *
+ *   http://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf
+ *
+ * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
+ * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include "qspinlock_types.h"
+#include <linux/atomic.h>
+#include <asm/barrier.h>
+
+#ifndef queued_spin_is_locked
+/**
+ * queued_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queued spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
+{
+	/*
+	 * Any !0 state indicates it is locked, even if _Q_LOCKED_VAL
+	 * isn't immediately observable.
+	 */
+	return atomic_read(&lock->val);
+}
+#endif
+
+/**
+ * queued_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queued spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ *
+ * N.B. Whenever there are tasks waiting for the lock, it is considered
+ *      locked wrt the lockref code to avoid lock stealing by the lockref
+ *      code and change things underneath the lock. This also allows some
+ *      optimizations to be applied without conflict with lockref.
+ */
+static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
+{
+	return !lock.val.counter;
+}
+
+/**
+ * queued_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queued spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
+{
+	return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
+}
+/**
+ * queued_spin_trylock - try to acquire the queued spinlock
+ * @lock : Pointer to queued spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queued_spin_trylock(struct qspinlock *lock)
+{
+	int val = atomic_read(&lock->val);
+
+	if (unlikely(val))
+		return 0;
+
+	return likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL));
+}
+
+extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+#ifndef queued_spin_lock
+/**
+ * queued_spin_lock - acquire a queued spinlock
+ * @lock: Pointer to queued spinlock structure
+ */
+static __always_inline void queued_spin_lock(struct qspinlock *lock)
+{
+	int val = 0;
+
+	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
+		return;
+
+	queued_spin_lock_slowpath(lock, val);
+}
+#endif
+
+#ifndef queued_spin_unlock
+/**
+ * queued_spin_unlock - release a queued spinlock
+ * @lock : Pointer to queued spinlock structure
+ */
+static __always_inline void queued_spin_unlock(struct qspinlock *lock)
+{
+	/*
+	 * unlock() needs release semantics:
+	 */
+	smp_store_release(&lock->locked, 0);
+}
+#endif
+
+#ifndef virt_spin_lock
+static __always_inline bool virt_spin_lock(struct qspinlock *lock __maybe_unused)
+{
+	return false;
+}
+#endif
+
+#ifndef __no_arch_spinlock_redefine
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queued spinlock functions.
+ */
+#define arch_spin_is_locked(l)		queued_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queued_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queued_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queued_spin_lock(l)
+#define arch_spin_trylock(l)		queued_spin_trylock(l)
+#define arch_spin_unlock(l)		queued_spin_unlock(l)
+#endif
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/tools/perf/bench/include/qspinlock_types.h b/tools/perf/bench/include/qspinlock_types.h
new file mode 100644
index 000000000000..93a959689070
--- /dev/null
+++ b/tools/perf/bench/include/qspinlock_types.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Queued spinlock
+ *
+ * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+#include <linux/percpu-simulate.h>
+#include <linux/types.h>
+
+#define CONFIG_NR_CPUS PERCPU_MAX
+
+typedef struct qspinlock {
+	union {
+		atomic_t val;
+
+		/*
+		 * By using the whole 2nd least significant byte for the
+		 * pending bit, we can allow better optimization of the lock
+		 * acquisition for the pending bit holder.
+		 */
+#ifdef __LITTLE_ENDIAN
+		struct {
+			u8	locked;
+			u8	pending;
+		};
+		struct {
+			u16	locked_pending;
+			u16	tail;
+		};
+#else
+		struct {
+			u16	tail;
+			u16	locked_pending;
+		};
+		struct {
+			u8	reserved[2];
+			u8	pending;
+			u8	locked;
+		};
+#endif
+	};
+} arch_spinlock_t;
+
+/*
+ * Initializier
+ */
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = ATOMIC_INIT(0) } }
+
+/*
+ * Bitfields in the atomic value:
+ *
+ * When NR_CPUS < 16K
+ *  0- 7: locked byte
+ *     8: pending
+ *  9-15: not used
+ * 16-17: tail index
+ * 18-31: tail cpu (+1)
+ *
+ * When NR_CPUS >= 16K
+ *  0- 7: locked byte
+ *     8: pending
+ *  9-10: tail index
+ * 11-31: tail cpu (+1)
+ */
+#define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
+				      << _Q_ ## type ## _OFFSET)
+#define _Q_LOCKED_OFFSET	0
+#define _Q_LOCKED_BITS		8
+#define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
+
+#define _Q_PENDING_OFFSET	(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS < (1U << 14)
+#define _Q_PENDING_BITS		8
+#else
+#define _Q_PENDING_BITS		1
+#endif
+#define _Q_PENDING_MASK		_Q_SET_MASK(PENDING)
+
+#define _Q_TAIL_IDX_OFFSET	(_Q_PENDING_OFFSET + _Q_PENDING_BITS)
+#define _Q_TAIL_IDX_BITS	2
+#define _Q_TAIL_IDX_MASK	_Q_SET_MASK(TAIL_IDX)
+
+#define _Q_TAIL_CPU_OFFSET	(_Q_TAIL_IDX_OFFSET + _Q_TAIL_IDX_BITS)
+#define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
+#define _Q_TAIL_CPU_MASK	_Q_SET_MASK(TAIL_CPU)
+
+#define _Q_TAIL_OFFSET		_Q_TAIL_IDX_OFFSET
+#define _Q_TAIL_MASK		(_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
+#define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL		(1U << _Q_PENDING_OFFSET)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/tools/perf/bench/qspinlock.c b/tools/perf/bench/qspinlock.c
new file mode 100644
index 000000000000..b678dd16b059
--- /dev/null
+++ b/tools/perf/bench/qspinlock.c
@@ -0,0 +1,411 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Queued spinlock
+ *
+ * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
+ * (C) Copyright 2013-2014,2018 Red Hat, Inc.
+ * (C) Copyright 2015 Intel Corp.
+ * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <longman@redhat.com>
+ *          Peter Zijlstra <peterz@infradead.org>
+ */
+
+#ifndef _GEN_PV_LOCK_SLOWPATH
+
+#include <linux/build_bug.h>
+#include <linux/percpu-simulate.h>
+#include <linux/atomic.h>
+#include <linux/prefetch.h>
+#include <asm/byteorder.h>
+#include "include/qspinlock.h"
+
+#define lockevent_inc(x) ({})
+#define lockevent_cond_inc(x, y) ({})
+#define trace_contention_begin(x, y) ({})
+#define trace_contention_end(x, y) ({})
+
+#define smp_processor_id get_this_cpu_id
+
+/*
+ * Include queued spinlock definitions and statistics code
+ */
+#include "include/qspinlock-private.h"
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. A copy of the original MCS lock paper ("Algorithms for Scalable
+ * Synchronization on Shared-Memory Multiprocessors by Mellor-Crummey and
+ * Scott") is available at
+ *
+ * https://bugzilla.kernel.org/show_bug.cgi?id=206115
+ *
+ * This queued spinlock implementation is based on the MCS lock, however to
+ * make it fit the 4 bytes we assume spinlock_t to be, and preserve its
+ * existing API, we must modify it somehow.
+ *
+ * In particular; where the traditional MCS lock consists of a tail pointer
+ * (8 bytes) and needs the next pointer (another 8 bytes) of its own node to
+ * unlock the next pending (next->locked), we compress both these: {tail,
+ * next->locked} into a single u32 value.
+ *
+ * Since a spinlock disables recursion of its own context and there is a limit
+ * to the contexts that can nest; namely: task, softirq, hardirq, nmi. As there
+ * are at most 4 nesting levels, it can be encoded by a 2-bit number. Now
+ * we can encode the tail by combining the 2-bit nesting level with the cpu
+ * number. With one byte for the lock value and 3 bytes for the tail, only a
+ * 32-bit word is now needed. Even though we only need 1 bit for the lock,
+ * we extend it to a full byte to achieve better performance for architectures
+ * that support atomic byte write.
+ *
+ * We also change the first spinner to spin on the lock bit instead of its
+ * node; whereby avoiding the need to carry a node from lock to unlock, and
+ * preserving existing lock API. This also makes the unlock code simpler and
+ * faster.
+ *
+ * N.B. The current implementation only supports architectures that allow
+ *      atomic operations on smaller 8-bit and 16-bit data types.
+ *
+ */
+
+#include "include/mcs_spinlock-private.h"
+
+/*
+ * Per-CPU queue node structures; we can never have more than 4 nested
+ * contexts: task, softirq, hardirq, nmi.
+ *
+ * Exactly fits one 64-byte cacheline on a 64-bit architecture.
+ *
+ * PV doubles the storage and uses the second cacheline for PV state.
+ */
+DEFINE_PER_CPU_ALIGNED(qnodes);
+
+/*
+ * Generate the native code for queued_spin_unlock_slowpath(); provide NOPs for
+ * all the PV callbacks.
+ */
+
+static __always_inline void __pv_init_node(struct mcs_spinlock *node __maybe_unused) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node __maybe_unused,
+					   struct mcs_spinlock *prev __maybe_unused) { }
+static __always_inline void __pv_kick_node(struct qspinlock *lock __maybe_unused,
+					   struct mcs_spinlock *node __maybe_unused) { }
+static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock __maybe_unused,
+						   struct mcs_spinlock *node __maybe_unused)
+						   { return 0; }
+
+#define pv_enabled()		false
+
+#define pv_init_node		__pv_init_node
+#define pv_wait_node		__pv_wait_node
+#define pv_kick_node		__pv_kick_node
+#define pv_wait_head_or_lock	__pv_wait_head_or_lock
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
+#endif
+
+#endif /* _GEN_PV_LOCK_SLOWPATH */
+
+/**
+ * queued_spin_lock_slowpath - acquire the queued spinlock
+ * @lock: Pointer to queued spinlock structure
+ * @val: Current value of the queued spinlock 32-bit word
+ *
+ * (queue tail, pending bit, lock value)
+ *
+ *              fast     :    slow                                  :    unlock
+ *                       :                                          :
+ * uncontended  (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
+ *                       :       | ^--------.------.             /  :
+ *                       :       v           \      \            |  :
+ * pending               :    (0,1,1) +--> (0,1,0)   \           |  :
+ *                       :       | ^--'              |           |  :
+ *                       :       v                   |           |  :
+ * uncontended           :    (n,x,y) +--> (n,0,0) --'           |  :
+ *   queue               :       | ^--'                          |  :
+ *                       :       v                               |  :
+ * contended             :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
+ *   queue               :         ^--'                             :
+ */
+void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+	struct mcs_spinlock *prev, *next, *node;
+	u32 old, tail;
+	int idx;
+
+	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+
+	if (pv_enabled())
+		goto pv_queue;
+
+	if (virt_spin_lock(lock))
+		return;
+
+	/*
+	 * Wait for in-progress pending->locked hand-overs with a bounded
+	 * number of spins so that we guarantee forward progress.
+	 *
+	 * 0,1,0 -> 0,0,1
+	 */
+	if (val == _Q_PENDING_VAL) {
+		int cnt = _Q_PENDING_LOOPS;
+		val = atomic_cond_read_relaxed(&lock->val,
+					       (VAL != _Q_PENDING_VAL) || !cnt--);
+	}
+
+	/*
+	 * If we observe any contention; queue.
+	 */
+	if (val & ~_Q_LOCKED_MASK)
+		goto queue;
+
+	/*
+	 * trylock || pending
+	 *
+	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
+	 */
+	val = queued_fetch_set_pending_acquire(lock);
+
+	/*
+	 * If we observe contention, there is a concurrent locker.
+	 *
+	 * Undo and queue; our setting of PENDING might have made the
+	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
+	 * on @next to become !NULL.
+	 */
+	if (unlikely(val & ~_Q_LOCKED_MASK)) {
+
+		/* Undo PENDING if we set it. */
+		if (!(val & _Q_PENDING_MASK))
+			clear_pending(lock);
+
+		goto queue;
+	}
+
+	/*
+	 * We're pending, wait for the owner to go away.
+	 *
+	 * 0,1,1 -> *,1,0
+	 *
+	 * this wait loop must be a load-acquire such that we match the
+	 * store-release that clears the locked bit and create lock
+	 * sequentiality; this is because not all
+	 * clear_pending_set_locked() implementations imply full
+	 * barriers.
+	 */
+	if (val & _Q_LOCKED_MASK)
+		smp_cond_load_acquire(&lock->locked, !VAL);
+
+	/*
+	 * take ownership and clear the pending bit.
+	 *
+	 * 0,1,0 -> 0,0,1
+	 */
+	clear_pending_set_locked(lock);
+	lockevent_inc(lock_pending);
+	return;
+
+	/*
+	 * End of pending bit optimistic spinning and beginning of MCS
+	 * queuing.
+	 */
+queue:
+	lockevent_inc(lock_slowpath);
+pv_queue:
+	node = this_cpu_ptr(qnodes, qnodes[0].mcs);
+	idx = node->count++;
+	tail = encode_tail(smp_processor_id(), idx);
+
+	trace_contention_begin(lock, LCB_F_SPIN);
+
+	/*
+	 * 4 nodes are allocated based on the assumption that there will
+	 * not be nested NMIs taking spinlocks. That may not be true in
+	 * some architectures even though the chance of needing more than
+	 * 4 nodes will still be extremely unlikely. When that happens,
+	 * we fall back to spinning on the lock directly without using
+	 * any MCS node. This is not the most elegant solution, but is
+	 * simple enough.
+	 */
+	if (unlikely(idx >= _Q_MAX_NODES)) {
+		lockevent_inc(lock_no_node);
+		while (!queued_spin_trylock(lock))
+			cpu_relax();
+		goto release;
+	}
+
+	node = grab_mcs_node(node, idx);
+
+	/*
+	 * Keep counts of non-zero index values:
+	 */
+	lockevent_cond_inc(lock_use_node2 + idx - 1, idx);
+
+	/*
+	 * Ensure that we increment the head node->count before initialising
+	 * the actual node. If the compiler is kind enough to reorder these
+	 * stores, then an IRQ could overwrite our assignments.
+	 */
+	barrier();
+
+	node->locked = 0;
+	node->next = NULL;
+	pv_init_node(node);
+
+	/*
+	 * We touched a (possibly) cold cacheline in the per-cpu queue node;
+	 * attempt the trylock once more in the hope someone let go while we
+	 * weren't watching.
+	 */
+	if (queued_spin_trylock(lock))
+		goto release;
+
+	/*
+	 * Ensure that the initialisation of @node is complete before we
+	 * publish the updated tail via xchg_tail() and potentially link
+	 * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
+	 */
+	smp_wmb();
+
+	/*
+	 * Publish the updated tail.
+	 * We have already touched the queueing cacheline; don't bother with
+	 * pending stuff.
+	 *
+	 * p,*,* -> n,*,*
+	 */
+	old = xchg_tail(lock, tail);
+	next = NULL;
+
+	/*
+	 * if there was a previous node; link it and wait until reaching the
+	 * head of the waitqueue.
+	 */
+	if (old & _Q_TAIL_MASK) {
+		prev = decode_tail(old);
+
+		/* Link @node into the waitqueue. */
+		WRITE_ONCE(prev->next, node);
+
+		pv_wait_node(node, prev);
+		arch_mcs_spin_lock_contended(&node->locked);
+
+		/*
+		 * While waiting for the MCS lock, the next pointer may have
+		 * been set by another lock waiter. We optimistically load
+		 * the next pointer & prefetch the cacheline for writing
+		 * to reduce latency in the upcoming MCS unlock operation.
+		 */
+		next = READ_ONCE(node->next);
+		if (next)
+			prefetchw(next);
+	}
+
+	/*
+	 * we're at the head of the waitqueue, wait for the owner & pending to
+	 * go away.
+	 *
+	 * *,x,y -> *,0,0
+	 *
+	 * this wait loop must use a load-acquire such that we match the
+	 * store-release that clears the locked bit and create lock
+	 * sequentiality; this is because the set_locked() function below
+	 * does not imply a full barrier.
+	 *
+	 * The PV pv_wait_head_or_lock function, if active, will acquire
+	 * the lock and return a non-zero value. So we have to skip the
+	 * atomic_cond_read_acquire() call. As the next PV queue head hasn't
+	 * been designated yet, there is no way for the locked value to become
+	 * _Q_SLOW_VAL. So both the set_locked() and the
+	 * atomic_cmpxchg_relaxed() calls will be safe.
+	 *
+	 * If PV isn't active, 0 will be returned instead.
+	 *
+	 */
+	if ((val = pv_wait_head_or_lock(lock, node)))
+		goto locked;
+
+	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
+
+locked:
+	/*
+	 * claim the lock:
+	 *
+	 * n,0,0 -> 0,0,1 : lock, uncontended
+	 * *,*,0 -> *,*,1 : lock, contended
+	 *
+	 * If the queue head is the only one in the queue (lock value == tail)
+	 * and nobody is pending, clear the tail code and grab the lock.
+	 * Otherwise, we only need to grab the lock.
+	 */
+
+	/*
+	 * In the PV case we might already have _Q_LOCKED_VAL set, because
+	 * of lock stealing; therefore we must also allow:
+	 *
+	 * n,0,1 -> 0,0,1
+	 *
+	 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
+	 *       above wait condition, therefore any concurrent setting of
+	 *       PENDING will make the uncontended transition fail.
+	 */
+	if ((val & _Q_TAIL_MASK) == tail) {
+		if (atomic_try_cmpxchg_relaxed(&lock->val, (int *)&val, _Q_LOCKED_VAL))
+			goto release; /* No contention */
+	}
+
+	/*
+	 * Either somebody is queued behind us or _Q_PENDING_VAL got set
+	 * which will then detect the remaining tail and queue behind us
+	 * ensuring we'll see a @next.
+	 */
+	set_locked(lock);
+
+	/*
+	 * contended path; wait for next if not observed yet, release.
+	 */
+	if (!next)
+		next = smp_cond_load_relaxed(&node->next, (VAL));
+
+	arch_mcs_spin_unlock_contended(&next->locked);
+	pv_kick_node(lock, next);
+
+release:
+	trace_contention_end(lock, 0);
+
+	/*
+	 * release the node
+	 */
+	__this_cpu_dec(qnodes, qnodes[0].mcs.count);
+}
+
+/*
+ * Generate the paravirt code for queued_spin_unlock_slowpath().
+ */
+#if !defined(_GEN_PV_LOCK_SLOWPATH) && defined(CONFIG_PARAVIRT_SPINLOCKS)
+#define _GEN_PV_LOCK_SLOWPATH
+
+#undef  pv_enabled
+#define pv_enabled()	true
+
+#undef pv_init_node
+#undef pv_wait_node
+#undef pv_kick_node
+#undef pv_wait_head_or_lock
+
+#undef  queued_spin_lock_slowpath
+#define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
+
+#include "qspinlock_paravirt.h"
+#include "qspinlock.c"
+
+bool nopvspin;
+static __init int parse_nopvspin(char *arg)
+{
+	nopvspin = true;
+	return 0;
+}
+early_param("nopvspin", parse_nopvspin);
+#endif
diff --git a/tools/perf/check-headers.sh b/tools/perf/check-headers.sh
index be519c433ce4..b827b10e19c1 100755
--- a/tools/perf/check-headers.sh
+++ b/tools/perf/check-headers.sh
@@ -118,6 +118,25 @@ check_2 () {
   fi
 }
 
+check_2_sed () {
+  tools_file=$1
+  orig_file=$2
+  sed_cmd="$3"
+
+  shift
+  shift
+  shift
+
+  cmd="diff $* <(sed '$sed_cmd' $tools_file) $orig_file > /dev/null"
+
+  if [ -f "$orig_file" ] && ! eval "$cmd"
+  then
+    FAILURES+=(
+      "$tools_file $orig_file"
+    )
+  fi
+}
+
 check () {
   file=$1
 
@@ -207,6 +226,19 @@ check_2 tools/perf/arch/parisc/entry/syscalls/syscall.tbl arch/parisc/entry/sysc
 check_2 tools/perf/arch/arm64/entry/syscalls/syscall_32.tbl arch/arm64/entry/syscalls/syscall_32.tbl
 check_2 tools/perf/arch/arm64/entry/syscalls/syscall_64.tbl arch/arm64/entry/syscalls/syscall_64.tbl
 
+# diff qspinlock files
+qsl_sed='s/ __maybe_unused//'
+qsl_common='-I "^#include" -I __percpu -I this_cpu_ -I per_cpu_ -I decode_tail \
+	-I DECLARE_PER_CPU_ALIGNED -I DEFINE_PER_CPU_ALIGNED -I CONFIG_NR_CPUS -B'
+check_2_sed tools/perf/bench/include/qspinlock.h	include/asm-generic/qspinlock.h "$qsl_sed" "$qsl_common"
+check_2 tools/perf/bench/include/qspinlock_types.h	include/asm-generic/qspinlock_types.h "$qsl_common"
+check_2 tools/perf/bench/include/mcs_spinlock.h		include/asm-generic/mcs_spinlock.h
+check_2 tools/perf/bench/include/qspinlock-private.h	kernel/locking/qspinlock.h	"$qsl_common"
+check_2 tools/perf/bench/include/mcs_spinlock-private.h	kernel/locking/mcs_spinlock.h	"$qsl_common"
+check_2_sed tools/perf/bench/qspinlock.c		kernel/locking/qspinlock.c	"$qsl_sed" \
+	"$qsl_common"' -I EXPORT_SYMBOL -I "^#define lockevent_" -I "^#define trace_" \
+        -I smp_processor_id -I atomic_try_cmpxchg_relaxed'
+
 for i in "${BEAUTY_FILES[@]}"
 do
   beauty_check "$i" -B
-- 
2.50.1.487.gc89ff58d15-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 6/7] perf bench: Add 'bench sync qspinlock' subcommand
  2025-07-29  2:26 [PATCH v1 0/7] perf bench: Add qspinlock benchmark Yuzhuo Jing
                   ` (4 preceding siblings ...)
  2025-07-29  2:26 ` [PATCH v1 5/7] perf bench: Import qspinlock from kernel Yuzhuo Jing
@ 2025-07-29  2:26 ` Yuzhuo Jing
  2025-07-31  5:16   ` Namhyung Kim
  2025-07-29  2:26 ` [PATCH v1 7/7] perf bench sync: Add latency histogram functionality Yuzhuo Jing
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Yuzhuo Jing @ 2025-07-29  2:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang Kan, Yuzhuo Jing, Yuzhuo Jing,
	Andrea Parri, Palmer Dabbelt, Charlie Jenkins,
	Sebastian Andrzej Siewior, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Barret Rhoden, Alexandre Ghiti, Guo Ren,
	linux-kernel, linux-perf-users

Benchmark kernel queued spinlock implementation in user space.  Support
settings of the number of threads and the number of acquire/releases.

Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
---
 tools/perf/bench/Build     |   2 +
 tools/perf/bench/bench.h   |   1 +
 tools/perf/bench/sync.c    | 234 +++++++++++++++++++++++++++++++++++++
 tools/perf/builtin-bench.c |   7 ++
 4 files changed, 244 insertions(+)
 create mode 100644 tools/perf/bench/sync.c

diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
index b558ab98719f..13558279fa0e 100644
--- a/tools/perf/bench/Build
+++ b/tools/perf/bench/Build
@@ -19,6 +19,8 @@ perf-bench-y += evlist-open-close.o
 perf-bench-y += breakpoint.o
 perf-bench-y += pmu-scan.o
 perf-bench-y += uprobe.o
+perf-bench-y += sync.o
+perf-bench-y += qspinlock.o
 
 perf-bench-$(CONFIG_X86_64) += mem-memcpy-x86-64-asm.o
 perf-bench-$(CONFIG_X86_64) += mem-memset-x86-64-asm.o
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index 9f736423af53..dd6c8b6126d3 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -22,6 +22,7 @@ int bench_numa(int argc, const char **argv);
 int bench_sched_messaging(int argc, const char **argv);
 int bench_sched_pipe(int argc, const char **argv);
 int bench_sched_seccomp_notify(int argc, const char **argv);
+int bench_sync_qspinlock(int argc, const char **argv);
 int bench_syscall_basic(int argc, const char **argv);
 int bench_syscall_getpgid(int argc, const char **argv);
 int bench_syscall_fork(int argc, const char **argv);
diff --git a/tools/perf/bench/sync.c b/tools/perf/bench/sync.c
new file mode 100644
index 000000000000..2685cb66584c
--- /dev/null
+++ b/tools/perf/bench/sync.c
@@ -0,0 +1,234 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Synchronization benchmark.
+ *
+ * 2025  Yuzhuo Jing <yuzhuo@google.com>
+ */
+#include <bits/time.h>
+#include <err.h>
+#include <inttypes.h>
+#include <perf/cpumap.h>
+#include <pthread.h>
+#include <stdbool.h>
+#include <string.h>
+#include <subcmd/parse-options.h>
+#include <sys/cdefs.h>
+
+#include "bench.h"
+
+#include "include/qspinlock.h"
+
+#define NS 1000000000ull
+#define CACHELINE_SIZE 64
+
+static unsigned int nthreads;
+static unsigned long nspins = 10000ul;
+
+struct barrier_t;
+
+typedef void(*lock_fn)(void *);
+
+/*
+ * Lock operation definition to support multiple implmentations of locks.
+ *
+ * The lock and unlock functions only take one variable, the data pointer.
+ */
+struct lock_ops {
+	lock_fn lock;
+	lock_fn unlock;
+	void *data;
+};
+
+struct worker {
+	pthread_t thd;
+	unsigned int tid;
+	struct lock_ops *ops;
+	struct barrier_t *barrier;
+	u64 runtime;		// in nanoseconds
+};
+
+static const struct option options[] = {
+	OPT_UINTEGER('t',	"threads",	&nthreads,
+		"Specify number of threads (default: number of CPUs)."),
+	OPT_ULONG('n',		"spins",	&nspins,
+		"Number of lock acquire operations per thread (default: 10,000 times)."),
+	OPT_END()
+};
+
+static const char *const bench_sync_usage[] = {
+	"perf bench sync qspinlock <options>",
+	NULL
+};
+
+/*
+ * A atomic-based barrier.  Expect to have lower latency than pthread barrier
+ * that sleeps the thread.
+ */
+struct barrier_t {
+	unsigned int count __aligned(CACHELINE_SIZE);
+};
+
+/*
+ * A atomic-based barrier.  Expect to have lower latency than pthread barrier
+ * that sleeps the thread.
+ */
+__always_inline void wait_barrier(struct barrier_t *b)
+{
+	if (__atomic_sub_fetch(&b->count, 1, __ATOMIC_RELAXED) == 0)
+		return;
+	while (__atomic_load_n(&b->count, __ATOMIC_RELAXED))
+		;
+}
+
+static int bench_sync_lock_generic(struct lock_ops *ops, int argc, const char **argv);
+
+/*
+ * Benchmark of linux kernel queued spinlock in user land.
+ */
+int bench_sync_qspinlock(int argc, const char **argv)
+{
+	struct qspinlock lock = __ARCH_SPIN_LOCK_UNLOCKED;
+	struct lock_ops ops = {
+		.lock = (lock_fn)queued_spin_lock,
+		.unlock = (lock_fn)queued_spin_unlock,
+		.data = &lock,
+	};
+	return bench_sync_lock_generic(&ops, argc, argv);
+}
+
+/*
+ * A busy loop to acquire and release the given lock N times.
+ */
+static void lock_loop(const struct lock_ops *ops, unsigned long n)
+{
+	unsigned long i;
+
+	for (i = 0; i < n; ++i) {
+		ops->lock(ops->data);
+		ops->unlock(ops->data);
+	}
+}
+
+/*
+ * Thread worker function.  Runs lock loop for N/5 times before and after
+ * the main timed loop.
+ */
+static void *sync_workerfn(void *args)
+{
+	struct worker *worker = (struct worker *)args;
+	struct timespec starttime, endtime;
+
+	set_this_cpu_id(worker->tid);
+
+	/* Barrier to let all threads start together */
+	wait_barrier(worker->barrier);
+
+	/* Warmup loop (not counted) to keep the below loop contended. */
+	lock_loop(worker->ops, nspins / 5);
+
+	clock_gettime(CLOCK_THREAD_CPUTIME_ID, &starttime);
+	lock_loop(worker->ops, nspins);
+	clock_gettime(CLOCK_THREAD_CPUTIME_ID, &endtime);
+
+	/* Tail loop (not counted) to keep the above loop contended. */
+	lock_loop(worker->ops, nspins / 5);
+
+	worker->runtime = (endtime.tv_sec - starttime.tv_sec) * NS
+		+ endtime.tv_nsec - starttime.tv_nsec;
+
+	return NULL;
+}
+
+/*
+ * Generic lock synchronization benchmark function.  Sets up threads and
+ * thread affinities.
+ */
+static int bench_sync_lock_generic(struct lock_ops *ops, int argc, const char **argv)
+{
+	struct perf_cpu_map *online_cpus;
+	unsigned int online_cpus_nr;
+	struct worker *workers;
+	u64 totaltime = 0, total_spins, avg_ns, avg_ns_dot;
+	struct barrier_t barrier;
+	cpu_set_t *cpuset;
+	size_t cpuset_size;
+
+	argc = parse_options(argc, argv, options, bench_sync_usage, 0);
+	if (argc) {
+		usage_with_options(bench_sync_usage, options);
+		exit(EXIT_FAILURE);
+	}
+
+	/* CPU count setup. */
+	online_cpus = perf_cpu_map__new_online_cpus();
+	if (!online_cpus)
+		err(EXIT_FAILURE, "No online CPUs available");
+	online_cpus_nr = perf_cpu_map__nr(online_cpus);
+
+	if (!nthreads) /* default to the number of CPUs */
+		nthreads = online_cpus_nr;
+
+	workers = calloc(nthreads, sizeof(*workers));
+	if (!workers)
+		err(EXIT_FAILURE, "calloc");
+
+	barrier.count = nthreads;
+
+	printf("Running with %u threads.\n", nthreads);
+
+	cpuset = CPU_ALLOC(online_cpus_nr);
+	if (!cpuset)
+		err(EXIT_FAILURE, "Cannot allocate cpuset.");
+	cpuset_size = CPU_ALLOC_SIZE(online_cpus_nr);
+
+	/* Create worker data structures, set CPU affinity, and create   */
+	for (unsigned int i = 0; i < nthreads; ++i) {
+		pthread_attr_t thread_attr;
+		int ret;
+
+		/* Basic worker thread information */
+		workers[i].tid = i;
+		workers[i].barrier = &barrier;
+		workers[i].ops = ops;
+
+		/* Set CPU affinity */
+		pthread_attr_init(&thread_attr);
+		CPU_ZERO_S(cpuset_size, cpuset);
+		CPU_SET_S(perf_cpu_map__cpu(online_cpus, i % online_cpus_nr).cpu,
+			cpuset_size, cpuset);
+
+		if (pthread_attr_setaffinity_np(&thread_attr, cpuset_size, cpuset))
+			err(EXIT_FAILURE, "Pthread set affinity failed");
+
+		/* Create and block thread */
+		ret = pthread_create(&workers[i].thd, &thread_attr, sync_workerfn, &workers[i]);
+		if (ret != 0)
+			err(EXIT_FAILURE, "Error creating thread: %s", strerror(ret));
+
+		pthread_attr_destroy(&thread_attr);
+	}
+
+	CPU_FREE(cpuset);
+
+	for (unsigned int i = 0; i < nthreads; ++i) {
+		int ret = pthread_join(workers[i].thd, NULL);
+
+		if (ret)
+			err(EXIT_FAILURE, "pthread_join");
+	}
+
+	/* Calculate overall average latency. */
+	for (unsigned int i = 0; i < nthreads; ++i)
+		totaltime += workers[i].runtime;
+
+	total_spins = (u64)nthreads * nspins;
+	avg_ns = totaltime / total_spins;
+	avg_ns_dot = (totaltime % total_spins) * 10000 / total_spins;
+
+	printf("Lock-unlock latency of %u threads: %"PRIu64".%"PRIu64" ns.\n",
+			nthreads, avg_ns, avg_ns_dot);
+
+	free(workers);
+
+	return 0;
+}
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index 2c1a9f3d847a..cfe6f6dc6ed4 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -52,6 +52,12 @@ static struct bench sched_benchmarks[] = {
 	{ NULL,		NULL,						NULL			}
 };
 
+static struct bench sync_benchmarks[] = {
+	{ "qspinlock",	"Benchmark for queued spinlock",		bench_sync_qspinlock	},
+	{ "all",	"Run all synchronization benchmarks",		NULL			},
+	{ NULL,		NULL,						NULL			}
+};
+
 static struct bench syscall_benchmarks[] = {
 	{ "basic",	"Benchmark for basic getppid(2) calls",		bench_syscall_basic	},
 	{ "getpgid",	"Benchmark for getpgid(2) calls",		bench_syscall_getpgid	},
@@ -122,6 +128,7 @@ struct collection {
 
 static struct collection collections[] = {
 	{ "sched",	"Scheduler and IPC benchmarks",			sched_benchmarks	},
+	{ "sync",	"Synchronization benchmarks",			sync_benchmarks		},
 	{ "syscall",	"System call benchmarks",			syscall_benchmarks	},
 	{ "mem",	"Memory access benchmarks",			mem_benchmarks		},
 #ifdef HAVE_LIBNUMA_SUPPORT
-- 
2.50.1.487.gc89ff58d15-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 7/7] perf bench sync: Add latency histogram functionality
  2025-07-29  2:26 [PATCH v1 0/7] perf bench: Add qspinlock benchmark Yuzhuo Jing
                   ` (5 preceding siblings ...)
  2025-07-29  2:26 ` [PATCH v1 6/7] perf bench: Add 'bench sync qspinlock' subcommand Yuzhuo Jing
@ 2025-07-29  2:26 ` Yuzhuo Jing
  2025-07-31  5:18   ` Namhyung Kim
  2025-07-31  5:24   ` Namhyung Kim
  2025-07-31  4:51 ` [PATCH v1 0/7] perf bench: Add qspinlock benchmark Namhyung Kim
  2025-08-04 14:28 ` Mark Rutland
  8 siblings, 2 replies; 18+ messages in thread
From: Yuzhuo Jing @ 2025-07-29  2:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang Kan, Yuzhuo Jing, Yuzhuo Jing,
	Andrea Parri, Palmer Dabbelt, Charlie Jenkins,
	Sebastian Andrzej Siewior, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Barret Rhoden, Alexandre Ghiti, Guo Ren,
	linux-kernel, linux-perf-users

Add an option to print the histogram of lock acquire latencies (unit in
TSCs).

Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
---
 tools/perf/bench/sync.c | 97 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 96 insertions(+), 1 deletion(-)

diff --git a/tools/perf/bench/sync.c b/tools/perf/bench/sync.c
index 2685cb66584c..c85e9853c72a 100644
--- a/tools/perf/bench/sync.c
+++ b/tools/perf/bench/sync.c
@@ -15,14 +15,19 @@
 #include <sys/cdefs.h>
 
 #include "bench.h"
+#include "../util/tsc.h"
 
 #include "include/qspinlock.h"
 
 #define NS 1000000000ull
 #define CACHELINE_SIZE 64
 
+#define DEFAULT_HIST_INTERVAL 1000
+
 static unsigned int nthreads;
 static unsigned long nspins = 10000ul;
+static bool do_hist;
+static u64 hist_interval = DEFAULT_HIST_INTERVAL;
 
 struct barrier_t;
 
@@ -45,6 +50,7 @@ struct worker {
 	struct lock_ops *ops;
 	struct barrier_t *barrier;
 	u64 runtime;		// in nanoseconds
+	u64 *lock_latency;	// in TSCs
 };
 
 static const struct option options[] = {
@@ -52,6 +58,10 @@ static const struct option options[] = {
 		"Specify number of threads (default: number of CPUs)."),
 	OPT_ULONG('n',		"spins",	&nspins,
 		"Number of lock acquire operations per thread (default: 10,000 times)."),
+	OPT_BOOLEAN(0,		"hist",		&do_hist,
+		"Print a histogram of lock acquire TSCs."),
+	OPT_U64(0,	"hist-interval",	&hist_interval,
+		"Histogram bucket size (default 1,000 TSCs)."),
 	OPT_END()
 };
 
@@ -109,6 +119,25 @@ static void lock_loop(const struct lock_ops *ops, unsigned long n)
 	}
 }
 
+/*
+ * A busy loop to acquire and release the given lock N times, and also collect
+ * all acquire latencies, for histogram use.  Note that the TSC operations
+ * latency itself is also included.
+ */
+static void lock_loop_timing(const struct lock_ops *ops, unsigned long n, u64 *sample_buffer)
+{
+	unsigned long i;
+	u64 t1, t2;
+
+	for (i = 0; i < n; ++i) {
+		t1 = rdtsc();
+		ops->lock(ops->data);
+		t2 = rdtsc();
+		ops->unlock(ops->data);
+		sample_buffer[i] = t2 - t1;
+	}
+}
+
 /*
  * Thread worker function.  Runs lock loop for N/5 times before and after
  * the main timed loop.
@@ -127,7 +156,10 @@ static void *sync_workerfn(void *args)
 	lock_loop(worker->ops, nspins / 5);
 
 	clock_gettime(CLOCK_THREAD_CPUTIME_ID, &starttime);
-	lock_loop(worker->ops, nspins);
+	if (worker->lock_latency)
+		lock_loop_timing(worker->ops, nspins, worker->lock_latency);
+	else
+		lock_loop(worker->ops, nspins);
 	clock_gettime(CLOCK_THREAD_CPUTIME_ID, &endtime);
 
 	/* Tail loop (not counted) to keep the above loop contended. */
@@ -139,6 +171,57 @@ static void *sync_workerfn(void *args)
 	return NULL;
 }
 
+/*
+ * Calculate and print a histogram.
+ */
+static void print_histogram(struct worker *workers)
+{
+	u64 tsc_max = 0;
+	u64 *buckets;
+	unsigned long nbuckets;
+
+	if (hist_interval == 0)
+		hist_interval = DEFAULT_HIST_INTERVAL;
+
+	printf("Lock acquire histogram:\n");
+
+	/* Calculate the max TSC value to get the number of buckets needed. */
+	for (unsigned int i = 0; i < nthreads; ++i) {
+		struct worker *w = workers + i;
+
+		for (unsigned long j = 0; j < nspins; ++j)
+			tsc_max = max(w->lock_latency[j], tsc_max);
+	}
+	nbuckets = (tsc_max + hist_interval - 1) / hist_interval;
+
+	/* Allocate the actual bucket.  The bucket definition may be optimized
+	 * if it is sparse.
+	 */
+	buckets = calloc(nbuckets, sizeof(*buckets));
+	if (!buckets)
+		err(EXIT_FAILURE, "calloc");
+
+	/* Iterate through all latencies again to fill the buckets. */
+	for (unsigned int i = 0; i < nthreads; ++i) {
+		struct worker *w = workers + i;
+
+		for (unsigned long j = 0; j < nspins; ++j) {
+			u64 latency = w->lock_latency[j];
+			++buckets[latency / hist_interval];
+		}
+	}
+
+	/* Print the histogram as a table. */
+	printf("Bucket, Count\n");
+	for (unsigned long i = 0; i < nbuckets; ++i) {
+		if (buckets[i] == 0)
+			continue;
+		printf("%"PRIu64", %"PRIu64"\n", hist_interval * (i + 1), buckets[i]);
+	}
+
+	free(buckets);
+}
+
 /*
  * Generic lock synchronization benchmark function.  Sets up threads and
  * thread affinities.
@@ -191,6 +274,12 @@ static int bench_sync_lock_generic(struct lock_ops *ops, int argc, const char **
 		workers[i].barrier = &barrier;
 		workers[i].ops = ops;
 
+		if (do_hist) {
+			workers[i].lock_latency = calloc(nspins, sizeof(*workers[i].lock_latency));
+			if (!workers[i].lock_latency)
+				err(EXIT_FAILURE, "calloc");
+		}
+
 		/* Set CPU affinity */
 		pthread_attr_init(&thread_attr);
 		CPU_ZERO_S(cpuset_size, cpuset);
@@ -228,6 +317,12 @@ static int bench_sync_lock_generic(struct lock_ops *ops, int argc, const char **
 	printf("Lock-unlock latency of %u threads: %"PRIu64".%"PRIu64" ns.\n",
 			nthreads, avg_ns, avg_ns_dot);
 
+	/* Print histogram if requested. */
+	if (do_hist)
+		print_histogram(workers);
+
+	for (unsigned int i = 0; i < nthreads; ++i)
+		free(workers[i].lock_latency);
 	free(workers);
 
 	return 0;
-- 
2.50.1.487.gc89ff58d15-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 0/7] perf bench: Add qspinlock benchmark
  2025-07-29  2:26 [PATCH v1 0/7] perf bench: Add qspinlock benchmark Yuzhuo Jing
                   ` (6 preceding siblings ...)
  2025-07-29  2:26 ` [PATCH v1 7/7] perf bench sync: Add latency histogram functionality Yuzhuo Jing
@ 2025-07-31  4:51 ` Namhyung Kim
  2025-08-04 14:28 ` Mark Rutland
  8 siblings, 0 replies; 18+ messages in thread
From: Namhyung Kim @ 2025-07-31  4:51 UTC (permalink / raw)
  To: Yuzhuo Jing
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang Kan, Yuzhuo Jing, Andrea Parri,
	Palmer Dabbelt, Charlie Jenkins, Sebastian Andrzej Siewior,
	Kumar Kartikeya Dwivedi, Alexei Starovoitov, Barret Rhoden,
	Alexandre Ghiti, Guo Ren, linux-kernel, linux-perf-users

Hello,

On Mon, Jul 28, 2025 at 07:26:33PM -0700, Yuzhuo Jing wrote:
> As an effort to improve the perf bench subcommand, this patch series
> adds benchmark for the kernel's queued spinlock implementation.
> 
> This series imports necessary kernel definitions such as atomics,
> introduces userspace per-cpu adapter, and imports the qspinlock
> implementation from the kernel tree to tools tree, with minimum
> adaptions.

But I'm curious how you handled difference in kernel vs. user space.
For example, normally kernel spinlocks imply no preemption but we cannot
guarantee that in userspace.

> 
> This subcommand enables convenient commands to investigate the
> performance of kernel lock implementations, such as using sampling:
> 
>     perf record -- ./perf bench sync qspinlock -t5
>     perf report

It'd be nice if you can share an example output of the change.

Thanks,
Namhyung

> 
> Yuzhuo Jing (7):
>   tools: Import cmpxchg and xchg functions
>   tools: Import smp_cond_load and atomic_cond_read
>   tools: Partial import of prefetch.h
>   tools: Implement userspace per-cpu
>   perf bench: Import qspinlock from kernel
>   perf bench: Add 'bench sync qspinlock' subcommand
>   perf bench sync: Add latency histogram functionality
> 
>  tools/arch/x86/include/asm/atomic.h           |  14 +
>  tools/arch/x86/include/asm/cmpxchg.h          | 113 +++++
>  tools/include/asm-generic/atomic-gcc.h        |  47 ++
>  tools/include/asm/barrier.h                   |  58 +++
>  tools/include/linux/atomic.h                  |  27 ++
>  tools/include/linux/compiler_types.h          |  30 ++
>  tools/include/linux/percpu-simulate.h         | 128 ++++++
>  tools/include/linux/prefetch.h                |  41 ++
>  tools/perf/bench/Build                        |   2 +
>  tools/perf/bench/bench.h                      |   1 +
>  .../perf/bench/include/mcs_spinlock-private.h | 115 +++++
>  tools/perf/bench/include/mcs_spinlock.h       |  19 +
>  tools/perf/bench/include/qspinlock-private.h  | 204 +++++++++
>  tools/perf/bench/include/qspinlock.h          | 153 +++++++
>  tools/perf/bench/include/qspinlock_types.h    |  98 +++++
>  tools/perf/bench/qspinlock.c                  | 411 ++++++++++++++++++
>  tools/perf/bench/sync.c                       | 329 ++++++++++++++
>  tools/perf/builtin-bench.c                    |   7 +
>  tools/perf/check-headers.sh                   |  32 ++
>  19 files changed, 1829 insertions(+)
>  create mode 100644 tools/include/linux/percpu-simulate.h
>  create mode 100644 tools/include/linux/prefetch.h
>  create mode 100644 tools/perf/bench/include/mcs_spinlock-private.h
>  create mode 100644 tools/perf/bench/include/mcs_spinlock.h
>  create mode 100644 tools/perf/bench/include/qspinlock-private.h
>  create mode 100644 tools/perf/bench/include/qspinlock.h
>  create mode 100644 tools/perf/bench/include/qspinlock_types.h
>  create mode 100644 tools/perf/bench/qspinlock.c
>  create mode 100644 tools/perf/bench/sync.c
> 
> -- 
> 2.50.1.487.gc89ff58d15-goog
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/7] tools: Import cmpxchg and xchg functions
  2025-07-29  2:26 ` [PATCH v1 1/7] tools: Import cmpxchg and xchg functions Yuzhuo Jing
@ 2025-07-31  4:52   ` Namhyung Kim
  2025-08-08  6:11   ` kernel test robot
  1 sibling, 0 replies; 18+ messages in thread
From: Namhyung Kim @ 2025-07-31  4:52 UTC (permalink / raw)
  To: Yuzhuo Jing
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang Kan, Yuzhuo Jing, Andrea Parri,
	Palmer Dabbelt, Charlie Jenkins, Sebastian Andrzej Siewior,
	Kumar Kartikeya Dwivedi, Alexei Starovoitov, Barret Rhoden,
	Alexandre Ghiti, Guo Ren, linux-kernel, linux-perf-users

On Mon, Jul 28, 2025 at 07:26:34PM -0700, Yuzhuo Jing wrote:
> Import necessary atomic functions used by qspinlock.  Copied x86
> implementation verbatim, and used compiler builtin for generic
> implementation.

Why x86 only?  Can we just use the generic version always?

Thanks,
Namhyung

> 
> Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
> ---
>  tools/arch/x86/include/asm/atomic.h    |  14 +++
>  tools/arch/x86/include/asm/cmpxchg.h   | 113 +++++++++++++++++++++++++
>  tools/include/asm-generic/atomic-gcc.h |  47 ++++++++++
>  tools/include/linux/atomic.h           |  24 ++++++
>  tools/include/linux/compiler_types.h   |  24 ++++++
>  5 files changed, 222 insertions(+)
> 
> diff --git a/tools/arch/x86/include/asm/atomic.h b/tools/arch/x86/include/asm/atomic.h
> index 365cf182df12..a55ffd4eb5f1 100644
> --- a/tools/arch/x86/include/asm/atomic.h
> +++ b/tools/arch/x86/include/asm/atomic.h
> @@ -71,6 +71,20 @@ static __always_inline int atomic_cmpxchg(atomic_t *v, int old, int new)
>  	return cmpxchg(&v->counter, old, new);
>  }
>  
> +static __always_inline bool atomic_try_cmpxchg(atomic_t *v, int *old, int new)
> +{
> +	return try_cmpxchg(&v->counter, old, new);
> +}
> +
> +static __always_inline int atomic_fetch_or(int i, atomic_t *v)
> +{
> +	int val = atomic_read(v);
> +
> +	do { } while (!atomic_try_cmpxchg(v, &val, val | i));
> +
> +	return val;
> +}
> +
>  static inline int test_and_set_bit(long nr, unsigned long *addr)
>  {
>  	GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(bts), *addr, "Ir", nr, "%0", "c");
> diff --git a/tools/arch/x86/include/asm/cmpxchg.h b/tools/arch/x86/include/asm/cmpxchg.h
> index 0ed9ca2766ad..5372da8b27fc 100644
> --- a/tools/arch/x86/include/asm/cmpxchg.h
> +++ b/tools/arch/x86/include/asm/cmpxchg.h
> @@ -8,6 +8,8 @@
>   * Non-existant functions to indicate usage errors at link time
>   * (or compile-time if the compiler implements __compiletime_error().
>   */
> +extern void __xchg_wrong_size(void)
> +	__compiletime_error("Bad argument size for xchg");
>  extern void __cmpxchg_wrong_size(void)
>  	__compiletime_error("Bad argument size for cmpxchg");
>  
> @@ -27,6 +29,49 @@ extern void __cmpxchg_wrong_size(void)
>  #define	__X86_CASE_Q	-1		/* sizeof will never return -1 */
>  #endif
>  
> +/* 
> + * An exchange-type operation, which takes a value and a pointer, and
> + * returns the old value.
> + */
> +#define __xchg_op(ptr, arg, op, lock)					\
> +	({								\
> +	        __typeof__ (*(ptr)) __ret = (arg);			\
> +		switch (sizeof(*(ptr))) {				\
> +		case __X86_CASE_B:					\
> +			asm_inline volatile (lock #op "b %b0, %1"	\
> +				      : "+q" (__ret), "+m" (*(ptr))	\
> +				      : : "memory", "cc");		\
> +			break;						\
> +		case __X86_CASE_W:					\
> +			asm_inline volatile (lock #op "w %w0, %1"	\
> +				      : "+r" (__ret), "+m" (*(ptr))	\
> +				      : : "memory", "cc");		\
> +			break;						\
> +		case __X86_CASE_L:					\
> +			asm_inline volatile (lock #op "l %0, %1"	\
> +				      : "+r" (__ret), "+m" (*(ptr))	\
> +				      : : "memory", "cc");		\
> +			break;						\
> +		case __X86_CASE_Q:					\
> +			asm_inline volatile (lock #op "q %q0, %1"	\
> +				      : "+r" (__ret), "+m" (*(ptr))	\
> +				      : : "memory", "cc");		\
> +			break;						\
> +		default:						\
> +			__ ## op ## _wrong_size();			\
> +		__cmpxchg_wrong_size();					\
> +		}							\
> +		__ret;							\
> +	})
> +
> +/*
> + * Note: no "lock" prefix even on SMP: xchg always implies lock anyway.
> + * Since this is generally used to protect other memory information, we
> + * use "asm volatile" and "memory" clobbers to prevent gcc from moving
> + * information around.
> + */
> +#define xchg(ptr, v)	__xchg_op((ptr), (v), xchg, "")
> +
>  /*
>   * Atomic compare and exchange.  Compare OLD with MEM, if identical,
>   * store NEW in MEM.  Return the initial value in MEM.  Success is
> @@ -86,5 +131,73 @@ extern void __cmpxchg_wrong_size(void)
>  #define cmpxchg(ptr, old, new)						\
>  	__cmpxchg(ptr, old, new, sizeof(*(ptr)))
>  
> +#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock)		\
> +({									\
> +	bool success;							\
> +	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
> +	__typeof__(*(_ptr)) __old = *_old;				\
> +	__typeof__(*(_ptr)) __new = (_new);				\
> +	switch (size) {							\
> +	case __X86_CASE_B:						\
> +	{								\
> +		volatile u8 *__ptr = (volatile u8 *)(_ptr);		\
> +		asm_inline volatile(lock "cmpxchgb %[new], %[ptr]"	\
> +			     CC_SET(z)					\
> +			     : CC_OUT(z) (success),			\
> +			       [ptr] "+m" (*__ptr),			\
> +			       [old] "+a" (__old)			\
> +			     : [new] "q" (__new)			\
> +			     : "memory");				\
> +		break;							\
> +	}								\
> +	case __X86_CASE_W:						\
> +	{								\
> +		volatile u16 *__ptr = (volatile u16 *)(_ptr);		\
> +		asm_inline volatile(lock "cmpxchgw %[new], %[ptr]"	\
> +			     CC_SET(z)					\
> +			     : CC_OUT(z) (success),			\
> +			       [ptr] "+m" (*__ptr),			\
> +			       [old] "+a" (__old)			\
> +			     : [new] "r" (__new)			\
> +			     : "memory");				\
> +		break;							\
> +	}								\
> +	case __X86_CASE_L:						\
> +	{								\
> +		volatile u32 *__ptr = (volatile u32 *)(_ptr);		\
> +		asm_inline volatile(lock "cmpxchgl %[new], %[ptr]"	\
> +			     CC_SET(z)					\
> +			     : CC_OUT(z) (success),			\
> +			       [ptr] "+m" (*__ptr),			\
> +			       [old] "+a" (__old)			\
> +			     : [new] "r" (__new)			\
> +			     : "memory");				\
> +		break;							\
> +	}								\
> +	case __X86_CASE_Q:						\
> +	{								\
> +		volatile u64 *__ptr = (volatile u64 *)(_ptr);		\
> +		asm_inline volatile(lock "cmpxchgq %[new], %[ptr]"	\
> +			     CC_SET(z)					\
> +			     : CC_OUT(z) (success),			\
> +			       [ptr] "+m" (*__ptr),			\
> +			       [old] "+a" (__old)			\
> +			     : [new] "r" (__new)			\
> +			     : "memory");				\
> +		break;							\
> +	}								\
> +	default:							\
> +		__cmpxchg_wrong_size();					\
> +	}								\
> +	if (unlikely(!success))						\
> +		*_old = __old;						\
> +	likely(success);						\
> +})
> +
> +#define __try_cmpxchg(ptr, pold, new, size)				\
> +	__raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)
> +
> +#define try_cmpxchg(ptr, pold, new) 				\
> +	__try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))
>  
>  #endif	/* TOOLS_ASM_X86_CMPXCHG_H */
> diff --git a/tools/include/asm-generic/atomic-gcc.h b/tools/include/asm-generic/atomic-gcc.h
> index 9b3c528bab92..08b7b3b36873 100644
> --- a/tools/include/asm-generic/atomic-gcc.h
> +++ b/tools/include/asm-generic/atomic-gcc.h
> @@ -62,6 +62,12 @@ static inline int atomic_dec_and_test(atomic_t *v)
>  	return __sync_sub_and_fetch(&v->counter, 1) == 0;
>  }
>  
> +#define xchg(ptr, v) \
> +	__atomic_exchange_n(ptr, v, __ATOMIC_SEQ_CST)
> +
> +#define xchg_relaxed(ptr, v) \
> +	__atomic_exchange_n(ptr, v, __ATOMIC_RELAXED)
> +
>  #define cmpxchg(ptr, oldval, newval) \
>  	__sync_val_compare_and_swap(ptr, oldval, newval)
>  
> @@ -70,6 +76,47 @@ static inline int atomic_cmpxchg(atomic_t *v, int oldval, int newval)
>  	return cmpxchg(&(v)->counter, oldval, newval);
>  }
>  
> +/**
> + * atomic_try_cmpxchg() - atomic compare and exchange with full ordering
> + * @v: pointer to atomic_t
> + * @old: pointer to int value to compare with
> + * @new: int value to assign
> + *
> + * If (@v == @old), atomically updates @v to @new with full ordering.
> + * Otherwise, @v is not modified, @old is updated to the current value of @v,
> + * and relaxed ordering is provided.
> + *
> + * Unsafe to use in noinstr code; use raw_atomic_try_cmpxchg() there.
> + *
> + * Return: @true if the exchange occured, @false otherwise.
> + */
> +static __always_inline bool
> +atomic_try_cmpxchg(atomic_t *v, int *old, int new)
> +{
> +	int r, o = *old;
> +	r = atomic_cmpxchg(v, o, new);
> +	if (unlikely(r != o))
> +		*old = r;
> +	return likely(r == o);
> +}
> +
> +/**
> + * atomic_fetch_or() - atomic bitwise OR with full ordering
> + * @i: int value
> + * @v: pointer to atomic_t
> + *
> + * Atomically updates @v to (@v | @i) with full ordering.
> + *
> + * Unsafe to use in noinstr code; use raw_atomic_fetch_or() there.
> + *
> + * Return: The original value of @v.
> + */
> +static __always_inline int
> +atomic_fetch_or(int i, atomic_t *v)
> +{
> +	return __sync_fetch_and_or(&v->counter, i);
> +}
> +
>  static inline int test_and_set_bit(long nr, unsigned long *addr)
>  {
>  	unsigned long mask = BIT_MASK(nr);
> diff --git a/tools/include/linux/atomic.h b/tools/include/linux/atomic.h
> index 01907b33537e..332a34177995 100644
> --- a/tools/include/linux/atomic.h
> +++ b/tools/include/linux/atomic.h
> @@ -12,4 +12,28 @@ void atomic_long_set(atomic_long_t *v, long i);
>  #define  atomic_cmpxchg_release         atomic_cmpxchg
>  #endif /* atomic_cmpxchg_relaxed */
>  
> +#ifndef atomic_cmpxchg_acquire
> +#define atomic_cmpxchg_acquire		atomic_cmpxchg
> +#endif
> +
> +#ifndef atomic_try_cmpxchg_acquire
> +#define atomic_try_cmpxchg_acquire	atomic_try_cmpxchg
> +#endif
> +
> +#ifndef atomic_try_cmpxchg_relaxed
> +#define atomic_try_cmpxchg_relaxed	atomic_try_cmpxchg
> +#endif
> +
> +#ifndef atomic_fetch_or_acquire
> +#define atomic_fetch_or_acquire		atomic_fetch_or
> +#endif
> +
> +#ifndef xchg_relaxed
> +#define xchg_relaxed		xchg
> +#endif
> +
> +#ifndef cmpxchg_release
> +#define cmpxchg_release		cmpxchg
> +#endif
> +
>  #endif /* __TOOLS_LINUX_ATOMIC_H */
> diff --git a/tools/include/linux/compiler_types.h b/tools/include/linux/compiler_types.h
> index d09f9dc172a4..9a2a2f8d7b6c 100644
> --- a/tools/include/linux/compiler_types.h
> +++ b/tools/include/linux/compiler_types.h
> @@ -31,6 +31,28 @@
>  # define __cond_lock(x,c) (c)
>  #endif /* __CHECKER__ */
>  
> +/*
> + * __unqual_scalar_typeof(x) - Declare an unqualified scalar type, leaving
> + *			       non-scalar types unchanged.
> + */
> +/*
> + * Prefer C11 _Generic for better compile-times and simpler code. Note: 'char'
> + * is not type-compatible with 'signed char', and we define a separate case.
> + */
> +#define __scalar_type_to_expr_cases(type)				\
> +		unsigned type:	(unsigned type)0,			\
> +		signed type:	(signed type)0
> +
> +#define __unqual_scalar_typeof(x) typeof(				\
> +		_Generic((x),						\
> +			 char:	(char)0,				\
> +			 __scalar_type_to_expr_cases(char),		\
> +			 __scalar_type_to_expr_cases(short),		\
> +			 __scalar_type_to_expr_cases(int),		\
> +			 __scalar_type_to_expr_cases(long),		\
> +			 __scalar_type_to_expr_cases(long long),	\
> +			 default: (x)))
> +
>  /* Compiler specific macros. */
>  #ifdef __GNUC__
>  #include <linux/compiler-gcc.h>
> @@ -40,4 +62,6 @@
>  #define asm_goto_output(x...) asm goto(x)
>  #endif
>  
> +#define asm_inline asm
> +
>  #endif /* __LINUX_COMPILER_TYPES_H */
> -- 
> 2.50.1.487.gc89ff58d15-goog
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 3/7] tools: Partial import of prefetch.h
  2025-07-29  2:26 ` [PATCH v1 3/7] tools: Partial import of prefetch.h Yuzhuo Jing
@ 2025-07-31  4:54   ` Namhyung Kim
  0 siblings, 0 replies; 18+ messages in thread
From: Namhyung Kim @ 2025-07-31  4:54 UTC (permalink / raw)
  To: Yuzhuo Jing
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang Kan, Yuzhuo Jing, Andrea Parri,
	Palmer Dabbelt, Charlie Jenkins, Sebastian Andrzej Siewior,
	Kumar Kartikeya Dwivedi, Alexei Starovoitov, Barret Rhoden,
	Alexandre Ghiti, Guo Ren, linux-kernel, linux-perf-users

On Mon, Jul 28, 2025 at 07:26:36PM -0700, Yuzhuo Jing wrote:
> Import only prefetch and prefetchw but not page and range related
> methods.
> 
> Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
> ---
>  tools/include/linux/prefetch.h | 41 ++++++++++++++++++++++++++++++++++
>  1 file changed, 41 insertions(+)
>  create mode 100644 tools/include/linux/prefetch.h
> 
> diff --git a/tools/include/linux/prefetch.h b/tools/include/linux/prefetch.h
> new file mode 100644
> index 000000000000..1ed8678f4824
> --- /dev/null
> +++ b/tools/include/linux/prefetch.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + *  Generic cache management functions. Everything is arch-specific,  
> + *  but this header exists to make sure the defines/functions can be
> + *  used in a generic way.
> + *
> + *  2000-11-13  Arjan van de Ven   <arjan@fenrus.demon.nl>
> + *
> + */
> +
> +#ifndef _LINUX_PREFETCH_H
> +#define _LINUX_PREFETCH_H
> +
> +/*
> +	prefetch(x) attempts to pre-emptively get the memory pointed to
> +	by address "x" into the CPU L1 cache. 
> +	prefetch(x) should not cause any kind of exception, prefetch(0) is
> +	specifically ok.
> +
> +	prefetch() should be defined by the architecture, if not, the 
> +	#define below provides a no-op define.	
> +	
> +	There are 2 prefetch() macros:
> +	
> +	prefetch(x)  	- prefetches the cacheline at "x" for read
> +	prefetchw(x)	- prefetches the cacheline at "x" for write
> +	
> +	there is also PREFETCH_STRIDE which is the architecure-preferred 
> +	"lookahead" size for prefetching streamed operations.
> +	
> +*/
> +
> +#ifndef ARCH_HAS_PREFETCH
> +#define prefetch(x) __builtin_prefetch(x)
> +#endif

Do we have ARCH_HAS_PREFETCH somewhere?

Thanks,
Namhyung

> +
> +#ifndef ARCH_HAS_PREFETCHW
> +#define prefetchw(x) __builtin_prefetch(x,1)
> +#endif
> +
> +#endif
> -- 
> 2.50.1.487.gc89ff58d15-goog
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 4/7] tools: Implement userspace per-cpu
  2025-07-29  2:26 ` [PATCH v1 4/7] tools: Implement userspace per-cpu Yuzhuo Jing
@ 2025-07-31  5:07   ` Namhyung Kim
  0 siblings, 0 replies; 18+ messages in thread
From: Namhyung Kim @ 2025-07-31  5:07 UTC (permalink / raw)
  To: Yuzhuo Jing
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang Kan, Yuzhuo Jing, Andrea Parri,
	Palmer Dabbelt, Charlie Jenkins, Sebastian Andrzej Siewior,
	Kumar Kartikeya Dwivedi, Alexei Starovoitov, Barret Rhoden,
	Alexandre Ghiti, Guo Ren, linux-kernel, linux-perf-users

On Mon, Jul 28, 2025 at 07:26:37PM -0700, Yuzhuo Jing wrote:
> Implement userspace per-cpu for imported kernel code.  Compared with
> simple thread-local definition, the kernel per-cpu provides 1) a
> guarantee of static lifetime even when thread exits, and 2) the ability
> to access other CPU's per-cpu data.
> 
> This patch adds an alternative implementation and interface for
> userspace per-cpu.  The kernel implementation uses special ELF sections
> and offset calculation.  For simplicity, this version defines a
> PERCPU_MAX length global array for each per-cpu data, and uses a
> thread-local cpu id for indexing.
> 
> Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
> ---
>  tools/include/linux/compiler_types.h  |   3 +
>  tools/include/linux/percpu-simulate.h | 128 ++++++++++++++++++++++++++
>  2 files changed, 131 insertions(+)
>  create mode 100644 tools/include/linux/percpu-simulate.h
> 
> diff --git a/tools/include/linux/compiler_types.h b/tools/include/linux/compiler_types.h
> index 9a2a2f8d7b6c..46550c500b8c 100644
> --- a/tools/include/linux/compiler_types.h
> +++ b/tools/include/linux/compiler_types.h
> @@ -31,6 +31,9 @@
>  # define __cond_lock(x,c) (c)
>  #endif /* __CHECKER__ */
>  
> +/* Per-cpu checker flag does not use address space attribute in userspace */
> +#define __percpu
> +
>  /*
>   * __unqual_scalar_typeof(x) - Declare an unqualified scalar type, leaving
>   *			       non-scalar types unchanged.
> diff --git a/tools/include/linux/percpu-simulate.h b/tools/include/linux/percpu-simulate.h
> new file mode 100644
> index 000000000000..a6af2f2211eb
> --- /dev/null
> +++ b/tools/include/linux/percpu-simulate.h
> @@ -0,0 +1,128 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Userspace implementation of per_cpu_ptr for adapted kernel code.
> + *
> + * Userspace code does not have and does not need a per-cpu concept, but
> + * instead can declare variables as thread-local.  However, the kernel per-cpu
> + * further provides 1) the guarantee of static lifetime when thread exits, and
> + * 2) the ability to access other CPU's per-cpu data.  This file provides a
> + * simple implementation of such functionality, but with slightly different
> + * APIs and without linker script changes.
> + *
> + * 2025  Yuzhuo Jing <yuzhuo@google.com>
> + */
> +#ifndef __PERCPU_SIMULATE_H__
> +#define __PERCPU_SIMULATE_H__
> +
> +#include <assert.h>
> +
> +#include <linux/compiler.h>
> +#include <linux/types.h>
> +
> +/*
> + * The maximum supported number of CPUs.  Per-cpu variables are defined as a
> + * PERCPU_MAX length array, indexed by a thread-local cpu id.
> + */
> +#define PERCPU_MAX 4096
> +
> +#ifdef ASSERT_PERCPU
> +#define __check_cpu_id(cpu)						\
> +({									\
> +	u32 cpuid = (cpu);						\
> +	assert(cpuid < PERCPU_MAX);					\
> +	cpuid;								\
> +})
> +#else
> +#define __check_cpu_id(cpu)	(cpu)
> +#endif
> +
> +/*
> + * Use weak symbol: only define __thread_per_cpu_id variable if any perf tool
> + * includes this header file.
> + */
> +_Thread_local u32 __thread_per_cpu_id __weak;

Is there any overhead (or some indirection) when using the thread local
variable?

> +
> +static inline u32 get_this_cpu_id(void)
> +{
> +	return __thread_per_cpu_id;
> +}
> +
> +/*
> + * The user code must call this function inside of each thread that uses
> + * per-cpu data structures.  The user code can choose an id of their choice,
> + * but must ensure each thread uses a different id.
> + *
> + * Safety: asserts CPU id smaller than PERCPU_MAX if ASSERT_PERCPU is defined.
> + */
> +static inline void set_this_cpu_id(u32 id)
> +{
> +	__thread_per_cpu_id = __check_cpu_id(id);
> +}
> +
> +/*
> + * Declare a per-cpu data structure.  This only declares the data type and
> + * array length. Different per-cpu data are differentiated by a key (identifer).
> + *
> + * Different from the kernel version, this API must be called before the actual
> + * definition (i.e. DEFINE_PER_CPU_ALIGNED).
> + *
> + * Note that this implementation does not support prepending static qualifier,
> + * or appending assignment expressions.
> + */
> +#define DECLARE_PER_CPU_ALIGNED(key, type, data) \
> +	extern struct __percpu_type_##key { \
> +		type data; \
> +	} __percpu_data_##key[PERCPU_MAX]
> +
> +/*
> + * Define the per-cpu data storage for a given key.  This uses a previously
> + * defined data type in DECLARE_PER_CPU_ALIGNED.
> + *
> + * Different from the kernel version, this API only accepts a key name.
> + */
> +#define DEFINE_PER_CPU_ALIGNED(key) \
> +	struct __percpu_type_##key __percpu_data_##key[PERCPU_MAX]

How do these APIs guarantee the alignment?

Thanks,
Namhyung

> +
> +#define __raw_per_cpu_value(key, field, cpu) \
> +	(__percpu_data_##key[cpu].field)
> +
> +/*
> + * Get a pointer of per-cpu data for a given key.
> + *
> + * Different from the kernel version, users of this API don't need to pass the
> + * address of the base variable (through `&varname').
> + *
> + * Safety: asserts CPU id smaller than PERCPU_MAX if ASSERT_PERCPU is defined.
> + */
> +#define per_cpu_ptr(key, field, cpu) (&per_cpu_value(key, field, cpu))
> +#define this_cpu_ptr(key, field) (&this_cpu_value(key, field))
> +
> +/*
> + * Additional APIs for direct value access.  Effectively, `*per_cpu_ptr(...)'.
> + *
> + * Safety: asserts CPU id smaller than PERCPU_MAX if ASSERT_PERCPU is defined.
> + */
> +#define per_cpu_value(key, field, cpu) \
> +	(__raw_per_cpu_value(key, field, __check_cpu_id(cpu)))
> +#define this_cpu_value(key, field) \
> +	(__raw_per_cpu_value(key, field, __thread_per_cpu_id))
> +
> +/*
> + * Helper functions of simple per-cpu operations.
> + *
> + * The kernel version differentiates __this_cpu_* from this_cpu_* for
> + * preemption/interrupt-safe contexts, but the userspace version defines them
> + * as the same.
> + */
> +
> +#define __this_cpu_add(key, field, val)	(this_cpu_value(key, field) += (val))
> +#define __this_cpu_sub(key, field, val)	(this_cpu_value(key, field) -= (val))
> +#define __this_cpu_inc(key, field)	(++this_cpu_value(key, field))
> +#define __this_cpu_dec(key, field)	(--this_cpu_value(key, field))
> +
> +#define this_cpu_add	__this_cpu_add
> +#define this_cpu_sub	__this_cpu_sub
> +#define this_cpu_inc	__this_cpu_inc
> +#define this_cpu_dec	__this_cpu_dec
> +
> +#endif /* __PERCPU_SIMULATE_H__ */
> -- 
> 2.50.1.487.gc89ff58d15-goog
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 6/7] perf bench: Add 'bench sync qspinlock' subcommand
  2025-07-29  2:26 ` [PATCH v1 6/7] perf bench: Add 'bench sync qspinlock' subcommand Yuzhuo Jing
@ 2025-07-31  5:16   ` Namhyung Kim
  2025-07-31 13:19     ` Yuzhuo Jing
  0 siblings, 1 reply; 18+ messages in thread
From: Namhyung Kim @ 2025-07-31  5:16 UTC (permalink / raw)
  To: Yuzhuo Jing
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang Kan, Yuzhuo Jing, Andrea Parri,
	Palmer Dabbelt, Charlie Jenkins, Sebastian Andrzej Siewior,
	Kumar Kartikeya Dwivedi, Alexei Starovoitov, Barret Rhoden,
	Alexandre Ghiti, Guo Ren, linux-kernel, linux-perf-users

On Mon, Jul 28, 2025 at 07:26:39PM -0700, Yuzhuo Jing wrote:
> Benchmark kernel queued spinlock implementation in user space.  Support
> settings of the number of threads and the number of acquire/releases.

My general advice is that you'd better add an example command line and
output in the commit message if you add any user-visible changes.  Also
please update the Documentation/perf-bench.txt.

Thanks,
Namhyung

> 
> Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
> ---
>  tools/perf/bench/Build     |   2 +
>  tools/perf/bench/bench.h   |   1 +
>  tools/perf/bench/sync.c    | 234 +++++++++++++++++++++++++++++++++++++
>  tools/perf/builtin-bench.c |   7 ++
>  4 files changed, 244 insertions(+)
>  create mode 100644 tools/perf/bench/sync.c
> 
> diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
> index b558ab98719f..13558279fa0e 100644
> --- a/tools/perf/bench/Build
> +++ b/tools/perf/bench/Build
> @@ -19,6 +19,8 @@ perf-bench-y += evlist-open-close.o
>  perf-bench-y += breakpoint.o
>  perf-bench-y += pmu-scan.o
>  perf-bench-y += uprobe.o
> +perf-bench-y += sync.o
> +perf-bench-y += qspinlock.o
>  
>  perf-bench-$(CONFIG_X86_64) += mem-memcpy-x86-64-asm.o
>  perf-bench-$(CONFIG_X86_64) += mem-memset-x86-64-asm.o
> diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
> index 9f736423af53..dd6c8b6126d3 100644
> --- a/tools/perf/bench/bench.h
> +++ b/tools/perf/bench/bench.h
> @@ -22,6 +22,7 @@ int bench_numa(int argc, const char **argv);
>  int bench_sched_messaging(int argc, const char **argv);
>  int bench_sched_pipe(int argc, const char **argv);
>  int bench_sched_seccomp_notify(int argc, const char **argv);
> +int bench_sync_qspinlock(int argc, const char **argv);
>  int bench_syscall_basic(int argc, const char **argv);
>  int bench_syscall_getpgid(int argc, const char **argv);
>  int bench_syscall_fork(int argc, const char **argv);
> diff --git a/tools/perf/bench/sync.c b/tools/perf/bench/sync.c
> new file mode 100644
> index 000000000000..2685cb66584c
> --- /dev/null
> +++ b/tools/perf/bench/sync.c
> @@ -0,0 +1,234 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Synchronization benchmark.
> + *
> + * 2025  Yuzhuo Jing <yuzhuo@google.com>
> + */
> +#include <bits/time.h>
> +#include <err.h>
> +#include <inttypes.h>
> +#include <perf/cpumap.h>
> +#include <pthread.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <subcmd/parse-options.h>
> +#include <sys/cdefs.h>
> +
> +#include "bench.h"
> +
> +#include "include/qspinlock.h"
> +
> +#define NS 1000000000ull
> +#define CACHELINE_SIZE 64
> +
> +static unsigned int nthreads;
> +static unsigned long nspins = 10000ul;
> +
> +struct barrier_t;
> +
> +typedef void(*lock_fn)(void *);
> +
> +/*
> + * Lock operation definition to support multiple implmentations of locks.
> + *
> + * The lock and unlock functions only take one variable, the data pointer.
> + */
> +struct lock_ops {
> +	lock_fn lock;
> +	lock_fn unlock;
> +	void *data;
> +};
> +
> +struct worker {
> +	pthread_t thd;
> +	unsigned int tid;
> +	struct lock_ops *ops;
> +	struct barrier_t *barrier;
> +	u64 runtime;		// in nanoseconds
> +};
> +
> +static const struct option options[] = {
> +	OPT_UINTEGER('t',	"threads",	&nthreads,
> +		"Specify number of threads (default: number of CPUs)."),
> +	OPT_ULONG('n',		"spins",	&nspins,
> +		"Number of lock acquire operations per thread (default: 10,000 times)."),
> +	OPT_END()
> +};
> +
> +static const char *const bench_sync_usage[] = {
> +	"perf bench sync qspinlock <options>",
> +	NULL
> +};
> +
> +/*
> + * A atomic-based barrier.  Expect to have lower latency than pthread barrier
> + * that sleeps the thread.
> + */
> +struct barrier_t {
> +	unsigned int count __aligned(CACHELINE_SIZE);
> +};
> +
> +/*
> + * A atomic-based barrier.  Expect to have lower latency than pthread barrier
> + * that sleeps the thread.
> + */
> +__always_inline void wait_barrier(struct barrier_t *b)
> +{
> +	if (__atomic_sub_fetch(&b->count, 1, __ATOMIC_RELAXED) == 0)
> +		return;
> +	while (__atomic_load_n(&b->count, __ATOMIC_RELAXED))
> +		;
> +}
> +
> +static int bench_sync_lock_generic(struct lock_ops *ops, int argc, const char **argv);
> +
> +/*
> + * Benchmark of linux kernel queued spinlock in user land.
> + */
> +int bench_sync_qspinlock(int argc, const char **argv)
> +{
> +	struct qspinlock lock = __ARCH_SPIN_LOCK_UNLOCKED;
> +	struct lock_ops ops = {
> +		.lock = (lock_fn)queued_spin_lock,
> +		.unlock = (lock_fn)queued_spin_unlock,
> +		.data = &lock,
> +	};
> +	return bench_sync_lock_generic(&ops, argc, argv);
> +}
> +
> +/*
> + * A busy loop to acquire and release the given lock N times.
> + */
> +static void lock_loop(const struct lock_ops *ops, unsigned long n)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < n; ++i) {
> +		ops->lock(ops->data);
> +		ops->unlock(ops->data);
> +	}
> +}
> +
> +/*
> + * Thread worker function.  Runs lock loop for N/5 times before and after
> + * the main timed loop.
> + */
> +static void *sync_workerfn(void *args)
> +{
> +	struct worker *worker = (struct worker *)args;
> +	struct timespec starttime, endtime;
> +
> +	set_this_cpu_id(worker->tid);
> +
> +	/* Barrier to let all threads start together */
> +	wait_barrier(worker->barrier);
> +
> +	/* Warmup loop (not counted) to keep the below loop contended. */
> +	lock_loop(worker->ops, nspins / 5);
> +
> +	clock_gettime(CLOCK_THREAD_CPUTIME_ID, &starttime);
> +	lock_loop(worker->ops, nspins);
> +	clock_gettime(CLOCK_THREAD_CPUTIME_ID, &endtime);
> +
> +	/* Tail loop (not counted) to keep the above loop contended. */
> +	lock_loop(worker->ops, nspins / 5);
> +
> +	worker->runtime = (endtime.tv_sec - starttime.tv_sec) * NS
> +		+ endtime.tv_nsec - starttime.tv_nsec;
> +
> +	return NULL;
> +}
> +
> +/*
> + * Generic lock synchronization benchmark function.  Sets up threads and
> + * thread affinities.
> + */
> +static int bench_sync_lock_generic(struct lock_ops *ops, int argc, const char **argv)
> +{
> +	struct perf_cpu_map *online_cpus;
> +	unsigned int online_cpus_nr;
> +	struct worker *workers;
> +	u64 totaltime = 0, total_spins, avg_ns, avg_ns_dot;
> +	struct barrier_t barrier;
> +	cpu_set_t *cpuset;
> +	size_t cpuset_size;
> +
> +	argc = parse_options(argc, argv, options, bench_sync_usage, 0);
> +	if (argc) {
> +		usage_with_options(bench_sync_usage, options);
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* CPU count setup. */
> +	online_cpus = perf_cpu_map__new_online_cpus();
> +	if (!online_cpus)
> +		err(EXIT_FAILURE, "No online CPUs available");
> +	online_cpus_nr = perf_cpu_map__nr(online_cpus);
> +
> +	if (!nthreads) /* default to the number of CPUs */
> +		nthreads = online_cpus_nr;
> +
> +	workers = calloc(nthreads, sizeof(*workers));
> +	if (!workers)
> +		err(EXIT_FAILURE, "calloc");
> +
> +	barrier.count = nthreads;
> +
> +	printf("Running with %u threads.\n", nthreads);
> +
> +	cpuset = CPU_ALLOC(online_cpus_nr);
> +	if (!cpuset)
> +		err(EXIT_FAILURE, "Cannot allocate cpuset.");
> +	cpuset_size = CPU_ALLOC_SIZE(online_cpus_nr);
> +
> +	/* Create worker data structures, set CPU affinity, and create   */
> +	for (unsigned int i = 0; i < nthreads; ++i) {
> +		pthread_attr_t thread_attr;
> +		int ret;
> +
> +		/* Basic worker thread information */
> +		workers[i].tid = i;
> +		workers[i].barrier = &barrier;
> +		workers[i].ops = ops;
> +
> +		/* Set CPU affinity */
> +		pthread_attr_init(&thread_attr);
> +		CPU_ZERO_S(cpuset_size, cpuset);
> +		CPU_SET_S(perf_cpu_map__cpu(online_cpus, i % online_cpus_nr).cpu,
> +			cpuset_size, cpuset);
> +
> +		if (pthread_attr_setaffinity_np(&thread_attr, cpuset_size, cpuset))
> +			err(EXIT_FAILURE, "Pthread set affinity failed");
> +
> +		/* Create and block thread */
> +		ret = pthread_create(&workers[i].thd, &thread_attr, sync_workerfn, &workers[i]);
> +		if (ret != 0)
> +			err(EXIT_FAILURE, "Error creating thread: %s", strerror(ret));
> +
> +		pthread_attr_destroy(&thread_attr);
> +	}
> +
> +	CPU_FREE(cpuset);
> +
> +	for (unsigned int i = 0; i < nthreads; ++i) {
> +		int ret = pthread_join(workers[i].thd, NULL);
> +
> +		if (ret)
> +			err(EXIT_FAILURE, "pthread_join");
> +	}
> +
> +	/* Calculate overall average latency. */
> +	for (unsigned int i = 0; i < nthreads; ++i)
> +		totaltime += workers[i].runtime;
> +
> +	total_spins = (u64)nthreads * nspins;
> +	avg_ns = totaltime / total_spins;
> +	avg_ns_dot = (totaltime % total_spins) * 10000 / total_spins;
> +
> +	printf("Lock-unlock latency of %u threads: %"PRIu64".%"PRIu64" ns.\n",
> +			nthreads, avg_ns, avg_ns_dot);
> +
> +	free(workers);
> +
> +	return 0;
> +}
> diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
> index 2c1a9f3d847a..cfe6f6dc6ed4 100644
> --- a/tools/perf/builtin-bench.c
> +++ b/tools/perf/builtin-bench.c
> @@ -52,6 +52,12 @@ static struct bench sched_benchmarks[] = {
>  	{ NULL,		NULL,						NULL			}
>  };
>  
> +static struct bench sync_benchmarks[] = {
> +	{ "qspinlock",	"Benchmark for queued spinlock",		bench_sync_qspinlock	},
> +	{ "all",	"Run all synchronization benchmarks",		NULL			},
> +	{ NULL,		NULL,						NULL			}
> +};
> +
>  static struct bench syscall_benchmarks[] = {
>  	{ "basic",	"Benchmark for basic getppid(2) calls",		bench_syscall_basic	},
>  	{ "getpgid",	"Benchmark for getpgid(2) calls",		bench_syscall_getpgid	},
> @@ -122,6 +128,7 @@ struct collection {
>  
>  static struct collection collections[] = {
>  	{ "sched",	"Scheduler and IPC benchmarks",			sched_benchmarks	},
> +	{ "sync",	"Synchronization benchmarks",			sync_benchmarks		},
>  	{ "syscall",	"System call benchmarks",			syscall_benchmarks	},
>  	{ "mem",	"Memory access benchmarks",			mem_benchmarks		},
>  #ifdef HAVE_LIBNUMA_SUPPORT
> -- 
> 2.50.1.487.gc89ff58d15-goog
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 7/7] perf bench sync: Add latency histogram functionality
  2025-07-29  2:26 ` [PATCH v1 7/7] perf bench sync: Add latency histogram functionality Yuzhuo Jing
@ 2025-07-31  5:18   ` Namhyung Kim
  2025-07-31  5:24   ` Namhyung Kim
  1 sibling, 0 replies; 18+ messages in thread
From: Namhyung Kim @ 2025-07-31  5:18 UTC (permalink / raw)
  To: Yuzhuo Jing
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang Kan, Yuzhuo Jing, Andrea Parri,
	Palmer Dabbelt, Charlie Jenkins, Sebastian Andrzej Siewior,
	Kumar Kartikeya Dwivedi, Alexei Starovoitov, Barret Rhoden,
	Alexandre Ghiti, Guo Ren, linux-kernel, linux-perf-users

On Mon, Jul 28, 2025 at 07:26:40PM -0700, Yuzhuo Jing wrote:
> Add an option to print the histogram of lock acquire latencies (unit in
> TSCs).

Same advice as of the previous patch.

Thanks,
Namhyung

> 
> Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
> ---
>  tools/perf/bench/sync.c | 97 ++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 96 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/bench/sync.c b/tools/perf/bench/sync.c
> index 2685cb66584c..c85e9853c72a 100644
> --- a/tools/perf/bench/sync.c
> +++ b/tools/perf/bench/sync.c
> @@ -15,14 +15,19 @@
>  #include <sys/cdefs.h>
>  
>  #include "bench.h"
> +#include "../util/tsc.h"
>  
>  #include "include/qspinlock.h"
>  
>  #define NS 1000000000ull
>  #define CACHELINE_SIZE 64
>  
> +#define DEFAULT_HIST_INTERVAL 1000
> +
>  static unsigned int nthreads;
>  static unsigned long nspins = 10000ul;
> +static bool do_hist;
> +static u64 hist_interval = DEFAULT_HIST_INTERVAL;
>  
>  struct barrier_t;
>  
> @@ -45,6 +50,7 @@ struct worker {
>  	struct lock_ops *ops;
>  	struct barrier_t *barrier;
>  	u64 runtime;		// in nanoseconds
> +	u64 *lock_latency;	// in TSCs
>  };
>  
>  static const struct option options[] = {
> @@ -52,6 +58,10 @@ static const struct option options[] = {
>  		"Specify number of threads (default: number of CPUs)."),
>  	OPT_ULONG('n',		"spins",	&nspins,
>  		"Number of lock acquire operations per thread (default: 10,000 times)."),
> +	OPT_BOOLEAN(0,		"hist",		&do_hist,
> +		"Print a histogram of lock acquire TSCs."),
> +	OPT_U64(0,	"hist-interval",	&hist_interval,
> +		"Histogram bucket size (default 1,000 TSCs)."),
>  	OPT_END()
>  };
>  
> @@ -109,6 +119,25 @@ static void lock_loop(const struct lock_ops *ops, unsigned long n)
>  	}
>  }
>  
> +/*
> + * A busy loop to acquire and release the given lock N times, and also collect
> + * all acquire latencies, for histogram use.  Note that the TSC operations
> + * latency itself is also included.
> + */
> +static void lock_loop_timing(const struct lock_ops *ops, unsigned long n, u64 *sample_buffer)
> +{
> +	unsigned long i;
> +	u64 t1, t2;
> +
> +	for (i = 0; i < n; ++i) {
> +		t1 = rdtsc();
> +		ops->lock(ops->data);
> +		t2 = rdtsc();
> +		ops->unlock(ops->data);
> +		sample_buffer[i] = t2 - t1;
> +	}
> +}
> +
>  /*
>   * Thread worker function.  Runs lock loop for N/5 times before and after
>   * the main timed loop.
> @@ -127,7 +156,10 @@ static void *sync_workerfn(void *args)
>  	lock_loop(worker->ops, nspins / 5);
>  
>  	clock_gettime(CLOCK_THREAD_CPUTIME_ID, &starttime);
> -	lock_loop(worker->ops, nspins);
> +	if (worker->lock_latency)
> +		lock_loop_timing(worker->ops, nspins, worker->lock_latency);
> +	else
> +		lock_loop(worker->ops, nspins);
>  	clock_gettime(CLOCK_THREAD_CPUTIME_ID, &endtime);
>  
>  	/* Tail loop (not counted) to keep the above loop contended. */
> @@ -139,6 +171,57 @@ static void *sync_workerfn(void *args)
>  	return NULL;
>  }
>  
> +/*
> + * Calculate and print a histogram.
> + */
> +static void print_histogram(struct worker *workers)
> +{
> +	u64 tsc_max = 0;
> +	u64 *buckets;
> +	unsigned long nbuckets;
> +
> +	if (hist_interval == 0)
> +		hist_interval = DEFAULT_HIST_INTERVAL;
> +
> +	printf("Lock acquire histogram:\n");
> +
> +	/* Calculate the max TSC value to get the number of buckets needed. */
> +	for (unsigned int i = 0; i < nthreads; ++i) {
> +		struct worker *w = workers + i;
> +
> +		for (unsigned long j = 0; j < nspins; ++j)
> +			tsc_max = max(w->lock_latency[j], tsc_max);
> +	}
> +	nbuckets = (tsc_max + hist_interval - 1) / hist_interval;
> +
> +	/* Allocate the actual bucket.  The bucket definition may be optimized
> +	 * if it is sparse.
> +	 */
> +	buckets = calloc(nbuckets, sizeof(*buckets));
> +	if (!buckets)
> +		err(EXIT_FAILURE, "calloc");
> +
> +	/* Iterate through all latencies again to fill the buckets. */
> +	for (unsigned int i = 0; i < nthreads; ++i) {
> +		struct worker *w = workers + i;
> +
> +		for (unsigned long j = 0; j < nspins; ++j) {
> +			u64 latency = w->lock_latency[j];
> +			++buckets[latency / hist_interval];
> +		}
> +	}
> +
> +	/* Print the histogram as a table. */
> +	printf("Bucket, Count\n");
> +	for (unsigned long i = 0; i < nbuckets; ++i) {
> +		if (buckets[i] == 0)
> +			continue;
> +		printf("%"PRIu64", %"PRIu64"\n", hist_interval * (i + 1), buckets[i]);
> +	}
> +
> +	free(buckets);
> +}
> +
>  /*
>   * Generic lock synchronization benchmark function.  Sets up threads and
>   * thread affinities.
> @@ -191,6 +274,12 @@ static int bench_sync_lock_generic(struct lock_ops *ops, int argc, const char **
>  		workers[i].barrier = &barrier;
>  		workers[i].ops = ops;
>  
> +		if (do_hist) {
> +			workers[i].lock_latency = calloc(nspins, sizeof(*workers[i].lock_latency));
> +			if (!workers[i].lock_latency)
> +				err(EXIT_FAILURE, "calloc");
> +		}
> +
>  		/* Set CPU affinity */
>  		pthread_attr_init(&thread_attr);
>  		CPU_ZERO_S(cpuset_size, cpuset);
> @@ -228,6 +317,12 @@ static int bench_sync_lock_generic(struct lock_ops *ops, int argc, const char **
>  	printf("Lock-unlock latency of %u threads: %"PRIu64".%"PRIu64" ns.\n",
>  			nthreads, avg_ns, avg_ns_dot);
>  
> +	/* Print histogram if requested. */
> +	if (do_hist)
> +		print_histogram(workers);
> +
> +	for (unsigned int i = 0; i < nthreads; ++i)
> +		free(workers[i].lock_latency);
>  	free(workers);
>  
>  	return 0;
> -- 
> 2.50.1.487.gc89ff58d15-goog
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 7/7] perf bench sync: Add latency histogram functionality
  2025-07-29  2:26 ` [PATCH v1 7/7] perf bench sync: Add latency histogram functionality Yuzhuo Jing
  2025-07-31  5:18   ` Namhyung Kim
@ 2025-07-31  5:24   ` Namhyung Kim
  1 sibling, 0 replies; 18+ messages in thread
From: Namhyung Kim @ 2025-07-31  5:24 UTC (permalink / raw)
  To: Yuzhuo Jing
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang Kan, Yuzhuo Jing, Andrea Parri,
	Palmer Dabbelt, Charlie Jenkins, Sebastian Andrzej Siewior,
	Kumar Kartikeya Dwivedi, Alexei Starovoitov, Barret Rhoden,
	Alexandre Ghiti, Guo Ren, linux-kernel, linux-perf-users

On Mon, Jul 28, 2025 at 07:26:40PM -0700, Yuzhuo Jing wrote:
> Add an option to print the histogram of lock acquire latencies (unit in
> TSCs).
> 
> Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
> ---
>  tools/perf/bench/sync.c | 97 ++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 96 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/bench/sync.c b/tools/perf/bench/sync.c
> index 2685cb66584c..c85e9853c72a 100644
> --- a/tools/perf/bench/sync.c
> +++ b/tools/perf/bench/sync.c
> @@ -15,14 +15,19 @@
>  #include <sys/cdefs.h>
>  
>  #include "bench.h"
> +#include "../util/tsc.h"
>  
>  #include "include/qspinlock.h"
>  
>  #define NS 1000000000ull
>  #define CACHELINE_SIZE 64
>  
> +#define DEFAULT_HIST_INTERVAL 1000
> +
>  static unsigned int nthreads;
>  static unsigned long nspins = 10000ul;
> +static bool do_hist;
> +static u64 hist_interval = DEFAULT_HIST_INTERVAL;
>  
>  struct barrier_t;
>  
> @@ -45,6 +50,7 @@ struct worker {
>  	struct lock_ops *ops;
>  	struct barrier_t *barrier;
>  	u64 runtime;		// in nanoseconds
> +	u64 *lock_latency;	// in TSCs

Why TSC?  Is it for x86 only?

Thanks,
Namhyung

>  };
>  
>  static const struct option options[] = {
> @@ -52,6 +58,10 @@ static const struct option options[] = {
>  		"Specify number of threads (default: number of CPUs)."),
>  	OPT_ULONG('n',		"spins",	&nspins,
>  		"Number of lock acquire operations per thread (default: 10,000 times)."),
> +	OPT_BOOLEAN(0,		"hist",		&do_hist,
> +		"Print a histogram of lock acquire TSCs."),
> +	OPT_U64(0,	"hist-interval",	&hist_interval,
> +		"Histogram bucket size (default 1,000 TSCs)."),
>  	OPT_END()
>  };
>  
> @@ -109,6 +119,25 @@ static void lock_loop(const struct lock_ops *ops, unsigned long n)
>  	}
>  }
>  
> +/*
> + * A busy loop to acquire and release the given lock N times, and also collect
> + * all acquire latencies, for histogram use.  Note that the TSC operations
> + * latency itself is also included.
> + */
> +static void lock_loop_timing(const struct lock_ops *ops, unsigned long n, u64 *sample_buffer)
> +{
> +	unsigned long i;
> +	u64 t1, t2;
> +
> +	for (i = 0; i < n; ++i) {
> +		t1 = rdtsc();
> +		ops->lock(ops->data);
> +		t2 = rdtsc();
> +		ops->unlock(ops->data);
> +		sample_buffer[i] = t2 - t1;
> +	}
> +}
> +
>  /*
>   * Thread worker function.  Runs lock loop for N/5 times before and after
>   * the main timed loop.
> @@ -127,7 +156,10 @@ static void *sync_workerfn(void *args)
>  	lock_loop(worker->ops, nspins / 5);
>  
>  	clock_gettime(CLOCK_THREAD_CPUTIME_ID, &starttime);
> -	lock_loop(worker->ops, nspins);
> +	if (worker->lock_latency)
> +		lock_loop_timing(worker->ops, nspins, worker->lock_latency);
> +	else
> +		lock_loop(worker->ops, nspins);
>  	clock_gettime(CLOCK_THREAD_CPUTIME_ID, &endtime);
>  
>  	/* Tail loop (not counted) to keep the above loop contended. */
> @@ -139,6 +171,57 @@ static void *sync_workerfn(void *args)
>  	return NULL;
>  }
>  
> +/*
> + * Calculate and print a histogram.
> + */
> +static void print_histogram(struct worker *workers)
> +{
> +	u64 tsc_max = 0;
> +	u64 *buckets;
> +	unsigned long nbuckets;
> +
> +	if (hist_interval == 0)
> +		hist_interval = DEFAULT_HIST_INTERVAL;
> +
> +	printf("Lock acquire histogram:\n");
> +
> +	/* Calculate the max TSC value to get the number of buckets needed. */
> +	for (unsigned int i = 0; i < nthreads; ++i) {
> +		struct worker *w = workers + i;
> +
> +		for (unsigned long j = 0; j < nspins; ++j)
> +			tsc_max = max(w->lock_latency[j], tsc_max);
> +	}
> +	nbuckets = (tsc_max + hist_interval - 1) / hist_interval;
> +
> +	/* Allocate the actual bucket.  The bucket definition may be optimized
> +	 * if it is sparse.
> +	 */
> +	buckets = calloc(nbuckets, sizeof(*buckets));
> +	if (!buckets)
> +		err(EXIT_FAILURE, "calloc");
> +
> +	/* Iterate through all latencies again to fill the buckets. */
> +	for (unsigned int i = 0; i < nthreads; ++i) {
> +		struct worker *w = workers + i;
> +
> +		for (unsigned long j = 0; j < nspins; ++j) {
> +			u64 latency = w->lock_latency[j];
> +			++buckets[latency / hist_interval];
> +		}
> +	}
> +
> +	/* Print the histogram as a table. */
> +	printf("Bucket, Count\n");
> +	for (unsigned long i = 0; i < nbuckets; ++i) {
> +		if (buckets[i] == 0)
> +			continue;
> +		printf("%"PRIu64", %"PRIu64"\n", hist_interval * (i + 1), buckets[i]);
> +	}
> +
> +	free(buckets);
> +}
> +
>  /*
>   * Generic lock synchronization benchmark function.  Sets up threads and
>   * thread affinities.
> @@ -191,6 +274,12 @@ static int bench_sync_lock_generic(struct lock_ops *ops, int argc, const char **
>  		workers[i].barrier = &barrier;
>  		workers[i].ops = ops;
>  
> +		if (do_hist) {
> +			workers[i].lock_latency = calloc(nspins, sizeof(*workers[i].lock_latency));
> +			if (!workers[i].lock_latency)
> +				err(EXIT_FAILURE, "calloc");
> +		}
> +
>  		/* Set CPU affinity */
>  		pthread_attr_init(&thread_attr);
>  		CPU_ZERO_S(cpuset_size, cpuset);
> @@ -228,6 +317,12 @@ static int bench_sync_lock_generic(struct lock_ops *ops, int argc, const char **
>  	printf("Lock-unlock latency of %u threads: %"PRIu64".%"PRIu64" ns.\n",
>  			nthreads, avg_ns, avg_ns_dot);
>  
> +	/* Print histogram if requested. */
> +	if (do_hist)
> +		print_histogram(workers);
> +
> +	for (unsigned int i = 0; i < nthreads; ++i)
> +		free(workers[i].lock_latency);
>  	free(workers);
>  
>  	return 0;
> -- 
> 2.50.1.487.gc89ff58d15-goog
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 6/7] perf bench: Add 'bench sync qspinlock' subcommand
  2025-07-31  5:16   ` Namhyung Kim
@ 2025-07-31 13:19     ` Yuzhuo Jing
  0 siblings, 0 replies; 18+ messages in thread
From: Yuzhuo Jing @ 2025-07-31 13:19 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang Kan, Yuzhuo Jing, Andrea Parri,
	Charlie Jenkins, Sebastian Andrzej Siewior,
	Kumar Kartikeya Dwivedi, Alexei Starovoitov, Barret Rhoden,
	Alexandre Ghiti, Guo Ren, linux-kernel, linux-perf-users

On Wed, Jul 30, 2025 at 10:16 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Mon, Jul 28, 2025 at 07:26:39PM -0700, Yuzhuo Jing wrote:
> > Benchmark kernel queued spinlock implementation in user space.  Support
> > settings of the number of threads and the number of acquire/releases.
>
> My general advice is that you'd better add an example command line and
> output in the commit message if you add any user-visible changes.  Also
> please update the Documentation/perf-bench.txt.
>
> Thanks,
> Namhyung

Hi Namhyung. Thanks for the review and for the advice! I will update
the commit messages and documentation later in a v2 patch.

Best regards,
Yuzhuo

> >
> > Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
> > ---
> >  tools/perf/bench/Build     |   2 +
> >  tools/perf/bench/bench.h   |   1 +
> >  tools/perf/bench/sync.c    | 234 +++++++++++++++++++++++++++++++++++++
> >  tools/perf/builtin-bench.c |   7 ++
> >  4 files changed, 244 insertions(+)
> >  create mode 100644 tools/perf/bench/sync.c
> >
> > diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
> > index b558ab98719f..13558279fa0e 100644
> > --- a/tools/perf/bench/Build
> > +++ b/tools/perf/bench/Build
> > @@ -19,6 +19,8 @@ perf-bench-y += evlist-open-close.o
> >  perf-bench-y += breakpoint.o
> >  perf-bench-y += pmu-scan.o
> >  perf-bench-y += uprobe.o
> > +perf-bench-y += sync.o
> > +perf-bench-y += qspinlock.o
> >
> >  perf-bench-$(CONFIG_X86_64) += mem-memcpy-x86-64-asm.o
> >  perf-bench-$(CONFIG_X86_64) += mem-memset-x86-64-asm.o
> > diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
> > index 9f736423af53..dd6c8b6126d3 100644
> > --- a/tools/perf/bench/bench.h
> > +++ b/tools/perf/bench/bench.h
> > @@ -22,6 +22,7 @@ int bench_numa(int argc, const char **argv);
> >  int bench_sched_messaging(int argc, const char **argv);
> >  int bench_sched_pipe(int argc, const char **argv);
> >  int bench_sched_seccomp_notify(int argc, const char **argv);
> > +int bench_sync_qspinlock(int argc, const char **argv);
> >  int bench_syscall_basic(int argc, const char **argv);
> >  int bench_syscall_getpgid(int argc, const char **argv);
> >  int bench_syscall_fork(int argc, const char **argv);
> > diff --git a/tools/perf/bench/sync.c b/tools/perf/bench/sync.c
> > new file mode 100644
> > index 000000000000..2685cb66584c
> > --- /dev/null
> > +++ b/tools/perf/bench/sync.c
> > @@ -0,0 +1,234 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Synchronization benchmark.
> > + *
> > + * 2025  Yuzhuo Jing <yuzhuo@google.com>
> > + */
> > +#include <bits/time.h>
> > +#include <err.h>
> > +#include <inttypes.h>
> > +#include <perf/cpumap.h>
> > +#include <pthread.h>
> > +#include <stdbool.h>
> > +#include <string.h>
> > +#include <subcmd/parse-options.h>
> > +#include <sys/cdefs.h>
> > +
> > +#include "bench.h"
> > +
> > +#include "include/qspinlock.h"
> > +
> > +#define NS 1000000000ull
> > +#define CACHELINE_SIZE 64
> > +
> > +static unsigned int nthreads;
> > +static unsigned long nspins = 10000ul;
> > +
> > +struct barrier_t;
> > +
> > +typedef void(*lock_fn)(void *);
> > +
> > +/*
> > + * Lock operation definition to support multiple implmentations of locks.
> > + *
> > + * The lock and unlock functions only take one variable, the data pointer.
> > + */
> > +struct lock_ops {
> > +     lock_fn lock;
> > +     lock_fn unlock;
> > +     void *data;
> > +};
> > +
> > +struct worker {
> > +     pthread_t thd;
> > +     unsigned int tid;
> > +     struct lock_ops *ops;
> > +     struct barrier_t *barrier;
> > +     u64 runtime;            // in nanoseconds
> > +};
> > +
> > +static const struct option options[] = {
> > +     OPT_UINTEGER('t',       "threads",      &nthreads,
> > +             "Specify number of threads (default: number of CPUs)."),
> > +     OPT_ULONG('n',          "spins",        &nspins,
> > +             "Number of lock acquire operations per thread (default: 10,000 times)."),
> > +     OPT_END()
> > +};
> > +
> > +static const char *const bench_sync_usage[] = {
> > +     "perf bench sync qspinlock <options>",
> > +     NULL
> > +};
> > +
> > +/*
> > + * A atomic-based barrier.  Expect to have lower latency than pthread barrier
> > + * that sleeps the thread.
> > + */
> > +struct barrier_t {
> > +     unsigned int count __aligned(CACHELINE_SIZE);
> > +};
> > +
> > +/*
> > + * A atomic-based barrier.  Expect to have lower latency than pthread barrier
> > + * that sleeps the thread.
> > + */
> > +__always_inline void wait_barrier(struct barrier_t *b)
> > +{
> > +     if (__atomic_sub_fetch(&b->count, 1, __ATOMIC_RELAXED) == 0)
> > +             return;
> > +     while (__atomic_load_n(&b->count, __ATOMIC_RELAXED))
> > +             ;
> > +}
> > +
> > +static int bench_sync_lock_generic(struct lock_ops *ops, int argc, const char **argv);
> > +
> > +/*
> > + * Benchmark of linux kernel queued spinlock in user land.
> > + */
> > +int bench_sync_qspinlock(int argc, const char **argv)
> > +{
> > +     struct qspinlock lock = __ARCH_SPIN_LOCK_UNLOCKED;
> > +     struct lock_ops ops = {
> > +             .lock = (lock_fn)queued_spin_lock,
> > +             .unlock = (lock_fn)queued_spin_unlock,
> > +             .data = &lock,
> > +     };
> > +     return bench_sync_lock_generic(&ops, argc, argv);
> > +}
> > +
> > +/*
> > + * A busy loop to acquire and release the given lock N times.
> > + */
> > +static void lock_loop(const struct lock_ops *ops, unsigned long n)
> > +{
> > +     unsigned long i;
> > +
> > +     for (i = 0; i < n; ++i) {
> > +             ops->lock(ops->data);
> > +             ops->unlock(ops->data);
> > +     }
> > +}
> > +
> > +/*
> > + * Thread worker function.  Runs lock loop for N/5 times before and after
> > + * the main timed loop.
> > + */
> > +static void *sync_workerfn(void *args)
> > +{
> > +     struct worker *worker = (struct worker *)args;
> > +     struct timespec starttime, endtime;
> > +
> > +     set_this_cpu_id(worker->tid);
> > +
> > +     /* Barrier to let all threads start together */
> > +     wait_barrier(worker->barrier);
> > +
> > +     /* Warmup loop (not counted) to keep the below loop contended. */
> > +     lock_loop(worker->ops, nspins / 5);
> > +
> > +     clock_gettime(CLOCK_THREAD_CPUTIME_ID, &starttime);
> > +     lock_loop(worker->ops, nspins);
> > +     clock_gettime(CLOCK_THREAD_CPUTIME_ID, &endtime);
> > +
> > +     /* Tail loop (not counted) to keep the above loop contended. */
> > +     lock_loop(worker->ops, nspins / 5);
> > +
> > +     worker->runtime = (endtime.tv_sec - starttime.tv_sec) * NS
> > +             + endtime.tv_nsec - starttime.tv_nsec;
> > +
> > +     return NULL;
> > +}
> > +
> > +/*
> > + * Generic lock synchronization benchmark function.  Sets up threads and
> > + * thread affinities.
> > + */
> > +static int bench_sync_lock_generic(struct lock_ops *ops, int argc, const char **argv)
> > +{
> > +     struct perf_cpu_map *online_cpus;
> > +     unsigned int online_cpus_nr;
> > +     struct worker *workers;
> > +     u64 totaltime = 0, total_spins, avg_ns, avg_ns_dot;
> > +     struct barrier_t barrier;
> > +     cpu_set_t *cpuset;
> > +     size_t cpuset_size;
> > +
> > +     argc = parse_options(argc, argv, options, bench_sync_usage, 0);
> > +     if (argc) {
> > +             usage_with_options(bench_sync_usage, options);
> > +             exit(EXIT_FAILURE);
> > +     }
> > +
> > +     /* CPU count setup. */
> > +     online_cpus = perf_cpu_map__new_online_cpus();
> > +     if (!online_cpus)
> > +             err(EXIT_FAILURE, "No online CPUs available");
> > +     online_cpus_nr = perf_cpu_map__nr(online_cpus);
> > +
> > +     if (!nthreads) /* default to the number of CPUs */
> > +             nthreads = online_cpus_nr;
> > +
> > +     workers = calloc(nthreads, sizeof(*workers));
> > +     if (!workers)
> > +             err(EXIT_FAILURE, "calloc");
> > +
> > +     barrier.count = nthreads;
> > +
> > +     printf("Running with %u threads.\n", nthreads);
> > +
> > +     cpuset = CPU_ALLOC(online_cpus_nr);
> > +     if (!cpuset)
> > +             err(EXIT_FAILURE, "Cannot allocate cpuset.");
> > +     cpuset_size = CPU_ALLOC_SIZE(online_cpus_nr);
> > +
> > +     /* Create worker data structures, set CPU affinity, and create   */
> > +     for (unsigned int i = 0; i < nthreads; ++i) {
> > +             pthread_attr_t thread_attr;
> > +             int ret;
> > +
> > +             /* Basic worker thread information */
> > +             workers[i].tid = i;
> > +             workers[i].barrier = &barrier;
> > +             workers[i].ops = ops;
> > +
> > +             /* Set CPU affinity */
> > +             pthread_attr_init(&thread_attr);
> > +             CPU_ZERO_S(cpuset_size, cpuset);
> > +             CPU_SET_S(perf_cpu_map__cpu(online_cpus, i % online_cpus_nr).cpu,
> > +                     cpuset_size, cpuset);
> > +
> > +             if (pthread_attr_setaffinity_np(&thread_attr, cpuset_size, cpuset))
> > +                     err(EXIT_FAILURE, "Pthread set affinity failed");
> > +
> > +             /* Create and block thread */
> > +             ret = pthread_create(&workers[i].thd, &thread_attr, sync_workerfn, &workers[i]);
> > +             if (ret != 0)
> > +                     err(EXIT_FAILURE, "Error creating thread: %s", strerror(ret));
> > +
> > +             pthread_attr_destroy(&thread_attr);
> > +     }
> > +
> > +     CPU_FREE(cpuset);
> > +
> > +     for (unsigned int i = 0; i < nthreads; ++i) {
> > +             int ret = pthread_join(workers[i].thd, NULL);
> > +
> > +             if (ret)
> > +                     err(EXIT_FAILURE, "pthread_join");
> > +     }
> > +
> > +     /* Calculate overall average latency. */
> > +     for (unsigned int i = 0; i < nthreads; ++i)
> > +             totaltime += workers[i].runtime;
> > +
> > +     total_spins = (u64)nthreads * nspins;
> > +     avg_ns = totaltime / total_spins;
> > +     avg_ns_dot = (totaltime % total_spins) * 10000 / total_spins;
> > +
> > +     printf("Lock-unlock latency of %u threads: %"PRIu64".%"PRIu64" ns.\n",
> > +                     nthreads, avg_ns, avg_ns_dot);
> > +
> > +     free(workers);
> > +
> > +     return 0;
> > +}
> > diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
> > index 2c1a9f3d847a..cfe6f6dc6ed4 100644
> > --- a/tools/perf/builtin-bench.c
> > +++ b/tools/perf/builtin-bench.c
> > @@ -52,6 +52,12 @@ static struct bench sched_benchmarks[] = {
> >       { NULL,         NULL,                                           NULL                    }
> >  };
> >
> > +static struct bench sync_benchmarks[] = {
> > +     { "qspinlock",  "Benchmark for queued spinlock",                bench_sync_qspinlock    },
> > +     { "all",        "Run all synchronization benchmarks",           NULL                    },
> > +     { NULL,         NULL,                                           NULL                    }
> > +};
> > +
> >  static struct bench syscall_benchmarks[] = {
> >       { "basic",      "Benchmark for basic getppid(2) calls",         bench_syscall_basic     },
> >       { "getpgid",    "Benchmark for getpgid(2) calls",               bench_syscall_getpgid   },
> > @@ -122,6 +128,7 @@ struct collection {
> >
> >  static struct collection collections[] = {
> >       { "sched",      "Scheduler and IPC benchmarks",                 sched_benchmarks        },
> > +     { "sync",       "Synchronization benchmarks",                   sync_benchmarks         },
> >       { "syscall",    "System call benchmarks",                       syscall_benchmarks      },
> >       { "mem",        "Memory access benchmarks",                     mem_benchmarks          },
> >  #ifdef HAVE_LIBNUMA_SUPPORT
> > --
> > 2.50.1.487.gc89ff58d15-goog
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 0/7] perf bench: Add qspinlock benchmark
  2025-07-29  2:26 [PATCH v1 0/7] perf bench: Add qspinlock benchmark Yuzhuo Jing
                   ` (7 preceding siblings ...)
  2025-07-31  4:51 ` [PATCH v1 0/7] perf bench: Add qspinlock benchmark Namhyung Kim
@ 2025-08-04 14:28 ` Mark Rutland
  8 siblings, 0 replies; 18+ messages in thread
From: Mark Rutland @ 2025-08-04 14:28 UTC (permalink / raw)
  To: Yuzhuo Jing
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang Kan, Yuzhuo Jing, Andrea Parri,
	Palmer Dabbelt, Charlie Jenkins, Sebastian Andrzej Siewior,
	Kumar Kartikeya Dwivedi, Alexei Starovoitov, Barret Rhoden,
	Alexandre Ghiti, Guo Ren, linux-kernel, linux-perf-users

On Mon, Jul 28, 2025 at 07:26:33PM -0700, Yuzhuo Jing wrote:
> As an effort to improve the perf bench subcommand, this patch series
> adds benchmark for the kernel's queued spinlock implementation.
> 
> This series imports necessary kernel definitions such as atomics,
> introduces userspace per-cpu adapter, and imports the qspinlock
> implementation from the kernel tree to tools tree, with minimum
> adaptions.

Who is this intended to be useful for, and when would they use this?

This doesn't serve as a benchmark of the host kernel, since it tests
whatever stale copy of the qspinlock code was built into the perf
binary.

I can understand that being able to test the code in userspace may be
helpful when making some changes, but why does this need to be built
into the perf tool?

Mark.

> This subcommand enables convenient commands to investigate the
> performance of kernel lock implementations, such as using sampling:
> 
>     perf record -- ./perf bench sync qspinlock -t5
>     perf report
> 
> Yuzhuo Jing (7):
>   tools: Import cmpxchg and xchg functions
>   tools: Import smp_cond_load and atomic_cond_read
>   tools: Partial import of prefetch.h
>   tools: Implement userspace per-cpu
>   perf bench: Import qspinlock from kernel
>   perf bench: Add 'bench sync qspinlock' subcommand
>   perf bench sync: Add latency histogram functionality
> 
>  tools/arch/x86/include/asm/atomic.h           |  14 +
>  tools/arch/x86/include/asm/cmpxchg.h          | 113 +++++
>  tools/include/asm-generic/atomic-gcc.h        |  47 ++
>  tools/include/asm/barrier.h                   |  58 +++
>  tools/include/linux/atomic.h                  |  27 ++
>  tools/include/linux/compiler_types.h          |  30 ++
>  tools/include/linux/percpu-simulate.h         | 128 ++++++
>  tools/include/linux/prefetch.h                |  41 ++
>  tools/perf/bench/Build                        |   2 +
>  tools/perf/bench/bench.h                      |   1 +
>  .../perf/bench/include/mcs_spinlock-private.h | 115 +++++
>  tools/perf/bench/include/mcs_spinlock.h       |  19 +
>  tools/perf/bench/include/qspinlock-private.h  | 204 +++++++++
>  tools/perf/bench/include/qspinlock.h          | 153 +++++++
>  tools/perf/bench/include/qspinlock_types.h    |  98 +++++
>  tools/perf/bench/qspinlock.c                  | 411 ++++++++++++++++++
>  tools/perf/bench/sync.c                       | 329 ++++++++++++++
>  tools/perf/builtin-bench.c                    |   7 +
>  tools/perf/check-headers.sh                   |  32 ++
>  19 files changed, 1829 insertions(+)
>  create mode 100644 tools/include/linux/percpu-simulate.h
>  create mode 100644 tools/include/linux/prefetch.h
>  create mode 100644 tools/perf/bench/include/mcs_spinlock-private.h
>  create mode 100644 tools/perf/bench/include/mcs_spinlock.h
>  create mode 100644 tools/perf/bench/include/qspinlock-private.h
>  create mode 100644 tools/perf/bench/include/qspinlock.h
>  create mode 100644 tools/perf/bench/include/qspinlock_types.h
>  create mode 100644 tools/perf/bench/qspinlock.c
>  create mode 100644 tools/perf/bench/sync.c
> 
> -- 
> 2.50.1.487.gc89ff58d15-goog
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/7] tools: Import cmpxchg and xchg functions
  2025-07-29  2:26 ` [PATCH v1 1/7] tools: Import cmpxchg and xchg functions Yuzhuo Jing
  2025-07-31  4:52   ` Namhyung Kim
@ 2025-08-08  6:11   ` kernel test robot
  1 sibling, 0 replies; 18+ messages in thread
From: kernel test robot @ 2025-08-08  6:11 UTC (permalink / raw)
  To: Yuzhuo Jing
  Cc: oe-lkp, lkp, linux-kernel, xudong.hao, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang Kan, Yuzhuo Jing, Yuzhuo Jing, Andrea Parri, Palmer Dabbelt,
	Charlie Jenkins, Sebastian Andrzej Siewior,
	Kumar Kartikeya Dwivedi, Alexei Starovoitov, Barret Rhoden,
	Alexandre Ghiti, Guo Ren, linux-perf-users, oliver.sang



Hello,

kernel test robot noticed "kernel-selftests.kvm.make.fail" on:

commit: 108c296547f3e749d89c270aa6319894f014f01c ("[PATCH v1 1/7] tools: Import cmpxchg and xchg functions")
url: https://github.com/intel-lab-lkp/linux/commits/Yuzhuo-Jing/tools-Import-cmpxchg-and-xchg-functions/20250729-102940
base: https://git.kernel.org/cgit/linux/kernel/git/perf/perf-tools-next.git perf-tools-next
patch link: https://lore.kernel.org/all/20250729022640.3134066-2-yuzhuo@google.com/
patch subject: [PATCH v1 1/7] tools: Import cmpxchg and xchg functions

in testcase: kernel-selftests
version: kernel-selftests-x86_64-186f3edfdd41-1_20250803
with following parameters:

	group: kvm



config: x86_64-rhel-9.4-kselftests
compiler: gcc-12
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) with 256G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202508080716.5744484-lkp@intel.com

KERNEL SELFTESTS: linux_headers_dir is /usr/src/linux-headers-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c
2025-08-05 10:05:53 sed -i s/default_timeout=45/default_timeout=300/ kselftest/runner.sh
2025-08-05 10:05:53 make -j224 TARGETS=kvm
make[1]: Entering directory '/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/kvm'
gcc -D_GNU_SOURCE=  -I/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/cgroup/lib/include -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 -Wno-gnu-variable-sized-type-not-at-end -MD -MP -DCONFIG_64BIT -fno-builtin-memcmp -fno-builtin-memcpy -fno-builtin-memset -fno-builtin-strnlen -fno-stack-protector -fno-PIE -fno-strict-aliasing -I/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include -I/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/arch/x86/include -I/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../usr/include/ -Iinclude -I. -Iinclude/x86 -I ../rseq -I..  -isystem /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/usr/include -march=x86-64-v2   -c demand_paging_test.c -o /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/kvm/demand_paging_test.o

...

gcc -D_GNU_SOURCE=  -I/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/cgroup/lib/include -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 -Wno-gnu-variable-sized-type-not-at-end -MD -MP -DCONFIG_64BIT -fno-builtin-memcmp -fno-builtin-memcpy -fno-builtin-memset -fno-builtin-strnlen -fno-stack-protector -fno-PIE -fno-strict-aliasing -I/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include -I/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/arch/x86/include -I/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../usr/include/ -Iinclude -I. -Iinclude/x86 -I ../rseq -I..  -isystem /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/usr/include -march=x86-64-v2   -c pre_fault_memory_test.c -o /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/kvm/pre_fault_memory_test.o
gcc -D_GNU_SOURCE=  -I/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/cgroup/lib/include -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 -Wno-gnu-variable-sized-type-not-at-end -MD -MP -DCONFIG_64BIT -fno-builtin-memcmp -fno-builtin-memcpy -fno-builtin-memset -fno-builtin-strnlen -fno-stack-protector -fno-PIE -fno-strict-aliasing -I/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include -I/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/arch/x86/include -I/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../usr/include/ -Iinclude -Ix86 -Iinclude/x86 -I ../rseq -I..  -isystem /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/usr/include -march=x86-64-v2   -c x86/nx_huge_pages_test.c -o /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/kvm/x86/nx_huge_pages_test.o
In file included from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/linux/bits.h:34,
                 from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/arch/x86/include/asm/msr-index.h:5,
                 from include/x86/processor.h:13,
                 from include/x86/apic.h:11,
                 from x86/fix_hypercall_test.c:13:
/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/linux/overflow.h:31: warning: "is_signed_type" redefined
   31 | #define is_signed_type(type)       (((type)(-1)) < (type)1)
      | 
In file included from include/kvm_test_harness.h:11,
                 from x86/fix_hypercall_test.c:12:
../kselftest_harness.h:754: note: this is the location of the previous definition
  754 | #define is_signed_type(var)       (!!(((__typeof__(var))(-1)) < (__typeof__(var))1))
      | 
In file included from x86/svm_nested_soft_inject_test.c:11:
/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/asm/../../arch/x86/include/asm/atomic.h:79:28: error: expected declaration specifiers or ‘...’ before ‘(’ token
   79 | static __always_inline int atomic_fetch_or(int i, atomic_t *v)
      |                            ^~~~~~~~~~~~~~~
/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/asm/../../arch/x86/include/asm/atomic.h:79:28: error: expected declaration specifiers or ‘...’ before ‘(’ token
   79 | static __always_inline int atomic_fetch_or(int i, atomic_t *v)
      |                            ^~~~~~~~~~~~~~~
/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/asm/../../arch/x86/include/asm/atomic.h:79:28: error: expected declaration specifiers or ‘...’ before numeric constant
   79 | static __always_inline int atomic_fetch_or(int i, atomic_t *v)
      |                            ^~~~~~~~~~~~~~~
In file included from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/linux/bits.h:34,
                 from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/linux/bitops.h:14,
                 from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/linux/hashtable.h:13,
                 from include/kvm_util.h:11,
                 from x86/userspace_msr_exit_test.c:11:
/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/linux/overflow.h:31: warning: "is_signed_type" redefined
   31 | #define is_signed_type(type)       (((type)(-1)) < (type)1)
      | 
In file included from include/kvm_test_harness.h:11,
                 from x86/userspace_msr_exit_test.c:9:
../kselftest_harness.h:754: note: this is the location of the previous definition
  754 | #define is_signed_type(var)       (!!(((__typeof__(var))(-1)) < (__typeof__(var))1))
      | 
cc1: note: unrecognized command-line option ‘-Wno-gnu-variable-sized-type-not-at-end’ may have been intended to silence earlier diagnostics
make[1]: *** [Makefile.kvm:299: /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/kvm/x86/svm_nested_soft_inject_test.o] Error 1
make[1]: *** Waiting for unfinished jobs....
In file included from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/linux/bits.h:34,
                 from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/linux/bitops.h:14,
                 from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/linux/hashtable.h:13,
                 from include/kvm_util.h:11,
                 from x86/sync_regs_test.c:20:
/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/linux/overflow.h:31: warning: "is_signed_type" redefined
   31 | #define is_signed_type(type)       (((type)(-1)) < (type)1)
      | 
In file included from include/kvm_test_harness.h:11,
                 from x86/sync_regs_test.c:18:
../kselftest_harness.h:754: note: this is the location of the previous definition
  754 | #define is_signed_type(var)       (!!(((__typeof__(var))(-1)) < (__typeof__(var))1))
      | 
In file included from include/kvm_test_harness.h:11,
                 from x86/vmx_pmu_caps_test.c:17:
../kselftest_harness.h:754: warning: "is_signed_type" redefined
  754 | #define is_signed_type(var)       (!!(((__typeof__(var))(-1)) < (__typeof__(var))1))
      | 
In file included from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/linux/bits.h:34,
                 from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/linux/bitops.h:14,
                 from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/linux/bitmap.h:7,
                 from x86/vmx_pmu_caps_test.c:15:
/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/linux/overflow.h:31: note: this is the location of the previous definition
   31 | #define is_signed_type(type)       (((type)(-1)) < (type)1)
      | 
In file included from memslot_perf_test.c:12:
/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/asm/../../arch/x86/include/asm/atomic.h:79:28: error: expected declaration specifiers or ‘...’ before ‘(’ token
   79 | static __always_inline int atomic_fetch_or(int i, atomic_t *v)
      |                            ^~~~~~~~~~~~~~~
/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/asm/../../arch/x86/include/asm/atomic.h:79:28: error: expected declaration specifiers or ‘...’ before ‘(’ token
   79 | static __always_inline int atomic_fetch_or(int i, atomic_t *v)
      |                            ^~~~~~~~~~~~~~~
/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/../../../tools/include/asm/../../arch/x86/include/asm/atomic.h:79:28: error: expected declaration specifiers or ‘...’ before numeric constant
   79 | static __always_inline int atomic_fetch_or(int i, atomic_t *v)
      |                            ^~~~~~~~~~~~~~~
cc1: note: unrecognized command-line option ‘-Wno-gnu-variable-sized-type-not-at-end’ may have been intended to silence earlier diagnostics
make[1]: *** [Makefile.kvm:299: /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/kvm/memslot_perf_test.o] Error 1
cc1: note: unrecognized command-line option ‘-Wno-gnu-variable-sized-type-not-at-end’ may have been intended to silence earlier diagnostics
cc1: note: unrecognized command-line option ‘-Wno-gnu-variable-sized-type-not-at-end’ may have been intended to silence earlier diagnostics
cc1: note: unrecognized command-line option ‘-Wno-gnu-variable-sized-type-not-at-end’ may have been intended to silence earlier diagnostics
cc1: note: unrecognized command-line option ‘-Wno-gnu-variable-sized-type-not-at-end’ may have been intended to silence earlier diagnostics
make[1]: Leaving directory '/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-108c296547f3e749d89c270aa6319894f014f01c/tools/testing/selftests/kvm'
make: *** [Makefile:207: all] Error 2



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250808/202508080716.5744484-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-08-08  6:11 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-29  2:26 [PATCH v1 0/7] perf bench: Add qspinlock benchmark Yuzhuo Jing
2025-07-29  2:26 ` [PATCH v1 1/7] tools: Import cmpxchg and xchg functions Yuzhuo Jing
2025-07-31  4:52   ` Namhyung Kim
2025-08-08  6:11   ` kernel test robot
2025-07-29  2:26 ` [PATCH v1 2/7] tools: Import smp_cond_load and atomic_cond_read Yuzhuo Jing
2025-07-29  2:26 ` [PATCH v1 3/7] tools: Partial import of prefetch.h Yuzhuo Jing
2025-07-31  4:54   ` Namhyung Kim
2025-07-29  2:26 ` [PATCH v1 4/7] tools: Implement userspace per-cpu Yuzhuo Jing
2025-07-31  5:07   ` Namhyung Kim
2025-07-29  2:26 ` [PATCH v1 5/7] perf bench: Import qspinlock from kernel Yuzhuo Jing
2025-07-29  2:26 ` [PATCH v1 6/7] perf bench: Add 'bench sync qspinlock' subcommand Yuzhuo Jing
2025-07-31  5:16   ` Namhyung Kim
2025-07-31 13:19     ` Yuzhuo Jing
2025-07-29  2:26 ` [PATCH v1 7/7] perf bench sync: Add latency histogram functionality Yuzhuo Jing
2025-07-31  5:18   ` Namhyung Kim
2025-07-31  5:24   ` Namhyung Kim
2025-07-31  4:51 ` [PATCH v1 0/7] perf bench: Add qspinlock benchmark Namhyung Kim
2025-08-04 14:28 ` Mark Rutland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).