[PATCH v3 0/5] Tracing contention lock owner call stack

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/5] Tracing contention lock owner call stack
@ 2025-01-29  0:14 Chun-Tse Shao
  2025-01-29  0:14 ` [PATCH v3 1/5] perf lock: Add bpf maps for owner stack tracing Chun-Tse Shao
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Chun-Tse Shao @ 2025-01-29  0:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chun-Tse Shao, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	nathan, ndesaulniers, morbo, justinstitt, linux-perf-users, bpf,
	llvm

For perf lock contention, the current owner tracking (-o option) only
works with per-thread mode (-t option). Enabling call stack mode for
owner can be useful for diagnosing why a system running slow in
lock contention.

Example output:
  $ sudo ~/linux/tools/perf/perf lock con -abvo -Y mutex -E16 perf bench sched pipe
   ...
   contended   total wait     max wait     avg wait         type   caller

         171      1.55 ms     20.26 us      9.06 us        mutex   pipe_read+0x57
                          0xffffffffac6318e7  pipe_read+0x57
                          0xffffffffac623862  vfs_read+0x332
                          0xffffffffac62434b  ksys_read+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
          36    193.71 us     15.27 us      5.38 us        mutex   pipe_write+0x50
                          0xffffffffac631ee0  pipe_write+0x50
                          0xffffffffac6241db  vfs_write+0x3bb
                          0xffffffffac6244ab  ksys_write+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
           4     51.22 us     16.47 us     12.80 us        mutex   do_epoll_wait+0x24d
                          0xffffffffac691f0d  do_epoll_wait+0x24d
                          0xffffffffac69249b  do_epoll_pwait.part.0+0xb
                          0xffffffffac693ba5  __x64_sys_epoll_pwait+0x95
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
           2     20.88 us     11.95 us     10.44 us        mutex   do_epoll_wait+0x24d
                          0xffffffffac691f0d  do_epoll_wait+0x24d
                          0xffffffffac693943  __x64_sys_epoll_wait+0x73
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
           1      7.33 us      7.33 us      7.33 us        mutex   do_epoll_ctl+0x6c1
                          0xffffffffac692e01  do_epoll_ctl+0x6c1
                          0xffffffffac6937e0  __x64_sys_epoll_ctl+0x70
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
           1      6.64 us      6.64 us      6.64 us        mutex   do_epoll_ctl+0x3d4
                          0xffffffffac692b14  do_epoll_ctl+0x3d4
                          0xffffffffac6937e0  __x64_sys_epoll_ctl+0x70
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76

  === owner stack trace ===

           3     31.24 us     15.27 us     10.41 us        mutex   pipe_read+0x348
                          0xffffffffac631bd8  pipe_read+0x348
                          0xffffffffac623862  vfs_read+0x332
                          0xffffffffac62434b  ksys_read+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
  ...

v3: Edit based on Namhyung's review.

v2: Fix logic deficit in patch 2/4.
Chun-Tse Shao (4):
  perf lock: Add bpf maps for owner stack tracing
  perf lock: Retrieve owner callstack in bpf program
  perf lock: Make rb_tree helper functions generic
  perf lock: Report owner stack in usermode
Chun-Tse Shao (5):
  perf lock: Add bpf maps for owner stack tracing
  perf lock: Retrieve owner callstack in bpf program
  perf lock: Make rb_tree helper functions generic
  perf lock: Report owner stack in usermode
  perf lock: Update documentation for -o option in contention mode

 tools/perf/builtin-lock.c                     |  60 +++-
 tools/perf/util/bpf_lock_contention.c         |  70 ++++-
 .../perf/util/bpf_skel/lock_contention.bpf.c  | 277 +++++++++++++++++-
 tools/perf/util/bpf_skel/lock_data.h          |   7 +
 tools/perf/util/lock-contention.h             |   7 +
 5 files changed, 398 insertions(+), 23 deletions(-)

--
2.48.1.262.g85cc9f2d1e-goog


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/5] perf lock: Add bpf maps for owner stack tracing
  2025-01-29  0:14 [PATCH v3 0/5] Tracing contention lock owner call stack Chun-Tse Shao
@ 2025-01-29  0:14 ` Chun-Tse Shao
  2025-01-29 18:01   ` Namhyung Kim
  2025-01-29  0:14 ` [PATCH v3 2/5] perf lock: Retrieve owner callstack in bpf program Chun-Tse Shao
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Chun-Tse Shao @ 2025-01-29  0:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chun-Tse Shao, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	nathan, ndesaulniers, morbo, justinstitt, linux-perf-users, bpf,
	llvm

Add a struct and few bpf maps in order to tracing owner stack.
`struct owner_tracing_data`: Contains owner's pid, stack id, timestamp for
  when the owner acquires lock, and the count of lock waiters.
`stack_buf`: Percpu buffer for retrieving owner stacktrace.
`owner_stacks`: For tracing owner stacktrace to customized owner stack id.
`owner_data`: For tracing lock_address to `struct owner_tracing_data` in
  bpf program.
`owner_stat`: For reporting owner stacktrace in usermode.

Signed-off-by: Chun-Tse Shao <ctshao@google.com>
---
 tools/perf/util/bpf_lock_contention.c         | 16 +++++--
 .../perf/util/bpf_skel/lock_contention.bpf.c  | 42 ++++++++++++++++---
 tools/perf/util/bpf_skel/lock_data.h          |  7 ++++
 3 files changed, 57 insertions(+), 8 deletions(-)

diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index fc8666222399..795e2374facc 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -131,9 +131,19 @@ int lock_contention_prepare(struct lock_contention *con)
 	else
 		bpf_map__set_max_entries(skel->maps.task_data, 1);
 
-	if (con->save_callstack)
-		bpf_map__set_max_entries(skel->maps.stacks, con->map_nr_entries);
-	else
+	if (con->save_callstack) {
+		bpf_map__set_max_entries(skel->maps.stacks,
+					 con->map_nr_entries);
+		if (con->owner) {
+			bpf_map__set_value_size(skel->maps.stack_buf, con->max_stack * sizeof(u64));
+			bpf_map__set_key_size(skel->maps.owner_stacks,
+						con->max_stack * sizeof(u64));
+			bpf_map__set_max_entries(skel->maps.owner_stacks, con->map_nr_entries);
+			bpf_map__set_max_entries(skel->maps.owner_data, con->map_nr_entries);
+			bpf_map__set_max_entries(skel->maps.owner_stat, con->map_nr_entries);
+			skel->rodata->max_stack = con->max_stack;
+		}
+	} else
 		bpf_map__set_max_entries(skel->maps.stacks, 1);
 
 	if (target__has_cpu(target)) {
diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index 6533ea9b044c..b4961dd86222 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -19,13 +19,37 @@
 #define LCB_F_PERCPU	(1U << 4)
 #define LCB_F_MUTEX	(1U << 5)
 
-/* callstack storage  */
+ /* buffer for owner stacktrace */
 struct {
-	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
 	__uint(key_size, sizeof(__u32));
 	__uint(value_size, sizeof(__u64));
-	__uint(max_entries, MAX_ENTRIES);
-} stacks SEC(".maps");
+	__uint(max_entries, 1);
+} stack_buf SEC(".maps");
+
+/* a map for tracing owner stacktrace to owner stack id */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u64)); // owner stacktrace
+	__uint(value_size, sizeof(__u64)); // owner stack id
+	__uint(max_entries, 1);
+} owner_stacks SEC(".maps");
+
+/* a map for tracing lock address to owner data */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u64)); // lock address
+	__uint(value_size, sizeof(struct owner_tracing_data));
+	__uint(max_entries, 1);
+} owner_data SEC(".maps");
+
+/* a map for contention_key (stores owner stack id) to contention data */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(struct contention_key));
+	__uint(value_size, sizeof(struct contention_data));
+	__uint(max_entries, 1);
+} owner_stat SEC(".maps");
 
 /* maintain timestamp at the beginning of contention */
 struct {
@@ -43,6 +67,14 @@ struct {
 	__uint(max_entries, 1);
 } tstamp_cpu SEC(".maps");
 
+/* callstack storage  */
+struct {
+	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u64));
+	__uint(max_entries, MAX_ENTRIES);
+} stacks SEC(".maps");
+
 /* actual lock contention statistics */
 struct {
 	__uint(type, BPF_MAP_TYPE_HASH);
@@ -143,6 +175,7 @@ const volatile int needs_callstack;
 const volatile int stack_skip;
 const volatile int lock_owner;
 const volatile int use_cgroup_v2;
+const volatile int max_stack;
 
 /* determine the key of lock stat */
 const volatile int aggr_mode;
@@ -466,7 +499,6 @@ int contention_end(u64 *ctx)
 			return 0;
 		need_delete = true;
 	}
-
 	duration = bpf_ktime_get_ns() - pelem->timestamp;
 	if ((__s64)duration < 0) {
 		__sync_fetch_and_add(&time_fail, 1);
diff --git a/tools/perf/util/bpf_skel/lock_data.h b/tools/perf/util/bpf_skel/lock_data.h
index c15f734d7fc4..15f5743bd409 100644
--- a/tools/perf/util/bpf_skel/lock_data.h
+++ b/tools/perf/util/bpf_skel/lock_data.h
@@ -3,6 +3,13 @@
 #ifndef UTIL_BPF_SKEL_LOCK_DATA_H
 #define UTIL_BPF_SKEL_LOCK_DATA_H
 
+struct owner_tracing_data {
+	u32 pid; // Who has the lock.
+	u32 count; // How many waiters for this lock.
+	u64 timestamp; // The time while the owner acquires lock and contention is going on.
+	s32 stack_id; // Identifier for `owner_stat`, which stores as value in `owner_stacks`
+};
+
 struct tstamp_data {
 	u64 timestamp;
 	u64 lock;
-- 
2.48.1.262.g85cc9f2d1e-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/5] perf lock: Retrieve owner callstack in bpf program
  2025-01-29  0:14 [PATCH v3 0/5] Tracing contention lock owner call stack Chun-Tse Shao
  2025-01-29  0:14 ` [PATCH v3 1/5] perf lock: Add bpf maps for owner stack tracing Chun-Tse Shao
@ 2025-01-29  0:14 ` Chun-Tse Shao
  2025-01-29 18:48   ` Namhyung Kim
  2025-01-29  0:14 ` [PATCH v3 3/5] perf lock: Make rb_tree helper functions generic Chun-Tse Shao
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Chun-Tse Shao @ 2025-01-29  0:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chun-Tse Shao, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	nathan, ndesaulniers, morbo, justinstitt, linux-perf-users, bpf,
	llvm

Tracing owner callstack in `contention_begin()` and `contention_end()`,
and storing in `owner_stat` bpf map.

Signed-off-by: Chun-Tse Shao <ctshao@google.com>
---
 .../perf/util/bpf_skel/lock_contention.bpf.c  | 237 +++++++++++++++++-
 1 file changed, 235 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index b4961dd86222..1ad2a0793c37 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
 // Copyright (c) 2022 Google
+#include "linux/bpf.h"
 #include "vmlinux.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_tracing.h>
@@ -7,6 +8,7 @@
 #include <asm-generic/errno-base.h>
 
 #include "lock_data.h"
+#include <time.h>
 
 /* for collect_lock_syms().  4096 was rejected by the verifier */
 #define MAX_CPUS  1024
@@ -31,7 +33,7 @@ struct {
 struct {
 	__uint(type, BPF_MAP_TYPE_HASH);
 	__uint(key_size, sizeof(__u64)); // owner stacktrace
-	__uint(value_size, sizeof(__u64)); // owner stack id
+	__uint(value_size, sizeof(__s32)); // owner stack id
 	__uint(max_entries, 1);
 } owner_stacks SEC(".maps");
 
@@ -197,6 +199,9 @@ int data_fail;
 int task_map_full;
 int data_map_full;
 
+struct task_struct *bpf_task_from_pid(s32 pid) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
 static inline __u64 get_current_cgroup_id(void)
 {
 	struct task_struct *task;
@@ -420,6 +425,27 @@ static inline struct tstamp_data *get_tstamp_elem(__u32 flags)
 	return pelem;
 }
 
+static inline s32 get_owner_stack_id(u64 *stacktrace)
+{
+	s32 *id;
+	static s32 id_gen = 1;
+
+	id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
+	if (id)
+		return *id;
+
+	// FIXME: currently `a = __sync_fetch_and_add(...)` cause "Invalid usage of the XADD return
+	// value" error in BPF program: https://github.com/llvm/llvm-project/issues/91888
+	bpf_map_update_elem(&owner_stacks, stacktrace, &id_gen, BPF_NOEXIST);
+	__sync_fetch_and_add(&id_gen, 1);
+
+	id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
+	if (id)
+		return *id;
+
+	return -1;
+}
+
 SEC("tp_btf/contention_begin")
 int contention_begin(u64 *ctx)
 {
@@ -437,6 +463,91 @@ int contention_begin(u64 *ctx)
 	pelem->flags = (__u32)ctx[1];
 
 	if (needs_callstack) {
+		u32 i = 0;
+		u32 id = 0;
+		int owner_pid;
+		u64 *buf;
+		struct task_struct *task;
+		struct owner_tracing_data *otdata;
+
+		if (!lock_owner)
+			goto skip_owner_begin;
+
+		task = get_lock_owner(pelem->lock, pelem->flags);
+		if (!task)
+			goto skip_owner_begin;
+
+		owner_pid = BPF_CORE_READ(task, pid);
+
+		buf = bpf_map_lookup_elem(&stack_buf, &i);
+		if (!buf)
+			goto skip_owner_begin;
+		for (i = 0; i < max_stack; i++)
+			buf[i] = 0x0;
+
+		task = bpf_task_from_pid(owner_pid);
+		if (task) {
+			bpf_get_task_stack(task, buf, max_stack * sizeof(unsigned long), 0);
+			bpf_task_release(task);
+		}
+
+		otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
+		id = get_owner_stack_id(buf);
+
+		// Contention just happens, or corner case `lock` is owned by process not
+		// `owner_pid`. For the corner case we treat it as unexpected internal error and
+		// just ignore the precvious tracing record.
+		if (!otdata || otdata->pid != owner_pid) {
+			struct owner_tracing_data first = {
+				.pid = owner_pid,
+				.timestamp = pelem->timestamp,
+				.count = 1,
+				.stack_id = id,
+			};
+			bpf_map_update_elem(&owner_data, &pelem->lock, &first, BPF_ANY);
+		}
+		// Contention is ongoing and new waiter joins.
+		else {
+			__sync_fetch_and_add(&otdata->count, 1);
+
+			// The owner is the same, but stacktrace might be changed. In this case we
+			// store/update `owner_stat` based on current owner stack id.
+			if (id != otdata->stack_id) {
+				u64 duration = otdata->timestamp - pelem->timestamp;
+				struct contention_key ckey = {
+					.stack_id = id,
+					.pid = 0,
+					.lock_addr_or_cgroup = 0,
+				};
+				struct contention_data *cdata =
+					bpf_map_lookup_elem(&owner_stat, &ckey);
+
+				if (!cdata) {
+					struct contention_data first = {
+						.total_time = duration,
+						.max_time = duration,
+						.min_time = duration,
+						.count = 1,
+						.flags = pelem->flags,
+					};
+					bpf_map_update_elem(&owner_stat, &ckey, &first,
+							    BPF_NOEXIST);
+				} else {
+					__sync_fetch_and_add(&cdata->total_time, duration);
+					__sync_fetch_and_add(&cdata->count, 1);
+
+					/* FIXME: need atomic operations */
+					if (cdata->max_time < duration)
+						cdata->max_time = duration;
+					if (cdata->min_time > duration)
+						cdata->min_time = duration;
+				}
+
+				otdata->timestamp = pelem->timestamp;
+				otdata->stack_id = id;
+			}
+		}
+skip_owner_begin:
 		pelem->stack_id = bpf_get_stackid(ctx, &stacks,
 						  BPF_F_FAST_STACK_CMP | stack_skip);
 		if (pelem->stack_id < 0)
@@ -473,6 +584,7 @@ int contention_end(u64 *ctx)
 	struct tstamp_data *pelem;
 	struct contention_key key = {};
 	struct contention_data *data;
+	__u64 timestamp;
 	__u64 duration;
 	bool need_delete = false;
 
@@ -499,12 +611,133 @@ int contention_end(u64 *ctx)
 			return 0;
 		need_delete = true;
 	}
-	duration = bpf_ktime_get_ns() - pelem->timestamp;
+	timestamp = bpf_ktime_get_ns();
+	duration = timestamp - pelem->timestamp;
 	if ((__s64)duration < 0) {
 		__sync_fetch_and_add(&time_fail, 1);
 		goto out;
 	}
 
+	if (needs_callstack && lock_owner) {
+		u64 owner_time;
+		struct contention_key ckey = {};
+		struct contention_data *cdata;
+		struct owner_tracing_data *otdata;
+
+		otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
+		if (!otdata)
+			goto skip_owner_end;
+
+		// Update `owner_stat`.
+		owner_time = timestamp - otdata->timestamp;
+		ckey.stack_id = otdata->stack_id;
+		cdata = bpf_map_lookup_elem(&owner_stat, &ckey);
+
+		if (!cdata) {
+			struct contention_data first = {
+				.total_time = owner_time,
+				.max_time = owner_time,
+				.min_time = owner_time,
+				.count = 1,
+				.flags = pelem->flags,
+			};
+			bpf_map_update_elem(&owner_stat, &ckey, &first, BPF_NOEXIST);
+		} else {
+			__sync_fetch_and_add(&cdata->total_time, owner_time);
+			__sync_fetch_and_add(&cdata->count, 1);
+
+			/* FIXME: need atomic operations */
+			if (cdata->max_time < owner_time)
+				cdata->max_time = owner_time;
+			if (cdata->min_time > owner_time)
+				cdata->min_time = owner_time;
+		}
+
+		// No contention is occurring, delete `lock` entry in `owner_data`.
+		if (otdata->count <= 1)
+			bpf_map_delete_elem(&owner_data, &pelem->lock);
+		// Contention is still ongoing, with a new owner (current task). `owner_data`
+		// should be updated accordingly.
+		else {
+			u32 i = 0;
+			u64 *buf;
+
+			// FIXME: __sync_fetch_and_sub(&otdata->count, 1) causes compile error.
+			otdata->count--;
+
+			buf = bpf_map_lookup_elem(&stack_buf, &i);
+			if (!buf)
+				goto skip_owner_end;
+			for (i = 0; i < (u32)max_stack; i++)
+				buf[i] = 0x0;
+
+			// ctx[1] has the return code of the lock function.
+			// If ctx[1] is not 0, the current task terminates lock waiting without
+			// acquiring it. Owner is not changed, but we still need to update the owner
+			// stack.
+			if (!ctx[1]) {
+				s32 id = 0;
+				struct task_struct *task = bpf_task_from_pid(otdata->pid);
+
+				if (task) {
+					bpf_get_task_stack(task, buf,
+							   max_stack * sizeof(unsigned long), 0);
+					bpf_task_release(task);
+				}
+
+				id = get_owner_stack_id(buf);
+
+				// If owner stack is changed, update `owner_data` and `owner_stat`
+				// accordingly.
+				if (id != otdata->stack_id) {
+					u64 duration = otdata->timestamp - pelem->timestamp;
+					struct contention_key ckey = {
+						.stack_id = id,
+						.pid = 0,
+						.lock_addr_or_cgroup = 0,
+					};
+					struct contention_data *cdata =
+						bpf_map_lookup_elem(&owner_stat, &ckey);
+
+					if (!cdata) {
+						struct contention_data first = {
+							.total_time = duration,
+							.max_time = duration,
+							.min_time = duration,
+							.count = 1,
+							.flags = pelem->flags,
+						};
+						bpf_map_update_elem(&owner_stat, &ckey, &first,
+								    BPF_NOEXIST);
+					} else {
+						__sync_fetch_and_add(&cdata->total_time, duration);
+						__sync_fetch_and_add(&cdata->count, 1);
+
+						/* FIXME: need atomic operations */
+						if (cdata->max_time < duration)
+							cdata->max_time = duration;
+						if (cdata->min_time > duration)
+							cdata->min_time = duration;
+					}
+
+					otdata->timestamp = pelem->timestamp;
+					otdata->stack_id = id;
+				}
+			}
+			// If ctx[1] is 0, then update tracinng data with the current task, which is
+			// the new owner.
+			else {
+				otdata->pid = pid;
+				otdata->timestamp = timestamp;
+
+				bpf_get_task_stack(bpf_get_current_task_btf(), buf,
+						   max_stack * sizeof(unsigned long), 0);
+				otdata->stack_id = get_owner_stack_id(buf);
+			}
+		}
+	}
+skip_owner_end:
+
 	switch (aggr_mode) {
 	case LOCK_AGGR_CALLER:
 		key.stack_id = pelem->stack_id;
-- 
2.48.1.262.g85cc9f2d1e-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 3/5] perf lock: Make rb_tree helper functions generic
  2025-01-29  0:14 [PATCH v3 0/5] Tracing contention lock owner call stack Chun-Tse Shao
  2025-01-29  0:14 ` [PATCH v3 1/5] perf lock: Add bpf maps for owner stack tracing Chun-Tse Shao
  2025-01-29  0:14 ` [PATCH v3 2/5] perf lock: Retrieve owner callstack in bpf program Chun-Tse Shao
@ 2025-01-29  0:14 ` Chun-Tse Shao
  2025-01-29  0:15 ` [PATCH v3 4/5] perf lock: Report owner stack in usermode Chun-Tse Shao
  2025-01-29  0:15 ` [PATCH v3 5/5] perf lock: Update documentation for -o option in contention mode Chun-Tse Shao
  4 siblings, 0 replies; 8+ messages in thread
From: Chun-Tse Shao @ 2025-01-29  0:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chun-Tse Shao, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	nathan, ndesaulniers, morbo, justinstitt, linux-perf-users, bpf,
	llvm

The rb_tree helper functions can be reused for parsing `owner_lock_stat`
into rb tree for sorting.

Signed-off-by: Chun-Tse Shao <ctshao@google.com>
---
 tools/perf/builtin-lock.c | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 5d405cd8e696..9bebc186286f 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -418,16 +418,13 @@ static void combine_lock_stats(struct lock_stat *st)
 	rb_insert_color(&st->rb, &sorted);
 }
 
-static void insert_to_result(struct lock_stat *st,
-			     int (*bigger)(struct lock_stat *, struct lock_stat *))
+static void insert_to(struct rb_root *rr, struct lock_stat *st,
+		      int (*bigger)(struct lock_stat *, struct lock_stat *))
 {
-	struct rb_node **rb = &result.rb_node;
+	struct rb_node **rb = &rr->rb_node;
 	struct rb_node *parent = NULL;
 	struct lock_stat *p;
 
-	if (combine_locks && st->combined)
-		return;
-
 	while (*rb) {
 		p = container_of(*rb, struct lock_stat, rb);
 		parent = *rb;
@@ -439,13 +436,21 @@ static void insert_to_result(struct lock_stat *st,
 	}
 
 	rb_link_node(&st->rb, parent, rb);
-	rb_insert_color(&st->rb, &result);
+	rb_insert_color(&st->rb, rr);
 }
 
-/* returns left most element of result, and erase it */
-static struct lock_stat *pop_from_result(void)
+static inline void insert_to_result(struct lock_stat *st,
+				    int (*bigger)(struct lock_stat *,
+						  struct lock_stat *))
+{
+	if (combine_locks && st->combined)
+		return;
+	insert_to(&result, st, bigger);
+}
+
+static inline struct lock_stat *pop_from(struct rb_root *rr)
 {
-	struct rb_node *node = result.rb_node;
+	struct rb_node *node = rr->rb_node;
 
 	if (!node)
 		return NULL;
@@ -453,8 +458,15 @@ static struct lock_stat *pop_from_result(void)
 	while (node->rb_left)
 		node = node->rb_left;
 
-	rb_erase(node, &result);
+	rb_erase(node, rr);
 	return container_of(node, struct lock_stat, rb);
+
+}
+
+/* returns left most element of result, and erase it */
+static struct lock_stat *pop_from_result(void)
+{
+	return pop_from(&result);
 }
 
 struct trace_lock_handler {
-- 
2.48.1.262.g85cc9f2d1e-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 4/5] perf lock: Report owner stack in usermode
  2025-01-29  0:14 [PATCH v3 0/5] Tracing contention lock owner call stack Chun-Tse Shao
                   ` (2 preceding siblings ...)
  2025-01-29  0:14 ` [PATCH v3 3/5] perf lock: Make rb_tree helper functions generic Chun-Tse Shao
@ 2025-01-29  0:15 ` Chun-Tse Shao
  2025-01-29  0:15 ` [PATCH v3 5/5] perf lock: Update documentation for -o option in contention mode Chun-Tse Shao
  4 siblings, 0 replies; 8+ messages in thread
From: Chun-Tse Shao @ 2025-01-29  0:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chun-Tse Shao, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	nathan, ndesaulniers, morbo, justinstitt, linux-perf-users, bpf,
	llvm

Parse `owner_lock_stat` into a rb tree, and report owner lock stats with
stack trace in order.

Example output:
  $ sudo ~/linux/tools/perf/perf lock con -abvo -Y mutex-spin -E3 perf bench sched pipe
  ...
   contended   total wait     max wait     avg wait         type   caller

         171      1.55 ms     20.26 us      9.06 us        mutex   pipe_read+0x57
                          0xffffffffac6318e7  pipe_read+0x57
                          0xffffffffac623862  vfs_read+0x332
                          0xffffffffac62434b  ksys_read+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
          36    193.71 us     15.27 us      5.38 us        mutex   pipe_write+0x50
                          0xffffffffac631ee0  pipe_write+0x50
                          0xffffffffac6241db  vfs_write+0x3bb
                          0xffffffffac6244ab  ksys_write+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
           4     51.22 us     16.47 us     12.80 us        mutex   do_epoll_wait+0x24d
                          0xffffffffac691f0d  do_epoll_wait+0x24d
                          0xffffffffac69249b  do_epoll_pwait.part.0+0xb
                          0xffffffffac693ba5  __x64_sys_epoll_pwait+0x95
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76

  === owner stack trace ===

           3     31.24 us     15.27 us     10.41 us        mutex   pipe_read+0x348
                          0xffffffffac631bd8  pipe_read+0x348
                          0xffffffffac623862  vfs_read+0x332
                          0xffffffffac62434b  ksys_read+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
  ...

Signed-off-by: Chun-Tse Shao <ctshao@google.com>
---
 tools/perf/builtin-lock.c             | 20 ++++++++--
 tools/perf/util/bpf_lock_contention.c | 54 +++++++++++++++++++++++++++
 tools/perf/util/lock-contention.h     |  7 ++++
 3 files changed, 78 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 9bebc186286f..d9b0d7472aea 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -42,6 +42,7 @@
 #include <linux/zalloc.h>
 #include <linux/err.h>
 #include <linux/stringify.h>
+#include <linux/rbtree.h>
 
 static struct perf_session *session;
 static struct target target;
@@ -1817,6 +1818,22 @@ static void print_contention_result(struct lock_contention *con)
 			break;
 	}
 
+	if (con->owner && con->save_callstack) {
+		struct rb_root root = RB_ROOT;
+
+		if (symbol_conf.field_sep)
+			fprintf(lock_output, "# owner stack trace:\n");
+		else
+			fprintf(lock_output, "\n=== owner stack trace ===\n\n");
+		while ((st = pop_owner_stack_trace(con)))
+			insert_to(&root, st, compare);
+
+		while ((st = pop_from(&root))) {
+			print_lock_stat(con, st);
+			zfree(st);
+		}
+	}
+
 	if (print_nr_entries) {
 		/* update the total/bad stats */
 		while ((st = pop_from_result())) {
@@ -1962,9 +1979,6 @@ static int check_lock_contention_options(const struct option *options,
 		}
 	}
 
-	if (show_lock_owner)
-		show_thread_stats = true;
-
 	return 0;
 }
 
diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index 795e2374facc..cf3267e46589 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -549,6 +549,60 @@ static const char *lock_contention_get_name(struct lock_contention *con,
 	return name_buf;
 }
 
+struct lock_stat *pop_owner_stack_trace(struct lock_contention *con)
+{
+	int stacks_fd, stat_fd;
+	u64 *stack_trace;
+	s32 stack_id;
+	struct contention_key ckey = {};
+	struct contention_data cdata = {};
+	size_t stack_size = con->max_stack * sizeof(*stack_trace);
+	struct lock_stat *st;
+	char name[KSYM_NAME_LEN];
+
+	stacks_fd = bpf_map__fd(skel->maps.owner_stacks);
+	stat_fd = bpf_map__fd(skel->maps.owner_stat);
+	if (!stacks_fd || !stat_fd)
+		return NULL;
+
+	stack_trace = zalloc(stack_size);
+	if (stack_trace == NULL)
+		return NULL;
+
+	if (bpf_map_get_next_key(stacks_fd, NULL, stack_trace))
+		return NULL;
+
+	bpf_map_lookup_elem(stacks_fd, stack_trace, &stack_id);
+	ckey.stack_id = stack_id;
+	bpf_map_lookup_elem(stat_fd, &ckey, &cdata);
+
+	st = zalloc(sizeof(struct lock_stat));
+	if (!st)
+		return NULL;
+
+	strcpy(name,
+	       stack_trace[0] ? lock_contention_get_name(con, NULL, stack_trace, 0) : "unknown");
+
+	st->name = strdup(name);
+	if (!st->name)
+		return NULL;
+
+	st->flags = cdata.flags;
+	st->nr_contended = cdata.count;
+	st->wait_time_total = cdata.total_time;
+	st->wait_time_max = cdata.max_time;
+	st->wait_time_min = cdata.min_time;
+	st->callstack = stack_trace;
+
+	if (cdata.count)
+		st->avg_wait_time = cdata.total_time / cdata.count;
+
+	bpf_map_delete_elem(stacks_fd, stack_trace);
+	bpf_map_delete_elem(stat_fd, &ckey);
+
+	return st;
+}
+
 int lock_contention_read(struct lock_contention *con)
 {
 	int fd, stack, err = 0;
diff --git a/tools/perf/util/lock-contention.h b/tools/perf/util/lock-contention.h
index a09f7fe877df..97fd33c57f17 100644
--- a/tools/perf/util/lock-contention.h
+++ b/tools/perf/util/lock-contention.h
@@ -168,6 +168,8 @@ int lock_contention_stop(void);
 int lock_contention_read(struct lock_contention *con);
 int lock_contention_finish(struct lock_contention *con);
 
+struct lock_stat *pop_owner_stack_trace(struct lock_contention *con);
+
 #else  /* !HAVE_BPF_SKEL */
 
 static inline int lock_contention_prepare(struct lock_contention *con __maybe_unused)
@@ -187,6 +189,11 @@ static inline int lock_contention_read(struct lock_contention *con __maybe_unuse
 	return 0;
 }
 
+struct lock_stat *pop_owner_stack_trace(struct lock_contention *con)
+{
+	return NULL;
+}
+
 #endif  /* HAVE_BPF_SKEL */
 
 #endif  /* PERF_LOCK_CONTENTION_H */
-- 
2.48.1.262.g85cc9f2d1e-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 5/5] perf lock: Update documentation for -o option in contention mode
  2025-01-29  0:14 [PATCH v3 0/5] Tracing contention lock owner call stack Chun-Tse Shao
                   ` (3 preceding siblings ...)
  2025-01-29  0:15 ` [PATCH v3 4/5] perf lock: Report owner stack in usermode Chun-Tse Shao
@ 2025-01-29  0:15 ` Chun-Tse Shao
  4 siblings, 0 replies; 8+ messages in thread
From: Chun-Tse Shao @ 2025-01-29  0:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chun-Tse Shao, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	nathan, ndesaulniers, morbo, justinstitt, linux-perf-users, bpf,
	llvm

This patch also decouple -o with -t, and shows warning to notify the new
behavior for -ov.

Signed-off-by: Chun-Tse Shao <ctshao@google.com>
---
 tools/perf/builtin-lock.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index d9b0d7472aea..b925be06b0d8 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -1818,7 +1818,7 @@ static void print_contention_result(struct lock_contention *con)
 			break;
 	}
 
-	if (con->owner && con->save_callstack) {
+	if (con->owner && con->save_callstack && verbose > 0) {
 		struct rb_root root = RB_ROOT;
 
 		if (symbol_conf.field_sep)
@@ -1979,6 +1979,11 @@ static int check_lock_contention_options(const struct option *options,
 		}
 	}
 
+	if (show_lock_owner && !show_thread_stats) {
+		pr_warning("Now -o try to show owner's callstack instead of pid and comm.\n");
+		pr_warning("Please use -t option too to keep the old behavior.\n");
+	}
+
 	return 0;
 }
 
@@ -2570,7 +2575,8 @@ int cmd_lock(int argc, const char **argv)
 		     "Filter specific address/symbol of locks", parse_lock_addr),
 	OPT_CALLBACK('S', "callstack-filter", NULL, "NAMES",
 		     "Filter specific function in the callstack", parse_call_stack),
-	OPT_BOOLEAN('o', "lock-owner", &show_lock_owner, "show lock owners instead of waiters"),
+	OPT_BOOLEAN('o', "lock-owner", &show_lock_owner, "show lock owners instead of waiters.\n"
+		"\t\t\tThis option can be combined with -t, which shows owner's per thread lock stats, or -v, which shows owner's stacktrace"),
 	OPT_STRING_NOEMPTY('x', "field-separator", &symbol_conf.field_sep, "separator",
 		   "print result in CSV format with custom separator"),
 	OPT_BOOLEAN(0, "lock-cgroup", &show_lock_cgroups, "show lock stats by cgroup"),
-- 
2.48.1.262.g85cc9f2d1e-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/5] perf lock: Add bpf maps for owner stack tracing
  2025-01-29  0:14 ` [PATCH v3 1/5] perf lock: Add bpf maps for owner stack tracing Chun-Tse Shao
@ 2025-01-29 18:01   ` Namhyung Kim
  0 siblings, 0 replies; 8+ messages in thread
From: Namhyung Kim @ 2025-01-29 18:01 UTC (permalink / raw)
  To: Chun-Tse Shao
  Cc: linux-kernel, peterz, mingo, acme, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	nathan, ndesaulniers, morbo, justinstitt, linux-perf-users, bpf,
	llvm

On Tue, Jan 28, 2025 at 04:14:57PM -0800, Chun-Tse Shao wrote:
> Add a struct and few bpf maps in order to tracing owner stack.
> `struct owner_tracing_data`: Contains owner's pid, stack id, timestamp for
>   when the owner acquires lock, and the count of lock waiters.
> `stack_buf`: Percpu buffer for retrieving owner stacktrace.
> `owner_stacks`: For tracing owner stacktrace to customized owner stack id.
> `owner_data`: For tracing lock_address to `struct owner_tracing_data` in
>   bpf program.
> `owner_stat`: For reporting owner stacktrace in usermode.
> 
> Signed-off-by: Chun-Tse Shao <ctshao@google.com>
> ---
>  tools/perf/util/bpf_lock_contention.c         | 16 +++++--
>  .../perf/util/bpf_skel/lock_contention.bpf.c  | 42 ++++++++++++++++---
>  tools/perf/util/bpf_skel/lock_data.h          |  7 ++++
>  3 files changed, 57 insertions(+), 8 deletions(-)
> 
> diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
> index fc8666222399..795e2374facc 100644
> --- a/tools/perf/util/bpf_lock_contention.c
> +++ b/tools/perf/util/bpf_lock_contention.c
> @@ -131,9 +131,19 @@ int lock_contention_prepare(struct lock_contention *con)
>  	else
>  		bpf_map__set_max_entries(skel->maps.task_data, 1);
>  
> -	if (con->save_callstack)
> -		bpf_map__set_max_entries(skel->maps.stacks, con->map_nr_entries);
> -	else
> +	if (con->save_callstack) {
> +		bpf_map__set_max_entries(skel->maps.stacks,
> +					 con->map_nr_entries);

It can be on the same line as it used to be.


> +		if (con->owner) {
> +			bpf_map__set_value_size(skel->maps.stack_buf, con->max_stack * sizeof(u64));
> +			bpf_map__set_key_size(skel->maps.owner_stacks,
> +						con->max_stack * sizeof(u64));
> +			bpf_map__set_max_entries(skel->maps.owner_stacks, con->map_nr_entries);
> +			bpf_map__set_max_entries(skel->maps.owner_data, con->map_nr_entries);
> +			bpf_map__set_max_entries(skel->maps.owner_stat, con->map_nr_entries);
> +			skel->rodata->max_stack = con->max_stack;
> +		}
> +	} else

Please add parenthesis even for single line when the matching if block
has parenthesis.


>  		bpf_map__set_max_entries(skel->maps.stacks, 1);
>  
>  	if (target__has_cpu(target)) {
> diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> index 6533ea9b044c..b4961dd86222 100644
> --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> @@ -19,13 +19,37 @@
>  #define LCB_F_PERCPU	(1U << 4)
>  #define LCB_F_MUTEX	(1U << 5)
>  
> -/* callstack storage  */
> + /* buffer for owner stacktrace */
>  struct {
> -	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
> +	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
>  	__uint(key_size, sizeof(__u32));
>  	__uint(value_size, sizeof(__u64));
> -	__uint(max_entries, MAX_ENTRIES);
> -} stacks SEC(".maps");
> +	__uint(max_entries, 1);
> +} stack_buf SEC(".maps");
> +
> +/* a map for tracing owner stacktrace to owner stack id */
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(key_size, sizeof(__u64)); // owner stacktrace
> +	__uint(value_size, sizeof(__u64)); // owner stack id
> +	__uint(max_entries, 1);
> +} owner_stacks SEC(".maps");
> +
> +/* a map for tracing lock address to owner data */
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(key_size, sizeof(__u64)); // lock address
> +	__uint(value_size, sizeof(struct owner_tracing_data));
> +	__uint(max_entries, 1);
> +} owner_data SEC(".maps");
> +
> +/* a map for contention_key (stores owner stack id) to contention data */
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(key_size, sizeof(struct contention_key));
> +	__uint(value_size, sizeof(struct contention_data));
> +	__uint(max_entries, 1);
> +} owner_stat SEC(".maps");
>  
>  /* maintain timestamp at the beginning of contention */
>  struct {
> @@ -43,6 +67,14 @@ struct {
>  	__uint(max_entries, 1);
>  } tstamp_cpu SEC(".maps");
>  
> +/* callstack storage  */
> +struct {
> +	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
> +	__uint(key_size, sizeof(__u32));
> +	__uint(value_size, sizeof(__u64));
> +	__uint(max_entries, MAX_ENTRIES);
> +} stacks SEC(".maps");

Can you please keep this on top of the maps so that it can minimize the
diff?  I think it's better to move the new owner related stuff to
bottom.

> +
>  /* actual lock contention statistics */
>  struct {
>  	__uint(type, BPF_MAP_TYPE_HASH);
> @@ -143,6 +175,7 @@ const volatile int needs_callstack;
>  const volatile int stack_skip;
>  const volatile int lock_owner;
>  const volatile int use_cgroup_v2;
> +const volatile int max_stack;
>  
>  /* determine the key of lock stat */
>  const volatile int aggr_mode;
> @@ -466,7 +499,6 @@ int contention_end(u64 *ctx)
>  			return 0;
>  		need_delete = true;
>  	}
> -

Not needed.

Thanks,
Namhyung


>  	duration = bpf_ktime_get_ns() - pelem->timestamp;
>  	if ((__s64)duration < 0) {
>  		__sync_fetch_and_add(&time_fail, 1);
> diff --git a/tools/perf/util/bpf_skel/lock_data.h b/tools/perf/util/bpf_skel/lock_data.h
> index c15f734d7fc4..15f5743bd409 100644
> --- a/tools/perf/util/bpf_skel/lock_data.h
> +++ b/tools/perf/util/bpf_skel/lock_data.h
> @@ -3,6 +3,13 @@
>  #ifndef UTIL_BPF_SKEL_LOCK_DATA_H
>  #define UTIL_BPF_SKEL_LOCK_DATA_H
>  
> +struct owner_tracing_data {
> +	u32 pid; // Who has the lock.
> +	u32 count; // How many waiters for this lock.
> +	u64 timestamp; // The time while the owner acquires lock and contention is going on.
> +	s32 stack_id; // Identifier for `owner_stat`, which stores as value in `owner_stacks`
> +};
> +
>  struct tstamp_data {
>  	u64 timestamp;
>  	u64 lock;
> -- 
> 2.48.1.262.g85cc9f2d1e-goog
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/5] perf lock: Retrieve owner callstack in bpf program
  2025-01-29  0:14 ` [PATCH v3 2/5] perf lock: Retrieve owner callstack in bpf program Chun-Tse Shao
@ 2025-01-29 18:48   ` Namhyung Kim
  0 siblings, 0 replies; 8+ messages in thread
From: Namhyung Kim @ 2025-01-29 18:48 UTC (permalink / raw)
  To: Chun-Tse Shao
  Cc: linux-kernel, peterz, mingo, acme, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	nathan, ndesaulniers, morbo, justinstitt, linux-perf-users, bpf,
	llvm

On Tue, Jan 28, 2025 at 04:14:58PM -0800, Chun-Tse Shao wrote:
> Tracing owner callstack in `contention_begin()` and `contention_end()`,
> and storing in `owner_stat` bpf map.

Can you please elaborate?  It'd be helpful to describe how the lock
owner is tracked and deals with multiple waiters.  Note that owner will
be changed when contention_end() is called (without an error).

And please mention that it can fail to get an owner and its callstack.
Actually that's very common and only mutex (and rwsem for writing?)
would support owner tracking.

> 
> Signed-off-by: Chun-Tse Shao <ctshao@google.com>
> ---
>  .../perf/util/bpf_skel/lock_contention.bpf.c  | 237 +++++++++++++++++-
>  1 file changed, 235 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> index b4961dd86222..1ad2a0793c37 100644
> --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> @@ -1,5 +1,6 @@
>  // SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
>  // Copyright (c) 2022 Google
> +#include "linux/bpf.h"

Why did you add this?


>  #include "vmlinux.h"
>  #include <bpf/bpf_helpers.h>
>  #include <bpf/bpf_tracing.h>
> @@ -7,6 +8,7 @@
>  #include <asm-generic/errno-base.h>
>  
>  #include "lock_data.h"
> +#include <time.h>

And this too.  I remember adding header sometimes caused a trouble in
the build.  So I'm trying to be careful on this.  Roughly I don't think
you need new structs or functions so I'm wondering why it's added.

>  
>  /* for collect_lock_syms().  4096 was rejected by the verifier */
>  #define MAX_CPUS  1024
> @@ -31,7 +33,7 @@ struct {
>  struct {
>  	__uint(type, BPF_MAP_TYPE_HASH);
>  	__uint(key_size, sizeof(__u64)); // owner stacktrace
> -	__uint(value_size, sizeof(__u64)); // owner stack id
> +	__uint(value_size, sizeof(__s32)); // owner stack id
>  	__uint(max_entries, 1);
>  } owner_stacks SEC(".maps");
>  
> @@ -197,6 +199,9 @@ int data_fail;
>  int task_map_full;
>  int data_map_full;
>  
> +struct task_struct *bpf_task_from_pid(s32 pid) __ksym;
> +void bpf_task_release(struct task_struct *p) __ksym;

These should have __weak and you need to check if it's NULL or not
before use to support ancient kernels.  If it's NULL then it should
reject owner callstack tracking.  Or you can check it in userspace
using vmlinux BTF and set lock_owner to false if it's not available.

Please take a look at check_slab_cache_iter() for example.

> +
>  static inline __u64 get_current_cgroup_id(void)
>  {
>  	struct task_struct *task;
> @@ -420,6 +425,27 @@ static inline struct tstamp_data *get_tstamp_elem(__u32 flags)
>  	return pelem;
>  }
>  
> +static inline s32 get_owner_stack_id(u64 *stacktrace)
> +{
> +	s32 *id;
> +	static s32 id_gen = 1;
> +
> +	id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
> +	if (id)
> +		return *id;
> +
> +	// FIXME: currently `a = __sync_fetch_and_add(...)` cause "Invalid usage of the XADD return
> +	// value" error in BPF program: https://github.com/llvm/llvm-project/issues/91888
> +	bpf_map_update_elem(&owner_stacks, stacktrace, &id_gen, BPF_NOEXIST);
> +	__sync_fetch_and_add(&id_gen, 1);

I'm afraid it doesn't guarantee a unique stack id.

I believe BPF has atomic instructions that can do ADD and FETCH.  I'm
not sure if it's a kernel or compiler version issue.  Have you tried
compare-and-exchange?

> +
> +	id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
> +	if (id)
> +		return *id;
> +
> +	return -1;
> +}
> +
>  SEC("tp_btf/contention_begin")
>  int contention_begin(u64 *ctx)
>  {
> @@ -437,6 +463,91 @@ int contention_begin(u64 *ctx)
>  	pelem->flags = (__u32)ctx[1];
>  
>  	if (needs_callstack) {
> +		u32 i = 0;
> +		u32 id = 0;
> +		int owner_pid;
> +		u64 *buf;
> +		struct task_struct *task;
> +		struct owner_tracing_data *otdata;
> +
> +		if (!lock_owner)
> +			goto skip_owner_begin;
> +
> +		task = get_lock_owner(pelem->lock, pelem->flags);
> +		if (!task)
> +			goto skip_owner_begin;
> +
> +		owner_pid = BPF_CORE_READ(task, pid);
> +
> +		buf = bpf_map_lookup_elem(&stack_buf, &i);
> +		if (!buf)
> +			goto skip_owner_begin;
> +		for (i = 0; i < max_stack; i++)
> +			buf[i] = 0x0;
> +
> +		task = bpf_task_from_pid(owner_pid);
> +		if (task) {
> +			bpf_get_task_stack(task, buf, max_stack * sizeof(unsigned long), 0);
> +			bpf_task_release(task);
> +		}
> +
> +		otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
> +		id = get_owner_stack_id(buf);
> +
> +		// Contention just happens, or corner case `lock` is owned by process not
> +		// `owner_pid`. For the corner case we treat it as unexpected internal error and
> +		// just ignore the precvious tracing record.

Typo precvious

It's unfortunate the comment style is mixed.  I'm not sure if we have a
strict rule for BPF code but at least we need to be consistent in a file.
Probably // is ok for a short comment at the end of line.  But please
use C-style block comments for multi-line messages.


> +		if (!otdata || otdata->pid != owner_pid) {
> +			struct owner_tracing_data first = {
> +				.pid = owner_pid,
> +				.timestamp = pelem->timestamp,
> +				.count = 1,
> +				.stack_id = id,
> +			};
> +			bpf_map_update_elem(&owner_data, &pelem->lock, &first, BPF_ANY);
> +		}
> +		// Contention is ongoing and new waiter joins.
> +		else {
> +			__sync_fetch_and_add(&otdata->count, 1);
> +
> +			// The owner is the same, but stacktrace might be changed. In this case we
> +			// store/update `owner_stat` based on current owner stack id.
> +			if (id != otdata->stack_id) {
> +				u64 duration = otdata->timestamp - pelem->timestamp;
> +				struct contention_key ckey = {
> +					.stack_id = id,
> +					.pid = 0,
> +					.lock_addr_or_cgroup = 0,
> +				};
> +				struct contention_data *cdata =
> +					bpf_map_lookup_elem(&owner_stat, &ckey);
> +
> +				if (!cdata) {
> +					struct contention_data first = {
> +						.total_time = duration,
> +						.max_time = duration,
> +						.min_time = duration,
> +						.count = 1,
> +						.flags = pelem->flags,
> +					};
> +					bpf_map_update_elem(&owner_stat, &ckey, &first,
> +							    BPF_NOEXIST);
> +				} else {
> +					__sync_fetch_and_add(&cdata->total_time, duration);
> +					__sync_fetch_and_add(&cdata->count, 1);
> +
> +					/* FIXME: need atomic operations */
> +					if (cdata->max_time < duration)
> +						cdata->max_time = duration;
> +					if (cdata->min_time > duration)
> +						cdata->min_time = duration;
> +				}

This code block is repeating at least for 3 times.  Can you factor it
out as a function?

> +
> +				otdata->timestamp = pelem->timestamp;
> +				otdata->stack_id = id;
> +			}
> +		}
> +skip_owner_begin:
>  		pelem->stack_id = bpf_get_stackid(ctx, &stacks,
>  						  BPF_F_FAST_STACK_CMP | stack_skip);
>  		if (pelem->stack_id < 0)
> @@ -473,6 +584,7 @@ int contention_end(u64 *ctx)
>  	struct tstamp_data *pelem;
>  	struct contention_key key = {};
>  	struct contention_data *data;
> +	__u64 timestamp;
>  	__u64 duration;
>  	bool need_delete = false;
>  
> @@ -499,12 +611,133 @@ int contention_end(u64 *ctx)
>  			return 0;
>  		need_delete = true;
>  	}
> -	duration = bpf_ktime_get_ns() - pelem->timestamp;
> +	timestamp = bpf_ktime_get_ns();
> +	duration = timestamp - pelem->timestamp;
>  	if ((__s64)duration < 0) {
>  		__sync_fetch_and_add(&time_fail, 1);
>  		goto out;
>  	}
>  
> +	if (needs_callstack && lock_owner) {
> +		u64 owner_time;
> +		struct contention_key ckey = {};
> +		struct contention_data *cdata;
> +		struct owner_tracing_data *otdata;
> +
> +		otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
> +		if (!otdata)
> +			goto skip_owner_end;
> +
> +		// Update `owner_stat`.
> +		owner_time = timestamp - otdata->timestamp;
> +		ckey.stack_id = otdata->stack_id;
> +		cdata = bpf_map_lookup_elem(&owner_stat, &ckey);
> +
> +		if (!cdata) {
> +			struct contention_data first = {
> +				.total_time = owner_time,
> +				.max_time = owner_time,
> +				.min_time = owner_time,
> +				.count = 1,
> +				.flags = pelem->flags,
> +			};
> +			bpf_map_update_elem(&owner_stat, &ckey, &first, BPF_NOEXIST);
> +		} else {
> +			__sync_fetch_and_add(&cdata->total_time, owner_time);
> +			__sync_fetch_and_add(&cdata->count, 1);
> +
> +			/* FIXME: need atomic operations */
> +			if (cdata->max_time < owner_time)
> +				cdata->max_time = owner_time;
> +			if (cdata->min_time > owner_time)
> +				cdata->min_time = owner_time;
> +		}
> +
> +		// No contention is occurring, delete `lock` entry in `owner_data`.
> +		if (otdata->count <= 1)
> +			bpf_map_delete_elem(&owner_data, &pelem->lock);
> +		// Contention is still ongoing, with a new owner (current task). `owner_data`
> +		// should be updated accordingly.
> +		else {
> +			u32 i = 0;
> +			u64 *buf;
> +
> +			// FIXME: __sync_fetch_and_sub(&otdata->count, 1) causes compile error.
> +			otdata->count--;

What about __sync_fetch_and_add(..., -1) ?  It doesn't seem BPF atomic
operations have SUB.

https://docs.kernel.org/bpf/standardization/instruction-set.html#atomic-operations

> +
> +			buf = bpf_map_lookup_elem(&stack_buf, &i);
> +			if (!buf)
> +				goto skip_owner_end;
> +			for (i = 0; i < (u32)max_stack; i++)
> +				buf[i] = 0x0;
> +
> +			// ctx[1] has the return code of the lock function.

Why not adding 's32 ret = (s32)ctx[1]' ?

Thanks,
Namhyung


> +			// If ctx[1] is not 0, the current task terminates lock waiting without
> +			// acquiring it. Owner is not changed, but we still need to update the owner
> +			// stack.
> +			if (!ctx[1]) {
> +				s32 id = 0;
> +				struct task_struct *task = bpf_task_from_pid(otdata->pid);
> +
> +				if (task) {
> +					bpf_get_task_stack(task, buf,
> +							   max_stack * sizeof(unsigned long), 0);
> +					bpf_task_release(task);
> +				}
> +
> +				id = get_owner_stack_id(buf);
> +
> +				// If owner stack is changed, update `owner_data` and `owner_stat`
> +				// accordingly.
> +				if (id != otdata->stack_id) {
> +					u64 duration = otdata->timestamp - pelem->timestamp;
> +					struct contention_key ckey = {
> +						.stack_id = id,
> +						.pid = 0,
> +						.lock_addr_or_cgroup = 0,
> +					};
> +					struct contention_data *cdata =
> +						bpf_map_lookup_elem(&owner_stat, &ckey);
> +
> +					if (!cdata) {
> +						struct contention_data first = {
> +							.total_time = duration,
> +							.max_time = duration,
> +							.min_time = duration,
> +							.count = 1,
> +							.flags = pelem->flags,
> +						};
> +						bpf_map_update_elem(&owner_stat, &ckey, &first,
> +								    BPF_NOEXIST);
> +					} else {
> +						__sync_fetch_and_add(&cdata->total_time, duration);
> +						__sync_fetch_and_add(&cdata->count, 1);
> +
> +						/* FIXME: need atomic operations */
> +						if (cdata->max_time < duration)
> +							cdata->max_time = duration;
> +						if (cdata->min_time > duration)
> +							cdata->min_time = duration;
> +					}
> +
> +					otdata->timestamp = pelem->timestamp;
> +					otdata->stack_id = id;
> +				}
> +			}
> +			// If ctx[1] is 0, then update tracinng data with the current task, which is
> +			// the new owner.
> +			else {
> +				otdata->pid = pid;
> +				otdata->timestamp = timestamp;
> +
> +				bpf_get_task_stack(bpf_get_current_task_btf(), buf,
> +						   max_stack * sizeof(unsigned long), 0);
> +				otdata->stack_id = get_owner_stack_id(buf);
> +			}
> +		}
> +	}
> +skip_owner_end:
> +
>  	switch (aggr_mode) {
>  	case LOCK_AGGR_CALLER:
>  		key.stack_id = pelem->stack_id;
> -- 
> 2.48.1.262.g85cc9f2d1e-goog
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-01-29 18:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-29  0:14 [PATCH v3 0/5] Tracing contention lock owner call stack Chun-Tse Shao
2025-01-29  0:14 ` [PATCH v3 1/5] perf lock: Add bpf maps for owner stack tracing Chun-Tse Shao
2025-01-29 18:01   ` Namhyung Kim
2025-01-29  0:14 ` [PATCH v3 2/5] perf lock: Retrieve owner callstack in bpf program Chun-Tse Shao
2025-01-29 18:48   ` Namhyung Kim
2025-01-29  0:14 ` [PATCH v3 3/5] perf lock: Make rb_tree helper functions generic Chun-Tse Shao
2025-01-29  0:15 ` [PATCH v3 4/5] perf lock: Report owner stack in usermode Chun-Tse Shao
2025-01-29  0:15 ` [PATCH v3 5/5] perf lock: Update documentation for -o option in contention mode Chun-Tse Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).