[PATCH v4 0/5] Tracing contention lock owner call stack

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/5] Tracing contention lock owner call stack
@ 2025-01-30  5:21 Chun-Tse Shao
  2025-01-30  5:21 ` [PATCH v4 1/5] perf lock: Add bpf maps for owner stack tracing Chun-Tse Shao
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Chun-Tse Shao @ 2025-01-30  5:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chun-Tse Shao, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, bpf

For perf lock contention, the current owner tracking (-o option) only
works with per-thread mode (-t option). Enabling call stack mode for
owner can be useful for diagnosing why a system running slow in
lock contention.

Example output:
  $ sudo ~/linux/tools/perf/perf lock con -abvo -Y mutex -E16 perf bench sched pipe
   ...
   contended   total wait     max wait     avg wait         type   caller

         171      1.55 ms     20.26 us      9.06 us        mutex   pipe_read+0x57
                          0xffffffffac6318e7  pipe_read+0x57
                          0xffffffffac623862  vfs_read+0x332
                          0xffffffffac62434b  ksys_read+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
          36    193.71 us     15.27 us      5.38 us        mutex   pipe_write+0x50
                          0xffffffffac631ee0  pipe_write+0x50
                          0xffffffffac6241db  vfs_write+0x3bb
                          0xffffffffac6244ab  ksys_write+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
           4     51.22 us     16.47 us     12.80 us        mutex   do_epoll_wait+0x24d
                          0xffffffffac691f0d  do_epoll_wait+0x24d
                          0xffffffffac69249b  do_epoll_pwait.part.0+0xb
                          0xffffffffac693ba5  __x64_sys_epoll_pwait+0x95
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
           2     20.88 us     11.95 us     10.44 us        mutex   do_epoll_wait+0x24d
                          0xffffffffac691f0d  do_epoll_wait+0x24d
                          0xffffffffac693943  __x64_sys_epoll_wait+0x73
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
           1      7.33 us      7.33 us      7.33 us        mutex   do_epoll_ctl+0x6c1
                          0xffffffffac692e01  do_epoll_ctl+0x6c1
                          0xffffffffac6937e0  __x64_sys_epoll_ctl+0x70
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
           1      6.64 us      6.64 us      6.64 us        mutex   do_epoll_ctl+0x3d4
                          0xffffffffac692b14  do_epoll_ctl+0x3d4
                          0xffffffffac6937e0  __x64_sys_epoll_ctl+0x70
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76

  === owner stack trace ===

           3     31.24 us     15.27 us     10.41 us        mutex   pipe_read+0x348
                          0xffffffffac631bd8  pipe_read+0x348
                          0xffffffffac623862  vfs_read+0x332
                          0xffffffffac62434b  ksys_read+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
  ...

v4: Edit based on Namhyung's review.

v3: Edit based on Namhyung's review.

v2: Fix logic deficit in patch 2/4.

Chun-Tse Shao (5):
  perf lock: Add bpf maps for owner stack tracing
  perf lock: Retrieve owner callstack in bpf program
  perf lock: Make rb_tree helper functions generic
  perf lock: Report owner stack in usermode
  perf lock: Update documentation for -o option in contention mode

 tools/perf/builtin-lock.c                     |  59 +++-
 tools/perf/util/bpf_lock_contention.c         |  68 ++++-
 .../perf/util/bpf_skel/lock_contention.bpf.c  | 281 +++++++++++++++++-
 tools/perf/util/bpf_skel/lock_data.h          |   7 +
 tools/perf/util/lock-contention.h             |   7 +
 5 files changed, 405 insertions(+), 17 deletions(-)

--
2.48.1.362.g079036d154-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v4 1/5] perf lock: Add bpf maps for owner stack tracing
  2025-01-30  5:21 [PATCH v4 0/5] Tracing contention lock owner call stack Chun-Tse Shao
@ 2025-01-30  5:21 ` Chun-Tse Shao
  2025-01-30  5:21 ` [PATCH v4 2/5] perf lock: Retrieve owner callstack in bpf program Chun-Tse Shao
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Chun-Tse Shao @ 2025-01-30  5:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chun-Tse Shao, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, bpf

Add a struct and few bpf maps in order to tracing owner stack.
`struct owner_tracing_data`: Contains owner's pid, stack id, timestamp for
  when the owner acquires lock, and the count of lock waiters.
`stack_buf`: Percpu buffer for retrieving owner stacktrace.
`owner_stacks`: For tracing owner stacktrace to customized owner stack id.
`owner_data`: For tracing lock_address to `struct owner_tracing_data` in
  bpf program.
`owner_stat`: For reporting owner stacktrace in usermode.

Signed-off-by: Chun-Tse Shao <ctshao@google.com>
---
 tools/perf/util/bpf_lock_contention.c         | 14 ++++++--
 .../perf/util/bpf_skel/lock_contention.bpf.c  | 33 +++++++++++++++++++
 tools/perf/util/bpf_skel/lock_data.h          |  7 ++++
 3 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index fc8666222399..76542b86e83f 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -131,10 +131,20 @@ int lock_contention_prepare(struct lock_contention *con)
 	else
 		bpf_map__set_max_entries(skel->maps.task_data, 1);
 
-	if (con->save_callstack)
+	if (con->save_callstack) {
 		bpf_map__set_max_entries(skel->maps.stacks, con->map_nr_entries);
-	else
+		if (con->owner) {
+			bpf_map__set_value_size(skel->maps.stack_buf, con->max_stack * sizeof(u64));
+			bpf_map__set_key_size(skel->maps.owner_stacks,
+						con->max_stack * sizeof(u64));
+			bpf_map__set_max_entries(skel->maps.owner_stacks, con->map_nr_entries);
+			bpf_map__set_max_entries(skel->maps.owner_data, con->map_nr_entries);
+			bpf_map__set_max_entries(skel->maps.owner_stat, con->map_nr_entries);
+			skel->rodata->max_stack = con->max_stack;
+		}
+	} else {
 		bpf_map__set_max_entries(skel->maps.stacks, 1);
+	}
 
 	if (target__has_cpu(target)) {
 		skel->rodata->has_cpu = 1;
diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index 6533ea9b044c..23fe9cc980ae 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -27,6 +27,38 @@ struct {
 	__uint(max_entries, MAX_ENTRIES);
 } stacks SEC(".maps");
 
+/* buffer for owner stacktrace */
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u64));
+	__uint(max_entries, 1);
+} stack_buf SEC(".maps");
+
+/* a map for tracing owner stacktrace to owner stack id */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u64)); // owner stacktrace
+	__uint(value_size, sizeof(__s32)); // owner stack id
+	__uint(max_entries, 1);
+} owner_stacks SEC(".maps");
+
+/* a map for tracing lock address to owner data */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u64)); // lock address
+	__uint(value_size, sizeof(struct owner_tracing_data));
+	__uint(max_entries, 1);
+} owner_data SEC(".maps");
+
+/* a map for contention_key (stores owner stack id) to contention data */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(struct contention_key));
+	__uint(value_size, sizeof(struct contention_data));
+	__uint(max_entries, 1);
+} owner_stat SEC(".maps");
+
 /* maintain timestamp at the beginning of contention */
 struct {
 	__uint(type, BPF_MAP_TYPE_HASH);
@@ -143,6 +175,7 @@ const volatile int needs_callstack;
 const volatile int stack_skip;
 const volatile int lock_owner;
 const volatile int use_cgroup_v2;
+const volatile int max_stack;
 
 /* determine the key of lock stat */
 const volatile int aggr_mode;
diff --git a/tools/perf/util/bpf_skel/lock_data.h b/tools/perf/util/bpf_skel/lock_data.h
index c15f734d7fc4..15f5743bd409 100644
--- a/tools/perf/util/bpf_skel/lock_data.h
+++ b/tools/perf/util/bpf_skel/lock_data.h
@@ -3,6 +3,13 @@
 #ifndef UTIL_BPF_SKEL_LOCK_DATA_H
 #define UTIL_BPF_SKEL_LOCK_DATA_H
 
+struct owner_tracing_data {
+	u32 pid; // Who has the lock.
+	u32 count; // How many waiters for this lock.
+	u64 timestamp; // The time while the owner acquires lock and contention is going on.
+	s32 stack_id; // Identifier for `owner_stat`, which stores as value in `owner_stacks`
+};
+
 struct tstamp_data {
 	u64 timestamp;
 	u64 lock;
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 2/5] perf lock: Retrieve owner callstack in bpf program
  2025-01-30  5:21 [PATCH v4 0/5] Tracing contention lock owner call stack Chun-Tse Shao
  2025-01-30  5:21 ` [PATCH v4 1/5] perf lock: Add bpf maps for owner stack tracing Chun-Tse Shao
@ 2025-01-30  5:21 ` Chun-Tse Shao
  2025-01-31 23:48   ` Namhyung Kim
  2025-01-30  5:21 ` [PATCH v4 3/5] perf lock: Make rb_tree helper functions generic Chun-Tse Shao
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 10+ messages in thread
From: Chun-Tse Shao @ 2025-01-30  5:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chun-Tse Shao, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, bpf

Tracks lock contention by tracing owner callstacks in
`contention_begin()` and `contention_end()`, storing data in the
owner_stat BPF map. `contention_begin()` records the owner and their
callstack. `contention_end()` updates contention statistics (count,
time), decrements the waiter count, and removes the record when no
waiters remain. Statistics are also updated if the owner's callstack
changes. Note that owner and its callstack retrieval may fail.

To elaborate the process in detail:
  /*
   * In `contention_begin(), the current task is the lock waiter`. We
   * create/update `owner_data` for the given `lock` address.
  contention_begin() {
    Try to get owner task. Skip entire process if fails.
    Try to get owner stack based on task. Use empty stack if fails.
    Store owner stack into `owner_stacks` and create `stack_id`. If fail
      to store, use negative `stack_id`, which will be ignored while
      reporting in usermode.
    Retrieve `owner_tracing_data` in `owner_data` with given `lock`
      address.

    /*
     * The first case means contention just happens, or mismatched owner
     * infomation so we just drop the previous record.
     */
    if (`owner_tracing_data` does not exist ||
        the recorded owner `pid` does not match with the detected owner) {
      Create `owner_tracing_data` with info from detected owner, and
        store it in `owner_data` with key `lock` address.
    }
    /*
     * The second case means contention is on going. One more waiter is
     * joining the lock contention. Both `owner_data` and `owner_stat`
     * should be updated.
     */
    else {
      `owner_tracing_data.count`++

      Create `contention_key` with owner `stack_id` and lookup
        `contention_data` in `owner_stat`.
      if (`contention_data` does not exist) {
        Create new entry for `contention_key`:`contention_data` in
          `owner_stat`.
      }
      else {
        Update the `count` and `total_time` in existing
        `contention_data`.
      }

      Update `timestamp` and `stack_id` in `owner_tracing_data`.
    }
  }

  /*
   * In `contention_end()`, the current task will be the new owner of
   * the `lock`, if `ctx[1]` is not 0.
   */
  contention_end() {
    Lookup `owner_tracing_data` in `owner_data` with key `lock`.

    Create `contention_key` with `owner_tracing_data.stack_id` and
      lookup `contention_data` in `owner_stat`.
    if (`contention_data` does not exist) {
      Create new entry for `contention_key`:`contention_data` in
        `owner_stat`.
    }
    else {
      Update the `count` and `total_time` in existing `contention_data`.
    }

    /*
     * There is no more waiters, contention is over, delete the record.
     */
    if (`owner_tracing_data.count` <= 1) {
      delete this record in `owner_data`.
    }
    /*
     * Contention is still on going.
     */
    else {
      `owner_tracing_data.count`--

      if (`!ctx[0]`) {
        The current task exits without acquiring the lock. However we
          check for the recorded owner if the stack is changed, and
          update `onwer_data` and `owner_stat` accordingly.
      }
      else {
        The current task is the new owner, retrieve its stack, store it
          in `owner_stack` and update `owner_tracing_data`.
      }
    }
  }

Signed-off-by: Chun-Tse Shao <ctshao@google.com>
---
 .../perf/util/bpf_skel/lock_contention.bpf.c  | 248 +++++++++++++++++-
 1 file changed, 247 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index 23fe9cc980ae..b12df873379f 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -197,6 +197,9 @@ int data_fail;
 int task_map_full;
 int data_map_full;
 
+struct task_struct *bpf_task_from_pid(s32 pid) __ksym __weak;
+void bpf_task_release(struct task_struct *p) __ksym __weak;
+
 static inline __u64 get_current_cgroup_id(void)
 {
 	struct task_struct *task;
@@ -420,6 +423,26 @@ static inline struct tstamp_data *get_tstamp_elem(__u32 flags)
 	return pelem;
 }
 
+static inline s32 get_owner_stack_id(u64 *stacktrace)
+{
+	s32 *id, new_id;
+	static s64 id_gen = 1;
+
+	id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
+	if (id)
+		return *id;
+
+	new_id = (s32)__sync_fetch_and_add(&id_gen, 1);
+
+	bpf_map_update_elem(&owner_stacks, stacktrace, &new_id, BPF_NOEXIST);
+
+	id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
+	if (id)
+		return *id;
+
+	return -1;
+}
+
 SEC("tp_btf/contention_begin")
 int contention_begin(u64 *ctx)
 {
@@ -437,6 +460,97 @@ int contention_begin(u64 *ctx)
 	pelem->flags = (__u32)ctx[1];
 
 	if (needs_callstack) {
+		u32 i = 0;
+		u32 id = 0;
+		int owner_pid;
+		u64 *buf;
+		struct task_struct *task;
+		struct owner_tracing_data *otdata;
+
+		if (!lock_owner)
+			goto skip_owner_begin;
+
+		task = get_lock_owner(pelem->lock, pelem->flags);
+		if (!task)
+			goto skip_owner_begin;
+
+		owner_pid = BPF_CORE_READ(task, pid);
+
+		buf = bpf_map_lookup_elem(&stack_buf, &i);
+		if (!buf)
+			goto skip_owner_begin;
+		for (i = 0; i < max_stack; i++)
+			buf[i] = 0x0;
+
+		if (bpf_task_from_pid) {
+			task = bpf_task_from_pid(owner_pid);
+			if (task) {
+				bpf_get_task_stack(task, buf, max_stack * sizeof(unsigned long), 0);
+				bpf_task_release(task);
+			}
+		}
+
+		otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
+		id = get_owner_stack_id(buf);
+
+		/*
+		 * Contention just happens, or corner case `lock` is owned by process not
+		 * `owner_pid`. For the corner case we treat it as unexpected internal error and
+		 * just ignore the precvious tracing record.
+		 */
+		if (!otdata || otdata->pid != owner_pid) {
+			struct owner_tracing_data first = {
+				.pid = owner_pid,
+				.timestamp = pelem->timestamp,
+				.count = 1,
+				.stack_id = id,
+			};
+			bpf_map_update_elem(&owner_data, &pelem->lock, &first, BPF_ANY);
+		}
+		/* Contention is ongoing and new waiter joins */
+		else {
+			__sync_fetch_and_add(&otdata->count, 1);
+
+			/*
+			 * The owner is the same, but stacktrace might be changed. In this case we
+			 * store/update `owner_stat` based on current owner stack id.
+			 */
+			if (id != otdata->stack_id) {
+				u64 duration = otdata->timestamp - pelem->timestamp;
+				struct contention_key ckey = {
+					.stack_id = id,
+					.pid = 0,
+					.lock_addr_or_cgroup = 0,
+				};
+				struct contention_data *cdata =
+					bpf_map_lookup_elem(&owner_stat, &ckey);
+
+				if (!cdata) {
+					struct contention_data first = {
+						.total_time = duration,
+						.max_time = duration,
+						.min_time = duration,
+						.count = 1,
+						.flags = pelem->flags,
+					};
+					bpf_map_update_elem(&owner_stat, &ckey, &first,
+							    BPF_NOEXIST);
+				} else {
+					__sync_fetch_and_add(&cdata->total_time, duration);
+					__sync_fetch_and_add(&cdata->count, 1);
+
+					/* FIXME: need atomic operations */
+					if (cdata->max_time < duration)
+						cdata->max_time = duration;
+					if (cdata->min_time > duration)
+						cdata->min_time = duration;
+				}
+
+				otdata->timestamp = pelem->timestamp;
+				otdata->stack_id = id;
+			}
+		}
+skip_owner_begin:
 		pelem->stack_id = bpf_get_stackid(ctx, &stacks,
 						  BPF_F_FAST_STACK_CMP | stack_skip);
 		if (pelem->stack_id < 0)
@@ -473,6 +587,7 @@ int contention_end(u64 *ctx)
 	struct tstamp_data *pelem;
 	struct contention_key key = {};
 	struct contention_data *data;
+	__u64 timestamp;
 	__u64 duration;
 	bool need_delete = false;
 
@@ -500,12 +615,143 @@ int contention_end(u64 *ctx)
 		need_delete = true;
 	}
 
-	duration = bpf_ktime_get_ns() - pelem->timestamp;
+	timestamp = bpf_ktime_get_ns();
+	duration = timestamp - pelem->timestamp;
 	if ((__s64)duration < 0) {
 		__sync_fetch_and_add(&time_fail, 1);
 		goto out;
 	}
 
+	if (needs_callstack && lock_owner) {
+		u64 owner_time;
+		struct contention_key ckey = {};
+		struct contention_data *cdata;
+		struct owner_tracing_data *otdata;
+
+		otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
+		if (!otdata)
+			goto skip_owner_end;
+
+		/* Update `owner_stat` */
+		owner_time = timestamp - otdata->timestamp;
+		ckey.stack_id = otdata->stack_id;
+		cdata = bpf_map_lookup_elem(&owner_stat, &ckey);
+
+		if (!cdata) {
+			struct contention_data first = {
+				.total_time = owner_time,
+				.max_time = owner_time,
+				.min_time = owner_time,
+				.count = 1,
+				.flags = pelem->flags,
+			};
+			bpf_map_update_elem(&owner_stat, &ckey, &first, BPF_NOEXIST);
+		} else {
+			__sync_fetch_and_add(&cdata->total_time, owner_time);
+			__sync_fetch_and_add(&cdata->count, 1);
+
+			/* FIXME: need atomic operations */
+			if (cdata->max_time < owner_time)
+				cdata->max_time = owner_time;
+			if (cdata->min_time > owner_time)
+				cdata->min_time = owner_time;
+		}
+
+		/* No contention is occurring, delete `lock` entry in `owner_data` */
+		if (otdata->count <= 1)
+			bpf_map_delete_elem(&owner_data, &pelem->lock);
+		/*
+		 * Contention is still ongoing, with a new owner (current task). `owner_data`
+		 * should be updated accordingly.
+		 */
+		else {
+			u32 i = 0;
+			u64 *buf;
+
+			__sync_fetch_and_add(&otdata->count, -1);
+
+			buf = bpf_map_lookup_elem(&stack_buf, &i);
+			if (!buf)
+				goto skip_owner_end;
+			for (i = 0; i < (u32)max_stack; i++)
+				buf[i] = 0x0;
+
+			/*
+			 * ctx[1] has the return code of the lock function.
+			 * If ctx[1] is not 0, the current task terminates lock waiting without
+			 * acquiring it. Owner is not changed, but we still need to update the owner
+			 * stack.
+			 */
+			if (!ctx[1]) {
+				s32 id = 0;
+				struct task_struct *task = NULL;
+
+				if (bpf_task_from_pid)
+					task = bpf_task_from_pid(otdata->pid);
+
+				if (task) {
+					bpf_get_task_stack(task, buf,
+							   max_stack * sizeof(unsigned long), 0);
+					bpf_task_release(task);
+				}
+
+				id = get_owner_stack_id(buf);
+
+				/*
+				 * If owner stack is changed, update `owner_data` and `owner_stat`
+				 * accordingly.
+				 */
+				if (id != otdata->stack_id) {
+					u64 duration = otdata->timestamp - pelem->timestamp;
+					struct contention_key ckey = {
+						.stack_id = id,
+						.pid = 0,
+						.lock_addr_or_cgroup = 0,
+					};
+					struct contention_data *cdata =
+						bpf_map_lookup_elem(&owner_stat, &ckey);
+
+					if (!cdata) {
+						struct contention_data first = {
+							.total_time = duration,
+							.max_time = duration,
+							.min_time = duration,
+							.count = 1,
+							.flags = pelem->flags,
+						};
+						bpf_map_update_elem(&owner_stat, &ckey, &first,
+								    BPF_NOEXIST);
+					} else {
+						__sync_fetch_and_add(&cdata->total_time, duration);
+						__sync_fetch_and_add(&cdata->count, 1);
+
+						/* FIXME: need atomic operations */
+						if (cdata->max_time < duration)
+							cdata->max_time = duration;
+						if (cdata->min_time > duration)
+							cdata->min_time = duration;
+					}
+
+					otdata->timestamp = pelem->timestamp;
+					otdata->stack_id = id;
+				}
+			}
+			/*
+			 * If ctx[1] is 0, then update tracinng data with the current task, which is
+			 * the new owner.
+			 */
+			else {
+				otdata->pid = pid;
+				otdata->timestamp = timestamp;
+
+				bpf_get_task_stack(bpf_get_current_task_btf(), buf,
+						   max_stack * sizeof(unsigned long), 0);
+				otdata->stack_id = get_owner_stack_id(buf);
+			}
+		}
+	}
+skip_owner_end:
+
 	switch (aggr_mode) {
 	case LOCK_AGGR_CALLER:
 		key.stack_id = pelem->stack_id;
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 3/5] perf lock: Make rb_tree helper functions generic
  2025-01-30  5:21 [PATCH v4 0/5] Tracing contention lock owner call stack Chun-Tse Shao
  2025-01-30  5:21 ` [PATCH v4 1/5] perf lock: Add bpf maps for owner stack tracing Chun-Tse Shao
  2025-01-30  5:21 ` [PATCH v4 2/5] perf lock: Retrieve owner callstack in bpf program Chun-Tse Shao
@ 2025-01-30  5:21 ` Chun-Tse Shao
  2025-01-30  5:21 ` [PATCH v4 4/5] perf lock: Report owner stack in usermode Chun-Tse Shao
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Chun-Tse Shao @ 2025-01-30  5:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chun-Tse Shao, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, bpf

The rb_tree helper functions can be reused for parsing `owner_lock_stat`
into rb tree for sorting.

Signed-off-by: Chun-Tse Shao <ctshao@google.com>
---
 tools/perf/builtin-lock.c | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 5d405cd8e696..9bebc186286f 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -418,16 +418,13 @@ static void combine_lock_stats(struct lock_stat *st)
 	rb_insert_color(&st->rb, &sorted);
 }
 
-static void insert_to_result(struct lock_stat *st,
-			     int (*bigger)(struct lock_stat *, struct lock_stat *))
+static void insert_to(struct rb_root *rr, struct lock_stat *st,
+		      int (*bigger)(struct lock_stat *, struct lock_stat *))
 {
-	struct rb_node **rb = &result.rb_node;
+	struct rb_node **rb = &rr->rb_node;
 	struct rb_node *parent = NULL;
 	struct lock_stat *p;
 
-	if (combine_locks && st->combined)
-		return;
-
 	while (*rb) {
 		p = container_of(*rb, struct lock_stat, rb);
 		parent = *rb;
@@ -439,13 +436,21 @@ static void insert_to_result(struct lock_stat *st,
 	}
 
 	rb_link_node(&st->rb, parent, rb);
-	rb_insert_color(&st->rb, &result);
+	rb_insert_color(&st->rb, rr);
 }
 
-/* returns left most element of result, and erase it */
-static struct lock_stat *pop_from_result(void)
+static inline void insert_to_result(struct lock_stat *st,
+				    int (*bigger)(struct lock_stat *,
+						  struct lock_stat *))
+{
+	if (combine_locks && st->combined)
+		return;
+	insert_to(&result, st, bigger);
+}
+
+static inline struct lock_stat *pop_from(struct rb_root *rr)
 {
-	struct rb_node *node = result.rb_node;
+	struct rb_node *node = rr->rb_node;
 
 	if (!node)
 		return NULL;
@@ -453,8 +458,15 @@ static struct lock_stat *pop_from_result(void)
 	while (node->rb_left)
 		node = node->rb_left;
 
-	rb_erase(node, &result);
+	rb_erase(node, rr);
 	return container_of(node, struct lock_stat, rb);
+
+}
+
+/* returns left most element of result, and erase it */
+static struct lock_stat *pop_from_result(void)
+{
+	return pop_from(&result);
 }
 
 struct trace_lock_handler {
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 4/5] perf lock: Report owner stack in usermode
  2025-01-30  5:21 [PATCH v4 0/5] Tracing contention lock owner call stack Chun-Tse Shao
                   ` (2 preceding siblings ...)
  2025-01-30  5:21 ` [PATCH v4 3/5] perf lock: Make rb_tree helper functions generic Chun-Tse Shao
@ 2025-01-30  5:21 ` Chun-Tse Shao
  2025-01-30  5:21 ` [PATCH v4 5/5] perf lock: Update documentation for -o option in contention mode Chun-Tse Shao
  2025-01-31 22:51 ` [PATCH v4 0/5] Tracing contention lock owner call stack Namhyung Kim
  5 siblings, 0 replies; 10+ messages in thread
From: Chun-Tse Shao @ 2025-01-30  5:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chun-Tse Shao, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, bpf

Parse `owner_lock_stat` into a rb tree, and report owner lock stats with
stack trace in order.

Example output:
  $ sudo ~/linux/tools/perf/perf lock con -abvo -Y mutex-spin -E3 perf bench sched pipe
  ...
   contended   total wait     max wait     avg wait         type   caller

         171      1.55 ms     20.26 us      9.06 us        mutex   pipe_read+0x57
                          0xffffffffac6318e7  pipe_read+0x57
                          0xffffffffac623862  vfs_read+0x332
                          0xffffffffac62434b  ksys_read+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
          36    193.71 us     15.27 us      5.38 us        mutex   pipe_write+0x50
                          0xffffffffac631ee0  pipe_write+0x50
                          0xffffffffac6241db  vfs_write+0x3bb
                          0xffffffffac6244ab  ksys_write+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
           4     51.22 us     16.47 us     12.80 us        mutex   do_epoll_wait+0x24d
                          0xffffffffac691f0d  do_epoll_wait+0x24d
                          0xffffffffac69249b  do_epoll_pwait.part.0+0xb
                          0xffffffffac693ba5  __x64_sys_epoll_pwait+0x95
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76

  === owner stack trace ===

           3     31.24 us     15.27 us     10.41 us        mutex   pipe_read+0x348
                          0xffffffffac631bd8  pipe_read+0x348
                          0xffffffffac623862  vfs_read+0x332
                          0xffffffffac62434b  ksys_read+0xbb
                          0xfffffffface604b2  do_syscall_64+0x82
                          0xffffffffad00012f  entry_SYSCALL_64_after_hwframe+0x76
  ...

Signed-off-by: Chun-Tse Shao <ctshao@google.com>
---
 tools/perf/builtin-lock.c             | 19 ++++++++--
 tools/perf/util/bpf_lock_contention.c | 54 +++++++++++++++++++++++++++
 tools/perf/util/lock-contention.h     |  7 ++++
 3 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 9bebc186286f..3dc100cf30ef 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -1817,6 +1817,22 @@ static void print_contention_result(struct lock_contention *con)
 			break;
 	}
 
+	if (con->owner && con->save_callstack) {
+		struct rb_root root = RB_ROOT;
+
+		if (symbol_conf.field_sep)
+			fprintf(lock_output, "# owner stack trace:\n");
+		else
+			fprintf(lock_output, "\n=== owner stack trace ===\n\n");
+		while ((st = pop_owner_stack_trace(con)))
+			insert_to(&root, st, compare);
+
+		while ((st = pop_from(&root))) {
+			print_lock_stat(con, st);
+			zfree(st);
+		}
+	}
+
 	if (print_nr_entries) {
 		/* update the total/bad stats */
 		while ((st = pop_from_result())) {
@@ -1962,9 +1978,6 @@ static int check_lock_contention_options(const struct option *options,
 		}
 	}
 
-	if (show_lock_owner)
-		show_thread_stats = true;
-
 	return 0;
 }
 
diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index 76542b86e83f..dc83b02c9724 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -549,6 +549,60 @@ static const char *lock_contention_get_name(struct lock_contention *con,
 	return name_buf;
 }
 
+struct lock_stat *pop_owner_stack_trace(struct lock_contention *con)
+{
+	int stacks_fd, stat_fd;
+	u64 *stack_trace;
+	s32 stack_id;
+	struct contention_key ckey = {};
+	struct contention_data cdata = {};
+	size_t stack_size = con->max_stack * sizeof(*stack_trace);
+	struct lock_stat *st;
+	char name[KSYM_NAME_LEN];
+
+	stacks_fd = bpf_map__fd(skel->maps.owner_stacks);
+	stat_fd = bpf_map__fd(skel->maps.owner_stat);
+	if (!stacks_fd || !stat_fd)
+		return NULL;
+
+	stack_trace = zalloc(stack_size);
+	if (stack_trace == NULL)
+		return NULL;
+
+	if (bpf_map_get_next_key(stacks_fd, NULL, stack_trace))
+		return NULL;
+
+	bpf_map_lookup_elem(stacks_fd, stack_trace, &stack_id);
+	ckey.stack_id = stack_id;
+	bpf_map_lookup_elem(stat_fd, &ckey, &cdata);
+
+	st = zalloc(sizeof(struct lock_stat));
+	if (!st)
+		return NULL;
+
+	strcpy(name,
+	       stack_trace[0] ? lock_contention_get_name(con, NULL, stack_trace, 0) : "unknown");
+
+	st->name = strdup(name);
+	if (!st->name)
+		return NULL;
+
+	st->flags = cdata.flags;
+	st->nr_contended = cdata.count;
+	st->wait_time_total = cdata.total_time;
+	st->wait_time_max = cdata.max_time;
+	st->wait_time_min = cdata.min_time;
+	st->callstack = stack_trace;
+
+	if (cdata.count)
+		st->avg_wait_time = cdata.total_time / cdata.count;
+
+	bpf_map_delete_elem(stacks_fd, stack_trace);
+	bpf_map_delete_elem(stat_fd, &ckey);
+
+	return st;
+}
+
 int lock_contention_read(struct lock_contention *con)
 {
 	int fd, stack, err = 0;
diff --git a/tools/perf/util/lock-contention.h b/tools/perf/util/lock-contention.h
index a09f7fe877df..97fd33c57f17 100644
--- a/tools/perf/util/lock-contention.h
+++ b/tools/perf/util/lock-contention.h
@@ -168,6 +168,8 @@ int lock_contention_stop(void);
 int lock_contention_read(struct lock_contention *con);
 int lock_contention_finish(struct lock_contention *con);
 
+struct lock_stat *pop_owner_stack_trace(struct lock_contention *con);
+
 #else  /* !HAVE_BPF_SKEL */
 
 static inline int lock_contention_prepare(struct lock_contention *con __maybe_unused)
@@ -187,6 +189,11 @@ static inline int lock_contention_read(struct lock_contention *con __maybe_unuse
 	return 0;
 }
 
+struct lock_stat *pop_owner_stack_trace(struct lock_contention *con)
+{
+	return NULL;
+}
+
 #endif  /* HAVE_BPF_SKEL */
 
 #endif  /* PERF_LOCK_CONTENTION_H */
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 5/5] perf lock: Update documentation for -o option in contention mode
  2025-01-30  5:21 [PATCH v4 0/5] Tracing contention lock owner call stack Chun-Tse Shao
                   ` (3 preceding siblings ...)
  2025-01-30  5:21 ` [PATCH v4 4/5] perf lock: Report owner stack in usermode Chun-Tse Shao
@ 2025-01-30  5:21 ` Chun-Tse Shao
  2025-01-31 22:51 ` [PATCH v4 0/5] Tracing contention lock owner call stack Namhyung Kim
  5 siblings, 0 replies; 10+ messages in thread
From: Chun-Tse Shao @ 2025-01-30  5:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chun-Tse Shao, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, bpf

This patch also decouple -o with -t, and shows warning to notify the new
behavior for -ov.

Signed-off-by: Chun-Tse Shao <ctshao@google.com>
---
 tools/perf/builtin-lock.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 3dc100cf30ef..e16bda6ce525 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -1817,7 +1817,7 @@ static void print_contention_result(struct lock_contention *con)
 			break;
 	}
 
-	if (con->owner && con->save_callstack) {
+	if (con->owner && con->save_callstack && verbose > 0) {
 		struct rb_root root = RB_ROOT;
 
 		if (symbol_conf.field_sep)
@@ -1978,6 +1978,11 @@ static int check_lock_contention_options(const struct option *options,
 		}
 	}
 
+	if (show_lock_owner && !show_thread_stats) {
+		pr_warning("Now -o try to show owner's callstack instead of pid and comm.\n");
+		pr_warning("Please use -t option too to keep the old behavior.\n");
+	}
+
 	return 0;
 }
 
@@ -2569,7 +2574,8 @@ int cmd_lock(int argc, const char **argv)
 		     "Filter specific address/symbol of locks", parse_lock_addr),
 	OPT_CALLBACK('S', "callstack-filter", NULL, "NAMES",
 		     "Filter specific function in the callstack", parse_call_stack),
-	OPT_BOOLEAN('o', "lock-owner", &show_lock_owner, "show lock owners instead of waiters"),
+	OPT_BOOLEAN('o', "lock-owner", &show_lock_owner, "show lock owners instead of waiters.\n"
+		"\t\t\tThis option can be combined with -t, which shows owner's per thread lock stats, or -v, which shows owner's stacktrace"),
 	OPT_STRING_NOEMPTY('x', "field-separator", &symbol_conf.field_sep, "separator",
 		   "print result in CSV format with custom separator"),
 	OPT_BOOLEAN(0, "lock-cgroup", &show_lock_cgroups, "show lock stats by cgroup"),
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 0/5] Tracing contention lock owner call stack
  2025-01-30  5:21 [PATCH v4 0/5] Tracing contention lock owner call stack Chun-Tse Shao
                   ` (4 preceding siblings ...)
  2025-01-30  5:21 ` [PATCH v4 5/5] perf lock: Update documentation for -o option in contention mode Chun-Tse Shao
@ 2025-01-31 22:51 ` Namhyung Kim
  5 siblings, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2025-01-31 22:51 UTC (permalink / raw)
  To: Chun-Tse Shao
  Cc: linux-kernel, peterz, mingo, acme, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, bpf

Hello,

On Wed, Jan 29, 2025 at 09:21:34PM -0800, Chun-Tse Shao wrote:
> 
> v4: Edit based on Namhyung's review.
> 
> v3: Edit based on Namhyung's review.
> 
> v2: Fix logic deficit in patch 2/4.

This change log doesn't give any meaningful info.  You'd better say what
exactly you changed.

Thanks,
Namhyung

> 
> Chun-Tse Shao (5):
>   perf lock: Add bpf maps for owner stack tracing
>   perf lock: Retrieve owner callstack in bpf program
>   perf lock: Make rb_tree helper functions generic
>   perf lock: Report owner stack in usermode
>   perf lock: Update documentation for -o option in contention mode
> 
>  tools/perf/builtin-lock.c                     |  59 +++-
>  tools/perf/util/bpf_lock_contention.c         |  68 ++++-
>  .../perf/util/bpf_skel/lock_contention.bpf.c  | 281 +++++++++++++++++-
>  tools/perf/util/bpf_skel/lock_data.h          |   7 +
>  tools/perf/util/lock-contention.h             |   7 +
>  5 files changed, 405 insertions(+), 17 deletions(-)
> 
> --
> 2.48.1.362.g079036d154-goog
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/5] perf lock: Retrieve owner callstack in bpf program
  2025-01-30  5:21 ` [PATCH v4 2/5] perf lock: Retrieve owner callstack in bpf program Chun-Tse Shao
@ 2025-01-31 23:48   ` Namhyung Kim
  2025-02-11 23:26     ` Chun-Tse Shao
  0 siblings, 1 reply; 10+ messages in thread
From: Namhyung Kim @ 2025-01-31 23:48 UTC (permalink / raw)
  To: Chun-Tse Shao
  Cc: linux-kernel, peterz, mingo, acme, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, bpf

On Wed, Jan 29, 2025 at 09:21:36PM -0800, Chun-Tse Shao wrote:
> Tracks lock contention by tracing owner callstacks in
> `contention_begin()` and `contention_end()`, storing data in the
> owner_stat BPF map. `contention_begin()` records the owner and their
> callstack. `contention_end()` updates contention statistics (count,
> time), decrements the waiter count, and removes the record when no
> waiters remain. Statistics are also updated if the owner's callstack
> changes. Note that owner and its callstack retrieval may fail.
> 
> To elaborate the process in detail:
>   /*
>    * In `contention_begin(), the current task is the lock waiter`. We
>    * create/update `owner_data` for the given `lock` address.
>   contention_begin() {
>     Try to get owner task. Skip entire process if fails.
>     Try to get owner stack based on task. Use empty stack if fails.
>     Store owner stack into `owner_stacks` and create `stack_id`. If fail
>       to store, use negative `stack_id`, which will be ignored while
>       reporting in usermode.
>     Retrieve `owner_tracing_data` in `owner_data` with given `lock`
>       address.
> 
>     /*
>      * The first case means contention just happens, or mismatched owner
>      * infomation so we just drop the previous record.
>      */
>     if (`owner_tracing_data` does not exist ||
>         the recorded owner `pid` does not match with the detected owner) {
>       Create `owner_tracing_data` with info from detected owner, and
>         store it in `owner_data` with key `lock` address.
>     }
>     /*
>      * The second case means contention is on going. One more waiter is
>      * joining the lock contention. Both `owner_data` and `owner_stat`
>      * should be updated.
>      */
>     else {
>       `owner_tracing_data.count`++
> 
>       Create `contention_key` with owner `stack_id` and lookup
>         `contention_data` in `owner_stat`.
>       if (`contention_data` does not exist) {
>         Create new entry for `contention_key`:`contention_data` in
>           `owner_stat`.
>       }
>       else {
>         Update the `count` and `total_time` in existing
>         `contention_data`.
>       }
> 
>       Update `timestamp` and `stack_id` in `owner_tracing_data`.
>     }
>   }
> 
>   /*
>    * In `contention_end()`, the current task will be the new owner of
>    * the `lock`, if `ctx[1]` is not 0.
>    */
>   contention_end() {
>     Lookup `owner_tracing_data` in `owner_data` with key `lock`.
> 
>     Create `contention_key` with `owner_tracing_data.stack_id` and
>       lookup `contention_data` in `owner_stat`.
>     if (`contention_data` does not exist) {
>       Create new entry for `contention_key`:`contention_data` in
>         `owner_stat`.
>     }
>     else {
>       Update the `count` and `total_time` in existing `contention_data`.
>     }
> 
>     /*
>      * There is no more waiters, contention is over, delete the record.
>      */
>     if (`owner_tracing_data.count` <= 1) {
>       delete this record in `owner_data`.
>     }
>     /*
>      * Contention is still on going.
>      */
>     else {
>       `owner_tracing_data.count`--
> 
>       if (`!ctx[0]`) {
>         The current task exits without acquiring the lock. However we
>           check for the recorded owner if the stack is changed, and
>           update `onwer_data` and `owner_stat` accordingly.
>       }
>       else {
>         The current task is the new owner, retrieve its stack, store it
>           in `owner_stack` and update `owner_tracing_data`.
>       }
>     }
>   }

I think this is too much detail. :)

I'd say something like this:

This implements per-callstack aggregation of lock owners in addition to
per-thread.  The owner callstack is captured using bpf_get_task_stack()
at contention_begin and it also adds a custom stackid function for the
owner stacks to be compared easily.

The owner info is kept in a hash map using lock addr as a key to handle
multiple waiters for the same lock.  At contention_end, it updates the
owner lock stat based on the info that was saved at contention_begin.
If there are more waiters, it'd update the owner pid to itself as
contention_end means it gets the lock now.  But it also needs to check
the return value of the lock function in case task was killed by a signal
or something.

> 
> Signed-off-by: Chun-Tse Shao <ctshao@google.com>
> ---
>  .../perf/util/bpf_skel/lock_contention.bpf.c  | 248 +++++++++++++++++-
>  1 file changed, 247 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> index 23fe9cc980ae..b12df873379f 100644
> --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> @@ -197,6 +197,9 @@ int data_fail;
>  int task_map_full;
>  int data_map_full;
>  
> +struct task_struct *bpf_task_from_pid(s32 pid) __ksym __weak;
> +void bpf_task_release(struct task_struct *p) __ksym __weak;
> +
>  static inline __u64 get_current_cgroup_id(void)
>  {
>  	struct task_struct *task;
> @@ -420,6 +423,26 @@ static inline struct tstamp_data *get_tstamp_elem(__u32 flags)
>  	return pelem;
>  }
>  
> +static inline s32 get_owner_stack_id(u64 *stacktrace)
> +{
> +	s32 *id, new_id;
> +	static s64 id_gen = 1;
> +
> +	id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
> +	if (id)
> +		return *id;
> +
> +	new_id = (s32)__sync_fetch_and_add(&id_gen, 1);
> +
> +	bpf_map_update_elem(&owner_stacks, stacktrace, &new_id, BPF_NOEXIST);
> +
> +	id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
> +	if (id)
> +		return *id;
> +
> +	return -1;
> +}
> +
>  SEC("tp_btf/contention_begin")
>  int contention_begin(u64 *ctx)
>  {
> @@ -437,6 +460,97 @@ int contention_begin(u64 *ctx)
>  	pelem->flags = (__u32)ctx[1];
>  
>  	if (needs_callstack) {
> +		u32 i = 0;
> +		u32 id = 0;
> +		int owner_pid;
> +		u64 *buf;
> +		struct task_struct *task;
> +		struct owner_tracing_data *otdata;
> +
> +		if (!lock_owner)
> +			goto skip_owner_begin;
> +
> +		task = get_lock_owner(pelem->lock, pelem->flags);
> +		if (!task)
> +			goto skip_owner_begin;
> +
> +		owner_pid = BPF_CORE_READ(task, pid);
> +
> +		buf = bpf_map_lookup_elem(&stack_buf, &i);
> +		if (!buf)
> +			goto skip_owner_begin;
> +		for (i = 0; i < max_stack; i++)
> +			buf[i] = 0x0;
> +
> +		if (bpf_task_from_pid) {

I think you can do this instead:

		if (bpf_task_from_pid == NULL)
			goto skip_owner_begin;

nit: it can be just 'skip_owner'. :)


> +			task = bpf_task_from_pid(owner_pid);
> +			if (task) {
> +				bpf_get_task_stack(task, buf, max_stack * sizeof(unsigned long), 0);
> +				bpf_task_release(task);
> +			}
> +		}
> +
> +		otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
> +		id = get_owner_stack_id(buf);
> +
> +		/*
> +		 * Contention just happens, or corner case `lock` is owned by process not
> +		 * `owner_pid`. For the corner case we treat it as unexpected internal error and
> +		 * just ignore the precvious tracing record.
> +		 */
> +		if (!otdata || otdata->pid != owner_pid) {
> +			struct owner_tracing_data first = {
> +				.pid = owner_pid,
> +				.timestamp = pelem->timestamp,
> +				.count = 1,
> +				.stack_id = id,
> +			};
> +			bpf_map_update_elem(&owner_data, &pelem->lock, &first, BPF_ANY);
> +		}
> +		/* Contention is ongoing and new waiter joins */
> +		else {
> +			__sync_fetch_and_add(&otdata->count, 1);
> +
> +			/*
> +			 * The owner is the same, but stacktrace might be changed. In this case we
> +			 * store/update `owner_stat` based on current owner stack id.
> +			 */
> +			if (id != otdata->stack_id) {
> +				u64 duration = otdata->timestamp - pelem->timestamp;

Isn't it the opposite?

	u64 duration = pelem->timestamp - otdata->timestamp;


> +				struct contention_key ckey = {
> +					.stack_id = id,
> +					.pid = 0,
> +					.lock_addr_or_cgroup = 0,
> +				};
> +				struct contention_data *cdata =
> +					bpf_map_lookup_elem(&owner_stat, &ckey);
> +
> +				if (!cdata) {
> +					struct contention_data first = {
> +						.total_time = duration,
> +						.max_time = duration,
> +						.min_time = duration,
> +						.count = 1,
> +						.flags = pelem->flags,
> +					};
> +					bpf_map_update_elem(&owner_stat, &ckey, &first,
> +							    BPF_NOEXIST);
> +				} else {
> +					__sync_fetch_and_add(&cdata->total_time, duration);
> +					__sync_fetch_and_add(&cdata->count, 1);
> +
> +					/* FIXME: need atomic operations */
> +					if (cdata->max_time < duration)
> +						cdata->max_time = duration;
> +					if (cdata->min_time > duration)
> +						cdata->min_time = duration;
> +				}

And as I said before, can we move this block out as a function?

> +
> +				otdata->timestamp = pelem->timestamp;
> +				otdata->stack_id = id;
> +			}
> +		}
> +skip_owner_begin:
>  		pelem->stack_id = bpf_get_stackid(ctx, &stacks,
>  						  BPF_F_FAST_STACK_CMP | stack_skip);
>  		if (pelem->stack_id < 0)
> @@ -473,6 +587,7 @@ int contention_end(u64 *ctx)
>  	struct tstamp_data *pelem;
>  	struct contention_key key = {};
>  	struct contention_data *data;
> +	__u64 timestamp;
>  	__u64 duration;
>  	bool need_delete = false;
>  
> @@ -500,12 +615,143 @@ int contention_end(u64 *ctx)
>  		need_delete = true;
>  	}
>  
> -	duration = bpf_ktime_get_ns() - pelem->timestamp;
> +	timestamp = bpf_ktime_get_ns();
> +	duration = timestamp - pelem->timestamp;
>  	if ((__s64)duration < 0) {
>  		__sync_fetch_and_add(&time_fail, 1);
>  		goto out;
>  	}
>  
> +	if (needs_callstack && lock_owner) {
> +		u64 owner_time;
> +		struct contention_key ckey = {};
> +		struct contention_data *cdata;
> +		struct owner_tracing_data *otdata;
> +
> +		otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
> +		if (!otdata)
> +			goto skip_owner_end;
> +
> +		/* Update `owner_stat` */
> +		owner_time = timestamp - otdata->timestamp;
> +		ckey.stack_id = otdata->stack_id;
> +		cdata = bpf_map_lookup_elem(&owner_stat, &ckey);
> +
> +		if (!cdata) {
> +			struct contention_data first = {
> +				.total_time = owner_time,
> +				.max_time = owner_time,
> +				.min_time = owner_time,
> +				.count = 1,
> +				.flags = pelem->flags,
> +			};
> +			bpf_map_update_elem(&owner_stat, &ckey, &first, BPF_NOEXIST);
> +		} else {
> +			__sync_fetch_and_add(&cdata->total_time, owner_time);
> +			__sync_fetch_and_add(&cdata->count, 1);
> +
> +			/* FIXME: need atomic operations */
> +			if (cdata->max_time < owner_time)
> +				cdata->max_time = owner_time;
> +			if (cdata->min_time > owner_time)
> +				cdata->min_time = owner_time;
> +		}
> +
> +		/* No contention is occurring, delete `lock` entry in `owner_data` */
> +		if (otdata->count <= 1)
> +			bpf_map_delete_elem(&owner_data, &pelem->lock);
> +		/*
> +		 * Contention is still ongoing, with a new owner (current task). `owner_data`
> +		 * should be updated accordingly.
> +		 */
> +		else {
> +			u32 i = 0;
> +			u64 *buf;
> +
> +			__sync_fetch_and_add(&otdata->count, -1);
> +
> +			buf = bpf_map_lookup_elem(&stack_buf, &i);
> +			if (!buf)
> +				goto skip_owner_end;
> +			for (i = 0; i < (u32)max_stack; i++)
> +				buf[i] = 0x0;
> +
> +			/*
> +			 * ctx[1] has the return code of the lock function.

Then I think it's clearer to have a local variable named 'ret' or so.


> +			 * If ctx[1] is not 0, the current task terminates lock waiting without
> +			 * acquiring it. Owner is not changed, but we still need to update the owner
> +			 * stack.
> +			 */
> +			if (!ctx[1]) {

This doesn't match to the comment.  It should be:

			if (ret < 0) {


> +				s32 id = 0;
> +				struct task_struct *task = NULL;
> +
> +				if (bpf_task_from_pid)

Same as the above.  No need to go down if you cannot get the task and
stack.

				if (bpf_task_from_pid == NULL)
					goto skip_owner;


> +					task = bpf_task_from_pid(otdata->pid);
> +
> +				if (task) {
> +					bpf_get_task_stack(task, buf,
> +							   max_stack * sizeof(unsigned long), 0);
> +					bpf_task_release(task);
> +				}
> +
> +				id = get_owner_stack_id(buf);
> +
> +				/*
> +				 * If owner stack is changed, update `owner_data` and `owner_stat`
> +				 * accordingly.
> +				 */
> +				if (id != otdata->stack_id) {
> +					u64 duration = otdata->timestamp - pelem->timestamp;
> +					struct contention_key ckey = {
> +						.stack_id = id,
> +						.pid = 0,
> +						.lock_addr_or_cgroup = 0,
> +					};
> +					struct contention_data *cdata =
> +						bpf_map_lookup_elem(&owner_stat, &ckey);
> +
> +					if (!cdata) {
> +						struct contention_data first = {
> +							.total_time = duration,
> +							.max_time = duration,
> +							.min_time = duration,
> +							.count = 1,
> +							.flags = pelem->flags,
> +						};
> +						bpf_map_update_elem(&owner_stat, &ckey, &first,
> +								    BPF_NOEXIST);
> +					} else {
> +						__sync_fetch_and_add(&cdata->total_time, duration);
> +						__sync_fetch_and_add(&cdata->count, 1);
> +
> +						/* FIXME: need atomic operations */
> +						if (cdata->max_time < duration)
> +							cdata->max_time = duration;
> +						if (cdata->min_time > duration)
> +							cdata->min_time = duration;
> +					}
> +
> +					otdata->timestamp = pelem->timestamp;
> +					otdata->stack_id = id;
> +				}
> +			}
> +			/*
> +			 * If ctx[1] is 0, then update tracinng data with the current task, which is
> +			 * the new owner.
> +			 */
> +			else {
> +				otdata->pid = pid;
> +				otdata->timestamp = timestamp;
> +
> +				bpf_get_task_stack(bpf_get_current_task_btf(), buf,
> +						   max_stack * sizeof(unsigned long), 0);

This would be meaningless since it's still in the contention path.
Current callstack will be the same as the waiter callstack.  You'd 
better just invalidate callstack here and let the next waiter update
it.

Thanks,
Namhyung


> +				otdata->stack_id = get_owner_stack_id(buf);
> +			}
> +		}
> +	}
> +skip_owner_end:
> +
>  	switch (aggr_mode) {
>  	case LOCK_AGGR_CALLER:
>  		key.stack_id = pelem->stack_id;
> -- 
> 2.48.1.362.g079036d154-goog
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/5] perf lock: Retrieve owner callstack in bpf program
  2025-01-31 23:48   ` Namhyung Kim
@ 2025-02-11 23:26     ` Chun-Tse Shao
  2025-02-12  1:45       ` Namhyung Kim
  0 siblings, 1 reply; 10+ messages in thread
From: Chun-Tse Shao @ 2025-02-11 23:26 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: linux-kernel, peterz, mingo, acme, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, bpf

Hi Namhyung, thanks for your review first!


On Fri, Jan 31, 2025 at 3:48 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Wed, Jan 29, 2025 at 09:21:36PM -0800, Chun-Tse Shao wrote:
> > Tracks lock contention by tracing owner callstacks in
> > `contention_begin()` and `contention_end()`, storing data in the
> > owner_stat BPF map. `contention_begin()` records the owner and their
> > callstack. `contention_end()` updates contention statistics (count,
> > time), decrements the waiter count, and removes the record when no
> > waiters remain. Statistics are also updated if the owner's callstack
> > changes. Note that owner and its callstack retrieval may fail.
> >
> > To elaborate the process in detail:
> >   /*
> >    * In `contention_begin(), the current task is the lock waiter`. We
> >    * create/update `owner_data` for the given `lock` address.
> >   contention_begin() {
> >     Try to get owner task. Skip entire process if fails.
> >     Try to get owner stack based on task. Use empty stack if fails.
> >     Store owner stack into `owner_stacks` and create `stack_id`. If fail
> >       to store, use negative `stack_id`, which will be ignored while
> >       reporting in usermode.
> >     Retrieve `owner_tracing_data` in `owner_data` with given `lock`
> >       address.
> >
> >     /*
> >      * The first case means contention just happens, or mismatched owner
> >      * infomation so we just drop the previous record.
> >      */
> >     if (`owner_tracing_data` does not exist ||
> >         the recorded owner `pid` does not match with the detected owner) {
> >       Create `owner_tracing_data` with info from detected owner, and
> >         store it in `owner_data` with key `lock` address.
> >     }
> >     /*
> >      * The second case means contention is on going. One more waiter is
> >      * joining the lock contention. Both `owner_data` and `owner_stat`
> >      * should be updated.
> >      */
> >     else {
> >       `owner_tracing_data.count`++
> >
> >       Create `contention_key` with owner `stack_id` and lookup
> >         `contention_data` in `owner_stat`.
> >       if (`contention_data` does not exist) {
> >         Create new entry for `contention_key`:`contention_data` in
> >           `owner_stat`.
> >       }
> >       else {
> >         Update the `count` and `total_time` in existing
> >         `contention_data`.
> >       }
> >
> >       Update `timestamp` and `stack_id` in `owner_tracing_data`.
> >     }
> >   }
> >
> >   /*
> >    * In `contention_end()`, the current task will be the new owner of
> >    * the `lock`, if `ctx[1]` is not 0.
> >    */
> >   contention_end() {
> >     Lookup `owner_tracing_data` in `owner_data` with key `lock`.
> >
> >     Create `contention_key` with `owner_tracing_data.stack_id` and
> >       lookup `contention_data` in `owner_stat`.
> >     if (`contention_data` does not exist) {
> >       Create new entry for `contention_key`:`contention_data` in
> >         `owner_stat`.
> >     }
> >     else {
> >       Update the `count` and `total_time` in existing `contention_data`.
> >     }
> >
> >     /*
> >      * There is no more waiters, contention is over, delete the record.
> >      */
> >     if (`owner_tracing_data.count` <= 1) {
> >       delete this record in `owner_data`.
> >     }
> >     /*
> >      * Contention is still on going.
> >      */
> >     else {
> >       `owner_tracing_data.count`--
> >
> >       if (`!ctx[0]`) {
> >         The current task exits without acquiring the lock. However we
> >           check for the recorded owner if the stack is changed, and
> >           update `onwer_data` and `owner_stat` accordingly.
> >       }
> >       else {
> >         The current task is the new owner, retrieve its stack, store it
> >           in `owner_stack` and update `owner_tracing_data`.
> >       }
> >     }
> >   }
>
> I think this is too much detail. :)
>
> I'd say something like this:
>
> This implements per-callstack aggregation of lock owners in addition to
> per-thread.  The owner callstack is captured using bpf_get_task_stack()
> at contention_begin and it also adds a custom stackid function for the
> owner stacks to be compared easily.
>
> The owner info is kept in a hash map using lock addr as a key to handle
> multiple waiters for the same lock.  At contention_end, it updates the
> owner lock stat based on the info that was saved at contention_begin.
> If there are more waiters, it'd update the owner pid to itself as
> contention_end means it gets the lock now.  But it also needs to check
> the return value of the lock function in case task was killed by a signal
> or something.
>

Thanks, I will just reuse this description. :D

> >
> > Signed-off-by: Chun-Tse Shao <ctshao@google.com>
> > ---
> >  .../perf/util/bpf_skel/lock_contention.bpf.c  | 248 +++++++++++++++++-
> >  1 file changed, 247 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > index 23fe9cc980ae..b12df873379f 100644
> > --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > @@ -197,6 +197,9 @@ int data_fail;
> >  int task_map_full;
> >  int data_map_full;
> >
> > +struct task_struct *bpf_task_from_pid(s32 pid) __ksym __weak;
> > +void bpf_task_release(struct task_struct *p) __ksym __weak;
> > +
> >  static inline __u64 get_current_cgroup_id(void)
> >  {
> >       struct task_struct *task;
> > @@ -420,6 +423,26 @@ static inline struct tstamp_data *get_tstamp_elem(__u32 flags)
> >       return pelem;
> >  }
> >
> > +static inline s32 get_owner_stack_id(u64 *stacktrace)
> > +{
> > +     s32 *id, new_id;
> > +     static s64 id_gen = 1;
> > +
> > +     id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
> > +     if (id)
> > +             return *id;
> > +
> > +     new_id = (s32)__sync_fetch_and_add(&id_gen, 1);
> > +
> > +     bpf_map_update_elem(&owner_stacks, stacktrace, &new_id, BPF_NOEXIST);
> > +
> > +     id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
> > +     if (id)
> > +             return *id;
> > +
> > +     return -1;
> > +}
> > +
> >  SEC("tp_btf/contention_begin")
> >  int contention_begin(u64 *ctx)
> >  {
> > @@ -437,6 +460,97 @@ int contention_begin(u64 *ctx)
> >       pelem->flags = (__u32)ctx[1];
> >
> >       if (needs_callstack) {
> > +             u32 i = 0;
> > +             u32 id = 0;
> > +             int owner_pid;
> > +             u64 *buf;
> > +             struct task_struct *task;
> > +             struct owner_tracing_data *otdata;
> > +
> > +             if (!lock_owner)
> > +                     goto skip_owner_begin;
> > +
> > +             task = get_lock_owner(pelem->lock, pelem->flags);
> > +             if (!task)
> > +                     goto skip_owner_begin;
> > +
> > +             owner_pid = BPF_CORE_READ(task, pid);
> > +
> > +             buf = bpf_map_lookup_elem(&stack_buf, &i);
> > +             if (!buf)
> > +                     goto skip_owner_begin;
> > +             for (i = 0; i < max_stack; i++)
> > +                     buf[i] = 0x0;
> > +
> > +             if (bpf_task_from_pid) {
>
> I think you can do this instead:
>
>                 if (bpf_task_from_pid == NULL)
>                         goto skip_owner_begin;
>
> nit: it can be just 'skip_owner'. :)
>
>
> > +                     task = bpf_task_from_pid(owner_pid);
> > +                     if (task) {
> > +                             bpf_get_task_stack(task, buf, max_stack * sizeof(unsigned long), 0);
> > +                             bpf_task_release(task);
> > +                     }
> > +             }
> > +
> > +             otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
> > +             id = get_owner_stack_id(buf);
> > +
> > +             /*
> > +              * Contention just happens, or corner case `lock` is owned by process not
> > +              * `owner_pid`. For the corner case we treat it as unexpected internal error and
> > +              * just ignore the precvious tracing record.
> > +              */
> > +             if (!otdata || otdata->pid != owner_pid) {
> > +                     struct owner_tracing_data first = {
> > +                             .pid = owner_pid,
> > +                             .timestamp = pelem->timestamp,
> > +                             .count = 1,
> > +                             .stack_id = id,
> > +                     };
> > +                     bpf_map_update_elem(&owner_data, &pelem->lock, &first, BPF_ANY);
> > +             }
> > +             /* Contention is ongoing and new waiter joins */
> > +             else {
> > +                     __sync_fetch_and_add(&otdata->count, 1);
> > +
> > +                     /*
> > +                      * The owner is the same, but stacktrace might be changed. In this case we
> > +                      * store/update `owner_stat` based on current owner stack id.
> > +                      */
> > +                     if (id != otdata->stack_id) {
> > +                             u64 duration = otdata->timestamp - pelem->timestamp;
>
> Isn't it the opposite?
>
>         u64 duration = pelem->timestamp - otdata->timestamp;
>
>
> > +                             struct contention_key ckey = {
> > +                                     .stack_id = id,
> > +                                     .pid = 0,
> > +                                     .lock_addr_or_cgroup = 0,
> > +                             };
> > +                             struct contention_data *cdata =
> > +                                     bpf_map_lookup_elem(&owner_stat, &ckey);
> > +
> > +                             if (!cdata) {
> > +                                     struct contention_data first = {
> > +                                             .total_time = duration,
> > +                                             .max_time = duration,
> > +                                             .min_time = duration,
> > +                                             .count = 1,
> > +                                             .flags = pelem->flags,
> > +                                     };
> > +                                     bpf_map_update_elem(&owner_stat, &ckey, &first,
> > +                                                         BPF_NOEXIST);
> > +                             } else {
> > +                                     __sync_fetch_and_add(&cdata->total_time, duration);
> > +                                     __sync_fetch_and_add(&cdata->count, 1);
> > +
> > +                                     /* FIXME: need atomic operations */
> > +                                     if (cdata->max_time < duration)
> > +                                             cdata->max_time = duration;
> > +                                     if (cdata->min_time > duration)
> > +                                             cdata->min_time = duration;
> > +                             }
>
> And as I said before, can we move this block out as a function?
>
> > +
> > +                             otdata->timestamp = pelem->timestamp;
> > +                             otdata->stack_id = id;
> > +                     }
> > +             }
> > +skip_owner_begin:
> >               pelem->stack_id = bpf_get_stackid(ctx, &stacks,
> >                                                 BPF_F_FAST_STACK_CMP | stack_skip);
> >               if (pelem->stack_id < 0)
> > @@ -473,6 +587,7 @@ int contention_end(u64 *ctx)
> >       struct tstamp_data *pelem;
> >       struct contention_key key = {};
> >       struct contention_data *data;
> > +     __u64 timestamp;
> >       __u64 duration;
> >       bool need_delete = false;
> >
> > @@ -500,12 +615,143 @@ int contention_end(u64 *ctx)
> >               need_delete = true;
> >       }
> >
> > -     duration = bpf_ktime_get_ns() - pelem->timestamp;
> > +     timestamp = bpf_ktime_get_ns();
> > +     duration = timestamp - pelem->timestamp;
> >       if ((__s64)duration < 0) {
> >               __sync_fetch_and_add(&time_fail, 1);
> >               goto out;
> >       }
> >
> > +     if (needs_callstack && lock_owner) {
> > +             u64 owner_time;
> > +             struct contention_key ckey = {};
> > +             struct contention_data *cdata;
> > +             struct owner_tracing_data *otdata;
> > +
> > +             otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
> > +             if (!otdata)
> > +                     goto skip_owner_end;
> > +
> > +             /* Update `owner_stat` */
> > +             owner_time = timestamp - otdata->timestamp;
> > +             ckey.stack_id = otdata->stack_id;
> > +             cdata = bpf_map_lookup_elem(&owner_stat, &ckey);
> > +
> > +             if (!cdata) {
> > +                     struct contention_data first = {
> > +                             .total_time = owner_time,
> > +                             .max_time = owner_time,
> > +                             .min_time = owner_time,
> > +                             .count = 1,
> > +                             .flags = pelem->flags,
> > +                     };
> > +                     bpf_map_update_elem(&owner_stat, &ckey, &first, BPF_NOEXIST);
> > +             } else {
> > +                     __sync_fetch_and_add(&cdata->total_time, owner_time);
> > +                     __sync_fetch_and_add(&cdata->count, 1);
> > +
> > +                     /* FIXME: need atomic operations */
> > +                     if (cdata->max_time < owner_time)
> > +                             cdata->max_time = owner_time;
> > +                     if (cdata->min_time > owner_time)
> > +                             cdata->min_time = owner_time;
> > +             }
> > +
> > +             /* No contention is occurring, delete `lock` entry in `owner_data` */
> > +             if (otdata->count <= 1)
> > +                     bpf_map_delete_elem(&owner_data, &pelem->lock);
> > +             /*
> > +              * Contention is still ongoing, with a new owner (current task). `owner_data`
> > +              * should be updated accordingly.
> > +              */
> > +             else {
> > +                     u32 i = 0;
> > +                     u64 *buf;
> > +
> > +                     __sync_fetch_and_add(&otdata->count, -1);
> > +
> > +                     buf = bpf_map_lookup_elem(&stack_buf, &i);
> > +                     if (!buf)
> > +                             goto skip_owner_end;
> > +                     for (i = 0; i < (u32)max_stack; i++)
> > +                             buf[i] = 0x0;
> > +
> > +                     /*
> > +                      * ctx[1] has the return code of the lock function.
>
> Then I think it's clearer to have a local variable named 'ret' or so.
>
>
> > +                      * If ctx[1] is not 0, the current task terminates lock waiting without
> > +                      * acquiring it. Owner is not changed, but we still need to update the owner
> > +                      * stack.
> > +                      */
> > +                     if (!ctx[1]) {
>
> This doesn't match to the comment.  It should be:
>
>                         if (ret < 0) {
>
>
> > +                             s32 id = 0;
> > +                             struct task_struct *task = NULL;
> > +
> > +                             if (bpf_task_from_pid)
>
> Same as the above.  No need to go down if you cannot get the task and
> stack.
>
>                                 if (bpf_task_from_pid == NULL)
>                                         goto skip_owner;
>
>
> > +                                     task = bpf_task_from_pid(otdata->pid);
> > +
> > +                             if (task) {
> > +                                     bpf_get_task_stack(task, buf,
> > +                                                        max_stack * sizeof(unsigned long), 0);
> > +                                     bpf_task_release(task);
> > +                             }
> > +
> > +                             id = get_owner_stack_id(buf);
> > +
> > +                             /*
> > +                              * If owner stack is changed, update `owner_data` and `owner_stat`
> > +                              * accordingly.
> > +                              */
> > +                             if (id != otdata->stack_id) {
> > +                                     u64 duration = otdata->timestamp - pelem->timestamp;
> > +                                     struct contention_key ckey = {
> > +                                             .stack_id = id,
> > +                                             .pid = 0,
> > +                                             .lock_addr_or_cgroup = 0,
> > +                                     };
> > +                                     struct contention_data *cdata =
> > +                                             bpf_map_lookup_elem(&owner_stat, &ckey);
> > +
> > +                                     if (!cdata) {
> > +                                             struct contention_data first = {
> > +                                                     .total_time = duration,
> > +                                                     .max_time = duration,
> > +                                                     .min_time = duration,
> > +                                                     .count = 1,
> > +                                                     .flags = pelem->flags,
> > +                                             };
> > +                                             bpf_map_update_elem(&owner_stat, &ckey, &first,
> > +                                                                 BPF_NOEXIST);
> > +                                     } else {
> > +                                             __sync_fetch_and_add(&cdata->total_time, duration);
> > +                                             __sync_fetch_and_add(&cdata->count, 1);
> > +
> > +                                             /* FIXME: need atomic operations */
> > +                                             if (cdata->max_time < duration)
> > +                                                     cdata->max_time = duration;
> > +                                             if (cdata->min_time > duration)
> > +                                                     cdata->min_time = duration;
> > +                                     }
> > +
> > +                                     otdata->timestamp = pelem->timestamp;
> > +                                     otdata->stack_id = id;
> > +                             }
> > +                     }
> > +                     /*
> > +                      * If ctx[1] is 0, then update tracinng data with the current task, which is
> > +                      * the new owner.
> > +                      */
> > +                     else {
> > +                             otdata->pid = pid;
> > +                             otdata->timestamp = timestamp;
> > +
> > +                             bpf_get_task_stack(bpf_get_current_task_btf(), buf,
> > +                                                max_stack * sizeof(unsigned long), 0);
>
> This would be meaningless since it's still in the contention path.
> Current callstack will be the same as the waiter callstack.  You'd
> better just invalidate callstack here and let the next waiter update
> it.

I wonder why this is meaningless. In this situation, the lock owner is
transferred to the current task, and there is at least one more
waiter, the contention is still ongoing. `otdata` is for tracing the
lock owner, so it should be correctly updated with the new owner,
which is the current task.
>
> Thanks,
> Namhyung
>
>
> > +                             otdata->stack_id = get_owner_stack_id(buf);
> > +                     }
> > +             }
> > +     }
> > +skip_owner_end:
> > +
> >       switch (aggr_mode) {
> >       case LOCK_AGGR_CALLER:
> >               key.stack_id = pelem->stack_id;
> > --
> > 2.48.1.362.g079036d154-goog
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/5] perf lock: Retrieve owner callstack in bpf program
  2025-02-11 23:26     ` Chun-Tse Shao
@ 2025-02-12  1:45       ` Namhyung Kim
  0 siblings, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2025-02-12  1:45 UTC (permalink / raw)
  To: Chun-Tse Shao
  Cc: linux-kernel, peterz, mingo, acme, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, bpf

On Tue, Feb 11, 2025 at 03:26:04PM -0800, Chun-Tse Shao wrote:
> Hi Namhyung, thanks for your review first!
> 
> 
> On Fri, Jan 31, 2025 at 3:48 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Wed, Jan 29, 2025 at 09:21:36PM -0800, Chun-Tse Shao wrote:
> > > Tracks lock contention by tracing owner callstacks in
> > > `contention_begin()` and `contention_end()`, storing data in the
> > > owner_stat BPF map. `contention_begin()` records the owner and their
> > > callstack. `contention_end()` updates contention statistics (count,
> > > time), decrements the waiter count, and removes the record when no
> > > waiters remain. Statistics are also updated if the owner's callstack
> > > changes. Note that owner and its callstack retrieval may fail.
> > >
> > > To elaborate the process in detail:
> > >   /*
> > >    * In `contention_begin(), the current task is the lock waiter`. We
> > >    * create/update `owner_data` for the given `lock` address.
> > >   contention_begin() {
> > >     Try to get owner task. Skip entire process if fails.
> > >     Try to get owner stack based on task. Use empty stack if fails.
> > >     Store owner stack into `owner_stacks` and create `stack_id`. If fail
> > >       to store, use negative `stack_id`, which will be ignored while
> > >       reporting in usermode.
> > >     Retrieve `owner_tracing_data` in `owner_data` with given `lock`
> > >       address.
> > >
> > >     /*
> > >      * The first case means contention just happens, or mismatched owner
> > >      * infomation so we just drop the previous record.
> > >      */
> > >     if (`owner_tracing_data` does not exist ||
> > >         the recorded owner `pid` does not match with the detected owner) {
> > >       Create `owner_tracing_data` with info from detected owner, and
> > >         store it in `owner_data` with key `lock` address.
> > >     }
> > >     /*
> > >      * The second case means contention is on going. One more waiter is
> > >      * joining the lock contention. Both `owner_data` and `owner_stat`
> > >      * should be updated.
> > >      */
> > >     else {
> > >       `owner_tracing_data.count`++
> > >
> > >       Create `contention_key` with owner `stack_id` and lookup
> > >         `contention_data` in `owner_stat`.
> > >       if (`contention_data` does not exist) {
> > >         Create new entry for `contention_key`:`contention_data` in
> > >           `owner_stat`.
> > >       }
> > >       else {
> > >         Update the `count` and `total_time` in existing
> > >         `contention_data`.
> > >       }
> > >
> > >       Update `timestamp` and `stack_id` in `owner_tracing_data`.
> > >     }
> > >   }
> > >
> > >   /*
> > >    * In `contention_end()`, the current task will be the new owner of
> > >    * the `lock`, if `ctx[1]` is not 0.
> > >    */
> > >   contention_end() {
> > >     Lookup `owner_tracing_data` in `owner_data` with key `lock`.
> > >
> > >     Create `contention_key` with `owner_tracing_data.stack_id` and
> > >       lookup `contention_data` in `owner_stat`.
> > >     if (`contention_data` does not exist) {
> > >       Create new entry for `contention_key`:`contention_data` in
> > >         `owner_stat`.
> > >     }
> > >     else {
> > >       Update the `count` and `total_time` in existing `contention_data`.
> > >     }
> > >
> > >     /*
> > >      * There is no more waiters, contention is over, delete the record.
> > >      */
> > >     if (`owner_tracing_data.count` <= 1) {
> > >       delete this record in `owner_data`.
> > >     }
> > >     /*
> > >      * Contention is still on going.
> > >      */
> > >     else {
> > >       `owner_tracing_data.count`--
> > >
> > >       if (`!ctx[0]`) {
> > >         The current task exits without acquiring the lock. However we
> > >           check for the recorded owner if the stack is changed, and
> > >           update `onwer_data` and `owner_stat` accordingly.
> > >       }
> > >       else {
> > >         The current task is the new owner, retrieve its stack, store it
> > >           in `owner_stack` and update `owner_tracing_data`.
> > >       }
> > >     }
> > >   }
> >
> > I think this is too much detail. :)
> >
> > I'd say something like this:
> >
> > This implements per-callstack aggregation of lock owners in addition to
> > per-thread.  The owner callstack is captured using bpf_get_task_stack()
> > at contention_begin and it also adds a custom stackid function for the
> > owner stacks to be compared easily.
> >
> > The owner info is kept in a hash map using lock addr as a key to handle
> > multiple waiters for the same lock.  At contention_end, it updates the
> > owner lock stat based on the info that was saved at contention_begin.
> > If there are more waiters, it'd update the owner pid to itself as
> > contention_end means it gets the lock now.  But it also needs to check
> > the return value of the lock function in case task was killed by a signal
> > or something.
> >
> 
> Thanks, I will just reuse this description. :D
> 
> > >
> > > Signed-off-by: Chun-Tse Shao <ctshao@google.com>
> > > ---
> > >  .../perf/util/bpf_skel/lock_contention.bpf.c  | 248 +++++++++++++++++-
> > >  1 file changed, 247 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > index 23fe9cc980ae..b12df873379f 100644
> > > --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > @@ -197,6 +197,9 @@ int data_fail;
> > >  int task_map_full;
> > >  int data_map_full;
> > >
> > > +struct task_struct *bpf_task_from_pid(s32 pid) __ksym __weak;
> > > +void bpf_task_release(struct task_struct *p) __ksym __weak;
> > > +
> > >  static inline __u64 get_current_cgroup_id(void)
> > >  {
> > >       struct task_struct *task;
> > > @@ -420,6 +423,26 @@ static inline struct tstamp_data *get_tstamp_elem(__u32 flags)
> > >       return pelem;
> > >  }
> > >
> > > +static inline s32 get_owner_stack_id(u64 *stacktrace)
> > > +{
> > > +     s32 *id, new_id;
> > > +     static s64 id_gen = 1;
> > > +
> > > +     id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
> > > +     if (id)
> > > +             return *id;
> > > +
> > > +     new_id = (s32)__sync_fetch_and_add(&id_gen, 1);
> > > +
> > > +     bpf_map_update_elem(&owner_stacks, stacktrace, &new_id, BPF_NOEXIST);
> > > +
> > > +     id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
> > > +     if (id)
> > > +             return *id;
> > > +
> > > +     return -1;
> > > +}
> > > +
> > >  SEC("tp_btf/contention_begin")
> > >  int contention_begin(u64 *ctx)
> > >  {
> > > @@ -437,6 +460,97 @@ int contention_begin(u64 *ctx)
> > >       pelem->flags = (__u32)ctx[1];
> > >
> > >       if (needs_callstack) {
> > > +             u32 i = 0;
> > > +             u32 id = 0;
> > > +             int owner_pid;
> > > +             u64 *buf;
> > > +             struct task_struct *task;
> > > +             struct owner_tracing_data *otdata;
> > > +
> > > +             if (!lock_owner)
> > > +                     goto skip_owner_begin;
> > > +
> > > +             task = get_lock_owner(pelem->lock, pelem->flags);
> > > +             if (!task)
> > > +                     goto skip_owner_begin;
> > > +
> > > +             owner_pid = BPF_CORE_READ(task, pid);
> > > +
> > > +             buf = bpf_map_lookup_elem(&stack_buf, &i);
> > > +             if (!buf)
> > > +                     goto skip_owner_begin;
> > > +             for (i = 0; i < max_stack; i++)
> > > +                     buf[i] = 0x0;
> > > +
> > > +             if (bpf_task_from_pid) {
> >
> > I think you can do this instead:
> >
> >                 if (bpf_task_from_pid == NULL)
> >                         goto skip_owner_begin;
> >
> > nit: it can be just 'skip_owner'. :)
> >
> >
> > > +                     task = bpf_task_from_pid(owner_pid);
> > > +                     if (task) {
> > > +                             bpf_get_task_stack(task, buf, max_stack * sizeof(unsigned long), 0);
> > > +                             bpf_task_release(task);
> > > +                     }
> > > +             }
> > > +
> > > +             otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
> > > +             id = get_owner_stack_id(buf);
> > > +
> > > +             /*
> > > +              * Contention just happens, or corner case `lock` is owned by process not
> > > +              * `owner_pid`. For the corner case we treat it as unexpected internal error and
> > > +              * just ignore the precvious tracing record.
> > > +              */
> > > +             if (!otdata || otdata->pid != owner_pid) {
> > > +                     struct owner_tracing_data first = {
> > > +                             .pid = owner_pid,
> > > +                             .timestamp = pelem->timestamp,
> > > +                             .count = 1,
> > > +                             .stack_id = id,
> > > +                     };
> > > +                     bpf_map_update_elem(&owner_data, &pelem->lock, &first, BPF_ANY);
> > > +             }
> > > +             /* Contention is ongoing and new waiter joins */
> > > +             else {
> > > +                     __sync_fetch_and_add(&otdata->count, 1);
> > > +
> > > +                     /*
> > > +                      * The owner is the same, but stacktrace might be changed. In this case we
> > > +                      * store/update `owner_stat` based on current owner stack id.
> > > +                      */
> > > +                     if (id != otdata->stack_id) {
> > > +                             u64 duration = otdata->timestamp - pelem->timestamp;
> >
> > Isn't it the opposite?
> >
> >         u64 duration = pelem->timestamp - otdata->timestamp;
> >
> >
> > > +                             struct contention_key ckey = {
> > > +                                     .stack_id = id,
> > > +                                     .pid = 0,
> > > +                                     .lock_addr_or_cgroup = 0,
> > > +                             };
> > > +                             struct contention_data *cdata =
> > > +                                     bpf_map_lookup_elem(&owner_stat, &ckey);
> > > +
> > > +                             if (!cdata) {
> > > +                                     struct contention_data first = {
> > > +                                             .total_time = duration,
> > > +                                             .max_time = duration,
> > > +                                             .min_time = duration,
> > > +                                             .count = 1,
> > > +                                             .flags = pelem->flags,
> > > +                                     };
> > > +                                     bpf_map_update_elem(&owner_stat, &ckey, &first,
> > > +                                                         BPF_NOEXIST);
> > > +                             } else {
> > > +                                     __sync_fetch_and_add(&cdata->total_time, duration);
> > > +                                     __sync_fetch_and_add(&cdata->count, 1);
> > > +
> > > +                                     /* FIXME: need atomic operations */
> > > +                                     if (cdata->max_time < duration)
> > > +                                             cdata->max_time = duration;
> > > +                                     if (cdata->min_time > duration)
> > > +                                             cdata->min_time = duration;
> > > +                             }
> >
> > And as I said before, can we move this block out as a function?
> >
> > > +
> > > +                             otdata->timestamp = pelem->timestamp;
> > > +                             otdata->stack_id = id;
> > > +                     }
> > > +             }
> > > +skip_owner_begin:
> > >               pelem->stack_id = bpf_get_stackid(ctx, &stacks,
> > >                                                 BPF_F_FAST_STACK_CMP | stack_skip);
> > >               if (pelem->stack_id < 0)
> > > @@ -473,6 +587,7 @@ int contention_end(u64 *ctx)
> > >       struct tstamp_data *pelem;
> > >       struct contention_key key = {};
> > >       struct contention_data *data;
> > > +     __u64 timestamp;
> > >       __u64 duration;
> > >       bool need_delete = false;
> > >
> > > @@ -500,12 +615,143 @@ int contention_end(u64 *ctx)
> > >               need_delete = true;
> > >       }
> > >
> > > -     duration = bpf_ktime_get_ns() - pelem->timestamp;
> > > +     timestamp = bpf_ktime_get_ns();
> > > +     duration = timestamp - pelem->timestamp;
> > >       if ((__s64)duration < 0) {
> > >               __sync_fetch_and_add(&time_fail, 1);
> > >               goto out;
> > >       }
> > >
> > > +     if (needs_callstack && lock_owner) {
> > > +             u64 owner_time;
> > > +             struct contention_key ckey = {};
> > > +             struct contention_data *cdata;
> > > +             struct owner_tracing_data *otdata;
> > > +
> > > +             otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
> > > +             if (!otdata)
> > > +                     goto skip_owner_end;
> > > +
> > > +             /* Update `owner_stat` */
> > > +             owner_time = timestamp - otdata->timestamp;
> > > +             ckey.stack_id = otdata->stack_id;
> > > +             cdata = bpf_map_lookup_elem(&owner_stat, &ckey);
> > > +
> > > +             if (!cdata) {
> > > +                     struct contention_data first = {
> > > +                             .total_time = owner_time,
> > > +                             .max_time = owner_time,
> > > +                             .min_time = owner_time,
> > > +                             .count = 1,
> > > +                             .flags = pelem->flags,
> > > +                     };
> > > +                     bpf_map_update_elem(&owner_stat, &ckey, &first, BPF_NOEXIST);
> > > +             } else {
> > > +                     __sync_fetch_and_add(&cdata->total_time, owner_time);
> > > +                     __sync_fetch_and_add(&cdata->count, 1);
> > > +
> > > +                     /* FIXME: need atomic operations */
> > > +                     if (cdata->max_time < owner_time)
> > > +                             cdata->max_time = owner_time;
> > > +                     if (cdata->min_time > owner_time)
> > > +                             cdata->min_time = owner_time;
> > > +             }
> > > +
> > > +             /* No contention is occurring, delete `lock` entry in `owner_data` */
> > > +             if (otdata->count <= 1)
> > > +                     bpf_map_delete_elem(&owner_data, &pelem->lock);
> > > +             /*
> > > +              * Contention is still ongoing, with a new owner (current task). `owner_data`
> > > +              * should be updated accordingly.
> > > +              */
> > > +             else {
> > > +                     u32 i = 0;
> > > +                     u64 *buf;
> > > +
> > > +                     __sync_fetch_and_add(&otdata->count, -1);
> > > +
> > > +                     buf = bpf_map_lookup_elem(&stack_buf, &i);
> > > +                     if (!buf)
> > > +                             goto skip_owner_end;
> > > +                     for (i = 0; i < (u32)max_stack; i++)
> > > +                             buf[i] = 0x0;
> > > +
> > > +                     /*
> > > +                      * ctx[1] has the return code of the lock function.
> >
> > Then I think it's clearer to have a local variable named 'ret' or so.
> >
> >
> > > +                      * If ctx[1] is not 0, the current task terminates lock waiting without
> > > +                      * acquiring it. Owner is not changed, but we still need to update the owner
> > > +                      * stack.
> > > +                      */
> > > +                     if (!ctx[1]) {
> >
> > This doesn't match to the comment.  It should be:
> >
> >                         if (ret < 0) {
> >
> >
> > > +                             s32 id = 0;
> > > +                             struct task_struct *task = NULL;
> > > +
> > > +                             if (bpf_task_from_pid)
> >
> > Same as the above.  No need to go down if you cannot get the task and
> > stack.
> >
> >                                 if (bpf_task_from_pid == NULL)
> >                                         goto skip_owner;
> >
> >
> > > +                                     task = bpf_task_from_pid(otdata->pid);
> > > +
> > > +                             if (task) {
> > > +                                     bpf_get_task_stack(task, buf,
> > > +                                                        max_stack * sizeof(unsigned long), 0);
> > > +                                     bpf_task_release(task);
> > > +                             }
> > > +
> > > +                             id = get_owner_stack_id(buf);
> > > +
> > > +                             /*
> > > +                              * If owner stack is changed, update `owner_data` and `owner_stat`
> > > +                              * accordingly.
> > > +                              */
> > > +                             if (id != otdata->stack_id) {
> > > +                                     u64 duration = otdata->timestamp - pelem->timestamp;
> > > +                                     struct contention_key ckey = {
> > > +                                             .stack_id = id,
> > > +                                             .pid = 0,
> > > +                                             .lock_addr_or_cgroup = 0,
> > > +                                     };
> > > +                                     struct contention_data *cdata =
> > > +                                             bpf_map_lookup_elem(&owner_stat, &ckey);
> > > +
> > > +                                     if (!cdata) {
> > > +                                             struct contention_data first = {
> > > +                                                     .total_time = duration,
> > > +                                                     .max_time = duration,
> > > +                                                     .min_time = duration,
> > > +                                                     .count = 1,
> > > +                                                     .flags = pelem->flags,
> > > +                                             };
> > > +                                             bpf_map_update_elem(&owner_stat, &ckey, &first,
> > > +                                                                 BPF_NOEXIST);
> > > +                                     } else {
> > > +                                             __sync_fetch_and_add(&cdata->total_time, duration);
> > > +                                             __sync_fetch_and_add(&cdata->count, 1);
> > > +
> > > +                                             /* FIXME: need atomic operations */
> > > +                                             if (cdata->max_time < duration)
> > > +                                                     cdata->max_time = duration;
> > > +                                             if (cdata->min_time > duration)
> > > +                                                     cdata->min_time = duration;
> > > +                                     }
> > > +
> > > +                                     otdata->timestamp = pelem->timestamp;
> > > +                                     otdata->stack_id = id;
> > > +                             }
> > > +                     }
> > > +                     /*
> > > +                      * If ctx[1] is 0, then update tracinng data with the current task, which is
> > > +                      * the new owner.
> > > +                      */
> > > +                     else {
> > > +                             otdata->pid = pid;
> > > +                             otdata->timestamp = timestamp;
> > > +
> > > +                             bpf_get_task_stack(bpf_get_current_task_btf(), buf,
> > > +                                                max_stack * sizeof(unsigned long), 0);
> >
> > This would be meaningless since it's still in the contention path.
> > Current callstack will be the same as the waiter callstack.  You'd
> > better just invalidate callstack here and let the next waiter update
> > it.
> 
> I wonder why this is meaningless. In this situation, the lock owner is
> transferred to the current task, and there is at least one more
> waiter, the contention is still ongoing. `otdata` is for tracing the
> lock owner, so it should be correctly updated with the new owner,
> which is the current task.

Yep, but I meant it has the same callstack as with waiters and provides
no additional information about the owner.

Thanks,
Namhyung

> >
> > > +                             otdata->stack_id = get_owner_stack_id(buf);
> > > +                     }
> > > +             }
> > > +     }
> > > +skip_owner_end:
> > > +
> > >       switch (aggr_mode) {
> > >       case LOCK_AGGR_CALLER:
> > >               key.stack_id = pelem->stack_id;
> > > --
> > > 2.48.1.362.g079036d154-goog
> > >

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-02-12  1:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-30  5:21 [PATCH v4 0/5] Tracing contention lock owner call stack Chun-Tse Shao
2025-01-30  5:21 ` [PATCH v4 1/5] perf lock: Add bpf maps for owner stack tracing Chun-Tse Shao
2025-01-30  5:21 ` [PATCH v4 2/5] perf lock: Retrieve owner callstack in bpf program Chun-Tse Shao
2025-01-31 23:48   ` Namhyung Kim
2025-02-11 23:26     ` Chun-Tse Shao
2025-02-12  1:45       ` Namhyung Kim
2025-01-30  5:21 ` [PATCH v4 3/5] perf lock: Make rb_tree helper functions generic Chun-Tse Shao
2025-01-30  5:21 ` [PATCH v4 4/5] perf lock: Report owner stack in usermode Chun-Tse Shao
2025-01-30  5:21 ` [PATCH v4 5/5] perf lock: Update documentation for -o option in contention mode Chun-Tse Shao
2025-01-31 22:51 ` [PATCH v4 0/5] Tracing contention lock owner call stack Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).