linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 1/3] perf lock contention: Clear lock addr after use
@ 2023-10-20 20:47 Namhyung Kim
  2023-10-20 20:47 ` [PATCH v3 2/3] perf lock contention: Check race in tstamp elem creation Namhyung Kim
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Namhyung Kim @ 2023-10-20 20:47 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ian Rogers, Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Song Liu, Hao Luo, bpf

It checks the current lock to calculated the delta of contention time.
The address is saved in the tstamp map which is allocated at begining of
contention and released at end of contention.

But it's possible for bpf_map_delete_elem() to fail.  In that case, the
element in the tstamp map kept for the current lock and it makes the
next contention for the same lock tracked incorrectly.  Specificially
the next contention begin will see the existing element for the task and
it'd just return.  Then the next contention end will see the element and
calculate the time using the timestamp for the previous begin.

This can result in a large value for two small contentions happened from
time to time.  Let's clear the lock address so that it can be updated
next time even if the bpf_map_delete_elem() failed.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/bpf_skel/lock_contention.bpf.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index 4900a5dfb4a4..b11179452e19 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -389,6 +389,7 @@ int contention_end(u64 *ctx)
 
 	duration = bpf_ktime_get_ns() - pelem->timestamp;
 	if ((__s64)duration < 0) {
+		pelem->lock = 0;
 		bpf_map_delete_elem(&tstamp, &pid);
 		__sync_fetch_and_add(&time_fail, 1);
 		return 0;
@@ -422,6 +423,7 @@ int contention_end(u64 *ctx)
 	data = bpf_map_lookup_elem(&lock_stat, &key);
 	if (!data) {
 		if (data_map_full) {
+			pelem->lock = 0;
 			bpf_map_delete_elem(&tstamp, &pid);
 			__sync_fetch_and_add(&data_fail, 1);
 			return 0;
@@ -445,6 +447,7 @@ int contention_end(u64 *ctx)
 				data_map_full = 1;
 			__sync_fetch_and_add(&data_fail, 1);
 		}
+		pelem->lock = 0;
 		bpf_map_delete_elem(&tstamp, &pid);
 		return 0;
 	}
@@ -458,6 +461,7 @@ int contention_end(u64 *ctx)
 	if (data->min_time > duration)
 		data->min_time = duration;
 
+	pelem->lock = 0;
 	bpf_map_delete_elem(&tstamp, &pid);
 	return 0;
 }
-- 
2.42.0.655.g421f12c284-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3 2/3] perf lock contention: Check race in tstamp elem creation
  2023-10-20 20:47 [PATCH v3 1/3] perf lock contention: Clear lock addr after use Namhyung Kim
@ 2023-10-20 20:47 ` Namhyung Kim
  2023-10-25  4:13   ` Ian Rogers
  2023-10-20 20:47 ` [PATCH v3 3/3] perf lock contention: Use per-cpu array map for spinlocks Namhyung Kim
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: Namhyung Kim @ 2023-10-20 20:47 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ian Rogers, Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Song Liu, Hao Luo, bpf

When pelem is NULL, it'd create a new entry with zero data.  But it
might be preempted by IRQ/NMI just before calling bpf_map_update_elem()
then there's a chance to call it twice for the same pid.  So it'd be
better to use BPF_NOEXIST flag and check the return value to prevent
the race.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/bpf_skel/lock_contention.bpf.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index b11179452e19..69d31fd77cd0 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -328,7 +328,11 @@ int contention_begin(u64 *ctx)
 	if (pelem == NULL) {
 		struct tstamp_data zero = {};
 
-		bpf_map_update_elem(&tstamp, &pid, &zero, BPF_ANY);
+		if (bpf_map_update_elem(&tstamp, &pid, &zero, BPF_NOEXIST) < 0) {
+			__sync_fetch_and_add(&task_fail, 1);
+			return 0;
+		}
+
 		pelem = bpf_map_lookup_elem(&tstamp, &pid);
 		if (pelem == NULL) {
 			__sync_fetch_and_add(&task_fail, 1);
-- 
2.42.0.655.g421f12c284-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3 3/3] perf lock contention: Use per-cpu array map for spinlocks
  2023-10-20 20:47 [PATCH v3 1/3] perf lock contention: Clear lock addr after use Namhyung Kim
  2023-10-20 20:47 ` [PATCH v3 2/3] perf lock contention: Check race in tstamp elem creation Namhyung Kim
@ 2023-10-20 20:47 ` Namhyung Kim
  2023-10-25  4:15   ` Ian Rogers
  2023-10-25  4:13 ` [PATCH v3 1/3] perf lock contention: Clear lock addr after use Ian Rogers
  2023-10-26 17:11 ` Namhyung Kim
  3 siblings, 1 reply; 7+ messages in thread
From: Namhyung Kim @ 2023-10-20 20:47 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ian Rogers, Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Song Liu, Hao Luo, bpf

Currently lock contention timestamp is maintained in a hash map keyed by
pid.  That means it needs to get and release a map element (which is
proctected by spinlock!) on each contention begin and end pair.  This
can impact on performance if there are a lot of contention (usually from
spinlocks).

It used to go with task local storage but it had an issue on memory
allocation in some critical paths.  Although it's addressed in recent
kernels IIUC, the tool should support old kernels too.  So it cannot
simply switch to the task local storage at least for now.

As spinlocks create lots of contention and they disabled preemption
during the spinning, it can use per-cpu array to keep the timestamp to
avoid overhead in hashmap update and delete.

In contention_begin, it's easy to check the lock types since it can see
the flags.  But contention_end cannot see it.  So let's try to per-cpu
array first (unconditionally) if it has an active element (lock != 0).
Then it should be used and per-task tstamp map should not be used until
the per-cpu array element is cleared which means nested spinlock
contention (if any) was finished and it nows see (the outer) lock.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 .../perf/util/bpf_skel/lock_contention.bpf.c  | 89 +++++++++++++++----
 1 file changed, 72 insertions(+), 17 deletions(-)

diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index 69d31fd77cd0..95cd8414f6ef 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -42,6 +42,14 @@ struct {
 	__uint(max_entries, MAX_ENTRIES);
 } tstamp SEC(".maps");
 
+/* maintain per-CPU timestamp at the beginning of contention */
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct tstamp_data));
+	__uint(max_entries, 1);
+} tstamp_cpu SEC(".maps");
+
 /* actual lock contention statistics */
 struct {
 	__uint(type, BPF_MAP_TYPE_HASH);
@@ -311,34 +319,57 @@ static inline __u32 check_lock_type(__u64 lock, __u32 flags)
 	return 0;
 }
 
-SEC("tp_btf/contention_begin")
-int contention_begin(u64 *ctx)
+static inline struct tstamp_data *get_tstamp_elem(__u32 flags)
 {
 	__u32 pid;
 	struct tstamp_data *pelem;
 
-	if (!enabled || !can_record(ctx))
-		return 0;
+	/* Use per-cpu array map for spinlock and rwlock */
+	if (flags == (LCB_F_SPIN | LCB_F_READ) || flags == LCB_F_SPIN ||
+	    flags == (LCB_F_SPIN | LCB_F_WRITE)) {
+		__u32 idx = 0;
+
+		pelem = bpf_map_lookup_elem(&tstamp_cpu, &idx);
+		/* Do not update the element for nested locks */
+		if (pelem && pelem->lock)
+			pelem = NULL;
+		return pelem;
+	}
 
 	pid = bpf_get_current_pid_tgid();
 	pelem = bpf_map_lookup_elem(&tstamp, &pid);
+	/* Do not update the element for nested locks */
 	if (pelem && pelem->lock)
-		return 0;
+		return NULL;
 
 	if (pelem == NULL) {
 		struct tstamp_data zero = {};
 
 		if (bpf_map_update_elem(&tstamp, &pid, &zero, BPF_NOEXIST) < 0) {
 			__sync_fetch_and_add(&task_fail, 1);
-			return 0;
+			return NULL;
 		}
 
 		pelem = bpf_map_lookup_elem(&tstamp, &pid);
 		if (pelem == NULL) {
 			__sync_fetch_and_add(&task_fail, 1);
-			return 0;
+			return NULL;
 		}
 	}
+	return pelem;
+}
+
+SEC("tp_btf/contention_begin")
+int contention_begin(u64 *ctx)
+{
+	struct tstamp_data *pelem;
+
+	if (!enabled || !can_record(ctx))
+		return 0;
+
+	pelem = get_tstamp_elem(ctx[1]);
+	if (pelem == NULL)
+		return 0;
 
 	pelem->timestamp = bpf_ktime_get_ns();
 	pelem->lock = (__u64)ctx[0];
@@ -377,24 +408,42 @@ int contention_begin(u64 *ctx)
 SEC("tp_btf/contention_end")
 int contention_end(u64 *ctx)
 {
-	__u32 pid;
+	__u32 pid = 0, idx = 0;
 	struct tstamp_data *pelem;
 	struct contention_key key = {};
 	struct contention_data *data;
 	__u64 duration;
+	bool need_delete = false;
 
 	if (!enabled)
 		return 0;
 
-	pid = bpf_get_current_pid_tgid();
-	pelem = bpf_map_lookup_elem(&tstamp, &pid);
-	if (!pelem || pelem->lock != ctx[0])
-		return 0;
+	/*
+	 * For spinlock and rwlock, it needs to get the timestamp for the
+	 * per-cpu map.  However, contention_end does not have the flags
+	 * so it cannot know whether it reads percpu or hash map.
+	 *
+	 * Try per-cpu map first and check if there's active contention.
+	 * If it is, do not read hash map because it cannot go to sleeping
+	 * locks before releasing the spinning locks.
+	 */
+	pelem = bpf_map_lookup_elem(&tstamp_cpu, &idx);
+	if (pelem && pelem->lock) {
+		if (pelem->lock != ctx[0])
+			return 0;
+	} else {
+		pid = bpf_get_current_pid_tgid();
+		pelem = bpf_map_lookup_elem(&tstamp, &pid);
+		if (!pelem || pelem->lock != ctx[0])
+			return 0;
+		need_delete = true;
+	}
 
 	duration = bpf_ktime_get_ns() - pelem->timestamp;
 	if ((__s64)duration < 0) {
 		pelem->lock = 0;
-		bpf_map_delete_elem(&tstamp, &pid);
+		if (need_delete)
+			bpf_map_delete_elem(&tstamp, &pid);
 		__sync_fetch_and_add(&time_fail, 1);
 		return 0;
 	}
@@ -406,8 +455,11 @@ int contention_end(u64 *ctx)
 	case LOCK_AGGR_TASK:
 		if (lock_owner)
 			key.pid = pelem->flags;
-		else
+		else {
+			if (!need_delete)
+				pid = bpf_get_current_pid_tgid();
 			key.pid = pid;
+		}
 		if (needs_callstack)
 			key.stack_id = pelem->stack_id;
 		break;
@@ -428,7 +480,8 @@ int contention_end(u64 *ctx)
 	if (!data) {
 		if (data_map_full) {
 			pelem->lock = 0;
-			bpf_map_delete_elem(&tstamp, &pid);
+			if (need_delete)
+				bpf_map_delete_elem(&tstamp, &pid);
 			__sync_fetch_and_add(&data_fail, 1);
 			return 0;
 		}
@@ -452,7 +505,8 @@ int contention_end(u64 *ctx)
 			__sync_fetch_and_add(&data_fail, 1);
 		}
 		pelem->lock = 0;
-		bpf_map_delete_elem(&tstamp, &pid);
+		if (need_delete)
+			bpf_map_delete_elem(&tstamp, &pid);
 		return 0;
 	}
 
@@ -466,7 +520,8 @@ int contention_end(u64 *ctx)
 		data->min_time = duration;
 
 	pelem->lock = 0;
-	bpf_map_delete_elem(&tstamp, &pid);
+	if (need_delete)
+		bpf_map_delete_elem(&tstamp, &pid);
 	return 0;
 }
 
-- 
2.42.0.655.g421f12c284-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 1/3] perf lock contention: Clear lock addr after use
  2023-10-20 20:47 [PATCH v3 1/3] perf lock contention: Clear lock addr after use Namhyung Kim
  2023-10-20 20:47 ` [PATCH v3 2/3] perf lock contention: Check race in tstamp elem creation Namhyung Kim
  2023-10-20 20:47 ` [PATCH v3 3/3] perf lock contention: Use per-cpu array map for spinlocks Namhyung Kim
@ 2023-10-25  4:13 ` Ian Rogers
  2023-10-26 17:11 ` Namhyung Kim
  3 siblings, 0 replies; 7+ messages in thread
From: Ian Rogers @ 2023-10-25  4:13 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Peter Zijlstra, Ingo Molnar, LKML, linux-perf-users, Song Liu,
	Hao Luo, bpf

On Fri, Oct 20, 2023 at 1:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> It checks the current lock to calculated the delta of contention time.
> The address is saved in the tstamp map which is allocated at begining of
> contention and released at end of contention.
>
> But it's possible for bpf_map_delete_elem() to fail.  In that case, the
> element in the tstamp map kept for the current lock and it makes the
> next contention for the same lock tracked incorrectly.  Specificially
> the next contention begin will see the existing element for the task and
> it'd just return.  Then the next contention end will see the element and
> calculate the time using the timestamp for the previous begin.
>
> This can result in a large value for two small contentions happened from
> time to time.  Let's clear the lock address so that it can be updated
> next time even if the bpf_map_delete_elem() failed.
>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>

Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

> ---
>  tools/perf/util/bpf_skel/lock_contention.bpf.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> index 4900a5dfb4a4..b11179452e19 100644
> --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> @@ -389,6 +389,7 @@ int contention_end(u64 *ctx)
>
>         duration = bpf_ktime_get_ns() - pelem->timestamp;
>         if ((__s64)duration < 0) {
> +               pelem->lock = 0;
>                 bpf_map_delete_elem(&tstamp, &pid);
>                 __sync_fetch_and_add(&time_fail, 1);
>                 return 0;
> @@ -422,6 +423,7 @@ int contention_end(u64 *ctx)
>         data = bpf_map_lookup_elem(&lock_stat, &key);
>         if (!data) {
>                 if (data_map_full) {
> +                       pelem->lock = 0;
>                         bpf_map_delete_elem(&tstamp, &pid);
>                         __sync_fetch_and_add(&data_fail, 1);
>                         return 0;
> @@ -445,6 +447,7 @@ int contention_end(u64 *ctx)
>                                 data_map_full = 1;
>                         __sync_fetch_and_add(&data_fail, 1);
>                 }
> +               pelem->lock = 0;
>                 bpf_map_delete_elem(&tstamp, &pid);
>                 return 0;
>         }
> @@ -458,6 +461,7 @@ int contention_end(u64 *ctx)
>         if (data->min_time > duration)
>                 data->min_time = duration;
>
> +       pelem->lock = 0;
>         bpf_map_delete_elem(&tstamp, &pid);
>         return 0;
>  }
> --
> 2.42.0.655.g421f12c284-goog
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 2/3] perf lock contention: Check race in tstamp elem creation
  2023-10-20 20:47 ` [PATCH v3 2/3] perf lock contention: Check race in tstamp elem creation Namhyung Kim
@ 2023-10-25  4:13   ` Ian Rogers
  0 siblings, 0 replies; 7+ messages in thread
From: Ian Rogers @ 2023-10-25  4:13 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Peter Zijlstra, Ingo Molnar, LKML, linux-perf-users, Song Liu,
	Hao Luo, bpf

On Fri, Oct 20, 2023 at 1:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> When pelem is NULL, it'd create a new entry with zero data.  But it
> might be preempted by IRQ/NMI just before calling bpf_map_update_elem()
> then there's a chance to call it twice for the same pid.  So it'd be
> better to use BPF_NOEXIST flag and check the return value to prevent
> the race.
>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>

Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian
> ---
>  tools/perf/util/bpf_skel/lock_contention.bpf.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> index b11179452e19..69d31fd77cd0 100644
> --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> @@ -328,7 +328,11 @@ int contention_begin(u64 *ctx)
>         if (pelem == NULL) {
>                 struct tstamp_data zero = {};
>
> -               bpf_map_update_elem(&tstamp, &pid, &zero, BPF_ANY);
> +               if (bpf_map_update_elem(&tstamp, &pid, &zero, BPF_NOEXIST) < 0) {
> +                       __sync_fetch_and_add(&task_fail, 1);
> +                       return 0;
> +               }
> +
>                 pelem = bpf_map_lookup_elem(&tstamp, &pid);
>                 if (pelem == NULL) {
>                         __sync_fetch_and_add(&task_fail, 1);
> --
> 2.42.0.655.g421f12c284-goog
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 3/3] perf lock contention: Use per-cpu array map for spinlocks
  2023-10-20 20:47 ` [PATCH v3 3/3] perf lock contention: Use per-cpu array map for spinlocks Namhyung Kim
@ 2023-10-25  4:15   ` Ian Rogers
  0 siblings, 0 replies; 7+ messages in thread
From: Ian Rogers @ 2023-10-25  4:15 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Peter Zijlstra, Ingo Molnar, LKML, linux-perf-users, Song Liu,
	Hao Luo, bpf

On Fri, Oct 20, 2023 at 1:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Currently lock contention timestamp is maintained in a hash map keyed by
> pid.  That means it needs to get and release a map element (which is
> proctected by spinlock!) on each contention begin and end pair.  This
> can impact on performance if there are a lot of contention (usually from
> spinlocks).
>
> It used to go with task local storage but it had an issue on memory
> allocation in some critical paths.  Although it's addressed in recent
> kernels IIUC, the tool should support old kernels too.  So it cannot
> simply switch to the task local storage at least for now.
>
> As spinlocks create lots of contention and they disabled preemption
> during the spinning, it can use per-cpu array to keep the timestamp to
> avoid overhead in hashmap update and delete.
>
> In contention_begin, it's easy to check the lock types since it can see
> the flags.  But contention_end cannot see it.  So let's try to per-cpu
> array first (unconditionally) if it has an active element (lock != 0).
> Then it should be used and per-task tstamp map should not be used until
> the per-cpu array element is cleared which means nested spinlock
> contention (if any) was finished and it nows see (the outer) lock.
>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>

Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

> ---
>  .../perf/util/bpf_skel/lock_contention.bpf.c  | 89 +++++++++++++++----
>  1 file changed, 72 insertions(+), 17 deletions(-)
>
> diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> index 69d31fd77cd0..95cd8414f6ef 100644
> --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> @@ -42,6 +42,14 @@ struct {
>         __uint(max_entries, MAX_ENTRIES);
>  } tstamp SEC(".maps");
>
> +/* maintain per-CPU timestamp at the beginning of contention */
> +struct {
> +       __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> +       __uint(key_size, sizeof(__u32));
> +       __uint(value_size, sizeof(struct tstamp_data));
> +       __uint(max_entries, 1);
> +} tstamp_cpu SEC(".maps");
> +
>  /* actual lock contention statistics */
>  struct {
>         __uint(type, BPF_MAP_TYPE_HASH);
> @@ -311,34 +319,57 @@ static inline __u32 check_lock_type(__u64 lock, __u32 flags)
>         return 0;
>  }
>
> -SEC("tp_btf/contention_begin")
> -int contention_begin(u64 *ctx)
> +static inline struct tstamp_data *get_tstamp_elem(__u32 flags)
>  {
>         __u32 pid;
>         struct tstamp_data *pelem;
>
> -       if (!enabled || !can_record(ctx))
> -               return 0;
> +       /* Use per-cpu array map for spinlock and rwlock */
> +       if (flags == (LCB_F_SPIN | LCB_F_READ) || flags == LCB_F_SPIN ||
> +           flags == (LCB_F_SPIN | LCB_F_WRITE)) {
> +               __u32 idx = 0;
> +
> +               pelem = bpf_map_lookup_elem(&tstamp_cpu, &idx);
> +               /* Do not update the element for nested locks */
> +               if (pelem && pelem->lock)
> +                       pelem = NULL;
> +               return pelem;
> +       }
>
>         pid = bpf_get_current_pid_tgid();
>         pelem = bpf_map_lookup_elem(&tstamp, &pid);
> +       /* Do not update the element for nested locks */
>         if (pelem && pelem->lock)
> -               return 0;
> +               return NULL;
>
>         if (pelem == NULL) {
>                 struct tstamp_data zero = {};
>
>                 if (bpf_map_update_elem(&tstamp, &pid, &zero, BPF_NOEXIST) < 0) {
>                         __sync_fetch_and_add(&task_fail, 1);
> -                       return 0;
> +                       return NULL;
>                 }
>
>                 pelem = bpf_map_lookup_elem(&tstamp, &pid);
>                 if (pelem == NULL) {
>                         __sync_fetch_and_add(&task_fail, 1);
> -                       return 0;
> +                       return NULL;
>                 }
>         }
> +       return pelem;
> +}
> +
> +SEC("tp_btf/contention_begin")
> +int contention_begin(u64 *ctx)
> +{
> +       struct tstamp_data *pelem;
> +
> +       if (!enabled || !can_record(ctx))
> +               return 0;
> +
> +       pelem = get_tstamp_elem(ctx[1]);
> +       if (pelem == NULL)
> +               return 0;
>
>         pelem->timestamp = bpf_ktime_get_ns();
>         pelem->lock = (__u64)ctx[0];
> @@ -377,24 +408,42 @@ int contention_begin(u64 *ctx)
>  SEC("tp_btf/contention_end")
>  int contention_end(u64 *ctx)
>  {
> -       __u32 pid;
> +       __u32 pid = 0, idx = 0;
>         struct tstamp_data *pelem;
>         struct contention_key key = {};
>         struct contention_data *data;
>         __u64 duration;
> +       bool need_delete = false;
>
>         if (!enabled)
>                 return 0;
>
> -       pid = bpf_get_current_pid_tgid();
> -       pelem = bpf_map_lookup_elem(&tstamp, &pid);
> -       if (!pelem || pelem->lock != ctx[0])
> -               return 0;
> +       /*
> +        * For spinlock and rwlock, it needs to get the timestamp for the
> +        * per-cpu map.  However, contention_end does not have the flags
> +        * so it cannot know whether it reads percpu or hash map.
> +        *
> +        * Try per-cpu map first and check if there's active contention.
> +        * If it is, do not read hash map because it cannot go to sleeping
> +        * locks before releasing the spinning locks.
> +        */
> +       pelem = bpf_map_lookup_elem(&tstamp_cpu, &idx);
> +       if (pelem && pelem->lock) {
> +               if (pelem->lock != ctx[0])
> +                       return 0;
> +       } else {
> +               pid = bpf_get_current_pid_tgid();
> +               pelem = bpf_map_lookup_elem(&tstamp, &pid);
> +               if (!pelem || pelem->lock != ctx[0])
> +                       return 0;
> +               need_delete = true;
> +       }
>
>         duration = bpf_ktime_get_ns() - pelem->timestamp;
>         if ((__s64)duration < 0) {
>                 pelem->lock = 0;
> -               bpf_map_delete_elem(&tstamp, &pid);
> +               if (need_delete)
> +                       bpf_map_delete_elem(&tstamp, &pid);
>                 __sync_fetch_and_add(&time_fail, 1);
>                 return 0;
>         }
> @@ -406,8 +455,11 @@ int contention_end(u64 *ctx)
>         case LOCK_AGGR_TASK:
>                 if (lock_owner)
>                         key.pid = pelem->flags;
> -               else
> +               else {
> +                       if (!need_delete)
> +                               pid = bpf_get_current_pid_tgid();
>                         key.pid = pid;
> +               }
>                 if (needs_callstack)
>                         key.stack_id = pelem->stack_id;
>                 break;
> @@ -428,7 +480,8 @@ int contention_end(u64 *ctx)
>         if (!data) {
>                 if (data_map_full) {
>                         pelem->lock = 0;
> -                       bpf_map_delete_elem(&tstamp, &pid);
> +                       if (need_delete)
> +                               bpf_map_delete_elem(&tstamp, &pid);
>                         __sync_fetch_and_add(&data_fail, 1);
>                         return 0;
>                 }
> @@ -452,7 +505,8 @@ int contention_end(u64 *ctx)
>                         __sync_fetch_and_add(&data_fail, 1);
>                 }
>                 pelem->lock = 0;
> -               bpf_map_delete_elem(&tstamp, &pid);
> +               if (need_delete)
> +                       bpf_map_delete_elem(&tstamp, &pid);
>                 return 0;
>         }
>
> @@ -466,7 +520,8 @@ int contention_end(u64 *ctx)
>                 data->min_time = duration;
>
>         pelem->lock = 0;
> -       bpf_map_delete_elem(&tstamp, &pid);
> +       if (need_delete)
> +               bpf_map_delete_elem(&tstamp, &pid);
>         return 0;
>  }
>
> --
> 2.42.0.655.g421f12c284-goog
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 1/3] perf lock contention: Clear lock addr after use
  2023-10-20 20:47 [PATCH v3 1/3] perf lock contention: Clear lock addr after use Namhyung Kim
                   ` (2 preceding siblings ...)
  2023-10-25  4:13 ` [PATCH v3 1/3] perf lock contention: Clear lock addr after use Ian Rogers
@ 2023-10-26 17:11 ` Namhyung Kim
  3 siblings, 0 replies; 7+ messages in thread
From: Namhyung Kim @ 2023-10-26 17:11 UTC (permalink / raw)
  To: Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: LKML, Ingo Molnar, Hao Luo, Song Liu, Adrian Hunter, bpf,
	linux-perf-users, Ian Rogers, Peter Zijlstra

On Fri, 20 Oct 2023 13:47:39 -0700, Namhyung Kim wrote:
> It checks the current lock to calculated the delta of contention time.
> The address is saved in the tstamp map which is allocated at begining of
> contention and released at end of contention.
> 
> But it's possible for bpf_map_delete_elem() to fail.  In that case, the
> element in the tstamp map kept for the current lock and it makes the
> next contention for the same lock tracked incorrectly.  Specificially
> the next contention begin will see the existing element for the task and
> it'd just return.  Then the next contention end will see the element and
> calculate the time using the timestamp for the previous begin.
> 
> [...]

Applied to perf-tools-next, thanks!

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-10-26 17:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-20 20:47 [PATCH v3 1/3] perf lock contention: Clear lock addr after use Namhyung Kim
2023-10-20 20:47 ` [PATCH v3 2/3] perf lock contention: Check race in tstamp elem creation Namhyung Kim
2023-10-25  4:13   ` Ian Rogers
2023-10-20 20:47 ` [PATCH v3 3/3] perf lock contention: Use per-cpu array map for spinlocks Namhyung Kim
2023-10-25  4:15   ` Ian Rogers
2023-10-25  4:13 ` [PATCH v3 1/3] perf lock contention: Clear lock addr after use Ian Rogers
2023-10-26 17:11 ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).