[PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1)

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1)
@ 2023-04-06 21:06 Namhyung Kim
  2023-04-06 21:06 ` [PATCH 1/7] perf lock contention: Simplify parse_lock_type() Namhyung Kim
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Namhyung Kim @ 2023-04-06 21:06 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ian Rogers, Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Song Liu, Hao Luo, bpf, Juri Lelli

Hello,

I got a report that the overhead of perf lock contention is too big in
some cases.  It was running the task aggregation mode (-t) at the moment
and there were lots of tasks contending each other.

It turned out that the hash map update is a problem.  The result is saved
in the lock_stat hash map which is pre-allocated.  The BPF program never
deletes data in the map, but just adds.  But if the map is full, (try to)
update the map becomes a very heavy operation - since it needs to check
every CPU's freelist to get a new node to save the result.  But we know
it'd fail when the map is full.  No need to update then.

I've checked it on my 64 CPU machine with this.

    $ perf bench sched messaging -g 1000
    # Running 'sched/messaging' benchmark:
    # 20 sender and receiver processes per group
    # 1000 groups == 40000 processes run

         Total time: 2.825 [sec]

And I used the task mode, so that it can guarantee the map is full.
The default map entry size is 16K and this workload has 40K tasks.

Before:
    $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
    # Running 'sched/messaging' benchmark:
    # 20 sender and receiver processes per group
    # 1000 groups == 40000 processes run

         Total time: 11.299 [sec]
     contended   total wait     max wait     avg wait          pid   comm

         19284      3.51 s       3.70 ms    181.91 us      1305863   sched-messaging
           243     84.09 ms    466.67 us    346.04 us      1336608   sched-messaging
           177     66.35 ms     12.08 ms    374.88 us      1220416   node

After:
    $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
    # Running 'sched/messaging' benchmark:
    # 20 sender and receiver processes per group
    # 1000 groups == 40000 processes run

         Total time: 3.044 [sec]
     contended   total wait     max wait     avg wait          pid   comm

         18743    591.92 ms    442.96 us     31.58 us      1431454   sched-messaging
            51    210.64 ms    207.45 ms      4.13 ms      1468724   sched-messaging
            81     68.61 ms     65.79 ms    847.07 us      1463183   sched-messaging

    === output for debug ===

    bad: 1164137, total: 2253341
    bad rate: 51.66 %
    histogram of failure reasons
           task: 0
          stack: 0
           time: 0
           data: 1164137

The first few patches are small cleanups and fixes.  You can get the code
from 'perf/lock-map-v1' branch in

  git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Thanks,
Namhyung

Namhyung Kim (7):
  perf lock contention: Simplify parse_lock_type()
  perf lock contention: Use -M for --map-nr-entries
  perf lock contention: Update default map size to 16384
  perf lock contention: Add data failure stat
  perf lock contention: Update total/bad stats for hidden entries
  perf lock contention: Revise needs_callstack() condition
  perf lock contention: Do not try to update if hash map is full

 tools/perf/Documentation/perf-lock.txt        |  4 +-
 tools/perf/builtin-lock.c                     | 64 ++++++++-----------
 tools/perf/util/bpf_lock_contention.c         |  7 +-
 .../perf/util/bpf_skel/lock_contention.bpf.c  | 29 +++++++--
 tools/perf/util/bpf_skel/lock_data.h          |  3 +
 tools/perf/util/lock-contention.h             |  2 +
 6 files changed, 60 insertions(+), 49 deletions(-)


base-commit: e5116f46d44b72ede59a6923829f68a8b8f84e76
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/7] perf lock contention: Simplify parse_lock_type()
  2023-04-06 21:06 [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1) Namhyung Kim
@ 2023-04-06 21:06 ` Namhyung Kim
  2023-04-06 21:06 ` [PATCH 2/7] perf lock contention: Use -M for --map-nr-entries Namhyung Kim
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2023-04-06 21:06 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ian Rogers, Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Song Liu, Hao Luo, bpf, Juri Lelli

The get_type_flag() should check both str and name fields in the
lock_type_table so that it can find the appropriate flag without retrying
with ':R' or ':W' suffix from the caller.

Also fix a typo in the rt-mutex.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-lock.c | 43 ++++++++-------------------------------
 1 file changed, 8 insertions(+), 35 deletions(-)

diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 32ec58fb80e4..695ce6bd64f7 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -1557,7 +1557,7 @@ static const struct {
 	{ LCB_F_SPIN | LCB_F_WRITE,	"rwlock:W",	"rwlock" },
 	{ LCB_F_READ,			"rwsem:R",	"rwsem" },
 	{ LCB_F_WRITE,			"rwsem:W",	"rwsem" },
-	{ LCB_F_RT,			"rt=mutex",	"rt-mutex" },
+	{ LCB_F_RT,			"rt-mutex",	"rt-mutex" },
 	{ LCB_F_RT | LCB_F_READ,	"rwlock-rt:R",	"rwlock-rt" },
 	{ LCB_F_RT | LCB_F_WRITE,	"rwlock-rt:W",	"rwlock-rt" },
 	{ LCB_F_PERCPU | LCB_F_READ,	"pcpu-sem:R",	"percpu-rwsem" },
@@ -1596,6 +1596,10 @@ static unsigned int get_type_flag(const char *str)
 		if (!strcmp(lock_type_table[i].name, str))
 			return lock_type_table[i].flags;
 	}
+	for (unsigned int i = 0; i < ARRAY_SIZE(lock_type_table); i++) {
+		if (!strcmp(lock_type_table[i].str, str))
+			return lock_type_table[i].flags;
+	}
 	return UINT_MAX;
 }
 
@@ -2121,46 +2125,15 @@ static int parse_lock_type(const struct option *opt __maybe_unused, const char *
 		unsigned int flags = get_type_flag(tok);
 
 		if (flags == -1U) {
-			char buf[32];
-
-			if (strchr(tok, ':'))
-			    continue;
-
-			/* try :R and :W suffixes for rwlock, rwsem, ... */
-			scnprintf(buf, sizeof(buf), "%s:R", tok);
-			flags = get_type_flag(buf);
-			if (flags != UINT_MAX) {
-				if (!add_lock_type(flags)) {
-					ret = -1;
-					break;
-				}
-			}
-
-			scnprintf(buf, sizeof(buf), "%s:W", tok);
-			flags = get_type_flag(buf);
-			if (flags != UINT_MAX) {
-				if (!add_lock_type(flags)) {
-					ret = -1;
-					break;
-				}
-			}
-			continue;
+			pr_err("Unknown lock flags: %s\n", tok);
+			ret = -1;
+			break;
 		}
 
 		if (!add_lock_type(flags)) {
 			ret = -1;
 			break;
 		}
-
-		if (!strcmp(tok, "mutex")) {
-			flags = get_type_flag("mutex-spin");
-			if (flags != UINT_MAX) {
-				if (!add_lock_type(flags)) {
-					ret = -1;
-					break;
-				}
-			}
-		}
 	}
 
 	free(s);
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/7] perf lock contention: Use -M for --map-nr-entries
  2023-04-06 21:06 [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1) Namhyung Kim
  2023-04-06 21:06 ` [PATCH 1/7] perf lock contention: Simplify parse_lock_type() Namhyung Kim
@ 2023-04-06 21:06 ` Namhyung Kim
  2023-04-06 21:06 ` [PATCH 3/7] perf lock contention: Update default map size to 16384 Namhyung Kim
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2023-04-06 21:06 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ian Rogers, Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Song Liu, Hao Luo, bpf, Juri Lelli

Users often want to change the map size, let's add a short option (-M)
for that.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/Documentation/perf-lock.txt | 1 +
 tools/perf/builtin-lock.c              | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/perf/Documentation/perf-lock.txt b/tools/perf/Documentation/perf-lock.txt
index 37aae194a2a1..b5e5d088d51c 100644
--- a/tools/perf/Documentation/perf-lock.txt
+++ b/tools/perf/Documentation/perf-lock.txt
@@ -155,6 +155,7 @@ CONTENTION OPTIONS
 --tid=<value>::
         Record events on existing thread ID (comma separated list).
 
+-M::
 --map-nr-entries=<value>::
 	Maximum number of BPF map entries (default: 10240).
 
diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 695ce6bd64f7..ef56bf90058d 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -2294,7 +2294,7 @@ int cmd_lock(int argc, const char **argv)
 		   "Trace on existing process id"),
 	OPT_STRING(0, "tid", &target.tid, "tid",
 		   "Trace on existing thread id (exclusive to --pid)"),
-	OPT_CALLBACK(0, "map-nr-entries", &bpf_map_entries, "num",
+	OPT_CALLBACK('M', "map-nr-entries", &bpf_map_entries, "num",
 		     "Max number of BPF map entries", parse_map_entry),
 	OPT_CALLBACK(0, "max-stack", &max_stack_depth, "num",
 		     "Set the maximum stack depth when collecting lopck contention, "
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/7] perf lock contention: Update default map size to 16384
  2023-04-06 21:06 [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1) Namhyung Kim
  2023-04-06 21:06 ` [PATCH 1/7] perf lock contention: Simplify parse_lock_type() Namhyung Kim
  2023-04-06 21:06 ` [PATCH 2/7] perf lock contention: Use -M for --map-nr-entries Namhyung Kim
@ 2023-04-06 21:06 ` Namhyung Kim
  2023-04-06 21:06 ` [PATCH 4/7] perf lock contention: Add data failure stat Namhyung Kim
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2023-04-06 21:06 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ian Rogers, Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Song Liu, Hao Luo, bpf, Juri Lelli

The BPF hash map will align the map size to a power of 2.  So 10k would
be 16k anyway.  Let's have the actual size to avoid confusions.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/Documentation/perf-lock.txt         | 3 ++-
 tools/perf/builtin-lock.c                      | 2 +-
 tools/perf/util/bpf_skel/lock_contention.bpf.c | 5 +----
 tools/perf/util/bpf_skel/lock_data.h           | 3 +++
 4 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/tools/perf/Documentation/perf-lock.txt b/tools/perf/Documentation/perf-lock.txt
index b5e5d088d51c..6e5ba3cd2b72 100644
--- a/tools/perf/Documentation/perf-lock.txt
+++ b/tools/perf/Documentation/perf-lock.txt
@@ -157,7 +157,8 @@ CONTENTION OPTIONS
 
 -M::
 --map-nr-entries=<value>::
-	Maximum number of BPF map entries (default: 10240).
+	Maximum number of BPF map entries (default: 16384).
+	This will be aligned to a power of 2.
 
 --max-stack=<value>::
 	Maximum stack depth when collecting lock contention (default: 8).
diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index ef56bf90058d..9b92c7a5aefb 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -60,7 +60,7 @@ static bool show_thread_stats;
 static bool show_lock_addrs;
 static bool show_lock_owner;
 static bool use_bpf;
-static unsigned long bpf_map_entries = 10240;
+static unsigned long bpf_map_entries = MAX_ENTRIES;
 static int max_stack_depth = CONTENTION_STACK_DEPTH;
 static int stack_skip = CONTENTION_STACK_SKIP;
 static int print_nr_entries = INT_MAX / 2;
diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index 3f4ee3992e81..f9d2d792ccc8 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -7,9 +7,6 @@
 
 #include "lock_data.h"
 
-/* default buffer size */
-#define MAX_ENTRIES  10240
-
 /* for collect_lock_syms().  4096 was rejected by the verifier */
 #define MAX_CPUS  1024
 
@@ -63,7 +60,7 @@ struct {
 	__uint(type, BPF_MAP_TYPE_HASH);
 	__uint(key_size, sizeof(__u64));
 	__uint(value_size, sizeof(__u32));
-	__uint(max_entries, 16384);
+	__uint(max_entries, MAX_ENTRIES);
 } lock_syms SEC(".maps");
 
 struct {
diff --git a/tools/perf/util/bpf_skel/lock_data.h b/tools/perf/util/bpf_skel/lock_data.h
index 1ba61cb4d480..260062a9f2ab 100644
--- a/tools/perf/util/bpf_skel/lock_data.h
+++ b/tools/perf/util/bpf_skel/lock_data.h
@@ -15,6 +15,9 @@ struct contention_task_data {
 	char comm[TASK_COMM_LEN];
 };
 
+/* default buffer size */
+#define MAX_ENTRIES  16384
+
 /*
  * Upper bits of the flags in the contention_data are used to identify
  * some well-known locks which do not have symbols (non-global locks).
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/7] perf lock contention: Add data failure stat
  2023-04-06 21:06 [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1) Namhyung Kim
                   ` (2 preceding siblings ...)
  2023-04-06 21:06 ` [PATCH 3/7] perf lock contention: Update default map size to 16384 Namhyung Kim
@ 2023-04-06 21:06 ` Namhyung Kim
  2023-04-06 21:06 ` [PATCH 5/7] perf lock contention: Update total/bad stats for hidden entries Namhyung Kim
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2023-04-06 21:06 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ian Rogers, Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Song Liu, Hao Luo, bpf, Juri Lelli

It's possible to fail to update the data when the lock_stat map is full.
We should check that case and show the number at the end.

  $ sudo ./perf lock con -ablv -E3 -- ./perf bench sched messaging
  ...
   contended   total wait     max wait     avg wait            address   symbol

        6157    208.48 ms     69.29 us     33.86 us   ffff934c001c1f00    (spinlock)
        4030     72.04 ms     61.84 us     17.88 us   ffff934c000415c0    (spinlock)
        3201     50.30 ms     47.73 us     15.71 us   ffff934c2eead850    (spinlock)

  === output for debug ===

  bad: 0, total: 13388
  bad rate: 0.00 %
  histogram of failure reasons
         task: 0
        stack: 0
         time: 0
         data: 0      <----- added

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-lock.c                      | 4 +++-
 tools/perf/util/bpf_lock_contention.c          | 1 +
 tools/perf/util/bpf_skel/lock_contention.bpf.c | 4 +++-
 tools/perf/util/lock-contention.h              | 1 +
 4 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 9b92c7a5aefb..01b318d6c80a 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -1626,7 +1626,7 @@ static void sort_contention_result(void)
 static void print_bpf_events(int total, struct lock_contention_fails *fails)
 {
 	/* Output for debug, this have to be removed */
-	int broken = fails->task + fails->stack + fails->time;
+	int broken = fails->task + fails->stack + fails->time + fails->data;
 
 	if (quiet || total == 0 || (broken == 0 && verbose <= 0))
 		return;
@@ -1640,7 +1640,9 @@ static void print_bpf_events(int total, struct lock_contention_fails *fails)
 	pr_info(" %10s: %d\n", "task", fails->task);
 	pr_info(" %10s: %d\n", "stack", fails->stack);
 	pr_info(" %10s: %d\n", "time", fails->time);
+	pr_info(" %10s: %d\n", "data", fails->data);
 }
+
 static void print_contention_result(struct lock_contention *con)
 {
 	struct lock_stat *st;
diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index 8a5d0eb44189..0071058ac3d2 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -262,6 +262,7 @@ int lock_contention_read(struct lock_contention *con)
 	con->fails.task = skel->bss->task_fail;
 	con->fails.stack = skel->bss->stack_fail;
 	con->fails.time = skel->bss->time_fail;
+	con->fails.data = skel->bss->data_fail;
 
 	stack_trace = zalloc(stack_size);
 	if (stack_trace == NULL)
diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index f9d2d792ccc8..cb87c98e5340 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -124,6 +124,7 @@ int aggr_mode;
 int task_fail;
 int stack_fail;
 int time_fail;
+int data_fail;
 
 static inline int can_record(u64 *ctx)
 {
@@ -380,7 +381,8 @@ int contention_end(u64 *ctx)
 		if (aggr_mode == LOCK_AGGR_ADDR)
 			first.flags |= check_lock_type(pelem->lock, pelem->flags);
 
-		bpf_map_update_elem(&lock_stat, &key, &first, BPF_NOEXIST);
+		if (bpf_map_update_elem(&lock_stat, &key, &first, BPF_NOEXIST) < 0)
+			__sync_fetch_and_add(&data_fail, 1);
 		bpf_map_delete_elem(&tstamp, &pid);
 		return 0;
 	}
diff --git a/tools/perf/util/lock-contention.h b/tools/perf/util/lock-contention.h
index 10c28302420c..3ed1cad370fc 100644
--- a/tools/perf/util/lock-contention.h
+++ b/tools/perf/util/lock-contention.h
@@ -126,6 +126,7 @@ struct lock_contention_fails {
 	int task;
 	int stack;
 	int time;
+	int data;
 };
 
 struct lock_contention {
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/7] perf lock contention: Update total/bad stats for hidden entries
  2023-04-06 21:06 [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1) Namhyung Kim
                   ` (3 preceding siblings ...)
  2023-04-06 21:06 ` [PATCH 4/7] perf lock contention: Add data failure stat Namhyung Kim
@ 2023-04-06 21:06 ` Namhyung Kim
  2023-04-06 21:06 ` [PATCH 6/7] perf lock contention: Revise needs_callstack() condition Namhyung Kim
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2023-04-06 21:06 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ian Rogers, Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Song Liu, Hao Luo, bpf, Juri Lelli

When -E option is used, it only prints the given number of entries but
the event stat at the end should have the numbers for entire entries.

Likewise, -S option will hide entries that don't have the named
function in the callstack.  Also update event stat for them.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-lock.c             | 11 +++++++++++
 tools/perf/util/bpf_lock_contention.c |  4 +++-
 tools/perf/util/lock-contention.h     |  1 +
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 01b318d6c80a..7742fa255c44 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -1726,6 +1726,17 @@ static void print_contention_result(struct lock_contention *con)
 			break;
 	}
 
+	if (print_nr_entries) {
+		/* update the total/bad stats */
+		while ((st = pop_from_result())) {
+			total += use_bpf ? st->nr_contended : 1;
+			if (st->broken)
+				bad++;
+		}
+	}
+	/* some entries are collected but hidden by the callstack filter */
+	total += con->nr_filtered;
+
 	if (use_bpf)
 		print_bpf_events(total, &con->fails);
 	else
diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index 0071058ac3d2..ea4f697d2a9f 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -299,8 +299,10 @@ int lock_contention_read(struct lock_contention *con)
 		if (con->save_callstack) {
 			bpf_map_lookup_elem(stack, &key.stack_id, stack_trace);
 
-			if (!match_callstack_filter(machine, stack_trace))
+			if (!match_callstack_filter(machine, stack_trace)) {
+				con->nr_filtered += data.count;
 				goto next;
+			}
 		}
 
 		switch (con->aggr_mode) {
diff --git a/tools/perf/util/lock-contention.h b/tools/perf/util/lock-contention.h
index 3ed1cad370fc..fa16532c971c 100644
--- a/tools/perf/util/lock-contention.h
+++ b/tools/perf/util/lock-contention.h
@@ -141,6 +141,7 @@ struct lock_contention {
 	int stack_skip;
 	int aggr_mode;
 	int owner;
+	int nr_filtered;
 	bool save_callstack;
 };
 
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/7] perf lock contention: Revise needs_callstack() condition
  2023-04-06 21:06 [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1) Namhyung Kim
                   ` (4 preceding siblings ...)
  2023-04-06 21:06 ` [PATCH 5/7] perf lock contention: Update total/bad stats for hidden entries Namhyung Kim
@ 2023-04-06 21:06 ` Namhyung Kim
  2023-04-06 21:06 ` [PATCH 7/7] perf lock contention: Do not try to update if hash map is full Namhyung Kim
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2023-04-06 21:06 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ian Rogers, Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Song Liu, Hao Luo, bpf, Juri Lelli

It needs callstacks for two reasons:

 * for stack aggregation mode, the map key is the stack id and it can
   also show the full stack traces when -v is used

 * for other aggregation modes, the stack filter can be used to limit
   lock contentions from known call paths

The -v option is meaningful (in terms of stack trace) only for stack
aggregation mode, so it should not set the save_callstack for other
mode like with -t or -l options.

I've noticed this with the following command line:

  $ sudo ./perf lock con -ablv -E 3 -M 16 -- ./perf bench sched messaging
  ...
   contended   total wait     max wait     avg wait            address   symbol

          88      4.59 ms    108.07 us     52.13 us   ffff935757f46ec0    (spinlock)
          33    905.22 us     73.67 us     27.43 us   ffff935757f41700    (spinlock)
          28    703.69 us     79.28 us     25.13 us   ffff938a3d9b0c80   rq_lock (spinlock)

  === output for debug ===

  bad: 12272, total: 12421
  bad rate: 98.80 %
  histogram of failure reasons
         task: 8285
        stack: 3987    <---------- here
         time: 0
         data: 0

It should not have any failure on stacks since it doesn't use it.
No functional change intended.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-lock.c             | 2 +-
 tools/perf/util/bpf_lock_contention.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 7742fa255c44..4e24351b18bd 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -77,7 +77,7 @@ static enum lock_aggr_mode aggr_mode = LOCK_AGGR_ADDR;
 
 static bool needs_callstack(void)
 {
-	return verbose > 0 || !list_empty(&callstack_filters);
+	return !list_empty(&callstack_filters);
 }
 
 static struct thread_stat *thread_stat_find(u32 tid)
diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index ea4f697d2a9f..9e20fa8ade09 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -346,7 +346,7 @@ int lock_contention_read(struct lock_contention *con)
 		if (data.count)
 			st->avg_wait_time = data.total_time / data.count;
 
-		if (con->save_callstack) {
+		if (con->aggr_mode == LOCK_AGGR_CALLER && verbose > 0) {
 			st->callstack = memdup(stack_trace, stack_size);
 			if (st->callstack == NULL)
 				break;
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 7/7] perf lock contention: Do not try to update if hash map is full
  2023-04-06 21:06 [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1) Namhyung Kim
                   ` (5 preceding siblings ...)
  2023-04-06 21:06 ` [PATCH 6/7] perf lock contention: Revise needs_callstack() condition Namhyung Kim
@ 2023-04-06 21:06 ` Namhyung Kim
  2023-04-07  0:35 ` [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1) Ian Rogers
  2023-04-07  0:41 ` Arnaldo Carvalho de Melo
  8 siblings, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2023-04-06 21:06 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ian Rogers, Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Song Liu, Hao Luo, bpf, Juri Lelli

It doesn't delete data in the task_data and lock_stat maps.  The data
is kept there until it's consumed by userspace at the end.  But it calls
bpf_map_update_elem() again and again, and the data will be discarded if
the map is full.  This is not good.

Worse, in the bpf_map_update_elem(), it keeps trying to get a new node
even if the map was full.  I guess it makes sense if it deletes some node
like in the tstamp map (that's why I didn't make the change there).

In a pre-allocated hash map, that means it'd iterate all CPU to check the
freelist.  And it has a bad performance impact on large machines.

I've checked it on my 64 CPU machine with this.

  $ perf bench sched messaging -g 1000
  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 1000 groups == 40000 processes run

       Total time: 2.825 [sec]

And I used the task mode, so that it can guarantee the map is full.
The default map entry size is 16K and this workload has 40K tasks.

Before:
  $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 1000 groups == 40000 processes run

       Total time: 11.299 [sec]
   contended   total wait     max wait     avg wait          pid   comm

       19284      3.51 s       3.70 ms    181.91 us      1305863   sched-messaging
         243     84.09 ms    466.67 us    346.04 us      1336608   sched-messaging
         177     66.35 ms     12.08 ms    374.88 us      1220416   node

For some reason, it didn't report the data failures.  But you can see the
total time in the workload is increased a lot (2.8 -> 11.3).  If it fails
early when the map is full, it goes back to normal.

After:
  $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 1000 groups == 40000 processes run

       Total time: 3.044 [sec]
   contended   total wait     max wait     avg wait          pid   comm

       18743    591.92 ms    442.96 us     31.58 us      1431454   sched-messaging
          51    210.64 ms    207.45 ms      4.13 ms      1468724   sched-messaging
          81     68.61 ms     65.79 ms    847.07 us      1463183   sched-messaging

  === output for debug ===

  bad: 1164137, total: 2253341
  bad rate: 51.66 %
  histogram of failure reasons
         task: 0
        stack: 0
         time: 0
         data: 1164137

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 .../perf/util/bpf_skel/lock_contention.bpf.c  | 22 ++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index cb87c98e5340..23f6e63544ed 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -4,6 +4,7 @@
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_tracing.h>
 #include <bpf/bpf_core_read.h>
+#include <asm-generic/errno-base.h>
 
 #include "lock_data.h"
 
@@ -126,6 +127,9 @@ int stack_fail;
 int time_fail;
 int data_fail;
 
+int task_map_full;
+int data_map_full;
+
 static inline int can_record(u64 *ctx)
 {
 	if (has_cpu) {
@@ -177,11 +181,12 @@ static inline int update_task_data(struct task_struct *task)
 		return -1;
 
 	p = bpf_map_lookup_elem(&task_data, &pid);
-	if (p == NULL) {
+	if (p == NULL && !task_map_full) {
 		struct contention_task_data data = {};
 
 		BPF_CORE_READ_STR_INTO(&data.comm, task, comm);
-		bpf_map_update_elem(&task_data, &pid, &data, BPF_NOEXIST);
+		if (bpf_map_update_elem(&task_data, &pid, &data, BPF_NOEXIST) == -E2BIG)
+			task_map_full = 1;
 	}
 
 	return 0;
@@ -370,6 +375,12 @@ int contention_end(u64 *ctx)
 
 	data = bpf_map_lookup_elem(&lock_stat, &key);
 	if (!data) {
+		if (data_map_full) {
+			bpf_map_delete_elem(&tstamp, &pid);
+			__sync_fetch_and_add(&data_fail, 1);
+			return 0;
+		}
+
 		struct contention_data first = {
 			.total_time = duration,
 			.max_time = duration,
@@ -377,12 +388,17 @@ int contention_end(u64 *ctx)
 			.count = 1,
 			.flags = pelem->flags,
 		};
+		int err;
 
 		if (aggr_mode == LOCK_AGGR_ADDR)
 			first.flags |= check_lock_type(pelem->lock, pelem->flags);
 
-		if (bpf_map_update_elem(&lock_stat, &key, &first, BPF_NOEXIST) < 0)
+		err = bpf_map_update_elem(&lock_stat, &key, &first, BPF_NOEXIST);
+		if (err < 0) {
+			if (err == -E2BIG)
+				data_map_full = 1;
 			__sync_fetch_and_add(&data_fail, 1);
+		}
 		bpf_map_delete_elem(&tstamp, &pid);
 		return 0;
 	}
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1)
  2023-04-06 21:06 [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1) Namhyung Kim
                   ` (6 preceding siblings ...)
  2023-04-06 21:06 ` [PATCH 7/7] perf lock contention: Do not try to update if hash map is full Namhyung Kim
@ 2023-04-07  0:35 ` Ian Rogers
  2023-04-07  0:41 ` Arnaldo Carvalho de Melo
  8 siblings, 0 replies; 10+ messages in thread
From: Ian Rogers @ 2023-04-07  0:35 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Adrian Hunter,
	Peter Zijlstra, Ingo Molnar, LKML, linux-perf-users, Song Liu,
	Hao Luo, bpf, Juri Lelli

On Thu, Apr 6, 2023 at 2:06 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> I got a report that the overhead of perf lock contention is too big in
> some cases.  It was running the task aggregation mode (-t) at the moment
> and there were lots of tasks contending each other.
>
> It turned out that the hash map update is a problem.  The result is saved
> in the lock_stat hash map which is pre-allocated.  The BPF program never
> deletes data in the map, but just adds.  But if the map is full, (try to)
> update the map becomes a very heavy operation - since it needs to check
> every CPU's freelist to get a new node to save the result.  But we know
> it'd fail when the map is full.  No need to update then.
>
> I've checked it on my 64 CPU machine with this.
>
>     $ perf bench sched messaging -g 1000
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver processes per group
>     # 1000 groups == 40000 processes run
>
>          Total time: 2.825 [sec]
>
> And I used the task mode, so that it can guarantee the map is full.
> The default map entry size is 16K and this workload has 40K tasks.
>
> Before:
>     $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver processes per group
>     # 1000 groups == 40000 processes run
>
>          Total time: 11.299 [sec]
>      contended   total wait     max wait     avg wait          pid   comm
>
>          19284      3.51 s       3.70 ms    181.91 us      1305863   sched-messaging
>            243     84.09 ms    466.67 us    346.04 us      1336608   sched-messaging
>            177     66.35 ms     12.08 ms    374.88 us      1220416   node
>
> After:
>     $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver processes per group
>     # 1000 groups == 40000 processes run
>
>          Total time: 3.044 [sec]
>      contended   total wait     max wait     avg wait          pid   comm
>
>          18743    591.92 ms    442.96 us     31.58 us      1431454   sched-messaging
>             51    210.64 ms    207.45 ms      4.13 ms      1468724   sched-messaging
>             81     68.61 ms     65.79 ms    847.07 us      1463183   sched-messaging
>
>     === output for debug ===
>
>     bad: 1164137, total: 2253341
>     bad rate: 51.66 %
>     histogram of failure reasons
>            task: 0
>           stack: 0
>            time: 0
>            data: 1164137
>
> The first few patches are small cleanups and fixes.  You can get the code
> from 'perf/lock-map-v1' branch in
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
>
> Thanks,
> Namhyung
>
> Namhyung Kim (7):
>   perf lock contention: Simplify parse_lock_type()
>   perf lock contention: Use -M for --map-nr-entries
>   perf lock contention: Update default map size to 16384
>   perf lock contention: Add data failure stat
>   perf lock contention: Update total/bad stats for hidden entries
>   perf lock contention: Revise needs_callstack() condition
>   perf lock contention: Do not try to update if hash map is full

Series:
Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

>  tools/perf/Documentation/perf-lock.txt        |  4 +-
>  tools/perf/builtin-lock.c                     | 64 ++++++++-----------
>  tools/perf/util/bpf_lock_contention.c         |  7 +-
>  .../perf/util/bpf_skel/lock_contention.bpf.c  | 29 +++++++--
>  tools/perf/util/bpf_skel/lock_data.h          |  3 +
>  tools/perf/util/lock-contention.h             |  2 +
>  6 files changed, 60 insertions(+), 49 deletions(-)
>
>
> base-commit: e5116f46d44b72ede59a6923829f68a8b8f84e76
> --
> 2.40.0.577.gac1e443424-goog
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1)
  2023-04-06 21:06 [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1) Namhyung Kim
                   ` (7 preceding siblings ...)
  2023-04-07  0:35 ` [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1) Ian Rogers
@ 2023-04-07  0:41 ` Arnaldo Carvalho de Melo
  8 siblings, 0 replies; 10+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-04-07  0:41 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Jiri Olsa, Ian Rogers, Adrian Hunter, Peter Zijlstra, Ingo Molnar,
	LKML, linux-perf-users, Song Liu, Hao Luo, bpf, Juri Lelli

Em Thu, Apr 06, 2023 at 02:06:04PM -0700, Namhyung Kim escreveu:
> Hello,
> 
> I got a report that the overhead of perf lock contention is too big in
> some cases.  It was running the task aggregation mode (-t) at the moment
> and there were lots of tasks contending each other.
> 
> It turned out that the hash map update is a problem.  The result is saved
> in the lock_stat hash map which is pre-allocated.  The BPF program never
> deletes data in the map, but just adds.  But if the map is full, (try to)
> update the map becomes a very heavy operation - since it needs to check
> every CPU's freelist to get a new node to save the result.  But we know
> it'd fail when the map is full.  No need to update then.

Thanks, applied.

- Arnaldo

 
> I've checked it on my 64 CPU machine with this.
> 
>     $ perf bench sched messaging -g 1000
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver processes per group
>     # 1000 groups == 40000 processes run
> 
>          Total time: 2.825 [sec]
> 
> And I used the task mode, so that it can guarantee the map is full.
> The default map entry size is 16K and this workload has 40K tasks.
> 
> Before:
>     $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver processes per group
>     # 1000 groups == 40000 processes run
> 
>          Total time: 11.299 [sec]
>      contended   total wait     max wait     avg wait          pid   comm
> 
>          19284      3.51 s       3.70 ms    181.91 us      1305863   sched-messaging
>            243     84.09 ms    466.67 us    346.04 us      1336608   sched-messaging
>            177     66.35 ms     12.08 ms    374.88 us      1220416   node
> 
> After:
>     $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver processes per group
>     # 1000 groups == 40000 processes run
> 
>          Total time: 3.044 [sec]
>      contended   total wait     max wait     avg wait          pid   comm
> 
>          18743    591.92 ms    442.96 us     31.58 us      1431454   sched-messaging
>             51    210.64 ms    207.45 ms      4.13 ms      1468724   sched-messaging
>             81     68.61 ms     65.79 ms    847.07 us      1463183   sched-messaging
> 
>     === output for debug ===
> 
>     bad: 1164137, total: 2253341
>     bad rate: 51.66 %
>     histogram of failure reasons
>            task: 0
>           stack: 0
>            time: 0
>            data: 1164137
> 
> The first few patches are small cleanups and fixes.  You can get the code
> from 'perf/lock-map-v1' branch in
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> 
> Thanks,
> Namhyung
> 
> Namhyung Kim (7):
>   perf lock contention: Simplify parse_lock_type()
>   perf lock contention: Use -M for --map-nr-entries
>   perf lock contention: Update default map size to 16384
>   perf lock contention: Add data failure stat
>   perf lock contention: Update total/bad stats for hidden entries
>   perf lock contention: Revise needs_callstack() condition
>   perf lock contention: Do not try to update if hash map is full
> 
>  tools/perf/Documentation/perf-lock.txt        |  4 +-
>  tools/perf/builtin-lock.c                     | 64 ++++++++-----------
>  tools/perf/util/bpf_lock_contention.c         |  7 +-
>  .../perf/util/bpf_skel/lock_contention.bpf.c  | 29 +++++++--
>  tools/perf/util/bpf_skel/lock_data.h          |  3 +
>  tools/perf/util/lock-contention.h             |  2 +
>  6 files changed, 60 insertions(+), 49 deletions(-)
> 
> 
> base-commit: e5116f46d44b72ede59a6923829f68a8b8f84e76
> -- 
> 2.40.0.577.gac1e443424-goog
> 

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-04-07  0:41 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-06 21:06 [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1) Namhyung Kim
2023-04-06 21:06 ` [PATCH 1/7] perf lock contention: Simplify parse_lock_type() Namhyung Kim
2023-04-06 21:06 ` [PATCH 2/7] perf lock contention: Use -M for --map-nr-entries Namhyung Kim
2023-04-06 21:06 ` [PATCH 3/7] perf lock contention: Update default map size to 16384 Namhyung Kim
2023-04-06 21:06 ` [PATCH 4/7] perf lock contention: Add data failure stat Namhyung Kim
2023-04-06 21:06 ` [PATCH 5/7] perf lock contention: Update total/bad stats for hidden entries Namhyung Kim
2023-04-06 21:06 ` [PATCH 6/7] perf lock contention: Revise needs_callstack() condition Namhyung Kim
2023-04-06 21:06 ` [PATCH 7/7] perf lock contention: Do not try to update if hash map is full Namhyung Kim
2023-04-07  0:35 ` [PATCHSET 0/7] perf lock contention: Improve performance if map is full (v1) Ian Rogers
2023-04-07  0:41 ` Arnaldo Carvalho de Melo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).