Question: perf report & top memory usage

public inbox for linux-perf-users@vger.kernel.org
 help / color / mirror / Atom feed

* Question: perf report & top memory usage
@ 2026-02-17 19:08 Stephen Brennan
  2026-02-18  1:52 ` Namhyung Kim
  0 siblings, 1 reply; 3+ messages in thread
From: Stephen Brennan @ 2026-02-17 19:08 UTC (permalink / raw)
  To: linux-perf-users
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, James Clark, Stephen Brennan

Hello all,

I had an interesting case where perf record required 35 GiB of memory to create
a report for a 400 MiB data file. Unfortunately I don't believe I can share the
perf.data, but I did an analysis and wanted to share what I found and ask some
questions.

The particular data file contains 1,087,091 samples, with call chains, generated
by a pretty standard "perf record -a -g sleep 10" on a machine with 76 CPUs. I
looked at the perf report code and profiled memory allocations. Three items
seemed to dominate memory use:

1. Histogram columns. The default being "comm,dso,symbol". The more buckets that
   the data is broken into, the more memory is used, and the histogram columns
   directly control this.

2. Callchains. The default is to track them when the perf.data contains them,
   though it can be disabled with "-g none". The data structure storing call
   chains seems pretty efficient (a prefix tree) but it looks like there is one
   per histogram bucket. This makes sense, but it seems duplicative with #3.

3. Accumulating child overhead. The default is to do this, creating the
   "Children" column in the report. The implementation walks the stack for each
   sample, creating a histogram bucket for each stack frame (even if no samples
   were observed actually executing in those symbols).

My understanding is that the 35 GiB memory usage then comes from a sort of
combinatorial explosion. In this data file, nearly every process has a unique
comm with numeric identifiers embedded within (e.g. "db1234"). This means that
the default "comm,dso,symbol" sort will result in a large number of buckets. The
call stacks are reasonably deep (though not absurdly so). There are many
non-leaf functions in the call stacks which don't have any Self samples. Child
overhead accumulation creates more buckets than there are samples: around 1.3
million buckets, compared to 1 million samples.

From this perspective, the memory usage makes sense to me. I understand that I
could tweak any combination of those knobs to ameliorate the issue. The most
straightforward option is to use "-s dso,symbol" because the "comm" column
wasn't informative for this workload. I also created a new histogram column
implementation (see below) that represents a command with any digits stripped,
so that the commands could still be grouped together, without the numeric
identifiers disrupting the bucketing. These solutions reduce memory used to 5.1
and 5.4 GiB respectively.

My concern is that most users aren't prepared to dive into this sort of detail,
especially when they're likely already in the middle of an analysis of some
other performance issue. While they may be familiar with the call graph options
and event selection choices, in my experience they generally aren't aware of the
many options that "perf report" provides. They certainly aren't aware of these
memory trade-offs, especially for what seemed like an innocuous 10-second data
collection at the default sample rate.

To sum up, I have the following questions:

1. Does my analysis make sense and seem consistent with your understanding?
2. Does anybody else deal with this sort of memory usage issue, and have
strategies they can share?
3. Does the patch below for the custom column make sense to submit? I know it's
rather workload specific, but it could be useful for others in this situation.

Thanks,
Stephen

Patch for the "commIgnoreDigit" column. For this workload, it reduced perf
report's peak RSS from 35 GiB to around 5.4 GiB, when used in place of "comm":

From df1452ae742d933b45c18d9dde090c11fb3cf846 Mon Sep 17 00:00:00 2001
From: Stephen Brennan <stephen.s.brennan@oracle.com>
Date: Wed, 3 Dec 2025 16:01:49 -0800
Subject: [PATCH 1/1] tools: perf: add commIgnoreDigit

The "comm" column allows grouping events by the process command. It is
intended to group like programs, despite having different PIDs. But some
workloads may adjust their own command, so that a unique identifier
(e.g. a PID or some other numeric value) is part of the command name.
This destroys the utility of "comm", forcing perf to place each unique
process name into its own bucket, which can contribute to a
combinatorial explosion of memory use in perf report.

Create a less strict version of this column, which ignores digits when
comparing command names. This allows "similar looking" processes to
again be placed in the same bucket.

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
---
 tools/perf/util/hist.c |  2 +
 tools/perf/util/hist.h |  1 +
 tools/perf/util/sort.c | 88 +++++++++++++++++++++++++++++++++++++++++-
 tools/perf/util/sort.h |  1 +
 4 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index ef4b569f7df46..5f691d9b0272d 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -110,6 +110,8 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 	len = thread__comm_len(h->thread);
 	if (hists__new_col_len(hists, HISTC_COMM, len))
 		hists__set_col_len(hists, HISTC_THREAD, len + 8);
+	if (hists__new_col_len(hists, HISTC_COMM_IGNORE_DIGIT, len))
+		hists__set_col_len(hists, HISTC_THREAD, len + 8);

 	if (h->ms.map) {
 		len = dso__name_len(map__dso(h->ms.map));
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 1d5ea632ca4e1..ae7e98bd9e46d 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -44,6 +44,7 @@ enum hist_column {
 	HISTC_THREAD,
 	HISTC_TGID,
 	HISTC_COMM,
+	HISTC_COMM_IGNORE_DIGIT,
 	HISTC_CGROUP_ID,
 	HISTC_CGROUP,
 	HISTC_PARENT,
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index f3a565b0e2307..656b5cc62a730 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <ctype.h>
 #include <errno.h>
 #include <inttypes.h>
 #include <regex.h>
@@ -265,6 +266,89 @@ struct sort_entry sort_comm = {
 	.se_width_idx	= HISTC_COMM,
 };

+/* --sort commIgnoreDigit */
+
+static int64_t strcmp_nodigit(const char *left, const char *right)
+{
+	for (;;) {
+		while (*left && isdigit(*left)) left++;
+		while (*right && isdigit(*right)) right++;
+		if (*left == *right && !*left) {
+			return 0;
+		} else if (*left == *right) {
+			left++;
+			right++;
+		} else {
+			return (int64_t)*left - (int64_t)*right;
+		}
+	}
+}
+
+static int64_t
+sort__commIgnoreDigit_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	return strcmp_nodigit(comm__str(right->comm), comm__str(left->comm));
+}
+
+static int64_t
+sort__commIgnoreDigit_collapse(struct hist_entry *left, struct hist_entry *right)
+{
+	return strcmp_nodigit(comm__str(right->comm), comm__str(left->comm));
+}
+
+static int64_t
+sort__commIgnoreDigit_sort(struct hist_entry *left, struct hist_entry *right)
+{
+	return strcmp_nodigit(comm__str(right->comm), comm__str(left->comm));
+}
+
+static int hist_entry__commIgnoreDigit_snprintf(struct hist_entry *he, char *bf,
+						size_t size, unsigned int width)
+{
+	int ret = 0;
+	unsigned int print_len, printed = 0, start = 0, end = 0;
+	bool in_digit;
+	const char *comm = comm__str(he->comm), *print;
+	while (printed < width && printed < size && comm[start]) {
+		in_digit = !!isdigit(comm[start]);
+		end = start + 1;
+		while (comm[end] && !!isdigit(comm[end]) == in_digit) end++;
+		if (in_digit) {
+			print_len = 3; /* <N> */
+			print = "<N>";
+		} else {
+			print_len = end - start;
+			print = &comm[start];
+		}
+		print_len = min(print_len, width - printed);
+		ret = repsep_snprintf(bf + printed, size - printed, "%-.*s",
+					print_len, print);
+		if (ret < 0)
+			return ret;
+		start = end;
+		printed += ret;
+	}
+	/* Pad to width if necessary */
+	if (printed < width && printed < size) {
+		ret = repsep_snprintf(bf + printed, size - printed, "%-*.*s",
+				       width - printed, width - printed, "");
+		if (ret < 0)
+			return ret;
+		printed += ret;
+	}
+	return printed;
+}
+
+struct sort_entry sort_commIgnoreDigit = {
+	.se_header	= "CommandIgnoreDigit",
+	.se_cmp		= sort__commIgnoreDigit_cmp,
+	.se_collapse	= sort__commIgnoreDigit_collapse,
+	.se_sort	= sort__commIgnoreDigit_sort,
+	.se_snprintf	= hist_entry__commIgnoreDigit_snprintf,
+	.se_filter	= hist_entry__thread_filter,
+	.se_width_idx	= HISTC_COMM_IGNORE_DIGIT,
+};
+
 /* --sort dso */

 static int64_t _sort__dso_cmp(struct map *map_l, struct map *map_r)
@@ -2576,6 +2660,7 @@ static struct sort_dimension common_sort_dimensions[] = {
 	DIM(SORT_PID, "pid", sort_thread),
 	DIM(SORT_TGID, "tgid", sort_tgid),
 	DIM(SORT_COMM, "comm", sort_comm),
+	DIM(SORT_COMM_IGNORE_DIGIT, "commIgnoreDigit", sort_commIgnoreDigit),
 	DIM(SORT_DSO, "dso", sort_dso),
 	DIM(SORT_SYM, "symbol", sort_sym),
 	DIM(SORT_PARENT, "parent", sort_parent),
@@ -3675,7 +3760,7 @@ int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
 			list->socket = 1;
 		} else if (sd->entry == &sort_thread) {
 			list->thread = 1;
-		} else if (sd->entry == &sort_comm) {
+		} else if (sd->entry == &sort_comm || sd->entry == &sort_commIgnoreDigit) {
 			list->comm = 1;
 		} else if (sd->entry == &sort_type_offset) {
 			symbol_conf.annotate_data_member = true;
@@ -4022,6 +4107,7 @@ static bool get_elide(int idx, FILE *output)
 	case HISTC_DSO:
 		return __get_elide(symbol_conf.dso_list, "dso", output);
 	case HISTC_COMM:
+	case HISTC_COMM_IGNORE_DIGIT:
 		return __get_elide(symbol_conf.comm_list, "comm", output);
 	default:
 		break;
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index d7787958e06b9..6819934b4d48a 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -43,6 +43,7 @@ enum sort_type {
 	/* common sort keys */
 	SORT_PID,
 	SORT_COMM,
+	SORT_COMM_IGNORE_DIGIT,
 	SORT_DSO,
 	SORT_SYM,
 	SORT_PARENT,
-- 
2.47.3

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: Question: perf report & top memory usage
  2026-02-17 19:08 Question: perf report & top memory usage Stephen Brennan
@ 2026-02-18  1:52 ` Namhyung Kim
  2026-03-05 18:02   ` Stephen Brennan
  0 siblings, 1 reply; 3+ messages in thread
From: Namhyung Kim @ 2026-02-18  1:52 UTC (permalink / raw)
  To: Stephen Brennan
  Cc: linux-perf-users, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark

Hello,

On Tue, Feb 17, 2026 at 11:08:19AM -0800, Stephen Brennan wrote:
> Hello all,
> 
> I had an interesting case where perf record required 35 GiB of memory to create
> a report for a 400 MiB data file. Unfortunately I don't believe I can share the
> perf.data, but I did an analysis and wanted to share what I found and ask some
> questions.
> 
> The particular data file contains 1,087,091 samples, with call chains, generated
> by a pretty standard "perf record -a -g sleep 10" on a machine with 76 CPUs. I
> looked at the perf report code and profiled memory allocations. Three items
> seemed to dominate memory use:
> 
> 1. Histogram columns. The default being "comm,dso,symbol". The more buckets that
>    the data is broken into, the more memory is used, and the histogram columns
>    directly control this.
> 
> 2. Callchains. The default is to track them when the perf.data contains them,
>    though it can be disabled with "-g none". The data structure storing call
>    chains seems pretty efficient (a prefix tree) but it looks like there is one
>    per histogram bucket. This makes sense, but it seems duplicative with #3.
> 
> 3. Accumulating child overhead. The default is to do this, creating the
>    "Children" column in the report. The implementation walks the stack for each
>    sample, creating a histogram bucket for each stack frame (even if no samples
>    were observed actually executing in those symbols).
> 
> My understanding is that the 35 GiB memory usage then comes from a sort of
> combinatorial explosion. In this data file, nearly every process has a unique
> comm with numeric identifiers embedded within (e.g. "db1234"). This means that
> the default "comm,dso,symbol" sort will result in a large number of buckets. The
> call stacks are reasonably deep (though not absurdly so). There are many
> non-leaf functions in the call stacks which don't have any Self samples. Child
> overhead accumulation creates more buckets than there are samples: around 1.3
> million buckets, compared to 1 million samples.
> 
> From this perspective, the memory usage makes sense to me. I understand that I
> could tweak any combination of those knobs to ameliorate the issue. The most
> straightforward option is to use "-s dso,symbol" because the "comm" column
> wasn't informative for this workload. I also created a new histogram column
> implementation (see below) that represents a command with any digits stripped,
> so that the commands could still be grouped together, without the numeric
> identifiers disrupting the bucketing. These solutions reduce memory used to 5.1
> and 5.4 GiB respectively.
> 
> My concern is that most users aren't prepared to dive into this sort of detail,
> especially when they're likely already in the middle of an analysis of some
> other performance issue. While they may be familiar with the call graph options
> and event selection choices, in my experience they generally aren't aware of the
> many options that "perf report" provides. They certainly aren't aware of these
> memory trade-offs, especially for what seemed like an innocuous 10-second data
> collection at the default sample rate.
> 
> To sum up, I have the following questions:
> 
> 1. Does my analysis make sense and seem consistent with your understanding?

Yes, it does!

> 2. Does anybody else deal with this sort of memory usage issue, and have
> strategies they can share?

No, but I think "-s dso,sym" should work fine.  And it's the default
sort key for `perf top` command.

> 3. Does the patch below for the custom column make sense to submit? I know it's
> rather workload specific, but it could be useful for others in this situation.

I think it makes sense and could be useful.  Maybe it's good to group
kworker threads together.

Different options would be

1. to add an option to enable it for the existing 'comm' sort key.
2. to have a regex pattern or so to specify names to merge.

But a separate sort key as you did seems to be fine.

> 
> Patch for the "commIgnoreDigit" column. For this workload, it reduced perf
> report's peak RSS from 35 GiB to around 5.4 GiB, when used in place of "comm":

We don't use CamelCase in the perf code base.  But otherwise looks ok.

Thanks,
Namhyung

> 
> From df1452ae742d933b45c18d9dde090c11fb3cf846 Mon Sep 17 00:00:00 2001
> From: Stephen Brennan <stephen.s.brennan@oracle.com>
> Date: Wed, 3 Dec 2025 16:01:49 -0800
> Subject: [PATCH 1/1] tools: perf: add commIgnoreDigit
> 
> The "comm" column allows grouping events by the process command. It is
> intended to group like programs, despite having different PIDs. But some
> workloads may adjust their own command, so that a unique identifier
> (e.g. a PID or some other numeric value) is part of the command name.
> This destroys the utility of "comm", forcing perf to place each unique
> process name into its own bucket, which can contribute to a
> combinatorial explosion of memory use in perf report.
> 
> Create a less strict version of this column, which ignores digits when
> comparing command names. This allows "similar looking" processes to
> again be placed in the same bucket.
> 
> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
> ---
>  tools/perf/util/hist.c |  2 +
>  tools/perf/util/hist.h |  1 +
>  tools/perf/util/sort.c | 88 +++++++++++++++++++++++++++++++++++++++++-
>  tools/perf/util/sort.h |  1 +
>  4 files changed, 91 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
> index ef4b569f7df46..5f691d9b0272d 100644
> --- a/tools/perf/util/hist.c
> +++ b/tools/perf/util/hist.c
> @@ -110,6 +110,8 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
>  	len = thread__comm_len(h->thread);
>  	if (hists__new_col_len(hists, HISTC_COMM, len))
>  		hists__set_col_len(hists, HISTC_THREAD, len + 8);
> +	if (hists__new_col_len(hists, HISTC_COMM_IGNORE_DIGIT, len))
> +		hists__set_col_len(hists, HISTC_THREAD, len + 8);
>  
>  	if (h->ms.map) {
>  		len = dso__name_len(map__dso(h->ms.map));
> diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> index 1d5ea632ca4e1..ae7e98bd9e46d 100644
> --- a/tools/perf/util/hist.h
> +++ b/tools/perf/util/hist.h
> @@ -44,6 +44,7 @@ enum hist_column {
>  	HISTC_THREAD,
>  	HISTC_TGID,
>  	HISTC_COMM,
> +	HISTC_COMM_IGNORE_DIGIT,
>  	HISTC_CGROUP_ID,
>  	HISTC_CGROUP,
>  	HISTC_PARENT,
> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> index f3a565b0e2307..656b5cc62a730 100644
> --- a/tools/perf/util/sort.c
> +++ b/tools/perf/util/sort.c
> @@ -1,4 +1,5 @@
>  // SPDX-License-Identifier: GPL-2.0
> +#include <ctype.h>
>  #include <errno.h>
>  #include <inttypes.h>
>  #include <regex.h>
> @@ -265,6 +266,89 @@ struct sort_entry sort_comm = {
>  	.se_width_idx	= HISTC_COMM,
>  };
>  
> +/* --sort commIgnoreDigit */
> +
> +static int64_t strcmp_nodigit(const char *left, const char *right)
> +{
> +	for (;;) {
> +		while (*left && isdigit(*left)) left++;
> +		while (*right && isdigit(*right)) right++;
> +		if (*left == *right && !*left) {
> +			return 0;
> +		} else if (*left == *right) {
> +			left++;
> +			right++;
> +		} else {
> +			return (int64_t)*left - (int64_t)*right;
> +		}
> +	}
> +}
> +
> +static int64_t
> +sort__commIgnoreDigit_cmp(struct hist_entry *left, struct hist_entry *right)
> +{
> +	return strcmp_nodigit(comm__str(right->comm), comm__str(left->comm));
> +}
> +
> +static int64_t
> +sort__commIgnoreDigit_collapse(struct hist_entry *left, struct hist_entry *right)
> +{
> +	return strcmp_nodigit(comm__str(right->comm), comm__str(left->comm));
> +}
> +
> +static int64_t
> +sort__commIgnoreDigit_sort(struct hist_entry *left, struct hist_entry *right)
> +{
> +	return strcmp_nodigit(comm__str(right->comm), comm__str(left->comm));
> +}
> +
> +static int hist_entry__commIgnoreDigit_snprintf(struct hist_entry *he, char *bf,
> +						size_t size, unsigned int width)
> +{
> +	int ret = 0;
> +	unsigned int print_len, printed = 0, start = 0, end = 0;
> +	bool in_digit;
> +	const char *comm = comm__str(he->comm), *print;
> +	while (printed < width && printed < size && comm[start]) {
> +		in_digit = !!isdigit(comm[start]);
> +		end = start + 1;
> +		while (comm[end] && !!isdigit(comm[end]) == in_digit) end++;
> +		if (in_digit) {
> +			print_len = 3; /* <N> */
> +			print = "<N>";
> +		} else {
> +			print_len = end - start;
> +			print = &comm[start];
> +		}
> +		print_len = min(print_len, width - printed);
> +		ret = repsep_snprintf(bf + printed, size - printed, "%-.*s",
> +					print_len, print);
> +		if (ret < 0)
> +			return ret;
> +		start = end;
> +		printed += ret;
> +	}
> +	/* Pad to width if necessary */
> +	if (printed < width && printed < size) {
> +		ret = repsep_snprintf(bf + printed, size - printed, "%-*.*s",
> +				       width - printed, width - printed, "");
> +		if (ret < 0)
> +			return ret;
> +		printed += ret;
> +	}
> +	return printed;
> +}
> +
> +struct sort_entry sort_commIgnoreDigit = {
> +	.se_header	= "CommandIgnoreDigit",
> +	.se_cmp		= sort__commIgnoreDigit_cmp,
> +	.se_collapse	= sort__commIgnoreDigit_collapse,
> +	.se_sort	= sort__commIgnoreDigit_sort,
> +	.se_snprintf	= hist_entry__commIgnoreDigit_snprintf,
> +	.se_filter	= hist_entry__thread_filter,
> +	.se_width_idx	= HISTC_COMM_IGNORE_DIGIT,
> +};
> +
>  /* --sort dso */
>  
>  static int64_t _sort__dso_cmp(struct map *map_l, struct map *map_r)
> @@ -2576,6 +2660,7 @@ static struct sort_dimension common_sort_dimensions[] = {
>  	DIM(SORT_PID, "pid", sort_thread),
>  	DIM(SORT_TGID, "tgid", sort_tgid),
>  	DIM(SORT_COMM, "comm", sort_comm),
> +	DIM(SORT_COMM_IGNORE_DIGIT, "commIgnoreDigit", sort_commIgnoreDigit),
>  	DIM(SORT_DSO, "dso", sort_dso),
>  	DIM(SORT_SYM, "symbol", sort_sym),
>  	DIM(SORT_PARENT, "parent", sort_parent),
> @@ -3675,7 +3760,7 @@ int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
>  			list->socket = 1;
>  		} else if (sd->entry == &sort_thread) {
>  			list->thread = 1;
> -		} else if (sd->entry == &sort_comm) {
> +		} else if (sd->entry == &sort_comm || sd->entry == &sort_commIgnoreDigit) {
>  			list->comm = 1;
>  		} else if (sd->entry == &sort_type_offset) {
>  			symbol_conf.annotate_data_member = true;
> @@ -4022,6 +4107,7 @@ static bool get_elide(int idx, FILE *output)
>  	case HISTC_DSO:
>  		return __get_elide(symbol_conf.dso_list, "dso", output);
>  	case HISTC_COMM:
> +	case HISTC_COMM_IGNORE_DIGIT:
>  		return __get_elide(symbol_conf.comm_list, "comm", output);
>  	default:
>  		break;
> diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> index d7787958e06b9..6819934b4d48a 100644
> --- a/tools/perf/util/sort.h
> +++ b/tools/perf/util/sort.h
> @@ -43,6 +43,7 @@ enum sort_type {
>  	/* common sort keys */
>  	SORT_PID,
>  	SORT_COMM,
> +	SORT_COMM_IGNORE_DIGIT,
>  	SORT_DSO,
>  	SORT_SYM,
>  	SORT_PARENT,
> -- 
> 2.47.3
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Question: perf report & top memory usage
  2026-02-18  1:52 ` Namhyung Kim
@ 2026-03-05 18:02   ` Stephen Brennan
  0 siblings, 0 replies; 3+ messages in thread
From: Stephen Brennan @ 2026-03-05 18:02 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: linux-perf-users, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark

Namhyung Kim <namhyung@kernel.org> writes:
> Hello,
>
> On Tue, Feb 17, 2026 at 11:08:19AM -0800, Stephen Brennan wrote:
>> Hello all,
>> 
>> I had an interesting case where perf record required 35 GiB of memory to create
>> a report for a 400 MiB data file. Unfortunately I don't believe I can share the
>> perf.data, but I did an analysis and wanted to share what I found and ask some
>> questions.
>> 
>> The particular data file contains 1,087,091 samples, with call chains, generated
>> by a pretty standard "perf record -a -g sleep 10" on a machine with 76 CPUs. I
>> looked at the perf report code and profiled memory allocations. Three items
>> seemed to dominate memory use:
>> 
>> 1. Histogram columns. The default being "comm,dso,symbol". The more buckets that
>>    the data is broken into, the more memory is used, and the histogram columns
>>    directly control this.
>> 
>> 2. Callchains. The default is to track them when the perf.data contains them,
>>    though it can be disabled with "-g none". The data structure storing call
>>    chains seems pretty efficient (a prefix tree) but it looks like there is one
>>    per histogram bucket. This makes sense, but it seems duplicative with #3.
>> 
>> 3. Accumulating child overhead. The default is to do this, creating the
>>    "Children" column in the report. The implementation walks the stack for each
>>    sample, creating a histogram bucket for each stack frame (even if no samples
>>    were observed actually executing in those symbols).
>> 
>> My understanding is that the 35 GiB memory usage then comes from a sort of
>> combinatorial explosion. In this data file, nearly every process has a unique
>> comm with numeric identifiers embedded within (e.g. "db1234"). This means that
>> the default "comm,dso,symbol" sort will result in a large number of buckets. The
>> call stacks are reasonably deep (though not absurdly so). There are many
>> non-leaf functions in the call stacks which don't have any Self samples. Child
>> overhead accumulation creates more buckets than there are samples: around 1.3
>> million buckets, compared to 1 million samples.
>> 
>> From this perspective, the memory usage makes sense to me. I understand that I
>> could tweak any combination of those knobs to ameliorate the issue. The most
>> straightforward option is to use "-s dso,symbol" because the "comm" column
>> wasn't informative for this workload. I also created a new histogram column
>> implementation (see below) that represents a command with any digits stripped,
>> so that the commands could still be grouped together, without the numeric
>> identifiers disrupting the bucketing. These solutions reduce memory used to 5.1
>> and 5.4 GiB respectively.
>> 
>> My concern is that most users aren't prepared to dive into this sort of detail,
>> especially when they're likely already in the middle of an analysis of some
>> other performance issue. While they may be familiar with the call graph options
>> and event selection choices, in my experience they generally aren't aware of the
>> many options that "perf report" provides. They certainly aren't aware of these
>> memory trade-offs, especially for what seemed like an innocuous 10-second data
>> collection at the default sample rate.
>> 
>> To sum up, I have the following questions:
>> 
>> 1. Does my analysis make sense and seem consistent with your understanding?
>
> Yes, it does!
>
>> 2. Does anybody else deal with this sort of memory usage issue, and have
>> strategies they can share?
>
> No, but I think "-s dso,sym" should work fine.  And it's the default
> sort key for `perf top` command.

Interesting, I did overlook that this is the default for perf top, thanks.

>> 3. Does the patch below for the custom column make sense to submit? I know it's
>> rather workload specific, but it could be useful for others in this situation.
>
> I think it makes sense and could be useful.  Maybe it's good to group
> kworker threads together.
>
> Different options would be
>
> 1. to add an option to enable it for the existing 'comm' sort key.
> 2. to have a regex pattern or so to specify names to merge.
>
> But a separate sort key as you did seems to be fine.

Thank you, those are good options as well. I hadn't considered the regex
approach before. I'll think more about it, because I do like that
flexibility.

>> 
>> Patch for the "commIgnoreDigit" column. For this workload, it reduced perf
>> report's peak RSS from 35 GiB to around 5.4 GiB, when used in place of "comm":
>
> We don't use CamelCase in the perf code base.  But otherwise looks ok.
>
> Thanks,
> Namhyung

Thanks Namhyung! I'll go ahead and send this patch without camel case
for further discussion.

Stephen

>> 
>> From df1452ae742d933b45c18d9dde090c11fb3cf846 Mon Sep 17 00:00:00 2001
>> From: Stephen Brennan <stephen.s.brennan@oracle.com>
>> Date: Wed, 3 Dec 2025 16:01:49 -0800
>> Subject: [PATCH 1/1] tools: perf: add commIgnoreDigit
>> 
>> The "comm" column allows grouping events by the process command. It is
>> intended to group like programs, despite having different PIDs. But some
>> workloads may adjust their own command, so that a unique identifier
>> (e.g. a PID or some other numeric value) is part of the command name.
>> This destroys the utility of "comm", forcing perf to place each unique
>> process name into its own bucket, which can contribute to a
>> combinatorial explosion of memory use in perf report.
>> 
>> Create a less strict version of this column, which ignores digits when
>> comparing command names. This allows "similar looking" processes to
>> again be placed in the same bucket.
>> 
>> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
>> ---
>>  tools/perf/util/hist.c |  2 +
>>  tools/perf/util/hist.h |  1 +
>>  tools/perf/util/sort.c | 88 +++++++++++++++++++++++++++++++++++++++++-
>>  tools/perf/util/sort.h |  1 +
>>  4 files changed, 91 insertions(+), 1 deletion(-)
>> 
>> diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
>> index ef4b569f7df46..5f691d9b0272d 100644
>> --- a/tools/perf/util/hist.c
>> +++ b/tools/perf/util/hist.c
>> @@ -110,6 +110,8 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
>>  	len = thread__comm_len(h->thread);
>>  	if (hists__new_col_len(hists, HISTC_COMM, len))
>>  		hists__set_col_len(hists, HISTC_THREAD, len + 8);
>> +	if (hists__new_col_len(hists, HISTC_COMM_IGNORE_DIGIT, len))
>> +		hists__set_col_len(hists, HISTC_THREAD, len + 8);
>>  
>>  	if (h->ms.map) {
>>  		len = dso__name_len(map__dso(h->ms.map));
>> diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
>> index 1d5ea632ca4e1..ae7e98bd9e46d 100644
>> --- a/tools/perf/util/hist.h
>> +++ b/tools/perf/util/hist.h
>> @@ -44,6 +44,7 @@ enum hist_column {
>>  	HISTC_THREAD,
>>  	HISTC_TGID,
>>  	HISTC_COMM,
>> +	HISTC_COMM_IGNORE_DIGIT,
>>  	HISTC_CGROUP_ID,
>>  	HISTC_CGROUP,
>>  	HISTC_PARENT,
>> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
>> index f3a565b0e2307..656b5cc62a730 100644
>> --- a/tools/perf/util/sort.c
>> +++ b/tools/perf/util/sort.c
>> @@ -1,4 +1,5 @@
>>  // SPDX-License-Identifier: GPL-2.0
>> +#include <ctype.h>
>>  #include <errno.h>
>>  #include <inttypes.h>
>>  #include <regex.h>
>> @@ -265,6 +266,89 @@ struct sort_entry sort_comm = {
>>  	.se_width_idx	= HISTC_COMM,
>>  };
>>  
>> +/* --sort commIgnoreDigit */
>> +
>> +static int64_t strcmp_nodigit(const char *left, const char *right)
>> +{
>> +	for (;;) {
>> +		while (*left && isdigit(*left)) left++;
>> +		while (*right && isdigit(*right)) right++;
>> +		if (*left == *right && !*left) {
>> +			return 0;
>> +		} else if (*left == *right) {
>> +			left++;
>> +			right++;
>> +		} else {
>> +			return (int64_t)*left - (int64_t)*right;
>> +		}
>> +	}
>> +}
>> +
>> +static int64_t
>> +sort__commIgnoreDigit_cmp(struct hist_entry *left, struct hist_entry *right)
>> +{
>> +	return strcmp_nodigit(comm__str(right->comm), comm__str(left->comm));
>> +}
>> +
>> +static int64_t
>> +sort__commIgnoreDigit_collapse(struct hist_entry *left, struct hist_entry *right)
>> +{
>> +	return strcmp_nodigit(comm__str(right->comm), comm__str(left->comm));
>> +}
>> +
>> +static int64_t
>> +sort__commIgnoreDigit_sort(struct hist_entry *left, struct hist_entry *right)
>> +{
>> +	return strcmp_nodigit(comm__str(right->comm), comm__str(left->comm));
>> +}
>> +
>> +static int hist_entry__commIgnoreDigit_snprintf(struct hist_entry *he, char *bf,
>> +						size_t size, unsigned int width)
>> +{
>> +	int ret = 0;
>> +	unsigned int print_len, printed = 0, start = 0, end = 0;
>> +	bool in_digit;
>> +	const char *comm = comm__str(he->comm), *print;
>> +	while (printed < width && printed < size && comm[start]) {
>> +		in_digit = !!isdigit(comm[start]);
>> +		end = start + 1;
>> +		while (comm[end] && !!isdigit(comm[end]) == in_digit) end++;
>> +		if (in_digit) {
>> +			print_len = 3; /* <N> */
>> +			print = "<N>";
>> +		} else {
>> +			print_len = end - start;
>> +			print = &comm[start];
>> +		}
>> +		print_len = min(print_len, width - printed);
>> +		ret = repsep_snprintf(bf + printed, size - printed, "%-.*s",
>> +					print_len, print);
>> +		if (ret < 0)
>> +			return ret;
>> +		start = end;
>> +		printed += ret;
>> +	}
>> +	/* Pad to width if necessary */
>> +	if (printed < width && printed < size) {
>> +		ret = repsep_snprintf(bf + printed, size - printed, "%-*.*s",
>> +				       width - printed, width - printed, "");
>> +		if (ret < 0)
>> +			return ret;
>> +		printed += ret;
>> +	}
>> +	return printed;
>> +}
>> +
>> +struct sort_entry sort_commIgnoreDigit = {
>> +	.se_header	= "CommandIgnoreDigit",
>> +	.se_cmp		= sort__commIgnoreDigit_cmp,
>> +	.se_collapse	= sort__commIgnoreDigit_collapse,
>> +	.se_sort	= sort__commIgnoreDigit_sort,
>> +	.se_snprintf	= hist_entry__commIgnoreDigit_snprintf,
>> +	.se_filter	= hist_entry__thread_filter,
>> +	.se_width_idx	= HISTC_COMM_IGNORE_DIGIT,
>> +};
>> +
>>  /* --sort dso */
>>  
>>  static int64_t _sort__dso_cmp(struct map *map_l, struct map *map_r)
>> @@ -2576,6 +2660,7 @@ static struct sort_dimension common_sort_dimensions[] = {
>>  	DIM(SORT_PID, "pid", sort_thread),
>>  	DIM(SORT_TGID, "tgid", sort_tgid),
>>  	DIM(SORT_COMM, "comm", sort_comm),
>> +	DIM(SORT_COMM_IGNORE_DIGIT, "commIgnoreDigit", sort_commIgnoreDigit),
>>  	DIM(SORT_DSO, "dso", sort_dso),
>>  	DIM(SORT_SYM, "symbol", sort_sym),
>>  	DIM(SORT_PARENT, "parent", sort_parent),
>> @@ -3675,7 +3760,7 @@ int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
>>  			list->socket = 1;
>>  		} else if (sd->entry == &sort_thread) {
>>  			list->thread = 1;
>> -		} else if (sd->entry == &sort_comm) {
>> +		} else if (sd->entry == &sort_comm || sd->entry == &sort_commIgnoreDigit) {
>>  			list->comm = 1;
>>  		} else if (sd->entry == &sort_type_offset) {
>>  			symbol_conf.annotate_data_member = true;
>> @@ -4022,6 +4107,7 @@ static bool get_elide(int idx, FILE *output)
>>  	case HISTC_DSO:
>>  		return __get_elide(symbol_conf.dso_list, "dso", output);
>>  	case HISTC_COMM:
>> +	case HISTC_COMM_IGNORE_DIGIT:
>>  		return __get_elide(symbol_conf.comm_list, "comm", output);
>>  	default:
>>  		break;
>> diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
>> index d7787958e06b9..6819934b4d48a 100644
>> --- a/tools/perf/util/sort.h
>> +++ b/tools/perf/util/sort.h
>> @@ -43,6 +43,7 @@ enum sort_type {
>>  	/* common sort keys */
>>  	SORT_PID,
>>  	SORT_COMM,
>> +	SORT_COMM_IGNORE_DIGIT,
>>  	SORT_DSO,
>>  	SORT_SYM,
>>  	SORT_PARENT,
>> -- 
>> 2.47.3
>> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-03-05 18:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-17 19:08 Question: perf report & top memory usage Stephen Brennan
2026-02-18  1:52 ` Namhyung Kim
2026-03-05 18:02   ` Stephen Brennan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox