Re: Question: perf report & top memory usage

public inbox for linux-perf-users@vger.kernel.org
 help / color / mirror / Atom feed

From: Stephen Brennan <stephen.s.brennan@oracle.com>
To: Namhyung Kim <namhyung@kernel.org>
Cc: linux-perf-users@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	James Clark <james.clark@linaro.org>
Subject: Re: Question: perf report & top memory usage
Date: Thu, 05 Mar 2026 10:02:01 -0800	[thread overview]
Message-ID: <87qzpym9ye.fsf@oracle.com> (raw)
In-Reply-To: <aZUbSLeRCSdTp55m@google.com>

Namhyung Kim <namhyung@kernel.org> writes:
> Hello,
>
> On Tue, Feb 17, 2026 at 11:08:19AM -0800, Stephen Brennan wrote:
>> Hello all,
>> 
>> I had an interesting case where perf record required 35 GiB of memory to create
>> a report for a 400 MiB data file. Unfortunately I don't believe I can share the
>> perf.data, but I did an analysis and wanted to share what I found and ask some
>> questions.
>> 
>> The particular data file contains 1,087,091 samples, with call chains, generated
>> by a pretty standard "perf record -a -g sleep 10" on a machine with 76 CPUs. I
>> looked at the perf report code and profiled memory allocations. Three items
>> seemed to dominate memory use:
>> 
>> 1. Histogram columns. The default being "comm,dso,symbol". The more buckets that
>>    the data is broken into, the more memory is used, and the histogram columns
>>    directly control this.
>> 
>> 2. Callchains. The default is to track them when the perf.data contains them,
>>    though it can be disabled with "-g none". The data structure storing call
>>    chains seems pretty efficient (a prefix tree) but it looks like there is one
>>    per histogram bucket. This makes sense, but it seems duplicative with #3.
>> 
>> 3. Accumulating child overhead. The default is to do this, creating the
>>    "Children" column in the report. The implementation walks the stack for each
>>    sample, creating a histogram bucket for each stack frame (even if no samples
>>    were observed actually executing in those symbols).
>> 
>> My understanding is that the 35 GiB memory usage then comes from a sort of
>> combinatorial explosion. In this data file, nearly every process has a unique
>> comm with numeric identifiers embedded within (e.g. "db1234"). This means that
>> the default "comm,dso,symbol" sort will result in a large number of buckets. The
>> call stacks are reasonably deep (though not absurdly so). There are many
>> non-leaf functions in the call stacks which don't have any Self samples. Child
>> overhead accumulation creates more buckets than there are samples: around 1.3
>> million buckets, compared to 1 million samples.
>> 
>> From this perspective, the memory usage makes sense to me. I understand that I
>> could tweak any combination of those knobs to ameliorate the issue. The most
>> straightforward option is to use "-s dso,symbol" because the "comm" column
>> wasn't informative for this workload. I also created a new histogram column
>> implementation (see below) that represents a command with any digits stripped,
>> so that the commands could still be grouped together, without the numeric
>> identifiers disrupting the bucketing. These solutions reduce memory used to 5.1
>> and 5.4 GiB respectively.
>> 
>> My concern is that most users aren't prepared to dive into this sort of detail,
>> especially when they're likely already in the middle of an analysis of some
>> other performance issue. While they may be familiar with the call graph options
>> and event selection choices, in my experience they generally aren't aware of the
>> many options that "perf report" provides. They certainly aren't aware of these
>> memory trade-offs, especially for what seemed like an innocuous 10-second data
>> collection at the default sample rate.
>> 
>> To sum up, I have the following questions:
>> 
>> 1. Does my analysis make sense and seem consistent with your understanding?
>
> Yes, it does!
>
>> 2. Does anybody else deal with this sort of memory usage issue, and have
>> strategies they can share?
>
> No, but I think "-s dso,sym" should work fine.  And it's the default
> sort key for `perf top` command.

Interesting, I did overlook that this is the default for perf top, thanks.

>> 3. Does the patch below for the custom column make sense to submit? I know it's
>> rather workload specific, but it could be useful for others in this situation.
>
> I think it makes sense and could be useful.  Maybe it's good to group
> kworker threads together.
>
> Different options would be
>
> 1. to add an option to enable it for the existing 'comm' sort key.
> 2. to have a regex pattern or so to specify names to merge.
>
> But a separate sort key as you did seems to be fine.

Thank you, those are good options as well. I hadn't considered the regex
approach before. I'll think more about it, because I do like that
flexibility.

>> 
>> Patch for the "commIgnoreDigit" column. For this workload, it reduced perf
>> report's peak RSS from 35 GiB to around 5.4 GiB, when used in place of "comm":
>
> We don't use CamelCase in the perf code base.  But otherwise looks ok.
>
> Thanks,
> Namhyung

Thanks Namhyung! I'll go ahead and send this patch without camel case
for further discussion.

Stephen

>> 
>> From df1452ae742d933b45c18d9dde090c11fb3cf846 Mon Sep 17 00:00:00 2001
>> From: Stephen Brennan <stephen.s.brennan@oracle.com>
>> Date: Wed, 3 Dec 2025 16:01:49 -0800
>> Subject: [PATCH 1/1] tools: perf: add commIgnoreDigit
>> 
>> The "comm" column allows grouping events by the process command. It is
>> intended to group like programs, despite having different PIDs. But some
>> workloads may adjust their own command, so that a unique identifier
>> (e.g. a PID or some other numeric value) is part of the command name.
>> This destroys the utility of "comm", forcing perf to place each unique
>> process name into its own bucket, which can contribute to a
>> combinatorial explosion of memory use in perf report.
>> 
>> Create a less strict version of this column, which ignores digits when
>> comparing command names. This allows "similar looking" processes to
>> again be placed in the same bucket.
>> 
>> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
>> ---
>>  tools/perf/util/hist.c |  2 +
>>  tools/perf/util/hist.h |  1 +
>>  tools/perf/util/sort.c | 88 +++++++++++++++++++++++++++++++++++++++++-
>>  tools/perf/util/sort.h |  1 +
>>  4 files changed, 91 insertions(+), 1 deletion(-)
>> 
>> diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
>> index ef4b569f7df46..5f691d9b0272d 100644
>> --- a/tools/perf/util/hist.c
>> +++ b/tools/perf/util/hist.c
>> @@ -110,6 +110,8 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
>>  	len = thread__comm_len(h->thread);
>>  	if (hists__new_col_len(hists, HISTC_COMM, len))
>>  		hists__set_col_len(hists, HISTC_THREAD, len + 8);
>> +	if (hists__new_col_len(hists, HISTC_COMM_IGNORE_DIGIT, len))
>> +		hists__set_col_len(hists, HISTC_THREAD, len + 8);
>>  
>>  	if (h->ms.map) {
>>  		len = dso__name_len(map__dso(h->ms.map));
>> diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
>> index 1d5ea632ca4e1..ae7e98bd9e46d 100644
>> --- a/tools/perf/util/hist.h
>> +++ b/tools/perf/util/hist.h
>> @@ -44,6 +44,7 @@ enum hist_column {
>>  	HISTC_THREAD,
>>  	HISTC_TGID,
>>  	HISTC_COMM,
>> +	HISTC_COMM_IGNORE_DIGIT,
>>  	HISTC_CGROUP_ID,
>>  	HISTC_CGROUP,
>>  	HISTC_PARENT,
>> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
>> index f3a565b0e2307..656b5cc62a730 100644
>> --- a/tools/perf/util/sort.c
>> +++ b/tools/perf/util/sort.c
>> @@ -1,4 +1,5 @@
>>  // SPDX-License-Identifier: GPL-2.0
>> +#include <ctype.h>
>>  #include <errno.h>
>>  #include <inttypes.h>
>>  #include <regex.h>
>> @@ -265,6 +266,89 @@ struct sort_entry sort_comm = {
>>  	.se_width_idx	= HISTC_COMM,
>>  };
>>  
>> +/* --sort commIgnoreDigit */
>> +
>> +static int64_t strcmp_nodigit(const char *left, const char *right)
>> +{
>> +	for (;;) {
>> +		while (*left && isdigit(*left)) left++;
>> +		while (*right && isdigit(*right)) right++;
>> +		if (*left == *right && !*left) {
>> +			return 0;
>> +		} else if (*left == *right) {
>> +			left++;
>> +			right++;
>> +		} else {
>> +			return (int64_t)*left - (int64_t)*right;
>> +		}
>> +	}
>> +}
>> +
>> +static int64_t
>> +sort__commIgnoreDigit_cmp(struct hist_entry *left, struct hist_entry *right)
>> +{
>> +	return strcmp_nodigit(comm__str(right->comm), comm__str(left->comm));
>> +}
>> +
>> +static int64_t
>> +sort__commIgnoreDigit_collapse(struct hist_entry *left, struct hist_entry *right)
>> +{
>> +	return strcmp_nodigit(comm__str(right->comm), comm__str(left->comm));
>> +}
>> +
>> +static int64_t
>> +sort__commIgnoreDigit_sort(struct hist_entry *left, struct hist_entry *right)
>> +{
>> +	return strcmp_nodigit(comm__str(right->comm), comm__str(left->comm));
>> +}
>> +
>> +static int hist_entry__commIgnoreDigit_snprintf(struct hist_entry *he, char *bf,
>> +						size_t size, unsigned int width)
>> +{
>> +	int ret = 0;
>> +	unsigned int print_len, printed = 0, start = 0, end = 0;
>> +	bool in_digit;
>> +	const char *comm = comm__str(he->comm), *print;
>> +	while (printed < width && printed < size && comm[start]) {
>> +		in_digit = !!isdigit(comm[start]);
>> +		end = start + 1;
>> +		while (comm[end] && !!isdigit(comm[end]) == in_digit) end++;
>> +		if (in_digit) {
>> +			print_len = 3; /* <N> */
>> +			print = "<N>";
>> +		} else {
>> +			print_len = end - start;
>> +			print = &comm[start];
>> +		}
>> +		print_len = min(print_len, width - printed);
>> +		ret = repsep_snprintf(bf + printed, size - printed, "%-.*s",
>> +					print_len, print);
>> +		if (ret < 0)
>> +			return ret;
>> +		start = end;
>> +		printed += ret;
>> +	}
>> +	/* Pad to width if necessary */
>> +	if (printed < width && printed < size) {
>> +		ret = repsep_snprintf(bf + printed, size - printed, "%-*.*s",
>> +				       width - printed, width - printed, "");
>> +		if (ret < 0)
>> +			return ret;
>> +		printed += ret;
>> +	}
>> +	return printed;
>> +}
>> +
>> +struct sort_entry sort_commIgnoreDigit = {
>> +	.se_header	= "CommandIgnoreDigit",
>> +	.se_cmp		= sort__commIgnoreDigit_cmp,
>> +	.se_collapse	= sort__commIgnoreDigit_collapse,
>> +	.se_sort	= sort__commIgnoreDigit_sort,
>> +	.se_snprintf	= hist_entry__commIgnoreDigit_snprintf,
>> +	.se_filter	= hist_entry__thread_filter,
>> +	.se_width_idx	= HISTC_COMM_IGNORE_DIGIT,
>> +};
>> +
>>  /* --sort dso */
>>  
>>  static int64_t _sort__dso_cmp(struct map *map_l, struct map *map_r)
>> @@ -2576,6 +2660,7 @@ static struct sort_dimension common_sort_dimensions[] = {
>>  	DIM(SORT_PID, "pid", sort_thread),
>>  	DIM(SORT_TGID, "tgid", sort_tgid),
>>  	DIM(SORT_COMM, "comm", sort_comm),
>> +	DIM(SORT_COMM_IGNORE_DIGIT, "commIgnoreDigit", sort_commIgnoreDigit),
>>  	DIM(SORT_DSO, "dso", sort_dso),
>>  	DIM(SORT_SYM, "symbol", sort_sym),
>>  	DIM(SORT_PARENT, "parent", sort_parent),
>> @@ -3675,7 +3760,7 @@ int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
>>  			list->socket = 1;
>>  		} else if (sd->entry == &sort_thread) {
>>  			list->thread = 1;
>> -		} else if (sd->entry == &sort_comm) {
>> +		} else if (sd->entry == &sort_comm || sd->entry == &sort_commIgnoreDigit) {
>>  			list->comm = 1;
>>  		} else if (sd->entry == &sort_type_offset) {
>>  			symbol_conf.annotate_data_member = true;
>> @@ -4022,6 +4107,7 @@ static bool get_elide(int idx, FILE *output)
>>  	case HISTC_DSO:
>>  		return __get_elide(symbol_conf.dso_list, "dso", output);
>>  	case HISTC_COMM:
>> +	case HISTC_COMM_IGNORE_DIGIT:
>>  		return __get_elide(symbol_conf.comm_list, "comm", output);
>>  	default:
>>  		break;
>> diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
>> index d7787958e06b9..6819934b4d48a 100644
>> --- a/tools/perf/util/sort.h
>> +++ b/tools/perf/util/sort.h
>> @@ -43,6 +43,7 @@ enum sort_type {
>>  	/* common sort keys */
>>  	SORT_PID,
>>  	SORT_COMM,
>> +	SORT_COMM_IGNORE_DIGIT,
>>  	SORT_DSO,
>>  	SORT_SYM,
>>  	SORT_PARENT,
>> -- 
>> 2.47.3
>>

     prev parent reply	other threads:[~2026-03-05 18:02 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-17 19:08 Question: perf report & top memory usage Stephen Brennan
2026-02-18  1:52 ` Namhyung Kim
2026-03-05 18:02   ` Stephen Brennan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87qzpym9ye.fsf@oracle.com \
    --to=stephen.s.brennan@oracle.com \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=irogers@google.com \
    --cc=james.clark@linaro.org \
    --cc=jolsa@kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox