linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Feng Tang <feng.tang@intel.com>
To: Leo Yan <leo.yan@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>,
	"Namhyung Kim" <namhyung@kernel.org>,
	Joe Mario <jmario@redhat.com>, <linux-perf-users@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, "Andi Kleen" <ak@linux.intel.com>,
	Kan Liang <kan.liang@linux.intel.com>,
	"Xing Zhengjun" <zhengjun.xing@linux.intel.com>
Subject: Re: [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines
Date: Mon, 13 Feb 2023 16:24:01 +0800	[thread overview]
Message-ID: <Y+nzoaJcnMFokDMX@feng-clx> (raw)
In-Reply-To: <Y+nyakPnVDpVjn24@leoy-yangtze.lan>

Hi Leo,

On Mon, Feb 13, 2023 at 04:18:50PM +0800, Leo Yan wrote:
> On Mon, Feb 13, 2023 at 11:17:33AM +0800, Feng Tang wrote:
> > Many platforms have feature of adjacent cachelines prefetch, when it
> > is enabled, for data in RAM of 2 cachelines (2N and 2N+1) granularity,
> > if one is fetched to cache, the other one could likely be fetched too,
> > which sort of extends the cacheline size to double, thus the false
> > sharing could happens in adjacent cachelines.
> > 
> > 0Day has captured performance changed related with this [1], and some
> > commercial software explicitly makes its hot global variables 128 bytes
> > aligned (2 cache lines) to avoid this kind of extended false sharing.
> > 
> > So add an option "-a" or "--double-cl" for c2c report to show false
> > sharing in double cache line granularity, which acts just like the
> > cacheline size is doubled. There is no change to c2c record. The
> > hardware HITM events are still per cacheline. The option just changes
> > the granularity of how events are grouped and displayed.
> > 
> > In the c2c report below (will-it-scale's pagefault2 case on old kernel):
> > 
> >   ----------------------------------------------------------------------
> >      26       31        2        0        0        0  0xffff888103ec6000
> >   ----------------------------------------------------------------------
> >    35.48%   50.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff8133148b      1153        66       971     3748        74  [k] get_mem_cgroup_from_mm
> >     6.45%    0.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff813396e4       570         0      1531      879        75  [k] mem_cgroup_charge
> >    25.81%   50.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81331472       949        70       593     3359        74  [k] get_mem_cgroup_from_mm
> >    19.35%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81339686      1352         0      1073     1022        74  [k] mem_cgroup_charge
> >     9.68%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff813396d6      1401         0       863      768        74  [k] mem_cgroup_charge
> >     3.23%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81333106       618         0       804       11         9  [k] uncharge_batch
> > 
> > The offset 0x10 and 0x54 used to displayed in 2 groups, and now they
> > are listed together to give users a hint.
> > 
> > [1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> > 
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > Reviewed-by: Andi Kleen <ak@linux.intel.com>
> > ---
> >  tools/perf/Documentation/perf-c2c.txt |  6 ++++++
> >  tools/perf/builtin-c2c.c              | 22 +++++++++++++---------
> >  tools/perf/util/cacheline.h           | 25 ++++++++++++++++++++-----
> >  tools/perf/util/sort.c                | 13 ++++++++++---
> >  tools/perf/util/sort.h                |  1 +
> >  5 files changed, 50 insertions(+), 17 deletions(-)
> > 
> > diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt
> > index 5c5eb2def83e..a8e1e40d270e 100644
> > --- a/tools/perf/Documentation/perf-c2c.txt
> > +++ b/tools/perf/Documentation/perf-c2c.txt
> > @@ -126,6 +126,12 @@ REPORT OPTIONS
> >  	The known limitations include exception handing such as
> >  	setjmp/longjmp will have calls/returns not match.
> >  
> > +-a::
> > +--double-cl::
> > +	Group HITM events in double cacheline granularity. Some architecture
> > +	has Adjacent Cacheline Prefetch feature, which behaves like the
> > +	cachline size is doubled.
> 
> Sorry I gave my review tags too quick...
> 
> Just found several typos, please fix it.

Thanks for the catch, will fix them.

> s/Some architecture has/Some architectures have
> 
> This is just literal fixing, except x86, I don't know if any other
> architecture has Adjacent Cacheline Prefetch feature :)
> 
> s/cachline/cacheline
> 
> > +
> >  C2C RECORD
> >  ----------
> >  The perf c2c record command setup options related to HITM cacheline analysis
> > diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> > index 52d94c7dd836..7d495db7e5a2 100644
> > --- a/tools/perf/builtin-c2c.c
> > +++ b/tools/perf/builtin-c2c.c
> > @@ -524,7 +524,7 @@ static int dcacheline_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
> >  	char buf[20];
> >  
> >  	if (he->mem_info)
> > -		addr = cl_address(he->mem_info->daddr.addr);
> > +		addr = cl_address(he->mem_info->daddr.addr, chk_double_cl);
> >  
> >  	return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr));
> >  }
> > @@ -562,7 +562,7 @@ static int offset_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
> >  	char buf[20];
> >  
> >  	if (he->mem_info)
> > -		addr = cl_offset(he->mem_info->daddr.al_addr);
> > +		addr = cl_offset(he->mem_info->daddr.al_addr, chk_double_cl);
> >  
> >  	return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr));
> >  }
> > @@ -574,9 +574,10 @@ offset_cmp(struct perf_hpp_fmt *fmt __maybe_unused,
> >  	uint64_t l = 0, r = 0;
> >  
> >  	if (left->mem_info)
> > -		l = cl_offset(left->mem_info->daddr.addr);
> > +		l = cl_offset(left->mem_info->daddr.addr, chk_double_cl);
> > +
> >  	if (right->mem_info)
> > -		r = cl_offset(right->mem_info->daddr.addr);
> > +		r = cl_offset(right->mem_info->daddr.addr, chk_double_cl);
> >  
> >  	return (int64_t)(r - l);
> >  }
> > @@ -2590,7 +2591,7 @@ perf_c2c_cacheline_browser__title(struct hist_browser *browser,
> >  	he = cl_browser->he;
> >  
> >  	if (he->mem_info)
> > -		addr = cl_address(he->mem_info->daddr.addr);
> > +		addr = cl_address(he->mem_info->daddr.addr, chk_double_cl);
> >  
> >  	scnprintf(bf, size, "Cacheline 0x%lx", addr);
> >  	return 0;
> > @@ -2788,15 +2789,16 @@ static int ui_quirks(void)
> >  	if (!c2c.use_stdio) {
> >  		dim_offset.width  = 5;
> >  		dim_offset.header = header_offset_tui;
> > -		nodestr = "CL";
> > +		nodestr = chk_double_cl ? "Double-CL" : "CL";
> >  	}
> >  
> >  	dim_percent_costly_snoop.header = percent_costly_snoop_header[c2c.display];
> >  
> >  	/* Fix the zero line for dcacheline column. */
> > -	buf = fill_line("Cacheline", dim_dcacheline.width +
> > -				     dim_dcacheline_node.width +
> > -				     dim_dcacheline_count.width + 4);
> > +	buf = fill_line(chk_double_cl ? "Double-Cacheline" : "Cacheline",
> > +				dim_dcacheline.width +
> > +				dim_dcacheline_node.width +
> > +				dim_dcacheline_count.width + 4);
> >  	if (!buf)
> >  		return -ENOMEM;
> >  
> > @@ -3037,6 +3039,8 @@ static int perf_c2c__report(int argc, const char **argv)
> >  	OPT_BOOLEAN('f', "force", &symbol_conf.force, "don't complain, do it"),
> >  	OPT_BOOLEAN(0, "stitch-lbr", &c2c.stitch_lbr,
> >  		    "Enable LBR callgraph stitching approach"),
> > +	OPT_BOOLEAN('a', "double-cl", &chk_double_cl,
> > +		    "Check adjacent cachline false sharing"),
> 
> I personally think the word "Detect" is better than "Check".

Ok will use this. The possiblity of finding a false sharing of
adjacent cachelines is indeed much less.

> s/cachline/cacheline

Will fix.

> >  	OPT_PARENT(c2c_options),
> >  	OPT_END()
> >  	};
> > diff --git a/tools/perf/util/cacheline.h b/tools/perf/util/cacheline.h
> > index dec8c0fb1f4a..630d16731b4f 100644
> > --- a/tools/perf/util/cacheline.h
> > +++ b/tools/perf/util/cacheline.h
> > @@ -6,16 +6,31 @@
> >  
> >  int __pure cacheline_size(void);
> >  
> > -static inline u64 cl_address(u64 address)
> > +
> > +/*
> > + * Some architecture has 'Adjacent Cacheline Prefetch' feature,
> 
> s/Some architecture has/Some architectures have

Will change.

> > + * which performs like the cacheline size being doubled.
> > + */
> > +static inline u64 cl_address(u64 address, bool double_cl)
> >  {
> > +	u64 size = cacheline_size();
> > +
> > +	if (double_cl)
> > +		size *= 2;
> > +
> >  	/* return the cacheline of the address */
> > -	return (address & ~(cacheline_size() - 1));
> > +	return (address & ~(size - 1));
> >  }
> >  
> > -static inline u64 cl_offset(u64 address)
> > +static inline u64 cl_offset(u64 address, bool double_cl)
> >  {
> > -	/* return the cacheline of the address */
> > -	return (address & (cacheline_size() - 1));
> > +	u64 size = cacheline_size();
> > +
> > +	if (double_cl)
> > +		size *= 2;
> > +
> > +	/* return the offset inside cachline */
> 
> s/cachline/cacheline
 
Will fix.

> > +	return (address & (size - 1));
> >  }
> >  
> >  #endif // PERF_CACHELINE_H
> > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > index e188f74698dd..148b28f0a7e2 100644
> > --- a/tools/perf/util/sort.c
> > +++ b/tools/perf/util/sort.c
> > @@ -52,6 +52,13 @@ enum sort_mode	sort__mode = SORT_MODE__NORMAL;
> >  static const char *const dynamic_headers[] = {"local_ins_lat", "ins_lat", "local_p_stage_cyc", "p_stage_cyc"};
> >  static const char *const arch_specific_sort_keys[] = {"local_p_stage_cyc", "p_stage_cyc"};
> >  
> > +/*
> > + * Some architecture has Adjacent Cacheline Prefetch feature, which behaves
> > + * like the cachline size is doubled. Enable this flag to check things in
> 
> s/Some architecture has/Some architectures have
> s/cachline/cacheline
> 
> With these typo fixes, you could add my reviewed and tested tags.

Thanks a lot for the review and helping to test on Arm64 platform
which I don't have access to!

- Feng

> Thanks,
> Leo
> 
> > + * double cacheline granularity.
> > + */
> > +bool chk_double_cl;
> > +
> >  /*
> >   * Replaces all occurrences of a char used with the:
> >   *
> > @@ -1499,8 +1506,8 @@ sort__dcacheline_cmp(struct hist_entry *left, struct hist_entry *right)
> >  
> >  addr:
> >  	/* al_addr does all the right addr - start + offset calculations */
> > -	l = cl_address(left->mem_info->daddr.al_addr);
> > -	r = cl_address(right->mem_info->daddr.al_addr);
> > +	l = cl_address(left->mem_info->daddr.al_addr, chk_double_cl);
> > +	r = cl_address(right->mem_info->daddr.al_addr, chk_double_cl);
> >  
> >  	if (l > r) return -1;
> >  	if (l < r) return 1;
> > @@ -1519,7 +1526,7 @@ static int hist_entry__dcacheline_snprintf(struct hist_entry *he, char *bf,
> >  	if (he->mem_info) {
> >  		struct map *map = he->mem_info->daddr.ms.map;
> >  
> > -		addr = cl_address(he->mem_info->daddr.al_addr);
> > +		addr = cl_address(he->mem_info->daddr.al_addr, chk_double_cl);
> >  		ms = &he->mem_info->daddr.ms;
> >  
> >  		/* print [s] for shared data mmaps */
> > diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> > index 921715e6aec4..04f0a6dc7381 100644
> > --- a/tools/perf/util/sort.h
> > +++ b/tools/perf/util/sort.h
> > @@ -35,6 +35,7 @@ extern struct sort_entry sort_sym_from;
> >  extern struct sort_entry sort_sym_to;
> >  extern struct sort_entry sort_srcline;
> >  extern const char default_mem_sort_order[];
> > +extern bool chk_double_cl;
> >  
> >  struct res_sample {
> >  	u64 time;
> > -- 
> > 2.30.2
> > 

  reply	other threads:[~2023-02-13  8:29 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-13  3:17 [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines Feng Tang
2023-02-13  8:02 ` Leo Yan
2023-02-13  8:18 ` Leo Yan
2023-02-13  8:24   ` Feng Tang [this message]
2023-02-13  8:54     ` Leo Yan
2023-02-13 13:39 ` Joe Mario
2023-02-13 14:04   ` Feng Tang
2023-02-14 19:31   ` Arnaldo Carvalho de Melo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y+nzoaJcnMFokDMX@feng-clx \
    --to=feng.tang@intel.com \
    --cc=acme@kernel.org \
    --cc=ak@linux.intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=jmario@redhat.com \
    --cc=jolsa@kernel.org \
    --cc=kan.liang@linux.intel.com \
    --cc=leo.yan@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=zhengjun.xing@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).