linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf/x86/amd: Warn only on new bits set
@ 2024-05-24 14:10 Breno Leitao
  2024-06-06  5:34 ` Sandipan Das
  2024-06-25 11:57 ` Peter Zijlstra
  0 siblings, 2 replies; 7+ messages in thread
From: Breno Leitao @ 2024-05-24 14:10 UTC (permalink / raw)
  To: sandipan.das, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Thomas Gleixner, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin
  Cc: leit, Paul E . McKenney, open list:PERFORMANCE EVENTS SUBSYSTEM,
	open list:PERFORMANCE EVENTS SUBSYSTEM

Warning at every leaking bits can cause a flood of message, triggering
vairous stall-warning mechanisms to fire, including CSD locks, which
makes the machine to be unusable.

Track the bits that are being leaked, and only warn when a new bit is
set.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 arch/x86/events/amd/core.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 1fc4ce44e743..df0ba2382d13 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -941,11 +941,12 @@ static int amd_pmu_v2_snapshot_branch_stack(struct perf_branch_entry *entries, u
 static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	static atomic64_t status_warned = ATOMIC64_INIT(0);
+	u64 reserved, status, mask, new_bits;
 	struct perf_sample_data data;
 	struct hw_perf_event *hwc;
 	struct perf_event *event;
 	int handled = 0, idx;
-	u64 reserved, status, mask;
 	bool pmu_enabled;
 
 	/*
@@ -1010,7 +1011,11 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
 	 * the corresponding PMCs are expected to be inactive according to the
 	 * active_mask
 	 */
-	WARN_ON(status > 0);
+	if (status > 0) {
+		new_bits = atomic64_fetch_or(status, &status_warned) ^ atomic64_read(&status_warned);
+		// A new bit was set for the very first time.
+		WARN(new_bits, "New overflows for inactive PMCs: %llx\n", new_bits);
+	}
 
 	/* Clear overflow and freeze bits */
 	amd_pmu_ack_global_status(~status);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf/x86/amd: Warn only on new bits set
  2024-05-24 14:10 [PATCH] perf/x86/amd: Warn only on new bits set Breno Leitao
@ 2024-06-06  5:34 ` Sandipan Das
  2024-06-25 11:57 ` Peter Zijlstra
  1 sibling, 0 replies; 7+ messages in thread
From: Sandipan Das @ 2024-06-06  5:34 UTC (permalink / raw)
  To: Breno Leitao, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Thomas Gleixner, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin
  Cc: leit, Paul E . McKenney, open list:PERFORMANCE EVENTS SUBSYSTEM,
	open list:PERFORMANCE EVENTS SUBSYSTEM

On 5/24/2024 7:40 PM, Breno Leitao wrote:
> Warning at every leaking bits can cause a flood of message, triggering
> vairous stall-warning mechanisms to fire, including CSD locks, which
> makes the machine to be unusable.
> 
> Track the bits that are being leaked, and only warn when a new bit is
> set.
> 
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Breno Leitao <leitao@debian.org>

Reviewed-by: Sandipan Das <sandipan.das@amd.com>

> ---
>  arch/x86/events/amd/core.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
> index 1fc4ce44e743..df0ba2382d13 100644
> --- a/arch/x86/events/amd/core.c
> +++ b/arch/x86/events/amd/core.c
> @@ -941,11 +941,12 @@ static int amd_pmu_v2_snapshot_branch_stack(struct perf_branch_entry *entries, u
>  static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
>  {
>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +	static atomic64_t status_warned = ATOMIC64_INIT(0);
> +	u64 reserved, status, mask, new_bits;
>  	struct perf_sample_data data;
>  	struct hw_perf_event *hwc;
>  	struct perf_event *event;
>  	int handled = 0, idx;
> -	u64 reserved, status, mask;
>  	bool pmu_enabled;
>  
>  	/*
> @@ -1010,7 +1011,11 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
>  	 * the corresponding PMCs are expected to be inactive according to the
>  	 * active_mask
>  	 */
> -	WARN_ON(status > 0);
> +	if (status > 0) {
> +		new_bits = atomic64_fetch_or(status, &status_warned) ^ atomic64_read(&status_warned);
> +		// A new bit was set for the very first time.
> +		WARN(new_bits, "New overflows for inactive PMCs: %llx\n", new_bits);
> +	}
>  
>  	/* Clear overflow and freeze bits */
>  	amd_pmu_ack_global_status(~status);


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf/x86/amd: Warn only on new bits set
  2024-05-24 14:10 [PATCH] perf/x86/amd: Warn only on new bits set Breno Leitao
  2024-06-06  5:34 ` Sandipan Das
@ 2024-06-25 11:57 ` Peter Zijlstra
  2024-06-25 14:47   ` Paul E. McKenney
  1 sibling, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2024-06-25 11:57 UTC (permalink / raw)
  To: Breno Leitao
  Cc: sandipan.das, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, leit, Paul E . McKenney,
	open list:PERFORMANCE EVENTS SUBSYSTEM,
	open list:PERFORMANCE EVENTS SUBSYSTEM

On Fri, May 24, 2024 at 07:10:20AM -0700, Breno Leitao wrote:
> Warning at every leaking bits can cause a flood of message, triggering
> vairous stall-warning mechanisms to fire, including CSD locks, which
> makes the machine to be unusable.
> 
> Track the bits that are being leaked, and only warn when a new bit is
> set.
> 
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  arch/x86/events/amd/core.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
> index 1fc4ce44e743..df0ba2382d13 100644
> --- a/arch/x86/events/amd/core.c
> +++ b/arch/x86/events/amd/core.c
> @@ -941,11 +941,12 @@ static int amd_pmu_v2_snapshot_branch_stack(struct perf_branch_entry *entries, u
>  static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
>  {
>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +	static atomic64_t status_warned = ATOMIC64_INIT(0);
> +	u64 reserved, status, mask, new_bits;
>  	struct perf_sample_data data;
>  	struct hw_perf_event *hwc;
>  	struct perf_event *event;
>  	int handled = 0, idx;
> -	u64 reserved, status, mask;
>  	bool pmu_enabled;
>  
>  	/*
> @@ -1010,7 +1011,11 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
>  	 * the corresponding PMCs are expected to be inactive according to the
>  	 * active_mask
>  	 */
> -	WARN_ON(status > 0);
> +	if (status > 0) {
> +		new_bits = atomic64_fetch_or(status, &status_warned) ^ atomic64_read(&status_warned);
> +		// A new bit was set for the very first time.
> +		WARN(new_bits, "New overflows for inactive PMCs: %llx\n", new_bits);
> +	}

Why not just a WARN_ON_ONCE() instead? This really shouldn't be
happening in the first place.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf/x86/amd: Warn only on new bits set
  2024-06-25 11:57 ` Peter Zijlstra
@ 2024-06-25 14:47   ` Paul E. McKenney
  2024-06-26  8:51     ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2024-06-25 14:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Breno Leitao, sandipan.das, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, leit,
	open list:PERFORMANCE EVENTS SUBSYSTEM,
	open list:PERFORMANCE EVENTS SUBSYSTEM

On Tue, Jun 25, 2024 at 01:57:34PM +0200, Peter Zijlstra wrote:
> On Fri, May 24, 2024 at 07:10:20AM -0700, Breno Leitao wrote:
> > Warning at every leaking bits can cause a flood of message, triggering
> > vairous stall-warning mechanisms to fire, including CSD locks, which
> > makes the machine to be unusable.
> > 
> > Track the bits that are being leaked, and only warn when a new bit is
> > set.
> > 
> > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> >  arch/x86/events/amd/core.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
> > index 1fc4ce44e743..df0ba2382d13 100644
> > --- a/arch/x86/events/amd/core.c
> > +++ b/arch/x86/events/amd/core.c
> > @@ -941,11 +941,12 @@ static int amd_pmu_v2_snapshot_branch_stack(struct perf_branch_entry *entries, u
> >  static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
> >  {
> >  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> > +	static atomic64_t status_warned = ATOMIC64_INIT(0);
> > +	u64 reserved, status, mask, new_bits;
> >  	struct perf_sample_data data;
> >  	struct hw_perf_event *hwc;
> >  	struct perf_event *event;
> >  	int handled = 0, idx;
> > -	u64 reserved, status, mask;
> >  	bool pmu_enabled;
> >  
> >  	/*
> > @@ -1010,7 +1011,11 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
> >  	 * the corresponding PMCs are expected to be inactive according to the
> >  	 * active_mask
> >  	 */
> > -	WARN_ON(status > 0);
> > +	if (status > 0) {
> > +		new_bits = atomic64_fetch_or(status, &status_warned) ^ atomic64_read(&status_warned);
> > +		// A new bit was set for the very first time.
> > +		WARN(new_bits, "New overflows for inactive PMCs: %llx\n", new_bits);
> > +	}
> 
> Why not just a WARN_ON_ONCE() instead? This really shouldn't be
> happening in the first place.

We did consider that, but seeing the full set of bits that shouldn't
have been happening in the first place helps with debuggging.

But is there a better way to accumulate and print the full set of
unexpected bits?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf/x86/amd: Warn only on new bits set
  2024-06-25 14:47   ` Paul E. McKenney
@ 2024-06-26  8:51     ` Peter Zijlstra
  2024-06-26 13:47       ` Paul E. McKenney
  2024-06-26 13:57       ` Breno Leitao
  0 siblings, 2 replies; 7+ messages in thread
From: Peter Zijlstra @ 2024-06-26  8:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Breno Leitao, sandipan.das, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, leit,
	open list:PERFORMANCE EVENTS SUBSYSTEM,
	open list:PERFORMANCE EVENTS SUBSYSTEM

On Tue, Jun 25, 2024 at 07:47:06AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 25, 2024 at 01:57:34PM +0200, Peter Zijlstra wrote:
> > On Fri, May 24, 2024 at 07:10:20AM -0700, Breno Leitao wrote:
> > > Warning at every leaking bits can cause a flood of message, triggering
> > > vairous stall-warning mechanisms to fire, including CSD locks, which
> > > makes the machine to be unusable.
> > > 
> > > Track the bits that are being leaked, and only warn when a new bit is
> > > set.
> > > 
> > > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > > Signed-off-by: Breno Leitao <leitao@debian.org>
> > > ---
> > >  arch/x86/events/amd/core.c | 9 +++++++--
> > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
> > > index 1fc4ce44e743..df0ba2382d13 100644
> > > --- a/arch/x86/events/amd/core.c
> > > +++ b/arch/x86/events/amd/core.c
> > > @@ -941,11 +941,12 @@ static int amd_pmu_v2_snapshot_branch_stack(struct perf_branch_entry *entries, u
> > >  static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
> > >  {
> > >  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> > > +	static atomic64_t status_warned = ATOMIC64_INIT(0);
> > > +	u64 reserved, status, mask, new_bits;
> > >  	struct perf_sample_data data;
> > >  	struct hw_perf_event *hwc;
> > >  	struct perf_event *event;
> > >  	int handled = 0, idx;
> > > -	u64 reserved, status, mask;
> > >  	bool pmu_enabled;
> > >  
> > >  	/*
> > > @@ -1010,7 +1011,11 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
> > >  	 * the corresponding PMCs are expected to be inactive according to the
> > >  	 * active_mask
> > >  	 */
> > > -	WARN_ON(status > 0);
> > > +	if (status > 0) {
> > > +		new_bits = atomic64_fetch_or(status, &status_warned) ^ atomic64_read(&status_warned);
> > > +		// A new bit was set for the very first time.
> > > +		WARN(new_bits, "New overflows for inactive PMCs: %llx\n", new_bits);
> > > +	}
> > 
> > Why not just a WARN_ON_ONCE() instead? This really shouldn't be
> > happening in the first place.
> 
> We did consider that, but seeing the full set of bits that shouldn't
> have been happening in the first place helps with debuggging.
> 
> But is there a better way to accumulate and print the full set of
> unexpected bits?

Dunno, I was just wondering if the whole thing wasn't massive overkill.
The changelog wasn't really explaining much here.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf/x86/amd: Warn only on new bits set
  2024-06-26  8:51     ` Peter Zijlstra
@ 2024-06-26 13:47       ` Paul E. McKenney
  2024-06-26 13:57       ` Breno Leitao
  1 sibling, 0 replies; 7+ messages in thread
From: Paul E. McKenney @ 2024-06-26 13:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Breno Leitao, sandipan.das, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Liang, Kan, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, leit,
	open list:PERFORMANCE EVENTS SUBSYSTEM,
	open list:PERFORMANCE EVENTS SUBSYSTEM

On Wed, Jun 26, 2024 at 10:51:53AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 25, 2024 at 07:47:06AM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 25, 2024 at 01:57:34PM +0200, Peter Zijlstra wrote:
> > > On Fri, May 24, 2024 at 07:10:20AM -0700, Breno Leitao wrote:
> > > > Warning at every leaking bits can cause a flood of message, triggering
> > > > vairous stall-warning mechanisms to fire, including CSD locks, which
> > > > makes the machine to be unusable.
> > > > 
> > > > Track the bits that are being leaked, and only warn when a new bit is
> > > > set.
> > > > 
> > > > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > > > Signed-off-by: Breno Leitao <leitao@debian.org>
> > > > ---
> > > >  arch/x86/events/amd/core.c | 9 +++++++--
> > > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
> > > > index 1fc4ce44e743..df0ba2382d13 100644
> > > > --- a/arch/x86/events/amd/core.c
> > > > +++ b/arch/x86/events/amd/core.c
> > > > @@ -941,11 +941,12 @@ static int amd_pmu_v2_snapshot_branch_stack(struct perf_branch_entry *entries, u
> > > >  static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
> > > >  {
> > > >  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> > > > +	static atomic64_t status_warned = ATOMIC64_INIT(0);
> > > > +	u64 reserved, status, mask, new_bits;
> > > >  	struct perf_sample_data data;
> > > >  	struct hw_perf_event *hwc;
> > > >  	struct perf_event *event;
> > > >  	int handled = 0, idx;
> > > > -	u64 reserved, status, mask;
> > > >  	bool pmu_enabled;
> > > >  
> > > >  	/*
> > > > @@ -1010,7 +1011,11 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
> > > >  	 * the corresponding PMCs are expected to be inactive according to the
> > > >  	 * active_mask
> > > >  	 */
> > > > -	WARN_ON(status > 0);
> > > > +	if (status > 0) {
> > > > +		new_bits = atomic64_fetch_or(status, &status_warned) ^ atomic64_read(&status_warned);
> > > > +		// A new bit was set for the very first time.
> > > > +		WARN(new_bits, "New overflows for inactive PMCs: %llx\n", new_bits);
> > > > +	}
> > > 
> > > Why not just a WARN_ON_ONCE() instead? This really shouldn't be
> > > happening in the first place.
> > 
> > We did consider that, but seeing the full set of bits that shouldn't
> > have been happening in the first place helps with debuggging.
> > 
> > But is there a better way to accumulate and print the full set of
> > unexpected bits?
> 
> Dunno, I was just wondering if the whole thing wasn't massive overkill.
> The changelog wasn't really explaining much here.

For me, the two additional executable lines of code seems to be a small
price to pay to be able to see all the bits, not just the first one that
happened to be set.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf/x86/amd: Warn only on new bits set
  2024-06-26  8:51     ` Peter Zijlstra
  2024-06-26 13:47       ` Paul E. McKenney
@ 2024-06-26 13:57       ` Breno Leitao
  1 sibling, 0 replies; 7+ messages in thread
From: Breno Leitao @ 2024-06-26 13:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, sandipan.das, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Liang, Kan, Thomas Gleixner, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, leit, open list:PERFORMANCE EVENTS SUBSYSTEM,
	open list:PERFORMANCE EVENTS SUBSYSTEM

Hello Peter,

On Wed, Jun 26, 2024 at 10:51:53AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 25, 2024 at 07:47:06AM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 25, 2024 at 01:57:34PM +0200, Peter Zijlstra wrote:

> > > Why not just a WARN_ON_ONCE() instead? This really shouldn't be
> > > happening in the first place.

> > We did consider that, but seeing the full set of bits that shouldn't
> > have been happening in the first place helps with debuggging.
> > 
> > But is there a better way to accumulate and print the full set of
> > unexpected bits?

> Dunno, I was just wondering if the whole thing wasn't massive overkill.
> The changelog wasn't really explaining much here.

I can help with some motivation, if it helps.

	1) This problem happens on random machines, rarely

	2) When this problem happens, there is a flood warnings,
	   sometimes it causes the whole machine to be unusable.

	3) It is hard to figure out what is the root cause, and to
	   reproduce the problem.

	4) There isn't information about what bits are being leaked.

That said, this patch will help with the following issues:

	1) It will tell us which bits are being set, so, it is easy to
	   communicate it back to vendor, and to do a root-cause analyzes.

	2) It avoid the machine to be unusable, because, worst case
	   scenario, we get less than 60 WARNs.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-06-26 13:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-24 14:10 [PATCH] perf/x86/amd: Warn only on new bits set Breno Leitao
2024-06-06  5:34 ` Sandipan Das
2024-06-25 11:57 ` Peter Zijlstra
2024-06-25 14:47   ` Paul E. McKenney
2024-06-26  8:51     ` Peter Zijlstra
2024-06-26 13:47       ` Paul E. McKenney
2024-06-26 13:57       ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).