[PATCH v2 0/2] Two semi-related perf throttling fixes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Two semi-related perf throttling fixes
@ 2026-04-29 17:36 Calvin Owens
  2026-04-29 17:36 ` [PATCH v2 1/2] perf/x86: Avoid double accounting of PMU NMI latencies Calvin Owens
  2026-04-29 17:36 ` [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events Calvin Owens
  0 siblings, 2 replies; 9+ messages in thread
From: Calvin Owens @ 2026-04-29 17:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-perf-users, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andi Kleen

Hi all,

This is a refresh of this little series:
https://lore.kernel.org/lkml/cover.1774969692.git.calvin@wbinvd.org/

Patch [1/2] is the same, patch [2/2] has two minor fixes described in
the patch itself. The remainder of this cover is the same as in v1.

Cheers,
Calvin

---

In the course of investigating [1], I set out to understand why this
sequence of messages is printed every boot, even when nobody is using
perf at all:

    perf: interrupt took too long (2516 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
    perf: interrupt took too long (3156 > 3145), lowering kernel.perf_event_max_sample_rate to 63000
    perf: interrupt took too long (4014 > 3945), lowering kernel.perf_event_max_sample_rate to 49000
    perf: interrupt took too long (5035 > 5017), lowering kernel.perf_event_max_sample_rate to 39000
    perf: interrupt took too long (6302 > 6293), lowering kernel.perf_event_max_sample_rate to 31000
    perf: interrupt took too long (7879 > 7877), lowering kernel.perf_event_max_sample_rate to 25000
    perf: interrupt took too long (9852 > 9848), lowering kernel.perf_event_max_sample_rate to 20000

It turns out this happens because of how the dynamic sample rate
throttling interacts with the perf hardware watchdog. Patch [2/2] is my
attempt to prevent the dynamic throttling logic from acting solely based
on the latency of the watchdog NMI.

Intel CPUs were happy with that. But AMD CPUs still printed the messages!

That happens because AMD CPUs have a second PMU facility with its own
NMI handler, and both NMI handlers average in their latency, even when
they don't actually handle the NMI.

Patch [1/2] fixes that, which is a correctness issue entirely
independent of patch [2/2]. But it also happens to be required for patch
[2/2] to achieve its goal on AMD CPUs, so I sent them together.

Thanks,
Calvin

[1] https://lore.kernel.org/all/acMe-QZUel-bBYUh@mozart.vkv.me/

Calvin Owens (2):
  perf/x86: Avoid double accounting of PMU NMI latencies
  perf: Don't throttle based on NMI watchdog events

 arch/x86/events/amd/ibs.c |  6 +++---
 arch/x86/events/core.c    |  3 ++-
 kernel/events/core.c      | 16 ++++++++++++++++
 3 files changed, 21 insertions(+), 4 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/2] perf/x86: Avoid double accounting of PMU NMI latencies
  2026-04-29 17:36 [PATCH v2 0/2] Two semi-related perf throttling fixes Calvin Owens
@ 2026-04-29 17:36 ` Calvin Owens
  2026-04-29 17:36 ` [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events Calvin Owens
  1 sibling, 0 replies; 9+ messages in thread
From: Calvin Owens @ 2026-04-29 17:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-perf-users, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andi Kleen

Because NMIs always poll all handlers, calling perf_sample_event_took()
unconditionally in perf_ibs_nmi_handler() and perf_event_nmi_handler()
causes two latency numbers to be fed into the exponentially weighted
moving average for each NMI on AMD machines, one of which is much
smaller than the other:

    <...>-70985   [029] d.Z1. 13311.704313: nmi_handler: perf_event_nmi_handler() delta_ns: 6732 handled: 1
    <...>-70985   [029] d.Z1. 13311.704317: nmi_handler: nmi_cpu_backtrace_handler() delta_ns: 1673 handled: 0
    <...>-70985   [029] d.Z1. 13311.704319: nmi_handler: perf_ibs_nmi_handler() delta_ns: 2064 handled: 0

This can bias the average unrealistically low, in this case because the
latency of perf_ibs_handle_irq() doing nothing is averaged with the
latency of amd_pmu_v2_handle_irq() doing real work:

    # bpftrace -e 'kprobe:perf_sample_event_took {\
    printf("%s: cpu=%02d sample_len_ns=%d\n", strftime("%S.%f", nsecs), cpu(), arg0); }'
    Attached 1 probe
    02.836860: cpu=17 sample_len_ns=7775
    02.836871: cpu=17 sample_len_ns=1492  // avg=4634
    03.042803: cpu=20 sample_len_ns=4298
    03.042810: cpu=20 sample_len_ns=1152  // avg=2725
    03.204410: cpu=27 sample_len_ns=6973
    03.204420: cpu=27 sample_len_ns=1302  // avg=4137
    03.622364: cpu=00 sample_len_ns=5270
    03.622371: cpu=00 sample_len_ns=992   // avg=3131

Avoid the problem by only accounting the latency of the handler which
actually handled the NMI.

Fixes: c2872d381f1a ("perf/x86/ibs: Add IBS interrupt to the dynamic throttle")
Reviewed-by: Andi Kleen <ak@kernel.org>
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
---
Changes in v2:
* No changes except tags: added Andi's Reviewed-by.

 arch/x86/events/amd/ibs.c | 6 +++---
 arch/x86/events/core.c    | 3 ++-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index e0bd5051db2a..b451a1b2c9ad 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -1599,10 +1599,10 @@ perf_ibs_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 	handled += perf_ibs_handle_irq(&perf_ibs_fetch, regs);
 	handled += perf_ibs_handle_irq(&perf_ibs_op, regs);
 
-	if (handled)
+	if (handled) {
 		inc_irq_stat(apic_perf_irqs);
-
-	perf_sample_event_took(sched_clock() - stamp);
+		perf_sample_event_took(sched_clock() - stamp);
+	}
 
 	return handled;
 }
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 810ab21ffd99..d1c7612e2e5b 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1814,7 +1814,8 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 	ret = static_call(x86_pmu_handle_irq)(regs);
 	finish_clock = sched_clock();
 
-	perf_sample_event_took(finish_clock - start_clock);
+	if (ret)
+		perf_sample_event_took(finish_clock - start_clock);
 
 	return ret;
 }
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events
  2026-04-29 17:36 [PATCH v2 0/2] Two semi-related perf throttling fixes Calvin Owens
  2026-04-29 17:36 ` [PATCH v2 1/2] perf/x86: Avoid double accounting of PMU NMI latencies Calvin Owens
@ 2026-04-29 17:36 ` Calvin Owens
  2026-04-29 22:08   ` Calvin Owens
  2026-05-01 20:54   ` Peter Zijlstra
  1 sibling, 2 replies; 9+ messages in thread
From: Calvin Owens @ 2026-04-29 17:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-perf-users, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andi Kleen

The throttling logic in perf_sample_event_took() assumes the NMI is
running at the maximum allowed sample rate. While this makes sense most
of the time, it wildly overestimates the runtime of the NMI for the perf
hardware watchdog:

    # bpftrace -e 'kprobe:perf_sample_event_took { \
	    printf("%s: cpu=%02d time_taken=%dns\n", \
	    strftime("%H:%M:%S.%f", nsecs), cpu(), arg0); }'
    03:12:13.087003: cpu=00 time_taken=3190ns
    03:12:13.486789: cpu=01 time_taken=2918ns
    03:12:18.075288: cpu=03 time_taken=3308ns
    03:12:19.797207: cpu=02 time_taken=2581ns
    03:12:23.110317: cpu=00 time_taken=2823ns
    03:12:23.510308: cpu=01 time_taken=2943ns
    03:12:29.229348: cpu=03 time_taken=3669ns
    03:12:31.656306: cpu=02 time_taken=3262ns

The NMI for the watchdog runs for 2-4us every ten seconds, but the
math done in perf_sample_event_took() concludes it is running for
200-400ms every second!

When it is the only PMU event running, it can take minutes to hours of
samples from the watchdog for the moving average to accumulate to
something near the real mean, which causes the same little "litany" of
sample rate throttles to happen every time Linux boots with the perf
hardware watchdog enabled:

    perf: interrupt took too long (2526 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
    perf: interrupt took too long (3177 > 3157), lowering kernel.perf_event_max_sample_rate to 62000
    perf: interrupt took too long (3979 > 3971), lowering kernel.perf_event_max_sample_rate to 50000
    perf: interrupt took too long (4983 > 4973), lowering kernel.perf_event_max_sample_rate to 40000

This serves no purpose: it doesn't actually affect the runtime of the
watchdog NMI at all. It confuses users, because it suggests their
machine is spinning its wheels in interrupts when it isn't.

Because the watchdog NMI is so infrequent, we can avoid throttling it by
making the throttling a two-step process: load and update a timestamp
whenever we think we need to throttle, and only actually proceed to
throttle if the last time that happened was less than one second ago.

This is inelegant, but it avoids touching the hot path and preserves
current throttling behavior for real PMU use, at the cost of delaying
the throttling by a single NMI.

Signed-off-by: Calvin Owens <calvin@wbinvd.org>
---
Changes in v2:
* Sashiko spotted that nesting NMIs could cause throttling to be skipped
  incorrectly. But its suggestions were terrible, ignore them and use
  __this_cpu_cmpxchg() to solve the problem instead.
* Sashiko notes that NMIs which last over one second firing on a period
  of greater than one second will also skip throttling: I don't believe
  that could ever actually happen, but this isn't a hot path, so I added
  a check that the current NMI took less than a second just to be sure.

 kernel/events/core.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6d1f8bad7e1c..c2a33cb194ce 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -623,6 +623,7 @@ core_initcall(init_events_core_sysctls);
  */
 #define NR_ACCUMULATED_SAMPLES 128
 static DEFINE_PER_CPU(u64, running_sample_length);
+static DEFINE_PER_CPU(u64, last_throttle_clock);
 
 static u64 __report_avg;
 static u64 __report_allowed;
@@ -643,6 +644,8 @@ void perf_sample_event_took(u64 sample_len_ns)
 	u64 max_len = READ_ONCE(perf_sample_allowed_ns);
 	u64 running_len;
 	u64 avg_len;
+	u64 last;
+	u64 now;
 	u32 max;
 
 	if (max_len == 0)
@@ -663,6 +666,19 @@ void perf_sample_event_took(u64 sample_len_ns)
 	if (avg_len <= max_len)
 		return;
 
+	/*
+	 * Very infrequent events like the perf counter hard watchdog
+	 * can trigger spurious throttling: skip throttling if the prior
+	 * NMI got here more than one second before this NMI began. But
+	 * never skip throttling if NMIs are nesting, or if any NMI runs
+	 * for longer than one second.
+	 */
+	now = local_clock();
+	last = __this_cpu_read(last_throttle_clock);
+	if (__this_cpu_cmpxchg(last_throttle_clock, last, now) == last &&
+	    now - last > NSEC_PER_SEC && sample_len_ns < NSEC_PER_SEC)
+		return;
+
 	__report_avg = avg_len;
 	__report_allowed = max_len;
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events
  2026-04-29 17:36 ` [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events Calvin Owens
@ 2026-04-29 22:08   ` Calvin Owens
  2026-04-29 22:15     ` Ian Rogers
  2026-05-01 17:07     ` Calvin Owens
  2026-05-01 20:54   ` Peter Zijlstra
  1 sibling, 2 replies; 9+ messages in thread
From: Calvin Owens @ 2026-04-29 22:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-perf-users, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andi Kleen

On Wednesday 04/29 at 20:38 +0000, sashiko-bot@kernel.org wrote:
> > @@ -636,6 +639,19 @@ void perf_sample_event_took(u64 sample_len_ns)
> >  	if (avg_len <= max_len)
> >  		return;
> >  
> > +	/*
> > +	 * Very infrequent events like the perf counter hard watchdog
> > +	 * can trigger spurious throttling: skip throttling if the prior
> > +	 * NMI got here more than one second before this NMI began. But
> > +	 * never skip throttling if NMIs are nesting, or if any NMI runs
> > +	 * for longer than one second.
> > +	 */
> > +	now = local_clock();
> > +	last = __this_cpu_read(last_throttle_clock);
> > +	if (__this_cpu_cmpxchg(last_throttle_clock, last, now) == last &&
> 
> On 32-bit architectures, is it safe to use __this_cpu_cmpxchg() on a 64-bit
> value in NMI context?
>
> For architectures like 32-bit x86, the generic fallback implementation
> achieves atomicity by disabling interrupts, which does not mask NMIs. If a
> nested NMI interrupts the sequence halfway through reading or writing the
> 32-bit halves of the u64, could this result in a torn read or write?
> 
> A corrupted timestamp could cause a massive wrap-around in the time gap
> calculation, perpetually satisfying the > NSEC_PER_SEC bypass condition and
> silently disabling PMU throttling for all events on that CPU.
> 
> > +	    now - last > NSEC_PER_SEC && sample_len_ns < NSEC_PER_SEC)

If this is a problem, isn't it also a problem for the 64-bit store after
updating the EWMA just above this?

I guess last_throttle_clock could be a u32 and use the low clock bits,
that's sufficient with the one second limit... but I would appreciate a
real human opinion :)

> The comment indicates the intent is to skip throttling if the prior NMI
> occurred more than one second before this NMI began.

My comment is not very clear, I suppose. It should say:

    "...skip throttling if the prior attempt to throttle occurred more
    than one second ago, and the current NMI runtime was less than one
    second. But never skip throttling if NMIs are nesting."

> However, local_clock() is sampled at the end of the current NMI, so now
> includes the current NMI's duration.
> 
> If an NMI takes a very long time (e.g., 900ms) but started only 200ms after
> the previous NMI ended, the true gap before it began is 200ms. Because now
> includes the 900ms duration, now - last evaluates to 1100ms, which bypasses
> the throttling.
> 
> Could this artificially inflate the time gap for long-running NMIs,
> directly defeating the throttling mechanism?

I think it's just saying "the code doesn't do what the comment says",
which is true. But what it is describing is sort of addressed below too.

> Additionally, is it possible for user-defined PMU events with expensive
> overhead (like heavy eBPF programs attached to slow tracepoints) to exploit
> this hardcoded frequency bypass?

No, I don't think ebpf programs run in NMI context?

> If an event triggers slower than 1Hz but takes a very long time to execute
> (e.g., 999ms), both now - last > NSEC_PER_SEC and sample_len_ns <
> NSEC_PER_SEC will be true.
> 
> Such an event would bypass the moving average calculation and never be
> throttled, allowing it to consume nearly 100% of the CPU while it runs and
> violating the default 25% CPU time limit set by
> sysctl_perf_cpu_time_max_percent.

No matter where the limit is defined, NMIs firing at just under that
limit will skip throttling. But I guess it could be something like:

    now - last > NSEC_PER_SEC && sample_len_ns < NSEC_PER_MSEC

...to sort of costrain the worst case.

But this all feels really arbitrary. I thought about using the watchdog
interval sysctl minus some padding as the upper bound, but that seemed
unnecessarily complicated.

Maybe this second patch is just more trouble than its worth...
especially if, as Andi noted earlier, the PMU watchdog is not long for
this world...

Thanks,
Calvin

> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/cover.1777483573.git.calvin@wbinvd.org?part=2

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events
  2026-04-29 22:08   ` Calvin Owens
@ 2026-04-29 22:15     ` Ian Rogers
  2026-04-29 22:41       ` Calvin Owens
  2026-05-01 17:07     ` Calvin Owens
  1 sibling, 1 reply; 9+ messages in thread
From: Ian Rogers @ 2026-04-29 22:15 UTC (permalink / raw)
  To: Calvin Owens
  Cc: linux-kernel, linux-perf-users, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter, James Clark,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andi Kleen

On Wed, Apr 29, 2026 at 3:09 PM Calvin Owens <calvin@wbinvd.org> wrote:
>
> On Wednesday 04/29 at 20:38 +0000, sashiko-bot@kernel.org wrote:
> > > @@ -636,6 +639,19 @@ void perf_sample_event_took(u64 sample_len_ns)
> > >     if (avg_len <= max_len)
> > >             return;
> > >
> > > +   /*
> > > +    * Very infrequent events like the perf counter hard watchdog
> > > +    * can trigger spurious throttling: skip throttling if the prior
> > > +    * NMI got here more than one second before this NMI began. But
> > > +    * never skip throttling if NMIs are nesting, or if any NMI runs
> > > +    * for longer than one second.
> > > +    */
> > > +   now = local_clock();
> > > +   last = __this_cpu_read(last_throttle_clock);
> > > +   if (__this_cpu_cmpxchg(last_throttle_clock, last, now) == last &&
> >
> > On 32-bit architectures, is it safe to use __this_cpu_cmpxchg() on a 64-bit
> > value in NMI context?
> >
> > For architectures like 32-bit x86, the generic fallback implementation
> > achieves atomicity by disabling interrupts, which does not mask NMIs. If a
> > nested NMI interrupts the sequence halfway through reading or writing the
> > 32-bit halves of the u64, could this result in a torn read or write?
> >
> > A corrupted timestamp could cause a massive wrap-around in the time gap
> > calculation, perpetually satisfying the > NSEC_PER_SEC bypass condition and
> > silently disabling PMU throttling for all events on that CPU.
> >
> > > +       now - last > NSEC_PER_SEC && sample_len_ns < NSEC_PER_SEC)
>
> If this is a problem, isn't it also a problem for the 64-bit store after
> updating the EWMA just above this?
>
> I guess last_throttle_clock could be a u32 and use the low clock bits,
> that's sufficient with the one second limit... but I would appreciate a
> real human opinion :)
>
> > The comment indicates the intent is to skip throttling if the prior NMI
> > occurred more than one second before this NMI began.
>
> My comment is not very clear, I suppose. It should say:
>
>     "...skip throttling if the prior attempt to throttle occurred more
>     than one second ago, and the current NMI runtime was less than one
>     second. But never skip throttling if NMIs are nesting."
>
> > However, local_clock() is sampled at the end of the current NMI, so now
> > includes the current NMI's duration.
> >
> > If an NMI takes a very long time (e.g., 900ms) but started only 200ms after
> > the previous NMI ended, the true gap before it began is 200ms. Because now
> > includes the 900ms duration, now - last evaluates to 1100ms, which bypasses
> > the throttling.
> >
> > Could this artificially inflate the time gap for long-running NMIs,
> > directly defeating the throttling mechanism?
>
> I think it's just saying "the code doesn't do what the comment says",
> which is true. But what it is describing is sort of addressed below too.
>
> > Additionally, is it possible for user-defined PMU events with expensive
> > overhead (like heavy eBPF programs attached to slow tracepoints) to exploit
> > this hardcoded frequency bypass?
>
> No, I don't think ebpf programs run in NMI context?

Unfortunately Sashiko isn't currently reading replied emails, but you
could copy-paste this feedback into an AI.

BPF programs can run in the NMI context, which means they can
introduce delays. I'm not familiar with all the side-effects of and
restrictions on this.

Thanks,
Ian

> > If an event triggers slower than 1Hz but takes a very long time to execute
> > (e.g., 999ms), both now - last > NSEC_PER_SEC and sample_len_ns <
> > NSEC_PER_SEC will be true.
> >
> > Such an event would bypass the moving average calculation and never be
> > throttled, allowing it to consume nearly 100% of the CPU while it runs and
> > violating the default 25% CPU time limit set by
> > sysctl_perf_cpu_time_max_percent.
>
> No matter where the limit is defined, NMIs firing at just under that
> limit will skip throttling. But I guess it could be something like:
>
>     now - last > NSEC_PER_SEC && sample_len_ns < NSEC_PER_MSEC
>
> ...to sort of costrain the worst case.
>
> But this all feels really arbitrary. I thought about using the watchdog
> interval sysctl minus some padding as the upper bound, but that seemed
> unnecessarily complicated.
>
> Maybe this second patch is just more trouble than its worth...
> especially if, as Andi noted earlier, the PMU watchdog is not long for
> this world...
>
> Thanks,
> Calvin
>
> > --
> > Sashiko AI review · https://sashiko.dev/#/patchset/cover.1777483573.git.calvin@wbinvd.org?part=2

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events
  2026-04-29 22:15     ` Ian Rogers
@ 2026-04-29 22:41       ` Calvin Owens
  0 siblings, 0 replies; 9+ messages in thread
From: Calvin Owens @ 2026-04-29 22:41 UTC (permalink / raw)
  To: Ian Rogers
  Cc: linux-kernel, linux-perf-users, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter, James Clark,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andi Kleen

On Wednesday 04/29 at 15:15 -0700, Ian Rogers wrote:
> On Wed, Apr 29, 2026 at 3:09 PM Calvin Owens <calvin@wbinvd.org> wrote:
> > On Wednesday 04/29 at 20:38 +0000, sashiko-bot@kernel.org wrote:
> > > Additionally, is it possible for user-defined PMU events with expensive
> > > overhead (like heavy eBPF programs attached to slow tracepoints) to exploit
> > > this hardcoded frequency bypass?
> >
> > No, I don't think ebpf programs run in NMI context?
>
> Unfortunately Sashiko isn't currently reading replied emails, but you
> could copy-paste this feedback into an AI.

I know it doesn't reply. I was hoping for input from somebody who knew,
more automated guessing from more LLMs is not really helpful at all :)

> BPF programs can run in the NMI context, which means they can
> introduce delays. I'm not familiar with all the side-effects of and
> restrictions on this.

Thanks for the clarification.

Thinking about it more: it seems to me that if they run from NMIs and
their runtime is unbounded, this is just a consequence of that and not a
new problem. If they are somehow bounded, then this must be bounded in
the same way.

Either way, I don't think it's a real problem as far as this patch goes.
But maybe I'm not being imaginative enough.

Thanks,
Calvin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events
  2026-04-29 22:08   ` Calvin Owens
  2026-04-29 22:15     ` Ian Rogers
@ 2026-05-01 17:07     ` Calvin Owens
  1 sibling, 0 replies; 9+ messages in thread
From: Calvin Owens @ 2026-05-01 17:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-perf-users, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, linux-kernel

On Wednesday 04/29 at 15:08 -0700, Calvin Owens wrote:
> On Wednesday 04/29 at 20:38 +0000, sashiko-bot@kernel.org wrote:
> > > @@ -636,6 +639,19 @@ void perf_sample_event_took(u64 sample_len_ns)
> > >  	if (avg_len <= max_len)
> > >  		return;
> > >  
> > > +	/*
> > > +	 * Very infrequent events like the perf counter hard watchdog
> > > +	 * can trigger spurious throttling: skip throttling if the prior
> > > +	 * NMI got here more than one second before this NMI began. But
> > > +	 * never skip throttling if NMIs are nesting, or if any NMI runs
> > > +	 * for longer than one second.
> > > +	 */
> > > +	now = local_clock();
> > > +	last = __this_cpu_read(last_throttle_clock);
> > > +	if (__this_cpu_cmpxchg(last_throttle_clock, last, now) == last &&
> > 
> > On 32-bit architectures, is it safe to use __this_cpu_cmpxchg() on a 64-bit
> > value in NMI context?
> >
> > For architectures like 32-bit x86, the generic fallback implementation
> > achieves atomicity by disabling interrupts, which does not mask NMIs. If a
> > nested NMI interrupts the sequence halfway through reading or writing the
> > 32-bit halves of the u64, could this result in a torn read or write?
> > 
> > A corrupted timestamp could cause a massive wrap-around in the time gap
> > calculation, perpetually satisfying the > NSEC_PER_SEC bypass condition and
> > silently disabling PMU throttling for all events on that CPU.
> > 
> > > +	    now - last > NSEC_PER_SEC && sample_len_ns < NSEC_PER_SEC)
> 
> If this is a problem, isn't it also a problem for the 64-bit store after
> updating the EWMA just above this?
> 
> I guess last_throttle_clock could be a u32 and use the low clock bits,
> that's sufficient with the one second limit... but I would appreciate a
> real human opinion :)

Andi, since you had looked at the original patch, do you have any
thoughts about this?

I think the LLM is splitting hairs and wasting all of our time: I don't
want to waste more of my and everyone else's time on the autofeedback
unless a real human being agrees it matters. I feel like I've already
been far too generous to it TBH...

This dynamic throttling only exists to prevent users from shooting
themselves in the foot and losing the machine. All the edge cases the
LLM has described only transiently delay throttling when they occur,
and I don't think that matters at all.

Thanks,
Calvin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events
  2026-04-29 17:36 ` [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events Calvin Owens
  2026-04-29 22:08   ` Calvin Owens
@ 2026-05-01 20:54   ` Peter Zijlstra
  2026-05-02  9:52     ` Calvin Owens
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2026-05-01 20:54 UTC (permalink / raw)
  To: Calvin Owens
  Cc: linux-kernel, linux-perf-users, x86, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andi Kleen

On Wed, Apr 29, 2026 at 10:36:11AM -0700, Calvin Owens wrote:
> The throttling logic in perf_sample_event_took() assumes the NMI is
> running at the maximum allowed sample rate. While this makes sense most
> of the time, it wildly overestimates the runtime of the NMI for the perf
> hardware watchdog:
> 
>     # bpftrace -e 'kprobe:perf_sample_event_took { \
> 	    printf("%s: cpu=%02d time_taken=%dns\n", \
> 	    strftime("%H:%M:%S.%f", nsecs), cpu(), arg0); }'
>     03:12:13.087003: cpu=00 time_taken=3190ns
>     03:12:13.486789: cpu=01 time_taken=2918ns
>     03:12:18.075288: cpu=03 time_taken=3308ns
>     03:12:19.797207: cpu=02 time_taken=2581ns
>     03:12:23.110317: cpu=00 time_taken=2823ns
>     03:12:23.510308: cpu=01 time_taken=2943ns
>     03:12:29.229348: cpu=03 time_taken=3669ns
>     03:12:31.656306: cpu=02 time_taken=3262ns
> 
> The NMI for the watchdog runs for 2-4us every ten seconds, but the
> math done in perf_sample_event_took() concludes it is running for
> 200-400ms every second!

For arguments sake, lets say this is an even 3us, this means we can run:

  250ms / 3us = 83333

such NMIs every second to consume 25% of CPU time. Which is in line with
the numbers it then reports no?

> When it is the only PMU event running, it can take minutes to hours of
> samples from the watchdog for the moving average to accumulate to
> something near the real mean, which causes the same little "litany" of
> sample rate throttles to happen every time Linux boots with the perf
> hardware watchdog enabled:
> 
>     perf: interrupt took too long (2526 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
>     perf: interrupt took too long (3177 > 3157), lowering kernel.perf_event_max_sample_rate to 62000
>     perf: interrupt took too long (3979 > 3971), lowering kernel.perf_event_max_sample_rate to 50000
>     perf: interrupt took too long (4983 > 4973), lowering kernel.perf_event_max_sample_rate to 40000
> 
> This serves no purpose: it doesn't actually affect the runtime of the
> watchdog NMI at all. It confuses users, because it suggests their
> machine is spinning its wheels in interrupts when it isn't.
> 
> Because the watchdog NMI is so infrequent, we can avoid throttling it by
> making the throttling a two-step process: load and update a timestamp
> whenever we think we need to throttle, and only actually proceed to
> throttle if the last time that happened was less than one second ago.
> 
> This is inelegant, but it avoids touching the hot path and preserves
> current throttling behavior for real PMU use, at the cost of delaying
> the throttling by a single NMI.

This makes no sense, and it quite broken. There is no throttling and you
still need to update the numbers.

I'm thinking less AI and more real human should be involved here. If you
cannot make sense of neither the code nor the AI babbling, step away.

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 6d1f8bad7e1c..c2a33cb194ce 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -623,6 +623,7 @@ core_initcall(init_events_core_sysctls);
>   */
>  #define NR_ACCUMULATED_SAMPLES 128
>  static DEFINE_PER_CPU(u64, running_sample_length);
> +static DEFINE_PER_CPU(u64, last_throttle_clock);
>  
>  static u64 __report_avg;
>  static u64 __report_allowed;
> @@ -643,6 +644,8 @@ void perf_sample_event_took(u64 sample_len_ns)
>  	u64 max_len = READ_ONCE(perf_sample_allowed_ns);
>  	u64 running_len;
>  	u64 avg_len;
> +	u64 last;
> +	u64 now;
>  	u32 max;
>  
>  	if (max_len == 0)
> @@ -663,6 +666,19 @@ void perf_sample_event_took(u64 sample_len_ns)
>  	if (avg_len <= max_len)
>  		return;
>  
> +	/*
> +	 * Very infrequent events like the perf counter hard watchdog
> +	 * can trigger spurious throttling: skip throttling if the prior
> +	 * NMI got here more than one second before this NMI began. But
> +	 * never skip throttling if NMIs are nesting, or if any NMI runs
> +	 * for longer than one second.
> +	 */
> +	now = local_clock();
> +	last = __this_cpu_read(last_throttle_clock);
> +	if (__this_cpu_cmpxchg(last_throttle_clock, last, now) == last &&
> +	    now - last > NSEC_PER_SEC && sample_len_ns < NSEC_PER_SEC)
> +		return;
> +
>  	__report_avg = avg_len;
>  	__report_allowed = max_len;
>  
> -- 
> 2.47.3
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events
  2026-05-01 20:54   ` Peter Zijlstra
@ 2026-05-02  9:52     ` Calvin Owens
  0 siblings, 0 replies; 9+ messages in thread
From: Calvin Owens @ 2026-05-02  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-perf-users, x86, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andi Kleen

On Friday 05/01 at 22:54 +0200, Peter Zijlstra wrote:
> On Wed, Apr 29, 2026 at 10:36:11AM -0700, Calvin Owens wrote:
> > The throttling logic in perf_sample_event_took() assumes the NMI is
> > running at the maximum allowed sample rate. While this makes sense most
> > of the time, it wildly overestimates the runtime of the NMI for the perf
> > hardware watchdog:
> > 
> >     # bpftrace -e 'kprobe:perf_sample_event_took { \
> > 	    printf("%s: cpu=%02d time_taken=%dns\n", \
> > 	    strftime("%H:%M:%S.%f", nsecs), cpu(), arg0); }'
> >     03:12:13.087003: cpu=00 time_taken=3190ns
> >     03:12:13.486789: cpu=01 time_taken=2918ns
> >     03:12:18.075288: cpu=03 time_taken=3308ns
> >     03:12:19.797207: cpu=02 time_taken=2581ns
> >     03:12:23.110317: cpu=00 time_taken=2823ns
> >     03:12:23.510308: cpu=01 time_taken=2943ns
> >     03:12:29.229348: cpu=03 time_taken=3669ns
> >     03:12:31.656306: cpu=02 time_taken=3262ns
> > 
> > The NMI for the watchdog runs for 2-4us every ten seconds, but the
> > math done in perf_sample_event_took() concludes it is running for
> > 200-400ms every second!
> 
> For arguments sake, lets say this is an even 3us, this means we can run:
> 
>   250ms / 3us = 83333
> 
> such NMIs every second to consume 25% of CPU time. Which is in line with
> the numbers it then reports no?

The watchdog NMI latency is not remotely predictive of the "real" NMI
latency in the way I think you're assuming.

These are watchdog NMIs on a znver4 machine:

    17:50:15.322551: cpu=11 time_taken=3878ns
    17:50:15.624184: cpu=02 time_taken=3547ns
    17:50:15.756226: cpu=15 time_taken=3817ns
    17:50:15.826175: cpu=19 time_taken=3386ns

...vs the "real thing" with perf running on the same machine:

    02:21:02.801929: cpu=13 time_taken=321ns
    02:21:02.801937: cpu=24 time_taken=270ns
    02:21:02.801966: cpu=23 time_taken=461ns
    02:21:02.801971: cpu=12 time_taken=310ns

This machine ends up with a lower perf_event_max_sample_rate when the
hardware watchdog is enabled, because of this effect (which obviously
varies a lot with what options you pass to perf).

But the point I was trying to make is that perf_event_max_sample_rate is
completely orthogonal to the 0.1hz watchdog NMI.

The current logic updates a sysctl that can have no possible effect on
the watchdog, based on an extrapolated worst case from the watchdog,
that cannot possibly actually occur with the watchdog. That seems
fundamentally silly to me.

I only actually care because it is user visible in the form of the
random confusing throttling messages. I don't care that
perf_event_max_sample_rate ends up artifically lower, and I didn't try
to fix that.

> > When it is the only PMU event running, it can take minutes to hours of
> > samples from the watchdog for the moving average to accumulate to
> > something near the real mean, which causes the same little "litany" of
> > sample rate throttles to happen every time Linux boots with the perf
> > hardware watchdog enabled:
> > 
> >     perf: interrupt took too long (2526 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
> >     perf: interrupt took too long (3177 > 3157), lowering kernel.perf_event_max_sample_rate to 62000
> >     perf: interrupt took too long (3979 > 3971), lowering kernel.perf_event_max_sample_rate to 50000
> >     perf: interrupt took too long (4983 > 4973), lowering kernel.perf_event_max_sample_rate to 40000
> >
> > This serves no purpose: it doesn't actually affect the runtime of the
> > watchdog NMI at all. It confuses users, because it suggests their
> > machine is spinning its wheels in interrupts when it isn't.
> > 
> > Because the watchdog NMI is so infrequent, we can avoid throttling it by
> > making the throttling a two-step process: load and update a timestamp
> > whenever we think we need to throttle, and only actually proceed to
> > throttle if the last time that happened was less than one second ago.
> > 
> > This is inelegant, but it avoids touching the hot path and preserves
> > current throttling behavior for real PMU use, at the cost of delaying
> > the throttling by a single NMI.
> 
> This makes no sense, and it quite broken. There is no throttling and you
> still need to update the numbers.

The ewma is updated above the patch context, that behavior doesn't
change at all.

Are you seeing __report_avg below it? That's for the deferred printk().

I don't understand what "there is no throttling" means here, sorry.

In practice this all works exactly the way I'm describing, the
throttling happens immmediately the first time perf is actually used on
the system:

    10:24:55 mahler kernel: perf: interrupt took too long (2503 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
    10:24:55 mahler kernel: perf: interrupt took too long (3178 > 3128), lowering kernel.perf_event_max_sample_rate to 62000
    10:24:55 mahler kernel: perf: interrupt took too long (3974 > 3972), lowering kernel.perf_event_max_sample_rate to 50000

...instead of randomly over the first hour of uptime like it does today:

    15:55:44 mahler kernel: perf: interrupt took too long (2518 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
    16:00:23 mahler kernel: perf: interrupt took too long (3163 > 3147), lowering kernel.perf_event_max_sample_rate to 63000
    16:10:18 mahler kernel: perf: interrupt took too long (3978 > 3953), lowering kernel.perf_event_max_sample_rate to 50000

This random throttling after boot isn't unique to my machines: most bare
metal servers I've interacted with over 10+ years do this. If I had a
nickel for every time somebody asked me why it happens when perf isn't
running, I could almost afford to pay what it cost google to give us
that worthless LLM review :)

> I'm thinking less AI and more real human should be involved here. If you
> cannot make sense of neither the code nor the AI babbling, step away.

The only LLM involved at all here is this one autoreview bot from google
that didn't ask for my permission to be involved.

I was simply trying to be generous by engaging with it. Generally, I've
been impressed with it, but in this particular case I feel strongly it's
been actively worse than nothing.

I will ignore it completely in the future when sending you patches.

> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 6d1f8bad7e1c..c2a33cb194ce 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -623,6 +623,7 @@ core_initcall(init_events_core_sysctls);
> >   */
> >  #define NR_ACCUMULATED_SAMPLES 128
> >  static DEFINE_PER_CPU(u64, running_sample_length);
> > +static DEFINE_PER_CPU(u64, last_throttle_clock);
> >  
> >  static u64 __report_avg;
> >  static u64 __report_allowed;
> > @@ -643,6 +644,8 @@ void perf_sample_event_took(u64 sample_len_ns)
> >  	u64 max_len = READ_ONCE(perf_sample_allowed_ns);
> >  	u64 running_len;
> >  	u64 avg_len;
> > +	u64 last;
> > +	u64 now;
> >  	u32 max;
> >  
> >  	if (max_len == 0)
> > @@ -663,6 +666,19 @@ void perf_sample_event_took(u64 sample_len_ns)
> >  	if (avg_len <= max_len)
> >  		return;
> >  
> > +	/*
> > +	 * Very infrequent events like the perf counter hard watchdog
> > +	 * can trigger spurious throttling: skip throttling if the prior
> > +	 * NMI got here more than one second before this NMI began. But
> > +	 * never skip throttling if NMIs are nesting, or if any NMI runs
> > +	 * for longer than one second.
> > +	 */
> > +	now = local_clock();
> > +	last = __this_cpu_read(last_throttle_clock);
> > +	if (__this_cpu_cmpxchg(last_throttle_clock, last, now) == last &&
> > +	    now - last > NSEC_PER_SEC && sample_len_ns < NSEC_PER_SEC)
> > +		return;
> > +
> >  	__report_avg = avg_len;
> >  	__report_allowed = max_len;
> >  
> > -- 
> > 2.47.3
> > 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-02  9:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29 17:36 [PATCH v2 0/2] Two semi-related perf throttling fixes Calvin Owens
2026-04-29 17:36 ` [PATCH v2 1/2] perf/x86: Avoid double accounting of PMU NMI latencies Calvin Owens
2026-04-29 17:36 ` [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events Calvin Owens
2026-04-29 22:08   ` Calvin Owens
2026-04-29 22:15     ` Ian Rogers
2026-04-29 22:41       ` Calvin Owens
2026-05-01 17:07     ` Calvin Owens
2026-05-01 20:54   ` Peter Zijlstra
2026-05-02  9:52     ` Calvin Owens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox