Re: [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Calvin Owens <calvin@wbinvd.org>
To: linux-kernel@vger.kernel.org
Cc: linux-perf-users@vger.kernel.org, x86@kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Namhyung Kim <namhyung@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	James Clark <james.clark@linaro.org>,
	Thomas Gleixner <tglx@kernel.org>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>, Andi Kleen <ak@linux.intel.com>
Subject: Re: [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events
Date: Wed, 29 Apr 2026 15:08:55 -0700	[thread overview]
Message-ID: <afKBd87b4AKbCYuL@mozart.vkv.me> (raw)
In-Reply-To: <33d87384aa5f96af76949d1399476779dd4f4fce.1777483573.git.calvin@wbinvd.org>

On Wednesday 04/29 at 20:38 +0000, sashiko-bot@kernel.org wrote:
> > @@ -636,6 +639,19 @@ void perf_sample_event_took(u64 sample_len_ns)
> >  	if (avg_len <= max_len)
> >  		return;
> >  
> > +	/*
> > +	 * Very infrequent events like the perf counter hard watchdog
> > +	 * can trigger spurious throttling: skip throttling if the prior
> > +	 * NMI got here more than one second before this NMI began. But
> > +	 * never skip throttling if NMIs are nesting, or if any NMI runs
> > +	 * for longer than one second.
> > +	 */
> > +	now = local_clock();
> > +	last = __this_cpu_read(last_throttle_clock);
> > +	if (__this_cpu_cmpxchg(last_throttle_clock, last, now) == last &&
> 
> On 32-bit architectures, is it safe to use __this_cpu_cmpxchg() on a 64-bit
> value in NMI context?
>
> For architectures like 32-bit x86, the generic fallback implementation
> achieves atomicity by disabling interrupts, which does not mask NMIs. If a
> nested NMI interrupts the sequence halfway through reading or writing the
> 32-bit halves of the u64, could this result in a torn read or write?
> 
> A corrupted timestamp could cause a massive wrap-around in the time gap
> calculation, perpetually satisfying the > NSEC_PER_SEC bypass condition and
> silently disabling PMU throttling for all events on that CPU.
> 
> > +	    now - last > NSEC_PER_SEC && sample_len_ns < NSEC_PER_SEC)

If this is a problem, isn't it also a problem for the 64-bit store after
updating the EWMA just above this?

I guess last_throttle_clock could be a u32 and use the low clock bits,
that's sufficient with the one second limit... but I would appreciate a
real human opinion :)

> The comment indicates the intent is to skip throttling if the prior NMI
> occurred more than one second before this NMI began.

My comment is not very clear, I suppose. It should say:

    "...skip throttling if the prior attempt to throttle occurred more
    than one second ago, and the current NMI runtime was less than one
    second. But never skip throttling if NMIs are nesting."

> However, local_clock() is sampled at the end of the current NMI, so now
> includes the current NMI's duration.
> 
> If an NMI takes a very long time (e.g., 900ms) but started only 200ms after
> the previous NMI ended, the true gap before it began is 200ms. Because now
> includes the 900ms duration, now - last evaluates to 1100ms, which bypasses
> the throttling.
> 
> Could this artificially inflate the time gap for long-running NMIs,
> directly defeating the throttling mechanism?

I think it's just saying "the code doesn't do what the comment says",
which is true. But what it is describing is sort of addressed below too.

> Additionally, is it possible for user-defined PMU events with expensive
> overhead (like heavy eBPF programs attached to slow tracepoints) to exploit
> this hardcoded frequency bypass?

No, I don't think ebpf programs run in NMI context?

> If an event triggers slower than 1Hz but takes a very long time to execute
> (e.g., 999ms), both now - last > NSEC_PER_SEC and sample_len_ns <
> NSEC_PER_SEC will be true.
> 
> Such an event would bypass the moving average calculation and never be
> throttled, allowing it to consume nearly 100% of the CPU while it runs and
> violating the default 25% CPU time limit set by
> sysctl_perf_cpu_time_max_percent.

No matter where the limit is defined, NMIs firing at just under that
limit will skip throttling. But I guess it could be something like:

    now - last > NSEC_PER_SEC && sample_len_ns < NSEC_PER_MSEC

...to sort of costrain the worst case.

But this all feels really arbitrary. I thought about using the watchdog
interval sysctl minus some padding as the upper bound, but that seemed
unnecessarily complicated.

Maybe this second patch is just more trouble than its worth...
especially if, as Andi noted earlier, the PMU watchdog is not long for
this world...

Thanks,
Calvin

> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/cover.1777483573.git.calvin@wbinvd.org?part=2

next prev parent reply	other threads:[~2026-04-29 22:09 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-29 17:36 [PATCH v2 0/2] Two semi-related perf throttling fixes Calvin Owens
2026-04-29 17:36 ` [PATCH v2 1/2] perf/x86: Avoid double accounting of PMU NMI latencies Calvin Owens
2026-04-29 17:36 ` [PATCH v2 2/2] perf: Don't throttle based on NMI watchdog events Calvin Owens
2026-04-29 22:08   ` Calvin Owens [this message]
2026-04-29 22:15     ` Ian Rogers
2026-04-29 22:41       ` Calvin Owens
2026-05-01 17:07     ` Calvin Owens
2026-05-01 20:54   ` Peter Zijlstra
2026-05-02  9:52     ` Calvin Owens

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afKBd87b4AKbCYuL@mozart.vkv.me \
    --to=calvin@wbinvd.org \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=ak@linux.intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=irogers@google.com \
    --cc=james.clark@linaro.org \
    --cc=jolsa@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox