linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Luigi Rizzo <lrizzo@google.com>, Marc Zyngier <maz@kernel.org>,
	Luigi Rizzo <rizzo.unipi@gmail.com>,
	Paolo Abeni <pabeni@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Sean Christopherson <seanjc@google.com>,
	Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	Bjorn Helgaas <bhelgaas@google.com>,
	Willem de Bruijn <willemb@google.com>,
	Luigi Rizzo <lrizzo@google.com>
Subject: Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation
Date: Mon, 17 Nov 2025 21:51:29 +0100	[thread overview]
Message-ID: <87jyzo757y.ffs@tglx> (raw)
In-Reply-To: <20251116182839.939139-5-lrizzo@google.com>

On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> Add two control parameters (target_irq_rate and hardirq_percent)
> to indicate the desired maximum values for these two metrics.
>
> Every update_ms the hook in handle_irq_event() recomputes the total and
> local interrupt rate and the amount of time spent in hardirq, compares
> the values with the targets, and adjusts the moderation delay up or down.
>
> The interrupt rate is computed in a scalable way by counting interrupts
> per-CPU, and aggregating the value in a global variable only every
> update_ms. Only CPUs that actively process interrupts are actually
> accessing the shared variable, so the system scales well even on very
> large servers.

You still fail to explain why this is required and why a per CPU
moderation is not sufficient and what the benefits of the approach are.

I'm happy to refer you to Documentation once more. It's well explained
there.

> +/* Measure and assess time spent in hardirq. */
> +static inline bool hardirq_high(struct irq_mod_state *ms, u64 delta_time, u32 hardirq_percent)
> +{
> +	bool above_threshold = false;
> +
> +	if (IS_ENABLED(CONFIG_IRQ_TIME_ACCOUNTING)) {
> +		u64 irqtime, cur = kcpustat_this_cpu->cpustat[CPUTIME_IRQ];
> +
> +		irqtime = cur - ms->last_irqtime;
> +		ms->last_irqtime = cur;
> +
> +		above_threshold = irqtime * 100 > delta_time * hardirq_percent;
> +		ms->hardirq_high += above_threshold;
> +	}
> +	return above_threshold;
> +}
> +
> +/* Measure and assess total and per-CPU interrupt rates. */
> +static inline bool irqrate_high(struct irq_mod_state *ms, u64 delta_time,
> +				u32 target_rate, u32 steps)
> +{
> +	u64 irq_rate, my_irq_rate, tmp, delta_irqs, active_cpus;
> +	bool my_rate_high, global_rate_high;
> +
> +	my_irq_rate = (ms->irq_count * NSEC_PER_SEC) / delta_time;
> +	/* Accumulate global counter and compute global irq rate. */
> +	tmp = atomic_long_add_return(ms->irq_count, &irq_mod_info.total_intrs);
> +	ms->irq_count = 1;
> +	delta_irqs = tmp - ms->last_total_irqs;
> +	ms->last_total_irqs = tmp;
> +	irq_rate = (delta_irqs * NSEC_PER_SEC) / delta_time;
> +
> +	/*
> +	 * Count how many CPUs handled interrupts in the last interval, needed
> +	 * to determine the per-CPU target (target_rate / active_cpus).
> +	 * Each active CPU increments the global counter approximately every
> +	 * update_ns. Scale the value by (update_ns / delta_time) to get the
> +	 * correct value. Also apply rounding and make sure active_cpus > 0.
> +	 */
> +	tmp = atomic_long_add_return(1, &irq_mod_info.total_cpus);
> +	active_cpus = tmp - ms->last_total_cpus;
> +	ms->last_total_cpus = tmp;
> +	active_cpus = (active_cpus * ms->update_ns + delta_time / 2) / delta_time;
> +	if (active_cpus < 1)
> +		active_cpus = 1;
> +
> +	/* Compare with global and per-CPU targets. */
> +	global_rate_high = irq_rate > target_rate;
> +	my_rate_high = my_irq_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
> +
> +	/* Statistics. */
> +	smooth_avg(&ms->irq_rate, irq_rate, steps);
> +	smooth_avg(&ms->my_irq_rate, my_irq_rate, steps);
> +	smooth_avg(&ms->scaled_cpu_count, active_cpus * 256, steps);
> +	ms->my_irq_high += my_rate_high;
> +	ms->irq_high += global_rate_high;
> +
> +	/* Moderate on this CPU only if both global and local rates are high. */

Because it's desired that CPUs can be starved by interrupts when enough
other CPUs only have a very low rate? I'm failing to understand that
logic and the comprehensive rationale in the change log does not help either.

> +	return global_rate_high && my_rate_high;
> +}
> +
> +/* Adjust the moderation delay, called at most every update_ns. */
> +void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_time)
> +{
> +	u32 hardirq_percent = READ_ONCE(irq_mod_info.hardirq_percent);
> +	u32 target_rate = READ_ONCE(irq_mod_info.target_irq_rate);
> +	bool below_target = true;
> +	u32 steps;
> +
> +	if (target_rate == 0 && hardirq_percent == 0) {
> +		/* Use fixed moderation delay. */
> +		ms->mod_ns = ms->delay_ns;
> +		ms->irq_rate = 0;
> +		ms->my_irq_rate = 0;
> +		ms->scaled_cpu_count = 0;
> +		return;
> +	}
> +
> +	/* Compute decay steps based on elapsed time, bound to a reasonable value. */
> +	steps = delta_time > 10 * ms->update_ns ? 10 : 1 + (delta_time / ms->update_ns);
> +
> +	if (target_rate > 0 && irqrate_high(ms, delta_time, target_rate, steps))
> +		below_target = false;
> +
> +	if (hardirq_percent > 0 && hardirq_high(ms, delta_time, hardirq_percent))
> +		below_target = false;

So that rate limits a per CPU overload, but only when IRQTIME accounting
is enabled. Oh well...

> +	/* Controller: adjust delay with exponential decay/grow. */
> +	if (below_target) {
> +		ms->mod_ns -= ms->mod_ns * steps / (steps + irq_mod_info.decay_factor);
> +		if (ms->mod_ns < 100)
> +			ms->mod_ns = 0;

These random numbers used to limit things are truly interresting and
easy to understand - NOT.

> +	} else {
> +		/* Exponential grow does not restart if value is too small. */
> +		if (ms->mod_ns < 500)
> +			ms->mod_ns = 500;
> +		ms->mod_ns += ms->mod_ns * steps / (steps + irq_mod_info.grow_factor);
> +		if (ms->mod_ns > ms->delay_ns)
> +			ms->mod_ns = ms->delay_ns;
> +	}

Why does this need separate grow and decay factors? Just because more
knobs are better?

> +}
> +
>  /* Moderation timer handler. */
>  static enum hrtimer_restart timer_cb(struct hrtimer *timer)
>  {
> @@ -169,8 +289,10 @@ static inline int set_moderation_mode(struct irq_desc *desc, bool enable)
>  /* Print statistics */
>  static int moderation_show(struct seq_file *p, void *v)
>  {
> +	ulong irq_rate = 0, irq_high = 0, my_irq_high = 0, hardirq_high = 0;
>  	uint delay_us = irq_mod_info.delay_us;
> -	int j;
> +	u64 now = ktime_get_ns();
> +	int j, active_cpus = 0;
>  
>  #define HEAD_FMT "%5s  %8s  %8s  %4s  %4s  %8s  %11s  %11s  %11s  %11s  %11s  %11s  %11s  %9s\n"
>  #define BODY_FMT "%5u  %8u  %8u  %4u  %4u  %8u  %11u  %11u  %11u  %11u  %11u  %11u  %11u  %9u\n"
> @@ -182,6 +304,23 @@ static int moderation_show(struct seq_file *p, void *v)
>  
>  	for_each_possible_cpu(j) {
>  		struct irq_mod_state *ms = per_cpu_ptr(&irq_mod_state, j);
> +		u64 age_ms = min((now - ms->last_ns) / NSEC_PER_MSEC, (u64)999999);

Not new in this patch: All of those accesses to remote CPU state are
racy and need to be annotated. This clearly never saw any testing with
KCSAN enabled. I'm happy to point to Documentation once again.

What's new is that this one is obviously broken on 32-bit because read
and write of a 64-bit value are split.

> +		if (age_ms < 10000) {
> +			/* Average irq_rate over recently active CPUs. */
> +			active_cpus++;
> +			irq_rate += ms->irq_rate;
> +		} else {
> +			ms->irq_rate = 0;
> +			ms->my_irq_rate = 0;
> +			ms->scaled_cpu_count = 64;
> +			ms->scaled_src_count = 64;
> +			ms->mod_ns = 0;
> +		}
> +
> +		irq_high += ms->irq_high;
> +		my_irq_high += ms->my_irq_high;
> +		hardirq_high += ms->hardirq_high;
>  
>  		seq_printf(p, BODY_FMT,
>  			   j, ms->irq_rate, ms->my_irq_rate,
> @@ -195,9 +334,34 @@ static int moderation_show(struct seq_file *p, void *v)
>  	seq_printf(p, "\n"
>  		   "enabled              %s\n"
>  		   "delay_us             %u\n"
> -		   "timer_rounds         %u\n",
> +		   "timer_rounds         %u\n"
> +		   "target_irq_rate      %u\n"
> +		   "hardirq_percent      %u\n"
> +		   "update_ms            %u\n"
> +		   "scale_cpus           %u\n"
> +		   "count_timer_calls    %s\n"
> +		   "count_msi_calls      %s\n"
> +		   "decay_factor         %u\n"
> +		   "grow_factor          %u\n",
>  		   str_yes_no(delay_us > 0),
> -		   delay_us, irq_mod_info.timer_rounds);
> +		   delay_us, irq_mod_info.timer_rounds,
> +		   irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent,
> +		   irq_mod_info.update_ms, irq_mod_info.scale_cpus,
> +		   str_yes_no(irq_mod_info.count_timer_calls),
> +		   str_yes_no(irq_mod_info.count_msi_calls),
> +		   irq_mod_info.decay_factor, irq_mod_info.grow_factor);
> +
> +	seq_printf(p,
> +		   "irq_rate             %lu\n"
> +		   "irq_high             %lu\n"
> +		   "my_irq_high          %lu\n"
> +		   "hardirq_percent_high %lu\n"
> +		   "total_interrupts     %lu\n"
> +		   "total_cpus           %lu\n",
> +		   active_cpus ? irq_rate / active_cpus : 0,
> +		   irq_high, my_irq_high, hardirq_high,
> +		   READ_ONCE(*((ulong *)&irq_mod_info.total_intrs)),
> +		   READ_ONCE(*((ulong *)&irq_mod_info.total_cpus)));

The more I look at this insane amount of telemetry the more I am
convinced that this is overly complex just for complexity sake.

Thanks,

        tglx

  reply	other threads:[~2025-11-17 20:51 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 1/8] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs Luigi Rizzo
2025-11-17 11:12   ` kernel test robot
2025-11-17 16:01   ` Thomas Gleixner
2025-11-17 16:16     ` Luigi Rizzo
2025-11-17 19:35       ` Thomas Gleixner
2025-11-17 21:56   ` kernel test robot
2025-11-16 18:28 ` [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation Luigi Rizzo
2025-11-17 19:30   ` Thomas Gleixner
2025-11-17 23:16     ` Thomas Gleixner
2025-11-17 23:59       ` Luigi Rizzo
2025-11-18  8:34         ` Thomas Gleixner
2025-11-18 10:09           ` Luigi Rizzo
2025-11-18 16:31             ` Thomas Gleixner
2025-11-18 18:25               ` Luigi Rizzo
2025-11-18 23:06                 ` Luigi Rizzo
2025-11-19 14:43                   ` Thomas Gleixner
2025-11-21 10:58                     ` Luigi Rizzo
2025-11-21 14:33                       ` Luigi Rizzo
2025-11-22 14:08                       ` Thomas Gleixner
2025-11-16 18:28 ` [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
2025-11-17 20:51   ` Thomas Gleixner [this message]
2025-11-17 21:34     ` Luigi Rizzo
2025-11-17 23:05       ` Thomas Gleixner
2025-11-18  9:00       ` Thomas Gleixner
2025-11-16 18:28 ` [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
2025-11-17 21:14   ` Thomas Gleixner
2025-11-17 21:36   ` kernel test robot
2025-11-16 18:28 ` [PATCH v2 6/8] genirq: soft_moderation: helpers for per-driver defaults Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 7/8] nvme-pci: add module parameter for default moderation mode Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 8/8] vfio-pci: " Luigi Rizzo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jyzo757y.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=akpm@linux-foundation.org \
    --cc=bhelgaas@google.com \
    --cc=jacob.jun.pan@linux.intel.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lrizzo@google.com \
    --cc=maz@kernel.org \
    --cc=pabeni@redhat.com \
    --cc=rizzo.unipi@gmail.com \
    --cc=seanjc@google.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).