All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Luigi Rizzo <lrizzo@google.com>, Marc Zyngier <maz@kernel.org>,
	Luigi Rizzo <rizzo.unipi@gmail.com>,
	Paolo Abeni <pabeni@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Sean Christopherson <seanjc@google.com>,
	Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	Bjorn Helgaas <bhelgaas@google.com>,
	Willem de Bruijn <willemb@google.com>,
	Luigi Rizzo <lrizzo@google.com>
Subject: Re: [PATCH 4/6] genirq: soft_moderation: implement adaptive moderation
Date: Thu, 13 Nov 2025 11:15:05 +0100	[thread overview]
Message-ID: <87o6p6ck7q.ffs@tglx> (raw)
In-Reply-To: <20251112192408.3646835-5-lrizzo@google.com>

On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
> Add two control parameters (target_irq_rate and hardirq_percent)
> to indicate the desired maximum values for these two metrics.
>
> Every update_ms the hook in handle_irq_event() recomputes the total and
> local interrupt rate and the amount of time spent in hardirq, compares
> the values with the targets, and adjusts the moderation delay up or down.
>
> The interrupt rate is computed in a scalable way by counting interrupts
> per-CPU, and aggregating the value in a global variable only every
> update_ms. Only CPUs that actively process interrupts are actually
> accessing the shared variable, so the system scales well even on very
> large servers.

This explains the what. But it does not provide any rationale for it.

> +/* Adjust the moderation delay, called at most every update_ns */
> +void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_time, u64 update_ns)
> +{
> +	/* Fetch configuration */
> +	u32 target_rate = clamp(irq_mod_info.target_irq_rate, 0u, 50000000u);
> +	u32 hardirq_percent = clamp(irq_mod_info.hardirq_percent, 0u, 100u);

This is all racy against a concurrent write, so that requires
READ_ONCE(), no?

> +	bool below_target = true;
> +	/* Compute decay steps based on elapsed time */
> +	u32 steps = delta_time > 10 * update_ns ? 10 : 1 + (delta_time / update_ns);
> +
> +	if (target_rate == 0 && hardirq_percent == 0) {	/* use fixed delay */
> +		ms->mod_ns = ms->delay_ns;
> +		ms->irq_rate = 0;
> +		ms->my_irq_rate = 0;
> +		ms->cpu_count = 0;
> +		return;
> +	}
> +
> +	if (target_rate > 0) {	/* control total and individual CPU rate */
> +		u64 irq_rate, my_irq_rate, tmp, delta_irqs, num_cpus;
> +		bool my_rate_ok, global_rate_ok;
> +
> +		/* Update global number of interrupts */
> +		if (ms->irq_count < 1)	/* make sure it is always > 0 */

And how does it become < 1?

> +			ms->irq_count = 1;
> +		tmp = atomic_long_add_return(ms->irq_count, &irq_mod_info.total_intrs);
> +		delta_irqs = tmp - ms->last_total_irqs;
> +
> +		/* Compute global rate, check if we are ok */
> +		irq_rate = (delta_irqs * NSEC_PER_SEC) / delta_time;
> +		global_rate_ok = irq_rate < target_rate;
> +
> +		ms->last_total_irqs = tmp;

I really had to find this update. Can you please just keep that stuff
together?

        tmp = ...;
        delta = tmp - ms->xxxx;
        ms->xxxx = tmp;

> +		/*
> +		 * num_cpus is the number of CPUs actively handling interrupts
> +		 * in the last interval. CPUs handling less than the fair share
> +		 * target_rate / num_cpus do not need to be throttled.
> +		 */
> +		tmp = atomic_long_add_return(1, &irq_mod_info.total_cpus);
> +		num_cpus = tmp - ms->last_total_cpus;
> +		/* scale proportionally to time, reduce errors if we are idle for too long */
> +		num_cpus = 1 + (num_cpus * update_ns + delta_time / 2) / delta_time;

I still fail to see why this global scale is required and how it is
correct. This can starve out a particular CPU which handles the bulk of
interrupts, no?

> +		/* Short intervals may underestimate sources. Apply a scale factor */
> +		num_cpus = num_cpus * get_scale_cpus() / 100;
> +
> +		/* Compute our rate, check if we are ok */
> +		my_irq_rate = (ms->irq_count * NSEC_PER_SEC) / delta_time;
> +		my_rate_ok = my_irq_rate * num_cpus < target_rate;
> +
> +		ms->irq_count = 1;	/* reset for next cycle */
> +		ms->last_total_cpus = tmp;
> +
> +		/* Use instantaneous rates to react. */
> +		below_target = global_rate_ok || my_rate_ok;
> +
> +		/* Statistics (rates are smoothed averages) */
> +		smooth_avg(&ms->irq_rate, irq_rate, steps);
> +		smooth_avg(&ms->my_irq_rate, my_irq_rate, steps);
> +		smooth_avg(&ms->cpu_count, num_cpus * 256, steps); /* scaled */
> +		ms->my_irq_high += !my_rate_ok;
> +		ms->irq_high += !global_rate_ok;
> +	}
> +
> +	if (hardirq_percent > 0) {		/* control time spent in hardirq */
> +		u64 cur = kcpustat_this_cpu->cpustat[CPUTIME_IRQ];

Depends on CONFIG_IRQ_TIME_ACCOUNTING=y, no?

> +		u64 irqtime = cur - ms->last_irqtime;
> +		bool hardirq_ok = irqtime * 100 < delta_time * hardirq_percent;
> +
> +		below_target &= hardirq_ok;
> +		ms->last_irqtime = cur;
> +		ms->hardirq_high += !hardirq_ok;	/* statistics */
> +	}
> +
> +	/* Controller: move mod_ns up/down if we are above/below target */
> +	if (below_target) {
> +		ms->mod_ns -= ms->mod_ns * steps / (steps + get_decay_factor());
> +		if (ms->mod_ns < 100)
> +			ms->mod_ns = 0;
> +	} else if (ms->mod_ns < 500) {
> +		ms->mod_ns = 500;

Random numbers pulled out of thin air. That's all over the place.

> +	} else {
> +		ms->mod_ns += ms->mod_ns * steps / (steps + get_grow_factor());
> +		if (ms->mod_ns > ms->delay_ns)
> +			ms->mod_ns = ms->delay_ns;	/* cap to delay_ns */
> +	}
> +}

Thanks,

        tglx

  reply	other threads:[~2025-11-13 10:15 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-12 19:24 [PATCH 0/6] platform wide software interrupt moderation Luigi Rizzo
2025-11-12 19:24 ` [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
2025-11-13  8:17   ` Thomas Gleixner
2025-11-13  9:44   ` Thomas Gleixner
2025-11-13 13:25   ` Marc Zyngier
2025-11-13 13:33     ` Luigi Rizzo
2025-11-13 14:42       ` Marc Zyngier
2025-11-13 14:55         ` Luigi Rizzo
2025-11-13 19:02           ` Marc Zyngier
2025-11-12 19:24 ` [PATCH 2/6] genirq: soft_moderation: add base files, procfs hooks Luigi Rizzo
2025-11-13  9:29   ` Thomas Gleixner
2025-11-13 10:24     ` Thomas Gleixner
2025-11-13 22:42       ` Luigi Rizzo
2025-11-13 22:32     ` Luigi Rizzo
2025-11-13  9:40   ` Thomas Gleixner
2025-11-12 19:24 ` [PATCH 3/6] genirq: soft_moderation: activate hooks in handle_irq_event() Luigi Rizzo
2025-11-13  9:45   ` Thomas Gleixner
2025-11-14  8:27     ` Luigi Rizzo
2025-11-12 19:24 ` [PATCH 4/6] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
2025-11-13 10:15   ` Thomas Gleixner [this message]
2025-11-12 19:24 ` [PATCH 5/6] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
2025-11-12 19:24 ` [PATCH 6/6] genirq: soft_moderation: implement per-driver defaults (nvme and vfio) Luigi Rizzo
2025-11-13 10:18   ` Thomas Gleixner
2025-11-13 10:42     ` Luigi Rizzo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o6p6ck7q.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=akpm@linux-foundation.org \
    --cc=bhelgaas@google.com \
    --cc=jacob.jun.pan@linux.intel.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lrizzo@google.com \
    --cc=maz@kernel.org \
    --cc=pabeni@redhat.com \
    --cc=rizzo.unipi@gmail.com \
    --cc=seanjc@google.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.