From: Thomas Gleixner <tglx@linutronix.de>
To: Luigi Rizzo <lrizzo@google.com>, Marc Zyngier <maz@kernel.org>,
Luigi Rizzo <rizzo.unipi@gmail.com>,
Paolo Abeni <pabeni@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Sean Christopherson <seanjc@google.com>,
Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
Bjorn Helgaas <bhelgaas@google.com>,
Willem de Bruijn <willemb@google.com>,
Luigi Rizzo <lrizzo@google.com>
Subject: Re: [PATCH 4/6] genirq: soft_moderation: implement adaptive moderation
Date: Thu, 13 Nov 2025 11:15:05 +0100 [thread overview]
Message-ID: <87o6p6ck7q.ffs@tglx> (raw)
In-Reply-To: <20251112192408.3646835-5-lrizzo@google.com>
On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
> Add two control parameters (target_irq_rate and hardirq_percent)
> to indicate the desired maximum values for these two metrics.
>
> Every update_ms the hook in handle_irq_event() recomputes the total and
> local interrupt rate and the amount of time spent in hardirq, compares
> the values with the targets, and adjusts the moderation delay up or down.
>
> The interrupt rate is computed in a scalable way by counting interrupts
> per-CPU, and aggregating the value in a global variable only every
> update_ms. Only CPUs that actively process interrupts are actually
> accessing the shared variable, so the system scales well even on very
> large servers.
This explains the what. But it does not provide any rationale for it.
> +/* Adjust the moderation delay, called at most every update_ns */
> +void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_time, u64 update_ns)
> +{
> + /* Fetch configuration */
> + u32 target_rate = clamp(irq_mod_info.target_irq_rate, 0u, 50000000u);
> + u32 hardirq_percent = clamp(irq_mod_info.hardirq_percent, 0u, 100u);
This is all racy against a concurrent write, so that requires
READ_ONCE(), no?
> + bool below_target = true;
> + /* Compute decay steps based on elapsed time */
> + u32 steps = delta_time > 10 * update_ns ? 10 : 1 + (delta_time / update_ns);
> +
> + if (target_rate == 0 && hardirq_percent == 0) { /* use fixed delay */
> + ms->mod_ns = ms->delay_ns;
> + ms->irq_rate = 0;
> + ms->my_irq_rate = 0;
> + ms->cpu_count = 0;
> + return;
> + }
> +
> + if (target_rate > 0) { /* control total and individual CPU rate */
> + u64 irq_rate, my_irq_rate, tmp, delta_irqs, num_cpus;
> + bool my_rate_ok, global_rate_ok;
> +
> + /* Update global number of interrupts */
> + if (ms->irq_count < 1) /* make sure it is always > 0 */
And how does it become < 1?
> + ms->irq_count = 1;
> + tmp = atomic_long_add_return(ms->irq_count, &irq_mod_info.total_intrs);
> + delta_irqs = tmp - ms->last_total_irqs;
> +
> + /* Compute global rate, check if we are ok */
> + irq_rate = (delta_irqs * NSEC_PER_SEC) / delta_time;
> + global_rate_ok = irq_rate < target_rate;
> +
> + ms->last_total_irqs = tmp;
I really had to find this update. Can you please just keep that stuff
together?
tmp = ...;
delta = tmp - ms->xxxx;
ms->xxxx = tmp;
> + /*
> + * num_cpus is the number of CPUs actively handling interrupts
> + * in the last interval. CPUs handling less than the fair share
> + * target_rate / num_cpus do not need to be throttled.
> + */
> + tmp = atomic_long_add_return(1, &irq_mod_info.total_cpus);
> + num_cpus = tmp - ms->last_total_cpus;
> + /* scale proportionally to time, reduce errors if we are idle for too long */
> + num_cpus = 1 + (num_cpus * update_ns + delta_time / 2) / delta_time;
I still fail to see why this global scale is required and how it is
correct. This can starve out a particular CPU which handles the bulk of
interrupts, no?
> + /* Short intervals may underestimate sources. Apply a scale factor */
> + num_cpus = num_cpus * get_scale_cpus() / 100;
> +
> + /* Compute our rate, check if we are ok */
> + my_irq_rate = (ms->irq_count * NSEC_PER_SEC) / delta_time;
> + my_rate_ok = my_irq_rate * num_cpus < target_rate;
> +
> + ms->irq_count = 1; /* reset for next cycle */
> + ms->last_total_cpus = tmp;
> +
> + /* Use instantaneous rates to react. */
> + below_target = global_rate_ok || my_rate_ok;
> +
> + /* Statistics (rates are smoothed averages) */
> + smooth_avg(&ms->irq_rate, irq_rate, steps);
> + smooth_avg(&ms->my_irq_rate, my_irq_rate, steps);
> + smooth_avg(&ms->cpu_count, num_cpus * 256, steps); /* scaled */
> + ms->my_irq_high += !my_rate_ok;
> + ms->irq_high += !global_rate_ok;
> + }
> +
> + if (hardirq_percent > 0) { /* control time spent in hardirq */
> + u64 cur = kcpustat_this_cpu->cpustat[CPUTIME_IRQ];
Depends on CONFIG_IRQ_TIME_ACCOUNTING=y, no?
> + u64 irqtime = cur - ms->last_irqtime;
> + bool hardirq_ok = irqtime * 100 < delta_time * hardirq_percent;
> +
> + below_target &= hardirq_ok;
> + ms->last_irqtime = cur;
> + ms->hardirq_high += !hardirq_ok; /* statistics */
> + }
> +
> + /* Controller: move mod_ns up/down if we are above/below target */
> + if (below_target) {
> + ms->mod_ns -= ms->mod_ns * steps / (steps + get_decay_factor());
> + if (ms->mod_ns < 100)
> + ms->mod_ns = 0;
> + } else if (ms->mod_ns < 500) {
> + ms->mod_ns = 500;
Random numbers pulled out of thin air. That's all over the place.
> + } else {
> + ms->mod_ns += ms->mod_ns * steps / (steps + get_grow_factor());
> + if (ms->mod_ns > ms->delay_ns)
> + ms->mod_ns = ms->delay_ns; /* cap to delay_ns */
> + }
> +}
Thanks,
tglx
next prev parent reply other threads:[~2025-11-13 10:15 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-12 19:24 [PATCH 0/6] platform wide software interrupt moderation Luigi Rizzo
2025-11-12 19:24 ` [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
2025-11-13 8:17 ` Thomas Gleixner
2025-11-13 9:44 ` Thomas Gleixner
2025-11-13 13:25 ` Marc Zyngier
2025-11-13 13:33 ` Luigi Rizzo
2025-11-13 14:42 ` Marc Zyngier
2025-11-13 14:55 ` Luigi Rizzo
2025-11-13 19:02 ` Marc Zyngier
2025-11-12 19:24 ` [PATCH 2/6] genirq: soft_moderation: add base files, procfs hooks Luigi Rizzo
2025-11-13 9:29 ` Thomas Gleixner
2025-11-13 10:24 ` Thomas Gleixner
2025-11-13 22:42 ` Luigi Rizzo
2025-11-13 22:32 ` Luigi Rizzo
2025-11-13 9:40 ` Thomas Gleixner
2025-11-12 19:24 ` [PATCH 3/6] genirq: soft_moderation: activate hooks in handle_irq_event() Luigi Rizzo
2025-11-13 9:45 ` Thomas Gleixner
2025-11-14 8:27 ` Luigi Rizzo
2025-11-12 19:24 ` [PATCH 4/6] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
2025-11-13 10:15 ` Thomas Gleixner [this message]
2025-11-12 19:24 ` [PATCH 5/6] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
2025-11-12 19:24 ` [PATCH 6/6] genirq: soft_moderation: implement per-driver defaults (nvme and vfio) Luigi Rizzo
2025-11-13 10:18 ` Thomas Gleixner
2025-11-13 10:42 ` Luigi Rizzo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87o6p6ck7q.ffs@tglx \
--to=tglx@linutronix.de \
--cc=akpm@linux-foundation.org \
--cc=bhelgaas@google.com \
--cc=jacob.jun.pan@linux.intel.com \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lrizzo@google.com \
--cc=maz@kernel.org \
--cc=pabeni@redhat.com \
--cc=rizzo.unipi@gmail.com \
--cc=seanjc@google.com \
--cc=willemb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).