Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@linutronix.de>
To: Luigi Rizzo <lrizzo@google.com>
Cc: Marc Zyngier <maz@kernel.org>,
	Luigi Rizzo <rizzo.unipi@gmail.com>,
	Paolo Abeni <pabeni@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Sean Christopherson <seanjc@google.com>,
	Jacob Pan <jacob.jun.pan@linux.intel.com>,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	Bjorn Helgaas <bhelgaas@google.com>,
	Willem de Bruijn <willemb@google.com>
Subject: Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation
Date: Tue, 18 Nov 2025 00:05:25 +0100	[thread overview]
Message-ID: <87cy5g6z0q.ffs@tglx> (raw)
In-Reply-To: <CAMOZA0KKJ9S45-LnLtYKn-L8dL71tsfs29c6ZL3bkuTcNXorAw@mail.gmail.com>

On Mon, Nov 17 2025 at 22:34, Luigi Rizzo wrote:
> On Mon, Nov 17, 2025 at 9:51 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
>> > Add two control parameters (target_irq_rate and hardirq_percent)
>> > to indicate the desired maximum values for these two metrics.
>> >
>> > Every update_ms the hook in handle_irq_event() recomputes the total and
>> > local interrupt rate and the amount of time spent in hardirq, compares
>> > the values with the targets, and adjusts the moderation delay up or down.
>> >
>> > The interrupt rate is computed in a scalable way by counting interrupts
>> > per-CPU, and aggregating the value in a global variable only every
>> > update_ms. Only CPUs that actively process interrupts are actually
>> > accessing the shared variable, so the system scales well even on very
>> > large servers.
>>
>> You still fail to explain why this is required and why a per CPU
>> moderation is not sufficient and what the benefits of the approach are.
>
> It was explained in the first patch of the series and in Documentation/.

Change logs of a patch have to be self contained whether you like it or
not. I'm not chasing random bits of information accross a series or
cover letter and once a patch is applied it becomes even harder to make
sense of it when the change log contains zero information.

> (First world problem, for sure: I have examples for AMD, Intel, Arm,
> all of them with 100+ CPUs per numa node, and 160-480 CPUs total)
> On some of the above platforms, MSIx interrupts cause heavy serialization
> of all other PCIe requests. As a result, when the total interrupt rate exceeds
> 1-2M intrs/s, I/O throughput degrades by up to 4x and more.
> 
> To deal with this with per CPU moderation, without shared state, each CPU
> cannot allow more than some 5Kintrs/s, which means fixed moderation
> should be set at 200us, and adaptive moderation should jump to such
> delays as soon as the local rate reaches 5K intrs/s.
>
> In reality, it is very unlikely that all CPUs are actively handling such high
> rates, so if we know better, we can adjust or remove moderation individually,
> based on actual local and total interrupt rates and number of active CPUs.
> 
> The purpose of this global mechanism is to figure out whether we are
> approaching a dangerous rate, and do individual tuning.

Makes sense, but I'm not convinced yet that this needs to be as complex
as it is.

>> > +     /* Compare with global and per-CPU targets. */
>> > +     global_rate_high = irq_rate > target_rate;
>> > +     my_rate_high = my_irq_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
>> > [...]
>> > +     /* Moderate on this CPU only if both global and local rates are high. */
>>
>> Because it's desired that CPUs can be starved by interrupts when enough
>> other CPUs only have a very low rate? I'm failing to understand that
>> logic and the comprehensive rationale in the change log does not help either.
>
> The comment could be worded better, s/moderate/bump delay up/
>
> The mechanism aims to make total_rate < target, by gently kicking
> individual delays up or down based on the condition
>  total_rate > target && local_rate > target / active_cpus ? bump_up()
> : bump_down()
>
> If the control is effective, the total rate will be within bound and
> nobody suffers,
> neither the CPUs handing <1K intr/s, nor the lonely CPU handling 100K+ intr/s
>
> If suddenly the rates go up, the CPUs with higher rates will be moderated more,
> hopefully converging to a new equilibrium.
> As any control system it has limits on what it can do.

I understand that, but without proper information in code and change log
anyone exposed to this code 6 months down the road will bump his head on
the wall when staring at it (including you).

>> > [...]
>> > +     if (target_rate > 0 && irqrate_high(ms, delta_time, target_rate, steps))
>> > +             below_target = false;
>> > +
>> > +     if (hardirq_percent > 0 && hardirq_high(ms, delta_time, hardirq_percent))
>> > +             below_target = false;
>>
>> So that rate limits a per CPU overload, but only when IRQTIME accounting
>> is enabled. Oh well...
>
> I can add checks to disallow setting the per-CPU overload when IRQTIME
> accounting is not present.

That solves what? It disables the setting, but that does not make the
functionality any different. Also the compiler is smart enough to
eliminate all that code because the return value of hardirq_high() is
constant.

>> > +     } else {
>> > +             /* Exponential grow does not restart if value is too small. */
>> > +             if (ms->mod_ns < 500)
>> > +                     ms->mod_ns = 500;
>> > +             ms->mod_ns += ms->mod_ns * steps / (steps + irq_mod_info.grow_factor);
>> > +             if (ms->mod_ns > ms->delay_ns)
>> > +                     ms->mod_ns = ms->delay_ns;
>> > +     }
>>
>> Why does this need separate grow and decay factors? Just because more
>> knobs are better?
>
> Like in TCP, brake aggressively (grow factor is smaller) to respond
> quickly to overload,
> and accelerate prudently (decay factor is higher) to avoid reacting
> too optimistically.

Why do I have to ask for all this information piecewise?

Thanks,

        tglx

next prev parent reply	other threads:[~2025-11-17 23:05 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 1/8] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs Luigi Rizzo
2025-11-17 11:12   ` kernel test robot
2025-11-17 16:01   ` Thomas Gleixner
2025-11-17 16:16     ` Luigi Rizzo
2025-11-17 19:35       ` Thomas Gleixner
2025-11-17 21:56   ` kernel test robot
2025-11-16 18:28 ` [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation Luigi Rizzo
2025-11-17 19:30   ` Thomas Gleixner
2025-11-17 23:16     ` Thomas Gleixner
2025-11-17 23:59       ` Luigi Rizzo
2025-11-18  8:34         ` Thomas Gleixner
2025-11-18 10:09           ` Luigi Rizzo
2025-11-18 16:31             ` Thomas Gleixner
2025-11-18 18:25               ` Luigi Rizzo
2025-11-18 23:06                 ` Luigi Rizzo
2025-11-19 14:43                   ` Thomas Gleixner
2025-11-21 10:58                     ` Luigi Rizzo
2025-11-21 14:33                       ` Luigi Rizzo
2025-11-22 14:08                       ` Thomas Gleixner
2025-11-16 18:28 ` [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
2025-11-17 20:51   ` Thomas Gleixner
2025-11-17 21:34     ` Luigi Rizzo
2025-11-17 23:05       ` Thomas Gleixner [this message]
2025-11-18  9:00       ` Thomas Gleixner
2025-11-16 18:28 ` [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
2025-11-17 21:14   ` Thomas Gleixner
2025-11-17 21:36   ` kernel test robot
2025-11-16 18:28 ` [PATCH v2 6/8] genirq: soft_moderation: helpers for per-driver defaults Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 7/8] nvme-pci: add module parameter for default moderation mode Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 8/8] vfio-pci: " Luigi Rizzo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87cy5g6z0q.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=akpm@linux-foundation.org \
    --cc=bhelgaas@google.com \
    --cc=jacob.jun.pan@linux.intel.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lrizzo@google.com \
    --cc=maz@kernel.org \
    --cc=pabeni@redhat.com \
    --cc=rizzo.unipi@gmail.com \
    --cc=seanjc@google.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).