* [PATCH 0/6] platform wide software interrupt moderation
@ 2025-11-12 19:24 Luigi Rizzo
2025-11-12 19:24 ` [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
` (5 more replies)
0 siblings, 6 replies; 24+ messages in thread
From: Luigi Rizzo @ 2025-11-12 19:24 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
Platform wide software interrupt moderation specifically addresses a
limitation of platforms, from many vendors, whose I/O performance drops
significantly when the total rate of MSI-X interrupts is too high (e.g
1-3M intr/s depending on the platform).
Conventional interrupt moderation operates separately on each source,
hence the configuration should target the worst case. On large servers
with hundreds of interrupt sources, keeping the total rate bounded would
require delays of 100-200us; and adaptive moderation would have to reach
those delays with as little as 10K intr/s per source. These values are
unacceptable for RPC or transactional workloads.
To address this problem, this code measures efficiently the total and
per-CPU interrupt rates, so that individual moderation delays can be
adjusted based on actual global and local load. This way, the system
controls both global interrupt rates and individual CPU load, and
tunes delays so they are normally 0 or very small except during actual
local/global overload.
Configuration is easy and robust. System administrators specify the
maximum targets (moderation delay; interrupt rate; and fraction of time
spent in hardirq), and per-CPU control loops adjust actual delays to try
and keep metrics within the bounds.
There is no need for exact targets, because the system is adaptive.
Values like delay_us=100, target_irq_rate=1000000, hardirq_percent=70
are good almost everywhere.
The system does not rely on any special hardware feature except from
MSI-X Pending Bit Array (PBA), a mandatory component of MSI-X
Boot defaults are set via module parameters (/sys/module/irq_moderation
and /sys/module/${DRIVER}) or at runtime via /proc/irq/moderation, which
is also used to export statistics. Moderation on individual irq can be
turned on/off via /proc/irq/NN/moderation .
PERFORMANCE BENEFITS:
Below are some experimental results under high load (before/after rates
are measured with conventional moderation or with this change):
- 100Gbps NIC, 32 queues: rx goes from 50-60Gbps to 92.8 Gbps (line rate).
- 200Gbps NIC, 10 VMs (total 160 queues): rx goes from 30 Gbps to 190 Gbps (line rate).
- 12 SSD, 96 queues: 4K random read goes from 6M to 20.5 MIOPS (device max).
In all cases, latency up to p95 is unaffected at low/moderate load,
even if compared with no moderation at all.
IMPLEMENTATION
- Most of the code, including module parameters and procfs hooks for
configuration and telemetry, is in two files
include/linux/irq_moderation.h and kernel/irq/irq_moderation.c.
- struct irq_desc is extended with a list entry and one field indicating
whether this source should use moderation
- handle_irq_event() and sysrec_posted_msi_notification() have small
inline hooks to track interrupts and trigger moderation as needed.
- per-CPU state is initialized via hooks in per-architecture files
- optional device driver module parameters can be added to set driver
defaults to enable/disable moderation
Luigi Rizzo (6):
genirq: platform wide interrupt moderation: Documentation, Kconfig,
irq_desc
genirq: soft_moderation: add base files, procfs hooks
genirq: soft_moderation: activate hooks in handle_irq_event()
genirq: soft_moderation: implement adaptive moderation
x86/irq: soft_moderation: add support for posted_msi (intel)
genirq: soft_moderation: implement per-driver defaults (nvme and vfio)
Documentation/core-api/irq/index.rst | 1 +
Documentation/core-api/irq/irq-moderation.rst | 215 ++++++++
arch/x86/kernel/cpu/common.c | 1 +
arch/x86/kernel/irq.c | 12 +
drivers/irqchip/irq-gic-v3.c | 2 +
drivers/nvme/host/pci.c | 3 +
drivers/vfio/pci/vfio_pci_intrs.c | 3 +
include/linux/interrupt.h | 28 +
include/linux/irq_moderation.h | 265 ++++++++++
include/linux/irqdesc.h | 5 +
kernel/irq/Kconfig | 11 +
kernel/irq/Makefile | 1 +
kernel/irq/handle.c | 3 +
kernel/irq/irq_moderation.c | 482 ++++++++++++++++++
kernel/irq/irqdesc.c | 1 +
kernel/irq/proc.c | 2 +
16 files changed, 1035 insertions(+)
create mode 100644 Documentation/core-api/irq/irq-moderation.rst
create mode 100644 include/linux/irq_moderation.h
create mode 100644 kernel/irq/irq_moderation.c
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc
2025-11-12 19:24 [PATCH 0/6] platform wide software interrupt moderation Luigi Rizzo
@ 2025-11-12 19:24 ` Luigi Rizzo
2025-11-13 8:17 ` Thomas Gleixner
` (2 more replies)
2025-11-12 19:24 ` [PATCH 2/6] genirq: soft_moderation: add base files, procfs hooks Luigi Rizzo
` (4 subsequent siblings)
5 siblings, 3 replies; 24+ messages in thread
From: Luigi Rizzo @ 2025-11-12 19:24 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
Platform wide software interrupt moderation ("soft_moderation" in this
patch series) specifically addresses a limitation of platforms from many
vendors whose I/O performance drops significantly when the total rate of
MSI-X interrupts is too high (e.g 1-3M intr/s depending on the platform).
Conventional interrupt moderation operates separately on each source,
hence the configuration should target the worst case. On large servers
with hundreds of interrupt sources, keeping the total rate bounded would
require delays of 100-200us; and adaptive moderation would have to reach
those delays with as little as 10K intr/s per source. These values are
unacceptable for RPC or transactional workloads.
To address this problem, this code measures efficiently the total and
per-CPU interrupt rates, so that individual moderation delays can be
adjusted based on actual global and local load. This way, the system
controls both global interrupt rates and individual CPU load, and
tunes delays so they are normally 0 or very small except during actual
local/global overload.
Configuration is easy and robust. System administrators specify the
maximum targets (moderation delay; interrupt rate; and fraction of time
spent in hardirq), and per-CPU control loops adjust actual delays to try
and keep metrics within the bounds.
There is no need for exact targets, because the system is adaptive; the
defaults delay_us=100, target_irq_rate=1000000, hardirq_frac=70 intr/s,
are good almost everywhere.
The system does not rely on any special hardware feature except from
MSI-X Pending Bit Array (PBA), a mandatory component of MSI-X
Boot defaults are set via module parameters (/sys/module/irq_moderation
and /sys/module/${DRIVER}) or at runtime via /proc/irq/moderation, which
is also used to export statistics. Moderation on individual irq can be
turned on/off via /proc/irq/NN/moderation .
The system does not rely on any special hardware feature except from
MSI-X Pending Bit Array (PBA), a mandatory component of MSI-X
This initial patch adds Documentation, Kconfig option, two fields in
struct irq_desc, and prototypes in include/linux/interrupt.h
No functional impact.
Enabling the option will just extend struct irq_desc with two fields.
CONFIG_SOFT_IRQ_MODERATION=y
---
Documentation/core-api/irq/index.rst | 1 +
Documentation/core-api/irq/irq-moderation.rst | 215 ++++++++++++++++++
include/linux/interrupt.h | 15 ++
include/linux/irqdesc.h | 5 +
kernel/irq/Kconfig | 11 +
5 files changed, 247 insertions(+)
create mode 100644 Documentation/core-api/irq/irq-moderation.rst
diff --git a/Documentation/core-api/irq/index.rst b/Documentation/core-api/irq/index.rst
index 0d65d11e54200..b5a6e2ade69bb 100644
--- a/Documentation/core-api/irq/index.rst
+++ b/Documentation/core-api/irq/index.rst
@@ -9,3 +9,4 @@ IRQs
irq-affinity
irq-domain
irqflags-tracing
+ irq-moderation
diff --git a/Documentation/core-api/irq/irq-moderation.rst b/Documentation/core-api/irq/irq-moderation.rst
new file mode 100644
index 0000000000000..ff12dbabc701b
--- /dev/null
+++ b/Documentation/core-api/irq/irq-moderation.rst
@@ -0,0 +1,215 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+===========================================
+Platform wide software interrupt moderation
+===========================================
+
+:Author: Luigi Rizzo <lrizzo@google.com>
+
+.. contents:: :depth: 2
+
+Introduction
+------------
+
+Platform Wide software interrupt moderation is a variant of moderation
+that adjusts the delay based on platform-wide metrics, instead of
+considering each source separately. It then uses hrtimers to implement
+adaptive, per-CPU moderation in software, without requiring any specific
+hardware support other than Pending Bit Array, a standard feature
+of MSI-X.
+
+To understand the motivation for this feature, we start with some
+background on interrupt moderation.
+
+* **Interrupt** is a mechanism to **notify** the CPU of **events**
+ that should be handled by software, for example, **completions**
+ of I/O requests (network tx/rx, disk read/writes...).
+
+* Each event typically issues one interrupt, which is then processed by
+ software before the next interrupt can be issued. If more events fire
+ in the meantime, the next interrupt notifies all of them. This is called
+ **coalescing**, and it can happen unintentionally, as in the example.
+
+* Coalescing amortizes the fixed costs of processing one interrupt over
+ multiple events, so it improves efficiency. This suggested the idea
+ that we could intentionally make coalescing more likely. This is called
+ interrupt **moderation**.
+
+* The most common and robust implementation of moderation enforces
+ some minimum **delay** between subsequent interrupts, using a timer
+ in the device or in software. Most NICs support programmable hardware
+ moderation, with timer granularity down to 1us or so. NVME also
+ specifies hardware moderation timers, with 100us granularity.
+
+* One downside of moderation, especially with **fixed** delay, is that
+ even with moderate load, the notification latency can increase by as
+ much as the moderation delay. This is undesirable for transactional
+ workloads. At high load the extra delay is less problematic, because
+ the queueing delays that occur can be one or more orders of magnitude
+ bigger.
+
+* To address this problem, software can dynamically adjust the delay, making
+ it proportional to the I/O rate. This is called **adaptive** moderation,
+ and it is commonly implemented in network device drivers.
+
+In summary, interrupt moderation, as normally implemented, is a very
+effective way to reduce interrupt processing costs on a per-source
+basis. The modest compromise on the extra latency can be removed with
+adaptive moderation.
+
+MOTIVATION
+~~~~~~~~~~
+
+There is one aspect that per-source moderation does not address.
+
+Several Systems-on-Chip (SoC) from all vendors (Intel, AMD, ARM), show
+huge reduction in I/O throughput (up to 3-4x times slower for high speed
+NICs or SSDs) in presence of high MSI-X interrupt rates across the entire
+platform (1-3M intr/s total, depending on the SoC). Note that unaffected
+SoCs can sustain 20-30M intr/s from MSI-X without performance degradation.
+
+The above performance degradation is not caused by overload of individual
+CPUs. What matters is the total MSI-X interrupt rate, across either
+individual PCIe root port, or the entire SoC. The specific root cause
+depends on the SoC, but generally some internal block (in the PCIe root
+port, or in the IOMMU block) applies excessive serialization around
+MSI-X writes. This in turn causes huge delays in other PCIe transactions,
+leading to the observed performance drop.
+
+EXISTING MITIGATIONS AND WHY THEY ARE INSUFFICIENT:
+===================================================
+
+* As mentioned, traditional interrupt moderation operates on individual sources.
+ Large server platforms can have hundreds of sources (NIC or NVME
+ queues). To stay within the total platform limit (e.g. 1-3M intrs/s)
+ one would need very large delays (100-200us), which are undesirable
+ for transactional workloads (RPC, database).
+
+* Per-source adaptive moderation has the same problem. The adaptive control
+ cannot tell whether other sources are active, so in order to be effective
+ it must assume the worst case and jump to large delays with as little
+ as 10K intrs/s, even if no other sources are active.
+
+* Back in 2020 we addressed this very problem for network devices with
+ the ``napi_defer_hard_irq`` mechanism: after the first interrupt,
+ NAPI does not rearm the queue interrupt, and instead runs the next
+ softirq handler from an hrtimer. It keeps using the timer as interrupt
+ source until one or a few handler calls generate no activity, at which
+ point interrupts are re-enabled.
+
+ That way, under load, device interrupts (which are problematic for
+ the platform) are highly reduced, and replaced with less problematic
+ timer interrupts. There are a few down sides though:
+
+ * the feature is only available for NAPI devices
+
+ * the timer delay is not adaptive, and must still be tuned based on the number
+ of active sources
+
+ * the system generates extra calls to the NAPI handler
+
+ * it has non-intuitive interaction with devices that share tx/rx interrupts
+ or implement optimizations that expect interrupts to be enabled.
+
+PLATFORM WIDE ADAPTIVE MODERATION
+---------------------------------
+
+Platform-Wide adaptive interrupt moderation is designed to overcome the
+above limitations. The system operates as follows:
+
+* on each interrupt, increments a per-CPU counter for number of interrupts
+
+* opportunistically, every ms or so, each CPU scalably accumulates
+ the values across the entire system, so it can compute the total and
+ per-CPU interrupt rate, and the number of CPUs actively processing
+ interrupts.
+
+* with the above information and a system-wide, configurable
+ ``target_irq_rate``, each CPU computes whether it is processing more
+ or less than its fair share of interrupts. A simple control loop then
+ adjusts up/down its per-CPU moderation delay. The value varies from 0 (disabled)
+ to a configurable upper bound ``delay_us``.
+
+* when the per-CPU moderation delay goes above a threshold (e.g. 10us), the first
+ interrupt served by that CPU will start an hrtimer to fire after the
+ adaptive delay. All interrupts sources served by that CPU will be
+ disabled as they come.
+
+* when the timer fires, all disabled sources are re-enabled. The Pending
+ Bit Array feature of MSI will avoid that interrupt generated while
+ disabled are lost.
+
+This scheme is effective in keeping the total interrupt rate under
+control as long as the configuration parameters are sensible
+(``delay_us < #CPUs / target_irq_rate``).
+
+It also lends itself to some extensions, specifically:
+
+* **protect against hardirq overload**. It is possible for one CPU
+ handling interrupts to be overwhelmed by hardirq processing. The
+ control loop can be extended to declare an overload situation when the
+ percentage of time spent in hardirq is above a configurable threshold
+ ``hardirq_percent``. Moderation can thus kick in to keep the load within bounds.
+
+* **reduce latency using timer-based polling**. Similar to ``napi_defer_hard_irq``
+ described earlier, once interrupts are disabled and we have an hrtimer active,
+ we keep the timer active for a few rounds and run the handler from a timer callback
+ instead of waiting for an interrupt. The ``timer_rounds`` parameter controls this behavior,
+
+ Say the control loop settles on 120us delay to stay within the global MSI-X rate limit.
+ By setting ``timer_rounds=2``, each time we have a hardware interrupt, the handler
+ will be called two more times by the timer. As a consequence, in the same conditions,
+ the same global MSI-X rate will be reached with just 120/3 = 40us delay, thus improving
+ latency significantly (note that those handlers call do cause extra CPU work, so we
+ may lose some of the efficiency gains coming from large delays).
+
+CONFIGURATION
+-------------
+
+Configuration of this system is done via module parameters
+``irq_moderation.${name}=${value}`` (for boot-time defaults)
+or writing ``echo "${name}=${value}" > /proc/irq/soft_moderation``
+for run-time configuration.
+
+Here are the existing module parameters
+
+* ``delay_us`` (0: off, range 0-500)
+
+ The maximum moderation delay. 0 means moderation is globally disabled.
+
+* ``target_irq_rate`` (0 off, range 0-50000000)
+
+ The maximum irq rate across the entire platform. The adaptive algorithm will adjust
+ delays to stay within the target. Use 0 to disable this control.
+
+* ``hardirq_percent`` (0 off, range 0-100)
+
+ The maximum percentage of CPU time spent in hardirq. The adaptive algorithm will adjust
+ delays to stay within the target. Use 0 to disable this control.
+
+* ``timer_rounds`` (0 0ff, range 0-20)
+
+ Once the moderation timer is activated, how many extra timer rounds to do before
+ re-enabling interrupts.
+
+* ``update_ms`` (default 1, range 1-100)
+
+ How often the adaptive control should adjust delays. The default value (1ms) should be good
+ in most circumstances.
+
+Interrupt moderation can be enabled/disabled on individual IRQs as follows:
+
+* module parameter ``${driver}.soft_moderation=1`` (default 0) selects
+ whether to use moderation at device probe time.
+
+* ``echo 1 > /proc/irq/*/${irq_name}/../soft_moderation`` (default 0, disabled) toggles
+ moderation on/off for specific IRQs once they are attached.
+
+**INTEL SPECIFIC**
+
+Recent intel CPUs support a kernel feature, enabled via boot parameter ``intremap=posted_msi``,
+that routes all interrupts targeting one CPU via a special interrupt, called **posted_msi**,
+whose handler in turn calls the individual interrupt handlers.
+
+The ``posted_msi`` kernel feature always uses moderation if enabled (``delay_us > 0``) and
+individual IRQs do not need to be enabled individually.
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 51b6484c04934..007201c8db6dd 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -872,6 +872,21 @@ extern int early_irq_init(void);
extern int arch_probe_nr_irqs(void);
extern int arch_early_irq_init(void);
+#ifdef CONFIG_IRQ_SOFT_MODERATION
+
+void irq_moderation_percpu_init(void);
+void irq_moderation_init_fields(struct irq_desc *desc);
+/* add/remove /proc/irq/NN/soft_moderation */
+void irq_moderation_procfs_entry(struct irq_desc *desc, umode_t umode);
+
+#else /* empty stubs to avoid conditional compilation */
+
+static inline void irq_moderation_percpu_init(void) {}
+static inline void irq_moderation_init_fields(struct irq_desc *desc) {}
+static inline void irq_moderation_procfs_entry(struct irq_desc *desc, umode_t umode) {};
+
+#endif
+
/*
* We want to know which function is an entrypoint of a hardirq or a softirq.
*/
diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index fd091c35d5721..4eb05bc456abe 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -112,6 +112,11 @@ struct irq_desc {
#endif
struct mutex request_mutex;
int parent_irq;
+#ifdef CONFIG_IRQ_SOFT_MODERATION
+ /* mode: 0: off, 1: disable_irq_nosync() */
+ u8 moderation_mode; /* written in procfs, read on irq */
+ struct list_head ms_node; /* per-CPU list of moderated irq_desc */
+#endif
struct module *owner;
const char *name;
#ifdef CONFIG_HARDIRQS_SW_RESEND
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 1b4254d19a73e..fb9c78b1deea8 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -155,6 +155,17 @@ config IRQ_KUNIT_TEST
endmenu
+# Support platform wide software interrupt moderation
+config IRQ_SOFT_MODERATION
+ bool "Enable platform wide software interrupt moderation"
+ help
+ Enable platform wide software interrupt moderation.
+ Uses a local timer to delay interrupts in configurable ways
+ and depending on various global system load indicators
+ and targets.
+
+ If you don't know what to do here, say N.
+
config GENERIC_IRQ_MULTI_HANDLER
bool
help
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 2/6] genirq: soft_moderation: add base files, procfs hooks
2025-11-12 19:24 [PATCH 0/6] platform wide software interrupt moderation Luigi Rizzo
2025-11-12 19:24 ` [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
@ 2025-11-12 19:24 ` Luigi Rizzo
2025-11-13 9:29 ` Thomas Gleixner
2025-11-13 9:40 ` Thomas Gleixner
2025-11-12 19:24 ` [PATCH 3/6] genirq: soft_moderation: activate hooks in handle_irq_event() Luigi Rizzo
` (3 subsequent siblings)
5 siblings, 2 replies; 24+ messages in thread
From: Luigi Rizzo @ 2025-11-12 19:24 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
Add the main core files that implement soft_moderation, limited to static
moderation, plus related small changes to include/linux/interrupt.h,
kernel/irq/Makefile, and kernel/irq/proc.c
- include/linux/irq_moderation.h has the two main struct, prototypes
and inline hooks
- kernel/irq/irq_moderation.c has the procfs handlers
The code is not yet hooked to the interrupt handler, so
the feature is disabled but we can see the module parameters
/sys/module/irq_moderation/parameters and read/write the procfs entries
/proc/irq/soft_moderation and /proc/irq/NN/soft_moderation.
Examples:
cat /proc/irq/soft_moderation
echo "delay_us=345" > /proc/irq/soft_moderation
echo 1 | tee /proc/irq/*/nvme*/../soft_moderation
Change-Id: I472d9b5b31770aa2787f062f7fe5d411882be60e
---
include/linux/irq_moderation.h | 196 ++++++++++++++++++++
kernel/irq/Makefile | 1 +
kernel/irq/irq_moderation.c | 315 +++++++++++++++++++++++++++++++++
kernel/irq/proc.c | 2 +
4 files changed, 514 insertions(+)
create mode 100644 include/linux/irq_moderation.h
create mode 100644 kernel/irq/irq_moderation.c
diff --git a/include/linux/irq_moderation.h b/include/linux/irq_moderation.h
new file mode 100644
index 0000000000000..4d90d7c4ca26b
--- /dev/null
+++ b/include/linux/irq_moderation.h
@@ -0,0 +1,196 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
+
+#ifndef _LINUX_IRQ_MODERATION_H
+#define _LINUX_IRQ_MODERATION_H
+
+/*
+ * Platform wide software interrupt moderation, see
+ * Documentation/core-api/irq/irq-moderation.rst
+ */
+
+#include <linux/kernel.h>
+#include <linux/hrtimer.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+
+#ifdef CONFIG_IRQ_SOFT_MODERATION
+
+/* Global configuration parameters and state */
+struct irq_mod_info {
+ /* These fields are written to by all CPUs */
+ ____cacheline_aligned
+ atomic_long_t total_intrs; /* running count updated every update_ms */
+ atomic_long_t total_cpus; /* as above, active CPUs in this interval */
+
+ /* These are mostly read (frequently), so use a different cacheline */
+ ____cacheline_aligned
+ u64 procfs_write_ns; /* last write to /proc/irq/soft_moderation */
+ uint delay_us; /* fixed delay, or maximum for adaptive */
+ uint target_irq_rate; /* target interrupt rate */
+ uint hardirq_percent; /* target maximum hardirq percentage */
+ uint timer_rounds; /* how many timer polls once moderation fires */
+ uint update_ms; /* how often to update delay/rate/fraction */
+ uint scale_cpus; /* (percent) scale factor to estimate active CPUs */
+ uint count_timer_calls; /* count timer calls for irq limits */
+ uint count_msi_calls; /* count calls from posted_msi for irq limits */
+ uint decay_factor; /* keep it at 16 */
+ uint grow_factor; /* keep it at 8 */
+ int pad[] ____cacheline_aligned;
+};
+
+extern struct irq_mod_info irq_mod_info;
+
+/* Per-CPU moderation state */
+struct irq_mod_state {
+ struct hrtimer timer; /* moderation timer */
+ struct list_head descs; /* moderated irq_desc on this CPU */
+
+ /* Counters on last time we updated moderation delay */
+ u64 last_ns; /* time of last update */
+ u64 last_irqtime; /* from cpustat[CPUTIME_IRQ] */
+ u64 last_total_irqs;
+ u64 last_total_cpus;
+
+ bool in_posted_msi; /* don't suppress handle_irq, set in posted_msi handler */
+ bool kick_posted_msi; /* kick posted_msi from the timer callback */
+
+ u32 cycles; /* calls since last ktime_get_ns() */
+ s32 irq_count; /* irqs in the last cycle, signed as we also decrement */
+ u32 delay_ns; /* fetched from irq_mod_info */
+ u32 mod_ns; /* recomputed every update_ms */
+ u32 sleep_ns; /* accumulated time for actual delay */
+ s32 rounds_left; /* how many rounds left for moderation */
+
+ /* Statistics */
+ u32 irq_rate; /* smoothed global irq rate */;
+ u32 my_irq_rate; /* smoothed irq rate for this CPU */;
+ u32 cpu_count; /* smoothed CPU count (scaled) */;
+ u32 src_count; /* smoothed irq sources count (scaled) */;
+ u32 irq_high; /* how many times above each threshold */
+ u32 my_irq_high;
+ u32 hardirq_high;
+ u32 timer_set; /* counters for various events */
+ u32 timer_fire;
+ u32 disable_irq;
+ u32 enable_irq;
+ u32 timer_calls;
+ u32 from_posted_msi;
+ u32 stray_irq;
+ int pad[] ____cacheline_aligned;
+};
+
+DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
+
+static inline void irq_moderation_start_timer(struct irq_mod_state *ms)
+{
+ ms->timer_set++;
+ ms->rounds_left = clamp(READ_ONCE(irq_mod_info.timer_rounds), 0u, 20u) + 1;
+ hrtimer_start_range_ns(&ms->timer, ns_to_ktime(ms->sleep_ns),
+ /*range*/2000, HRTIMER_MODE_REL_PINNED_HARD);
+}
+
+static inline bool irq_moderation_enabled(void)
+{
+ return READ_ONCE(irq_mod_info.delay_us);
+}
+
+static inline uint irq_moderation_get_update_ms(void)
+{
+ return clamp(READ_ONCE(irq_mod_info.update_ms), 1u, 100u);
+}
+
+/* Called on each interrupt for adaptive moderation delay adjustment */
+static inline void irq_moderation_adjust_delay(struct irq_mod_state *ms)
+{
+ u64 now, delta_time, update_ns;
+
+ ms->irq_count++;
+ if (ms->cycles++ < 16) /* ktime_get_ns() is expensive, don't do too often */
+ return;
+ ms->cycles = 0;
+ now = ktime_get_ns();
+ delta_time = now - ms->last_ns;
+ update_ns = irq_moderation_get_update_ms() * NSEC_PER_MSEC;
+
+ /* Run approximately every update_ns, a little bit early is ok */
+ if (delta_time < update_ns - 5000)
+ return;
+
+ /* Fetch important state */
+ ms->delay_ns = clamp(irq_mod_info.delay_us, 1u, 500u) * NSEC_PER_USEC;
+
+ ms->last_ns = now;
+ ms->mod_ns = ms->delay_ns;
+}
+
+/* Return true if timer is active or delay is large enough to require moderation */
+static inline bool irq_moderation_needed(struct irq_mod_state *ms)
+{
+ if (!hrtimer_is_queued(&ms->timer)) {
+ ms->sleep_ns += ms->mod_ns; /* accumulate sleep time */
+ if (ms->sleep_ns < 10000) /* no moderation if too small */
+ return false;
+ }
+ return true;
+}
+
+void disable_irq_nosync(unsigned int irq);
+
+/*
+ * Use in handle_irq_event() before calling the handler. Decide whether this
+ * desc should be moderated, and in case disable the irq and add the desc to
+ * the list for this CPU.
+ */
+static inline void irq_moderation_hook(struct irq_desc *desc)
+{
+ struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+ if (!irq_moderation_enabled())
+ return;
+
+ if (!READ_ONCE(desc->moderation_mode))
+ return;
+
+ irq_moderation_adjust_delay(ms);
+
+ if (!list_empty(&desc->ms_node)) {
+ /*
+ * Very unlikely, stray interrupt while the desc is moderated.
+ * Unfortunately we cannot ignore it, just count it.
+ */
+ ms->stray_irq++;
+ return;
+ }
+
+ if (!irq_moderation_needed(ms))
+ return;
+
+ list_add(&desc->ms_node, &ms->descs); /* Add to list of moderated desc */
+ /*
+ * disable the irq. This will also cause irq_can_handle() return false
+ * (through irq_can_handle_actions()), and that will prevent a handler
+ * instance to be run again while the descriptor is being moderated.
+ *
+ * irq_moderation_epilogue() will then start the timer if needed.
+ */
+ ms->disable_irq++;
+ disable_irq_nosync(desc->irq_data.irq); /* desc must be unlocked */
+}
+
+/* After the handler, if desc is moderated, make sure the timer is active. */
+static inline void irq_moderation_epilogue(const struct irq_desc *desc)
+{
+ struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+ if (!list_empty(&desc->ms_node) && !hrtimer_is_queued(&ms->timer))
+ irq_moderation_start_timer(ms);
+}
+
+#else /* empty stubs to avoid conditional compilation */
+
+static inline void irq_moderation_hook(struct irq_desc *desc) {}
+static inline void irq_moderation_epilogue(const struct irq_desc *desc) {}
+
+#endif /* CONFIG_IRQ_SOFT_MODERATION */
+
+#endif /* _LINUX_IRQ_MODERATION_H */
diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
index 6ab3a40556670..c06da43d644f2 100644
--- a/kernel/irq/Makefile
+++ b/kernel/irq/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_GENERIC_IRQ_CHIP) += generic-chip.o
obj-$(CONFIG_GENERIC_IRQ_PROBE) += autoprobe.o
obj-$(CONFIG_IRQ_DOMAIN) += irqdomain.o
obj-$(CONFIG_IRQ_SIM) += irq_sim.o
+obj-$(CONFIG_IRQ_SOFT_MODERATION) += irq_moderation.o
obj-$(CONFIG_PROC_FS) += proc.o
obj-$(CONFIG_GENERIC_PENDING_IRQ) += migration.o
obj-$(CONFIG_GENERIC_IRQ_MIGRATION) += cpuhotplug.o
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
new file mode 100644
index 0000000000000..a9d2bdcf4d8c7
--- /dev/null
+++ b/kernel/irq/irq_moderation.c
@@ -0,0 +1,315 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause
+
+#include <linux/module.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+
+#include <linux/irq_moderation.h>
+#include <linux/kernel_stat.h> /* interrupt.h, kcpustat_this_cpu */
+#include "internals.h"
+
+/*
+ * Platform-wide software interrupt moderation.
+ *
+ * see Documentation/core-api/irq/irq-moderation.rst
+ *
+ * === MOTIVATION AND OPERATION ===
+ *
+ * Some platforms show reduced I/O performance when the total device interrupt
+ * rate across the entire platform becomes too high. This code implements
+ * per-CPU adaptive moderation based on the total interrupt rate, as opposed
+ * to conventional moderation that operates separately on each source.
+ *
+ * It computes the total interrupt rate and number of sources, and uses the
+ * information to adaptively disable individual interrupts for small amounts
+ * of time using per-CPU hrtimers and MSI-X Pending Bit Array. Specifically:
+ *
+ * - a hook in handle_irq_event(), which applies only on sources configured
+ * to use moderation, updates statistics and check whether we need
+ * moderation on that CPU/irq. If so, calls disable_irq_nosync() and starts
+ * an hrtimer with appropriate delay.
+ *
+ * - the timer callback calls enable_irq() for all disabled interrupts on that
+ * CPU. That in turn will generate interrupts if there are pending events.
+ *
+ * === CONFIGURATION ===
+ *
+ * The following can be controlled at boot time via module parameters
+ *
+ * irq_moderation.${NAME}=${VALUE}
+ *
+ * or at runtime by writing
+ *
+ * echo "${NAME}=${VALUE}" > /proc/irq/soft_moderation
+ *
+ * delay_us (default 50, range 10..500, 0 DISABLES MODERATION)
+ * Fixed or maximum moderation delay. A reasonable range is 20..100, higher
+ * values can be useful if the hardirq handler is performing a significant
+ * amount of work.
+ *
+ * FIXED MODERATION mode requires target_irq_rate=0, hardirq_percent=0
+ *
+ * target_irq_rate (default 1M, 0 off, range 0..50M)
+ * the total irq rate above which moderation kicks in.
+ * Not particularly critical, a value in the 500K-1M range is usually ok
+ *
+ * hardirq_percent (default 70, 0 off, range 10..100)
+ * the hardirq percentage above which moderation kicks in.
+ * 50-90 is a reasonable range.
+ *
+ * timer_rounds (default 0, max 20)
+ * once moderation triggers, periodically run handler zero or more
+ * times using a timer rather than interrupts. This is similar to
+ * napi_defer_hard_irqs on NICs.
+ * A small value may help control load in interrupt-challenged platforms.
+ *
+ * update_ms (default 1, range 1...100)
+ * how often the load is measured and moderation delay updated.
+ *
+ * Moderation can be enabled/disabled for individual interrupts with
+ *
+ * echo "on" > /proc/irq/NN/soft_moderation # use "off" to disable
+ *
+ * === MONITORING ===
+ *
+ * cat /proc/irq/soft_moderation shows per-CPU and global statistics.
+ *
+ */
+
+static_assert(offsetof(struct irq_mod_info, procfs_write_ns) == 64);
+
+struct irq_mod_info irq_mod_info ____cacheline_aligned = {
+ .delay_us = 100,
+ .update_ms = 1,
+ .count_timer_calls = true,
+};
+
+module_param_named(delay_us, irq_mod_info.delay_us, uint, 0444);
+MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 = moderation off, range 10..500.");
+
+module_param_named(timer_rounds, irq_mod_info.timer_rounds, uint, 0444);
+MODULE_PARM_DESC(timer_rounds, "How many extra timer polls once moderation triggered.");
+
+DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
+
+static void smooth_avg(u32 *dst, u32 val, u32 steps)
+{
+ *dst = ((64 - steps) * *dst + steps * val) / 64;
+}
+
+/* moderation timer handler, called in hardintr context */
+static enum hrtimer_restart moderation_timer_cb(struct hrtimer *timer)
+{
+ struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+ struct irq_desc *desc, *next;
+ uint srcs = 0;
+
+ ms->timer_fire++;
+ WARN_ONCE(ms->timer_set != ms->timer_fire,
+ "CPU %d timer set %d fire %d (lost events?)\n",
+ smp_processor_id(), ms->timer_set, ms->timer_fire);
+
+ ms->rounds_left--;
+
+ if (ms->rounds_left > 0) {
+ /* Timer still alive. Just call the handlers */
+ list_for_each_entry_safe(desc, next, &ms->descs, ms_node) {
+ ms->irq_count += irq_mod_info.count_timer_calls;
+ ms->timer_calls++;
+ handle_irq_event_percpu(desc);
+ }
+ ms->timer_set++;
+ hrtimer_forward_now(&ms->timer, ms->sleep_ns);
+ return HRTIMER_RESTART;
+ }
+
+ /* Last round, remove from list and enable_irq() */
+ list_for_each_entry_safe(desc, next, &ms->descs, ms_node) {
+ list_del(&desc->ms_node);
+ INIT_LIST_HEAD(&desc->ms_node);
+ srcs++;
+ ms->enable_irq++;
+ enable_irq(desc->irq_data.irq); /* ok if the sync_lock/unlock are NULL */
+ }
+ smooth_avg(&ms->src_count, srcs * 256, 1);
+
+ ms->sleep_ns = 0; /* prepare to accumulate next moderation delay */
+
+ WARN_ONCE(ms->disable_irq != ms->enable_irq,
+ "CPU %d irq disable %d enable %d (%s)\n",
+ smp_processor_id(), ms->disable_irq, ms->enable_irq,
+ "bookkeeping error, some irq qill be stuck");
+
+ return HRTIMER_NORESTART;
+}
+
+/* Initialize moderation state in desc_set_defaults() */
+void irq_moderation_init_fields(struct irq_desc *desc)
+{
+ INIT_LIST_HEAD(&desc->ms_node);
+ desc->moderation_mode = 0;
+}
+
+/* Per-CPU state initialization */
+void irq_moderation_percpu_init(void)
+{
+ struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+ hrtimer_setup(&ms->timer, moderation_timer_cb, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
+ INIT_LIST_HEAD(&ms->descs);
+}
+
+static void set_moderation_mode(struct irq_desc *desc, bool mode)
+{
+ if (desc) {
+ struct irq_chip *chip = desc->irq_data.chip;
+
+ /* Make sure this is msi and we can run enable_irq from irq context */
+ mode &= desc->handle_irq == handle_edge_irq && chip && chip->irq_bus_lock == NULL &&
+ chip->irq_bus_sync_unlock == NULL;
+ if (mode != desc->moderation_mode)
+ desc->moderation_mode = mode;
+ }
+}
+
+#pragma clang diagnostic error "-Wformat"
+/* Print statistics */
+static int moderation_show(struct seq_file *p, void *v)
+{
+ ulong irq_rate = 0, irq_high = 0, my_irq_high = 0, hardirq_high = 0;
+ uint delay_us = irq_mod_info.delay_us;
+ struct irq_mod_state *ms;
+ u64 now = ktime_get_ns();
+ int j, active_cpus = 0;
+ struct irq_desc *desc = p->private;
+
+ if (desc) {
+ seq_printf(p, "%s\n", desc->moderation_mode ? "on" : "off");
+ return 0;
+ }
+
+ seq_puts(p, "# CPU irq/s my_irq/s cpus srcs delay_ns irq_hi my_irq_hi"
+ " hardirq_hi timer_set disable_irq from_msi timer_calls stray_irq\n");
+ for_each_online_cpu(j) {
+ ms = per_cpu_ptr(&irq_mod_state, j);
+ if (now - ms->last_ns > NSEC_PER_SEC) {
+ ms->my_irq_rate = 0;
+ ms->irq_rate = 0;
+ ms->cpu_count = 0;
+ } else { /* average irq_rate over active CPUs */
+ active_cpus++;
+ irq_rate += ms->irq_rate;
+ }
+ /* compute totals */
+ irq_high += ms->irq_high;
+ my_irq_high += ms->my_irq_high;
+ hardirq_high += ms->hardirq_high;
+
+ seq_printf(p, "%5u %8u %8u %4u %4u %8u %11u %11u %11u %11u %11u %11u %11u %9u\n",
+ j, ms->irq_rate, ms->my_irq_rate, (ms->cpu_count + 128) / 256,
+ (ms->src_count + 128) / 256, ms->mod_ns, ms->irq_high, ms->my_irq_high,
+ ms->hardirq_high, ms->timer_set, ms->disable_irq,
+ ms->from_posted_msi, ms->timer_calls, ms->stray_irq);
+ }
+
+ seq_printf(p, "\n"
+ "enabled %s\n"
+ "delay_us %u\n"
+ "timer_rounds %u\n"
+ "count_timer_calls %s\n",
+ str_yes_no(delay_us),
+ delay_us,
+ irq_mod_info.timer_rounds,
+ str_yes_no(irq_mod_info.count_timer_calls));
+
+ return 0;
+}
+
+static int moderation_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, moderation_show, pde_data(inode));
+}
+
+/* helpers to set values */
+struct var_names {
+ const char *name;
+ uint *val;
+ int min;
+ int max;
+} var_names[] = {
+ { "delay_us", &irq_mod_info.delay_us, 0, 500 },
+ { "timer_rounds", &irq_mod_info.timer_rounds, 0, 50 },
+ { "count_timer_calls", &irq_mod_info.count_timer_calls, 0, 1 },
+ {}
+};
+
+static int set_parameter(const char *buf, int len)
+{
+ struct var_names *n;
+ int l, val;
+
+ for (n = var_names; n->name; n++) {
+ l = strlen(n->name);
+ if (len >= l + 2 && !strncmp(buf, n->name, l) && buf[l] == '=')
+ break;
+ }
+ if (!n->name || !n->val)
+ return -EINVAL;
+ if (kstrtouint(buf + l + 1, 0, &val))
+ return -EINVAL;
+ WRITE_ONCE(*(n->val), clamp(val, n->min, n->max));
+ irq_mod_info.procfs_write_ns = ktime_get_ns();
+ return len;
+}
+
+static ssize_t moderation_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
+{
+ char val[40]; /* bounded string size */
+ struct irq_desc *desc = (struct irq_desc *)pde_data(file_inode(f));
+
+ if (count == 0 || count + 1 > sizeof(val))
+ return -EINVAL;
+ if (copy_from_user(val, buf, count))
+ return -EFAULT;
+ val[count] = '\0';
+ if (val[count - 1] == '\n')
+ val[count - 1] = '\0';
+ if (!desc)
+ return set_parameter(val, count);
+
+ if (!strcmp(val, "off") || !strcmp(val, "0"))
+ set_moderation_mode(desc, false);
+ else if (!strcmp(val, "on") || !strcmp(val, "1"))
+ set_moderation_mode(desc, true);
+ else
+ return -EINVAL;
+ return count; /* consume all */
+}
+
+static const struct proc_ops proc_ops = {
+ .proc_open = moderation_open,
+ .proc_read = seq_read,
+ .proc_lseek = seq_lseek,
+ .proc_release = single_release,
+ .proc_write = moderation_write,
+};
+
+void irq_moderation_procfs_entry(struct irq_desc *desc, umode_t umode)
+{
+ if (umode)
+ proc_create_data("soft_moderation", umode, desc->dir, &proc_ops, desc);
+ else
+ remove_proc_entry("soft_moderation", desc->dir);
+}
+
+static int __init init_irq_moderation(void)
+{
+ proc_create_data("irq/soft_moderation", 0644, NULL, &proc_ops, (void *)0);
+ return 0;
+}
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("1.0");
+MODULE_AUTHOR("Luigi Rizzo <lrizzo@google.com>");
+MODULE_DESCRIPTION("Platform wide software interrupt moderation");
+module_init(init_irq_moderation);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 29c2404e743be..3d2b583b9443e 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -374,6 +374,7 @@ void register_irq_proc(unsigned int irq, struct irq_desc *desc)
proc_create_single_data("effective_affinity_list", 0444, desc->dir,
irq_effective_aff_list_proc_show, irqp);
# endif
+ irq_moderation_procfs_entry(desc, 0644);
#endif
proc_create_single_data("spurious", 0444, desc->dir,
irq_spurious_proc_show, (void *)(long)irq);
@@ -395,6 +396,7 @@ void unregister_irq_proc(unsigned int irq, struct irq_desc *desc)
remove_proc_entry("effective_affinity", desc->dir);
remove_proc_entry("effective_affinity_list", desc->dir);
# endif
+ irq_moderation_procfs_entry(desc, 0); /* remove */
#endif
remove_proc_entry("spurious", desc->dir);
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 3/6] genirq: soft_moderation: activate hooks in handle_irq_event()
2025-11-12 19:24 [PATCH 0/6] platform wide software interrupt moderation Luigi Rizzo
2025-11-12 19:24 ` [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
2025-11-12 19:24 ` [PATCH 2/6] genirq: soft_moderation: add base files, procfs hooks Luigi Rizzo
@ 2025-11-12 19:24 ` Luigi Rizzo
2025-11-13 9:45 ` Thomas Gleixner
2025-11-12 19:24 ` [PATCH 4/6] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
` (2 subsequent siblings)
5 siblings, 1 reply; 24+ messages in thread
From: Luigi Rizzo @ 2025-11-12 19:24 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
Activate soft_moderation via the hooks in handle_irq_event()
and per-CPU and irq_desc initialization.
This change only implements fixed moderation. It needs to be
explicitly enabled at runtime on individual interrupts.
Example (kernel built with CONFIG_SOFT_IRQ_MODERATION=y)
# enable fixed moderation
echo "delay_us=400" > /proc/irq/soft_moderation
# enable on network interrupts (change name as appropriate)
echo on | tee /proc/irq/*/*eth*/../soft_moderation
# show it works by looking at counters in /proc/irq/soft_moderation
cat /proc/irq/soft_moderation
# Show runtime impact on ping times changing delay_us
ping -n -f -q -c 1000 ${some_nearby_host}
echo "delay_us=100" > /proc/irq/soft_moderation
ping -n -f -q -c 1000 ${some_nearby_host}
Configuration via module parameters (irq_moderation.${name}=${value}) or
echo "${name}=${value}" > /proc/irq/soft_moderation)
delay_us 0=off, range 1-500, default 100
how long an interrupt is disabled after it fires. Small values are
accumulated until they are large enough, e.g. 10us. As an example, a 2us value
means that the timer is set only every 5 interrupts.
timer_rounds 0-20, default 0
How many extra timer runs before re-enabling interrupts. This allows
reducing the number of MSI interrupts while keeping delay_us small.
This is similar to the "napi_defer_hard_irqs" option in NAPI, but with
some subtle differences (e.g. here the number of rounds is
deterministic, and interrupts are disabled at MSI level).
Change-Id: I47c5059ad537fcb9561f924620cf68e1d648aae6
---
arch/x86/kernel/cpu/common.c | 1 +
drivers/irqchip/irq-gic-v3.c | 2 ++
kernel/irq/handle.c | 3 +++
kernel/irq/irqdesc.c | 1 +
4 files changed, 7 insertions(+)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 02d97834a1d4d..1953419fde6ff 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2440,6 +2440,7 @@ void cpu_init(void)
intel_posted_msi_init();
}
+ irq_moderation_percpu_init();
mmgrab(&init_mm);
cur->active_mm = &init_mm;
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 3de351e66ee84..902bcbf9d85d8 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1226,6 +1226,8 @@ static void gic_cpu_sys_reg_init(void)
WARN_ON(gic_dist_security_disabled() != cpus_have_security_disabled);
}
+ irq_moderation_percpu_init();
+
/*
* Some firmwares hand over to the kernel with the BPR changed from
* its reset value (and with a value large enough to prevent
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index e103451243a0b..2cacceaaea9d0 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -12,6 +12,7 @@
#include <linux/random.h>
#include <linux/sched.h>
#include <linux/interrupt.h>
+#include <linux/irq_moderation.h>
#include <linux/kernel_stat.h>
#include <asm/irq_regs.h>
@@ -254,9 +255,11 @@ irqreturn_t handle_irq_event(struct irq_desc *desc)
irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
raw_spin_unlock(&desc->lock);
+ irq_moderation_hook(desc); /* may disable irq so must run unlocked */
ret = handle_irq_event_percpu(desc);
raw_spin_lock(&desc->lock);
+ irq_moderation_epilogue(desc); /* start moderation timer if needed */
irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
return ret;
}
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index db714d3014b5f..e3efbecf5b937 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -134,6 +134,7 @@ static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
desc->tot_count = 0;
desc->name = NULL;
desc->owner = owner;
+ irq_moderation_init_fields(desc);
for_each_possible_cpu(cpu)
*per_cpu_ptr(desc->kstat_irqs, cpu) = (struct irqstat) { };
desc_smp_init(desc, node, affinity);
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 4/6] genirq: soft_moderation: implement adaptive moderation
2025-11-12 19:24 [PATCH 0/6] platform wide software interrupt moderation Luigi Rizzo
` (2 preceding siblings ...)
2025-11-12 19:24 ` [PATCH 3/6] genirq: soft_moderation: activate hooks in handle_irq_event() Luigi Rizzo
@ 2025-11-12 19:24 ` Luigi Rizzo
2025-11-13 10:15 ` Thomas Gleixner
2025-11-12 19:24 ` [PATCH 5/6] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
2025-11-12 19:24 ` [PATCH 6/6] genirq: soft_moderation: implement per-driver defaults (nvme and vfio) Luigi Rizzo
5 siblings, 1 reply; 24+ messages in thread
From: Luigi Rizzo @ 2025-11-12 19:24 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
Add two control parameters (target_irq_rate and hardirq_percent)
to indicate the desired maximum values for these two metrics.
Every update_ms the hook in handle_irq_event() recomputes the total and
local interrupt rate and the amount of time spent in hardirq, compares
the values with the targets, and adjusts the moderation delay up or down.
The interrupt rate is computed in a scalable way by counting interrupts
per-CPU, and aggregating the value in a global variable only every
update_ms. Only CPUs that actively process interrupts are actually
accessing the shared variable, so the system scales well even on very
large servers.
EXAMPLE TESTING
Need some workload that can exceed the limits, such as heavy
network or disk traffic. For testing, one can use very low thresholds
(e.g. target_irq_rate=50000, hardirq_frac=10) to make it easier to go
above the limit.
# configure maximum delay (which is also the fixed moderation delay)
echo "delay_us=400" > /proc/irq/soft_moderation
# enable on network interrupts (change name as appropriate)
echo on | tee /proc/irq/*/*eth*/../soft_moderation
# ping times should reflect the 400us
ping -n -f -q -c 1000 ${some_nearby_host}
# show actual per-cpu delays and statistics
less /proc/irq/soft_moderation
# configure adaptive bounds. The control loop will adjust values
# based on actual load
echo "target_irq_rate=1000000" > /proc/irq/soft_moderation
echo "hardirq_percent=70" > /proc/irq/soft_moderation
# ping times now should be much lower
ping -n -f -q -c 1000 ${some_nearby_host}
# show actual per-cpu delays and statistics
less /proc/irq/soft_moderation
By generating high interrupt or hardirq load, one can also test
the effectiveness of the control scheme and the sensitivity to
control parameters.
NEW PARAMETERS
target_irq_rate 0 off, 0-50000000, default 0
the total maximum acceptable interrupt rate.
hardirq_percent 0 off, 0-100, default 0
the maximum acceptable percentage of time spent in hardirq.
update_ms 1-100, default 1
how often the control loop will readjust the delay.
Change-Id: I3cdc72041be1e3c793013d8804f484cdcbb455ab
---
include/linux/irq_moderation.h | 9 ++-
kernel/irq/irq_moderation.c | 143 ++++++++++++++++++++++++++++++++-
2 files changed, 147 insertions(+), 5 deletions(-)
diff --git a/include/linux/irq_moderation.h b/include/linux/irq_moderation.h
index 4d90d7c4ca26b..45df60230e42e 100644
--- a/include/linux/irq_moderation.h
+++ b/include/linux/irq_moderation.h
@@ -89,6 +89,8 @@ static inline void irq_moderation_start_timer(struct irq_mod_state *ms)
/*range*/2000, HRTIMER_MODE_REL_PINNED_HARD);
}
+void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_time, u64 update_ns);
+
static inline bool irq_moderation_enabled(void)
{
return READ_ONCE(irq_mod_info.delay_us);
@@ -119,8 +121,13 @@ static inline void irq_moderation_adjust_delay(struct irq_mod_state *ms)
/* Fetch important state */
ms->delay_ns = clamp(irq_mod_info.delay_us, 1u, 500u) * NSEC_PER_USEC;
+ /* If config changed, restart from the highest delay */
+ if (ktime_compare(irq_mod_info.procfs_write_ns, ms->last_ns) > 0)
+ ms->mod_ns = ms->delay_ns;
+
ms->last_ns = now;
- ms->mod_ns = ms->delay_ns;
+ /* Do the expensive processing */
+ __irq_moderation_adjust_delay(ms, delta_time, update_ns);
}
/* Return true if timer is active or delay is large enough to require moderation */
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index a9d2bdcf4d8c7..0229697a6a95a 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -81,22 +81,127 @@ static_assert(offsetof(struct irq_mod_info, procfs_write_ns) == 64);
struct irq_mod_info irq_mod_info ____cacheline_aligned = {
.delay_us = 100,
.update_ms = 1,
+ .scale_cpus = 100,
.count_timer_calls = true,
+ .decay_factor = 16,
+ .grow_factor = 8,
};
module_param_named(delay_us, irq_mod_info.delay_us, uint, 0444);
MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 = moderation off, range 10..500.");
+module_param_named(hardirq_percent, irq_mod_info.hardirq_percent, uint, 0444);
+MODULE_PARM_DESC(hardirq_percent, "Target max hardirq percentage, 0 off.");
+
+module_param_named(target_irq_rate, irq_mod_info.target_irq_rate, uint, 0444);
+MODULE_PARM_DESC(target_irq_rate, "Target max interrupt rate, 0 off.");
+
module_param_named(timer_rounds, irq_mod_info.timer_rounds, uint, 0444);
MODULE_PARM_DESC(timer_rounds, "How many extra timer polls once moderation triggered.");
+module_param_named(update_ms, irq_mod_info.update_ms, uint, 0444);
+MODULE_PARM_DESC(update_ms, "Update interval in milliseconds, range 1-100");
+
DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
+static inline uint get_grow_factor(void) { return clamp(irq_mod_info.grow_factor, 8u, 64u); }
+static inline uint get_decay_factor(void) { return clamp(irq_mod_info.decay_factor, 8u, 64u); }
+static inline uint get_scale_cpus(void) { return clamp(irq_mod_info.scale_cpus, 50u, 1000u); }
+
static void smooth_avg(u32 *dst, u32 val, u32 steps)
{
*dst = ((64 - steps) * *dst + steps * val) / 64;
}
+/* Adjust the moderation delay, called at most every update_ns */
+void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_time, u64 update_ns)
+{
+ /* Fetch configuration */
+ u32 target_rate = clamp(irq_mod_info.target_irq_rate, 0u, 50000000u);
+ u32 hardirq_percent = clamp(irq_mod_info.hardirq_percent, 0u, 100u);
+ bool below_target = true;
+ /* Compute decay steps based on elapsed time */
+ u32 steps = delta_time > 10 * update_ns ? 10 : 1 + (delta_time / update_ns);
+
+ if (target_rate == 0 && hardirq_percent == 0) { /* use fixed delay */
+ ms->mod_ns = ms->delay_ns;
+ ms->irq_rate = 0;
+ ms->my_irq_rate = 0;
+ ms->cpu_count = 0;
+ return;
+ }
+
+ if (target_rate > 0) { /* control total and individual CPU rate */
+ u64 irq_rate, my_irq_rate, tmp, delta_irqs, num_cpus;
+ bool my_rate_ok, global_rate_ok;
+
+ /* Update global number of interrupts */
+ if (ms->irq_count < 1) /* make sure it is always > 0 */
+ ms->irq_count = 1;
+ tmp = atomic_long_add_return(ms->irq_count, &irq_mod_info.total_intrs);
+ delta_irqs = tmp - ms->last_total_irqs;
+
+ /* Compute global rate, check if we are ok */
+ irq_rate = (delta_irqs * NSEC_PER_SEC) / delta_time;
+ global_rate_ok = irq_rate < target_rate;
+
+ ms->last_total_irqs = tmp;
+
+ /*
+ * num_cpus is the number of CPUs actively handling interrupts
+ * in the last interval. CPUs handling less than the fair share
+ * target_rate / num_cpus do not need to be throttled.
+ */
+ tmp = atomic_long_add_return(1, &irq_mod_info.total_cpus);
+ num_cpus = tmp - ms->last_total_cpus;
+ /* scale proportionally to time, reduce errors if we are idle for too long */
+ num_cpus = 1 + (num_cpus * update_ns + delta_time / 2) / delta_time;
+
+ /* Short intervals may underestimate sources. Apply a scale factor */
+ num_cpus = num_cpus * get_scale_cpus() / 100;
+
+ /* Compute our rate, check if we are ok */
+ my_irq_rate = (ms->irq_count * NSEC_PER_SEC) / delta_time;
+ my_rate_ok = my_irq_rate * num_cpus < target_rate;
+
+ ms->irq_count = 1; /* reset for next cycle */
+ ms->last_total_cpus = tmp;
+
+ /* Use instantaneous rates to react. */
+ below_target = global_rate_ok || my_rate_ok;
+
+ /* Statistics (rates are smoothed averages) */
+ smooth_avg(&ms->irq_rate, irq_rate, steps);
+ smooth_avg(&ms->my_irq_rate, my_irq_rate, steps);
+ smooth_avg(&ms->cpu_count, num_cpus * 256, steps); /* scaled */
+ ms->my_irq_high += !my_rate_ok;
+ ms->irq_high += !global_rate_ok;
+ }
+
+ if (hardirq_percent > 0) { /* control time spent in hardirq */
+ u64 cur = kcpustat_this_cpu->cpustat[CPUTIME_IRQ];
+ u64 irqtime = cur - ms->last_irqtime;
+ bool hardirq_ok = irqtime * 100 < delta_time * hardirq_percent;
+
+ below_target &= hardirq_ok;
+ ms->last_irqtime = cur;
+ ms->hardirq_high += !hardirq_ok; /* statistics */
+ }
+
+ /* Controller: move mod_ns up/down if we are above/below target */
+ if (below_target) {
+ ms->mod_ns -= ms->mod_ns * steps / (steps + get_decay_factor());
+ if (ms->mod_ns < 100)
+ ms->mod_ns = 0;
+ } else if (ms->mod_ns < 500) {
+ ms->mod_ns = 500;
+ } else {
+ ms->mod_ns += ms->mod_ns * steps / (steps + get_grow_factor());
+ if (ms->mod_ns > ms->delay_ns)
+ ms->mod_ns = ms->delay_ns; /* cap to delay_ns */
+ }
+}
+
/* moderation timer handler, called in hardintr context */
static enum hrtimer_restart moderation_timer_cb(struct hrtimer *timer)
{
@@ -172,6 +277,13 @@ static void set_moderation_mode(struct irq_desc *desc, bool mode)
}
}
+/* irq_to_desc is not exported. Wrap it in this function for a specific use. */
+void irq_moderation_set_mode(int irq, bool mode)
+{
+ set_moderation_mode(irq_to_desc(irq), mode);
+}
+EXPORT_SYMBOL(irq_moderation_set_mode);
+
#pragma clang diagnostic error "-Wformat"
/* Print statistics */
static int moderation_show(struct seq_file *p, void *v)
@@ -215,12 +327,32 @@ static int moderation_show(struct seq_file *p, void *v)
seq_printf(p, "\n"
"enabled %s\n"
"delay_us %u\n"
+ "target_irq_rate %u\n"
+ "hardirq_percent %u\n"
"timer_rounds %u\n"
- "count_timer_calls %s\n",
+ "update_ms %u\n"
+ "scale_cpus %u\n"
+ "count_timer_calls %s\n"
+ "decay_factor %u\n"
+ "grow_factor %u\n",
str_yes_no(delay_us),
- delay_us,
- irq_mod_info.timer_rounds,
- str_yes_no(irq_mod_info.count_timer_calls));
+ delay_us, irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent,
+ irq_mod_info.timer_rounds, irq_mod_info.update_ms,
+ irq_mod_info.scale_cpus,
+ str_yes_no(irq_mod_info.count_timer_calls),
+ get_decay_factor(), get_grow_factor());
+
+ seq_printf(p,
+ "irq_rate %lu\n"
+ "irq_high %lu\n"
+ "my_irq_high %lu\n"
+ "hardirq_percent_high %lu\n"
+ "total_interrupts %lu\n"
+ "total_cpus %lu\n",
+ active_cpus ? irq_rate / active_cpus : 0,
+ irq_high, my_irq_high, hardirq_high,
+ READ_ONCE(*((ulong *)&irq_mod_info.total_intrs)),
+ READ_ONCE(*((ulong *)&irq_mod_info.total_cpus)));
return 0;
}
@@ -238,7 +370,10 @@ struct var_names {
int max;
} var_names[] = {
{ "delay_us", &irq_mod_info.delay_us, 0, 500 },
+ { "target_irq_rate", &irq_mod_info.target_irq_rate, 0, 50000000 },
+ { "hardirq_percent", &irq_mod_info.hardirq_percent, 0, 100 },
{ "timer_rounds", &irq_mod_info.timer_rounds, 0, 50 },
+ { "update_ms", &irq_mod_info.update_ms, 1, 100 },
{ "count_timer_calls", &irq_mod_info.count_timer_calls, 0, 1 },
{}
};
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 5/6] x86/irq: soft_moderation: add support for posted_msi (intel)
2025-11-12 19:24 [PATCH 0/6] platform wide software interrupt moderation Luigi Rizzo
` (3 preceding siblings ...)
2025-11-12 19:24 ` [PATCH 4/6] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
@ 2025-11-12 19:24 ` Luigi Rizzo
2025-11-12 19:24 ` [PATCH 6/6] genirq: soft_moderation: implement per-driver defaults (nvme and vfio) Luigi Rizzo
5 siblings, 0 replies; 24+ messages in thread
From: Luigi Rizzo @ 2025-11-12 19:24 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On recent Intel CPUs, kernels compiled with CONFIG_X86_POSTED_MSI=y,
and the boot option "intremap=posted_msi", all MSI interrupts
that hit a CPU issue a single POSTED_MSI interrupt processed by
sysvec_posted_msi_notification() instead of having separate interrupts.
This change adds soft moderation hooks to the above handler.
Soft moderation on posted_msi does not require per-source enable,
irq_moderation.delay_us > 0 suffices.
To test it, run a kernel with the above options and enable moderation by
setting delay_us > 0. The column "from_msi" in /proc/irq/soft_moderation
will show a non-zero value.
Change-Id: Idcd6005f05048c4b3f9d002c8587039b46bc9d73
---
arch/x86/kernel/irq.c | 12 +++++++
include/linux/irq_moderation.h | 62 ++++++++++++++++++++++++++++++++++
kernel/irq/irq_moderation.c | 39 ++++++++++++++++-----
3 files changed, 104 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 10721a1252269..241d57fadc30c 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -4,6 +4,7 @@
*/
#include <linux/cpu.h>
#include <linux/interrupt.h>
+#include <linux/irq_moderation.h>
#include <linux/kernel_stat.h>
#include <linux/of.h>
#include <linux/seq_file.h>
@@ -448,6 +449,13 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
inc_irq_stat(posted_msi_notification_count);
irq_enter();
+ if (posted_msi_moderation_enabled()) {
+ if (posted_msi_should_rearm(handle_pending_pir(pid->pir, regs)))
+ goto rearm;
+ else
+ goto common_end;
+ }
+
/*
* Max coalescing count includes the extra round of handle_pending_pir
* after clearing the outstanding notification bit. Hence, at most
@@ -458,6 +466,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
break;
}
+rearm:
/*
* Clear outstanding notification bit to allow new IRQ notifications,
* do this last to maximize the window of interrupt coalescing.
@@ -471,6 +480,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
*/
handle_pending_pir(pid->pir, regs);
+common_end:
+ posted_msi_moderation_epilogue();
+
apic_eoi();
irq_exit();
set_irq_regs(old_regs);
diff --git a/include/linux/irq_moderation.h b/include/linux/irq_moderation.h
index 45df60230e42e..aabcfaba1aefc 100644
--- a/include/linux/irq_moderation.h
+++ b/include/linux/irq_moderation.h
@@ -155,6 +155,14 @@ static inline void irq_moderation_hook(struct irq_desc *desc)
if (!irq_moderation_enabled())
return;
+#ifdef CONFIG_X86_POSTED_MSI
+ if (ms->in_posted_msi) { /* these calls are not moderated */
+ ms->from_posted_msi++;
+ ms->irq_count += irq_mod_info.count_msi_calls;
+ return;
+ }
+#endif
+
if (!READ_ONCE(desc->moderation_mode))
return;
@@ -193,11 +201,65 @@ static inline void irq_moderation_epilogue(const struct irq_desc *desc)
irq_moderation_start_timer(ms);
}
+#ifdef CONFIG_X86_POSTED_MSI
+/*
+ * Helpers for to sysvec_posted_msi_notification(), use as follows
+ *
+ * if (posted_msi_moderation_enabled()) {
+ * if (posted_msi_should_rearm(handle_pending_pir(pid->pir, regs)))
+ * goto rearm;
+ * else
+ * goto common_end;
+ * }
+ * ...
+ * common_end:
+ * posted_msi_moderation_epilogue();
+ */
+static inline bool posted_msi_moderation_enabled(void)
+{
+ struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+ if (!irq_moderation_enabled())
+ return false;
+ irq_moderation_adjust_delay(ms);
+ ms->in_posted_msi = true; /* tell handler to not throttle these calls */
+ return true;
+}
+
+/* Decide whether to rearm or not posted_msi */
+static inline bool posted_msi_should_rearm(bool work_done)
+{
+ struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+ if (ms->rounds_left > 0) /* timer pending, no rearm */
+ return false;
+ if (!work_done) /* no work done, can rearm */
+ return true;
+ if (!irq_moderation_needed(ms)) /* moderation not needed, rearm */
+ return true;
+ ms->kick_posted_msi = true; /* do kick in timer callback */
+ irq_moderation_start_timer(ms);
+ return false; /* timer now active, no rearm */
+}
+
+/* Cleanup state set in posted_msi_moderation_enabled() */
+static inline void posted_msi_moderation_epilogue(void)
+{
+ this_cpu_ptr(&irq_mod_state)->in_posted_msi = false;
+}
+#endif
+
#else /* empty stubs to avoid conditional compilation */
static inline void irq_moderation_hook(struct irq_desc *desc) {}
static inline void irq_moderation_epilogue(const struct irq_desc *desc) {}
+#ifdef CONFIG_X86_POSTED_MSI
+static inline bool posted_msi_moderation_enabled(void) { return false; }
+static inline bool posted_msi_should_rearm(bool work_done) { return false; }
+static inline void posted_msi_moderation_epilogue(void) {}
+#endif
+
#endif /* CONFIG_IRQ_SOFT_MODERATION */
#endif /* _LINUX_IRQ_MODERATION_H */
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index 0229697a6a95a..672e161ecf29e 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -7,6 +7,15 @@
#include <linux/irq_moderation.h>
#include <linux/kernel_stat.h> /* interrupt.h, kcpustat_this_cpu */
#include "internals.h"
+#ifdef CONFIG_X86
+#include <asm/apic.h>
+#endif
+
+#ifdef CONFIG_IRQ_REMAP
+extern bool enable_posted_msi;
+#else
+static bool enable_posted_msi;
+#endif
/*
* Platform-wide software interrupt moderation.
@@ -29,6 +38,10 @@
* moderation on that CPU/irq. If so, calls disable_irq_nosync() and starts
* an hrtimer with appropriate delay.
*
+ * - Intel only: using "intremap=posted_msi", all the above is done in
+ * sysvec_posted_msi_notification(). In this case all host device interrupts
+ * are subject to moderation.
+ *
* - the timer callback calls enable_irq() for all disabled interrupts on that
* CPU. That in turn will generate interrupts if there are pending events.
*
@@ -82,6 +95,7 @@ struct irq_mod_info irq_mod_info ____cacheline_aligned = {
.delay_us = 100,
.update_ms = 1,
.scale_cpus = 100,
+ .count_msi_calls = true,
.count_timer_calls = true,
.decay_factor = 16,
.grow_factor = 8,
@@ -216,6 +230,17 @@ static enum hrtimer_restart moderation_timer_cb(struct hrtimer *timer)
ms->rounds_left--;
+#ifdef CONFIG_X86_POSTED_MSI
+ if (ms->kick_posted_msi) {
+ if (ms->rounds_left <= 0)
+ ms->kick_posted_msi = false;
+ /* Next call will be from timer, count it conditionally */
+ ms->irq_count -= !irq_mod_info.count_timer_calls;
+ ms->timer_calls++;
+ apic->send_IPI_self(POSTED_MSI_NOTIFICATION_VECTOR);
+ }
+#endif
+
if (ms->rounds_left > 0) {
/* Timer still alive. Just call the handlers */
list_for_each_entry_safe(desc, next, &ms->descs, ms_node) {
@@ -277,13 +302,6 @@ static void set_moderation_mode(struct irq_desc *desc, bool mode)
}
}
-/* irq_to_desc is not exported. Wrap it in this function for a specific use. */
-void irq_moderation_set_mode(int irq, bool mode)
-{
- set_moderation_mode(irq_to_desc(irq), mode);
-}
-EXPORT_SYMBOL(irq_moderation_set_mode);
-
#pragma clang diagnostic error "-Wformat"
/* Print statistics */
static int moderation_show(struct seq_file *p, void *v)
@@ -325,7 +343,7 @@ static int moderation_show(struct seq_file *p, void *v)
}
seq_printf(p, "\n"
- "enabled %s\n"
+ "enabled %s%s\n"
"delay_us %u\n"
"target_irq_rate %u\n"
"hardirq_percent %u\n"
@@ -333,13 +351,15 @@ static int moderation_show(struct seq_file *p, void *v)
"update_ms %u\n"
"scale_cpus %u\n"
"count_timer_calls %s\n"
+ "count_msi_calls %s\n"
"decay_factor %u\n"
"grow_factor %u\n",
- str_yes_no(delay_us),
+ str_yes_no(delay_us), enable_posted_msi ? " (also on posted_msi)" : "",
delay_us, irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent,
irq_mod_info.timer_rounds, irq_mod_info.update_ms,
irq_mod_info.scale_cpus,
str_yes_no(irq_mod_info.count_timer_calls),
+ str_yes_no(irq_mod_info.count_msi_calls),
get_decay_factor(), get_grow_factor());
seq_printf(p,
@@ -375,6 +395,7 @@ struct var_names {
{ "timer_rounds", &irq_mod_info.timer_rounds, 0, 50 },
{ "update_ms", &irq_mod_info.update_ms, 1, 100 },
{ "count_timer_calls", &irq_mod_info.count_timer_calls, 0, 1 },
+ { "count_msi_calls", &irq_mod_info.count_msi_calls, 0, 1 },
{}
};
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 6/6] genirq: soft_moderation: implement per-driver defaults (nvme and vfio)
2025-11-12 19:24 [PATCH 0/6] platform wide software interrupt moderation Luigi Rizzo
` (4 preceding siblings ...)
2025-11-12 19:24 ` [PATCH 5/6] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
@ 2025-11-12 19:24 ` Luigi Rizzo
2025-11-13 10:18 ` Thomas Gleixner
5 siblings, 1 reply; 24+ messages in thread
From: Luigi Rizzo @ 2025-11-12 19:24 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
Introduce helpers to implement per-driver module parameters to enable
moderation at boot/probe time.
As an example, use the helpers in nvme and vfio drivers.
To test, boot a kernel with
${driver}.soft_moderation=1 # enables moderation.
and verify with "cat /proc/irq/soft_moderation" that
the counters increase.
Change-Id: Iaad4110977deb96df845501895e0043bd93fc350
---
drivers/nvme/host/pci.c | 3 +++
drivers/vfio/pci/vfio_pci_intrs.c | 3 +++
include/linux/interrupt.h | 13 +++++++++++++
kernel/irq/irq_moderation.c | 11 +++++++++++
4 files changed, 30 insertions(+)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 72fb675a696f4..b9d7bce30061f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -72,6 +72,8 @@
static_assert(MAX_PRP_RANGE / NVME_CTRL_PAGE_SIZE <=
(1 /* prp1 */ + NVME_MAX_NR_DESCRIPTORS * PRPS_PER_PAGE));
+DEFINE_IRQ_MODERATION_MODE_PARAMETER;
+
static int use_threaded_interrupts;
module_param(use_threaded_interrupts, int, 0444);
@@ -1989,6 +1991,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
result = queue_request_irq(nvmeq);
if (result < 0)
goto release_sq;
+ IRQ_MODERATION_SET_DEFAULT_MODE(pci_irq_vector(to_pci_dev(dev->dev), vector));
}
set_bit(NVMEQ_ENABLED, &nvmeq->flags);
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 30d3e921cb0de..e54d88cfe601d 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -22,6 +22,8 @@
#include "vfio_pci_priv.h"
+DEFINE_IRQ_MODERATION_MODE_PARAMETER;
+
struct vfio_pci_irq_ctx {
struct vfio_pci_core_device *vdev;
struct eventfd_ctx *trigger;
@@ -317,6 +319,7 @@ static int vfio_intx_enable(struct vfio_pci_core_device *vdev,
vfio_irq_ctx_free(vdev, ctx, 0);
return ret;
}
+ IRQ_MODERATION_SET_DEFAULT_MODE(pdev->irq);
return 0;
}
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 007201c8db6dd..c7d68d8ec49d7 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -879,12 +879,25 @@ void irq_moderation_init_fields(struct irq_desc *desc);
/* add/remove /proc/irq/NN/soft_moderation */
void irq_moderation_procfs_entry(struct irq_desc *desc, umode_t umode);
+/* helpers for per-driver moderation mode settings */
+#define DEFINE_IRQ_MODERATION_MODE_PARAMETER \
+ static bool soft_moderation; \
+ module_param(soft_moderation, bool, 0644); \
+ MODULE_PARM_DESC(soft_moderation, "0: off, 1: disable_irq")
+
+void irq_moderation_set_mode(int irq, bool mode);
+#define IRQ_MODERATION_SET_DEFAULT_MODE(_irq) \
+ irq_moderation_set_mode(_irq, READ_ONCE(soft_moderation))
+
#else /* empty stubs to avoid conditional compilation */
static inline void irq_moderation_percpu_init(void) {}
static inline void irq_moderation_init_fields(struct irq_desc *desc) {}
static inline void irq_moderation_procfs_entry(struct irq_desc *desc, umode_t umode) {};
+#define DEFINE_IRQ_MODERATION_MODE_PARAMETER
+#define IRQ_MODERATION_SET_DEFAULT_MODE(_irq)
+
#endif
/*
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index 672e161ecf29e..3b3962dae33d1 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -83,6 +83,10 @@ static bool enable_posted_msi;
*
* echo "on" > /proc/irq/NN/soft_moderation # use "off" to disable
*
+ * For selected drivers, the default can also be supplied via module parameters
+ *
+ * ${DRIVER}.soft_moderation=1
+ *
* === MONITORING ===
*
* cat /proc/irq/soft_moderation shows per-CPU and global statistics.
@@ -302,6 +306,13 @@ static void set_moderation_mode(struct irq_desc *desc, bool mode)
}
}
+/* irq_to_desc is not exported. Wrap it in this function for a specific use. */
+void irq_moderation_set_mode(int irq, bool mode)
+{
+ set_moderation_mode(irq_to_desc(irq), mode);
+}
+EXPORT_SYMBOL(irq_moderation_set_mode);
+
#pragma clang diagnostic error "-Wformat"
/* Print statistics */
static int moderation_show(struct seq_file *p, void *v)
--
2.51.2.1041.gc1ab5b90ca-goog
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc
2025-11-12 19:24 ` [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
@ 2025-11-13 8:17 ` Thomas Gleixner
2025-11-13 9:44 ` Thomas Gleixner
2025-11-13 13:25 ` Marc Zyngier
2 siblings, 0 replies; 24+ messages in thread
From: Thomas Gleixner @ 2025-11-13 8:17 UTC (permalink / raw)
To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
> Platform wide software interrupt moderation ("soft_moderation" in this
> patch series) specifically addresses a limitation of platforms from many
"in this patch series" becomes meaningless when this is actually merged
into git.
> vendors whose I/O performance drops significantly when the total rate of
> MSI-X interrupts is too high (e.g 1-3M intr/s depending on the platform).
>
> Conventional interrupt moderation operates separately on each source,
> hence the configuration should target the worst case. On large servers
> with hundreds of interrupt sources, keeping the total rate bounded would
> require delays of 100-200us; and adaptive moderation would have to reach
> those delays with as little as 10K intr/s per source. These values are
> unacceptable for RPC or transactional workloads.
>
> To address this problem, this code measures efficiently the total and
> per-CPU interrupt rates, so that individual moderation delays can be
> adjusted based on actual global and local load. This way, the system
> controls both global interrupt rates and individual CPU load, and
> tunes delays so they are normally 0 or very small except during actual
> local/global overload.
>
> Configuration is easy and robust. System administrators specify the
> maximum targets (moderation delay; interrupt rate; and fraction of time
> spent in hardirq), and per-CPU control loops adjust actual delays to try
> and keep metrics within the bounds.
>
> There is no need for exact targets, because the system is adaptive; the
> defaults delay_us=100, target_irq_rate=1000000, hardirq_frac=70 intr/s,
> are good almost everywhere.
>
> The system does not rely on any special hardware feature except from
> MSI-X Pending Bit Array (PBA), a mandatory component of MSI-X
>
> Boot defaults are set via module parameters (/sys/module/irq_moderation
> and /sys/module/${DRIVER}) or at runtime via /proc/irq/moderation, which
> is also used to export statistics. Moderation on individual irq can be
> turned on/off via /proc/irq/NN/moderation .
>
> The system does not rely on any special hardware feature except from
> MSI-X Pending Bit Array (PBA), a mandatory component of MSI-X
You said that 8 lines above already, but I can't see where the PBA
dependency actually is.
> This initial patch adds Documentation, Kconfig option, two fields in
git grep 'This patch' Documentation/process/
> struct irq_desc, and prototypes in include/linux/interrupt.h
Please do not describe trivial implementation details in the change log.
> No functional impact.
>
> Enabling the option will just extend struct irq_desc with two fields.
>
> CONFIG_SOFT_IRQ_MODERATION=y
> ---
Clearly lacks a 'Signed-off-by' as all the other patches do.
The patch submission rules are well documented.
https://www.kernel.org/doc/html/latest/process/submitting-patches.html
and please also read and follow:
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
> +#ifdef CONFIG_IRQ_SOFT_MODERATION
> +
> +void irq_moderation_percpu_init(void);
> +void irq_moderation_init_fields(struct irq_desc *desc);
> +/* add/remove /proc/irq/NN/soft_moderation */
> +void irq_moderation_procfs_entry(struct irq_desc *desc, umode_t umode);
> +
> +#else /* empty stubs to avoid conditional compilation */
The canonical format is:
#else /* CONFIG_IRQ_SOFT_MODERATION */
...
#endif /* !CONFIG_IRQ_SOFT_MODERATION */
But that aside, what's the point of adding these prototypes and stubs
without an implementation?
> /*
> * We want to know which function is an entrypoint of a hardirq or a softirq.
> */
> diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
> index fd091c35d5721..4eb05bc456abe 100644
> --- a/include/linux/irqdesc.h
> +++ b/include/linux/irqdesc.h
> @@ -112,6 +112,11 @@ struct irq_desc {
> #endif
> struct mutex request_mutex;
> int parent_irq;
> +#ifdef CONFIG_IRQ_SOFT_MODERATION
> + /* mode: 0: off, 1: disable_irq_nosync() */
> + u8 moderation_mode; /* written in procfs, read on irq */
> + struct list_head ms_node; /* per-CPU list of moderated irq_desc */
Don't use tail comments and document the members in the exisiting struct
documentation. It's not that hard to keep something consistent.
Again. What's the point of adding this without usage?
You can also avoid the #ifdeffery by doing:
#ifdef CONFIG_IRQ_SOFT_MODERATION
struct irq_desc_mod {
....
....
};
#else
struct irq_desc_mod { };
#endif
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/6] genirq: soft_moderation: add base files, procfs hooks
2025-11-12 19:24 ` [PATCH 2/6] genirq: soft_moderation: add base files, procfs hooks Luigi Rizzo
@ 2025-11-13 9:29 ` Thomas Gleixner
2025-11-13 10:24 ` Thomas Gleixner
2025-11-13 22:32 ` Luigi Rizzo
2025-11-13 9:40 ` Thomas Gleixner
1 sibling, 2 replies; 24+ messages in thread
From: Thomas Gleixner @ 2025-11-13 9:29 UTC (permalink / raw)
To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
> Add the main core files that implement soft_moderation, limited to static
> moderation, plus related small changes to include/linux/interrupt.h,
> kernel/irq/Makefile, and kernel/irq/proc.c
>
> - include/linux/irq_moderation.h has the two main struct, prototypes
> and inline hooks
>
> - kernel/irq/irq_moderation.c has the procfs handlers
I can see that from the patch. This has zero useful information. See
Documentation.
> The code is not yet hooked to the interrupt handler, so
> the feature is disabled but we can see the module parameters
Who is 'we'?
> /sys/module/irq_moderation/parameters and read/write the procfs entries
> /proc/irq/soft_moderation and /proc/irq/NN/soft_moderation.
>
> Examples:
> cat /proc/irq/soft_moderation
> echo "delay_us=345" > /proc/irq/soft_moderation
> echo 1 | tee /proc/irq/*/nvme*/../soft_moderation
That's impressive to be able to write into proc/...
> +/*
> + * Platform wide software interrupt moderation, see
> + * Documentation/core-api/irq/irq-moderation.rst
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/hrtimer.h>
> +#include <linux/irq.h>
> +#include <linux/irqdesc.h>
includes are ordered alphabetically.
> +#ifdef CONFIG_IRQ_SOFT_MODERATION
> +
> +/* Global configuration parameters and state */
> +struct irq_mod_info {
> + /* These fields are written to by all CPUs */
> + ____cacheline_aligned
> + atomic_long_t total_intrs; /* running count updated every update_ms */
> + atomic_long_t total_cpus; /* as above, active CPUs in this interval */
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#struct-declarations-and-initializers
> +
> + /* These are mostly read (frequently), so use a different cacheline */
> + ____cacheline_aligned
> + u64 procfs_write_ns; /* last write to /proc/irq/soft_moderation */
> + uint delay_us; /* fixed delay, or maximum for adaptive */
> + uint target_irq_rate; /* target interrupt rate */
> + uint hardirq_percent; /* target maximum hardirq percentage */
> + uint timer_rounds; /* how many timer polls once moderation fires */
> + uint update_ms; /* how often to update delay/rate/fraction */
> + uint scale_cpus; /* (percent) scale factor to estimate active CPUs */
> + uint count_timer_calls; /* count timer calls for irq limits */
> + uint count_msi_calls; /* count calls from posted_msi for irq limits */
> + uint decay_factor; /* keep it at 16 */
> + uint grow_factor; /* keep it at 8 */
> + int pad[] ____cacheline_aligned;
> +};
> +
> +extern struct irq_mod_info irq_mod_info;
> +
> +/* Per-CPU moderation state */
> +struct irq_mod_state {
> + struct hrtimer timer; /* moderation timer */
> + struct list_head descs; /* moderated irq_desc on this CPU */
> +
> + /* Counters on last time we updated moderation delay */
> + u64 last_ns; /* time of last update */
> + u64 last_irqtime; /* from cpustat[CPUTIME_IRQ] */
> + u64 last_total_irqs;
> + u64 last_total_cpus;
> +
> + bool in_posted_msi; /* don't suppress handle_irq, set in posted_msi handler */
> + bool kick_posted_msi; /* kick posted_msi from the timer callback */
> +
> + u32 cycles; /* calls since last ktime_get_ns() */
> + s32 irq_count; /* irqs in the last cycle, signed as we also decrement */
> + u32 delay_ns; /* fetched from irq_mod_info */
> + u32 mod_ns; /* recomputed every update_ms */
> + u32 sleep_ns; /* accumulated time for actual delay */
> + s32 rounds_left; /* how many rounds left for moderation */
> +
> + /* Statistics */
> + u32 irq_rate; /* smoothed global irq rate */;
> + u32 my_irq_rate; /* smoothed irq rate for this CPU */;
> + u32 cpu_count; /* smoothed CPU count (scaled) */;
> + u32 src_count; /* smoothed irq sources count (scaled) */;
> + u32 irq_high; /* how many times above each threshold */
> + u32 my_irq_high;
> + u32 hardirq_high;
> + u32 timer_set; /* counters for various events */
> + u32 timer_fire;
> + u32 disable_irq;
> + u32 enable_irq;
> + u32 timer_calls;
> + u32 from_posted_msi;
> + u32 stray_irq;
> + int pad[] ____cacheline_aligned;
> +};
Most of this is not used at all at this point. So why introduce massive
data structures with fields w/o usage. That's unreviewable.
> +static inline bool irq_moderation_enabled(void)
> +{
> + return READ_ONCE(irq_mod_info.delay_us);
This really wants to be a static key.
> +}
> +
> +/* Called on each interrupt for adaptive moderation delay adjustment */
> +static inline void irq_moderation_adjust_delay(struct irq_mod_state *ms)
> +{
> + u64 now, delta_time, update_ns;
> +
> + ms->irq_count++;
> + if (ms->cycles++ < 16) /* ktime_get_ns() is expensive, don't do too often */
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#comment-style
> + return;
> + ms->cycles = 0;
> + now = ktime_get_ns();
> + delta_time = now - ms->last_ns;
> + update_ns = irq_moderation_get_update_ms() * NSEC_PER_MSEC;
> +
> + /* Run approximately every update_ns, a little bit early is ok */
> + if (delta_time < update_ns - 5000)
> + return;
> +
> + /* Fetch important state */
That's a truly helpful comment. It tells me exactly nothing.
> + ms->delay_ns = clamp(irq_mod_info.delay_us, 1u, 500u) * NSEC_PER_USEC;
> +
> + ms->last_ns = now;
> + ms->mod_ns = ms->delay_ns;
> +}
> +
> +/* Return true if timer is active or delay is large enough to require moderation */
> +static inline bool irq_moderation_needed(struct irq_mod_state *ms)
> +{
> + if (!hrtimer_is_queued(&ms->timer)) {
> + ms->sleep_ns += ms->mod_ns; /* accumulate sleep time */
> + if (ms->sleep_ns < 10000) /* no moderation if too small */
Don't use magic constants. Use proper defines with descriptive names.
> + return false;
> + }
> + return true;
> +}
> +
> +void disable_irq_nosync(unsigned int irq);
> +
> +/*
> + * Use in handle_irq_event() before calling the handler. Decide whether this
> + * desc should be moderated, and in case disable the irq and add the desc to
> + * the list for this CPU.
> + */
> +static inline void irq_moderation_hook(struct irq_desc *desc)
> +{
> + struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> +
> + if (!irq_moderation_enabled())
> + return;
> +
> + if (!READ_ONCE(desc->moderation_mode))
> + return;
> +
> + irq_moderation_adjust_delay(ms);
> +
> + if (!list_empty(&desc->ms_node)) {
> + /*
> + * Very unlikely, stray interrupt while the desc is moderated.
> + * Unfortunately we cannot ignore it, just count it.
What means 'cannot ignore it' ?
> + */
> + ms->stray_irq++;
> + return;
> + }
> +
> + if (!irq_moderation_needed(ms))
> + return;
> +
> + list_add(&desc->ms_node, &ms->descs); /* Add to list of moderated desc */
That assumes that the interrupt is targeted to a specific CPU. How does
that interact with affinity changes or platforms which use multi-CPU affinities?
> + /*
> + * disable the irq. This will also cause irq_can_handle() return false
Sentences start with an uppercase letter.
> + * (through irq_can_handle_actions()), and that will prevent a handler
> + * instance to be run again while the descriptor is being moderated.
> + *
> + * irq_moderation_epilogue() will then start the timer if needed.
> + */
> + ms->disable_irq++;
> + disable_irq_nosync(desc->irq_data.irq); /* desc must be unlocked */
How does this work with suspend?
> +}
> +
> +/* After the handler, if desc is moderated, make sure the timer is active. */
> +static inline void irq_moderation_epilogue(const struct irq_desc *desc)
> +{
> + struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
Lacks a check whether moderation is enabled or not.
> + if (!list_empty(&desc->ms_node) && !hrtimer_is_queued(&ms->timer))
> + irq_moderation_start_timer(ms);
> +}
> +
> +#else /* empty stubs to avoid conditional compilation */
> +
> +static inline void irq_moderation_hook(struct irq_desc *desc) {}
> +static inline void irq_moderation_epilogue(const struct irq_desc *desc) {}
> +
> +#endif /* CONFIG_IRQ_SOFT_MODERATION */
> +
> +#endif /* _LINUX_IRQ_MODERATION_H */
> diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
> index 6ab3a40556670..c06da43d644f2 100644
> --- a/kernel/irq/Makefile
> +++ b/kernel/irq/irq_moderation.c
> @@ -0,0 +1,315 @@
> +// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause
> +
> +#include <linux/module.h>
> +#include <linux/seq_file.h>
> +#include <linux/proc_fs.h>
> +
> +#include <linux/irq_moderation.h>
> +#include <linux/kernel_stat.h> /* interrupt.h, kcpustat_this_cpu */
These comments are pointless.
> +static_assert(offsetof(struct irq_mod_info, procfs_write_ns) == 64);
What guarantees that a platform has a cacheline size of 64?
> +struct irq_mod_info irq_mod_info ____cacheline_aligned = {
> + .delay_us = 100,
> + .update_ms = 1,
> + .count_timer_calls = true,
> +};
> +
> +module_param_named(delay_us, irq_mod_info.delay_us, uint, 0444);
> +MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 = moderation off, range 10..500.");
> +
> +module_param_named(timer_rounds, irq_mod_info.timer_rounds, uint, 0444);
> +MODULE_PARM_DESC(timer_rounds, "How many extra timer polls once moderation triggered.");
> +
> +DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
> +
> +static void smooth_avg(u32 *dst, u32 val, u32 steps)
> +{
> + *dst = ((64 - steps) * *dst + steps * val) / 64;
> +}
> +
> +/* moderation timer handler, called in hardintr context */
> +static enum hrtimer_restart moderation_timer_cb(struct hrtimer *timer)
> +{
> + struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> + struct irq_desc *desc, *next;
> + uint srcs = 0;
> +
> + ms->timer_fire++;
> + WARN_ONCE(ms->timer_set != ms->timer_fire,
> + "CPU %d timer set %d fire %d (lost events?)\n",
> + smp_processor_id(), ms->timer_set, ms->timer_fire);
> +
> + ms->rounds_left--;
> +
> + if (ms->rounds_left > 0) {
> + /* Timer still alive. Just call the handlers */
> + list_for_each_entry_safe(desc, next, &ms->descs, ms_node) {
> + ms->irq_count += irq_mod_info.count_timer_calls;
> + ms->timer_calls++;
> + handle_irq_event_percpu(desc);
That lacks setting the INPROGRESS flag.
> + }
> + ms->timer_set++;
> + hrtimer_forward_now(&ms->timer, ms->sleep_ns);
> + return HRTIMER_RESTART;
> + }
> +
> + /* Last round, remove from list and enable_irq() */
> + list_for_each_entry_safe(desc, next, &ms->descs, ms_node) {
> + list_del(&desc->ms_node);
> + INIT_LIST_HEAD(&desc->ms_node);
> + srcs++;
> + ms->enable_irq++;
> + enable_irq(desc->irq_data.irq); /* ok if the sync_lock/unlock are NULL */
> + }
> + smooth_avg(&ms->src_count, srcs * 256, 1);
> +
> + ms->sleep_ns = 0; /* prepare to accumulate next moderation delay */
> +
> + WARN_ONCE(ms->disable_irq != ms->enable_irq,
> + "CPU %d irq disable %d enable %d (%s)\n",
> + smp_processor_id(), ms->disable_irq, ms->enable_irq,
> + "bookkeeping error, some irq qill be stuck");
> +
> + return HRTIMER_NORESTART;
> +}
> +
> +/* Initialize moderation state in desc_set_defaults() */
> +void irq_moderation_init_fields(struct irq_desc *desc)
> +{
> + INIT_LIST_HEAD(&desc->ms_node);
> + desc->moderation_mode = 0;
> +}
> +
> +/* Per-CPU state initialization */
> +void irq_moderation_percpu_init(void)
> +{
> + struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> +
> + hrtimer_setup(&ms->timer, moderation_timer_cb, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
> + INIT_LIST_HEAD(&ms->descs);
> +}
Where is this called?
> +static void set_moderation_mode(struct irq_desc *desc, bool mode)
> +{
> + if (desc) {
Why would desc be NULL?
> + struct irq_chip *chip = desc->irq_data.chip;
> +
> + /* Make sure this is msi and we can run enable_irq from irq context */
> + mode &= desc->handle_irq == handle_edge_irq && chip && chip->irq_bus_lock == NULL &&
> + chip->irq_bus_sync_unlock == NULL;
> + if (mode != desc->moderation_mode)
> + desc->moderation_mode = mode;
> + }
> +}
> +
> +#pragma clang diagnostic error "-Wformat"
What's that for?
> +/* Print statistics */
> +static int moderation_show(struct seq_file *p, void *v)
> +{
> + ulong irq_rate = 0, irq_high = 0, my_irq_high = 0, hardirq_high = 0;
> + uint delay_us = irq_mod_info.delay_us;
> + struct irq_mod_state *ms;
> + u64 now = ktime_get_ns();
> + int j, active_cpus = 0;
> + struct irq_desc *desc = p->private;
> +
> + if (desc) {
> + seq_printf(p, "%s\n", desc->moderation_mode ? "on" : "off");
> + return 0;
> + }
> +
> + seq_puts(p, "# CPU irq/s my_irq/s cpus srcs delay_ns irq_hi my_irq_hi"
> + " hardirq_hi timer_set disable_irq from_msi timer_calls stray_irq\n");
> + for_each_online_cpu(j) {
Racy agains CPU hotplug.
> + ms = per_cpu_ptr(&irq_mod_state, j);
> + if (now - ms->last_ns > NSEC_PER_SEC) {
> + ms->my_irq_rate = 0;
> + ms->irq_rate = 0;
> + ms->cpu_count = 0;
> + } else { /* average irq_rate over active CPUs */
> + active_cpus++;
> + irq_rate += ms->irq_rate;
> + }
> + /* compute totals */
> + irq_high += ms->irq_high;
> + my_irq_high += ms->my_irq_high;
> + hardirq_high += ms->hardirq_high;
> +
> + seq_printf(p, "%5u %8u %8u %4u %4u %8u %11u %11u %11u %11u %11u %11u %11u %9u\n",
> + j, ms->irq_rate, ms->my_irq_rate, (ms->cpu_count + 128) / 256,
> + (ms->src_count + 128) / 256, ms->mod_ns, ms->irq_high, ms->my_irq_high,
> + ms->hardirq_high, ms->timer_set, ms->disable_irq,
> + ms->from_posted_msi, ms->timer_calls, ms->stray_irq);
> + }
> +
> + seq_printf(p, "\n"
> + "enabled %s\n"
> + "delay_us %u\n"
> + "timer_rounds %u\n"
> + "count_timer_calls %s\n",
> + str_yes_no(delay_us),
> + delay_us,
> + irq_mod_info.timer_rounds,
> + str_yes_no(irq_mod_info.count_timer_calls));
> +
> + return 0;
> +}
> +
> +static int moderation_open(struct inode *inode, struct file *file)
> +{
> + return single_open(file, moderation_show, pde_data(inode));
> +}
> +
> +/* helpers to set values */
> +struct var_names {
> + const char *name;
> + uint *val;
> + int min;
> + int max;
> +} var_names[] = {
> + { "delay_us", &irq_mod_info.delay_us, 0, 500 },
> + { "timer_rounds", &irq_mod_info.timer_rounds, 0, 50 },
> + { "count_timer_calls", &irq_mod_info.count_timer_calls, 0, 1 },
> + {}
Don't use zero terminated arrays. Use ARRAY_SIZE() for loop termination.
> +};
> +
> +static int set_parameter(const char *buf, int len)
> +{
> + struct var_names *n;
> + int l, val;
> +
> + for (n = var_names; n->name; n++) {
> + l = strlen(n->name);
> + if (len >= l + 2 && !strncmp(buf, n->name, l) && buf[l] == '=')
> + break;
> + }
> + if (!n->name || !n->val)
> + return -EINVAL;
> + if (kstrtouint(buf + l + 1, 0, &val))
> + return -EINVAL;
> + WRITE_ONCE(*(n->val), clamp(val, n->min, n->max));
> + irq_mod_info.procfs_write_ns = ktime_get_ns();
The point of this time stamp is?
> + return len;
> +}
> +
> +static ssize_t moderation_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
> +{
> + char val[40]; /* bounded string size */
> + struct irq_desc *desc = (struct irq_desc *)pde_data(file_inode(f));
> +
> + if (count == 0 || count + 1 > sizeof(val))
> + return -EINVAL;
> + if (copy_from_user(val, buf, count))
> + return -EFAULT;
> + val[count] = '\0';
> + if (val[count - 1] == '\n')
> + val[count - 1] = '\0';
> + if (!desc)
> + return set_parameter(val, count);
> +
> + if (!strcmp(val, "off") || !strcmp(val, "0"))
> + set_moderation_mode(desc, false);
> + else if (!strcmp(val, "on") || !strcmp(val, "1"))
> + set_moderation_mode(desc, true);
kstrtobool() exists for a reason.
> + else
> + return -EINVAL;
> + return count; /* consume all */
Code reuse is fine, but this really want's to be seperated for the
control parameters and the per interrupt ones.
> +}
> +
> +static const struct proc_ops proc_ops = {
> + .proc_open = moderation_open,
> + .proc_read = seq_read,
> + .proc_lseek = seq_lseek,
> + .proc_release = single_release,
> + .proc_write = moderation_write,
> +};
> +
> +void irq_moderation_procfs_entry(struct irq_desc *desc, umode_t umode)
> +{
> + if (umode)
Seriously? What's wrong with having irq_moderation_procfs_add/remove() ?
> + proc_create_data("soft_moderation", umode, desc->dir, &proc_ops, desc);
> + else
> + remove_proc_entry("soft_moderation", desc->dir);
> +}
> +
> +static int __init init_irq_moderation(void)
> +{
> + proc_create_data("irq/soft_moderation", 0644, NULL, &proc_ops, (void *)0);
> + return 0;
> +}
The selection of code which goes into this patch is pretty random. Some
handling functions and a big pile of procfs muck. Reviewing this series
is a pain.
Thanks,
tglx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/6] genirq: soft_moderation: add base files, procfs hooks
2025-11-12 19:24 ` [PATCH 2/6] genirq: soft_moderation: add base files, procfs hooks Luigi Rizzo
2025-11-13 9:29 ` Thomas Gleixner
@ 2025-11-13 9:40 ` Thomas Gleixner
1 sibling, 0 replies; 24+ messages in thread
From: Thomas Gleixner @ 2025-11-13 9:40 UTC (permalink / raw)
To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
> Add the main core files that implement soft_moderation, limited to static
> moderation, plus related small changes to include/linux/interrupt.h,
> kernel/irq/Makefile, and kernel/irq/proc.c
>
> - include/linux/irq_moderation.h has the two main struct, prototypes
> and inline hooks
And none of this should be in a global visible header. All of that is
only relevant to the core interrupt code, so that want's to be a core
local header file.
Thanks,
tglx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc
2025-11-12 19:24 ` [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
2025-11-13 8:17 ` Thomas Gleixner
@ 2025-11-13 9:44 ` Thomas Gleixner
2025-11-13 13:25 ` Marc Zyngier
2 siblings, 0 replies; 24+ messages in thread
From: Thomas Gleixner @ 2025-11-13 9:44 UTC (permalink / raw)
To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
> +# Support platform wide software interrupt moderation
> +config IRQ_SOFT_MODERATION
> + bool "Enable platform wide software interrupt moderation"
Lacks 'depends on PROCFS'
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/6] genirq: soft_moderation: activate hooks in handle_irq_event()
2025-11-12 19:24 ` [PATCH 3/6] genirq: soft_moderation: activate hooks in handle_irq_event() Luigi Rizzo
@ 2025-11-13 9:45 ` Thomas Gleixner
2025-11-14 8:27 ` Luigi Rizzo
0 siblings, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2025-11-13 9:45 UTC (permalink / raw)
To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
Forgot to mention it on the earlier patches. The subject line is wrong
in multiple aspects. See documentation.
> arch/x86/kernel/cpu/common.c | 1 +
> drivers/irqchip/irq-gic-v3.c | 2 ++
How are those related to the subject?
> kernel/irq/handle.c | 3 +++
> kernel/irq/irqdesc.c | 1 +
> 4 files changed, 7 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 02d97834a1d4d..1953419fde6ff 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -2440,6 +2440,7 @@ void cpu_init(void)
>
> intel_posted_msi_init();
> }
> + irq_moderation_percpu_init();
Why is this called in architecture specific code? There is absolutely
nothing architecture specific about this. The CPU hotplug infrastructure
can handle this just fine in a generic way.
> #include <asm/irq_regs.h>
> @@ -254,9 +255,11 @@ irqreturn_t handle_irq_event(struct irq_desc *desc)
> irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> raw_spin_unlock(&desc->lock);
>
> + irq_moderation_hook(desc); /* may disable irq so must run unlocked */
That's just wrong. That can trivially be implemented in a way which
works with the lock held.
> ret = handle_irq_event_percpu(desc);
> raw_spin_lock(&desc->lock);
> + irq_moderation_epilogue(desc); /* start moderation timer if needed */
> irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> return ret;
> }
> diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
> index db714d3014b5f..e3efbecf5b937 100644
> --- a/kernel/irq/irqdesc.c
> +++ b/kernel/irq/irqdesc.c
> @@ -134,6 +134,7 @@ static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
> desc->tot_count = 0;
> desc->name = NULL;
> desc->owner = owner;
> + irq_moderation_init_fields(desc);
That's clearly part of activation in handle_irq_event() ....
Thanks
tglx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/6] genirq: soft_moderation: implement adaptive moderation
2025-11-12 19:24 ` [PATCH 4/6] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
@ 2025-11-13 10:15 ` Thomas Gleixner
0 siblings, 0 replies; 24+ messages in thread
From: Thomas Gleixner @ 2025-11-13 10:15 UTC (permalink / raw)
To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
> Add two control parameters (target_irq_rate and hardirq_percent)
> to indicate the desired maximum values for these two metrics.
>
> Every update_ms the hook in handle_irq_event() recomputes the total and
> local interrupt rate and the amount of time spent in hardirq, compares
> the values with the targets, and adjusts the moderation delay up or down.
>
> The interrupt rate is computed in a scalable way by counting interrupts
> per-CPU, and aggregating the value in a global variable only every
> update_ms. Only CPUs that actively process interrupts are actually
> accessing the shared variable, so the system scales well even on very
> large servers.
This explains the what. But it does not provide any rationale for it.
> +/* Adjust the moderation delay, called at most every update_ns */
> +void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_time, u64 update_ns)
> +{
> + /* Fetch configuration */
> + u32 target_rate = clamp(irq_mod_info.target_irq_rate, 0u, 50000000u);
> + u32 hardirq_percent = clamp(irq_mod_info.hardirq_percent, 0u, 100u);
This is all racy against a concurrent write, so that requires
READ_ONCE(), no?
> + bool below_target = true;
> + /* Compute decay steps based on elapsed time */
> + u32 steps = delta_time > 10 * update_ns ? 10 : 1 + (delta_time / update_ns);
> +
> + if (target_rate == 0 && hardirq_percent == 0) { /* use fixed delay */
> + ms->mod_ns = ms->delay_ns;
> + ms->irq_rate = 0;
> + ms->my_irq_rate = 0;
> + ms->cpu_count = 0;
> + return;
> + }
> +
> + if (target_rate > 0) { /* control total and individual CPU rate */
> + u64 irq_rate, my_irq_rate, tmp, delta_irqs, num_cpus;
> + bool my_rate_ok, global_rate_ok;
> +
> + /* Update global number of interrupts */
> + if (ms->irq_count < 1) /* make sure it is always > 0 */
And how does it become < 1?
> + ms->irq_count = 1;
> + tmp = atomic_long_add_return(ms->irq_count, &irq_mod_info.total_intrs);
> + delta_irqs = tmp - ms->last_total_irqs;
> +
> + /* Compute global rate, check if we are ok */
> + irq_rate = (delta_irqs * NSEC_PER_SEC) / delta_time;
> + global_rate_ok = irq_rate < target_rate;
> +
> + ms->last_total_irqs = tmp;
I really had to find this update. Can you please just keep that stuff
together?
tmp = ...;
delta = tmp - ms->xxxx;
ms->xxxx = tmp;
> + /*
> + * num_cpus is the number of CPUs actively handling interrupts
> + * in the last interval. CPUs handling less than the fair share
> + * target_rate / num_cpus do not need to be throttled.
> + */
> + tmp = atomic_long_add_return(1, &irq_mod_info.total_cpus);
> + num_cpus = tmp - ms->last_total_cpus;
> + /* scale proportionally to time, reduce errors if we are idle for too long */
> + num_cpus = 1 + (num_cpus * update_ns + delta_time / 2) / delta_time;
I still fail to see why this global scale is required and how it is
correct. This can starve out a particular CPU which handles the bulk of
interrupts, no?
> + /* Short intervals may underestimate sources. Apply a scale factor */
> + num_cpus = num_cpus * get_scale_cpus() / 100;
> +
> + /* Compute our rate, check if we are ok */
> + my_irq_rate = (ms->irq_count * NSEC_PER_SEC) / delta_time;
> + my_rate_ok = my_irq_rate * num_cpus < target_rate;
> +
> + ms->irq_count = 1; /* reset for next cycle */
> + ms->last_total_cpus = tmp;
> +
> + /* Use instantaneous rates to react. */
> + below_target = global_rate_ok || my_rate_ok;
> +
> + /* Statistics (rates are smoothed averages) */
> + smooth_avg(&ms->irq_rate, irq_rate, steps);
> + smooth_avg(&ms->my_irq_rate, my_irq_rate, steps);
> + smooth_avg(&ms->cpu_count, num_cpus * 256, steps); /* scaled */
> + ms->my_irq_high += !my_rate_ok;
> + ms->irq_high += !global_rate_ok;
> + }
> +
> + if (hardirq_percent > 0) { /* control time spent in hardirq */
> + u64 cur = kcpustat_this_cpu->cpustat[CPUTIME_IRQ];
Depends on CONFIG_IRQ_TIME_ACCOUNTING=y, no?
> + u64 irqtime = cur - ms->last_irqtime;
> + bool hardirq_ok = irqtime * 100 < delta_time * hardirq_percent;
> +
> + below_target &= hardirq_ok;
> + ms->last_irqtime = cur;
> + ms->hardirq_high += !hardirq_ok; /* statistics */
> + }
> +
> + /* Controller: move mod_ns up/down if we are above/below target */
> + if (below_target) {
> + ms->mod_ns -= ms->mod_ns * steps / (steps + get_decay_factor());
> + if (ms->mod_ns < 100)
> + ms->mod_ns = 0;
> + } else if (ms->mod_ns < 500) {
> + ms->mod_ns = 500;
Random numbers pulled out of thin air. That's all over the place.
> + } else {
> + ms->mod_ns += ms->mod_ns * steps / (steps + get_grow_factor());
> + if (ms->mod_ns > ms->delay_ns)
> + ms->mod_ns = ms->delay_ns; /* cap to delay_ns */
> + }
> +}
Thanks,
tglx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 6/6] genirq: soft_moderation: implement per-driver defaults (nvme and vfio)
2025-11-12 19:24 ` [PATCH 6/6] genirq: soft_moderation: implement per-driver defaults (nvme and vfio) Luigi Rizzo
@ 2025-11-13 10:18 ` Thomas Gleixner
2025-11-13 10:42 ` Luigi Rizzo
0 siblings, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2025-11-13 10:18 UTC (permalink / raw)
To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
> Introduce helpers to implement per-driver module parameters to enable
> moderation at boot/probe time.
>
> As an example, use the helpers in nvme and vfio drivers.
>
> To test, boot a kernel with
>
> ${driver}.soft_moderation=1 # enables moderation.
>
> and verify with "cat /proc/irq/soft_moderation" that
> the counters increase.
>
> Change-Id: Iaad4110977deb96df845501895e0043bd93fc350
> ---
> drivers/nvme/host/pci.c | 3 +++
> drivers/vfio/pci/vfio_pci_intrs.c | 3 +++
> include/linux/interrupt.h | 13 +++++++++++++
> kernel/irq/irq_moderation.c | 11 +++++++++++
Again a pile of randomly picked files. The proper way is documented:
1) Add infrastructure first
2) Use infrastructure in drivers, one patch per subsystem
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 72fb675a696f4..b9d7bce30061f 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -72,6 +72,8 @@
> static_assert(MAX_PRP_RANGE / NVME_CTRL_PAGE_SIZE <=
> (1 /* prp1 */ + NVME_MAX_NR_DESCRIPTORS * PRPS_PER_PAGE));
>
> +DEFINE_IRQ_MODERATION_MODE_PARAMETER;
> +
> static int use_threaded_interrupts;
> module_param(use_threaded_interrupts, int, 0444);
>
> @@ -1989,6 +1991,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
> result = queue_request_irq(nvmeq);
> if (result < 0)
> goto release_sq;
> + IRQ_MODERATION_SET_DEFAULT_MODE(pci_irq_vector(to_pci_dev(dev->dev), vector));
What's the point of this IRQ_MODERATION_SET_DEFAULT_MODE() wrapper?
Thanks,
tglx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/6] genirq: soft_moderation: add base files, procfs hooks
2025-11-13 9:29 ` Thomas Gleixner
@ 2025-11-13 10:24 ` Thomas Gleixner
2025-11-13 22:42 ` Luigi Rizzo
2025-11-13 22:32 ` Luigi Rizzo
1 sibling, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2025-11-13 10:24 UTC (permalink / raw)
To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On Thu, Nov 13 2025 at 10:29, Thomas Gleixner wrote:
> On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
>> +static void set_moderation_mode(struct irq_desc *desc, bool mode)
>> +{
>> + if (desc) {
>
> Why would desc be NULL?
>
>> + struct irq_chip *chip = desc->irq_data.chip;
>> +
>> + /* Make sure this is msi and we can run enable_irq from irq context */
>> + mode &= desc->handle_irq == handle_edge_irq && chip && chip->irq_bus_lock == NULL &&
>> + chip->irq_bus_sync_unlock == NULL;
Q: How does this make sure that it is MSI?
A: Not at all
Q: How is this protected against concurrent modification?
A: Not at all
>> + if (mode != desc->moderation_mode)
>> + desc->moderation_mode = mode;
This conditional is truly making a difference.
Thanks
tglx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 6/6] genirq: soft_moderation: implement per-driver defaults (nvme and vfio)
2025-11-13 10:18 ` Thomas Gleixner
@ 2025-11-13 10:42 ` Luigi Rizzo
0 siblings, 0 replies; 24+ messages in thread
From: Luigi Rizzo @ 2025-11-13 10:42 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Thu, Nov 13, 2025 at 11:18 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
...
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -72,6 +72,8 @@
> > static_assert(MAX_PRP_RANGE / NVME_CTRL_PAGE_SIZE <=
> > (1 /* prp1 */ + NVME_MAX_NR_DESCRIPTORS * PRPS_PER_PAGE));
> >
> > +DEFINE_IRQ_MODERATION_MODE_PARAMETER;
> > +
> > static int use_threaded_interrupts;
> > module_param(use_threaded_interrupts, int, 0444);
> >
> > @@ -1989,6 +1991,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
> > result = queue_request_irq(nvmeq);
> > if (result < 0)
> > goto release_sq;
> > + IRQ_MODERATION_SET_DEFAULT_MODE(pci_irq_vector(to_pci_dev(dev->dev), vector));
>
> What's the point of this IRQ_MODERATION_SET_DEFAULT_MODE() wrapper?
I wanted to avoid exposing the module parameter name,
so that adding support in a driver is mechanical and we can
change the name in one place for all (during this review or later).
Ideally I would have preferred some generic parameter
irq_moderation.enable_on="nvme vfio ..."
to set defaults without having to patch individual drivers.
But I have not found a way to go from the irq_desc to the driver's name
(desc->name is often renamed especially for NICs)
cheers
luigi
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc
2025-11-12 19:24 ` [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
2025-11-13 8:17 ` Thomas Gleixner
2025-11-13 9:44 ` Thomas Gleixner
@ 2025-11-13 13:25 ` Marc Zyngier
2025-11-13 13:33 ` Luigi Rizzo
2 siblings, 1 reply; 24+ messages in thread
From: Marc Zyngier @ 2025-11-13 13:25 UTC (permalink / raw)
To: Luigi Rizzo
Cc: Thomas Gleixner, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Wed, 12 Nov 2025 19:24:03 +0000,
Luigi Rizzo <lrizzo@google.com> wrote:
[...]
> The system does not rely on any special hardware feature except from
> MSI-X Pending Bit Array (PBA), a mandatory component of MSI-X
Is this stuff PCI specific? if so, Why? What is the actual dependency
on PBA? It is it just that you are relying on the ability to mask
interrupts without losing them, something that is pretty much a given
on any architecture?
[...]
> +Platform Wide software interrupt moderation is a variant of moderation
> +that adjusts the delay based on platform-wide metrics, instead of
> +considering each source separately. It then uses hrtimers to implement
> +adaptive, per-CPU moderation in software, without requiring any specific
> +hardware support other than Pending Bit Array, a standard feature
> +of MSI-X.
> +
> +To understand the motivation for this feature, we start with some
> +background on interrupt moderation.
This reads like marketing blurb. This is an API documentation, and it
shouldn't be a description of your motivations for building it the way
you did. I'd suggest you stick to the API, and keep the motivations
for the cover letter.
> +
> +* **Interrupt** is a mechanism to **notify** the CPU of **events**
> + that should be handled by software, for example, **completions**
> + of I/O requests (network tx/rx, disk read/writes...).
That's only half of the truth, as this description only applies to
*edge* interrupts. Level interrupts report a change in *state*, not an
event.
How do you plan to deal with interrupt moderation for level
interrupts?
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc
2025-11-13 13:25 ` Marc Zyngier
@ 2025-11-13 13:33 ` Luigi Rizzo
2025-11-13 14:42 ` Marc Zyngier
0 siblings, 1 reply; 24+ messages in thread
From: Luigi Rizzo @ 2025-11-13 13:33 UTC (permalink / raw)
To: Marc Zyngier
Cc: Thomas Gleixner, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Thu, Nov 13, 2025 at 2:25 PM Marc Zyngier <maz@kernel.org> wrote:
>
> On Wed, 12 Nov 2025 19:24:03 +0000,
> Luigi Rizzo <lrizzo@google.com> wrote:
>
> [...]
>
> > The system does not rely on any special hardware feature except from
> > MSI-X Pending Bit Array (PBA), a mandatory component of MSI-X
>
> Is this stuff PCI specific? if so, Why? What is the actual dependency
> on PBA? It is it just that you are relying on the ability to mask
> interrupts without losing them, something that is pretty much a given
> on any architecture?
You are right, I was overly restrictive. I only need what you say,
will replace the text accordingly.
>
> [...]
> > +To understand the motivation for this feature, we start with some
> > +background on interrupt moderation.
>
> This reads like marketing blurb. This is an API documentation, and it
> shouldn't be a description of your motivations for building it the way
> you did. I'd suggest you stick to the API, and keep the motivations
> for the cover letter.
ok will remove it.
Sorry if it reads like marketing, that is very very far from my intentions.
I just wanted to give background information in a way that is easy
to access from the source tree.
>
> > +
> > +* **Interrupt** is a mechanism to **notify** the CPU of **events**
> > + that should be handled by software, for example, **completions**
> > + of I/O requests (network tx/rx, disk read/writes...).
>
> That's only half of the truth, as this description only applies to
> *edge* interrupts. Level interrupts report a change in *state*, not an
> event.
>
> How do you plan to deal with interrupt moderation for level
> interrupts?
I don't. This is restricted to edge interrupts.
cheers
luigi
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc
2025-11-13 13:33 ` Luigi Rizzo
@ 2025-11-13 14:42 ` Marc Zyngier
2025-11-13 14:55 ` Luigi Rizzo
0 siblings, 1 reply; 24+ messages in thread
From: Marc Zyngier @ 2025-11-13 14:42 UTC (permalink / raw)
To: Luigi Rizzo
Cc: Thomas Gleixner, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Thu, 13 Nov 2025 13:33:51 +0000,
Luigi Rizzo <lrizzo@google.com> wrote:
>
> On Thu, Nov 13, 2025 at 2:25 PM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Wed, 12 Nov 2025 19:24:03 +0000,
> > Luigi Rizzo <lrizzo@google.com> wrote:
> >
> > [...]
> >
> > > The system does not rely on any special hardware feature except from
> > > MSI-X Pending Bit Array (PBA), a mandatory component of MSI-X
> >
> > Is this stuff PCI specific? if so, Why? What is the actual dependency
> > on PBA? It is it just that you are relying on the ability to mask
> > interrupts without losing them, something that is pretty much a given
> > on any architecture?
>
> You are right, I was overly restrictive. I only need what you say,
> will replace the text accordingly.
>
> >
> > [...]
> > > +To understand the motivation for this feature, we start with some
> > > +background on interrupt moderation.
> >
> > This reads like marketing blurb. This is an API documentation, and it
> > shouldn't be a description of your motivations for building it the way
> > you did. I'd suggest you stick to the API, and keep the motivations
> > for the cover letter.
>
> ok will remove it.
> Sorry if it reads like marketing, that is very very far from my intentions.
> I just wanted to give background information in a way that is easy
> to access from the source tree.
Background information is only valid at a given point in time, for a
particular configuration, and rarely translate into something that
spans architectures and implementation in a universal way. For a
start, the quoted figures for interrupt rates are pretty much
irrelevant -- think of reading this in 10 years...
The descriptions are also massively x86-specific. That's probably OK
for the stuff you care about, but I'd certainly would want things to
be a bit more abstract and applicable to all architectures.
> > > +* **Interrupt** is a mechanism to **notify** the CPU of **events**
> > > + that should be handled by software, for example, **completions**
> > > + of I/O requests (network tx/rx, disk read/writes...).
> >
> > That's only half of the truth, as this description only applies to
> > *edge* interrupts. Level interrupts report a change in *state*, not an
> > event.
> >
> > How do you plan to deal with interrupt moderation for level
> > interrupts?
>
> I don't. This is restricted to edge interrupts.
I also note that since you explicitly check for handle_edge_irq() in
set_moderation_mode(), this will not work on anything GIC related, or
any other architecture that uses the fasteoi flows. I really wonder
why you are not looking at the actual trigger mode instead...
Until you fix it, please refrain from touching the GICv3 code, and
make sure this is solely enabled on x86 -- it clearly wasn't tested on
anything else.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc
2025-11-13 14:42 ` Marc Zyngier
@ 2025-11-13 14:55 ` Luigi Rizzo
2025-11-13 19:02 ` Marc Zyngier
0 siblings, 1 reply; 24+ messages in thread
From: Luigi Rizzo @ 2025-11-13 14:55 UTC (permalink / raw)
To: Marc Zyngier
Cc: Thomas Gleixner, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Thu, Nov 13, 2025 at 3:42 PM Marc Zyngier <maz@kernel.org> wrote:
> [...]
>
> The descriptions are also massively x86-specific. That's probably OK
> for the stuff you care about, but I'd certainly would want things to
> be a bit more abstract and applicable to all architectures.
Absolutely.
> I also note that since you explicitly check for handle_edge_irq() in
> set_moderation_mode(), this will not work on anything GIC related, or
> any other architecture that uses the fasteoi flows. I really wonder
> why you are not looking at the actual trigger mode instead...
sure, that would be the best thing. Any suggestions on how to fix the
check ?
>
> Until you fix it, please refrain from touching the GICv3 code, and
> make sure this is solely enabled on x86 -- it clearly wasn't tested on
> anything else.
FWIW I did verify correct operation and performance boost on arm64,
both network and nvme (this was a previous version which
did not restrict to handle_edge_irq).
Also FWIW there should be nothing architecture-specific in this series.
The only reason I touched arch-specific code was to add initialization
of percpu structs, and that only because I had not found a better way
to run it before interrupts were activated.
Following Thomas' suggestion, this is now being replaced with
a more appropriate, architecture-independent code below
/* Per-CPU state initialization */
static void irq_moderation_percpu_init(void *data)
{
struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
hrtimer_setup(&ms->timer, moderation_timer_cb,
CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
INIT_LIST_HEAD(&ms->descs);
}
static int cpuhp_setup_cb(uint cpu)
{
irq_moderation_percpu_init(NULL);
return 0;
}
static int __init init_irq_moderation(void)
{
on_each_cpu(irq_moderation_percpu_init, NULL, 1);
irq_mod_info.initialized = 1;
cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
"moderation:online", cpuhp_setup_cb, NULL);
proc_create_data("irq/soft_moderation", 0644, NULL, &proc_ops,
(void *)0);
return 0;
}
cheers
luigi
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc
2025-11-13 14:55 ` Luigi Rizzo
@ 2025-11-13 19:02 ` Marc Zyngier
0 siblings, 0 replies; 24+ messages in thread
From: Marc Zyngier @ 2025-11-13 19:02 UTC (permalink / raw)
To: Luigi Rizzo
Cc: Thomas Gleixner, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Thu, 13 Nov 2025 14:55:55 +0000,
Luigi Rizzo <lrizzo@google.com> wrote:
>
> On Thu, Nov 13, 2025 at 3:42 PM Marc Zyngier <maz@kernel.org> wrote:
> > [...]
>
> >
> > The descriptions are also massively x86-specific. That's probably OK
> > for the stuff you care about, but I'd certainly would want things to
> > be a bit more abstract and applicable to all architectures.
>
> Absolutely.
>
> > I also note that since you explicitly check for handle_edge_irq() in
> > set_moderation_mode(), this will not work on anything GIC related, or
> > any other architecture that uses the fasteoi flows. I really wonder
> > why you are not looking at the actual trigger mode instead...
>
> sure, that would be the best thing. Any suggestions on how to fix the
> check ?
I made that suggestion already: check the trigger mode (the interrupt
"type") and only apply this to edge interrupts (and stop mentioning
MSIs, for which some architectures have a level variant).
> > Until you fix it, please refrain from touching the GICv3 code, and
> > make sure this is solely enabled on x86 -- it clearly wasn't tested on
> > anything else.
>
> FWIW I did verify correct operation and performance boost on arm64,
> both network and nvme (this was a previous version which
> did not restrict to handle_edge_irq).
You do realise that what is not on the list doesn't exist, right? ;-)
> Also FWIW there should be nothing architecture-specific in this series.
We're in strong agreement here.
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/6] genirq: soft_moderation: add base files, procfs hooks
2025-11-13 9:29 ` Thomas Gleixner
2025-11-13 10:24 ` Thomas Gleixner
@ 2025-11-13 22:32 ` Luigi Rizzo
1 sibling, 0 replies; 24+ messages in thread
From: Luigi Rizzo @ 2025-11-13 22:32 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Thu, Nov 13, 2025 at 10:29 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
> > [...]
most other comments will be addressed in the next version, but I wanted
to address a few of them here:
> > /sys/module/irq_moderation/parameters and read/write the procfs entries
> > /proc/irq/soft_moderation and /proc/irq/NN/soft_moderation.
> >
> > Examples:
> > cat /proc/irq/soft_moderation
> > echo "delay_us=345" > /proc/irq/soft_moderation
> > echo 1 | tee /proc/irq/*/nvme*/../soft_moderation
>
> That's impressive to be able to write into proc/...
That is already the case for /proc/irq/default_smp_affinity and
/proc/irq/NN/smp_affinity[_list] so I followed the same approach.
> > +
> > +/* After the handler, if desc is moderated, make sure the timer is active. */
> > +static inline void irq_moderation_epilogue(const struct irq_desc *desc)
> > +{
> > + struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
>
> Lacks a check whether moderation is enabled or not.
That would be redundant, !list_empty(desc->ms_node) is only true
if moderation is enabled and many other conditions in the previous hook.
> > + /* Timer still alive. Just call the handlers */
> > + list_for_each_entry_safe(desc, next, &ms->descs, ms_node) {
> > + ms->irq_count += irq_mod_info.count_timer_calls;
> > + ms->timer_calls++;
> > + handle_irq_event_percpu(desc);
>
> That lacks setting the INPROGRESS flag.
I did it in an earlier test version, but was wondering if that is
actually needed.
While the irqdesc is in the timer list, irqd_irq_disabled() is true and that
makes irq_can_handle_actions() and irq_can_handle() return false.
> > +#pragma clang diagnostic error "-Wformat"
>
> What's that for?
print() formats with a lot of entries are error prone, and the above
catches a lot of trivial mistakes, so I felt it was useful to keep it.
cheers
luigi
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/6] genirq: soft_moderation: add base files, procfs hooks
2025-11-13 10:24 ` Thomas Gleixner
@ 2025-11-13 22:42 ` Luigi Rizzo
0 siblings, 0 replies; 24+ messages in thread
From: Luigi Rizzo @ 2025-11-13 22:42 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Thu, Nov 13, 2025 at 11:24 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, Nov 13 2025 at 10:29, Thomas Gleixner wrote:
> > On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
> >> [...]
> >> + struct irq_chip *chip = desc->irq_data.chip;
> >> +
> >> + /* Make sure this is msi and we can run enable_irq from irq context */
> >> + mode &= desc->handle_irq == handle_edge_irq && chip && chip->irq_bus_lock == NULL &&
> >> + chip->irq_bus_sync_unlock == NULL;
>
> Q: How does this make sure that it is MSI?
> A: Not at all
Yeah the MSI thing was misguided, as Marc suggested I will replace the condition
with !irqd_is_level_type(&desc->irq_data)
> Q: How is this protected against concurrent modification?
> A: Not at all
Is there such a risk, specifically regarding changing desc->chip ?
I expected desc->chip to be set when the irq is attached and never modified
until destruction. Then, this function is called from within the
procfs handler so
the desc is already set. If it gets destroyed, changing mode is harmless.
cheers
luigi
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/6] genirq: soft_moderation: activate hooks in handle_irq_event()
2025-11-13 9:45 ` Thomas Gleixner
@ 2025-11-14 8:27 ` Luigi Rizzo
0 siblings, 0 replies; 24+ messages in thread
From: Luigi Rizzo @ 2025-11-14 8:27 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Thu, Nov 13, 2025 at 10:45 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
> > [...]
> > @@ -254,9 +255,11 @@ irqreturn_t handle_irq_event(struct irq_desc *desc)
> > irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> > raw_spin_unlock(&desc->lock);
> >
> > + irq_moderation_hook(desc); /* may disable irq so must run unlocked */
>
> That's just wrong. That can trivially be implemented in a way which
> works with the lock held.
Do you mean I can move the call before the unlock and use __disable_irq()
given that moderation is only enabled on interrupts that have
irq_bus_lock == NULL, so disable_irq_nosync() which is
scoped_irqdesc_get_and_buslock(irq, IRQ_GET_DESC_CHECK_GLOBAL) {
__disable_irq(scoped_irqdesc);
return 0;
}
effectively collapses to raw_spin_lock()
This would save a lock/unlock dance and a few conditionals?
> > @@ -134,6 +134,7 @@ static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
> > desc->tot_count = 0;
> > desc->name = NULL;
> > desc->owner = owner;
> > + irq_moderation_init_fields(desc);
>
> That's clearly part of activation in handle_irq_event() ....
I don't follow. As far as I understand desc_set_defaults() makes sure we
have a clean descriptor after release/before allocation. This includes
clearing the moderation enable flag from a previous user of this desc,
and the flag is only read in the hook in handle_irq_event().
cheers
luigi
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2025-11-14 8:28 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-12 19:24 [PATCH 0/6] platform wide software interrupt moderation Luigi Rizzo
2025-11-12 19:24 ` [PATCH 1/6] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
2025-11-13 8:17 ` Thomas Gleixner
2025-11-13 9:44 ` Thomas Gleixner
2025-11-13 13:25 ` Marc Zyngier
2025-11-13 13:33 ` Luigi Rizzo
2025-11-13 14:42 ` Marc Zyngier
2025-11-13 14:55 ` Luigi Rizzo
2025-11-13 19:02 ` Marc Zyngier
2025-11-12 19:24 ` [PATCH 2/6] genirq: soft_moderation: add base files, procfs hooks Luigi Rizzo
2025-11-13 9:29 ` Thomas Gleixner
2025-11-13 10:24 ` Thomas Gleixner
2025-11-13 22:42 ` Luigi Rizzo
2025-11-13 22:32 ` Luigi Rizzo
2025-11-13 9:40 ` Thomas Gleixner
2025-11-12 19:24 ` [PATCH 3/6] genirq: soft_moderation: activate hooks in handle_irq_event() Luigi Rizzo
2025-11-13 9:45 ` Thomas Gleixner
2025-11-14 8:27 ` Luigi Rizzo
2025-11-12 19:24 ` [PATCH 4/6] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
2025-11-13 10:15 ` Thomas Gleixner
2025-11-12 19:24 ` [PATCH 5/6] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
2025-11-12 19:24 ` [PATCH 6/6] genirq: soft_moderation: implement per-driver defaults (nvme and vfio) Luigi Rizzo
2025-11-13 10:18 ` Thomas Gleixner
2025-11-13 10:42 ` Luigi Rizzo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).