* [PATCH v2 1/8] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc
2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs Luigi Rizzo
` (6 subsequent siblings)
7 siblings, 0 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
Platform wide software interrupt moderation ("soft_moderation")
specifically addresses a limitation of platforms from many vendors
whose I/O performance drops significantly when the total rate of MSI-X
interrupts is too high (e.g 1-3M intr/s depending on the platform).
Conventional interrupt moderation operates separately on each source,
hence the configuration should target the worst case. On large servers
with hundreds of interrupt sources, keeping the total rate bounded would
require delays of 100-200us, and adaptive moderation would have to reach
those delays with as little as 10K intr/s per source. These values are
unacceptable for RPC or transactional workloads.
To address this problem, soft_moderaton measures efficiently the total
and per-CPU interrupt rates, so that individual moderation delays
can be adjusted based on actual global and local load. This way, the
system controls both global interrupt rates and individual CPU load,
and tunes delays so they are normally 0 or very small except during
actual local or global overload.
soft_moderation does not rely on any special hardware feature except
requiring the device to record which sources have pending interrupts.
Configuration is easy and robust. System administrators specify the
maximum targets (moderation delay; interrupt rate; and fraction of time
spent in hardirq), and per-CPU control loops adjust actual delays to try
and keep metrics within the targets.
There is no need for exact targets, because the system is adaptive;
values like delay_us=100, target_irq_rate=1000000, hardirq_percent=70 are
good almost everywhere.
Boot defaults are set via module parameters (/sys/module/irq_moderation
and /sys/module/${DRIVER}) or at runtime via /proc/irq/soft_moderation, which
is also used to export statistics. Moderation on individual irqs can
be turned on/off via /proc/irq/NN/soft_moderation .
No functional impact in. Enabling the option will just extend struct
irq_desc.
CONFIG_IRQ_SOFT_MODERATION=y
Change-Id: I33ad84525a0956a1164033e690104bf9fb40ecac
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
Documentation/core-api/irq/index.rst | 1 +
Documentation/core-api/irq/irq-moderation.rst | 154 ++++++++++++++++++
include/linux/irqdesc.h | 18 ++
kernel/irq/Kconfig | 12 ++
4 files changed, 185 insertions(+)
create mode 100644 Documentation/core-api/irq/irq-moderation.rst
diff --git a/Documentation/core-api/irq/index.rst b/Documentation/core-api/irq/index.rst
index 0d65d11e54200..b5a6e2ade69bb 100644
--- a/Documentation/core-api/irq/index.rst
+++ b/Documentation/core-api/irq/index.rst
@@ -9,3 +9,4 @@ IRQs
irq-affinity
irq-domain
irqflags-tracing
+ irq-moderation
diff --git a/Documentation/core-api/irq/irq-moderation.rst b/Documentation/core-api/irq/irq-moderation.rst
new file mode 100644
index 0000000000000..71cad4fd5c3f4
--- /dev/null
+++ b/Documentation/core-api/irq/irq-moderation.rst
@@ -0,0 +1,154 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+===========================================
+Platform wide software interrupt moderation
+===========================================
+
+:Author: Luigi Rizzo <lrizzo@google.com>
+
+.. contents:: :depth: 2
+
+Introduction
+------------
+
+Platform Wide software interrupt moderation is a variant of moderation
+that adjusts the delay based on platform-wide metrics, instead of
+considering each source separately. It then uses hrtimers to implement
+adaptive, per-CPU moderation in software, without requiring any specific
+hardware support other than Pending Bit Array, a standard feature
+of MSI-X.
+
+The most common and robust implementation of moderation enforces
+some minimum **delay** between subsequent interrupts, using a timer
+in the device or in software. Most NICs support programmable hardware
+moderation, with timer granularity down to 1us or so. NVME also
+specifies hardware moderation timers, with 100us granularity.
+
+One downside of moderation, especially with **fixed** delay, is that
+even with moderate load, the notification latency can increase by as
+much as the moderation delay. This is undesirable for transactional
+workloads. At high load the extra delay is less problematic, because
+the queueing delays that occur can be one or more orders of magnitude
+bigger.
+
+To address this problem, software can dynamically adjust the delay, making
+it proportional to the I/O rate. This is called **adaptive** moderation,
+and it is commonly implemented in network device drivers.
+
+There is one aspect that per-source moderation does not address.
+
+Several Systems-on-Chip (SoC) from all vendors (Intel, AMD, ARM), show
+huge reduction in I/O throughput (up to 3-4x times slower for high speed
+NICs or SSDs) in presence of high MSI-X interrupt rates across the entire
+platform (1-3M intr/s total, depending on the SoC). Note that unaffected
+SoCs can sustain 20-30M intr/s from MSI-X without performance degradation.
+
+The above performance degradation is not caused by overload of individual
+CPUs. What matters is the total MSI-X interrupt rate, across either
+individual PCIe root port, or the entire SoC. The specific root cause
+depends on the SoC, but generally some internal block (in the PCIe root
+port, or in the IOMMU block) applies excessive serialization around
+MSI-X writes. This in turn causes huge delays in other PCIe transactions,
+leading to the observed performance drop.
+
+PLATFORM WIDE ADAPTIVE MODERATION
+---------------------------------
+
+Platform-Wide adaptive interrupt moderation addresses the problem
+operateing as follows (all parameters are configurable via module parameters
+irq_moderation.${name}=${value} or /proc/irq/soft_moderation):
+
+* On each interrupt, increments a per-CPU interrupt counter.
+
+* Opportunistically, every ``update_msi`` or so, each CPU scalably
+ accumulates the values across the entire system, computes the global
+ and per-CPU interrupt rate, and the number of CPUs actively processing
+ interrupts, ``active_cpus``.
+
+* Based on a configurable ``target_irq_rate``, each CPU the per-CPU
+ fair share (``target_irq_rate / active_cpus``) and whether the global
+ and total rate are abovethe targets. A simple control loop then adjusts
+ up/down its per-CPU moderation delay, ``mod_ns``, between 0 (disabled)
+ and a configurable maximum ``delay_us``.
+
+* When ``mod_ns`` is above a threshold (e.g. 10us), the first interrupt
+ served by that CPU starts an hrtimer to fire ``mod_ns`` nanoseconds.
+ All interrupts sources served by that CPU will be disabled as they come.
+
+* When the timer fires, all disabled sources are re-enabled, allowing pending
+ interrupts to be processed again.
+
+This scheme is effective in keeping the total interrupt rate under
+control as long as the configuration parameters are sensible
+(``delay_us < #CPUs / target_irq_rate``).
+
+It also lends itself to some extensions, specifically:
+
+* **protect against hardirq overload**. It is possible for one CPU
+ handling interrupts to be overwhelmed by hardirq processing. The
+ control loop can be extended to declare an overload situation when the
+ percentage of time spent in hardirq is above a configurable threshold
+ ``hardirq_percent``. Moderation can thus kick in to keep the load within bounds.
+
+* **reduce latency using timer-based polling**. Similar to ``napi_defer_hard_irq``
+ described earlier, once interrupts are disabled and we have an hrtimer active,
+ we keep the timer active for a few rounds and run the handler from a timer callback
+ instead of waiting for an interrupt. The ``timer_rounds`` parameter controls this behavior,
+
+ Say the control loop settles on 120us delay to stay within the global MSI-X rate limit.
+ By setting ``timer_rounds=2``, each time we have a hardware interrupt, the handler
+ will be called two more times by the timer. As a consequence, in the same conditions,
+ the same global MSI-X rate will be reached with just 120/3 = 40us delay, thus improving
+ latency significantly (note that those handlers call do cause extra CPU work, so we
+ may lose some of the efficiency gains coming from large delays).
+
+CONFIGURATION
+-------------
+
+Configuration of this system is done via module parameters
+``irq_moderation.${name}=${value}`` (for boot-time defaults)
+or writing ``echo "${name}=${value}" > /proc/irq/soft_moderation``
+for run-time configuration.
+
+Here are the existing module parameters
+
+* ``delay_us`` (0: off, range 0-500)
+
+ The maximum moderation delay. 0 means moderation is globally disabled.
+
+* ``target_irq_rate`` (0 off, range 0-50000000)
+
+ The maximum irq rate across the entire platform. The adaptive algorithm will adjust
+ delays to stay within the target. Use 0 to disable this control.
+
+* ``hardirq_percent`` (0 off, range 0-100)
+
+ The maximum percentage of CPU time spent in hardirq. The adaptive algorithm will adjust
+ delays to stay within the target. Use 0 to disable this control.
+
+* ``timer_rounds`` (0 0ff, range 0-20)
+
+ Once the moderation timer is activated, how many extra timer rounds to do before
+ re-enabling interrupts.
+
+* ``update_ms`` (default 1, range 1-100)
+
+ How often the adaptive control should adjust delays. The default value (1ms) should be good
+ in most circumstances.
+
+Interrupt moderation can be enabled/disabled on individual IRQs as follows:
+
+* module parameter ``${driver}.soft_moderation=1`` (default 0) selects
+ whether to use moderation at device probe time.
+
+* ``echo 1 > /proc/irq/*/${irq_name}/../soft_moderation`` (default 0, disabled) toggles
+ moderation on/off for specific IRQs once they are attached.
+
+**INTEL SPECIFIC**
+
+Recent intel CPUs support a kernel feature, enabled via boot parameter ``intremap=posted_msi``,
+that routes all interrupts targeting one CPU via a special interrupt, called **posted_msi**,
+whose handler in turn calls the individual interrupt handlers.
+
+The ``posted_msi`` kernel feature always uses moderation if enabled (``delay_us > 0``) and
+individual IRQs do not need to be enabled individually.
diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index fd091c35d5721..25df3c51772fa 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -29,6 +29,23 @@ struct irqstat {
#endif
};
+#ifdef CONFIG_IRQ_SOFT_MODERATION
+/**
+ * struct irq_desc_mod - interrupt moderation information
+ * @ms_node: per-CPU list of moderated irq_desc
+ * @enable: enable moderation on this descriptor
+ */
+struct irq_desc_mod {
+ struct list_head ms_node;
+ bool enable;
+};
+
+void irq_moderation_init_fields(struct irq_desc_mod *mod);
+#else
+struct irq_desc_mod {};
+static inline void irq_moderation_init_fields(struct irq_desc_mod *mod) {}
+#endif
+
/**
* struct irq_desc - interrupt descriptor
* @irq_common_data: per irq and chip data passed down to chip functions
@@ -112,6 +129,7 @@ struct irq_desc {
#endif
struct mutex request_mutex;
int parent_irq;
+ struct irq_desc_mod mod;
struct module *owner;
const char *name;
#ifdef CONFIG_HARDIRQS_SW_RESEND
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 1b4254d19a73e..c258973b63459 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -155,6 +155,18 @@ config IRQ_KUNIT_TEST
endmenu
+# Support platform wide software interrupt moderation
+config IRQ_SOFT_MODERATION
+ bool "Enable platform wide software interrupt moderation"
+ depends on PROC_FS=y
+ help
+ Enable platform wide software interrupt moderation.
+ Uses a local timer to delay interrupts in configurable ways
+ and depending on various global system load indicators
+ and targets.
+
+ If you don't know what to do here, say N.
+
config GENERIC_IRQ_MULTI_HANDLER
bool
help
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 1/8] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
2025-11-17 11:12 ` kernel test robot
` (2 more replies)
2025-11-16 18:28 ` [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation Luigi Rizzo
` (5 subsequent siblings)
7 siblings, 3 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
Add the core files that implement procfs and module parameters
for soft_moderation. This gives access to the module parameters
/sys/module/irq_moderation/parameters and read/write the procfs entries
/proc/irq/soft_moderation and /proc/irq/NN/soft_moderation.
Examples:
cat /proc/irq/soft_moderation
echo "delay_us=345" > /proc/irq/soft_moderation
echo 1 | tee /proc/irq/*/nvme*/../soft_moderation
No functional change.
Change-Id: I83fc9568e70885cc02e7fcd4dbe141d9ee329c82
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
kernel/irq/Makefile | 1 +
kernel/irq/irq_moderation.c | 305 ++++++++++++++++++++++++++++++++++++
kernel/irq/irq_moderation.h | 149 ++++++++++++++++++
kernel/irq/irqdesc.c | 1 +
kernel/irq/proc.c | 3 +
5 files changed, 459 insertions(+)
create mode 100644 kernel/irq/irq_moderation.c
create mode 100644 kernel/irq/irq_moderation.h
diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
index 6ab3a40556670..c06da43d644f2 100644
--- a/kernel/irq/Makefile
+++ b/kernel/irq/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_GENERIC_IRQ_CHIP) += generic-chip.o
obj-$(CONFIG_GENERIC_IRQ_PROBE) += autoprobe.o
obj-$(CONFIG_IRQ_DOMAIN) += irqdomain.o
obj-$(CONFIG_IRQ_SIM) += irq_sim.o
+obj-$(CONFIG_IRQ_SOFT_MODERATION) += irq_moderation.o
obj-$(CONFIG_PROC_FS) += proc.o
obj-$(CONFIG_GENERIC_PENDING_IRQ) += migration.o
obj-$(CONFIG_GENERIC_IRQ_MIGRATION) += cpuhotplug.o
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
new file mode 100644
index 0000000000000..3a907b8f65698
--- /dev/null
+++ b/kernel/irq/irq_moderation.c
@@ -0,0 +1,305 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause
+
+#include <linux/cpuhotplug.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/kernel_stat.h>
+#include <linux/module.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+
+#include "internals.h"
+#include "irq_moderation.h"
+
+/*
+ * Platform-wide software interrupt moderation.
+ *
+ * see Documentation/core-api/irq/irq-moderation.rst
+ *
+ * === MOTIVATION AND OPERATION ===
+ *
+ * Some platforms show reduced I/O performance when the total device interrupt
+ * rate across the entire platform becomes too high. This code implements
+ * per-CPU adaptive moderation based on the total interrupt rate, as opposed
+ * to conventional moderation that operates separately on each source.
+ *
+ * It computes the total interrupt rate and number of sources, and uses the
+ * information to adaptively disable individual interrupts for small amounts
+ * of time using per-CPU hrtimers. Specifically:
+ *
+ * - a hook in handle_irq_event(), which applies only on sources configured
+ * to use moderation, updates statistics and check whether we need
+ * moderation on that CPU/irq. If so, calls disable_irq_nosync() and starts
+ * an hrtimer with appropriate delay.
+ *
+ * - the timer callback calls enable_irq() for all disabled interrupts on that
+ * CPU. That in turn will generate interrupts if there are pending events.
+ *
+ * === CONFIGURATION ===
+ *
+ * The following can be controlled at boot time via module parameters
+ *
+ * irq_moderation.${NAME}=${VALUE}
+ *
+ * or at runtime by writing
+ *
+ * echo "${NAME}=${VALUE}" > /proc/irq/soft_moderation
+ *
+ * delay_us (default 0, suggested 100, range 0-500, 0 DISABLES MODERATION)
+ * Fixed or maximum moderation delay. A reasonable range is 20..100, higher
+ * values can be useful if the hardirq handler is performing a significant
+ * amount of work.
+ *
+ * timer_rounds (default 0, max 20)
+ * Once moderation triggers, periodically run handler zero or more
+ * times using a timer rather than interrupts. This is similar to
+ * napi_defer_hard_irqs on NICs.
+ * A small value may help control load in interrupt-challenged platforms.
+ *
+ * Moderation can be enabled/disabled for individual interrupts with
+ *
+ * echo "on" > /proc/irq/NN/soft_moderation # use "off" to disable
+ *
+ * === MONITORING ===
+ *
+ * cat /proc/irq/soft_moderation shows per-CPU and global statistics.
+ *
+ */
+
+struct irq_mod_info irq_mod_info ____cacheline_aligned;
+
+/* Boot time value, copled to irq_mod_info.delay_us after init. */
+static uint mod_delay_us;
+module_param_named(delay_us, mod_delay_us, uint, 0444);
+MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 = moderation off, range 0-500.");
+
+module_param_named(timer_rounds, irq_mod_info.timer_rounds, uint, 0444);
+MODULE_PARM_DESC(timer_rounds, "How many timer polls once moderation triggers, range 0-20.");
+
+DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
+
+/* Initialize moderation state, used in desc_set_defaults() */
+void irq_moderation_init_fields(struct irq_desc_mod *mod)
+{
+ INIT_LIST_HEAD(&mod->ms_node);
+ mod->enable = false;
+}
+
+static inline int set_moderation_mode(struct irq_desc *desc, bool enable)
+{
+ struct irq_data *irqd = &desc->irq_data;
+ struct irq_chip *chip = desc->irq_data.chip;
+
+ /* Moderation is supported only in specific cases. */
+ if (enable) {
+ if (irqd_is_level_type(irqd) || !irqd_is_single_target(irqd) ||
+ chip->irq_bus_lock || chip->irq_bus_sync_unlock)
+ return -EOPNOTSUPP;
+ }
+ desc->mod.enable = enable;
+ return 0;
+}
+
+#pragma clang diagnostic error "-Wformat"
+/* Print statistics */
+static int moderation_show(struct seq_file *p, void *v)
+{
+ uint delay_us = irq_mod_info.delay_us;
+ int j;
+
+#define HEAD_FMT "%5s %8s %8s %4s %4s %8s %11s %11s %11s %11s %11s %11s %11s %9s\n"
+#define BODY_FMT "%5u %8u %8u %4u %4u %8u %11u %11u %11u %11u %11u %11u %11u %9u\n"
+
+ seq_printf(p, HEAD_FMT,
+ "# CPU", "irq/s", "my_irq/s", "cpus", "srcs", "delay_ns",
+ "irq_hi", "my_irq_hi", "hardirq_hi", "timer_set",
+ "disable_irq", "from_msi", "timer_calls", "stray_irq");
+
+ for_each_possible_cpu(j) {
+ struct irq_mod_state *ms = per_cpu_ptr(&irq_mod_state, j);
+
+ seq_printf(p, BODY_FMT,
+ j, ms->irq_rate, ms->my_irq_rate,
+ (ms->scaled_cpu_count + 128) / 256,
+ (ms->scaled_src_count + 128) / 256,
+ ms->mod_ns, ms->irq_high, ms->my_irq_high,
+ ms->hardirq_high, ms->timer_set, ms->disable_irq,
+ ms->from_posted_msi, ms->timer_calls, ms->stray_irq);
+ }
+
+ seq_printf(p, "\n"
+ "enabled %s\n"
+ "delay_us %u\n"
+ "timer_rounds %u\n",
+ str_yes_no(delay_us > 0),
+ delay_us, irq_mod_info.timer_rounds);
+
+ return 0;
+}
+
+static int moderation_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, moderation_show, pde_data(inode));
+}
+
+/* Helpers to set and clamp values from procfs or at init. */
+struct param_names {
+ const char *name;
+ uint *val;
+ uint min;
+ uint max;
+};
+
+static struct param_names param_names[] = {
+ { "delay_us", &irq_mod_info.delay_us, 0, 500 },
+ { "timer_rounds", &irq_mod_info.timer_rounds, 0, 20 },
+ /* Empty entry indicates the following are not settable from procfs. */
+ {},
+ { "update_ms", &irq_mod_info.update_ms, 1, 100 },
+};
+
+static ssize_t moderation_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
+{
+ struct param_names *n = param_names;
+ char cmd[40];
+ uint i, l, val;
+
+ if (count == 0 || count + 1 > sizeof(cmd))
+ return -EINVAL;
+ if (copy_from_user(cmd, buf, count))
+ return -EFAULT;
+ cmd[count] = '\0';
+ for (i = 0; i < ARRAY_SIZE(param_names) && n->name; i++, n++) {
+ l = strlen(n->name);
+ if (count < l + 2 || strncmp(cmd, n->name, l) || cmd[l] != '=')
+ continue;
+ if (kstrtouint(cmd + l + 1, 0, &val))
+ return -EINVAL;
+ WRITE_ONCE(*(n->val), clamp(val, n->min, n->max));
+ /* Record last parameter change, for use in the control loop. */
+ irq_mod_info.procfs_write_ns = ktime_get_ns();
+ return count;
+ }
+ return -EINVAL;
+}
+
+static const struct proc_ops proc_ops = {
+ .proc_open = moderation_open,
+ .proc_read = seq_read,
+ .proc_lseek = seq_lseek,
+ .proc_release = single_release,
+ .proc_write = moderation_write,
+};
+
+/* Handlers for /proc/irq/NN/soft_moderation */
+static int mode_show(struct seq_file *p, void *v)
+{
+ struct irq_desc *desc = p->private;
+
+ if (!desc)
+ return -ENOENT;
+
+ seq_printf(p, "%s irq %u trigger 0x%x %s %smanaged %slazy handle_irq %pB\n",
+ desc->mod.enable ? "on" : "off", desc->irq_data.irq,
+ irqd_get_trigger_type(&desc->irq_data),
+ irqd_is_level_type(&desc->irq_data) ? "Level" : "Edge",
+ irqd_affinity_is_managed(&desc->irq_data) ? "" : "un",
+ irq_settings_disable_unlazy(desc) ? "un" : "", desc->handle_irq
+ );
+ return 0;
+}
+
+static ssize_t mode_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
+{
+ struct irq_desc *desc = (struct irq_desc *)pde_data(file_inode(f));
+ char cmd[40];
+ bool enable;
+ int ret;
+
+ if (!desc)
+ return -ENOENT;
+ if (count == 0 || count + 1 > sizeof(cmd))
+ return -EINVAL;
+ if (copy_from_user(cmd, buf, count))
+ return -EFAULT;
+ cmd[count] = '\0';
+
+ ret = kstrtobool(cmd, &enable);
+ if (!ret)
+ ret = set_moderation_mode(desc, enable);
+ return ret ? : count;
+}
+
+static int mode_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, mode_show, pde_data(inode));
+}
+
+static const struct proc_ops mode_ops = {
+ .proc_open = mode_open,
+ .proc_read = seq_read,
+ .proc_lseek = seq_lseek,
+ .proc_release = single_release,
+ .proc_write = mode_write,
+};
+
+void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode)
+{
+ proc_create_data("soft_moderation", umode, desc->dir, &mode_ops, desc);
+}
+
+void irq_moderation_procfs_remove(struct irq_desc *desc)
+{
+ remove_proc_entry("soft_moderation", desc->dir);
+}
+
+/* Per-CPU state initialization */
+static void irq_moderation_percpu_init(void *data)
+{
+ struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+ INIT_LIST_HEAD(&ms->descs);
+}
+
+static int cpuhp_setup_cb(uint cpu)
+{
+ irq_moderation_percpu_init(NULL);
+ return 0;
+}
+
+static void clamp_parameter(uint *dst, uint val)
+{
+ struct param_names *n = param_names;
+ uint i;
+
+ for (i = 0; i < ARRAY_SIZE(param_names); i++, n++) {
+ if (dst == n->val) {
+ *dst = clamp(val, n->min, n->max);
+ return;
+ }
+ }
+}
+
+static int __init init_irq_moderation(void)
+{
+ uint *cur;
+
+ on_each_cpu(irq_moderation_percpu_init, NULL, 1);
+ cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "moderation:online", cpuhp_setup_cb, NULL);
+
+ /* Clamp all initial values to the allowed range. */
+ for (cur = &irq_mod_info.target_irq_rate; cur < irq_mod_info.pad; cur++)
+ clamp_parameter(cur, *cur);
+
+ /* Finally, set delay_us to enable moderation if needed. */
+ clamp_parameter(&irq_mod_info.delay_us, mod_delay_us);
+
+ proc_create_data("irq/soft_moderation", 0644, NULL, &proc_ops, NULL);
+ return 0;
+}
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("1.0");
+MODULE_AUTHOR("Luigi Rizzo <lrizzo@google.com>");
+MODULE_DESCRIPTION("Platform wide software interrupt moderation");
+module_init(init_irq_moderation);
diff --git a/kernel/irq/irq_moderation.h b/kernel/irq/irq_moderation.h
new file mode 100644
index 0000000000000..ccb8193482b51
--- /dev/null
+++ b/kernel/irq/irq_moderation.h
@@ -0,0 +1,149 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
+
+#ifndef _LINUX_IRQ_MODERATION_H
+#define _LINUX_IRQ_MODERATION_H
+
+/*
+ * Platform wide software interrupt moderation, see
+ * Documentation/core-api/irq/irq-moderation.rst
+ */
+
+#include <linux/hrtimer.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/kernel.h>
+
+#ifdef CONFIG_IRQ_SOFT_MODERATION
+
+/**
+ * struct irq_mod_info - global configuration parameters and state
+ * @total_intrs: running count updated every update_ms
+ * @total_cpus: as above, active CPUs in this interval
+ * @procfs_write_ns: last write to /proc/irq/soft_moderation
+ * @delay_us: fixed delay, or maximum for adaptive
+ * @target_irq_rate: target maximum interrupt rate
+ * @hardirq_percent: target maximum hardirq percentage
+ * @timer_rounds: how many timer polls once moderation fires
+ * @update_ms: how often to update delay/rate/fraction
+ * @scale_cpus: (percent) scale factor to estimate active CPUs
+ * @count_timer_calls: count timer calls for irq limits
+ * @count_msi_calls: count calls from posted_msi for irq limits
+ * @decay_factor: smoothing factor for the control loop, keep at 16
+ * @grow_factor: smoothing factor for the control loop, keep it at 8
+ */
+struct irq_mod_info {
+ /* These fields are written to by all CPUs */
+ ____cacheline_aligned
+ atomic_long_t total_intrs;
+ atomic_long_t total_cpus;
+
+ /* These are mostly read (frequently), so use a different cacheline */
+ ____cacheline_aligned
+ u64 procfs_write_ns;
+ uint delay_us;
+ uint target_irq_rate;
+ uint hardirq_percent;
+ uint timer_rounds;
+ uint update_ms;
+ uint scale_cpus;
+ uint count_timer_calls;
+ uint count_msi_calls;
+ uint decay_factor;
+ uint grow_factor;
+ uint pad[];
+};
+
+extern struct irq_mod_info irq_mod_info;
+
+/**
+ * struct irq_mod_state - per-CPU moderation state
+ *
+ * @timer: moderation timer
+ * @descs: list of moderated irq_desc on this CPU
+ *
+ * Counters on last time we updated moderation delay
+ * @last_ns: time of last update
+ * @last_irqtime: from cpustat[CPUTIME_IRQ]
+ * @last_total_irqs: from irq_mod_info
+ * @last_total_cpus: from irq_mod_info
+ *
+ * Local info to control hooks and timer callbacks
+ * @dont_count: do not count this interrupt
+ * @in_posted_msi: don't suppress handle_irq, set in posted_msi handler
+ * @kick_posted_msi: kick posted_msi from the timer callback
+ * @rounds_left: how many rounds left for timer callbacks
+ *
+ * @irq_count: irqs in the last cycle, signed as we also decrement
+ * @update_ns: fetched from irq_mod_info
+ * @delay_ns: fetched from irq_mod_info
+ * @mod_ns: current moderation delay, recomputed every update_ms
+ * @sleep_ns: accumulated time for actual delay
+ *
+ * Statistics
+ * @irq_rate: smoothed global irq rate
+ * @my_irq_rate: smoothed irq rate for this CPU
+ * @scaled_cpu_count: smoothed CPU count (scaled)
+ * @scaled_src_count: smoothed count of irq sources (scaled)
+ * @irq_high: how many times global irq above threshold
+ * @my_irq_high: how many times local irq above threshold
+ * @hardirq_high: how many times local hardirq_percent above threshold
+ * @timer_set: how many timer_set calls
+ * @timer_fire: how many timer_fire, must match timer_set in timer callback
+ * @disable_irq: how many disable_irq calls
+ * @enable_irq: how many enable_irq, must match disable_irq in timer callback
+ * @timer_calls: how many handler calls from timer interrupt
+ * @from_posted_msi: how many calls from posted_msi handler
+ * @stray_irq: how many stray interrupts
+ */
+struct irq_mod_state {
+ struct hrtimer timer;
+ struct list_head descs;
+
+ /* Counters on last time we updated moderation delay */
+ u64 last_ns;
+ u64 last_irqtime;
+ u64 last_total_irqs;
+ u64 last_total_cpus;
+
+ bool dont_count;
+ bool in_posted_msi;
+ bool kick_posted_msi;
+ u8 rounds_left;
+
+ u32 irq_count;
+ u32 update_ns;
+ u32 delay_ns;
+ u32 mod_ns;
+ u32 sleep_ns;
+
+ /* Statistics */
+ u32 irq_rate;
+ u32 my_irq_rate;
+ u32 scaled_cpu_count;
+ u32 scaled_src_count;
+ u32 irq_high;
+ u32 my_irq_high;
+ u32 hardirq_high;
+ u32 timer_set;
+ u32 timer_fire;
+ u32 disable_irq;
+ u32 enable_irq;
+ u32 timer_calls;
+ u32 from_posted_msi;
+ u32 stray_irq;
+ int pad[] ____cacheline_aligned;
+};
+
+DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
+
+void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode);
+void irq_moderation_procfs_remove(struct irq_desc *desc);
+
+#else /* CONFIG_IRQ_SOFT_MODERATION */
+
+static inline void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode) {}
+static inline void irq_moderation_procfs_remove(struct irq_desc *desc) {}
+
+#endif /* !CONFIG_IRQ_SOFT_MODERATION */
+
+#endif /* _LINUX_IRQ_MODERATION_H */
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index db714d3014b5f..e5cdade3dbbce 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -134,6 +134,7 @@ static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
desc->tot_count = 0;
desc->name = NULL;
desc->owner = owner;
+ irq_moderation_init_fields(&desc->mod);
for_each_possible_cpu(cpu)
*per_cpu_ptr(desc->kstat_irqs, cpu) = (struct irqstat) { };
desc_smp_init(desc, node, affinity);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 29c2404e743be..5dcbc36b7de1b 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -14,6 +14,7 @@
#include <linux/mutex.h>
#include "internals.h"
+#include "irq_moderation.h"
/*
* Access rules:
@@ -374,6 +375,7 @@ void register_irq_proc(unsigned int irq, struct irq_desc *desc)
proc_create_single_data("effective_affinity_list", 0444, desc->dir,
irq_effective_aff_list_proc_show, irqp);
# endif
+ irq_moderation_procfs_add(desc, 0644);
#endif
proc_create_single_data("spurious", 0444, desc->dir,
irq_spurious_proc_show, (void *)(long)irq);
@@ -395,6 +397,7 @@ void unregister_irq_proc(unsigned int irq, struct irq_desc *desc)
remove_proc_entry("effective_affinity", desc->dir);
remove_proc_entry("effective_affinity_list", desc->dir);
# endif
+ irq_moderation_procfs_remove(desc);
#endif
remove_proc_entry("spurious", desc->dir);
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
2025-11-16 18:28 ` [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs Luigi Rizzo
@ 2025-11-17 11:12 ` kernel test robot
2025-11-17 16:01 ` Thomas Gleixner
2025-11-17 21:56 ` kernel test robot
2 siblings, 0 replies; 32+ messages in thread
From: kernel test robot @ 2025-11-17 11:12 UTC (permalink / raw)
To: Luigi Rizzo, Thomas Gleixner, Marc Zyngier, Luigi Rizzo,
Paolo Abeni, Andrew Morton, Sean Christopherson, Jacob Pan
Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel,
linux-arch, Bjorn Helgaas, Willem de Bruijn
Hi Luigi,
kernel test robot noticed the following build warnings:
[auto build test WARNING on tip/irq/core]
[also build test WARNING on tip/x86/core tip/master linus/master v6.18-rc6 next-20251114]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Luigi-Rizzo/genirq-platform-wide-interrupt-moderation-Documentation-Kconfig-irq_desc/20251117-023148
base: tip/irq/core
patch link: https://lore.kernel.org/r/20251116182839.939139-3-lrizzo%40google.com
patch subject: [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20251117/202511171936.CT5XHXPZ-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251117/202511171936.CT5XHXPZ-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511171936.CT5XHXPZ-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> kernel/irq/irq_moderation.c:103: warning: ignoring '#pragma clang diagnostic' [-Wunknown-pragmas]
103 | #pragma clang diagnostic error "-Wformat"
In file included from include/linux/uaccess.h:10,
from include/linux/sched/task.h:13,
from include/linux/sched/signal.h:9,
from include/linux/rcuwait.h:6,
from include/linux/percpu-rwsem.h:7,
from include/linux/fs.h:34,
from include/linux/seq_file.h:11,
from kernel/irq/irq_moderation.c:8:
In function 'check_copy_size',
inlined from 'copy_from_user' at include/linux/uaccess.h:207:7,
inlined from 'moderation_write' at kernel/irq/irq_moderation.c:169:6:
include/linux/ucopysize.h:54:25: error: call to '__bad_copy_to' declared with attribute error: copy destination size is too small
54 | __bad_copy_to();
| ^~~~~~~~~~~~~~~
In function 'check_copy_size',
inlined from 'copy_from_user' at include/linux/uaccess.h:207:7,
inlined from 'mode_write' at kernel/irq/irq_moderation.c:223:6:
include/linux/ucopysize.h:54:25: error: call to '__bad_copy_to' declared with attribute error: copy destination size is too small
54 | __bad_copy_to();
| ^~~~~~~~~~~~~~~
vim +103 kernel/irq/irq_moderation.c
102
> 103 #pragma clang diagnostic error "-Wformat"
104 /* Print statistics */
105 static int moderation_show(struct seq_file *p, void *v)
106 {
107 uint delay_us = irq_mod_info.delay_us;
108 int j;
109
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
2025-11-16 18:28 ` [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs Luigi Rizzo
2025-11-17 11:12 ` kernel test robot
@ 2025-11-17 16:01 ` Thomas Gleixner
2025-11-17 16:16 ` Luigi Rizzo
2025-11-17 21:56 ` kernel test robot
2 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-17 16:01 UTC (permalink / raw)
To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> +/* Handlers for /proc/irq/NN/soft_moderation */
> +static int mode_show(struct seq_file *p, void *v)
> +{
> + struct irq_desc *desc = p->private;
> +
> + if (!desc)
> + return -ENOENT;
> +
> + seq_printf(p, "%s irq %u trigger 0x%x %s %smanaged %slazy handle_irq %pB\n",
> + desc->mod.enable ? "on" : "off", desc->irq_data.irq,
> + irqd_get_trigger_type(&desc->irq_data),
> + irqd_is_level_type(&desc->irq_data) ? "Level" : "Edge",
> + irqd_affinity_is_managed(&desc->irq_data) ? "" : "un",
> + irq_settings_disable_unlazy(desc) ? "un" : "", desc->handle_irq
Why are you exposing randomly picked information here? The only thing
what's interesting is whether the interrupt is moderated or not. The
interrupt number is redundant information. And all the other internal
details are available in debugfs already.
> +/* Per-CPU state initialization */
> +static void irq_moderation_percpu_init(void *data)
> +{
> + struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> +
> + INIT_LIST_HEAD(&ms->descs);
> +}
> +
> +static int cpuhp_setup_cb(uint cpu)
> +{
> + irq_moderation_percpu_init(NULL);
> + return 0;
> +}
> +
> +static void clamp_parameter(uint *dst, uint val)
> +{
> + struct param_names *n = param_names;
> + uint i;
> +
> + for (i = 0; i < ARRAY_SIZE(param_names); i++, n++) {
> + if (dst == n->val) {
> + *dst = clamp(val, n->min, n->max);
> + return;
> + }
> + }
> +}
> +
> +static int __init init_irq_moderation(void)
> +{
> + uint *cur;
> +
> + on_each_cpu(irq_moderation_percpu_init, NULL, 1);
That's pointless. Register the hotplug callback without 'nocalls' and
let the hotplug code handle it.
> + cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "moderation:online", cpuhp_setup_cb, NULL);
> +#ifdef CONFIG_IRQ_SOFT_MODERATION
> +
> +/**
> + * struct irq_mod_info - global configuration parameters and state
> + * @total_intrs: running count updated every update_ms
> + * @total_cpus: as above, active CPUs in this interval
> + * @procfs_write_ns: last write to /proc/irq/soft_moderation
> + * @delay_us: fixed delay, or maximum for adaptive
> + * @target_irq_rate: target maximum interrupt rate
> + * @hardirq_percent: target maximum hardirq percentage
> + * @timer_rounds: how many timer polls once moderation fires
> + * @update_ms: how often to update delay/rate/fraction
> + * @scale_cpus: (percent) scale factor to estimate active CPUs
> + * @count_timer_calls: count timer calls for irq limits
> + * @count_msi_calls: count calls from posted_msi for irq limits
> + * @decay_factor: smoothing factor for the control loop, keep at 16
> + * @grow_factor: smoothing factor for the control loop, keep it at 8
> + */
> +struct irq_mod_info {
> + /* These fields are written to by all CPUs */
> + ____cacheline_aligned
> + atomic_long_t total_intrs;
> + atomic_long_t total_cpus;
> +
> + /* These are mostly read (frequently), so use a different cacheline */
> + ____cacheline_aligned
> + u64 procfs_write_ns;
> + uint delay_us;
I asked you last time already to follow the TIP tree documentation, no?
> + uint target_irq_rate;
> + uint hardirq_percent;
> + uint timer_rounds;
> + uint update_ms;
> + uint scale_cpus;
> + uint count_timer_calls;
> + uint count_msi_calls;
> + uint decay_factor;
> + uint grow_factor;
> + uint pad[];
> +};
And again you decided to add these fat data structures all in once with
no usage. I told you last time that this is unreviewable and that stuff
should be introduced gradually with the usage.
Feel free to ignore my feedback, but don't be upset if I ignore your
patches then.
Thanks,
tglx
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
2025-11-17 16:01 ` Thomas Gleixner
@ 2025-11-17 16:16 ` Luigi Rizzo
2025-11-17 19:35 ` Thomas Gleixner
0 siblings, 1 reply; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-17 16:16 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Mon, Nov 17, 2025 at 5:01 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> > +/* Handlers for /proc/irq/NN/soft_moderation */
> > +static int mode_show(struct seq_file *p, void *v)
> > +{
> > + struct irq_desc *desc = p->private;
> > +
> > + if (!desc)
> > + return -ENOENT;
> > +
> > + seq_printf(p, "%s irq %u trigger 0x%x %s %smanaged %slazy handle_irq %pB\n",
> > + desc->mod.enable ? "on" : "off", desc->irq_data.irq,
> > + irqd_get_trigger_type(&desc->irq_data),
> > + irqd_is_level_type(&desc->irq_data) ? "Level" : "Edge",
> > + irqd_affinity_is_managed(&desc->irq_data) ? "" : "un",
> > + irq_settings_disable_unlazy(desc) ? "un" : "", desc->handle_irq
>
> Why are you exposing randomly picked information here? The only thing
> what's interesting is whether the interrupt is moderated or not. The
> interrupt number is redundant information. And all the other internal
> details are available in debugfs already.
ah, sorry, forgot to clean up this internal debugging code.
The previous version had only the enable.
>
> >
> > +static int __init init_irq_moderation(void)
> > +{
> > + uint *cur;
> > +
> > + on_each_cpu(irq_moderation_percpu_init, NULL, 1);
>
> That's pointless. Register the hotplug callback without 'nocalls' and
> let the hotplug code handle it.
Sounds good. I have a question on event ordering.
Which event should I use to make sure the callback runs
before interrupts are enabled on the new CPU ?
> > ...
> I asked you last time already to follow the TIP tree documentation, no?
>
> > + uint target_irq_rate;
> > + uint hardirq_percent;
> > + uint timer_rounds;
> > + uint update_ms;
> > + uint scale_cpus;
> > + uint count_timer_calls;
> > + uint count_msi_calls;
> > + uint decay_factor;
> > + uint grow_factor;
> > + uint pad[];
> > +};
>
> And again you decided to add these fat data structures all in once with
> no usage. I told you last time that this is unreviewable and that stuff
> should be introduced gradually with the usage.
Ok, will do.
FWIW my goal was to get the telemetry functions in the first patch, and reduce
the clutter in subsequent patches, since each new field would create many
chunks (docstring, struct field, init, print format and value).
cheers
luigi
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
2025-11-17 16:16 ` Luigi Rizzo
@ 2025-11-17 19:35 ` Thomas Gleixner
0 siblings, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-17 19:35 UTC (permalink / raw)
To: Luigi Rizzo
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Mon, Nov 17 2025 at 17:16, Luigi Rizzo wrote:
> On Mon, Nov 17, 2025 at 5:01 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > +static int __init init_irq_moderation(void)
>> > +{
>> > + uint *cur;
>> > +
>> > + on_each_cpu(irq_moderation_percpu_init, NULL, 1);
>>
>> That's pointless. Register the hotplug callback without 'nocalls' and
>> let the hotplug code handle it.
>
> Sounds good. I have a question on event ordering.
> Which event should I use to make sure the callback runs
> before interrupts are enabled on the new CPU ?
See include/linux/cpuhotplug.h
But see my other reply.
>> > ...
>> I asked you last time already to follow the TIP tree documentation, no?
>>
>> > + uint target_irq_rate;
>> > + uint hardirq_percent;
>> > + uint timer_rounds;
>> > + uint update_ms;
>> > + uint scale_cpus;
>> > + uint count_timer_calls;
>> > + uint count_msi_calls;
>> > + uint decay_factor;
>> > + uint grow_factor;
>> > + uint pad[];
>> > +};
>>
>> And again you decided to add these fat data structures all in once with
>> no usage. I told you last time that this is unreviewable and that stuff
>> should be introduced gradually with the usage.
>
> Ok, will do.
> FWIW my goal was to get the telemetry functions in the first patch, and reduce
> the clutter in subsequent patches, since each new field would create many
> chunks (docstring, struct field, init, print format and value).
TBH, I'm not convinced at all that you need all of this telemetry maze.
That looks pretty overengineered and that can be added on top of a
functional and reviewable base implementation.
Thanks,
tglx
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
2025-11-16 18:28 ` [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs Luigi Rizzo
2025-11-17 11:12 ` kernel test robot
2025-11-17 16:01 ` Thomas Gleixner
@ 2025-11-17 21:56 ` kernel test robot
2 siblings, 0 replies; 32+ messages in thread
From: kernel test robot @ 2025-11-17 21:56 UTC (permalink / raw)
To: Luigi Rizzo, Thomas Gleixner, Marc Zyngier, Luigi Rizzo,
Paolo Abeni, Andrew Morton, Sean Christopherson, Jacob Pan
Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel,
linux-arch, Bjorn Helgaas, Willem de Bruijn
Hi Luigi,
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/irq/core]
[also build test ERROR on tip/x86/core tip/master linus/master v6.18-rc6 next-20251117]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Luigi-Rizzo/genirq-platform-wide-interrupt-moderation-Documentation-Kconfig-irq_desc/20251117-023148
base: tip/irq/core
patch link: https://lore.kernel.org/r/20251116182839.939139-3-lrizzo%40google.com
patch subject: [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20251118/202511180547.snSezdSu-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251118/202511180547.snSezdSu-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511180547.snSezdSu-lkp@intel.com/
All errors (new ones prefixed by >>):
kernel/irq/irq_moderation.c:103: warning: ignoring '#pragma clang diagnostic' [-Wunknown-pragmas]
103 | #pragma clang diagnostic error "-Wformat"
In file included from include/linux/uaccess.h:10,
from include/linux/sched/task.h:13,
from include/linux/sched/signal.h:9,
from include/linux/rcuwait.h:6,
from include/linux/percpu-rwsem.h:7,
from include/linux/fs.h:34,
from include/linux/seq_file.h:11,
from kernel/irq/irq_moderation.c:8:
In function 'check_copy_size',
inlined from 'copy_from_user' at include/linux/uaccess.h:207:7,
inlined from 'moderation_write' at kernel/irq/irq_moderation.c:169:6:
>> include/linux/ucopysize.h:54:25: error: call to '__bad_copy_to' declared with attribute error: copy destination size is too small
54 | __bad_copy_to();
| ^~~~~~~~~~~~~~~
In function 'check_copy_size',
inlined from 'copy_from_user' at include/linux/uaccess.h:207:7,
inlined from 'mode_write' at kernel/irq/irq_moderation.c:223:6:
>> include/linux/ucopysize.h:54:25: error: call to '__bad_copy_to' declared with attribute error: copy destination size is too small
54 | __bad_copy_to();
| ^~~~~~~~~~~~~~~
vim +/__bad_copy_to +54 include/linux/ucopysize.h
808aac63e2bdf9 Kees Cook 2025-02-28 43
808aac63e2bdf9 Kees Cook 2025-02-28 44 static __always_inline __must_check bool
808aac63e2bdf9 Kees Cook 2025-02-28 45 check_copy_size(const void *addr, size_t bytes, bool is_source)
808aac63e2bdf9 Kees Cook 2025-02-28 46 {
808aac63e2bdf9 Kees Cook 2025-02-28 47 int sz = __builtin_object_size(addr, 0);
808aac63e2bdf9 Kees Cook 2025-02-28 48 if (unlikely(sz >= 0 && sz < bytes)) {
808aac63e2bdf9 Kees Cook 2025-02-28 49 if (!__builtin_constant_p(bytes))
808aac63e2bdf9 Kees Cook 2025-02-28 50 copy_overflow(sz, bytes);
808aac63e2bdf9 Kees Cook 2025-02-28 51 else if (is_source)
808aac63e2bdf9 Kees Cook 2025-02-28 52 __bad_copy_from();
808aac63e2bdf9 Kees Cook 2025-02-28 53 else
808aac63e2bdf9 Kees Cook 2025-02-28 @54 __bad_copy_to();
808aac63e2bdf9 Kees Cook 2025-02-28 55 return false;
808aac63e2bdf9 Kees Cook 2025-02-28 56 }
808aac63e2bdf9 Kees Cook 2025-02-28 57 if (WARN_ON_ONCE(bytes > INT_MAX))
808aac63e2bdf9 Kees Cook 2025-02-28 58 return false;
808aac63e2bdf9 Kees Cook 2025-02-28 59 check_object_size(addr, bytes, is_source);
808aac63e2bdf9 Kees Cook 2025-02-28 60 return true;
808aac63e2bdf9 Kees Cook 2025-02-28 61 }
808aac63e2bdf9 Kees Cook 2025-02-28 62
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 1/8] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
2025-11-17 19:30 ` Thomas Gleixner
2025-11-16 18:28 ` [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
` (4 subsequent siblings)
7 siblings, 1 reply; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
Add hooks into handle_irq_event() to implement fixed moderation.
Moderation needs to be explicitly enabled on individual IRQs.
When an IRQ with moderation enabled fires, start an hrtimer for
irq_moderation.delay_us. While the timer is active, all IRQs with
moderation enabled will be disabled after running the handler. When the
timer fires, re-enable all disabled IRQs.
Example (kernel built with CONFIG_IRQ_SOFT_MODERATION=y)
# enable fixed moderation
echo "delay_us=400" > /proc/irq/soft_moderation
# enable on network interrupts (change name as appropriate)
echo on | tee /proc/irq/*/*eth*/../soft_moderation
# show it works by looking at counters in /proc/irq/soft_moderation
cat /proc/irq/soft_moderation
# Show runtime impact on ping times changing delay_us
ping -n -f -q -c 1000 ${some_nearby_host}
echo "delay_us=100" > /proc/irq/soft_moderation
ping -n -f -q -c 1000 ${some_nearby_host}
Configure via module parameters (irq_moderation.${name}=${value}) or
procfs (echo "${name}=${value}" > /proc/irq/soft_moderation):
delay_us 0=off, range 1-500, default 0
How long an interrupt is disabled after it fires. Small values are
accumulated until they are large enough, e.g. 10us. As an example,
delay_us=2 means that the timer is set only every 5 interrupts.
timer_rounds 0-20, default 0
How many extra timer runs before re-enabling interrupts. This
reduces interrupts to one every (timer_rounds * delay_us). Similar
to the "napi_defer_hard_irqs" option in NAPI, but with some subtle
differences (e.g. here the number of rounds is deterministic, and
interrupts are disabled at MSI level).
Change-Id: Ic52e95f4128da5e2964afdae0ba3cf9c5c3d732e
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
kernel/irq/handle.c | 3 ++
kernel/irq/irq_moderation.c | 86 +++++++++++++++++++++++++++++-
kernel/irq/irq_moderation.h | 102 ++++++++++++++++++++++++++++++++++++
3 files changed, 189 insertions(+), 2 deletions(-)
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index e103451243a0b..2f8d828ed143e 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -19,6 +19,7 @@
#include <trace/events/irq.h>
#include "internals.h"
+#include "irq_moderation.h"
#ifdef CONFIG_GENERIC_IRQ_MULTI_HANDLER
void (*handle_arch_irq)(struct pt_regs *) __ro_after_init;
@@ -254,9 +255,11 @@ irqreturn_t handle_irq_event(struct irq_desc *desc)
irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
raw_spin_unlock(&desc->lock);
+ irq_moderation_hook(desc);
ret = handle_irq_event_percpu(desc);
raw_spin_lock(&desc->lock);
+ irq_moderation_epilogue(desc);
irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
return ret;
}
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index 3a907b8f65698..688c8e8c49392 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -56,6 +56,10 @@
* napi_defer_hard_irqs on NICs.
* A small value may help control load in interrupt-challenged platforms.
*
+ *
+ * update_ms (default 5, range 1...100)
+ * How often moderation delay is updated.
+ *
* Moderation can be enabled/disabled for individual interrupts with
*
* echo "on" > /proc/irq/NN/soft_moderation # use "off" to disable
@@ -66,7 +70,10 @@
*
*/
-struct irq_mod_info irq_mod_info ____cacheline_aligned;
+/* Recommended values for the control loop. */
+struct irq_mod_info irq_mod_info ____cacheline_aligned = {
+ .update_ms = 5,
+};
/* Boot time value, copled to irq_mod_info.delay_us after init. */
static uint mod_delay_us;
@@ -76,8 +83,66 @@ MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 = moderation off, range 0
module_param_named(timer_rounds, irq_mod_info.timer_rounds, uint, 0444);
MODULE_PARM_DESC(timer_rounds, "How many timer polls once moderation triggers, range 0-20.");
+module_param_named(update_ms, irq_mod_info.update_ms, uint, 0444);
+MODULE_PARM_DESC(update_ms, "Update interval in milliseconds, range 1-100");
+
DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
+DEFINE_STATIC_KEY_FALSE(irq_moderation_enabled_key);
+EXPORT_SYMBOL(irq_moderation_enabled_key);
+
+static inline void smooth_avg(u32 *dst, u32 val, u32 steps)
+{
+ *dst = ((64 - steps) * *dst + steps * val) / 64;
+}
+
+/* Moderation timer handler. */
+static enum hrtimer_restart timer_cb(struct hrtimer *timer)
+{
+ struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+ struct irq_desc *desc, *next;
+ uint srcs = 0;
+
+ ms->timer_fire++;
+ WARN_ONCE(ms->timer_set != ms->timer_fire,
+ "CPU %d timer set %d fire %d (lost events?)\n",
+ smp_processor_id(), ms->timer_set, ms->timer_fire);
+
+ ms->rounds_left--;
+
+ if (ms->rounds_left > 0) {
+ /* Timer still alive, just call the handlers. */
+ list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
+ ms->irq_count += irq_mod_info.count_timer_calls;
+ ms->timer_calls++;
+ handle_irq_event_percpu(desc);
+ }
+ ms->timer_set++;
+ hrtimer_forward_now(&ms->timer, ms->sleep_ns);
+ return HRTIMER_RESTART;
+ }
+
+ /* Last round, remove from list and enable_irq(). */
+ list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
+ list_del(&desc->mod.ms_node);
+ INIT_LIST_HEAD(&desc->mod.ms_node);
+ srcs++;
+ ms->enable_irq++;
+ enable_irq(desc->irq_data.irq);
+ }
+ smooth_avg(&ms->scaled_src_count, srcs * 256, 1);
+
+ /* Prepare to accumulate next moderation delay. */
+ ms->sleep_ns = 0;
+
+ WARN_ONCE(ms->disable_irq != ms->enable_irq,
+ "CPU %d irq disable %d enable %d (%s)\n",
+ smp_processor_id(), ms->disable_irq, ms->enable_irq,
+ "bookkeeping error, some irq will be stuck");
+
+ return HRTIMER_NORESTART;
+}
+
/* Initialize moderation state, used in desc_set_defaults() */
void irq_moderation_init_fields(struct irq_desc_mod *mod)
{
@@ -153,11 +218,24 @@ struct param_names {
static struct param_names param_names[] = {
{ "delay_us", &irq_mod_info.delay_us, 0, 500 },
{ "timer_rounds", &irq_mod_info.timer_rounds, 0, 20 },
+ { "update_ms", &irq_mod_info.update_ms, 1, 100 },
/* Empty entry indicates the following are not settable from procfs. */
{},
- { "update_ms", &irq_mod_info.update_ms, 1, 100 },
+ { "count_timer_calls", &irq_mod_info.count_timer_calls, 0, 1 },
};
+static void update_enable_key(void)
+{
+ bool newval = irq_mod_info.delay_us != 0;
+
+ if (newval != static_key_enabled(&irq_moderation_enabled_key)) {
+ if (newval)
+ static_branch_enable(&irq_moderation_enabled_key);
+ else
+ static_branch_disable(&irq_moderation_enabled_key);
+ }
+}
+
static ssize_t moderation_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
{
struct param_names *n = param_names;
@@ -176,6 +254,8 @@ static ssize_t moderation_write(struct file *f, const char __user *buf, size_t c
if (kstrtouint(cmd + l + 1, 0, &val))
return -EINVAL;
WRITE_ONCE(*(n->val), clamp(val, n->min, n->max));
+ if (n->val == &irq_mod_info.delay_us)
+ update_enable_key();
/* Record last parameter change, for use in the control loop. */
irq_mod_info.procfs_write_ns = ktime_get_ns();
return count;
@@ -258,6 +338,7 @@ static void irq_moderation_percpu_init(void *data)
{
struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+ hrtimer_setup(&ms->timer, timer_cb, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
INIT_LIST_HEAD(&ms->descs);
}
@@ -293,6 +374,7 @@ static int __init init_irq_moderation(void)
/* Finally, set delay_us to enable moderation if needed. */
clamp_parameter(&irq_mod_info.delay_us, mod_delay_us);
+ update_enable_key();
proc_create_data("irq/soft_moderation", 0644, NULL, &proc_ops, NULL);
return 0;
diff --git a/kernel/irq/irq_moderation.h b/kernel/irq/irq_moderation.h
index ccb8193482b51..d9bc672ffd6f1 100644
--- a/kernel/irq/irq_moderation.h
+++ b/kernel/irq/irq_moderation.h
@@ -136,11 +136,113 @@ struct irq_mod_state {
DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
+extern struct static_key_false irq_moderation_enabled_key;
+
+/* Called on each interrupt for adaptive moderation delay adjustment. */
+static inline void irq_moderation_adjust_delay(struct irq_mod_state *ms)
+{
+ u64 now, delta_time;
+
+ ms->irq_count++;
+ /* ktime_get_ns() is expensive, don't do too often */
+ if (ms->irq_count & 0xf)
+ return;
+ now = ktime_get_ns();
+ delta_time = now - ms->last_ns;
+
+ /* Run approximately every update_ns, a little bit early is ok. */
+ if (delta_time < ms->update_ns - 5000)
+ return;
+
+ ms->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
+ ms->delay_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
+
+ ms->mod_ns = ms->delay_ns;
+}
+
+/* Return true if timer is active or delay is large enough to require moderation */
+static inline bool irq_moderation_needed(struct irq_mod_state *ms)
+{
+ const u32 min_delay_ns = 10000;
+
+ if (!hrtimer_is_queued(&ms->timer)) {
+ /* accumulate sleep time, no moderation if too small */
+ ms->sleep_ns += ms->mod_ns;
+ if (ms->sleep_ns < min_delay_ns)
+ return false;
+ }
+ return true;
+}
+
+void disable_irq_nosync(unsigned int irq);
+
+/*
+ * Use in handle_irq_event() before calling the handler. Decide whether this
+ * desc should be moderated, and in case disable the irq and add the desc to
+ * the list for this CPU.
+ */
+static inline void irq_moderation_hook(struct irq_desc *desc)
+{
+ struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+ if (!static_branch_unlikely(&irq_moderation_enabled_key))
+ return;
+
+ if (!READ_ONCE(desc->mod.enable))
+ return;
+
+ irq_moderation_adjust_delay(ms);
+
+ if (!list_empty(&desc->mod.ms_node)) {
+ /*
+ * Very unlikely, stray interrupt while the desc is moderated.
+ * We cannot ignore it or we may miss events, but do count it.
+ */
+ ms->stray_irq++;
+ return;
+ }
+
+ if (!irq_moderation_needed(ms))
+ return;
+
+ /* Add to list of moderated desc on this CPU */
+ list_add(&desc->mod.ms_node, &ms->descs);
+ /*
+ * Disable the irq. This will also cause irq_can_handle() return false
+ * (through irq_can_handle_actions()), and that will prevent a handler
+ * instance to be run again while the descriptor is being moderated.
+ *
+ * irq_moderation_epilogue() will then start the timer if needed.
+ */
+ ms->disable_irq++;
+ disable_irq_nosync(desc->irq_data.irq);
+}
+
+static inline void irq_moderation_start_timer(struct irq_mod_state *ms)
+{
+ ms->timer_set++;
+ ms->rounds_left = READ_ONCE(irq_mod_info.timer_rounds) + 1;
+ hrtimer_start_range_ns(&ms->timer, ns_to_ktime(ms->sleep_ns),
+ /*range*/2000, HRTIMER_MODE_REL_PINNED_HARD);
+}
+
+/* After the handler, if desc is moderated, make sure the timer is active. */
+static inline void irq_moderation_epilogue(const struct irq_desc *desc)
+{
+ struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+ if (!list_empty(&desc->mod.ms_node) && !hrtimer_is_queued(&ms->timer))
+ irq_moderation_start_timer(ms);
+}
+
void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode);
void irq_moderation_procfs_remove(struct irq_desc *desc);
#else /* CONFIG_IRQ_SOFT_MODERATION */
+static inline void irq_moderation_hook(struct irq_desc *desc) {}
+static inline void irq_moderation_epilogue(const struct irq_desc *desc) {}
+
static inline void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode) {}
static inline void irq_moderation_procfs_remove(struct irq_desc *desc) {}
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
2025-11-16 18:28 ` [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation Luigi Rizzo
@ 2025-11-17 19:30 ` Thomas Gleixner
2025-11-17 23:16 ` Thomas Gleixner
0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-17 19:30 UTC (permalink / raw)
To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
>
> +DEFINE_STATIC_KEY_FALSE(irq_moderation_enabled_key);
> +EXPORT_SYMBOL(irq_moderation_enabled_key);
Why needs this to be exported? No code outside of the interrupt core has
to know about this and the core code is always built-in.
> +static inline void smooth_avg(u32 *dst, u32 val, u32 steps)
> +{
> + *dst = ((64 - steps) * *dst + steps * val) / 64;
How is this correct for steps > 64?
Also this is unreadable. Just make this
u32 foo(u32 cur, u32 val, u32 steps)
{
return .....;
}
and add proper documentation what this function is about and the
expected and valid input parameters.
> +/* Moderation timer handler. */
> +static enum hrtimer_restart timer_cb(struct hrtimer *timer)
> +{
> + struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> + struct irq_desc *desc, *next;
> + uint srcs = 0;
> +
> + ms->timer_fire++;
> + WARN_ONCE(ms->timer_set != ms->timer_fire,
> + "CPU %d timer set %d fire %d (lost events?)\n",
> + smp_processor_id(), ms->timer_set, ms->timer_fire);
This whole paranoid debug code is required because this code is so
convoluted that it can't validated for correctness, right?
> + ms->rounds_left--;
> +
> + if (ms->rounds_left > 0) {
> + /* Timer still alive, just call the handlers. */
> + list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
> + ms->irq_count += irq_mod_info.count_timer_calls;
> + ms->timer_calls++;
> + handle_irq_event_percpu(desc);
I told you before that running the handler w/o the INPROGRESS flag set
is inconsistent state, no?
But what's worse is that this completely ignores the real disabled
state.
interrupt()
....
moderate() -> disable depth = 1
...
Driver code
disable_irq() -> disable depth = 2
timer()
handle_irq_event() -> BUG
And the more I think about it the more I'm convinced that abusing the
disabled state is the wrong thing to do. The interrupt is definitely not
disabled at all and all you create is inconsistent state.
So no. That needs more thought and a different mechanism to reflect that
moderated state.
> + }
> + ms->timer_set++;
> + hrtimer_forward_now(&ms->timer, ms->sleep_ns);
> + return HRTIMER_RESTART;
> + }
> +
> + /* Last round, remove from list and enable_irq(). */
> + list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
> + list_del(&desc->mod.ms_node);
> + INIT_LIST_HEAD(&desc->mod.ms_node);
list_del_init();
> + srcs++;
> + ms->enable_irq++;
> + enable_irq(desc->irq_data.irq);
> + }
> + smooth_avg(&ms->scaled_src_count, srcs * 256, 1);
> +
> + /* Prepare to accumulate next moderation delay. */
> + ms->sleep_ns = 0;
> +
> + WARN_ONCE(ms->disable_irq != ms->enable_irq,
> + "CPU %d irq disable %d enable %d (%s)\n",
> + smp_processor_id(), ms->disable_irq, ms->enable_irq,
> + "bookkeeping error, some irq will be stuck");
Oh well....
> + return HRTIMER_NORESTART;
> +}
> +static void update_enable_key(void)
> +{
> + bool newval = irq_mod_info.delay_us != 0;
> +
> + if (newval != static_key_enabled(&irq_moderation_enabled_key)) {
Pointless exercise. static_branch_enable/disable() handle that already
> + if (newval)
> + static_branch_enable(&irq_moderation_enabled_key);
> + else
> + static_branch_disable(&irq_moderation_enabled_key);
> + }
> +}
> +
> static ssize_t moderation_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
> {
> struct param_names *n = param_names;
> @@ -176,6 +254,8 @@ static ssize_t moderation_write(struct file *f, const char __user *buf, size_t c
> if (kstrtouint(cmd + l + 1, 0, &val))
> return -EINVAL;
> WRITE_ONCE(*(n->val), clamp(val, n->min, n->max));
> + if (n->val == &irq_mod_info.delay_us)
> + update_enable_key();
> /* Record last parameter change, for use in the control loop. */
> irq_mod_info.procfs_write_ns = ktime_get_ns();
> return count;
> @@ -258,6 +338,7 @@ static void irq_moderation_percpu_init(void *data)
> {
> struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
>
> + hrtimer_setup(&ms->timer, timer_cb, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
> INIT_LIST_HEAD(&ms->descs);
I'm not convinced that you need that online callback at all. None of
this must run on the target CPU. All of that can be done from the boot
CPU in a for_each_possible_cpu() loop. It's a one time initialization, no?
> }
>
> @@ -293,6 +374,7 @@ static int __init init_irq_moderation(void)
>
> /* Finally, set delay_us to enable moderation if needed. */
> clamp_parameter(&irq_mod_info.delay_us, mod_delay_us);
> + update_enable_key();
>
> proc_create_data("irq/soft_moderation", 0644, NULL, &proc_ops, NULL);
> return 0;
> diff --git a/kernel/irq/irq_moderation.h b/kernel/irq/irq_moderation.h
> index ccb8193482b51..d9bc672ffd6f1 100644
> --- a/kernel/irq/irq_moderation.h
> +++ b/kernel/irq/irq_moderation.h
> @@ -136,11 +136,113 @@ struct irq_mod_state {
>
> DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
>
> +extern struct static_key_false irq_moderation_enabled_key;
> +
> +/* Called on each interrupt for adaptive moderation delay adjustment. */
> +static inline void irq_moderation_adjust_delay(struct irq_mod_state *ms)
> +{
> + u64 now, delta_time;
> +
> + ms->irq_count++;
> + /* ktime_get_ns() is expensive, don't do too often */
> + if (ms->irq_count & 0xf)
Magic number and what's wrong with using %?
> + return;
> + now = ktime_get_ns();
> + delta_time = now - ms->last_ns;
> +
> + /* Run approximately every update_ns, a little bit early is ok. */
> + if (delta_time < ms->update_ns - 5000)
Again. These hardcoded magic numbers all over the place without any
rationale are annoying.
> + return;
> +
> + ms->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
> + ms->delay_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
> +
> + ms->mod_ns = ms->delay_ns;
So you need to read time and do a incomprehensible comparison in order
to store the same values over and over? Both values are initialized
during init and cannot be changed afterwards. Oh well...
> +}
> +
> +/* Return true if timer is active or delay is large enough to require moderation */
> +static inline bool irq_moderation_needed(struct irq_mod_state *ms)
> +{
> + const u32 min_delay_ns = 10000;
Yet another magic number....
> + if (!hrtimer_is_queued(&ms->timer)) {
> + /* accumulate sleep time, no moderation if too small */
Sentences still start with an uppercase letter.
> + ms->sleep_ns += ms->mod_ns;
How is this accumulating sleep time? It's adding the configured delay
value (nanoseconds) to ms->sleep_ns unconditionally on every interrupt
which is handled on a CPU. It has no relation to time at all.
> + if (ms->sleep_ns < min_delay_ns)
> + return false;
> + }
> + return true;
> +}
> +
> +void disable_irq_nosync(unsigned int irq);
> +
> +/*
> + * Use in handle_irq_event() before calling the handler. Decide whether this
> + * desc should be moderated, and in case disable the irq and add the desc to
> + * the list for this CPU.
> + */
> +static inline void irq_moderation_hook(struct irq_desc *desc)
> +{
> + struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> +
> + if (!static_branch_unlikely(&irq_moderation_enabled_key))
> + return;
> +
> + if (!READ_ONCE(desc->mod.enable))
> + return;
> +
> + irq_moderation_adjust_delay(ms);
> +
> + if (!list_empty(&desc->mod.ms_node)) {
> + /*
> + * Very unlikely, stray interrupt while the desc is moderated.
Moderating 'the desc'? The interrupt is moderated, no?
> + * We cannot ignore it or we may miss events, but do count it.
This comment still does not make sense. What can't you ignore and how
does counting and returning prevent missing an event?
> + */
> + ms->stray_irq++;
> + return;
> + }
> +
> + if (!irq_moderation_needed(ms))
> + return;
> +
> + /* Add to list of moderated desc on this CPU */
> + list_add(&desc->mod.ms_node, &ms->descs);
Right and if this CPU is hot-unplugged afterwards then this interrupt is
dead forever. If the timer is still armed it is migrated to an online
CPU...
Also what protects the list against a teardown of an interrupt which is
enqueued in that list? Nothing at all, which means there is a UAF
built-in.
> + /*
> + * Disable the irq. This will also cause irq_can_handle() return false
s/irq/interrupt/ Use proper words please.
> + * (through irq_can_handle_actions()), and that will prevent a handler
> + * instance to be run again while the descriptor is being moderated.
> + *
> + * irq_moderation_epilogue() will then start the timer if needed.
It's needed unconditionally after adding it to the list and disabling
it. So why can't you do it right here?
> + */
> + ms->disable_irq++;
> + disable_irq_nosync(desc->irq_data.irq);
> +}
> +
> +static inline void irq_moderation_start_timer(struct irq_mod_state *ms)
> +{
> + ms->timer_set++;
> + ms->rounds_left = READ_ONCE(irq_mod_info.timer_rounds) + 1;
> + hrtimer_start_range_ns(&ms->timer, ns_to_ktime(ms->sleep_ns),
> + /*range*/2000, HRTIMER_MODE_REL_PINNED_HARD);
This glued in comment is better than a properly named constant, right?
> +}
> +
> +/* After the handler, if desc is moderated, make sure the timer is active. */
> +static inline void irq_moderation_epilogue(const struct irq_desc *desc)
> +{
> + struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> +
This still lacks a check whether this moderation muck is enabled at
all. And no, the list_empty() check is not the right thing for
that. That's what the static key is for.
But that's moot as there is no real reason for this epilogue to exist in
the first place.
Thanks,
tglx
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
2025-11-17 19:30 ` Thomas Gleixner
@ 2025-11-17 23:16 ` Thomas Gleixner
2025-11-17 23:59 ` Luigi Rizzo
0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-17 23:16 UTC (permalink / raw)
To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On Mon, Nov 17 2025 at 20:30, Thomas Gleixner wrote:
> On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
>> + ms->rounds_left--;
>> +
>> + if (ms->rounds_left > 0) {
>> + /* Timer still alive, just call the handlers. */
>> + list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
>> + ms->irq_count += irq_mod_info.count_timer_calls;
I missed this gem before. How is this supposed to calculate an interrupt
rate when count_timer_calls is disabled?
Yet another magic knob to tweak something which works by chance and not
by design.
TBH. This whole thing should be put into the 'ugly code museum' for
educational purposes and deterrence. It wants to be rewritten from
scratch with a proper design and a structured understandable approach.
This polish the Google PoC hackery to death will go nowhere. It's just a
ginormous waste of time. Not that I care about the time you waste with
that, but I pretty much care about mine.
That said, start over from scratch and take the feedback into account so
you can address the substantial problems I pointed out (CPU hotplug,
concurrency, life time management, state consistency, affinity changes)
in the design and not after the fact.
First of all please find some other wording than moderation. That's just
a randomly diced word without real meaning. Pick something which
describes what this infrastructure actually does: Adaptive polling, no?
There are a couple of other fundamental questions to answer upfront:
1) Is this throttle everything on a CPU the proper approach?
To me this does not make sense. The CPU hogging network adapter or
disk drive has no business to delay low frequency interrupts,
which might be important, just because.
Making this a per interrupt line property allows to solve a few
other issues trivially like the integration into that posted MSI
muck.
It also reduces the amount of descriptors to be polled in the
timer interrupt.
2) Shouldn't the interrupt source be masked at the device level once
an interrupt is switched into polling mode?
Masking it at the device level (without touching disabled state)
is definitely a sensible thing to do. It keeps state consistent
and again allows trivial integration of that posted MSI stuff
without insane hacks all over the place.
3) Does a pure interrupt rate based scheme make sense?
Definitely not in the way it's implemented. Why?
Simply because once you switched to polling mode there is no real
information anymore as you fail to take the return value of the
handler into account. So unless your magic knob is 0 every polled
interrupt is accounted for whether it actually handles an
interrupt or not.
But if your magic knob is 0 then this purely relies on irqtime
accounting, which includes the timer interrupt as an accumulative
measure.
IOW, "works" by some definition of works after adding enough magic
knobs to make it "work" under certain circumstances. "Works for
Google" is not a good argument.
That's unmaintainable and unusable. No amount of magic command
line examples will fix that because the state space of your knobs
is way too big to be useful and comprehensible.
Add all the questions which pop up when you really sit down and do a
proper from scratch design of this.
Thanks,
tglx
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
2025-11-17 23:16 ` Thomas Gleixner
@ 2025-11-17 23:59 ` Luigi Rizzo
2025-11-18 8:34 ` Thomas Gleixner
0 siblings, 1 reply; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-17 23:59 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Tue, Nov 18, 2025 at 12:16 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Mon, Nov 17 2025 at 20:30, Thomas Gleixner wrote:
> > On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> >> + ms->rounds_left--;
> >> +
> >> + if (ms->rounds_left > 0) {
> >> + /* Timer still alive, just call the handlers. */
> >> + list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
> >> + ms->irq_count += irq_mod_info.count_timer_calls;
> I missed this gem before. How is this supposed to calculate an interrupt
> rate when count_timer_calls is disabled?
FWIW the code is supposed to count the MSI interrupts,
which are the problematic ones.
count_timer_calls was a debugging knob, but understood that
it has no place in the upstream code.
> This polish the Google PoC hackery to death will go nowhere. It's just a
> ginormous waste of time. Not that I care about the time you waste with
> that, but I pretty much care about mine.
>
> That said, start over from scratch and take the feedback into account so
point taken.
> First of all please find some other wording than moderation. That's just
> a randomly diced word without real meaning. Pick something which
> describes what this infrastructure actually does: Adaptive polling, no?
The core feature (with timer_rounds = 0) is really pure moderation:
disable+start a timer after an interrupt, enable when the timer fires,
no extra calls.
It is only timer_rounds>0 (which as implemented is clearly broken,
and will be removed) that combines moderation + N rounds of timed polling.
> There are a couple of other fundamental questions to answer upfront:
>
> 1) Is this throttle everything on a CPU the proper approach?
>
> To me this does not make sense. The CPU hogging network adapter or
> disk drive has no business to delay low frequency interrupts,
> which might be important, just because.
while there is some shared fate, a low frequency source (with interrupts
more than the adaptive_delay apart) on the same CPU as a high frequency
source, will rarely if ever see any additional delay:
the first interrupt from a source is always served right away,
there is a high chance that the timer fires and the source
is re-enabled before the next interrupt from the low frequency source.
cheers
luigi
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
2025-11-17 23:59 ` Luigi Rizzo
@ 2025-11-18 8:34 ` Thomas Gleixner
2025-11-18 10:09 ` Luigi Rizzo
0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-18 8:34 UTC (permalink / raw)
To: Luigi Rizzo
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Tue, Nov 18 2025 at 00:59, Luigi Rizzo wrote:
> On Tue, Nov 18, 2025 at 12:16 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> There are a couple of other fundamental questions to answer upfront:
>>
>> 1) Is this throttle everything on a CPU the proper approach?
>>
>> To me this does not make sense. The CPU hogging network adapter or
>> disk drive has no business to delay low frequency interrupts,
>> which might be important, just because.
>
> while there is some shared fate, a low frequency source (with interrupts
> more than the adaptive_delay apart) on the same CPU as a high frequency
> source, will rarely if ever see any additional delay:
> the first interrupt from a source is always served right away,
> there is a high chance that the timer fires and the source
> is re-enabled before the next interrupt from the low frequency source.
I understand that from a practical point of view it might not make a real
difference, but when you look at it conceptually, then the interrupt
which causes the system to slow down is the one you want to switch over
into polling mode. All others are harmless as they do not contribute to
the overall problem in a significant enough way.
As a side effect of that approach the posted MSI integration then mostly
falls into place especially when combined with immediate masking.
Immediate masking is not a problem at all because in reality the high
frequency interrupt will be masked immediately on the next event (a few
microseconds later) anyway.
Thanks,
tglx
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
2025-11-18 8:34 ` Thomas Gleixner
@ 2025-11-18 10:09 ` Luigi Rizzo
2025-11-18 16:31 ` Thomas Gleixner
0 siblings, 1 reply; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-18 10:09 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Tue, Nov 18, 2025 at 9:34 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, Nov 18 2025 at 00:59, Luigi Rizzo wrote:
> > On Tue, Nov 18, 2025 at 12:16 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> There are a couple of other fundamental questions to answer upfront:
> >>
> >> 1) Is this throttle everything on a CPU the proper approach?
> >>
> >> To me this does not make sense. The CPU hogging network adapter or
> >> disk drive has no business to delay low frequency interrupts,
> >> which might be important, just because.
> >
> > while there is some shared fate, a low frequency source (with interrupts
> > more than the adaptive_delay apart) on the same CPU as a high frequency
> > source, will rarely if ever see any additional delay:
> > the first interrupt from a source is always served right away,
> > there is a high chance that the timer fires and the source
> > is re-enabled before the next interrupt from the low frequency source.
>
> I understand that from a practical point of view it might not make a real
> difference, but when you look at it conceptually, then the interrupt
> which causes the system to slow down is the one you want to switch over
> into polling mode. All others are harmless as they do not contribute to
> the overall problem in a significant enough way.
(I appreciate the time you are dedicating to this thread)
Fully agree. The tradeoff is that the rate accounting state
(#interrupts in the last interval, a timestamp, mod_ns, sleep_ns)
now would have to go into the irqdesc, and the extra layer
of per-CPU aggregation is still needed to avoid hitting too often on
the shared state.
I also want to reiterate that "polling mode" is not the core contribution
of this patch series. There is limited polling only when timer_rounds>0,
which is not what I envision to use, and will go away because
as you showed it does not handle correctly the teardown path.
> As a side effect of that approach the posted MSI integration then mostly
> falls into place especially when combined with immediate masking.
> Immediate masking is not a problem at all because in reality the high
> frequency interrupt will be masked immediately on the next event (a few
> microseconds later) anyway.
This again has pros and cons. The posted MSI feature
helps only when there are N>1 high rate sources
hitting the same CPU, and in that (real) case having to
mask N sources one by one, rather than just not rearming
the posted_msi interrupt, means an N-fold increase in
the interrupt rate for a given moderation delay.
Note that even under load, actual moderation delays
are typically as low as 20-40us, which are practically
unnoticeable by low rate devices (moderation does
not affect timers or system interrupts, and one
has always the option to move the latency sensitive,
low rate source to a different CPU where it would also
benefit from the jitter induced by the heavy hitters).
cheers
luigi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
2025-11-18 10:09 ` Luigi Rizzo
@ 2025-11-18 16:31 ` Thomas Gleixner
2025-11-18 18:25 ` Luigi Rizzo
0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-18 16:31 UTC (permalink / raw)
To: Luigi Rizzo
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, linux-kernel, linux-arch, Bjorn Helgaas,
Willem de Bruijn
On Tue, Nov 18 2025 at 11:09, Luigi Rizzo wrote:
> On Tue, Nov 18, 2025 at 9:34 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> I understand that from a practical point of view it might not make a real
>> difference, but when you look at it conceptually, then the interrupt
>> which causes the system to slow down is the one you want to switch over
>> into polling mode. All others are harmless as they do not contribute to
>> the overall problem in a significant enough way.
>
> (I appreciate the time you are dedicating to this thread)
It's hopefully saving me time and nerves later :)
> Fully agree. The tradeoff is that the rate accounting state
> (#interrupts in the last interval, a timestamp, mod_ns, sleep_ns)
> now would have to go into the irqdesc, and the extra layer
> of per-CPU aggregation is still needed to avoid hitting too often on
> the shared state.
>
> I also want to reiterate that "polling mode" is not the core contribution
> of this patch series. There is limited polling only when timer_rounds>0,
> which is not what I envision to use, and will go away because
> as you showed it does not handle correctly the teardown path.
Ok.
>> As a side effect of that approach the posted MSI integration then mostly
>> falls into place especially when combined with immediate masking.
>> Immediate masking is not a problem at all because in reality the high
>> frequency interrupt will be masked immediately on the next event (a few
>> microseconds later) anyway.
>
> This again has pros and cons. The posted MSI feature
> helps only when there are N>1 high rate sources
> hitting the same CPU, and in that (real) case having to
> mask N sources one by one, rather than just not rearming
> the posted_msi interrupt, means an N-fold increase in
> the interrupt rate for a given moderation delay.
That's kinda true for the per interrupt accounting, but if you look at
it from a per CPU accounting perspective then you still can handle them
individually and mask them when they arrive within or trigger a delay
window. That means:
1) One of the PIR aggregated device interrupts triggers the delay
2) If there are other bits active in that same posted handler
invocation, then they will be flipped over too
3) Devices which have not been covered by the initial switch, will
either be not affected at all or can raise exactly _one_ interrupt
which flips them over.
This scheme makes me way more comfortable because:
1) The state is always consistent.
If masked then the PIR rearm is not causing any harm because the
device can't send an interrupt anymore.
On unmask there is no way to screw up vs. rearming and eventually
losing an interrupt.
I haven't analyzed your scheme in detail, but the PIR mechanism is
fickle and I have very little confidence that the way you
implemented it is correct under all circumstances.
2) It requires exactly zero lines of x86-intelism code.
3) It keeps the mechanism exactly the same for PIR and non-PIR mode
as you need the 'mask' on mode transition anyway to solve the
disable/enable_irq() state inconsistency.
Please focus on making this simple and correct in the first place.
If there is still a really compelling reason to treat PIR differently,
then this can be done as a separate step once the initial dust has
settled.
Thanks,
tglx
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
2025-11-18 16:31 ` Thomas Gleixner
@ 2025-11-18 18:25 ` Luigi Rizzo
2025-11-18 23:06 ` Luigi Rizzo
0 siblings, 1 reply; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-18 18:25 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, linux-kernel, linux-arch, Bjorn Helgaas,
Willem de Bruijn
On Tue, Nov 18, 2025 at 5:31 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, Nov 18 2025 at 11:09, Luigi Rizzo wrote:
> > ...
> > (I appreciate the time you are dedicating to this thread)
>
> It's hopefully saving me time and nerves later :)
>
> > ...
> That's kinda true for the per interrupt accounting, but if you look at
> it from a per CPU accounting perspective then you still can handle them
> individually and mask them when they arrive within or trigger a delay
> window. That means:
ok, so to sum up, what are the correct methods to do the masking
you suggest, should i use mask_irq()/unmask_irq() ?
thanks
luigi
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
2025-11-18 18:25 ` Luigi Rizzo
@ 2025-11-18 23:06 ` Luigi Rizzo
2025-11-19 14:43 ` Thomas Gleixner
0 siblings, 1 reply; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-18 23:06 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, linux-kernel, linux-arch, Bjorn Helgaas,
Willem de Bruijn
On Tue, Nov 18, 2025 at 7:25 PM Luigi Rizzo <lrizzo@google.com> wrote:
>
> On Tue, Nov 18, 2025 at 5:31 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Tue, Nov 18 2025 at 11:09, Luigi Rizzo wrote:
> > > ...
> > > (I appreciate the time you are dedicating to this thread)
> >
> > It's hopefully saving me time and nerves later :)
> >
> > > ...
> > That's kinda true for the per interrupt accounting, but if you look at
> > it from a per CPU accounting perspective then you still can handle them
> > individually and mask them when they arrive within or trigger a delay
> > window. That means:
>
> ok, so to sum up, what are the correct methods to do the masking
> you suggest, should i use mask_irq()/unmask_irq() ?
I tried the following instead of disable_irq()/enable_irq().
This seems to work also for posted_msi with no arch-specific code, as
you suggested.
The drain_desc_list() function below should be usable directly with
hotplug remove callbacks.
I still need to figure out if there is anything special needed when
an interrupt is removed, or the system goes into suspend,
while irq_desc's are in the moderation list.
/* run this to mask and moderate an interrupt */
void irq_mod_mask(struct irq_desc *desc)
{
scoped_irqdesc_get_and_buslock(desc->irq_data.irq,
IRQ_GET_DESC_CHECK_GLOBAL) {
struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
/* Add to list of moderated desc on this CPU */
list_add(&desc->mod.ms_node, &ms->descs);
mask_irq(scoped_irqdesc);
ms->mask_irq++;
if (!hrtimer_is_queued(&ms->timer))
irq_moderation_start_timer(ms);
}
}
/* run this on the timer callback */
static int drain_desc_list(struct irq_mod_state *ms)
{
struct irq_desc *desc, *next;
uint srcs = 0;
/* Last round, remove from list and unmask_irq(). */
list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
list_del_init(&desc->mod.ms_node);
srcs++;
ms->unmask_irq++;
scoped_irqdesc_get_and_buslock(desc->irq_data.irq,
IRQ_GET_DESC_CHECK_GLOBAL) {
unmask_irq(scoped_irqdesc);
}
}
return srcs;
}
cheers
luigi
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
2025-11-18 23:06 ` Luigi Rizzo
@ 2025-11-19 14:43 ` Thomas Gleixner
2025-11-21 10:58 ` Luigi Rizzo
0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-19 14:43 UTC (permalink / raw)
To: Luigi Rizzo
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, linux-kernel, linux-arch, Bjorn Helgaas,
Willem de Bruijn
On Wed, Nov 19 2025 at 00:06, Luigi Rizzo wrote:
> On Tue, Nov 18, 2025 at 7:25 PM Luigi Rizzo <lrizzo@google.com> wrote:
>> > That's kinda true for the per interrupt accounting, but if you look at
>> > it from a per CPU accounting perspective then you still can handle them
>> > individually and mask them when they arrive within or trigger a delay
>> > window. That means:
>>
>> ok, so to sum up, what are the correct methods to do the masking
>> you suggest, should i use mask_irq()/unmask_irq() ?
>
> I tried the following instead of disable_irq()/enable_irq().
> This seems to work also for posted_msi with no arch-specific code, as
> you suggested.
:)
> The drain_desc_list() function below should be usable directly with
> hotplug remove callbacks.
Emphasis on *should*.
> I still need to figure out if there is anything special needed when
> an interrupt is removed, or the system goes into suspend,
> while irq_desc's are in the moderation list.
Yes quite some and you forgot to mention affinity changes, interrupt
disable/enable semantics, synchronization, state consistency ...
> /* run this to mask and moderate an interrupt */
> void irq_mod_mask(struct irq_desc *desc)
> {
> scoped_irqdesc_get_and_buslock(desc->irq_data.irq,
> IRQ_GET_DESC_CHECK_GLOBAL) {
Seriously? Why do you want to look up the descriptor again when you
already have the pointer? Copying random code together does not qualify
as programming.
> struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
>
> /* Add to list of moderated desc on this CPU */
> list_add(&desc->mod.ms_node, &ms->descs);
> mask_irq(scoped_irqdesc);
> ms->mask_irq++;
> if (!hrtimer_is_queued(&ms->timer))
> irq_moderation_start_timer(ms);
> }
Assuming you still invoke this from handle_irq_event(), which I think is
the wrong place, then I really have to ask how that is supposed to work
correctly especially with handle_fasteoi_irq(), which is used on
ARM64. Why?
handle_fasteoi_irq()
....
handle_irq_event() {
...
moderation()
mask();
}
cond_unmask_eoi_irq()
unmask();
You need state to prevent that. There is a similar, but more subtle
issue in handle_edge_irq(), which is used on x86.
You got away with this so far due to the disable_irq() abuse and with
your unmask PoC you didn't notice the rare issue on x86, but with
handle_fasteoi_irq() this clearly can't ever work.
> /* run this on the timer callback */
> static int drain_desc_list(struct irq_mod_state *ms)
> {
> struct irq_desc *desc, *next;
> uint srcs = 0;
>
> /* Last round, remove from list and unmask_irq(). */
> list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
> list_del_init(&desc->mod.ms_node);
> srcs++;
> ms->unmask_irq++;
> scoped_irqdesc_get_and_buslock(desc->irq_data.irq,
> IRQ_GET_DESC_CHECK_GLOBAL) {
> unmask_irq(scoped_irqdesc);
That's broken vs. state changes which occured while the descriptor was
queued. This cannot unconditionally unmask for bloody obvious reasons.
There is also the opposite: Interrupt is moderated, masked, queued and
some management code unmasks. Is that correct while it is still in the
list and marked moderated? It probably is, but only if done right and
clearly documented.
While talking about mask/unmask. You need to ensure that the interrupt
is actually maskable to be elegible for moderation. MSI does not
guarantee that, which is not an issue on ARM64 as that can mask at the
CPU level, but it is an issue on x86 which does not provide such a
mechanism.
You need to understand the intricacies of interrupt handling and
interrupt management first instead of jumping again to "works for me and
should be usable" conclusions.
Now that you know that the mask approach "works", you can stop this
"hack it into submission" frenzy and take a step back and really study
the overall problem space to come up with a proper design and
integration for this.
What you really want is to define the semantics and requirements of this
moderation mechanism versus (no particular order):
1) The existing semantics of handle_edge_irq() and
handle_fasteoi_irq(), which need to be guaranteed.
That's one of the most crucial points because edge type
interrupts have fire and forget semantics on the device side and
there is no guarantee that the device will send another one in
time. People including me wasted tons of time to decode such
problems in the past.
2) The existing semantics of disable/enable_irq(), irq_shutdown(),
synchronize_irq(), which need to be preserved.
3) Teardown of an interrupt, which also relates to shutdown and
synchronizing.
4) Interrupts coming in while in moderated state and "masked". That
can be spurious interrupts or happen due to unmasking by
management code, e.g. enable_irq().
5) CPU hotunplug. Define the exact point where the list has to be
cleaned out and the timer canceled and it's guaranteed that
there can't be interrupts added back to the moderation list.
6) Affinity changes. Are there any implications when the descriptor
is enqueued on the old target and waiting for the timer to unmask
it? That assumes that CPU hotunplug related affinity changes are
not affected because those interrupts are guaranteed not to be
moderated.
7) Suspend/resume. That needs some deep analysis of the potential
state space.
8) Eligibility beyond edge, single target, etc. That might need
actual opt-in support from the interrupt chip hierarchy as
otherwise this is going to end up in a continous stream of
undecodable bug reports, which others have to deal with :(
See my remark vs. unmaskable MSI above.
Once you figured all of that out, it becomes pretty clear how the
implementation has to look like and you can start building it up piece
wise by doing the necessary refactoring and preparatory changes first so
that the actual functionality falls into place at the end.
If you start the other way round by glueing the functionality somewhere
into interrupt handling first and then trying to address the resulting
fallout, you'll get nowhere. Trust me that a lot of smart people have
failed that way.
That said, I'm happy to answer _informed_ questions about semantics,
rules and requirements if they are not obvious from the code.
Good luck!
Thanks,
tglx
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
2025-11-19 14:43 ` Thomas Gleixner
@ 2025-11-21 10:58 ` Luigi Rizzo
2025-11-21 14:33 ` Luigi Rizzo
2025-11-22 14:08 ` Thomas Gleixner
0 siblings, 2 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-21 10:58 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, linux-kernel, linux-arch, Bjorn Helgaas,
Willem de Bruijn
On Wed, Nov 19, 2025 at 3:43 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
...
>
> What you really want is to define the semantics and requirements of this
> moderation mechanism versus (no particular order):
>
> 1) The existing semantics of handle_edge_irq() and
> handle_fasteoi_irq(), which need to be guaranteed.
>
> That's one of the most crucial points because edge type
> interrupts have fire and forget semantics on the device side and
> there is no guarantee that the device will send another one in
> time. People including me wasted tons of time to decode such
> problems in the past.
>
> 2) The existing semantics of disable/enable_irq(), irq_shutdown(),
> synchronize_irq(), which need to be preserved.
>
> 3) Teardown of an interrupt, which also relates to shutdown and
> synchronizing.
>
> 4) Interrupts coming in while in moderated state and "masked". That
> can be spurious interrupts or happen due to unmasking by
> management code, e.g. enable_irq().
>
> 5) CPU hotunplug. Define the exact point where the list has to be
> cleaned out and the timer canceled and it's guaranteed that
> there can't be interrupts added back to the moderation list.
>
> 6) Affinity changes. Are there any implications when the descriptor
> is enqueued on the old target and waiting for the timer to unmask
> it? That assumes that CPU hotunplug related affinity changes are
> not affected because those interrupts are guaranteed not to be
> moderated.
>
> 7) Suspend/resume. That needs some deep analysis of the potential
> state space.
>
> 8) Eligibility beyond edge, single target, etc. That might need
> actual opt-in support from the interrupt chip hierarchy as
> otherwise this is going to end up in a continous stream of
> undecodable bug reports, which others have to deal with :(
> See my remark vs. unmaskable MSI above.
>
> Once you figured all of that out, it becomes pretty clear how the
> implementation has to look like and you can start building it up piece
> wise by doing the necessary refactoring and preparatory changes first so
> that the actual functionality falls into place at the end.
>
> If you start the other way round by glueing the functionality somewhere
> into interrupt handling first and then trying to address the resulting
> fallout, you'll get nowhere. Trust me that a lot of smart people have
> failed that way.
>
> That said, I'm happy to answer _informed_ questions about semantics,
> rules and requirements if they are not obvious from the code.
Addressing your bullets above here is the design,
if you have time let me know what you think,
BASIC MECHANISM
The basic control mechanism changes interrupt state as follows:
- for moderated interrupts, call __disable_irq(desc) and put the desc
in a per-CPU list with IRQD_IRQ_INPROGRESS set.
Set the _IRQ_DISABLE_UNLAZY status flag so disable will also
mask the interrupt (it works even without that, but experiments
with greedy NVME devices show clear benefits).
NOTE: we need to patch irq_can_handle_pm() so it will return false
on irqdesc that are in a moderation list, otherwise we have
a deadlock when an interrupt (generated before the mask) comes in.
- when the moderation timer fires, remove the irqdesc from the list,
clear IRQD_IRQ_INPROGRESS and call __enable_irq(desc).
RATIONALE:
- why disable/enable instead of mask/unmask:
disable/enable already supports call from different layers
(infrastructure and interrupt handlers), blocks stray interrupts,
and reissues pending events on enable. If I were to use mask/unmask
directly, I would have to replicate most of the disable logic and risk
introducing subtle bugs.
Setting _IRQ_DISABLE_UNLAZY makes the mask immediate, which based on
tests (greedy NVME devices) seems beneficial.
- why IRQD_IRQ_INPROGRESS instead of using another flag:
IRQD_IRQ_INPROGRESS seems the way to implement synchronization with
infrastructure events (shutdown etc.). We can argue that when an
irqdesc is in the timer list it is not "inprogress" and could be
preempted (with huge pain, as it could be on a different CPU),
but for practical purposes most of the actions that block on INPROGRESS
should also wait for the irqdesc to come off the timer list.
- what about spurious interrupts:
there is no way to block a device from sending them one after another,
so using the existing protection (don't call the handler if disabled)
seems the easiest way to deal with them.
HANDLING OF VARIOUS EVENTS:
INTERRUPT TEARDOWN
free_irq() uses synchronize_irq() before proceeding, which waits until
IRQD_IRQ_INPROGRESS is clear. This guarantees that a moderated interrupt
will not be destroyed while it is on the timer list.
INTERRUPT MIGRATION
Migration changes the affinity mask, but it takes effect only when
trying to run the handler. That too is prevented by IRQD_IRQ_INPROGRESS
being set, so the new CPU will be used only after the interrupt exits
the timer list.
HOTPLUG
During shutdown the system moves timers and reassigns interrupt affinity
to a new CPU. The easiest way to guarantee that pending events are
handled correctly is to use a per-CPU "moderation_on" flag managed as
follows by hotplug callbacks on CPUHP_ONLINE (or nearby)
- on setup, set the flag. Only now interrupts can be moderated.
- on shutdown, with interrupts disabled, clear the flag thus preventing
more interrupts to be moderated on that CPU; process the list of moderated
interrupts (as if the timer had fired), and cancel the timer.
This avoids depending on a specific state: no moderation before fully
online, and cleanup before any teardown may interfere with moderation
timers.
SUSPEND
The hotplug callbacks are also invoked during deep suspend so that
makes it safe.
BOOT PROCESSOR
Hotplug callbacks are not invoked for the boot processor. However the
boot processor is the last one to go, and since there is no other place
to run the timer callbacks, they will be run where they are supposed to.
---
cheers
luigi
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
2025-11-21 10:58 ` Luigi Rizzo
@ 2025-11-21 14:33 ` Luigi Rizzo
2025-11-22 14:08 ` Thomas Gleixner
1 sibling, 0 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-21 14:33 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, linux-kernel, linux-arch, Bjorn Helgaas,
Willem de Bruijn
On Fri, Nov 21, 2025 at 11:58 AM Luigi Rizzo <lrizzo@google.com> wrote:
>
> On Wed, Nov 19, 2025 at 3:43 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> ...
> >
> > ...
> > That said, I'm happy to answer _informed_ questions about semantics,
> > rules and requirements if they are not obvious from the code.
Some followup on the cost of mask_irq() which is rarely called without
moderation, but is hit hard with moderation:
For msix devices, mask_irq() eventually calls the following function
static inline void pci_msix_mask(struct msi_desc *desc)
{
desc->pci.msix_ctrl |= PCI_MSIX_ENTRY_CTRL_MASKBIT;
pci_msix_write_vector_ctrl(desc, desc->pci.msix_ctrl); //
eventually, writel()
/* Flush write to device */
readl(desc->pci.mask_base);
}
The final readl() is very expensive (serialized after the write, and takes
a full round trip through a busy PCIe bus). Its real purpose is to make
sure the function returns only after the hardware has seen the mask.
Except for a few critical places (setup/shutdown, others?) it does not
seem necessary to wait, because the irq layer knows how to drop stray
interrupts.
Here is an experiment removing the readl() where it is not critical,
during an experiment with busy NVME workload (~1M intr/s total)
The average goes from 1800ns down to 50ns, the long tails almost
completely disappear.
Data is collected with kstats https://lwn.net/Articles/813303/
Distribution of mask_irq() times in nanoseconds collected with kstats
over a 5s period, heavy workload.
F=/sys/kernel/debug/kstats/mask_irq
echo reset > $F; sleep 5; grep CPUS $ | awk '{t+=$6*$8;n+=$6; print}
END { print "avg ", t/n, t*1e-9}'
-------------------------------------------------------
ORIGINAL CODE
count 945 avg 680 p 0.000195 n 945
count 1863 avg 742 p 0.000580 n 2808
count 2941 avg 805 p 0.001188 n 5749
count 6058 avg 866 p 0.002441 n 11807
count 10090 avg 930 p 0.004527 n 21897
count 17273 avg 994 p 0.008099 n 39170
count 88252 avg 1099 p 0.026348 n 127422
count 227984 avg 1223 p 0.073490 n 355406
count 405964 avg 1348 p 0.157434 n 761370
count 530397 avg 1474 p 0.267109 n 1291767
count 606284 avg 1601 p 0.392475 n 1898051
count 658027 avg 1728 p 0.528541 n 2556078
count 643644 avg 1854 p 0.661632 n 3199722
count 515443 avg 1980 p 0.768215 n 3715165
count 546976 avg 2156 p 0.881317 n 4262141
count 243037 avg 2415 p 0.931572 n 4505178
count 125700 avg 2674 p 0.957564 n 4630878
count 73707 avg 2934 p 0.972805 n 4704585
count 48124 avg 3190 p 0.982756 n 4752709
count 30397 avg 3446 p 0.989041 n 4783106
count 19108 avg 3700 p 0.992993 n 4802214
count 11228 avg 3956 p 0.995314 n 4813442
count 11484 avg 4316 p 0.997689 n 4824926
count 5447 avg 4835 p 0.998815 n 4830373
count 2699 avg 5341 p 0.999373 n 4833072
count 1343 avg 5855 p 0.999651 n 4834415
count 668 avg 6369 p 0.999789 n 4835083
count 382 avg 6880 p 0.999868 n 4835465
count 207 avg 7406 p 0.999911 n 4835672
count 149 avg 7918 p 0.999942 n 4835821
count 159 avg 8671 p 0.999975 n 4835980
count 61 avg 9574 p 0.999987 n 4836041
count 22 avg 10758 p 0.999992 n 4836063
count 15 avg 11925 p 0.999995 n 4836078
count 9 avg 12750 p 0.999997 n 4836087
count 6 avg 13739 p 0.999998 n 4836093
count 2 avg 14737 p 0.999998 n 4836095
count 1 avg 15699 p 0.999999 n 4836096
count 3 avg 17951 p 0.999999 n 4836099
count 1 avg 18680 p 1.000000 n 4836100
avg 1833.11 8.86512
-------------------------------------------------------
AFTER REMOVING READL()
count 45 avg 13 p 0.000009 n 45
count 5792 avg 14 p 0.001204 n 5837
count 36628 avg 15 p 0.008761 n 42465
count 108293 avg 17 p 0.031103 n 150758
count 529586 avg 18 p 0.140365 n 680344
count 352816 avg 21 p 0.213156 n 1033160
count 1082719 avg 22 p 0.436538 n 2115879
count 480375 avg 24 p 0.535646 n 2596254
count 628931 avg 26 p 0.665404 n 3225185
count 275178 avg 28 p 0.722178 n 3500363
count 158508 avg 30 p 0.754880 n 3658871
count 100066 avg 33 p 0.775525 n 3758937
count 42338 avg 38 p 0.784260 n 3801275
count 75427 avg 41 p 0.799822 n 3876702
count 43099 avg 45 p 0.808714 n 3919801
count 30181 avg 49 p 0.814941 n 3949982
count 3757 avg 53 p 0.815716 n 3953739
count 2104 avg 57 p 0.816150 n 3955843
count 3520 avg 62 p 0.816876 n 3959363
count 28199 avg 69 p 0.822694 n 3987562
count 72163 avg 76 p 0.837583 n 4059725
count 109582 avg 83 p 0.860191 n 4169307
count 35839 avg 90 p 0.867585 n 4205146
count 6622 avg 100 p 0.868951 n 4211768
count 19393 avg 109 p 0.872952 n 4231161
count 51844 avg 116 p 0.883649 n 4283005
count 94042 avg 122 p 0.903051 n 4377047
count 28129 avg 133 p 0.908854 n 4405176
count 91010 avg 151 p 0.927631 n 4496186
count 73043 avg 164 p 0.942701 n 4569229
count 20705 avg 185 p 0.946973 n 4589934
count 59477 avg 199 p 0.959244 n 4649411
count 10842 avg 212 p 0.961481 n 4660253
count 16374 avg 233 p 0.964859 n 4676627
count 35394 avg 242 p 0.972161 n 4712021
count 36254 avg 277 p 0.979641 n 4748275
count 16982 avg 304 p 0.983145 n 4765257
count 20411 avg 327 p 0.987356 n 4785668
count 14352 avg 362 p 0.990317 n 4800020
count 10931 avg 399 p 0.992572 n 4810951
count 8384 avg 436 p 0.994302 n 4819335
count 3184 avg 467 p 0.994959 n 4822519
count 4923 avg 487 p 0.995974 n 4827442
count 7341 avg 538 p 0.997489 n 4834783
count 2957 avg 606 p 0.998099 n 4837740
count 3688 avg 666 p 0.998860 n 4841428
count 1807 avg 732 p 0.999233 n 4843235
count 813 avg 800 p 0.999400 n 4844048
count 699 avg 856 p 0.999545 n 4844747
count 293 avg 927 p 0.999605 n 4845040
count 203 avg 985 p 0.999647 n 4845243
count 390 avg 1086 p 0.999727 n 4845633
count 273 avg 1213 p 0.999784 n 4845906
count 162 avg 1336 p 0.999817 n 4846068
count 89 avg 1465 p 0.999835 n 4846157
count 50 avg 1600 p 0.999846 n 4846207
count 33 avg 1727 p 0.999853 n 4846240
count 26 avg 1850 p 0.999858 n 4846266
count 38 avg 1993 p 0.999866 n 4846304
count 89 avg 2185 p 0.999884 n 4846393
count 132 avg 2444 p 0.999911 n 4846525
count 134 avg 2680 p 0.999939 n 4846659
count 92 avg 2932 p 0.999958 n 4846751
count 65 avg 3182 p 0.999971 n 4846816
count 42 avg 3448 p 0.999980 n 4846858
count 29 avg 3713 p 0.999986 n 4846887
count 11 avg 3934 p 0.999988 n 4846898
count 18 avg 4305 p 0.999992 n 4846916
count 10 avg 4784 p 0.999994 n 4846926
count 7 avg 5264 p 0.999996 n 4846933
count 5 avg 5957 p 0.999997 n 4846938
count 1 avg 6603 p 0.999997 n 4846939
count 1 avg 7098 p 0.999997 n 4846940
count 3 avg 7433 p 0.999998 n 4846943
count 1 avg 7892 p 0.999998 n 4846944
count 2 avg 8955 p 0.999998 n 4846946
count 3 avg 9424 p 0.999999 n 4846949
count 1 avg 11194 p 0.999999 n 4846950
count 2 avg 11619 p 1.000000 n 4846952
avg 51.3519 0.2489
-------------------------------------------------------
cheers
Luigi
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
2025-11-21 10:58 ` Luigi Rizzo
2025-11-21 14:33 ` Luigi Rizzo
@ 2025-11-22 14:08 ` Thomas Gleixner
1 sibling, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-22 14:08 UTC (permalink / raw)
To: Luigi Rizzo
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, linux-kernel, linux-arch, Bjorn Helgaas,
Willem de Bruijn
On Fri, Nov 21 2025 at 11:58, Luigi Rizzo wrote:
> On Wed, Nov 19, 2025 at 3:43 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> Addressing your bullets above here is the design,
> if you have time let me know what you think,
>
> BASIC MECHANISM
> The basic control mechanism changes interrupt state as follows:
>
> - for moderated interrupts, call __disable_irq(desc) and put the desc
> in a per-CPU list with IRQD_IRQ_INPROGRESS set.
> Set the _IRQ_DISABLE_UNLAZY status flag so disable will also
> mask the interrupt (it works even without that, but experiments
> with greedy NVME devices show clear benefits).
What sets that flag? If you want to fiddle with it in this moderation
muck behind the scene, no.
With this new scheme where the moderation handling never invokes the
handler after the time is over, you can't actually mask upfront
unconditionally unless there is a clear indication from the underlying
interrupt chain that this works correctly with edge type interrupts.
It only works when it is guaranteed that the interrupt chip which
actually provides the masking functionality will latch an interrupt
raised in the device and then raise it down the chain when unmasked.
PCIe mandates that behaviour for MSI-X and maskable MSI, but it's not
guaranteed for other device types. And out of painful experience it's
not even true for some broken PCIe devices.
> NOTE: we need to patch irq_can_handle_pm() so it will return false
> on irqdesc that are in a moderation list, otherwise we have
> a deadlock when an interrupt (generated before the mask) comes in.
I told you that you need state :)
> - when the moderation timer fires, remove the irqdesc from the list,
> clear IRQD_IRQ_INPROGRESS and call __enable_irq(desc).
>
> RATIONALE:
> - why disable/enable instead of mask/unmask:
> disable/enable already supports call from different layers
> (infrastructure and interrupt handlers), blocks stray interrupts,
> and reissues pending events on enable. If I were to use mask/unmask
> directly, I would have to replicate most of the disable logic and risk
> introducing subtle bugs.
It's not that much to replicate, but fair enough as long as it happens
under the lock when entering the handler.
> Setting _IRQ_DISABLE_UNLAZY makes the mask immediate, which based on
> tests (greedy NVME devices) seems beneficial.
See above. What you really want is:
1) An indicator in the irq chip which tells that the above
requirement of raising a pending interrupt on unmask is
"guaranteed" :)
2) A dedicated variant of __disable_irq() which invokes
__irq_disable() with the following twist
disable_moderated_irq(desc)
force = !irqd_irq_masked(..) && irq_moderate_unlazy(desc[chip]);
if (!desc->disable++ || force)
irq_disable_moderated(desc, force)
force |= irq_settings_disable_unlazy(desc);
__irq_disable(desc, force);
That requires to have a new MODERATE_UNLAZY flag, which can be set by
infrastructure, e.g. the PCI/MSI core as that knows whether the
underlying device MSI implementation provides masking. That should be
probably a feature in the interrupt chip.
That needs a command line option to prevent MODERATE_UNLAZY from being
effective.
That preserves the lazy masking behaviour for non-moderated situations,
which is not only a correctness measure (see above) but preferred even
for correctly working hardware because it avoids going out to the
hardware (e.g. writing to PCI config space twice) in many cases.
Though all of this wants to be a separate step _after_ the initial
functionality has been worked out and stabilized. It's really an
optimization on top of the base functionality and we are not going to
optimize prematurely.
> - why IRQD_IRQ_INPROGRESS instead of using another flag:
> IRQD_IRQ_INPROGRESS seems the way to implement synchronization with
> infrastructure events (shutdown etc.). We can argue that when an
> irqdesc is in the timer list it is not "inprogress" and could be
> preempted (with huge pain, as it could be on a different CPU),
> but for practical purposes most of the actions that block on INPROGRESS
> should also wait for the irqdesc to come off the timer list.
As long as those functions are not polling the bit with interrupts
disabled that's a correct assumption.
> - what about spurious interrupts:
> there is no way to block a device from sending them one after another,
> so using the existing protection (don't call the handler if disabled)
> seems the easiest way to deal with them.
Correct.
> HANDLING OF VARIOUS EVENTS:
>
> INTERRUPT TEARDOWN
> free_irq() uses synchronize_irq() before proceeding, which waits until
> IRQD_IRQ_INPROGRESS is clear. This guarantees that a moderated interrupt
> will not be destroyed while it is on the timer list.
That requires a new variant for __enable_irq() because free_irq() shuts
the interrupt down and __enable_irq() would just start it up again...
> INTERRUPT MIGRATION
> Migration changes the affinity mask, but it takes effect only when
> trying to run the handler. That too is prevented by IRQD_IRQ_INPROGRESS
> being set, so the new CPU will be used only after the interrupt exits
> the timer list.
How is affinity setting related to trying to run the handler? It steers
the interrupt to a new target CPU so that the next interrupt raised in
the device ends up at the new target.
So if that happens (and the interrupt is not masked) then the handler on
the new target CPU will poll for $DELAY until the other side fires the
timer and clears the bit. That's insane and needs to be prevented by
checking moderated state.
> HOTPLUG
> During shutdown the system moves timers and reassigns interrupt affinity
> to a new CPU. The easiest way to guarantee that pending events are
> handled correctly is to use a per-CPU "moderation_on" flag managed as
> follows by hotplug callbacks on CPUHP_ONLINE (or nearby)
That can be anywhere in the state machine between ONLINE and ACTIVE.
> - on setup, set the flag. Only now interrupts can be moderated.
> - on shutdown, with interrupts disabled, clear the flag thus preventing
> more interrupts to be moderated on that CPU; process the list of moderated
> interrupts (as if the timer had fired), and cancel the timer.
Ok.
> SUSPEND
> The hotplug callbacks are also invoked during deep suspend so that
> makes it safe.
>
> BOOT PROCESSOR
> Hotplug callbacks are not invoked for the boot processor. However the
> boot processor is the last one to go, and since there is no other place
> to run the timer callbacks, they will be run where they are supposed to.
That's only true for hibernation, aka. suspend to disk, but not for
suspend to RAM/IDLE.
You want to treat all of these scenarios the same and just make sure
that the moderation muck is disabled and nothing hangs on a list
especially not with the INPROGRESS bit set.
Thanks,
tglx
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation
2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
` (2 preceding siblings ...)
2025-11-16 18:28 ` [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
2025-11-17 20:51 ` Thomas Gleixner
2025-11-16 18:28 ` [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
` (3 subsequent siblings)
7 siblings, 1 reply; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
Add two control parameters (target_irq_rate and hardirq_percent)
to indicate the desired maximum values for these two metrics.
Every update_ms the hook in handle_irq_event() recomputes the total and
local interrupt rate and the amount of time spent in hardirq, compares
the values with the targets, and adjusts the moderation delay up or down.
The interrupt rate is computed in a scalable way by counting interrupts
per-CPU, and aggregating the value in a global variable only every
update_ms. Only CPUs that actively process interrupts are actually
accessing the shared variable, so the system scales well even on very
large servers.
EXAMPLE TESTING
Exceeding the thresholds requires suitably heavy workloads, such as
network or SSD traffic. Lower thresholds (e.g. target_irq_rate=50000,
hardirq_frac=10) can make it easier to reach those values.
# configure maximum delay (which is also the fixed moderation delay)
echo "delay_us=400" > /proc/irq/soft_moderation
# enable on network interrupts (change name as appropriate)
echo on | tee /proc/irq/*/*eth*/../soft_moderation
# ping times should reflect the 400us
ping -n -f -q -c 1000 ${some_nearby_host}
# show actual per-cpu delays and statistics
less /proc/irq/soft_moderation
# configure adaptive bounds. The control loop will adjust values
# based on actual load
echo "target_irq_rate=1000000" > /proc/irq/soft_moderation
echo "hardirq_percent=70" > /proc/irq/soft_moderation
# ping times now should be much lower
ping -n -f -q -c 1000 ${some_nearby_host}
# show actual per-cpu delays and statistics
less /proc/irq/soft_moderation
By generating high interrupt or hardirq loads, one can also test
the effectiveness of the control scheme and the sensitivity to
control parameters.
NEW PARAMETERS
target_irq_rate 0 off, 0-50000000, default 0
The total maximum acceptable interrupt rate.
hardirq_percent 0 off, 0-100, default 0
The maximum acceptable percentage of time spent in hardirq.
update_ms 1-100, default 5
How often the control loop will readjust the delay.
Change-Id: Iccd762a61c4389eb15f06d4a35c5dab2821e5fd1
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
kernel/irq/irq_moderation.c | 177 +++++++++++++++++++++++++++++++++++-
kernel/irq/irq_moderation.h | 10 +-
2 files changed, 182 insertions(+), 5 deletions(-)
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index 688c8e8c49392..72be9e88c3890 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -56,9 +56,18 @@
* napi_defer_hard_irqs on NICs.
* A small value may help control load in interrupt-challenged platforms.
*
+ * FIXED MODERATION mode requires target_irq_rate=0, hardirq_percent=0
+ *
+ * target_irq_rate (default 0, suggested 1000000, 0 off, range 0..50M)
+ * The total irq rate above which moderation kicks in.
+ * Not particularly critical, a value in the 500K-1M range is usually ok.
+ *
+ * hardirq_percent (default 0, suggested 70, 0 off, range 10..100)
+ * The hardirq percentage above which moderation kicks in.
+ * 50-90 is a reasonable range.
*
* update_ms (default 5, range 1...100)
- * How often moderation delay is updated.
+ * How often the load is measured and moderation delay updated.
*
* Moderation can be enabled/disabled for individual interrupts with
*
@@ -73,6 +82,9 @@
/* Recommended values for the control loop. */
struct irq_mod_info irq_mod_info ____cacheline_aligned = {
.update_ms = 5,
+ .scale_cpus = 200,
+ .decay_factor = 16,
+ .grow_factor = 8,
};
/* Boot time value, copled to irq_mod_info.delay_us after init. */
@@ -83,6 +95,12 @@ MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 = moderation off, range 0
module_param_named(timer_rounds, irq_mod_info.timer_rounds, uint, 0444);
MODULE_PARM_DESC(timer_rounds, "How many timer polls once moderation triggers, range 0-20.");
+module_param_named(hardirq_percent, irq_mod_info.hardirq_percent, uint, 0444);
+MODULE_PARM_DESC(hardirq_percent, "Target max hardirq percentage, 0 off, range 0-100.");
+
+module_param_named(target_irq_rate, irq_mod_info.target_irq_rate, uint, 0444);
+MODULE_PARM_DESC(target_irq_rate, "Target max interrupt rate, 0 off, range 0-50000000.");
+
module_param_named(update_ms, irq_mod_info.update_ms, uint, 0444);
MODULE_PARM_DESC(update_ms, "Update interval in milliseconds, range 1-100");
@@ -96,6 +114,108 @@ static inline void smooth_avg(u32 *dst, u32 val, u32 steps)
*dst = ((64 - steps) * *dst + steps * val) / 64;
}
+/* Measure and assess time spent in hardirq. */
+static inline bool hardirq_high(struct irq_mod_state *ms, u64 delta_time, u32 hardirq_percent)
+{
+ bool above_threshold = false;
+
+ if (IS_ENABLED(CONFIG_IRQ_TIME_ACCOUNTING)) {
+ u64 irqtime, cur = kcpustat_this_cpu->cpustat[CPUTIME_IRQ];
+
+ irqtime = cur - ms->last_irqtime;
+ ms->last_irqtime = cur;
+
+ above_threshold = irqtime * 100 > delta_time * hardirq_percent;
+ ms->hardirq_high += above_threshold;
+ }
+ return above_threshold;
+}
+
+/* Measure and assess total and per-CPU interrupt rates. */
+static inline bool irqrate_high(struct irq_mod_state *ms, u64 delta_time,
+ u32 target_rate, u32 steps)
+{
+ u64 irq_rate, my_irq_rate, tmp, delta_irqs, active_cpus;
+ bool my_rate_high, global_rate_high;
+
+ my_irq_rate = (ms->irq_count * NSEC_PER_SEC) / delta_time;
+ /* Accumulate global counter and compute global irq rate. */
+ tmp = atomic_long_add_return(ms->irq_count, &irq_mod_info.total_intrs);
+ ms->irq_count = 1;
+ delta_irqs = tmp - ms->last_total_irqs;
+ ms->last_total_irqs = tmp;
+ irq_rate = (delta_irqs * NSEC_PER_SEC) / delta_time;
+
+ /*
+ * Count how many CPUs handled interrupts in the last interval, needed
+ * to determine the per-CPU target (target_rate / active_cpus).
+ * Each active CPU increments the global counter approximately every
+ * update_ns. Scale the value by (update_ns / delta_time) to get the
+ * correct value. Also apply rounding and make sure active_cpus > 0.
+ */
+ tmp = atomic_long_add_return(1, &irq_mod_info.total_cpus);
+ active_cpus = tmp - ms->last_total_cpus;
+ ms->last_total_cpus = tmp;
+ active_cpus = (active_cpus * ms->update_ns + delta_time / 2) / delta_time;
+ if (active_cpus < 1)
+ active_cpus = 1;
+
+ /* Compare with global and per-CPU targets. */
+ global_rate_high = irq_rate > target_rate;
+ my_rate_high = my_irq_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
+
+ /* Statistics. */
+ smooth_avg(&ms->irq_rate, irq_rate, steps);
+ smooth_avg(&ms->my_irq_rate, my_irq_rate, steps);
+ smooth_avg(&ms->scaled_cpu_count, active_cpus * 256, steps);
+ ms->my_irq_high += my_rate_high;
+ ms->irq_high += global_rate_high;
+
+ /* Moderate on this CPU only if both global and local rates are high. */
+ return global_rate_high && my_rate_high;
+}
+
+/* Adjust the moderation delay, called at most every update_ns. */
+void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_time)
+{
+ u32 hardirq_percent = READ_ONCE(irq_mod_info.hardirq_percent);
+ u32 target_rate = READ_ONCE(irq_mod_info.target_irq_rate);
+ bool below_target = true;
+ u32 steps;
+
+ if (target_rate == 0 && hardirq_percent == 0) {
+ /* Use fixed moderation delay. */
+ ms->mod_ns = ms->delay_ns;
+ ms->irq_rate = 0;
+ ms->my_irq_rate = 0;
+ ms->scaled_cpu_count = 0;
+ return;
+ }
+
+ /* Compute decay steps based on elapsed time, bound to a reasonable value. */
+ steps = delta_time > 10 * ms->update_ns ? 10 : 1 + (delta_time / ms->update_ns);
+
+ if (target_rate > 0 && irqrate_high(ms, delta_time, target_rate, steps))
+ below_target = false;
+
+ if (hardirq_percent > 0 && hardirq_high(ms, delta_time, hardirq_percent))
+ below_target = false;
+
+ /* Controller: adjust delay with exponential decay/grow. */
+ if (below_target) {
+ ms->mod_ns -= ms->mod_ns * steps / (steps + irq_mod_info.decay_factor);
+ if (ms->mod_ns < 100)
+ ms->mod_ns = 0;
+ } else {
+ /* Exponential grow does not restart if value is too small. */
+ if (ms->mod_ns < 500)
+ ms->mod_ns = 500;
+ ms->mod_ns += ms->mod_ns * steps / (steps + irq_mod_info.grow_factor);
+ if (ms->mod_ns > ms->delay_ns)
+ ms->mod_ns = ms->delay_ns;
+ }
+}
+
/* Moderation timer handler. */
static enum hrtimer_restart timer_cb(struct hrtimer *timer)
{
@@ -169,8 +289,10 @@ static inline int set_moderation_mode(struct irq_desc *desc, bool enable)
/* Print statistics */
static int moderation_show(struct seq_file *p, void *v)
{
+ ulong irq_rate = 0, irq_high = 0, my_irq_high = 0, hardirq_high = 0;
uint delay_us = irq_mod_info.delay_us;
- int j;
+ u64 now = ktime_get_ns();
+ int j, active_cpus = 0;
#define HEAD_FMT "%5s %8s %8s %4s %4s %8s %11s %11s %11s %11s %11s %11s %11s %9s\n"
#define BODY_FMT "%5u %8u %8u %4u %4u %8u %11u %11u %11u %11u %11u %11u %11u %9u\n"
@@ -182,6 +304,23 @@ static int moderation_show(struct seq_file *p, void *v)
for_each_possible_cpu(j) {
struct irq_mod_state *ms = per_cpu_ptr(&irq_mod_state, j);
+ u64 age_ms = min((now - ms->last_ns) / NSEC_PER_MSEC, (u64)999999);
+
+ if (age_ms < 10000) {
+ /* Average irq_rate over recently active CPUs. */
+ active_cpus++;
+ irq_rate += ms->irq_rate;
+ } else {
+ ms->irq_rate = 0;
+ ms->my_irq_rate = 0;
+ ms->scaled_cpu_count = 64;
+ ms->scaled_src_count = 64;
+ ms->mod_ns = 0;
+ }
+
+ irq_high += ms->irq_high;
+ my_irq_high += ms->my_irq_high;
+ hardirq_high += ms->hardirq_high;
seq_printf(p, BODY_FMT,
j, ms->irq_rate, ms->my_irq_rate,
@@ -195,9 +334,34 @@ static int moderation_show(struct seq_file *p, void *v)
seq_printf(p, "\n"
"enabled %s\n"
"delay_us %u\n"
- "timer_rounds %u\n",
+ "timer_rounds %u\n"
+ "target_irq_rate %u\n"
+ "hardirq_percent %u\n"
+ "update_ms %u\n"
+ "scale_cpus %u\n"
+ "count_timer_calls %s\n"
+ "count_msi_calls %s\n"
+ "decay_factor %u\n"
+ "grow_factor %u\n",
str_yes_no(delay_us > 0),
- delay_us, irq_mod_info.timer_rounds);
+ delay_us, irq_mod_info.timer_rounds,
+ irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent,
+ irq_mod_info.update_ms, irq_mod_info.scale_cpus,
+ str_yes_no(irq_mod_info.count_timer_calls),
+ str_yes_no(irq_mod_info.count_msi_calls),
+ irq_mod_info.decay_factor, irq_mod_info.grow_factor);
+
+ seq_printf(p,
+ "irq_rate %lu\n"
+ "irq_high %lu\n"
+ "my_irq_high %lu\n"
+ "hardirq_percent_high %lu\n"
+ "total_interrupts %lu\n"
+ "total_cpus %lu\n",
+ active_cpus ? irq_rate / active_cpus : 0,
+ irq_high, my_irq_high, hardirq_high,
+ READ_ONCE(*((ulong *)&irq_mod_info.total_intrs)),
+ READ_ONCE(*((ulong *)&irq_mod_info.total_cpus)));
return 0;
}
@@ -218,10 +382,15 @@ struct param_names {
static struct param_names param_names[] = {
{ "delay_us", &irq_mod_info.delay_us, 0, 500 },
{ "timer_rounds", &irq_mod_info.timer_rounds, 0, 20 },
+ { "target_irq_rate", &irq_mod_info.target_irq_rate, 0, 50000000 },
+ { "hardirq_percent", &irq_mod_info.hardirq_percent, 0, 100 },
{ "update_ms", &irq_mod_info.update_ms, 1, 100 },
/* Empty entry indicates the following are not settable from procfs. */
{},
+ { "scale_cpus", &irq_mod_info.scale_cpus, 50, 1000 },
{ "count_timer_calls", &irq_mod_info.count_timer_calls, 0, 1 },
+ { "decay_factor", &irq_mod_info.decay_factor, 8, 64 },
+ { "grow_factor", &irq_mod_info.grow_factor, 8, 64 },
};
static void update_enable_key(void)
diff --git a/kernel/irq/irq_moderation.h b/kernel/irq/irq_moderation.h
index d9bc672ffd6f1..3543e8e8b6e2d 100644
--- a/kernel/irq/irq_moderation.h
+++ b/kernel/irq/irq_moderation.h
@@ -138,6 +138,8 @@ DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
extern struct static_key_false irq_moderation_enabled_key;
+void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_time);
+
/* Called on each interrupt for adaptive moderation delay adjustment. */
static inline void irq_moderation_adjust_delay(struct irq_mod_state *ms)
{
@@ -157,7 +159,13 @@ static inline void irq_moderation_adjust_delay(struct irq_mod_state *ms)
ms->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
ms->delay_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
- ms->mod_ns = ms->delay_ns;
+ /* If config changed, restart from the highest delay. */
+ if (ktime_compare(irq_mod_info.procfs_write_ns, ms->last_ns) > 0)
+ ms->mod_ns = ms->delay_ns;
+
+ ms->last_ns = now;
+ /* Do the expensive processing */
+ __irq_moderation_adjust_delay(ms, delta_time);
}
/* Return true if timer is active or delay is large enough to require moderation */
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation
2025-11-16 18:28 ` [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
@ 2025-11-17 20:51 ` Thomas Gleixner
2025-11-17 21:34 ` Luigi Rizzo
0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-17 20:51 UTC (permalink / raw)
To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> Add two control parameters (target_irq_rate and hardirq_percent)
> to indicate the desired maximum values for these two metrics.
>
> Every update_ms the hook in handle_irq_event() recomputes the total and
> local interrupt rate and the amount of time spent in hardirq, compares
> the values with the targets, and adjusts the moderation delay up or down.
>
> The interrupt rate is computed in a scalable way by counting interrupts
> per-CPU, and aggregating the value in a global variable only every
> update_ms. Only CPUs that actively process interrupts are actually
> accessing the shared variable, so the system scales well even on very
> large servers.
You still fail to explain why this is required and why a per CPU
moderation is not sufficient and what the benefits of the approach are.
I'm happy to refer you to Documentation once more. It's well explained
there.
> +/* Measure and assess time spent in hardirq. */
> +static inline bool hardirq_high(struct irq_mod_state *ms, u64 delta_time, u32 hardirq_percent)
> +{
> + bool above_threshold = false;
> +
> + if (IS_ENABLED(CONFIG_IRQ_TIME_ACCOUNTING)) {
> + u64 irqtime, cur = kcpustat_this_cpu->cpustat[CPUTIME_IRQ];
> +
> + irqtime = cur - ms->last_irqtime;
> + ms->last_irqtime = cur;
> +
> + above_threshold = irqtime * 100 > delta_time * hardirq_percent;
> + ms->hardirq_high += above_threshold;
> + }
> + return above_threshold;
> +}
> +
> +/* Measure and assess total and per-CPU interrupt rates. */
> +static inline bool irqrate_high(struct irq_mod_state *ms, u64 delta_time,
> + u32 target_rate, u32 steps)
> +{
> + u64 irq_rate, my_irq_rate, tmp, delta_irqs, active_cpus;
> + bool my_rate_high, global_rate_high;
> +
> + my_irq_rate = (ms->irq_count * NSEC_PER_SEC) / delta_time;
> + /* Accumulate global counter and compute global irq rate. */
> + tmp = atomic_long_add_return(ms->irq_count, &irq_mod_info.total_intrs);
> + ms->irq_count = 1;
> + delta_irqs = tmp - ms->last_total_irqs;
> + ms->last_total_irqs = tmp;
> + irq_rate = (delta_irqs * NSEC_PER_SEC) / delta_time;
> +
> + /*
> + * Count how many CPUs handled interrupts in the last interval, needed
> + * to determine the per-CPU target (target_rate / active_cpus).
> + * Each active CPU increments the global counter approximately every
> + * update_ns. Scale the value by (update_ns / delta_time) to get the
> + * correct value. Also apply rounding and make sure active_cpus > 0.
> + */
> + tmp = atomic_long_add_return(1, &irq_mod_info.total_cpus);
> + active_cpus = tmp - ms->last_total_cpus;
> + ms->last_total_cpus = tmp;
> + active_cpus = (active_cpus * ms->update_ns + delta_time / 2) / delta_time;
> + if (active_cpus < 1)
> + active_cpus = 1;
> +
> + /* Compare with global and per-CPU targets. */
> + global_rate_high = irq_rate > target_rate;
> + my_rate_high = my_irq_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
> +
> + /* Statistics. */
> + smooth_avg(&ms->irq_rate, irq_rate, steps);
> + smooth_avg(&ms->my_irq_rate, my_irq_rate, steps);
> + smooth_avg(&ms->scaled_cpu_count, active_cpus * 256, steps);
> + ms->my_irq_high += my_rate_high;
> + ms->irq_high += global_rate_high;
> +
> + /* Moderate on this CPU only if both global and local rates are high. */
Because it's desired that CPUs can be starved by interrupts when enough
other CPUs only have a very low rate? I'm failing to understand that
logic and the comprehensive rationale in the change log does not help either.
> + return global_rate_high && my_rate_high;
> +}
> +
> +/* Adjust the moderation delay, called at most every update_ns. */
> +void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_time)
> +{
> + u32 hardirq_percent = READ_ONCE(irq_mod_info.hardirq_percent);
> + u32 target_rate = READ_ONCE(irq_mod_info.target_irq_rate);
> + bool below_target = true;
> + u32 steps;
> +
> + if (target_rate == 0 && hardirq_percent == 0) {
> + /* Use fixed moderation delay. */
> + ms->mod_ns = ms->delay_ns;
> + ms->irq_rate = 0;
> + ms->my_irq_rate = 0;
> + ms->scaled_cpu_count = 0;
> + return;
> + }
> +
> + /* Compute decay steps based on elapsed time, bound to a reasonable value. */
> + steps = delta_time > 10 * ms->update_ns ? 10 : 1 + (delta_time / ms->update_ns);
> +
> + if (target_rate > 0 && irqrate_high(ms, delta_time, target_rate, steps))
> + below_target = false;
> +
> + if (hardirq_percent > 0 && hardirq_high(ms, delta_time, hardirq_percent))
> + below_target = false;
So that rate limits a per CPU overload, but only when IRQTIME accounting
is enabled. Oh well...
> + /* Controller: adjust delay with exponential decay/grow. */
> + if (below_target) {
> + ms->mod_ns -= ms->mod_ns * steps / (steps + irq_mod_info.decay_factor);
> + if (ms->mod_ns < 100)
> + ms->mod_ns = 0;
These random numbers used to limit things are truly interresting and
easy to understand - NOT.
> + } else {
> + /* Exponential grow does not restart if value is too small. */
> + if (ms->mod_ns < 500)
> + ms->mod_ns = 500;
> + ms->mod_ns += ms->mod_ns * steps / (steps + irq_mod_info.grow_factor);
> + if (ms->mod_ns > ms->delay_ns)
> + ms->mod_ns = ms->delay_ns;
> + }
Why does this need separate grow and decay factors? Just because more
knobs are better?
> +}
> +
> /* Moderation timer handler. */
> static enum hrtimer_restart timer_cb(struct hrtimer *timer)
> {
> @@ -169,8 +289,10 @@ static inline int set_moderation_mode(struct irq_desc *desc, bool enable)
> /* Print statistics */
> static int moderation_show(struct seq_file *p, void *v)
> {
> + ulong irq_rate = 0, irq_high = 0, my_irq_high = 0, hardirq_high = 0;
> uint delay_us = irq_mod_info.delay_us;
> - int j;
> + u64 now = ktime_get_ns();
> + int j, active_cpus = 0;
>
> #define HEAD_FMT "%5s %8s %8s %4s %4s %8s %11s %11s %11s %11s %11s %11s %11s %9s\n"
> #define BODY_FMT "%5u %8u %8u %4u %4u %8u %11u %11u %11u %11u %11u %11u %11u %9u\n"
> @@ -182,6 +304,23 @@ static int moderation_show(struct seq_file *p, void *v)
>
> for_each_possible_cpu(j) {
> struct irq_mod_state *ms = per_cpu_ptr(&irq_mod_state, j);
> + u64 age_ms = min((now - ms->last_ns) / NSEC_PER_MSEC, (u64)999999);
Not new in this patch: All of those accesses to remote CPU state are
racy and need to be annotated. This clearly never saw any testing with
KCSAN enabled. I'm happy to point to Documentation once again.
What's new is that this one is obviously broken on 32-bit because read
and write of a 64-bit value are split.
> + if (age_ms < 10000) {
> + /* Average irq_rate over recently active CPUs. */
> + active_cpus++;
> + irq_rate += ms->irq_rate;
> + } else {
> + ms->irq_rate = 0;
> + ms->my_irq_rate = 0;
> + ms->scaled_cpu_count = 64;
> + ms->scaled_src_count = 64;
> + ms->mod_ns = 0;
> + }
> +
> + irq_high += ms->irq_high;
> + my_irq_high += ms->my_irq_high;
> + hardirq_high += ms->hardirq_high;
>
> seq_printf(p, BODY_FMT,
> j, ms->irq_rate, ms->my_irq_rate,
> @@ -195,9 +334,34 @@ static int moderation_show(struct seq_file *p, void *v)
> seq_printf(p, "\n"
> "enabled %s\n"
> "delay_us %u\n"
> - "timer_rounds %u\n",
> + "timer_rounds %u\n"
> + "target_irq_rate %u\n"
> + "hardirq_percent %u\n"
> + "update_ms %u\n"
> + "scale_cpus %u\n"
> + "count_timer_calls %s\n"
> + "count_msi_calls %s\n"
> + "decay_factor %u\n"
> + "grow_factor %u\n",
> str_yes_no(delay_us > 0),
> - delay_us, irq_mod_info.timer_rounds);
> + delay_us, irq_mod_info.timer_rounds,
> + irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent,
> + irq_mod_info.update_ms, irq_mod_info.scale_cpus,
> + str_yes_no(irq_mod_info.count_timer_calls),
> + str_yes_no(irq_mod_info.count_msi_calls),
> + irq_mod_info.decay_factor, irq_mod_info.grow_factor);
> +
> + seq_printf(p,
> + "irq_rate %lu\n"
> + "irq_high %lu\n"
> + "my_irq_high %lu\n"
> + "hardirq_percent_high %lu\n"
> + "total_interrupts %lu\n"
> + "total_cpus %lu\n",
> + active_cpus ? irq_rate / active_cpus : 0,
> + irq_high, my_irq_high, hardirq_high,
> + READ_ONCE(*((ulong *)&irq_mod_info.total_intrs)),
> + READ_ONCE(*((ulong *)&irq_mod_info.total_cpus)));
The more I look at this insane amount of telemetry the more I am
convinced that this is overly complex just for complexity sake.
Thanks,
tglx
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation
2025-11-17 20:51 ` Thomas Gleixner
@ 2025-11-17 21:34 ` Luigi Rizzo
2025-11-17 23:05 ` Thomas Gleixner
2025-11-18 9:00 ` Thomas Gleixner
0 siblings, 2 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-17 21:34 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Mon, Nov 17, 2025 at 9:51 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> > Add two control parameters (target_irq_rate and hardirq_percent)
> > to indicate the desired maximum values for these two metrics.
> >
> > Every update_ms the hook in handle_irq_event() recomputes the total and
> > local interrupt rate and the amount of time spent in hardirq, compares
> > the values with the targets, and adjusts the moderation delay up or down.
> >
> > The interrupt rate is computed in a scalable way by counting interrupts
> > per-CPU, and aggregating the value in a global variable only every
> > update_ms. Only CPUs that actively process interrupts are actually
> > accessing the shared variable, so the system scales well even on very
> > large servers.
>
> You still fail to explain why this is required and why a per CPU
> moderation is not sufficient and what the benefits of the approach are.
It was explained in the first patch of the series and in Documentation/.
(First world problem, for sure: I have examples for AMD, Intel, Arm,
all of them with 100+ CPUs per numa node, and 160-480 CPUs total)
On some of the above platforms, MSIx interrupts cause heavy serialization
of all other PCIe requests. As a result, when the total interrupt rate exceeds
1-2M intrs/s, I/O throughput degrades by up to 4x and more.
To deal with this with per CPU moderation, without shared state, each CPU
cannot allow more than some 5Kintrs/s, which means fixed moderation
should be set at 200us, and adaptive moderation should jump to such
delays as soon as the local rate reaches 5K intrs/s.
In reality, it is very unlikely that all CPUs are actively handling such high
rates, so if we know better, we can adjust or remove moderation individually,
based on actual local and total interrupt rates and number of active CPUs.
The purpose of this global mechanism is to figure out whether we are
approaching a dangerous rate, and do individual tuning.
> > + /* Compare with global and per-CPU targets. */
> > + global_rate_high = irq_rate > target_rate;
> > + my_rate_high = my_irq_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
> > [...]
> > + /* Moderate on this CPU only if both global and local rates are high. */
>
> Because it's desired that CPUs can be starved by interrupts when enough
> other CPUs only have a very low rate? I'm failing to understand that
> logic and the comprehensive rationale in the change log does not help either.
The comment could be worded better, s/moderate/bump delay up/
The mechanism aims to make total_rate < target, by gently kicking
individual delays up or down based on the condition
total_rate > target && local_rate > target / active_cpus ? bump_up()
: bump_down()
If the control is effective, the total rate will be within bound and
nobody suffers,
neither the CPUs handing <1K intr/s, nor the lonely CPU handling 100K+ intr/s
If suddenly the rates go up, the CPUs with higher rates will be moderated more,
hopefully converging to a new equilibrium.
As any control system it has limits on what it can do.
> > [...]
> > + if (target_rate > 0 && irqrate_high(ms, delta_time, target_rate, steps))
> > + below_target = false;
> > +
> > + if (hardirq_percent > 0 && hardirq_high(ms, delta_time, hardirq_percent))
> > + below_target = false;
>
> So that rate limits a per CPU overload, but only when IRQTIME accounting
> is enabled. Oh well...
I can add checks to disallow setting the per-CPU overload when IRQTIME
accounting is not present.
>
> > + /* Controller: adjust delay with exponential decay/grow. */
> > + if (below_target) {
> > + ms->mod_ns -= ms->mod_ns * steps / (steps + irq_mod_info.decay_factor);
> > + if (ms->mod_ns < 100)
> > + ms->mod_ns = 0;
>
> These random numbers used to limit things are truly interresting and
> easy to understand - NOT.
>
> > + } else {
> > + /* Exponential grow does not restart if value is too small. */
> > + if (ms->mod_ns < 500)
> > + ms->mod_ns = 500;
> > + ms->mod_ns += ms->mod_ns * steps / (steps + irq_mod_info.grow_factor);
> > + if (ms->mod_ns > ms->delay_ns)
> > + ms->mod_ns = ms->delay_ns;
> > + }
>
> Why does this need separate grow and decay factors? Just because more
> knobs are better?
Like in TCP, brake aggressively (grow factor is smaller) to respond
quickly to overload,
and accelerate prudently (decay factor is higher) to avoid reacting
too optimistically.
thanks again for the attention
luigi
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation
2025-11-17 21:34 ` Luigi Rizzo
@ 2025-11-17 23:05 ` Thomas Gleixner
2025-11-18 9:00 ` Thomas Gleixner
1 sibling, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-17 23:05 UTC (permalink / raw)
To: Luigi Rizzo
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Mon, Nov 17 2025 at 22:34, Luigi Rizzo wrote:
> On Mon, Nov 17, 2025 at 9:51 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
>> > Add two control parameters (target_irq_rate and hardirq_percent)
>> > to indicate the desired maximum values for these two metrics.
>> >
>> > Every update_ms the hook in handle_irq_event() recomputes the total and
>> > local interrupt rate and the amount of time spent in hardirq, compares
>> > the values with the targets, and adjusts the moderation delay up or down.
>> >
>> > The interrupt rate is computed in a scalable way by counting interrupts
>> > per-CPU, and aggregating the value in a global variable only every
>> > update_ms. Only CPUs that actively process interrupts are actually
>> > accessing the shared variable, so the system scales well even on very
>> > large servers.
>>
>> You still fail to explain why this is required and why a per CPU
>> moderation is not sufficient and what the benefits of the approach are.
>
> It was explained in the first patch of the series and in Documentation/.
Change logs of a patch have to be self contained whether you like it or
not. I'm not chasing random bits of information accross a series or
cover letter and once a patch is applied it becomes even harder to make
sense of it when the change log contains zero information.
> (First world problem, for sure: I have examples for AMD, Intel, Arm,
> all of them with 100+ CPUs per numa node, and 160-480 CPUs total)
> On some of the above platforms, MSIx interrupts cause heavy serialization
> of all other PCIe requests. As a result, when the total interrupt rate exceeds
> 1-2M intrs/s, I/O throughput degrades by up to 4x and more.
>
> To deal with this with per CPU moderation, without shared state, each CPU
> cannot allow more than some 5Kintrs/s, which means fixed moderation
> should be set at 200us, and adaptive moderation should jump to such
> delays as soon as the local rate reaches 5K intrs/s.
>
> In reality, it is very unlikely that all CPUs are actively handling such high
> rates, so if we know better, we can adjust or remove moderation individually,
> based on actual local and total interrupt rates and number of active CPUs.
>
> The purpose of this global mechanism is to figure out whether we are
> approaching a dangerous rate, and do individual tuning.
Makes sense, but I'm not convinced yet that this needs to be as complex
as it is.
>> > + /* Compare with global and per-CPU targets. */
>> > + global_rate_high = irq_rate > target_rate;
>> > + my_rate_high = my_irq_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
>> > [...]
>> > + /* Moderate on this CPU only if both global and local rates are high. */
>>
>> Because it's desired that CPUs can be starved by interrupts when enough
>> other CPUs only have a very low rate? I'm failing to understand that
>> logic and the comprehensive rationale in the change log does not help either.
>
> The comment could be worded better, s/moderate/bump delay up/
>
> The mechanism aims to make total_rate < target, by gently kicking
> individual delays up or down based on the condition
> total_rate > target && local_rate > target / active_cpus ? bump_up()
> : bump_down()
>
> If the control is effective, the total rate will be within bound and
> nobody suffers,
> neither the CPUs handing <1K intr/s, nor the lonely CPU handling 100K+ intr/s
>
> If suddenly the rates go up, the CPUs with higher rates will be moderated more,
> hopefully converging to a new equilibrium.
> As any control system it has limits on what it can do.
I understand that, but without proper information in code and change log
anyone exposed to this code 6 months down the road will bump his head on
the wall when staring at it (including you).
>> > [...]
>> > + if (target_rate > 0 && irqrate_high(ms, delta_time, target_rate, steps))
>> > + below_target = false;
>> > +
>> > + if (hardirq_percent > 0 && hardirq_high(ms, delta_time, hardirq_percent))
>> > + below_target = false;
>>
>> So that rate limits a per CPU overload, but only when IRQTIME accounting
>> is enabled. Oh well...
>
> I can add checks to disallow setting the per-CPU overload when IRQTIME
> accounting is not present.
That solves what? It disables the setting, but that does not make the
functionality any different. Also the compiler is smart enough to
eliminate all that code because the return value of hardirq_high() is
constant.
>> > + } else {
>> > + /* Exponential grow does not restart if value is too small. */
>> > + if (ms->mod_ns < 500)
>> > + ms->mod_ns = 500;
>> > + ms->mod_ns += ms->mod_ns * steps / (steps + irq_mod_info.grow_factor);
>> > + if (ms->mod_ns > ms->delay_ns)
>> > + ms->mod_ns = ms->delay_ns;
>> > + }
>>
>> Why does this need separate grow and decay factors? Just because more
>> knobs are better?
>
> Like in TCP, brake aggressively (grow factor is smaller) to respond
> quickly to overload,
> and accelerate prudently (decay factor is higher) to avoid reacting
> too optimistically.
Why do I have to ask for all this information piecewise?
Thanks,
tglx
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation
2025-11-17 21:34 ` Luigi Rizzo
2025-11-17 23:05 ` Thomas Gleixner
@ 2025-11-18 9:00 ` Thomas Gleixner
1 sibling, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-18 9:00 UTC (permalink / raw)
To: Luigi Rizzo
Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
Bjorn Helgaas, Willem de Bruijn
On Mon, Nov 17 2025 at 22:34, Luigi Rizzo wrote:
> (First world problem, for sure: I have examples for AMD, Intel, Arm,
> all of them with 100+ CPUs per numa node, and 160-480 CPUs total)
> On some of the above platforms, MSIx interrupts cause heavy serialization
> of all other PCIe requests. As a result, when the total interrupt rate exceeds
> 1-2M intrs/s, I/O throughput degrades by up to 4x and more.
Is it actually the sum of all interrupts in the system or is it
segmented per root port?
Thanks,
tglx
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel)
2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
` (3 preceding siblings ...)
2025-11-16 18:28 ` [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
2025-11-17 21:14 ` Thomas Gleixner
2025-11-17 21:36 ` kernel test robot
2025-11-16 18:28 ` [PATCH v2 6/8] genirq: soft_moderation: helpers for per-driver defaults Luigi Rizzo
` (2 subsequent siblings)
7 siblings, 2 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On recent Intel CPUs, kernels compiled with CONFIG_X86_POSTED_MSI=y,
and the boot option "intremap=posted_msi", all MSI interrupts
that hit a CPU issue a single POSTED_MSI interrupt processed by
sysvec_posted_msi_notification() instead of having separate interrupts.
This change adds soft moderation hooks to the above handler.
Soft moderation on posted_msi does not require per-source enable,
irq_moderation.delay_us > 0 suffices.
To test it, run a kernel with the above options and enable moderation by
setting delay_us > 0. The column "from_msi" in /proc/irq/soft_moderation
will show a non-zero value.
Change-Id: I07b83b428de6f6541e3903b553c1b837c68a0b7d
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/irq.c | 13 +++++++
kernel/irq/irq_moderation.c | 38 ++++++++++++++++++-
kernel/irq/irq_moderation.h | 73 ++++++++++++++++++++++++++++++++++++-
4 files changed, 123 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index bc184dd38d993..530f5b5342eaa 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -46,7 +46,7 @@ KCOV_INSTRUMENT_unwind_guess.o := n
CFLAGS_head32.o := -fno-stack-protector
CFLAGS_head64.o := -fno-stack-protector
-CFLAGS_irq.o := -I $(src)/../include/asm/trace
+CFLAGS_irq.o := -I $(src)/../include/asm/trace -I $(srctree)/kernel/irq
obj-y += head_$(BITS).o
obj-y += head$(BITS).o
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 10721a1252269..1abdd21fa5c52 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -13,6 +13,8 @@
#include <linux/export.h>
#include <linux/irq.h>
+#include <irq_moderation.h>
+
#include <asm/irq_stack.h>
#include <asm/apic.h>
#include <asm/io_apic.h>
@@ -448,6 +450,13 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
inc_irq_stat(posted_msi_notification_count);
irq_enter();
+ if (posted_msi_moderation_enabled()) {
+ if (posted_msi_should_rearm(handle_pending_pir(pid->pir, regs)))
+ goto rearm;
+ else
+ goto common_end;
+ }
+
/*
* Max coalescing count includes the extra round of handle_pending_pir
* after clearing the outstanding notification bit. Hence, at most
@@ -458,6 +467,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
break;
}
+rearm:
/*
* Clear outstanding notification bit to allow new IRQ notifications,
* do this last to maximize the window of interrupt coalescing.
@@ -471,6 +481,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
*/
handle_pending_pir(pid->pir, regs);
+common_end:
+ posted_msi_moderation_epilogue();
+
apic_eoi();
irq_exit();
set_irq_regs(old_regs);
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index 72be9e88c3890..2d01e4cd4638b 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -11,6 +11,13 @@
#include "internals.h"
#include "irq_moderation.h"
+#ifdef CONFIG_X86
+#include <asm/apic.h>
+#include <asm/irq_remapping.h>
+#else
+static inline bool posted_msi_supported(void) { return false; }
+#endif
+
/*
* Platform-wide software interrupt moderation.
*
@@ -32,6 +39,10 @@
* moderation on that CPU/irq. If so, calls disable_irq_nosync() and starts
* an hrtimer with appropriate delay.
*
+ * - Intel only: using "intremap=posted_msi", all the above is done in
+ * sysvec_posted_msi_notification(). In this case all host device interrupts
+ * are subject to moderation.
+ *
* - the timer callback calls enable_irq() for all disabled interrupts on that
* CPU. That in turn will generate interrupts if there are pending events.
*
@@ -230,6 +241,17 @@ static enum hrtimer_restart timer_cb(struct hrtimer *timer)
ms->rounds_left--;
+#ifdef CONFIG_X86_POSTED_MSI
+ if (ms->kick_posted_msi) {
+ if (ms->rounds_left == 0)
+ ms->kick_posted_msi = false;
+ /* Next call will be from timer, count it conditionally. */
+ ms->dont_count = !irq_mod_info.count_timer_calls;
+ ms->timer_calls++;
+ apic->send_IPI_self(POSTED_MSI_NOTIFICATION_VECTOR);
+ }
+#endif
+
if (ms->rounds_left > 0) {
/* Timer still alive, just call the handlers. */
list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
@@ -332,7 +354,7 @@ static int moderation_show(struct seq_file *p, void *v)
}
seq_printf(p, "\n"
- "enabled %s\n"
+ "enabled %s%s\n"
"delay_us %u\n"
"timer_rounds %u\n"
"target_irq_rate %u\n"
@@ -344,6 +366,7 @@ static int moderation_show(struct seq_file *p, void *v)
"decay_factor %u\n"
"grow_factor %u\n",
str_yes_no(delay_us > 0),
+ posted_msi_supported() ? " (also on posted_msi)" : "",
delay_us, irq_mod_info.timer_rounds,
irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent,
irq_mod_info.update_ms, irq_mod_info.scale_cpus,
@@ -389,6 +412,7 @@ static struct param_names param_names[] = {
{},
{ "scale_cpus", &irq_mod_info.scale_cpus, 50, 1000 },
{ "count_timer_calls", &irq_mod_info.count_timer_calls, 0, 1 },
+ { "count_msi_calls", &irq_mod_info.count_msi_calls, 0, 1 },
{ "decay_factor", &irq_mod_info.decay_factor, 8, 64 },
{ "grow_factor", &irq_mod_info.grow_factor, 8, 64 },
};
@@ -476,6 +500,18 @@ static ssize_t mode_write(struct file *f, const char __user *buf, size_t count,
ret = kstrtobool(cmd, &enable);
if (!ret)
ret = set_moderation_mode(desc, enable);
+ if (ret) {
+ /* extra helpers for prodkernel */
+ if (cmd[count - 1] == '\n')
+ cmd[count - 1] = '\0';
+ ret = 0;
+ if (!strcmp(cmd, "managed"))
+ irqd_set(&desc->irq_data, IRQD_AFFINITY_MANAGED);
+ else if (!strcmp(cmd, "unmanaged"))
+ irqd_clear(&desc->irq_data, IRQD_AFFINITY_MANAGED);
+ else
+ ret = -EINVAL;
+ }
return ret ? : count;
}
diff --git a/kernel/irq/irq_moderation.h b/kernel/irq/irq_moderation.h
index 3543e8e8b6e2d..69bbbb7b2ec80 100644
--- a/kernel/irq/irq_moderation.h
+++ b/kernel/irq/irq_moderation.h
@@ -145,7 +145,8 @@ static inline void irq_moderation_adjust_delay(struct irq_mod_state *ms)
{
u64 now, delta_time;
- ms->irq_count++;
+ /* dont_count can only be set in timer calls from posted_msi */
+ ms->irq_count += !ms->dont_count;
/* ktime_get_ns() is expensive, don't do too often */
if (ms->irq_count & 0xf)
return;
@@ -196,6 +197,15 @@ static inline void irq_moderation_hook(struct irq_desc *desc)
if (!static_branch_unlikely(&irq_moderation_enabled_key))
return;
+#ifdef CONFIG_X86_POSTED_MSI
+ if (ms->in_posted_msi) {
+ /* these calls are not moderated */
+ ms->from_posted_msi++;
+ ms->irq_count += irq_mod_info.count_msi_calls;
+ return;
+ }
+#endif
+
if (!READ_ONCE(desc->mod.enable))
return;
@@ -243,6 +253,61 @@ static inline void irq_moderation_epilogue(const struct irq_desc *desc)
irq_moderation_start_timer(ms);
}
+#ifdef CONFIG_X86_POSTED_MSI
+/*
+ * Helpers for to sysvec_posted_msi_notification(), use as follows
+ *
+ * if (posted_msi_moderation_enabled()) {
+ * if (posted_msi_should_rearm(handle_pending_pir(pid->pir, regs)))
+ * goto rearm;
+ * else
+ * goto common_end;
+ * }
+ * ...
+ * common_end:
+ * posted_msi_moderation_epilogue();
+ */
+static inline bool posted_msi_moderation_enabled(void)
+{
+ struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+ if (!static_branch_unlikely(&irq_moderation_enabled_key))
+ return false;
+ irq_moderation_adjust_delay(ms);
+ /* Tell handlers to not throttle next calls. */
+ ms->in_posted_msi = true;
+ return true;
+}
+
+/* Decide whether or not to rearm posted_msi. */
+static inline bool posted_msi_should_rearm(bool work_done)
+{
+ struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+ /* No rearm if there is a timer pending. */
+ if (ms->rounds_left > 0)
+ return false;
+ /* No work done, can rearm. */
+ if (!work_done)
+ return true;
+ if (!irq_moderation_needed(ms))
+ return true;
+ /* Start the timer, inform the handler, and do not rearm. */
+ ms->kick_posted_msi = true;
+ irq_moderation_start_timer(ms);
+ return false;
+}
+
+/* Cleanup state set in posted_msi_moderation_enabled() */
+static inline void posted_msi_moderation_epilogue(void)
+{
+ struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+ ms->in_posted_msi = false;
+ ms->dont_count = false;
+}
+#endif
+
void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode);
void irq_moderation_procfs_remove(struct irq_desc *desc);
@@ -251,6 +316,12 @@ void irq_moderation_procfs_remove(struct irq_desc *desc);
static inline void irq_moderation_hook(struct irq_desc *desc) {}
static inline void irq_moderation_epilogue(const struct irq_desc *desc) {}
+#ifdef CONFIG_X86_POSTED_MSI
+static inline bool posted_msi_moderation_enabled(void) { return false; }
+static inline bool posted_msi_should_rearm(bool work_done) { return false; }
+static inline void posted_msi_moderation_epilogue(void) {}
+#endif
+
static inline void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode) {}
static inline void irq_moderation_procfs_remove(struct irq_desc *desc) {}
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel)
2025-11-16 18:28 ` [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
@ 2025-11-17 21:14 ` Thomas Gleixner
2025-11-17 21:36 ` kernel test robot
1 sibling, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-17 21:14 UTC (permalink / raw)
To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> On recent Intel CPUs, kernels compiled with CONFIG_X86_POSTED_MSI=y,
> and the boot option "intremap=posted_msi", all MSI interrupts
> that hit a CPU issue a single POSTED_MSI interrupt processed by
> sysvec_posted_msi_notification() instead of having separate interrupts.
>
> This change adds soft moderation hooks to the above handler.
'This change' is not any better than 'This patch'.
> Soft moderation on posted_msi does not require per-source enable,
> irq_moderation.delay_us > 0 suffices.
>
> To test it, run a kernel with the above options and enable moderation by
> setting delay_us > 0. The column "from_msi" in /proc/irq/soft_moderation
> will show a non-zero value.
Impressive. But it would be more impressive and useful to actualy have
an explanation why this is required and what the benefits are.
> CFLAGS_head32.o := -fno-stack-protector
> CFLAGS_head64.o := -fno-stack-protector
> -CFLAGS_irq.o := -I $(src)/../include/asm/trace
> +CFLAGS_irq.o := -I $(src)/../include/asm/trace -I $(srctree)/kernel/irq
Not going to happen. Architecture code has no business to tap into
generic core infrastructure.
> --- a/kernel/irq/irq_moderation.c
> +++ b/kernel/irq/irq_moderation.c
> @@ -11,6 +11,13 @@
> #include "internals.h"
> #include "irq_moderation.h"
>
> +#ifdef CONFIG_X86
Architecture specific muck has absolutely no place in generic code.
> +#include <asm/apic.h>
> +#include <asm/irq_remapping.h>
> +#else
> +static inline bool posted_msi_supported(void) { return false; }
> +#endif
> +
> /*
> * Platform-wide software interrupt moderation.
> *
> @@ -32,6 +39,10 @@
> * moderation on that CPU/irq. If so, calls disable_irq_nosync() and starts
> * an hrtimer with appropriate delay.
> *
> + * - Intel only: using "intremap=posted_msi", all the above is done in
> + * sysvec_posted_msi_notification(). In this case all host device interrupts
> + * are subject to moderation.
> + *
> * - the timer callback calls enable_irq() for all disabled interrupts on that
> * CPU. That in turn will generate interrupts if there are pending events.
> *
> @@ -230,6 +241,17 @@ static enum hrtimer_restart timer_cb(struct hrtimer *timer)
>
> ms->rounds_left--;
>
> +#ifdef CONFIG_X86_POSTED_MSI
> + if (ms->kick_posted_msi) {
> + if (ms->rounds_left == 0)
> + ms->kick_posted_msi = false;
> + /* Next call will be from timer, count it conditionally. */
> + ms->dont_count = !irq_mod_info.count_timer_calls;
TBH, this is one of the worst hacks I've seen in a long time. It's
admittedly creative, but that's unmaintainable and the worst precedent
to open the flood gates for random architecture hacks in generic core
code.
If you want this to work for that posted MSI muck, then this needs
proper integration of those posted vectors into the interrupt core and a
sane mechanism to do the accounting.
> @@ -476,6 +500,18 @@ static ssize_t mode_write(struct file *f, const char __user *buf, size_t count,
> ret = kstrtobool(cmd, &enable);
> if (!ret)
> ret = set_moderation_mode(desc, enable);
> + if (ret) {
> + /* extra helpers for prodkernel */
> + if (cmd[count - 1] == '\n')
> + cmd[count - 1] = '\0';
> + ret = 0;
> + if (!strcmp(cmd, "managed"))
> + irqd_set(&desc->irq_data, IRQD_AFFINITY_MANAGED);
> + else if (!strcmp(cmd, "unmanaged"))
> + irqd_clear(&desc->irq_data, IRQD_AFFINITY_MANAGED);
What the heck? Can you spare me the exposure to random google internal
garbage? My eyes are bleeding already enough.
Thanks,
tglx
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel)
2025-11-16 18:28 ` [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
2025-11-17 21:14 ` Thomas Gleixner
@ 2025-11-17 21:36 ` kernel test robot
1 sibling, 0 replies; 32+ messages in thread
From: kernel test robot @ 2025-11-17 21:36 UTC (permalink / raw)
To: Luigi Rizzo, Thomas Gleixner, Marc Zyngier, Luigi Rizzo,
Paolo Abeni, Andrew Morton, Sean Christopherson, Jacob Pan
Cc: llvm, oe-kbuild-all, Linux Memory Management List, linux-kernel,
linux-arch, Bjorn Helgaas, Willem de Bruijn
Hi Luigi,
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/irq/core]
[also build test ERROR on tip/x86/core tip/master linus/master v6.18-rc6 next-20251117]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Luigi-Rizzo/genirq-platform-wide-interrupt-moderation-Documentation-Kconfig-irq_desc/20251117-023148
base: tip/irq/core
patch link: https://lore.kernel.org/r/20251116182839.939139-6-lrizzo%40google.com
patch subject: [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel)
config: x86_64-buildonly-randconfig-003-20251118 (https://download.01.org/0day-ci/archive/20251118/202511180439.x4dMf653-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251118/202511180439.x4dMf653-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511180439.x4dMf653-lkp@intel.com/
All errors (new ones prefixed by >>):
>> kernel/irq/irq_moderation.c:369:6: error: call to undeclared function 'posted_msi_supported'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
369 | posted_msi_supported() ? " (also on posted_msi)" : "",
| ^
1 error generated.
vim +/posted_msi_supported +369 kernel/irq/irq_moderation.c
321
322 seq_printf(p, HEAD_FMT,
323 "# CPU", "irq/s", "my_irq/s", "cpus", "srcs", "delay_ns",
324 "irq_hi", "my_irq_hi", "hardirq_hi", "timer_set",
325 "disable_irq", "from_msi", "timer_calls", "stray_irq");
326
327 for_each_possible_cpu(j) {
328 struct irq_mod_state *ms = per_cpu_ptr(&irq_mod_state, j);
329 u64 age_ms = min((now - ms->last_ns) / NSEC_PER_MSEC, (u64)999999);
330
331 if (age_ms < 10000) {
332 /* Average irq_rate over recently active CPUs. */
333 active_cpus++;
334 irq_rate += ms->irq_rate;
335 } else {
336 ms->irq_rate = 0;
337 ms->my_irq_rate = 0;
338 ms->scaled_cpu_count = 64;
339 ms->scaled_src_count = 64;
340 ms->mod_ns = 0;
341 }
342
343 irq_high += ms->irq_high;
344 my_irq_high += ms->my_irq_high;
345 hardirq_high += ms->hardirq_high;
346
347 seq_printf(p, BODY_FMT,
348 j, ms->irq_rate, ms->my_irq_rate,
349 (ms->scaled_cpu_count + 128) / 256,
350 (ms->scaled_src_count + 128) / 256,
351 ms->mod_ns, ms->irq_high, ms->my_irq_high,
352 ms->hardirq_high, ms->timer_set, ms->disable_irq,
353 ms->from_posted_msi, ms->timer_calls, ms->stray_irq);
354 }
355
356 seq_printf(p, "\n"
357 "enabled %s%s\n"
358 "delay_us %u\n"
359 "timer_rounds %u\n"
360 "target_irq_rate %u\n"
361 "hardirq_percent %u\n"
362 "update_ms %u\n"
363 "scale_cpus %u\n"
364 "count_timer_calls %s\n"
365 "count_msi_calls %s\n"
366 "decay_factor %u\n"
367 "grow_factor %u\n",
368 str_yes_no(delay_us > 0),
> 369 posted_msi_supported() ? " (also on posted_msi)" : "",
370 delay_us, irq_mod_info.timer_rounds,
371 irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent,
372 irq_mod_info.update_ms, irq_mod_info.scale_cpus,
373 str_yes_no(irq_mod_info.count_timer_calls),
374 str_yes_no(irq_mod_info.count_msi_calls),
375 irq_mod_info.decay_factor, irq_mod_info.grow_factor);
376
377 seq_printf(p,
378 "irq_rate %lu\n"
379 "irq_high %lu\n"
380 "my_irq_high %lu\n"
381 "hardirq_percent_high %lu\n"
382 "total_interrupts %lu\n"
383 "total_cpus %lu\n",
384 active_cpus ? irq_rate / active_cpus : 0,
385 irq_high, my_irq_high, hardirq_high,
386 READ_ONCE(*((ulong *)&irq_mod_info.total_intrs)),
387 READ_ONCE(*((ulong *)&irq_mod_info.total_cpus)));
388
389 return 0;
390 }
391
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 6/8] genirq: soft_moderation: helpers for per-driver defaults
2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
` (4 preceding siblings ...)
2025-11-16 18:28 ` [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 7/8] nvme-pci: add module parameter for default moderation mode Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 8/8] vfio-pci: " Luigi Rizzo
7 siblings, 0 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
Introduce helpers to implement per-driver module parameters to enable
moderation at boot/probe time.
No functional change.
Change-Id: I305aa42fa348055004cc2221ef6b055a0ab4b9d5
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
include/linux/interrupt.h | 19 +++++++++++++++++++
kernel/irq/irq_moderation.c | 14 ++++++++++++++
2 files changed, 33 insertions(+)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 51b6484c04934..17ce0aac181de 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -872,6 +872,25 @@ extern int early_irq_init(void);
extern int arch_probe_nr_irqs(void);
extern int arch_early_irq_init(void);
+#ifdef CONFIG_IRQ_SOFT_MODERATION
+
+/* helpers for per-driver moderation mode settings */
+#define DEFINE_IRQ_MODERATION_MODE_PARAMETER \
+ static bool soft_moderation; \
+ module_param(soft_moderation, bool, 0644); \
+ MODULE_PARM_DESC(soft_moderation, "0: off, 1: on")
+
+void irq_moderation_set_mode(int irq, bool enable);
+#define IRQ_MODERATION_SET_DEFAULT_MODE(_irq) \
+ irq_moderation_set_mode(_irq, READ_ONCE(soft_moderation))
+
+#else /* empty stubs to avoid conditional compilation */
+
+#define DEFINE_IRQ_MODERATION_MODE_PARAMETER
+#define IRQ_MODERATION_SET_DEFAULT_MODE(_irq)
+
+#endif
+
/*
* We want to know which function is an entrypoint of a hardirq or a softirq.
*/
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index 2d01e4cd4638b..c2542c92fbbd5 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -84,6 +84,10 @@ static inline bool posted_msi_supported(void) { return false; }
*
* echo "on" > /proc/irq/NN/soft_moderation # use "off" to disable
*
+ * For selected drivers, the default can also be supplied via module parameters
+ *
+ * ${DRIVER}.soft_moderation=1
+ *
* === MONITORING ===
*
* cat /proc/irq/soft_moderation shows per-CPU and global statistics.
@@ -307,6 +311,16 @@ static inline int set_moderation_mode(struct irq_desc *desc, bool enable)
return 0;
}
+/* irq_to_desc() is not exported. Wrap it for use in drivers. */
+void irq_moderation_set_mode(int irq, bool enable)
+{
+ struct irq_desc *desc = irq_to_desc(irq);
+
+ if (desc)
+ set_moderation_mode(desc, enable);
+}
+EXPORT_SYMBOL(irq_moderation_set_mode);
+
#pragma clang diagnostic error "-Wformat"
/* Print statistics */
static int moderation_show(struct seq_file *p, void *v)
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* [PATCH v2 7/8] nvme-pci: add module parameter for default moderation mode
2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
` (5 preceding siblings ...)
2025-11-16 18:28 ` [PATCH v2 6/8] genirq: soft_moderation: helpers for per-driver defaults Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 8/8] vfio-pci: " Luigi Rizzo
7 siblings, 0 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
The following module parameter enables moderation at boot time for nvme devices.
nvme.soft_moderation=1
It can be overridden at runtime via /proc/irq/*/*nvme*/soft_moderation.
See Documentation/core-api/irq/irq-moderation.rst for configuration.
Change-Id: Ib061737415c26b868e889d4d9953e1d25ca8fc4b
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
drivers/nvme/host/pci.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 72fb675a696f4..b9d7bce30061f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -72,6 +72,8 @@
static_assert(MAX_PRP_RANGE / NVME_CTRL_PAGE_SIZE <=
(1 /* prp1 */ + NVME_MAX_NR_DESCRIPTORS * PRPS_PER_PAGE));
+DEFINE_IRQ_MODERATION_MODE_PARAMETER;
+
static int use_threaded_interrupts;
module_param(use_threaded_interrupts, int, 0444);
@@ -1989,6 +1991,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
result = queue_request_irq(nvmeq);
if (result < 0)
goto release_sq;
+ IRQ_MODERATION_SET_DEFAULT_MODE(pci_irq_vector(to_pci_dev(dev->dev), vector));
}
set_bit(NVMEQ_ENABLED, &nvmeq->flags);
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* [PATCH v2 8/8] vfio-pci: add module parameter for default moderation mode
2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
` (6 preceding siblings ...)
2025-11-16 18:28 ` [PATCH v2 7/8] nvme-pci: add module parameter for default moderation mode Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
7 siblings, 0 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
Andrew Morton, Sean Christopherson, Jacob Pan
Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
Luigi Rizzo
The following module parameter enables moderation at boot time for vfio devices.
vfio_pci_core.soft_moderation=1
It can be overridden at runtime via /proc/irq/*/*vfio*/soft_moderation.
See Documentation/core-api/irq/irq-moderation.rst for configuration.
Change-Id: I69ece365fa8eb796c22e088dbe849eddb684e02a
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
drivers/vfio/pci/vfio_pci_intrs.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 30d3e921cb0de..e54d88cfe601d 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -22,6 +22,8 @@
#include "vfio_pci_priv.h"
+DEFINE_IRQ_MODERATION_MODE_PARAMETER;
+
struct vfio_pci_irq_ctx {
struct vfio_pci_core_device *vdev;
struct eventfd_ctx *trigger;
@@ -317,6 +319,7 @@ static int vfio_intx_enable(struct vfio_pci_core_device *vdev,
vfio_irq_ctx_free(vdev, ctx, 0);
return ret;
}
+ IRQ_MODERATION_SET_DEFAULT_MODE(pdev->irq);
return 0;
}
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread