[PATCH v2 0/8] platform wide software interrupt moderation

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/8] platform wide software interrupt moderation
@ 2025-11-16 18:28 Luigi Rizzo
  2025-11-16 18:28 ` [PATCH v2 1/8] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

Platform wide software interrupt moderation specifically addresses a
limitation of platforms, from many vendors, whose I/O performance drops
significantly when the total rate of MSI-X interrupts is too high (e.g
1-3M intr/s depending on the platform).

Conventional interrupt moderation operates separately on each source,
hence the configuration should target the worst case. On large servers
with hundreds of interrupt sources, keeping the total rate bounded would
require delays of 100-200us; and adaptive moderation would have to reach
those delays with as little as 10K intr/s per source. These values are
unacceptable for RPC or transactional workloads.

To address this problem, this code measures efficiently the total and
per-CPU interrupt rates, so that individual moderation delays can be
adjusted based on actual global and local load. This way, the system
controls both global interrupt rates and individual CPU load, and
tunes delays so they are normally 0 or very small except during actual
local/global overload.

Configuration is easy and robust. System administrators specify the
maximum targets (moderation delay; interrupt rate; and fraction of time
spent in hardirq), and per-CPU control loops adjust actual delays to try
and keep metrics within the bounds.

There is no need for exact targets, because the system is adaptive.
Values like delay_us=100, target_irq_rate=1000000, hardirq_percent=70
are good almost everywhere.

The system does not rely on any special hardware feature except from
devices recording pending interrupts.

Boot defaults are set via module parameters (/sys/module/irq_moderation
and /sys/module/${DRIVER}) or at runtime via /proc/irq/moderation, which
is also used to export statistics.  Moderation on individual irq can be
turned on/off via /proc/irq/NN/moderation .

PERFORMANCE BENEFITS:
Below are some experimental results under high load (before/after rates
are measured with conventional moderation or with this change):

- 100Gbps NIC, 32 queues: rx goes from 50-60Gbps to 92.8 Gbps (line rate).
- 200Gbps NIC, 10 VMs (total 160 queues): rx goes from 30 Gbps to 190 Gbps (line rate).
- 12 SSD, 96 queues: 4K random read goes from 6M to 20.5 MIOPS (device max).

In all cases, latency up to p95 is unaffected at low/moderate load,
even if compared with no moderation at all.

Changes in v2:
- many style fixes (mostly on comments) based on reviewers' comments on v1
- removed background from Documentation/core-api/irq/irq-moderation.rst
- split procfs handlers
- moved internal details to kernel/irq/irq_moderation.h
- use cpu hotplug for per-CPU setup, removed unnecessary arch-specific changes
- select suitable irqs based on !irqd_is_level_type(irqd) && irqd_is_single_target(irqd)
- use a static_key to enable/disable the feature

There are two open comments from v1 for which I would like maintainer's
clarifications

- handle_irq_event() calls irq_moderation_hook() after releasing the lock,
  so it can call disable_irq_nosync(). It may be possible to move the
  call before releasing the lock and use __disable_irq(). I am not sure
  if there is any benefit in making the change.

- the timer callback calls handle_irq_event_percpu(desc) on moderated
  irqdesc (which have irqd_irq_disabled(irqd) == 1) without changing
  IRQD_IRQ_INPROGRESS. I am not sure if this should be protected with
  the following, and especially where it would make a difference
  (specifically because that the desc is disabled during this sequence).

     raw_spin_lock(&desc->lock);
     irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
     raw_spin_unlock(&desc->lock)

     handle_irq_event_percpu(desc); // <--

     raw_spin_lock(&desc->lock);
     irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
     raw_spin_unlock(&desc->lock)

Luigi Rizzo (8):
  genirq: platform wide interrupt moderation: Documentation, Kconfig,
    irq_desc
  genirq: soft_moderation: add base files, procfs
  genirq: soft_moderation: implement fixed moderation
  genirq: soft_moderation: implement adaptive moderation
  x86/irq: soft_moderation: add support for posted_msi (intel)
  genirq: soft_moderation: helpers for per-driver defaults
  nvme-pci: add module parameter for default moderation mode
  vfio-pci: add module parameter for default moderation mode

 Documentation/core-api/irq/index.rst          |   1 +
 Documentation/core-api/irq/irq-moderation.rst | 154 +++++
 arch/x86/kernel/Makefile                      |   2 +-
 arch/x86/kernel/irq.c                         |  13 +
 drivers/nvme/host/pci.c                       |   3 +
 drivers/vfio/pci/vfio_pci_intrs.c             |   3 +
 include/linux/interrupt.h                     |  19 +
 include/linux/irqdesc.h                       |  18 +
 kernel/irq/Kconfig                            |  12 +
 kernel/irq/Makefile                           |   1 +
 kernel/irq/handle.c                           |   3 +
 kernel/irq/irq_moderation.c                   | 606 ++++++++++++++++++
 kernel/irq/irq_moderation.h                   | 330 ++++++++++
 kernel/irq/irqdesc.c                          |   1 +
 kernel/irq/proc.c                             |   3 +
 15 files changed, 1168 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/core-api/irq/irq-moderation.rst
 create mode 100644 kernel/irq/irq_moderation.c
 create mode 100644 kernel/irq/irq_moderation.h

-- 
2.52.0.rc1.455.g30608eb744-goog

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 1/8] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc
  2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
  2025-11-16 18:28 ` [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs Luigi Rizzo
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

Platform wide software interrupt moderation ("soft_moderation")
specifically addresses a limitation of platforms from many vendors
whose I/O performance drops significantly when the total rate of MSI-X
interrupts is too high (e.g 1-3M intr/s depending on the platform).

Conventional interrupt moderation operates separately on each source,
hence the configuration should target the worst case. On large servers
with hundreds of interrupt sources, keeping the total rate bounded would
require delays of 100-200us, and adaptive moderation would have to reach
those delays with as little as 10K intr/s per source. These values are
unacceptable for RPC or transactional workloads.

To address this problem, soft_moderaton measures efficiently the total
and per-CPU interrupt rates, so that individual moderation delays
can be adjusted based on actual global and local load. This way, the
system controls both global interrupt rates and individual CPU load,
and tunes delays so they are normally 0 or very small except during
actual local or global overload.

soft_moderation does not rely on any special hardware feature except
requiring the device to record which sources have pending interrupts.

Configuration is easy and robust. System administrators specify the
maximum targets (moderation delay; interrupt rate; and fraction of time
spent in hardirq), and per-CPU control loops adjust actual delays to try
and keep metrics within the targets.

There is no need for exact targets, because the system is adaptive;
values like delay_us=100, target_irq_rate=1000000, hardirq_percent=70 are
good almost everywhere.

Boot defaults are set via module parameters (/sys/module/irq_moderation
and /sys/module/${DRIVER}) or at runtime via /proc/irq/soft_moderation, which
is also used to export statistics. Moderation on individual irqs can
be turned on/off via /proc/irq/NN/soft_moderation .

No functional impact in. Enabling the option will just extend struct
irq_desc.

CONFIG_IRQ_SOFT_MODERATION=y

Change-Id: I33ad84525a0956a1164033e690104bf9fb40ecac
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 Documentation/core-api/irq/index.rst          |   1 +
 Documentation/core-api/irq/irq-moderation.rst | 154 ++++++++++++++++++
 include/linux/irqdesc.h                       |  18 ++
 kernel/irq/Kconfig                            |  12 ++
 4 files changed, 185 insertions(+)
 create mode 100644 Documentation/core-api/irq/irq-moderation.rst

diff --git a/Documentation/core-api/irq/index.rst b/Documentation/core-api/irq/index.rst
index 0d65d11e54200..b5a6e2ade69bb 100644
--- a/Documentation/core-api/irq/index.rst
+++ b/Documentation/core-api/irq/index.rst
@@ -9,3 +9,4 @@ IRQs
    irq-affinity
    irq-domain
    irqflags-tracing
+   irq-moderation
diff --git a/Documentation/core-api/irq/irq-moderation.rst b/Documentation/core-api/irq/irq-moderation.rst
new file mode 100644
index 0000000000000..71cad4fd5c3f4
--- /dev/null
+++ b/Documentation/core-api/irq/irq-moderation.rst
@@ -0,0 +1,154 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+===========================================
+Platform wide software interrupt moderation
+===========================================
+
+:Author: Luigi Rizzo <lrizzo@google.com>
+
+.. contents:: :depth: 2
+
+Introduction
+------------
+
+Platform Wide software interrupt moderation is a variant of moderation
+that adjusts the delay based on platform-wide metrics, instead of
+considering each source separately.  It then uses hrtimers to implement
+adaptive, per-CPU moderation in software, without requiring any specific
+hardware support other than Pending Bit Array, a standard feature
+of MSI-X.
+
+The most common and robust implementation of moderation enforces
+some minimum **delay** between subsequent interrupts, using a timer
+in the device or in software. Most NICs support programmable hardware
+moderation, with timer granularity down to 1us or so.  NVME also
+specifies hardware moderation timers, with 100us granularity.
+
+One downside of moderation, especially with **fixed** delay, is that
+even with moderate load, the notification latency can increase by as
+much as the moderation delay. This is undesirable for transactional
+workloads. At high load the extra delay is less problematic, because
+the queueing delays that occur can be one or more orders of magnitude
+bigger.
+
+To address this problem, software can dynamically adjust the delay, making
+it proportional to the I/O rate. This is called **adaptive** moderation,
+and it is commonly implemented in network device drivers.
+
+There is one aspect that per-source moderation does not address.
+
+Several Systems-on-Chip (SoC) from all vendors (Intel, AMD, ARM), show
+huge reduction in I/O throughput (up to 3-4x times slower for high speed
+NICs or SSDs) in presence of high MSI-X interrupt rates across the entire
+platform (1-3M intr/s total, depending on the SoC). Note that unaffected
+SoCs can sustain 20-30M intr/s from MSI-X without performance degradation.
+
+The above performance degradation is not caused by overload of individual
+CPUs. What matters is the total MSI-X interrupt rate, across either
+individual PCIe root port, or the entire SoC. The specific root cause
+depends on the SoC, but generally some internal block (in the PCIe root
+port, or in the IOMMU block) applies excessive serialization around
+MSI-X writes. This in turn causes huge delays in other PCIe transactions,
+leading to the observed performance drop.
+
+PLATFORM WIDE ADAPTIVE MODERATION
+---------------------------------
+
+Platform-Wide adaptive interrupt moderation addresses the problem
+operateing as follows (all parameters are configurable via module parameters
+irq_moderation.${name}=${value} or /proc/irq/soft_moderation):
+
+* On each interrupt, increments a per-CPU interrupt counter.
+
+* Opportunistically, every ``update_msi`` or so, each CPU scalably
+  accumulates the values across the entire system, computes the global
+  and per-CPU interrupt rate, and the number of CPUs actively processing
+  interrupts, ``active_cpus``.
+
+* Based on a configurable ``target_irq_rate``, each CPU the per-CPU
+  fair share (``target_irq_rate / active_cpus``) and whether the global
+  and total rate are abovethe targets. A simple control loop then adjusts
+  up/down its per-CPU moderation delay, ``mod_ns``, between 0 (disabled)
+  and a configurable maximum ``delay_us``.
+
+* When ``mod_ns`` is above a threshold (e.g. 10us), the first interrupt
+  served by that CPU starts an hrtimer to fire ``mod_ns`` nanoseconds.
+  All interrupts sources served by that CPU will be disabled as they come.
+
+* When the timer fires, all disabled sources are re-enabled, allowing pending
+  interrupts to be processed again.
+
+This scheme is effective in keeping the total interrupt rate under
+control as long as the configuration parameters are sensible
+(``delay_us < #CPUs / target_irq_rate``).
+
+It also lends itself to some extensions, specifically:
+
+* **protect against hardirq overload**. It is possible for one CPU
+  handling interrupts to be overwhelmed by hardirq processing. The
+  control loop can be extended to declare an overload situation when the
+  percentage of time spent in hardirq is above a configurable threshold
+  ``hardirq_percent``. Moderation can thus kick in to keep the load within bounds.
+
+* **reduce latency using timer-based polling**. Similar to ``napi_defer_hard_irq``
+  described earlier, once interrupts are disabled and we have an hrtimer active,
+  we keep the timer active for a few rounds and run the handler from a timer callback
+  instead of waiting for an interrupt. The ``timer_rounds`` parameter controls this behavior,
+
+  Say the control loop settles on 120us delay to stay within the global MSI-X rate limit.
+  By setting ``timer_rounds=2``, each time we have a hardware interrupt, the handler
+  will be called two more times by the timer. As a consequence, in the same conditions,
+  the same global MSI-X rate will be reached with just 120/3 = 40us delay, thus improving
+  latency significantly (note that those handlers call do cause extra CPU work, so we
+  may lose some of the efficiency gains coming from large delays).
+
+CONFIGURATION
+-------------
+
+Configuration of this system is done via module parameters
+``irq_moderation.${name}=${value}`` (for boot-time defaults)
+or writing ``echo "${name}=${value}" > /proc/irq/soft_moderation``
+for run-time configuration.
+
+Here are the existing module parameters
+
+* ``delay_us`` (0: off, range 0-500)
+
+   The maximum moderation delay. 0 means moderation is globally disabled.
+
+* ``target_irq_rate`` (0 off, range 0-50000000)
+
+  The maximum irq rate across the entire platform. The adaptive algorithm will adjust
+  delays to stay within the target. Use 0 to disable this control.
+
+* ``hardirq_percent`` (0 off, range 0-100)
+
+  The maximum percentage of CPU time spent in hardirq. The adaptive algorithm will adjust
+  delays to stay within the target. Use 0 to disable this control.
+
+* ``timer_rounds`` (0 0ff, range 0-20)
+
+  Once the moderation timer is activated, how many extra timer rounds to do before
+  re-enabling interrupts.
+
+* ``update_ms`` (default 1, range 1-100)
+
+  How often the adaptive control should adjust delays. The default value (1ms) should be good
+  in most circumstances.
+
+Interrupt moderation can be enabled/disabled on individual IRQs as follows:
+
+* module parameter ``${driver}.soft_moderation=1`` (default 0) selects
+  whether to use moderation at device probe time.
+
+* ``echo 1 > /proc/irq/*/${irq_name}/../soft_moderation`` (default 0, disabled) toggles
+  moderation on/off for specific IRQs once they are attached.
+
+**INTEL SPECIFIC**
+
+Recent intel CPUs support a kernel feature, enabled via boot parameter ``intremap=posted_msi``,
+that routes all interrupts targeting one CPU via a special interrupt, called **posted_msi**,
+whose handler in turn calls the individual interrupt handlers.
+
+The ``posted_msi`` kernel feature always uses moderation if enabled (``delay_us > 0``) and
+individual IRQs do not need to be enabled individually.
diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index fd091c35d5721..25df3c51772fa 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -29,6 +29,23 @@ struct irqstat {
 #endif
 };
 
+#ifdef CONFIG_IRQ_SOFT_MODERATION
+/**
+ * struct irq_desc_mod - interrupt moderation information
+ * @ms_node:		per-CPU list of moderated irq_desc
+ * @enable:		enable moderation on this descriptor
+ */
+struct irq_desc_mod {
+	struct list_head	ms_node;
+	bool			enable;
+};
+
+void irq_moderation_init_fields(struct irq_desc_mod *mod);
+#else
+struct irq_desc_mod {};
+static inline void irq_moderation_init_fields(struct irq_desc_mod *mod) {}
+#endif
+
 /**
  * struct irq_desc - interrupt descriptor
  * @irq_common_data:	per irq and chip data passed down to chip functions
@@ -112,6 +129,7 @@ struct irq_desc {
 #endif
 	struct mutex		request_mutex;
 	int			parent_irq;
+	struct irq_desc_mod	mod;
 	struct module		*owner;
 	const char		*name;
 #ifdef CONFIG_HARDIRQS_SW_RESEND
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 1b4254d19a73e..c258973b63459 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -155,6 +155,18 @@ config IRQ_KUNIT_TEST
 
 endmenu
 
+# Support platform wide software interrupt moderation
+config IRQ_SOFT_MODERATION
+	bool "Enable platform wide software interrupt moderation"
+	depends on PROC_FS=y
+	help
+	  Enable platform wide software interrupt moderation.
+	  Uses a local timer to delay interrupts in configurable ways
+	  and depending on various global system load indicators
+	  and targets.
+
+	  If you don't know what to do here, say N.
+
 config GENERIC_IRQ_MULTI_HANDLER
 	bool
 	help
-- 
2.52.0.rc1.455.g30608eb744-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
  2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
  2025-11-16 18:28 ` [PATCH v2 1/8] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
  2025-11-17 11:12   ` kernel test robot
                     ` (2 more replies)
  2025-11-16 18:28 ` [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation Luigi Rizzo
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

Add the core files that implement procfs and module parameters
for soft_moderation. This gives access to the module parameters
/sys/module/irq_moderation/parameters and read/write the procfs entries
/proc/irq/soft_moderation and /proc/irq/NN/soft_moderation.

Examples:
  cat /proc/irq/soft_moderation
  echo "delay_us=345" > /proc/irq/soft_moderation
  echo 1 | tee /proc/irq/*/nvme*/../soft_moderation

No functional change.

Change-Id: I83fc9568e70885cc02e7fcd4dbe141d9ee329c82
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 kernel/irq/Makefile         |   1 +
 kernel/irq/irq_moderation.c | 305 ++++++++++++++++++++++++++++++++++++
 kernel/irq/irq_moderation.h | 149 ++++++++++++++++++
 kernel/irq/irqdesc.c        |   1 +
 kernel/irq/proc.c           |   3 +
 5 files changed, 459 insertions(+)
 create mode 100644 kernel/irq/irq_moderation.c
 create mode 100644 kernel/irq/irq_moderation.h

diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
index 6ab3a40556670..c06da43d644f2 100644
--- a/kernel/irq/Makefile
+++ b/kernel/irq/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_GENERIC_IRQ_CHIP) += generic-chip.o
 obj-$(CONFIG_GENERIC_IRQ_PROBE) += autoprobe.o
 obj-$(CONFIG_IRQ_DOMAIN) += irqdomain.o
 obj-$(CONFIG_IRQ_SIM) += irq_sim.o
+obj-$(CONFIG_IRQ_SOFT_MODERATION) += irq_moderation.o
 obj-$(CONFIG_PROC_FS) += proc.o
 obj-$(CONFIG_GENERIC_PENDING_IRQ) += migration.o
 obj-$(CONFIG_GENERIC_IRQ_MIGRATION) += cpuhotplug.o
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
new file mode 100644
index 0000000000000..3a907b8f65698
--- /dev/null
+++ b/kernel/irq/irq_moderation.c
@@ -0,0 +1,305 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause
+
+#include <linux/cpuhotplug.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/kernel_stat.h>
+#include <linux/module.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+
+#include "internals.h"
+#include "irq_moderation.h"
+
+/*
+ * Platform-wide software interrupt moderation.
+ *
+ * see Documentation/core-api/irq/irq-moderation.rst
+ *
+ * === MOTIVATION AND OPERATION ===
+ *
+ * Some platforms show reduced I/O performance when the total device interrupt
+ * rate across the entire platform becomes too high. This code implements
+ * per-CPU adaptive moderation based on the total interrupt rate, as opposed
+ * to conventional moderation that operates separately on each source.
+ *
+ * It computes the total interrupt rate and number of sources, and uses the
+ * information to adaptively disable individual interrupts for small amounts
+ * of time using per-CPU hrtimers. Specifically:
+ *
+ * - a hook in handle_irq_event(), which applies only on sources configured
+ *   to use moderation, updates statistics and check whether we need
+ *   moderation on that CPU/irq. If so, calls disable_irq_nosync() and starts
+ *   an hrtimer with appropriate delay.
+ *
+ * - the timer callback calls enable_irq() for all disabled interrupts on that
+ *   CPU. That in turn will generate interrupts if there are pending events.
+ *
+ * === CONFIGURATION ===
+ *
+ * The following can be controlled at boot time via module parameters
+ *
+ *     irq_moderation.${NAME}=${VALUE}
+ *
+ * or at runtime by writing
+ *
+ *     echo "${NAME}=${VALUE}" > /proc/irq/soft_moderation
+ *
+ *   delay_us (default 0, suggested 100, range 0-500, 0 DISABLES MODERATION)
+ *     Fixed or maximum moderation delay.  A reasonable range is 20..100, higher
+ *     values can be useful if the hardirq handler is performing a significant
+ *     amount of work.
+ *
+ *   timer_rounds (default 0, max 20)
+ *     Once moderation triggers, periodically run handler zero or more
+ *     times using a timer rather than interrupts. This is similar to
+ *     napi_defer_hard_irqs on NICs.
+ *     A small value may help control load in interrupt-challenged platforms.
+ *
+ * Moderation can be enabled/disabled for individual interrupts with
+ *
+ *    echo "on" > /proc/irq/NN/soft_moderation # use "off" to disable
+ *
+ * === MONITORING ===
+ *
+ * cat /proc/irq/soft_moderation shows per-CPU and global statistics.
+ *
+ */
+
+struct irq_mod_info irq_mod_info ____cacheline_aligned;
+
+/* Boot time value, copled to irq_mod_info.delay_us after init. */
+static uint mod_delay_us;
+module_param_named(delay_us, mod_delay_us, uint, 0444);
+MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 = moderation off, range 0-500.");
+
+module_param_named(timer_rounds, irq_mod_info.timer_rounds, uint, 0444);
+MODULE_PARM_DESC(timer_rounds, "How many timer polls once moderation triggers, range 0-20.");
+
+DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
+
+/* Initialize moderation state, used in desc_set_defaults() */
+void irq_moderation_init_fields(struct irq_desc_mod *mod)
+{
+	INIT_LIST_HEAD(&mod->ms_node);
+	mod->enable = false;
+}
+
+static inline int set_moderation_mode(struct irq_desc *desc, bool enable)
+{
+	struct irq_data *irqd = &desc->irq_data;
+	struct irq_chip *chip = desc->irq_data.chip;
+
+	/* Moderation is supported only in specific cases. */
+	if (enable) {
+		if (irqd_is_level_type(irqd) || !irqd_is_single_target(irqd) ||
+		    chip->irq_bus_lock || chip->irq_bus_sync_unlock)
+			return -EOPNOTSUPP;
+	}
+	desc->mod.enable = enable;
+	return 0;
+}
+
+#pragma clang diagnostic error "-Wformat"
+/* Print statistics */
+static int moderation_show(struct seq_file *p, void *v)
+{
+	uint delay_us = irq_mod_info.delay_us;
+	int j;
+
+#define HEAD_FMT "%5s  %8s  %8s  %4s  %4s  %8s  %11s  %11s  %11s  %11s  %11s  %11s  %11s  %9s\n"
+#define BODY_FMT "%5u  %8u  %8u  %4u  %4u  %8u  %11u  %11u  %11u  %11u  %11u  %11u  %11u  %9u\n"
+
+	seq_printf(p, HEAD_FMT,
+		   "# CPU", "irq/s", "my_irq/s", "cpus", "srcs", "delay_ns",
+		   "irq_hi", "my_irq_hi", "hardirq_hi", "timer_set",
+		   "disable_irq", "from_msi", "timer_calls", "stray_irq");
+
+	for_each_possible_cpu(j) {
+		struct irq_mod_state *ms = per_cpu_ptr(&irq_mod_state, j);
+
+		seq_printf(p, BODY_FMT,
+			   j, ms->irq_rate, ms->my_irq_rate,
+			   (ms->scaled_cpu_count + 128) / 256,
+			   (ms->scaled_src_count + 128) / 256,
+			   ms->mod_ns, ms->irq_high, ms->my_irq_high,
+			   ms->hardirq_high, ms->timer_set, ms->disable_irq,
+			   ms->from_posted_msi, ms->timer_calls, ms->stray_irq);
+	}
+
+	seq_printf(p, "\n"
+		   "enabled              %s\n"
+		   "delay_us             %u\n"
+		   "timer_rounds         %u\n",
+		   str_yes_no(delay_us > 0),
+		   delay_us, irq_mod_info.timer_rounds);
+
+	return 0;
+}
+
+static int moderation_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, moderation_show, pde_data(inode));
+}
+
+/* Helpers to set and clamp values from procfs or at init. */
+struct param_names {
+	const char *name;
+	uint *val;
+	uint min;
+	uint max;
+};
+
+static struct param_names param_names[] = {
+	{ "delay_us", &irq_mod_info.delay_us, 0, 500 },
+	{ "timer_rounds", &irq_mod_info.timer_rounds, 0, 20 },
+	/* Empty entry indicates the following are not settable from procfs. */
+	{},
+	{ "update_ms", &irq_mod_info.update_ms, 1, 100 },
+};
+
+static ssize_t moderation_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
+{
+	struct param_names *n = param_names;
+	char cmd[40];
+	uint i, l, val;
+
+	if (count == 0 || count + 1 > sizeof(cmd))
+		return -EINVAL;
+	if (copy_from_user(cmd, buf, count))
+		return -EFAULT;
+	cmd[count] = '\0';
+	for (i = 0;  i < ARRAY_SIZE(param_names) && n->name; i++, n++) {
+		l = strlen(n->name);
+		if (count < l + 2 || strncmp(cmd, n->name, l) || cmd[l] != '=')
+			continue;
+		if (kstrtouint(cmd + l + 1, 0, &val))
+			return -EINVAL;
+		WRITE_ONCE(*(n->val), clamp(val, n->min, n->max));
+		/* Record last parameter change, for use in the control loop. */
+		irq_mod_info.procfs_write_ns = ktime_get_ns();
+		return count;
+	}
+	return -EINVAL;
+}
+
+static const struct proc_ops proc_ops = {
+	.proc_open	= moderation_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= moderation_write,
+};
+
+/* Handlers for /proc/irq/NN/soft_moderation */
+static int mode_show(struct seq_file *p, void *v)
+{
+	struct irq_desc *desc = p->private;
+
+	if (!desc)
+		return -ENOENT;
+
+	seq_printf(p, "%s irq %u trigger 0x%x %s %smanaged %slazy handle_irq %pB\n",
+		   desc->mod.enable ? "on" : "off", desc->irq_data.irq,
+		   irqd_get_trigger_type(&desc->irq_data),
+		   irqd_is_level_type(&desc->irq_data) ? "Level" : "Edge",
+		   irqd_affinity_is_managed(&desc->irq_data) ? "" : "un",
+		   irq_settings_disable_unlazy(desc) ? "un" : "", desc->handle_irq
+		   );
+	return 0;
+}
+
+static ssize_t mode_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
+{
+	struct irq_desc *desc = (struct irq_desc *)pde_data(file_inode(f));
+	char cmd[40];
+	bool enable;
+	int ret;
+
+	if (!desc)
+		return -ENOENT;
+	if (count == 0 || count + 1 > sizeof(cmd))
+		return -EINVAL;
+	if (copy_from_user(cmd, buf, count))
+		return -EFAULT;
+	cmd[count] = '\0';
+
+	ret = kstrtobool(cmd, &enable);
+	if (!ret)
+		ret = set_moderation_mode(desc, enable);
+	return ret ? : count;
+}
+
+static int mode_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, mode_show, pde_data(inode));
+}
+
+static const struct proc_ops mode_ops = {
+	.proc_open	= mode_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= mode_write,
+};
+
+void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode)
+{
+	proc_create_data("soft_moderation", umode, desc->dir, &mode_ops, desc);
+}
+
+void irq_moderation_procfs_remove(struct irq_desc *desc)
+{
+	remove_proc_entry("soft_moderation", desc->dir);
+}
+
+/* Per-CPU state initialization */
+static void irq_moderation_percpu_init(void *data)
+{
+	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+	INIT_LIST_HEAD(&ms->descs);
+}
+
+static int cpuhp_setup_cb(uint cpu)
+{
+	irq_moderation_percpu_init(NULL);
+	return 0;
+}
+
+static void clamp_parameter(uint *dst, uint val)
+{
+	struct param_names *n = param_names;
+	uint i;
+
+	for (i = 0; i < ARRAY_SIZE(param_names); i++, n++) {
+		if (dst == n->val) {
+			*dst = clamp(val, n->min, n->max);
+			return;
+		}
+	}
+}
+
+static int __init init_irq_moderation(void)
+{
+	uint *cur;
+
+	on_each_cpu(irq_moderation_percpu_init, NULL, 1);
+	cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "moderation:online", cpuhp_setup_cb, NULL);
+
+	/* Clamp all initial values to the allowed range. */
+	for (cur = &irq_mod_info.target_irq_rate; cur < irq_mod_info.pad; cur++)
+		clamp_parameter(cur, *cur);
+
+	/* Finally, set delay_us to enable moderation if needed. */
+	clamp_parameter(&irq_mod_info.delay_us, mod_delay_us);
+
+	proc_create_data("irq/soft_moderation", 0644, NULL, &proc_ops, NULL);
+	return 0;
+}
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("1.0");
+MODULE_AUTHOR("Luigi Rizzo <lrizzo@google.com>");
+MODULE_DESCRIPTION("Platform wide software interrupt moderation");
+module_init(init_irq_moderation);
diff --git a/kernel/irq/irq_moderation.h b/kernel/irq/irq_moderation.h
new file mode 100644
index 0000000000000..ccb8193482b51
--- /dev/null
+++ b/kernel/irq/irq_moderation.h
@@ -0,0 +1,149 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
+
+#ifndef _LINUX_IRQ_MODERATION_H
+#define _LINUX_IRQ_MODERATION_H
+
+/*
+ * Platform wide software interrupt moderation, see
+ * Documentation/core-api/irq/irq-moderation.rst
+ */
+
+#include <linux/hrtimer.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/kernel.h>
+
+#ifdef CONFIG_IRQ_SOFT_MODERATION
+
+/**
+ * struct irq_mod_info - global configuration parameters and state
+ * @total_intrs:	running count updated every update_ms
+ * @total_cpus:		as above, active CPUs in this interval
+ * @procfs_write_ns:	last write to /proc/irq/soft_moderation
+ * @delay_us:		fixed delay, or maximum for adaptive
+ * @target_irq_rate:	target maximum interrupt rate
+ * @hardirq_percent:	target maximum hardirq percentage
+ * @timer_rounds:	how many timer polls once moderation fires
+ * @update_ms:		how often to update delay/rate/fraction
+ * @scale_cpus:		(percent) scale factor to estimate active CPUs
+ * @count_timer_calls:	count timer calls for irq limits
+ * @count_msi_calls:	count calls from posted_msi for irq limits
+ * @decay_factor:	smoothing factor for the control loop, keep at 16
+ * @grow_factor:	smoothing factor for the control loop, keep it at 8
+ */
+struct irq_mod_info {
+	/* These fields are written to by all CPUs */
+	____cacheline_aligned
+	atomic_long_t total_intrs;
+	atomic_long_t total_cpus;
+
+	/* These are mostly read (frequently), so use a different cacheline */
+	____cacheline_aligned
+	u64 procfs_write_ns;
+	uint delay_us;
+	uint target_irq_rate;
+	uint hardirq_percent;
+	uint timer_rounds;
+	uint update_ms;
+	uint scale_cpus;
+	uint count_timer_calls;
+	uint count_msi_calls;
+	uint decay_factor;
+	uint grow_factor;
+	uint pad[];
+};
+
+extern struct irq_mod_info irq_mod_info;
+
+/**
+ * struct irq_mod_state - per-CPU moderation state
+ *
+ * @timer:		moderation timer
+ * @descs:		list of	moderated irq_desc on this CPU
+ *
+ * Counters on last time we updated moderation delay
+ * @last_ns:		time of last update
+ * @last_irqtime:	from cpustat[CPUTIME_IRQ]
+ * @last_total_irqs:	from irq_mod_info
+ * @last_total_cpus:	from irq_mod_info
+ *
+ * Local info to control hooks and timer callbacks
+ * @dont_count:		do not count this interrupt
+ * @in_posted_msi:	don't suppress handle_irq, set in posted_msi handler
+ * @kick_posted_msi:	kick posted_msi from the timer callback
+ * @rounds_left:	how many rounds left for timer callbacks
+ *
+ * @irq_count:		irqs in the last cycle, signed as we also decrement
+ * @update_ns:		fetched from irq_mod_info
+ * @delay_ns:		fetched from irq_mod_info
+ * @mod_ns:		current moderation delay, recomputed every update_ms
+ * @sleep_ns:		accumulated time for actual delay
+ *
+ * Statistics
+ * @irq_rate:		smoothed global irq rate
+ * @my_irq_rate:	smoothed irq rate for this CPU
+ * @scaled_cpu_count:	smoothed CPU count (scaled)
+ * @scaled_src_count:	smoothed count of irq sources (scaled)
+ * @irq_high:		how many times global irq above threshold
+ * @my_irq_high:	how many times local irq above threshold
+ * @hardirq_high:	how many times local hardirq_percent above threshold
+ * @timer_set:		how many timer_set calls
+ * @timer_fire:		how many timer_fire, must match timer_set in timer callback
+ * @disable_irq:	how many disable_irq calls
+ * @enable_irq:		how many enable_irq, must match disable_irq in timer callback
+ * @timer_calls:	how many handler calls from timer interrupt
+ * @from_posted_msi:	how many calls from posted_msi handler
+ * @stray_irq:		how many stray interrupts
+ */
+struct irq_mod_state {
+	struct hrtimer timer;
+	struct list_head descs;
+
+	/* Counters on last time we updated moderation delay */
+	u64 last_ns;
+	u64 last_irqtime;
+	u64 last_total_irqs;
+	u64 last_total_cpus;
+
+	bool dont_count;
+	bool in_posted_msi;
+	bool kick_posted_msi;
+	u8 rounds_left;
+
+	u32 irq_count;
+	u32 update_ns;
+	u32 delay_ns;
+	u32 mod_ns;
+	u32 sleep_ns;
+
+	/* Statistics */
+	u32 irq_rate;
+	u32 my_irq_rate;
+	u32 scaled_cpu_count;
+	u32 scaled_src_count;
+	u32 irq_high;
+	u32 my_irq_high;
+	u32 hardirq_high;
+	u32 timer_set;
+	u32 timer_fire;
+	u32 disable_irq;
+	u32 enable_irq;
+	u32 timer_calls;
+	u32 from_posted_msi;
+	u32 stray_irq;
+	int pad[] ____cacheline_aligned;
+};
+
+DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
+
+void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode);
+void irq_moderation_procfs_remove(struct irq_desc *desc);
+
+#else /* CONFIG_IRQ_SOFT_MODERATION */
+
+static inline void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode) {}
+static inline void irq_moderation_procfs_remove(struct irq_desc *desc) {}
+
+#endif /* !CONFIG_IRQ_SOFT_MODERATION */
+
+#endif /* _LINUX_IRQ_MODERATION_H */
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index db714d3014b5f..e5cdade3dbbce 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -134,6 +134,7 @@ static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
 	desc->tot_count = 0;
 	desc->name = NULL;
 	desc->owner = owner;
+	irq_moderation_init_fields(&desc->mod);
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(desc->kstat_irqs, cpu) = (struct irqstat) { };
 	desc_smp_init(desc, node, affinity);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 29c2404e743be..5dcbc36b7de1b 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -14,6 +14,7 @@
 #include <linux/mutex.h>
 
 #include "internals.h"
+#include "irq_moderation.h"
 
 /*
  * Access rules:
@@ -374,6 +375,7 @@ void register_irq_proc(unsigned int irq, struct irq_desc *desc)
 	proc_create_single_data("effective_affinity_list", 0444, desc->dir,
 				irq_effective_aff_list_proc_show, irqp);
 # endif
+	irq_moderation_procfs_add(desc, 0644);
 #endif
 	proc_create_single_data("spurious", 0444, desc->dir,
 				irq_spurious_proc_show, (void *)(long)irq);
@@ -395,6 +397,7 @@ void unregister_irq_proc(unsigned int irq, struct irq_desc *desc)
 	remove_proc_entry("effective_affinity", desc->dir);
 	remove_proc_entry("effective_affinity_list", desc->dir);
 # endif
+	irq_moderation_procfs_remove(desc);
 #endif
 	remove_proc_entry("spurious", desc->dir);
 
-- 
2.52.0.rc1.455.g30608eb744-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
  2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
  2025-11-16 18:28 ` [PATCH v2 1/8] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
  2025-11-16 18:28 ` [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
  2025-11-17 19:30   ` Thomas Gleixner
  2025-11-16 18:28 ` [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

Add hooks into handle_irq_event() to implement fixed moderation.
Moderation needs to be explicitly enabled on individual IRQs.

When an IRQ with moderation enabled fires, start an hrtimer for
irq_moderation.delay_us. While the timer is active, all IRQs with
moderation enabled will be disabled after running the handler. When the
timer fires, re-enable all disabled IRQs.

Example (kernel built with CONFIG_IRQ_SOFT_MODERATION=y)

  # enable fixed moderation
  echo "delay_us=400" > /proc/irq/soft_moderation

  # enable on network interrupts (change name as appropriate)
  echo on | tee /proc/irq/*/*eth*/../soft_moderation

  # show it works by looking at counters in /proc/irq/soft_moderation
  cat /proc/irq/soft_moderation

  # Show runtime impact on ping times changing delay_us
  ping -n -f -q -c 1000 ${some_nearby_host}
  echo "delay_us=100" > /proc/irq/soft_moderation
  ping -n -f -q -c 1000 ${some_nearby_host}

Configure via module parameters (irq_moderation.${name}=${value}) or
procfs (echo "${name}=${value}" > /proc/irq/soft_moderation):

delay_us   0=off, range 1-500, default 0
    How long an interrupt is disabled after it fires. Small values are
    accumulated until they are large enough, e.g. 10us. As an example,
    delay_us=2 means that the timer is set only every 5 interrupts.

timer_rounds  0-20, default 0
    How many extra timer runs before re-enabling interrupts. This
    reduces interrupts to one every (timer_rounds * delay_us). Similar
    to the "napi_defer_hard_irqs" option in NAPI, but with some subtle
    differences (e.g. here the number of rounds is deterministic, and
    interrupts are disabled at MSI level).

Change-Id: Ic52e95f4128da5e2964afdae0ba3cf9c5c3d732e
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 kernel/irq/handle.c         |   3 ++
 kernel/irq/irq_moderation.c |  86 +++++++++++++++++++++++++++++-
 kernel/irq/irq_moderation.h | 102 ++++++++++++++++++++++++++++++++++++
 3 files changed, 189 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index e103451243a0b..2f8d828ed143e 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -19,6 +19,7 @@
 #include <trace/events/irq.h>
 
 #include "internals.h"
+#include "irq_moderation.h"
 
 #ifdef CONFIG_GENERIC_IRQ_MULTI_HANDLER
 void (*handle_arch_irq)(struct pt_regs *) __ro_after_init;
@@ -254,9 +255,11 @@ irqreturn_t handle_irq_event(struct irq_desc *desc)
 	irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
 	raw_spin_unlock(&desc->lock);
 
+	irq_moderation_hook(desc);
 	ret = handle_irq_event_percpu(desc);
 
 	raw_spin_lock(&desc->lock);
+	irq_moderation_epilogue(desc);
 	irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
 	return ret;
 }
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index 3a907b8f65698..688c8e8c49392 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -56,6 +56,10 @@
  *     napi_defer_hard_irqs on NICs.
  *     A small value may help control load in interrupt-challenged platforms.
  *
+ *
+ *   update_ms (default 5, range 1...100)
+ *     How often moderation delay is updated.
+ *
  * Moderation can be enabled/disabled for individual interrupts with
  *
  *    echo "on" > /proc/irq/NN/soft_moderation # use "off" to disable
@@ -66,7 +70,10 @@
  *
  */
 
-struct irq_mod_info irq_mod_info ____cacheline_aligned;
+/* Recommended values for the control loop. */
+struct irq_mod_info irq_mod_info ____cacheline_aligned = {
+	.update_ms		= 5,
+};
 
 /* Boot time value, copled to irq_mod_info.delay_us after init. */
 static uint mod_delay_us;
@@ -76,8 +83,66 @@ MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 = moderation off, range 0
 module_param_named(timer_rounds, irq_mod_info.timer_rounds, uint, 0444);
 MODULE_PARM_DESC(timer_rounds, "How many timer polls once moderation triggers, range 0-20.");
 
+module_param_named(update_ms, irq_mod_info.update_ms, uint, 0444);
+MODULE_PARM_DESC(update_ms, "Update interval in milliseconds, range 1-100");
+
 DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
 
+DEFINE_STATIC_KEY_FALSE(irq_moderation_enabled_key);
+EXPORT_SYMBOL(irq_moderation_enabled_key);
+
+static inline void smooth_avg(u32 *dst, u32 val, u32 steps)
+{
+	*dst = ((64 - steps) * *dst + steps * val) / 64;
+}
+
+/* Moderation timer handler. */
+static enum hrtimer_restart timer_cb(struct hrtimer *timer)
+{
+	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+	struct irq_desc *desc, *next;
+	uint srcs = 0;
+
+	ms->timer_fire++;
+	WARN_ONCE(ms->timer_set != ms->timer_fire,
+		  "CPU %d timer set %d fire %d (lost events?)\n",
+		  smp_processor_id(), ms->timer_set, ms->timer_fire);
+
+	ms->rounds_left--;
+
+	if (ms->rounds_left > 0) {
+		/* Timer still alive, just call the handlers. */
+		list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
+			ms->irq_count += irq_mod_info.count_timer_calls;
+			ms->timer_calls++;
+			handle_irq_event_percpu(desc);
+		}
+		ms->timer_set++;
+		hrtimer_forward_now(&ms->timer, ms->sleep_ns);
+		return HRTIMER_RESTART;
+	}
+
+	/* Last round, remove from list and enable_irq(). */
+	list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
+		list_del(&desc->mod.ms_node);
+		INIT_LIST_HEAD(&desc->mod.ms_node);
+		srcs++;
+		ms->enable_irq++;
+		enable_irq(desc->irq_data.irq);
+	}
+	smooth_avg(&ms->scaled_src_count, srcs * 256, 1);
+
+	/* Prepare to accumulate next moderation delay. */
+	ms->sleep_ns = 0;
+
+	WARN_ONCE(ms->disable_irq != ms->enable_irq,
+		  "CPU %d irq disable %d enable %d (%s)\n",
+		  smp_processor_id(), ms->disable_irq, ms->enable_irq,
+		  "bookkeeping error, some irq will be stuck");
+
+	return HRTIMER_NORESTART;
+}
+
 /* Initialize moderation state, used in desc_set_defaults() */
 void irq_moderation_init_fields(struct irq_desc_mod *mod)
 {
@@ -153,11 +218,24 @@ struct param_names {
 static struct param_names param_names[] = {
 	{ "delay_us", &irq_mod_info.delay_us, 0, 500 },
 	{ "timer_rounds", &irq_mod_info.timer_rounds, 0, 20 },
+	{ "update_ms", &irq_mod_info.update_ms, 1, 100 },
 	/* Empty entry indicates the following are not settable from procfs. */
 	{},
-	{ "update_ms", &irq_mod_info.update_ms, 1, 100 },
+	{ "count_timer_calls", &irq_mod_info.count_timer_calls, 0, 1 },
 };
 
+static void update_enable_key(void)
+{
+	bool newval = irq_mod_info.delay_us != 0;
+
+	if (newval != static_key_enabled(&irq_moderation_enabled_key)) {
+		if (newval)
+			static_branch_enable(&irq_moderation_enabled_key);
+		else
+			static_branch_disable(&irq_moderation_enabled_key);
+	}
+}
+
 static ssize_t moderation_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
 {
 	struct param_names *n = param_names;
@@ -176,6 +254,8 @@ static ssize_t moderation_write(struct file *f, const char __user *buf, size_t c
 		if (kstrtouint(cmd + l + 1, 0, &val))
 			return -EINVAL;
 		WRITE_ONCE(*(n->val), clamp(val, n->min, n->max));
+		if (n->val == &irq_mod_info.delay_us)
+			update_enable_key();
 		/* Record last parameter change, for use in the control loop. */
 		irq_mod_info.procfs_write_ns = ktime_get_ns();
 		return count;
@@ -258,6 +338,7 @@ static void irq_moderation_percpu_init(void *data)
 {
 	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
 
+	hrtimer_setup(&ms->timer, timer_cb, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
 	INIT_LIST_HEAD(&ms->descs);
 }
 
@@ -293,6 +374,7 @@ static int __init init_irq_moderation(void)
 
 	/* Finally, set delay_us to enable moderation if needed. */
 	clamp_parameter(&irq_mod_info.delay_us, mod_delay_us);
+	update_enable_key();
 
 	proc_create_data("irq/soft_moderation", 0644, NULL, &proc_ops, NULL);
 	return 0;
diff --git a/kernel/irq/irq_moderation.h b/kernel/irq/irq_moderation.h
index ccb8193482b51..d9bc672ffd6f1 100644
--- a/kernel/irq/irq_moderation.h
+++ b/kernel/irq/irq_moderation.h
@@ -136,11 +136,113 @@ struct irq_mod_state {
 
 DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
 
+extern struct static_key_false irq_moderation_enabled_key;
+
+/* Called on each interrupt for adaptive moderation delay adjustment. */
+static inline void irq_moderation_adjust_delay(struct irq_mod_state *ms)
+{
+	u64 now, delta_time;
+
+	ms->irq_count++;
+	/* ktime_get_ns() is expensive, don't do too often */
+	if (ms->irq_count & 0xf)
+		return;
+	now = ktime_get_ns();
+	delta_time = now - ms->last_ns;
+
+	/* Run approximately every update_ns, a little bit early is ok. */
+	if (delta_time < ms->update_ns - 5000)
+		return;
+
+	ms->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
+	ms->delay_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
+
+	ms->mod_ns = ms->delay_ns;
+}
+
+/* Return true if timer is active or delay is large enough to require moderation */
+static inline bool irq_moderation_needed(struct irq_mod_state *ms)
+{
+	const u32 min_delay_ns = 10000;
+
+	if (!hrtimer_is_queued(&ms->timer)) {
+		/* accumulate sleep time, no moderation if too small */
+		ms->sleep_ns += ms->mod_ns;
+		if (ms->sleep_ns < min_delay_ns)
+			return false;
+	}
+	return true;
+}
+
+void disable_irq_nosync(unsigned int irq);
+
+/*
+ * Use in handle_irq_event() before calling the handler. Decide whether this
+ * desc should be moderated, and in case disable the irq and add the desc to
+ * the list for this CPU.
+ */
+static inline void irq_moderation_hook(struct irq_desc *desc)
+{
+	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+	if (!static_branch_unlikely(&irq_moderation_enabled_key))
+		return;
+
+	if (!READ_ONCE(desc->mod.enable))
+		return;
+
+	irq_moderation_adjust_delay(ms);
+
+	if (!list_empty(&desc->mod.ms_node)) {
+		/*
+		 * Very unlikely, stray interrupt while the desc is moderated.
+		 * We cannot ignore it or we may miss events, but do count it.
+		 */
+		ms->stray_irq++;
+		return;
+	}
+
+	if (!irq_moderation_needed(ms))
+		return;
+
+	/* Add to list of moderated desc on this CPU */
+	list_add(&desc->mod.ms_node, &ms->descs);
+	/*
+	 * Disable the irq. This will also cause irq_can_handle() return false
+	 * (through irq_can_handle_actions()), and that will prevent a handler
+	 * instance to be run again while the descriptor is being moderated.
+	 *
+	 * irq_moderation_epilogue() will then start the timer if needed.
+	 */
+	ms->disable_irq++;
+	disable_irq_nosync(desc->irq_data.irq);
+}
+
+static inline void irq_moderation_start_timer(struct irq_mod_state *ms)
+{
+	ms->timer_set++;
+	ms->rounds_left = READ_ONCE(irq_mod_info.timer_rounds) + 1;
+	hrtimer_start_range_ns(&ms->timer, ns_to_ktime(ms->sleep_ns),
+			       /*range*/2000, HRTIMER_MODE_REL_PINNED_HARD);
+}
+
+/* After the handler, if desc is moderated, make sure the timer is active. */
+static inline void irq_moderation_epilogue(const struct irq_desc *desc)
+{
+	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+	if (!list_empty(&desc->mod.ms_node) && !hrtimer_is_queued(&ms->timer))
+		irq_moderation_start_timer(ms);
+}
+
 void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode);
 void irq_moderation_procfs_remove(struct irq_desc *desc);
 
 #else /* CONFIG_IRQ_SOFT_MODERATION */
 
+static inline void irq_moderation_hook(struct irq_desc *desc) {}
+static inline void irq_moderation_epilogue(const struct irq_desc *desc) {}
+
 static inline void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode) {}
 static inline void irq_moderation_procfs_remove(struct irq_desc *desc) {}
 
-- 
2.52.0.rc1.455.g30608eb744-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation
  2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
                   ` (2 preceding siblings ...)
  2025-11-16 18:28 ` [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
  2025-11-17 20:51   ` Thomas Gleixner
  2025-11-16 18:28 ` [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

Add two control parameters (target_irq_rate and hardirq_percent)
to indicate the desired maximum values for these two metrics.

Every update_ms the hook in handle_irq_event() recomputes the total and
local interrupt rate and the amount of time spent in hardirq, compares
the values with the targets, and adjusts the moderation delay up or down.

The interrupt rate is computed in a scalable way by counting interrupts
per-CPU, and aggregating the value in a global variable only every
update_ms. Only CPUs that actively process interrupts are actually
accessing the shared variable, so the system scales well even on very
large servers.

EXAMPLE TESTING

Exceeding the thresholds requires suitably heavy workloads, such as
network or SSD traffic. Lower thresholds (e.g. target_irq_rate=50000,
hardirq_frac=10) can make it easier to reach those values.

  # configure maximum delay (which is also the fixed moderation delay)
  echo "delay_us=400" > /proc/irq/soft_moderation

  # enable on network interrupts (change name as appropriate)
  echo on | tee /proc/irq/*/*eth*/../soft_moderation

  # ping times should reflect the 400us
  ping -n -f -q -c 1000 ${some_nearby_host}
  # show actual per-cpu delays and statistics
  less /proc/irq/soft_moderation

  # configure adaptive bounds. The control loop will adjust values
  # based on actual load
  echo "target_irq_rate=1000000" > /proc/irq/soft_moderation
  echo "hardirq_percent=70" > /proc/irq/soft_moderation

  # ping times now should be much lower
  ping -n -f -q -c 1000 ${some_nearby_host}

  # show actual per-cpu delays and statistics
  less /proc/irq/soft_moderation

By generating high interrupt or hardirq loads, one can also test
the effectiveness of the control scheme and the sensitivity to
control parameters.

NEW PARAMETERS

target_irq_rate   0 off, 0-50000000, default 0
    The total maximum acceptable interrupt rate.

hardirq_percent   0 off,  0-100, default 0
    The maximum acceptable percentage of time spent in hardirq.

update_ms         1-100, default 5
    How often the control loop will readjust the delay.

Change-Id: Iccd762a61c4389eb15f06d4a35c5dab2821e5fd1
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 kernel/irq/irq_moderation.c | 177 +++++++++++++++++++++++++++++++++++-
 kernel/irq/irq_moderation.h |  10 +-
 2 files changed, 182 insertions(+), 5 deletions(-)

diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index 688c8e8c49392..72be9e88c3890 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -56,9 +56,18 @@
  *     napi_defer_hard_irqs on NICs.
  *     A small value may help control load in interrupt-challenged platforms.
  *
+ *     FIXED MODERATION mode requires target_irq_rate=0, hardirq_percent=0
+ *
+ *   target_irq_rate (default 0, suggested 1000000, 0 off, range 0..50M)
+ *     The total irq rate above which moderation kicks in.
+ *     Not particularly critical, a value in the 500K-1M range is usually ok.
+ *
+ *   hardirq_percent (default 0, suggested 70, 0 off, range 10..100)
+ *     The hardirq percentage above which moderation kicks in.
+ *     50-90 is a reasonable range.
  *
  *   update_ms (default 5, range 1...100)
- *     How often moderation delay is updated.
+ *     How often the load is measured and moderation delay updated.
  *
  * Moderation can be enabled/disabled for individual interrupts with
  *
@@ -73,6 +82,9 @@
 /* Recommended values for the control loop. */
 struct irq_mod_info irq_mod_info ____cacheline_aligned = {
 	.update_ms		= 5,
+	.scale_cpus		= 200,
+	.decay_factor		= 16,
+	.grow_factor		= 8,
 };
 
 /* Boot time value, copled to irq_mod_info.delay_us after init. */
@@ -83,6 +95,12 @@ MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 = moderation off, range 0
 module_param_named(timer_rounds, irq_mod_info.timer_rounds, uint, 0444);
 MODULE_PARM_DESC(timer_rounds, "How many timer polls once moderation triggers, range 0-20.");
 
+module_param_named(hardirq_percent, irq_mod_info.hardirq_percent, uint, 0444);
+MODULE_PARM_DESC(hardirq_percent, "Target max hardirq percentage, 0 off, range 0-100.");
+
+module_param_named(target_irq_rate, irq_mod_info.target_irq_rate, uint, 0444);
+MODULE_PARM_DESC(target_irq_rate, "Target max interrupt rate, 0 off, range 0-50000000.");
+
 module_param_named(update_ms, irq_mod_info.update_ms, uint, 0444);
 MODULE_PARM_DESC(update_ms, "Update interval in milliseconds, range 1-100");
 
@@ -96,6 +114,108 @@ static inline void smooth_avg(u32 *dst, u32 val, u32 steps)
 	*dst = ((64 - steps) * *dst + steps * val) / 64;
 }
 
+/* Measure and assess time spent in hardirq. */
+static inline bool hardirq_high(struct irq_mod_state *ms, u64 delta_time, u32 hardirq_percent)
+{
+	bool above_threshold = false;
+
+	if (IS_ENABLED(CONFIG_IRQ_TIME_ACCOUNTING)) {
+		u64 irqtime, cur = kcpustat_this_cpu->cpustat[CPUTIME_IRQ];
+
+		irqtime = cur - ms->last_irqtime;
+		ms->last_irqtime = cur;
+
+		above_threshold = irqtime * 100 > delta_time * hardirq_percent;
+		ms->hardirq_high += above_threshold;
+	}
+	return above_threshold;
+}
+
+/* Measure and assess total and per-CPU interrupt rates. */
+static inline bool irqrate_high(struct irq_mod_state *ms, u64 delta_time,
+				u32 target_rate, u32 steps)
+{
+	u64 irq_rate, my_irq_rate, tmp, delta_irqs, active_cpus;
+	bool my_rate_high, global_rate_high;
+
+	my_irq_rate = (ms->irq_count * NSEC_PER_SEC) / delta_time;
+	/* Accumulate global counter and compute global irq rate. */
+	tmp = atomic_long_add_return(ms->irq_count, &irq_mod_info.total_intrs);
+	ms->irq_count = 1;
+	delta_irqs = tmp - ms->last_total_irqs;
+	ms->last_total_irqs = tmp;
+	irq_rate = (delta_irqs * NSEC_PER_SEC) / delta_time;
+
+	/*
+	 * Count how many CPUs handled interrupts in the last interval, needed
+	 * to determine the per-CPU target (target_rate / active_cpus).
+	 * Each active CPU increments the global counter approximately every
+	 * update_ns. Scale the value by (update_ns / delta_time) to get the
+	 * correct value. Also apply rounding and make sure active_cpus > 0.
+	 */
+	tmp = atomic_long_add_return(1, &irq_mod_info.total_cpus);
+	active_cpus = tmp - ms->last_total_cpus;
+	ms->last_total_cpus = tmp;
+	active_cpus = (active_cpus * ms->update_ns + delta_time / 2) / delta_time;
+	if (active_cpus < 1)
+		active_cpus = 1;
+
+	/* Compare with global and per-CPU targets. */
+	global_rate_high = irq_rate > target_rate;
+	my_rate_high = my_irq_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
+
+	/* Statistics. */
+	smooth_avg(&ms->irq_rate, irq_rate, steps);
+	smooth_avg(&ms->my_irq_rate, my_irq_rate, steps);
+	smooth_avg(&ms->scaled_cpu_count, active_cpus * 256, steps);
+	ms->my_irq_high += my_rate_high;
+	ms->irq_high += global_rate_high;
+
+	/* Moderate on this CPU only if both global and local rates are high. */
+	return global_rate_high && my_rate_high;
+}
+
+/* Adjust the moderation delay, called at most every update_ns. */
+void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_time)
+{
+	u32 hardirq_percent = READ_ONCE(irq_mod_info.hardirq_percent);
+	u32 target_rate = READ_ONCE(irq_mod_info.target_irq_rate);
+	bool below_target = true;
+	u32 steps;
+
+	if (target_rate == 0 && hardirq_percent == 0) {
+		/* Use fixed moderation delay. */
+		ms->mod_ns = ms->delay_ns;
+		ms->irq_rate = 0;
+		ms->my_irq_rate = 0;
+		ms->scaled_cpu_count = 0;
+		return;
+	}
+
+	/* Compute decay steps based on elapsed time, bound to a reasonable value. */
+	steps = delta_time > 10 * ms->update_ns ? 10 : 1 + (delta_time / ms->update_ns);
+
+	if (target_rate > 0 && irqrate_high(ms, delta_time, target_rate, steps))
+		below_target = false;
+
+	if (hardirq_percent > 0 && hardirq_high(ms, delta_time, hardirq_percent))
+		below_target = false;
+
+	/* Controller: adjust delay with exponential decay/grow. */
+	if (below_target) {
+		ms->mod_ns -= ms->mod_ns * steps / (steps + irq_mod_info.decay_factor);
+		if (ms->mod_ns < 100)
+			ms->mod_ns = 0;
+	} else {
+		/* Exponential grow does not restart if value is too small. */
+		if (ms->mod_ns < 500)
+			ms->mod_ns = 500;
+		ms->mod_ns += ms->mod_ns * steps / (steps + irq_mod_info.grow_factor);
+		if (ms->mod_ns > ms->delay_ns)
+			ms->mod_ns = ms->delay_ns;
+	}
+}
+
 /* Moderation timer handler. */
 static enum hrtimer_restart timer_cb(struct hrtimer *timer)
 {
@@ -169,8 +289,10 @@ static inline int set_moderation_mode(struct irq_desc *desc, bool enable)
 /* Print statistics */
 static int moderation_show(struct seq_file *p, void *v)
 {
+	ulong irq_rate = 0, irq_high = 0, my_irq_high = 0, hardirq_high = 0;
 	uint delay_us = irq_mod_info.delay_us;
-	int j;
+	u64 now = ktime_get_ns();
+	int j, active_cpus = 0;
 
 #define HEAD_FMT "%5s  %8s  %8s  %4s  %4s  %8s  %11s  %11s  %11s  %11s  %11s  %11s  %11s  %9s\n"
 #define BODY_FMT "%5u  %8u  %8u  %4u  %4u  %8u  %11u  %11u  %11u  %11u  %11u  %11u  %11u  %9u\n"
@@ -182,6 +304,23 @@ static int moderation_show(struct seq_file *p, void *v)
 
 	for_each_possible_cpu(j) {
 		struct irq_mod_state *ms = per_cpu_ptr(&irq_mod_state, j);
+		u64 age_ms = min((now - ms->last_ns) / NSEC_PER_MSEC, (u64)999999);
+
+		if (age_ms < 10000) {
+			/* Average irq_rate over recently active CPUs. */
+			active_cpus++;
+			irq_rate += ms->irq_rate;
+		} else {
+			ms->irq_rate = 0;
+			ms->my_irq_rate = 0;
+			ms->scaled_cpu_count = 64;
+			ms->scaled_src_count = 64;
+			ms->mod_ns = 0;
+		}
+
+		irq_high += ms->irq_high;
+		my_irq_high += ms->my_irq_high;
+		hardirq_high += ms->hardirq_high;
 
 		seq_printf(p, BODY_FMT,
 			   j, ms->irq_rate, ms->my_irq_rate,
@@ -195,9 +334,34 @@ static int moderation_show(struct seq_file *p, void *v)
 	seq_printf(p, "\n"
 		   "enabled              %s\n"
 		   "delay_us             %u\n"
-		   "timer_rounds         %u\n",
+		   "timer_rounds         %u\n"
+		   "target_irq_rate      %u\n"
+		   "hardirq_percent      %u\n"
+		   "update_ms            %u\n"
+		   "scale_cpus           %u\n"
+		   "count_timer_calls    %s\n"
+		   "count_msi_calls      %s\n"
+		   "decay_factor         %u\n"
+		   "grow_factor          %u\n",
 		   str_yes_no(delay_us > 0),
-		   delay_us, irq_mod_info.timer_rounds);
+		   delay_us, irq_mod_info.timer_rounds,
+		   irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent,
+		   irq_mod_info.update_ms, irq_mod_info.scale_cpus,
+		   str_yes_no(irq_mod_info.count_timer_calls),
+		   str_yes_no(irq_mod_info.count_msi_calls),
+		   irq_mod_info.decay_factor, irq_mod_info.grow_factor);
+
+	seq_printf(p,
+		   "irq_rate             %lu\n"
+		   "irq_high             %lu\n"
+		   "my_irq_high          %lu\n"
+		   "hardirq_percent_high %lu\n"
+		   "total_interrupts     %lu\n"
+		   "total_cpus           %lu\n",
+		   active_cpus ? irq_rate / active_cpus : 0,
+		   irq_high, my_irq_high, hardirq_high,
+		   READ_ONCE(*((ulong *)&irq_mod_info.total_intrs)),
+		   READ_ONCE(*((ulong *)&irq_mod_info.total_cpus)));
 
 	return 0;
 }
@@ -218,10 +382,15 @@ struct param_names {
 static struct param_names param_names[] = {
 	{ "delay_us", &irq_mod_info.delay_us, 0, 500 },
 	{ "timer_rounds", &irq_mod_info.timer_rounds, 0, 20 },
+	{ "target_irq_rate", &irq_mod_info.target_irq_rate, 0, 50000000 },
+	{ "hardirq_percent", &irq_mod_info.hardirq_percent, 0, 100 },
 	{ "update_ms", &irq_mod_info.update_ms, 1, 100 },
 	/* Empty entry indicates the following are not settable from procfs. */
 	{},
+	{ "scale_cpus", &irq_mod_info.scale_cpus, 50, 1000 },
 	{ "count_timer_calls", &irq_mod_info.count_timer_calls, 0, 1 },
+	{ "decay_factor", &irq_mod_info.decay_factor, 8, 64 },
+	{ "grow_factor", &irq_mod_info.grow_factor, 8, 64 },
 };
 
 static void update_enable_key(void)
diff --git a/kernel/irq/irq_moderation.h b/kernel/irq/irq_moderation.h
index d9bc672ffd6f1..3543e8e8b6e2d 100644
--- a/kernel/irq/irq_moderation.h
+++ b/kernel/irq/irq_moderation.h
@@ -138,6 +138,8 @@ DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
 
 extern struct static_key_false irq_moderation_enabled_key;
 
+void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_time);
+
 /* Called on each interrupt for adaptive moderation delay adjustment. */
 static inline void irq_moderation_adjust_delay(struct irq_mod_state *ms)
 {
@@ -157,7 +159,13 @@ static inline void irq_moderation_adjust_delay(struct irq_mod_state *ms)
 	ms->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
 	ms->delay_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
 
-	ms->mod_ns = ms->delay_ns;
+	/* If config changed, restart from the highest delay. */
+	if (ktime_compare(irq_mod_info.procfs_write_ns, ms->last_ns) > 0)
+		ms->mod_ns = ms->delay_ns;
+
+	ms->last_ns = now;
+	/* Do the expensive processing */
+	__irq_moderation_adjust_delay(ms, delta_time);
 }
 
 /* Return true if timer is active or delay is large enough to require moderation */
-- 
2.52.0.rc1.455.g30608eb744-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel)
  2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
                   ` (3 preceding siblings ...)
  2025-11-16 18:28 ` [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
  2025-11-17 21:14   ` Thomas Gleixner
  2025-11-17 21:36   ` kernel test robot
  2025-11-16 18:28 ` [PATCH v2 6/8] genirq: soft_moderation: helpers for per-driver defaults Luigi Rizzo
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

On recent Intel CPUs, kernels compiled with CONFIG_X86_POSTED_MSI=y,
and the boot option "intremap=posted_msi", all MSI interrupts
that hit a CPU issue a single POSTED_MSI interrupt processed by
sysvec_posted_msi_notification() instead of having separate interrupts.

This change adds soft moderation hooks to the above handler.
Soft moderation on posted_msi does not require per-source enable,
irq_moderation.delay_us > 0 suffices.

To test it, run a kernel with the above options and enable moderation by
setting delay_us > 0.  The column "from_msi" in /proc/irq/soft_moderation
will show a non-zero value.

Change-Id: I07b83b428de6f6541e3903b553c1b837c68a0b7d
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 arch/x86/kernel/Makefile    |  2 +-
 arch/x86/kernel/irq.c       | 13 +++++++
 kernel/irq/irq_moderation.c | 38 ++++++++++++++++++-
 kernel/irq/irq_moderation.h | 73 ++++++++++++++++++++++++++++++++++++-
 4 files changed, 123 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index bc184dd38d993..530f5b5342eaa 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -46,7 +46,7 @@ KCOV_INSTRUMENT_unwind_guess.o				:= n
 
 CFLAGS_head32.o := -fno-stack-protector
 CFLAGS_head64.o := -fno-stack-protector
-CFLAGS_irq.o := -I $(src)/../include/asm/trace
+CFLAGS_irq.o := -I $(src)/../include/asm/trace -I $(srctree)/kernel/irq
 
 obj-y			+= head_$(BITS).o
 obj-y			+= head$(BITS).o
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 10721a1252269..1abdd21fa5c52 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -13,6 +13,8 @@
 #include <linux/export.h>
 #include <linux/irq.h>
 
+#include <irq_moderation.h>
+
 #include <asm/irq_stack.h>
 #include <asm/apic.h>
 #include <asm/io_apic.h>
@@ -448,6 +450,13 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
 	inc_irq_stat(posted_msi_notification_count);
 	irq_enter();
 
+	if (posted_msi_moderation_enabled()) {
+		if (posted_msi_should_rearm(handle_pending_pir(pid->pir, regs)))
+			goto rearm;
+		else
+			goto common_end;
+	}
+
 	/*
 	 * Max coalescing count includes the extra round of handle_pending_pir
 	 * after clearing the outstanding notification bit. Hence, at most
@@ -458,6 +467,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
 			break;
 	}
 
+rearm:
 	/*
 	 * Clear outstanding notification bit to allow new IRQ notifications,
 	 * do this last to maximize the window of interrupt coalescing.
@@ -471,6 +481,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
 	 */
 	handle_pending_pir(pid->pir, regs);
 
+common_end:
+	posted_msi_moderation_epilogue();
+
 	apic_eoi();
 	irq_exit();
 	set_irq_regs(old_regs);
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index 72be9e88c3890..2d01e4cd4638b 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -11,6 +11,13 @@
 #include "internals.h"
 #include "irq_moderation.h"
 
+#ifdef CONFIG_X86
+#include <asm/apic.h>
+#include <asm/irq_remapping.h>
+#else
+static inline bool posted_msi_supported(void) { return false; }
+#endif
+
 /*
  * Platform-wide software interrupt moderation.
  *
@@ -32,6 +39,10 @@
  *   moderation on that CPU/irq. If so, calls disable_irq_nosync() and starts
  *   an hrtimer with appropriate delay.
  *
+ * - Intel only: using "intremap=posted_msi", all the above is done in
+ *   sysvec_posted_msi_notification(). In this case all host device interrupts
+ *   are subject to moderation.
+ *
  * - the timer callback calls enable_irq() for all disabled interrupts on that
  *   CPU. That in turn will generate interrupts if there are pending events.
  *
@@ -230,6 +241,17 @@ static enum hrtimer_restart timer_cb(struct hrtimer *timer)
 
 	ms->rounds_left--;
 
+#ifdef CONFIG_X86_POSTED_MSI
+	if (ms->kick_posted_msi) {
+		if (ms->rounds_left == 0)
+			ms->kick_posted_msi = false;
+		/* Next call will be from timer, count it conditionally. */
+		ms->dont_count = !irq_mod_info.count_timer_calls;
+		ms->timer_calls++;
+		apic->send_IPI_self(POSTED_MSI_NOTIFICATION_VECTOR);
+	}
+#endif
+
 	if (ms->rounds_left > 0) {
 		/* Timer still alive, just call the handlers. */
 		list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
@@ -332,7 +354,7 @@ static int moderation_show(struct seq_file *p, void *v)
 	}
 
 	seq_printf(p, "\n"
-		   "enabled              %s\n"
+		   "enabled              %s%s\n"
 		   "delay_us             %u\n"
 		   "timer_rounds         %u\n"
 		   "target_irq_rate      %u\n"
@@ -344,6 +366,7 @@ static int moderation_show(struct seq_file *p, void *v)
 		   "decay_factor         %u\n"
 		   "grow_factor          %u\n",
 		   str_yes_no(delay_us > 0),
+		   posted_msi_supported() ? " (also on posted_msi)" : "",
 		   delay_us, irq_mod_info.timer_rounds,
 		   irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent,
 		   irq_mod_info.update_ms, irq_mod_info.scale_cpus,
@@ -389,6 +412,7 @@ static struct param_names param_names[] = {
 	{},
 	{ "scale_cpus", &irq_mod_info.scale_cpus, 50, 1000 },
 	{ "count_timer_calls", &irq_mod_info.count_timer_calls, 0, 1 },
+	{ "count_msi_calls", &irq_mod_info.count_msi_calls, 0, 1 },
 	{ "decay_factor", &irq_mod_info.decay_factor, 8, 64 },
 	{ "grow_factor", &irq_mod_info.grow_factor, 8, 64 },
 };
@@ -476,6 +500,18 @@ static ssize_t mode_write(struct file *f, const char __user *buf, size_t count,
 	ret = kstrtobool(cmd, &enable);
 	if (!ret)
 		ret = set_moderation_mode(desc, enable);
+	if (ret) {
+		/* extra helpers for prodkernel */
+		if (cmd[count - 1] == '\n')
+			cmd[count - 1] = '\0';
+		ret = 0;
+		if (!strcmp(cmd, "managed"))
+			irqd_set(&desc->irq_data, IRQD_AFFINITY_MANAGED);
+		else if (!strcmp(cmd, "unmanaged"))
+			irqd_clear(&desc->irq_data, IRQD_AFFINITY_MANAGED);
+		else
+			ret = -EINVAL;
+	}
 	return ret ? : count;
 }
 
diff --git a/kernel/irq/irq_moderation.h b/kernel/irq/irq_moderation.h
index 3543e8e8b6e2d..69bbbb7b2ec80 100644
--- a/kernel/irq/irq_moderation.h
+++ b/kernel/irq/irq_moderation.h
@@ -145,7 +145,8 @@ static inline void irq_moderation_adjust_delay(struct irq_mod_state *ms)
 {
 	u64 now, delta_time;
 
-	ms->irq_count++;
+	/* dont_count can only be set in timer calls from posted_msi */
+	ms->irq_count += !ms->dont_count;
 	/* ktime_get_ns() is expensive, don't do too often */
 	if (ms->irq_count & 0xf)
 		return;
@@ -196,6 +197,15 @@ static inline void irq_moderation_hook(struct irq_desc *desc)
 	if (!static_branch_unlikely(&irq_moderation_enabled_key))
 		return;
 
+#ifdef CONFIG_X86_POSTED_MSI
+	if (ms->in_posted_msi) {
+		/* these calls are not moderated */
+		ms->from_posted_msi++;
+		ms->irq_count += irq_mod_info.count_msi_calls;
+		return;
+	}
+#endif
+
 	if (!READ_ONCE(desc->mod.enable))
 		return;
 
@@ -243,6 +253,61 @@ static inline void irq_moderation_epilogue(const struct irq_desc *desc)
 		irq_moderation_start_timer(ms);
 }
 
+#ifdef CONFIG_X86_POSTED_MSI
+/*
+ * Helpers for to sysvec_posted_msi_notification(), use as follows
+ *
+ *	if (posted_msi_moderation_enabled()) {
+ *		if (posted_msi_should_rearm(handle_pending_pir(pid->pir, regs)))
+ *			goto rearm;
+ *		else
+ *			goto common_end;
+ *	}
+ *	...
+ *  common_end:
+ *	posted_msi_moderation_epilogue();
+ */
+static inline bool posted_msi_moderation_enabled(void)
+{
+	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+	if (!static_branch_unlikely(&irq_moderation_enabled_key))
+		return false;
+	irq_moderation_adjust_delay(ms);
+	/* Tell handlers to not throttle next calls. */
+	ms->in_posted_msi = true;
+	return true;
+}
+
+/* Decide whether or not to rearm posted_msi. */
+static inline bool posted_msi_should_rearm(bool work_done)
+{
+	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+	/* No rearm if there is a timer pending. */
+	if (ms->rounds_left > 0)
+		return false;
+	/* No work done, can rearm. */
+	if (!work_done)
+		return true;
+	if (!irq_moderation_needed(ms))
+		return true;
+	/* Start the timer, inform the handler, and do not rearm. */
+	ms->kick_posted_msi = true;
+	irq_moderation_start_timer(ms);
+	return false;
+}
+
+/* Cleanup state set in posted_msi_moderation_enabled() */
+static inline void posted_msi_moderation_epilogue(void)
+{
+	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+	ms->in_posted_msi = false;
+	ms->dont_count = false;
+}
+#endif
+
 void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode);
 void irq_moderation_procfs_remove(struct irq_desc *desc);
 
@@ -251,6 +316,12 @@ void irq_moderation_procfs_remove(struct irq_desc *desc);
 static inline void irq_moderation_hook(struct irq_desc *desc) {}
 static inline void irq_moderation_epilogue(const struct irq_desc *desc) {}
 
+#ifdef CONFIG_X86_POSTED_MSI
+static inline bool posted_msi_moderation_enabled(void) { return false; }
+static inline bool posted_msi_should_rearm(bool work_done) { return false; }
+static inline void posted_msi_moderation_epilogue(void) {}
+#endif
+
 static inline void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode) {}
 static inline void irq_moderation_procfs_remove(struct irq_desc *desc) {}
 
-- 
2.52.0.rc1.455.g30608eb744-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 6/8] genirq: soft_moderation: helpers for per-driver defaults
  2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
                   ` (4 preceding siblings ...)
  2025-11-16 18:28 ` [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
  2025-11-16 18:28 ` [PATCH v2 7/8] nvme-pci: add module parameter for default moderation mode Luigi Rizzo
  2025-11-16 18:28 ` [PATCH v2 8/8] vfio-pci: " Luigi Rizzo
  7 siblings, 0 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

Introduce helpers to implement per-driver module parameters to enable
moderation at boot/probe time.

No functional change.

Change-Id: I305aa42fa348055004cc2221ef6b055a0ab4b9d5
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 include/linux/interrupt.h   | 19 +++++++++++++++++++
 kernel/irq/irq_moderation.c | 14 ++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 51b6484c04934..17ce0aac181de 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -872,6 +872,25 @@ extern int early_irq_init(void);
 extern int arch_probe_nr_irqs(void);
 extern int arch_early_irq_init(void);
 
+#ifdef CONFIG_IRQ_SOFT_MODERATION
+
+/* helpers for per-driver moderation mode settings */
+#define DEFINE_IRQ_MODERATION_MODE_PARAMETER		\
+	static bool soft_moderation;			\
+	module_param(soft_moderation, bool, 0644);	\
+	MODULE_PARM_DESC(soft_moderation, "0: off, 1: on")
+
+void irq_moderation_set_mode(int irq, bool enable);
+#define IRQ_MODERATION_SET_DEFAULT_MODE(_irq)		\
+	irq_moderation_set_mode(_irq, READ_ONCE(soft_moderation))
+
+#else   /* empty stubs to avoid conditional compilation */
+
+#define DEFINE_IRQ_MODERATION_MODE_PARAMETER
+#define IRQ_MODERATION_SET_DEFAULT_MODE(_irq)
+
+#endif
+
 /*
  * We want to know which function is an entrypoint of a hardirq or a softirq.
  */
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index 2d01e4cd4638b..c2542c92fbbd5 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -84,6 +84,10 @@ static inline bool posted_msi_supported(void) { return false; }
  *
  *    echo "on" > /proc/irq/NN/soft_moderation # use "off" to disable
  *
+ * For selected drivers, the default can also be supplied via module parameters
+ *
+ *	${DRIVER}.soft_moderation=1
+ *
  * === MONITORING ===
  *
  * cat /proc/irq/soft_moderation shows per-CPU and global statistics.
@@ -307,6 +311,16 @@ static inline int set_moderation_mode(struct irq_desc *desc, bool enable)
 	return 0;
 }
 
+/* irq_to_desc() is not exported. Wrap it for use in drivers. */
+void irq_moderation_set_mode(int irq, bool enable)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+
+	if (desc)
+		set_moderation_mode(desc, enable);
+}
+EXPORT_SYMBOL(irq_moderation_set_mode);
+
 #pragma clang diagnostic error "-Wformat"
 /* Print statistics */
 static int moderation_show(struct seq_file *p, void *v)
-- 
2.52.0.rc1.455.g30608eb744-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 7/8] nvme-pci: add module parameter for default moderation mode
  2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
                   ` (5 preceding siblings ...)
  2025-11-16 18:28 ` [PATCH v2 6/8] genirq: soft_moderation: helpers for per-driver defaults Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
  2025-11-16 18:28 ` [PATCH v2 8/8] vfio-pci: " Luigi Rizzo
  7 siblings, 0 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

The following module parameter enables moderation at boot time for nvme devices.

  nvme.soft_moderation=1

It can be overridden at runtime via /proc/irq/*/*nvme*/soft_moderation.
See Documentation/core-api/irq/irq-moderation.rst for configuration.

Change-Id: Ib061737415c26b868e889d4d9953e1d25ca8fc4b
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 drivers/nvme/host/pci.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 72fb675a696f4..b9d7bce30061f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -72,6 +72,8 @@
 static_assert(MAX_PRP_RANGE / NVME_CTRL_PAGE_SIZE <=
 	(1 /* prp1 */ + NVME_MAX_NR_DESCRIPTORS * PRPS_PER_PAGE));
 
+DEFINE_IRQ_MODERATION_MODE_PARAMETER;
+
 static int use_threaded_interrupts;
 module_param(use_threaded_interrupts, int, 0444);
 
@@ -1989,6 +1991,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 		result = queue_request_irq(nvmeq);
 		if (result < 0)
 			goto release_sq;
+		IRQ_MODERATION_SET_DEFAULT_MODE(pci_irq_vector(to_pci_dev(dev->dev), vector));
 	}
 
 	set_bit(NVMEQ_ENABLED, &nvmeq->flags);
-- 
2.52.0.rc1.455.g30608eb744-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 8/8] vfio-pci: add module parameter for default moderation mode
  2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
                   ` (6 preceding siblings ...)
  2025-11-16 18:28 ` [PATCH v2 7/8] nvme-pci: add module parameter for default moderation mode Luigi Rizzo
@ 2025-11-16 18:28 ` Luigi Rizzo
  7 siblings, 0 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-16 18:28 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

The following module parameter enables moderation at boot time for vfio devices.

vfio_pci_core.soft_moderation=1

It can be overridden at runtime via /proc/irq/*/*vfio*/soft_moderation.
See Documentation/core-api/irq/irq-moderation.rst for configuration.

Change-Id: I69ece365fa8eb796c22e088dbe849eddb684e02a
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 drivers/vfio/pci/vfio_pci_intrs.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 30d3e921cb0de..e54d88cfe601d 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -22,6 +22,8 @@
 
 #include "vfio_pci_priv.h"
 
+DEFINE_IRQ_MODERATION_MODE_PARAMETER;
+
 struct vfio_pci_irq_ctx {
 	struct vfio_pci_core_device	*vdev;
 	struct eventfd_ctx		*trigger;
@@ -317,6 +319,7 @@ static int vfio_intx_enable(struct vfio_pci_core_device *vdev,
 		vfio_irq_ctx_free(vdev, ctx, 0);
 		return ret;
 	}
+	IRQ_MODERATION_SET_DEFAULT_MODE(pdev->irq);
 
 	return 0;
 }
-- 
2.52.0.rc1.455.g30608eb744-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
  2025-11-16 18:28 ` [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs Luigi Rizzo
@ 2025-11-17 11:12   ` kernel test robot
  2025-11-17 16:01   ` Thomas Gleixner
  2025-11-17 21:56   ` kernel test robot
  2 siblings, 0 replies; 32+ messages in thread
From: kernel test robot @ 2025-11-17 11:12 UTC (permalink / raw)
  To: Luigi Rizzo, Thomas Gleixner, Marc Zyngier, Luigi Rizzo,
	Paolo Abeni, Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel,
	linux-arch, Bjorn Helgaas, Willem de Bruijn

Hi Luigi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/irq/core]
[also build test WARNING on tip/x86/core tip/master linus/master v6.18-rc6 next-20251114]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Luigi-Rizzo/genirq-platform-wide-interrupt-moderation-Documentation-Kconfig-irq_desc/20251117-023148
base:   tip/irq/core
patch link:    https://lore.kernel.org/r/20251116182839.939139-3-lrizzo%40google.com
patch subject: [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20251117/202511171936.CT5XHXPZ-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251117/202511171936.CT5XHXPZ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511171936.CT5XHXPZ-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/irq/irq_moderation.c:103: warning: ignoring '#pragma clang diagnostic' [-Wunknown-pragmas]
     103 | #pragma clang diagnostic error "-Wformat"
   In file included from include/linux/uaccess.h:10,
                    from include/linux/sched/task.h:13,
                    from include/linux/sched/signal.h:9,
                    from include/linux/rcuwait.h:6,
                    from include/linux/percpu-rwsem.h:7,
                    from include/linux/fs.h:34,
                    from include/linux/seq_file.h:11,
                    from kernel/irq/irq_moderation.c:8:
   In function 'check_copy_size',
       inlined from 'copy_from_user' at include/linux/uaccess.h:207:7,
       inlined from 'moderation_write' at kernel/irq/irq_moderation.c:169:6:
   include/linux/ucopysize.h:54:25: error: call to '__bad_copy_to' declared with attribute error: copy destination size is too small
      54 |                         __bad_copy_to();
         |                         ^~~~~~~~~~~~~~~
   In function 'check_copy_size',
       inlined from 'copy_from_user' at include/linux/uaccess.h:207:7,
       inlined from 'mode_write' at kernel/irq/irq_moderation.c:223:6:
   include/linux/ucopysize.h:54:25: error: call to '__bad_copy_to' declared with attribute error: copy destination size is too small
      54 |                         __bad_copy_to();
         |                         ^~~~~~~~~~~~~~~


vim +103 kernel/irq/irq_moderation.c

   102	
 > 103	#pragma clang diagnostic error "-Wformat"
   104	/* Print statistics */
   105	static int moderation_show(struct seq_file *p, void *v)
   106	{
   107		uint delay_us = irq_mod_info.delay_us;
   108		int j;
   109	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
  2025-11-16 18:28 ` [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs Luigi Rizzo
  2025-11-17 11:12   ` kernel test robot
@ 2025-11-17 16:01   ` Thomas Gleixner
  2025-11-17 16:16     ` Luigi Rizzo
  2025-11-17 21:56   ` kernel test robot
  2 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-17 16:01 UTC (permalink / raw)
  To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> +/* Handlers for /proc/irq/NN/soft_moderation */
> +static int mode_show(struct seq_file *p, void *v)
> +{
> +	struct irq_desc *desc = p->private;
> +
> +	if (!desc)
> +		return -ENOENT;
> +
> +	seq_printf(p, "%s irq %u trigger 0x%x %s %smanaged %slazy handle_irq %pB\n",
> +		   desc->mod.enable ? "on" : "off", desc->irq_data.irq,
> +		   irqd_get_trigger_type(&desc->irq_data),
> +		   irqd_is_level_type(&desc->irq_data) ? "Level" : "Edge",
> +		   irqd_affinity_is_managed(&desc->irq_data) ? "" : "un",
> +		   irq_settings_disable_unlazy(desc) ? "un" : "", desc->handle_irq

Why are you exposing randomly picked information here? The only thing
what's interesting is whether the interrupt is moderated or not. The
interrupt number is redundant information. And all the other internal
details are available in debugfs already.

> +/* Per-CPU state initialization */
> +static void irq_moderation_percpu_init(void *data)
> +{
> +	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> +
> +	INIT_LIST_HEAD(&ms->descs);
> +}
> +
> +static int cpuhp_setup_cb(uint cpu)
> +{
> +	irq_moderation_percpu_init(NULL);
> +	return 0;
> +}
> +
> +static void clamp_parameter(uint *dst, uint val)
> +{
> +	struct param_names *n = param_names;
> +	uint i;
> +
> +	for (i = 0; i < ARRAY_SIZE(param_names); i++, n++) {
> +		if (dst == n->val) {
> +			*dst = clamp(val, n->min, n->max);
> +			return;
> +		}
> +	}
> +}
> +
> +static int __init init_irq_moderation(void)
> +{
> +	uint *cur;
> +
> +	on_each_cpu(irq_moderation_percpu_init, NULL, 1);

That's pointless. Register the hotplug callback without 'nocalls' and
let the hotplug code handle it.

> +	cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "moderation:online", cpuhp_setup_cb, NULL);

> +#ifdef CONFIG_IRQ_SOFT_MODERATION
> +
> +/**
> + * struct irq_mod_info - global configuration parameters and state
> + * @total_intrs:	running count updated every update_ms
> + * @total_cpus:		as above, active CPUs in this interval
> + * @procfs_write_ns:	last write to /proc/irq/soft_moderation
> + * @delay_us:		fixed delay, or maximum for adaptive
> + * @target_irq_rate:	target maximum interrupt rate
> + * @hardirq_percent:	target maximum hardirq percentage
> + * @timer_rounds:	how many timer polls once moderation fires
> + * @update_ms:		how often to update delay/rate/fraction
> + * @scale_cpus:		(percent) scale factor to estimate active CPUs
> + * @count_timer_calls:	count timer calls for irq limits
> + * @count_msi_calls:	count calls from posted_msi for irq limits
> + * @decay_factor:	smoothing factor for the control loop, keep at 16
> + * @grow_factor:	smoothing factor for the control loop, keep it at 8
> + */
> +struct irq_mod_info {
> +	/* These fields are written to by all CPUs */
> +	____cacheline_aligned
> +	atomic_long_t total_intrs;
> +	atomic_long_t total_cpus;
> +
> +	/* These are mostly read (frequently), so use a different cacheline */
> +	____cacheline_aligned
> +	u64 procfs_write_ns;
> +	uint delay_us;

I asked you last time already to follow the TIP tree documentation, no?

> +	uint target_irq_rate;
> +	uint hardirq_percent;
> +	uint timer_rounds;
> +	uint update_ms;
> +	uint scale_cpus;
> +	uint count_timer_calls;
> +	uint count_msi_calls;
> +	uint decay_factor;
> +	uint grow_factor;
> +	uint pad[];
> +};

And again you decided to add these fat data structures all in once with
no usage. I told you last time that this is unreviewable and that stuff
should be introduced gradually with the usage.

Feel free to ignore my feedback, but don't be upset if I ignore your
patches then.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
  2025-11-17 16:01   ` Thomas Gleixner
@ 2025-11-17 16:16     ` Luigi Rizzo
  2025-11-17 19:35       ` Thomas Gleixner
  0 siblings, 1 reply; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-17 16:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
	Bjorn Helgaas, Willem de Bruijn

On Mon, Nov 17, 2025 at 5:01 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> > +/* Handlers for /proc/irq/NN/soft_moderation */
> > +static int mode_show(struct seq_file *p, void *v)
> > +{
> > +     struct irq_desc *desc = p->private;
> > +
> > +     if (!desc)
> > +             return -ENOENT;
> > +
> > +     seq_printf(p, "%s irq %u trigger 0x%x %s %smanaged %slazy handle_irq %pB\n",
> > +                desc->mod.enable ? "on" : "off", desc->irq_data.irq,
> > +                irqd_get_trigger_type(&desc->irq_data),
> > +                irqd_is_level_type(&desc->irq_data) ? "Level" : "Edge",
> > +                irqd_affinity_is_managed(&desc->irq_data) ? "" : "un",
> > +                irq_settings_disable_unlazy(desc) ? "un" : "", desc->handle_irq
>
> Why are you exposing randomly picked information here? The only thing
> what's interesting is whether the interrupt is moderated or not. The
> interrupt number is redundant information. And all the other internal
> details are available in debugfs already.

ah, sorry, forgot to clean up this internal debugging code.
The previous version had only the enable.

>
> >
> > +static int __init init_irq_moderation(void)
> > +{
> > +     uint *cur;
> > +
> > +     on_each_cpu(irq_moderation_percpu_init, NULL, 1);
>
> That's pointless. Register the hotplug callback without 'nocalls' and
> let the hotplug code handle it.

Sounds good. I have a question on event ordering.
Which event should I use to make sure the callback runs
before interrupts are enabled on the new CPU ?

> > ...
> I asked you last time already to follow the TIP tree documentation, no?
>
> > +     uint target_irq_rate;
> > +     uint hardirq_percent;
> > +     uint timer_rounds;
> > +     uint update_ms;
> > +     uint scale_cpus;
> > +     uint count_timer_calls;
> > +     uint count_msi_calls;
> > +     uint decay_factor;
> > +     uint grow_factor;
> > +     uint pad[];
> > +};
>
> And again you decided to add these fat data structures all in once with
> no usage. I told you last time that this is unreviewable and that stuff
> should be introduced gradually with the usage.

Ok, will do.
FWIW my goal was to get the telemetry functions in the first patch, and reduce
the clutter in subsequent patches, since each new field would create many
chunks (docstring, struct field, init, print format and value).

cheers
luigi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
  2025-11-16 18:28 ` [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation Luigi Rizzo
@ 2025-11-17 19:30   ` Thomas Gleixner
  2025-11-17 23:16     ` Thomas Gleixner
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-17 19:30 UTC (permalink / raw)
  To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
>  
> +DEFINE_STATIC_KEY_FALSE(irq_moderation_enabled_key);
> +EXPORT_SYMBOL(irq_moderation_enabled_key);

Why needs this to be exported? No code outside of the interrupt core has
to know about this and the core code is always built-in.

> +static inline void smooth_avg(u32 *dst, u32 val, u32 steps)
> +{
> +	*dst = ((64 - steps) * *dst + steps * val) / 64;

How is this correct for steps > 64?

Also this is unreadable. Just make this

u32 foo(u32 cur, u32 val, u32 steps)
{
        return .....;
}

and add proper documentation what this function is about and the
expected and valid input parameters.

> +/* Moderation timer handler. */
> +static enum hrtimer_restart timer_cb(struct hrtimer *timer)
> +{
> +	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> +	struct irq_desc *desc, *next;
> +	uint srcs = 0;
> +
> +	ms->timer_fire++;
> +	WARN_ONCE(ms->timer_set != ms->timer_fire,
> +		  "CPU %d timer set %d fire %d (lost events?)\n",
> +		  smp_processor_id(), ms->timer_set, ms->timer_fire);

This whole paranoid debug code is required because this code is so
convoluted that it can't validated for correctness, right?

> +	ms->rounds_left--;
> +
> +	if (ms->rounds_left > 0) {
> +		/* Timer still alive, just call the handlers. */
> +		list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
> +			ms->irq_count += irq_mod_info.count_timer_calls;
> +			ms->timer_calls++;
> +			handle_irq_event_percpu(desc);

I told you before that running the handler w/o the INPROGRESS flag set
is inconsistent state, no?

But what's worse is that this completely ignores the real disabled
state.

interrupt()
        ....
        moderate()      	-> disable depth = 1

...

Driver code
       disable_irq()		-> disable depth = 2

timer()
       handle_irq_event()       -> BUG

And the more I think about it the more I'm convinced that abusing the
disabled state is the wrong thing to do. The interrupt is definitely not
disabled at all and all you create is inconsistent state.

So no. That needs more thought and a different mechanism to reflect that
moderated state.

> +		}
> +		ms->timer_set++;
> +		hrtimer_forward_now(&ms->timer, ms->sleep_ns);
> +		return HRTIMER_RESTART;
> +	}
> +
> +	/* Last round, remove from list and enable_irq(). */
> +	list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
> +		list_del(&desc->mod.ms_node);
> +		INIT_LIST_HEAD(&desc->mod.ms_node);

                list_del_init();

> +		srcs++;
> +		ms->enable_irq++;
> +		enable_irq(desc->irq_data.irq);
> +	}
> +	smooth_avg(&ms->scaled_src_count, srcs * 256, 1);
> +
> +	/* Prepare to accumulate next moderation delay. */
> +	ms->sleep_ns = 0;
> +
> +	WARN_ONCE(ms->disable_irq != ms->enable_irq,
> +		  "CPU %d irq disable %d enable %d (%s)\n",
> +		  smp_processor_id(), ms->disable_irq, ms->enable_irq,
> +		  "bookkeeping error, some irq will be stuck");

Oh well....

> +	return HRTIMER_NORESTART;
> +}

> +static void update_enable_key(void)
> +{
> +	bool newval = irq_mod_info.delay_us != 0;
> +
> +	if (newval != static_key_enabled(&irq_moderation_enabled_key)) {

Pointless exercise. static_branch_enable/disable() handle that already

> +		if (newval)
> +			static_branch_enable(&irq_moderation_enabled_key);
> +		else
> +			static_branch_disable(&irq_moderation_enabled_key);
> +	}
> +}
> +
>  static ssize_t moderation_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
>  {
>  	struct param_names *n = param_names;
> @@ -176,6 +254,8 @@ static ssize_t moderation_write(struct file *f, const char __user *buf, size_t c
>  		if (kstrtouint(cmd + l + 1, 0, &val))
>  			return -EINVAL;
>  		WRITE_ONCE(*(n->val), clamp(val, n->min, n->max));
> +		if (n->val == &irq_mod_info.delay_us)
> +			update_enable_key();
>  		/* Record last parameter change, for use in the control loop. */
>  		irq_mod_info.procfs_write_ns = ktime_get_ns();
>  		return count;
> @@ -258,6 +338,7 @@ static void irq_moderation_percpu_init(void *data)
>  {
>  	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
>  
> +	hrtimer_setup(&ms->timer, timer_cb, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
>  	INIT_LIST_HEAD(&ms->descs);

I'm not convinced that you need that online callback at all. None of
this must run on the target CPU. All of that can be done from the boot
CPU in a for_each_possible_cpu() loop. It's a one time initialization, no?

>  }
>  
> @@ -293,6 +374,7 @@ static int __init init_irq_moderation(void)
>  
>  	/* Finally, set delay_us to enable moderation if needed. */
>  	clamp_parameter(&irq_mod_info.delay_us, mod_delay_us);
> +	update_enable_key();
>  
>  	proc_create_data("irq/soft_moderation", 0644, NULL, &proc_ops, NULL);
>  	return 0;
> diff --git a/kernel/irq/irq_moderation.h b/kernel/irq/irq_moderation.h
> index ccb8193482b51..d9bc672ffd6f1 100644
> --- a/kernel/irq/irq_moderation.h
> +++ b/kernel/irq/irq_moderation.h
> @@ -136,11 +136,113 @@ struct irq_mod_state {
>  
>  DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
>  
> +extern struct static_key_false irq_moderation_enabled_key;
> +
> +/* Called on each interrupt for adaptive moderation delay adjustment. */
> +static inline void irq_moderation_adjust_delay(struct irq_mod_state *ms)
> +{
> +	u64 now, delta_time;
> +
> +	ms->irq_count++;
> +	/* ktime_get_ns() is expensive, don't do too often */
> +	if (ms->irq_count & 0xf)

Magic number and what's wrong with using %?

> +		return;
> +	now = ktime_get_ns();
> +	delta_time = now - ms->last_ns;
> +
> +	/* Run approximately every update_ns, a little bit early is ok. */
> +	if (delta_time < ms->update_ns - 5000)

Again. These hardcoded magic numbers all over the place without any
rationale are annoying.

> +		return;
> +
> +	ms->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
> +	ms->delay_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
> +
> +	ms->mod_ns = ms->delay_ns;

So you need to read time and do a incomprehensible comparison in order
to store the same values over and over? Both values are initialized
during init and cannot be changed afterwards. Oh well...

> +}
> +
> +/* Return true if timer is active or delay is large enough to require moderation */
> +static inline bool irq_moderation_needed(struct irq_mod_state *ms)
> +{
> +	const u32 min_delay_ns = 10000;

Yet another magic number....

> +	if (!hrtimer_is_queued(&ms->timer)) {
> +		/* accumulate sleep time, no moderation if too small */

Sentences still start with an uppercase letter.

> +		ms->sleep_ns += ms->mod_ns;

How is this accumulating sleep time? It's adding the configured delay
value (nanoseconds) to ms->sleep_ns unconditionally on every interrupt
which is handled on a CPU. It has no relation to time at all.

> +		if (ms->sleep_ns < min_delay_ns)
> +			return false;
> +	}
> +	return true;
> +}
> +
> +void disable_irq_nosync(unsigned int irq);
> +
> +/*
> + * Use in handle_irq_event() before calling the handler. Decide whether this
> + * desc should be moderated, and in case disable the irq and add the desc to
> + * the list for this CPU.
> + */
> +static inline void irq_moderation_hook(struct irq_desc *desc)
> +{
> +	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> +
> +	if (!static_branch_unlikely(&irq_moderation_enabled_key))
> +		return;
> +
> +	if (!READ_ONCE(desc->mod.enable))
> +		return;
> +
> +	irq_moderation_adjust_delay(ms);
> +
> +	if (!list_empty(&desc->mod.ms_node)) {
> +		/*
> +		 * Very unlikely, stray interrupt while the desc is moderated.

Moderating 'the desc'? The interrupt is moderated, no?

> +		 * We cannot ignore it or we may miss events, but do count it.

This comment still does not make sense. What can't you ignore and how
does counting and returning prevent missing an event?

> +		 */
> +		ms->stray_irq++;
> +		return;
> +	}
> +
> +	if (!irq_moderation_needed(ms))
> +		return;
> +
> +	/* Add to list of moderated desc on this CPU */
> +	list_add(&desc->mod.ms_node, &ms->descs);

Right and if this CPU is hot-unplugged afterwards then this interrupt is
dead forever. If the timer is still armed it is migrated to an online
CPU...

Also what protects the list against a teardown of an interrupt which is
enqueued in that list? Nothing at all, which means there is a UAF
built-in.

> +	/*
> +	 * Disable the irq. This will also cause irq_can_handle() return false

s/irq/interrupt/ Use proper words please.

> +	 * (through irq_can_handle_actions()), and that will prevent a handler
> +	 * instance to be run again while the descriptor is being moderated.
> +	 *
> +	 * irq_moderation_epilogue() will then start the timer if needed.

It's needed unconditionally after adding it to the list and disabling
it. So why can't you do it right here?

> +	 */
> +	ms->disable_irq++;
> +	disable_irq_nosync(desc->irq_data.irq);
> +}
> +
> +static inline void irq_moderation_start_timer(struct irq_mod_state *ms)
> +{
> +	ms->timer_set++;
> +	ms->rounds_left = READ_ONCE(irq_mod_info.timer_rounds) + 1;
> +	hrtimer_start_range_ns(&ms->timer, ns_to_ktime(ms->sleep_ns),
> +			       /*range*/2000, HRTIMER_MODE_REL_PINNED_HARD);

This glued in comment is better than a properly named constant, right?

> +}
> +
> +/* After the handler, if desc is moderated, make sure the timer is active. */
> +static inline void irq_moderation_epilogue(const struct irq_desc *desc)
> +{
> +	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> +

This still lacks a check whether this moderation muck is enabled at
all. And no, the list_empty() check is not the right thing for
that. That's what the static key is for.

But that's moot as there is no real reason for this epilogue to exist in
the first place.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
  2025-11-17 16:16     ` Luigi Rizzo
@ 2025-11-17 19:35       ` Thomas Gleixner
  0 siblings, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-17 19:35 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
	Bjorn Helgaas, Willem de Bruijn

On Mon, Nov 17 2025 at 17:16, Luigi Rizzo wrote:
> On Mon, Nov 17, 2025 at 5:01 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > +static int __init init_irq_moderation(void)
>> > +{
>> > +     uint *cur;
>> > +
>> > +     on_each_cpu(irq_moderation_percpu_init, NULL, 1);
>>
>> That's pointless. Register the hotplug callback without 'nocalls' and
>> let the hotplug code handle it.
>
> Sounds good. I have a question on event ordering.
> Which event should I use to make sure the callback runs
> before interrupts are enabled on the new CPU ?

See include/linux/cpuhotplug.h

But see my other reply.

>> > ...
>> I asked you last time already to follow the TIP tree documentation, no?
>>
>> > +     uint target_irq_rate;
>> > +     uint hardirq_percent;
>> > +     uint timer_rounds;
>> > +     uint update_ms;
>> > +     uint scale_cpus;
>> > +     uint count_timer_calls;
>> > +     uint count_msi_calls;
>> > +     uint decay_factor;
>> > +     uint grow_factor;
>> > +     uint pad[];
>> > +};
>>
>> And again you decided to add these fat data structures all in once with
>> no usage. I told you last time that this is unreviewable and that stuff
>> should be introduced gradually with the usage.
>
> Ok, will do.
> FWIW my goal was to get the telemetry functions in the first patch, and reduce
> the clutter in subsequent patches, since each new field would create many
> chunks (docstring, struct field, init, print format and value).

TBH, I'm not convinced at all that you need all of this telemetry maze.

That looks pretty overengineered and that can be added on top of a
functional and reviewable base implementation.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation
  2025-11-16 18:28 ` [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
@ 2025-11-17 20:51   ` Thomas Gleixner
  2025-11-17 21:34     ` Luigi Rizzo
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-17 20:51 UTC (permalink / raw)
  To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> Add two control parameters (target_irq_rate and hardirq_percent)
> to indicate the desired maximum values for these two metrics.
>
> Every update_ms the hook in handle_irq_event() recomputes the total and
> local interrupt rate and the amount of time spent in hardirq, compares
> the values with the targets, and adjusts the moderation delay up or down.
>
> The interrupt rate is computed in a scalable way by counting interrupts
> per-CPU, and aggregating the value in a global variable only every
> update_ms. Only CPUs that actively process interrupts are actually
> accessing the shared variable, so the system scales well even on very
> large servers.

You still fail to explain why this is required and why a per CPU
moderation is not sufficient and what the benefits of the approach are.

I'm happy to refer you to Documentation once more. It's well explained
there.

> +/* Measure and assess time spent in hardirq. */
> +static inline bool hardirq_high(struct irq_mod_state *ms, u64 delta_time, u32 hardirq_percent)
> +{
> +	bool above_threshold = false;
> +
> +	if (IS_ENABLED(CONFIG_IRQ_TIME_ACCOUNTING)) {
> +		u64 irqtime, cur = kcpustat_this_cpu->cpustat[CPUTIME_IRQ];
> +
> +		irqtime = cur - ms->last_irqtime;
> +		ms->last_irqtime = cur;
> +
> +		above_threshold = irqtime * 100 > delta_time * hardirq_percent;
> +		ms->hardirq_high += above_threshold;
> +	}
> +	return above_threshold;
> +}
> +
> +/* Measure and assess total and per-CPU interrupt rates. */
> +static inline bool irqrate_high(struct irq_mod_state *ms, u64 delta_time,
> +				u32 target_rate, u32 steps)
> +{
> +	u64 irq_rate, my_irq_rate, tmp, delta_irqs, active_cpus;
> +	bool my_rate_high, global_rate_high;
> +
> +	my_irq_rate = (ms->irq_count * NSEC_PER_SEC) / delta_time;
> +	/* Accumulate global counter and compute global irq rate. */
> +	tmp = atomic_long_add_return(ms->irq_count, &irq_mod_info.total_intrs);
> +	ms->irq_count = 1;
> +	delta_irqs = tmp - ms->last_total_irqs;
> +	ms->last_total_irqs = tmp;
> +	irq_rate = (delta_irqs * NSEC_PER_SEC) / delta_time;
> +
> +	/*
> +	 * Count how many CPUs handled interrupts in the last interval, needed
> +	 * to determine the per-CPU target (target_rate / active_cpus).
> +	 * Each active CPU increments the global counter approximately every
> +	 * update_ns. Scale the value by (update_ns / delta_time) to get the
> +	 * correct value. Also apply rounding and make sure active_cpus > 0.
> +	 */
> +	tmp = atomic_long_add_return(1, &irq_mod_info.total_cpus);
> +	active_cpus = tmp - ms->last_total_cpus;
> +	ms->last_total_cpus = tmp;
> +	active_cpus = (active_cpus * ms->update_ns + delta_time / 2) / delta_time;
> +	if (active_cpus < 1)
> +		active_cpus = 1;
> +
> +	/* Compare with global and per-CPU targets. */
> +	global_rate_high = irq_rate > target_rate;
> +	my_rate_high = my_irq_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
> +
> +	/* Statistics. */
> +	smooth_avg(&ms->irq_rate, irq_rate, steps);
> +	smooth_avg(&ms->my_irq_rate, my_irq_rate, steps);
> +	smooth_avg(&ms->scaled_cpu_count, active_cpus * 256, steps);
> +	ms->my_irq_high += my_rate_high;
> +	ms->irq_high += global_rate_high;
> +
> +	/* Moderate on this CPU only if both global and local rates are high. */

Because it's desired that CPUs can be starved by interrupts when enough
other CPUs only have a very low rate? I'm failing to understand that
logic and the comprehensive rationale in the change log does not help either.

> +	return global_rate_high && my_rate_high;
> +}
> +
> +/* Adjust the moderation delay, called at most every update_ns. */
> +void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_time)
> +{
> +	u32 hardirq_percent = READ_ONCE(irq_mod_info.hardirq_percent);
> +	u32 target_rate = READ_ONCE(irq_mod_info.target_irq_rate);
> +	bool below_target = true;
> +	u32 steps;
> +
> +	if (target_rate == 0 && hardirq_percent == 0) {
> +		/* Use fixed moderation delay. */
> +		ms->mod_ns = ms->delay_ns;
> +		ms->irq_rate = 0;
> +		ms->my_irq_rate = 0;
> +		ms->scaled_cpu_count = 0;
> +		return;
> +	}
> +
> +	/* Compute decay steps based on elapsed time, bound to a reasonable value. */
> +	steps = delta_time > 10 * ms->update_ns ? 10 : 1 + (delta_time / ms->update_ns);
> +
> +	if (target_rate > 0 && irqrate_high(ms, delta_time, target_rate, steps))
> +		below_target = false;
> +
> +	if (hardirq_percent > 0 && hardirq_high(ms, delta_time, hardirq_percent))
> +		below_target = false;

So that rate limits a per CPU overload, but only when IRQTIME accounting
is enabled. Oh well...

> +	/* Controller: adjust delay with exponential decay/grow. */
> +	if (below_target) {
> +		ms->mod_ns -= ms->mod_ns * steps / (steps + irq_mod_info.decay_factor);
> +		if (ms->mod_ns < 100)
> +			ms->mod_ns = 0;

These random numbers used to limit things are truly interresting and
easy to understand - NOT.

> +	} else {
> +		/* Exponential grow does not restart if value is too small. */
> +		if (ms->mod_ns < 500)
> +			ms->mod_ns = 500;
> +		ms->mod_ns += ms->mod_ns * steps / (steps + irq_mod_info.grow_factor);
> +		if (ms->mod_ns > ms->delay_ns)
> +			ms->mod_ns = ms->delay_ns;
> +	}

Why does this need separate grow and decay factors? Just because more
knobs are better?

> +}
> +
>  /* Moderation timer handler. */
>  static enum hrtimer_restart timer_cb(struct hrtimer *timer)
>  {
> @@ -169,8 +289,10 @@ static inline int set_moderation_mode(struct irq_desc *desc, bool enable)
>  /* Print statistics */
>  static int moderation_show(struct seq_file *p, void *v)
>  {
> +	ulong irq_rate = 0, irq_high = 0, my_irq_high = 0, hardirq_high = 0;
>  	uint delay_us = irq_mod_info.delay_us;
> -	int j;
> +	u64 now = ktime_get_ns();
> +	int j, active_cpus = 0;
>  
>  #define HEAD_FMT "%5s  %8s  %8s  %4s  %4s  %8s  %11s  %11s  %11s  %11s  %11s  %11s  %11s  %9s\n"
>  #define BODY_FMT "%5u  %8u  %8u  %4u  %4u  %8u  %11u  %11u  %11u  %11u  %11u  %11u  %11u  %9u\n"
> @@ -182,6 +304,23 @@ static int moderation_show(struct seq_file *p, void *v)
>  
>  	for_each_possible_cpu(j) {
>  		struct irq_mod_state *ms = per_cpu_ptr(&irq_mod_state, j);
> +		u64 age_ms = min((now - ms->last_ns) / NSEC_PER_MSEC, (u64)999999);

Not new in this patch: All of those accesses to remote CPU state are
racy and need to be annotated. This clearly never saw any testing with
KCSAN enabled. I'm happy to point to Documentation once again.

What's new is that this one is obviously broken on 32-bit because read
and write of a 64-bit value are split.

> +		if (age_ms < 10000) {
> +			/* Average irq_rate over recently active CPUs. */
> +			active_cpus++;
> +			irq_rate += ms->irq_rate;
> +		} else {
> +			ms->irq_rate = 0;
> +			ms->my_irq_rate = 0;
> +			ms->scaled_cpu_count = 64;
> +			ms->scaled_src_count = 64;
> +			ms->mod_ns = 0;
> +		}
> +
> +		irq_high += ms->irq_high;
> +		my_irq_high += ms->my_irq_high;
> +		hardirq_high += ms->hardirq_high;
>  
>  		seq_printf(p, BODY_FMT,
>  			   j, ms->irq_rate, ms->my_irq_rate,
> @@ -195,9 +334,34 @@ static int moderation_show(struct seq_file *p, void *v)
>  	seq_printf(p, "\n"
>  		   "enabled              %s\n"
>  		   "delay_us             %u\n"
> -		   "timer_rounds         %u\n",
> +		   "timer_rounds         %u\n"
> +		   "target_irq_rate      %u\n"
> +		   "hardirq_percent      %u\n"
> +		   "update_ms            %u\n"
> +		   "scale_cpus           %u\n"
> +		   "count_timer_calls    %s\n"
> +		   "count_msi_calls      %s\n"
> +		   "decay_factor         %u\n"
> +		   "grow_factor          %u\n",
>  		   str_yes_no(delay_us > 0),
> -		   delay_us, irq_mod_info.timer_rounds);
> +		   delay_us, irq_mod_info.timer_rounds,
> +		   irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent,
> +		   irq_mod_info.update_ms, irq_mod_info.scale_cpus,
> +		   str_yes_no(irq_mod_info.count_timer_calls),
> +		   str_yes_no(irq_mod_info.count_msi_calls),
> +		   irq_mod_info.decay_factor, irq_mod_info.grow_factor);
> +
> +	seq_printf(p,
> +		   "irq_rate             %lu\n"
> +		   "irq_high             %lu\n"
> +		   "my_irq_high          %lu\n"
> +		   "hardirq_percent_high %lu\n"
> +		   "total_interrupts     %lu\n"
> +		   "total_cpus           %lu\n",
> +		   active_cpus ? irq_rate / active_cpus : 0,
> +		   irq_high, my_irq_high, hardirq_high,
> +		   READ_ONCE(*((ulong *)&irq_mod_info.total_intrs)),
> +		   READ_ONCE(*((ulong *)&irq_mod_info.total_cpus)));

The more I look at this insane amount of telemetry the more I am
convinced that this is overly complex just for complexity sake.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel)
  2025-11-16 18:28 ` [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
@ 2025-11-17 21:14   ` Thomas Gleixner
  2025-11-17 21:36   ` kernel test robot
  1 sibling, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-17 21:14 UTC (permalink / raw)
  To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> On recent Intel CPUs, kernels compiled with CONFIG_X86_POSTED_MSI=y,
> and the boot option "intremap=posted_msi", all MSI interrupts
> that hit a CPU issue a single POSTED_MSI interrupt processed by
> sysvec_posted_msi_notification() instead of having separate interrupts.
>
> This change adds soft moderation hooks to the above handler.

'This change' is not any better than 'This patch'.

> Soft moderation on posted_msi does not require per-source enable,
> irq_moderation.delay_us > 0 suffices.
>
> To test it, run a kernel with the above options and enable moderation by
> setting delay_us > 0.  The column "from_msi" in /proc/irq/soft_moderation
> will show a non-zero value.

Impressive. But it would be more impressive and useful to actualy have
an explanation why this is required and what the benefits are.

>  CFLAGS_head32.o := -fno-stack-protector
>  CFLAGS_head64.o := -fno-stack-protector
> -CFLAGS_irq.o := -I $(src)/../include/asm/trace
> +CFLAGS_irq.o := -I $(src)/../include/asm/trace -I $(srctree)/kernel/irq

Not going to happen. Architecture code has no business to tap into
generic core infrastructure.

> --- a/kernel/irq/irq_moderation.c
> +++ b/kernel/irq/irq_moderation.c
> @@ -11,6 +11,13 @@
>  #include "internals.h"
>  #include "irq_moderation.h"

>  
> +#ifdef CONFIG_X86

Architecture specific muck has absolutely no place in generic code.

> +#include <asm/apic.h>
> +#include <asm/irq_remapping.h>
> +#else
> +static inline bool posted_msi_supported(void) { return false; }
> +#endif
> +
>  /*
>   * Platform-wide software interrupt moderation.
>   *
> @@ -32,6 +39,10 @@
>   *   moderation on that CPU/irq. If so, calls disable_irq_nosync() and starts
>   *   an hrtimer with appropriate delay.
>   *
> + * - Intel only: using "intremap=posted_msi", all the above is done in
> + *   sysvec_posted_msi_notification(). In this case all host device interrupts
> + *   are subject to moderation.
> + *
>   * - the timer callback calls enable_irq() for all disabled interrupts on that
>   *   CPU. That in turn will generate interrupts if there are pending events.
>   *
> @@ -230,6 +241,17 @@ static enum hrtimer_restart timer_cb(struct hrtimer *timer)
>  
>  	ms->rounds_left--;
>  
> +#ifdef CONFIG_X86_POSTED_MSI
> +	if (ms->kick_posted_msi) {
> +		if (ms->rounds_left == 0)
> +			ms->kick_posted_msi = false;
> +		/* Next call will be from timer, count it conditionally. */
> +		ms->dont_count = !irq_mod_info.count_timer_calls;

TBH, this is one of the worst hacks I've seen in a long time. It's
admittedly creative, but that's unmaintainable and the worst precedent
to open the flood gates for random architecture hacks in generic core
code.

If you want this to work for that posted MSI muck, then this needs
proper integration of those posted vectors into the interrupt core and a
sane mechanism to do the accounting.

> @@ -476,6 +500,18 @@ static ssize_t mode_write(struct file *f, const char __user *buf, size_t count,
>  	ret = kstrtobool(cmd, &enable);
>  	if (!ret)
>  		ret = set_moderation_mode(desc, enable);
> +	if (ret) {
> +		/* extra helpers for prodkernel */
> +		if (cmd[count - 1] == '\n')
> +			cmd[count - 1] = '\0';
> +		ret = 0;
> +		if (!strcmp(cmd, "managed"))
> +			irqd_set(&desc->irq_data, IRQD_AFFINITY_MANAGED);
> +		else if (!strcmp(cmd, "unmanaged"))
> +			irqd_clear(&desc->irq_data, IRQD_AFFINITY_MANAGED);

What the heck? Can you spare me the exposure to random google internal
garbage? My eyes are bleeding already enough.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation
  2025-11-17 20:51   ` Thomas Gleixner
@ 2025-11-17 21:34     ` Luigi Rizzo
  2025-11-17 23:05       ` Thomas Gleixner
  2025-11-18  9:00       ` Thomas Gleixner
  0 siblings, 2 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-17 21:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
	Bjorn Helgaas, Willem de Bruijn

On Mon, Nov 17, 2025 at 9:51 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> > Add two control parameters (target_irq_rate and hardirq_percent)
> > to indicate the desired maximum values for these two metrics.
> >
> > Every update_ms the hook in handle_irq_event() recomputes the total and
> > local interrupt rate and the amount of time spent in hardirq, compares
> > the values with the targets, and adjusts the moderation delay up or down.
> >
> > The interrupt rate is computed in a scalable way by counting interrupts
> > per-CPU, and aggregating the value in a global variable only every
> > update_ms. Only CPUs that actively process interrupts are actually
> > accessing the shared variable, so the system scales well even on very
> > large servers.
>
> You still fail to explain why this is required and why a per CPU
> moderation is not sufficient and what the benefits of the approach are.

It was explained in the first patch of the series and in Documentation/.

(First world problem, for sure: I have examples for AMD, Intel, Arm,
all of them with 100+ CPUs per numa node, and 160-480 CPUs total)
On some of the above platforms, MSIx interrupts cause heavy serialization
of all other PCIe requests. As a result, when the total interrupt rate exceeds
1-2M intrs/s, I/O throughput degrades by up to 4x and more.

To deal with this with per CPU moderation, without shared state, each CPU
cannot allow more than some 5Kintrs/s, which means fixed moderation
should be set at 200us, and adaptive moderation should jump to such
delays as soon as the local rate reaches 5K intrs/s.

In reality, it is very unlikely that all CPUs are actively handling such high
rates, so if we know better, we can adjust or remove moderation individually,
based on actual local and total interrupt rates and number of active CPUs.

The purpose of this global mechanism is to figure out whether we are
approaching a dangerous rate, and do individual tuning.

> > +     /* Compare with global and per-CPU targets. */
> > +     global_rate_high = irq_rate > target_rate;
> > +     my_rate_high = my_irq_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
> > [...]
> > +     /* Moderate on this CPU only if both global and local rates are high. */
>
> Because it's desired that CPUs can be starved by interrupts when enough
> other CPUs only have a very low rate? I'm failing to understand that
> logic and the comprehensive rationale in the change log does not help either.

The comment could be worded better, s/moderate/bump delay up/

The mechanism aims to make total_rate < target, by gently kicking
individual delays up or down based on the condition
 total_rate > target && local_rate > target / active_cpus ? bump_up()
: bump_down()

If the control is effective, the total rate will be within bound and
nobody suffers,
neither the CPUs handing <1K intr/s, nor the lonely CPU handling 100K+ intr/s

If suddenly the rates go up, the CPUs with higher rates will be moderated more,
hopefully converging to a new equilibrium.
As any control system it has limits on what it can do.

> > [...]
> > +     if (target_rate > 0 && irqrate_high(ms, delta_time, target_rate, steps))
> > +             below_target = false;
> > +
> > +     if (hardirq_percent > 0 && hardirq_high(ms, delta_time, hardirq_percent))
> > +             below_target = false;
>
> So that rate limits a per CPU overload, but only when IRQTIME accounting
> is enabled. Oh well...

I can add checks to disallow setting the per-CPU overload when IRQTIME
accounting is not present.

>
> > +     /* Controller: adjust delay with exponential decay/grow. */
> > +     if (below_target) {
> > +             ms->mod_ns -= ms->mod_ns * steps / (steps + irq_mod_info.decay_factor);
> > +             if (ms->mod_ns < 100)
> > +                     ms->mod_ns = 0;
>
> These random numbers used to limit things are truly interresting and
> easy to understand - NOT.
>
> > +     } else {
> > +             /* Exponential grow does not restart if value is too small. */
> > +             if (ms->mod_ns < 500)
> > +                     ms->mod_ns = 500;
> > +             ms->mod_ns += ms->mod_ns * steps / (steps + irq_mod_info.grow_factor);
> > +             if (ms->mod_ns > ms->delay_ns)
> > +                     ms->mod_ns = ms->delay_ns;
> > +     }
>
> Why does this need separate grow and decay factors? Just because more
> knobs are better?

Like in TCP, brake aggressively (grow factor is smaller) to respond
quickly to overload,
and accelerate prudently (decay factor is higher) to avoid reacting
too optimistically.

thanks again for the attention
luigi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel)
  2025-11-16 18:28 ` [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
  2025-11-17 21:14   ` Thomas Gleixner
@ 2025-11-17 21:36   ` kernel test robot
  1 sibling, 0 replies; 32+ messages in thread
From: kernel test robot @ 2025-11-17 21:36 UTC (permalink / raw)
  To: Luigi Rizzo, Thomas Gleixner, Marc Zyngier, Luigi Rizzo,
	Paolo Abeni, Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, linux-kernel,
	linux-arch, Bjorn Helgaas, Willem de Bruijn

Hi Luigi,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/irq/core]
[also build test ERROR on tip/x86/core tip/master linus/master v6.18-rc6 next-20251117]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Luigi-Rizzo/genirq-platform-wide-interrupt-moderation-Documentation-Kconfig-irq_desc/20251117-023148
base:   tip/irq/core
patch link:    https://lore.kernel.org/r/20251116182839.939139-6-lrizzo%40google.com
patch subject: [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel)
config: x86_64-buildonly-randconfig-003-20251118 (https://download.01.org/0day-ci/archive/20251118/202511180439.x4dMf653-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251118/202511180439.x4dMf653-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511180439.x4dMf653-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/irq/irq_moderation.c:369:6: error: call to undeclared function 'posted_msi_supported'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     369 |                    posted_msi_supported() ? " (also on posted_msi)" : "",
         |                    ^
   1 error generated.


vim +/posted_msi_supported +369 kernel/irq/irq_moderation.c

   321	
   322		seq_printf(p, HEAD_FMT,
   323			   "# CPU", "irq/s", "my_irq/s", "cpus", "srcs", "delay_ns",
   324			   "irq_hi", "my_irq_hi", "hardirq_hi", "timer_set",
   325			   "disable_irq", "from_msi", "timer_calls", "stray_irq");
   326	
   327		for_each_possible_cpu(j) {
   328			struct irq_mod_state *ms = per_cpu_ptr(&irq_mod_state, j);
   329			u64 age_ms = min((now - ms->last_ns) / NSEC_PER_MSEC, (u64)999999);
   330	
   331			if (age_ms < 10000) {
   332				/* Average irq_rate over recently active CPUs. */
   333				active_cpus++;
   334				irq_rate += ms->irq_rate;
   335			} else {
   336				ms->irq_rate = 0;
   337				ms->my_irq_rate = 0;
   338				ms->scaled_cpu_count = 64;
   339				ms->scaled_src_count = 64;
   340				ms->mod_ns = 0;
   341			}
   342	
   343			irq_high += ms->irq_high;
   344			my_irq_high += ms->my_irq_high;
   345			hardirq_high += ms->hardirq_high;
   346	
   347			seq_printf(p, BODY_FMT,
   348				   j, ms->irq_rate, ms->my_irq_rate,
   349				   (ms->scaled_cpu_count + 128) / 256,
   350				   (ms->scaled_src_count + 128) / 256,
   351				   ms->mod_ns, ms->irq_high, ms->my_irq_high,
   352				   ms->hardirq_high, ms->timer_set, ms->disable_irq,
   353				   ms->from_posted_msi, ms->timer_calls, ms->stray_irq);
   354		}
   355	
   356		seq_printf(p, "\n"
   357			   "enabled              %s%s\n"
   358			   "delay_us             %u\n"
   359			   "timer_rounds         %u\n"
   360			   "target_irq_rate      %u\n"
   361			   "hardirq_percent      %u\n"
   362			   "update_ms            %u\n"
   363			   "scale_cpus           %u\n"
   364			   "count_timer_calls    %s\n"
   365			   "count_msi_calls      %s\n"
   366			   "decay_factor         %u\n"
   367			   "grow_factor          %u\n",
   368			   str_yes_no(delay_us > 0),
 > 369			   posted_msi_supported() ? " (also on posted_msi)" : "",
   370			   delay_us, irq_mod_info.timer_rounds,
   371			   irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent,
   372			   irq_mod_info.update_ms, irq_mod_info.scale_cpus,
   373			   str_yes_no(irq_mod_info.count_timer_calls),
   374			   str_yes_no(irq_mod_info.count_msi_calls),
   375			   irq_mod_info.decay_factor, irq_mod_info.grow_factor);
   376	
   377		seq_printf(p,
   378			   "irq_rate             %lu\n"
   379			   "irq_high             %lu\n"
   380			   "my_irq_high          %lu\n"
   381			   "hardirq_percent_high %lu\n"
   382			   "total_interrupts     %lu\n"
   383			   "total_cpus           %lu\n",
   384			   active_cpus ? irq_rate / active_cpus : 0,
   385			   irq_high, my_irq_high, hardirq_high,
   386			   READ_ONCE(*((ulong *)&irq_mod_info.total_intrs)),
   387			   READ_ONCE(*((ulong *)&irq_mod_info.total_cpus)));
   388	
   389		return 0;
   390	}
   391	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
  2025-11-16 18:28 ` [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs Luigi Rizzo
  2025-11-17 11:12   ` kernel test robot
  2025-11-17 16:01   ` Thomas Gleixner
@ 2025-11-17 21:56   ` kernel test robot
  2 siblings, 0 replies; 32+ messages in thread
From: kernel test robot @ 2025-11-17 21:56 UTC (permalink / raw)
  To: Luigi Rizzo, Thomas Gleixner, Marc Zyngier, Luigi Rizzo,
	Paolo Abeni, Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel,
	linux-arch, Bjorn Helgaas, Willem de Bruijn

Hi Luigi,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/irq/core]
[also build test ERROR on tip/x86/core tip/master linus/master v6.18-rc6 next-20251117]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Luigi-Rizzo/genirq-platform-wide-interrupt-moderation-Documentation-Kconfig-irq_desc/20251117-023148
base:   tip/irq/core
patch link:    https://lore.kernel.org/r/20251116182839.939139-3-lrizzo%40google.com
patch subject: [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs
config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20251118/202511180547.snSezdSu-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251118/202511180547.snSezdSu-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511180547.snSezdSu-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/irq/irq_moderation.c:103: warning: ignoring '#pragma clang diagnostic' [-Wunknown-pragmas]
     103 | #pragma clang diagnostic error "-Wformat"
   In file included from include/linux/uaccess.h:10,
                    from include/linux/sched/task.h:13,
                    from include/linux/sched/signal.h:9,
                    from include/linux/rcuwait.h:6,
                    from include/linux/percpu-rwsem.h:7,
                    from include/linux/fs.h:34,
                    from include/linux/seq_file.h:11,
                    from kernel/irq/irq_moderation.c:8:
   In function 'check_copy_size',
       inlined from 'copy_from_user' at include/linux/uaccess.h:207:7,
       inlined from 'moderation_write' at kernel/irq/irq_moderation.c:169:6:
>> include/linux/ucopysize.h:54:25: error: call to '__bad_copy_to' declared with attribute error: copy destination size is too small
      54 |                         __bad_copy_to();
         |                         ^~~~~~~~~~~~~~~
   In function 'check_copy_size',
       inlined from 'copy_from_user' at include/linux/uaccess.h:207:7,
       inlined from 'mode_write' at kernel/irq/irq_moderation.c:223:6:
>> include/linux/ucopysize.h:54:25: error: call to '__bad_copy_to' declared with attribute error: copy destination size is too small
      54 |                         __bad_copy_to();
         |                         ^~~~~~~~~~~~~~~


vim +/__bad_copy_to +54 include/linux/ucopysize.h

808aac63e2bdf9 Kees Cook 2025-02-28  43  
808aac63e2bdf9 Kees Cook 2025-02-28  44  static __always_inline __must_check bool
808aac63e2bdf9 Kees Cook 2025-02-28  45  check_copy_size(const void *addr, size_t bytes, bool is_source)
808aac63e2bdf9 Kees Cook 2025-02-28  46  {
808aac63e2bdf9 Kees Cook 2025-02-28  47  	int sz = __builtin_object_size(addr, 0);
808aac63e2bdf9 Kees Cook 2025-02-28  48  	if (unlikely(sz >= 0 && sz < bytes)) {
808aac63e2bdf9 Kees Cook 2025-02-28  49  		if (!__builtin_constant_p(bytes))
808aac63e2bdf9 Kees Cook 2025-02-28  50  			copy_overflow(sz, bytes);
808aac63e2bdf9 Kees Cook 2025-02-28  51  		else if (is_source)
808aac63e2bdf9 Kees Cook 2025-02-28  52  			__bad_copy_from();
808aac63e2bdf9 Kees Cook 2025-02-28  53  		else
808aac63e2bdf9 Kees Cook 2025-02-28 @54  			__bad_copy_to();
808aac63e2bdf9 Kees Cook 2025-02-28  55  		return false;
808aac63e2bdf9 Kees Cook 2025-02-28  56  	}
808aac63e2bdf9 Kees Cook 2025-02-28  57  	if (WARN_ON_ONCE(bytes > INT_MAX))
808aac63e2bdf9 Kees Cook 2025-02-28  58  		return false;
808aac63e2bdf9 Kees Cook 2025-02-28  59  	check_object_size(addr, bytes, is_source);
808aac63e2bdf9 Kees Cook 2025-02-28  60  	return true;
808aac63e2bdf9 Kees Cook 2025-02-28  61  }
808aac63e2bdf9 Kees Cook 2025-02-28  62  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation
  2025-11-17 21:34     ` Luigi Rizzo
@ 2025-11-17 23:05       ` Thomas Gleixner
  2025-11-18  9:00       ` Thomas Gleixner
  1 sibling, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-17 23:05 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
	Bjorn Helgaas, Willem de Bruijn

On Mon, Nov 17 2025 at 22:34, Luigi Rizzo wrote:
> On Mon, Nov 17, 2025 at 9:51 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
>> > Add two control parameters (target_irq_rate and hardirq_percent)
>> > to indicate the desired maximum values for these two metrics.
>> >
>> > Every update_ms the hook in handle_irq_event() recomputes the total and
>> > local interrupt rate and the amount of time spent in hardirq, compares
>> > the values with the targets, and adjusts the moderation delay up or down.
>> >
>> > The interrupt rate is computed in a scalable way by counting interrupts
>> > per-CPU, and aggregating the value in a global variable only every
>> > update_ms. Only CPUs that actively process interrupts are actually
>> > accessing the shared variable, so the system scales well even on very
>> > large servers.
>>
>> You still fail to explain why this is required and why a per CPU
>> moderation is not sufficient and what the benefits of the approach are.
>
> It was explained in the first patch of the series and in Documentation/.

Change logs of a patch have to be self contained whether you like it or
not. I'm not chasing random bits of information accross a series or
cover letter and once a patch is applied it becomes even harder to make
sense of it when the change log contains zero information.

> (First world problem, for sure: I have examples for AMD, Intel, Arm,
> all of them with 100+ CPUs per numa node, and 160-480 CPUs total)
> On some of the above platforms, MSIx interrupts cause heavy serialization
> of all other PCIe requests. As a result, when the total interrupt rate exceeds
> 1-2M intrs/s, I/O throughput degrades by up to 4x and more.
> 
> To deal with this with per CPU moderation, without shared state, each CPU
> cannot allow more than some 5Kintrs/s, which means fixed moderation
> should be set at 200us, and adaptive moderation should jump to such
> delays as soon as the local rate reaches 5K intrs/s.
>
> In reality, it is very unlikely that all CPUs are actively handling such high
> rates, so if we know better, we can adjust or remove moderation individually,
> based on actual local and total interrupt rates and number of active CPUs.
> 
> The purpose of this global mechanism is to figure out whether we are
> approaching a dangerous rate, and do individual tuning.

Makes sense, but I'm not convinced yet that this needs to be as complex
as it is.

>> > +     /* Compare with global and per-CPU targets. */
>> > +     global_rate_high = irq_rate > target_rate;
>> > +     my_rate_high = my_irq_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
>> > [...]
>> > +     /* Moderate on this CPU only if both global and local rates are high. */
>>
>> Because it's desired that CPUs can be starved by interrupts when enough
>> other CPUs only have a very low rate? I'm failing to understand that
>> logic and the comprehensive rationale in the change log does not help either.
>
> The comment could be worded better, s/moderate/bump delay up/
>
> The mechanism aims to make total_rate < target, by gently kicking
> individual delays up or down based on the condition
>  total_rate > target && local_rate > target / active_cpus ? bump_up()
> : bump_down()
>
> If the control is effective, the total rate will be within bound and
> nobody suffers,
> neither the CPUs handing <1K intr/s, nor the lonely CPU handling 100K+ intr/s
>
> If suddenly the rates go up, the CPUs with higher rates will be moderated more,
> hopefully converging to a new equilibrium.
> As any control system it has limits on what it can do.

I understand that, but without proper information in code and change log
anyone exposed to this code 6 months down the road will bump his head on
the wall when staring at it (including you).

>> > [...]
>> > +     if (target_rate > 0 && irqrate_high(ms, delta_time, target_rate, steps))
>> > +             below_target = false;
>> > +
>> > +     if (hardirq_percent > 0 && hardirq_high(ms, delta_time, hardirq_percent))
>> > +             below_target = false;
>>
>> So that rate limits a per CPU overload, but only when IRQTIME accounting
>> is enabled. Oh well...
>
> I can add checks to disallow setting the per-CPU overload when IRQTIME
> accounting is not present.

That solves what? It disables the setting, but that does not make the
functionality any different. Also the compiler is smart enough to
eliminate all that code because the return value of hardirq_high() is
constant.

>> > +     } else {
>> > +             /* Exponential grow does not restart if value is too small. */
>> > +             if (ms->mod_ns < 500)
>> > +                     ms->mod_ns = 500;
>> > +             ms->mod_ns += ms->mod_ns * steps / (steps + irq_mod_info.grow_factor);
>> > +             if (ms->mod_ns > ms->delay_ns)
>> > +                     ms->mod_ns = ms->delay_ns;
>> > +     }
>>
>> Why does this need separate grow and decay factors? Just because more
>> knobs are better?
>
> Like in TCP, brake aggressively (grow factor is smaller) to respond
> quickly to overload,
> and accelerate prudently (decay factor is higher) to avoid reacting
> too optimistically.

Why do I have to ask for all this information piecewise?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
  2025-11-17 19:30   ` Thomas Gleixner
@ 2025-11-17 23:16     ` Thomas Gleixner
  2025-11-17 23:59       ` Luigi Rizzo
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-17 23:16 UTC (permalink / raw)
  To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

On Mon, Nov 17 2025 at 20:30, Thomas Gleixner wrote:
> On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
>> +	ms->rounds_left--;
>> +
>> +	if (ms->rounds_left > 0) {
>> +		/* Timer still alive, just call the handlers. */
>> +		list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
>> +			ms->irq_count += irq_mod_info.count_timer_calls;

I missed this gem before. How is this supposed to calculate an interrupt
rate when count_timer_calls is disabled?

Yet another magic knob to tweak something which works by chance and not
by design.

TBH. This whole thing should be put into the 'ugly code museum' for
educational purposes and deterrence. It wants to be rewritten from
scratch with a proper design and a structured understandable approach.

This polish the Google PoC hackery to death will go nowhere. It's just a
ginormous waste of time. Not that I care about the time you waste with
that, but I pretty much care about mine.

That said, start over from scratch and take the feedback into account so
you can address the substantial problems I pointed out (CPU hotplug,
concurrency, life time management, state consistency, affinity changes)
in the design and not after the fact.

First of all please find some other wording than moderation. That's just
a randomly diced word without real meaning. Pick something which
describes what this infrastructure actually does: Adaptive polling, no?

There are a couple of other fundamental questions to answer upfront:

   1) Is this throttle everything on a CPU the proper approach?

      To me this does not make sense. The CPU hogging network adapter or
      disk drive has no business to delay low frequency interrupts,
      which might be important, just because.

      Making this a per interrupt line property allows to solve a few
      other issues trivially like the integration into that posted MSI
      muck.

      It also reduces the amount of descriptors to be polled in the
      timer interrupt.

   2) Shouldn't the interrupt source be masked at the device level once
      an interrupt is switched into polling mode?

      Masking it at the device level (without touching disabled state)
      is definitely a sensible thing to do. It keeps state consistent
      and again allows trivial integration of that posted MSI stuff
      without insane hacks all over the place.

   3) Does a pure interrupt rate based scheme make sense?

      Definitely not in the way it's implemented. Why?

      Simply because once you switched to polling mode there is no real
      information anymore as you fail to take the return value of the
      handler into account. So unless your magic knob is 0 every polled
      interrupt is accounted for whether it actually handles an
      interrupt or not.

      But if your magic knob is 0 then this purely relies on irqtime
      accounting, which includes the timer interrupt as an accumulative
      measure.

      IOW, "works" by some definition of works after adding enough magic
      knobs to make it "work" under certain circumstances. "Works for
      Google" is not a good argument.

      That's unmaintainable and unusable. No amount of magic command
      line examples will fix that because the state space of your knobs
      is way too big to be useful and comprehensible.

Add all the questions which pop up when you really sit down and do a
proper from scratch design of this.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
  2025-11-17 23:16     ` Thomas Gleixner
@ 2025-11-17 23:59       ` Luigi Rizzo
  2025-11-18  8:34         ` Thomas Gleixner
  0 siblings, 1 reply; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-17 23:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
	Bjorn Helgaas, Willem de Bruijn

On Tue, Nov 18, 2025 at 12:16 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Mon, Nov 17 2025 at 20:30, Thomas Gleixner wrote:
> > On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> >> +    ms->rounds_left--;
> >> +
> >> +    if (ms->rounds_left > 0) {
> >> +            /* Timer still alive, just call the handlers. */
> >> +            list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
> >> +                    ms->irq_count += irq_mod_info.count_timer_calls;

> I missed this gem before. How is this supposed to calculate an interrupt
> rate when count_timer_calls is disabled?

FWIW the code is supposed to count the MSI interrupts,
which are the problematic ones.
count_timer_calls was a debugging knob, but understood that
it has no place in the upstream code.

> This polish the Google PoC hackery to death will go nowhere. It's just a
> ginormous waste of time. Not that I care about the time you waste with
> that, but I pretty much care about mine.
>
> That said, start over from scratch and take the feedback into account so

point taken.

> First of all please find some other wording than moderation. That's just
> a randomly diced word without real meaning. Pick something which
> describes what this infrastructure actually does: Adaptive polling, no?

The core feature (with timer_rounds = 0)  is really pure moderation:
disable+start a timer after an interrupt, enable when the timer fires,
no extra calls.

It is only timer_rounds>0 (which as implemented is clearly broken,
and will be removed) that combines moderation + N rounds of timed polling.

> There are a couple of other fundamental questions to answer upfront:
>
>    1) Is this throttle everything on a CPU the proper approach?
>
>       To me this does not make sense. The CPU hogging network adapter or
>       disk drive has no business to delay low frequency interrupts,
>       which might be important, just because.

while there is some shared fate, a low frequency source (with interrupts
more than the adaptive_delay apart) on the same CPU as a high frequency
source, will rarely if ever see any additional delay:
the first interrupt from a source is always served right away,
there is a high chance that the timer fires and the source
is re-enabled before the next interrupt from the low frequency source.

cheers
luigi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
  2025-11-17 23:59       ` Luigi Rizzo
@ 2025-11-18  8:34         ` Thomas Gleixner
  2025-11-18 10:09           ` Luigi Rizzo
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-18  8:34 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
	Bjorn Helgaas, Willem de Bruijn

On Tue, Nov 18 2025 at 00:59, Luigi Rizzo wrote:
> On Tue, Nov 18, 2025 at 12:16 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> There are a couple of other fundamental questions to answer upfront:
>>
>>    1) Is this throttle everything on a CPU the proper approach?
>>
>>       To me this does not make sense. The CPU hogging network adapter or
>>       disk drive has no business to delay low frequency interrupts,
>>       which might be important, just because.
>
> while there is some shared fate, a low frequency source (with interrupts
> more than the adaptive_delay apart) on the same CPU as a high frequency
> source, will rarely if ever see any additional delay:
> the first interrupt from a source is always served right away,
> there is a high chance that the timer fires and the source
> is re-enabled before the next interrupt from the low frequency source.

I understand that from a practical point of view it might not make a real
difference, but when you look at it conceptually, then the interrupt
which causes the system to slow down is the one you want to switch over
into polling mode. All others are harmless as they do not contribute to
the overall problem in a significant enough way.

As a side effect of that approach the posted MSI integration then mostly
falls into place especially when combined with immediate masking.
Immediate masking is not a problem at all because in reality the high
frequency interrupt will be masked immediately on the next event (a few
microseconds later) anyway.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation
  2025-11-17 21:34     ` Luigi Rizzo
  2025-11-17 23:05       ` Thomas Gleixner
@ 2025-11-18  9:00       ` Thomas Gleixner
  1 sibling, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-18  9:00 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
	Bjorn Helgaas, Willem de Bruijn

On Mon, Nov 17 2025 at 22:34, Luigi Rizzo wrote:
> (First world problem, for sure: I have examples for AMD, Intel, Arm,
> all of them with 100+ CPUs per numa node, and 160-480 CPUs total)
> On some of the above platforms, MSIx interrupts cause heavy serialization
> of all other PCIe requests. As a result, when the total interrupt rate exceeds
> 1-2M intrs/s, I/O throughput degrades by up to 4x and more.

Is it actually the sum of all interrupts in the system or is it
segmented per root port?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
  2025-11-18  8:34         ` Thomas Gleixner
@ 2025-11-18 10:09           ` Luigi Rizzo
  2025-11-18 16:31             ` Thomas Gleixner
  0 siblings, 1 reply; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-18 10:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
	Bjorn Helgaas, Willem de Bruijn

On Tue, Nov 18, 2025 at 9:34 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, Nov 18 2025 at 00:59, Luigi Rizzo wrote:
> > On Tue, Nov 18, 2025 at 12:16 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> There are a couple of other fundamental questions to answer upfront:
> >>
> >>    1) Is this throttle everything on a CPU the proper approach?
> >>
> >>       To me this does not make sense. The CPU hogging network adapter or
> >>       disk drive has no business to delay low frequency interrupts,
> >>       which might be important, just because.
> >
> > while there is some shared fate, a low frequency source (with interrupts
> > more than the adaptive_delay apart) on the same CPU as a high frequency
> > source, will rarely if ever see any additional delay:
> > the first interrupt from a source is always served right away,
> > there is a high chance that the timer fires and the source
> > is re-enabled before the next interrupt from the low frequency source.
>
> I understand that from a practical point of view it might not make a real
> difference, but when you look at it conceptually, then the interrupt
> which causes the system to slow down is the one you want to switch over
> into polling mode. All others are harmless as they do not contribute to
> the overall problem in a significant enough way.

(I appreciate the time you are dedicating to this thread)

Fully agree. The tradeoff is that the rate accounting state
(#interrupts in the last interval, a timestamp, mod_ns, sleep_ns)
now would have to go into the irqdesc, and the extra layer
of per-CPU aggregation is still needed to avoid hitting too often on
the shared state.

I also want to reiterate that "polling mode" is not the core contribution
of this patch series. There is limited polling only when timer_rounds>0,
which is not what I envision to use, and will go away because
as you showed it does not handle correctly the teardown path.

>  As a side effect of that approach the posted MSI integration then mostly
> falls into place especially when combined with immediate masking.
> Immediate masking is not a problem at all because in reality the high
> frequency interrupt will be masked immediately on the next event (a few
> microseconds later) anyway.

This again has pros and cons. The posted MSI feature
helps only when there are N>1  high rate sources
hitting the same CPU, and in that (real) case having to
mask N sources one by one, rather than just not rearming
the posted_msi interrupt, means an N-fold increase in
the interrupt rate for a given moderation delay.

Note that even under load, actual moderation delays
are typically as low as 20-40us, which are practically
unnoticeable by low rate devices (moderation does
not affect timers or system interrupts, and one
has always the option to move the latency sensitive,
low rate source to a different CPU where it would also
benefit from the jitter induced by the heavy hitters).

cheers
luigi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
  2025-11-18 10:09           ` Luigi Rizzo
@ 2025-11-18 16:31             ` Thomas Gleixner
  2025-11-18 18:25               ` Luigi Rizzo
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-18 16:31 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, linux-kernel, linux-arch, Bjorn Helgaas,
	Willem de Bruijn

On Tue, Nov 18 2025 at 11:09, Luigi Rizzo wrote:
> On Tue, Nov 18, 2025 at 9:34 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> I understand that from a practical point of view it might not make a real
>> difference, but when you look at it conceptually, then the interrupt
>> which causes the system to slow down is the one you want to switch over
>> into polling mode. All others are harmless as they do not contribute to
>> the overall problem in a significant enough way.
>
> (I appreciate the time you are dedicating to this thread)

It's hopefully saving me time and nerves later :)

> Fully agree. The tradeoff is that the rate accounting state
> (#interrupts in the last interval, a timestamp, mod_ns, sleep_ns)
> now would have to go into the irqdesc, and the extra layer
> of per-CPU aggregation is still needed to avoid hitting too often on
> the shared state.
>
> I also want to reiterate that "polling mode" is not the core contribution
> of this patch series. There is limited polling only when timer_rounds>0,
> which is not what I envision to use, and will go away because
> as you showed it does not handle correctly the teardown path.

Ok.

>>  As a side effect of that approach the posted MSI integration then mostly
>> falls into place especially when combined with immediate masking.
>> Immediate masking is not a problem at all because in reality the high
>> frequency interrupt will be masked immediately on the next event (a few
>> microseconds later) anyway.
>
> This again has pros and cons. The posted MSI feature
> helps only when there are N>1  high rate sources
> hitting the same CPU, and in that (real) case having to
> mask N sources one by one, rather than just not rearming
> the posted_msi interrupt, means an N-fold increase in
> the interrupt rate for a given moderation delay.

That's kinda true for the per interrupt accounting, but if you look at
it from a per CPU accounting perspective then you still can handle them
individually and mask them when they arrive within or trigger a delay
window. That means:

   1) One of the PIR aggregated device interrupts triggers the delay

   2) If there are other bits active in that same posted handler
      invocation, then they will be flipped over too

   3) Devices which have not been covered by the initial switch, will
      either be not affected at all or can raise exactly _one_ interrupt
      which flips them over.

This scheme makes me way more comfortable because:

   1) The state is always consistent.

      If masked then the PIR rearm is not causing any harm because the
      device can't send an interrupt anymore.

      On unmask there is no way to screw up vs. rearming and eventually
      losing an interrupt.

      I haven't analyzed your scheme in detail, but the PIR mechanism is
      fickle and I have very little confidence that the way you
      implemented it is correct under all circumstances.

   2) It requires exactly zero lines of x86-intelism code.

   3) It keeps the mechanism exactly the same for PIR and non-PIR mode
      as you need the 'mask' on mode transition anyway to solve the
      disable/enable_irq() state inconsistency.

Please focus on making this simple and correct in the first place.

If there is still a really compelling reason to treat PIR differently,
then this can be done as a separate step once the initial dust has
settled.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
  2025-11-18 16:31             ` Thomas Gleixner
@ 2025-11-18 18:25               ` Luigi Rizzo
  2025-11-18 23:06                 ` Luigi Rizzo
  0 siblings, 1 reply; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-18 18:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, linux-kernel, linux-arch, Bjorn Helgaas,
	Willem de Bruijn

On Tue, Nov 18, 2025 at 5:31 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, Nov 18 2025 at 11:09, Luigi Rizzo wrote:
> > ...
> > (I appreciate the time you are dedicating to this thread)
>
> It's hopefully saving me time and nerves later :)
>
> > ...
> That's kinda true for the per interrupt accounting, but if you look at
> it from a per CPU accounting perspective then you still can handle them
> individually and mask them when they arrive within or trigger a delay
> window. That means:

ok, so to sum up, what are the correct methods to do the masking
you suggest, should i use mask_irq()/unmask_irq() ?

thanks
luigi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
  2025-11-18 18:25               ` Luigi Rizzo
@ 2025-11-18 23:06                 ` Luigi Rizzo
  2025-11-19 14:43                   ` Thomas Gleixner
  0 siblings, 1 reply; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-18 23:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, linux-kernel, linux-arch, Bjorn Helgaas,
	Willem de Bruijn

On Tue, Nov 18, 2025 at 7:25 PM Luigi Rizzo <lrizzo@google.com> wrote:
>
> On Tue, Nov 18, 2025 at 5:31 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Tue, Nov 18 2025 at 11:09, Luigi Rizzo wrote:
> > > ...
> > > (I appreciate the time you are dedicating to this thread)
> >
> > It's hopefully saving me time and nerves later :)
> >
> > > ...
> > That's kinda true for the per interrupt accounting, but if you look at
> > it from a per CPU accounting perspective then you still can handle them
> > individually and mask them when they arrive within or trigger a delay
> > window. That means:
>
> ok, so to sum up, what are the correct methods to do the masking
> you suggest, should i use mask_irq()/unmask_irq() ?

I tried the following instead of disable_irq()/enable_irq().
This seems to work also for posted_msi with no arch-specific code, as
you suggested.

The drain_desc_list() function below should be usable directly with
hotplug remove callbacks.

I still need to figure out if there is anything special needed when
an interrupt is removed, or the system goes into suspend,
while irq_desc's are in the moderation list.

/* run this to mask and moderate an interrupt */
void irq_mod_mask(struct irq_desc *desc)
{
        scoped_irqdesc_get_and_buslock(desc->irq_data.irq,
IRQ_GET_DESC_CHECK_GLOBAL) {
                struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);

                /* Add to list of moderated desc on this CPU */
                list_add(&desc->mod.ms_node, &ms->descs);
                mask_irq(scoped_irqdesc);
                ms->mask_irq++;
                if (!hrtimer_is_queued(&ms->timer))
                        irq_moderation_start_timer(ms);
        }
}

/* run this on the timer callback */
static int drain_desc_list(struct irq_mod_state *ms)
{
        struct irq_desc *desc, *next;
        uint srcs = 0;

        /* Last round, remove from list and unmask_irq(). */
        list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
                list_del_init(&desc->mod.ms_node);
                srcs++;
                ms->unmask_irq++;
                scoped_irqdesc_get_and_buslock(desc->irq_data.irq,
IRQ_GET_DESC_CHECK_GLOBAL) {
                        unmask_irq(scoped_irqdesc);
                }
        }
        return srcs;
}

cheers
luigi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
  2025-11-18 23:06                 ` Luigi Rizzo
@ 2025-11-19 14:43                   ` Thomas Gleixner
  2025-11-21 10:58                     ` Luigi Rizzo
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-19 14:43 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, linux-kernel, linux-arch, Bjorn Helgaas,
	Willem de Bruijn

On Wed, Nov 19 2025 at 00:06, Luigi Rizzo wrote:
> On Tue, Nov 18, 2025 at 7:25 PM Luigi Rizzo <lrizzo@google.com> wrote:
>> > That's kinda true for the per interrupt accounting, but if you look at
>> > it from a per CPU accounting perspective then you still can handle them
>> > individually and mask them when they arrive within or trigger a delay
>> > window. That means:
>>
>> ok, so to sum up, what are the correct methods to do the masking
>> you suggest, should i use mask_irq()/unmask_irq() ?
>
> I tried the following instead of disable_irq()/enable_irq().
> This seems to work also for posted_msi with no arch-specific code, as
> you suggested.

:)

> The drain_desc_list() function below should be usable directly with
> hotplug remove callbacks.

Emphasis on *should*.

> I still need to figure out if there is anything special needed when
> an interrupt is removed, or the system goes into suspend,
> while irq_desc's are in the moderation list.

Yes quite some and you forgot to mention affinity changes, interrupt
disable/enable semantics, synchronization, state consistency ...

> /* run this to mask and moderate an interrupt */
> void irq_mod_mask(struct irq_desc *desc)
> {
>         scoped_irqdesc_get_and_buslock(desc->irq_data.irq,
> IRQ_GET_DESC_CHECK_GLOBAL) {

Seriously? Why do you want to look up the descriptor again when you
already have the pointer? Copying random code together does not qualify
as programming.

>                 struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
>
>                 /* Add to list of moderated desc on this CPU */
>                 list_add(&desc->mod.ms_node, &ms->descs);
>                 mask_irq(scoped_irqdesc);
>                 ms->mask_irq++;
>                 if (!hrtimer_is_queued(&ms->timer))
>                         irq_moderation_start_timer(ms);
>         }

Assuming you still invoke this from handle_irq_event(), which I think is
the wrong place, then I really have to ask how that is supposed to work
correctly especially with handle_fasteoi_irq(), which is used on
ARM64. Why?

handle_fasteoi_irq()
        ....
        handle_irq_event() {
             ...
             moderation()
                mask();
        }

        cond_unmask_eoi_irq()
             unmask();

You need state to prevent that. There is a similar, but more subtle
issue in handle_edge_irq(), which is used on x86.

You got away with this so far due to the disable_irq() abuse and with
your unmask PoC you didn't notice the rare issue on x86, but with
handle_fasteoi_irq() this clearly can't ever work.

> /* run this on the timer callback */
> static int drain_desc_list(struct irq_mod_state *ms)
> {
>         struct irq_desc *desc, *next;
>         uint srcs = 0;
>
>         /* Last round, remove from list and unmask_irq(). */
>         list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
>                 list_del_init(&desc->mod.ms_node);
>                 srcs++;
>                 ms->unmask_irq++;
>                 scoped_irqdesc_get_and_buslock(desc->irq_data.irq,
> IRQ_GET_DESC_CHECK_GLOBAL) {
>                         unmask_irq(scoped_irqdesc);

That's broken vs. state changes which occured while the descriptor was
queued. This cannot unconditionally unmask for bloody obvious reasons.

There is also the opposite: Interrupt is moderated, masked, queued and
some management code unmasks. Is that correct while it is still in the
list and marked moderated? It probably is, but only if done right and
clearly documented.

While talking about mask/unmask. You need to ensure that the interrupt
is actually maskable to be elegible for moderation. MSI does not
guarantee that, which is not an issue on ARM64 as that can mask at the
CPU level, but it is an issue on x86 which does not provide such a
mechanism.

You need to understand the intricacies of interrupt handling and
interrupt management first instead of jumping again to "works for me and
should be usable" conclusions.

Now that you know that the mask approach "works", you can stop this
"hack it into submission" frenzy and take a step back and really study
the overall problem space to come up with a proper design and
integration for this.

What you really want is to define the semantics and requirements of this
moderation mechanism versus (no particular order):

    1) The existing semantics of handle_edge_irq() and
       handle_fasteoi_irq(), which need to be guaranteed.

       That's one of the most crucial points because edge type
       interrupts have fire and forget semantics on the device side and
       there is no guarantee that the device will send another one in
       time. People including me wasted tons of time to decode such
       problems in the past.

    2) The existing semantics of disable/enable_irq(), irq_shutdown(),
       synchronize_irq(), which need to be preserved.

    3) Teardown of an interrupt, which also relates to shutdown and
       synchronizing.

    4) Interrupts coming in while in moderated state and "masked". That
       can be spurious interrupts or happen due to unmasking by
       management code, e.g. enable_irq().

    5) CPU hotunplug. Define the exact point where the list has to be
       cleaned out and the timer canceled and it's guaranteed that
       there can't be interrupts added back to the moderation list.

    6) Affinity changes. Are there any implications when the descriptor
       is enqueued on the old target and waiting for the timer to unmask
       it? That assumes that CPU hotunplug related affinity changes are
       not affected because those interrupts are guaranteed not to be
       moderated.

    7) Suspend/resume. That needs some deep analysis of the potential
       state space.

    8) Eligibility beyond edge, single target, etc. That might need
       actual opt-in support from the interrupt chip hierarchy as
       otherwise this is going to end up in a continous stream of
       undecodable bug reports, which others have to deal with :(
       See my remark vs. unmaskable MSI above.

Once you figured all of that out, it becomes pretty clear how the
implementation has to look like and you can start building it up piece
wise by doing the necessary refactoring and preparatory changes first so
that the actual functionality falls into place at the end.

If you start the other way round by glueing the functionality somewhere
into interrupt handling first and then trying to address the resulting
fallout, you'll get nowhere. Trust me that a lot of smart people have
failed that way.

That said, I'm happy to answer _informed_ questions about semantics,
rules and requirements if they are not obvious from the code.

Good luck!

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
  2025-11-19 14:43                   ` Thomas Gleixner
@ 2025-11-21 10:58                     ` Luigi Rizzo
  2025-11-21 14:33                       ` Luigi Rizzo
  2025-11-22 14:08                       ` Thomas Gleixner
  0 siblings, 2 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-21 10:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, linux-kernel, linux-arch, Bjorn Helgaas,
	Willem de Bruijn

On Wed, Nov 19, 2025 at 3:43 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
...
>
> What you really want is to define the semantics and requirements of this
> moderation mechanism versus (no particular order):
>
>     1) The existing semantics of handle_edge_irq() and
>        handle_fasteoi_irq(), which need to be guaranteed.
>
>        That's one of the most crucial points because edge type
>        interrupts have fire and forget semantics on the device side and
>        there is no guarantee that the device will send another one in
>        time. People including me wasted tons of time to decode such
>        problems in the past.
>
>     2) The existing semantics of disable/enable_irq(), irq_shutdown(),
>        synchronize_irq(), which need to be preserved.
>
>     3) Teardown of an interrupt, which also relates to shutdown and
>        synchronizing.
>
>     4) Interrupts coming in while in moderated state and "masked". That
>        can be spurious interrupts or happen due to unmasking by
>        management code, e.g. enable_irq().
>
>     5) CPU hotunplug. Define the exact point where the list has to be
>        cleaned out and the timer canceled and it's guaranteed that
>        there can't be interrupts added back to the moderation list.
>
>     6) Affinity changes. Are there any implications when the descriptor
>        is enqueued on the old target and waiting for the timer to unmask
>        it? That assumes that CPU hotunplug related affinity changes are
>        not affected because those interrupts are guaranteed not to be
>        moderated.
>
>     7) Suspend/resume. That needs some deep analysis of the potential
>        state space.
>
>     8) Eligibility beyond edge, single target, etc. That might need
>        actual opt-in support from the interrupt chip hierarchy as
>        otherwise this is going to end up in a continous stream of
>        undecodable bug reports, which others have to deal with :(
>        See my remark vs. unmaskable MSI above.
>
> Once you figured all of that out, it becomes pretty clear how the
> implementation has to look like and you can start building it up piece
> wise by doing the necessary refactoring and preparatory changes first so
> that the actual functionality falls into place at the end.
>
> If you start the other way round by glueing the functionality somewhere
> into interrupt handling first and then trying to address the resulting
> fallout, you'll get nowhere. Trust me that a lot of smart people have
> failed that way.
>
> That said, I'm happy to answer _informed_ questions about semantics,
> rules and requirements if they are not obvious from the code.

Addressing your bullets above here is the design,
if you have time let me know what you think,

BASIC MECHANISM
The basic control mechanism changes interrupt state as follows:

- for moderated interrupts, call __disable_irq(desc) and put the desc
  in a per-CPU list with IRQD_IRQ_INPROGRESS set.
  Set the _IRQ_DISABLE_UNLAZY status flag so disable will also
  mask the interrupt (it works even without that, but experiments
  with greedy NVME devices show clear benefits).

  NOTE: we need to patch irq_can_handle_pm() so it will return false
  on irqdesc that are in a moderation list, otherwise we have
  a deadlock when an interrupt (generated before the mask) comes in.

- when the moderation timer fires, remove the irqdesc from the list,
  clear IRQD_IRQ_INPROGRESS and call __enable_irq(desc).

RATIONALE:
- why disable/enable instead of mask/unmask:
  disable/enable already supports call from different layers
  (infrastructure and interrupt handlers), blocks stray interrupts,
  and reissues pending events on enable. If I were to use mask/unmask
  directly, I would have to replicate most of the disable logic and risk
  introducing subtle bugs.

  Setting _IRQ_DISABLE_UNLAZY makes the mask immediate, which based on
  tests (greedy NVME devices) seems beneficial.

- why IRQD_IRQ_INPROGRESS instead of using another flag:
  IRQD_IRQ_INPROGRESS seems the way to implement synchronization with
  infrastructure events (shutdown etc.). We can argue that when an
  irqdesc is in the timer list it is not "inprogress" and could be
  preempted (with huge pain, as it could be on a different CPU),
  but for practical purposes most of the actions that block on INPROGRESS
  should also wait for the irqdesc to come off the timer list.

- what about spurious interrupts:
  there is no way to block a device from sending them one after another,
  so using the existing protection (don't call the handler if disabled)
  seems the easiest way to deal with them.

HANDLING OF VARIOUS EVENTS:

INTERRUPT TEARDOWN
  free_irq() uses synchronize_irq() before proceeding, which waits until
  IRQD_IRQ_INPROGRESS is clear. This guarantees that a moderated interrupt
  will not be destroyed while it is on the timer list.

INTERRUPT MIGRATION
  Migration changes the affinity mask, but it takes effect only when
  trying to run the handler. That too is prevented by IRQD_IRQ_INPROGRESS
  being set, so the new CPU will be used only after the interrupt exits
  the timer list.

HOTPLUG
  During shutdown the system moves timers and reassigns interrupt affinity
  to a new CPU. The easiest way to guarantee that pending events are
  handled correctly is to use a per-CPU "moderation_on" flag managed as
  follows by hotplug callbacks on CPUHP_ONLINE (or nearby)

  - on setup, set the flag. Only now interrupts can be moderated.
  - on shutdown, with interrupts disabled, clear the flag thus preventing
    more interrupts to be moderated on that CPU; process the list of moderated
    interrupts (as if the timer had fired), and cancel the timer.

  This avoids depending on a specific state: no moderation before fully
  online, and cleanup before any teardown may interfere with moderation
  timers.


SUSPEND
  The hotplug callbacks are also invoked during deep suspend so that
makes it safe.

BOOT PROCESSOR
  Hotplug callbacks are not invoked for the boot processor. However the
  boot processor is the last one to go, and since there is no other place
  to run the timer callbacks, they will be run where they are supposed to.

---
cheers
luigi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
  2025-11-21 10:58                     ` Luigi Rizzo
@ 2025-11-21 14:33                       ` Luigi Rizzo
  2025-11-22 14:08                       ` Thomas Gleixner
  1 sibling, 0 replies; 32+ messages in thread
From: Luigi Rizzo @ 2025-11-21 14:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, linux-kernel, linux-arch, Bjorn Helgaas,
	Willem de Bruijn

On Fri, Nov 21, 2025 at 11:58 AM Luigi Rizzo <lrizzo@google.com> wrote:
>
> On Wed, Nov 19, 2025 at 3:43 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> ...
> >
> > ...
> > That said, I'm happy to answer _informed_ questions about semantics,
> > rules and requirements if they are not obvious from the code.

Some followup on the cost of mask_irq()  which is rarely called without
moderation, but is hit hard with moderation:

For msix devices, mask_irq() eventually calls the following function

static inline void pci_msix_mask(struct msi_desc *desc)
{
        desc->pci.msix_ctrl |= PCI_MSIX_ENTRY_CTRL_MASKBIT;
        pci_msix_write_vector_ctrl(desc, desc->pci.msix_ctrl); //
eventually, writel()
        /* Flush write to device */
        readl(desc->pci.mask_base);
}

The final readl() is very expensive (serialized after the write, and takes
a full round trip through a busy PCIe bus). Its real purpose is to make
sure the function returns only after the hardware has seen the mask.

Except for a few critical places (setup/shutdown, others?) it does not
seem necessary to wait, because the irq layer knows how to drop stray
interrupts.

Here is an experiment removing the readl() where it is not critical,
during an experiment with busy NVME workload (~1M intr/s total)
The average goes from 1800ns down to 50ns, the long tails almost
completely disappear.

Data is collected with kstats https://lwn.net/Articles/813303/
Distribution of mask_irq() times in nanoseconds collected with kstats
over a 5s period, heavy workload.

F=/sys/kernel/debug/kstats/mask_irq
echo reset > $F; sleep 5; grep CPUS $ | awk '{t+=$6*$8;n+=$6; print}
END { print "avg ", t/n, t*1e-9}'

-------------------------------------------------------
ORIGINAL CODE

count      945  avg      680  p 0.000195  n      945
count     1863  avg      742  p 0.000580  n     2808
count     2941  avg      805  p 0.001188  n     5749
count     6058  avg      866  p 0.002441  n    11807
count    10090  avg      930  p 0.004527  n    21897
count    17273  avg      994  p 0.008099  n    39170
count    88252  avg     1099  p 0.026348  n   127422
count   227984  avg     1223  p 0.073490  n   355406
count   405964  avg     1348  p 0.157434  n   761370
count   530397  avg     1474  p 0.267109  n  1291767
count   606284  avg     1601  p 0.392475  n  1898051
count   658027  avg     1728  p 0.528541  n  2556078
count   643644  avg     1854  p 0.661632  n  3199722
count   515443  avg     1980  p 0.768215  n  3715165
count   546976  avg     2156  p 0.881317  n  4262141
count   243037  avg     2415  p 0.931572  n  4505178
count   125700  avg     2674  p 0.957564  n  4630878
count    73707  avg     2934  p 0.972805  n  4704585
count    48124  avg     3190  p 0.982756  n  4752709
count    30397  avg     3446  p 0.989041  n  4783106
count    19108  avg     3700  p 0.992993  n  4802214
count    11228  avg     3956  p 0.995314  n  4813442
count    11484  avg     4316  p 0.997689  n  4824926
count     5447  avg     4835  p 0.998815  n  4830373
count     2699  avg     5341  p 0.999373  n  4833072
count     1343  avg     5855  p 0.999651  n  4834415
count      668  avg     6369  p 0.999789  n  4835083
count      382  avg     6880  p 0.999868  n  4835465
count      207  avg     7406  p 0.999911  n  4835672
count      149  avg     7918  p 0.999942  n  4835821
count      159  avg     8671  p 0.999975  n  4835980
count       61  avg     9574  p 0.999987  n  4836041
count       22  avg    10758  p 0.999992  n  4836063
count       15  avg    11925  p 0.999995  n  4836078
count        9  avg    12750  p 0.999997  n  4836087
count        6  avg    13739  p 0.999998  n  4836093
count        2  avg    14737  p 0.999998  n  4836095
count        1  avg    15699  p 0.999999  n  4836096
count        3  avg    17951  p 0.999999  n  4836099
count        1  avg    18680  p 1.000000  n  4836100
avg  1833.11 8.86512

-------------------------------------------------------
AFTER REMOVING READL()

count       45  avg       13  p 0.000009  n       45
count     5792  avg       14  p 0.001204  n     5837
count    36628  avg       15  p 0.008761  n    42465
count   108293  avg       17  p 0.031103  n   150758
count   529586  avg       18  p 0.140365  n   680344
count   352816  avg       21  p 0.213156  n  1033160
count  1082719  avg       22  p 0.436538  n  2115879
count   480375  avg       24  p 0.535646  n  2596254
count   628931  avg       26  p 0.665404  n  3225185
count   275178  avg       28  p 0.722178  n  3500363
count   158508  avg       30  p 0.754880  n  3658871
count   100066  avg       33  p 0.775525  n  3758937
count    42338  avg       38  p 0.784260  n  3801275
count    75427  avg       41  p 0.799822  n  3876702
count    43099  avg       45  p 0.808714  n  3919801
count    30181  avg       49  p 0.814941  n  3949982
count     3757  avg       53  p 0.815716  n  3953739
count     2104  avg       57  p 0.816150  n  3955843
count     3520  avg       62  p 0.816876  n  3959363
count    28199  avg       69  p 0.822694  n  3987562
count    72163  avg       76  p 0.837583  n  4059725
count   109582  avg       83  p 0.860191  n  4169307
count    35839  avg       90  p 0.867585  n  4205146
count     6622  avg      100  p 0.868951  n  4211768
count    19393  avg      109  p 0.872952  n  4231161
count    51844  avg      116  p 0.883649  n  4283005
count    94042  avg      122  p 0.903051  n  4377047
count    28129  avg      133  p 0.908854  n  4405176
count    91010  avg      151  p 0.927631  n  4496186
count    73043  avg      164  p 0.942701  n  4569229
count    20705  avg      185  p 0.946973  n  4589934
count    59477  avg      199  p 0.959244  n  4649411
count    10842  avg      212  p 0.961481  n  4660253
count    16374  avg      233  p 0.964859  n  4676627
count    35394  avg      242  p 0.972161  n  4712021
count    36254  avg      277  p 0.979641  n  4748275
count    16982  avg      304  p 0.983145  n  4765257
count    20411  avg      327  p 0.987356  n  4785668
count    14352  avg      362  p 0.990317  n  4800020
count    10931  avg      399  p 0.992572  n  4810951
count     8384  avg      436  p 0.994302  n  4819335
count     3184  avg      467  p 0.994959  n  4822519
count     4923  avg      487  p 0.995974  n  4827442
count     7341  avg      538  p 0.997489  n  4834783
count     2957  avg      606  p 0.998099  n  4837740
count     3688  avg      666  p 0.998860  n  4841428
count     1807  avg      732  p 0.999233  n  4843235
count      813  avg      800  p 0.999400  n  4844048
count      699  avg      856  p 0.999545  n  4844747
count      293  avg      927  p 0.999605  n  4845040
count      203  avg      985  p 0.999647  n  4845243
count      390  avg     1086  p 0.999727  n  4845633
count      273  avg     1213  p 0.999784  n  4845906
count      162  avg     1336  p 0.999817  n  4846068
count       89  avg     1465  p 0.999835  n  4846157
count       50  avg     1600  p 0.999846  n  4846207
count       33  avg     1727  p 0.999853  n  4846240
count       26  avg     1850  p 0.999858  n  4846266
count       38  avg     1993  p 0.999866  n  4846304
count       89  avg     2185  p 0.999884  n  4846393
count      132  avg     2444  p 0.999911  n  4846525
count      134  avg     2680  p 0.999939  n  4846659
count       92  avg     2932  p 0.999958  n  4846751
count       65  avg     3182  p 0.999971  n  4846816
count       42  avg     3448  p 0.999980  n  4846858
count       29  avg     3713  p 0.999986  n  4846887
count       11  avg     3934  p 0.999988  n  4846898
count       18  avg     4305  p 0.999992  n  4846916
count       10  avg     4784  p 0.999994  n  4846926
count        7  avg     5264  p 0.999996  n  4846933
count        5  avg     5957  p 0.999997  n  4846938
count        1  avg     6603  p 0.999997  n  4846939
count        1  avg     7098  p 0.999997  n  4846940
count        3  avg     7433  p 0.999998  n  4846943
count        1  avg     7892  p 0.999998  n  4846944
count        2  avg     8955  p 0.999998  n  4846946
count        3  avg     9424  p 0.999999  n  4846949
count        1  avg    11194  p 0.999999  n  4846950
count        2  avg    11619  p 1.000000  n  4846952
avg  51.3519 0.2489

-------------------------------------------------------

cheers
Luigi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation
  2025-11-21 10:58                     ` Luigi Rizzo
  2025-11-21 14:33                       ` Luigi Rizzo
@ 2025-11-22 14:08                       ` Thomas Gleixner
  1 sibling, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2025-11-22 14:08 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, linux-kernel, linux-arch, Bjorn Helgaas,
	Willem de Bruijn

On Fri, Nov 21 2025 at 11:58, Luigi Rizzo wrote:
> On Wed, Nov 19, 2025 at 3:43 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> Addressing your bullets above here is the design,
> if you have time let me know what you think,
>
> BASIC MECHANISM
> The basic control mechanism changes interrupt state as follows:
>
> - for moderated interrupts, call __disable_irq(desc) and put the desc
>   in a per-CPU list with IRQD_IRQ_INPROGRESS set.
>   Set the _IRQ_DISABLE_UNLAZY status flag so disable will also
>   mask the interrupt (it works even without that, but experiments
>   with greedy NVME devices show clear benefits).

What sets that flag? If you want to fiddle with it in this moderation
muck behind the scene, no.

With this new scheme where the moderation handling never invokes the
handler after the time is over, you can't actually mask upfront
unconditionally unless there is a clear indication from the underlying
interrupt chain that this works correctly with edge type interrupts.

It only works when it is guaranteed that the interrupt chip which
actually provides the masking functionality will latch an interrupt
raised in the device and then raise it down the chain when unmasked.

PCIe mandates that behaviour for MSI-X and maskable MSI, but it's not
guaranteed for other device types. And out of painful experience it's
not even true for some broken PCIe devices.

>   NOTE: we need to patch irq_can_handle_pm() so it will return false
>   on irqdesc that are in a moderation list, otherwise we have
>   a deadlock when an interrupt (generated before the mask) comes in.

I told you that you need state :)

> - when the moderation timer fires, remove the irqdesc from the list,
>   clear IRQD_IRQ_INPROGRESS and call __enable_irq(desc).
>
> RATIONALE:
> - why disable/enable instead of mask/unmask:
>   disable/enable already supports call from different layers
>   (infrastructure and interrupt handlers), blocks stray interrupts,
>   and reissues pending events on enable. If I were to use mask/unmask
>   directly, I would have to replicate most of the disable logic and risk
>   introducing subtle bugs.

It's not that much to replicate, but fair enough as long as it happens
under the lock when entering the handler. 

>   Setting _IRQ_DISABLE_UNLAZY makes the mask immediate, which based on
>   tests (greedy NVME devices) seems beneficial.

See above. What you really want is:

    1) An indicator in the irq chip which tells that the above
       requirement of raising a pending interrupt on unmask is
       "guaranteed" :)

    2) A dedicated variant of __disable_irq() which invokes
       __irq_disable() with the following twist

       disable_moderated_irq(desc)
          force = !irqd_irq_masked(..) && irq_moderate_unlazy(desc[chip]);
          if (!desc->disable++ || force)
          	irq_disable_moderated(desc, force)
                    force |= irq_settings_disable_unlazy(desc);
                    __irq_disable(desc, force);

That requires to have a new MODERATE_UNLAZY flag, which can be set by
infrastructure, e.g. the PCI/MSI core as that knows whether the
underlying device MSI implementation provides masking. That should be
probably a feature in the interrupt chip.

That needs a command line option to prevent MODERATE_UNLAZY from being
effective.

That preserves the lazy masking behaviour for non-moderated situations,
which is not only a correctness measure (see above) but preferred even
for correctly working hardware because it avoids going out to the
hardware (e.g. writing to PCI config space twice) in many cases.

Though all of this wants to be a separate step _after_ the initial
functionality has been worked out and stabilized. It's really an
optimization on top of the base functionality and we are not going to
optimize prematurely.

> - why IRQD_IRQ_INPROGRESS instead of using another flag:
>   IRQD_IRQ_INPROGRESS seems the way to implement synchronization with
>   infrastructure events (shutdown etc.). We can argue that when an
>   irqdesc is in the timer list it is not "inprogress" and could be
>   preempted (with huge pain, as it could be on a different CPU),
>   but for practical purposes most of the actions that block on INPROGRESS
>   should also wait for the irqdesc to come off the timer list.

As long as those functions are not polling the bit with interrupts
disabled that's a correct assumption.

> - what about spurious interrupts:
>   there is no way to block a device from sending them one after another,
>   so using the existing protection (don't call the handler if disabled)
>   seems the easiest way to deal with them.

Correct.

> HANDLING OF VARIOUS EVENTS:
>
> INTERRUPT TEARDOWN
>   free_irq() uses synchronize_irq() before proceeding, which waits until
>   IRQD_IRQ_INPROGRESS is clear. This guarantees that a moderated interrupt
>   will not be destroyed while it is on the timer list.

That requires a new variant for __enable_irq() because free_irq() shuts
the interrupt down and __enable_irq() would just start it up again...

> INTERRUPT MIGRATION
>   Migration changes the affinity mask, but it takes effect only when
>   trying to run the handler. That too is prevented by IRQD_IRQ_INPROGRESS
>   being set, so the new CPU will be used only after the interrupt exits
>   the timer list.

How is affinity setting related to trying to run the handler? It steers
the interrupt to a new target CPU so that the next interrupt raised in
the device ends up at the new target.

So if that happens (and the interrupt is not masked) then the handler on
the new target CPU will poll for $DELAY until the other side fires the
timer and clears the bit. That's insane and needs to be prevented by
checking moderated state.

> HOTPLUG
>   During shutdown the system moves timers and reassigns interrupt affinity
>   to a new CPU. The easiest way to guarantee that pending events are
>   handled correctly is to use a per-CPU "moderation_on" flag managed as
>   follows by hotplug callbacks on CPUHP_ONLINE (or nearby)

That can be anywhere in the state machine between ONLINE and ACTIVE.

>   - on setup, set the flag. Only now interrupts can be moderated.
>   - on shutdown, with interrupts disabled, clear the flag thus preventing
>     more interrupts to be moderated on that CPU; process the list of moderated
>     interrupts (as if the timer had fired), and cancel the timer.

Ok.

> SUSPEND
>   The hotplug callbacks are also invoked during deep suspend so that
>   makes it safe.
>
> BOOT PROCESSOR
>   Hotplug callbacks are not invoked for the boot processor. However the
>   boot processor is the last one to go, and since there is no other place
>   to run the timer callbacks, they will be run where they are supposed to.

That's only true for hibernation, aka. suspend to disk, but not for
suspend to RAM/IDLE.

You want to treat all of these scenarios the same and just make sure
that the moderation muck is disabled and nothing hangs on a list
especially not with the INPROGRESS bit set.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-11-22 14:08 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-16 18:28 [PATCH v2 0/8] platform wide software interrupt moderation Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 1/8] genirq: platform wide interrupt moderation: Documentation, Kconfig, irq_desc Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 2/8] genirq: soft_moderation: add base files, procfs Luigi Rizzo
2025-11-17 11:12   ` kernel test robot
2025-11-17 16:01   ` Thomas Gleixner
2025-11-17 16:16     ` Luigi Rizzo
2025-11-17 19:35       ` Thomas Gleixner
2025-11-17 21:56   ` kernel test robot
2025-11-16 18:28 ` [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation Luigi Rizzo
2025-11-17 19:30   ` Thomas Gleixner
2025-11-17 23:16     ` Thomas Gleixner
2025-11-17 23:59       ` Luigi Rizzo
2025-11-18  8:34         ` Thomas Gleixner
2025-11-18 10:09           ` Luigi Rizzo
2025-11-18 16:31             ` Thomas Gleixner
2025-11-18 18:25               ` Luigi Rizzo
2025-11-18 23:06                 ` Luigi Rizzo
2025-11-19 14:43                   ` Thomas Gleixner
2025-11-21 10:58                     ` Luigi Rizzo
2025-11-21 14:33                       ` Luigi Rizzo
2025-11-22 14:08                       ` Thomas Gleixner
2025-11-16 18:28 ` [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation Luigi Rizzo
2025-11-17 20:51   ` Thomas Gleixner
2025-11-17 21:34     ` Luigi Rizzo
2025-11-17 23:05       ` Thomas Gleixner
2025-11-18  9:00       ` Thomas Gleixner
2025-11-16 18:28 ` [PATCH v2 5/8] x86/irq: soft_moderation: add support for posted_msi (intel) Luigi Rizzo
2025-11-17 21:14   ` Thomas Gleixner
2025-11-17 21:36   ` kernel test robot
2025-11-16 18:28 ` [PATCH v2 6/8] genirq: soft_moderation: helpers for per-driver defaults Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 7/8] nvme-pci: add module parameter for default moderation mode Luigi Rizzo
2025-11-16 18:28 ` [PATCH v2 8/8] vfio-pci: " Luigi Rizzo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).