[PATCH-v4 0/3] Global Software Interrupt Moderation (GSIM)

public inbox for linux-arch@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH-v4 0/3] Global Software Interrupt Moderation (GSIM)
@ 2026-01-15 15:59 Luigi Rizzo
  2026-01-15 15:59 ` [PATCH v4 1/3] genirq: Add flags for software interrupt moderation Luigi Rizzo
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Luigi Rizzo @ 2026-01-15 15:59 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

Global Software Interrupt Moderation (GSIM) specifically addresses a
limitation of platforms, from many vendors, whose I/O performance drops
significantly when the total rate of MSI-X interrupts is too high (e.g
1-3M intr/s depending on the platform).

Conventional interrupt moderation, typically implemented in hardware
by NICs or storage devices, operates separately on each source (e.g. a
completion queue). Large servers can have hundreds of sources, and without
knowledge of global activity, keeping the total rate bounded would require
moderation delays of 100-200us, and adaptive moderation would have to
reach those delays with as little as 10K intr/s per source. These values
are unacceptable for RPC or transactional workloads.

To address this problem, GSIM measures efficiently the total and
per-CPU interrupt rates, so that individual moderation delays can be
dynamically adjusted based on actual global and local load. This way,
delays are normally 0 or very small except during actual
local/global overload.

As an additional benefit, GSIM also monitors the percentage of time
spent by each CPU in hardirq, and can use moderation to reserve some
time for other, lower priority, tasks.

Configuration is easy and robust. System administrators specify the
maximum targets (moderation delay; interrupt rate; and percentage of time
spent in hardirq), and which interrupt sources should be moderated (can be
done per-interrupt, per device, or globally). Independent per-CPU control
loops adjust actual delays to try and keep metrics within the targets.

The system is adaptive, and moderation does not affect throughput but
only latency and only in high load scenarios. Hence, targets don't need
to match precisely the platform limits, and one can make conservative
and robust choices. Values like delay_us=100, target_irq_rate=1000000,
hardirq_percent=70 are a very good starting point.

GSIM does not rely on any special hardware feature.

Defaults are set at boot via module parameters

    irq_moderation.${NAME}=${VALUE}

and can be changed runtime with

    echo ${NAME}=${VALUE} /proc/irq/soft_moderation

/proc/irq/soft_moderation is also used to export statistics.

Moderation on individual interrupts can be turned on/off at runtime with

    echo 1 > /proc/irq/NN/moderation  # use 0 to disable

PERFORMANCE BENEFITS:
Below are some experimental results under high load comparing conventional
moderation with GSIM:

- 100Gbps NIC, 32 queues: rx goes from 50 Gbps to 92.8 Gbps (line rate).
- 200Gbps NIC, 10 VMs (total 160 queues): rx goes from 30 Gbps to 190 Gbps (line rate).
- 12 SSD, 96 queues: 4K random read goes from 6M to 20.5M IOPS (device max).

In all cases, with adaptive moderatrion, latency up to p95 is unaffected
at low/moderate load, even if compared with no moderation at all.

Changes in v4:
- added irqdesc and irqdata flags as suggested by maintainer
- parameters are only configured via independent procfs entries.
  No module parameters anymore.
- merged control and interrupt functions back into a single header/C files
- applied various annotations (lockdep, data_race())
- formatting and various renaming as suggested by maintainer.
- added performance measurements with adaptive moderation.
- removed the mechanism to conditionally enable moderation at interrupt
  creation. This can be done in userspace and suitable udev extensions
  will be handled separately.

Changes in v3:
- clearly documented architecture in kernel/irq/irq_moderation.c
  including how to handle enable/disable/mask, interrupt migration,
  hotplug and suspend.
- split implementation in 4 files irq_moderation.[ch] and
  irq_moderation_hook.[ch] for better separation of control plane and
  "dataplane" (functions ran on each interrupt)
- limited scope to handle_edge_irq() and handle_fasteoi_irq() which
  have been tested on actual hardware.
- tested on Intel (also with intremap=posted_msi), AMD, ARM, with NIC,
  nvme, vfio

Changes in v2:
- many style fixes (mostly on comments) based on reviewers' comments on v1
- removed background from Documentation/core-api/irq/irq-moderation.rst
- split procfs handlers
- moved internal details to kernel/irq/irq_moderation.h
- use cpu hotplug for per-CPU setup, removed unnecessary arch-specific changes
- select suitable irqs based on !irqd_is_level_type(irqd) && irqd_is_single_target(irqd)
- use a static_key to enable/disable the feature

There are two open comments from v1 for which I would like maintainer's
clarifications

- handle_irq_event() calls irq_moderation_hook() after releasing the lock,
  so it can call disable_irq_nosync(). It may be possible to move the
  call before releasing the lock and use __disable_irq(). I am not sure
  if there is any benefit in making the change.

- the timer callback calls handle_irq_event_percpu(desc) on moderated
  irqdesc (which have irqd_irq_disabled(irqd) == 1) without changing
  IRQD_IRQ_INPROGRESS. I am not sure if this should be protected with
  the following, and especially where it would make a difference
  (specifically because that the desc is disabled during this sequence).

     raw_spin_lock(&desc->lock);
     irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
     raw_spin_unlock(&desc->lock)

     handle_irq_event_percpu(desc); // <--

     raw_spin_lock(&desc->lock);
     irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
     raw_spin_unlock(&desc->lock)

Luigi Rizzo (3):
  genirq: Add flags for software interrupt moderation.
  genirq: Fixed-delay Global Software Interrupt Moderation (GSIM)
  genirq: Adaptive Global Software Interrupt Moderation (GSIM).

 include/linux/irq.h         |   6 +-
 include/linux/irqdesc.h     |  12 +
 kernel/irq/Kconfig          |  11 +
 kernel/irq/Makefile         |   1 +
 kernel/irq/chip.c           |  15 +
 kernel/irq/internals.h      |  10 +
 kernel/irq/irq_moderation.c | 693 ++++++++++++++++++++++++++++++++++++
 kernel/irq/irq_moderation.h | 161 +++++++++
 kernel/irq/irqdesc.c        |   1 +
 kernel/irq/proc.c           |   2 +
 kernel/irq/settings.h       |   7 +
 11 files changed, 918 insertions(+), 1 deletion(-)
 create mode 100644 kernel/irq/irq_moderation.c
 create mode 100644 kernel/irq/irq_moderation.h

-- 
2.52.0.457.g6b5491de43-goog

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 1/3] genirq: Add flags for software interrupt moderation.
  2026-01-15 15:59 [PATCH-v4 0/3] Global Software Interrupt Moderation (GSIM) Luigi Rizzo
@ 2026-01-15 15:59 ` Luigi Rizzo
  2026-01-15 15:59 ` [PATCH v4 2/3] genirq: Fixed-delay Global Software Interrupt Moderation (GSIM) Luigi Rizzo
  2026-01-15 15:59 ` [PATCH v4 3/3] genirq: Adaptive " Luigi Rizzo
  2 siblings, 0 replies; 11+ messages in thread
From: Luigi Rizzo @ 2026-01-15 15:59 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

Add two flags to support software interrupt moderation:

- IRQ_SW_MODERATION is an irqdesc flag indicating that an interrupt supports
  moderation. This is a feature that can be set by the system
  administrator.
- IRQD_IRQ_MODERATED is an internal irqdata flag indicating that the
  interrupt is currently being moderated. This is a state flag.

Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 include/linux/irq.h   | 6 +++++-
 kernel/irq/settings.h | 7 +++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 4a9f1d7b08c39..df653e10a83bf 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -72,6 +72,7 @@ enum irqchip_irq_state;
  * IRQ_DISABLE_UNLAZY		- Disable lazy irq disable
  * IRQ_HIDDEN			- Don't show up in /proc/interrupts
  * IRQ_NO_DEBUG			- Exclude from note_interrupt() debugging
+ * IRQ_SW_MODERATION		- Can do software interrupt moderation
  */
 enum {
 	IRQ_TYPE_NONE		= 0x00000000,
@@ -99,13 +100,14 @@ enum {
 	IRQ_DISABLE_UNLAZY	= (1 << 19),
 	IRQ_HIDDEN		= (1 << 20),
 	IRQ_NO_DEBUG		= (1 << 21),
+	IRQ_SW_MODERATION	= (1 << 22),
 };
 
 #define IRQF_MODIFY_MASK	\
 	(IRQ_TYPE_SENSE_MASK | IRQ_NOPROBE | IRQ_NOREQUEST | \
 	 IRQ_NOAUTOEN | IRQ_LEVEL | IRQ_NO_BALANCING | \
 	 IRQ_PER_CPU | IRQ_NESTED_THREAD | IRQ_NOTHREAD | IRQ_PER_CPU_DEVID | \
-	 IRQ_IS_POLLED | IRQ_DISABLE_UNLAZY | IRQ_HIDDEN)
+	 IRQ_IS_POLLED | IRQ_DISABLE_UNLAZY | IRQ_HIDDEN | IRQ_SW_MODERATION)
 
 #define IRQ_NO_BALANCING_MASK	(IRQ_PER_CPU | IRQ_NO_BALANCING)
 
@@ -219,6 +221,7 @@ struct irq_data {
  *				  irqchip have flag IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND set.
  * IRQD_RESEND_WHEN_IN_PROGRESS	- Interrupt may fire when already in progress in which
  *				  case it must be resent at the next available opportunity.
+ * IRQD_IRQ_MODERATED		- Interrupt is currently moderated.
  */
 enum {
 	IRQD_TRIGGER_MASK		= 0xf,
@@ -244,6 +247,7 @@ enum {
 	IRQD_AFFINITY_ON_ACTIVATE	= BIT(28),
 	IRQD_IRQ_ENABLED_ON_SUSPEND	= BIT(29),
 	IRQD_RESEND_WHEN_IN_PROGRESS    = BIT(30),
+	IRQD_IRQ_MODERATED		= BIT(31),
 };
 
 #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
diff --git a/kernel/irq/settings.h b/kernel/irq/settings.h
index 00b3bd127692c..bc8ade4726322 100644
--- a/kernel/irq/settings.h
+++ b/kernel/irq/settings.h
@@ -18,6 +18,7 @@ enum {
 	_IRQ_DISABLE_UNLAZY	= IRQ_DISABLE_UNLAZY,
 	_IRQ_HIDDEN		= IRQ_HIDDEN,
 	_IRQ_NO_DEBUG		= IRQ_NO_DEBUG,
+	_IRQ_SW_MODERATION	= IRQ_SW_MODERATION,
 	_IRQF_MODIFY_MASK	= IRQF_MODIFY_MASK,
 };
 
@@ -34,6 +35,7 @@ enum {
 #define IRQ_DISABLE_UNLAZY	GOT_YOU_MORON
 #define IRQ_HIDDEN		GOT_YOU_MORON
 #define IRQ_NO_DEBUG		GOT_YOU_MORON
+#define IRQ_SW_MODERATION	GOT_YOU_MORON
 #undef IRQF_MODIFY_MASK
 #define IRQF_MODIFY_MASK	GOT_YOU_MORON
 
@@ -180,3 +182,8 @@ static inline bool irq_settings_no_debug(struct irq_desc *desc)
 {
 	return desc->status_use_accessors & _IRQ_NO_DEBUG;
 }
+
+static inline bool irq_settings_moderation_allowed(struct irq_desc *desc)
+{
+	return desc->status_use_accessors & _IRQ_SW_MODERATION;
+}
-- 
2.52.0.457.g6b5491de43-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 2/3] genirq: Fixed-delay Global Software Interrupt Moderation (GSIM)
  2026-01-15 15:59 [PATCH-v4 0/3] Global Software Interrupt Moderation (GSIM) Luigi Rizzo
  2026-01-15 15:59 ` [PATCH v4 1/3] genirq: Add flags for software interrupt moderation Luigi Rizzo
@ 2026-01-15 15:59 ` Luigi Rizzo
  2026-01-15 20:52   ` kernel test robot
                     ` (2 more replies)
  2026-01-15 15:59 ` [PATCH v4 3/3] genirq: Adaptive " Luigi Rizzo
  2 siblings, 3 replies; 11+ messages in thread
From: Luigi Rizzo @ 2026-01-15 15:59 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

Some platforms show reduced I/O performance when the total device
interrupt rate across the entire platform becomes too high.

Interrupt moderation can be used to rate-limit individual interrupts,
and as a consequence also the total rate. Not all devices implement
moderation in hardware, or they may have impractically coarse granularity
(e.g. NVME specifies 100us).

GSIM implements interrupt moderation in software, allowing control of
interrupt rates even without hardware support. It als provides a building
block for more sophisticated, adaptive mechanisms.

Moderation is enabled/disabled per interrupt source with

  echo 1 > /proc/irq/NN/soft_moderation # use 0 to disable

For interrupts with moderation enabled, the delay is fixed, and equal
for all.  It is configured via procfs (range 0-500, 0 means disabled):

  echo delay_us=XX > /proc/irq/soft_moderation

Per CPU statistics on how often moderation is used are available via

  cat /proc/irq/soft_moderation/stats

GSIM is limited to edge interrupts using handle_edge_irq() or
handle_fasteoi_irq(). It has been tested on Intel (including with
intremap=posted_msi), AMD, ARM cpus with NIC, NVME and VFIO devices.

PERFORMANCE BENEFITS:
Below are some experimental results under high load (the first number is
without GSIM; the second is with delay_us=100)

- 100Gbps NIC, 32 queues: rx goes from 50 Gbps to 92.8 Gbps (line rate).
- 200Gbps NIC, 10 VMs (total 160 queues): rx goes from 30 Gbps to 190 Gbps (line rate).
- 8 SSD, 64 queues: 4K random read goes from 6M to 13.4M IOPS (device max).
- 12 SSD, 96 queues: 4K random read goes from 6M to 20.5M IOPS (device max).

Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 include/linux/irqdesc.h     |  12 +
 kernel/irq/Kconfig          |  11 +
 kernel/irq/Makefile         |   1 +
 kernel/irq/chip.c           |  15 ++
 kernel/irq/internals.h      |  10 +
 kernel/irq/irq_moderation.c | 444 ++++++++++++++++++++++++++++++++++++
 kernel/irq/irq_moderation.h | 111 +++++++++
 kernel/irq/irqdesc.c        |   1 +
 kernel/irq/proc.c           |   2 +
 9 files changed, 607 insertions(+)
 create mode 100644 kernel/irq/irq_moderation.c
 create mode 100644 kernel/irq/irq_moderation.h

diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index 17902861de76d..8b2edce25a4d8 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -17,6 +17,16 @@ struct irq_desc;
 struct irq_domain;
 struct pt_regs;
 
+/**
+ * struct irq_desc_mod - interrupt moderation information
+ * @ms_node:		per-CPU list of moderated irq_desc
+ */
+struct irq_desc_mod {
+#ifdef CONFIG_IRQ_SW_MODERATION
+	struct list_head	ms_node;
+#endif
+};
+
 /**
  * struct irqstat - interrupt statistics
  * @cnt:	real-time interrupt count
@@ -46,6 +56,7 @@ struct irqstat {
  * @threads_handled:	stats field for deferred spurious detection of threaded handlers
  * @threads_handled_last: comparator field for deferred spurious detection of threaded handlers
  * @lock:		locking for SMP
+ * @mod_state:		moderation state
  * @affinity_hint:	hint to user space for preferred irq affinity
  * @affinity_notify:	context for notification of affinity changes
  * @pending_mask:	pending rebalanced interrupts
@@ -81,6 +92,7 @@ struct irq_desc {
 	atomic_t		threads_handled;
 	int			threads_handled_last;
 	raw_spinlock_t		lock;
+	struct irq_desc_mod	mod_state;
 	struct cpumask		*percpu_enabled;
 #ifdef CONFIG_SMP
 	const struct cpumask	*affinity_hint;
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 1b4254d19a73e..cb104d1cabd0e 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -155,6 +155,17 @@ config IRQ_KUNIT_TEST
 
 endmenu
 
+config IRQ_SW_MODERATION
+	bool "Enable Global Software Interrupt Moderation"
+	depends on PROC_FS
+	help
+	  Enable Global Software Interrupt Moderation.
+	  Uses a local timer to delay interrupts in configurable ways
+	  and depending on various global system load indicators
+	  and targets.
+
+	  If you don't know what to do here, say N.
+
 config GENERIC_IRQ_MULTI_HANDLER
 	bool
 	help
diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
index 6ab3a40556670..5bd1fb464ace6 100644
--- a/kernel/irq/Makefile
+++ b/kernel/irq/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_GENERIC_IRQ_CHIP) += generic-chip.o
 obj-$(CONFIG_GENERIC_IRQ_PROBE) += autoprobe.o
 obj-$(CONFIG_IRQ_DOMAIN) += irqdomain.o
 obj-$(CONFIG_IRQ_SIM) += irq_sim.o
+obj-$(CONFIG_IRQ_SW_MODERATION) += irq_moderation.o
 obj-$(CONFIG_PROC_FS) += proc.o
 obj-$(CONFIG_GENERIC_PENDING_IRQ) += migration.o
 obj-$(CONFIG_GENERIC_IRQ_MIGRATION) += cpuhotplug.o
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 678f094d261a7..6ea5bb672b6ca 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -18,6 +18,7 @@
 #include <trace/events/irq.h>
 
 #include "internals.h"
+#include "irq_moderation.h"
 
 static irqreturn_t bad_chained_irq(int irq, void *dev_id)
 {
@@ -486,6 +487,10 @@ static bool irq_can_handle_pm(struct irq_desc *desc)
 	if (!irqd_has_set(irqd, IRQD_IRQ_INPROGRESS | IRQD_WAKEUP_ARMED))
 		return true;
 
+	/* Moderated ones (also have IRQD_IRQ_INPROGRESS) need early return. */
+	if (irqd_has_set(&desc->irq_data, IRQD_IRQ_MODERATED))
+		return false;
+
 	/*
 	 * If the interrupt is an armed wakeup source, mark it pending
 	 * and suspended, disable it and notify the pm core about the
@@ -745,6 +750,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
 	 * handling the previous one - it may need to be resent.
 	 */
 	if (!irq_can_handle_pm(desc)) {
+		if (irqd_has_set(&desc->irq_data, IRQD_IRQ_MODERATED)) {
+			desc->istate |= IRQS_PENDING;
+			mask_irq(desc);
+		}
 		if (irqd_needs_resend_when_in_progress(&desc->irq_data))
 			desc->istate |= IRQS_PENDING;
 		cond_eoi_irq(chip, &desc->irq_data);
@@ -765,6 +774,9 @@ void handle_fasteoi_irq(struct irq_desc *desc)
 
 	cond_unmask_eoi_irq(desc, chip);
 
+	if (irq_start_moderation(desc))
+		return;
+
 	/*
 	 * When the race described above happens this will resend the interrupt.
 	 */
@@ -854,6 +866,9 @@ void handle_edge_irq(struct irq_desc *desc)
 
 		handle_irq_event(desc);
 
+		if (irq_start_moderation(desc))
+			break;
+
 	} while ((desc->istate & IRQS_PENDING) && !irqd_irq_disabled(&desc->irq_data));
 }
 EXPORT_SYMBOL(handle_edge_irq);
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 0164ca48da59e..20ea3e4ee5f2d 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -521,3 +521,13 @@ static inline void irq_debugfs_copy_devname(int irq, struct device *dev)
 {
 }
 #endif /* CONFIG_GENERIC_IRQ_DEBUGFS */
+
+#ifdef CONFIG_IRQ_SW_MODERATION
+void irq_moderation_init_fields(struct irq_desc_mod *mod);
+void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode);
+void irq_moderation_procfs_remove(struct irq_desc *desc);
+#else
+static inline void irq_moderation_init_fields(struct irq_desc_mod *mod) {}
+static inline void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode) {}
+static inline void irq_moderation_procfs_remove(struct irq_desc *desc) {}
+#endif
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
new file mode 100644
index 0000000000000..07d1e740addcd
--- /dev/null
+++ b/kernel/irq/irq_moderation.c
@@ -0,0 +1,444 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause
+
+#include <linux/cpuhotplug.h>
+#include <linux/glob.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/kallsyms.h>
+#include <linux/notifier.h>
+#include <linux/proc_fs.h>
+#include <linux/suspend.h>
+#include <linux/seq_file.h>
+
+#include "internals.h"
+#include "irq_moderation.h"
+
+/*
+ * Global Software Interrupt Moderation (GSIM)
+ *
+ * Some platforms show reduced I/O performance when the total device interrupt
+ * rate across the entire platform becomes too high. To address the problem,
+ * GSIM uses a hook after running the handler to implement software interrupt
+ * moderation with programmable delay.
+ *
+ * Configuration is done at runtime via procfs
+ *   echo ${VALUE} > /proc/irq/soft_moderation/${NAME}
+ *
+ * Supported parameters:
+ *
+ *   delay_us (default 0, suggested 100, 0 off, range 0-500)
+ *       Maximum moderation delay. A reasonable range is 20-100. Higher values
+ *       can be useful if the hardirq handler has long runtimes.
+ *
+ * Moderation can be enabled/disabled dynamically for individual interrupts with
+ *   echo 1 > /proc/irq/NN/soft_moderation # use 0 to disable
+ *
+ * Monitoring of per-cpu and global statistics is available via procfs
+ *   cat /proc/irq/soft_moderation/stats
+ *
+ * === ARCHITECTURE ===
+ *
+ * INTERRUPT HANDLING:
+ * - irq_moderation_hook() runs under desc->lock right after the interrupt handler.
+ *   If the interrupt must be moderated, it sets IRQD_IRQ_INPROGRESS, calls
+ *   __disable_irq() adds the irq_desc to a per-CPU list of moderated interrupts,
+ *   and starts a moderation timer if not yet active.
+ * - desc->handler is modified so that when called on a moderated irq_desc it
+ *   calls mask_irq(), sets IRQS_PENDING and returns immediately.
+ * - the timer callback drains the moderation list: on each irq_desc it acquires
+ *   desc->lock, and if desc->action != NULL calls __enable_irq(), possibly calling
+ *   the handler if IRQS_PENDING is set.
+ *
+ * INTERRUPT TEARDOWN
+ * is protected by IRQD_IRQ_INPROGRESS and checking desc->action != NULL.
+ * This works because free_irq() runs in two steps:
+ * - first clear desc->action (under lock),
+ * - then call synchronize_irq(), which blocks on IRQD_IRQ_INPROGRESS
+ *   before freeing resources.
+ * When the moderation timer races with free_irq() we can have two cases:
+ * 1. timer runs before clearing desc->action. In this case __enable_irq()
+ *    is valid and the subsequent free_irq() will complete as intended
+ * 2. desc->action is cleared before the timer runs. In this case synchronize_irq()
+ *    will block until the timer expires (remember moderation delays are very short,
+ *    comparable to C-state exit times), __enable_irq() will not be run,
+ *    and free_irq() will complete successfully.
+ *
+ * INTERRUPT MIGRATION
+ * is protected by IRQD_IRQ_INPROGRESS that prevents running the handler on the
+ * new CPU while an interrupt is moderated.
+ *
+ * HOTPLUG
+ * During CPU shutdown, the kernel moves timers and reassigns interrupt affinity
+ * to a new CPU. The easiest way and most robust way to guarantee that pending
+ * events are handled correctly is to use a per-CPU "moderation_allowed" flag
+ * and hotplug callbacks on CPUHP_AP_ONLINE_DYN:
+ * - on setup, set the flag. That will allow interrupts to be moderated.
+ * - on shutdown, with interrupts disabled, 1. clear the flag thus preventing
+ *   more interrupts to be moderated on that CPU, 2. flush the list of moderated
+ *   interrupts (as if the timer had fired), and 3. cancel the timer.
+ * This avoids depending with the internals of the up/down sequence.
+ *
+ * SUSPEND
+ * Register a PM notifier to handle PM_SUSPEND_PREPARE and PM_POST_RESTORE as
+ * hotplug shutdown and setup events. The hotplug callbacks are also invoked
+ * during suspend to/resume from disk.
+ *
+ * BOOT PROCESSOR
+ * Hotplug callbacks are not invoked for the boot processor.
+ * However the boot processor is the last one to go, and since there is
+ * no other place to run the timer callbacks, they will be run where they
+ * are supposed to.
+ */
+
+/* Recommended values. */
+struct irq_mod_info irq_mod_info ____cacheline_aligned = {
+	.update_ms		= 5,
+};
+
+DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
+
+DEFINE_STATIC_KEY_FALSE(irq_moderation_enabled_key);
+
+static void update_enable_key(void)
+{
+	if (irq_mod_info.delay_us != 0)
+		static_branch_enable(&irq_moderation_enabled_key);
+	else
+		static_branch_disable(&irq_moderation_enabled_key);
+}
+
+/* Actually start moderation. */
+bool irq_moderation_do_start(struct irq_desc *desc, struct irq_mod_state *m)
+{
+	lockdep_assert_irqs_disabled();
+
+	if (!hrtimer_is_queued(&m->timer)) {
+		const u32 min_delay_ns = 10000;
+		const u64 slack_ns = 2000;
+
+		/* Accumulate sleep time, no moderation if too small. */
+		m->sleep_ns += m->mod_ns;
+		if (m->sleep_ns < min_delay_ns)
+			return false;
+		/* We need moderation, start the timer. */
+		m->timer_set++;
+		hrtimer_start_range_ns(&m->timer, ns_to_ktime(m->sleep_ns),
+				       slack_ns, HRTIMER_MODE_REL_PINNED_HARD);
+	}
+
+	/*
+	 * Add to the timer list and __disable_irq() to prevent serving subsequent
+	 * interrupts.
+	 */
+	if (!list_empty(&desc->mod_state.ms_node)) {
+		/* Very unlikely, stray interrupt while moderated. */
+		m->stray_irq++;
+	} else {
+		m->enqueue++;
+		list_add(&desc->mod_state.ms_node, &m->descs);
+		__disable_irq(desc);
+	}
+	irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS | IRQD_IRQ_MODERATED);
+	return true;
+}
+
+/* Initialize moderation state, used in desc_set_defaults() */
+void irq_moderation_init_fields(struct irq_desc_mod *mod_state)
+{
+	INIT_LIST_HEAD(&mod_state->ms_node);
+}
+
+static int set_mode(struct irq_desc *desc, bool enable)
+{
+	struct irq_data *irqd = &desc->irq_data;
+	struct irq_chip *chip = irqd->chip;
+
+	lockdep_assert_held(&desc->lock);
+
+	if (!enable) {
+		irq_settings_clr_and_set(desc, _IRQ_SW_MODERATION, 0);
+		return 0;
+	}
+
+	/* Moderation is only supported in specific cases. */
+	enable &= !irqd_is_level_type(irqd);
+	enable &= irqd_is_single_target(irqd);
+	enable &= !chip->irq_bus_lock && !chip->irq_bus_sync_unlock;
+	enable &= chip->irq_mask && chip->irq_unmask;
+	enable &= desc->handle_irq == handle_edge_irq || desc->handle_irq == handle_fasteoi_irq;
+	if (!enable)
+		return -EOPNOTSUPP;
+
+	irq_settings_clr_and_set(desc, 0, _IRQ_SW_MODERATION);
+	return 0;
+}
+
+/* Helpers to set and clamp values from procfs or at init. */
+struct swmod_param {
+	const char	*name;
+	int		(*wr)(struct swmod_param *n, const char __user *s, size_t count);
+	void		(*rd)(struct seq_file *p);
+	void		*val;
+	u32		min;
+	u32		max;
+};
+
+static int swmod_wr_u32(struct swmod_param *n, const char __user *s, size_t count)
+{
+	u32 res;
+	int ret = kstrtouint_from_user(s, count, 0, &res);
+
+	if (!ret) {
+		WRITE_ONCE(*(u32 *)(n->val), clamp(res, n->min, n->max));
+		ret = count;
+	}
+	return ret;
+}
+
+static void swmod_rd_u32(struct seq_file *p)
+{
+	struct swmod_param *n = p->private;
+
+	seq_printf(p, "%u\n", *(u32 *)(n->val));
+}
+
+static int swmod_wr_delay(struct swmod_param *n, const char __user *s, size_t count)
+{
+	int ret = swmod_wr_u32(n, s, count);
+
+	if (ret >= 0)
+		update_enable_key();
+	return ret;
+}
+
+#define HEAD_FMT "%5s  %8s  %11s  %11s  %9s\n"
+#define BODY_FMT "%5u  %8u  %11u  %11u  %9u\n"
+
+#pragma clang diagnostic error "-Wformat"
+
+/* Print statistics */
+static void rd_stats(struct seq_file *p)
+{
+	uint delay_us = irq_mod_info.delay_us;
+	int cpu;
+
+	seq_printf(p, HEAD_FMT,
+		   "# CPU", "delay_ns", "timer_set", "enqueue", "stray_irq");
+
+	for_each_possible_cpu(cpu) {
+		struct irq_mod_state cur;
+
+		/* Copy statistics, will only use some 32bit values, races ok. */
+		data_race(cur = *per_cpu_ptr(&irq_mod_state, cpu));
+		seq_printf(p, BODY_FMT,
+			   cpu, cur.mod_ns, cur.timer_set, cur.enqueue, cur.stray_irq);
+	}
+
+	seq_printf(p, "\n"
+		   "enabled              %s\n"
+		   "delay_us             %u\n",
+		   str_yes_no(delay_us > 0),
+		   delay_us);
+}
+
+static int moderation_show(struct seq_file *p, void *v)
+{
+	struct swmod_param *n = p->private;
+
+	if (!n || !n->rd)
+		return -EINVAL;
+	n->rd(p);
+	return 0;
+}
+
+static int moderation_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, moderation_show, pde_data(inode));
+}
+
+static struct swmod_param param_names[] = {
+	{ "delay_us", swmod_wr_delay, swmod_rd_u32, &irq_mod_info.delay_us, 0, 500 },
+	{ "stats", NULL, rd_stats},
+};
+
+static ssize_t moderation_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
+{
+	struct swmod_param *n = (struct swmod_param *)pde_data(file_inode(f));
+
+	return n && n->wr ? n->wr(n, buf, count) : -EINVAL;
+}
+
+static const struct proc_ops proc_ops = {
+	.proc_open	= moderation_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= moderation_write,
+};
+
+/* Handlers for /proc/irq/NN/soft_moderation */
+static int mode_show(struct seq_file *p, void *v)
+{
+	struct irq_desc *desc = p->private;
+
+	seq_puts(p, irq_settings_moderation_allowed(desc) ? "on\n" : "off\n");
+	return 0;
+}
+
+static ssize_t mode_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
+{
+	struct irq_desc *desc = (struct irq_desc *)pde_data(file_inode(f));
+	bool enable;
+	int ret = kstrtobool_from_user(buf, count, &enable);
+
+	if (!ret) {
+		guard(raw_spinlock_irqsave)(&desc->lock);
+		ret = set_mode(desc, enable);
+	}
+	return ret ? : count;
+}
+
+static int mode_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, mode_show, pde_data(inode));
+}
+
+static const struct proc_ops mode_ops = {
+	.proc_open	= mode_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= mode_write,
+};
+
+void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode)
+{
+	proc_create_data("soft_moderation", umode, desc->dir, &mode_ops, desc);
+}
+
+void irq_moderation_procfs_remove(struct irq_desc *desc)
+{
+	remove_proc_entry("soft_moderation", desc->dir);
+}
+
+/* Used on timer expiration or CPU shutdown. */
+static void drain_desc_list(struct irq_mod_state *m)
+{
+	struct irq_desc *desc, *next;
+
+	/* Remove from list and enable interrupts back. */
+	list_for_each_entry_safe(desc, next, &m->descs, mod_state.ms_node) {
+		guard(raw_spinlock)(&desc->lock);
+		list_del_init(&desc->mod_state.ms_node);
+		irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS | IRQD_IRQ_MODERATED);
+		/* Protect against competing free_irq(). */
+		if (desc->action)
+			__enable_irq(desc);
+	}
+}
+
+static enum hrtimer_restart timer_callback(struct hrtimer *timer)
+{
+	struct irq_mod_state *m = this_cpu_ptr(&irq_mod_state);
+
+	lockdep_assert_irqs_disabled();
+
+	drain_desc_list(m);
+	/* Prepare to accumulate next moderation delay. */
+	m->sleep_ns = 0;
+	return HRTIMER_NORESTART;
+}
+
+/* Hotplug callback for setup. */
+static int cpu_setup_cb(uint cpu)
+{
+	struct irq_mod_state *m = this_cpu_ptr(&irq_mod_state);
+
+	hrtimer_setup(&m->timer, timer_callback,
+		      CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
+	INIT_LIST_HEAD(&m->descs);
+	m->moderation_allowed = true;
+	return 0;
+}
+
+/*
+ * Hotplug callback for shutdown.
+ * Mark the CPU as offline for moderation, and drain the list of masked
+ * interrupts. Any subsequent interrupt on this CPU will not be
+ * moderated, but they will be on the new target.
+ */
+static int cpu_remove_cb(uint cpu)
+{
+	struct irq_mod_state *m = this_cpu_ptr(&irq_mod_state);
+
+	guard(irqsave)();
+	m->moderation_allowed = false;
+	drain_desc_list(m);
+	hrtimer_cancel(&m->timer);
+	return 0;
+}
+
+static void(mod_pm_prepare_cb)(void *arg)
+{
+	cpu_remove_cb(smp_processor_id());
+}
+
+static void(mod_pm_resume_cb)(void *arg)
+{
+	cpu_setup_cb(smp_processor_id());
+}
+
+static int mod_pm_notifier_cb(struct notifier_block *nb, unsigned long event, void *unused)
+{
+	switch (event) {
+	case PM_SUSPEND_PREPARE:
+		on_each_cpu(mod_pm_prepare_cb, NULL, 1);
+		break;
+	case PM_POST_SUSPEND:
+		on_each_cpu(mod_pm_resume_cb, NULL, 1);
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+struct notifier_block mod_nb = {
+	.notifier_call	= mod_pm_notifier_cb,
+	.priority	= 100,
+};
+
+static void __init clamp_parameter(u32 *dst, u32 val)
+{
+	struct swmod_param *n = param_names;
+
+	for (int i = 0; i < ARRAY_SIZE(param_names); i++, n++) {
+		if (dst == n->val) {
+			WRITE_ONCE(*dst, clamp(val, n->min, n->max));
+			return;
+		}
+	}
+}
+
+static int __init init_irq_moderation(void)
+{
+	struct proc_dir_entry *dir;
+	struct swmod_param *n;
+	int i;
+
+	/* Clamp all initial values to the allowed range. */
+	for (uint *cur = &irq_mod_info.delay_us; cur < irq_mod_info.params_end; cur++)
+		clamp_parameter(cur, *cur);
+
+	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "irq_moderation", cpu_setup_cb, cpu_remove_cb);
+	register_pm_notifier(&mod_nb);
+
+	update_enable_key();
+
+	dir = proc_mkdir("irq/soft_moderation", NULL);
+	if (!dir)
+		return 0;
+	for (i = 0, n = param_names; i < ARRAY_SIZE(param_names); i++, n++)
+		proc_create_data(n->name, n->wr ? 0644 : 0444, dir, &proc_ops, n);
+	return 0;
+}
+
+device_initcall(init_irq_moderation);
diff --git a/kernel/irq/irq_moderation.h b/kernel/irq/irq_moderation.h
new file mode 100644
index 0000000000000..0d634f8e9225d
--- /dev/null
+++ b/kernel/irq/irq_moderation.h
@@ -0,0 +1,111 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
+
+#ifndef _LINUX_IRQ_MODERATION_H
+#define _LINUX_IRQ_MODERATION_H
+
+#ifdef CONFIG_IRQ_SW_MODERATION
+/* Common data structures for Global Software Interrupt Moderation, GSIM */
+
+#include <linux/hrtimer.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/kernel.h>
+
+/**
+ * struct irq_mod_info - global configuration parameters and state
+ * @delay_us:		maximum delay
+ * @update_ms:		how often to update delay (epoch duration)
+ */
+struct irq_mod_info {
+	u32		delay_us;
+	u32		update_ms;
+	u32		params_end[];
+};
+
+extern struct irq_mod_info irq_mod_info;
+
+/**
+ * struct irq_mod_state - per-CPU moderation state
+ *
+ * Used on every interrupt:
+ * @timer:		moderation timer
+ * @moderation_allowed:	per-CPU flag, toggled during hotplug/suspend events
+ * @intr_count:		interrupts in the last epoch
+ * @sleep_ns:		accumulated time for actual delay
+ * @mod_ns:		nominal moderation delay, recomputed every epoch
+ *
+ * Used less frequently, every few interrupts:
+ * @epoch_start_ns:	start of current epoch
+ * @update_ns:		update_ms from irq_mod_info, converted to ns
+ * @stray_irq:		how many stray interrupts (almost never used)
+ *
+ * Used once per epoch per interrupt source:
+ * @descs:		list of	moderated irq_desc on this CPU
+ * @enqueue:		how many enqueue on the list
+ *
+ * Statistics
+ * @timer_set:		how many timer_set calls
+ */
+struct irq_mod_state {
+	struct hrtimer		timer;
+	bool			moderation_allowed;
+	u32			intr_count;
+	u32			sleep_ns;
+	u32			mod_ns;
+	atomic64_t		epoch_start_ns;
+	u32			update_ns;
+	u32			stray_irq;
+	struct list_head	descs;
+	u32			enqueue;
+	u32			timer_set;
+};
+
+DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
+
+extern struct static_key_false irq_moderation_enabled_key;
+
+bool irq_moderation_do_start(struct irq_desc *desc, struct irq_mod_state *m);
+
+static inline void check_epoch(struct irq_mod_state *m)
+{
+	const u64 now = ktime_get_ns(), epoch_ns = now - atomic64_read(&m->epoch_start_ns);
+	const u32 slack_ns = 5000;
+
+	/* Run approximately every update_ns, a little bit early is ok. */
+	if (epoch_ns < m->update_ns - slack_ns)
+		return;
+	/* Fetch updated parameters. */
+	m->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
+	m->mod_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
+}
+
+/*
+ * Use after running the handler, with lock held. If this source should be
+ * moderated, disable it, add to the timer list for this CPU and return true.
+ * The caller must also exit handle_*_irq() without processing IRQS_PENDING,
+ * as that will happen when the moderation timer fires and calls __enable_irq().
+ */
+static inline bool irq_start_moderation(struct irq_desc *desc)
+{
+	struct irq_mod_state *m = this_cpu_ptr(&irq_mod_state);
+
+	if (!static_branch_unlikely(&irq_moderation_enabled_key))
+		return false;
+	if (!irq_settings_moderation_allowed(desc))
+		return false;
+	if (!m->moderation_allowed)
+		return false;
+
+	m->intr_count++;
+
+	/* Is this a new epoch? ktime_get_ns() is expensive, don't check too often. */
+	if ((m->intr_count & 0xf) == 0)
+		check_epoch(m);
+
+	return irq_moderation_do_start(desc, m);
+}
+#else /* CONFIG_IRQ_SW_MODERATION */
+static inline bool irq_start_moderation(struct irq_desc *desc) { return false; }
+#endif /* !CONFIG_IRQ_SW_MODERATION */
+
+#endif /* _LINUX_IRQ_MODERATION_H */
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index f8e4e13dbe339..5b6a69ee82b8f 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -134,6 +134,7 @@ static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
 	desc->tot_count = 0;
 	desc->name = NULL;
 	desc->owner = owner;
+	irq_moderation_init_fields(&desc->mod_state);
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(desc->kstat_irqs, cpu) = (struct irqstat) { };
 	desc_smp_init(desc, node, affinity);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 77258eafbf632..5ce64654da097 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -376,6 +376,7 @@ void register_irq_proc(unsigned int irq, struct irq_desc *desc)
 	proc_create_single_data("effective_affinity_list", 0444, desc->dir,
 				irq_effective_aff_list_proc_show, irqp);
 # endif
+	irq_moderation_procfs_add(desc, 0644);
 #endif
 	proc_create_single_data("spurious", 0444, desc->dir,
 				irq_spurious_proc_show, (void *)(long)irq);
@@ -397,6 +398,7 @@ void unregister_irq_proc(unsigned int irq, struct irq_desc *desc)
 	remove_proc_entry("effective_affinity", desc->dir);
 	remove_proc_entry("effective_affinity_list", desc->dir);
 # endif
+	irq_moderation_procfs_remove(desc);
 #endif
 	remove_proc_entry("spurious", desc->dir);
 
-- 
2.52.0.457.g6b5491de43-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 3/3] genirq: Adaptive Global Software Interrupt Moderation (GSIM).
  2026-01-15 15:59 [PATCH-v4 0/3] Global Software Interrupt Moderation (GSIM) Luigi Rizzo
  2026-01-15 15:59 ` [PATCH v4 1/3] genirq: Add flags for software interrupt moderation Luigi Rizzo
  2026-01-15 15:59 ` [PATCH v4 2/3] genirq: Fixed-delay Global Software Interrupt Moderation (GSIM) Luigi Rizzo
@ 2026-01-15 15:59 ` Luigi Rizzo
  2026-01-15 21:14   ` kernel test robot
                     ` (2 more replies)
  2 siblings, 3 replies; 11+ messages in thread
From: Luigi Rizzo @ 2026-01-15 15:59 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

Interrupt moderation helps keeping system load (as defined later) under
control, but the use of a fixed moderation delay imposes unnecessary
latency when the system load is already low.

We focus on two types of system load related to interrupts:
- total device interrupt rates. Some platforms show severe I/O
  performance degradation with more than 1-2 Mintr/s across the entire
  system. This affects especially server-class hardware with hundreds of
  interrupt sources (NIC or SSD queues from many physical or virtualized
  devices).
- percentage of time spent in hardirq. This can become problematic in
  presence of large numbers of interrupt sources hitting individual CPUs
  (for instance, as a result of attempts to isolate application CPUs
  from interrupt load).

Make GSIM adaptive by measuring total and per-CPU interrupt rates, as
well as time spent in hardirq on each CPU. The metrics are compared to
configurable targets, and each CPU adjusts the moderation delay up
or down depending on the result, using multiplicative increase/decrease.

Configuration of moderation parameters is done via procfs

  echo ${VALUE} > /proc/irq/soft_moderation/${NAME}

Parameters are:

  delay_us (0 off, range 0-500)
      Maximum moderation delay in microseconds.

  target_intr_rate (0 off, range 0-50000000)
      Total rate above which moderation should trigger.

  hardirq_percent (0 off,range 0-100)
      Percentage of time spent in hardirq above which moderation should
      trigger.

  update_ms (range 1-100, default 5)
      How often the metrics should be computed and moderation delay
      updated.

When target_intr_rate and hardirq_percent are both 0, GSIM uses delay_us
as fixed moderation delay. Otherwise, the delay is dynamically adjusted
up or down, independently on each CPU, based on how the total and per-CPU
metrics compare with the targets.

Provided that delay_us suffices to bring the metrics within the target,
the control loop will dynamically converge to the minimum actual
moderation delay to stay within the target.

PERFORMANCE BENEFITS:

The tests below demonstrate how adaptive moderation allows improved
throughput at high load (same as fixed moderation) and low latency ad
moderate load (same as no moderation) without having to hand-tune the
system based on load.

We run experiments on one x86 platform with 8 SSD (64 queues) capable of
an aggregate of approximately 14M IOPS running fio with variable number
of SSD devices (1 or 8), threads per disk (1 or 200), and IODEPTH (from
1 to 128 per thread) the actual command is below:

${FIO} --name=read_IOPs_test ${DEVICES} --iodepth=${IODEPTH} --numjobs=${JOBS} \
    --bs=4K --rw=randread  --filesize=1000G --ioengine=libaio --direct=1 \
    --verify=0 --randrepeat=0 --time_based=1 --runtime=600s \
    --cpus-allowed=1-119 --cpus_allowed_policy=split --group_reporting=1

For each configuration we test three moderations settings:
- OFF:      delay_us=0
- FIXED:    delay_us=200 target_intr_rate=0 hardirq_percent=0
- ADAPTIVE: delay_us=200 target_intr_rate=1000000 hardirq_percent=70

The first set of measurements is for ONE DISK, ONE THREAD.
At low IODEPTH the throughput is latency bound, and moderation is not
necessary. A fixed moderation delay dominates the latency hence reducing
throughput; adaptive moderation avoids the problem.
As the IODEPTH increases, the system becomes I/O bound, and even the
fixed moderation delay does not harm.

Overall: adaptive moderation is better than fixed moderation and
at least as good as moderation off.

COMPLETION LATENCY PERCENTILES in us (p50, p90, p99)

         ------ OFF --------   ------ FIXED ------   ----- ADAPTIVE ----
IODEPTH  IOPS  p50  p90  p99   IOPS  p50  p90  p99   IOPS  p50  p90  p99
         -------------------   -------------------   -------------------
1:        12K   78   88   94 .   5K  208  210  215 .  12K   78   88   96
8:        94K   83   91  110 .  38K  210  212  221 .  94K   83   91  105
32:      423K   72   85  124 . 150K  210  219  235 . 424K   72   85  124
128:     698K  180  200  243 . 513K  251  306  347 . 718K  174  194  239

A second set of measurements is with one disk and 200 threads.
The system is I/O bound but without significant interrupt overhead.
All three scenarios are basically equivalent.

         --------- OFF --------   ------- FIXED -------   ------ ADAPTIVE -----
IODEPTH  IOPS    p50  p90   p99   IOPS    p50  p90  p99   IOPS    p50  p90  p99
         ----------------------   ---------------------   ---------------------
1:       1581K   110  174   281 .  933K   208  223  363 . 1556K   114  176  277
8:       1768K   889 1516  1926 . 1768K   848 1516 2147 . 1768K   889 1516 1942
32:      1768K  3589 5735  7701 . 1768K  3589 5735 7635 . 1768K  3589 5735 7504
128:     1768K  14ms 24ms  31ms . 1768K  14ms 24ms 29ms . 1768K  14ms 24ms 30ms

Finally, we have one set of measurements with 8 disks and 200 threads
per disk, all running on socket 0.  The system would be I/O bound (and
CPU/latency bound at low IODEPTH), but this platform is unable to cope with the
high global interrupt rate and so moderation is really necessary to hit
the disk limits.

As we see below, adaptive moderation gives more than 2X higher
throughput at meaningful iodepths, and even latency is much better.
The only case where we see a small regression is with iodepth=1, because
the high interrupt rate triggers the control loop to increase the
moderation delay.

         --------- OFF --------   -------- FIXED -------   ------ ADAPTIVE ------
IODEPTH  IOPS    p50  p90   p99   IOPS     p50  p90  p99   IOPS     p50  p90  p99
         ----------------------   ----------------------   ----------------------
1:       2304K    82   94   128 .  1030K   208  277  293 .  1842K    97  149  188
8:       5240K   128  938  1680 .  7500K   208  233  343 . 10000K   151  210  281
32:      5251K   206 3621  3949 . 12300K   184 1106 5407 . 12100K   184 1139 5407
128:     5330K  4228 12ms  17ms . 13800K  1123 4883 7373 . 13800K  1074 4883 7635

Finally, here are experiments indicating how throughput is affected by
the various parameters (with 8 disks and 200 threads).

IOPS vs delay_us (target_intr_rate = 0, hardirq_percent=0)

delay_us	0	50	100	150	200	250
IODEPTH 1	2300	1860	1580	1300	1034	1066
IODEPTH 8	5254	9936	9645	8818	7500	6150
IODEPTH 32	5250	11300	13800	13800	13800	13800
IODEPTH 128	5900	13600	13900	13900	13900	13900

IOPS vs target_intr_rate (delay_us = 200, hardirq_percent=0, iodepth 128)
value		250K	500K	750K	1000K	1250K	1500K	1750K	2000K
socket0		13900	13900	13900	13800	13800	12900	11000	8808
both sockets	13900	13900	13900	13800	8600	8000	6900	6400

hardirq_percent (delay_us = 200, target_intr_rate=0, iodepth 128)
hardirq%	1	10	20	30	40	50	60	70
KIOPS		13900	13800	13300	12100	10400	8600	7400	6500

Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 kernel/irq/irq_moderation.c | 271 ++++++++++++++++++++++++++++++++++--
 kernel/irq/irq_moderation.h |  58 +++++++-
 2 files changed, 314 insertions(+), 15 deletions(-)

diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index 07d1e740addcd..8221cb8d9fc79 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -18,8 +18,9 @@
  *
  * Some platforms show reduced I/O performance when the total device interrupt
  * rate across the entire platform becomes too high. To address the problem,
- * GSIM uses a hook after running the handler to implement software interrupt
- * moderation with programmable delay.
+ * GSIM uses a hook after running the handler to measure global and per-CPU
+ * interrupt rates, compare them with configurable targets, and implements
+ * independent, per-CPU software moderation delays.
  *
  * Configuration is done at runtime via procfs
  *   echo ${VALUE} > /proc/irq/soft_moderation/${NAME}
@@ -30,6 +31,26 @@
  *       Maximum moderation delay. A reasonable range is 20-100. Higher values
  *       can be useful if the hardirq handler has long runtimes.
  *
+ *   target_intr_rate (default 0, suggested 1000000, 0 off, range 0-50000000)
+ *       The total interrupt rate above which moderation kicks in.
+ *       Not particularly critical, a value in the 500K-1M range is usually ok.
+ *
+ *   hardirq_percent (default 0, suggested 70, 0 off, range 0-100)
+ *       The hardirq percentage above which moderation kicks in.
+ *       50-90 is a reasonable range.
+ *
+ *       FIXED MODERATION mode requires target_intr_rate=0, hardirq_percent=0
+ *
+ *   update_ms (default 5, range 1-100)
+ *       How often the load is measured and moderation delay updated.
+ *
+ *   scale_cpus (default 150, range 50-1000)
+ *       Small update_ms may lead to underestimate the number of CPUs
+ *       simultaneously handling interrupts, and the opposite can happen
+ *       with very large values. This parameter may help correct the value,
+ *       though it is not recommended to modify the default unless there are
+ *       very strong reasons.
+ *
  * Moderation can be enabled/disabled dynamically for individual interrupts with
  *   echo 1 > /proc/irq/NN/soft_moderation # use 0 to disable
  *
@@ -93,6 +114,8 @@
 /* Recommended values. */
 struct irq_mod_info irq_mod_info ____cacheline_aligned = {
 	.update_ms		= 5,
+	.increase_factor	= MIN_SCALING_FACTOR,
+	.scale_cpus		= 150,
 };
 
 DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
@@ -107,6 +130,171 @@ static void update_enable_key(void)
 		static_branch_disable(&irq_moderation_enabled_key);
 }
 
+/* Functions called in handle_*_irq(). */
+
+/*
+ * Compute smoothed average between old and cur. 'steps' is used
+ * to approximate applying the smoothing multiple times.
+ */
+static inline u32 smooth_avg(u32 old, u32 cur, u32 steps)
+{
+	const u32 smooth_factor = 64;
+
+	steps = min(steps, smooth_factor - 1);
+	return ((smooth_factor - steps) * old + steps * cur) / smooth_factor;
+}
+
+/* Measure and assess time spent in hardirq. */
+static inline bool hardirq_high(struct irq_mod_state *m, u32 hardirq_percent)
+{
+	bool above_threshold;
+	u64 irqtime, cur;
+
+	if (!IS_ENABLED(CONFIG_IRQ_TIME_ACCOUNTING))
+		return false;
+
+	cur = kcpustat_this_cpu->cpustat[CPUTIME_IRQ];
+	irqtime = cur - m->last_irqtime;
+	m->last_irqtime = cur;
+
+	above_threshold = irqtime * 100 > (u64)m->epoch_ns * hardirq_percent;
+	m->hardirq_high += above_threshold;
+	return above_threshold;
+}
+
+/* Measure and assess total and per-CPU interrupt rates. */
+static inline bool irqrate_high(struct irq_mod_state *m, u32 target_rate, u32 steps)
+{
+	u32 global_intr_rate, local_intr_rate, delta_intrs, active_cpus, tmp;
+	bool local_rate_high, global_rate_high;
+
+	local_intr_rate = ((u64)m->intr_count * NSEC_PER_SEC) / m->epoch_ns;
+
+	/* Accumulate global counter and compute global interrupt rate. */
+	tmp = atomic_add_return(m->intr_count, &irq_mod_info.total_intrs);
+	m->intr_count = 1;
+	delta_intrs = tmp - m->last_total_intrs;
+	m->last_total_intrs = tmp;
+	global_intr_rate = ((u64)delta_intrs * NSEC_PER_SEC) / m->epoch_ns;
+
+	/*
+	 * Count how many CPUs handled interrupts in the last epoch, needed
+	 * to determine the per-CPU target (target_rate / active_cpus).
+	 * Each active CPU increments the global counter approximately every
+	 * update_ns. Scale the value by (update_ns / m->epoch_ns) to get the
+	 * correct value. Also apply rounding and make sure active_cpus > 0.
+	 */
+	tmp = atomic_add_return(1, &irq_mod_info.total_cpus);
+	active_cpus = tmp - m->last_total_cpus;
+	m->last_total_cpus = tmp;
+	active_cpus = (active_cpus * m->update_ns + m->epoch_ns / 2) / m->epoch_ns;
+	if (active_cpus < 1)
+		active_cpus = 1;
+
+	/* Compare with global and per-CPU targets. */
+	global_rate_high = global_intr_rate > target_rate;
+
+	/*
+	 * Short epochs may lead to underestimate the number of active CPUs.
+	 * Apply a scaling factor to compensate. This may make the controller
+	 * a bit more aggressive but does not harm system throughput.
+	 */
+	local_rate_high = local_intr_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
+
+	/* Statistics. */
+	m->global_intr_rate = smooth_avg(m->global_intr_rate, global_intr_rate, steps);
+	m->local_intr_rate = smooth_avg(m->local_intr_rate, local_intr_rate, steps);
+	m->scaled_cpu_count = smooth_avg(m->scaled_cpu_count, active_cpus * 256, steps);
+	m->local_irq_high += local_rate_high;
+	m->global_irq_high += global_rate_high;
+
+	/* Moderate on this CPU only if both global and local rates are high. */
+	return global_rate_high && local_rate_high;
+}
+
+/* Periodic adjustment, called once per epoch. */
+void irq_moderation_update_epoch(struct irq_mod_state *m)
+{
+	const u32 hardirq_percent = READ_ONCE(irq_mod_info.hardirq_percent);
+	const u32 target_rate = READ_ONCE(irq_mod_info.target_intr_rate);
+	const u32 min_delay_ns = 500;
+	bool above_target = false;
+	u32 steps;
+
+	/*
+	 * If any of the configuration parameter changes, read the main ones
+	 * (delay_ns, update_ns), and set the adaptive delay, mod_ns, to the
+	 * maximum value to help converge.
+	 * Without that, the system might be already below target_intr_rate
+	 * because of saturation on the bus (the very problem GSIM is trying
+	 * to address) and that would block the control loop.
+	 * Setting mod_ns to the highest value (if chosen properly) can reduce
+	 * the interrupt rate below target_intr_rate and let the controller
+	 * gradually reach the target.
+	 */
+	if (raw_read_seqcount(&irq_mod_info.seq.seqcount) != m->seq) {
+		do {
+			m->seq = read_seqbegin(&irq_mod_info.seq);
+			m->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
+			m->mod_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
+			m->delay_ns = m->mod_ns;
+		} while (read_seqretry(&irq_mod_info.seq, m->seq));
+	}
+
+	if (target_rate == 0 && hardirq_percent == 0) {
+		/* Use fixed moderation delay. */
+		m->mod_ns = m->delay_ns;
+		m->global_intr_rate = 0;
+		m->local_intr_rate = 0;
+		m->scaled_cpu_count = 0;
+		return;
+	}
+
+	/*
+	 * To scale values X by a factor (1 +/- 1/F) every "update_ns" we do
+	 *	X := X * (1 +/- 1/F)
+	 * If the interval is N times longer, applying the formula N times gives
+	 *	X := X * ((1 +/- 1/F) ** N)
+	 * We don't want to deal floating point or exponentials, and we cap N
+	 * to some small value < F . This leads to an approximated formula
+	 *	X := X * (1 +/- N/F)
+	 * The variable steps below is the number N of steps.
+	 */
+	steps = clamp(m->epoch_ns / m->update_ns, 1u, MIN_SCALING_FACTOR - 1u);
+
+	if (target_rate > 0 && irqrate_high(m, target_rate, steps))
+		above_target = true;
+
+	if (hardirq_percent > 0 && hardirq_high(m, hardirq_percent))
+		above_target = true;
+
+	/*
+	 * Controller: adjust delay with exponential increase or decrease.
+	 *
+	 * Note the different constants: we increase fast (smaller factor)
+	 * to aggressively slow down when the interrupt rate goes up,
+	 * but decrease slowly (larger factor) because reducing the delay can
+	 * drive up the interrupt rate and we don't want to create load spikes.
+	 */
+	if (above_target) {
+		const u32 increase_factor = READ_ONCE(irq_mod_info.increase_factor);
+
+		/* Make sure the value is large enough for the exponential to grow. */
+		if (m->mod_ns < min_delay_ns)
+			m->mod_ns = min_delay_ns;
+		m->mod_ns += m->mod_ns * steps / increase_factor;
+		if (m->mod_ns > m->delay_ns)
+			m->mod_ns = m->delay_ns;
+	} else {
+		const u32 decrease_factor = 2 * READ_ONCE(irq_mod_info.increase_factor);
+
+		m->mod_ns -= m->mod_ns * steps / decrease_factor;
+		/* Round down to 0 values that are too small to bother. */
+		if (m->mod_ns < min_delay_ns)
+			m->mod_ns = 0;
+	}
+}
+
 /* Actually start moderation. */
 bool irq_moderation_do_start(struct irq_desc *desc, struct irq_mod_state *m)
 {
@@ -142,6 +330,8 @@ bool irq_moderation_do_start(struct irq_desc *desc, struct irq_mod_state *m)
 	return true;
 }
 
+/* Control functions. */
+
 /* Initialize moderation state, used in desc_set_defaults() */
 void irq_moderation_init_fields(struct irq_desc_mod *mod_state)
 {
@@ -189,7 +379,9 @@ static int swmod_wr_u32(struct swmod_param *n, const char __user *s, size_t coun
 	int ret = kstrtouint_from_user(s, count, 0, &res);
 
 	if (!ret) {
+		write_seqlock(&irq_mod_info.seq);
 		WRITE_ONCE(*(u32 *)(n->val), clamp(res, n->min, n->max));
+		write_sequnlock(&irq_mod_info.seq);
 		ret = count;
 	}
 	return ret;
@@ -211,34 +403,82 @@ static int swmod_wr_delay(struct swmod_param *n, const char __user *s, size_t co
 	return ret;
 }
 
-#define HEAD_FMT "%5s  %8s  %11s  %11s  %9s\n"
-#define BODY_FMT "%5u  %8u  %11u  %11u  %9u\n"
+#define HEAD_FMT "%5s  %8s  %10s  %4s  %8s  %11s  %11s  %11s  %11s  %11s  %9s\n"
+#define BODY_FMT "%5u  %8u  %10u  %4u  %8u  %11u  %11u  %11u  %11u  %11u  %9u\n"
 
 #pragma clang diagnostic error "-Wformat"
 
 /* Print statistics */
 static void rd_stats(struct seq_file *p)
 {
+	ulong global_intr_rate = 0, global_irq_high = 0;
+	ulong local_irq_high = 0, hardirq_high = 0;
 	uint delay_us = irq_mod_info.delay_us;
-	int cpu;
+	u64 now = ktime_get_ns();
+	int cpu, active_cpus = 0;
 
 	seq_printf(p, HEAD_FMT,
-		   "# CPU", "delay_ns", "timer_set", "enqueue", "stray_irq");
+		   "# CPU", "irq/s", "loc_irq/s", "cpus", "delay_ns",
+		   "irq_hi", "loc_irq_hi", "hardirq_hi", "timer_set",
+		   "enqueue", "stray_irq");
 
 	for_each_possible_cpu(cpu) {
-		struct irq_mod_state cur;
+		struct irq_mod_state cur, *m = per_cpu_ptr(&irq_mod_state, cpu);
+		u64 epoch_start_ns;
+		bool recent;
+
+		/* Accumulate and print only recent samples */
+		epoch_start_ns = atomic64_read(&m->epoch_start_ns);
+		recent = (now - epoch_start_ns) < 10 * NSEC_PER_SEC;
 
 		/* Copy statistics, will only use some 32bit values, races ok. */
 		data_race(cur = *per_cpu_ptr(&irq_mod_state, cpu));
+		if (recent) {
+			active_cpus++;
+			global_intr_rate += cur.global_intr_rate;
+		}
+
+		global_irq_high += cur.global_irq_high;
+		local_irq_high += cur.local_irq_high;
+		hardirq_high += cur.hardirq_high;
+
 		seq_printf(p, BODY_FMT,
-			   cpu, cur.mod_ns, cur.timer_set, cur.enqueue, cur.stray_irq);
+			   cpu,
+			   recent * cur.global_intr_rate,
+			   recent * cur.local_intr_rate,
+			   recent * (cur.scaled_cpu_count + 128) / 256,
+			   recent * cur.mod_ns,
+			   cur.global_irq_high,
+			   cur.local_irq_high,
+			   cur.hardirq_high,
+			   cur.timer_set,
+			   cur.enqueue,
+			   cur.stray_irq);
 	}
 
 	seq_printf(p, "\n"
 		   "enabled              %s\n"
-		   "delay_us             %u\n",
+		   "delay_us             %u\n"
+		   "target_intr_rate     %u\n"
+		   "hardirq_percent      %u\n"
+		   "update_ms            %u\n"
+		   "scale_cpus           %u\n",
 		   str_yes_no(delay_us > 0),
-		   delay_us);
+		   delay_us,
+		   irq_mod_info.target_intr_rate, irq_mod_info.hardirq_percent,
+		   irq_mod_info.update_ms, irq_mod_info.scale_cpus);
+
+	seq_printf(p,
+		   "intr_rate            %lu\n"
+		   "irq_high             %lu\n"
+		   "my_irq_high          %lu\n"
+		   "hardirq_percent_high %lu\n"
+		   "total_interrupts     %u\n"
+		   "total_cpus           %u\n",
+		   active_cpus ? global_intr_rate / active_cpus : 0,
+		   global_irq_high, local_irq_high, hardirq_high,
+		   READ_ONCE(*((u32 *)&irq_mod_info.total_intrs)),
+		   READ_ONCE(*((u32 *)&irq_mod_info.total_cpus)));
 }
 
 static int moderation_show(struct seq_file *p, void *v)
@@ -258,6 +498,11 @@ static int moderation_open(struct inode *inode, struct file *file)
 
 static struct swmod_param param_names[] = {
 	{ "delay_us", swmod_wr_delay, swmod_rd_u32, &irq_mod_info.delay_us, 0, 500 },
+	{ "target_intr_rate", swmod_wr_u32, swmod_rd_u32, &irq_mod_info.target_intr_rate, 0, 50000000 },
+	{ "hardirq_percent", swmod_wr_u32, swmod_rd_u32, &irq_mod_info.hardirq_percent, 0, 100 },
+	{ "update_ms", swmod_wr_u32, swmod_rd_u32, &irq_mod_info.update_ms, 1, 100 },
+	{ "increase_factor", swmod_wr_u32, NULL, &irq_mod_info.increase_factor, MIN_SCALING_FACTOR, 128 },
+	{ "scale_cpus", swmod_wr_u32, swmod_rd_u32, &irq_mod_info.scale_cpus, 50, 1000 },
 	{ "stats", NULL, rd_stats},
 };
 
@@ -427,6 +672,7 @@ static int __init init_irq_moderation(void)
 	/* Clamp all initial values to the allowed range. */
 	for (uint *cur = &irq_mod_info.delay_us; cur < irq_mod_info.params_end; cur++)
 		clamp_parameter(cur, *cur);
+	seqlock_init(&irq_mod_info.seq);
 
 	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "irq_moderation", cpu_setup_cb, cpu_remove_cb);
 	register_pm_notifier(&mod_nb);
@@ -436,8 +682,11 @@ static int __init init_irq_moderation(void)
 	dir = proc_mkdir("irq/soft_moderation", NULL);
 	if (!dir)
 		return 0;
-	for (i = 0, n = param_names; i < ARRAY_SIZE(param_names); i++, n++)
+	for (i = 0, n = param_names; i < ARRAY_SIZE(param_names); i++, n++) {
+		if (!n->rd)
+			continue;
 		proc_create_data(n->name, n->wr ? 0644 : 0444, dir, &proc_ops, n);
+	}
 	return 0;
 }
 
diff --git a/kernel/irq/irq_moderation.h b/kernel/irq/irq_moderation.h
index 0d634f8e9225d..3a92306d1aee9 100644
--- a/kernel/irq/irq_moderation.h
+++ b/kernel/irq/irq_moderation.h
@@ -13,12 +13,32 @@
 
 /**
  * struct irq_mod_info - global configuration parameters and state
+ * @total_intrs:	running count of total interrupts
+ * @total_cpus:		running count of total active CPUs
+ *			totals are updated every update_ms ("epoch")
+ * @seq:		protects updates to parameters
  * @delay_us:		maximum delay
- * @update_ms:		how often to update delay (epoch duration)
+ * @target_intr_rate:	target maximum interrupt rate
+ * @hardirq_percent:	target maximum hardirq percentage
+ * @update_ms:		how often to update delay/rate/fraction (epoch duration)
+ * @increase_factor:	constant for exponential increase/decrease of delay
+ * @scale_cpus:		(percent) scale factor to estimate active CPUs
  */
 struct irq_mod_info {
+	/* These fields are written to by all CPUs every epoch. */
+	____cacheline_aligned
+	atomic_t	total_intrs;
+	atomic_t	total_cpus;
+
+	/* These are mostly read (frequently), so use a different cacheline. */
+	____cacheline_aligned
+	seqlock_t	seq;
 	u32		delay_us;
+	u32		target_intr_rate;
+	u32		hardirq_percent;
 	u32		update_ms;
+	u32		increase_factor;
+	u32		scale_cpus;
 	u32		params_end[];
 };
 
@@ -43,8 +63,22 @@ extern struct irq_mod_info irq_mod_info;
  * @descs:		list of	moderated irq_desc on this CPU
  * @enqueue:		how many enqueue on the list
  *
+ * Used once per epoch:
+ * @seq:		latest seq from irq_mod_info
+ * @delay_ns:		fetched from irq_mod_info
+ * @epoch_ns:		duration of last epoch
+ * @last_total_intrs:	from irq_mod_info
+ * @last_total_cpus:	from irq_mod_info
+ * @last_irqtime:	from cpustat[CPUTIME_IRQ]
+ *
  * Statistics
+ * @global_intr_rate:	smoothed global interrupt rate
+ * @local_intr_rate:	smoothed interrupt rate for this CPU
  * @timer_set:		how many timer_set calls
+ * @scaled_cpu_count:	smoothed CPU count (scaled)
+ * @global_irq_high:	how many times global irq rate was above threshold
+ * @local_irq_high:	how many times local irq rate was above threshold
+ * @hardirq_high:	how many times local hardirq_percent was above threshold
  */
 struct irq_mod_state {
 	struct hrtimer		timer;
@@ -57,14 +91,29 @@ struct irq_mod_state {
 	u32			stray_irq;
 	struct list_head	descs;
 	u32			enqueue;
+	u32			seq;
+	u32			delay_ns;
+	u32			epoch_ns;
+	u32			last_total_intrs;
+	u32			last_total_cpus;
+	u64			last_irqtime;
+	u32			global_intr_rate;
+	u32			local_intr_rate;
 	u32			timer_set;
+	u32			scaled_cpu_count;
+	u32			global_irq_high;
+	u32			local_irq_high;
+	u32			hardirq_high;
 };
 
 DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
 
+#define MIN_SCALING_FACTOR 8u
+
 extern struct static_key_false irq_moderation_enabled_key;
 
 bool irq_moderation_do_start(struct irq_desc *desc, struct irq_mod_state *m);
+void irq_moderation_update_epoch(struct irq_mod_state *m);
 
 static inline void check_epoch(struct irq_mod_state *m)
 {
@@ -74,9 +123,10 @@ static inline void check_epoch(struct irq_mod_state *m)
 	/* Run approximately every update_ns, a little bit early is ok. */
 	if (epoch_ns < m->update_ns - slack_ns)
 		return;
-	/* Fetch updated parameters. */
-	m->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
-	m->mod_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
+	m->epoch_ns = min(epoch_ns, (u64)U32_MAX);
+	atomic64_set(&m->epoch_start_ns, now);
+	/* Do the expensive processing */
+	irq_moderation_update_epoch(m);
 }
 
 /*
-- 
2.52.0.457.g6b5491de43-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 2/3] genirq: Fixed-delay Global Software Interrupt Moderation (GSIM)
  2026-01-15 15:59 ` [PATCH v4 2/3] genirq: Fixed-delay Global Software Interrupt Moderation (GSIM) Luigi Rizzo
@ 2026-01-15 20:52   ` kernel test robot
  2026-01-15 21:39   ` Thomas Gleixner
  2026-01-15 23:53   ` kernel test robot
  2 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2026-01-15 20:52 UTC (permalink / raw)
  To: Luigi Rizzo, Thomas Gleixner, Marc Zyngier, Luigi Rizzo,
	Paolo Abeni, Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel,
	linux-arch, Bjorn Helgaas, Willem de Bruijn

Hi Luigi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.19-rc5]
[cannot apply to tip/irq/core next-20260115]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Luigi-Rizzo/genirq-Add-flags-for-software-interrupt-moderation/20260116-000808
base:   linus/master
patch link:    https://lore.kernel.org/r/20260115155942.482137-3-lrizzo%40google.com
patch subject: [PATCH v4 2/3] genirq: Fixed-delay Global Software Interrupt Moderation (GSIM)
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20260116/202601160402.Z1oBJhge-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260116/202601160402.Z1oBJhge-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601160402.Z1oBJhge-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/irq/irq_moderation.c:217: warning: ignoring '#pragma clang diagnostic' [-Wunknown-pragmas]
     217 | #pragma clang diagnostic error "-Wformat"


vim +217 kernel/irq/irq_moderation.c

   216	
 > 217	#pragma clang diagnostic error "-Wformat"
   218	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 3/3] genirq: Adaptive Global Software Interrupt Moderation (GSIM).
  2026-01-15 15:59 ` [PATCH v4 3/3] genirq: Adaptive " Luigi Rizzo
@ 2026-01-15 21:14   ` kernel test robot
  2026-01-15 22:09   ` kernel test robot
  2026-01-15 22:24   ` Thomas Gleixner
  2 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2026-01-15 21:14 UTC (permalink / raw)
  To: Luigi Rizzo, Thomas Gleixner, Marc Zyngier, Luigi Rizzo,
	Paolo Abeni, Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel,
	linux-arch, Bjorn Helgaas, Willem de Bruijn

Hi Luigi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.19-rc5]
[cannot apply to tip/irq/core next-20260115]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Luigi-Rizzo/genirq-Add-flags-for-software-interrupt-moderation/20260116-000808
base:   linus/master
patch link:    https://lore.kernel.org/r/20260115155942.482137-4-lrizzo%40google.com
patch subject: [PATCH v4 3/3] genirq: Adaptive Global Software Interrupt Moderation (GSIM).
config: mips-randconfig-r051-20260116 (https://download.01.org/0day-ci/archive/20260116/202601160433.z2rdE9lE-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 9b8addffa70cee5b2acc5454712d9cf78ce45710)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260116/202601160433.z2rdE9lE-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601160433.z2rdE9lE-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/irq/irq_moderation.c:432:40: warning: overflow in expression; result is 1'410'065'408 with type 'long' [-Winteger-overflow]
     432 |                 recent = (now - epoch_start_ns) < 10 * NSEC_PER_SEC;
         |                                                   ~~~^~~~~~~~~~~~~~
   1 warning generated.


vim +432 kernel/irq/irq_moderation.c

   410	
   411	/* Print statistics */
   412	static void rd_stats(struct seq_file *p)
   413	{
   414		ulong global_intr_rate = 0, global_irq_high = 0;
   415		ulong local_irq_high = 0, hardirq_high = 0;
   416		uint delay_us = irq_mod_info.delay_us;
   417		u64 now = ktime_get_ns();
   418		int cpu, active_cpus = 0;
   419	
   420		seq_printf(p, HEAD_FMT,
   421			   "# CPU", "irq/s", "loc_irq/s", "cpus", "delay_ns",
   422			   "irq_hi", "loc_irq_hi", "hardirq_hi", "timer_set",
   423			   "enqueue", "stray_irq");
   424	
   425		for_each_possible_cpu(cpu) {
   426			struct irq_mod_state cur, *m = per_cpu_ptr(&irq_mod_state, cpu);
   427			u64 epoch_start_ns;
   428			bool recent;
   429	
   430			/* Accumulate and print only recent samples */
   431			epoch_start_ns = atomic64_read(&m->epoch_start_ns);
 > 432			recent = (now - epoch_start_ns) < 10 * NSEC_PER_SEC;
   433	
   434			/* Copy statistics, will only use some 32bit values, races ok. */
   435			data_race(cur = *per_cpu_ptr(&irq_mod_state, cpu));
   436			if (recent) {
   437				active_cpus++;
   438				global_intr_rate += cur.global_intr_rate;
   439			}
   440	
   441			global_irq_high += cur.global_irq_high;
   442			local_irq_high += cur.local_irq_high;
   443			hardirq_high += cur.hardirq_high;
   444	
   445			seq_printf(p, BODY_FMT,
   446				   cpu,
   447				   recent * cur.global_intr_rate,
   448				   recent * cur.local_intr_rate,
   449				   recent * (cur.scaled_cpu_count + 128) / 256,
   450				   recent * cur.mod_ns,
   451				   cur.global_irq_high,
   452				   cur.local_irq_high,
   453				   cur.hardirq_high,
   454				   cur.timer_set,
   455				   cur.enqueue,
   456				   cur.stray_irq);
   457		}
   458	
   459		seq_printf(p, "\n"
   460			   "enabled              %s\n"
   461			   "delay_us             %u\n"
   462			   "target_intr_rate     %u\n"
   463			   "hardirq_percent      %u\n"
   464			   "update_ms            %u\n"
   465			   "scale_cpus           %u\n",
   466			   str_yes_no(delay_us > 0),
   467			   delay_us,
   468			   irq_mod_info.target_intr_rate, irq_mod_info.hardirq_percent,
   469			   irq_mod_info.update_ms, irq_mod_info.scale_cpus);
   470	
   471		seq_printf(p,
   472			   "intr_rate            %lu\n"
   473			   "irq_high             %lu\n"
   474			   "my_irq_high          %lu\n"
   475			   "hardirq_percent_high %lu\n"
   476			   "total_interrupts     %u\n"
   477			   "total_cpus           %u\n",
   478			   active_cpus ? global_intr_rate / active_cpus : 0,
   479			   global_irq_high, local_irq_high, hardirq_high,
   480			   READ_ONCE(*((u32 *)&irq_mod_info.total_intrs)),
   481			   READ_ONCE(*((u32 *)&irq_mod_info.total_cpus)));
   482	}
   483	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 2/3] genirq: Fixed-delay Global Software Interrupt Moderation (GSIM)
  2026-01-15 15:59 ` [PATCH v4 2/3] genirq: Fixed-delay Global Software Interrupt Moderation (GSIM) Luigi Rizzo
  2026-01-15 20:52   ` kernel test robot
@ 2026-01-15 21:39   ` Thomas Gleixner
  2026-01-15 23:53   ` kernel test robot
  2 siblings, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2026-01-15 21:39 UTC (permalink / raw)
  To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

On Thu, Jan 15 2026 at 15:59, Luigi Rizzo wrote:
> Some platforms show reduced I/O performance when the total device
> interrupt rate across the entire platform becomes too high.
>
> Interrupt moderation can be used to rate-limit individual interrupts,
> and as a consequence also the total rate. Not all devices implement
> moderation in hardware, or they may have impractically coarse granularity
> (e.g. NVME specifies 100us).
>
> GSIM implements interrupt moderation in software, allowing control of
> interrupt rates even without hardware support. It als provides a building
> block for more sophisticated, adaptive mechanisms.
>
> Moderation is enabled/disabled per interrupt source with
>
>   echo 1 > /proc/irq/NN/soft_moderation # use 0 to disable
>
> For interrupts with moderation enabled, the delay is fixed, and equal
> for all.  It is configured via procfs (range 0-500, 0 means disabled):
>
>   echo delay_us=XX > /proc/irq/soft_moderation

Interesting that you can successfully write into a directory...

> Per CPU statistics on how often moderation is used are available via
>
>   cat /proc/irq/soft_moderation/stats
>
> +/**
> + * struct irq_desc_mod - interrupt moderation information
> + * @ms_node:		per-CPU list of moderated irq_desc

How is that node a per-CPU list?

> + */
> +struct irq_desc_mod {
> +#ifdef CONFIG_IRQ_SW_MODERATION
> +	struct list_head	ms_node;
> +#endif
> +};

> +/* Actually start moderation. */
> +bool irq_moderation_do_start(struct irq_desc *desc, struct irq_mod_state *m)
> +{
> +	lockdep_assert_irqs_disabled();

You rather assert that desc::lock is held, which implies interrupts disabled.

> +	if (!hrtimer_is_queued(&m->timer)) {
> +		const u32 min_delay_ns = 10000;
> +		const u64 slack_ns = 2000;
> +
> +		/* Accumulate sleep time, no moderation if too small. */
> +		m->sleep_ns += m->mod_ns;
> +		if (m->sleep_ns < min_delay_ns)
> +			return false;
> +		/* We need moderation, start the timer. */
> +		m->timer_set++;
> +		hrtimer_start_range_ns(&m->timer, ns_to_ktime(m->sleep_ns),
> +				       slack_ns, HRTIMER_MODE_REL_PINNED_HARD);
> +	}
> +
> +	/*
> +	 * Add to the timer list and __disable_irq() to prevent serving subsequent

__disable_irq() is an interesting new verb.

> +	 * interrupts.
> +	 */
> +	if (!list_empty(&desc->mod_state.ms_node)) {
> +		/* Very unlikely, stray interrupt while moderated. */
> +		m->stray_irq++;

How does that happen? handle_..._irq() returns early when the moderated
flag is set, no? If at all then this needs a WARN_ON_ONCE() because then
the underlying logic is broken.

> +	} else {
> +		m->enqueue++;
> +		list_add(&desc->mod_state.ms_node, &m->descs);
> +		__disable_irq(desc);
> +	}
> +	irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS | IRQD_IRQ_MODERATED);
> +	return true;
> +}
> +
> +/* Initialize moderation state, used in desc_set_defaults() */
> +void irq_moderation_init_fields(struct irq_desc_mod *mod_state)
> +{
> +	INIT_LIST_HEAD(&mod_state->ms_node);
> +}

Why does that have to be a function and not a trivial inline?

> +
> +static int set_mode(struct irq_desc *desc, bool enable)
> +{
> +	struct irq_data *irqd = &desc->irq_data;
> +	struct irq_chip *chip = irqd->chip;
> +
> +	lockdep_assert_held(&desc->lock);
> +
> +	if (!enable) {
> +		irq_settings_clr_and_set(desc, _IRQ_SW_MODERATION, 0);
> +		return 0;
> +	}
> +
> +	/* Moderation is only supported in specific cases. */
> +	enable &= !irqd_is_level_type(irqd);
> +	enable &= irqd_is_single_target(irqd);
> +	enable &= !chip->irq_bus_lock && !chip->irq_bus_sync_unlock;
> +	enable &= chip->irq_mask && chip->irq_unmask;
> +	enable &= desc->handle_irq == handle_edge_irq || desc->handle_irq == handle_fasteoi_irq;
> +	if (!enable)
> +		return -EOPNOTSUPP;
> +
> +	irq_settings_clr_and_set(desc, 0, _IRQ_SW_MODERATION);
> +	return 0;
> +}
> +
> +/* Helpers to set and clamp values from procfs or at init. */
> +struct swmod_param {
> +	const char	*name;
> +	int		(*wr)(struct swmod_param *n, const char __user *s, size_t count);
> +	void		(*rd)(struct seq_file *p);
> +	void		*val;

Why is this a void pointer if all parameters are u32? They better
are because the min/max values are too. But then you would not have to
type cast @val in the read and write functions, which is superior over
proper data types, right?

Aside of that all of this should be unsigned int and not u32 because non
of this is hardware related.

> +	u32		min;
> +	u32		max;
> +};

I'm really not convinced that any of this is actually an improvement
over properly separated write functions.

> +
> +static int swmod_wr_u32(struct swmod_param *n, const char __user *s, size_t count)
> +{
> +	u32 res;
> +	int ret = kstrtouint_from_user(s, count, 0, &res);
> +
> +	if (!ret) {
> +		WRITE_ONCE(*(u32 *)(n->val), clamp(res, n->min, n->max));

This is just wrong. If the value is outside of the valid region, then
this needs to be rejected and not silently clamped.

> +		ret = count;
> +	}
> +	return ret;
> +}
> +
> +static void swmod_rd_u32(struct seq_file *p)
> +{
> +	struct swmod_param *n = p->private;
> +
> +	seq_printf(p, "%u\n", *(u32 *)(n->val));
> +}
> +
> +static int swmod_wr_delay(struct swmod_param *n, const char __user *s, size_t count)
> +{
> +	int ret = swmod_wr_u32(n, s, count);
> +
> +	if (ret >= 0)
> +		update_enable_key();
> +	return ret;
> +}
> +
> +#define HEAD_FMT "%5s  %8s  %11s  %11s  %9s\n"
> +#define BODY_FMT "%5u  %8u  %11u  %11u  %9u\n"
> +
> +#pragma clang diagnostic error "-Wformat"

I told you before that this is bogus:

kernel/irq/irq_moderation.c:217: warning: ignoring ‘#pragma clang diagnostic’ [-Wunknown-pragmas]
  217 | #pragma clang diagnostic error "-Wformat"

Not everyone uses clang.

> +/* Print statistics */
> +static void rd_stats(struct seq_file *p)
> +{
> +	uint delay_us = irq_mod_info.delay_us;
> +	int cpu;
> +
> +	seq_printf(p, HEAD_FMT,
> +		   "# CPU", "delay_ns", "timer_set", "enqueue", "stray_irq");
> +
> +	for_each_possible_cpu(cpu) {
> +		struct irq_mod_state cur;
> +
> +		/* Copy statistics, will only use some 32bit values, races ok. */
> +		data_race(cur = *per_cpu_ptr(&irq_mod_state, cpu));
> +		seq_printf(p, BODY_FMT,
> +			   cpu, cur.mod_ns, cur.timer_set, cur.enqueue, cur.stray_irq);
> +	}
> +
> +	seq_printf(p, "\n"
> +		   "enabled              %s\n"
> +		   "delay_us             %u\n",
> +		   str_yes_no(delay_us > 0),
> +		   delay_us);

Again: Why is this printing anything when it's disabled. There is zero
value for that.

> +}
> +
> +static int moderation_show(struct seq_file *p, void *v)
> +{
> +	struct swmod_param *n = p->private;
> +
> +	if (!n || !n->rd)
> +		return -EINVAL;

How can either of these conditions be true?

> +	n->rd(p);
> +	return 0;
> +}
> +
> +static int moderation_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, moderation_show, pde_data(inode));
> +}
> +
> +static struct swmod_param param_names[] = {
> +	{ "delay_us", swmod_wr_delay, swmod_rd_u32, &irq_mod_info.delay_us, 0, 500 },
> +	{ "stats", NULL, rd_stats},

C89 struct initializers are broken. Either use proper C99 initializers
or use a macro which hides them away.

> +};
> +
> +static ssize_t moderation_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
> +{
> +	struct swmod_param *n = (struct swmod_param *)pde_data(file_inode(f));
> +
> +	return n && n->wr ? n->wr(n, buf, count) : -EINVAL;

How is n = NULL?

> +}
> +
> +static const struct proc_ops proc_ops = {
> +	.proc_open	= moderation_open,
> +	.proc_read	= seq_read,
> +	.proc_lseek	= seq_lseek,
> +	.proc_release	= single_release,
> +	.proc_write	= moderation_write,
> +};
> +
> +/* Handlers for /proc/irq/NN/soft_moderation */
> +static int mode_show(struct seq_file *p, void *v)
> +{
> +	struct irq_desc *desc = p->private;
> +
> +	seq_puts(p, irq_settings_moderation_allowed(desc) ? "on\n" : "off\n");
> +	return 0;
> +}
> +
> +static ssize_t mode_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)

mode_show() and mode_write() are really as undescriptive as it gets.

> +{
> +	struct irq_desc *desc = (struct irq_desc *)pde_data(file_inode(f));
> +	bool enable;
> +	int ret = kstrtobool_from_user(buf, count, &enable);
> +
> +	if (!ret) {
> +		guard(raw_spinlock_irqsave)(&desc->lock);

raw_spinlock_irq - there is nothing to save. This is thread context.

> +		ret = set_mode(desc, enable);

Why is this set_mode() function 100 lines above instead of being just
next to the usage site? Makes following the code easier, right?

> +	}
> +	return ret ? : count;
> +}


> +/* Used on timer expiration or CPU shutdown. */
> +static void drain_desc_list(struct irq_mod_state *m)
> +{
> +	struct irq_desc *desc, *next;
> +
> +	/* Remove from list and enable interrupts back. */
> +	list_for_each_entry_safe(desc, next, &m->descs, mod_state.ms_node) {
> +		guard(raw_spinlock)(&desc->lock);
> +		list_del_init(&desc->mod_state.ms_node);
> +		irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS | IRQD_IRQ_MODERATED);
> +		/* Protect against competing free_irq(). */

It's not competing. It's concurrent. Can you please use precise wording?

> +		if (desc->action)
> +			__enable_irq(desc);
> +	}
> +}
> +
> +static enum hrtimer_restart timer_callback(struct hrtimer *timer)
> +{
> +	struct irq_mod_state *m = this_cpu_ptr(&irq_mod_state);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	drain_desc_list(m);
> +	/* Prepare to accumulate next moderation delay. */
> +	m->sleep_ns = 0;
> +	return HRTIMER_NORESTART;
> +}
> +
> +/* Hotplug callback for setup. */
> +static int cpu_setup_cb(uint cpu)

unsigned int

> +{
> +	struct irq_mod_state *m = this_cpu_ptr(&irq_mod_state);
> +
> +	hrtimer_setup(&m->timer, timer_callback,
> +		      CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
> +	INIT_LIST_HEAD(&m->descs);
> +	m->moderation_allowed = true;
> +	return 0;
> +}
> +
> +/*
> + * Hotplug callback for shutdown.
> + * Mark the CPU as offline for moderation, and drain the list of masked
> + * interrupts. Any subsequent interrupt on this CPU will not be
> + * moderated, but they will be on the new target.
> + */
> +static int cpu_remove_cb(uint cpu)

Ditto

> +{
> +	struct irq_mod_state *m = this_cpu_ptr(&irq_mod_state);
> +
> +	guard(irqsave)();

Lacks a comment explanaining why this needs to be irqsave. The hotplug
callback does not require that.

> +	m->moderation_allowed = false;
> +	drain_desc_list(m);
> +	hrtimer_cancel(&m->timer);
> +	return 0;
> +}
> +
> +static void(mod_pm_prepare_cb)(void *arg)

What are those brackets around the function name for aside of making the
code unreadable?

> +{
> +	cpu_remove_cb(smp_processor_id());
> +}
> +
> +static void(mod_pm_resume_cb)(void *arg)

Ditto.

> +{
> +	cpu_setup_cb(smp_processor_id());
> +}
> +

> +static void __init clamp_parameter(u32 *dst, u32 val)
> +{
> +	struct swmod_param *n = param_names;
> +
> +	for (int i = 0; i < ARRAY_SIZE(param_names); i++, n++) {
> +		if (dst == n->val) {
> +			WRITE_ONCE(*dst, clamp(val, n->min, n->max));
> +			return;
> +		}
> +	}
> +}

> +static int __init init_irq_moderation(void)
> +{
> +	struct proc_dir_entry *dir;
> +	struct swmod_param *n;
> +	int i;
> +
> +	/* Clamp all initial values to the allowed range. */
> +	for (uint *cur = &irq_mod_info.delay_us; cur < irq_mod_info.params_end; cur++)
> +		clamp_parameter(cur, *cur);

This is required because someone might have put an invalid value into
the struct initializer, right?

> +	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "irq_moderation", cpu_setup_cb, cpu_remove_cb);

Can fail....

> +	register_pm_notifier(&mod_nb);

Ditto.

> +	update_enable_key();

This is required to make sure it's disabled as it might have enabled
itself magically?

> +	dir = proc_mkdir("irq/soft_moderation", NULL);
> +	if (!dir)
> +		return 0;
> +	for (i = 0, n = param_names; i < ARRAY_SIZE(param_names); i++, n++)
> +		proc_create_data(n->name, n->wr ? 0644 : 0444, dir, &proc_ops, n);

Can fail too.

> +	return 0;
> +}
> +

Pointless new line.

> +device_initcall(init_irq_moderation);

> +/**
> + * struct irq_mod_info - global configuration parameters and state
> + * @delay_us:		maximum delay
> + * @update_ms:		how often to update delay (epoch duration)
> + */
> +struct irq_mod_info {
> +	u32		delay_us;
> +	u32		update_ms;
> +	u32		params_end[];

What is this params_end[] for? A made up substitute for ARRAY_SIZE()?

> +};
> +
> +extern struct irq_mod_info irq_mod_info;

> +static inline void check_epoch(struct irq_mod_state *m)
> +{
> +	const u64 now = ktime_get_ns(), epoch_ns = now - atomic64_read(&m->epoch_start_ns);
> +	const u32 slack_ns = 5000;
> +
> +	/* Run approximately every update_ns, a little bit early is ok. */
> +	if (epoch_ns < m->update_ns - slack_ns)
> +		return;
> +	/* Fetch updated parameters. */
> +	m->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
> +	m->mod_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
> +}

TBH, this is the most convoluted way to handle global parameters which I
have seen in a long time.

First you need the intr_count & 0xF check, then you read time and
subtract the start and have another comparison to avoid reading and
multiplying the global parameters on every invocation.

Just to make it more interresting:

     epoch_start_ns is never set, so this time based rate limit does not
     work at all.

I don't care whether you "fix" this in the next patch or not. Patches
have to be self contained, functional and comprehensible on their own.

That's just completely bonkers. These variables are updated by user
space every now and then, if at all.

So either you read directly from the global struct (you need obviously
nano seconds converted struct members for that) or you update the per
CPU instances from the write functions or you use a sequence counter as
everyone else does.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 3/3] genirq: Adaptive Global Software Interrupt Moderation (GSIM).
  2026-01-15 15:59 ` [PATCH v4 3/3] genirq: Adaptive " Luigi Rizzo
  2026-01-15 21:14   ` kernel test robot
@ 2026-01-15 22:09   ` kernel test robot
  2026-01-15 22:24   ` Thomas Gleixner
  2 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2026-01-15 22:09 UTC (permalink / raw)
  To: Luigi Rizzo, Thomas Gleixner, Marc Zyngier, Luigi Rizzo,
	Paolo Abeni, Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel,
	linux-arch, Bjorn Helgaas, Willem de Bruijn

Hi Luigi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.19-rc5]
[cannot apply to tip/irq/core next-20260115]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Luigi-Rizzo/genirq-Add-flags-for-software-interrupt-moderation/20260116-000808
base:   linus/master
patch link:    https://lore.kernel.org/r/20260115155942.482137-4-lrizzo%40google.com
patch subject: [PATCH v4 3/3] genirq: Adaptive Global Software Interrupt Moderation (GSIM).
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20260116/202601160504.h8ijlURr-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260116/202601160504.h8ijlURr-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601160504.h8ijlURr-lkp@intel.com/

All warnings (new ones prefixed by >>):

   kernel/irq/irq_moderation.c:409: warning: ignoring '#pragma clang diagnostic' [-Wunknown-pragmas]
     409 | #pragma clang diagnostic error "-Wformat"
   kernel/irq/irq_moderation.c: In function 'rd_stats':
>> kernel/irq/irq_moderation.c:432:54: warning: integer overflow in expression of type 'long int' results in '1410065408' [-Woverflow]
     432 |                 recent = (now - epoch_start_ns) < 10 * NSEC_PER_SEC;
         |                                                      ^


vim +432 kernel/irq/irq_moderation.c

   410	
   411	/* Print statistics */
   412	static void rd_stats(struct seq_file *p)
   413	{
   414		ulong global_intr_rate = 0, global_irq_high = 0;
   415		ulong local_irq_high = 0, hardirq_high = 0;
   416		uint delay_us = irq_mod_info.delay_us;
   417		u64 now = ktime_get_ns();
   418		int cpu, active_cpus = 0;
   419	
   420		seq_printf(p, HEAD_FMT,
   421			   "# CPU", "irq/s", "loc_irq/s", "cpus", "delay_ns",
   422			   "irq_hi", "loc_irq_hi", "hardirq_hi", "timer_set",
   423			   "enqueue", "stray_irq");
   424	
   425		for_each_possible_cpu(cpu) {
   426			struct irq_mod_state cur, *m = per_cpu_ptr(&irq_mod_state, cpu);
   427			u64 epoch_start_ns;
   428			bool recent;
   429	
   430			/* Accumulate and print only recent samples */
   431			epoch_start_ns = atomic64_read(&m->epoch_start_ns);
 > 432			recent = (now - epoch_start_ns) < 10 * NSEC_PER_SEC;
   433	
   434			/* Copy statistics, will only use some 32bit values, races ok. */
   435			data_race(cur = *per_cpu_ptr(&irq_mod_state, cpu));
   436			if (recent) {
   437				active_cpus++;
   438				global_intr_rate += cur.global_intr_rate;
   439			}
   440	
   441			global_irq_high += cur.global_irq_high;
   442			local_irq_high += cur.local_irq_high;
   443			hardirq_high += cur.hardirq_high;
   444	
   445			seq_printf(p, BODY_FMT,
   446				   cpu,
   447				   recent * cur.global_intr_rate,
   448				   recent * cur.local_intr_rate,
   449				   recent * (cur.scaled_cpu_count + 128) / 256,
   450				   recent * cur.mod_ns,
   451				   cur.global_irq_high,
   452				   cur.local_irq_high,
   453				   cur.hardirq_high,
   454				   cur.timer_set,
   455				   cur.enqueue,
   456				   cur.stray_irq);
   457		}
   458	
   459		seq_printf(p, "\n"
   460			   "enabled              %s\n"
   461			   "delay_us             %u\n"
   462			   "target_intr_rate     %u\n"
   463			   "hardirq_percent      %u\n"
   464			   "update_ms            %u\n"
   465			   "scale_cpus           %u\n",
   466			   str_yes_no(delay_us > 0),
   467			   delay_us,
   468			   irq_mod_info.target_intr_rate, irq_mod_info.hardirq_percent,
   469			   irq_mod_info.update_ms, irq_mod_info.scale_cpus);
   470	
   471		seq_printf(p,
   472			   "intr_rate            %lu\n"
   473			   "irq_high             %lu\n"
   474			   "my_irq_high          %lu\n"
   475			   "hardirq_percent_high %lu\n"
   476			   "total_interrupts     %u\n"
   477			   "total_cpus           %u\n",
   478			   active_cpus ? global_intr_rate / active_cpus : 0,
   479			   global_irq_high, local_irq_high, hardirq_high,
   480			   READ_ONCE(*((u32 *)&irq_mod_info.total_intrs)),
   481			   READ_ONCE(*((u32 *)&irq_mod_info.total_cpus)));
   482	}
   483	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 3/3] genirq: Adaptive Global Software Interrupt Moderation (GSIM).
  2026-01-15 15:59 ` [PATCH v4 3/3] genirq: Adaptive " Luigi Rizzo
  2026-01-15 21:14   ` kernel test robot
  2026-01-15 22:09   ` kernel test robot
@ 2026-01-15 22:24   ` Thomas Gleixner
  2026-01-15 22:44     ` Luigi Rizzo
  2 siblings, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2026-01-15 22:24 UTC (permalink / raw)
  To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

On Thu, Jan 15 2026 at 15:59, Luigi Rizzo wrote:
>  /* Initialize moderation state, used in desc_set_defaults() */
>  void irq_moderation_init_fields(struct irq_desc_mod *mod_state)
>  {
> @@ -189,7 +379,9 @@ static int swmod_wr_u32(struct swmod_param *n, const char __user *s, size_t coun
>  	int ret = kstrtouint_from_user(s, count, 0, &res);
>  
>  	if (!ret) {
> +		write_seqlock(&irq_mod_info.seq);

That's broken.

              spin_lock(seq->lock);
              seq->seqcount++;
Interrupt
              ...
              do {
              	read_seqbegin(seq)
                    --> livelock because seqcount is odd

You clearly never ran that code with lockdep enabled because lockdep
would have yelled at you. Testing patches with lockdep is not optional
as documented...

And no, using write_seqlock_irq() won't fix it either as that will still
explode on a RT enabled kernel as documented...

Also this sequence count magic on the read side does not have any real
value:

> +	if (raw_read_seqcount(&irq_mod_info.seq.seqcount) != m->seq) {

Q: What's the gain of this sequence count magic?

A: Absolutely nothing. Once this pulled the cache line in, then the
   sequence count magic is just cargo cult programming because:

      1) The data is a snapshot in any case as the next update might
         come right after that

      2) Once the cache line is pulled in, reading the _three_
         parameters from it (assumed they have been already converted to
         nanoseconds) is basically free and definitely less expensive
         than the seqbegin/retry loop which comes with expensive SMP
         barriers on some architectures.

Sequence counts are only useful when you need a consistent data set and
the write side updates the whole data set in one go.

As this is a write one by one operation, there is no consistency problem
and the concurrency is handled nicely by READ/WRITE_ONCE(). Either it
sees the old value or the new.

If you actually update the per CPU instances in the parameter write
functions, you can avoid touching the global cache line completely, no?

Can you please stop polishing your prove of concept code to death and
actually sit back and think about a proper design and implementation?

I don't care at all about you wasting _your_ time, but I very much care
about _my_ time wasted by this.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 3/3] genirq: Adaptive Global Software Interrupt Moderation (GSIM).
  2026-01-15 22:24   ` Thomas Gleixner
@ 2026-01-15 22:44     ` Luigi Rizzo
  0 siblings, 0 replies; 11+ messages in thread
From: Luigi Rizzo @ 2026-01-15 22:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marc Zyngier, Luigi Rizzo, Paolo Abeni, Andrew Morton,
	Sean Christopherson, Jacob Pan, linux-kernel, linux-arch,
	Bjorn Helgaas, Willem de Bruijn

On Thu, Jan 15, 2026 at 11:24 PM Thomas Gleixner <tglx@kernel.org> wrote:
>
> On Thu, Jan 15 2026 at 15:59, Luigi Rizzo wrote:
> >  /* Initialize moderation state, used in desc_set_defaults() */
...
> Also this sequence count magic on the read side does not have any real
> value:
>
> > +     if (raw_read_seqcount(&irq_mod_info.seq.seqcount) != m->seq) {
>
> Q: What's the gain of this sequence count magic?

The goal is stated in the comment just before the read, see text below.
Surely I can use your suggestion and have the procfs write callback update
all the per-cpu copies and avoid looking for the change on each CPU.

        /*
         * If any of the configuration parameter changes, read the main ones
         * (delay_ns, update_ns), and set the adaptive delay, mod_ns, to the
         * maximum value to help converge.
         * Without that, the system might be already below target_intr_rate
         * because of saturation on the bus (the very problem GSIM is trying
         * to address) and that would block the control loop.
         * Setting mod_ns to the highest value (if chosen properly) can reduce
         * the interrupt rate below target_intr_rate and let the controller
         * gradually reach the target.
         */

cheers
luigi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 2/3] genirq: Fixed-delay Global Software Interrupt Moderation (GSIM)
  2026-01-15 15:59 ` [PATCH v4 2/3] genirq: Fixed-delay Global Software Interrupt Moderation (GSIM) Luigi Rizzo
  2026-01-15 20:52   ` kernel test robot
  2026-01-15 21:39   ` Thomas Gleixner
@ 2026-01-15 23:53   ` kernel test robot
  2 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2026-01-15 23:53 UTC (permalink / raw)
  To: Luigi Rizzo, Thomas Gleixner, Marc Zyngier, Luigi Rizzo,
	Paolo Abeni, Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel,
	linux-arch, Bjorn Helgaas, Willem de Bruijn

Hi Luigi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.19-rc5]
[cannot apply to tip/irq/core next-20260115]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Luigi-Rizzo/genirq-Add-flags-for-software-interrupt-moderation/20260116-000808
base:   linus/master
patch link:    https://lore.kernel.org/r/20260115155942.482137-3-lrizzo%40google.com
patch subject: [PATCH v4 2/3] genirq: Fixed-delay Global Software Interrupt Moderation (GSIM)
config: arm-randconfig-r123-20260116 (https://download.01.org/0day-ci/archive/20260116/202601160733.QNd6lcMA-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 9b8addffa70cee5b2acc5454712d9cf78ce45710)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260116/202601160733.QNd6lcMA-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601160733.QNd6lcMA-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> kernel/irq/irq_moderation.c:404:23: sparse: sparse: symbol 'mod_nb' was not declared. Should it be static?
   kernel/irq/irq_moderation.c:298:9: sparse: sparse: context imbalance in 'mode_write' - different lock contexts for basic block
   kernel/irq/irq_moderation.c:330:9: sparse: sparse: context imbalance in 'drain_desc_list' - different lock contexts for basic block

vim +/mod_nb +404 kernel/irq/irq_moderation.c

   403	
 > 404	struct notifier_block mod_nb = {
   405		.notifier_call	= mod_pm_notifier_cb,
   406		.priority	= 100,
   407	};
   408	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-01-15 23:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-15 15:59 [PATCH-v4 0/3] Global Software Interrupt Moderation (GSIM) Luigi Rizzo
2026-01-15 15:59 ` [PATCH v4 1/3] genirq: Add flags for software interrupt moderation Luigi Rizzo
2026-01-15 15:59 ` [PATCH v4 2/3] genirq: Fixed-delay Global Software Interrupt Moderation (GSIM) Luigi Rizzo
2026-01-15 20:52   ` kernel test robot
2026-01-15 21:39   ` Thomas Gleixner
2026-01-15 23:53   ` kernel test robot
2026-01-15 15:59 ` [PATCH v4 3/3] genirq: Adaptive " Luigi Rizzo
2026-01-15 21:14   ` kernel test robot
2026-01-15 22:09   ` kernel test robot
2026-01-15 22:24   ` Thomas Gleixner
2026-01-15 22:44     ` Luigi Rizzo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox