linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH-v3 0/3] Global Software Interrupt Moderation (GSIM)
@ 2025-12-17 11:21 Luigi Rizzo
  2025-12-17 11:21 ` [PATCH-v3 1/3] genirq: Fixed " Luigi Rizzo
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Luigi Rizzo @ 2025-12-17 11:21 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

Global Software Interrupt Moderation (GSIM) specifically addresses a
limitation of platforms, from many vendors, whose I/O performance drops
significantly when the total rate of MSI-X interrupts is too high (e.g
1-3M intr/s depending on the platform).

Conventional interrupt moderation, typically implemented in hardware
by NICs or storage devices, operates separately on each source (e.g. a
completion queue). Large servers can have hundreds of sources, and without
knowledge of global activity, keeping the total rate bounded would require
moderation delays of 100-200us, and adaptive moderation would have to
reach those delays with as little as 10K intr/s per source. These values
are unacceptable for RPC or transactional workloads.

To address this problem, GSIM measures efficiently the total and
per-CPU interrupt rates, so that individual moderation delays can be
dynamically adjusted based on actual global and local load. This way,
delays are normally 0 or very small except during actual
local/global overload.

As an additional benefit, GSIM also monitors the percentage of time
spent by each CPU in hardirq, and can use moderation to reserve some
time for other, lower priority, tasks.

Configuration is easy and robust. System administrators specify the
maximum targets (moderation delay; interrupt rate; and percentage of time
spent in hardirq), and which interrupt sources should be moderated (can be
done per-interrupt, per device, or globally). Independent per-CPU control
loops adjust actual delays to try and keep metrics within the targets.

The system is adaptive, and moderation does not affect throughput but
only latency and only in high load scenarios. Hence, targets don't need
to match precisely the platform limits, and one can make conservative
and robust choices. Values like delay_us=100, target_irq_rate=1000000,
hardirq_percent=70 are a very good starting point.

GSIM does not rely on any special hardware feature.

Defaults are set at boot via module parameters

    irq_moderation.${NAME}=${VALUE}

and can be changed runtime with

    echo ${NAME}=${VALUE} /proc/irq/soft_moderation

/proc/irq/soft_moderation is also used to export statistics.

Moderation on individual interrupts can be turned on/off at runtime with

    echo 1 > /proc/irq/NN/moderation  # use 0 to disable

PERFORMANCE BENEFITS:
Below are some experimental results under high load comparing conventional
moderation with GSIM:

- 100Gbps NIC, 32 queues: rx goes from 50 Gbps to 92.8 Gbps (line rate).
- 200Gbps NIC, 10 VMs (total 160 queues): rx goes from 30 Gbps to 190 Gbps (line rate).
- 12 SSD, 96 queues: 4K random read goes from 6M to 20.5M IOPS (device max).

In all cases, latency up to p95 is unaffected at low/moderate load,
even if compared with no moderation at all.

Changes in v3:
- clearly documented architecture in kernel/irq/irq_moderation.c
  including how to handle enable/disable/mask, interrupt migration,
  hotplug and suspend.
- split implementation in 4 files irq_moderation.[ch] and
  irq_moderation_hook.[ch] for better separation of control plane and
  "dataplane" (functions ran on each interrupt)
- limited scope to handle_edge_irq() and handle_fasteoi_irq() which
  have been tested on actual hardware.
- tested on Intel (also with intremap=posted_msi), AMD, ARM, with NIC,
  nvme, vfio

Changes in v2:
- many style fixes (mostly on comments) based on reviewers' comments on v1
- removed background from Documentation/core-api/irq/irq-moderation.rst
- split procfs handlers
- moved internal details to kernel/irq/irq_moderation.h
- use cpu hotplug for per-CPU setup, removed unnecessary arch-specific changes
- select suitable irqs based on !irqd_is_level_type(irqd) && irqd_is_single_target(irqd)
- use a static_key to enable/disable the feature

There are two open comments from v1 for which I would like maintainer's
clarifications

- handle_irq_event() calls irq_moderation_hook() after releasing the lock,
  so it can call disable_irq_nosync(). It may be possible to move the
  call before releasing the lock and use __disable_irq(). I am not sure
  if there is any benefit in making the change.

- the timer callback calls handle_irq_event_percpu(desc) on moderated
  irqdesc (which have irqd_irq_disabled(irqd) == 1) without changing
  IRQD_IRQ_INPROGRESS. I am not sure if this should be protected with
  the following, and especially where it would make a difference
  (specifically because that the desc is disabled during this sequence).

     raw_spin_lock(&desc->lock);
     irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
     raw_spin_unlock(&desc->lock)

     handle_irq_event_percpu(desc); // <--

     raw_spin_lock(&desc->lock);
     irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
     raw_spin_unlock(&desc->lock)


Luigi Rizzo (3):
  genirq: Fixed Global Software Interrupt Moderation (GSIM)
  genirq: Adaptive Global Software Interrupt Moderation (GSIM)
  genirq: Configurable default mode for GSIM

 include/linux/irqdesc.h          |  28 ++
 kernel/irq/Kconfig               |  12 +
 kernel/irq/Makefile              |   1 +
 kernel/irq/chip.c                |  16 +-
 kernel/irq/irq_moderation.c      | 613 +++++++++++++++++++++++++++++++
 kernel/irq/irq_moderation.h      | 135 +++++++
 kernel/irq/irq_moderation_hook.c | 157 ++++++++
 kernel/irq/irq_moderation_hook.h | 102 +++++
 kernel/irq/irqdesc.c             |   1 +
 kernel/irq/manage.c              |   4 +
 kernel/irq/proc.c                |   3 +
 11 files changed, 1071 insertions(+), 1 deletion(-)
 create mode 100644 kernel/irq/irq_moderation.c
 create mode 100644 kernel/irq/irq_moderation.h
 create mode 100644 kernel/irq/irq_moderation_hook.c
 create mode 100644 kernel/irq/irq_moderation_hook.h

-- 
2.52.0.305.g3fc767764a-goog


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH-v3 1/3] genirq: Fixed Global Software Interrupt Moderation (GSIM)
  2025-12-17 11:21 [PATCH-v3 0/3] Global Software Interrupt Moderation (GSIM) Luigi Rizzo
@ 2025-12-17 11:21 ` Luigi Rizzo
  2025-12-17 19:17   ` Randy Dunlap
                     ` (2 more replies)
  2025-12-17 11:21 ` [PATCH-v3 2/3] genirq: Adaptive " Luigi Rizzo
  2025-12-17 11:21 ` [PATCH-v3 3/3] genirq: Configurable default mode for GSIM Luigi Rizzo
  2 siblings, 3 replies; 11+ messages in thread
From: Luigi Rizzo @ 2025-12-17 11:21 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

Some platforms show reduced I/O performance when the total device
interrupt rate across the entire platform becomes too high.

Interrupt moderation can be used to rate-limit individual interrupts,
and as a consequence also the total rate. Not all devices implement
moderation in hardware, or they may have impractically coarse granularity
(e.g. NVME specifies 100us).

GSIM implements interrupt moderation in software, allowing control of
interrupt rates even without hardware support. It als provides a building
block for more sophisticated, adaptive mechanisms.

Moderation is enabled/disabled per interrupt source with

    echo 1 > /proc/irq/NN/soft_moderation # use 0 to disable

In this version the delay is fixed, and equal for all interrupts with
moderation enabled. It is set at boot via module parameter

    irq_moderation.delay_us=XX  # range 0-500, 0 means disabled

and can be updated at runtime with

    echo delay_us=XX > /proc/irq/soft_moderation

Per CPU statistics on how often moderation is used are available via

    cat /proc/irq/soft_moderation

GSIM is limited to edge interrupts using handle_edge_irq() or
handle_fasteoi_irq(). It has been tested on Intel (including with
intremap=posted_msi), AMD, ARM with NIC, NVME and VFIO.

PERFORMANCE BENEFITS:
Below are some experimental results under high load (before/after rates
are measured with conventional moderation or with this change):

- 100Gbps NIC, 32 queues: rx goes from 50 Gbps to 92.8 Gbps (line rate).
- 200Gbps NIC, 10 VMs (total 160 queues): rx goes from 30 Gbps to 190 Gbps (line rate).
- 12 SSD, 96 queues: 4K random read goes from 6M to 20.5M IOPS (device max).

In all cases, latency up to p95 is unaffected at low/moderate load,
even if compared with no moderation at all

On Intel, intremap=posted_msi requires the fix in
https://lore.kernel.org/lkml/20251125214631.044440658@linutronix.de/

Change-Id: I7c321ef5d295cad6ab4a87f2fa79ce8c85c7bf9a
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 include/linux/irqdesc.h          |  23 ++
 kernel/irq/Kconfig               |  12 +
 kernel/irq/Makefile              |   1 +
 kernel/irq/chip.c                |  16 +-
 kernel/irq/irq_moderation.c      | 414 +++++++++++++++++++++++++++++++
 kernel/irq/irq_moderation.h      |  93 +++++++
 kernel/irq/irq_moderation_hook.h | 104 ++++++++
 kernel/irq/irqdesc.c             |   1 +
 kernel/irq/proc.c                |   3 +
 9 files changed, 666 insertions(+), 1 deletion(-)
 create mode 100644 kernel/irq/irq_moderation.c
 create mode 100644 kernel/irq/irq_moderation.h
 create mode 100644 kernel/irq/irq_moderation_hook.h

diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index 17902861de76d..d42ce57ff5406 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -17,6 +17,27 @@ struct irq_desc;
 struct irq_domain;
 struct pt_regs;
 
+#ifdef CONFIG_IRQ_SOFT_MODERATION
+
+/**
+ * struct irq_desc_mod - interrupt moderation information
+ * @ms_node:		per-CPU list of moderated irq_desc
+ * @enable:		enable moderation on this descriptor
+ */
+struct irq_desc_mod {
+	struct list_head	ms_node;
+	bool			enable;
+};
+
+void irq_moderation_init_fields(struct irq_desc_mod *mod);
+
+#else
+
+struct irq_desc_mod {};
+static inline void irq_moderation_init_fields(struct irq_desc_mod *mod) {}
+
+#endif
+
 /**
  * struct irqstat - interrupt statistics
  * @cnt:	real-time interrupt count
@@ -46,6 +67,7 @@ struct irqstat {
  * @threads_handled:	stats field for deferred spurious detection of threaded handlers
  * @threads_handled_last: comparator field for deferred spurious detection of threaded handlers
  * @lock:		locking for SMP
+ * @mod:		moderation state
  * @affinity_hint:	hint to user space for preferred irq affinity
  * @affinity_notify:	context for notification of affinity changes
  * @pending_mask:	pending rebalanced interrupts
@@ -81,6 +103,7 @@ struct irq_desc {
 	atomic_t		threads_handled;
 	int			threads_handled_last;
 	raw_spinlock_t		lock;
+	struct irq_desc_mod	mod;
 	struct cpumask		*percpu_enabled;
 #ifdef CONFIG_SMP
 	const struct cpumask	*affinity_hint;
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 1b4254d19a73e..c258973b63459 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -155,6 +155,18 @@ config IRQ_KUNIT_TEST
 
 endmenu
 
+# Support platform wide software interrupt moderation
+config IRQ_SOFT_MODERATION
+	bool "Enable platform wide software interrupt moderation"
+	depends on PROC_FS=y
+	help
+	  Enable platform wide software interrupt moderation.
+	  Uses a local timer to delay interrupts in configurable ways
+	  and depending on various global system load indicators
+	  and targets.
+
+	  If you don't know what to do here, say N.
+
 config GENERIC_IRQ_MULTI_HANDLER
 	bool
 	help
diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
index 6ab3a40556670..c06da43d644f2 100644
--- a/kernel/irq/Makefile
+++ b/kernel/irq/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_GENERIC_IRQ_CHIP) += generic-chip.o
 obj-$(CONFIG_GENERIC_IRQ_PROBE) += autoprobe.o
 obj-$(CONFIG_IRQ_DOMAIN) += irqdomain.o
 obj-$(CONFIG_IRQ_SIM) += irq_sim.o
+obj-$(CONFIG_IRQ_SOFT_MODERATION) += irq_moderation.o
 obj-$(CONFIG_PROC_FS) += proc.o
 obj-$(CONFIG_GENERIC_PENDING_IRQ) += migration.o
 obj-$(CONFIG_GENERIC_IRQ_MIGRATION) += cpuhotplug.o
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 678f094d261a7..56faa5260b858 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -18,6 +18,7 @@
 #include <trace/events/irq.h>
 
 #include "internals.h"
+#include "irq_moderation_hook.h"
 
 static irqreturn_t bad_chained_irq(int irq, void *dev_id)
 {
@@ -739,6 +740,13 @@ void handle_fasteoi_irq(struct irq_desc *desc)
 
 	guard(raw_spinlock)(&desc->lock);
 
+	if (irq_moderation_skip_moderated(desc)) {
+		mask_irq(desc);
+		desc->istate |= IRQS_PENDING;
+		cond_eoi_irq(chip, &desc->irq_data);
+		return;
+	}
+
 	/*
 	 * When an affinity change races with IRQ handling, the next interrupt
 	 * can arrive on the new CPU before the original CPU has completed
@@ -765,6 +773,9 @@ void handle_fasteoi_irq(struct irq_desc *desc)
 
 	cond_unmask_eoi_irq(desc, chip);
 
+	if (irq_moderation_hook(desc))
+		return;
+
 	/*
 	 * When the race described above happens this will resend the interrupt.
 	 */
@@ -824,7 +835,7 @@ void handle_edge_irq(struct irq_desc *desc)
 {
 	guard(raw_spinlock)(&desc->lock);
 
-	if (!irq_can_handle(desc)) {
+	if (irq_moderation_skip_moderated(desc) || !irq_can_handle(desc)) {
 		desc->istate |= IRQS_PENDING;
 		mask_ack_irq(desc);
 		return;
@@ -854,6 +865,9 @@ void handle_edge_irq(struct irq_desc *desc)
 
 		handle_irq_event(desc);
 
+		if (irq_moderation_hook(desc))
+			break;
+
 	} while ((desc->istate & IRQS_PENDING) && !irqd_irq_disabled(&desc->irq_data));
 }
 EXPORT_SYMBOL(handle_edge_irq);
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
new file mode 100644
index 0000000000000..2a7401bc51da2
--- /dev/null
+++ b/kernel/irq/irq_moderation.c
@@ -0,0 +1,414 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause
+
+#include <linux/cpuhotplug.h>
+#include <linux/glob.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/kallsyms.h>
+#include <linux/module.h>
+#include <linux/notifier.h>
+#include <linux/proc_fs.h>
+#include <linux/suspend.h>
+#include <linux/seq_file.h>
+
+#include "internals.h"
+#include "irq_moderation.h"
+
+/*
+ * Management code for Global Software Interrupt Moderation (GSIM)
+ *
+ * Some platforms show reduced I/O performance when the total device interrupt
+ * rate across the entire platform becomes too high. To address the problem,
+ * GSIM uses a hook after running the handler to implement software interrupt
+ * moderation with programmable delay.
+ *
+ * Configuration is controlled at boot time via module parameters
+ *
+ *     irq_moderation.${NAME}=${VALUE}
+ *
+ * or at runtime via /proc/irq/soft_moderation
+ *
+ *     echo "${NAME}=${VALUE}" > /proc/irq/soft_moderation
+ *
+ * Supported parameters:
+ *
+ *   delay_us (default 0, suggested 100, 0 off, range 0-500)
+ *       Maximum moderation delay. A reasonable range is 20-100. Higher values
+ *       can be useful if the hardirq handler has long runtimes.
+ *
+ * Moderation can be enabled/disabled dynamically for individual interrupts with
+ *
+ *     echo 1 > /proc/irq/NN/soft_moderation # use 0 to disable
+ *
+ * === MONITORING ===
+ *
+ * cat /proc/irq/soft_moderation shows configuration and per-CPU/global statistics.
+ *
+ * === ARCHITECTURE ===
+ *
+ * INTERRUPT HANDLING:
+ * - irq_moderation_hook() runs under desc->lock right after the interrupt handler.
+ *   If the interrupt must be moderated, it sets IRQD_IRQ_INPROGRESS, calls
+ *   __disable_irq() adds the irq_desc to a per-CPU list of moderated interrupts,
+ *   and starts a moderation timer if not yet active.
+ * - desc->handler is modified so that when called on a moderated irq_desc it
+ *   calls mask_irq(), sets IRQS_PENDING and returns immediately.
+ * - the timer callback drains the moderation list: on each irq_desc it acquires
+ *   desc->lock, and if desc->action != NULL calls __enable_irq(), possibly calling
+ *   the handler if IRQS_PENDING is set.
+ *
+ * INTERRUPT TEARDOWN
+ * is protected by IRQD_IRQ_INPROGRESS and checking desc->action != NULL.
+ * This works because free_irq() runs in two steps:
+ * - first clear desc->action (under lock),
+ * - then call synchronize_irq(), which blocks on IRQD_IRQ_INPROGRESS
+ *   before freeing resources.
+ * When the moderation timer races with free_irq() we can have two cases:
+ * 1. timer runs before clearing desc->action. In this case __enable_irq()
+ *    is valid and the subsequent free_irq() will complete as intended
+ * 2. desc->action is cleared before the timer runs. In this case synchronize_irq()
+ *    will block until the timer expires (remember moderation delays are very short,
+ *    comparable to C-state exit times), __enable_irq() will not be run,
+ *    and free_irq() will complete successfully.
+ *
+ * INTERRUPT MIGRATION
+ * is protected by IRQD_IRQ_INPROGRESS that prevents running the handler on the
+ * new CPU while an interrupt is moderated.
+ *
+ * HOTPLUG
+ * During CPU shutdown, the kernel moves timers and reassigns interrupt affinity
+ * to a new CPU. The easiest way and most robust way to guarantee that pending
+ * events are handled correctly is to use a per-CPU "moderation_on" flag and
+ * hotplug callbacks on CPUHP_AP_ONLINE_DYN:
+ * - on setup, set the flag. That will allow interrupts to be moderated.
+ * - on shutdown, with interrupts disabled, 1. clear the flag thus preventing
+ *   more interrupts to be moderated on that CPU, 2. flush the list of moderated
+ *   interrupts (as if the timer had fired), and 3. cancel the timer.
+ * This avoids depending with the internals of the up/down sequence.
+ *
+ * SUSPEND
+ * Register a PM notifier to handle PM_SUSPEND_PREPARE and PM_POST_RESTORE as
+ * hotplug shutdown and setup events. The hotplug callbacks are also invoked
+ * during suspend to/resume from disk.
+ *
+ * BOOT PROCESSOR
+ * Hotplug callbacks are not invoked for the boot processor.
+ * However the boot processor is the last one to go, and since there is
+ * no other place to run the timer callbacks, they will be run where they
+ * are supposed to.
+ */
+
+/* Recommended values. */
+struct irq_mod_info irq_mod_info ____cacheline_aligned = {
+	.update_ms		= 5,
+};
+
+module_param_named(delay_us, irq_mod_info.delay_us, uint, 0444);
+MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 = moderation off, range 0-500.");
+
+DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
+
+DEFINE_STATIC_KEY_FALSE(irq_moderation_enabled_key);
+
+/* Initialize moderation state, used in desc_set_defaults() */
+void irq_moderation_init_fields(struct irq_desc_mod *mod)
+{
+	INIT_LIST_HEAD(&mod->ms_node);
+	mod->enable = false;
+}
+
+static int set_mode(struct irq_desc *desc, bool enable)
+{
+	lockdep_assert_held(&desc->lock);
+	if (enable) {
+		struct irq_data *irqd = &desc->irq_data;
+		struct irq_chip *chip = irqd->chip;
+
+		/* Moderation is only supported in specific cases. */
+		enable &= !irqd_is_level_type(irqd);
+		enable &= irqd_is_single_target(irqd);
+		enable &= !chip->irq_bus_lock && !chip->irq_bus_sync_unlock;
+		enable &= chip->irq_mask && chip->irq_unmask;
+		enable &= desc->handle_irq == handle_edge_irq ||
+				desc->handle_irq == handle_fasteoi_irq;
+		if (!enable)
+			return -EOPNOTSUPP;
+	}
+	desc->mod.enable = enable;
+	return 0;
+}
+
+#pragma clang diagnostic error "-Wformat"
+/* Print statistics */
+static int moderation_show(struct seq_file *p, void *v)
+{
+	uint delay_us = irq_mod_info.delay_us;
+	int j;
+
+#define HEAD_FMT "%5s  %8s  %11s  %11s  %9s\n"
+#define BODY_FMT "%5u  %8u  %11u  %11u  %9u\n"
+
+	seq_printf(p, HEAD_FMT,
+		   "# CPU", "delay_ns", "timer_set", "enqueue", "stray_irq");
+
+	for_each_possible_cpu(j) {
+		struct irq_mod_state *ms = per_cpu_ptr(&irq_mod_state, j);
+
+		seq_printf(p, BODY_FMT,
+			   j, ms->mod_ns, ms->timer_set, ms->enqueue, ms->stray_irq);
+	}
+
+	seq_printf(p, "\n"
+		   "enabled              %s\n"
+		   "delay_us             %u\n",
+		   str_yes_no(delay_us > 0),
+		   delay_us);
+
+	return 0;
+}
+
+static int moderation_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, moderation_show, pde_data(inode));
+}
+
+/* Helpers to set and clamp values from procfs or at init. */
+struct param_names {
+	const char *name;
+	uint *val;
+	uint min;
+	uint max;
+};
+
+static struct param_names param_names[] = {
+	{ "delay_us", &irq_mod_info.delay_us, 0, 500 },
+	/* Entries with no name cannot be set at runtime. */
+	{ "", &irq_mod_info.update_ms, 1, 100 },
+};
+
+static void update_enable_key(void)
+{
+	struct static_key_false *key = &irq_moderation_enabled_key;
+
+	if (irq_mod_info.delay_us != 0)
+		static_branch_enable(key);
+	else
+		static_branch_disable(key);
+}
+
+static ssize_t moderation_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
+{
+	uint i, val, copy_len, name_len;
+	struct param_names *n;
+	char cmd[40];
+
+	copy_len = min(sizeof(cmd) - 1, count);
+	if (count == 0)
+		return -EINVAL;
+	if (copy_from_user(cmd, buf, copy_len))
+		return -EFAULT;
+	cmd[copy_len] = '\0';
+	for (i = 0, n = param_names;  i < ARRAY_SIZE(param_names); i++, n++) {
+		name_len = strlen(n->name);
+		if (name_len < 1 || copy_len < name_len + 2 || strncmp(cmd, n->name, name_len) ||
+		    cmd[name_len] != '=')
+			continue;
+		if (kstrtouint(cmd + name_len + 1, 0, &val))
+			return -EINVAL;
+		WRITE_ONCE(*(n->val), clamp(val, n->min, n->max));
+		if (n->val == &irq_mod_info.delay_us)
+			update_enable_key();
+		/* Notify parameter update. */
+		irq_mod_info.version++;
+		return count;
+	}
+	return -EINVAL;
+}
+
+static const struct proc_ops proc_ops = {
+	.proc_open	= moderation_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= moderation_write,
+};
+
+/* Handlers for /proc/irq/NN/soft_moderation */
+static int mode_show(struct seq_file *p, void *v)
+{
+	struct irq_desc *desc = p->private;
+
+	if (!desc)
+		return -ENOENT;
+
+	seq_puts(p, desc->mod.enable ? "on\n" : "off\n");
+	return 0;
+}
+
+static ssize_t mode_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
+{
+	struct irq_desc *desc = (struct irq_desc *)pde_data(file_inode(f));
+	char cmd[40];
+	bool enable;
+	int ret;
+
+	if (!desc)
+		return -ENOENT;
+	if (count == 0 || count + 1 > sizeof(cmd))
+		return -EINVAL;
+	if (copy_from_user(cmd, buf, count))
+		return -EFAULT;
+	cmd[count] = '\0';
+
+	ret = kstrtobool(cmd, &enable);
+	if (!ret) {
+		guard(raw_spinlock_irqsave)(&desc->lock);
+		ret = set_mode(desc, enable);
+	}
+	return ret ? : count;
+}
+
+static int mode_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, mode_show, pde_data(inode));
+}
+
+static const struct proc_ops mode_ops = {
+	.proc_open	= mode_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= mode_write,
+};
+
+void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode)
+{
+	proc_create_data("soft_moderation", umode, desc->dir, &mode_ops, desc);
+}
+
+void irq_moderation_procfs_remove(struct irq_desc *desc)
+{
+	remove_proc_entry("soft_moderation", desc->dir);
+}
+
+/* Used on timer expiration or CPU shutdown. */
+static void drain_desc_list(struct irq_mod_state *ms)
+{
+	struct irq_desc *desc, *next;
+
+	/* Remove from list and enable interrupts back. */
+	list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
+		guard(raw_spinlock)(&desc->lock);
+		list_del_init(&desc->mod.ms_node);
+		irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
+		/* Protect against competing free_irq(). */
+		if (desc->action)
+			__enable_irq(desc);
+	}
+}
+
+static enum hrtimer_restart timer_callback(struct hrtimer *timer)
+{
+	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+	drain_desc_list(ms);
+	/* Prepare to accumulate next moderation delay. */
+	ms->sleep_ns = 0;
+	return HRTIMER_NORESTART;
+}
+
+/* Hotplug callback for setup. */
+static int cpu_setup_cb(uint cpu)
+{
+	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+	if (ms->moderation_on)
+		return 0;
+	hrtimer_setup(&ms->timer, timer_callback,
+		      CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
+	INIT_LIST_HEAD(&ms->descs);
+	smp_mb();
+	ms->moderation_on = true;
+	return 0;
+}
+
+/*
+ * Hotplug callback for shutdown.
+ * Mark the CPU as offline for moderation, and drain the list of masked
+ * interrupts. Any subsequent interrupt on this CPU will not be
+ * moderated, but they will be on the new target.
+ */
+static int cpu_remove_cb(uint cpu)
+{
+	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+	unsigned long flags;
+
+	local_irq_save(flags);
+	ms->moderation_on = false;
+	drain_desc_list(ms);
+	hrtimer_cancel(&ms->timer);
+	local_irq_restore(flags);
+	return 0;
+}
+
+static void(mod_pm_prepare_cb)(void *arg)
+{
+	cpu_remove_cb(smp_processor_id());
+}
+
+static void(mod_pm_resume_cb)(void *arg)
+{
+	cpu_setup_cb(smp_processor_id());
+}
+
+static int mod_pm_notifier_cb( struct notifier_block *nb, unsigned long event, void *unused)
+{
+	switch (event) {
+	case PM_SUSPEND_PREPARE:
+		on_each_cpu(mod_pm_prepare_cb, NULL, 1);
+		break;
+	case PM_POST_SUSPEND:
+		on_each_cpu(mod_pm_resume_cb, NULL, 1);
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+struct notifier_block mod_nb = {
+	.notifier_call = mod_pm_notifier_cb,
+	.priority = 100,
+};
+
+static void clamp_parameter(uint *dst, uint val)
+{
+	struct param_names *n = param_names;
+
+	for (int i = 0; i < ARRAY_SIZE(param_names); i++, n++) {
+		if (dst == n->val) {
+			WRITE_ONCE(*dst, clamp(val, n->min, n->max));
+			return;
+		}
+	}
+}
+
+static int __init init_irq_moderation(void)
+{
+	/* Clamp all initial values to the allowed range, update version. */
+	for (uint *cur = &irq_mod_info.delay_us; cur < irq_mod_info.pad; cur++)
+		clamp_parameter(cur, *cur);
+	irq_mod_info.version++;
+
+	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "irq_moderation", cpu_setup_cb, cpu_remove_cb);
+	register_pm_notifier(&mod_nb);
+
+	update_enable_key();
+
+	proc_create_data("irq/soft_moderation", 0644, NULL, &proc_ops, NULL);
+	return 0;
+}
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("1.0");
+MODULE_AUTHOR("Luigi Rizzo <lrizzo@google.com>");
+MODULE_DESCRIPTION("Global Software Interrupt Moderation");
+module_init(init_irq_moderation);
diff --git a/kernel/irq/irq_moderation.h b/kernel/irq/irq_moderation.h
new file mode 100644
index 0000000000000..8fe1cf30bdf91
--- /dev/null
+++ b/kernel/irq/irq_moderation.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
+
+#ifndef _LINUX_IRQ_MODERATION_H
+#define _LINUX_IRQ_MODERATION_H
+
+/* Common data structures for Global Software Interrupt Moderation, GSIM */
+
+#include <linux/hrtimer.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/kernel.h>
+
+#ifdef CONFIG_IRQ_SOFT_MODERATION
+
+/**
+ * struct irq_mod_info - global configuration parameters and state
+ * @version:		increment on writes to /proc/irq/soft_moderation
+ * @delay_us:		maximum delay
+ * @update_ms:		how often to update delay (epoch duration)
+ */
+struct irq_mod_info {
+	u32		version;
+	u32		delay_us;
+	u32		update_ms;
+	u32		pad[];
+};
+
+extern struct irq_mod_info irq_mod_info;
+
+/**
+ * struct irq_mod_state - per-CPU moderation state
+ *
+ * Used on every interrupt:
+ * @timer:		moderation timer
+ * @moderation_on:	per-CPU enable, toggled during hotplug/suspend events
+ * @intr_count:		interrupts in the last epoch
+ * @sleep_ns:		accumulated time for actual delay
+ * @mod_ns:		nominal moderation delay, recomputed every epoch
+ *
+ * Used less frequently, every few interrupts:
+ * @epoch_start_ns:	start of current epoch
+ * @update_ns:		update_ms from irq_mod_info, converted to ns
+ * @stray_irq:		how many stray interrupts (almost never used)
+ *
+ * Used once per epoch per interrupt source:
+ * @descs:		list of	moderated irq_desc on this CPU
+ * @enqueue:		how many enqueue on the list
+ *
+ * Used once per epoch:
+ * @version:		version fetched from irq_mod_info
+ *
+ * Statistics
+ * @timer_set:		how many timer_set calls
+ */
+struct irq_mod_state {
+	/* Used on every interrupt. */
+	struct hrtimer		timer;
+	bool			moderation_on;
+	u32			intr_count;
+	u32			sleep_ns;
+	u32			mod_ns;
+
+	/* Used less frequently. */
+	u64			epoch_start_ns;
+	u32			update_ns;
+	u32			stray_irq;
+
+	/* Used once per epoch per source. */
+	struct list_head	descs;
+	u32			enqueue;
+
+	/* Used once per epoch. */
+	u32			version;
+
+	/* Statistics */
+	u32			timer_set;
+};
+
+DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
+
+extern struct static_key_false irq_moderation_enabled_key;
+
+void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode);
+void irq_moderation_procfs_remove(struct irq_desc *desc);
+
+#else /* CONFIG_IRQ_SOFT_MODERATION */
+
+static inline void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode) {}
+static inline void irq_moderation_procfs_remove(struct irq_desc *desc) {}
+
+#endif /* !CONFIG_IRQ_SOFT_MODERATION */
+
+#endif /* _LINUX_IRQ_MODERATION_H */
diff --git a/kernel/irq/irq_moderation_hook.h b/kernel/irq/irq_moderation_hook.h
new file mode 100644
index 0000000000000..f9ac3005f69f4
--- /dev/null
+++ b/kernel/irq/irq_moderation_hook.h
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
+
+#ifndef _LINUX_IRQ_MODERATION_HOOK_H
+#define _LINUX_IRQ_MODERATION_HOOK_H
+
+/* Interrupt hooks for Global Software Interrupt Moderation, GSIM */
+
+#include <linux/hrtimer.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/kernel.h>
+
+#include "irq_moderation.h"
+
+#ifdef CONFIG_IRQ_SOFT_MODERATION
+
+static inline void __maybe_new_epoch(struct irq_mod_state *ms)
+{
+	const u64 now = ktime_get_ns(), epoch_ns = now - ms->epoch_start_ns;
+	const u32 slack_ns = 5000;
+	u32 version;
+
+	/* Run approximately every update_ns, a little bit early is ok. */
+	if (epoch_ns < ms->update_ns - slack_ns)
+		return;
+	ms->epoch_start_ns = now;
+	/* Fetch updated parameters. */
+        while ((version = READ_ONCE(irq_mod_info.version)) != ms->version) {
+		ms->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
+		ms->mod_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
+		ms->version = version;
+        }
+}
+
+static inline bool irq_moderation_needed(struct irq_desc *desc, struct irq_mod_state *ms)
+{
+	if (!hrtimer_is_queued(&ms->timer)) {
+		const u32 min_delay_ns = 10000;
+		const u64 slack_ns = 2000;
+
+		/* Accumulate sleep time, no moderation if too small. */
+		ms->sleep_ns += ms->mod_ns;
+		if (ms->sleep_ns < min_delay_ns)
+			return false;
+		/* We need moderation, start the timer. */
+		ms->timer_set++;
+		hrtimer_start_range_ns(&ms->timer, ns_to_ktime(ms->sleep_ns),
+				       slack_ns, HRTIMER_MODE_REL_PINNED_HARD);
+	}
+
+	/*
+	 * Add to the timer list and __disable_irq() to prevent serving subsequent
+	 * interrupts.
+	 */
+	if (!list_empty(&desc->mod.ms_node)) {
+		/* Very unlikely, stray interrupt while moderated. */
+		ms->stray_irq++;
+	} else {
+		ms->enqueue++;
+		list_add(&desc->mod.ms_node, &ms->descs);
+		__disable_irq(desc);
+	}
+	irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
+	return true;
+}
+
+/*
+ * Use after running the handler, with lock held. If this source should be
+ * moderated, disable it, add to the timer list for this CPU and return true.
+ */
+static inline bool irq_moderation_hook(struct irq_desc *desc)
+{
+	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
+
+	if (!static_branch_unlikely(&irq_moderation_enabled_key))
+		return false;
+	if (!desc->mod.enable)
+		return false;
+	if (!ms->moderation_on)
+		return false;
+
+	ms->intr_count++;
+
+	/* Is this a new epoch? ktime_get_ns() is expensive, don't check too often. */
+	if ((ms->intr_count & 0xf) == 0)
+		__maybe_new_epoch(ms);
+
+	return irq_moderation_needed(desc, ms);
+}
+
+/* On entry of desc->irq_handler() tell handler to skip moderated interrupts. */
+static inline bool irq_moderation_skip_moderated(struct irq_desc *desc)
+{
+	return !list_empty(&desc->mod.ms_node);
+}
+
+#else /* CONFIG_IRQ_SOFT_MODERATION */
+
+static inline bool irq_moderation_hook(struct irq_desc *desc) { return false; }
+static inline bool irq_moderation_skip_moderated(struct irq_desc *desc) { return false; }
+
+#endif /* !CONFIG_IRQ_SOFT_MODERATION */
+
+#endif /* _LINUX_IRQ_MODERATION_HOOK_H */
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index f8e4e13dbe339..bfc5802033383 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -134,6 +134,7 @@ static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
 	desc->tot_count = 0;
 	desc->name = NULL;
 	desc->owner = owner;
+	irq_moderation_init_fields(&desc->mod);
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(desc->kstat_irqs, cpu) = (struct irqstat) { };
 	desc_smp_init(desc, node, affinity);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 77258eafbf632..462e63af5d4d1 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -14,6 +14,7 @@
 #include <linux/mutex.h>
 
 #include "internals.h"
+#include "irq_moderation.h"
 
 /*
  * Access rules:
@@ -376,6 +377,7 @@ void register_irq_proc(unsigned int irq, struct irq_desc *desc)
 	proc_create_single_data("effective_affinity_list", 0444, desc->dir,
 				irq_effective_aff_list_proc_show, irqp);
 # endif
+	irq_moderation_procfs_add(desc, 0644);
 #endif
 	proc_create_single_data("spurious", 0444, desc->dir,
 				irq_spurious_proc_show, (void *)(long)irq);
@@ -397,6 +399,7 @@ void unregister_irq_proc(unsigned int irq, struct irq_desc *desc)
 	remove_proc_entry("effective_affinity", desc->dir);
 	remove_proc_entry("effective_affinity_list", desc->dir);
 # endif
+	irq_moderation_procfs_remove(desc);
 #endif
 	remove_proc_entry("spurious", desc->dir);
 
-- 
2.52.0.305.g3fc767764a-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH-v3 2/3] genirq: Adaptive Global Software Interrupt Moderation (GSIM)
  2025-12-17 11:21 [PATCH-v3 0/3] Global Software Interrupt Moderation (GSIM) Luigi Rizzo
  2025-12-17 11:21 ` [PATCH-v3 1/3] genirq: Fixed " Luigi Rizzo
@ 2025-12-17 11:21 ` Luigi Rizzo
  2025-12-17 19:18   ` Randy Dunlap
                     ` (2 more replies)
  2025-12-17 11:21 ` [PATCH-v3 3/3] genirq: Configurable default mode for GSIM Luigi Rizzo
  2 siblings, 3 replies; 11+ messages in thread
From: Luigi Rizzo @ 2025-12-17 11:21 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

Make GSIM adaptive by measuring total and per-CPU interrupt rates, as
well as time spent in hardirq on each CPU. The metrics are compared to
configurable targets, and each CPU adjusts the moderation delay up
or down depending on the result, using multiplicative increase/decrease.

Configuration is done at boot time via module parameters

    irq_moderation.${NAME}=${VALUE}

or at runtime via /proc/irq/soft_moderation

    echo ${NAME}=${VALUE} > /proc/irq/soft_moderation

Parameters are:

  delay_us (0 off, range 0-500)
      Maximum moderation delay in microseconds.

  target_intr_rate (0 off, range 0-50000000)
      Total rate above which moderation should trigger.

  hardirq_percent (0 off,range 0-100)
      Percentage of time spent in hardirq above which moderation should
      trigger.

  update_ms (range 1-100)
      How often the metrics should be computed and moderation delay
      updated.

When target_intr_rate and hardirq_percent are both 0, GSIM uses a fixed
moderation delay equal to delay_us. Otherwise, the delay is dynamically
adjusted up or down, independently on each CPU, based on how the total
and per-CPU metrics compare with the targets.

Provided that delay_us suffices to bring the metrics within the target,
the control loop will dynamically converge to the minimum actual
moderation delay to stay within the target.

Change-Id: I19b15d98e214a90fadee1c6e5bce6c8aac7a709a
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 kernel/irq/Makefile              |   2 +-
 kernel/irq/irq_moderation.c      |  94 ++++++++++++++++--
 kernel/irq/irq_moderation.h      |  46 ++++++++-
 kernel/irq/irq_moderation_hook.c | 157 +++++++++++++++++++++++++++++++
 kernel/irq/irq_moderation_hook.h |  12 +--
 5 files changed, 291 insertions(+), 20 deletions(-)
 create mode 100644 kernel/irq/irq_moderation_hook.c

diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
index c06da43d644f2..d41cca6a5da76 100644
--- a/kernel/irq/Makefile
+++ b/kernel/irq/Makefile
@@ -9,7 +9,7 @@ obj-$(CONFIG_GENERIC_IRQ_CHIP) += generic-chip.o
 obj-$(CONFIG_GENERIC_IRQ_PROBE) += autoprobe.o
 obj-$(CONFIG_IRQ_DOMAIN) += irqdomain.o
 obj-$(CONFIG_IRQ_SIM) += irq_sim.o
-obj-$(CONFIG_IRQ_SOFT_MODERATION) += irq_moderation.o
+obj-$(CONFIG_IRQ_SOFT_MODERATION) += irq_moderation.o irq_moderation_hook.o
 obj-$(CONFIG_PROC_FS) += proc.o
 obj-$(CONFIG_GENERIC_PENDING_IRQ) += migration.o
 obj-$(CONFIG_GENERIC_IRQ_MIGRATION) += cpuhotplug.o
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index 2a7401bc51da2..be7d10404a24a 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -19,8 +19,9 @@
  *
  * Some platforms show reduced I/O performance when the total device interrupt
  * rate across the entire platform becomes too high. To address the problem,
- * GSIM uses a hook after running the handler to implement software interrupt
- * moderation with programmable delay.
+ * GSIM uses a hook after running the handler to measure global and per-CPU
+ * interrupt rates, compare them with configurable targets, and implements
+ * independent, per-CPU software moderation delays.
  *
  * Configuration is controlled at boot time via module parameters
  *
@@ -36,6 +37,19 @@
  *       Maximum moderation delay. A reasonable range is 20-100. Higher values
  *       can be useful if the hardirq handler has long runtimes.
  *
+ *   target_intr_rate (default 0, suggested 1000000, 0 off, range 0-50000000)
+ *       The total interrupt rate above which moderation kicks in.
+ *       Not particularly critical, a value in the 500K-1M range is usually ok.
+ *
+ *   hardirq_percent (default 0, suggested 70, 0 off, range 0-100)
+ *       The hardirq percentage above which moderation kicks in.
+ *       50-90 is a reasonable range.
+ *
+ *       FIXED MODERATION mode requires target_intr_rate=0, hardirq_percent=0
+ *
+ *   update_ms (default 5, range 1-100)
+ *       How often the load is measured and moderation delay updated.
+ *
  * Moderation can be enabled/disabled dynamically for individual interrupts with
  *
  *     echo 1 > /proc/irq/NN/soft_moderation # use 0 to disable
@@ -101,11 +115,22 @@
 /* Recommended values. */
 struct irq_mod_info irq_mod_info ____cacheline_aligned = {
 	.update_ms		= 5,
+	.increase_factor	= MIN_SCALING_FACTOR,
+	.scale_cpus		= 200,
 };
 
 module_param_named(delay_us, irq_mod_info.delay_us, uint, 0444);
 MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 = moderation off, range 0-500.");
 
+module_param_named(hardirq_percent, irq_mod_info.hardirq_percent, uint, 0444);
+MODULE_PARM_DESC(hardirq_percent, "Target max hardirq percentage, 0 off, range 0-100.");
+
+module_param_named(target_intr_rate, irq_mod_info.target_intr_rate, uint, 0444);
+MODULE_PARM_DESC(target_intr_rate, "Target max interrupt rate, 0 off, range 0-50000000.");
+
+module_param_named(update_ms, irq_mod_info.update_ms, uint, 0444);
+MODULE_PARM_DESC(update_ms, "Update interval in milliseconds, range 1-100");
+
 DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
 
 DEFINE_STATIC_KEY_FALSE(irq_moderation_enabled_key);
@@ -142,27 +167,72 @@ static int set_mode(struct irq_desc *desc, bool enable)
 /* Print statistics */
 static int moderation_show(struct seq_file *p, void *v)
 {
+	ulong intr_rate = 0, irq_high = 0, my_irq_high = 0, hardirq_high = 0;
 	uint delay_us = irq_mod_info.delay_us;
-	int j;
+	u64 now = ktime_get_ns();
+	int j, active_cpus = 0;
 
-#define HEAD_FMT "%5s  %8s  %11s  %11s  %9s\n"
-#define BODY_FMT "%5u  %8u  %11u  %11u  %9u\n"
+#define HEAD_FMT "%5s  %8s  %8s  %4s  %8s  %11s  %11s  %11s  %11s  %11s  %9s\n"
+#define BODY_FMT "%5u  %8u  %8u  %4u  %8u  %11u  %11u  %11u  %11u  %11u  %9u\n"
 
 	seq_printf(p, HEAD_FMT,
-		   "# CPU", "delay_ns", "timer_set", "enqueue", "stray_irq");
+		   "# CPU", "irq/s", "my_irq/s", "cpus", "delay_ns",
+		   "irq_hi", "my_irq_hi", "hardirq_hi", "timer_set",
+		   "enqueue", "stray_irq");
 
 	for_each_possible_cpu(j) {
 		struct irq_mod_state *ms = per_cpu_ptr(&irq_mod_state, j);
+		/* Watch out, epoch start_ns is 64 bits. */
+		u64 epoch_start_ns = atomic64_read((atomic64_t *)&ms->epoch_start_ns);
+		s64 age_ms = min((now - epoch_start_ns) / NSEC_PER_MSEC, (u64)999999);
+
+		if (age_ms < 10000) {
+			/* Average intr_rate over recently active CPUs. */
+			active_cpus++;
+			intr_rate += ms->intr_rate;
+		} else {
+			age_ms = -1;
+			ms->intr_rate = 0;
+			ms->my_intr_rate = 0;
+			ms->scaled_cpu_count = 0;
+			ms->mod_ns = 0;
+		}
+
+		irq_high += ms->irq_high;
+		my_irq_high += ms->my_irq_high;
+		hardirq_high += ms->hardirq_high;
 
 		seq_printf(p, BODY_FMT,
-			   j, ms->mod_ns, ms->timer_set, ms->enqueue, ms->stray_irq);
+			   j, ms->intr_rate, ms->my_intr_rate,
+			   (ms->scaled_cpu_count + 128) / 256,
+			   ms->mod_ns, ms->irq_high, ms->my_irq_high,
+			   ms->hardirq_high, ms->timer_set, ms->enqueue,
+			   ms->stray_irq);
 	}
 
 	seq_printf(p, "\n"
 		   "enabled              %s\n"
-		   "delay_us             %u\n",
+		   "delay_us             %u\n"
+		   "target_intr_rate     %u\n"
+		   "hardirq_percent      %u\n"
+		   "update_ms            %u\n"
+		   "scale_cpus           %u\n",
 		   str_yes_no(delay_us > 0),
-		   delay_us);
+		   delay_us,
+		   irq_mod_info.target_intr_rate, irq_mod_info.hardirq_percent,
+		   irq_mod_info.update_ms, irq_mod_info.scale_cpus);
+
+	seq_printf(p,
+		   "intr_rate            %lu\n"
+		   "irq_high             %lu\n"
+		   "my_irq_high          %lu\n"
+		   "hardirq_percent_high %lu\n"
+		   "total_interrupts     %u\n"
+		   "total_cpus           %u\n",
+		   active_cpus ? intr_rate / active_cpus : 0,
+		   irq_high, my_irq_high, hardirq_high,
+		   READ_ONCE(*((u32 *)&irq_mod_info.total_intrs)),
+		   READ_ONCE(*((u32 *)&irq_mod_info.total_cpus)));
 
 	return 0;
 }
@@ -182,8 +252,12 @@ struct param_names {
 
 static struct param_names param_names[] = {
 	{ "delay_us", &irq_mod_info.delay_us, 0, 500 },
+	{ "target_intr_rate", &irq_mod_info.target_intr_rate, 0, 50000000 },
+	{ "hardirq_percent", &irq_mod_info.hardirq_percent, 0, 100 },
+	{ "update_ms", &irq_mod_info.update_ms, 1, 100 },
 	/* Entries with no name cannot be set at runtime. */
-	{ "", &irq_mod_info.update_ms, 1, 100 },
+	{ "", &irq_mod_info.increase_factor, MIN_SCALING_FACTOR, 128 },
+	{ "", &irq_mod_info.scale_cpus, 50, 1000 },
 };
 
 static void update_enable_key(void)
diff --git a/kernel/irq/irq_moderation.h b/kernel/irq/irq_moderation.h
index 8fe1cf30bdf91..bd8edb9524613 100644
--- a/kernel/irq/irq_moderation.h
+++ b/kernel/irq/irq_moderation.h
@@ -14,14 +14,32 @@
 
 /**
  * struct irq_mod_info - global configuration parameters and state
+ * @total_intrs:	running count of total interrupts
+ * @total_cpus:		running count of total active CPUs
+ *			totals are updated every update_ms ("epoch")
  * @version:		increment on writes to /proc/irq/soft_moderation
- * @delay_us:		maximum delay
- * @update_ms:		how often to update delay (epoch duration)
+ * @delay_us:		fixed delay, or maximum for adaptive
+ * @target_intr_rate:	target maximum interrupt rate
+ * @hardirq_percent:	target maximum hardirq percentage
+ * @update_ms:		how often to update delay/rate/fraction (epoch duration)
+ * @increase_factor:	constant for exponential increase/decrease of delay
+ * @scale_cpus:		(percent) scale factor to estimate active CPUs
  */
 struct irq_mod_info {
+	/* These fields are written to by all CPUs every epoch. */
+	____cacheline_aligned
+	atomic_t	total_intrs;
+	atomic_t	total_cpus;
+
+	/* These are mostly read (frequently), so use a different cacheline. */
+	____cacheline_aligned
 	u32		version;
 	u32		delay_us;
+	u32		target_intr_rate;
+	u32		hardirq_percent;
 	u32		update_ms;
+	u32		increase_factor;
+	u32		scale_cpus;
 	u32		pad[];
 };
 
@@ -48,9 +66,20 @@ extern struct irq_mod_info irq_mod_info;
  *
  * Used once per epoch:
  * @version:		version fetched from irq_mod_info
+ * @delay_ns:		fetched from irq_mod_info
+ * @epoch_ns:		duration of last epoch
+ * @last_total_intrs:	from irq_mod_info
+ * @last_total_cpus:	from irq_mod_info
+ * @last_irqtime:	from cpustat[CPUTIME_IRQ]
  *
  * Statistics
+ * @intr_rate:		smoothed global interrupt rate
+ * @my_intr_rate:	smoothed interrupt rate for this CPU
  * @timer_set:		how many timer_set calls
+ * @scaled_cpu_count:	smoothed CPU count (scaled)
+ * @irq_high:		how many times global irq above threshold
+ * @my_irq_high:	how many times local irq above threshold
+ * @hardirq_high:	how many times local hardirq_percent above threshold
  */
 struct irq_mod_state {
 	/* Used on every interrupt. */
@@ -71,13 +100,26 @@ struct irq_mod_state {
 
 	/* Used once per epoch. */
 	u32			version;
+	u32			delay_ns;
+	u32			epoch_ns;
+	u32			last_total_intrs;
+	u32			last_total_cpus;
+	u64			last_irqtime;
 
 	/* Statistics */
+	u32			intr_rate;
+	u32			my_intr_rate;
 	u32			timer_set;
+	u32			timer_fire;
+	u32			scaled_cpu_count;
+	u32			irq_high;
+	u32			my_irq_high;
+	u32			hardirq_high;
 };
 
 DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
 
+#define MIN_SCALING_FACTOR 8u
 extern struct static_key_false irq_moderation_enabled_key;
 
 void irq_moderation_procfs_add(struct irq_desc *desc, umode_t umode);
diff --git a/kernel/irq/irq_moderation_hook.c b/kernel/irq/irq_moderation_hook.c
new file mode 100644
index 0000000000000..f04ed25a482a0
--- /dev/null
+++ b/kernel/irq/irq_moderation_hook.c
@@ -0,0 +1,157 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause
+
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/kernel_stat.h>
+
+#include "internals.h"
+#include "irq_moderation_hook.h"
+
+/* Slow path functions for interrupt moderation. */
+
+static inline u32 smooth_avg(u32 old, u32 cur, u32 steps)
+{
+	const u32 smooth_factor = 64;
+
+	steps = min(steps, smooth_factor - 1);
+	return ((smooth_factor - steps) * old + steps * cur) / smooth_factor;
+}
+
+/* Measure and assess time spent in hardirq. */
+static inline bool hardirq_high(struct irq_mod_state *ms, u32 hardirq_percent)
+{
+	bool above_threshold = false;
+
+	if (IS_ENABLED(CONFIG_IRQ_TIME_ACCOUNTING)) {
+		u64 irqtime, cur = kcpustat_this_cpu->cpustat[CPUTIME_IRQ];
+
+		irqtime = cur - ms->last_irqtime;
+		ms->last_irqtime = cur;
+
+		above_threshold = irqtime * 100 > (u64)ms->epoch_ns * hardirq_percent;
+		ms->hardirq_high += above_threshold;
+	}
+	return above_threshold;
+}
+
+/* Measure and assess total and per-CPU interrupt rates. */
+static inline bool irqrate_high(struct irq_mod_state *ms, u32 target_rate, u32 steps)
+{
+	u32 intr_rate, my_intr_rate, delta_intrs, active_cpus, tmp;
+	bool my_rate_high, global_rate_high;
+
+	my_intr_rate = ((u64)ms->intr_count * NSEC_PER_SEC) / ms->epoch_ns;
+
+	/* Accumulate global counter and compute global interrupt rate. */
+	tmp = atomic_add_return(ms->intr_count, &irq_mod_info.total_intrs);
+	ms->intr_count = 1;
+	delta_intrs = tmp - ms->last_total_intrs;
+	ms->last_total_intrs = tmp;
+	intr_rate = ((u64)delta_intrs * NSEC_PER_SEC) / ms->epoch_ns;
+
+	/*
+	 * Count how many CPUs handled interrupts in the last epoch, needed
+	 * to determine the per-CPU target (target_rate / active_cpus).
+	 * Each active CPU increments the global counter approximately every
+	 * update_ns. Scale the value by (update_ns / ms->epoch_ns) to get the
+	 * correct value. Also apply rounding and make sure active_cpus > 0.
+	 */
+	tmp = atomic_add_return(1, &irq_mod_info.total_cpus);
+	active_cpus = tmp - ms->last_total_cpus;
+	ms->last_total_cpus = tmp;
+	active_cpus = (active_cpus * ms->update_ns + ms->epoch_ns / 2) / ms->epoch_ns;
+	if (active_cpus < 1)
+		active_cpus = 1;
+
+	/* Compare with global and per-CPU targets. */
+	global_rate_high = intr_rate > target_rate;
+
+	/*
+	 * Short epochs may lead to underestimate the number of active CPUs.
+	 * Apply a scaling factor to compensate. This may make the controller
+	 * a bit more aggressive but does not harm system throughput.
+	 */
+	my_rate_high = my_intr_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
+
+	/* Statistics. */
+	ms->intr_rate = smooth_avg(ms->intr_rate, intr_rate, steps);
+	ms->my_intr_rate = smooth_avg(ms->my_intr_rate, my_intr_rate, steps);
+	ms->scaled_cpu_count = smooth_avg(ms->scaled_cpu_count, active_cpus * 256, steps);
+	ms->my_irq_high += my_rate_high;
+	ms->irq_high += global_rate_high;
+
+	/* Moderate on this CPU only if both global and local rates are high. */
+	return global_rate_high && my_rate_high;
+}
+
+/* Periodic adjustment, called once per epoch. */
+void irq_moderation_update_epoch(struct irq_mod_state *ms)
+{
+	const u32 hardirq_percent = READ_ONCE(irq_mod_info.hardirq_percent);
+	const u32 target_rate = READ_ONCE(irq_mod_info.target_intr_rate);
+	const u32 min_delay_ns = 500;
+	bool above_target = false;
+	u32 version;
+	u32 steps;
+
+	/* Fetch updated parameters. */
+	while ((version = READ_ONCE(irq_mod_info.version)) != ms->version) {
+		ms->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
+		ms->mod_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
+		ms->delay_ns = ms->mod_ns;
+		ms->version = version;
+	}
+
+	if (target_rate == 0 && hardirq_percent == 0) {
+		/* Use fixed moderation delay. */
+		ms->mod_ns = ms->delay_ns;
+		ms->intr_rate = 0;
+		ms->my_intr_rate = 0;
+		ms->scaled_cpu_count = 0;
+		return;
+	}
+
+	/*
+	 * To scale values X by a factor (1 +/- 1/F) every "update_ns" we do
+	 * 	X := X * (1 +/- 1/F)
+	 * If the interval is N times longer, applying the formula N times gives
+	 * 	X := X * ((1 +/- 1/F) ** N)
+	 * We don't want to deal floating point or exponentials, and we cap N
+	 * to some small value < F . This leads to an approximated formula
+	 * 	X := X * (1 +/- N/F)
+	 * The variable steps below is the number N of steps.
+	 */
+	steps = clamp(ms->epoch_ns / ms->update_ns, 1u, MIN_SCALING_FACTOR - 1u);
+
+	if (target_rate > 0 && irqrate_high(ms, target_rate, steps))
+		above_target = true;
+
+	if (hardirq_percent > 0 && hardirq_high(ms, hardirq_percent))
+		above_target = true;
+
+	/*
+	 * Controller: adjust delay with exponential increase or decrease.
+	 *
+	 * Note the different constants: we increase fast (smaller factor)
+	 * to aggressively slow down when the interrupt rate goes up,
+	 * but decrease slowly (larger factor) because reducing the delay can
+	 * drive up the interrupt rate and we don't want to create load spikes.
+	 */
+	if (above_target) {
+		const u32 increase_factor = READ_ONCE(irq_mod_info.increase_factor);
+
+		/* Make sure the value is large enough for the exponential to grow. */
+		if (ms->mod_ns < min_delay_ns)
+			ms->mod_ns = min_delay_ns;
+		ms->mod_ns += ms->mod_ns * steps / increase_factor;
+		if (ms->mod_ns > ms->delay_ns)
+			ms->mod_ns = ms->delay_ns;
+	} else {
+		const u32 decrease_factor = 2 * READ_ONCE(irq_mod_info.increase_factor);
+
+		ms->mod_ns -= ms->mod_ns * steps / decrease_factor;
+		/* Round down to 0 values that are too small to bother. */
+		if (ms->mod_ns < min_delay_ns)
+			ms->mod_ns = 0;
+	}
+}
diff --git a/kernel/irq/irq_moderation_hook.h b/kernel/irq/irq_moderation_hook.h
index f9ac3005f69f4..e96127923a04b 100644
--- a/kernel/irq/irq_moderation_hook.h
+++ b/kernel/irq/irq_moderation_hook.h
@@ -14,22 +14,20 @@
 
 #ifdef CONFIG_IRQ_SOFT_MODERATION
 
+void irq_moderation_update_epoch(struct irq_mod_state *ms);
+
 static inline void __maybe_new_epoch(struct irq_mod_state *ms)
 {
 	const u64 now = ktime_get_ns(), epoch_ns = now - ms->epoch_start_ns;
 	const u32 slack_ns = 5000;
-	u32 version;
 
 	/* Run approximately every update_ns, a little bit early is ok. */
 	if (epoch_ns < ms->update_ns - slack_ns)
 		return;
+	ms->epoch_ns = min(epoch_ns, (u64)U32_MAX);
 	ms->epoch_start_ns = now;
-	/* Fetch updated parameters. */
-        while ((version = READ_ONCE(irq_mod_info.version)) != ms->version) {
-		ms->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
-		ms->mod_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
-		ms->version = version;
-        }
+	/* Do the expensive processing */
+	irq_moderation_update_epoch(ms);
 }
 
 static inline bool irq_moderation_needed(struct irq_desc *desc, struct irq_mod_state *ms)
-- 
2.52.0.305.g3fc767764a-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH-v3 3/3] genirq: Configurable default mode for GSIM
  2025-12-17 11:21 [PATCH-v3 0/3] Global Software Interrupt Moderation (GSIM) Luigi Rizzo
  2025-12-17 11:21 ` [PATCH-v3 1/3] genirq: Fixed " Luigi Rizzo
  2025-12-17 11:21 ` [PATCH-v3 2/3] genirq: Adaptive " Luigi Rizzo
@ 2025-12-17 11:21 ` Luigi Rizzo
  2025-12-18 21:56   ` Thomas Gleixner
  2 siblings, 1 reply; 11+ messages in thread
From: Luigi Rizzo @ 2025-12-17 11:21 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

GSIM (Global Software Interrupt Moderation) can be enabled only after the
interrupt is created by writing to /proc/irq/NN/soft_moderation. This is
impractical when devices that are dynamically created or reconfigured.

Add a module parameter irq_moderation.enable_list to specify whether
moderation should be enabled at interrupt creation time. This is done
with a comma-separated list of patterns (enable_list) matched against
interrupt or handler names when the interrupt is created.

This allows very flexible control without having to modify every single
driver. As an example, one can limit to specific drivers by specifying
the handler functions (using parentheses) as below

     irq_moderation.enable_list="nvme_irq(),vfio_msihandler()"

ora apply it to certain interrupt names

     irq_moderation.enable_list="eth*,vfio*"

Change-Id: Id5f7aba5b21ad478e4fb7edd0f00eeb2b83e07d2
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 include/linux/irqdesc.h     |   5 ++
 kernel/irq/irq_moderation.c | 125 ++++++++++++++++++++++++++++++++++++
 kernel/irq/manage.c         |   4 ++
 3 files changed, 134 insertions(+)

diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index d42ce57ff5406..e0785c709d013 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -16,6 +16,7 @@ struct module;
 struct irq_desc;
 struct irq_domain;
 struct pt_regs;
+struct irqaction;
 
 #ifdef CONFIG_IRQ_SOFT_MODERATION
 
@@ -30,11 +31,15 @@ struct irq_desc_mod {
 };
 
 void irq_moderation_init_fields(struct irq_desc_mod *mod);
+bool irq_moderation_get_default(const struct irqaction *action);
+void irq_moderation_enable(struct irq_desc *desc);
 
 #else
 
 struct irq_desc_mod {};
 static inline void irq_moderation_init_fields(struct irq_desc_mod *mod) {}
+static inline bool irq_moderation_get_default(const struct irqaction *action) { return false; }
+static inline void irq_moderation_enable(struct irq_desc *desc) {}
 
 #endif
 
diff --git a/kernel/irq/irq_moderation.c b/kernel/irq/irq_moderation.c
index be7d10404a24a..216bd1f83d23e 100644
--- a/kernel/irq/irq_moderation.c
+++ b/kernel/irq/irq_moderation.c
@@ -50,6 +50,11 @@
  *   update_ms (default 5, range 1-100)
  *       How often the load is measured and moderation delay updated.
  *
+ *   enable_list (comma-separated list of handlers or interrupt names)
+ *       Enable moderation at creation for interrupts whose name or handler
+ *       functions match patterns in this list. Example:
+ *       "nvme_irq(),eth*,vfio_msihandler()"
+ *
  * Moderation can be enabled/disabled dynamically for individual interrupts with
  *
  *     echo 1 > /proc/irq/NN/soft_moderation # use 0 to disable
@@ -142,6 +147,90 @@ void irq_moderation_init_fields(struct irq_desc_mod *mod)
 	mod->enable = false;
 }
 
+/*
+ * irq_moderation.enable_list is a comma-separated list of patterns to match
+ * against the name eg "eth*" or handler e.g "nvme_irq()" of the interrupt.
+ * There is one active buffer and one used for updates, so replacement just
+ * needs to swap the pointer.
+ */
+#define MAX_CONFIG_LEN 4096
+static struct enable_list {
+	char buf[MAX_CONFIG_LEN];
+	char buf2[MAX_CONFIG_LEN];
+	char *cur;
+	uint len;
+	struct rw_semaphore rwsem;
+} patterns = { .cur = patterns.buf, .rwsem = __RWSEM_INITIALIZER(patterns.rwsem) };
+
+static int set_enable_list(const char *val, const struct kernel_param *kp)
+{
+	uint i, len = strlen(val);
+	char *dst;
+
+	if (len >= MAX_CONFIG_LEN - 1)
+		return -E2BIG;
+	if (val[len - 1] == '\n')
+		len--;
+
+	/* Serialized by procfs. */
+	dst = patterns.cur == patterns.buf ? patterns.buf2 : patterns.buf;
+	/* Copy and split the string on commas. */
+	for (i = 0; i < len; i++)
+		dst[i] = val[i] == ',' ?  '\0' : val[i];
+	dst[i] = '\0';
+	guard(rwsem_write)(&patterns.rwsem);
+	patterns.cur = dst;
+	patterns.len = len;
+	return 0;
+}
+
+static int get_enable_list(char *buf, const struct kernel_param *kp)
+{
+	int i;
+
+	/* Join the patterns with commas. */
+	guard(rwsem_read)(&patterns.rwsem);
+	for (i = 0; i < patterns.len; i++)
+		buf[i] = patterns.cur[i] == '\0' ?  ',' : patterns.cur[i];
+	buf[i] = '\0';
+	return patterns.len;
+}
+
+static const struct kernel_param_ops enable_list_ops = {
+	.set = set_enable_list,
+	.get = get_enable_list,
+};
+
+module_param_cb(enable_list, &enable_list_ops, NULL, 0444);
+
+/* Called early in __setup_irq() without desc->lock. */
+bool irq_moderation_get_default(const struct irqaction *act)
+{
+	guard(rwsem_read)(&patterns.rwsem);
+	for (int arg = 0; arg < 3; arg++) {
+		/* buf includes room for "()". */
+		char buf[KSYM_SYMBOL_LEN + 3];
+		const char *name;
+
+		if (arg == 0) {
+			name = act->name;
+			if (!name)
+				continue;
+		} else {
+			irq_handler_t fn = arg == 1 ? act->handler : act->thread_fn;
+
+			if (lookup_symbol_name((ulong)fn, buf))
+				continue;
+			memcpy(buf + strlen(buf), "()", 3);
+			name = buf;
+		}
+		for (int i = 0; i < patterns.len; i += 1 + strlen(patterns.cur + i))
+			if (glob_match(patterns.cur + i, name))
+				return true;
+	}
+	return false;
+}
+
 static int set_mode(struct irq_desc *desc, bool enable)
 {
 	lockdep_assert_held(&desc->lock);
@@ -163,6 +252,12 @@ static int set_mode(struct irq_desc *desc, bool enable)
 	return 0;
 }
 
+/* Called with desc->lock held in __setup_irq(). */
+void irq_moderation_enable(struct irq_desc *desc)
+{
+	set_mode(desc, true);
+}
+
 #pragma clang diagnostic error "-Wformat"
 /* Print statistics */
 static int moderation_show(struct seq_file *p, void *v)
@@ -171,6 +266,7 @@ static int moderation_show(struct seq_file *p, void *v)
 	uint delay_us = irq_mod_info.delay_us;
 	u64 now = ktime_get_ns();
 	int j, active_cpus = 0;
+	char *buf;
 
 #define HEAD_FMT "%5s  %8s  %8s  %4s  %8s  %11s  %11s  %11s  %11s  %11s  %9s\n"
 #define BODY_FMT "%5u  %8u  %8u  %4u  %8u  %11u  %11u  %11u  %11u  %11u  %9u\n"
@@ -210,17 +306,24 @@ static int moderation_show(struct seq_file *p, void *v)
 			   ms->stray_irq);
 	}
 
+	buf = kmalloc(MAX_CONFIG_LEN, GFP_KERNEL);
+	if (buf)
+		get_enable_list(buf, NULL);
 	seq_printf(p, "\n"
 		   "enabled              %s\n"
+		   "enable_list          '%s'\n"
 		   "delay_us             %u\n"
 		   "target_intr_rate     %u\n"
 		   "hardirq_percent      %u\n"
 		   "update_ms            %u\n"
 		   "scale_cpus           %u\n",
 		   str_yes_no(delay_us > 0),
+		   buf ? : "",
 		   delay_us,
 		   irq_mod_info.target_intr_rate, irq_mod_info.hardirq_percent,
 		   irq_mod_info.update_ms, irq_mod_info.scale_cpus);
+	if (buf)
+		kfree(buf);
 
 	seq_printf(p,
 		   "intr_rate            %lu\n"
@@ -258,6 +361,8 @@ static struct param_names param_names[] = {
 	/* Entries with no name cannot be set at runtime. */
 	{ "", &irq_mod_info.increase_factor, MIN_SCALING_FACTOR, 128 },
 	{ "", &irq_mod_info.scale_cpus, 50, 1000 },
+	/* val = NULL indicates a special entry for enable_list. */
+	{ "enable_list", NULL },
 };
 
 static void update_enable_key(void)
@@ -270,6 +375,24 @@ static void update_enable_key(void)
 		static_branch_disable(key);
 }
 
+static ssize_t set_enable_list_from_proc(const char __user *buf, int count, int ofs)
+{
+	char *cmd;
+
+	if (count >= MAX_CONFIG_LEN)
+		return -EINVAL;
+	cmd = kmalloc(count + 1, GFP_KERNEL);
+	if (!cmd)
+		return -ENOMEM;
+	if (copy_from_user(cmd, buf, count)) {
+		kfree(cmd);
+		return -EFAULT;
+	}
+	set_enable_list(cmd + ofs, NULL);
+	kfree(cmd);
+	return count;
+}
+
 static ssize_t moderation_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
 {
 	uint i, val, copy_len, name_len;
@@ -287,6 +410,8 @@ static ssize_t moderation_write(struct file *f, const char __user *buf, size_t c
 		if (name_len < 1 || copy_len < name_len + 2 || strncmp(cmd, n->name, name_len) ||
 		    cmd[name_len] != '=')
 			continue;
+		if (n->val == NULL)
+			return set_enable_list_from_proc(buf, count, name_len + 1);
 		if (kstrtouint(cmd + name_len + 1, 0, &val))
 			return -EINVAL;
 		WRITE_ONCE(*(n->val), clamp(val, n->min, n->max));
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 8b1b4c8a4f54c..41b6e6ec2b09e 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1449,6 +1449,7 @@ static bool valid_percpu_irqaction(struct irqaction *old, struct irqaction *new)
 static int
 __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
 {
+	bool use_moderation = irq_moderation_get_default(new);
 	struct irqaction *old, **old_ptr;
 	unsigned long flags, thread_mask = 0;
 	int ret, nested, shared = 0;
@@ -1761,6 +1762,9 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
 
 	irq_pm_install_action(desc, new);
 
+	if (use_moderation)
+		irq_moderation_enable(desc);
+
 	/* Reset broken irq detection when installing new handler */
 	desc->irq_count = 0;
 	desc->irqs_unhandled = 0;
-- 
2.52.0.305.g3fc767764a-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH-v3 1/3] genirq: Fixed Global Software Interrupt Moderation (GSIM)
  2025-12-17 11:21 ` [PATCH-v3 1/3] genirq: Fixed " Luigi Rizzo
@ 2025-12-17 19:17   ` Randy Dunlap
  2025-12-18 15:37   ` Thomas Gleixner
  2025-12-21 11:21   ` kernel test robot
  2 siblings, 0 replies; 11+ messages in thread
From: Randy Dunlap @ 2025-12-17 19:17 UTC (permalink / raw)
  To: Luigi Rizzo, Thomas Gleixner, Marc Zyngier, Luigi Rizzo,
	Paolo Abeni, Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn



On 12/17/25 3:21 AM, Luigi Rizzo wrote:
> diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
> index 1b4254d19a73e..c258973b63459 100644
> --- a/kernel/irq/Kconfig
> +++ b/kernel/irq/Kconfig
> @@ -155,6 +155,18 @@ config IRQ_KUNIT_TEST
>  
>  endmenu
>  
> +# Support platform wide software interrupt moderation
> +config IRQ_SOFT_MODERATION
> +	bool "Enable platform wide software interrupt moderation"
> +	depends on PROC_FS=y
> +	help
> +	  Enable platform wide software interrupt moderation.
> +	  Uses a local timer to delay interrupts in configurable ways
> +	  and depending on various global system load indicators
> +	  and targets.
> +
> +	  If you don't know what to do here, say N.

s/platform wide/platform-wide/g
(3 places)

-- 
~Randy


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH-v3 2/3] genirq: Adaptive Global Software Interrupt Moderation (GSIM)
  2025-12-17 11:21 ` [PATCH-v3 2/3] genirq: Adaptive " Luigi Rizzo
@ 2025-12-17 19:18   ` Randy Dunlap
  2025-12-18 16:00   ` Thomas Gleixner
  2025-12-21  9:46   ` kernel test robot
  2 siblings, 0 replies; 11+ messages in thread
From: Randy Dunlap @ 2025-12-17 19:18 UTC (permalink / raw)
  To: Luigi Rizzo, Thomas Gleixner, Marc Zyngier, Luigi Rizzo,
	Paolo Abeni, Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn

Hi--

On 12/17/25 3:21 AM, Luigi Rizzo wrote:
> Make GSIM adaptive by measuring total and per-CPU interrupt rates, as
> well as time spent in hardirq on each CPU. The metrics are compared to
> configurable targets, and each CPU adjusts the moderation delay up
> or down depending on the result, using multiplicative increase/decrease.
> 
> Configuration is done at boot time via module parameters
> 
>     irq_moderation.${NAME}=${VALUE}
> 
> or at runtime via /proc/irq/soft_moderation
> 
>     echo ${NAME}=${VALUE} > /proc/irq/soft_moderation
> 
> Parameters are:
> 
>   delay_us (0 off, range 0-500)
>       Maximum moderation delay in microseconds.
> 
>   target_intr_rate (0 off, range 0-50000000)
>       Total rate above which moderation should trigger.
> 
>   hardirq_percent (0 off,range 0-100)
>       Percentage of time spent in hardirq above which moderation should
>       trigger.
> 
>   update_ms (range 1-100)
>       How often the metrics should be computed and moderation delay
>       updated.

This could use some userspace documentation, e.g., in Documentation/admin-guide/,
or in Documentation/admin-guide/kernel-parameters.txt.
or in Documentation/core-api/irq/...

> When target_intr_rate and hardirq_percent are both 0, GSIM uses a fixed
> moderation delay equal to delay_us. Otherwise, the delay is dynamically
> adjusted up or down, independently on each CPU, based on how the total
> and per-CPU metrics compare with the targets.
> 
> Provided that delay_us suffices to bring the metrics within the target,
> the control loop will dynamically converge to the minimum actual
> moderation delay to stay within the target.
> 
> Change-Id: I19b15d98e214a90fadee1c6e5bce6c8aac7a709a

checkpatch says that we don't take Change-Id: lines in a patch.
OTOH, the git tree currently has 415 of them.
I wonder if we have given up on that.

> Signed-off-by: Luigi Rizzo <lrizzo@google.com>
> ---
>  kernel/irq/Makefile              |   2 +-
>  kernel/irq/irq_moderation.c      |  94 ++++++++++++++++--
>  kernel/irq/irq_moderation.h      |  46 ++++++++-
>  kernel/irq/irq_moderation_hook.c | 157 +++++++++++++++++++++++++++++++
>  kernel/irq/irq_moderation_hook.h |  12 +--
>  5 files changed, 291 insertions(+), 20 deletions(-)
>  create mode 100644 kernel/irq/irq_moderation_hook.c


-- 
~Randy


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH-v3 1/3] genirq: Fixed Global Software Interrupt Moderation (GSIM)
  2025-12-17 11:21 ` [PATCH-v3 1/3] genirq: Fixed " Luigi Rizzo
  2025-12-17 19:17   ` Randy Dunlap
@ 2025-12-18 15:37   ` Thomas Gleixner
  2025-12-21 11:21   ` kernel test robot
  2 siblings, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2025-12-18 15:37 UTC (permalink / raw)
  To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

On Wed, Dec 17 2025 at 11:21, Luigi Rizzo wrote:
>  include/linux/irqdesc.h          |  23 ++
>  kernel/irq/Kconfig               |  12 +
>  kernel/irq/Makefile              |   1 +
>  kernel/irq/chip.c                |  16 +-
>  kernel/irq/irq_moderation.c      | 414 +++++++++++++++++++++++++++++++
>  kernel/irq/irq_moderation.h      |  93 +++++++
>  kernel/irq/irq_moderation_hook.h | 104 ++++++++
>  kernel/irq/irqdesc.c             |   1 +
>  kernel/irq/proc.c                |   3 +

This does too many things at once and is barely reviewable.

> +#ifdef CONFIG_IRQ_SOFT_MODERATION
> +
> +/**
> + * struct irq_desc_mod - interrupt moderation information
> + * @ms_node:		per-CPU list of moderated irq_desc
> + * @enable:		enable moderation on this descriptor
> + */
> +struct irq_desc_mod {
> +	struct list_head	ms_node;
> +	bool			enable;
> +};
> +
> +void irq_moderation_init_fields(struct irq_desc_mod *mod);

Why has this to be in the global header? The function is only used in
the core code.

> +
> +#else
> +
> +struct irq_desc_mod {};
> +static inline void irq_moderation_init_fields(struct irq_desc_mod *mod) {}
> +

Get rid of all these extra new lines. They provide no real value.

> +# Support platform wide software interrupt moderation
> +config IRQ_SOFT_MODERATION
> +	bool "Enable platform wide software interrupt moderation"
> +	depends on PROC_FS=y

  depends on PROC_FS

> @@ -739,6 +740,13 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>  
>  	guard(raw_spinlock)(&desc->lock);
>  
> +	if (irq_moderation_skip_moderated(desc)) {
> +		mask_irq(desc);
> +		desc->istate |= IRQS_PENDING;
> +		cond_eoi_irq(chip, &desc->irq_data);
> +		return;
> +	}

This is not how it works. That has to be integrated into the existing
checks so that the non-moderated case does not have extra conditional
branches. Something like this:

static bool irq_can_handle_pm(struct irq_desc *desc)
{
        <SNIP>

	if (!irqd_has_set(irqd, IRQD_IRQ_INPROGRESS | IRQD_WAKEUP_ARMED | IRQD_IRQ_MODERATED))
		return true;

and then handle the moderated case in that function and return false if
it can't handle it.

>  	cond_unmask_eoi_irq(desc, chip);
>  
> +	if (irq_moderation_hook(desc))
> +		return;

Can you please come up with a decriptive function name? 'hook' says nothing.

> +/* Recommended values. */
> +struct irq_mod_info irq_mod_info ____cacheline_aligned = {
> +	.update_ms		= 5,
> +};
> +
> +module_param_named(delay_us, irq_mod_info.delay_us, uint, 0444);
> +MODULE_PARM_DESC(delay_us, "Max moderation delay us, 0 = moderation off, range 0-500.");

What enforces the 0-500 range?

This also lacks a description in Documentation/admin-guide/kernel-parameters.txt.

But why having command line parameters at all? Moderation needs to be
enabled for the relevant interrupts once the system came up, so changing
those parameters can be done in user space init code.

> +DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
> +
> +DEFINE_STATIC_KEY_FALSE(irq_moderation_enabled_key);
> +
> +/* Initialize moderation state, used in desc_set_defaults() */
> +void irq_moderation_init_fields(struct irq_desc_mod *mod)
> +{
> +	INIT_LIST_HEAD(&mod->ms_node);
> +	mod->enable = false;

When you integrate it into the irq_data state, then this extra boolean
is not required.

> +}
> +
> +static int set_mode(struct irq_desc *desc, bool enable)
> +{
> +	lockdep_assert_held(&desc->lock);

New line please to separate the assert and the code.

> +	if (enable) {

Please handle the !enable case first and spare the indentation below.

> +		struct irq_data *irqd = &desc->irq_data;
> +		struct irq_chip *chip = irqd->chip;
> +
> +		/* Moderation is only supported in specific cases. */
> +		enable &= !irqd_is_level_type(irqd);
> +		enable &= irqd_is_single_target(irqd);
> +		enable &= !chip->irq_bus_lock && !chip->irq_bus_sync_unlock;
> +		enable &= chip->irq_mask && chip->irq_unmask;
> +		enable &= desc->handle_irq == handle_edge_irq ||
> +				desc->handle_irq == handle_fasteoi_irq;
> +		if (!enable)
> +			return -EOPNOTSUPP;
> +	}
> +	desc->mod.enable = enable;
> +	return 0;
> +}
> +
> +#pragma clang diagnostic error "-Wformat"

New line please. This glued together stuff is horrible to read.

> +/* Print statistics */
> +static int moderation_show(struct seq_file *p, void *v)
> +{
> +	uint delay_us = irq_mod_info.delay_us;
> +	int j;
> +
> +#define HEAD_FMT "%5s  %8s  %11s  %11s  %9s\n"
> +#define BODY_FMT "%5u  %8u  %11u  %11u  %9u\n"

Please don't put defines into the function body. That's just breaking
the reading flow.

> +	seq_printf(p, HEAD_FMT,
> +		   "# CPU", "delay_ns", "timer_set", "enqueue", "stray_irq");
> +
> +	for_each_possible_cpu(j) {
> +		struct irq_mod_state *ms = per_cpu_ptr(&irq_mod_state, j);
> +
> +		seq_printf(p, BODY_FMT,
> +			   j, ms->mod_ns, ms->timer_set, ms->enqueue, ms->stray_irq);
> +	}

Why is this printing all the stats unconditionally, i.e. even when it's disabled?

> +	seq_printf(p, "\n"
> +		   "enabled              %s\n"
> +		   "delay_us             %u\n",
> +		   str_yes_no(delay_us > 0),
> +		   delay_us);
> +
> +	return 0;
> +}
> +
> +static int moderation_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, moderation_show, pde_data(inode));
> +}
> +
> +/* Helpers to set and clamp values from procfs or at init. */
> +struct param_names {
> +	const char *name;
> +	uint *val;
> +	uint min;
> +	uint max;

Struct formatting is wrong, but see below.

> +};
> +
> +static struct param_names param_names[] = {
> +	{ "delay_us", &irq_mod_info.delay_us, 0, 500 },
> +	/* Entries with no name cannot be set at runtime. */
> +	{ "", &irq_mod_info.update_ms, 1, 100 },

Why is this a parameter if it can't be set at all? It's initialized to 5
and nowhere modifiable.

> +};
> +
> +static void update_enable_key(void)
> +{
> +	struct static_key_false *key = &irq_moderation_enabled_key;
> +
> +	if (irq_mod_info.delay_us != 0)
> +		static_branch_enable(key);
> +	else
> +		static_branch_disable(key);
> +}
> +
> +static ssize_t moderation_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
> +{
> +	uint i, val, copy_len, name_len;
> +	struct param_names *n;
> +	char cmd[40];
> +
> +	copy_len = min(sizeof(cmd) - 1, count);
> +	if (count == 0)
> +		return -EINVAL;
> +	if (copy_from_user(cmd, buf, copy_len))
> +		return -EFAULT;
> +	cmd[copy_len] = '\0';
> +	for (i = 0, n = param_names;  i < ARRAY_SIZE(param_names); i++, n++) {
> +		name_len = strlen(n->name);
> +		if (name_len < 1 || copy_len < name_len + 2 || strncmp(cmd, n->name, name_len) ||
> +		    cmd[name_len] != '=')
> +			continue;
> +		if (kstrtouint(cmd + name_len + 1, 0, &val))
> +			return -EINVAL;
> +		WRITE_ONCE(*(n->val), clamp(val, n->min, n->max));
> +		if (n->val == &irq_mod_info.delay_us)
> +			update_enable_key();
> +		/* Notify parameter update. */
> +		irq_mod_info.version++;
> +		return count;
> +	}
> +	return -EINVAL;

This is truly horrible. What's wrong with having:

     /proc/irq/soft_moderation/stats
     /proc/irq/soft_moderation/$parameter1
     ...
     /proc/irq/soft_moderation/$parameterN

and then use kstrto*_from_user() in the write functions?

To simplify that, you can create DEFINE_PROC_SHOW_STORE_ATTRIBUTE
similar to DEFINE_PROC_SHOW_ATTRIBUTE.

Also what serializes reads vs. writes and writes vs. writes?

> +/* Handlers for /proc/irq/NN/soft_moderation */
> +static int mode_show(struct seq_file *p, void *v)
> +{
> +	struct irq_desc *desc = p->private;
> +
> +	if (!desc)
> +		return -ENOENT;
> +
> +	seq_puts(p, desc->mod.enable ? "on\n" : "off\n");
> +	return 0;
> +}
> +
> +static ssize_t mode_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
> +{
> +	struct irq_desc *desc = (struct irq_desc *)pde_data(file_inode(f));
> +	char cmd[40];
> +	bool enable;
> +	int ret;
> +
> +	if (!desc)
> +		return -ENOENT;

How can desc be NULL when the file exists?

> +	if (count == 0 || count + 1 > sizeof(cmd))
> +		return -EINVAL;
> +	if (copy_from_user(cmd, buf, count))
> +		return -EFAULT;
> +	cmd[count] = '\0';
> +
> +	ret = kstrtobool(cmd, &enable);

kstrtobool_from_user()

> +	if (!ret) {
> +		guard(raw_spinlock_irqsave)(&desc->lock);

raw_spinlock_irq. There is nothing to save here, interrupts are enabled.

> +		ret = set_mode(desc, enable);
> +	}
> +	return ret ? : count;
> +}

> +/* Used on timer expiration or CPU shutdown. */
> +static void drain_desc_list(struct irq_mod_state *ms)
> +{
> +	struct irq_desc *desc, *next;
> +
> +	/* Remove from list and enable interrupts back. */
> +	list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
> +		guard(raw_spinlock)(&desc->lock);
> +		list_del_init(&desc->mod.ms_node);
> +		irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> +		/* Protect against competing free_irq(). */

s/competing/concurrent/

> +		if (desc->action)
> +			__enable_irq(desc);
> +	}
> +}
> +
> +static enum hrtimer_restart timer_callback(struct hrtimer *timer)
> +{
> +	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> +
> +	drain_desc_list(ms);
> +	/* Prepare to accumulate next moderation delay. */
> +	ms->sleep_ns = 0;
> +	return HRTIMER_NORESTART;
> +}
> +
> +/* Hotplug callback for setup. */
> +static int cpu_setup_cb(uint cpu)
> +{
> +	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> +
> +	if (ms->moderation_on)
> +		return 0;

How can that be enabled at the point when this is invoked?

> +	hrtimer_setup(&ms->timer, timer_callback,
> +		      CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
> +	INIT_LIST_HEAD(&ms->descs);
> +	smp_mb();

Undocumented barrier. Where is the counterpart and what does it protect
against? This is a per CPU state with no SMP relevance, no?

> +	ms->moderation_on = true;
> +	return 0;
> +}
> +
> +/*
> + * Hotplug callback for shutdown.
> + * Mark the CPU as offline for moderation, and drain the list of masked
> + * interrupts. Any subsequent interrupt on this CPU will not be
> + * moderated, but they will be on the new target.
> + */
> +static int cpu_remove_cb(uint cpu)
> +{
> +	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> +	unsigned long flags;
> +
> +	local_irq_save(flags);

        guard(irqsave)()

But please double check whether save is required at all.

> +	ms->moderation_on = false;
> +	drain_desc_list(ms);
> +	hrtimer_cancel(&ms->timer);
> +	local_irq_restore(flags);
> +	return 0;
> +}
> +
> +static void(mod_pm_prepare_cb)(void *arg)
> +{
> +	cpu_remove_cb(smp_processor_id());
> +}
> +
> +static void(mod_pm_resume_cb)(void *arg)
> +{
> +	cpu_setup_cb(smp_processor_id());
> +}
> +
> +static int mod_pm_notifier_cb( struct notifier_block *nb, unsigned long event, void *unused)
> +{
> +	switch (event) {
> +	case PM_SUSPEND_PREPARE:
> +		on_each_cpu(mod_pm_prepare_cb, NULL, 1);
> +		break;
> +	case PM_POST_SUSPEND:
> +		on_each_cpu(mod_pm_resume_cb, NULL, 1);
> +		break;
> +	}
> +	return NOTIFY_OK;
> +}
> +
> +struct notifier_block mod_nb = {
> +	.notifier_call = mod_pm_notifier_cb,
> +	.priority = 100,

struct formatting.

> +};
> +
> +static void clamp_parameter(uint *dst, uint val)
> +{
> +	struct param_names *n = param_names;
> +
> +	for (int i = 0; i < ARRAY_SIZE(param_names); i++, n++) {
> +		if (dst == n->val) {
> +			WRITE_ONCE(*dst, clamp(val, n->min, n->max));
> +			return;
> +		}
> +	}
> +}
> +
> +static int __init init_irq_moderation(void)
> +{
> +	/* Clamp all initial values to the allowed range, update version. */
> +	for (uint *cur = &irq_mod_info.delay_us; cur < irq_mod_info.pad; cur++)

This is just wrong. If one of the parameter sizes is changed, then this
goes south....

Get rid of the module parameter and you can spare all this hackery as
the procfs entries have limit checks.

> +		clamp_parameter(cur, *cur);
> +	irq_mod_info.version++;

What's 'version'? I assume it's a sequence number to detect
updates. Then name it that way.

> +	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "irq_moderation", cpu_setup_cb, cpu_remove_cb);

That can fail...

> +	register_pm_notifier(&mod_nb);
> +
> +	update_enable_key();
> +
> +	proc_create_data("irq/soft_moderation", 0644, NULL, &proc_ops, NULL);
> +	return 0;
> +}
> +
> +MODULE_LICENSE("Dual BSD/GPL");
> +MODULE_VERSION("1.0");
> +MODULE_AUTHOR("Luigi Rizzo <lrizzo@google.com>");
> +MODULE_DESCRIPTION("Global Software Interrupt Moderation");
> +module_init(init_irq_moderation);

When you remove the module parameters then you can get rid of this whole
module muck and explicitely invoke the moderation init from the irq
procfs init.

> + *
> + * Used once per epoch:
> + * @version:		version fetched from irq_mod_info

This still does not explain what version is about. If you add comments
then please make them useful.

> + * Statistics
> + * @timer_set:		how many timer_set calls
> + */
> +struct irq_mod_state {
> +	/* Used on every interrupt. */
> +	struct hrtimer		timer;
> +	bool			moderation_on;
> +	u32			intr_count;
> +	u32			sleep_ns;
> +	u32			mod_ns;
> +
> +	/* Used less frequently. */
> +	u64			epoch_start_ns;
> +	u32			update_ns;
> +	u32			stray_irq;
> +
> +	/* Used once per epoch per source. */
> +	struct list_head	descs;
> +	u32			enqueue;
> +
> +	/* Used once per epoch. */

Duplicated information. That's already in the kernel doc, no?

> +	u32			version;
> +
> +	/* Statistics */
> +	u32			timer_set;
> +};
> +
> diff --git a/kernel/irq/irq_moderation_hook.h b/kernel/irq/irq_moderation_hook.h
> new file mode 100644
> index 0000000000000..f9ac3005f69f4
> --- /dev/null
> +++ b/kernel/irq/irq_moderation_hook.h
> @@ -0,0 +1,104 @@
> +/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
> +
> +#ifndef _LINUX_IRQ_MODERATION_HOOK_H
> +#define _LINUX_IRQ_MODERATION_HOOK_H

Why does that need another header file when it includes the other one anyway?

> +/* Interrupt hooks for Global Software Interrupt Moderation, GSIM */
> +
> +#include <linux/hrtimer.h>
> +#include <linux/irq.h>
> +#include <linux/irqdesc.h>
> +#include <linux/kernel.h>
> +
> +#include "irq_moderation.h"
> +
> +#ifdef CONFIG_IRQ_SOFT_MODERATION
> +
> +static inline void __maybe_new_epoch(struct irq_mod_state *ms)

__maybe_new_epoch() is really awesome. check_epoch() perhaps?

> +{
> +	const u64 now = ktime_get_ns(), epoch_ns = now - ms->epoch_start_ns;
> +	const u32 slack_ns = 5000;
> +	u32 version;
> +
> +	/* Run approximately every update_ns, a little bit early is ok. */
> +	if (epoch_ns < ms->update_ns - slack_ns)
> +		return;
> +	ms->epoch_start_ns = now;
> +	/* Fetch updated parameters. */
> +        while ((version = READ_ONCE(irq_mod_info.version)) != ms->version) {

Spaces instead of tab

> +		ms->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
> +		ms->mod_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
> +		ms->version = version;
> +        }

So that version thing is an open coded sequence count, right? What's
wrong with using a real sequence count?

> +}
> +
> +static inline bool irq_moderation_needed(struct irq_desc *desc, struct irq_mod_state *ms)
> +{
> +	if (!hrtimer_is_queued(&ms->timer)) {
> +		const u32 min_delay_ns = 10000;
> +		const u64 slack_ns = 2000;
> +
> +		/* Accumulate sleep time, no moderation if too small. */
> +		ms->sleep_ns += ms->mod_ns;
> +		if (ms->sleep_ns < min_delay_ns)
> +			return false;
> +		/* We need moderation, start the timer. */
> +		ms->timer_set++;
> +		hrtimer_start_range_ns(&ms->timer, ns_to_ktime(ms->sleep_ns),
> +				       slack_ns, HRTIMER_MODE_REL_PINNED_HARD);
> +	}
> +
> +	/*
> +	 * Add to the timer list and __disable_irq() to prevent serving subsequent
> +	 * interrupts.
> +	 */
> +	if (!list_empty(&desc->mod.ms_node)) {
> +		/* Very unlikely, stray interrupt while moderated. */
> +		ms->stray_irq++;

Why is this not caught on entry to the handler? If this ends up here
with the node queued, then that's a bug IMO.

> +	} else {
> +		ms->enqueue++;
> +		list_add(&desc->mod.ms_node, &ms->descs);
> +		__disable_irq(desc);
> +	}
> +	irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> +	return true;
> +}
> +
> +/*
> + * Use after running the handler, with lock held. If this source should be
> + * moderated, disable it, add to the timer list for this CPU and return true.
> + */
> +static inline bool irq_moderation_hook(struct irq_desc *desc)
> +{
> +	struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
> +
> +	if (!static_branch_unlikely(&irq_moderation_enabled_key))
> +		return false;
> +	if (!desc->mod.enable)
> +		return false;
> +	if (!ms->moderation_on)
> +		return false;
> +
> +	ms->intr_count++;
> +
> +	/* Is this a new epoch? ktime_get_ns() is expensive, don't check too often. */

This comment is confusing.

> +	if ((ms->intr_count & 0xf) == 0)
> +		__maybe_new_epoch(ms);
> +
> +	return irq_moderation_needed(desc, ms);
> +}
> +
> +/* On entry of desc->irq_handler() tell handler to skip moderated interrupts. */
> +static inline bool irq_moderation_skip_moderated(struct irq_desc *desc)
> +{
> +	return !list_empty(&desc->mod.ms_node);

See above ....

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH-v3 2/3] genirq: Adaptive Global Software Interrupt Moderation (GSIM)
  2025-12-17 11:21 ` [PATCH-v3 2/3] genirq: Adaptive " Luigi Rizzo
  2025-12-17 19:18   ` Randy Dunlap
@ 2025-12-18 16:00   ` Thomas Gleixner
  2025-12-21  9:46   ` kernel test robot
  2 siblings, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2025-12-18 16:00 UTC (permalink / raw)
  To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Luigi Rizzo

On Wed, Dec 17 2025 at 11:21, Luigi Rizzo wrote:
> +module_param_named(hardirq_percent, irq_mod_info.hardirq_percent, uint, 0444);
> +MODULE_PARM_DESC(hardirq_percent, "Target max hardirq percentage, 0 off, range 0-100.");
> +
> +module_param_named(target_intr_rate, irq_mod_info.target_intr_rate, uint, 0444);
> +MODULE_PARM_DESC(target_intr_rate, "Target max interrupt rate, 0 off, range 0-50000000.");
> +
> +module_param_named(update_ms, irq_mod_info.update_ms, uint, 0444);
> +MODULE_PARM_DESC(update_ms, "Update interval in milliseconds, range 1-100");

Comment about command line parameters still applies

>  DEFINE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
>  
>  DEFINE_STATIC_KEY_FALSE(irq_moderation_enabled_key);
> @@ -142,27 +167,72 @@ static int set_mode(struct irq_desc *desc, bool enable)
>  /* Print statistics */
>  static int moderation_show(struct seq_file *p, void *v)
>  {
> +	ulong intr_rate = 0, irq_high = 0, my_irq_high = 0, hardirq_high = 0;
>  	uint delay_us = irq_mod_info.delay_us;
> -	int j;
> +	u64 now = ktime_get_ns();
> +	int j, active_cpus = 0;
>  
> -#define HEAD_FMT "%5s  %8s  %11s  %11s  %9s\n"
> -#define BODY_FMT "%5u  %8u  %11u  %11u  %9u\n"
> +#define HEAD_FMT "%5s  %8s  %8s  %4s  %8s  %11s  %11s  %11s  %11s  %11s  %9s\n"
> +#define BODY_FMT "%5u  %8u  %8u  %4u  %8u  %11u  %11u  %11u  %11u  %11u  %9u\n"

Sigh.

>  	seq_printf(p, HEAD_FMT,
> -		   "# CPU", "delay_ns", "timer_set", "enqueue", "stray_irq");
> +		   "# CPU", "irq/s", "my_irq/s", "cpus", "delay_ns",
> +		   "irq_hi", "my_irq_hi", "hardirq_hi", "timer_set",
> +		   "enqueue", "stray_irq");
>  
>  	for_each_possible_cpu(j) {
>  		struct irq_mod_state *ms = per_cpu_ptr(&irq_mod_state, j);
> +		/* Watch out, epoch start_ns is 64 bits. */
> +		u64 epoch_start_ns = atomic64_read((atomic64_t *)&ms->epoch_start_ns);

No. We are not casting this to an atomic when the write side is non-atomic.

> +		s64 age_ms = min((now - epoch_start_ns) / NSEC_PER_MSEC, (u64)999999);
> +
> +		if (age_ms < 10000) {
> +			/* Average intr_rate over recently active CPUs. */
> +			active_cpus++;
> +			intr_rate += ms->intr_rate;

data_race() - applies to _all_ remote accesses.

> +		} else {
> +			age_ms = -1;
> +			ms->intr_rate = 0;
> +			ms->my_intr_rate = 0;
> +			ms->scaled_cpu_count = 0;
> +			ms->mod_ns = 0;

So this overwrites data which can be concurrently modified on the remote
CPU. How is that even remotely correct?

> @@ -48,9 +66,20 @@ extern struct irq_mod_info irq_mod_info;
>   *
>   * Used once per epoch:
>   * @version:		version fetched from irq_mod_info
> + * @delay_ns:		fetched from irq_mod_info
> + * @epoch_ns:		duration of last epoch
> + * @last_total_intrs:	from irq_mod_info
> + * @last_total_cpus:	from irq_mod_info
> + * @last_irqtime:	from cpustat[CPUTIME_IRQ]
>   *
>   * Statistics
> + * @intr_rate:		smoothed global interrupt rate
> + * @my_intr_rate:	smoothed interrupt rate for this CPU

my_intr_rate is a very descriptive name for the content. NOT.

>   * @timer_set:		how many timer_set calls
> + * @scaled_cpu_count:	smoothed CPU count (scaled)
> + * @irq_high:		how many times global irq above threshold
> + * @my_irq_high:	how many times local irq above threshold

What's this MY about? This is per CPU, right? So local_irq_high or
something like that and then name the other one global_irq_high so it's
entirely clear what this is about. Applies for the above too.

>  DECLARE_PER_CPU_ALIGNED(struct irq_mod_state, irq_mod_state);
>  
> +#define MIN_SCALING_FACTOR 8u
>  extern struct static_key_false irq_moderation_enabled_key;

Your choices of using newlines are truly interesting.

> +++ b/kernel/irq/irq_moderation_hook.c

This needs a new file because?

> +static inline u32 smooth_avg(u32 old, u32 cur, u32 steps)
> +{
> +	const u32 smooth_factor = 64;
> +
> +	steps = min(steps, smooth_factor - 1);
> +	return ((smooth_factor - steps) * old + steps * cur) / smooth_factor;

Still lacks an explanation what this does.

> +/* Measure and assess total and per-CPU interrupt rates. */
> +static inline bool irqrate_high(struct irq_mod_state *ms, u32 target_rate, u32 steps)
> +{
> +	u32 intr_rate, my_intr_rate, delta_intrs, active_cpus, tmp;
> +	bool my_rate_high, global_rate_high;
> +
> +	my_intr_rate = ((u64)ms->intr_count * NSEC_PER_SEC) / ms->epoch_ns;
> +
> +	/* Accumulate global counter and compute global interrupt rate. */
> +	tmp = atomic_add_return(ms->intr_count, &irq_mod_info.total_intrs);
> +	ms->intr_count = 1;
> +	delta_intrs = tmp - ms->last_total_intrs;
> +	ms->last_total_intrs = tmp;
> +	intr_rate = ((u64)delta_intrs * NSEC_PER_SEC) / ms->epoch_ns;
> +
> +	/*
> +	 * Count how many CPUs handled interrupts in the last epoch, needed
> +	 * to determine the per-CPU target (target_rate / active_cpus).
> +	 * Each active CPU increments the global counter approximately every
> +	 * update_ns. Scale the value by (update_ns / ms->epoch_ns) to get the
> +	 * correct value. Also apply rounding and make sure active_cpus > 0.
> +	 */
> +	tmp = atomic_add_return(1, &irq_mod_info.total_cpus);
> +	active_cpus = tmp - ms->last_total_cpus;
> +	ms->last_total_cpus = tmp;
> +	active_cpus = (active_cpus * ms->update_ns + ms->epoch_ns / 2) / ms->epoch_ns;
> +	if (active_cpus < 1)
> +		active_cpus = 1;
> +
> +	/* Compare with global and per-CPU targets. */
> +	global_rate_high = intr_rate > target_rate;
> +
> +	/*
> +	 * Short epochs may lead to underestimate the number of active CPUs.
> +	 * Apply a scaling factor to compensate. This may make the controller
> +	 * a bit more aggressive but does not harm system throughput.
> +	 */
> +	my_rate_high = my_intr_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
> +
> +	/* Statistics. */
> +	ms->intr_rate = smooth_avg(ms->intr_rate, intr_rate, steps);
> +	ms->my_intr_rate = smooth_avg(ms->my_intr_rate, my_intr_rate, steps);
> +	ms->scaled_cpu_count = smooth_avg(ms->scaled_cpu_count, active_cpus * 256, steps);
> +	ms->my_irq_high += my_rate_high;
> +	ms->irq_high += global_rate_high;
> +
> +	/* Moderate on this CPU only if both global and local rates are high. */
> +	return global_rate_high && my_rate_high;
> +}
> +
> +/* Periodic adjustment, called once per epoch. */
> +void irq_moderation_update_epoch(struct irq_mod_state *ms)
> +{
> +	const u32 hardirq_percent = READ_ONCE(irq_mod_info.hardirq_percent);
> +	const u32 target_rate = READ_ONCE(irq_mod_info.target_intr_rate);
> +	const u32 min_delay_ns = 500;
> +	bool above_target = false;
> +	u32 version;
> +	u32 steps;
> +
> +	/* Fetch updated parameters. */
> +	while ((version = READ_ONCE(irq_mod_info.version)) != ms->version) {
> +		ms->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
> +		ms->mod_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
> +		ms->delay_ns = ms->mod_ns;
> +		ms->version = version;
> +	}
> +
> +	if (target_rate == 0 && hardirq_percent == 0) {
> +		/* Use fixed moderation delay. */
> +		ms->mod_ns = ms->delay_ns;
> +		ms->intr_rate = 0;
> +		ms->my_intr_rate = 0;
> +		ms->scaled_cpu_count = 0;
> +		return;
> +	}

So this evaluates modifiable parameters in three steps. Why isn't that
all evaluated under the sequence/version mechanics or at least the
hardirq percentage and the target rate reevaluated when there is new data?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH-v3 3/3] genirq: Configurable default mode for GSIM
  2025-12-17 11:21 ` [PATCH-v3 3/3] genirq: Configurable default mode for GSIM Luigi Rizzo
@ 2025-12-18 21:56   ` Thomas Gleixner
  0 siblings, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2025-12-18 21:56 UTC (permalink / raw)
  To: Luigi Rizzo, Marc Zyngier, Luigi Rizzo, Paolo Abeni,
	Andrew Morton, Sean Christopherson
  Cc: linux-kernel, linux-arch, Bjorn Helgaas, Willem de Bruijn,
	Greg Kroah-Hartman

On Wed, Dec 17 2025 at 11:21, Luigi Rizzo wrote:
> GSIM (Global Software Interrupt Moderation) can be enabled only after the
> interrupt is created by writing to /proc/irq/NN/soft_moderation. This is
> impractical when devices that are dynamically created or reconfigured.
>
> Add a module parameter irq_moderation.enable_list to specify whether
> moderation should be enabled at interrupt creation time. This is done
> with a comma-separated list of patterns (enable_list) matched against
> interrupt or handler names when the interrupt is created.
>
> This allows very flexible control without having to modify every single
> driver. As an example, one can limit to specific drivers by specifying
> the handler functions (using parentheses) as below
>
>      irq_moderation.enable_list="nvme_irq(),vfio_msihandler()"
>
> ora apply it to certain interrupt names
>
>      irq_moderation.enable_list="eth*,vfio*"

TBH, that's an admittely creative but horrible hack.

That's what uevents are for. Something like the below provides the
information you need to keep track of interrupt requests:

KERNEL[57.529152] change   /kernel/irq/256 (irq)

With that it is very practical to do all this magic in user space, no?

Thanks,

        tglx
---
 kernel/irq/internals.h |    6 ++++++
 kernel/irq/irqdesc.c   |   25 ++++++++++++++++++++-----
 kernel/irq/manage.c    |    1 +
 3 files changed, 27 insertions(+), 5 deletions(-)

--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -521,3 +521,9 @@ static inline void irq_debugfs_copy_devn
 {
 }
 #endif /* CONFIG_GENERIC_IRQ_DEBUGFS */
+
+#if defined(CONFIG_SPARSE_IRQ) && defined(CONFIG_SYSFS)
+void irqdesc_action_uevent(struct irq_desc *desc);
+#else
+static inline void irqdesc_action_uevent(struct irq_desc *desc) { }
+#endif
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -241,7 +241,7 @@ static int init_desc(struct irq_desc *de
 static void irq_kobj_release(struct kobject *kobj);
 
 #ifdef CONFIG_SYSFS
-static struct kobject *irq_kobj_base;
+static struct kset *irq_kobj_base;
 
 #define IRQ_ATTR_RO(_name) \
 static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
@@ -363,10 +363,12 @@ static void irq_sysfs_add(int irq, struc
 		 * crucial and failures in the late irq_sysfs_init()
 		 * cannot be rolled back.
 		 */
-		if (kobject_add(&desc->kobj, irq_kobj_base, "%d", irq))
+		if (kobject_add(&desc->kobj, &irq_kobj_base->kobj, "%d", irq)) {
 			pr_warn("Failed to add kobject for irq %d\n", irq);
-		else
+		} else {
 			desc->istate |= IRQS_SYSFS;
+			desc->kobj.kset = irq_kobj_base;
+		}
 	}
 }
 
@@ -382,6 +384,16 @@ static void irq_sysfs_del(struct irq_des
 		kobject_del(&desc->kobj);
 }
 
+void irqdesc_action_uevent(struct irq_desc *desc)
+{
+	if (!(desc->istate & IRQS_SYSFS))
+		return;
+
+	guard(mutex)(&sparse_irq_lock);
+	if (irq_kobj_base)
+		kobject_uevent(&desc->kobj, KOBJ_CHANGE);
+}
+
 static int __init irq_sysfs_init(void)
 {
 	struct irq_desc *desc;
@@ -389,13 +401,16 @@ static int __init irq_sysfs_init(void)
 
 	/* Prevent concurrent irq alloc/free */
 	guard(mutex)(&sparse_irq_lock);
-	irq_kobj_base = kobject_create_and_add("irq", kernel_kobj);
+	irq_kobj_base = kset_create_and_add("irq", NULL, kernel_kobj);
 	if (!irq_kobj_base)
 		return -ENOMEM;
 
 	/* Add the already allocated interrupts */
-	for_each_irq_desc(irq, desc)
+	for_each_irq_desc(irq, desc) {
 		irq_sysfs_add(irq, desc);
+		if (data_race(desc->action))
+			kobject_uevent(&desc->kobj, KOBJ_CHANGE);
+	}
 	return 0;
 }
 postcore_initcall(irq_sysfs_init);
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1762,6 +1762,7 @@ static int
 	register_irq_proc(irq, desc);
 	new->dir = NULL;
 	register_handler_proc(irq, new);
+	irqdesc_action_uevent(desc);
 	return 0;
 
 mismatch:

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH-v3 2/3] genirq: Adaptive Global Software Interrupt Moderation (GSIM)
  2025-12-17 11:21 ` [PATCH-v3 2/3] genirq: Adaptive " Luigi Rizzo
  2025-12-17 19:18   ` Randy Dunlap
  2025-12-18 16:00   ` Thomas Gleixner
@ 2025-12-21  9:46   ` kernel test robot
  2 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2025-12-21  9:46 UTC (permalink / raw)
  To: Luigi Rizzo, Thomas Gleixner, Marc Zyngier, Luigi Rizzo,
	Paolo Abeni, Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, linux-kernel,
	linux-arch, Bjorn Helgaas, Willem de Bruijn

Hi Luigi,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/irq/core]
[also build test ERROR on linus/master v6.19-rc1]
[cannot apply to next-20251219]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Luigi-Rizzo/genirq-Fixed-Global-Software-Interrupt-Moderation-GSIM/20251217-193249
base:   tip/irq/core
patch link:    https://lore.kernel.org/r/20251217112128.1401896-3-lrizzo%40google.com
patch subject: [PATCH-v3 2/3] genirq: Adaptive Global Software Interrupt Moderation (GSIM)
config: hexagon-allmodconfig (https://download.01.org/0day-ci/archive/20251221/202512211737.PVtnFVdB-lkp@intel.com/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251221/202512211737.PVtnFVdB-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512211737.PVtnFVdB-lkp@intel.com/

All errors (new ones prefixed by >>):

>> ld.lld: error: undefined symbol: __hexagon_udivdi3
   >>> referenced by irq_moderation_hook.c
   >>>               kernel/irq/irq_moderation_hook.o:(irq_moderation_update_epoch) in archive vmlinux.a
   >>> referenced by irq_moderation_hook.c
   >>>               kernel/irq/irq_moderation_hook.o:(irq_moderation_update_epoch) in archive vmlinux.a
   >>> referenced by irq_moderation_hook.c
   >>>               kernel/irq/irq_moderation_hook.o:(irq_moderation_update_epoch) in archive vmlinux.a
   >>> referenced 1 more times
   >>> did you mean: __hexagon_udivsi3
   >>> defined in: vmlinux.a(arch/hexagon/lib/udivsi3.o)

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH-v3 1/3] genirq: Fixed Global Software Interrupt Moderation (GSIM)
  2025-12-17 11:21 ` [PATCH-v3 1/3] genirq: Fixed " Luigi Rizzo
  2025-12-17 19:17   ` Randy Dunlap
  2025-12-18 15:37   ` Thomas Gleixner
@ 2025-12-21 11:21   ` kernel test robot
  2 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2025-12-21 11:21 UTC (permalink / raw)
  To: Luigi Rizzo, Thomas Gleixner, Marc Zyngier, Luigi Rizzo,
	Paolo Abeni, Andrew Morton, Sean Christopherson, Jacob Pan
  Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel,
	linux-arch, Bjorn Helgaas, Willem de Bruijn

Hi Luigi,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/irq/core]
[also build test ERROR on linus/master v6.19-rc1]
[cannot apply to next-20251219]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Luigi-Rizzo/genirq-Fixed-Global-Software-Interrupt-Moderation-GSIM/20251217-193249
base:   tip/irq/core
patch link:    https://lore.kernel.org/r/20251217112128.1401896-2-lrizzo%40google.com
patch subject: [PATCH-v3 1/3] genirq: Fixed Global Software Interrupt Moderation (GSIM)
config: um-allyesconfig (https://download.01.org/0day-ci/archive/20251221/202512211933.p2Yg8Zzq-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251221/202512211933.p2Yg8Zzq-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512211933.p2Yg8Zzq-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/irq/irq_moderation.c:141: warning: ignoring '#pragma clang diagnostic' [-Wunknown-pragmas]
     141 | #pragma clang diagnostic error "-Wformat"
   In file included from include/linux/uaccess.h:11,
                    from include/linux/sched/task.h:13,
                    from include/linux/sched/signal.h:9,
                    from include/linux/rcuwait.h:6,
                    from include/linux/mm.h:36,
                    from include/linux/kallsyms.h:13,
                    from kernel/irq/irq_moderation.c:7:
   In function 'check_copy_size',
       inlined from 'copy_from_user' at include/linux/uaccess.h:218:7,
       inlined from 'mode_write' at kernel/irq/irq_moderation.c:259:6:
>> include/linux/ucopysize.h:54:25: error: call to '__bad_copy_to' declared with attribute error: copy destination size is too small
      54 |                         __bad_copy_to();
         |                         ^~~~~~~~~~~~~~~


vim +/__bad_copy_to +54 include/linux/ucopysize.h

808aac63e2bdf9 Kees Cook 2025-02-28  43  
808aac63e2bdf9 Kees Cook 2025-02-28  44  static __always_inline __must_check bool
808aac63e2bdf9 Kees Cook 2025-02-28  45  check_copy_size(const void *addr, size_t bytes, bool is_source)
808aac63e2bdf9 Kees Cook 2025-02-28  46  {
808aac63e2bdf9 Kees Cook 2025-02-28  47  	int sz = __builtin_object_size(addr, 0);
808aac63e2bdf9 Kees Cook 2025-02-28  48  	if (unlikely(sz >= 0 && sz < bytes)) {
808aac63e2bdf9 Kees Cook 2025-02-28  49  		if (!__builtin_constant_p(bytes))
808aac63e2bdf9 Kees Cook 2025-02-28  50  			copy_overflow(sz, bytes);
808aac63e2bdf9 Kees Cook 2025-02-28  51  		else if (is_source)
808aac63e2bdf9 Kees Cook 2025-02-28  52  			__bad_copy_from();
808aac63e2bdf9 Kees Cook 2025-02-28  53  		else
808aac63e2bdf9 Kees Cook 2025-02-28 @54  			__bad_copy_to();
808aac63e2bdf9 Kees Cook 2025-02-28  55  		return false;
808aac63e2bdf9 Kees Cook 2025-02-28  56  	}
808aac63e2bdf9 Kees Cook 2025-02-28  57  	if (WARN_ON_ONCE(bytes > INT_MAX))
808aac63e2bdf9 Kees Cook 2025-02-28  58  		return false;
808aac63e2bdf9 Kees Cook 2025-02-28  59  	check_object_size(addr, bytes, is_source);
808aac63e2bdf9 Kees Cook 2025-02-28  60  	return true;
808aac63e2bdf9 Kees Cook 2025-02-28  61  }
808aac63e2bdf9 Kees Cook 2025-02-28  62  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-12-21 11:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-17 11:21 [PATCH-v3 0/3] Global Software Interrupt Moderation (GSIM) Luigi Rizzo
2025-12-17 11:21 ` [PATCH-v3 1/3] genirq: Fixed " Luigi Rizzo
2025-12-17 19:17   ` Randy Dunlap
2025-12-18 15:37   ` Thomas Gleixner
2025-12-21 11:21   ` kernel test robot
2025-12-17 11:21 ` [PATCH-v3 2/3] genirq: Adaptive " Luigi Rizzo
2025-12-17 19:18   ` Randy Dunlap
2025-12-18 16:00   ` Thomas Gleixner
2025-12-21  9:46   ` kernel test robot
2025-12-17 11:21 ` [PATCH-v3 3/3] genirq: Configurable default mode for GSIM Luigi Rizzo
2025-12-18 21:56   ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).