public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@kernel.org>
To: LKML <linux-kernel@vger.kernel.org>
Cc: x86@kernel.org, Dmitry Ilvokhin <d@ilvokhin.com>,
	Neil Horman <nhorman@tuxdriver.com>,
	Radu Rendec <radu@rendec.net>
Subject: [patch v2 12/14] [RFC] genirq/proc: Provide binary statistic interface
Date: Fri, 20 Mar 2026 14:22:24 +0100	[thread overview]
Message-ID: <20260320132102.841834115@kernel.org> (raw)
In-Reply-To: 20260320131108.344376329@kernel.org

/proc/interrupts is expensive to evaluate for monitoring because:

  - it is text based and contains a lot of information which is not
    relevant for interrupt frequency analysis. Due to the extra information
    like chip name, hardware interrupt number, interrupt action names, it
    has to take the interrupt descriptor lock to output those items into
    the seq_file buffer. That obviously interferes with high frequency
    interrupt workloads.

  - it contains both device interrupts, per CPU and architecture specific
    interrupt counters without being able to look at them separately. The
    file is seekable by some definition of seekable as the position can
    change when interrupts are requested or freed, so the data has to be
    read completely to get a coherent picture.

  - it emits records for requested interrupts even if their interrupt count
    is zero.

  - it always prints the per CPU counters even if all but one of them are
    zero.

  - converting numbers to text and then parsing the text back to numbers in
    user space is a pretty wasteful exercise

Provide a new interface which addresses the above pain points:

  1) The interface is binary and emits variable length records per
     interrupt. Each record starts with a header containing the interrupt
     number and the number of data entries following the header. The data
     entries consist of a CPU number and count pair.

  2) Interrupts with a total count of zero are skipped and produce no
     output at all.

  3) Interrupts which have a single CPU affinity either due to a restricted
     affinity mask or due to the underlying interrupt chip restricting a
     mask to a single CPU target emit only one data entry.

     That means they are not emitting the stale counts on previous target
     CPUs but they are not really interesting for interrupt frequency
     analysis as they are not changing and therefore pointless for
     accounting.

  4) The interface separates device interrupts, per CPU interrupts and
     architecture specific interrupts.

     Per CPU and architecture specific interrupts can only be monitored,
     while device interrupts can also be steered by changing the affinity
     unless they are affinity managed by the kernel.

     Per CPU interrupts are only available on architectures, e.g. ARM64,
     which use the regular interrupt descriptor mechanism for per CPU
     interrupt handling.

     Architectures which have their own mechanics, e.g. x86, do not enable
     and provide the per CPU interface as those interrupts are covered by
     the architecture specific accounting.

  5) The readout is fully lockless so it does not interfere with concurrent
     interrupt handling.

  6) Seek is restricted to seek(fd, 0, SEEK_SET) as that's the only
     operation which makes sense due to the variable record length and the
     dynamics of interrupt request/free operations which influence the
     position of the records in the output. For all other seek()
     invocations return the current file position, which makes e.g. python
     happy as an error code causes the file open checks to mark the
     resulting file object non-seekable.

Implement support for /proc/irq/device_stats and /proc/irq/percpu_stats.

The support for architecture specific interrupt statistics is added in a
separate step.

Reading /proc/irq/device_stats on a 256 CPU x86 machine with 83 requested
interrupts produces 13 records due to skipping zero count interrupts. It
results in 13 * 16 = 208 bytes of data as all device interrupts on x86 are
single CPU targeted. That readout takes ~8us time in the kernel, while the
full /proc/interrupts readout takes about 360us.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/uapi/linux/irqstats.h |   27 +++
 kernel/irq/Kconfig            |    3 
 kernel/irq/proc.c             |  314 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 344 insertions(+)

--- /dev/null
+++ b/include/uapi/linux/irqstats.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
+#ifndef LINUX_UAPI_IRQSTATS_H
+#define LINUX_UAPI_IRQSTATS_H
+
+/**
+ * irq_proc_stat_cpu - Data record for /proc/irq/stats
+ * @cpu:	The CPU associated to @cnt
+ * @cnt:	The count assiciated to @cpu
+ */
+struct irq_proc_stat_cpu {
+	unsigned int	cpu;
+	unsigned int	cnt;
+};
+
+/**
+ * irq_proc_stat_data - Data header for /proc/irq/stats
+ * @irqnr:	The interrupt number
+ * @entries:	The number of records (max. nr_cpu_ids)
+ * @pcpu:	Runtime sized array of per CPU stat records
+ */
+struct irq_proc_stat_data {
+	unsigned int			irqnr;
+	unsigned int			entries;
+	struct irq_proc_stat_cpu	pcpu[];
+};
+
+#endif
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -18,6 +18,9 @@ config GENERIC_IRQ_SHOW
 config GENERIC_IRQ_SHOW_LEVEL
        bool
 
+config GENERIC_IRQ_STATS_PERCPU
+       bool
+
 # Supports effective affinity mask
 config GENERIC_IRQ_EFFECTIVE_AFF_MASK
        depends on SMP
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -13,6 +13,8 @@
 #include <linux/kernel_stat.h>
 #include <linux/mutex.h>
 #include <linux/string.h>
+#include <linux/uio.h>
+#include <uapi/linux/irqstats.h>
 
 #include "internals.h"
 
@@ -636,9 +638,321 @@ static const struct seq_operations irq_s
 	.show  = irq_seq_show,
 };
 
+/*
+ * /proc/irq/stats related code
+ *
+ * /proc/irq/stats provides variable record sized statistics for device
+ * interrupts.
+ */
+struct irq_proc_stat {
+	unsigned int			irqnr;
+	bool				percpu;
+	bool				first;
+	size_t				from;
+	size_t				count;
+	loff_t				read_pos;
+	struct irq_desc			*desc;
+	struct irq_proc_stat_data	*data;
+};
+
+static inline bool irq_stat_valid_irq(struct irq_proc_stat *s)
+{
+	struct irq_desc *desc = s->desc;
+
+	/* Check for general validity */
+	if (!irq_settings_proc_valid(desc))
+		return false;
+
+	if (!s->percpu) {
+		/*
+		 * Device interrupts update desc::tot_count. Per CPU
+		 * interrupts are not touching that fields due to the
+		 * obvious concurrency issues.  For device interrupts it's
+		 * therefore sufficient to evaluate desc::tot_count.
+		 */
+		if (!data_race(desc->tot_count))
+			return false;
+	} else {
+		/*
+		 * Per CPU interrupts are marked accordingly in the
+		 * settings.
+		 */
+		if (!irq_settings_is_per_cpu(desc) && !irq_settings_is_per_cpu_devid(desc))
+			return false;
+	}
+
+	/* Try to get a reference to prevent freeing before it's evaluated */
+	return irq_desc_get_ref(desc);
+}
+
+static inline bool irq_stat_find_irq(struct irq_proc_stat *s)
+{
+	/* Loop until a valid interrupt is found */
+	guard(rcu)();
+	for (;; s->irqnr++) {
+		s->desc = irq_find_desc_at_or_after(s->irqnr);
+		/* NULL means there is no interrupt anymore in the maple tree */
+		if (!s->desc) {
+			s->irqnr = total_nr_irqs;
+			return false;
+		}
+
+		/* Save the interrupt number for the next search */
+		s->irqnr = irq_desc_get_irq(s->desc);
+
+		if (irq_stat_valid_irq(s))
+			return true;
+	}
+}
+
+static inline void irq_stat_next_irq(struct irq_proc_stat *s)
+{
+	s->irqnr++;
+	irq_stat_find_irq(s);
+}
+
+static void irq_dev_stat_update_one(struct irq_proc_stat *s)
+{
+	struct irq_proc_stat_data *d = s->data;
+	struct irq_desc *desc = s->desc;
+	struct irq_data *irqd;
+	unsigned int cpu;
+
+	/*
+	 * Optimize for single CPU target affinities. Otherwise walk the
+	 * effective affinity mask, which falls back to the real affinity
+	 * mask if the architecture does not support effective affinity
+	 * masks. Bad luck...
+	 */
+	irqd = irq_desc_get_irq_data(desc);
+	cpu = irq_data_get_single_target(irqd);
+	if (cpu < nr_cpu_ids) {
+		struct irq_proc_stat_cpu pcpu = {
+			.cpu = cpu,
+			.cnt = data_race(per_cpu(desc->kstat_irqs->cnt, cpu)),
+		};
+
+		if (pcpu.cnt)
+			d->pcpu[d->entries++] = pcpu;
+	} else {
+		const struct cpumask *m = irq_data_get_effective_affinity_mask(irqd);
+
+		for_each_cpu(cpu, m) {
+			struct irq_proc_stat_cpu pcpu = {
+				.cpu = cpu,
+				.cnt = data_race(per_cpu(desc->kstat_irqs->cnt, cpu)),
+			};
+
+			if (pcpu.cnt)
+				d->pcpu[d->entries++] = pcpu;
+		}
+	}
+}
+
+static void irq_percpu_stat_update_one(struct irq_proc_stat *s)
+{
+	struct irq_proc_stat_data *d = s->data;
+	struct irq_desc *desc = s->desc;
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct irq_proc_stat_cpu pcpu = {
+			.cpu = cpu,
+			.cnt = data_race(per_cpu(desc->kstat_irqs->cnt, cpu)),
+		};
+
+		if (pcpu.cnt)
+			d->pcpu[d->entries++] = pcpu;
+	}
+}
+
+static bool irq_stat_update_one(struct irq_proc_stat *s)
+{
+	struct irq_proc_stat_data *d = s->data;
+
+	if (IS_ENABLED(CONFIG_GENERIC_IRQ_PERCPU_STATS) && s->percpu)
+		irq_percpu_stat_update_one(s);
+	else
+		irq_dev_stat_update_one(s);
+
+	/* Only output data if there is an actual count */
+	if (d->entries) {
+		d->irqnr = s->irqnr;
+		s->count = sizeof(*d) + d->entries * sizeof(*d->pcpu);
+	}
+
+	/* Drop the reference count which got acquired in irq_stat_find_irq() */
+	irq_desc_put_ref(s->desc);
+	s->desc = NULL;
+	return !!s->count;
+}
+
+static __always_inline bool irq_stat_next_data(struct irq_proc_stat *s)
+{
+	/*
+	 * On the first read or after a lseek(fd, 0, SEEK_SET) find the
+	 * first interrupt. Otherwise find the next one.
+	 */
+	if (unlikely(s->first)) {
+		s->irqnr = 0;
+		s->first = false;
+		irq_stat_find_irq(s);
+	} else {
+		irq_stat_next_irq(s);
+	}
+
+	/* Repeat until an interrupt with non-zero counts is found */
+	for (; s->desc; irq_stat_next_irq(s)) {
+		if (irq_stat_update_one(s))
+			return true;
+	}
+	return false;
+}
+
+static size_t irq_stat_copy_to_iter(struct irq_proc_stat *s, struct iov_iter *iter)
+{
+	size_t n = copy_to_iter(((char *)s->data) + s->from, s->count, iter);
+
+	s->count -= n;
+	s->from += n;
+	return n;
+}
+
+/* Force inline as otherwise next() becomes a indirect call */
+static __always_inline ssize_t __irq_stats_read(struct kiocb *iocb, struct iov_iter *iter,
+						bool (*next)(struct irq_proc_stat *))
+{
+	struct irq_proc_stat *s = iocb->ki_filp->private_data;
+	size_t copied = 0;
+
+	/* Real seek is not supported. See irq_stat_lseek() */
+	if (WARN_ON_ONCE(iocb->ki_pos != s->read_pos))
+		goto done;
+
+	if (s->count)
+		copied += irq_stat_copy_to_iter(s, iter);
+
+	for (; !s->count;) {
+		s->count = s->from = 0;
+		s->data->entries = 0;
+
+		if (!next(s))
+			goto done;
+		copied += irq_stat_copy_to_iter(s, iter);
+	}
+
+	if (!copied)
+		return -EFAULT;
+done:
+	iocb->ki_pos += copied;
+	s->read_pos += copied;
+	return copied;
+}
+
+static ssize_t irq_stats_read(struct kiocb *iocb, struct iov_iter *iter)
+{
+	return __irq_stats_read(iocb, iter, irq_stat_next_data);
+}
+
+static loff_t irq_stats_llseek(struct file *filp, loff_t offset, int whence)
+{
+	struct irq_proc_stat *s = filp->private_data;
+	loff_t ret;
+
+	/*
+	 * As this is a variable record interface and the actual use case is to
+	 * get a full snapshot of the active interrupts, there is no point in
+	 * trying to be fully seekable. Just support rewind to the beginning of
+	 * the data set. For all other operations return the current position
+	 * which makes e.g. python happy.
+	 */
+	if (whence != SEEK_SET || offset)
+		return noop_llseek(filp, offset, whence);
+
+	ret = default_llseek(filp, 0, SEEK_SET);
+	if (ret < 0)
+		return ret;
+
+	/* Reset the position, drop any leftovers and indicate to start over */
+	s->read_pos = 0;
+	s->count = 0;
+	s->first = true;
+	return 0;
+}
+
+static int __irq_stats_open(struct inode *inode, struct file *filp, bool percpu)
+{
+	struct irq_proc_stat *s = kzalloc_obj(*s);
+
+	if (!s)
+		return -ENOMEM;
+
+	s->data	= kzalloc_flex(*s->data, pcpu, num_possible_cpus());
+	if (!s->data) {
+		kfree(s);
+		return -ENOMEM;
+	}
+
+	s->first = true;
+	s->percpu = percpu;
+	filp->private_data = s;
+	return 0;
+}
+
+static int irq_stats_open(struct inode *inode, struct file *filp)
+{
+	return __irq_stats_open(inode, filp, false);
+}
+
+static int irq_stats_release(struct inode *inode, struct file *filp)
+{
+	struct irq_proc_stat *s = filp->private_data;
+
+	if (s) {
+		kfree(s->data);
+		kfree(s);
+	}
+	return 0;
+}
+
+static const struct proc_ops irq_dev_stat_ops = {
+	.proc_flags	= PROC_ENTRY_PERMANENT,
+	.proc_open	= irq_stats_open,
+	.proc_release	= irq_stats_release,
+	.proc_read_iter	= irq_stats_read,
+	.proc_lseek	= irq_stats_llseek,
+};
+
+#ifdef CONFIG_GENERIC_IRQ_STATS_PERCPU
+static int irq_pcp_stats_open(struct inode *inode, struct file *filp)
+{
+	return __irq_stats_open(inode, filp, true);
+}
+
+static const struct proc_ops irq_pcp_stat_ops = {
+	.proc_flags	= PROC_ENTRY_PERMANENT,
+	.proc_open	= irq_pcp_stats_open,
+	.proc_release	= irq_stats_release,
+	.proc_read_iter	= irq_stats_read,
+	.proc_lseek	= irq_stats_llseek,
+};
+
+static __init void irq_pcp_stats_init(void)
+{
+	proc_create("percpu_stats", 0, root_irq_dir, &irq_pcp_stat_ops);
+}
+#else  /* CONFIG_GENERIC_IRQ_STATS_PERCPU */
+static inline void irq_pcp_stats_init(void) { }
+#endif /* !CONFIG_GENERIC_IRQ_STATS_PERCPU */
+
 static int __init irq_proc_init(void)
 {
 	proc_create_seq("interrupts", 0, NULL, &irq_seq_ops);
+	if (!root_irq_dir)
+		return 0;
+
+	proc_create("device_stats", 0, root_irq_dir, &irq_dev_stat_ops);
+	irq_pcp_stats_init();
 	return 0;
 }
 fs_initcall(irq_proc_init);


  parent reply	other threads:[~2026-03-20 13:22 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-20 13:21 [patch v2 00/14] Improve /proc/interrupts further and add a binary interface Thomas Gleixner
2026-03-20 13:21 ` [patch v2 01/14] x86/irq: Optimize interrupts decimals printing Thomas Gleixner
2026-03-21 16:10   ` Radu Rendec
2026-03-20 13:21 ` [patch v2 02/14] genirq/proc: Avoid formatting zero counts in /proc/interrupts Thomas Gleixner
2026-03-21 16:38   ` Radu Rendec
2026-03-20 13:21 ` [patch v2 03/14] genirq/proc: Utilize irq_desc::tot_count to avoid evaluation Thomas Gleixner
2026-03-22 19:59   ` Radu Rendec
2026-03-20 13:21 ` [patch v2 04/14] x86/irq: Make irqstats array based Thomas Gleixner
2026-03-20 16:39   ` Michael Kelley
2026-03-21 16:38     ` Thomas Gleixner
2026-03-21 20:32       ` Michael Kelley
2026-03-23 19:24   ` Radu Rendec
2026-03-24 19:54     ` Thomas Gleixner
2026-03-24 20:21       ` Thomas Gleixner
2026-03-24 20:32         ` Radu Rendec
2026-03-25 19:20           ` Radu Rendec
2026-03-25 22:52             ` Thomas Gleixner
2026-03-25 22:54               ` Florian Fainelli
2026-03-26 10:29                 ` Thomas Gleixner
2026-03-26 23:00                   ` Florian Fainelli
2026-03-26 12:34               ` Radu Rendec
2026-03-20 13:21 ` [patch v2 05/14] genirq: Expose nr_irqs in core code Thomas Gleixner
2026-03-23 19:48   ` Radu Rendec
2026-03-23 21:27     ` Thomas Gleixner
2026-03-20 13:21 ` [patch v2 06/14] genirq: Cache the condition for /proc/interrupts exposure Thomas Gleixner
2026-03-23 20:58   ` Radu Rendec
2026-03-24 20:31     ` Thomas Gleixner
2026-03-24 20:36       ` Radu Rendec
2026-03-20 13:21 ` [patch v2 07/14] genirq: Calculate precision only when required Thomas Gleixner
2026-03-25 19:47   ` Radu Rendec
2026-03-20 13:22 ` [patch v2 08/14] genirq: Add rcuref count to struct irq_desc Thomas Gleixner
2026-03-26 18:43   ` Dmitry Ilvokhin
2026-03-20 13:22 ` [patch v2 09/14] genirq: Expose irq_find_desc_at_or_after() in core code Thomas Gleixner
2026-03-26 19:13   ` Dmitry Ilvokhin
2026-03-26 21:11     ` Thomas Gleixner
2026-03-26 21:25       ` Thomas Gleixner
2026-03-20 13:22 ` [patch v2 10/14] genirq/proc: Speed up /proc/interrupts iteration Thomas Gleixner
2026-03-20 13:22 ` [patch v2 11/14] [RFC] genirq: Cache target CPU for single CPU affinities Thomas Gleixner
2026-03-20 13:22 ` Thomas Gleixner [this message]
2026-03-20 13:22 ` [patch v2 13/14] [RFC] genirq/proc: Provide architecture specific binary statistics Thomas Gleixner
2026-03-20 13:22 ` [patch v2 14/14] [RFC] x86/irq: Hook up architecture specific stats Thomas Gleixner
2026-03-20 16:45 ` [patch v2 00/14] Improve /proc/interrupts further and add a binary interface Michael Kelley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260320132102.841834115@kernel.org \
    --to=tglx@kernel.org \
    --cc=d@ilvokhin.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nhorman@tuxdriver.com \
    --cc=radu@rendec.net \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox