From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.ilvokhin.com (mail.ilvokhin.com [178.62.254.231]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0D3B3368284 for ; Wed, 1 Apr 2026 16:42:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=178.62.254.231 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775061734; cv=none; b=Kkt+mx0zqki5kFRCjFPtb7IiZMovdLubvneyikr05g17jz5ttEqWRBK8arymSmDBXua8FWE80RI70IJlgaBCmQpCkYknmj/PVWB076mGg4S3MUb3Nz2HqmaXEyD1rakgb4GxQKOMWRkPPTSx34m+cpF+Xon23emNGJkfTNQyRm8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775061734; c=relaxed/simple; bh=/oe+UbxFAB8J74ng9oNum3NOuFnWqjrq13sLSp+8FUE=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=ckwpIxHDVXOPNzzTlW1zbnWZrpISffC0oIngGqci2VShk/U0FZFfzUsb+ph8WOdB3eALwaL0QcsC7FQQt1TB7YTcJ5Z5ElgQYWubMWuQEO3AsxBfO1/P9DxZNamQoCwaLZ7hC7hxVNCS8KMzt+kSFDLR+LnRZJ6YwXoFoTpr5sQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=ilvokhin.com; spf=pass smtp.mailfrom=ilvokhin.com; dkim=pass (1024-bit key) header.d=ilvokhin.com header.i=@ilvokhin.com header.b=NGC+QYDm; arc=none smtp.client-ip=178.62.254.231 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=ilvokhin.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=ilvokhin.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=ilvokhin.com header.i=@ilvokhin.com header.b="NGC+QYDm" Received: from shell.ilvokhin.com (shell.ilvokhin.com [138.68.190.75]) (Authenticated sender: d@ilvokhin.com) by mail.ilvokhin.com (Postfix) with ESMTPSA id 94172BE321; Wed, 01 Apr 2026 16:42:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ilvokhin.com; s=mail; t=1775061724; bh=Jm/4VJFvXU/Y4YxiKE/RWHUb4AB5ac/0YpixBApL+a0=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=NGC+QYDmSRw4d3ZQqgUFckv7RnxKzl2kKrtzZtwZ43AkGFs2CTguyx+OPSisUMiPS kUxIbrNBsBW446JZh9jTWFJfz7JH+VH8F0RnxdqGEMMblWwgb3PCA9osNyU18lorQj ejfNNxF9WsJppg5iU88C/mJUo/vw1bHAsZ0zlNAQ= Date: Wed, 1 Apr 2026 16:42:01 +0000 From: Dmitry Ilvokhin To: Thomas Gleixner Cc: LKML , x86@kernel.org, Neil Horman , Radu Rendec Subject: Re: [patch v2 12/14] [RFC] genirq/proc: Provide binary statistic interface Message-ID: References: <20260320131108.344376329@kernel.org> <20260320132102.841834115@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260320132102.841834115@kernel.org> On Fri, Mar 20, 2026 at 02:22:24PM +0100, Thomas Gleixner wrote: > /proc/interrupts is expensive to evaluate for monitoring because: > > - it is text based and contains a lot of information which is not > relevant for interrupt frequency analysis. Due to the extra information > like chip name, hardware interrupt number, interrupt action names, it > has to take the interrupt descriptor lock to output those items into > the seq_file buffer. That obviously interferes with high frequency > interrupt workloads. > > - it contains both device interrupts, per CPU and architecture specific > interrupt counters without being able to look at them separately. The > file is seekable by some definition of seekable as the position can > change when interrupts are requested or freed, so the data has to be > read completely to get a coherent picture. > > - it emits records for requested interrupts even if their interrupt count > is zero. > > - it always prints the per CPU counters even if all but one of them are > zero. > > - converting numbers to text and then parsing the text back to numbers in > user space is a pretty wasteful exercise > > Provide a new interface which addresses the above pain points: > > 1) The interface is binary and emits variable length records per > interrupt. Each record starts with a header containing the interrupt > number and the number of data entries following the header. The data > entries consist of a CPU number and count pair. > > 2) Interrupts with a total count of zero are skipped and produce no > output at all. > > 3) Interrupts which have a single CPU affinity either due to a restricted > affinity mask or due to the underlying interrupt chip restricting a > mask to a single CPU target emit only one data entry. > > That means they are not emitting the stale counts on previous target > CPUs but they are not really interesting for interrupt frequency > analysis as they are not changing and therefore pointless for > accounting. > > 4) The interface separates device interrupts, per CPU interrupts and > architecture specific interrupts. > > Per CPU and architecture specific interrupts can only be monitored, > while device interrupts can also be steered by changing the affinity > unless they are affinity managed by the kernel. > > Per CPU interrupts are only available on architectures, e.g. ARM64, > which use the regular interrupt descriptor mechanism for per CPU > interrupt handling. > > Architectures which have their own mechanics, e.g. x86, do not enable > and provide the per CPU interface as those interrupts are covered by > the architecture specific accounting. > > 5) The readout is fully lockless so it does not interfere with concurrent > interrupt handling. > > 6) Seek is restricted to seek(fd, 0, SEEK_SET) as that's the only > operation which makes sense due to the variable record length and the > dynamics of interrupt request/free operations which influence the > position of the records in the output. For all other seek() > invocations return the current file position, which makes e.g. python > happy as an error code causes the file open checks to mark the > resulting file object non-seekable. > > Implement support for /proc/irq/device_stats and /proc/irq/percpu_stats. > > The support for architecture specific interrupt statistics is added in a > separate step. > > Reading /proc/irq/device_stats on a 256 CPU x86 machine with 83 requested > interrupts produces 13 records due to skipping zero count interrupts. It > results in 13 * 16 = 208 bytes of data as all device interrupts on x86 are > single CPU targeted. That readout takes ~8us time in the kernel, while the > full /proc/interrupts readout takes about 360us. > > Signed-off-by: Thomas Gleixner > --- > include/uapi/linux/irqstats.h | 27 +++ > kernel/irq/Kconfig | 3 > kernel/irq/proc.c | 314 ++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 344 insertions(+) > > --- /dev/null > +++ b/include/uapi/linux/irqstats.h > @@ -0,0 +1,27 @@ > +/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */ > +#ifndef LINUX_UAPI_IRQSTATS_H > +#define LINUX_UAPI_IRQSTATS_H > + > +/** > + * irq_proc_stat_cpu - Data record for /proc/irq/stats > + * @cpu: The CPU associated to @cnt > + * @cnt: The count assiciated to @cpu nit: s/assiciated/associated/ > + */ > +struct irq_proc_stat_cpu { > + unsigned int cpu; > + unsigned int cnt; > +}; nit: UAPI structs should use __u32 instead of unsigned int. > + > +/** > + * irq_proc_stat_data - Data header for /proc/irq/stats > + * @irqnr: The interrupt number > + * @entries: The number of records (max. nr_cpu_ids) > + * @pcpu: Runtime sized array of per CPU stat records > + */ > +struct irq_proc_stat_data { > + unsigned int irqnr; > + unsigned int entries; > + struct irq_proc_stat_cpu pcpu[]; > +}; Same here. Also, this struct has no extensibility mechanism. If irq_proc_stat_cpu ever needs a new field, there's no way for userspace to detect the layout change. A __u32 entry_size set to sizeof(struct irq_proc_stat_cpu) would let userspace stride through entries safely, even if the struct grows later. > + > +#endif > --- a/kernel/irq/Kconfig > +++ b/kernel/irq/Kconfig > @@ -18,6 +18,9 @@ config GENERIC_IRQ_SHOW > config GENERIC_IRQ_SHOW_LEVEL > bool > > +config GENERIC_IRQ_STATS_PERCPU > + bool > + [...] > +static bool irq_stat_update_one(struct irq_proc_stat *s) > +{ > + struct irq_proc_stat_data *d = s->data; > + > + if (IS_ENABLED(CONFIG_GENERIC_IRQ_PERCPU_STATS) && s->percpu) > + irq_percpu_stat_update_one(s); Should be GENERIC_IRQ_STATS_PERCPU, PERCPU and STATS are swapped with each other.