From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 72CD83CE480
	for <linux-kernel@vger.kernel.org>; Wed,  4 Mar 2026 18:55:30 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1772650530; cv=none; b=P39IQHPgS7QjzFfR4eMzWTMKx3I3AzwCX7ZktfrvnRvKtYslNmWHShzGXayhd4fJnNZeZjuiC9Vjkbco35j1iYOA6vkJytupDSwVdSc/xIG1G5VNRF8KqpFOli6nEDzLsRw3LVCyYfsHLttCG3HPYNfLq7vG3vrx9KJfc4tghps=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1772650530; c=relaxed/simple;
	bh=IqxKbevOgmZhzRwC2gDH3lMnuiIDAV+t0EjqD9svGas=;
	h=Date:Message-ID:From:To:Cc:Subject; b=lR1mjwv5VZaN4LV9HUVEA7ifNXg+PoUE0Yh9hNz5XhUQLN/dyiTeBG/qYFjk+h2V3J3aWrKix5HwVNDemZfwJ8cFtMLuvp97i+4DwL2WSihN9UUZtXl7ccknaPeRCf3dBuQfo7MmwRHiUzXmampVmO9Qe3Yyfd/9mVE/aaFGFvk=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=cnyMMJuZ; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="cnyMMJuZ"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id E0143C4CEF7;
	Wed,  4 Mar 2026 18:55:29 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1772650530;
	bh=IqxKbevOgmZhzRwC2gDH3lMnuiIDAV+t0EjqD9svGas=;
	h=Date:From:To:Cc:Subject:From;
	b=cnyMMJuZ+XxqdpM3wmz/x+1BWQqK/e3+/mth1P4alyRiqb+EsNzO126lsvapp2NKD
	 dK1bknvZXDa1yIA54M6mKSLe95BdQYaPGwC8SnmGFuBIN9F7dMwKY9+9ZR1Gb7cErJ
	 tsQ0+rhNUJZhaHdmUt4/ZiR2weSA2vw5cqehQG3U0zrEbs19BxUWBrmMNnpwTPnxbN
	 hkN6uW0CNTXtNBqHhSTq6u3o4z1E38pOpTdoNMrFnNc0CqgdbbMTpMLpZfBYYo5ZE+
	 09JckrdILsWa7jYUi9H83l37gDNgWPAQ4r8txkfDjBsfhklRDZHYwKK/61qxvTlGoX
	 HUv1KZC4vndWQ==
Date: Wed, 04 Mar 2026 19:55:27 +0100
Message-ID: <20260303150539.513068586@kernel.org>
User-Agent: quilt/0.68
From: Thomas Gleixner <tglx@kernel.org>
To: LKML <linux-kernel@vger.kernel.org>
Cc: x86@kernel.org,
 Dmitry Ilvokhin <d@ilvokhin.com>,
 Neil Horman <nhorman@tuxdriver.com>
Subject: [patch 00/14] genirq: Improve /proc/interrupts for real and add a
 binary interface
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>

I started to look into this a few days ago due to the recent patch reminder
from Dmitry:

  https://lore.kernel.org/aQj5mGZ6_BBlAm3B@shell.ilvokhin.com

Let me start with the history or /proc/interrupts. I've been observing for
more than two decades that developers micro enhance performance of
/proc/interrupts readout and stop right there after gaining $N% in a
particular part of the code instead of actually looking at the bigger
picture. I understand that people have time constraints, but it's amazing
how much low hanging fruit has been left on the table due to that.

Not to mention the more than a decade old talking about a new binary
interface which addresses the underlying problem of the unfortunately
unchangeable /proc/interrupts ABI. Just for giggles, the patch above
mentions it even in the change log:

   "Although a binary /proc interface would be a better long-term solution
    due to lower formatting (kernel side) and parsing (user-space side)
    overhead the text interface will remain in use for some time, even if
    better solutions will be available. Optimizing the /proc/interrupts
    printing code is therefore still beneficial."


Coming back to the above referenced patch which triggered me to actually
look into that. The patch achieves an impressive readout time improvement
of ~19% on my 256 CPUs test machine with a trivial C code test case which
essentially does:

	fd = open("/proc/interrupts");

	for (i = 0; i < 1000; i++) {
	    t0 = now();
	    read_all_data(fd);
	    deltas[i] = now() - t0;
	    lseek(fd, 0, SEEK_SET);
	}

	print_mean_and_rel_stddev(deltas);

I wrote that trivial test because the numbers provided in the patch above
are based on 'perf stat -r 1000 cat /proc/interrupts >/dev/null', which is
taking all the irrelevant setup and tear-down costs of 'cat' into
account. That makes it tedious to observe the actual problems via perf
[stat|top] because the setup/teardown overhead stats obfuscate the output.

For completeness sake the numbers observed with that very same perf command
line are provided below for reference. They pretty much confirm the
findings of the narrowed down micro benchmark.

Let's take a look at the resulting numbers:

Patch                                            t mean  rel. stddev   delta base  delta prev
     Baseline v7.0-rc1                      1310.363 us     +/-1.81% 
 1   x86/irq: Optimize interrupts decimal   1059.238 us     +/-1.76%       -19%      -19%

Impressive, but not so impressive when looking at perf top output and
addressing all the other offenders one by one:

Patch                                            t mean  rel. stddev   delta base  delta prev
     Baseline v7.0-rc1                      1310.363 us     +/-1.81% 
 1   x86/irq: Optimize interrupts decimal   1059.238 us     +/-1.76%       -19%      -19%
 3   genirq/proc: Utilize irq_desc::tot_c    652.365 us     +/-0.81%       -50%      -38%
 4   x86/irq: Make irqstats array based      605.326 us     +/-1.63%       -54%      - 7%
 7   genirq: Calculate precision only whe    575.973 us     +/-1.12%       -56%      - 5%
10   genirq/proc: Speed up /proc/interrup    209.907 us     +/-1.84%       -84%      -64%

Now let's look how I got there by simply running the microbenchmark with
infinite loops and using perf top to analyze the hotspots.

Baseline

  20.19%  [k] mtree_load
  19.34%  [k] num_to_str
  10.12%  [k] number
   9.18%  [k] vsnprintf
   7.55%  [k] seq_put_decimal_ull_width
   6.16%  [k] format_decode
   5.65%  [k] show_interrupts
   4.88%  [k] _find_next_bit
   3.46%  [k] seq_read_iter
   3.29%  [k] seq_printf
   1.58%  [k] memcpy_orig
   1.50%  [k] __rcu_read_lock
   1.26%  [k] arch_show_interrupts
   1.05%  [k] __rcu_read_unlock
   1.04%  [k] int_seq_next
   0.94%  [k] rep_movs_alternative
   0.82%  [k] irq_to_desc
   0.56%  [k] irq_get_nr_irqs

It's interesting to me that vsnprintf() was the first item to look at.

 1   x86/irq: Optimize interrupts decimal printing

  29.94%  [k] num_to_str
  25.16%  [k] mtree_load
  12.48%  [k] seq_put_decimal_ull_width
   7.05%  [k] show_interrupts
   6.53%  [k] _find_next_bit
   4.48%  [k] seq_read_iter
   1.87%  [k] __rcu_read_lock
   1.84%  [k] arch_show_interrupts
   1.60%  [k] vsnprintf
   1.32%  [k] __rcu_read_unlock
   1.18%  [k] rep_movs_alternative
   1.16%  [k] int_seq_next
   1.01%  [k] format_decode
   0.98%  [k] irq_to_desc
   0.65%  [k] irq_get_nr_irqs

So Dmitry's patch removed the vsnprintf() overhead and made num_to_str()
more prominent.

That number is insanely high. So I analyzed /proc/interrupt output which
gave an easy way out. There is a large number of interrupts with all but
one CPU having zero counts. That's normal for interrupt managed multiqueue
devices. Also low frequency interrupts tend to stay on their initial
affinity and are not moved around by balancers.

With single CPU effective affinity targets (x86 and other architectures)
the majority of lines have just _one_ non zero entry, unless the balancer
or admin changed the affinity.

So it was pretty obvious to use a fixed string and write it directly
instead of repeating the string conversion of zero over and over.

 3   genirq/proc: Utilize irq_desc::tot_count to avoid evaluation

  41.04%  [kernel]          [k] mtree_load
  10.92%  [kernel]          [k] num_to_str
   5.97%  [kernel]          [k] show_interrupts
   5.18%  [kernel]          [k] _find_next_bit
   4.86%  [kernel]          [k] seq_put_decimal_ull_width
   4.65%  [kernel]          [k] seq_read_iter
   3.02%  [kernel]          [k] __rcu_read_lock
   2.84%  [kernel]          [k] arch_show_interrupts
   2.71%  [kernel]          [k] irq_proc_emit_counts
   2.69%  [kernel]          [k] memcpy_orig
   2.32%  [kernel]          [k] vsnprintf
   2.01%  [kernel]          [k] __rcu_read_unlock
   1.79%  [kernel]          [k] int_seq_next
   1.71%  [kernel]          [k] format_decode
   1.62%  [kernel]          [k] irq_to_desc
   1.47%  [kernel]          [k] rep_movs_alternative
   1.13%  [kernel]          [k] irq_get_nr_irqs

num_to_str() has lost the pole position and got replaced by mtree_load()
again. mtree_load() is used to get the interrupt descriptors, but there is
something seriously wrong with the high CPU usage. Though I decided to look
at that later.

When fixing up Dmitry's patch I noticed the way how x86 manages the
architecture specific statistics is suboptimal. x86 holds the per interrupt
counters in struct members and therefore requires an #ifdeffed series of
copied code per counter to emit them. The struct member based counters are
also in the way of implementing a binary interface without adding more
architecture specific duplicated code.

The same can be achieved with an array of counters. That's not changing the
actual code in the interrupt hotpath, which increments the counter, as the
array indices are constant so the compiler still calculates the offset from
the per CPU data pointer as before. This also fixes the out of sync
arch_show_interrupts() and arch_irq_stat_cpu() implementations as they now
use the same table.

 4   x86/irq: Make irqstats array based

  43.99%  [kernel]          [k] mtree_load
   9.16%  [kernel]          [k] irq_proc_emit_counts
   6.41%  [kernel]          [k] show_interrupts
   5.86%  [kernel]          [k] _find_next_bit
   4.91%  [kernel]          [k] seq_read_iter
   3.98%  [kernel]          [k] memcpy_orig
   3.28%  [kernel]          [k] __rcu_read_lock
   2.95%  [kernel]          [k] num_to_str
   2.28%  [kernel]          [k] vsnprintf
   2.06%  [kernel]          [k] __rcu_read_unlock
   2.02%  [kernel]          [k] rep_movs_alternative
   1.99%  [kernel]          [k] int_seq_next
   1.96%  [kernel]          [k] format_decode
   1.69%  [kernel]          [k] irq_to_desc
   1.58%  [kernel]          [k] seq_put_decimal_ull_width
   1.20%  [kernel]          [k] irq_get_nr_irqs

num_to_str() got further demoted because x86 now uses
irq_proc_emit_counts() with the optimized 0 output as well and
arch_show_interrupts() is off the radar because it only contains a trivial
loop.

This reduces text size by ~2k, which obviously reduces the I-cache foot
print. The loop overhead is completely irrelevant compared to the actual
costs of doing the for_each_online_cpu() loop once per interrupt and in
each iteration access memory in the worst possible pattern especially when
reading counters from remote nodes. There is not much which can be done
about that. I tried to copy all per CPU counters into local data storage
first, but that just trades one memory/cacheline massacre against another
for no gain.

So back to mtree_load(). That high CPU usage doesn't make any sense because
the whole point of sparse interrupts and the underlying maple tree is to
have quick iterations through the maple tree to skip holes. So much for the
theory.

fs/proc/interrupts got never updated and still iterates over the possible
interrupt number space one by one thereby defeating the whole purpose of
the maple tree. On the test machine that's almost 1000 lookups of interrupt
descriptor, while only 153 are in use and exist in the maple tree.

The trivial fix would have been to use the proper iterator in
fs/proc/interrupts, but that would need some investigation vs. the
architectures which do not use the generic version of show_interrupts().
That can be done by those people if they actually care.

Aside of that it still would touch the maple tree twice for each interrupt.
First for finding the next number and then to actually load the interrupt
descriptor. This can be done smarter by retrieving the next descriptor
right away and storing it in the *v data pointer of the seq_file ops
instead of using the pointer to store the number.

But doing that required some preparatory changes and while thinking about
them I noticed a few obvious improvements to avoid doing the same thing
over and over for no reason and to reduce the number of conditional
branches in show_interrupts()

 7   genirq: Calculate precision only when required

  46.41%  [k] mtree_load
  10.42%  [k] irq_proc_emit_counts
   5.86%  [k] show_interrupts
   5.11%  [k] seq_read_iter
   5.05%  [k] _find_next_bit
   3.56%  [k] memcpy_orig
   3.23%  [k] num_to_str
   2.51%  [k] __rcu_read_unlock
   2.27%  [k] vsnprintf
   2.23%  [k] __rcu_read_lock
   2.00%  [k] rep_movs_alternative
   1.90%  [k] int_seq_next
   1.81%  [k] seq_put_decimal_ull_width
   1.58%  [k] format_decode
   1.09%  [k] number
   0.91%  [k] string
   0.76%  [k] put_dec_trunc8
   0.67%  [k] seq_printf
   0.60%  [k] irq_to_desc
   0.46%  [k] irq_get_nr_irqs

show_interrupts() became slightly less expensive. The mtree_load()
leader position stays obviously the same.

With that addressed adding optimized seq_file code into the interrupt core
became feasible. All what it still required was to provide a fast refcount
mechanism so that the pointer can safely be stored in the seq_file iterator
and cannot be freed between seq_next() and seq_show(). I pondered briefly
to use the existing kobject in the interrupt descriptor, but that uses a
refcount_t underneath which is way more expensive than rcuref_t and
requires function calls. As the descriptors are RCU managed rcuref_t is the
obvious choice.

10   genirq/proc: Speed up /proc/interrupts iteration

  26.77%  [k] irq_proc_emit_counts
  15.90%  [k] _find_next_bit
  10.56%  [k] memcpy_orig
   9.82%  [k] num_to_str
   5.68%  [k] rep_movs_alternative
   5.34%  [k] vsnprintf
   4.76%  [k] seq_put_decimal_ull_width
   4.01%  [k] format_decode
   3.17%  [k] mt_find
   2.51%  [k] string
   2.24%  [k] number
   1.52%  [k] put_dec_trunc8
   1.44%  [k] seq_printf
   0.87%  [k] irq_seq_show

mt_find() is the one retrieving the next descriptor and the numbers are now
where one would expect them to be.

I might have missed some low hanging fruit there as well. If you find
it, you are entitled to keep it and fix it yourself. :)

That said, let's talk about the previously mentioned optimized binary
interface. As I'm not interested in reading "it would be better" over and
over for another ten years, I sat down and implemented a straight forward
interface which provides a variable record sized binary dump of the
relevant statistics separated into three files:

   1) device interrupts
   2) per CPU interrupts (irq descriptor based)
   3) architecture specific interrupts

The separation is done because for interrupt balancing purposes #1 is the
interesting part as #2 and #3 cannot be influenced by modifying
affinities. They can be monitored of course, but that's a different class
of events.

This interface comes with the following semantics:

   - each record starts with a pair of 'interrupt number' and 'number of
     entries'

   - each entry is a pair of 'cpu number' and 'interrupt counts'

   - records are only emitted for interrupts which have a total non-zero
     event count as that's what matters for observation.

   - records are only emitted if the counter(s) for the effective affinity
     mask CPU target(s) are non zero.

     Emitting counts from a previous affinity setting is irrelevant and can
     be easily cached by the monitoring application if needed.

   - completely lockless vs. the interrupt descriptor

     The readout does not touch the interrupt descriptor lock so it does
     not interfere with concurrent high frequency interrupts at all except
     for the memory access to the counter which can't obviously be avoided.

   - contrary to the non-sensical seq_file seek, which handles
     /proc/interrupts, allow only seeking back to the origin.

     /proc/interrupts is also a non constant record file and the seq_file
     lseek() implementation does therefore not guarantee a consistent
     lseek() at all, except for seek(0, SEEK_SET).

Reading those files with the same 1000 loops test takes on average:

     ~ 8 us for the device interrupts
     ~37 us for x86 architecture interrupts
     ---
     ~45 us total

Compared to the fully applied /proc/interrupt enhancements:

      209us vs. 45us =~ -79%	 --> ~4.5X

The comparison to the baseline v7.0-rc1 is:

     1310us vs. 45us =~ -96%	 --> ~29X

For completeness sake the perf top list of the endless read/lseek loop
accessing /proc/irq/device_stats:

  70.52%  [kernel]          [k] mt_find
   6.36%  [kernel]          [k] _copy_to_iter
   5.09%  [kernel]          [k] irq_stats_read
   3.77%  [kernel]          [k] irq_find_desc_at_or_after
   2.09%  [kernel]          [k] __rcu_read_lock
   2.04%  [kernel]          [k] __rcu_read_unlock
   1.67%  [kernel]          [k] __check_object_size
   1.07%  [kernel]          [k] __virt_addr_valid
   0.82%  [kernel]          [k] entry_SYSCALL_64
   0.81%  [kernel]          [k] vfs_read

Here mt_find() dominates rightfully because the decision to output data is
trivial as all interrupts are single CPU target so there is only one
counter per interrupt to access. If non-zero the store and the output is
just vanishing in the noise compared to the actual lookup costs. Based on
the 8us average that means:

    5.641 us   mt_find()
    0.509 us   copy_to_iter()
    0.407 us   irq_stats_read()
    0.302 us   irq_find_desc_at_or_after

mt_find() and irq_find_desc_at_or_after() belong together so we have:

    5.943 us   lookup()
    0.509 us   copy_to_iter()	// Copies data to user space
    0.407 us   irq_stats_read() // Counter analysis

Broken down to a per interrupt average with 153 requested interrupts on
that machine this means:

       38.8 ns   lookup time
        3.3 ns   copy to user
	2.7 ns	 analayis

And the same for the endless read/lseek loop accessing
/proc/irq/arch_stats:

  57.82%  [k] _find_next_bit
  37.20%  [k] irq_arch_stats_read
   1.06%  [k] _copy_to_iter
   1.03%  [k] rep_movs_alternative
   0.73%  [k] irq_stat_copy_to_iter
   0.27%  [k] vfs_read

The most expensive part of the latter are the for_each_online_cpu() loops
and the sub-optimal memory access patterns as explained above.

So with 37us readout time for 256 CPUs and 19 x86 architecture specific
counters this gives:

  21.393 us  for_each_online_cpu()
  13.764 us  irq_arch_stats_read()  // Counter analysis

That means an average per counter:

    1.12 us for_each_online_cpu()
    0.72 us irq_arch_stats_read()

In theory we could optimize the for_each_online_cpu() overhead, but that
creates a lot of inlined code for a meager 5% performance improvement over
the out of line version. Not really worth it.

Note that the /proc/interrupt numbers were taken on a mostly idle system
with almost zero propability to hit irq_desc::lock contention.

All test results are obviously skewed due to the repetitive invocations,
which prime the cache. But cache cold tests with no repeats are resulting
in roughly the same performance difference ratio for all scenarios.

A quick python hack computing the total number of interrupts from the
optimized /proc/interrupts and from the new binary interfaces yields:

      /proc/interrupts optimized	/proc/irq/[device+arch]_stats

      6.957 ms				0.394 ms	-94% (~17X)

which covers both the costs of reading and computing. The read advantage of
the binary interface is ~4.5X (see above), so the compute advantage of
avoiding the text parsing and not looking at pointless numbers amounts to
~3.7X which is unsurprisingly in the expected ballpark.

The last four patches related to the binary interface need obviously some
thought vs. the interface and are therefore marked RFC.

The series applies on top of v7.0-rc1 and is also available via git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git irq/core

Thanks

	tglx
---
   It's not that I'm so smart, it's just that I stay with problems longer.
      - Albert Einstein 
---
'perf stat -r -r 1000 cat /proc/interrupts' data series

Baseline v7.0-rc1

 Performance counter stats for 'cat /proc/interrupts' (1000 runs):

              2.55 msec task-clock                       #    0.830 CPUs utilized               ( +-  0.09% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
                94      page-faults                      #   36.877 K/sec                       ( +-  0.05% )
         5,072,386      cycles                           #    1.990 GHz                         ( +-  0.08% )
           357,009      stalled-cycles-frontend          #    7.04% frontend cycles idle        ( +-  0.29% )
        17,197,829      instructions                     #    3.39  insn per cycle            
                                                  #    0.02  stalled cycles per insn     ( +-  0.01% )
         3,323,704      branches                         #    1.304 G/sec                       ( +-  0.01% )
            13,773      branch-misses                    #    0.41% of all branches             ( +-  0.09% )

        0.00307221 +- 0.00000369 seconds time elapsed  ( +-  0.12% )

x86/irq: Optimize interrupts decimals printing

 Performance counter stats for 'cat /proc/interrupts' (1000 runs):

              2.10 msec task-clock                       #    0.809 CPUs utilized               ( +-  0.18% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
                94      page-faults                      #   44.720 K/sec                       ( +-  0.05% )
         4,179,092      cycles                           #    1.988 GHz                         ( +-  0.13% )
           360,135      stalled-cycles-frontend          #    8.62% frontend cycles idle        ( +-  0.35% )
        13,174,711      instructions                     #    3.15  insn per cycle            
                                                  #    0.03  stalled cycles per insn     ( +-  0.02% )
         2,596,179      branches                         #    1.235 G/sec                       ( +-  0.02% )
            13,793      branch-misses                    #    0.53% of all branches             ( +-  0.09% )

        0.00259694 +- 0.00000597 seconds time elapsed  ( +-  0.23% )

genirq/proc: Utilize irq_desc::tot_count to avoid evaluation

 Performance counter stats for 'cat /proc/interrupts' (1000 runs):

              1.51 msec task-clock                       #    0.753 CPUs utilized               ( +-  0.23% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
                94      page-faults                      #   62.425 K/sec                       ( +-  0.05% )
         2,989,307      cycles                           #    1.985 GHz                         ( +-  0.18% )
           331,156      stalled-cycles-frontend          #   11.08% frontend cycles idle        ( +-  0.55% )
         7,564,373      instructions                     #    2.53  insn per cycle            
                                                  #    0.04  stalled cycles per insn     ( +-  0.03% )
         1,554,530      branches                         #    1.032 G/sec                       ( +-  0.03% )
            13,531      branch-misses                    #    0.87% of all branches             ( +-  0.10% )

        0.00199971 +- 0.00000644 seconds time elapsed  ( +-  0.32% )

x86/irq: Make irqstats array based

 Performance counter stats for 'cat /proc/interrupts' (1000 runs):

              1.46 msec task-clock                       #    0.730 CPUs utilized               ( +-  0.17% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
                95      page-faults                      #   65.261 K/sec                       ( +-  0.05% )
         2,891,394      cycles                           #    1.986 GHz                         ( +-  0.16% )
           335,074      stalled-cycles-frontend          #   11.59% frontend cycles idle        ( +-  0.40% )
         6,461,160      instructions                     #    2.23  insn per cycle            
                                                  #    0.05  stalled cycles per insn     ( +-  0.04% )
         1,370,320      branches                         #  941.353 M/sec                       ( +-  0.03% )
            13,409      branch-misses                    #    0.98% of all branches             ( +-  0.10% )

        0.00199471 +- 0.00000400 seconds time elapsed  ( +-  0.20% )

genirq: Calculate precision only when required

 Performance counter stats for 'cat /proc/interrupts' (1000 runs):

              1.42 msec task-clock                       #    0.732 CPUs utilized               ( +-  0.23% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
                95      page-faults                      #   67.004 K/sec                       ( +-  0.05% )
         2,817,515      cycles                           #    1.987 GHz                         ( +-  0.23% )
           327,877      stalled-cycles-frontend          #   11.64% frontend cycles idle        ( +-  0.73% )
         6,391,647      instructions                     #    2.27  insn per cycle            
                                                  #    0.05  stalled cycles per insn     ( +-  0.04% )
         1,348,883      branches                         #  951.380 M/sec                       ( +-  0.03% )
            13,483      branch-misses                    #    1.00% of all branches             ( +-  0.09% )

        0.00193706 +- 0.00000522 seconds time elapsed  ( +-  0.27% )

genirq/proc: Speed up /proc/interrupts iteration

 Performance counter stats for 'cat /proc/interrupts' (1000 runs):

              1.05 msec task-clock                       #    0.671 CPUs utilized               ( +-  0.23% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
                95      page-faults                      #   90.552 K/sec                       ( +-  0.05% )
         2,077,859      cycles                           #    1.981 GHz                         ( +-  0.18% )
           313,913      stalled-cycles-frontend          #   15.11% frontend cycles idle        ( +-  0.35% )
         3,891,909      instructions                     #    1.87  insn per cycle            
                                                  #    0.08  stalled cycles per insn     ( +-  0.06% )
           744,475      branches                         #  709.616 M/sec                       ( +-  0.06% )
            12,962      branch-misses                    #    1.74% of all branches             ( +-  0.10% )

        0.00156440 +- 0.00000414 seconds time elapsed  ( +-  0.26% )

---
 arch/x86/Kconfig                    |    1 
 arch/x86/events/amd/core.c          |    2 
 arch/x86/events/amd/ibs.c           |    2 
 arch/x86/events/core.c              |    2 
 arch/x86/events/intel/core.c        |    2 
 arch/x86/events/intel/knc.c         |    2 
 arch/x86/events/intel/p4.c          |    2 
 arch/x86/events/zhaoxin/core.c      |    2 
 arch/x86/hyperv/hv_init.c           |    2 
 arch/x86/include/asm/hardirq.h      |   76 ++--
 arch/x86/include/asm/mce.h          |    3 
 arch/x86/kernel/apic/apic.c         |    4 
 arch/x86/kernel/apic/ipi.c          |    2 
 arch/x86/kernel/cpu/acrn.c          |    2 
 arch/x86/kernel/cpu/mce/amd.c       |    2 
 arch/x86/kernel/cpu/mce/core.c      |    8 
 arch/x86/kernel/cpu/mce/threshold.c |    2 
 arch/x86/kernel/cpu/mshyperv.c      |    4 
 arch/x86/kernel/irq.c               |  225 ++++----------
 arch/x86/kernel/irq_work.c          |    2 
 arch/x86/kernel/kvm.c               |    2 
 arch/x86/kernel/nmi.c               |    4 
 arch/x86/kernel/smp.c               |    6 
 arch/x86/mm/tlb.c                   |    2 
 arch/x86/xen/enlighten_hvm.c        |    2 
 arch/x86/xen/enlighten_pv.c         |    2 
 arch/x86/xen/smp.c                  |    6 
 arch/x86/xen/smp_pv.c               |    2 
 fs/proc/Makefile                    |    4 
 include/linux/interrupt.h           |    1 
 include/linux/irq.h                 |   18 +
 include/linux/irqdesc.h             |    8 
 include/uapi/linux/irqstats.h       |   27 +
 kernel/irq/Kconfig                  |    6 
 kernel/irq/chip.c                   |    2 
 kernel/irq/internals.h              |   24 +
 kernel/irq/irqdesc.c                |   67 ++--
 kernel/irq/manage.c                 |   16 -
 kernel/irq/proc.c                   |  556 +++++++++++++++++++++++++++++++++---
 kernel/irq/settings.h               |   14 
 40 files changed, 815 insertions(+), 301 deletions(-)