From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 72CD83CE480 for ; Wed, 4 Mar 2026 18:55:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772650530; cv=none; b=P39IQHPgS7QjzFfR4eMzWTMKx3I3AzwCX7ZktfrvnRvKtYslNmWHShzGXayhd4fJnNZeZjuiC9Vjkbco35j1iYOA6vkJytupDSwVdSc/xIG1G5VNRF8KqpFOli6nEDzLsRw3LVCyYfsHLttCG3HPYNfLq7vG3vrx9KJfc4tghps= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772650530; c=relaxed/simple; bh=IqxKbevOgmZhzRwC2gDH3lMnuiIDAV+t0EjqD9svGas=; h=Date:Message-ID:From:To:Cc:Subject; b=lR1mjwv5VZaN4LV9HUVEA7ifNXg+PoUE0Yh9hNz5XhUQLN/dyiTeBG/qYFjk+h2V3J3aWrKix5HwVNDemZfwJ8cFtMLuvp97i+4DwL2WSihN9UUZtXl7ccknaPeRCf3dBuQfo7MmwRHiUzXmampVmO9Qe3Yyfd/9mVE/aaFGFvk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=cnyMMJuZ; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="cnyMMJuZ" Received: by smtp.kernel.org (Postfix) with ESMTPSA id E0143C4CEF7; Wed, 4 Mar 2026 18:55:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772650530; bh=IqxKbevOgmZhzRwC2gDH3lMnuiIDAV+t0EjqD9svGas=; h=Date:From:To:Cc:Subject:From; b=cnyMMJuZ+XxqdpM3wmz/x+1BWQqK/e3+/mth1P4alyRiqb+EsNzO126lsvapp2NKD dK1bknvZXDa1yIA54M6mKSLe95BdQYaPGwC8SnmGFuBIN9F7dMwKY9+9ZR1Gb7cErJ tsQ0+rhNUJZhaHdmUt4/ZiR2weSA2vw5cqehQG3U0zrEbs19BxUWBrmMNnpwTPnxbN hkN6uW0CNTXtNBqHhSTq6u3o4z1E38pOpTdoNMrFnNc0CqgdbbMTpMLpZfBYYo5ZE+ 09JckrdILsWa7jYUi9H83l37gDNgWPAQ4r8txkfDjBsfhklRDZHYwKK/61qxvTlGoX HUv1KZC4vndWQ== Date: Wed, 04 Mar 2026 19:55:27 +0100 Message-ID: <20260303150539.513068586@kernel.org> User-Agent: quilt/0.68 From: Thomas Gleixner To: LKML Cc: x86@kernel.org, Dmitry Ilvokhin , Neil Horman Subject: [patch 00/14] genirq: Improve /proc/interrupts for real and add a binary interface Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: I started to look into this a few days ago due to the recent patch reminder from Dmitry: https://lore.kernel.org/aQj5mGZ6_BBlAm3B@shell.ilvokhin.com Let me start with the history or /proc/interrupts. I've been observing for more than two decades that developers micro enhance performance of /proc/interrupts readout and stop right there after gaining $N% in a particular part of the code instead of actually looking at the bigger picture. I understand that people have time constraints, but it's amazing how much low hanging fruit has been left on the table due to that. Not to mention the more than a decade old talking about a new binary interface which addresses the underlying problem of the unfortunately unchangeable /proc/interrupts ABI. Just for giggles, the patch above mentions it even in the change log: "Although a binary /proc interface would be a better long-term solution due to lower formatting (kernel side) and parsing (user-space side) overhead the text interface will remain in use for some time, even if better solutions will be available. Optimizing the /proc/interrupts printing code is therefore still beneficial." Coming back to the above referenced patch which triggered me to actually look into that. The patch achieves an impressive readout time improvement of ~19% on my 256 CPUs test machine with a trivial C code test case which essentially does: fd = open("/proc/interrupts"); for (i = 0; i < 1000; i++) { t0 = now(); read_all_data(fd); deltas[i] = now() - t0; lseek(fd, 0, SEEK_SET); } print_mean_and_rel_stddev(deltas); I wrote that trivial test because the numbers provided in the patch above are based on 'perf stat -r 1000 cat /proc/interrupts >/dev/null', which is taking all the irrelevant setup and tear-down costs of 'cat' into account. That makes it tedious to observe the actual problems via perf [stat|top] because the setup/teardown overhead stats obfuscate the output. For completeness sake the numbers observed with that very same perf command line are provided below for reference. They pretty much confirm the findings of the narrowed down micro benchmark. Let's take a look at the resulting numbers: Patch t mean rel. stddev delta base delta prev Baseline v7.0-rc1 1310.363 us +/-1.81% 1 x86/irq: Optimize interrupts decimal 1059.238 us +/-1.76% -19% -19% Impressive, but not so impressive when looking at perf top output and addressing all the other offenders one by one: Patch t mean rel. stddev delta base delta prev Baseline v7.0-rc1 1310.363 us +/-1.81% 1 x86/irq: Optimize interrupts decimal 1059.238 us +/-1.76% -19% -19% 3 genirq/proc: Utilize irq_desc::tot_c 652.365 us +/-0.81% -50% -38% 4 x86/irq: Make irqstats array based 605.326 us +/-1.63% -54% - 7% 7 genirq: Calculate precision only whe 575.973 us +/-1.12% -56% - 5% 10 genirq/proc: Speed up /proc/interrup 209.907 us +/-1.84% -84% -64% Now let's look how I got there by simply running the microbenchmark with infinite loops and using perf top to analyze the hotspots. Baseline 20.19% [k] mtree_load 19.34% [k] num_to_str 10.12% [k] number 9.18% [k] vsnprintf 7.55% [k] seq_put_decimal_ull_width 6.16% [k] format_decode 5.65% [k] show_interrupts 4.88% [k] _find_next_bit 3.46% [k] seq_read_iter 3.29% [k] seq_printf 1.58% [k] memcpy_orig 1.50% [k] __rcu_read_lock 1.26% [k] arch_show_interrupts 1.05% [k] __rcu_read_unlock 1.04% [k] int_seq_next 0.94% [k] rep_movs_alternative 0.82% [k] irq_to_desc 0.56% [k] irq_get_nr_irqs It's interesting to me that vsnprintf() was the first item to look at. 1 x86/irq: Optimize interrupts decimal printing 29.94% [k] num_to_str 25.16% [k] mtree_load 12.48% [k] seq_put_decimal_ull_width 7.05% [k] show_interrupts 6.53% [k] _find_next_bit 4.48% [k] seq_read_iter 1.87% [k] __rcu_read_lock 1.84% [k] arch_show_interrupts 1.60% [k] vsnprintf 1.32% [k] __rcu_read_unlock 1.18% [k] rep_movs_alternative 1.16% [k] int_seq_next 1.01% [k] format_decode 0.98% [k] irq_to_desc 0.65% [k] irq_get_nr_irqs So Dmitry's patch removed the vsnprintf() overhead and made num_to_str() more prominent. That number is insanely high. So I analyzed /proc/interrupt output which gave an easy way out. There is a large number of interrupts with all but one CPU having zero counts. That's normal for interrupt managed multiqueue devices. Also low frequency interrupts tend to stay on their initial affinity and are not moved around by balancers. With single CPU effective affinity targets (x86 and other architectures) the majority of lines have just _one_ non zero entry, unless the balancer or admin changed the affinity. So it was pretty obvious to use a fixed string and write it directly instead of repeating the string conversion of zero over and over. 3 genirq/proc: Utilize irq_desc::tot_count to avoid evaluation 41.04% [kernel] [k] mtree_load 10.92% [kernel] [k] num_to_str 5.97% [kernel] [k] show_interrupts 5.18% [kernel] [k] _find_next_bit 4.86% [kernel] [k] seq_put_decimal_ull_width 4.65% [kernel] [k] seq_read_iter 3.02% [kernel] [k] __rcu_read_lock 2.84% [kernel] [k] arch_show_interrupts 2.71% [kernel] [k] irq_proc_emit_counts 2.69% [kernel] [k] memcpy_orig 2.32% [kernel] [k] vsnprintf 2.01% [kernel] [k] __rcu_read_unlock 1.79% [kernel] [k] int_seq_next 1.71% [kernel] [k] format_decode 1.62% [kernel] [k] irq_to_desc 1.47% [kernel] [k] rep_movs_alternative 1.13% [kernel] [k] irq_get_nr_irqs num_to_str() has lost the pole position and got replaced by mtree_load() again. mtree_load() is used to get the interrupt descriptors, but there is something seriously wrong with the high CPU usage. Though I decided to look at that later. When fixing up Dmitry's patch I noticed the way how x86 manages the architecture specific statistics is suboptimal. x86 holds the per interrupt counters in struct members and therefore requires an #ifdeffed series of copied code per counter to emit them. The struct member based counters are also in the way of implementing a binary interface without adding more architecture specific duplicated code. The same can be achieved with an array of counters. That's not changing the actual code in the interrupt hotpath, which increments the counter, as the array indices are constant so the compiler still calculates the offset from the per CPU data pointer as before. This also fixes the out of sync arch_show_interrupts() and arch_irq_stat_cpu() implementations as they now use the same table. 4 x86/irq: Make irqstats array based 43.99% [kernel] [k] mtree_load 9.16% [kernel] [k] irq_proc_emit_counts 6.41% [kernel] [k] show_interrupts 5.86% [kernel] [k] _find_next_bit 4.91% [kernel] [k] seq_read_iter 3.98% [kernel] [k] memcpy_orig 3.28% [kernel] [k] __rcu_read_lock 2.95% [kernel] [k] num_to_str 2.28% [kernel] [k] vsnprintf 2.06% [kernel] [k] __rcu_read_unlock 2.02% [kernel] [k] rep_movs_alternative 1.99% [kernel] [k] int_seq_next 1.96% [kernel] [k] format_decode 1.69% [kernel] [k] irq_to_desc 1.58% [kernel] [k] seq_put_decimal_ull_width 1.20% [kernel] [k] irq_get_nr_irqs num_to_str() got further demoted because x86 now uses irq_proc_emit_counts() with the optimized 0 output as well and arch_show_interrupts() is off the radar because it only contains a trivial loop. This reduces text size by ~2k, which obviously reduces the I-cache foot print. The loop overhead is completely irrelevant compared to the actual costs of doing the for_each_online_cpu() loop once per interrupt and in each iteration access memory in the worst possible pattern especially when reading counters from remote nodes. There is not much which can be done about that. I tried to copy all per CPU counters into local data storage first, but that just trades one memory/cacheline massacre against another for no gain. So back to mtree_load(). That high CPU usage doesn't make any sense because the whole point of sparse interrupts and the underlying maple tree is to have quick iterations through the maple tree to skip holes. So much for the theory. fs/proc/interrupts got never updated and still iterates over the possible interrupt number space one by one thereby defeating the whole purpose of the maple tree. On the test machine that's almost 1000 lookups of interrupt descriptor, while only 153 are in use and exist in the maple tree. The trivial fix would have been to use the proper iterator in fs/proc/interrupts, but that would need some investigation vs. the architectures which do not use the generic version of show_interrupts(). That can be done by those people if they actually care. Aside of that it still would touch the maple tree twice for each interrupt. First for finding the next number and then to actually load the interrupt descriptor. This can be done smarter by retrieving the next descriptor right away and storing it in the *v data pointer of the seq_file ops instead of using the pointer to store the number. But doing that required some preparatory changes and while thinking about them I noticed a few obvious improvements to avoid doing the same thing over and over for no reason and to reduce the number of conditional branches in show_interrupts() 7 genirq: Calculate precision only when required 46.41% [k] mtree_load 10.42% [k] irq_proc_emit_counts 5.86% [k] show_interrupts 5.11% [k] seq_read_iter 5.05% [k] _find_next_bit 3.56% [k] memcpy_orig 3.23% [k] num_to_str 2.51% [k] __rcu_read_unlock 2.27% [k] vsnprintf 2.23% [k] __rcu_read_lock 2.00% [k] rep_movs_alternative 1.90% [k] int_seq_next 1.81% [k] seq_put_decimal_ull_width 1.58% [k] format_decode 1.09% [k] number 0.91% [k] string 0.76% [k] put_dec_trunc8 0.67% [k] seq_printf 0.60% [k] irq_to_desc 0.46% [k] irq_get_nr_irqs show_interrupts() became slightly less expensive. The mtree_load() leader position stays obviously the same. With that addressed adding optimized seq_file code into the interrupt core became feasible. All what it still required was to provide a fast refcount mechanism so that the pointer can safely be stored in the seq_file iterator and cannot be freed between seq_next() and seq_show(). I pondered briefly to use the existing kobject in the interrupt descriptor, but that uses a refcount_t underneath which is way more expensive than rcuref_t and requires function calls. As the descriptors are RCU managed rcuref_t is the obvious choice. 10 genirq/proc: Speed up /proc/interrupts iteration 26.77% [k] irq_proc_emit_counts 15.90% [k] _find_next_bit 10.56% [k] memcpy_orig 9.82% [k] num_to_str 5.68% [k] rep_movs_alternative 5.34% [k] vsnprintf 4.76% [k] seq_put_decimal_ull_width 4.01% [k] format_decode 3.17% [k] mt_find 2.51% [k] string 2.24% [k] number 1.52% [k] put_dec_trunc8 1.44% [k] seq_printf 0.87% [k] irq_seq_show mt_find() is the one retrieving the next descriptor and the numbers are now where one would expect them to be. I might have missed some low hanging fruit there as well. If you find it, you are entitled to keep it and fix it yourself. :) That said, let's talk about the previously mentioned optimized binary interface. As I'm not interested in reading "it would be better" over and over for another ten years, I sat down and implemented a straight forward interface which provides a variable record sized binary dump of the relevant statistics separated into three files: 1) device interrupts 2) per CPU interrupts (irq descriptor based) 3) architecture specific interrupts The separation is done because for interrupt balancing purposes #1 is the interesting part as #2 and #3 cannot be influenced by modifying affinities. They can be monitored of course, but that's a different class of events. This interface comes with the following semantics: - each record starts with a pair of 'interrupt number' and 'number of entries' - each entry is a pair of 'cpu number' and 'interrupt counts' - records are only emitted for interrupts which have a total non-zero event count as that's what matters for observation. - records are only emitted if the counter(s) for the effective affinity mask CPU target(s) are non zero. Emitting counts from a previous affinity setting is irrelevant and can be easily cached by the monitoring application if needed. - completely lockless vs. the interrupt descriptor The readout does not touch the interrupt descriptor lock so it does not interfere with concurrent high frequency interrupts at all except for the memory access to the counter which can't obviously be avoided. - contrary to the non-sensical seq_file seek, which handles /proc/interrupts, allow only seeking back to the origin. /proc/interrupts is also a non constant record file and the seq_file lseek() implementation does therefore not guarantee a consistent lseek() at all, except for seek(0, SEEK_SET). Reading those files with the same 1000 loops test takes on average: ~ 8 us for the device interrupts ~37 us for x86 architecture interrupts --- ~45 us total Compared to the fully applied /proc/interrupt enhancements: 209us vs. 45us =~ -79% --> ~4.5X The comparison to the baseline v7.0-rc1 is: 1310us vs. 45us =~ -96% --> ~29X For completeness sake the perf top list of the endless read/lseek loop accessing /proc/irq/device_stats: 70.52% [kernel] [k] mt_find 6.36% [kernel] [k] _copy_to_iter 5.09% [kernel] [k] irq_stats_read 3.77% [kernel] [k] irq_find_desc_at_or_after 2.09% [kernel] [k] __rcu_read_lock 2.04% [kernel] [k] __rcu_read_unlock 1.67% [kernel] [k] __check_object_size 1.07% [kernel] [k] __virt_addr_valid 0.82% [kernel] [k] entry_SYSCALL_64 0.81% [kernel] [k] vfs_read Here mt_find() dominates rightfully because the decision to output data is trivial as all interrupts are single CPU target so there is only one counter per interrupt to access. If non-zero the store and the output is just vanishing in the noise compared to the actual lookup costs. Based on the 8us average that means: 5.641 us mt_find() 0.509 us copy_to_iter() 0.407 us irq_stats_read() 0.302 us irq_find_desc_at_or_after mt_find() and irq_find_desc_at_or_after() belong together so we have: 5.943 us lookup() 0.509 us copy_to_iter() // Copies data to user space 0.407 us irq_stats_read() // Counter analysis Broken down to a per interrupt average with 153 requested interrupts on that machine this means: 38.8 ns lookup time 3.3 ns copy to user 2.7 ns analayis And the same for the endless read/lseek loop accessing /proc/irq/arch_stats: 57.82% [k] _find_next_bit 37.20% [k] irq_arch_stats_read 1.06% [k] _copy_to_iter 1.03% [k] rep_movs_alternative 0.73% [k] irq_stat_copy_to_iter 0.27% [k] vfs_read The most expensive part of the latter are the for_each_online_cpu() loops and the sub-optimal memory access patterns as explained above. So with 37us readout time for 256 CPUs and 19 x86 architecture specific counters this gives: 21.393 us for_each_online_cpu() 13.764 us irq_arch_stats_read() // Counter analysis That means an average per counter: 1.12 us for_each_online_cpu() 0.72 us irq_arch_stats_read() In theory we could optimize the for_each_online_cpu() overhead, but that creates a lot of inlined code for a meager 5% performance improvement over the out of line version. Not really worth it. Note that the /proc/interrupt numbers were taken on a mostly idle system with almost zero propability to hit irq_desc::lock contention. All test results are obviously skewed due to the repetitive invocations, which prime the cache. But cache cold tests with no repeats are resulting in roughly the same performance difference ratio for all scenarios. A quick python hack computing the total number of interrupts from the optimized /proc/interrupts and from the new binary interfaces yields: /proc/interrupts optimized /proc/irq/[device+arch]_stats 6.957 ms 0.394 ms -94% (~17X) which covers both the costs of reading and computing. The read advantage of the binary interface is ~4.5X (see above), so the compute advantage of avoiding the text parsing and not looking at pointless numbers amounts to ~3.7X which is unsurprisingly in the expected ballpark. The last four patches related to the binary interface need obviously some thought vs. the interface and are therefore marked RFC. The series applies on top of v7.0-rc1 and is also available via git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git irq/core Thanks tglx --- It's not that I'm so smart, it's just that I stay with problems longer. - Albert Einstein --- 'perf stat -r -r 1000 cat /proc/interrupts' data series Baseline v7.0-rc1 Performance counter stats for 'cat /proc/interrupts' (1000 runs): 2.55 msec task-clock # 0.830 CPUs utilized ( +- 0.09% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 94 page-faults # 36.877 K/sec ( +- 0.05% ) 5,072,386 cycles # 1.990 GHz ( +- 0.08% ) 357,009 stalled-cycles-frontend # 7.04% frontend cycles idle ( +- 0.29% ) 17,197,829 instructions # 3.39 insn per cycle # 0.02 stalled cycles per insn ( +- 0.01% ) 3,323,704 branches # 1.304 G/sec ( +- 0.01% ) 13,773 branch-misses # 0.41% of all branches ( +- 0.09% ) 0.00307221 +- 0.00000369 seconds time elapsed ( +- 0.12% ) x86/irq: Optimize interrupts decimals printing Performance counter stats for 'cat /proc/interrupts' (1000 runs): 2.10 msec task-clock # 0.809 CPUs utilized ( +- 0.18% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 94 page-faults # 44.720 K/sec ( +- 0.05% ) 4,179,092 cycles # 1.988 GHz ( +- 0.13% ) 360,135 stalled-cycles-frontend # 8.62% frontend cycles idle ( +- 0.35% ) 13,174,711 instructions # 3.15 insn per cycle # 0.03 stalled cycles per insn ( +- 0.02% ) 2,596,179 branches # 1.235 G/sec ( +- 0.02% ) 13,793 branch-misses # 0.53% of all branches ( +- 0.09% ) 0.00259694 +- 0.00000597 seconds time elapsed ( +- 0.23% ) genirq/proc: Utilize irq_desc::tot_count to avoid evaluation Performance counter stats for 'cat /proc/interrupts' (1000 runs): 1.51 msec task-clock # 0.753 CPUs utilized ( +- 0.23% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 94 page-faults # 62.425 K/sec ( +- 0.05% ) 2,989,307 cycles # 1.985 GHz ( +- 0.18% ) 331,156 stalled-cycles-frontend # 11.08% frontend cycles idle ( +- 0.55% ) 7,564,373 instructions # 2.53 insn per cycle # 0.04 stalled cycles per insn ( +- 0.03% ) 1,554,530 branches # 1.032 G/sec ( +- 0.03% ) 13,531 branch-misses # 0.87% of all branches ( +- 0.10% ) 0.00199971 +- 0.00000644 seconds time elapsed ( +- 0.32% ) x86/irq: Make irqstats array based Performance counter stats for 'cat /proc/interrupts' (1000 runs): 1.46 msec task-clock # 0.730 CPUs utilized ( +- 0.17% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 95 page-faults # 65.261 K/sec ( +- 0.05% ) 2,891,394 cycles # 1.986 GHz ( +- 0.16% ) 335,074 stalled-cycles-frontend # 11.59% frontend cycles idle ( +- 0.40% ) 6,461,160 instructions # 2.23 insn per cycle # 0.05 stalled cycles per insn ( +- 0.04% ) 1,370,320 branches # 941.353 M/sec ( +- 0.03% ) 13,409 branch-misses # 0.98% of all branches ( +- 0.10% ) 0.00199471 +- 0.00000400 seconds time elapsed ( +- 0.20% ) genirq: Calculate precision only when required Performance counter stats for 'cat /proc/interrupts' (1000 runs): 1.42 msec task-clock # 0.732 CPUs utilized ( +- 0.23% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 95 page-faults # 67.004 K/sec ( +- 0.05% ) 2,817,515 cycles # 1.987 GHz ( +- 0.23% ) 327,877 stalled-cycles-frontend # 11.64% frontend cycles idle ( +- 0.73% ) 6,391,647 instructions # 2.27 insn per cycle # 0.05 stalled cycles per insn ( +- 0.04% ) 1,348,883 branches # 951.380 M/sec ( +- 0.03% ) 13,483 branch-misses # 1.00% of all branches ( +- 0.09% ) 0.00193706 +- 0.00000522 seconds time elapsed ( +- 0.27% ) genirq/proc: Speed up /proc/interrupts iteration Performance counter stats for 'cat /proc/interrupts' (1000 runs): 1.05 msec task-clock # 0.671 CPUs utilized ( +- 0.23% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 95 page-faults # 90.552 K/sec ( +- 0.05% ) 2,077,859 cycles # 1.981 GHz ( +- 0.18% ) 313,913 stalled-cycles-frontend # 15.11% frontend cycles idle ( +- 0.35% ) 3,891,909 instructions # 1.87 insn per cycle # 0.08 stalled cycles per insn ( +- 0.06% ) 744,475 branches # 709.616 M/sec ( +- 0.06% ) 12,962 branch-misses # 1.74% of all branches ( +- 0.10% ) 0.00156440 +- 0.00000414 seconds time elapsed ( +- 0.26% ) --- arch/x86/Kconfig | 1 arch/x86/events/amd/core.c | 2 arch/x86/events/amd/ibs.c | 2 arch/x86/events/core.c | 2 arch/x86/events/intel/core.c | 2 arch/x86/events/intel/knc.c | 2 arch/x86/events/intel/p4.c | 2 arch/x86/events/zhaoxin/core.c | 2 arch/x86/hyperv/hv_init.c | 2 arch/x86/include/asm/hardirq.h | 76 ++-- arch/x86/include/asm/mce.h | 3 arch/x86/kernel/apic/apic.c | 4 arch/x86/kernel/apic/ipi.c | 2 arch/x86/kernel/cpu/acrn.c | 2 arch/x86/kernel/cpu/mce/amd.c | 2 arch/x86/kernel/cpu/mce/core.c | 8 arch/x86/kernel/cpu/mce/threshold.c | 2 arch/x86/kernel/cpu/mshyperv.c | 4 arch/x86/kernel/irq.c | 225 ++++---------- arch/x86/kernel/irq_work.c | 2 arch/x86/kernel/kvm.c | 2 arch/x86/kernel/nmi.c | 4 arch/x86/kernel/smp.c | 6 arch/x86/mm/tlb.c | 2 arch/x86/xen/enlighten_hvm.c | 2 arch/x86/xen/enlighten_pv.c | 2 arch/x86/xen/smp.c | 6 arch/x86/xen/smp_pv.c | 2 fs/proc/Makefile | 4 include/linux/interrupt.h | 1 include/linux/irq.h | 18 + include/linux/irqdesc.h | 8 include/uapi/linux/irqstats.h | 27 + kernel/irq/Kconfig | 6 kernel/irq/chip.c | 2 kernel/irq/internals.h | 24 + kernel/irq/irqdesc.c | 67 ++-- kernel/irq/manage.c | 16 - kernel/irq/proc.c | 556 +++++++++++++++++++++++++++++++++--- kernel/irq/settings.h | 14 40 files changed, 815 insertions(+), 301 deletions(-)