public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Proper kernel irq time accounting -v4
@ 2010-10-05  0:03 Venkatesh Pallipadi
  2010-10-05  0:03 ` [PATCH 1/8] si time accounting accounts bh_disable'd time to si -v4 Venkatesh Pallipadi
                   ` (10 more replies)
  0 siblings, 11 replies; 40+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-05  0:03 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi

Previous versions:
-v0: http://lkml.indiana.edu/hypermail//linux/kernel/1005.3/00411.html
     lkml subject - "Finer granularity and task/cgroup irq time accounting"

-v1: http://lkml.indiana.edu/hypermail//linux/kernel/1007.2/00987.html
     lkml subject - "Finer granularity and task/cgroup irq time accounting"

-v2: http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/00488.html
     lkml subject - "Proper kernel irq time accounting"

-v3: http://lkml.indiana.edu/hypermail//linux/kernel/1009.3/02482.html
     lkml subject - "Proper kernel irq time accounting -v3"


Change from -v3:
- Switch to using sched_clock_cpu instead of sched_clock in
  account_system_vtime. This needed tick_check_idle to be called early in
  irq_enter.
- Do not account softirq time in ksoftirqd context. We want ksoftirqd to
  continue having sched exec_runtime.
- Add a boot option to disable tsc based irq accounting in x86.
- Remove the patch that exported irq times in /proc/interrupts and
  /proc/softirqs. I am relooking at other options to export that and
  will have a replacement patch soon.


Description:
Here is some background information about interrupt time accounting in kernel
and related problems.

Interrupts always run in the context of currently running task. Softirqs mostly
run in the context of currently running task, unless softirqd gets involved.
/proc/interrupts and /proc/softirqs are the interfaces that report the number
of interrupts and softirqs per CPU since boot. /proc/stat has fields that
report per CPU and system-wide interrupt and softirq processing time in
clock_t units.


There are two problems with the way interrupts are accounted by the kernel.

(1) Coarse grained interrupt time reporting
On most archs (except s390, powerpc, ia64 with CONFIG_VIRT_CPU_ACCOUNTING),
the interrupt and softirq time reported in /proc/stat is tick sample based.
Kernel looks at what it is doing at each CPU tick and accounts the entire
tick to that particular activity. This means the data in /proc/stat is
pretty coarse grained.

One related problem (atleast on x86), with recent
"Run irq handlers with interrupts disabled" change, timer interrupt cannot
fire when there is an interrupt being serviced [1]. As a result sampling based
hardirq time in /proc/stat cannot capture any hardirq time at all.

(2) Accounting irq processing time to current task/taskgroup
Whenever irq processing happens, kernel accounts that time to currently
running task. The exec_runtime reported in /proc/<pid>/schedstat and
<cgroup>/cpuacct.usage* includes any irq processing that happened while
the task was running. The scheduler vruntime calculations
also account any irq processing to the current task. This means exec time
accounting during heavy irq processing is kind of random, depending on
when and which CPU processing happens and what task happened to be
running on that CPU at that time.


Solution to (1) involves adding extra timing on irq entry/exit to
get the fine granularity info and then exporting it to user.
The following patchset addresses this problem in a way similar to [2][3].
Keeps most of the code that does the timing generic
(CONFIG_IRQ_TIME_ACCOUNTING), based off of sched_clock(). And adds support for
this in x86. This time is not yet exported to userspace yet. Patch for that
coming soon.

One partial solution proposed in [2][3] for (2) above, was to capture this
interrupt time at task/taskgroup level and let user know how much irq
processing time kernel charged to each task/taskgroup. But, that solution
did not solve task timeslice including irq processing.
Peter Zijlstra and Martin Schwidefsky disagreed with that approach and
wanted to see more complete solution in not accounting irq processing time
to tasks at all.

The patchset below tries this more complete solution, with two scheduler
related changes. First, to take out irq processing time from the time
scheduler accounts to the task. Second, make adjustments to the CPU
power, to account for irq processing activity on the CPU. That in turn
results in irq busy CPU pulling tasks corresponding to its non-irq
processing bandwidth that it has got.

The changes here is only enabled for CONFIG_IRQ_TIME_ACCOUNTING as of now.

Thanks,
Venki

References:

[1] http://lkml.indiana.edu/hypermail//linux/kernel/1005.3/00864.html
    lkml subject - "genirq: Run irq handlers with interrupts disabled"

[2] http://lkml.indiana.edu/hypermail//linux/kernel/1005.3/00411.html
    lkml subject - "Finer granularity and task/cgroup irq time accounting"

[3] http://lkml.indiana.edu/hypermail//linux/kernel/1007.2/00987.html
    lkml subject - "Finer granularity and task/cgroup irq time accounting"

Signed-off-by: Venkatesh Pallipadi <venki@google.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2010-12-01 19:16 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-05  0:03 Proper kernel irq time accounting -v4 Venkatesh Pallipadi
2010-10-05  0:03 ` [PATCH 1/8] si time accounting accounts bh_disable'd time to si -v4 Venkatesh Pallipadi
2010-10-18 19:24   ` [tip:sched/core] sched: Fix softirq time accounting tip-bot for Venkatesh Pallipadi
2010-10-05  0:03 ` [PATCH 2/8] Consolidate account_system_vtime extern declaration -v4 Venkatesh Pallipadi
2010-10-18 19:24   ` [tip:sched/core] sched: Consolidate account_system_vtime extern declaration tip-bot for Venkatesh Pallipadi
2010-10-18 19:27   ` [tip:sched/core] sched: Export account_system_vtime() tip-bot for Ingo Molnar
2010-10-05  0:03 ` [PATCH 3/8] Add a PF flag for ksoftirqd identification Venkatesh Pallipadi
2010-10-15 14:26   ` Peter Zijlstra
2010-10-15 14:46   ` Eric Dumazet
2010-10-18 19:25   ` [tip:sched/core] sched: " tip-bot for Venkatesh Pallipadi
2010-10-05  0:03 ` [PATCH 4/8] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v4 Venkatesh Pallipadi
2010-10-15 14:28   ` Peter Zijlstra
2010-10-18 19:25   ` [tip:sched/core] sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time tip-bot for Venkatesh Pallipadi
2010-10-05  0:03 ` [PATCH 5/8] x86: Add IRQ_TIME_ACCOUNTING in x86 -v4 Venkatesh Pallipadi
2010-10-15 14:38   ` Peter Zijlstra
2010-10-18 19:26   ` [tip:sched/core] x86: Add IRQ_TIME_ACCOUNTING tip-bot for Venkatesh Pallipadi
2010-10-05  0:03 ` [PATCH 6/8] sched: Do not account irq time to current task -v4 Venkatesh Pallipadi
2010-10-18 19:26   ` [tip:sched/core] sched: Do not account irq time to current task tip-bot for Venkatesh Pallipadi
2010-11-29  8:45     ` Yong Zhang
2010-11-29 11:59       ` Peter Zijlstra
2010-11-29 14:22         ` Yong Zhang
2010-11-29 17:06           ` Raistlin
2010-11-30  5:57             ` Yong Zhang
2010-12-01 18:55               ` Venkatesh Pallipadi
2010-12-01 19:16                 ` Peter Zijlstra
2010-10-05  0:03 ` [PATCH 7/8] sched: Remove irq time from available CPU power -v4 Venkatesh Pallipadi
2010-10-18 19:26   ` [tip:sched/core] sched: Remove irq time from available CPU power tip-bot for Venkatesh Pallipadi
2010-10-05  0:03 ` [PATCH 8/8] Call tick_check_idle before __irq_enter Venkatesh Pallipadi
2010-10-17  9:05   ` Yong Zhang
2010-10-18  9:15     ` Peter Zijlstra
2010-10-18 19:27   ` [tip:sched/core] sched: " tip-bot for Venkatesh Pallipadi
2010-10-12 19:00 ` Proper kernel irq time accounting -v4 Venkatesh Pallipadi
2010-10-14 16:12 ` Shaun Ruffell
2010-10-14 18:19   ` Venkatesh Pallipadi
2010-10-14 20:00     ` Shaun Ruffell
2010-10-15 15:11 ` Peter Zijlstra
2010-10-15 15:27   ` Peter Zijlstra
2010-10-15 17:13     ` Venkatesh Pallipadi
2010-10-15 17:20       ` Peter Zijlstra
2010-10-17  9:11       ` Yong Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox