Proper kernel irq time accounting -v3

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Proper kernel irq time accounting -v3
@ 2010-09-29 19:21 Venkatesh Pallipadi
  2010-09-29 19:21 ` [PATCH 1/7] si time accounting accounts bh_disable'd time to si -v3 Venkatesh Pallipadi
                   ` (7 more replies)
  0 siblings, 8 replies; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-29 19:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet

Previous versions:
-v0: http://lkml.indiana.edu/hypermail//linux/kernel/1005.3/00411.html
     lkml subject - "Finer granularity and task/cgroup irq time accounting"

-v1: http://lkml.indiana.edu/hypermail//linux/kernel/1007.2/00987.html
     lkml subject - "Finer granularity and task/cgroup irq time accounting"

-v2: http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/00488.html
     lkml subject - "Proper kernel irq time accounting"

Change from -v2:
- Fix the bug with timers during local_bh_disable accounting the time as
  softirq.
- Change the implementation of scheduler not accounting irq time to current
  task using rq->clock_task approach as suggested by Peter Zijlstra
- General cleanup of the patches based on earlier feedback

Description:
Here is some background information about interrupt time accounting in kernel
and related problems.

Interrupts always run in the context of currently running task. Softirqs mostly
run in the context of currently running task, unless softirqd gets involved.
/proc/interrupts and /proc/softirqs are the interfaces that report the number
of interrupts and softirqs per CPU since boot. /proc/stat has fields that
report per CPU and system-wide interrupt and softirq processing time in
clock_t units.

There are two problems with the way interrupts are accounted by the kernel.

(1) Coarse grained interrupt time reporting
On most archs (except s390, powerpc, ia64 with CONFIG_VIRT_CPU_ACCOUNTING),
the interrupt and softirq time reported in /proc/stat is tick sample based.
Kernel looks at what it is doing at each CPU tick and accounts the entire
tick to that particular activity. This means the data in /proc/stat is
pretty coarse grained.

One related problem (atleast on x86), with recent
"Run irq handlers with interrupts disabled" change, timer interrupt cannot
fire when there is an interrupt being serviced [1]. As a result sampling based
hardirq time in /proc/stat cannot capture any hardirq time at all.

(2) Accounting irq processing time to current task/taskgroup
Whenever irq processing happens, kernel accounts that time to currently
running task. The exec_runtime reported in /proc/<pid>/schedstat and
<cgroup>/cpuacct.usage* includes any irq processing that happened while
the task was running. The scheduler vruntime calculations
also account any irq processing to the current task. This means exec time
accounting during heavy irq processing is kind of random, depending on
when and which CPU processing happens and what task happened to be
running on that CPU at that time.

Solution to (1) involves adding extra timing on irq entry/exit to
get the fine granularity info and then exporting it to user.
The following patchset addresses this problem in a way similar to [2][3].
Keeps most of the code that does the timing generic
(CONFIG_IRQ_TIME_ACCOUNTING), based off of sched_clock(). And adds support for
this in x86. The new fine granularity time information is exported in
/proc/interrupts and /proc/softirqs as a reference implementation. Whether
it actually belongs there or somewhere else is open for discussion.

One partial solution proposed in [2][3] for (2) above, was to capture this
interrupt time at task/taskgroup level and let user know how much irq
processing time kernel charged to each task/taskgroup. But, that solution
did not solve task timeslice including irq processing.
Peter Zijlstra and Martin Schwidefsky disagreed with that approach and
wanted to see more complete solution in not accounting irq processing time
to tasks at all.

The patchset below tries this more complete solution, with two scheduler
related changes. First, to take out irq processing time from the time
scheduler accounts to the task. Second, make adjustments to the CPU
power, to account for irq processing activity on the CPU. That in turn
results in irq busy CPU pulling tasks corresponding to its non-irq
processing bandwidth that it has got.

The changes here is only enabled for CONFIG_IRQ_TIME_ACCOUNTING as of now.

Thanks,
Venki

References:

[1] http://lkml.indiana.edu/hypermail//linux/kernel/1005.3/00864.html
    lkml subject - "genirq: Run irq handlers with interrupts disabled"

[2] http://lkml.indiana.edu/hypermail//linux/kernel/1005.3/00411.html
    lkml subject - "Finer granularity and task/cgroup irq time accounting"

[3] http://lkml.indiana.edu/hypermail//linux/kernel/1007.2/00987.html
    lkml subject - "Finer granularity and task/cgroup irq time accounting"

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/7] si time accounting accounts bh_disable'd time to si -v3
  2010-09-29 19:21 Proper kernel irq time accounting -v3 Venkatesh Pallipadi
@ 2010-09-29 19:21 ` Venkatesh Pallipadi
  2010-09-30 11:04   ` Peter Zijlstra
  2010-09-29 19:21 ` [PATCH 2/7] Consolidate account_system_vtime extern declaration -v3 Venkatesh Pallipadi
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-29 19:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi

Peter Zijlstra found a bug in the way softirq time is accounted in
VIRT_CPU_ACCOUNTING on this thread.
http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html

The problem is, softirq processing uses local_bh_disable internally. There
is no way, later in the flow, to differentiate between whether softirq is
being processed or is it just that bh has been disabled. So, a hardirq when bh
is disabled results in time being wrongly accounted as softirq.

Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
as well. As account_system_time() in normal tick based accouting also uses
softirq_count, which will be set even when not in softirq with bh disabled.

Peter also suggested solution of using 2 * SOFTIRQ_OFFSET as irq count
for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
processing. The patch below does that and adds API in_serving_softirq() which
returns whether we are currently processing softirq or not.

Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
to in_serving_softirq.

Looks like many usages of in_softirq really want in_serving_softirq. Those
changes can be made individually on a case by case basis.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 include/linux/hardirq.h |    5 ++++
 include/linux/sched.h   |    6 ++--
 kernel/sched.c          |    2 +-
 kernel/softirq.c        |   51 +++++++++++++++++++++++++++++++---------------
 net/sched/cls_cgroup.c  |    2 +-
 5 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index d5b3876..e37a77c 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -64,6 +64,8 @@
 #define HARDIRQ_OFFSET	(1UL << HARDIRQ_SHIFT)
 #define NMI_OFFSET	(1UL << NMI_SHIFT)
 
+#define SOFTIRQ_DISABLE_OFFSET	(2 * SOFTIRQ_OFFSET)
+
 #ifndef PREEMPT_ACTIVE
 #define PREEMPT_ACTIVE_BITS	1
 #define PREEMPT_ACTIVE_SHIFT	(NMI_SHIFT + NMI_BITS)
@@ -82,10 +84,13 @@
 /*
  * Are we doing bottom half or hardware interrupt processing?
  * Are we in a softirq context? Interrupt context?
+ * in_softirq - Are we currently processing softirq or have bh disabled?
+ * in_serving_softirq - Are we currently processing softirq?
  */
 #define in_irq()		(hardirq_count())
 #define in_softirq()		(softirq_count())
 #define in_interrupt()		(irq_count())
+#define in_serving_softirq()	(softirq_count() & SOFTIRQ_OFFSET)
 
 /*
  * Are we in NMI context?
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e2a6db..126457e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2367,9 +2367,9 @@ extern int __cond_resched_lock(spinlock_t *lock);
 
 extern int __cond_resched_softirq(void);
 
-#define cond_resched_softirq() ({				\
-	__might_sleep(__FILE__, __LINE__, SOFTIRQ_OFFSET);	\
-	__cond_resched_softirq();				\
+#define cond_resched_softirq() ({					\
+	__might_sleep(__FILE__, __LINE__, SOFTIRQ_DISABLE_OFFSET);	\
+	__cond_resched_softirq();					\
 })
 
 /*
diff --git a/kernel/sched.c b/kernel/sched.c
index dc85ceb..b6e714b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3397,7 +3397,7 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	tmp = cputime_to_cputime64(cputime);
 	if (hardirq_count() - hardirq_offset)
 		cpustat->irq = cputime64_add(cpustat->irq, tmp);
-	else if (softirq_count())
+	else if (in_serving_softirq())
 		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
 	else
 		cpustat->system = cputime64_add(cpustat->system, tmp);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 07b4f1b..988dfbe 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -77,11 +77,21 @@ void wakeup_softirqd(void)
 }
 
 /*
+ * preempt_count and SOFTIRQ_OFFSET usage:
+ * - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving
+ *   softirq processing.
+ * - preempt_count is changed by SOFTIRQ_DISABLE_OFFSET (= 2 * SOFTIRQ_OFFSET)
+ *   on local_bh_disable or local_bh_enable.
+ * This lets us distinguish between whether we are currently processing
+ * softirq and whether we just have bh disabled.
+ */
+
+/*
  * This one is for softirq.c-internal use,
  * where hardirqs are disabled legitimately:
  */
 #ifdef CONFIG_TRACE_IRQFLAGS
-static void __local_bh_disable(unsigned long ip)
+static void __local_bh_disable(unsigned long ip, unsigned int cnt)
 {
 	unsigned long flags;
 
@@ -95,32 +105,43 @@ static void __local_bh_disable(unsigned long ip)
 	 * We must manually increment preempt_count here and manually
 	 * call the trace_preempt_off later.
 	 */
-	preempt_count() += SOFTIRQ_OFFSET;
+	preempt_count() += cnt;
 	/*
 	 * Were softirqs turned off above:
 	 */
-	if (softirq_count() == SOFTIRQ_OFFSET)
+	if (softirq_count() == cnt)
 		trace_softirqs_off(ip);
 	raw_local_irq_restore(flags);
 
-	if (preempt_count() == SOFTIRQ_OFFSET)
+	if (preempt_count() == cnt)
 		trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
 }
 #else /* !CONFIG_TRACE_IRQFLAGS */
-static inline void __local_bh_disable(unsigned long ip)
+static inline void __local_bh_disable(unsigned long ip, unsigned int cnt)
 {
-	add_preempt_count(SOFTIRQ_OFFSET);
+	add_preempt_count(cnt);
 	barrier();
 }
 #endif /* CONFIG_TRACE_IRQFLAGS */
 
 void local_bh_disable(void)
 {
-	__local_bh_disable((unsigned long)__builtin_return_address(0));
+	__local_bh_disable((unsigned long)__builtin_return_address(0),
+				SOFTIRQ_DISABLE_OFFSET);
 }
 
 EXPORT_SYMBOL(local_bh_disable);
 
+static void __local_bh_enable(unsigned int cnt)
+{
+	WARN_ON_ONCE(in_irq());
+	WARN_ON_ONCE(!irqs_disabled());
+
+	if (softirq_count() == cnt)
+		trace_softirqs_on((unsigned long)__builtin_return_address(0));
+	sub_preempt_count(cnt);
+}
+
 /*
  * Special-case - softirqs can safely be enabled in
  * cond_resched_softirq(), or by __do_softirq(),
@@ -128,12 +149,7 @@ EXPORT_SYMBOL(local_bh_disable);
  */
 void _local_bh_enable(void)
 {
-	WARN_ON_ONCE(in_irq());
-	WARN_ON_ONCE(!irqs_disabled());
-
-	if (softirq_count() == SOFTIRQ_OFFSET)
-		trace_softirqs_on((unsigned long)__builtin_return_address(0));
-	sub_preempt_count(SOFTIRQ_OFFSET);
+	__local_bh_enable(SOFTIRQ_DISABLE_OFFSET);
 }
 
 EXPORT_SYMBOL(_local_bh_enable);
@@ -147,13 +163,13 @@ static inline void _local_bh_enable_ip(unsigned long ip)
 	/*
 	 * Are softirqs going to be turned on now:
 	 */
-	if (softirq_count() == SOFTIRQ_OFFSET)
+	if (softirq_count() == SOFTIRQ_DISABLE_OFFSET)
 		trace_softirqs_on(ip);
 	/*
 	 * Keep preemption disabled until we are done with
 	 * softirq processing:
  	 */
- 	sub_preempt_count(SOFTIRQ_OFFSET - 1);
+	sub_preempt_count(SOFTIRQ_DISABLE_OFFSET - 1);
 
 	if (unlikely(!in_interrupt() && local_softirq_pending()))
 		do_softirq();
@@ -198,7 +214,8 @@ asmlinkage void __do_softirq(void)
 	pending = local_softirq_pending();
 	account_system_vtime(current);
 
-	__local_bh_disable((unsigned long)__builtin_return_address(0));
+	__local_bh_disable((unsigned long)__builtin_return_address(0),
+				SOFTIRQ_OFFSET);
 	lockdep_softirq_enter();
 
 	cpu = smp_processor_id();
@@ -245,7 +262,7 @@ restart:
 	lockdep_softirq_exit();
 
 	account_system_vtime(current);
-	_local_bh_enable();
+	__local_bh_enable(SOFTIRQ_OFFSET);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 78ef2c5..37dff78 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -123,7 +123,7 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	 * calls by looking at the number of nested bh disable calls because
 	 * softirqs always disables bh.
 	 */
-	if (softirq_count() != SOFTIRQ_OFFSET) {
+	if (in_serving_softirq()) {
 		/* If there is an sk_classid we'll use that. */
 		if (!skb->sk)
 			return -1;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/7] Consolidate account_system_vtime extern declaration -v3
  2010-09-29 19:21 Proper kernel irq time accounting -v3 Venkatesh Pallipadi
  2010-09-29 19:21 ` [PATCH 1/7] si time accounting accounts bh_disable'd time to si -v3 Venkatesh Pallipadi
@ 2010-09-29 19:21 ` Venkatesh Pallipadi
  2010-09-29 19:21 ` [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3 Venkatesh Pallipadi
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-29 19:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi

Just a minor cleanup patch that makes things easier to the following patches.
No functionality change in this patch.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 arch/ia64/include/asm/system.h    |    4 ----
 arch/powerpc/include/asm/system.h |    4 ----
 arch/s390/include/asm/system.h    |    1 -
 include/linux/hardirq.h           |    2 ++
 4 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/ia64/include/asm/system.h b/arch/ia64/include/asm/system.h
index 9f342a5..dd028f2 100644
--- a/arch/ia64/include/asm/system.h
+++ b/arch/ia64/include/asm/system.h
@@ -272,10 +272,6 @@ void cpu_idle_wait(void);
 
 void default_idle(void);
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
-extern void account_system_vtime(struct task_struct *);
-#endif
-
 #endif /* __KERNEL__ */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/system.h b/arch/powerpc/include/asm/system.h
index 6c294ac..9c3d160 100644
--- a/arch/powerpc/include/asm/system.h
+++ b/arch/powerpc/include/asm/system.h
@@ -542,10 +542,6 @@ extern void reloc_got2(unsigned long);
 
 #define PTRRELOC(x)	((typeof(x)) add_reloc_offset((unsigned long)(x)))
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
-extern void account_system_vtime(struct task_struct *);
-#endif
-
 extern struct dentry *powerpc_debugfs_root;
 
 #endif /* __KERNEL__ */
diff --git a/arch/s390/include/asm/system.h b/arch/s390/include/asm/system.h
index cef6621..38ddd8a 100644
--- a/arch/s390/include/asm/system.h
+++ b/arch/s390/include/asm/system.h
@@ -97,7 +97,6 @@ static inline void restore_access_regs(unsigned int *acrs)
 
 extern void account_vtime(struct task_struct *, struct task_struct *);
 extern void account_tick_vtime(struct task_struct *);
-extern void account_system_vtime(struct task_struct *);
 
 #ifdef CONFIG_PFAULT
 extern void pfault_irq_init(void);
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index e37a77c..41367c5 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -141,6 +141,8 @@ struct task_struct;
 static inline void account_system_vtime(struct task_struct *tsk)
 {
 }
+#else
+extern void account_system_vtime(struct task_struct *tsk);
 #endif
 
 #if defined(CONFIG_NO_HZ)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3
  2010-09-29 19:21 Proper kernel irq time accounting -v3 Venkatesh Pallipadi
  2010-09-29 19:21 ` [PATCH 1/7] si time accounting accounts bh_disable'd time to si -v3 Venkatesh Pallipadi
  2010-09-29 19:21 ` [PATCH 2/7] Consolidate account_system_vtime extern declaration -v3 Venkatesh Pallipadi
@ 2010-09-29 19:21 ` Venkatesh Pallipadi
  2010-09-30 11:06   ` Peter Zijlstra
  2010-09-29 19:21 ` [PATCH 4/7] x86: Add IRQ_TIME_ACCOUNTING in x86 -v3 Venkatesh Pallipadi
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-29 19:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi

s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
the fine granularity accounting of user, system, hardirq, softirq times.
Adding that option on archs like x86 will be challenging however, given the
state of TSC reliability on various platforms and also the overhead it will
add in syscall entry exit.

Instead, add a lighter variant that only does finer accounting of
hardirq and softirq times, providing precise irq times (instead of timer tick
based samples). This accounting is added with a new config option
CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
interested in paying the perf penalty.

This accounting is based on sched_clock, with the code being generic.
So, other archs may find it useful as well.

Note that the kstat_cpu irq times (and hence /proc/stat) are still based on
tick based samples. The reason being that the kstat irq also includes
system time and changing only irq times there to have finer granularity can
result in inconsistency like sum kstat time adding up to more than 100% etc.

This patch just adds the core logic and does not enable this logic yet.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 include/linux/hardirq.h |    2 +-
 include/linux/sched.h   |   11 +++++++++++
 kernel/sched.c          |   38 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 1 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 41367c5..ff43e92 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -137,7 +137,7 @@ extern void synchronize_irq(unsigned int irq);
 
 struct task_struct;
 
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#if !defined(CONFIG_VIRT_CPU_ACCOUNTING) && !defined(CONFIG_IRQ_TIME_ACCOUNTING)
 static inline void account_system_vtime(struct task_struct *tsk)
 {
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 126457e..8adf166 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1826,6 +1826,17 @@ extern void sched_clock_idle_sleep_event(void);
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 #endif
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+/*
+ * An i/f to runtime opt-in for irq time accounting based off of sched_clock.
+ * The reason for this explicit opt-in is not to have perf penalty with
+ * slow sched_clocks.
+ */
+extern void enable_sched_clock_irqtime(void);
+#else
+static inline void enable_sched_clock_irqtime(void) {}
+#endif
+
 extern unsigned long long
 task_sched_runtime(struct task_struct *task);
 extern unsigned long long thread_group_sched_runtime(struct task_struct *task);
diff --git a/kernel/sched.c b/kernel/sched.c
index b6e714b..bc2581e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1917,6 +1917,44 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 	dec_nr_running(rq);
 }
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+
+static DEFINE_PER_CPU(u64, cpu_hardirq_time);
+static DEFINE_PER_CPU(u64, cpu_softirq_time);
+
+static DEFINE_PER_CPU(u64, irq_start_time);
+static int sched_clock_irqtime;
+
+void enable_sched_clock_irqtime(void)
+{
+	sched_clock_irqtime = 1;
+}
+
+void account_system_vtime(struct task_struct *curr)
+{
+	unsigned long flags;
+	int cpu;
+	u64 now, delta;
+
+	if (!sched_clock_irqtime)
+		return;
+
+	local_irq_save(flags);
+
+	now = sched_clock();
+	cpu = smp_processor_id();
+	delta = now - per_cpu(irq_start_time, cpu);
+	per_cpu(irq_start_time, cpu) = now;
+	if (hardirq_count())
+		per_cpu(cpu_hardirq_time, cpu) += delta;
+	else if (in_serving_softirq())
+		per_cpu(cpu_softirq_time, cpu) += delta;
+
+	local_irq_restore(flags);
+}
+
+#endif
+
 #include "sched_idletask.c"
 #include "sched_fair.c"
 #include "sched_rt.c"
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/7] x86: Add IRQ_TIME_ACCOUNTING in x86 -v3
  2010-09-29 19:21 Proper kernel irq time accounting -v3 Venkatesh Pallipadi
                   ` (2 preceding siblings ...)
  2010-09-29 19:21 ` [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3 Venkatesh Pallipadi
@ 2010-09-29 19:21 ` Venkatesh Pallipadi
  2010-09-29 19:21 ` [PATCH 5/7] sched: Do not account irq time to current task -v3 Venkatesh Pallipadi
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-29 19:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi

This patch adds IRQ_TIME_ACCOUNTING option on x86 and runtime enables it
when TSC is enabled.

This change just enables fine grained irq time accounting, isn't used yet.
Following patches use it for different purposes.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 arch/x86/Kconfig      |   11 +++++++++++
 arch/x86/kernel/tsc.c |    2 ++
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cea0cd9..f4c70c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -795,6 +795,17 @@ config SCHED_MC
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config IRQ_TIME_ACCOUNTING
+	bool "Fine granularity task level IRQ time accounting"
+	default n
+	---help---
+	  Select this option to enable fine granularity task irq time
+	  accounting. This is done by reading a timestamp on each
+	  transitions between softirq and hardirq state, so there can be a
+	  small performance impact.
+
+	  If in doubt, say N here.
+
 source "kernel/Kconfig.preempt"
 
 config X86_UP_APIC
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 26a863a..110d815 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -987,6 +987,8 @@ void __init tsc_init(void)
 	/* now allow native_sched_clock() to use rdtsc */
 	tsc_disabled = 0;
 
+	enable_sched_clock_irqtime();
+
 	lpj = ((u64)tsc_khz * 1000);
 	do_div(lpj, HZ);
 	lpj_fine = lpj;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/7] sched: Do not account irq time to current task -v3
  2010-09-29 19:21 Proper kernel irq time accounting -v3 Venkatesh Pallipadi
                   ` (3 preceding siblings ...)
  2010-09-29 19:21 ` [PATCH 4/7] x86: Add IRQ_TIME_ACCOUNTING in x86 -v3 Venkatesh Pallipadi
@ 2010-09-29 19:21 ` Venkatesh Pallipadi
  2010-09-29 19:21 ` [PATCH 6/7] sched: Remove irq time from available CPU power -v3 Venkatesh Pallipadi
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-29 19:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi

Scheduler accounts both softirq and interrupt processing times to the
currently running task. This means, if the interrupt processing was
for some other task in the system, then the current task ends up being
penalized as it gets shorter runtime than otherwise.

Change sched task accounting to acoount only actual task time from
currently running task. Now update_curr(), modifies the delta_exec to
depend on rq->clock_task.

Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
for later.

This change will impact scheduling behavior in interrupt heavy conditions.

Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
spending 75%+ of its time in irq processing. CPU 3 spending around 35%
time running nc task.
Now, if I run another CPU intensive task on CPU 2, without this change
/proc/<pid>/schedstat shows 100% of time accounted to this task. With this
change, it rightly shows less than 25% accounted to this task as remaining
time is actually spent on irq processing.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 kernel/sched.c      |   43 ++++++++++++++++++++++++++++++++++++++++---
 kernel/sched_fair.c |    6 +++---
 kernel/sched_rt.c   |    8 ++++----
 3 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index bc2581e..771bfa9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -493,6 +493,7 @@ struct rq {
 	struct mm_struct *prev_mm;
 
 	u64 clock;
+	u64 clock_task;
 
 	atomic_t nr_iowait;
 
@@ -643,10 +644,19 @@ static inline struct task_group *task_group(struct task_struct *p)
 
 #endif /* CONFIG_CGROUP_SCHED */
 
+static u64 irq_time_cpu(int cpu);
+
 inline void update_rq_clock(struct rq *rq)
 {
-	if (!rq->skip_clock_update)
-		rq->clock = sched_clock_cpu(cpu_of(rq));
+	if (!rq->skip_clock_update) {
+		int cpu = cpu_of(rq);
+		u64 irq_time;
+
+		rq->clock = sched_clock_cpu(cpu);
+		irq_time = irq_time_cpu(cpu);
+		if (rq->clock - irq_time > rq->clock_task)
+			rq->clock_task = rq->clock - irq_time;
+	}
 }
 
 /*
@@ -1919,6 +1929,18 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 
+/*
+ * There are no locks covering percpu hardirq/softirq time.
+ * They are only modified in account_system_vtime, on corresponding CPU
+ * with interrupts disabled. So, writes are safe.
+ * They are read and saved off onto struct rq in update_rq_clock().
+ * This may result in other CPU reading this CPU's irq time and can
+ * race with irq/account_system_vtime on this CPU. We would either get old
+ * or new value (or semi updated value on 32 bit) with a side effect of
+ * accounting a slice of irq time to wrong task when irq is in progress
+ * while we read rq->clock. That is a worthy compromise in place of having
+ * locks on each irq in account_system_time.
+ */
 static DEFINE_PER_CPU(u64, cpu_hardirq_time);
 static DEFINE_PER_CPU(u64, cpu_softirq_time);
 
@@ -1930,6 +1952,14 @@ void enable_sched_clock_irqtime(void)
 	sched_clock_irqtime = 1;
 }
 
+static u64 irq_time_cpu(int cpu)
+{
+	if (!sched_clock_irqtime)
+		return 0;
+
+	return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
+}
+
 void account_system_vtime(struct task_struct *curr)
 {
 	unsigned long flags;
@@ -1953,6 +1983,13 @@ void account_system_vtime(struct task_struct *curr)
 	local_irq_restore(flags);
 }
 
+#else
+
+static u64 irq_time_cpu(int cpu)
+{
+	return 0;
+}
+
 #endif
 
 #include "sched_idletask.c"
@@ -3286,7 +3323,7 @@ static u64 do_task_delta_exec(struct task_struct *p, struct rq *rq)
 
 	if (task_current(rq, p)) {
 		update_rq_clock(rq);
-		ns = rq->clock - p->se.exec_start;
+		ns = rq->clock_task - p->se.exec_start;
 		if ((s64)ns < 0)
 			ns = 0;
 	}
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index db3f674..0baa696 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -519,7 +519,7 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
 static void update_curr(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
-	u64 now = rq_of(cfs_rq)->clock;
+	u64 now = rq_of(cfs_rq)->clock_task;
 	unsigned long delta_exec;
 
 	if (unlikely(!curr))
@@ -602,7 +602,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	/*
 	 * We are starting a new run period:
 	 */
-	se->exec_start = rq_of(cfs_rq)->clock;
+	se->exec_start = rq_of(cfs_rq)->clock_task;
 }
 
 /**************************************************
@@ -1798,7 +1798,7 @@ int can_migrate_task(struct task_struct *p, struct rq *rq, int this_cpu,
 	 * 2) too many balance attempts have failed.
 	 */
 
-	tsk_cache_hot = task_hot(p, rq->clock, sd);
+	tsk_cache_hot = task_hot(p, rq->clock_task, sd);
 	if (!tsk_cache_hot ||
 		sd->nr_balance_failed > sd->cache_nice_tries) {
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index d10c80e..bb4027a 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -609,7 +609,7 @@ static void update_curr_rt(struct rq *rq)
 	if (!task_has_rt_policy(curr))
 		return;
 
-	delta_exec = rq->clock - curr->se.exec_start;
+	delta_exec = rq->clock_task - curr->se.exec_start;
 	if (unlikely((s64)delta_exec < 0))
 		delta_exec = 0;
 
@@ -618,7 +618,7 @@ static void update_curr_rt(struct rq *rq)
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);
 
-	curr->se.exec_start = rq->clock;
+	curr->se.exec_start = rq->clock_task;
 	cpuacct_charge(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
@@ -1074,7 +1074,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	} while (rt_rq);
 
 	p = rt_task_of(rt_se);
-	p->se.exec_start = rq->clock;
+	p->se.exec_start = rq->clock_task;
 
 	return p;
 }
@@ -1709,7 +1709,7 @@ static void set_curr_task_rt(struct rq *rq)
 {
 	struct task_struct *p = rq->curr;
 
-	p->se.exec_start = rq->clock;
+	p->se.exec_start = rq->clock_task;
 
 	/* The running task is never eligible for pushing */
 	dequeue_pushable_task(rq, p);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 6/7] sched: Remove irq time from available CPU power -v3
  2010-09-29 19:21 Proper kernel irq time accounting -v3 Venkatesh Pallipadi
                   ` (4 preceding siblings ...)
  2010-09-29 19:21 ` [PATCH 5/7] sched: Do not account irq time to current task -v3 Venkatesh Pallipadi
@ 2010-09-29 19:21 ` Venkatesh Pallipadi
  2010-09-29 19:21 ` [PATCH 7/7] Export per cpu hardirq and softirq time in proc -v3 Venkatesh Pallipadi
  2010-09-30  7:59 ` Proper kernel irq time accounting -v3 Andi Kleen
  7 siblings, 0 replies; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-29 19:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi

The idea suggested by Peter Zijlstra here.
http://marc.info/?l=linux-kernel&m=127476934517534&w=2

irq time is technically not available to the tasks running on the CPU.
This patch removes irq time from CPU power piggybacking on
sched_rt_avg_update().

Tested this by keeping CPU X busy with a network intensive task having 75%
oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
cycle soakers on the system. Without this change, there will be two tasks on
each CPU. With this change, there is a single task on irq busy CPU X and
remaining 7 tasks are spread around among other 3 CPUs.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 kernel/sched.c          |   18 ++++++++++++++++++
 kernel/sched_fair.c     |    8 +++++++-
 kernel/sched_features.h |    5 +++++
 3 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 771bfa9..bfbe064 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -521,6 +521,10 @@ struct rq {
 	u64 avg_idle;
 #endif
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+	u64 prev_irq_time;
+#endif
+
 	/* calc_load related fields */
 	unsigned long calc_load_update;
 	long calc_load_active;
@@ -645,6 +649,7 @@ static inline struct task_group *task_group(struct task_struct *p)
 #endif /* CONFIG_CGROUP_SCHED */
 
 static u64 irq_time_cpu(int cpu);
+static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time);
 
 inline void update_rq_clock(struct rq *rq)
 {
@@ -656,6 +661,8 @@ inline void update_rq_clock(struct rq *rq)
 		irq_time = irq_time_cpu(cpu);
 		if (rq->clock - irq_time > rq->clock_task)
 			rq->clock_task = rq->clock - irq_time;
+
+		sched_irq_time_avg_update(rq, irq_time);
 	}
 }
 
@@ -1983,6 +1990,15 @@ void account_system_vtime(struct task_struct *curr)
 	local_irq_restore(flags);
 }
 
+static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time)
+{
+	if (sched_clock_irqtime && sched_feat(NONIRQ_POWER)) {
+		u64 delta_irq = curr_irq_time - rq->prev_irq_time;
+		rq->prev_irq_time = curr_irq_time;
+		sched_rt_avg_update(rq, delta_irq);
+	}
+}
+
 #else
 
 static u64 irq_time_cpu(int cpu)
@@ -1990,6 +2006,8 @@ static u64 irq_time_cpu(int cpu)
 	return 0;
 }
 
+static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { }
+
 #endif
 
 #include "sched_idletask.c"
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0baa696..6d0362a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2268,7 +2268,13 @@ unsigned long scale_rt_power(int cpu)
 	u64 total, available;
 
 	total = sched_avg_period() + (rq->clock - rq->age_stamp);
-	available = total - rq->rt_avg;
+
+	if (unlikely(total < rq->rt_avg)) {
+		/* Ensures that power won't end up being negative */
+		available = 0;
+	} else {
+		available = total - rq->rt_avg;
+	}
 
 	if (unlikely((s64)total < SCHED_LOAD_SCALE))
 		total = SCHED_LOAD_SCALE;
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 83c66e8..185f920 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -61,3 +61,8 @@ SCHED_FEAT(ASYM_EFF_LOAD, 1)
  * release the lock. Decreases scheduling overhead.
  */
 SCHED_FEAT(OWNER_SPIN, 1)
+
+/*
+ * Decrement CPU power based on irq activity
+ */
+SCHED_FEAT(NONIRQ_POWER, 1)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 7/7] Export per cpu hardirq and softirq time in proc -v3
  2010-09-29 19:21 Proper kernel irq time accounting -v3 Venkatesh Pallipadi
                   ` (5 preceding siblings ...)
  2010-09-29 19:21 ` [PATCH 6/7] sched: Remove irq time from available CPU power -v3 Venkatesh Pallipadi
@ 2010-09-29 19:21 ` Venkatesh Pallipadi
  2010-09-30  7:59 ` Proper kernel irq time accounting -v3 Andi Kleen
  7 siblings, 0 replies; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-29 19:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, Martin Schwidefsky
  Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi

I can predict this change being debated.

There is already per CPU and system level irq time in /proc/stat, which
on arch like x86 is based on sampled data. Earlier patches in this series
adds a fine grained irq time option for such archs. And exporting this
fine grained irq time to userspace seems helpful.

How should it be exported though? I considered:
(1) Changing the currently exported info in /proc/stat to directly use this
    new fine grained irq times. Doing that though will likely break the sum
    view to the user as user/system/ and other times there are still sample
    based and only irq time will be fine grained. So, user will almost always
    see sum time != 100% in top etc.
(2) Changing the currently exported info in /proc/stat to indirectly use this
    new fine grained irq times. By still doing the cpustat updating on ticks,
    but looking at the fine grained stats to figure out whether the time should
    be hardirq/softirq/user/system/idle. Doing that will be a lot of code
    churn in kernel/sched.c:account_*_tick code which is already sort of
    complicated.
(3) Add a new interface in /proc. Implied an additional file read and buffer
    allocation, etc which I want to avoid if possible.
(4) Don't export this info at all. I am ok with this as a alternative. But,
    I needed this to be exported somewhere for my testing atleast.
(5) piggyback on /proc/interrupts and /proc/softirqs. Assuming users interested
    in this kind of info are already looking into those files, we wont have
    overhead of additional file read. There is still a likely hood of breaking
    some apps which only expect interrupt count in those files.

So, here is the patch that does (5)

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 Documentation/filesystems/proc.txt |    9 +++++++++
 fs/proc/interrupts.c               |   11 ++++++++++-
 fs/proc/softirqs.c                 |    8 ++++++++
 include/linux/sched.h              |    3 +++
 kernel/sched.c                     |   27 +++++++++++++++++++++++++++
 5 files changed, 57 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index a6aca87..4456011 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -536,6 +536,11 @@ the threshold vector does not exist on x86_64 platforms.  Others are
 suppressed when the system is a uniprocessor.  As of this writing, only
 i386 and x86_64 platforms support the new IRQ vector displays.
 
+Another addition to /proc/interrupt is "Time:" line at the end which
+displays time spent by corresponding CPU processing interrupts in USER_HZ units.
+This time is based on fine grained accouting when CONFIG_VIRT_CPU_ACCOUNTING
+or CONFIG_IRQ_TIME_ACCOUNTING is active, otherwise it is tick sample based.
+
 Of some interest is the introduction of the /proc/irq directory to 2.4.
 It could be used to set IRQ to CPU affinity, this means that you can "hook" an
 IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
@@ -824,6 +829,10 @@ Provides counts of softirq handlers serviced since boot time, for each cpu.
  HRTIMER:          0          0          0          0
      RCU:       1678       1769       2178       2250
 
+Addition to /proc/softirqs is "Time:" line at the end which
+displays time spent by corresponding CPU processing softirqs in USER_HZ units.
+This time is based on fine grained accouting when CONFIG_VIRT_CPU_ACCOUNTING
+or CONFIG_IRQ_TIME_ACCOUNTING is active, otherwise it is tick sample based.
 
 1.3 IDE devices in /proc/ide
 ----------------------------
diff --git a/fs/proc/interrupts.c b/fs/proc/interrupts.c
index 05029c0..66d913a 100644
--- a/fs/proc/interrupts.c
+++ b/fs/proc/interrupts.c
@@ -3,6 +3,7 @@
 #include <linux/interrupt.h>
 #include <linux/irqnr.h>
 #include <linux/proc_fs.h>
+#include <linux/sched.h>
 #include <linux/seq_file.h>
 
 /*
@@ -23,7 +24,15 @@ static void *int_seq_next(struct seq_file *f, void *v, loff_t *pos)
 
 static void int_seq_stop(struct seq_file *f, void *v)
 {
-	/* Nothing to do */
+	int j;
+
+	seq_printf(f, "\n");
+	seq_printf(f, "Time:");
+	for_each_possible_cpu(j)
+		seq_printf(f, " %10lu", (unsigned long)get_cpu_hardirq_time(j));
+	seq_printf(f, "  Interrupt Processing Time\n");
+	seq_printf(f, "\n");
+
 }
 
 static const struct seq_operations int_seq_ops = {
diff --git a/fs/proc/softirqs.c b/fs/proc/softirqs.c
index 1807c24..f028329 100644
--- a/fs/proc/softirqs.c
+++ b/fs/proc/softirqs.c
@@ -1,6 +1,7 @@
 #include <linux/init.h>
 #include <linux/kernel_stat.h>
 #include <linux/proc_fs.h>
+#include <linux/sched.h>
 #include <linux/seq_file.h>
 
 /*
@@ -21,6 +22,13 @@ static int show_softirqs(struct seq_file *p, void *v)
 			seq_printf(p, " %10u", kstat_softirqs_cpu(i, j));
 		seq_printf(p, "\n");
 	}
+
+	seq_printf(p, "\n");
+	seq_printf(p, "    Time:");
+	for_each_possible_cpu(j)
+		seq_printf(p, " %10lu", (unsigned long)get_cpu_softirq_time(j));
+	seq_printf(p, "\n");
+
 	return 0;
 }
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8adf166..6562daf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1826,6 +1826,9 @@ extern void sched_clock_idle_sleep_event(void);
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 #endif
 
+extern clock_t get_cpu_hardirq_time(int cpu);
+extern clock_t get_cpu_softirq_time(int cpu);
+
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 /*
  * An i/f to runtime opt-in for irq time accounting based off of sched_clock.
diff --git a/kernel/sched.c b/kernel/sched.c
index bfbe064..a171647 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -73,6 +73,7 @@
 #include <linux/ftrace.h>
 #include <linux/slab.h>
 
+#include <asm/cputime.h>
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
 
@@ -1999,6 +2000,22 @@ static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time)
 	}
 }
 
+clock_t get_cpu_hardirq_time(int cpu)
+{
+	if (!sched_clock_irqtime)
+		return cputime64_to_clock_t(kstat_cpu(cpu).cpustat.irq);
+
+	return nsec_to_clock_t(per_cpu(cpu_hardirq_time, cpu));
+}
+
+clock_t get_cpu_softirq_time(int cpu)
+{
+	if (!sched_clock_irqtime)
+		return cputime64_to_clock_t(kstat_cpu(cpu).cpustat.softirq);
+
+	return nsec_to_clock_t(per_cpu(cpu_softirq_time, cpu));
+}
+
 #else
 
 static u64 irq_time_cpu(int cpu)
@@ -2008,6 +2025,16 @@ static u64 irq_time_cpu(int cpu)
 
 static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { }
 
+clock_t get_cpu_hardirq_time(int cpu)
+{
+	return cputime64_to_clock_t(kstat_cpu(cpu).cpustat.irq);
+}
+
+clock_t get_cpu_softirq_time(int cpu)
+{
+	return cputime64_to_clock_t(kstat_cpu(cpu).cpustat.softirq);
+}
+
 #endif
 
 #include "sched_idletask.c"
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Proper kernel irq time accounting -v3
  2010-09-29 19:21 Proper kernel irq time accounting -v3 Venkatesh Pallipadi
                   ` (6 preceding siblings ...)
  2010-09-29 19:21 ` [PATCH 7/7] Export per cpu hardirq and softirq time in proc -v3 Venkatesh Pallipadi
@ 2010-09-30  7:59 ` Andi Kleen
  2010-09-30 16:37   ` Venkatesh Pallipadi
  7 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2010-09-30  7:59 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, linux-kernel, Paul Turner, Eric Dumazet

Venkatesh Pallipadi <venki@google.com> writes:
>
> Solution to (1) involves adding extra timing on irq entry/exit to
> get the fine granularity info and then exporting it to user.

This means RDTSC right?

I am a bit concerned about overhead on systems with high TSC overhead.
Is there a way to turn this off at runtime for those?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7] si time accounting accounts bh_disable'd time to si -v3
  2010-09-29 19:21 ` [PATCH 1/7] si time accounting accounts bh_disable'd time to si -v3 Venkatesh Pallipadi
@ 2010-09-30 11:04   ` Peter Zijlstra
  2010-09-30 16:26     ` Venkatesh Pallipadi
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2010-09-30 11:04 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Wed, 2010-09-29 at 12:21 -0700, Venkatesh Pallipadi wrote:
> Peter Zijlstra found a bug in the way softirq time is accounted in
> VIRT_CPU_ACCOUNTING on this thread.
> http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
> 
> The problem is, softirq processing uses local_bh_disable internally. There
> is no way, later in the flow, to differentiate between whether softirq is
> being processed or is it just that bh has been disabled. So, a hardirq when bh
> is disabled results in time being wrongly accounted as softirq.
> 
> Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
> as well. As account_system_time() in normal tick based accouting also uses
> softirq_count, which will be set even when not in softirq with bh disabled.
> 
> Peter also suggested solution of using 2 * SOFTIRQ_OFFSET as irq count
> for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
> processing. The patch below does that and adds API in_serving_softirq() which
> returns whether we are currently processing softirq or not.
> 
> Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
> to in_serving_softirq.
> 
> Looks like many usages of in_softirq really want in_serving_softirq. Those
> changes can be made individually on a case by case basis.
> 
> Signed-off-by: Venkatesh Pallipadi <venki@google.com>

One nit: in_serving_softirq() doesn't seem right as either:

  - we're not accounting ksoftirq in it, or
  - we're are and VIRT_CPU_ACCOUNTING is again broken ;-)

So only the softirq from irq tails wants to have SOFTIRQ_OFFSET set, the
ksoftirqd stuff can be tested for using PF_flags or something (ksoftirq
doesn't currently have a PF_SOFTIRQ flag, but -rt does and we could
bring that over).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3
  2010-09-29 19:21 ` [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3 Venkatesh Pallipadi
@ 2010-09-30 11:06   ` Peter Zijlstra
  2010-09-30 16:29     ` Venkatesh Pallipadi
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2010-09-30 11:06 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Wed, 2010-09-29 at 12:21 -0700, Venkatesh Pallipadi wrote:
> +void account_system_vtime(struct task_struct *curr)
> +{
> +       unsigned long flags;
> +       int cpu;
> +       u64 now, delta;
> +
> +       if (!sched_clock_irqtime)
> +               return;
> +
> +       local_irq_save(flags);
> +
> +       now = sched_clock();
> +       cpu = smp_processor_id();

Like said before, that really wants to read like:

	cpu = smp_processor_id();
	now = sched_clock_cpu(cpu);

sched_clock() is raw tsc + ns conversion and can go all over the place.

> +       delta = now - per_cpu(irq_start_time, cpu);
> +       per_cpu(irq_start_time, cpu) = now;
> +       if (hardirq_count())
> +               per_cpu(cpu_hardirq_time, cpu) += delta;
> +       else if (in_serving_softirq())
> +               per_cpu(cpu_softirq_time, cpu) += delta;
> +
> +       local_irq_restore(flags);
> +} 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7] si time accounting accounts bh_disable'd time to si -v3
  2010-09-30 11:04   ` Peter Zijlstra
@ 2010-09-30 16:26     ` Venkatesh Pallipadi
  2010-10-01 23:16       ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-30 16:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Thu, Sep 30, 2010 at 4:04 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-09-29 at 12:21 -0700, Venkatesh Pallipadi wrote:
>> Peter Zijlstra found a bug in the way softirq time is accounted in
>> VIRT_CPU_ACCOUNTING on this thread.
>> http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
>>
>> The problem is, softirq processing uses local_bh_disable internally. There
>> is no way, later in the flow, to differentiate between whether softirq is
>> being processed or is it just that bh has been disabled. So, a hardirq when bh
>> is disabled results in time being wrongly accounted as softirq.
>>
>> Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
>> as well. As account_system_time() in normal tick based accouting also uses
>> softirq_count, which will be set even when not in softirq with bh disabled.
>>
>> Peter also suggested solution of using 2 * SOFTIRQ_OFFSET as irq count
>> for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
>> processing. The patch below does that and adds API in_serving_softirq() which
>> returns whether we are currently processing softirq or not.
>>
>> Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
>> to in_serving_softirq.
>>
>> Looks like many usages of in_softirq really want in_serving_softirq. Those
>> changes can be made individually on a case by case basis.
>>
>> Signed-off-by: Venkatesh Pallipadi <venki@google.com>
>
> One nit: in_serving_softirq() doesn't seem right as either:
>
>  - we're not accounting ksoftirq in it, or
>  - we're are and VIRT_CPU_ACCOUNTING is again broken ;-)
>
> So only the softirq from irq tails wants to have SOFTIRQ_OFFSET set, the
> ksoftirqd stuff can be tested for using PF_flags or something (ksoftirq
> doesn't currently have a PF_SOFTIRQ flag, but -rt does and we could
> bring that over).
>

The problem is that ksoftirqd is also handling softirq's and some
eventual users of in_serving_softirq (like the network code in this
patch) want to differentiate between whether it is the softirq thats
running or some real process (!ksoftirqd) context.

Also, ksoftirqd adds to softirq counts, does trace softirq, etc. So,
it kind of made sense to add the time also to softirq stats as well.
If we dont account time to softirq stats, then if some user is looking
at say time per softirq using the softirq count will be misled. No?

In the other thread you mentioned doing that will cause problems. Were
you thinking of scheduler issues or are there other problems charging
softirq time this way?

Thanks,
Venki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3
  2010-09-30 11:06   ` Peter Zijlstra
@ 2010-09-30 16:29     ` Venkatesh Pallipadi
  2010-09-30 20:38       ` Venkatesh Pallipadi
  2010-10-01 11:45       ` Peter Zijlstra
  0 siblings, 2 replies; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-30 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Thu, Sep 30, 2010 at 4:06 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-09-29 at 12:21 -0700, Venkatesh Pallipadi wrote:
>> +void account_system_vtime(struct task_struct *curr)
>> +{
>> +       unsigned long flags;
>> +       int cpu;
>> +       u64 now, delta;
>> +
>> +       if (!sched_clock_irqtime)
>> +               return;
>> +
>> +       local_irq_save(flags);
>> +
>> +       now = sched_clock();
>> +       cpu = smp_processor_id();
>
> Like said before, that really wants to read like:
>
>        cpu = smp_processor_id();
>        now = sched_clock_cpu(cpu);
>
> sched_clock() is raw tsc + ns conversion and can go all over the place.

sched_clock_cpu() won't really work for here, due to what looks like
idle and timer tick dependencies. Using sched_clock_cpu(), I end up
accounting CPU idle time to hardirq due to time captured before the
handler and after the handler.

Thanks,
Venki

>
>> +       delta = now - per_cpu(irq_start_time, cpu);
>> +       per_cpu(irq_start_time, cpu) = now;
>> +       if (hardirq_count())
>> +               per_cpu(cpu_hardirq_time, cpu) += delta;
>> +       else if (in_serving_softirq())
>> +               per_cpu(cpu_softirq_time, cpu) += delta;
>> +
>> +       local_irq_restore(flags);
>> +}
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Proper kernel irq time accounting -v3
  2010-09-30  7:59 ` Proper kernel irq time accounting -v3 Andi Kleen
@ 2010-09-30 16:37   ` Venkatesh Pallipadi
  2010-09-30 17:36     ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-30 16:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, linux-kernel, Paul Turner, Eric Dumazet

On Thu, Sep 30, 2010 at 12:59 AM, Andi Kleen <andi@firstfloor.org> wrote:
> Venkatesh Pallipadi <venki@google.com> writes:
>>
>> Solution to (1) involves adding extra timing on irq entry/exit to
>> get the fine granularity info and then exporting it to user.
>
> This means RDTSC right?
>
> I am a bit concerned about overhead on systems with high TSC overhead.
> Is there a way to turn this off at runtime for those?
>

Yes. Overhead will be RDTSC.
notsc is a boot option that will disable this. I didn't want to have a
specific run time option to disable this as we would already have
other RDTSC users like native_sched_clock. Do you think we need a
specific option to disable this feature? I don't like adding a runtime
option because it complicates the code a bit (to handle things like
someone turning it off and turning it on later, etc).

Thanks,
Venki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Proper kernel irq time accounting -v3
  2010-09-30 16:37   ` Venkatesh Pallipadi
@ 2010-09-30 17:36     ` Andi Kleen
  0 siblings, 0 replies; 29+ messages in thread
From: Andi Kleen @ 2010-09-30 17:36 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Balbir Singh, linux-kernel, Paul Turner, Eric Dumazet

Venkatesh Pallipadi <venki@google.com> writes:
>
> Yes. Overhead will be RDTSC.
> notsc is a boot option that will disable this. I didn't want to have a

notsc is a very heavy hammer.

> specific run time option to disable this as we would already have
> other RDTSC users like native_sched_clock. Do you think we need a
> specific option to disable this feature? I don't like adding a runtime
> option because it complicates the code a bit (to handle things like
> someone turning it off and turning it on later, etc).

How about some benchmarks on a known slow TSC system, like a P4?

If you can't measure it even with high interrupt loads it is probably
not relevant. If it is it would be better to have a on/off switch.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3
  2010-09-30 16:29     ` Venkatesh Pallipadi
@ 2010-09-30 20:38       ` Venkatesh Pallipadi
  2010-10-01 11:46         ` Peter Zijlstra
  2010-10-01 11:45       ` Peter Zijlstra
  1 sibling, 1 reply; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-30 20:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Thu, Sep 30, 2010 at 9:29 AM, Venkatesh Pallipadi <venki@google.com> wrote:
> On Thu, Sep 30, 2010 at 4:06 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Wed, 2010-09-29 at 12:21 -0700, Venkatesh Pallipadi wrote:
>>> +void account_system_vtime(struct task_struct *curr)
>>> +{
>>> +       unsigned long flags;
>>> +       int cpu;
>>> +       u64 now, delta;
>>> +
>>> +       if (!sched_clock_irqtime)
>>> +               return;
>>> +
>>> +       local_irq_save(flags);
>>> +
>>> +       now = sched_clock();
>>> +       cpu = smp_processor_id();
>>
>> Like said before, that really wants to read like:
>>
>>        cpu = smp_processor_id();
>>        now = sched_clock_cpu(cpu);
>>
>> sched_clock() is raw tsc + ns conversion and can go all over the place.
>
> sched_clock_cpu() won't really work for here, due to what looks like
> idle and timer tick dependencies. Using sched_clock_cpu(), I end up
> accounting CPU idle time to hardirq due to time captured before the
> handler and after the handler.
>

Specifically, here is the now and delta log from my test setup with
using sched_clock_cpu() above (data below from a particular CPU)

          <idle>-0     [001]  1697.910060: account_system_vtime:
SOFTIRQ STOP 1700897011613, delta 3555
          <idle>-0     [001]  1697.911047: account_system_vtime: IRQ
START 1700897999028
          <idle>-0     [001]  1697.911051: account_system_vtime:
HARDIRQ STOP 1700898005071, delta 6043
          <idle>-0     [001]  1697.911052: account_system_vtime: IRQ
START 1700898006003
          <idle>-0     [001]  1697.911057: account_system_vtime:
SOFTIRQ STOP 1700898010685, delta 4682
          <idle>-0     [001]  1697.911931: account_system_vtime: IRQ
START 1700898884039
          <idle>-0     [001]  1697.911935: account_system_vtime:
HARDIRQ STOP 1700898890304, delta 6265
          <idle>-0     [001]  1697.915040: account_system_vtime: IRQ
START 1700899887146
          <idle>-0     [001]  1697.915047: account_system_vtime:
HARDIRQ STOP 1700902008678, delta 2121532
          <idle>-0     [001]  1697.915048: account_system_vtime: IRQ
START 1700902009617
          <idle>-0     [001]  1697.915055: account_system_vtime:
SOFTIRQ STOP 1700902015924, delta 6307
          <idle>-0     [001]  1697.918958: account_system_vtime: IRQ
START 1700903006980
          <idle>-0     [001]  1697.918963: account_system_vtime:
HARDIRQ STOP 1700905931367, delta 2924387
          <idle>-0     [001]  1697.919553: account_system_vtime: IRQ
START 1700906521486
          <idle>-0     [001]  1697.919557: account_system_vtime:
HARDIRQ STOP 1700906527006, delta 5520
          <idle>-0     [001]  1697.933420: account_system_vtime: IRQ
START 1700907524219
          <idle>-0     [001]  1697.933428: account_system_vtime:
HARDIRQ STOP 1700920424020, delta 12899801
          <idle>-0     [001]  1697.936512: account_system_vtime: IRQ
START 1700921418037
          <idle>-0     [001]  1697.936523: account_system_vtime:
HARDIRQ STOP 1700923524497, delta 2106460
           <...>-13891 [001]  1697.937001: account_system_vtime: IRQ
START 1700924001105
           <...>-13891 [001]  1697.937013: account_system_vtime:
HARDIRQ STOP 1700924015861, delta 14756
           <...>-13891 [001]  1697.937015: account_system_vtime: IRQ
START 1700924017841
           <...>-13891 [001]  1697.937020: account_system_vtime:
SOFTIRQ STOP 1700924023189, delta 5348
           <...>-13891 [001]  1697.937188: account_system_vtime: IRQ
START 1700924191127
           <...>-13891 [001]  1697.937197: account_system_vtime:
HARDIRQ STOP 1700924199862, delta 8735
          <idle>-0     [001]  1697.938000: account_system_vtime: IRQ
START 1700924999818
          <idle>-0     [001]  1697.938013: account_system_vtime:
HARDIRQ STOP 1700925017573, delta 17755
          <idle>-0     [001]  1697.938015: account_system_vtime: IRQ
START 1700925019459
          <idle>-0     [001]  1697.938031: account_system_vtime:
SOFTIRQ STOP 1700925034677, delta 15218

Thanks,
Venki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3
  2010-09-30 16:29     ` Venkatesh Pallipadi
  2010-09-30 20:38       ` Venkatesh Pallipadi
@ 2010-10-01 11:45       ` Peter Zijlstra
  1 sibling, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2010-10-01 11:45 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Thu, 2010-09-30 at 09:29 -0700, Venkatesh Pallipadi wrote:
> On Thu, Sep 30, 2010 at 4:06 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, 2010-09-29 at 12:21 -0700, Venkatesh Pallipadi wrote:
> >> +void account_system_vtime(struct task_struct *curr)
> >> +{
> >> +       unsigned long flags;
> >> +       int cpu;
> >> +       u64 now, delta;
> >> +
> >> +       if (!sched_clock_irqtime)
> >> +               return;
> >> +
> >> +       local_irq_save(flags);
> >> +
> >> +       now = sched_clock();
> >> +       cpu = smp_processor_id();
> >
> > Like said before, that really wants to read like:
> >
> >        cpu = smp_processor_id();
> >        now = sched_clock_cpu(cpu);
> >
> > sched_clock() is raw tsc + ns conversion and can go all over the place.
> 
> sched_clock_cpu() won't really work for here, due to what looks like
> idle and timer tick dependencies. Using sched_clock_cpu(), I end up
> accounting CPU idle time to hardirq due to time captured before the
> handler and after the handler.

huh!?

No, sched_clock_cpu() gives sched_clock() for those few systems that
actually have a usable tsc, for those that don't we use the tick (and
idle hooks) to read GTOD and set up a TICK_NSEC window. We then use TSC
to calculate high res deltas, we filter anything that goes backwards and
anything that exceeds the window.

How does that render it useless for this purpose?

Never use sched_clock(), its crap (just like TSC is).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3
  2010-09-30 20:38       ` Venkatesh Pallipadi
@ 2010-10-01 11:46         ` Peter Zijlstra
  2010-10-01 16:51           ` Venkatesh Pallipadi
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2010-10-01 11:46 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Thu, 2010-09-30 at 13:38 -0700, Venkatesh Pallipadi wrote:
> On Thu, Sep 30, 2010 at 9:29 AM, Venkatesh Pallipadi <venki@google.com> wrote:
> > On Thu, Sep 30, 2010 at 4:06 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> On Wed, 2010-09-29 at 12:21 -0700, Venkatesh Pallipadi wrote:
> >>> +void account_system_vtime(struct task_struct *curr)
> >>> +{
> >>> +       unsigned long flags;
> >>> +       int cpu;
> >>> +       u64 now, delta;
> >>> +
> >>> +       if (!sched_clock_irqtime)
> >>> +               return;
> >>> +
> >>> +       local_irq_save(flags);
> >>> +
> >>> +       now = sched_clock();
> >>> +       cpu = smp_processor_id();
> >>
> >> Like said before, that really wants to read like:
> >>
> >>        cpu = smp_processor_id();
> >>        now = sched_clock_cpu(cpu);
> >>
> >> sched_clock() is raw tsc + ns conversion and can go all over the place.
> >
> > sched_clock_cpu() won't really work for here, due to what looks like
> > idle and timer tick dependencies. Using sched_clock_cpu(), I end up
> > accounting CPU idle time to hardirq due to time captured before the
> > handler and after the handler.
> >
> 
> Specifically, here is the now and delta log from my test setup with
> using sched_clock_cpu() above (data below from a particular CPU)
> 
>           <idle>-0     [001]  1697.910060: account_system_vtime:
> SOFTIRQ STOP 1700897011613, delta 3555
>           <idle>-0     [001]  1697.911047: account_system_vtime: IRQ
> START 1700897999028

Dude, linewrap hell! And what are you trying to illustrate?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3
  2010-10-01 11:46         ` Peter Zijlstra
@ 2010-10-01 16:51           ` Venkatesh Pallipadi
  2010-10-01 17:29             ` Venkatesh Pallipadi
  0 siblings, 1 reply; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-01 16:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Fri, Oct 1, 2010 at 4:46 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-09-30 at 13:38 -0700, Venkatesh Pallipadi wrote:
>> On Thu, Sep 30, 2010 at 9:29 AM, Venkatesh Pallipadi <venki@google.com> wrote:
>> > On Thu, Sep 30, 2010 at 4:06 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> >> On Wed, 2010-09-29 at 12:21 -0700, Venkatesh Pallipadi wrote:
>> >>> +void account_system_vtime(struct task_struct *curr)
>> >>> +{
>> >>> +       unsigned long flags;
>> >>> +       int cpu;
>> >>> +       u64 now, delta;
>> >>> +
>> >>> +       if (!sched_clock_irqtime)
>> >>> +               return;
>> >>> +
>> >>> +       local_irq_save(flags);
>> >>> +
>> >>> +       now = sched_clock();
>> >>> +       cpu = smp_processor_id();
>> >>
>> >> Like said before, that really wants to read like:
>> >>
>> >>        cpu = smp_processor_id();
>> >>        now = sched_clock_cpu(cpu);
>> >>
>> >> sched_clock() is raw tsc + ns conversion and can go all over the place.
>> >
>> > sched_clock_cpu() won't really work for here, due to what looks like
>> > idle and timer tick dependencies. Using sched_clock_cpu(), I end up
>> > accounting CPU idle time to hardirq due to time captured before the
>> > handler and after the handler.
>> >
>>
>> Specifically, here is the now and delta log from my test setup with
>> using sched_clock_cpu() above (data below from a particular CPU)
>>
>>           <idle>-0     [001]  1697.910060: account_system_vtime:
>> SOFTIRQ STOP 1700897011613, delta 3555
>>           <idle>-0     [001]  1697.911047: account_system_vtime: IRQ
>> START 1700897999028
>
> Dude, linewrap hell! And what are you trying to illustrate?
>
Sorry. Retrying with shorted lines

This is the log of start, hardirq stop, softirq stop and deltas at
stop, taken in account_system_vtime on CPU 1 with sched_clock_cpu.

~5000 delta for a hardirq seems reasonable. But, as you see there are
frequent ms deltas on hardirq, specifically when in idle.

<idle>-0     []  1697.911047: START 1700897999028
<idle>-0     []  1697.911051: HARD STOP 1700898005071, delta 6043
<idle>-0     []  1697.911052: START 1700898006003
<idle>-0     []  1697.911057: SOFT STOP 1700898010685, delta 4682
<idle>-0     []  1697.911931: START 1700898884039
<idle>-0     []  1697.911935: HARD STOP 1700898890304, delta 6265
<idle>-0     []  1697.915040: START 1700899887146
<idle>-0     []  1697.915047: HARD STOP 1700902008678, delta 2121532
<idle>-0     []  1697.915048: START 1700902009617
<idle>-0     []  1697.915055: SOFT STOP 1700902015924, delta 6307
<idle>-0     []  1697.918958: START 1700903006980
<idle>-0     []  1697.918963: HARD STOP 1700905931367, delta 2924387
<idle>-0     []  1697.919553: START 1700906521486
<idle>-0     []  1697.919557: HARD STOP 1700906527006, delta 5520
<idle>-0     []  1697.933420: START 1700907524219
<idle>-0     []  1697.933428: HARD STOP 1700920424020, delta 12899801
<idle>-0     []  1697.936512: START 1700921418037
<idle>-0     []  1697.936523: HARD STOP 1700923524497, delta 2106460


This is probably because this test system is of the crazy TSC ones.
But, it does have TSC as its closcksource and there is no trace of
mark_tsc_unstable in dmesg.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3
  2010-10-01 16:51           ` Venkatesh Pallipadi
@ 2010-10-01 17:29             ` Venkatesh Pallipadi
  2010-10-01 23:14               ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-01 17:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Fri, Oct 1, 2010 at 9:51 AM, Venkatesh Pallipadi <venki@google.com> wrote:
> On Fri, Oct 1, 2010 at 4:46 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Thu, 2010-09-30 at 13:38 -0700, Venkatesh Pallipadi wrote:
>>> On Thu, Sep 30, 2010 at 9:29 AM, Venkatesh Pallipadi <venki@google.com> wrote:
>>> > On Thu, Sep 30, 2010 at 4:06 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> >> On Wed, 2010-09-29 at 12:21 -0700, Venkatesh Pallipadi wrote:
>>> >>> +void account_system_vtime(struct task_struct *curr)
>>> >>> +{
>>> >>> +       unsigned long flags;
>>> >>> +       int cpu;
>>> >>> +       u64 now, delta;
>>> >>> +
>>> >>> +       if (!sched_clock_irqtime)
>>> >>> +               return;
>>> >>> +
>>> >>> +       local_irq_save(flags);
>>> >>> +
>>> >>> +       now = sched_clock();
>>> >>> +       cpu = smp_processor_id();
>>> >>
>>> >> Like said before, that really wants to read like:
>>> >>
>>> >>        cpu = smp_processor_id();
>>> >>        now = sched_clock_cpu(cpu);
>>> >>
>>> >> sched_clock() is raw tsc + ns conversion and can go all over the place.
>>> >
>>> > sched_clock_cpu() won't really work for here, due to what looks like
>>> > idle and timer tick dependencies. Using sched_clock_cpu(), I end up
>>> > accounting CPU idle time to hardirq due to time captured before the
>>> > handler and after the handler.
>>> >
>>>
>>> Specifically, here is the now and delta log from my test setup with
>>> using sched_clock_cpu() above (data below from a particular CPU)
>>>
>>>           <idle>-0     [001]  1697.910060: account_system_vtime:
>>> SOFTIRQ STOP 1700897011613, delta 3555
>>>           <idle>-0     [001]  1697.911047: account_system_vtime: IRQ
>>> START 1700897999028
>>
>> Dude, linewrap hell! And what are you trying to illustrate?
>>
> Sorry. Retrying with shorted lines
>
> This is the log of start, hardirq stop, softirq stop and deltas at
> stop, taken in account_system_vtime on CPU 1 with sched_clock_cpu.
>
> ~5000 delta for a hardirq seems reasonable. But, as you see there are
> frequent ms deltas on hardirq, specifically when in idle.
>
> <idle>-0     []  1697.911047: START 1700897999028
> <idle>-0     []  1697.911051: HARD STOP 1700898005071, delta 6043
> <idle>-0     []  1697.911052: START 1700898006003
> <idle>-0     []  1697.911057: SOFT STOP 1700898010685, delta 4682
> <idle>-0     []  1697.911931: START 1700898884039
> <idle>-0     []  1697.911935: HARD STOP 1700898890304, delta 6265
> <idle>-0     []  1697.915040: START 1700899887146
> <idle>-0     []  1697.915047: HARD STOP 1700902008678, delta 2121532
> <idle>-0     []  1697.915048: START 1700902009617
> <idle>-0     []  1697.915055: SOFT STOP 1700902015924, delta 6307
> <idle>-0     []  1697.918958: START 1700903006980
> <idle>-0     []  1697.918963: HARD STOP 1700905931367, delta 2924387
> <idle>-0     []  1697.919553: START 1700906521486
> <idle>-0     []  1697.919557: HARD STOP 1700906527006, delta 5520
> <idle>-0     []  1697.933420: START 1700907524219
> <idle>-0     []  1697.933428: HARD STOP 1700920424020, delta 12899801
> <idle>-0     []  1697.936512: START 1700921418037
> <idle>-0     []  1697.936523: HARD STOP 1700923524497, delta 2106460
>
>
> This is probably because this test system is of the crazy TSC ones.
> But, it does have TSC as its closcksource and there is no trace of
> mark_tsc_unstable in dmesg.
>

Digging a bit deeper into sched_clock stuff, I see sched_clock_stable
is only set in
arch/x86/kernel/cpu/intel.c when
    if (c->x86_power & (1 << 8)) {

So, on x86, sched_clock_stable is not set on all other kind of CPUs
and my test system happens to be one of them. So, sched_clock_cpu()
falls back to tick based even when TSC is not marked unstable and
clocksource is using TSC for timing.
First, is this limited use of sched_clock_stable by design? Shouldn't
it be rather based on tsc_unstable value in tsc.c?
If this is by design, I will have to limit the irq accounting to this
limited set of CPUs as well. As, for irq accounting, I cannot use tick
based sched_clock_cpu() as it will have jumps while handling interrupt
when the clock gets readjusted.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3
  2010-10-01 17:29             ` Venkatesh Pallipadi
@ 2010-10-01 23:14               ` Peter Zijlstra
  2010-10-01 23:32                 ` Venkatesh Pallipadi
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2010-10-01 23:14 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Fri, 2010-10-01 at 10:29 -0700, Venkatesh Pallipadi wrote:
> So, on x86, sched_clock_stable is not set on all other kind of CPUs
> and my test system happens to be one of them. So, sched_clock_cpu()
> falls back to tick based even when TSC is not marked unstable and
> clocksource is using TSC for timing. 

It is never tick based!! It's tick augmented! Because TSC is such a
piece of crap we use external (slow) means of determining a window in
which the TSC should live and then use the TSC to generate high
resolution offsets inside that.

So even if your usage is in the hardirq context that moves that window
it should all work out.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7] si time accounting accounts bh_disable'd time to si -v3
  2010-09-30 16:26     ` Venkatesh Pallipadi
@ 2010-10-01 23:16       ` Peter Zijlstra
  2010-10-02 15:42         ` Venkatesh Pallipadi
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2010-10-01 23:16 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Thu, 2010-09-30 at 09:26 -0700, Venkatesh Pallipadi wrote:
> > One nit: in_serving_softirq() doesn't seem right as either:
> >
> >  - we're not accounting ksoftirq in it, or
> >  - we're are and VIRT_CPU_ACCOUNTING is again broken ;-)
> >
> > So only the softirq from irq tails wants to have SOFTIRQ_OFFSET set, the
> > ksoftirqd stuff can be tested for using PF_flags or something (ksoftirq
> > doesn't currently have a PF_SOFTIRQ flag, but -rt does and we could
> > bring that over).
> >
> 
> The problem is that ksoftirqd is also handling softirq's and some
> eventual users of in_serving_softirq (like the network code in this
> patch) want to differentiate between whether it is the softirq thats
> running or some real process (!ksoftirqd) context.

The make in_serving_softirq() be something like:
  (preempt_count() & SOFTIRQ_OFFSET) || (current->flags & PF_SOFTIRQ)


> Also, ksoftirqd adds to softirq counts, does trace softirq, etc. So,
> it kind of made sense to add the time also to softirq stats as well.
> If we dont account time to softirq stats, then if some user is looking
> at say time per softirq using the softirq count will be misled. No?

Simply add back the task accounting when you report it?

> In the other thread you mentioned doing that will cause problems. Were
> you thinking of scheduler issues or are there other problems charging
> softirq time this way?

Of course there are.. you're double accounting the time of ksoftirqd,
and worse, you're adding that back into the equation as part of the !
sched_fair time.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3
  2010-10-01 23:14               ` Peter Zijlstra
@ 2010-10-01 23:32                 ` Venkatesh Pallipadi
  2010-10-02 10:53                   ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-01 23:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Fri, Oct 1, 2010 at 4:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2010-10-01 at 10:29 -0700, Venkatesh Pallipadi wrote:
>> So, on x86, sched_clock_stable is not set on all other kind of CPUs
>> and my test system happens to be one of them. So, sched_clock_cpu()
>> falls back to tick based even when TSC is not marked unstable and
>> clocksource is using TSC for timing.
>
> It is never tick based!! It's tick augmented! Because TSC is such a
> piece of crap we use external (slow) means of determining a window in
> which the TSC should live and then use the TSC to generate high
> resolution offsets inside that.
>
> So even if your usage is in the hardirq context that moves that window
> it should all work out.
>

You mean there should not be any "jumps" noticed with
sched_clock_cpu() when we are idle and get a interrupt?
Atleast thats what I am seeing. May be there is some other bug
somewhere causing that.

Loooking at one snapshot from my earlier log

<idle>-0     []  1697.915040: : START 1700899887146
// We were idle and got an interrupt and recorded sched_clock_cpu() as
1700899887146
<idle>-0     []  1697.915047: : HARD STOP 1700902008678, delta 2121532
// We finished handling the interrupt and recorded sched_clock_cpu()
as 1700902008678
// So, delta we see is > 2ms
// This is trace_printk based on local clock, which is using sched_clock()
// So, the trace timing shows delta of 7 us, which is kind of expected time here

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3
  2010-10-01 23:32                 ` Venkatesh Pallipadi
@ 2010-10-02 10:53                   ` Peter Zijlstra
  2010-10-02 15:26                     ` Venkatesh Pallipadi
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2010-10-02 10:53 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet,
	Len Brown

On Fri, 2010-10-01 at 16:32 -0700, Venkatesh Pallipadi wrote:
> On Fri, Oct 1, 2010 at 4:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, 2010-10-01 at 10:29 -0700, Venkatesh Pallipadi wrote:
> >> So, on x86, sched_clock_stable is not set on all other kind of CPUs
> >> and my test system happens to be one of them. So, sched_clock_cpu()
> >> falls back to tick based even when TSC is not marked unstable and
> >> clocksource is using TSC for timing.
> >
> > It is never tick based!! It's tick augmented! Because TSC is such a
> > piece of crap we use external (slow) means of determining a window in
> > which the TSC should live and then use the TSC to generate high
> > resolution offsets inside that.
> >
> > So even if your usage is in the hardirq context that moves that window
> > it should all work out.
> >
> 
> You mean there should not be any "jumps" noticed with
> sched_clock_cpu() when we are idle and get a interrupt?
> Atleast thats what I am seeing. May be there is some other bug
> somewhere causing that.
> 
> Loooking at one snapshot from my earlier log
> 
> <idle>-0     []  1697.915040: : START 1700899887146
> // We were idle and got an interrupt and recorded sched_clock_cpu() as
> 1700899887146
> <idle>-0     []  1697.915047: : HARD STOP 1700902008678, delta 2121532
> // We finished handling the interrupt and recorded sched_clock_cpu()
> as 1700902008678
> // So, delta we see is > 2ms
> // This is trace_printk based on local clock, which is using sched_clock()
> // So, the trace timing shows delta of 7 us, which is kind of expected time here

Egads, yes that would be a kernel/sched_clock.c buglet..

So we're in NOHZ and an IRQ/NMI happens that ends up calling
sched_clock_cpu() and friends without us leaving NOHZ.

drivers/acpi/processor_idle.c:acpi_idle_enter_simple() calls
sched_clock_idle_{sleep,wakeup}_event() around the idle loop -- are
there idle methods missing this, and or do we handle the interrupt
before the wakeup event?

Also, in irq_enter() we call __irq_enter() which does
account_system_vtime() before tick_check_idle() which restarts various
timers and resets jiffies and in fact already calls
sched_clock_idle_wakeup_event().

Gah what a mess.. we could try a code shuffle to restart timer/clock
bits before calling into account_system_vtime(), although I bet that'll
be interesting. But I see no way to fix the NMI during NOHZ problem, its
not like we can actually do GTOD from NMI context :/

The thing is, I really do _NOT_ trust TSC to be sane enough to use like
you want to do, its really proven itself to be reliably crap.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3
  2010-10-02 10:53                   ` Peter Zijlstra
@ 2010-10-02 15:26                     ` Venkatesh Pallipadi
  2010-10-03  0:26                       ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-02 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet,
	Len Brown

On Sat, Oct 2, 2010 at 3:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2010-10-01 at 16:32 -0700, Venkatesh Pallipadi wrote:
>> On Fri, Oct 1, 2010 at 4:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, 2010-10-01 at 10:29 -0700, Venkatesh Pallipadi wrote:
>> >> So, on x86, sched_clock_stable is not set on all other kind of CPUs
>> >> and my test system happens to be one of them. So, sched_clock_cpu()
>> >> falls back to tick based even when TSC is not marked unstable and
>> >> clocksource is using TSC for timing.
>> >
>> > It is never tick based!! It's tick augmented! Because TSC is such a
>> > piece of crap we use external (slow) means of determining a window in
>> > which the TSC should live and then use the TSC to generate high
>> > resolution offsets inside that.
>> >
>> > So even if your usage is in the hardirq context that moves that window
>> > it should all work out.
>> >
>>
>> You mean there should not be any "jumps" noticed with
>> sched_clock_cpu() when we are idle and get a interrupt?
>> Atleast thats what I am seeing. May be there is some other bug
>> somewhere causing that.
>>
>> Loooking at one snapshot from my earlier log
>>
>> <idle>-0     []  1697.915040: : START 1700899887146
>> // We were idle and got an interrupt and recorded sched_clock_cpu() as
>> 1700899887146
>> <idle>-0     []  1697.915047: : HARD STOP 1700902008678, delta 2121532
>> // We finished handling the interrupt and recorded sched_clock_cpu()
>> as 1700902008678
>> // So, delta we see is > 2ms
>> // This is trace_printk based on local clock, which is using sched_clock()
>> // So, the trace timing shows delta of 7 us, which is kind of expected time here
>
> Egads, yes that would be a kernel/sched_clock.c buglet..
>
> So we're in NOHZ and an IRQ/NMI happens that ends up calling
> sched_clock_cpu() and friends without us leaving NOHZ.
>
> drivers/acpi/processor_idle.c:acpi_idle_enter_simple() calls
> sched_clock_idle_{sleep,wakeup}_event() around the idle loop -- are
> there idle methods missing this, and or do we handle the interrupt
> before the wakeup event?
>

Yes. My test machine is using basic mwait_idle from x86 process.c and
that code path does not have these sleep wakeup events. Also, these
basic idle (both halt and mwait) go into idle after enabling
interrupts and so on the way out handle interrupt before doing
anything else.

> Also, in irq_enter() we call __irq_enter() which does
> account_system_vtime() before tick_check_idle() which restarts various
> timers and resets jiffies and in fact already calls
> sched_clock_idle_wakeup_event().
>

Ah. The reason for this jump is account_system_vtime being called
before wakeup_event.

> Gah what a mess.. we could try a code shuffle to restart timer/clock
> bits before calling into account_system_vtime(), although I bet that'll
> be interesting. But I see no way to fix the NMI during NOHZ problem, its
> not like we can actually do GTOD from NMI context :/
>
> The thing is, I really do _NOT_ trust TSC to be sane enough to use like
> you want to do, its really proven itself to be reliably crap.
>

I agree with reliably crap part :). Though TSC usage in this case is
somewhat limited as we always compare timings only from same CPU.
May be above code shuffle + somehow skipping the time accounting in
account_system_vtime is a good way to proceed? I will look at this
change next. Is it OK to use sched_clock() here until that time? I
really want to get some wider testing feedback on this before spending
more time.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7] si time accounting accounts bh_disable'd time to si -v3
  2010-10-01 23:16       ` Peter Zijlstra
@ 2010-10-02 15:42         ` Venkatesh Pallipadi
  2010-10-03  0:34           ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-02 15:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Fri, Oct 1, 2010 at 4:16 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-09-30 at 09:26 -0700, Venkatesh Pallipadi wrote:
>> > One nit: in_serving_softirq() doesn't seem right as either:
>> >
>> >  - we're not accounting ksoftirq in it, or
>> >  - we're are and VIRT_CPU_ACCOUNTING is again broken ;-)
>> >
>> > So only the softirq from irq tails wants to have SOFTIRQ_OFFSET set, the
>> > ksoftirqd stuff can be tested for using PF_flags or something (ksoftirq
>> > doesn't currently have a PF_SOFTIRQ flag, but -rt does and we could
>> > bring that over).
>> >
>>
>> The problem is that ksoftirqd is also handling softirq's and some
>> eventual users of in_serving_softirq (like the network code in this
>> patch) want to differentiate between whether it is the softirq thats
>> running or some real process (!ksoftirqd) context.
>
> The make in_serving_softirq() be something like:
>  (preempt_count() & SOFTIRQ_OFFSET) || (current->flags & PF_SOFTIRQ)
>

Yes. I would also need in_softirq_at_hardirq_tail() for accounting calls :)

>> Also, ksoftirqd adds to softirq counts, does trace softirq, etc. So,
>> it kind of made sense to add the time also to softirq stats as well.
>> If we dont account time to softirq stats, then if some user is looking
>> at say time per softirq using the softirq count will be misled. No?
>
> Simply add back the task accounting when you report it?
>
>> In the other thread you mentioned doing that will cause problems. Were
>> you thinking of scheduler issues or are there other problems charging
>> softirq time this way?
>
> Of course there are.. you're double accounting the time of ksoftirqd,
> and worse, you're adding that back into the equation as part of the !
> sched_fair time.
>

No. There should not be any double accounting with this current
change. We account softirq processing both at hardirq tail and
ksoftirqd as CPU softirq time. It will be taken out of ksoftirqd sched
exec time as with any other thread. And it will be taken out of fair
time available on the CPU as well. Which to me seems to be the right
thing to do, as the this is more coupled with the CPU and ksoftirqd is
just giving context for softirqd to run.

Changing only hardirq tail to have SOFTIRQ_OFFSET and changing
ksoftirqd to SOFTIRQ_OFFSET*2 would result in these additional
ksoftirqd softirqs staying as the part of sched_fair time.

Or I am totally missing something here?

Thanks,
Venki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3
  2010-10-02 15:26                     ` Venkatesh Pallipadi
@ 2010-10-03  0:26                       ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2010-10-03  0:26 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet,
	Len Brown

On Sat, 2010-10-02 at 08:26 -0700, Venkatesh Pallipadi wrote:
> Is it OK to use sched_clock() here until that time? I
> really want to get some wider testing feedback on this before spending
> more time. 

Sure.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7] si time accounting accounts bh_disable'd time to si -v3
  2010-10-02 15:42         ` Venkatesh Pallipadi
@ 2010-10-03  0:34           ` Peter Zijlstra
  2010-10-04 16:54             ` Venkatesh Pallipadi
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2010-10-03  0:34 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Sat, 2010-10-02 at 08:42 -0700, Venkatesh Pallipadi wrote:
> 
> > The make in_serving_softirq() be something like:
> >  (preempt_count() & SOFTIRQ_OFFSET) || (current->flags & PF_SOFTIRQ)
> >
> 
> Yes. I would also need in_softirq_at_hardirq_tail() for accounting calls :)

Well, you could open-code it there or something.

> >> Also, ksoftirqd adds to softirq counts, does trace softirq, etc. So,
> >> it kind of made sense to add the time also to softirq stats as well.
> >> If we dont account time to softirq stats, then if some user is looking
> >> at say time per softirq using the softirq count will be misled. No?
> >
> > Simply add back the task accounting when you report it?
> >
> >> In the other thread you mentioned doing that will cause problems. Were
> >> you thinking of scheduler issues or are there other problems charging
> >> softirq time this way?
> >
> > Of course there are.. you're double accounting the time of ksoftirqd,
> > and worse, you're adding that back into the equation as part of the !
> > sched_fair time.
> >
> 
> No. There should not be any double accounting with this current
> change. We account softirq processing both at hardirq tail and
> ksoftirqd as CPU softirq time. It will be taken out of ksoftirqd sched
> exec time as with any other thread. And it will be taken out of fair
> time available on the CPU as well. Which to me seems to be the right
> thing to do, as the this is more coupled with the CPU and ksoftirqd is
> just giving context for softirqd to run.

But ksoftirqd is a SCHED_OTHER task, so by taking it out of its runtime
the scheduler will get all confused.

> Changing only hardirq tail to have SOFTIRQ_OFFSET and changing
> ksoftirqd to SOFTIRQ_OFFSET*2 would result in these additional
> ksoftirqd softirqs staying as the part of sched_fair time.
> 
> Or I am totally missing something here? 

Please keep ksoftirq scheduling normal, if people want it to be another
scheduling class, let them change that, but since its a task it should
be a normal task and get scheduled like everybody else, not have some
magic properties.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7] si time accounting accounts bh_disable'd time to si -v3
  2010-10-03  0:34           ` Peter Zijlstra
@ 2010-10-04 16:54             ` Venkatesh Pallipadi
  0 siblings, 0 replies; 29+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-04 16:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
	Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet

On Sat, Oct 2, 2010 at 5:34 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, 2010-10-02 at 08:42 -0700, Venkatesh Pallipadi wrote:
>>
>> > The make in_serving_softirq() be something like:
>> >  (preempt_count() & SOFTIRQ_OFFSET) || (current->flags & PF_SOFTIRQ)
>> >
>>
>> Yes. I would also need in_softirq_at_hardirq_tail() for accounting calls :)
>
> Well, you could open-code it there or something.
>
>> >> Also, ksoftirqd adds to softirq counts, does trace softirq, etc. So,
>> >> it kind of made sense to add the time also to softirq stats as well.
>> >> If we dont account time to softirq stats, then if some user is looking
>> >> at say time per softirq using the softirq count will be misled. No?
>> >
>> > Simply add back the task accounting when you report it?
>> >
>> >> In the other thread you mentioned doing that will cause problems. Were
>> >> you thinking of scheduler issues or are there other problems charging
>> >> softirq time this way?
>> >
>> > Of course there are.. you're double accounting the time of ksoftirqd,
>> > and worse, you're adding that back into the equation as part of the !
>> > sched_fair time.
>> >
>>
>> No. There should not be any double accounting with this current
>> change. We account softirq processing both at hardirq tail and
>> ksoftirqd as CPU softirq time. It will be taken out of ksoftirqd sched
>> exec time as with any other thread. And it will be taken out of fair
>> time available on the CPU as well. Which to me seems to be the right
>> thing to do, as the this is more coupled with the CPU and ksoftirqd is
>> just giving context for softirqd to run.
>
> But ksoftirqd is a SCHED_OTHER task, so by taking it out of its runtime
> the scheduler will get all confused.
>
>> Changing only hardirq tail to have SOFTIRQ_OFFSET and changing
>> ksoftirqd to SOFTIRQ_OFFSET*2 would result in these additional
>> ksoftirqd softirqs staying as the part of sched_fair time.
>>
>> Or I am totally missing something here?
>
> Please keep ksoftirq scheduling normal, if people want it to be another
> scheduling class, let them change that, but since its a task it should
> be a normal task and get scheduled like everybody else, not have some
> magic properties.
>

OK. In that case we can leave the existing softirq accounting
(!IRQ_TIME_ACCOUNTING) users as they are. They don't have any
ksoftirqd and scheduler issues. So, this patch should be good as it
is.
I will however, change IRQ_TIME_ACCOUNTING case to not take out
scheduler time out of ksoftirqd due to softirqs. Sounds good?

Thanks,
Venki

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2010-10-04 16:55 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-29 19:21 Proper kernel irq time accounting -v3 Venkatesh Pallipadi
2010-09-29 19:21 ` [PATCH 1/7] si time accounting accounts bh_disable'd time to si -v3 Venkatesh Pallipadi
2010-09-30 11:04   ` Peter Zijlstra
2010-09-30 16:26     ` Venkatesh Pallipadi
2010-10-01 23:16       ` Peter Zijlstra
2010-10-02 15:42         ` Venkatesh Pallipadi
2010-10-03  0:34           ` Peter Zijlstra
2010-10-04 16:54             ` Venkatesh Pallipadi
2010-09-29 19:21 ` [PATCH 2/7] Consolidate account_system_vtime extern declaration -v3 Venkatesh Pallipadi
2010-09-29 19:21 ` [PATCH 3/7] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v3 Venkatesh Pallipadi
2010-09-30 11:06   ` Peter Zijlstra
2010-09-30 16:29     ` Venkatesh Pallipadi
2010-09-30 20:38       ` Venkatesh Pallipadi
2010-10-01 11:46         ` Peter Zijlstra
2010-10-01 16:51           ` Venkatesh Pallipadi
2010-10-01 17:29             ` Venkatesh Pallipadi
2010-10-01 23:14               ` Peter Zijlstra
2010-10-01 23:32                 ` Venkatesh Pallipadi
2010-10-02 10:53                   ` Peter Zijlstra
2010-10-02 15:26                     ` Venkatesh Pallipadi
2010-10-03  0:26                       ` Peter Zijlstra
2010-10-01 11:45       ` Peter Zijlstra
2010-09-29 19:21 ` [PATCH 4/7] x86: Add IRQ_TIME_ACCOUNTING in x86 -v3 Venkatesh Pallipadi
2010-09-29 19:21 ` [PATCH 5/7] sched: Do not account irq time to current task -v3 Venkatesh Pallipadi
2010-09-29 19:21 ` [PATCH 6/7] sched: Remove irq time from available CPU power -v3 Venkatesh Pallipadi
2010-09-29 19:21 ` [PATCH 7/7] Export per cpu hardirq and softirq time in proc -v3 Venkatesh Pallipadi
2010-09-30  7:59 ` Proper kernel irq time accounting -v3 Andi Kleen
2010-09-30 16:37   ` Venkatesh Pallipadi
2010-09-30 17:36     ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).