* Proper kernel irq time accounting -v4
@ 2010-10-05 0:03 Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 1/8] si time accounting accounts bh_disable'd time to si -v4 Venkatesh Pallipadi
` (10 more replies)
0 siblings, 11 replies; 40+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-05 0:03 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
Balbir Singh, Martin Schwidefsky
Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi
Previous versions:
-v0: http://lkml.indiana.edu/hypermail//linux/kernel/1005.3/00411.html
lkml subject - "Finer granularity and task/cgroup irq time accounting"
-v1: http://lkml.indiana.edu/hypermail//linux/kernel/1007.2/00987.html
lkml subject - "Finer granularity and task/cgroup irq time accounting"
-v2: http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/00488.html
lkml subject - "Proper kernel irq time accounting"
-v3: http://lkml.indiana.edu/hypermail//linux/kernel/1009.3/02482.html
lkml subject - "Proper kernel irq time accounting -v3"
Change from -v3:
- Switch to using sched_clock_cpu instead of sched_clock in
account_system_vtime. This needed tick_check_idle to be called early in
irq_enter.
- Do not account softirq time in ksoftirqd context. We want ksoftirqd to
continue having sched exec_runtime.
- Add a boot option to disable tsc based irq accounting in x86.
- Remove the patch that exported irq times in /proc/interrupts and
/proc/softirqs. I am relooking at other options to export that and
will have a replacement patch soon.
Description:
Here is some background information about interrupt time accounting in kernel
and related problems.
Interrupts always run in the context of currently running task. Softirqs mostly
run in the context of currently running task, unless softirqd gets involved.
/proc/interrupts and /proc/softirqs are the interfaces that report the number
of interrupts and softirqs per CPU since boot. /proc/stat has fields that
report per CPU and system-wide interrupt and softirq processing time in
clock_t units.
There are two problems with the way interrupts are accounted by the kernel.
(1) Coarse grained interrupt time reporting
On most archs (except s390, powerpc, ia64 with CONFIG_VIRT_CPU_ACCOUNTING),
the interrupt and softirq time reported in /proc/stat is tick sample based.
Kernel looks at what it is doing at each CPU tick and accounts the entire
tick to that particular activity. This means the data in /proc/stat is
pretty coarse grained.
One related problem (atleast on x86), with recent
"Run irq handlers with interrupts disabled" change, timer interrupt cannot
fire when there is an interrupt being serviced [1]. As a result sampling based
hardirq time in /proc/stat cannot capture any hardirq time at all.
(2) Accounting irq processing time to current task/taskgroup
Whenever irq processing happens, kernel accounts that time to currently
running task. The exec_runtime reported in /proc/<pid>/schedstat and
<cgroup>/cpuacct.usage* includes any irq processing that happened while
the task was running. The scheduler vruntime calculations
also account any irq processing to the current task. This means exec time
accounting during heavy irq processing is kind of random, depending on
when and which CPU processing happens and what task happened to be
running on that CPU at that time.
Solution to (1) involves adding extra timing on irq entry/exit to
get the fine granularity info and then exporting it to user.
The following patchset addresses this problem in a way similar to [2][3].
Keeps most of the code that does the timing generic
(CONFIG_IRQ_TIME_ACCOUNTING), based off of sched_clock(). And adds support for
this in x86. This time is not yet exported to userspace yet. Patch for that
coming soon.
One partial solution proposed in [2][3] for (2) above, was to capture this
interrupt time at task/taskgroup level and let user know how much irq
processing time kernel charged to each task/taskgroup. But, that solution
did not solve task timeslice including irq processing.
Peter Zijlstra and Martin Schwidefsky disagreed with that approach and
wanted to see more complete solution in not accounting irq processing time
to tasks at all.
The patchset below tries this more complete solution, with two scheduler
related changes. First, to take out irq processing time from the time
scheduler accounts to the task. Second, make adjustments to the CPU
power, to account for irq processing activity on the CPU. That in turn
results in irq busy CPU pulling tasks corresponding to its non-irq
processing bandwidth that it has got.
The changes here is only enabled for CONFIG_IRQ_TIME_ACCOUNTING as of now.
Thanks,
Venki
References:
[1] http://lkml.indiana.edu/hypermail//linux/kernel/1005.3/00864.html
lkml subject - "genirq: Run irq handlers with interrupts disabled"
[2] http://lkml.indiana.edu/hypermail//linux/kernel/1005.3/00411.html
lkml subject - "Finer granularity and task/cgroup irq time accounting"
[3] http://lkml.indiana.edu/hypermail//linux/kernel/1007.2/00987.html
lkml subject - "Finer granularity and task/cgroup irq time accounting"
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 1/8] si time accounting accounts bh_disable'd time to si -v4
2010-10-05 0:03 Proper kernel irq time accounting -v4 Venkatesh Pallipadi
@ 2010-10-05 0:03 ` Venkatesh Pallipadi
2010-10-18 19:24 ` [tip:sched/core] sched: Fix softirq time accounting tip-bot for Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 2/8] Consolidate account_system_vtime extern declaration -v4 Venkatesh Pallipadi
` (9 subsequent siblings)
10 siblings, 1 reply; 40+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-05 0:03 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
Balbir Singh, Martin Schwidefsky
Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi
Peter Zijlstra found a bug in the way softirq time is accounted in
VIRT_CPU_ACCOUNTING on this thread.
http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
The problem is, softirq processing uses local_bh_disable internally. There
is no way, later in the flow, to differentiate between whether softirq is
being processed or is it just that bh has been disabled. So, a hardirq when bh
is disabled results in time being wrongly accounted as softirq.
Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
as well. As account_system_time() in normal tick based accouting also uses
softirq_count, which will be set even when not in softirq with bh disabled.
Peter also suggested solution of using 2 * SOFTIRQ_OFFSET as irq count
for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
processing. The patch below does that and adds API in_serving_softirq() which
returns whether we are currently processing softirq or not.
Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
to in_serving_softirq.
Looks like many usages of in_softirq really want in_serving_softirq. Those
changes can be made individually on a case by case basis.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
include/linux/hardirq.h | 5 ++++
include/linux/sched.h | 6 ++--
kernel/sched.c | 2 +-
kernel/softirq.c | 51 +++++++++++++++++++++++++++++++---------------
net/sched/cls_cgroup.c | 2 +-
5 files changed, 44 insertions(+), 22 deletions(-)
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index d5b3876..e37a77c 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -64,6 +64,8 @@
#define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT)
#define NMI_OFFSET (1UL << NMI_SHIFT)
+#define SOFTIRQ_DISABLE_OFFSET (2 * SOFTIRQ_OFFSET)
+
#ifndef PREEMPT_ACTIVE
#define PREEMPT_ACTIVE_BITS 1
#define PREEMPT_ACTIVE_SHIFT (NMI_SHIFT + NMI_BITS)
@@ -82,10 +84,13 @@
/*
* Are we doing bottom half or hardware interrupt processing?
* Are we in a softirq context? Interrupt context?
+ * in_softirq - Are we currently processing softirq or have bh disabled?
+ * in_serving_softirq - Are we currently processing softirq?
*/
#define in_irq() (hardirq_count())
#define in_softirq() (softirq_count())
#define in_interrupt() (irq_count())
+#define in_serving_softirq() (softirq_count() & SOFTIRQ_OFFSET)
/*
* Are we in NMI context?
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e2a6db..126457e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2367,9 +2367,9 @@ extern int __cond_resched_lock(spinlock_t *lock);
extern int __cond_resched_softirq(void);
-#define cond_resched_softirq() ({ \
- __might_sleep(__FILE__, __LINE__, SOFTIRQ_OFFSET); \
- __cond_resched_softirq(); \
+#define cond_resched_softirq() ({ \
+ __might_sleep(__FILE__, __LINE__, SOFTIRQ_DISABLE_OFFSET); \
+ __cond_resched_softirq(); \
})
/*
diff --git a/kernel/sched.c b/kernel/sched.c
index dc85ceb..b6e714b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3397,7 +3397,7 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
tmp = cputime_to_cputime64(cputime);
if (hardirq_count() - hardirq_offset)
cpustat->irq = cputime64_add(cpustat->irq, tmp);
- else if (softirq_count())
+ else if (in_serving_softirq())
cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
else
cpustat->system = cputime64_add(cpustat->system, tmp);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 07b4f1b..988dfbe 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -77,11 +77,21 @@ void wakeup_softirqd(void)
}
/*
+ * preempt_count and SOFTIRQ_OFFSET usage:
+ * - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving
+ * softirq processing.
+ * - preempt_count is changed by SOFTIRQ_DISABLE_OFFSET (= 2 * SOFTIRQ_OFFSET)
+ * on local_bh_disable or local_bh_enable.
+ * This lets us distinguish between whether we are currently processing
+ * softirq and whether we just have bh disabled.
+ */
+
+/*
* This one is for softirq.c-internal use,
* where hardirqs are disabled legitimately:
*/
#ifdef CONFIG_TRACE_IRQFLAGS
-static void __local_bh_disable(unsigned long ip)
+static void __local_bh_disable(unsigned long ip, unsigned int cnt)
{
unsigned long flags;
@@ -95,32 +105,43 @@ static void __local_bh_disable(unsigned long ip)
* We must manually increment preempt_count here and manually
* call the trace_preempt_off later.
*/
- preempt_count() += SOFTIRQ_OFFSET;
+ preempt_count() += cnt;
/*
* Were softirqs turned off above:
*/
- if (softirq_count() == SOFTIRQ_OFFSET)
+ if (softirq_count() == cnt)
trace_softirqs_off(ip);
raw_local_irq_restore(flags);
- if (preempt_count() == SOFTIRQ_OFFSET)
+ if (preempt_count() == cnt)
trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
}
#else /* !CONFIG_TRACE_IRQFLAGS */
-static inline void __local_bh_disable(unsigned long ip)
+static inline void __local_bh_disable(unsigned long ip, unsigned int cnt)
{
- add_preempt_count(SOFTIRQ_OFFSET);
+ add_preempt_count(cnt);
barrier();
}
#endif /* CONFIG_TRACE_IRQFLAGS */
void local_bh_disable(void)
{
- __local_bh_disable((unsigned long)__builtin_return_address(0));
+ __local_bh_disable((unsigned long)__builtin_return_address(0),
+ SOFTIRQ_DISABLE_OFFSET);
}
EXPORT_SYMBOL(local_bh_disable);
+static void __local_bh_enable(unsigned int cnt)
+{
+ WARN_ON_ONCE(in_irq());
+ WARN_ON_ONCE(!irqs_disabled());
+
+ if (softirq_count() == cnt)
+ trace_softirqs_on((unsigned long)__builtin_return_address(0));
+ sub_preempt_count(cnt);
+}
+
/*
* Special-case - softirqs can safely be enabled in
* cond_resched_softirq(), or by __do_softirq(),
@@ -128,12 +149,7 @@ EXPORT_SYMBOL(local_bh_disable);
*/
void _local_bh_enable(void)
{
- WARN_ON_ONCE(in_irq());
- WARN_ON_ONCE(!irqs_disabled());
-
- if (softirq_count() == SOFTIRQ_OFFSET)
- trace_softirqs_on((unsigned long)__builtin_return_address(0));
- sub_preempt_count(SOFTIRQ_OFFSET);
+ __local_bh_enable(SOFTIRQ_DISABLE_OFFSET);
}
EXPORT_SYMBOL(_local_bh_enable);
@@ -147,13 +163,13 @@ static inline void _local_bh_enable_ip(unsigned long ip)
/*
* Are softirqs going to be turned on now:
*/
- if (softirq_count() == SOFTIRQ_OFFSET)
+ if (softirq_count() == SOFTIRQ_DISABLE_OFFSET)
trace_softirqs_on(ip);
/*
* Keep preemption disabled until we are done with
* softirq processing:
*/
- sub_preempt_count(SOFTIRQ_OFFSET - 1);
+ sub_preempt_count(SOFTIRQ_DISABLE_OFFSET - 1);
if (unlikely(!in_interrupt() && local_softirq_pending()))
do_softirq();
@@ -198,7 +214,8 @@ asmlinkage void __do_softirq(void)
pending = local_softirq_pending();
account_system_vtime(current);
- __local_bh_disable((unsigned long)__builtin_return_address(0));
+ __local_bh_disable((unsigned long)__builtin_return_address(0),
+ SOFTIRQ_OFFSET);
lockdep_softirq_enter();
cpu = smp_processor_id();
@@ -245,7 +262,7 @@ restart:
lockdep_softirq_exit();
account_system_vtime(current);
- _local_bh_enable();
+ __local_bh_enable(SOFTIRQ_OFFSET);
}
#ifndef __ARCH_HAS_DO_SOFTIRQ
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 78ef2c5..37dff78 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -123,7 +123,7 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
* calls by looking at the number of nested bh disable calls because
* softirqs always disables bh.
*/
- if (softirq_count() != SOFTIRQ_OFFSET) {
+ if (in_serving_softirq()) {
/* If there is an sk_classid we'll use that. */
if (!skb->sk)
return -1;
--
1.7.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 2/8] Consolidate account_system_vtime extern declaration -v4
2010-10-05 0:03 Proper kernel irq time accounting -v4 Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 1/8] si time accounting accounts bh_disable'd time to si -v4 Venkatesh Pallipadi
@ 2010-10-05 0:03 ` Venkatesh Pallipadi
2010-10-18 19:24 ` [tip:sched/core] sched: Consolidate account_system_vtime extern declaration tip-bot for Venkatesh Pallipadi
2010-10-18 19:27 ` [tip:sched/core] sched: Export account_system_vtime() tip-bot for Ingo Molnar
2010-10-05 0:03 ` [PATCH 3/8] Add a PF flag for ksoftirqd identification Venkatesh Pallipadi
` (8 subsequent siblings)
10 siblings, 2 replies; 40+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-05 0:03 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
Balbir Singh, Martin Schwidefsky
Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi
Just a minor cleanup patch that makes things easier to the following patches.
No functionality change in this patch.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
arch/ia64/include/asm/system.h | 4 ----
arch/powerpc/include/asm/system.h | 4 ----
arch/s390/include/asm/system.h | 1 -
include/linux/hardirq.h | 2 ++
4 files changed, 2 insertions(+), 9 deletions(-)
diff --git a/arch/ia64/include/asm/system.h b/arch/ia64/include/asm/system.h
index 9f342a5..dd028f2 100644
--- a/arch/ia64/include/asm/system.h
+++ b/arch/ia64/include/asm/system.h
@@ -272,10 +272,6 @@ void cpu_idle_wait(void);
void default_idle(void);
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
-extern void account_system_vtime(struct task_struct *);
-#endif
-
#endif /* __KERNEL__ */
#endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/system.h b/arch/powerpc/include/asm/system.h
index 6c294ac..9c3d160 100644
--- a/arch/powerpc/include/asm/system.h
+++ b/arch/powerpc/include/asm/system.h
@@ -542,10 +542,6 @@ extern void reloc_got2(unsigned long);
#define PTRRELOC(x) ((typeof(x)) add_reloc_offset((unsigned long)(x)))
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
-extern void account_system_vtime(struct task_struct *);
-#endif
-
extern struct dentry *powerpc_debugfs_root;
#endif /* __KERNEL__ */
diff --git a/arch/s390/include/asm/system.h b/arch/s390/include/asm/system.h
index cef6621..38ddd8a 100644
--- a/arch/s390/include/asm/system.h
+++ b/arch/s390/include/asm/system.h
@@ -97,7 +97,6 @@ static inline void restore_access_regs(unsigned int *acrs)
extern void account_vtime(struct task_struct *, struct task_struct *);
extern void account_tick_vtime(struct task_struct *);
-extern void account_system_vtime(struct task_struct *);
#ifdef CONFIG_PFAULT
extern void pfault_irq_init(void);
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index e37a77c..41367c5 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -141,6 +141,8 @@ struct task_struct;
static inline void account_system_vtime(struct task_struct *tsk)
{
}
+#else
+extern void account_system_vtime(struct task_struct *tsk);
#endif
#if defined(CONFIG_NO_HZ)
--
1.7.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 3/8] Add a PF flag for ksoftirqd identification
2010-10-05 0:03 Proper kernel irq time accounting -v4 Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 1/8] si time accounting accounts bh_disable'd time to si -v4 Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 2/8] Consolidate account_system_vtime extern declaration -v4 Venkatesh Pallipadi
@ 2010-10-05 0:03 ` Venkatesh Pallipadi
2010-10-15 14:26 ` Peter Zijlstra
` (2 more replies)
2010-10-05 0:03 ` [PATCH 4/8] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v4 Venkatesh Pallipadi
` (7 subsequent siblings)
10 siblings, 3 replies; 40+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-05 0:03 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
Balbir Singh, Martin Schwidefsky
Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi
To account softirq time cleanly in scheduler, we need to identify whether
softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
Add PF_KSOFTIRQD for that purpose.
As all PF flag bits are currently taken, create space by moving one of the
infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
with some other state fields.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
include/linux/sched.h | 3 ++-
kernel/cpuset.c | 2 +-
kernel/kthread.c | 2 +-
kernel/sched.c | 2 +-
kernel/softirq.c | 1 +
kernel/workqueue.c | 6 +++---
6 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 126457e..43064cd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1234,6 +1234,7 @@ struct task_struct {
/* Revert to default priority/policy when forking */
unsigned sched_reset_on_fork:1;
+ unsigned sched_thread_bound:1; /* Thread bound to specific cpu */
pid_t pid;
pid_t tgid;
@@ -1708,7 +1709,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
#define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
#define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
-#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
+#define PF_KSOFTIRQD 0x04000000 /* I am ksoftirqd */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b23c097..8a2eb02 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1394,7 +1394,7 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
* set_cpus_allowed_ptr() on all attached tasks before cpus_allowed may
* be changed.
*/
- if (tsk->flags & PF_THREAD_BOUND)
+ if (tsk->sched_thread_bound)
return -EINVAL;
ret = security_task_setscheduler(tsk, 0, NULL);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 2dc3786..6b51a4c 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -185,7 +185,7 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
p->cpus_allowed = cpumask_of_cpu(cpu);
p->rt.nr_cpus_allowed = 1;
- p->flags |= PF_THREAD_BOUND;
+ p->sched_thread_bound = 1;
}
EXPORT_SYMBOL(kthread_bind);
diff --git a/kernel/sched.c b/kernel/sched.c
index b6e714b..c13fae6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5464,7 +5464,7 @@ again:
goto out;
}
- if (unlikely((p->flags & PF_THREAD_BOUND) && p != current &&
+ if (unlikely(p->sched_thread_bound && p != current &&
!cpumask_equal(&p->cpus_allowed, new_mask))) {
ret = -EINVAL;
goto out;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 988dfbe..267f7b7 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -713,6 +713,7 @@ static int run_ksoftirqd(void * __bind_cpu)
{
set_current_state(TASK_INTERRUPTIBLE);
+ current->flags |= PF_KSOFTIRQD;
while (!kthread_should_stop()) {
preempt_disable();
if (!local_softirq_pending()) {
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f77afd9..7146ee6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1340,12 +1340,12 @@ static struct worker *create_worker(struct global_cwq *gcwq, bool bind)
/*
* A rogue worker will become a regular one if CPU comes
* online later on. Make sure every worker has
- * PF_THREAD_BOUND set.
+ * sched_thread_bound set.
*/
if (bind && !on_unbound_cpu)
kthread_bind(worker->task, gcwq->cpu);
else {
- worker->task->flags |= PF_THREAD_BOUND;
+ worker->task->sched_thread_bound = 1;
if (on_unbound_cpu)
worker->flags |= WORKER_UNBOUND;
}
@@ -2817,7 +2817,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *name,
if (IS_ERR(rescuer->task))
goto err;
- rescuer->task->flags |= PF_THREAD_BOUND;
+ rescuer->task->sched_thread_bound = 1;
wake_up_process(rescuer->task);
}
--
1.7.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 4/8] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v4
2010-10-05 0:03 Proper kernel irq time accounting -v4 Venkatesh Pallipadi
` (2 preceding siblings ...)
2010-10-05 0:03 ` [PATCH 3/8] Add a PF flag for ksoftirqd identification Venkatesh Pallipadi
@ 2010-10-05 0:03 ` Venkatesh Pallipadi
2010-10-15 14:28 ` Peter Zijlstra
2010-10-18 19:25 ` [tip:sched/core] sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time tip-bot for Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 5/8] x86: Add IRQ_TIME_ACCOUNTING in x86 -v4 Venkatesh Pallipadi
` (6 subsequent siblings)
10 siblings, 2 replies; 40+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-05 0:03 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
Balbir Singh, Martin Schwidefsky
Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi
s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
the fine granularity accounting of user, system, hardirq, softirq times.
Adding that option on archs like x86 will be challenging however, given the
state of TSC reliability on various platforms and also the overhead it will
add in syscall entry exit.
Instead, add a lighter variant that only does finer accounting of
hardirq and softirq times, providing precise irq times (instead of timer tick
based samples). This accounting is added with a new config option
CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
interested in paying the perf penalty.
This accounting is based on sched_clock, with the code being generic.
So, other archs may find it useful as well.
This patch just adds the core logic and does not enable this logic yet.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
include/linux/hardirq.h | 2 +-
include/linux/sched.h | 13 ++++++++++++
kernel/sched.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 63 insertions(+), 1 deletions(-)
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 41367c5..ff43e92 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -137,7 +137,7 @@ extern void synchronize_irq(unsigned int irq);
struct task_struct;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#if !defined(CONFIG_VIRT_CPU_ACCOUNTING) && !defined(CONFIG_IRQ_TIME_ACCOUNTING)
static inline void account_system_vtime(struct task_struct *tsk)
{
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 43064cd..918a33b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1827,6 +1827,19 @@ extern void sched_clock_idle_sleep_event(void);
extern void sched_clock_idle_wakeup_event(u64 delta_ns);
#endif
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+/*
+ * An i/f to runtime opt-in for irq time accounting based off of sched_clock.
+ * The reason for this explicit opt-in is not to have perf penalty with
+ * slow sched_clocks.
+ */
+extern void enable_sched_clock_irqtime(void);
+extern void disable_sched_clock_irqtime(void);
+#else
+static inline void enable_sched_clock_irqtime(void) {}
+static inline void disable_sched_clock_irqtime(void) {}
+#endif
+
extern unsigned long long
task_sched_runtime(struct task_struct *task);
extern unsigned long long thread_group_sched_runtime(struct task_struct *task);
diff --git a/kernel/sched.c b/kernel/sched.c
index c13fae6..4d8bbed 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1917,6 +1917,55 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
dec_nr_running(rq);
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+
+static DEFINE_PER_CPU(u64, cpu_hardirq_time);
+static DEFINE_PER_CPU(u64, cpu_softirq_time);
+
+static DEFINE_PER_CPU(u64, irq_start_time);
+static int sched_clock_irqtime;
+
+void enable_sched_clock_irqtime(void)
+{
+ sched_clock_irqtime = 1;
+}
+
+void disable_sched_clock_irqtime(void)
+{
+ sched_clock_irqtime = 0;
+}
+
+void account_system_vtime(struct task_struct *curr)
+{
+ unsigned long flags;
+ int cpu;
+ u64 now, delta;
+
+ if (!sched_clock_irqtime)
+ return;
+
+ local_irq_save(flags);
+
+ now = sched_clock();
+ cpu = smp_processor_id();
+ delta = now - per_cpu(irq_start_time, cpu);
+ per_cpu(irq_start_time, cpu) = now;
+ /*
+ * We do not account for softirq time from ksoftirqd here.
+ * We want to continue accounting softirq time to ksoftirqd thread
+ * in that case, so as not to confuse scheduler with a special task
+ * that do not consume any time, but still wants to run.
+ */
+ if (hardirq_count())
+ per_cpu(cpu_hardirq_time, cpu) += delta;
+ else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD))
+ per_cpu(cpu_softirq_time, cpu) += delta;
+
+ local_irq_restore(flags);
+}
+
+#endif
+
#include "sched_idletask.c"
#include "sched_fair.c"
#include "sched_rt.c"
--
1.7.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 5/8] x86: Add IRQ_TIME_ACCOUNTING in x86 -v4
2010-10-05 0:03 Proper kernel irq time accounting -v4 Venkatesh Pallipadi
` (3 preceding siblings ...)
2010-10-05 0:03 ` [PATCH 4/8] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v4 Venkatesh Pallipadi
@ 2010-10-05 0:03 ` Venkatesh Pallipadi
2010-10-15 14:38 ` Peter Zijlstra
2010-10-18 19:26 ` [tip:sched/core] x86: Add IRQ_TIME_ACCOUNTING tip-bot for Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 6/8] sched: Do not account irq time to current task -v4 Venkatesh Pallipadi
` (5 subsequent siblings)
10 siblings, 2 replies; 40+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-05 0:03 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
Balbir Singh, Martin Schwidefsky
Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi
This patch adds IRQ_TIME_ACCOUNTING option on x86 and runtime enables it
when TSC is enabled.
This change just enables fine grained irq time accounting, isn't used yet.
Following patches use it for different purposes.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
Documentation/kernel-parameters.txt | 4 ++++
arch/x86/Kconfig | 11 +++++++++++
arch/x86/kernel/tsc.c | 8 ++++++++
3 files changed, 23 insertions(+), 0 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 8dd7248..ed05a4a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2435,6 +2435,10 @@ and is between 256 and 4096 characters. It is defined in the file
disables clocksource verification at runtime.
Used to enable high-resolution timer mode on older
hardware, and in virtualized environment.
+ [x86] noirqtime: Do not use TSC to do irq accounting.
+ Used to run time disable IRQ_TIME_ACCOUNTING on any
+ platforms where RDTSC is slow and this accounting
+ can add overhead.
turbografx.map[2|3]= [HW,JOY]
TurboGraFX parallel port interface
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cea0cd9..f4c70c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -795,6 +795,17 @@ config SCHED_MC
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
+config IRQ_TIME_ACCOUNTING
+ bool "Fine granularity task level IRQ time accounting"
+ default n
+ ---help---
+ Select this option to enable fine granularity task irq time
+ accounting. This is done by reading a timestamp on each
+ transitions between softirq and hardirq state, so there can be a
+ small performance impact.
+
+ If in doubt, say N here.
+
source "kernel/Kconfig.preempt"
config X86_UP_APIC
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 26a863a..a1c2cd7 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -104,10 +104,14 @@ int __init notsc_setup(char *str)
__setup("notsc", notsc_setup);
+static int no_sched_irq_time;
+
static int __init tsc_setup(char *str)
{
if (!strcmp(str, "reliable"))
tsc_clocksource_reliable = 1;
+ if (!strncmp(str, "noirqtime", 9))
+ no_sched_irq_time = 1;
return 1;
}
@@ -801,6 +805,7 @@ void mark_tsc_unstable(char *reason)
if (!tsc_unstable) {
tsc_unstable = 1;
sched_clock_stable = 0;
+ disable_sched_clock_irqtime();
printk(KERN_INFO "Marking TSC unstable due to %s\n", reason);
/* Change only the rating, when not registered */
if (clocksource_tsc.mult)
@@ -987,6 +992,9 @@ void __init tsc_init(void)
/* now allow native_sched_clock() to use rdtsc */
tsc_disabled = 0;
+ if (!no_sched_irq_time)
+ enable_sched_clock_irqtime();
+
lpj = ((u64)tsc_khz * 1000);
do_div(lpj, HZ);
lpj_fine = lpj;
--
1.7.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 6/8] sched: Do not account irq time to current task -v4
2010-10-05 0:03 Proper kernel irq time accounting -v4 Venkatesh Pallipadi
` (4 preceding siblings ...)
2010-10-05 0:03 ` [PATCH 5/8] x86: Add IRQ_TIME_ACCOUNTING in x86 -v4 Venkatesh Pallipadi
@ 2010-10-05 0:03 ` Venkatesh Pallipadi
2010-10-18 19:26 ` [tip:sched/core] sched: Do not account irq time to current task tip-bot for Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 7/8] sched: Remove irq time from available CPU power -v4 Venkatesh Pallipadi
` (4 subsequent siblings)
10 siblings, 1 reply; 40+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-05 0:03 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
Balbir Singh, Martin Schwidefsky
Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi
Scheduler accounts both softirq and interrupt processing times to the
currently running task. This means, if the interrupt processing was
for some other task in the system, then the current task ends up being
penalized as it gets shorter runtime than otherwise.
Change sched task accounting to acoount only actual task time from
currently running task. Now update_curr(), modifies the delta_exec to
depend on rq->clock_task.
Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
for later.
This change will impact scheduling behavior in interrupt heavy conditions.
Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
spending 75%+ of its time in irq processing. CPU 3 spending around 35%
time running nc task.
Now, if I run another CPU intensive task on CPU 2, without this change
/proc/<pid>/schedstat shows 100% of time accounted to this task. With this
change, it rightly shows less than 25% accounted to this task as remaining
time is actually spent on irq processing.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
kernel/sched.c | 43 ++++++++++++++++++++++++++++++++++++++++---
kernel/sched_fair.c | 6 +++---
kernel/sched_rt.c | 8 ++++----
3 files changed, 47 insertions(+), 10 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 4d8bbed..12e9d70 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -493,6 +493,7 @@ struct rq {
struct mm_struct *prev_mm;
u64 clock;
+ u64 clock_task;
atomic_t nr_iowait;
@@ -643,10 +644,19 @@ static inline struct task_group *task_group(struct task_struct *p)
#endif /* CONFIG_CGROUP_SCHED */
+static u64 irq_time_cpu(int cpu);
+
inline void update_rq_clock(struct rq *rq)
{
- if (!rq->skip_clock_update)
- rq->clock = sched_clock_cpu(cpu_of(rq));
+ if (!rq->skip_clock_update) {
+ int cpu = cpu_of(rq);
+ u64 irq_time;
+
+ rq->clock = sched_clock_cpu(cpu);
+ irq_time = irq_time_cpu(cpu);
+ if (rq->clock - irq_time > rq->clock_task)
+ rq->clock_task = rq->clock - irq_time;
+ }
}
/*
@@ -1919,6 +1929,18 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+/*
+ * There are no locks covering percpu hardirq/softirq time.
+ * They are only modified in account_system_vtime, on corresponding CPU
+ * with interrupts disabled. So, writes are safe.
+ * They are read and saved off onto struct rq in update_rq_clock().
+ * This may result in other CPU reading this CPU's irq time and can
+ * race with irq/account_system_vtime on this CPU. We would either get old
+ * or new value (or semi updated value on 32 bit) with a side effect of
+ * accounting a slice of irq time to wrong task when irq is in progress
+ * while we read rq->clock. That is a worthy compromise in place of having
+ * locks on each irq in account_system_time.
+ */
static DEFINE_PER_CPU(u64, cpu_hardirq_time);
static DEFINE_PER_CPU(u64, cpu_softirq_time);
@@ -1935,6 +1957,14 @@ void disable_sched_clock_irqtime(void)
sched_clock_irqtime = 0;
}
+static u64 irq_time_cpu(int cpu)
+{
+ if (!sched_clock_irqtime)
+ return 0;
+
+ return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
+}
+
void account_system_vtime(struct task_struct *curr)
{
unsigned long flags;
@@ -1964,6 +1994,13 @@ void account_system_vtime(struct task_struct *curr)
local_irq_restore(flags);
}
+#else
+
+static u64 irq_time_cpu(int cpu)
+{
+ return 0;
+}
+
#endif
#include "sched_idletask.c"
@@ -3297,7 +3334,7 @@ static u64 do_task_delta_exec(struct task_struct *p, struct rq *rq)
if (task_current(rq, p)) {
update_rq_clock(rq);
- ns = rq->clock - p->se.exec_start;
+ ns = rq->clock_task - p->se.exec_start;
if ((s64)ns < 0)
ns = 0;
}
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index db3f674..0baa696 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -519,7 +519,7 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
- u64 now = rq_of(cfs_rq)->clock;
+ u64 now = rq_of(cfs_rq)->clock_task;
unsigned long delta_exec;
if (unlikely(!curr))
@@ -602,7 +602,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
/*
* We are starting a new run period:
*/
- se->exec_start = rq_of(cfs_rq)->clock;
+ se->exec_start = rq_of(cfs_rq)->clock_task;
}
/**************************************************
@@ -1798,7 +1798,7 @@ int can_migrate_task(struct task_struct *p, struct rq *rq, int this_cpu,
* 2) too many balance attempts have failed.
*/
- tsk_cache_hot = task_hot(p, rq->clock, sd);
+ tsk_cache_hot = task_hot(p, rq->clock_task, sd);
if (!tsk_cache_hot ||
sd->nr_balance_failed > sd->cache_nice_tries) {
#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index d10c80e..bb4027a 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -609,7 +609,7 @@ static void update_curr_rt(struct rq *rq)
if (!task_has_rt_policy(curr))
return;
- delta_exec = rq->clock - curr->se.exec_start;
+ delta_exec = rq->clock_task - curr->se.exec_start;
if (unlikely((s64)delta_exec < 0))
delta_exec = 0;
@@ -618,7 +618,7 @@ static void update_curr_rt(struct rq *rq)
curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
- curr->se.exec_start = rq->clock;
+ curr->se.exec_start = rq->clock_task;
cpuacct_charge(curr, delta_exec);
sched_rt_avg_update(rq, delta_exec);
@@ -1074,7 +1074,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
} while (rt_rq);
p = rt_task_of(rt_se);
- p->se.exec_start = rq->clock;
+ p->se.exec_start = rq->clock_task;
return p;
}
@@ -1709,7 +1709,7 @@ static void set_curr_task_rt(struct rq *rq)
{
struct task_struct *p = rq->curr;
- p->se.exec_start = rq->clock;
+ p->se.exec_start = rq->clock_task;
/* The running task is never eligible for pushing */
dequeue_pushable_task(rq, p);
--
1.7.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 7/8] sched: Remove irq time from available CPU power -v4
2010-10-05 0:03 Proper kernel irq time accounting -v4 Venkatesh Pallipadi
` (5 preceding siblings ...)
2010-10-05 0:03 ` [PATCH 6/8] sched: Do not account irq time to current task -v4 Venkatesh Pallipadi
@ 2010-10-05 0:03 ` Venkatesh Pallipadi
2010-10-18 19:26 ` [tip:sched/core] sched: Remove irq time from available CPU power tip-bot for Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 8/8] Call tick_check_idle before __irq_enter Venkatesh Pallipadi
` (3 subsequent siblings)
10 siblings, 1 reply; 40+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-05 0:03 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
Balbir Singh, Martin Schwidefsky
Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi
The idea suggested by Peter Zijlstra here.
http://marc.info/?l=linux-kernel&m=127476934517534&w=2
irq time is technically not available to the tasks running on the CPU.
This patch removes irq time from CPU power piggybacking on
sched_rt_avg_update().
Tested this by keeping CPU X busy with a network intensive task having 75%
oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
cycle soakers on the system. Without this change, there will be two tasks on
each CPU. With this change, there is a single task on irq busy CPU X and
remaining 7 tasks are spread around among other 3 CPUs.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
kernel/sched.c | 18 ++++++++++++++++++
kernel/sched_fair.c | 8 +++++++-
kernel/sched_features.h | 5 +++++
3 files changed, 30 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 12e9d70..6487da4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -521,6 +521,10 @@ struct rq {
u64 avg_idle;
#endif
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ u64 prev_irq_time;
+#endif
+
/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
@@ -645,6 +649,7 @@ static inline struct task_group *task_group(struct task_struct *p)
#endif /* CONFIG_CGROUP_SCHED */
static u64 irq_time_cpu(int cpu);
+static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time);
inline void update_rq_clock(struct rq *rq)
{
@@ -656,6 +661,8 @@ inline void update_rq_clock(struct rq *rq)
irq_time = irq_time_cpu(cpu);
if (rq->clock - irq_time > rq->clock_task)
rq->clock_task = rq->clock - irq_time;
+
+ sched_irq_time_avg_update(rq, irq_time);
}
}
@@ -1994,6 +2001,15 @@ void account_system_vtime(struct task_struct *curr)
local_irq_restore(flags);
}
+static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time)
+{
+ if (sched_clock_irqtime && sched_feat(NONIRQ_POWER)) {
+ u64 delta_irq = curr_irq_time - rq->prev_irq_time;
+ rq->prev_irq_time = curr_irq_time;
+ sched_rt_avg_update(rq, delta_irq);
+ }
+}
+
#else
static u64 irq_time_cpu(int cpu)
@@ -2001,6 +2017,8 @@ static u64 irq_time_cpu(int cpu)
return 0;
}
+static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { }
+
#endif
#include "sched_idletask.c"
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0baa696..6d0362a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2268,7 +2268,13 @@ unsigned long scale_rt_power(int cpu)
u64 total, available;
total = sched_avg_period() + (rq->clock - rq->age_stamp);
- available = total - rq->rt_avg;
+
+ if (unlikely(total < rq->rt_avg)) {
+ /* Ensures that power won't end up being negative */
+ available = 0;
+ } else {
+ available = total - rq->rt_avg;
+ }
if (unlikely((s64)total < SCHED_LOAD_SCALE))
total = SCHED_LOAD_SCALE;
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 83c66e8..185f920 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -61,3 +61,8 @@ SCHED_FEAT(ASYM_EFF_LOAD, 1)
* release the lock. Decreases scheduling overhead.
*/
SCHED_FEAT(OWNER_SPIN, 1)
+
+/*
+ * Decrement CPU power based on irq activity
+ */
+SCHED_FEAT(NONIRQ_POWER, 1)
--
1.7.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 8/8] Call tick_check_idle before __irq_enter
2010-10-05 0:03 Proper kernel irq time accounting -v4 Venkatesh Pallipadi
` (6 preceding siblings ...)
2010-10-05 0:03 ` [PATCH 7/8] sched: Remove irq time from available CPU power -v4 Venkatesh Pallipadi
@ 2010-10-05 0:03 ` Venkatesh Pallipadi
2010-10-17 9:05 ` Yong Zhang
2010-10-18 19:27 ` [tip:sched/core] sched: " tip-bot for Venkatesh Pallipadi
2010-10-12 19:00 ` Proper kernel irq time accounting -v4 Venkatesh Pallipadi
` (2 subsequent siblings)
10 siblings, 2 replies; 40+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-05 0:03 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
Balbir Singh, Martin Schwidefsky
Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi
When CPU is idle and on first interrupt, irq_enter calls tick_check_idle()
to notify interruption from idle. But, there is a problem if this call
is done after __irq_enter, as all routines in __irq_enter may find
stale time due to yet to be done tick_check_idle.
Specifically, trace calls in __irq_enter when they use global clock and also
account_system_vtime change in this patch as it wants to use sched_clock_cpu()
to do proper irq timing.
But, tick_check_idle was moved after __irq_enter intentionally to
prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a.
irq: call __irq_enter() before calling the tick_idle_check
Impact: avoid spurious ksoftirqd wakeups
Moving tick_check_idle() before __irq_enter and wrapping it with
local_bh_enable/disable would solve both the problems.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
kernel/sched.c | 2 +-
kernel/softirq.c | 12 +++++++++---
2 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 6487da4..442e151 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1983,8 +1983,8 @@ void account_system_vtime(struct task_struct *curr)
local_irq_save(flags);
- now = sched_clock();
cpu = smp_processor_id();
+ now = sched_clock_cpu(cpu);
delta = now - per_cpu(irq_start_time, cpu);
per_cpu(irq_start_time, cpu) = now;
/*
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 267f7b7..6bb7e19 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -296,10 +296,16 @@ void irq_enter(void)
rcu_irq_enter();
if (idle_cpu(cpu) && !in_interrupt()) {
- __irq_enter();
+ /*
+ * Prevent raise_softirq from needlessly waking up ksoftirqd
+ * here, as softirq will be serviced on return from interrupt.
+ */
+ local_bh_disable();
tick_check_idle(cpu);
- } else
- __irq_enter();
+ local_bh_enable();
+ }
+
+ __irq_enter();
}
#ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED
--
1.7.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: Proper kernel irq time accounting -v4
2010-10-05 0:03 Proper kernel irq time accounting -v4 Venkatesh Pallipadi
` (7 preceding siblings ...)
2010-10-05 0:03 ` [PATCH 8/8] Call tick_check_idle before __irq_enter Venkatesh Pallipadi
@ 2010-10-12 19:00 ` Venkatesh Pallipadi
2010-10-14 16:12 ` Shaun Ruffell
2010-10-15 15:11 ` Peter Zijlstra
10 siblings, 0 replies; 40+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-12 19:00 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Paul Turner, Eric Dumazet, Venkatesh Pallipadi,
Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
Martin Schwidefsky
On Mon, Oct 4, 2010 at 5:03 PM, Venkatesh Pallipadi <venki@google.com> wrote:
> Previous versions:
> -v0: http://lkml.indiana.edu/hypermail//linux/kernel/1005.3/00411.html
> lkml subject - "Finer granularity and task/cgroup irq time accounting"
>
> -v1: http://lkml.indiana.edu/hypermail//linux/kernel/1007.2/00987.html
> lkml subject - "Finer granularity and task/cgroup irq time accounting"
>
> -v2: http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/00488.html
> lkml subject - "Proper kernel irq time accounting"
>
> -v3: http://lkml.indiana.edu/hypermail//linux/kernel/1009.3/02482.html
> lkml subject - "Proper kernel irq time accounting -v3"
>
>
> Change from -v3:
> - Switch to using sched_clock_cpu instead of sched_clock in
> account_system_vtime. This needed tick_check_idle to be called early in
> irq_enter.
> - Do not account softirq time in ksoftirqd context. We want ksoftirqd to
> continue having sched exec_runtime.
> - Add a boot option to disable tsc based irq accounting in x86.
> - Remove the patch that exported irq times in /proc/interrupts and
> /proc/softirqs. I am relooking at other options to export that and
> will have a replacement patch soon.
>
>
Peter,
If the patchset looks sane can you push this along to tip please.
Thanks,
Venki
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Proper kernel irq time accounting -v4
2010-10-05 0:03 Proper kernel irq time accounting -v4 Venkatesh Pallipadi
` (8 preceding siblings ...)
2010-10-12 19:00 ` Proper kernel irq time accounting -v4 Venkatesh Pallipadi
@ 2010-10-14 16:12 ` Shaun Ruffell
2010-10-14 18:19 ` Venkatesh Pallipadi
2010-10-15 15:11 ` Peter Zijlstra
10 siblings, 1 reply; 40+ messages in thread
From: Shaun Ruffell @ 2010-10-14 16:12 UTC (permalink / raw)
To: Venkatesh Pallipadi; +Cc: linux-kernel
On 10/04/2010 07:03 PM, Venkatesh Pallipadi wrote:
> Solution to (1) involves adding extra timing on irq entry/exit to
> get the fine granularity info and then exporting it to user.
> The following patchset addresses this problem in a way similar to [2][3].
> Keeps most of the code that does the timing generic
> (CONFIG_IRQ_TIME_ACCOUNTING), based off of sched_clock(). And adds support for
> this in x86. This time is not yet exported to userspace yet. Patch for that
> coming soon.
>
Would you be willing to share your thoughts on how you plan to export
this information to userspace?
I applied this set to one of my machines in hopes of seeing more
accurate hi time before I noticed that in the quoted paragraph you said
this is still forthcoming.
Thanks,
Shaun
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Proper kernel irq time accounting -v4
2010-10-14 16:12 ` Shaun Ruffell
@ 2010-10-14 18:19 ` Venkatesh Pallipadi
2010-10-14 20:00 ` Shaun Ruffell
0 siblings, 1 reply; 40+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-14 18:19 UTC (permalink / raw)
To: Shaun Ruffell; +Cc: linux-kernel
On Thu, Oct 14, 2010 at 9:12 AM, Shaun Ruffell <sruffell@digium.com> wrote:
> On 10/04/2010 07:03 PM, Venkatesh Pallipadi wrote:
>> Solution to (1) involves adding extra timing on irq entry/exit to
>> get the fine granularity info and then exporting it to user.
>> The following patchset addresses this problem in a way similar to [2][3].
>> Keeps most of the code that does the timing generic
>> (CONFIG_IRQ_TIME_ACCOUNTING), based off of sched_clock(). And adds support for
>> this in x86. This time is not yet exported to userspace yet. Patch for that
>> coming soon.
>>
>
> Would you be willing to share your thoughts on how you plan to export
> this information to userspace?
>
> I applied this set to one of my machines in hopes of seeing more
> accurate hi time before I noticed that in the quoted paragraph you said
> this is still forthcoming.
>
Yes. I tried couple of variations on how to export this to user in my
earlier versions of this patchset and each of them had its own
problems. Right now I have patch that I am testing, which retrofits
/proc/stat to be based on this new fine granularity info. So, existing
tools like top, mpstat will show up this information.
This is how things would look like from top (with a CPU intensive loop
and a network intensive nc running on the system)
With vanilla kernel:
Cpu0 : 0.0% us, 0.3% sy, 0.0% ni, 99.3% id, 0.0% wa, 0.0% hi, 0.3% si
Cpu1 : 100.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu2 : 1.3% us, 27.2% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 71.4% si
Cpu3 : 1.6% us, 1.3% sy, 0.0% ni, 96.7% id, 0.0% wa, 0.0% hi, 0.3% si
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7555 root 20 0 1760 528 436 R 100 0.0 0:15.79 nc
7563 root 20 0 3632 268 204 R 100 0.0 0:13.13 loop
With this patchset:
Cpu0 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu1 : 100.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu2 : 2.0% us, 30.6% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 67.4% si
Cpu3 : 0.7% us, 0.7% sy, 0.3% ni, 98.3% id, 0.0% wa, 0.0% hi, 0.0% si
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6289 root 20 0 3632 268 204 R 100 0.0 2:18.67 loop
5737 root 20 0 1760 528 436 R 33 0.0 0:26.72 nc
With this patchset + exporting it through /proc/stat that I am testing:
Cpu0 : 1.3% us, 1.0% sy, 0.3% ni, 97.0% id, 0.0% wa, 0.0% hi, 0.3% si
Cpu1 : 99.3% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.7% hi, 0.0% si
Cpu2 : 1.3% us, 31.5% sy, 0.0% ni, 0.0% id, 0.0% wa, 8.3% hi, 58.9% si
Cpu3 : 1.0% us, 2.0% sy, 0.3% ni, 95.0% id, 0.0% wa, 0.7% hi, 1.0% si
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20929 root 20 0 3632 268 204 R 99 0.0 3:48.25 loop
20796 root 20 0 1760 528 436 R 33 0.0 2:38.65 nc
So, with this patchset you should see task CPU% showing only the exec
runtime of the task and not si time. With the upcoming change the CPU
times would get fixed as well. I am planning to send out the
"exporting through /proc/stat patchset" as incremental change once
this patchset gets into tip/next....
Thanks,
Venki
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Proper kernel irq time accounting -v4
2010-10-14 18:19 ` Venkatesh Pallipadi
@ 2010-10-14 20:00 ` Shaun Ruffell
0 siblings, 0 replies; 40+ messages in thread
From: Shaun Ruffell @ 2010-10-14 20:00 UTC (permalink / raw)
To: Venkatesh Pallipadi; +Cc: linux-kernel
On 10/14/2010 01:19 PM, Venkatesh Pallipadi wrote:
> So, with this patchset you should see task CPU% showing only the exec
> runtime of the task and not si time. With the upcoming change the CPU
> times would get fixed as well. I am planning to send out the
> "exporting through /proc/stat patchset" as incremental change once
> this patchset gets into tip/next....
That is what I was looking for. I didn't know if you were thinking
about using /proc/stat or some other mechanism. I'll look forward to
seeing/testing that patch.
Thanks again,
Shaun
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/8] Add a PF flag for ksoftirqd identification
2010-10-05 0:03 ` [PATCH 3/8] Add a PF flag for ksoftirqd identification Venkatesh Pallipadi
@ 2010-10-15 14:26 ` Peter Zijlstra
2010-10-15 14:46 ` Eric Dumazet
2010-10-18 19:25 ` [tip:sched/core] sched: " tip-bot for Venkatesh Pallipadi
2 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2010-10-15 14:26 UTC (permalink / raw)
To: Venkatesh Pallipadi
Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet
On Mon, 2010-10-04 at 17:03 -0700, Venkatesh Pallipadi wrote:
> @@ -1234,6 +1234,7 @@ struct task_struct {
>
> /* Revert to default priority/policy when forking */
> unsigned sched_reset_on_fork:1;
> + unsigned sched_thread_bound:1; /* Thread bound to specific cpu */
>
> pid_t pid;
> pid_t tgid;
> @@ -1708,7 +1709,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
> #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
> #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
> -#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
> +#define PF_KSOFTIRQD 0x04000000 /* I am ksoftirqd */
No need to do that, there's two free bits, 0x1 and 0x1000 (although this
latter is pending in Andrew's tree).
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 4/8] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v4
2010-10-05 0:03 ` [PATCH 4/8] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v4 Venkatesh Pallipadi
@ 2010-10-15 14:28 ` Peter Zijlstra
2010-10-18 19:25 ` [tip:sched/core] sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time tip-bot for Venkatesh Pallipadi
1 sibling, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2010-10-15 14:28 UTC (permalink / raw)
To: Venkatesh Pallipadi
Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet
On Mon, 2010-10-04 at 17:03 -0700, Venkatesh Pallipadi wrote:
> + if (hardirq_count())
if (in_irq())
to match the (mangled) in_serving_softirq() ?
> + per_cpu(cpu_hardirq_time, cpu) += delta;
> + else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD))
> + per_cpu(cpu_softirq_time, cpu) += delta;
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 5/8] x86: Add IRQ_TIME_ACCOUNTING in x86 -v4
2010-10-05 0:03 ` [PATCH 5/8] x86: Add IRQ_TIME_ACCOUNTING in x86 -v4 Venkatesh Pallipadi
@ 2010-10-15 14:38 ` Peter Zijlstra
2010-10-18 19:26 ` [tip:sched/core] x86: Add IRQ_TIME_ACCOUNTING tip-bot for Venkatesh Pallipadi
1 sibling, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2010-10-15 14:38 UTC (permalink / raw)
To: Venkatesh Pallipadi
Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet
On Mon, 2010-10-04 at 17:03 -0700, Venkatesh Pallipadi wrote:
> @@ -801,6 +805,7 @@ void mark_tsc_unstable(char *reason)
> if (!tsc_unstable) {
> tsc_unstable = 1;
> sched_clock_stable = 0;
> + disable_sched_clock_irqtime();
> printk(KERN_INFO "Marking TSC unstable due to %s\n", reason);
> /* Change only the rating, when not registered */
> if (clocksource_tsc.mult)
I think we can be less strict that this and keep it enabled for
everything with constant tsc and nonstop tsc.
Since we're using it strictly per-cpu the cross-cpu sync isn't really
important, all we need it to know the per-cpu value doesn't go all
screwy on us.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/8] Add a PF flag for ksoftirqd identification
2010-10-05 0:03 ` [PATCH 3/8] Add a PF flag for ksoftirqd identification Venkatesh Pallipadi
2010-10-15 14:26 ` Peter Zijlstra
@ 2010-10-15 14:46 ` Eric Dumazet
2010-10-18 19:25 ` [tip:sched/core] sched: " tip-bot for Venkatesh Pallipadi
2 siblings, 0 replies; 40+ messages in thread
From: Eric Dumazet @ 2010-10-15 14:46 UTC (permalink / raw)
To: Venkatesh Pallipadi
Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
Balbir Singh, Martin Schwidefsky, linux-kernel, Paul Turner
Le lundi 04 octobre 2010 à 17:03 -0700, Venkatesh Pallipadi a écrit :
> To account softirq time cleanly in scheduler, we need to identify whether
> softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
> Add PF_KSOFTIRQD for that purpose.
>
> As all PF flag bits are currently taken, create space by moving one of the
> infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
> with some other state fields.
>
> Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Instead of using one bit per task (and fight to find a free bit) why not
using existing :
DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
And check if current is ksoftirqd ?
if (__get_cpu_var(ksoftirqd) == current) ...
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Proper kernel irq time accounting -v4
2010-10-05 0:03 Proper kernel irq time accounting -v4 Venkatesh Pallipadi
` (9 preceding siblings ...)
2010-10-14 16:12 ` Shaun Ruffell
@ 2010-10-15 15:11 ` Peter Zijlstra
2010-10-15 15:27 ` Peter Zijlstra
10 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2010-10-15 15:11 UTC (permalink / raw)
To: Venkatesh Pallipadi
Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet
OK, I'll take this, but will frob 4 to simply add PF_KSOFTIRQD.
We can do all additional changes, possibly including what Eric
suggested, at a later stage as increments.
Thanks!
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Proper kernel irq time accounting -v4
2010-10-15 15:11 ` Peter Zijlstra
@ 2010-10-15 15:27 ` Peter Zijlstra
2010-10-15 17:13 ` Venkatesh Pallipadi
0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2010-10-15 15:27 UTC (permalink / raw)
To: Venkatesh Pallipadi
Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet
On Fri, 2010-10-15 at 17:11 +0200, Peter Zijlstra wrote:
> OK, I'll take this, but will frob 4 to simply add PF_KSOFTIRQD.
>
> We can do all additional changes, possibly including what Eric
> suggested, at a later stage as increments.
>
> Thanks!
*kablam!*
booting with noirqtime doesn't work either, same explosion.
My patch-stack can be gotten from:
http://programming.kicks-ass.net/sekrit/patches.tar.bz2
---
Calgary: Unable to locate Rio Grande table in EBDA - bailing!
Memory: 16441220k/17825792k available (5966k kernel code, 1060600k absent, 323972k reserved, 6084k data, 1652k init)
Hierarchical RCU implementation.
NR_IRQS:4352
Extended CMOS year: 2000
Console: colour VGA+ 80x25
console [ttyS0] enabled, bootconsole disabled
console [ttyS0] enabled, bootconsole disabled
hpet clockevent registered
------------[ cut here ]------------
WARNING: at /usr/src/linux-2.6/kernel/softirq.c:159 _local_bh_enable_ip+0x49/0xc7()
Hardware name: X8DTN
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.36-rc7-tip-01685-g5131300-dirty #514
Call Trace:
<IRQ> [<ffffffff8106cdf8>] warn_slowpath_common+0x85/0x9d
[<ffffffff8107319f>] ? irq_enter+0x4e/0x7f
[<ffffffff8106ce2a>] warn_slowpath_null+0x1a/0x1c
[<ffffffff810730af>] _local_bh_enable_ip+0x49/0xc7
[<ffffffff8107314f>] local_bh_enable+0x12/0x14
[<ffffffff8107319f>] irq_enter+0x4e/0x7f
[<ffffffff815d07e6>] do_IRQ+0x36/0xc3
[<ffffffff815ca553>] ret_from_intr+0x0/0xf
<EOI> [<ffffffff815ca1d4>] ? _raw_spin_unlock_irqrestore+0x29/0x2d
[<ffffffff810b42e6>] __setup_irq+0x229/0x2d0
[<ffffffff810b43b0>] setup_irq+0x23/0x28
[<ffffffff81bdcfc9>] setup_default_timer_irq+0x12/0x14
[<ffffffff81bdcfe2>] hpet_time_init+0x17/0x19
[<ffffffff81bdcfb0>] x86_late_time_init+0xa/0x11
[<ffffffff81bd9cc7>] start_kernel+0x340/0x3df
[<ffffffff81bd92a8>] x86_64_start_reservations+0xb8/0xbc
[<ffffffff81bd9393>] x86_64_start_kernel+0xe7/0xee
---[ end trace a7919e7f17c0a725 ]---
------------[ cut here ]------------
WARNING: at /usr/src/linux-2.6/kernel/sched_clock.c:219 sched_clock_cpu+0x31/0xd4()
Hardware name: X8DTN
Modules linked in:
Pid: 0, comm: swapper Tainted: G W 2.6.36-rc7-tip-01685-g5131300-dirty #514
Call Trace:
<IRQ> [<ffffffff8106cdf8>] warn_slowpath_common+0x85/0x9d
[<ffffffff8106ce2a>] warn_slowpath_null+0x1a/0x1c
[<ffffffff8108ccb4>] sched_clock_cpu+0x31/0xd4
[<ffffffff810665af>] update_rq_clock+0x28/0xa1
[<ffffffff81068f7b>] scheduler_tick+0x4d/0x29d
[<ffffffff8107a145>] update_process_times+0x67/0x77
[<ffffffff810939f9>] tick_periodic+0x63/0x6f
[<ffffffff81093a2b>] tick_handle_periodic+0x26/0x73
[<ffffffff8102e75e>] timer_interrupt+0x1e/0x25
[<ffffffff810b34d7>] handle_IRQ_event+0x55/0x11d
[<ffffffff810b570a>] handle_level_irq+0x7b/0xd7
[<ffffffff8102e0b5>] handle_irq+0x88/0x93
[<ffffffff815d080c>] do_IRQ+0x5c/0xc3
[<ffffffff815ca553>] ret_from_intr+0x0/0xf
<EOI> [<ffffffff815ca1d4>] ? _raw_spin_unlock_irqrestore+0x29/0x2d
[<ffffffff810b42e6>] __setup_irq+0x229/0x2d0
[<ffffffff810b43b0>] setup_irq+0x23/0x28
[<ffffffff81bdcfc9>] setup_default_timer_irq+0x12/0x14
[<ffffffff81bdcfe2>] hpet_time_init+0x17/0x19
[<ffffffff81bdcfb0>] x86_late_time_init+0xa/0x11
[<ffffffff81bd9cc7>] start_kernel+0x340/0x3df
[<ffffffff81bd92a8>] x86_64_start_reservations+0xb8/0xbc
[<ffffffff81bd9393>] x86_64_start_kernel+0xe7/0xee
---[ end trace a7919e7f17c0a726 ]---
------------[ cut here ]------------
WARNING: at /usr/src/linux-2.6/kernel/perf_event.c:1656 perf_event_task_tick+0x3c/0x259()
Hardware name: X8DTN
Modules linked in:
Pid: 0, comm: swapper Tainted: G W 2.6.36-rc7-tip-01685-g5131300-dirty #514
Call Trace:
<IRQ> [<ffffffff8106cdf8>] warn_slowpath_common+0x85/0x9d
[<ffffffff8106ce2a>] warn_slowpath_null+0x1a/0x1c
[<ffffffff810d8617>] perf_event_task_tick+0x3c/0x259
[<ffffffff81069037>] scheduler_tick+0x109/0x29d
[<ffffffff8107a145>] update_process_times+0x67/0x77
[<ffffffff810939f9>] tick_periodic+0x63/0x6f
[<ffffffff81093a2b>] tick_handle_periodic+0x26/0x73
[<ffffffff8102e75e>] timer_interrupt+0x1e/0x25
[<ffffffff810b34d7>] handle_IRQ_event+0x55/0x11d
[<ffffffff810b570a>] handle_level_irq+0x7b/0xd7
[<ffffffff8102e0b5>] handle_irq+0x88/0x93
[<ffffffff815d080c>] do_IRQ+0x5c/0xc3
[<ffffffff815ca553>] ret_from_intr+0x0/0xf
<EOI> [<ffffffff815ca1d4>] ? _raw_spin_unlock_irqrestore+0x29/0x2d
[<ffffffff810b42e6>] __setup_irq+0x229/0x2d0
[<ffffffff810b43b0>] setup_irq+0x23/0x28
[<ffffffff81bdcfc9>] setup_default_timer_irq+0x12/0x14
[<ffffffff81bdcfe2>] hpet_time_init+0x17/0x19
[<ffffffff81bdcfb0>] x86_late_time_init+0xa/0x11
[<ffffffff81bd9cc7>] start_kernel+0x340/0x3df
[<ffffffff81bd92a8>] x86_64_start_reservations+0xb8/0xbc
[<ffffffff81bd9393>] x86_64_start_kernel+0xe7/0xee
---[ end trace a7919e7f17c0a727 ]---
BUG: scheduling while atomic: swapper/0/0x10010000
Modules linked in:
CPU 0
Modules linked in:
Pid: 0, comm: swapper Tainted: G W 2.6.36-rc7-tip-01685-g5131300-dirty #514 X8DTN/X8DTN
RIP: 0010:[<ffffffff815ca1d4>] [<ffffffff815ca1d4>] _raw_spin_unlock_irqrestore+0x29/0x2d
RSP: 0000:ffffffff81a01ea8 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffffff81a01eb8 RCX: 0000000000000000
RDX: ffffffff81b1e650 RSI: 0000000000000246 RDI: ffffffff81a02284
RBP: ffffffff815ca54e R08: 000000000000001e R09: ffffffff81b22a40
R10: ffffffff81a01e18 R11: 0000000000000004 R12: ffffffff81a01e48
R13: 00000000ffffffff R14: 0000000000000021 R15: ffffffff81e29ace
FS: 0000000000000000(0000) GS:ffff8800bf000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001b13000 CR4: 00000000000006b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81b1b020)
Stack:
ffffffff81a01eb8 ffffffff81a02200 ffffffff81a01f08 ffffffff810b42e6
<0> 0000000000000008 0000000000000246 ffffffff81a01ee8 0000000000000000
<0> ffffffff81b1e2b0 0000000000000000 ffffffffffffffff 0000000000000000
Call Trace:
[<ffffffff810b42e6>] ? __setup_irq+0x229/0x2d0
[<ffffffff810b43b0>] ? setup_irq+0x23/0x28
[<ffffffff81bdcfc9>] ? setup_default_timer_irq+0x12/0x14
[<ffffffff81bdcfe2>] ? hpet_time_init+0x17/0x19
[<ffffffff81bdcfb0>] ? x86_late_time_init+0xa/0x11
[<ffffffff81bd9cc7>] ? start_kernel+0x340/0x3df
[<ffffffff81bd92a8>] ? x86_64_start_reservations+0xb8/0xbc
[<ffffffff81bd9393>] ? x86_64_start_kernel+0xe7/0xee
Code: c9 c3 55 48 89 e5 53 48 83 ec 08 e8 c7 17 a6 ff 48 89 f3 66 ff 07 f6 c7 02 75 09 56 9d e8 6c f6 af ff eb 07 e8 ca f8 af ff 53 9d <5b> 5b c9 c3 55 48 89 e5 48 83 ec 10 e8 9b 17 a6 ff 48 89 7d f8
Call Trace:
[<ffffffff810b42e6>] ? __setup_irq+0x229/0x2d0
[<ffffffff810b43b0>] ? setup_irq+0x23/0x28
[<ffffffff81bdcfc9>] ? setup_default_timer_irq+0x12/0x14
[<ffffffff81bdcfe2>] ? hpet_time_init+0x17/0x19
[<ffffffff81bdcfb0>] ? x86_late_time_init+0xa/0x11
[<ffffffff81bd9cc7>] ? start_kernel+0x340/0x3df
[<ffffffff81bd92a8>] ? x86_64_start_reservations+0xb8/0xbc
[<ffffffff81bd9393>] ? x86_64_start_kernel+0xe7/0xee
------------[ cut here ]------------
kernel BUG at /usr/src/linux-2.6/kernel/posix-cpu-timers.c:1312!
invalid opcode: 0000 [#1] SMP
last sysfs file:
CPU 0
Modules linked in:
Pid: 0, comm: swapper Tainted: G W 2.6.36-rc7-tip-01685-g5131300-dirty #514 X8DTN/X8DTN
RIP: 0010:[<ffffffff81089455>] [<ffffffff81089455>] run_posix_cpu_timers+0x2c/0x61b
RSP: 0000:ffff8800bf003dd8 EFLAGS: 00010202
RAX: 0000000000000296 RBX: ffffffff81b1b020 RCX: 0000000000000000
RDX: ffff8800bf000001 RSI: 0000000000000000 RDI: ffffffff81b1b020
RBP: ffff8800bf003e58 R08: 0000000000000006 R09: 0000ffff00104c10
R10: ffff8800bf003ca4 R11: 0000000000000020 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff81a02284
FS: 0000000000000000(0000) GS:ffff8800bf000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001b13000 CR4: 00000000000006b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81b1b020)
Stack:
0000000000000000 ffffffff81b1b020 ffff8800bf003e18 ffff8800bf011ac0
<0> ffff8800bf003df8 ffff8800bf003df8 ffffffff81b1b020 ffffffff81a02284
<0> ffff8800bf003e58 ffffffff81069037 0000000000000000 ffffffff81b1b020
Call Trace:
<IRQ>
[<ffffffff81069037>] ? scheduler_tick+0x109/0x29d
[<ffffffff8107a14d>] update_process_times+0x6f/0x77
[<ffffffff810939f9>] tick_periodic+0x63/0x6f
[<ffffffff81093a2b>] tick_handle_periodic+0x26/0x73
[<ffffffff8102e75e>] timer_interrupt+0x1e/0x25
[<ffffffff810b34d7>] handle_IRQ_event+0x55/0x11d
[<ffffffff810b570a>] handle_level_irq+0x7b/0xd7
[<ffffffff8102e0b5>] handle_irq+0x88/0x93
[<ffffffff815d080c>] do_IRQ+0x5c/0xc3
[<ffffffff815ca553>] ret_from_intr+0x0/0xf
<EOI>
[<ffffffff815ca1d4>] ? _raw_spin_unlock_irqrestore+0x29/0x2d
[<ffffffff810b42e6>] __setup_irq+0x229/0x2d0
[<ffffffff810b43b0>] setup_irq+0x23/0x28
[<ffffffff81bdcfc9>] setup_default_timer_irq+0x12/0x14
[<ffffffff81bdcfe2>] hpet_time_init+0x17/0x19
[<ffffffff81bdcfb0>] x86_late_time_init+0xa/0x11
[<ffffffff81bd9cc7>] start_kernel+0x340/0x3df
[<ffffffff81bd92a8>] x86_64_start_reservations+0xb8/0xbc
[<ffffffff81bd9393>] x86_64_start_kernel+0xe7/0xee
Code: 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 58 e8 41 25 fa ff 48 8d 45 a0 48 89 fb 48 89 45 a0 48 89 45 a8 9c 58 f6 c4 02 74 04 <0f> 0b eb fe 48 8b 87 08 03 00 00 48 85 c0 0f 85 a9 05 00 00 48
RIP [<ffffffff81089455>] run_posix_cpu_timers+0x2c/0x61b
RSP <ffff8800bf003dd8>
---[ end trace a7919e7f17c0a728 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 0, comm: swapper Tainted: G D W 2.6.36-rc7-tip-01685-g5131300-dirty #514
Call Trace:
<IRQ> [<ffffffff815c7e4a>] panic+0x91/0x19c
[<ffffffff8106dd6f>] ? kmsg_dump+0x13b/0x155
[<ffffffff815cb38c>] oops_end+0xaf/0xbf
[<ffffffff8102f13d>] die+0x5a/0x63
[<ffffffff815cad93>] do_trap+0x121/0x130
[<ffffffff8102d1af>] do_invalid_op+0x94/0x9d
[<ffffffff81089455>] ? run_posix_cpu_timers+0x2c/0x61b
[<ffffffff815c9e4d>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[<ffffffff815ca5d9>] ? irq_return+0x0/0x2
[<ffffffff8102c895>] invalid_op+0x15/0x20
[<ffffffff81089455>] ? run_posix_cpu_timers+0x2c/0x61b
[<ffffffff81069037>] ? scheduler_tick+0x109/0x29d
[<ffffffff8107a14d>] update_process_times+0x6f/0x77
[<ffffffff810939f9>] tick_periodic+0x63/0x6f
[<ffffffff81093a2b>] tick_handle_periodic+0x26/0x73
[<ffffffff8102e75e>] timer_interrupt+0x1e/0x25
[<ffffffff810b34d7>] handle_IRQ_event+0x55/0x11d
[<ffffffff810b570a>] handle_level_irq+0x7b/0xd7
[<ffffffff8102e0b5>] handle_irq+0x88/0x93
[<ffffffff815d080c>] do_IRQ+0x5c/0xc3
[<ffffffff815ca553>] ret_from_intr+0x0/0xf
<EOI> [<ffffffff815ca1d4>] ? _raw_spin_unlock_irqrestore+0x29/0x2d
[<ffffffff810b42e6>] __setup_irq+0x229/0x2d0
[<ffffffff810b43b0>] setup_irq+0x23/0x28
[<ffffffff81bdcfc9>] setup_default_timer_irq+0x12/0x14
[<ffffffff81bdcfe2>] hpet_time_init+0x17/0x19
[<ffffffff81bdcfb0>] x86_late_time_init+0xa/0x11
[<ffffffff81bd9cc7>] start_kernel+0x340/0x3df
[<ffffffff81bd92a8>] x86_64_start_reservations+0xb8/0xbc
[<ffffffff81bd9393>] x86_64_start_kernel+0xe7/0xee
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Proper kernel irq time accounting -v4
2010-10-15 15:27 ` Peter Zijlstra
@ 2010-10-15 17:13 ` Venkatesh Pallipadi
2010-10-15 17:20 ` Peter Zijlstra
2010-10-17 9:11 ` Yong Zhang
0 siblings, 2 replies; 40+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-15 17:13 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet
On Fri, Oct 15, 2010 at 8:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2010-10-15 at 17:11 +0200, Peter Zijlstra wrote:
>> OK, I'll take this, but will frob 4 to simply add PF_KSOFTIRQD.
>>
>> We can do all additional changes, possibly including what Eric
>> suggested, at a later stage as increments.
>>
>> Thanks!
>
> *kablam!*
>
> booting with noirqtime doesn't work either, same explosion.
This crash seems to be from patch 8/8 that changed irq_enter. Please
drop it for now and I will rework that.
I will also send the cleanup changes including one that Eric suggested
as incremental patches.
Thanks,
Venki
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Proper kernel irq time accounting -v4
2010-10-15 17:13 ` Venkatesh Pallipadi
@ 2010-10-15 17:20 ` Peter Zijlstra
2010-10-17 9:11 ` Yong Zhang
1 sibling, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2010-10-15 17:20 UTC (permalink / raw)
To: Venkatesh Pallipadi
Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Balbir Singh,
Martin Schwidefsky, linux-kernel, Paul Turner, Eric Dumazet
On Fri, 2010-10-15 at 10:13 -0700, Venkatesh Pallipadi wrote:
> On Fri, Oct 15, 2010 at 8:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, 2010-10-15 at 17:11 +0200, Peter Zijlstra wrote:
> >> OK, I'll take this, but will frob 4 to simply add PF_KSOFTIRQD.
> >>
> >> We can do all additional changes, possibly including what Eric
> >> suggested, at a later stage as increments.
> >>
> >> Thanks!
> >
> > *kablam!*
> >
> > booting with noirqtime doesn't work either, same explosion.
>
> This crash seems to be from patch 8/8 that changed irq_enter. Please
> drop it for now and I will rework that.
> I will also send the cleanup changes including one that Eric suggested
> as incremental patches.
There's also still the PF_flags rework patch from me:
http://lkml.org/lkml/2010/9/20/170
Which I still need to rework based on comments from Linus/Andrew.
The idea is to take out all PF_flags of the "I'm a $FOO" thread into
their own field.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 8/8] Call tick_check_idle before __irq_enter
2010-10-05 0:03 ` [PATCH 8/8] Call tick_check_idle before __irq_enter Venkatesh Pallipadi
@ 2010-10-17 9:05 ` Yong Zhang
2010-10-18 9:15 ` Peter Zijlstra
2010-10-18 19:27 ` [tip:sched/core] sched: " tip-bot for Venkatesh Pallipadi
1 sibling, 1 reply; 40+ messages in thread
From: Yong Zhang @ 2010-10-17 9:05 UTC (permalink / raw)
To: Venkatesh Pallipadi
Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
Balbir Singh, Martin Schwidefsky, linux-kernel, Paul Turner,
Eric Dumazet
On Mon, Oct 04, 2010 at 05:03:23PM -0700, Venkatesh Pallipadi wrote:
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 267f7b7..6bb7e19 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -296,10 +296,16 @@ void irq_enter(void)
>
> rcu_irq_enter();
> if (idle_cpu(cpu) && !in_interrupt()) {
> - __irq_enter();
> + /*
> + * Prevent raise_softirq from needlessly waking up ksoftirqd
> + * here, as softirq will be serviced on return from interrupt.
> + */
> + local_bh_disable();
> tick_check_idle(cpu);
> - } else
> - __irq_enter();
> + local_bh_enable();
_local_bh_enable()
local_bh_enable() could invoke do_softirq().
> + }
> +
> + __irq_enter();
> }
>
> #ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Proper kernel irq time accounting -v4
2010-10-15 17:13 ` Venkatesh Pallipadi
2010-10-15 17:20 ` Peter Zijlstra
@ 2010-10-17 9:11 ` Yong Zhang
1 sibling, 0 replies; 40+ messages in thread
From: Yong Zhang @ 2010-10-17 9:11 UTC (permalink / raw)
To: Venkatesh Pallipadi
Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
Balbir Singh, Martin Schwidefsky, linux-kernel, Paul Turner,
Eric Dumazet
On Fri, Oct 15, 2010 at 10:13:46AM -0700, Venkatesh Pallipadi wrote:
> On Fri, Oct 15, 2010 at 8:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, 2010-10-15 at 17:11 +0200, Peter Zijlstra wrote:
> >> OK, I'll take this, but will frob 4 to simply add PF_KSOFTIRQD.
> >>
> >> We can do all additional changes, possibly including what Eric
> >> suggested, at a later stage as increments.
> >>
> >> Thanks!
> >
> > *kablam!*
> >
> > booting with noirqtime doesn't work either, same explosion.
>
> This crash seems to be from patch 8/8 that changed irq_enter. Please
> drop it for now and I will rework that.
Just notice this, so my reply to patch 8/8 will fix it.
Thanks,
Yong
> I will also send the cleanup changes including one that Eric suggested
> as incremental patches.
>
> Thanks,
> Venki
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 8/8] Call tick_check_idle before __irq_enter
2010-10-17 9:05 ` Yong Zhang
@ 2010-10-18 9:15 ` Peter Zijlstra
0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2010-10-18 9:15 UTC (permalink / raw)
To: Yong Zhang
Cc: Venkatesh Pallipadi, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
Balbir Singh, Martin Schwidefsky, linux-kernel, Paul Turner,
Eric Dumazet
On Sun, 2010-10-17 at 17:05 +0800, Yong Zhang wrote:
> On Mon, Oct 04, 2010 at 05:03:23PM -0700, Venkatesh Pallipadi wrote:
> > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > index 267f7b7..6bb7e19 100644
> > --- a/kernel/softirq.c
> > +++ b/kernel/softirq.c
> > @@ -296,10 +296,16 @@ void irq_enter(void)
> >
> > rcu_irq_enter();
> > if (idle_cpu(cpu) && !in_interrupt()) {
> > - __irq_enter();
> > + /*
> > + * Prevent raise_softirq from needlessly waking up ksoftirqd
> > + * here, as softirq will be serviced on return from interrupt.
> > + */
> > + local_bh_disable();
> > tick_check_idle(cpu);
> > - } else
> > - __irq_enter();
> > + local_bh_enable();
>
> _local_bh_enable()
>
> local_bh_enable() could invoke do_softirq().
This seems to indeed cure that explosion.. Thanks!
^ permalink raw reply [flat|nested] 40+ messages in thread
* [tip:sched/core] sched: Fix softirq time accounting
2010-10-05 0:03 ` [PATCH 1/8] si time accounting accounts bh_disable'd time to si -v4 Venkatesh Pallipadi
@ 2010-10-18 19:24 ` tip-bot for Venkatesh Pallipadi
0 siblings, 0 replies; 40+ messages in thread
From: tip-bot for Venkatesh Pallipadi @ 2010-10-18 19:24 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, mingo, venki
Commit-ID: 75e1056f5c57050415b64cb761a3acc35d91f013
Gitweb: http://git.kernel.org/tip/75e1056f5c57050415b64cb761a3acc35d91f013
Author: Venkatesh Pallipadi <venki@google.com>
AuthorDate: Mon, 4 Oct 2010 17:03:16 -0700
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:20 +0200
sched: Fix softirq time accounting
Peter Zijlstra found a bug in the way softirq time is accounted in
VIRT_CPU_ACCOUNTING on this thread:
http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
The problem is, softirq processing uses local_bh_disable internally. There
is no way, later in the flow, to differentiate between whether softirq is
being processed or is it just that bh has been disabled. So, a hardirq when bh
is disabled results in time being wrongly accounted as softirq.
Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
as well. As account_system_time() in normal tick based accouting also uses
softirq_count, which will be set even when not in softirq with bh disabled.
Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
processing. The patch below does that and adds API in_serving_softirq() which
returns whether we are currently processing softirq or not.
Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
to in_serving_softirq.
Looks like many usages of in_softirq really want in_serving_softirq. Those
changes can be made individually on a case by case basis.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/hardirq.h | 5 ++++
include/linux/sched.h | 6 ++--
kernel/sched.c | 2 +-
kernel/softirq.c | 51 +++++++++++++++++++++++++++++++---------------
net/sched/cls_cgroup.c | 2 +-
5 files changed, 44 insertions(+), 22 deletions(-)
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index d5b3876..e37a77c 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -64,6 +64,8 @@
#define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT)
#define NMI_OFFSET (1UL << NMI_SHIFT)
+#define SOFTIRQ_DISABLE_OFFSET (2 * SOFTIRQ_OFFSET)
+
#ifndef PREEMPT_ACTIVE
#define PREEMPT_ACTIVE_BITS 1
#define PREEMPT_ACTIVE_SHIFT (NMI_SHIFT + NMI_BITS)
@@ -82,10 +84,13 @@
/*
* Are we doing bottom half or hardware interrupt processing?
* Are we in a softirq context? Interrupt context?
+ * in_softirq - Are we currently processing softirq or have bh disabled?
+ * in_serving_softirq - Are we currently processing softirq?
*/
#define in_irq() (hardirq_count())
#define in_softirq() (softirq_count())
#define in_interrupt() (irq_count())
+#define in_serving_softirq() (softirq_count() & SOFTIRQ_OFFSET)
/*
* Are we in NMI context?
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cdf5669..8744e50 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2366,9 +2366,9 @@ extern int __cond_resched_lock(spinlock_t *lock);
extern int __cond_resched_softirq(void);
-#define cond_resched_softirq() ({ \
- __might_sleep(__FILE__, __LINE__, SOFTIRQ_OFFSET); \
- __cond_resched_softirq(); \
+#define cond_resched_softirq() ({ \
+ __might_sleep(__FILE__, __LINE__, SOFTIRQ_DISABLE_OFFSET); \
+ __cond_resched_softirq(); \
})
/*
diff --git a/kernel/sched.c b/kernel/sched.c
index 771b518..089be8a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3422,7 +3422,7 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
tmp = cputime_to_cputime64(cputime);
if (hardirq_count() - hardirq_offset)
cpustat->irq = cputime64_add(cpustat->irq, tmp);
- else if (softirq_count())
+ else if (in_serving_softirq())
cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
else
cpustat->system = cputime64_add(cpustat->system, tmp);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 07b4f1b..988dfbe 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -77,11 +77,21 @@ void wakeup_softirqd(void)
}
/*
+ * preempt_count and SOFTIRQ_OFFSET usage:
+ * - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving
+ * softirq processing.
+ * - preempt_count is changed by SOFTIRQ_DISABLE_OFFSET (= 2 * SOFTIRQ_OFFSET)
+ * on local_bh_disable or local_bh_enable.
+ * This lets us distinguish between whether we are currently processing
+ * softirq and whether we just have bh disabled.
+ */
+
+/*
* This one is for softirq.c-internal use,
* where hardirqs are disabled legitimately:
*/
#ifdef CONFIG_TRACE_IRQFLAGS
-static void __local_bh_disable(unsigned long ip)
+static void __local_bh_disable(unsigned long ip, unsigned int cnt)
{
unsigned long flags;
@@ -95,32 +105,43 @@ static void __local_bh_disable(unsigned long ip)
* We must manually increment preempt_count here and manually
* call the trace_preempt_off later.
*/
- preempt_count() += SOFTIRQ_OFFSET;
+ preempt_count() += cnt;
/*
* Were softirqs turned off above:
*/
- if (softirq_count() == SOFTIRQ_OFFSET)
+ if (softirq_count() == cnt)
trace_softirqs_off(ip);
raw_local_irq_restore(flags);
- if (preempt_count() == SOFTIRQ_OFFSET)
+ if (preempt_count() == cnt)
trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
}
#else /* !CONFIG_TRACE_IRQFLAGS */
-static inline void __local_bh_disable(unsigned long ip)
+static inline void __local_bh_disable(unsigned long ip, unsigned int cnt)
{
- add_preempt_count(SOFTIRQ_OFFSET);
+ add_preempt_count(cnt);
barrier();
}
#endif /* CONFIG_TRACE_IRQFLAGS */
void local_bh_disable(void)
{
- __local_bh_disable((unsigned long)__builtin_return_address(0));
+ __local_bh_disable((unsigned long)__builtin_return_address(0),
+ SOFTIRQ_DISABLE_OFFSET);
}
EXPORT_SYMBOL(local_bh_disable);
+static void __local_bh_enable(unsigned int cnt)
+{
+ WARN_ON_ONCE(in_irq());
+ WARN_ON_ONCE(!irqs_disabled());
+
+ if (softirq_count() == cnt)
+ trace_softirqs_on((unsigned long)__builtin_return_address(0));
+ sub_preempt_count(cnt);
+}
+
/*
* Special-case - softirqs can safely be enabled in
* cond_resched_softirq(), or by __do_softirq(),
@@ -128,12 +149,7 @@ EXPORT_SYMBOL(local_bh_disable);
*/
void _local_bh_enable(void)
{
- WARN_ON_ONCE(in_irq());
- WARN_ON_ONCE(!irqs_disabled());
-
- if (softirq_count() == SOFTIRQ_OFFSET)
- trace_softirqs_on((unsigned long)__builtin_return_address(0));
- sub_preempt_count(SOFTIRQ_OFFSET);
+ __local_bh_enable(SOFTIRQ_DISABLE_OFFSET);
}
EXPORT_SYMBOL(_local_bh_enable);
@@ -147,13 +163,13 @@ static inline void _local_bh_enable_ip(unsigned long ip)
/*
* Are softirqs going to be turned on now:
*/
- if (softirq_count() == SOFTIRQ_OFFSET)
+ if (softirq_count() == SOFTIRQ_DISABLE_OFFSET)
trace_softirqs_on(ip);
/*
* Keep preemption disabled until we are done with
* softirq processing:
*/
- sub_preempt_count(SOFTIRQ_OFFSET - 1);
+ sub_preempt_count(SOFTIRQ_DISABLE_OFFSET - 1);
if (unlikely(!in_interrupt() && local_softirq_pending()))
do_softirq();
@@ -198,7 +214,8 @@ asmlinkage void __do_softirq(void)
pending = local_softirq_pending();
account_system_vtime(current);
- __local_bh_disable((unsigned long)__builtin_return_address(0));
+ __local_bh_disable((unsigned long)__builtin_return_address(0),
+ SOFTIRQ_OFFSET);
lockdep_softirq_enter();
cpu = smp_processor_id();
@@ -245,7 +262,7 @@ restart:
lockdep_softirq_exit();
account_system_vtime(current);
- _local_bh_enable();
+ __local_bh_enable(SOFTIRQ_OFFSET);
}
#ifndef __ARCH_HAS_DO_SOFTIRQ
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 78ef2c5..37dff78 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -123,7 +123,7 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
* calls by looking at the number of nested bh disable calls because
* softirqs always disables bh.
*/
- if (softirq_count() != SOFTIRQ_OFFSET) {
+ if (in_serving_softirq()) {
/* If there is an sk_classid we'll use that. */
if (!skb->sk)
return -1;
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [tip:sched/core] sched: Consolidate account_system_vtime extern declaration
2010-10-05 0:03 ` [PATCH 2/8] Consolidate account_system_vtime extern declaration -v4 Venkatesh Pallipadi
@ 2010-10-18 19:24 ` tip-bot for Venkatesh Pallipadi
2010-10-18 19:27 ` [tip:sched/core] sched: Export account_system_vtime() tip-bot for Ingo Molnar
1 sibling, 0 replies; 40+ messages in thread
From: tip-bot for Venkatesh Pallipadi @ 2010-10-18 19:24 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, mingo, venki
Commit-ID: e1e10a265d28273ab8c70be19d43dcbdeead6c5a
Gitweb: http://git.kernel.org/tip/e1e10a265d28273ab8c70be19d43dcbdeead6c5a
Author: Venkatesh Pallipadi <venki@google.com>
AuthorDate: Mon, 4 Oct 2010 17:03:17 -0700
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:21 +0200
sched: Consolidate account_system_vtime extern declaration
Just a minor cleanup patch that makes things easier to the following patches.
No functionality change in this patch.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/ia64/include/asm/system.h | 4 ----
arch/powerpc/include/asm/system.h | 4 ----
arch/s390/include/asm/system.h | 1 -
include/linux/hardirq.h | 2 ++
4 files changed, 2 insertions(+), 9 deletions(-)
diff --git a/arch/ia64/include/asm/system.h b/arch/ia64/include/asm/system.h
index 9f342a5..dd028f2 100644
--- a/arch/ia64/include/asm/system.h
+++ b/arch/ia64/include/asm/system.h
@@ -272,10 +272,6 @@ void cpu_idle_wait(void);
void default_idle(void);
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
-extern void account_system_vtime(struct task_struct *);
-#endif
-
#endif /* __KERNEL__ */
#endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/system.h b/arch/powerpc/include/asm/system.h
index 6c294ac..9c3d160 100644
--- a/arch/powerpc/include/asm/system.h
+++ b/arch/powerpc/include/asm/system.h
@@ -542,10 +542,6 @@ extern void reloc_got2(unsigned long);
#define PTRRELOC(x) ((typeof(x)) add_reloc_offset((unsigned long)(x)))
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
-extern void account_system_vtime(struct task_struct *);
-#endif
-
extern struct dentry *powerpc_debugfs_root;
#endif /* __KERNEL__ */
diff --git a/arch/s390/include/asm/system.h b/arch/s390/include/asm/system.h
index cef6621..38ddd8a 100644
--- a/arch/s390/include/asm/system.h
+++ b/arch/s390/include/asm/system.h
@@ -97,7 +97,6 @@ static inline void restore_access_regs(unsigned int *acrs)
extern void account_vtime(struct task_struct *, struct task_struct *);
extern void account_tick_vtime(struct task_struct *);
-extern void account_system_vtime(struct task_struct *);
#ifdef CONFIG_PFAULT
extern void pfault_irq_init(void);
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index e37a77c..41367c5 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -141,6 +141,8 @@ struct task_struct;
static inline void account_system_vtime(struct task_struct *tsk)
{
}
+#else
+extern void account_system_vtime(struct task_struct *tsk);
#endif
#if defined(CONFIG_NO_HZ)
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [tip:sched/core] sched: Add a PF flag for ksoftirqd identification
2010-10-05 0:03 ` [PATCH 3/8] Add a PF flag for ksoftirqd identification Venkatesh Pallipadi
2010-10-15 14:26 ` Peter Zijlstra
2010-10-15 14:46 ` Eric Dumazet
@ 2010-10-18 19:25 ` tip-bot for Venkatesh Pallipadi
2 siblings, 0 replies; 40+ messages in thread
From: tip-bot for Venkatesh Pallipadi @ 2010-10-18 19:25 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, mingo, venki
Commit-ID: 6cdd5199daf0cb7b0fcc8dca941af08492612887
Gitweb: http://git.kernel.org/tip/6cdd5199daf0cb7b0fcc8dca941af08492612887
Author: Venkatesh Pallipadi <venki@google.com>
AuthorDate: Mon, 4 Oct 2010 17:03:18 -0700
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:22 +0200
sched: Add a PF flag for ksoftirqd identification
To account softirq time cleanly in scheduler, we need to identify whether
softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
Add PF_KSOFTIRQD for that purpose.
As all PF flag bits are currently taken, create space by moving one of the
infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
with some other state fields.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/sched.h | 1 +
kernel/softirq.c | 1 +
2 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8744e50..aca0ce6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1682,6 +1682,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
/*
* Per process flags
*/
+#define PF_KSOFTIRQD 0x00000001 /* I am ksoftirqd */
#define PF_STARTING 0x00000002 /* being created */
#define PF_EXITING 0x00000004 /* getting shut down */
#define PF_EXITPIDONE 0x00000008 /* pi exit done on shut down */
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 988dfbe..267f7b7 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -713,6 +713,7 @@ static int run_ksoftirqd(void * __bind_cpu)
{
set_current_state(TASK_INTERRUPTIBLE);
+ current->flags |= PF_KSOFTIRQD;
while (!kthread_should_stop()) {
preempt_disable();
if (!local_softirq_pending()) {
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [tip:sched/core] sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time
2010-10-05 0:03 ` [PATCH 4/8] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v4 Venkatesh Pallipadi
2010-10-15 14:28 ` Peter Zijlstra
@ 2010-10-18 19:25 ` tip-bot for Venkatesh Pallipadi
1 sibling, 0 replies; 40+ messages in thread
From: tip-bot for Venkatesh Pallipadi @ 2010-10-18 19:25 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, mingo, venki
Commit-ID: b52bfee445d315549d41eacf2fa7c156e7d153d5
Gitweb: http://git.kernel.org/tip/b52bfee445d315549d41eacf2fa7c156e7d153d5
Author: Venkatesh Pallipadi <venki@google.com>
AuthorDate: Mon, 4 Oct 2010 17:03:19 -0700
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:24 +0200
sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time
s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
the fine granularity accounting of user, system, hardirq, softirq times.
Adding that option on archs like x86 will be challenging however, given the
state of TSC reliability on various platforms and also the overhead it will
add in syscall entry exit.
Instead, add a lighter variant that only does finer accounting of
hardirq and softirq times, providing precise irq times (instead of timer tick
based samples). This accounting is added with a new config option
CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
interested in paying the perf penalty.
This accounting is based on sched_clock, with the code being generic.
So, other archs may find it useful as well.
This patch just adds the core logic and does not enable this logic yet.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/hardirq.h | 2 +-
include/linux/sched.h | 13 ++++++++++++
kernel/sched.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 63 insertions(+), 1 deletions(-)
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 41367c5..ff43e92 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -137,7 +137,7 @@ extern void synchronize_irq(unsigned int irq);
struct task_struct;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#if !defined(CONFIG_VIRT_CPU_ACCOUNTING) && !defined(CONFIG_IRQ_TIME_ACCOUNTING)
static inline void account_system_vtime(struct task_struct *tsk)
{
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index aca0ce6..2cca9a9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1826,6 +1826,19 @@ extern void sched_clock_idle_sleep_event(void);
extern void sched_clock_idle_wakeup_event(u64 delta_ns);
#endif
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+/*
+ * An i/f to runtime opt-in for irq time accounting based off of sched_clock.
+ * The reason for this explicit opt-in is not to have perf penalty with
+ * slow sched_clocks.
+ */
+extern void enable_sched_clock_irqtime(void);
+extern void disable_sched_clock_irqtime(void);
+#else
+static inline void enable_sched_clock_irqtime(void) {}
+static inline void disable_sched_clock_irqtime(void) {}
+#endif
+
extern unsigned long long
task_sched_runtime(struct task_struct *task);
extern unsigned long long thread_group_sched_runtime(struct task_struct *task);
diff --git a/kernel/sched.c b/kernel/sched.c
index 089be8a..9b302e3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1908,6 +1908,55 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
dec_nr_running(rq);
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+
+static DEFINE_PER_CPU(u64, cpu_hardirq_time);
+static DEFINE_PER_CPU(u64, cpu_softirq_time);
+
+static DEFINE_PER_CPU(u64, irq_start_time);
+static int sched_clock_irqtime;
+
+void enable_sched_clock_irqtime(void)
+{
+ sched_clock_irqtime = 1;
+}
+
+void disable_sched_clock_irqtime(void)
+{
+ sched_clock_irqtime = 0;
+}
+
+void account_system_vtime(struct task_struct *curr)
+{
+ unsigned long flags;
+ int cpu;
+ u64 now, delta;
+
+ if (!sched_clock_irqtime)
+ return;
+
+ local_irq_save(flags);
+
+ now = sched_clock();
+ cpu = smp_processor_id();
+ delta = now - per_cpu(irq_start_time, cpu);
+ per_cpu(irq_start_time, cpu) = now;
+ /*
+ * We do not account for softirq time from ksoftirqd here.
+ * We want to continue accounting softirq time to ksoftirqd thread
+ * in that case, so as not to confuse scheduler with a special task
+ * that do not consume any time, but still wants to run.
+ */
+ if (hardirq_count())
+ per_cpu(cpu_hardirq_time, cpu) += delta;
+ else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD))
+ per_cpu(cpu_softirq_time, cpu) += delta;
+
+ local_irq_restore(flags);
+}
+
+#endif
+
#include "sched_idletask.c"
#include "sched_fair.c"
#include "sched_rt.c"
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [tip:sched/core] x86: Add IRQ_TIME_ACCOUNTING
2010-10-05 0:03 ` [PATCH 5/8] x86: Add IRQ_TIME_ACCOUNTING in x86 -v4 Venkatesh Pallipadi
2010-10-15 14:38 ` Peter Zijlstra
@ 2010-10-18 19:26 ` tip-bot for Venkatesh Pallipadi
1 sibling, 0 replies; 40+ messages in thread
From: tip-bot for Venkatesh Pallipadi @ 2010-10-18 19:26 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, mingo, venki
Commit-ID: e82b8e4ea4f3dffe6e7939f90e78da675fcc450e
Gitweb: http://git.kernel.org/tip/e82b8e4ea4f3dffe6e7939f90e78da675fcc450e
Author: Venkatesh Pallipadi <venki@google.com>
AuthorDate: Mon, 4 Oct 2010 17:03:20 -0700
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:25 +0200
x86: Add IRQ_TIME_ACCOUNTING
This patch adds IRQ_TIME_ACCOUNTING option on x86 and runtime enables it
when TSC is enabled.
This change just enables fine grained irq time accounting, isn't used yet.
Following patches use it for different purposes.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-6-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
Documentation/kernel-parameters.txt | 4 ++++
arch/x86/Kconfig | 11 +++++++++++
arch/x86/kernel/tsc.c | 8 ++++++++
3 files changed, 23 insertions(+), 0 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 8dd7248..ed05a4a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2435,6 +2435,10 @@ and is between 256 and 4096 characters. It is defined in the file
disables clocksource verification at runtime.
Used to enable high-resolution timer mode on older
hardware, and in virtualized environment.
+ [x86] noirqtime: Do not use TSC to do irq accounting.
+ Used to run time disable IRQ_TIME_ACCOUNTING on any
+ platforms where RDTSC is slow and this accounting
+ can add overhead.
turbografx.map[2|3]= [HW,JOY]
TurboGraFX parallel port interface
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cea0cd9..f4c70c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -795,6 +795,17 @@ config SCHED_MC
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
+config IRQ_TIME_ACCOUNTING
+ bool "Fine granularity task level IRQ time accounting"
+ default n
+ ---help---
+ Select this option to enable fine granularity task irq time
+ accounting. This is done by reading a timestamp on each
+ transitions between softirq and hardirq state, so there can be a
+ small performance impact.
+
+ If in doubt, say N here.
+
source "kernel/Kconfig.preempt"
config X86_UP_APIC
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 26a863a..a1c2cd7 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -104,10 +104,14 @@ int __init notsc_setup(char *str)
__setup("notsc", notsc_setup);
+static int no_sched_irq_time;
+
static int __init tsc_setup(char *str)
{
if (!strcmp(str, "reliable"))
tsc_clocksource_reliable = 1;
+ if (!strncmp(str, "noirqtime", 9))
+ no_sched_irq_time = 1;
return 1;
}
@@ -801,6 +805,7 @@ void mark_tsc_unstable(char *reason)
if (!tsc_unstable) {
tsc_unstable = 1;
sched_clock_stable = 0;
+ disable_sched_clock_irqtime();
printk(KERN_INFO "Marking TSC unstable due to %s\n", reason);
/* Change only the rating, when not registered */
if (clocksource_tsc.mult)
@@ -987,6 +992,9 @@ void __init tsc_init(void)
/* now allow native_sched_clock() to use rdtsc */
tsc_disabled = 0;
+ if (!no_sched_irq_time)
+ enable_sched_clock_irqtime();
+
lpj = ((u64)tsc_khz * 1000);
do_div(lpj, HZ);
lpj_fine = lpj;
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [tip:sched/core] sched: Do not account irq time to current task
2010-10-05 0:03 ` [PATCH 6/8] sched: Do not account irq time to current task -v4 Venkatesh Pallipadi
@ 2010-10-18 19:26 ` tip-bot for Venkatesh Pallipadi
2010-11-29 8:45 ` Yong Zhang
0 siblings, 1 reply; 40+ messages in thread
From: tip-bot for Venkatesh Pallipadi @ 2010-10-18 19:26 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, mingo, venki
Commit-ID: 305e6835e05513406fa12820e40e4a8ecb63743c
Gitweb: http://git.kernel.org/tip/305e6835e05513406fa12820e40e4a8ecb63743c
Author: Venkatesh Pallipadi <venki@google.com>
AuthorDate: Mon, 4 Oct 2010 17:03:21 -0700
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:26 +0200
sched: Do not account irq time to current task
Scheduler accounts both softirq and interrupt processing times to the
currently running task. This means, if the interrupt processing was
for some other task in the system, then the current task ends up being
penalized as it gets shorter runtime than otherwise.
Change sched task accounting to acoount only actual task time from
currently running task. Now update_curr(), modifies the delta_exec to
depend on rq->clock_task.
Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
for later.
This change will impact scheduling behavior in interrupt heavy conditions.
Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
spending 75%+ of its time in irq processing. CPU 3 spending around 35%
time running nc task.
Now, if I run another CPU intensive task on CPU 2, without this change
/proc/<pid>/schedstat shows 100% of time accounted to this task. With this
change, it rightly shows less than 25% accounted to this task as remaining
time is actually spent on irq processing.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched.c | 43 ++++++++++++++++++++++++++++++++++++++++---
kernel/sched_fair.c | 6 +++---
kernel/sched_rt.c | 8 ++++----
3 files changed, 47 insertions(+), 10 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 9b302e3..9e01b71 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -491,6 +491,7 @@ struct rq {
struct mm_struct *prev_mm;
u64 clock;
+ u64 clock_task;
atomic_t nr_iowait;
@@ -641,10 +642,19 @@ static inline struct task_group *task_group(struct task_struct *p)
#endif /* CONFIG_CGROUP_SCHED */
+static u64 irq_time_cpu(int cpu);
+
inline void update_rq_clock(struct rq *rq)
{
- if (!rq->skip_clock_update)
- rq->clock = sched_clock_cpu(cpu_of(rq));
+ if (!rq->skip_clock_update) {
+ int cpu = cpu_of(rq);
+ u64 irq_time;
+
+ rq->clock = sched_clock_cpu(cpu);
+ irq_time = irq_time_cpu(cpu);
+ if (rq->clock - irq_time > rq->clock_task)
+ rq->clock_task = rq->clock - irq_time;
+ }
}
/*
@@ -1910,6 +1920,18 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+/*
+ * There are no locks covering percpu hardirq/softirq time.
+ * They are only modified in account_system_vtime, on corresponding CPU
+ * with interrupts disabled. So, writes are safe.
+ * They are read and saved off onto struct rq in update_rq_clock().
+ * This may result in other CPU reading this CPU's irq time and can
+ * race with irq/account_system_vtime on this CPU. We would either get old
+ * or new value (or semi updated value on 32 bit) with a side effect of
+ * accounting a slice of irq time to wrong task when irq is in progress
+ * while we read rq->clock. That is a worthy compromise in place of having
+ * locks on each irq in account_system_time.
+ */
static DEFINE_PER_CPU(u64, cpu_hardirq_time);
static DEFINE_PER_CPU(u64, cpu_softirq_time);
@@ -1926,6 +1948,14 @@ void disable_sched_clock_irqtime(void)
sched_clock_irqtime = 0;
}
+static u64 irq_time_cpu(int cpu)
+{
+ if (!sched_clock_irqtime)
+ return 0;
+
+ return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
+}
+
void account_system_vtime(struct task_struct *curr)
{
unsigned long flags;
@@ -1955,6 +1985,13 @@ void account_system_vtime(struct task_struct *curr)
local_irq_restore(flags);
}
+#else
+
+static u64 irq_time_cpu(int cpu)
+{
+ return 0;
+}
+
#endif
#include "sched_idletask.c"
@@ -3322,7 +3359,7 @@ static u64 do_task_delta_exec(struct task_struct *p, struct rq *rq)
if (task_current(rq, p)) {
update_rq_clock(rq);
- ns = rq->clock - p->se.exec_start;
+ ns = rq->clock_task - p->se.exec_start;
if ((s64)ns < 0)
ns = 0;
}
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f1c615f..c358d40 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -519,7 +519,7 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
- u64 now = rq_of(cfs_rq)->clock;
+ u64 now = rq_of(cfs_rq)->clock_task;
unsigned long delta_exec;
if (unlikely(!curr))
@@ -602,7 +602,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
/*
* We are starting a new run period:
*/
- se->exec_start = rq_of(cfs_rq)->clock;
+ se->exec_start = rq_of(cfs_rq)->clock_task;
}
/**************************************************
@@ -1802,7 +1802,7 @@ int can_migrate_task(struct task_struct *p, struct rq *rq, int this_cpu,
* 2) too many balance attempts have failed.
*/
- tsk_cache_hot = task_hot(p, rq->clock, sd);
+ tsk_cache_hot = task_hot(p, rq->clock_task, sd);
if (!tsk_cache_hot ||
sd->nr_balance_failed > sd->cache_nice_tries) {
#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index ab77aa0..bea7d79 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -609,7 +609,7 @@ static void update_curr_rt(struct rq *rq)
if (!task_has_rt_policy(curr))
return;
- delta_exec = rq->clock - curr->se.exec_start;
+ delta_exec = rq->clock_task - curr->se.exec_start;
if (unlikely((s64)delta_exec < 0))
delta_exec = 0;
@@ -618,7 +618,7 @@ static void update_curr_rt(struct rq *rq)
curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
- curr->se.exec_start = rq->clock;
+ curr->se.exec_start = rq->clock_task;
cpuacct_charge(curr, delta_exec);
sched_rt_avg_update(rq, delta_exec);
@@ -1075,7 +1075,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
} while (rt_rq);
p = rt_task_of(rt_se);
- p->se.exec_start = rq->clock;
+ p->se.exec_start = rq->clock_task;
return p;
}
@@ -1713,7 +1713,7 @@ static void set_curr_task_rt(struct rq *rq)
{
struct task_struct *p = rq->curr;
- p->se.exec_start = rq->clock;
+ p->se.exec_start = rq->clock_task;
/* The running task is never eligible for pushing */
dequeue_pushable_task(rq, p);
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [tip:sched/core] sched: Remove irq time from available CPU power
2010-10-05 0:03 ` [PATCH 7/8] sched: Remove irq time from available CPU power -v4 Venkatesh Pallipadi
@ 2010-10-18 19:26 ` tip-bot for Venkatesh Pallipadi
0 siblings, 0 replies; 40+ messages in thread
From: tip-bot for Venkatesh Pallipadi @ 2010-10-18 19:26 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, mingo, venki
Commit-ID: aa483808516ca5cacfa0e5849691f64fec25828e
Gitweb: http://git.kernel.org/tip/aa483808516ca5cacfa0e5849691f64fec25828e
Author: Venkatesh Pallipadi <venki@google.com>
AuthorDate: Mon, 4 Oct 2010 17:03:22 -0700
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:27 +0200
sched: Remove irq time from available CPU power
The idea was suggested by Peter Zijlstra here:
http://marc.info/?l=linux-kernel&m=127476934517534&w=2
irq time is technically not available to the tasks running on the CPU.
This patch removes irq time from CPU power piggybacking on
sched_rt_avg_update().
Tested this by keeping CPU X busy with a network intensive task having 75%
oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
cycle soakers on the system. Without this change, there will be two tasks on
each CPU. With this change, there is a single task on irq busy CPU X and
remaining 7 tasks are spread around among other 3 CPUs.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched.c | 18 ++++++++++++++++++
kernel/sched_fair.c | 8 +++++++-
kernel/sched_features.h | 5 +++++
3 files changed, 30 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 9e01b71..bff9ef5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -519,6 +519,10 @@ struct rq {
u64 avg_idle;
#endif
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ u64 prev_irq_time;
+#endif
+
/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
@@ -643,6 +647,7 @@ static inline struct task_group *task_group(struct task_struct *p)
#endif /* CONFIG_CGROUP_SCHED */
static u64 irq_time_cpu(int cpu);
+static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time);
inline void update_rq_clock(struct rq *rq)
{
@@ -654,6 +659,8 @@ inline void update_rq_clock(struct rq *rq)
irq_time = irq_time_cpu(cpu);
if (rq->clock - irq_time > rq->clock_task)
rq->clock_task = rq->clock - irq_time;
+
+ sched_irq_time_avg_update(rq, irq_time);
}
}
@@ -1985,6 +1992,15 @@ void account_system_vtime(struct task_struct *curr)
local_irq_restore(flags);
}
+static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time)
+{
+ if (sched_clock_irqtime && sched_feat(NONIRQ_POWER)) {
+ u64 delta_irq = curr_irq_time - rq->prev_irq_time;
+ rq->prev_irq_time = curr_irq_time;
+ sched_rt_avg_update(rq, delta_irq);
+ }
+}
+
#else
static u64 irq_time_cpu(int cpu)
@@ -1992,6 +2008,8 @@ static u64 irq_time_cpu(int cpu)
return 0;
}
+static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { }
+
#endif
#include "sched_idletask.c"
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c358d40..74cccfa 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2275,7 +2275,13 @@ unsigned long scale_rt_power(int cpu)
u64 total, available;
total = sched_avg_period() + (rq->clock - rq->age_stamp);
- available = total - rq->rt_avg;
+
+ if (unlikely(total < rq->rt_avg)) {
+ /* Ensures that power won't end up being negative */
+ available = 0;
+ } else {
+ available = total - rq->rt_avg;
+ }
if (unlikely((s64)total < SCHED_LOAD_SCALE))
total = SCHED_LOAD_SCALE;
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 83c66e8..185f920 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -61,3 +61,8 @@ SCHED_FEAT(ASYM_EFF_LOAD, 1)
* release the lock. Decreases scheduling overhead.
*/
SCHED_FEAT(OWNER_SPIN, 1)
+
+/*
+ * Decrement CPU power based on irq activity
+ */
+SCHED_FEAT(NONIRQ_POWER, 1)
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [tip:sched/core] sched: Call tick_check_idle before __irq_enter
2010-10-05 0:03 ` [PATCH 8/8] Call tick_check_idle before __irq_enter Venkatesh Pallipadi
2010-10-17 9:05 ` Yong Zhang
@ 2010-10-18 19:27 ` tip-bot for Venkatesh Pallipadi
1 sibling, 0 replies; 40+ messages in thread
From: tip-bot for Venkatesh Pallipadi @ 2010-10-18 19:27 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, yong.zhang0, venki,
mingo
Commit-ID: d267f87fb8179c6dba03d08b91952e81bc3723c7
Gitweb: http://git.kernel.org/tip/d267f87fb8179c6dba03d08b91952e81bc3723c7
Author: Venkatesh Pallipadi <venki@google.com>
AuthorDate: Mon, 4 Oct 2010 17:03:23 -0700
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:29 +0200
sched: Call tick_check_idle before __irq_enter
When CPU is idle and on first interrupt, irq_enter calls tick_check_idle()
to notify interruption from idle. But, there is a problem if this call
is done after __irq_enter, as all routines in __irq_enter may find
stale time due to yet to be done tick_check_idle.
Specifically, trace calls in __irq_enter when they use global clock and also
account_system_vtime change in this patch as it wants to use sched_clock_cpu()
to do proper irq timing.
But, tick_check_idle was moved after __irq_enter intentionally to
prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a:
irq: call __irq_enter() before calling the tick_idle_check
Impact: avoid spurious ksoftirqd wakeups
Moving tick_check_idle() before __irq_enter and wrapping it with
local_bh_enable/disable would solve both the problems.
Fixed-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched.c | 2 +-
kernel/softirq.c | 12 +++++++++---
2 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index bff9ef5..567f5cb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1974,8 +1974,8 @@ void account_system_vtime(struct task_struct *curr)
local_irq_save(flags);
- now = sched_clock();
cpu = smp_processor_id();
+ now = sched_clock_cpu(cpu);
delta = now - per_cpu(irq_start_time, cpu);
per_cpu(irq_start_time, cpu) = now;
/*
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 267f7b7..79ee8f1 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -296,10 +296,16 @@ void irq_enter(void)
rcu_irq_enter();
if (idle_cpu(cpu) && !in_interrupt()) {
- __irq_enter();
+ /*
+ * Prevent raise_softirq from needlessly waking up ksoftirqd
+ * here, as softirq will be serviced on return from interrupt.
+ */
+ local_bh_disable();
tick_check_idle(cpu);
- } else
- __irq_enter();
+ _local_bh_enable();
+ }
+
+ __irq_enter();
}
#ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [tip:sched/core] sched: Export account_system_vtime()
2010-10-05 0:03 ` [PATCH 2/8] Consolidate account_system_vtime extern declaration -v4 Venkatesh Pallipadi
2010-10-18 19:24 ` [tip:sched/core] sched: Consolidate account_system_vtime extern declaration tip-bot for Venkatesh Pallipadi
@ 2010-10-18 19:27 ` tip-bot for Ingo Molnar
1 sibling, 0 replies; 40+ messages in thread
From: tip-bot for Ingo Molnar @ 2010-10-18 19:27 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, venki, mingo
Commit-ID: b7dadc38797584f6203386da1947ed5edf516646
Gitweb: http://git.kernel.org/tip/b7dadc38797584f6203386da1947ed5edf516646
Author: Ingo Molnar <mingo@elte.hu>
AuthorDate: Mon, 18 Oct 2010 20:00:37 +0200
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:30 +0200
sched: Export account_system_vtime()
KVM uses it for example:
ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined!
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 567f5cb..5998222 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1991,6 +1991,7 @@ void account_system_vtime(struct task_struct *curr)
local_irq_restore(flags);
}
+EXPORT_SYMBOL_GPL(account_system_vtime);
static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time)
{
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [tip:sched/core] sched: Do not account irq time to current task
2010-10-18 19:26 ` [tip:sched/core] sched: Do not account irq time to current task tip-bot for Venkatesh Pallipadi
@ 2010-11-29 8:45 ` Yong Zhang
2010-11-29 11:59 ` Peter Zijlstra
0 siblings, 1 reply; 40+ messages in thread
From: Yong Zhang @ 2010-11-29 8:45 UTC (permalink / raw)
To: mingo, hpa, linux-kernel, a.p.zijlstra, tglx, venki, mingo
Cc: linux-tip-commits
On Tue, Oct 19, 2010 at 3:26 AM, tip-bot for Venkatesh Pallipadi
<venki@google.com> wrote:
> Commit-ID: 305e6835e05513406fa12820e40e4a8ecb63743c
> Gitweb: http://git.kernel.org/tip/305e6835e05513406fa12820e40e4a8ecb63743c
> Author: Venkatesh Pallipadi <venki@google.com>
> AuthorDate: Mon, 4 Oct 2010 17:03:21 -0700
> Committer: Ingo Molnar <mingo@elte.hu>
> CommitDate: Mon, 18 Oct 2010 20:52:26 +0200
>
> sched: Do not account irq time to current task
>
> Scheduler accounts both softirq and interrupt processing times to the
> currently running task. This means, if the interrupt processing was
> for some other task in the system, then the current task ends up being
> penalized as it gets shorter runtime than otherwise.
>
> Change sched task accounting to acoount only actual task time from
> currently running task. Now update_curr(), modifies the delta_exec to
> depend on rq->clock_task.
>
> Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
> extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
> for later.
>
> This change will impact scheduling behavior in interrupt heavy conditions.
>
> Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
> task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
> spending 75%+ of its time in irq processing. CPU 3 spending around 35%
> time running nc task.
>
> Now, if I run another CPU intensive task on CPU 2, without this change
> /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
> change, it rightly shows less than 25% accounted to this task as remaining
> time is actually spent on irq processing.
>
> Signed-off-by: Venkatesh Pallipadi <venki@google.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
> kernel/sched.c | 43 ++++++++++++++++++++++++++++++++++++++++---
> kernel/sched_fair.c | 6 +++---
> kernel/sched_rt.c | 8 ++++----
> 3 files changed, 47 insertions(+), 10 deletions(-)
>
[snip]
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index ab77aa0..bea7d79 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -609,7 +609,7 @@ static void update_curr_rt(struct rq *rq)
> if (!task_has_rt_policy(curr))
> return;
>
> - delta_exec = rq->clock - curr->se.exec_start;
> + delta_exec = rq->clock_task - curr->se.exec_start;
> if (unlikely((s64)delta_exec < 0))
> delta_exec = 0;
>
> @@ -618,7 +618,7 @@ static void update_curr_rt(struct rq *rq)
> curr->se.sum_exec_runtime += delta_exec;
> account_group_exec_runtime(curr, delta_exec);
>
> - curr->se.exec_start = rq->clock;
> + curr->se.exec_start = rq->clock_task;
> cpuacct_charge(curr, delta_exec);
>
> sched_rt_avg_update(rq, delta_exec);
Seems the above changes to update_curr_rt() have some false positive
to rt_bandwidth control.
For example:
rt_period=1000000;
rt_runtime=950000;
then if in that period the irq_time is no zero(such as 50000), according to
the throttle mechanism, rt is not throttled. In the end we left no
time to others.
It seems that this break the semantic of throttle.
Maybe we can revert the change to update_curr_rt()?
Thanks,
Yong
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [tip:sched/core] sched: Do not account irq time to current task
2010-11-29 8:45 ` Yong Zhang
@ 2010-11-29 11:59 ` Peter Zijlstra
2010-11-29 14:22 ` Yong Zhang
0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2010-11-29 11:59 UTC (permalink / raw)
To: Yong Zhang
Cc: mingo, hpa, linux-kernel, tglx, venki, mingo, linux-tip-commits
On Mon, 2010-11-29 at 16:45 +0800, Yong Zhang wrote:
> On Tue, Oct 19, 2010 at 3:26 AM, tip-bot for Venkatesh Pallipadi
> <venki@google.com> wrote:
> > Commit-ID: 305e6835e05513406fa12820e40e4a8ecb63743c
> > Gitweb: http://git.kernel.org/tip/305e6835e05513406fa12820e40e4a8ecb63743c
> > Author: Venkatesh Pallipadi <venki@google.com>
> > AuthorDate: Mon, 4 Oct 2010 17:03:21 -0700
> > Committer: Ingo Molnar <mingo@elte.hu>
> > CommitDate: Mon, 18 Oct 2010 20:52:26 +0200
> >
> > sched: Do not account irq time to current task
> >
> > Scheduler accounts both softirq and interrupt processing times to the
> > currently running task. This means, if the interrupt processing was
> > for some other task in the system, then the current task ends up being
> > penalized as it gets shorter runtime than otherwise.
> >
> > Change sched task accounting to acoount only actual task time from
> > currently running task. Now update_curr(), modifies the delta_exec to
> > depend on rq->clock_task.
> >
> > Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
> > extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
> > for later.
> >
> > This change will impact scheduling behavior in interrupt heavy conditions.
> >
> > Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
> > task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
> > spending 75%+ of its time in irq processing. CPU 3 spending around 35%
> > time running nc task.
> >
> > Now, if I run another CPU intensive task on CPU 2, without this change
> > /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
> > change, it rightly shows less than 25% accounted to this task as remaining
> > time is actually spent on irq processing.
> >
> > Signed-off-by: Venkatesh Pallipadi <venki@google.com>
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
> > Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > ---
> > kernel/sched.c | 43 ++++++++++++++++++++++++++++++++++++++++---
> > kernel/sched_fair.c | 6 +++---
> > kernel/sched_rt.c | 8 ++++----
> > 3 files changed, 47 insertions(+), 10 deletions(-)
> >
>
> [snip]
>
> > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> > index ab77aa0..bea7d79 100644
> > --- a/kernel/sched_rt.c
> > +++ b/kernel/sched_rt.c
> > @@ -609,7 +609,7 @@ static void update_curr_rt(struct rq *rq)
> > if (!task_has_rt_policy(curr))
> > return;
> >
> > - delta_exec = rq->clock - curr->se.exec_start;
> > + delta_exec = rq->clock_task - curr->se.exec_start;
> > if (unlikely((s64)delta_exec < 0))
> > delta_exec = 0;
> >
> > @@ -618,7 +618,7 @@ static void update_curr_rt(struct rq *rq)
> > curr->se.sum_exec_runtime += delta_exec;
> > account_group_exec_runtime(curr, delta_exec);
> >
> > - curr->se.exec_start = rq->clock;
> > + curr->se.exec_start = rq->clock_task;
> > cpuacct_charge(curr, delta_exec);
> >
> > sched_rt_avg_update(rq, delta_exec);
>
> Seems the above changes to update_curr_rt() have some false positive
> to rt_bandwidth control.
> For example:
> rt_period=1000000;
> rt_runtime=950000;
> then if in that period the irq_time is no zero(such as 50000), according to
> the throttle mechanism, rt is not throttled. In the end we left no
> time to others.
> It seems that this break the semantic of throttle.
>
> Maybe we can revert the change to update_curr_rt()?
No, that's totally correct.
Its the correct and desired behaviour, IRQ time is not time spend
running the RT tasks, hence they should not be accounted for it.
If you still want to throttle RT tasks simply ensure their bandwidth
constraint is lower than the available time.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [tip:sched/core] sched: Do not account irq time to current task
2010-11-29 11:59 ` Peter Zijlstra
@ 2010-11-29 14:22 ` Yong Zhang
2010-11-29 17:06 ` Raistlin
0 siblings, 1 reply; 40+ messages in thread
From: Yong Zhang @ 2010-11-29 14:22 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, hpa, linux-kernel, tglx, venki, mingo, linux-tip-commits
On Mon, Nov 29, 2010 at 12:59:50PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-11-29 at 16:45 +0800, Yong Zhang wrote:
> > On Tue, Oct 19, 2010 at 3:26 AM, tip-bot for Venkatesh Pallipadi
> > <venki@google.com> wrote:
> > > Commit-ID: 305e6835e05513406fa12820e40e4a8ecb63743c
> > > Gitweb: http://git.kernel.org/tip/305e6835e05513406fa12820e40e4a8ecb63743c
> > > Author: Venkatesh Pallipadi <venki@google.com>
> > > AuthorDate: Mon, 4 Oct 2010 17:03:21 -0700
> > > Committer: Ingo Molnar <mingo@elte.hu>
> > > CommitDate: Mon, 18 Oct 2010 20:52:26 +0200
> > >
> > > sched: Do not account irq time to current task
> > >
> > > Scheduler accounts both softirq and interrupt processing times to the
> > > currently running task. This means, if the interrupt processing was
> > > for some other task in the system, then the current task ends up being
> > > penalized as it gets shorter runtime than otherwise.
> > >
> > > Change sched task accounting to acoount only actual task time from
> > > currently running task. Now update_curr(), modifies the delta_exec to
> > > depend on rq->clock_task.
> > >
> > > Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
> > > extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
> > > for later.
> > >
> > > This change will impact scheduling behavior in interrupt heavy conditions.
> > >
> > > Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
> > > task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
> > > spending 75%+ of its time in irq processing. CPU 3 spending around 35%
> > > time running nc task.
> > >
> > > Now, if I run another CPU intensive task on CPU 2, without this change
> > > /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
> > > change, it rightly shows less than 25% accounted to this task as remaining
> > > time is actually spent on irq processing.
> > >
> > > Signed-off-by: Venkatesh Pallipadi <venki@google.com>
> > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
> > > Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > > ---
> > > kernel/sched.c | 43 ++++++++++++++++++++++++++++++++++++++++---
> > > kernel/sched_fair.c | 6 +++---
> > > kernel/sched_rt.c | 8 ++++----
> > > 3 files changed, 47 insertions(+), 10 deletions(-)
> > >
> >
> > [snip]
> >
> > > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> > > index ab77aa0..bea7d79 100644
> > > --- a/kernel/sched_rt.c
> > > +++ b/kernel/sched_rt.c
> > > @@ -609,7 +609,7 @@ static void update_curr_rt(struct rq *rq)
> > > if (!task_has_rt_policy(curr))
> > > return;
> > >
> > > - delta_exec = rq->clock - curr->se.exec_start;
> > > + delta_exec = rq->clock_task - curr->se.exec_start;
> > > if (unlikely((s64)delta_exec < 0))
> > > delta_exec = 0;
> > >
> > > @@ -618,7 +618,7 @@ static void update_curr_rt(struct rq *rq)
> > > curr->se.sum_exec_runtime += delta_exec;
> > > account_group_exec_runtime(curr, delta_exec);
> > >
> > > - curr->se.exec_start = rq->clock;
> > > + curr->se.exec_start = rq->clock_task;
> > > cpuacct_charge(curr, delta_exec);
> > >
> > > sched_rt_avg_update(rq, delta_exec);
> >
> > Seems the above changes to update_curr_rt() have some false positive
> > to rt_bandwidth control.
> > For example:
> > rt_period=1000000;
> > rt_runtime=950000;
> > then if in that period the irq_time is no zero(such as 50000), according to
> > the throttle mechanism, rt is not throttled. In the end we left no
> > time to others.
> > It seems that this break the semantic of throttle.
> >
> > Maybe we can revert the change to update_curr_rt()?
>
> No, that's totally correct.
>
> Its the correct and desired behaviour, IRQ time is not time spend
> running the RT tasks, hence they should not be accounted for it.
Right.
>
> If you still want to throttle RT tasks simply ensure their bandwidth
> constraint is lower than the available time.
But the available time is harder to calculated than before.
IRQ is random, so as to the irq_time.
But the unthrottle(do_sched_rt_period_timer()) runs in fixed
period which is based on hard clock.
Is that what we want?
Thanks,
Yong
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [tip:sched/core] sched: Do not account irq time to current task
2010-11-29 14:22 ` Yong Zhang
@ 2010-11-29 17:06 ` Raistlin
2010-11-30 5:57 ` Yong Zhang
0 siblings, 1 reply; 40+ messages in thread
From: Raistlin @ 2010-11-29 17:06 UTC (permalink / raw)
To: Yong Zhang
Cc: Peter Zijlstra, mingo, hpa, linux-kernel, tglx, venki, mingo,
linux-tip-commits
[-- Attachment #1: Type: text/plain, Size: 2266 bytes --]
On Mon, 2010-11-29 at 22:22 +0800, Yong Zhang wrote:
> > If you still want to throttle RT tasks simply ensure their bandwidth
> > constraint is lower than the available time.
>
> But the available time is harder to calculated than before.
>
Well, it shouldn't... I would say it makes it easier! :-P
> IRQ is random, so as to the irq_time.
>
Well, that's definitely true, as it is true that the time needed to deal
with IRQ is now somehow "lost" or "hidden" (meaning that it is not
accounted to anyone).
> But the unthrottle(do_sched_rt_period_timer()) runs in fixed
> period which is based on hard clock.
>
> Is that what we want?
>
I'm not sure I'm getting what the issue is here... Do you have an
example do discuss?
Because, referring to the one in your first e-mail, if in the last
period (of 1sec) we spent 900ms running -rt tasks and 50ms servicing
interrupts, given a limit was 950ms for sched_rt, why should we throttle
them? :-O
Well, I see that this could mean that all we'd do in that period will be
servicing interrupts and running -rt tasks (for _their_ last 50ms) but,
also means (at least to me) that your system needs some more design
effort.
Then I can also agree on the point that it might make sense to think a
bit on how to take the 50ms from interrupts somehow into account, but
just charging -rt tasks for that time seems quite arbitrary... :-O
Something that I've done recently is trying to figure out, if interrupts
handler have their own threads (as in PREEMPT_RT), what happens if you
try constraining the bandwidth of those thread, using proper scheduling
mechanisms (such as deadline scheduling, but rt-throttling could also be
a "reasonable approximation" :-P)... But it's just preliminary research
results.
BTW, having handlers in threads could actually be the solution per-se
here, since they would then consume -rt bandwidth (if set to -rt) and
contribute to throttling... :-)
Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)
http://retis.sssup.it/people/faggioli -- dario.faggioli@jabber.org
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [tip:sched/core] sched: Do not account irq time to current task
2010-11-29 17:06 ` Raistlin
@ 2010-11-30 5:57 ` Yong Zhang
2010-12-01 18:55 ` Venkatesh Pallipadi
0 siblings, 1 reply; 40+ messages in thread
From: Yong Zhang @ 2010-11-30 5:57 UTC (permalink / raw)
To: Raistlin
Cc: Peter Zijlstra, mingo, hpa, linux-kernel, tglx, venki, mingo,
linux-tip-commits
On Tue, Nov 30, 2010 at 1:06 AM, Raistlin <raistlin@linux.it> wrote:
> On Mon, 2010-11-29 at 22:22 +0800, Yong Zhang wrote:
>> > If you still want to throttle RT tasks simply ensure their bandwidth
>> > constraint is lower than the available time.
>>
>> But the available time is harder to calculated than before.
>>
> Well, it shouldn't... I would say it makes it easier! :-P
It depends on the way how we see it.
>From the side that we can just focus on task time, it's easy
and well understood. :)
Now I think we need a way to export the irq_time to user,
then the user can determine the average time spending
on IRQ.
Maybe Venkatesh Pallipadi's patchset titled "Proper
kernel irq time reporting" is the one.
>
>> IRQ is random, so as to the irq_time.
>>
> Well, that's definitely true, as it is true that the time needed to deal
> with IRQ is now somehow "lost" or "hidden" (meaning that it is not
> accounted to anyone).
>
>> But the unthrottle(do_sched_rt_period_timer()) runs in fixed
>> period which is based on hard clock.
>>
>> Is that what we want?
>>
> I'm not sure I'm getting what the issue is here... Do you have an
> example do discuss?
I was just thinking about pulling out irq_time from rt_period, but
it seems that it's not so easy.
By comparing current unthrottle mechanism and the before one,
there is no difference. IRQ can happen at any time, regardless
who its time will be accounted to. So it's still fair for now.
> Because, referring to the one in your first e-mail, if in the last
> period (of 1sec) we spent 900ms running -rt tasks and 50ms servicing
> interrupts, given a limit was 950ms for sched_rt, why should we throttle
> them? :-O
In the previous version, rt is throttled because irq_time is accounted
to the task.
So in the new version, I should set rt_runtime to 900ms to make it
sensible.
>
> Well, I see that this could mean that all we'd do in that period will be
> servicing interrupts and running -rt tasks (for _their_ last 50ms) but,
> also means (at least to me) that your system needs some more design
> effort.
> Then I can also agree on the point that it might make sense to think a
> bit on how to take the 50ms from interrupts somehow into account, but
> just charging -rt tasks for that time seems quite arbitrary... :-O
Sure.
>
> Something that I've done recently is trying to figure out, if interrupts
> handler have their own threads (as in PREEMPT_RT), what happens if you
> try constraining the bandwidth of those thread, using proper scheduling
> mechanisms (such as deadline scheduling, but rt-throttling could also be
> a "reasonable approximation" :-P)... But it's just preliminary research
> results.
Interesting. Will keep eye on that. :)
>
> BTW, having handlers in threads could actually be the solution per-se
> here, since they would then consume -rt bandwidth (if set to -rt) and
> contribute to throttling... :-)
Agree. This will make time arrangement more accurate.
But as you said above, we must evaluate the side effect to
irq when constraining the bandwidth of irq thread.
Thanks,
Yong
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [tip:sched/core] sched: Do not account irq time to current task
2010-11-30 5:57 ` Yong Zhang
@ 2010-12-01 18:55 ` Venkatesh Pallipadi
2010-12-01 19:16 ` Peter Zijlstra
0 siblings, 1 reply; 40+ messages in thread
From: Venkatesh Pallipadi @ 2010-12-01 18:55 UTC (permalink / raw)
To: Yong Zhang
Cc: Raistlin, Peter Zijlstra, mingo, hpa, linux-kernel, tglx, mingo,
linux-tip-commits
On Mon, Nov 29, 2010 at 9:57 PM, Yong Zhang <yong.zhang0@gmail.com> wrote:
> On Tue, Nov 30, 2010 at 1:06 AM, Raistlin <raistlin@linux.it> wrote:
>> On Mon, 2010-11-29 at 22:22 +0800, Yong Zhang wrote:
>>> > If you still want to throttle RT tasks simply ensure their bandwidth
>>> > constraint is lower than the available time.
>>>
>>> But the available time is harder to calculated than before.
>>>
>> Well, it shouldn't... I would say it makes it easier! :-P
>
> It depends on the way how we see it.
> From the side that we can just focus on task time, it's easy
> and well understood. :)
>
> Now I think we need a way to export the irq_time to user,
> then the user can determine the average time spending
> on IRQ.
> Maybe Venkatesh Pallipadi's patchset titled "Proper
> kernel irq time reporting" is the one.
>
Yes. That second patchset exports this information through /proc/stat.
Peter: Can you pull that patchset into tip.
Thanks,
Venki
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [tip:sched/core] sched: Do not account irq time to current task
2010-12-01 18:55 ` Venkatesh Pallipadi
@ 2010-12-01 19:16 ` Peter Zijlstra
0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2010-12-01 19:16 UTC (permalink / raw)
To: Venkatesh Pallipadi
Cc: Yong Zhang, Raistlin, mingo, hpa, linux-kernel, tglx, mingo,
linux-tip-commits
On Wed, 2010-12-01 at 10:55 -0800, Venkatesh Pallipadi wrote:
> Peter: Can you pull that patchset into tip.
I'll try and have another look at those patches early next week. Do
remind me if you haven't heard back from me by Wednesday or something.
^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2010-12-01 19:16 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-05 0:03 Proper kernel irq time accounting -v4 Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 1/8] si time accounting accounts bh_disable'd time to si -v4 Venkatesh Pallipadi
2010-10-18 19:24 ` [tip:sched/core] sched: Fix softirq time accounting tip-bot for Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 2/8] Consolidate account_system_vtime extern declaration -v4 Venkatesh Pallipadi
2010-10-18 19:24 ` [tip:sched/core] sched: Consolidate account_system_vtime extern declaration tip-bot for Venkatesh Pallipadi
2010-10-18 19:27 ` [tip:sched/core] sched: Export account_system_vtime() tip-bot for Ingo Molnar
2010-10-05 0:03 ` [PATCH 3/8] Add a PF flag for ksoftirqd identification Venkatesh Pallipadi
2010-10-15 14:26 ` Peter Zijlstra
2010-10-15 14:46 ` Eric Dumazet
2010-10-18 19:25 ` [tip:sched/core] sched: " tip-bot for Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 4/8] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v4 Venkatesh Pallipadi
2010-10-15 14:28 ` Peter Zijlstra
2010-10-18 19:25 ` [tip:sched/core] sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time tip-bot for Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 5/8] x86: Add IRQ_TIME_ACCOUNTING in x86 -v4 Venkatesh Pallipadi
2010-10-15 14:38 ` Peter Zijlstra
2010-10-18 19:26 ` [tip:sched/core] x86: Add IRQ_TIME_ACCOUNTING tip-bot for Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 6/8] sched: Do not account irq time to current task -v4 Venkatesh Pallipadi
2010-10-18 19:26 ` [tip:sched/core] sched: Do not account irq time to current task tip-bot for Venkatesh Pallipadi
2010-11-29 8:45 ` Yong Zhang
2010-11-29 11:59 ` Peter Zijlstra
2010-11-29 14:22 ` Yong Zhang
2010-11-29 17:06 ` Raistlin
2010-11-30 5:57 ` Yong Zhang
2010-12-01 18:55 ` Venkatesh Pallipadi
2010-12-01 19:16 ` Peter Zijlstra
2010-10-05 0:03 ` [PATCH 7/8] sched: Remove irq time from available CPU power -v4 Venkatesh Pallipadi
2010-10-18 19:26 ` [tip:sched/core] sched: Remove irq time from available CPU power tip-bot for Venkatesh Pallipadi
2010-10-05 0:03 ` [PATCH 8/8] Call tick_check_idle before __irq_enter Venkatesh Pallipadi
2010-10-17 9:05 ` Yong Zhang
2010-10-18 9:15 ` Peter Zijlstra
2010-10-18 19:27 ` [tip:sched/core] sched: " tip-bot for Venkatesh Pallipadi
2010-10-12 19:00 ` Proper kernel irq time accounting -v4 Venkatesh Pallipadi
2010-10-14 16:12 ` Shaun Ruffell
2010-10-14 18:19 ` Venkatesh Pallipadi
2010-10-14 20:00 ` Shaun Ruffell
2010-10-15 15:11 ` Peter Zijlstra
2010-10-15 15:27 ` Peter Zijlstra
2010-10-15 17:13 ` Venkatesh Pallipadi
2010-10-15 17:20 ` Peter Zijlstra
2010-10-17 9:11 ` Yong Zhang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).