All of lore.kernel.org
 help / color / mirror / Atom feed
From: tip-bot for Venkatesh Pallipadi <venki@google.com>
To: linux-tip-commits@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, hpa@zytor.com, mingo@redhat.com,
	a.p.zijlstra@chello.nl, tglx@linutronix.de, mingo@elte.hu,
	venki@google.com
Subject: [tip:sched/core] sched: Do not account irq time to current task
Date: Mon, 18 Oct 2010 19:26:29 GMT	[thread overview]
Message-ID: <tip-305e6835e05513406fa12820e40e4a8ecb63743c@git.kernel.org> (raw)
In-Reply-To: <1286237003-12406-7-git-send-email-venki@google.com>

Commit-ID:  305e6835e05513406fa12820e40e4a8ecb63743c
Gitweb:     http://git.kernel.org/tip/305e6835e05513406fa12820e40e4a8ecb63743c
Author:     Venkatesh Pallipadi <venki@google.com>
AuthorDate: Mon, 4 Oct 2010 17:03:21 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:26 +0200

sched: Do not account irq time to current task

Scheduler accounts both softirq and interrupt processing times to the
currently running task. This means, if the interrupt processing was
for some other task in the system, then the current task ends up being
penalized as it gets shorter runtime than otherwise.

Change sched task accounting to acoount only actual task time from
currently running task. Now update_curr(), modifies the delta_exec to
depend on rq->clock_task.

Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
for later.

This change will impact scheduling behavior in interrupt heavy conditions.

Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
spending 75%+ of its time in irq processing. CPU 3 spending around 35%
time running nc task.

Now, if I run another CPU intensive task on CPU 2, without this change
/proc/<pid>/schedstat shows 100% of time accounted to this task. With this
change, it rightly shows less than 25% accounted to this task as remaining
time is actually spent on irq processing.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c      |   43 ++++++++++++++++++++++++++++++++++++++++---
 kernel/sched_fair.c |    6 +++---
 kernel/sched_rt.c   |    8 ++++----
 3 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 9b302e3..9e01b71 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -491,6 +491,7 @@ struct rq {
 	struct mm_struct *prev_mm;
 
 	u64 clock;
+	u64 clock_task;
 
 	atomic_t nr_iowait;
 
@@ -641,10 +642,19 @@ static inline struct task_group *task_group(struct task_struct *p)
 
 #endif /* CONFIG_CGROUP_SCHED */
 
+static u64 irq_time_cpu(int cpu);
+
 inline void update_rq_clock(struct rq *rq)
 {
-	if (!rq->skip_clock_update)
-		rq->clock = sched_clock_cpu(cpu_of(rq));
+	if (!rq->skip_clock_update) {
+		int cpu = cpu_of(rq);
+		u64 irq_time;
+
+		rq->clock = sched_clock_cpu(cpu);
+		irq_time = irq_time_cpu(cpu);
+		if (rq->clock - irq_time > rq->clock_task)
+			rq->clock_task = rq->clock - irq_time;
+	}
 }
 
 /*
@@ -1910,6 +1920,18 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 
+/*
+ * There are no locks covering percpu hardirq/softirq time.
+ * They are only modified in account_system_vtime, on corresponding CPU
+ * with interrupts disabled. So, writes are safe.
+ * They are read and saved off onto struct rq in update_rq_clock().
+ * This may result in other CPU reading this CPU's irq time and can
+ * race with irq/account_system_vtime on this CPU. We would either get old
+ * or new value (or semi updated value on 32 bit) with a side effect of
+ * accounting a slice of irq time to wrong task when irq is in progress
+ * while we read rq->clock. That is a worthy compromise in place of having
+ * locks on each irq in account_system_time.
+ */
 static DEFINE_PER_CPU(u64, cpu_hardirq_time);
 static DEFINE_PER_CPU(u64, cpu_softirq_time);
 
@@ -1926,6 +1948,14 @@ void disable_sched_clock_irqtime(void)
 	sched_clock_irqtime = 0;
 }
 
+static u64 irq_time_cpu(int cpu)
+{
+	if (!sched_clock_irqtime)
+		return 0;
+
+	return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
+}
+
 void account_system_vtime(struct task_struct *curr)
 {
 	unsigned long flags;
@@ -1955,6 +1985,13 @@ void account_system_vtime(struct task_struct *curr)
 	local_irq_restore(flags);
 }
 
+#else
+
+static u64 irq_time_cpu(int cpu)
+{
+	return 0;
+}
+
 #endif
 
 #include "sched_idletask.c"
@@ -3322,7 +3359,7 @@ static u64 do_task_delta_exec(struct task_struct *p, struct rq *rq)
 
 	if (task_current(rq, p)) {
 		update_rq_clock(rq);
-		ns = rq->clock - p->se.exec_start;
+		ns = rq->clock_task - p->se.exec_start;
 		if ((s64)ns < 0)
 			ns = 0;
 	}
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f1c615f..c358d40 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -519,7 +519,7 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
 static void update_curr(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
-	u64 now = rq_of(cfs_rq)->clock;
+	u64 now = rq_of(cfs_rq)->clock_task;
 	unsigned long delta_exec;
 
 	if (unlikely(!curr))
@@ -602,7 +602,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	/*
 	 * We are starting a new run period:
 	 */
-	se->exec_start = rq_of(cfs_rq)->clock;
+	se->exec_start = rq_of(cfs_rq)->clock_task;
 }
 
 /**************************************************
@@ -1802,7 +1802,7 @@ int can_migrate_task(struct task_struct *p, struct rq *rq, int this_cpu,
 	 * 2) too many balance attempts have failed.
 	 */
 
-	tsk_cache_hot = task_hot(p, rq->clock, sd);
+	tsk_cache_hot = task_hot(p, rq->clock_task, sd);
 	if (!tsk_cache_hot ||
 		sd->nr_balance_failed > sd->cache_nice_tries) {
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index ab77aa0..bea7d79 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -609,7 +609,7 @@ static void update_curr_rt(struct rq *rq)
 	if (!task_has_rt_policy(curr))
 		return;
 
-	delta_exec = rq->clock - curr->se.exec_start;
+	delta_exec = rq->clock_task - curr->se.exec_start;
 	if (unlikely((s64)delta_exec < 0))
 		delta_exec = 0;
 
@@ -618,7 +618,7 @@ static void update_curr_rt(struct rq *rq)
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);
 
-	curr->se.exec_start = rq->clock;
+	curr->se.exec_start = rq->clock_task;
 	cpuacct_charge(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
@@ -1075,7 +1075,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	} while (rt_rq);
 
 	p = rt_task_of(rt_se);
-	p->se.exec_start = rq->clock;
+	p->se.exec_start = rq->clock_task;
 
 	return p;
 }
@@ -1713,7 +1713,7 @@ static void set_curr_task_rt(struct rq *rq)
 {
 	struct task_struct *p = rq->curr;
 
-	p->se.exec_start = rq->clock;
+	p->se.exec_start = rq->clock_task;
 
 	/* The running task is never eligible for pushing */
 	dequeue_pushable_task(rq, p);

  reply	other threads:[~2010-10-18 19:27 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-05  0:03 Proper kernel irq time accounting -v4 Venkatesh Pallipadi
2010-10-05  0:03 ` [PATCH 1/8] si time accounting accounts bh_disable'd time to si -v4 Venkatesh Pallipadi
2010-10-18 19:24   ` [tip:sched/core] sched: Fix softirq time accounting tip-bot for Venkatesh Pallipadi
2010-10-05  0:03 ` [PATCH 2/8] Consolidate account_system_vtime extern declaration -v4 Venkatesh Pallipadi
2010-10-18 19:24   ` [tip:sched/core] sched: Consolidate account_system_vtime extern declaration tip-bot for Venkatesh Pallipadi
2010-10-18 19:27   ` [tip:sched/core] sched: Export account_system_vtime() tip-bot for Ingo Molnar
2010-10-05  0:03 ` [PATCH 3/8] Add a PF flag for ksoftirqd identification Venkatesh Pallipadi
2010-10-15 14:26   ` Peter Zijlstra
2010-10-15 14:46   ` Eric Dumazet
2010-10-18 19:25   ` [tip:sched/core] sched: " tip-bot for Venkatesh Pallipadi
2010-10-05  0:03 ` [PATCH 4/8] Add IRQ_TIME_ACCOUNTING, finer accounting of irq time -v4 Venkatesh Pallipadi
2010-10-15 14:28   ` Peter Zijlstra
2010-10-18 19:25   ` [tip:sched/core] sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time tip-bot for Venkatesh Pallipadi
2010-10-05  0:03 ` [PATCH 5/8] x86: Add IRQ_TIME_ACCOUNTING in x86 -v4 Venkatesh Pallipadi
2010-10-15 14:38   ` Peter Zijlstra
2010-10-18 19:26   ` [tip:sched/core] x86: Add IRQ_TIME_ACCOUNTING tip-bot for Venkatesh Pallipadi
2010-10-05  0:03 ` [PATCH 6/8] sched: Do not account irq time to current task -v4 Venkatesh Pallipadi
2010-10-18 19:26   ` tip-bot for Venkatesh Pallipadi [this message]
2010-11-29  8:45     ` [tip:sched/core] sched: Do not account irq time to current task Yong Zhang
2010-11-29 11:59       ` Peter Zijlstra
2010-11-29 14:22         ` Yong Zhang
2010-11-29 17:06           ` Raistlin
2010-11-30  5:57             ` Yong Zhang
2010-12-01 18:55               ` Venkatesh Pallipadi
2010-12-01 19:16                 ` Peter Zijlstra
2010-10-05  0:03 ` [PATCH 7/8] sched: Remove irq time from available CPU power -v4 Venkatesh Pallipadi
2010-10-18 19:26   ` [tip:sched/core] sched: Remove irq time from available CPU power tip-bot for Venkatesh Pallipadi
2010-10-05  0:03 ` [PATCH 8/8] Call tick_check_idle before __irq_enter Venkatesh Pallipadi
2010-10-17  9:05   ` Yong Zhang
2010-10-18  9:15     ` Peter Zijlstra
2010-10-18 19:27   ` [tip:sched/core] sched: " tip-bot for Venkatesh Pallipadi
2010-10-12 19:00 ` Proper kernel irq time accounting -v4 Venkatesh Pallipadi
2010-10-14 16:12 ` Shaun Ruffell
2010-10-14 18:19   ` Venkatesh Pallipadi
2010-10-14 20:00     ` Shaun Ruffell
2010-10-15 15:11 ` Peter Zijlstra
2010-10-15 15:27   ` Peter Zijlstra
2010-10-15 17:13     ` Venkatesh Pallipadi
2010-10-15 17:20       ` Peter Zijlstra
2010-10-17  9:11       ` Yong Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=tip-305e6835e05513406fa12820e40e4a8ecb63743c@git.kernel.org \
    --to=venki@google.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.