public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched: fix unfairness when upgrade weight
@ 2008-06-30  6:27 Lai Jiangshan
  2008-07-02 21:49 ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Lai Jiangshan @ 2008-06-30  6:27 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

When two or more process upgrade their priority,
unfairness will happen, several of them may get all cpu-usage,
and the other cannot be scheduled to run for a long time.

example:
# (create 2 processes and set affinity to cpu#0)
# renice 19 pid1 pid2
# renice -19 pid1 pid2

step3 upgrade the 2 processes' weight, these 2 processes should
share the cpu#0 as soon as possible after step3, and any of them
should get 50% cpu-usage. But sometimes one of them gets all cpu-usage
for tens of seconds before they share the cpu#0.

fair-group example:
# mkdir 1 2 (create 2 fair-groups)
# (create 2 processes and set affinity to cpu#0)
# echo pid1 > 1/tasks ; echo pid2 > 2/tasks
# echo 2 > 1/cpu.shares ; echo 2 > 2/cpu.shares
# echo $((2**18)) > 1/cpu.shares ; echo $((2**18)) > 2/cpu.shares

The reason why such unfairness happened:

When a sched_entity is running, if its weight is low, its vruntime
increases by a large value every time and if its weight
is high, its vruntime increases by a small value.

So when the two sched_entity's weight is low, they will still 
fairness even if difference of their vruntime is large, but if
their weight are upgraded, this large difference of vruntime
will bring unfairness.

example:
se1's vruntime         se2's vruntime
    1000M               (R) 1020M
	(assume vruntime is increases by about 50M every time)
(R) 1050M                   1020M 
    1050M               (R) 1070M
(R) 1100M                   1070M
    1100M               (R) 1120M 
	(fairness, even if difference of their vruntime is large)
	(upgrade their weight, vruntime is increases by about 10K)
(R) 1100M+10K               1120M
(R) 1100M+20K               1120M
(R) 1100M+30K               1120M
(R) 1100M+40K               1120M
(R) 1100M+50K               1120M
	(se1 gets all cpu-usage for long time (mybe about tens
of seconds))
	(unfairness, difference=20M is too large for new weight)

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
diff --git a/kernel/sched.c b/kernel/sched.c
index 3aaa5c8..9c4b8cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4598,6 +4598,9 @@ void set_user_nice(struct task_struct *p, long nice)
 	delta = p->prio - old_prio;
 
 	if (on_rq) {
+		if (delta < 0 && p->sched_class == &fair_sched_class)
+			upgrade_weight(task_cfs_rq(p), &p->se);
+
 		enqueue_task(rq, p, 0);
 		inc_load(rq, p);
 		/*
@@ -8282,6 +8285,7 @@ static void set_se_shares(struct sched_entity *se, unsigned long shares)
 	struct cfs_rq *cfs_rq = se->cfs_rq;
 	struct rq *rq = cfs_rq->rq;
 	int on_rq;
+	unsigned long old_weight = se->load.weight;
 
 	spin_lock_irq(&rq->lock);
 
@@ -8292,8 +8296,12 @@ static void set_se_shares(struct sched_entity *se, unsigned long shares)
 	se->load.weight = shares;
 	se->load.inv_weight = 0;
 
-	if (on_rq)
+	if (on_rq) {
+		if (old_weight < shares)
+			upgrade_weight(cfs_rq, se);
+
 		enqueue_entity(cfs_rq, se, 0);
+	}
 
 	spin_unlock_irq(&rq->lock);
 }
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 08ae848..f3b2af4 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -587,6 +587,33 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
 #endif
 }
 
+static void upgrade_weight(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	unsigned long delta_exec_per_tick = TICK_NSEC;
+	u64 vruntime = cfs_rq->min_vruntime;
+
+	/*
+	 * The new vruntime should be:
+	 *	pre_vruntime + calc_delta_fair(pre_delta_exec, &se->load)
+	 * but we do not have any field to memorize this 2 value. So we assume
+	 * that this sched_entity has just been enqueued and the last
+	 * delta_exec is slice in one tick.
+	 */
+
+	if (cfs_rq->curr) {
+		vruntime = min_vruntime(vruntime,
+				cfs_rq->curr->vruntime);
+	}
+
+	if (first_fair(cfs_rq)) {
+		vruntime = min_vruntime(vruntime,
+				__pick_next_entity(cfs_rq)->vruntime);
+	}
+
+	vruntime += calc_delta_fair(delta_exec_per_tick, &se->load);
+	se->vruntime = min_vruntime(vruntime, se->vruntime);
+}
+
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {



^ permalink raw reply related	[flat|nested] 7+ messages in thread
* [PATCH] sched: fix unfairness when upgrade weight
@ 2009-02-25  7:32 Miao Xie
  2009-02-25  8:20 ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Miao Xie @ 2009-02-25  7:32 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra; +Cc: Linux-Kernel

When two or more processes upgrade their priority,
unfairness will happen, part of them may get all cpu-usage,
and the others cannot be scheduled to run for a long time.

example:
 # (create 2 processes and set affinity to cpu#0)
 # renice 19 pid1 pid2
 # renice -19 pid1 pid2

step3 upgrade the 2 processes' weight, these 2 processes should
share the cpu#0 as soon as possible after step3, and each of them
should get 50% cpu-usage. But sometimes one of them gets all cpu-usage
for tens of seconds before the other is schedulered to run.

fair-group example:
 # mkdir 1 2 (create 2 fair-groups)
 # (create 2 processes and set affinity to cpu#0)
 # echo pid1 > 1/tasks ; echo pid2 > 2/tasks
 # echo 2 > 1/cpu.shares ; echo 2 > 2/cpu.shares
 # echo $((2**18)) > 1/cpu.shares ; echo $((2**18)) > 2/cpu.shares

The reason why such unfairness happened:

When a sched_entity is running, if its weight is low, its vruntime
increases by a large value every time and if its weight is high,
its vruntime increases by a small value.

So when the two sched_entity's weight is low, they will still
fairness even if difference of their vruntime is large, but if
their weight are upgraded, this large difference of vruntime
will bring unfairness. Because it will cost the process to spend
a lot of time to catch up the huge difference.

example:
se1's vruntime         se2's vruntime
    1000M               (R) 1020M
    (assume vruntime is increases by about 50M every time)
(R) 1050M                   1020M
    1050M               (R) 1070M
(R) 1100M                   1070M
    1100M               (R) 1120M
    (fairness, even if difference of their vruntime is large)
    (upgrade their weight, vruntime is increases by about 10K)
(R) 1100M+10K               1120M
(R) 1100M+20K               1120M
(R) 1100M+30K               1120M
(R) 1100M+40K               1120M
(R) 1100M+50K               1120M
    (se1 gets all cpu-usage for long time (mybe about tens of seconds))
    (unfairness, difference=20M is too large for new weight)

This patch fixes this bug by tuning the vruntime of weight-upgraded
sched entities, just like waking up a task. the new vruntime will be
    cfs_rq->min_vruntime + sched_vslice();

Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>

---
 kernel/sched.c      |   16 +++++++++-------
 kernel/sched_fair.c |    9 +++++++++
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 410eec4..26e6d33 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5096,12 +5096,8 @@ void set_user_nice(struct task_struct *p, long nice)
 
 	if (on_rq) {
 		enqueue_task(rq, p, 0);
-		/*
-		 * If the task increased its priority or is running and
-		 * lowered its priority, then reschedule its CPU:
-		 */
-		if (delta < 0 || (delta > 0 && task_running(rq, p)))
-			resched_task(rq->curr);
+		p->sched_class->prio_changed(rq, p, old_prio,
+						task_running(rq, p));
 	}
 out_unlock:
 	task_rq_unlock(rq, &flags);
@@ -8929,16 +8925,22 @@ static void __set_se_shares(struct sched_entity *se, unsigned long shares)
 {
 	struct cfs_rq *cfs_rq = se->cfs_rq;
 	int on_rq;
+	unsigned long old_weight;
 
 	on_rq = se->on_rq;
 	if (on_rq)
 		dequeue_entity(cfs_rq, se, 0);
 
+	old_weight = se->load.weight;
 	se->load.weight = shares;
 	se->load.inv_weight = 0;
 
-	if (on_rq)
+	if (on_rq) {
+		if (se->load.weight > old_weight)
+			se->vruntime = cfs_rq->min_vruntime
+						+ sched_vslice(cfs_rq, se);
 		enqueue_entity(cfs_rq, se, 0);
+	}
 }
 
 static void set_se_shares(struct sched_entity *se, unsigned long shares)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0566f2a..34d4d11 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1690,6 +1690,15 @@ static void task_new_fair(struct rq *rq, struct task_struct *p)
 static void prio_changed_fair(struct rq *rq, struct task_struct *p,
 			      int oldprio, int running)
 {
+	struct cfs_rq *cfs_rq = task_cfs_rq(p);
+	struct sched_entity *se = &p->se;
+	int on_rq = se->on_rq;
+
+	if (p->prio < oldprio && on_rq) {
+		dequeue_entity(cfs_rq, se, 0);
+		se->vruntime = cfs_rq->min_vruntime + sched_vslice(cfs_rq, se);
+		enqueue_entity(cfs_rq, se, 0);
+	}
 	/*
 	 * Reschedule if we are currently running on this runqueue and
 	 * our priority decreased, or if we are not currently running on
-- 
1.6.0.3



^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-02-25 11:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-30  6:27 [PATCH] sched: fix unfairness when upgrade weight Lai Jiangshan
2008-07-02 21:49 ` Peter Zijlstra
2008-07-03 11:30   ` jha
2008-07-04  2:39     ` Lai Jiangshan
  -- strict thread matches above, loose matches on Subject: below --
2009-02-25  7:32 Miao Xie
2009-02-25  8:20 ` Peter Zijlstra
2009-02-25 11:13   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox