[PATCH] sched: fix unfairness when upgrade weight

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched: fix unfairness when upgrade weight
@ 2008-06-30  6:27 Lai Jiangshan
  2008-07-02 21:49 ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Lai Jiangshan @ 2008-06-30  6:27 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

When two or more process upgrade their priority,
unfairness will happen, several of them may get all cpu-usage,
and the other cannot be scheduled to run for a long time.

example:
# (create 2 processes and set affinity to cpu#0)
# renice 19 pid1 pid2
# renice -19 pid1 pid2

step3 upgrade the 2 processes' weight, these 2 processes should
share the cpu#0 as soon as possible after step3, and any of them
should get 50% cpu-usage. But sometimes one of them gets all cpu-usage
for tens of seconds before they share the cpu#0.

fair-group example:
# mkdir 1 2 (create 2 fair-groups)
# (create 2 processes and set affinity to cpu#0)
# echo pid1 > 1/tasks ; echo pid2 > 2/tasks
# echo 2 > 1/cpu.shares ; echo 2 > 2/cpu.shares
# echo $((2**18)) > 1/cpu.shares ; echo $((2**18)) > 2/cpu.shares

The reason why such unfairness happened:

When a sched_entity is running, if its weight is low, its vruntime
increases by a large value every time and if its weight
is high, its vruntime increases by a small value.

So when the two sched_entity's weight is low, they will still 
fairness even if difference of their vruntime is large, but if
their weight are upgraded, this large difference of vruntime
will bring unfairness.

example:
se1's vruntime         se2's vruntime
    1000M               (R) 1020M
	(assume vruntime is increases by about 50M every time)
(R) 1050M                   1020M 
    1050M               (R) 1070M
(R) 1100M                   1070M
    1100M               (R) 1120M 
	(fairness, even if difference of their vruntime is large)
	(upgrade their weight, vruntime is increases by about 10K)
(R) 1100M+10K               1120M
(R) 1100M+20K               1120M
(R) 1100M+30K               1120M
(R) 1100M+40K               1120M
(R) 1100M+50K               1120M
	(se1 gets all cpu-usage for long time (mybe about tens
of seconds))
	(unfairness, difference=20M is too large for new weight)

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
diff --git a/kernel/sched.c b/kernel/sched.c
index 3aaa5c8..9c4b8cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4598,6 +4598,9 @@ void set_user_nice(struct task_struct *p, long nice)
 	delta = p->prio - old_prio;
 
 	if (on_rq) {
+		if (delta < 0 && p->sched_class == &fair_sched_class)
+			upgrade_weight(task_cfs_rq(p), &p->se);
+
 		enqueue_task(rq, p, 0);
 		inc_load(rq, p);
 		/*
@@ -8282,6 +8285,7 @@ static void set_se_shares(struct sched_entity *se, unsigned long shares)
 	struct cfs_rq *cfs_rq = se->cfs_rq;
 	struct rq *rq = cfs_rq->rq;
 	int on_rq;
+	unsigned long old_weight = se->load.weight;
 
 	spin_lock_irq(&rq->lock);
 
@@ -8292,8 +8296,12 @@ static void set_se_shares(struct sched_entity *se, unsigned long shares)
 	se->load.weight = shares;
 	se->load.inv_weight = 0;
 
-	if (on_rq)
+	if (on_rq) {
+		if (old_weight < shares)
+			upgrade_weight(cfs_rq, se);
+
 		enqueue_entity(cfs_rq, se, 0);
+	}
 
 	spin_unlock_irq(&rq->lock);
 }
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 08ae848..f3b2af4 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -587,6 +587,33 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
 #endif
 }
 
+static void upgrade_weight(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	unsigned long delta_exec_per_tick = TICK_NSEC;
+	u64 vruntime = cfs_rq->min_vruntime;
+
+	/*
+	 * The new vruntime should be:
+	 *	pre_vruntime + calc_delta_fair(pre_delta_exec, &se->load)
+	 * but we do not have any field to memorize this 2 value. So we assume
+	 * that this sched_entity has just been enqueued and the last
+	 * delta_exec is slice in one tick.
+	 */
+
+	if (cfs_rq->curr) {
+		vruntime = min_vruntime(vruntime,
+				cfs_rq->curr->vruntime);
+	}
+
+	if (first_fair(cfs_rq)) {
+		vruntime = min_vruntime(vruntime,
+				__pick_next_entity(cfs_rq)->vruntime);
+	}
+
+	vruntime += calc_delta_fair(delta_exec_per_tick, &se->load);
+	se->vruntime = min_vruntime(vruntime, se->vruntime);
+}
+
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: fix unfairness when upgrade weight
  2008-06-30  6:27 [PATCH] sched: fix unfairness when upgrade weight Lai Jiangshan
@ 2008-07-02 21:49 ` Peter Zijlstra
  2008-07-03 11:30   ` jha
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2008-07-02 21:49 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Ingo Molnar, Linux Kernel Mailing List, James H. Anderson

Hi Lai,

On Mon, 2008-06-30 at 14:27 +0800, Lai Jiangshan wrote:
> When two or more process upgrade their priority,
> unfairness will happen, several of them may get all cpu-usage,
> and the other cannot be scheduled to run for a long time.
> 
> example:
> # (create 2 processes and set affinity to cpu#0)
> # renice 19 pid1 pid2
> # renice -19 pid1 pid2
> 
> step3 upgrade the 2 processes' weight, these 2 processes should
> share the cpu#0 as soon as possible after step3, and any of them
> should get 50% cpu-usage. But sometimes one of them gets all cpu-usage
> for tens of seconds before they share the cpu#0.
> 
> fair-group example:
> # mkdir 1 2 (create 2 fair-groups)
> # (create 2 processes and set affinity to cpu#0)
> # echo pid1 > 1/tasks ; echo pid2 > 2/tasks
> # echo 2 > 1/cpu.shares ; echo 2 > 2/cpu.shares
> # echo $((2**18)) > 1/cpu.shares ; echo $((2**18)) > 2/cpu.shares
> 
> The reason why such unfairness happened:
> 
> When a sched_entity is running, if its weight is low, its vruntime
> increases by a large value every time and if its weight
> is high, its vruntime increases by a small value.
> 
> So when the two sched_entity's weight is low, they will still 
> fairness even if difference of their vruntime is large, but if
> their weight are upgraded, this large difference of vruntime
> will bring unfairness.
> 
> example:
> se1's vruntime         se2's vruntime
>     1000M               (R) 1020M
> 	(assume vruntime is increases by about 50M every time)
> (R) 1050M                   1020M 
>     1050M               (R) 1070M
> (R) 1100M                   1070M
>     1100M               (R) 1120M 
> 	(fairness, even if difference of their vruntime is large)
> 	(upgrade their weight, vruntime is increases by about 10K)
> (R) 1100M+10K               1120M
> (R) 1100M+20K               1120M
> (R) 1100M+30K               1120M
> (R) 1100M+40K               1120M
> (R) 1100M+50K               1120M
> 	(se1 gets all cpu-usage for long time (mybe about tens
> of seconds))
> 	(unfairness, difference=20M is too large for new weight)

My initial response to this email was: sure, that's because you cannot
renice two tasks atomically - we'll just have to live with that.

However after a bit more thought it occurred to me this is because we're
changing the weight of a task with non-zero lag.

I think the proper solution to this problem is to scale the lag
according to the change in weights. But lets ask James, who is an expert
in this area.


So while I think you're right in that we have an issue, I don't like
your solution.

How about something like these patches (compile tested only).

---
Subject: sched: fair: avg_vruntime
From: Peter Zijlstra <a.p.zijlstra@chello.nl>

In order to implement a deadline scheduler we need to be able to test
eligibility. This requires knowing the current virtual time. We use a property
of fair schedulers to determine this in an numerically stable way, namely the
sum of all lags is 0. Therefore the average of all virtual times is the
position of lag=0.

We can't just take the average of vruntime - as it will use the full range
of its u64 and will wrap around. Instead we'll use the average of
(vruntime - min_vruntime)

\Sum_{i}^{n} 1/n (v_{i} - v) = 1/n (\Sum_{i}^{n} v_{i}) - vn

By factoring out the 1/n (never storing that) we avoid rounding, which
would bring an accumulating error.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c       |    3 ++
 kernel/sched_debug.c |    3 ++
 kernel/sched_fair.c  |   68 ++++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 62 insertions(+), 12 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2008-07-02 17:56:31.000000000 +0200
+++ linux-2.6/kernel/sched.c	2008-07-02 23:43:44.000000000 +0200
@@ -377,6 +377,9 @@ struct cfs_rq {
 	struct load_weight load;
 	unsigned long nr_running;
 
+	long nr_queued;
+	s64 avg_vruntime;
+
 	u64 exec_clock;
 	u64 min_vruntime;
 	u64 pair_start;
Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c	2008-07-02 17:56:30.000000000 +0200
+++ linux-2.6/kernel/sched_debug.c	2008-07-02 23:36:09.000000000 +0200
@@ -161,6 +161,9 @@ void print_cfs_rq(struct seq_file *m, in
 			SPLIT_NS(spread0));
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
+			SPLIT_NS(avg_vruntime(cfs_rq)));
+
 #ifdef CONFIG_SCHEDSTATS
 #define P(n) SEQ_printf(m, "  .%-30s: %d\n", #n, rq->n);
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c	2008-07-02 17:56:30.000000000 +0200
+++ linux-2.6/kernel/sched_fair.c	2008-07-02 23:43:44.000000000 +0200
@@ -221,6 +221,55 @@ static inline s64 entity_key(struct cfs_
 	return se->vruntime - cfs_rq->min_vruntime;
 }
 
+static void
+avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	s64 key = entity_key(cfs_rq, se);
+	cfs_rq->avg_vruntime += key;
+	cfs_rq->nr_queued++;
+}
+
+static void
+avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	s64 key = entity_key(cfs_rq, se);
+	cfs_rq->avg_vruntime -= key;
+	cfs_rq->nr_queued--;
+}
+
+static inline
+void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+{
+	cfs_rq->avg_vruntime -= cfs_rq->nr_queued * delta;
+}
+
+static u64 avg_vruntime(struct cfs_rq *cfs_rq)
+{
+	s64 avg = cfs_rq->avg_vruntime;
+
+	if (cfs_rq->nr_queued)
+		avg = div_s64(avg, cfs_rq->nr_queued);
+
+	return cfs_rq->min_vruntime + avg;
+}
+
+/*
+ * maintain cfs_rq->min_vruntime to be a monotonic increasing
+ * value tracking the leftmost vruntime in the tree.
+ */
+static void
+update_min_vruntime(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	/*
+	 * open coded max_vruntime() to allow updating avg_vruntime
+	 */
+	s64 delta = (s64)(se->vruntime - cfs_rq->min_vruntime);
+	if (delta > 0) {
+		avg_vruntime_update(cfs_rq, delta);
+		cfs_rq->min_vruntime = se->vruntime;
+	}
+}
+
 /*
  * Enqueue an entity into the rb-tree:
  */
@@ -232,6 +281,8 @@ static void __enqueue_entity(struct cfs_
 	s64 key = entity_key(cfs_rq, se);
 	int leftmost = 1;
 
+	avg_vruntime_add(cfs_rq, se);
+
 	/*
 	 * Find the right place in the rbtree:
 	 */
@@ -256,12 +307,7 @@ static void __enqueue_entity(struct cfs_
 	 */
 	if (leftmost) {
 		cfs_rq->rb_leftmost = &se->run_node;
-		/*
-		 * maintain cfs_rq->min_vruntime to be a monotonic increasing
-		 * value tracking the leftmost vruntime in the tree.
-		 */
-		cfs_rq->min_vruntime =
-			max_vruntime(cfs_rq->min_vruntime, se->vruntime);
+		update_min_vruntime(cfs_rq, se);
 	}
 
 	rb_link_node(&se->run_node, parent, link);
@@ -272,17 +318,13 @@ static void __dequeue_entity(struct cfs_
 {
 	if (cfs_rq->rb_leftmost == &se->run_node) {
 		struct rb_node *next_node;
-		struct sched_entity *next;
 
 		next_node = rb_next(&se->run_node);
 		cfs_rq->rb_leftmost = next_node;
 
 		if (next_node) {
-			next = rb_entry(next_node,
-					struct sched_entity, run_node);
-			cfs_rq->min_vruntime =
-				max_vruntime(cfs_rq->min_vruntime,
-					     next->vruntime);
+			update_min_vruntime(cfs_rq, rb_entry(next_node,
+						struct sched_entity, run_node));
 		}
 	}
 
@@ -290,6 +332,8 @@ static void __dequeue_entity(struct cfs_
 		cfs_rq->next = NULL;
 
 	rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
+
+	avg_vruntime_sub(cfs_rq, se);
 }
 
 static inline struct rb_node *first_fair(struct cfs_rq *cfs_rq)


---
Subject: sched: non-zero lag renice
From: Peter Zijlstra <a.p.zijlstra@chello.nl>

In the case where we renice a task which has non-zero lag, its not clear
what needs to be done, as it has a deviation from fairness.

Try to compensate this by scaling the lag (deviation from fairness) by the
change in weights.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |   14 +++++++-------
 kernel/sched_fair.c |   26 ++++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2008-07-02 23:35:24.000000000 +0200
+++ linux-2.6/kernel/sched.c	2008-07-02 23:40:12.000000000 +0200
@@ -4780,12 +4780,9 @@ void set_user_nice(struct task_struct *p
 
 	if (on_rq) {
 		enqueue_task(rq, p, 0);
-		/*
-		 * If the task increased its priority or is running and
-		 * lowered its priority, then reschedule its CPU:
-		 */
-		if (delta < 0 || (delta > 0 && task_running(rq, p)))
-			resched_task(rq->curr);
+
+		check_class_changed(rq, p, p->sched_class, old_prio,
+				    task_running(rq, p));
 	}
 out_unlock:
 	task_rq_unlock(rq, &flags);
@@ -8527,6 +8524,7 @@ void sched_move_task(struct task_struct 
 static void __set_se_shares(struct sched_entity *se, unsigned long shares)
 {
 	struct cfs_rq *cfs_rq = se->cfs_rq;
+	unsigned long old_weight = se->load.weight;
 	int on_rq;
 
 	on_rq = se->on_rq;
@@ -8536,8 +8534,10 @@ static void __set_se_shares(struct sched
 	se->load.weight = shares;
 	se->load.inv_weight = 0;
 
-	if (on_rq)
+	if (on_rq) {
 		enqueue_entity(cfs_rq, se, 0);
+		prio_changed_entity(cfs_rq, se, old_weight, shares);
+	}
 }
 
 static void set_se_shares(struct sched_entity *se, unsigned long shares)
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c	2008-07-02 23:35:37.000000000 +0200
+++ linux-2.6/kernel/sched_fair.c	2008-07-02 23:36:37.000000000 +0200
@@ -1680,6 +1680,28 @@ static void task_new_fair(struct rq *rq,
 	resched_task(rq->curr);
 }
 
+static void prio_changed_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+		unsigned long old_weight, unsigned long new_weight)
+{
+	u64 avg;
+	s64 lag;
+
+	if (old_weight == new_weight)
+		return;
+
+	dequeue_entity(cfs_rq, se, 0);
+
+	avg = avg_vruntime(cfs_rq);
+	lag = (s64)(se->vruntime - avg);
+
+	lag *= new_weight;
+	lag = div_s64(lag, old_weight);
+
+	se->vruntime = avg + lag;
+
+	enqueue_entity(cfs_rq, se, 0);
+}
+
 /*
  * Priority of the task has changed. Check to see if we preempt
  * the current task.
@@ -1687,6 +1709,10 @@ static void task_new_fair(struct rq *rq,
 static void prio_changed_fair(struct rq *rq, struct task_struct *p,
 			      int oldprio, int running)
 {
+	prio_changed_entity(&rq->cfs, &p->se,
+			    prio_to_weight[USER_PRIO(oldprio)],
+			    prio_to_weight[USER_PRIO(p->prio)]);
+
 	/*
 	 * Reschedule if we are currently running on this runqueue and
 	 * our priority decreased, or if we are not currently running on



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: fix unfairness when upgrade weight
  2008-07-02 21:49 ` Peter Zijlstra
@ 2008-07-03 11:30   ` jha
  2008-07-04  2:39     ` Lai Jiangshan
  0 siblings, 1 reply; 7+ messages in thread
From: jha @ 2008-07-03 11:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Ingo Molnar, Linux Kernel Mailing List,
	James H. Anderson, jeffay

Peter,

In the scenarios below, processes are changing weights.  In the 
scheduling literature, people instead talk about processes leaving and 
joining the system.  A weight change is viewed as a special case where 
a process leaves with its old weight and re-joins with its new weight.  
It is well known that allowing tasks with non-zero lag to leave causes 
the sorts of problems you are observing -- in fact, this is one of the 
"classic" issues people have looked at w.r.t. fairness.  We have looked 
at this extensively on multiprocessors.  However, your scenarios (as I 
understand them) really involve just a single processor.  The most 
accessible resource I know on this issue that just focuses on 
uniprocessors is:

A Proportional Share Resource Allocation Algorithm For Real-Time, 
Time-Shared Systems
IEEE Real-Time Systems Symposium, December 1996.
http://www.cs.unc.edu/~jeffay/papers/RTSS-96a.pdf

A slightly longer version of this paper exists that contains some of 
the missing proofs, but I always have trouble locating it on-line (it's 
in a non-obvious place, I think).   I've cc'ed Kevin Jeffay, one of the 
co-authors.  Maybe he can point you to the longer version.

BTW, if I recall correctly, the very lag scaling idea you mentioned is 
discussed in Kevin's paper.

Hope this helps.

-Jim Anderson


Quoting Peter Zijlstra <peterz@infradead.org>:

> Hi Lai,
>
> On Mon, 2008-06-30 at 14:27 +0800, Lai Jiangshan wrote:
>> When two or more process upgrade their priority,
>> unfairness will happen, several of them may get all cpu-usage,
>> and the other cannot be scheduled to run for a long time.
>>
>> example:
>> # (create 2 processes and set affinity to cpu#0)
>> # renice 19 pid1 pid2
>> # renice -19 pid1 pid2
>>
>> step3 upgrade the 2 processes' weight, these 2 processes should
>> share the cpu#0 as soon as possible after step3, and any of them
>> should get 50% cpu-usage. But sometimes one of them gets all cpu-usage
>> for tens of seconds before they share the cpu#0.
>>
>> fair-group example:
>> # mkdir 1 2 (create 2 fair-groups)
>> # (create 2 processes and set affinity to cpu#0)
>> # echo pid1 > 1/tasks ; echo pid2 > 2/tasks
>> # echo 2 > 1/cpu.shares ; echo 2 > 2/cpu.shares
>> # echo $((2**18)) > 1/cpu.shares ; echo $((2**18)) > 2/cpu.shares
>>
>> The reason why such unfairness happened:
>>
>> When a sched_entity is running, if its weight is low, its vruntime
>> increases by a large value every time and if its weight
>> is high, its vruntime increases by a small value.
>>
>> So when the two sched_entity's weight is low, they will still
>> fairness even if difference of their vruntime is large, but if
>> their weight are upgraded, this large difference of vruntime
>> will bring unfairness.
>>
>> example:
>> se1's vruntime         se2's vruntime
>>     1000M               (R) 1020M
>> 	(assume vruntime is increases by about 50M every time)
>> (R) 1050M                   1020M
>>     1050M               (R) 1070M
>> (R) 1100M                   1070M
>>     1100M               (R) 1120M
>> 	(fairness, even if difference of their vruntime is large)
>> 	(upgrade their weight, vruntime is increases by about 10K)
>> (R) 1100M+10K               1120M
>> (R) 1100M+20K               1120M
>> (R) 1100M+30K               1120M
>> (R) 1100M+40K               1120M
>> (R) 1100M+50K               1120M
>> 	(se1 gets all cpu-usage for long time (mybe about tens
>> of seconds))
>> 	(unfairness, difference=20M is too large for new weight)
>
> My initial response to this email was: sure, that's because you cannot
> renice two tasks atomically - we'll just have to live with that.
>
> However after a bit more thought it occurred to me this is because we're
> changing the weight of a task with non-zero lag.
>
> I think the proper solution to this problem is to scale the lag
> according to the change in weights. But lets ask James, who is an expert
> in this area.
>
>
> So while I think you're right in that we have an issue, I don't like
> your solution.
>
> How about something like these patches (compile tested only).
>
> ---
> Subject: sched: fair: avg_vruntime
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
>
> In order to implement a deadline scheduler we need to be able to test
> eligibility. This requires knowing the current virtual time. We use a 
> property
> of fair schedulers to determine this in an numerically stable way, namely the
> sum of all lags is 0. Therefore the average of all virtual times is the
> position of lag=0.
>
> We can't just take the average of vruntime - as it will use the full range
> of its u64 and will wrap around. Instead we'll use the average of
> (vruntime - min_vruntime)
>
> \Sum_{i}^{n} 1/n (v_{i} - v) = 1/n (\Sum_{i}^{n} v_{i}) - vn
>
> By factoring out the 1/n (never storing that) we avoid rounding, which
> would bring an accumulating error.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> kernel/sched.c       |    3 ++
> kernel/sched_debug.c |    3 ++
> kernel/sched_fair.c  |   68 
> ++++++++++++++++++++++++++++++++++++++++++---------
> 3 files changed, 62 insertions(+), 12 deletions(-)
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c	2008-07-02 17:56:31.000000000 +0200
> +++ linux-2.6/kernel/sched.c	2008-07-02 23:43:44.000000000 +0200
> @@ -377,6 +377,9 @@ struct cfs_rq {
> 	struct load_weight load;
> 	unsigned long nr_running;
>
> +	long nr_queued;
> +	s64 avg_vruntime;
> +
> 	u64 exec_clock;
> 	u64 min_vruntime;
> 	u64 pair_start;
> Index: linux-2.6/kernel/sched_debug.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_debug.c	2008-07-02 17:56:30.000000000 +0200
> +++ linux-2.6/kernel/sched_debug.c	2008-07-02 23:36:09.000000000 +0200
> @@ -161,6 +161,9 @@ void print_cfs_rq(struct seq_file *m, in
> 			SPLIT_NS(spread0));
> 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
> 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
> +	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
> +			SPLIT_NS(avg_vruntime(cfs_rq)));
> +
> #ifdef CONFIG_SCHEDSTATS
> #define P(n) SEQ_printf(m, "  .%-30s: %d\n", #n, rq->n);
>
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c	2008-07-02 17:56:30.000000000 +0200
> +++ linux-2.6/kernel/sched_fair.c	2008-07-02 23:43:44.000000000 +0200
> @@ -221,6 +221,55 @@ static inline s64 entity_key(struct cfs_
> 	return se->vruntime - cfs_rq->min_vruntime;
> }
>
> +static void
> +avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +	s64 key = entity_key(cfs_rq, se);
> +	cfs_rq->avg_vruntime += key;
> +	cfs_rq->nr_queued++;
> +}
> +
> +static void
> +avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +	s64 key = entity_key(cfs_rq, se);
> +	cfs_rq->avg_vruntime -= key;
> +	cfs_rq->nr_queued--;
> +}
> +
> +static inline
> +void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
> +{
> +	cfs_rq->avg_vruntime -= cfs_rq->nr_queued * delta;
> +}
> +
> +static u64 avg_vruntime(struct cfs_rq *cfs_rq)
> +{
> +	s64 avg = cfs_rq->avg_vruntime;
> +
> +	if (cfs_rq->nr_queued)
> +		avg = div_s64(avg, cfs_rq->nr_queued);
> +
> +	return cfs_rq->min_vruntime + avg;
> +}
> +
> +/*
> + * maintain cfs_rq->min_vruntime to be a monotonic increasing
> + * value tracking the leftmost vruntime in the tree.
> + */
> +static void
> +update_min_vruntime(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +	/*
> +	 * open coded max_vruntime() to allow updating avg_vruntime
> +	 */
> +	s64 delta = (s64)(se->vruntime - cfs_rq->min_vruntime);
> +	if (delta > 0) {
> +		avg_vruntime_update(cfs_rq, delta);
> +		cfs_rq->min_vruntime = se->vruntime;
> +	}
> +}
> +
> /*
>  * Enqueue an entity into the rb-tree:
>  */
> @@ -232,6 +281,8 @@ static void __enqueue_entity(struct cfs_
> 	s64 key = entity_key(cfs_rq, se);
> 	int leftmost = 1;
>
> +	avg_vruntime_add(cfs_rq, se);
> +
> 	/*
> 	 * Find the right place in the rbtree:
> 	 */
> @@ -256,12 +307,7 @@ static void __enqueue_entity(struct cfs_
> 	 */
> 	if (leftmost) {
> 		cfs_rq->rb_leftmost = &se->run_node;
> -		/*
> -		 * maintain cfs_rq->min_vruntime to be a monotonic increasing
> -		 * value tracking the leftmost vruntime in the tree.
> -		 */
> -		cfs_rq->min_vruntime =
> -			max_vruntime(cfs_rq->min_vruntime, se->vruntime);
> +		update_min_vruntime(cfs_rq, se);
> 	}
>
> 	rb_link_node(&se->run_node, parent, link);
> @@ -272,17 +318,13 @@ static void __dequeue_entity(struct cfs_
> {
> 	if (cfs_rq->rb_leftmost == &se->run_node) {
> 		struct rb_node *next_node;
> -		struct sched_entity *next;
>
> 		next_node = rb_next(&se->run_node);
> 		cfs_rq->rb_leftmost = next_node;
>
> 		if (next_node) {
> -			next = rb_entry(next_node,
> -					struct sched_entity, run_node);
> -			cfs_rq->min_vruntime =
> -				max_vruntime(cfs_rq->min_vruntime,
> -					     next->vruntime);
> +			update_min_vruntime(cfs_rq, rb_entry(next_node,
> +						struct sched_entity, run_node));
> 		}
> 	}
>
> @@ -290,6 +332,8 @@ static void __dequeue_entity(struct cfs_
> 		cfs_rq->next = NULL;
>
> 	rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
> +
> +	avg_vruntime_sub(cfs_rq, se);
> }
>
> static inline struct rb_node *first_fair(struct cfs_rq *cfs_rq)
>
>
> ---
> Subject: sched: non-zero lag renice
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
>
> In the case where we renice a task which has non-zero lag, its not clear
> what needs to be done, as it has a deviation from fairness.
>
> Try to compensate this by scaling the lag (deviation from fairness) by the
> change in weights.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> kernel/sched.c      |   14 +++++++-------
> kernel/sched_fair.c |   26 ++++++++++++++++++++++++++
> 2 files changed, 33 insertions(+), 7 deletions(-)
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c	2008-07-02 23:35:24.000000000 +0200
> +++ linux-2.6/kernel/sched.c	2008-07-02 23:40:12.000000000 +0200
> @@ -4780,12 +4780,9 @@ void set_user_nice(struct task_struct *p
>
> 	if (on_rq) {
> 		enqueue_task(rq, p, 0);
> -		/*
> -		 * If the task increased its priority or is running and
> -		 * lowered its priority, then reschedule its CPU:
> -		 */
> -		if (delta < 0 || (delta > 0 && task_running(rq, p)))
> -			resched_task(rq->curr);
> +
> +		check_class_changed(rq, p, p->sched_class, old_prio,
> +				    task_running(rq, p));
> 	}
> out_unlock:
> 	task_rq_unlock(rq, &flags);
> @@ -8527,6 +8524,7 @@ void sched_move_task(struct task_struct
> static void __set_se_shares(struct sched_entity *se, unsigned long shares)
> {
> 	struct cfs_rq *cfs_rq = se->cfs_rq;
> +	unsigned long old_weight = se->load.weight;
> 	int on_rq;
>
> 	on_rq = se->on_rq;
> @@ -8536,8 +8534,10 @@ static void __set_se_shares(struct sched
> 	se->load.weight = shares;
> 	se->load.inv_weight = 0;
>
> -	if (on_rq)
> +	if (on_rq) {
> 		enqueue_entity(cfs_rq, se, 0);
> +		prio_changed_entity(cfs_rq, se, old_weight, shares);
> +	}
> }
>
> static void set_se_shares(struct sched_entity *se, unsigned long shares)
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c	2008-07-02 23:35:37.000000000 +0200
> +++ linux-2.6/kernel/sched_fair.c	2008-07-02 23:36:37.000000000 +0200
> @@ -1680,6 +1680,28 @@ static void task_new_fair(struct rq *rq,
> 	resched_task(rq->curr);
> }
>
> +static void prio_changed_entity(struct cfs_rq *cfs_rq, struct 
> sched_entity *se,
> +		unsigned long old_weight, unsigned long new_weight)
> +{
> +	u64 avg;
> +	s64 lag;
> +
> +	if (old_weight == new_weight)
> +		return;
> +
> +	dequeue_entity(cfs_rq, se, 0);
> +
> +	avg = avg_vruntime(cfs_rq);
> +	lag = (s64)(se->vruntime - avg);
> +
> +	lag *= new_weight;
> +	lag = div_s64(lag, old_weight);
> +
> +	se->vruntime = avg + lag;
> +
> +	enqueue_entity(cfs_rq, se, 0);
> +}
> +
> /*
>  * Priority of the task has changed. Check to see if we preempt
>  * the current task.
> @@ -1687,6 +1709,10 @@ static void task_new_fair(struct rq *rq,
> static void prio_changed_fair(struct rq *rq, struct task_struct *p,
> 			      int oldprio, int running)
> {
> +	prio_changed_entity(&rq->cfs, &p->se,
> +			    prio_to_weight[USER_PRIO(oldprio)],
> +			    prio_to_weight[USER_PRIO(p->prio)]);
> +
> 	/*
> 	 * Reschedule if we are currently running on this runqueue and
> 	 * our priority decreased, or if we are not currently running on
>
>
>



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: fix unfairness when upgrade weight
  2008-07-03 11:30   ` jha
@ 2008-07-04  2:39     ` Lai Jiangshan
  0 siblings, 0 replies; 7+ messages in thread
From: Lai Jiangshan @ 2008-07-04  2:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: jha, Ingo Molnar, Linux Kernel Mailing List, James H. Anderson,
	jeffay

Peter Zijlstra wrote:
> Hi Lai,
> 
> On Mon, 2008-06-30 at 14:27 +0800, Lai Jiangshan wrote:
>> When two or more process upgrade their priority,
>> unfairness will happen, several of them may get all cpu-usage,
>> and the other cannot be scheduled to run for a long time.
>>
>> example:
>> # (create 2 processes and set affinity to cpu#0)
>> # renice 19 pid1 pid2
>> # renice -19 pid1 pid2
>>
>> step3 upgrade the 2 processes' weight, these 2 processes should
>> share the cpu#0 as soon as possible after step3, and any of them
>> should get 50% cpu-usage. But sometimes one of them gets all cpu-usage
>> for tens of seconds before they share the cpu#0.
>>
>> fair-group example:
>> # mkdir 1 2 (create 2 fair-groups)
>> # (create 2 processes and set affinity to cpu#0)
>> # echo pid1 > 1/tasks ; echo pid2 > 2/tasks
>> # echo 2 > 1/cpu.shares ; echo 2 > 2/cpu.shares
>> # echo $((2**18)) > 1/cpu.shares ; echo $((2**18)) > 2/cpu.shares
>>
>> The reason why such unfairness happened:
>>
>> When a sched_entity is running, if its weight is low, its vruntime
>> increases by a large value every time and if its weight
>> is high, its vruntime increases by a small value.
>>
>> So when the two sched_entity's weight is low, they will still 
>> fairness even if difference of their vruntime is large, but if
>> their weight are upgraded, this large difference of vruntime
>> will bring unfairness.
>>
>> example:
>> se1's vruntime         se2's vruntime
>>     1000M               (R) 1020M
>> 	(assume vruntime is increases by about 50M every time)
>> (R) 1050M                   1020M 
>>     1050M               (R) 1070M
>> (R) 1100M                   1070M
>>     1100M               (R) 1120M 
>> 	(fairness, even if difference of their vruntime is large)
>> 	(upgrade their weight, vruntime is increases by about 10K)
>> (R) 1100M+10K               1120M
>> (R) 1100M+20K               1120M
>> (R) 1100M+30K               1120M
>> (R) 1100M+40K               1120M
>> (R) 1100M+50K               1120M
>> 	(se1 gets all cpu-usage for long time (mybe about tens
>> of seconds))
>> 	(unfairness, difference=20M is too large for new weight)
> 
> My initial response to this email was: sure, that's because you cannot
> renice two tasks atomically - we'll just have to live with that.
> 
> However after a bit more thought it occurred to me this is because we're
> changing the weight of a task with non-zero lag.
> 


IMO, that's because the next runtime of a se is *predetermined*(use
se->vruntime to determine its next runtime).

So the solution of this problem is that the next runtime must be
redetermined when weight is changed.

And my solution use MIN(cfs_rq->min_vruntime, 
cfs_rq->curr->vruntime, __pick_next_entity(cfs_rq)->vruntime)
as "current vruntime", and I suppose that pre_delta_exec < TICK_NSEC
(I suppose that local irq is disabled too long is a rare phenomena)

(And I suppose the value wakeup_gran is very small value too,
"current vruntime" is not so accurate)

But my solution don't redetermined se's next runtime when weight
is degraded. It's a big weakness.


> I think the proper solution to this problem is to scale the lag
> according to the change in weights. But lets ask James, who is an expert
> in this area.
> 
> 
> So while I think you're right in that we have an issue, I don't like
> your solution.
> 
> How about something like these patches (compile tested only).
> 


How your solution fix this:
se1's weight is biger than se2's weight at first

and then we upgrade se2's weight:
se1 = cfs_rq->curr(not in rbtree), se2 is *the only* node in the rbtree,
and se1's vruntime=1000M,  se2's vruntime = 1020M

But se2's vruntime still is 1020M after se2's weight is upgraded in your solution.

Unfairness still happen(difference=20M is too large for biger weight in).

Some times (cfs_rq->min_vruntime - cfq_rq->curr->vruntime) is a large
difference for huge weight.

>  
> +static void prio_changed_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> +		unsigned long old_weight, unsigned long new_weight)
> +{
> +	u64 avg;
> +	s64 lag;
> +
> +	if (old_weight == new_weight)
> +		return;
> +
> +	dequeue_entity(cfs_rq, se, 0);
> +
> +	avg = avg_vruntime(cfs_rq);
> +	lag = (s64)(se->vruntime - avg);
> +
> +	lag *= new_weight;

why  lag = lag * new_weight; ?

> +	lag = div_s64(lag, old_weight);
> +
> +	se->vruntime = avg + lag;

how about (s64)(se->vruntime - avg) > 0?
and how about (s64)(se->vruntime - avg) < 0?

> +
> +	enqueue_entity(cfs_rq, se, 0);
> +}
> +


----------
I suggest that combine your solution and mine:
use cfs_rq->curr_vruntime to track carefully the "current vruntime"
of this cfs_rq and:

static void prio_changed_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
		unsigned long old_weight, unsigned long new_weight)
{
	u64 cfs_vruntime;
	u64 vdelta_exec;

	if (old_weight == new_weight)
		return;

	dequeue_entity(cfs_rq, se, 0);

	cfs_vruntime = cfs_rq->curr_vruntime;
	vdelta_exec = (s64)(se->vruntime - cfs_vruntime);

	if (likely(((s64)vdelta_exec) > 0)) {
		vdelta_exec *= old_weight;
		vdelta_exec = div_u64(vdelta_exec, new_weight);
	}

	se->vruntime = cfs_vruntime + vdelta_exec;

	enqueue_entity(cfs_rq, se, 0);
}


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] sched: fix unfairness when upgrade weight
@ 2009-02-25  7:32 Miao Xie
  2009-02-25  8:20 ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Miao Xie @ 2009-02-25  7:32 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra; +Cc: Linux-Kernel

When two or more processes upgrade their priority,
unfairness will happen, part of them may get all cpu-usage,
and the others cannot be scheduled to run for a long time.

example:
 # (create 2 processes and set affinity to cpu#0)
 # renice 19 pid1 pid2
 # renice -19 pid1 pid2

step3 upgrade the 2 processes' weight, these 2 processes should
share the cpu#0 as soon as possible after step3, and each of them
should get 50% cpu-usage. But sometimes one of them gets all cpu-usage
for tens of seconds before the other is schedulered to run.

fair-group example:
 # mkdir 1 2 (create 2 fair-groups)
 # (create 2 processes and set affinity to cpu#0)
 # echo pid1 > 1/tasks ; echo pid2 > 2/tasks
 # echo 2 > 1/cpu.shares ; echo 2 > 2/cpu.shares
 # echo $((2**18)) > 1/cpu.shares ; echo $((2**18)) > 2/cpu.shares

The reason why such unfairness happened:

When a sched_entity is running, if its weight is low, its vruntime
increases by a large value every time and if its weight is high,
its vruntime increases by a small value.

So when the two sched_entity's weight is low, they will still
fairness even if difference of their vruntime is large, but if
their weight are upgraded, this large difference of vruntime
will bring unfairness. Because it will cost the process to spend
a lot of time to catch up the huge difference.

example:
se1's vruntime         se2's vruntime
    1000M               (R) 1020M
    (assume vruntime is increases by about 50M every time)
(R) 1050M                   1020M
    1050M               (R) 1070M
(R) 1100M                   1070M
    1100M               (R) 1120M
    (fairness, even if difference of their vruntime is large)
    (upgrade their weight, vruntime is increases by about 10K)
(R) 1100M+10K               1120M
(R) 1100M+20K               1120M
(R) 1100M+30K               1120M
(R) 1100M+40K               1120M
(R) 1100M+50K               1120M
    (se1 gets all cpu-usage for long time (mybe about tens of seconds))
    (unfairness, difference=20M is too large for new weight)

This patch fixes this bug by tuning the vruntime of weight-upgraded
sched entities, just like waking up a task. the new vruntime will be
    cfs_rq->min_vruntime + sched_vslice();

Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>

---
 kernel/sched.c      |   16 +++++++++-------
 kernel/sched_fair.c |    9 +++++++++
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 410eec4..26e6d33 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5096,12 +5096,8 @@ void set_user_nice(struct task_struct *p, long nice)
 
 	if (on_rq) {
 		enqueue_task(rq, p, 0);
-		/*
-		 * If the task increased its priority or is running and
-		 * lowered its priority, then reschedule its CPU:
-		 */
-		if (delta < 0 || (delta > 0 && task_running(rq, p)))
-			resched_task(rq->curr);
+		p->sched_class->prio_changed(rq, p, old_prio,
+						task_running(rq, p));
 	}
 out_unlock:
 	task_rq_unlock(rq, &flags);
@@ -8929,16 +8925,22 @@ static void __set_se_shares(struct sched_entity *se, unsigned long shares)
 {
 	struct cfs_rq *cfs_rq = se->cfs_rq;
 	int on_rq;
+	unsigned long old_weight;
 
 	on_rq = se->on_rq;
 	if (on_rq)
 		dequeue_entity(cfs_rq, se, 0);
 
+	old_weight = se->load.weight;
 	se->load.weight = shares;
 	se->load.inv_weight = 0;
 
-	if (on_rq)
+	if (on_rq) {
+		if (se->load.weight > old_weight)
+			se->vruntime = cfs_rq->min_vruntime
+						+ sched_vslice(cfs_rq, se);
 		enqueue_entity(cfs_rq, se, 0);
+	}
 }
 
 static void set_se_shares(struct sched_entity *se, unsigned long shares)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0566f2a..34d4d11 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1690,6 +1690,15 @@ static void task_new_fair(struct rq *rq, struct task_struct *p)
 static void prio_changed_fair(struct rq *rq, struct task_struct *p,
 			      int oldprio, int running)
 {
+	struct cfs_rq *cfs_rq = task_cfs_rq(p);
+	struct sched_entity *se = &p->se;
+	int on_rq = se->on_rq;
+
+	if (p->prio < oldprio && on_rq) {
+		dequeue_entity(cfs_rq, se, 0);
+		se->vruntime = cfs_rq->min_vruntime + sched_vslice(cfs_rq, se);
+		enqueue_entity(cfs_rq, se, 0);
+	}
 	/*
 	 * Reschedule if we are currently running on this runqueue and
 	 * our priority decreased, or if we are not currently running on
-- 
1.6.0.3



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: fix unfairness when upgrade weight
  2009-02-25  7:32 Miao Xie
@ 2009-02-25  8:20 ` Peter Zijlstra
  2009-02-25 11:13   ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2009-02-25  8:20 UTC (permalink / raw)
  To: miaox; +Cc: Ingo Molnar, Linux-Kernel

On Wed, 2009-02-25 at 15:32 +0800, Miao Xie wrote:

> This patch fixes this bug by tuning the vruntime of weight-upgraded
> sched entities, just like waking up a task. the new vruntime will be
>     cfs_rq->min_vruntime + sched_vslice();

I really don't like that.

Better would be to scale with min_vruntime, which would at least
approximate the lag somewhat.

Best is to compute the actual lag, but that might just not be worth the
extra overhead.

http://programming.kicks-ass.net/kernel-patches/sched-avg_vruntime/


> ---
>  kernel/sched.c      |   16 +++++++++-------
>  kernel/sched_fair.c |    9 +++++++++
>  2 files changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 410eec4..26e6d33 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5096,12 +5096,8 @@ void set_user_nice(struct task_struct *p, long nice)
>  
>  	if (on_rq) {
>  		enqueue_task(rq, p, 0);
> -		/*
> -		 * If the task increased its priority or is running and
> -		 * lowered its priority, then reschedule its CPU:
> -		 */
> -		if (delta < 0 || (delta > 0 && task_running(rq, p)))
> -			resched_task(rq->curr);
> +		p->sched_class->prio_changed(rq, p, old_prio,
> +						task_running(rq, p));
>  	}
>  out_unlock:
>  	task_rq_unlock(rq, &flags);
> @@ -8929,16 +8925,22 @@ static void __set_se_shares(struct sched_entity *se, unsigned long shares)
>  {
>  	struct cfs_rq *cfs_rq = se->cfs_rq;
>  	int on_rq;
> +	unsigned long old_weight;
>  
>  	on_rq = se->on_rq;
>  	if (on_rq)
>  		dequeue_entity(cfs_rq, se, 0);
>  
> +	old_weight = se->load.weight;
>  	se->load.weight = shares;
>  	se->load.inv_weight = 0;
>  
> -	if (on_rq)
> +	if (on_rq) {
> +		if (se->load.weight > old_weight)
> +			se->vruntime = cfs_rq->min_vruntime
> +						+ sched_vslice(cfs_rq, se);
>  		enqueue_entity(cfs_rq, se, 0);
> +	}
>  }
>  
>  static void set_se_shares(struct sched_entity *se, unsigned long shares)
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 0566f2a..34d4d11 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1690,6 +1690,15 @@ static void task_new_fair(struct rq *rq, struct task_struct *p)
>  static void prio_changed_fair(struct rq *rq, struct task_struct *p,
>  			      int oldprio, int running)
>  {
> +	struct cfs_rq *cfs_rq = task_cfs_rq(p);
> +	struct sched_entity *se = &p->se;
> +	int on_rq = se->on_rq;
> +
> +	if (p->prio < oldprio && on_rq) {
> +		dequeue_entity(cfs_rq, se, 0);
> +		se->vruntime = cfs_rq->min_vruntime + sched_vslice(cfs_rq, se);
> +		enqueue_entity(cfs_rq, se, 0);
> +	}

we very likely just enqueued the thing, and now we dequeue/enqueue
again.. not very nice.

>  	/*
>  	 * Reschedule if we are currently running on this runqueue and
>  	 * our priority decreased, or if we are not currently running on


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: fix unfairness when upgrade weight
  2009-02-25  8:20 ` Peter Zijlstra
@ 2009-02-25 11:13   ` Ingo Molnar
  0 siblings, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2009-02-25 11:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: miaox, Linux-Kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -1690,6 +1690,15 @@ static void task_new_fair(struct rq *rq, struct task_struct *p)
> >  static void prio_changed_fair(struct rq *rq, struct task_struct *p,
> >  			      int oldprio, int running)
> >  {
> > +	struct cfs_rq *cfs_rq = task_cfs_rq(p);
> > +	struct sched_entity *se = &p->se;
> > +	int on_rq = se->on_rq;
> > +
> > +	if (p->prio < oldprio && on_rq) {
> > +		dequeue_entity(cfs_rq, se, 0);
> > +		se->vruntime = cfs_rq->min_vruntime + sched_vslice(cfs_rq, se);
> > +		enqueue_entity(cfs_rq, se, 0);
> > +	}
> 
> we very likely just enqueued the thing, and now we dequeue/enqueue
> again.. not very nice.

but priority-change is a slowpath, so it shouldnt matter much, 
should it?

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-02-25 11:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-30  6:27 [PATCH] sched: fix unfairness when upgrade weight Lai Jiangshan
2008-07-02 21:49 ` Peter Zijlstra
2008-07-03 11:30   ` jha
2008-07-04  2:39     ` Lai Jiangshan
  -- strict thread matches above, loose matches on Subject: below --
2009-02-25  7:32 Miao Xie
2009-02-25  8:20 ` Peter Zijlstra
2009-02-25 11:13   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox