public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Mike Galbraith <efault@gmx.de>
Cc: Brian Rogers <brian@xyzw.org>, Ingo Molnar <mingo@elte.hu>,
	linux-kernel@vger.kernel.org
Subject: Re: [BUG] How to get real-time priority using idle priority
Date: Thu, 15 Jan 2009 11:14:16 +0100	[thread overview]
Message-ID: <1232014456.8870.26.camel@laptop> (raw)
In-Reply-To: <1232011723.26761.36.camel@marge.simson.net>

On Thu, 2009-01-15 at 10:28 +0100, Mike Galbraith wrote:

> The real problem (excluding the SCHED_IDLE specific problems), is that
> update_min_vruntime() doesn't work quite as intended, and will slam
> min_vruntime far right if load balancing etc places a task which is far
> right of the currently running task on the runqueue.  If the currently
> running, and up to this point the min_vruntime pace setter, is a hog,
> any task waking to this runqueue after min_vruntime leaps forward has to
> wait for the hog to consume the gap.  In the case of SCHED_IDLE tasks,
> that gap can be huge, but even with a nice 19 tasks it can be quite
> large and painful.
> 
> Removing the if (vruntime == cfs_rq->min_vruntime) test, which will be
> true if the currently running task is the pace setter, cured it for me.

OK, so we have 1 running task A (which is obviously curr and the tree is
equally obviously empty).

A nicely chugs along, doing its thing, carrying min_vruntime along as it
goes.

Then some whacko speed freak SCHED_IDLE task gets inserted due to SMP
balancing, which is very likely far right, in that case

update_curr
  update_min_vruntime
    cfs_rq->rb_leftmost := true (the crazy task sitting in a tree)
      vruntime = se->vruntime

and voila, min_vruntime is waaay right of where it ought to be.

OK, so why did I write it like that to begin with...

Aah, yes.

Say we've just dequeued current

schedule
  deactivate_task(prev)
    dequeue_entity
      update_min_vruntime

Then we'll set

  vruntime = cfs_rq->min_vruntime;

we find !cfs_rq->curr, but do find someone in the tree. Then we _must_
do vruntime = se->vruntime, because

 vruntime = min_vruntime(vruntime := cfs_rq->min_vruntime, se->vruntime)

will not advance vruntime, and cause lags the other way around (which we
fixed with that initial patch: 1af5f730fc1bf7c62ec9fb2d307206e18bf40a69
(sched: more accurate min_vruntime accounting).

Which leads me to suggest the following

---
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 8e1352c..f2d2d94 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -283,7 +283,7 @@ static void update_min_vruntime(struct cfs_rq
*cfs_rq)
 						   struct sched_entity,
 						   run_node);
 
-		if (vruntime == cfs_rq->min_vruntime)
+		if (!cfs_rq->curr)
 			vruntime = se->vruntime;
 		else
 			vruntime = min_vruntime(vruntime, se->vruntime);

The below can be split into 3 patches:

 - the idle weight change (do we really need that? why?)
 - the above update_min_vruntime() fix
 - the SCHED_IDLE vs SCHED_OTHER isolation changes (ACK on those)

> If this cures your woes (and Peter acks it), I'll split and submit.
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index deb5ac8..e9f0762 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1320,8 +1320,8 @@ static inline void update_load_sub(struct load_weight *lw, unsigned long dec)
>   * slice expiry etc.
>   */
>  
> -#define WEIGHT_IDLEPRIO		2
> -#define WMULT_IDLEPRIO		(1 << 31)
> +#define WEIGHT_IDLEPRIO		3
> +#define WMULT_IDLEPRIO		1431655765
>  
>  /*
>   * Nice levels are multiplicative, with a gentle 10% change for every
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 8e1352c..761071d 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -283,10 +283,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
>  						   struct sched_entity,
>  						   run_node);
>  
> -		if (vruntime == cfs_rq->min_vruntime)
> -			vruntime = se->vruntime;
> -		else
> -			vruntime = min_vruntime(vruntime, se->vruntime);
> +		vruntime = min_vruntime(vruntime, se->vruntime);
>  	}
>  
>  	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
> @@ -677,9 +674,13 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>  			unsigned long thresh = sysctl_sched_latency;
>  
>  			/*
> -			 * convert the sleeper threshold into virtual time
> +			 * Convert the sleeper threshold into virtual time.
> +			 * SCHED_IDLE is a special sub-class.  We care about
> +			 * fairness only relative to other SCHED_IDLE tasks,
> +			 * all of which have the same weight.
>  			 */
> -			if (sched_feat(NORMALIZED_SLEEPER))
> +			if (sched_feat(NORMALIZED_SLEEPER) &&
> +					task_of(se)->policy != SCHED_IDLE)
>  				thresh = calc_delta_fair(thresh, se);
>  
>  			vruntime -= thresh;
> @@ -1340,14 +1341,18 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
>  
>  static void set_last_buddy(struct sched_entity *se)
>  {
> -	for_each_sched_entity(se)
> -		cfs_rq_of(se)->last = se;
> +	for_each_sched_entity(se) {
> +		if (likely(task_of(se)->policy != SCHED_IDLE))
> +			cfs_rq_of(se)->last = se;
> +	}
>  }
>  
>  static void set_next_buddy(struct sched_entity *se)
>  {
> -	for_each_sched_entity(se)
> -		cfs_rq_of(se)->next = se;
> +	for_each_sched_entity(se) {
> +		if (likely(task_of(se)->policy != SCHED_IDLE))
> +			cfs_rq_of(se)->next = se;
> +	}
>  }
>  
>  /*
> @@ -1393,12 +1398,18 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
>  		return;
>  
>  	/*
> -	 * Batch tasks do not preempt (their preemption is driven by
> +	 * Batch and idle tasks do not preempt (their preemption is driven by
>  	 * the tick):
>  	 */
> -	if (unlikely(p->policy == SCHED_BATCH))
> +	if (unlikely(p->policy != SCHED_NORMAL))
>  		return;
>  
> +	/* Idle tasks are by definition preempted by everybody. */
> +	if (unlikely(curr->policy == SCHED_IDLE)) {
> +		resched_task(curr);
> +		return;
> +	}
> +
>  	if (!sched_feat(WAKEUP_PREEMPT))
>  		return;
>  
> 
> 


  reply	other threads:[~2009-01-15 10:14 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-11 10:58 [BUG] How to get real-time priority using idle priority Brian Rogers
2009-01-12  5:09 ` Mike Galbraith
2009-01-12 13:03   ` Mike Galbraith
2009-01-12 13:14     ` Ingo Molnar
2009-01-12 15:23       ` Mike Galbraith
2009-01-12 15:24         ` Ingo Molnar
2009-01-13  1:05       ` Brian Rogers
2009-01-13  2:58         ` Mike Galbraith
2009-01-14  5:13           ` Mike Galbraith
2009-01-14  5:31             ` Ingo Molnar
2009-01-14  6:02               ` Mike Galbraith
2009-01-14  7:35                 ` Ingo Molnar
2009-01-15  9:28         ` Mike Galbraith
2009-01-15 10:14           ` Peter Zijlstra [this message]
2009-01-15 10:30             ` Mike Galbraith
2009-01-15 11:37               ` Mike Galbraith
2009-01-15 11:41                 ` Peter Zijlstra
2009-01-15 12:54                   ` Ingo Molnar
2009-01-15 13:05                     ` Peter Zijlstra
2009-01-15 13:15                       ` Ingo Molnar
2009-01-15 13:16                     ` Peter Zijlstra
2009-01-15 12:07           ` Brian Rogers
2009-01-12 20:46     ` [patch take 2] " Mike Galbraith
2009-01-12 20:50       ` Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1232014456.8870.26.camel@laptop \
    --to=a.p.zijlstra@chello.nl \
    --cc=brian@xyzw.org \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox