[patch] fix SMT scheduler latency bug

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch] fix SMT scheduler latency bug
@ 2005-06-22 10:25 Ingo Molnar
  2005-06-22 14:40 ` Con Kolivas
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2005-06-22 10:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Con Kolivas, linux-kernel, William Weston


William Weston reported unusually high scheduling latencies on his x86 
HT box, on the -RT kernel. I managed to reproduce it on my HT box and 
the latency tracer shows the incident in action:

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
      du-2803  3Dnh2    0us : __trace_start_sched_wakeup (try_to_wake_up)
        ..............................................................
        ... we are running on CPU#3, PID 2778 gets woken to CPU#1: ...
        ..............................................................
      du-2803  3Dnh2    0us : __trace_start_sched_wakeup <<...>-2778> (73 1)
      du-2803  3Dnh2    0us : _raw_spin_unlock (try_to_wake_up)
        ................................................
        ... still on CPU#3, we send an IPI to CPU#1: ...
        ................................................
      du-2803  3Dnh1    0us : resched_task (try_to_wake_up)
      du-2803  3Dnh1    1us : smp_send_reschedule (try_to_wake_up)
      du-2803  3Dnh1    1us : send_IPI_mask_bitmask (smp_send_reschedule)
      du-2803  3Dnh1    2us : _raw_spin_unlock_irqrestore (try_to_wake_up)
        ...............................................
        ... 1 usec later, the IPI arrives on CPU#1: ...
        ...............................................
  <idle>-0     1Dnh.    2us : smp_reschedule_interrupt (c0100c5a 0 0)

so far so good, this is the normal wakeup/preemption mechanism. But here
comes the scheduler anomaly on CPU#1:

  <idle>-0     1Dnh.    2us : preempt_schedule_irq (need_resched)
  <idle>-0     1Dnh.    2us : preempt_schedule_irq (need_resched)
  <idle>-0     1Dnh.    3us : __schedule (preempt_schedule_irq)
  <idle>-0     1Dnh.    3us : profile_hit (__schedule)
  <idle>-0     1Dnh1    3us : sched_clock (__schedule)
  <idle>-0     1Dnh1    4us : _raw_spin_lock_irq (__schedule)
  <idle>-0     1Dnh1    4us : _raw_spin_lock_irqsave (__schedule)
  <idle>-0     1Dnh2    5us : _raw_spin_unlock (__schedule)
  <idle>-0     1Dnh1    5us : preempt_schedule (__schedule)
  <idle>-0     1Dnh1    6us : _raw_spin_lock (__schedule)
  <idle>-0     1Dnh2    6us : find_next_bit (__schedule)
  <idle>-0     1Dnh2    6us : _raw_spin_lock (__schedule)
  <idle>-0     1Dnh3    7us : find_next_bit (__schedule)
  <idle>-0     1Dnh3    7us : find_next_bit (__schedule)
  <idle>-0     1Dnh3    8us : _raw_spin_unlock (__schedule)
  <idle>-0     1Dnh2    8us : preempt_schedule (__schedule)
  <idle>-0     1Dnh2    8us : find_next_bit (__schedule)
  <idle>-0     1Dnh2    9us : trace_stop_sched_switched (__schedule)
  <idle>-0     1Dnh2    9us : _raw_spin_lock (trace_stop_sched_switched)
  <idle>-0     1Dnh3   10us : trace_stop_sched_switched <<...>-2778> (73 8c)
  <idle>-0     1Dnh3   10us : _raw_spin_unlock (trace_stop_sched_switched)
  <idle>-0     1Dnh1   10us : _raw_spin_unlock (__schedule)
  <idle>-0     1Dnh.   11us : local_irq_enable_noresched (preempt_schedule_irq)
  <idle>-0     1Dnh.   11us < (0)

we didnt pick up pid 2778! It only gets scheduled much later:

   <...>-2778  1Dnh2  412us : __switch_to (__schedule)
   <...>-2778  1Dnh2  413us : __schedule <<idle>-0> (8c 73)
   <...>-2778  1Dnh2  413us : _raw_spin_unlock (__schedule)
   <...>-2778  1Dnh1  413us : trace_stop_sched_switched (__schedule)
   <...>-2778  1Dnh1  414us : _raw_spin_lock (trace_stop_sched_switched)
   <...>-2778  1Dnh2  414us : trace_stop_sched_switched <<...>-2778> (73 1)
   <...>-2778  1Dnh2  414us : _raw_spin_unlock (trace_stop_sched_switched)
   <...>-2778  1Dnh1  415us : trace_stop_sched_switched (__schedule)

the reason for this anomaly is the following code in dependent_sleeper():

                /*
                 * If a user task with lower static priority than the
                 * running task on the SMT sibling is trying to schedule,
                 * delay it till there is proportionately less timeslice
                 * left of the sibling task to prevent a lower priority
                 * task from using an unfair proportion of the
                 * physical cpu's resources. -ck
                 */
[...]
                        if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) /
                                100) > task_timeslice(p)))
                                        ret = 1;

note that in contrast to the comment above, we dont actually do the 
check based on static priority, we do the check based on timeslices. But 
timeslices go up and down, and even highprio tasks can randomly have 
very low timeslices (just before their next refill) and can thus be 
judged as 'lowprio' by the above piece of code. This condition is 
clearly buggy. The correct test is to check for static_prio _and_ to 
check for the preemption priority. Even on different static priority 
levels, a higher-prio interactive task should not be delayed due to a 
higher-static-prio CPU hog.

there is a symmetric bug in the 'kick SMT sibling' code of this function 
as well, which can be solved in a similar way.

the patch below (against the current scheduler queue in -mm) fixes both 
bugs. I have build and boot-tested this on x86 SMT, and nice +20 tasks 
still get properly throttled - so the dependent-sleeper logic is still 
in action.

btw., these bugs pessimised the SMT scheduler because the 'delay wakeup' 
property was applied too liberally, so this fix is likely a throughput 
improvement as well.

i separated out a smt_slice() function to make the code easier to read.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

 kernel/sched.c |   19 +++++++++++++++----
 1 files changed, 15 insertions(+), 4 deletions(-)

Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -2651,6 +2651,16 @@ static inline void wake_sleeping_depende
 	 */
 }
 
+/*
+ * number of 'lost' timeslices this task wont be able to fully
+ * utilize, if another task runs on a sibling. This models the
+ * slowdown effect of other tasks running on siblings:
+ */
+static inline unsigned long smt_slice(task_t *p, struct sched_domain *sd)
+{
+	return p->time_slice * (100 - sd->per_cpu_gain) / 100;
+}
+
 static inline int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
 {
 	struct sched_domain *tmp, *sd = NULL;
@@ -2715,8 +2725,9 @@ static inline int dependent_sleeper(int 
 				(sd->per_cpu_gain * DEF_TIMESLICE / 100))
 					ret = 1;
 		} else
-			if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) /
-				100) > task_timeslice(p)))
+			if (smt_curr->static_prio < p->static_prio &&
+				!TASK_PREEMPTS_CURR(p, smt_rq) &&
+				smt_slice(smt_curr, sd) > task_timeslice(p))
 					ret = 1;
 
 check_smt_task:
@@ -2738,8 +2749,8 @@ check_smt_task:
 				(sd->per_cpu_gain * DEF_TIMESLICE / 100))
 					resched_task(smt_curr);
 		} else {
-			if ((p->time_slice * (100 - sd->per_cpu_gain) / 100) >
-				task_timeslice(smt_curr))
+			if (TASK_PREEMPTS_CURR(p, smt_rq) &&
+				smt_slice(p, sd) > task_timeslice(smt_curr))
 					resched_task(smt_curr);
 			else
 				wakeup_busy_runqueue(smt_rq);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] fix SMT scheduler latency bug
  2005-06-22 10:25 [patch] fix SMT scheduler latency bug Ingo Molnar
@ 2005-06-22 14:40 ` Con Kolivas
  2005-06-22 16:04   ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: Con Kolivas @ 2005-06-22 14:40 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, William Weston

[-- Attachment #1: Type: text/plain, Size: 3398 bytes --]

Hi

On Wed, 22 Jun 2005 20:25, Ingo Molnar wrote:
> William Weston reported unusually high scheduling latencies on his x86
> HT box, on the -RT kernel. I managed to reproduce it on my HT box and
> the latency tracer shows the incident in action:

Thanks for picking this up. I've had a long hard look at the code and your 
patch.

> the reason for this anomaly is the following code in dependent_sleeper():
>
>                 /*
>                  * If a user task with lower static priority than the
>                  * running task on the SMT sibling is trying to schedule,
>                  * delay it till there is proportionately less timeslice
>                  * left of the sibling task to prevent a lower priority
>                  * task from using an unfair proportion of the
>                  * physical cpu's resources. -ck
>                  */
> [...]
>                         if (((smt_curr->time_slice * (100 -
> sd->per_cpu_gain) / 100) > task_timeslice(p)))
>                                         ret = 1;
>
> note that in contrast to the comment above, we dont actually do the
> check based on static priority, we do the check based on timeslices. But
> timeslices go up and down, and even highprio tasks can randomly have
> very low timeslices (just before their next refill) and can thus be
> judged as 'lowprio' by the above piece of code. 

I don't see it like that. task_timeslice(p) will always return the same value 
based purely on static priority and smt_curr->time_slice cannot ever be 
larger than task_timeslice(p) unless there is a significant enough 'nice' 
difference. It is not smt_curr that is rescheduled as a result of this test, 
it is p that is not scheduled and we look at p's task_timeslice which does 
not alter. The task that is delayed in either case is dependant on its static 
priority which will determine its task_timeslice() vs the current value of 
->time_slice on the sibling which is emptied as that task runs, and it is 
expected to fluctuate.

> This condition is 
> clearly buggy. The correct test is to check for static_prio _and_ to
> check for the preemption priority. Even on different static priority
> levels, a higher-prio interactive task should not be delayed due to a
> higher-static-prio CPU hog.

> -			if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) /
> -				100) > task_timeslice(p)))
> +			if (smt_curr->static_prio < p->static_prio &&
> +				!TASK_PREEMPTS_CURR(p, smt_rq) &&
> +				smt_slice(smt_curr, sd) > task_timeslice(p))

Checking for smt_curr->static_prio < p->static_prio appears redundant to me 
because the condition can only be met if there is a significant difference in 
the different timeslice case as I mentioned above.

> +			if (TASK_PREEMPTS_CURR(p, smt_rq) &&

Is this check necessary? The proportion is supposed to be distributed 
according to static priority only.

If this code is causing large latencies then I believe it can only occur with 
different nice levels running on siblings and high priority tasks starting 
new timeslices repeatedly and never getting to the last per_cpu_gain% of 
their timeslice. Ingo do you think this might be what is being seen? If this 
truly can happen then this code will have to move to a jiffy based proportion 
as the real time code is to prevent this problem. 

Cheers,
Con

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] fix SMT scheduler latency bug
  2005-06-22 14:40 ` Con Kolivas
@ 2005-06-22 16:04   ` Ingo Molnar
  2005-06-22 23:03     ` Con Kolivas
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2005-06-22 16:04 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Andrew Morton, linux-kernel, William Weston



* Con Kolivas <kernel@kolivas.org> wrote:

> > [...]
> >                         if (((smt_curr->time_slice * (100 -
> > sd->per_cpu_gain) / 100) > task_timeslice(p)))
> >                                         ret = 1;
> >
> > note that in contrast to the comment above, we dont actually do the
> > check based on static priority, we do the check based on timeslices. But
> > timeslices go up and down, and even highprio tasks can randomly have
> > very low timeslices (just before their next refill) and can thus be
> > judged as 'lowprio' by the above piece of code. 
> 
> I don't see it like that. task_timeslice(p) will always return the 
> same value based purely on static priority and smt_curr->time_slice 
> cannot ever be larger than task_timeslice(p) unless there is a 
> significant enough 'nice' difference. [...]

task_timeslice(p) is indeed constant over time, but smt_curr->time_slice 
is not. So this condition opens up the possibility of a lower prio 
thread accumulating a larger ->time_slice and thus reversing the 
priority equation.

> [...] It is not smt_curr that is rescheduled as a result of this test, 
> it is p that is not scheduled and we look at p's task_timeslice which 
> does not alter. The task that is delayed in either case is dependant 
> on its static priority which will determine its task_timeslice() vs 
> the current value of ->time_slice on the sibling which is emptied as 
> that task runs, and it is expected to fluctuate.

see below.

> > This condition is 
> > clearly buggy. The correct test is to check for static_prio _and_ to
> > check for the preemption priority. Even on different static priority
> > levels, a higher-prio interactive task should not be delayed due to a
> > higher-static-prio CPU hog.
> 
> > -			if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) /
> > -				100) > task_timeslice(p)))
> > +			if (smt_curr->static_prio < p->static_prio &&
> > +				!TASK_PREEMPTS_CURR(p, smt_rq) &&
> > +				smt_slice(smt_curr, sd) > task_timeslice(p))
> 
> Checking for smt_curr->static_prio < p->static_prio appears redundant 
> to me because the condition can only be met if there is a significant 
> difference in the different timeslice case as I mentioned above.

> > +			if (TASK_PREEMPTS_CURR(p, smt_rq) &&
> 
> Is this check necessary? The proportion is supposed to be distributed 
> according to static priority only.

yes, it's absolutely necessary. p->time_slice and smt->time_slice is not 
something we distribute according to any priority rule - it's a 
mechanism that controls distribution of timeslices, not 'immediateness 
of execution'. So we should and must not decide preemption based on it.  

p->time_slice's and smt->time_slice's natural fluctuation is a nice 
mechanism though to base the statistical approach of 'delaying 
lower-prio tasks' on - but only if the higher-level rules of preemption 
are not violated.

i.e. a task can be delayed if and only if:

  - its static priority is lower than that of the currently running task
  - its dynamic priority is lower than that of the currently running task

the first rule implements the policy of protecting higher-static-prio 
tasks, the second rule is a scheduling correctness (preemption latency 
avoidance) issue.

you are right in that there is overlap in the ->static_prio rule and the 
smt_slice/time_slice rule - but this is still necessary because for 
->time_slice there are no guarantees. (e.g. exiting threads can boost a 
thread up to twice its normal ->time_slice value.) So by adding this 
rule we not only speed up the decision in the common case (simple 
comparison instead of having to do multiplications and divisions), but 
we also separate all the various constrains of the scheduler clearly.  
E.g. we can still tweak ->time_slice distribution without having to 
worry much about the dependent-sleeper logic.

there's an updated patch below, i made one more change to the 'interrupt 
current CPU' logic: i made the priority rules an exact mirror image of 
the 'delay current wakee' condition.

> If this code is causing large latencies then I believe it can only 
> occur with different nice levels running on siblings and high priority 
> tasks starting new timeslices repeatedly and never getting to the last 
> per_cpu_gain% of their timeslice. Ingo do you think this might be what 
> is being seen? [...]

no. This very much occurs with plain normal SCHED_OTHER tasks, no 
renicing or anything. I believe this bug might also have caused 
performance regressions (too much idle time on siblings).

	Ingo

-------
William Weston reported unusually high scheduling latencies on his
x86 HT box. Managed to reproduce it on my HT box and the -RT tree's
latency tracer shows the incident in action:

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
      du-2803  3Dnh2    0us : __trace_start_sched_wakeup (try_to_wake_up)
        ..............................................................
        ... we are running on CPU#3, PID 2778 gets woken to CPU#1: ...
        ..............................................................
      du-2803  3Dnh2    0us : __trace_start_sched_wakeup <<...>-2778> (73 1)
      du-2803  3Dnh2    0us : _raw_spin_unlock (try_to_wake_up)
        ................................................
        ... still on CPU#3, we send an IPI to CPU#1: ...
        ................................................
      du-2803  3Dnh1    0us : resched_task (try_to_wake_up)
      du-2803  3Dnh1    1us : smp_send_reschedule (try_to_wake_up)
      du-2803  3Dnh1    1us : send_IPI_mask_bitmask (smp_send_reschedule)
      du-2803  3Dnh1    2us : _raw_spin_unlock_irqrestore (try_to_wake_up)
        ...............................................
        ... 1 usec later, the IPI arrives on CPU#1: ...
        ...............................................
  <idle>-0     1Dnh.    2us : smp_reschedule_interrupt (c0100c5a 0 0)

so far so good, this is the normal wakeup/preemption mechanism. But here
comes the scheduler anomaly on CPU#1:

  <idle>-0     1Dnh.    2us : preempt_schedule_irq (need_resched)
  <idle>-0     1Dnh.    2us : preempt_schedule_irq (need_resched)
  <idle>-0     1Dnh.    3us : __schedule (preempt_schedule_irq)
  <idle>-0     1Dnh.    3us : profile_hit (__schedule)
  <idle>-0     1Dnh1    3us : sched_clock (__schedule)
  <idle>-0     1Dnh1    4us : _raw_spin_lock_irq (__schedule)
  <idle>-0     1Dnh1    4us : _raw_spin_lock_irqsave (__schedule)
  <idle>-0     1Dnh2    5us : _raw_spin_unlock (__schedule)
  <idle>-0     1Dnh1    5us : preempt_schedule (__schedule)
  <idle>-0     1Dnh1    6us : _raw_spin_lock (__schedule)
  <idle>-0     1Dnh2    6us : find_next_bit (__schedule)
  <idle>-0     1Dnh2    6us : _raw_spin_lock (__schedule)
  <idle>-0     1Dnh3    7us : find_next_bit (__schedule)
  <idle>-0     1Dnh3    7us : find_next_bit (__schedule)
  <idle>-0     1Dnh3    8us : _raw_spin_unlock (__schedule)
  <idle>-0     1Dnh2    8us : preempt_schedule (__schedule)
  <idle>-0     1Dnh2    8us : find_next_bit (__schedule)
  <idle>-0     1Dnh2    9us : trace_stop_sched_switched (__schedule)
  <idle>-0     1Dnh2    9us : _raw_spin_lock (trace_stop_sched_switched)
  <idle>-0     1Dnh3   10us : trace_stop_sched_switched <<...>-2778> (73 8c)
  <idle>-0     1Dnh3   10us : _raw_spin_unlock (trace_stop_sched_switched)
  <idle>-0     1Dnh1   10us : _raw_spin_unlock (__schedule)
  <idle>-0     1Dnh.   11us : local_irq_enable_noresched (preempt_schedule_irq)
  <idle>-0     1Dnh.   11us < (0)

we didnt pick up pid 2778! It only gets scheduled much later:

   <...>-2778  1Dnh2  412us : __switch_to (__schedule)
   <...>-2778  1Dnh2  413us : __schedule <<idle>-0> (8c 73)
   <...>-2778  1Dnh2  413us : _raw_spin_unlock (__schedule)
   <...>-2778  1Dnh1  413us : trace_stop_sched_switched (__schedule)
   <...>-2778  1Dnh1  414us : _raw_spin_lock (trace_stop_sched_switched)
   <...>-2778  1Dnh2  414us : trace_stop_sched_switched <<...>-2778> (73 1)
   <...>-2778  1Dnh2  414us : _raw_spin_unlock (trace_stop_sched_switched)
   <...>-2778  1Dnh1  415us : trace_stop_sched_switched (__schedule)

the reason for this anomaly is the following code in dependent_sleeper():

                /*
                 * If a user task with lower static priority than the
                 * running task on the SMT sibling is trying to schedule,
                 * delay it till there is proportionately less timeslice
                 * left of the sibling task to prevent a lower priority
                 * task from using an unfair proportion of the
                 * physical cpu's resources. -ck
                 */
[...]
                        if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) /
                                100) > task_timeslice(p)))
                                        ret = 1;

note that in contrast to the comment above, we dont actually do the check
based on static priority, we do the check based on timeslices. But
timeslices go up and down, and even highprio tasks can randomly have very
low timeslices (just before their next refill) and can thus be judged as
'lowprio' by the above piece of code. This condition is clearly buggy.
The correct test is to check for static_prio _and_ to check for the preemption
priority. Even on different static priority levels, a higher-prio interactive
task should not be delayed due to a higher-static-prio CPU hog.

there is a symmetric bug in the 'kick SMT sibling' code of this function
as well, which can be solved in a similar way.

the patch below fixes both bugs. I have build and boot-tested this on x86
SMT, and nice +20 tasks still get properly throttled.

btw., these bugs pessimised the SMT scheduler because the 'delay wakeup'
property was applied too liberally, so this fix is likely a throughput
improvement as well.

i separated out a smt_slice() function to make the code easier to read.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

 kernel/sched.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -2651,6 +2651,16 @@ static inline void wake_sleeping_depende
 	 */
 }
 
+/*
+ * number of 'lost' timeslices this task wont be able to fully
+ * utilize, if another task runs on a sibling. This models the
+ * slowdown effect of other tasks running on siblings:
+ */
+static inline unsigned long smt_slice(task_t *p, struct sched_domain *sd)
+{
+	return p->time_slice * (100 - sd->per_cpu_gain) / 100;
+}
+
 static inline int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
 {
 	struct sched_domain *tmp, *sd = NULL;
@@ -2715,8 +2725,9 @@ static inline int dependent_sleeper(int 
 				(sd->per_cpu_gain * DEF_TIMESLICE / 100))
 					ret = 1;
 		} else
-			if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) /
-				100) > task_timeslice(p)))
+			if (smt_curr->static_prio < p->static_prio &&
+				!TASK_PREEMPTS_CURR(p, smt_rq) &&
+				smt_slice(smt_curr, sd) > task_timeslice(p))
 					ret = 1;
 
 check_smt_task:
@@ -2738,8 +2749,9 @@ check_smt_task:
 				(sd->per_cpu_gain * DEF_TIMESLICE / 100))
 					resched_task(smt_curr);
 		} else {
-			if ((p->time_slice * (100 - sd->per_cpu_gain) / 100) >
-				task_timeslice(smt_curr))
+			if (smt_curr->static_prio >= p->static_prio &&
+				TASK_PREEMPTS_CURR(p, smt_rq) &&
+				smt_slice(p, sd) >= task_timeslice(smt_curr))
 					resched_task(smt_curr);
 			else
 				wakeup_busy_runqueue(smt_rq);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] fix SMT scheduler latency bug
  2005-06-22 16:04   ` Ingo Molnar
@ 2005-06-22 23:03     ` Con Kolivas
  2005-06-22 23:32       ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: Con Kolivas @ 2005-06-22 23:03 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, William Weston

[-- Attachment #1: Type: text/plain, Size: 1802 bytes --]

On Thu, 23 Jun 2005 02:04, Ingo Molnar wrote:
> * Con Kolivas <kernel@kolivas.org> wrote:
> > > [...]
> > >                         if (((smt_curr->time_slice * (100 -
> > > sd->per_cpu_gain) / 100) > task_timeslice(p)))
> > >                                         ret = 1;
> > >
> > > note that in contrast to the comment above, we dont actually do the
> > > check based on static priority, we do the check based on timeslices.
> > > But timeslices go up and down, and even highprio tasks can randomly
> > > have very low timeslices (just before their next refill) and can thus
> > > be judged as 'lowprio' by the above piece of code.
> >
> > I don't see it like that. task_timeslice(p) will always return the
> > same value based purely on static priority and smt_curr->time_slice
> > cannot ever be larger than task_timeslice(p) unless there is a
> > significant enough 'nice' difference. [...]
>
> task_timeslice(p) is indeed constant over time, but smt_curr->time_slice
> is not. So this condition opens up the possibility of a lower prio
> thread accumulating a larger ->time_slice and thus reversing the
> priority equation.

I'm not clear on how the value of ->time_slice can ever grow to larger than 
task_timeslice(p). It starts at task_timeslice(p) and decrements till it gets 
to 0 when it refills again. 

I only recall three times that the value of p->time_slice is increased:
New timeslice where it is refilled with task_timeslice(p)
At fork() where if it has one tick left it is increased back to one tick
Reaping a child where it collects its child's time_slice up to a maximum of 
task_timeslice(p).

I must be missing something but I can't see how p->time_slice can ever be 
larger than task_timeslice(p).

Please correct me :|

Cheers,
Con

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] fix SMT scheduler latency bug
  2005-06-22 23:03     ` Con Kolivas
@ 2005-06-22 23:32       ` Ingo Molnar
  2005-06-23  0:03         ` Con Kolivas
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2005-06-22 23:32 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Andrew Morton, linux-kernel, William Weston

* Con Kolivas <kernel@kolivas.org> wrote:

> > task_timeslice(p) is indeed constant over time, but smt_curr->time_slice
> > is not. So this condition opens up the possibility of a lower prio
> > thread accumulating a larger ->time_slice and thus reversing the
> > priority equation.
> 
> I'm not clear on how the value of ->time_slice can ever grow to larger 
> than task_timeslice(p). It starts at task_timeslice(p) and decrements 
> till it gets to 0 when it refills again.

I was thinking abut sched_exit(), there we let unused child timeslices 
'flow back' into the parent thread, if the child thread was shortlived. 
The check there does:

        if (p->first_time_slice) {
                p->parent->time_slice += p->time_slice;
                if (unlikely(p->parent->time_slice > task_timeslice(p)))
                        p->parent->time_slice = task_timeslice(p);
        }

notice that we check parent->time_slice against the child's 
task_timeslice(p), not against task_timeslice(p->parent). So if the 
child thread got reniced, it could cause a higher-than-normal amount of 
timeslices. But this should be a rare scenario, and the above code is 
more of a bug than a feature (will send a patch for it tomorrow), and it 
should not affect the workloads i was testing.

lets take a look at the second condition again:

                if ((p->time_slice * (100 - sd->per_cpu_gain) / 100) >
                        task_timeslice(smt_curr))
                                resched_task(smt_curr);

if this condition is true then we trigger a preemption at smt_curr. Now 
in the bug scenario, 'p' is a highprio task and smt_curr is a lowprio 
task. If p->time_slice (which fluctuates between task_timeslice(p) and 
0) happens to be low enough, preemption wont be triggered and we lose a 
wakeup in essence - 'p', despite being the highest-prio task around, 
wont be run until some CPU runs schedule() voluntarily. Ok?

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] fix SMT scheduler latency bug
  2005-06-22 23:32       ` Ingo Molnar
@ 2005-06-23  0:03         ` Con Kolivas
  2005-06-23 13:24           ` Con Kolivas
  0 siblings, 1 reply; 7+ messages in thread
From: Con Kolivas @ 2005-06-23  0:03 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, William Weston

[-- Attachment #1: Type: text/plain, Size: 2880 bytes --]

On Thu, 23 Jun 2005 09:32, Ingo Molnar wrote:
> * Con Kolivas <kernel@kolivas.org> wrote:
> > > task_timeslice(p) is indeed constant over time, but
> > > smt_curr->time_slice is not. So this condition opens up the possibility
> > > of a lower prio thread accumulating a larger ->time_slice and thus
> > > reversing the priority equation.
> >
> > I'm not clear on how the value of ->time_slice can ever grow to larger
> > than task_timeslice(p). It starts at task_timeslice(p) and decrements
> > till it gets to 0 when it refills again.
>
> I was thinking abut sched_exit(), there we let unused child timeslices
> 'flow back' into the parent thread, if the child thread was shortlived.
> The check there does:
>
>         if (p->first_time_slice) {
>                 p->parent->time_slice += p->time_slice;
>                 if (unlikely(p->parent->time_slice > task_timeslice(p)))
>                         p->parent->time_slice = task_timeslice(p);
>         }
>
> notice that we check parent->time_slice against the child's
> task_timeslice(p), not against task_timeslice(p->parent). So if the
> child thread got reniced, it could cause a higher-than-normal amount of
> timeslices. But this should be a rare scenario, and the above code is
> more of a bug than a feature (will send a patch for it tomorrow), and it
> should not affect the workloads i was testing.

Agreed.

> lets take a look at the second condition again:
>
>                 if ((p->time_slice * (100 - sd->per_cpu_gain) / 100) >
>                         task_timeslice(smt_curr))
>                                 resched_task(smt_curr);
>
> if this condition is true then we trigger a preemption at smt_curr. Now
> in the bug scenario, 'p' is a highprio task and smt_curr is a lowprio
> task. If p->time_slice (which fluctuates between task_timeslice(p) and
> 0) happens to be low enough, preemption wont be triggered and we lose a
> wakeup in essence - 'p', despite being the highest-prio task around,
> wont be run until some CPU runs schedule() voluntarily. Ok?

In dependent_sleeper() we return 1 only to prevent p from scheduling. This 
second condition does not return 1 from dependent_sleeper() so p will still 
go ahead and schedule. This second condition only affects the scheduling on 
the smt sibling.

About the only scenario I can envision a high priority task being delayed with 
the code as it currently is in 2.6.12-mm1 is with a high priority task being 
on the expired array and a low priority task being delayed on the active 
array. This still should not create large latencies unless array swapping is 
significantly delayed. I considered adding a check for this originally but it 
seemed to be unnecessary extra complexity since an expired task by design is 
expected to be delayed more anyway.

Cheers,
Con


Cheers,
Con

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] fix SMT scheduler latency bug
  2005-06-23  0:03         ` Con Kolivas
@ 2005-06-23 13:24           ` Con Kolivas
  0 siblings, 0 replies; 7+ messages in thread
From: Con Kolivas @ 2005-06-23 13:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, William Weston

[-- Attachment #1: Type: text/plain, Size: 636 bytes --]

On Thu, 23 Jun 2005 10:03, Con Kolivas wrote:
> About the only scenario I can envision a high priority task being delayed
> with the code as it currently is in 2.6.12-mm1 is with a high priority task
> being on the expired array and a low priority task being delayed on the
> active array. This still should not create large latencies unless array
> swapping is significantly delayed. I considered adding a check for this
> originally but it seemed to be unnecessary extra complexity since an
> expired task by design is expected to be delayed more anyway.

BTW if this is an issue it would only require a patch like this.

Cheers,
Con

[-- Attachment #2: sched-smt_nice_check_expired_array.patch --]
[-- Type: text/x-diff, Size: 889 bytes --]

Index: linux-2.6.12-mm1/kernel/sched.c
===================================================================
--- linux-2.6.12-mm1.orig/kernel/sched.c	2005-06-23 00:10:22.000000000 +1000
+++ linux-2.6.12-mm1/kernel/sched.c	2005-06-23 23:19:35.000000000 +1000
@@ -2784,7 +2784,8 @@ static inline int dependent_sleeper(int 
 					ret = 1;
 		} else
 			if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) /
-				100) > task_timeslice(p)))
+				100) > task_timeslice(p)) &&
+				p->static_prio <= this_rq->best_expired_prio)
 					ret = 1;
 
 check_smt_task:
@@ -2807,7 +2808,8 @@ check_smt_task:
 					resched_task(smt_curr);
 		} else {
 			if ((p->time_slice * (100 - sd->per_cpu_gain) / 100) >
-				task_timeslice(smt_curr))
+				task_timeslice(smt_curr) &&
+				smt_curr->static_prio <= smt_rq->best_expired_prio)
 					resched_task(smt_curr);
 			else
 				wakeup_busy_runqueue(smt_rq);

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-06-23 13:30 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-22 10:25 [patch] fix SMT scheduler latency bug Ingo Molnar
2005-06-22 14:40 ` Con Kolivas
2005-06-22 16:04   ` Ingo Molnar
2005-06-22 23:03     ` Con Kolivas
2005-06-22 23:32       ` Ingo Molnar
2005-06-23  0:03         ` Con Kolivas
2005-06-23 13:24           ` Con Kolivas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox