* [patch] fix SMT scheduler latency bug
@ 2005-06-22 10:25 Ingo Molnar
2005-06-22 14:40 ` Con Kolivas
0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2005-06-22 10:25 UTC (permalink / raw)
To: Andrew Morton; +Cc: Con Kolivas, linux-kernel, William Weston
William Weston reported unusually high scheduling latencies on his x86
HT box, on the -RT kernel. I managed to reproduce it on my HT box and
the latency tracer shows the incident in action:
_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
du-2803 3Dnh2 0us : __trace_start_sched_wakeup (try_to_wake_up)
..............................................................
... we are running on CPU#3, PID 2778 gets woken to CPU#1: ...
..............................................................
du-2803 3Dnh2 0us : __trace_start_sched_wakeup <<...>-2778> (73 1)
du-2803 3Dnh2 0us : _raw_spin_unlock (try_to_wake_up)
................................................
... still on CPU#3, we send an IPI to CPU#1: ...
................................................
du-2803 3Dnh1 0us : resched_task (try_to_wake_up)
du-2803 3Dnh1 1us : smp_send_reschedule (try_to_wake_up)
du-2803 3Dnh1 1us : send_IPI_mask_bitmask (smp_send_reschedule)
du-2803 3Dnh1 2us : _raw_spin_unlock_irqrestore (try_to_wake_up)
...............................................
... 1 usec later, the IPI arrives on CPU#1: ...
...............................................
<idle>-0 1Dnh. 2us : smp_reschedule_interrupt (c0100c5a 0 0)
so far so good, this is the normal wakeup/preemption mechanism. But here
comes the scheduler anomaly on CPU#1:
<idle>-0 1Dnh. 2us : preempt_schedule_irq (need_resched)
<idle>-0 1Dnh. 2us : preempt_schedule_irq (need_resched)
<idle>-0 1Dnh. 3us : __schedule (preempt_schedule_irq)
<idle>-0 1Dnh. 3us : profile_hit (__schedule)
<idle>-0 1Dnh1 3us : sched_clock (__schedule)
<idle>-0 1Dnh1 4us : _raw_spin_lock_irq (__schedule)
<idle>-0 1Dnh1 4us : _raw_spin_lock_irqsave (__schedule)
<idle>-0 1Dnh2 5us : _raw_spin_unlock (__schedule)
<idle>-0 1Dnh1 5us : preempt_schedule (__schedule)
<idle>-0 1Dnh1 6us : _raw_spin_lock (__schedule)
<idle>-0 1Dnh2 6us : find_next_bit (__schedule)
<idle>-0 1Dnh2 6us : _raw_spin_lock (__schedule)
<idle>-0 1Dnh3 7us : find_next_bit (__schedule)
<idle>-0 1Dnh3 7us : find_next_bit (__schedule)
<idle>-0 1Dnh3 8us : _raw_spin_unlock (__schedule)
<idle>-0 1Dnh2 8us : preempt_schedule (__schedule)
<idle>-0 1Dnh2 8us : find_next_bit (__schedule)
<idle>-0 1Dnh2 9us : trace_stop_sched_switched (__schedule)
<idle>-0 1Dnh2 9us : _raw_spin_lock (trace_stop_sched_switched)
<idle>-0 1Dnh3 10us : trace_stop_sched_switched <<...>-2778> (73 8c)
<idle>-0 1Dnh3 10us : _raw_spin_unlock (trace_stop_sched_switched)
<idle>-0 1Dnh1 10us : _raw_spin_unlock (__schedule)
<idle>-0 1Dnh. 11us : local_irq_enable_noresched (preempt_schedule_irq)
<idle>-0 1Dnh. 11us < (0)
we didnt pick up pid 2778! It only gets scheduled much later:
<...>-2778 1Dnh2 412us : __switch_to (__schedule)
<...>-2778 1Dnh2 413us : __schedule <<idle>-0> (8c 73)
<...>-2778 1Dnh2 413us : _raw_spin_unlock (__schedule)
<...>-2778 1Dnh1 413us : trace_stop_sched_switched (__schedule)
<...>-2778 1Dnh1 414us : _raw_spin_lock (trace_stop_sched_switched)
<...>-2778 1Dnh2 414us : trace_stop_sched_switched <<...>-2778> (73 1)
<...>-2778 1Dnh2 414us : _raw_spin_unlock (trace_stop_sched_switched)
<...>-2778 1Dnh1 415us : trace_stop_sched_switched (__schedule)
the reason for this anomaly is the following code in dependent_sleeper():
/*
* If a user task with lower static priority than the
* running task on the SMT sibling is trying to schedule,
* delay it till there is proportionately less timeslice
* left of the sibling task to prevent a lower priority
* task from using an unfair proportion of the
* physical cpu's resources. -ck
*/
[...]
if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) /
100) > task_timeslice(p)))
ret = 1;
note that in contrast to the comment above, we dont actually do the
check based on static priority, we do the check based on timeslices. But
timeslices go up and down, and even highprio tasks can randomly have
very low timeslices (just before their next refill) and can thus be
judged as 'lowprio' by the above piece of code. This condition is
clearly buggy. The correct test is to check for static_prio _and_ to
check for the preemption priority. Even on different static priority
levels, a higher-prio interactive task should not be delayed due to a
higher-static-prio CPU hog.
there is a symmetric bug in the 'kick SMT sibling' code of this function
as well, which can be solved in a similar way.
the patch below (against the current scheduler queue in -mm) fixes both
bugs. I have build and boot-tested this on x86 SMT, and nice +20 tasks
still get properly throttled - so the dependent-sleeper logic is still
in action.
btw., these bugs pessimised the SMT scheduler because the 'delay wakeup'
property was applied too liberally, so this fix is likely a throughput
improvement as well.
i separated out a smt_slice() function to make the code easier to read.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kernel/sched.c | 19 +++++++++++++++----
1 files changed, 15 insertions(+), 4 deletions(-)
Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -2651,6 +2651,16 @@ static inline void wake_sleeping_depende
*/
}
+/*
+ * number of 'lost' timeslices this task wont be able to fully
+ * utilize, if another task runs on a sibling. This models the
+ * slowdown effect of other tasks running on siblings:
+ */
+static inline unsigned long smt_slice(task_t *p, struct sched_domain *sd)
+{
+ return p->time_slice * (100 - sd->per_cpu_gain) / 100;
+}
+
static inline int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
{
struct sched_domain *tmp, *sd = NULL;
@@ -2715,8 +2725,9 @@ static inline int dependent_sleeper(int
(sd->per_cpu_gain * DEF_TIMESLICE / 100))
ret = 1;
} else
- if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) /
- 100) > task_timeslice(p)))
+ if (smt_curr->static_prio < p->static_prio &&
+ !TASK_PREEMPTS_CURR(p, smt_rq) &&
+ smt_slice(smt_curr, sd) > task_timeslice(p))
ret = 1;
check_smt_task:
@@ -2738,8 +2749,8 @@ check_smt_task:
(sd->per_cpu_gain * DEF_TIMESLICE / 100))
resched_task(smt_curr);
} else {
- if ((p->time_slice * (100 - sd->per_cpu_gain) / 100) >
- task_timeslice(smt_curr))
+ if (TASK_PREEMPTS_CURR(p, smt_rq) &&
+ smt_slice(p, sd) > task_timeslice(smt_curr))
resched_task(smt_curr);
else
wakeup_busy_runqueue(smt_rq);
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [patch] fix SMT scheduler latency bug 2005-06-22 10:25 [patch] fix SMT scheduler latency bug Ingo Molnar @ 2005-06-22 14:40 ` Con Kolivas 2005-06-22 16:04 ` Ingo Molnar 0 siblings, 1 reply; 7+ messages in thread From: Con Kolivas @ 2005-06-22 14:40 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, William Weston [-- Attachment #1: Type: text/plain, Size: 3398 bytes --] Hi On Wed, 22 Jun 2005 20:25, Ingo Molnar wrote: > William Weston reported unusually high scheduling latencies on his x86 > HT box, on the -RT kernel. I managed to reproduce it on my HT box and > the latency tracer shows the incident in action: Thanks for picking this up. I've had a long hard look at the code and your patch. > the reason for this anomaly is the following code in dependent_sleeper(): > > /* > * If a user task with lower static priority than the > * running task on the SMT sibling is trying to schedule, > * delay it till there is proportionately less timeslice > * left of the sibling task to prevent a lower priority > * task from using an unfair proportion of the > * physical cpu's resources. -ck > */ > [...] > if (((smt_curr->time_slice * (100 - > sd->per_cpu_gain) / 100) > task_timeslice(p))) > ret = 1; > > note that in contrast to the comment above, we dont actually do the > check based on static priority, we do the check based on timeslices. But > timeslices go up and down, and even highprio tasks can randomly have > very low timeslices (just before their next refill) and can thus be > judged as 'lowprio' by the above piece of code. I don't see it like that. task_timeslice(p) will always return the same value based purely on static priority and smt_curr->time_slice cannot ever be larger than task_timeslice(p) unless there is a significant enough 'nice' difference. It is not smt_curr that is rescheduled as a result of this test, it is p that is not scheduled and we look at p's task_timeslice which does not alter. The task that is delayed in either case is dependant on its static priority which will determine its task_timeslice() vs the current value of ->time_slice on the sibling which is emptied as that task runs, and it is expected to fluctuate. > This condition is > clearly buggy. The correct test is to check for static_prio _and_ to > check for the preemption priority. Even on different static priority > levels, a higher-prio interactive task should not be delayed due to a > higher-static-prio CPU hog. > - if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / > - 100) > task_timeslice(p))) > + if (smt_curr->static_prio < p->static_prio && > + !TASK_PREEMPTS_CURR(p, smt_rq) && > + smt_slice(smt_curr, sd) > task_timeslice(p)) Checking for smt_curr->static_prio < p->static_prio appears redundant to me because the condition can only be met if there is a significant difference in the different timeslice case as I mentioned above. > + if (TASK_PREEMPTS_CURR(p, smt_rq) && Is this check necessary? The proportion is supposed to be distributed according to static priority only. If this code is causing large latencies then I believe it can only occur with different nice levels running on siblings and high priority tasks starting new timeslices repeatedly and never getting to the last per_cpu_gain% of their timeslice. Ingo do you think this might be what is being seen? If this truly can happen then this code will have to move to a jiffy based proportion as the real time code is to prevent this problem. Cheers, Con [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [patch] fix SMT scheduler latency bug 2005-06-22 14:40 ` Con Kolivas @ 2005-06-22 16:04 ` Ingo Molnar 2005-06-22 23:03 ` Con Kolivas 0 siblings, 1 reply; 7+ messages in thread From: Ingo Molnar @ 2005-06-22 16:04 UTC (permalink / raw) To: Con Kolivas; +Cc: Andrew Morton, linux-kernel, William Weston * Con Kolivas <kernel@kolivas.org> wrote: > > [...] > > if (((smt_curr->time_slice * (100 - > > sd->per_cpu_gain) / 100) > task_timeslice(p))) > > ret = 1; > > > > note that in contrast to the comment above, we dont actually do the > > check based on static priority, we do the check based on timeslices. But > > timeslices go up and down, and even highprio tasks can randomly have > > very low timeslices (just before their next refill) and can thus be > > judged as 'lowprio' by the above piece of code. > > I don't see it like that. task_timeslice(p) will always return the > same value based purely on static priority and smt_curr->time_slice > cannot ever be larger than task_timeslice(p) unless there is a > significant enough 'nice' difference. [...] task_timeslice(p) is indeed constant over time, but smt_curr->time_slice is not. So this condition opens up the possibility of a lower prio thread accumulating a larger ->time_slice and thus reversing the priority equation. > [...] It is not smt_curr that is rescheduled as a result of this test, > it is p that is not scheduled and we look at p's task_timeslice which > does not alter. The task that is delayed in either case is dependant > on its static priority which will determine its task_timeslice() vs > the current value of ->time_slice on the sibling which is emptied as > that task runs, and it is expected to fluctuate. see below. > > This condition is > > clearly buggy. The correct test is to check for static_prio _and_ to > > check for the preemption priority. Even on different static priority > > levels, a higher-prio interactive task should not be delayed due to a > > higher-static-prio CPU hog. > > > - if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / > > - 100) > task_timeslice(p))) > > + if (smt_curr->static_prio < p->static_prio && > > + !TASK_PREEMPTS_CURR(p, smt_rq) && > > + smt_slice(smt_curr, sd) > task_timeslice(p)) > > Checking for smt_curr->static_prio < p->static_prio appears redundant > to me because the condition can only be met if there is a significant > difference in the different timeslice case as I mentioned above. > > + if (TASK_PREEMPTS_CURR(p, smt_rq) && > > Is this check necessary? The proportion is supposed to be distributed > according to static priority only. yes, it's absolutely necessary. p->time_slice and smt->time_slice is not something we distribute according to any priority rule - it's a mechanism that controls distribution of timeslices, not 'immediateness of execution'. So we should and must not decide preemption based on it. p->time_slice's and smt->time_slice's natural fluctuation is a nice mechanism though to base the statistical approach of 'delaying lower-prio tasks' on - but only if the higher-level rules of preemption are not violated. i.e. a task can be delayed if and only if: - its static priority is lower than that of the currently running task - its dynamic priority is lower than that of the currently running task the first rule implements the policy of protecting higher-static-prio tasks, the second rule is a scheduling correctness (preemption latency avoidance) issue. you are right in that there is overlap in the ->static_prio rule and the smt_slice/time_slice rule - but this is still necessary because for ->time_slice there are no guarantees. (e.g. exiting threads can boost a thread up to twice its normal ->time_slice value.) So by adding this rule we not only speed up the decision in the common case (simple comparison instead of having to do multiplications and divisions), but we also separate all the various constrains of the scheduler clearly. E.g. we can still tweak ->time_slice distribution without having to worry much about the dependent-sleeper logic. there's an updated patch below, i made one more change to the 'interrupt current CPU' logic: i made the priority rules an exact mirror image of the 'delay current wakee' condition. > If this code is causing large latencies then I believe it can only > occur with different nice levels running on siblings and high priority > tasks starting new timeslices repeatedly and never getting to the last > per_cpu_gain% of their timeslice. Ingo do you think this might be what > is being seen? [...] no. This very much occurs with plain normal SCHED_OTHER tasks, no renicing or anything. I believe this bug might also have caused performance regressions (too much idle time on siblings). Ingo ------- William Weston reported unusually high scheduling latencies on his x86 HT box. Managed to reproduce it on my HT box and the -RT tree's latency tracer shows the incident in action: _------=> CPU# / _-----=> irqs-off | / _----=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth |||| / ||||| delay cmd pid ||||| time | caller \ / ||||| \ | / du-2803 3Dnh2 0us : __trace_start_sched_wakeup (try_to_wake_up) .............................................................. ... we are running on CPU#3, PID 2778 gets woken to CPU#1: ... .............................................................. du-2803 3Dnh2 0us : __trace_start_sched_wakeup <<...>-2778> (73 1) du-2803 3Dnh2 0us : _raw_spin_unlock (try_to_wake_up) ................................................ ... still on CPU#3, we send an IPI to CPU#1: ... ................................................ du-2803 3Dnh1 0us : resched_task (try_to_wake_up) du-2803 3Dnh1 1us : smp_send_reschedule (try_to_wake_up) du-2803 3Dnh1 1us : send_IPI_mask_bitmask (smp_send_reschedule) du-2803 3Dnh1 2us : _raw_spin_unlock_irqrestore (try_to_wake_up) ............................................... ... 1 usec later, the IPI arrives on CPU#1: ... ............................................... <idle>-0 1Dnh. 2us : smp_reschedule_interrupt (c0100c5a 0 0) so far so good, this is the normal wakeup/preemption mechanism. But here comes the scheduler anomaly on CPU#1: <idle>-0 1Dnh. 2us : preempt_schedule_irq (need_resched) <idle>-0 1Dnh. 2us : preempt_schedule_irq (need_resched) <idle>-0 1Dnh. 3us : __schedule (preempt_schedule_irq) <idle>-0 1Dnh. 3us : profile_hit (__schedule) <idle>-0 1Dnh1 3us : sched_clock (__schedule) <idle>-0 1Dnh1 4us : _raw_spin_lock_irq (__schedule) <idle>-0 1Dnh1 4us : _raw_spin_lock_irqsave (__schedule) <idle>-0 1Dnh2 5us : _raw_spin_unlock (__schedule) <idle>-0 1Dnh1 5us : preempt_schedule (__schedule) <idle>-0 1Dnh1 6us : _raw_spin_lock (__schedule) <idle>-0 1Dnh2 6us : find_next_bit (__schedule) <idle>-0 1Dnh2 6us : _raw_spin_lock (__schedule) <idle>-0 1Dnh3 7us : find_next_bit (__schedule) <idle>-0 1Dnh3 7us : find_next_bit (__schedule) <idle>-0 1Dnh3 8us : _raw_spin_unlock (__schedule) <idle>-0 1Dnh2 8us : preempt_schedule (__schedule) <idle>-0 1Dnh2 8us : find_next_bit (__schedule) <idle>-0 1Dnh2 9us : trace_stop_sched_switched (__schedule) <idle>-0 1Dnh2 9us : _raw_spin_lock (trace_stop_sched_switched) <idle>-0 1Dnh3 10us : trace_stop_sched_switched <<...>-2778> (73 8c) <idle>-0 1Dnh3 10us : _raw_spin_unlock (trace_stop_sched_switched) <idle>-0 1Dnh1 10us : _raw_spin_unlock (__schedule) <idle>-0 1Dnh. 11us : local_irq_enable_noresched (preempt_schedule_irq) <idle>-0 1Dnh. 11us < (0) we didnt pick up pid 2778! It only gets scheduled much later: <...>-2778 1Dnh2 412us : __switch_to (__schedule) <...>-2778 1Dnh2 413us : __schedule <<idle>-0> (8c 73) <...>-2778 1Dnh2 413us : _raw_spin_unlock (__schedule) <...>-2778 1Dnh1 413us : trace_stop_sched_switched (__schedule) <...>-2778 1Dnh1 414us : _raw_spin_lock (trace_stop_sched_switched) <...>-2778 1Dnh2 414us : trace_stop_sched_switched <<...>-2778> (73 1) <...>-2778 1Dnh2 414us : _raw_spin_unlock (trace_stop_sched_switched) <...>-2778 1Dnh1 415us : trace_stop_sched_switched (__schedule) the reason for this anomaly is the following code in dependent_sleeper(): /* * If a user task with lower static priority than the * running task on the SMT sibling is trying to schedule, * delay it till there is proportionately less timeslice * left of the sibling task to prevent a lower priority * task from using an unfair proportion of the * physical cpu's resources. -ck */ [...] if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / 100) > task_timeslice(p))) ret = 1; note that in contrast to the comment above, we dont actually do the check based on static priority, we do the check based on timeslices. But timeslices go up and down, and even highprio tasks can randomly have very low timeslices (just before their next refill) and can thus be judged as 'lowprio' by the above piece of code. This condition is clearly buggy. The correct test is to check for static_prio _and_ to check for the preemption priority. Even on different static priority levels, a higher-prio interactive task should not be delayed due to a higher-static-prio CPU hog. there is a symmetric bug in the 'kick SMT sibling' code of this function as well, which can be solved in a similar way. the patch below fixes both bugs. I have build and boot-tested this on x86 SMT, and nice +20 tasks still get properly throttled. btw., these bugs pessimised the SMT scheduler because the 'delay wakeup' property was applied too liberally, so this fix is likely a throughput improvement as well. i separated out a smt_slice() function to make the code easier to read. Signed-off-by: Ingo Molnar <mingo@elte.hu> kernel/sched.c | 20 ++++++++++++++++---- 1 files changed, 16 insertions(+), 4 deletions(-) Index: kernel/sched.c =================================================================== --- kernel/sched.c.orig +++ kernel/sched.c @@ -2651,6 +2651,16 @@ static inline void wake_sleeping_depende */ } +/* + * number of 'lost' timeslices this task wont be able to fully + * utilize, if another task runs on a sibling. This models the + * slowdown effect of other tasks running on siblings: + */ +static inline unsigned long smt_slice(task_t *p, struct sched_domain *sd) +{ + return p->time_slice * (100 - sd->per_cpu_gain) / 100; +} + static inline int dependent_sleeper(int this_cpu, runqueue_t *this_rq) { struct sched_domain *tmp, *sd = NULL; @@ -2715,8 +2725,9 @@ static inline int dependent_sleeper(int (sd->per_cpu_gain * DEF_TIMESLICE / 100)) ret = 1; } else - if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / - 100) > task_timeslice(p))) + if (smt_curr->static_prio < p->static_prio && + !TASK_PREEMPTS_CURR(p, smt_rq) && + smt_slice(smt_curr, sd) > task_timeslice(p)) ret = 1; check_smt_task: @@ -2738,8 +2749,9 @@ check_smt_task: (sd->per_cpu_gain * DEF_TIMESLICE / 100)) resched_task(smt_curr); } else { - if ((p->time_slice * (100 - sd->per_cpu_gain) / 100) > - task_timeslice(smt_curr)) + if (smt_curr->static_prio >= p->static_prio && + TASK_PREEMPTS_CURR(p, smt_rq) && + smt_slice(p, sd) >= task_timeslice(smt_curr)) resched_task(smt_curr); else wakeup_busy_runqueue(smt_rq); ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [patch] fix SMT scheduler latency bug 2005-06-22 16:04 ` Ingo Molnar @ 2005-06-22 23:03 ` Con Kolivas 2005-06-22 23:32 ` Ingo Molnar 0 siblings, 1 reply; 7+ messages in thread From: Con Kolivas @ 2005-06-22 23:03 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, William Weston [-- Attachment #1: Type: text/plain, Size: 1802 bytes --] On Thu, 23 Jun 2005 02:04, Ingo Molnar wrote: > * Con Kolivas <kernel@kolivas.org> wrote: > > > [...] > > > if (((smt_curr->time_slice * (100 - > > > sd->per_cpu_gain) / 100) > task_timeslice(p))) > > > ret = 1; > > > > > > note that in contrast to the comment above, we dont actually do the > > > check based on static priority, we do the check based on timeslices. > > > But timeslices go up and down, and even highprio tasks can randomly > > > have very low timeslices (just before their next refill) and can thus > > > be judged as 'lowprio' by the above piece of code. > > > > I don't see it like that. task_timeslice(p) will always return the > > same value based purely on static priority and smt_curr->time_slice > > cannot ever be larger than task_timeslice(p) unless there is a > > significant enough 'nice' difference. [...] > > task_timeslice(p) is indeed constant over time, but smt_curr->time_slice > is not. So this condition opens up the possibility of a lower prio > thread accumulating a larger ->time_slice and thus reversing the > priority equation. I'm not clear on how the value of ->time_slice can ever grow to larger than task_timeslice(p). It starts at task_timeslice(p) and decrements till it gets to 0 when it refills again. I only recall three times that the value of p->time_slice is increased: New timeslice where it is refilled with task_timeslice(p) At fork() where if it has one tick left it is increased back to one tick Reaping a child where it collects its child's time_slice up to a maximum of task_timeslice(p). I must be missing something but I can't see how p->time_slice can ever be larger than task_timeslice(p). Please correct me :| Cheers, Con [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [patch] fix SMT scheduler latency bug 2005-06-22 23:03 ` Con Kolivas @ 2005-06-22 23:32 ` Ingo Molnar 2005-06-23 0:03 ` Con Kolivas 0 siblings, 1 reply; 7+ messages in thread From: Ingo Molnar @ 2005-06-22 23:32 UTC (permalink / raw) To: Con Kolivas; +Cc: Andrew Morton, linux-kernel, William Weston * Con Kolivas <kernel@kolivas.org> wrote: > > task_timeslice(p) is indeed constant over time, but smt_curr->time_slice > > is not. So this condition opens up the possibility of a lower prio > > thread accumulating a larger ->time_slice and thus reversing the > > priority equation. > > I'm not clear on how the value of ->time_slice can ever grow to larger > than task_timeslice(p). It starts at task_timeslice(p) and decrements > till it gets to 0 when it refills again. I was thinking abut sched_exit(), there we let unused child timeslices 'flow back' into the parent thread, if the child thread was shortlived. The check there does: if (p->first_time_slice) { p->parent->time_slice += p->time_slice; if (unlikely(p->parent->time_slice > task_timeslice(p))) p->parent->time_slice = task_timeslice(p); } notice that we check parent->time_slice against the child's task_timeslice(p), not against task_timeslice(p->parent). So if the child thread got reniced, it could cause a higher-than-normal amount of timeslices. But this should be a rare scenario, and the above code is more of a bug than a feature (will send a patch for it tomorrow), and it should not affect the workloads i was testing. lets take a look at the second condition again: if ((p->time_slice * (100 - sd->per_cpu_gain) / 100) > task_timeslice(smt_curr)) resched_task(smt_curr); if this condition is true then we trigger a preemption at smt_curr. Now in the bug scenario, 'p' is a highprio task and smt_curr is a lowprio task. If p->time_slice (which fluctuates between task_timeslice(p) and 0) happens to be low enough, preemption wont be triggered and we lose a wakeup in essence - 'p', despite being the highest-prio task around, wont be run until some CPU runs schedule() voluntarily. Ok? Ingo ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [patch] fix SMT scheduler latency bug 2005-06-22 23:32 ` Ingo Molnar @ 2005-06-23 0:03 ` Con Kolivas 2005-06-23 13:24 ` Con Kolivas 0 siblings, 1 reply; 7+ messages in thread From: Con Kolivas @ 2005-06-23 0:03 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, William Weston [-- Attachment #1: Type: text/plain, Size: 2880 bytes --] On Thu, 23 Jun 2005 09:32, Ingo Molnar wrote: > * Con Kolivas <kernel@kolivas.org> wrote: > > > task_timeslice(p) is indeed constant over time, but > > > smt_curr->time_slice is not. So this condition opens up the possibility > > > of a lower prio thread accumulating a larger ->time_slice and thus > > > reversing the priority equation. > > > > I'm not clear on how the value of ->time_slice can ever grow to larger > > than task_timeslice(p). It starts at task_timeslice(p) and decrements > > till it gets to 0 when it refills again. > > I was thinking abut sched_exit(), there we let unused child timeslices > 'flow back' into the parent thread, if the child thread was shortlived. > The check there does: > > if (p->first_time_slice) { > p->parent->time_slice += p->time_slice; > if (unlikely(p->parent->time_slice > task_timeslice(p))) > p->parent->time_slice = task_timeslice(p); > } > > notice that we check parent->time_slice against the child's > task_timeslice(p), not against task_timeslice(p->parent). So if the > child thread got reniced, it could cause a higher-than-normal amount of > timeslices. But this should be a rare scenario, and the above code is > more of a bug than a feature (will send a patch for it tomorrow), and it > should not affect the workloads i was testing. Agreed. > lets take a look at the second condition again: > > if ((p->time_slice * (100 - sd->per_cpu_gain) / 100) > > task_timeslice(smt_curr)) > resched_task(smt_curr); > > if this condition is true then we trigger a preemption at smt_curr. Now > in the bug scenario, 'p' is a highprio task and smt_curr is a lowprio > task. If p->time_slice (which fluctuates between task_timeslice(p) and > 0) happens to be low enough, preemption wont be triggered and we lose a > wakeup in essence - 'p', despite being the highest-prio task around, > wont be run until some CPU runs schedule() voluntarily. Ok? In dependent_sleeper() we return 1 only to prevent p from scheduling. This second condition does not return 1 from dependent_sleeper() so p will still go ahead and schedule. This second condition only affects the scheduling on the smt sibling. About the only scenario I can envision a high priority task being delayed with the code as it currently is in 2.6.12-mm1 is with a high priority task being on the expired array and a low priority task being delayed on the active array. This still should not create large latencies unless array swapping is significantly delayed. I considered adding a check for this originally but it seemed to be unnecessary extra complexity since an expired task by design is expected to be delayed more anyway. Cheers, Con Cheers, Con [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [patch] fix SMT scheduler latency bug 2005-06-23 0:03 ` Con Kolivas @ 2005-06-23 13:24 ` Con Kolivas 0 siblings, 0 replies; 7+ messages in thread From: Con Kolivas @ 2005-06-23 13:24 UTC (permalink / raw) To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, William Weston [-- Attachment #1: Type: text/plain, Size: 636 bytes --] On Thu, 23 Jun 2005 10:03, Con Kolivas wrote: > About the only scenario I can envision a high priority task being delayed > with the code as it currently is in 2.6.12-mm1 is with a high priority task > being on the expired array and a low priority task being delayed on the > active array. This still should not create large latencies unless array > swapping is significantly delayed. I considered adding a check for this > originally but it seemed to be unnecessary extra complexity since an > expired task by design is expected to be delayed more anyway. BTW if this is an issue it would only require a patch like this. Cheers, Con [-- Attachment #2: sched-smt_nice_check_expired_array.patch --] [-- Type: text/x-diff, Size: 889 bytes --] Index: linux-2.6.12-mm1/kernel/sched.c =================================================================== --- linux-2.6.12-mm1.orig/kernel/sched.c 2005-06-23 00:10:22.000000000 +1000 +++ linux-2.6.12-mm1/kernel/sched.c 2005-06-23 23:19:35.000000000 +1000 @@ -2784,7 +2784,8 @@ static inline int dependent_sleeper(int ret = 1; } else if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / - 100) > task_timeslice(p))) + 100) > task_timeslice(p)) && + p->static_prio <= this_rq->best_expired_prio) ret = 1; check_smt_task: @@ -2807,7 +2808,8 @@ check_smt_task: resched_task(smt_curr); } else { if ((p->time_slice * (100 - sd->per_cpu_gain) / 100) > - task_timeslice(smt_curr)) + task_timeslice(smt_curr) && + smt_curr->static_prio <= smt_rq->best_expired_prio) resched_task(smt_curr); else wakeup_busy_runqueue(smt_rq); ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2005-06-23 13:30 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-06-22 10:25 [patch] fix SMT scheduler latency bug Ingo Molnar 2005-06-22 14:40 ` Con Kolivas 2005-06-22 16:04 ` Ingo Molnar 2005-06-22 23:03 ` Con Kolivas 2005-06-22 23:32 ` Ingo Molnar 2005-06-23 0:03 ` Con Kolivas 2005-06-23 13:24 ` Con Kolivas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox