public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* bug in sched.c:task_hot()
@ 2004-10-05  2:38 Chen, Kenneth W
  2004-10-05  3:17 ` Nick Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Chen, Kenneth W @ 2004-10-05  2:38 UTC (permalink / raw)
  To: linux-kernel

Current implementation of task_hot() has a performance bug in it
that it will cause integer underflow.

Variable "now" (typically passed in as rq->timestamp_last_tick)
and p->timestamp are all defined as unsigned long long.  However,
If former is smaller than the latter, integer under flow occurs
which make the result of subtraction a huge positive number. Then
it is compared to sd->cache_hot_time and it will wrongly identify
a cache hot task as cache cold.

This bug causes large amount of incorrect process migration across
cpus (at stunning 10,000 per second) and we lost cache affinity very
quickly and almost took double digit performance regression on a db
transaction processing workload.  Patch to fix the bug.  Diff'ed against
2.6.9-rc3.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>


--- linux-2.6.9-rc3/kernel/sched.c.orig	2004-10-04 19:11:21.000000000 -0700
+++ linux-2.6.9-rc3/kernel/sched.c	2004-10-04 19:19:27.000000000 -0700
@@ -180,7 +180,8 @@ static unsigned int task_timeslice(task_
 	else
 		return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
 }
-#define task_hot(p, now, sd) ((now) - (p)->timestamp < (sd)->cache_hot_time)
+#define task_hot(p, now, sd) ((long long) ((now) - (p)->timestamp)	\
+				< (long long) (sd)->cache_hot_time)

 enum idle_type
 {



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in sched.c:task_hot()
  2004-10-05  2:38 bug in sched.c:task_hot() Chen, Kenneth W
@ 2004-10-05  3:17 ` Nick Piggin
  2004-10-05 17:15   ` Chen, Kenneth W
  2004-10-05  7:33 ` Peter Williams
  2004-10-05  8:03 ` Ingo Molnar
  2 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2004-10-05  3:17 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: linux-kernel

Chen, Kenneth W wrote:

>Current implementation of task_hot() has a performance bug in it
>that it will cause integer underflow.
>
>Variable "now" (typically passed in as rq->timestamp_last_tick)
>and p->timestamp are all defined as unsigned long long.  However,
>If former is smaller than the latter, integer under flow occurs
>which make the result of subtraction a huge positive number. Then
>it is compared to sd->cache_hot_time and it will wrongly identify
>a cache hot task as cache cold.
>
>This bug causes large amount of incorrect process migration across
>cpus (at stunning 10,000 per second) and we lost cache affinity very
>quickly and almost took double digit performance regression on a db
>transaction processing workload.  Patch to fix the bug.  Diff'ed against
>2.6.9-rc3.
>
>Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
>

This one looks OK (the other may need a bit of rethinking).
What kernel is the regression in relation to, out of interest?

>
>--- linux-2.6.9-rc3/kernel/sched.c.orig	2004-10-04 19:11:21.000000000 -0700
>+++ linux-2.6.9-rc3/kernel/sched.c	2004-10-04 19:19:27.000000000 -0700
>@@ -180,7 +180,8 @@ static unsigned int task_timeslice(task_
> 	else
> 		return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
> }
>-#define task_hot(p, now, sd) ((now) - (p)->timestamp < (sd)->cache_hot_time)
>+#define task_hot(p, now, sd) ((long long) ((now) - (p)->timestamp)	\
>+				< (long long) (sd)->cache_hot_time)
>
> enum idle_type
> {
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in sched.c:task_hot()
  2004-10-05  2:38 bug in sched.c:task_hot() Chen, Kenneth W
  2004-10-05  3:17 ` Nick Piggin
@ 2004-10-05  7:33 ` Peter Williams
  2004-10-05  7:44   ` Nick Piggin
  2004-10-05 17:39   ` Chen, Kenneth W
  2004-10-05  8:03 ` Ingo Molnar
  2 siblings, 2 replies; 11+ messages in thread
From: Peter Williams @ 2004-10-05  7:33 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: linux-kernel

Chen, Kenneth W wrote:
> Current implementation of task_hot() has a performance bug in it
> that it will cause integer underflow.
> 
> Variable "now" (typically passed in as rq->timestamp_last_tick)
> and p->timestamp are all defined as unsigned long long.  However,
> If former is smaller than the latter, integer under flow occurs
> which make the result of subtraction a huge positive number. Then
> it is compared to sd->cache_hot_time and it will wrongly identify
> a cache hot task as cache cold.
> 
> This bug causes large amount of incorrect process migration across
> cpus (at stunning 10,000 per second) and we lost cache affinity very
> quickly and almost took double digit performance regression on a db
> transaction processing workload.  Patch to fix the bug.  Diff'ed against
> 2.6.9-rc3.
> 
> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
> 
> 
> --- linux-2.6.9-rc3/kernel/sched.c.orig	2004-10-04 19:11:21.000000000 -0700
> +++ linux-2.6.9-rc3/kernel/sched.c	2004-10-04 19:19:27.000000000 -0700
> @@ -180,7 +180,8 @@ static unsigned int task_timeslice(task_
>  	else
>  		return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
>  }
> -#define task_hot(p, now, sd) ((now) - (p)->timestamp < (sd)->cache_hot_time)
> +#define task_hot(p, now, sd) ((long long) ((now) - (p)->timestamp)	\
> +				< (long long) (sd)->cache_hot_time)
> 
>  enum idle_type
>  {

The interesting question is: How does now get to be less than timestamp? 
  This probably means that timestamp_last_tick is not a good way of 
getting a value for "now".  By the way, neither is sched_clock() when 
measuring small time differences as it is not monotonic (something that 
I had to allow for in my scheduling code).  I applied no such safeguards 
to the timing used by the load balancing code as I assumed that it 
already worked.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in sched.c:task_hot()
  2004-10-05  7:33 ` Peter Williams
@ 2004-10-05  7:44   ` Nick Piggin
  2004-10-05  8:07     ` Peter Williams
  2004-10-05 17:39   ` Chen, Kenneth W
  1 sibling, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2004-10-05  7:44 UTC (permalink / raw)
  To: Peter Williams; +Cc: Chen, Kenneth W, linux-kernel

Peter Williams wrote:
> Chen, Kenneth W wrote:
> 
>> Current implementation of task_hot() has a performance bug in it
>> that it will cause integer underflow.
>>
>> Variable "now" (typically passed in as rq->timestamp_last_tick)
>> and p->timestamp are all defined as unsigned long long.  However,
>> If former is smaller than the latter, integer under flow occurs
>> which make the result of subtraction a huge positive number. Then
>> it is compared to sd->cache_hot_time and it will wrongly identify
>> a cache hot task as cache cold.
>>
>> This bug causes large amount of incorrect process migration across
>> cpus (at stunning 10,000 per second) and we lost cache affinity very
>> quickly and almost took double digit performance regression on a db
>> transaction processing workload.  Patch to fix the bug.  Diff'ed against
>> 2.6.9-rc3.
>>
>> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
>>
>>
>> --- linux-2.6.9-rc3/kernel/sched.c.orig    2004-10-04 
>> 19:11:21.000000000 -0700
>> +++ linux-2.6.9-rc3/kernel/sched.c    2004-10-04 19:19:27.000000000 -0700
>> @@ -180,7 +180,8 @@ static unsigned int task_timeslice(task_
>>      else
>>          return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
>>  }
>> -#define task_hot(p, now, sd) ((now) - (p)->timestamp < 
>> (sd)->cache_hot_time)
>> +#define task_hot(p, now, sd) ((long long) ((now) - (p)->timestamp)    \
>> +                < (long long) (sd)->cache_hot_time)
>>
>>  enum idle_type
>>  {
> 
> 
> The interesting question is: How does now get to be less than timestamp? 
>  This probably means that timestamp_last_tick is not a good way of 
> getting a value for "now".

It is the best we can do.

>  By the way, neither is sched_clock() when 
> measuring small time differences as it is not monotonic (something that 
> I had to allow for in my scheduling code).

I'm pretty sure it is monotonic, actually. I know some CPUs can execute
rdtsc speculatively, but I don't think it would ever be sane to execute
two rdtsc's in the wrong order.

>  I applied no such safeguards 
> to the timing used by the load balancing code as I assumed that it 
> already worked.

It should (modulo this bug).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in sched.c:task_hot()
  2004-10-05  2:38 bug in sched.c:task_hot() Chen, Kenneth W
  2004-10-05  3:17 ` Nick Piggin
  2004-10-05  7:33 ` Peter Williams
@ 2004-10-05  8:03 ` Ingo Molnar
  2 siblings, 0 replies; 11+ messages in thread
From: Ingo Molnar @ 2004-10-05  8:03 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: linux-kernel, Andrew Morton, Nick Piggin


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> Current implementation of task_hot() has a performance bug in it
> that it will cause integer underflow.

> -#define task_hot(p, now, sd) ((now) - (p)->timestamp < (sd)->cache_hot_time)
> +#define task_hot(p, now, sd) ((long long) ((now) - (p)->timestamp)	\
> +				< (long long) (sd)->cache_hot_time)

ah ... nice catch.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in sched.c:task_hot()
  2004-10-05  7:44   ` Nick Piggin
@ 2004-10-05  8:07     ` Peter Williams
  2004-10-05  8:42       ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Williams @ 2004-10-05  8:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Chen, Kenneth W, linux-kernel

Nick Piggin wrote:
> Peter Williams wrote:
> 
>> Chen, Kenneth W wrote:
>>
>>> Current implementation of task_hot() has a performance bug in it
>>> that it will cause integer underflow.
>>>
>>> Variable "now" (typically passed in as rq->timestamp_last_tick)
>>> and p->timestamp are all defined as unsigned long long.  However,
>>> If former is smaller than the latter, integer under flow occurs
>>> which make the result of subtraction a huge positive number. Then
>>> it is compared to sd->cache_hot_time and it will wrongly identify
>>> a cache hot task as cache cold.
>>>
>>> This bug causes large amount of incorrect process migration across
>>> cpus (at stunning 10,000 per second) and we lost cache affinity very
>>> quickly and almost took double digit performance regression on a db
>>> transaction processing workload.  Patch to fix the bug.  Diff'ed against
>>> 2.6.9-rc3.
>>>
>>> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
>>>
>>>
>>> --- linux-2.6.9-rc3/kernel/sched.c.orig    2004-10-04 
>>> 19:11:21.000000000 -0700
>>> +++ linux-2.6.9-rc3/kernel/sched.c    2004-10-04 19:19:27.000000000 
>>> -0700
>>> @@ -180,7 +180,8 @@ static unsigned int task_timeslice(task_
>>>      else
>>>          return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
>>>  }
>>> -#define task_hot(p, now, sd) ((now) - (p)->timestamp < 
>>> (sd)->cache_hot_time)
>>> +#define task_hot(p, now, sd) ((long long) ((now) - (p)->timestamp)    \
>>> +                < (long long) (sd)->cache_hot_time)
>>>
>>>  enum idle_type
>>>  {
>>
>>
>>
>> The interesting question is: How does now get to be less than 
>> timestamp?  This probably means that timestamp_last_tick is not a good 
>> way of getting a value for "now".
> 
> 
> It is the best we can do.

You could use sched_clock() which will do better.  The setting of 
timestamp in schedule() gives you a pretty good chance that it's value 
will be greater than timestamp_last_tick.

> 
>>  By the way, neither is sched_clock() when measuring small time 
>> differences as it is not monotonic (something that I had to allow for 
>> in my scheduling code).
> 
> 
> I'm pretty sure it is monotonic, actually. I know some CPUs can execute
> rdtsc speculatively, but I don't think it would ever be sane to execute
> two rdtsc's in the wrong order.

I have experienced it going backwards and I assumed that it was due to 
the timing code applying corrections.  (You've got two choices if your 
clock is running fast: one is to mark time until the real world catches 
up with you and the other is to set your clock back to the correct time 
when you notice a discrepancy.  I assumed that the second strategy had 
been followed by the time code and didn't bother checking further 
because it was an easy problem to sidestep.) Admittedly, this behaviour 
was only observed when measuring very short times such as the time spent 
on the runqueue waiting for CPU access when the system was idle BUT it 
was definitely occurring.  And it only occurred on a system where the 
lower bits of the values returned by sched_clock() were not zero i.e. a 
reasonably modern one.  It was observed on a single CPU machine as well 
and was not, therefore, a result of drift between CPUs.

> 
>>  I applied no such safeguards to the timing used by the load balancing 
>> code as I assumed that it already worked.
> 
> 
> It should (modulo this bug).

I'm reasonably confident that ZAPHOD doesn't change the values of 
timestamp or timestamp_last_tick to something that they would not have 
been without ZAPHOD and there are other scheduler changes (than 
ZAPHOD's) in -mm2 which may bear examination.

ZAPHOD also did not introduce the declaration of these values as 
unsigned long long as that is present in rc3.  BTW a quick to the 
problem is to change their declarations to just long long as we don't 
need the 64th bit for a few hundred years.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in sched.c:task_hot()
  2004-10-05  8:07     ` Peter Williams
@ 2004-10-05  8:42       ` Nick Piggin
  2004-10-05 10:03         ` Peter Williams
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2004-10-05  8:42 UTC (permalink / raw)
  To: Peter Williams; +Cc: Chen, Kenneth W, linux-kernel

Peter Williams wrote:
> Nick Piggin wrote:
> 
>> Peter Williams wrote:
>>
>>>
>>> The interesting question is: How does now get to be less than 
>>> timestamp?  This probably means that timestamp_last_tick is not a 
>>> good way of getting a value for "now".
>>
>>
>>
>> It is the best we can do.
> 
> 
> You could use sched_clock() which will do better.  The setting of 
> timestamp in schedule() gives you a pretty good chance that it's value 
> will be greater than timestamp_last_tick.
> 

sched_clock is not guaranteed to be synchronised across CPUs though.
It may be completely different. So even if you did use sched_clock,
you'd still have to apply the timestamp_last_tick adjustment.

Considering that we really don't need sched_clock resolution here,
it just isn't needed.

Kenneth's overflow fix is definitely required though, even if you
were to use sched_clock.

>>
>>>  By the way, neither is sched_clock() when measuring small time 
>>> differences as it is not monotonic (something that I had to allow for 
>>> in my scheduling code).
>>
>>
>>
>> I'm pretty sure it is monotonic, actually. I know some CPUs can execute
>> rdtsc speculatively, but I don't think it would ever be sane to execute
>> two rdtsc's in the wrong order.
> 

Hmm, there may be some jitter when waking a process from a remote
CPU - because in that case, we do have to apply the timestamp_last_tick
correction.

> 
> I have experienced it going backwards and I assumed that it was due to 
> the timing code applying corrections.  (You've got two choices if your 
> clock is running fast: one is to mark time until the real world catches 
> up with you and the other is to set your clock back to the correct time 
> when you notice a discrepancy.  I assumed that the second strategy had 
> been followed by the time code and didn't bother checking further 
> because it was an easy problem to sidestep.) Admittedly, this behaviour 

We don't really care what real time is doing here, just so long as the
numbers returned are roughly the same for everyone (all processes).

> was only observed when measuring very short times such as the time spent 
> on the runqueue waiting for CPU access when the system was idle BUT it 
> was definitely occurring.  And it only occurred on a system where the 
> lower bits of the values returned by sched_clock() were not zero i.e. a 
> reasonably modern one.  It was observed on a single CPU machine as well 
> and was not, therefore, a result of drift between CPUs.

I don't see how this could happen on a single CPU system. I can
believe you saw it though.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in sched.c:task_hot()
  2004-10-05  8:42       ` Nick Piggin
@ 2004-10-05 10:03         ` Peter Williams
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Williams @ 2004-10-05 10:03 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Chen, Kenneth W, linux-kernel

Nick Piggin wrote:
> Peter Williams wrote:
> 
>> Nick Piggin wrote:
>>
>>> Peter Williams wrote:
>>>
>>>>
>>>> The interesting question is: How does now get to be less than 
>>>> timestamp?  This probably means that timestamp_last_tick is not a 
>>>> good way of getting a value for "now".
>>>
>>>
>>>
>>>
>>> It is the best we can do.
>>
>>
>>
>> You could use sched_clock() which will do better.  The setting of 
>> timestamp in schedule() gives you a pretty good chance that it's value 
>> will be greater than timestamp_last_tick.
>>
> 
> sched_clock is not guaranteed to be synchronised across CPUs though.
> It may be completely different. So even if you did use sched_clock,
> you'd still have to apply the timestamp_last_tick adjustment.

I assumed that was why all the "timestamp correction on changing CPU" 
code was added recently.

> 
> Considering that we really don't need sched_clock resolution here,
> it just isn't needed.
> 
> Kenneth's overflow fix is definitely required though, even if you
> were to use sched_clock.

Yes.  Or, if the accuracy is sufficient, you could just use 
timestamp_last_tick to set timestamp which would save the cost of 
calling sched_clock() for that purpose.

> 
>>>
>>>>  By the way, neither is sched_clock() when measuring small time 
>>>> differences as it is not monotonic (something that I had to allow 
>>>> for in my scheduling code).
>>>
>>>
>>>
>>>
>>> I'm pretty sure it is monotonic, actually. I know some CPUs can execute
>>> rdtsc speculatively, but I don't think it would ever be sane to execute
>>> two rdtsc's in the wrong order.
>>
>>
> 
> Hmm, there may be some jitter when waking a process from a remote
> CPU - because in that case, we do have to apply the timestamp_last_tick
> correction.
> 
>>
>> I have experienced it going backwards and I assumed that it was due to 
>> the timing code applying corrections.  (You've got two choices if your 
>> clock is running fast: one is to mark time until the real world 
>> catches up with you and the other is to set your clock back to the 
>> correct time when you notice a discrepancy.  I assumed that the second 
>> strategy had been followed by the time code and didn't bother checking 
>> further because it was an easy problem to sidestep.) Admittedly, this 
>> behaviour 
> 
> 
> We don't really care what real time is doing here, just so long as the
> numbers returned are roughly the same for everyone (all processes).

I agree which is why I didn't chase it.  As far as I'm concerned when 
sched_clock() appears to go backwards the time interval that I'm 
measuring is so small that using zero is close enough.

> 
>> was only observed when measuring very short times such as the time 
>> spent on the runqueue waiting for CPU access when the system was idle 
>> BUT it was definitely occurring.  And it only occurred on a system 
>> where the lower bits of the values returned by sched_clock() were not 
>> zero i.e. a reasonably modern one.  It was observed on a single CPU 
>> machine as well and was not, therefore, a result of drift between CPUs.
> 
> 
> I don't see how this could happen on a single CPU system. I can
> believe you saw it though.

When I suspected it as the source of a problem that I was experiencing I 
put some test code in to detect it.  The time between occurrences was of 
the order of hours and only happened when the interval was very small.

As I said, I assumed that it was due to "corrections" but didn't bother 
chasing it as it only happened when the time intervals were very small 
and using zero when it occurred was adequate for my purposes.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: bug in sched.c:task_hot()
  2004-10-05  3:17 ` Nick Piggin
@ 2004-10-05 17:15   ` Chen, Kenneth W
  0 siblings, 0 replies; 11+ messages in thread
From: Chen, Kenneth W @ 2004-10-05 17:15 UTC (permalink / raw)
  To: 'Nick Piggin'; +Cc: linux-kernel

Nick Piggin wrote on Monday, October 04, 2004 8:18 PM
> >This bug causes large amount of incorrect process migration across
> >cpus (at stunning 10,000 per second) and we lost cache affinity very
> >quickly and almost took double digit performance regression on a db
> >transaction processing workload.  Patch to fix the bug.  Diff'ed against
> >2.6.9-rc3.
> >
>
> This one looks OK (the other may need a bit of rethinking).
> What kernel is the regression in relation to, out of interest?

Performance regression was initially seen on 2.6.8 relative to 2.6.6.
We subsequently narrowed down to these set of bugs.

- Ken



^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: bug in sched.c:task_hot()
  2004-10-05  7:33 ` Peter Williams
  2004-10-05  7:44   ` Nick Piggin
@ 2004-10-05 17:39   ` Chen, Kenneth W
  2004-10-05 22:09     ` Peter Williams
  1 sibling, 1 reply; 11+ messages in thread
From: Chen, Kenneth W @ 2004-10-05 17:39 UTC (permalink / raw)
  To: 'Peter Williams'; +Cc: linux-kernel

Chen, Kenneth W wrote:
> Current implementation of task_hot() has a performance bug in it
> that it will cause integer underflow.
>
> Variable "now" (typically passed in as rq->timestamp_last_tick)
> and p->timestamp are all defined as unsigned long long.  However,
> If former is smaller than the latter, integer under flow occurs
> which make the result of subtraction a huge positive number. Then
> it is compared to sd->cache_hot_time and it will wrongly identify
> a cache hot task as cache cold.
>
> This bug causes large amount of incorrect process migration across
> cpus (at stunning 10,000 per second) and we lost cache affinity very
> quickly and almost took double digit performance regression on a db
> transaction processing workload.  Patch to fix the bug.  Diff'ed against
> 2.6.9-rc3.
>

Peter Williams wrote on Tuesday, October 05, 2004 12:33 AM
> The interesting question is: How does now get to be less than timestamp?
>   This probably means that timestamp_last_tick is not a good way of
> getting a value for "now".  By the way, neither is sched_clock() when
> measuring small time differences as it is not monotonic (something that
> I had to allow for in my scheduling code).  I applied no such safeguards
> to the timing used by the load balancing code as I assumed that it
> already worked.

The reason now gets smaller than timestamp was because of random thing
that activate_task() do to p->timestamp.  Here are the sequence of events:

On timer tick, scheduler_tick() updates rq->timestamp_last_tick,
1 us later, process A wakes up, entering into run queue. activate_task()
updates p->timestamp.  At this time p->timestamp is larger than rq->
timestamp_last_tick.  Then another cpu goes into idle, tries load balancing,
It wrongly finds process A not cache hot (because of integer underflow),
moved it.  Then application runs on a cache cold CPU and suffer performance.

- Ken



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in sched.c:task_hot()
  2004-10-05 17:39   ` Chen, Kenneth W
@ 2004-10-05 22:09     ` Peter Williams
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Williams @ 2004-10-05 22:09 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: linux-kernel

Chen, Kenneth W wrote:
> Chen, Kenneth W wrote:
> 
>>Current implementation of task_hot() has a performance bug in it
>>that it will cause integer underflow.
>>
>>Variable "now" (typically passed in as rq->timestamp_last_tick)
>>and p->timestamp are all defined as unsigned long long.  However,
>>If former is smaller than the latter, integer under flow occurs
>>which make the result of subtraction a huge positive number. Then
>>it is compared to sd->cache_hot_time and it will wrongly identify
>>a cache hot task as cache cold.
>>
>>This bug causes large amount of incorrect process migration across
>>cpus (at stunning 10,000 per second) and we lost cache affinity very
>>quickly and almost took double digit performance regression on a db
>>transaction processing workload.  Patch to fix the bug.  Diff'ed against
>>2.6.9-rc3.
>>
> 
> 
> Peter Williams wrote on Tuesday, October 05, 2004 12:33 AM
> 
>>The interesting question is: How does now get to be less than timestamp?
>>  This probably means that timestamp_last_tick is not a good way of
>>getting a value for "now".  By the way, neither is sched_clock() when
>>measuring small time differences as it is not monotonic (something that
>>I had to allow for in my scheduling code).  I applied no such safeguards
>>to the timing used by the load balancing code as I assumed that it
>>already worked.
> 
> 
> The reason now gets smaller than timestamp was because of random thing
> that activate_task() do to p->timestamp.  Here are the sequence of events:
> 
> On timer tick, scheduler_tick() updates rq->timestamp_last_tick,
> 1 us later, process A wakes up, entering into run queue. activate_task()
> updates p->timestamp.  At this time p->timestamp is larger than rq->
> timestamp_last_tick.  Then another cpu goes into idle, tries load balancing,
> It wrongly finds process A not cache hot (because of integer underflow),
> moved it.  Then application runs on a cache cold CPU and suffer performance.

Having thought about this over night I've come to the conclusion that 
(because it is set using sched_clock()) timestamp is always bigger than 
timestamp_last_tick for a short period of time (i.e. until the next time 
  scheduler_tick()).  This doesn't effect the load balancing code that 
is triggered by scheduler_tick() (for obvious reasons) but all other 
load balancing code is susceptible to the problem.  The setting of 
timestamp in activate() just exacerbates the problem and the fact that 
Ingo's VP work is reducing the amount of time that it takes tasks to get 
from wake up to a point where load balancing is triggered (or they're 
involved in load balancing decisions) other than by scheduler_tick() 
means that it matters more than it used to.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2004-10-05 22:09 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-05  2:38 bug in sched.c:task_hot() Chen, Kenneth W
2004-10-05  3:17 ` Nick Piggin
2004-10-05 17:15   ` Chen, Kenneth W
2004-10-05  7:33 ` Peter Williams
2004-10-05  7:44   ` Nick Piggin
2004-10-05  8:07     ` Peter Williams
2004-10-05  8:42       ` Nick Piggin
2004-10-05 10:03         ` Peter Williams
2004-10-05 17:39   ` Chen, Kenneth W
2004-10-05 22:09     ` Peter Williams
2004-10-05  8:03 ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox