public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Williams <pwil3058@bigpond.net.au>
To: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: Re: bug in sched.c:task_hot()
Date: Tue, 05 Oct 2004 18:07:59 +1000	[thread overview]
Message-ID: <4162565F.60007@bigpond.net.au> (raw)
In-Reply-To: <416250F0.5010008@yahoo.com.au>

Nick Piggin wrote:
> Peter Williams wrote:
> 
>> Chen, Kenneth W wrote:
>>
>>> Current implementation of task_hot() has a performance bug in it
>>> that it will cause integer underflow.
>>>
>>> Variable "now" (typically passed in as rq->timestamp_last_tick)
>>> and p->timestamp are all defined as unsigned long long.  However,
>>> If former is smaller than the latter, integer under flow occurs
>>> which make the result of subtraction a huge positive number. Then
>>> it is compared to sd->cache_hot_time and it will wrongly identify
>>> a cache hot task as cache cold.
>>>
>>> This bug causes large amount of incorrect process migration across
>>> cpus (at stunning 10,000 per second) and we lost cache affinity very
>>> quickly and almost took double digit performance regression on a db
>>> transaction processing workload.  Patch to fix the bug.  Diff'ed against
>>> 2.6.9-rc3.
>>>
>>> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
>>>
>>>
>>> --- linux-2.6.9-rc3/kernel/sched.c.orig    2004-10-04 
>>> 19:11:21.000000000 -0700
>>> +++ linux-2.6.9-rc3/kernel/sched.c    2004-10-04 19:19:27.000000000 
>>> -0700
>>> @@ -180,7 +180,8 @@ static unsigned int task_timeslice(task_
>>>      else
>>>          return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
>>>  }
>>> -#define task_hot(p, now, sd) ((now) - (p)->timestamp < 
>>> (sd)->cache_hot_time)
>>> +#define task_hot(p, now, sd) ((long long) ((now) - (p)->timestamp)    \
>>> +                < (long long) (sd)->cache_hot_time)
>>>
>>>  enum idle_type
>>>  {
>>
>>
>>
>> The interesting question is: How does now get to be less than 
>> timestamp?  This probably means that timestamp_last_tick is not a good 
>> way of getting a value for "now".
> 
> 
> It is the best we can do.

You could use sched_clock() which will do better.  The setting of 
timestamp in schedule() gives you a pretty good chance that it's value 
will be greater than timestamp_last_tick.

> 
>>  By the way, neither is sched_clock() when measuring small time 
>> differences as it is not monotonic (something that I had to allow for 
>> in my scheduling code).
> 
> 
> I'm pretty sure it is monotonic, actually. I know some CPUs can execute
> rdtsc speculatively, but I don't think it would ever be sane to execute
> two rdtsc's in the wrong order.

I have experienced it going backwards and I assumed that it was due to 
the timing code applying corrections.  (You've got two choices if your 
clock is running fast: one is to mark time until the real world catches 
up with you and the other is to set your clock back to the correct time 
when you notice a discrepancy.  I assumed that the second strategy had 
been followed by the time code and didn't bother checking further 
because it was an easy problem to sidestep.) Admittedly, this behaviour 
was only observed when measuring very short times such as the time spent 
on the runqueue waiting for CPU access when the system was idle BUT it 
was definitely occurring.  And it only occurred on a system where the 
lower bits of the values returned by sched_clock() were not zero i.e. a 
reasonably modern one.  It was observed on a single CPU machine as well 
and was not, therefore, a result of drift between CPUs.

> 
>>  I applied no such safeguards to the timing used by the load balancing 
>> code as I assumed that it already worked.
> 
> 
> It should (modulo this bug).

I'm reasonably confident that ZAPHOD doesn't change the values of 
timestamp or timestamp_last_tick to something that they would not have 
been without ZAPHOD and there are other scheduler changes (than 
ZAPHOD's) in -mm2 which may bear examination.

ZAPHOD also did not introduce the declaration of these values as 
unsigned long long as that is present in rc3.  BTW a quick to the 
problem is to change their declarations to just long long as we don't 
need the 64th bit for a few hundred years.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

  reply	other threads:[~2004-10-05  8:08 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-10-05  2:38 bug in sched.c:task_hot() Chen, Kenneth W
2004-10-05  3:17 ` Nick Piggin
2004-10-05 17:15   ` Chen, Kenneth W
2004-10-05  7:33 ` Peter Williams
2004-10-05  7:44   ` Nick Piggin
2004-10-05  8:07     ` Peter Williams [this message]
2004-10-05  8:42       ` Nick Piggin
2004-10-05 10:03         ` Peter Williams
2004-10-05 17:39   ` Chen, Kenneth W
2004-10-05 22:09     ` Peter Williams
2004-10-05  8:03 ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4162565F.60007@bigpond.net.au \
    --to=pwil3058@bigpond.net.au \
    --cc=kenneth.w.chen@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nickpiggin@yahoo.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox