* Re: [PATCH] - sched_clock() broken for ia64 SN platform
2003-11-20 0:56 [PATCH] - sched_clock() broken for ia64 SN platform John Hawkes
@ 2003-11-20 1:24 ` David Mosberger
2003-11-20 4:09 ` John Hawkes
` (11 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: David Mosberger @ 2003-11-20 1:24 UTC (permalink / raw)
To: linux-ia64
>>>>> On Wed, 19 Nov 2003 16:56:23 -0800 (PST), John Hawkes <hawkes@babylon.engr.sgi.com> said:
John> We might instead want to implement a more general scheme,
John> along the lines of what is done by (struct time_interpolator),
John> to provide a framework to solve this for other architectures
John> that have "drifty" non-default timebases.
My sense is that with a bit of thinking, it would be possible to come
up with a solution that allows even drifty platforms to use ITC for
sched_clock()---it serves very a specific purpose in the scheduler and
scalability is key and perfect accuracy is not (unlike for
gettimeofday). I don't think anything that goes out to read a single
(shared) platform counter will be sufficiently scalable to the number
of CPUs you guys are talking about. But yes, it would be much more
effort than just adding Yet Another Callback. The rewards would be
bigger, though, too...
--david
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] - sched_clock() broken for ia64 SN platform
2003-11-20 0:56 [PATCH] - sched_clock() broken for ia64 SN platform John Hawkes
2003-11-20 1:24 ` David Mosberger
@ 2003-11-20 4:09 ` John Hawkes
2003-11-20 6:01 ` David Mosberger
` (10 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: John Hawkes @ 2003-11-20 4:09 UTC (permalink / raw)
To: linux-ia64
From: "David Mosberger" <davidm@napali.hpl.hp.com>> >>>>> On Wed, 19 Nov 2003
16:56:23 -0800 (PST), John Hawkes <hawkes@babylon.engr.sgi.com> said:
>
> John> We might instead want to implement a more general scheme,
> John> along the lines of what is done by (struct time_interpolator),
> John> to provide a framework to solve this for other architectures
> John> that have "drifty" non-default timebases.
>
> My sense is that with a bit of thinking, it would be possible to come
> up with a solution that allows even drifty platforms to use ITC for
> sched_clock()---it serves very a specific purpose in the scheduler and
> scalability is key and perfect accuracy is not (unlike for
> gettimeofday). I don't think anything that goes out to read a single
> (shared) platform counter will be sufficiently scalable to the number
> of CPUs you guys are talking about. But yes, it would be much more
> effort than just adding Yet Another Callback. The rewards would be
> bigger, though, too...
In 2.4 the scheduler used "jiffies" directly as a timestamp for this purpose.
Then for some reason someone decided to abstract that into sched_clock(), to
let every architecture decide how to implement it. The alpha architecture
implements sched_clock() with jiffies. The i386 uses the TSC (which might not
be synchronized for all platforms?). The ia64 uses the ITC.
I'd like to hear an argument about why sched_clock() needs sub-microsecond
accuracy, instead of just using jiffies, when one use of sched_clock() is to
compare a delta time against cache_decay_ticks, which is a
"jiffies"-granularity value, and the other use is to determine the relative
computebound-vs-interactive characteristics of the process.
John Hawkes
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] - sched_clock() broken for ia64 SN platform
2003-11-20 0:56 [PATCH] - sched_clock() broken for ia64 SN platform John Hawkes
2003-11-20 1:24 ` David Mosberger
2003-11-20 4:09 ` John Hawkes
@ 2003-11-20 6:01 ` David Mosberger
2003-11-20 15:23 ` Jack Steiner
` (9 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: David Mosberger @ 2003-11-20 6:01 UTC (permalink / raw)
To: linux-ia64
>>>>> On Wed, 19 Nov 2003 20:09:15 -0800, "John Hawkes" <hawkes@sgi.com> said:
John> I'd like to hear an argument about why sched_clock() needs
John> sub-microsecond accuracy, instead of just using jiffies, when
John> one use of sched_clock() is to compare a delta time against
John> cache_decay_ticks, which is a "jiffies"-granularity value, and
John> the other use is to determine the relative
John> computebound-vs-interactive characteristics of the process.
It's probably best to discuss this on lkml. I didn't follow all the
recent scheduler developments but AFAIK, this is largely driven by
trying to fix some scheduling corners. I think the "persistent
starvation" bug that Stephane found gets fixed by it, for example.
Also, supposedly it helps interactivity.
--david
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] - sched_clock() broken for ia64 SN platform
2003-11-20 0:56 [PATCH] - sched_clock() broken for ia64 SN platform John Hawkes
` (2 preceding siblings ...)
2003-11-20 6:01 ` David Mosberger
@ 2003-11-20 15:23 ` Jack Steiner
2003-11-20 17:25 ` Grant Grundler
` (8 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Jack Steiner @ 2003-11-20 15:23 UTC (permalink / raw)
To: linux-ia64
On Wed, Nov 19, 2003 at 05:24:08PM -0800, David Mosberger wrote:
> >>>>> On Wed, 19 Nov 2003 16:56:23 -0800 (PST), John Hawkes <hawkes@babylon.engr.sgi.com> said:
>
> John> We might instead want to implement a more general scheme,
> John> along the lines of what is done by (struct time_interpolator),
> John> to provide a framework to solve this for other architectures
> John> that have "drifty" non-default timebases.
>
> My sense is that with a bit of thinking, it would be possible to come
> up with a solution that allows even drifty platforms to use ITC for
> sched_clock()---it serves very a specific purpose in the scheduler and
> scalability is key and perfect accuracy is not (unlike for
> gettimeofday). I don't think anything that goes out to read a single
> (shared) platform counter will be sufficiently scalable to the number
> of CPUs you guys are talking about. But yes, it would be much more
This is slightly off-topic, but the shared platform counter on the SGI
platform isnt a single counter. The counter is replicated in each chipset
It is synchronized thruout the system so that all cpus will see
the same value - ie., no drift. Reading the counter does not required any
off-node references. There shouldnt be any scaling issues. However,
reading the ITC is faster & preferred if intercpu drift is not an issue.
> effort than just adding Yet Another Callback. The rewards would be
> bigger, though, too...
>
> --david
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Thanks
Jack Steiner (steiner@sgi.com) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] - sched_clock() broken for ia64 SN platform
2003-11-20 0:56 [PATCH] - sched_clock() broken for ia64 SN platform John Hawkes
` (3 preceding siblings ...)
2003-11-20 15:23 ` Jack Steiner
@ 2003-11-20 17:25 ` Grant Grundler
2003-11-20 17:25 ` Rich Altmaier
` (7 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Grant Grundler @ 2003-11-20 17:25 UTC (permalink / raw)
To: linux-ia64
On Wed, Nov 19, 2003 at 08:09:15PM -0800, John Hawkes wrote:
> I'd like to hear an argument about why sched_clock() needs sub-microsecond
> accuracy, instead of just using jiffies, when one use of sched_clock() is to
> compare a delta time against cache_decay_ticks, which is a
> "jiffies"-granularity value, and the other use is to determine the relative
> computebound-vs-interactive characteristics of the process.
In general, it seems like bouncing the jiffies cacheline around is more of
a problem than the need for accuracy. This sounds like a similar problem
Jack Steiner wrote about before (updating interrupt counts):
| Updating the counter causes a cache line to be bounced between
| cpus at a rate of at least HZ*active_cpus. (The number of bus transactions
| is at least 2X higher because the line is first obtained for
| shared usage, then upgraded to modified. In addition, multiple references
| are made to the line for each interrupt. On a big system, it is unlikely that
| a cpu can hold the line for entire time that the interrupt is being
| serviced).
hth,
grant
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] - sched_clock() broken for ia64 SN platform
2003-11-20 0:56 [PATCH] - sched_clock() broken for ia64 SN platform John Hawkes
` (4 preceding siblings ...)
2003-11-20 17:25 ` Grant Grundler
@ 2003-11-20 17:25 ` Rich Altmaier
2003-11-20 18:32 ` David Mosberger
` (6 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Rich Altmaier @ 2003-11-20 17:25 UTC (permalink / raw)
To: linux-ia64
Just to add a small note to Jack's comments, the reason for the
existance of this "globally synchronized hi-res" counter is exactly
for scheduling. On IRIX we use this in the frame-scheduler, to
achieve order(tens of microsecond) simultaneous launch of processes on
multiple CPUs for hard realtime.
The counter is replicated in each memory controller, where the
NUMAflex interconnect provides a broadcast clock signal to drive them.
It's a fairly cool feature, and the realtime people love it...
FYI, Rich
Jack Steiner wrote:
>
>
> This is slightly off-topic, but the shared platform counter on the SGI
> platform isnt a single counter. The counter is replicated in each chipset
> It is synchronized thruout the system so that all cpus will see
> the same value - ie., no drift. Reading the counter does not required any
> off-node references. There shouldnt be any scaling issues. However,
> reading the ITC is faster & preferred if intercpu drift is not an issue.
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] - sched_clock() broken for ia64 SN platform
2003-11-20 0:56 [PATCH] - sched_clock() broken for ia64 SN platform John Hawkes
` (5 preceding siblings ...)
2003-11-20 17:25 ` Rich Altmaier
@ 2003-11-20 18:32 ` David Mosberger
2003-11-20 19:20 ` Robin Holt
` (5 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: David Mosberger @ 2003-11-20 18:32 UTC (permalink / raw)
To: linux-ia64
>>>>> On Thu, 20 Nov 2003 09:23:35 -0600, Jack Steiner <steiner@sgi.com> said:
Jack> This is slightly off-topic, but the shared platform counter on
Jack> the SGI platform isnt a single counter. The counter is
Jack> replicated in each chipset It is synchronized thruout the
Jack> system so that all cpus will see the same value - ie., no
Jack> drift. Reading the counter does not required any off-node
Jack> references. There shouldnt be any scaling issues.
Ah, that's good. Just for future reference, what's the approximate
latency of reading this counter?
Jack> However, reading the ITC is faster & preferred if intercpu
Jack> drift is not an issue.
Yes. Plus we could solve the problem once and for all, not once for
each drifty platform.
As I remember it, sched_clock() was originally invented to measure
fine-grained "how long have I run" times. This can be done with ITC
without synchronization, since the start and stop "times" will be
measured on the same CPU. Howver, as John points out, at the moment
sched_clock() is also used for migration-decisions. My guess is that
this part is just due to someone trying to be overly clever. At least
on drifty platforms, you can just as easily make this decision based
on jiffies. All it would do is add one word to the task_struct and
reading both sched_clock() and jiffies when updating the timestamp(s).
--david
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] - sched_clock() broken for ia64 SN platform
2003-11-20 0:56 [PATCH] - sched_clock() broken for ia64 SN platform John Hawkes
` (6 preceding siblings ...)
2003-11-20 18:32 ` David Mosberger
@ 2003-11-20 19:20 ` Robin Holt
2003-11-20 19:23 ` Robin Holt
` (4 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Robin Holt @ 2003-11-20 19:20 UTC (permalink / raw)
To: linux-ia64
On Thu, Nov 20, 2003 at 09:25:45AM -0800, Grant Grundler wrote:
> On Wed, Nov 19, 2003 at 08:09:15PM -0800, John Hawkes wrote:
> > I'd like to hear an argument about why sched_clock() needs sub-microsecond
> > accuracy, instead of just using jiffies, when one use of sched_clock() is to
> > compare a delta time against cache_decay_ticks, which is a
> > "jiffies"-granularity value, and the other use is to determine the relative
> > computebound-vs-interactive characteristics of the process.
>
> In general, it seems like bouncing the jiffies cacheline around is more of
> a problem than the need for accuracy. This sounds like a similar problem
> Jack Steiner wrote about before (updating interrupt counts):
The jiffies cacheline is always used shared except by cpu 0 when updating it
once per tick. The interrupt counts problem was that every cpu would receive
the tick interrupt and try to update (grab exclusive copy) the same cacheline.
Different problem. Jiffies is not as much of a concern.
>
> | Updating the counter causes a cache line to be bounced between
> | cpus at a rate of at least HZ*active_cpus. (The number of bus transactions
> | is at least 2X higher because the line is first obtained for
> | shared usage, then upgraded to modified. In addition, multiple references
> | are made to the line for each interrupt. On a big system, it is unlikely that
> | a cpu can hold the line for entire time that the interrupt is being
> | serviced).
>
> hth,
> grant
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] - sched_clock() broken for ia64 SN platform
2003-11-20 0:56 [PATCH] - sched_clock() broken for ia64 SN platform John Hawkes
` (7 preceding siblings ...)
2003-11-20 19:20 ` Robin Holt
@ 2003-11-20 19:23 ` Robin Holt
2003-11-20 20:58 ` John Hawkes
` (3 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Robin Holt @ 2003-11-20 19:23 UTC (permalink / raw)
To: linux-ia64
On Thu, Nov 20, 2003 at 10:32:29AM -0800, David Mosberger wrote:
> >>>>> On Thu, 20 Nov 2003 09:23:35 -0600, Jack Steiner <steiner@sgi.com> said:
>
> Jack> This is slightly off-topic, but the shared platform counter on
> Jack> the SGI platform isnt a single counter. The counter is
> Jack> replicated in each chipset It is synchronized thruout the
> Jack> system so that all cpus will see the same value - ie., no
> Jack> drift. Reading the counter does not required any off-node
> Jack> references. There shouldnt be any scaling issues.
>
> Ah, that's good. Just for future reference, what's the approximate
> latency of reading this counter?
I haven't tested it recently, but it was approx 55nSec on a 900Mhz
Madison.
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] - sched_clock() broken for ia64 SN platform
2003-11-20 0:56 [PATCH] - sched_clock() broken for ia64 SN platform John Hawkes
` (8 preceding siblings ...)
2003-11-20 19:23 ` Robin Holt
@ 2003-11-20 20:58 ` John Hawkes
2003-11-20 21:27 ` David Mosberger
` (2 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: John Hawkes @ 2003-11-20 20:58 UTC (permalink / raw)
To: linux-ia64
From: "David Mosberger" <davidm@napali.hpl.hp.com>
> Jack Steiner> However, reading the ITC is faster & preferred if intercpu
> Jack Steiner> drift is not an issue.
>
> Yes. Plus we could solve the problem once and for all, not once for
> each drifty platform.
>
> As I remember it, sched_clock() was originally invented to measure
> fine-grained "how long have I run" times. This can be done with ITC
> without synchronization, since the start and stop "times" will be
> measured on the same CPU. Howver, as John points out, at the moment
> sched_clock() is also used for migration-decisions. My guess is that
> this part is just due to someone trying to be overly clever. At least
> on drifty platforms, you can just as easily make this decision based
> on jiffies. All it would do is add one word to the task_struct and
> reading both sched_clock() and jiffies when updating the timestamp(s).
I doubt this double-count would ever be accepted by the wider Linux Community,
as it bloats mainline arch-independent code, just to fix a problem with a
handful of drifty platforms.
The i386 code is uglier than my patch, as it makes NUMA platforms use the
gross-granularity "jiffies" as the time base. So much for any of the benefits
in the scheduler to a high-precision task->timestamp. At least with my ia64
patch, it allows for a platform-specific sched_clock() that returns a
high-precision value and doesn't appreciably add bloat.
John Hawkes
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] - sched_clock() broken for ia64 SN platform
2003-11-20 0:56 [PATCH] - sched_clock() broken for ia64 SN platform John Hawkes
` (9 preceding siblings ...)
2003-11-20 20:58 ` John Hawkes
@ 2003-11-20 21:27 ` David Mosberger
2003-11-20 21:58 ` john stultz
2003-11-20 22:14 ` John Hawkes
12 siblings, 0 replies; 14+ messages in thread
From: David Mosberger @ 2003-11-20 21:27 UTC (permalink / raw)
To: linux-ia64
>>>>> On Thu, 20 Nov 2003 12:58:20 -0800, "John Hawkes" <hawkes@sgi.com> said:
John> I doubt this double-count would ever be accepted by the wider
John> Linux Community, as it bloats mainline arch-independent code,
John> just to fix a problem with a handful of drifty platforms.
That's not an argument: the "bloat" can be trivially hidden for
non-drifty architectures with an inline routine or macro. I for one
would be perfectly happy to pay an extra word in the ia64-version of
task_struct if that would yield a generic and scalable solution to the
problem. The problem right now is that the generic kernel code is
structured in a way that prevents this.
--david
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] - sched_clock() broken for ia64 SN platform
2003-11-20 0:56 [PATCH] - sched_clock() broken for ia64 SN platform John Hawkes
` (10 preceding siblings ...)
2003-11-20 21:27 ` David Mosberger
@ 2003-11-20 21:58 ` john stultz
2003-11-20 22:14 ` John Hawkes
12 siblings, 0 replies; 14+ messages in thread
From: john stultz @ 2003-11-20 21:58 UTC (permalink / raw)
To: linux-ia64
On Thu, 2003-11-20 at 13:27, David Mosberger wrote:
> >>>>> On Thu, 20 Nov 2003 12:58:20 -0800, "John Hawkes" <hawkes@sgi.com> said:
>
> John> I doubt this double-count would ever be accepted by the wider
> John> Linux Community, as it bloats mainline arch-independent code,
> John> just to fix a problem with a handful of drifty platforms.
>
> That's not an argument: the "bloat" can be trivially hidden for
> non-drifty architectures with an inline routine or macro. I for one
> would be perfectly happy to pay an extra word in the ia64-version of
> task_struct if that would yield a generic and scalable solution to the
> problem. The problem right now is that the generic kernel code is
> structured in a way that prevents this.
I too was confused why per-cpu start and stop times were not just used
for this high-res accounting. I'm not sure I can look into it now, but
I'd be interested to hear why we'd compare timestamps across cpus
(rather then just use time deltas calculated on a single cpu).
Oh, and that last thought, keep it around for when 2.7 opens ;)
-john
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH] - sched_clock() broken for ia64 SN platform
2003-11-20 0:56 [PATCH] - sched_clock() broken for ia64 SN platform John Hawkes
` (11 preceding siblings ...)
2003-11-20 21:58 ` john stultz
@ 2003-11-20 22:14 ` John Hawkes
12 siblings, 0 replies; 14+ messages in thread
From: John Hawkes @ 2003-11-20 22:14 UTC (permalink / raw)
To: linux-ia64
From: "john stultz" <johnstul@us.ibm.com>
> I too was confused why per-cpu start and stop times were not just used
> for this high-res accounting. I'm not sure I can look into it now, but
> I'd be interested to hear why we'd compare timestamps across cpus
> (rather then just use time deltas calculated on a single cpu).
sched_clock() and task->timestamp are used in two different ways. One is for
supposedly high-res accounting. The other is for can_migrate_task(), called
during load-balancing, to determine if the process has slept long enough to
consider it to no longer be cache-hot. It's this latter use that suffers from
a drifty timebase.
John Hawkes
^ permalink raw reply [flat|nested] 14+ messages in thread