Need better is_better_time_interpolator() algorithm

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Need better is_better_time_interpolator() algorithm
@ 2005-08-25 16:44 Alex Williamson
  2005-08-25 17:36 ` john stultz
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2005-08-25 16:44 UTC (permalink / raw)
  To: linux-kernel

Hi,

   In playing with an HPET device, I noticed that
kernel/timer.c:is_better_time_interpolator() is completely non-symmetric
in the timer it returns.  The test is simply:

return new->frequency > 2*time_interpolator->frequency ||
 (unsigned long)new->drift < (unsigned long)time_interpolator->drift;

Given two timers:

(a) 1.5GHz, 750ppm
(b) 250Mhz, 500ppm

the resulting "better" timer is completely dependent on the order
they're passed in.  For example, (a),(b) = (b); (b),(a) = (a).

   What are we really looking for in a "better" timer?  There are at
least 4 factors that I can think of that seem important to determining a
better clock:

      * resolution (frequency)
      * accuracy (drift)
      * access latency (may be non-uniform across the system?)
      * jitter (monotonically increasing)

How can we munge these all together to come up with a single goodness
factor for comparison?  There's probably a thesis covering algorithms to
handle this.  Anyone know of one or have some good ideas?  Thanks,

	Alex

-- 
Alex Williamson                             HP Linux & Open Source Lab

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Need better is_better_time_interpolator() algorithm
  2005-08-25 16:44 Need better is_better_time_interpolator() algorithm Alex Williamson
@ 2005-08-25 17:36 ` john stultz
  2005-08-25 18:43   ` Alex Williamson
  0 siblings, 1 reply; 15+ messages in thread
From: john stultz @ 2005-08-25 17:36 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linux-kernel

On Thu, 2005-08-25 at 10:44 -0600, Alex Williamson wrote:
>    In playing with an HPET device, I noticed that
> kernel/timer.c:is_better_time_interpolator() is completely non-symmetric
> in the timer it returns.  The test is simply:
> 
> return new->frequency > 2*time_interpolator->frequency ||
>  (unsigned long)new->drift < (unsigned long)time_interpolator->drift;
> 
> Given two timers:
> 
> (a) 1.5GHz, 750ppm
> (b) 250Mhz, 500ppm
> 
> the resulting "better" timer is completely dependent on the order
> they're passed in.  For example, (a),(b) = (b); (b),(a) = (a).
> 
>    What are we really looking for in a "better" timer?  There are at
> least 4 factors that I can think of that seem important to determining a
> better clock:
> 
>       * resolution (frequency)
>       * accuracy (drift)
>       * access latency (may be non-uniform across the system?)
>       * jitter (monotonically increasing)
> 
> How can we munge these all together to come up with a single goodness
> factor for comparison?  There's probably a thesis covering algorithms to
> handle this.  Anyone know of one or have some good ideas?  Thanks,

With my timeofday rework code, the timesource structure (which was
influenced by the time interpolators) just uses a fixed "priority" vale.
This value can be changed as needed (for example: We lower the tsc
timesource priority if the TSCs are found to be out of sync).

In order to have some scale of goodness and avoid priority inflation, I
put a comment suggesting what the different priority levels mean. 

ie:
0-99: Terrible. Only use at bootup or when there's nothing else
available
100-199: Functional but not desired
200-299: Good. a correct and usable timesource
300-399: Desired. A reasonably fast and accurate timesource.
400-499: Perfect. The ideal timesource. A must-use where available.

Additionally, I created a sysfs interface that could be used to override
the priority selected timesource. 

Realistically I don't think too many systems will have multiple out of
tree timesources, so assigning the correct priority value shouldn't be
too difficult.

This just seemed a bit more straight forward then sorting out some
weighting algorithm for their properties to select the best timesource. 

thanks
-john

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Need better is_better_time_interpolator() algorithm
  2005-08-25 17:36 ` john stultz
@ 2005-08-25 18:43   ` Alex Williamson
  2005-08-25 19:02     ` john stultz
  2005-08-26 15:39     ` Christoph Lameter
  0 siblings, 2 replies; 15+ messages in thread
From: Alex Williamson @ 2005-08-25 18:43 UTC (permalink / raw)
  To: john stultz; +Cc: linux-kernel

On Thu, 2005-08-25 at 10:36 -0700, john stultz wrote:
> On Thu, 2005-08-25 at 10:44 -0600, Alex Williamson wrote:
> > How can we munge these all together to come up with a single goodness
> > factor for comparison?  There's probably a thesis covering algorithms to
> > handle this.  Anyone know of one or have some good ideas?  Thanks,
> 
> With my timeofday rework code, the timesource structure (which was
> influenced by the time interpolators) just uses a fixed "priority" vale.
...
> Realistically I don't think too many systems will have multiple out of
> tree timesources, so assigning the correct priority value shouldn't be
> too difficult.
> 
> This just seemed a bit more straight forward then sorting out some
> weighting algorithm for their properties to select the best timesource. 

   I don't know that it's that uncommon.  Simply having one non-arch
specific timer is enough to need to decided whether it's better than a
generic timer.  I assume pretty much every arch has a cycle timer.  For
smaller boxes, this might be the preferred timer given it's latency even
if something like an hpet exists (mmio access are expensive).  How do
you hard code a value that can account for that?  I agree, we could
easily go too far and produce some bloated algorithm, but maybe it's
simply a weighted product of a few variables.

To start with, what would this do:

(frequency) * (1/drift) * (1/latency) * (1/(jitter_factor * cpus))

Something this simple at least starts to dynamically bring the factors
together.  All else being equal (and with no weighting), this would give
the 1.5GHz/750ppm timer a higher priority than the 250MHz/500ppm timer.
Is that good?  I like your idea to make this user tunable after boot,
but I still think there has to be a way to make a smarter decision up
front.  Thanks,

	Alex

-- 
Alex Williamson                             HP Linux & Open Source Lab

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Need better is_better_time_interpolator() algorithm
  2005-08-25 18:43   ` Alex Williamson
@ 2005-08-25 19:02     ` john stultz
  2005-08-26 15:39     ` Christoph Lameter
  1 sibling, 0 replies; 15+ messages in thread
From: john stultz @ 2005-08-25 19:02 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linux-kernel

On Thu, 2005-08-25 at 12:43 -0600, Alex Williamson wrote:
> On Thu, 2005-08-25 at 10:36 -0700, john stultz wrote:
> > On Thu, 2005-08-25 at 10:44 -0600, Alex Williamson wrote:
> > > How can we munge these all together to come up with a single goodness
> > > factor for comparison?  There's probably a thesis covering algorithms to
> > > handle this.  Anyone know of one or have some good ideas?  Thanks,
> > 
> > With my timeofday rework code, the timesource structure (which was
> > influenced by the time interpolators) just uses a fixed "priority" vale.
> ...
> > Realistically I don't think too many systems will have multiple out of
> > tree timesources, so assigning the correct priority value shouldn't be
> > too difficult.
> > 
> > This just seemed a bit more straight forward then sorting out some
> > weighting algorithm for their properties to select the best timesource. 
> 
>    I don't know that it's that uncommon.  Simply having one non-arch
> specific timer is enough to need to decided whether it's better than a
> generic timer. I assume pretty much every arch has a cycle timer.  For
> smaller boxes, this might be the preferred timer given it's latency even
> if something like an hpet exists (mmio access are expensive).  How do
> you hard code a value that can account for that?  

Well, in my patches I set the default priority of cycle timer as being
very high, say 350, and the slower MMIO device as being a decent 250.
Then if we boot up on say a NUMA system, or if we find the cycle
counters to be out of sync, the cycle counter init code drops the
timesource priority of the cycle counter to something like 50.


> I agree, we could
> easily go too far and produce some bloated algorithm, but maybe it's
> simply a weighted product of a few variables.
> 
> To start with, what would this do:
> 
> (frequency) * (1/drift) * (1/latency) * (1/(jitter_factor * cpus))

It just seems that something like this could be for the most part
precomputed when you're writing the time_interpolator code into a single
priority value that the init code can tweak as needed.

> Something this simple at least starts to dynamically bring the factors
> together.  All else being equal (and with no weighting), this would give
> the 1.5GHz/750ppm timer a higher priority than the 250MHz/500ppm timer.
> Is that good?  I like your idea to make this user tunable after boot,
> but I still think there has to be a way to make a smarter decision up
> front.  Thanks,

Shrug. I agree it needs improvement, so don't let me stop you from doing
something like what you have above. I just think its more complex then
necessary and might result in less predictable interpolator selections.

thanks
-john



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Need better is_better_time_interpolator() algorithm
  2005-08-25 18:43   ` Alex Williamson
  2005-08-25 19:02     ` john stultz
@ 2005-08-26 15:39     ` Christoph Lameter
  2005-08-26 16:18       ` Alex Williamson
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Lameter @ 2005-08-26 15:39 UTC (permalink / raw)
  To: Alex Williamson; +Cc: john stultz, linux-kernel

On Thu, 25 Aug 2005, Alex Williamson wrote:

>    I don't know that it's that uncommon.  Simply having one non-arch
> specific timer is enough to need to decided whether it's better than a
> generic timer.  I assume pretty much every arch has a cycle timer.  For
> smaller boxes, this might be the preferred timer given it's latency even
> if something like an hpet exists (mmio access are expensive).  How do
> you hard code a value that can account for that?  I agree, we could
> easily go too far and produce some bloated algorithm, but maybe it's
> simply a weighted product of a few variables.

I think a priority is something useful for the interpolators. Some of 
the decisions about which time sources to use also have criteria different 
from drift/latency/jitter/cpu. F.e. timers may not survive various 
power-saving configurations. Thus I would think that we need a priority 
plus some flags.

Some of the criteria for choosing a time source may be:

1. If a system boots up with a single cpu then there is no question that 
the ITC/TSC should be used because of the fast access.

2. If the BIOS is capable of perfect ITC/TSC synchronization then use 
   the ITC/TSC. However, this is typically restricted to SMP configs and 
   there is an issue on IA64 that the PAL can indicate a nodrift 
   configuration but Linux is not capable of perfects sync on bootup.

3. If a memory mapped global clock is available then make sure to 
   first use a distributed timer over a centralized time source because
   distributed timer have fast local access even under contention. 
   I.e. use Altix RTC over HPET/Cyclone.

4. Use any memory mapped global clock source 

5. Abuse some other system component for time keeping (PIT, i/o devices 
etc)

The criteria for drift/latency can only be established after we gone 
through the above. The low-power stuff adds additional complexity.

We may need to dynamically change timer interpolators / time sources if 
the power situation changes. Nothing like that exists today for time 
interpolators since they are mostly used in server configurations.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Need better is_better_time_interpolator() algorithm
  2005-08-26 15:39     ` Christoph Lameter
@ 2005-08-26 16:18       ` Alex Williamson
  2005-08-26 19:16         ` George Anzinger
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2005-08-26 16:18 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: john stultz, linux-kernel

On Fri, 2005-08-26 at 08:39 -0700, Christoph Lameter wrote:

> I think a priority is something useful for the interpolators. Some of 
> the decisions about which time sources to use also have criteria different 
> from drift/latency/jitter/cpu. F.e. timers may not survive various 
> power-saving configurations. Thus I would think that we need a priority 
> plus some flags.
> 
> Some of the criteria for choosing a time source may be:

Hi Christoph,

   I sent another followup to this thread with a patch containing a
fairly crude algorithm that I think better explains my starting point.
I'm sure the weighting and scaling factors need work, but I think many
of the criteria you describe will favor the right clock.

> 1. If a system boots up with a single cpu then there is no question that 
> the ITC/TSC should be used because of the fast access.

I'm hoping the jitter/cpu and latency calculation will always end up
favoring the cycle counter in the scenario.

> 2. If the BIOS is capable of perfect ITC/TSC synchronization then use 
>    the ITC/TSC. However, this is typically restricted to SMP configs and 
>    there is an issue on IA64 that the PAL can indicate a nodrift 
>    configuration but Linux is not capable of perfects sync on bootup.

If we're setting the jitter flag appropriately, which I think we are,
then the low latency of the ITC and lack of jitter penalty should do the
right thing here.

> 3. If a memory mapped global clock is available then make sure to 
>    first use a distributed timer over a centralized time source because
>    distributed timer have fast local access even under contention. 
>    I.e. use Altix RTC over HPET/Cyclone.
> 
> 4. Use any memory mapped global clock source 

Scaling the latency to the average across a NUMA system will
automatically order these correctly.  They will definitely have higher
latency than the ITC, so should fall about here on the list.  It's easy
to make the Altix RTC have no node association and therefore not
penalize the access latency further.

> 5. Abuse some other system component for time keeping (PIT, i/o devices 
> etc)
> 
> The criteria for drift/latency can only be established after we gone 
> through the above. The low-power stuff adds additional complexity.
> 
> We may need to dynamically change timer interpolators / time sources if 
> the power situation changes. Nothing like that exists today for time 
> interpolators since they are mostly used in server configurations.

   Yes, I don't know how to handle low power situations other than have
a flag on the interpolators and a hook to switch to the best low power
timer when the need arises.  We may need to use the same hook to
re-evaluate the timer when a CPU hotplug occurs.  This way we could
dynamically fall back to case 1 above if most of the CPUs are removed
(and likewise move to an hpet if CPUs are added).

   I was looking into the suggestion to use logarithms to try to
normalize some of the factors such that weighting is more logical.
Hopefully that won't explode the complexity.  Thanks,

	Alex

-- 
Alex Williamson                             HP Linux & Open Source Lab

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Need better is_better_time_interpolator() algorithm
  2005-08-26 16:18       ` Alex Williamson
@ 2005-08-26 19:16         ` George Anzinger
  2005-08-26 19:26           ` Alex Williamson
  0 siblings, 1 reply; 15+ messages in thread
From: George Anzinger @ 2005-08-26 19:16 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Christoph Lameter, john stultz, linux-kernel

Alex Williamson wrote:
> On Fri, 2005-08-26 at 08:39 -0700, Christoph Lameter wrote:
> 
> 
>>I think a priority is something useful for the interpolators. Some of 
>>the decisions about which time sources to use also have criteria different 
>>from drift/latency/jitter/cpu. F.e. timers may not survive various 
>>power-saving configurations. Thus I would think that we need a priority 
>>plus some flags.
>>
>>Some of the criteria for choosing a time source may be:
> 
> 
> Hi Christoph,
> 
>    I sent another followup to this thread with a patch containing a
> fairly crude algorithm that I think better explains my starting point.
> I'm sure the weighting and scaling factors need work, but I think many
> of the criteria you describe will favor the right clock.
> 
> 
>>1. If a system boots up with a single cpu then there is no question that 
>>the ITC/TSC should be used because of the fast access.

We need to factor in frequency shifting here, especially if it happens 
with out notice.


~
-- 
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Need better is_better_time_interpolator() algorithm
  2005-08-26 19:16         ` George Anzinger
@ 2005-08-26 19:26           ` Alex Williamson
  2005-08-26 19:33             ` Christoph Lameter
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2005-08-26 19:26 UTC (permalink / raw)
  To: george; +Cc: Christoph Lameter, john stultz, linux-kernel

On Fri, 2005-08-26 at 12:16 -0700, George Anzinger wrote:
> Alex Williamson wrote:
> > On Fri, 2005-08-26 at 08:39 -0700, Christoph Lameter wrote: 
> >>1. If a system boots up with a single cpu then there is no question that 
> >>the ITC/TSC should be used because of the fast access.
> 
> We need to factor in frequency shifting here, especially if it happens 
> with out notice.

   Would we ever want to favor a frequency shifting timer over anything
else in the system?  If it was noticeable perhaps we'd just need a
callback to re-evaluate the frequency and rescan for the best timer.  If
it happens without notice, a flag that statically assigns it the lowest
priority will due.  Or maybe if the driver factored the frequency
shifting into the drift it would make the timer undesirable without
resorting to flags.  Thanks,

	Alex

-- 
Alex Williamson                             HP Linux & Open Source Lab

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Need better is_better_time_interpolator() algorithm
  2005-08-26 19:26           ` Alex Williamson
@ 2005-08-26 19:33             ` Christoph Lameter
  2005-08-26 19:51               ` George Anzinger
  2005-08-27 11:55               ` Pavel Machek
  0 siblings, 2 replies; 15+ messages in thread
From: Christoph Lameter @ 2005-08-26 19:33 UTC (permalink / raw)
  To: Alex Williamson; +Cc: george, john stultz, linux-kernel

On Fri, 26 Aug 2005, Alex Williamson wrote:

>    Would we ever want to favor a frequency shifting timer over anything
> else in the system?  If it was noticeable perhaps we'd just need a
> callback to re-evaluate the frequency and rescan for the best timer.  If
> it happens without notice, a flag that statically assigns it the lowest
> priority will due.  Or maybe if the driver factored the frequency
> shifting into the drift it would make the timer undesirable without
> resorting to flags.  Thanks,

Timers are usually constant. AFAIK Frequency shifts only occur through 
power management. In that case we usually have some notifiers running 
before the change. These notifiers need to switch to a different time 
source if the timer frequency will be shifting or the timer will become 
unavailable.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Need better is_better_time_interpolator() algorithm
  2005-08-26 19:33             ` Christoph Lameter
@ 2005-08-26 19:51               ` George Anzinger
  2005-08-27 11:55               ` Pavel Machek
  1 sibling, 0 replies; 15+ messages in thread
From: George Anzinger @ 2005-08-26 19:51 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Alex Williamson, john stultz, linux-kernel

Christoph Lameter wrote:
> On Fri, 26 Aug 2005, Alex Williamson wrote:
> 
> 
>>   Would we ever want to favor a frequency shifting timer over anything
>>else in the system?  If it was noticeable perhaps we'd just need a
>>callback to re-evaluate the frequency and rescan for the best timer.  If
>>it happens without notice, a flag that statically assigns it the lowest
>>priority will due.  Or maybe if the driver factored the frequency
>>shifting into the drift it would make the timer undesirable without
>>resorting to flags.  Thanks,
> 
> 
> Timers are usually constant. AFAIK Frequency shifts only occur through 
> power management. In that case we usually have some notifiers running 
> before the change. These notifiers need to switch to a different time 
> source if the timer frequency will be shifting or the timer will become 
> unavailable.
> 
If there is a notifier, I presume we can track it.  We might want to 
refine things so as to not hit too big a bump when the shift occures, 
but I think it is doable.  The desirability of doing it, I think, 
depends on the availablity of something better.  The access time of the 
TSC is "really" enticing.  Even so, I think a _good_ clock would not 
depend on long term accuracy of something as fast as the TSC.  Vendors 
are even modulating these to reduce RFI, but still, because of its 
speed, it makes the best interpolator for the jiffie to jiffie times.

-- 
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Need better is_better_time_interpolator() algorithm
  2005-08-26 19:33             ` Christoph Lameter
  2005-08-26 19:51               ` George Anzinger
@ 2005-08-27 11:55               ` Pavel Machek
  2005-08-29 17:00                 ` Christoph Lameter
  1 sibling, 1 reply; 15+ messages in thread
From: Pavel Machek @ 2005-08-27 11:55 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Alex Williamson, george, john stultz, linux-kernel

Hi!

> >    Would we ever want to favor a frequency shifting timer over anything
> > else in the system?  If it was noticeable perhaps we'd just need a
> > callback to re-evaluate the frequency and rescan for the best timer.  If
> > it happens without notice, a flag that statically assigns it the lowest
> > priority will due.  Or maybe if the driver factored the frequency
> > shifting into the drift it would make the timer undesirable without
> > resorting to flags.  Thanks,
> 
> Timers are usually constant. AFAIK Frequency shifts only occur through 
> power management. In that case we usually have some notifiers running 

Usually is the key word here. Older APM stuff changes frequency behind
your back, and sometimes frequency shift is time-critical.
-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Need better is_better_time_interpolator() algorithm
  2005-08-27 11:55               ` Pavel Machek
@ 2005-08-29 17:00                 ` Christoph Lameter
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Lameter @ 2005-08-29 17:00 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Alex Williamson, george, john stultz, linux-kernel

On Sat, 27 Aug 2005, Pavel Machek wrote:

> Usually is the key word here. Older APM stuff changes frequency behind
> your back, and sometimes frequency shift is time-critical.

In that case the clocks tied to the shifting frequency are
not usable.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Need better is_better_time_interpolator() algorithm
@ 2005-08-25 21:40 linux
  2005-08-25 23:07 ` Alex Williamson
  0 siblings, 1 reply; 15+ messages in thread
From: linux @ 2005-08-25 21:40 UTC (permalink / raw)
  To: alex.williamson; +Cc: linux-kernel

> (frequency) * (1/drift) * (1/latency) * (1/(jitter_factor * cpus))

(Note that 1/cpus, being a constant for all evaluations of this
expression, has no effect on the final ranking.)
The usual way it's done is with some fiddle factors:

quality_a^a * quality_b^b * quality_c^c

Or, equivalently:

a * log(quality_a) + b * log(quality_b) + c * log(quality_c)

Then you use the a, b and c factors to weight the relative importance
of them.  Your suggestion is equivalent to setting all the exponents to 1.

But you can also say that "a is twice as important as b" in a
consistent manner.

Note that computing a few bits of log_2 is not hard to do in integer
math if you're not too anxious about efficiency:

unsigned log2(unsigned x)
{
	unsigned result = 31;
	unsigned i;

	assert(x);
	while (!x & (1u<<31)) {
		x <<= 1;
		result--;
	}
	/* Think of x as a 1.31-bit fixed-point number, 1 <= x < 2 */
	for (i = 0; i < NUM_FRACTION_BITS; i++) {
		unsigned long long y = x;
		/* Square x and compare to 2. */
		y *= x;
		result <<= 1;
		if (y & (1ull<<63)) {
			result++;
			x = (unsigned)(y >> 32);
		} else {
			x = (unsigned)(y >> 31);
		}
	}
	return result;
}

Setting NUM_FRACTION_BITS to 16 or so would give enough room for
reasonable-sized weights and not have the total overflow 32 bits.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Need better is_better_time_interpolator() algorithm
  2005-08-25 21:40 linux
@ 2005-08-25 23:07 ` Alex Williamson
  2005-08-26 16:48   ` Christoph Lameter
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2005-08-25 23:07 UTC (permalink / raw)
  To: linux; +Cc: linux-kernel

On Thu, 2005-08-25 at 17:40 -0400, linux@horizon.com wrote:
> > (frequency) * (1/drift) * (1/latency) * (1/(jitter_factor * cpus))
> 
> (Note that 1/cpus, being a constant for all evaluations of this
> expression, has no effect on the final ranking.)

   I was sloppy expressing how the jitter factors in, but I've got code
below that should make it more clear.

> The usual way it's done is with some fiddle factors:
> 
> quality_a^a * quality_b^b * quality_c^c
> 
> Or, equivalently:
> 
> a * log(quality_a) + b * log(quality_b) + c * log(quality_c)
> 
> Then you use the a, b and c factors to weight the relative importance
> of them.  Your suggestion is equivalent to setting all the exponents to 1.
> 
> But you can also say that "a is twice as important as b" in a
> consistent manner.

   Right.  It's the weighting factors themselves that I'm clueless
about.  I'm hoping someone will chime in to help us prioritize and
weight clock attributes appropriately.  The code I've been poking at is
below (by no means ready for inclusion).  I think this brings the
components I'm aware of together, and makes an attempt to do something
meaningful with them.

   On my system, I have 2 NUMA nodes, each with 4 processors.  The cycle
counter runs at 1.5GHz with a 750ppm drift.  The latency of this timer
comes out to ~30ns.  The cycle counter is subject to drift, so I divide
by the square of the cpus and arrive at a "goodness" factor of around
950.

  I also have 2 HPETs on the system, one on each node.  These run at
250MHz with an assumed drift of 500ppm (although the current hpet code
makes the drift much higher).  I measure the access latency of these to
be around 200ns.  Because the HPETs have different access latency
depending on the node, I use the NUMA info to get a average access
latency of ~300ns.  These timers are not subject to jitter, eliminating
that factor.  This results in a "goodness" factor of ~1450.

   I don't really know if this makes sense, but it seems to do what I
think it should.  If I where to add another node to the system, I would
more strongly favor the HPETs time, if I removed a node I would revert
to the cycle counter.  Anyway, I think it might be a good starting point
for further experimentation.  Patch below.

	Alex
-- 
Alex Williamson                             HP Linux & Open Source Lab


 arch/ia64/kernel/cyclone.c      |    3 -
 arch/ia64/kernel/time.c         |    3 -
 arch/ia64/sn/kernel/sn2/timer.c |    3 -
 arch/sparc64/kernel/time.c      |    3 -
 drivers/char/hpet.c             |   12 +++-
 include/linux/hpet.h            |    1 
 include/linux/timex.h           |    2 
 kernel/timer.c                  |  113 +++++++++++++++++++++++++++++++++++++++-
 8 files changed, 133 insertions(+), 7 deletions(-)

diff -r b40794c1ac45 arch/ia64/kernel/cyclone.c
--- a/arch/ia64/kernel/cyclone.c	Wed Aug 24 12:00:22 2005
+++ b/arch/ia64/kernel/cyclone.c	Thu Aug 25 16:29:22 2005
@@ -23,7 +23,8 @@
 	.shift =	16,
 	.frequency =	CYCLONE_TIMER_FREQ,
 	.drift =	-100,
-	.mask =		(1LL << 40) - 1
+	.mask =		(1LL << 40) - 1,
+	.node =		-1
 };
 
 int __init init_cyclone_clock(void)
diff -r b40794c1ac45 arch/ia64/kernel/time.c
--- a/arch/ia64/kernel/time.c	Wed Aug 24 12:00:22 2005
+++ b/arch/ia64/kernel/time.c	Thu Aug 25 16:29:22 2005
@@ -48,7 +48,8 @@
 static struct time_interpolator itc_interpolator = {
 	.shift = 16,
 	.mask = 0xffffffffffffffffLL,
-	.source = TIME_SOURCE_CPU
+	.source = TIME_SOURCE_CPU,
+	.node = -1
 };
 
 static irqreturn_t
diff -r b40794c1ac45 arch/ia64/sn/kernel/sn2/timer.c
--- a/arch/ia64/sn/kernel/sn2/timer.c	Wed Aug 24 12:00:22 2005
+++ b/arch/ia64/sn/kernel/sn2/timer.c	Thu Aug 25 16:29:22 2005
@@ -25,7 +25,8 @@
 	.drift = -1,
 	.shift = 10,
 	.mask = (1LL << 55) - 1,
-	.source = TIME_SOURCE_MMIO64
+	.source = TIME_SOURCE_MMIO64,
+	.node = -1
 };
 
 void __init sn_timer_init(void)
diff -r b40794c1ac45 arch/sparc64/kernel/time.c
--- a/arch/sparc64/kernel/time.c	Wed Aug 24 12:00:22 2005
+++ b/arch/sparc64/kernel/time.c	Thu Aug 25 16:29:22 2005
@@ -1048,7 +1048,8 @@
 static struct time_interpolator sparc64_cpu_interpolator = {
 	.source		=	TIME_SOURCE_CPU,
 	.shift		=	16,
-	.mask		=	0xffffffffffffffffLL
+	.mask		=	0xffffffffffffffffLL,
+	.node		=	-1
 };
 
 /* The quotient formula is taken from the IA64 port. */
diff -r b40794c1ac45 drivers/char/hpet.c
--- a/drivers/char/hpet.c	Wed Aug 24 12:00:22 2005
+++ b/drivers/char/hpet.c	Thu Aug 25 16:29:22 2005
@@ -82,6 +82,7 @@
 	unsigned long hp_delta;
 	unsigned int hp_ntimer;
 	unsigned int hp_which;
+	acpi_handle handle;
 	struct hpet_dev hp_dev[1];
 };
 
@@ -702,6 +703,7 @@
 {
 #ifdef	CONFIG_TIME_INTERPOLATION
 	struct time_interpolator *ti;
+	int pxm;
 
 	ti = kmalloc(sizeof(*ti), GFP_KERNEL);
 	if (!ti)
@@ -714,7 +716,12 @@
 	ti->frequency = hpet_time_div(hpets->hp_period);
 	ti->drift = ti->frequency * HPET_DRIFT / 1000000;
 	ti->mask = -1;
-
+	ti->node = -1;
+	pxm = acpi_get_pxm(hpetp->handle);
+#ifdef CONFIG_NUMA
+	if (pxm >= 0) 
+		ti->node = pxm_to_nid_map[pxm];
+#endif
 	hpetp->hp_interpolator = ti;
 	register_time_interpolator(ti);
 #endif
@@ -871,6 +878,7 @@
 	}
 
 	hpetp->hp_delta = hpet_calibrate(hpetp);
+	hpetp->handle = hdp->handle;
 	hpet_register_interpolator(hpetp);
 
 	return 0;
@@ -923,6 +931,8 @@
 	struct hpet_data data;
 
 	memset(&data, 0, sizeof(data));
+
+	data.handle = device->handle;
 
 	result =
 	    acpi_walk_resources(device->handle, METHOD_NAME__CRS,
diff -r b40794c1ac45 include/linux/hpet.h
--- a/include/linux/hpet.h	Wed Aug 24 12:00:22 2005
+++ b/include/linux/hpet.h	Thu Aug 25 16:29:22 2005
@@ -118,6 +118,7 @@
 	unsigned short hd_flags;
 	unsigned int hd_state;	/* timer allocated */
 	unsigned int hd_irq[HPET_MAX_TIMERS];
+	acpi_handle handle;
 };
 
 #define	HPET_DATA_PLATFORM	0x0001	/* platform call to hpet_alloc */
diff -r b40794c1ac45 include/linux/timex.h
--- a/include/linux/timex.h	Wed Aug 24 12:00:22 2005
+++ b/include/linux/timex.h	Thu Aug 25 16:29:22 2005
@@ -298,6 +298,8 @@
 	long drift;			/* drift in parts-per-million (or -1) */
 	unsigned long skips;		/* skips forward */
 	unsigned long ns_skipped;	/* nanoseconds skipped */
+	unsigned long latency;		/* access latency, set by register_time_interpolator() */
+	long node;			/* NUMA node where the device lives, or -1 */
 	struct time_interpolator *next;
 };
 
diff -r b40794c1ac45 kernel/timer.c
--- a/kernel/timer.c	Wed Aug 24 12:00:22 2005
+++ b/kernel/timer.c	Thu Aug 25 16:29:22 2005
@@ -39,6 +39,8 @@
 #include <asm/div64.h>
 #include <asm/timex.h>
 #include <asm/io.h>
+#include <asm/smp.h>
+#include <asm/topology.h>
 
 #ifdef CONFIG_TIME_INTERPOLATION
 static void time_interpolator_update(long delta_nsec);
@@ -1408,6 +1410,26 @@
 static struct time_interpolator *time_interpolator_list;
 static DEFINE_SPINLOCK(time_interpolator_lock);
 
+static u64 time_interpolator_read(struct time_interpolator *ti)
+{
+	unsigned long (*x)(void);
+
+	switch (ti->source)
+	{
+		case TIME_SOURCE_FUNCTION:
+			x = ti->addr;
+			return x();
+
+		case TIME_SOURCE_MMIO64	:
+			return readq((void __iomem *) ti->addr);
+
+		case TIME_SOURCE_MMIO32	:
+			return readl((void __iomem *) ti->addr);
+
+		default: return get_cycles();
+	}
+}
+
 static inline u64 time_interpolator_get_cycles(unsigned int src)
 {
 	unsigned long (*x)(void);
@@ -1517,13 +1539,99 @@
 	}
 }
 
+#ifdef CONFIG_NUMA
+static long
+time_interpolator_latency_scale(struct time_interpolator *ti)
+{
+	int cpu;
+	long scale = 0;
+
+	if (ti->node == -1)
+		return 10;
+
+	for_each_online_cpu(cpu)
+		scale += node_distance(cpu_to_node(cpu), ti->node);
+
+	/* 
+	 * Penalize timers not on the timekeeper node.
+	 * FIXME: This isn't a good check for the timekeeper
+	 */
+	if (ti->node != 0)
+		scale += node_distance(cpu_to_node(0), ti->node);
+
+	return scale / num_online_cpus();
+}
+#else
+#define time_interpolator_latency_scale(x) (10)
+#endif
+
+static unsigned long
+get_time_interpolator_priority(struct time_interpolator *ti)
+{
+	unsigned long pri;
+
+	/*
+	 * -1 drift seems to indicate that someone really wants us to use
+	 *  their timer.
+	 */
+	if (ti->drift == -1)
+		return ~0UL;
+
+	pri = ti->frequency/ti->drift;
+	pri /= ti->latency * time_interpolator_latency_scale(ti) / 10;
+	if (ti->jitter)
+		pri /= num_online_cpus() * num_online_cpus();
+
+	return pri;
+}
+
 static inline int
 is_better_time_interpolator(struct time_interpolator *new)
 {
 	if (!time_interpolator)
 		return 1;
-	return new->frequency > 2*time_interpolator->frequency ||
-	    (unsigned long)new->drift < (unsigned long)time_interpolator->drift;
+
+	if (new == time_interpolator)
+		return 0;
+
+	return get_time_interpolator_priority(new) >
+	       get_time_interpolator_priority(time_interpolator);
+}
+
+static unsigned long
+time_interpolator_latency(struct time_interpolator *ti)
+{
+	volatile u64 start, end;
+	unsigned long latency;
+	int i, scale;
+
+	start = time_interpolator_read(ti);
+	for (i = 0; i < 100 ; i++)
+		end = time_interpolator_read(ti);
+	latency = ((end - start) * (1000000000UL/i))/ti->frequency;
+
+#ifdef CONFIG_NUMA
+	/*
+	 * On big boxes, an MMIO timer source may have different access times
+	 * from different nodes.  It's impractical to measure the latency from
+	 * each node, so we trust node distance to scale the latency to
+	 * the local acess time.
+	 */
+	switch (ti->source) {
+		case TIME_SOURCE_MMIO64:
+		case TIME_SOURCE_MMIO32:
+			if (ti->node == -1)
+				scale = 10;
+			else
+				scale = node_distance(cpu_to_node(
+				                smp_processor_id()), ti->node);
+			break;
+		default:
+			scale = 10;
+	}
+	latency = (latency * 10)/scale; /* SMP locality = 10 */
+#endif
+	return latency > 0 ? latency : 1UL;
 }
 
 void
@@ -1536,6 +1644,7 @@
 		BUG();
 
 	ti->nsec_per_cyc = ((u64)NSEC_PER_SEC << ti->shift) / ti->frequency;
+	ti->latency = time_interpolator_latency(ti);
 	spin_lock(&time_interpolator_lock);
 	write_seqlock_irqsave(&xtime_lock, flags);
 	if (is_better_time_interpolator(ti)) {



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Need better is_better_time_interpolator() algorithm
  2005-08-25 23:07 ` Alex Williamson
@ 2005-08-26 16:48   ` Christoph Lameter
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Lameter @ 2005-08-26 16:48 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linux, linux-kernel

On Thu, 25 Aug 2005, Alex Williamson wrote:

>    I don't really know if this makes sense, but it seems to do what I
> think it should.  If I where to add another node to the system, I would
> more strongly favor the HPETs time, if I removed a node I would revert
> to the cycle counter.  Anyway, I think it might be a good starting point
> for further experimentation.  Patch below.

Adding a node to the time interpolator does not make much sense since time 
needs to be universally accessible in order to be comparable. Unless you 
want to use the time interpolator time source as an interrupt source
then you may want to change the node that the timer interrupt runs on.

Some time source like the Altix RTC are not bound to any node but are 
available with a node local mmio instruction. There is no way to assign a 
node number to it.

Moreover you may derive the node number from the mmio address if the need 
really arises.

Also the latency is only a minor criterion in the determination of the 
most suitable clock. More important are the clock characteristics like
consistency over multiple processors, or the distribution of the timer.

One does not want a clock that is not consistent over multiple processors 
regardless of the latency. A distributed timer wins over a low latency 
timer on a node in a NUMA system.

I think it would be best add some flags that describe

1. The consistency scope of the time source
   i.e. TIME_SOURCE_SYNC_PROCESSOR	
		Access to the time source yields a processor local
		result
	TIME_SOURCE_SYNC_NODE
		Access to the time source yields a result that is the same
		for all processors on the same node
	TIME_SOURCE_SYNC_GLOBAL
		Access to the time source yields a globally consistent
		result

2. The locality of the time source
   TIME_SOURCE_IN_PROCESSOR	->	Processor is the time source
   TIME_SOURCE_DISTRIBUTED	->	Distributed time source
   TIME_SOURCE_NODE		->	memory mapped time source on a 
node.

Then the ITC would be configured as
	TIME_SOURCE_SYNC_PROCESSOR | TIME_SOURCE_IN_PROCESSOR

If the ITC is syncd up per node
	TIME_SOURCE_SYNC_NODE | TIME_SOURCE_IN_PROCESSOR

If this is an SMP system then it may even be
	TIME_SOURCE_SYNC_GLOBAL | TIME_SOURCE_IN_PROCESSOR

For HPET this would be
	TIME_SOURCE_SYNC_GLOBAL | TIME_SOURCE_NODE

For Altix RTC
	TIME_SOURCE_SYNC_GLOBAL | TIME_SOURCE_DISTRIBUTED

Hope this makes sense.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2005-08-29 17:01 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-25 16:44 Need better is_better_time_interpolator() algorithm Alex Williamson
2005-08-25 17:36 ` john stultz
2005-08-25 18:43   ` Alex Williamson
2005-08-25 19:02     ` john stultz
2005-08-26 15:39     ` Christoph Lameter
2005-08-26 16:18       ` Alex Williamson
2005-08-26 19:16         ` George Anzinger
2005-08-26 19:26           ` Alex Williamson
2005-08-26 19:33             ` Christoph Lameter
2005-08-26 19:51               ` George Anzinger
2005-08-27 11:55               ` Pavel Machek
2005-08-29 17:00                 ` Christoph Lameter
  -- strict thread matches above, loose matches on Subject: below --
2005-08-25 21:40 linux
2005-08-25 23:07 ` Alex Williamson
2005-08-26 16:48   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox