* Need better is_better_time_interpolator() algorithm
@ 2005-08-25 16:44 Alex Williamson
2005-08-25 17:36 ` john stultz
0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2005-08-25 16:44 UTC (permalink / raw)
To: linux-kernel
Hi,
In playing with an HPET device, I noticed that
kernel/timer.c:is_better_time_interpolator() is completely non-symmetric
in the timer it returns. The test is simply:
return new->frequency > 2*time_interpolator->frequency ||
(unsigned long)new->drift < (unsigned long)time_interpolator->drift;
Given two timers:
(a) 1.5GHz, 750ppm
(b) 250Mhz, 500ppm
the resulting "better" timer is completely dependent on the order
they're passed in. For example, (a),(b) = (b); (b),(a) = (a).
What are we really looking for in a "better" timer? There are at
least 4 factors that I can think of that seem important to determining a
better clock:
* resolution (frequency)
* accuracy (drift)
* access latency (may be non-uniform across the system?)
* jitter (monotonically increasing)
How can we munge these all together to come up with a single goodness
factor for comparison? There's probably a thesis covering algorithms to
handle this. Anyone know of one or have some good ideas? Thanks,
Alex
--
Alex Williamson HP Linux & Open Source Lab
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: Need better is_better_time_interpolator() algorithm 2005-08-25 16:44 Need better is_better_time_interpolator() algorithm Alex Williamson @ 2005-08-25 17:36 ` john stultz 2005-08-25 18:43 ` Alex Williamson 0 siblings, 1 reply; 15+ messages in thread From: john stultz @ 2005-08-25 17:36 UTC (permalink / raw) To: Alex Williamson; +Cc: linux-kernel On Thu, 2005-08-25 at 10:44 -0600, Alex Williamson wrote: > In playing with an HPET device, I noticed that > kernel/timer.c:is_better_time_interpolator() is completely non-symmetric > in the timer it returns. The test is simply: > > return new->frequency > 2*time_interpolator->frequency || > (unsigned long)new->drift < (unsigned long)time_interpolator->drift; > > Given two timers: > > (a) 1.5GHz, 750ppm > (b) 250Mhz, 500ppm > > the resulting "better" timer is completely dependent on the order > they're passed in. For example, (a),(b) = (b); (b),(a) = (a). > > What are we really looking for in a "better" timer? There are at > least 4 factors that I can think of that seem important to determining a > better clock: > > * resolution (frequency) > * accuracy (drift) > * access latency (may be non-uniform across the system?) > * jitter (monotonically increasing) > > How can we munge these all together to come up with a single goodness > factor for comparison? There's probably a thesis covering algorithms to > handle this. Anyone know of one or have some good ideas? Thanks, With my timeofday rework code, the timesource structure (which was influenced by the time interpolators) just uses a fixed "priority" vale. This value can be changed as needed (for example: We lower the tsc timesource priority if the TSCs are found to be out of sync). In order to have some scale of goodness and avoid priority inflation, I put a comment suggesting what the different priority levels mean. ie: 0-99: Terrible. Only use at bootup or when there's nothing else available 100-199: Functional but not desired 200-299: Good. a correct and usable timesource 300-399: Desired. A reasonably fast and accurate timesource. 400-499: Perfect. The ideal timesource. A must-use where available. Additionally, I created a sysfs interface that could be used to override the priority selected timesource. Realistically I don't think too many systems will have multiple out of tree timesources, so assigning the correct priority value shouldn't be too difficult. This just seemed a bit more straight forward then sorting out some weighting algorithm for their properties to select the best timesource. thanks -john ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Need better is_better_time_interpolator() algorithm 2005-08-25 17:36 ` john stultz @ 2005-08-25 18:43 ` Alex Williamson 2005-08-25 19:02 ` john stultz 2005-08-26 15:39 ` Christoph Lameter 0 siblings, 2 replies; 15+ messages in thread From: Alex Williamson @ 2005-08-25 18:43 UTC (permalink / raw) To: john stultz; +Cc: linux-kernel On Thu, 2005-08-25 at 10:36 -0700, john stultz wrote: > On Thu, 2005-08-25 at 10:44 -0600, Alex Williamson wrote: > > How can we munge these all together to come up with a single goodness > > factor for comparison? There's probably a thesis covering algorithms to > > handle this. Anyone know of one or have some good ideas? Thanks, > > With my timeofday rework code, the timesource structure (which was > influenced by the time interpolators) just uses a fixed "priority" vale. ... > Realistically I don't think too many systems will have multiple out of > tree timesources, so assigning the correct priority value shouldn't be > too difficult. > > This just seemed a bit more straight forward then sorting out some > weighting algorithm for their properties to select the best timesource. I don't know that it's that uncommon. Simply having one non-arch specific timer is enough to need to decided whether it's better than a generic timer. I assume pretty much every arch has a cycle timer. For smaller boxes, this might be the preferred timer given it's latency even if something like an hpet exists (mmio access are expensive). How do you hard code a value that can account for that? I agree, we could easily go too far and produce some bloated algorithm, but maybe it's simply a weighted product of a few variables. To start with, what would this do: (frequency) * (1/drift) * (1/latency) * (1/(jitter_factor * cpus)) Something this simple at least starts to dynamically bring the factors together. All else being equal (and with no weighting), this would give the 1.5GHz/750ppm timer a higher priority than the 250MHz/500ppm timer. Is that good? I like your idea to make this user tunable after boot, but I still think there has to be a way to make a smarter decision up front. Thanks, Alex -- Alex Williamson HP Linux & Open Source Lab ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Need better is_better_time_interpolator() algorithm 2005-08-25 18:43 ` Alex Williamson @ 2005-08-25 19:02 ` john stultz 2005-08-26 15:39 ` Christoph Lameter 1 sibling, 0 replies; 15+ messages in thread From: john stultz @ 2005-08-25 19:02 UTC (permalink / raw) To: Alex Williamson; +Cc: linux-kernel On Thu, 2005-08-25 at 12:43 -0600, Alex Williamson wrote: > On Thu, 2005-08-25 at 10:36 -0700, john stultz wrote: > > On Thu, 2005-08-25 at 10:44 -0600, Alex Williamson wrote: > > > How can we munge these all together to come up with a single goodness > > > factor for comparison? There's probably a thesis covering algorithms to > > > handle this. Anyone know of one or have some good ideas? Thanks, > > > > With my timeofday rework code, the timesource structure (which was > > influenced by the time interpolators) just uses a fixed "priority" vale. > ... > > Realistically I don't think too many systems will have multiple out of > > tree timesources, so assigning the correct priority value shouldn't be > > too difficult. > > > > This just seemed a bit more straight forward then sorting out some > > weighting algorithm for their properties to select the best timesource. > > I don't know that it's that uncommon. Simply having one non-arch > specific timer is enough to need to decided whether it's better than a > generic timer. I assume pretty much every arch has a cycle timer. For > smaller boxes, this might be the preferred timer given it's latency even > if something like an hpet exists (mmio access are expensive). How do > you hard code a value that can account for that? Well, in my patches I set the default priority of cycle timer as being very high, say 350, and the slower MMIO device as being a decent 250. Then if we boot up on say a NUMA system, or if we find the cycle counters to be out of sync, the cycle counter init code drops the timesource priority of the cycle counter to something like 50. > I agree, we could > easily go too far and produce some bloated algorithm, but maybe it's > simply a weighted product of a few variables. > > To start with, what would this do: > > (frequency) * (1/drift) * (1/latency) * (1/(jitter_factor * cpus)) It just seems that something like this could be for the most part precomputed when you're writing the time_interpolator code into a single priority value that the init code can tweak as needed. > Something this simple at least starts to dynamically bring the factors > together. All else being equal (and with no weighting), this would give > the 1.5GHz/750ppm timer a higher priority than the 250MHz/500ppm timer. > Is that good? I like your idea to make this user tunable after boot, > but I still think there has to be a way to make a smarter decision up > front. Thanks, Shrug. I agree it needs improvement, so don't let me stop you from doing something like what you have above. I just think its more complex then necessary and might result in less predictable interpolator selections. thanks -john ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Need better is_better_time_interpolator() algorithm 2005-08-25 18:43 ` Alex Williamson 2005-08-25 19:02 ` john stultz @ 2005-08-26 15:39 ` Christoph Lameter 2005-08-26 16:18 ` Alex Williamson 1 sibling, 1 reply; 15+ messages in thread From: Christoph Lameter @ 2005-08-26 15:39 UTC (permalink / raw) To: Alex Williamson; +Cc: john stultz, linux-kernel On Thu, 25 Aug 2005, Alex Williamson wrote: > I don't know that it's that uncommon. Simply having one non-arch > specific timer is enough to need to decided whether it's better than a > generic timer. I assume pretty much every arch has a cycle timer. For > smaller boxes, this might be the preferred timer given it's latency even > if something like an hpet exists (mmio access are expensive). How do > you hard code a value that can account for that? I agree, we could > easily go too far and produce some bloated algorithm, but maybe it's > simply a weighted product of a few variables. I think a priority is something useful for the interpolators. Some of the decisions about which time sources to use also have criteria different from drift/latency/jitter/cpu. F.e. timers may not survive various power-saving configurations. Thus I would think that we need a priority plus some flags. Some of the criteria for choosing a time source may be: 1. If a system boots up with a single cpu then there is no question that the ITC/TSC should be used because of the fast access. 2. If the BIOS is capable of perfect ITC/TSC synchronization then use the ITC/TSC. However, this is typically restricted to SMP configs and there is an issue on IA64 that the PAL can indicate a nodrift configuration but Linux is not capable of perfects sync on bootup. 3. If a memory mapped global clock is available then make sure to first use a distributed timer over a centralized time source because distributed timer have fast local access even under contention. I.e. use Altix RTC over HPET/Cyclone. 4. Use any memory mapped global clock source 5. Abuse some other system component for time keeping (PIT, i/o devices etc) The criteria for drift/latency can only be established after we gone through the above. The low-power stuff adds additional complexity. We may need to dynamically change timer interpolators / time sources if the power situation changes. Nothing like that exists today for time interpolators since they are mostly used in server configurations. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Need better is_better_time_interpolator() algorithm 2005-08-26 15:39 ` Christoph Lameter @ 2005-08-26 16:18 ` Alex Williamson 2005-08-26 19:16 ` George Anzinger 0 siblings, 1 reply; 15+ messages in thread From: Alex Williamson @ 2005-08-26 16:18 UTC (permalink / raw) To: Christoph Lameter; +Cc: john stultz, linux-kernel On Fri, 2005-08-26 at 08:39 -0700, Christoph Lameter wrote: > I think a priority is something useful for the interpolators. Some of > the decisions about which time sources to use also have criteria different > from drift/latency/jitter/cpu. F.e. timers may not survive various > power-saving configurations. Thus I would think that we need a priority > plus some flags. > > Some of the criteria for choosing a time source may be: Hi Christoph, I sent another followup to this thread with a patch containing a fairly crude algorithm that I think better explains my starting point. I'm sure the weighting and scaling factors need work, but I think many of the criteria you describe will favor the right clock. > 1. If a system boots up with a single cpu then there is no question that > the ITC/TSC should be used because of the fast access. I'm hoping the jitter/cpu and latency calculation will always end up favoring the cycle counter in the scenario. > 2. If the BIOS is capable of perfect ITC/TSC synchronization then use > the ITC/TSC. However, this is typically restricted to SMP configs and > there is an issue on IA64 that the PAL can indicate a nodrift > configuration but Linux is not capable of perfects sync on bootup. If we're setting the jitter flag appropriately, which I think we are, then the low latency of the ITC and lack of jitter penalty should do the right thing here. > 3. If a memory mapped global clock is available then make sure to > first use a distributed timer over a centralized time source because > distributed timer have fast local access even under contention. > I.e. use Altix RTC over HPET/Cyclone. > > 4. Use any memory mapped global clock source Scaling the latency to the average across a NUMA system will automatically order these correctly. They will definitely have higher latency than the ITC, so should fall about here on the list. It's easy to make the Altix RTC have no node association and therefore not penalize the access latency further. > 5. Abuse some other system component for time keeping (PIT, i/o devices > etc) > > The criteria for drift/latency can only be established after we gone > through the above. The low-power stuff adds additional complexity. > > We may need to dynamically change timer interpolators / time sources if > the power situation changes. Nothing like that exists today for time > interpolators since they are mostly used in server configurations. Yes, I don't know how to handle low power situations other than have a flag on the interpolators and a hook to switch to the best low power timer when the need arises. We may need to use the same hook to re-evaluate the timer when a CPU hotplug occurs. This way we could dynamically fall back to case 1 above if most of the CPUs are removed (and likewise move to an hpet if CPUs are added). I was looking into the suggestion to use logarithms to try to normalize some of the factors such that weighting is more logical. Hopefully that won't explode the complexity. Thanks, Alex -- Alex Williamson HP Linux & Open Source Lab ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Need better is_better_time_interpolator() algorithm 2005-08-26 16:18 ` Alex Williamson @ 2005-08-26 19:16 ` George Anzinger 2005-08-26 19:26 ` Alex Williamson 0 siblings, 1 reply; 15+ messages in thread From: George Anzinger @ 2005-08-26 19:16 UTC (permalink / raw) To: Alex Williamson; +Cc: Christoph Lameter, john stultz, linux-kernel Alex Williamson wrote: > On Fri, 2005-08-26 at 08:39 -0700, Christoph Lameter wrote: > > >>I think a priority is something useful for the interpolators. Some of >>the decisions about which time sources to use also have criteria different >>from drift/latency/jitter/cpu. F.e. timers may not survive various >>power-saving configurations. Thus I would think that we need a priority >>plus some flags. >> >>Some of the criteria for choosing a time source may be: > > > Hi Christoph, > > I sent another followup to this thread with a patch containing a > fairly crude algorithm that I think better explains my starting point. > I'm sure the weighting and scaling factors need work, but I think many > of the criteria you describe will favor the right clock. > > >>1. If a system boots up with a single cpu then there is no question that >>the ITC/TSC should be used because of the fast access. We need to factor in frequency shifting here, especially if it happens with out notice. ~ -- George Anzinger george@mvista.com HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Need better is_better_time_interpolator() algorithm 2005-08-26 19:16 ` George Anzinger @ 2005-08-26 19:26 ` Alex Williamson 2005-08-26 19:33 ` Christoph Lameter 0 siblings, 1 reply; 15+ messages in thread From: Alex Williamson @ 2005-08-26 19:26 UTC (permalink / raw) To: george; +Cc: Christoph Lameter, john stultz, linux-kernel On Fri, 2005-08-26 at 12:16 -0700, George Anzinger wrote: > Alex Williamson wrote: > > On Fri, 2005-08-26 at 08:39 -0700, Christoph Lameter wrote: > >>1. If a system boots up with a single cpu then there is no question that > >>the ITC/TSC should be used because of the fast access. > > We need to factor in frequency shifting here, especially if it happens > with out notice. Would we ever want to favor a frequency shifting timer over anything else in the system? If it was noticeable perhaps we'd just need a callback to re-evaluate the frequency and rescan for the best timer. If it happens without notice, a flag that statically assigns it the lowest priority will due. Or maybe if the driver factored the frequency shifting into the drift it would make the timer undesirable without resorting to flags. Thanks, Alex -- Alex Williamson HP Linux & Open Source Lab ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Need better is_better_time_interpolator() algorithm 2005-08-26 19:26 ` Alex Williamson @ 2005-08-26 19:33 ` Christoph Lameter 2005-08-26 19:51 ` George Anzinger 2005-08-27 11:55 ` Pavel Machek 0 siblings, 2 replies; 15+ messages in thread From: Christoph Lameter @ 2005-08-26 19:33 UTC (permalink / raw) To: Alex Williamson; +Cc: george, john stultz, linux-kernel On Fri, 26 Aug 2005, Alex Williamson wrote: > Would we ever want to favor a frequency shifting timer over anything > else in the system? If it was noticeable perhaps we'd just need a > callback to re-evaluate the frequency and rescan for the best timer. If > it happens without notice, a flag that statically assigns it the lowest > priority will due. Or maybe if the driver factored the frequency > shifting into the drift it would make the timer undesirable without > resorting to flags. Thanks, Timers are usually constant. AFAIK Frequency shifts only occur through power management. In that case we usually have some notifiers running before the change. These notifiers need to switch to a different time source if the timer frequency will be shifting or the timer will become unavailable. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Need better is_better_time_interpolator() algorithm 2005-08-26 19:33 ` Christoph Lameter @ 2005-08-26 19:51 ` George Anzinger 2005-08-27 11:55 ` Pavel Machek 1 sibling, 0 replies; 15+ messages in thread From: George Anzinger @ 2005-08-26 19:51 UTC (permalink / raw) To: Christoph Lameter; +Cc: Alex Williamson, john stultz, linux-kernel Christoph Lameter wrote: > On Fri, 26 Aug 2005, Alex Williamson wrote: > > >> Would we ever want to favor a frequency shifting timer over anything >>else in the system? If it was noticeable perhaps we'd just need a >>callback to re-evaluate the frequency and rescan for the best timer. If >>it happens without notice, a flag that statically assigns it the lowest >>priority will due. Or maybe if the driver factored the frequency >>shifting into the drift it would make the timer undesirable without >>resorting to flags. Thanks, > > > Timers are usually constant. AFAIK Frequency shifts only occur through > power management. In that case we usually have some notifiers running > before the change. These notifiers need to switch to a different time > source if the timer frequency will be shifting or the timer will become > unavailable. > If there is a notifier, I presume we can track it. We might want to refine things so as to not hit too big a bump when the shift occures, but I think it is doable. The desirability of doing it, I think, depends on the availablity of something better. The access time of the TSC is "really" enticing. Even so, I think a _good_ clock would not depend on long term accuracy of something as fast as the TSC. Vendors are even modulating these to reduce RFI, but still, because of its speed, it makes the best interpolator for the jiffie to jiffie times. -- George Anzinger george@mvista.com HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Need better is_better_time_interpolator() algorithm 2005-08-26 19:33 ` Christoph Lameter 2005-08-26 19:51 ` George Anzinger @ 2005-08-27 11:55 ` Pavel Machek 2005-08-29 17:00 ` Christoph Lameter 1 sibling, 1 reply; 15+ messages in thread From: Pavel Machek @ 2005-08-27 11:55 UTC (permalink / raw) To: Christoph Lameter; +Cc: Alex Williamson, george, john stultz, linux-kernel Hi! > > Would we ever want to favor a frequency shifting timer over anything > > else in the system? If it was noticeable perhaps we'd just need a > > callback to re-evaluate the frequency and rescan for the best timer. If > > it happens without notice, a flag that statically assigns it the lowest > > priority will due. Or maybe if the driver factored the frequency > > shifting into the drift it would make the timer undesirable without > > resorting to flags. Thanks, > > Timers are usually constant. AFAIK Frequency shifts only occur through > power management. In that case we usually have some notifiers running Usually is the key word here. Older APM stuff changes frequency behind your back, and sometimes frequency shift is time-critical. -- 64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Need better is_better_time_interpolator() algorithm 2005-08-27 11:55 ` Pavel Machek @ 2005-08-29 17:00 ` Christoph Lameter 0 siblings, 0 replies; 15+ messages in thread From: Christoph Lameter @ 2005-08-29 17:00 UTC (permalink / raw) To: Pavel Machek; +Cc: Alex Williamson, george, john stultz, linux-kernel On Sat, 27 Aug 2005, Pavel Machek wrote: > Usually is the key word here. Older APM stuff changes frequency behind > your back, and sometimes frequency shift is time-critical. In that case the clocks tied to the shifting frequency are not usable. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Need better is_better_time_interpolator() algorithm
@ 2005-08-25 21:40 linux
2005-08-25 23:07 ` Alex Williamson
0 siblings, 1 reply; 15+ messages in thread
From: linux @ 2005-08-25 21:40 UTC (permalink / raw)
To: alex.williamson; +Cc: linux-kernel
> (frequency) * (1/drift) * (1/latency) * (1/(jitter_factor * cpus))
(Note that 1/cpus, being a constant for all evaluations of this
expression, has no effect on the final ranking.)
The usual way it's done is with some fiddle factors:
quality_a^a * quality_b^b * quality_c^c
Or, equivalently:
a * log(quality_a) + b * log(quality_b) + c * log(quality_c)
Then you use the a, b and c factors to weight the relative importance
of them. Your suggestion is equivalent to setting all the exponents to 1.
But you can also say that "a is twice as important as b" in a
consistent manner.
Note that computing a few bits of log_2 is not hard to do in integer
math if you're not too anxious about efficiency:
unsigned log2(unsigned x)
{
unsigned result = 31;
unsigned i;
assert(x);
while (!x & (1u<<31)) {
x <<= 1;
result--;
}
/* Think of x as a 1.31-bit fixed-point number, 1 <= x < 2 */
for (i = 0; i < NUM_FRACTION_BITS; i++) {
unsigned long long y = x;
/* Square x and compare to 2. */
y *= x;
result <<= 1;
if (y & (1ull<<63)) {
result++;
x = (unsigned)(y >> 32);
} else {
x = (unsigned)(y >> 31);
}
}
return result;
}
Setting NUM_FRACTION_BITS to 16 or so would give enough room for
reasonable-sized weights and not have the total overflow 32 bits.
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: Need better is_better_time_interpolator() algorithm 2005-08-25 21:40 linux @ 2005-08-25 23:07 ` Alex Williamson 2005-08-26 16:48 ` Christoph Lameter 0 siblings, 1 reply; 15+ messages in thread From: Alex Williamson @ 2005-08-25 23:07 UTC (permalink / raw) To: linux; +Cc: linux-kernel On Thu, 2005-08-25 at 17:40 -0400, linux@horizon.com wrote: > > (frequency) * (1/drift) * (1/latency) * (1/(jitter_factor * cpus)) > > (Note that 1/cpus, being a constant for all evaluations of this > expression, has no effect on the final ranking.) I was sloppy expressing how the jitter factors in, but I've got code below that should make it more clear. > The usual way it's done is with some fiddle factors: > > quality_a^a * quality_b^b * quality_c^c > > Or, equivalently: > > a * log(quality_a) + b * log(quality_b) + c * log(quality_c) > > Then you use the a, b and c factors to weight the relative importance > of them. Your suggestion is equivalent to setting all the exponents to 1. > > But you can also say that "a is twice as important as b" in a > consistent manner. Right. It's the weighting factors themselves that I'm clueless about. I'm hoping someone will chime in to help us prioritize and weight clock attributes appropriately. The code I've been poking at is below (by no means ready for inclusion). I think this brings the components I'm aware of together, and makes an attempt to do something meaningful with them. On my system, I have 2 NUMA nodes, each with 4 processors. The cycle counter runs at 1.5GHz with a 750ppm drift. The latency of this timer comes out to ~30ns. The cycle counter is subject to drift, so I divide by the square of the cpus and arrive at a "goodness" factor of around 950. I also have 2 HPETs on the system, one on each node. These run at 250MHz with an assumed drift of 500ppm (although the current hpet code makes the drift much higher). I measure the access latency of these to be around 200ns. Because the HPETs have different access latency depending on the node, I use the NUMA info to get a average access latency of ~300ns. These timers are not subject to jitter, eliminating that factor. This results in a "goodness" factor of ~1450. I don't really know if this makes sense, but it seems to do what I think it should. If I where to add another node to the system, I would more strongly favor the HPETs time, if I removed a node I would revert to the cycle counter. Anyway, I think it might be a good starting point for further experimentation. Patch below. Alex -- Alex Williamson HP Linux & Open Source Lab arch/ia64/kernel/cyclone.c | 3 - arch/ia64/kernel/time.c | 3 - arch/ia64/sn/kernel/sn2/timer.c | 3 - arch/sparc64/kernel/time.c | 3 - drivers/char/hpet.c | 12 +++- include/linux/hpet.h | 1 include/linux/timex.h | 2 kernel/timer.c | 113 +++++++++++++++++++++++++++++++++++++++- 8 files changed, 133 insertions(+), 7 deletions(-) diff -r b40794c1ac45 arch/ia64/kernel/cyclone.c --- a/arch/ia64/kernel/cyclone.c Wed Aug 24 12:00:22 2005 +++ b/arch/ia64/kernel/cyclone.c Thu Aug 25 16:29:22 2005 @@ -23,7 +23,8 @@ .shift = 16, .frequency = CYCLONE_TIMER_FREQ, .drift = -100, - .mask = (1LL << 40) - 1 + .mask = (1LL << 40) - 1, + .node = -1 }; int __init init_cyclone_clock(void) diff -r b40794c1ac45 arch/ia64/kernel/time.c --- a/arch/ia64/kernel/time.c Wed Aug 24 12:00:22 2005 +++ b/arch/ia64/kernel/time.c Thu Aug 25 16:29:22 2005 @@ -48,7 +48,8 @@ static struct time_interpolator itc_interpolator = { .shift = 16, .mask = 0xffffffffffffffffLL, - .source = TIME_SOURCE_CPU + .source = TIME_SOURCE_CPU, + .node = -1 }; static irqreturn_t diff -r b40794c1ac45 arch/ia64/sn/kernel/sn2/timer.c --- a/arch/ia64/sn/kernel/sn2/timer.c Wed Aug 24 12:00:22 2005 +++ b/arch/ia64/sn/kernel/sn2/timer.c Thu Aug 25 16:29:22 2005 @@ -25,7 +25,8 @@ .drift = -1, .shift = 10, .mask = (1LL << 55) - 1, - .source = TIME_SOURCE_MMIO64 + .source = TIME_SOURCE_MMIO64, + .node = -1 }; void __init sn_timer_init(void) diff -r b40794c1ac45 arch/sparc64/kernel/time.c --- a/arch/sparc64/kernel/time.c Wed Aug 24 12:00:22 2005 +++ b/arch/sparc64/kernel/time.c Thu Aug 25 16:29:22 2005 @@ -1048,7 +1048,8 @@ static struct time_interpolator sparc64_cpu_interpolator = { .source = TIME_SOURCE_CPU, .shift = 16, - .mask = 0xffffffffffffffffLL + .mask = 0xffffffffffffffffLL, + .node = -1 }; /* The quotient formula is taken from the IA64 port. */ diff -r b40794c1ac45 drivers/char/hpet.c --- a/drivers/char/hpet.c Wed Aug 24 12:00:22 2005 +++ b/drivers/char/hpet.c Thu Aug 25 16:29:22 2005 @@ -82,6 +82,7 @@ unsigned long hp_delta; unsigned int hp_ntimer; unsigned int hp_which; + acpi_handle handle; struct hpet_dev hp_dev[1]; }; @@ -702,6 +703,7 @@ { #ifdef CONFIG_TIME_INTERPOLATION struct time_interpolator *ti; + int pxm; ti = kmalloc(sizeof(*ti), GFP_KERNEL); if (!ti) @@ -714,7 +716,12 @@ ti->frequency = hpet_time_div(hpets->hp_period); ti->drift = ti->frequency * HPET_DRIFT / 1000000; ti->mask = -1; - + ti->node = -1; + pxm = acpi_get_pxm(hpetp->handle); +#ifdef CONFIG_NUMA + if (pxm >= 0) + ti->node = pxm_to_nid_map[pxm]; +#endif hpetp->hp_interpolator = ti; register_time_interpolator(ti); #endif @@ -871,6 +878,7 @@ } hpetp->hp_delta = hpet_calibrate(hpetp); + hpetp->handle = hdp->handle; hpet_register_interpolator(hpetp); return 0; @@ -923,6 +931,8 @@ struct hpet_data data; memset(&data, 0, sizeof(data)); + + data.handle = device->handle; result = acpi_walk_resources(device->handle, METHOD_NAME__CRS, diff -r b40794c1ac45 include/linux/hpet.h --- a/include/linux/hpet.h Wed Aug 24 12:00:22 2005 +++ b/include/linux/hpet.h Thu Aug 25 16:29:22 2005 @@ -118,6 +118,7 @@ unsigned short hd_flags; unsigned int hd_state; /* timer allocated */ unsigned int hd_irq[HPET_MAX_TIMERS]; + acpi_handle handle; }; #define HPET_DATA_PLATFORM 0x0001 /* platform call to hpet_alloc */ diff -r b40794c1ac45 include/linux/timex.h --- a/include/linux/timex.h Wed Aug 24 12:00:22 2005 +++ b/include/linux/timex.h Thu Aug 25 16:29:22 2005 @@ -298,6 +298,8 @@ long drift; /* drift in parts-per-million (or -1) */ unsigned long skips; /* skips forward */ unsigned long ns_skipped; /* nanoseconds skipped */ + unsigned long latency; /* access latency, set by register_time_interpolator() */ + long node; /* NUMA node where the device lives, or -1 */ struct time_interpolator *next; }; diff -r b40794c1ac45 kernel/timer.c --- a/kernel/timer.c Wed Aug 24 12:00:22 2005 +++ b/kernel/timer.c Thu Aug 25 16:29:22 2005 @@ -39,6 +39,8 @@ #include <asm/div64.h> #include <asm/timex.h> #include <asm/io.h> +#include <asm/smp.h> +#include <asm/topology.h> #ifdef CONFIG_TIME_INTERPOLATION static void time_interpolator_update(long delta_nsec); @@ -1408,6 +1410,26 @@ static struct time_interpolator *time_interpolator_list; static DEFINE_SPINLOCK(time_interpolator_lock); +static u64 time_interpolator_read(struct time_interpolator *ti) +{ + unsigned long (*x)(void); + + switch (ti->source) + { + case TIME_SOURCE_FUNCTION: + x = ti->addr; + return x(); + + case TIME_SOURCE_MMIO64 : + return readq((void __iomem *) ti->addr); + + case TIME_SOURCE_MMIO32 : + return readl((void __iomem *) ti->addr); + + default: return get_cycles(); + } +} + static inline u64 time_interpolator_get_cycles(unsigned int src) { unsigned long (*x)(void); @@ -1517,13 +1539,99 @@ } } +#ifdef CONFIG_NUMA +static long +time_interpolator_latency_scale(struct time_interpolator *ti) +{ + int cpu; + long scale = 0; + + if (ti->node == -1) + return 10; + + for_each_online_cpu(cpu) + scale += node_distance(cpu_to_node(cpu), ti->node); + + /* + * Penalize timers not on the timekeeper node. + * FIXME: This isn't a good check for the timekeeper + */ + if (ti->node != 0) + scale += node_distance(cpu_to_node(0), ti->node); + + return scale / num_online_cpus(); +} +#else +#define time_interpolator_latency_scale(x) (10) +#endif + +static unsigned long +get_time_interpolator_priority(struct time_interpolator *ti) +{ + unsigned long pri; + + /* + * -1 drift seems to indicate that someone really wants us to use + * their timer. + */ + if (ti->drift == -1) + return ~0UL; + + pri = ti->frequency/ti->drift; + pri /= ti->latency * time_interpolator_latency_scale(ti) / 10; + if (ti->jitter) + pri /= num_online_cpus() * num_online_cpus(); + + return pri; +} + static inline int is_better_time_interpolator(struct time_interpolator *new) { if (!time_interpolator) return 1; - return new->frequency > 2*time_interpolator->frequency || - (unsigned long)new->drift < (unsigned long)time_interpolator->drift; + + if (new == time_interpolator) + return 0; + + return get_time_interpolator_priority(new) > + get_time_interpolator_priority(time_interpolator); +} + +static unsigned long +time_interpolator_latency(struct time_interpolator *ti) +{ + volatile u64 start, end; + unsigned long latency; + int i, scale; + + start = time_interpolator_read(ti); + for (i = 0; i < 100 ; i++) + end = time_interpolator_read(ti); + latency = ((end - start) * (1000000000UL/i))/ti->frequency; + +#ifdef CONFIG_NUMA + /* + * On big boxes, an MMIO timer source may have different access times + * from different nodes. It's impractical to measure the latency from + * each node, so we trust node distance to scale the latency to + * the local acess time. + */ + switch (ti->source) { + case TIME_SOURCE_MMIO64: + case TIME_SOURCE_MMIO32: + if (ti->node == -1) + scale = 10; + else + scale = node_distance(cpu_to_node( + smp_processor_id()), ti->node); + break; + default: + scale = 10; + } + latency = (latency * 10)/scale; /* SMP locality = 10 */ +#endif + return latency > 0 ? latency : 1UL; } void @@ -1536,6 +1644,7 @@ BUG(); ti->nsec_per_cyc = ((u64)NSEC_PER_SEC << ti->shift) / ti->frequency; + ti->latency = time_interpolator_latency(ti); spin_lock(&time_interpolator_lock); write_seqlock_irqsave(&xtime_lock, flags); if (is_better_time_interpolator(ti)) { ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Need better is_better_time_interpolator() algorithm 2005-08-25 23:07 ` Alex Williamson @ 2005-08-26 16:48 ` Christoph Lameter 0 siblings, 0 replies; 15+ messages in thread From: Christoph Lameter @ 2005-08-26 16:48 UTC (permalink / raw) To: Alex Williamson; +Cc: linux, linux-kernel On Thu, 25 Aug 2005, Alex Williamson wrote: > I don't really know if this makes sense, but it seems to do what I > think it should. If I where to add another node to the system, I would > more strongly favor the HPETs time, if I removed a node I would revert > to the cycle counter. Anyway, I think it might be a good starting point > for further experimentation. Patch below. Adding a node to the time interpolator does not make much sense since time needs to be universally accessible in order to be comparable. Unless you want to use the time interpolator time source as an interrupt source then you may want to change the node that the timer interrupt runs on. Some time source like the Altix RTC are not bound to any node but are available with a node local mmio instruction. There is no way to assign a node number to it. Moreover you may derive the node number from the mmio address if the need really arises. Also the latency is only a minor criterion in the determination of the most suitable clock. More important are the clock characteristics like consistency over multiple processors, or the distribution of the timer. One does not want a clock that is not consistent over multiple processors regardless of the latency. A distributed timer wins over a low latency timer on a node in a NUMA system. I think it would be best add some flags that describe 1. The consistency scope of the time source i.e. TIME_SOURCE_SYNC_PROCESSOR Access to the time source yields a processor local result TIME_SOURCE_SYNC_NODE Access to the time source yields a result that is the same for all processors on the same node TIME_SOURCE_SYNC_GLOBAL Access to the time source yields a globally consistent result 2. The locality of the time source TIME_SOURCE_IN_PROCESSOR -> Processor is the time source TIME_SOURCE_DISTRIBUTED -> Distributed time source TIME_SOURCE_NODE -> memory mapped time source on a node. Then the ITC would be configured as TIME_SOURCE_SYNC_PROCESSOR | TIME_SOURCE_IN_PROCESSOR If the ITC is syncd up per node TIME_SOURCE_SYNC_NODE | TIME_SOURCE_IN_PROCESSOR If this is an SMP system then it may even be TIME_SOURCE_SYNC_GLOBAL | TIME_SOURCE_IN_PROCESSOR For HPET this would be TIME_SOURCE_SYNC_GLOBAL | TIME_SOURCE_NODE For Altix RTC TIME_SOURCE_SYNC_GLOBAL | TIME_SOURCE_DISTRIBUTED Hope this makes sense. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2005-08-29 17:01 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-08-25 16:44 Need better is_better_time_interpolator() algorithm Alex Williamson 2005-08-25 17:36 ` john stultz 2005-08-25 18:43 ` Alex Williamson 2005-08-25 19:02 ` john stultz 2005-08-26 15:39 ` Christoph Lameter 2005-08-26 16:18 ` Alex Williamson 2005-08-26 19:16 ` George Anzinger 2005-08-26 19:26 ` Alex Williamson 2005-08-26 19:33 ` Christoph Lameter 2005-08-26 19:51 ` George Anzinger 2005-08-27 11:55 ` Pavel Machek 2005-08-29 17:00 ` Christoph Lameter -- strict thread matches above, loose matches on Subject: below -- 2005-08-25 21:40 linux 2005-08-25 23:07 ` Alex Williamson 2005-08-26 16:48 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox