public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: Help with the high res timers
       [not found]         ` <20050504001307.GF3372@us.ibm.com>
@ 2005-05-04 17:10           ` George Anzinger
  2005-05-04 17:44             ` Chris Friesen
  2005-05-04 18:14             ` Darren Hart
  0 siblings, 2 replies; 12+ messages in thread
From: George Anzinger @ 2005-05-04 17:10 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: john stultz, Liu Qi,
	'high-res-timers-discourse@lists.sourceforge.net',
	Darren Hart, linux-kernel@vger.kernel.org

|First I want to express my joy that we are discussing these issues at some depth.

I also think we need a wider audience and so have added the kernel list to the 
cc.  For that reason, I will not trim this message, thus allowing them to "catch 
up".

Nishanth Aravamudan wrote:
> On 03.05.2005 [16:15:04 -0700], George Anzinger wrote:
> 
>>john stultz wrote:
>>
>>>On Tue, 2005-05-03 at 14:36 -0700, George Anzinger wrote:
>>>
>>>
>>>>At the moment I am trying to understand and recast the work John Stultz 
>>>>is doing.  I have some very real concerns here and there is a patch by 
>>>>nacc@us.ibm.com<Nishanth Aravamudan> that, to me, looks like a royal 
>>>>pain.  I think we need to understand it better and voice our concerns in 
>>>>a way that is constructive. Nish's patch is here: 
>>>>http://www.ussg.iu.edu/hypermail/linux/kernel/0504.3/1635.html
> 
> 
> I'm not on the HRT list, so I didn't get the original e-mail...keep me
> on the CC, please.
> 
> 
>>>Hey George, 
>>>	Hopefully our work isn't causing you too much trouble. I do 
>>>	understand
>>>the pain of having large conflicting works that try to achieve similar
>>>goals.
>>>
>>>
>>>
>>>>To summarize my concerns, the HRT code depends on correct and exact 
>>>>registration of the timer interrupt with time of day.  (We don't worry 
>>>>about latency here.) In the latest x86 HRT patch we fine tune the system 
>>>>clock to insure this alignment.  This allows us to set up two stage 
>>>>timers that use the run_timers code for the coarse time (to the time base 
>>>>resolution) followed by a short high res hardware timer to get to the 
>>>>higher resolution we want.  If the coarse timer (i.e. add_timer, 
>>>>run_timer and friends) do not register correctly, we will have the case 
>>>>of a coarse timer expiring after the needed and requested time.  Now, 
>>>
>>>>from a coarse timer point of view, that may be fine, but it means we can 
>>>
>>>>not depend on it for the high res first stage.
>>>
>>>
>>>
>>>The conceptual idea Nish is going after, is once we have a solid method
>>>to get a high-res timeofday (via the new monotonic_clock() function), we
>>>can use that instead of jiffies to expire soft-timers. 
>>>
>>>This requires re-defining each timer-bucket entry length from a jiffy
>>>instead to a unit of time (Nish calls this a timerinterval unit). This
>>>timer-interval unit's length is flexible can be defined at compile time.
>>>Currently Nish is using ~1ms (1<<20 nanoseconds) as the interval length,
>>>but it could be redefined to be as desired to give higher-res soft-
>>>timers.
>>
>>The issue is not its length, but can we interrupt at its end.  The key to 
>>high res is being on time.  If we want a 100usec resolution timer, we need 
>>to have the low res timer hit its mark with something real close to this.
> 
> 
> Are you asking: Can we interrupt at the end of the timerinterval? That
> doesn't matter to me, so much, since I made the timerinerval length
> independent of the interrupt source. If you wanted to make sure they
> corresponded completely, you simply would need to change the
> nsecs_to_timerintervals() function appropriately. I was trying to stay
> somewhat efficient, especially since we're dealing with a 64-bit
> quantity, and thus used the shifting John mentioned above.

I think we are at too high a level to go into this in detail here, but part of 
the HRT patch is a header called sc_math.h (Scaled math).  The notion that you 
need to choose and hard code the multiplier is addressed there.  The only real 
choice you need to make is the number of bits you want to shift.  The more you 
shift the more exact the answer is.  For an example of choosing the shift bits 
you may want to look at jiffies.h in the current tree.
> 
> So, currently, my code does not provide for high-res timers,
> admittedly. It does enable some of the infrastructure that should make
> implementing high-res timing in mainline (turned on by tuning
> TIMERINTERVAL_BITS or setting a .config option, perhaps) more feasible,
> in my eyes. I think we have the same goal in mind -- make timing
> flexible and useful.
> 
> To get 100 usec resolution with my system, you would need to make
> TIMERINTERVAL_BITS=16. Basically, this makes a timerinterval = 1
> nanosecond << 16, or 131.07 (131) microseconds. (Of course, you could
> not use shifting and just divide by 10. Might be slower, but is
> certainly feasible.) Admittedly, this is not perfect, but it's close.
> Now, this has nothing to do with the hardware. But, it defines the
> lowest interval, *regardless* of hardware, which the system will
> support. If you request a 100 microsecond delay, we will make it 131
> microseconds.

Leaving aside the precision issue (addressed by scaled math above), I don't 
think this is the way to address high resolution.  For the most part most user 
and kernel applications are quite happy with 10 ms resolution and, I would 
argue, we should go back to the 100 HZ value.  IMNSHO high res should be asked 
for on a request at a time basis.  The POSIX standard provided a rather nice way 
to do this in the CLOCKS & TIMERS section of the standard.  In the kernel the 
low res part of this is in kernel/posix-timers.c.  To extend to high res all one 
does is to define new clocks.  In HRT we define two high res clocks 
(CLOCK_REALTIME_HR and CLOCK_MONOTONIC_HR).  From the user point of view, to get 
high res, he has to used one of these clocks.

The trick is to muster the kernel resources to make the *_HR clock a reality. 
In my HRT patches I have done this with a two phase approach.  I simply add a 
second, HR member to the timer structure.  This allows the timer to carry the 
extra bits needed to express the added resolution and also, if the bits are 
zero, to indicate that it is just a low res timer.  Over the life time of the 
HRT patch I have worked with two different ways of handling the HR bits in the 
timer list.  In all cases, we stay with a periodic time tick and only incur 
extra overhead when the timer is about to expire.  We ask the "arch" code to 
provide an interrupt source with the needed short term resolution.  Since we 
only use this timer at the end of the timer period it needs cover only a short 
time (0 to 1 jiffie), but should have good resolution over this period.  Also, 
since it is a rather short time, we don't require it to have long term stability.

The point is, the standard, low res, timer code needs to be only minimally aware 
of the HR bits.  On the other hand, since the HR timer will be started on 
expiration of the low res timer, we need that expiration to be aligned with the 
clock in a stable, predictable long term way.  (For what its worth, we have yet 
to address NTP clock drifting in the HRT patch, so that is still a fly in this 
web of time.)
> 
> Now, the trick is then, to get the hardware to tick every 100
> microseconds. You have a lot more expertise in this area, and I would be
> interested to see what the actual options are. Presuming we are able to
> get a periodic interrupt at this rate, we could then do what I think HRT
> pretty much does, maintain the normal buckets for normal timers and then
> have a special HRT bucket. We could then be aware of when, in
> nanoseconds, the next HRT interrupt should be, set the hardware
> appropriately and be ok. If there are no HRT timers, then we'll never
> check that bucket, really (we could also just not configure in that
> option). I think it's feasible, regardless. There may be performance
> penalties, I agree. I'm working on figuring out how bad things get if we
> set TIMERINTERVAL_BITS=10, for instance (about every microsecond). If
> the system stays up, I'll be happy :)

I don't think we have hardware yet that can address 100usec ticks, nor, as I 
argue above, do we need it.  In point of fact, we found it necessary to address 
the user who, using the POSIX standard timer interface asked for a repeating 
timer of, say 1 usec (yes we "say" we support 1 usec resolution).  Such a 
request could easily consume the machine (I call these timer storms), causing it 
never to be hear from again.  The, I think, elegant solution to the timer storm 
problem is to not restart the timer until the user picks up the prior 
expiration.  This dynamically adjusts the timer response to the amount of 
machine available at the time.
> 
> 
>>>Now when the new code expires timers it does so against the timeofday
>>>subsystem's notion of time instead of jiffies. It simply goes through
>>>all the timer bucket entries between now and the last time we expired
>>>timers.
>>
>>This is not unlike what happens now.  I would hope that the number of 
>>buckets visited averages to to something real close to 1 per run_timers 
>>entry.
> 
> 
> I agree, this is what we might be able to achieve by keeping high-res
> timers in their own bucket. That way, we could conceivably know when the
> next periodic interrupt is and when the next HRT interrupt is in
> absolute nanoseconds (based on do_monotonic_clock()).

For what its worth, we have such a clock in HRT.  It is expressed in jiffies and 
arch_cycles, but jiffies is what is used today as the "standard" for internal 
system time.  This gets into the area of just how to express time in areas where 
you need to quickly get and compare the time.  After a lot of though, I decided 
to use arch_cycles rather than nanoseconds because arch_cycles are directly 
readable from the hardware.  Any other unit of time requires a conversion, and, 
as you know, usually requires 64-bits as well.  By staying with arch_cycles we 
avoid "most" of the conversion overhead.
> 
> 
>>>Now an interesting point you bring up above is when to schedule timer
>>>interrupts. One could just have a fine-grained timer-interval unit and
>>>crank up HZ, but clearly we don't want to schedule interrupts to go off
>>>too frequently or the overhead won't be worth it. Instead to get high-
>>>res timers, the idea is to look at the timer list and schedule
>>>interrupts appropriately. Then we only take an interrupt when there is
>>>work to do, rather then at regular periodic intervals when there may or
>>>may not be anything to expire.
>>
>>This would suggest that all timers are high res.  I don't think this makes 
>>sense:
>>1) because most users just don't care that much about resolution and 
>>high-res carries some overhead.
>>2) Not all platform hardware is able to handle high res.
> 
> 
> I agree, this is an issue which my TIMERINTERVAL_BITS=10 test should
> also reveal. But under the normal case where TIMERINTERVAL_BITS are 20,
> then there will never be an interrupt scheduled to go off quicker than
> 2^20 nanoseconds in the future (about a millisecond). Everything hinges
> around these bits :) (mostly because they define how we convert
> nanoseconds to timerintervals).
> 
> 
>>I think high res timers should only be used when the user asks for them.  
>>This keeps the overhead under control.
> 
> 
> I agree -- the user asking for the, currently, would mean they set
> TIMERINTERVAL_BITS lower than normal and also set up the hardware to
> interrupt appropriately.

No, no, they ask by using a *_HR clock.
> 
> 
>>>Now from my understanding of your framework, this sounds similar to how
>>>your high-res hardware timer interrupt is used. The idea being that we
>>>use that for all soft-timers, instead of having two levels of high-res
>>>and coarse soft-timers.
>>>
>>>The interesting problem that Darren will be looking at is how to
>>>schedule interrupts accurately. This appears to be something you've
>>>already attacked, so your experience would be helpful. My idea being
>>>that the timer_interrupt code checks when it intended to fire and when
>>>it actually did fire, and that information could then be used to tune
>>>the interrupt scheduling code.
>>
>>A small, measurable latency is ok and need not be backed out by the 
>>software. If you go this route you risk expiring a timer early and the 
>>standard says bad things about this.
> 
> 
> I'm not sure what you mean about timers going off early. We round up on
> addition and round down on expiration to make sure that we don't get
> timers early. As long as do_monotonic_clock() is accurate, we should be
> ok. (Darren's part uses the timerintervals value to figure out when the
> next interrupt should be, similar to the existing
> next_timer_interrupt())

Your rounding up or down is not really different than what is done with jiffies 
today...
> 
> 
>>What is missing in this is that the flag ship arch (x86) has piss poor 
>>capability to schedule timer interrupts.  I.e. there really is no commonly 
>>available timer to generate interrupts at some arbitrary time.  There is, 
>>however, the PIT, which, if you don't mess with it, is very low overhead 
>>but periodic.
> 
> 
> And generally, we are going to stay periodic, for now. Darren's patch is
> a further extension on mine, which requires a lot of testing.
> 
> 
>>Then there is the issue of what the time standard is.  In the x86, the PIT 
>>is it.  The pm_timer is close, but the read problem adds so much overhead 
>>as to make it hard to justify.  Possibly the HPET does a better job, but it 
>>is still an I/O access (read SLOW) and is not commonly available as yet.
>>
>>You also have to do something reasonable for the accounting subsystems. 
>>Currently this is done by sampling the system at a periodic rate.  What 
>>ever is running gets charged with the last period.  If you go non-periodic, 
>>either you have to charge a variable period when the sample is taken or you 
>>have to set up timers as part of context switching.  This ladder is not 
>>wise as the context switch overhead then gets larger.  It is rather easy to 
>>show that the accounting overhead in such a system with a modest load is 
>>higher than in a periodic sampled system.  This system is also charged with 
>>doing the time slicing.  With out a periodic tick, a timer needs to be set 
>>up each context switch.
> 
> 
> We will always (at least in our current concept) have an idea of when,
> worst/best case, the next timer interrupt will be (a maximum, predefined
> interval). This would also define how long the system could go
> completely idle without expiring any timers.

I think we should leave the skipping of timer interrupts to VST code.  The VST 
code only needs to figure out when the next timer is to expire when the cpu is 
idle, i.e. it has a time to spare to look up that timer.
> 
> 
>>In addition to all of this, there is the issue of how to organize the timer 
>>list so that you can find the next timer.  With the current organization 
>>this is something you just don't want to do.  On the other hand, I am 
>>convinced that, for periodic timers, it is about as good as it gets.
> 
> 
> I think currently, we do something similar to next_timer_interrupt(),
> making sure we update the stored value whenever we add a new timer, if
> necessary. Except now, we will determine the next time to set the hw
> timer to go off every time we expire timers.

Finding the next timer is a high overhead operation....
> 
> I know my proposal is a big change and does make things more difficult
> in the short-term for the HRT patch. But I think it is possible to work
> together and get a very reasonable system in place, which allows for
> both HRT and normal timers to work effectively. I may have made things
> worse in my response, but I think the discussion has to be had :)
> 


-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Help with the high res timers
  2005-05-04 17:10           ` Help with the high res timers George Anzinger
@ 2005-05-04 17:44             ` Chris Friesen
  2005-05-04 17:51               ` Nishanth Aravamudan
  2005-05-04 20:31               ` George Anzinger
  2005-05-04 18:14             ` Darren Hart
  1 sibling, 2 replies; 12+ messages in thread
From: Chris Friesen @ 2005-05-04 17:44 UTC (permalink / raw)
  To: george
  Cc: Nishanth Aravamudan, john stultz, Liu Qi,
	'high-res-timers-discourse@lists.sourceforge.net',
	Darren Hart, linux-kernel@vger.kernel.org

George Anzinger wrote:

> The, I think, elegant solution to the timer storm problem is to 
> not restart the timer until the user picks up the prior expiration.  
> This dynamically adjusts the timer response to the amount of machine 
> available at the time.

The disadvantage is that you then lose accuracy since each timer 
interval is increased by some random amount based on system scheduling. 
  What about some kind of ulimit-type thing to specify the minimum 
recurring interval that can be specified?  If root so specifies, you 
could have 1usec interval timers and the system would hang.  This is 
conceptually no different than busy-looping in a SCHED_FIFO task.

Chris

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Help with the high res timers
       [not found]         ` <1115166592.13738.96.camel@cog.beaverton.ibm.com>
@ 2005-05-04 17:46           ` George Anzinger
  2005-05-04 22:13             ` john stultz
  0 siblings, 1 reply; 12+ messages in thread
From: George Anzinger @ 2005-05-04 17:46 UTC (permalink / raw)
  To: john stultz
  Cc: Liu Qi, 'high-res-timers-discourse@lists.sourceforge.net',
	Nishanth Aravamudan, Darren Hart, linux-kernel@vger.kernel.org

~

>>The issue is not its length, but can we interrupt at its end.  The key to high 
>>res is being on time.  If we want a 100usec resolution timer, we need to have 
>>the low res timer hit its mark with something real close to this.
> 
> 
> Indeed. 
> 
> As an aside, one nice thing with Nish's code is that the maximum soft-
> timer latency is only the maximum interval between ticks. This avoids
> accumulating error caused by lost ticks. 

I think you are confusing the x86 problems with lost ticks with the wider world. 
  Most archs have reasonable time resources and, while they may suffer from late 
ticks, do not, therefor loose time.  In as much as you can find a way to 
eliminate time loss in the x86 due to lost ticks of the time base, you can also 
do so with the current time keeping system.  The current code is just to 
dependant on catching every timer tick.  It should use the other resources 
available to it to cover any missing ticks.  (We do this rather nicely in the 
HRT patch, by the way.)
> 
> 
> 
>>>Now when the new code expires timers it does so against the timeofday
>>>subsystem's notion of time instead of jiffies. It simply goes through
>>>all the timer bucket entries between now and the last time we expired
>>>timers.
>>
>>This is not unlike what happens now.  I would hope that the number of buckets 
>>visited averages to to something real close to 1 per run_timers entery.
> 
> 
> Well, as long as the HZ period is close to the timer-interval unit
> length, this is true. However if the timer-interval unit is smaller,
> multiple bucket entries would be expired. The performance considerations
> here are being looked at and this may be an area where the concepts in
> HRT might help (having a HRT specific sub-bucket).

This is where we get in trouble with HR timers.  For a HR timer, we need to know 
how to get a timer to expire (i.e. appear in the call back) at a well defined 
and presise time (leaving aside latency issues).  The above discription allows 
timers to be put in buckets without (as near as I can tell) making transparent 
exactly when the bucket will be emptied, only saying that it will be after the 
latest timer in the bucket is due.
> 
> 
>>>Now an interesting point you bring up above is when to schedule timer
>>>interrupts. One could just have a fine-grained timer-interval unit and
>>>crank up HZ, but clearly we don't want to schedule interrupts to go off
>>>too frequently or the overhead won't be worth it. Instead to get high-
>>>res timers, the idea is to look at the timer list and schedule
>>>interrupts appropriately. Then we only take an interrupt when there is
>>>work to do, rather then at regular periodic intervals when there may or
>>>may not be anything to expire.
>>
>>This would suggest that all timers are high res.  I don't think this makes sense:
>>1) because most users just don't care that much about resolution and high-res 
>>carries some overhead.
>>2) Not all platform hardware is able to handle high res.
> 
> 
> Indeed, in our system all timers could be high res, but don't
> necessarily need to be by default. It ends up being a function of how
> finely-grained the timer-interval units are set to be and how
> efficiently the hardware can be scheduled.

We currently ship HRT with a resolution of 1usec.  Lets agree that you don't 
want to even try to do this by adjusting the timer-interval.
> 
> 
>>I think high res timers should only be used when the user asks for them.  This 
>>keeps the overhead under control.
> 
> 
> Abstractly that makes sense, but I'm not sure how you mean "when the
> user asks for them". Is this a runtime consideration, or is it compile
> time?

Run time, he uses the POSIX clocks and timers interface and uses a seperate high 
res clock.  Only timers on high res clocks are high res.
> 
> 
~
>>
>>A small, measureable latency is ok and need not be backed out by the software. 
>>If you go this route you risk expiring a timer early and the standard says bad 
>>things about this.
> 
> 
> Since we expire based upon time instead of ticks, we can never expire
> early.

Think of it this way.  Decompose a HR timer into corse and fine units (you 
choose, but here let say jiffies and nanoseconds).  Now we want the normal timer 
system to handle the jiffies part of the time and to turn the timer over to the 
HR timer code to take care of the nanosecond remainder.  If the jiffie part is 
late, depending on the nanosecond part, it could make the timer late (i.e for 
low values of the nanosecond part).  For high values of the nanosecond part, we 
can compenstate...

This decomposition makes a lot of sense, by the way, for, at least, the 
following reasons:
1) it keeps the most of the HR issues out of the normal timer code,
2) it keeps high res and low res timer in the correct time order, i.e. a low res 
timer for jiffie X will expire prior to a high res timer for jiffie X + Y 
nanoseconds.
3) handling the high res timer list is made vastly easier as it will only need 
to have a rather small number of timers in it at any given time (i.e. those that 
are to expire prior to the next corse timer tick).
> 
> 
>>What is missing in this is that the flag ship arch (x86) has piss poor 
>>capability to schedule timer interrupts.  I.e. there really is no commonly 
>>available timer to generate interrupts at some arbitrary time.  There is, 
>>however, the PIT, which, if you don't mess with it, is very low overhead but 
>>periodic.
>>
>>Then there is the issue of what the time standard is.  In the x86, the PIT is 
>>it.  The pm_timer is close, but the read problem adds so much overhead as to 
>>make it hard to justify.  Possibly the HPET does a better job, but it is still 
>>an I/O access (read SLOW) and is not commonly available as yet.
> 
> 
> With my rework, the time standard issue is separate problem domain from
> HRT. If the timeofday code cannot give correct time, its a bug in that
> subsystem. While you're point is a fair critique of my timeofday work
> (which I am trying to address), the clear interface between soft-timers
> and timeofday allows for the soft-timer subsystem to not worry about
> that.

Leaving aside timers, one issue I am trying to get you to address is that on 
most X86 machines the "clock" rock is only expressed via PIT interrupts.  While 
we must express time with more resolution than this, we must also "use" that 
rock (i.e. PIT) to keep decent long term time.
> 
> 
>>You also have to do something reasonable for the accounting subsystems. 
>>Currently this is done by sampling the system at a periodic rate.  What ever is 
>>running gets charged with the last period.  If you go non-periodic, either you 
>>have to charge a variable period when the sample is taken or you have to set up 
>>timers as part of context switching.  This ladder is not wise as the context 
>>switch overhead then gets larger.  It is rather easy to show that the accounting 
>>overhead in such a system with a modest load is higher than in a periodic 
>>sampled system.  This system is also charged with doing the time slicing.  With 
>>out a periodic tick, a timer needs to be set up each context switch.
> 
> 
> I've not looked at the accounting subsystem yet, but I'll try to dig in
> and see what we can do here. Thanks for the heads up.

This is the main reason a tick less system is unwise.  It is overload prone.
> 
> 
>>In addition to all of this, there is the issue of how to organize the timer list 
>>so that you can find the next timer.  With the current organization this is 
>>something you just don't want to do.  On the other hand, I am convinced that, 
>>for periodic timers, it is about as good as it gets.
> 
> 
> You might have to go a bit more into detail on this last point. 

First, lets agree that we need not be in love with any given timer list structure.

The issues to be addressed are:
1) Fast insert.
2) Fast removal prior to expire, (almost all timers never expire).
3) Fast look up and removal of timers to expire NOW

If you want to find the next timer to expire at some time in the furture:
4) Fast look up of the next timer.

The current cascade timer list does a very good job of 1, 2, and 3 but is not 
set up to make 4 easy or fast.


> 
> One thing I'd like to emphasize is that while Nish and my work do change
> a large amount of code that collides with your code, we want to make
> sure that what the HRT patch achieves is still possible. As I get more
> familiar with the HRT code needs and you get more familiar with what
> we're providing I hope things will work smoothly.
> 
As do I.
-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Help with the high res timers
  2005-05-04 17:44             ` Chris Friesen
@ 2005-05-04 17:51               ` Nishanth Aravamudan
  2005-05-04 18:16                 ` Chris Friesen
  2005-05-04 20:31               ` George Anzinger
  1 sibling, 1 reply; 12+ messages in thread
From: Nishanth Aravamudan @ 2005-05-04 17:51 UTC (permalink / raw)
  To: Chris Friesen
  Cc: george, john stultz, Liu Qi,
	'high-res-timers-discourse@lists.sourceforge.net',
	Darren Hart, linux-kernel@vger.kernel.org

On 04.05.2005 [11:44:56 -0600], Chris Friesen wrote:
> George Anzinger wrote:
> 
> >The, I think, elegant solution to the timer storm problem is to 
> >not restart the timer until the user picks up the prior expiration.  
> >This dynamically adjusts the timer response to the amount of machine 
> >available at the time.
> 
> The disadvantage is that you then lose accuracy since each timer 
> interval is increased by some random amount based on system scheduling. 
>  What about some kind of ulimit-type thing to specify the minimum 
> recurring interval that can be specified?  If root so specifies, you 
> could have 1usec interval timers and the system would hang.  This is 
> conceptually no different than busy-looping in a SCHED_FIFO task.

If I understand your point correctly, I think this is achieved by
TIMERINTERVAL_BITS in my patch (not to claim my patch is function, but
conceptually). No matter what you actually request, the best you can do
is 2^TIMERINTERVAL_BITS nanoseconds, and usually worse because the
tick-rate and timerinterval length do not necessarily line up.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Help with the high res timers
  2005-05-04 17:10           ` Help with the high res timers George Anzinger
  2005-05-04 17:44             ` Chris Friesen
@ 2005-05-04 18:14             ` Darren Hart
  2005-05-04 21:46               ` George Anzinger
  1 sibling, 1 reply; 12+ messages in thread
From: Darren Hart @ 2005-05-04 18:14 UTC (permalink / raw)
  To: george
  Cc: Nishanth Aravamudan, john stultz, Liu Qi,
	'high-res-timers-discourse@lists.sourceforge.net',
	linux-kernel@vger.kernel.org

George Anzinger wrote:
 > |First I want to express my joy that we are discussing these issues at
 > some depth.
 >
 > I also think we need a wider audience and so have added the kernel list
 > to the cc.  For that reason, I will not trim this message, thus allowing
 > them to "catch up".
 >
 > Nishanth Aravamudan wrote:
 >
 >> On 03.05.2005 [16:15:04 -0700], George Anzinger wrote:
 >>
 >>> john stultz wrote:
 >>>
 >>>> On Tue, 2005-05-03 at 14:36 -0700, George Anzinger wrote:
 >>>>
 >>>>
 >>>>> At the moment I am trying to understand and recast the work John
 >>>>> Stultz is doing.  I have some very real concerns here and there is
 >>>>> a patch by nacc@us.ibm.com<Nishanth Aravamudan> that, to me, looks
 >>>>> like a royal pain.  I think we need to understand it better and
 >>>>> voice our concerns in a way that is constructive. Nish's patch is
 >>>>> here: http://www.ussg.iu.edu/hypermail/linux/kernel/0504.3/1635.html
 >>
 >>
 >>
 >> I'm not on the HRT list, so I didn't get the original e-mail...keep me
 >> on the CC, please.
 >>
 >>
 >>>> Hey George,     Hopefully our work isn't causing you too much
 >>>> trouble. I do     understand
 >>>> the pain of having large conflicting works that try to achieve similar
 >>>> goals.
 >>>>
 >>>>
 >>>>
 >>>>> To summarize my concerns, the HRT code depends on correct and exact
 >>>>> registration of the timer interrupt with time of day.  (We don't
 >>>>> worry about latency here.) In the latest x86 HRT patch we fine tune
 >>>>> the system clock to insure this alignment.  This allows us to set
 >>>>> up two stage timers that use the run_timers code for the coarse
 >>>>> time (to the time base resolution) followed by a short high res
 >>>>> hardware timer to get to the higher resolution we want.  If the
 >>>>> coarse timer (i.e. add_timer, run_timer and friends) do not
 >>>>> register correctly, we will have the case of a coarse timer
 >>>>> expiring after the needed and requested time.  Now,
 >>>>
 >>>>
 >>>>> from a coarse timer point of view, that may be fine, but it means
 >>>>> we can
 >>>>
 >>>>
 >>>>> not depend on it for the high res first stage.
 >>>>
 >>>>
 >>>>
 >>>>
 >>>> The conceptual idea Nish is going after, is once we have a solid 
method
 >>>> to get a high-res timeofday (via the new monotonic_clock()
 >>>> function), we
 >>>> can use that instead of jiffies to expire soft-timers.
 >>>> This requires re-defining each timer-bucket entry length from a jiffy
 >>>> instead to a unit of time (Nish calls this a timerinterval unit). This
 >>>> timer-interval unit's length is flexible can be defined at compile
 >>>> time.
 >>>> Currently Nish is using ~1ms (1<<20 nanoseconds) as the interval
 >>>> length,
 >>>> but it could be redefined to be as desired to give higher-res soft-
 >>>> timers.
 >>>
 >>>
 >>> The issue is not its length, but can we interrupt at its end.  The
 >>> key to high res is being on time.  If we want a 100usec resolution
 >>> timer, we need to have the low res timer hit its mark with something
 >>> real close to this.
 >>
 >>
 >>
 >> Are you asking: Can we interrupt at the end of the timerinterval? That
 >> doesn't matter to me, so much, since I made the timerinerval length
 >> independent of the interrupt source. If you wanted to make sure they
 >> corresponded completely, you simply would need to change the
 >> nsecs_to_timerintervals() function appropriately. I was trying to stay
 >> somewhat efficient, especially since we're dealing with a 64-bit
 >> quantity, and thus used the shifting John mentioned above.
 >
 >
 > I think we are at too high a level to go into this in detail here, but
 > part of the HRT patch is a header called sc_math.h (Scaled math).  The
 > notion that you need to choose and hard code the multiplier is addressed
 > there.  The only real choice you need to make is the number of bits you
 > want to shift.  The more you shift the more exact the answer is.  For an
 > example of choosing the shift bits you may want to look at jiffies.h in
 > the current tree.
 >
 >>
 >> So, currently, my code does not provide for high-res timers,
 >> admittedly. It does enable some of the infrastructure that should make
 >> implementing high-res timing in mainline (turned on by tuning
 >> TIMERINTERVAL_BITS or setting a .config option, perhaps) more feasible,
 >> in my eyes. I think we have the same goal in mind -- make timing
 >> flexible and useful.
 >>
 >> To get 100 usec resolution with my system, you would need to make
 >> TIMERINTERVAL_BITS=16. Basically, this makes a timerinterval = 1
 >> nanosecond << 16, or 131.07 (131) microseconds. (Of course, you could
 >> not use shifting and just divide by 10. Might be slower, but is
 >> certainly feasible.) Admittedly, this is not perfect, but it's close.
 >> Now, this has nothing to do with the hardware. But, it defines the
 >> lowest interval, *regardless* of hardware, which the system will
 >> support. If you request a 100 microsecond delay, we will make it 131
 >> microseconds.
 >
 >
 > Leaving aside the precision issue (addressed by scaled math above), I
 > don't think this is the way to address high resolution.  For the most
 > part most user and kernel applications are quite happy with 10 ms
 > resolution and, I would argue, we should go back to the 100 HZ value.
 > IMNSHO high res should be asked for on a request at a time basis.  The
 > POSIX standard provided a rather nice way to do this in the CLOCKS &
 > TIMERS section of the standard.  In the kernel the low res part of this
 > is in kernel/posix-timers.c.  To extend to high res all one does is to
 > define new clocks.  In HRT we define two high res clocks
 > (CLOCK_REALTIME_HR and CLOCK_MONOTONIC_HR).  From the user point of
 > view, to get high res, he has to used one of these clocks.
 >
 > The trick is to muster the kernel resources to make the *_HR clock a
 > reality. In my HRT patches I have done this with a two phase approach.
 > I simply add a second, HR member to the timer structure.  This allows
 > the timer to carry the extra bits needed to express the added resolution
 > and also, if the bits are zero, to indicate that it is just a low res
 > timer.  Over the life time of the HRT patch I have worked with two
 > different ways of handling the HR bits in the timer list.  In all cases,
 > we stay with a periodic time tick and only incur extra overhead when the
 > timer is about to expire.  We ask the "arch" code to provide an
 > interrupt source with the needed short term resolution.  Since we only
 > use this timer at the end of the timer period it needs cover only a
 > short time (0 to 1 jiffie), but should have good resolution over this
 > period.  Also, since it is a rather short time, we don't require it to
 > have long term stability.
 >
 > The point is, the standard, low res, timer code needs to be only
 > minimally aware of the HR bits.  On the other hand, since the HR timer
 > will be started on expiration of the low res timer, we need that
 > expiration to be aligned with the clock in a stable, predictable long
 > term way.  (For what its worth, we have yet to address NTP clock
 > drifting in the HRT patch, so that is still a fly in this web of time.)
 >

You are still dependent on jiffies which is very unreliable du to lost 
ticks, HZ vs ACTHZ, etc.  Even if you use a periodic tick, it seems the 
HRT code would still benefit from a human-time soft-timers sub-system 
that wasn't affected by such things.


 >>
 >> Now, the trick is then, to get the hardware to tick every 100
 >> microseconds. You have a lot more expertise in this area, and I would be
 >> interested to see what the actual options are. Presuming we are able to
 >> get a periodic interrupt at this rate, we could then do what I think HRT
 >> pretty much does, maintain the normal buckets for normal timers and then
 >> have a special HRT bucket. We could then be aware of when, in
 >> nanoseconds, the next HRT interrupt should be, set the hardware
 >> appropriately and be ok. If there are no HRT timers, then we'll never
 >> check that bucket, really (we could also just not configure in that
 >> option). I think it's feasible, regardless. There may be performance
 >> penalties, I agree. I'm working on figuring out how bad things get if we
 >> set TIMERINTERVAL_BITS=10, for instance (about every microsecond). If
 >> the system stays up, I'll be happy :)
 >
 >
 > I don't think we have hardware yet that can address 100usec ticks, nor,
 > as I argue above, do we need it.  In point of fact, we found it
 > necessary to address the user who, using the POSIX standard timer
 > interface asked for a repeating timer of, say 1 usec (yes we "say" we
 > support 1 usec resolution).  Such a request could easily consume the
 > machine (I call these timer storms), causing it never to be hear from
 > again.  The, I think, elegant solution to the timer storm problem is to
 > not restart the timer until the user picks up the prior expiration.
 > This dynamically adjusts the timer response to the amount of machine
 > available at the time.
 >
 >>
 >>
 >>>> Now when the new code expires timers it does so against the timeofday
 >>>> subsystem's notion of time instead of jiffies. It simply goes through
 >>>> all the timer bucket entries between now and the last time we expired
 >>>> timers.
 >>>
 >>>
 >>> This is not unlike what happens now.  I would hope that the number of
 >>> buckets visited averages to to something real close to 1 per
 >>> run_timers entry.
 >>
 >>
 >>
 >> I agree, this is what we might be able to achieve by keeping high-res
 >> timers in their own bucket. That way, we could conceivably know when the
 >> next periodic interrupt is and when the next HRT interrupt is in
 >> absolute nanoseconds (based on do_monotonic_clock()).
 >
 >
 > For what its worth, we have such a clock in HRT.  It is expressed in
 > jiffies and arch_cycles, but jiffies is what is used today as the
 > "standard" for internal system time.  This gets into the area of just
 > how to express time in areas where you need to quickly get and compare
 > the time.  After a lot of though, I decided to use arch_cycles rather
 > than nanoseconds because arch_cycles are directly readable from the
 > hardware.  Any other unit of time requires a conversion, and, as you
 > know, usually requires 64-bits as well.  By staying with arch_cycles we
 > avoid "most" of the conversion overhead.
 >

I can see why you might want to use arch_cycles, and at least it is an 
absolute time.  Jiffies though is not.  Perhaps HRT could benefit from a 
conversion to "milliseconds + arch_cycles" ?

 >>
 >>
 >>>> Now an interesting point you bring up above is when to schedule timer
 >>>> interrupts. One could just have a fine-grained timer-interval unit and
 >>>> crank up HZ, but clearly we don't want to schedule interrupts to 
go off
 >>>> too frequently or the overhead won't be worth it. Instead to get high-
 >>>> res timers, the idea is to look at the timer list and schedule
 >>>> interrupts appropriately. Then we only take an interrupt when there is
 >>>> work to do, rather then at regular periodic intervals when there 
may or
 >>>> may not be anything to expire.
 >>>
 >>>
 >>> This would suggest that all timers are high res.  I don't think this
 >>> makes sense:
 >>> 1) because most users just don't care that much about resolution and
 >>> high-res carries some overhead.
 >>> 2) Not all platform hardware is able to handle high res.
 >>
 >>
 >>
 >> I agree, this is an issue which my TIMERINTERVAL_BITS=10 test should
 >> also reveal. But under the normal case where TIMERINTERVAL_BITS are 20,
 >> then there will never be an interrupt scheduled to go off quicker than
 >> 2^20 nanoseconds in the future (about a millisecond). Everything hinges
 >> around these bits :) (mostly because they define how we convert
 >> nanoseconds to timerintervals).
 >>
 >>
 >>> I think high res timers should only be used when the user asks for
 >>> them.  This keeps the overhead under control.
 >>
 >>
 >>
 >> I agree -- the user asking for the, currently, would mean they set
 >> TIMERINTERVAL_BITS lower than normal and also set up the hardware to
 >> interrupt appropriately.
 >
 >
 > No, no, they ask by using a *_HR clock.
 >
 >>
 >>
 >>>> Now from my understanding of your framework, this sounds similar 
to how
 >>>> your high-res hardware timer interrupt is used. The idea being that we
 >>>> use that for all soft-timers, instead of having two levels of high-res
 >>>> and coarse soft-timers.
 >>>>
 >>>> The interesting problem that Darren will be looking at is how to
 >>>> schedule interrupts accurately. This appears to be something you've
 >>>> already attacked, so your experience would be helpful. My idea being
 >>>> that the timer_interrupt code checks when it intended to fire and when
 >>>> it actually did fire, and that information could then be used to tune
 >>>> the interrupt scheduling code.
 >>>
 >>>
 >>> A small, measurable latency is ok and need not be backed out by the
 >>> software. If you go this route you risk expiring a timer early and
 >>> the standard says bad things about this.
 >>
 >>
 >>
 >> I'm not sure what you mean about timers going off early. We round up on
 >> addition and round down on expiration to make sure that we don't get
 >> timers early. As long as do_monotonic_clock() is accurate, we should be
 >> ok. (Darren's part uses the timerintervals value to figure out when the
 >> next interrupt should be, similar to the existing
 >> next_timer_interrupt())
 >
 >
 > Your rounding up or down is not really different than what is done with
 > jiffies today...
 >
 >>
 >>
 >>> What is missing in this is that the flag ship arch (x86) has piss
 >>> poor capability to schedule timer interrupts.  I.e. there really is
 >>> no commonly available timer to generate interrupts at some arbitrary
 >>> time.  There is, however, the PIT, which, if you don't mess with it,
 >>> is very low overhead but periodic.
 >>
 >>
 >>
 >> And generally, we are going to stay periodic, for now. Darren's patch is
 >> a further extension on mine, which requires a lot of testing.
 >>
 >>
 >>> Then there is the issue of what the time standard is.  In the x86,
 >>> the PIT is it.  The pm_timer is close, but the read problem adds so
 >>> much overhead as to make it hard to justify.  Possibly the HPET does
 >>> a better job, but it is still an I/O access (read SLOW) and is not
 >>> commonly available as yet.
 >>>
 >>> You also have to do something reasonable for the accounting
 >>> subsystems. Currently this is done by sampling the system at a
 >>> periodic rate.  What ever is running gets charged with the last
 >>> period.  If you go non-periodic, either you have to charge a variable
 >>> period when the sample is taken or you have to set up timers as part
 >>> of context switching.  This ladder is not wise as the context switch
 >>> overhead then gets larger.  It is rather easy to show that the
 >>> accounting overhead in such a system with a modest load is higher
 >>> than in a periodic sampled system.  This system is also charged with
 >>> doing the time slicing.  With out a periodic tick, a timer needs to
 >>> be set up each context switch.
 >>
 >>
 >>
 >> We will always (at least in our current concept) have an idea of when,
 >> worst/best case, the next timer interrupt will be (a maximum, predefined
 >> interval). This would also define how long the system could go
 >> completely idle without expiring any timers.
 >
 >
 > I think we should leave the skipping of timer interrupts to VST code.
 > The VST code only needs to figure out when the next timer is to expire
 > when the cpu is idle, i.e. it has a time to spare to look up that timer.
 >
 >>
 >>
 >>> In addition to all of this, there is the issue of how to organize the
 >>> timer list so that you can find the next timer.  With the current
 >>> organization this is something you just don't want to do.  On the
 >>> other hand, I am convinced that, for periodic timers, it is about as
 >>> good as it gets.
 >>
 >>
 >>
 >> I think currently, we do something similar to next_timer_interrupt(),
 >> making sure we update the stored value whenever we add a new timer, if
 >> necessary. Except now, we will determine the next time to set the hw
 >> timer to go off every time we expire timers.
 >
 >
 > Finding the next timer is a high overhead operation....

Perhaps it is now, but does it have to be?  The process scheduler used 
to be full of O(N) algorithms to, but they were all converted to O(1) by 
doing several incremental calculations instead of one huge one.  Perhaps 
something similar could be done with timers by cacheing a couple of 
timer expirations and modifying them as timers are added, removed, and 
modified.  Just brainstorming here.

 >
 >>
 >> I know my proposal is a big change and does make things more difficult
 >> in the short-term for the HRT patch. But I think it is possible to work
 >> together and get a very reasonable system in place, which allows for
 >> both HRT and normal timers to work effectively. I may have made things
 >> worse in my response, but I think the discussion has to be had :)
 >>
 >
 >

--Darren

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Help with the high res timers
  2005-05-04 17:51               ` Nishanth Aravamudan
@ 2005-05-04 18:16                 ` Chris Friesen
  2005-05-04 20:36                   ` George Anzinger
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Friesen @ 2005-05-04 18:16 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: george, john stultz, Liu Qi,
	'high-res-timers-discourse@lists.sourceforge.net',
	Darren Hart, linux-kernel@vger.kernel.org

Nishanth Aravamudan wrote:

> If I understand your point correctly, I think this is achieved by
> TIMERINTERVAL_BITS in my patch (not to claim my patch is function, but
> conceptually). No matter what you actually request, the best you can do
> is 2^TIMERINTERVAL_BITS nanoseconds, and usually worse because the
> tick-rate and timerinterval length do not necessarily line up.

My point is simply that the timer for the next interval should start at 
the time the timer expires, not the time that userspace picks up the 
prior expiration.  Throttling the timer rate should be done at the time 
of timer request rather than timer expiry.

If I have usec-accuracy in the timer subsystem, I should be able to set 
a timer with an interval of 9.999ms and have it remain accurate over 
time (subject to scheduler jitter, of course).  N timer intervals later 
my timer should expire at (original_time + N*9.999ms + jitter).  In this 
case the error is roughly constant with time.

If the timer doesn't start counting the next interval until the user 
detects expiry, I'm going to get some non-zero addition to *each* 
interval such that my timers will not remain accurate over long periods 
of time.  In this case N timer intervals later my timer will expire at 
(original_time + N*(9.999ms + jitter)) which is a very different thing. 
  Since jitter will always be positive, the error increases with time.

Chris

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Help with the high res timers
  2005-05-04 17:44             ` Chris Friesen
  2005-05-04 17:51               ` Nishanth Aravamudan
@ 2005-05-04 20:31               ` George Anzinger
  1 sibling, 0 replies; 12+ messages in thread
From: George Anzinger @ 2005-05-04 20:31 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Nishanth Aravamudan, john stultz, Liu Qi,
	'high-res-timers-discourse@lists.sourceforge.net',
	Darren Hart, linux-kernel@vger.kernel.org

Chris Friesen wrote:
> George Anzinger wrote:
> 
>> The, I think, elegant solution to the timer storm problem is to not 
>> restart the timer until the user picks up the prior expiration.  This 
>> dynamically adjusts the timer response to the amount of machine 
>> available at the time.
> 
> 
> The disadvantage is that you then lose accuracy since each timer 
> interval is increased by some random amount based on system scheduling. 
>  What about some kind of ulimit-type thing to specify the minimum 
> recurring interval that can be specified?  If root so specifies, you 
> could have 1usec interval timers and the system would hang.  This is 
> conceptually no different than busy-looping in a SCHED_FIFO task.

The standard comes to the rescue here.  The standard defines timer_getoverrun() 
which returns the number of additional timeouts you _would_ have seen if you had 
been fast enough.

I tried a limit thing.  It is MUCH too fragile for the real world.
> 
-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Help with the high res timers
  2005-05-04 18:16                 ` Chris Friesen
@ 2005-05-04 20:36                   ` George Anzinger
  0 siblings, 0 replies; 12+ messages in thread
From: George Anzinger @ 2005-05-04 20:36 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Nishanth Aravamudan, john stultz, Liu Qi,
	'high-res-timers-discourse@lists.sourceforge.net',
	Darren Hart, linux-kernel@vger.kernel.org

Chris Friesen wrote:
> Nishanth Aravamudan wrote:
> 
>> If I understand your point correctly, I think this is achieved by
>> TIMERINTERVAL_BITS in my patch (not to claim my patch is function, but
>> conceptually). No matter what you actually request, the best you can do
>> is 2^TIMERINTERVAL_BITS nanoseconds, and usually worse because the
>> tick-rate and timerinterval length do not necessarily line up.
> 
> 
> My point is simply that the timer for the next interval should start at 
> the time the timer expires, not the time that userspace picks up the 
> prior expiration.  Throttling the timer rate should be done at the time 
> of timer request rather than timer expiry.
> 
> If I have usec-accuracy in the timer subsystem, I should be able to set 
> a timer with an interval of 9.999ms and have it remain accurate over 
> time (subject to scheduler jitter, of course).  N timer intervals later 
> my timer should expire at (original_time + N*9.999ms + jitter).  In this 
> case the error is roughly constant with time.
> 
> If the timer doesn't start counting the next interval until the user 
> detects expiry, I'm going to get some non-zero addition to *each* 
> interval such that my timers will not remain accurate over long periods 
> of time.  In this case N timer intervals later my timer will expire at 
> (original_time + N*(9.999ms + jitter)) which is a very different thing. 
>  Since jitter will always be positive, the error increases with time.

You can, and we do in the main line kernel, keep this accuracy even with missed 
timer expiries.  When a timer is restarted, the actual requested expire time is 
recalculated.  It is adjusted ONLY by the requested repeat time until it results 
in a time that has yet to come and the number of repeat times it takes is logged 
in the overrun count for user examination.
> 
> Chris
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Help with the high res timers
  2005-05-04 18:14             ` Darren Hart
@ 2005-05-04 21:46               ` George Anzinger
  0 siblings, 0 replies; 12+ messages in thread
From: George Anzinger @ 2005-05-04 21:46 UTC (permalink / raw)
  To: Darren Hart
  Cc: Nishanth Aravamudan, john stultz, Liu Qi,
	'high-res-timers-discourse@lists.sourceforge.net',
	linux-kernel@vger.kernel.org

Darren Hart wrote:
> George Anzinger wrote:
>>  |First I want to express my joy that we are discussing these issues at
>>  some depth.
>>
~
>>  The point is, the standard, low res, timer code needs to be only
>>  minimally aware of the HR bits.  On the other hand, since the HR timer
>>  will be started on expiration of the low res timer, we need that
>>  expiration to be aligned with the clock in a stable, predictable long
>>  term way.  (For what its worth, we have yet to address NTP clock
>>  drifting in the HRT patch, so that is still a fly in this web of time.)
>>
> 
> You are still dependent on jiffies which is very unreliable do to lost 
> ticks, HZ vs ACTHZ, etc.  Even if you use a periodic tick, it seems the 
> HRT code would still benefit from a human-time soft-timers sub-system 
> that wasn't affected by such things.

Well, first we clean up jiffies WRT lost ticks.  We just don't have them with 
the HRT patch.  If you mean being late, just as John's code give the right time 
all the time so does the HRT time code.  The notion is that the TSC or the 
pm_counter can carry us over a couple of lost ticks as well as the between tick 
times.  So, we aren't affected by lost ticks and ACTHZ just is not part of the 
POSIX clocks & timers world.
> 
> 
>> >
~
>>  For what its worth, we have such a clock in HRT.  It is expressed in
>>  jiffies and arch_cycles, but jiffies is what is used today as the
>>  "standard" for internal system time.  This gets into the area of just
>>  how to express time in areas where you need to quickly get and compare
>>  the time.  After a lot of though, I decided to use arch_cycles rather
>>  than nanoseconds because arch_cycles are directly readable from the
>>  hardware.  Any other unit of time requires a conversion, and, as you
>>  know, usually requires 64-bits as well.  By staying with arch_cycles we
>>  avoid "most" of the conversion overhead.
>>
> 
> I can see why you might want to use arch_cycles, and at least it is an 
> absolute time.  Jiffies though is not.  Perhaps HRT could benefit from a 
> conversion to "milliseconds + arch_cycles" ?

Two points here, first, jiffies, once you convince yourself that the PIT is the 
gold time standard, IS the time.  It is the best time we have in the x86 world 
and in all other archs too.  That is what the arch time code is supposed to do. 
  Your milliseconds is, or should be, derived from jiffies or, in other archs, 
the same place jiffies is derived from, i.e. the time keeping "rock" on the 
mother board.  To account for things like NTP, we always grab arch_cycles, 
xtime, jiffies, and wall_to_monotonic under a read lock at the same time.  This 
allows us to peg one expression of time against another.

All that said, for HRT to work correctly, the timer must have a high resolution 
time contained in its structure.  Currently this is jiffies and arch_cycles. 
Further, we need to know that when that jiffies value arrives, we will get 
called ASAP, and not at some random time between that value and the next time it 
changes.  For repeating timers we use these values _after_ the timer expires to 
figure the next expire time, so if they are changed by the timer code, we are in 
trouble.
> 
>> >
>> >
~
>> > I think currently, we do something similar to next_timer_interrupt(),
>> > making sure we update the stored value whenever we add a new timer, if
>> > necessary. Except now, we will determine the next time to set the hw
>> > timer to go off every time we expire timers.
>>
>>
>>  Finding the next timer is a high overhead operation....
> 
> Perhaps it is now, but does it have to be?  The process scheduler used 
> to be full of O(N) algorithms to, but they were all converted to O(1) by 
> doing several incremental calculations instead of one huge one.  Perhaps 
> something similar could be done with timers by caching a couple of 
> timer expirations and modifying them as timers are added, removed, and 
> modified.  Just brainstorming here.

Believe me I did a lot of "brainstorming" around how to keep HR timers along 
with the LR timers.  Until just recently I had a different timer list in timer.c 
which was hashed on the low timer bits.  After much thought and finally 
realizing that MOST timers do NOT expire, I came to conclude that the current 
list is just right given what it is currently used for.  The stuff that is not 
O(1) in this structure is the cascading.  The reason this is not a big issue is 
just that most timers are removed prior to the cascade.

In the list structure I used to use, I had to order timers by expire time, 
accounting for both jiffies and arch_cycles.  Now HRT uses the same old cascade 
structure and the timer code, for the most part, ignores the arch_cycles member. 
  Only on expire is it discovered and the timer sent to the hrtime code where 
the timer is entered in to a very short list of timers that are to expire some 
time between now and the next jiffie.  This is a very simple O(N/2) list, but it 
is also really short.  (This is a simplified discussion.  The code is on the HRT 
web site and in the HRT CVS on that site for the curious.)
> 
>>
>> >
~
-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Help with the high res timers
  2005-05-04 17:46           ` George Anzinger
@ 2005-05-04 22:13             ` john stultz
  2005-05-04 22:48               ` George Anzinger
  0 siblings, 1 reply; 12+ messages in thread
From: john stultz @ 2005-05-04 22:13 UTC (permalink / raw)
  To: George Anzinger
  Cc: Liu Qi, 'high-res-timers-discourse@lists.sourceforge.net',
	Nishanth Aravamudan, Darren Hart, linux-kernel@vger.kernel.org

On Wed, 2005-05-04 at 10:46 -0700, George Anzinger wrote:
> > Well, as long as the HZ period is close to the timer-interval unit
> > length, this is true. However if the timer-interval unit is smaller,
> > multiple bucket entries would be expired. The performance considerations
> > here are being looked at and this may be an area where the concepts in
> > HRT might help (having a HRT specific sub-bucket).
> 
> This is where we get in trouble with HR timers.  For a HR timer, we need to know 
> how to get a timer to expire (i.e. appear in the call back) at a well defined 
> and presise time (leaving aside latency issues).  The above discription allows 
> timers to be put in buckets without (as near as I can tell) making transparent 
> exactly when the bucket will be emptied, only saying that it will be after the 
> latest timer in the bucket is due.

<snip>

> 
> Think of it this way.  Decompose a HR timer into corse and fine units (you 
> choose, but here let say jiffies and nanoseconds).  Now we want the normal timer 
> system to handle the jiffies part of the time and to turn the timer over to the 
> HR timer code to take care of the nanosecond remainder.  If the jiffie part is 
> late, depending on the nanosecond part, it could make the timer late (i.e for 
> low values of the nanosecond part).  For high values of the nanosecond part, we 
> can compenstate...
> 
> This decomposition makes a lot of sense, by the way, for, at least, the 
> following reasons:
> 1) it keeps the most of the HR issues out of the normal timer code,
> 2) it keeps high res and low res timer in the correct time order, i.e. a low res 
> timer for jiffie X will expire prior to a high res timer for jiffie X + Y 
> nanoseconds.
> 3) handling the high res timer list is made vastly easier as it will only need 
> to have a rather small number of timers in it at any given time (i.e. those that 
> are to expire prior to the next corse timer tick).


Hmmm. Ok I think I see what your getting at. Let me know where I go
wrong:

1. Since HR soft-timers are a special case, their absolute nanosecond
expire values (exp_ns) are decomposed into a low-res portion and a high-
res portion. In your case it is units of jiffies (exp_jf) and
arch_cycles (exp_ac) respectively.

2. Since jiffies units map directly to a periodic tick, one can set a
regular soft-timer to expire at exp_jf. Then when it is expired, it is
added to a separate HR-timers list to expire in exp_ac  arch_cycles
units.

3. With the new-timeofday rework and Nish's soft-timers code, the soft-
timers bucket entries map to actual nanosecond time values, rather then
ticks. This causes a problem with your two level (regular periodic and
hr-timer) timer-lists because since entries don't strictly map to
interrupts, you don't how to decompose the nanosecond expiration into
low-res and high-res portions.


Here is my possible solution: Still keeping Nish's soft-timer rework
where we use nanoseconds instead of ticks or jiffies, provide an
expected interrupt-period value, which gives you the maximum interval
length between ticks. Thus you schedule a regular-low-res timer for the
nanosecond value (exp_ns - expected_interrupt_period). When that timer
expires, you move the timer to the HR timer list and schedule an
interrupt for the remaining time.

Let me know how that sounds.

thanks
-john


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Help with the high res timers
  2005-05-04 22:13             ` john stultz
@ 2005-05-04 22:48               ` George Anzinger
  2005-05-04 23:27                 ` john stultz
  0 siblings, 1 reply; 12+ messages in thread
From: George Anzinger @ 2005-05-04 22:48 UTC (permalink / raw)
  To: john stultz
  Cc: Liu Qi, 'high-res-timers-discourse@lists.sourceforge.net',
	Nishanth Aravamudan, Darren Hart, linux-kernel@vger.kernel.org

john stultz wrote:
> On Wed, 2005-05-04 at 10:46 -0700, George Anzinger wrote:
> 
>>>Well, as long as the HZ period is close to the timer-interval unit
>>>length, this is true. However if the timer-interval unit is smaller,
>>>multiple bucket entries would be expired. The performance considerations
>>>here are being looked at and this may be an area where the concepts in
>>>HRT might help (having a HRT specific sub-bucket).
>>
>>This is where we get in trouble with HR timers.  For a HR timer, we need to know 
>>how to get a timer to expire (i.e. appear in the call back) at a well defined 
>>and precise time (leaving aside latency issues).  The above description allows 
>>timers to be put in buckets without (as near as I can tell) making transparent 
>>exactly when the bucket will be emptied, only saying that it will be after the 
>>latest timer in the bucket is due.
> 
> 
> <snip>
> 
>>Think of it this way.  Decompose a HR timer into coarse and fine units (you 
>>choose, but here let say jiffies and nanoseconds).  Now we want the normal timer 
>>system to handle the jiffies part of the time and to turn the timer over to the 
>>HR timer code to take care of the nanosecond remainder.  If the jiffie part is 
>>late, depending on the nanosecond part, it could make the timer late (i.e for 
>>low values of the nanosecond part).  For high values of the nanosecond part, we 
>>can compenstate...
>>
>>This decomposition makes a lot of sense, by the way, for, at least, the 
>>following reasons:
>>1) it keeps the most of the HR issues out of the normal timer code,
>>2) it keeps high res and low res timer in the correct time order, i.e. a low res 
>>timer for jiffie X will expire prior to a high res timer for jiffie X + Y 
>>nanoseconds.
>>3) handling the high res timer list is made vastly easier as it will only need 
>>to have a rather small number of timers in it at any given time (i.e. those that 
>>are to expire prior to the next coarse timer tick).
> 
> 
> 
> Hmmm. Ok I think I see what your getting at. Let me know where I go
> wrong:
> 
> 1. Since HR soft-timers are a special case, their absolute nanosecond
> expire values (exp_ns) are decomposed into a low-res portion and a high-
> res portion. In your case it is units of jiffies (exp_jf) and
> arch_cycles (exp_ac) respectively.
> 
> 2. Since jiffies units map directly to a periodic tick, one can set a
> regular soft-timer to expire at exp_jf. Then when it is expired, it is
> added to a separate HR-timers list to expire in exp_ac  arch_cycles
> units.
> 
> 3. With the new-timeofday rework and Nish's soft-timers code, the soft-
> timers bucket entries map to actual nanosecond time values, rather then
> ticks. This causes a problem with your two level (regular periodic and
> hr-timer) timer-lists because since entries don't strictly map to
> interrupts, you don't how to decompose the nanosecond expiration into
> low-res and high-res portions.
> 
> 
> Here is my possible solution: Still keeping Nish's soft-timer rework
> where we use nanoseconds instead of ticks or jiffies, provide an
> expected interrupt-period value, which gives you the maximum interval
> length between ticks. Thus you schedule a regular-low-res timer for the
> nanosecond value (exp_ns - expected_interrupt_period). When that timer
> expires, you move the timer to the HR timer list and schedule an
> interrupt for the remaining time.

Oh, I have been thinking along these same lines.  I don't recall at the moment, 
but I depend on the timer retaining the absolute expire time as I set it.  This 
is used both by the secondary timer, and by the rearm code.  I need to look more 
closely at that.  But this is picking things apart at a very low level and does 
not, I think, address timer ordering to the point where I feel completely safe. 
  I.e. will such a scheme allow a LR time at time X to expire after a HR timer 
for time X+y.
> 
> Let me know how that sounds.
> 
> thanks
> -john
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Help with the high res timers
  2005-05-04 22:48               ` George Anzinger
@ 2005-05-04 23:27                 ` john stultz
  0 siblings, 0 replies; 12+ messages in thread
From: john stultz @ 2005-05-04 23:27 UTC (permalink / raw)
  To: George Anzinger
  Cc: Liu Qi, 'high-res-timers-discourse@lists.sourceforge.net',
	Nishanth Aravamudan, Darren Hart, linux-kernel@vger.kernel.org

On Wed, 2005-05-04 at 15:48 -0700, George Anzinger wrote:
> john stultz wrote:
> > Here is my possible solution: Still keeping Nish's soft-timer rework
> > where we use nanoseconds instead of ticks or jiffies, provide an
> > expected interrupt-period value, which gives you the maximum interval
> > length between ticks. Thus you schedule a regular-low-res timer for the
> > nanosecond value (exp_ns - expected_interrupt_period). When that timer
> > expires, you move the timer to the HR timer list and schedule an
> > interrupt for the remaining time.
> 
> Oh, I have been thinking along these same lines.  I don't recall at the moment, 
> but I depend on the timer retaining the absolute expire time as I set it.  This 
> is used both by the secondary timer, and by the rearm code.  I need to look more 
> closely at that.  

Glad we're on similar wavelengths. :)

> But this is picking things apart at a very low level and does 
> not, I think, address timer ordering to the point where I feel completely safe. 
>   I.e. will such a scheme allow a LR time at time X to expire after a HR timer 
> for time X+y.

Hmm. That sounds like an interesting problem as well, although I'm not
familiar enough with it to make a reasonable comment. Let me ponder it
for awhile and see what can be done.

thanks
-john




^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-05-04 23:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <E1DSl7F-0002v2-Ck@sc8-sf-web4.sourceforge.net>
     [not found] ` <20050503024336.GA4023@ict.ac.cn>
     [not found]   ` <4277EEF7.8010609@mvista.com>
     [not found]     ` <1115158804.13738.56.camel@cog.beaverton.ibm.com>
     [not found]       ` <427805F8.7000309@mvista.com>
     [not found]         ` <20050504001307.GF3372@us.ibm.com>
2005-05-04 17:10           ` Help with the high res timers George Anzinger
2005-05-04 17:44             ` Chris Friesen
2005-05-04 17:51               ` Nishanth Aravamudan
2005-05-04 18:16                 ` Chris Friesen
2005-05-04 20:36                   ` George Anzinger
2005-05-04 20:31               ` George Anzinger
2005-05-04 18:14             ` Darren Hart
2005-05-04 21:46               ` George Anzinger
     [not found]         ` <1115166592.13738.96.camel@cog.beaverton.ibm.com>
2005-05-04 17:46           ` George Anzinger
2005-05-04 22:13             ` john stultz
2005-05-04 22:48               ` George Anzinger
2005-05-04 23:27                 ` john stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox