Re: 2.6.13-rt6, ktimer subsystem

All of lore.kernel.org
 help / color / mirror / Atom feed

From: George Anzinger <george@mvista.com>
To: tglx@linutronix.de
Cc: john stultz <johnstul@us.ibm.com>,
	Nish Aravamudan <nish.aravamudan@gmail.com>,
	Ingo Molnar <mingo@elte.hu>,
	linux-kernel@vger.kernel.org,
	Steven Rostedt <rostedt@goodmis.org>,
	dwalker@mvista.com,
	"'high-res-timers-discourse@lists.sourceforge.net'" 
	<high-res-timers-discourse@lists.sourceforge.net>
Subject: Re: 2.6.13-rt6, ktimer subsystem
Date: Thu, 15 Sep 2005 15:35:31 -0700	[thread overview]
Message-ID: <4329F733.2090604@mvista.com> (raw)
In-Reply-To: <1126751140.6509.474.camel@tglx.tec.linutronix.de>

Thomas Gleixner wrote:
> On Wed, 2005-09-14 at 12:38 -0700, George Anzinger wrote:
> 
> 
>>Well, having spent a bit of time looking at the code it appears that a 
>>lot of the ideas we looked at and discarded (see 
>>high-res-timers-discourse@lists.sourceforge.net) are in this.  Shame it 
>>was all done with out reference or comment to that list, anyone on it or 
>>even the lkml.
> 
> 
> Well, I'm considering to wear sackcloth and ashes. But this seems like
> the pot calling the kettle back as I don't remember a single relevant
> reference/comment from you on the UTIME/KURT mailing list after your
> UTIME->HRT fork.

Hm...All the work has been on HRT mailing list.  In the early days the 
UTIME/KURT folks were on the list, still may be for all I know.

Still, as much as sackcloth and ashes might become you or me, I would 
rather try to learn how to be more open about what we are doing so that 
what results is the best possible for the linux community.
> 
> 
>>A few of the top issues:
>>
>>time in nanoseconds 64-bits, requires a divide to do much of anything 
>>with it.  Divides are slow and should be avoided if possible.  
> 
> 
> The divides are rare and definitely not in the hot paths. I'm sure that
> they can be replaced by some intelligent scaled math algorithm if it
> turns out to be necessary. The hot path instructions are simple
> add/sub/cmp which are less/equal expensive on a 32bit machine to an
> operation on struct timespec or an jiffies/arch_cycles pair. The non
> nsec based implementation gives a burden to 64bit machines and is
> provable wrong in the aspect of summing rounding errors of interval
> timers. 

With out arguing with your "provable wrong", I will admit that the 
64-bit LOOKS good and easy.  At the same time I think we can abstract 
all the required actions on the "nstime" thing and set it up so we can 
use either a timespec or a s64 depending on the machine (or some such). 
  I have such a patch nearly ready...
> 
> 
~
> 
>>The rbtree is a high overhead tree.  I suspect performance problems 
>>here.  
> 
> 
> 1. rbtree is available out of the box

True.
> 
> 2. rbtree is proven to be efficient - at least there are a couple of
> performance relevant users relying on it e.g. mm, ext3
> 
> 3. I did insertion/removal tests with 10k entries (<2us on a 1GHz box in
> user space). This is way below the experienced and reproducible trouble
> of recascading. The penalty is completely on the thread which owns the
> timer for non signal related functions. For signal related functions
> only the removal on expiry is a penalty for the complete system (in the
> softirq)

Well, I never was a cascade fan.

In the early HRT patches the whole timer list was replaced with a hashed 
list.  It was O(N/M) on insertion where we could easily choose M (for a 
while it was even a config option).  Removal was just an unlink, same as 
the cascade list.

To be clear on my take on this, as I understand it the rblist is 
something like O(N*somelog 2).  What is left out here is the fixed 
overhead of F which is there even if N = 1.  So we have something like 
(F+O(f(N)) for a list.  For the most part we don't look at F, but as 
list complexity grows, it gets larger thus pushing out the cross over 
point to a higher "N" when comparing two lists.  I considered the rbtree 
when doing the secondary list for HRT and concluded that "N" was small 
enough that a simple O(N/2) list would do just fine as it would only 
contain timers set to expire in the next jiffie.

I don't think an O(N/2) list is right for all timers (as in the 
ktimers), but I still wonder if there isn't something better than and 
rblist.  Note, "wonder".  I don't have such a list structure in hand.
> 
> The cascading is a penalty for the complete system all the time.
> 
> Performance is a straw man argument here. You know very well that > 90%
> of the timers are inaccurate "timeout" timers related to I/O,
> networking, devices. Most of those never expire (the positive feedback
> removes the timer before expiry) and those timers have no constraint to
> be accurate, except for the fact that they have to detect an
> device/network problem at some time. In this case it is completely
> irrelevant whether the timeout occurs n msecs earlier or later.

I agree, but it not accuracy that I am arguing, but cpu cycles.  Those 
we use in the kernel are not available for the user.
> 
> 
>>If it is the right answer here, then why not use it for normal 
>>timers?  A list of timers is a rather unique thing and, I think, 
>>deserves a management structure that accounts for the fact that the 
>>elements in the tree are perishable.
> 
> 
> The first goal was to separate out the timers from the timeout API and I
> believe that this separation is necessary. 
> 
> The implementation of ktimers is not at all restricted to the timers we
> addressed for now, it can also be utilized for the timeout API without
> much effort. 
> 
> The performance improvement is enormous despite the alleged 64bit math
> overhead.
> 
> Testcase on a 600MHz CeleronM, 512MB RAM:
> 
> 16 cyclic SCHED_FIFO tasks using clock_nanosleep(ABSTIME)
> base interval = 1000 us, offset per task = 50 us 
> tracing each latency value to disk
> 
> 16 cyclic SCHED_OTHER tasks using clock_nanosleep(ABSTIME)
> base interval = 1000 us, offset per task = 500 us
> tracing each latency value to disk
> 
> while true; do hackbench 50; done
> cat /dev/hda | nc atlas 4711
> 
> This injects a (insane) load avg of >600 permantely
> 
> 2.6.13-rt4 (cascade based implementation)
> 16 cyclic SCHED_FIFO tasks using clock_nanosleep(ABSTIME)
> 
> task	loops		min	max	average	sigma	prio
> 0	      999999	5	869	19	20	80
> 1	      999999	9	883	18	22	79
> 2	      999999	9	927	19	23	78
> 3	      999999	5	908	21	28	77
> 4	      999999	0	1056	22	33	76
> 5	      999999	0	973	23	33	75
> 6	      999999	0	926	23	33	74
> 7	      999999	1	893	24	33	73
> 8	      999999	2	942	23	34	72
> 9	      999999	1	868	24	34	71
> 10	      999999	0	912	23	34	70
> 11	      999999	0	911	28	46	69
> 12	      999999	0	912	28	46	68
> 13	      999999	9	967	28	46	67
> 14	      999999	9	954	28	46	66
> 15	      999999	9	946	28	46	65
> 
> 
> 2.6.13-rt11 (ktimers based implementation)
> 16 cyclic SCHED_FIFO tasks using clock_nanosleep(ABSTIME)
> 
> task	loops		min	max	average	sigma	prio
> 0	      999999	9	76	20	4	80
> 1	      999999	8	84	20	4	79
> 2	      999999	8	98	21	4	78
> 3	      999999	9	121	20	4	77
> 4	      999999	8	124	20	4	76
> 5	      999999	9	140	20	4	75
> 6	      999999	10	103	22	5	74
> 7	      999999	9	99	21	5	73
> 8	      999999	8	95	21	5	72
> 9	      999999	9	148	21	5	71
> 10	      999999	9	141	22	6	70
> 11	      999999	9	143	22	5	69
> 12	      999999	8	129	20	4	68
> 13	      999999	9	149	21	5	67
> 14	      999999	9	135	21	5	66
> 15	      999999	8	230	22	5	65

I confess I don't understand the above numbers.  What are min and max 
and in what units?  Are you saying the large max numbers are caused by 
the cascade?
> 
> 
> 
>>It appears that the "monotonic_clock" is being used to drive ktimers. 
>>The "monotonic_clock" was NEVER meant to poke outside of the kernel.  It 
>>is a raw kernel clock that is only required to be monotonic with nothing 
>>said about accuracy.  It should NOT be confused with CLOCK_MONOTONIC 
>>which is directly tied to xtime and therefor is ntp corrected.
> 
> 
> I'm well aware of that and this is a completely different playground. 
> 
> The ktimers implementation is _independent_ of the clock sources. The
> current clock source implementation may suck, but thats not a problem of
> ktimers at all. Its a problem of arch/XXX/timeYY and kernel/time.c. I
> did not spend too much time on that as John Stultz is working on the
> timesource and timeofday stuff and I really hope that this gets
> accepted.
> 
> The relation ship of clock sources and their accuracy is also a
> worthwhile field of discussion.
> 
> 
>>These are only the concerns I have from having a rather quick look at 
>>the code.  I am sure that there are other issues...
> 
> 
> It would be hubristic to deny that :)
> 
> OTH, 
> 
> - The posix timer tests run all successful, except the broken 2timertest
> which fails on any other HRT kernel too and the sleep to long for real
> timers when the clock is set backwards, which is easily solvable
> (working on that).

Your mileage seems to differ from mine.  Here is what I get from ./do_test:
The following tests failed:
clock_nanosleeptest
abs_timer_test
4-1
clock_settimetest
clock_gettimetest2
2timer_test 



Then, on the second run, it crashed in an attempt to get the monotonic 
clock (a divide error).  System is a dual PIII, 800Mhz.  This from the 
rt11 patch.
> 
> - The performance improvements in combination with simpler code is an
> argument of itself

-- 
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/

next prev parent reply	other threads:[~2005-09-15 22:36 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-09-13 10:00 2.6.13-rt6, ktimer subsystem Ingo Molnar
2005-09-13 19:59 ` Lee Revell
2005-09-13 20:06   ` Lee Revell
2005-09-13 20:10   ` Ingo Molnar
2005-09-13 20:36     ` Lee Revell
2005-09-15  7:55       ` Ingo Molnar
2005-09-15 11:37         ` Ingo Molnar
2005-09-14 15:56 ` Darren Hart
2005-09-14 22:09   ` Darren Hart
2005-09-14 19:38 ` George Anzinger
2005-09-15  2:25   ` Thomas Gleixner
2005-09-15 22:35     ` George Anzinger [this message]
2005-09-15 22:53       ` Thomas Gleixner
2005-09-15 23:10         ` George Anzinger
2005-09-15 23:09       ` Daniel Walker
2005-09-16  0:08         ` George Anzinger
2005-09-15  9:20   ` Ingo Molnar
2005-09-15 23:04     ` George Anzinger
2005-09-15 23:20       ` Thomas Gleixner
2005-09-15  9:43 ` Roman Zippel
2005-09-26  7:02 ` 2.6.14-rc2-rt2 Ingo Molnar
2005-09-27  6:13   ` 2.6.14-rc2-rt2 Eran Mann
2005-09-27 10:33     ` 2.6.14-rc2-rt2 Ingo Molnar
2005-09-27 16:59   ` 2.6.14-rc2-rt2 Fernando Lopez-Lezcano
2005-09-27 22:15     ` 2.6.14-rc2-rt2 Thomas Gleixner
2005-09-27 23:11       ` 2.6.14-rc2-rt2 Fernando Lopez-Lezcano
2005-09-27 23:10     ` 2.6.14-rc2-rt2 Daniel Walker
2005-09-28  3:04       ` 2.6.14-rc2-rt2 Fernando Lopez-Lezcano
2005-09-28  9:48         ` 2.6.14-rc2-rt2 Ingo Molnar
2005-09-28 16:34           ` 2.6.14-rc2-rt2 Fernando Lopez-Lezcano
2005-09-29  9:07             ` 2.6.14-rc2-rt2 Eran Mann
2005-09-28  9:10   ` 2.6.14-rc2-rt2 Peter Zijlstra
2005-09-29 16:45   ` 2.6.14-rc2-rt2 Badari Pulavarty
2005-09-30 10:58     ` 2.6.14-rc2-rt2 Ingo Molnar
2005-10-02 15:18   ` 2.6.14-rc3-rt1 Ingo Molnar
2005-10-02 15:42     ` 2.6.14-rc3-rt1 Mark Knecht
2005-10-02 19:25       ` 2.6.14-rc3-rt1 Mark Knecht
2005-10-06 17:13         ` 2.6.14-rc3-rt1 Steven Rostedt
2005-10-07 11:09           ` [patch] pcmcia-shutdown-fix.patch Ingo Molnar
2005-10-07 19:17             ` Russell King
2005-10-07 19:46               ` Steven Rostedt
2005-10-10 15:13               ` Steven Rostedt
2005-10-10 15:37                 ` Steven Rostedt
2005-10-02 20:51     ` 2.6.14-rc3-rt1 Felix Oxley
2005-10-02 21:55       ` 2.6.14-rc3-rt1 Felix Oxley
2005-10-03  6:33       ` 2.6.14-rc3-rt1 Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4329F733.2090604@mvista.com \
    --to=george@mvista.com \
    --cc=dwalker@mvista.com \
    --cc=high-res-timers-discourse@lists.sourceforge.net \
    --cc=johnstul@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=nish.aravamudan@gmail.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.