Re: 2.6.13-rt6, ktimer subsystem

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: George Anzinger <george@mvista.com>
To: tglx@linutronix.de
Cc: john stultz <johnstul@us.ibm.com>,
	Nish Aravamudan <nish.aravamudan@gmail.com>,
	Ingo Molnar <mingo@elte.hu>,
	linux-kernel@vger.kernel.org,
	Steven Rostedt <rostedt@goodmis.org>,
	dwalker@mvista.com,
	"'high-res-timers-discourse@lists.sourceforge.net'" 
	<high-res-timers-discourse@lists.sourceforge.net>
Subject: Re: 2.6.13-rt6, ktimer subsystem
Date: Thu, 15 Sep 2005 15:35:31 -0700	[thread overview]
Message-ID: <4329F733.2090604@mvista.com> (raw)
In-Reply-To: <1126751140.6509.474.camel@tglx.tec.linutronix.de>

Thomas Gleixner wrote:
> On Wed, 2005-09-14 at 12:38 -0700, George Anzinger wrote:
> 
> 
>>Well, having spent a bit of time looking at the code it appears that a 
>>lot of the ideas we looked at and discarded (see 
>>high-res-timers-discourse@lists.sourceforge.net) are in this.  Shame it 
>>was all done with out reference or comment to that list, anyone on it or 
>>even the lkml.
> 
> 
> Well, I'm considering to wear sackcloth and ashes. But this seems like
> the pot calling the kettle back as I don't remember a single relevant
> reference/comment from you on the UTIME/KURT mailing list after your
> UTIME->HRT fork.

Hm...All the work has been on HRT mailing list.  In the early days the 
UTIME/KURT folks were on the list, still may be for all I know.

Still, as much as sackcloth and ashes might become you or me, I would 
rather try to learn how to be more open about what we are doing so that 
what results is the best possible for the linux community.
> 
> 
>>A few of the top issues:
>>
>>time in nanoseconds 64-bits, requires a divide to do much of anything 
>>with it.  Divides are slow and should be avoided if possible.  
> 
> 
> The divides are rare and definitely not in the hot paths. I'm sure that
> they can be replaced by some intelligent scaled math algorithm if it
> turns out to be necessary. The hot path instructions are simple
> add/sub/cmp which are less/equal expensive on a 32bit machine to an
> operation on struct timespec or an jiffies/arch_cycles pair. The non
> nsec based implementation gives a burden to 64bit machines and is
> provable wrong in the aspect of summing rounding errors of interval
> timers. 

With out arguing with your "provable wrong", I will admit that the 
64-bit LOOKS good and easy.  At the same time I think we can abstract 
all the required actions on the "nstime" thing and set it up so we can 
use either a timespec or a s64 depending on the machine (or some such). 
  I have such a patch nearly ready...
> 
> 
~
> 
>>The rbtree is a high overhead tree.  I suspect performance problems 
>>here.  
> 
> 
> 1. rbtree is available out of the box

True.
> 
> 2. rbtree is proven to be efficient - at least there are a couple of
> performance relevant users relying on it e.g. mm, ext3
> 
> 3. I did insertion/removal tests with 10k entries (<2us on a 1GHz box in
> user space). This is way below the experienced and reproducible trouble
> of recascading. The penalty is completely on the thread which owns the
> timer for non signal related functions. For signal related functions
> only the removal on expiry is a penalty for the complete system (in the
> softirq)

Well, I never was a cascade fan.

In the early HRT patches the whole timer list was replaced with a hashed 
list.  It was O(N/M) on insertion where we could easily choose M (for a 
while it was even a config option).  Removal was just an unlink, same as 
the cascade list.

To be clear on my take on this, as I understand it the rblist is 
something like O(N*somelog 2).  What is left out here is the fixed 
overhead of F which is there even if N = 1.  So we have something like 
(F+O(f(N)) for a list.  For the most part we don't look at F, but as 
list complexity grows, it gets larger thus pushing out the cross over 
point to a higher "N" when comparing two lists.  I considered the rbtree 
when doing the secondary list for HRT and concluded that "N" was small 
enough that a simple O(N/2) list would do just fine as it would only 
contain timers set to expire in the next jiffie.

I don't think an O(N/2) list is right for all timers (as in the 
ktimers), but I still wonder if there isn't something better than and 
rblist.  Note, "wonder".  I don't have such a list structure in hand.
> 
> The cascading is a penalty for the complete system all the time.
> 
> Performance is a straw man argument here. You know very well that > 90%
> of the timers are inaccurate "timeout" timers related to I/O,
> networking, devices. Most of those never expire (the positive feedback
> removes the timer before expiry) and those timers have no constraint to
> be accurate, except for the fact that they have to detect an
> device/network problem at some time. In this case it is completely
> irrelevant whether the timeout occurs n msecs earlier or later.

I agree, but it not accuracy that I am arguing, but cpu cycles.  Those 
we use in the kernel are not available for the user.
> 
> 
>>If it is the right answer here, then why not use it for normal 
>>timers?  A list of timers is a rather unique thing and, I think, 
>>deserves a management structure that accounts for the fact that the 
>>elements in the tree are perishable.
> 
> 
> The first goal was to separate out the timers from the timeout API and I
> believe that this separation is necessary. 
> 
> The implementation of ktimers is not at all restricted to the timers we
> addressed for now, it can also be utilized for the timeout API without
> much effort. 
> 
> The performance improvement is enormous despite the alleged 64bit math
> overhead.
> 
> Testcase on a 600MHz CeleronM, 512MB RAM:
> 
> 16 cyclic SCHED_FIFO tasks using clock_nanosleep(ABSTIME)
> base interval = 1000 us, offset per task = 50 us 
> tracing each latency value to disk
> 
> 16 cyclic SCHED_OTHER tasks using clock_nanosleep(ABSTIME)
> base interval = 1000 us, offset per task = 500 us
> tracing each latency value to disk
> 
> while true; do hackbench 50; done
> cat /dev/hda | nc atlas 4711
> 
> This injects a (insane) load avg of >600 permantely
> 
> 2.6.13-rt4 (cascade based implementation)
> 16 cyclic SCHED_FIFO tasks using clock_nanosleep(ABSTIME)
> 
> task	loops		min	max	average	sigma	prio
> 0	      999999	5	869	19	20	80
> 1	      999999	9	883	18	22	79
> 2	      999999	9	927	19	23	78
> 3	      999999	5	908	21	28	77
> 4	      999999	0	1056	22	33	76
> 5	      999999	0	973	23	33	75
> 6	      999999	0	926	23	33	74
> 7	      999999	1	893	24	33	73
> 8	      999999	2	942	23	34	72
> 9	      999999	1	868	24	34	71
> 10	      999999	0	912	23	34	70
> 11	      999999	0	911	28	46	69
> 12	      999999	0	912	28	46	68
> 13	      999999	9	967	28	46	67
> 14	      999999	9	954	28	46	66
> 15	      999999	9	946	28	46	65
> 
> 
> 2.6.13-rt11 (ktimers based implementation)
> 16 cyclic SCHED_FIFO tasks using clock_nanosleep(ABSTIME)
> 
> task	loops		min	max	average	sigma	prio
> 0	      999999	9	76	20	4	80
> 1	      999999	8	84	20	4	79
> 2	      999999	8	98	21	4	78
> 3	      999999	9	121	20	4	77
> 4	      999999	8	124	20	4	76
> 5	      999999	9	140	20	4	75
> 6	      999999	10	103	22	5	74
> 7	      999999	9	99	21	5	73
> 8	      999999	8	95	21	5	72
> 9	      999999	9	148	21	5	71
> 10	      999999	9	141	22	6	70
> 11	      999999	9	143	22	5	69
> 12	      999999	8	129	20	4	68
> 13	      999999	9	149	21	5	67
> 14	      999999	9	135	21	5	66
> 15	      999999	8	230	22	5	65

I confess I don't understand the above numbers.  What are min and max 
and in what units?  Are you saying the large max numbers are caused by 
the cascade?
> 
> 
> 
>>It appears that the "monotonic_clock" is being used to drive ktimers. 
>>The "monotonic_clock" was NEVER meant to poke outside of the kernel.  It 
>>is a raw kernel clock that is only required to be monotonic with nothing 
>>said about accuracy.  It should NOT be confused with CLOCK_MONOTONIC 
>>which is directly tied to xtime and therefor is ntp corrected.
> 
> 
> I'm well aware of that and this is a completely different playground. 
> 
> The ktimers implementation is _independent_ of the clock sources. The
> current clock source implementation may suck, but thats not a problem of
> ktimers at all. Its a problem of arch/XXX/timeYY and kernel/time.c. I
> did not spend too much time on that as John Stultz is working on the
> timesource and timeofday stuff and I really hope that this gets
> accepted.
> 
> The relation ship of clock sources and their accuracy is also a
> worthwhile field of discussion.
> 
> 
>>These are only the concerns I have from having a rather quick look at 
>>the code.  I am sure that there are other issues...
> 
> 
> It would be hubristic to deny that :)
> 
> OTH, 
> 
> - The posix timer tests run all successful, except the broken 2timertest
> which fails on any other HRT kernel too and the sleep to long for real
> timers when the clock is set backwards, which is easily solvable
> (working on that).

Your mileage seems to differ from mine.  Here is what I get from ./do_test:
The following tests failed:
clock_nanosleeptest
abs_timer_test
4-1
clock_settimetest
clock_gettimetest2
2timer_test 



Then, on the second run, it crashed in an attempt to get the monotonic 
clock (a divide error).  System is a dual PIII, 800Mhz.  This from the 
rt11 patch.
> 
> - The performance improvements in combination with simpler code is an
> argument of itself

-- 
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/

next prev parent reply	other threads:[~2005-09-15 22:36 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-09-13 10:00 2.6.13-rt6, ktimer subsystem Ingo Molnar
2005-09-13 19:59 ` Lee Revell
2005-09-13 20:06   ` Lee Revell
2005-09-13 20:10   ` Ingo Molnar
2005-09-13 20:36     ` Lee Revell
2005-09-15  7:55       ` Ingo Molnar
2005-09-15 11:37         ` Ingo Molnar
2005-09-14 15:56 ` Darren Hart
2005-09-14 22:09   ` Darren Hart
2005-09-14 19:38 ` George Anzinger
2005-09-15  2:25   ` Thomas Gleixner
2005-09-15 22:35     ` George Anzinger [this message]
2005-09-15 22:53       ` Thomas Gleixner
2005-09-15 23:10         ` George Anzinger
2005-09-15 23:09       ` Daniel Walker
2005-09-16  0:08         ` George Anzinger
2005-09-15  9:20   ` Ingo Molnar
2005-09-15 23:04     ` George Anzinger
2005-09-15 23:20       ` Thomas Gleixner
2005-09-15  9:43 ` Roman Zippel
2005-09-26  7:02 ` 2.6.14-rc2-rt2 Ingo Molnar
2005-09-27  6:13   ` 2.6.14-rc2-rt2 Eran Mann
2005-09-27 10:33     ` 2.6.14-rc2-rt2 Ingo Molnar
2005-09-27 16:59   ` 2.6.14-rc2-rt2 Fernando Lopez-Lezcano
2005-09-27 22:15     ` 2.6.14-rc2-rt2 Thomas Gleixner
2005-09-27 23:11       ` 2.6.14-rc2-rt2 Fernando Lopez-Lezcano
2005-09-27 23:10     ` 2.6.14-rc2-rt2 Daniel Walker
2005-09-28  3:04       ` 2.6.14-rc2-rt2 Fernando Lopez-Lezcano
2005-09-28  9:48         ` 2.6.14-rc2-rt2 Ingo Molnar
2005-09-28 16:34           ` 2.6.14-rc2-rt2 Fernando Lopez-Lezcano
2005-09-29  9:07             ` 2.6.14-rc2-rt2 Eran Mann
2005-09-28  9:10   ` 2.6.14-rc2-rt2 Peter Zijlstra
2005-09-29 16:45   ` 2.6.14-rc2-rt2 Badari Pulavarty
2005-09-30 10:58     ` 2.6.14-rc2-rt2 Ingo Molnar
2005-10-02 15:18   ` 2.6.14-rc3-rt1 Ingo Molnar
2005-10-02 15:42     ` 2.6.14-rc3-rt1 Mark Knecht
2005-10-02 19:25       ` 2.6.14-rc3-rt1 Mark Knecht
2005-10-06 17:13         ` 2.6.14-rc3-rt1 Steven Rostedt
2005-10-07 11:09           ` [patch] pcmcia-shutdown-fix.patch Ingo Molnar
2005-10-07 19:17             ` Russell King
2005-10-07 19:46               ` Steven Rostedt
2005-10-10 15:13               ` Steven Rostedt
2005-10-10 15:37                 ` Steven Rostedt
2005-10-02 20:51     ` 2.6.14-rc3-rt1 Felix Oxley
2005-10-02 21:55       ` 2.6.14-rc3-rt1 Felix Oxley
2005-10-03  6:33       ` 2.6.14-rc3-rt1 Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4329F733.2090604@mvista.com \
    --to=george@mvista.com \
    --cc=dwalker@mvista.com \
    --cc=high-res-timers-discourse@lists.sourceforge.net \
    --cc=johnstul@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=nish.aravamudan@gmail.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox