Re: [Adeos-main] latency results for ppc and x86

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jan Kiszka <jan.kiszka@domain.hid>
To: Nicholas Mc Guire <der.herr@domain.hid>
Cc: adeos-main@gna.org, Wolfgang Grandegger <wg@domain.hid>
Subject: Re: [Adeos-main] latency results for ppc and x86
Date: Wed, 21 Feb 2007 13:29:07 +0100	[thread overview]
Message-ID: <45DC3B13.5090200@domain.hid> (raw)
In-Reply-To: <Pine.LNX.4.60.0702211116020.2526@domain.hid>

[-- Attachment #1: Type: text/plain, Size: 6592 bytes --]

Nicholas Mc Guire wrote:
>>>> Latencies are mainly due to cache refills on the P4. Have you already
>>>> put load onto your system? If not, worst case latencies will be even
>>>> longer.
>>>
>>>
>>> one posibility we found in RTLinux/GPL to reduce latency is to free up
>>> TLBs by flushing a few of the TLB hot spots, basically these flushpoints
>>> are something like:
>>>
>>> __asm__ __volatile__("invlpg %0": :"m"
>>> (*(char*)__builtin_return_address(0)));
>>>
>>> put at places where we know we don't need thos lines any more (i.e.
>>> after switching tasks or the like). By inserting only a few such
>>> flushpoints in
>>> hot code on the kernel side we found a clear reduction of the worst case
>>> jitter and interrupt response times.
> 
>> Interesting. Are these flushpoints present in latest kernel patches of
>> RTLinux/GPL? Sounds like a nice thing to play with on a rainy day. :)
> 
> 
> yup - basically if you look at the latest patches (2.4.33-rtl3.2) you
> will find them in the kernel code. Or in the rtlinux core code
> (rtl_core.c and rtl_sched.c). The concept is off course not restricted
> to 2.4.X kernels note thought that some archs (notably MIPS)
> have a problem with __builtin_return_address.

OK, thanks.

> 
> 
>>>
>>> Aside from caches, BTB exhaustion in high load situations is also a
>>> problem that has not been addressed much in the realtime variants - with
>>> the P6 families having a botched BTB prediction unit, one can use some
>>> "strange" constructions to reduce branch penalties - i.e.:
>>>
>>>   if(!condition){slow_path();}
>>>   else{fast_path();}
>>>
>>> if more predictalbe than
>>>
>>>   if(codition){fast_path();}
>>>   else{slow_path();}
> 
>> I think this is also what likely()/unlikely() teaches to the the
>> compiler on x86 (where there is no branch prediction predicate for the
>> instructions), isn't it?
> 
> 
> no not really - likely/unlikely give hints during compilation to relocate
> the unlikey part to a distant location (some lable at the end of the
> file...) but that does not change the rpoblem at runtime with respect to
> the worst case. The BTB uses a hysteresis of one miss/hit to adjust the
> guess on P6 systems with the default (if the address is not present in
> the BTB) of not taken - thus if you reorder for the "not taken" case
> being the fast patch you will always have the fast path preloaded in
> the pipeline.
> 
> if(likley(condition)){
>    fast_patch();
> else
>    slow_path();
> 
> will be fast on average but the worst case is that the address is not
> in the BTB so the slow_patch() tag is loaded by default.

Ah, got the idea. How much arch/processor-type-dependent is this
optimisation? It would surely makes no sense to optimise for arch X in
generic code.

> 
> There is a paper on this (a bit messy) published at RTLWS7 (Lile) 2005
> if you are interested in the details.
> 
>>>
>>> as in the first case the branch prediction is static, thus the worst
>>> case
>>> is that you are jumping over a few bytes of object code when the
>>> condition
>>> is not met. in the second case the default if the BTB does not yet know
>>> this branch is to guess not-taken and thus load the jump target of the
>>> slow patch with the overhead of TLB/Cache penalties.
>>>
>>> Regarding the PPC numbers, the surprising thing for me is that the same
>>> archs are doing MUCH better with old RTAI/RTLinux versions, i.e. 2.4.4
>>> kernel on a 50MHz MPC860 shows a worst case of 57us - so I do question
>>> what is going wrong here in the 2.6.X branches of hard-realtime Linux -
> 
>> You forget that old stuff was kernel-only, lacking a lot of Linux
>> integration features. Recent I-pipe-based real-time via Xenomai normally
>> includes support for user-space RT (you can switch it off, but hardly
>> anyone does). So its not a useful comparison given that new real-time
>> projects almost always want full-featured user space these days. For a
>> fairer comparison, one should consider a simple I-pipe domain that
>> contains the real-time "application".
> 
> 
> note that the numbers posted here WERE kernel numbers !

But with user space support enabled. There are no separate code paths
for kernel and user space threads, basic infrastructure is shared here
for good reasons.

> I know that people want to move to user-space - but what is the advantage
> over RT-preempt then if you use the dynamic tick patch (scheduled to go
> mainline in 2.6.21 BTW) ?

So far, determinism (both /wrt mainline and latest -rt).

BTW, kernel space real time is specifically no longer recommendable for
commercial projects that have to worry about the (likely non-GPL)
license of their application code. And then there are those countless
technical advantages that speed up the development process of user space
apps.

> 
>>> my suspicion is that there is too much work being done on fast-hot CPUs
>>> and the low-end is being neglected - which is bad as the numbers you
>>> post here for ADEOS are numbers reachable with mainstream preemptive
>>> kernel by now as well (off course not on the low end systems though).
> 
>> That's scenario-dependent. Simple setups like a plain timed task can
>> reach the dimension of I-pipe-based Xenomai, but more complex scenarios
>> suffer from the exploding complexity in mainstream Linux, even with -rt.
>> Just think of "simple" mutexes realised via futexes.
> 
> 
> do you have some code samples with numbers ? I would be very interested in
> a demo that shows this problem - I was not able to really find a smoking
> gun with RT-preempt and dynamic ticks (2.6.17.2).

I can't help with demo code, but I can name a few conceptual issues:

 o Futexes may require to allocate memory when suspending on a contented
   lock (refill_pi_state_cache)
 o Futexes depend on mmap_sem
 o Preemptible RCU read-sides can either lead to OOM or require
   intrusive read-side priority boosting (see Paul McKenney's LWN
   article)
 o Excessive lock nesting depths in critical code paths makes it hard to
   predict worst-case behaviour (or to verify that measurements actually
   already triggered them)
 o Any nanosleep&friends-using Linux process can schedule hrtimers at
   arbitrary dates, requiring to have a pretty close look at the
   (worst-case) timer usage pattern of the _whole_ system, not only the
   SCHED_FIFO/RR part

That's what I can tell from the heart. But one would have to analyse the
code more thoroughly I guess.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

next prev parent reply	other threads:[~2007-02-21 12:29 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <45CD730A.6000405@domain.hid>
2007-02-20  7:21 ` [Adeos-main] latency results for ppc and x86 poornima r
2007-02-21  7:13   ` Wolfgang Grandegger
2007-02-21  9:33     ` poornima r
2007-02-21  9:33       ` Nicholas Mc Guire
2007-02-21 10:49         ` Jan Kiszka
2007-02-21 10:26           ` Nicholas Mc Guire
2007-02-21 12:29             ` Jan Kiszka [this message]
2007-02-21 12:14               ` Nicholas Mc Guire
2007-02-21 13:51                 ` Jan Kiszka
2007-02-21 14:52                   ` Wolfgang Grandegger
2007-02-21 15:10                     ` Nicholas Mc Guire
2007-02-21 18:27                       ` Jan Kiszka
2007-02-21 19:07                         ` Nicholas Mc Guire
2007-02-21 21:05                           ` Jan Kiszka
2007-03-14 12:51     ` [Adeos-main] test results for switchtest and cyclictest on x86 poornima r

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45DC3B13.5090200@domain.hid \
    --to=jan.kiszka@domain.hid \
    --cc=adeos-main@gna.org \
    --cc=der.herr@domain.hid \
    --cc=wg@domain.hid \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.