linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [patch] Real-Time Preemption, -RT-2.6.9-rc4-mm1-U9.3
@ 2004-10-22 18:43 Mark_H_Johnson
  2004-10-22 18:49 ` K.R. Foley
  0 siblings, 1 reply; 8+ messages in thread
From: Mark_H_Johnson @ 2004-10-22 18:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Lee Revell, Rui Nuno Capela, Mark_H_Johnson,
	K.R. Foley, Bill Huey, Adam Heath, Florian Schmidt,
	Thomas Gleixner, Michal Schmidt, Fernando Pablo Lopez-Lezcano

>i have released the -U9.3 Real-Time Preemption patch, ...
>
It is getting hard to keep up with the updates....

This version built OK and since I noticed it includes fixes for the
parallel port, I added that back to my configuration and built / booted
without any problems. I still see the BUG from:

Oct 22 12:27:50 dws77 kernel: 8139too Fast Ethernet driver 0.9.27
Oct 22 12:27:50 dws77 kernel: eth0: RealTek RTL8139 at 0xdc00,
00:50:bf:39:11:fc, IRQ 11
Oct 22 12:27:50 dws77 kernel: BUG: atomic counter underflow at:
Oct 22 12:27:50 dws77 kernel:  [<c02b8f88>] qdisc_destroy+0x98/0xa0 (12)

I saw the messages about fixes for the other network drivers, but
don't forget this one.

Real time stress tests ran more smoothly this time with fewer
odd symptoms but a few new symptoms showed up too. I'll send the
boot log and traces separately. The following summarizes the tests
and results.

[1] X11 stress - very clean, max CPU delay was only 20 usec.

[2] proc stress - very clean, max CPU delay was only 30 usec.

[3] network output stress - only trace much worse than U9.2. An odd
pattern in the graph showing a delay of roughly 400 usec every 5
seconds with a much smaller delay following. There were also a couple
bursts of delays at 90-100 seconds, and 250-260 seconds. Did not
see this pattern on any other test.

[4] network input stress - very clean, max CPU delay was only 80 usec.

[5] disk write stress - very clean, max CPU delay only 70 usec.

[6] disk copy stress - very clean, max CPU delay only 90 usec.

[7] disk read stress - first 25 seconds, had a pattern of roughly 100 usec
CPU delays with a few peaks at 500 usec. After that, was very clean, almost
99.7% of the CPU delays were under 100 usec.

During these tests (total 25-30 minutes) had seven latency traces
over >200 usec. Summary follows:

00 - find_symbol, a single trace line over 400 usec as follows
preemption latency trace v1.0.7 on 2.6.9-rc4-mm1-RT-U9.3
-------------------------------------------------------
 latency: 495 us, entries: 9 (9)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: modprobe/3643, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: _spin_lock_irqsave+0x1f/0x80 <c0314f4f>
 => ended at:   _spin_unlock_irq+0x1b/0x40 <c031538b>
=======>
00000001 0.000ms (+0.000ms): _spin_lock_irqsave (resolve_symbol)
00000001 0.000ms (+0.447ms): __find_symbol (resolve_symbol)
00010001 0.448ms (+0.000ms): do_nmi (__find_symbol)
00010001 0.448ms (+0.000ms): do_nmi (add_preempt_count)
00010001 0.449ms (+0.042ms): do_nmi (<00200093>)
00000001 0.491ms (+0.000ms): use_module (resolve_symbol)
00000001 0.492ms (+0.001ms): already_uses (use_module)
00000001 0.493ms (+0.000ms): kmem_cache_alloc (use_module)
00000001 0.494ms (+0.000ms): _spin_unlock_irq (resolve_symbol)

01, 02, 03, 05, 06 - flush_tlb
 latency: 1815 us, entries: 108 (108)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
 latency: 8959 us, entries: 180 (180)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
 latency: 175300 us, entries: 4000 (8116)   |   [VP:1 KP:1 SP:1 HP:1
#CPUS:2]
 latency: 80679 us, entries: 1545 (1545)   |   [VP:1 KP:1 SP:1 HP:1
#CPUS:2]
 latency: 76801 us, entries: 3561 (3561)   |   [VP:1 KP:1 SP:1 HP:1
#CPUS:2]

This is that symptom I reported before where something gets "stuck"
and one or more clock ticks later, it finally gets freed up. Note that
the real time application did not see any of these delays. It may
be interesting to do another test w/ two real time tasks to see if
these are real or a sampling artifact.

04 - avc_insert, a single > 200 usec trace entry.

preemption latency trace v1.0.7 on 2.6.9-rc4-mm1-RT-U9.3
-------------------------------------------------------
 latency: 216 us, entries: 4 (4)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: fam/2933, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: _spin_lock_irqsave+0x1f/0x80 <c0314f4f>
 => ended at:   _spin_unlock_irqrestore+0x20/0x50 <c0315340>
=======>
00000001 0.000ms (+0.000ms): _spin_lock_irqsave (avc_has_perm_noaudit)
00000001 0.000ms (+0.214ms): avc_insert (avc_has_perm_noaudit)
00000001 0.214ms (+0.000ms): memcpy (avc_has_perm_noaudit)
00000001 0.215ms (+0.000ms): _spin_unlock_irqrestore (avc_has_perm_noaudit)

  --Mark


^ permalink raw reply	[flat|nested] 8+ messages in thread
* Re: [patch] Real-Time Preemption, -RT-2.6.9-rc4-mm1-U9.3
@ 2004-10-22 19:40 Mark_H_Johnson
  2004-10-22 19:55 ` Thomas Gleixner
  0 siblings, 1 reply; 8+ messages in thread
From: Mark_H_Johnson @ 2004-10-22 19:40 UTC (permalink / raw)
  To: tglx
  Cc: Bill Huey, Adam Heath, K.R. Foley, LKML, Ingo Molnar,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Michal Schmidt

>No, the fix was for the missing pci shutdown in tulip.
I thought the two were related (since I get the failed to shutdown
message right after that traceback. Perhaps the 8139too needs
that same shutdown fix.

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 8+ messages in thread
* [patch] Real-Time Preemption, -VP-2.6.9-rc4-mm1-U0
@ 2004-10-14  0:24 Ingo Molnar
  2004-10-14 14:31 ` [patch] Real-Time Preemption, -VP-2.6.9-rc4-mm1-U1 Ingo Molnar
  0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2004-10-14  0:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lee Revell, Rui Nuno Capela, Mark_H_Johnson, K.R. Foley,
	Daniel Walker, Bill Huey, Andrew Morton


i'm pleased to announce a significantly improved version of the
Real-Time Preemption (PREEMPT_REALTIME) feature that i have been working
towards in the past couple of weeks.

the patch (against 2.6.9-rc4-mm1) can be downloaded from:

  http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc4-mm1-U0

this is i believe the first correct conversion of the Linux kernel to a
fully preemptible (fully mutex-based) preemption model, while still
keeping all locking properties of Linux.

I also think that this feature can and should be integrated into the
upstream kernel sometime in the future. It will need improvements and
fixes and lots of testing, but i believe the basic concept is sound and
inclusion is manageable and desirable. So comments, testing and feedback
is more than welcome!

to recap the conceptual issues that needed solving: the previous patch
already converted a fair portion of spinlocks to mutexes, but that in a
fully preemptible kernel model the following locking primitives are
especially problematic:

	- RCU locking
	- per-cpu counters and per-cpu variables
	- tlb gather/release logic
	- seqlocks
	- atomic-kmaps
	- pte mapping

note that while we tend to think about these as SMP locking primitives,
in a fully preemptible model these locking rules are necessary for
correctness, even on a single-CPU embedded box.

Unfortunately none of the existing preemption patches solve these
problems: they concentrate on UP but these locking rules must not be
ignored on UP either!

(Bill Huey's mutex patch he just released is the one notable exception i
know about, which is also a correct implementation i believe, but it
doesnt attack these locking rules [yet]. Bill's locking hierarchy is i
believe quite similar to the -T9 patch - this i believe is roughly the
best one can get when only using spinlocks as a vehicle.)

In the previous (-T9) version of the Real-Time Preemption patch, the
above locking primitives kept large portions of kernel code
non-preemptable, causing a 'spreadout' of raw spinlocks and
non-preemptible sections to various kernel subsystems.

Another, even more problematic effect was that both network drivers and
block IO drivers got 'contaminated' by raw spinlocks, triggering lots of
per-driver changes and thus limiting testability. It is basically an
opt-in model to correctness which is bad from a maintainance and
upstream acceptance point of view. The -T9 patch was a prime example of
the problems the Linux locking primitives cause in a fully preemptible
kernel model.


To solve all these fundamental problems, i improved/fixed/changed all of
these locking methods to be preemption-friendly. Most of the time it was
necessary to introduce an additional API variant because e.g.
rcu_read_lock() is anonymous (it doesnt identify the data protected), so
i introduced a variant that takes the write-lock as an argument. In the
PREEMPT_REALTIME case we can thus properly serialize on that lock.

For per-cpu variables i introduced a new API variant that creates a
spinlock-array for the per-cpu-variable, and users must make sure the
cpu field doesnt change. Migration to another CPU can happen within the
critical section, but 'statistically' the variable is still per-CPU and
update correctness is fully preserved.

TLB shootdown was the source of another nasty type of critical section:
it keeps preemption disabled during much of the pagetable zapping and
also relies on a per-CPU variable to keep TLB state. The fix for
PREEMPT_REALTIME (on x86) was to implement a simpler but preemptible TLB
flushing method. This necessiated two minor API changes to the generic
TLB shootdown code - pretty straightforward ones.

Atomic kmaps were another source of non-preemptible sections:
pte_map/pte_unmap both used nontrivial functions within and ran for a
long time, creating a lock dependency and a latency problem as well. I
wrapped them via normal kmaps, which thus become preemptible. (The main
reason for atomic kmaps were non-preemptability constraints - but those
are not present in a fully preemptible model.)

seqlocks (used within the VFS and other places) were another problem:
the are now preemptible by default, the same auto-type-detection logic
applies to them as to preemptible/raw spinlocks: switching between a
preemptible and a non-preemptible type is done by changing the
prototype, the locking APIs stay the same.

The improvements to locking allowed the gradual 'pulling out' of all
raw-spinlocks from various kernel subsystems. In the -U0 patch i have
almost completely finished this 'pullout', and as a result the following
kernel subsystems are now completely spinlock-free and 100% mutex-based:

	- networking
	- IO subsystem
	- VFS and lowlevel filesystems
	- memory management (except the SLAB allocator lock)
	- signal code
	- all of drivers/* and sound/*

note: it is important that when PREEMPT_REALTIME is disabled, the old
locking rules apply and there is no performance impact whatsoever. So
what this implements is in essence a compile-time multi-tier locking
architecture enabling 4 types of preemption models:

  - stock                      (casual voluntary kernel preemption)
  - CONFIG_PREEMPT_VOLUNTARY   (lots of cond_resched() points)
  - CONFIG_PREEMPT             (involuntary preemption plus spinlocks)
  - CONFIG_PREEMPT_REALTIME    (everything is a mutex)

these models cover all the variants people are interested in: servers
with almost no latency worries, desktops with ~1msec needs and hard-RT
applications needing both microsecond-range latencies and provable
maximum latencies.

to quantitatively see the effects of these changes to the locking
landscape, here's the output of a script that disassembles the kernel
image and counts the number of actual spin lock/unlock function calls
done versus mutex lock/unlocks:

With PREEMPT_REALTIME disabled, all locks are spinlocks and old
semaphores:

        spinlock API calls:      5359  (71.6%)
        | old mutex API calls:   2120  (28.3%)
        | new mutex API calls:      2  (0%)
        all mutex API calls:     2122  (28.3%)
        --------------------------------------
        lock API calls:          7481 (100.0%)

with the -T9 kernel, which had the 'spread out' locking model, a
considerable portion of spinlocks were replaced by mutexes, but more
than 20% of usage was still spinlocks:

        spinlock API calls:      1835  (23.1%)
        | old mutex API calls:   2142  (26.9%)
        | new mutex API calls:   3961  (49.8%)
        all mutex API calls:     6103  (76.8%)
        --------------------------------------
        lock API calls:          7938 (100.0%)

here are some fresh numbers from Bill Huey's mmLinux kernel:

        spinlock API calls:      2452  (30.3%)
        all mutex API calls:     5614  (69.6%)
        --------------------------------------
        lock API calls:          8066 (100.0%)

(his mutex implementation directly falls back to up()/down() so the new
mutexes become part of the old mutexes.)

while i believe that the locking design is fundamentally incomplete in
the MontaVista kernel and thus is not directly comparable to
PREEMPT_REALTIME nor the mmLinux kernel, here are the stats from it
using a similar .config:

        spinlock API calls:      1444  (26.1%)
        | old mutex API calls:   2095  (37.9%)
        | new mutex API calls:   1981  (35.8%)
        all mutex API calls:     4076  (73.8%)
        --------------------------------------
        lock API calls:          5520 (100.0%)

(here it is visible that apparently a significant amount of [i believe
necessary] locking is missing from this kernel.)

finally, here are the stats from the new PREEMPT_REALTIME -U0 kernel:

        spinlock API calls:       491  (6.0%)
        | old mutex API calls:   2142  (26.2%)
        | new mutex API calls:   5536  (67.7%)
        all mutex API calls:     7678  (93.9%)
        --------------------------------------
        lock API calls:          8169 (100.0%)

note that almost all of the remaining spinlocks are held for a short
amount of time and have a finite maximum duration. They involve hardware
access, scheduling and interrupt handling and timers - almost all of
that code has O(1) characteristics.

what this means is that we are approaching hard-real-time domains ... 
using what is in essence the stock Linux kernel!

note that priority inheritance is still not part of this patch, but that
effort can now be centralized to the two basic Linux semaphore types,
solving the full spectrum of priority inheritance problems!

the code is still x86-only but only for practical reasons - other
architectures will be covered in the future as well.

to create a -U0 tree from scratch the patching order is:

   http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.8.tar.bz2
 + http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.9-rc4.bz2
 + http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc4/2.6.9-rc4-mm1/2.6.9-rc4-mm1.bz2
 + http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc4-mm1-U0

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2004-10-22 20:10 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-22 18:43 [patch] Real-Time Preemption, -RT-2.6.9-rc4-mm1-U9.3 Mark_H_Johnson
2004-10-22 18:49 ` K.R. Foley
2004-10-22 18:58   ` Thomas Gleixner
  -- strict thread matches above, loose matches on Subject: below --
2004-10-22 19:40 Mark_H_Johnson
2004-10-22 19:55 ` Thomas Gleixner
2004-10-14  0:24 [patch] Real-Time Preemption, -VP-2.6.9-rc4-mm1-U0 Ingo Molnar
2004-10-14 14:31 ` [patch] Real-Time Preemption, -VP-2.6.9-rc4-mm1-U1 Ingo Molnar
2004-10-14 23:42   ` [patch] Real-Time Preemption, -VP-2.6.9-rc4-mm1-U2 Ingo Molnar
2004-10-15 10:26     ` [patch] Real-Time Preemption, -VP-2.6.9-rc4-mm1-U3 Ingo Molnar
2004-10-16 15:33       ` [patch] Real-Time Preemption, -VP-2.6.9-rc4-mm1-U4 Ingo Molnar
2004-10-18 14:50         ` [patch] Real-Time Preemption, -RT-2.6.9-rc4-mm1-U5 Ingo Molnar
2004-10-19 12:46           ` [patch] Real-Time Preemption, -RT-2.6.9-rc4-mm1-U6 Ingo Molnar
2004-10-19 18:00             ` [patch] Real-Time Preemption, -RT-2.6.9-rc4-mm1-U7 Ingo Molnar
2004-10-20  9:45               ` [patch] Real-Time Preemption, -RT-2.6.9-rc4-mm1-U8 Ingo Molnar
2004-10-21 13:27                 ` [patch] Real-Time Preemption, -RT-2.6.9-rc4-mm1-U9 Ingo Molnar
2004-10-22 13:35                   ` [patch] Real-Time Preemption, -RT-2.6.9-rc4-mm1-U9.3 Ingo Molnar
2004-10-22 18:41                     ` Alexander Batyrshin
2004-10-22 20:08                       ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).