All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [patch] Real-Time Preemption, -RT-2.6.9-mm1-V0.5.2
@ 2004-10-29 16:20 Mark_H_Johnson
  2004-10-29 16:26 ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: Mark_H_Johnson @ 2004-10-29 16:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Rui Nuno Capela, linux-kernel, Lee Revell,
	K.R. Foley, Bill Huey, Adam Heath, Florian Schmidt,
	Thomas Gleixner, Michal Schmidt, Fernando Pablo Lopez-Lezcano,
	Karsten Wiese

Today's run with -V0.5.11.

In short, the tracing appears to be broken on SMP systems and
may have caused the crash in the first attempt. In the second
attempt (no tracing), the system ran but the something is not
quite right in the scheduler (or in marking tasks "ready to run"
since it appears that my cpu burner (a nice'd task) is getting
run when the system should run the X server & other applications.

Will be sending system logs and serial console output separately.
  --Mark

-----

First attempt.

[1] Build / boot to X server OK. Only unusual message is the one
with the atomic counter underflow (in system log, not on console output).

[2] Set IRQ priorities & ran my "get_ltrace" script which looks for
changes in the latency trace output & dumps them out. Had one trace
and then soon after, the machine locked up. No console output for
several seconds, then finally a stream of errors.

The serial console messages appear to be incomplete at the beginning
(not sure why). Looks something like this

Starting system message bus: [  OK  ]^M
(kdeinit/3196/CPU#0): new 1204 us maximum-latency wakeup.
(X/2832/CPU#1): new 68008 us maximum-latency wakeup.
(get_ltrace.sh/3482/CPU#1): new 3814 us maximum-latency wakeup.
(get_ltrace.sh/3530/CPU#1): new 3963 us maximum-latency wakeup.
(artsd/2965/CPU#1): new 78620 us maximum-latency wakeup.

  (124)
 [<c012454b>] profile_task_exit+0x1b/0x50 (24)
 [<c012691f>] do_exit+0x1f/0x5c0 (40)

[note how the prologue data is missing & the output starts in
the middle of a line]

The critical section nesting was "unique" and looks like

| preempt count: 00010005 ]
| 5-level deep critical section nesting:
----------------------------------------
.. [<c03257cf>] .... _spin_lock+0x1f/0x70
.....[<c01e217a>] ..   ( <= __up_write+0x26a/0x2a0)
.. [<c03257cf>] .... _spin_lock+0x1f/0x70
.....[<c01e1f65>] ..   ( <= __up_write+0x55/0x2a0)
.. [<c0325817>] .... _spin_lock+0x67/0x70
.....[<c011b54d>] ..   ( <= task_rq_lock+0x3d/0x70)
.. [<c03257cf>] .... _spin_lock+0x1f/0x70
.....[<c0115f47>] ..   ( <= nmi_watchdog_tick+0x127/0x140)
.. [<c013d5bd>] .... print_traces+0x1d/0x60
.....[<c0105bec>] ..   ( <= show_regs+0x14c/0x174)

The script then exits with preempt count of 3 and an atomic counter
underflow BUG message. This is followed right after with

BUG: Unable to handle kernel NULL pointer dereference at virtual address
00000
020

and the appropriate dump messages. The cycle of script failure and
kernel BUG repeats a few cycles and then stops.

Let's reboot and not look at the latency traces.

Second attempt.

Boots OK and started the stress test.

[1] Got through about 80% of the X11perf stress test before the system
"live locked" again. The audio continued to come out but the display
stopped updating. I found that Alt-Sysrq-L and Alt-Sysrq-T seemed to
work so I did both while waiting for the audio test to stop on its own.

[2] Running the second test, a similar symptom occurred where the
display froze during the top test, but it finished as well.

[3] Running the third test (network output), I got a number of error
messages on the X display (but not on the serial console) indicating
segmentation violation for the command
  sudo -u me rcp file remote-system
I was able to manipulate the display at this point.

[4] It then ran the fourth test (network input), and the symptom of
locking up the display repeated (still no serial console messages).

[5] Disk write test had the same display lockup (audio OK).
Did Alt-Sysrq-L and -T to see what the status of tasks was.
Noticed that the cpu_burn program (nice'd) was still the active
task even though I had higher priority jobs (e.g., the X server)
that should have been ready to run.

[6] Disk copy & read tests, same symptoms, though it unfroze sometime
later in the test (and then froze again...). Something is not
quite consistent in how the system runs under these stress tests.

Application level charts are "interesting". All of them show very
little overhead to the CPU task with a few glitches. The audio
loop is consistently getting done "early". The yellow line showing
the nominal audio cycle time is well above the white line showing
the actual duration. To recap, in latencytest, we have a loop like
this...

  capture T1
  CPU burn 80% of nominal audio duration
  capture T2
  write next audio segment
  capture T3

where T2-T1 is the CPU time and T3-T1 is the audio time. Some of the
delays were extremely long, but I assume that was due to the occasional
use of Alt-Sysrq keys. Some may also be due to that scheduling problem
I described above as well.



^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: [patch] Real-Time Preemption, -RT-2.6.9-mm1-V0.5.2
@ 2004-10-28 15:02 Mark_H_Johnson
  2004-10-28 18:38 ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: Mark_H_Johnson @ 2004-10-28 15:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rui Nuno Capela, linux-kernel, Lee Revell, mark_h_johnson,
	K.R. Foley, Bill Huey, Adam Heath, Florian Schmidt,
	Thomas Gleixner, Michal Schmidt, Fernando Pablo Lopez-Lezcano,
	Karsten Wiese

I built with V0.5.1 and on a relatively idle system saw a number
of traces with > 4000 entries, with a wide range of latencies.
For example:

lt001.v1k1/lt.00: latency: 98910 us, entries: 4000 (24040)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.01: latency: 4021 us, entries: 4000 (11139)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.02: latency: 6597 us, entries: 4000 (13361)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.03: latency: 8873 us, entries: 4000 (16109)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.04: latency: 9092 us, entries: 4000 (17525)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.05: latency: 10079 us, entries: 4000 (22351)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.06: latency: 1976 us, entries: 2148 (2148)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.07: latency: 50675 us, entries: 4000 (12102)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.08: latency: 2822 us, entries: 4000 (7318)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.09: latency: 5817 us, entries: 4000 (9574)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.10: latency: 10811 us, entries: 4000 (26175)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.11: latency: 2511 us, entries: 4000 (6204)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.12: latency: 9194 us, entries: 4000 (28039)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.13: latency: 12068 us, entries: 4000 (37143)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.14: latency: 36768 us, entries: 4000 (101483)   |   [VP:0
KP:1 SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.15: latency: 96154 us, entries: 4000 (21364)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.16: latency: 2759 us, entries: 4000 (7153)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.17: latency: 6653 us, entries: 4000 (15161)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.18: latency: 7322 us, entries: 4000 (12684)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.19: latency: 7440 us, entries: 4000 (15368)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]
lt001.v1k1/lt.20: latency: 13838 us, entries: 4000 (30923)   |   [VP:0 KP:1
SP:1 HP:1 #CPUS:2]

It did not take long to collect this information. These may be false
positives, here is the start of one example (note 00000000 for preempt
count in some of the lines).

preemption latency trace v1.0.7 on 2.6.9-mm1-RT-V0.5.1
-------------------------------------------------------
 latency: 1976 us, entries: 2148 (2148)   |   [VP:0 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: get_ltrace.sh/3673, uid:0 nice:0 policy:1 rt_prio:50
    -----------------
 => started at: try_to_wake_up+0x1cc/0x330 <c011c1cc>
 => ended at:   finish_task_switch+0x41/0xc0 <c011c7e1>
=======>
80000000 0.000ms (+0.000ms): _spin_unlock (try_to_wake_up)
80000000 0.000ms (+0.000ms): (105) ((140))
80000000 0.000ms (+0.000ms): (6) ((0))
80000000 0.000ms (+0.000ms): resched_task (try_to_wake_up)
80000000 0.001ms (+0.000ms): _spin_unlock_irqrestore (try_to_wake_up)
80000000 0.001ms (+0.000ms): preempt_schedule (try_to_wake_up)
00000000 0.001ms (+0.000ms): preempt_schedule (cpu_idle)
80000000 0.002ms (+0.000ms): __sched_text_start (preempt_schedule)
80000000 0.002ms (+0.000ms): sched_clock (__sched_text_start)
80000000 0.002ms (+0.000ms): _spin_lock_irq (__sched_text_start)
80000000 0.003ms (+0.000ms): _spin_lock_irqsave (__sched_text_start)
80000000 0.003ms (+0.000ms): dequeue_task (__sched_text_start)
80000000 0.004ms (+0.000ms): recalc_task_prio (__sched_text_start)
80000000 0.004ms (+0.000ms): effective_prio (recalc_task_prio)
80000000 0.004ms (+0.000ms): enqueue_task (__sched_text_start)
80000000 0.005ms (+0.000ms): __switch_to (__sched_text_start)
80000000 0.005ms (+0.000ms): (0) ((6))
80000000 0.005ms (+0.000ms): (140) ((105))
80000000 0.006ms (+0.000ms): finish_task_switch (__sched_text_start)
80000000 0.006ms (+0.000ms): trace_stop_sched_switched (finish_task_switch)
80000000 0.006ms (+0.000ms): (6) ((105))
80000000 0.006ms (+0.000ms): _spin_lock (trace_stop_sched_switched)
80000000 0.007ms (+0.000ms): _spin_unlock (finish_task_switch)
80000000 0.007ms (+0.000ms): _spin_unlock_irq (finish_task_switch)
00000000 0.008ms (+0.000ms): _do_softirq (ksoftirqd)
80000000 0.008ms (+0.000ms): ___do_softirq (_do_softirq)
00000000 0.008ms (+0.000ms): run_timer_softirq (___do_softirq)
... ends like this ...
80000000 1.856ms (+0.000ms): rwsem_owner_del (__up_write)
80000000 1.857ms (+0.000ms): _spin_unlock (__up_write)
80000000 1.857ms (+0.000ms): _spin_unlock (__up_write)
00000000 1.858ms (+0.000ms): up (real_lookup)
00000000 1.858ms (+0.000ms): _up_write (up)
00000000 1.858ms (+0.000ms): __up_write (_up_write)
80000000 1.859ms (+0.000ms): _spin_lock (__up_write)
80000000 1.859ms (+0.000ms): _spin_lock (__up_write)
80000000 1.860ms (+0.000ms): rwsem_owner_del (__up_write)
80000000 1.861ms (+0.000ms): _spin_unlock (__up_write)
80000000 1.861ms (+0.000ms): _spin_unlock (__up_write)
00000000 1.862ms (+0.001ms): follow_mount (link_path_walk)
00000000 1.863ms (+33179.004ms): dput (link_path_walk)

I haven't run the stress tests yet but will send the system log
and these traces in a separate message so you can see what I'm seeing.

  --Mark


^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: [patch] Real-Time Preemption, -RT-2.6.9-mm1-V0.3
@ 2004-10-27 13:53 Ingo Molnar
  2004-10-27 15:26 ` [patch] Real-Time Preemption, -RT-2.6.9-mm1-V0.4 Rui Nuno Capela
  0 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2004-10-27 13:53 UTC (permalink / raw)
  To: Rui Nuno Capela
  Cc: linux-kernel, Lee Revell, mark_h_johnson, K.R. Foley, Bill Huey,
	Adam Heath, Florian Schmidt, Thomas Gleixner, Michal Schmidt,
	Fernando Pablo Lopez-Lezcano, Karsten Wiese


* Rui Nuno Capela <rncbc@rncbc.org> wrote:

> OK. Currently with RT-V0.3.2.
> 
> So it seems that the jackd -R is no more an issue here.

great.

> However (oh no!:) those jackd -R xruns are still frequent, much
> frequent than RT-U3, which is my stable RT kernel atm.

-V0.4.1 could help with this problem. There were a number of places
where the PREEMPT_REALTIME kernel missed reschedules so it could easily
happen that jackd would sit in the runqueue waiting to be executed and
the kernel got quickly out of a critical section but then the kernel
'forgot' to reschedule for many milliseconds!

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-10-29 20:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-29 16:20 [patch] Real-Time Preemption, -RT-2.6.9-mm1-V0.5.2 Mark_H_Johnson
2004-10-29 16:26 ` Ingo Molnar
2004-10-29 17:52   ` John Gilbert
2004-10-29 20:30     ` Ingo Molnar
2004-10-29 20:41     ` Lee Revell
  -- strict thread matches above, loose matches on Subject: below --
2004-10-28 15:02 Mark_H_Johnson
2004-10-28 18:38 ` Ingo Molnar
2004-10-27 13:53 [patch] Real-Time Preemption, -RT-2.6.9-mm1-V0.3 Ingo Molnar
2004-10-27 15:26 ` [patch] Real-Time Preemption, -RT-2.6.9-mm1-V0.4 Rui Nuno Capela
2004-10-27 20:51   ` Ingo Molnar
2004-10-27 21:19     ` Ingo Molnar
2004-10-27 23:31       ` Rui Nuno Capela
2004-10-28  6:36         ` Ingo Molnar
2004-10-28  8:31           ` Rui Nuno Capela
2004-10-28  8:56             ` Ingo Molnar
2004-10-28  9:17               ` Rui Nuno Capela
2004-10-28  9:32                 ` Ingo Molnar
2004-10-28 13:57                   ` [patch] Real-Time Preemption, -RT-2.6.9-mm1-V0.5.2 Ingo Molnar
2004-10-28 14:10                     ` DaMouse
2004-10-28 14:18                       ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.