Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
@ 2004-09-07 16:54 Mark_H_Johnson
  2004-09-08 18:42 ` Ingo Molnar
  2004-09-09 16:02 ` Ingo Molnar
  0 siblings, 2 replies; 36+ messages in thread
From: Mark_H_Johnson @ 2004-09-07 16:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard,
	nando, luke, free78

Results from this morning's test with -R1 and some follow up on
related messages.

Two runs with -R1 plus the following changes
 - a change to the ensonic driver to reduce latency
 - added mcount() calls in sched.c and timer_tsc.c
 - disabled inline functions in sched.c
The last two so we can see the steps a little more clearly.

The differences between the runs was:
 - first run had DRI and hardware acceleration enabled
 - second run disabled DRI / hardware acceleration
as suggested by Lee Revell.

>From Lee Revell <rlrevell@joe-job.com>
>This is looking more and more like a video driver problem:
>...
>The easiest way to eliminate this possibility is to disable DRI and set
>'Option "NoAccel"' in your X config.  Do you get the same mysterious
>latencies with this setting?

I am not sure I see any significant differences in latencies. Most of
the large delays occur during disk reads and the number of traces >
500 usec was over 100 in both tests (about 30 minutes each). There were
also more latency traces per minute during disk reads when DRI was
disabled. There were however none of the "find_next_bit" traces in the
NoAccel run with significant time delays so that may indicate some
display problems but not sure. See below for details.

Side Comment
============

If you look at the date / time of the traces, you will notice that
most occur in the latter part of the test. This is during the
"disk copy" and "disk read" parts of the testing. To illustrate, the
first 10 traces take about 20 minutes to collect; the last 90 are
collected in 15 and 9 minutes (first and second test).

It is also encouraging that the longest trace is < 800 usec in these
two runs. This is far better than what I saw a couple weeks ago.

Shortest trace
==============

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
--------------------------------------------------
 latency: 550 us, entries: 6 (6)
    -----------------
    | task: cat/6771, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: kmap_atomic+0x23/0xe0
 => ended at:   kunmap_atomic+0x7b/0xa0
=======>
00000001 0.000ms (+0.000ms): kmap_atomic (file_read_actor)
00000001 0.000ms (+0.000ms): page_address (file_read_actor)
00000001 0.000ms (+0.549ms): __copy_to_user_ll (file_read_actor)
00000001 0.550ms (+0.000ms): kunmap_atomic (file_read_actor)
00000001 0.550ms (+0.000ms): sub_preempt_count (kunmap_atomic)
00000001 0.550ms (+0.000ms): update_max_trace (check_preempt_timing)

The 1/2 msec latency in __copy_to_user_ll occurred a few more times
in both tests.

Find Next Bit
=============

It appears that -R1 has significantly fewer cases where the code in
find_next_bit is delayed. To summarize:

# grep find_next_bit lt040903/lt*/lt.* | grep '(+0.1' | wc -l
63
# grep find_next_bit lt040907/lt*/lt.* | grep '(+0.1' | wc -l
0

[counting number of trace entries w/ 100-199 usec latencies]

The runs on September 3 were with -Q9, today's (September 7) were
with -R1. Not sure why this changed, but it is encouraging. The
maximum with DRI / display acceleration was +0.069ms, the maximum
without DRI was +0.001ms.

Mark Offset TSC
===============

In a similar way, long latencies in mark_offset_tsc were significantly
reduced in -R1.
# grep _tsc lt040903/lt*/lt.* | grep '(+0.1' | wc -l
24
# grep _tsc lt040907/lt*/lt.* | grep '(+0.1' | wc -l
3

Two of the long delays were in the same trace in the following
sequence:

00010001 0.140ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
00010001 0.140ms (+0.000ms): spin_lock (timer_interrupt)
00010002 0.140ms (+0.000ms): spin_lock (<00000000>)
00010002 0.141ms (+0.000ms): mark_offset_tsc (timer_interrupt) [0]
00010002 0.141ms (+0.000ms): mark_offset_tsc (timer_interrupt) [1]
00010002 0.141ms (+0.000ms): spin_lock (mark_offset_tsc)
00010003 0.141ms (+0.000ms): spin_lock (<00000000>)
00010003 0.141ms (+0.131ms): mark_offset_tsc (timer_interrupt) [2]
00010003 0.273ms (+0.000ms): mark_offset_tsc (timer_interrupt) [3]
00010003 0.273ms (+0.000ms): spin_lock (mark_offset_tsc)
00010004 0.273ms (+0.000ms): spin_lock (<00000000>)
00010004 0.273ms (+0.145ms): mark_offset_tsc (timer_interrupt) [4]
00010004 0.419ms (+0.004ms): mark_offset_tsc (timer_interrupt) [5]
00010004 0.423ms (+0.002ms): mark_offset_tsc (timer_interrupt) [6]
00010004 0.425ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010004 0.425ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010003 0.426ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010003 0.426ms (+0.046ms): mark_offset_tsc (timer_interrupt)
00010003 0.472ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010002 0.472ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)

For reference, the steps in the code read (w/o comments):

        mcount(); [1]
        write_seqlock(&monotonic_lock);
        mcount(); [2]
        last_offset = ((unsigned long long)last_tsc_high<<32)|last_tsc_low;
        rdtsc(last_tsc_low, last_tsc_high);

        mcount(); [3]
        spin_lock(&i8253_lock);
        mcount(); [4]
        outb_p(0x00, PIT_MODE);     /* latch the count ASAP */
        mcount(); [5]

        count = inb_p(PIT_CH0);    /* read the latched count */
        mcount(); [6]
        count |= inb(PIT_CH0) << 8;
        mcount(); [7]
I numbered the traces / corresponding mcount() calls. If I labeled this
right, it appears the delays were in the rdtsc call and outb_p to
latch the count. Perhaps a hardware level delay taking place.

Clear Page Tables
=================

This is the longest single latency with the following traces:

# grep '(+0.6' lt040907/lt*/lt.*
lt040907/lt001.v3k1/lt.28:00000001 0.001ms (+0.635ms): clear_page_tables
(exit_mmap)
lt040907/lt002.v3k1/lt.75:00000001 0.001ms (+0.628ms): clear_page_tables
(exit_mmap)

__modify_IO_APIC_irq
====================

Not quite sure what is happening here, but this trace

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
--------------------------------------------------
 latency: 612 us, entries: 111 (111)
    -----------------
    | task: kblockd/0/10, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: __spin_lock_irqsave+0x4b/0xa0
 => ended at:   as_work_handler+0x5c/0xa0
=======>
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000001 0.000ms (+0.000ms): generic_enable_irq (ide_do_request)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (generic_enable_irq)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000002 0.001ms (+0.000ms): unmask_IO_APIC_irq (generic_enable_irq)
00000002 0.001ms (+0.000ms): __spin_lock_irqsave (unmask_IO_APIC_irq)
00000002 0.001ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000003 0.001ms (+0.000ms): __unmask_IO_APIC_irq (unmask_IO_APIC_irq)
00000003 0.002ms (+0.567ms): __modify_IO_APIC_irq (__unmask_IO_APIC_irq)
00010001 0.569ms (+0.000ms): do_nmi (smp_apic_timer_interrupt)
00010001 0.569ms (+0.002ms): do_nmi (sched_clock)
00010001 0.572ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.572ms (+0.000ms): profile_hook (profile_tick)
00010001 0.573ms (+0.000ms): read_lock (profile_hook)
00010002 0.573ms (+0.000ms): read_lock (<00000000>)
00010002 0.573ms (+0.000ms): notifier_call_chain (profile_hook)
00010001 0.574ms (+0.000ms): profile_hit (nmi_watchdog_tick)
00000001 0.575ms (+0.000ms): smp_apic_timer_interrupt (as_work_handler)

is an example where __modify_IO_APIC_irq shows up with a long latency.
There were about a dozen long latencies (> 100 usec) in the two tests
run this morning, slightly fewer (6 vs 8) in the first test with DRI
and display accelaration turned on. This appears to be an increase
with -R1, tests over three days (-Q5 through -Q9) only had 4 long
latencies at this spot. [Hmm - that might be due to the change in
the way we compute / display the latencies]

Spin Lock
=========

We seem to have gotten stuck here in a spin lock...

 => started at: spin_lock+0x34/0x90
 => ended at:   journal_commit_transaction+0xa25/0x1c30
=======>
00000001 0.000ms (+0.559ms): spin_lock (journal_commit_transaction)

and here...

 => started at: __read_lock_irqsave+0x93/0xa0
 => ended at:   snd_pcm_lib_write1+0x3ba/0x600
=======>
00000001 0.000ms (+0.000ms): __read_lock_irqsave (read_lock_irq)
00000001 0.000ms (+0.000ms): spin_lock (snd_pcm_lib_write1)
00000002 0.000ms (+0.000ms): spin_lock (<00000000>)
00000002 0.000ms (+0.000ms): snd_pcm_update_hw_ptr (snd_pcm_lib_write1)
...
00000002 0.008ms (+0.000ms): snd_ensoniq_trigger (snd_pcm_do_stop)
00000002 0.009ms (+0.000ms): spin_lock (snd_ensoniq_trigger)
00000003 0.009ms (+0.549ms): spin_lock (<00000000>)

Overall there traces referring to spin_lock were...

# grep 'spin_lock (' lt040907/lt*/* | grep -v '+0.0' | wc -l
35
# grep 'spin_lock (' lt040903/lt*/* | grep -v '+0.0' | wc -l
11

More of these traces in -R1 than -Q9, but I am not sure if this is
significant or not.

Context Switch
==============

With the inlined functions disabled, and some extra debug code,
I saw context_switch appear > 800 times in the long latency traces.
Usually very fast (> 10 usec) but sometimes listed with some long
latencies (> 100 usec, 45 trace lines).
For example:

 => started at: schedule+0x5b/0x610
 => ended at:   schedule+0x385/0x610
...
0000002 0.004ms (+0.000ms): schedule (do_irqd)
00000002 0.005ms (+0.000ms): context_switch (schedule)
00000002 0.005ms (+0.000ms): dummy_cs_entry (context_switch)
00000002 0.005ms (+0.000ms): context_switch (schedule)
00000002 0.005ms (+0.000ms): dummy_cs_switch_mm (context_switch)
00000002 0.005ms (+0.111ms): context_switch (schedule)
00000002 0.116ms (+0.000ms): dummy_cs_unlikely_if (context_switch)
00000002 0.117ms (+0.000ms): context_switch (schedule)
00000002 0.117ms (+0.000ms): dummy_cs_switch_to (context_switch)
00000002 0.117ms (+0.029ms): context_switch (schedule)
00000002 0.147ms (+0.128ms): __switch_to (context_switch)
00000002 0.275ms (+0.001ms): dummy_cs_exit (context_switch)
00000002 0.277ms (+0.155ms): context_switch (schedule)
00000002 0.432ms (+0.000ms): finish_task_switch (schedule)
00000002 0.433ms (+0.000ms): smp_apic_timer_interrupt (finish_task_switch)
00010002 0.433ms (+0.137ms): profile_tick (smp_apic_timer_interrupt)
00010002 0.571ms (+0.000ms): profile_hook (profile_tick)
00010002 0.571ms (+0.000ms): read_lock (profile_hook)
00010003 0.571ms (+0.000ms): read_lock (<00000000>)
00010003 0.571ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.572ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)

The dummy_cs lines refer to calls I added to context_switch at each
step to read...

        dummy_cs_entry();
        if (unlikely(!mm)) {
                next->active_mm = oldmm;
                atomic_inc(&oldmm->mm_count);
                dummy_cs_lazy_tlb();
                enter_lazy_tlb(oldmm, next);
        } else {
                dummy_cs_switch_mm();
                switch_mm(oldmm, mm, next);
        }

        dummy_cs_unlikely_if();
        if (unlikely(!prev->mm)) {
                prev->active_mm = NULL;
                WARN_ON(rq->prev_mm);
                rq->prev_mm = oldmm;
        }

        /* Here we just switch the register state and the stack. */
        dummy_cs_switch_to();
        switch_to(prev, next, prev);

        dummy_cs_exit();
        return prev;

The switch_mm path of the if appears to be more costly time wise
than the lazy_tlb path. To see this from the detailed traces, try
something like...

grep -C2 'dummy_cs_switch_mm' lt040907/lt*/* | less -im
or
grep -C2 'dummy_cs_lazy_tlb' lt040907/lt*/* | less -im

Closing
=======

I will send copies of the traces (not to the mailing lists) in a few
minutes for review / analysis.

I also downloaded the -R8 patch and other associated patches and will
be building that soon.
  --Mark


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-07 16:54 [patch] voluntary-preempt-2.6.9-rc1-bk4-R1 Mark_H_Johnson
@ 2004-09-08 18:42 ` Ingo Molnar
  2004-09-10  2:31   ` K.R. Foley
  2004-09-09 16:02 ` Ingo Molnar
  1 sibling, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2004-09-08 18:42 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard,
	nando, luke, free78

* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> If you look at the date / time of the traces, you will notice that
> most occur in the latter part of the test. This is during the "disk
> copy" and "disk read" parts of the testing. [...]

would it be possible to test with DMA disabled? (hdparm -d0 /dev/hda) It
might take some extra work to shun the extra latency reports from the
PIO IDE path (which is quite slow) but once that is done you should be
able to see whether these long 0.5 msec delays remain even if all (most)
DMA activity has been eliminated.

> preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
> --------------------------------------------------
>  latency: 550 us, entries: 6 (6)
>     -----------------
>     | task: cat/6771, uid:0 nice:0 policy:0 rt_prio:0
>     -----------------
>  => started at: kmap_atomic+0x23/0xe0
>  => ended at:   kunmap_atomic+0x7b/0xa0
> =======>
> 00000001 0.000ms (+0.000ms): kmap_atomic (file_read_actor)
> 00000001 0.000ms (+0.000ms): page_address (file_read_actor)
> 00000001 0.000ms (+0.549ms): __copy_to_user_ll (file_read_actor)
> 00000001 0.550ms (+0.000ms): kunmap_atomic (file_read_actor)
> 00000001 0.550ms (+0.000ms): sub_preempt_count (kunmap_atomic)
> 00000001 0.550ms (+0.000ms): update_max_trace (check_preempt_timing)

this is a full page copy, from userspace into a kernelspace pagecache
page. This shouldnt take 500 usecs on any hardware. Since this is a
single instruction (memcpy's rep; movsl instruction) there's nothing
that Linux can do to avoid (or even to cause) such a situation.

> 00010002 0.141ms (+0.000ms): mark_offset_tsc (timer_interrupt) [0]
> 00010002 0.141ms (+0.000ms): mark_offset_tsc (timer_interrupt) [1]
> 00010002 0.141ms (+0.000ms): spin_lock (mark_offset_tsc)
> 00010003 0.141ms (+0.000ms): spin_lock (<00000000>)
> 00010003 0.141ms (+0.131ms): mark_offset_tsc (timer_interrupt) [2]
> 00010003 0.273ms (+0.000ms): mark_offset_tsc (timer_interrupt) [3]

note that there's no spinning on the spinlock, the (<00000000>) shows 
that there was no contention at all.

> For reference, the steps in the code read (w/o comments):
> 
>         mcount(); [1]
>         write_seqlock(&monotonic_lock);
>         mcount(); [2]
>         last_offset = ((unsigned long long)last_tsc_high<<32)|last_tsc_low;
>         rdtsc(last_tsc_low, last_tsc_high);
>         mcount(); [3]

the 131 usec delay occured between [2] and [3] - which, if you check the 
assembly, there are only 14 instructions between those two mcount() 
calls:

 1a0:   31 db                   xor    %ebx,%ebx
 1a2:   8b 0d 10 00 00 00       mov    0x10,%ecx
                        1a4: R_386_32   .bss
 1a8:   a1 14 00 00 00          mov    0x14,%eax
                        1a9: R_386_32   .bss
 1ad:   89 c2                   mov    %eax,%edx
 1af:   31 c0                   xor    %eax,%eax
 1b1:   89 c7                   mov    %eax,%edi
 1b3:   09 cf                   or     %ecx,%edi
 1b5:   89 7d e0                mov    %edi,0xffffffe0(%ebp)
 1b8:   89 d7                   mov    %edx,%edi
 1ba:   09 df                   or     %ebx,%edi
 1bc:   89 7d e4                mov    %edi,0xffffffe4(%ebp)
 1bf:   0f 31                   rdtsc
 1c1:   89 15 14 00 00 00       mov    %edx,0x14
                        1c3: R_386_32   .bss
 1c7:   a3 10 00 00 00          mov    %eax,0x10
                        1c8: R_386_32   .bss

no loop, no nothing. No way can this take 131 usecs without hardware 
effects.

> Clear Page Tables
> =================
> 
> This is the longest single latency with the following traces:
> 
> # grep '(+0.6' lt040907/lt*/lt.*
> lt040907/lt001.v3k1/lt.28:00000001 0.001ms (+0.635ms): clear_page_tables
> (exit_mmap)
> lt040907/lt002.v3k1/lt.75:00000001 0.001ms (+0.628ms): clear_page_tables
> (exit_mmap)

this one might be a real latency - but it's hard to tell if there are 
random 500 usec latencies all around the place.

> __modify_IO_APIC_irq
> ====================

> 00000003 0.001ms (+0.000ms): __unmask_IO_APIC_irq (unmask_IO_APIC_irq)
> 00000003 0.002ms (+0.567ms): __modify_IO_APIC_irq (__unmask_IO_APIC_irq)
> 00010001 0.569ms (+0.000ms): do_nmi (smp_apic_timer_interrupt)

this too seems to be one of these random 500 usec latencies that have no 
connection whatsoever to what is being done. It's just some unfortunate 
piece of code that is more likely to access the memory bus or happens to 
be on a page boundary or something like that.

> Spin Lock
> =========
> 
> We seem to have gotten stuck here in a spin lock...

none of the spinlocks had a counter different from zero so there was no
contention. The extra trace entry after a spinlock:

> 00000002 0.000ms (+0.000ms): spin_lock (<00000000>)

shows the number of times the spinlock had to spin internally before it
got the lock. For real contention this should be some large nonzero
number.

> 00000002 0.008ms (+0.000ms): snd_ensoniq_trigger (snd_pcm_do_stop)
> 00000002 0.009ms (+0.000ms): spin_lock (snd_ensoniq_trigger)
> 00000003 0.009ms (+0.549ms): spin_lock (<00000000>)

this too seems to be caused by that 'magic' latency - a noncontended
spinlock cannot take 500 usecs to execute ...

> Context Switch
> ==============

same for the context-switch codepath. I'm very convinced that the 'magic
latencies' are distributed more or less randomly across kernel code. 
Code that accesses the main memory bus more likely will be affected more
by DMA starvation, that's what makes some of these functions more
prominent than others.

> 00000002 0.005ms (+0.000ms): dummy_cs_switch_mm (context_switch)
> 00000002 0.005ms (+0.111ms): context_switch (schedule)

since this includes a cr3 flush it is very likely accessing main memory. 
(it's possibly re-fetching lots of TLB entries from a new pagetable
which is likely cache-cold.)

so my main candidate is still IDE DMA. Please disable IDE DMA and see
what happens (after hiding the PIO IDE codepath via
touch_preempt_timing()).

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-08 18:42 ` Ingo Molnar
@ 2004-09-10  2:31   ` K.R. Foley
  2004-09-10 17:56     ` K.R. Foley
  0 siblings, 1 reply; 36+ messages in thread
From: K.R. Foley @ 2004-09-10  2:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Lee Revell, Free Ekanayaka, Eric St-Laurent,
	linux-kernel, Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard,
	nando, luke, free78

Ingo Molnar wrote:
> * Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:
> 
> 
>>If you look at the date / time of the traces, you will notice that
>>most occur in the latter part of the test. This is during the "disk
>>copy" and "disk read" parts of the testing. [...]
> 
> 
> would it be possible to test with DMA disabled? (hdparm -d0 /dev/hda) It
> might take some extra work to shun the extra latency reports from the
> PIO IDE path (which is quite slow) but once that is done you should be
> able to see whether these long 0.5 msec delays remain even if all (most)
> DMA activity has been eliminated.
> 
> 
>>preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
>>--------------------------------------------------
>> latency: 550 us, entries: 6 (6)
>>    -----------------
>>    | task: cat/6771, uid:0 nice:0 policy:0 rt_prio:0
>>    -----------------
>> => started at: kmap_atomic+0x23/0xe0
>> => ended at:   kunmap_atomic+0x7b/0xa0
>>=======>
>>00000001 0.000ms (+0.000ms): kmap_atomic (file_read_actor)
>>00000001 0.000ms (+0.000ms): page_address (file_read_actor)
>>00000001 0.000ms (+0.549ms): __copy_to_user_ll (file_read_actor)
>>00000001 0.550ms (+0.000ms): kunmap_atomic (file_read_actor)
>>00000001 0.550ms (+0.000ms): sub_preempt_count (kunmap_atomic)
>>00000001 0.550ms (+0.000ms): update_max_trace (check_preempt_timing)
>
> 
> this is a full page copy, from userspace into a kernelspace pagecache
> page. This shouldnt take 500 usecs on any hardware. Since this is a
> single instruction (memcpy's rep; movsl instruction) there's nothing
> that Linux can do to avoid (or even to cause) such a situation.

I saw this one (or one very similar) on a system that I just started 
testing on some today. Not quite as high (~219 usec if I remember 
correctly). I don't have access to the system from here. I will forward 
the trace tomorrow when I'm there tomorrow. However, I haven't seen this 
on my slower system running the same stress tests.  There are several 
possible points of interest:

System I saw this on:
P4 2.4GHz or 3.0GHz
2GB memory
2.6.9-rc1-bk12-S0 built for SMP (even though hyperthreading is off 
currently)

System I haven't seen this on:
PII 450
256MB memory
2.6.9-rc1-bk12-R6 built for UP

Sorry I don't have more complete data in front of me. I will send the 
concrete info tomorrow with the trace.

kr

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-10  2:31   ` K.R. Foley
@ 2004-09-10 17:56     ` K.R. Foley
  2004-09-10 19:26       ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: K.R. Foley @ 2004-09-10 17:56 UTC (permalink / raw)
  To: K.R. Foley
  Cc: Ingo Molnar, Mark_H_Johnson, Lee Revell, Free Ekanayaka,
	Eric St-Laurent, linux-kernel, Felipe Alfaro Solana,
	Daniel Schmitt, P.O. Gaillard, nando, luke, free78

K.R. Foley wrote:
> Ingo Molnar wrote:
> 
>> * Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:
>>
>>
>>> If you look at the date / time of the traces, you will notice that
>>> most occur in the latter part of the test. This is during the "disk
>>> copy" and "disk read" parts of the testing. [...]
>>
>>
>>
>> would it be possible to test with DMA disabled? (hdparm -d0 /dev/hda) It
>> might take some extra work to shun the extra latency reports from the
>> PIO IDE path (which is quite slow) but once that is done you should be
>> able to see whether these long 0.5 msec delays remain even if all (most)
>> DMA activity has been eliminated.
>>
>>
>>> preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
>>> --------------------------------------------------
>>> latency: 550 us, entries: 6 (6)
>>>    -----------------
>>>    | task: cat/6771, uid:0 nice:0 policy:0 rt_prio:0
>>>    -----------------
>>> => started at: kmap_atomic+0x23/0xe0
>>> => ended at:   kunmap_atomic+0x7b/0xa0
>>> =======>
>>> 00000001 0.000ms (+0.000ms): kmap_atomic (file_read_actor)
>>> 00000001 0.000ms (+0.000ms): page_address (file_read_actor)
>>> 00000001 0.000ms (+0.549ms): __copy_to_user_ll (file_read_actor)
>>> 00000001 0.550ms (+0.000ms): kunmap_atomic (file_read_actor)
>>> 00000001 0.550ms (+0.000ms): sub_preempt_count (kunmap_atomic)
>>> 00000001 0.550ms (+0.000ms): update_max_trace (check_preempt_timing)
>>
>>
>>
>> this is a full page copy, from userspace into a kernelspace pagecache
>> page. This shouldnt take 500 usecs on any hardware. Since this is a
>> single instruction (memcpy's rep; movsl instruction) there's nothing
>> that Linux can do to avoid (or even to cause) such a situation.
> 
> 
> I saw this one (or one very similar) on a system that I just started 
> testing on some today. Not quite as high (~219 usec if I remember 
> correctly). I don't have access to the system from here. I will forward 
> the trace tomorrow when I'm there tomorrow. However, I haven't seen this 
> on my slower system running the same stress tests.  There are several 
> possible points of interest:
> 
> System I saw this on:
> P4 2.4GHz or 3.0GHz
> 2GB memory
> 2.6.9-rc1-bk12-S0 built for SMP (even though hyperthreading is off 
> currently)

Actual system info for above is 2.4GHz with 512 memory.

> 
> System I haven't seen this on:
> PII 450
> 256MB memory
> 2.6.9-rc1-bk12-R6 built for UP
> 
> Sorry I don't have more complete data in front of me. I will send the 
> concrete info tomorrow with the trace.
> 

As promised here is the trace:

http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/trace2.txt

I also have some other traces from this system that I have not seen 
before on my slower system. For instance this one where we spend ~204 
usec in __spin_lock_irqsave:

http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/trace1.txt

Or this one where we spend ~203 usec in sched_clock. That just doesn't 
seem possible.

http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/trace4.txt

kr

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-10 17:56     ` K.R. Foley
@ 2004-09-10 19:26       ` Ingo Molnar
  2004-09-10 19:28         ` K.R. Foley
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2004-09-10 19:26 UTC (permalink / raw)
  To: K.R. Foley
  Cc: Mark_H_Johnson, Lee Revell, Free Ekanayaka, Eric St-Laurent,
	linux-kernel, Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard,
	nando, luke, free78


* K.R. Foley <kr@cybsft.com> wrote:

> I also have some other traces from this system that I have not seen
> before on my slower system. For instance this one where we spend ~204
> usec in __spin_lock_irqsave:
> 
> http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/trace1.txt
> 
> Or this one where we spend ~203 usec in sched_clock. That just doesn't
> seem possible.
> 
> http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/trace4.txt

seems quite similar to Mark's IDE-DMA related hardware latencies. Does 
reducing the DMA mode via hdparm reduce these latencies?

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-10 19:26       ` Ingo Molnar
@ 2004-09-10 19:28         ` K.R. Foley
  0 siblings, 0 replies; 36+ messages in thread
From: K.R. Foley @ 2004-09-10 19:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Lee Revell, Free Ekanayaka, Eric St-Laurent,
	linux-kernel, Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard,
	nando, luke, free78

Ingo Molnar wrote:
> * K.R. Foley <kr@cybsft.com> wrote:
> 
> 
>>I also have some other traces from this system that I have not seen
>>before on my slower system. For instance this one where we spend ~204
>>usec in __spin_lock_irqsave:
>>
>>http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/trace1.txt
>>
>>Or this one where we spend ~203 usec in sched_clock. That just doesn't
>>seem possible.
>>
>>http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/trace4.txt
> 
> 
> seems quite similar to Mark's IDE-DMA related hardware latencies. Does 
> reducing the DMA mode via hdparm reduce these latencies?
> 
> 	Ingo
> 
I will give that a try shortly.

kr

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-07 16:54 [patch] voluntary-preempt-2.6.9-rc1-bk4-R1 Mark_H_Johnson
  2004-09-08 18:42 ` Ingo Molnar
@ 2004-09-09 16:02 ` Ingo Molnar
  1 sibling, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2004-09-09 16:02 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard,
	nando, luke, free78


* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> Results from this morning's test with -R1 and some follow up on
> related messages.

lt002.v3k1/lt.10 is particularly interesting:

 00000001 0.000ms (+0.000ms): spin_lock (rtl8139_poll)
 00000001 0.000ms (+0.070ms): spin_lock (<00000000>)
 00000001 0.070ms (+0.070ms): rtl8139_rx (rtl8139_poll)
 00000001 0.140ms (+0.070ms): alloc_skb (rtl8139_rx)
 00000001 0.210ms (+0.070ms): kmem_cache_alloc (alloc_skb)
 00000001 0.280ms (+0.073ms): __kmalloc (alloc_skb)
 00000001 0.354ms (+0.139ms): eth_type_trans (rtl8139_rx)
 00000001 0.493ms (+0.076ms): netif_receive_skb (rtl8139_rx)
 00000002 0.570ms (+0.001ms): packet_rcv_spkt (netif_receive_skb)

this too shows the CPU in 'slow motion' in a codepath that normally
executes 10 times faster than this on a 100 MHz Pentium Classic ...

another interesting thing is that the unit of delay seems to be around
70 usecs. As if under certain circumstances every main memory access
created a 70 usecs hit. (a cachemiss perhaps?) The eth_type_trans entry
perhaps generated 2 main memory accesses (2 cachemisses?) or so.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
@ 2004-09-13 14:44 Mark_H_Johnson
  2004-09-14 18:32 ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Mark_H_Johnson @ 2004-09-13 14:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric St-Laurent, Free Ekanayaka, free78, K.R. Foley, linux-kernel,
	Felipe Alfaro Solana, luke, nando, P.O. Gaillard, Daniel Schmitt,
	Lee Revell

>most of your remaining latencies seem to be get_swap_page() related -
>the attached (highly experimental) patch might fix that particular
>latency. (ontop of -S0).

Qualitatively, -S0 with this patch is a little worse than what I saw with
-R8.
>From the last message I sent...

> -R8 on September 10 - 38 traces > 200 usec, only 3 > 500 usec
-S0 on September 13 - 78 traces > 200 usec, only 3 > 500 usec

So I have more relatively small delays in the same test period - almost
twice as much with few that I can see are swap related. Most are network
related. One very long trace > 70 msec - this is the third day I've had
a single multiple millisecond trace in a run.

Key contributors seem to be:
 - IRQ sequence         - 00 02 50 64 73 75 76 77
 - modprobe             - 01
 - rtl8139_poll         - 03 04 07 08 09 10 12 13 15 16 17 18 19 21 24 47
                        - 49 72 74
 - tcp_prequeue_process - 05
 - release_sock         - 06 14
 - tcp_copy_to_iovec    - 11
 - do_munmap            - 20 [very long trace]
 - twkill_work          - 22
 - exit_mmap            - 23 25 26 27 28 29 30 31 32 36 37 38 39 40 41 44
                        - 46 48 51 52 53 54 55 57 58 59 60 61 62 66 67 68
                        - 69 70 71
 - kswapd               - 33 34 35 42 63
 - avc_insert           - 43 45 65
 - schedule             - 56

IRQ Sequence
============

Hard to describe - no specific steps that take a long time, just
a long series of activities, appear to be IRQ related that add up to
a > 300 usec time delay. Many I noted have mouse / signal related
activities, see also the "schedule" trace near the end for a similar
condition.

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
 latency: 342 us, entries: 280 (280)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: kjournald/191, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: __spin_lock_irqsave+0x2b/0xb0
 => ended at:   _spin_unlock_irq+0x2e/0x60
=======>
[should look at the full traces for details]

modprobe
========

The shortest trace. I had ran
  system-config-soundcard
to play the test sound before starting the real time test program to
make sure the sound system ran OK. Not something I would expect to do
on a real time system, but something to fix (or document as a limitation).

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
 latency: 460 us, entries: 9 (9)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: modprobe/2995, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: __spin_lock_irqsave+0x2b/0xb0
 => ended at:   _spin_unlock_irq+0x2e/0x60
=======>
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000001 0.000ms (+0.454ms): __find_symbol (resolve_symbol)
00000001 0.455ms (+0.000ms): use_module (resolve_symbol)
00000001 0.455ms (+0.003ms): already_uses (use_module)
00000001 0.459ms (+0.001ms): kmem_cache_alloc (use_module)
00000001 0.460ms (+0.000ms): _spin_unlock_irq (resolve_symbol)
00000001 0.460ms (+0.000ms): sub_preempt_count (_spin_unlock_irq)
00000001 0.461ms (+0.000ms): update_max_trace (check_preempt_timing)

rtl8139_poll
============

We've talked this one before but still shows up on a regular basis.

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
 latency: 347 us, entries: 245 (245)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: ksoftirqd/0/3, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   rtl8139_rx+0x219/0x340
=======>
[again - no big steps, just a large number w/o reschedule opportunity]

tcp_prequeue_process
====================

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
 latency: 271 us, entries: 294 (294)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: rcp/3591, uid:2711 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: tcp_prequeue_process+0x49/0xb0
 => ended at:   tcp_write_xmit+0x249/0x330
=======>
[appears similar to the others - many steps, each one very short]

release_sock
============

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
 latency: 209 us, entries: 275 (275)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: rcp/3705, uid:2711 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: __spin_lock_irqsave+0x2b/0xb0
 => ended at:   cond_resched_softirq+0x65/0x90
=======>
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_bh)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000101 0.000ms (+0.000ms): __release_sock (release_sock)
00000101 0.000ms (+0.000ms): _spin_unlock (__release_sock)
... long sequence, not long in each step ...

tcp_copy_to_iovec
=================

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
 latency: 210 us, entries: 243 (243)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: rcp/3894, uid:2711 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: tcp_copy_to_iovec+0x5b/0xb0
 => ended at:   local_bh_enable+0x17/0xa0
=======>
00000100 0.000ms (+0.002ms): tcp_copy_to_iovec (tcp_rcv_established)
00000100 0.002ms (+0.005ms): tcp_event_data_recv (tcp_rcv_established)
00000100 0.007ms (+0.000ms): tcp_send_ack (tcp_rcv_established)
... large number of steps, each one very short in duration ...

do_munmap
=========

The very long trace (> 4000 traces, > 70 msec). Following lists all steps
until we get "stuck" with roughly 1 msec pauses. Note - I don't have the
"how we got out" part of this trace but see previous messages for that
kind of detail.

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
 latency: 71065 us, entries: 4000 (5343)   |   [VP:1 KP:1 SP:1 HP:1
#CPUS:2]
    -----------------
    | task: latencytest/4408, uid:0 nice:0 policy:1 rt_prio:99
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   cond_resched_lock+0x1d/0xa0
=======>
00000001 0.000ms (+0.000ms): spin_lock (do_munmap)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): unmap_region (do_munmap)
00000001 0.001ms (+0.000ms): lru_add_drain (unmap_region)
00000002 0.001ms (+0.000ms): __pagevec_lru_add_active (lru_add_drain)
00000002 0.002ms (+0.000ms): spin_lock_irq (__pagevec_lru_add_active)
00000002 0.002ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000003 0.002ms (+0.001ms): __spin_lock_irqsave (<00000000>)
00000003 0.003ms (+0.000ms): _spin_unlock_irq (__pagevec_lru_add_active)
00000002 0.003ms (+0.001ms): release_pages (__pagevec_lru_add_active)
00000001 0.004ms (+0.001ms): __bitmap_weight (unmap_region)
00000001 0.006ms (+0.001ms): unmap_vmas (unmap_region)
00000001 0.007ms (+0.000ms): unmap_page_range (unmap_vmas)
00000001 0.007ms (+0.000ms): zap_pmd_range (unmap_page_range)
00000001 0.008ms (+0.000ms): zap_pte_range (zap_pmd_range)
00000001 0.009ms (+0.000ms): kmap_atomic (zap_pte_range)
00000002 0.009ms (+0.001ms): page_address (zap_pte_range)
00000002 0.011ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.012ms (+0.000ms): page_remove_rmap (zap_pte_range)
00000002 0.013ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.013ms (+0.000ms): page_remove_rmap (zap_pte_range)
00000002 0.013ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.013ms (+0.000ms): page_remove_rmap (zap_pte_range)
00000002 0.014ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.014ms (+0.000ms): page_remove_rmap (zap_pte_range)
00000002 0.015ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.015ms (+0.000ms): page_remove_rmap (zap_pte_range)
00000002 0.015ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.016ms (+0.000ms): page_remove_rmap (zap_pte_range)
00000002 0.016ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.016ms (+0.001ms): page_remove_rmap (zap_pte_range)
00000002 0.017ms (+0.000ms): smp_apic_timer_interrupt (zap_pte_range)
00010002 0.018ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010002 0.018ms (+0.000ms): profile_hook (profile_tick)
00010002 0.018ms (+0.000ms): read_lock (profile_hook)
00010003 0.019ms (+0.000ms): read_lock (<00000000>)
00010003 0.019ms (+0.000ms): notifier_call_chain (profile_hook)
00010003 0.019ms (+0.000ms): _read_unlock (profile_tick)
00010002 0.020ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010002 0.021ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
00010002 0.021ms (+0.000ms): update_one_process (update_process_times)
00010002 0.022ms (+0.000ms): run_local_timers (update_process_times)
00010002 0.022ms (+0.000ms): raise_softirq (update_process_times)
00010002 0.023ms (+0.000ms): scheduler_tick (update_process_times)
00010002 0.023ms (+0.002ms): sched_clock (scheduler_tick)
00010002 0.026ms (+0.000ms): spin_lock (scheduler_tick)
00010003 0.027ms (+0.000ms): spin_lock (<00000000>)
00010003 0.027ms (+0.000ms): _spin_unlock (scheduler_tick)
00010002 0.027ms (+0.001ms): rebalance_tick (scheduler_tick)
00000003 0.028ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
00000003 0.029ms (+0.001ms): __do_softirq (do_softirq)
00000002 0.030ms (+0.000ms): set_page_dirty (zap_pte_range)
00000002 0.031ms (+0.000ms): page_remove_rmap (zap_pte_range)
00000002 0.031ms (+0.001ms): kunmap_atomic (zap_pte_range)
00000001 0.032ms (+0.002ms): flush_tlb_mm (unmap_vmas)
00000002 0.034ms (+0.000ms): flush_tlb_others (flush_tlb_mm)
00000002 0.035ms (+0.000ms): spin_lock (flush_tlb_others)
00000003 0.035ms (+0.000ms): spin_lock (<00000000>)
00000003 0.036ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000003 0.037ms (+0.158ms): send_IPI_mask_bitmask (flush_tlb_others)
00010003 0.195ms (+0.000ms): do_nmi (flush_tlb_others)
00010003 0.195ms (+0.002ms): do_nmi (__trace)
00010003 0.198ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010003 0.199ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010003 0.199ms (+0.000ms): profile_hook (profile_tick)
00010003 0.199ms (+0.000ms): read_lock (profile_hook)
00010004 0.199ms (+0.000ms): read_lock (<00000000>)
00010004 0.199ms (+0.000ms): notifier_call_chain (profile_hook)
00010004 0.200ms (+0.000ms): _read_unlock (profile_tick)
00010003 0.200ms (+0.001ms): profile_hit (nmi_watchdog_tick)
00010003 0.202ms (+0.000ms): do_IRQ (flush_tlb_others)
00010003 0.202ms (+0.000ms): do_IRQ (<00000000>)
00010003 0.202ms (+0.000ms): spin_lock (do_IRQ)
00010004 0.203ms (+0.000ms): spin_lock (<00000000>)
00010004 0.203ms (+0.000ms): ack_edge_ioapic_irq (do_IRQ)
00010004 0.203ms (+0.000ms): redirect_hardirq (do_IRQ)
00010004 0.204ms (+0.000ms): _spin_unlock (do_IRQ)
00010003 0.204ms (+0.000ms): handle_IRQ_event (do_IRQ)
00010003 0.204ms (+0.000ms): timer_interrupt (handle_IRQ_event)
00010003 0.204ms (+0.000ms): spin_lock (timer_interrupt)
00010004 0.205ms (+0.000ms): spin_lock (<00000000>)
00010004 0.205ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010004 0.205ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010004 0.206ms (+0.000ms): spin_lock (mark_offset_tsc)
00010005 0.206ms (+0.000ms): spin_lock (<00000000>)
00010005 0.206ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 0.206ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 0.206ms (+0.000ms): spin_lock (mark_offset_tsc)
00010006 0.207ms (+0.000ms): spin_lock (<00000000>)
00010006 0.207ms (+0.003ms): mark_offset_tsc (timer_interrupt)
00010006 0.211ms (+0.004ms): mark_offset_tsc (timer_interrupt)
00010006 0.215ms (+0.002ms): mark_offset_tsc (timer_interrupt)
00010006 0.217ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010006 0.217ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010006 0.217ms (+0.000ms): _spin_unlock (mark_offset_tsc)
00010005 0.218ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 0.218ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 0.218ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 0.219ms (+0.000ms): _spin_unlock (mark_offset_tsc)
00010004 0.219ms (+0.000ms): timer_interrupt (handle_IRQ_event)
00010004 0.219ms (+0.000ms): spin_lock (timer_interrupt)
00010005 0.220ms (+0.003ms): spin_lock (<00000000>)
00010005 0.223ms (+0.000ms): _spin_unlock (timer_interrupt)
00010004 0.224ms (+0.000ms): do_timer (timer_interrupt)
00010004 0.224ms (+0.000ms): update_wall_time (do_timer)
00010004 0.225ms (+0.000ms): update_wall_time_one_tick (update_wall_time)
00010004 0.225ms (+0.000ms): _spin_unlock (timer_interrupt)
00010003 0.226ms (+0.000ms): spin_lock (do_IRQ)
00010004 0.226ms (+0.000ms): spin_lock (<00000000>)
00010004 0.226ms (+0.000ms): note_interrupt (do_IRQ)
00010004 0.227ms (+0.000ms): end_edge_ioapic_irq (do_IRQ)
00010004 0.227ms (+0.000ms): _spin_unlock (do_IRQ)
00000004 0.227ms (+0.000ms): do_softirq (do_IRQ)
00000004 0.227ms (+0.788ms): __do_softirq (do_softirq)
00000003 1.016ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
00010003 1.016ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010003 1.017ms (+0.000ms): profile_hook (profile_tick)
00010003 1.017ms (+0.000ms): read_lock (profile_hook)
00010004 1.017ms (+0.000ms): read_lock (<00000000>)
00010004 1.017ms (+0.000ms): notifier_call_chain (profile_hook)
00010004 1.017ms (+0.000ms): _read_unlock (profile_tick)
00010003 1.018ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010003 1.018ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
00010003 1.018ms (+0.000ms): update_one_process (update_process_times)
00010003 1.018ms (+0.000ms): run_local_timers (update_process_times)
00010003 1.018ms (+0.000ms): raise_softirq (update_process_times)
00010003 1.019ms (+0.000ms): scheduler_tick (update_process_times)
00010003 1.019ms (+0.000ms): sched_clock (scheduler_tick)
00010003 1.019ms (+0.000ms): spin_lock (scheduler_tick)
00010004 1.019ms (+0.000ms): spin_lock (<00000000>)
00010004 1.019ms (+0.000ms): _spin_unlock (scheduler_tick)
00010003 1.020ms (+0.000ms): rebalance_tick (scheduler_tick)
00000004 1.020ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
00000004 1.020ms (+0.173ms): __do_softirq (do_softirq)
00010003 1.193ms (+0.000ms): do_nmi (flush_tlb_others)
00010003 1.194ms (+0.001ms): do_nmi (__trace)
00010003 1.195ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010003 1.196ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010003 1.196ms (+0.000ms): profile_hook (profile_tick)
00010003 1.196ms (+0.000ms): read_lock (profile_hook)
00010004 1.196ms (+0.000ms): read_lock (<00000000>)
00010004 1.196ms (+0.000ms): notifier_call_chain (profile_hook)
00010004 1.197ms (+0.000ms): _read_unlock (profile_tick)
00010003 1.197ms (+0.000ms): profile_hit (nmi_watchdog_tick)
00010003 1.198ms (+0.000ms): do_IRQ (flush_tlb_others)
00010003 1.198ms (+0.000ms): do_IRQ (<00000000>)
00010003 1.198ms (+0.000ms): spin_lock (do_IRQ)
00010004 1.198ms (+0.000ms): spin_lock (<00000000>)
00010004 1.198ms (+0.000ms): ack_edge_ioapic_irq (do_IRQ)
00010004 1.199ms (+0.000ms): redirect_hardirq (do_IRQ)
00010004 1.199ms (+0.000ms): _spin_unlock (do_IRQ)
00010003 1.199ms (+0.000ms): handle_IRQ_event (do_IRQ)
00010003 1.199ms (+0.000ms): timer_interrupt (handle_IRQ_event)
00010003 1.199ms (+0.000ms): spin_lock (timer_interrupt)
00010004 1.200ms (+0.000ms): spin_lock (<00000000>)
00010004 1.200ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010004 1.200ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010004 1.200ms (+0.000ms): spin_lock (mark_offset_tsc)
00010005 1.200ms (+0.000ms): spin_lock (<00000000>)
00010005 1.201ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 1.201ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 1.201ms (+0.000ms): spin_lock (mark_offset_tsc)
00010006 1.201ms (+0.000ms): spin_lock (<00000000>)
00010006 1.201ms (+0.003ms): mark_offset_tsc (timer_interrupt)
00010006 1.205ms (+0.004ms): mark_offset_tsc (timer_interrupt)
00010006 1.209ms (+0.002ms): mark_offset_tsc (timer_interrupt)
00010006 1.212ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010006 1.212ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010006 1.212ms (+0.000ms): _spin_unlock (mark_offset_tsc)
00010005 1.212ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 1.213ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 1.213ms (+0.000ms): mark_offset_tsc (timer_interrupt)
00010005 1.213ms (+0.000ms): _spin_unlock (mark_offset_tsc)
00010004 1.213ms (+0.000ms): timer_interrupt (handle_IRQ_event)
00010004 1.213ms (+0.000ms): spin_lock (timer_interrupt)
00010005 1.214ms (+0.003ms): spin_lock (<00000000>)
00010005 1.217ms (+0.000ms): _spin_unlock (timer_interrupt)
00010004 1.218ms (+0.000ms): do_timer (timer_interrupt)
00010004 1.218ms (+0.000ms): update_wall_time (do_timer)
00010004 1.218ms (+0.000ms): update_wall_time_one_tick (update_wall_time)
00010004 1.218ms (+0.000ms): _spin_unlock (timer_interrupt)
00010003 1.219ms (+0.000ms): spin_lock (do_IRQ)
00010004 1.219ms (+0.000ms): spin_lock (<00000000>)
00010004 1.219ms (+0.000ms): note_interrupt (do_IRQ)
00010004 1.219ms (+0.000ms): end_edge_ioapic_irq (do_IRQ)
00010004 1.219ms (+0.000ms): _spin_unlock (do_IRQ)
00000004 1.220ms (+0.000ms): do_softirq (do_IRQ)
00000004 1.220ms (+0.795ms): __do_softirq (do_softirq)
00000003 2.016ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
... the steps over the last millisecond get repeated for several cycles ...

twkill_work
===========

A very long sequence (> 1500 traces) with non zero preempt values
that end in "00". Does that mean we did not have any locks but could
not schedule for other reasons?

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
 latency: 429 us, entries: 1531 (1531)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: events/0/6, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: __spin_lock_irqsave+0x2b/0xb0
 => ended at:   twkill_work+0xb5/0xe0
=======>
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_bh)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000101 0.000ms (+0.000ms): tcp_do_twkill_work (twkill_work)
00000101 0.000ms (+0.000ms): _spin_unlock (tcp_do_twkill_work)
00000100 0.001ms (+0.000ms): tcp_timewait_kill (tcp_do_twkill_work)
00000100 0.001ms (+0.000ms): write_lock (tcp_timewait_kill)
00000101 0.001ms (+0.000ms): write_lock (<00000000>)
00000101 0.002ms (+0.000ms): _write_unlock (tcp_timewait_kill)
00000100 0.002ms (+0.000ms): spin_lock (tcp_timewait_kill)
00000101 0.002ms (+0.000ms): spin_lock (<00000000>)
00000101 0.002ms (+0.000ms): tcp_bucket_destroy (tcp_timewait_kill)
00000101 0.003ms (+0.000ms): kmem_cache_free (tcp_bucket_destroy)
00000101 0.003ms (+0.000ms): _spin_unlock (tcp_timewait_kill)
00000100 0.003ms (+0.000ms): kmem_cache_free (tcp_do_twkill_work)
00000100 0.003ms (+0.000ms): spin_lock (tcp_do_twkill_work)
00000101 0.004ms (+0.000ms): spin_lock (<00000000>)
00000101 0.004ms (+0.000ms): _spin_unlock (tcp_do_twkill_work)
00000100 0.004ms (+0.000ms): tcp_timewait_kill (tcp_do_twkill_work)
00000100 0.004ms (+0.000ms): write_lock (tcp_timewait_kill)
...

exit_mmap
=========

This one has a relatively low time delay for clear_page_tables, but
several other traces have 100 to 200 usec delays at this step. [though
if you add the delay handling the nmi_watchdog, it falls in the 100-200
range as well]

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
 latency: 266 us, entries: 146 (146)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: get_ltrace.sh/5099, uid:0 nice:-9 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (cond_resched_lock)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): __bitmap_weight (unmap_vmas)
00000001 0.000ms (+0.001ms): vm_acct_memory (exit_mmap)
00000001 0.001ms (+0.056ms): clear_page_tables (exit_mmap)
00010001 0.058ms (+0.000ms): do_nmi (clear_page_tables)
00010001 0.058ms (+0.002ms): do_nmi (__mcount)
00010001 0.060ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010001 0.061ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.061ms (+0.000ms): profile_hook (profile_tick)
00010001 0.061ms (+0.000ms): read_lock (profile_hook)
00010002 0.061ms (+0.000ms): read_lock (<00000000>)
00010002 0.061ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.062ms (+0.000ms): _read_unlock (profile_tick)
00010001 0.062ms (+0.091ms): profile_hit (nmi_watchdog_tick)
00010001 0.153ms (+0.000ms): do_IRQ (clear_page_tables)
00010001 0.153ms (+0.000ms): do_IRQ (<0000000a>)
00010001 0.153ms (+0.000ms): spin_lock (do_IRQ)
00010002 0.154ms (+0.000ms): spin_lock (<00000000>)
00010002 0.154ms (+0.000ms): mask_and_ack_level_ioapic_irq (do_IRQ)
00010002 0.154ms (+0.000ms): mask_IO_APIC_irq
(mask_and_ack_level_ioapic_irq)
00010002 0.154ms (+0.000ms): __spin_lock_irqsave (mask_IO_APIC_irq)
00010003 0.154ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00010003 0.154ms (+0.000ms): __mask_IO_APIC_irq (mask_IO_APIC_irq)
00010003 0.155ms (+0.018ms): __modify_IO_APIC_irq (__mask_IO_APIC_irq)
00010003 0.173ms (+0.000ms): _spin_unlock_irqrestore (mask_IO_APIC_irq)
00010002 0.174ms (+0.000ms): redirect_hardirq (do_IRQ)
00010002 0.174ms (+0.000ms): _spin_unlock (do_IRQ)
00010001 0.174ms (+0.000ms): handle_IRQ_event (do_IRQ)
00010001 0.174ms (+0.000ms): usb_hcd_irq (handle_IRQ_event)
...

kswapd
======

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
 latency: 557 us, entries: 840 (840)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (page_referenced_file)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.001ms (+0.002ms): prio_tree_first (vma_prio_tree_next)
00000001 0.004ms (+0.002ms): prio_tree_left (prio_tree_first)
00000001 0.006ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.007ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.007ms (+0.001ms): spin_lock (<00000000>)
00000002 0.009ms (+0.001ms): kmap_atomic (page_referenced_one)
00000003 0.011ms (+0.000ms): page_address (page_referenced_one)
00000003 0.012ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.012ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.012ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.013ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.013ms (+0.003ms): spin_lock (page_referenced_one)
00000002 0.017ms (+0.003ms): spin_lock (<00000000>)
00000002 0.020ms (+0.000ms): kmap_atomic (page_referenced_one)

avc_insert
==========

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
 latency: 266 us, entries: 80 (80)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: fam/2565, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: __spin_lock_irqsave+0x2b/0xb0
 => ended at:   _spin_unlock_irqrestore+0x32/0x70
=======>
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (avc_has_perm_noaudit)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000001 0.000ms (+0.204ms): avc_insert (avc_has_perm_noaudit)
00000001 0.204ms (+0.000ms): memcpy (avc_has_perm_noaudit)
00000001 0.204ms (+0.001ms): _spin_unlock_irqrestore (avc_has_perm_noaudit)
00010001 0.205ms (+0.000ms): do_IRQ (_spin_unlock_irqrestore)
00010001 0.206ms (+0.000ms): do_IRQ (<0000000a>)
00010001 0.206ms (+0.000ms): spin_lock (do_IRQ)
00010002 0.206ms (+0.000ms): spin_lock (<00000000>)
...

schedule
========

Have not talked about schedule in a while - this looks like something
different than before. Appears to be some deep nesting of preempt
disabling activities.

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
-------------------------------------------------------
 latency: 237 us, entries: 426 (426)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: latencytest/5314, uid:0 nice:0 policy:1 rt_prio:99
    -----------------
 => started at: schedule+0x59/0x5b0
 => ended at:   schedule+0x39d/0x5b0
=======>
00000001 0.000ms (+0.000ms): schedule (ksoftirqd)
00000001 0.000ms (+0.000ms): schedule (ksoftirqd)
00000001 0.000ms (+0.000ms): schedule (ksoftirqd)
00000001 0.000ms (+0.000ms): schedule (ksoftirqd)
00000001 0.000ms (+0.000ms): schedule (ksoftirqd)
00000001 0.000ms (+0.000ms): schedule (ksoftirqd)
00000001 0.000ms (+0.000ms): sched_clock (schedule)
00000001 0.001ms (+0.000ms): schedule (ksoftirqd)
00000001 0.001ms (+0.000ms): schedule (ksoftirqd)
00000001 0.001ms (+0.000ms): schedule (ksoftirqd)
00000001 0.001ms (+0.000ms): spin_lock_irq (schedule)
00000001 0.001ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000002 0.002ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000002 0.002ms (+0.000ms): schedule (ksoftirqd)
00000002 0.002ms (+0.000ms): deactivate_task (schedule)
00000002 0.002ms (+0.000ms): dequeue_task (deactivate_task)
00000002 0.002ms (+0.000ms): schedule (ksoftirqd)
00000002 0.002ms (+0.000ms): schedule (ksoftirqd)
00000002 0.003ms (+0.000ms): dependent_sleeper (schedule)
00000002 0.003ms (+0.000ms): schedule (ksoftirqd)
00000002 0.003ms (+0.000ms): schedule (ksoftirqd)
00000002 0.003ms (+0.000ms): schedule (ksoftirqd)
00000002 0.003ms (+0.000ms): schedule (ksoftirqd)
00000002 0.003ms (+0.000ms): dummy_switch_tasks (schedule)
00000002 0.003ms (+0.000ms): schedule (ksoftirqd)
00000002 0.004ms (+0.000ms): schedule (ksoftirqd)
00000002 0.004ms (+0.000ms): schedule (ksoftirqd)
00000002 0.004ms (+0.000ms): schedule (ksoftirqd)
00000002 0.004ms (+0.000ms): schedule (ksoftirqd)
00000002 0.004ms (+0.000ms): sched_info_switch (schedule)
00000002 0.004ms (+0.000ms): sched_info_depart (sched_info_switch)
00000002 0.005ms (+0.000ms): sched_info_arrive (schedule)
00000002 0.005ms (+0.000ms): sched_info_dequeued (sched_info_arrive)
00000002 0.005ms (+0.000ms): schedule (ksoftirqd)
00000002 0.005ms (+0.000ms): context_switch (schedule)
00000002 0.005ms (+0.000ms): dummy_cs_entry (context_switch)
00000002 0.005ms (+0.000ms): context_switch (schedule)
00000002 0.006ms (+0.000ms): dummy_cs_switch_mm (context_switch)
00000002 0.006ms (+0.000ms): context_switch (schedule)
00000002 0.006ms (+0.000ms): dummy_cs_unlikely_if (context_switch)
00000002 0.006ms (+0.000ms): context_switch (schedule)
00000002 0.006ms (+0.000ms): dummy_cs_switch_to (context_switch)
00000002 0.006ms (+0.000ms): context_switch (schedule)
04000002 0.007ms (+0.000ms): __switch_to (context_switch)
04000002 0.007ms (+0.000ms): dummy_cs_exit (context_switch)
04000002 0.007ms (+0.000ms): context_switch (schedule)
04000002 0.007ms (+0.000ms): finish_task_switch (schedule)
04000002 0.008ms (+0.000ms): _spin_unlock_irq (finish_task_switch)
04010002 0.008ms (+0.000ms): do_IRQ (_spin_unlock_irq)
04010002 0.008ms (+0.000ms): do_IRQ (<0000000a>)
04010002 0.008ms (+0.000ms): spin_lock (do_IRQ)
04010003 0.009ms (+0.000ms): spin_lock (<00000000>)
04010003 0.009ms (+0.000ms): mask_and_ack_level_ioapic_irq (do_IRQ)
04010003 0.009ms (+0.000ms): mask_IO_APIC_irq
(mask_and_ack_level_ioapic_irq)
04010003 0.009ms (+0.000ms): __spin_lock_irqsave (mask_IO_APIC_irq)
04010004 0.009ms (+0.000ms): __spin_lock_irqsave (<00000000>)
04010004 0.009ms (+0.000ms): __mask_IO_APIC_irq (mask_IO_APIC_irq)
04010004 0.010ms (+0.020ms): __modify_IO_APIC_irq (__mask_IO_APIC_irq)
04010004 0.030ms (+0.000ms): _spin_unlock_irqrestore (mask_IO_APIC_irq)
04010003 0.030ms (+0.000ms): redirect_hardirq (do_IRQ)
04010003 0.030ms (+0.000ms): _spin_unlock (do_IRQ)
04010002 0.031ms (+0.000ms): handle_IRQ_event (do_IRQ)
04010002 0.031ms (+0.000ms): usb_hcd_irq (handle_IRQ_event)
04010002 0.031ms (+0.001ms): uhci_irq (usb_hcd_irq)
04010002 0.032ms (+0.000ms): spin_lock (uhci_irq)
04010003 0.032ms (+0.000ms): spin_lock (<00000000>)
04010003 0.032ms (+0.000ms): uhci_get_current_frame_number (uhci_irq)
04010003 0.033ms (+0.000ms): uhci_free_pending_qhs (uhci_irq)
04010003 0.033ms (+0.000ms): uhci_free_pending_tds (uhci_irq)
...
gets even deeper nesting with the first value going up to 04010008
while the system appears to send signals based on mouse input. None
of the steps take very long, but with over 400 steps in the sequence
it shows up in the trace record.

  --Mark


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-13 14:44 Mark_H_Johnson
@ 2004-09-14 18:32 ` Ingo Molnar
  2004-09-15 15:26   ` Stephen Smalley
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2004-09-14 18:32 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Eric St-Laurent, Free Ekanayaka, free78, K.R. Foley, linux-kernel,
	Felipe Alfaro Solana, luke, nando, P.O. Gaillard, Daniel Schmitt,
	Lee Revell

* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> So I have more relatively small delays in the same test period -
> almost twice as much with few that I can see are swap related. Most
> are network related. One very long trace > 70 msec - this is the third
> day I've had a single multiple millisecond trace in a run.

the big one is the tlb_flush_others() latency:

> 00010003 0.195ms (+0.000ms): do_nmi (flush_tlb_others)
> 00010003 0.195ms (+0.002ms): do_nmi (__trace)

> 00010003 1.193ms (+0.000ms): do_nmi (flush_tlb_others)
> 00010003 1.194ms (+0.001ms): do_nmi (__trace)

> ... the steps over the last millisecond get repeated for several cycles ...

this is similar to what i described in earlier emails - IPI passing 
getting held up by some IRQs-off activity on the other CPU. You can see 
what happened on the other CPU by doing this:

  grep ': do_nmi' lt.20 | grep -v flush_tlb_other

there's only one suspicious thing, roughly half into the trace:

  00010003 30.178ms (+0.004ms): do_nmi (run_timer_softirq)
  00010003 31.178ms (+0.001ms): do_nmi (run_timer_softirq)

so run_timer_softirq() was done - but we dont normally run that with
interrupts disabled ...

so maybe something else is holding up the IPI? Eventually the IPI
arrived so it could very well be just a 70 msec delay due to printing a
latency trace to the console. But ... maybe something odd is going on
that delays IPIs. To debug this further, could you do something like
this to kernel/latency.c's nmi_trace() function:

		if (cpu_online(cpu) && cpu != this_cpu) {
			__trace(eip, nmi_eips[cpu]);
			__trace(eip, irq_stat[cpu].apic_timer_irqs);
		}

this will add a third do_nmi entry to the trace, showing the APIC timer
interrupt counter on that other CPU. If this counter doesnt increase
during the ~70 ticks then that CPU had interrupts disabled permanently
and the IPI latency is normal. If the counter increases then the IPI
delay is unjustified and there's something bad going on ...

> A very long sequence (> 1500 traces) with non zero preempt values that
> end in "00". Does that mean we did not have any locks but could not
> schedule for other reasons?

no - it's 0x00000100 that is SOFTIRQ_OFFSET, so it's normal behavior.

The design is like this: the softirq and hardirq counts are 'merged'
into preempt_count(). spin_lock/unlock increases/decreases the
preempt_count() by 1. Softirq disabling increases the preempt_count() by
0x100 - hardirq entry increases the preempt_count() by 0x10000. This way
we have a nesting up to 255 and we are still able to tell from the
preempt_count() alone whether we are in a hardirq, or whether softirqs
are disabled.

For scheduling purposes all that matters is whether the preempt_count is
nonzero (in which case the kernel must not be preempted). But there are
other contexts too that need exclusion: softirqs will run asynchronously
even if the preempt_count is nonzero - as long as the bits covered by
SOFTIRQ_MASK (0xff00) are not set.

the selinux overhead looks interesting:

> avc_insert
> ==========
> 
> preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0
> -------------------------------------------------------
>  latency: 266 us, entries: 80 (80)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
>     -----------------
>     | task: fam/2565, uid:0 nice:0 policy:0 rt_prio:0
>     -----------------
>  => started at: __spin_lock_irqsave+0x2b/0xb0
>  => ended at:   _spin_unlock_irqrestore+0x32/0x70
> =======>
> 00000001 0.000ms (+0.000ms): __spin_lock_irqsave (avc_has_perm_noaudit)
> 00000001 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
> 00000001 0.000ms (+0.204ms): avc_insert (avc_has_perm_noaudit)
> 00000001 0.204ms (+0.000ms): memcpy (avc_has_perm_noaudit)
> 00000001 0.204ms (+0.001ms): _spin_unlock_irqrestore (avc_has_perm_noaudit)

it seems a bit odd. avc_has_perm_noaudit() calls avc_insert(), and from
the entry to avc_insert(), to the memcpy() call done by
avc_has_perm_noaudit() there were 204 usecs spent. That's alot of time -
175 thousand cycles! Now if you check out the code between line 1009 and
1019 in security/selinux/avc.c, there's little that could cause this
amount of overhead. avc_insert() itself is rather simple - it does an
avc_claim_node() call which is an inlined function so it doesnt show up
in the trace. That function _might_ have called acv_reclaim_node() which
seems to do a linear scan over a list - i dont know how long that list
is typically but it could be long. Could you add "#define inline" near
the top of avc.c (but below the #include lines) so that we can see how
the overhead is distributed in the future?

> schedule
> ========
> 
> Have not talked about schedule in a while - this looks like something
> different than before. Appears to be some deep nesting of preempt
> disabling activities.

> ... gets even deeper nesting with the first value going up to 04010008
> while the system appears to send signals based on mouse input. None of
> the steps take very long, but with over 400 steps in the sequence it
> shows up in the trace record.

it could be tracing overhead - you've got inlining disabled in sched.c
(upon my request) and you've got the extra lines you added to find out
where the overhead is. Now that we know that DMA was (is?) a big
contributor to the mystic latencies you might want to remove those.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-14 18:32 ` Ingo Molnar
@ 2004-09-15 15:26   ` Stephen Smalley
  0 siblings, 0 replies; 36+ messages in thread
From: Stephen Smalley @ 2004-09-15 15:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Eric St-Laurent, Free Ekanayaka, free78,
	K.R. Foley, linux-kernel, Felipe Alfaro Solana, luke, nando,
	P.O. Gaillard, Daniel Schmitt, Lee Revell

On Tue, 2004-09-14 at 14:32, Ingo Molnar wrote:
> it seems a bit odd. avc_has_perm_noaudit() calls avc_insert(), and from
> the entry to avc_insert(), to the memcpy() call done by
> avc_has_perm_noaudit() there were 204 usecs spent. That's alot of time -
> 175 thousand cycles! Now if you check out the code between line 1009 and
> 1019 in security/selinux/avc.c, there's little that could cause this
> amount of overhead. avc_insert() itself is rather simple - it does an
> avc_claim_node() call which is an inlined function so it doesnt show up
> in the trace. That function _might_ have called acv_reclaim_node() which
> seems to do a linear scan over a list - i dont know how long that list
> is typically but it could be long. Could you add "#define inline" near
> the top of avc.c (but below the #include lines) so that we can see how
> the overhead is distributed in the future?

May be rendered moot by the RCU patches for SELinux, see
http://marc.theaimsgroup.com/?l=linux-kernel&m=109386501021596&w=2

-- 
Stephen Smalley <sds@epoch.ncsc.mil>
National Security Agency


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
@ 2004-09-10 14:28 Mark_H_Johnson
  2004-09-10 19:29 ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Mark_H_Johnson @ 2004-09-10 14:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric St-Laurent, Free Ekanayaka, free78, K.R. Foley, linux-kernel,
	Felipe Alfaro Solana, luke, nando, P.O. Gaillard, Daniel Schmitt,
	Lee Revell

>does 'hdparm -X udma2 /dev/hda' work?

Yes it does and quite well actually. For comparison
  -R1 on September 7 - over 100 traces > 500 usec
  -R8 on September 10 - 38 traces > 200 usec, only 3 > 500 usec

This was with the full test suite (latencytest active, all different
types of operations).

In addition to what I already reported today, there were a few additional
types of traces:

network poll - Traces 00 01 02 03 04 06 07 08 16 17 29 32
twkill_work  - Trace  05
VL kswapd0   - Trace  12
spin lock/nmi- Trace  15
do_IRQ       - Trace  25 26 27
the previously reported symptoms...
exit mmap    - Traces 09 18 19 20 21 22 23 24 28 30 31 33 34 35 36 37
kswapd0      - Traces 10 11 13 14

Network Poll
============
preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 332 us, entries: 198 (198)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: ksoftirqd/1/5, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   rtl8139_rx+0x219/0x340
=======>
00000001 0.000ms (+0.000ms): spin_lock (rtl8139_poll)
00000001 0.000ms (+0.001ms): spin_lock (<00000000>)
00000001 0.001ms (+0.003ms): rtl8139_rx (rtl8139_poll)
... we've been down this path before ...
No individual traces > 10 usec; just the number of traces involved.

twkill_work
===========

I have not seen this one before....

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 499 us, entries: 1858 (1858)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: events/0/6, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: twkill_work+0xb5/0xe0
 => ended at:   twkill_work+0xb5/0xe0
=======>
00000101 0.000ms (+0.000ms): touch_preempt_timing (twkill_work)
00000101 0.000ms (+0.000ms): smp_apic_timer_interrupt
(touch_preempt_timing)
00010101 0.000ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010101 0.000ms (+0.000ms): profile_hook (profile_tick)
00010101 0.000ms (+0.000ms): read_lock (profile_hook)
00010102 0.001ms (+0.000ms): read_lock (<00000000>)
00010102 0.001ms (+0.000ms): notifier_call_chain (profile_hook)
00010102 0.001ms (+0.000ms): _read_unlock (profile_tick)
00010101 0.002ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010101 0.002ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
00010101 0.002ms (+0.000ms): update_one_process (update_process_times)
00010101 0.003ms (+0.000ms): run_local_timers (update_process_times)
00010101 0.003ms (+0.000ms): raise_softirq (update_process_times)
00010101 0.003ms (+0.000ms): scheduler_tick (update_process_times)
00010101 0.004ms (+0.000ms): sched_clock (scheduler_tick)
00010101 0.005ms (+0.000ms): rcu_check_callbacks (scheduler_tick)
00010101 0.005ms (+0.001ms): idle_cpu (rcu_check_callbacks)
00010101 0.006ms (+0.000ms): spin_lock (scheduler_tick)
00010102 0.006ms (+0.000ms): spin_lock (<00000000>)
00010102 0.007ms (+0.000ms): task_timeslice (scheduler_tick)
00010102 0.007ms (+0.001ms): __bitmap_weight (scheduler_tick)
00010102 0.009ms (+0.000ms): _spin_unlock (scheduler_tick)
00010101 0.010ms (+0.001ms): rebalance_tick (scheduler_tick)
00010101 0.011ms (+0.000ms): do_IRQ (touch_preempt_timing)
00010101 0.011ms (+0.000ms): do_IRQ (<00000000>)
00010101 0.012ms (+0.000ms): spin_lock (do_IRQ)
00010102 0.012ms (+0.000ms): spin_lock (<00000000>)
00010102 0.012ms (+0.000ms): ack_edge_ioapic_irq (do_IRQ)
00010102 0.013ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010102 0.013ms (+0.000ms): _spin_unlock (do_IRQ)
... Look at this next sequence ...
00010102 0.033ms (+0.000ms): end_edge_ioapic_irq (do_IRQ)
00010102 0.033ms (+0.000ms): _spin_unlock (do_IRQ)
00000101 0.034ms (+0.000ms): tcp_do_twkill_work (twkill_work)
00000101 0.035ms (+0.000ms): _spin_unlock (tcp_do_twkill_work)
00000100 0.035ms (+0.000ms): tcp_timewait_kill (tcp_do_twkill_work)
00000100 0.035ms (+0.000ms): write_lock (tcp_timewait_kill)
00000101 0.036ms (+0.000ms): write_lock (<00000000>)
00000101 0.037ms (+0.000ms): _write_unlock (tcp_timewait_kill)
00000100 0.037ms (+0.000ms): spin_lock (tcp_timewait_kill)
00000101 0.037ms (+0.000ms): spin_lock (<00000000>)
00000101 0.038ms (+0.000ms): tcp_bucket_destroy (tcp_timewait_kill)
00000101 0.038ms (+0.000ms): kmem_cache_free (tcp_bucket_destroy)
00000101 0.038ms (+0.000ms): cache_flusharray (kmem_cache_free)
... or this one ...
00000100 0.050ms (+0.000ms): spin_lock (tcp_do_twkill_work)
00000101 0.051ms (+0.000ms): spin_lock (<00000000>)
00000101 0.051ms (+0.000ms): _spin_unlock (tcp_do_twkill_work)
00000100 0.051ms (+0.000ms): tcp_timewait_kill (tcp_do_twkill_work)
00000100 0.051ms (+0.000ms): write_lock (tcp_timewait_kill)
00000101 0.052ms (+0.000ms): write_lock (<00000000>)
00000101 0.052ms (+0.000ms): _write_unlock (tcp_timewait_kill)
00000100 0.053ms (+0.000ms): spin_lock (tcp_timewait_kill)
00000101 0.053ms (+0.000ms): spin_lock (<00000000>)
00000101 0.053ms (+0.000ms): tcp_bucket_destroy (tcp_timewait_kill)
00000101 0.053ms (+0.000ms): kmem_cache_free (tcp_bucket_destroy)
00000101 0.054ms (+0.000ms): _spin_unlock (tcp_timewait_kill)
00000100 0.054ms (+0.000ms): kmem_cache_free (tcp_do_twkill_work)
00000100 0.054ms (+0.000ms): spin_lock (tcp_do_twkill_work)
00000101 0.054ms (+0.000ms): spin_lock (<00000000>)
00000101 0.054ms (+0.000ms): _spin_unlock (tcp_do_twkill_work)
00000100 0.055ms (+0.000ms): tcp_timewait_kill (tcp_do_twkill_work)
... and ends like this ...
00000101 0.496ms (+0.000ms): _write_unlock (tcp_timewait_kill)
00000100 0.496ms (+0.000ms): preempt_schedule (tcp_timewait_kill)
00000100 0.496ms (+0.000ms): spin_lock (tcp_timewait_kill)
00000101 0.496ms (+0.000ms): spin_lock (<00000000>)
00000101 0.497ms (+0.000ms): tcp_bucket_destroy (tcp_timewait_kill)
00000101 0.497ms (+0.000ms): kmem_cache_free (tcp_bucket_destroy)
00000101 0.497ms (+0.000ms): _spin_unlock (tcp_timewait_kill)
00000100 0.497ms (+0.000ms): preempt_schedule (tcp_timewait_kill)
00000100 0.498ms (+0.000ms): kmem_cache_free (tcp_do_twkill_work)
00000100 0.498ms (+0.000ms): spin_lock (tcp_do_twkill_work)
00000101 0.498ms (+0.000ms): spin_lock (<00000000>)
00000101 0.499ms (+0.000ms): touch_preempt_timing (twkill_work)
00000101 0.499ms (+0.000ms): update_max_trace (check_preempt_timing)

I looked briefly at the -R8 patch and see the change in condition for
breaking the lock, but from this trace, it may not be enough.

VL kswapd0
==========

An extremely long latency (> 6 msec) appears to be a combination of
effects, perhaps tracing related. It starts like this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 6776 us, entries: 548 (548)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (page_referenced_file)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.001ms (+0.001ms): prio_tree_first (vma_prio_tree_next)
00000001 0.002ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.002ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.003ms (+0.001ms): spin_lock (<00000000>)
00000002 0.004ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.004ms (+0.000ms): page_address (page_referenced_one)
00000003 0.005ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.005ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.005ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.006ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.006ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.007ms (+0.000ms): spin_lock (<00000000>)
... looks like the typical kswapd0 cycle until this point ...
00000001 0.126ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.127ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.127ms (+0.000ms): spin_lock (<00000000>)
00000002 0.128ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.128ms (+0.000ms): page_address (page_referenced_one)
00000003 0.129ms (+0.000ms): flush_tlb_page (page_referenced_one)
00000004 0.129ms (+0.000ms): flush_tlb_others (flush_tlb_page)
00000004 0.130ms (+0.000ms): spin_lock (flush_tlb_others)
00000005 0.130ms (+0.000ms): spin_lock (<00000000>)
00000005 0.131ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000005 0.132ms (+0.790ms): send_IPI_mask_bitmask (flush_tlb_others)
00010005 0.922ms (+0.000ms): do_nmi (flush_tlb_others)
00010005 0.923ms (+0.003ms): do_nmi (touch_preempt_timing)
00010005 0.926ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010005 0.926ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010005 0.926ms (+0.000ms): profile_hook (profile_tick)
00010005 0.927ms (+0.000ms): read_lock (profile_hook)
00010006 0.927ms (+0.000ms): read_lock (<00000000>)
00010006 0.927ms (+0.000ms): notifier_call_chain (profile_hook)
00010006 0.928ms (+0.000ms): _read_unlock (profile_tick)
00010005 0.928ms (+0.027ms): profile_hit (nmi_watchdog_tick)
00000005 0.955ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
00010005 0.955ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010005 0.956ms (+0.000ms): profile_hook (profile_tick)
00010005 0.956ms (+0.000ms): read_lock (profile_hook)
00010006 0.956ms (+0.000ms): read_lock (<00000000>)
00010006 0.956ms (+0.000ms): notifier_call_chain (profile_hook)
00010006 0.956ms (+0.000ms): _read_unlock (profile_tick)
00010005 0.956ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010005 0.957ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
00010005 0.957ms (+0.000ms): update_one_process (update_process_times)
00010005 0.957ms (+0.000ms): run_local_timers (update_process_times)
00010005 0.957ms (+0.000ms): raise_softirq (update_process_times)
00010005 0.958ms (+0.000ms): scheduler_tick (update_process_times)
00010005 0.958ms (+0.000ms): sched_clock (scheduler_tick)
00010005 0.958ms (+0.000ms): spin_lock (scheduler_tick)
00010006 0.958ms (+0.000ms): spin_lock (<00000000>)
00010006 0.959ms (+0.000ms): task_timeslice (scheduler_tick)
00010006 0.959ms (+0.000ms): __bitmap_weight (scheduler_tick)
00010006 0.959ms (+0.000ms): _spin_unlock (scheduler_tick)
00010005 0.959ms (+0.000ms): rebalance_tick (scheduler_tick)
00000006 0.960ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
00000006 0.960ms (+0.000ms): __do_softirq (do_softirq)
00000006 0.960ms (+0.000ms): wake_up_process (do_softirq)
00000006 0.960ms (+0.000ms): try_to_wake_up (wake_up_process)
00000006 0.960ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000006 0.961ms (+0.000ms): spin_lock (task_rq_lock)
00000007 0.961ms (+0.000ms): spin_lock (<00000000>)
00000007 0.961ms (+0.000ms): wake_idle (try_to_wake_up)
00000007 0.961ms (+0.000ms): activate_task (try_to_wake_up)
00000007 0.961ms (+0.000ms): sched_clock (activate_task)
00000007 0.962ms (+0.000ms): recalc_task_prio (activate_task)
00000007 0.962ms (+0.000ms): effective_prio (recalc_task_prio)
00000007 0.962ms (+0.000ms): __activate_task (try_to_wake_up)
00000007 0.962ms (+0.000ms): enqueue_task (__activate_task)
00000007 0.962ms (+0.000ms): sched_info_queued (enqueue_task)
00000007 0.963ms (+0.000ms): resched_task (try_to_wake_up)
00000007 0.963ms (+0.000ms): task_rq_unlock (try_to_wake_up)
00000007 0.963ms (+0.000ms): _spin_unlock_irqrestore (try_to_wake_up)
00000006 0.964ms (+0.958ms): preempt_schedule (try_to_wake_up)
00010005 1.922ms (+0.000ms): do_nmi (flush_tlb_others)
00010005 1.922ms (+0.003ms): do_nmi (check_preempt_timing)
00010005 1.926ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010005 1.926ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010005 1.926ms (+0.000ms): profile_hook (profile_tick)
00010005 1.926ms (+0.000ms): read_lock (profile_hook)
00010006 1.927ms (+0.000ms): read_lock (<00000000>)
00010006 1.927ms (+0.000ms): notifier_call_chain (profile_hook)
00010006 1.927ms (+0.000ms): _read_unlock (profile_tick)
00010005 1.927ms (+0.000ms): preempt_schedule (profile_tick)
00010005 1.928ms (+0.027ms): profile_hit (nmi_watchdog_tick)
00000005 1.955ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
00010005 1.955ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010005 1.955ms (+0.000ms): profile_hook (profile_tick)
00010005 1.955ms (+0.000ms): read_lock (profile_hook)
00010006 1.955ms (+0.000ms): read_lock (<00000000>)
00010006 1.956ms (+0.000ms): notifier_call_chain (profile_hook)
00010006 1.956ms (+0.000ms): _read_unlock (profile_tick)
00010005 1.956ms (+0.000ms): preempt_schedule (profile_tick)
00010005 1.956ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010005 1.956ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
00010005 1.956ms (+0.000ms): update_one_process (update_process_times)
00010005 1.957ms (+0.000ms): run_local_timers (update_process_times)
00010005 1.957ms (+0.000ms): raise_softirq (update_process_times)
00010005 1.957ms (+0.000ms): scheduler_tick (update_process_times)
00010005 1.957ms (+0.000ms): sched_clock (scheduler_tick)
00010005 1.958ms (+0.000ms): spin_lock (scheduler_tick)
00010006 1.958ms (+0.000ms): spin_lock (<00000000>)
00010006 1.958ms (+0.000ms): task_timeslice (scheduler_tick)
00010006 1.958ms (+0.000ms): __bitmap_weight (scheduler_tick)
00010006 1.958ms (+0.000ms): _spin_unlock (scheduler_tick)
00010005 1.959ms (+0.000ms): preempt_schedule (scheduler_tick)
00010005 1.959ms (+0.000ms): rebalance_tick (scheduler_tick)
00000006 1.959ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
00000006 1.959ms (+0.961ms): __do_softirq (do_softirq)
00010005 2.921ms (+0.000ms): do_nmi (flush_tlb_others)
00010005 2.922ms (+0.001ms): do_nmi (check_preempt_timing)
... looks like we got stuck for milliseconds but finally finished ...
00000006 4.959ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
00000006 4.959ms (+0.960ms): __do_softirq (do_softirq)
00010005 5.919ms (+0.000ms): do_nmi (flush_tlb_others)
00010005 5.920ms (+0.003ms): do_nmi (__trace)
00010005 5.924ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010005 5.924ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010005 5.924ms (+0.000ms): profile_hook (profile_tick)
00010005 5.924ms (+0.000ms): read_lock (profile_hook)
00010006 5.924ms (+0.000ms): read_lock (<00000000>)
00010006 5.925ms (+0.000ms): notifier_call_chain (profile_hook)
00010006 5.925ms (+0.000ms): _read_unlock (profile_tick)
00010005 5.925ms (+0.000ms): preempt_schedule (profile_tick)
00010005 5.925ms (+0.027ms): profile_hit (nmi_watchdog_tick)
00000005 5.952ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
00010005 5.953ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010005 5.953ms (+0.000ms): profile_hook (profile_tick)
00010005 5.953ms (+0.000ms): read_lock (profile_hook)
00010006 5.953ms (+0.000ms): read_lock (<00000000>)
00010006 5.953ms (+0.000ms): notifier_call_chain (profile_hook)
00010006 5.953ms (+0.000ms): _read_unlock (profile_tick)
00010005 5.954ms (+0.000ms): preempt_schedule (profile_tick)
00010005 5.954ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010005 5.954ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
00010005 5.954ms (+0.000ms): update_one_process (update_process_times)
00010005 5.954ms (+0.000ms): run_local_timers (update_process_times)
00010005 5.955ms (+0.000ms): raise_softirq (update_process_times)
00010005 5.955ms (+0.000ms): scheduler_tick (update_process_times)
00010005 5.955ms (+0.000ms): sched_clock (scheduler_tick)
00010005 5.955ms (+0.000ms): spin_lock (scheduler_tick)
00010006 5.956ms (+0.000ms): spin_lock (<00000000>)
00010006 5.956ms (+0.000ms): task_timeslice (scheduler_tick)
00010006 5.956ms (+0.000ms): __bitmap_weight (scheduler_tick)
00010006 5.956ms (+0.000ms): _spin_unlock (scheduler_tick)
00010005 5.956ms (+0.000ms): preempt_schedule (scheduler_tick)
00010005 5.957ms (+0.000ms): rebalance_tick (scheduler_tick)
00000006 5.957ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
00000006 5.957ms (+0.817ms): __do_softirq (do_softirq)
00000005 6.774ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000004 6.775ms (+0.000ms): preempt_schedule (flush_tlb_others)
00000003 6.775ms (+0.000ms): preempt_schedule (flush_tlb_page)
00000003 6.775ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 6.775ms (+0.000ms): preempt_schedule (page_referenced_one)
00000002 6.776ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 6.776ms (+0.000ms): preempt_schedule (page_referenced_one)
00000001 6.776ms (+0.000ms): _spin_unlock (page_referenced_file)
00000001 6.777ms (+0.000ms): sub_preempt_count (_spin_unlock)
00000001 6.777ms (+0.000ms): update_max_trace (check_preempt_timing)

Spin Lock/nmi
=============

May be an artifact of the test configuration

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 244 us, entries: 17 (17)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: fam/3212, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: __spin_lock_irqsave+0x2b/0xb0
 => ended at:   _spin_unlock_irqrestore+0x32/0x70
=======>
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (avc_has_perm_noaudit)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000001 0.000ms (+0.069ms): avc_insert (avc_has_perm_noaudit)
00010001 0.069ms (+0.000ms): do_nmi (avc_insert)
00010001 0.070ms (+0.002ms): do_nmi (__mcount)
00010001 0.072ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010001 0.073ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.073ms (+0.000ms): profile_hook (profile_tick)
00010001 0.073ms (+0.000ms): read_lock (profile_hook)
00010002 0.073ms (+0.000ms): read_lock (<00000000>)
00010002 0.073ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.074ms (+0.000ms): _read_unlock (profile_tick)
00010001 0.074ms (+0.168ms): profile_hit (nmi_watchdog_tick)
00000001 0.242ms (+0.000ms): memcpy (avc_has_perm_noaudit)
00000001 0.243ms (+0.002ms): _spin_unlock_irqrestore (avc_has_perm_noaudit)
00000001 0.245ms (+0.000ms): sub_preempt_count (_spin_unlock_irqrestore)
00000001 0.245ms (+0.000ms): update_max_trace (check_preempt_timing)

Do_IRQ
======

Perhaps an unfortunate series of nested IRQ's?
Starts like this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 240 us, entries: 380 (380)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: cpu_burn/14061, uid:0 nice:10 policy:0 rt_prio:0
    -----------------
 => started at: do_IRQ+0x19/0x1a0
 => ended at:   do_IRQ+0x14a/0x1a0
=======>
00010000 0.000ms (+0.000ms): do_IRQ (common_interrupt)
00010000 0.000ms (+0.000ms): do_IRQ (<08048340>)
00010000 0.000ms (+0.000ms): do_IRQ (<0000000a>)
00010000 0.000ms (+0.000ms): spin_lock (do_IRQ)
00010001 0.000ms (+0.000ms): spin_lock (<00000000>)
00010001 0.001ms (+0.000ms): mask_and_ack_level_ioapic_irq (do_IRQ)
00010001 0.001ms (+0.000ms): mask_IO_APIC_irq
(mask_and_ack_level_ioapic_irq)
00010001 0.001ms (+0.000ms): __spin_lock_irqsave (mask_IO_APIC_irq)
00010002 0.002ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00010002 0.002ms (+0.000ms): __mask_IO_APIC_irq (mask_IO_APIC_irq)
00010002 0.002ms (+0.013ms): __modify_IO_APIC_irq (__mask_IO_APIC_irq)
00010002 0.016ms (+0.000ms): _spin_unlock_irqrestore (mask_IO_APIC_irq)
00010001 0.016ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010001 0.016ms (+0.000ms): _spin_unlock (do_IRQ)
00010000 0.017ms (+0.000ms): generic_handle_IRQ_event (do_IRQ)
00010000 0.017ms (+0.001ms): usb_hcd_irq (generic_handle_IRQ_event)
00010000 0.019ms (+0.001ms): uhci_irq (usb_hcd_irq)
00010000 0.021ms (+0.000ms): spin_lock (uhci_irq)
00010001 0.021ms (+0.000ms): spin_lock (<00000000>)
00010001 0.021ms (+0.001ms): uhci_get_current_frame_number (uhci_irq)
00010001 0.023ms (+0.000ms): uhci_free_pending_qhs (uhci_irq)
00010001 0.023ms (+0.001ms): uhci_free_pending_tds (uhci_irq)
00010001 0.024ms (+0.001ms): uhci_remove_pending_urbps (uhci_irq)
00010001 0.025ms (+0.000ms): uhci_transfer_result (uhci_irq)
00010001 0.026ms (+0.000ms): spin_lock (uhci_transfer_result)
...
00020005 0.134ms (+0.000ms): do_IRQ (_spin_unlock_irqrestore)
00020005 0.134ms (+0.000ms): do_IRQ (<00000000>)
00020005 0.134ms (+0.000ms): spin_lock (do_IRQ)
00020006 0.134ms (+0.000ms): spin_lock (<00000000>)
00020006 0.134ms (+0.000ms): ack_edge_ioapic_irq (do_IRQ)
00020006 0.135ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00020006 0.135ms (+0.000ms): _spin_unlock (do_IRQ)
00020005 0.135ms (+0.000ms): preempt_schedule (do_IRQ)
00020005 0.135ms (+0.000ms): generic_handle_IRQ_event (do_IRQ)
00020005 0.136ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
... exits like this ...
00010001 0.220ms (+0.000ms): generic_note_interrupt (do_IRQ)
00010001 0.220ms (+0.000ms): end_level_ioapic_irq (do_IRQ)
00010001 0.220ms (+0.000ms): unmask_IO_APIC_irq (do_IRQ)
00010001 0.221ms (+0.000ms): __spin_lock_irqsave (unmask_IO_APIC_irq)
00010002 0.221ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00010002 0.221ms (+0.000ms): __unmask_IO_APIC_irq (unmask_IO_APIC_irq)
00010002 0.221ms (+0.013ms): __modify_IO_APIC_irq (__unmask_IO_APIC_irq)
00010002 0.235ms (+0.000ms): _spin_unlock_irqrestore (unmask_IO_APIC_irq)
00010001 0.235ms (+0.000ms): preempt_schedule (unmask_IO_APIC_irq)
00010001 0.235ms (+0.000ms): _spin_unlock (do_IRQ)
00010000 0.236ms (+0.000ms): preempt_schedule (do_IRQ)
00000001 0.236ms (+0.000ms): do_softirq (do_IRQ)
00000001 0.236ms (+0.000ms): __do_softirq (do_softirq)
00000001 0.236ms (+0.000ms): wake_up_process (do_softirq)
00000001 0.237ms (+0.000ms): try_to_wake_up (wake_up_process)
00000001 0.237ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000001 0.237ms (+0.000ms): spin_lock (task_rq_lock)
00000002 0.237ms (+0.000ms): spin_lock (<00000000>)
00000002 0.237ms (+0.000ms): wake_idle (try_to_wake_up)
00000002 0.238ms (+0.000ms): activate_task (try_to_wake_up)
00000002 0.238ms (+0.000ms): sched_clock (activate_task)
00000002 0.238ms (+0.000ms): recalc_task_prio (activate_task)
00000002 0.238ms (+0.000ms): effective_prio (recalc_task_prio)
00000002 0.239ms (+0.000ms): __activate_task (try_to_wake_up)
00000002 0.239ms (+0.000ms): enqueue_task (__activate_task)
00000002 0.239ms (+0.000ms): sched_info_queued (enqueue_task)
00000002 0.239ms (+0.000ms): resched_task (try_to_wake_up)
00000002 0.240ms (+0.000ms): task_rq_unlock (try_to_wake_up)
00000002 0.240ms (+0.000ms): _spin_unlock_irqrestore (try_to_wake_up)
00000001 0.240ms (+0.000ms): preempt_schedule (try_to_wake_up)
00000001 0.241ms (+0.000ms): sub_preempt_count (do_IRQ)
00000001 0.242ms (+0.000ms): update_max_trace (check_preempt_timing)

I will send the full traces separately (off list).

  --Mark


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-10 14:28 Mark_H_Johnson
@ 2004-09-10 19:29 ` Ingo Molnar
  0 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2004-09-10 19:29 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Eric St-Laurent, Free Ekanayaka, free78, K.R. Foley, linux-kernel,
	Felipe Alfaro Solana, luke, nando, P.O. Gaillard, Daniel Schmitt,
	Lee Revell

[-- Attachment #1: Type: text/plain, Size: 668 bytes --]


* Mark_H_Johnson@Raytheon.com <Mark_H_Johnson@Raytheon.com> wrote:

> >does 'hdparm -X udma2 /dev/hda' work?
> 
> Yes it does and quite well actually. For comparison
>   -R1 on September 7 - over 100 traces > 500 usec
>   -R8 on September 10 - 38 traces > 200 usec, only 3 > 500 usec
> 
> This was with the full test suite (latencytest active, all different
> types of operations).

cool. Perhaps further reducing the DMA mode (to udma0, or even mdma 
modes?) will further reduce these latencies?

most of your remaining latencies seem to be get_swap_page() related -
the attached (highly experimental) patch might fix that particular
latency. (ontop of -S0).

	Ingo

[-- Attachment #2: swapspace-layout-improvements-2.6.9-rc1-bk12-A1 --]
[-- Type: text/plain, Size: 8058 bytes --]

--- linux/include/linux/swap.h.orig	
+++ linux/include/linux/swap.h	
@@ -193,7 +193,7 @@ extern int rw_swap_page_sync(int, swp_en
 extern struct address_space swapper_space;
 #define total_swapcache_pages  swapper_space.nrpages
 extern void show_swap_cache_info(void);
-extern int add_to_swap(struct page *);
+extern int add_to_swap(struct page *page, void *cookie, pgoff_t index);
 extern void __delete_from_swap_cache(struct page *);
 extern void delete_from_swap_cache(struct page *);
 extern int move_to_swap_cache(struct page *, swp_entry_t);
@@ -209,7 +209,7 @@ extern long total_swap_pages;
 extern unsigned int nr_swapfiles;
 extern struct swap_info_struct swap_info[];
 extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(void);
+extern swp_entry_t get_swap_page(void *cookie, pgoff_t index);
 extern int swap_duplicate(swp_entry_t);
 extern int valid_swaphandles(swp_entry_t, unsigned long *);
 extern void swap_free(swp_entry_t);
@@ -274,7 +274,7 @@ static inline int remove_exclusive_swap_
 	return 0;
 }
 
-static inline swp_entry_t get_swap_page(void)
+static inline swp_entry_t get_swap_page(void *cookie, pgoff_t index)
 {
 	swp_entry_t entry;
 	entry.val = 0;
--- linux/kernel/power/swsusp.c.orig	
+++ linux/kernel/power/swsusp.c	
@@ -317,7 +317,7 @@ static int write_suspend_image(void)
 	for (i=0; i<nr_copy_pages; i++) {
 		if (!(i%100))
 			printk( "." );
-		entry = get_swap_page();
+		entry = get_swap_page(NULL, i);
 		if (!entry.val)
 			panic("\nNot enough swapspace when writing data" );
 		
@@ -335,7 +335,7 @@ static int write_suspend_image(void)
 		cur = (union diskpage *)((char *) pagedir_nosave)+i;
 		BUG_ON ((char *) cur != (((char *) pagedir_nosave) + i*PAGE_SIZE));
 		printk( "." );
-		entry = get_swap_page();
+		entry = get_swap_page(NULL, i);
 		if (!entry.val) {
 			printk(KERN_CRIT "Not enough swapspace when writing pgdir\n" );
 			panic("Don't know how to recover");
@@ -358,7 +358,7 @@ static int write_suspend_image(void)
 	BUG_ON (sizeof(struct suspend_header) > PAGE_SIZE-sizeof(swp_entry_t));
 	BUG_ON (sizeof(union diskpage) != PAGE_SIZE);
 	BUG_ON (sizeof(struct link) != PAGE_SIZE);
-	entry = get_swap_page();
+	entry = get_swap_page(NULL, i);
 	if (!entry.val)
 		panic( "\nNot enough swapspace when writing header" );
 	if (swapfile_used[swp_type(entry)] != SWAPFILE_SUSPEND)
--- linux/kernel/power/pmdisk.c.orig	
+++ linux/kernel/power/pmdisk.c	
@@ -206,7 +206,7 @@ static int write_swap_page(unsigned long
 	swp_entry_t entry;
 	int error = 0;
 
-	entry = get_swap_page();
+	entry = get_swap_page(NULL, addr >> PAGE_SHIFT);
 	if (swp_offset(entry) && 
 	    swapfile_used[swp_type(entry)] == SWAPFILE_SUSPEND) {
 		error = rw_swap_page_sync(WRITE, entry,
--- linux/mm/vmscan.c.orig	
+++ linux/mm/vmscan.c	
@@ -390,7 +390,10 @@ static int shrink_list(struct list_head 
 		 * Try to allocate it some swap space here.
 		 */
 		if (PageAnon(page) && !PageSwapCache(page)) {
-			if (!add_to_swap(page))
+			void *cookie = page->mapping;
+			pgoff_t index = page->index;
+
+			if (!add_to_swap(page, cookie, index))
 				goto activate_locked;
 		}
 #endif /* CONFIG_SWAP */
--- linux/mm/swap_state.c.orig	
+++ linux/mm/swap_state.c	
@@ -137,8 +137,12 @@ void __delete_from_swap_cache(struct pag
  *
  * Allocate swap space for the page and add the page to the
  * swap cache.  Caller needs to hold the page lock. 
+ *
+ * We attempt to lay pages out on swap to that virtually-contiguous pages are
+ * contiguous on-disk.  To do this we utilise page->index (offset into vma) and
+ * page->mapping (the anon_vma's address).
  */
-int add_to_swap(struct page * page)
+int add_to_swap(struct page *page, void *cookie, pgoff_t index)
 {
 	swp_entry_t entry;
 	int pf_flags;
@@ -148,7 +152,7 @@ int add_to_swap(struct page * page)
 		BUG();
 
 	for (;;) {
-		entry = get_swap_page();
+		entry = get_swap_page(cookie, index);
 		if (!entry.val)
 			return 0;
 
--- linux/mm/shmem.c.orig	
+++ linux/mm/shmem.c	
@@ -740,7 +740,7 @@ static int shmem_writepage(struct page *
 	struct shmem_inode_info *info;
 	swp_entry_t *entry, swap;
 	struct address_space *mapping;
-	unsigned long index;
+	pgoff_t index;
 	struct inode *inode;
 
 	BUG_ON(!PageLocked(page));
@@ -752,7 +752,7 @@ static int shmem_writepage(struct page *
 	info = SHMEM_I(inode);
 	if (info->flags & VM_LOCKED)
 		goto redirty;
-	swap = get_swap_page();
+	swap = get_swap_page(mapping, index);
 	if (!swap.val)
 		goto redirty;
 
--- linux/mm/swapfile.c.orig	
+++ linux/mm/swapfile.c	
@@ -25,6 +25,7 @@
 #include <linux/rmap.h>
 #include <linux/security.h>
 #include <linux/backing-dev.h>
+#include <linux/hash.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -83,71 +84,51 @@ void swap_unplug_io_fn(struct backing_de
 	up_read(&swap_unplug_sem);
 }
 
-static inline int scan_swap_map(struct swap_info_struct *si)
-{
-	unsigned long offset;
-	/* 
-	 * We try to cluster swap pages by allocating them
-	 * sequentially in swap.  Once we've allocated
-	 * SWAPFILE_CLUSTER pages this way, however, we resort to
-	 * first-free allocation, starting a new cluster.  This
-	 * prevents us from scattering swap pages all over the entire
-	 * swap partition, so that we reduce overall disk seek times
-	 * between swap pages.  -- sct */
-	if (si->cluster_nr) {
-		while (si->cluster_next <= si->highest_bit) {
-			offset = si->cluster_next++;
-			if (si->swap_map[offset])
-				continue;
-			si->cluster_nr--;
-			goto got_page;
-		}
-	}
-	si->cluster_nr = SWAPFILE_CLUSTER;
+int akpm;
 
-	/* try to find an empty (even not aligned) cluster. */
-	offset = si->lowest_bit;
- check_next_cluster:
-	if (offset+SWAPFILE_CLUSTER-1 <= si->highest_bit)
-	{
-		unsigned long nr;
-		for (nr = offset; nr < offset+SWAPFILE_CLUSTER; nr++)
-			if (si->swap_map[nr])
-			{
-				offset = nr+1;
-				goto check_next_cluster;
-			}
-		/* We found a completly empty cluster, so start
-		 * using it.
-		 */
-		goto got_page;
-	}
-	/* No luck, so now go finegrined as usual. -Andrea */
-	for (offset = si->lowest_bit; offset <= si->highest_bit ; offset++) {
-		if (si->swap_map[offset])
+/*
+ * We divide the swapdev into 1024 kilobyte chunks.  We use the cookie and the
+ * upper bits of the index to select a chunk and the rest of the index as the
+ * offset into the selected chunk.
+ */
+#define CHUNK_SHIFT	(20 - PAGE_SHIFT)
+#define CHUNK_MASK	(-1UL << CHUNK_SHIFT)
+
+static int
+scan_swap_map(struct swap_info_struct *si, void *cookie, pgoff_t index)
+{
+	unsigned long chunk;
+	unsigned long nchunks;
+	unsigned long block;
+	unsigned long scan;
+
+	nchunks = si->max >> CHUNK_SHIFT;
+	chunk = 0;
+	if (nchunks)
+		chunk = hash_long((unsigned long)cookie + (index & CHUNK_MASK),
+					BITS_PER_LONG) % nchunks;
+
+	block = (chunk << CHUNK_SHIFT) + (index & ~CHUNK_MASK);
+
+	for (scan = 0; scan < si->max; scan++, block++) {
+		if (block == si->max)
+			block = 0;
+		if (block == 0)
 			continue;
-		si->lowest_bit = offset+1;
-	got_page:
-		if (offset == si->lowest_bit)
-			si->lowest_bit++;
-		if (offset == si->highest_bit)
-			si->highest_bit--;
-		if (si->lowest_bit > si->highest_bit) {
-			si->lowest_bit = si->max;
-			si->highest_bit = 0;
-		}
-		si->swap_map[offset] = 1;
-		si->inuse_pages++;
+		if (si->swap_map[block])
+			continue;
+		si->swap_map[block] = 1;
 		nr_swap_pages--;
-		si->cluster_next = offset+1;
-		return offset;
+		if (akpm)
+			printk("cookie:%p, index:%lu, chunk:%lu nchunks:%lu "
+				"block:%lu\n",
+				cookie, index, chunk, nchunks, block);
+		return block;
 	}
-	si->lowest_bit = si->max;
-	si->highest_bit = 0;
 	return 0;
 }
 
-swp_entry_t get_swap_page(void)
+swp_entry_t get_swap_page(void *cookie, pgoff_t index)
 {
 	struct swap_info_struct * p;
 	unsigned long offset;
@@ -166,7 +147,7 @@ swp_entry_t get_swap_page(void)
 		p = &swap_info[type];
 		if ((p->flags & SWP_ACTIVE) == SWP_ACTIVE) {
 			swap_device_lock(p);
-			offset = scan_swap_map(p);
+			offset = scan_swap_map(p, cookie, index);
 			swap_device_unlock(p);
 			if (offset) {
 				entry = swp_entry(type,offset);

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
@ 2004-09-10 12:57 Mark_H_Johnson
  2004-09-10 13:10 ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Mark_H_Johnson @ 2004-09-10 12:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard,
	nando, luke, free78

>You could also try lower DMA modes, I don't know if that will help or
>not but UDMA133 is essentially "one entire PCI 32/33Mhz bus worth of
>traffic so the effects are not too suprising. IDE control ports can give
>you long stalls too especially if IORDY is in use.

It appears I finally figured it out and have the drive running with udma2
instead of udma4. For reference:

# hdparm /dev/hda

/dev/hda:
 multcount    = 16 (on)
 IO_support   =  1 (32-bit)
 unmaskirq    =  1 (on)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 readonly     =  0 (off)
 readahead    = 64 (on)
 geometry     = 58168/16/63, sectors = 58633344, start = 0
# hdparm -i /dev/hda

/dev/hda:

 Model=WDC WD300BB-00AUA1, FwRev=18.20D18, SerialNo=WD-WMA6W1764218
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
 BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=58633344
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 *udma2 udma3 udma4 udma5
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: device does not report version:

 * signifies the current active mode

The same test I ran yesterday
 head -c 750000000 /dev/zero >tmpfile (disk writes)
 cp tmpfile tmpfile2 (disk copy)
 cat tmpfile tmpfile2 >/dev/null
ran with the following results late last night. All are traces > 200 usec
[much smaller limit than before] No traces > 600 usec.

Exit Mmap - Traces 00 01 02 03 04 05 10 13 24 26
Kswapd0   - Traces 06 07 08 09 11 12 14 16..23 25 27
Spin lock - Trace  15

Exit Mmap
=========

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 209 us, entries: 34 (34)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: sleep/4436, uid:0 nice:-9 policy:0 rt_prio:0
    -----------------
 => started at: cond_resched_lock+0x1d/0xa0
 => ended at:   _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): touch_preempt_timing (cond_resched_lock)
00000001 0.000ms (+0.001ms): __bitmap_weight (unmap_vmas)
00000001 0.001ms (+0.002ms): vm_acct_memory (exit_mmap)
00000001 0.003ms (+0.113ms): clear_page_tables (exit_mmap)
00010001 0.117ms (+0.000ms): do_nmi (clear_page_tables)
00010001 0.117ms (+0.003ms): do_nmi (avc_has_perm)
00010001 0.121ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010001 0.121ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.121ms (+0.000ms): profile_hook (profile_tick)
00010001 0.121ms (+0.000ms): read_lock (profile_hook)
00010002 0.121ms (+0.000ms): read_lock (<00000000>)
00010002 0.122ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.122ms (+0.000ms): _read_unlock (profile_tick)
00010001 0.122ms (+0.078ms): profile_hit (nmi_watchdog_tick)
00000001 0.200ms (+0.002ms): flush_tlb_mm (exit_mmap)
00000001 0.202ms (+0.000ms): free_pages_and_swap_cache (exit_mmap)

Kswapd0
=======

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 539 us, entries: 719 (719)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (page_referenced_file)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.000ms (+0.000ms): prio_tree_first (vma_prio_tree_next)
00000001 0.001ms (+0.001ms): prio_tree_left (prio_tree_first)
00000001 0.003ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.004ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.005ms (+0.001ms): spin_lock (<00000000>)
00000002 0.006ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.006ms (+0.000ms): page_address (page_referenced_one)
00000003 0.007ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.007ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.007ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.008ms (+0.001ms): page_referenced_one (page_referenced_file)
00000001 0.009ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.009ms (+0.002ms): spin_lock (<00000000>)
... cycle repeats a number of times ...
00000001 0.535ms (+0.000ms): preempt_schedule (page_referenced_one)
00000001 0.536ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.536ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.537ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.537ms (+0.000ms): spin_lock (<00000000>)
00000002 0.538ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.538ms (+0.000ms): page_address (page_referenced_one)
00000003 0.538ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.538ms (+0.000ms): preempt_schedule (page_referenced_one)
00000002 0.539ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.539ms (+0.000ms): preempt_schedule (page_referenced_one)
00000001 0.539ms (+0.000ms): _spin_unlock (page_referenced_file)
00000001 0.540ms (+0.000ms): sub_preempt_count (_spin_unlock)
00000001 0.541ms (+0.000ms): update_max_trace (check_preempt_timing)

Spin Lock
=========

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 235 us, entries: 18 (18)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000002 0.000ms (+0.113ms): spin_lock (<00000000>)
00010002 0.113ms (+0.000ms): do_nmi (get_swap_page)
00010002 0.114ms (+0.002ms): do_nmi (default_idle)
00010002 0.116ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010002 0.116ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010002 0.117ms (+0.000ms): profile_hook (profile_tick)
00010002 0.117ms (+0.000ms): read_lock (profile_hook)
00010003 0.117ms (+0.000ms): read_lock (<00000000>)
00010003 0.117ms (+0.000ms): notifier_call_chain (profile_hook)
00010003 0.117ms (+0.000ms): _read_unlock (profile_tick)
00010002 0.118ms (+0.116ms): profile_hit (nmi_watchdog_tick)
00000002 0.235ms (+0.000ms): _spin_unlock (get_swap_page)
00000001 0.235ms (+0.000ms): _spin_unlock (get_swap_page)
00000001 0.235ms (+0.001ms): sub_preempt_count (_spin_unlock)
00000001 0.237ms (+0.000ms): update_max_trace (check_preempt_timing)

Are you SURE the spin lock counter works properly on SMP systems?
I did a quick check of yesterday's results:
  # grep -ir '<.*>' latencytest0.42-png/lt040909  | wc -l
  6978
  # grep -ir '<.*>' latencytest0.42-png/lt040909  | grep -v '<00000000>' |
less -im
  ...
  latencytest0.42-png/lt040909/lt004.v3k1/lt.10:00010002 0.382ms
(+0.000ms): do_IRQ (<0000000e>)
latencytest0.42-png/lt040909/lt004.v3k1/lt.11:00010001 0.003ms (+0.000ms):
do_IRQ (<0000000a>)
latencytest0.42-png/lt040909/lt004.v3k1/lt.11:00020001 0.308ms (+0.000ms):
do_IRQ (<0000000e>)
latencytest0.42-png/lt040909/lt005.v3k1/lt.00:00010002 0.263ms (+0.005ms):
do_nmi (<0804bd56>)
latencytest0.42-png/lt040909/lt006.v3k1/lt.00:00010001 0.501ms (+0.005ms):
do_nmi (<0864d2b2>)
latencytest0.42-png/lt040909/lt006.v3k1/lt.01:00010001 0.067ms (+0.003ms):
do_nmi (<0815f06c>)
  ...
No entries that are non zero and lock related.

  --Mark


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-10 12:57 Mark_H_Johnson
@ 2004-09-10 13:10 ` Ingo Molnar
  0 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2004-09-10 13:10 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard,
	nando, luke, free78


* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> 00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
> 00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
> 00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
> 00000002 0.000ms (+0.113ms): spin_lock (<00000000>)
> 00010002 0.113ms (+0.000ms): do_nmi (get_swap_page)

> Are you SURE the spin lock counter works properly on SMP systems?
> I did a quick check of yesterday's results:
>   # grep -ir '<.*>' latencytest0.42-png/lt040909  | wc -l
>   6978
>   # grep -ir '<.*>' latencytest0.42-png/lt040909  | grep -v '<00000000>' |
> less -im
>   ...

> No entries that are non zero and lock related.

it works fine here. To double-check i've created a contention testcase:

 00000001 0.000ms (+0.000ms): spin_lock (sys_gettimeofday)
 00000001 0.000ms (+0.000ms): spin_lock (<000012ce>)

this spin_lock() spun 4814 times before it got the lock.

Linux locking is pretty uncontended on 2-way boxes.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
@ 2004-09-09 22:56 Mark_H_Johnson
  2004-09-10 12:42 ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Mark_H_Johnson @ 2004-09-09 22:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric St-Laurent, Free Ekanayaka, free78, K.R. Foley, linux-kernel,
	Felipe Alfaro Solana, luke, nando, P.O. Gaillard, Daniel Schmitt,
	Lee Revell

>I tried
>  hdparm -p -X udma2 /dev/hda
>(since it was udma4 previously)
>and reran the tests.
Not quite sure this was what I wanted - appears to turn on PIO modes
exclusively.

What is the right incantation for this?
Thanks.
--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-09 22:56 Mark_H_Johnson
@ 2004-09-10 12:42 ` Ingo Molnar
  0 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2004-09-10 12:42 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Eric St-Laurent, Free Ekanayaka, free78, K.R. Foley, linux-kernel,
	Felipe Alfaro Solana, luke, nando, P.O. Gaillard, Daniel Schmitt,
	Lee Revell


* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> >I tried
> >  hdparm -p -X udma2 /dev/hda
> >(since it was udma4 previously)
> >and reran the tests.
> Not quite sure this was what I wanted - appears to turn on PIO modes
> exclusively.
>
> What is the right incantation for this?

does 'hdparm -X udma2 /dev/hda' work? -p seems to be some PIO auto-tune 
option that you probably dont need.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
@ 2004-09-09 22:41 Mark_H_Johnson
  0 siblings, 0 replies; 36+ messages in thread
From: Mark_H_Johnson @ 2004-09-09 22:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard,
	nando, luke, free78

>so if these assumptions are true i'd suggest to decrease max_sectors
>from 32K to 16K - that alone should halve these random latencies from
>560 usecs to 280 usecs. Unless you see stability you might want to try
>an 8K setting as well - this will likely decrease your IO rates
>noticeably though. This would reduce the random delays to 140 usecs.

I tried both and still had some long delays. For example:

00000003 0.015ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000003 0.015ms (+0.137ms): send_IPI_mask_bitmask (flush_tlb_others)
00010003 0.153ms (+0.000ms): do_nmi (flush_tlb_others)
00010003 0.153ms (+0.001ms): do_nmi (ide_outsl)
00010003 0.155ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010003 0.155ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010003 0.155ms (+0.000ms): profile_hook (profile_tick)
00010003 0.156ms (+0.000ms): read_lock (profile_hook)
00010004 0.156ms (+0.000ms): read_lock (<00000000>)
00010004 0.156ms (+0.000ms): notifier_call_chain (profile_hook)
00010004 0.156ms (+0.000ms): _read_unlock (profile_tick)
00010003 0.157ms (+0.290ms): profile_hit (nmi_watchdog_tick)
00000003 0.447ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000001 0.447ms (+0.000ms): free_pages_and_swap_cache (unmap_vmas)

or this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 440 us, entries: 34 (34)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000002 0.000ms (+0.410ms): spin_lock (<00000000>)
00010002 0.411ms (+0.000ms): do_IRQ (get_swap_page)

or this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 548 us, entries: 8 (8)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000002 0.000ms (+0.547ms): spin_lock (<00000000>)
00000002 0.548ms (+0.000ms): _spin_unlock (get_swap_page)
00000001 0.548ms (+0.000ms): _spin_unlock (get_swap_page)
00000001 0.548ms (+0.000ms): sub_preempt_count (_spin_unlock)
00000001 0.549ms (+0.000ms): update_max_trace (check_preempt_timing)

or this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 422 us, entries: 345 (345)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: kblockd/0/12, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: do_IRQ+0x19/0x1a0
 => ended at:   do_IRQ+0x14a/0x1a0
=======>
00010000 0.000ms (+0.000ms): do_IRQ (common_interrupt)
00010000 0.000ms (+0.000ms): do_IRQ (kthread_should_stop)
00010000 0.000ms (+0.000ms): do_IRQ (<0000000a>)
00010000 0.000ms (+0.000ms): spin_lock (do_IRQ)
00010001 0.000ms (+0.052ms): spin_lock (<00000000>)
00010001 0.053ms (+0.000ms): mask_and_ack_level_ioapic_irq (do_IRQ)
00010001 0.053ms (+0.000ms): mask_IO_APIC_irq
(mask_and_ack_level_ioapic_irq)
00010001 0.053ms (+0.000ms): __spin_lock_irqsave (mask_IO_APIC_irq)
00010002 0.053ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00010002 0.053ms (+0.000ms): __mask_IO_APIC_irq (mask_IO_APIC_irq)
00010002 0.054ms (+0.220ms): __modify_IO_APIC_irq (__mask_IO_APIC_irq)
00010002 0.274ms (+0.000ms): _spin_unlock_irqrestore (mask_IO_APIC_irq)
00010001 0.274ms (+0.000ms): generic_redirect_hardirq (do_IRQ)

Kswapd0
=======

Not sure I saw this before - may be a side effect of the smaller
transfer sizes...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 500 us, entries: 608 (608)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (page_referenced_file)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.000ms (+0.000ms): prio_tree_first (vma_prio_tree_next)
00000001 0.000ms (+0.000ms): prio_tree_left (prio_tree_first)
00000001 0.000ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.001ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.001ms (+0.000ms): spin_lock (<00000000>)
00000002 0.001ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.001ms (+0.000ms): page_address (page_referenced_one)
00000003 0.002ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.002ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.002ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.002ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.003ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.003ms (+0.000ms): spin_lock (<00000000>)
...
00000001 0.480ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.481ms (+0.001ms): page_referenced_one (page_referenced_file)
00000001 0.483ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.484ms (+0.001ms): spin_lock (<00000000>)
00000002 0.485ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.486ms (+0.001ms): page_address (page_referenced_one)
00000003 0.487ms (+0.000ms): flush_tlb_page (page_referenced_one)
00000003 0.488ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.489ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.489ms (+0.001ms): vma_prio_tree_next (page_referenced_file)
00000001 0.491ms (+0.002ms): page_referenced_one (page_referenced_file)
00000001 0.494ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.494ms (+0.002ms): spin_lock (<00000000>)
00000002 0.497ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.497ms (+0.000ms): page_address (page_referenced_one)
00000003 0.498ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.499ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.499ms (+0.002ms): _spin_unlock (page_referenced_file)
00000001 0.501ms (+0.002ms): sub_preempt_count (_spin_unlock)
00000001 0.503ms (+0.000ms): update_max_trace (check_preempt_timing)

>but the real fix would be to tweak the IDE controller to not do so
>agressive DMA! Are there any BIOS settings that somehow deal with it?
>Try increasing the PCI latency value? Is the disk using UDMA - if yes,
>could you downgrade it to normal IDE DMA? Perhaps that tweaks the
>controller to be 'nice' to the CPU.

I tried
  hdparm -p -X udma2 /dev/hda
(since it was udma4 previously)
and reran the tests. As a side note - that command caused a long latency
Something like...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 2019 us, entries: 2829 (2829)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: hdparm/13753, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: cond_resched+0xd/0x30
 => ended at:   cond_resched+0xd/0x30
=======>
00000001 0.000ms (+0.000ms): touch_preempt_timing (cond_resched)
00000001 0.000ms (+0.007ms): ide_find_setting_by_ioctl (generic_ide_ioctl)
00000001 0.008ms (+0.000ms): ide_write_setting (generic_ide_ioctl)
00000001 0.009ms (+0.001ms): capable (ide_write_setting)
...
00000001 0.064ms (+0.001ms): ide_inb (ide_config_drive_speed)
00000001 0.066ms (+0.000ms): ide_inb (ide_config_drive_speed)
00000001 0.067ms (+0.000ms): ide_inb (ide_config_drive_speed)
...
00000001 1.899ms (+0.000ms): ide_inb (ide_config_drive_speed)
00000001 1.899ms (+0.000ms): ide_inb (ide_config_drive_speed)
00000001 1.900ms (+0.000ms): ide_inb (ide_config_drive_speed)
00000001 1.901ms (+0.000ms): __const_udelay (ide_config_drive_speed)
00000001 1.901ms (+0.000ms): __delay (ide_config_drive_speed)
00000001 1.901ms (+0.001ms): delay_tsc (__delay)
00000001 1.902ms (+0.000ms): ide_inb (ide_config_drive_speed)
...

The results after that change were like this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 536 us, entries: 31 (31)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: IRQ 14/140, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: kmap_atomic+0x23/0xe0
 => ended at:   kunmap_atomic+0x7b/0xa0
=======>
00000001 0.000ms (+0.000ms): kmap_atomic (__ide_do_rw_disk)
00000001 0.000ms (+0.000ms): page_address (__ide_do_rw_disk)
00000001 0.000ms (+0.000ms): ide_set_handler (__ide_do_rw_disk)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (ide_set_handler)
00000002 0.000ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000002 0.001ms (+0.000ms): __ide_set_handler (ide_set_handler)
00000002 0.001ms (+0.000ms): __mod_timer (ide_set_handler)
00000002 0.001ms (+0.000ms): __spin_lock_irqsave (__mod_timer)
00000003 0.001ms (+0.000ms): __spin_lock_irqsave (<00000000>)
00000003 0.002ms (+0.000ms): spin_lock (__mod_timer)
00000004 0.002ms (+0.000ms): spin_lock (<00000000>)
00000004 0.002ms (+0.000ms): internal_add_timer (__mod_timer)
00000004 0.002ms (+0.000ms): _spin_unlock (__mod_timer)
00000003 0.002ms (+0.000ms): _spin_unlock_irqrestore (__mod_timer)
00000002 0.003ms (+0.000ms): _spin_unlock_irqrestore (ide_set_handler)
00000001 0.003ms (+0.000ms): taskfile_output_data (__ide_do_rw_disk)
00000001 0.003ms (+0.000ms): ata_output_data (taskfile_output_data)
00000001 0.003ms (+0.496ms): ide_outsw (ata_output_data)
00010001 0.500ms (+0.000ms): do_nmi (ide_outsw)
00010001 0.501ms (+0.005ms): do_nmi (<0864d2b2>)
00010001 0.506ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010001 0.506ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.506ms (+0.000ms): profile_hook (profile_tick)
00010001 0.506ms (+0.000ms): read_lock (profile_hook)
00010002 0.507ms (+0.000ms): read_lock (<00000000>)
00010002 0.507ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.507ms (+0.000ms): _read_unlock (profile_tick)
00010001 0.507ms (+0.028ms): profile_hit (nmi_watchdog_tick)
00000001 0.536ms (+0.000ms): kunmap_atomic (__ide_do_rw_disk)
00000001 0.536ms (+0.000ms): sub_preempt_count (kunmap_atomic)
00000001 0.536ms (+0.000ms): sub_preempt_count (kunmap_atomic)
00000001 0.537ms (+0.000ms): update_max_trace (check_preempt_timing)

or this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 497 us, entries: 18 (18)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: IRQ 14/140, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: kmap_atomic+0x23/0xe0
 => ended at:   kunmap_atomic+0x7b/0xa0
=======>
00000001 0.000ms (+0.000ms): kmap_atomic (read_intr)
00000001 0.000ms (+0.000ms): page_address (read_intr)
00000001 0.000ms (+0.000ms): taskfile_input_data (read_intr)
00000001 0.000ms (+0.000ms): ata_input_data (taskfile_input_data)
00000001 0.000ms (+0.066ms): ide_insw (ata_input_data)
00010001 0.067ms (+0.000ms): do_nmi (ide_insw)
00010001 0.067ms (+0.003ms): do_nmi (<0815f06c>)
00010001 0.070ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010001 0.070ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.071ms (+0.000ms): profile_hook (profile_tick)
00010001 0.071ms (+0.000ms): read_lock (profile_hook)
00010002 0.071ms (+0.000ms): read_lock (<00000000>)
00010002 0.071ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.071ms (+0.000ms): _read_unlock (profile_tick)
00010001 0.072ms (+0.424ms): profile_hit (nmi_watchdog_tick)
00000001 0.497ms (+0.000ms): kunmap_atomic (read_intr)
00000001 0.497ms (+0.000ms): sub_preempt_count (kunmap_atomic)
00000001 0.498ms (+0.000ms): update_max_trace (check_preempt_timing)

or this...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 454 us, entries: 60 (60)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000002 0.001ms (+0.267ms): spin_lock (<00000000>)
00010002 0.268ms (+0.000ms): do_nmi (get_swap_page)
00010002 0.268ms (+0.003ms): do_nmi (ide_outsw)
00010002 0.272ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010002 0.272ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010002 0.272ms (+0.000ms): profile_hook (profile_tick)
00010002 0.272ms (+0.000ms): read_lock (profile_hook)
00010003 0.273ms (+0.000ms): read_lock (<00000000>)
00010003 0.273ms (+0.000ms): notifier_call_chain (profile_hook)
00010003 0.273ms (+0.000ms): _read_unlock (profile_tick)
00010002 0.274ms (+0.000ms): profile_hit (nmi_watchdog_tick)
00010002 0.274ms (+0.000ms): do_IRQ (get_swap_page)
00010002 0.275ms (+0.000ms): do_IRQ (<00000000>)
00010002 0.275ms (+0.000ms): spin_lock (do_IRQ)
00010003 0.275ms (+0.000ms): spin_lock (<00000000>)
00010003 0.275ms (+0.000ms): ack_edge_ioapic_irq (do_IRQ)
00010003 0.275ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010003 0.276ms (+0.000ms): _spin_unlock (do_IRQ)

or this...

reemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 457 us, entries: 8 (8)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): spin_lock (get_swap_page)
00000002 0.000ms (+0.456ms): spin_lock (<00000000>)
00000002 0.457ms (+0.000ms): _spin_unlock (get_swap_page)
00000001 0.457ms (+0.000ms): _spin_unlock (get_swap_page)
00000001 0.458ms (+0.000ms): sub_preempt_count (_spin_unlock)
00000001 0.458ms (+0.000ms): update_max_trace (check_preempt_timing)

So I don't think that particular setting helps.

  --Mark



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
@ 2004-09-09 19:23 Mark_H_Johnson
  2004-09-09 19:47 ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Mark_H_Johnson @ 2004-09-09 19:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard,
	nando, luke, free78

>would it be possible to test with DMA disabled? (hdparm -d0 /dev/hda) It
>might take some extra work to shun the extra latency reports from the
>PIO IDE path (which is quite slow) but once that is done you should be
>able to see whether these long 0.5 msec delays remain even if all (most)
>DMA activity has been eliminated.

OK. With new patches in hand, I have a set of latency results w/ IDE
DMA turned off. For reference, tests were run with:

# hdparm /dev/hda

/dev/hda:
 multcount    = 16 (on)
 IO_support   =  1 (32-bit)
 unmaskirq    =  1 (on)
 using_dma    =  0 (off)
 keepsettings =  0 (off)
 readonly     =  0 (off)
 readahead    = 64 (on)
 geometry     = 58168/16/63, sectors = 58633344, start = 0

# cat /sys/block/hda/queue/max_sectors_kb
32
# cat /sys/block/hda/queue/read_ahead_kb
32
# cat /proc/sys/net/core/netdev_max_backlog
8
# dmesg -n 1

and all tests run w/ a -R8 kernel plus patches for sched, timer_tsc,
and ide-iops (to add latency trace outputs or suppress known long paths).

No latency traces > 600 usec. These tests were run simply with
  head -c 750000000 /dev/zero >tmpfile  (disk writes)
  cp tmpfile tmpfile2 (disk copy)
  cat tmpfile tmpfile2 >/dev/null
while capturing the latency traces in another process. 72 total traces
captured in 15 minutes of tests.

For reference, the 750 meg file size is about 1.5 x physical memory.
No I/O to the audio card. X was running (to monitor the test) and a
network was connected (but generally idle).

PIO trace
=========

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 369 us, entries: 3 (3)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: IRQ 14/140, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: ide_outsl+0x44/0x50
 => ended at:   ide_outsl+0x44/0x50
=======>
00000001 0.000ms (+0.370ms): touch_preempt_timing (ide_outsl)
00000001 0.370ms (+0.000ms): touch_preempt_timing (ide_outsl)
00000001 0.370ms (+0.000ms): update_max_trace (check_preempt_timing)

which appears to be the PIO path ide_outsl. I had a few similar traces
with ide_insl during the copy / read tests as well.

send_IPI_mask_bitmask
=====================

I have several traces where send_IPI_mask_bitmask (flush_tlb_others)
shows up. For example...

00000002 0.010ms (+0.001ms): kunmap_atomic (zap_pte_range)
00000001 0.011ms (+0.000ms): flush_tlb_mm (unmap_vmas)
00000002 0.012ms (+0.000ms): flush_tlb_others (flush_tlb_mm)
00000002 0.012ms (+0.000ms): spin_lock (flush_tlb_others)
00000003 0.013ms (+0.001ms): spin_lock (<00000000>)
00000003 0.014ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000003 0.014ms (+0.132ms): send_IPI_mask_bitmask (flush_tlb_others)
00010003 0.147ms (+0.000ms): do_nmi (flush_tlb_others)
00010003 0.147ms (+0.001ms): do_nmi (ide_outsl)
00010003 0.149ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010003 0.149ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010003 0.150ms (+0.000ms): profile_hook (profile_tick)
00010003 0.150ms (+0.000ms): read_lock (profile_hook)
00010004 0.150ms (+0.000ms): read_lock (<00000000>)
00010004 0.150ms (+0.000ms): notifier_call_chain (profile_hook)
00010004 0.151ms (+0.000ms): _read_unlock (profile_tick)
00010003 0.151ms (+0.250ms): profile_hit (nmi_watchdog_tick)
00000003 0.401ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000001 0.402ms (+0.000ms): free_pages_and_swap_cache (unmap_vmas)
00000001 0.402ms (+0.000ms): lru_add_drain (free_pages_and_swap_cache)

or this one...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 445 us, entries: 22 (22)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (page_referenced_file)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.001ms (+0.001ms): prio_tree_first (vma_prio_tree_next)
00000001 0.003ms (+0.001ms): prio_tree_left (prio_tree_first)
00000001 0.005ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.005ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.006ms (+0.001ms): spin_lock (<00000000>)
00000002 0.007ms (+0.000ms): kmap_atomic (page_referenced_one)
00000003 0.007ms (+0.001ms): page_address (page_referenced_one)
00000003 0.009ms (+0.000ms): flush_tlb_page (page_referenced_one)
00000004 0.010ms (+0.000ms): flush_tlb_others (flush_tlb_page)
00000004 0.010ms (+0.000ms): spin_lock (flush_tlb_others)
00000005 0.011ms (+0.001ms): spin_lock (<00000000>)
00000005 0.012ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000005 0.012ms (+0.431ms): send_IPI_mask_bitmask (flush_tlb_others)
00000005 0.444ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000003 0.445ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.445ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.445ms (+0.000ms): _spin_unlock (page_referenced_file)
00000001 0.446ms (+0.000ms): sub_preempt_count (_spin_unlock)
00000001 0.446ms (+0.000ms): update_max_trace (check_preempt_timing)

or this one...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 428 us, entries: 49 (49)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: get_ltrace.sh/5514, uid:0 nice:-9 policy:0 rt_prio:0
    -----------------
 => started at: flush_tlb_mm+0x1d/0xd0
 => ended at:   flush_tlb_mm+0x52/0xd0
=======>
00000001 0.000ms (+0.001ms): flush_tlb_mm (copy_mm)
00000001 0.001ms (+0.000ms): flush_tlb_others (flush_tlb_mm)
00000001 0.002ms (+0.000ms): spin_lock (flush_tlb_others)
00000002 0.002ms (+0.000ms): spin_lock (<00000000>)
00000002 0.003ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000002 0.003ms (+0.415ms): send_IPI_mask_bitmask (flush_tlb_others)
00000002 0.419ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
00010002 0.419ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010002 0.419ms (+0.000ms): profile_hook (profile_tick)
00010002 0.419ms (+0.000ms): read_lock (profile_hook)
00010003 0.420ms (+0.000ms): read_lock (<00000000>)
00010003 0.420ms (+0.000ms): notifier_call_chain (profile_hook)
00010003 0.420ms (+0.000ms): _read_unlock (profile_tick)
00010002 0.420ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
00010002 0.420ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
...

or this one...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 330 us, entries: 9 (9)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: get_ltrace.sh/5514, uid:0 nice:-9 policy:0 rt_prio:0
    -----------------
 => started at: flush_tlb_mm+0x1d/0xd0
 => ended at:   flush_tlb_mm+0x52/0xd0
=======>
00000001 0.000ms (+0.001ms): flush_tlb_mm (copy_mm)
00000001 0.001ms (+0.000ms): flush_tlb_others (flush_tlb_mm)
00000001 0.002ms (+0.000ms): spin_lock (flush_tlb_others)
00000002 0.002ms (+0.000ms): spin_lock (<00000000>)
00000002 0.003ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000002 0.004ms (+0.326ms): send_IPI_mask_bitmask (flush_tlb_others)
00000002 0.330ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000001 0.331ms (+0.000ms): sub_preempt_count (flush_tlb_mm)
00000001 0.331ms (+0.000ms): update_max_trace (check_preempt_timing)

try_to_wake_up
==============

Buried inside a pretty long trace in kswapd0, I saw the following...

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8-nd
-------------------------------------------------------
 latency: 426 us, entries: 286 (286)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: kswapd0/40, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (page_referenced_file)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): vma_prio_tree_next (page_referenced_file)
00000001 0.000ms (+0.000ms): prio_tree_first (vma_prio_tree_next)
00000001 0.001ms (+0.000ms): prio_tree_left (prio_tree_first)
00000001 0.001ms (+0.000ms): page_referenced_one (page_referenced_file)
00000001 0.001ms (+0.000ms): spin_lock (page_referenced_one)
00000002 0.002ms (+0.000ms): spin_lock (<00000000>)
...
00000006 0.108ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000006 0.108ms (+0.000ms): spin_lock (task_rq_lock)
00000007 0.108ms (+0.000ms): spin_lock (<00000000>)
00000007 0.109ms (+0.000ms): wake_idle (try_to_wake_up)
00000007 0.109ms (+0.000ms): activate_task (try_to_wake_up)
00000007 0.109ms (+0.000ms): sched_clock (activate_task)
00000007 0.109ms (+0.000ms): recalc_task_prio (activate_task)
00000007 0.109ms (+0.000ms): effective_prio (recalc_task_prio)
00000007 0.110ms (+0.000ms): __activate_task (try_to_wake_up)
00000007 0.110ms (+0.000ms): enqueue_task (__activate_task)
00000007 0.110ms (+0.000ms): sched_info_queued (enqueue_task)
00000007 0.110ms (+0.000ms): resched_task (try_to_wake_up)
00000007 0.111ms (+0.000ms): task_rq_unlock (try_to_wake_up)
00000007 0.111ms (+0.000ms): _spin_unlock_irqrestore (try_to_wake_up)
00000006 0.111ms (+0.298ms): preempt_schedule (try_to_wake_up)
00000005 0.409ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000004 0.409ms (+0.000ms): preempt_schedule (flush_tlb_others)
00000003 0.410ms (+0.000ms): preempt_schedule (flush_tlb_page)
00000003 0.410ms (+0.000ms): kunmap_atomic (page_referenced_one)
00000002 0.410ms (+0.000ms): preempt_schedule (page_referenced_one)
00000002 0.410ms (+0.000ms): _spin_unlock (page_referenced_one)
00000001 0.410ms (+0.000ms): preempt_schedule (page_referenced_one)

There is an almost 300 usec hit there as we unwind the nested layers.

So the long wait on paths through sched and timer_tsc appear to be
eliminated with PIO to the disk.

Is there some "for sure" way to limit the size and/or duration of
DMA transfers so I get reasonable performance from the disk (and
other DMA devices) and reasonable latency?
  --Mark


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-09 19:23 Mark_H_Johnson
@ 2004-09-09 19:47 ` Ingo Molnar
  2004-09-09 19:02   ` Alan Cox
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2004-09-09 19:47 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard,
	nando, luke, free78

* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> PIO trace
> =========

> 00000001 0.000ms (+0.370ms): touch_preempt_timing (ide_outsl)
> 00000001 0.370ms (+0.000ms): touch_preempt_timing (ide_outsl)

Please decrease the 128U chunking to 8U or so.

> I have several traces where send_IPI_mask_bitmask (flush_tlb_others)
> shows up. For example...

flush_tlb_others() is a good indicator of irqs-off sections on the other
CPU. The code does the following when it flushes TLBs: it sends an IPI
(inter-process-interrupt) to all other CPUs (one CPU in your case) and
waits for arrival of that IRQ and completion while spinning on a flag. 
The IPI normally takes 10 usecs or so to process so this is not an
issue. BUT if the CPU has IRQs disabled then the IPI is delayed and the
IRQs-off latency shows up as flush_tlb_others() latency.

> 00000003 0.014ms (+0.132ms): send_IPI_mask_bitmask (flush_tlb_others)
> 00010003 0.147ms (+0.000ms): do_nmi (flush_tlb_others)
> 00010003 0.147ms (+0.001ms): do_nmi (ide_outsl)

Since the other CPU's do_nmi() implicates ide_outsl it could be that we
are doing ide_outsl with IRQs disabled? Could you add something like
this to the ide_outsl code:

	if (irqs_disabled() && printk_ratelimit())
		dump_stack();

(the most common irqs-off section is the latency printout itself - this
triggers if the latency message goes to the console - i.e. 'dmesg -n 1'
wasnt done.)

> Buried inside a pretty long trace in kswapd0, I saw the following...

> 00000007 0.111ms (+0.000ms): _spin_unlock_irqrestore (try_to_wake_up)
> 00000006 0.111ms (+0.298ms): preempt_schedule (try_to_wake_up)
> 00000005 0.409ms (+0.000ms): _spin_unlock (flush_tlb_others)

this too is flush_tlb_others() related.

> So the long wait on paths through sched and timer_tsc appear to be
> eliminated with PIO to the disk.

yeah, nice. I'd still like to make sure that we've not hidden latencies
by working down the ide_outsl() latency and its apparent IRQs-off
property.

> Is there some "for sure" way to limit the size and/or duration of DMA
> transfers so I get reasonable performance from the disk (and other DMA
> devices) and reasonable latency?

the 'unit' of the 'weird' delays seem to be around 70 usecs, the maximum
seems to be around 560 usecs. Note the 1:8 relationship between the two. 
You have 32 KB as max_sectors, so the 70 usecs unit is for a single 4K
transfer which is a single scatter-gather entry: it all makes perfect
sense. 4K per 70 usecs means a DMA rate of ~57 MB/sec which sounds
reasonable.

so if these assumptions are true i'd suggest to decrease max_sectors
from 32K to 16K - that alone should halve these random latencies from
560 usecs to 280 usecs. Unless you see stability you might want to try
an 8K setting as well - this will likely decrease your IO rates
noticeably though. This would reduce the random delays to 140 usecs.

but the real fix would be to tweak the IDE controller to not do so
agressive DMA! Are there any BIOS settings that somehow deal with it? 
Try increasing the PCI latency value? Is the disk using UDMA - if yes,
could you downgrade it to normal IDE DMA? Perhaps that tweaks the
controller to be 'nice' to the CPU. Is your IDE chipset integrated on
the motherboard? Could you send me your full bootlog (off-list)?

there are also tools that tweak chipsets directly - powertweak and the
PCI latency settings. Maybe something tweaks the IDE controller just the
right way. Also, try disabling specific controller support in the
.config (or turn it on) - by chance the generic IDE code could program
the IDE controller in a way that generates nicer DMA.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-09 19:47 ` Ingo Molnar
@ 2004-09-09 19:02   ` Alan Cox
  0 siblings, 0 replies; 36+ messages in thread
From: Alan Cox @ 2004-09-09 19:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Lee Revell, Free Ekanayaka, Eric St-Laurent,
	Linux Kernel Mailing List, K.R. Foley, Felipe Alfaro Solana,
	Daniel Schmitt, P.O. Gaillard, nando, luke, free78

On Iau, 2004-09-09 at 20:47, Ingo Molnar wrote:
> there are also tools that tweak chipsets directly - powertweak and the
> PCI latency settings. Maybe something tweaks the IDE controller just the
> right way. Also, try disabling specific controller support in the
> .config (or turn it on) - by chance the generic IDE code could program
> the IDE controller in a way that generates nicer DMA.

Do not tweak your IDE controller pci latency without making a backup if
its anything but the base motherboard chipset. Not unless you make
backups first and test it very hard.

Its a particularly popular area for errata.

You could also try lower DMA modes, I don't know if that will help or
not but UDMA133 is essentially "one entire PCI 32/33Mhz bus worth of
traffic so the effects are not too suprising. IDE control ports can give
you long stalls too especially if IORDY is in use.

Alan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
@ 2004-09-09 18:31 Mark_H_Johnson
  0 siblings, 0 replies; 36+ messages in thread
From: Mark_H_Johnson @ 2004-09-09 18:31 UTC (permalink / raw)
  To: Alan Cox
  Cc: Mark_H_Johnson, Ingo Molnar, Lee Revell, Free Ekanayaka,
	Eric St-Laurent, Linux Kernel Mailing List, K.R. Foley,
	Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard, nando, luke,
	free78

>If you haven't done hdparm -u1 that may be a reason you want to touch
>these.

Alas, but
# hdparm /dev/hda
/dev/hda:
 multcount    = 16 (on)
 IO_support   =  1 (32-bit)
 unmaskirq    =  1 (on)
 using_dma    =  0 (off)
 keepsettings =  0 (off)
 readonly     =  0 (off)
 readahead    = 64 (on)
 geometry     = 58168/16/63, sectors = 58633344, start = 0

so I already have IRQ's unmasked. [this was during no DMA tests, I usually
run with DMA enabled]

I'll be commenting on the results of the no DMA tests shortly.

  --Mark


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
@ 2004-09-08 20:33 Mark_H_Johnson
  2004-09-08 22:31 ` Alan Cox
                   ` (3 more replies)
  0 siblings, 4 replies; 36+ messages in thread
From: Mark_H_Johnson @ 2004-09-08 20:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Lee Revell, Free Ekanayaka, Eric St-Laurent,
	linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt,
	P.O. Gaillard, nando, luke, free78

>.... Please disable IDE DMA and see
>what happens (after hiding the PIO IDE codepath via
>touch_preempt_timing()).

Not quite sure where to add touch_preempt_timing() calls - somewhere in the
loop in ide_outsl and ide_insl? [so we keep resetting the start /end
times?]

I did collect some initial data - most are of the form:

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8
-------------------------------------------------------
 latency: 545 us, entries: 8 (8)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: IRQ 14/140, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: kmap_atomic+0x23/0xe0
 => ended at:   kunmap_atomic+0x7b/0xa0
=======>
00000001 0.000ms (+0.000ms): kmap_atomic (ide_multwrite)
00000001 0.000ms (+0.001ms): page_address (ide_multwrite)
00000001 0.001ms (+0.000ms): taskfile_output_data (ide_multwrite)
00000001 0.002ms (+0.000ms): ata_output_data (taskfile_output_data)
00000001 0.003ms (+0.542ms): ide_outsl (ata_output_data)
00000001 0.545ms (+0.000ms): kunmap_atomic (ide_multwrite)
00000001 0.545ms (+0.000ms): sub_preempt_count (kunmap_atomic)
00000001 0.546ms (+0.000ms): update_max_trace (check_preempt_timing)

Which I am sure is what you expected.

I did get a couple that looked like this:

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8
-------------------------------------------------------
 latency: 1261 us, entries: 73 (73)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: sleep/4859, uid:0 nice:-9 policy:0 rt_prio:0
    -----------------
 => started at: spin_lock+0x24/0x90
 => ended at:   _spin_unlock+0x2d/0x60
=======>
00000001 0.000ms (+0.000ms): spin_lock (do_wp_page)
00000001 0.000ms (+0.000ms): spin_lock (<00000000>)
00000001 0.000ms (+0.000ms): kmap_atomic (do_wp_page)
00000002 0.000ms (+0.000ms): page_address (do_wp_page)
00000002 0.001ms (+0.000ms): page_remove_rmap (do_wp_page)
00000002 0.002ms (+0.001ms): flush_tlb_page (do_wp_page)
00000003 0.003ms (+0.000ms): flush_tlb_others (flush_tlb_page)
00000003 0.003ms (+0.000ms): spin_lock (flush_tlb_others)
00000004 0.004ms (+0.001ms): spin_lock (<00000000>)
00000004 0.005ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000004 0.005ms (+0.321ms): send_IPI_mask_bitmask (flush_tlb_others)
00010004 0.327ms (+0.000ms): do_nmi (flush_tlb_others)
00010004 0.327ms (+0.003ms): do_nmi (kallsyms_lookup)
00010004 0.331ms (+0.000ms): notifier_call_chain (default_do_nmi)
00010004 0.331ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010004 0.331ms (+0.000ms): profile_hook (profile_tick)
00010004 0.332ms (+0.000ms): read_lock (profile_hook)
00010005 0.332ms (+0.000ms): read_lock (<00000000>)
00010005 0.332ms (+0.000ms): notifier_call_chain (profile_hook)
00010005 0.332ms (+0.000ms): _read_unlock (profile_tick)
00010004 0.333ms (+0.091ms): profile_hit (nmi_watchdog_tick)
00000004 0.424ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
...
00000005 0.431ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
00000005 0.431ms (+0.000ms): __do_softirq (do_softirq)
00000005 0.432ms (+0.000ms): wake_up_process (do_softirq)
00000005 0.432ms (+0.000ms): try_to_wake_up (wake_up_process)
00000005 0.433ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000005 0.433ms (+0.000ms): spin_lock (task_rq_lock)
00000006 0.433ms (+0.000ms): spin_lock (<00000000>)
00000006 0.433ms (+0.000ms): wake_idle (try_to_wake_up)
00000006 0.434ms (+0.000ms): activate_task (try_to_wake_up)
00000006 0.434ms (+0.000ms): sched_clock (activate_task)
00000006 0.434ms (+0.000ms): recalc_task_prio (activate_task)
00000006 0.435ms (+0.000ms): effective_prio (recalc_task_prio)
00000006 0.435ms (+0.000ms): __activate_task (try_to_wake_up)
00000006 0.436ms (+0.000ms): enqueue_task (__activate_task)
00000006 0.436ms (+0.000ms): sched_info_queued (enqueue_task)
00000006 0.437ms (+0.000ms): resched_task (try_to_wake_up)
00000006 0.437ms (+0.000ms): task_rq_unlock (try_to_wake_up)
00000006 0.437ms (+0.000ms): _spin_unlock_irqrestore (try_to_wake_up)
00000005 0.437ms (+0.820ms): preempt_schedule (try_to_wake_up)
00000004 1.258ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000003 1.258ms (+0.000ms): preempt_schedule (flush_tlb_others)
00000002 1.258ms (+0.000ms): preempt_schedule (flush_tlb_page)
...

or this

preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-R8
-------------------------------------------------------
 latency: 1210 us, entries: 113 (113)   |   [VP:1 KP:1 SP:1 HP:1 #CPUS:2]
    -----------------
    | task: get_ltrace.sh/3071, uid:0 nice:-9 policy:0 rt_prio:0
    -----------------
 => started at: flush_tlb_mm+0x1d/0xd0
 => ended at:   flush_tlb_mm+0x52/0xd0
=======>
00000001 0.000ms (+0.001ms): flush_tlb_mm (copy_mm)
00000001 0.001ms (+0.000ms): flush_tlb_others (flush_tlb_mm)
00000001 0.001ms (+0.000ms): spin_lock (flush_tlb_others)
00000002 0.002ms (+0.000ms): spin_lock (<00000000>)
00000002 0.003ms (+0.000ms): send_IPI_mask (flush_tlb_others)
00000002 0.003ms (+0.378ms): send_IPI_mask_bitmask (flush_tlb_others)
00000002 0.381ms (+0.000ms): smp_apic_timer_interrupt (flush_tlb_others)
...
00010003 0.459ms (+0.000ms): preempt_schedule (timer_interrupt)
00010003 0.460ms (+0.000ms): do_timer (timer_interrupt)
00010003 0.460ms (+0.000ms): update_wall_time (do_timer)
00010003 0.460ms (+0.000ms): update_wall_time_one_tick (update_wall_time)
00010003 0.460ms (+0.000ms): _spin_unlock (timer_interrupt)
00010002 0.460ms (+0.000ms): preempt_schedule (timer_interrupt)
00010002 0.461ms (+0.000ms): spin_lock (do_IRQ)
00010003 0.461ms (+0.000ms): spin_lock (<00000000>)
00010003 0.461ms (+0.000ms): generic_note_interrupt (do_IRQ)
00010003 0.461ms (+0.000ms): end_edge_ioapic_irq (do_IRQ)
00010003 0.461ms (+0.000ms): _spin_unlock (do_IRQ)
00010002 0.462ms (+0.000ms): preempt_schedule (do_IRQ)
00000003 0.462ms (+0.000ms): do_softirq (do_IRQ)
00000003 0.462ms (+0.747ms): __do_softirq (do_softirq)
00000002 1.209ms (+0.000ms): _spin_unlock (flush_tlb_others)
00000001 1.210ms (+0.000ms): preempt_schedule (flush_tlb_others)
00000001 1.210ms (+0.000ms): sub_preempt_count (flush_tlb_mm)
00000001 1.211ms (+0.000ms): update_max_trace (check_preempt_timing)

where I had > 1 msec latencies w/o any apparent references to the IDE.

  --Mark


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-08 20:33 Mark_H_Johnson
@ 2004-09-08 22:31 ` Alan Cox
  2004-09-09 16:44   ` Thomas Charbonnel
  2004-09-09  6:27 ` Ingo Molnar
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Alan Cox @ 2004-09-08 22:31 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Ingo Molnar, Lee Revell, Free Ekanayaka, Eric St-Laurent,
	Linux Kernel Mailing List, K.R. Foley, Felipe Alfaro Solana,
	Daniel Schmitt, P.O. Gaillard, nando, luke, free78

On Mer, 2004-09-08 at 21:33, Mark_H_Johnson@raytheon.com wrote:
> >.... Please disable IDE DMA and see
> >what happens (after hiding the PIO IDE codepath via
> >touch_preempt_timing()).
> 
> Not quite sure where to add touch_preempt_timing() calls - somewhere in the
> loop in ide_outsl and ide_insl? [so we keep resetting the start /end
> times?]

If you haven't done hdparm -u1 that may be a reason you want to touch
these. To defend against some very bad old h/w where a stall in the I/O
stream to the disk causes corruption we disable IRQ's across the
transfer in PIO mode by default.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-08 22:31 ` Alan Cox
@ 2004-09-09 16:44   ` Thomas Charbonnel
  0 siblings, 0 replies; 36+ messages in thread
From: Thomas Charbonnel @ 2004-09-09 16:44 UTC (permalink / raw)
  To: Alan Cox
  Cc: Mark_H_Johnson, Ingo Molnar, Lee Revell, Free Ekanayaka,
	Eric St-Laurent, Linux Kernel Mailing List, K.R. Foley,
	Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard, nando, luke,
	free78

Alan Cox wrote :
> On Mer, 2004-09-08 at 21:33, Mark_H_Johnson@raytheon.com wrote:
> > >.... Please disable IDE DMA and see
> > >what happens (after hiding the PIO IDE codepath via
> > >touch_preempt_timing()).
> > 
> > Not quite sure where to add touch_preempt_timing() calls - somewhere in the
> > loop in ide_outsl and ide_insl? [so we keep resetting the start /end
> > times?]
> 
> If you haven't done hdparm -u1 that may be a reason you want to touch
> these. To defend against some very bad old h/w where a stall in the I/O
> stream to the disk causes corruption we disable IRQ's across the
> transfer in PIO mode by default.
> 

I had the exact same problem showing in the output of latencytest, and
enabling unmaskirq on the drive being stressed solved it, thanks !

See this for the problem :
http://www.undata.org/~thomas/unmaskirq_0/index.html
and this for the (impressive) results :
http://www.undata.org/~thomas/unmaskirq_1/index.html

Thomas



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-08 20:33 Mark_H_Johnson
  2004-09-08 22:31 ` Alan Cox
@ 2004-09-09  6:27 ` Ingo Molnar
  2004-09-09 15:12 ` Ingo Molnar
  2004-09-09 15:33 ` Ingo Molnar
  3 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2004-09-09  6:27 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard,
	nando, luke, free78


* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> Not quite sure where to add touch_preempt_timing() calls - somewhere in the
> loop in ide_outsl and ide_insl? [so we keep resetting the start /end
> times?]

> 00000001 0.002ms (+0.000ms): ata_output_data (taskfile_output_data)
> 00000001 0.003ms (+0.542ms): ide_outsl (ata_output_data)
> 00000001 0.545ms (+0.000ms): kunmap_atomic (ide_multwrite)

yeah, i'd add it to ide_outsl. Something like the patch below. (it
compiles but is untested otherwise.)

another possibility besides DMA starvation is starvation between CPUs. 
The only way to exclude DMA as a source of CPU starvation would be to
try a maxcpus=1 test using the SMP kernel. You dont even have to run the
RT-latency tester for this - just the disk read/write workload should
trigger the latency traces!

	Ingo

--- linux/drivers/ide/ide-iops.c.orig	
+++ linux/drivers/ide/ide-iops.c	
@@ -56,7 +56,15 @@ static u32 ide_inl (unsigned long port)
 
 static void ide_insl (unsigned long port, void *addr, u32 count)
 {
-	insl(port, addr, count);
+	unsigned int chunk, offset = 0;
+
+	while (count) {
+		chunk = min(128U, count);
+		insl(port, addr + offset, chunk);
+		count -= chunk;
+		offset += chunk;
+		touch_preempt_timing();
+	}
 }
 
 static void ide_outb (u8 val, unsigned long port)
@@ -86,7 +94,15 @@ static void ide_outl (u32 val, unsigned 
 
 static void ide_outsl (unsigned long port, void *addr, u32 count)
 {
-	outsl(port, addr, count);
+	unsigned int chunk, offset = 0;
+
+	while (count) {
+		chunk = min(128U, count);
+		outsl(port, addr + offset, chunk);
+		count -= chunk;
+		offset += chunk;
+		touch_preempt_timing();
+	}
 }
 
 void default_hwif_iops (ide_hwif_t *hwif)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-08 20:33 Mark_H_Johnson
  2004-09-08 22:31 ` Alan Cox
  2004-09-09  6:27 ` Ingo Molnar
@ 2004-09-09 15:12 ` Ingo Molnar
  2004-09-09 15:33 ` Ingo Molnar
  3 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2004-09-09 15:12 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard,
	nando, luke, free78

* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> 00000004 0.005ms (+0.321ms): send_IPI_mask_bitmask (flush_tlb_others)
> 00010004 0.327ms (+0.000ms): do_nmi (flush_tlb_others)
> 00010004 0.327ms (+0.003ms): do_nmi (kallsyms_lookup)

these traces happen if you forget to do 'dmesg -n 1' to turn off console
output. One CPU does a latency-printout with interrupts disabled, and
this CPU keeps waiting in flush_tlb_others(). The second do_nmi() entry
(which is a special one, showing the other CPU's activity) indeed shows
that the CPU is executing in kallsyms_lookup() - possibly it's in the
middle of a latency printout.

i havent seen such type of traces with 'dmesg -n 1'.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-08 20:33 Mark_H_Johnson
                   ` (2 preceding siblings ...)
  2004-09-09 15:12 ` Ingo Molnar
@ 2004-09-09 15:33 ` Ingo Molnar
  3 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2004-09-09 15:33 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, P.O. Gaillard,
	nando, luke, free78

> >.... Please disable IDE DMA and see
> >what happens (after hiding the PIO IDE codepath via
> >touch_preempt_timing()).

here is a list of a couple of more or less independent ideas regarding
your latency problems. (Some of these i might have suggested before.)

1)

we talked about external SMM events (handled by BIOS code and not
influenced/disabled by Linux) being a possible source of these 'mystic'
latencies. In SMP systems, there's a good likelyhood that SMM's are only
executed on CPU#0.

one good way to exclude SMM as a possibility would be to switch the CPUs
in the test. When you start an RT CPU-intense task to do the latency
test, is that task bound to any particular CPU? If yes and it's CPU#1,
then could you bind it to CPU#0? That would shift almost all kernel
activities to CPU#1. Likewise, could you set the
/proc/irq/*/smp_affinity masks to be '2' for all interrupts - to handle
all IRQs on CPU#1.

but i dont expect SMMs as the source of these latencies. BIOS writers
are supposed to fix up the TSC in SMM handlers to make them as seemless
as possible. So seeing frequent 500+ usec SMMs with no TSC fixup seems
weird.

2)

the maxcpus=1 workload-only (non-latencytest) test should also give an
important datapoint: is this phenomenon caused by SMP?

3)

it would still be nice if you could try the tests on a different type of
SMP hardware, as i cannot reproduce your results on 3 different types of
SMP hardware. E.g. one of my testsystems is a dual 466 MHz Celeron
(Mendocino) box, and the biggest latencies very closely track the UP
results reported by others - the highest latency in an IO-intense
workload is 192 usecs. The box is using IDE DMA. This system is roughly
half as fast as your 800+ MHz P3 box. I also have tried tests on a dual
P4 and a dual P3 - both have maximum latencies below 100 usecs and never
produce traces in weird places like the scheduler or copy_page_ll.

4)

your previous experiments have shown that disabling DRI in X does not
eliminate these weird latencies. Another DMA agent is IDE, which you are
testing now via the PIO-mode experiment. There's a third DMA agent:
audio cards do DMA too, have you tried to eliminate the audio component
of the test? Isnt there a latencytest mode that doesnt use an audio card
but /dev/rtc? [or just running the workloads without running latencytest
should be ok too to trigger the latency traces.]

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0
@ 2004-09-02 22:14 Ingo Molnar
  2004-09-03  0:24 ` Lee Revell
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2004-09-02 22:14 UTC (permalink / raw)
  To: Lee Revell
  Cc: linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt,
	Mark_H_Johnson, P.O. Gaillard


* Lee Revell <rlrevell@joe-job.com> wrote:

> > i've given up on the netdev_backlog_granularity approach, and as a
> > replacement i've modified specific network drivers to return at a safe
> > point if softirq preemption is requested.
> 
> Makes sense, netdev_max_backlog never made a difference on my system
> (via-rhine driver).

via-rhine does RX processing from the hardirq handler, this codepath is
harder to break up. The NAPI ->poll functions used by e100 and 8193too
are much easier to break up because RX throttling and re-trying is a
basic property of NAPI.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0
  2004-09-02 22:14 [patch] voluntary-preempt-2.6.9-rc1-bk4-R0 Ingo Molnar
@ 2004-09-03  0:24 ` Lee Revell
  2004-09-03  3:17   ` Eric St-Laurent
  0 siblings, 1 reply; 36+ messages in thread
From: Lee Revell @ 2004-09-03  0:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt,
	Mark_H_Johnson, P.O. Gaillard

On Thu, 2004-09-02 at 18:14, Ingo Molnar wrote:
> * Lee Revell <rlrevell@joe-job.com> wrote:
> 
> > > i've given up on the netdev_backlog_granularity approach, and as a
> > > replacement i've modified specific network drivers to return at a safe
> > > point if softirq preemption is requested.
> > 
> > Makes sense, netdev_max_backlog never made a difference on my system
> > (via-rhine driver).
> 
> via-rhine does RX processing from the hardirq handler, this codepath is
> harder to break up. The NAPI ->poll functions used by e100 and 8193too
> are much easier to break up because RX throttling and re-trying is a
> basic property of NAPI.
> 

Judging from these graphs, all of the latency issues are solved, at
least on my UP hardware, and the latencies seem to be getting very close
to the limits of what the hardware can do:

http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/jack-test-1

The worst case latency is only 160 usecs, and the vast majority fall
into the pattern from 0 to 120 usecs.  All of the spikes above 120 are
almost certainly caused by netdev_max_backlog.  However these are not
long enough to cause any problems with my workload; the lowest practical
latency for audio work is around 0.66 ms (32 frames at 48khz). 

Lee



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0
  2004-09-03  0:24 ` Lee Revell
@ 2004-09-03  3:17   ` Eric St-Laurent
  2004-09-03  6:26     ` Lee Revell
  0 siblings, 1 reply; 36+ messages in thread
From: Eric St-Laurent @ 2004-09-03  3:17 UTC (permalink / raw)
  To: Lee Revell
  Cc: Ingo Molnar, linux-kernel, K.R. Foley, Felipe Alfaro Solana,
	Daniel Schmitt, Mark_H_Johnson, P.O. Gaillard

> Judging from these graphs, all of the latency issues are solved, at
> least on my UP hardware, and the latencies seem to be getting very close
> to the limits of what the hardware can do:
> 
> http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/jack-test-1
> 
> The worst case latency is only 160 usecs, and the vast majority fall
> into the pattern from 0 to 120 usecs.  All of the spikes above 120 are
> almost certainly caused by netdev_max_backlog.  However these are not
> long enough to cause any problems with my workload; the lowest practical
> latency for audio work is around 0.66 ms (32 frames at 48khz). 

Lee,

A few weeks ago you wrote that "the worst latency I was able to trigger
was 46 usecs", now it's 160 usecs.

Ingo has done much work on his patches since then.

Why the worst latency is higher now? I presume that the latency
measurements technique are more accurate and the 46 usecs was
inaccurate?

Ref: http://uwsg.indiana.edu/hypermail/linux/kernel/0407.3/0994.html

Best regards,

Eric St-Laurent



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0
  2004-09-03  3:17   ` Eric St-Laurent
@ 2004-09-03  6:26     ` Lee Revell
  2004-09-03  6:36       ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Lee Revell @ 2004-09-03  6:26 UTC (permalink / raw)
  To: Eric St-Laurent
  Cc: Ingo Molnar, linux-kernel, K.R. Foley, Felipe Alfaro Solana,
	Daniel Schmitt, Mark_H_Johnson, P.O. Gaillard

On Thu, 2004-09-02 at 23:17, Eric St-Laurent wrote:
> > Judging from these graphs, all of the latency issues are solved, at
> > least on my UP hardware, and the latencies seem to be getting very close
> > to the limits of what the hardware can do:
> > 
> > http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6#/var/www/2.6.9-rc1-Q6/jack-test-1
> > 
> > The worst case latency is only 160 usecs, and the vast majority fall
> > into the pattern from 0 to 120 usecs.  All of the spikes above 120 are
> > almost certainly caused by netdev_max_backlog.  However these are not
> > long enough to cause any problems with my workload; the lowest practical
> > latency for audio work is around 0.66 ms (32 frames at 48khz). 
> 
> Lee,
> 
> A few weeks ago you wrote that "the worst latency I was able to trigger
> was 46 usecs", now it's 160 usecs.
> 
> Ingo has done much work on his patches since then.
> 
> Why the worst latency is higher now? I presume that the latency
> measurements technique are more accurate and the 46 usecs was
> inaccurate?
> 
> Ref: http://uwsg.indiana.edu/hypermail/linux/kernel/0407.3/0994.html
> 

Yup, due to my incomplete understanding of the jackd code, my initial
measurements were measuring the time it took to run one process cycle
(basically a NOOP if there are no clients), rather than the actual time
jackd spent in poll() between cycles.

This did have the effect of measuring the scueduler latency, but I
believe it was being measured indirectly via cache effects - the longer
it had been since jackd last ran, the colder the cachelines touched by
the last cycle.

Since I am using a patch to jackd to measure these latencies, which will
be merged in the near future, it's more important for me that the patch
accurately reflect the latencies jackd users will see than it is for the
new, accurate results to be compatible with the old.

All datasets for -O and earlier use the old code:

	http://krustophenia.net/testresults.php?dataset=2.6.8-rc3-O5

This one uses the initial version of the new code, measuring the time
jackd spends in poll().  The bimodal distribution is due to my sound
card having two different interrupt sources for capture and playback. 
So one of the spikes represent the elapsed time between the capture
interrupt and the playback interrupt, and the other the time between the
playback interrupt and the next capture interrupt:

	http://krustophenia.net/testresults.php?dataset=2.6.8.1-P0

-Q and later use the current method, which is like the above except the
second hump is discarded, as it is a function of the scheduling latency
and the period size rather than just the scheduling latency:

	http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6

So, don't be fooled by the numbers, the newest version of the patch is
in fact the best.  I have been meaning to go back and measure the
current patches with the old code but it's pretty low priority...

Lee

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0
  2004-09-03  6:26     ` Lee Revell
@ 2004-09-03  6:36       ` Ingo Molnar
  2004-09-03  6:49         ` Lee Revell
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2004-09-03  6:36 UTC (permalink / raw)
  To: Lee Revell
  Cc: Eric St-Laurent, linux-kernel, K.R. Foley, Felipe Alfaro Solana,
	Daniel Schmitt, Mark_H_Johnson, P.O. Gaillard


* Lee Revell <rlrevell@joe-job.com> wrote:

> -Q and later use the current method, which is like the above except
> the second hump is discarded, as it is a function of the scheduling
> latency and the period size rather than just the scheduling latency:
> 
> 	http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6
> 
> So, don't be fooled by the numbers, the newest version of the patch is
> in fact the best.  I have been meaning to go back and measure the
> current patches with the old code but it's pretty low priority...

vanilla kernel 2.6.8.1 would be quite interesting to get a few charts of
- especially if your measurement methodology has changed. There's not
much sense in re-testing older VP patches.

also, has the userspace workload you are using stayed constant during
all these tests?

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0
  2004-09-03  6:36       ` Ingo Molnar
@ 2004-09-03  6:49         ` Lee Revell
  2004-09-03  7:05           ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Lee Revell @ 2004-09-03  6:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric St-Laurent, linux-kernel, K.R. Foley, Felipe Alfaro Solana,
	Daniel Schmitt, Mark_H_Johnson, P.O. Gaillard

On Fri, 2004-09-03 at 02:36, Ingo Molnar wrote:
> * Lee Revell <rlrevell@joe-job.com> wrote:
> 
> > -Q and later use the current method, which is like the above except
> > the second hump is discarded, as it is a function of the scheduling
> > latency and the period size rather than just the scheduling latency:
> > 
> > 	http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-Q6
> > 
> > So, don't be fooled by the numbers, the newest version of the patch is
> > in fact the best.  I have been meaning to go back and measure the
> > current patches with the old code but it's pretty low priority...
> 
> vanilla kernel 2.6.8.1 would be quite interesting to get a few charts of
> - especially if your measurement methodology has changed.

OK, I will give this a shot.  Now that the VP patches are stabilizing I
will be doing more profiling.  I also want to try the -mm kernel, this
has some interesting differences from the stock kernel.  For example I
measured about a 10% improvement with the old method, which implies a
big performance gain.

>  There's not
> much sense in re-testing older VP patches.
> 

Yup, my thoughts exactly, this would just tell us what we already know,
that the latency gets better with each version.

> also, has the userspace workload you are using stayed constant during
> all these tests?
> 

I am mostly just using normal desktop hacker workloads, web browsing,
email, builds.  Lately I am using the box as a Samba server.  At first,
I was stressing the system using every disk benchmark I could think of,
but it never seemed to affect the worst case and did not even change the
shape of the distribution much, so I don't bother.  For all practical
purposes, it's impossible to change the shape of these graphs much by
stressing the system.

I am able to induce large latencies by using up all available swap with
make -j12 on a KDE program, and by pingflooding the broadcast address,
but these are pathological enough that I have not worried about them.

Lee

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0
  2004-09-03  6:49         ` Lee Revell
@ 2004-09-03  7:05           ` Ingo Molnar
  2004-09-03  7:40             ` Lee Revell
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2004-09-03  7:05 UTC (permalink / raw)
  To: Lee Revell
  Cc: Eric St-Laurent, linux-kernel, K.R. Foley, Felipe Alfaro Solana,
	Daniel Schmitt, Mark_H_Johnson, P.O. Gaillard

another question - any objections against me rebasing the VP patch to
the current -mm tree and keeping it there exclusively until all possible
merges are done? It would probably quite some work to create a complete
patch for the upstream or BK tree during that process, as small patches
start to flow in the VP => -mm => BK direction. Would any of the regular
VP users/testers be wary to use the -mm tree?

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0
  2004-09-03  7:05           ` Ingo Molnar
@ 2004-09-03  7:40             ` Lee Revell
  2004-09-03  7:50               ` Free Ekanayaka
  0 siblings, 1 reply; 36+ messages in thread
From: Lee Revell @ 2004-09-03  7:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric St-Laurent, linux-kernel, K.R. Foley, Felipe Alfaro Solana,
	Daniel Schmitt, Mark_H_Johnson, P.O. Gaillard, nando, luke,
	free78

On Fri, 2004-09-03 at 03:05, Ingo Molnar wrote:
> another question - any objections against me rebasing the VP patch to
> the current -mm tree and keeping it there exclusively until all possible
> merges are done? It would probably quite some work to create a complete
> patch for the upstream or BK tree during that process, as small patches
> start to flow in the VP => -mm => BK direction. Would any of the regular
> VP users/testers be wary to use the -mm tree?
> 

None here.  Assuming the SMP issues are resolved, then the only
remaining issues on UP will probably be with various drivers, those
fixes should apply just as easily to -mm as vanilla.

As far as I am concerned the VP patches are stable enough for the
audio-centric distros to start distributing VP kernel packages, these
will certainly be using the vanilla kernel.  I think the PlanetCCRMA and
AGNULA people are planning to start distributing test VP-kernel packages
as soon as the patches stabilize.  IIRC Nando is on vacation this week.

I will make an announcement on LAD that as of R0 the VP patches should
be stable and are ready for wider testing.  You may want to wait until
after the initial slew of bug reports before rebasing VP against MM.  I
suspect most of the problems with be driver specific, and most of the
fixes will apply equally to -mm and vanilla.

I have added Luke (AudioSlack), Free (AGNULA), and Nando (CCRMA) to the
cc: list.  They would be in the best position to answer your question.

Lee

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0
  2004-09-03  7:40             ` Lee Revell
@ 2004-09-03  7:50               ` Free Ekanayaka
  2004-09-03  8:05                 ` Lee Revell
  0 siblings, 1 reply; 36+ messages in thread
From: Free Ekanayaka @ 2004-09-03  7:50 UTC (permalink / raw)
  To: Lee Revell
  Cc: Ingo Molnar, Eric St-Laurent, linux-kernel, K.R. Foley,
	Felipe Alfaro Solana, Daniel Schmitt, Mark_H_Johnson,
	P.O. Gaillard, nando, luke, free78

|--==> "LR" == Lee Revell <rlrevell@joe-job.com> writes:

  LR> On Fri, 2004-09-03 at 03:05, Ingo Molnar wrote:
  >>another question - any objections against me rebasing the VP patch to
  >>the current -mm tree and keeping it there exclusively until all possible
  >>merges are done? It would probably quite some work to create a complete
  >>patch for the upstream or BK tree during that process, as small patches
  >>start to flow in the VP => -mm => BK direction. Would any of the regular
  >>VP users/testers be wary to use the -mm tree?
  >>

  LR> None here.  Assuming the SMP issues are resolved, then the only
  LR> remaining issues on UP will probably be with various drivers, those
  LR> fixes should apply just as easily to -mm as vanilla.

  LR> As far as I am concerned the VP patches are stable enough for the
  LR> audio-centric distros to start distributing VP kernel packages, these
  LR> will certainly be using the vanilla kernel.  I think the PlanetCCRMA and
  LR> AGNULA people are planning to start distributing test VP-kernel packages
  LR> as soon as the patches stabilize.  IIRC Nando is on vacation this week.

  LR> I will make an announcement on LAD that as of R0 the VP patches should
  LR> be stable and are ready for wider testing.  You may want to wait until
  LR> after the initial slew of bug reports before rebasing VP against MM.  I
  LR> suspect most of the problems with be driver specific, and most of the
  LR> fixes will apply equally to -mm and vanilla.

  LR> I have added Luke (AudioSlack), Free (AGNULA), and Nando (CCRMA) to the
  LR> cc: list.  They would be in the best position to answer your question.

Yes, you're right. We plan to  provide test 2.6.x  packages as soon as
patches stabilise. Please let me  know if you have some recommendation
(configuration flags, additional patches, etc.).

Cheers,

Free

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R0
  2004-09-03  7:50               ` Free Ekanayaka
@ 2004-09-03  8:05                 ` Lee Revell
  2004-09-03  9:25                   ` [patch] voluntary-preempt-2.6.9-rc1-bk4-R1 Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Lee Revell @ 2004-09-03  8:05 UTC (permalink / raw)
  To: Free Ekanayaka
  Cc: Ingo Molnar, Eric St-Laurent, linux-kernel, K.R. Foley,
	Felipe Alfaro Solana, Daniel Schmitt, Mark_H_Johnson,
	P.O. Gaillard, nando, luke, free78

On Fri, 2004-09-03 at 03:50, Free Ekanayaka wrote:
>   LR> As far as I am concerned the VP patches are stable enough for the
>   LR> audio-centric distros to start distributing VP kernel packages, these
>   LR> will certainly be using the vanilla kernel.  I think the PlanetCCRMA and
>   LR> AGNULA people are planning to start distributing test VP-kernel packages
>   LR> as soon as the patches stabilize.  IIRC Nando is on vacation this week.
> 
>   LR> I will make an announcement on LAD that as of R0 the VP patches should
>   LR> be stable and are ready for wider testing.  You may want to wait until
>   LR> after the initial slew of bug reports before rebasing VP against MM.  I
>   LR> suspect most of the problems with be driver specific, and most of the
>   LR> fixes will apply equally to -mm and vanilla.
> 
>   LR> I have added Luke (AudioSlack), Free (AGNULA), and Nando (CCRMA) to the
>   LR> cc: list.  They would be in the best position to answer your question.
> 
> Yes, you're right. We plan to  provide test 2.6.x  packages as soon as
> patches stabilise. Please let me  know if you have some recommendation
> (configuration flags, additional patches, etc.).
> 

As of -R0 it's definitely stable on UP and SMP users are reporting the
same.  All known problems should be fixed, and there are no known
regressions.  You should probably post a UP version and have your users
test that before posting SMP packages, the latter are not quite as well
tested.

No other patches (ie various scheduler tweaks, CK) should be necessary,
and in fact are not recommended because they might mask latency issues
that we would rather fix.

I use Debian unstable which should be pretty close to AGNULA, and for
several weeks now I have been unable to produce an xrun in jack at 32
frames no matter what I throw at the machine.  I actually have not had
xrun debugging enabled in weeks because I don't get any xruns.

Any problems at this point are most likely going to involve less common
hardware, stuff the LKML testers don't have.

Lee

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-03  8:05                 ` Lee Revell
@ 2004-09-03  9:25                   ` Ingo Molnar
  2004-09-03  9:50                     ` Luke Yelavich
  2004-09-03 11:33                     ` Thomas Charbonnel
  0 siblings, 2 replies; 36+ messages in thread
From: Ingo Molnar @ 2004-09-03  9:25 UTC (permalink / raw)
  To: Lee Revell
  Cc: Free Ekanayaka, Eric St-Laurent, linux-kernel, K.R. Foley,
	Felipe Alfaro Solana, Daniel Schmitt, Mark_H_Johnson,
	P.O. Gaillard, nando, luke, free78

* Lee Revell <rlrevell@joe-job.com> wrote:

> As of -R0 it's definitely stable on UP and SMP users are reporting the
> same.  All known problems should be fixed, and there are no known
> regressions.  You should probably post a UP version and have your
> users test that before posting SMP packages, the latter are not quite
> as well tested.

Florian Schmidt reported a minor bug that prevents a successful build if
!CONFIG_LATENCY_TRACE - i've uploaded -R1 that fixes this:

 http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-R1

there are no other changes in -R1, and there are no known pending bugs
currently.

for a packaged kernel i'd suggest to enable all the CONFIG_PREEMPT_*
values in the .config, including CONFIG_PREEMPT_TIMING.

CONFIG_LATENCY_TRACE can add overhead if active, so while it would be
useful for initial packages to enable this .config option, i'd suggest
to turn it off by default by changing 'tracing_enabled = 1' to
'tracing_enabled = 0' in the patch. Then people can enable it and do
precise tracing whenever they encounter a particular high latency on
their system.

configuring the threaded/nonthreaded properties of IRQ handlers can be
tricky. Perhaps a script could scan /proc (and/or /sys) for audio
interrupts and make them directly executed? Unfortunately audio
interrupt handler requests dont seem to be going through some central
ALSA function so i see no easy way to somehow tag audio interrupts
automatically and provide some /proc flag to make audio interrupts
non-threaded by default.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-03  9:25                   ` [patch] voluntary-preempt-2.6.9-rc1-bk4-R1 Ingo Molnar
@ 2004-09-03  9:50                     ` Luke Yelavich
  2004-09-03 10:29                       ` Ingo Molnar
  2004-09-03 11:33                     ` Thomas Charbonnel
  1 sibling, 1 reply; 36+ messages in thread
From: Luke Yelavich @ 2004-09-03  9:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, Mark_H_Johnson,
	P.O. Gaillard, nando, free78

On Fri, Sep 03, 2004 at 07:25:47PM EST, Ingo Molnar wrote:
> 
> * Lee Revell <rlrevell@joe-job.com> wrote:
> 
> > As of -R0 it's definitely stable on UP and SMP users are reporting the
> > same.  All known problems should be fixed, and there are no known
> > regressions.  You should probably post a UP version and have your
> > users test that before posting SMP packages, the latter are not quite
> > as well tested.
> 
> Florian Schmidt reported a minor bug that prevents a successful build if
> !CONFIG_LATENCY_TRACE - i've uploaded -R1 that fixes this:
>   
>  http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-R1
> 
> there are no other changes in -R1, and there are no known pending bugs
> currently.

I think I might be having a problem here, that I didn't have with previous
patches. Bare in mind that the previous patch I tested was against 2.6.8.1
vanilla.

I am using a D-Link KVM here between my notebook and my desktop machine.
It is the desktop I am currently testing these patches on, and the KVM
requires a double-tap of the scroll lock key to switch between machines. With
the latest R0 patch, something is not working when I attempt to change from
my desktop to my notebook. The KVM usually lets out a beep when I can
use the arrow keys to switch, but it isn't here. Adding to that, my console
locks up totally for about 10 seconds, before allowing me to go on and
type commands. I also seem to get some debugging output or a trace of
some sort when rebooting, and the kernel panics with the message:
(0)Kernel Panic - not syncing: Failed exception in interrupt

Included is some dmesg output from beginning of bootup.

Let me know if any more info is needed.
-- 
Luke Yelavich
http://www.audioslack.com
luke@audioslack.com

CPU: After generic identify, caps: bfebfbff 00000000 00000000 00000000
CPU: After vendor identify, caps:  bfebfbff 00000000 00000000 00000000
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: After all inits, caps:        bfebfbff 00000000 00000000 00000080
CPU: Intel(R) Pentium(R) 4 CPU 2.40GHz stepping 07
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
Badness in schedule at kernel/sched.c:2637
 [<c0302fb0>] schedule+0x558/0x55d
 [<c0302ff5>] preempt_schedule+0x40/0x5f
 [<c010041f>] rest_init+0x67/0x73
 [<c03da780>] start_kernel+0x195/0x1d4
 [<c03da38c>] unknown_bootoption+0x0/0x147
(swapper/1): new 666 us maximum-latency critical section.
 => started at: <preempt_schedule+0x3b/0x5f>
 => ended at:   <finish_task_switch+0x37/0x93>
 [<c012f01f>] check_preempt_timing+0x11b/0x1da
 [<c01103df>] finish_task_switch+0x37/0x93
 [<c01103df>] finish_task_switch+0x37/0x93
 [<c012f212>] sub_preempt_count+0x4d/0x65
 [<c012f212>] sub_preempt_count+0x4d/0x65
 [<c01103df>] finish_task_switch+0x37/0x93
 [<c011045e>] schedule_tail+0x23/0x5c
 [<c0105f5e>] ret_from_fork+0x6/0x14
 [<c010046b>] init+0x0/0x179
 [<c0104210>] kernel_thread_helper+0x0/0xb
Badness in schedule at kernel/sched.c:2637
 [<c0302fb0>] schedule+0x558/0x55d
 [<c0302ff5>] preempt_schedule+0x40/0x5f
 [<c0110439>] finish_task_switch+0x91/0x93
 [<c011045e>] schedule_tail+0x23/0x5c
 [<c0105f5e>] ret_from_fork+0x6/0x14
 [<c010046b>] init+0x0/0x179
 [<c0104210>] kernel_thread_helper+0x0/0xb
NET: Registered protocol family 16
PCI: PCI BIOS revision 2.10 entry at 0xf1e60, last bus=2
(swapper/1): new 904 us maximum-latency critical section.
 => started at: <cond_resched+0xd/0x41>
 => ended at:   <cond_resched+0xd/0x41>
 [<c012f01f>] check_preempt_timing+0x11b/0x1da
 [<c0303417>] cond_resched+0xd/0x41
 [<c0303417>] cond_resched+0xd/0x41
 [<c012f114>] touch_preempt_timing+0x36/0x3a
 [<c012f114>] touch_preempt_timing+0x36/0x3a
 [<c0303417>] cond_resched+0xd/0x41
 [<c0138b6b>] kmem_cache_alloc+0x5f/0x61
 [<c0119fe8>] __request_region+0x26/0xbd
 [<c03f1336>] pci_direct_init+0x3e/0x10e
 [<c03da807>] do_initcalls+0x2f/0xbc
 [<c01004c4>] init+0x59/0x179
 [<c010046b>] init+0x0/0x179
 [<c0104215>] kernel_thread_helper+0x5/0xb
PCI: Using configuration type 1
Linux Plug and Play Support v0.97 (c) Adam Belay
SCSI subsystem initialized
usbcore: registered new driver usbfs
usbcore: registered new driver hub

<snip>

get_random_bytes called before random driver initialization
(swapper/1): new 6605 us maximum-latency critical section.
 => started at: <cond_resched+0xd/0x41>
 => ended at:   <cond_resched+0xd/0x41>
 [<c012f01f>] check_preempt_timing+0x11b/0x1da
 [<c0303417>] cond_resched+0xd/0x41
 [<c0303417>] cond_resched+0xd/0x41
 [<c012f114>] touch_preempt_timing+0x36/0x3a
 [<c012f114>] touch_preempt_timing+0x36/0x3a
 [<c0303417>] cond_resched+0xd/0x41
 [<c0138da7>] __kmalloc+0x89/0x90
 [<c0183f6c>] proc_create+0x86/0xd9
 [<c018416b>] create_proc_entry+0x66/0xc9
 [<c03f1fcc>] dev_proc_init+0x2b/0xa3
 [<c03f206c>] net_dev_init+0x28/0x17c
 [<c03f1ce4>] pcibios_init+0x65/0x7e
 [<c03da807>] do_initcalls+0x2f/0xbc
 [<c01004c4>] init+0x59/0x179
 [<c010046b>] init+0x0/0x179
 [<c0104215>] kernel_thread_helper+0x5/0xb
(swapper/1): new 43385 us maximum-latency critical section.
 => started at: <cond_resched+0xd/0x41>
 => ended at:   <cond_resched+0xd/0x41>
 [<c012f01f>] check_preempt_timing+0x11b/0x1da
 [<c0303417>] cond_resched+0xd/0x41
 [<c0107b1d>] do_IRQ+0x14a/0x18e
 [<c0303417>] cond_resched+0xd/0x41
 [<c012f114>] touch_preempt_timing+0x36/0x3a
 [<c012f114>] touch_preempt_timing+0x36/0x3a
 [<c0303417>] cond_resched+0xd/0x41
 [<c0138b6b>] kmem_cache_alloc+0x5f/0x61
 [<c0119fe8>] __request_region+0x26/0xbd
 [<c01efb09>] isapnp_next_rdp+0x66/0xa1
 [<c03e6d2d>] isapnp_isolate_rdp_select+0x5b/0xc5
 [<c03e6eda>] isapnp_isolate+0x143/0x1f8
 [<c03e7c15>] isapnp_init+0xbb/0x2da
 [<c03da807>] do_initcalls+0x2f/0xbc
 [<c01004c4>] init+0x59/0x179
 [<c010046b>] init+0x0/0x179
 [<c0104215>] kernel_thread_helper+0x5/0xb
(swapper/1): new 43492 us maximum-latency critical section.
 => started at: <cond_resched+0xd/0x41>
 => ended at:   <cond_resched+0xd/0x41>
 [<c012f01f>] check_preempt_timing+0x11b/0x1da
 [<c0303417>] cond_resched+0xd/0x41
 [<c0303417>] cond_resched+0xd/0x41
 [<c012f114>] touch_preempt_timing+0x36/0x3a
 [<c012f114>] touch_preempt_timing+0x36/0x3a
 [<c0303417>] cond_resched+0xd/0x41
 [<c0138b6b>] kmem_cache_alloc+0x5f/0x61
 [<c0119fe8>] __request_region+0x26/0xbd
 [<c01efb09>] isapnp_next_rdp+0x66/0xa1
 [<c03e6d2d>] isapnp_isolate_rdp_select+0x5b/0xc5
 [<c03e6eda>] isapnp_isolate+0x143/0x1f8
 [<c03e7c15>] isapnp_init+0xbb/0x2da
 [<c03da807>] do_initcalls+0x2f/0xbc
 [<c01004c4>] init+0x59/0x179
 [<c010046b>] init+0x0/0x179
 [<c0104215>] kernel_thread_helper+0x5/0xb
(swapper/1): new 43561 us maximum-latency critical section.
 => started at: <cond_resched+0xd/0x41>
 => ended at:   <cond_resched+0xd/0x41>
 [<c012f01f>] check_preempt_timing+0x11b/0x1da
 [<c0303417>] cond_resched+0xd/0x41
 [<c0107b1d>] do_IRQ+0x14a/0x18e
 [<c0303417>] cond_resched+0xd/0x41
 [<c012f114>] touch_preempt_timing+0x36/0x3a
 [<c012f114>] touch_preempt_timing+0x36/0x3a
 [<c0303417>] cond_resched+0xd/0x41
 [<c0138b6b>] kmem_cache_alloc+0x5f/0x61
 [<c0119fe8>] __request_region+0x26/0xbd
 [<c01efb09>] isapnp_next_rdp+0x66/0xa1
 [<c03e6d2d>] isapnp_isolate_rdp_select+0x5b/0xc5
 [<c03e6eda>] isapnp_isolate+0x143/0x1f8
 [<c03e7c15>] isapnp_init+0xbb/0x2da
 [<c03da807>] do_initcalls+0x2f/0xbc
 [<c01004c4>] init+0x59/0x179
 [<c010046b>] init+0x0/0x179
 [<c0104215>] kernel_thread_helper+0x5/0xb
isapnp: No Plug & Play device found
requesting new irq thread for IRQ8...
Real Time Clock Driver v1.12
Using anticipatory io scheduler
<snip>
requesting new irq thread for IRQ6...
<snip>
IRQ#6 thread started up.
eth0: Tigon3 [partno(BCM95702A20) rev 1002 PHY(5703)] (PCI:33MHz:32-bit) 10/100/1000BaseT Ethernet 00:e0:18:df:d5:b6
eth0: HostTXDS[1] RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1] 
requesting new irq thread for IRQ3...
(dhcpcd/2226): new 115753 us maximum-latency critical section.
 => started at: <tg3_open+0xc9/0x263 [tg3]>
 => ended at:   <tg3_open+0x122/0x263 [tg3]>
 [<c012f01f>] check_preempt_timing+0x11b/0x1da
 [<f88c805a>] tg3_open+0x122/0x263 [tg3]
 [<f88c805a>] tg3_open+0x122/0x263 [tg3]
 [<c012f212>] sub_preempt_count+0x4d/0x65
 [<c012f212>] sub_preempt_count+0x4d/0x65
 [<f88c805a>] tg3_open+0x122/0x263 [tg3]
 [<c029f2ad>] dev_open+0xcf/0xfe
 [<c02a30c1>] dev_mc_upload+0x3d/0x5d
 [<c02a0987>] dev_change_flags+0x5b/0x12c
 [<c029e930>] __dev_get_by_name+0xe/0xb9
 [<c02e2d4b>] devinet_ioctl+0x247/0x5c1
 [<c02e5066>] inet_ioctl+0x5f/0x9f
 [<c0301f9d>] packet_ioctl+0x14d/0x177
 [<c0296a96>] sock_ioctl+0x112/0x2df
 [<c0161e06>] sys_ioctl+0x13a/0x2a0
 [<c0106087>] syscall_call+0x7/0xb
IRQ#3 thread started up.
divert: not allocating divert_blk for non-ethernet device sit0
(IRQ 1/716): new 204218 us maximum-latency critical section.
 => started at: <__do_softirq+0x39/0x59>
 => ended at:   <__do_softirq+0x4a/0x59>
 [<c012f01f>] check_preempt_timing+0x11b/0x1da
 [<c01186d0>] __do_softirq+0x4a/0x59
 [<c01186d0>] __do_softirq+0x4a/0x59
 [<c012f212>] sub_preempt_count+0x4d/0x65
 [<c012f212>] sub_preempt_count+0x4d/0x65
 [<c01186d0>] __do_softirq+0x4a/0x59
 [<c01194d6>] do_irqd+0x71/0x91
 [<c012819f>] kthread+0xaa/0xaf
 [<c0119465>] do_irqd+0x0/0x91
 [<c01280f5>] kthread+0x0/0xaf
 [<c0104215>] kernel_thread_helper+0x5/0xb

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-03  9:50                     ` Luke Yelavich
@ 2004-09-03 10:29                       ` Ingo Molnar
  2004-09-03 10:43                         ` Luke Yelavich
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2004-09-03 10:29 UTC (permalink / raw)
  To: Luke Yelavich
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, Mark_H_Johnson,
	P.O. Gaillard, nando, free78


* Luke Yelavich <luke@audioslack.com> wrote:

> I think I might be having a problem here, that I didn't have with
> previous patches. Bare in mind that the previous patch I tested was
> against 2.6.8.1 vanilla.
> 
> I am using a D-Link KVM here between my notebook and my desktop
> machine. It is the desktop I am currently testing these patches on,
> and the KVM requires a double-tap of the scroll lock key to switch
> between machines. With the latest R0 patch, something is not working
> when I attempt to change from my desktop to my notebook. The KVM
> usually lets out a beep when I can use the arrow keys to switch, but
> it isn't here. Adding to that, my console locks up totally for about
> 10 seconds, before allowing me to go on and type commands. [...]

i have a KVM too to two testsystems and unfortunately i cannot reproduce
your problems neither with KVM (key-based-)switching nor with scroll
lock. But this KVM uses a triple-key combination to switch, not
scroll-lock.

it's a PS2 keyboard, right? If yes then does:

	echo 0 > /proc/irq/1/i8042/threaded
	( maybe also: echo 0 > /proc/irq/12/i8042/threaded )

fix the problem? The PS2 driver has been a bit unrobust when hardirq
redirection is enabled.

> [...] I also seem to get some debugging output or a trace of some sort
> when rebooting, and the kernel panics with the message: (0)Kernel
> Panic - not syncing: Failed exception in interrupt

hm. Would be nice to get a serial console capture of this, if possible.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-03 10:29                       ` Ingo Molnar
@ 2004-09-03 10:43                         ` Luke Yelavich
  0 siblings, 0 replies; 36+ messages in thread
From: Luke Yelavich @ 2004-09-03 10:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, Mark_H_Johnson,
	P.O. Gaillard, nando, free78

On Fri, Sep 03, 2004 at 08:29:48PM EST, Ingo Molnar wrote:
> * Luke Yelavich <luke@audioslack.com> wrote:
> > I am using a D-Link KVM here between my notebook and my desktop
> > machine. It is the desktop I am currently testing these patches on,
> > and the KVM requires a double-tap of the scroll lock key to switch
> > between machines. With the latest R0 patch, something is not working
> > when I attempt to change from my desktop to my notebook. The KVM
> > usually lets out a beep when I can use the arrow keys to switch, but
> > it isn't here. Adding to that, my console locks up totally for about
> > 10 seconds, before allowing me to go on and type commands. [...]
> 
> i have a KVM too to two testsystems and unfortunately i cannot reproduce
> your problems neither with KVM (key-based-)switching nor with scroll
> lock. But this KVM uses a triple-key combination to switch, not
> scroll-lock.
> 
> it's a PS2 keyboard, right?

Yes, that is correct.

> If yes then does:
> 
> 	echo 0 > /proc/irq/1/i8042/threaded
> 	( maybe also: echo 0 > /proc/irq/12/i8042/threaded )
> 
> fix the problem? The PS2 driver has been a bit unrobust when hardirq
> redirection is enabled.

I only had to turn off IRQ threading for 1, and all is well again.

> 
> > [...] I also seem to get some debugging output or a trace of some sort
> > when rebooting, and the kernel panics with the message: (0)Kernel
> > Panic - not syncing: Failed exception in interrupt
> 
> hm. Would be nice to get a serial console capture of this, if possible.

I am pretty sure I have a cable, however my desktop is the only computer in this
room that has serial ports, and I also don't know how to set up a serial
console. If you feel that this is important, I will see what I can do.
-- 
Luke Yelavich
http://www.audioslack.com
luke@audioslack.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-03  9:25                   ` [patch] voluntary-preempt-2.6.9-rc1-bk4-R1 Ingo Molnar
  2004-09-03  9:50                     ` Luke Yelavich
@ 2004-09-03 11:33                     ` Thomas Charbonnel
  2004-09-03 11:49                       ` Ingo Molnar
  1 sibling, 1 reply; 36+ messages in thread
From: Thomas Charbonnel @ 2004-09-03 11:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, Mark_H_Johnson,
	P.O. Gaillard, nando, luke, free78

Ingo Molnar wrote :
> Florian Schmidt reported a minor bug that prevents a successful build if
> !CONFIG_LATENCY_TRACE - i've uploaded -R1 that fixes this:
>   
>  http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-R1
> 

I still get > 170 us latency from rtl8139 :
http://www.undata.org/~thomas/R1_rtl8139.trace

And again this one :
preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
--------------------------------------------------
 latency: 597 us, entries: 12 (12)
    -----------------
    | task: swapper/0, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: smp_apic_timer_interrupt+0x32/0xd0
 => ended at:   smp_apic_timer_interrupt+0x86/0xd0
=======>
00010000 0.000ms (+0.000ms): smp_apic_timer_interrupt
(apic_timer_interrupt)
00010000 0.000ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
00010000 0.000ms (+0.000ms): profile_hook (profile_tick)
00010001 0.000ms (+0.595ms): notifier_call_chain (profile_hook)
00010000 0.595ms (+0.000ms): do_nmi (mcount)
00020000 0.596ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00020000 0.596ms (+0.000ms): profile_hook (profile_tick)
00020001 0.597ms (+0.000ms): notifier_call_chain (profile_hook)
00020000 0.597ms (+689953.444ms): profile_hit (nmi_watchdog_tick)
00010001 689954.042ms (+1.141ms): update_process_times (do_timer)
00000001 0.597ms (+0.000ms): sub_preempt_count
(smp_apic_timer_interrupt)
00000001 0.598ms (+0.000ms): update_max_trace (check_preempt_timing)

Thomas



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-03 11:33                     ` Thomas Charbonnel
@ 2004-09-03 11:49                       ` Ingo Molnar
  2004-09-03 12:05                         ` Thomas Charbonnel
  2004-09-03 16:14                         ` Thomas Charbonnel
  0 siblings, 2 replies; 36+ messages in thread
From: Ingo Molnar @ 2004-09-03 11:49 UTC (permalink / raw)
  To: Thomas Charbonnel
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, Mark_H_Johnson,
	P.O. Gaillard, nando, luke, free78

[-- Attachment #1: Type: text/plain, Size: 2395 bytes --]


* Thomas Charbonnel <thomas@undata.org> wrote:

> I still get > 170 us latency from rtl8139 :
> http://www.undata.org/~thomas/R1_rtl8139.trace

this is a single-packet latency, we wont get much lower than this with
the current techniques. Disabling ip_conntrack and tracing ought to
lower the real latency somewhat.

> And again this one :
> preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
> --------------------------------------------------
>  latency: 597 us, entries: 12 (12)
>     -----------------
>     | task: swapper/0, uid:0 nice:0 policy:0 rt_prio:0
>     -----------------
>  => started at: smp_apic_timer_interrupt+0x32/0xd0
>  => ended at:   smp_apic_timer_interrupt+0x86/0xd0
> =======>
> 00010000 0.000ms (+0.000ms): smp_apic_timer_interrupt (apic_timer_interrupt)
> 00010000 0.000ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
> 00010000 0.000ms (+0.000ms): profile_hook (profile_tick)
> 00010001 0.000ms (+0.595ms): notifier_call_chain (profile_hook)
> 00010000 0.595ms (+0.000ms): do_nmi (mcount)
> 00020000 0.596ms (+0.000ms): profile_tick (nmi_watchdog_tick)
> 00020000 0.596ms (+0.000ms): profile_hook (profile_tick)
> 00020001 0.597ms (+0.000ms): notifier_call_chain (profile_hook)
> 00020000 0.597ms (+689953.444ms): profile_hit (nmi_watchdog_tick)
> 00010001 689954.042ms (+1.141ms): update_process_times (do_timer)
> 00000001 0.597ms (+0.000ms): sub_preempt_count (smp_apic_timer_interrupt)
> 00000001 0.598ms (+0.000ms): update_max_trace (check_preempt_timing)

this is a pretty weird one. First it shows an apparently non-monotonic
RDTSC: the jump forward and backward in time around profile_hit. I
suspect the real RDTSC value was lower than the previous one and caused
an underflow. What is your cpu_khz in /proc/cpuinfo?

the other weird one is the +0.595 usec entry at notifier_call_chain(). 
That code is just a couple of instructions, so no real for any overhead 
there.

could you try the attached robust-get-cycles.patch ontop of your current
tree and see whether it impacts these weirdnesses? The patch makes sure
that the cycle counter is sane: two subsequent readings of it were
monotonic and less than 1000 cycles apart.

this patch probably wont remove the +0.595 msec latency though - the
RDTSC value jumped forward there permanently. Maybe the RDTSC value is
somehow corrupted by NMIs - could you turn off the NMI watchdog to
check?

	Ingo

[-- Attachment #2: robust-get-cycles.patch --]
[-- Type: text/plain, Size: 1594 bytes --]

--- linux/kernel/latency.c.orig2	
+++ linux/kernel/latency.c	
@@ -66,6 +66,18 @@ static unsigned long notrace cycles_to_u
 	return (unsigned long) delta;
 }
 
+static cycles_t notrace robust_get_cycles(void)
+{
+	cycles_t t0 = get_cycles(), t1;
+
+	for (;;) {
+		t1 = get_cycles();
+		if (t1 - t0 < 1000)
+			return t1;
+		t0 = t1;
+	}
+}
+
 #ifdef CONFIG_LATENCY_TRACE
 
 unsigned int trace_enabled = 1;
@@ -89,7 +101,7 @@ ____trace(struct cpu_trace *tr, unsigned
 		entry = tr->trace + tr->trace_idx;
 		entry->eip = eip;
 		entry->parent_eip = parent_eip;
-		entry->timestamp = get_cycles();
+		entry->timestamp = robust_get_cycles();
 		entry->preempt_count = preempt_count();
 	}
 	tr->trace_idx++;
@@ -295,7 +307,7 @@ check_preempt_timing(struct cpu_trace *t
 		return;
 #endif
 	atomic_inc(&tr->disabled);
-	latency = cycles_to_usecs(get_cycles() - tr->preempt_timestamp);
+	latency = cycles_to_usecs(robust_get_cycles() - tr->preempt_timestamp);
 
 	if (preempt_thresh) {
 		if (latency < preempt_thresh)
@@ -337,7 +349,7 @@ check_preempt_timing(struct cpu_trace *t
 out:
 #ifdef CONFIG_LATENCY_TRACE
 	tr->trace_idx = 0;
-	tr->preempt_timestamp = get_cycles();
+	tr->preempt_timestamp = robust_get_cycles();
 #endif
 	tr->critical_start = parent_eip;
 	__trace(eip, parent_eip);
@@ -376,7 +388,7 @@ void notrace add_preempt_count(int val)
 		struct cpu_trace *tr = &__get_cpu_var(trace);
 
 		local_irq_save(flags);
-		tr->preempt_timestamp = get_cycles();
+		tr->preempt_timestamp = robust_get_cycles();
 		tr->critical_start = eip;
 #ifdef CONFIG_LATENCY_TRACE
 		tr->trace_idx = 0;

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-03 11:49                       ` Ingo Molnar
@ 2004-09-03 12:05                         ` Thomas Charbonnel
  2004-09-03 16:14                         ` Thomas Charbonnel
  1 sibling, 0 replies; 36+ messages in thread
From: Thomas Charbonnel @ 2004-09-03 12:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, Mark_H_Johnson,
	P.O. Gaillard, nando, luke, free78

Ingo Molnar wrote :
> * Thomas Charbonnel <thomas@undata.org> wrote:
> 
> > I still get > 170 us latency from rtl8139 :
> > http://www.undata.org/~thomas/R1_rtl8139.trace
> 
> this is a single-packet latency, we wont get much lower than this with
> the current techniques. Disabling ip_conntrack and tracing ought to
> lower the real latency somewhat.
> 

Ok, I'll do that.

> > And again this one :
> > preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
> > --------------------------------------------------
> >  latency: 597 us, entries: 12 (12)
> >     -----------------
> >     | task: swapper/0, uid:0 nice:0 policy:0 rt_prio:0
> >     -----------------
> >  => started at: smp_apic_timer_interrupt+0x32/0xd0
> >  => ended at:   smp_apic_timer_interrupt+0x86/0xd0
> > =======>
> > 00010000 0.000ms (+0.000ms): smp_apic_timer_interrupt (apic_timer_interrupt)
> > 00010000 0.000ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
> > 00010000 0.000ms (+0.000ms): profile_hook (profile_tick)
> > 00010001 0.000ms (+0.595ms): notifier_call_chain (profile_hook)
> > 00010000 0.595ms (+0.000ms): do_nmi (mcount)
> > 00020000 0.596ms (+0.000ms): profile_tick (nmi_watchdog_tick)
> > 00020000 0.596ms (+0.000ms): profile_hook (profile_tick)
> > 00020001 0.597ms (+0.000ms): notifier_call_chain (profile_hook)
> > 00020000 0.597ms (+689953.444ms): profile_hit (nmi_watchdog_tick)
> > 00010001 689954.042ms (+1.141ms): update_process_times (do_timer)
> > 00000001 0.597ms (+0.000ms): sub_preempt_count (smp_apic_timer_interrupt)
> > 00000001 0.598ms (+0.000ms): update_max_trace (check_preempt_timing)
> 
> this is a pretty weird one. First it shows an apparently non-monotonic
> RDTSC: the jump forward and backward in time around profile_hit. I
> suspect the real RDTSC value was lower than the previous one and caused
> an underflow. What is your cpu_khz in /proc/cpuinfo?
> 
root@satellite thomas # cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 11
model name      : Intel(R) Pentium(R) III Mobile CPU      1000MHz
stepping        : 1
cpu MHz         : 996.879
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse
bogomips        : 1966.08

> the other weird one is the +0.595 usec entry at notifier_call_chain(). 
> That code is just a couple of instructions, so no real for any overhead 
> there.
> 
> could you try the attached robust-get-cycles.patch ontop of your current
> tree and see whether it impacts these weirdnesses? The patch makes sure
> that the cycle counter is sane: two subsequent readings of it were
> monotonic and less than 1000 cycles apart.
> 
> this patch probably wont remove the +0.595 msec latency though - the
> RDTSC value jumped forward there permanently. Maybe the RDTSC value is
> somehow corrupted by NMIs - could you turn off the NMI watchdog to
> check?
> 
> 	Ingo

I precisely enabled the NMI watchdog to track those weird latencies
down. My guess is still that when ACPI is enabled my bios does something
funky with SMM/SMI that increments the TSC. I'll try the patch and let
you know.

Thomas



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-03 11:49                       ` Ingo Molnar
  2004-09-03 12:05                         ` Thomas Charbonnel
@ 2004-09-03 16:14                         ` Thomas Charbonnel
  2004-09-03 17:36                           ` Thomas Charbonnel
  1 sibling, 1 reply; 36+ messages in thread
From: Thomas Charbonnel @ 2004-09-03 16:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, Mark_H_Johnson,
	P.O. Gaillard, nando, luke, free78

Ingo Molnar wrote :
> this is a pretty weird one. First it shows an apparently non-monotonic
> RDTSC: the jump forward and backward in time around profile_hit. I
> suspect the real RDTSC value was lower than the previous one and caused
> an underflow. What is your cpu_khz in /proc/cpuinfo?
> 
> the other weird one is the +0.595 usec entry at notifier_call_chain(). 
> That code is just a couple of instructions, so no real for any overhead 
> there.
> 
> could you try the attached robust-get-cycles.patch ontop of your current
> tree and see whether it impacts these weirdnesses? The patch makes sure
> that the cycle counter is sane: two subsequent readings of it were
> monotonic and less than 1000 cycles apart.
> 
> this patch probably wont remove the +0.595 msec latency though - the
> RDTSC value jumped forward there permanently. Maybe the RDTSC value is
> somehow corrupted by NMIs - could you turn off the NMI watchdog to
> check?

Here are more traces with robust-get-cycles applied. So far no
non-monotonic issue.

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
--------------------------------------------------
 latency: 626 us, entries: 25 (25)
    -----------------
    | task: ksoftirqd/0/2, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: preempt_schedule+0x3b/0x60
 => ended at:   schedule+0x2e4/0x570
=======>
04000000 0.000ms (+0.000ms): preempt_schedule (need_resched)
04000000 0.000ms (+0.000ms): schedule (preempt_schedule)
04000001 0.000ms (+0.620ms): sched_clock (schedule)
04010001 0.620ms (+0.000ms): do_nmi (sched_clock)
04010001 0.621ms (+0.000ms): profile_tick (nmi_watchdog_tick)
04010001 0.622ms (+0.000ms): profile_hook (profile_tick)
04010002 0.622ms (+0.000ms): notifier_call_chain (profile_hook)
04010001 0.622ms (+0.000ms): preempt_schedule (profile_tick)
04010001 0.622ms (+0.000ms): profile_hit (nmi_watchdog_tick)
04000001 0.623ms (+0.000ms): smp_apic_timer_interrupt (sched_clock)
04010001 0.623ms (+0.000ms): profile_tick (smp_apic_timer_interrupt)
04010001 0.623ms (+0.000ms): profile_hook (profile_tick)
04010002 0.623ms (+0.000ms): notifier_call_chain (profile_hook)
04010001 0.624ms (+0.000ms): preempt_schedule (profile_tick)
04010001 0.624ms (+0.000ms): profile_hit (smp_apic_timer_interrupt)
04000002 0.624ms (+0.000ms): do_softirq (smp_apic_timer_interrupt)
04000002 0.624ms (+0.000ms): __do_softirq (do_softirq)
04000002 0.625ms (+0.000ms): dequeue_task (schedule)
04000002 0.625ms (+0.000ms): recalc_task_prio (schedule)
04000002 0.625ms (+0.000ms): effective_prio (recalc_task_prio)
04000002 0.625ms (+0.000ms): enqueue_task (schedule)
00000002 0.626ms (+0.000ms): __switch_to (schedule)
00000002 0.626ms (+0.000ms): finish_task_switch (schedule)
00000001 0.626ms (+0.000ms): sub_preempt_count (schedule)
00000001 0.627ms (+0.000ms): update_max_trace (check_preempt_timing)

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
--------------------------------------------------
 latency: 460 us, entries: 38 (38)
    -----------------
    | task: swapper/0, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: do_IRQ+0x19/0x190
 => ended at:   do_IRQ+0x14a/0x190
=======>
00010000 0.000ms (+0.000ms): do_IRQ (common_interrupt)
00010000 0.000ms (+0.000ms): do_IRQ (default_idle)
00010000 0.000ms (+0.000ms): do_IRQ (<00000000>)
00010001 0.000ms (+0.444ms): mask_and_ack_8259A (do_IRQ)
00020002 0.444ms (+0.000ms): do_nmi (mask_and_ack_8259A)
00020002 0.445ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00020002 0.446ms (+0.000ms): profile_hook (profile_tick)
00020003 0.446ms (+0.000ms): notifier_call_chain (profile_hook)
00020002 0.446ms (+0.001ms): profile_hit (nmi_watchdog_tick)
00010001 0.448ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010000 0.448ms (+0.000ms): generic_handle_IRQ_event (do_IRQ)
00010000 0.448ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
00010001 0.448ms (+0.005ms): mark_offset_tsc (timer_interrupt)
00010001 0.454ms (+0.000ms): do_timer (timer_interrupt)
00010001 0.454ms (+0.000ms): update_process_times (do_timer)
00010001 0.454ms (+0.000ms): update_one_process (update_process_times)
00010001 0.455ms (+0.000ms): run_local_timers (update_process_times)
00010001 0.455ms (+0.000ms): raise_softirq (update_process_times)
00010001 0.455ms (+0.000ms): scheduler_tick (update_process_times)
00010001 0.455ms (+0.000ms): sched_clock (scheduler_tick)
00010001 0.456ms (+0.000ms): update_wall_time (do_timer)
00010001 0.456ms (+0.000ms): update_wall_time_one_tick
(update_wall_time)
00010001 0.456ms (+0.000ms): generic_note_interrupt (do_IRQ)
00010001 0.456ms (+0.000ms): end_8259A_irq (do_IRQ)
00010001 0.457ms (+0.001ms): enable_8259A_irq (do_IRQ)
00000001 0.458ms (+0.000ms): do_softirq (do_IRQ)
00000001 0.458ms (+0.000ms): __do_softirq (do_softirq)
00000001 0.458ms (+0.000ms): wake_up_process (do_softirq)
00000001 0.458ms (+0.000ms): try_to_wake_up (wake_up_process)
00000001 0.458ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000002 0.459ms (+0.000ms): activate_task (try_to_wake_up)
00000002 0.459ms (+0.000ms): sched_clock (activate_task)
00000002 0.459ms (+0.000ms): recalc_task_prio (activate_task)
00000002 0.459ms (+0.000ms): effective_prio (recalc_task_prio)
00000002 0.459ms (+0.000ms): enqueue_task (activate_task)
00000001 0.460ms (+0.000ms): preempt_schedule (try_to_wake_up)
00000001 0.460ms (+0.000ms): sub_preempt_count (do_IRQ)
00000001 0.460ms (+0.000ms): update_max_trace (check_preempt_timing)

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
--------------------------------------------------
 latency: 557 us, entries: 39 (39)
    -----------------
    | task: evolution-1.4/5492, uid:1000 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: fget+0x32/0x80
 => ended at:   fget+0x58/0x80
=======>
00000001 0.000ms (+0.537ms): fget (do_pollfd)
00010001 0.537ms (+0.000ms): do_nmi (sub_preempt_count)
00010001 0.538ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.538ms (+0.000ms): profile_hook (profile_tick)
00010002 0.539ms (+0.000ms): notifier_call_chain (profile_hook)
00010001 0.539ms (+0.001ms): profile_hit (nmi_watchdog_tick)
00010001 0.540ms (+0.000ms): do_IRQ (sub_preempt_count)
00010001 0.540ms (+0.000ms): do_IRQ (<00000000>)
00010002 0.541ms (+0.002ms): mask_and_ack_8259A (do_IRQ)
00010002 0.543ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010001 0.543ms (+0.000ms): generic_handle_IRQ_event (do_IRQ)
00010001 0.543ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
00010002 0.544ms (+0.005ms): mark_offset_tsc (timer_interrupt)
00010002 0.549ms (+0.000ms): do_timer (timer_interrupt)
00010002 0.550ms (+0.000ms): update_process_times (do_timer)
00010002 0.550ms (+0.000ms): update_one_process (update_process_times)
00010002 0.550ms (+0.000ms): run_local_timers (update_process_times)
00010002 0.550ms (+0.000ms): raise_softirq (update_process_times)
00010002 0.551ms (+0.000ms): scheduler_tick (update_process_times)
00010002 0.551ms (+0.000ms): sched_clock (scheduler_tick)
00010003 0.551ms (+0.000ms): task_timeslice (scheduler_tick)
00010002 0.552ms (+0.000ms): update_wall_time (do_timer)
00010002 0.552ms (+0.000ms): update_wall_time_one_tick
(update_wall_time)
00010002 0.552ms (+0.000ms): generic_note_interrupt (do_IRQ)
00010002 0.553ms (+0.000ms): end_8259A_irq (do_IRQ)
00010002 0.553ms (+0.001ms): enable_8259A_irq (do_IRQ)
00000002 0.554ms (+0.000ms): do_softirq (do_IRQ)
00000002 0.554ms (+0.000ms): __do_softirq (do_softirq)
00000002 0.554ms (+0.000ms): wake_up_process (do_softirq)
00000002 0.555ms (+0.000ms): try_to_wake_up (wake_up_process)
00000002 0.555ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000003 0.555ms (+0.000ms): activate_task (try_to_wake_up)
00000003 0.555ms (+0.000ms): sched_clock (activate_task)
00000003 0.555ms (+0.000ms): recalc_task_prio (activate_task)
00000003 0.556ms (+0.000ms): effective_prio (recalc_task_prio)
00000003 0.556ms (+0.000ms): enqueue_task (activate_task)
00000002 0.556ms (+0.000ms): preempt_schedule (try_to_wake_up)
00000001 0.557ms (+0.000ms): sub_preempt_count (fget)
00000001 0.557ms (+0.000ms): update_max_trace (check_preempt_timing)

I had similar problems affecting apparently random code paths before
enabling nmi_watchdog, but if you feel it can be useful, I can do some
more tests with robust-get-cycles and without nmi.

Thomas



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-R1
  2004-09-03 16:14                         ` Thomas Charbonnel
@ 2004-09-03 17:36                           ` Thomas Charbonnel
  0 siblings, 0 replies; 36+ messages in thread
From: Thomas Charbonnel @ 2004-09-03 17:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Free Ekanayaka, Eric St-Laurent, linux-kernel,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, Mark_H_Johnson,
	P.O. Gaillard, nando, luke, free78

I wrote :
> Ingo Molnar wrote :
> > this is a pretty weird one. First it shows an apparently non-monotonic
> > RDTSC: the jump forward and backward in time around profile_hit. I
> > suspect the real RDTSC value was lower than the previous one and caused
> > an underflow. What is your cpu_khz in /proc/cpuinfo?
> > 
> > the other weird one is the +0.595 usec entry at notifier_call_chain(). 
> > That code is just a couple of instructions, so no real for any overhead 
> > there.
> > 
> > could you try the attached robust-get-cycles.patch ontop of your current
> > tree and see whether it impacts these weirdnesses? The patch makes sure
> > that the cycle counter is sane: two subsequent readings of it were
> > monotonic and less than 1000 cycles apart.
> > 
> > this patch probably wont remove the +0.595 msec latency though - the
> > RDTSC value jumped forward there permanently. Maybe the RDTSC value is
> > somehow corrupted by NMIs - could you turn off the NMI watchdog to
> > check?
> 
> Here are more traces with robust-get-cycles applied. So far no
> non-monotonic issue.

In the end here's one :

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-R1
--------------------------------------------------
 latency: 891 us, entries: 38 (38)
    -----------------
    | task: swapper/0, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: do_IRQ+0x19/0x190
 => ended at:   do_IRQ+0x14a/0x190
=======>
00010000 0.000ms (+0.876ms): do_nmi (robust_get_cycles)
00020000 0.876ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00020000 0.876ms (+0.000ms): profile_hook (profile_tick)
00020001 0.876ms (+0.000ms): notifier_call_chain (profile_hook)
00020000 0.876ms (+689952.730ms): profile_hit (nmi_watchdog_tick)
04000002 689953.607ms (+1.855ms): finish_task_switch (schedule)
00010000 0.877ms (+0.000ms): do_IRQ (default_idle)
00010000 0.877ms (+0.000ms): do_IRQ (<00000000>)
00010001 0.877ms (+0.002ms): mask_and_ack_8259A (do_IRQ)
00010001 0.879ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010000 0.880ms (+0.000ms): generic_handle_IRQ_event (do_IRQ)
00010000 0.880ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
00010001 0.880ms (+0.005ms): mark_offset_tsc (timer_interrupt)
00010001 0.886ms (+0.000ms): do_timer (timer_interrupt)
00010001 0.886ms (+0.000ms): update_process_times (do_timer)
00010001 0.886ms (+0.000ms): update_one_process (update_process_times)
00010001 0.886ms (+0.000ms): run_local_timers (update_process_times)
00010001 0.886ms (+0.000ms): raise_softirq (update_process_times)
00010001 0.887ms (+0.000ms): scheduler_tick (update_process_times)
00010001 0.887ms (+0.000ms): sched_clock (scheduler_tick)
00010001 0.887ms (+0.000ms): update_wall_time (do_timer)
00010001 0.887ms (+0.000ms): update_wall_time_one_tick
(update_wall_time)
00010001 0.888ms (+0.000ms): generic_note_interrupt (do_IRQ)
00010001 0.888ms (+0.000ms): end_8259A_irq (do_IRQ)
00010001 0.888ms (+0.001ms): enable_8259A_irq (do_IRQ)
00000001 0.889ms (+0.000ms): do_softirq (do_IRQ)
00000001 0.889ms (+0.000ms): __do_softirq (do_softirq)
00000001 0.889ms (+0.000ms): wake_up_process (do_softirq)
00000001 0.890ms (+0.000ms): try_to_wake_up (wake_up_process)
00000001 0.890ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000002 0.890ms (+0.000ms): activate_task (try_to_wake_up)
00000002 0.890ms (+0.000ms): sched_clock (activate_task)
00000002 0.890ms (+0.000ms): recalc_task_prio (activate_task)
00000002 0.891ms (+0.000ms): effective_prio (recalc_task_prio)
00000002 0.891ms (+0.000ms): enqueue_task (activate_task)
00000001 0.891ms (+0.000ms): preempt_schedule (try_to_wake_up)
00000001 0.892ms (+0.000ms): sub_preempt_count (do_IRQ)
00000001 0.892ms (+0.000ms): update_max_trace (check_preempt_timing)



^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2004-09-15 15:31 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-07 16:54 [patch] voluntary-preempt-2.6.9-rc1-bk4-R1 Mark_H_Johnson
2004-09-08 18:42 ` Ingo Molnar
2004-09-10  2:31   ` K.R. Foley
2004-09-10 17:56     ` K.R. Foley
2004-09-10 19:26       ` Ingo Molnar
2004-09-10 19:28         ` K.R. Foley
2004-09-09 16:02 ` Ingo Molnar
  -- strict thread matches above, loose matches on Subject: below --
2004-09-13 14:44 Mark_H_Johnson
2004-09-14 18:32 ` Ingo Molnar
2004-09-15 15:26   ` Stephen Smalley
2004-09-10 14:28 Mark_H_Johnson
2004-09-10 19:29 ` Ingo Molnar
2004-09-10 12:57 Mark_H_Johnson
2004-09-10 13:10 ` Ingo Molnar
2004-09-09 22:56 Mark_H_Johnson
2004-09-10 12:42 ` Ingo Molnar
2004-09-09 22:41 Mark_H_Johnson
2004-09-09 19:23 Mark_H_Johnson
2004-09-09 19:47 ` Ingo Molnar
2004-09-09 19:02   ` Alan Cox
2004-09-09 18:31 Mark_H_Johnson
2004-09-08 20:33 Mark_H_Johnson
2004-09-08 22:31 ` Alan Cox
2004-09-09 16:44   ` Thomas Charbonnel
2004-09-09  6:27 ` Ingo Molnar
2004-09-09 15:12 ` Ingo Molnar
2004-09-09 15:33 ` Ingo Molnar
2004-09-02 22:14 [patch] voluntary-preempt-2.6.9-rc1-bk4-R0 Ingo Molnar
2004-09-03  0:24 ` Lee Revell
2004-09-03  3:17   ` Eric St-Laurent
2004-09-03  6:26     ` Lee Revell
2004-09-03  6:36       ` Ingo Molnar
2004-09-03  6:49         ` Lee Revell
2004-09-03  7:05           ` Ingo Molnar
2004-09-03  7:40             ` Lee Revell
2004-09-03  7:50               ` Free Ekanayaka
2004-09-03  8:05                 ` Lee Revell
2004-09-03  9:25                   ` [patch] voluntary-preempt-2.6.9-rc1-bk4-R1 Ingo Molnar
2004-09-03  9:50                     ` Luke Yelavich
2004-09-03 10:29                       ` Ingo Molnar
2004-09-03 10:43                         ` Luke Yelavich
2004-09-03 11:33                     ` Thomas Charbonnel
2004-09-03 11:49                       ` Ingo Molnar
2004-09-03 12:05                         ` Thomas Charbonnel
2004-09-03 16:14                         ` Thomas Charbonnel
2004-09-03 17:36                           ` Thomas Charbonnel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox