[patch] voluntary-preempt-2.6.9-rc1-bk4-Q9

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-02  6:55   ` [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8 Ingo Molnar
@ 2004-09-02 11:10     ` Ingo Molnar
  2004-09-02 12:14       ` Thomas Charbonnel
  2004-09-02 13:16       ` Thomas Charbonnel
  0 siblings, 2 replies; 18+ messages in thread
From: Ingo Molnar @ 2004-09-02 11:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, Lee Revell,
	Mark_H_Johnson


i've released the -Q9 patch:

  http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q9

ontop of:

  http://redhat.com/~mingo/voluntary-preempt/diff-bk-040828-2.6.8.1.bz2

Changes:

 - fixed the cond_resched_softirq() bug noticed by Mika Penttila.

 - updated the preemption-friendly network-RX code but 8193too.c still
   produces delayed packets so netdev_backlog_granularity now defaults
   to 2, which seems to be working fine on my testbox.

 - the latency_trace output now includes the kernel and patch version,
   for easier sorting of reports.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-02 11:10     ` [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 Ingo Molnar
@ 2004-09-02 12:14       ` Thomas Charbonnel
  2004-09-02 13:16       ` Thomas Charbonnel
  1 sibling, 0 replies; 18+ messages in thread
From: Thomas Charbonnel @ 2004-09-02 12:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt,
	Lee Revell, Mark_H_Johnson

Ingo Molnar wrote :
> i've released the -Q9 patch:
> 
>   http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q9
> 
> ontop of:
> 
>   http://redhat.com/~mingo/voluntary-preempt/diff-bk-040828-2.6.8.1.bz2
> 
> Changes:
> 
>  - fixed the cond_resched_softirq() bug noticed by Mika Penttila.
> 
>  - updated the preemption-friendly network-RX code but 8193too.c still
>    produces delayed packets so netdev_backlog_granularity now defaults
>    to 2, which seems to be working fine on my testbox.

Thanks, network seems to be working fine with the new default value.
I still get > 140 us non-preemptible sections in rtl8139_pol, though :
http://www.undata.org/~thomas/q9_rtl8139.trace

Thomas



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-02 11:10     ` [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 Ingo Molnar
  2004-09-02 12:14       ` Thomas Charbonnel
@ 2004-09-02 13:16       ` Thomas Charbonnel
  2004-09-02 13:23         ` Ingo Molnar
  1 sibling, 1 reply; 18+ messages in thread
From: Thomas Charbonnel @ 2004-09-02 13:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt,
	Lee Revell, Mark_H_Johnson

Ingo Molnar wrote :
> i've released the -Q9 patch:
> 
>   http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk4-Q9

With ACPI compiled in and booting with acpi=off (which again doesn't
seem to be honoured), here's another weird one :

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-Q9
--------------------------------------------------
 latency: 2888 us, entries: 37 (37)
    -----------------
    | task: swapper/0, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: do_IRQ+0x19/0x1a0
 => ended at:   do_IRQ+0x14d/0x1a0
=======>
00010000 0.000ms (+0.000ms): do_IRQ (common_interrupt)
00010000 0.000ms (+0.000ms): do_IRQ (default_idle)
00010000 0.000ms (+0.000ms): do_IRQ (<00000000>)
00010001 0.000ms (+0.002ms): mask_and_ack_8259A (do_IRQ)
00010001 0.002ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010000 0.002ms (+0.000ms): generic_handle_IRQ_event (do_IRQ)
00010000 0.003ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
00010001 0.003ms (+2.878ms): mark_offset_tsc (timer_interrupt)
00010001 2.882ms (+0.000ms): do_timer (timer_interrupt)
00010001 2.882ms (+0.000ms): update_process_times (do_timer)
00010001 2.882ms (+0.000ms): update_one_process (update_process_times)
00010001 2.882ms (+0.000ms): run_local_timers (update_process_times)
00010001 2.882ms (+0.000ms): raise_softirq (update_process_times)
00010001 2.882ms (+0.000ms): scheduler_tick (update_process_times)
00010001 2.883ms (+0.000ms): sched_clock (scheduler_tick)
00010001 2.883ms (+0.000ms): update_wall_time (do_timer)
00010001 2.883ms (+0.000ms): update_wall_time_one_tick
(update_wall_time)
00010001 2.883ms (+0.000ms): profile_tick (timer_interrupt)
00010001 2.884ms (+0.000ms): profile_hook (profile_tick)
00010002 2.884ms (+0.000ms): notifier_call_chain (profile_hook)
00010001 2.884ms (+0.000ms): profile_hit (timer_interrupt)
00010001 2.884ms (+0.000ms): generic_note_interrupt (do_IRQ)
00010001 2.884ms (+0.000ms): end_8259A_irq (do_IRQ)
00010001 2.885ms (+0.000ms): enable_8259A_irq (do_IRQ)
00000001 2.886ms (+0.000ms): do_softirq (do_IRQ)
00000001 2.886ms (+0.000ms): __do_softirq (do_softirq)
00000001 2.886ms (+0.000ms): wake_up_process (do_softirq)
00000001 2.886ms (+0.000ms): try_to_wake_up (wake_up_process)
00000001 2.886ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000002 2.886ms (+0.000ms): activate_task (try_to_wake_up)
00000002 2.886ms (+0.000ms): sched_clock (activate_task)
00000002 2.887ms (+0.000ms): recalc_task_prio (activate_task)
00000002 2.887ms (+0.000ms): effective_prio (recalc_task_prio)
00000002 2.887ms (+0.000ms): enqueue_task (activate_task)
00000001 2.887ms (+0.000ms): preempt_schedule (try_to_wake_up)
00000001 2.888ms (+0.000ms): sub_preempt_count (do_IRQ)
00000001 2.888ms (+0.000ms): update_max_trace (check_preempt_timing)

Thomas



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-02 13:16       ` Thomas Charbonnel
@ 2004-09-02 13:23         ` Ingo Molnar
  2004-09-02 14:38           ` Thomas Charbonnel
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2004-09-02 13:23 UTC (permalink / raw)
  To: Thomas Charbonnel
  Cc: linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt,
	Lee Revell, Mark_H_Johnson


* Thomas Charbonnel <thomas@undata.org> wrote:

> With ACPI compiled in and booting with acpi=off (which again doesn't
> seem to be honoured), here's another weird one :

> 00010000 0.003ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
> 00010001 0.003ms (+2.878ms): mark_offset_tsc (timer_interrupt)
> 00010001 2.882ms (+0.000ms): do_timer (timer_interrupt)

do you have the NMI watchdog enabled? That could help us debugging this. 
Enabling the APIC/IO_APIC and using nmi_watchdog=1 would be the ideal
solution - if that doesnt work then nmi_watchdog=2 would be fine too,
but remove this code from arch/i386/kernel/nmi.c:

	if (nmi_watchdog == NMI_LOCAL_APIC)
		nmi_hz = 1;

(otherwise nmi_watchdog=2 would result in one NMI per second, not enough
to shed light on the above 2.8 msec latency.)

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-02 13:23         ` Ingo Molnar
@ 2004-09-02 14:38           ` Thomas Charbonnel
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Charbonnel @ 2004-09-02 14:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt,
	Lee Revell, Mark_H_Johnson

Ingo Molnar wrote :

> do you have the NMI watchdog enabled? That could help us debugging this. 
> Enabling the APIC/IO_APIC and using nmi_watchdog=1 would be the ideal
> solution - if that doesnt work then nmi_watchdog=2 would be fine too,
> but remove this code from arch/i386/kernel/nmi.c:
> 
> 	if (nmi_watchdog == NMI_LOCAL_APIC)
> 		nmi_hz = 1;
> 
> (otherwise nmi_watchdog=2 would result in one NMI per second, not enough
> to shed light on the above 2.8 msec latency.)
> 
> 	Ingo

nmi_watchdog=1 worked fine.
Here's a trace (there seem to be a counter overflow) :

preemption latency trace v1.0.5 on 2.6.9-rc1-VP-Q9
--------------------------------------------------
 latency: 539 us, entries: 38 (38)
    -----------------
    | task: swapper/0, uid:0 nice:0 policy:0 rt_prio:0
    -----------------
 => started at: do_IRQ+0x19/0x190
 => ended at:   do_IRQ+0x14a/0x190
=======>
00010000 0.000ms (+0.000ms): do_IRQ (common_interrupt)
00010000 0.000ms (+0.000ms): do_IRQ (default_idle)
00010000 0.000ms (+0.000ms): do_IRQ (<00000000>)
00010001 0.000ms (+0.002ms): mask_and_ack_8259A (do_IRQ)
00010001 0.002ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
00010000 0.002ms (+0.000ms): generic_handle_IRQ_event (do_IRQ)
00010000 0.002ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
00010001 0.003ms (+0.005ms): mark_offset_tsc (timer_interrupt)
00010001 0.008ms (+0.000ms): do_timer (timer_interrupt)
00010001 0.008ms (+0.000ms): update_process_times (do_timer)
00010001 0.008ms (+0.000ms): update_one_process (update_process_times)
00010001 0.009ms (+0.000ms): run_local_timers (update_process_times)
00010001 0.009ms (+0.000ms): raise_softirq (update_process_times)
00010001 0.009ms (+0.000ms): scheduler_tick (update_process_times)
00010001 0.009ms (+0.000ms): sched_clock (scheduler_tick)
00010001 0.009ms (+0.000ms): update_wall_time (do_timer)
00010001 0.009ms (+0.000ms): update_wall_time_one_tick
(update_wall_time)
00010001 0.010ms (+0.000ms): generic_note_interrupt (do_IRQ)
00010001 0.010ms (+0.000ms): end_8259A_irq (do_IRQ)
00010001 0.010ms (+0.000ms): enable_8259A_irq (do_IRQ)
00000001 0.011ms (+0.524ms): do_softirq (do_IRQ)
00000001 0.535ms (+0.000ms): do_nmi (mcount)
00010001 0.536ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.536ms (+0.000ms): profile_hook (profile_tick)
00010002 0.536ms (+0.000ms): notifier_call_chain (profile_hook)
00010001 0.537ms (+689953.062ms): profile_hit (nmi_watchdog_tick)
00000002 689953.599ms (+1.523ms): sched_clock (activate_task)
00000001 0.537ms (+0.000ms): wake_up_process (do_softirq)
00000001 0.537ms (+0.000ms): try_to_wake_up (wake_up_process)
00000001 0.537ms (+0.000ms): task_rq_lock (try_to_wake_up)
00000002 0.537ms (+0.000ms): activate_task (try_to_wake_up)
00000002 0.538ms (+0.000ms): sched_clock (activate_task)
00000002 0.538ms (+0.000ms): recalc_task_prio (activate_task)
00000002 0.538ms (+0.000ms): effective_prio (recalc_task_prio)
00000002 0.538ms (+0.000ms): enqueue_task (activate_task)
00000001 0.538ms (+0.000ms): preempt_schedule (try_to_wake_up)
00000001 0.539ms (+0.000ms): sub_preempt_count (do_IRQ)
00000001 0.539ms (+0.000ms): update_max_trace (check_preempt_timing)

Thomas



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
@ 2004-09-02 20:21 Mark_H_Johnson
  2004-09-02 20:38 ` Ingo Molnar
  2004-09-02 22:43 ` Ingo Molnar
  0 siblings, 2 replies; 18+ messages in thread
From: Mark_H_Johnson @ 2004-09-02 20:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt,
	Lee Revell

>I've released the -Q9 patch:
>...

Built a kernel with -Q9 plus the following additional patches:
... email saved into mark-offset-tsc-mcount.patch ...
... email saved into ens001.patch ...
... own patch to add mcount() calls to sched.c ...

Ran my 25 minute test and collected 71 traces > 500 usec. The following
has a summary of the results.

[1] No cascade traces - several runs with this result - we should be OK
here.

[2] rtl8139 still showing some traces due to my use of
netdev_backlog_granularity = 8. Had severe problems when it is 1.

[3] I read what you said about hardware overhead, but the data seems
to show latencies getting to "__switch_to" may be buried in some
inline functions. A typical trace looks something like this...

00000002 0.003ms (+0.000ms): dummy_switch_tasks (schedule)
00000002 0.003ms (+0.000ms): schedule (worker_thread)
00000002 0.003ms (+0.000ms): schedule (worker_thread)
00000002 0.004ms (+0.000ms): schedule (worker_thread)
00000002 0.004ms (+0.000ms): schedule (worker_thread)
00000002 0.004ms (+0.000ms): schedule (worker_thread)
00000002 0.004ms (+0.274ms): schedule (worker_thread)
04000002 0.279ms (+0.000ms): __switch_to (schedule)

dummy_switch_tasks refers to a function I added / called right after
the label switch_tasks (in sched.c). The mcount() traces that follow
are basically at each step leading up to the call to context_switch.
Since context_switch is static inline, I assume it is not traced -
please confirm. I am considering adding mcount() calls inside
context_switch to see if there is a step that has some long
duration.

[4] The go_idle path is also one that appears to have some long
latencies. The following is an example - the trace of dummy_go_idle
refers to the label named go_idle.

00000002 0.002ms (+0.000ms): dummy_go_idle (schedule)
00000002 0.002ms (+0.060ms): schedule (io_schedule)
00000002 0.063ms (+0.069ms): load_balance_newidle (schedule)
00000002 0.133ms (+0.074ms): find_busiest_group (load_balance_newidle)
00000002 0.207ms (+0.034ms): find_next_bit (find_busiest_group)
00000002 0.242ms (+0.039ms): find_next_bit (find_busiest_group)
00000002 0.281ms (+0.070ms): find_busiest_queue (load_balance_newidle)
00000002 0.351ms (+0.071ms): find_next_bit (find_busiest_queue)
00000002 0.422ms (+0.069ms): double_lock_balance (load_balance_newidle)
00000003 0.492ms (+0.070ms): move_tasks (load_balance_newidle)
00010003 0.563ms (+0.000ms): do_nmi (move_tasks)
00010003 0.563ms (+0.002ms): do_nmi (del_timer_sync)
00010003 0.566ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010003 0.566ms (+0.000ms): profile_hook (profile_tick)
00010003 0.567ms (+0.000ms): read_lock (profile_hook)
00010004 0.567ms (+0.000ms): notifier_call_chain (profile_hook)
00010003 0.567ms (+0.001ms): profile_hit (nmi_watchdog_tick)
00000003 0.569ms (+0.000ms): find_next_bit (move_tasks)
00000003 0.569ms (+0.000ms): dequeue_task (move_tasks)
00000003 0.570ms (+0.000ms): enqueue_task (move_tasks)
00000003 0.570ms (+0.000ms): resched_task (move_tasks)
00000003 0.571ms (+0.000ms): find_next_bit (move_tasks)
00000003 0.571ms (+0.000ms): dequeue_task (move_tasks)
00000003 0.571ms (+0.000ms): enqueue_task (move_tasks)
00000003 0.572ms (+0.000ms): resched_task (move_tasks)
00000002 0.572ms (+0.000ms): preempt_schedule (load_balance_newidle)

We had 500 usec accumulation of latency through several function calls.

[5] mark_tsc_offset - some steps in that sequence are generating
some long latencies. For example:

04010003 0.011ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010003 0.011ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010003 0.011ms (+0.000ms): spin_lock (mark_offset_tsc)
04010004 0.011ms (+0.137ms): mark_offset_tsc (timer_interrupt)
04010004 0.149ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 0.149ms (+0.000ms): spin_lock (mark_offset_tsc)
04010005 0.149ms (+0.144ms): mark_offset_tsc (timer_interrupt)
04010005 0.294ms (+0.004ms): mark_offset_tsc (timer_interrupt)
04010005 0.298ms (+0.003ms): mark_offset_tsc (timer_interrupt)
04010005 0.301ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010005 0.301ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 0.301ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 0.301ms (+0.073ms): mark_offset_tsc (timer_interrupt)
04010004 0.375ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010003 0.375ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)

I didn't see any feedback from the table I provided previously.
Is this data helpful or should I take out the patch?

[6] spin_lock - may just be an SMP lock problem but here's a trace
I don't recall seeing previously.

00000002 0.008ms (+0.000ms): snd_ensoniq_trigger (snd_pcm_do_stop)
00000002 0.008ms (+0.344ms): spin_lock (snd_ensoniq_trigger)
00010003 0.353ms (+0.015ms): do_nmi (snd_ensoniq_trigger)
00010003 0.368ms (+0.006ms): do_nmi (update_one_process)
00010003 0.375ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010003 0.375ms (+0.000ms): profile_hook (profile_tick)
00010003 0.375ms (+0.121ms): read_lock (profile_hook)
00010004 0.496ms (+0.000ms): notifier_call_chain (profile_hook)
00010003 0.497ms (+0.068ms): profile_hit (nmi_watchdog_tick)
00000002 0.566ms (+0.000ms): snd_pcm_post_stop (snd_pcm_action_single)

[7] Not sure what to call it, I don't recall seeing this type of trace
before either.

 => started at: __spin_lock_irqsave+0x39/0x90
 => ended at:   as_work_handler+0x5c/0xa0
=======>
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000001 0.000ms (+0.000ms): generic_enable_irq (ide_do_request)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (generic_enable_irq)
00000002 0.000ms (+0.000ms): unmask_IO_APIC_irq (generic_enable_irq)
00000002 0.000ms (+0.000ms): __spin_lock_irqsave (unmask_IO_APIC_irq)
00000003 0.001ms (+0.000ms): __unmask_IO_APIC_irq (unmask_IO_APIC_irq)
00000003 0.001ms (+0.066ms): __modify_IO_APIC_irq (__unmask_IO_APIC_irq)
00000001 0.067ms (+0.001ms): smp_apic_timer_interrupt (as_work_handler)
00010001 0.069ms (+0.087ms): profile_tick (smp_apic_timer_interrupt)
00010001 0.157ms (+0.000ms): profile_hook (profile_tick)
00010001 0.157ms (+0.069ms): read_lock (profile_hook)
00010002 0.227ms (+0.000ms): notifier_call_chain (profile_hook)
00010001 0.227ms (+0.069ms): profile_hit (smp_apic_timer_interrupt)
00010001 0.297ms (+0.000ms): update_process_times
(smp_apic_timer_interrupt)
00010001 0.297ms (+0.069ms): update_one_process (update_process_times)
00010001 0.367ms (+0.000ms): run_local_timers (update_process_times)
00010001 0.367ms (+0.069ms): raise_softirq (update_process_times)
00010001 0.437ms (+0.000ms): scheduler_tick (update_process_times)
00010001 0.437ms (+0.070ms): sched_clock (scheduler_tick)
00020001 0.507ms (+0.000ms): do_nmi (scheduler_tick)
00020001 0.508ms (+0.002ms): do_nmi (del_timer_sync)
00020001 0.511ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00020001 0.511ms (+0.000ms): profile_hook (profile_tick)
00020001 0.511ms (+0.065ms): read_lock (profile_hook)
00020002 0.577ms (+0.000ms): notifier_call_chain (profile_hook)
00020001 0.577ms (+0.000ms): profile_hit (nmi_watchdog_tick)
00010001 0.578ms (+0.000ms): spin_lock (scheduler_tick)

[8] exit_mmap - there are a few traces referring to code in or
called by exit_mmap. Here's an example.

 => started at: cond_resched_lock+0x6b/0x110
 => ended at:   exit_mmap+0x155/0x1f0
=======>
00000001 0.000ms (+0.000ms): touch_preempt_timing (cond_resched_lock)
00000001 0.000ms (+0.000ms): __bitmap_weight (unmap_vmas)
00000001 0.000ms (+0.000ms): vm_acct_memory (exit_mmap)
00000001 0.001ms (+0.629ms): clear_page_tables (exit_mmap)
00000001 0.631ms (+0.000ms): flush_tlb_mm (exit_mmap)
00000001 0.631ms (+0.000ms): free_pages_and_swap_cache (exit_mmap)

I will send the traces separately (not to linux-kernel) for analysis.

  --Mark


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-02 20:21 Mark_H_Johnson
@ 2004-09-02 20:38 ` Ingo Molnar
  2004-09-02 22:43 ` Ingo Molnar
  1 sibling, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2004-09-02 20:38 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt,
	Lee Revell


* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> 00000002 0.003ms (+0.000ms): dummy_switch_tasks (schedule)
> 00000002 0.003ms (+0.000ms): schedule (worker_thread)
> 00000002 0.003ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.274ms): schedule (worker_thread)
> 04000002 0.279ms (+0.000ms): __switch_to (schedule)

a quick suggestion: could you add this near the top of sched.c (below
the #include lines):

 #define static

this will turn off all inlining and makes the scheduler-internal
functions visible. If there's any scheduler-internal overhead we should
see it. Maybe this is the CR3 flush (switch_mm()) - but 274 usecs is
still excessive ...

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-02 20:21 Mark_H_Johnson
  2004-09-02 20:38 ` Ingo Molnar
@ 2004-09-02 22:43 ` Ingo Molnar
  2004-09-03  9:09   ` Ingo Molnar
  1 sibling, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2004-09-02 22:43 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt,
	Lee Revell

[-- Attachment #1: Type: text/plain, Size: 9081 bytes --]


* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> [1] No cascade traces - several runs with this result - we should be
> OK here.

yeah, good. I think my analysis was correct: softirqd got delayed by
higher prio tasks and this caused a backlog in timer processing => hence
the 'storm' of cascade() calls. Now we may reschedule between cascade()
calls.

> [2] rtl8139 still showing some traces due to my use of
> netdev_backlog_granularity = 8. Had severe problems when it is 1.

i've just released -R0 which should fix these issues in theory.

> [3] I read what you said about hardware overhead, but the data seems
> to show latencies getting to "__switch_to" may be buried in some
> inline functions. A typical trace looks something like this...

> 
> 00000002 0.003ms (+0.000ms): dummy_switch_tasks (schedule)
> 00000002 0.003ms (+0.000ms): schedule (worker_thread)
> 00000002 0.003ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.000ms): schedule (worker_thread)
> 00000002 0.004ms (+0.274ms): schedule (worker_thread)
> 04000002 0.279ms (+0.000ms): __switch_to (schedule)
> 
> dummy_switch_tasks refers to a function I added / called right after
> the label switch_tasks (in sched.c). The mcount() traces that follow
> are basically at each step leading up to the call to context_switch.
> Since context_switch is static inline, I assume it is not traced -
> please confirm. [...]

correct, inline functions are not traced. (Also, 'tail-merged' function
calls are not traced either - this can happen if a function calls
another one right at the end.)

> [...] I am considering adding mcount() calls inside context_switch to
> see if there is a step that has some long duration.

yeah. If you can isolate the overhead down to a specific line of code
that would implicate that particular piece of code quite heavily. But if
the latencies happen all around the place then it means that it's not
the code that is at fault, it's just an innocent bystander.

> [4] The go_idle path is also one that appears to have some long
> latencies. The following is an example - the trace of dummy_go_idle
> refers to the label named go_idle.
> 
> 00000002 0.002ms (+0.000ms): dummy_go_idle (schedule)
> 00000002 0.002ms (+0.060ms): schedule (io_schedule)
> 00000002 0.063ms (+0.069ms): load_balance_newidle (schedule)
> 00000002 0.133ms (+0.074ms): find_busiest_group (load_balance_newidle)
> 00000002 0.207ms (+0.034ms): find_next_bit (find_busiest_group)
> 00000002 0.242ms (+0.039ms): find_next_bit (find_busiest_group)
> 00000002 0.281ms (+0.070ms): find_busiest_queue (load_balance_newidle)
> 00000002 0.351ms (+0.071ms): find_next_bit (find_busiest_queue)
> 00000002 0.422ms (+0.069ms): double_lock_balance (load_balance_newidle)
> 00000003 0.492ms (+0.070ms): move_tasks (load_balance_newidle)

this is as if the CPU executed everything in 'slow motion'. E.g. the
cache being disabled could be one such reason - or some severe DMA or
other bus traffic.

the code being executed is completely benign - no locking, just
straightforward processing. No way can they take 60-70 usecs!

as if certain data structures were uncached or something like that. 
Although this theory can be excluded i think because the same code
doesnt execute slowly in other traces (and most likely not in the
normal, non-traced execution either).

in theory code tracing itself could perhaps generate some sort of bad
cache pattern that hits a sweet spot of the CPU. (very weak theory but
that's all that's left ...) Could you disable tracing (but keep the
preemption-timing on) and see whether you still get these 500+ usec
latencies reported? Switching /proc/sys/kernel/tracing_enabled off
should be enough.

> [5] mark_tsc_offset - some steps in that sequence are generating
> some long latencies. For example:
> 
> 04010003 0.011ms (+0.000ms): mark_offset_tsc (timer_interrupt)
> 04010003 0.011ms (+0.000ms): mark_offset_tsc (timer_interrupt)
> 04010003 0.011ms (+0.000ms): spin_lock (mark_offset_tsc)
> 04010004 0.011ms (+0.137ms): mark_offset_tsc (timer_interrupt)
> 04010004 0.149ms (+0.000ms): mark_offset_tsc (timer_interrupt)
> 04010004 0.149ms (+0.000ms): spin_lock (mark_offset_tsc)
> 04010005 0.149ms (+0.144ms): mark_offset_tsc (timer_interrupt)
> 04010005 0.294ms (+0.004ms): mark_offset_tsc (timer_interrupt)
> 04010005 0.298ms (+0.003ms): mark_offset_tsc (timer_interrupt)
> 04010005 0.301ms (+0.000ms): mark_offset_tsc (timer_interrupt)
> 04010005 0.301ms (+0.000ms): mark_offset_tsc (timer_interrupt)
> 04010004 0.301ms (+0.000ms): mark_offset_tsc (timer_interrupt)
> 04010004 0.301ms (+0.073ms): mark_offset_tsc (timer_interrupt)
> 04010004 0.375ms (+0.000ms): mark_offset_tsc (timer_interrupt)
> 04010003 0.375ms (+0.000ms): timer_interrupt (generic_handle_IRQ_event)
> 
> I didn't see any feedback from the table I provided previously.
> Is this data helpful or should I take out the patch?

sent a mail separately, replying to your previous table. I think it's
useful, lets try to pin down what code there is between #03-#04.

> [6] spin_lock - may just be an SMP lock problem but here's a trace
> I don't recall seeing previously.
> 
> 00000002 0.008ms (+0.000ms): snd_ensoniq_trigger (snd_pcm_do_stop)
> 00000002 0.008ms (+0.344ms): spin_lock (snd_ensoniq_trigger)
> 00010003 0.353ms (+0.015ms): do_nmi (snd_ensoniq_trigger)
> 00010003 0.368ms (+0.006ms): do_nmi (update_one_process)
> 00010003 0.375ms (+0.000ms): profile_tick (nmi_watchdog_tick)
> 00010003 0.375ms (+0.000ms): profile_hook (profile_tick)
> 00010003 0.375ms (+0.121ms): read_lock (profile_hook)
> 00010004 0.496ms (+0.000ms): notifier_call_chain (profile_hook)
> 00010003 0.497ms (+0.068ms): profile_hit (nmi_watchdog_tick)
> 00000002 0.566ms (+0.000ms): snd_pcm_post_stop (snd_pcm_action_single)

it's hard to tell whether the overhead is the 'magic' overhead or
spinning on the lock. I've attached trace-spin.patch that shows us how
many times a spinlock had to spin before it got the lock. It will also
tell the precise time it took to execute the spinning function.

if we looped zero times and still it's high overhead then this again
implicates the 'magic' hw issue that interferes.

> [7] Not sure what to call it, I don't recall seeing this type of trace
> before either.
> 
>  => started at: __spin_lock_irqsave+0x39/0x90
>  => ended at:   as_work_handler+0x5c/0xa0
> =======>
> 00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
> 00000001 0.000ms (+0.000ms): generic_enable_irq (ide_do_request)
> 00000001 0.000ms (+0.000ms): __spin_lock_irqsave (generic_enable_irq)
> 00000002 0.000ms (+0.000ms): unmask_IO_APIC_irq (generic_enable_irq)
> 00000002 0.000ms (+0.000ms): __spin_lock_irqsave (unmask_IO_APIC_irq)
> 00000003 0.001ms (+0.000ms): __unmask_IO_APIC_irq (unmask_IO_APIC_irq)
> 00000003 0.001ms (+0.066ms): __modify_IO_APIC_irq (__unmask_IO_APIC_irq)
> 00000001 0.067ms (+0.001ms): smp_apic_timer_interrupt (as_work_handler)
> 00010001 0.069ms (+0.087ms): profile_tick (smp_apic_timer_interrupt)
> 00010001 0.157ms (+0.000ms): profile_hook (profile_tick)
> 00010001 0.157ms (+0.069ms): read_lock (profile_hook)
> 00010002 0.227ms (+0.000ms): notifier_call_chain (profile_hook)
> 00010001 0.227ms (+0.069ms): profile_hit (smp_apic_timer_interrupt)
> 00010001 0.297ms (+0.000ms): update_process_times
> (smp_apic_timer_interrupt)
> 00010001 0.297ms (+0.069ms): update_one_process (update_process_times)
> 00010001 0.367ms (+0.000ms): run_local_timers (update_process_times)
> 00010001 0.367ms (+0.069ms): raise_softirq (update_process_times)
> 00010001 0.437ms (+0.000ms): scheduler_tick (update_process_times)
> 00010001 0.437ms (+0.070ms): sched_clock (scheduler_tick)
> 00020001 0.507ms (+0.000ms): do_nmi (scheduler_tick)
> 00020001 0.508ms (+0.002ms): do_nmi (del_timer_sync)
> 00020001 0.511ms (+0.000ms): profile_tick (nmi_watchdog_tick)
> 00020001 0.511ms (+0.000ms): profile_hook (profile_tick)
> 00020001 0.511ms (+0.065ms): read_lock (profile_hook)
> 00020002 0.577ms (+0.000ms): notifier_call_chain (profile_hook)
> 00020001 0.577ms (+0.000ms): profile_hit (nmi_watchdog_tick)
> 00010001 0.578ms (+0.000ms): spin_lock (scheduler_tick)

this too seems to be 'magic' overhead just randomly in the middle of a
commonly executing function.

> [8] exit_mmap - there are a few traces referring to code in or
> called by exit_mmap. Here's an example.
> 
>  => started at: cond_resched_lock+0x6b/0x110
>  => ended at:   exit_mmap+0x155/0x1f0
> =======>
> 00000001 0.000ms (+0.000ms): touch_preempt_timing (cond_resched_lock)
> 00000001 0.000ms (+0.000ms): __bitmap_weight (unmap_vmas)
> 00000001 0.000ms (+0.000ms): vm_acct_memory (exit_mmap)
> 00000001 0.001ms (+0.629ms): clear_page_tables (exit_mmap)
> 00000001 0.631ms (+0.000ms): flush_tlb_mm (exit_mmap)
> 00000001 0.631ms (+0.000ms): free_pages_and_swap_cache (exit_mmap)

this should be a known latency: clear_page_tables() runs with preemption
disabled (a spinlock held).

	Ingo

[-- Attachment #2: trace-spin.patch --]
[-- Type: text/plain, Size: 1264 bytes --]

--- linux/kernel/sched.c.orig2	
+++ linux/kernel/sched.c	
@@ -4926,9 +4926,11 @@ EXPORT_SYMBOL(__might_sleep);
 #define BUILD_LOCK_OPS(op, locktype)					\
 void __sched op##_lock(locktype *lock)					\
 {									\
+	unsigned long count = 0;					\
 	for (;;) {							\
 		if (likely(_raw_##op##_trylock(lock)))			\
 			break;						\
+		count++;						\
 		preempt_disable();					\
 		touch_preempt_timing();					\
 		preempt_enable();					\
@@ -4937,16 +4939,19 @@ void __sched op##_lock(locktype *lock)		
 		cpu_relax();						\
 	}								\
 	preempt_disable();						\
+	__trace((unsigned long)op##_lock, count);			\
 }									\
 									\
 EXPORT_SYMBOL(op##_lock);						\
 									\
 void __sched __##op##_lock_irqsave(locktype *lock, unsigned long *flags)\
 {									\
+	unsigned long count = 0;					\
 	for (;;) {							\
 		local_irq_save(*flags);					\
 		if (likely(_raw_##op##_trylock(lock)))			\
 			break;						\
+		count++;						\
 		local_irq_restore(*flags);				\
 									\
 		preempt_disable();					\
@@ -4957,6 +4962,7 @@ void __sched __##op##_lock_irqsave(lockt
 			(lock)->break_lock = 1;				\
 		cpu_relax();						\
 	}								\
+	__trace((unsigned long)__##op##_lock_irqsave, count);		\
 	preempt_disable();						\
 }									\
 									\

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-02 22:43 ` Ingo Molnar
@ 2004-09-03  9:09   ` Ingo Molnar
  0 siblings, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2004-09-03  9:09 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt,
	Lee Revell


* Ingo Molnar <mingo@elte.hu> wrote:

> > 00000002 0.002ms (+0.000ms): dummy_go_idle (schedule)
> > 00000002 0.002ms (+0.060ms): schedule (io_schedule)
> > 00000002 0.063ms (+0.069ms): load_balance_newidle (schedule)
> > 00000002 0.133ms (+0.074ms): find_busiest_group (load_balance_newidle)
> > 00000002 0.207ms (+0.034ms): find_next_bit (find_busiest_group)
> > 00000002 0.242ms (+0.039ms): find_next_bit (find_busiest_group)
> > 00000002 0.281ms (+0.070ms): find_busiest_queue (load_balance_newidle)
> > 00000002 0.351ms (+0.071ms): find_next_bit (find_busiest_queue)
> > 00000002 0.422ms (+0.069ms): double_lock_balance (load_balance_newidle)
> > 00000003 0.492ms (+0.070ms): move_tasks (load_balance_newidle)
> 
> this is as if the CPU executed everything in 'slow motion'. E.g. the
> cache being disabled could be one such reason - or some severe DMA or
> other bus traffic.

another thing: could you try maxcpus=1 again and see whether 1 CPU
produces similar 'slow motion' traces? If it's DMA or PCI bus traffic
somehow interfering then i'd expect the same phenomenon to pop up with a
single CPU too. IIRC you tested an UP kernel once before, but i believe
that was prior fixing all the latency measurement errors. Would be nice
to re-test again, with maxcpus=1, and see whether any of these
slow-motion traces trigger. On a 1-CPU test i'd suggest to lower the
tracing threshold to half of the 2-CPU value.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
@ 2004-09-03 13:42 Mark_H_Johnson
  0 siblings, 0 replies; 18+ messages in thread
From: Mark_H_Johnson @ 2004-09-03 13:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt,
	Lee Revell

>> this is as if the CPU executed everything in 'slow motion'. E.g. the
>> cache being disabled could be one such reason - or some severe DMA or
>> other bus traffic.
>
>another thing: could you try maxcpus=1 again and see whether 1 CPU
>produces similar 'slow motion' traces?

OK. Second try to send this message. I've had two complete system lockups
now with -Q9. The first doing a kernel build, the second while trying
to send a message explaining that plus the results I saw when trying
to test with maxcpus=1.

Booted w/ maxcpus=1 nmi_watchdog=1. Saw information messages confirming
both were set and acted on.

First major problem was the audio came out "too fast". The test repeats
a pattern and you could hear it repeating faster than it should have.
Needless to say - this meant that there was no time available to run
any non real time tasks so the application timing results were worthless
and the latency traces were separated by several minutes in some cases.
[before I could force a SIGINT to the real time task]

I don't know if the traces I did get are helpful or not, but here is a
summary.... The numbers after each title refer to the trace file number
so if you count them, you get an idea how frequent each symptom was.
I will send the full traces only if you ask for them.

Slow getting to "switch_tasks" (00, 01, 10, 12, 13, 16)

00000002 0.002ms (+0.000ms): dequeue_task (deactivate_task)
00000002 0.002ms (+0.000ms): schedule (io_schedule)
00000002 0.002ms (+0.000ms): schedule (io_schedule)
00000002 0.002ms (+0.000ms): schedule (io_schedule)
00000002 0.003ms (+0.000ms): schedule (io_schedule)
00000002 0.003ms (+0.000ms): schedule (io_schedule)
00000002 0.003ms (+0.303ms): schedule (io_schedule)
00010002 0.306ms (+0.016ms): do_nmi (schedule)
00010002 0.323ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010002 0.323ms (+0.000ms): profile_hook (profile_tick)
00010002 0.323ms (+0.100ms): read_lock (profile_hook)
00010003 0.424ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.424ms (+0.070ms): profile_hit (nmi_watchdog_tick)
00000002 0.494ms (+0.000ms): dummy_switch_tasks (schedule)

Slow getting to "__switch_to" (02, 03, 05, 09)

00000002 0.003ms (+0.000ms): dummy_switch_tasks (schedule)
00000002 0.003ms (+0.000ms): schedule (io_schedule)
00000002 0.003ms (+0.327ms): schedule (io_schedule)
00000002 0.331ms (+0.000ms): schedule (io_schedule)
00000002 0.331ms (+0.000ms): schedule (io_schedule)
00000002 0.331ms (+0.000ms): schedule (io_schedule)
00000002 0.332ms (+0.236ms): schedule (io_schedule)
00010002 0.568ms (+0.001ms): do_nmi (mcount)
00010002 0.570ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010002 0.570ms (+0.000ms): profile_hook (profile_tick)
00010002 0.570ms (+0.000ms): read_lock (profile_hook)
00010003 0.571ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.571ms (+0.000ms): profile_hit (nmi_watchdog_tick)
00000002 0.572ms (+0.000ms): __switch_to (schedule)

Slow to both switch_tasks and __switch_to (04, 06)

00000002 0.002ms (+0.000ms): dequeue_task (deactivate_task)
00000002 0.002ms (+0.000ms): schedule (io_schedule)
00000002 0.002ms (+0.000ms): schedule (io_schedule)
00000002 0.002ms (+0.000ms): schedule (io_schedule)
00000002 0.003ms (+0.000ms): schedule (io_schedule)
00000002 0.003ms (+0.000ms): schedule (io_schedule)
00000002 0.003ms (+0.131ms): schedule (io_schedule)
00000002 0.134ms (+0.000ms): dummy_switch_tasks (schedule)
00000002 0.134ms (+0.000ms): schedule (io_schedule)
00000002 0.134ms (+0.000ms): schedule (io_schedule)
00000002 0.134ms (+0.000ms): schedule (io_schedule)
00000002 0.135ms (+0.000ms): schedule (io_schedule)
00000002 0.135ms (+0.000ms): schedule (io_schedule)
00000002 0.135ms (+0.139ms): schedule (io_schedule)
00010002 0.275ms (+0.279ms): do_nmi (schedule)
00010002 0.554ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010002 0.554ms (+0.000ms): profile_hook (profile_tick)
00010002 0.554ms (+0.000ms): read_lock (profile_hook)
00010003 0.555ms (+0.000ms): notifier_call_chain (profile_hook)
00010002 0.555ms (+0.001ms): profile_hit (nmi_watchdog_tick)
00000002 0.557ms (+0.000ms): __switch_to (schedule)

sched_clock (08, 11, 14)

00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.553ms): sched_clock (schedule)
00010001 0.554ms (+0.001ms): do_nmi (sched_clock)
00010001 0.556ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.556ms (+0.000ms): profile_hook (profile_tick)
00010001 0.556ms (+0.000ms): read_lock (profile_hook)
00010002 0.556ms (+0.000ms): notifier_call_chain (profile_hook)
00010001 0.557ms (+0.000ms): profile_hit (nmi_watchdog_tick)
00010001 0.557ms (+0.000ms): do_IRQ (sched_clock)
00010001 0.558ms (+0.000ms): do_IRQ (<00000000>)
00010001 0.558ms (+0.000ms): spin_lock (do_IRQ)

__modify_IO_APIC_irq (15)

00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000001 0.000ms (+0.000ms): generic_enable_irq (ide_do_request)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (generic_enable_irq)
00000002 0.001ms (+0.000ms): unmask_IO_APIC_irq (generic_enable_irq)
00000002 0.001ms (+0.000ms): __spin_lock_irqsave (unmask_IO_APIC_irq)
00000003 0.002ms (+0.000ms): __unmask_IO_APIC_irq (unmask_IO_APIC_irq)
00000003 0.002ms (+0.565ms): __modify_IO_APIC_irq (__unmask_IO_APIC_irq)
00010001 0.567ms (+0.001ms): do_nmi (apic_timer_interrupt)
00010001 0.569ms (+0.000ms): profile_tick (nmi_watchdog_tick)
00010001 0.569ms (+0.000ms): profile_hook (profile_tick)
00010001 0.569ms (+0.000ms): read_lock (profile_hook)
00010002 0.570ms (+0.000ms): notifier_call_chain (profile_hook)
00010001 0.570ms (+0.000ms): profile_hit (nmi_watchdog_tick)
00000001 0.571ms (+0.000ms): smp_apic_timer_interrupt (as_work_handler)

In the meantime, I will build a kernel with -R1 and leave the latency trace
on (now running a -Q7 kernel with both CPUs active) to see if it shows
anything interesting before attempting any more tests.

  --Mark


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
@ 2004-09-03 15:33 Mark_H_Johnson
  2004-09-04  0:04 ` Lee Revell
  0 siblings, 1 reply; 18+ messages in thread
From: Mark_H_Johnson @ 2004-09-03 15:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt,
	Lee Revell

>In the meantime, I will build a kernel with -R1 and leave the latency
trace
>on (now running a -Q7 kernel with both CPUs active) to see if it shows
>anything interesting before attempting any more tests.

Well - THAT was special. Another crash but I may have a clue on both that
and the "general slow down" in sched.c.

The crash is likely due to a problem with X. I caused it this time when
I was trying to open / hide Mozilla windows (again preparing to send email
with a web based client). The last operation that worked was the window
hide. The one that locked everything up was a click to restore a Mozilla
window onto the screen. I don't know if this is relevant, but the last
trace in /var/log/messages was a latency trace caused by X.

Sep  3 09:57:11 dws77 kernel: (X/2382): new 329 us maximum-latency critical
section.
Sep  3 09:57:11 dws77 kernel:  => started at: <kmap_atomic+0x23/0xe0>
Sep  3 09:57:11 dws77 kernel:  => ended at:   <kunmap_atomic+0x7b/0xa0>

I am not sure this is relevant since all the data for it was written
to disk (my script picked up the latency trace as well). Let me know
if you want the trace data.

The slow down in sched.c may be due to disk DMA activities.
When I started the kernel build, I forgot to run my script that sets
a number of /proc values. In specific, the values were:
cat /sys/block/hda/queue/max_sectors_kb
128
cat /sys/block/hda/queue/read_ahead_kb
128
cat /proc/sys/net/core/netdev_max_backlog
300
cat '/proc/irq/10/Ensoniq AudioPCI/threaded'
1

and in my other tests, they are:
echo 32 > /sys/block/hda/queue/max_sectors_kb
echo 32 > /sys/block/hda/queue/read_ahead_kb
echo 8 > /proc/sys/net/core/netdev_max_backlog
echo 0 > '/proc/irq/10/Ensoniq AudioPCI/threaded'

Note - no audio and only light network activity during the kernel
build.

With the default settings, I had latencies in sched.c over 2 msec
(longest was 2305 us). For example:

preemption latency trace v1.0.3
-------------------------------
 latency: 2305 us, entries: 137 (137)
    -----------------
    | task: ksoftirqd/0/3, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: schedule+0x51/0x740
 => ended at:   schedule+0x337/0x740
=======>
00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.000ms): sched_clock (schedule)
00000001 0.000ms (+0.000ms): spin_lock_irq (schedule)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000002 0.001ms (+0.000ms): deactivate_task (schedule)
00000002 0.001ms (+0.000ms): dequeue_task (deactivate_task)
04000002 0.765ms (+0.764ms): __switch_to (schedule)
04000002 0.800ms (+0.034ms): finish_task_switch (schedule)
04010002 1.116ms (+0.316ms): do_IRQ (finish_task_switch)
04010002 1.116ms (+0.000ms): spin_lock (do_IRQ)
04010003 1.117ms (+0.000ms): ack_edge_ioapic_irq (do_IRQ)
04010003 1.118ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
04010002 1.328ms (+0.210ms): generic_handle_IRQ_event (do_IRQ)
04010002 1.797ms (+0.468ms): timer_interrupt (generic_handle_IRQ_event)
04010002 1.840ms (+0.043ms): spin_lock (timer_interrupt)
04010003 1.842ms (+0.001ms): mark_offset_tsc (timer_interrupt)
04010003 1.842ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010003 1.842ms (+0.000ms): spin_lock (mark_offset_tsc)
04010004 1.890ms (+0.048ms): mark_offset_tsc (timer_interrupt)
04010004 2.236ms (+0.345ms): mark_offset_tsc (timer_interrupt)
04010004 2.236ms (+0.000ms): spin_lock (mark_offset_tsc)
04010005 2.236ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010005 2.240ms (+0.003ms): mark_offset_tsc (timer_interrupt)
04010005 2.244ms (+0.004ms): mark_offset_tsc (timer_interrupt)
04010005 2.246ms (+0.002ms): mark_offset_tsc (timer_interrupt)
04010005 2.247ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 2.247ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 2.247ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 2.248ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010003 2.249ms (+0.001ms): timer_interrupt (generic_handle_IRQ_event)
...

So this trace shows a number of the symptoms I have previously
reported.

When I used the lower disk settings, the latencies went way
down [at least up to the crash...] with a maximum of 618 us.
I looked through several traces and could not find a similar
sequence for comparison.

So we may be both right.
 - there is hardware overhead that needs to be accounted for; this
pair of runs seems to point to disk I/O which we previously checked
is DMA to/from an IDE disk.
 - there may be steps in the code that cause longer latency and
we want to avoid if possible.

It may be a combination of effects. A question for others doing
testing (like Lee) - have you been doing any other activity in
the background when doing your tests? For example, I have found
that something as simple as
  head -c $1 /dev/zero >tmpfile  [$1 is a > physical memory size]
or
  cat tmpfile > /dev/null
can cause significantly increased latencies in the 2.6 kernels.

  --Mark

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
@ 2004-09-03 18:28 Mark_H_Johnson
  2004-09-03 18:33 ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Mark_H_Johnson @ 2004-09-03 18:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: K.R. Foley, linux-kernel, Felipe Alfaro Solana, Daniel Schmitt,
	Lee Revell

>a quick suggestion: could you add this near the top of sched.c (below
>the #include lines):
>
> #define static
>
>this will turn off all inlining and makes the scheduler-internal
>functions visible. ....

I did this and built a broken kernel. I panics very early in start up.
I also got warnings like
kernel/sched.c: In function __might_sleep
kernel/sched.c:4974: warning: prev_jiffy might be used uninitialized in
this function
which if I read the code right should be a local static variable.

I suppose you meant
 #define inline
instead (which throws a warning about a duplicate definition; can I
ignore it?)

--Mark H Johnson
  <mailto:Mark_H_Johnson@raytheon.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-03 18:28 [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 Mark_H_Johnson
@ 2004-09-03 18:33 ` Ingo Molnar
  0 siblings, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2004-09-03 18:33 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: K.R. Foley, linux-kernel, Felipe Alfaro Solana, Daniel Schmitt,
	Lee Revell


* Mark_H_Johnson@Raytheon.com <Mark_H_Johnson@Raytheon.com> wrote:

> I suppose you meant
>  #define inline

yeah ...

> instead (which throws a warning about a duplicate definition; can I
> ignore it?)

yeah, ought to be fine. (you can do an #undef inline for a cleaner
compile).

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-03 15:33 [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 Mark_H_Johnson
@ 2004-09-04  0:04 ` Lee Revell
  2004-09-04 16:52   ` Alan Cox
  2004-10-05 20:43   ` GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9] Timothy Miller
  0 siblings, 2 replies; 18+ messages in thread
From: Lee Revell @ 2004-09-04  0:04 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Ingo Molnar, linux-kernel, K.R. Foley, Felipe Alfaro Solana,
	Daniel Schmitt

On Fri, 2004-09-03 at 11:33, Mark_H_Johnson@raytheon.com wrote:
> Well - THAT was special. Another crash but I may have a clue on both that
> and the "general slow down" in sched.c.
> 
> The crash is likely due to a problem with X. I caused it this time when
> I was trying to open / hide Mozilla windows (again preparing to send email
> with a web based client). The last operation that worked was the window
> hide. The one that locked everything up was a click to restore a Mozilla
> window onto the screen. I don't know if this is relevant, but the last
> trace in /var/log/messages was a latency trace caused by X.
> 
> Sep  3 09:57:11 dws77 kernel: (X/2382): new 329 us maximum-latency critical
> section.
> Sep  3 09:57:11 dws77 kernel:  => started at: <kmap_atomic+0x23/0xe0>
> Sep  3 09:57:11 dws77 kernel:  => ended at:   <kunmap_atomic+0x7b/0xa0>
> 
> I am not sure this is relevant since all the data for it was written
> to disk (my script picked up the latency trace as well). Let me know
> if you want the trace data.
> 
> The slow down in sched.c may be due to disk DMA activities.

[...]

> It may be a combination of effects. A question for others doing
> testing (like Lee) - have you been doing any other activity in
> the background when doing your tests? For example, I have found
> that something as simple as
>   head -c $1 /dev/zero >tmpfile  [$1 is a > physical memory size]
> or
>   cat tmpfile > /dev/null
> can cause significantly increased latencies in the 2.6 kernels.
> 

This is looking more and more like a video driver problem:

"Misbehaving video card drivers are another source of significant delays
in scheduling user code. A number of video cards manufacturers recently
began employing a hack to save a PCI bus transaction for each display
operation in order to gain a few percentage points on their WinBench
[Ziff-Davis 98] Graphics WinMark performance.

The video cards have a command FIFO that is written to via the PCI bus.
They also have a status register, read via the PCI bus, which says
whether the command FIFO is full or not. The hack is to not check
whether the command FIFO is full before attempting to write to it, thus
saving a PCI bus read.

The problem with this is that the result of attempting to write to the
FIFO when it is full is to stall the CPU waiting on the PCI bus write
until a command has been completed and space becomes available to accept
the new command. In fact, this not only causes the CPU to stall waiting
on the PCI bus, but since the PCI controller chip also controls the ISA
bus and mediates interrupts, ISA traffic and interrupt requests are
stalled as well. Even the clock interrupts stop.

These video cards will stall the machine, for instance, when the user
drags a window. For windows occupying most of a 1024x768 screen on a
333MHz Pentium II with an AccelStar II AGP video board (which is based
on the 3D Labs Permedia 2 chip set) this will stall the machine for
25-30ms at a time!"

(from http://research.microsoft.com/~mbj/papers/tr-98-29.html)

Ingo, would the above situation indeed produce these symptoms?

I had this exact problem with my Via Unichrome chipset (an open source
driver!), reported it to the maintainers, and it turned out this was
exactly what the driver was doing (it's fixed now).  The above text is
from 1998, the problem in the via driver was discovered last week.  So I
would actually expect this behavior to be the norm, unless someone has
fixed that driver.

The easiest way to eliminate this possibility is to disable DRI and set
'Option "NoAccel"' in your X config.  Do you get the same mysterious
latencies with this setting?

What video hardware are you using?

Lee

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-04  0:04 ` Lee Revell
@ 2004-09-04 16:52   ` Alan Cox
  2004-09-04 18:05     ` Lee Revell
  2004-10-05 20:43   ` GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9] Timothy Miller
  1 sibling, 1 reply; 18+ messages in thread
From: Alan Cox @ 2004-09-04 16:52 UTC (permalink / raw)
  To: Lee Revell
  Cc: Mark_H_Johnson, Ingo Molnar, Linux Kernel Mailing List,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt

On Sad, 2004-09-04 at 01:04, Lee Revell wrote:
> This is looking more and more like a video driver problem:

Not really. The delay is too small and X is smarter than this. (except a
VIA case that only recently got squished).

> The video cards have a command FIFO that is written to via the PCI bus.
> They also have a status register, read via the PCI bus, which says
> whether the command FIFO is full or not. The hack is to not check
> whether the command FIFO is full before attempting to write to it, thus
> saving a PCI bus read.

On problem cards X defaults to polling the status FIFO. You can tell it
to be rude but you have to actively do so. Newer PCI 2.x specs also have
a thing or two to say on the subject


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-04 16:52   ` Alan Cox
@ 2004-09-04 18:05     ` Lee Revell
  0 siblings, 0 replies; 18+ messages in thread
From: Lee Revell @ 2004-09-04 18:05 UTC (permalink / raw)
  To: Alan Cox
  Cc: Mark_H_Johnson, Ingo Molnar, Linux Kernel Mailing List,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt

On Sat, 2004-09-04 at 12:52, Alan Cox wrote:
> On Sad, 2004-09-04 at 01:04, Lee Revell wrote:
> > This is looking more and more like a video driver problem:
> 
> Not really. The delay is too small and X is smarter than this. (except a
> VIA case that only recently got squished).
> 

True, the VIA problem did not cause long latency traces to appear on my
machine.  This is a weird one.

Lee





^ permalink raw reply	[flat|nested] 18+ messages in thread

* GPU driver misbehavior  [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9]
  2004-09-04  0:04 ` Lee Revell
  2004-09-04 16:52   ` Alan Cox
@ 2004-10-05 20:43   ` Timothy Miller
  2004-10-05 21:12     ` Lee Revell
  1 sibling, 1 reply; 18+ messages in thread
From: Timothy Miller @ 2004-10-05 20:43 UTC (permalink / raw)
  To: Lee Revell
  Cc: Mark_H_Johnson, Ingo Molnar, linux-kernel, K.R. Foley,
	Felipe Alfaro Solana, Daniel Schmitt

Lee Revell wrote:

> 
> "Misbehaving video card drivers are another source of significant delays
> in scheduling user code. A number of video cards manufacturers recently
> began employing a hack to save a PCI bus transaction for each display
> operation in order to gain a few percentage points on their WinBench
> [Ziff-Davis 98] Graphics WinMark performance.
> 
> The video cards have a command FIFO that is written to via the PCI bus.
> They also have a status register, read via the PCI bus, which says
> whether the command FIFO is full or not. The hack is to not check
> whether the command FIFO is full before attempting to write to it, thus
> saving a PCI bus read.
> 
> The problem with this is that the result of attempting to write to the
> FIFO when it is full is to stall the CPU waiting on the PCI bus write
> until a command has been completed and space becomes available to accept
> the new command. In fact, this not only causes the CPU to stall waiting
> on the PCI bus, but since the PCI controller chip also controls the ISA
> bus and mediates interrupts, ISA traffic and interrupt requests are
> stalled as well. Even the clock interrupts stop.
> 
> These video cards will stall the machine, for instance, when the user
> drags a window. For windows occupying most of a 1024x768 screen on a
> 333MHz Pentium II with an AccelStar II AGP video board (which is based
> on the 3D Labs Permedia 2 chip set) this will stall the machine for
> 25-30ms at a time!"

I would expect that I'm not the first to think of this, but I haven't 
seen it mentioned, so it makes me wonder.  Therefore, I offer my solution.

Whenever you read the status register, keep a copy of the "number of 
free fifo entries" field.  Whenever you're going to do a group of writes 
to the fifo, you first must check for enough free entries.  The macro 
that does this checks the copy of the status register to see if there 
were enough free the last time you checked.  If so, deduct the number of 
free slots you're about to use, and move on.  If not, re-read the status 
register and loop or sleep if you don't have enough free.

The copy of the status register will always be "correct" in that it will 
always report a number of free entries less than or equal to the actual 
number, and it will never report a number greater than what is available 
(barring a hardware glitch of a bug which is bad for other reasons). 
This is because you're assuming the fifo doesn't drain, when in fact, it 
does.

This results in nearly optimal performance, because usually you end up 
reading the status register mostly when the fifo is full (a time when 
extra bus reads don't hurt anything).  If you have a 256-entry fifo, 
then you end up reading the status register once for ever 256 writes, 
for a performance loss of only 0.39%, and you ONLY get this performance 
loss when the fifo drains faster than you can fill it.

One challenge to this is when you have more than one entity trying to 
access the same resource.  But in that case, you'll already have to be 
using some sort of mutex mechanism anyhow.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GPU driver misbehavior  [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9]
  2004-10-05 20:43   ` GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9] Timothy Miller
@ 2004-10-05 21:12     ` Lee Revell
  0 siblings, 0 replies; 18+ messages in thread
From: Lee Revell @ 2004-10-05 21:12 UTC (permalink / raw)
  To: Timothy Miller
  Cc: Mark_H_Johnson, Ingo Molnar, linux-kernel, K.R. Foley,
	Felipe Alfaro Solana, Daniel Schmitt

On Tue, 2004-10-05 at 16:43, Timothy Miller wrote:
> Lee Revell wrote:
> 
> > 
> > "Misbehaving video card drivers are another source of significant delays
> > in scheduling user code. A number of video cards manufacturers recently
> > began employing a hack to save a PCI bus transaction for each display
> > operation in order to gain a few percentage points on their WinBench
> > [Ziff-Davis 98] Graphics WinMark performance.
> > 
> > The video cards have a command FIFO that is written to via the PCI bus.
> > They also have a status register, read via the PCI bus, which says
> > whether the command FIFO is full or not. The hack is to not check
> > whether the command FIFO is full before attempting to write to it, thus
> > saving a PCI bus read.
> > 
> > The problem with this is that the result of attempting to write to the
> > FIFO when it is full is to stall the CPU waiting on the PCI bus write
> > until a command has been completed and space becomes available to accept
> > the new command. In fact, this not only causes the CPU to stall waiting
> > on the PCI bus, but since the PCI controller chip also controls the ISA
> > bus and mediates interrupts, ISA traffic and interrupt requests are
> > stalled as well. Even the clock interrupts stop.
> > 
> > These video cards will stall the machine, for instance, when the user
> > drags a window. For windows occupying most of a 1024x768 screen on a
> > 333MHz Pentium II with an AccelStar II AGP video board (which is based
> > on the 3D Labs Permedia 2 chip set) this will stall the machine for
> > 25-30ms at a time!"
> 
> I would expect that I'm not the first to think of this, but I haven't 
> seen it mentioned, so it makes me wonder.  Therefore, I offer my solution.
> 
> Whenever you read the status register, keep a copy of the "number of 
> free fifo entries" field.  Whenever you're going to do a group of writes 
> to the fifo, you first must check for enough free entries.  The macro 
> that does this checks the copy of the status register to see if there 
> were enough free the last time you checked.  If so, deduct the number of 
> free slots you're about to use, and move on.  If not, re-read the status 
> register and loop or sleep if you don't have enough free.
> 
> The copy of the status register will always be "correct" in that it will 
> always report a number of free entries less than or equal to the actual 
> number, and it will never report a number greater than what is available 
> (barring a hardware glitch of a bug which is bad for other reasons). 
> This is because you're assuming the fifo doesn't drain, when in fact, it 
> does.
> 
> This results in nearly optimal performance, because usually you end up 
> reading the status register mostly when the fifo is full (a time when 
> extra bus reads don't hurt anything).  If you have a 256-entry fifo, 
> then you end up reading the status register once for ever 256 writes, 
> for a performance loss of only 0.39%, and you ONLY get this performance 
> loss when the fifo drains faster than you can fill it.
> 
> One challenge to this is when you have more than one entity trying to 
> access the same resource.  But in that case, you'll already have to be 
> using some sort of mutex mechanism anyhow.
> 
> 

AFAIK only one driver (VIA unichrome) has had this problem recently. 
Thomas Hellstrom fixed it, so I added him to the cc: list.  Thomas, you
mentioned there was a performance hit associated with the fix; would
this be an improvement over what you did?

Also I should add that I was quoting a research.microsoft.com whitepaper
above.  But s/AccelStar II AGP/VIA CLE266/ and it applies exactly to my
results.  Just want to give credit where it's due...

Lee


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2004-10-05 21:12 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-03 15:33 [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 Mark_H_Johnson
2004-09-04  0:04 ` Lee Revell
2004-09-04 16:52   ` Alan Cox
2004-09-04 18:05     ` Lee Revell
2004-10-05 20:43   ` GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9] Timothy Miller
2004-10-05 21:12     ` Lee Revell
  -- strict thread matches above, loose matches on Subject: below --
2004-09-03 18:28 [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 Mark_H_Johnson
2004-09-03 18:33 ` Ingo Molnar
2004-09-03 13:42 Mark_H_Johnson
2004-09-02 20:21 Mark_H_Johnson
2004-09-02 20:38 ` Ingo Molnar
2004-09-02 22:43 ` Ingo Molnar
2004-09-03  9:09   ` Ingo Molnar
2004-08-30 19:13 [patch] voluntary-preempt-2.6.9-rc1-bk4-Q5 Mark_H_Johnson
2004-09-02  6:33 ` Ingo Molnar
2004-09-02  6:55   ` [patch] voluntary-preempt-2.6.9-rc1-bk4-Q8 Ingo Molnar
2004-09-02 11:10     ` [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 Ingo Molnar
2004-09-02 12:14       ` Thomas Charbonnel
2004-09-02 13:16       ` Thomas Charbonnel
2004-09-02 13:23         ` Ingo Molnar
2004-09-02 14:38           ` Thomas Charbonnel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox