public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
@ 2004-09-03 15:33 Mark_H_Johnson
  2004-09-04  0:04 ` Lee Revell
  0 siblings, 1 reply; 6+ messages in thread
From: Mark_H_Johnson @ 2004-09-03 15:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt,
	Lee Revell

>In the meantime, I will build a kernel with -R1 and leave the latency
trace
>on (now running a -Q7 kernel with both CPUs active) to see if it shows
>anything interesting before attempting any more tests.

Well - THAT was special. Another crash but I may have a clue on both that
and the "general slow down" in sched.c.

The crash is likely due to a problem with X. I caused it this time when
I was trying to open / hide Mozilla windows (again preparing to send email
with a web based client). The last operation that worked was the window
hide. The one that locked everything up was a click to restore a Mozilla
window onto the screen. I don't know if this is relevant, but the last
trace in /var/log/messages was a latency trace caused by X.

Sep  3 09:57:11 dws77 kernel: (X/2382): new 329 us maximum-latency critical
section.
Sep  3 09:57:11 dws77 kernel:  => started at: <kmap_atomic+0x23/0xe0>
Sep  3 09:57:11 dws77 kernel:  => ended at:   <kunmap_atomic+0x7b/0xa0>

I am not sure this is relevant since all the data for it was written
to disk (my script picked up the latency trace as well). Let me know
if you want the trace data.

The slow down in sched.c may be due to disk DMA activities.
When I started the kernel build, I forgot to run my script that sets
a number of /proc values. In specific, the values were:
cat /sys/block/hda/queue/max_sectors_kb
128
cat /sys/block/hda/queue/read_ahead_kb
128
cat /proc/sys/net/core/netdev_max_backlog
300
cat '/proc/irq/10/Ensoniq AudioPCI/threaded'
1

and in my other tests, they are:
echo 32 > /sys/block/hda/queue/max_sectors_kb
echo 32 > /sys/block/hda/queue/read_ahead_kb
echo 8 > /proc/sys/net/core/netdev_max_backlog
echo 0 > '/proc/irq/10/Ensoniq AudioPCI/threaded'

Note - no audio and only light network activity during the kernel
build.

With the default settings, I had latencies in sched.c over 2 msec
(longest was 2305 us). For example:

preemption latency trace v1.0.3
-------------------------------
 latency: 2305 us, entries: 137 (137)
    -----------------
    | task: ksoftirqd/0/3, uid:0 nice:-10 policy:0 rt_prio:0
    -----------------
 => started at: schedule+0x51/0x740
 => ended at:   schedule+0x337/0x740
=======>
00000001 0.000ms (+0.000ms): schedule (worker_thread)
00000001 0.000ms (+0.000ms): sched_clock (schedule)
00000001 0.000ms (+0.000ms): spin_lock_irq (schedule)
00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq)
00000002 0.001ms (+0.000ms): deactivate_task (schedule)
00000002 0.001ms (+0.000ms): dequeue_task (deactivate_task)
04000002 0.765ms (+0.764ms): __switch_to (schedule)
04000002 0.800ms (+0.034ms): finish_task_switch (schedule)
04010002 1.116ms (+0.316ms): do_IRQ (finish_task_switch)
04010002 1.116ms (+0.000ms): spin_lock (do_IRQ)
04010003 1.117ms (+0.000ms): ack_edge_ioapic_irq (do_IRQ)
04010003 1.118ms (+0.000ms): generic_redirect_hardirq (do_IRQ)
04010002 1.328ms (+0.210ms): generic_handle_IRQ_event (do_IRQ)
04010002 1.797ms (+0.468ms): timer_interrupt (generic_handle_IRQ_event)
04010002 1.840ms (+0.043ms): spin_lock (timer_interrupt)
04010003 1.842ms (+0.001ms): mark_offset_tsc (timer_interrupt)
04010003 1.842ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010003 1.842ms (+0.000ms): spin_lock (mark_offset_tsc)
04010004 1.890ms (+0.048ms): mark_offset_tsc (timer_interrupt)
04010004 2.236ms (+0.345ms): mark_offset_tsc (timer_interrupt)
04010004 2.236ms (+0.000ms): spin_lock (mark_offset_tsc)
04010005 2.236ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010005 2.240ms (+0.003ms): mark_offset_tsc (timer_interrupt)
04010005 2.244ms (+0.004ms): mark_offset_tsc (timer_interrupt)
04010005 2.246ms (+0.002ms): mark_offset_tsc (timer_interrupt)
04010005 2.247ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 2.247ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 2.247ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010004 2.248ms (+0.000ms): mark_offset_tsc (timer_interrupt)
04010003 2.249ms (+0.001ms): timer_interrupt (generic_handle_IRQ_event)
...

So this trace shows a number of the symptoms I have previously
reported.

When I used the lower disk settings, the latencies went way
down [at least up to the crash...] with a maximum of 618 us.
I looked through several traces and could not find a similar
sequence for comparison.

So we may be both right.
 - there is hardware overhead that needs to be accounted for; this
pair of runs seems to point to disk I/O which we previously checked
is DMA to/from an IDE disk.
 - there may be steps in the code that cause longer latency and
we want to avoid if possible.

It may be a combination of effects. A question for others doing
testing (like Lee) - have you been doing any other activity in
the background when doing your tests? For example, I have found
that something as simple as
  head -c $1 /dev/zero >tmpfile  [$1 is a > physical memory size]
or
  cat tmpfile > /dev/null
can cause significantly increased latencies in the 2.6 kernels.

  --Mark


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-03 15:33 [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 Mark_H_Johnson
@ 2004-09-04  0:04 ` Lee Revell
  2004-09-04 16:52   ` Alan Cox
  2004-10-05 20:43   ` GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9] Timothy Miller
  0 siblings, 2 replies; 6+ messages in thread
From: Lee Revell @ 2004-09-04  0:04 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Ingo Molnar, linux-kernel, K.R. Foley, Felipe Alfaro Solana,
	Daniel Schmitt

On Fri, 2004-09-03 at 11:33, Mark_H_Johnson@raytheon.com wrote:
> Well - THAT was special. Another crash but I may have a clue on both that
> and the "general slow down" in sched.c.
> 
> The crash is likely due to a problem with X. I caused it this time when
> I was trying to open / hide Mozilla windows (again preparing to send email
> with a web based client). The last operation that worked was the window
> hide. The one that locked everything up was a click to restore a Mozilla
> window onto the screen. I don't know if this is relevant, but the last
> trace in /var/log/messages was a latency trace caused by X.
> 
> Sep  3 09:57:11 dws77 kernel: (X/2382): new 329 us maximum-latency critical
> section.
> Sep  3 09:57:11 dws77 kernel:  => started at: <kmap_atomic+0x23/0xe0>
> Sep  3 09:57:11 dws77 kernel:  => ended at:   <kunmap_atomic+0x7b/0xa0>
> 
> I am not sure this is relevant since all the data for it was written
> to disk (my script picked up the latency trace as well). Let me know
> if you want the trace data.
> 
> The slow down in sched.c may be due to disk DMA activities.

[...]

> It may be a combination of effects. A question for others doing
> testing (like Lee) - have you been doing any other activity in
> the background when doing your tests? For example, I have found
> that something as simple as
>   head -c $1 /dev/zero >tmpfile  [$1 is a > physical memory size]
> or
>   cat tmpfile > /dev/null
> can cause significantly increased latencies in the 2.6 kernels.
> 

This is looking more and more like a video driver problem:

"Misbehaving video card drivers are another source of significant delays
in scheduling user code. A number of video cards manufacturers recently
began employing a hack to save a PCI bus transaction for each display
operation in order to gain a few percentage points on their WinBench
[Ziff-Davis 98] Graphics WinMark performance.

The video cards have a command FIFO that is written to via the PCI bus.
They also have a status register, read via the PCI bus, which says
whether the command FIFO is full or not. The hack is to not check
whether the command FIFO is full before attempting to write to it, thus
saving a PCI bus read.

The problem with this is that the result of attempting to write to the
FIFO when it is full is to stall the CPU waiting on the PCI bus write
until a command has been completed and space becomes available to accept
the new command. In fact, this not only causes the CPU to stall waiting
on the PCI bus, but since the PCI controller chip also controls the ISA
bus and mediates interrupts, ISA traffic and interrupt requests are
stalled as well. Even the clock interrupts stop.

These video cards will stall the machine, for instance, when the user
drags a window. For windows occupying most of a 1024x768 screen on a
333MHz Pentium II with an AccelStar II AGP video board (which is based
on the 3D Labs Permedia 2 chip set) this will stall the machine for
25-30ms at a time!"

(from http://research.microsoft.com/~mbj/papers/tr-98-29.html)

Ingo, would the above situation indeed produce these symptoms?

I had this exact problem with my Via Unichrome chipset (an open source
driver!), reported it to the maintainers, and it turned out this was
exactly what the driver was doing (it's fixed now).  The above text is
from 1998, the problem in the via driver was discovered last week.  So I
would actually expect this behavior to be the norm, unless someone has
fixed that driver.

The easiest way to eliminate this possibility is to disable DRI and set
'Option "NoAccel"' in your X config.  Do you get the same mysterious
latencies with this setting?

What video hardware are you using?

Lee







^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-04  0:04 ` Lee Revell
@ 2004-09-04 16:52   ` Alan Cox
  2004-09-04 18:05     ` Lee Revell
  2004-10-05 20:43   ` GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9] Timothy Miller
  1 sibling, 1 reply; 6+ messages in thread
From: Alan Cox @ 2004-09-04 16:52 UTC (permalink / raw)
  To: Lee Revell
  Cc: Mark_H_Johnson, Ingo Molnar, Linux Kernel Mailing List,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt

On Sad, 2004-09-04 at 01:04, Lee Revell wrote:
> This is looking more and more like a video driver problem:

Not really. The delay is too small and X is smarter than this. (except a
VIA case that only recently got squished).

> The video cards have a command FIFO that is written to via the PCI bus.
> They also have a status register, read via the PCI bus, which says
> whether the command FIFO is full or not. The hack is to not check
> whether the command FIFO is full before attempting to write to it, thus
> saving a PCI bus read.

On problem cards X defaults to polling the status FIFO. You can tell it
to be rude but you have to actively do so. Newer PCI 2.x specs also have
a thing or two to say on the subject


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9
  2004-09-04 16:52   ` Alan Cox
@ 2004-09-04 18:05     ` Lee Revell
  0 siblings, 0 replies; 6+ messages in thread
From: Lee Revell @ 2004-09-04 18:05 UTC (permalink / raw)
  To: Alan Cox
  Cc: Mark_H_Johnson, Ingo Molnar, Linux Kernel Mailing List,
	K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt

On Sat, 2004-09-04 at 12:52, Alan Cox wrote:
> On Sad, 2004-09-04 at 01:04, Lee Revell wrote:
> > This is looking more and more like a video driver problem:
> 
> Not really. The delay is too small and X is smarter than this. (except a
> VIA case that only recently got squished).
> 

True, the VIA problem did not cause long latency traces to appear on my
machine.  This is a weird one.

Lee





^ permalink raw reply	[flat|nested] 6+ messages in thread

* GPU driver misbehavior  [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9]
  2004-09-04  0:04 ` Lee Revell
  2004-09-04 16:52   ` Alan Cox
@ 2004-10-05 20:43   ` Timothy Miller
  2004-10-05 21:12     ` Lee Revell
  1 sibling, 1 reply; 6+ messages in thread
From: Timothy Miller @ 2004-10-05 20:43 UTC (permalink / raw)
  To: Lee Revell
  Cc: Mark_H_Johnson, Ingo Molnar, linux-kernel, K.R. Foley,
	Felipe Alfaro Solana, Daniel Schmitt



Lee Revell wrote:

> 
> "Misbehaving video card drivers are another source of significant delays
> in scheduling user code. A number of video cards manufacturers recently
> began employing a hack to save a PCI bus transaction for each display
> operation in order to gain a few percentage points on their WinBench
> [Ziff-Davis 98] Graphics WinMark performance.
> 
> The video cards have a command FIFO that is written to via the PCI bus.
> They also have a status register, read via the PCI bus, which says
> whether the command FIFO is full or not. The hack is to not check
> whether the command FIFO is full before attempting to write to it, thus
> saving a PCI bus read.
> 
> The problem with this is that the result of attempting to write to the
> FIFO when it is full is to stall the CPU waiting on the PCI bus write
> until a command has been completed and space becomes available to accept
> the new command. In fact, this not only causes the CPU to stall waiting
> on the PCI bus, but since the PCI controller chip also controls the ISA
> bus and mediates interrupts, ISA traffic and interrupt requests are
> stalled as well. Even the clock interrupts stop.
> 
> These video cards will stall the machine, for instance, when the user
> drags a window. For windows occupying most of a 1024x768 screen on a
> 333MHz Pentium II with an AccelStar II AGP video board (which is based
> on the 3D Labs Permedia 2 chip set) this will stall the machine for
> 25-30ms at a time!"

I would expect that I'm not the first to think of this, but I haven't 
seen it mentioned, so it makes me wonder.  Therefore, I offer my solution.

Whenever you read the status register, keep a copy of the "number of 
free fifo entries" field.  Whenever you're going to do a group of writes 
to the fifo, you first must check for enough free entries.  The macro 
that does this checks the copy of the status register to see if there 
were enough free the last time you checked.  If so, deduct the number of 
free slots you're about to use, and move on.  If not, re-read the status 
register and loop or sleep if you don't have enough free.

The copy of the status register will always be "correct" in that it will 
always report a number of free entries less than or equal to the actual 
number, and it will never report a number greater than what is available 
(barring a hardware glitch of a bug which is bad for other reasons). 
This is because you're assuming the fifo doesn't drain, when in fact, it 
does.

This results in nearly optimal performance, because usually you end up 
reading the status register mostly when the fifo is full (a time when 
extra bus reads don't hurt anything).  If you have a 256-entry fifo, 
then you end up reading the status register once for ever 256 writes, 
for a performance loss of only 0.39%, and you ONLY get this performance 
loss when the fifo drains faster than you can fill it.

One challenge to this is when you have more than one entity trying to 
access the same resource.  But in that case, you'll already have to be 
using some sort of mutex mechanism anyhow.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: GPU driver misbehavior  [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9]
  2004-10-05 20:43   ` GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9] Timothy Miller
@ 2004-10-05 21:12     ` Lee Revell
  0 siblings, 0 replies; 6+ messages in thread
From: Lee Revell @ 2004-10-05 21:12 UTC (permalink / raw)
  To: Timothy Miller
  Cc: Mark_H_Johnson, Ingo Molnar, linux-kernel, K.R. Foley,
	Felipe Alfaro Solana, Daniel Schmitt

On Tue, 2004-10-05 at 16:43, Timothy Miller wrote:
> Lee Revell wrote:
> 
> > 
> > "Misbehaving video card drivers are another source of significant delays
> > in scheduling user code. A number of video cards manufacturers recently
> > began employing a hack to save a PCI bus transaction for each display
> > operation in order to gain a few percentage points on their WinBench
> > [Ziff-Davis 98] Graphics WinMark performance.
> > 
> > The video cards have a command FIFO that is written to via the PCI bus.
> > They also have a status register, read via the PCI bus, which says
> > whether the command FIFO is full or not. The hack is to not check
> > whether the command FIFO is full before attempting to write to it, thus
> > saving a PCI bus read.
> > 
> > The problem with this is that the result of attempting to write to the
> > FIFO when it is full is to stall the CPU waiting on the PCI bus write
> > until a command has been completed and space becomes available to accept
> > the new command. In fact, this not only causes the CPU to stall waiting
> > on the PCI bus, but since the PCI controller chip also controls the ISA
> > bus and mediates interrupts, ISA traffic and interrupt requests are
> > stalled as well. Even the clock interrupts stop.
> > 
> > These video cards will stall the machine, for instance, when the user
> > drags a window. For windows occupying most of a 1024x768 screen on a
> > 333MHz Pentium II with an AccelStar II AGP video board (which is based
> > on the 3D Labs Permedia 2 chip set) this will stall the machine for
> > 25-30ms at a time!"
> 
> I would expect that I'm not the first to think of this, but I haven't 
> seen it mentioned, so it makes me wonder.  Therefore, I offer my solution.
> 
> Whenever you read the status register, keep a copy of the "number of 
> free fifo entries" field.  Whenever you're going to do a group of writes 
> to the fifo, you first must check for enough free entries.  The macro 
> that does this checks the copy of the status register to see if there 
> were enough free the last time you checked.  If so, deduct the number of 
> free slots you're about to use, and move on.  If not, re-read the status 
> register and loop or sleep if you don't have enough free.
> 
> The copy of the status register will always be "correct" in that it will 
> always report a number of free entries less than or equal to the actual 
> number, and it will never report a number greater than what is available 
> (barring a hardware glitch of a bug which is bad for other reasons). 
> This is because you're assuming the fifo doesn't drain, when in fact, it 
> does.
> 
> This results in nearly optimal performance, because usually you end up 
> reading the status register mostly when the fifo is full (a time when 
> extra bus reads don't hurt anything).  If you have a 256-entry fifo, 
> then you end up reading the status register once for ever 256 writes, 
> for a performance loss of only 0.39%, and you ONLY get this performance 
> loss when the fifo drains faster than you can fill it.
> 
> One challenge to this is when you have more than one entity trying to 
> access the same resource.  But in that case, you'll already have to be 
> using some sort of mutex mechanism anyhow.
> 
> 

AFAIK only one driver (VIA unichrome) has had this problem recently. 
Thomas Hellstrom fixed it, so I added him to the cc: list.  Thomas, you
mentioned there was a performance hit associated with the fix; would
this be an improvement over what you did?

Also I should add that I was quoting a research.microsoft.com whitepaper
above.  But s/AccelStar II AGP/VIA CLE266/ and it applies exactly to my
results.  Just want to give credit where it's due...

Lee


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-10-05 21:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-03 15:33 [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 Mark_H_Johnson
2004-09-04  0:04 ` Lee Revell
2004-09-04 16:52   ` Alan Cox
2004-09-04 18:05     ` Lee Revell
2004-10-05 20:43   ` GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9] Timothy Miller
2004-10-05 21:12     ` Lee Revell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox