* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 @ 2004-09-03 15:33 Mark_H_Johnson 2004-09-04 0:04 ` Lee Revell 0 siblings, 1 reply; 6+ messages in thread From: Mark_H_Johnson @ 2004-09-03 15:33 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt, Lee Revell >In the meantime, I will build a kernel with -R1 and leave the latency trace >on (now running a -Q7 kernel with both CPUs active) to see if it shows >anything interesting before attempting any more tests. Well - THAT was special. Another crash but I may have a clue on both that and the "general slow down" in sched.c. The crash is likely due to a problem with X. I caused it this time when I was trying to open / hide Mozilla windows (again preparing to send email with a web based client). The last operation that worked was the window hide. The one that locked everything up was a click to restore a Mozilla window onto the screen. I don't know if this is relevant, but the last trace in /var/log/messages was a latency trace caused by X. Sep 3 09:57:11 dws77 kernel: (X/2382): new 329 us maximum-latency critical section. Sep 3 09:57:11 dws77 kernel: => started at: <kmap_atomic+0x23/0xe0> Sep 3 09:57:11 dws77 kernel: => ended at: <kunmap_atomic+0x7b/0xa0> I am not sure this is relevant since all the data for it was written to disk (my script picked up the latency trace as well). Let me know if you want the trace data. The slow down in sched.c may be due to disk DMA activities. When I started the kernel build, I forgot to run my script that sets a number of /proc values. In specific, the values were: cat /sys/block/hda/queue/max_sectors_kb 128 cat /sys/block/hda/queue/read_ahead_kb 128 cat /proc/sys/net/core/netdev_max_backlog 300 cat '/proc/irq/10/Ensoniq AudioPCI/threaded' 1 and in my other tests, they are: echo 32 > /sys/block/hda/queue/max_sectors_kb echo 32 > /sys/block/hda/queue/read_ahead_kb echo 8 > /proc/sys/net/core/netdev_max_backlog echo 0 > '/proc/irq/10/Ensoniq AudioPCI/threaded' Note - no audio and only light network activity during the kernel build. With the default settings, I had latencies in sched.c over 2 msec (longest was 2305 us). For example: preemption latency trace v1.0.3 ------------------------------- latency: 2305 us, entries: 137 (137) ----------------- | task: ksoftirqd/0/3, uid:0 nice:-10 policy:0 rt_prio:0 ----------------- => started at: schedule+0x51/0x740 => ended at: schedule+0x337/0x740 =======> 00000001 0.000ms (+0.000ms): schedule (worker_thread) 00000001 0.000ms (+0.000ms): sched_clock (schedule) 00000001 0.000ms (+0.000ms): spin_lock_irq (schedule) 00000001 0.000ms (+0.000ms): __spin_lock_irqsave (spin_lock_irq) 00000002 0.001ms (+0.000ms): deactivate_task (schedule) 00000002 0.001ms (+0.000ms): dequeue_task (deactivate_task) 04000002 0.765ms (+0.764ms): __switch_to (schedule) 04000002 0.800ms (+0.034ms): finish_task_switch (schedule) 04010002 1.116ms (+0.316ms): do_IRQ (finish_task_switch) 04010002 1.116ms (+0.000ms): spin_lock (do_IRQ) 04010003 1.117ms (+0.000ms): ack_edge_ioapic_irq (do_IRQ) 04010003 1.118ms (+0.000ms): generic_redirect_hardirq (do_IRQ) 04010002 1.328ms (+0.210ms): generic_handle_IRQ_event (do_IRQ) 04010002 1.797ms (+0.468ms): timer_interrupt (generic_handle_IRQ_event) 04010002 1.840ms (+0.043ms): spin_lock (timer_interrupt) 04010003 1.842ms (+0.001ms): mark_offset_tsc (timer_interrupt) 04010003 1.842ms (+0.000ms): mark_offset_tsc (timer_interrupt) 04010003 1.842ms (+0.000ms): spin_lock (mark_offset_tsc) 04010004 1.890ms (+0.048ms): mark_offset_tsc (timer_interrupt) 04010004 2.236ms (+0.345ms): mark_offset_tsc (timer_interrupt) 04010004 2.236ms (+0.000ms): spin_lock (mark_offset_tsc) 04010005 2.236ms (+0.000ms): mark_offset_tsc (timer_interrupt) 04010005 2.240ms (+0.003ms): mark_offset_tsc (timer_interrupt) 04010005 2.244ms (+0.004ms): mark_offset_tsc (timer_interrupt) 04010005 2.246ms (+0.002ms): mark_offset_tsc (timer_interrupt) 04010005 2.247ms (+0.000ms): mark_offset_tsc (timer_interrupt) 04010004 2.247ms (+0.000ms): mark_offset_tsc (timer_interrupt) 04010004 2.247ms (+0.000ms): mark_offset_tsc (timer_interrupt) 04010004 2.248ms (+0.000ms): mark_offset_tsc (timer_interrupt) 04010003 2.249ms (+0.001ms): timer_interrupt (generic_handle_IRQ_event) ... So this trace shows a number of the symptoms I have previously reported. When I used the lower disk settings, the latencies went way down [at least up to the crash...] with a maximum of 618 us. I looked through several traces and could not find a similar sequence for comparison. So we may be both right. - there is hardware overhead that needs to be accounted for; this pair of runs seems to point to disk I/O which we previously checked is DMA to/from an IDE disk. - there may be steps in the code that cause longer latency and we want to avoid if possible. It may be a combination of effects. A question for others doing testing (like Lee) - have you been doing any other activity in the background when doing your tests? For example, I have found that something as simple as head -c $1 /dev/zero >tmpfile [$1 is a > physical memory size] or cat tmpfile > /dev/null can cause significantly increased latencies in the 2.6 kernels. --Mark ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 2004-09-03 15:33 [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 Mark_H_Johnson @ 2004-09-04 0:04 ` Lee Revell 2004-09-04 16:52 ` Alan Cox 2004-10-05 20:43 ` GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9] Timothy Miller 0 siblings, 2 replies; 6+ messages in thread From: Lee Revell @ 2004-09-04 0:04 UTC (permalink / raw) To: Mark_H_Johnson Cc: Ingo Molnar, linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt On Fri, 2004-09-03 at 11:33, Mark_H_Johnson@raytheon.com wrote: > Well - THAT was special. Another crash but I may have a clue on both that > and the "general slow down" in sched.c. > > The crash is likely due to a problem with X. I caused it this time when > I was trying to open / hide Mozilla windows (again preparing to send email > with a web based client). The last operation that worked was the window > hide. The one that locked everything up was a click to restore a Mozilla > window onto the screen. I don't know if this is relevant, but the last > trace in /var/log/messages was a latency trace caused by X. > > Sep 3 09:57:11 dws77 kernel: (X/2382): new 329 us maximum-latency critical > section. > Sep 3 09:57:11 dws77 kernel: => started at: <kmap_atomic+0x23/0xe0> > Sep 3 09:57:11 dws77 kernel: => ended at: <kunmap_atomic+0x7b/0xa0> > > I am not sure this is relevant since all the data for it was written > to disk (my script picked up the latency trace as well). Let me know > if you want the trace data. > > The slow down in sched.c may be due to disk DMA activities. [...] > It may be a combination of effects. A question for others doing > testing (like Lee) - have you been doing any other activity in > the background when doing your tests? For example, I have found > that something as simple as > head -c $1 /dev/zero >tmpfile [$1 is a > physical memory size] > or > cat tmpfile > /dev/null > can cause significantly increased latencies in the 2.6 kernels. > This is looking more and more like a video driver problem: "Misbehaving video card drivers are another source of significant delays in scheduling user code. A number of video cards manufacturers recently began employing a hack to save a PCI bus transaction for each display operation in order to gain a few percentage points on their WinBench [Ziff-Davis 98] Graphics WinMark performance. The video cards have a command FIFO that is written to via the PCI bus. They also have a status register, read via the PCI bus, which says whether the command FIFO is full or not. The hack is to not check whether the command FIFO is full before attempting to write to it, thus saving a PCI bus read. The problem with this is that the result of attempting to write to the FIFO when it is full is to stall the CPU waiting on the PCI bus write until a command has been completed and space becomes available to accept the new command. In fact, this not only causes the CPU to stall waiting on the PCI bus, but since the PCI controller chip also controls the ISA bus and mediates interrupts, ISA traffic and interrupt requests are stalled as well. Even the clock interrupts stop. These video cards will stall the machine, for instance, when the user drags a window. For windows occupying most of a 1024x768 screen on a 333MHz Pentium II with an AccelStar II AGP video board (which is based on the 3D Labs Permedia 2 chip set) this will stall the machine for 25-30ms at a time!" (from http://research.microsoft.com/~mbj/papers/tr-98-29.html) Ingo, would the above situation indeed produce these symptoms? I had this exact problem with my Via Unichrome chipset (an open source driver!), reported it to the maintainers, and it turned out this was exactly what the driver was doing (it's fixed now). The above text is from 1998, the problem in the via driver was discovered last week. So I would actually expect this behavior to be the norm, unless someone has fixed that driver. The easiest way to eliminate this possibility is to disable DRI and set 'Option "NoAccel"' in your X config. Do you get the same mysterious latencies with this setting? What video hardware are you using? Lee ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 2004-09-04 0:04 ` Lee Revell @ 2004-09-04 16:52 ` Alan Cox 2004-09-04 18:05 ` Lee Revell 2004-10-05 20:43 ` GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9] Timothy Miller 1 sibling, 1 reply; 6+ messages in thread From: Alan Cox @ 2004-09-04 16:52 UTC (permalink / raw) To: Lee Revell Cc: Mark_H_Johnson, Ingo Molnar, Linux Kernel Mailing List, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt On Sad, 2004-09-04 at 01:04, Lee Revell wrote: > This is looking more and more like a video driver problem: Not really. The delay is too small and X is smarter than this. (except a VIA case that only recently got squished). > The video cards have a command FIFO that is written to via the PCI bus. > They also have a status register, read via the PCI bus, which says > whether the command FIFO is full or not. The hack is to not check > whether the command FIFO is full before attempting to write to it, thus > saving a PCI bus read. On problem cards X defaults to polling the status FIFO. You can tell it to be rude but you have to actively do so. Newer PCI 2.x specs also have a thing or two to say on the subject ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 2004-09-04 16:52 ` Alan Cox @ 2004-09-04 18:05 ` Lee Revell 0 siblings, 0 replies; 6+ messages in thread From: Lee Revell @ 2004-09-04 18:05 UTC (permalink / raw) To: Alan Cox Cc: Mark_H_Johnson, Ingo Molnar, Linux Kernel Mailing List, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt On Sat, 2004-09-04 at 12:52, Alan Cox wrote: > On Sad, 2004-09-04 at 01:04, Lee Revell wrote: > > This is looking more and more like a video driver problem: > > Not really. The delay is too small and X is smarter than this. (except a > VIA case that only recently got squished). > True, the VIA problem did not cause long latency traces to appear on my machine. This is a weird one. Lee ^ permalink raw reply [flat|nested] 6+ messages in thread
* GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9] 2004-09-04 0:04 ` Lee Revell 2004-09-04 16:52 ` Alan Cox @ 2004-10-05 20:43 ` Timothy Miller 2004-10-05 21:12 ` Lee Revell 1 sibling, 1 reply; 6+ messages in thread From: Timothy Miller @ 2004-10-05 20:43 UTC (permalink / raw) To: Lee Revell Cc: Mark_H_Johnson, Ingo Molnar, linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt Lee Revell wrote: > > "Misbehaving video card drivers are another source of significant delays > in scheduling user code. A number of video cards manufacturers recently > began employing a hack to save a PCI bus transaction for each display > operation in order to gain a few percentage points on their WinBench > [Ziff-Davis 98] Graphics WinMark performance. > > The video cards have a command FIFO that is written to via the PCI bus. > They also have a status register, read via the PCI bus, which says > whether the command FIFO is full or not. The hack is to not check > whether the command FIFO is full before attempting to write to it, thus > saving a PCI bus read. > > The problem with this is that the result of attempting to write to the > FIFO when it is full is to stall the CPU waiting on the PCI bus write > until a command has been completed and space becomes available to accept > the new command. In fact, this not only causes the CPU to stall waiting > on the PCI bus, but since the PCI controller chip also controls the ISA > bus and mediates interrupts, ISA traffic and interrupt requests are > stalled as well. Even the clock interrupts stop. > > These video cards will stall the machine, for instance, when the user > drags a window. For windows occupying most of a 1024x768 screen on a > 333MHz Pentium II with an AccelStar II AGP video board (which is based > on the 3D Labs Permedia 2 chip set) this will stall the machine for > 25-30ms at a time!" I would expect that I'm not the first to think of this, but I haven't seen it mentioned, so it makes me wonder. Therefore, I offer my solution. Whenever you read the status register, keep a copy of the "number of free fifo entries" field. Whenever you're going to do a group of writes to the fifo, you first must check for enough free entries. The macro that does this checks the copy of the status register to see if there were enough free the last time you checked. If so, deduct the number of free slots you're about to use, and move on. If not, re-read the status register and loop or sleep if you don't have enough free. The copy of the status register will always be "correct" in that it will always report a number of free entries less than or equal to the actual number, and it will never report a number greater than what is available (barring a hardware glitch of a bug which is bad for other reasons). This is because you're assuming the fifo doesn't drain, when in fact, it does. This results in nearly optimal performance, because usually you end up reading the status register mostly when the fifo is full (a time when extra bus reads don't hurt anything). If you have a 256-entry fifo, then you end up reading the status register once for ever 256 writes, for a performance loss of only 0.39%, and you ONLY get this performance loss when the fifo drains faster than you can fill it. One challenge to this is when you have more than one entity trying to access the same resource. But in that case, you'll already have to be using some sort of mutex mechanism anyhow. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9] 2004-10-05 20:43 ` GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9] Timothy Miller @ 2004-10-05 21:12 ` Lee Revell 0 siblings, 0 replies; 6+ messages in thread From: Lee Revell @ 2004-10-05 21:12 UTC (permalink / raw) To: Timothy Miller Cc: Mark_H_Johnson, Ingo Molnar, linux-kernel, K.R. Foley, Felipe Alfaro Solana, Daniel Schmitt On Tue, 2004-10-05 at 16:43, Timothy Miller wrote: > Lee Revell wrote: > > > > > "Misbehaving video card drivers are another source of significant delays > > in scheduling user code. A number of video cards manufacturers recently > > began employing a hack to save a PCI bus transaction for each display > > operation in order to gain a few percentage points on their WinBench > > [Ziff-Davis 98] Graphics WinMark performance. > > > > The video cards have a command FIFO that is written to via the PCI bus. > > They also have a status register, read via the PCI bus, which says > > whether the command FIFO is full or not. The hack is to not check > > whether the command FIFO is full before attempting to write to it, thus > > saving a PCI bus read. > > > > The problem with this is that the result of attempting to write to the > > FIFO when it is full is to stall the CPU waiting on the PCI bus write > > until a command has been completed and space becomes available to accept > > the new command. In fact, this not only causes the CPU to stall waiting > > on the PCI bus, but since the PCI controller chip also controls the ISA > > bus and mediates interrupts, ISA traffic and interrupt requests are > > stalled as well. Even the clock interrupts stop. > > > > These video cards will stall the machine, for instance, when the user > > drags a window. For windows occupying most of a 1024x768 screen on a > > 333MHz Pentium II with an AccelStar II AGP video board (which is based > > on the 3D Labs Permedia 2 chip set) this will stall the machine for > > 25-30ms at a time!" > > I would expect that I'm not the first to think of this, but I haven't > seen it mentioned, so it makes me wonder. Therefore, I offer my solution. > > Whenever you read the status register, keep a copy of the "number of > free fifo entries" field. Whenever you're going to do a group of writes > to the fifo, you first must check for enough free entries. The macro > that does this checks the copy of the status register to see if there > were enough free the last time you checked. If so, deduct the number of > free slots you're about to use, and move on. If not, re-read the status > register and loop or sleep if you don't have enough free. > > The copy of the status register will always be "correct" in that it will > always report a number of free entries less than or equal to the actual > number, and it will never report a number greater than what is available > (barring a hardware glitch of a bug which is bad for other reasons). > This is because you're assuming the fifo doesn't drain, when in fact, it > does. > > This results in nearly optimal performance, because usually you end up > reading the status register mostly when the fifo is full (a time when > extra bus reads don't hurt anything). If you have a 256-entry fifo, > then you end up reading the status register once for ever 256 writes, > for a performance loss of only 0.39%, and you ONLY get this performance > loss when the fifo drains faster than you can fill it. > > One challenge to this is when you have more than one entity trying to > access the same resource. But in that case, you'll already have to be > using some sort of mutex mechanism anyhow. > > AFAIK only one driver (VIA unichrome) has had this problem recently. Thomas Hellstrom fixed it, so I added him to the cc: list. Thomas, you mentioned there was a performance hit associated with the fix; would this be an improvement over what you did? Also I should add that I was quoting a research.microsoft.com whitepaper above. But s/AccelStar II AGP/VIA CLE266/ and it applies exactly to my results. Just want to give credit where it's due... Lee ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2004-10-05 21:12 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-09-03 15:33 [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9 Mark_H_Johnson 2004-09-04 0:04 ` Lee Revell 2004-09-04 16:52 ` Alan Cox 2004-09-04 18:05 ` Lee Revell 2004-10-05 20:43 ` GPU driver misbehavior [Re: [patch] voluntary-preempt-2.6.9-rc1-bk4-Q9] Timothy Miller 2004-10-05 21:12 ` Lee Revell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox