* Worst case performance of up() @ 2006-11-24 16:21 Adrian Cox 2006-11-24 20:45 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 12+ messages in thread From: Adrian Cox @ 2006-11-24 16:21 UTC (permalink / raw) To: linuxppc-dev First the background: I've been investigating poor performance of a Firewire capture application, running on a dual-7450 board with a 2.6.17 kernel. The kernel is based on a slightly earlier version of the mpc7448hpc2 board port, using arch/powerpc, which I've not yet updated to reflect the changes made when the board support entered the mainstream kernel. The application runs smoothly on a single processor. On the dual processor machine, the application sometimes suffers a drop in frame-rate, simultaneous with high CPU usage by the Firewire kernel thread. Further investigation reveals that the kernel thread spends most of the time in one line: up(&fi->complete_sem) in __queue_complete_req() in drivers/iee1394/raw1394.c. It seems that whenever the userspace thread calling raw1394_read() is scheduled on the opposite CPU to the kernel thread, the kernel thread takes much longer to execute up() - typically 10000 times longer. Does anybody have any ideas what could make up() take so long in this circumstance? I'd expect cache transfers to make the operation about 100 times slower, but this looks like repeated cache ping-pong between the two CPUs. -- Adrian Cox <adrian@humboldt.co.uk> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Worst case performance of up() 2006-11-24 16:21 Worst case performance of up() Adrian Cox @ 2006-11-24 20:45 ` Benjamin Herrenschmidt 2006-11-27 21:02 ` Adrian Cox 0 siblings, 1 reply; 12+ messages in thread From: Benjamin Herrenschmidt @ 2006-11-24 20:45 UTC (permalink / raw) To: Adrian Cox; +Cc: linuxppc-dev On Fri, 2006-11-24 at 16:21 +0000, Adrian Cox wrote: > First the background: I've been investigating poor performance of a > Firewire capture application, running on a dual-7450 board with a 2.6.17 > kernel. The kernel is based on a slightly earlier version of the > mpc7448hpc2 board port, using arch/powerpc, which I've not yet updated > to reflect the changes made when the board support entered the > mainstream kernel. > > The application runs smoothly on a single processor. On the dual > processor machine, the application sometimes suffers a drop in > frame-rate, simultaneous with high CPU usage by the Firewire kernel > thread. > > Further investigation reveals that the kernel thread spends most of the > time in one line: up(&fi->complete_sem) in __queue_complete_req() in > drivers/iee1394/raw1394.c. It seems that whenever the userspace thread > calling raw1394_read() is scheduled on the opposite CPU to the kernel > thread, the kernel thread takes much longer to execute up() - typically > 10000 times longer. > > Does anybody have any ideas what could make up() take so long in this > circumstance? I'd expect cache transfers to make the operation about 100 > times slower, but this looks like repeated cache ping-pong between the > two CPUs. Is it hung in up() (toplevel) or __up (low level) ? The former is mostly just a atomic_add_return which boils down to : static __inline__ int atomic_add_return(int a, atomic_t *v) { int t; __asm__ __volatile__( LWSYNC_ON_SMP "1: lwarx %0,0,%2 # atomic_add_return\n\ add %0,%1,%0\n" PPC405_ERR77(0,%2) " stwcx. %0,0,%2 \n\ bne- 1b" ISYNC_ON_SMP : "=&r" (t) : "r" (a), "r" (&v->counter) : "cc", "memory"); return t; } So yes, on SMP, you get an additional sync and isync in there, though I'm surprised that you hit a code path where that would make such a big difference (unless you are really up'ing a zillion times per sec). Have you tried some oprofile runs to catch the exact instruction where the cycles appear to be wasted ? Maybe there is some contention on the reservation (though that would be a bit strange to have a contention on a up...) or somewhat the semaphore ends up sharing a cache line with something else. That would cause a performance problem. Have you tried moving the semaphore away from whatever other data might be manipulated at the same time ? In it's own cache line maybe ? Cheers, Ben. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Worst case performance of up() 2006-11-24 20:45 ` Benjamin Herrenschmidt @ 2006-11-27 21:02 ` Adrian Cox 2006-12-02 10:35 ` Adrian Cox 0 siblings, 1 reply; 12+ messages in thread From: Adrian Cox @ 2006-11-27 21:02 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linuxppc-dev On Sat, 2006-11-25 at 07:45 +1100, Benjamin Herrenschmidt wrote: > On Fri, 2006-11-24 at 16:21 +0000, Adrian Cox wrote: > > Does anybody have any ideas what could make up() take so long in this > > circumstance? I'd expect cache transfers to make the operation about 100 > > times slower, but this looks like repeated cache ping-pong between the > > two CPUs. > > Is it hung in up() (toplevel) or __up (low level) ? Not yet proven. > Have you tried some oprofile runs to catch the exact instruction where > the cycles appear to be wasted ? I've spent a day wrestling with oprofile, but I've not managed to trigger the problem while profiling yet. It's possible that the slight overhead of oprofile is enough to change the scheduling behaviour. It remains an odd problem, and rather hard to reproduce. Unless I get better data with oprofile this one may remain in the mystery file. > Have you tried moving the semaphore away from whatever other data might > be manipulated at the same time ? In it's own cache line maybe ? I did try that, but it didn't make any difference. -- Adrian Cox <adrian@humboldt.co.uk> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Worst case performance of up() 2006-11-27 21:02 ` Adrian Cox @ 2006-12-02 10:35 ` Adrian Cox 2006-12-02 11:15 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 12+ messages in thread From: Adrian Cox @ 2006-12-02 10:35 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linuxppc-dev On Mon, 2006-11-27 at 21:02 +0000, Adrian Cox wrote: > On Sat, 2006-11-25 at 07:45 +1100, Benjamin Herrenschmidt wrote: > > On Fri, 2006-11-24 at 16:21 +0000, Adrian Cox wrote: > > > Does anybody have any ideas what could make up() take so long in this > > > circumstance? I'd expect cache transfers to make the operation about 100 > > > times slower, but this looks like repeated cache ping-pong between the > > > two CPUs. > > > > Is it hung in up() (toplevel) or __up (low level) ? > > Not yet proven. By using a scope, I have further data: the system is hung in this line of resched_task() in kernel/sched.c: set_tsk_thread_flag(p, TIF_NEED_RESCHED); During this time, there is a great deal of ARTRY activity on the bus. The sequence ends when the other CPU takes a timer tick. I'll need to track down what the other CPU is doing at this point, but my current hypothesis is that it's somewhere in schedule(). > > Have you tried some oprofile runs to catch the exact instruction where > > the cycles appear to be wasted ? Oprofile turned out to break the error condition, by increasing the interrupt rate on each CPU. In the end a combination of lockmeter and an oscilloscope did the trick. -- Adrian Cox <adrian@humboldt.co.uk> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Worst case performance of up() 2006-12-02 10:35 ` Adrian Cox @ 2006-12-02 11:15 ` Benjamin Herrenschmidt 2006-12-02 11:54 ` Adrian Cox 0 siblings, 1 reply; 12+ messages in thread From: Benjamin Herrenschmidt @ 2006-12-02 11:15 UTC (permalink / raw) To: Adrian Cox; +Cc: linuxppc-dev On Sat, 2006-12-02 at 10:35 +0000, Adrian Cox wrote: > On Mon, 2006-11-27 at 21:02 +0000, Adrian Cox wrote: > > On Sat, 2006-11-25 at 07:45 +1100, Benjamin Herrenschmidt wrote: > > > On Fri, 2006-11-24 at 16:21 +0000, Adrian Cox wrote: > > > > Does anybody have any ideas what could make up() take so long in this > > > > circumstance? I'd expect cache transfers to make the operation about 100 > > > > times slower, but this looks like repeated cache ping-pong between the > > > > two CPUs. > > > > > > Is it hung in up() (toplevel) or __up (low level) ? > > > > Not yet proven. > > By using a scope, I have further data: the system is hung in this line > of resched_task() in kernel/sched.c: > set_tsk_thread_flag(p, TIF_NEED_RESCHED); > > During this time, there is a great deal of ARTRY activity on the bus. > The sequence ends when the other CPU takes a timer tick. > > I'll need to track down what the other CPU is doing at this point, but > my current hypothesis is that it's somewhere in schedule(). > > > > Have you tried some oprofile runs to catch the exact instruction where > > > the cycles appear to be wasted ? > > Oprofile turned out to break the error condition, by increasing the > interrupt rate on each CPU. In the end a combination of lockmeter and > an oscilloscope did the trick. I think we are hitting a livelock due to both CPUs trying to perform an atomic operation on the same cache line (or same variable even). I would expect that to work more smoothly but it looks like we can hit a work case scenario... Can you remind me what CPU this on precisely ? I know that for some CPUs like 970's, Apple code has some weirdo workarounds around atomic ops involving forcing a mispredict when the stwcx. fails ... but if both CPUs are following the exact same pattern, I can't see another way out but an interrupt, unless something in the bus protocol can prevent such livelocks... Ben. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Worst case performance of up() 2006-12-02 11:15 ` Benjamin Herrenschmidt @ 2006-12-02 11:54 ` Adrian Cox 2006-12-02 20:52 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 12+ messages in thread From: Adrian Cox @ 2006-12-02 11:54 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linuxppc-dev On Sat, 2006-12-02 at 22:15 +1100, Benjamin Herrenschmidt wrote: > I think we are hitting a livelock due to both CPUs trying to perform an > atomic operation on the same cache line (or same variable even). I agree. > Can you remind me what CPU this on precisely ? I know that for some > CPUs like 970's, Apple code has some weirdo workarounds around atomic > ops involving forcing a mispredict when the stwcx. fails ... but if both > CPUs are following the exact same pattern, I can't see another way out > but an interrupt, unless something in the bus protocol can prevent such > livelocks... It's a pair of 7448s in MPX bus mode, with a Tsi109 host bridge. -- Adrian Cox <adrian@humboldt.co.uk> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Worst case performance of up() 2006-12-02 11:54 ` Adrian Cox @ 2006-12-02 20:52 ` Benjamin Herrenschmidt 2006-12-04 13:43 ` Adrian Cox 0 siblings, 1 reply; 12+ messages in thread From: Benjamin Herrenschmidt @ 2006-12-02 20:52 UTC (permalink / raw) To: Adrian Cox; +Cc: linuxppc-dev On Sat, 2006-12-02 at 11:54 +0000, Adrian Cox wrote: > On Sat, 2006-12-02 at 22:15 +1100, Benjamin Herrenschmidt wrote: > > I think we are hitting a livelock due to both CPUs trying to perform an > > atomic operation on the same cache line (or same variable even). > > I agree. > > > Can you remind me what CPU this on precisely ? I know that for some > > CPUs like 970's, Apple code has some weirdo workarounds around atomic > > ops involving forcing a mispredict when the stwcx. fails ... but if both > > CPUs are following the exact same pattern, I can't see another way out > > but an interrupt, unless something in the bus protocol can prevent such > > livelocks... > > It's a pair of 7448s in MPX bus mode, with a Tsi109 host bridge. At this point, we might have good use of Freescale help... there might be something we can do in the construct of the atomic ops to avoid that... Ben. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Worst case performance of up() 2006-12-02 20:52 ` Benjamin Herrenschmidt @ 2006-12-04 13:43 ` Adrian Cox 2006-12-04 20:25 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 12+ messages in thread From: Adrian Cox @ 2006-12-04 13:43 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linuxppc-dev On Sun, 2006-12-03 at 07:52 +1100, Benjamin Herrenschmidt wrote: > On Sat, 2006-12-02 at 11:54 +0000, Adrian Cox wrote: > > On Sat, 2006-12-02 at 22:15 +1100, Benjamin Herrenschmidt wrote: > > > I think we are hitting a livelock due to both CPUs trying to perform an > > > atomic operation on the same cache line (or same variable even). > > > > I agree. I now have this one identified and fixed. The cure is actually simple: on 32-bit SMP machines, rather than setting powersave_nap to 0, set ppc_md.power_save to NULL. CPU A is attempting to reschedule the idle thread on CPU B: set_tsk_thread_flag(p, TIF_NEED_RESCHED); CPU B is in the idle loop, but does not support nap. The result is that ppc6xx_idle() returns immediately, and CPU B runs in a tight loop: while(...) { set_thread_flag(TIF_POLLING_NRFLAG); ... clear_thread_flag(TIF_POLLING_NRFLAG); } If ppc_md.power_save is NULL, then the idle loop does not touch the flag word of the idle thread, and everything works. -- Adrian Cox <adrian@humboldt.co.uk> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Worst case performance of up() 2006-12-04 13:43 ` Adrian Cox @ 2006-12-04 20:25 ` Benjamin Herrenschmidt 2006-12-05 22:00 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 12+ messages in thread From: Benjamin Herrenschmidt @ 2006-12-04 20:25 UTC (permalink / raw) To: Adrian Cox; +Cc: linuxppc-dev On Mon, 2006-12-04 at 13:43 +0000, Adrian Cox wrote: > On Sun, 2006-12-03 at 07:52 +1100, Benjamin Herrenschmidt wrote: > > On Sat, 2006-12-02 at 11:54 +0000, Adrian Cox wrote: > > > On Sat, 2006-12-02 at 22:15 +1100, Benjamin Herrenschmidt wrote: > > > > I think we are hitting a livelock due to both CPUs trying to perform an > > > > atomic operation on the same cache line (or same variable even). > > > > > > I agree. > > I now have this one identified and fixed. The cure is actually simple: > on 32-bit SMP machines, rather than setting powersave_nap to 0, set > ppc_md.power_save to NULL. > > CPU A is attempting to reschedule the idle thread on CPU B: > set_tsk_thread_flag(p, TIF_NEED_RESCHED); > > CPU B is in the idle loop, but does not support nap. The result is that > ppc6xx_idle() returns immediately, and CPU B runs in a tight loop: > while(...) { > set_thread_flag(TIF_POLLING_NRFLAG); > ... > clear_thread_flag(TIF_POLLING_NRFLAG); > } > > If ppc_md.power_save is NULL, then the idle loop does not touch the flag > word of the idle thread, and everything works. Good to know, there might be other cases of performance issues due to touching that flag in a loop... Ben. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Worst case performance of up() 2006-12-04 20:25 ` Benjamin Herrenschmidt @ 2006-12-05 22:00 ` Benjamin Herrenschmidt 2006-12-08 2:33 ` Paul Mackerras 0 siblings, 1 reply; 12+ messages in thread From: Benjamin Herrenschmidt @ 2006-12-05 22:00 UTC (permalink / raw) To: Adrian Cox; +Cc: linuxppc-dev, Paul Mackerras > > If ppc_md.power_save is NULL, then the idle loop does not touch the flag > > word of the idle thread, and everything works. > > Good to know, there might be other cases of performance issues due to > touching that flag in a loop... More specifically... that means we have the same performance problem on SMP PowerMacs (dual G4s & Xserve G4s)... on those machines, there's a firmware property telling us not to use NAP mode and we do so by clearing powersave_nap. Paul, do you reckon the best solution here is to clear ppc_md.power_save when detecting an SMP powermac ? (Or the flush-on-lock property) Ben. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Worst case performance of up() 2006-12-05 22:00 ` Benjamin Herrenschmidt @ 2006-12-08 2:33 ` Paul Mackerras 2006-12-08 3:10 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 12+ messages in thread From: Paul Mackerras @ 2006-12-08 2:33 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linuxppc-dev Benjamin Herrenschmidt writes: > Paul, do you reckon the best solution here is to clear ppc_md.power_save > when detecting an SMP powermac ? (Or the flush-on-lock property) Yes, that sounds good. It would let us get rid of the runtime cpu feature check in idle_6xx.c too. Paul. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Worst case performance of up() 2006-12-08 2:33 ` Paul Mackerras @ 2006-12-08 3:10 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 12+ messages in thread From: Benjamin Herrenschmidt @ 2006-12-08 3:10 UTC (permalink / raw) To: Paul Mackerras; +Cc: linuxppc-dev On Fri, 2006-12-08 at 13:33 +1100, Paul Mackerras wrote: > Benjamin Herrenschmidt writes: > > > Paul, do you reckon the best solution here is to clear ppc_md.power_save > > when detecting an SMP powermac ? (Or the flush-on-lock property) > > Yes, that sounds good. It would let us get rid of the runtime cpu > feature check in idle_6xx.c too. We would have to do something about users flippin powersave_nap tho... Ben. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2006-12-08 4:11 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-11-24 16:21 Worst case performance of up() Adrian Cox 2006-11-24 20:45 ` Benjamin Herrenschmidt 2006-11-27 21:02 ` Adrian Cox 2006-12-02 10:35 ` Adrian Cox 2006-12-02 11:15 ` Benjamin Herrenschmidt 2006-12-02 11:54 ` Adrian Cox 2006-12-02 20:52 ` Benjamin Herrenschmidt 2006-12-04 13:43 ` Adrian Cox 2006-12-04 20:25 ` Benjamin Herrenschmidt 2006-12-05 22:00 ` Benjamin Herrenschmidt 2006-12-08 2:33 ` Paul Mackerras 2006-12-08 3:10 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).