Worst case performance of up()

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Worst case performance of up()
@ 2006-11-24 16:21 Adrian Cox
  2006-11-24 20:45 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Adrian Cox @ 2006-11-24 16:21 UTC (permalink / raw)
  To: linuxppc-dev

First the background: I've been investigating poor performance of a
Firewire capture application, running on a dual-7450 board with a 2.6.17
kernel. The kernel is based on a slightly earlier version of the
mpc7448hpc2 board port, using arch/powerpc, which I've not yet updated
to reflect the changes made when the board support entered the
mainstream kernel. 

The application runs smoothly on a single processor. On the dual
processor machine, the application sometimes suffers a drop in
frame-rate, simultaneous with high CPU usage by the Firewire kernel
thread.

Further investigation reveals that the kernel thread spends most of the
time in one line: up(&fi->complete_sem) in __queue_complete_req() in
drivers/iee1394/raw1394.c.  It seems that whenever the userspace thread
calling raw1394_read() is scheduled on the opposite CPU to the kernel
thread, the kernel thread takes much longer to execute up() - typically
10000 times longer.

Does anybody have any ideas what could make up() take so long in this
circumstance? I'd expect cache transfers to make the operation about 100
times slower, but this looks like repeated cache ping-pong between the
two CPUs.

-- 
Adrian Cox <adrian@humboldt.co.uk>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Worst case performance of up()
  2006-11-24 16:21 Worst case performance of up() Adrian Cox
@ 2006-11-24 20:45 ` Benjamin Herrenschmidt
  2006-11-27 21:02   ` Adrian Cox
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2006-11-24 20:45 UTC (permalink / raw)
  To: Adrian Cox; +Cc: linuxppc-dev

On Fri, 2006-11-24 at 16:21 +0000, Adrian Cox wrote:
> First the background: I've been investigating poor performance of a
> Firewire capture application, running on a dual-7450 board with a 2.6.17
> kernel. The kernel is based on a slightly earlier version of the
> mpc7448hpc2 board port, using arch/powerpc, which I've not yet updated
> to reflect the changes made when the board support entered the
> mainstream kernel. 
> 
> The application runs smoothly on a single processor. On the dual
> processor machine, the application sometimes suffers a drop in
> frame-rate, simultaneous with high CPU usage by the Firewire kernel
> thread.
> 
> Further investigation reveals that the kernel thread spends most of the
> time in one line: up(&fi->complete_sem) in __queue_complete_req() in
> drivers/iee1394/raw1394.c.  It seems that whenever the userspace thread
> calling raw1394_read() is scheduled on the opposite CPU to the kernel
> thread, the kernel thread takes much longer to execute up() - typically
> 10000 times longer.
> 
> Does anybody have any ideas what could make up() take so long in this
> circumstance? I'd expect cache transfers to make the operation about 100
> times slower, but this looks like repeated cache ping-pong between the
> two CPUs.

Is it hung in up() (toplevel) or __up (low level) ?

The former is mostly just a atomic_add_return which boils down to :

static __inline__ int atomic_add_return(int a, atomic_t *v)
{
        int t;

        __asm__ __volatile__(
        LWSYNC_ON_SMP
"1:     lwarx   %0,0,%2         # atomic_add_return\n\
        add     %0,%1,%0\n"
        PPC405_ERR77(0,%2)
"       stwcx.  %0,0,%2 \n\
        bne-    1b"
        ISYNC_ON_SMP
        : "=&r" (t)
        : "r" (a), "r" (&v->counter)
        : "cc", "memory");

        return t;
}

So yes, on SMP, you get an additional sync and isync in there, though
I'm surprised that you hit a code path where that would make such a big
difference (unless you are really up'ing a zillion times per sec).

Have you tried some oprofile runs to catch the exact instruction where
the cycles appear to be wasted ?

Maybe there is some contention on the reservation (though that would be
a bit strange to have a contention on a up...) or somewhat the semaphore
ends up sharing a cache line with something else. That would cause a
performance problem.

Have you tried moving the semaphore away from whatever other data might
be manipulated at the same time ? In it's own cache line maybe ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Worst case performance of up()
  2006-11-24 20:45 ` Benjamin Herrenschmidt
@ 2006-11-27 21:02   ` Adrian Cox
  2006-12-02 10:35     ` Adrian Cox
  0 siblings, 1 reply; 12+ messages in thread
From: Adrian Cox @ 2006-11-27 21:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev

On Sat, 2006-11-25 at 07:45 +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2006-11-24 at 16:21 +0000, Adrian Cox wrote:
> > Does anybody have any ideas what could make up() take so long in this
> > circumstance? I'd expect cache transfers to make the operation about 100
> > times slower, but this looks like repeated cache ping-pong between the
> > two CPUs.
> 
> Is it hung in up() (toplevel) or __up (low level) ?

Not yet proven.

> Have you tried some oprofile runs to catch the exact instruction where
> the cycles appear to be wasted ?

I've spent a day wrestling with oprofile, but I've not managed to
trigger the problem while profiling yet.  It's possible that the slight
overhead of oprofile is enough to change the scheduling behaviour.

It remains an odd problem, and rather hard to reproduce.  Unless I get
better data with oprofile this one may remain in the mystery file.

> Have you tried moving the semaphore away from whatever other data might
> be manipulated at the same time ? In it's own cache line maybe ?

I did try that, but it didn't make any difference.

-- 
Adrian Cox <adrian@humboldt.co.uk>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Worst case performance of up()
  2006-11-27 21:02   ` Adrian Cox
@ 2006-12-02 10:35     ` Adrian Cox
  2006-12-02 11:15       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Adrian Cox @ 2006-12-02 10:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev

On Mon, 2006-11-27 at 21:02 +0000, Adrian Cox wrote:
> On Sat, 2006-11-25 at 07:45 +1100, Benjamin Herrenschmidt wrote:
> > On Fri, 2006-11-24 at 16:21 +0000, Adrian Cox wrote:
> > > Does anybody have any ideas what could make up() take so long in this
> > > circumstance? I'd expect cache transfers to make the operation about 100
> > > times slower, but this looks like repeated cache ping-pong between the
> > > two CPUs.
> > 
> > Is it hung in up() (toplevel) or __up (low level) ?
> 
> Not yet proven.

By using a scope, I have further data: the system is hung in this line
of resched_task() in kernel/sched.c:
	set_tsk_thread_flag(p, TIF_NEED_RESCHED);

During this time, there is a great deal of ARTRY activity on the bus.
The sequence ends when the other CPU takes a timer tick.

I'll need to track down what the other CPU is doing at this point, but
my current hypothesis is that it's somewhere in schedule().

> > Have you tried some oprofile runs to catch the exact instruction where
> > the cycles appear to be wasted ?

Oprofile turned out to break the error condition, by increasing the
interrupt rate on each CPU.  In the end a combination of lockmeter and
an oscilloscope did the trick.


-- 
Adrian Cox <adrian@humboldt.co.uk>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Worst case performance of up()
  2006-12-02 10:35     ` Adrian Cox
@ 2006-12-02 11:15       ` Benjamin Herrenschmidt
  2006-12-02 11:54         ` Adrian Cox
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2006-12-02 11:15 UTC (permalink / raw)
  To: Adrian Cox; +Cc: linuxppc-dev

On Sat, 2006-12-02 at 10:35 +0000, Adrian Cox wrote:
> On Mon, 2006-11-27 at 21:02 +0000, Adrian Cox wrote:
> > On Sat, 2006-11-25 at 07:45 +1100, Benjamin Herrenschmidt wrote:
> > > On Fri, 2006-11-24 at 16:21 +0000, Adrian Cox wrote:
> > > > Does anybody have any ideas what could make up() take so long in this
> > > > circumstance? I'd expect cache transfers to make the operation about 100
> > > > times slower, but this looks like repeated cache ping-pong between the
> > > > two CPUs.
> > > 
> > > Is it hung in up() (toplevel) or __up (low level) ?
> > 
> > Not yet proven.
> 
> By using a scope, I have further data: the system is hung in this line
> of resched_task() in kernel/sched.c:
> 	set_tsk_thread_flag(p, TIF_NEED_RESCHED);
> 
> During this time, there is a great deal of ARTRY activity on the bus.
> The sequence ends when the other CPU takes a timer tick.
> 
> I'll need to track down what the other CPU is doing at this point, but
> my current hypothesis is that it's somewhere in schedule().
> 
> > > Have you tried some oprofile runs to catch the exact instruction where
> > > the cycles appear to be wasted ?
> 
> Oprofile turned out to break the error condition, by increasing the
> interrupt rate on each CPU.  In the end a combination of lockmeter and
> an oscilloscope did the trick.

I think we are hitting a livelock due to both CPUs trying to perform an
atomic operation on the same cache line (or same variable even).

I would expect that to work more smoothly but it looks like we can hit
a work case scenario...

Can you remind me what CPU this on precisely ? I know that for some
CPUs like 970's, Apple code has some weirdo workarounds around atomic
ops involving forcing a mispredict when the stwcx. fails ... but if both
CPUs are following the exact same pattern, I can't see another way out
but an interrupt, unless something in the bus protocol can prevent such
livelocks...

Ben.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Worst case performance of up()
  2006-12-02 11:15       ` Benjamin Herrenschmidt
@ 2006-12-02 11:54         ` Adrian Cox
  2006-12-02 20:52           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Adrian Cox @ 2006-12-02 11:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev

On Sat, 2006-12-02 at 22:15 +1100, Benjamin Herrenschmidt wrote:
> I think we are hitting a livelock due to both CPUs trying to perform an
> atomic operation on the same cache line (or same variable even).

I agree.

> Can you remind me what CPU this on precisely ? I know that for some
> CPUs like 970's, Apple code has some weirdo workarounds around atomic
> ops involving forcing a mispredict when the stwcx. fails ... but if both
> CPUs are following the exact same pattern, I can't see another way out
> but an interrupt, unless something in the bus protocol can prevent such
> livelocks...

It's a pair of 7448s in MPX bus mode, with a Tsi109 host bridge.

-- 
Adrian Cox <adrian@humboldt.co.uk>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Worst case performance of up()
  2006-12-02 11:54         ` Adrian Cox
@ 2006-12-02 20:52           ` Benjamin Herrenschmidt
  2006-12-04 13:43             ` Adrian Cox
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2006-12-02 20:52 UTC (permalink / raw)
  To: Adrian Cox; +Cc: linuxppc-dev

On Sat, 2006-12-02 at 11:54 +0000, Adrian Cox wrote:
> On Sat, 2006-12-02 at 22:15 +1100, Benjamin Herrenschmidt wrote:
> > I think we are hitting a livelock due to both CPUs trying to perform an
> > atomic operation on the same cache line (or same variable even).
> 
> I agree.
> 
> > Can you remind me what CPU this on precisely ? I know that for some
> > CPUs like 970's, Apple code has some weirdo workarounds around atomic
> > ops involving forcing a mispredict when the stwcx. fails ... but if both
> > CPUs are following the exact same pattern, I can't see another way out
> > but an interrupt, unless something in the bus protocol can prevent such
> > livelocks...
> 
> It's a pair of 7448s in MPX bus mode, with a Tsi109 host bridge.

At this point, we might have good use of Freescale help... there might
be something we can do in the construct of the atomic ops to avoid
that...

Ben.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Worst case performance of up()
  2006-12-02 20:52           ` Benjamin Herrenschmidt
@ 2006-12-04 13:43             ` Adrian Cox
  2006-12-04 20:25               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Adrian Cox @ 2006-12-04 13:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev

On Sun, 2006-12-03 at 07:52 +1100, Benjamin Herrenschmidt wrote:
> On Sat, 2006-12-02 at 11:54 +0000, Adrian Cox wrote:
> > On Sat, 2006-12-02 at 22:15 +1100, Benjamin Herrenschmidt wrote:
> > > I think we are hitting a livelock due to both CPUs trying to perform an
> > > atomic operation on the same cache line (or same variable even).
> > 
> > I agree.

I now have this one identified and fixed. The cure is actually simple:
on 32-bit SMP machines, rather than setting powersave_nap to 0, set
ppc_md.power_save to NULL.

CPU A is attempting to reschedule the idle thread on CPU B:
set_tsk_thread_flag(p, TIF_NEED_RESCHED);

CPU B is in the idle loop, but does not support nap. The result is that
ppc6xx_idle() returns immediately, and CPU B runs in a tight loop:
while(...) {
	set_thread_flag(TIF_POLLING_NRFLAG);
	...
	clear_thread_flag(TIF_POLLING_NRFLAG);
}

If ppc_md.power_save is NULL, then the idle loop does not touch the flag
word of the idle thread, and everything works.

-- 
Adrian Cox <adrian@humboldt.co.uk>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Worst case performance of up()
  2006-12-04 13:43             ` Adrian Cox
@ 2006-12-04 20:25               ` Benjamin Herrenschmidt
  2006-12-05 22:00                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2006-12-04 20:25 UTC (permalink / raw)
  To: Adrian Cox; +Cc: linuxppc-dev

On Mon, 2006-12-04 at 13:43 +0000, Adrian Cox wrote:
> On Sun, 2006-12-03 at 07:52 +1100, Benjamin Herrenschmidt wrote:
> > On Sat, 2006-12-02 at 11:54 +0000, Adrian Cox wrote:
> > > On Sat, 2006-12-02 at 22:15 +1100, Benjamin Herrenschmidt wrote:
> > > > I think we are hitting a livelock due to both CPUs trying to perform an
> > > > atomic operation on the same cache line (or same variable even).
> > > 
> > > I agree.
> 
> I now have this one identified and fixed. The cure is actually simple:
> on 32-bit SMP machines, rather than setting powersave_nap to 0, set
> ppc_md.power_save to NULL.
> 
> CPU A is attempting to reschedule the idle thread on CPU B:
> set_tsk_thread_flag(p, TIF_NEED_RESCHED);
> 
> CPU B is in the idle loop, but does not support nap. The result is that
> ppc6xx_idle() returns immediately, and CPU B runs in a tight loop:
> while(...) {
> 	set_thread_flag(TIF_POLLING_NRFLAG);
> 	...
> 	clear_thread_flag(TIF_POLLING_NRFLAG);
> }
> 
> If ppc_md.power_save is NULL, then the idle loop does not touch the flag
> word of the idle thread, and everything works.

Good to know, there might be other cases of performance issues due to
touching that flag in a loop...

Ben.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Worst case performance of up()
  2006-12-04 20:25               ` Benjamin Herrenschmidt
@ 2006-12-05 22:00                 ` Benjamin Herrenschmidt
  2006-12-08  2:33                   ` Paul Mackerras
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2006-12-05 22:00 UTC (permalink / raw)
  To: Adrian Cox; +Cc: linuxppc-dev, Paul Mackerras

> > If ppc_md.power_save is NULL, then the idle loop does not touch the flag
> > word of the idle thread, and everything works.
> 
> Good to know, there might be other cases of performance issues due to
> touching that flag in a loop...

More specifically... that means we have the same performance problem on
SMP PowerMacs (dual G4s & Xserve G4s)... on those machines, there's a
firmware property telling us not to use NAP mode and we do so by
clearing powersave_nap.

Paul, do you reckon the best solution here is to clear ppc_md.power_save
when detecting an SMP powermac ? (Or the flush-on-lock property)

Ben.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Worst case performance of up()
  2006-12-05 22:00                 ` Benjamin Herrenschmidt
@ 2006-12-08  2:33                   ` Paul Mackerras
  2006-12-08  3:10                     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Paul Mackerras @ 2006-12-08  2:33 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev

Benjamin Herrenschmidt writes:

> Paul, do you reckon the best solution here is to clear ppc_md.power_save
> when detecting an SMP powermac ? (Or the flush-on-lock property)

Yes, that sounds good.  It would let us get rid of the runtime cpu
feature check in idle_6xx.c too.

Paul.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Worst case performance of up()
  2006-12-08  2:33                   ` Paul Mackerras
@ 2006-12-08  3:10                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2006-12-08  3:10 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev

On Fri, 2006-12-08 at 13:33 +1100, Paul Mackerras wrote:
> Benjamin Herrenschmidt writes:
> 
> > Paul, do you reckon the best solution here is to clear ppc_md.power_save
> > when detecting an SMP powermac ? (Or the flush-on-lock property)
> 
> Yes, that sounds good.  It would let us get rid of the runtime cpu
> feature check in idle_6xx.c too.

We would have to do something about users flippin powersave_nap tho...

Ben.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2006-12-08  4:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-24 16:21 Worst case performance of up() Adrian Cox
2006-11-24 20:45 ` Benjamin Herrenschmidt
2006-11-27 21:02   ` Adrian Cox
2006-12-02 10:35     ` Adrian Cox
2006-12-02 11:15       ` Benjamin Herrenschmidt
2006-12-02 11:54         ` Adrian Cox
2006-12-02 20:52           ` Benjamin Herrenschmidt
2006-12-04 13:43             ` Adrian Cox
2006-12-04 20:25               ` Benjamin Herrenschmidt
2006-12-05 22:00                 ` Benjamin Herrenschmidt
2006-12-08  2:33                   ` Paul Mackerras
2006-12-08  3:10                     ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).