wrong SP restored after DBE exception

All of lore.kernel.org
 help / color / mirror / Atom feed

* wrong SP restored after DBE exception
@ 2006-09-27 19:53 Dave Johnson
  2006-09-28 13:09 ` Ralf Baechle
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Johnson @ 2006-09-27 19:53 UTC (permalink / raw)
  To: linux-mips

I'm running into an odd problem with the DBE exception handler.

I've got an IO device that on some error conditions causes a bus fault
on access.  The driver I have that accesses this device directs
all reads/writes through a wrapper function to handle potential bus
faults (see the read version below).

If the device causes a DBE on access, do_be() looks up the EPC in the
DBE table and successfully corrects the PC to handle the fault.

This works most of the time, however on about 1 out of 100 faults the
SP register is saved and restored incorrectly.  When control returns to
the faulting function SP is 304 bytes less than where it should
be and as expected things go down hill from there.

304 bytes is PT_SIZE (the amount of space for saved registers)

I suspect something is wrong with except_vec3_generic() or
handle_dbe() but the only thing that comes to mind is potential nested
interrupts/exceptions that would clobber K0/K1.  The fact that SP is
off by 304 bytes seems to indicate it saved twice but only restored
once.

CPU:    SiByte BCM1250 (both A8 and B2 stepping tested)
Kernel: linux 2.6.12 (yes, I know it's old), 64bit kernel
Config: Occurs with and without SMP and with and without PREEMPT

I took a quick look to see if this area has changed between 2.6.12 and
2.6.17 and the only part I see is get_saved_sp() and that should
only effect faults from userspace.  All the faults I'm getting are
from a kernel-mode driver.

I've walked through one (succesful) DBE fault from this driver using a
JTAG debugger and everything looks to run exactly as expected.  I have
yet to catch a failing one with the debugger except for after the
restore is finished but that's too late.

Anyone have any thoughts on this issue?

------------------------
read wrapper function is:

	.set    noreorder
/* int sb_io_trap_readb(unsigned char *value, const volatile void *addrs); */
LEAF(sb_io_trap_readb)
	/* do the read, handle error */
8:	lb	t0, (a1)
9:	add	t0, t0, zero /* consume read */
	.section __dbe_table,"a"
	PTR	8b, 1f
	PTR	9b, 1f
	.previous
	/*
	 * write out to the caller's pointer, if this fails it's a bug
	 * and we should fault as normal
	 */
	sb	t0, (a0)
	/* all good, return success */
	jr	ra
	move	v0, zero
	/* fault handler, return -EIO */
1:	jr	ra
	li	v0, -EIO
END(sb_io_trap_readb)

-- 
Dave Johnson
Starent Networks

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: wrong SP restored after DBE exception
  2006-09-27 19:53 wrong SP restored after DBE exception Dave Johnson
@ 2006-09-28 13:09 ` Ralf Baechle
  2006-09-28 13:56   ` Maciej W. Rozycki
  0 siblings, 1 reply; 6+ messages in thread
From: Ralf Baechle @ 2006-09-28 13:09 UTC (permalink / raw)
  To: Dave Johnson; +Cc: linux-mips

On Wed, Sep 27, 2006 at 03:53:55PM -0400, Dave Johnson wrote:

> I'm running into an odd problem with the DBE exception handler.

There is a fundamental problem with the way unmaskable exceptions other
than cache errors and NMI are handled.  This is the disassembly of the
kernel's exception entry path starting at the general exception vector:

a8000000003bae20 <except_vec3_generic>:
a8000000003bae20:       401b6800        mfc0    k1,$13
a8000000003bae24:       337b007c        andi    k1,k1,0x7c
a8000000003bae28:       001bd878        dsll    k1,k1,0x1
a8000000003bae2c:       3c1a003f        lui     k0,0x3f
a8000000003bae30:       035bd02d        daddu   k0,k0,k1
a8000000003bae34:       df5ad2d8        ld      k0,-11560(k0)
a8000000003bae38:       03400008        jr      k0
a8000000003bae3c:       00000000        nop
[...]

A few types of exceptions will be handled by just using $k0 and $k1; must
will save the registers right away:

[...]
a800000000020440:       401a6000        mfc0    k0,$12
a800000000020444:       001ad0c0        sll     k0,k0,0x3
a800000000020448:       0740000a        bltz    k0,0xa800000000020474
a80000000002044c:       03a0d82d        move    k1,sp
a800000000020450:       403b2000        dmfc0   k1,$4
a800000000020454:       3c1aa800        lui     k0,0xa800
a800000000020458:       001bddfa        dsrl    k1,k1,0x17
a80000000002045c:       675a0000        daddiu  k0,k0,0
a800000000020460:       001ad438        dsll    k0,k0,0x10
a800000000020464:       675a0043        daddiu  k0,k0,67
a800000000020468:       001ad438        dsll    k0,k0,0x10
a80000000002046c:       037ad82d        daddu   k1,k1,k0
a800000000020470:       df7bf008        ld      k1,-4088(k1)
a800000000020474:       03a0d02d        move    k0,sp
a800000000020478:       677dfed0        daddiu  sp,k1,-304
a80000000002047c:       ffba00e8        sd      k0,232(sp)
(c0_status.exl is cleared a mile further down)
[...]

If we take a DBE exception in this code we're in trouble and I've seen
systems delivering DBEs highly asynchronously.  Afar the Broadcom SOCs
fall into that class.

So the interesting part is if we take a data bus exception between
the stack pointer adjustment and and before EXL is cleared.  We're taking
a nested exception so c0_epc and c0_cause.bd will not be updated.  So
when the bus error handler will save the $sp value it saw on entry but
will return to the EPC of the first exception, that is only one stack
frame will be popped.  Whops ...

  Ralf

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: wrong SP restored after DBE exception
  2006-09-28 13:09 ` Ralf Baechle
@ 2006-09-28 13:56   ` Maciej W. Rozycki
  2006-09-28 14:28     ` Ralf Baechle
  0 siblings, 1 reply; 6+ messages in thread
From: Maciej W. Rozycki @ 2006-09-28 13:56 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Dave Johnson, linux-mips

On Thu, 28 Sep 2006, Ralf Baechle wrote:

> If we take a DBE exception in this code we're in trouble and I've seen
> systems delivering DBEs highly asynchronously.  Afar the Broadcom SOCs
> fall into that class.
> 
> So the interesting part is if we take a data bus exception between
> the stack pointer adjustment and and before EXL is cleared.  We're taking
> a nested exception so c0_epc and c0_cause.bd will not be updated.  So
> when the bus error handler will save the $sp value it saw on entry but
> will return to the EPC of the first exception, that is only one stack
> frame will be popped.  Whops ...

 It looks like a design issue -- further asynchronous bus error exceptions 
should be blocked till one currenly being handled has been acked.  In fact 
if they are asynchronous, then it really makes no sense to use the 
exception and a general interrupt should be used instead -- the whole 
point of using an exception here is the ability to stop a data corrupting 
transaction, as unlike an interropt, an exception can be precise.

  Maciej

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: wrong SP restored after DBE exception
  2006-09-28 13:56   ` Maciej W. Rozycki
@ 2006-09-28 14:28     ` Ralf Baechle
  2006-09-28 14:34       ` Dave Johnson
  0 siblings, 1 reply; 6+ messages in thread
From: Ralf Baechle @ 2006-09-28 14:28 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Dave Johnson, linux-mips

On Thu, Sep 28, 2006 at 02:56:29PM +0100, Maciej W. Rozycki wrote:

> > If we take a DBE exception in this code we're in trouble and I've seen
> > systems delivering DBEs highly asynchronously.  Afar the Broadcom SOCs
> > fall into that class.
> > 
> > So the interesting part is if we take a data bus exception between
> > the stack pointer adjustment and and before EXL is cleared.  We're taking
> > a nested exception so c0_epc and c0_cause.bd will not be updated.  So
> > when the bus error handler will save the $sp value it saw on entry but
> > will return to the EPC of the first exception, that is only one stack
> > frame will be popped.  Whops ...
> 
>  It looks like a design issue -- further asynchronous bus error exceptions 
> should be blocked till one currenly being handled has been acked.  In fact 
> if they are asynchronous, then it really makes no sense to use the 
> exception and a general interrupt should be used instead -- the whole 
> point of using an exception here is the ability to stop a data corrupting 
> transaction, as unlike an interropt, an exception can be precise.

I would suggest to disable interrupts around accesses that potencially
could result in DB exceptions and just to make sure he is not getting
trapped by a non-blocking load by making some use of any value read
from the device.  Writes could be posted depending on bus type.  So
having a read from the same device would force the write to complete.

  Ralf

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: wrong SP restored after DBE exception
  2006-09-28 14:28     ` Ralf Baechle
@ 2006-09-28 14:34       ` Dave Johnson
  2006-09-28 17:55         ` Dave Johnson
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Johnson @ 2006-09-28 14:34 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Maciej W. Rozycki, linux-mips

Ralf Baechle writes:
> I would suggest to disable interrupts around accesses that potencially
> could result in DB exceptions and just to make sure he is not getting
> trapped by a non-blocking load by making some use of any value read
> from the device.  Writes could be posted depending on bus type.  So
> having a read from the same device would force the write to complete.
> 
>   Ralf

Ya, I was about to try that.  I could be getting an interrupt
between the time the read is issued and the timeout occurs on the
GBus.  Also, doing a dummy read on the GBus to a device that
shouldn't fault prior to (for reads) or after (for writes) the
potentially faulting one to force ordering seems like a good idea
too.


-- 
Dave Johnson
Starent Networks

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: wrong SP restored after DBE exception
  2006-09-28 14:34       ` Dave Johnson
@ 2006-09-28 17:55         ` Dave Johnson
  0 siblings, 0 replies; 6+ messages in thread
From: Dave Johnson @ 2006-09-28 17:55 UTC (permalink / raw)
  To: Ralf Baechle, Maciej W. Rozycki, linux-mips

Dave Johnson <djohnson+linux-mips@sw.starentnetworks.com>, writes:
> Ralf Baechle writes:
> > I would suggest to disable interrupts around accesses that potencially
> > could result in DB exceptions and just to make sure he is not getting
> > trapped by a non-blocking load by making some use of any value read
> > from the device.  Writes could be posted depending on bus type.  So
> > having a read from the same device would force the write to complete.

Disabling interrupts around the accesses works ok.  My test program
has caused about 400000 DBEs so far with no problem.

Thanks.

-- 
Dave Johnson
Starent Networks

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-09-28 17:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-27 19:53 wrong SP restored after DBE exception Dave Johnson
2006-09-28 13:09 ` Ralf Baechle
2006-09-28 13:56   ` Maciej W. Rozycki
2006-09-28 14:28     ` Ralf Baechle
2006-09-28 14:34       ` Dave Johnson
2006-09-28 17:55         ` Dave Johnson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.