OOPS on MPC8548 board when writing RAID5 array

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* OOPS on MPC8548 board when writing RAID5 array
@ 2009-11-10 11:44 hank peng
  2009-11-13  1:36 ` Dan Williams
  0 siblings, 1 reply; 3+ messages in thread
From: hank peng @ 2009-11-10 11:44 UTC (permalink / raw)
  To: linuxppc-dev, linux-raid

CPU is MPC8548, kernel version is 2.6.31.5,CONFIG_FSL_DMA and
CONFIG_ASYNC_TX_DMA options are all enabled.
#mdadm -C /dev/md0 --assume-clean -l5 -n3 /dev/sd{a,b,c}
#dd if=/dev/zero of=/dev/md0 bs=1M count=1000
Oops: Exception in kernel mode, sig: 5 [#1]
MPC85xx CDS
Modules linked in:
NIP: c01c45d8 LR: c01c4d48 CTR: 00000000
REGS: c2dd5c80 TRAP: 0700   Not tainted  (2.6.31.5)
MSR: 00029000 <EE,ME,CE>  CR: 22004028  XER: 00000000
TASK = e820a580[3804] 'md0_raid5' THREAD: c2dd4000
GPR00: 00000001 c2dd5d30 e820a580 c2fb1088 00000001 00000000 00000002 00001000
GPR08: 00000001 c0485a20 00000000 ef8092f8 22002024 55555555 c2d67870 c0282d2c
GPR16: 00001000 e8355c00 c2eff964 00000000 00000000 00000019 01000040 c2dd5e00
GPR24: c2dd5dfc 00000001 c2dd5dc0 c099c420 00000000 c2d67838 00000002 c2dd5d58
NIP [c01c45d8] async_tx_quiesce+0x28/0x74
LR [c01c4d48] async_xor+0x208/0x350
Call Trace:
[c2dd5d30] [c02a80f8] fsl_dma_alloc_descriptor+0x24/0x70 (unreliable)
[c2dd5d40] [c01c4d48] async_xor+0x208/0x350
[c2dd5db0] [c02839ec] ops_run_postxor+0xfc/0x1c0
[c2dd5df0] [c0284700] handle_stripe5+0xb24/0x15c0
[c2dd5e70] [c02864c8] handle_stripe+0x34/0x12d4
[c2dd5f10] [c02879ac] raid5d+0x244/0x458
[c2dd5f70] [c02938d4] md_thread+0x5c/0x124
[c2dd5fc0] [c004cc9c] kthread+0x78/0x7c
[c2dd5ff0] [c000f50c] kernel_thread+0x4c/0x68
Instruction dump:
7c0803a6 4e800020 9421fff0 7c0802a6 93e1000c 7c7f1b78 90010014 80630000
2f830000 419e0034 80030004 5400fffe <0f000000> 480e19b1 2f830002 419e0030

I checked the kernel source code, and find that this OOPS was caused
by the following BUG_ON code:
It is in crypto/async_tx/async_tx.c:
void async_tx_quiesce(struct dma_async_tx_descriptor **tx)
{
        if (*tx) {
                /* if ack is already set then we cannot be sure
                 * we are referring to the correct operation
                 */
                BUG_ON(async_tx_test_ack(*tx));
   /* OOPS occured */
                if (dma_wait_for_async_tx(*tx) == DMA_ERROR)
                        panic("DMA_ERROR waiting for transaction\n");
                async_tx_ack(*tx);
                *tx = NULL;
        }
}


-- 
The simplest is not all best but the best is surely the simplest!

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: OOPS on MPC8548 board when writing RAID5 array
  2009-11-10 11:44 OOPS on MPC8548 board when writing RAID5 array hank peng
@ 2009-11-13  1:36 ` Dan Williams
  2009-11-13  2:45   ` hank peng
  0 siblings, 1 reply; 3+ messages in thread
From: Dan Williams @ 2009-11-13  1:36 UTC (permalink / raw)
  To: hank peng; +Cc: linux-raid, linuxppc-dev, Suresh Vishnu

Hi Hank,

Thanks for testing.

On Tue, Nov 10, 2009 at 4:44 AM, hank peng <pengxihan@gmail.com> wrote:
> CPU is MPC8548, kernel version is 2.6.31.5,CONFIG_FSL_DMA and
> CONFIG_ASYNC_TX_DMA options are all enabled.
> #mdadm -C /dev/md0 --assume-clean -l5 -n3 /dev/sd{a,b,c}
> #dd if=3D/dev/zero of=3D/dev/md0 bs=3D1M count=3D1000
> Oops: Exception in kernel mode, sig: 5 [#1]
> MPC85xx CDS
> Modules linked in:
> NIP: c01c45d8 LR: c01c4d48 CTR: 00000000
> REGS: c2dd5c80 TRAP: 0700 =A0 Not tainted =A0(2.6.31.5)
> MSR: 00029000 <EE,ME,CE> =A0CR: 22004028 =A0XER: 00000000
> TASK =3D e820a580[3804] 'md0_raid5' THREAD: c2dd4000
> GPR00: 00000001 c2dd5d30 e820a580 c2fb1088 00000001 00000000 00000002 000=
01000
> GPR08: 00000001 c0485a20 00000000 ef8092f8 22002024 55555555 c2d67870 c02=
82d2c
> GPR16: 00001000 e8355c00 c2eff964 00000000 00000000 00000019 01000040 c2d=
d5e00
> GPR24: c2dd5dfc 00000001 c2dd5dc0 c099c420 00000000 c2d67838 00000002 c2d=
d5d58
> NIP [c01c45d8] async_tx_quiesce+0x28/0x74
[..]
> I checked the kernel source code, and find that this OOPS was caused
> by the following BUG_ON code:
> It is in crypto/async_tx/async_tx.c:
> void async_tx_quiesce(struct dma_async_tx_descriptor **tx)
> {
> =A0 =A0 =A0 =A0if (*tx) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* if ack is already set then we cannot be=
 sure
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * we are referring to the correct operati=
on
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 */
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0BUG_ON(async_tx_test_ack(*tx));
> =A0 /* OOPS occured */

Yes, this looks like a manifestation of the issue I brought up in my
review of the driver [1].  The talitos_prep_dma_xor routine is always
acknowledging its descriptors, which it should not because that is the
responsibility of the client of the api.  When the raid code tries to
attach a memcpy that depends on the xor it sees that it needs to
switch to from talitos to fsldma (or software if fsldma is turned
off).  Since talitos does not have the DMA_INTERRUPT capability to
trigger the channel switch we need to perform a synchronous wait for
the xor to complete before submitting the memcpy.  When the ack bit is
not set the xor descriptor might be recycled by the dma device driver
while we are waiting for it, hence the BUG_ON().

--
Dan

See the final comment:
[1]: http://marc.info/?l=3Dlinux-raid&m=3D125685641412112&w=3D2

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: OOPS on MPC8548 board when writing RAID5 array
  2009-11-13  1:36 ` Dan Williams
@ 2009-11-13  2:45   ` hank peng
  0 siblings, 0 replies; 3+ messages in thread
From: hank peng @ 2009-11-13  2:45 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid, linuxppc-dev, Suresh Vishnu

2009/11/13 Dan Williams <dan.j.williams@intel.com>:
> Hi Hank,
>
> Thanks for testing.
>
> On Tue, Nov 10, 2009 at 4:44 AM, hank peng <pengxihan@gmail.com> wrote:
>> CPU is MPC8548, kernel version is 2.6.31.5,CONFIG_FSL_DMA and
>> CONFIG_ASYNC_TX_DMA options are all enabled.
>> #mdadm -C /dev/md0 --assume-clean -l5 -n3 /dev/sd{a,b,c}
>> #dd if=3D/dev/zero of=3D/dev/md0 bs=3D1M count=3D1000
>> Oops: Exception in kernel mode, sig: 5 [#1]
>> MPC85xx CDS
>> Modules linked in:
>> NIP: c01c45d8 LR: c01c4d48 CTR: 00000000
>> REGS: c2dd5c80 TRAP: 0700 =C2=A0 Not tainted =C2=A0(2.6.31.5)
>> MSR: 00029000 <EE,ME,CE> =C2=A0CR: 22004028 =C2=A0XER: 00000000
>> TASK =3D e820a580[3804] 'md0_raid5' THREAD: c2dd4000
>> GPR00: 00000001 c2dd5d30 e820a580 c2fb1088 00000001 00000000 00000002 00=
001000
>> GPR08: 00000001 c0485a20 00000000 ef8092f8 22002024 55555555 c2d67870 c0=
282d2c
>> GPR16: 00001000 e8355c00 c2eff964 00000000 00000000 00000019 01000040 c2=
dd5e00
>> GPR24: c2dd5dfc 00000001 c2dd5dc0 c099c420 00000000 c2d67838 00000002 c2=
dd5d58
>> NIP [c01c45d8] async_tx_quiesce+0x28/0x74
> [..]
>> I checked the kernel source code, and find that this OOPS was caused
>> by the following BUG_ON code:
>> It is in crypto/async_tx/async_tx.c:
>> void async_tx_quiesce(struct dma_async_tx_descriptor **tx)
>> {
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0if (*tx) {
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* if ack is alre=
ady set then we cannot be sure
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 * we are referri=
ng to the correct operation
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0BUG_ON(async_tx_t=
est_ack(*tx));
>> =C2=A0 /* OOPS occured */
>
> Yes, this looks like a manifestation of the issue I brought up in my
> review of the driver [1]. =C2=A0The talitos_prep_dma_xor routine is alway=
s
> acknowledging its descriptors, which it should not because that is the
> responsibility of the client of the api. =C2=A0When the raid code tries t=
o
> attach a memcpy that depends on the xor it sees that it needs to
> switch to from talitos to fsldma (or software if fsldma is turned
> off). =C2=A0Since talitos does not have the DMA_INTERRUPT capability to
> trigger the channel switch we need to perform a synchronous wait for
> the xor to complete before submitting the memcpy. =C2=A0When the ack bit =
is
> not set the xor descriptor might be recycled by the dma device driver
> while we are waiting for it, hence the BUG_ON().
>
Thanks for reply, Dan.
Forgot to say, when this OOPS happened, I have not applied talitos XOR
patch. I only enabled async_xx api and FSL_DMA, so here
I think XOR was done by CPU and memcpy was done by DMA using async_xx api.
Another interseting thing I should say is that I have tried latest
stable kernel 2.6.31.6, this problem didn't exist. After I applied
talitos XOR patch, it was OK too. I checked the related souce codes
and it seems that there were no changes which make me feel very
confused.

I have been testing latest serials of kernels about XOR patch on
MPC8548 board and I hope Freescale guys also can give me help.

> --
> Dan
>
> See the final comment:
> [1]: http://marc.info/?l=3Dlinux-raid&m=3D125685641412112&w=3D2
>



--=20
The simplest is not all best but the best is surely the simplest!

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-11-13  2:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-10 11:44 OOPS on MPC8548 board when writing RAID5 array hank peng
2009-11-13  1:36 ` Dan Williams
2009-11-13  2:45   ` hank peng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox