RAID5 lockup with AMCC440 and async-tx

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID5 lockup with AMCC440 and async-tx
@ 2007-10-01  9:16 Dale Dunlea
  2007-10-01 10:13 ` Justin Piszcz
  2007-10-01 10:32 ` Wolfgang Denk
  0 siblings, 2 replies; 6+ messages in thread
From: Dale Dunlea @ 2007-10-01  9:16 UTC (permalink / raw)
  To: linux-raid

Hi,

I have a board with an AMCC440 processor, running RAID5 using the
async-tx interface. In general, it works well, but I have found a test
case that consistently causes a hard lockup of the entire system.

What makes this case odd is that I have only been able to generate it
when accessing disks that are on two separate HBAs - in my case
mpt-fusion based SAS HBAs. Once two HBAs are in use, the bug is
trivial to repeat. I simply create a RAID5 using disks from each HBA,
wait for it to resync, and then run

"dd if=/dev/zero of=/dev/md0 bs=512 count=100000".

By disabling CONFIG_DMA_ENGINE in my kernel config, the hang goes
away, but then so does my performance.

Any pointers on how to debug this? It feels like a race condition of
some description, but any serial port printing I enable causes the
problem to go away, and I can't print silently to /var/log/messages as
the system hangs before it can flush.

Regards,
Dale

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5 lockup with AMCC440 and async-tx
  2007-10-01  9:16 RAID5 lockup with AMCC440 and async-tx Dale Dunlea
@ 2007-10-01 10:13 ` Justin Piszcz
  2007-10-01 10:32 ` Wolfgang Denk
  1 sibling, 0 replies; 6+ messages in thread
From: Justin Piszcz @ 2007-10-01 10:13 UTC (permalink / raw)
  To: Dale Dunlea; +Cc: linux-raid, linux-ide-arrays



On Mon, 1 Oct 2007, Dale Dunlea wrote:

> Hi,
>
> I have a board with an AMCC440 processor, running RAID5 using the
> async-tx interface. In general, it works well, but I have found a test
> case that consistently causes a hard lockup of the entire system.
>
> What makes this case odd is that I have only been able to generate it
> when accessing disks that are on two separate HBAs - in my case
> mpt-fusion based SAS HBAs. Once two HBAs are in use, the bug is
> trivial to repeat. I simply create a RAID5 using disks from each HBA,
> wait for it to resync, and then run
>
> "dd if=/dev/zero of=/dev/md0 bs=512 count=100000".
>
> By disabling CONFIG_DMA_ENGINE in my kernel config, the hang goes
> away, but then so does my performance.
>
> Any pointers on how to debug this? It feels like a race condition of
> some description, but any serial port printing I enable causes the
> problem to go away, and I can't print silently to /var/log/messages as
> the system hangs before it can flush.
>
> Regards,
> Dale
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Added linux-ide-arrays to the CC list which is probably better suited 
towards this kind of question.

Justin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5 lockup with AMCC440 and async-tx
  2007-10-01  9:16 RAID5 lockup with AMCC440 and async-tx Dale Dunlea
  2007-10-01 10:13 ` Justin Piszcz
@ 2007-10-01 10:32 ` Wolfgang Denk
  2007-10-01 11:02   ` Dale Dunlea
  1 sibling, 1 reply; 6+ messages in thread
From: Wolfgang Denk @ 2007-10-01 10:32 UTC (permalink / raw)
  To: Dale Dunlea; +Cc: linux-raid

Dear Dale,

in message <8a24fb800710010216m21cd7734p4c19df1aa7dd5564@mail.gmail.com> you wrote:
> 
> I have a board with an AMCC440 processor, running RAID5 using the
> async-tx interface. In general, it works well, but I have found a test
> case that consistently causes a hard lockup of the entire system.

Please make sure to use latest code - we found a bug recently.

> What makes this case odd is that I have only been able to generate it
> when accessing disks that are on two separate HBAs - in my case
> mpt-fusion based SAS HBAs. Once two HBAs are in use, the bug is
> trivial to repeat. I simply create a RAID5 using disks from each HBA,
> wait for it to resync, and then run

We saw similar problems, in our case they showed up only with a large
number of disks in combination with big kernel pages sizes (64 kB).

> Any pointers on how to debug this? It feels like a race condition of
> some description, but any serial port printing I enable causes the
> problem to go away, and I can't print silently to /var/log/messages as
> the system hangs before it can flush.

See above - please try current code.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
HR Manager to job candidate "I see you've had no  computer  training.
Although  that  qualifies  you  for upper management, it means you're
under-qualified for our entry level positions."

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5 lockup with AMCC440 and async-tx
  2007-10-01 10:32 ` Wolfgang Denk
@ 2007-10-01 11:02   ` Dale Dunlea
  2007-10-01 17:39     ` Wolfgang Denk
  0 siblings, 1 reply; 6+ messages in thread
From: Dale Dunlea @ 2007-10-01 11:02 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: linux-raid

On 01/10/2007, Wolfgang Denk <wd@denx.de> wrote:
> Dear Dale,
>
> in message <8a24fb800710010216m21cd7734p4c19df1aa7dd5564@mail.gmail.com> you wrote:
> >
> > I have a board with an AMCC440 processor, running RAID5 using the
> > async-tx interface. In general, it works well, but I have found a test
> > case that consistently causes a hard lockup of the entire system.
>
> Please make sure to use latest code - we found a bug recently.

Latest code from Dan or latest code from denx.de? I grabbed the latest
code from Dan, but I'm having trouble cloning denx.de:

"remote: error: object directory /home/git/linux-2.6/.git/objects does
not exist; check .git/objects/info/alternates."
>
> > What makes this case odd is that I have only been able to generate it
> > when accessing disks that are on two separate HBAs - in my case
> > mpt-fusion based SAS HBAs. Once two HBAs are in use, the bug is
> > trivial to repeat. I simply create a RAID5 using disks from each HBA,
> > wait for it to resync, and then run
>
> We saw similar problems, in our case they showed up only with a large
> number of disks in combination with big kernel pages sizes (64 kB).
>
The problem occurs for me with both 4k and 64k pages.

Regards,
Dale

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5 lockup with AMCC440 and async-tx
  2007-10-01 11:02   ` Dale Dunlea
@ 2007-10-01 17:39     ` Wolfgang Denk
  2007-10-01 19:25       ` Dale Dunlea
  0 siblings, 1 reply; 6+ messages in thread
From: Wolfgang Denk @ 2007-10-01 17:39 UTC (permalink / raw)
  To: Dale Dunlea; +Cc: linux-raid

Dear Dale,

in message <8a24fb800710010402u5aa0187bq4f850b8cb71483c9@mail.gmail.com> you wrote:
>
> Latest code from Dan or latest code from denx.de? I grabbed the latest

From linux-2.6-denx

> code from Dan, but I'm having trouble cloning denx.de:
> 
> "remote: error: object directory /home/git/linux-2.6/.git/objects does
> not exist; check .git/objects/info/alternates."

Argh.. Stupid me.

Please try again - this one is fixed now.

> > We saw similar problems, in our case they showed up only with a large
> > number of disks in combination with big kernel pages sizes (64 kB).
> >
> The problem occurs for me with both 4k and 64k pages.

Probably using more than one controller adds to the likelyhood of
being hit by this race condition.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Immortality consists largely of boredom.
	-- Zefrem Cochrane, "Metamorphosis", stardate 3219.8

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5 lockup with AMCC440 and async-tx
  2007-10-01 17:39     ` Wolfgang Denk
@ 2007-10-01 19:25       ` Dale Dunlea
  0 siblings, 0 replies; 6+ messages in thread
From: Dale Dunlea @ 2007-10-01 19:25 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: linux-raid

On 01/10/2007, Wolfgang Denk <wd@denx.de> wrote:
> > Latest code from Dan or latest code from denx.de? I grabbed the latest
>
> From linux-2.6-denx

I grabbed the latest from denx.de, but unfortunately, to no avail. The
dd test still hangs pretty much immediately.

Thanks nonetheless.

Regards,
Dale

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-10-01 19:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-01  9:16 RAID5 lockup with AMCC440 and async-tx Dale Dunlea
2007-10-01 10:13 ` Justin Piszcz
2007-10-01 10:32 ` Wolfgang Denk
2007-10-01 11:02   ` Dale Dunlea
2007-10-01 17:39     ` Wolfgang Denk
2007-10-01 19:25       ` Dale Dunlea

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).