* RAID5 lockup with AMCC440 and async-tx
@ 2007-10-01 9:16 Dale Dunlea
2007-10-01 10:13 ` Justin Piszcz
2007-10-01 10:32 ` Wolfgang Denk
0 siblings, 2 replies; 6+ messages in thread
From: Dale Dunlea @ 2007-10-01 9:16 UTC (permalink / raw)
To: linux-raid
Hi,
I have a board with an AMCC440 processor, running RAID5 using the
async-tx interface. In general, it works well, but I have found a test
case that consistently causes a hard lockup of the entire system.
What makes this case odd is that I have only been able to generate it
when accessing disks that are on two separate HBAs - in my case
mpt-fusion based SAS HBAs. Once two HBAs are in use, the bug is
trivial to repeat. I simply create a RAID5 using disks from each HBA,
wait for it to resync, and then run
"dd if=/dev/zero of=/dev/md0 bs=512 count=100000".
By disabling CONFIG_DMA_ENGINE in my kernel config, the hang goes
away, but then so does my performance.
Any pointers on how to debug this? It feels like a race condition of
some description, but any serial port printing I enable causes the
problem to go away, and I can't print silently to /var/log/messages as
the system hangs before it can flush.
Regards,
Dale
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAID5 lockup with AMCC440 and async-tx
2007-10-01 9:16 RAID5 lockup with AMCC440 and async-tx Dale Dunlea
@ 2007-10-01 10:13 ` Justin Piszcz
2007-10-01 10:32 ` Wolfgang Denk
1 sibling, 0 replies; 6+ messages in thread
From: Justin Piszcz @ 2007-10-01 10:13 UTC (permalink / raw)
To: Dale Dunlea; +Cc: linux-raid, linux-ide-arrays
On Mon, 1 Oct 2007, Dale Dunlea wrote:
> Hi,
>
> I have a board with an AMCC440 processor, running RAID5 using the
> async-tx interface. In general, it works well, but I have found a test
> case that consistently causes a hard lockup of the entire system.
>
> What makes this case odd is that I have only been able to generate it
> when accessing disks that are on two separate HBAs - in my case
> mpt-fusion based SAS HBAs. Once two HBAs are in use, the bug is
> trivial to repeat. I simply create a RAID5 using disks from each HBA,
> wait for it to resync, and then run
>
> "dd if=/dev/zero of=/dev/md0 bs=512 count=100000".
>
> By disabling CONFIG_DMA_ENGINE in my kernel config, the hang goes
> away, but then so does my performance.
>
> Any pointers on how to debug this? It feels like a race condition of
> some description, but any serial port printing I enable causes the
> problem to go away, and I can't print silently to /var/log/messages as
> the system hangs before it can flush.
>
> Regards,
> Dale
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Added linux-ide-arrays to the CC list which is probably better suited
towards this kind of question.
Justin.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAID5 lockup with AMCC440 and async-tx
2007-10-01 9:16 RAID5 lockup with AMCC440 and async-tx Dale Dunlea
2007-10-01 10:13 ` Justin Piszcz
@ 2007-10-01 10:32 ` Wolfgang Denk
2007-10-01 11:02 ` Dale Dunlea
1 sibling, 1 reply; 6+ messages in thread
From: Wolfgang Denk @ 2007-10-01 10:32 UTC (permalink / raw)
To: Dale Dunlea; +Cc: linux-raid
Dear Dale,
in message <8a24fb800710010216m21cd7734p4c19df1aa7dd5564@mail.gmail.com> you wrote:
>
> I have a board with an AMCC440 processor, running RAID5 using the
> async-tx interface. In general, it works well, but I have found a test
> case that consistently causes a hard lockup of the entire system.
Please make sure to use latest code - we found a bug recently.
> What makes this case odd is that I have only been able to generate it
> when accessing disks that are on two separate HBAs - in my case
> mpt-fusion based SAS HBAs. Once two HBAs are in use, the bug is
> trivial to repeat. I simply create a RAID5 using disks from each HBA,
> wait for it to resync, and then run
We saw similar problems, in our case they showed up only with a large
number of disks in combination with big kernel pages sizes (64 kB).
> Any pointers on how to debug this? It feels like a race condition of
> some description, but any serial port printing I enable causes the
> problem to go away, and I can't print silently to /var/log/messages as
> the system hangs before it can flush.
See above - please try current code.
Best regards,
Wolfgang Denk
--
DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
HR Manager to job candidate "I see you've had no computer training.
Although that qualifies you for upper management, it means you're
under-qualified for our entry level positions."
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAID5 lockup with AMCC440 and async-tx
2007-10-01 10:32 ` Wolfgang Denk
@ 2007-10-01 11:02 ` Dale Dunlea
2007-10-01 17:39 ` Wolfgang Denk
0 siblings, 1 reply; 6+ messages in thread
From: Dale Dunlea @ 2007-10-01 11:02 UTC (permalink / raw)
To: Wolfgang Denk; +Cc: linux-raid
On 01/10/2007, Wolfgang Denk <wd@denx.de> wrote:
> Dear Dale,
>
> in message <8a24fb800710010216m21cd7734p4c19df1aa7dd5564@mail.gmail.com> you wrote:
> >
> > I have a board with an AMCC440 processor, running RAID5 using the
> > async-tx interface. In general, it works well, but I have found a test
> > case that consistently causes a hard lockup of the entire system.
>
> Please make sure to use latest code - we found a bug recently.
Latest code from Dan or latest code from denx.de? I grabbed the latest
code from Dan, but I'm having trouble cloning denx.de:
"remote: error: object directory /home/git/linux-2.6/.git/objects does
not exist; check .git/objects/info/alternates."
>
> > What makes this case odd is that I have only been able to generate it
> > when accessing disks that are on two separate HBAs - in my case
> > mpt-fusion based SAS HBAs. Once two HBAs are in use, the bug is
> > trivial to repeat. I simply create a RAID5 using disks from each HBA,
> > wait for it to resync, and then run
>
> We saw similar problems, in our case they showed up only with a large
> number of disks in combination with big kernel pages sizes (64 kB).
>
The problem occurs for me with both 4k and 64k pages.
Regards,
Dale
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAID5 lockup with AMCC440 and async-tx
2007-10-01 11:02 ` Dale Dunlea
@ 2007-10-01 17:39 ` Wolfgang Denk
2007-10-01 19:25 ` Dale Dunlea
0 siblings, 1 reply; 6+ messages in thread
From: Wolfgang Denk @ 2007-10-01 17:39 UTC (permalink / raw)
To: Dale Dunlea; +Cc: linux-raid
Dear Dale,
in message <8a24fb800710010402u5aa0187bq4f850b8cb71483c9@mail.gmail.com> you wrote:
>
> Latest code from Dan or latest code from denx.de? I grabbed the latest
From linux-2.6-denx
> code from Dan, but I'm having trouble cloning denx.de:
>
> "remote: error: object directory /home/git/linux-2.6/.git/objects does
> not exist; check .git/objects/info/alternates."
Argh.. Stupid me.
Please try again - this one is fixed now.
> > We saw similar problems, in our case they showed up only with a large
> > number of disks in combination with big kernel pages sizes (64 kB).
> >
> The problem occurs for me with both 4k and 64k pages.
Probably using more than one controller adds to the likelyhood of
being hit by this race condition.
Best regards,
Wolfgang Denk
--
DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Immortality consists largely of boredom.
-- Zefrem Cochrane, "Metamorphosis", stardate 3219.8
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAID5 lockup with AMCC440 and async-tx
2007-10-01 17:39 ` Wolfgang Denk
@ 2007-10-01 19:25 ` Dale Dunlea
0 siblings, 0 replies; 6+ messages in thread
From: Dale Dunlea @ 2007-10-01 19:25 UTC (permalink / raw)
To: Wolfgang Denk; +Cc: linux-raid
On 01/10/2007, Wolfgang Denk <wd@denx.de> wrote:
> > Latest code from Dan or latest code from denx.de? I grabbed the latest
>
> From linux-2.6-denx
I grabbed the latest from denx.de, but unfortunately, to no avail. The
dd test still hangs pretty much immediately.
Thanks nonetheless.
Regards,
Dale
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2007-10-01 19:25 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-01 9:16 RAID5 lockup with AMCC440 and async-tx Dale Dunlea
2007-10-01 10:13 ` Justin Piszcz
2007-10-01 10:32 ` Wolfgang Denk
2007-10-01 11:02 ` Dale Dunlea
2007-10-01 17:39 ` Wolfgang Denk
2007-10-01 19:25 ` Dale Dunlea
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).