Bizarre RAID "failure"

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Bizarre RAID "failure"
@ 2004-02-19 22:44 Tom Maddox
  2004-02-19 23:38 ` Måns Rullgård
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Tom Maddox @ 2004-02-19 22:44 UTC (permalink / raw)
  To: linux-raid

Hi, all,

I'm encountering a bizarre problem with software RAID 5 under Linux that
I'm hoping someone on this list can help me solve or at least
understand.

I've got a box running Red Hat 7.3 with SGI's 2.4.18 XFS 1.1 kernel. 
It's using three FastTrak TX 2000 (PDC20271) cards in non-RAID mode with
three Western Digital 200 GB drives.  I'm using those controllers
because they were handy and they support large drives.  The drives are
in an XFS-formatted RAID 5 array using md, which has never given me
problems before.  In this case, however, I'm running into some seriously
anomalous behavior.

If the system goes down unexpectedly (e.g., because of a power failure),
the RAID array comes back up dirty and begins to rebuild itself, which
is odd enough on its own.  What's worse is that, whenever this happens,
the rebuild hangs at about 2.4%.  When it reaches that point, the array
becomes totally nonresponsive--I can't even query its status with mdadm
or any other tool, although I can use "cat /proc/mdstat" to see the
status of the rebuild.  Any command that attempts to access the RAID
drive hangs.

My assumption would normally be that there's a hardware failure
somewhere, but I've swapped out each component individually (including
cables!) and the same problem keeps happening.

Has anyone seen this behavior before, and can you recommend a solution?

Thanks,

Tom

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Bizarre RAID "failure"
  2004-02-19 22:44 Bizarre RAID "failure" Tom Maddox
@ 2004-02-19 23:38 ` Måns Rullgård
  2004-02-20  0:39 ` Kanoa Withington
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Måns Rullgård @ 2004-02-19 23:38 UTC (permalink / raw)
  To: linux-raid

Tom Maddox <tmaddox@thereinc.com> writes:

> If the system goes down unexpectedly (e.g., because of a power failure),
> the RAID array comes back up dirty and begins to rebuild itself, which
> is odd enough on its own.

This is supposedly much better in 2.6 kernels.

-- 
Måns Rullgård
mru@kth.se

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Bizarre RAID "failure"
  2004-02-19 22:44 Bizarre RAID "failure" Tom Maddox
  2004-02-19 23:38 ` Måns Rullgård
@ 2004-02-20  0:39 ` Kanoa Withington
  2004-02-20  0:44   ` Tom Maddox
  2004-02-20  8:33 ` Nathan Hunsperger
  2004-03-02 16:56 ` Corey McGuire
  3 siblings, 1 reply; 6+ messages in thread
From: Kanoa Withington @ 2004-02-20  0:39 UTC (permalink / raw)
  To: Tom Maddox; +Cc: linux-raid


It might not be related but I've seen odd behaviour with that kernel
and XFS family when xfs_repair/XFS Recovery is run while the array is
resyncing.

If you have the opportunity to try some tests, you might try booting
the system without the volume mounted to avoid the automatic XFS
checking. When the resync is complete, then try mounting the volume.

-Kanoa

On Thu, 19 Feb 2004, Tom Maddox wrote:

> Hi, all,
>
> I'm encountering a bizarre problem with software RAID 5 under Linux that
> I'm hoping someone on this list can help me solve or at least
> understand.
>
> I've got a box running Red Hat 7.3 with SGI's 2.4.18 XFS 1.1 kernel.
> It's using three FastTrak TX 2000 (PDC20271) cards in non-RAID mode with
> three Western Digital 200 GB drives.  I'm using those controllers
> because they were handy and they support large drives.  The drives are
> in an XFS-formatted RAID 5 array using md, which has never given me
> problems before.  In this case, however, I'm running into some seriously
> anomalous behavior.
>
> If the system goes down unexpectedly (e.g., because of a power failure),
> the RAID array comes back up dirty and begins to rebuild itself, which
> is odd enough on its own.  What's worse is that, whenever this happens,
> the rebuild hangs at about 2.4%.  When it reaches that point, the array
> becomes totally nonresponsive--I can't even query its status with mdadm
> or any other tool, although I can use "cat /proc/mdstat" to see the
> status of the rebuild.  Any command that attempts to access the RAID
> drive hangs.
>
> My assumption would normally be that there's a hardware failure
> somewhere, but I've swapped out each component individually (including
> cables!) and the same problem keeps happening.
>
> Has anyone seen this behavior before, and can you recommend a solution?
>
> Thanks,
>
> Tom
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Bizarre RAID "failure"
  2004-02-20  0:39 ` Kanoa Withington
@ 2004-02-20  0:44   ` Tom Maddox
  0 siblings, 0 replies; 6+ messages in thread
From: Tom Maddox @ 2004-02-20  0:44 UTC (permalink / raw)
  To: Kanoa Withington; +Cc: linux-raid

On Thu, 2004-02-19 at 16:39, Kanoa Withington wrote:
> It might not be related but I've seen odd behaviour with that kernel
> and XFS family when xfs_repair/XFS Recovery is run while the array is
> resyncing.
> 
> If you have the opportunity to try some tests, you might try booting
> the system without the volume mounted to avoid the automatic XFS
> checking. When the resync is complete, then try mounting the volume.
> 
> -Kanoa

Yep, tried that.  I got identical results, unfortunately.

Thanks,

Tom


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Bizarre RAID "failure"
  2004-02-19 22:44 Bizarre RAID "failure" Tom Maddox
  2004-02-19 23:38 ` Måns Rullgård
  2004-02-20  0:39 ` Kanoa Withington
@ 2004-02-20  8:33 ` Nathan Hunsperger
  2004-03-02 16:56 ` Corey McGuire
  3 siblings, 0 replies; 6+ messages in thread
From: Nathan Hunsperger @ 2004-02-20  8:33 UTC (permalink / raw)
  To: Tom Maddox; +Cc: linux-raid

On Thu, Feb 19, 2004 at 02:44:52PM -0800, Tom Maddox wrote:
<SNIP>
> If the system goes down unexpectedly (e.g., because of a power failure),
> the RAID array comes back up dirty and begins to rebuild itself, which
> is odd enough on its own.  What's worse is that, whenever this happens,
> the rebuild hangs at about 2.4%.  When it reaches that point, the array
> becomes totally nonresponsive--I can't even query its status with mdadm
> or any other tool, although I can use "cat /proc/mdstat" to see the
> status of the rebuild.  Any command that attempts to access the RAID
> drive hangs.
<SNIP>
> Has anyone seen this behavior before, and can you recommend a solution?

Tom,

I have had problems very similar to this before.  I was running 14 fibre
channel disks on a QLA2100 HBA w/ various 2.4 kernels.  What I found
was that after a while of heavy IO, all access to the disks stopped,
and the rebuild would hang.  Additionally, any command that required
access to any filesystem data that wasn't cached (on any filesystem)
would hang.  By switching between the 3 or so available QLA drivers,
I could affect the delta between reboot and stall.  I knew the hardware
was fine, as it worked flawlessly under Solaris.  In the end, I had
to upgrade the HBA to a QLA2200, at which time I had no more problems.
Because the hardware works under different OSs, I have to believe that
my problem was an incompatability between the QLA2100 and the drivers
(even though they claimed to work for it).

I hope that at least gives you some possible insight.

- Nathan

> 
> Thanks,
> 
> Tom
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Bizarre RAID "failure"
  2004-02-19 22:44 Bizarre RAID "failure" Tom Maddox
                   ` (2 preceding siblings ...)
  2004-02-20  8:33 ` Nathan Hunsperger
@ 2004-03-02 16:56 ` Corey McGuire
  3 siblings, 0 replies; 6+ messages in thread
From: Corey McGuire @ 2004-03-02 16:56 UTC (permalink / raw)
  To: Tom Maddox; +Cc: linux-raid


I don't know if this till help, but I was having lots of trouble with my promise controllers until 2.4.23... before that, they were locking up drives all the time.  I had one drop out entirely, with two drives.  I had to rebuild a raid5 with two dead drives.

Once I upgraded, I stopped having such trouble.

One more thing I might add, I too had 3 promise controllers and that was hard to manage as well.  I moved two drives to my onboard controller which not only made things a bit less flaky (I am assuming 33% less flaky) but also seemed to speed things up (later VIA chipsets take the HDD controller off of the PCI bus, and three PCI HDD controllers can easily saturate that.)

On Thursday 19 February 2004 02:44 pm, Tom Maddox wrote:
> Hi, all,
> 
> I'm encountering a bizarre problem with software RAID 5 under Linux that
> I'm hoping someone on this list can help me solve or at least
> understand.
> 
> I've got a box running Red Hat 7.3 with SGI's 2.4.18 XFS 1.1 kernel. 
> It's using three FastTrak TX 2000 (PDC20271) cards in non-RAID mode with
> three Western Digital 200 GB drives.  I'm using those controllers
> because they were handy and they support large drives.  The drives are
> in an XFS-formatted RAID 5 array using md, which has never given me
> problems before.  In this case, however, I'm running into some seriously
> anomalous behavior.
> 
> If the system goes down unexpectedly (e.g., because of a power failure),
> the RAID array comes back up dirty and begins to rebuild itself, which
> is odd enough on its own.  What's worse is that, whenever this happens,
> the rebuild hangs at about 2.4%.  When it reaches that point, the array
> becomes totally nonresponsive--I can't even query its status with mdadm
> or any other tool, although I can use "cat /proc/mdstat" to see the
> status of the rebuild.  Any command that attempts to access the RAID
> drive hangs.
> 
> My assumption would normally be that there's a hardware failure
> somewhere, but I've swapped out each component individually (including
> cables!) and the same problem keeps happening.
> 
> Has anyone seen this behavior before, and can you recommend a solution?
> 
> Thanks,
> 
> Tom
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-03-02 16:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-19 22:44 Bizarre RAID "failure" Tom Maddox
2004-02-19 23:38 ` Måns Rullgård
2004-02-20  0:39 ` Kanoa Withington
2004-02-20  0:44   ` Tom Maddox
2004-02-20  8:33 ` Nathan Hunsperger
2004-03-02 16:56 ` Corey McGuire

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).