RAID1 disk failure causes hung mount

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID1 disk failure causes hung mount
@ 2008-09-01 18:32 Philip Molter
  2008-09-01 20:20 ` David Lethe
  0 siblings, 1 reply; 3+ messages in thread
From: Philip Molter @ 2008-09-01 18:32 UTC (permalink / raw)
  To: Linux RAID Mailing List

Hello,

We're running a modified version of the FC4 2.6.17 kernel (2.6.17.4).  I 
realize this is an old kernel.  For internal reasons, we cannot update 
to a newer version of the kernel at this time.

We have a 3ware 9550SXU card with 12 drives in JBOD mode.  These 12 
drives are mirrored in 6 RAID1 pairs, then striped together in one big 
RAID0 stripe.  When we have a disk error with one of the drives in a 
RAID1 pair, the entire RAID0 mount locks up.  We can still cd to the 
mount and read from it, but if we try to write anything to the mount, 
the process hangs in an unkillable state.

This recently happened.  Here are the log messages from the disk failure:

sd 0:0:9:0: SCSI error: return code = 0x8000004
sdj: Current: sense key: Medium Error
     Additional sense: Unrecovered read error
end_request: I/O error, dev sdj, sector 10964975
raid1: sdj1: rescheduling sector 10964912
3w-9xxx: scsi0: ERROR: (0x03:0x0202): Drive ECC error:port=9.
sd 0:0:9:0: SCSI error: return code = 0x8000004
sdj: Current: sense key: Medium Error
     Additional sense: Unrecovered read error
end_request: I/O error, dev sdj, sector 10964975
raid1: sdd1: redirecting sector 10964912 to another mirror

When this happened, /dev/sdj1 did not fail out of its RAID.  It also did 
not lock the system.  Later:

sd 0:0:9:0: SCSI error: return code = 0x8000004
sdj: Current: sense key: Medium Error
     Additional sense: Unrecovered read error
end_request: I/O error, dev sdj, sector 16744439
raid1: sdj1: rescheduling sector 16744376

When this happened, /dev/sdj1 did not fail out of its RAID, but it did 
lock writes to the big RAID0 stripe.  I manually failed /dev/sdj1 out of 
the RAID and /proc/mdstat did report it as failed out at that point.  It 
did not cause writes to begin being processed.  I tried to manually 
remove /dev/sdj1 from the RAID and mdadm reported that the device was 
busy.  A hard power-cycle was required to restore functionality to the 
system.  This is consistent with these kinds of errors.

I have looked through the patches related to RAID1 and lockups.  The 
patches from January/March 2008 related to RAID1 deadlocks have not 
seemed to help (I didn't really expect them to as 2.6.17.4 predates 
bitmap code, no?).

I'd like to be able to get more debug during cases like this, but I'm 
not sure what to gather or how to gather it.  If anyone has any 
suggestions, I'd appreciate hearing them.  If any of the md devs have 
any suggestions for patches to look at that specifically address this 
behavior, I'd by very grateful for that advice.  As it is, I've been 
combing over git commits with very little success.

Thanks in advance for any assistance, knowledge, or suggestions anyone has.

Philip

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: RAID1 disk failure causes hung mount
  2008-09-01 18:32 RAID1 disk failure causes hung mount Philip Molter
@ 2008-09-01 20:20 ` David Lethe
  2008-09-01 20:44   ` Philip Molter
  0 siblings, 1 reply; 3+ messages in thread
From: David Lethe @ 2008-09-01 20:20 UTC (permalink / raw)
  To: Philip Molter, Linux RAID Mailing List

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Philip Molter
> Sent: Monday, September 01, 2008 1:32 PM
> To: Linux RAID Mailing List
> Subject: RAID1 disk failure causes hung mount
> 
> Hello,
> 
> We're running a modified version of the FC4 2.6.17 kernel (2.6.17.4).
> I
> realize this is an old kernel.  For internal reasons, we cannot update
> to a newer version of the kernel at this time.
> 
> We have a 3ware 9550SXU card with 12 drives in JBOD mode.  These 12
> drives are mirrored in 6 RAID1 pairs, then striped together in one big
> RAID0 stripe.  When we have a disk error with one of the drives in a
> RAID1 pair, the entire RAID0 mount locks up.  We can still cd to the
> mount and read from it, but if we try to write anything to the mount,
> the process hangs in an unkillable state.
> 
> This recently happened.  Here are the log messages from the disk
> failure:
> 
> sd 0:0:9:0: SCSI error: return code = 0x8000004
> sdj: Current: sense key: Medium Error
>      Additional sense: Unrecovered read error
> end_request: I/O error, dev sdj, sector 10964975
> raid1: sdj1: rescheduling sector 10964912
> 3w-9xxx: scsi0: ERROR: (0x03:0x0202): Drive ECC error:port=9.
> sd 0:0:9:0: SCSI error: return code = 0x8000004
> sdj: Current: sense key: Medium Error
>      Additional sense: Unrecovered read error
> end_request: I/O error, dev sdj, sector 10964975
> raid1: sdd1: redirecting sector 10964912 to another mirror
> 
> When this happened, /dev/sdj1 did not fail out of its RAID.  It also
> did
> not lock the system.  Later:
> 
> sd 0:0:9:0: SCSI error: return code = 0x8000004
> sdj: Current: sense key: Medium Error
>      Additional sense: Unrecovered read error
> end_request: I/O error, dev sdj, sector 16744439
> raid1: sdj1: rescheduling sector 16744376
> 
> When this happened, /dev/sdj1 did not fail out of its RAID, but it did
> lock writes to the big RAID0 stripe.  I manually failed /dev/sdj1 out
> of
> the RAID and /proc/mdstat did report it as failed out at that point.
> It
> did not cause writes to begin being processed.  I tried to manually
> remove /dev/sdj1 from the RAID and mdadm reported that the device was
> busy.  A hard power-cycle was required to restore functionality to the
> system.  This is consistent with these kinds of errors.
> 
> I have looked through the patches related to RAID1 and lockups.  The
> patches from January/March 2008 related to RAID1 deadlocks have not
> seemed to help (I didn't really expect them to as 2.6.17.4 predates
> bitmap code, no?).
> 
> I'd like to be able to get more debug during cases like this, but I'm
> not sure what to gather or how to gather it.  If anyone has any
> suggestions, I'd appreciate hearing them.  If any of the md devs have
> any suggestions for patches to look at that specifically address this
> behavior, I'd by very grateful for that advice.  As it is, I've been
> combing over git commits with very little success.
> 
> Thanks in advance for any assistance, knowledge, or suggestions anyone
> has.
> 
> Philip
Phillip:
The problem really isn't with the LINUX kernel.  Your 3WARE is the issue.  Specifically, when the RAID1 "failed", LINUX did what it was supposed to do.   /dev/sdj1 is the 3WARE-defined RAID1, and it generated a media error because it could not reconcile bad data on the RAID1 set.  My guess is that you had a drive failure in combination with an unrecoverable read error on a physical block on the  surviving disk in the pair.

I write 3WARE-specific diags, and have code to drill down into the card and get debug info, and most likely repair the damage, but it is well beyond the scope of giving you a simple how-to, beyond booting 3WARE BIOS and doing data consistency checks on the broken RAID1.  If you don't care about recovering data on the RAID0 slice, then do a consistency check/repair, and then you'll have only minor loss.  If you want all of the data back, then you'll probably have to pay somebody for their time.  Due to your hardware RAID component failure, it isn't really applicable to a software RAID forum. 

David


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: RAID1 disk failure causes hung mount
  2008-09-01 20:20 ` David Lethe
@ 2008-09-01 20:44   ` Philip Molter
  0 siblings, 0 replies; 3+ messages in thread
From: Philip Molter @ 2008-09-01 20:44 UTC (permalink / raw)
  To: David Lethe; +Cc: Linux RAID Mailing List

David Lethe wrote:
>> -----Original Message-----
>> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
>> owner@vger.kernel.org] On Behalf Of Philip Molter
>> Sent: Monday, September 01, 2008 1:32 PM
>> To: Linux RAID Mailing List
>> Subject: RAID1 disk failure causes hung mount
>>
>> Hello,
>>
>> We're running a modified version of the FC4 2.6.17 kernel (2.6.17.4).
>> I
>> realize this is an old kernel.  For internal reasons, we cannot update
>> to a newer version of the kernel at this time.
>>
>> We have a 3ware 9550SXU card with 12 drives in JBOD mode.  These 12
>> drives are mirrored in 6 RAID1 pairs, then striped together in one big
>> RAID0 stripe.  When we have a disk error with one of the drives in a
>> RAID1 pair, the entire RAID0 mount locks up.  We can still cd to the
>> mount and read from it, but if we try to write anything to the mount,
>> the process hangs in an unkillable state.
 >>
>> ...
 >>
> Phillip:
> The problem really isn't with the LINUX kernel. Your 3WARE is the
> issue. Specifically, when the RAID1 "failed", LINUX did what it was
> supposed to do. /dev/sdj1 is the 3WARE-defined RAID1, and it generated a
> media error because it could not reconcile bad data on the RAID1 set. My
> guess is that you had a drive failure in combination with an
> unrecoverable read error on a physical block on the surviving disk in
> the pair.
> 
> I write 3WARE-specific diags, and have code to drill down into the
> card and get debug info, and most likely repair the damage, but it is
> well beyond the scope of giving you a simple how-to, beyond booting
> 3WARE BIOS and doing data consistency checks on the broken RAID1. If you
> don't care about recovering data on the RAID0 slice, then do a
> consistency check/repair, and then you'll have only minor loss. If you
> want all of the data back, then you'll probably have to pay somebody for
> their time. Due to your hardware RAID component failure, it isn't really
> applicable to a software RAID forum. 

Hi David,

I'm sorry if I wasn't clear enough about this.  I have no hardware RAID. 
  My 12-drive 3ware controller has all of its drives configured in JBOD 
mode and my RAIDs (6 RAID1s striped into a single RAID0, NOT a 12-drive 
RAID10) are all defined and assembled using Linux's md software RAID. 
The OS sees 12 drives, and from those 12 drives, configures 6 software 
RAID1s and one software RAID0 using mdadm/raid auto-detect.

As for the behavior, only one disk is having an error.  The 3ware 
reports errors from only one drive (confirmed via SMART data gathered 
off the drive), and the second drive in the RAID set reports no error 
via SMART data gathered off the disk nor through 3ware diagnostics.  The 
drive is easy to replace.  It's the hangup on write to the RAID1 that is 
causing problems, because it requires a reboot to get the drive into a 
replaceable state with regards to md.  I can replace the drive, but I 
can never remove the drive using mdadm until I reboot, and I can't sync 
any data before the reboot, which effectively means I have to recover my 
filesystem and associated data every time I lose a disk.

I appreciate the offer of help.  If you have any other ideas of what may 
be wrong, I am very eager to hear them.

Philip

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-09-01 20:44 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-01 18:32 RAID1 disk failure causes hung mount Philip Molter
2008-09-01 20:20 ` David Lethe
2008-09-01 20:44   ` Philip Molter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).