raid failure question

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid failure question
@ 2010-01-11 18:00 Tim Bock
  2010-01-11 18:08 ` Majed B.
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Tim Bock @ 2010-01-11 18:00 UTC (permalink / raw)
  To: linux-raid

Hello,

Excluding the obvious multi-disk or bus failures, can anyone describe
what type of disk failure a raid cannot detect/recover from?

I have had two disk failures over the last three months, and in spite of
having a hot spare, manual intervention was required each time to make
the raid usable again.  I'm just not sure if I'm not setting something
up right, or if there is some other issue.

Thanks for any comments or suggestions.

Tim

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: raid failure question
  2010-01-11 18:00 raid failure question Tim Bock
@ 2010-01-11 18:08 ` Majed B.
  2010-01-11 20:44   ` Thomas Fjellstrom
  2010-01-11 20:53 ` Robin Hill
  2010-01-12  4:47 ` Leslie Rhorer
  2 siblings, 1 reply; 9+ messages in thread
From: Majed B. @ 2010-01-11 18:08 UTC (permalink / raw)
  To: Tim Bock; +Cc: linux-raid

Voltage spikes, disks heating (monitored by smartd) & sector
corruption (monitored by smartd) are what I can think of for the time
being.

On Mon, Jan 11, 2010 at 9:00 PM, Tim Bock <jtbock@daylight.com> wrote:
> Hello,
>
> Excluding the obvious multi-disk or bus failures, can anyone describe
> what type of disk failure a raid cannot detect/recover from?
>
> I have had two disk failures over the last three months, and in spite of
> having a hot spare, manual intervention was required each time to make
> the raid usable again.  I'm just not sure if I'm not setting something
> up right, or if there is some other issue.
>
> Thanks for any comments or suggestions.
>
> Tim
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
       Majed B.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: raid failure question
  2010-01-11 18:08 ` Majed B.
@ 2010-01-11 20:44   ` Thomas Fjellstrom
  0 siblings, 0 replies; 9+ messages in thread
From: Thomas Fjellstrom @ 2010-01-11 20:44 UTC (permalink / raw)
  To: Majed B.; +Cc: Tim Bock, linux-raid

On Mon January 11 2010, Majed B. wrote:
> Voltage spikes, disks heating (monitored by smartd) & sector
> corruption (monitored by smartd) are what I can think of for the time
> being.

Would that actually make it so a hot spare wouldn't get used automatically? 
I thought the point of a hot spare was to take over for a failed disk /no 
matter what/. Doesn't really make much sense is many of the more common 
error cases cause a hot spare to be ignored.

> On Mon, Jan 11, 2010 at 9:00 PM, Tim Bock <jtbock@daylight.com> wrote:
> > Hello,
> >
> > Excluding the obvious multi-disk or bus failures, can anyone describe
> > what type of disk failure a raid cannot detect/recover from?
> >
> > I have had two disk failures over the last three months, and in spite
> > of having a hot spare, manual intervention was required each time to
> > make the raid usable again.  I'm just not sure if I'm not setting
> > something up right, or if there is some other issue.
> >
> > Thanks for any comments or suggestions.
> >
> > Tim
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid"
> > in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Thomas Fjellstrom
tfjellstrom@shaw.ca

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: raid failure question
  2010-01-11 18:00 raid failure question Tim Bock
  2010-01-11 18:08 ` Majed B.
@ 2010-01-11 20:53 ` Robin Hill
  2010-01-12 12:08   ` Roger Heflin
  2010-02-01 20:19   ` Bill Davidsen
  2010-01-12  4:47 ` Leslie Rhorer
  2 siblings, 2 replies; 9+ messages in thread
From: Robin Hill @ 2010-01-11 20:53 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1150 bytes --]

On Mon Jan 11, 2010 at 11:00:40AM -0700, Tim Bock wrote:

> Hello,
> 
> Excluding the obvious multi-disk or bus failures, can anyone describe
> what type of disk failure a raid cannot detect/recover from?
> 
> I have had two disk failures over the last three months, and in spite of
> having a hot spare, manual intervention was required each time to make
> the raid usable again.  I'm just not sure if I'm not setting something
> up right, or if there is some other issue.
> 
> Thanks for any comments or suggestions.
> 
Any failure where the disk doesn't actually return an error (within a
reasonable time).  For example, consumer grade disks often have very
long retry times - this can mean the array in unusable for a long time
until the disk eventually fails the read.

If the disk actually returns an error then, AFAIK, the RAID array should
always be able to recover from it.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: raid failure question
  2010-01-11 20:53 ` Robin Hill
@ 2010-01-12 12:08   ` Roger Heflin
  2010-01-12 15:07     ` Tim Bock
  2010-02-01 20:19   ` Bill Davidsen
  1 sibling, 1 reply; 9+ messages in thread
From: Roger Heflin @ 2010-01-12 12:08 UTC (permalink / raw)
  To: linux-raid

Robin Hill wrote:
> On Mon Jan 11, 2010 at 11:00:40AM -0700, Tim Bock wrote:
> 
>> Hello,
>>
>> Excluding the obvious multi-disk or bus failures, can anyone describe
>> what type of disk failure a raid cannot detect/recover from?
>>
>> I have had two disk failures over the last three months, and in spite of
>> having a hot spare, manual intervention was required each time to make
>> the raid usable again.  I'm just not sure if I'm not setting something
>> up right, or if there is some other issue.
>>
>> Thanks for any comments or suggestions.
>>
> Any failure where the disk doesn't actually return an error (within a
> reasonable time).  For example, consumer grade disks often have very
> long retry times - this can mean the array in unusable for a long time
> until the disk eventually fails the read.
> 
> If the disk actually returns an error then, AFAIK, the RAID array should
> always be able to recover from it.
> 
> Cheers,
>     Robin

The OS will time the disk out at about 30 seconds if it does not 
answer, and then the disk gets treated as "BAD".

On fiber channel this is a fairly common type of failure, if something 
fails in the fabric such that the disk can no longer talk to the machine.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: raid failure question
  2010-01-12 12:08   ` Roger Heflin
@ 2010-01-12 15:07     ` Tim Bock
  0 siblings, 0 replies; 9+ messages in thread
From: Tim Bock @ 2010-01-12 15:07 UTC (permalink / raw)
  To: Roger Heflin; +Cc: linux-raid

First, thanks for the replies.

The problem is that the drive is not marked as failed by the array, but
"something" happens to the drive which drives the load avg to 12+ and
makes the array (and server) largely unusable.  It is as if the {OS,
array} is waiting for something to time out.  The first time this
happened, there was a "Medium Error" in the log at 2 am (during an rsync
backup), and I didn't even know about the problem until 7 am.  So it
should have had plenty of time to "time out" if it was going to, yes?
There was a logged error the second time as well, but I didn't save it
before the logs rotated out.

Similarly, upon reboot during this problem, something was happening with
the disk which prevented the system from coming up when the array was in
the fstab.  When I took it out of the fstab, system came up and I was
able to manually fail the disk, the array automatically rebuilt with the
hot spare, birds started singing, and life went on.

I've replaced the offending disk, but as this has happened twice (with
two different disks in a 4+1 array), I'm just trying to figure out what
is going on...and more importantly, how I can fix it, if possible.  As
Thomas implies, the joy of a hot spare is that a disk failure is
hopefully transparent to your users...

Thanks for your time,
Tim

On Tue, 2010-01-12 at 06:08 -0600, Roger Heflin wrote:
> Robin Hill wrote:
> > On Mon Jan 11, 2010 at 11:00:40AM -0700, Tim Bock wrote:
> > 
> >> Hello,
> >>
> >> Excluding the obvious multi-disk or bus failures, can anyone describe
> >> what type of disk failure a raid cannot detect/recover from?
> >>
> >> I have had two disk failures over the last three months, and in spite of
> >> having a hot spare, manual intervention was required each time to make
> >> the raid usable again.  I'm just not sure if I'm not setting something
> >> up right, or if there is some other issue.
> >>
> >> Thanks for any comments or suggestions.
> >>
> > Any failure where the disk doesn't actually return an error (within a
> > reasonable time).  For example, consumer grade disks often have very
> > long retry times - this can mean the array in unusable for a long time
> > until the disk eventually fails the read.
> > 
> > If the disk actually returns an error then, AFAIK, the RAID array should
> > always be able to recover from it.
> > 
> > Cheers,
> >     Robin
> 
> The OS will time the disk out at about 30 seconds if it does not 
> answer, and then the disk gets treated as "BAD".
> 
> On fiber channel this is a fairly common type of failure, if something 
> fails in the fabric such that the disk can no longer talk to the machine.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: raid failure question
  2010-01-11 20:53 ` Robin Hill
  2010-01-12 12:08   ` Roger Heflin
@ 2010-02-01 20:19   ` Bill Davidsen
  1 sibling, 0 replies; 9+ messages in thread
From: Bill Davidsen @ 2010-02-01 20:19 UTC (permalink / raw)
  To: linux-raid

Robin Hill wrote:
> On Mon Jan 11, 2010 at 11:00:40AM -0700, Tim Bock wrote:
>
>   
>> Hello,
>>
>> Excluding the obvious multi-disk or bus failures, can anyone describe
>> what type of disk failure a raid cannot detect/recover from?
>>
>> I have had two disk failures over the last three months, and in spite of
>> having a hot spare, manual intervention was required each time to make
>> the raid usable again.  I'm just not sure if I'm not setting something
>> up right, or if there is some other issue.
>>
>> Thanks for any comments or suggestions.
>>
>>     
> Any failure where the disk doesn't actually return an error (within a
> reasonable time).  For example, consumer grade disks often have very
> long retry times - this can mean the array in unusable for a long time
> until the disk eventually fails the read.
>
> If the disk actually returns an error then, AFAIK, the RAID array should
> always be able to recover from it.
>   

The problem is that the admin should be able to set a timeout after 
which recovery takes place even if the drive hasn't returned a bad 
status. And some form of counter could be kept such that after a number 
of these the drive is failed. There is no solution, Neil says the 
timeout should be in the driver, the driver writers say that if it hurts 
md the timeout should be there. Everyone points the finger at some other 
code and says "there."

This is not lazyness or buck passing, Neil feels that md is not the 
place, but putting it elsewhere causes other problems. Until someone 
says "perfect is the enemy of good enough" and puts a timer where it 
will solve the problem, this behavior will continue.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: raid failure question
  2010-01-11 18:00 raid failure question Tim Bock
  2010-01-11 18:08 ` Majed B.
  2010-01-11 20:53 ` Robin Hill
@ 2010-01-12  4:47 ` Leslie Rhorer
  2 siblings, 0 replies; 9+ messages in thread
From: Leslie Rhorer @ 2010-01-12  4:47 UTC (permalink / raw)
  To: 'Tim Bock', linux-raid



> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Tim Bock
> Sent: Monday, January 11, 2010 12:01 PM
> To: linux-raid@vger.kernel.org
> Subject: raid failure question
> 
> Hello,
> 
> Excluding the obvious multi-disk or bus failures, can anyone describe
> what type of disk failure a raid cannot detect/recover from?
> 
> I have had two disk failures over the last three months, and in spite of
> having a hot spare, manual intervention was required each time to make
> the raid usable again.  I'm just not sure if I'm not setting something
> up right, or if there is some other issue.

	I think you are trying to ask two different questions, here.  The
first concerns errors which make an array unrecoverable without intervention
by an admin.  The second concerns not promoting a hot standby whenever a
drive is failed by the array, even though the array itself is automatically
recoverable.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: raid failure question
@ 2010-02-01 20:29 David Lethe
  0 siblings, 0 replies; 9+ messages in thread
From: David Lethe @ 2010-02-01 20:29 UTC (permalink / raw)
  To: Bill Davidsen, linux-raid@vger.kernel.org

- bad block on surviving disk during a rebuild gives partial unrecoverable data loss.
- bugs in firmware can be devastating, like NCQ TCQ problems,
-Humans

-----Original Message-----

From:  "Bill Davidsen" <davidsen@tmr.com>
Subj:  Re: raid failure question
Date:  Mon Feb 1, 2010 2:21 pm
Size:  1K
To:  "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>

Robin Hill wrote: 
> On Mon Jan 11, 2010 at 11:00:40AM -0700, Tim Bock wrote: 
> 
>    
>> Hello, 
>> 
>> Excluding the obvious multi-disk or bus failures, can anyone describe 
>> what type of disk failure a raid cannot detect/recover from? 
>> 
>> I have had two disk failures over the last three months, and in spite of 
>> having a hot spare, manual intervention was required each time to make 
>> the raid usable again.  I'm just not sure if I'm not setting something 
>> up right, or if there is some other issue. 
>> 
>> Thanks for any comments or suggestions. 
>> 
>>      
> Any failure where the disk doesn't actually return an error (within a 
> reasonable time).  For example, consumer grade disks often have very 
> long retry times - this can mean the array in unusable for a long time 
> until the disk eventually fails the read. 
> 
> If the disk actually returns an error then, AFAIK, the RAID array should 
> always be able to recover from it. 
>    
 
The problem is that the admin should be able to set a timeout after  
which recovery takes place even if the drive hasn't returned a bad  
status. And some form of counter could be kept such that after a number  
of these the drive is failed. There is no solution, Neil says the  
timeout should be in the driver, the driver writers say that if it hurts  
md the timeout should be there. Everyone points the finger at some other  
code and says "there." 
 
This is not lazyness or buck passing, Neil feels that md is not the  
place, but putting it elsewhere causes other problems. Until someone  
says "perfect is the enemy of good enough" and puts a timer where it  
will solve the problem, this behavior will continue. 
 
--  
Bill Davidsen <davidsen@tmr.com> 
  "We can't solve today's problems by using the same thinking we 
   used in creating them." - Einstein 
 
-- 
To unsubscribe from this list: send the line "unsubscribe linux-raid" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at  http://vger.kernel.org/majordomo-info.html 


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-02-01 20:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-11 18:00 raid failure question Tim Bock
2010-01-11 18:08 ` Majed B.
2010-01-11 20:44   ` Thomas Fjellstrom
2010-01-11 20:53 ` Robin Hill
2010-01-12 12:08   ` Roger Heflin
2010-01-12 15:07     ` Tim Bock
2010-02-01 20:19   ` Bill Davidsen
2010-01-12  4:47 ` Leslie Rhorer
  -- strict thread matches above, loose matches on Subject: below --
2010-02-01 20:29 David Lethe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).