RAID timeout parameter accessibility request

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* RAID timeout parameter accessibility request
@ 2007-12-30 22:42 Jose de la Mancha
  2007-12-30 23:22 ` Jan Engelhardt
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Jose de la Mancha @ 2007-12-30 22:42 UTC (permalink / raw)
  To: linux-kernel

Hi everyone. I'm sorry but I'm not currently subscribed to this list (I've
been sent here by the listmaster), so please CC me all your
answers/comments. Thanks in advance.

SHORT QUESTION :
In a Debian-controlled RAID array, is there a parameter that handles the
timeout before a non-responding drive is dropped from the array ? Can this
timeout become user-adjustable in a future build ?

EXPLANATIONS :
As you might know, if you install and use a "desktop edition" hard drive in
a RAID array, the drive may not work correctly. This is caused by the normal
error recovery procedure that a desktop edition hard drive uses : when an
error is found on a desktop edition hard drive, the drive will enter into a
deep recovery cycle to attempt to repair the error, recover the data from
the problematic area, and then reallocate a dedicated area to replace the
problematic area. This process can take up to 120 seconds depending on the
severity of the issue.

The problem is that most RAID controllers allow a very short amount of time
(7-15 seconds) for a hard drive to recover from an error. If a hard drive
takes too long to complete this process, the drive will be dropped from the
RAID array !

Of course there are "RAID edition" hard drives with a feature called TLER
(Time Limited Error Recovery) which stops the hard drive from entering into
a deep recovery cycle. The hard drive will only spend 7 seconds to attempt
to recover. This means that the hard drive will not be dropped from a RAID
array. But these "special" hard drives are way too expensive IMHO just for a
small firmware-based feature.

There would be an easy way to allow users to use "ordinary" hard drives in a
Debian software-controlled RAID array. So here's my request : I suppose
there is a parameter that handles the default timeout before a drive is
dropped from the RAID array. I don't know if this parameter is hardcoded,
but it would be nice if it was user-adjustable. This way, we could simply
set up this parameter to 120 seconds or more (instead of 7-15) and we
wouldn't have any more problems with using desktop "edition hard" drives in
a RAID array.

What do you think ? Can it be done in a future build ?

I really hope that you'll be able to help, because I guess a lot of people
can be concerned by this issue.

Many thanks in advance & Best regards.

Jose

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID timeout parameter accessibility request
       [not found] <fa.nf+P3+JC0dd6/v0bUA0T+jgXpts@ifi.uio.no>
@ 2007-12-30 23:10 ` Robert Hancock
  2007-12-31  9:54   ` Jose de la Mancha
  0 siblings, 1 reply; 9+ messages in thread
From: Robert Hancock @ 2007-12-30 23:10 UTC (permalink / raw)
  To: Jose de la Mancha; +Cc: linux-kernel

Jose de la Mancha wrote:
> Hi everyone. I'm sorry but I'm not currently subscribed to this list (I've
> been sent here by the listmaster), so please CC me all your
> answers/comments. Thanks in advance.
> 
> SHORT QUESTION :
> In a Debian-controlled RAID array, is there a parameter that handles the
> timeout before a non-responding drive is dropped from the array ? Can this
> timeout become user-adjustable in a future build ?
> 
> EXPLANATIONS :
> As you might know, if you install and use a "desktop edition" hard drive in
> a RAID array, the drive may not work correctly. This is caused by the normal
> error recovery procedure that a desktop edition hard drive uses : when an
> error is found on a desktop edition hard drive, the drive will enter into a
> deep recovery cycle to attempt to repair the error, recover the data from
> the problematic area, and then reallocate a dedicated area to replace the
> problematic area. This process can take up to 120 seconds depending on the
> severity of the issue.
> 
> The problem is that most RAID controllers allow a very short amount of time
> (7-15 seconds) for a hard drive to recover from an error. If a hard drive
> takes too long to complete this process, the drive will be dropped from the
> RAID array !

This always seemed a strange use case to me. If the drive is getting 
read errors, either it's dying and needs to be replaced, or it has a 
sporadic bad sector as a result of a power failure during write, etc. in 
which case the drive should be resynchronized. In either case the drive 
should be dropped from the array and require manual intervention. It 
doesn't seem logical to me to just read the data from another drive and 
carry on in our merry way without any warning.

> 
> Of course there are "RAID edition" hard drives with a feature called TLER
> (Time Limited Error Recovery) which stops the hard drive from entering into
> a deep recovery cycle. The hard drive will only spend 7 seconds to attempt
> to recover. This means that the hard drive will not be dropped from a RAID
> array. But these "special" hard drives are way too expensive IMHO just for a
> small firmware-based feature.
> 
> There would be an easy way to allow users to use "ordinary" hard drives in a
> Debian software-controlled RAID array. So here's my request : I suppose
> there is a parameter that handles the default timeout before a drive is
> dropped from the RAID array. I don't know if this parameter is hardcoded,
> but it would be nice if it was user-adjustable. This way, we could simply
> set up this parameter to 120 seconds or more (instead of 7-15) and we
> wouldn't have any more problems with using desktop "edition hard" drives in
> a RAID array.
> 
> What do you think ? Can it be done in a future build ?
> 
> I really hope that you'll be able to help, because I guess a lot of people
> can be concerned by this issue.
> 
> Many thanks in advance & Best regards.

I don't know the md internals very well, but I wouldn't imagine there's 
a timeout in its code, the timeout would be based on the block layer and 
driver timeouts for the consitituent devices. For libata disks, the 
timeout is normally 30 seconds. After that expires, the disk will get a 
soft or hard reset and the command is typically retried by the block 
layer. If all retries fail the upper layers will get a failure report, 
and I believe at that point the md layer decides to disable the device.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID timeout parameter accessibility request
  2007-12-30 22:42 Jose de la Mancha
@ 2007-12-30 23:22 ` Jan Engelhardt
  2007-12-31  7:19 ` Thanasis
  2008-01-02 18:17 ` Bill Davidsen
  2 siblings, 0 replies; 9+ messages in thread
From: Jan Engelhardt @ 2007-12-30 23:22 UTC (permalink / raw)
  To: Jose de la Mancha; +Cc: linux-kernel


On Dec 30 2007 23:42, Jose de la Mancha wrote:
>SHORT QUESTION :
>In a Debian-controlled RAID array, is there a parameter that handles the
>timeout before a non-responding drive is dropped from the array ? Can this
>timeout become user-adjustable in a future build ?

Not sure about Debian,

but perhaps /sys/block/md0/md/safe_mode_delay does something?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID timeout parameter accessibility request
  2007-12-30 22:42 Jose de la Mancha
  2007-12-30 23:22 ` Jan Engelhardt
@ 2007-12-31  7:19 ` Thanasis
  2008-01-02 18:17 ` Bill Davidsen
  2 siblings, 0 replies; 9+ messages in thread
From: Thanasis @ 2007-12-31  7:19 UTC (permalink / raw)
  To: Jose de la Mancha; +Cc: linux-kernel

on 12/31/2007 12:42 AM Jose de la Mancha wrote the following:
> 
> Of course there are "RAID edition" hard drives with a feature called TLER
> (Time Limited Error Recovery) which stops the hard drive from entering into
> a deep recovery cycle. The hard drive will only spend 7 seconds to attempt
> to recover. This means that the hard drive will not be dropped from a RAID
> array. But these "special" hard drives are way too expensive IMHO just for a
> small firmware-based feature.

WD 2500YS
price same as an IDE or SATA

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID timeout parameter accessibility request
  2007-12-30 23:10 ` RAID timeout parameter accessibility request Robert Hancock
@ 2007-12-31  9:54   ` Jose de la Mancha
  2007-12-31 10:45     ` Thanasis
  2007-12-31 12:11     ` Michael Tokarev
  0 siblings, 2 replies; 9+ messages in thread
From: Jose de la Mancha @ 2007-12-31  9:54 UTC (permalink / raw)
  To: linux-kernel

Thanks guys for your answers (please remember to keep CCing me).

Robert Hancock wrote:
> This always seemed a strange use case to me. If the drive is getting
> read errors, either it's dying and needs to be replaced, or it has a
> sporadic bad sector as a result of a power failure during write, etc. in
> which case the drive should be resynchronized. In either case the drive
> should be dropped from the array and require manual intervention. It
> doesn't seem logical to me to just read the data from another drive and
> carry on in our merry way without any warning.

--> A warning message is OK, but dropping the drive from the array is
excessive IMHO. And anyway, this should be user-configurable, so that it
becomes each user's responsibility to choose if the drive shall be dropped
or not. Currently we don't have any choice.

Jan Engelhardt wrote:
> Not sure about Debian, but perhaps /sys/block/md0/md/safe_mode_delay
> does something?

--> I'll check that out. Does someone know about how this "safe mode delay"
works ?

Thanasis wrote:
> WD 2500YS
> price same as an IDE or SATA

--> All RAID edition drives are more expensive that their equivalent
"desktop edition" drives (same model on "desktop edition"). Just take a look
at newegg for instance. Besides, trying to find an affordable "RAID edition"
model is not a solution to this technical timeout issue, just a workaraound
(a bad one IMHO). Thanks anyway.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID timeout parameter accessibility request
  2007-12-31  9:54   ` Jose de la Mancha
@ 2007-12-31 10:45     ` Thanasis
  2007-12-31 12:11     ` Michael Tokarev
  1 sibling, 0 replies; 9+ messages in thread
From: Thanasis @ 2007-12-31 10:45 UTC (permalink / raw)
  To: Jose de la Mancha; +Cc: linux-kernel

on 12/31/2007 11:54 AM Jose de la Mancha wrote the following:

> 
> --> All RAID edition drives are more expensive that their equivalent
> "desktop edition" drives (same model on "desktop edition"). Just take a look
> at newegg for instance. 
> 
http://www.newegg.com/Product/Product.aspx?Item=N82E16822136055&Tpk=WD%2b2500YS

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID timeout parameter accessibility request
  2007-12-31  9:54   ` Jose de la Mancha
  2007-12-31 10:45     ` Thanasis
@ 2007-12-31 12:11     ` Michael Tokarev
  1 sibling, 0 replies; 9+ messages in thread
From: Michael Tokarev @ 2007-12-31 12:11 UTC (permalink / raw)
  To: Jose de la Mancha; +Cc: linux-kernel

Jose de la Mancha wrote:
[]
> Jan Engelhardt wrote:
>> Not sure about Debian, but perhaps /sys/block/md0/md/safe_mode_delay
>> does something?
> 
> --> I'll check that out. Does someone know about how this "safe mode delay"
> works ?

It's about something entirely different.  This parameter tells md after
how much inactivity time to update the superblocks to indicate the array
is "clean" - so that in case of power loss w/o shutting down the array,
it will not require reconstruction.  It has nothing to do with timeouts.

By the way, linux raid is usually discussed at linux-raid@vger, not here.

/mjt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID timeout parameter accessibility request
       [not found] ` <9GqH0-pd-11@gated-at.bofh.it>
@ 2008-01-02 14:49   ` Bodo Eggert
  0 siblings, 0 replies; 9+ messages in thread
From: Bodo Eggert @ 2008-01-02 14:49 UTC (permalink / raw)
  To: Thanasis, Jose de la Mancha, linux-kernel

Thanasis <thanasis@asyr.hopto.org> wrote:
> on 12/31/2007 11:54 AM Jose de la Mancha wrote the following:

>> --> All RAID edition drives are more expensive that their equivalent
>> "desktop edition" drives (same model on "desktop edition"). Just take a look
>> at newegg for instance.
>> 
>http://www.newegg.com/Product/Product.aspx?Item=N82E16822136055&Tpk=WD%2b2500YS

Not available here, and my local store offers a 250 GB drive for 7 % less.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID timeout parameter accessibility request
  2007-12-30 22:42 Jose de la Mancha
  2007-12-30 23:22 ` Jan Engelhardt
  2007-12-31  7:19 ` Thanasis
@ 2008-01-02 18:17 ` Bill Davidsen
  2 siblings, 0 replies; 9+ messages in thread
From: Bill Davidsen @ 2008-01-02 18:17 UTC (permalink / raw)
  To: Jose de la Mancha; +Cc: linux-kernel

Jose de la Mancha wrote:
> Hi everyone. I'm sorry but I'm not currently subscribed to this list (I've
> been sent here by the listmaster), so please CC me all your
> answers/comments. Thanks in advance.
> 
> SHORT QUESTION :
> In a Debian-controlled RAID array, is there a parameter that handles the
> timeout before a non-responding drive is dropped from the array ? Can this
> timeout become user-adjustable in a future build ?
> 
> EXPLANATIONS :
> As you might know, if you install and use a "desktop edition" hard drive in
> a RAID array, the drive may not work correctly. This is caused by the normal
> error recovery procedure that a desktop edition hard drive uses : when an
> error is found on a desktop edition hard drive, the drive will enter into a
> deep recovery cycle to attempt to repair the error, recover the data from
> the problematic area, and then reallocate a dedicated area to replace the
> problematic area. This process can take up to 120 seconds depending on the
> severity of the issue.
> 
> The problem is that most RAID controllers allow a very short amount of time
> (7-15 seconds) for a hard drive to recover from an error. If a hard drive
> takes too long to complete this process, the drive will be dropped from the
> RAID array !
> 
> Of course there are "RAID edition" hard drives with a feature called TLER
> (Time Limited Error Recovery) which stops the hard drive from entering into
> a deep recovery cycle. The hard drive will only spend 7 seconds to attempt
> to recover. This means that the hard drive will not be dropped from a RAID
> array. But these "special" hard drives are way too expensive IMHO just for a
> small firmware-based feature.
> 
I'm not sure "way too expensive" is appropriate. I'm using Seagate 320GB 
"NS" drives instead of "AS" models, and at the time I bought them they 
were $100 vs. $95 and are now $95 vs. $85. Other sizes and vendors have 
similar price points. A 10-15% premium seems reasonable for the 
firmware, faster access, and a general assurance that the drive is 
intended for 7/24 use.

On a small home array the cost is minimal, and if you are running 
desktop drives in huge arrays 7/24 you probably are doing other cost 
cutting tradeoffs to reliability, it's a choice.

> There would be an easy way to allow users to use "ordinary" hard drives in a
> Debian software-controlled RAID array. So here's my request : I suppose
> there is a parameter that handles the default timeout before a drive is
> dropped from the RAID array. I don't know if this parameter is hardcoded,
> but it would be nice if it was user-adjustable. This way, we could simply
> set up this parameter to 120 seconds or more (instead of 7-15) and we
> wouldn't have any more problems with using desktop "edition hard" drives in
> a RAID array.
> 
> What do you think ? Can it be done in a future build ?
> 
Just in general I agree, I'd like to see the error reported back up and 
let the user make the decision. One of the benefits of Linux is that it 
lets you make your own decisions, even bad ones. Windows assumes it owns 
the computer, you're an idiot, and it will let you use your computer if 
you do it their way.

> I really hope that you'll be able to help, because I guess a lot of people
> can be concerned by this issue.
> 
Not so much, with rewrites of sectors where possible this problem is 
less common than it was. But I agree on the drive kicking option, more 
control is good. However, the timeout should be in the driver, not in 
the raid code, that's where it belongs. The kernel copes with errors 
better than having a drive go practice self-gratification for minutes at 
a time.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-01-02 17:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <fa.nf+P3+JC0dd6/v0bUA0T+jgXpts@ifi.uio.no>
2007-12-30 23:10 ` RAID timeout parameter accessibility request Robert Hancock
2007-12-31  9:54   ` Jose de la Mancha
2007-12-31 10:45     ` Thanasis
2007-12-31 12:11     ` Michael Tokarev
     [not found] <9GpBt-6P8-57@gated-at.bofh.it>
     [not found] ` <9GqH0-pd-11@gated-at.bofh.it>
2008-01-02 14:49   ` Bodo Eggert
2007-12-30 22:42 Jose de la Mancha
2007-12-30 23:22 ` Jan Engelhardt
2007-12-31  7:19 ` Thanasis
2008-01-02 18:17 ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox