Question about raid robustness when disk fails

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Question about raid robustness when disk fails
@ 2010-01-08 17:39 Tim Bock
  2010-01-22 16:32 ` Goswin von Brederlow
  0 siblings, 1 reply; 15+ messages in thread
From: Tim Bock @ 2010-01-08 17:39 UTC (permalink / raw)
  To: linux-raid

Hello,

	I built a raid-1 + lvm setup on a Dell 2950 in December 2008.  The OS
disk (ubuntu server 8.04) is not part of the raid.  Raid is 4 disks + 1
hot spare (all raid disks are sata, 1TB Seagates).

	Worked like a charm for ten months, and then had some kind of disk
problem in October which drove the load average to 13.  Initially tried
a reboot, but system would not come all of the way back up.  Had to boot
single-user and comment out the RAID entry.  System came up, I manually
failed/removed the offending disk, added the RAID entry back to fstab,
rebooted, and things proceeded as I would expect.  Replaced offending
drive.

	In early December, had a hiccup on a drive in a different slot.  Load
average again near 13.  Issued reboot, which proceeded normally until
the "unmounting local filesystems" stage, and then just seemed to hang.
Eventually just pushed power button.  The subsequent boot took about
twenty minutes (journal recovery and fsck), but seemed to come up ok.

From the log:
Dec 9 02:06:10 fs1 kernel: [6185521.188847] mptbase: ioc0:
LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands
After Error}, SubCode(0x0000) 
Dec 9 02:06:10 fs1 kernel: [6185521.189287] sd 2:0:1:0: [sdb] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK 
Dec 9 02:06:10 fs1 kernel: [6185521.189294] sd 2:0:1:0: [sdb] Sense
Key : Medium Error [current] 
Dec 9 02:06:10 fs1 kernel: [6185521.189299] Info fld=0x2e78894 
Dec 9 02:06:10 fs1 kernel: [6185521.189302] sd 2:0:1:0: [sdb] Add.
Sense: Unrecovered read error 
Dec 9 02:06:10 fs1 kernel: [6185521.189309] end_request: I/O error, dev
sdb, sector 48728212

Ok, so looks like the drive is having some problems, maybe failing.
Noted, but I have a hot spare which should take over in the event of a
failure, yes?

Things moved along fine until Dec 23.  Same drive and symptoms as
earlier that month, but this time it did not come up on its own when
rebooted.  Had to comment out the RAID while in single-user mode,
reboot, manually fail/remove drive, and then it finally started syncing
with the spare as expected.  From smartctl, the last command before the
error was READ FPDMA QUEUED (this was the same for all five of the most
recent errors reported by SMART, and all essentially at the same time).

So it appears I have another bad disk, though smartctl reports that the
drive passes the extended self-test.  My question (at long last) is
this:  In all three cases, why didn't the raid fail the drive and start
using the spare (without my help)?  I guess I'm not clear on what kind
of failures the raid will detect/survive (beyond the obvious, like
failure of a disk and its mirror or bus failure).  Is there some
configuration piece I have missed?

Thanks for any enlightenment...

Tim

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about raid robustness when disk fails
  2010-01-08 17:39 Question about raid robustness when disk fails Tim Bock
@ 2010-01-22 16:32 ` Goswin von Brederlow
  2010-01-25 16:22   ` Tim Bock
  2010-01-27  0:19   ` Ryan Wagoner
  0 siblings, 2 replies; 15+ messages in thread
From: Goswin von Brederlow @ 2010-01-22 16:32 UTC (permalink / raw)
  To: Tim Bock; +Cc: linux-raid

Tim Bock <jtbock@daylight.com> writes:

> Hello,
>
> 	I built a raid-1 + lvm setup on a Dell 2950 in December 2008.  The OS
> disk (ubuntu server 8.04) is not part of the raid.  Raid is 4 disks + 1
> hot spare (all raid disks are sata, 1TB Seagates).
>
> 	Worked like a charm for ten months, and then had some kind of disk
> problem in October which drove the load average to 13.  Initially tried
> a reboot, but system would not come all of the way back up.  Had to boot
> single-user and comment out the RAID entry.  System came up, I manually
> failed/removed the offending disk, added the RAID entry back to fstab,
> rebooted, and things proceeded as I would expect.  Replaced offending
> drive.

If a drive goes crazy without actualy dying then linux can spend a
long time trying to get something from the drive. The driver chip can
go crazy or the driver itself can have a bug and lockup. All those
things are below the raid level and if they halt your system then raid
can not do anything about it.

Only when a drive goes bad and the lower layers report an error to the
raid level can raid cope with the situation, remove the drive and keep
running. Unfortunately there seems to be a loose correlation between
cost of the controler (chip) and the likelyhood of a failing disk
locking up the system. I.e. the cheap onboard SATA chips on desktop
systems do that more often than expensive server controler. But that
is just a loose relationship.

MfG
        Goswin

PS: I've seen hardware raid boxes lock up too so this isn't a drawback
of software raid.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about raid robustness when disk fails
  2010-01-22 16:32 ` Goswin von Brederlow
@ 2010-01-25 16:22   ` Tim Bock
  2010-01-25 17:51     ` Goswin von Brederlow
  2010-01-27  0:19   ` Ryan Wagoner
  1 sibling, 1 reply; 15+ messages in thread
From: Tim Bock @ 2010-01-25 16:22 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: linux-raid

Thank you for the response.  Through the smartctl tests, I noticed that
the "seek error rate" value for the misbehaving disk was at 42, with the
threshold at 30.  For other disks in the same array, the "seek error
rate" values were up around 75 (same threshold of 30).  As it seems the
values decrement to the threshold, I took that as a further sign that
the disk was in trouble and replaced it.  Any likely correlation between
the described problem and the "seek error rate" value?

Is there a way to post-mortem the drive/logs/other traces to gain
insight into what the lower layer problem was?  I would like to be able
to definitively pinpoint (or at least have a reasonable level of
confidence about) the cause of the problem.  The ultimate goal, of
course, is to try and prevent any recurrence.

Thanks again,
Tim

On Fri, 2010-01-22 at 17:32 +0100, Goswin von Brederlow wrote:
> Tim Bock <jtbock@daylight.com> writes:
> 
> > Hello,
> >
> > 	I built a raid-1 + lvm setup on a Dell 2950 in December 2008.  The OS
> > disk (ubuntu server 8.04) is not part of the raid.  Raid is 4 disks + 1
> > hot spare (all raid disks are sata, 1TB Seagates).
> >
> > 	Worked like a charm for ten months, and then had some kind of disk
> > problem in October which drove the load average to 13.  Initially tried
> > a reboot, but system would not come all of the way back up.  Had to boot
> > single-user and comment out the RAID entry.  System came up, I manually
> > failed/removed the offending disk, added the RAID entry back to fstab,
> > rebooted, and things proceeded as I would expect.  Replaced offending
> > drive.
> 
> If a drive goes crazy without actualy dying then linux can spend a
> long time trying to get something from the drive. The driver chip can
> go crazy or the driver itself can have a bug and lockup. All those
> things are below the raid level and if they halt your system then raid
> can not do anything about it.
> 
> Only when a drive goes bad and the lower layers report an error to the
> raid level can raid cope with the situation, remove the drive and keep
> running. Unfortunately there seems to be a loose correlation between
> cost of the controler (chip) and the likelyhood of a failing disk
> locking up the system. I.e. the cheap onboard SATA chips on desktop
> systems do that more often than expensive server controler. But that
> is just a loose relationship.
> 
> MfG
>         Goswin
> 
> PS: I've seen hardware raid boxes lock up too so this isn't a drawback
> of software raid.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about raid robustness when disk fails
  2010-01-25 16:22   ` Tim Bock
@ 2010-01-25 17:51     ` Goswin von Brederlow
  2010-01-25 18:12       ` Michał Sawicz
  0 siblings, 1 reply; 15+ messages in thread
From: Goswin von Brederlow @ 2010-01-25 17:51 UTC (permalink / raw)
  To: Tim Bock; +Cc: Goswin von Brederlow, linux-raid

Tim Bock <jtbock@daylight.com> writes:

> Thank you for the response.  Through the smartctl tests, I noticed that
> the "seek error rate" value for the misbehaving disk was at 42, with the
> threshold at 30.  For other disks in the same array, the "seek error
> rate" values were up around 75 (same threshold of 30).  As it seems the
> values decrement to the threshold, I took that as a further sign that
> the disk was in trouble and replaced it.  Any likely correlation between
> the described problem and the "seek error rate" value?

Always keep in mind that smart values are often random, fictional or
garbage. I have disks that have an airflow temperature (outside) of 80+
and temperature (inside) of 50+. Both going down as the disk heats up
from use.

The only values I would keep a close eye on is remapped sectors and
pending sectors. Anything else gives nice graphs but I always feel is
totaly useless. And even the pending sectors are != 0 on one drive while
badblocks reports no errors on repeated passes. The drive just doesn't
seem to reduce the count when it successfully remaps a sector.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about raid robustness when disk fails
  2010-01-25 17:51     ` Goswin von Brederlow
@ 2010-01-25 18:12       ` Michał Sawicz
  2010-01-26  7:29         ` Goswin von Brederlow
  0 siblings, 1 reply; 15+ messages in thread
From: Michał Sawicz @ 2010-01-25 18:12 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: Tim Bock, linux-raid

[-- Attachment #1: Type: text/plain, Size: 551 bytes --]

Dnia 2010-01-25, pon o godzinie 18:51 +0100, Goswin von Brederlow pisze:
> The only values I would keep a close eye on is remapped sectors and
> pending sectors. Anything else gives nice graphs but I always feel is
> totaly useless. And even the pending sectors are != 0 on one drive
> while
> badblocks reports no errors on repeated passes. The drive just doesn't
> seem to reduce the count when it successfully remaps a sector. 

I read today that Samsungs show that behavior. Maybe this is the case?

-- 
Cheers
Michał (Saviq) Sawicz

[-- Attachment #2: To jest część wiadomości podpisana cyfrowo --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about raid robustness when disk fails
  2010-01-25 18:12       ` Michał Sawicz
@ 2010-01-26  7:29         ` Goswin von Brederlow
  0 siblings, 0 replies; 15+ messages in thread
From: Goswin von Brederlow @ 2010-01-26  7:29 UTC (permalink / raw)
  To: Micha Sawicz; +Cc: Goswin von Brederlow, Tim Bock, linux-raid

MichaÅ‚ Sawicz <michal@sawicz.net> writes:

> Dnia 2010-01-25, pon o godzinie 18:51 +0100, Goswin von Brederlow pisze:
>> The only values I would keep a close eye on is remapped sectors and
>> pending sectors. Anything else gives nice graphs but I always feel is
>> totaly useless. And even the pending sectors are != 0 on one drive
>> while
>> badblocks reports no errors on repeated passes. The drive just doesn't
>> seem to reduce the count when it successfully remaps a sector. 
>
> I read today that Samsungs show that behavior. Maybe this is the case?

I usualy buy one or at most 2 new disks at a time so my drives are all
different makes and models. Might verry well be a samsung drive that has
that behaviour.

MfG
        Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about raid robustness when disk fails
  2010-01-22 16:32 ` Goswin von Brederlow
  2010-01-25 16:22   ` Tim Bock
@ 2010-01-27  0:19   ` Ryan Wagoner
  2010-01-27  4:22     ` Michael Evans
  2010-01-27 15:15     ` Tim Bock
  1 sibling, 2 replies; 15+ messages in thread
From: Ryan Wagoner @ 2010-01-27  0:19 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: Tim Bock, linux-raid

On Fri, Jan 22, 2010 at 11:32 AM, Goswin von Brederlow
<goswin-v-b@web.de> wrote:
> Tim Bock <jtbock@daylight.com> writes:
>
>> Hello,
>>
>>       I built a raid-1 + lvm setup on a Dell 2950 in December 2008.  The OS
>> disk (ubuntu server 8.04) is not part of the raid.  Raid is 4 disks + 1
>> hot spare (all raid disks are sata, 1TB Seagates).
>>
>>       Worked like a charm for ten months, and then had some kind of disk
>> problem in October which drove the load average to 13.  Initially tried
>> a reboot, but system would not come all of the way back up.  Had to boot
>> single-user and comment out the RAID entry.  System came up, I manually
>> failed/removed the offending disk, added the RAID entry back to fstab,
>> rebooted, and things proceeded as I would expect.  Replaced offending
>> drive.
>
> If a drive goes crazy without actualy dying then linux can spend a
> long time trying to get something from the drive. The driver chip can
> go crazy or the driver itself can have a bug and lockup. All those
> things are below the raid level and if they halt your system then raid
> can not do anything about it.
>
> Only when a drive goes bad and the lower layers report an error to the
> raid level can raid cope with the situation, remove the drive and keep
> running. Unfortunately there seems to be a loose correlation between
> cost of the controler (chip) and the likelyhood of a failing disk
> locking up the system. I.e. the cheap onboard SATA chips on desktop
> systems do that more often than expensive server controler. But that
> is just a loose relationship.
>
> MfG
>        Goswin
>
> PS: I've seen hardware raid boxes lock up too so this isn't a drawback
> of software raid.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

You need to be using drives designed for RAID use with TLER (time
limited error recovery). When the drive encounters an error instead of
attempting to read the data for an extended period of time it just
gives up so the RAID can take care of it.

For example I had a SAS drive start to fail on a hardware RAID server.
Every time it hit a bad spot on the drive you could tell the system
would pause for a brief second as only that drive light was on. The
drive gave up and the RAID determined the correct data. It ran fine
like this until I was able to replace the drive the next day.

Ryan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about raid robustness when disk fails
  2010-01-27  0:19   ` Ryan Wagoner
@ 2010-01-27  4:22     ` Michael Evans
  2010-01-27  9:04       ` Goswin von Brederlow
  2010-01-27 15:15     ` Tim Bock
  1 sibling, 1 reply; 15+ messages in thread
From: Michael Evans @ 2010-01-27  4:22 UTC (permalink / raw)
  To: Ryan Wagoner; +Cc: Goswin von Brederlow, Tim Bock, linux-raid

On Tue, Jan 26, 2010 at 4:19 PM, Ryan Wagoner <rswagoner@gmail.com> wrote:
> On Fri, Jan 22, 2010 at 11:32 AM, Goswin von Brederlow
> <goswin-v-b@web.de> wrote:
>> Tim Bock <jtbock@daylight.com> writes:
>>
>>> Hello,
>>>
>>>       I built a raid-1 + lvm setup on a Dell 2950 in December 2008.  The OS
>>> disk (ubuntu server 8.04) is not part of the raid.  Raid is 4 disks + 1
>>> hot spare (all raid disks are sata, 1TB Seagates).
>>>
>>>       Worked like a charm for ten months, and then had some kind of disk
>>> problem in October which drove the load average to 13.  Initially tried
>>> a reboot, but system would not come all of the way back up.  Had to boot
>>> single-user and comment out the RAID entry.  System came up, I manually
>>> failed/removed the offending disk, added the RAID entry back to fstab,
>>> rebooted, and things proceeded as I would expect.  Replaced offending
>>> drive.
>>
>> If a drive goes crazy without actualy dying then linux can spend a
>> long time trying to get something from the drive. The driver chip can
>> go crazy or the driver itself can have a bug and lockup. All those
>> things are below the raid level and if they halt your system then raid
>> can not do anything about it.
>>
>> Only when a drive goes bad and the lower layers report an error to the
>> raid level can raid cope with the situation, remove the drive and keep
>> running. Unfortunately there seems to be a loose correlation between
>> cost of the controler (chip) and the likelyhood of a failing disk
>> locking up the system. I.e. the cheap onboard SATA chips on desktop
>> systems do that more often than expensive server controler. But that
>> is just a loose relationship.
>>
>> MfG
>>        Goswin
>>
>> PS: I've seen hardware raid boxes lock up too so this isn't a drawback
>> of software raid.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> You need to be using drives designed for RAID use with TLER (time
> limited error recovery). When the drive encounters an error instead of
> attempting to read the data for an extended period of time it just
> gives up so the RAID can take care of it.
>
> For example I had a SAS drive start to fail on a hardware RAID server.
> Every time it hit a bad spot on the drive you could tell the system
> would pause for a brief second as only that drive light was on. The
> drive gave up and the RAID determined the correct data. It ran fine
> like this until I was able to replace the drive the next day.
>
> Ryan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Why doesn't the kernel issue a pessimistic alternate 'read' path (on
the other drives needed to obtain the data) if the ideal method is
late.  It would be more useful for time-sensitive/worst case buffering
to be able to customize when to 'give up' dynamically.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about raid robustness when disk fails
  2010-01-27  4:22     ` Michael Evans
@ 2010-01-27  9:04       ` Goswin von Brederlow
  2010-01-27  9:22         ` Asdo
  0 siblings, 1 reply; 15+ messages in thread
From: Goswin von Brederlow @ 2010-01-27  9:04 UTC (permalink / raw)
  To: Michael Evans; +Cc: Ryan Wagoner, Goswin von Brederlow, Tim Bock, linux-raid

Michael Evans <mjevans1983@gmail.com> writes:

> Why doesn't the kernel issue a pessimistic alternate 'read' path (on
> the other drives needed to obtain the data) if the ideal method is
> late.  It would be more useful for time-sensitive/worst case buffering
> to be able to customize when to 'give up' dynamically.

That is a verry good question. I look forward to seeing patches for this
from you. :) I think it isn't done because nobody has bothered to write
the code yet but maybe I'm wrong and it would make the code too
complicated.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about raid robustness when disk fails
  2010-01-27  9:04       ` Goswin von Brederlow
@ 2010-01-27  9:22         ` Asdo
  2010-01-27 10:25           ` Goswin von Brederlow
  0 siblings, 1 reply; 15+ messages in thread
From: Asdo @ 2010-01-27  9:22 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: Michael Evans, Ryan Wagoner, Tim Bock, linux-raid

Goswin von Brederlow wrote:
> Michael Evans <mjevans1983@gmail.com> writes:
>   
>> Why doesn't the kernel issue a pessimistic alternate 'read' path (on
>> the other drives needed to obtain the data) if the ideal method is
>> late.  It would be more useful for time-sensitive/worst case buffering
>> to be able to customize when to 'give up' dynamically.
>>     
>
> That is a verry good question. I look forward to seeing patches for this
> from you. :) I think it isn't done because nobody has bothered to write
> the code yet but maybe I'm wrong and it would make the code too
> complicated.
>   

This is probably more complicated than allowing a timeout to be set at 
the MD layer or block-device layer, isn't it?

Which would be just as good I think.

Is it possible to cancel a SATA/SCSI command that is being executed by 
the drive?
(it's probably feasible only with NCQ disabled anyway, but it's easy to 
disable NCQ)


It's a pity we have to rely on TLER, this narrows the choice of drives a 
lot...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about raid robustness when disk fails
  2010-01-27  9:22         ` Asdo
@ 2010-01-27 10:25           ` Goswin von Brederlow
  2010-01-27 10:43             ` Asdo
  0 siblings, 1 reply; 15+ messages in thread
From: Goswin von Brederlow @ 2010-01-27 10:25 UTC (permalink / raw)
  To: Asdo
  Cc: Goswin von Brederlow, Michael Evans, Ryan Wagoner, Tim Bock,
	linux-raid

Asdo <asdo@shiftmail.org> writes:

> Goswin von Brederlow wrote:
>> Michael Evans <mjevans1983@gmail.com> writes:
>>
>>> Why doesn't the kernel issue a pessimistic alternate 'read' path (on
>>> the other drives needed to obtain the data) if the ideal method is
>>> late.  It would be more useful for time-sensitive/worst case buffering
>>> to be able to customize when to 'give up' dynamically.
>>>
>>
>> That is a verry good question. I look forward to seeing patches for this
>> from you. :) I think it isn't done because nobody has bothered to write
>> the code yet but maybe I'm wrong and it would make the code too
>> complicated.
>>
>
> This is probably more complicated than allowing a timeout to be set at
> the MD layer or block-device layer, isn't it?

There is a timeout at various levels already but for example the scsi
specs alow for quite some time till you give up, as in a minute. You
would certainly want something much much smaller here.

So from the top of my head here is what I imagine you need: You would
need to set a timeout for reading a block. Then once the timeout is
reached you need to read the rest of the stripe if not available
already. Do you ready every block in a stripe or just enough to get the
data? You might not need all blocks, e.g. a 3 way raid1 or a raid6
doesn't need all blocks. But then you have another timeout situation
there.

So lets say we read all blocks for simplicity sake. Then you might have
scheduled more reads than you need and when enough reads were
successfull you should not wait for the rest but return the data
imediatly. Late arrivals from extra reads (or the original) you then
have to also handle. Or do you cancel them? Also the original read might
succeed before the extra reads return.

It might also be wise to notice when additional reads are slower than
the original and if that happens often then increase the initial timeout
slightly. But a warning for the admin would do to so he can adjust the
timeout himself.

I don't think setting the timeout for the initial read will be
complicated but handling the alternatives will be not trivial. If yo
implement it you probably find more problems along the way.

> Which would be just as good I think.
>
> Is it possible to cancel a SATA/SCSI command that is being executed by
> the drive?
> (it's probably feasible only with NCQ disabled anyway, but it's easy
> to disable NCQ)

Do you want to do that? I would rather have the drive keep trying and
return an error if it can't read so the raid layer rewrites the blocks
causing it to be remapped. I do not want to wait for that but I want it
to happen.

> It's a pity we have to rely on TLER, this narrows the choice of drives
> a lot...

I don't. I just acknowledge the limitation and accept the downtime to
find and remove a broken but not properly failed disk. I use raid so I
don't loose my data when a disk fails, not primarily for availability.
So far I had one case in 10 years where a failing disk took down my
system.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about raid robustness when disk fails
  2010-01-27 10:25           ` Goswin von Brederlow
@ 2010-01-27 10:43             ` Asdo
  2010-01-27 15:34               ` Goswin von Brederlow
  0 siblings, 1 reply; 15+ messages in thread
From: Asdo @ 2010-01-27 10:43 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: linux-raid

Goswin von Brederlow wrote:
>> Is it possible to cancel a SATA/SCSI command that is being executed by
>> the drive?
>> (it's probably feasible only with NCQ disabled anyway, but it's easy
>> to disable NCQ)
>>     
>
> Do you want to do that? I would rather have the drive keep trying and
> return an error if it can't read so the raid layer rewrites the blocks
> causing it to be remapped. I do not want to wait for that but I want it
> to happen.
>   
So you want that to happen in the background?
Not that much benefit for that to happen in the background, imho.
Why not just having an error returned after a timeout, and normal MD 
read-error-recovery procedure kicking in? (recomputation from parity and 
rewrite of the damaged block)

>> It's a pity we have to rely on TLER, this narrows the choice of drives
>> a lot...
>>     
>
> I don't. I just acknowledge the limitation and accept the downtime 
The time might be so long that MD or the controller can drop the entire 
drive.
It didn't happen to me but I think I read something like this on this ML...
> to
> find and remove a broken but not properly failed disk. I use raid so I
> don't loose my data when a disk fails, not primarily for availability.
> So far I had one case in 10 years where a failing disk took down my
> system.
>   


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about raid robustness when disk fails
  2010-01-27  0:19   ` Ryan Wagoner
  2010-01-27  4:22     ` Michael Evans
@ 2010-01-27 15:15     ` Tim Bock
  1 sibling, 0 replies; 15+ messages in thread
From: Tim Bock @ 2010-01-27 15:15 UTC (permalink / raw)
  To: Ryan Wagoner; +Cc: linux-raid

On Tue, 2010-01-26 at 19:19 -0500, Ryan Wagoner wrote:
> On Fri, Jan 22, 2010 at 11:32 AM, Goswin von Brederlow
> <goswin-v-b@web.de> wrote:
> > Tim Bock <jtbock@daylight.com> writes:
> >
> >> Hello,
> >>
> >>       I built a raid-1 + lvm setup on a Dell 2950 in December 2008.  The OS
> >> disk (ubuntu server 8.04) is not part of the raid.  Raid is 4 disks + 1
> >> hot spare (all raid disks are sata, 1TB Seagates).
> >>
> >>       Worked like a charm for ten months, and then had some kind of disk
> >> problem in October which drove the load average to 13.  Initially tried
> >> a reboot, but system would not come all of the way back up.  Had to boot
> >> single-user and comment out the RAID entry.  System came up, I manually
> >> failed/removed the offending disk, added the RAID entry back to fstab,
> >> rebooted, and things proceeded as I would expect.  Replaced offending
> >> drive.
> >
> > If a drive goes crazy without actualy dying then linux can spend a
> > long time trying to get something from the drive. The driver chip can
> > go crazy or the driver itself can have a bug and lockup. All those
> > things are below the raid level and if they halt your system then raid
> > can not do anything about it.
> >
> > Only when a drive goes bad and the lower layers report an error to the
> > raid level can raid cope with the situation, remove the drive and keep
> > running. Unfortunately there seems to be a loose correlation between
> > cost of the controler (chip) and the likelyhood of a failing disk
> > locking up the system. I.e. the cheap onboard SATA chips on desktop
> > systems do that more often than expensive server controler. But that
> > is just a loose relationship.
> >
> > MfG
> >        Goswin
> >
> > PS: I've seen hardware raid boxes lock up too so this isn't a drawback
> > of software raid.
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> You need to be using drives designed for RAID use with TLER (time
> limited error recovery). When the drive encounters an error instead of
> attempting to read the data for an extended period of time it just
> gives up so the RAID can take care of it.
> 
> For example I had a SAS drive start to fail on a hardware RAID server.
> Every time it hit a bad spot on the drive you could tell the system
> would pause for a brief second as only that drive light was on. The
> drive gave up and the RAID determined the correct data. It ran fine
> like this until I was able to replace the drive the next day.
> 

Interesting, but from my reading of the tler data sheet, I don't think
this would have helped me.  The failing drive tied up the system for
hours (happened at 2am during backups, and system still unresponsive
with a 12+ load avg when I arrived at 7am), and in fact did not recover
until I rebooted single user and commented out the RAID entry.  My
understanding is that tler prevents the raid from failing the drive too
soon (because of unresponsiveness).  But in my case, I would have been
*happy* if the drive had been failed automatically, rather than
effectively bringing down the system.  So my understanding is that tler
wouldn't have done anything for me, though it is certainly possible I've
misunderstood something.

Thanks,
Tim


> Ryan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about raid robustness when disk fails
  2010-01-27 10:43             ` Asdo
@ 2010-01-27 15:34               ` Goswin von Brederlow
  2010-01-28 11:52                 ` Michael Evans
  0 siblings, 1 reply; 15+ messages in thread
From: Goswin von Brederlow @ 2010-01-27 15:34 UTC (permalink / raw)
  To: Asdo; +Cc: Goswin von Brederlow, linux-raid

Asdo <asdo@shiftmail.org> writes:

> Goswin von Brederlow wrote:
>>> Is it possible to cancel a SATA/SCSI command that is being executed by
>>> the drive?
>>> (it's probably feasible only with NCQ disabled anyway, but it's easy
>>> to disable NCQ)
>>>
>>
>> Do you want to do that? I would rather have the drive keep trying and
>> return an error if it can't read so the raid layer rewrites the blocks
>> causing it to be remapped. I do not want to wait for that but I want it
>> to happen.
>>
> So you want that to happen in the background?
> Not that much benefit for that to happen in the background, imho.
> Why not just having an error returned after a timeout, and normal MD
> read-error-recovery procedure kicking in? (recomputation from parity
> and rewrite of the damaged block)

Because the drive might just had a seek error and needs to reposition
its head. It might have been accessed on another partition and have a
read error there taking time. Or just multiple reads on the
partition. The drive taking long doesn't mean THIS read is broken.

If you kick of a read-error-recovery and get another error on another
drive then your raid will be down as well. Better not risk that.

>>> It's a pity we have to rely on TLER, this narrows the choice of drives
>>> a lot...
>>>
>>
>> I don't. I just acknowledge the limitation and accept the downtime
> The time might be so long that MD or the controller can drop the
> entire drive.
> It didn't happen to me but I think I read something like this on this ML...

Downtime as in I had to shut down the system hard and remove a drive at
a time till it would boot again when I came home in the evening.

If it just hangs for 5 minutes till it kicks a drive but then continious
running I still call that a success.

>> to
>> find and remove a broken but not properly failed disk. I use raid so I
>> don't loose my data when a disk fails, not primarily for availability.
>> So far I had one case in 10 years where a failing disk took down my
>> system.
>>

MfG
        Goswin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Question about raid robustness when disk fails
  2010-01-27 15:34               ` Goswin von Brederlow
@ 2010-01-28 11:52                 ` Michael Evans
  0 siblings, 0 replies; 15+ messages in thread
From: Michael Evans @ 2010-01-28 11:52 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: Asdo, linux-raid

On Wed, Jan 27, 2010 at 7:34 AM, Goswin von Brederlow <goswin-v-b@web.de> wrote:
> Asdo <asdo@shiftmail.org> writes:
> If you kick of a read-error-recovery and get another error on another
> drive then your raid will be down as well. Better not risk that.
>

I mostly only disagree with this point; everything else is more a
choice of tuning.  Different applications having different desires.

If a read is late, it might be a good idea to force a full stripe
recheck and alert the administrator about the latency/failure.

Current raid levels do not have a way of validating blocks
individually or even as part of a larger data set other than the
stripe; raid 5 has only one set of redundant data, but no way of
determining for sure which data-unit is bad.  Likely the 'slow' drive
should be presumed bad, and the rest of the stripe recalculated.  If
the drive returned data it should be compared to the calculation's
result.  If the data matches then it managed a clean read, but a
re-write should be issued anyway to ensure that the data really is
over-written.  The problem is the case where the data doesn't match,
but we lack another parity chunk (to validate against the computed
recovery data) due to using a single recovery stripe and no
validation.

Suddenly we have two potentially valid solutions and no way of
determining which is correct.  The question is which source you
believe is less likely to have an error: a single potentially faulty
drive that has returned a maybe error-corrected read (which has
certain odds of being correct), or a set of other drives that happened
to return data a little more quickly, but which all also have an
inherent risk (though less /per drive/) of error.  My gut reaction is
that the more nominally timed and reacting drives /likely/ are the
correct source, but that there is still an unquantified risk of
failure.

Of course, this is also a slightly moot point; in the above model we'd
have received one or the other first and already passed it to the next
layer.  A simpler approach would be to presume a soft failure on
timeout and unconditionally follow the re-computation path; only
dropping it if one of those drives also had an error or timeout.

The highly risk adverse or truly paranoid might willingly sacrifice
just a little more storage to 'make extra sure' that the data is
correct; which would greatly simplify the risks outlined above.

Going with the default chunk size of 64k, that's 128 x 512bytes.  512
/ 4 = 128 chunks for 32bit checksums, 512/16 = 32 for 128 bit
checksums, or 16 for 256 bit ones.  A proposed reduction in capacity,
respectively (and truncated, not rounded), to 99.993%, 99.975%, and
99.951% of the otherwise usable storage space. (Using
(SectorsPerChunk*ChunksThatFit)/(SectorsPerChunk*ChunksThatFit+1) to
give the data that would fit in the same sized 'old space' but first
making the 'old space' one sector larger to simplify the calculation.
Obviously the offsets and packing change if you choose a larger sector
size but keep the chunk size the same.)

With multi-core systems I don't see a major downside on the extra
processing workload, but I do see the disk bottle-neck as a downside.
The most natural read pattern would be to slurp up the remaining
sectors to the CRC payload on each drive.  For large streamed files
the effect is likely good.  For databases and small-file tasks that
cache poorly this would probably be bad.

Just arbitrarily making the chunk one sector larger only drops the
storage ratio to 99.224% and eliminates that problem (plus provides
the remainder of those sectors to dedicate to additional payload).

The error detection/recovery payload could better be expressed as
taking the minimum (and most frequently occurring) form of device
block size; but tunable up from there just as chunk size is.  This way
it could work for media that doesn't suffer seek time, but does have
data-alignment issues.

From a logical view both the existing MD/RAID and these additional
validity/recovery (likely what the unused space in the per-chunk
checksum storage model could be filled with) methods should exist in
some kind of block device reliability layer.  The type of reliability
that MD currently provides would be a simple whole-partition/device
approach.  The kind of reliability that BTRFS and ZFS aim for are a
more granular approach.  Obviously a place where code and complexity
can be reduced by using a common interface and consolidating code.
Additional related improvements might also be exposed via that
interface.

Is that something others would value as a valid contribution?  I'm
actually thinking of looking in to this, but don't really want to
expend the effort if it's unlikely to be anything more than a local
patch trying to anchor me to the past (when I'm not at least being
paid to continually port it forward).

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-01-28 11:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-08 17:39 Question about raid robustness when disk fails Tim Bock
2010-01-22 16:32 ` Goswin von Brederlow
2010-01-25 16:22   ` Tim Bock
2010-01-25 17:51     ` Goswin von Brederlow
2010-01-25 18:12       ` Michał Sawicz
2010-01-26  7:29         ` Goswin von Brederlow
2010-01-27  0:19   ` Ryan Wagoner
2010-01-27  4:22     ` Michael Evans
2010-01-27  9:04       ` Goswin von Brederlow
2010-01-27  9:22         ` Asdo
2010-01-27 10:25           ` Goswin von Brederlow
2010-01-27 10:43             ` Asdo
2010-01-27 15:34               ` Goswin von Brederlow
2010-01-28 11:52                 ` Michael Evans
2010-01-27 15:15     ` Tim Bock

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).