Re: Question about raid robustness when disk fails

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Michael Evans <mjevans1983@gmail.com>
To: Ryan Wagoner <rswagoner@gmail.com>
Cc: Goswin von Brederlow <goswin-v-b@web.de>,
	Tim Bock <jtbock@daylight.com>,
	linux-raid@vger.kernel.org
Subject: Re: Question about raid robustness when disk fails
Date: Tue, 26 Jan 2010 20:22:56 -0800	[thread overview]
Message-ID: <4877c76c1001262022p369eac60s639d87cad743ff94@mail.gmail.com> (raw)
In-Reply-To: <7d86ddb91001261619kbb77697t2660e5b8cc44535d@mail.gmail.com>

On Tue, Jan 26, 2010 at 4:19 PM, Ryan Wagoner <rswagoner@gmail.com> wrote:
> On Fri, Jan 22, 2010 at 11:32 AM, Goswin von Brederlow
> <goswin-v-b@web.de> wrote:
>> Tim Bock <jtbock@daylight.com> writes:
>>
>>> Hello,
>>>
>>>       I built a raid-1 + lvm setup on a Dell 2950 in December 2008.  The OS
>>> disk (ubuntu server 8.04) is not part of the raid.  Raid is 4 disks + 1
>>> hot spare (all raid disks are sata, 1TB Seagates).
>>>
>>>       Worked like a charm for ten months, and then had some kind of disk
>>> problem in October which drove the load average to 13.  Initially tried
>>> a reboot, but system would not come all of the way back up.  Had to boot
>>> single-user and comment out the RAID entry.  System came up, I manually
>>> failed/removed the offending disk, added the RAID entry back to fstab,
>>> rebooted, and things proceeded as I would expect.  Replaced offending
>>> drive.
>>
>> If a drive goes crazy without actualy dying then linux can spend a
>> long time trying to get something from the drive. The driver chip can
>> go crazy or the driver itself can have a bug and lockup. All those
>> things are below the raid level and if they halt your system then raid
>> can not do anything about it.
>>
>> Only when a drive goes bad and the lower layers report an error to the
>> raid level can raid cope with the situation, remove the drive and keep
>> running. Unfortunately there seems to be a loose correlation between
>> cost of the controler (chip) and the likelyhood of a failing disk
>> locking up the system. I.e. the cheap onboard SATA chips on desktop
>> systems do that more often than expensive server controler. But that
>> is just a loose relationship.
>>
>> MfG
>>        Goswin
>>
>> PS: I've seen hardware raid boxes lock up too so this isn't a drawback
>> of software raid.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> You need to be using drives designed for RAID use with TLER (time
> limited error recovery). When the drive encounters an error instead of
> attempting to read the data for an extended period of time it just
> gives up so the RAID can take care of it.
>
> For example I had a SAS drive start to fail on a hardware RAID server.
> Every time it hit a bad spot on the drive you could tell the system
> would pause for a brief second as only that drive light was on. The
> drive gave up and the RAID determined the correct data. It ran fine
> like this until I was able to replace the drive the next day.
>
> Ryan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Why doesn't the kernel issue a pessimistic alternate 'read' path (on
the other drives needed to obtain the data) if the ideal method is
late.  It would be more useful for time-sensitive/worst case buffering
to be able to customize when to 'give up' dynamically.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2010-01-27  4:22 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-08 17:39 Question about raid robustness when disk fails Tim Bock
2010-01-22 16:32 ` Goswin von Brederlow
2010-01-25 16:22   ` Tim Bock
2010-01-25 17:51     ` Goswin von Brederlow
2010-01-25 18:12       ` Michał Sawicz
2010-01-26  7:29         ` Goswin von Brederlow
2010-01-27  0:19   ` Ryan Wagoner
2010-01-27  4:22     ` Michael Evans [this message]
2010-01-27  9:04       ` Goswin von Brederlow
2010-01-27  9:22         ` Asdo
2010-01-27 10:25           ` Goswin von Brederlow
2010-01-27 10:43             ` Asdo
2010-01-27 15:34               ` Goswin von Brederlow
2010-01-28 11:52                 ` Michael Evans
2010-01-27 15:15     ` Tim Bock

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4877c76c1001262022p369eac60s639d87cad743ff94@mail.gmail.com \
    --to=mjevans1983@gmail.com \
    --cc=goswin-v-b@web.de \
    --cc=jtbock@daylight.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=rswagoner@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).