From: Michael Evans <mjevans1983@gmail.com>
To: Ryan Wagoner <rswagoner@gmail.com>
Cc: Goswin von Brederlow <goswin-v-b@web.de>,
Tim Bock <jtbock@daylight.com>,
linux-raid@vger.kernel.org
Subject: Re: Question about raid robustness when disk fails
Date: Tue, 26 Jan 2010 20:22:56 -0800 [thread overview]
Message-ID: <4877c76c1001262022p369eac60s639d87cad743ff94@mail.gmail.com> (raw)
In-Reply-To: <7d86ddb91001261619kbb77697t2660e5b8cc44535d@mail.gmail.com>
On Tue, Jan 26, 2010 at 4:19 PM, Ryan Wagoner <rswagoner@gmail.com> wrote:
> On Fri, Jan 22, 2010 at 11:32 AM, Goswin von Brederlow
> <goswin-v-b@web.de> wrote:
>> Tim Bock <jtbock@daylight.com> writes:
>>
>>> Hello,
>>>
>>> I built a raid-1 + lvm setup on a Dell 2950 in December 2008. The OS
>>> disk (ubuntu server 8.04) is not part of the raid. Raid is 4 disks + 1
>>> hot spare (all raid disks are sata, 1TB Seagates).
>>>
>>> Worked like a charm for ten months, and then had some kind of disk
>>> problem in October which drove the load average to 13. Initially tried
>>> a reboot, but system would not come all of the way back up. Had to boot
>>> single-user and comment out the RAID entry. System came up, I manually
>>> failed/removed the offending disk, added the RAID entry back to fstab,
>>> rebooted, and things proceeded as I would expect. Replaced offending
>>> drive.
>>
>> If a drive goes crazy without actualy dying then linux can spend a
>> long time trying to get something from the drive. The driver chip can
>> go crazy or the driver itself can have a bug and lockup. All those
>> things are below the raid level and if they halt your system then raid
>> can not do anything about it.
>>
>> Only when a drive goes bad and the lower layers report an error to the
>> raid level can raid cope with the situation, remove the drive and keep
>> running. Unfortunately there seems to be a loose correlation between
>> cost of the controler (chip) and the likelyhood of a failing disk
>> locking up the system. I.e. the cheap onboard SATA chips on desktop
>> systems do that more often than expensive server controler. But that
>> is just a loose relationship.
>>
>> MfG
>> Goswin
>>
>> PS: I've seen hardware raid boxes lock up too so this isn't a drawback
>> of software raid.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> You need to be using drives designed for RAID use with TLER (time
> limited error recovery). When the drive encounters an error instead of
> attempting to read the data for an extended period of time it just
> gives up so the RAID can take care of it.
>
> For example I had a SAS drive start to fail on a hardware RAID server.
> Every time it hit a bad spot on the drive you could tell the system
> would pause for a brief second as only that drive light was on. The
> drive gave up and the RAID determined the correct data. It ran fine
> like this until I was able to replace the drive the next day.
>
> Ryan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Why doesn't the kernel issue a pessimistic alternate 'read' path (on
the other drives needed to obtain the data) if the ideal method is
late. It would be more useful for time-sensitive/worst case buffering
to be able to customize when to 'give up' dynamically.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2010-01-27 4:22 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-01-08 17:39 Question about raid robustness when disk fails Tim Bock
2010-01-22 16:32 ` Goswin von Brederlow
2010-01-25 16:22 ` Tim Bock
2010-01-25 17:51 ` Goswin von Brederlow
2010-01-25 18:12 ` Michał Sawicz
2010-01-26 7:29 ` Goswin von Brederlow
2010-01-27 0:19 ` Ryan Wagoner
2010-01-27 4:22 ` Michael Evans [this message]
2010-01-27 9:04 ` Goswin von Brederlow
2010-01-27 9:22 ` Asdo
2010-01-27 10:25 ` Goswin von Brederlow
2010-01-27 10:43 ` Asdo
2010-01-27 15:34 ` Goswin von Brederlow
2010-01-28 11:52 ` Michael Evans
2010-01-27 15:15 ` Tim Bock
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4877c76c1001262022p369eac60s639d87cad743ff94@mail.gmail.com \
--to=mjevans1983@gmail.com \
--cc=goswin-v-b@web.de \
--cc=jtbock@daylight.com \
--cc=linux-raid@vger.kernel.org \
--cc=rswagoner@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).