From: David Brown <david@westcontrol.com>
To: linux-raid@vger.kernel.org
Subject: Re: high throughput storage server?
Date: Fri, 25 Feb 2011 09:20:41 +0100 [thread overview]
Message-ID: <ik7oq8$dao$1@dough.gmane.org> (raw)
In-Reply-To: <4D66EA1B.7010801@hardwarefreak.com>
On 25/02/2011 00:30, Stan Hoeppner wrote:
> David Brown put forth on 2/24/2011 5:24 AM:
>
>> My understanding of RAID controllers (software or hardware) is that they
>> consider a drive to be either "good" or "bad". So if you get an URE,
>> the controller considers the drive "bad" and ejects it from the array.
>> It doesn't matter if it is an URE or a total disk death.
>>
>> Maybe hardware RAID controllers do something else here - you know far
>> more about them than I do.
>
> Most HBA and SAN RAID firmware I've dealt with kicks drives offline
> pretty quickly at any sign of an unrecoverable error. I've also seen
> drives kicked simply because the RAID firmware didn't like the drive
> firmware. I have a fond (sarcasm) memory of DAC960s kicking ST118202
> 18GB Cheetahs offline left and right in the late 90s. The fact I still
> recall that Seagate drive# after 10+ years should be informative
> regarding the severity of that issue. :(
>
>> The idea of the md raid "bad block list" is that there is a medium
>> ground - you can have disks that are "mostly good".
>
> Everything I've read and seen in the last few years regarding hard disk
> technology says that platter manufacturing quality and tolerance are so
> high on modern drives that media defects are rarely, if ever, seen by
> the customer, as they're mapped out at the factory. The platters don't
> suffer wear effects, but the rest of the moving parts do. From what
> I've read/seen, "media" errors observed in the wild today are actually
> caused by mechanical failures due to physical wear on various moving
> parts: VC actuator pivot bearing/race, spindle bearings, etc.
> Mechanical failures tend to show mild "media errors" in the beginning
> and get worse with time as moving parts go further out of alignment.
> Thus, as I see it, any UREs on a modern drive represent a "Don't trust
> me--Replace me NOW" flag. I could be all wrong here, but this is what
> I've read, and seen in manufacturer videos from WD and Seagate.
>
That's very useful information to know - I don't go through nearly
enough disks myself to be able to judge these things (and while I read
lots of stuff on the web, I don't see /everything/ !). Thanks.
However, this still sounds to me like a drive with UREs is dying but not
dead yet. Assuming you are correct here (and I've no reason to doubt
that - unless someone else disagrees), it means that a disk with UREs
will be dying quickly rather than dying slowly. But if the non-URE data
on the disk can be used to make a rebuild faster and safer, then surely
that is worth doing?
It may be that when a disk has had an URE and therefore an entry in the
bad block list, then it should be marked read-only and only used for
data recovery and "hot replace" rebuilds. But until it completely
croaks, it is still better than no disk at all while the rebuild is in
progress.
>> Supposing you have a RAID6 array, and one disk has died completely. It
>> gets replaced by a hot spare, and rebuild begins. As the rebuild
>> progresses, disk 1 gets an URE. Traditional handling would mean disk 1
>> is ejected, and now you have a double-degraded RAID6 to rebuilt. When
>> you later get an URE on disk 2, you have lost data for that stripe - and
>> the whole raid is gone.
>>
>> But with bad block lists, the URE on disk 1 leads to a bad block entry
>> on disk 1, and the rebuild continues. When you later get an URE on disk
>> 2, it's no problem - you use data from disk 1 and the other disks. URE's
>> are no longer a killer unless your set has no redundancy.
>
> They're not a killer with RAID 6 anyway, are they?. You can be
> rebuilding one failed drive and suffer UREs left and right, as long as
> you don't get two of them on two drives simultaneously in the same
> stripe block read. I think that's right. Please correct me if not.
>
That's true as long as UREs do not cause that disk to be kicked out of
the array. With bad block support in md raid, a disk suffering an URE
will /not/ be kicked out. But my understanding (from what you wrote
above) was that with hardware raid controllers, an URE /would/ cause a
disk to be kicked out. Or am I mixing something up again?
>> URE's are also what I worry about with RAID1 (including RAID10)
>> rebuilds. If a disk has failed, you are right in saying that the
>> chances of the second disk in the pair failing completely are tiny. But
>> the chances of getting an URE on the second disk during the rebuild are
>> not negligible - they are small, but growing with each new jump in disk
>> size.
>
> I touched on this in my other reply, somewhat tongue-in-cheek mentioning
> 3 leg and 4 leg RAID10. At current capacities and URE ratings I'm not
> worried about it with mirror pairs. If URE ratings haven't increased
> substantially by the time our avg drive capacity hits 10GB I'll start to
> worry.
>
> Somewhat related to this, does any else here build their arrays from the
> smallest cap drives they can get away with, preferably single platter
> models when possible? I adopted this strategy quite some time ago,
> mostly to keep rebuild times to a minimum, keep rotational mass low to
> consume the least energy since using more drives, but also with the URE
> issue in the back of my mind. Anecdotal evidence tends to point to the
> trend of OPs going with fewer gargantuan drives instead of many smaller
> ones. Maybe that's just members of this list, whose criteria may be
> quite different from the typical enterprise data center.
>
>> With md raid's future bad block lists and hot replace features, then an
>> URE on the second disk during rebuilds is only a problem if the first
>> disk has died completely - if it only had a small problem, then the "hot
>> replace" rebuild will be able to use both disks to find the data.
>
> What happens when you have multiple drives at the same or similar bad
> block count?
>
You replace them all. Once a drive reaches a certain number of bad
blocks (and that threshold may be just 1, or it may be more), you should
replace it. There isn't any reason not to do hot replace builds on
multiple drives simultaneously, if you've got the drives and drive bays
on hand - apart from at the bad blocks, they replacement is just a
straight disk to disk copy.
>> I know you are more interested in hardware raid than software raid, but
>> I'm sure you'll find some interesting points in Neil's writings. If you
>> don't want to read through the thread, at least read his blog post.
>>
>> <http://neil.brown.name/blog/20110216044002>
>
> Will catch up. Thanks for the blog link.
>
next prev parent reply other threads:[~2011-02-25 8:20 UTC|newest]
Thread overview: 116+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-02-14 23:59 high throughput storage server? Matt Garman
2011-02-15 2:06 ` Doug Dumitru
2011-02-15 4:44 ` Matt Garman
2011-02-15 5:49 ` hansbkk
2011-02-15 9:43 ` David Brown
2011-02-24 20:28 ` Matt Garman
2011-02-24 20:43 ` David Brown
2011-02-15 15:16 ` Joe Landman
2011-02-15 20:37 ` NeilBrown
2011-02-15 20:47 ` Joe Landman
2011-02-15 21:41 ` NeilBrown
2011-02-24 20:58 ` Matt Garman
2011-02-24 21:20 ` Joe Landman
2011-02-26 23:54 ` high throughput storage server? GPFS w/ 10GB/s throughput to the rescue Stan Hoeppner
2011-02-27 0:56 ` Joe Landman
2011-02-27 14:55 ` Stan Hoeppner
2011-03-12 22:49 ` Matt Garman
2011-02-27 21:30 ` high throughput storage server? Ed W
2011-02-28 15:46 ` Joe Landman
2011-02-28 23:14 ` Stan Hoeppner
2011-02-28 22:22 ` Stan Hoeppner
2011-03-02 3:44 ` Matt Garman
2011-03-02 4:20 ` Joe Landman
2011-03-02 7:10 ` Roberto Spadim
2011-03-02 19:03 ` Drew
2011-03-02 19:20 ` Roberto Spadim
2011-03-13 20:10 ` Christoph Hellwig
2011-03-14 12:27 ` Stan Hoeppner
2011-03-14 12:47 ` Christoph Hellwig
2011-03-18 13:16 ` Stan Hoeppner
2011-03-18 14:05 ` Christoph Hellwig
2011-03-18 15:43 ` Stan Hoeppner
2011-03-18 16:21 ` Roberto Spadim
2011-03-18 22:01 ` NeilBrown
2011-03-18 22:23 ` Roberto Spadim
2011-03-20 1:34 ` Stan Hoeppner
2011-03-20 3:41 ` NeilBrown
2011-03-20 5:32 ` Roberto Spadim
2011-03-20 23:22 ` Stan Hoeppner
2011-03-21 0:52 ` Roberto Spadim
2011-03-21 2:44 ` Keld Jørn Simonsen
2011-03-21 3:13 ` Roberto Spadim
2011-03-21 3:14 ` Roberto Spadim
2011-03-21 17:07 ` Stan Hoeppner
2011-03-21 14:18 ` Stan Hoeppner
2011-03-21 17:08 ` Roberto Spadim
2011-03-21 22:13 ` Keld Jørn Simonsen
2011-03-22 9:46 ` Robin Hill
2011-03-22 10:14 ` Keld Jørn Simonsen
2011-03-23 8:53 ` Stan Hoeppner
2011-03-23 15:57 ` Roberto Spadim
2011-03-23 16:19 ` Joe Landman
2011-03-24 8:05 ` Stan Hoeppner
2011-03-24 13:12 ` Joe Landman
2011-03-25 7:06 ` Stan Hoeppner
2011-03-24 17:07 ` Christoph Hellwig
2011-03-24 5:52 ` Stan Hoeppner
2011-03-24 6:33 ` NeilBrown
2011-03-24 8:07 ` Roberto Spadim
2011-03-24 8:31 ` Stan Hoeppner
2011-03-22 10:00 ` Stan Hoeppner
2011-03-22 11:01 ` Keld Jørn Simonsen
2011-02-15 12:29 ` Stan Hoeppner
2011-02-15 12:45 ` Roberto Spadim
2011-02-15 13:03 ` Roberto Spadim
2011-02-24 20:43 ` Matt Garman
2011-02-24 20:53 ` Zdenek Kaspar
2011-02-24 21:07 ` Joe Landman
2011-02-15 13:39 ` David Brown
2011-02-16 23:32 ` Stan Hoeppner
2011-02-17 0:00 ` Keld Jørn Simonsen
2011-02-17 0:19 ` Stan Hoeppner
2011-02-17 2:23 ` Roberto Spadim
2011-02-17 3:05 ` Stan Hoeppner
2011-02-17 0:26 ` David Brown
2011-02-17 0:45 ` Stan Hoeppner
2011-02-17 10:39 ` David Brown
2011-02-24 20:49 ` Matt Garman
2011-02-15 13:48 ` Zdenek Kaspar
2011-02-15 14:29 ` Roberto Spadim
2011-02-15 14:51 ` A. Krijgsman
2011-02-15 16:44 ` Roberto Spadim
2011-02-15 14:56 ` Zdenek Kaspar
2011-02-24 20:36 ` Matt Garman
2011-02-17 11:07 ` John Robinson
2011-02-17 13:36 ` Roberto Spadim
2011-02-17 13:54 ` Roberto Spadim
2011-02-17 21:47 ` Stan Hoeppner
2011-02-17 22:13 ` Joe Landman
2011-02-17 23:49 ` Stan Hoeppner
2011-02-18 0:06 ` Joe Landman
2011-02-18 3:48 ` Stan Hoeppner
2011-02-18 13:49 ` Mattias Wadenstein
2011-02-18 23:16 ` Stan Hoeppner
2011-02-21 10:25 ` Mattias Wadenstein
2011-02-21 21:51 ` Stan Hoeppner
2011-02-22 8:57 ` David Brown
2011-02-22 9:30 ` Mattias Wadenstein
2011-02-22 9:49 ` David Brown
2011-02-22 13:38 ` Stan Hoeppner
2011-02-22 14:18 ` David Brown
2011-02-23 5:52 ` Stan Hoeppner
2011-02-23 13:56 ` David Brown
2011-02-23 14:25 ` John Robinson
2011-02-23 15:15 ` David Brown
2011-02-23 23:14 ` Stan Hoeppner
2011-02-24 10:19 ` David Brown
2011-02-23 21:59 ` Stan Hoeppner
2011-02-23 23:43 ` John Robinson
2011-02-24 15:53 ` Stan Hoeppner
2011-02-23 21:11 ` Stan Hoeppner
2011-02-24 11:24 ` David Brown
2011-02-24 23:30 ` Stan Hoeppner
2011-02-25 8:20 ` David Brown [this message]
2011-02-19 0:24 ` Joe Landman
2011-02-21 10:04 ` Mattias Wadenstein
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='ik7oq8$dao$1@dough.gmane.org' \
--to=david@westcontrol.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.