From: "George Spelvin" <linux@horizon.com>
To: adilger@sun.com, linux@horizon.com
Cc: david@lang.hm, linux-doc@vger.kernel.org,
linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org,
pavel@ucw.cz
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:
Date: 1 Sep 2009 21:10:20 -0400 [thread overview]
Message-ID: <20090902011020.32110.qmail@science.horizon.com> (raw)
In-Reply-To: <20090901161828.GN4197@webber.adilger.int>
>> - I seem to recall that ZFS does replicate metadata.
>
> ZFS definitely does replicate data. At the lowest level it has RAID-1,
> and RAID-Z/Z2, which are pretty close to RAID-5/6 respectively, but with
> the important difference that every write is a full-stripe-width write,
> so that it is not possible for RAID-Z/Z2 to cause corruption due to a
> partially-written RAID parity stripe.
>
> In addition, for internal metadata blocks there are 1 or 2 duplicate
> copies written to different devices, so that in case of a fatal device
> corruption (e.g. double failure of a RAID-Z device) the metadata tree
> is still intact.
Forgive me for implying by omission that ZFS did not replicate data.
What I was trying to point out is that it replicates metadata *more*,
and you can choose among the redundant backups.
> What else is interesting is that in the case of 1-4-bit errors the
> default checksum function can also be used as ECC to recover the correct
> data even if there is no replicated copy of the data.
Interesting. Do you actually see suhc low-bit-weight errors in
practice? I had assumed that modern disks were complicated enough
that errors would be high-bit-weight miscorrections.
>> One of ZFS's big performance problems is that currently it only checksums
>> the entire RAID stripe, so it always has to read every drive, and doesn't
>> get RAID's IOPS advantage.
>
> Or this is a drawback of the Linux software RAID because it doesn't detect
> the case when the parity is bad before there is a second drive failure and
> the bad parity is used to reconstruct the data block incorrectly (which
> will also go undetected because there is no checksum).
Well, all conventional RAID systems lack block checksums (or, more to
the point, rely on the drive's checksumming), and have this problem.
I was pointing out that ZFS currently doesn't support partial-stripe
*reads*, thus limiting IOPS in random-read applications. But that's
an "implementation detail", not a major architectural issue.
prev parent reply other threads:[~2009-09-02 1:10 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-08-31 0:54 raid is dangerous but that's secret (was Re: [patch] ext2/3: George Spelvin
2009-08-31 11:04 ` Pavel Machek
2009-08-31 15:45 ` david
2009-09-01 0:56 ` George Spelvin
2009-09-01 8:36 ` NeilBrown
2009-09-01 8:46 ` Pavel Machek
2009-09-01 11:18 ` George Spelvin
2009-09-01 12:35 ` NeilBrown
2009-09-01 15:25 ` david
2009-09-01 21:12 ` NeilBrown
2009-09-01 16:18 ` Andreas Dilger
2009-09-02 1:10 ` George Spelvin [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090902011020.32110.qmail@science.horizon.com \
--to=linux@horizon.com \
--cc=adilger@sun.com \
--cc=david@lang.hm \
--cc=linux-doc@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pavel@ucw.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).