From: Russell Coker <russell@coker.com.au>
To: Martin <m_btrfs@ml1.co.uk>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: ditto blocks on ZFS
Date: Mon, 19 May 2014 02:09:34 +1000 [thread overview]
Message-ID: <10946613.XrCytCZfuu@xev> (raw)
In-Reply-To: <ll7lvd$ulk$1@ger.gmane.org>
On Sat, 17 May 2014 13:50:52 Martin wrote:
> On 16/05/14 04:07, Russell Coker wrote:
> > https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape
> >
> > Probably most of you already know about this, but for those of you who
> > haven't the above describes ZFS "ditto blocks" which is a good feature we
> > need on BTRFS. The briefest summary is that on top of the RAID
> > redundancy there...
> [... are additional copies of metadata ...]
>
>
> Is that idea not already implemented in effect in btrfs with the way
> that the superblocks are replicated multiple times, ever more times, for
> ever more huge storage devices?
No. If the metadata for the root directory is corrupted then everything is
lost even if the superblock is OK. At every level in the directory tree a
corruption will lose all levels below that, a corruption for /home would be
very significant as would a corruption of /home/importantuser/major-project.
> The one exception is for SSDs whereby there is the excuse that you
> cannot know whether your data is usefully replicated across different
> erase blocks on a single device, and SSDs are not 'that big' anyhow.
I am not convinced by that argument. While you can't know that it's usefully
replicated you also can't say for sure that replication will never save you.
There will surely be some random factors involved. If dup on ssd will save
you from 50% of corruption problems is it worth doing? What if it's 80% or
20%?
I have BTRFS running as the root filesystem on Intel SSDs on four machines
(one of which is a file server with a pair of large disks in a BTRFS RAID-1).
On all of those systems I have dup for metadata, it doesn't take up any amount
of space I need for something else and it might save me.
> So... Your idea of replicating metadata multiple times in proportion to
> assumed 'importance' or 'extent of impact if lost' is an interesting
> approach. However, is that appropriate and useful considering the real
> world failure mechanisms that are to be guarded against?
Firstly it's not my idea, it's the idea of the ZFS developers. Secondly I
started reading about this after doing some experiments with a failing SATA
disk. In spite of having ~14,000 read errors (which sounds like a lot but is
a small fraction of a 2TB disk) the vast majority of the data was readable,
largely due to ~2000 errors corrected by dup metadata.
> Do you see or measure any real advantage?
Imagine that you have a RAID-1 array where both disks get ~14,000 read errors.
This could happen due to a design defect common to drives of a particular
model or some shared environmental problem. Most errors would be corrected by
RAID-1 but there would be a risk of some data being lost due to both copies
being corrupt. Another possibility is that one disk could entirely die
(although total disk death seems rare nowadays) and the other could have
corruption. If metadata was duplicated in addition to being on both disks
then the probability of data loss would be reduced.
Another issue is the case where all drive slots are filled with active drives
(a very common configuration). To replace a disk you have to physically
remove the old disk before adding the new one. If the array is a RAID-1 or
RAID-5 then ANY error during reconstruction loses data. Using dup for
metadata on top of the RAID protections (IE the ZFS ditto idea) means that
case doesn't lose you data.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
next prev parent reply other threads:[~2014-05-18 16:09 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-16 3:07 ditto blocks on ZFS Russell Coker
2014-05-17 12:50 ` Martin
2014-05-17 14:24 ` Hugo Mills
2014-05-18 16:09 ` Russell Coker [this message]
2014-05-19 20:36 ` Martin
2014-05-19 21:47 ` Brendan Hide
2014-05-20 2:07 ` Russell Coker
2014-05-20 14:07 ` Austin S Hemmelgarn
2014-05-20 20:11 ` Brendan Hide
2014-05-20 14:56 ` ashford
2014-05-21 2:51 ` Russell Coker
2014-05-21 23:05 ` Martin
2014-05-22 11:10 ` Austin S Hemmelgarn
2014-05-22 22:09 ` ashford
2014-05-23 3:54 ` Russell Coker
2014-05-23 8:03 ` Duncan
2014-05-21 23:29 ` Konstantinos Skarlatos
-- strict thread matches above, loose matches on Subject: below --
2014-05-22 15:28 Tomasz Chmielewski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=10946613.XrCytCZfuu@xev \
--to=russell@coker.com.au \
--cc=linux-btrfs@vger.kernel.org \
--cc=m_btrfs@ml1.co.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).