linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Russell Coker <russell@coker.com.au>
To: Martin <m_btrfs@ml1.co.uk>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: ditto blocks on ZFS
Date: Mon, 19 May 2014 02:09:34 +1000	[thread overview]
Message-ID: <10946613.XrCytCZfuu@xev> (raw)
In-Reply-To: <ll7lvd$ulk$1@ger.gmane.org>

On Sat, 17 May 2014 13:50:52 Martin wrote:
> On 16/05/14 04:07, Russell Coker wrote:
> > https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape
> > 
> > Probably most of you already know about this, but for those of you who
> > haven't the above describes ZFS "ditto blocks" which is a good feature we
> > need on BTRFS.  The briefest summary is that on top of the RAID
> > redundancy there...
> [... are additional copies of metadata ...]
> 
> 
> Is that idea not already implemented in effect in btrfs with the way
> that the superblocks are replicated multiple times, ever more times, for
> ever more huge storage devices?

No.  If the metadata for the root directory is corrupted then everything is 
lost even if the superblock is OK.  At every level in the directory tree a 
corruption will lose all levels below that, a corruption for /home would be 
very significant as would a corruption of /home/importantuser/major-project.

> The one exception is for SSDs whereby there is the excuse that you
> cannot know whether your data is usefully replicated across different
> erase blocks on a single device, and SSDs are not 'that big' anyhow.

I am not convinced by that argument.  While you can't know that it's usefully 
replicated you also can't say for sure that replication will never save you.  
There will surely be some random factors involved.  If dup on ssd will save 
you from 50% of corruption problems is it worth doing?  What if it's 80% or 
20%?

I have BTRFS running as the root filesystem on Intel SSDs on four machines 
(one of which is a file server with a pair of large disks in a BTRFS RAID-1).  
On all of those systems I have dup for metadata, it doesn't take up any amount 
of space I need for something else and it might save me.

> So... Your idea of replicating metadata multiple times in proportion to
> assumed 'importance' or 'extent of impact if lost' is an interesting
> approach. However, is that appropriate and useful considering the real
> world failure mechanisms that are to be guarded against?

Firstly it's not my idea, it's the idea of the ZFS developers.  Secondly I 
started reading about this after doing some experiments with a failing SATA 
disk.  In spite of having ~14,000 read errors (which sounds like a lot but is 
a small fraction of a 2TB disk) the vast majority of the data was readable, 
largely due to ~2000 errors corrected by dup metadata.

> Do you see or measure any real advantage?

Imagine that you have a RAID-1 array where both disks get ~14,000 read errors.  
This could happen due to a design defect common to drives of a particular 
model or some shared environmental problem.  Most errors would be corrected by 
RAID-1 but there would be a risk of some data being lost due to both copies 
being corrupt.  Another possibility is that one disk could entirely die 
(although total disk death seems rare nowadays) and the other could have 
corruption.  If metadata was duplicated in addition to being on both disks 
then the probability of data loss would be reduced.

Another issue is the case where all drive slots are filled with active drives 
(a very common configuration).  To replace a disk you have to physically 
remove the old disk before adding the new one.  If the array is a RAID-1 or 
RAID-5 then ANY error during reconstruction loses data.  Using dup for 
metadata on top of the RAID protections (IE the ZFS ditto idea) means that 
case doesn't lose you data.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


  parent reply	other threads:[~2014-05-18 16:09 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-16  3:07 ditto blocks on ZFS Russell Coker
2014-05-17 12:50 ` Martin
2014-05-17 14:24   ` Hugo Mills
2014-05-18 16:09   ` Russell Coker [this message]
2014-05-19 20:36     ` Martin
2014-05-19 21:47       ` Brendan Hide
2014-05-20  2:07         ` Russell Coker
2014-05-20 14:07           ` Austin S Hemmelgarn
2014-05-20 20:11             ` Brendan Hide
2014-05-20 14:56           ` ashford
2014-05-21  2:51             ` Russell Coker
2014-05-21 23:05               ` Martin
2014-05-22 11:10                 ` Austin S Hemmelgarn
2014-05-22 22:09               ` ashford
2014-05-23  3:54                 ` Russell Coker
2014-05-23  8:03                   ` Duncan
2014-05-21 23:29           ` Konstantinos Skarlatos
  -- strict thread matches above, loose matches on Subject: below --
2014-05-22 15:28 Tomasz Chmielewski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=10946613.XrCytCZfuu@xev \
    --to=russell@coker.com.au \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=m_btrfs@ml1.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).