Re: Is it safe to use btrfs on top of different types of devices?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Adam Borowski <kilobyte@angband.pl>
Cc: Zoltan <zoltan1980@gmail.com>, linux-btrfs@vger.kernel.org
Subject: Re: Is it safe to use btrfs on top of different types of devices?
Date: Wed, 18 Oct 2017 07:30:55 -0400	[thread overview]
Message-ID: <213a404f-90e6-a3f8-4867-4e9fcf24426c@gmail.com> (raw)
In-Reply-To: <20171017202135.xdop4eko6utircmz@angband.pl>

On 2017-10-17 16:21, Adam Borowski wrote:
> On Tue, Oct 17, 2017 at 03:19:09PM -0400, Austin S. Hemmelgarn wrote:
>> On 2017-10-17 13:06, Adam Borowski wrote:
>>> The thing is, reliability guarantees required vary WILDLY depending on your
>>> particular use cases.  On one hand, there's "even an one-minute downtime
>>> would cost us mucho $$$s, can't have that!" -- on the other, "it died?
>>> Okay, we got backups, lemme restore it after the weekend".
>> Yes, but if you are in the second case, you arguably don't need replication,
>> and would be better served by improving the reliability of your underlying
>> storage stack than trying to work around it's problems. Even in that case,
>> your overall reliability is still constrained by the least reliable
>> component (in more idiomatic terms 'a chain is only as strong as it's
>> weakest link').
> 
> MD can handle this case well, there's no reason btrfs shouldn't do that too.
> A RAID is not akin to serially connected chain, it's a parallel connected
> chain: while pieces of the broken second chain hanging down from the first
> don't make it strictly more resilient than having just a single chain, in
> general case it _is_ more reliable even if the other chain is weaker.
My chain analogy is supposed to be relating to the storage stack as a 
whole, RAID is a single link in the chain, with whatever filesystem 
above it, and whatever storage drivers and hardware below.
> 
> Don't we have a patchset that deals with marking a device as failed at
> runtime floating on the mailing list?  I did not look at those patches yet,
> but they are a step in this direction.
There were some disagreements on whether the device should be released 
(that is, the node closed) immediately when we know it's failed, or 
should be held open until remount.
> 
>> Using replication with a reliable device and a questionable device is
>> essentially the same as trying to add redundancy to a machine by adding an
>> extra linkage that doesn't always work and can get in the way of the main
>> linkage it's supposed to be protecting from failure.  Yes, it will work most
>> of the time, but the system is going to be less reliable than it is without
>> the 'redundancy'.
> 
> That's the current state of btrfs, but the design is sound, and reaching
> more than parity with MD is a matter of implementation.
Indeed, however MD is still not perfectly reliable in this situation 
(though they are exponentially better than BTRFS at the moment).
> 
>>> Thus, I switched the machine to NBD (albeit it sucks on 100Mbit eth).  Alas,
>>> the network driver allocates memory with GFP_NOIO which causes NBD
>>> disconnects (somehow, this doesn't ever happen on swap where GFP_NOIO would
>>> be obvious but on regular filesystem where throwing out userspace memory is
>>> safe).  The disconnects happen around once per week.
>> Somewhat off-topic, but you might try looking at ATAoE as an alternative,
>> it's more reliable in my experience (if you've got a reliable network),
>> gives better performance (there's less protocol overhead than NBD, and it
>> runs on top of layer 2 instead of layer 4)
> 
> I've tested it -- not on the Odroid-U2 but on Pine64 (fully working GbE).
> NBD delivers 108MB/sec in a linear transfer, ATAoE is lucky to break
> 40MB/sec, same target (Qnap-253a, spinning rust), both in default
> configuration without further tuning.  NBD is over IPv6 for that extra 20
> bytes per packet overhead.
Interesting, I've seen the the exact opposite in terms of performance.
> 
> Also, NBD can be encrypted or arbitrarily routed.
Yes, though if you're on a local network, neither should matter :).
> 
>>> It's a single-device filesystem, thus disconnects are obviously fatal.  But,
>>> they never caused even a single bit of damage (as scrub goes), thus proving
>>> btrfs handles this kind of disconnects well.  Unlike times past, the kernel
>>> doesn't get confused thus no reboot is needed, merely an unmount, "service
>>> nbd-client restart", mount, restart the rebuild jobs.
>> That's expected behavior though.  _Single_ device BTRFS has nothing to get
>> out of sync most of the time, the only time there's any possibility of an
>> issue is when you die after writing the first copy of a block that's in a
>> dup profile chunk, but even that is not very likely to cause problems
>> (you'll just lose at most the last <commit-time> worth of data).
> 
> How come?  In a DUP profile, the writes are: chunk 1, chunk2, barrier,
> superblock.  The two prior writes may be arbitrarily reordered -- both
> between each other or even individual sectors inside the chunks, but unless
> the disk lies about barriers, there's no way to have any corruption, thus
> running scrub is not needed.
If the device dies after writing chunk 1 but before the barrier, you end 
up needing scrub.  How much of a failure window is present is largely a 
function of how fast the device is, but there is a failure window there.
> 
>> The moment you add another device though, that simplicity goes out the
>> window.
> 
> RAID1 doesn't seem less simple to me: if the new superblock has been
> successfully written on at least one disk, barriers imply that at least one
> copy is correct.  If the other disk was out to lunch before the final
> unmount, those blocks will be degraded, but that's no different from one
> pair of DUP blocks being corrupted.
It's not guaranteed to be just those blocks though, it could be anything 
up to and including the superblock.  It's the handling needed for that 
situation, as well as possibly being so out of sync that generations on 
the old device don't have any data at all any more on the new one, that 
makes replication complex in BTRFS.
> 
> With RAID5, there be dragons, but that's due to implementation deficiencies;
> if an upper layer says "hey you downstairs, the block you gave me has a
> wrong csum/generation, try to recover it", there's no reason it shouldn't
> be able to reliably recover it in all cases that don't involve a double
> (RAID5) or triple (RAID6) failure.

next prev parent reply	other threads:[~2017-10-18 11:30 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-14 19:00 Is it safe to use btrfs on top of different types of devices? Zoltán Ivánfi
2017-10-15  0:19 ` Peter Grandi
2017-10-15  3:42 ` Duncan
2017-10-15  8:30 ` Zoltán Ivánfi
2017-10-15 12:05   ` Duncan
2017-10-16 11:53   ` Austin S. Hemmelgarn
2017-10-16 16:57     ` Zoltan
2017-10-16 17:27       ` Austin S. Hemmelgarn
2017-10-17  1:14         ` Adam Borowski
2017-10-17 11:26           ` Austin S. Hemmelgarn
2017-10-17 11:42             ` Zoltan
2017-10-17 12:40               ` Austin S. Hemmelgarn
2017-10-17 17:06                 ` Adam Borowski
2017-10-17 19:19                   ` Austin S. Hemmelgarn
2017-10-17 20:21                     ` Adam Borowski
2017-10-17 21:56                       ` Zoltán Ivánfi
2017-10-18  4:44                         ` Duncan
2017-10-18 14:07                         ` Peter Grandi
2017-10-18 11:30                       ` Austin S. Hemmelgarn [this message]
2017-10-18 11:59                         ` Adam Borowski
2017-10-18 14:30                           ` Austin S. Hemmelgarn
2017-10-18  4:50                     ` Duncan
2017-10-18 13:53               ` Peter Grandi
2017-10-18 14:30                 ` Austin S. Hemmelgarn
2017-10-19 11:01                   ` Peter Grandi
2017-10-19 12:32                     ` Austin S. Hemmelgarn
2017-10-19 18:39                       ` Peter Grandi
2017-10-20 11:53                         ` Austin S. Hemmelgarn
2017-10-19 13:48                     ` Zoltan
2017-10-19 14:27                       ` Austin S. Hemmelgarn
2017-10-19 14:42                         ` Zoltan
2017-10-19 15:07                           ` Austin S. Hemmelgarn
2017-10-19 18:00                         ` Peter Grandi
2017-10-19 17:56                       ` Peter Grandi
2017-10-19 18:59                         ` Peter Grandi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=213a404f-90e6-a3f8-4867-4e9fcf24426c@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=kilobyte@angband.pl \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=zoltan1980@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).