RAID1C3 across 3 devices but with only 2 online simultaneously

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID1C3 across 3 devices but with only 2 online simultaneously
@ 2020-04-28 22:49 Timothée Jourde
  2020-04-29  0:25 ` Zygo Blaxell
  0 siblings, 1 reply; 4+ messages in thread
From: Timothée Jourde @ 2020-04-28 22:49 UTC (permalink / raw)
  To: linux-btrfs

Hi everyone! I would like to use Btrfs' RAID in a somewhat unorthodox 
way, but if it's possible then it would be perfect for my use case.

Suppose I have a RAID1C3 filesystem spanning 3 hard drives A, B and C, 
each of the same size. A and B are at different physical locations after 
the initial mkfs.

When I want to make a backup, I would plug in drives C and A, and send 
them a snapshot. Once in a while, I would go to drive B's location 
carrying drive C with me, then I would plug in drives C and B, waiting 
for Btrfs to replicate the missing data on drive B.

It would be easier than sending the snapshots across 3 different 
filesystems, and most importantly, I could run a scrub with auto-repair 
on C+A or C+B.

I did some quick tests with 3 USB keys, and it *seems* to work. But I 
don't know how to be notified when the replication is done (which seems 
to run automatically/silently in the background). I also don't know 
whether it's a reliable method or not.

Any thoughts about this?

Thanks!
Timothée

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RAID1C3 across 3 devices but with only 2 online simultaneously
  2020-04-28 22:49 RAID1C3 across 3 devices but with only 2 online simultaneously Timothée Jourde
@ 2020-04-29  0:25 ` Zygo Blaxell
  2020-04-29  1:11   ` Chris Murphy
  2020-05-03 17:41   ` Timothée Jourde
  0 siblings, 2 replies; 4+ messages in thread
From: Zygo Blaxell @ 2020-04-29  0:25 UTC (permalink / raw)
  To: Timothée Jourde; +Cc: linux-btrfs

On Wed, Apr 29, 2020 at 12:49:39AM +0200, Timothée Jourde wrote:
> Hi everyone! I would like to use Btrfs' RAID in a somewhat unorthodox way,
> but if it's possible then it would be perfect for my use case.
> 
> Suppose I have a RAID1C3 filesystem spanning 3 hard drives A, B and C, each
> of the same size. A and B are at different physical locations after the
> initial mkfs.
> 
> When I want to make a backup, I would plug in drives C and A, and send them
> a snapshot. Once in a while, I would go to drive B's location carrying drive
> C with me, then I would plug in drives C and B, waiting for Btrfs to
> replicate the missing data on drive B.

If the host is attached to drive C (e.g. host C is a laptop), and it
connects to both drive A and B (e.g. A and B are USB external drives),
then this is not a backup system because it does not isolate the
backup and original disks from failures in host C (e.g. RAM failure).
This configuration has all the (lack of) fault isolation of degraded
raid1, you'll need a separate host for some drives to protect against
host failures.

If drives A and B have their own hosts, and you are detaching drive C
and carrying only the drive C between them, then read on...

> It would be easier than sending the snapshots across 3 different
> filesystems, and most importantly, I could run a scrub with auto-repair on
> C+A or C+B.
> 
> I did some quick tests with 3 USB keys, and it *seems* to work. But I don't
> know how to be notified when the replication is done (which seems to run
> automatically/silently in the background). 

The replication is done by scrub, so you can do a scrub status, or start
the scrub with -B so that it doesn't exit until it's done.

> I also don't know whether it's a
> reliable method or not.

Doing this with scrub is not reliable with the crc32c csum method--every
~16TB of updates, you'll get a crc32c collision, so you'll have the
wrong data on disk and no csum failure to detect it with.  Any of the
other csum options will solve this.  Use SHA256 if (and only if) you
are worried about crypto collision attacks in your data; otherwise,
xxhash64 is fine.  Any nodatasum files will most likely be trashed,
as there are no csum failures to detect that an update is required,
so scrub won't update them (or it will update them, but it will keep
the old copy and overwrite the new one).

Doing this with btrfs replace is more reliable (in the sense that
everything will be updated properly), but you need to be careful to avoid
having btrfs become aware of the C drive before the replace is started
(i.e. wipefs the target drive B or C, before it becomes visible to btrfs).
You don't need raid1c3 for this, there will never be more than two drives
connected at a time.  Replace will also not spam your kernel logs with
millions of errors, so if anything does go wrong, you'll be able to spot
it in the kernel log.

If something goes wrong during the resync between two drives, it is
likely that both drives will be unrecoverable (if the source drive is
damaged, it may be unrecoverable, and the target drive contents will
be destroyed if either drive has a problem).

If the host A RAM is bad during the copy from A to C, and this is not
noticed until after the copy from C to B starts, then all 3 filesystem
copies may be damaged or destroyed.  If the host B RAM goes bad, it
only bothers host B, assuming that drive C is overwritten by drive
A when C goes back to A.

Depending on how you connect A and C, there's a risk that host A will
choose C over A, wiping out updates that occurred on A during the
time that C was disconnected (e.g. if C has a higher transid than A).

There will be no metadata redundancy on A while drive B is away, so a
single UNC sector or bit flip error (both very common occurrences!) may
destroy the filesystem on drive A.  Drive B and C can be 'btrfs replace'd,
but if you want to keep the data on A then this is a problem.

> Any thoughts about this?

I'd try harder to make snapshot-based updates work on 3 separate
filesystems (either btrfs send, or just make a snapshot on A and rsync
it to a snapshot on C if that's easier to manage).  This removes the
requirement to keep umounting the filesystem at host B whenever C arrives.

With 3 separate filesystems you have much better fault isolation, and
the whole point of backups is to have redundant copies _in isolated
failure domains_.  It's also much harder to accidentally destroy the
whole thing with a btrfs uuid collision gotcha.

With separate filesystems, when host RAM goes bad on host A, it can
destroy A and C at the same time, but host B (whose host RAM hopefully
has not gone bad at the same time) will not be destroyed by the update
from C to B.  Host B can detect the error and abort the update instead,
leaving whatever data was previously on B intact.

Also with 3 filesystems you can have dup metadata on each disk, so you
can repair single bit errors and UNC sectors instead of going all the
way back to mkfs.

Another alternative is to swap B and C instead of overwriting one with
the other, so you always have 2 complete copies.

> Thanks!
> Timothée
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RAID1C3 across 3 devices but with only 2 online simultaneously
  2020-04-29  0:25 ` Zygo Blaxell
@ 2020-04-29  1:11   ` Chris Murphy
  2020-05-03 17:41   ` Timothée Jourde
  1 sibling, 0 replies; 4+ messages in thread
From: Chris Murphy @ 2020-04-29  1:11 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Timothée Jourde, Btrfs BTRFS

On Tue, Apr 28, 2020 at 6:25 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:

> Doing this with scrub is not reliable with the crc32c csum method--every
> ~16TB of updates, you'll get a crc32c collision, so you'll have the
> wrong data on disk and no csum failure to detect it with.  Any of the
> other csum options will solve this.  Use SHA256 if (and only if) you
> are worried about crypto collision attacks in your data; otherwise,
> xxhash64 is fine.

What about blake2b? Hash benchmark on x86_64 shows it's quite a lot
faster than SHA256 (and yet still way slower than xxhash64 or crc32c).
But I have no idea if this actually affects overall file system
read/write performance when under load.

I've started to migrate to xxhash64. It'd be nice to have a convert
option. Rewriting 100% of the metadata is still a fraction of having
to rewrite out TBs of data. But this is not a complaint. Btrfs is
still badass.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RAID1C3 across 3 devices but with only 2 online simultaneously
  2020-04-29  0:25 ` Zygo Blaxell
  2020-04-29  1:11   ` Chris Murphy
@ 2020-05-03 17:41   ` Timothée Jourde
  1 sibling, 0 replies; 4+ messages in thread
From: Timothée Jourde @ 2020-05-03 17:41 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

Thank you for your very detailed answer, it was very informative about 
some of the internals of Btrfs.

Maybe a dedicated backup system is more suited to my use case.

I know it's easy to suggest a feature, but it would be awesome if Btrfs 
could scrub a filesystem without de-duplication and still be able to 
auto-repair from another Btrfs filesystem (which was the main reason I 
wanted to use RAID).

Thanks again!
Timothée


Zygo Blaxell:
> On Wed, Apr 29, 2020 at 12:49:39AM +0200, Timothée Jourde wrote:
>> Hi everyone! I would like to use Btrfs' RAID in a somewhat unorthodox way,
>> but if it's possible then it would be perfect for my use case.
>>
>> Suppose I have a RAID1C3 filesystem spanning 3 hard drives A, B and C, each
>> of the same size. A and B are at different physical locations after the
>> initial mkfs.
>>
>> When I want to make a backup, I would plug in drives C and A, and send them
>> a snapshot. Once in a while, I would go to drive B's location carrying drive
>> C with me, then I would plug in drives C and B, waiting for Btrfs to
>> replicate the missing data on drive B.
> 
> If the host is attached to drive C (e.g. host C is a laptop), and it
> connects to both drive A and B (e.g. A and B are USB external drives),
> then this is not a backup system because it does not isolate the
> backup and original disks from failures in host C (e.g. RAM failure).
> This configuration has all the (lack of) fault isolation of degraded
> raid1, you'll need a separate host for some drives to protect against
> host failures.
> 
> If drives A and B have their own hosts, and you are detaching drive C
> and carrying only the drive C between them, then read on...
> 
>> It would be easier than sending the snapshots across 3 different
>> filesystems, and most importantly, I could run a scrub with auto-repair on
>> C+A or C+B.
>>
>> I did some quick tests with 3 USB keys, and it *seems* to work. But I don't
>> know how to be notified when the replication is done (which seems to run
>> automatically/silently in the background).
> 
> The replication is done by scrub, so you can do a scrub status, or start
> the scrub with -B so that it doesn't exit until it's done.
> 
>> I also don't know whether it's a
>> reliable method or not.
> 
> Doing this with scrub is not reliable with the crc32c csum method--every
> ~16TB of updates, you'll get a crc32c collision, so you'll have the
> wrong data on disk and no csum failure to detect it with.  Any of the
> other csum options will solve this.  Use SHA256 if (and only if) you
> are worried about crypto collision attacks in your data; otherwise,
> xxhash64 is fine.  Any nodatasum files will most likely be trashed,
> as there are no csum failures to detect that an update is required,
> so scrub won't update them (or it will update them, but it will keep
> the old copy and overwrite the new one).
> 
> Doing this with btrfs replace is more reliable (in the sense that
> everything will be updated properly), but you need to be careful to avoid
> having btrfs become aware of the C drive before the replace is started
> (i.e. wipefs the target drive B or C, before it becomes visible to btrfs).
> You don't need raid1c3 for this, there will never be more than two drives
> connected at a time.  Replace will also not spam your kernel logs with
> millions of errors, so if anything does go wrong, you'll be able to spot
> it in the kernel log.
> 
> If something goes wrong during the resync between two drives, it is
> likely that both drives will be unrecoverable (if the source drive is
> damaged, it may be unrecoverable, and the target drive contents will
> be destroyed if either drive has a problem).
> 
> If the host A RAM is bad during the copy from A to C, and this is not
> noticed until after the copy from C to B starts, then all 3 filesystem
> copies may be damaged or destroyed.  If the host B RAM goes bad, it
> only bothers host B, assuming that drive C is overwritten by drive
> A when C goes back to A.
> 
> Depending on how you connect A and C, there's a risk that host A will
> choose C over A, wiping out updates that occurred on A during the
> time that C was disconnected (e.g. if C has a higher transid than A).
> 
> There will be no metadata redundancy on A while drive B is away, so a
> single UNC sector or bit flip error (both very common occurrences!) may
> destroy the filesystem on drive A.  Drive B and C can be 'btrfs replace'd,
> but if you want to keep the data on A then this is a problem.
> 
>> Any thoughts about this?
> 
> I'd try harder to make snapshot-based updates work on 3 separate
> filesystems (either btrfs send, or just make a snapshot on A and rsync
> it to a snapshot on C if that's easier to manage).  This removes the
> requirement to keep umounting the filesystem at host B whenever C arrives.
> 
> With 3 separate filesystems you have much better fault isolation, and
> the whole point of backups is to have redundant copies _in isolated
> failure domains_.  It's also much harder to accidentally destroy the
> whole thing with a btrfs uuid collision gotcha.
> 
> With separate filesystems, when host RAM goes bad on host A, it can
> destroy A and C at the same time, but host B (whose host RAM hopefully
> has not gone bad at the same time) will not be destroyed by the update
> from C to B.  Host B can detect the error and abort the update instead,
> leaving whatever data was previously on B intact.
> 
> Also with 3 filesystems you can have dup metadata on each disk, so you
> can repair single bit errors and UNC sectors instead of going all the
> way back to mkfs.
> 
> Another alternative is to swap B and C instead of overwriting one with
> the other, so you always have 2 complete copies.
> 
>> Thanks!
>> Timothée
>>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-05-03 17:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-28 22:49 RAID1C3 across 3 devices but with only 2 online simultaneously Timothée Jourde
2020-04-29  0:25 ` Zygo Blaxell
2020-04-29  1:11   ` Chris Murphy
2020-05-03 17:41   ` Timothée Jourde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).