Re: safe/necessary to balance system chunks?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: safe/necessary to balance system chunks?
Date: Sat, 26 Apr 2014 04:01:32 +0000 (UTC)	[thread overview]
Message-ID: <pan$dd651$bbf57e6c$5d375e9c$5e894ab7@cox.net> (raw)
In-Reply-To: 535AB27C.6070205@gmail.com

Austin S Hemmelgarn posted on Fri, 25 Apr 2014 15:07:40 -0400 as
excerpted:

> I actually have a similar situation with how I have my desktop system
> set up, when I go about recreating the filesystem (which I do every
> time I upgrade either the tools or the kernel),

Wow.  Given that I run a git kernel and btrfs-tools, I'd be spending a 
*LOT* of time on redoing my filesystems if I did that!  Tho see my just-
previous reply for what I do (a fresh mkfs.btrfs every few kernel cycles, 
to take advantage of new on-device-format feature options and to clean 
out any possibly remaining cruft from bugs now fixed, given that btrfs 
isn't fully stable yet).

Anyway, why I'm replying here:

[in the context of btrfs raid1 mode]

> I use the following approach:
> 
> 1. Delete one of the devices from the filesystem
> 2. Create a new btrfs file system on the device just removed from the
> filesystem
> 3. Copy the data from the old filesystem to the new one
> 4. one at a time, delete the remaining devices from the old filesystem
> and add them to the new one, re-balancing the new filesystem after
> adding each device.
> 
> This seems to work relatively well for me, and prevents the possibility
> that there is ever just one copy of the data.  It does, however, require
> that the amount of data that you are storing on the filesystem is less
> than the size of one of the devices (although you can kind of work
> around this limitation by setting compress-force=zlib on the new file
> system when you mount it, then using defrag to decompress everything
> after the conversion is done), and that you have to drop to single user
> mode for the conversion (unless it's something that isn't needed all the
> time, like the home directories or /usr/src, in which case you just log
> everyone out and log in as root on the console to do it).

I believe you're laboring under an unfortunate but understandable 
misconception of the nature of btrfs raid1.  Since in the event of device-
loss it's a critical misconception, I decided to deal with it in a reply 
separate from the other one (which I then made as a sibling post to yours 
in reply to the same parent, instead of as a reply to you).

Unlike for instance mdraid raid1 mode, which is N mirror-copies of the 
data across N devices (so 3 devices = 3 copies, 5 devices = 5 copies, 
etc)...

**BTRFS RAID1 MODE IS CURRENTLY PAIR-MIRROR ONLY!**

No matter the number of devices in the btrfs so-called "raid1", btrfs 
only pair-mirrors each chunk, so it's only two copies of the data per 
filesystem.  To have more than two-copy redundancy, you must use multiple 
filesystems and make one a copy of the other using either conventional 
backup methods or the btrfs-specific send/receive.

This is actually my biggest annoyance/feature-request with current btrfs, 
as my own sweet-spot ideal is triplet-mirroring, and N-way-mirroring is 
indeed on the roadmap and has been for years, but the devs plan to use 
some of the code from btrfs raid5/6 to implement it, and of course while 
incomplete raid5/6 mode was introduced in 3.9, as of 3.14 at least, 
that's exactly what raid5/6 mode is, incomplete, and while I saw patches 
to properly support raid5/6 scrub recently, I believe it's still 
incomplete in 3.15 as well.  And of course N-way-mirroring remains 
roadmapped for after that... So not being a dev, I continue to wait, as 
patiently as I can manage since I'd rather a good implementation later 
than a buggy one now, for that still coming N-way-mirroring.  Tho at this 
point I admit to having some sympathy for the donkey forever following 
that apple held on the end of a stick just out of reach... even if I 
/would/ rather wait another five years for it and have it done /right/, 
than be dealing with a bad implementation available right now.

Anyway, given that we /are/ dealing with pair-mirror-only raid1 mode 
currently... as well as your pre-condition that for your method to work, 
the data to store on the filesystem must fit on a single device...

If you have a 3-device-plus btrfs raid1 and you're using btrfs device 
delete to remove the device you're going to create the new filesystem on, 
you do still have two-way-redundancy at all times, since the the btrfs 
device delete will ensure the two copies are on the remaining devices, 
but that's unnecessary work compared to simply leaving it a device down 
in the first place, and starting with the last device of the previous 
(grandparent generation) filesystem as the first of a new (child 
generation) filesystem, leaving it unused between.

If OTOH you're hard-removing a device from the raid1, without a btrfs 
device delete first, then at the moment you do so, you only have a single 
copy of any chunk where one of the pair was on that device, and it 
remains that way until you do the mkfs and finish populating the new 
filesystem with the contents of the old one.

So you're either doing extra work (if you're using btrfs device delete), 
or leaving yourself with a single copy of anything on the removed device, 
until it is back up and running as the new filesystem! =:^(

I'd suggest not bothering with more than two (or possibly three) devices 
per filesystem, since by btrfs raid1, you only get pair-mirroring, so 
more devices is a waste for that, and by your own pre-condition, you 
limit the amount of data to the capacity of one device, so you can't take 
advantage of the extra storage capacity of more devices with >2 devices 
on a two-way-mirroring-limited raid1 either, making it a waste for that 
as well.  Save the extra devices for when you do the transfer.

If you have only three devices, setup the btrfs raid1 with two, and leave 
the third as a spare.  Then for the transfer, create and populate the new 
filesystem on the third, remove a device from the btrfs raid1 pair, add 
it to the new btrfs and convert to raid1.  At that point you can drop the 
old filesystem and leave its remaining device as your first device when 
you repeat the process later, making the last device of the grandparent 
into the first device of the child.

This way you'll have two copies of the data at all times and/or will save 
the work of the third device add and rebalance, and later the device 
delete, bringing it to two devices again.

And as a bonus, except for the time you're actually doing the mkfs and 
repopulating the new filesystem, you'll have a third copy, albeit a bit 
outdated, as a backup, that being the spare that you're not including in 
the current filesystem, since it still has a complete copy of the old 
filesystem from before it was removed, and that old copy can still be 
mounted using the degraded option (since it's the single device remaining 
of what was previously a multi-device raid1).

Alternatively, do the three-device raid1 thing and btrfs device delete 
when you're taking a device out and btrfs balance after adding the third 
device.  This will be more hassle, but dropping a device from a two-
device raid1 forces it read-only as writes can no longer be made in raid1 
mode, while a three-device raid1 doesn't give you more redundancy since 
btrfs raid1 remains pair-mirror-only, but DOES give you the ability to 
continue writing in raid1 mode with a missing device, since you still 
have two devices and can do raid1 pair-mirror writing.

So in view of the pair-mirror restriction, three devices won't give you 
additional redundancy, but it WILL give you a continued writable raid1 if 
a device drops out.  Whether that's worth the hassle of the additional 
steps needed to btrfs device delete to create the new filesystem and 
btrfs balance on adding the third device, is up to you, but it does give 
you that choice. =:^)

Similarly if you have four devices, only in that case you can actually do 
two independent two-device btrfs raid1 filesystems, one working and one 
backup, taking the backup down to recreate as the new primary/working 
filesystem when necessary, thus avoiding the whole device-add and 
rebalance thing entirely.  And your backup is then a full pair-redundant 
backup as well, tho of course you lose the backup for the period you're 
doing the mkfs and repopulating the new version.

This is actually pretty much what I'm doing here, except that my physical 
devices are more than twice the size of my data and I only have two 
physical devices.  But I use partitioning and create the dual-device 
btrfs raid1 pair-mirror across two partitions, one on each physical 
device, with the backup set being two different partitions, one each on 
the same pair of physical devices.

If you have five devices, I'd recommend doing about the same thing, only 
with the fifth device as a normally physically disconnected (and possibly 
stored separately, perhaps even off-site) backup of the two separate 
btrfs pair-mirror raid1s.  Actually, you can remove a device from one of 
the raid1s (presumably the backup/secondary) to create the new btrfs 
raid1, still leaving the one (presumably the working/primary) as a 
complete two-device raid1 pair, leaving the other device as a backup that 
can still be mounted using degraded, should that be necessary.

Or simply use the fifth device for something else. =:^)

With six devices you have a multi-way choice:

1) Btrfs raid1 pairs as with four devices but with two levels of backup.

This would be the same as the 5-device scenario, but completing the pair 
for the secondary backup.

2) Btrfs raid1 pairs with an addition device in primary and backup.

2a) This gives you a bit more flexibility in terms of size, since you now 
get 1.5 times the capacity of a single device, for both primary/working 
and secondary/backup.

2b) You also get the device-dropped write-flexibility described under the 
three-device case, but now for both primary and backup. =:^)

3) Six-device raid10.  In "simple" configuration, this would give you 3-
way-striping and 3X capacity of a single device, still pair mirroring, 
but you'd lose the independent backups.  However, if you used 
partitioning to split each physical device in half and made each set of 
six partitions an independent btrfs raid10, you'd still have half the 3X 
capacity, so 1.5X the capacity of a single device, still have the three-
way-striping and 2-way-mirroring for 3X the speed with pair-mirroring 
redundancy, *AND* have independent primary and backup sets, each its own 
6-way set of partitions across the 6 devices, giving you simple tear-down 
and recreate of the backup raid10 as the new working raid10.

That would be a very nice setup; something I'd like for myself. =:^)

Actually, once N-way-mirroring hits I'm going to want to setup pretty 
close to just this, except using triplet mirroring and two-way-striping 
instead of the reverse.  Keeping the two-way-partitioning as well, that'd 
give me 2X speed and 3X redundancy, at 1X capacity, with a primary and 
backup raid10 on different 6-way partition sets of the same six physical 
devices.

Ideally, the selectable-way mirroring/striping code will be flexible 
enough by that time to let me temporarily reduce striping (and speed/
capacity) to 1-way while keeping 3-way-mirroring, should I lose a device 
or two, thus avoiding the force-to-read-only that dropping below two-
devices in a raid1 or four devices in a raid10 currently does.  Upon 
replacing the bad devices, I could rebalance the 1-way-striped bits and 
get full 2-way-striping once again, while the triplet mirroring would 
have never been compromised.

That's my ideal. =:^)

But to do that I still need triplet-mirroring, and triplet-mirroring 
isn't available yet. =:^(

But it'll sure be nice when I CAN do it! =:^)

4) Do something else with the last pair of devices. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

next prev parent reply	other threads:[~2014-04-26  4:01 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-25 14:57 safe/necessary to balance system chunks? Steve Leung
2014-04-25 17:24 ` Chris Murphy
2014-04-25 18:12   ` Austin S Hemmelgarn
2014-04-25 18:43     ` Steve Leung
2014-04-25 19:07       ` Austin S Hemmelgarn
2014-04-26  4:01         ` Duncan [this message]
2014-04-26  1:11       ` Duncan
2014-04-26  1:24       ` Chris Murphy
2014-04-26  2:56         ` Steve Leung
2014-04-26  4:05           ` Chris Murphy
2014-04-26  4:55           ` Duncan
2014-04-25 19:14     ` Hugo Mills
2014-06-19 11:32       ` Alex Lyakas
2014-04-25 23:03     ` Duncan
2014-04-26  1:41       ` Chris Murphy
2014-04-26  4:23         ` Duncan
2014-04-25 18:36   ` Steve Leung

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$dd651$bbf57e6c$5d375e9c$5e894ab7@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).