From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: RAID-1 refuses to balance large drive
Date: Thu, 24 Mar 2016 06:11:44 +0000 (UTC) [thread overview]
Message-ID: <pan$bb328$cf4ba5ae$55fe61f9$a9194f18@cox.net> (raw)
In-Reply-To: 56F3559C.3030201@gmail.com
Brad Templeton posted on Wed, 23 Mar 2016 19:49:00 -0700 as excerpted:
> On 03/23/2016 07:33 PM, Qu Wenruo wrote:
>
>>> Still, it seems to me
>>> that the lack of space even after I filled the disks should not
>>> interfere with the balance's ability to move chunks which are found on
>>> both 3 and 4 so that one remains and one goes to the 6. This action
>>> needs no spare space. Now I presume the current algorithm perhaps
>>> does not work this way?
>>
>> No, balance is not working like that.
>> Although most user consider balance is moving data, which is partly
>> right. The fact is, balance is, copy-and-delete. And it needs spare
>> space.
>>
>> Means you must have enough space for the extents you are balancing,
>> then btrfs will copy them, update reference, and then delete old data
>> (with its block group).
>>
>> So for balancing data in already filled device, btrfs needs to find
>> space for them first.
>> Which will need 2 devices with unallocated space for RAID1.
>>
>> And in you case, you only have 1 devices with unallocated space, so no
>> space to balance.
>
> Ah. I would class this as a bug, or at least a non-optimal design. If
> I understand, you say it tries to move both of the matching chunks to
> new homes. This makes no sense if there are 3 drives because it is
> assured that one chunk is staying on the same drive. Even with 4 or
> more drives, where this could make sense, in fact it would still be wise
> to attempt to move only one of the pair of chunks, and then move the
> other if that is also a good idea.
What balance does, at its most basic, is rewrite and in the process
manipulate chunks in some desired way, depending on the filters used, if
any. Once the chunks have been rewritten, the old copies are deleted.
But existing chunks are never simply left in place unless the filters
exclude them entirely. If they are rewritten, a new chunk is created and
the old chunk is removed.
Now one of the simplest and most basic effects of this rewrite process is
that where two or more chunks of the same type (typically data or
metadata) are only partially full, the rewrite process will create a new
chunk and start writing, filling it until it is full, then creating
another and filling it, etc, which ends up compacting chunks as it
rewrites them. So if there's ten chunks and average of 50% full, it'll
compact that into five chunks, 100% full. The usage filter is very
helpful here, letting you tell balance to only bother with chunks that
are under say 10% (usage=10) full, where you'll get a pretty big effect
for the effort, as 10 such chunks can be consolidated into one. Of
course that would only happen if you /had/ 10 such chunks under 10% full,
but at say usage=50, you still get one freed chunk for every two balance
rewrites, taking longer, but still far less time than it would take to
rewrite 90% full chunks, with far more dramatic effects... as long as
there are chunks to balance and combine at that usage level, of course.
Here, we're using a different side effect, the fact that with a raid1
setup, there are always two copies of the chunk, one on each of exactly
two devices, and that when new chunks are allocated, they *SHOULD* be
allocated from the devices with the most free space, subject only to the
rule that both copies cannot be on the same device, so the effect is that
it'll allocate from the device with the most space left for the first
copy, and then for the second copy, it'll allocate from the device with
the most space left, but where the device list excludes the device that
the first copy is on.
But, the point that Qu is making is that balance, by definition, rewrites
both raid1 copies of the chunk. It can't simply rewrite just the one
that's on the fullest device to the most empty and leave the other copy
alone. So what it will do is allocate space for a new chunk from each of
the two devices with the most space left, and will copy the chunks to
them, only releasing the existing copies when the copy is done and the
new copies are safely on their respective devices.
Which means that at least two devices MUST have space left in ordered to
rebalance from raid1 to raid1. If only one device has space left, no
rebalance can be done.
Now your 3 TB and 4 TB devices, one each, are full, with space left only
on the 6 TB device. When you first switched from the 2 TB device to the
6 TB device, the device delete would have rewritten from the 2 TB device
to the 6 TB device, and you probably had some space left on the other
devices at that point. However, you didn't have enough space left on the
other two devices to utilize much of the 6 TB device, because each time
you allocated a chunk on the 6 TB device, a chunk had to be allocated on
one of the others as well, and they simply didn't have enough space left
by that point to do that too many times.
Now, you /did/ try to rebalance before you /fully/ ran out of space on
the other devices, and that's what Chris and I were thinking should have
worked, putting one copy of each rebalanced chunk on the 6 TB device.
But, lacking (preferably) btrfs device usage (or btrfs filesystem show,
gives a bit less information but does say how much of each device is
actually used) reports from /before/ the further fillup, we can't say for
sure how much space was actually left.
Now here's the question. You said you estimated each drive had ~50 GB
free when you did the original replace and then tried to balance, but
where did that 50 GB number come from?
Here's why it matters. Btrfs allocates space in two steps. First it
allocates from the unallocated pool into chunks, which can be data or
metadata (there's also system chunks, but that's only a few MiB total, in
your case 32 MiB on each of two devices given the raid1, and doesn't
change dramatically with usage as data and metadata chunks do).
And it can easily happen that all available space is already allocated
into (partially used) chunks, so there's no actually unallocated space
left on a device in ordered to allocate further chunks, but there's still
sufficient space left in the partially used chunks to continue adding and
changing files for some time. Only when new chunk allocation is
necessary will a problem show up.
Now given the various btrfs reports, btrfs fi show and btrfs fi df, or
btrfs fi usage, or for a device-centric report, btrfs dev usage, possibly
combined with the other reports depending on what you're trying to figure
out, it's quite possible to tell exactly what the status of each of the
devices is, regarding both unallocated space as well as allocated chunks,
and how much of those allocated chunks is actually used (globally,
unfortunately actual usage of the chunk allocation isn't broken down by
device, tho that information isn't technically needed per-device).
But if you're estimating only based on normal df, not the btrfs versions
of the commands, you don't know how much space remained actually
unallocated on each device, and for balance, that's the critical thing,
particularly with raid1, since it MUST have space to allocate new chunks
on AT LEAST TWO devices.
Which is where the IRC recommendation to add a 4th device of some GiB
came in, the idea being to add enough unallocated space on that 4th
device, that being the second device with actually unallocated space, to
get you out of the tight spot.
There is, however, another factor in play here as well, chunk size. Data
chunks are the largest, and are nominally 1 GiB in size. *HOWEVER*, on
devices over some particular size, they can increase in size upto, a dev
stated in one thread, 10 GiB in size. However, while I know it can
happen at larger filesystem and device sizes, I don't have the foggiest
what the conditions and algorithms for chunk size are. But with TB-scale
devices and btrfs', it's very possible, even likely, that you're dealing
with over the 1 GiB nominal size.
And if you're dealing with 10 GiB chunk sizes, or possibly even larger if
I took that dev's chunk size limitation comments out of context and am
wrong about that chunk size limit...
You may well simply not have a second device with enough unallocated
space on it to properly handle the chunk sizes on that filesystem.
Certainly, the btrfs fi usage report you posted showed a few gigs of
unallocated space on each of three of the four devices (with all sorts of
space left on the 6 TB device, of course), but all three were in the
single-digits GB, and if most of your data chunks are 10 GiB... you
simply don't have a device with enough unallocated space left to write
that second copy.
Tho adding back that 2 TB device and doing a balance should indeed give
you enough space to put a serious dent in that imbalance.
But as Qu says, you will likely end up having to rebalance several times
in ordered to get it nicely balanced out, since you'll fill up that under
2 TiB pretty fast from the other two full devices and it'll start round-
robinning to all three for the second copy before the other two are even
a TiB down from full.
Again as Qu says, rebalancing to single and back to raid1 is another
option, that should result in a much faster loading of the 6 TB device.
I think (but I'm not sure) that the the single mode allocator still uses
the "most space" allocation algorithm, in which case, given a total raid1
usage of 7.77 TiB, which should be 3.88 TiB (~4.25 TB) in single mode,
you should end up with a nearly free 3 TB device, just under 1 GiB on the
4 TB device, and just under 3 TB on the 6 TB device, basically 3 TB free/
unallocated on each of the three devices.
(The tiny 4th device should be left entirely free in that case and should
then be trivial to device delete as there will be nothing on it to move
to other devices, it'll be a simple change to the system chunk device
data and the superblocks on the other three devices.)
Then you can rebalance to raid1 mode again, and it should use up that 3
TB on each device relatively evenly, round-robinning an unused device
that alternates on each set of chunks copied. While ~3/4 of all chunks
should start out with their single-mode copy on the 6 TB device, 3/4 of
all chunks deleted will be off it, leaving it free to get one of the two
copies most of the time. You should end up with about 1.3 TB free per
device, with about 1.6 TB of the 3 TB device allocated, 2.6 TB of the 4
TB device allocated, together pretty well sharing one copy of each chunk
between them, and 4.3 T of the 6 TB device used, pretty much one copy of
each chunk on its own.
The down side to that is that you're left with only a single copy while
in single mode, and if that copy gets corrupted, you simply lose whatever
was in that now corrupted chunk. If the data's valuable enough, you may
thus prefer to do repeated balances.
The other alternative of course is to ensure that everything that's not
trivially replaced is backed up, and start from scratch with a newly
created btrfs on the three devices, restoring to it from backup.
That's what I'd do, since the sysadmin's rule of backups in simple form
says if it's not backed up, you are by definition of your (in)action,
defining that data as worth less than the time/trouble/resources
necessary to back it up. So if it's worth the hassle it should be
already backed up so you can simply blow away the existing filesystem,
create it over new, and restore from backups, and if you don't have those
backups, then by definition it's not worth the hassle, and starting over
with a fresh filesystem is all three of (1) less hassle, (2) a chance to
take advantage of newer filesystem options that weren't available when
you first created the existing filesystem, and (3) a clean start, blowing
away any chance of some bug lurking in the existing layout waiting to
come back and bite you after you've put all the work into those
rebalances, if you choose them over the clean start.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2016-03-24 6:12 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-23 0:47 RAID-1 refuses to balance large drive Brad Templeton
2016-03-23 4:01 ` Qu Wenruo
2016-03-23 4:47 ` Brad Templeton
2016-03-23 5:42 ` Chris Murphy
[not found] ` <56F22F80.501@gmail.com>
2016-03-23 6:17 ` Chris Murphy
2016-03-23 16:51 ` Brad Templeton
2016-03-23 18:34 ` Chris Murphy
2016-03-23 19:10 ` Brad Templeton
2016-03-23 19:27 ` Alexander Fougner
2016-03-23 19:33 ` Chris Murphy
2016-03-24 1:59 ` Qu Wenruo
2016-03-24 2:13 ` Brad Templeton
2016-03-24 2:33 ` Qu Wenruo
2016-03-24 2:49 ` Brad Templeton
2016-03-24 3:44 ` Chris Murphy
2016-03-24 3:46 ` Qu Wenruo
2016-03-24 6:11 ` Duncan [this message]
2016-03-25 13:16 ` Patrik Lundquist
2016-03-25 14:35 ` Henk Slager
2016-03-26 4:15 ` Duncan
[not found] ` <CAHz9+Emc4DsXoMLKYrp1TfN+2r2cXxaJmPyTnpeCZF=h0FhtMg@mail.gmail.com>
2018-05-27 1:27 ` Brad Templeton
2018-05-27 1:41 ` Qu Wenruo
2018-05-27 1:49 ` Brad Templeton
2018-05-27 1:56 ` Qu Wenruo
2018-05-27 2:06 ` Brad Templeton
2018-05-27 2:16 ` Qu Wenruo
2018-05-27 2:21 ` Brad Templeton
2018-05-27 5:55 ` Duncan
2018-05-27 18:22 ` Brad Templeton
2018-05-28 8:31 ` Duncan
2018-06-08 3:23 ` Zygo Blaxell
2016-03-27 4:23 ` Brad Templeton
2016-03-23 21:54 ` Duncan
2016-03-23 22:28 ` Duncan
2016-03-24 7:08 ` Andrew Vaughan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$bb328$cf4ba5ae$55fe61f9$a9194f18@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).