* How to properly and efficiently balance RAID6 after more drives are added?
@ 2015-09-02 10:29 Christian Rohmann
2015-09-02 11:30 ` Hugo Mills
0 siblings, 1 reply; 8+ messages in thread
From: Christian Rohmann @ 2015-09-02 10:29 UTC (permalink / raw)
To: linux-btrfs
Hello btrfs-enthusiasts,
I have a rather big btrfs RAID6 with currently 12 devices. It used to be
only 8 drives 4TB each, but I successfully added 4 more drives with 1TB
each at some point. What I am trying to find out, and that's my main
reason for posting this, is how to balance the data on the drives now.
I am wondering what I should read from this "btrfs filesystem show" output:
--- cut ---
Total devices 12 FS bytes used 19.23TiB
devid 1 size 3.64TiB used 3.64TiB path /dev/sdc
devid 2 size 3.64TiB used 3.64TiB path /dev/sdd
devid 3 size 3.64TiB used 3.64TiB path /dev/sde
devid 4 size 3.64TiB used 3.64TiB path /dev/sdf
devid 5 size 3.64TiB used 3.64TiB path /dev/sdh
devid 6 size 3.64TiB used 3.64TiB path /dev/sdi
devid 7 size 3.64TiB used 3.64TiB path /dev/sdj
devid 8 size 3.64TiB used 3.64TiB path /dev/sdb
devid 9 size 931.00GiB used 535.48GiB path /dev/sdg
devid 10 size 931.00GiB used 535.48GiB path /dev/sdk
devid 11 size 931.00GiB used 535.48GiB path /dev/sdl
devid 12 size 931.00GiB used 535.48GiB path /dev/sdm
btrfs-progs v4.1.2
--- cut ---
First of all I wonder why the first 8 disks are shown as "full" as "used
= size", but there is 5.3TB of free space for the fs shown by "df":
--- cut ---
Filesystem Size Used Avail Use% Mounted on
/dev/sdc 33T 20T 5.3T 79% /somemountpointsomewhere
--- cut ---
Also "btrfs filesystem df" doesn't give me any clues on the matter:
--- cut ---
btrfs filesystem df /srv/mirror/
Data, single: total=8.00MiB, used=0.00B
Data, RAID6: total=22.85TiB, used=19.19TiB
System, single: total=4.00MiB, used=0.00B
System, RAID6: total=12.00MiB, used=1.34MiB
Metadata, single: total=8.00MiB, used=0.00B
Metadata, RAID6: total=42.09GiB, used=38.42GiB
GlobalReserve, single: total=512.00MiB, used=1.58MiB
--- cut ---
What I am very certain about is that the "load" of I/O requests is not
equal yet, as iostat clearly shows:
--- cut ---
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 21.40 4.41 42.22 12.71 3626.12 940.79
166.29 3.82 69.38 42.83 157.60 5.98 32.82
sdb 22.35 4.45 41.29 12.71 3624.20 941.27
169.09 4.22 77.88 46.75 178.97 6.10 32.96
sdd 22.03 4.44 41.60 12.73 3623.76 943.22
168.13 3.79 69.45 42.53 157.48 6.05 32.85
sde 21.21 4.43 42.30 12.74 3621.39 943.36
165.88 3.82 69.28 42.99 156.62 5.98 32.90
sdf 22.19 4.42 41.42 12.75 3623.65 940.63
168.51 3.77 69.36 42.64 156.13 6.05 32.79
sdh 21.35 4.46 42.25 12.68 3623.12 940.28
166.14 3.95 71.72 43.61 165.40 6.02 33.06
sdi 21.92 4.38 41.67 12.79 3622.03 942.91
167.63 3.49 63.83 40.23 140.74 6.02 32.77
sdj 21.31 4.41 42.26 12.72 3625.32 941.50
166.12 3.99 72.25 44.50 164.44 6.00 33.01
sdg 8.90 4.97 12.53 21.16 1284.47 1630.08
173.02 0.83 24.61 27.31 23.02 1.77 5.95
sdk 9.14 4.94 12.30 21.19 1284.61 1630.02
174.07 0.79 23.41 26.59 21.57 1.76 5.91
sdl 8.88 4.95 12.58 21.19 1284.46 1630.06
172.62 0.80 23.80 25.68 22.68 1.78 6.00
sdm 9.07 4.85 12.35 21.29 1284.43 1630.01
173.26 0.79 23.57 26.57 21.83 1.77 5.94
--- cut ---
Should I run btrfs balance on the filesystem? If so, what FILTERS would
I then use in order for the data and therefore requests to be better
distributed?
With regards and thanks in advance,
Christian
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: How to properly and efficiently balance RAID6 after more drives are added?
2015-09-02 10:29 How to properly and efficiently balance RAID6 after more drives are added? Christian Rohmann
@ 2015-09-02 11:30 ` Hugo Mills
2015-09-02 13:09 ` Christian Rohmann
0 siblings, 1 reply; 8+ messages in thread
From: Hugo Mills @ 2015-09-02 11:30 UTC (permalink / raw)
To: Christian Rohmann; +Cc: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 6142 bytes --]
On Wed, Sep 02, 2015 at 12:29:06PM +0200, Christian Rohmann wrote:
> Hello btrfs-enthusiasts,
>
> I have a rather big btrfs RAID6 with currently 12 devices. It used to be
> only 8 drives 4TB each, but I successfully added 4 more drives with 1TB
> each at some point. What I am trying to find out, and that's my main
> reason for posting this, is how to balance the data on the drives now.
>
> I am wondering what I should read from this "btrfs filesystem show" output:
>
> --- cut ---
> Total devices 12 FS bytes used 19.23TiB
> devid 1 size 3.64TiB used 3.64TiB path /dev/sdc
> devid 2 size 3.64TiB used 3.64TiB path /dev/sdd
> devid 3 size 3.64TiB used 3.64TiB path /dev/sde
> devid 4 size 3.64TiB used 3.64TiB path /dev/sdf
> devid 5 size 3.64TiB used 3.64TiB path /dev/sdh
> devid 6 size 3.64TiB used 3.64TiB path /dev/sdi
> devid 7 size 3.64TiB used 3.64TiB path /dev/sdj
> devid 8 size 3.64TiB used 3.64TiB path /dev/sdb
> devid 9 size 931.00GiB used 535.48GiB path /dev/sdg
> devid 10 size 931.00GiB used 535.48GiB path /dev/sdk
> devid 11 size 931.00GiB used 535.48GiB path /dev/sdl
> devid 12 size 931.00GiB used 535.48GiB path /dev/sdm
You had some data on the first 8 drives with 6 data+2 parity, then
added four more. From that point on, you were adding block groups with
10 data+2 parity. At some point, the first 8 drives became full, and
then new block groups have been added only to the new drives, using 2
data+2 parity.
> btrfs-progs v4.1.2
> --- cut ---
>
>
> First of all I wonder why the first 8 disks are shown as "full" as "used
> = size", but there is 5.3TB of free space for the fs shown by "df":
>
> --- cut ---
> Filesystem Size Used Avail Use% Mounted on
> /dev/sdc 33T 20T 5.3T 79% /somemountpointsomewhere
> --- cut ---
This is inaccurate because the calculations that correct for the
RAID usage probably aren't all that precise for parity RAID,
particularly when there's variable stripe sizes like you have in your
FS. In fact, they're not even all that good for things like RAID-1
(I've seen inaccuracies on my own RAID-1 system).
> Also "btrfs filesystem df" doesn't give me any clues on the matter:
>
> --- cut ---
> btrfs filesystem df /srv/mirror/
> Data, single: total=8.00MiB, used=0.00B
> Data, RAID6: total=22.85TiB, used=19.19TiB
> System, single: total=4.00MiB, used=0.00B
> System, RAID6: total=12.00MiB, used=1.34MiB
> Metadata, single: total=8.00MiB, used=0.00B
> Metadata, RAID6: total=42.09GiB, used=38.42GiB
> GlobalReserve, single: total=512.00MiB, used=1.58MiB
> --- cut ---
This is showing you how the "used" space from the btrfs fi show
output is divided up. It won't tell you anything about the proportion
of the data that's 6+2, the amount that's 10+2, and the amount that's
2+2 (or any other values).
> What I am very certain about is that the "load" of I/O requests is not
> equal yet, as iostat clearly shows:
>
> --- cut ---
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
> avgrq-sz avgqu-sz await r_await w_await svctm %util
> sdc 21.40 4.41 42.22 12.71 3626.12 940.79
> 166.29 3.82 69.38 42.83 157.60 5.98 32.82
> sdb 22.35 4.45 41.29 12.71 3624.20 941.27
> 169.09 4.22 77.88 46.75 178.97 6.10 32.96
> sdd 22.03 4.44 41.60 12.73 3623.76 943.22
> 168.13 3.79 69.45 42.53 157.48 6.05 32.85
> sde 21.21 4.43 42.30 12.74 3621.39 943.36
> 165.88 3.82 69.28 42.99 156.62 5.98 32.90
> sdf 22.19 4.42 41.42 12.75 3623.65 940.63
> 168.51 3.77 69.36 42.64 156.13 6.05 32.79
> sdh 21.35 4.46 42.25 12.68 3623.12 940.28
> 166.14 3.95 71.72 43.61 165.40 6.02 33.06
> sdi 21.92 4.38 41.67 12.79 3622.03 942.91
> 167.63 3.49 63.83 40.23 140.74 6.02 32.77
> sdj 21.31 4.41 42.26 12.72 3625.32 941.50
> 166.12 3.99 72.25 44.50 164.44 6.00 33.01
> sdg 8.90 4.97 12.53 21.16 1284.47 1630.08
> 173.02 0.83 24.61 27.31 23.02 1.77 5.95
> sdk 9.14 4.94 12.30 21.19 1284.61 1630.02
> 174.07 0.79 23.41 26.59 21.57 1.76 5.91
> sdl 8.88 4.95 12.58 21.19 1284.46 1630.06
> 172.62 0.80 23.80 25.68 22.68 1.78 6.00
> sdm 9.07 4.85 12.35 21.29 1284.43 1630.01
> 173.26 0.79 23.57 26.57 21.83 1.77 5.94
>
> --- cut ---
>
>
>
> Should I run btrfs balance on the filesystem? If so, what FILTERS would
> I then use in order for the data and therefore requests to be better
> distributed?
Yes, you should run a balance. You probably need to free up some
space on the first 8 drives first, to give the allocator a chance to
use all 12 devices in a single stripe. This can also be done with a
balance. Sadly, with the striped RAID levels (0, 10, 5, 6), it's
generally harder to ensure that all of the data is striped as evenly
as is possible(*). I don't think there are any filters that you should
to use -- just balance everything. The first time probably won't do
the job fully. A second balance probably will. These are going to take
a very long time to run (in your case, I'd guess at least a week for
each balance). I would recommend starting the balance in a tmux or
screen session, and also creating a second shell in the same session
to run monitoring processes. I typically use something like:
watch -n60 sudo btrfs fi show\; echo\; btrfs fi df /mountpoint\; echo\; btrfs bal stat /mountpoint
Hugo.
(*) Hmmm... idea for a new filter: min/max stripe width? Then you
could balance only the block groups that aren't at full width, which
is probably what's needed here.
--
Hugo Mills | Comic Sans goes into a bar, and the barman says, "We
hugo@... carfax.org.uk | don't serve your type here."
http://carfax.org.uk/ |
PGP: E2AB1DE4 |
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: How to properly and efficiently balance RAID6 after more drives are added?
2015-09-02 11:30 ` Hugo Mills
@ 2015-09-02 13:09 ` Christian Rohmann
2015-09-03 2:22 ` Duncan
0 siblings, 1 reply; 8+ messages in thread
From: Christian Rohmann @ 2015-09-02 13:09 UTC (permalink / raw)
To: Hugo Mills, linux-btrfs
Hey Hugo,
thanks for the quick response.
On 09/02/2015 01:30 PM, Hugo Mills wrote:
> You had some data on the first 8 drives with 6 data+2 parity, then
> added four more. From that point on, you were adding block groups
> with 10 data+2 parity. At some point, the first 8 drives became
> full, and then new block groups have been added only to the new
> drives, using 2 data+2 parity.
Even though the old 8 drive RAID6 was not full yet? Read: There was
still some terabytes of free space.
>> Should I run btrfs balance on the filesystem? If so, what FILTERS
>> would I then use in order for the data and therefore requests to
>> be better distributed?
>
> Yes, you should run a balance. You probably need to free up some
> space on the first 8 drives first, to give the allocator a chance
> to use all 12 devices in a single stripe. This can also be done
> with a balance. Sadly, with the striped RAID levels (0, 10, 5, 6),
> it's generally harder to ensure that all of the data is striped as
> evenly as is possible(*). I don't think there are any filters that
> you should to use -- just balance everything. The first time
> probably won't do the job fully. A second balance probably will.
> These are going to take a very long time to run (in your case, I'd
> guess at least a week for each balance). I would recommend starting
> the balance in a tmux or screen session, and also creating a second
> shell in the same session to run monitoring processes. I typically
> use something like:
>
> watch -n60 sudo btrfs fi show\; echo\; btrfs fi df /mountpoint\;
> echo\; btrfs bal stat /mountpoint
Yeah, that's what I usually do. The thing is that one does not get any
progress indication and estimate about how long a task will take.
> (*) Hmmm... idea for a new filter: min/max stripe width? Then you
> could balance only the block groups that aren't at full width,
> which is probably what's needed here.
Consider my question and motivation a rather obvious use case of
running out of disk space (or iops) and simply adding some more
drives. A balance needs to be straightforward for people to understand
and perform such tasks.
Regards
Christian
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: How to properly and efficiently balance RAID6 after more drives are added?
2015-09-02 13:09 ` Christian Rohmann
@ 2015-09-03 2:22 ` Duncan
2015-09-04 8:28 ` Christian Rohmann
0 siblings, 1 reply; 8+ messages in thread
From: Duncan @ 2015-09-03 2:22 UTC (permalink / raw)
To: linux-btrfs
Christian Rohmann posted on Wed, 02 Sep 2015 15:09:47 +0200 as excerpted:
> Hey Hugo,
>
> thanks for the quick response.
>
> On 09/02/2015 01:30 PM, Hugo Mills wrote:
>> You had some data on the first 8 drives with 6 data+2 parity, then
>> added four more. From that point on, you were adding block groups with
>> 10 data+2 parity. At some point, the first 8 drives became full, and
>> then new block groups have been added only to the new drives, using 2
>> data+2 parity.
>
> Even though the old 8 drive RAID6 was not full yet? Read: There was
> still some terabytes of free space.
At this point we're primarily guessing (unless you want to dive deep into
btrfs-debug or the like), because the results you posted are from after
you added the set of four new devices to the existing eight. We don't
have the btrfs fi show and df from before you added the new devices.
But what we /do/ know from what you posted (from after the add), the
previously existing devices are "100% chunk-allocated", size 3.64 TiB,
used 3.64 TiB, on each of the first eight devices.
I don't know how much of (the user docs on) the wiki you've read, and/or
understood, but for many people, it takes awhile to really understand a
few major differences between btrfs and most other filesystems.
1) Btrfs separates data and metadata into separate allocations,
allocating, tracking and reporting them separately. While some
filesystems do allocate separately, few expose the separate data and
metadata allocation detail to the user.
2) Btrfs allocates and uses space in two steps, first allocating/
reserving relatively large "chunks" from free-space into separate data
and metadata chunks, then using space from these chunk allocations as
needed, until they're full and more must be allocated. Nominal[1] chunk
size is 1 GiB for data, 256 MiB for metadata.
It's worth noting that for striped raid (with or without parity, so
raid0,5,6, with parity strips taken from what would be the raid0 strips
as appropriate), btrfs allocates a full chunk strip on each available
device, so nominal raid6 strip allocation on eight devices would be a 6
GiB data plus 2 GiB parity stripe (8x1GiB strips per stripe), while
metadata would be 1.5 GiB metadata (6x256MiB) plus half a GiB parity
(2x256MiB) (total of 8x256MiB strips per stripe).
Again, most filesystems don't allocate in chunks like this, at least for
data (they often will allocate metadata in chunks of some size, in
ordered to keep it grouped relatively close together, but that level of
detail isn't show to the user, and because metadata is typically a small
fraction of data, it can simply be included in the used figure as soon as
allocated and still disappear in the rounding error). What they report
as free space is thus available unallocated space that should, within
rounding error, be available for data.
3) Up until a few kernel cycles ago, btrfs could and would automatically
allocate chunks as needed, but wouldn't deallocate them when they
emptied. Once they were allocated for data or metadata, that's how they
stayed allocated, unless/until the user did a balance manually, at which
point the chunk rewrite would consolidate the used space and free any
unused chunk-space back to the unallocated space pool.
The result was that given normal usage writing and deleting data, over
time, all unallocated space would typically end up allocated as data
chunks, such that at some point the filesystem would run out of metadata
space and need to allocate more metadata chunks, but couldn't, because of
all those extra partially to entirely empty data chunks that were
allocated and never freed.
Since IIRC 3.17 or so (kernel cycle from unverified memory, but that
should be close), btrfs will automatically deallocate chunks if they're
left entirely empty, so the problem has disappeared to a large extent,
tho it's still possible to eventually end up with a bunch of not-quite-
empty data chunks, that require a manual balance to consolidate and clean
up.
4) Normal df (as opposed to btrfs fi df) will list free space in existing
data chunks as free, even after all unallocated space is gone and it's
all allocated to either data or metadata chunks. At that point, which
ever one you run out of first, typically metadata, will trigger ENOSPC
errors, despite df often showing quite some free space left -- because
all the reported free-space is tied up in data chunks, and there's no
unallocated space left to allocate to new metadata chunks when the
existing ones get full.
5) What btrfs fi show reports for "used" in the individual device stats
is chunk-allocated space.
What your btrfs fi show is saying, is that 100% of the capacity of those
first eight devices is chunk-allocated, to data or metadata chunks it
doesn't say, but whichever it is, it's already allocated to one or the
other, and cannot be reallocated to something else, either a different
sized stripe after adding the new devices, or to the opposite of data or
metadata, whichever it is allocated as, until it is rewritten in ordered
to consolidate all the actually used space into as few chunks as
possible, thereby freeing the unused but currently chunk-allocated space
back to the unallocated pool. This chunk rewrite and consolidation is
exactly what balance is designed to do.
Again, at this point we're guessing to some extent, based on what's
reported now, after the addition and evident partial use of the four new
devices to the existing eight. Thus we don't know for sure when the
existing eight devices got fully allocated, whether it was before the
addition of the new devices or after, but full allocation is definitely
the state they're in now, according to your posted btrfs fi show.
One plausible guess is as Hugo suggested, that they were mostly but not
fully allocated before the addition of the new devices, with that data
written as an 8-strip-stripe (6+2), that after the addition of the four
new devices, the remaining unallocated space on the original eight was
then filled along with usage from the new four, in a 12-strip-stripe (10
+2), after which further writes, if any, were now down to a 4-strip-
stripe (2+2), since the original eight were now fully chunk-allocated and
the new four were the only devices with remaining unallocated space.
Another plausible guess is that the original eight devices were fully
chunk-allocated before the addition of the four new devices, and that the
free space that df was reporting was entirely in already allocated but
not fully used data chunks. In this case, you would have been perilously
close to ENOSPC errors, when the existing metadata chunks got full, since
all space was already allocated so no more metadata chunks could have
been allocated, and if you didn't actually hit those errors, it was
simply down to the lucky timing of adding the four new devices.
In either case, that df was and is reporting TiB of free space doesn't
necessarily mean that there was unallocated space left, because df
reports on potential space to write data, including both data-chunk-
allocated-but-not-yet-data-used-space, and unallocated-space. Btrfs fi
show is reporting for each device it's total space and allocated space,
something totally different than df reports, so trying to directly
compare the output from the two commands without knowing exactly what
those numbers mean, is meaningless as they're reporting two entirely
different things.
---
[1] Nominal chunk size: Note the "nominal" qualifier. While this is the
normal chunk allocation size, on multi-TiB devices, the first few data
chunk allocations in particular can be much larger, multiples of a GiB,
while as unallocated space dwindles, both data and metadata chunks can be
smaller in ordered to use up the last available unallocated space.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: How to properly and efficiently balance RAID6 after more drives are added?
2015-09-03 2:22 ` Duncan
@ 2015-09-04 8:28 ` Christian Rohmann
2015-09-04 11:04 ` Duncan
0 siblings, 1 reply; 8+ messages in thread
From: Christian Rohmann @ 2015-09-04 8:28 UTC (permalink / raw)
To: Duncan, linux-btrfs
Hello Ducan,
thanks a million for taking the time an effort to explain all that.
I understand that all the devices must have been chunk-allocated for
btrfs to tell me all available "space" was used (read "allocated to data
chunks").
The filesystem is quite old already with kernels starting at 3.12 (I
believe) and now 4.2 with always the most current version of btrfs-progs
debian has available.
On 09/03/2015 04:22 AM, Duncan wrote:
> But what we /do/ know from what you posted (from after the add), the
> previously existing devices are "100% chunk-allocated", size 3.64 TiB,
> used 3.64 TiB, on each of the first eight devices.
>
> I don't know how much of (the user docs on) the wiki you've read, and/or
> understood, but for many people, it takes awhile to really understand a
> few major differences between btrfs and most other filesystems.
>
> 1) Btrfs separates data and metadata into separate allocations,
> allocating, tracking and reporting them separately. While some
> filesystems do allocate separately, few expose the separate data and
> metadata allocation detail to the user.
>
> 2) Btrfs allocates and uses space in two steps, first allocating/
> reserving relatively large "chunks" from free-space into separate data
> and metadata chunks, then using space from these chunk allocations as
> needed, until they're full and more must be allocated. Nominal[1] chunk
> size is 1 GiB for data, 256 MiB for metadata.
> 3) Up until a few kernel cycles ago, btrfs could and would automatically
> allocate chunks as needed, but wouldn't deallocate them when they
> emptied. Once they were allocated for data or metadata, that's how they
> stayed allocated, unless/until the user did a balance manually, at which
> point the chunk rewrite would consolidate the used space and free any
> unused chunk-space back to the unallocated space pool.
> The result was that given normal usage writing and deleting data, over
> time, all unallocated space would typically end up allocated as data
> chunks, such that at some point the filesystem would run out of metadata
> space and need to allocate more metadata chunks, but couldn't, because of
> all those extra partially to entirely empty data chunks that were
> allocated and never freed.
>
> Since IIRC 3.17 or so (kernel cycle from unverified memory, but that
> should be close), btrfs will automatically deallocate chunks if they're
> left entirely empty, so the problem has disappeared to a large extent,
> tho it's still possible to eventually end up with a bunch of not-quite-
> empty data chunks, that require a manual balance to consolidate and clean
> up.
I am running a full balance now, it's at 94% remaining (running for 48
hrs already ;-) ).
Is there any way I should / could "scan" for empty data chunks or almost
empty data chunks which could be freed in order to have more chunks
available for the actual balancing or new chunks that should be used
with a 10 drive RAID6? I understand that btrfs NOW does that somewhat
automagically, but my FS is quite old and used already and there is new
data coming in all the time, so I wand that properly spread across all
the drives.
Regards
Christian
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: How to properly and efficiently balance RAID6 after more drives are added?
2015-09-04 8:28 ` Christian Rohmann
@ 2015-09-04 11:04 ` Duncan
2015-11-11 14:17 ` Christian Rohmann
0 siblings, 1 reply; 8+ messages in thread
From: Duncan @ 2015-09-04 11:04 UTC (permalink / raw)
To: linux-btrfs
Christian Rohmann posted on Fri, 04 Sep 2015 10:28:21 +0200 as excerpted:
> Hello Ducan,
>
> thanks a million for taking the time an effort to explain all that.
> I understand that all the devices must have been chunk-allocated for
> btrfs to tell me all available "space" was used (read "allocated to data
> chunks").
>
> The filesystem is quite old already with kernels starting at 3.12 (I
> believe) and now 4.2 with always the most current version of btrfs-progs
> debian has available.
IIRC I wrote the previous reply without knowing the kernel you were on.
You had posted the userspace version, which was current, but not the
kernel version, so it's good to see that posted now.
Since kernel 3.12 was before automatic-empty-chunk-reclaim, empty chunks
wouldn't have been reclaimed back then, as I guessed. You're running
current 4.2 now and they would be, but only if they're totally empty, and
I'm guessing they're not, particularly if you don't run with the
autodefrag mount option turned on or do regular manual defrags. (Defrag
doesn't directly affect chunks, that's what balance is for, but failing
to defrag will mean things are more fragmented, which will cause btrfs to
spread out more in the available chunks as it'll try to put new files in
as few extents as possible, possibly in empty chunks if they haven't been
reclaimed and the space in fuller chunks is too small for the full file,
upto the chunk-size, of course.)
And of course, only with 4.1 (nominally 3.19 but there were initial
problems) was raid6 mode fully code-complete and functional -- before
that, runtime worked, it calculated and wrote the parity stripes as it
should, but the code to recover from problems wasn't complete, so you
were effectively running a slow raid0 in terms of recovery ability, but
one that got "magically" updated to raid6 once the recovery code was
actually there and working.
So I'm guessing you have some 8-strip-stripe chunks at say 20% full or
some such. There's 19.19 data TiB used of 22.85 TiB allocated, a spread
of over 3 TiB. A full nominal-size data stripe allocation, given 12
devices in raid6, will be 10x1GiB data plus 2x1GiB parity, so there's
about 3.5 TiB / 10 GiB extra stripes worth of chunks, 350 stripes or so,
that should be freeable, roughly (the fact that you probably have 8-
strip, 12-strip, and 4-strip stripes, on the same filesystem, will of
course change that a bit, as will the fact that four devices are much
smaller than the other eight).
> On 09/03/2015 04:22 AM, Duncan wrote: [snipped]
>
> I am running a full balance now, it's at 94% remaining (running for 48
> hrs already ;-) ).
>
> Is there any way I should / could "scan" for empty data chunks or almost
> empty data chunks which could be freed in order to have more chunks
> available for the actual balancing or new chunks that should be used
> with a 10 drive RAID6? I understand that btrfs NOW does that somewhat
> automagically, but my FS is quite old and used already and there is new
> data coming in all the time, so I wand that properly spread across all
> the drives.
There are balance filters, -dusage=20, for instance, would only rebalance
data (-d) chunks with usage under 20%. Of course there's more about
balance filters in the manpage and on the wiki.
The great thing about -dusage= (and -musage= where appropriate) is that
it can often free and deallocate large numbers of chunks at a fraction of
the time it'd take to do a full balance. Not only are you only dealing
with a fraction of the chunks, but since the ones it picks are for
example only 20% full (with usage=20) or less, they take only 20% (or
less) of the time to balance that a full chunk would. Additionally, 20%
full or less means you reclaim chunks 4:1 or better -- five old chunks
are rewritten into a single new one, freeing four! So in a scenario with
a whole bunch of chunks at less than say 2/3 full (usage=67, rewrite
three into two), this can reclaim a whole lot of chunks in a relatively
small amount of time, certainly so compared to a full balance, since
rewriting a 100% full chunk takes the full amount of time and doesn't
reclaim anything.
But, given that the whole reason you're messing with it is to try to even
out the stripes across all devices, a full rewrite is eventually in order
anyway. However, knowing about the filters would have let you do a
-dusage=20 or possibly -dusage=50 before the full balance, leaving the
full balance more room to work in and and possibly allowing a more
effective balance to the widest stripes possible.
Likely above 50 and almost certainly above 67, the returns wouldn't be
worth it, since the time taken for the filtered balance would be longer,
and an unfiltered balance was planned afterward anyway. Here, I'd have
tried something like 20 first, then 50 if I wasn't happy with the results
of 20. The thing is, either 20 would give me good results in a
reasonably short time, or there'd be so few candidates that it'd be very
fast to give me the bad results, thus allowing me to try 50. Same with
50 and 67, tho I'd definitely be unhappy if 50 didn't give me at least a
TiB or so freed to unallocated, hopefully some of which would be in the
first eight devices, ideally giving the full balance room enough to do
full 12-device-stripes, keeping enough free on the original eight devices
as it went to go 12-wide until the smaller devices were full, then 8
wide, eliminating the 4-wide stripes entirely.
Tho as Hugo suggested, having the original larger eight devices all the
way full and thus a good likelihood of all three stripe widths isn't
ideal, and it might actually take a couple balances (yes, at a week a
piece or whatever =:^() to straighten things out. A good -dusage=
filtered balance pre-pass would have likely taken under a day and with
luck would have allowed a single full balance to do the job, but it's a
bit late for that now...
Meanwhile, FWIW that long maintenance time is one of the reasons I'm a
strong partitioning advocate. Between the fact that I use SSDs and the
fact that my btrfs partitions are all under 50 GiB per partition (which
probably wouldn't be practical for you, but half to 1 TiB per device
partition might be...), full scrubs typically take under a minute here,
and full balances still in the single-digit minutes. Of course, I have
other partitions/filesystems too, and to do all of them would take a bit
longer, say an hour, but with maintenance time under 10 minutes per
filesystem, doing it is not only not a pain, but is actually trivial,
where as doing maintenance that's going to take a week is definitely a
pain, something you're going to avoid if possible, meaning there's a fair
chance a minor problem will be allowed to get far worse before it's
addressed, than it would be if the maintenance were a matter of a few
hours, say a day at most.
But that's just me. I've fine tuned my partitioning layout over multiple
multi-year generations and have it setup so I don't have the hassle of
"oh, I'm out of space on this partition, gotta symlink to a different
one" that a lot of folks point to as the reason they prefer big storage
pools like lvm or multi-whole-physical-device btrfs. And obviously, I'm
not scaling storage to the double-digit TiB you are, either. So your
system, your layout and rules. I'm simply passing on one reason that I'm
such a strong partitioning advocate, here.
Plus I know you'd REALLY like those 10 minute full-balances right about
now! =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: How to properly and efficiently balance RAID6 after more drives are added?
2015-09-04 11:04 ` Duncan
@ 2015-11-11 14:17 ` Christian Rohmann
2015-11-12 4:31 ` Duncan
0 siblings, 1 reply; 8+ messages in thread
From: Christian Rohmann @ 2015-11-11 14:17 UTC (permalink / raw)
To: linux-btrfs
Sorry for the late reply to this list regarding this topic
...
On 09/04/2015 01:04 PM, Duncan wrote:
> And of course, only with 4.1 (nominally 3.19 but there were initial
> problems) was raid6 mode fully code-complete and functional -- before
> that, runtime worked, it calculated and wrote the parity stripes as it
> should, but the code to recover from problems wasn't complete, so you
> were effectively running a slow raid0 in terms of recovery ability, but
> one that got "magically" updated to raid6 once the recovery code was
> actually there and working.
As other who write to this ML, I run into crashes when trying to do a
balance of my filesystem.
I moved through the different kernel versions and btrfs-tools and am
currently running Kernel 4.3 + 4.3rc1 of the tools but still after like
an hour of balancing (and actually moving chunks) the machine crashes
horribly without giving any good stack trace or anything in the kernel
log which I could report here :(
Any ideas on how I could proceed to get some usable debug info for the
devs to look at?
> So I'm guessing you have some 8-strip-stripe chunks at say 20% full or
> some such. There's 19.19 data TiB used of 22.85 TiB allocated, a spread
> of over 3 TiB. A full nominal-size data stripe allocation, given 12
> devices in raid6, will be 10x1GiB data plus 2x1GiB parity, so there's
> about 3.5 TiB / 10 GiB extra stripes worth of chunks, 350 stripes or so,
> that should be freeable, roughly (the fact that you probably have 8-
> strip, 12-strip, and 4-strip stripes, on the same filesystem, will of
> course change that a bit, as will the fact that four devices are much
> smaller than the other eight).
The new devices have been in place for while (> 2 months) now, and are
barely used. Why is there not more data being put onto the new disks?
Even without a balance new data should spread evenly across all devices
right? From the IOPs I can see that only the 8 disks which always have
been in the box are doing any heavy lifting and the new disks are mostly
idle.
Anything I could do to narrow down where a certain file is stored across
the devices?
Regards
Christian
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: How to properly and efficiently balance RAID6 after more drives are added?
2015-11-11 14:17 ` Christian Rohmann
@ 2015-11-12 4:31 ` Duncan
0 siblings, 0 replies; 8+ messages in thread
From: Duncan @ 2015-11-12 4:31 UTC (permalink / raw)
To: linux-btrfs
Christian Rohmann posted on Wed, 11 Nov 2015 15:17:19 +0100 as excerpted:
> Sorry for the late reply to this list regarding this topic ...
>
> On 09/04/2015 01:04 PM, Duncan wrote:
>> And of course, only with 4.1 (nominally 3.19 but there were initial
>> problems) was raid6 mode fully code-complete and functional -- before
>> that, runtime worked, it calculated and wrote the parity stripes as it
>> should, but the code to recover from problems wasn't complete, so you
>> were effectively running a slow raid0 in terms of recovery ability, but
>> one that got "magically" updated to raid6 once the recovery code was
>> actually there and working.
>
> As other who write to this ML, I run into crashes when trying to do a
> balance of my filesystem.
> I moved through the different kernel versions and btrfs-tools and am
> currently running Kernel 4.3 + 4.3rc1 of the tools but still after like
> an hour of balancing (and actually moving chunks) the machine crashes
> horribly without giving any good stack trace or anything in the kernel
> log which I could report here :(
>
> Any ideas on how I could proceed to get some usable debug info for the
> devs to look at?
I'm not a dev so my view into the real deep technical side is limited,
but what I can say is this...
Generally, crashes during balance indicate not so much bugs in the way
the kernel handles existing balance (tho those sometimes occur as well,
but the chances are relatively lower), but rather, a filesystem screwed
up in a way that balance hasn't been taught to deal with yet.
Of course there's two immediate points that can be made from that:
1) Newer kernels have been taught to deal with more bugs, so if you're
not on current (which you are now), consider upgrading to current at
least long enough to see if it already knows how to deal with it.
2) If a balance is crashing with a particular kernel, it's unlikely the
problem will simply go away on its own, without either a kernel upgrade
to one that knows how to deal with that problem, or in some cases, a
filesystem change that unpins whatever was bad and lets it be deleted.
Filesystem changes likely to do that sort of thing are removing your
oldest snapshots, thereby freeing anything that had changed in newer
snapshots and the working version, that was still being pinned by the old
snapshots, or in the absence of snapshot pinning, removal of whatever
often large possibly repeatedly edited file happened to be locking down
whatever balance was choking on.
Another point (based on a different factor) can be added in addition:
3) Raid56 mode is still relatively new, and it seems a number of users of
the raid56 mode feature seem to be reporting what appears to me at least
(considering my read of tracedumps is extremely limited) to be the same
sort of balance bug, often with the same couldn't-get-a-trace pattern.
This very likely indicates a remaining bug embedded deeply enough in the
raid56 code that it has taken until now to trigger enough times to even
begin to appear on the radar. Of course the fact that it so often no-
traces doesn't help finding it, but the reports are getting common enough
that at least to the informed non-dev list regular like me, there does
seem to be a pattern emerging.
This is a bit worrying, but it's /exactly/ the reason that I had
suggested that people wait for at least two entirely "clean" kernel
cycles without raid56 bugs before considering it as stable as is the rest
of btrfs, and predicted that would likely be at least five kernel cycles
(a year) after initial nominal-full-code release, putting it at 4.4 at
the earliest. Since the last big raid56 bug was fixed fairly early in
the 4.1 cycle, two clean series would be 4.2 and 4.3, which would again
point to 4.4. But we now have this late-appearing bug just coming up on
the radar, which if it does indeed end up raid56 related, both validates
my earlier caution, and at least conservatively speaking, should reset
that two-clean-kernel-cycles clock. However, given that the feature in
general has been maturing in the mean time, I'd say reset it with only
one clean kernel cycle this time, so again assuming the problem is indeed
found to be in raid56 and that it's fixed before 4.4 release, I'd want
4.5 to be raid56 uneventful, and would then consider 4.6 raid56 maturity/
stability-comparable to btrfs in general, assuming no further raid56 bugs
have appeared by its release.
As to ideas for getting a trace, the best I can do is repeat what I've
seen others suggest here, that will obviously take a bit more resources
than some have available but that apparently has the best chance of
working if it can be done in such instances, that being...
Configure the test machine with a network-attached tty, and set it as
your system console, so debug traces will dump to it. The kernel will
try its best to dump traces to system-console as it considers that safe
even after it considers itself too scrambled to trust writing anything to
disk, so this sort of network system console arrangement can often get at
least /some/ of a debug trace before the kernel entirely loses coherency.
The specifics I don't know as I don't tend to have the network resources
to log to, here, and thus, have no personal experience with it at all.
But I might remember seeing a text file in the kernel docs dir that had
instructions. But you could look that up as easy as I, so there's no
point in me double-checking on that.
The other side of it would be enabling the various btrfs and general
kernel debug and tracing apparatus, but you'd need a dev to give you the
details there.
>> So I'm guessing you have some 8-strip-stripe chunks at say 20% full or
>> some such. There's 19.19 data TiB used of 22.85 TiB allocated, a
>> spread of over 3 TiB. A full nominal-size data stripe allocation,
>> given 12 devices in raid6, will be 10x1GiB data plus 2x1GiB parity, so
>> there's about 3.5 TiB / 10 GiB extra stripes worth of chunks, 350
>> stripes or so,
>> that should be freeable, roughly (the fact that you probably have 8-
>> strip, 12-strip, and 4-strip stripes, on the same filesystem, will of
>> course change that a bit, as will the fact that four devices are much
>> smaller than the other eight).
>
> The new devices have been in place for while (> 2 months) now, and are
> barely used. Why is there not more data being put onto the new disks?
> Even without a balance new data should spread evenly across all devices
> right? From the IOPs I can see that only the 8 disks which always have
> been in the box are doing any heavy lifting and the new disks are mostly
> idle.
That isn't surprising. New extent allocations will be made from existing
data chunks, where they can be (that being, where there's empty space
within them), and most of those will be across only the original 8
devices. Only if that space in existing data chunks is used up will new
chunk allocations be made. And as it appeared you had over 3 TiB of
space within the existing chunks...
Of course balance is supposed to be the tool that helps you fix this, but
with it bugging out on... something... as discussed above, that's not
really helping you either.
Personally, what I'd probably do here would be decide if the data was
worth the trouble or not, given the time it's obviously going to take
even with good backups to simply copy nearly 20 gig of data from one
place to another. Then I'd blow away and recreate, as the only sure way
to a clean filesystem, and copy back if I did consider it worth the
trouble. Of course that's easy for /me/ to say, with my multiple
separate but rather small (nothing even 3-digit GiB, let alone TiB scale)
btrfs filesystems, all on ssd, such that a full balance/scrub/check on a
single filesystem is only minutes at the longest, and often under a
minute, to completion. But it /is/ what I'd do.
But then again, as should be clear from the above discussion, I wouldn't
have trusted non-throw-away data to btrfs raid56 until I considered it
roughly as stable as the rest of btrfs, which for me would have been at
least 4.4 and is now beginning to look like at least 4.6, in the first
place. Neither at this point would I be at all confident that were you
to use the same sort of raid56 layout, at its current maturity, that
you'd not end up with the exact same bug and thus no workable balance,
tho at least you'd have full-width-stripes as you'd have been using all
the devices from the get-go, so maybe you'd not /need/ to balance for
awhile.
> Anything I could do to narrow down where a certain file is stored across
> the devices?
The other possibility (this one both narrowing down where the problem is
and hopefully helping to eliminate it at the same time) would be,
assuming no snapshots locking down old data, to start rewriting that 20
TiB of data say a TiB or two at a time, removing the old copy, thereby
freeing the extents and tracking metadata it took, and trying the balance
again, until you find the bit causing all the trouble and rewrite it,
presumably to a form less troublesome to balance. If you have a gut
feeling as to where in your data the problem might be, start with it;
otherwise, just cover the whole nearly 20 TiB systematically.
If at some point you can now complete a balance, that demonstrates that
the problem was indeed a defect in the filesystem that a rewrite
eventually overcame. If you still can't balance after a full rewrite of
everything, that demonstrates a more fundamental bug, likely somewhere
within the guts of the raid56 code itself, such that rewriting everything
only rewrites the same problem once again.
That one might actually be practical enough to do, and has a good chance
of working, tho due note that you need to verify that your method of
rewriting the files isn't simply using reflink (which AFAIK is what a
current mv with src and dest on the same btrfs, will now do), since
reflink won't actually rewrite the data, only some metadata. The easiest
way to be /sure/ a file is actually rewritten, is to do a cross-
filesystem copy/move, perhaps using tmpfs if your memory is large enough
for the file(s) in question, in which case you'd /copy/ it off btrfs to
tmpfs, then /move/ it back, into a different location. When the round
trip is completed, sync, and delete the old copy.
(Tmpfs being memory-only, thus as fast as possible but not crash-safe
should the only copy be in tmpfs at the time, this procedure ensures that
a valid copy is always on permanent storage. The first copy leaves the
old version in place, where it remains until the new version from tmpfs
is safely moved into the new location, with the sync ensuring it all
actually hits permanent storage, after which it should be safe to remove
the old copy since the new one is now safely on disk.)
As for knowing specifically where a file is stored, yes, that's possible,
using btrfs debug commands. As the saying goes, however, the details
"are left as an exercise for the reader", since I've never actually had
to do it myself. So check the various btrfs-* manpages and (carefully!)
experiment a bit. =:^) Or just check back thru the list archive as I'm
sure I've seen it posted, but without a bit more to go on than that, the
manpages method is likely faster. =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-11-12 4:31 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-02 10:29 How to properly and efficiently balance RAID6 after more drives are added? Christian Rohmann
2015-09-02 11:30 ` Hugo Mills
2015-09-02 13:09 ` Christian Rohmann
2015-09-03 2:22 ` Duncan
2015-09-04 8:28 ` Christian Rohmann
2015-09-04 11:04 ` Duncan
2015-11-11 14:17 ` Christian Rohmann
2015-11-12 4:31 ` Duncan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).