How to properly and efficiently balance RAID6 after more drives are added?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* How to properly and efficiently balance RAID6 after more drives are added?
@ 2015-09-02 10:29 Christian Rohmann
  2015-09-02 11:30 ` Hugo Mills
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Rohmann @ 2015-09-02 10:29 UTC (permalink / raw)
  To: linux-btrfs

Hello btrfs-enthusiasts,

I have a rather big btrfs RAID6 with currently 12 devices. It used to be
only 8 drives 4TB each, but I successfully added 4 more drives with 1TB
each at some point. What I am trying to find out, and that's my main
reason for posting this, is how to balance the data on the drives now.

I am wondering what I should read from this "btrfs filesystem show" output:

--- cut ---
        Total devices 12 FS bytes used 19.23TiB
        devid    1 size 3.64TiB used 3.64TiB path /dev/sdc
        devid    2 size 3.64TiB used 3.64TiB path /dev/sdd
        devid    3 size 3.64TiB used 3.64TiB path /dev/sde
        devid    4 size 3.64TiB used 3.64TiB path /dev/sdf
        devid    5 size 3.64TiB used 3.64TiB path /dev/sdh
        devid    6 size 3.64TiB used 3.64TiB path /dev/sdi
        devid    7 size 3.64TiB used 3.64TiB path /dev/sdj
        devid    8 size 3.64TiB used 3.64TiB path /dev/sdb
        devid    9 size 931.00GiB used 535.48GiB path /dev/sdg
        devid   10 size 931.00GiB used 535.48GiB path /dev/sdk
        devid   11 size 931.00GiB used 535.48GiB path /dev/sdl
        devid   12 size 931.00GiB used 535.48GiB path /dev/sdm

btrfs-progs v4.1.2
--- cut ---


First of all I wonder why the first 8 disks are shown as "full" as "used
= size", but there is 5.3TB of free space for the fs shown by "df":

--- cut ---
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc         33T   20T  5.3T  79% /somemountpointsomewhere
--- cut ---

Also "btrfs filesystem df" doesn't give me any clues on the matter:

--- cut ---
btrfs filesystem df /srv/mirror/
Data, single: total=8.00MiB, used=0.00B
Data, RAID6: total=22.85TiB, used=19.19TiB
System, single: total=4.00MiB, used=0.00B
System, RAID6: total=12.00MiB, used=1.34MiB
Metadata, single: total=8.00MiB, used=0.00B
Metadata, RAID6: total=42.09GiB, used=38.42GiB
GlobalReserve, single: total=512.00MiB, used=1.58MiB
--- cut ---




What I am very certain about is that the "load" of I/O requests is not
equal yet, as iostat clearly shows:

--- cut ---
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc              21.40     4.41   42.22   12.71  3626.12   940.79
166.29     3.82   69.38   42.83  157.60   5.98  32.82
sdb              22.35     4.45   41.29   12.71  3624.20   941.27
169.09     4.22   77.88   46.75  178.97   6.10  32.96
sdd              22.03     4.44   41.60   12.73  3623.76   943.22
168.13     3.79   69.45   42.53  157.48   6.05  32.85
sde              21.21     4.43   42.30   12.74  3621.39   943.36
165.88     3.82   69.28   42.99  156.62   5.98  32.90
sdf              22.19     4.42   41.42   12.75  3623.65   940.63
168.51     3.77   69.36   42.64  156.13   6.05  32.79
sdh              21.35     4.46   42.25   12.68  3623.12   940.28
166.14     3.95   71.72   43.61  165.40   6.02  33.06
sdi              21.92     4.38   41.67   12.79  3622.03   942.91
167.63     3.49   63.83   40.23  140.74   6.02  32.77
sdj              21.31     4.41   42.26   12.72  3625.32   941.50
166.12     3.99   72.25   44.50  164.44   6.00  33.01
sdg               8.90     4.97   12.53   21.16  1284.47  1630.08
173.02     0.83   24.61   27.31   23.02   1.77   5.95
sdk               9.14     4.94   12.30   21.19  1284.61  1630.02
174.07     0.79   23.41   26.59   21.57   1.76   5.91
sdl               8.88     4.95   12.58   21.19  1284.46  1630.06
172.62     0.80   23.80   25.68   22.68   1.78   6.00
sdm               9.07     4.85   12.35   21.29  1284.43  1630.01
173.26     0.79   23.57   26.57   21.83   1.77   5.94

--- cut ---



Should I run btrfs balance on the filesystem? If so, what FILTERS would
I then use in order for the data and therefore requests to be better
distributed?




With regards and thanks in advance,


Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to properly and efficiently balance RAID6 after more drives are added?
  2015-09-02 10:29 How to properly and efficiently balance RAID6 after more drives are added? Christian Rohmann
@ 2015-09-02 11:30 ` Hugo Mills
  2015-09-02 13:09   ` Christian Rohmann
  0 siblings, 1 reply; 8+ messages in thread
From: Hugo Mills @ 2015-09-02 11:30 UTC (permalink / raw)
  To: Christian Rohmann; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6142 bytes --]

On Wed, Sep 02, 2015 at 12:29:06PM +0200, Christian Rohmann wrote:
> Hello btrfs-enthusiasts,
> 
> I have a rather big btrfs RAID6 with currently 12 devices. It used to be
> only 8 drives 4TB each, but I successfully added 4 more drives with 1TB
> each at some point. What I am trying to find out, and that's my main
> reason for posting this, is how to balance the data on the drives now.
> 
> I am wondering what I should read from this "btrfs filesystem show" output:
> 
> --- cut ---
>         Total devices 12 FS bytes used 19.23TiB
>         devid    1 size 3.64TiB used 3.64TiB path /dev/sdc
>         devid    2 size 3.64TiB used 3.64TiB path /dev/sdd
>         devid    3 size 3.64TiB used 3.64TiB path /dev/sde
>         devid    4 size 3.64TiB used 3.64TiB path /dev/sdf
>         devid    5 size 3.64TiB used 3.64TiB path /dev/sdh
>         devid    6 size 3.64TiB used 3.64TiB path /dev/sdi
>         devid    7 size 3.64TiB used 3.64TiB path /dev/sdj
>         devid    8 size 3.64TiB used 3.64TiB path /dev/sdb
>         devid    9 size 931.00GiB used 535.48GiB path /dev/sdg
>         devid   10 size 931.00GiB used 535.48GiB path /dev/sdk
>         devid   11 size 931.00GiB used 535.48GiB path /dev/sdl
>         devid   12 size 931.00GiB used 535.48GiB path /dev/sdm

   You had some data on the first 8 drives with 6 data+2 parity, then
added four more. From that point on, you were adding block groups with
10 data+2 parity. At some point, the first 8 drives became full, and
then new block groups have been added only to the new drives, using 2
data+2 parity.

> btrfs-progs v4.1.2
> --- cut ---
> 
> 
> First of all I wonder why the first 8 disks are shown as "full" as "used
> = size", but there is 5.3TB of free space for the fs shown by "df":
> 
> --- cut ---
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sdc         33T   20T  5.3T  79% /somemountpointsomewhere
> --- cut ---

   This is inaccurate because the calculations that correct for the
RAID usage probably aren't all that precise for parity RAID,
particularly when there's variable stripe sizes like you have in your
FS. In fact, they're not even all that good for things like RAID-1
(I've seen inaccuracies on my own RAID-1 system).

> Also "btrfs filesystem df" doesn't give me any clues on the matter:
> 
> --- cut ---
> btrfs filesystem df /srv/mirror/
> Data, single: total=8.00MiB, used=0.00B
> Data, RAID6: total=22.85TiB, used=19.19TiB
> System, single: total=4.00MiB, used=0.00B
> System, RAID6: total=12.00MiB, used=1.34MiB
> Metadata, single: total=8.00MiB, used=0.00B
> Metadata, RAID6: total=42.09GiB, used=38.42GiB
> GlobalReserve, single: total=512.00MiB, used=1.58MiB
> --- cut ---

   This is showing you how the "used" space from the btrfs fi show
output is divided up. It won't tell you anything about the proportion
of the data that's 6+2, the amount that's 10+2, and the amount that's
2+2 (or any other values).

> What I am very certain about is that the "load" of I/O requests is not
> equal yet, as iostat clearly shows:
> 
> --- cut ---
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdc              21.40     4.41   42.22   12.71  3626.12   940.79
> 166.29     3.82   69.38   42.83  157.60   5.98  32.82
> sdb              22.35     4.45   41.29   12.71  3624.20   941.27
> 169.09     4.22   77.88   46.75  178.97   6.10  32.96
> sdd              22.03     4.44   41.60   12.73  3623.76   943.22
> 168.13     3.79   69.45   42.53  157.48   6.05  32.85
> sde              21.21     4.43   42.30   12.74  3621.39   943.36
> 165.88     3.82   69.28   42.99  156.62   5.98  32.90
> sdf              22.19     4.42   41.42   12.75  3623.65   940.63
> 168.51     3.77   69.36   42.64  156.13   6.05  32.79
> sdh              21.35     4.46   42.25   12.68  3623.12   940.28
> 166.14     3.95   71.72   43.61  165.40   6.02  33.06
> sdi              21.92     4.38   41.67   12.79  3622.03   942.91
> 167.63     3.49   63.83   40.23  140.74   6.02  32.77
> sdj              21.31     4.41   42.26   12.72  3625.32   941.50
> 166.12     3.99   72.25   44.50  164.44   6.00  33.01
> sdg               8.90     4.97   12.53   21.16  1284.47  1630.08
> 173.02     0.83   24.61   27.31   23.02   1.77   5.95
> sdk               9.14     4.94   12.30   21.19  1284.61  1630.02
> 174.07     0.79   23.41   26.59   21.57   1.76   5.91
> sdl               8.88     4.95   12.58   21.19  1284.46  1630.06
> 172.62     0.80   23.80   25.68   22.68   1.78   6.00
> sdm               9.07     4.85   12.35   21.29  1284.43  1630.01
> 173.26     0.79   23.57   26.57   21.83   1.77   5.94
> 
> --- cut ---
> 
> 
> 
> Should I run btrfs balance on the filesystem? If so, what FILTERS would
> I then use in order for the data and therefore requests to be better
> distributed?

   Yes, you should run a balance. You probably need to free up some
space on the first 8 drives first, to give the allocator a chance to
use all 12 devices in a single stripe. This can also be done with a
balance. Sadly, with the striped RAID levels (0, 10, 5, 6), it's
generally harder to ensure that all of the data is striped as evenly
as is possible(*). I don't think there are any filters that you should
to use -- just balance everything. The first time probably won't do
the job fully. A second balance probably will. These are going to take
a very long time to run (in your case, I'd guess at least a week for
each balance). I would recommend starting the balance in a tmux or
screen session, and also creating a second shell in the same session
to run monitoring processes. I typically use something like:

watch -n60 sudo btrfs fi show\; echo\; btrfs fi df /mountpoint\; echo\; btrfs bal stat /mountpoint

   Hugo.

(*) Hmmm... idea for a new filter: min/max stripe width? Then you
could balance only the block groups that aren't at full width, which
is probably what's needed here.

-- 
Hugo Mills             | Comic Sans goes into a bar, and the barman says, "We
hugo@... carfax.org.uk | don't serve your type here."
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to properly and efficiently balance RAID6 after more drives are added?
  2015-09-02 11:30 ` Hugo Mills
@ 2015-09-02 13:09   ` Christian Rohmann
  2015-09-03  2:22     ` Duncan
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Rohmann @ 2015-09-02 13:09 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs

Hey Hugo,

thanks for the quick response.

On 09/02/2015 01:30 PM, Hugo Mills wrote:
> You had some data on the first 8 drives with 6 data+2 parity, then 
> added four more. From that point on, you were adding block groups
> with 10 data+2 parity. At some point, the first 8 drives became
> full, and then new block groups have been added only to the new
> drives, using 2 data+2 parity.

Even though the old 8 drive RAID6 was not full yet? Read: There was
still some terabytes of free space.


>> Should I run btrfs balance on the filesystem? If so, what FILTERS
>> would I then use in order for the data and therefore requests to
>> be better distributed?
> 
> Yes, you should run a balance. You probably need to free up some 
> space on the first 8 drives first, to give the allocator a chance
> to use all 12 devices in a single stripe. This can also be done
> with a balance. Sadly, with the striped RAID levels (0, 10, 5, 6),
> it's generally harder to ensure that all of the data is striped as
> evenly as is possible(*). I don't think there are any filters that
> you should to use -- just balance everything. The first time
> probably won't do the job fully. A second balance probably will.
> These are going to take a very long time to run (in your case, I'd
> guess at least a week for each balance). I would recommend starting
> the balance in a tmux or screen session, and also creating a second
> shell in the same session to run monitoring processes. I typically
> use something like:
> 
> watch -n60 sudo btrfs fi show\; echo\; btrfs fi df /mountpoint\;
> echo\; btrfs bal stat /mountpoint

Yeah, that's what I usually do. The thing is that one does not get any
progress indication and estimate about how long a task will take.


> (*) Hmmm... idea for a new filter: min/max stripe width? Then you 
> could balance only the block groups that aren't at full width,
> which is probably what's needed here.

Consider my question and motivation a rather obvious use case of
running out of disk space (or iops) and simply adding some more
drives. A balance needs to be straightforward for people to understand
and perform such tasks.



Regards

Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to properly and efficiently balance RAID6 after more drives are added?
  2015-09-02 13:09   ` Christian Rohmann
@ 2015-09-03  2:22     ` Duncan
  2015-09-04  8:28       ` Christian Rohmann
  0 siblings, 1 reply; 8+ messages in thread
From: Duncan @ 2015-09-03  2:22 UTC (permalink / raw)
  To: linux-btrfs

Christian Rohmann posted on Wed, 02 Sep 2015 15:09:47 +0200 as excerpted:

> Hey Hugo,
> 
> thanks for the quick response.
> 
> On 09/02/2015 01:30 PM, Hugo Mills wrote:
>> You had some data on the first 8 drives with 6 data+2 parity, then
>> added four more. From that point on, you were adding block groups with
>> 10 data+2 parity. At some point, the first 8 drives became full, and
>> then new block groups have been added only to the new drives, using 2
>> data+2 parity.
> 
> Even though the old 8 drive RAID6 was not full yet? Read: There was
> still some terabytes of free space.

At this point we're primarily guessing (unless you want to dive deep into 
btrfs-debug or the like), because the results you posted are from after 
you added the set of four new devices to the existing eight.  We don't 
have the btrfs fi show and df from before you added the new devices.

But what we /do/ know from what you posted (from after the add), the 
previously existing devices are "100% chunk-allocated", size 3.64 TiB, 
used 3.64 TiB, on each of the first eight devices.

I don't know how much of (the user docs on) the wiki you've read, and/or 
understood, but for many people, it takes awhile to really understand a 
few major differences between btrfs and most other filesystems.

1) Btrfs separates data and metadata into separate allocations, 
allocating, tracking and reporting them separately.  While some 
filesystems do allocate separately, few expose the separate data and 
metadata allocation detail to the user.

2) Btrfs allocates and uses space in two steps, first allocating/
reserving relatively large "chunks" from free-space into separate data 
and metadata chunks, then using space from these chunk allocations as 
needed, until they're full and more must be allocated.  Nominal[1] chunk 
size is 1 GiB for data, 256 MiB for metadata.

It's worth noting that for striped raid (with or without parity, so 
raid0,5,6, with parity strips taken from what would be the raid0 strips 
as appropriate), btrfs allocates a full chunk strip on each available 
device, so nominal raid6 strip allocation on eight devices would be a 6 
GiB data plus 2 GiB parity stripe (8x1GiB strips per stripe), while 
metadata would be 1.5 GiB metadata (6x256MiB) plus half a GiB parity 
(2x256MiB) (total of 8x256MiB strips per stripe).

Again, most filesystems don't allocate in chunks like this, at least for 
data (they often will allocate metadata in chunks of some size, in 
ordered to keep it grouped relatively close together, but that level of 
detail isn't show to the user, and because metadata is typically a small 
fraction of data, it can simply be included in the used figure as soon as 
allocated and still disappear in the rounding error).  What they report 
as free space is thus available unallocated space that should, within 
rounding error, be available for data.

3) Up until a few kernel cycles ago, btrfs could and would automatically 
allocate chunks as needed, but wouldn't deallocate them when they 
emptied.  Once they were allocated for data or metadata, that's how they 
stayed allocated, unless/until the user did a balance manually, at which 
point the chunk rewrite would consolidate the used space and free any 
unused chunk-space back to the unallocated space pool.

The result was that given normal usage writing and deleting data, over 
time, all unallocated space would typically end up allocated as data 
chunks, such that at some point the filesystem would run out of metadata 
space and need to allocate more metadata chunks, but couldn't, because of 
all those extra partially to entirely empty data chunks that were 
allocated and never freed.

Since IIRC 3.17 or so (kernel cycle from unverified memory, but that 
should be close), btrfs will automatically deallocate chunks if they're 
left entirely empty, so the problem has disappeared to a large extent, 
tho it's still possible to eventually end up with a bunch of not-quite-
empty data chunks, that require a manual balance to consolidate and clean 
up.

4) Normal df (as opposed to btrfs fi df) will list free space in existing 
data chunks as free, even after all unallocated space is gone and it's 
all allocated to either data or metadata chunks.  At that point, which 
ever one you run out of first, typically metadata, will trigger ENOSPC 
errors, despite df often showing quite some free space left -- because 
all the reported free-space is tied up in data chunks, and there's no 
unallocated space left to allocate to new metadata chunks when the 
existing ones get full.

5) What btrfs fi show reports for "used" in the individual device stats 
is chunk-allocated space.

What your btrfs fi show is saying, is that 100% of the capacity of those 
first eight devices is chunk-allocated, to data or metadata chunks it 
doesn't say, but whichever it is, it's already allocated to one or the 
other, and cannot be reallocated to something else, either a different 
sized stripe after adding the new devices, or to the opposite of data or 
metadata, whichever it is allocated as, until it is rewritten in ordered 
to consolidate all the actually used space into as few chunks as 
possible, thereby freeing the unused but currently chunk-allocated space 
back to the unallocated pool.  This chunk rewrite and consolidation is 
exactly what balance is designed to do.

Again, at this point we're guessing to some extent, based on what's 
reported now, after the addition and evident partial use of the four new 
devices to the existing eight.  Thus we don't know for sure when the 
existing eight devices got fully allocated, whether it was before the 
addition of the new devices or after, but full allocation is definitely 
the state they're in now, according to your posted btrfs fi show.

One plausible guess is as Hugo suggested, that they were mostly but not 
fully allocated before the addition of the new devices, with that data 
written as an 8-strip-stripe (6+2), that after the addition of the four 
new devices, the remaining unallocated space on the original eight was 
then filled along with usage from the new four, in a 12-strip-stripe (10
+2), after which further writes, if any, were now down to a 4-strip-
stripe (2+2), since the original eight were now fully chunk-allocated and 
the new four were the only devices with remaining unallocated space.

Another plausible guess is that the original eight devices were fully 
chunk-allocated before the addition of the four new devices, and that the 
free space that df was reporting was entirely in already allocated but 
not fully used data chunks.  In this case, you would have been perilously 
close to ENOSPC errors, when the existing metadata chunks got full, since 
all space was already allocated so no more metadata chunks could have 
been allocated, and if you didn't actually hit those errors, it was 
simply down to the lucky timing of adding the four new devices.

In either case, that df was and is reporting TiB of free space doesn't 
necessarily mean that there was unallocated space left, because df 
reports on potential space to write data, including both data-chunk-
allocated-but-not-yet-data-used-space, and unallocated-space.  Btrfs fi 
show is reporting for each device it's total space and allocated space, 
something totally different than df reports, so trying to directly 
compare the output from the two commands without knowing exactly what 
those numbers mean, is meaningless as they're reporting two entirely 
different things.

---
[1] Nominal chunk size:  Note the "nominal" qualifier.  While this is the 
normal chunk allocation size, on multi-TiB devices, the first few data 
chunk allocations in particular can be much larger, multiples of a GiB, 
while as unallocated space dwindles, both data and metadata chunks can be 
smaller in ordered to use up the last available unallocated space.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to properly and efficiently balance RAID6 after more drives are added?
  2015-09-03  2:22     ` Duncan
@ 2015-09-04  8:28       ` Christian Rohmann
  2015-09-04 11:04         ` Duncan
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Rohmann @ 2015-09-04  8:28 UTC (permalink / raw)
  To: Duncan, linux-btrfs

Hello Ducan,

thanks a million for taking the time an effort to explain all that.
I understand that all the devices must have been chunk-allocated for
btrfs to tell me all available "space" was used (read "allocated to data
chunks").

The filesystem is quite old already with kernels starting at 3.12 (I
believe) and now 4.2 with always the most current version of btrfs-progs
debian has available.


On 09/03/2015 04:22 AM, Duncan wrote:
> But what we /do/ know from what you posted (from after the add), the 
> previously existing devices are "100% chunk-allocated", size 3.64 TiB, 
> used 3.64 TiB, on each of the first eight devices.
> 
> I don't know how much of (the user docs on) the wiki you've read, and/or 
> understood, but for many people, it takes awhile to really understand a 
> few major differences between btrfs and most other filesystems.
> 
> 1) Btrfs separates data and metadata into separate allocations, 
> allocating, tracking and reporting them separately.  While some 
> filesystems do allocate separately, few expose the separate data and 
> metadata allocation detail to the user.
> 
> 2) Btrfs allocates and uses space in two steps, first allocating/
> reserving relatively large "chunks" from free-space into separate data 
> and metadata chunks, then using space from these chunk allocations as 
> needed, until they're full and more must be allocated.  Nominal[1] chunk 
> size is 1 GiB for data, 256 MiB for metadata.

> 3) Up until a few kernel cycles ago, btrfs could and would automatically 
> allocate chunks as needed, but wouldn't deallocate them when they 
> emptied.  Once they were allocated for data or metadata, that's how they 
> stayed allocated, unless/until the user did a balance manually, at which 
> point the chunk rewrite would consolidate the used space and free any 
> unused chunk-space back to the unallocated space pool.

> The result was that given normal usage writing and deleting data, over 
> time, all unallocated space would typically end up allocated as data 
> chunks, such that at some point the filesystem would run out of metadata 
> space and need to allocate more metadata chunks, but couldn't, because of 
> all those extra partially to entirely empty data chunks that were 
> allocated and never freed.
> 
> Since IIRC 3.17 or so (kernel cycle from unverified memory, but that 
> should be close), btrfs will automatically deallocate chunks if they're 
> left entirely empty, so the problem has disappeared to a large extent, 
> tho it's still possible to eventually end up with a bunch of not-quite-
> empty data chunks, that require a manual balance to consolidate and clean 
> up.


I am running a full balance now, it's at 94% remaining (running for 48
hrs already ;-) ).

Is there any way I should / could "scan" for empty data chunks or almost
empty data chunks which could be freed in order to have more chunks
available for the actual balancing or new chunks that should be used
with a 10 drive RAID6? I understand that btrfs NOW does that somewhat
automagically, but my FS is quite old and used already and there is new
data coming in all the time, so I wand that properly spread across all
the drives.


Regards

Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to properly and efficiently balance RAID6 after more drives are added?
  2015-09-04  8:28       ` Christian Rohmann
@ 2015-09-04 11:04         ` Duncan
  2015-11-11 14:17           ` Christian Rohmann
  0 siblings, 1 reply; 8+ messages in thread
From: Duncan @ 2015-09-04 11:04 UTC (permalink / raw)
  To: linux-btrfs

Christian Rohmann posted on Fri, 04 Sep 2015 10:28:21 +0200 as excerpted:

> Hello Ducan,
> 
> thanks a million for taking the time an effort to explain all that.
> I understand that all the devices must have been chunk-allocated for
> btrfs to tell me all available "space" was used (read "allocated to data
> chunks").
> 
> The filesystem is quite old already with kernels starting at 3.12 (I
> believe) and now 4.2 with always the most current version of btrfs-progs
> debian has available.

IIRC I wrote the previous reply without knowing the kernel you were on.  
You had posted the userspace version, which was current, but not the 
kernel version, so it's good to see that posted now. 

Since kernel 3.12 was before automatic-empty-chunk-reclaim, empty chunks 
wouldn't have been reclaimed back then, as I guessed.  You're running 
current 4.2 now and they would be, but only if they're totally empty, and 
I'm guessing they're not, particularly if you don't run with the 
autodefrag mount option turned on or do regular manual defrags.  (Defrag 
doesn't directly affect chunks, that's what balance is for, but failing 
to defrag will mean things are more fragmented, which will cause btrfs to 
spread out more in the available chunks as it'll try to put new files in 
as few extents as possible, possibly in empty chunks if they haven't been 
reclaimed and the space in fuller chunks is too small for the full file, 
upto the chunk-size, of course.)

And of course, only with 4.1 (nominally 3.19 but there were initial 
problems) was raid6 mode fully code-complete and functional -- before 
that, runtime worked, it calculated and wrote the parity stripes as it 
should, but the code to recover from problems wasn't complete, so you 
were effectively running a slow raid0 in terms of recovery ability, but 
one that got "magically" updated to raid6 once the recovery code was 
actually there and working.

So I'm guessing you have some 8-strip-stripe chunks at say 20% full or 
some such.  There's 19.19 data TiB used of 22.85 TiB allocated, a spread 
of over 3 TiB.  A full nominal-size data stripe allocation, given 12 
devices in raid6, will be 10x1GiB data plus 2x1GiB parity, so there's 
about 3.5 TiB / 10 GiB extra stripes worth of chunks, 350 stripes or so, 
that should be freeable, roughly (the fact that you probably have 8-
strip, 12-strip, and 4-strip stripes, on the same filesystem, will of 
course change that a bit, as will the fact that four devices are much 
smaller than the other eight).

> On 09/03/2015 04:22 AM, Duncan wrote: [snipped]
> 
> I am running a full balance now, it's at 94% remaining (running for 48
> hrs already ;-) ).
> 
> Is there any way I should / could "scan" for empty data chunks or almost
> empty data chunks which could be freed in order to have more chunks
> available for the actual balancing or new chunks that should be used
> with a 10 drive RAID6? I understand that btrfs NOW does that somewhat
> automagically, but my FS is quite old and used already and there is new
> data coming in all the time, so I wand that properly spread across all
> the drives.

There are balance filters, -dusage=20, for instance, would only rebalance 
data (-d) chunks with usage under 20%.  Of course there's more about 
balance filters in the manpage and on the wiki.

The great thing about -dusage= (and -musage= where appropriate) is that 
it can often free and deallocate large numbers of chunks at a fraction of 
the time it'd take to do a full balance.  Not only are you only dealing 
with a fraction of the chunks, but since the ones it picks are for 
example only 20% full (with usage=20) or less, they take only 20% (or 
less) of the time to balance that a full chunk would.  Additionally, 20% 
full or less means you reclaim chunks 4:1 or better -- five old chunks 
are rewritten into a single new one, freeing four!  So in a scenario with 
a whole bunch of chunks at less than say 2/3 full (usage=67, rewrite 
three into two), this can reclaim a whole lot of chunks in a relatively 
small amount of time, certainly so compared to a full balance, since 
rewriting a 100% full chunk takes the full amount of time and doesn't 
reclaim anything.

But, given that the whole reason you're messing with it is to try to even 
out the stripes across all devices, a full rewrite is eventually in order 
anyway.  However, knowing about the filters would have let you do a
-dusage=20 or possibly -dusage=50 before the full balance, leaving the 
full balance more room to work in and and possibly allowing a more 
effective balance to the widest stripes possible.

Likely above 50 and almost certainly above 67, the returns wouldn't be 
worth it, since the time taken for the filtered balance would be longer, 
and an unfiltered balance was planned afterward anyway.  Here, I'd have 
tried something like 20 first, then 50 if I wasn't happy with the results 
of 20.  The thing is, either 20 would give me good results in a 
reasonably short time, or there'd be so few candidates that it'd be very 
fast to give me the bad results, thus allowing me to try 50.  Same with 
50 and 67, tho I'd definitely be unhappy if 50 didn't give me at least a 
TiB or so freed to unallocated, hopefully some of which would be in the 
first eight devices, ideally giving the full balance room enough to do 
full 12-device-stripes, keeping enough free on the original eight devices 
as it went to go 12-wide until the smaller devices were full, then 8 
wide, eliminating the 4-wide stripes entirely.

Tho as Hugo suggested, having the original larger eight devices all the 
way full and thus a good likelihood of all three stripe widths isn't 
ideal, and it might actually take a couple balances (yes, at a week a 
piece or whatever =:^() to straighten things out.  A good -dusage= 
filtered balance pre-pass would have likely taken under a day and with 
luck would have allowed a single full balance to do the job, but it's a 
bit late for that now...

Meanwhile, FWIW that long maintenance time is one of the reasons I'm a 
strong partitioning advocate.  Between the fact that I use SSDs and the 
fact that my btrfs partitions are all under 50 GiB per partition (which 
probably wouldn't be practical for you, but half to 1 TiB per device 
partition might be...), full scrubs typically take under a minute here, 
and full balances still in the single-digit minutes.  Of course, I have 
other partitions/filesystems too, and to do all of them would take a bit 
longer, say an hour, but with maintenance time under 10 minutes per 
filesystem, doing it is not only not a pain, but is actually trivial, 
where as doing maintenance that's going to take a week is definitely a 
pain, something you're going to avoid if possible, meaning there's a fair 
chance a minor problem will be allowed to get far worse before it's 
addressed, than it would be if the maintenance were a matter of a few 
hours, say a day at most.

But that's just me.  I've fine tuned my partitioning layout over multiple 
multi-year generations and have it setup so I don't have the hassle of 
"oh, I'm out of space on this partition, gotta symlink to a different 
one" that a lot of folks point to as the reason they prefer big storage 
pools like lvm or multi-whole-physical-device btrfs.  And obviously, I'm 
not scaling storage to the double-digit TiB you are, either.  So your 
system, your layout and rules.  I'm simply passing on one reason that I'm 
such a strong partitioning advocate, here. 

Plus I know you'd REALLY like those 10 minute full-balances right about 
now! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to properly and efficiently balance RAID6 after more drives are added?
  2015-09-04 11:04         ` Duncan
@ 2015-11-11 14:17           ` Christian Rohmann
  2015-11-12  4:31             ` Duncan
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Rohmann @ 2015-11-11 14:17 UTC (permalink / raw)
  To: linux-btrfs

Sorry for the late reply to this list regarding this topic
...

On 09/04/2015 01:04 PM, Duncan wrote:
> And of course, only with 4.1 (nominally 3.19 but there were initial 
> problems) was raid6 mode fully code-complete and functional -- before 
> that, runtime worked, it calculated and wrote the parity stripes as it 
> should, but the code to recover from problems wasn't complete, so you 
> were effectively running a slow raid0 in terms of recovery ability, but 
> one that got "magically" updated to raid6 once the recovery code was 
> actually there and working.

As other who write to this ML, I run into crashes when trying to do a
balance of my filesystem.
I moved through the different kernel versions and btrfs-tools and am
currently running Kernel 4.3 + 4.3rc1 of the tools but still after like
an hour of balancing (and actually moving chunks) the machine crashes
horribly without giving any good stack trace or anything in the kernel
log which I could report here :(

Any ideas on how I could proceed to get some usable debug info for the
devs to look at?

> So I'm guessing you have some 8-strip-stripe chunks at say 20% full or 
> some such.  There's 19.19 data TiB used of 22.85 TiB allocated, a spread 
> of over 3 TiB.  A full nominal-size data stripe allocation, given 12 
> devices in raid6, will be 10x1GiB data plus 2x1GiB parity, so there's 
> about 3.5 TiB / 10 GiB extra stripes worth of chunks, 350 stripes or so, 
> that should be freeable, roughly (the fact that you probably have 8-
> strip, 12-strip, and 4-strip stripes, on the same filesystem, will of 
> course change that a bit, as will the fact that four devices are much 
> smaller than the other eight).

The new devices have been in place for while (> 2 months) now, and are
barely used. Why is there not more data being put onto the new disks?
Even without a balance new data should spread evenly across all devices
right? From the IOPs I can see that only the 8 disks which always have
been in the box are doing any heavy lifting and the new disks are mostly
idle.

Anything I could do to narrow down where a certain file is stored across
the devices?

Regards

Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to properly and efficiently balance RAID6 after more drives are added?
  2015-11-11 14:17           ` Christian Rohmann
@ 2015-11-12  4:31             ` Duncan
  0 siblings, 0 replies; 8+ messages in thread
From: Duncan @ 2015-11-12  4:31 UTC (permalink / raw)
  To: linux-btrfs

Christian Rohmann posted on Wed, 11 Nov 2015 15:17:19 +0100 as excerpted:

> Sorry for the late reply to this list regarding this topic ...
> 
> On 09/04/2015 01:04 PM, Duncan wrote:
>> And of course, only with 4.1 (nominally 3.19 but there were initial
>> problems) was raid6 mode fully code-complete and functional -- before
>> that, runtime worked, it calculated and wrote the parity stripes as it
>> should, but the code to recover from problems wasn't complete, so you
>> were effectively running a slow raid0 in terms of recovery ability, but
>> one that got "magically" updated to raid6 once the recovery code was
>> actually there and working.
> 
> As other who write to this ML, I run into crashes when trying to do a
> balance of my filesystem.
> I moved through the different kernel versions and btrfs-tools and am
> currently running Kernel 4.3 + 4.3rc1 of the tools but still after like
> an hour of balancing (and actually moving chunks) the machine crashes
> horribly without giving any good stack trace or anything in the kernel
> log which I could report here :(
> 
> Any ideas on how I could proceed to get some usable debug info for the
> devs to look at?

I'm not a dev so my view into the real deep technical side is limited, 
but what I can say is this...

Generally, crashes during balance indicate not so much bugs in the way 
the kernel handles existing balance (tho those sometimes occur as well, 
but the chances are relatively lower), but rather, a filesystem screwed 
up in a way that balance hasn't been taught to deal with yet.

Of course there's two immediate points that can be made from that:
1) Newer kernels have been taught to deal with more bugs, so if you're 
not on current (which you are now), consider upgrading to current at 
least long enough to see if it already knows how to deal with it.
2) If a balance is crashing with a particular kernel, it's unlikely the 
problem will simply go away on its own, without either a kernel upgrade 
to one that knows how to deal with that problem, or in some cases, a 
filesystem change that unpins whatever was bad and lets it be deleted.  
Filesystem changes likely to do that sort of thing are removing your 
oldest snapshots, thereby freeing anything that had changed in newer 
snapshots and the working version, that was still being pinned by the old 
snapshots, or in the absence of snapshot pinning, removal of whatever 
often large possibly repeatedly edited file happened to be locking down 
whatever balance was choking on.

Another point (based on a different factor) can be added in addition:
3) Raid56 mode is still relatively new, and it seems a number of users of 
the raid56 mode feature seem to be reporting what appears to me at least 
(considering my read of tracedumps is extremely limited) to be the same 
sort of balance bug, often with the same couldn't-get-a-trace pattern.  
This very likely indicates a remaining bug embedded deeply enough in the 
raid56 code that it has taken until now to trigger enough times to even 
begin to appear on the radar.  Of course the fact that it so often no-
traces doesn't help finding it, but the reports are getting common enough 
that at least to the informed non-dev list regular like me, there does 
seem to be a pattern emerging.

This is a bit worrying, but it's /exactly/ the reason that I had 
suggested that people wait for at least two entirely "clean" kernel 
cycles without raid56 bugs before considering it as stable as is the rest 
of btrfs, and predicted that would likely be at least five kernel cycles 
(a year) after initial nominal-full-code release, putting it at 4.4 at 
the earliest.  Since the last big raid56 bug was fixed fairly early in 
the 4.1 cycle, two clean series would be 4.2 and 4.3, which would again 
point to 4.4.  But we now have this late-appearing bug just coming up on 
the radar, which if it does indeed end up raid56 related, both validates 
my earlier caution, and at least conservatively speaking, should reset 
that two-clean-kernel-cycles clock.  However, given that the feature in 
general has been maturing in the mean time, I'd say reset it with only 
one clean kernel cycle this time, so again assuming the problem is indeed 
found to be in raid56 and that it's fixed before 4.4 release, I'd want 
4.5 to be raid56 uneventful, and would then consider 4.6 raid56 maturity/
stability-comparable to btrfs in general, assuming no further raid56 bugs 
have appeared by its release.

As to ideas for getting a trace, the best I can do is repeat what I've 
seen others suggest here, that will obviously take a bit more resources 
than some have available but that apparently has the best chance of 
working if it can be done in such instances, that being...

Configure the test machine with a network-attached tty, and set it as 
your system console, so debug traces will dump to it.  The kernel will 
try its best to dump traces to system-console as it considers that safe 
even after it considers itself too scrambled to trust writing anything to 
disk, so this sort of network system console arrangement can often get at 
least /some/ of a debug trace before the kernel entirely loses coherency.

The specifics I don't know as I don't tend to have the network resources 
to log to, here, and thus, have no personal experience with it at all.  
But I might remember seeing a text file in the kernel docs dir that had 
instructions.  But you could look that up as easy as I, so there's no 
point in me double-checking on that.

The other side of it would be enabling the various btrfs and general 
kernel debug and tracing apparatus, but you'd need a dev to give you the 
details there.

>> So I'm guessing you have some 8-strip-stripe chunks at say 20% full or
>> some such.  There's 19.19 data TiB used of 22.85 TiB allocated, a
>> spread of over 3 TiB.  A full nominal-size data stripe allocation,
>> given 12 devices in raid6, will be 10x1GiB data plus 2x1GiB parity, so
>> there's about 3.5 TiB / 10 GiB extra stripes worth of chunks, 350
>> stripes or so,
>> that should be freeable, roughly (the fact that you probably have 8-
>> strip, 12-strip, and 4-strip stripes, on the same filesystem, will of
>> course change that a bit, as will the fact that four devices are much
>> smaller than the other eight).
> 
> The new devices have been in place for while (> 2 months) now, and are
> barely used. Why is there not more data being put onto the new disks?
> Even without a balance new data should spread evenly across all devices
> right? From the IOPs I can see that only the 8 disks which always have
> been in the box are doing any heavy lifting and the new disks are mostly
> idle.

That isn't surprising.  New extent allocations will be made from existing 
data chunks, where they can be (that being, where there's empty space 
within them), and most of those will be across only the original 8 
devices.  Only if that space in existing data chunks is used up will new 
chunk allocations be made.  And as it appeared you had over 3 TiB of 
space within the existing chunks...

Of course balance is supposed to be the tool that helps you fix this, but 
with it bugging out on... something... as discussed above, that's not 
really helping you either.

Personally, what I'd probably do here would be decide if the data was 
worth the trouble or not, given the time it's obviously going to take 
even with good backups to simply copy nearly 20 gig of data from one 
place to another.  Then I'd blow away and recreate, as the only sure way 
to a clean filesystem, and copy back if I did consider it worth the 
trouble.  Of course that's easy for /me/ to say, with my multiple 
separate but rather small (nothing even 3-digit GiB, let alone TiB scale) 
btrfs filesystems, all on ssd, such that a full balance/scrub/check on a 
single filesystem is only minutes at the longest, and often under a 
minute, to completion.  But it /is/ what I'd do.

But then again, as should be clear from the above discussion, I wouldn't 
have trusted non-throw-away data to btrfs raid56 until I considered it 
roughly as stable as the rest of btrfs, which for me would have been at 
least 4.4 and is now beginning to look like at least 4.6, in the first 
place.  Neither at this point would I be at all confident that were you 
to use the same sort of raid56 layout, at its current maturity, that 
you'd not end up with the exact same bug and thus no workable balance, 
tho at least you'd have full-width-stripes as you'd have been using all 
the devices from the get-go, so maybe you'd not /need/ to balance for 
awhile.

> Anything I could do to narrow down where a certain file is stored across
> the devices?

The other possibility (this one both narrowing down where the problem is 
and hopefully helping to eliminate it at the same time) would be, 
assuming no snapshots locking down old data, to start rewriting that 20 
TiB of data say a TiB or two at a time, removing the old copy, thereby 
freeing the extents and tracking metadata it took, and trying the balance 
again, until you find the bit causing all the trouble and rewrite it, 
presumably to a form less troublesome to balance.  If you have a gut 
feeling as to where in your data the problem might be, start with it; 
otherwise, just cover the whole nearly 20 TiB systematically.

If at some point you can now complete a balance, that demonstrates that 
the problem was indeed a defect in the filesystem that a rewrite 
eventually overcame.  If you still can't balance after a full rewrite of 
everything, that demonstrates a more fundamental bug, likely somewhere 
within the guts of the raid56 code itself, such that rewriting everything 
only rewrites the same problem once again.

That one might actually be practical enough to do, and has a good chance 
of working, tho due note that you need to verify that your method of 
rewriting the files isn't simply using reflink (which AFAIK is what a 
current mv with src and dest on the same btrfs, will now do), since 
reflink won't actually rewrite the data, only some metadata.  The easiest 
way to be /sure/ a file is actually rewritten, is to do a cross-
filesystem copy/move, perhaps using tmpfs if your memory is large enough 
for the file(s) in question, in which case you'd /copy/ it off btrfs to 
tmpfs, then /move/ it back, into a different location.  When the round 
trip is completed, sync, and delete the old copy.

(Tmpfs being memory-only, thus as fast as possible but not crash-safe 
should the only copy be in tmpfs at the time, this procedure ensures that 
a valid copy is always on permanent storage.  The first copy leaves the 
old version in place, where it remains until the new version from tmpfs 
is safely moved into the new location, with the sync ensuring it all 
actually hits permanent storage, after which it should be safe to remove 
the old copy since the new one is now safely on disk.)

As for knowing specifically where a file is stored, yes, that's possible, 
using btrfs debug commands.  As the saying goes, however, the details 
"are left as an exercise for the reader", since I've never actually had 
to do it myself.  So check the various btrfs-* manpages and (carefully!) 
experiment a bit. =:^)  Or just check back thru the list archive as I'm 
sure I've seen it posted, but without a bit more to go on than that, the 
manpages method is likely faster. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-11-12  4:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-02 10:29 How to properly and efficiently balance RAID6 after more drives are added? Christian Rohmann
2015-09-02 11:30 ` Hugo Mills
2015-09-02 13:09   ` Christian Rohmann
2015-09-03  2:22     ` Duncan
2015-09-04  8:28       ` Christian Rohmann
2015-09-04 11:04         ` Duncan
2015-11-11 14:17           ` Christian Rohmann
2015-11-12  4:31             ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).