Balancing raid5 after adding another disk does not move/use any data on it

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* Balancing raid5 after adding another disk does not move/use any data on it
@ 2019-03-13 21:58 Jakub Husák
  2019-03-14 21:31 ` Chris Murphy
  0 siblings, 1 reply; 16+ messages in thread
From: Jakub Husák @ 2019-03-13 21:58 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I added another disk to my 3-disk raid5 and ran a balance command. After 
few hours I looked to output of `fi usage` to see that no data are being 
used on the new disk. I got the same result even when balancing my raid5 
data or metadata.

Next I tried to convert my raid5 metadata to raid1 (a good idea anyway) 
and the new disk started to fill immediately (even though it received 
the whole amount of metadata with replicas being spread among the other 
drives, instead of being really "balanced". I know why this happened, I 
don't like it but I can live with it, let's not go off topic here :)).

Now my usage output looks like this:

# btrfs filesystem usage /mnt/data WARNING: RAID56 detected, not 
implemented Overall: Device size: 10.91TiB Device allocated: 316.12GiB 
Device unallocated: 10.61TiB Device missing: 0.00B Used: 58.88GiB Free 
(estimated): 0.00B (min: 8.00EiB) Data ratio: 0.00 Metadata ratio: 2.00 
Global reserve: 512.00MiB (used: 184.94MiB) Data,RAID5: Size:4.59TiB, 
Used:4.06TiB /dev/mapper/crypt-sdb 2.29TiB /dev/mapper/crypt-sdc 2.29TiB 
/dev/mapper/crypt-sde 2.29TiB Metadata,RAID1: Size:158.00GiB, 
Used:29.44GiB /dev/mapper/crypt-sdb 53.00GiB /dev/mapper/crypt-sdc 
53.00GiB /dev/mapper/crypt-sdd 158.00GiB /dev/mapper/crypt-sde 52.00GiB 
System,RAID1: Size:64.00MiB, Used:528.00KiB /dev/mapper/crypt-sdc 
32.00MiB /dev/mapper/crypt-sdd 64.00MiB /dev/mapper/crypt-sde 32.00MiB 
Unallocated: /dev/mapper/crypt-sdb 392.04GiB /dev/mapper/crypt-sdc 
392.01GiB /dev/mapper/crypt-sdd 2.57TiB /dev/mapper/crypt-sde 393.01GiB

I'm now running `fi balance -dusage=10` (and rising the usage limit). I 
can see that the unallocated space is rising as it's freeing the little 
used chunks but still no data are being stored on the new disk.

I it some bug? Is `fi usage` not showing me something (as it states 
"WARNING: RAID56 detected, not implemented")? Or is there just too much 
free space on the first set of disks that the balancing is not bothering 
moving any data?

If so, shouldn't it be really balancing (spreading) the data among all 
the drives to use all the IOPS capacity, even when the raid5 redundancy 
constraint is currently satisfied?

# uname -a Linux keeper 4.19.0-0.bpo.2-amd64 #1 SMP Debian 
4.19.16-1~bpo9+1 (2019-02-07) x86_64 GNU/Linux # btrfs --version 
btrfs-progs v4.17 # btrfs fi show Label: none uuid: 
xxxxxxxxxxxxxxxxxxxxxxxxxx Total devices 4 FS bytes used 4.09TiB devid 2 
size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sdc devid 3 size 
2.73TiB used 2.34TiB path /dev/mapper/crypt-sdb devid 4 size 2.73TiB 
used 2.34TiB path /dev/mapper/crypt-sde devid 5 size 2.73TiB used 
158.06GiB path /dev/mapper/crypt-sdd # btrfs fi df . Data, RAID5: 
total=4.59TiB, used=4.06TiB System, RAID1: total=64.00MiB, 
used=528.00KiB Metadata, RAID1: total=158.00GiB, used=29.43GiB 
GlobalReserve, single: total=512.00MiB, used=0.00B

Thanks

Jakub

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
@ 2019-03-13 22:11 Jakub Husák
  2019-03-14 14:59 ` Noah Massey
  2019-03-15 18:01 ` Zygo Blaxell
  0 siblings, 2 replies; 16+ messages in thread
From: Jakub Husák @ 2019-03-13 22:11 UTC (permalink / raw)
  To: linux-btrfs

Sorry, fighting with this technology called "email" :)


Hopefully better wrapped outputs:

On 13. 03. 19 22:58, Jakub Husák wrote:


> Hi,
>
> I added another disk to my 3-disk raid5 and ran a balance command. 
> After few hours I looked to output of `fi usage` to see that no data 
> are being used on the new disk. I got the same result even when 
> balancing my raid5 data or metadata.
>
> Next I tried to convert my raid5 metadata to raid1 (a good idea 
> anyway) and the new disk started to fill immediately (even though it 
> received the whole amount of metadata with replicas being spread among 
> the other drives, instead of being really "balanced". I know why this 
> happened, I don't like it but I can live with it, let's not go off 
> topic here :)).
>
> Now my usage output looks like this:
>
# btrfs filesystem usage   /mnt/data1
WARNING: RAID56 detected, not implemented
Overall:
     Device size:          10.91TiB
     Device allocated:         316.12GiB
     Device unallocated:          10.61TiB
     Device missing:             0.00B
     Used:              58.86GiB
     Free (estimated):             0.00B    (min: 8.00EiB)
     Data ratio:                  0.00
     Metadata ratio:              2.00
     Global reserve:         512.00MiB    (used: 0.00B)

Data,RAID5: Size:4.59TiB, Used:4.06TiB
    /dev/mapper/crypt-sdb       2.29TiB
    /dev/mapper/crypt-sdc       2.29TiB
    /dev/mapper/crypt-sde       2.29TiB

Metadata,RAID1: Size:158.00GiB, Used:29.43GiB
    /dev/mapper/crypt-sdb      53.00GiB
    /dev/mapper/crypt-sdc      53.00GiB
    /dev/mapper/crypt-sdd     158.00GiB
    /dev/mapper/crypt-sde      52.00GiB

System,RAID1: Size:64.00MiB, Used:528.00KiB
    /dev/mapper/crypt-sdc      32.00MiB
    /dev/mapper/crypt-sdd      64.00MiB
    /dev/mapper/crypt-sde      32.00MiB

Unallocated:
    /dev/mapper/crypt-sdb     393.04GiB
    /dev/mapper/crypt-sdc     393.01GiB
    /dev/mapper/crypt-sdd       2.57TiB
    /dev/mapper/crypt-sde     394.01GiB

>
> I'm now running `fi balance -dusage=10` (and rising the usage limit). 
> I can see that the unallocated space is rising as it's freeing the 
> little used chunks but still no data are being stored on the new disk.
>
> I it some bug? Is `fi usage` not showing me something (as it states 
> "WARNING: RAID56 detected, not implemented")? Or is there just too 
> much free space on the first set of disks that the balancing is not 
> bothering moving any data?
>
> If so, shouldn't it be really balancing (spreading) the data among all 
> the drives to use all the IOPS capacity, even when the raid5 
> redundancy constraint is currently satisfied?
>
>
#  uname -a
Linux storage 4.19.0-0.bpo.2-amd64 #1 SMP Debian 4.19.16-1~bpo9+1 
(2019-02-07) x86_64 GNU/Linux
#   btrfs --version
btrfs-progs v4.17
#  btrfs fi show
Label: none  uuid: xxxxxxxxxxxxxxxxx
     Total devices 4 FS bytes used 4.09TiB
     devid    2 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sdc
     devid    3 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sdb
     devid    4 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sde
     devid    5 size 2.73TiB used 158.06GiB path /dev/mapper/crypt-sdd

#   btrfs fi df .
Data, RAID5: total=4.59TiB, used=4.06TiB
System, RAID1: total=64.00MiB, used=528.00KiB
Metadata, RAID1: total=158.00GiB, used=29.43GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

> Thanks
>
> Jakub
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
  2019-03-13 22:11 Balancing raid5 after adding another disk does not move/use any data on it Jakub Husák
@ 2019-03-14 14:59 ` Noah Massey
  2019-03-14 15:08   ` Noah Massey
  2019-03-15 18:01 ` Zygo Blaxell
  1 sibling, 1 reply; 16+ messages in thread
From: Noah Massey @ 2019-03-14 14:59 UTC (permalink / raw)
  To: Jakub Husák; +Cc: linux-btrfs

On Wed, Mar 13, 2019 at 6:13 PM Jakub Husák <jakub@husak.pro> wrote:
>
> Sorry, fighting with this technology called "email" :)
>
>
> Hopefully better wrapped outputs:
>
> On 13. 03. 19 22:58, Jakub Husák wrote:
>
>
> > Hi,
> >
> > I added another disk to my 3-disk raid5 and ran a balance command.
> > After few hours I looked to output of `fi usage` to see that no data
> > are being used on the new disk. I got the same result even when
> > balancing my raid5 data or metadata.
> >

Am I correct in rephrasing your issue into "balancing 3 copy raid5
does not rebalance into 4 copy raid5"? Because that was a surprising
result to me, but may be a spec that I wasn't aware of.

> >
> # btrfs filesystem usage   /mnt/data1
>
> Data,RAID5: Size:4.59TiB, Used:4.06TiB
>     /dev/mapper/crypt-sdb       2.29TiB
>     /dev/mapper/crypt-sdc       2.29TiB
>     /dev/mapper/crypt-sde       2.29TiB
>
> Metadata,RAID1: Size:158.00GiB, Used:29.43GiB
>     /dev/mapper/crypt-sdb      53.00GiB
>     /dev/mapper/crypt-sdc      53.00GiB
>     /dev/mapper/crypt-sdd     158.00GiB
>     /dev/mapper/crypt-sde      52.00GiB

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
  2019-03-14 14:59 ` Noah Massey
@ 2019-03-14 15:08   ` Noah Massey
  0 siblings, 0 replies; 16+ messages in thread
From: Noah Massey @ 2019-03-14 15:08 UTC (permalink / raw)
  To: Jakub Husák; +Cc: linux-btrfs

On Thu, Mar 14, 2019 at 10:59 AM Noah Massey <noah.massey@gmail.com> wrote:
>
> On Wed, Mar 13, 2019 at 6:13 PM Jakub Husák <jakub@husak.pro> wrote:
> >
> > Sorry, fighting with this technology called "email" :)
> >
> >
> > Hopefully better wrapped outputs:
> >
> > On 13. 03. 19 22:58, Jakub Husák wrote:
> >
> >
> > > Hi,
> > >
> > > I added another disk to my 3-disk raid5 and ran a balance command.
> > > After few hours I looked to output of `fi usage` to see that no data
> > > are being used on the new disk. I got the same result even when
> > > balancing my raid5 data or metadata.
> > >
>
> Am I correct in rephrasing your issue into "balancing 3 copy raid5
> does not rebalance into 4 copy raid5"? Because that was a surprising
> result to me, but may be a spec that I wasn't aware of.
>

Maybe it's because the new disk does not have any RAID5 block groups?
To test, I'd usually suggest adding bogus data until something got
pushed to sdd, then try a balance and remove the temp data. In your
case, it might be hard since all 3 disks have a similar amount of
unused data.
What happens if you 'dev remove' one of the old drives, and then add it back in?

In case it's not clear, I'm throwing spaghetti here. Have backups.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
  2019-03-13 21:58 Jakub Husák
@ 2019-03-14 21:31 ` Chris Murphy
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Murphy @ 2019-03-14 21:31 UTC (permalink / raw)
  To: Jakub Husák; +Cc: Btrfs BTRFS

On Wed, Mar 13, 2019 at 3:58 PM Jakub Husák <jakub@husak.pro> wrote:
>
> Hi,
>
> I added another disk to my 3-disk raid5 and ran a balance command.

What exact commands did you use for the two operations?

>After
> few hours I looked to output of `fi usage` to see that no data are being
> used on the new disk. I got the same result even when balancing my raid5
> data or metadata.
>
> Next I tried to convert my raid5 metadata to raid1 (a good idea anyway)
> and the new disk started to fill immediately (even though it received
> the whole amount of metadata with replicas being spread among the other
> drives, instead of being really "balanced". I know why this happened, I
> don't like it but I can live with it, let's not go off topic here :)).

They could be related problems. Unclear.

I suggest grabbing btrfs-debugfs from upstream btrfs-progs and run
`sudo btrfs-debugfs -b /mntpoint/` and let's see what the block group
distribution looks like.

https://github.com/kdave/btrfs-progs/blob/master/btrfs-debugfs

>
> I'm now running `fi balance -dusage=10` (and rising the usage limit). I
> can see that the unallocated space is rising as it's freeing the little
> used chunks but still no data are being stored on the new disk.
>
> I it some bug?

It's possible, but not enough information. The balance code is complicated.

> If so, shouldn't it be really balancing (spreading) the data among all
> the drives to use all the IOPS capacity, even when the raid5 redundancy
> constraint is currently satisfied?

I'd expect that it should copy extents from old 3 strip block groups
to new 4 strip block groups. However, there have been some
improvements related to block group management and enospc avoidance
where existing block groups get filled first, before new block groups
are created, and I wonder if that's what's going on here, but it's
speculation. What do you get for

btrfs insp dump-t -t 5 /dev/   ##device, not mountpoint, will work if
fs is mounted but ideally not in-use
btrfs insp dump-s -f /dev/  ##same

Also, no significant changes in raid56.c between 4.19.16 and 5.0.2.
But there have been some volume.c changes.
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/volumes.c?id=v5.0.2&id2=v4.19.16

Anyway, I would stop making changes for now and make sure your backups
are up to date as a top priority. And then it's safer to poke this
with a stick and see what's going on and how to get it to cooperate.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
  2019-03-13 22:11 Balancing raid5 after adding another disk does not move/use any data on it Jakub Husák
  2019-03-14 14:59 ` Noah Massey
@ 2019-03-15 18:01 ` Zygo Blaxell
  2019-03-15 18:42   ` Jakub Husák
  2019-03-15 20:31   ` Hans van Kranenburg
  1 sibling, 2 replies; 16+ messages in thread
From: Zygo Blaxell @ 2019-03-15 18:01 UTC (permalink / raw)
  To: Jakub Husák; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4983 bytes --]

On Wed, Mar 13, 2019 at 11:11:02PM +0100, Jakub Husák wrote:
> Sorry, fighting with this technology called "email" :)
> 
> 
> Hopefully better wrapped outputs:
> 
> On 13. 03. 19 22:58, Jakub Husák wrote:
> 
> 
> > Hi,
> > 
> > I added another disk to my 3-disk raid5 and ran a balance command. After
> > few hours I looked to output of `fi usage` to see that no data are being
> > used on the new disk. I got the same result even when balancing my raid5
> > data or metadata.
> > 
> > Next I tried to convert my raid5 metadata to raid1 (a good idea anyway)
> > and the new disk started to fill immediately (even though it received
> > the whole amount of metadata with replicas being spread among the other
> > drives, instead of being really "balanced". I know why this happened, I
> > don't like it but I can live with it, let's not go off topic here :)).
> > 
> > Now my usage output looks like this:
> > 
> # btrfs filesystem usage   /mnt/data1
> WARNING: RAID56 detected, not implemented
> Overall:
>     Device size:          10.91TiB
>     Device allocated:         316.12GiB
>     Device unallocated:          10.61TiB
>     Device missing:             0.00B
>     Used:              58.86GiB
>     Free (estimated):             0.00B    (min: 8.00EiB)
>     Data ratio:                  0.00
>     Metadata ratio:              2.00
>     Global reserve:         512.00MiB    (used: 0.00B)
> 
> Data,RAID5: Size:4.59TiB, Used:4.06TiB
>    /dev/mapper/crypt-sdb       2.29TiB
>    /dev/mapper/crypt-sdc       2.29TiB
>    /dev/mapper/crypt-sde       2.29TiB
> 
> Metadata,RAID1: Size:158.00GiB, Used:29.43GiB
>    /dev/mapper/crypt-sdb      53.00GiB
>    /dev/mapper/crypt-sdc      53.00GiB
>    /dev/mapper/crypt-sdd     158.00GiB
>    /dev/mapper/crypt-sde      52.00GiB
> 
> System,RAID1: Size:64.00MiB, Used:528.00KiB
>    /dev/mapper/crypt-sdc      32.00MiB
>    /dev/mapper/crypt-sdd      64.00MiB
>    /dev/mapper/crypt-sde      32.00MiB
> 
> Unallocated:
>    /dev/mapper/crypt-sdb     393.04GiB
>    /dev/mapper/crypt-sdc     393.01GiB
>    /dev/mapper/crypt-sdd       2.57TiB
>    /dev/mapper/crypt-sde     394.01GiB
> 
> > 
> > I'm now running `fi balance -dusage=10` (and rising the usage limit). I
> > can see that the unallocated space is rising as it's freeing the little
> > used chunks but still no data are being stored on the new disk.

That is exactly what is happening:  you are moving tiny amounts of data
into existing big empty spaces, so no new chunk allocations (which should
use the new drive) are happening.  You have 470GB of data allocated
but not used, so you have up to 235 block groups to fill before the new
drive gets any data.

Also note that you always have to do a full data balance when adding
devices to raid5 in order to make use of all the space, so you might
as well get started on that now.  It'll take a while.  'btrfs balance
start -dstripes=1..3 /mnt/data1' will work for this case.

> > I it some bug? Is `fi usage` not showing me something (as it states
> > "WARNING: RAID56 detected, not implemented")? 

The warning just means the fields in the 'fi usage' output header,
like "Free (estimate)", have bogus values because they're not computed
correctly.

> > Or is there just too much
> > free space on the first set of disks that the balancing is not bothering
> > moving any data?

Yes.  ;)

> > If so, shouldn't it be really balancing (spreading) the data among all
> > the drives to use all the IOPS capacity, even when the raid5 redundancy
> > constraint is currently satisfied?

btrfs divides the disks into chunks first, then spreads the data across
the chunks.  The chunk allocation behavior spreads chunks across all the
disks.  When you are adding a disk to raid5, you have to redistribute all
the old data across all the disks to get balanced IOPS and space usage,
hence the full balance requirement.

If you don't do a full balance, it will eventually allocate data on
all disks, but it will run out of space on sdb, sdc, and sde first,
and then be unable to use the remaining 2TB+ on sdd.

> > 
> #  uname -a
> Linux storage 4.19.0-0.bpo.2-amd64 #1 SMP Debian 4.19.16-1~bpo9+1
> (2019-02-07) x86_64 GNU/Linux
> #   btrfs --version
> btrfs-progs v4.17
> #  btrfs fi show
> Label: none  uuid: xxxxxxxxxxxxxxxxx
>     Total devices 4 FS bytes used 4.09TiB
>     devid    2 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sdc
>     devid    3 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sdb
>     devid    4 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sde
>     devid    5 size 2.73TiB used 158.06GiB path /dev/mapper/crypt-sdd
> 
> #   btrfs fi df .
> Data, RAID5: total=4.59TiB, used=4.06TiB
> System, RAID1: total=64.00MiB, used=528.00KiB
> Metadata, RAID1: total=158.00GiB, used=29.43GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> > Thanks
> > 
> > Jakub
> > 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
  2019-03-15 18:01 ` Zygo Blaxell
@ 2019-03-15 18:42   ` Jakub Husák
  2019-03-15 18:59     ` Zygo Blaxell
  2019-03-15 20:31   ` Hans van Kranenburg
  1 sibling, 1 reply; 16+ messages in thread
From: Jakub Husák @ 2019-03-15 18:42 UTC (permalink / raw)
  To: linux-btrfs

Thanks for explanation! actually when I moved forward with the 
rebalancing the fourth disk started to receive some data.

BTW, I was hoping some filter like '-dstripes=1..3' existed and it is! 
Wouldn't it deserve some documentation? :)

Also thanks to Noah Massey for caring!

Cheers


On 15. 03. 19 19:01, Zygo Blaxell wrote:
> On Wed, Mar 13, 2019 at 11:11:02PM +0100, Jakub Husák wrote:
>> Sorry, fighting with this technology called "email" :)
>>
>>
>> Hopefully better wrapped outputs:
>>
>> On 13. 03. 19 22:58, Jakub Husák wrote:
>>
>>
>>> Hi,
>>>
>>> I added another disk to my 3-disk raid5 and ran a balance command. After
>>> few hours I looked to output of `fi usage` to see that no data are being
>>> used on the new disk. I got the same result even when balancing my raid5
>>> data or metadata.
>>>
>>> Next I tried to convert my raid5 metadata to raid1 (a good idea anyway)
>>> and the new disk started to fill immediately (even though it received
>>> the whole amount of metadata with replicas being spread among the other
>>> drives, instead of being really "balanced". I know why this happened, I
>>> don't like it but I can live with it, let's not go off topic here :)).
>>>
>>> Now my usage output looks like this:
>>>
>> # btrfs filesystem usage   /mnt/data1
>> WARNING: RAID56 detected, not implemented
>> Overall:
>>      Device size:          10.91TiB
>>      Device allocated:         316.12GiB
>>      Device unallocated:          10.61TiB
>>      Device missing:             0.00B
>>      Used:              58.86GiB
>>      Free (estimated):             0.00B    (min: 8.00EiB)
>>      Data ratio:                  0.00
>>      Metadata ratio:              2.00
>>      Global reserve:         512.00MiB    (used: 0.00B)
>>
>> Data,RAID5: Size:4.59TiB, Used:4.06TiB
>>     /dev/mapper/crypt-sdb       2.29TiB
>>     /dev/mapper/crypt-sdc       2.29TiB
>>     /dev/mapper/crypt-sde       2.29TiB
>>
>> Metadata,RAID1: Size:158.00GiB, Used:29.43GiB
>>     /dev/mapper/crypt-sdb      53.00GiB
>>     /dev/mapper/crypt-sdc      53.00GiB
>>     /dev/mapper/crypt-sdd     158.00GiB
>>     /dev/mapper/crypt-sde      52.00GiB
>>
>> System,RAID1: Size:64.00MiB, Used:528.00KiB
>>     /dev/mapper/crypt-sdc      32.00MiB
>>     /dev/mapper/crypt-sdd      64.00MiB
>>     /dev/mapper/crypt-sde      32.00MiB
>>
>> Unallocated:
>>     /dev/mapper/crypt-sdb     393.04GiB
>>     /dev/mapper/crypt-sdc     393.01GiB
>>     /dev/mapper/crypt-sdd       2.57TiB
>>     /dev/mapper/crypt-sde     394.01GiB
>>
>>> I'm now running `fi balance -dusage=10` (and rising the usage limit). I
>>> can see that the unallocated space is rising as it's freeing the little
>>> used chunks but still no data are being stored on the new disk.
> That is exactly what is happening:  you are moving tiny amounts of data
> into existing big empty spaces, so no new chunk allocations (which should
> use the new drive) are happening.  You have 470GB of data allocated
> but not used, so you have up to 235 block groups to fill before the new
> drive gets any data.
>
> Also note that you always have to do a full data balance when adding
> devices to raid5 in order to make use of all the space, so you might
> as well get started on that now.  It'll take a while.  'btrfs balance
> start -dstripes=1..3 /mnt/data1' will work for this case.
>
>>> I it some bug? Is `fi usage` not showing me something (as it states
>>> "WARNING: RAID56 detected, not implemented")?
> The warning just means the fields in the 'fi usage' output header,
> like "Free (estimate)", have bogus values because they're not computed
> correctly.
>
>>> Or is there just too much
>>> free space on the first set of disks that the balancing is not bothering
>>> moving any data?
> Yes.  ;)
>
>>> If so, shouldn't it be really balancing (spreading) the data among all
>>> the drives to use all the IOPS capacity, even when the raid5 redundancy
>>> constraint is currently satisfied?
> btrfs divides the disks into chunks first, then spreads the data across
> the chunks.  The chunk allocation behavior spreads chunks across all the
> disks.  When you are adding a disk to raid5, you have to redistribute all
> the old data across all the disks to get balanced IOPS and space usage,
> hence the full balance requirement.
>
> If you don't do a full balance, it will eventually allocate data on
> all disks, but it will run out of space on sdb, sdc, and sde first,
> and then be unable to use the remaining 2TB+ on sdd.
>
>> #  uname -a
>> Linux storage 4.19.0-0.bpo.2-amd64 #1 SMP Debian 4.19.16-1~bpo9+1
>> (2019-02-07) x86_64 GNU/Linux
>> #   btrfs --version
>> btrfs-progs v4.17
>> #  btrfs fi show
>> Label: none  uuid: xxxxxxxxxxxxxxxxx
>>      Total devices 4 FS bytes used 4.09TiB
>>      devid    2 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sdc
>>      devid    3 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sdb
>>      devid    4 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sde
>>      devid    5 size 2.73TiB used 158.06GiB path /dev/mapper/crypt-sdd
>>
>> #   btrfs fi df .
>> Data, RAID5: total=4.59TiB, used=4.06TiB
>> System, RAID1: total=64.00MiB, used=528.00KiB
>> Metadata, RAID1: total=158.00GiB, used=29.43GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>>> Thanks
>>>
>>> Jakub
>>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
  2019-03-15 18:42   ` Jakub Husák
@ 2019-03-15 18:59     ` Zygo Blaxell
  0 siblings, 0 replies; 16+ messages in thread
From: Zygo Blaxell @ 2019-03-15 18:59 UTC (permalink / raw)
  To: Jakub Husák; +Cc: linux-btrfs

On Fri, Mar 15, 2019 at 07:42:21PM +0100, Jakub Husák wrote:
> Thanks for explanation! actually when I moved forward with the rebalancing
> the fourth disk started to receive some data.
> 
> BTW, I was hoping some filter like '-dstripes=1..3' existed and it is!
> Wouldn't it deserve some documentation? :)

It has some, from the man page for btrfs-balance:

       stripes=<range>
           Balance only block groups which have the given number of stripes. The parameter is a range specified as start..end.
           Makes sense for block group profiles that utilize striping, ie. RAID0/10/5/6. The range minimum and maximum are
           inclusive.

There are probably some wikis that could benefit from a sentence or
two explaining when you'd use this option.  Or a table of which RAID
profiles must be balanced after a device add (always raid0, raid5,
raid6, sometimes raid1 and raid10) and which don't (never single, dup,
sometimes raid1 and raid10).

> Also thanks to Noah Massey for caring!
> 
> Cheers
> 
> 
> On 15. 03. 19 19:01, Zygo Blaxell wrote:
> > On Wed, Mar 13, 2019 at 11:11:02PM +0100, Jakub Husák wrote:
> > > Sorry, fighting with this technology called "email" :)
> > > 
> > > 
> > > Hopefully better wrapped outputs:
> > > 
> > > On 13. 03. 19 22:58, Jakub Husák wrote:
> > > 
> > > 
> > > > Hi,
> > > > 
> > > > I added another disk to my 3-disk raid5 and ran a balance command. After
> > > > few hours I looked to output of `fi usage` to see that no data are being
> > > > used on the new disk. I got the same result even when balancing my raid5
> > > > data or metadata.
> > > > 
> > > > Next I tried to convert my raid5 metadata to raid1 (a good idea anyway)
> > > > and the new disk started to fill immediately (even though it received
> > > > the whole amount of metadata with replicas being spread among the other
> > > > drives, instead of being really "balanced". I know why this happened, I
> > > > don't like it but I can live with it, let's not go off topic here :)).
> > > > 
> > > > Now my usage output looks like this:
> > > > 
> > > # btrfs filesystem usage   /mnt/data1
> > > WARNING: RAID56 detected, not implemented
> > > Overall:
> > >      Device size:          10.91TiB
> > >      Device allocated:         316.12GiB
> > >      Device unallocated:          10.61TiB
> > >      Device missing:             0.00B
> > >      Used:              58.86GiB
> > >      Free (estimated):             0.00B    (min: 8.00EiB)
> > >      Data ratio:                  0.00
> > >      Metadata ratio:              2.00
> > >      Global reserve:         512.00MiB    (used: 0.00B)
> > > 
> > > Data,RAID5: Size:4.59TiB, Used:4.06TiB
> > >     /dev/mapper/crypt-sdb       2.29TiB
> > >     /dev/mapper/crypt-sdc       2.29TiB
> > >     /dev/mapper/crypt-sde       2.29TiB
> > > 
> > > Metadata,RAID1: Size:158.00GiB, Used:29.43GiB
> > >     /dev/mapper/crypt-sdb      53.00GiB
> > >     /dev/mapper/crypt-sdc      53.00GiB
> > >     /dev/mapper/crypt-sdd     158.00GiB
> > >     /dev/mapper/crypt-sde      52.00GiB
> > > 
> > > System,RAID1: Size:64.00MiB, Used:528.00KiB
> > >     /dev/mapper/crypt-sdc      32.00MiB
> > >     /dev/mapper/crypt-sdd      64.00MiB
> > >     /dev/mapper/crypt-sde      32.00MiB
> > > 
> > > Unallocated:
> > >     /dev/mapper/crypt-sdb     393.04GiB
> > >     /dev/mapper/crypt-sdc     393.01GiB
> > >     /dev/mapper/crypt-sdd       2.57TiB
> > >     /dev/mapper/crypt-sde     394.01GiB
> > > 
> > > > I'm now running `fi balance -dusage=10` (and rising the usage limit). I
> > > > can see that the unallocated space is rising as it's freeing the little
> > > > used chunks but still no data are being stored on the new disk.
> > That is exactly what is happening:  you are moving tiny amounts of data
> > into existing big empty spaces, so no new chunk allocations (which should
> > use the new drive) are happening.  You have 470GB of data allocated
> > but not used, so you have up to 235 block groups to fill before the new
> > drive gets any data.
> > 
> > Also note that you always have to do a full data balance when adding
> > devices to raid5 in order to make use of all the space, so you might
> > as well get started on that now.  It'll take a while.  'btrfs balance
> > start -dstripes=1..3 /mnt/data1' will work for this case.
> > 
> > > > I it some bug? Is `fi usage` not showing me something (as it states
> > > > "WARNING: RAID56 detected, not implemented")?
> > The warning just means the fields in the 'fi usage' output header,
> > like "Free (estimate)", have bogus values because they're not computed
> > correctly.
> > 
> > > > Or is there just too much
> > > > free space on the first set of disks that the balancing is not bothering
> > > > moving any data?
> > Yes.  ;)
> > 
> > > > If so, shouldn't it be really balancing (spreading) the data among all
> > > > the drives to use all the IOPS capacity, even when the raid5 redundancy
> > > > constraint is currently satisfied?
> > btrfs divides the disks into chunks first, then spreads the data across
> > the chunks.  The chunk allocation behavior spreads chunks across all the
> > disks.  When you are adding a disk to raid5, you have to redistribute all
> > the old data across all the disks to get balanced IOPS and space usage,
> > hence the full balance requirement.
> > 
> > If you don't do a full balance, it will eventually allocate data on
> > all disks, but it will run out of space on sdb, sdc, and sde first,
> > and then be unable to use the remaining 2TB+ on sdd.
> > 
> > > #  uname -a
> > > Linux storage 4.19.0-0.bpo.2-amd64 #1 SMP Debian 4.19.16-1~bpo9+1
> > > (2019-02-07) x86_64 GNU/Linux
> > > #   btrfs --version
> > > btrfs-progs v4.17
> > > #  btrfs fi show
> > > Label: none  uuid: xxxxxxxxxxxxxxxxx
> > >      Total devices 4 FS bytes used 4.09TiB
> > >      devid    2 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sdc
> > >      devid    3 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sdb
> > >      devid    4 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sde
> > >      devid    5 size 2.73TiB used 158.06GiB path /dev/mapper/crypt-sdd
> > > 
> > > #   btrfs fi df .
> > > Data, RAID5: total=4.59TiB, used=4.06TiB
> > > System, RAID1: total=64.00MiB, used=528.00KiB
> > > Metadata, RAID1: total=158.00GiB, used=29.43GiB
> > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > 
> > > > Thanks
> > > > 
> > > > Jakub
> > > > 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
  2019-03-15 18:01 ` Zygo Blaxell
  2019-03-15 18:42   ` Jakub Husák
@ 2019-03-15 20:31   ` Hans van Kranenburg
  2019-03-16  6:07     ` Andrei Borzenkov
  1 sibling, 1 reply; 16+ messages in thread
From: Hans van Kranenburg @ 2019-03-15 20:31 UTC (permalink / raw)
  To: Zygo Blaxell, Jakub Husák; +Cc: linux-btrfs

On 3/15/19 7:01 PM, Zygo Blaxell wrote:
> On Wed, Mar 13, 2019 at 11:11:02PM +0100, Jakub Husák wrote:
>> Sorry, fighting with this technology called "email" :)
>>
>>
>> Hopefully better wrapped outputs:
>>
>> On 13. 03. 19 22:58, Jakub Husák wrote:
>>
>>
>>> Hi,
>>>
>>> I added another disk to my 3-disk raid5 and ran a balance command. After
>>> few hours I looked to output of `fi usage` to see that no data are being
>>> used on the new disk. I got the same result even when balancing my raid5
>>> data or metadata.
>>>
>>> Next I tried to convert my raid5 metadata to raid1 (a good idea anyway)
>>> and the new disk started to fill immediately (even though it received
>>> the whole amount of metadata with replicas being spread among the other
>>> drives, instead of being really "balanced". I know why this happened, I
>>> don't like it but I can live with it, let's not go off topic here :)).
>>>
>>> Now my usage output looks like this:
>>>
>> # btrfs filesystem usage   /mnt/data1
>> WARNING: RAID56 detected, not implemented
>> Overall:
>>     Device size:          10.91TiB
>>     Device allocated:         316.12GiB
>>     Device unallocated:          10.61TiB
>>     Device missing:             0.00B
>>     Used:              58.86GiB
>>     Free (estimated):             0.00B    (min: 8.00EiB)
>>     Data ratio:                  0.00
>>     Metadata ratio:              2.00
>>     Global reserve:         512.00MiB    (used: 0.00B)
>>
>> Data,RAID5: Size:4.59TiB, Used:4.06TiB
>>    /dev/mapper/crypt-sdb       2.29TiB
>>    /dev/mapper/crypt-sdc       2.29TiB
>>    /dev/mapper/crypt-sde       2.29TiB
>>
>> Metadata,RAID1: Size:158.00GiB, Used:29.43GiB
>>    /dev/mapper/crypt-sdb      53.00GiB
>>    /dev/mapper/crypt-sdc      53.00GiB
>>    /dev/mapper/crypt-sdd     158.00GiB
>>    /dev/mapper/crypt-sde      52.00GiB
>>
>> System,RAID1: Size:64.00MiB, Used:528.00KiB
>>    /dev/mapper/crypt-sdc      32.00MiB
>>    /dev/mapper/crypt-sdd      64.00MiB
>>    /dev/mapper/crypt-sde      32.00MiB
>>
>> Unallocated:
>>    /dev/mapper/crypt-sdb     393.04GiB
>>    /dev/mapper/crypt-sdc     393.01GiB
>>    /dev/mapper/crypt-sdd       2.57TiB
>>    /dev/mapper/crypt-sde     394.01GiB
>>
>>>
>>> I'm now running `fi balance -dusage=10` (and rising the usage limit). I
>>> can see that the unallocated space is rising as it's freeing the little
>>> used chunks but still no data are being stored on the new disk.
> 
> That is exactly what is happening:  you are moving tiny amounts of data
> into existing big empty spaces, so no new chunk allocations (which should
> use the new drive) are happening.  You have 470GB of data allocated
> but not used, so you have up to 235 block groups to fill before the new
> drive gets any data.
> 
> Also note that you always have to do a full data balance when adding
> devices to raid5 in order to make use of all the space, so you might
> as well get started on that now.  It'll take a while.  'btrfs balance
> start -dstripes=1..3 /mnt/data1' will work for this case.
> 
>>> I it some bug? Is `fi usage` not showing me something (as it states
>>> "WARNING: RAID56 detected, not implemented")? 
> 
> The warning just means the fields in the 'fi usage' output header,
> like "Free (estimate)", have bogus values because they're not computed
> correctly.

The output of the btrfs-usage-report which comes with the python-btrfs
library (since v11) might be interesting for you here.

It actually will show you pretty accurate numbers, and it also contains
a section that exactly shows you how much currently unallocatable raw
disk space you have on which disk. While moving around things with
balance, you can see the numbers change.

>>> Or is there just too much
>>> free space on the first set of disks that the balancing is not bothering
>>> moving any data?
> 
> Yes.  ;)
> 
>>> If so, shouldn't it be really balancing (spreading) the data among all
>>> the drives to use all the IOPS capacity, even when the raid5 redundancy
>>> constraint is currently satisfied?
> 
> btrfs divides the disks into chunks first, then spreads the data across
> the chunks.  The chunk allocation behavior spreads chunks across all the
> disks.  When you are adding a disk to raid5, you have to redistribute all
> the old data across all the disks to get balanced IOPS and space usage,
> hence the full balance requirement.
> 
> If you don't do a full balance, it will eventually allocate data on
> all disks, but it will run out of space on sdb, sdc, and sde first,
> and then be unable to use the remaining 2TB+ on sdd.

Also, if you have a lot of empty space in the current allocations, btrfs
balance has the tendency to first start packing everything together
before allocating new (4 disk wide) block groups.

This is annoying, because it can result in moving the same data multiple
times during balance (into empty space of another existing block group,
and then when that one has its turn again etc).

So you want to get rid of empty space in existing block groups as soon
as possible. btrfs-balance-least-used can do this, (also an example from
python-btrfs), by doing them in order of most empty one first.

A copy of the script with the following change will filter out block
groups that already span 4 drives:

diff --git a/bin/btrfs-balance-least-used b/bin/btrfs-balance-least-used
index 7005347..0b243a3 100755
--- a/bin/btrfs-balance-least-used
+++ b/bin/btrfs-balance-least-used
@@ -41,6 +41,8 @@ def load_block_groups(fs, max_used_pct):
     for chunk in fs.chunks():
         if not (chunk.type & btrfs.BLOCK_GROUP_DATA):
             continue
+        if len(chunk.stripes) > 3:
+            continue
         try:
             block_group = fs.block_group(chunk.vaddr, chunk.length)
             if block_group.used_pct <= max_used_pct:


https://github.com/knorrie/python-btrfs/tree/master/bin

>> #  uname -a
>> Linux storage 4.19.0-0.bpo.2-amd64 #1 SMP Debian 4.19.16-1~bpo9+1
>> (2019-02-07) x86_64 GNU/Linux
>> #   btrfs --version
>> btrfs-progs v4.17
>> #  btrfs fi show
>> Label: none  uuid: xxxxxxxxxxxxxxxxx
>>     Total devices 4 FS bytes used 4.09TiB
>>     devid    2 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sdc
>>     devid    3 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sdb
>>     devid    4 size 2.73TiB used 2.34TiB path /dev/mapper/crypt-sde
>>     devid    5 size 2.73TiB used 158.06GiB path /dev/mapper/crypt-sdd
>>
>> #   btrfs fi df .
>> Data, RAID5: total=4.59TiB, used=4.06TiB
>> System, RAID1: total=64.00MiB, used=528.00KiB
>> Metadata, RAID1: total=158.00GiB, used=29.43GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B

</commercials break>

Hans


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
  2019-03-15 20:31   ` Hans van Kranenburg
@ 2019-03-16  6:07     ` Andrei Borzenkov
  2019-03-16 16:34       ` Hans van Kranenburg
  2019-03-16 23:10       ` Zygo Blaxell
  0 siblings, 2 replies; 16+ messages in thread
From: Andrei Borzenkov @ 2019-03-16  6:07 UTC (permalink / raw)
  To: Hans van Kranenburg, Zygo Blaxell, Jakub Husák; +Cc: linux-btrfs

15.03.2019 23:31, Hans van Kranenburg пишет:
...
>>
>>>> If so, shouldn't it be really balancing (spreading) the data among all
>>>> the drives to use all the IOPS capacity, even when the raid5 redundancy
>>>> constraint is currently satisfied?
>>
>> btrfs divides the disks into chunks first, then spreads the data across
>> the chunks.  The chunk allocation behavior spreads chunks across all the
>> disks.  When you are adding a disk to raid5, you have to redistribute all
>> the old data across all the disks to get balanced IOPS and space usage,
>> hence the full balance requirement.
>>
>> If you don't do a full balance, it will eventually allocate data on
>> all disks, but it will run out of space on sdb, sdc, and sde first,
>> and then be unable to use the remaining 2TB+ on sdd.
> 
> Also, if you have a lot of empty space in the current allocations, btrfs
> balance has the tendency to first start packing everything together
> before allocating new (4 disk wide) block groups.
> 
> This is annoying, because it can result in moving the same data multiple
> times during balance (into empty space of another existing block group,
> and then when that one has its turn again etc).
> > So you want to get rid of empty space in existing block groups as soon
> as possible. btrfs-balance-least-used can do this, (also an example from
> python-btrfs), by doing them in order of most empty one first.
> 

But if I understand the above correctly it will still attempt to move
data in next most empty chunks first. Is there any way to force
allocation of new chunks? Or, better, force usage of chunks with given
stripe width as balance target?

This thread actually made me wonder - is there any guarantee (or even
tentative promise) about RAID stripe width from btrfs at all? Is it
possible that RAID5 degrades to mirror by itself due to unfortunate
space distribution?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
  2019-03-16  6:07     ` Andrei Borzenkov
@ 2019-03-16 16:34       ` Hans van Kranenburg
  2019-03-16 19:51         ` Hans van Kranenburg
  2019-03-16 23:10       ` Zygo Blaxell
  1 sibling, 1 reply; 16+ messages in thread
From: Hans van Kranenburg @ 2019-03-16 16:34 UTC (permalink / raw)
  To: Andrei Borzenkov, Zygo Blaxell, Jakub Husák; +Cc: linux-btrfs

On 3/16/19 7:07 AM, Andrei Borzenkov wrote:
> 15.03.2019 23:31, Hans van Kranenburg пишет:
> ...
>>>
>>>>> If so, shouldn't it be really balancing (spreading) the data among all
>>>>> the drives to use all the IOPS capacity, even when the raid5 redundancy
>>>>> constraint is currently satisfied?
>>>
>>> btrfs divides the disks into chunks first, then spreads the data across
>>> the chunks.  The chunk allocation behavior spreads chunks across all the
>>> disks.  When you are adding a disk to raid5, you have to redistribute all
>>> the old data across all the disks to get balanced IOPS and space usage,
>>> hence the full balance requirement.
>>>
>>> If you don't do a full balance, it will eventually allocate data on
>>> all disks, but it will run out of space on sdb, sdc, and sde first,
>>> and then be unable to use the remaining 2TB+ on sdd.
>>
>> Also, if you have a lot of empty space in the current allocations, btrfs
>> balance has the tendency to first start packing everything together
>> before allocating new (4 disk wide) block groups.
>>
>> This is annoying, because it can result in moving the same data multiple
>> times during balance (into empty space of another existing block group,
>> and then when that one has its turn again etc).
>>> So you want to get rid of empty space in existing block groups as soon
>> as possible. btrfs-balance-least-used can do this, (also an example from
>> python-btrfs), by doing them in order of most empty one first.
>>
> 
> But if I understand the above correctly it will still attempt to move
> data in next most empty chunks first.

Balance feeds data back to the fs as new writes, so it will try filling
up existing block groups with lowest vaddr first (when running nossd/ssd
mode). Newly added block groups (/chunks) always get a new vaddr which
is higher than everything else, so they're chosen last, which means when
all lower numbered ones are packed with data and we keep removing those
ones.

> Is there any way to force
> allocation of new chunks? Or, better, force usage of chunks with given
> stripe width as balance target?

Nope. Or, the other way, blacklisting everything that you know you want
to get rid of. Currently that's not possible. It would be knobs that
influence the extent allocator (e.g. prefer writing into chunk with
highest num_stripes first).

Conversion has a similar problem. For every chunk that gets converted,
you get a new empty one with the new target profile, and it's quite
possible that you're first rewriting data a few times, (depending on how
compacted everything already was) into the existing old profile chunks
before actually starting to use the new profile.

Having a lot of empty space in existing block groups is something that
mainly happens after removing a lot of data. In that case, if you care,
compacting everything together with least amount of data movement is why
I added the balance-least-first algorithm.

Since we're not using the "cluster" allocator for data any more (the
ssd-option related change in 4.14), normal operation with equal amounts
of removing and adding data all the time do not result in overallocation
any more.

> This thread actually made me wonder - is there any guarantee (or even
> tentative promise) about RAID stripe width from btrfs at all? Is it
> possible that RAID5 degrades to mirror by itself due to unfortunate
> space distribution?

For RAID5, minimum is two disks. So yes, if you add two disks and don't
forcibly rewrite all your data, it will happily start adding two-disk
RAID5 block groups if the other disks are full.

Hans

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
  2019-03-16 16:34       ` Hans van Kranenburg
@ 2019-03-16 19:51         ` Hans van Kranenburg
  2019-03-17 20:52           ` Jakub Husák
  0 siblings, 1 reply; 16+ messages in thread
From: Hans van Kranenburg @ 2019-03-16 19:51 UTC (permalink / raw)
  To: Andrei Borzenkov, Zygo Blaxell, Jakub Husák; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1351 bytes --]

On 3/16/19 5:34 PM, Hans van Kranenburg wrote:
> On 3/16/19 7:07 AM, Andrei Borzenkov wrote:
>> [...]
>> This thread actually made me wonder - is there any guarantee (or even
>> tentative promise) about RAID stripe width from btrfs at all? Is it
>> possible that RAID5 degrades to mirror by itself due to unfortunate
>> space distribution?
> 
> For RAID5, minimum is two disks. So yes, if you add two disks and don't
> forcibly rewrite all your data, it will happily start adding two-disk
> RAID5 block groups if the other disks are full.

Attached an example that shows a list of used physical and virtual space
ordered by chunk type (== block group flags) and also num_stripes (how
many disks (or, dev extents)) are used. The btrfs-usage-report does not
add this level of detail. (But maybe it would be interesting to add, but
then I would add it into the btrfs.fs_usage code...)

For the RAID56 with a big mess of different block groups with different
"horizontal size" this will be more interesting than what it shows here
as test:

-# ./chunks_stripes_report.py /
flags            num_stripes     physical      virtual
-----            -----------     --------      -------
DATA                       1    759.00GiB    759.00GiB
SYSTEM|DUP                 2     64.00MiB     32.00MiB
METADATA|DUP               2      7.00GiB      3.50GiB


Hans

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: chunks_stripes_report.py --]
[-- Type: text/x-python; name="chunks_stripes_report.py", Size: 1010 bytes --]

#!/usr/bin/python3

import btrfs
from collections import defaultdict, Counter

physical_bytes = defaultdict(Counter)
virtual_bytes = defaultdict(Counter)

with btrfs.FileSystem('/') as fs:
    for chunk in fs.chunks():
        physical_bytes[chunk.type][chunk.num_stripes] += \
           btrfs.volumes.chunk_to_dev_extent_length(chunk) * chunk.num_stripes
        virtual_bytes[chunk.type][chunk.num_stripes] += chunk.length

report_lines = [
    ('flags', 'num_stripes', 'physical', 'virtual'),
    ('-----', '-----------', '--------', '-------'),
]
for flags, counter in physical_bytes.items():
    for num_stripes, pbytes in counter.items():
        report_lines.append((
            btrfs.utils.block_group_flags_str(flags),
            num_stripes,
            btrfs.utils.pretty_size(pbytes),
            btrfs.utils.pretty_size(virtual_bytes[flags][num_stripes]),
        ))
for report_line in report_lines:
    print("{: <16} {: >11} {: >12} {: >12}".format(*report_line))

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
  2019-03-16  6:07     ` Andrei Borzenkov
  2019-03-16 16:34       ` Hans van Kranenburg
@ 2019-03-16 23:10       ` Zygo Blaxell
  1 sibling, 0 replies; 16+ messages in thread
From: Zygo Blaxell @ 2019-03-16 23:10 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Hans van Kranenburg, Jakub Husák, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 8186 bytes --]

On Sat, Mar 16, 2019 at 09:07:17AM +0300, Andrei Borzenkov wrote:
> 15.03.2019 23:31, Hans van Kranenburg пишет:
> ...
> >>
> >>>> If so, shouldn't it be really balancing (spreading) the data among all
> >>>> the drives to use all the IOPS capacity, even when the raid5 redundancy
> >>>> constraint is currently satisfied?
> >>
> >> btrfs divides the disks into chunks first, then spreads the data across
> >> the chunks.  The chunk allocation behavior spreads chunks across all the
> >> disks.  When you are adding a disk to raid5, you have to redistribute all
> >> the old data across all the disks to get balanced IOPS and space usage,
> >> hence the full balance requirement.
> >>
> >> If you don't do a full balance, it will eventually allocate data on
> >> all disks, but it will run out of space on sdb, sdc, and sde first,
> >> and then be unable to use the remaining 2TB+ on sdd.
> > 
> > Also, if you have a lot of empty space in the current allocations, btrfs
> > balance has the tendency to first start packing everything together
> > before allocating new (4 disk wide) block groups.
> > 
> > This is annoying, because it can result in moving the same data multiple
> > times during balance (into empty space of another existing block group,
> > and then when that one has its turn again etc).
> > > So you want to get rid of empty space in existing block groups as soon
> > as possible. btrfs-balance-least-used can do this, (also an example from
> > python-btrfs), by doing them in order of most empty one first.
> > 
> 
> But if I understand the above correctly it will still attempt to move
> data in next most empty chunks first. Is there any way to force
> allocation of new chunks? Or, better, force usage of chunks with given
> stripe width as balance target?
> 
> This thread actually made me wonder - is there any guarantee (or even
> tentative promise) about RAID stripe width from btrfs at all? Is it
> possible that RAID5 degrades to mirror by itself due to unfortunate
> space distribution?

Note that the data layout of RAID5 with 1 data disk, 1 parity disk, and
even parity is identical to RAID1 with 1 data disk and 1 mirror copy.
The two algorithms produce the same data layout with those parameters.
IRC btrfs uses odd parity, so on btrfs the RAID5 parity stripes are
the complement of the data stripes, but they don't need to be:  with
even parity on 2 disks, the data and parity blocks are identical and
interchangeable.

If you have RAID5 with non-equal device sizes, as long as the two largest
disks are the same size, btrfs will adjust the stripe width to match
the disks with free space available, subject to the constraint that the
resulting block group must have enough disks to survive one disk failure.
e.g. for RAID5 with 5 disks, 2x3TB, 2x2TB, 1x1TB, you get three zones:

  -> raid5 fills smallest unallocated spaces first, all drives ->
   3TB AAAAAAAAAABBBBBBBBBBCCCCCCCCCC
   3TB AAAAAAAAAABBBBBBBBBBCCCCCCCCCC
   2TB AAAAAAAAAABBBBBBBBBB
   2TB AAAAAAAAAABBBBBBBBBB
   1TB AAAAAAAAAA

Zone "A" is 5 disks wide, zone "B" is 4 disks wide, and zone "C" is
2 disks wide (each letter represents 100x1GB chunks).  This is not
necessarily how the data is laid out on disk--the btrfs allocator will
store data on disk in some permutation of this order; however, the
total number of chunks in each zone on each disk is as shown.

For -draid5 -mraid1, you can get patterns like this:

  <- raid1 fills largest unallocated spaces first, 2 drives <-
   3TB 5AAAAAAAA4BBBBBBBBB3CCCCCCCC21
   3TB 5AAAAAAAA4BBBBBBBBB3CCCCCCCC21
   2TB 6AAAAAAAADBBBBBBBBBC
   2TB 6AAAAAAAADBBBBBBBBBC
   1TB UAAAAAAAAD

where numbered zones are raid1 metadata chunks, zone "D" is raid5 3 disks
wide, and "U" is the worst-case one unusable 1GB chunk (not to scale)
in arrays with an odd number of disks.  The numbered zones occupy space
that would normally form a full-width raid5 stripe in the zone, so the
last raid5 block groups in each zone are less wide (i.e. the metadata
chunks in the "B" zone make some stripes in the "B" zone space behave
like stripes in "C" zone space).

If the allocations start from empty disks and there are no array reshaping
operations (convert profile, add/delete/resize devices) then the allocator
should allocate all the usable space as efficiently as possible.  In the
-draid5 -mraid1 case, it would be slightly more efficient to allocate all
the metadata in the "C" zone so it doesn't make any narrower stripes in
the "B" and "A" zones.  Typically this is exactly what happens, since
all the "A" and "B" space must be allocated before raid5 can reach the
"C" zone from the left, while all the "C" space must be allocated before
raid1 can reach the "B" zone from the right, and the two allocators only
interact when the filesystem is completely full.

  <- raid1 fills from the right, raid5 from the left <-
   3TB AAAAAAAAAABBBBBBBBBBCCCC654321
   3TB AAAAAAAAAABBBBBBBBBBCCCC654321
   2TB AAAAAAAAAABBBBBBBBBB
   2TB AAAAAAAAAABBBBBBBBBB
   1TB AAAAAAAAAA
  -> they meet somewhere in the middle, no space wasted ->

If all the drives are the same size, then raid5 and raid1 meet
immediately in zone "A":

  <- raid1 fills from the right, raid5 from the left <-
   3TB AAAAAAAAAAAAAAAAAAAAAAAAAAA421
   3TB AAAAAAAAAAAAAAAAAAAAAAAAAAA431
   3TB AAAAAAAAAAAAAAAAAAAAAAAAAAAU32
  -> they meet somewhere in the middle, up to 1GB wasted ->

There used to be a bug (maybe there still is?) where the allocator would
randomly place about 0.1% of chunks on a non-optimal disk (due to a race
condition?).  That can theoretically lose a few GB of space per TB by
shrinking the stripe width on a few block groups, or stealing a mirror
chunk from the largest disk in a raid1 array with multiple disk sizes.
You can get rid of those using the 'stripes' filter for balance--though
only 0.1% of the space is gained or lost this way, so it may not be
worth the IO cost.

If you are converting or reshaping an array, the nice rules above don't
hold any more.  e.g. if we replace a 1TB drive with a 3TB drive, we get
2TB unallocated ("_"):

   3TB AAAAAAAAAABBBBBBBBBBCCCC654321
   3TB AAAAAAAAAABBBBBBBBBBCCCC654321
   2TB AAAAAAAAAABBBBBBBBBB
   2TB AAAAAAAAAABBBBBBBBBB
   3TB AAAAAAAAAA____________________

Now we have no available space because there's no free chunks on two
or more drives (i.e. all the free space is on 1 drive and all the RAID
profiles we are using require 2).  Upgrade another disk, and...

   3TB AAAAAAAAAABBBBBBBBBBCCCC654321
   3TB AAAAAAAAAABBBBBBBBBBCCCC654321
   2TB AAAAAAAAAABBBBBBBBBB
   3TB AAAAAAAAAABBBBBBBBBB__________
   3TB AAAAAAAAAA____________________

Now we have 1TB of free space, in stripes 2 disks wide.  Without a balance,
it would fill up like this:

   3TB AAAAAAAAAABBBBBBBBBBCCCC654321
   3TB AAAAAAAAAABBBBBBBBBBCCCC654321
   2TB AAAAAAAAAABBBBBBBBBB
   3TB AAAAAAAAAABBBBBBBBBBCCCCCCCCCC
   3TB AAAAAAAAAACCCCCCCCCCXXXXXXXXXX
  -> raid5 fills smallest unallocated spaces first on all drives ->

Note the "C" zone here is still stripes 2 disks wide, so a lot of space is
wasted by narrow stripes.  Even the diagram makes it look we did something
wrong--we don't have the nice orderly fill pattern.  1TB is unusable,
and the free space estimated by 'df' was egregiously wrong the whole time.

Full balance fixes that, and we get some unallocated space that is
usable:

  -> raid5 from left to right ->
   3TB AAAAAAAAAAAAAAAAAAA________531
   3TB AAAAAAAAAAAAAAAAAAA________531
   2TB AAAAAAAAAAAAAAAAAAA_
   3TB AAAAAAAAAAAAAAAAAAA________642
   3TB AAAAAAAAAAAAAAAAAAA________642
  <- raid1 from right to left <-

which can then be filled up like this:

   3TB AAAAAAAAAAAAAAAAAAAABBBBBB7531
   3TB AAAAAAAAAAAAAAAAAAAABBBBBB7531
   2TB AAAAAAAAAAAAAAAAAAAA
   3TB AAAAAAAAAAAAAAAAAAAABBBBBBC642
   3TB AAAAAAAAAAAAAAAAAAAABBBBBBC642

By the time our hypothetical filesystem was full, there was another
metadata chunk allocated, so we end up with one 1GB block group in zone
"C" with 2 disks--but at most one.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
  2019-03-16 19:51         ` Hans van Kranenburg
@ 2019-03-17 20:52           ` Jakub Husák
  2019-03-17 22:53             ` Hans van Kranenburg
  0 siblings, 1 reply; 16+ messages in thread
From: Jakub Husák @ 2019-03-17 20:52 UTC (permalink / raw)
  To: linux-btrfs

This is a great tool Hans!  This kind of overview should be a part of 
btrfs-progs.

Mine looks currently like this, I have a few more days to go with 
rebalancing :)

flags            num_stripes     physical      virtual
-----            -----------     --------      -------
DATA|RAID5                 3      5.29TiB      3.53TiB
DATA|RAID5                 4    980.00GiB    735.00GiB
SYSTEM|RAID1               2    128.00MiB     64.00MiB
METADATA|RAID1             2    314.00GiB    157.00GiB

Btw, I checked the other utils in your python-btrfs and it seems that 
they are, sadly, not installed with simple pip install, which would be 
great. Maybe it needs a few lines in setup.py (i'm not too familiar with 
python packaging)?


On 16. 03. 19 20:51, Hans van Kranenburg wrote:
> On 3/16/19 5:34 PM, Hans van Kranenburg wrote:
>> On 3/16/19 7:07 AM, Andrei Borzenkov wrote:
>>> [...]
>>> This thread actually made me wonder - is there any guarantee (or even
>>> tentative promise) about RAID stripe width from btrfs at all? Is it
>>> possible that RAID5 degrades to mirror by itself due to unfortunate
>>> space distribution?
>> For RAID5, minimum is two disks. So yes, if you add two disks and don't
>> forcibly rewrite all your data, it will happily start adding two-disk
>> RAID5 block groups if the other disks are full.
> Attached an example that shows a list of used physical and virtual space
> ordered by chunk type (== block group flags) and also num_stripes (how
> many disks (or, dev extents)) are used. The btrfs-usage-report does not
> add this level of detail. (But maybe it would be interesting to add, but
> then I would add it into the btrfs.fs_usage code...)
>
> For the RAID56 with a big mess of different block groups with different
> "horizontal size" this will be more interesting than what it shows here
> as test:
>
> -# ./chunks_stripes_report.py /
> flags            num_stripes     physical      virtual
> -----            -----------     --------      -------
> DATA                       1    759.00GiB    759.00GiB
> SYSTEM|DUP                 2     64.00MiB     32.00MiB
> METADATA|DUP               2      7.00GiB      3.50GiB
>
>
> Hans

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
  2019-03-17 20:52           ` Jakub Husák
@ 2019-03-17 22:53             ` Hans van Kranenburg
  2019-03-18 19:54               ` Marc Joliet
  0 siblings, 1 reply; 16+ messages in thread
From: Hans van Kranenburg @ 2019-03-17 22:53 UTC (permalink / raw)
  To: Jakub Husák, linux-btrfs

Hi,

On 3/17/19 9:52 PM, Jakub Husák wrote:
> This is a great tool Hans!  This kind of overview should be a part of
> btrfs-progs.

Thing is... this seems super useful because it's super useful for the
exact thing you are currently doing and trying to find out.

Fun thing is, there are a thousand other things in other scenarios that
are interesting to know. Should btrfs-progs implement hardcoded
solutions for all of them? Or cover 80% of what's needed with 20% of effort?

The main reason why I started writing the python-btrfs library is that
it allows me to just quickly write a few lines of code to get some
information, exactly for what I want to know at that point.

In the previous example, writing the table with output is already more
than three times as many lines of code than getting the actual info,
which is a simple 'for chunk in fs.chunks()' and then boom, you have a
lot of info to do something with.

https://python-btrfs.readthedocs.io/en/stable/btrfs.html#btrfs.ctree.Chunk

> Mine looks currently like this, I have a few more days to go with
> rebalancing :)
> 
> flags            num_stripes     physical      virtual
> -----            -----------     --------      -------
> DATA|RAID5                 3      5.29TiB      3.53TiB
> DATA|RAID5                 4    980.00GiB    735.00GiB
> SYSTEM|RAID1               2    128.00MiB     64.00MiB
> METADATA|RAID1             2    314.00GiB    157.00GiB

Ha, nice!

> Btw, I checked the other utils in your python-btrfs and it seems that
> they are, sadly, not installed with simple pip install, which would be
> great. Maybe it needs a few lines in setup.py (i'm not too familiar with
> python packaging)?

Can you share how you're using this?

Personally, I never use pip for anything, so I might not be putting in
there what users expect. My latest thought about this was that users use
pip to have some library dependency for something else, so they don't
need standalone programs and example scripts?

I mainly have debian packages installed everywhere, and otherwise I'm
doing a git clone of the project from github and mess around in there,
with added benefit that I can view history on all files.

Hans

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Balancing raid5 after adding another disk does not move/use any data on it
  2019-03-17 22:53             ` Hans van Kranenburg
@ 2019-03-18 19:54               ` Marc Joliet
  0 siblings, 0 replies; 16+ messages in thread
From: Marc Joliet @ 2019-03-18 19:54 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]

Am Sonntag, 17. März 2019, 23:53:45 CET schrieb Hans van Kranenburg:
> My latest thought about this was that users use
> pip to have some library dependency for something else, so they don't
> need standalone programs and example scripts?

My current understanding is that that Python land kinda wants everybody to use 
pip to install anything written in Python (except the science people, who 
gravitate more towards conda, though it can wrap pip for software not packaged 
natively).  So yeah, it's perfectly natural to install scripts with pip, 
though I forgot where exactly in setup.py you have to specify them.

Examples include SCons (which is also distributed via pip), various linters 
such as flake8, and test frameworks such as nose which also come with scripts 
needed to drive them.

(Also, I seem to remember that there are provisions for specifying examples 
separately from regular scripts, but I forgot the specifics.)

Greetings
-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-03-18 19:54 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-03-13 22:11 Balancing raid5 after adding another disk does not move/use any data on it Jakub Husák
2019-03-14 14:59 ` Noah Massey
2019-03-14 15:08   ` Noah Massey
2019-03-15 18:01 ` Zygo Blaxell
2019-03-15 18:42   ` Jakub Husák
2019-03-15 18:59     ` Zygo Blaxell
2019-03-15 20:31   ` Hans van Kranenburg
2019-03-16  6:07     ` Andrei Borzenkov
2019-03-16 16:34       ` Hans van Kranenburg
2019-03-16 19:51         ` Hans van Kranenburg
2019-03-17 20:52           ` Jakub Husák
2019-03-17 22:53             ` Hans van Kranenburg
2019-03-18 19:54               ` Marc Joliet
2019-03-16 23:10       ` Zygo Blaxell
  -- strict thread matches above, loose matches on Subject: below --
2019-03-13 21:58 Jakub Husák
2019-03-14 21:31 ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox