Healthy amount of free space?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Healthy amount of free space?
@ 2018-07-16 20:58 Wolf
  2018-07-17  7:20 ` Nikolay Borisov
  2018-07-17 11:46 ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 19+ messages in thread
From: Wolf @ 2018-07-16 20:58 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2343 bytes --]

Greetings,
I would like to ask what what is healthy amount of free space to keep on
each device for btrfs to be happy?

This is how my disk array currently looks like

    [root@dennas ~]# btrfs fi usage /raid
    Overall:
        Device size:                  29.11TiB
        Device allocated:             21.26TiB
        Device unallocated:            7.85TiB
        Device missing:                  0.00B
        Used:                         21.18TiB
        Free (estimated):              3.96TiB      (min: 3.96TiB)
        Data ratio:                       2.00
        Metadata ratio:                   2.00
        Global reserve:              512.00MiB      (used: 0.00B)

    Data,RAID1: Size:10.61TiB, Used:10.58TiB
       /dev/mapper/data1       1.75TiB
       /dev/mapper/data2       1.75TiB
       /dev/mapper/data3     856.00GiB
       /dev/mapper/data4     856.00GiB
       /dev/mapper/data5       1.75TiB
       /dev/mapper/data6       1.75TiB
       /dev/mapper/data7       6.29TiB
       /dev/mapper/data8       6.29TiB

    Metadata,RAID1: Size:15.00GiB, Used:13.00GiB
       /dev/mapper/data1       2.00GiB
       /dev/mapper/data2       3.00GiB
       /dev/mapper/data3       1.00GiB
       /dev/mapper/data4       1.00GiB
       /dev/mapper/data5       3.00GiB
       /dev/mapper/data6       1.00GiB
       /dev/mapper/data7       9.00GiB
       /dev/mapper/data8      10.00GiB

    System,RAID1: Size:64.00MiB, Used:1.50MiB
       /dev/mapper/data2      32.00MiB
       /dev/mapper/data6      32.00MiB
       /dev/mapper/data7      32.00MiB
       /dev/mapper/data8      32.00MiB

    Unallocated:
       /dev/mapper/data1    1004.52GiB
       /dev/mapper/data2    1004.49GiB
       /dev/mapper/data3    1006.01GiB
       /dev/mapper/data4    1006.01GiB
       /dev/mapper/data5    1004.52GiB
       /dev/mapper/data6    1004.49GiB
       /dev/mapper/data7    1005.00GiB
       /dev/mapper/data8    1005.00GiB

Btrfs does quite good job of evenly using space on all devices. No, how
low can I let that go? In other words, with how much space
free/unallocated remaining space should I consider adding new disk?

Thanks for advice :)

W.

-- 
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-16 20:58 Healthy amount of free space? Wolf
@ 2018-07-17  7:20 ` Nikolay Borisov
  2018-07-17  8:02   ` Martin Steigerwald
  2018-07-17 11:46 ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 19+ messages in thread
From: Nikolay Borisov @ 2018-07-17  7:20 UTC (permalink / raw)
  To: Wolf, linux-btrfs



On 16.07.2018 23:58, Wolf wrote:
> Greetings,
> I would like to ask what what is healthy amount of free space to keep on
> each device for btrfs to be happy?
> 
> This is how my disk array currently looks like
> 
>     [root@dennas ~]# btrfs fi usage /raid
>     Overall:
>         Device size:                  29.11TiB
>         Device allocated:             21.26TiB
>         Device unallocated:            7.85TiB
>         Device missing:                  0.00B
>         Used:                         21.18TiB
>         Free (estimated):              3.96TiB      (min: 3.96TiB)
>         Data ratio:                       2.00
>         Metadata ratio:                   2.00
>         Global reserve:              512.00MiB      (used: 0.00B)
> 
>     Data,RAID1: Size:10.61TiB, Used:10.58TiB
>        /dev/mapper/data1       1.75TiB
>        /dev/mapper/data2       1.75TiB
>        /dev/mapper/data3     856.00GiB
>        /dev/mapper/data4     856.00GiB
>        /dev/mapper/data5       1.75TiB
>        /dev/mapper/data6       1.75TiB
>        /dev/mapper/data7       6.29TiB
>        /dev/mapper/data8       6.29TiB
> 
>     Metadata,RAID1: Size:15.00GiB, Used:13.00GiB
>        /dev/mapper/data1       2.00GiB
>        /dev/mapper/data2       3.00GiB
>        /dev/mapper/data3       1.00GiB
>        /dev/mapper/data4       1.00GiB
>        /dev/mapper/data5       3.00GiB
>        /dev/mapper/data6       1.00GiB
>        /dev/mapper/data7       9.00GiB
>        /dev/mapper/data8      10.00GiB
> 
>     System,RAID1: Size:64.00MiB, Used:1.50MiB
>        /dev/mapper/data2      32.00MiB
>        /dev/mapper/data6      32.00MiB
>        /dev/mapper/data7      32.00MiB
>        /dev/mapper/data8      32.00MiB
> 
>     Unallocated:
>        /dev/mapper/data1    1004.52GiB
>        /dev/mapper/data2    1004.49GiB
>        /dev/mapper/data3    1006.01GiB
>        /dev/mapper/data4    1006.01GiB
>        /dev/mapper/data5    1004.52GiB
>        /dev/mapper/data6    1004.49GiB
>        /dev/mapper/data7    1005.00GiB
>        /dev/mapper/data8    1005.00GiB
> 
> Btrfs does quite good job of evenly using space on all devices. No, how
> low can I let that go? In other words, with how much space
> free/unallocated remaining space should I consider adding new disk?

Btrfs will start running into problems when you run out of unallocated
space. So the best advice will be monitor your device unallocated, once
it gets really low - like 2-3 gb I will suggest you run balance which
will try to free up unallocated space by rewriting data more compactly
into sparsely populated block groups. If after running balance you
haven't really freed any space then you should consider adding a new
drive and running balance to even out the spread of data/metadata.

> 
> Thanks for advice :)
> 
> W.
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-17  7:20 ` Nikolay Borisov
@ 2018-07-17  8:02   ` Martin Steigerwald
  2018-07-17  8:16     ` Nikolay Borisov
  0 siblings, 1 reply; 19+ messages in thread
From: Martin Steigerwald @ 2018-07-17  8:02 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Wolf, linux-btrfs

Hi Nikolay.

Nikolay Borisov - 17.07.18, 09:20:
> On 16.07.2018 23:58, Wolf wrote:
> > Greetings,
> > I would like to ask what what is healthy amount of free space to
> > keep on each device for btrfs to be happy?
> > 
> > This is how my disk array currently looks like
> > 
> >     [root@dennas ~]# btrfs fi usage /raid
> >     
> >     Overall:
> >         Device size:                  29.11TiB
> >         Device allocated:             21.26TiB
> >         Device unallocated:            7.85TiB
> >         Device missing:                  0.00B
> >         Used:                         21.18TiB
> >         Free (estimated):              3.96TiB      (min: 3.96TiB)
> >         Data ratio:                       2.00
> >         Metadata ratio:                   2.00
> >         Global reserve:              512.00MiB      (used: 0.00B)
[…]
> > Btrfs does quite good job of evenly using space on all devices. No,
> > how low can I let that go? In other words, with how much space
> > free/unallocated remaining space should I consider adding new disk?
> 
> Btrfs will start running into problems when you run out of unallocated
> space. So the best advice will be monitor your device unallocated,
> once it gets really low - like 2-3 gb I will suggest you run balance
> which will try to free up unallocated space by rewriting data more
> compactly into sparsely populated block groups. If after running
> balance you haven't really freed any space then you should consider
> adding a new drive and running balance to even out the spread of
> data/metadata.

What are these issues exactly?

I have

% btrfs fi us -T /home
Overall:
    Device size:                 340.00GiB
    Device allocated:            340.00GiB
    Device unallocated:            2.00MiB
    Device missing:                  0.00B
    Used:                        308.37GiB
    Free (estimated):             14.65GiB      (min: 14.65GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

                          Data      Metadata System              
Id Path                   RAID1     RAID1    RAID1    Unallocated
-- ---------------------- --------- -------- -------- -----------
 1 /dev/mapper/msata-home 165.89GiB  4.08GiB 32.00MiB     1.00MiB
 2 /dev/mapper/sata-home  165.89GiB  4.08GiB 32.00MiB     1.00MiB
-- ---------------------- --------- -------- -------- -----------
   Total                  165.89GiB  4.08GiB 32.00MiB     2.00MiB
   Used                   151.24GiB  2.95GiB 48.00KiB

on a RAID-1 filesystem one, part of the time two Plasma desktops + 
KDEPIM and Akonadi + Baloo desktop search + you name it write to like 
mad.

Since kernel 4.5 or 4.6 this simply works. Before that sometimes BTRFS 
crawled to an halt on searching for free blocks, and I had to switch off 
the laptop uncleanly. If that happened, a balance helped for a while. 
But since 4.5 or 4.6 this did not happen anymore.

I found with SLES 12 SP 3 or so there is btrfsmaintenance running a 
balance weekly. Which created an issue on our Proxmox + Ceph on Intel 
NUC based opensource demo lab. This is for sure no recommended 
configuration for Ceph and Ceph is quite slow on these 2,5 inch 
harddisks and 1 GBit network link, despite albeit somewhat minimal, 
limited to 5 GiB m.2 SSD caching. What happened it that the VM crawled 
to a halt and the kernel gave task hung for more than 120 seconds 
messages. The VM was basically unusable during the balance. Sure that 
should not happen with a "proper" setup, also it also did not happen 
without the automatic balance.

Also what would happen on a hypervisor setup with several thousands of 
VMs with BTRFS, when several 100 of them decide to start the balance at 
a similar time? It could probably bring the I/O system below to an halt, 
as many enterprise storage systems are designed to sustain burst I/O 
loads, but not maximum utilization during an extended period of time.

I am really wondering what to recommend in my Linux performance tuning 
and analysis courses. On my own laptop I do not do regular balances so 
far. Due to my thinking: If it is not broken, do not fix it.

My personal opinion here also is: If the filesystem degrades that much 
that it becomes unusable without regular maintenance from user space, 
the filesystem needs to be fixed. Ideally I would not have to worry on 
whether to regularly balance an BTRFS or not. In other words: I should 
not have to visit a performance analysis and tuning course in order to 
use a computer with BTRFS filesystem.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-17  8:02   ` Martin Steigerwald
@ 2018-07-17  8:16     ` Nikolay Borisov
  2018-07-17 17:54       ` Martin Steigerwald
  0 siblings, 1 reply; 19+ messages in thread
From: Nikolay Borisov @ 2018-07-17  8:16 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Wolf, linux-btrfs



On 17.07.2018 11:02, Martin Steigerwald wrote:
> Hi Nikolay.
> 
> Nikolay Borisov - 17.07.18, 09:20:
>> On 16.07.2018 23:58, Wolf wrote:
>>> Greetings,
>>> I would like to ask what what is healthy amount of free space to
>>> keep on each device for btrfs to be happy?
>>>
>>> This is how my disk array currently looks like
>>>
>>>     [root@dennas ~]# btrfs fi usage /raid
>>>     
>>>     Overall:
>>>         Device size:                  29.11TiB
>>>         Device allocated:             21.26TiB
>>>         Device unallocated:            7.85TiB
>>>         Device missing:                  0.00B
>>>         Used:                         21.18TiB
>>>         Free (estimated):              3.96TiB      (min: 3.96TiB)
>>>         Data ratio:                       2.00
>>>         Metadata ratio:                   2.00
>>>         Global reserve:              512.00MiB      (used: 0.00B)
> […]
>>> Btrfs does quite good job of evenly using space on all devices. No,
>>> how low can I let that go? In other words, with how much space
>>> free/unallocated remaining space should I consider adding new disk?
>>
>> Btrfs will start running into problems when you run out of unallocated
>> space. So the best advice will be monitor your device unallocated,
>> once it gets really low - like 2-3 gb I will suggest you run balance
>> which will try to free up unallocated space by rewriting data more
>> compactly into sparsely populated block groups. If after running
>> balance you haven't really freed any space then you should consider
>> adding a new drive and running balance to even out the spread of
>> data/metadata.
> 
> What are these issues exactly?

For example if you have plenty of data space but your metadata is full
then you will be getting ENOSPC.

> 
> I have
> 
> % btrfs fi us -T /home
> Overall:
>     Device size:                 340.00GiB
>     Device allocated:            340.00GiB
>     Device unallocated:            2.00MiB
>     Device missing:                  0.00B
>     Used:                        308.37GiB
>     Free (estimated):             14.65GiB      (min: 14.65GiB)
>     Data ratio:                       2.00
>     Metadata ratio:                   2.00
>     Global reserve:              512.00MiB      (used: 0.00B)
> 
>                           Data      Metadata System              
> Id Path                   RAID1     RAID1    RAID1    Unallocated
> -- ---------------------- --------- -------- -------- -----------
>  1 /dev/mapper/msata-home 165.89GiB  4.08GiB 32.00MiB     1.00MiB
>  2 /dev/mapper/sata-home  165.89GiB  4.08GiB 32.00MiB     1.00MiB
> -- ---------------------- --------- -------- -------- -----------
>    Total                  165.89GiB  4.08GiB 32.00MiB     2.00MiB
>    Used                   151.24GiB  2.95GiB 48.00KiB

You already have only 33% of your metadata full so if your workload
turned out to actually be making more metadata-heavy changed i.e
snapshots you could exhaust this and get ENOSPC, despite having around
14gb of free data space. Furthermore this data space is spread around
multiple data chunks, depending on how populated they are a balance
could be able to free up unallocated space which later could be
re-purposed for metadata (again, depending on what you are doing).

> 
> on a RAID-1 filesystem one, part of the time two Plasma desktops + 
> KDEPIM and Akonadi + Baloo desktop search + you name it write to like 
> mad.
> 

<snip>

> 
> Thanks,
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-17  8:16     ` Nikolay Borisov
@ 2018-07-17 17:54       ` Martin Steigerwald
  2018-07-18 12:35         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 19+ messages in thread
From: Martin Steigerwald @ 2018-07-17 17:54 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Wolf, linux-btrfs

Nikolay Borisov - 17.07.18, 10:16:
> On 17.07.2018 11:02, Martin Steigerwald wrote:
> > Nikolay Borisov - 17.07.18, 09:20:
> >> On 16.07.2018 23:58, Wolf wrote:
> >>> Greetings,
> >>> I would like to ask what what is healthy amount of free space to
> >>> keep on each device for btrfs to be happy?
> >>> 
> >>> This is how my disk array currently looks like
> >>> 
> >>>     [root@dennas ~]# btrfs fi usage /raid
> >>>     
> >>>     Overall:
> >>>         Device size:                  29.11TiB
> >>>         Device allocated:             21.26TiB
> >>>         Device unallocated:            7.85TiB
> >>>         Device missing:                  0.00B
> >>>         Used:                         21.18TiB
> >>>         Free (estimated):              3.96TiB      (min: 3.96TiB)
> >>>         Data ratio:                       2.00
> >>>         Metadata ratio:                   2.00
> >>>         Global reserve:              512.00MiB      (used: 0.00B)
> > 
> > […]
> > 
> >>> Btrfs does quite good job of evenly using space on all devices.
> >>> No,
> >>> how low can I let that go? In other words, with how much space
> >>> free/unallocated remaining space should I consider adding new
> >>> disk?
> >> 
> >> Btrfs will start running into problems when you run out of
> >> unallocated space. So the best advice will be monitor your device
> >> unallocated, once it gets really low - like 2-3 gb I will suggest
> >> you run balance which will try to free up unallocated space by
> >> rewriting data more compactly into sparsely populated block
> >> groups. If after running balance you haven't really freed any
> >> space then you should consider adding a new drive and running
> >> balance to even out the spread of data/metadata.
> > 
> > What are these issues exactly?
> 
> For example if you have plenty of data space but your metadata is full
> then you will be getting ENOSPC.

Of that one I am aware.

This just did not happen so far.

I did not yet add it explicitly to the training slides, but I just make 
myself a note to do that.

Anything else?

> > I have
> > 
> > % btrfs fi us -T /home
> > 
> > Overall:
> >     Device size:                 340.00GiB
> >     Device allocated:            340.00GiB
> >     Device unallocated:            2.00MiB
> >     Device missing:                  0.00B
> >     Used:                        308.37GiB
> >     Free (estimated):             14.65GiB      (min: 14.65GiB)
> >     Data ratio:                       2.00
> >     Metadata ratio:                   2.00
> >     Global reserve:              512.00MiB      (used: 0.00B)
> >     
> >                           Data      Metadata System
> > 
> > Id Path                   RAID1     RAID1    RAID1    Unallocated
> > -- ---------------------- --------- -------- -------- -----------
> > 
> >  1 /dev/mapper/msata-home 165.89GiB  4.08GiB 32.00MiB     1.00MiB
> >  2 /dev/mapper/sata-home  165.89GiB  4.08GiB 32.00MiB     1.00MiB
> > 
> > -- ---------------------- --------- -------- -------- -----------
> > 
> >    Total                  165.89GiB  4.08GiB 32.00MiB     2.00MiB
> >    Used                   151.24GiB  2.95GiB 48.00KiB
>
> You already have only 33% of your metadata full so if your workload
> turned out to actually be making more metadata-heavy changed i.e
> snapshots you could exhaust this and get ENOSPC, despite having around
> 14gb of free data space. Furthermore this data space is spread around
> multiple data chunks, depending on how populated they are a balance
> could be able to free up unallocated space which later could be
> re-purposed for metadata (again, depending on what you are doing).

The filesystem above IMO is not fit for snapshots. It would fill up 
rather quickly, I think even when I balance metadata. Actually I tried 
this and as I remember it took at most a day until it was full.

If I read above figures currently at maximum I could gain one additional 
GiB by balancing metadata. That would not make a huge difference.

I bet I am already running this filesystem beyond recommendation, as I 
bet many would argue it is to full already for regular usage… I do not 
see the benefit of squeezing the last free space out of it just to fit 
in another GiB.

So I still do not get the point why it would make sense to balance it at 
this point in time. Especially as this 1 GiB I could regain is not even 
needed. And I do not see the point of balancing it weekly. I would 
regain about 1 GiB of metadata space every now and then, but the cost 
would be a lot of additional I/O to the SSD. They still take it very 
nicely so far, but I think, right now, there is simply no point in 
balancing, at least not regularly, unless…

there would be an performance gain. Whenever I balanced a complete 
filesystem with data and metadata I however saw a cross drop in 
performance, like doubling the boot time for example (no scientific 
measurement, just my personal observation). I admit I did not do this 
for a long time and the balancing might have gotten better during the 
last few years of kernel development, but I am not yet convinced of 
that.

So is balancing this filesystem likely to improve the performance of it? 
And if so, why?

What it could improve, I think, is allocating new data, cause BTRFS due 
to the balancing might have freed some chunks, so in case lots of new 
data is written it does not have to search inside existing chunks which 
are likely to fragment their free space over time.

I just like to understand this better. Right now I am quite confused at 
what recommendations to give about balancing.

I bet SLES developers had a good reason for going with weekly balancing. 
Right now I just don´t get it. And as you work at SUSE, I thought I just 
ask about it.

I am aware of some earlier threads, but I did not read everything that 
has been discussed so far. In case there is a good summary, feel free to 
point me to it.

I bet a page in BTRFS wiki about performance aspects would be a nice 
idea. I would even create one, if I still can access the wiki.

> > on a RAID-1 filesystem one, part of the time two Plasma desktops +
> > KDEPIM and Akonadi + Baloo desktop search + you name it write to
> > like
> > mad.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-17 17:54       ` Martin Steigerwald
@ 2018-07-18 12:35         ` Austin S. Hemmelgarn
  2018-07-18 13:07           ` Chris Murphy
  0 siblings, 1 reply; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-18 12:35 UTC (permalink / raw)
  To: Martin Steigerwald, Nikolay Borisov; +Cc: Wolf, linux-btrfs

On 2018-07-17 13:54, Martin Steigerwald wrote:
> Nikolay Borisov - 17.07.18, 10:16:
>> On 17.07.2018 11:02, Martin Steigerwald wrote:
>>> Nikolay Borisov - 17.07.18, 09:20:
>>>> On 16.07.2018 23:58, Wolf wrote:
>>>>> Greetings,
>>>>> I would like to ask what what is healthy amount of free space to
>>>>> keep on each device for btrfs to be happy?
>>>>>
>>>>> This is how my disk array currently looks like
>>>>>
>>>>>      [root@dennas ~]# btrfs fi usage /raid
>>>>>      
>>>>>      Overall:
>>>>>          Device size:                  29.11TiB
>>>>>          Device allocated:             21.26TiB
>>>>>          Device unallocated:            7.85TiB
>>>>>          Device missing:                  0.00B
>>>>>          Used:                         21.18TiB
>>>>>          Free (estimated):              3.96TiB      (min: 3.96TiB)
>>>>>          Data ratio:                       2.00
>>>>>          Metadata ratio:                   2.00
>>>>>          Global reserve:              512.00MiB      (used: 0.00B)
>>>
>>> […]
>>>
>>>>> Btrfs does quite good job of evenly using space on all devices.
>>>>> No,
>>>>> how low can I let that go? In other words, with how much space
>>>>> free/unallocated remaining space should I consider adding new
>>>>> disk?
>>>>
>>>> Btrfs will start running into problems when you run out of
>>>> unallocated space. So the best advice will be monitor your device
>>>> unallocated, once it gets really low - like 2-3 gb I will suggest
>>>> you run balance which will try to free up unallocated space by
>>>> rewriting data more compactly into sparsely populated block
>>>> groups. If after running balance you haven't really freed any
>>>> space then you should consider adding a new drive and running
>>>> balance to even out the spread of data/metadata.
>>>
>>> What are these issues exactly?
>>
>> For example if you have plenty of data space but your metadata is full
>> then you will be getting ENOSPC.
> 
> Of that one I am aware.
> 
> This just did not happen so far.
> 
> I did not yet add it explicitly to the training slides, but I just make
> myself a note to do that.
> 
> Anything else?

If you're doing a training presentation, it may be worth mentioning that 
preallocation with fallocate() does not behave the same on BTRFS as it 
does on other filesystems.  For example, the following sequence of commands:

     fallocate -l X ./tmp
     dd if=/dev/zero of=./tmp bs=1 count=X

Will always work on ext4, XFS, and most other filesystems, for any value 
of X between zero and just below the total amount of free space on the 
filesystem.  On BTRFS though, it will reliably fail with ENOSPC for 
values of X that are greater than _half_ of the total amount of free 
space on the filesystem (actually, greater than just short of half).  In 
essence, preallocating space does not prevent COW semantics for the 
first write unless the file is marked NOCOW.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-18 12:35         ` Austin S. Hemmelgarn
@ 2018-07-18 13:07           ` Chris Murphy
  2018-07-18 13:30             ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2018-07-18 13:07 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS

On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

> If you're doing a training presentation, it may be worth mentioning that
> preallocation with fallocate() does not behave the same on BTRFS as it does
> on other filesystems.  For example, the following sequence of commands:
>
>     fallocate -l X ./tmp
>     dd if=/dev/zero of=./tmp bs=1 count=X
>
> Will always work on ext4, XFS, and most other filesystems, for any value of
> X between zero and just below the total amount of free space on the
> filesystem.  On BTRFS though, it will reliably fail with ENOSPC for values
> of X that are greater than _half_ of the total amount of free space on the
> filesystem (actually, greater than just short of half).  In essence,
> preallocating space does not prevent COW semantics for the first write
> unless the file is marked NOCOW.

Is this a bug, or is it suboptimal behavior, or is it intentional?

And then I wonder what happens with XFS COW:

     fallocate -l X ./tmp
     cp --reflink ./tmp ./tmp2
     dd if=/dev/zero of=./tmp bs=1 count=X



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-18 13:07           ` Chris Murphy
@ 2018-07-18 13:30             ` Austin S. Hemmelgarn
  2018-07-18 17:04               ` Chris Murphy
  2018-07-20  5:01               ` Andrei Borzenkov
  0 siblings, 2 replies; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-18 13:30 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS

On 2018-07-18 09:07, Chris Murphy wrote:
> On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> 
>> If you're doing a training presentation, it may be worth mentioning that
>> preallocation with fallocate() does not behave the same on BTRFS as it does
>> on other filesystems.  For example, the following sequence of commands:
>>
>>      fallocate -l X ./tmp
>>      dd if=/dev/zero of=./tmp bs=1 count=X
>>
>> Will always work on ext4, XFS, and most other filesystems, for any value of
>> X between zero and just below the total amount of free space on the
>> filesystem.  On BTRFS though, it will reliably fail with ENOSPC for values
>> of X that are greater than _half_ of the total amount of free space on the
>> filesystem (actually, greater than just short of half).  In essence,
>> preallocating space does not prevent COW semantics for the first write
>> unless the file is marked NOCOW.
> 
> Is this a bug, or is it suboptimal behavior, or is it intentional?
It's been discussed before, though I can't find the email thread right 
now.  Pretty much, this is _technically_ not incorrect behavior, as the 
documentation for fallocate doesn't say that subsequent writes can't 
fail due to lack of space.  I personally consider it a bug though 
because it breaks from existing behavior in a way that is avoidable and 
defies user expectations.

There are two issues here:

1. Regions preallocated with fallocate still do COW on the first write 
to any given block in that region.  This can be handled by either 
treating the first write to each block as NOCOW, or by allocating a bit 
of extra space and doing a rotating approach like this for writes:
     - Write goes into the extra space.
     - Once the write is done, convert the region covered by the write
       into a new block of extra space.
     - When the final block of the preallocated region is written,
       deallocate the extra space.
2. Preallocation does not completely account for necessary metadata 
space that will be needed to store the data there.  This may not be 
necessary if the first issue is addressed properly.
> 
> And then I wonder what happens with XFS COW:
> 
>       fallocate -l X ./tmp
>       cp --reflink ./tmp ./tmp2
>       dd if=/dev/zero of=./tmp bs=1 count=X
I'm not sure.  In this particular case, this will fail on BTRFS for any 
X larger than just short of one third of the total free space.  I would 
expect it to fail for any X larger than just short of half instead.

ZFS gets around this by not supporting fallocate (well, kind of, if 
you're using glibc and call posix_fallocate, that _will_ work, but it 
will take forever because it works by writing out each block of space 
that's being allocated, which, ironically, means that that still suffers 
from the same issue potentially that we have).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-18 13:30             ` Austin S. Hemmelgarn
@ 2018-07-18 17:04               ` Chris Murphy
  2018-07-18 17:06                 ` Austin S. Hemmelgarn
  2018-07-20  5:01               ` Andrei Borzenkov
  1 sibling, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2018-07-18 17:04 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Chris Murphy, Martin Steigerwald, Nikolay Borisov, Wolf,
	Btrfs BTRFS

On Wed, Jul 18, 2018 at 7:30 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

>
> I'm not sure.  In this particular case, this will fail on BTRFS for any X
> larger than just short of one third of the total free space.  I would expect
> it to fail for any X larger than just short of half instead.

I'm confused. I can't get it to fail when X is 3/4 of free space.

lvcreate -V 2g -T vg/thintastic -n btrfstest
mkfs.btrfs -M /dev/mapper/vg-btrfstest
mount /dev/mapper/vg-btrfstest /mnt/btrfs
cd /mnt/btrfs
fallocate -l 1500m tmp
dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450

Succeeds. No enospc. This is on kernel 4.17.6.


Copied from terminal:

[chris@f28s btrfs]$ df -h
Filesystem                Size  Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest  2.0G   17M  2.0G   1% /mnt/btrfs
[chris@f28s btrfs]$ sudo fallocate -l 1500m /mnt/btrfs/tmp
[chris@f28s btrfs]$ filefrag -v tmp
Filesystem type is: 9123683e
File size of tmp is 1572864000 (384000 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   32767:      16400..     49167:  32768:             unwritten
   1:    32768..   65535:      56576..     89343:  32768:      49168: unwritten
   2:    65536..   98303:     109824..    142591:  32768:      89344: unwritten
   3:    98304..  131071:     163072..    195839:  32768:     142592: unwritten
   4:   131072..  163839:     216320..    249087:  32768:     195840: unwritten
   5:   163840..  196607:     269568..    302335:  32768:     249088: unwritten
   6:   196608..  229375:     322816..    355583:  32768:     302336: unwritten
   7:   229376..  262143:     376064..    408831:  32768:     355584: unwritten
   8:   262144..  294911:     429312..    462079:  32768:     408832: unwritten
   9:   294912..  327679:     482560..    515327:  32768:     462080: unwritten
  10:   327680..  344063:      89344..    105727:  16384:     515328: unwritten
  11:   344064..  360447:     142592..    158975:  16384:     105728: unwritten
  12:   360448..  376831:     195840..    212223:  16384:     158976: unwritten
  13:   376832..  383999:     249088..    256255:   7168:     212224:
last,unwritten,eof
tmp: 14 extents found
[chris@f28s btrfs]$ df -h
Filesystem                Size  Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest  2.0G  1.5G  543M  74% /mnt/btrfs
[chris@f28s btrfs]$ sudo dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450
1450+0 records in
1450+0 records out
1520435200 bytes (1.5 GB, 1.4 GiB) copied, 13.4757 s, 113 MB/s
[chris@f28s btrfs]$ df -h
Filesystem                Size  Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest  2.0G  1.5G  591M  72% /mnt/btrfs
[chris@f28s btrfs]$ filefrag -v tmp
Filesystem type is: 9123683e
File size of tmp is 1520435200 (371200 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   16383:     302336..    318719:  16384:
   1:    16384..   32767:     355584..    371967:  16384:     318720:
   2:    32768..   49151:     408832..    425215:  16384:     371968:
   3:    49152..   65535:     462080..    478463:  16384:     425216:
   4:    65536..   73727:     515328..    523519:   8192:     478464:
   5:    73728..   86015:       3328..     15615:  12288:     523520:
   6:    86016..   98303:     256256..    268543:  12288:      15616:
   7:    98304..  104959:      49168..     55823:   6656:     268544:
   8:   104960..  109047:     105728..    109815:   4088:      55824:
   9:   109048..  113143:     158976..    163071:   4096:     109816:
  10:   113144..  117239:     212224..    216319:   4096:     163072:
  11:   117240..  121335:     318720..    322815:   4096:     216320:
  12:   121336..  125431:     371968..    376063:   4096:     322816:
  13:   125432..  128251:     425216..    428035:   2820:     376064:
  14:   128252..  131071:     478464..    481283:   2820:     428036:
  15:   131072..  132409:       1460..      2797:   1338:     481284:
  16:   132410..  165177:     322816..    355583:  32768:       2798:
  17:   165178..  197945:     376064..    408831:  32768:     355584:
  18:   197946..  230713:     429312..    462079:  32768:     408832:
  19:   230714..  263481:     482560..    515327:  32768:     462080:
  20:   263482..  296249:      16400..     49167:  32768:     515328:
  21:   296250..  327687:      56576..     88013:  31438:      49168:
  22:   327688..  328711:     428036..    429059:   1024:      88014:
  23:   328712..  361479:     109824..    142591:  32768:     429060:
  24:   361480..  371199:      88014..     97733:   9720:     142592: last,eof
tmp: 25 extents found
[chris@f28s btrfs]$


*shrug*


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-18 17:04               ` Chris Murphy
@ 2018-07-18 17:06                 ` Austin S. Hemmelgarn
  2018-07-18 17:14                   ` Chris Murphy
  0 siblings, 1 reply; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-18 17:06 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS

On 2018-07-18 13:04, Chris Murphy wrote:
> On Wed, Jul 18, 2018 at 7:30 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> 
>>
>> I'm not sure.  In this particular case, this will fail on BTRFS for any X
>> larger than just short of one third of the total free space.  I would expect
>> it to fail for any X larger than just short of half instead.
> 
> I'm confused. I can't get it to fail when X is 3/4 of free space.
> 
> lvcreate -V 2g -T vg/thintastic -n btrfstest
> mkfs.btrfs -M /dev/mapper/vg-btrfstest
> mount /dev/mapper/vg-btrfstest /mnt/btrfs
> cd /mnt/btrfs
> fallocate -l 1500m tmp
> dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450
> 
> Succeeds. No enospc. This is on kernel 4.17.6.
Odd, I could have sworn it would fail reliably.  Unless something has 
changed since I last tested though, doing it with X equal to the free 
space on the filesystem will fail.
> 
> 
> Copied from terminal:
> 
> [chris@f28s btrfs]$ df -h
> Filesystem                Size  Used Avail Use% Mounted on
> /dev/mapper/vg-btrfstest  2.0G   17M  2.0G   1% /mnt/btrfs
> [chris@f28s btrfs]$ sudo fallocate -l 1500m /mnt/btrfs/tmp
> [chris@f28s btrfs]$ filefrag -v tmp
> Filesystem type is: 9123683e
> File size of tmp is 1572864000 (384000 blocks of 4096 bytes)
>   ext:     logical_offset:        physical_offset: length:   expected: flags:
>     0:        0..   32767:      16400..     49167:  32768:             unwritten
>     1:    32768..   65535:      56576..     89343:  32768:      49168: unwritten
>     2:    65536..   98303:     109824..    142591:  32768:      89344: unwritten
>     3:    98304..  131071:     163072..    195839:  32768:     142592: unwritten
>     4:   131072..  163839:     216320..    249087:  32768:     195840: unwritten
>     5:   163840..  196607:     269568..    302335:  32768:     249088: unwritten
>     6:   196608..  229375:     322816..    355583:  32768:     302336: unwritten
>     7:   229376..  262143:     376064..    408831:  32768:     355584: unwritten
>     8:   262144..  294911:     429312..    462079:  32768:     408832: unwritten
>     9:   294912..  327679:     482560..    515327:  32768:     462080: unwritten
>    10:   327680..  344063:      89344..    105727:  16384:     515328: unwritten
>    11:   344064..  360447:     142592..    158975:  16384:     105728: unwritten
>    12:   360448..  376831:     195840..    212223:  16384:     158976: unwritten
>    13:   376832..  383999:     249088..    256255:   7168:     212224:
> last,unwritten,eof
> tmp: 14 extents found
> [chris@f28s btrfs]$ df -h
> Filesystem                Size  Used Avail Use% Mounted on
> /dev/mapper/vg-btrfstest  2.0G  1.5G  543M  74% /mnt/btrfs
> [chris@f28s btrfs]$ sudo dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450
> 1450+0 records in
> 1450+0 records out
> 1520435200 bytes (1.5 GB, 1.4 GiB) copied, 13.4757 s, 113 MB/s
> [chris@f28s btrfs]$ df -h
> Filesystem                Size  Used Avail Use% Mounted on
> /dev/mapper/vg-btrfstest  2.0G  1.5G  591M  72% /mnt/btrfs
> [chris@f28s btrfs]$ filefrag -v tmp
> Filesystem type is: 9123683e
> File size of tmp is 1520435200 (371200 blocks of 4096 bytes)
>   ext:     logical_offset:        physical_offset: length:   expected: flags:
>     0:        0..   16383:     302336..    318719:  16384:
>     1:    16384..   32767:     355584..    371967:  16384:     318720:
>     2:    32768..   49151:     408832..    425215:  16384:     371968:
>     3:    49152..   65535:     462080..    478463:  16384:     425216:
>     4:    65536..   73727:     515328..    523519:   8192:     478464:
>     5:    73728..   86015:       3328..     15615:  12288:     523520:
>     6:    86016..   98303:     256256..    268543:  12288:      15616:
>     7:    98304..  104959:      49168..     55823:   6656:     268544:
>     8:   104960..  109047:     105728..    109815:   4088:      55824:
>     9:   109048..  113143:     158976..    163071:   4096:     109816:
>    10:   113144..  117239:     212224..    216319:   4096:     163072:
>    11:   117240..  121335:     318720..    322815:   4096:     216320:
>    12:   121336..  125431:     371968..    376063:   4096:     322816:
>    13:   125432..  128251:     425216..    428035:   2820:     376064:
>    14:   128252..  131071:     478464..    481283:   2820:     428036:
>    15:   131072..  132409:       1460..      2797:   1338:     481284:
>    16:   132410..  165177:     322816..    355583:  32768:       2798:
>    17:   165178..  197945:     376064..    408831:  32768:     355584:
>    18:   197946..  230713:     429312..    462079:  32768:     408832:
>    19:   230714..  263481:     482560..    515327:  32768:     462080:
>    20:   263482..  296249:      16400..     49167:  32768:     515328:
>    21:   296250..  327687:      56576..     88013:  31438:      49168:
>    22:   327688..  328711:     428036..    429059:   1024:      88014:
>    23:   328712..  361479:     109824..    142591:  32768:     429060:
>    24:   361480..  371199:      88014..     97733:   9720:     142592: last,eof
> tmp: 25 extents found
> [chris@f28s btrfs]$
> 
> 
> *shrug*
> 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-18 17:06                 ` Austin S. Hemmelgarn
@ 2018-07-18 17:14                   ` Chris Murphy
  2018-07-18 17:40                     ` Chris Murphy
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2018-07-18 17:14 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Chris Murphy, Martin Steigerwald, Nikolay Borisov, Wolf,
	Btrfs BTRFS

On Wed, Jul 18, 2018 at 11:06 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2018-07-18 13:04, Chris Murphy wrote:
>>
>> On Wed, Jul 18, 2018 at 7:30 AM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>
>>>
>>> I'm not sure.  In this particular case, this will fail on BTRFS for any X
>>> larger than just short of one third of the total free space.  I would
>>> expect
>>> it to fail for any X larger than just short of half instead.
>>
>>
>> I'm confused. I can't get it to fail when X is 3/4 of free space.
>>
>> lvcreate -V 2g -T vg/thintastic -n btrfstest
>> mkfs.btrfs -M /dev/mapper/vg-btrfstest
>> mount /dev/mapper/vg-btrfstest /mnt/btrfs
>> cd /mnt/btrfs
>> fallocate -l 1500m tmp
>> dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450
>>
>> Succeeds. No enospc. This is on kernel 4.17.6.
>
> Odd, I could have sworn it would fail reliably.  Unless something has
> changed since I last tested though, doing it with X equal to the free space
> on the filesystem will fail.

OK well X is being defined twice here so I can't tell if I'm doing
this correctly. There's fallocate X and that's 75% of free space for
the empty fs at the time of fallocate.

And then there's dd which is 1450m which is ~2.67x the free space at
the time of dd.

I don't know for sure, but based on the addresses reported before and
after dd for the fallocated tmp file, it looks like Btrfs is not using
the originally fallocated addresses for dd. So maybe it is COWing into
new blocks, but is just as quickly deallocating the fallocated blocks
as it goes, and hence doesn't end up in enospc?



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-18 17:14                   ` Chris Murphy
@ 2018-07-18 17:40                     ` Chris Murphy
  2018-07-18 18:01                       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2018-07-18 17:40 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Austin S. Hemmelgarn, Martin Steigerwald, Nikolay Borisov, Wolf,
	Btrfs BTRFS

On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy <lists@colorremedies.com> wrote:

> I don't know for sure, but based on the addresses reported before and
> after dd for the fallocated tmp file, it looks like Btrfs is not using
> the originally fallocated addresses for dd. So maybe it is COWing into
> new blocks, but is just as quickly deallocating the fallocated blocks
> as it goes, and hence doesn't end up in enospc?

Previous thread is "Problem with file system" from August 2017. And
there's these reproduce steps from Austin which have fallocate coming
after the dd.

    truncate --size=4G ./test-fs
    mkfs.btrfs ./test-fs
    mkdir ./test
    mount -t auto ./test-fs ./test
    dd if=/dev/zero of=./test/test bs=65536 count=32768
    fallocate -l 2147483650 ./test/test && echo "Success!"


My test Btrfs is 2G not 4G, so I'm cutting the values of dd and
fallocate in half.

[chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s
[chris@f28s btrfs]$ sync
[chris@f28s btrfs]$ df -h
Filesystem                Size  Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest  2.0G 1018M  1.1G  50% /mnt/btrfs
[chris@f28s btrfs]$ sudo fallocate -l 1000m tmp


Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over
it, this fails, but I kinda expect that because there's only 1.1G free
space. But maybe that's what you're saying is the bug, it shouldn't
fail?



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-18 17:40                     ` Chris Murphy
@ 2018-07-18 18:01                       ` Austin S. Hemmelgarn
  2018-07-18 21:32                         ` Chris Murphy
  0 siblings, 1 reply; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-18 18:01 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS

On 2018-07-18 13:40, Chris Murphy wrote:
> On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy <lists@colorremedies.com> wrote:
> 
>> I don't know for sure, but based on the addresses reported before and
>> after dd for the fallocated tmp file, it looks like Btrfs is not using
>> the originally fallocated addresses for dd. So maybe it is COWing into
>> new blocks, but is just as quickly deallocating the fallocated blocks
>> as it goes, and hence doesn't end up in enospc?
> 
> Previous thread is "Problem with file system" from August 2017. And
> there's these reproduce steps from Austin which have fallocate coming
> after the dd.
> 
>      truncate --size=4G ./test-fs
>      mkfs.btrfs ./test-fs
>      mkdir ./test
>      mount -t auto ./test-fs ./test
>      dd if=/dev/zero of=./test/test bs=65536 count=32768
>      fallocate -l 2147483650 ./test/test && echo "Success!"
> 
> 
> My test Btrfs is 2G not 4G, so I'm cutting the values of dd and
> fallocate in half.
> 
> [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s
> [chris@f28s btrfs]$ sync
> [chris@f28s btrfs]$ df -h
> Filesystem                Size  Used Avail Use% Mounted on
> /dev/mapper/vg-btrfstest  2.0G 1018M  1.1G  50% /mnt/btrfs
> [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp
> 
> 
> Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over
> it, this fails, but I kinda expect that because there's only 1.1G free
> space. But maybe that's what you're saying is the bug, it shouldn't
> fail?
Yes, you're right, I had things backwards (well, kind of, this does work 
on ext4 and regular XFS, so it arguably should work here).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-18 18:01                       ` Austin S. Hemmelgarn
@ 2018-07-18 21:32                         ` Chris Murphy
  2018-07-18 21:47                           ` Chris Murphy
  2018-07-19 11:21                           ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 19+ messages in thread
From: Chris Murphy @ 2018-07-18 21:32 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Chris Murphy, Martin Steigerwald, Nikolay Borisov, Wolf,
	Btrfs BTRFS

On Wed, Jul 18, 2018 at 12:01 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2018-07-18 13:40, Chris Murphy wrote:
>>
>> On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy <lists@colorremedies.com>
>> wrote:
>>
>>> I don't know for sure, but based on the addresses reported before and
>>> after dd for the fallocated tmp file, it looks like Btrfs is not using
>>> the originally fallocated addresses for dd. So maybe it is COWing into
>>> new blocks, but is just as quickly deallocating the fallocated blocks
>>> as it goes, and hence doesn't end up in enospc?
>>
>>
>> Previous thread is "Problem with file system" from August 2017. And
>> there's these reproduce steps from Austin which have fallocate coming
>> after the dd.
>>
>>      truncate --size=4G ./test-fs
>>      mkfs.btrfs ./test-fs
>>      mkdir ./test
>>      mount -t auto ./test-fs ./test
>>      dd if=/dev/zero of=./test/test bs=65536 count=32768
>>      fallocate -l 2147483650 ./test/test && echo "Success!"
>>
>>
>> My test Btrfs is 2G not 4G, so I'm cutting the values of dd and
>> fallocate in half.
>>
>> [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000
>> 1000+0 records in
>> 1000+0 records out
>> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s
>> [chris@f28s btrfs]$ sync
>> [chris@f28s btrfs]$ df -h
>> Filesystem                Size  Used Avail Use% Mounted on
>> /dev/mapper/vg-btrfstest  2.0G 1018M  1.1G  50% /mnt/btrfs
>> [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp
>>
>>
>> Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over
>> it, this fails, but I kinda expect that because there's only 1.1G free
>> space. But maybe that's what you're saying is the bug, it shouldn't
>> fail?
>
> Yes, you're right, I had things backwards (well, kind of, this does work on
> ext4 and regular XFS, so it arguably should work here).

I guess I'm confused what it even means to fallocate over a file with
in-use blocks unless either -d or -p options are used. And from the
man page, I don't grok the distinction between -d and -p either. But
based on their descriptions I'd expect they both should work without
enospc.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-18 21:32                         ` Chris Murphy
@ 2018-07-18 21:47                           ` Chris Murphy
  2018-07-19 11:21                           ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 19+ messages in thread
From: Chris Murphy @ 2018-07-18 21:47 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Austin S. Hemmelgarn, Martin Steigerwald, Nikolay Borisov, Wolf,
	Btrfs BTRFS

Related on XFS list.

https://www.spinics.net/lists/linux-xfs/msg20722.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-18 21:32                         ` Chris Murphy
  2018-07-18 21:47                           ` Chris Murphy
@ 2018-07-19 11:21                           ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-19 11:21 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS

On 2018-07-18 17:32, Chris Murphy wrote:
> On Wed, Jul 18, 2018 at 12:01 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2018-07-18 13:40, Chris Murphy wrote:
>>>
>>> On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy <lists@colorremedies.com>
>>> wrote:
>>>
>>>> I don't know for sure, but based on the addresses reported before and
>>>> after dd for the fallocated tmp file, it looks like Btrfs is not using
>>>> the originally fallocated addresses for dd. So maybe it is COWing into
>>>> new blocks, but is just as quickly deallocating the fallocated blocks
>>>> as it goes, and hence doesn't end up in enospc?
>>>
>>>
>>> Previous thread is "Problem with file system" from August 2017. And
>>> there's these reproduce steps from Austin which have fallocate coming
>>> after the dd.
>>>
>>>       truncate --size=4G ./test-fs
>>>       mkfs.btrfs ./test-fs
>>>       mkdir ./test
>>>       mount -t auto ./test-fs ./test
>>>       dd if=/dev/zero of=./test/test bs=65536 count=32768
>>>       fallocate -l 2147483650 ./test/test && echo "Success!"
>>>
>>>
>>> My test Btrfs is 2G not 4G, so I'm cutting the values of dd and
>>> fallocate in half.
>>>
>>> [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000
>>> 1000+0 records in
>>> 1000+0 records out
>>> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s
>>> [chris@f28s btrfs]$ sync
>>> [chris@f28s btrfs]$ df -h
>>> Filesystem                Size  Used Avail Use% Mounted on
>>> /dev/mapper/vg-btrfstest  2.0G 1018M  1.1G  50% /mnt/btrfs
>>> [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp
>>>
>>>
>>> Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over
>>> it, this fails, but I kinda expect that because there's only 1.1G free
>>> space. But maybe that's what you're saying is the bug, it shouldn't
>>> fail?
>>
>> Yes, you're right, I had things backwards (well, kind of, this does work on
>> ext4 and regular XFS, so it arguably should work here).
> 
> I guess I'm confused what it even means to fallocate over a file with
> in-use blocks unless either -d or -p options are used. And from the
> man page, I don't grok the distinction between -d and -p either. But
> based on their descriptions I'd expect they both should work without
> enospc.
> 
Without any specific options, it forces allocation of any sparse regions 
in the file (that is, it gets rid of holes in the file).  On BTRFS, I 
believe the command also forcibly unshares all the extents in the file 
(for the system call, there's a special flag for doing this). 
Additionally, you can extend a file with fallocate this way by 
specifying a length longer than the current size of the file, which 
guarantees that writes into that region will succeed, unlike truncating 
the file to a larger size, which just creates a hole at the end of the 
file to bring it up to size.

As far as `-d` versus `-p`:  `-p` directly translates to the option for 
the system call that punches a hole.  It requires a length and possibly 
an offset, and will punch a hole at that exact location of that exact 
size.  `-d` is a special option that's only available for the command. 
It tells the `fallocate` command to search the file for zero-filled 
regions, and punch holes there.  Neither option should ever trigger an 
ENOSPC, except possibly if it has to split an extent for some reason and 
you are completely out of metadata space.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-18 13:30             ` Austin S. Hemmelgarn
  2018-07-18 17:04               ` Chris Murphy
@ 2018-07-20  5:01               ` Andrei Borzenkov
  2018-07-20 11:36                 ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 19+ messages in thread
From: Andrei Borzenkov @ 2018-07-20  5:01 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Chris Murphy
  Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS

18.07.2018 16:30, Austin S. Hemmelgarn пишет:
> On 2018-07-18 09:07, Chris Murphy wrote:
>> On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>
>>> If you're doing a training presentation, it may be worth mentioning that
>>> preallocation with fallocate() does not behave the same on BTRFS as
>>> it does
>>> on other filesystems.  For example, the following sequence of commands:
>>>
>>>      fallocate -l X ./tmp
>>>      dd if=/dev/zero of=./tmp bs=1 count=X
>>>
>>> Will always work on ext4, XFS, and most other filesystems, for any
>>> value of
>>> X between zero and just below the total amount of free space on the
>>> filesystem.  On BTRFS though, it will reliably fail with ENOSPC for
>>> values
>>> of X that are greater than _half_ of the total amount of free space
>>> on the
>>> filesystem (actually, greater than just short of half).  In essence,
>>> preallocating space does not prevent COW semantics for the first write
>>> unless the file is marked NOCOW.
>>
>> Is this a bug, or is it suboptimal behavior, or is it intentional?
> It's been discussed before, though I can't find the email thread right
> now.  Pretty much, this is _technically_ not incorrect behavior, as the
> documentation for fallocate doesn't say that subsequent writes can't
> fail due to lack of space.  I personally consider it a bug though
> because it breaks from existing behavior in a way that is avoidable and
> defies user expectations.
> 
> There are two issues here:
> 
> 1. Regions preallocated with fallocate still do COW on the first write
> to any given block in that region.  This can be handled by either
> treating the first write to each block as NOCOW, or by allocating a bit

How is it possible? As long as fallocate actually allocates space, this
should be checksummed which means it is no more possible to overwrite
it. May be fallocate on btrfs could simply reserve space. Not sure
whether it complies with fallocate specification, but as long as
intention is to ensure write will not fail for the lack of space it
should be adequate (to the extent it can be ensured on btrfs of course).
Also hole in file returns zeros by definition which also matches
fallocate behavior.

> of extra space and doing a rotating approach like this for writes:
>     - Write goes into the extra space.
>     - Once the write is done, convert the region covered by the write
>       into a new block of extra space.
>     - When the final block of the preallocated region is written,
>       deallocate the extra space.
> 2. Preallocation does not completely account for necessary metadata
> space that will be needed to store the data there.  This may not be
> necessary if the first issue is addressed properly.
>>
>> And then I wonder what happens with XFS COW:
>>
>>       fallocate -l X ./tmp
>>       cp --reflink ./tmp ./tmp2
>>       dd if=/dev/zero of=./tmp bs=1 count=X
> I'm not sure.  In this particular case, this will fail on BTRFS for any
> X larger than just short of one third of the total free space.  I would
> expect it to fail for any X larger than just short of half instead.
> 
> ZFS gets around this by not supporting fallocate (well, kind of, if
> you're using glibc and call posix_fallocate, that _will_ work, but it
> will take forever because it works by writing out each block of space
> that's being allocated, which, ironically, means that that still suffers
> from the same issue potentially that we have).

What happens on btrfs then? fallocate specifies that new space should be
initialized to zero, so something should still write those zeros?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-20  5:01               ` Andrei Borzenkov
@ 2018-07-20 11:36                 ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-20 11:36 UTC (permalink / raw)
  To: Andrei Borzenkov, Chris Murphy
  Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS

On 2018-07-20 01:01, Andrei Borzenkov wrote:
> 18.07.2018 16:30, Austin S. Hemmelgarn пишет:
>> On 2018-07-18 09:07, Chris Murphy wrote:
>>> On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn
>>> <ahferroin7@gmail.com> wrote:
>>>
>>>> If you're doing a training presentation, it may be worth mentioning that
>>>> preallocation with fallocate() does not behave the same on BTRFS as
>>>> it does
>>>> on other filesystems.  For example, the following sequence of commands:
>>>>
>>>>       fallocate -l X ./tmp
>>>>       dd if=/dev/zero of=./tmp bs=1 count=X
>>>>
>>>> Will always work on ext4, XFS, and most other filesystems, for any
>>>> value of
>>>> X between zero and just below the total amount of free space on the
>>>> filesystem.  On BTRFS though, it will reliably fail with ENOSPC for
>>>> values
>>>> of X that are greater than _half_ of the total amount of free space
>>>> on the
>>>> filesystem (actually, greater than just short of half).  In essence,
>>>> preallocating space does not prevent COW semantics for the first write
>>>> unless the file is marked NOCOW.
>>>
>>> Is this a bug, or is it suboptimal behavior, or is it intentional?
>> It's been discussed before, though I can't find the email thread right
>> now.  Pretty much, this is _technically_ not incorrect behavior, as the
>> documentation for fallocate doesn't say that subsequent writes can't
>> fail due to lack of space.  I personally consider it a bug though
>> because it breaks from existing behavior in a way that is avoidable and
>> defies user expectations.
>>
>> There are two issues here:
>>
>> 1. Regions preallocated with fallocate still do COW on the first write
>> to any given block in that region.  This can be handled by either
>> treating the first write to each block as NOCOW, or by allocating a bit
> 
> How is it possible? As long as fallocate actually allocates space, this
> should be checksummed which means it is no more possible to overwrite
> it. May be fallocate on btrfs could simply reserve space. Not sure
> whether it complies with fallocate specification, but as long as
> intention is to ensure write will not fail for the lack of space it
> should be adequate (to the extent it can be ensured on btrfs of course).
> Also hole in file returns zeros by definition which also matches
> fallocate behavior.
Except it doesn't _have_ to be checksummed if there's no data there, and 
that will always be the case for a new allocation.   When I say it could 
be NOCOW, I'm talking specifically about the first write to each newly 
allocated block (that is, one either beyond the previous end of the 
file, or one in a region that used to be a hole).  This obviously won't 
work for places where there are already data.
> 
>> of extra space and doing a rotating approach like this for writes:
>>      - Write goes into the extra space.
>>      - Once the write is done, convert the region covered by the write
>>        into a new block of extra space.
>>      - When the final block of the preallocated region is written,
>>        deallocate the extra space.
>> 2. Preallocation does not completely account for necessary metadata
>> space that will be needed to store the data there.  This may not be
>> necessary if the first issue is addressed properly.
>>>
>>> And then I wonder what happens with XFS COW:
>>>
>>>        fallocate -l X ./tmp
>>>        cp --reflink ./tmp ./tmp2
>>>        dd if=/dev/zero of=./tmp bs=1 count=X
>> I'm not sure.  In this particular case, this will fail on BTRFS for any
>> X larger than just short of one third of the total free space.  I would
>> expect it to fail for any X larger than just short of half instead.
>>
>> ZFS gets around this by not supporting fallocate (well, kind of, if
>> you're using glibc and call posix_fallocate, that _will_ work, but it
>> will take forever because it works by writing out each block of space
>> that's being allocated, which, ironically, means that that still suffers
>> from the same issue potentially that we have).
> 
> What happens on btrfs then? fallocate specifies that new space should be
> initialized to zero, so something should still write those zeros?
> 
For new regions (places that were holes previously, or were beyond the 
end of the file), we create an unwritten extent, which is a region 
that's 'allocated', but everything reads back as zero.  The problem is 
that we don't write into the blocks allocated for the unwritten extent 
at all, and only deallocate them once a write to another block finishes. 
  In essence, we're (either explicitly or implicitly) applying COW 
semantics to a region that should not be COW until after the first write 
to each block.

For the case of calling fallocate on existing data, we don't really do 
anything (unless the flag telling fallocate to unshare the region is 
passed).  This is actually consistent with pretty much every other 
filesystem in existence, but that's because pretty much every other 
filesystem in existence implicitly provides the same guarantee that 
fallocate does for regions that already have data.  This case can in 
theory be handled by the same looping algorithm I described above 
without needing the base amount of space allocated, but I wouldn't 
consider it important enough currently to worry about (because calling 
fallocate on regions with existing data is not a common practice).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Healthy amount of free space?
  2018-07-16 20:58 Healthy amount of free space? Wolf
  2018-07-17  7:20 ` Nikolay Borisov
@ 2018-07-17 11:46 ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-17 11:46 UTC (permalink / raw)
  To: Wolf, linux-btrfs

On 2018-07-16 16:58, Wolf wrote:
> Greetings,
> I would like to ask what what is healthy amount of free space to keep on
> each device for btrfs to be happy?
> 
> This is how my disk array currently looks like
> 
>      [root@dennas ~]# btrfs fi usage /raid
>      Overall:
>          Device size:                  29.11TiB
>          Device allocated:             21.26TiB
>          Device unallocated:            7.85TiB
>          Device missing:                  0.00B
>          Used:                         21.18TiB
>          Free (estimated):              3.96TiB      (min: 3.96TiB)
>          Data ratio:                       2.00
>          Metadata ratio:                   2.00
>          Global reserve:              512.00MiB      (used: 0.00B)
> 
>      Data,RAID1: Size:10.61TiB, Used:10.58TiB
>         /dev/mapper/data1       1.75TiB
>         /dev/mapper/data2       1.75TiB
>         /dev/mapper/data3     856.00GiB
>         /dev/mapper/data4     856.00GiB
>         /dev/mapper/data5       1.75TiB
>         /dev/mapper/data6       1.75TiB
>         /dev/mapper/data7       6.29TiB
>         /dev/mapper/data8       6.29TiB
> 
>      Metadata,RAID1: Size:15.00GiB, Used:13.00GiB
>         /dev/mapper/data1       2.00GiB
>         /dev/mapper/data2       3.00GiB
>         /dev/mapper/data3       1.00GiB
>         /dev/mapper/data4       1.00GiB
>         /dev/mapper/data5       3.00GiB
>         /dev/mapper/data6       1.00GiB
>         /dev/mapper/data7       9.00GiB
>         /dev/mapper/data8      10.00GiB
Slightly OT, but the distribution of metadata chunks across devices 
looks a bit sub-optimal here.  If you can tolerate the volume being 
somewhat slower for a while, I'd suggest balancing these (it should get 
you better performance long-term).
> 
>      System,RAID1: Size:64.00MiB, Used:1.50MiB
>         /dev/mapper/data2      32.00MiB
>         /dev/mapper/data6      32.00MiB
>         /dev/mapper/data7      32.00MiB
>         /dev/mapper/data8      32.00MiB
> 
>      Unallocated:
>         /dev/mapper/data1    1004.52GiB
>         /dev/mapper/data2    1004.49GiB
>         /dev/mapper/data3    1006.01GiB
>         /dev/mapper/data4    1006.01GiB
>         /dev/mapper/data5    1004.52GiB
>         /dev/mapper/data6    1004.49GiB
>         /dev/mapper/data7    1005.00GiB
>         /dev/mapper/data8    1005.00GiB
> 
> Btrfs does quite good job of evenly using space on all devices. No, how
> low can I let that go? In other words, with how much space
> free/unallocated remaining space should I consider adding new disk?
Disclaimer: What I'm about to say is based on personal experience.  YMMV.

It depends on how you use the filesystem.

Realistically, there are a couple of things I consider when trying to 
decide on this myself:

* How quickly does the total usage increase on average, and how much can 
it be expected to increase in one day in the worst case scenario?  This 
isn't really BTRFS specific, but it's worth mentioning.  I usually don't 
let an array get close enough to full that it wouldn't be able to safely 
handle at least one day of the worst case increase and another 2 of 
average increases.  In BTRFS terms, the 'safely handle' part means you 
should be adding about 5GB for a multi-TB array like you have, or about 
1GB for a sub-TB array.

* What are the typical write patterns?  Do files get rewritten in-place, 
or are they only ever rewritten with a replace-by-rename? Are writes 
mostly random, or mostly sequential?  Are writes mostly small or mostly 
large?  The more towards the first possibility listed in each of those 
question (in-place rewrites, random access, and small writes), the more 
free space you should keep on the volume.

* Does this volume see heavy usage of fallocate() either to preallocate 
space (note that this _DOES NOT WORK SANELY_ on BTRFS), or to punch 
holes or remove ranges from files.  If whatever software you're using 
does this a lot on this volume, you want even more free space.

* Do old files tend to get removed in large batches?  That is, possibly 
hundreds or thousands of files at a time.  If so, and you're running a 
reasonably recent (4.x series) kernel or regularly balance the volume to 
clean up empty chunks, you don't need quite as much free space.

* How quickly can you get a new device added, and is it critical that 
this volume always be writable?  Sounds stupid, but a lot of people 
don't consider this.  If you can trivially get a new device added 
immediately, you can generally let things go a bit further than you 
would normally, same for if the volume being read-only can be tolerated 
for a while without significant issues.

It's worth noting that I explicitly do not care about snapshot usage. 
It rarely has much impact on this other than changing how the total 
usage increases in a day.

Evaluating all of this is of course something I can't really do for you. 
  If I had to guess, with no other information that the allocations 
shown, I'd say that you're probably generically fine until you get down 
to about 5GB more than twice the average amount by which the total usage 
increases in a day.  That's a rather conservative guess without any 
spare overhead for more than a day, and assumes you aren't using 
fallocate much but have an otherwise evenly mixed write/delete workload.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-07-20 12:24 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-16 20:58 Healthy amount of free space? Wolf
2018-07-17  7:20 ` Nikolay Borisov
2018-07-17  8:02   ` Martin Steigerwald
2018-07-17  8:16     ` Nikolay Borisov
2018-07-17 17:54       ` Martin Steigerwald
2018-07-18 12:35         ` Austin S. Hemmelgarn
2018-07-18 13:07           ` Chris Murphy
2018-07-18 13:30             ` Austin S. Hemmelgarn
2018-07-18 17:04               ` Chris Murphy
2018-07-18 17:06                 ` Austin S. Hemmelgarn
2018-07-18 17:14                   ` Chris Murphy
2018-07-18 17:40                     ` Chris Murphy
2018-07-18 18:01                       ` Austin S. Hemmelgarn
2018-07-18 21:32                         ` Chris Murphy
2018-07-18 21:47                           ` Chris Murphy
2018-07-19 11:21                           ` Austin S. Hemmelgarn
2018-07-20  5:01               ` Andrei Borzenkov
2018-07-20 11:36                 ` Austin S. Hemmelgarn
2018-07-17 11:46 ` Austin S. Hemmelgarn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).