* Healthy amount of free space?
@ 2018-07-16 20:58 Wolf
2018-07-17 7:20 ` Nikolay Borisov
2018-07-17 11:46 ` Austin S. Hemmelgarn
0 siblings, 2 replies; 19+ messages in thread
From: Wolf @ 2018-07-16 20:58 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2343 bytes --]
Greetings,
I would like to ask what what is healthy amount of free space to keep on
each device for btrfs to be happy?
This is how my disk array currently looks like
[root@dennas ~]# btrfs fi usage /raid
Overall:
Device size: 29.11TiB
Device allocated: 21.26TiB
Device unallocated: 7.85TiB
Device missing: 0.00B
Used: 21.18TiB
Free (estimated): 3.96TiB (min: 3.96TiB)
Data ratio: 2.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
Data,RAID1: Size:10.61TiB, Used:10.58TiB
/dev/mapper/data1 1.75TiB
/dev/mapper/data2 1.75TiB
/dev/mapper/data3 856.00GiB
/dev/mapper/data4 856.00GiB
/dev/mapper/data5 1.75TiB
/dev/mapper/data6 1.75TiB
/dev/mapper/data7 6.29TiB
/dev/mapper/data8 6.29TiB
Metadata,RAID1: Size:15.00GiB, Used:13.00GiB
/dev/mapper/data1 2.00GiB
/dev/mapper/data2 3.00GiB
/dev/mapper/data3 1.00GiB
/dev/mapper/data4 1.00GiB
/dev/mapper/data5 3.00GiB
/dev/mapper/data6 1.00GiB
/dev/mapper/data7 9.00GiB
/dev/mapper/data8 10.00GiB
System,RAID1: Size:64.00MiB, Used:1.50MiB
/dev/mapper/data2 32.00MiB
/dev/mapper/data6 32.00MiB
/dev/mapper/data7 32.00MiB
/dev/mapper/data8 32.00MiB
Unallocated:
/dev/mapper/data1 1004.52GiB
/dev/mapper/data2 1004.49GiB
/dev/mapper/data3 1006.01GiB
/dev/mapper/data4 1006.01GiB
/dev/mapper/data5 1004.52GiB
/dev/mapper/data6 1004.49GiB
/dev/mapper/data7 1005.00GiB
/dev/mapper/data8 1005.00GiB
Btrfs does quite good job of evenly using space on all devices. No, how
low can I let that go? In other words, with how much space
free/unallocated remaining space should I consider adding new disk?
Thanks for advice :)
W.
--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: Healthy amount of free space? 2018-07-16 20:58 Healthy amount of free space? Wolf @ 2018-07-17 7:20 ` Nikolay Borisov 2018-07-17 8:02 ` Martin Steigerwald 2018-07-17 11:46 ` Austin S. Hemmelgarn 1 sibling, 1 reply; 19+ messages in thread From: Nikolay Borisov @ 2018-07-17 7:20 UTC (permalink / raw) To: Wolf, linux-btrfs On 16.07.2018 23:58, Wolf wrote: > Greetings, > I would like to ask what what is healthy amount of free space to keep on > each device for btrfs to be happy? > > This is how my disk array currently looks like > > [root@dennas ~]# btrfs fi usage /raid > Overall: > Device size: 29.11TiB > Device allocated: 21.26TiB > Device unallocated: 7.85TiB > Device missing: 0.00B > Used: 21.18TiB > Free (estimated): 3.96TiB (min: 3.96TiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,RAID1: Size:10.61TiB, Used:10.58TiB > /dev/mapper/data1 1.75TiB > /dev/mapper/data2 1.75TiB > /dev/mapper/data3 856.00GiB > /dev/mapper/data4 856.00GiB > /dev/mapper/data5 1.75TiB > /dev/mapper/data6 1.75TiB > /dev/mapper/data7 6.29TiB > /dev/mapper/data8 6.29TiB > > Metadata,RAID1: Size:15.00GiB, Used:13.00GiB > /dev/mapper/data1 2.00GiB > /dev/mapper/data2 3.00GiB > /dev/mapper/data3 1.00GiB > /dev/mapper/data4 1.00GiB > /dev/mapper/data5 3.00GiB > /dev/mapper/data6 1.00GiB > /dev/mapper/data7 9.00GiB > /dev/mapper/data8 10.00GiB > > System,RAID1: Size:64.00MiB, Used:1.50MiB > /dev/mapper/data2 32.00MiB > /dev/mapper/data6 32.00MiB > /dev/mapper/data7 32.00MiB > /dev/mapper/data8 32.00MiB > > Unallocated: > /dev/mapper/data1 1004.52GiB > /dev/mapper/data2 1004.49GiB > /dev/mapper/data3 1006.01GiB > /dev/mapper/data4 1006.01GiB > /dev/mapper/data5 1004.52GiB > /dev/mapper/data6 1004.49GiB > /dev/mapper/data7 1005.00GiB > /dev/mapper/data8 1005.00GiB > > Btrfs does quite good job of evenly using space on all devices. No, how > low can I let that go? In other words, with how much space > free/unallocated remaining space should I consider adding new disk? Btrfs will start running into problems when you run out of unallocated space. So the best advice will be monitor your device unallocated, once it gets really low - like 2-3 gb I will suggest you run balance which will try to free up unallocated space by rewriting data more compactly into sparsely populated block groups. If after running balance you haven't really freed any space then you should consider adding a new drive and running balance to even out the spread of data/metadata. > > Thanks for advice :) > > W. > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-17 7:20 ` Nikolay Borisov @ 2018-07-17 8:02 ` Martin Steigerwald 2018-07-17 8:16 ` Nikolay Borisov 0 siblings, 1 reply; 19+ messages in thread From: Martin Steigerwald @ 2018-07-17 8:02 UTC (permalink / raw) To: Nikolay Borisov; +Cc: Wolf, linux-btrfs Hi Nikolay. Nikolay Borisov - 17.07.18, 09:20: > On 16.07.2018 23:58, Wolf wrote: > > Greetings, > > I would like to ask what what is healthy amount of free space to > > keep on each device for btrfs to be happy? > > > > This is how my disk array currently looks like > > > > [root@dennas ~]# btrfs fi usage /raid > > > > Overall: > > Device size: 29.11TiB > > Device allocated: 21.26TiB > > Device unallocated: 7.85TiB > > Device missing: 0.00B > > Used: 21.18TiB > > Free (estimated): 3.96TiB (min: 3.96TiB) > > Data ratio: 2.00 > > Metadata ratio: 2.00 > > Global reserve: 512.00MiB (used: 0.00B) […] > > Btrfs does quite good job of evenly using space on all devices. No, > > how low can I let that go? In other words, with how much space > > free/unallocated remaining space should I consider adding new disk? > > Btrfs will start running into problems when you run out of unallocated > space. So the best advice will be monitor your device unallocated, > once it gets really low - like 2-3 gb I will suggest you run balance > which will try to free up unallocated space by rewriting data more > compactly into sparsely populated block groups. If after running > balance you haven't really freed any space then you should consider > adding a new drive and running balance to even out the spread of > data/metadata. What are these issues exactly? I have % btrfs fi us -T /home Overall: Device size: 340.00GiB Device allocated: 340.00GiB Device unallocated: 2.00MiB Device missing: 0.00B Used: 308.37GiB Free (estimated): 14.65GiB (min: 14.65GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data Metadata System Id Path RAID1 RAID1 RAID1 Unallocated -- ---------------------- --------- -------- -------- ----------- 1 /dev/mapper/msata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB 2 /dev/mapper/sata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB -- ---------------------- --------- -------- -------- ----------- Total 165.89GiB 4.08GiB 32.00MiB 2.00MiB Used 151.24GiB 2.95GiB 48.00KiB on a RAID-1 filesystem one, part of the time two Plasma desktops + KDEPIM and Akonadi + Baloo desktop search + you name it write to like mad. Since kernel 4.5 or 4.6 this simply works. Before that sometimes BTRFS crawled to an halt on searching for free blocks, and I had to switch off the laptop uncleanly. If that happened, a balance helped for a while. But since 4.5 or 4.6 this did not happen anymore. I found with SLES 12 SP 3 or so there is btrfsmaintenance running a balance weekly. Which created an issue on our Proxmox + Ceph on Intel NUC based opensource demo lab. This is for sure no recommended configuration for Ceph and Ceph is quite slow on these 2,5 inch harddisks and 1 GBit network link, despite albeit somewhat minimal, limited to 5 GiB m.2 SSD caching. What happened it that the VM crawled to a halt and the kernel gave task hung for more than 120 seconds messages. The VM was basically unusable during the balance. Sure that should not happen with a "proper" setup, also it also did not happen without the automatic balance. Also what would happen on a hypervisor setup with several thousands of VMs with BTRFS, when several 100 of them decide to start the balance at a similar time? It could probably bring the I/O system below to an halt, as many enterprise storage systems are designed to sustain burst I/O loads, but not maximum utilization during an extended period of time. I am really wondering what to recommend in my Linux performance tuning and analysis courses. On my own laptop I do not do regular balances so far. Due to my thinking: If it is not broken, do not fix it. My personal opinion here also is: If the filesystem degrades that much that it becomes unusable without regular maintenance from user space, the filesystem needs to be fixed. Ideally I would not have to worry on whether to regularly balance an BTRFS or not. In other words: I should not have to visit a performance analysis and tuning course in order to use a computer with BTRFS filesystem. Thanks, -- Martin ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-17 8:02 ` Martin Steigerwald @ 2018-07-17 8:16 ` Nikolay Borisov 2018-07-17 17:54 ` Martin Steigerwald 0 siblings, 1 reply; 19+ messages in thread From: Nikolay Borisov @ 2018-07-17 8:16 UTC (permalink / raw) To: Martin Steigerwald; +Cc: Wolf, linux-btrfs On 17.07.2018 11:02, Martin Steigerwald wrote: > Hi Nikolay. > > Nikolay Borisov - 17.07.18, 09:20: >> On 16.07.2018 23:58, Wolf wrote: >>> Greetings, >>> I would like to ask what what is healthy amount of free space to >>> keep on each device for btrfs to be happy? >>> >>> This is how my disk array currently looks like >>> >>> [root@dennas ~]# btrfs fi usage /raid >>> >>> Overall: >>> Device size: 29.11TiB >>> Device allocated: 21.26TiB >>> Device unallocated: 7.85TiB >>> Device missing: 0.00B >>> Used: 21.18TiB >>> Free (estimated): 3.96TiB (min: 3.96TiB) >>> Data ratio: 2.00 >>> Metadata ratio: 2.00 >>> Global reserve: 512.00MiB (used: 0.00B) > […] >>> Btrfs does quite good job of evenly using space on all devices. No, >>> how low can I let that go? In other words, with how much space >>> free/unallocated remaining space should I consider adding new disk? >> >> Btrfs will start running into problems when you run out of unallocated >> space. So the best advice will be monitor your device unallocated, >> once it gets really low - like 2-3 gb I will suggest you run balance >> which will try to free up unallocated space by rewriting data more >> compactly into sparsely populated block groups. If after running >> balance you haven't really freed any space then you should consider >> adding a new drive and running balance to even out the spread of >> data/metadata. > > What are these issues exactly? For example if you have plenty of data space but your metadata is full then you will be getting ENOSPC. > > I have > > % btrfs fi us -T /home > Overall: > Device size: 340.00GiB > Device allocated: 340.00GiB > Device unallocated: 2.00MiB > Device missing: 0.00B > Used: 308.37GiB > Free (estimated): 14.65GiB (min: 14.65GiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data Metadata System > Id Path RAID1 RAID1 RAID1 Unallocated > -- ---------------------- --------- -------- -------- ----------- > 1 /dev/mapper/msata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB > 2 /dev/mapper/sata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB > -- ---------------------- --------- -------- -------- ----------- > Total 165.89GiB 4.08GiB 32.00MiB 2.00MiB > Used 151.24GiB 2.95GiB 48.00KiB You already have only 33% of your metadata full so if your workload turned out to actually be making more metadata-heavy changed i.e snapshots you could exhaust this and get ENOSPC, despite having around 14gb of free data space. Furthermore this data space is spread around multiple data chunks, depending on how populated they are a balance could be able to free up unallocated space which later could be re-purposed for metadata (again, depending on what you are doing). > > on a RAID-1 filesystem one, part of the time two Plasma desktops + > KDEPIM and Akonadi + Baloo desktop search + you name it write to like > mad. > <snip> > > Thanks, > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-17 8:16 ` Nikolay Borisov @ 2018-07-17 17:54 ` Martin Steigerwald 2018-07-18 12:35 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 19+ messages in thread From: Martin Steigerwald @ 2018-07-17 17:54 UTC (permalink / raw) To: Nikolay Borisov; +Cc: Wolf, linux-btrfs Nikolay Borisov - 17.07.18, 10:16: > On 17.07.2018 11:02, Martin Steigerwald wrote: > > Nikolay Borisov - 17.07.18, 09:20: > >> On 16.07.2018 23:58, Wolf wrote: > >>> Greetings, > >>> I would like to ask what what is healthy amount of free space to > >>> keep on each device for btrfs to be happy? > >>> > >>> This is how my disk array currently looks like > >>> > >>> [root@dennas ~]# btrfs fi usage /raid > >>> > >>> Overall: > >>> Device size: 29.11TiB > >>> Device allocated: 21.26TiB > >>> Device unallocated: 7.85TiB > >>> Device missing: 0.00B > >>> Used: 21.18TiB > >>> Free (estimated): 3.96TiB (min: 3.96TiB) > >>> Data ratio: 2.00 > >>> Metadata ratio: 2.00 > >>> Global reserve: 512.00MiB (used: 0.00B) > > > > […] > > > >>> Btrfs does quite good job of evenly using space on all devices. > >>> No, > >>> how low can I let that go? In other words, with how much space > >>> free/unallocated remaining space should I consider adding new > >>> disk? > >> > >> Btrfs will start running into problems when you run out of > >> unallocated space. So the best advice will be monitor your device > >> unallocated, once it gets really low - like 2-3 gb I will suggest > >> you run balance which will try to free up unallocated space by > >> rewriting data more compactly into sparsely populated block > >> groups. If after running balance you haven't really freed any > >> space then you should consider adding a new drive and running > >> balance to even out the spread of data/metadata. > > > > What are these issues exactly? > > For example if you have plenty of data space but your metadata is full > then you will be getting ENOSPC. Of that one I am aware. This just did not happen so far. I did not yet add it explicitly to the training slides, but I just make myself a note to do that. Anything else? > > I have > > > > % btrfs fi us -T /home > > > > Overall: > > Device size: 340.00GiB > > Device allocated: 340.00GiB > > Device unallocated: 2.00MiB > > Device missing: 0.00B > > Used: 308.37GiB > > Free (estimated): 14.65GiB (min: 14.65GiB) > > Data ratio: 2.00 > > Metadata ratio: 2.00 > > Global reserve: 512.00MiB (used: 0.00B) > > > > Data Metadata System > > > > Id Path RAID1 RAID1 RAID1 Unallocated > > -- ---------------------- --------- -------- -------- ----------- > > > > 1 /dev/mapper/msata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB > > 2 /dev/mapper/sata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB > > > > -- ---------------------- --------- -------- -------- ----------- > > > > Total 165.89GiB 4.08GiB 32.00MiB 2.00MiB > > Used 151.24GiB 2.95GiB 48.00KiB > > You already have only 33% of your metadata full so if your workload > turned out to actually be making more metadata-heavy changed i.e > snapshots you could exhaust this and get ENOSPC, despite having around > 14gb of free data space. Furthermore this data space is spread around > multiple data chunks, depending on how populated they are a balance > could be able to free up unallocated space which later could be > re-purposed for metadata (again, depending on what you are doing). The filesystem above IMO is not fit for snapshots. It would fill up rather quickly, I think even when I balance metadata. Actually I tried this and as I remember it took at most a day until it was full. If I read above figures currently at maximum I could gain one additional GiB by balancing metadata. That would not make a huge difference. I bet I am already running this filesystem beyond recommendation, as I bet many would argue it is to full already for regular usage… I do not see the benefit of squeezing the last free space out of it just to fit in another GiB. So I still do not get the point why it would make sense to balance it at this point in time. Especially as this 1 GiB I could regain is not even needed. And I do not see the point of balancing it weekly. I would regain about 1 GiB of metadata space every now and then, but the cost would be a lot of additional I/O to the SSD. They still take it very nicely so far, but I think, right now, there is simply no point in balancing, at least not regularly, unless… there would be an performance gain. Whenever I balanced a complete filesystem with data and metadata I however saw a cross drop in performance, like doubling the boot time for example (no scientific measurement, just my personal observation). I admit I did not do this for a long time and the balancing might have gotten better during the last few years of kernel development, but I am not yet convinced of that. So is balancing this filesystem likely to improve the performance of it? And if so, why? What it could improve, I think, is allocating new data, cause BTRFS due to the balancing might have freed some chunks, so in case lots of new data is written it does not have to search inside existing chunks which are likely to fragment their free space over time. I just like to understand this better. Right now I am quite confused at what recommendations to give about balancing. I bet SLES developers had a good reason for going with weekly balancing. Right now I just don´t get it. And as you work at SUSE, I thought I just ask about it. I am aware of some earlier threads, but I did not read everything that has been discussed so far. In case there is a good summary, feel free to point me to it. I bet a page in BTRFS wiki about performance aspects would be a nice idea. I would even create one, if I still can access the wiki. > > on a RAID-1 filesystem one, part of the time two Plasma desktops + > > KDEPIM and Akonadi + Baloo desktop search + you name it write to > > like > > mad. Thanks, -- Martin ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-17 17:54 ` Martin Steigerwald @ 2018-07-18 12:35 ` Austin S. Hemmelgarn 2018-07-18 13:07 ` Chris Murphy 0 siblings, 1 reply; 19+ messages in thread From: Austin S. Hemmelgarn @ 2018-07-18 12:35 UTC (permalink / raw) To: Martin Steigerwald, Nikolay Borisov; +Cc: Wolf, linux-btrfs On 2018-07-17 13:54, Martin Steigerwald wrote: > Nikolay Borisov - 17.07.18, 10:16: >> On 17.07.2018 11:02, Martin Steigerwald wrote: >>> Nikolay Borisov - 17.07.18, 09:20: >>>> On 16.07.2018 23:58, Wolf wrote: >>>>> Greetings, >>>>> I would like to ask what what is healthy amount of free space to >>>>> keep on each device for btrfs to be happy? >>>>> >>>>> This is how my disk array currently looks like >>>>> >>>>> [root@dennas ~]# btrfs fi usage /raid >>>>> >>>>> Overall: >>>>> Device size: 29.11TiB >>>>> Device allocated: 21.26TiB >>>>> Device unallocated: 7.85TiB >>>>> Device missing: 0.00B >>>>> Used: 21.18TiB >>>>> Free (estimated): 3.96TiB (min: 3.96TiB) >>>>> Data ratio: 2.00 >>>>> Metadata ratio: 2.00 >>>>> Global reserve: 512.00MiB (used: 0.00B) >>> >>> […] >>> >>>>> Btrfs does quite good job of evenly using space on all devices. >>>>> No, >>>>> how low can I let that go? In other words, with how much space >>>>> free/unallocated remaining space should I consider adding new >>>>> disk? >>>> >>>> Btrfs will start running into problems when you run out of >>>> unallocated space. So the best advice will be monitor your device >>>> unallocated, once it gets really low - like 2-3 gb I will suggest >>>> you run balance which will try to free up unallocated space by >>>> rewriting data more compactly into sparsely populated block >>>> groups. If after running balance you haven't really freed any >>>> space then you should consider adding a new drive and running >>>> balance to even out the spread of data/metadata. >>> >>> What are these issues exactly? >> >> For example if you have plenty of data space but your metadata is full >> then you will be getting ENOSPC. > > Of that one I am aware. > > This just did not happen so far. > > I did not yet add it explicitly to the training slides, but I just make > myself a note to do that. > > Anything else? If you're doing a training presentation, it may be worth mentioning that preallocation with fallocate() does not behave the same on BTRFS as it does on other filesystems. For example, the following sequence of commands: fallocate -l X ./tmp dd if=/dev/zero of=./tmp bs=1 count=X Will always work on ext4, XFS, and most other filesystems, for any value of X between zero and just below the total amount of free space on the filesystem. On BTRFS though, it will reliably fail with ENOSPC for values of X that are greater than _half_ of the total amount of free space on the filesystem (actually, greater than just short of half). In essence, preallocating space does not prevent COW semantics for the first write unless the file is marked NOCOW. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-18 12:35 ` Austin S. Hemmelgarn @ 2018-07-18 13:07 ` Chris Murphy 2018-07-18 13:30 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 19+ messages in thread From: Chris Murphy @ 2018-07-18 13:07 UTC (permalink / raw) To: Austin S. Hemmelgarn Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > If you're doing a training presentation, it may be worth mentioning that > preallocation with fallocate() does not behave the same on BTRFS as it does > on other filesystems. For example, the following sequence of commands: > > fallocate -l X ./tmp > dd if=/dev/zero of=./tmp bs=1 count=X > > Will always work on ext4, XFS, and most other filesystems, for any value of > X between zero and just below the total amount of free space on the > filesystem. On BTRFS though, it will reliably fail with ENOSPC for values > of X that are greater than _half_ of the total amount of free space on the > filesystem (actually, greater than just short of half). In essence, > preallocating space does not prevent COW semantics for the first write > unless the file is marked NOCOW. Is this a bug, or is it suboptimal behavior, or is it intentional? And then I wonder what happens with XFS COW: fallocate -l X ./tmp cp --reflink ./tmp ./tmp2 dd if=/dev/zero of=./tmp bs=1 count=X -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-18 13:07 ` Chris Murphy @ 2018-07-18 13:30 ` Austin S. Hemmelgarn 2018-07-18 17:04 ` Chris Murphy 2018-07-20 5:01 ` Andrei Borzenkov 0 siblings, 2 replies; 19+ messages in thread From: Austin S. Hemmelgarn @ 2018-07-18 13:30 UTC (permalink / raw) To: Chris Murphy; +Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS On 2018-07-18 09:07, Chris Murphy wrote: > On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: > >> If you're doing a training presentation, it may be worth mentioning that >> preallocation with fallocate() does not behave the same on BTRFS as it does >> on other filesystems. For example, the following sequence of commands: >> >> fallocate -l X ./tmp >> dd if=/dev/zero of=./tmp bs=1 count=X >> >> Will always work on ext4, XFS, and most other filesystems, for any value of >> X between zero and just below the total amount of free space on the >> filesystem. On BTRFS though, it will reliably fail with ENOSPC for values >> of X that are greater than _half_ of the total amount of free space on the >> filesystem (actually, greater than just short of half). In essence, >> preallocating space does not prevent COW semantics for the first write >> unless the file is marked NOCOW. > > Is this a bug, or is it suboptimal behavior, or is it intentional? It's been discussed before, though I can't find the email thread right now. Pretty much, this is _technically_ not incorrect behavior, as the documentation for fallocate doesn't say that subsequent writes can't fail due to lack of space. I personally consider it a bug though because it breaks from existing behavior in a way that is avoidable and defies user expectations. There are two issues here: 1. Regions preallocated with fallocate still do COW on the first write to any given block in that region. This can be handled by either treating the first write to each block as NOCOW, or by allocating a bit of extra space and doing a rotating approach like this for writes: - Write goes into the extra space. - Once the write is done, convert the region covered by the write into a new block of extra space. - When the final block of the preallocated region is written, deallocate the extra space. 2. Preallocation does not completely account for necessary metadata space that will be needed to store the data there. This may not be necessary if the first issue is addressed properly. > > And then I wonder what happens with XFS COW: > > fallocate -l X ./tmp > cp --reflink ./tmp ./tmp2 > dd if=/dev/zero of=./tmp bs=1 count=X I'm not sure. In this particular case, this will fail on BTRFS for any X larger than just short of one third of the total free space. I would expect it to fail for any X larger than just short of half instead. ZFS gets around this by not supporting fallocate (well, kind of, if you're using glibc and call posix_fallocate, that _will_ work, but it will take forever because it works by writing out each block of space that's being allocated, which, ironically, means that that still suffers from the same issue potentially that we have). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-18 13:30 ` Austin S. Hemmelgarn @ 2018-07-18 17:04 ` Chris Murphy 2018-07-18 17:06 ` Austin S. Hemmelgarn 2018-07-20 5:01 ` Andrei Borzenkov 1 sibling, 1 reply; 19+ messages in thread From: Chris Murphy @ 2018-07-18 17:04 UTC (permalink / raw) To: Austin S. Hemmelgarn Cc: Chris Murphy, Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS On Wed, Jul 18, 2018 at 7:30 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > > I'm not sure. In this particular case, this will fail on BTRFS for any X > larger than just short of one third of the total free space. I would expect > it to fail for any X larger than just short of half instead. I'm confused. I can't get it to fail when X is 3/4 of free space. lvcreate -V 2g -T vg/thintastic -n btrfstest mkfs.btrfs -M /dev/mapper/vg-btrfstest mount /dev/mapper/vg-btrfstest /mnt/btrfs cd /mnt/btrfs fallocate -l 1500m tmp dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450 Succeeds. No enospc. This is on kernel 4.17.6. Copied from terminal: [chris@f28s btrfs]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg-btrfstest 2.0G 17M 2.0G 1% /mnt/btrfs [chris@f28s btrfs]$ sudo fallocate -l 1500m /mnt/btrfs/tmp [chris@f28s btrfs]$ filefrag -v tmp Filesystem type is: 9123683e File size of tmp is 1572864000 (384000 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 32767: 16400.. 49167: 32768: unwritten 1: 32768.. 65535: 56576.. 89343: 32768: 49168: unwritten 2: 65536.. 98303: 109824.. 142591: 32768: 89344: unwritten 3: 98304.. 131071: 163072.. 195839: 32768: 142592: unwritten 4: 131072.. 163839: 216320.. 249087: 32768: 195840: unwritten 5: 163840.. 196607: 269568.. 302335: 32768: 249088: unwritten 6: 196608.. 229375: 322816.. 355583: 32768: 302336: unwritten 7: 229376.. 262143: 376064.. 408831: 32768: 355584: unwritten 8: 262144.. 294911: 429312.. 462079: 32768: 408832: unwritten 9: 294912.. 327679: 482560.. 515327: 32768: 462080: unwritten 10: 327680.. 344063: 89344.. 105727: 16384: 515328: unwritten 11: 344064.. 360447: 142592.. 158975: 16384: 105728: unwritten 12: 360448.. 376831: 195840.. 212223: 16384: 158976: unwritten 13: 376832.. 383999: 249088.. 256255: 7168: 212224: last,unwritten,eof tmp: 14 extents found [chris@f28s btrfs]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg-btrfstest 2.0G 1.5G 543M 74% /mnt/btrfs [chris@f28s btrfs]$ sudo dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450 1450+0 records in 1450+0 records out 1520435200 bytes (1.5 GB, 1.4 GiB) copied, 13.4757 s, 113 MB/s [chris@f28s btrfs]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg-btrfstest 2.0G 1.5G 591M 72% /mnt/btrfs [chris@f28s btrfs]$ filefrag -v tmp Filesystem type is: 9123683e File size of tmp is 1520435200 (371200 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 16383: 302336.. 318719: 16384: 1: 16384.. 32767: 355584.. 371967: 16384: 318720: 2: 32768.. 49151: 408832.. 425215: 16384: 371968: 3: 49152.. 65535: 462080.. 478463: 16384: 425216: 4: 65536.. 73727: 515328.. 523519: 8192: 478464: 5: 73728.. 86015: 3328.. 15615: 12288: 523520: 6: 86016.. 98303: 256256.. 268543: 12288: 15616: 7: 98304.. 104959: 49168.. 55823: 6656: 268544: 8: 104960.. 109047: 105728.. 109815: 4088: 55824: 9: 109048.. 113143: 158976.. 163071: 4096: 109816: 10: 113144.. 117239: 212224.. 216319: 4096: 163072: 11: 117240.. 121335: 318720.. 322815: 4096: 216320: 12: 121336.. 125431: 371968.. 376063: 4096: 322816: 13: 125432.. 128251: 425216.. 428035: 2820: 376064: 14: 128252.. 131071: 478464.. 481283: 2820: 428036: 15: 131072.. 132409: 1460.. 2797: 1338: 481284: 16: 132410.. 165177: 322816.. 355583: 32768: 2798: 17: 165178.. 197945: 376064.. 408831: 32768: 355584: 18: 197946.. 230713: 429312.. 462079: 32768: 408832: 19: 230714.. 263481: 482560.. 515327: 32768: 462080: 20: 263482.. 296249: 16400.. 49167: 32768: 515328: 21: 296250.. 327687: 56576.. 88013: 31438: 49168: 22: 327688.. 328711: 428036.. 429059: 1024: 88014: 23: 328712.. 361479: 109824.. 142591: 32768: 429060: 24: 361480.. 371199: 88014.. 97733: 9720: 142592: last,eof tmp: 25 extents found [chris@f28s btrfs]$ *shrug* -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-18 17:04 ` Chris Murphy @ 2018-07-18 17:06 ` Austin S. Hemmelgarn 2018-07-18 17:14 ` Chris Murphy 0 siblings, 1 reply; 19+ messages in thread From: Austin S. Hemmelgarn @ 2018-07-18 17:06 UTC (permalink / raw) To: Chris Murphy; +Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS On 2018-07-18 13:04, Chris Murphy wrote: > On Wed, Jul 18, 2018 at 7:30 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: > >> >> I'm not sure. In this particular case, this will fail on BTRFS for any X >> larger than just short of one third of the total free space. I would expect >> it to fail for any X larger than just short of half instead. > > I'm confused. I can't get it to fail when X is 3/4 of free space. > > lvcreate -V 2g -T vg/thintastic -n btrfstest > mkfs.btrfs -M /dev/mapper/vg-btrfstest > mount /dev/mapper/vg-btrfstest /mnt/btrfs > cd /mnt/btrfs > fallocate -l 1500m tmp > dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450 > > Succeeds. No enospc. This is on kernel 4.17.6. Odd, I could have sworn it would fail reliably. Unless something has changed since I last tested though, doing it with X equal to the free space on the filesystem will fail. > > > Copied from terminal: > > [chris@f28s btrfs]$ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg-btrfstest 2.0G 17M 2.0G 1% /mnt/btrfs > [chris@f28s btrfs]$ sudo fallocate -l 1500m /mnt/btrfs/tmp > [chris@f28s btrfs]$ filefrag -v tmp > Filesystem type is: 9123683e > File size of tmp is 1572864000 (384000 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 32767: 16400.. 49167: 32768: unwritten > 1: 32768.. 65535: 56576.. 89343: 32768: 49168: unwritten > 2: 65536.. 98303: 109824.. 142591: 32768: 89344: unwritten > 3: 98304.. 131071: 163072.. 195839: 32768: 142592: unwritten > 4: 131072.. 163839: 216320.. 249087: 32768: 195840: unwritten > 5: 163840.. 196607: 269568.. 302335: 32768: 249088: unwritten > 6: 196608.. 229375: 322816.. 355583: 32768: 302336: unwritten > 7: 229376.. 262143: 376064.. 408831: 32768: 355584: unwritten > 8: 262144.. 294911: 429312.. 462079: 32768: 408832: unwritten > 9: 294912.. 327679: 482560.. 515327: 32768: 462080: unwritten > 10: 327680.. 344063: 89344.. 105727: 16384: 515328: unwritten > 11: 344064.. 360447: 142592.. 158975: 16384: 105728: unwritten > 12: 360448.. 376831: 195840.. 212223: 16384: 158976: unwritten > 13: 376832.. 383999: 249088.. 256255: 7168: 212224: > last,unwritten,eof > tmp: 14 extents found > [chris@f28s btrfs]$ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg-btrfstest 2.0G 1.5G 543M 74% /mnt/btrfs > [chris@f28s btrfs]$ sudo dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450 > 1450+0 records in > 1450+0 records out > 1520435200 bytes (1.5 GB, 1.4 GiB) copied, 13.4757 s, 113 MB/s > [chris@f28s btrfs]$ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg-btrfstest 2.0G 1.5G 591M 72% /mnt/btrfs > [chris@f28s btrfs]$ filefrag -v tmp > Filesystem type is: 9123683e > File size of tmp is 1520435200 (371200 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 16383: 302336.. 318719: 16384: > 1: 16384.. 32767: 355584.. 371967: 16384: 318720: > 2: 32768.. 49151: 408832.. 425215: 16384: 371968: > 3: 49152.. 65535: 462080.. 478463: 16384: 425216: > 4: 65536.. 73727: 515328.. 523519: 8192: 478464: > 5: 73728.. 86015: 3328.. 15615: 12288: 523520: > 6: 86016.. 98303: 256256.. 268543: 12288: 15616: > 7: 98304.. 104959: 49168.. 55823: 6656: 268544: > 8: 104960.. 109047: 105728.. 109815: 4088: 55824: > 9: 109048.. 113143: 158976.. 163071: 4096: 109816: > 10: 113144.. 117239: 212224.. 216319: 4096: 163072: > 11: 117240.. 121335: 318720.. 322815: 4096: 216320: > 12: 121336.. 125431: 371968.. 376063: 4096: 322816: > 13: 125432.. 128251: 425216.. 428035: 2820: 376064: > 14: 128252.. 131071: 478464.. 481283: 2820: 428036: > 15: 131072.. 132409: 1460.. 2797: 1338: 481284: > 16: 132410.. 165177: 322816.. 355583: 32768: 2798: > 17: 165178.. 197945: 376064.. 408831: 32768: 355584: > 18: 197946.. 230713: 429312.. 462079: 32768: 408832: > 19: 230714.. 263481: 482560.. 515327: 32768: 462080: > 20: 263482.. 296249: 16400.. 49167: 32768: 515328: > 21: 296250.. 327687: 56576.. 88013: 31438: 49168: > 22: 327688.. 328711: 428036.. 429059: 1024: 88014: > 23: 328712.. 361479: 109824.. 142591: 32768: 429060: > 24: 361480.. 371199: 88014.. 97733: 9720: 142592: last,eof > tmp: 25 extents found > [chris@f28s btrfs]$ > > > *shrug* > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-18 17:06 ` Austin S. Hemmelgarn @ 2018-07-18 17:14 ` Chris Murphy 2018-07-18 17:40 ` Chris Murphy 0 siblings, 1 reply; 19+ messages in thread From: Chris Murphy @ 2018-07-18 17:14 UTC (permalink / raw) To: Austin S. Hemmelgarn Cc: Chris Murphy, Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS On Wed, Jul 18, 2018 at 11:06 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2018-07-18 13:04, Chris Murphy wrote: >> >> On Wed, Jul 18, 2018 at 7:30 AM, Austin S. Hemmelgarn >> <ahferroin7@gmail.com> wrote: >> >>> >>> I'm not sure. In this particular case, this will fail on BTRFS for any X >>> larger than just short of one third of the total free space. I would >>> expect >>> it to fail for any X larger than just short of half instead. >> >> >> I'm confused. I can't get it to fail when X is 3/4 of free space. >> >> lvcreate -V 2g -T vg/thintastic -n btrfstest >> mkfs.btrfs -M /dev/mapper/vg-btrfstest >> mount /dev/mapper/vg-btrfstest /mnt/btrfs >> cd /mnt/btrfs >> fallocate -l 1500m tmp >> dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450 >> >> Succeeds. No enospc. This is on kernel 4.17.6. > > Odd, I could have sworn it would fail reliably. Unless something has > changed since I last tested though, doing it with X equal to the free space > on the filesystem will fail. OK well X is being defined twice here so I can't tell if I'm doing this correctly. There's fallocate X and that's 75% of free space for the empty fs at the time of fallocate. And then there's dd which is 1450m which is ~2.67x the free space at the time of dd. I don't know for sure, but based on the addresses reported before and after dd for the fallocated tmp file, it looks like Btrfs is not using the originally fallocated addresses for dd. So maybe it is COWing into new blocks, but is just as quickly deallocating the fallocated blocks as it goes, and hence doesn't end up in enospc? -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-18 17:14 ` Chris Murphy @ 2018-07-18 17:40 ` Chris Murphy 2018-07-18 18:01 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 19+ messages in thread From: Chris Murphy @ 2018-07-18 17:40 UTC (permalink / raw) To: Chris Murphy Cc: Austin S. Hemmelgarn, Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy <lists@colorremedies.com> wrote: > I don't know for sure, but based on the addresses reported before and > after dd for the fallocated tmp file, it looks like Btrfs is not using > the originally fallocated addresses for dd. So maybe it is COWing into > new blocks, but is just as quickly deallocating the fallocated blocks > as it goes, and hence doesn't end up in enospc? Previous thread is "Problem with file system" from August 2017. And there's these reproduce steps from Austin which have fallocate coming after the dd. truncate --size=4G ./test-fs mkfs.btrfs ./test-fs mkdir ./test mount -t auto ./test-fs ./test dd if=/dev/zero of=./test/test bs=65536 count=32768 fallocate -l 2147483650 ./test/test && echo "Success!" My test Btrfs is 2G not 4G, so I'm cutting the values of dd and fallocate in half. [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s [chris@f28s btrfs]$ sync [chris@f28s btrfs]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over it, this fails, but I kinda expect that because there's only 1.1G free space. But maybe that's what you're saying is the bug, it shouldn't fail? -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-18 17:40 ` Chris Murphy @ 2018-07-18 18:01 ` Austin S. Hemmelgarn 2018-07-18 21:32 ` Chris Murphy 0 siblings, 1 reply; 19+ messages in thread From: Austin S. Hemmelgarn @ 2018-07-18 18:01 UTC (permalink / raw) To: Chris Murphy; +Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS On 2018-07-18 13:40, Chris Murphy wrote: > On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy <lists@colorremedies.com> wrote: > >> I don't know for sure, but based on the addresses reported before and >> after dd for the fallocated tmp file, it looks like Btrfs is not using >> the originally fallocated addresses for dd. So maybe it is COWing into >> new blocks, but is just as quickly deallocating the fallocated blocks >> as it goes, and hence doesn't end up in enospc? > > Previous thread is "Problem with file system" from August 2017. And > there's these reproduce steps from Austin which have fallocate coming > after the dd. > > truncate --size=4G ./test-fs > mkfs.btrfs ./test-fs > mkdir ./test > mount -t auto ./test-fs ./test > dd if=/dev/zero of=./test/test bs=65536 count=32768 > fallocate -l 2147483650 ./test/test && echo "Success!" > > > My test Btrfs is 2G not 4G, so I'm cutting the values of dd and > fallocate in half. > > [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000 > 1000+0 records in > 1000+0 records out > 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s > [chris@f28s btrfs]$ sync > [chris@f28s btrfs]$ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs > [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp > > > Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over > it, this fails, but I kinda expect that because there's only 1.1G free > space. But maybe that's what you're saying is the bug, it shouldn't > fail? Yes, you're right, I had things backwards (well, kind of, this does work on ext4 and regular XFS, so it arguably should work here). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-18 18:01 ` Austin S. Hemmelgarn @ 2018-07-18 21:32 ` Chris Murphy 2018-07-18 21:47 ` Chris Murphy 2018-07-19 11:21 ` Austin S. Hemmelgarn 0 siblings, 2 replies; 19+ messages in thread From: Chris Murphy @ 2018-07-18 21:32 UTC (permalink / raw) To: Austin S. Hemmelgarn Cc: Chris Murphy, Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS On Wed, Jul 18, 2018 at 12:01 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2018-07-18 13:40, Chris Murphy wrote: >> >> On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy <lists@colorremedies.com> >> wrote: >> >>> I don't know for sure, but based on the addresses reported before and >>> after dd for the fallocated tmp file, it looks like Btrfs is not using >>> the originally fallocated addresses for dd. So maybe it is COWing into >>> new blocks, but is just as quickly deallocating the fallocated blocks >>> as it goes, and hence doesn't end up in enospc? >> >> >> Previous thread is "Problem with file system" from August 2017. And >> there's these reproduce steps from Austin which have fallocate coming >> after the dd. >> >> truncate --size=4G ./test-fs >> mkfs.btrfs ./test-fs >> mkdir ./test >> mount -t auto ./test-fs ./test >> dd if=/dev/zero of=./test/test bs=65536 count=32768 >> fallocate -l 2147483650 ./test/test && echo "Success!" >> >> >> My test Btrfs is 2G not 4G, so I'm cutting the values of dd and >> fallocate in half. >> >> [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000 >> 1000+0 records in >> 1000+0 records out >> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s >> [chris@f28s btrfs]$ sync >> [chris@f28s btrfs]$ df -h >> Filesystem Size Used Avail Use% Mounted on >> /dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs >> [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp >> >> >> Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over >> it, this fails, but I kinda expect that because there's only 1.1G free >> space. But maybe that's what you're saying is the bug, it shouldn't >> fail? > > Yes, you're right, I had things backwards (well, kind of, this does work on > ext4 and regular XFS, so it arguably should work here). I guess I'm confused what it even means to fallocate over a file with in-use blocks unless either -d or -p options are used. And from the man page, I don't grok the distinction between -d and -p either. But based on their descriptions I'd expect they both should work without enospc. -- Chris Murphy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-18 21:32 ` Chris Murphy @ 2018-07-18 21:47 ` Chris Murphy 2018-07-19 11:21 ` Austin S. Hemmelgarn 1 sibling, 0 replies; 19+ messages in thread From: Chris Murphy @ 2018-07-18 21:47 UTC (permalink / raw) To: Chris Murphy Cc: Austin S. Hemmelgarn, Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS Related on XFS list. https://www.spinics.net/lists/linux-xfs/msg20722.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-18 21:32 ` Chris Murphy 2018-07-18 21:47 ` Chris Murphy @ 2018-07-19 11:21 ` Austin S. Hemmelgarn 1 sibling, 0 replies; 19+ messages in thread From: Austin S. Hemmelgarn @ 2018-07-19 11:21 UTC (permalink / raw) To: Chris Murphy; +Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS On 2018-07-18 17:32, Chris Murphy wrote: > On Wed, Jul 18, 2018 at 12:01 PM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: >> On 2018-07-18 13:40, Chris Murphy wrote: >>> >>> On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy <lists@colorremedies.com> >>> wrote: >>> >>>> I don't know for sure, but based on the addresses reported before and >>>> after dd for the fallocated tmp file, it looks like Btrfs is not using >>>> the originally fallocated addresses for dd. So maybe it is COWing into >>>> new blocks, but is just as quickly deallocating the fallocated blocks >>>> as it goes, and hence doesn't end up in enospc? >>> >>> >>> Previous thread is "Problem with file system" from August 2017. And >>> there's these reproduce steps from Austin which have fallocate coming >>> after the dd. >>> >>> truncate --size=4G ./test-fs >>> mkfs.btrfs ./test-fs >>> mkdir ./test >>> mount -t auto ./test-fs ./test >>> dd if=/dev/zero of=./test/test bs=65536 count=32768 >>> fallocate -l 2147483650 ./test/test && echo "Success!" >>> >>> >>> My test Btrfs is 2G not 4G, so I'm cutting the values of dd and >>> fallocate in half. >>> >>> [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000 >>> 1000+0 records in >>> 1000+0 records out >>> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s >>> [chris@f28s btrfs]$ sync >>> [chris@f28s btrfs]$ df -h >>> Filesystem Size Used Avail Use% Mounted on >>> /dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs >>> [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp >>> >>> >>> Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over >>> it, this fails, but I kinda expect that because there's only 1.1G free >>> space. But maybe that's what you're saying is the bug, it shouldn't >>> fail? >> >> Yes, you're right, I had things backwards (well, kind of, this does work on >> ext4 and regular XFS, so it arguably should work here). > > I guess I'm confused what it even means to fallocate over a file with > in-use blocks unless either -d or -p options are used. And from the > man page, I don't grok the distinction between -d and -p either. But > based on their descriptions I'd expect they both should work without > enospc. > Without any specific options, it forces allocation of any sparse regions in the file (that is, it gets rid of holes in the file). On BTRFS, I believe the command also forcibly unshares all the extents in the file (for the system call, there's a special flag for doing this). Additionally, you can extend a file with fallocate this way by specifying a length longer than the current size of the file, which guarantees that writes into that region will succeed, unlike truncating the file to a larger size, which just creates a hole at the end of the file to bring it up to size. As far as `-d` versus `-p`: `-p` directly translates to the option for the system call that punches a hole. It requires a length and possibly an offset, and will punch a hole at that exact location of that exact size. `-d` is a special option that's only available for the command. It tells the `fallocate` command to search the file for zero-filled regions, and punch holes there. Neither option should ever trigger an ENOSPC, except possibly if it has to split an extent for some reason and you are completely out of metadata space. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-18 13:30 ` Austin S. Hemmelgarn 2018-07-18 17:04 ` Chris Murphy @ 2018-07-20 5:01 ` Andrei Borzenkov 2018-07-20 11:36 ` Austin S. Hemmelgarn 1 sibling, 1 reply; 19+ messages in thread From: Andrei Borzenkov @ 2018-07-20 5:01 UTC (permalink / raw) To: Austin S. Hemmelgarn, Chris Murphy Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS 18.07.2018 16:30, Austin S. Hemmelgarn пишет: > On 2018-07-18 09:07, Chris Murphy wrote: >> On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn >> <ahferroin7@gmail.com> wrote: >> >>> If you're doing a training presentation, it may be worth mentioning that >>> preallocation with fallocate() does not behave the same on BTRFS as >>> it does >>> on other filesystems. For example, the following sequence of commands: >>> >>> fallocate -l X ./tmp >>> dd if=/dev/zero of=./tmp bs=1 count=X >>> >>> Will always work on ext4, XFS, and most other filesystems, for any >>> value of >>> X between zero and just below the total amount of free space on the >>> filesystem. On BTRFS though, it will reliably fail with ENOSPC for >>> values >>> of X that are greater than _half_ of the total amount of free space >>> on the >>> filesystem (actually, greater than just short of half). In essence, >>> preallocating space does not prevent COW semantics for the first write >>> unless the file is marked NOCOW. >> >> Is this a bug, or is it suboptimal behavior, or is it intentional? > It's been discussed before, though I can't find the email thread right > now. Pretty much, this is _technically_ not incorrect behavior, as the > documentation for fallocate doesn't say that subsequent writes can't > fail due to lack of space. I personally consider it a bug though > because it breaks from existing behavior in a way that is avoidable and > defies user expectations. > > There are two issues here: > > 1. Regions preallocated with fallocate still do COW on the first write > to any given block in that region. This can be handled by either > treating the first write to each block as NOCOW, or by allocating a bit How is it possible? As long as fallocate actually allocates space, this should be checksummed which means it is no more possible to overwrite it. May be fallocate on btrfs could simply reserve space. Not sure whether it complies with fallocate specification, but as long as intention is to ensure write will not fail for the lack of space it should be adequate (to the extent it can be ensured on btrfs of course). Also hole in file returns zeros by definition which also matches fallocate behavior. > of extra space and doing a rotating approach like this for writes: > - Write goes into the extra space. > - Once the write is done, convert the region covered by the write > into a new block of extra space. > - When the final block of the preallocated region is written, > deallocate the extra space. > 2. Preallocation does not completely account for necessary metadata > space that will be needed to store the data there. This may not be > necessary if the first issue is addressed properly. >> >> And then I wonder what happens with XFS COW: >> >> fallocate -l X ./tmp >> cp --reflink ./tmp ./tmp2 >> dd if=/dev/zero of=./tmp bs=1 count=X > I'm not sure. In this particular case, this will fail on BTRFS for any > X larger than just short of one third of the total free space. I would > expect it to fail for any X larger than just short of half instead. > > ZFS gets around this by not supporting fallocate (well, kind of, if > you're using glibc and call posix_fallocate, that _will_ work, but it > will take forever because it works by writing out each block of space > that's being allocated, which, ironically, means that that still suffers > from the same issue potentially that we have). What happens on btrfs then? fallocate specifies that new space should be initialized to zero, so something should still write those zeros? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-20 5:01 ` Andrei Borzenkov @ 2018-07-20 11:36 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 19+ messages in thread From: Austin S. Hemmelgarn @ 2018-07-20 11:36 UTC (permalink / raw) To: Andrei Borzenkov, Chris Murphy Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS On 2018-07-20 01:01, Andrei Borzenkov wrote: > 18.07.2018 16:30, Austin S. Hemmelgarn пишет: >> On 2018-07-18 09:07, Chris Murphy wrote: >>> On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn >>> <ahferroin7@gmail.com> wrote: >>> >>>> If you're doing a training presentation, it may be worth mentioning that >>>> preallocation with fallocate() does not behave the same on BTRFS as >>>> it does >>>> on other filesystems. For example, the following sequence of commands: >>>> >>>> fallocate -l X ./tmp >>>> dd if=/dev/zero of=./tmp bs=1 count=X >>>> >>>> Will always work on ext4, XFS, and most other filesystems, for any >>>> value of >>>> X between zero and just below the total amount of free space on the >>>> filesystem. On BTRFS though, it will reliably fail with ENOSPC for >>>> values >>>> of X that are greater than _half_ of the total amount of free space >>>> on the >>>> filesystem (actually, greater than just short of half). In essence, >>>> preallocating space does not prevent COW semantics for the first write >>>> unless the file is marked NOCOW. >>> >>> Is this a bug, or is it suboptimal behavior, or is it intentional? >> It's been discussed before, though I can't find the email thread right >> now. Pretty much, this is _technically_ not incorrect behavior, as the >> documentation for fallocate doesn't say that subsequent writes can't >> fail due to lack of space. I personally consider it a bug though >> because it breaks from existing behavior in a way that is avoidable and >> defies user expectations. >> >> There are two issues here: >> >> 1. Regions preallocated with fallocate still do COW on the first write >> to any given block in that region. This can be handled by either >> treating the first write to each block as NOCOW, or by allocating a bit > > How is it possible? As long as fallocate actually allocates space, this > should be checksummed which means it is no more possible to overwrite > it. May be fallocate on btrfs could simply reserve space. Not sure > whether it complies with fallocate specification, but as long as > intention is to ensure write will not fail for the lack of space it > should be adequate (to the extent it can be ensured on btrfs of course). > Also hole in file returns zeros by definition which also matches > fallocate behavior. Except it doesn't _have_ to be checksummed if there's no data there, and that will always be the case for a new allocation. When I say it could be NOCOW, I'm talking specifically about the first write to each newly allocated block (that is, one either beyond the previous end of the file, or one in a region that used to be a hole). This obviously won't work for places where there are already data. > >> of extra space and doing a rotating approach like this for writes: >> - Write goes into the extra space. >> - Once the write is done, convert the region covered by the write >> into a new block of extra space. >> - When the final block of the preallocated region is written, >> deallocate the extra space. >> 2. Preallocation does not completely account for necessary metadata >> space that will be needed to store the data there. This may not be >> necessary if the first issue is addressed properly. >>> >>> And then I wonder what happens with XFS COW: >>> >>> fallocate -l X ./tmp >>> cp --reflink ./tmp ./tmp2 >>> dd if=/dev/zero of=./tmp bs=1 count=X >> I'm not sure. In this particular case, this will fail on BTRFS for any >> X larger than just short of one third of the total free space. I would >> expect it to fail for any X larger than just short of half instead. >> >> ZFS gets around this by not supporting fallocate (well, kind of, if >> you're using glibc and call posix_fallocate, that _will_ work, but it >> will take forever because it works by writing out each block of space >> that's being allocated, which, ironically, means that that still suffers >> from the same issue potentially that we have). > > What happens on btrfs then? fallocate specifies that new space should be > initialized to zero, so something should still write those zeros? > For new regions (places that were holes previously, or were beyond the end of the file), we create an unwritten extent, which is a region that's 'allocated', but everything reads back as zero. The problem is that we don't write into the blocks allocated for the unwritten extent at all, and only deallocate them once a write to another block finishes. In essence, we're (either explicitly or implicitly) applying COW semantics to a region that should not be COW until after the first write to each block. For the case of calling fallocate on existing data, we don't really do anything (unless the flag telling fallocate to unshare the region is passed). This is actually consistent with pretty much every other filesystem in existence, but that's because pretty much every other filesystem in existence implicitly provides the same guarantee that fallocate does for regions that already have data. This case can in theory be handled by the same looping algorithm I described above without needing the base amount of space allocated, but I wouldn't consider it important enough currently to worry about (because calling fallocate on regions with existing data is not a common practice). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space? 2018-07-16 20:58 Healthy amount of free space? Wolf 2018-07-17 7:20 ` Nikolay Borisov @ 2018-07-17 11:46 ` Austin S. Hemmelgarn 1 sibling, 0 replies; 19+ messages in thread From: Austin S. Hemmelgarn @ 2018-07-17 11:46 UTC (permalink / raw) To: Wolf, linux-btrfs On 2018-07-16 16:58, Wolf wrote: > Greetings, > I would like to ask what what is healthy amount of free space to keep on > each device for btrfs to be happy? > > This is how my disk array currently looks like > > [root@dennas ~]# btrfs fi usage /raid > Overall: > Device size: 29.11TiB > Device allocated: 21.26TiB > Device unallocated: 7.85TiB > Device missing: 0.00B > Used: 21.18TiB > Free (estimated): 3.96TiB (min: 3.96TiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,RAID1: Size:10.61TiB, Used:10.58TiB > /dev/mapper/data1 1.75TiB > /dev/mapper/data2 1.75TiB > /dev/mapper/data3 856.00GiB > /dev/mapper/data4 856.00GiB > /dev/mapper/data5 1.75TiB > /dev/mapper/data6 1.75TiB > /dev/mapper/data7 6.29TiB > /dev/mapper/data8 6.29TiB > > Metadata,RAID1: Size:15.00GiB, Used:13.00GiB > /dev/mapper/data1 2.00GiB > /dev/mapper/data2 3.00GiB > /dev/mapper/data3 1.00GiB > /dev/mapper/data4 1.00GiB > /dev/mapper/data5 3.00GiB > /dev/mapper/data6 1.00GiB > /dev/mapper/data7 9.00GiB > /dev/mapper/data8 10.00GiB Slightly OT, but the distribution of metadata chunks across devices looks a bit sub-optimal here. If you can tolerate the volume being somewhat slower for a while, I'd suggest balancing these (it should get you better performance long-term). > > System,RAID1: Size:64.00MiB, Used:1.50MiB > /dev/mapper/data2 32.00MiB > /dev/mapper/data6 32.00MiB > /dev/mapper/data7 32.00MiB > /dev/mapper/data8 32.00MiB > > Unallocated: > /dev/mapper/data1 1004.52GiB > /dev/mapper/data2 1004.49GiB > /dev/mapper/data3 1006.01GiB > /dev/mapper/data4 1006.01GiB > /dev/mapper/data5 1004.52GiB > /dev/mapper/data6 1004.49GiB > /dev/mapper/data7 1005.00GiB > /dev/mapper/data8 1005.00GiB > > Btrfs does quite good job of evenly using space on all devices. No, how > low can I let that go? In other words, with how much space > free/unallocated remaining space should I consider adding new disk? Disclaimer: What I'm about to say is based on personal experience. YMMV. It depends on how you use the filesystem. Realistically, there are a couple of things I consider when trying to decide on this myself: * How quickly does the total usage increase on average, and how much can it be expected to increase in one day in the worst case scenario? This isn't really BTRFS specific, but it's worth mentioning. I usually don't let an array get close enough to full that it wouldn't be able to safely handle at least one day of the worst case increase and another 2 of average increases. In BTRFS terms, the 'safely handle' part means you should be adding about 5GB for a multi-TB array like you have, or about 1GB for a sub-TB array. * What are the typical write patterns? Do files get rewritten in-place, or are they only ever rewritten with a replace-by-rename? Are writes mostly random, or mostly sequential? Are writes mostly small or mostly large? The more towards the first possibility listed in each of those question (in-place rewrites, random access, and small writes), the more free space you should keep on the volume. * Does this volume see heavy usage of fallocate() either to preallocate space (note that this _DOES NOT WORK SANELY_ on BTRFS), or to punch holes or remove ranges from files. If whatever software you're using does this a lot on this volume, you want even more free space. * Do old files tend to get removed in large batches? That is, possibly hundreds or thousands of files at a time. If so, and you're running a reasonably recent (4.x series) kernel or regularly balance the volume to clean up empty chunks, you don't need quite as much free space. * How quickly can you get a new device added, and is it critical that this volume always be writable? Sounds stupid, but a lot of people don't consider this. If you can trivially get a new device added immediately, you can generally let things go a bit further than you would normally, same for if the volume being read-only can be tolerated for a while without significant issues. It's worth noting that I explicitly do not care about snapshot usage. It rarely has much impact on this other than changing how the total usage increases in a day. Evaluating all of this is of course something I can't really do for you. If I had to guess, with no other information that the allocations shown, I'd say that you're probably generically fine until you get down to about 5GB more than twice the average amount by which the total usage increases in a day. That's a rather conservative guess without any spare overhead for more than a day, and assumes you aren't using fallocate much but have an otherwise evenly mixed write/delete workload. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2018-07-20 12:24 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-07-16 20:58 Healthy amount of free space? Wolf 2018-07-17 7:20 ` Nikolay Borisov 2018-07-17 8:02 ` Martin Steigerwald 2018-07-17 8:16 ` Nikolay Borisov 2018-07-17 17:54 ` Martin Steigerwald 2018-07-18 12:35 ` Austin S. Hemmelgarn 2018-07-18 13:07 ` Chris Murphy 2018-07-18 13:30 ` Austin S. Hemmelgarn 2018-07-18 17:04 ` Chris Murphy 2018-07-18 17:06 ` Austin S. Hemmelgarn 2018-07-18 17:14 ` Chris Murphy 2018-07-18 17:40 ` Chris Murphy 2018-07-18 18:01 ` Austin S. Hemmelgarn 2018-07-18 21:32 ` Chris Murphy 2018-07-18 21:47 ` Chris Murphy 2018-07-19 11:21 ` Austin S. Hemmelgarn 2018-07-20 5:01 ` Andrei Borzenkov 2018-07-20 11:36 ` Austin S. Hemmelgarn 2018-07-17 11:46 ` Austin S. Hemmelgarn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).