From: Robert White <rwhite@pobox.com>
To: Martin Steigerwald <Martin@lichtvoll.de>
Cc: Hugo Mills <hugo@carfax.org.uk>, linux-btrfs@vger.kernel.org
Subject: Re: BTRFS free space handling still needs more work: Hangs again
Date: Sat, 27 Dec 2014 05:49:48 -0800 [thread overview]
Message-ID: <549EB8FC.9040101@pobox.com> (raw)
In-Reply-To: <9534911.qSQhRgc3Jg@merkaba>
On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
> Am Samstag, 27. Dezember 2014, 03:52:56 schrieb Robert White:
>>> My theory from watching the Windows XP defragmentation case is this:
>>>
>>> - For writing into the file BTRFS needs to actually allocate and use free
>>> space in the current tree allocation, or, as we seem to misunderstood
>>> from the words we use, it needs to fit data in
>>>
>>> Data, RAID1: total=144.98GiB, used=140.94GiB
>>>
>>> between 144,98 GiB and 140,94 GiB given that total space of this tree, or
>>> if its not a tree, but the chunks in that the tree manages, in these
>>> chunks can *not* be extended anymore.
>>
>> If your file was actually COW (and you have _not_ been taking snapshots)
>> then there is no extenting to be had. But if you are using snapper
>> (which I believe you mentioned previously) then the snapshots cause a
>> write boundary and a layer of copying. Frequently taking snapshots of a
>> COW file is self defeating. If you are going to take snapshots then you
>> might as well turn copy on write back on and, for the love of pete, stop
>> defragging things.
>
> I don´t use any snapshots on the filesystems. None, zero, zilch, nada.
>
> And as I understand it copy on write means: It has to write the new write
> requests to somewhere else. For this it needs to allocate space. Either
> withing existing chunks or in a newly allocated one.
>
> So for COW when writing to a file it will always need to allocate new space
> (although it can forget about the old space afterwards unless there isn´t a
> snapshot holding it)
It can _only_ forget about the space if absolutely _all_ of the old
extent is overwritten. So if you write 1MiB, then you go back and
overwrite 1MiB-4Kib, then you go back and write 1MiB-8KiB, you've now
got 3MiB-12KiB to represent 1MiB of data. No snapshots involved. The
worst case is quite well understood.
[...--------------] 1MiB
[...-------------] 1MiB-4KiB
[...------------] 1MiB-8KiB
BTRFS will _NOT_ reclaim the "part" of any extent. So if this kept going
it would take 250 diminishing overwrites, each 4k less than the prior:
1MiB == 250 4k blocks.
(250*(250+1))/2 = 31375 4K blocks or 125.5MiB of storage allocated and
dedicated to representing 1MiB of accessible data.
This is a worst case, of course, but it exists and it's _horrible_.
And such a file can be "burped" by doing a copy-and-rename, resulting in
returning it to a single 1MiB extent. (I don't know if a "btrfs defrag"
would have identical results, but I think it would.)
The problem is that there isn't (yet) a COW safe way to discard partial
extents. That is, there is no universally safe way (yet implemented) to
turn that first 1MiB into two extents of 1MiB-4K and one 4K extent "in
place" so there is no way (yet) to prevent this worst case.
Doing things like excessive defragging at the BTRFS level, and
defragging inside of a VM, and using certain file types can lead to
pretty awful data wastage. YMMV.
e.g. "too much tidying up and you make a mess".
I offered a pseudocode example a few days back on how this problem might
be dealt with in future, but I've not seen any feedback on it.
>
> Anyway, I got it reproduced. And am about to write a lengthy mail about.
Have fun with that lengthy email, but the devs already know about the
data waste profile of the system. They just don't have a good solution yet.
Practical use cases involving _not_ defragging and _not_ packing files,
or disabling COW and using raw image formats for VM disk storage are,
meanwhile, also well understood.
>
> It can easily be reproduced without even using Virtualbox, just by a nice
> simple fio job.
>
Yep. As I've explained twice now.
next prev parent reply other threads:[~2014-12-27 13:49 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-12-26 13:37 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald
2014-12-26 14:20 ` Martin Steigerwald
2014-12-26 14:41 ` Martin Steigerwald
2014-12-27 3:33 ` Duncan
2014-12-26 15:59 ` Martin Steigerwald
2014-12-27 4:26 ` Duncan
2014-12-26 22:48 ` Robert White
2014-12-27 5:54 ` Duncan
2014-12-27 9:01 ` Martin Steigerwald
2014-12-27 9:30 ` Hugo Mills
2014-12-27 10:54 ` Martin Steigerwald
2014-12-27 11:52 ` Robert White
2014-12-27 13:16 ` Martin Steigerwald
2014-12-27 13:49 ` Robert White [this message]
2014-12-27 14:06 ` Martin Steigerwald
2014-12-27 14:00 ` Robert White
2014-12-27 14:14 ` Martin Steigerwald
2014-12-27 14:21 ` Martin Steigerwald
2014-12-27 15:14 ` Robert White
2014-12-27 16:01 ` Martin Steigerwald
2014-12-28 0:25 ` Robert White
2014-12-28 1:01 ` Bardur Arantsson
2014-12-28 4:03 ` Robert White
2014-12-28 12:03 ` Martin Steigerwald
2014-12-28 17:04 ` Patrik Lundquist
2014-12-29 10:14 ` Martin Steigerwald
2014-12-28 12:07 ` Martin Steigerwald
2014-12-28 14:52 ` Robert White
2014-12-28 15:42 ` Martin Steigerwald
2014-12-28 15:47 ` Martin Steigerwald
2014-12-29 0:27 ` Robert White
2014-12-29 9:14 ` Martin Steigerwald
2014-12-27 16:10 ` Martin Steigerwald
2014-12-27 14:19 ` Robert White
2014-12-27 11:11 ` Martin Steigerwald
2014-12-27 12:08 ` Robert White
2014-12-27 13:55 ` Martin Steigerwald
2014-12-27 14:54 ` Robert White
2014-12-27 16:26 ` Hugo Mills
2014-12-27 17:11 ` Martin Steigerwald
2014-12-27 17:59 ` Martin Steigerwald
2014-12-28 0:06 ` Robert White
2014-12-28 11:05 ` Martin Steigerwald
2014-12-28 13:00 ` BTRFS free space handling still needs more work: Hangs again (further tests) Martin Steigerwald
2014-12-28 13:40 ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare) Martin Steigerwald
2014-12-28 13:56 ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Martin Steigerwald
2014-12-28 15:00 ` Martin Steigerwald
2014-12-29 9:25 ` Martin Steigerwald
2014-12-27 18:28 ` BTRFS free space handling still needs more work: Hangs again Zygo Blaxell
2014-12-27 18:40 ` Hugo Mills
2014-12-27 19:23 ` BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time) Martin Steigerwald
2014-12-29 2:07 ` Zygo Blaxell
2014-12-29 9:32 ` Martin Steigerwald
2015-01-06 20:03 ` Zygo Blaxell
2015-01-07 19:08 ` Martin Steigerwald
2015-01-07 21:41 ` Zygo Blaxell
2015-01-08 5:45 ` Duncan
2015-01-08 10:18 ` Martin Steigerwald
2015-01-09 8:25 ` Duncan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=549EB8FC.9040101@pobox.com \
--to=rwhite@pobox.com \
--cc=Martin@lichtvoll.de \
--cc=hugo@carfax.org.uk \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).