Re: BTRFS free space handling still needs more work: Hangs again

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Robert White <rwhite@pobox.com>
To: Martin Steigerwald <Martin@lichtvoll.de>
Cc: Hugo Mills <hugo@carfax.org.uk>, linux-btrfs@vger.kernel.org
Subject: Re: BTRFS free space handling still needs more work: Hangs again
Date: Sat, 27 Dec 2014 05:49:48 -0800	[thread overview]
Message-ID: <549EB8FC.9040101@pobox.com> (raw)
In-Reply-To: <9534911.qSQhRgc3Jg@merkaba>

On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
> Am Samstag, 27. Dezember 2014, 03:52:56 schrieb Robert White:
>>> My theory from watching the Windows XP defragmentation case is this:
>>>
>>> - For writing into the file BTRFS needs to actually allocate and use free
>>> space in the current tree allocation, or, as we seem to misunderstood
>>> from the words we use, it needs to fit data in
>>>
>>> Data, RAID1: total=144.98GiB, used=140.94GiB
>>>
>>> between 144,98 GiB and 140,94 GiB given that total space of this tree, or
>>> if its not a tree, but the chunks in that the tree manages, in these
>>> chunks can *not* be extended anymore.
>>
>> If your file was actually COW (and you have _not_ been taking snapshots)
>> then there is no extenting to be had. But if you are using snapper
>> (which I believe you mentioned previously) then the snapshots cause a
>> write boundary and a layer of copying. Frequently taking snapshots of a
>> COW file is self defeating. If you are going to take snapshots then you
>> might as well turn copy on write back on and, for the love of pete, stop
>> defragging things.
>
> I don´t use any snapshots on the filesystems. None, zero, zilch, nada.
>
> And as I understand it copy on write means: It has to write the new write
> requests to somewhere else. For this it needs to allocate space. Either
> withing existing chunks or in a newly allocated one.
>
> So for COW when writing to a file it will always need to allocate new space
> (although it can forget about the old space afterwards unless there isn´t a
> snapshot holding it)

It can _only_ forget about the space if absolutely _all_ of the old 
extent is overwritten. So if you write 1MiB, then you go back and 
overwrite 1MiB-4Kib, then you go back and write 1MiB-8KiB, you've now 
got 3MiB-12KiB to represent 1MiB of data. No snapshots involved. The 
worst case is quite well understood.

[...--------------] 1MiB
[...-------------]  1MiB-4KiB
[...------------]   1MiB-8KiB

BTRFS will _NOT_ reclaim the "part" of any extent. So if this kept going 
it would take 250 diminishing overwrites, each 4k less than the prior:

1MiB == 250 4k blocks.
(250*(250+1))/2 = 31375 4K blocks or 125.5MiB of storage allocated and 
dedicated to representing 1MiB of accessible data.

This is a worst case, of course, but it exists and it's _horrible_.

And such a file can be "burped" by doing a copy-and-rename, resulting in 
returning it to a single 1MiB extent. (I don't know if a "btrfs defrag" 
would have identical results, but I think it would.)

The problem is that there isn't (yet) a COW safe way to discard partial 
extents. That is, there is no universally safe way (yet implemented) to 
turn that first 1MiB into two extents of 1MiB-4K and one 4K extent "in 
place" so there is no way (yet) to prevent this worst case.

Doing things like excessive defragging at the BTRFS level, and 
defragging inside of a VM, and using certain file types can lead to 
pretty awful data wastage. YMMV.

e.g. "too much tidying up and you make a mess".

I offered a pseudocode example a few days back on how this problem might 
be dealt with in future, but I've not seen any feedback on it.

>
> Anyway, I got it reproduced. And am about to write a lengthy mail about.

Have fun with that lengthy email, but the devs already know about the 
data waste profile of the system. They just don't have a good solution yet.

Practical use cases involving _not_ defragging and _not_ packing files, 
or disabling COW and using raw image formats for VM disk storage are, 
meanwhile, also well understood.

>
> It can easily be reproduced without even using Virtualbox, just by a nice
> simple fio job.
>

Yep. As I've explained twice now.

next prev parent reply	other threads:[~2014-12-27 13:49 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-26 13:37 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald
2014-12-26 14:20 ` Martin Steigerwald
2014-12-26 14:41   ` Martin Steigerwald
2014-12-27  3:33     ` Duncan
2014-12-26 15:59 ` Martin Steigerwald
2014-12-27  4:26   ` Duncan
2014-12-26 22:48 ` Robert White
2014-12-27  5:54   ` Duncan
2014-12-27  9:01   ` Martin Steigerwald
2014-12-27  9:30     ` Hugo Mills
2014-12-27 10:54       ` Martin Steigerwald
2014-12-27 11:52         ` Robert White
2014-12-27 13:16           ` Martin Steigerwald
2014-12-27 13:49             ` Robert White [this message]
2014-12-27 14:06               ` Martin Steigerwald
2014-12-27 14:00             ` Robert White
2014-12-27 14:14               ` Martin Steigerwald
2014-12-27 14:21                 ` Martin Steigerwald
2014-12-27 15:14                   ` Robert White
2014-12-27 16:01                     ` Martin Steigerwald
2014-12-28  0:25                       ` Robert White
2014-12-28  1:01                         ` Bardur Arantsson
2014-12-28  4:03                           ` Robert White
2014-12-28 12:03                             ` Martin Steigerwald
2014-12-28 17:04                               ` Patrik Lundquist
2014-12-29 10:14                                 ` Martin Steigerwald
2014-12-28 12:07                             ` Martin Steigerwald
2014-12-28 14:52                               ` Robert White
2014-12-28 15:42                                 ` Martin Steigerwald
2014-12-28 15:47                                   ` Martin Steigerwald
2014-12-29  0:27                                   ` Robert White
2014-12-29  9:14                                     ` Martin Steigerwald
2014-12-27 16:10                     ` Martin Steigerwald
2014-12-27 14:19               ` Robert White
2014-12-27 11:11       ` Martin Steigerwald
2014-12-27 12:08         ` Robert White
2014-12-27 13:55       ` Martin Steigerwald
2014-12-27 14:54         ` Robert White
2014-12-27 16:26           ` Hugo Mills
2014-12-27 17:11             ` Martin Steigerwald
2014-12-27 17:59               ` Martin Steigerwald
2014-12-28  0:06             ` Robert White
2014-12-28 11:05               ` Martin Steigerwald
2014-12-28 13:00         ` BTRFS free space handling still needs more work: Hangs again (further tests) Martin Steigerwald
2014-12-28 13:40           ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare) Martin Steigerwald
2014-12-28 13:56             ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Martin Steigerwald
2014-12-28 15:00               ` Martin Steigerwald
2014-12-29  9:25               ` Martin Steigerwald
2014-12-27 18:28       ` BTRFS free space handling still needs more work: Hangs again Zygo Blaxell
2014-12-27 18:40         ` Hugo Mills
2014-12-27 19:23           ` BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time) Martin Steigerwald
2014-12-29  2:07             ` Zygo Blaxell
2014-12-29  9:32               ` Martin Steigerwald
2015-01-06 20:03                 ` Zygo Blaxell
2015-01-07 19:08                   ` Martin Steigerwald
2015-01-07 21:41                     ` Zygo Blaxell
2015-01-08  5:45                     ` Duncan
2015-01-08 10:18                       ` Martin Steigerwald
2015-01-09  8:25                         ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=549EB8FC.9040101@pobox.com \
    --to=rwhite@pobox.com \
    --cc=Martin@lichtvoll.de \
    --cc=hugo@carfax.org.uk \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).