From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from resqmta-ch2-02v.sys.comcast.net ([69.252.207.34]:43072 "EHLO resqmta-ch2-02v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750748AbaL0Ntv (ORCPT ); Sat, 27 Dec 2014 08:49:51 -0500 Message-ID: <549EB8FC.9040101@pobox.com> Date: Sat, 27 Dec 2014 05:49:48 -0800 From: Robert White MIME-Version: 1.0 To: Martin Steigerwald CC: Hugo Mills , linux-btrfs@vger.kernel.org Subject: Re: BTRFS free space handling still needs more work: Hangs again References: <3738341.y7uRQFcLJH@merkaba> <3538352.CI4nobbHtu@merkaba> <549E9D98.7010102@pobox.com> <9534911.qSQhRgc3Jg@merkaba> In-Reply-To: <9534911.qSQhRgc3Jg@merkaba> Content-Type: text/plain; charset=windows-1252; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 12/27/2014 05:16 AM, Martin Steigerwald wrote: > Am Samstag, 27. Dezember 2014, 03:52:56 schrieb Robert White: >>> My theory from watching the Windows XP defragmentation case is this: >>> >>> - For writing into the file BTRFS needs to actually allocate and use free >>> space in the current tree allocation, or, as we seem to misunderstood >>> from the words we use, it needs to fit data in >>> >>> Data, RAID1: total=144.98GiB, used=140.94GiB >>> >>> between 144,98 GiB and 140,94 GiB given that total space of this tree, or >>> if its not a tree, but the chunks in that the tree manages, in these >>> chunks can *not* be extended anymore. >> >> If your file was actually COW (and you have _not_ been taking snapshots) >> then there is no extenting to be had. But if you are using snapper >> (which I believe you mentioned previously) then the snapshots cause a >> write boundary and a layer of copying. Frequently taking snapshots of a >> COW file is self defeating. If you are going to take snapshots then you >> might as well turn copy on write back on and, for the love of pete, stop >> defragging things. > > I donīt use any snapshots on the filesystems. None, zero, zilch, nada. > > And as I understand it copy on write means: It has to write the new write > requests to somewhere else. For this it needs to allocate space. Either > withing existing chunks or in a newly allocated one. > > So for COW when writing to a file it will always need to allocate new space > (although it can forget about the old space afterwards unless there isnīt a > snapshot holding it) It can _only_ forget about the space if absolutely _all_ of the old extent is overwritten. So if you write 1MiB, then you go back and overwrite 1MiB-4Kib, then you go back and write 1MiB-8KiB, you've now got 3MiB-12KiB to represent 1MiB of data. No snapshots involved. The worst case is quite well understood. [...--------------] 1MiB [...-------------] 1MiB-4KiB [...------------] 1MiB-8KiB BTRFS will _NOT_ reclaim the "part" of any extent. So if this kept going it would take 250 diminishing overwrites, each 4k less than the prior: 1MiB == 250 4k blocks. (250*(250+1))/2 = 31375 4K blocks or 125.5MiB of storage allocated and dedicated to representing 1MiB of accessible data. This is a worst case, of course, but it exists and it's _horrible_. And such a file can be "burped" by doing a copy-and-rename, resulting in returning it to a single 1MiB extent. (I don't know if a "btrfs defrag" would have identical results, but I think it would.) The problem is that there isn't (yet) a COW safe way to discard partial extents. That is, there is no universally safe way (yet implemented) to turn that first 1MiB into two extents of 1MiB-4K and one 4K extent "in place" so there is no way (yet) to prevent this worst case. Doing things like excessive defragging at the BTRFS level, and defragging inside of a VM, and using certain file types can lead to pretty awful data wastage. YMMV. e.g. "too much tidying up and you make a mess". I offered a pseudocode example a few days back on how this problem might be dealt with in future, but I've not seen any feedback on it. > > Anyway, I got it reproduced. And am about to write a lengthy mail about. Have fun with that lengthy email, but the devs already know about the data waste profile of the system. They just don't have a good solution yet. Practical use cases involving _not_ defragging and _not_ packing files, or disabling COW and using raw image formats for VM disk storage are, meanwhile, also well understood. > > It can easily be reproduced without even using Virtualbox, just by a nice > simple fio job. > Yep. As I've explained twice now.