From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mondschein.lichtvoll.de ([194.150.191.11]:34536 "EHLO mail.lichtvoll.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750795AbaL0RLc (ORCPT ); Sat, 27 Dec 2014 12:11:32 -0500 From: Martin Steigerwald To: Hugo Mills Cc: Robert White , linux-btrfs@vger.kernel.org Subject: Re: BTRFS free space handling still needs more work: Hangs again Date: Sat, 27 Dec 2014 18:11:21 +0100 Message-ID: <9346949.uCfVN6IAc7@merkaba> In-Reply-To: <20141227162642.GK25267@carfax.org.uk> References: <3738341.y7uRQFcLJH@merkaba> <549EC829.8080808@pobox.com> <20141227162642.GK25267@carfax.org.uk> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart1794851.O5moSLK8lM"; micalg="pgp-sha1"; protocol="application/pgp-signature" Sender: linux-btrfs-owner@vger.kernel.org List-ID: --nextPart1794851.O5moSLK8lM Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Am Samstag, 27. Dezember 2014, 16:26:42 schrieb Hugo Mills: > On Sat, Dec 27, 2014 at 06:54:33AM -0800, Robert White wrote: > > On 12/27/2014 05:55 AM, Martin Steigerwald wrote: > [snip] > > >while fio was just *laying* out the 4 GiB file. Yes, thats 100% sy= stem CPU > > >for 10 seconds while allocatiing a 4 GiB file on a filesystem like= : > > > > > >martin@merkaba:~> LANG=3DC df -hT /home > > >Filesystem Type Size Used Avail Use% Mounted on > > >/dev/mapper/msata-home btrfs 170G 156G 17G 91% /home > > > > > >where a 4 GiB file should easily fit, no? (And this output is with= the 4 > > >GiB file. So it was even 4 GiB more free before.) > >=20 > > No. /usr/bin/df is an _approximation_ in BTRFS because of the limit= s > > of the fsstat() function call. The fstat function call was defined > > in 1990 and "can't understand" the dynamic allocation model used in= > > BTRFS as it assumes fixed geometry for filesystems. You do _not_ > > have 17G actually available. You need to rely on btrfs fi df and > > btrfs fi show to figure out how much space you _really_ have. > >=20 > > According to this block you have a RAID1 of ~ 160GB expanse (two 16= 0G disks) > >=20 > > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home > > > Sa 27. Dez 13:26:39 CET 2014 > > > Label: 'home' uuid: [some UUID] > > > Total devices 2 FS bytes used 152.83GiB > > > devid 1 size 160.00GiB used 160.00GiB path > > /dev/mapper/msata-home > > > devid 2 size 160.00GiB used 160.00GiB path > > /dev/mapper/sata-home > >=20 > > And according to this block you have about 4.49GiB of data space: > >=20 > > > Btrfs v3.17 > > > Data, RAID1: total=3D154.97GiB, used=3D149.58GiB > > > System, RAID1: total=3D32.00MiB, used=3D48.00KiB > > > Metadata, RAID1: total=3D5.00GiB, used=3D3.26GiB > > > GlobalReserve, single: total=3D512.00MiB, used=3D0.00B > >=20 > > 154.97 > > 5.00 > > 0.032 > > + 0.512 > >=20 > > Pretty much as close to 160GiB as you are going to get (those > > numbers being rounded up in places for "human readability") BTRFS > > has allocate 100% of the raw storage into typed extents. > >=20 > > A large datafile can only fit in the 154.97-149.58 =3D 5.39 >=20 > I appreciate that this is something of a minor point in the grand > scheme of things, but I'm afraid I've lost the enthusiasm to engage > with the broader (somewhat rambling, possibly-at-cross-purposes) > conversation in this thread. However... >=20 > > Trying to allocate that 4GiB file into that 5.39GiB of space become= s > > an NP-complete (e.g. "very hard") problem if it is very fragmented.= >=20 > This is... badly mistaken, at best. The problem of where to write = a > file into a set of free extents is definitely *not* an NP-hard > problem. It's a P problem, with an O(n log n) solution, where n is th= e > number of free extents in the free space cache. The simple approach: > fill the first hole with as many bytes as you can, then move on to th= e > next hole. More complex: order the free extents by size first. Both o= f > these are O(n log n) algorithms, given an efficient general-purpose > index of free space. >=20 > The problem of placing file data isn't a bin-packing problem; it's= > not like allocating RAM (where each allocation must be contiguous). > The items being placed may be split as much as you like, although > minimising the amount of splitting is a goal. >=20 > I suspect that the performance problems that Martin is seeing may > indeed be related to free space fragmentation, in that finding and > creating all of those tiny extents for a huge file is causing > problems. I believe that btrfs isn't alone in this, but it may well b= e > showing the problem to a far greater degree than other FSes. I don't > have figures to compare, I'm afraid. Thats what I wanted to hint at. I suspect an issue with free space fragmentation and do what I think I = see: btrfs balance minimizes free space in chunk fragmentation. And that is my whole case on why I think it does help with my /home filesystem. So while btrfs filesystem defragment may help with defragmenting indivi= dual files, possibly at the cost of fragmenting free space at least on files= ystem almost full conditions, I think to help with free space fragmentation t= here are only three options at the moment: 1) reformat and restore via rsync or btrfs send from backup (i.e. file = based) 2) make the BTRFS in itself bigger 3) btrfs balance at least chunks, at least those that are not more than= 70% or 80% full. Do you know of any other ways to deal with it? So yes, in case it really is freespace fragmentation, I do think a bala= nce may be helpful. Even if usually one should not use a balance. =20 > > I also don't know what kind of tool you are using, but it might be > > repeatedly trying and failing to fallocate the file as a single > > extent or something equally dumb. >=20 > Userspace doesn't as far as I know, get to make that decision. I'v= e > just read the fallocate(2) man page, and it says nothing at all about= > the contiguity of the extent(s) storage allocated by the call. fio fallocates just once. And then writes, even if the fallocate call f= ails. Was nice to see at some point as BTRFS returned out of space on the fallocate but was still be able to write the 4GiB of random data. I bet= the latter was due to compression. Thus while it could not guarentee that the 4 GiB will be there in all cases, i.e. even with uncompressibl= e data, it was able to wrote out the random buffer fio repeatedly wrote. I think I will step back from this now, its weekend and a quiet time af= ter all. I probably got a bit too engaged with this discussion. Yet, I had the f= eeling I was treated by Robert like someone who doesn=C2=B4t know a thing. I w= ant to approach this with a willingness to learn, and I don=C2=B4t want to int= erpret an empirical result away before someone even had a closer look at it. I had this before where an expert claimed that he wouldn=C2=B4t reduce = the dirty_background_ratio on an rsync via NFS case and I actually needed t= o prove the result to him before he =E2=80=93 I don=C2=B4t even know =E2=80= =93 eventually accepted it. I may be off with my free space fragmentation idea, thus let the kern.l= og and my results speak for itself. I don=C2=B4t see much point in proceed= ing this discussion before a BTRFS developer had a look at it. I put up the sysrq-trigger t kern.log onto the bug report. The bugzilla= does not seem to be available from here at the moment, nginx reports "502 ba= d gateway, but the kern.log I attached to it. And in case someone needs i= t by mail, just ping me. =2D-=20 Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 --nextPart1794851.O5moSLK8lM Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEABECAAYFAlSe6D4ACgkQmRvqrKWZhMekuwCfaJPB2hcCruWSWGA7Aqjbn80N uFcAnjTuYnSvkDNPeMObSdSmF4dvlO1o =poi1 -----END PGP SIGNATURE----- --nextPart1794851.O5moSLK8lM--