From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-btrfs-owner@vger.kernel.org>
Received: from resqmta-ch2-06v.sys.comcast.net ([69.252.207.38]:38521 "EHLO
	resqmta-ch2-06v.sys.comcast.net" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1751094AbaL0POf (ORCPT
	<rfc822;linux-btrfs@vger.kernel.org>);
	Sat, 27 Dec 2014 10:14:35 -0500
Message-ID: <549ECCD8.6090307@pobox.com>
Date: Sat, 27 Dec 2014 07:14:32 -0800
From: Robert White <rwhite@pobox.com>
MIME-Version: 1.0
To: Martin Steigerwald <Martin@lichtvoll.de>
CC: Hugo Mills <hugo@carfax.org.uk>, linux-btrfs@vger.kernel.org
Subject: Re: BTRFS free space handling still needs more work: Hangs again
References: <3738341.y7uRQFcLJH@merkaba> <549EBB90.5070406@pobox.com> <1779212.Cg9zjTft4U@merkaba> <34633403.WlleJmkifE@merkaba>
In-Reply-To: <34633403.WlleJmkifE@merkaba>
Content-Type: text/plain; charset=windows-1252; format=flowed
Sender: linux-btrfs-owner@vger.kernel.org
List-ID: <linux-btrfs.vger.kernel.org>

On 12/27/2014 06:21 AM, Martin Steigerwald wrote:
> Am Samstag, 27. Dezember 2014, 15:14:05 schrieb Martin Steigerwald:
>> Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White:
>>> On 12/27/2014 05:16 AM, Martin Steigerwald wrote:
>>>> It can easily be reproduced without even using Virtualbox, just by a
>>>> nice
>>>> simple fio job.
>>>
>>> TL;DR: If you want a worst-case example of consuming a BTRFS filesystem
>>> with one single file...
>>>
>>> #!/bin/bash
>>> # not tested, so correct any syntax errors
>>> typeset -i counter
>>> for ((counter=250;counter>0;counter--)); do
>>>
>>>    dd if=/dev/urandom of=/some/file bs=4k count=$counter
>>>
>>> done
>>> exit
>>>
>>>
>>> Each pass over /some/file is 4k shorter than the previous one, but none
>>> of the extents can be deallocated. File will be 1MiB in size and usage
>>> will be something like 125.5MiB (if I've done the math correctly).
>>> larger values of counter will result in exponentially larger amounts of
>>> waste.
>>
>> Robert, I experienced this hang issues even before the defragmenting case.
>> It happened while just installed a 400 MiB tax returns application to it
>> (that is no joke, it is that big).
>>
>> It happens while just using the VM.
>>
>> Yes, I recommend not to use BTRFS for any VM image or any larger database on
>> rotating storage for exactly that COW semantics.
>>
>> But on SSD?
>>
>> Its busy looping a CPU core and while the flash is basically idling.
>>
>> I refuse to believe that this is by design.
>>
>> I do think there is a *bug*.
>>
>> Either acknowledge it and try to fix it, or say its by design *without even
>> looking at it closely enough to be sure that it is not a bug* and limit your
>> own possibilities by it.
>>
>> I´d rather see it treated as a bug for now.
>>
>> Come on, 254 IOPS on a filesystem with still 17 GiB of free space while
>> randomly writing to a 4 GiB file.
>>
>> People do these kind of things. Ditch that defrag Windows XP VM case, I had
>> performance issue even before by just installing things to it. Databases,
>> VMs, emulators. And heck even while just *creating* the file with fio as I
>> shown.
>
> Add to these use cases things like this:
>
> martin@merkaba:~/.local/share/akonadi/db_data/akonadi> ls -lSh | head -5
> insgesamt 2,2G
> -rw-rw---- 1 martin martin 1,7G Dez 27 15:17 parttable.ibd
> -rw-rw---- 1 martin martin 488M Dez 27 15:17 pimitemtable.ibd
> -rw-rw---- 1 martin martin  23M Dez 27 15:17 pimitemflagrelation.ibd
> -rw-rw---- 1 martin martin 240K Dez 27 15:17 collectiontable.ibd
>
>
> Or this:
>
> martin@merkaba:~/.local/share/baloo> du -sch * | sort -rh
> 9,2G    insgesamt
> 8,0G    email
> 1,2G    file
> 51M     emailContacts
> 408K    contacts
> 76K     notes
> 16K     calendars
>
> martin@merkaba:~/.local/share/baloo> ls -lSh email | head -5
> insgesamt 8,0G
> -rw-r--r-- 1 martin martin 4,0G Dez 27 15:16 postlist.DB
> -rw-r--r-- 1 martin martin 3,9G Dez 27 15:16 termlist.DB
> -rw-r--r-- 1 martin martin 143M Dez 27 15:16 record.DB
> -rw-r--r-- 1 martin martin  63K Dez 27 15:16 postlist.baseA

/usr/bin/du and /usr/bin/df and /bin/ls are all _useless_ for showing 
the amount of filespace used by a file in BTRFS.

Look at a nice paste of the previously described "worst case" allocation.

Gust rwhite # btrfs fi df /
Data, single: total=344.00GiB, used=340.41GiB
System, DUP: total=32.00MiB, used=80.00KiB
Metadata, DUP: total=8.00GiB, used=4.84GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Gust rwhite # for ((counter=250;counter>0;counter--)); do dd 
if=/dev/urandom of=some_file conv=notrunc,fsync bs=4k count=$counter 
 >/dev/null 2>&1; done

Gust rwhite # btrfs fi df /
Data, single: total=344.00GiB, used=340.48GiB
System, DUP: total=32.00MiB, used=80.00KiB
Metadata, DUP: total=8.00GiB, used=4.84GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Gust rwhite # du some_file
1000    some_file

Gust rwhite # ls -lh some_file
-rw-rw-r--+ 1 root root 1000K Dec 27 07:00 some_file

Gust rwhite # rm some_file
Gust rwhite # btrfs fi df /
Data, single: total=344.00GiB, used=340.41GiB
System, DUP: total=32.00MiB, used=80.00KiB
Metadata, DUP: total=8.00GiB, used=4.84GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Notice that "some_file" shows 1000 blocks in du, and 1000k bytes in ls.

But notice that data used jumps from 340.41GiB to 340.48GiB when the 
file is created, then drops back down to 340.41GiB when it's deleted.

Now I have compression turned on so the amount of growth/shrinkage 
changes between each run, but it's _Way_ more than 1Meg, that's like 
70MiB (give or take significant rounding in the third place after the 
decimal). So I wrote this file in a way that leads to it taking up 
_seventy_ _times_ it's base size in actual allocated storage. Real files 
do not perform this terribly, but they can get pretty ugly in some cases.

You _really_ need to learn how the system works and what its best and 
worst cases look like before you start shouting "bug!"

You are using the wrong numbers (e.g. "df") for available space and you 
don't know how to estimate what your tools _should_ do for the 
conditions observed.

But yes, if you open a file and scribble all over it when your disk is 
full to within the same order of magnitude as the size of the file you 
are scribbling on, you will get into a condition where the _application_ 
will aggressively retry the IO. Particularly if that application is a 
"test program" or a virtual machine doing asynchronous IO.

That's what those sorts of systems do when they crash against a limit in 
the underlying system.

So yea... out of space plus agressive writer equals spinning CPU

Before you can assign blame you need to strace your application to see 
what call its making over and over again to see if its just being stupid.

> These will not be as bad as the fio test case, but still these files are
> written into. They are updated in place.
>
> And thats running on every Plasma desktop by default. And on GNOME desktops
> there is similar stuff.
>
> I haven´t seen this spike out a kworker yet tough, so maybe the workload is
> light enough not to trigger it that easily.
>