linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Robert White <rwhite@pobox.com>
To: Hugo Mills <hugo@carfax.org.uk>,
	Martin Steigerwald <Martin@lichtvoll.de>,
	linux-btrfs@vger.kernel.org
Subject: Re: BTRFS free space handling still needs more work: Hangs again
Date: Sat, 27 Dec 2014 16:06:13 -0800	[thread overview]
Message-ID: <549F4975.7060009@pobox.com> (raw)
In-Reply-To: <20141227162642.GK25267@carfax.org.uk>

Semi off-topic questions...

On 12/27/2014 08:26 AM, Hugo Mills wrote:
>     This is... badly mistaken, at best. The problem of where to write a
> file into a set of free extents is definitely *not* an NP-hard
> problem. It's a P problem, with an O(n log n) solution, where n is the
> number of free extents in the free space cache. The simple approach:
> fill the first hole with as many bytes as you can, then move on to the
> next hole. More complex: order the free extents by size first. Both of
> these are O(n log n) algorithms, given an efficient general-purpose
> index of free space.

Which algorithm is actually in use?

Is any attempt made to keep subsequent allocations in the same data extent?

All of "best fit", "first fit", and "first encountered" allocation have 
terrible distribution graphs over time.

Without a knod to locality, discontiguous allocation will have 
staggeringly bad after-effects in terms of read-ahead.

>
>     The problem of placing file data isn't a bin-packing problem; it's
> not like allocating RAM (where each allocation must be contiguous).
> The items being placed may be split as much as you like, although
> minimising the amount of splitting is a goal.

How is compression and re-compression handled? If a linear extent is 
compressed to find its on-disk size in bytes, and then there isn't an 
extent large enough to fit it, it has to be cut, then recompressed, then 
searched again right?

How does the system look for the right cut? How iterative can this get? 
Does it always try cutting in half? Does it shave single bytes off the 
end? Does it add one byte at a time till it reaches the size of the 
extent its looking at?

Can you get down to a point where you are placing data in five or ten 
byte chunks somehow? (e.g. what's the smallest chunk you can place? 
clearly if I open a multi-megabyte file and update a single word or byte 
it's not going to land in metadata from my reading of the code.) One 
could easily end up with a couple million free extents of just a few 
bytes each, particularly if largest-first allocation is used.

The degenerate cases here do come straight from the various packing 
problems. You may not be executing any of those packing algorithms but 
once you ignore enough of those issues in the easy cases your free space 
will be a fine pink mist suspended in space. (both an explosion analogy 
and a reference to pink noise 8-) ).

>     I suspect that the performance problems that Martin is seeing may
> indeed be related to free space fragmentation, in that finding and
> creating all of those tiny extents for a huge file is causing
> problems. I believe that btrfs isn't alone in this, but it may well be
> showing the problem to a far greater degree than other FSes. I don't
> have figures to compare, I'm afraid.

>
>> I also don't know what kind of tool you are using, but it might be
>> repeatedly trying and failing to fallocate the file as a single
>> extent or something equally dumb.
>
>     Userspace doesn't as far as I know, get to make that decision. I've
> just read the fallocate(2) man page, and it says nothing at all about
> the contiguity of the extent(s) storage allocated by the call.

Yep, my bad. But as soon as I saw that "fio" was starting two threads, 
one doing random read/write and another doing sequential read/write, 
both on the same file, it set off my "not just creating a file" mindset. 
Given the delayed write into/through the cache normally done by casual 
file io, It seemed likely that fio would be doing something more 
aggressive (like using O_DIRECT or repeated fdatasync() which could get 
very tit-for-tat).

Compare that to a VM in which the guest operating system "knows" it has, 
and has used, its "disk space" internally, and the subsequent async 
activity of the monitor to push that activity out to real storage which 
is usually quite pathological... well you can get into some super 
pernicious behavior over write ordering and infinite retries.

So I was wrong about fallocate per-se, applications can be incredibly 
dumb. For instance a VM might think its _inconceivable_ to get an ENOSPC 
while rewriting data it's just read from a file it "knows" has no holes etc.

Given how lots of code doesn't even check the results of many function 
calls... how many times have you seen code that doesn't look at the 
return value of fwrite() or printf()? Or one that, if it does something 
like if (bytes_written < size) retry_remainder(); So sure I was 
imagining an fallocate() in a loop or something equally dumb. 8-)

>
>     Hugo.
>
> [snip]
>


  parent reply	other threads:[~2014-12-28  0:06 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-26 13:37 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald
2014-12-26 14:20 ` Martin Steigerwald
2014-12-26 14:41   ` Martin Steigerwald
2014-12-27  3:33     ` Duncan
2014-12-26 15:59 ` Martin Steigerwald
2014-12-27  4:26   ` Duncan
2014-12-26 22:48 ` Robert White
2014-12-27  5:54   ` Duncan
2014-12-27  9:01   ` Martin Steigerwald
2014-12-27  9:30     ` Hugo Mills
2014-12-27 10:54       ` Martin Steigerwald
2014-12-27 11:52         ` Robert White
2014-12-27 13:16           ` Martin Steigerwald
2014-12-27 13:49             ` Robert White
2014-12-27 14:06               ` Martin Steigerwald
2014-12-27 14:00             ` Robert White
2014-12-27 14:14               ` Martin Steigerwald
2014-12-27 14:21                 ` Martin Steigerwald
2014-12-27 15:14                   ` Robert White
2014-12-27 16:01                     ` Martin Steigerwald
2014-12-28  0:25                       ` Robert White
2014-12-28  1:01                         ` Bardur Arantsson
2014-12-28  4:03                           ` Robert White
2014-12-28 12:03                             ` Martin Steigerwald
2014-12-28 17:04                               ` Patrik Lundquist
2014-12-29 10:14                                 ` Martin Steigerwald
2014-12-28 12:07                             ` Martin Steigerwald
2014-12-28 14:52                               ` Robert White
2014-12-28 15:42                                 ` Martin Steigerwald
2014-12-28 15:47                                   ` Martin Steigerwald
2014-12-29  0:27                                   ` Robert White
2014-12-29  9:14                                     ` Martin Steigerwald
2014-12-27 16:10                     ` Martin Steigerwald
2014-12-27 14:19               ` Robert White
2014-12-27 11:11       ` Martin Steigerwald
2014-12-27 12:08         ` Robert White
2014-12-27 13:55       ` Martin Steigerwald
2014-12-27 14:54         ` Robert White
2014-12-27 16:26           ` Hugo Mills
2014-12-27 17:11             ` Martin Steigerwald
2014-12-27 17:59               ` Martin Steigerwald
2014-12-28  0:06             ` Robert White [this message]
2014-12-28 11:05               ` Martin Steigerwald
2014-12-28 13:00         ` BTRFS free space handling still needs more work: Hangs again (further tests) Martin Steigerwald
2014-12-28 13:40           ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare) Martin Steigerwald
2014-12-28 13:56             ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Martin Steigerwald
2014-12-28 15:00               ` Martin Steigerwald
2014-12-29  9:25               ` Martin Steigerwald
2014-12-27 18:28       ` BTRFS free space handling still needs more work: Hangs again Zygo Blaxell
2014-12-27 18:40         ` Hugo Mills
2014-12-27 19:23           ` BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time) Martin Steigerwald
2014-12-29  2:07             ` Zygo Blaxell
2014-12-29  9:32               ` Martin Steigerwald
2015-01-06 20:03                 ` Zygo Blaxell
2015-01-07 19:08                   ` Martin Steigerwald
2015-01-07 21:41                     ` Zygo Blaxell
2015-01-08  5:45                     ` Duncan
2015-01-08 10:18                       ` Martin Steigerwald
2015-01-09  8:25                         ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=549F4975.7060009@pobox.com \
    --to=rwhite@pobox.com \
    --cc=Martin@lichtvoll.de \
    --cc=hugo@carfax.org.uk \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).