All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Edward Shishkin <edward@redhat.com>
Cc: Edward Shishkin <edward.shishkin@gmail.com>,
	Mat <jackdachef@gmail.com>, LKML <linux-kernel@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org,
	Chris Mason <chris.mason@oracle.com>,
	Ric Wheeler <rwheeler@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	The development of BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Date: Thu, 24 Jun 2010 00:57:26 +0100	[thread overview]
Message-ID: <20100623235726.GG7058@shareable.org> (raw)
In-Reply-To: <4C1BC924.30604@redhat.com>

Edward Shishkin wrote:
> I'll try to help, but I am rather pessimistic here: working out
> algorithms is something, which doesn't like timelines..

Nonsense.  Working out algorithms is just work to an algorithm
designer, just like programming is work to a programmer.  Sure, some
things are harder than others, and there's an element of creativity.
But lots of it is just cranking the handle, even in algorithm design.

> >Note that it's not necessarily a problem to have a few nodes with low
> >utilisation.  Deliberate violation of the classic balancing heuristic
> >is often useful for performance.[*]
> 
> Ok, let's violate this, but not in the detriment of utilisation:
> Internal fragmentation is a horror for file systems, the enemy #1
> (which is completely defeated in the last century BTW).
> 
> >  The problem you've found is only a
> >real problem when there are _too many_ nodes with low utilisation.
> 
> IMHO this is a problem, as we can not estimate number of such nodes.
> Do you have any sane upper boundary for this number? I don't know such one.
> Maybe I have missed something?

Well, I don't know if btrfs maintains enough statistics or other
implicit processes to guarantee a sane upper boundary for this.  The
impression I'm getting from the thread is that it relies on heuristic
behaviour which is usually sufficient (and perhaps a bit faster than
updating statistics on the disk would be), but that it does not
provide a hard upper boundary.  I'm really not sure, though.  I
haven't read the code.

> >[*] For example when filling a tree by inserting contiguously
> >ascending keys, the classic "split into two when full" heuristic gives
> >the worst possible results (50% lost space), and deliberately
> >underfilling the most actively updated nodes, which is not permitted
> >at all by the classic algorithm, gives denser packing in the end
> >(almost zero lost space).
> 
> At the end of what?

At the end of inserting keys in ascending order.  Just think about the
classic B-tree when that is done: Node[1] fills to 100% utilisation,
then is split into two nodes at 50% (call them node[1] and node[2]).
Node[2] fills to 100%, then splits into node[2] at 50% and node[3].
Etc etc: Result is a million nodes at 50% utilisation, except the last
one.

If instead you fill node[1], then *don't* split it but permit node[2]
to have 0% to start with, let that fill to 100%, then don't split
node[2] and let node[3] start with 0%, etc. etc: Result is half a
million nodes at 100% utilisation, except the last one.

Both fit the desired bounds, but the second is more desirable in
practice, especially as it's common behaviour to fill with contiguous,
ascending keys in a filesystem (or blob-filled database) where data
extents are part of the tree :-)

To handle the cases of multiple independent ascending runs of keys
being written in parallel, you generalise to maintain more than one
below-50% block, near the "active insertion heads".

The above describes classic B-trees where updates are assumed to be
written immediately.  Things are different when there's a memory cache
and delayed allocation/writing, and packing can be reorganised in
memory before sending to storage.

> I hope you don't speak about on-line defragmentation?

No, not at all.

> Can you point me to the paper (if any)?

Sorry, no, I haven't seen this described in a paper.  Imho it's
obvious when you think about it.  But maybe it's not obvious to
everyone - perhaps this is even the heuristic modification missing
from btrfs which it would need to avoid the scenario which started
this thread?

-- Jamie

      parent reply	other threads:[~2010-06-23 23:57 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-06-03 14:58 Unbound(?) internal fragmentation in Btrfs Edward Shishkin
2010-06-17 23:29 ` Mat
2010-06-18  8:03   ` Christian Stroetmann
2010-06-18 13:32   ` Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) Edward Shishkin
2010-06-18 13:45     ` Daniel J Blueman
2010-06-18 13:45       ` Daniel J Blueman
2010-06-18 16:50       ` Edward Shishkin
2010-06-23 23:40         ` Jamie Lokier
2010-06-24  3:43           ` Daniel Taylor
2010-06-24  4:51             ` Mike Fedyk
2010-06-24  4:51               ` Mike Fedyk
2010-06-24  4:51               ` Mike Fedyk
2010-06-24 22:06               ` Daniel Taylor
2010-06-24 22:06                 ` Daniel Taylor
2010-06-24 22:06                 ` Daniel Taylor
2010-06-25  9:15                 ` Btrfs: broken file system design Andi Kleen
2010-06-25 18:58                 ` Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) Ric Wheeler
2010-06-26  5:18                   ` Michael Tokarev
2010-06-26 11:55                     ` Ric Wheeler
     [not found]                     ` <57784.2001:5c0:82dc::2.1277555665.squirrel@www.tofubar.com>
2010-06-26 13:47                       ` Ric Wheeler
2010-06-24  9:50             ` David Woodhouse
2010-06-24  9:50               ` David Woodhouse
2010-06-18 18:15       ` Christian Stroetmann
2010-06-18 13:47     ` Chris Mason
2010-06-18 15:05       ` Edward Shishkin
2010-06-18 15:05       ` Edward Shishkin
2010-06-18 15:05         ` Edward Shishkin
2010-06-18 15:10         ` Chris Mason
2010-06-18 16:22           ` Edward Shishkin
2010-06-18 16:22           ` Edward Shishkin
2010-06-18 16:22             ` Edward Shishkin
2010-06-18 18:10             ` Chris Mason
2010-06-18 15:21       ` Christian Stroetmann
2010-06-18 15:22         ` Chris Mason
2010-06-18 15:56     ` Jamie Lokier
2010-06-18 19:25       ` Christian Stroetmann
2010-06-18 19:29       ` Edward Shishkin
2010-06-18 19:35         ` Chris Mason
2010-06-18 22:04           ` Balancing leaves when walking from top to down (was Btrfs:...) Edward Shishkin
2010-06-18 22:04           ` Edward Shishkin
2010-06-18 22:16             ` Ric Wheeler
2010-06-19  0:03               ` Edward Shishkin
2010-06-21 13:15             ` Chris Mason
2010-06-21 18:00               ` Chris Mason
2010-06-22 14:12                 ` Edward Shishkin
2010-06-22 14:12                 ` Edward Shishkin
2010-06-22 14:12                   ` Edward Shishkin
2010-06-22 14:20                   ` Chris Mason
2010-06-23 13:46                     ` Edward Shishkin
2010-06-23 13:46                       ` Edward Shishkin
2010-06-23 23:37                       ` Jamie Lokier
2010-06-24 13:06                         ` Chris Mason
2010-06-30 20:05                           ` Edward Shishkin
2010-06-30 20:05                             ` Edward Shishkin
2010-06-30 21:12                             ` Chris Mason
2010-06-30 20:05                           ` Edward Shishkin
2010-06-30 20:05                           ` Edward Shishkin
2010-06-23 13:46                     ` Edward Shishkin
2010-07-09  4:16                 ` Chris Samuel
2010-07-09 20:30                   ` Chris Mason
2010-06-18 22:04           ` Edward Shishkin
2010-06-23 23:57         ` Jamie Lokier [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100623235726.GG7058@shareable.org \
    --to=jamie@shareable.org \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=edward.shishkin@gmail.com \
    --cc=edward@redhat.com \
    --cc=jackdachef@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rwheeler@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.