From: Christian Stroetmann <stroetmann@ontolab.com>
To: Jamie Lokier <jamie@shareable.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Date: Fri, 18 Jun 2010 21:25:38 +0200 [thread overview]
Message-ID: <4C1BC832.4080809@ontolab.com> (raw)
In-Reply-To: <20100618155653.GC10919@shareable.org>
Jamie Lokier wrote:
> Edward Shishkin wrote:
>
>> If you decide to base your file system on some algorithms then please
>> use the original ones from proper academic papers. DO NOT modify the
>> algorithms in solitude: this is very fragile thing! All such
>> modifications must be reviewed by specialists in the theory of
>> algorithms. Such review can be done in various scientific magazines of
>> proper level.
>>
>> Personally I don't see any way to improve the situation with Btrfs
>> except full redesigning the last one. If you want to base your file
>> system on the paper of Ohad Rodeh, then please, use *exactly* the
>> Bayer's B-trees that he refers to. That said, make sure that all
>> records you put to the tree has equal length and all non-root nodes of
>> your tree are at least half filled.
>>
> First, thanks Edward for identifying a specific problem with the
> current btrfs implementation.
>
> I've studied modified B-trees quite a lot and know enough to be sure
> that they are quite robust when you modify them in all sorts of ways.
>
This is the point: Which kind of modified B-tree data structure is best
suited?
> Moreover, you are incorrect to say there's an intrinsic algorithmic
> problem with variable-length records. It is not true; if Knuth said
> so, Knuth was mistaken.
>
> This is easily shown by modelling variable-length records (key ->
> string) as a range of fixed length records ([key,index] -> byte) and
> apply the standard B-tree algorithms to that, which guarantees
> algorithm properties such as space utilisation and time; then you can
> substitute a "compressed" representation of contiguous index runs,
> which amounts to nothing more than just storing the strings (split
> where the algorithm says to do so) and endpoint indexes , and because
> this compression does not expand (in any way that matters), classic
> algorithmic properties are still guaranteed.
>
> Variable-length keys are a different business. Those are trickier,
> but as far as I know, btrfs doesn't use them.
>
>
>> As to current Btrfs code: *NOT ACK*!!! I don't think we need such
>> "file systems".
>>
> Btrfs provides many useful features that other filesystems don't. We
> definitely need it, or something like it. You have identified a bug.
> It's not a corruption bug, but it's definitely a bug, and probably
> affects performance as well as space utilisation.
>
> It is not deep design bug; it is just a result of the packing and
> balancing heuristics.
>
I think this is the most important design question in relation with
filesystems that use some kind of B-trees, which means, if the wrong
modified B-tree as the fundamental data structure was chosen, then this
is a deep design bug.
> If you are still interested, please apply your knowledge of B-tree
> algorithms to understanding why btrfs fails to balance the tree
> sufficiently well, and then propose a fix.
>
This is a general problem of filesystem design, especially the packing
and balancing heurisitcs, and a special problem of the Btrfs filesystem.
You can't simply say do it in this or that way. That's why another
filesystem uses something exotic like a B*-tree in conjunction with
dancing trees as fundamental data structure, which leads back to the
deep design question/problem/decision/bug/.... And after I followed the
explanations of this exotic B-tree version by the main developer I knew
just right from the start of the development of the Btrfs filesystem
that it wasn't chosen the right modified B-tree data structure, because
it was too simple and too general. And since some days I have the
impression that there wasn't made a design decision at all with the only
exception that there has to be some kind of a B-tree algorithm/data
structure in the Btrfs filesystem.
And I also think that such a deep desgin decision can't simply be
corrected in general (subjective opinion).
> Note that it's not necessarily a problem to have a few nodes with low
> utilisation. Deliberate violation of the classic balancing heuristic
> is often useful for performance.[*] The problem you've found is only a
> real problem when there are _too many_ nodes with low utilisation.
>
The found problem is the first problem with the chosen modified B-tree
data structure. I wouldn't call it only a problem in a special case.
> [*] For example when filling a tree by inserting contiguously
> ascending keys, the classic "split into two when full" heuristic gives
> the worst possible results (50% lost space), and deliberately
> underfilling the most actively updated nodes, which is not permitted
> at all by the classic algorithm, gives denser packing in the end
> (almost zero lost space). It's also faster. The trick is to make
> sure there's just the right number of underfilled nodes...
>
Yes, but ....
Firstly, maybe you are too focused on the classic B-tree algorithm here.
Secondly, a trick here, a split there, turning off a feature and then?
Then we have complexity at then end, which brings us back to the start,
the design decision.
But if you say there are no deep problems, then I will believe you for now.
> -- Jamie
>
With all the best
Christian Stroetmann
next prev parent reply other threads:[~2010-06-18 19:25 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-06-03 14:58 Unbound(?) internal fragmentation in Btrfs Edward Shishkin
2010-06-17 23:29 ` Mat
2010-06-18 8:03 ` Christian Stroetmann
2010-06-18 13:32 ` Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) Edward Shishkin
2010-06-18 13:45 ` Daniel J Blueman
2010-06-18 13:45 ` Daniel J Blueman
2010-06-18 16:50 ` Edward Shishkin
2010-06-23 23:40 ` Jamie Lokier
2010-06-24 3:43 ` Daniel Taylor
2010-06-24 4:51 ` Mike Fedyk
2010-06-24 4:51 ` Mike Fedyk
2010-06-24 4:51 ` Mike Fedyk
2010-06-24 22:06 ` Daniel Taylor
2010-06-24 22:06 ` Daniel Taylor
2010-06-24 22:06 ` Daniel Taylor
2010-06-25 9:15 ` Btrfs: broken file system design Andi Kleen
2010-06-25 18:58 ` Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) Ric Wheeler
2010-06-26 5:18 ` Michael Tokarev
2010-06-26 11:55 ` Ric Wheeler
[not found] ` <57784.2001:5c0:82dc::2.1277555665.squirrel@www.tofubar.com>
2010-06-26 13:47 ` Ric Wheeler
2010-06-24 9:50 ` David Woodhouse
2010-06-24 9:50 ` David Woodhouse
2010-06-18 18:15 ` Christian Stroetmann
2010-06-18 13:47 ` Chris Mason
2010-06-18 15:05 ` Edward Shishkin
2010-06-18 15:05 ` Edward Shishkin
2010-06-18 15:05 ` Edward Shishkin
2010-06-18 15:10 ` Chris Mason
2010-06-18 16:22 ` Edward Shishkin
2010-06-18 16:22 ` Edward Shishkin
2010-06-18 16:22 ` Edward Shishkin
2010-06-18 18:10 ` Chris Mason
2010-06-18 15:21 ` Christian Stroetmann
2010-06-18 15:22 ` Chris Mason
2010-06-18 15:56 ` Jamie Lokier
2010-06-18 19:25 ` Christian Stroetmann [this message]
2010-06-18 19:29 ` Edward Shishkin
2010-06-18 19:35 ` Chris Mason
2010-06-18 22:04 ` Balancing leaves when walking from top to down (was Btrfs:...) Edward Shishkin
2010-06-18 22:04 ` Edward Shishkin
2010-06-18 22:16 ` Ric Wheeler
2010-06-19 0:03 ` Edward Shishkin
2010-06-21 13:15 ` Chris Mason
2010-06-21 18:00 ` Chris Mason
2010-06-22 14:12 ` Edward Shishkin
2010-06-22 14:12 ` Edward Shishkin
2010-06-22 14:12 ` Edward Shishkin
2010-06-22 14:20 ` Chris Mason
2010-06-23 13:46 ` Edward Shishkin
2010-06-23 13:46 ` Edward Shishkin
2010-06-23 13:46 ` Edward Shishkin
2010-06-23 23:37 ` Jamie Lokier
2010-06-24 13:06 ` Chris Mason
2010-06-30 20:05 ` Edward Shishkin
2010-06-30 20:05 ` Edward Shishkin
2010-06-30 20:05 ` Edward Shishkin
2010-06-30 21:12 ` Chris Mason
2010-06-30 20:05 ` Edward Shishkin
2010-07-09 4:16 ` Chris Samuel
2010-07-09 20:30 ` Chris Mason
2010-06-18 22:04 ` Edward Shishkin
2010-06-23 23:57 ` Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) Jamie Lokier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C1BC832.4080809@ontolab.com \
--to=stroetmann@ontolab.com \
--cc=jamie@shareable.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.