From: Miao Xie <miaox@cn.fujitsu.com>
To: Arne Jansen <sensille@gmx.net>
Cc: Andrey Kuzmin <andrey.v.kuzmin@gmail.com>,
Linux Btrfs <linux-btrfs@vger.kernel.org>,
Chris Mason <chris.mason@oracle.com>
Subject: Re: [RFC] Tree fragmentation and prefetching
Date: Thu, 24 Mar 2011 09:38:50 +0800 [thread overview]
Message-ID: <4D8AA0AA.5010006@cn.fujitsu.com> (raw)
In-Reply-To: <4D8A57E9.6010107@gmx.net>
On wed, 23 Mar 2011 21:28:25 +0100, Arne Jansen wrote:
> On 23.03.2011 20:26, Andrey Kuzmin wrote:
>> On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansen<sensille@gmx.net> wrote:
>>> While looking into the performance of scrub I noticed that a significant
>>> amount of time is being used for loading the extent tree and the csum
>>> tree. While this is no surprise I did some prototyping on how to improve
>>> on it.
>>> The main idea is to load the tree (or parts of it) top-down, order the
>>> needed blocks and distribute it over all disks.
>>> To keep you interested, some results first.
>>>
>>> a) by tree enumeration with reada=2
>>> reading extent tree: 242s
>>> reading csum tree: 140s
>>> reading both trees: 324s
>>>
>>> b) prefetch prototype
>>> reading extent tree: 23.5s
>>> reading csum tree: 20.4s
>>> reading both trees: 25.7s
>>
>> 10x speed-up looks indeed impressive. Just for me to be sure, did I
>> get you right in that you attribute this effect specifically to
>> enumerating tree leaves in key address vs. disk addresses when these
>> two are not aligned?
>
> Yes. Leaves and the intermediate nodes tend to be quite scattered
> around the disk with respect to their logical order.
> Reading them in logical (ascending/descending) order require lots
> of seeks.
I'm also dealing with tree fragmentation problem, I try to store the leaves
which have the same parent closely.
Regards
Miao
>
>>
>> Regards,
>> Andrey
>>
>>>
>>> The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
>>> 28%. It is created with the current git tree + the round robin patch and
>>> filled with
>>>
>>> fs_mark -D 512 -t 16 -n 4096 -F -S0
>>>
>>> The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
>>> with path->reada=2. Both trees are being enumerated one after the other.
>>> The prototype currently just uses raw bios, does not make use of the
>>> page cache and does not enter the read pages into the cache. This will
>>> probably add some overhead. It also does not check the crcs.
>>>
>>> While it is very promising to implement it for scrub, I think a more
>>> general interface which can be used for every enumeration would be
>>> beneficial. Use cases that come to mind are rebalance, reflink, deletion
>>> of large files, listing of large directories etc..
>>>
>>> I'd imagine an interface along the lines of
>>>
>>> int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
>>> int btrfs_readahead_add(struct btrfs_root *root,
>>> struct btrfs_key *start,
>>> struct btrfs_key *end,
>>> struct btrfs_reada_ctx *reada);
>>> void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);
>>>
>>> to trigger the readahead of parts of a tree. Multiple readahead
>>> requests can be given before waiting. This would enable the very
>>> beneficial folding seen above for 'reading both trees'.
>>>
>>> Also it would be possible to add a cascading readahead, where the
>>> content of leaves would trigger readaheads in other trees, maybe by
>>> giving a callback for the decisions what to read instead of the fixed
>>> start/end range.
>>>
>>> For the implementation I'd need an interface which I haven't been able
>>> to find yet. Currently I can trigger the read of several pages / tree
>>> blocks and wait for the completion of each of them. What I'd need would
>>> be an interface that gives me a callback on each completion or a waiting
>>> function that wakes up on each completion with the information which
>>> pages just completed.
>>> One way to achieve this would be to add a hook, but I gladly take any
>>> implementation hints.
>>>
>>> --
>>> Arne
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2011-03-24 1:38 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-03-23 13:06 [RFC] Tree fragmentation and prefetching Arne Jansen
2011-03-23 15:01 ` Arne Jansen
2011-03-23 19:26 ` Andrey Kuzmin
2011-03-23 20:28 ` Arne Jansen
2011-03-23 21:47 ` Andrey Kuzmin
2011-03-24 1:38 ` Miao Xie [this message]
2011-03-24 7:29 ` Arne Jansen
2011-03-24 12:45 ` Miao Xie
2011-03-23 19:32 ` Chris Mason
2011-03-25 20:14 ` Arne Jansen
2011-03-25 20:15 ` Chris Mason
2011-03-25 20:53 ` Arne Jansen
2011-03-25 21:01 ` Chris Mason
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4D8AA0AA.5010006@cn.fujitsu.com \
--to=miaox@cn.fujitsu.com \
--cc=andrey.v.kuzmin@gmail.com \
--cc=chris.mason@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=sensille@gmx.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.