From: Robbie Ko <robbieko@synology.com>
To: dsterba@suse.cz, linux-btrfs@vger.kernel.org, wqu@suse.com
Subject: Re: [PATCH v2] btrfs: speedup mount time with readahead chunk tree
Date: Thu, 9 Jul 2020 10:38:42 +0800 [thread overview]
Message-ID: <0358f6f6-da68-94a4-f3ed-718e5caeded4@synology.com> (raw)
In-Reply-To: <20200708211142.GD28832@twin.jikos.cz>
David Sterba 於 2020/7/9 上午5:11 寫道:
> On Tue, Jul 07, 2020 at 09:25:11PM +0200, David Sterba wrote:
>> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote:
>>> From: Robbie Ko <robbieko@synology.com>
>>>
>>> When mounting, we always need to read the whole chunk tree,
>>> when there are too many chunk items, most of the time is
>>> spent on btrfs_read_chunk_tree, because we only read one
>>> leaf at a time.
>>>
>>> It is unreasonable to limit the readahead mechanism to a
>>> range of 64k, so we have removed that limit.
>>>
>>> In addition we added reada_maximum_size to customize the
>>> size of the pre-reader, The default is 64k to maintain the
>>> original behavior.
>>>
>>> So we fix this by used readahead mechanism, and set readahead
>>> max size to ULLONG_MAX which reads all the leaves after the
>>> key in the node when reading a level 1 node.
>> The readahead of chunk tree is a special case as we know we will need
>> the whole tree, in all other cases the search readahead needs is
>> supposed to read only one leaf.
>>
>> For that reason I don't want to touch the current path readahead logic
>> at all and do the chunk tree readahead in one go instead of the
>> per-search.
>>
>> Also I don't like to see size increase of btrfs_path just to use the
>> custom once.
>>
>> The idea of the whole tree readahead is to do something like:
>>
>> - find first item
>> - start readahead on all leaves from its level 1 node parent
>> (readahead_tree_block)
>> - when the level 1 parent changes during iterating items, start the
>> readahead again
>>
>> This skips readahead of all nodes above level 1, if you find a nicer way
>> to readahead the whole tree I won't object, but for the first
>> implementation the level 1 seems ok to me.
> Patch below, I tried to create large system chunk by fallocate on a
> sparse loop device, but got only 1 node on level 1 so the readahead
> cannot show off.
>
> # btrfs fi df .
> Data, single: total=59.83TiB, used=59.83TiB
> System, single: total=36.00MiB, used=6.20MiB
> Metadata, single: total=1.01GiB, used=91.78MiB
> GlobalReserve, single: total=26.80MiB, used=0.00B
>
> There were 395 leaf nodes that got read ahead, time between the first
> and last is 0.83s and the block group tree read took about 40 seconds.
> This was in a VM with file-backed images, and the loop device was
> constructed from these devices so it's spinning rust.
>
> I don't have results for non-prefetched mount to compare at the moment.
>
I think what you're doing is working.
But there are many similar problems that need to be improved.
1. load_free_space_tree
We need to read all BTRFS_FREE_SPACE_BITMAP_KEY and
BTRFS_FREE_SPACE_EXTENT_KEY until the next FREE_SPACE_INFO_KEY.
2. populate_free_space_tree
We need to read all BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY
until the end of the block group
3. btrfs_real_readdir
We need as many reads as possible (inode, BTRFS_DIR_INDEX_KEY).
4. btrfs_clone
We need as many reads as possible (inode, BTRFS_EXTENT_DATA_KEY).
5. btrfs_verify_dev_extents
We need to read all the BTRFS_DEV_EXTENT_KEYs.
6. caching_kthread (inode-map.c)
We need all the BTRFS_INODE_ITEM_KEY of fs_tree to build the inode map
For the above cases.
It is not possible to write a special readahead code for each case.
We have to provide a new readaread framework
Enable the caller to determine the scope of readaheads needed.
The possible parameters of the readahead are as follows
1. reada_maximum_nr : Read a maximum of several leaves at a time.
2. reada_max_key : READA_FORWARD Early Suspension Condition
3. reada_min_key : READA_BACK Abort condition ahead of time.
We need to review all users of readahead to confirm that the The
behavior of readahead.
For example, in scrub_enumerate_chunks readahead has the effect of Very
small,
Because most of the time is spent on scrub_chunk,
The processing of scrub_chunk for all DEV_EXTENT in a leaf is A long time.
If the dev tree has been modified in the meantime, the previously
pre-reading leaf may be useless.
> ----
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c7a3d4d730a3..e19891243199 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7013,6 +7013,19 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
> return ret;
> }
>
> +void readahead_tree_node_children(struct extent_buffer *node)
> +{
> + int i;
> + const int nr_items = btrfs_header_nritems(node);
> +
> + for (i = 0; i < nr_items; i++) {
> + u64 start;
> +
> + start = btrfs_node_blockptr(node, i);
> + readahead_tree_block(node->fs_info, start);
> + }
> +}
> +
> int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
> {
> struct btrfs_root *root = fs_info->chunk_root;
> @@ -7023,6 +7036,7 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
> int ret;
> int slot;
> u64 total_dev = 0;
> + u64 last_ra_node = 0;
>
> path = btrfs_alloc_path();
> if (!path)
> @@ -7048,6 +7062,8 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
> if (ret < 0)
> goto error;
> while (1) {
> + struct extent_buffer *node;
> +
> leaf = path->nodes[0];
> slot = path->slots[0];
> if (slot >= btrfs_header_nritems(leaf)) {
> @@ -7058,6 +7074,13 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
> goto error;
> break;
> }
> + node = path->nodes[1];
> + if (node) {
> + if (last_ra_node != node->start) {
> + readahead_tree_node_children(node);
> + last_ra_node = node->start;
> + }
> + }
> btrfs_item_key_to_cpu(leaf, &found_key, slot);
> if (found_key.type == BTRFS_DEV_ITEM_KEY) {
> struct btrfs_dev_item *dev_item;
next prev parent reply other threads:[~2020-07-09 2:38 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-07 3:59 [PATCH v2] btrfs: speedup mount time with readahead chunk tree robbieko
2020-07-07 19:25 ` David Sterba
2020-07-08 2:19 ` Robbie Ko
2020-07-08 14:04 ` David Sterba
2020-07-08 14:57 ` Holger Hoffstätte
2020-07-08 15:21 ` David Sterba
2020-07-09 1:46 ` Robbie Ko
2020-07-09 7:17 ` Holger Hoffstätte
2020-07-08 21:11 ` David Sterba
2020-07-09 2:38 ` Robbie Ko [this message]
2020-07-09 9:13 ` David Sterba
2020-07-10 1:54 ` Robbie Ko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0358f6f6-da68-94a4-f3ed-718e5caeded4@synology.com \
--to=robbieko@synology.com \
--cc=dsterba@suse.cz \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox