From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Nikolay Borisov <nborisov@suse.com>, Qu Wenruo <wqu@suse.com>,
linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2 3/3] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs
Date: Thu, 29 Oct 2020 20:43:12 +0800 [thread overview]
Message-ID: <55275e3c-1828-7180-e902-8344a571516e@gmx.com> (raw)
In-Reply-To: <765375f6-7c4f-e9d9-f0f5-3de9bac74cce@suse.com>
On 2020/10/29 下午7:50, Nikolay Borisov wrote:
>
>
> On 29.10.20 г. 9:12 ч., Qu Wenruo wrote:
>> Refactor btrfs_lookup_bio_sums() by:
>> - Remove the @file_offset parameter
>> There are two factors making the @file_offset parameter useless:
>> * For csum lookup in csum tree, file offset makes no sense
>> We only need disk_bytenr, which is unrelated to file_offset
>> * page_offset (file offset) of each bvec is not contig
>> bio can be merged, meaning we could have pages at different
>> file offsets in the same bio.
>> Thus passing file_offset makes no sense any longer.
>> The only user of file_offset is for data reloc inode, we will use
>> a new function, search_file_offset_in_bio(), to handle it.
>>
>> - Extract the csum tree lookup into find_csum_tree_sums()
>> The new function will handle the csum search in csum tree.
>> The return value is the same as btrfs_find_ordered_sum(), returning
>> the found number of sectors who has checksum.
>>
>> - Change how we do the main loop
>> The only needed info from bio is:
>> * the on-disk bytenr
>> * the file_offset (if file_offset == (u64)-1)
>> * the length
>>
>> After extracting above info, we can do the search without bio
>> at all, which makes the main loop much simpler:
>>
>> for (cur_disk_bytenr = orig_disk_bytenr;
>> cur_disk_bytenr < orig_disk_bytenr + orig_len;
>> cur_disk_bytenr += count * sectorsize) {
>>
>> /* Lookup csum tree */
>> count = find_csum_tree_sums(fs_info, path, cur_disk_bytenr,
>> search_len, csum_dst);
>> if (!count) {
>> /* Csum hole handling */
>> }
>> }
>>
>> - Use single variable as core to calculate all other offsets
>> Instead of all differnt type of variables, we use only one core
>> variable, cur_disk_bytenr, which represents the current disk bytenr.
>>
>> All involves values can be calculated from that core variable, and
>> all those variable will only be visible in the inner loop.
>>
>> diff_sectors = div_u64(cur_disk_bytenr - orig_disk_bytenr,
>> sectorsize);
>> cur_disk_bytenr = orig_disk_bytenr +
>> diff_sectors * sectorsize;
>> csum_dst = csum + diff_sectors * csum_size;
>>
>> All above refactor makes btrfs_lookup_bio_sums() way more robust than it
>> used to, especially related to the file offset lookup.
>> Now file_offset lookup is only related to data reloc inode, other wise
>> we don't need to bother file_offset at all.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> fs/btrfs/compression.c | 4 +-
>> fs/btrfs/ctree.h | 2 +-
>> fs/btrfs/file-item.c | 256 ++++++++++++++++++++++++++---------------
>> fs/btrfs/inode.c | 5 +-
>> 4 files changed, 171 insertions(+), 96 deletions(-)
>>
>
> <snip>
>
>
>> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
>> index acacdd0bfe2c..85e7a3618a07 100644
>> --- a/fs/btrfs/file-item.c
>> +++ b/fs/btrfs/file-item.c
>> @@ -239,39 +239,140 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
>> return ret;
>> }
>>
>> +/*
>> + * Helper to find csums for logical bytenr range
>> + * [disk_bytenr, disk_bytenr + len) and restore the result to @dst.
>
> nit: s/restore/store
>
>> + *
>> + * Return >0 for the number of sectors we found.
>> + * Return 0 for the range [disk_bytenr, disk_bytenr + sectorsize) has no csum
>> + * for it. Caller may want to try next sector until one range is hit.
>> + * Return <0 for fatal error.
>> + */
>> +static int find_csum_tree_sums(struct btrfs_fs_info *fs_info,
>> + struct btrfs_path *path, u64 disk_bytenr,
>> + u64 len, u8 *dst)
>
> A better name would be search_csum_tree.
The naming was designed to follow btrfs_find_ordered_sums(), but that
function just get deleted, thus the new naming has no thing to follow,
and the new name looks pretty sane to me.
>
>> +{
>> + struct btrfs_csum_item *item = NULL;
>> + struct btrfs_key key;
>> + u32 csum_size = btrfs_super_csum_size(fs_info->super_copy);
>
> nit: Why u32, btrfs_super_csum_size is defined as returning 'int',
> however this function is really returning u16 since "struct btrfs_csums"
> is defined as u16.
It was misguided by u32 from btrfs super block, where sectorsize,
nodesize are all u32.
Any recommendation on this? Just u16 for csum_size or follow
nodesize/sectorsize to use u32?
>
>> + u32 sectorsize = fs_info->sectorsize;
>> + int ret;
>> + u64 csum_start;
>> + u64 csum_len;
>> +
>> + ASSERT(IS_ALIGNED(disk_bytenr, sectorsize) &&
>> + IS_ALIGNED(len, sectorsize));
>> +
>> + /* Check if the current csum item covers disk_bytenr */
>> + if (path->nodes[0]) {
>> + item = btrfs_item_ptr(path->nodes[0], path->slots[0],
>> + struct btrfs_csum_item);
>> + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>> + csum_start = key.offset;
>> + csum_len = btrfs_item_size_nr(path->nodes[0], path->slots[0]) /
>
> nit: put the division in brackets, it doesn't affect correctness but it
> makes it more obvious since multiplication and division have same
> precedence then associativitiy comes into play...
>
>> + csum_size * sectorsize;
>> +
>> + if (csum_start <= disk_bytenr &&
>> + csum_start + csum_len > disk_bytenr)
>
> if (in_range(disk_bytenr, csum_start, csum_len))
>
>> + goto found;
>> + }
>> +
>> + /* Current item doesn't contain the desired range, re-search */
>> + btrfs_release_path(path);
>> + item = btrfs_lookup_csum(NULL, fs_info->csum_root, path,
>> + disk_bytenr, 0);
>> + if (IS_ERR(item)) {
>> + ret = PTR_ERR(item);
>> + goto out;
>> + }
>> +found:
>
> The found label could be moved right before ret = div_u64 since if you
> go directly to it it's guaranteed you have already done the
> btrfs_item_to_key et al operations so you are simply duplicating them now.
Right, that saves us several instructions.
>
>> + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>> + csum_start = key.offset;
>> + csum_len = btrfs_item_size_nr(path->nodes[0], path->slots[0]) /
>> + csum_size * sectorsize;
>
> nit: put brackets around division.
>
>> + ASSERT(csum_start <= disk_bytenr &&
>> + csum_start + csum_len > disk_bytenr);
>
> Use in_range macro
>
>> +
>> + ret = div_u64(min(csum_start + csum_len, disk_bytenr + len) -
>> + disk_bytenr, sectorsize);
>> + read_extent_buffer(path->nodes[0], dst, (unsigned long)item,
>> + ret * csum_size);
>> +out:
>> + if (ret == -ENOENT) {
>> + ret = 0;
>> + memset(dst, 0, csum_size);
>> + }
>> + return ret;
>> +}
>> +
>> +/*
>> + * A helper to locate the file_offset of @cur_disk_bytenr of a @bio.
>> + *
>> + * Bio of btrfs represents read range of
>> + * [bi_sector << 9, bi_sector << 9 + bi_size).
>> + * Knowing this, we can iterate through each bvec to locate the page belong to
>> + * @cur_disk_bytenr and get the file offset.
>> + *
>> + * @inode is used to determine the bvec page really belongs to @inode.
>> + *
>> + * Return 0 if we can't find the file offset;
>> + * Return >0 if we find the file offset and restore it to @file_offset_ret
>> + */
>> +static int search_file_offset_in_bio(struct bio *bio, struct inode *inode,
>> + u64 disk_bytenr, u64 *file_offset_ret)
>> +{
>> + struct bvec_iter iter;
>> + struct bio_vec bvec;
>> + u64 cur = bio->bi_iter.bi_sector << 9;
>> + int ret = 0;
>> +
>> + bio_for_each_segment(bvec, bio, iter) {
>> + struct page *page = bvec.bv_page;
>> +
>> + if (cur > disk_bytenr)
>> + break;
>> + if (cur + bvec.bv_len <= disk_bytenr) {
>> + cur += bvec.bv_len;
>> + continue;
>> + }
>> + ASSERT(cur <= disk_bytenr && cur + bvec.bv_len > disk_bytenr);
>
> in_range(disk_bytenr, cur, bvec.bv_len)
>
>> + if (page && page->mapping && page->mapping->host &&
>> + page->mapping->host == inode) {
> We are guaranteed to always have a page, since this bvec will have been
> aded via bio_add_page, since those will be data pages we are guaranteed
> to have mapping and mapping->host, so you ought to only check if it's
> the same inode, no ?
Page is definitely always here, so the first check is duplicated.
But I doubt about the page->mapping part.
For cases like reading compressed data, IIRC the page can be anonymous
(allocated by alloc_page(), not bound to any inode).
Thus at least we still need to check page->mapping.
>
>> + ret = 1;
>> + *file_offset_ret = page_offset(page) + bvec.bv_offset
>> + + disk_bytenr - cur;
>> + break;
>> + }
>> + }
>> + return ret;
>> +}
>
>> +
> <snip>
>
>> @@ -323,81 +427,53 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
>> path->skip_locking = 1;
>> }
>>
>> - disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
>> -
>> - bio_for_each_segment(bvec, bio, iter) {
>> - page_bytes_left = bvec.bv_len;
>> - if (count)
>> - goto next;
>> -
>> - if (page_offsets)
>> - offset = page_offset(bvec.bv_page) + bvec.bv_offset;
>> + for (cur_disk_bytenr = orig_disk_bytenr;
>> + cur_disk_bytenr < orig_disk_bytenr + orig_len;
>> + cur_disk_bytenr += count * sectorsize) {
>> + int search_len = orig_disk_bytenr + orig_len - cur_disk_bytenr;
>> + int diff_sectors;
>
> This variable is misnamed, it's really offset_sector or sector_offset.
I have no better alternatives yet, thus would use offset_sector for now.
>
>> + u8 *csum_dst;> +
>> + diff_sectors = div_u64(cur_disk_bytenr - orig_disk_bytenr,
>> + sectorsize);
>> + cur_disk_bytenr = orig_disk_bytenr +
>> + diff_sectors * sectorsize;
>
> Since you already increment cur_disk_bytenr with count * sector size
> where count is the number of csums, why do you have recalculate it here?
Completely bad code from previous iterations forgot to cleanup...
> Furthermore you convert to sectors in diff_sectors and then you multiply
> it by sectorsize which gives you back just cur_disk_bytenr -
> orig_disk_bytenr so the end expression is:
>
> cur_disk_bytenr = orig_disk_bytenr + cur_disk_bytenr - orig_disk_bytenr
> => cur_disk_bytenr = cur_disk_bytenr - am I missing something?
>
>
>> + csum_dst = csum + diff_sectors * csum_size;
>> +
>> + count = find_csum_tree_sums(fs_info, path, cur_disk_bytenr,
>> + search_len, csum_dst);
>
> Missing error handling if count is negative!
Yep, I thought it was only either 0 or 1, but is totally wrong.
Thanks for the review!
Qu
>
>> + if (!count) {
>> + /*
>> + * For not found case, the csum has been zeroed
>> + * in find_csum_tree_sums() already, just skip
>> + * to next sector.
>> + */
>> + count = 1;
>
> <snip>
>
next prev parent reply other threads:[~2020-10-29 12:43 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-29 7:12 [PATCH v2 0/3] btrfs: btrfs_lookup_bio_sums() related fixes Qu Wenruo
2020-10-29 7:12 ` [PATCH v2 1/3] btrfs: file-item: use nodesize to determine whether we need readhead for btrfs_lookup_bio_sums() Qu Wenruo
2020-10-29 18:57 ` Josef Bacik
2020-10-29 23:57 ` Qu Wenruo
2020-10-29 7:12 ` [PATCH v2 2/3] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums() Qu Wenruo
2020-10-29 7:47 ` Nikolay Borisov
2020-10-29 7:53 ` Qu Wenruo
2020-10-29 18:51 ` Josef Bacik
2020-10-29 7:12 ` [PATCH v2 3/3] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs Qu Wenruo
2020-10-29 11:50 ` Nikolay Borisov
2020-10-29 12:43 ` Qu Wenruo [this message]
2020-10-29 13:54 ` David Sterba
2020-10-29 14:11 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55275e3c-1828-7180-e902-8344a571516e@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=nborisov@suse.com \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox