public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH v2 3/3] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs
Date: Thu, 29 Oct 2020 15:12:18 +0800	[thread overview]
Message-ID: <20201029071218.49860-4-wqu@suse.com> (raw)
In-Reply-To: <20201029071218.49860-1-wqu@suse.com>

Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
  There are two factors making the @file_offset parameter useless:
  * For csum lookup in csum tree, file offset makes no sense
    We only need disk_bytenr, which is unrelated to file_offset
  * page_offset (file offset) of each bvec is not contig
    bio can be merged, meaning we could have pages at different
    file offsets in the same bio.
  Thus passing file_offset makes no sense any longer.
  The only user of file_offset is for data reloc inode, we will use
  a new function, search_file_offset_in_bio(), to handle it.

- Extract the csum tree lookup into find_csum_tree_sums()
  The new function will handle the csum search in csum tree.
  The return value is the same as btrfs_find_ordered_sum(), returning
  the found number of sectors who has checksum.

- Change how we do the main loop
  The only needed info from bio is:
  * the on-disk bytenr
  * the file_offset (if file_offset == (u64)-1)
  * the length

  After extracting above info, we can do the search without bio
  at all, which makes the main loop much simpler:

	for (cur_disk_bytenr = orig_disk_bytenr;
	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
	     cur_disk_bytenr += count * sectorsize) {

		/* Lookup csum tree */
		count = find_csum_tree_sums(fs_info, path, cur_disk_bytenr,
					    search_len, csum_dst);
		if (!count) {
			/* Csum hole handling */
		}
	}

- Use single variable as core to calculate all other offsets
  Instead of all differnt type of variables, we use only one core
  variable, cur_disk_bytenr, which represents the current disk bytenr.

  All involves values can be calculated from that core variable, and
  all those variable will only be visible in the inner loop.

	diff_sectors = div_u64(cur_disk_bytenr - orig_disk_bytenr,
			       sectorsize);
	cur_disk_bytenr = orig_disk_bytenr +
			  diff_sectors * sectorsize;
	csum_dst = csum + diff_sectors * csum_size;

All above refactor makes btrfs_lookup_bio_sums() way more robust than it
used to, especially related to the file offset lookup.
Now file_offset lookup is only related to data reloc inode, other wise
we don't need to bother file_offset at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c |   4 +-
 fs/btrfs/ctree.h       |   2 +-
 fs/btrfs/file-item.c   | 256 ++++++++++++++++++++++++++---------------
 fs/btrfs/inode.c       |   5 +-
 4 files changed, 171 insertions(+), 96 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 1ab56a734e70..0347cad4136a 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -759,7 +759,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 
 			if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
 				ret = btrfs_lookup_bio_sums(inode, comp_bio,
-							    (u64)-1, sums);
+							    sums);
 				BUG_ON(ret); /* -ENOMEM */
 			}
 
@@ -787,7 +787,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	BUG_ON(ret); /* -ENOMEM */
 
 	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
-		ret = btrfs_lookup_bio_sums(inode, comp_bio, (u64)-1, sums);
+		ret = btrfs_lookup_bio_sums(inode, comp_bio, sums);
 		BUG_ON(ret); /* -ENOMEM */
 	}
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0c3ea3599dc7..5580158b344d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2900,7 +2900,7 @@ struct btrfs_dio_private;
 int btrfs_del_csums(struct btrfs_trans_handle *trans,
 		    struct btrfs_root *root, u64 bytenr, u64 len);
 blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
-				   u64 offset, u8 *dst);
+				   u8 *dst);
 int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root,
 			     u64 objectid, u64 pos,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index acacdd0bfe2c..85e7a3618a07 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -239,39 +239,140 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+/*
+ * Helper to find csums for logical bytenr range
+ * [disk_bytenr, disk_bytenr + len) and restore the result to @dst.
+ *
+ * Return >0 for the number of sectors we found.
+ * Return 0 for the range [disk_bytenr, disk_bytenr + sectorsize) has no csum
+ * for it. Caller may want to try next sector until one range is hit.
+ * Return <0 for fatal error.
+ */
+static int find_csum_tree_sums(struct btrfs_fs_info *fs_info,
+			       struct btrfs_path *path, u64 disk_bytenr,
+			       u64 len, u8 *dst)
+{
+	struct btrfs_csum_item *item = NULL;
+	struct btrfs_key key;
+	u32 csum_size = btrfs_super_csum_size(fs_info->super_copy);
+	u32 sectorsize = fs_info->sectorsize;
+	int ret;
+	u64 csum_start;
+	u64 csum_len;
+
+	ASSERT(IS_ALIGNED(disk_bytenr, sectorsize) &&
+	       IS_ALIGNED(len, sectorsize));
+
+	/* Check if the current csum item covers disk_bytenr */
+	if (path->nodes[0]) {
+		item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+				      struct btrfs_csum_item);
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		csum_start = key.offset;
+		csum_len = btrfs_item_size_nr(path->nodes[0], path->slots[0]) /
+			   csum_size * sectorsize;
+
+		if (csum_start <= disk_bytenr &&
+		    csum_start + csum_len > disk_bytenr)
+			goto found;
+	}
+
+	/* Current item doesn't contain the desired range, re-search */
+	btrfs_release_path(path);
+	item = btrfs_lookup_csum(NULL, fs_info->csum_root, path,
+				 disk_bytenr, 0);
+	if (IS_ERR(item)) {
+		ret = PTR_ERR(item);
+		goto out;
+	}
+found:
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+	csum_start = key.offset;
+	csum_len = btrfs_item_size_nr(path->nodes[0], path->slots[0]) /
+		   csum_size * sectorsize;
+	ASSERT(csum_start <= disk_bytenr &&
+	       csum_start + csum_len > disk_bytenr);
+
+	ret = div_u64(min(csum_start + csum_len, disk_bytenr + len) -
+		      disk_bytenr, sectorsize);
+	read_extent_buffer(path->nodes[0], dst, (unsigned long)item,
+			ret * csum_size);
+out:
+	if (ret == -ENOENT) {
+		ret = 0;
+		memset(dst, 0, csum_size);
+	}
+	return ret;
+}
+
+/*
+ * A helper to locate the file_offset of @cur_disk_bytenr of a @bio.
+ *
+ * Bio of btrfs represents read range of
+ * [bi_sector << 9, bi_sector << 9 + bi_size).
+ * Knowing this, we can iterate through each bvec to locate the page belong to
+ * @cur_disk_bytenr and get the file offset.
+ *
+ * @inode is used to determine the bvec page really belongs to @inode.
+ *
+ * Return 0 if we can't find the file offset;
+ * Return >0 if we find the file offset and restore it to @file_offset_ret
+ */
+static int search_file_offset_in_bio(struct bio *bio, struct inode *inode,
+				     u64 disk_bytenr, u64 *file_offset_ret)
+{
+	struct bvec_iter iter;
+	struct bio_vec bvec;
+	u64 cur = bio->bi_iter.bi_sector << 9;
+	int ret = 0;
+
+	bio_for_each_segment(bvec, bio, iter) {
+		struct page *page = bvec.bv_page;
+
+		if (cur > disk_bytenr)
+			break;
+		if (cur + bvec.bv_len <= disk_bytenr) {
+			cur += bvec.bv_len;
+			continue;
+		}
+		ASSERT(cur <= disk_bytenr && cur + bvec.bv_len > disk_bytenr);
+		if (page && page->mapping && page->mapping->host &&
+		    page->mapping->host == inode) {
+			ret = 1;
+			*file_offset_ret = page_offset(page) + bvec.bv_offset
+				+ disk_bytenr - cur;
+			break;
+		}
+	}
+	return ret;
+}
+
 /**
- * btrfs_lookup_bio_sums - Look up checksums for a read bio.
+ * Lookup the csum for the read bio in csum tree.
  *
  * @inode: inode that the bio is for.
  * @bio: bio to look up.
- * @offset: Unless (u64)-1, look up checksums for this offset in the file.
- *          If (u64)-1, use the page offsets from the bio instead.
- * @dst: Buffer of size nblocks * btrfs_super_csum_size() used to return
- *       checksum (nblocks = bio->bi_iter.bi_size / fs_info->sectorsize). If
+ * @dst: Buffer of size nsectors * btrfs_super_csum_size() used to return
+ *       checksum (nsectors = bio->bi_iter.bi_size / fs_info->sectorsize). If
  *       NULL, the checksum buffer is allocated and returned in
  *       btrfs_io_bio(bio)->csum instead.
  *
  * Return: BLK_STS_RESOURCE if allocating memory fails, BLK_STS_OK otherwise.
  */
 blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
-				   u64 offset, u8 *dst)
+				   u8 *dst)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct bio_vec bvec;
-	struct bvec_iter iter;
-	struct btrfs_csum_item *item = NULL;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_path *path;
-	const bool page_offsets = (offset == (u64)-1);
+	u32 csum_size = btrfs_super_csum_size(fs_info->super_copy);
+	u32 sectorsize = fs_info->sectorsize;
+	u64 orig_len = bio->bi_iter.bi_size;
+	u64 orig_disk_bytenr = bio->bi_iter.bi_sector << 9;
+	u64 cur_disk_bytenr;
 	u8 *csum;
-	u64 item_start_offset = 0;
-	u64 item_last_offset = 0;
-	u64 disk_bytenr;
-	u64 page_bytes_left;
-	u32 diff;
-	int nblocks;
+	int nsectors = orig_len >> inode->i_sb->s_blocksize_bits;
 	int count = 0;
-	u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
 
 	/*
 	 * This function is only called for read bio.
@@ -280,18 +381,21 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 	 * - All of our csums should only be csum tree
 	 *   No ordered extents csums. As ordered extents are only for write
 	 *   path.
+	 * - No need to bother any other info from bvec
+	 *   Since we're looking up csums, the only important info is the
+	 *   disk_bytenr and the length, which can all be extracted from
+	 *   bi_iter directly.
 	 */
 	ASSERT(bio_op(bio) == REQ_OP_READ);
 	path = btrfs_alloc_path();
 	if (!path)
 		return BLK_STS_RESOURCE;
 
-	nblocks = bio->bi_iter.bi_size >> inode->i_sb->s_blocksize_bits;
 	if (!dst) {
 		struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
 
-		if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
-			btrfs_bio->csum = kmalloc_array(nblocks, csum_size,
+		if (nsectors * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
+			btrfs_bio->csum = kmalloc_array(nsectors, csum_size,
 							GFP_NOFS);
 			if (!btrfs_bio->csum) {
 				btrfs_free_path(path);
@@ -309,7 +413,7 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 	 * If needed csum size is larger than a node, kick the readahead for
 	 * csum tree would be a good idea.
 	 */
-	if (nblocks * csum_size > fs_info->nodesize)
+	if (nsectors * csum_size > fs_info->nodesize)
 		path->reada = READA_FORWARD;
 
 	/*
@@ -323,81 +427,53 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 		path->skip_locking = 1;
 	}
 
-	disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
-
-	bio_for_each_segment(bvec, bio, iter) {
-		page_bytes_left = bvec.bv_len;
-		if (count)
-			goto next;
-
-		if (page_offsets)
-			offset = page_offset(bvec.bv_page) + bvec.bv_offset;
+	for (cur_disk_bytenr = orig_disk_bytenr;
+	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
+	     cur_disk_bytenr += count * sectorsize) {
+		int search_len = orig_disk_bytenr + orig_len - cur_disk_bytenr;
+		int diff_sectors;
+		u8 *csum_dst;
+
+		diff_sectors = div_u64(cur_disk_bytenr - orig_disk_bytenr,
+				       sectorsize);
+		cur_disk_bytenr = orig_disk_bytenr +
+				  diff_sectors * sectorsize;
+		csum_dst = csum + diff_sectors * csum_size;
+
+		count = find_csum_tree_sums(fs_info, path, cur_disk_bytenr,
+					    search_len, csum_dst);
+		if (!count) {
+			/*
+			 * For not found case, the csum has been zeroed
+			 * in find_csum_tree_sums() already, just skip
+			 * to next sector.
+			 */
+			count = 1;
 
-		if (!item || disk_bytenr < item_start_offset ||
-		    disk_bytenr >= item_last_offset) {
-			struct btrfs_key found_key;
-			u32 item_size;
-
-			if (item)
-				btrfs_release_path(path);
-			item = btrfs_lookup_csum(NULL, fs_info->csum_root,
-						 path, disk_bytenr, 0);
-			if (IS_ERR(item)) {
-				count = 1;
-				memset(csum, 0, csum_size);
-				if (BTRFS_I(inode)->root->root_key.objectid ==
-				    BTRFS_DATA_RELOC_TREE_OBJECTID) {
-					set_extent_bits(io_tree, offset,
-						offset + fs_info->sectorsize - 1,
+			/*
+			 * For data reloc inode, we need to mark the
+			 * range NODATASUM so that balance won't report
+			 * false csum error.
+			 */
+			if (BTRFS_I(inode)->root->root_key.objectid ==
+			    BTRFS_DATA_RELOC_TREE_OBJECTID) {
+				u64 file_offset;
+				int ret;
+
+				ret = search_file_offset_in_bio(bio, inode,
+						cur_disk_bytenr, &file_offset);
+				if (ret)
+					set_extent_bits(io_tree, file_offset,
+						file_offset + sectorsize - 1,
 						EXTENT_NODATASUM);
-				} else {
-					btrfs_info_rl(fs_info,
-						   "no csum found for inode %llu start %llu",
-					       btrfs_ino(BTRFS_I(inode)), offset);
-				}
-				item = NULL;
-				btrfs_release_path(path);
-				goto found;
+			} else {
+				btrfs_warn_rl(fs_info,
+		"csum hole found for disk bytenr range [%llu, %llu)",
+				cur_disk_bytenr, cur_disk_bytenr + sectorsize);
 			}
-			btrfs_item_key_to_cpu(path->nodes[0], &found_key,
-					      path->slots[0]);
-
-			item_start_offset = found_key.offset;
-			item_size = btrfs_item_size_nr(path->nodes[0],
-						       path->slots[0]);
-			item_last_offset = item_start_offset +
-				(item_size / csum_size) *
-				fs_info->sectorsize;
-			item = btrfs_item_ptr(path->nodes[0], path->slots[0],
-					      struct btrfs_csum_item);
-		}
-		/*
-		 * this byte range must be able to fit inside
-		 * a single leaf so it will also fit inside a u32
-		 */
-		diff = disk_bytenr - item_start_offset;
-		diff = diff / fs_info->sectorsize;
-		diff = diff * csum_size;
-		count = min_t(int, nblocks, (item_last_offset - disk_bytenr) >>
-					    inode->i_sb->s_blocksize_bits);
-		read_extent_buffer(path->nodes[0], csum,
-				   ((unsigned long)item) + diff,
-				   csum_size * count);
-found:
-		csum += count * csum_size;
-		nblocks -= count;
-next:
-		while (count > 0) {
-			count--;
-			disk_bytenr += fs_info->sectorsize;
-			offset += fs_info->sectorsize;
-			page_bytes_left -= fs_info->sectorsize;
-			if (!page_bytes_left)
-				break; /* move to next bio */
 		}
 	}
 
-	WARN_ON_ONCE(count);
 	btrfs_free_path(path);
 	return BLK_STS_OK;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c3d32f4858d5..0c9441b34709 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2233,7 +2233,7 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio,
 							   bio_flags);
 			goto out;
 		} else if (!skip_sum) {
-			ret = btrfs_lookup_bio_sums(inode, bio, (u64)-1, NULL);
+			ret = btrfs_lookup_bio_sums(inode, bio, NULL);
 			if (ret)
 				goto out;
 		}
@@ -7839,8 +7839,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		 * Load the csums up front to reduce csum tree searches and
 		 * contention when submitting bios.
 		 */
-		status = btrfs_lookup_bio_sums(inode, dio_bio, file_offset,
-					       dip->csums);
+		status = btrfs_lookup_bio_sums(inode, dio_bio, dip->csums);
 		if (status != BLK_STS_OK)
 			goto out_err;
 	}
-- 
2.29.1


  parent reply	other threads:[~2020-10-29  7:23 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-29  7:12 [PATCH v2 0/3] btrfs: btrfs_lookup_bio_sums() related fixes Qu Wenruo
2020-10-29  7:12 ` [PATCH v2 1/3] btrfs: file-item: use nodesize to determine whether we need readhead for btrfs_lookup_bio_sums() Qu Wenruo
2020-10-29 18:57   ` Josef Bacik
2020-10-29 23:57     ` Qu Wenruo
2020-10-29  7:12 ` [PATCH v2 2/3] btrfs: file-item: remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums() Qu Wenruo
2020-10-29  7:47   ` Nikolay Borisov
2020-10-29  7:53     ` Qu Wenruo
2020-10-29 18:51   ` Josef Bacik
2020-10-29  7:12 ` Qu Wenruo [this message]
2020-10-29 11:50   ` [PATCH v2 3/3] btrfs: file-item: refactor btrfs_lookup_bio_sums() to handle out-of-order bvecs Nikolay Borisov
2020-10-29 12:43     ` Qu Wenruo
2020-10-29 13:54       ` David Sterba
2020-10-29 14:11         ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201029071218.49860-4-wqu@suse.com \
    --to=wqu@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox