From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Cc: Ritesh Harjani <riteshh@linux.ibm.com>, Qu Wenruo <wqu@suse.com>
Subject: [PATCH v7 16/17] btrfs: fix a subpage relocation data corruption
Date: Mon, 12 Jul 2021 16:30:26 +0800 [thread overview]
Message-ID: <20210712083027.212734-17-wqu@suse.com> (raw)
In-Reply-To: <20210712083027.212734-1-wqu@suse.com>
[BUG]
When using the following script, btrfs will report data corruption after
one data balance with subpage support:
mkfs.btrfs -f -s 4k $dev
mount $dev -o nospace_cache $mnt
$fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
sync
btrfs balance start -d $mnt
btrfs scrub start -B $mnt
Similar problem can be easily observed in btrfs/028 test case, there
will be tons of balance failure with -EIO.
[CAUSE]
Above fsstress will result the following data extents layout in extent
tree:
item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
refs 2 gen 7 flags DATA
extent data backref root FS_TREE objectid 259 offset 1339392 count 1
extent data backref root FS_TREE objectid 259 offset 647168 count 1
item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
block group used 102400 chunk_objectid 256 flags DATA
item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
refs 1 gen 7 flags DATA
extent data backref root FS_TREE objectid 259 offset 729088 count 1
Then when creating the data reloc inode, the data reloc inode will look
like this:
0 32K 64K 96K 100K 104K
|<------ Extent A ----->| |<- Ext B ->|
Then when we first try to relocate extent A, we setup the data reloc
inode with iszie 96K, then read both page [0, 64K) and page [64K, 128K).
For page 64K, since the isize is just 96K, we fill range [96K, 128K)
with 0 and set it uptodate.
Then when we come to extent B, we update isize to 104K, then try to read
page [64K, 128K).
Then we find the page is already uptodate, so we skip the read.
But range [96K, 128K) is filled with 0, not the real data.
Then we writeback the data reloc inode to disk, with 0 filling range
[96K, 128K), corrupting the content of extent B.
The behavior is caused by the fact that we still do full page read for
subpage case.
The bug won't really happen for regular sectorsize, as one page only
contains one sector.
[FIX]
This patch will fix the problem by invalidating range [isize, PAGE_END]
in prealloc_file_extent_cluster().
So that if above example happens, when we preallocate the file extent
for extent B, we will clear the uptodate bits for range [96K, 128K),
allowing later relocate_one_page() to re-read the needed range.
There is a special note for the invalidating part.
Since we're not calling real btrfs_invalidatepage(), but just clearing
the subpage and page uptodate bits, we can leave a page half dirty and
half out of date.
Reading such page can make btrfs to deadlock, as we normally expect a
dirty page to be full uptodate.
Thus here we flush and wait the data reloc inode before doing the hacked
invalidating.
This won't cause extra overhead, as we're going to writeback the data
later anyway.
Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/relocation.c | 59 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 72ffeb34b92b..cfb33e093150 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2782,10 +2782,69 @@ static noinline_for_stack int prealloc_file_extent_cluster(
u64 num_bytes;
int nr;
int ret = 0;
+ u64 isize = i_size_read(&inode->vfs_inode);
u64 prealloc_start = cluster->start - offset;
u64 prealloc_end = cluster->end - offset;
u64 cur_offset = prealloc_start;
+ /*
+ * For subpage case, previous isize may not be aligned to PAGE_SIZE.
+ * This means the range [isize, PAGE_END + 1) is filled with 0 by
+ * btrfs_do_readpage() call of previously relocated file cluster.
+ *
+ * If the current cluster starts in above range, btrfs_do_readpage()
+ * will skip the read, and relocate_one_page() will later writeback
+ * the padding 0 as new data, causing data corruption.
+ *
+ * Here we have to manually invalidate the range (isize, PAGE_END + 1).
+ */
+ if (!IS_ALIGNED(isize, PAGE_SIZE)) {
+ struct address_space *mapping = inode->vfs_inode.i_mapping;
+ struct btrfs_fs_info *fs_info = inode->root->fs_info;
+ const u32 sectorsize = fs_info->sectorsize;
+ struct page *page;
+
+ ASSERT(sectorsize < PAGE_SIZE);
+ ASSERT(IS_ALIGNED(isize, sectorsize));
+
+ /*
+ * Btrfs subpage can't handle page with DIRTY but without
+ * UPTODATE bit as it can lead to the following deadlock:
+ * btrfs_readpage()
+ * | Page already *locked*
+ * |- btrfs_lock_and_flush_ordered_range()
+ * |- btrfs_start_ordered_extent()
+ * |- extent_write_cache_pages()
+ * |- lock_page()
+ * We try to lock the page we already hold.
+ *
+ * Here we just writeback the whole data reloc inode, so that
+ * we will be ensured to have no dirty range in the page, and
+ * are safe to clear the uptodate bits.
+ *
+ * This shouldn't cause too much overhead, as we need to write
+ * the data back anyway.
+ */
+ ret = filemap_write_and_wait(mapping);
+ if (ret < 0)
+ return ret;
+
+ clear_extent_bits(&inode->io_tree, isize,
+ round_up(isize, PAGE_SIZE) - 1,
+ EXTENT_UPTODATE);
+ page = find_lock_page(mapping, isize >> PAGE_SHIFT);
+ /*
+ * If page is freed we don't need to do anything then, as
+ * we will re-read the whole page anyway.
+ */
+ if (page) {
+ btrfs_subpage_clear_uptodate(fs_info, page, isize,
+ round_up(isize, PAGE_SIZE) - isize);
+ unlock_page(page);
+ put_page(page);
+ }
+ }
+
BUG_ON(cluster->start != cluster->boundary[0]);
ret = btrfs_alloc_data_chunk_ondemand(inode,
prealloc_end + 1 - prealloc_start);
--
2.32.0
next prev parent reply other threads:[~2021-07-12 8:34 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-12 8:30 [PATCH v7 00/17] btrfs: add data write support for subpage Qu Wenruo
2021-07-12 8:30 ` [PATCH v7 01/17] btrfs: properly reset @this_bio_flag in btrfs_do_readpage() to avoid inheriting old bio flags to next extent Qu Wenruo
2021-07-16 13:59 ` Anand Jain
2021-07-12 8:30 ` [PATCH v7 02/17] btrfs: fix NULL pointer dereference when reading two compressed extent inside the same page Qu Wenruo
2021-07-12 8:30 ` [PATCH v7 03/17] btrfs: disable compressed readahead for subpage Qu Wenruo
2021-07-12 8:30 ` [PATCH v7 04/17] btrfs: grab correct extent map for subpage compressed extent read Qu Wenruo
2021-07-12 8:30 ` [PATCH v7 05/17] btrfs: rework btrfs_decompress_buf2page() Qu Wenruo
2021-07-12 8:30 ` [PATCH v7 06/17] btrfs: rework lzo_decompress_bio() to make it subpage compatible Qu Wenruo
2021-07-12 8:30 ` [PATCH v7 07/17] btrfs: extract relocation page read and dirty part into its own function Qu Wenruo
2021-07-12 8:30 ` [PATCH v7 08/17] btrfs: make relocate_one_page() to handle subpage case Qu Wenruo
2021-07-12 8:30 ` [PATCH v7 09/17] btrfs: fix wild subpage writeback which does not have ordered extent Qu Wenruo
2021-07-12 8:30 ` [PATCH v7 10/17] btrfs: disable inline extent creation for subpage Qu Wenruo
2021-07-12 8:30 ` [PATCH v7 11/17] btrfs: allow submit_extent_page() to do bio split " Qu Wenruo
2021-07-12 8:30 ` [PATCH v7 12/17] btrfs: reject raid5/6 fs " Qu Wenruo
2021-07-13 0:39 ` Anand Jain
2021-07-12 8:30 ` [PATCH v7 13/17] btrfs: fix a crash caused by race between prepare_pages() and btrfs_releasepage() Qu Wenruo
2021-07-12 8:30 ` [PATCH v7 14/17] btrfs: fix a use-after-free bug in writeback subpage helper Qu Wenruo
2021-07-12 8:30 ` [PATCH v7 15/17] btrfs: fix a subpage false alert for relocating partial preallocated data extents Qu Wenruo
2021-07-12 8:30 ` Qu Wenruo [this message]
2021-07-12 8:30 ` [PATCH v7 17/17] btrfs: allow read-write for 4K sectorsize on 64K page size systems Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210712083027.212734-17-wqu@suse.com \
--to=wqu@suse.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=riteshh@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox