[PATCH AUTOSEL 5.14 190/252] btrfs: remove racy and unnecessary inode transaction update when using no-holes

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 5.14 190/252] btrfs: remove racy and unnecessary inode transaction update when using no-holes
       [not found] <20210909114106.141462-1-sashal@kernel.org>
@ 2021-09-09 11:40 ` Sasha Levin
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 191/252] btrfs: reset this_bio_flag to avoid inheriting old flags Sasha Levin
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Sasha Levin @ 2021-09-09 11:40 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Filipe Manana, Josef Bacik, David Sterba, Sasha Levin,
	linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

[ Upstream commit cceaa89f02f15f232391ae4be214137b0a0285c0 ]

When using the NO_HOLES feature and expanding the size of an inode, we
update the inode's last_trans, last_sub_trans and last_log_commit fields
at maybe_insert_hole() so that a fsync does know that the inode needs to
be logged (by making sure that btrfs_inode_in_log() returns false). This
happens for expanding truncate operations, buffered writes, direct IO
writes and when cloning extents to an offset greater than the inode's
i_size.

However the way we do it is racy, because in between setting the inode's
last_sub_trans and last_log_commit fields, the log transaction ID that was
assigned to last_sub_trans might be committed before we read the root's
last_log_commit and assign that value to last_log_commit. If that happens
it would make a future call to btrfs_inode_in_log() return true. This is
a race that should be extremely unlikely to be hit in practice, and it is
the same that was described by commit bc0939fcfab0d7 ("btrfs: fix race
between marking inode needs to be logged and log syncing").

The fix would simply be to set last_log_commit to the value we assigned
to last_sub_trans minus 1, like it was done in that commit. However
updating these two fields plus the last_trans field is pointless here
because all the callers of btrfs_cont_expand() (which is the only
caller of maybe_insert_hole()) always call btrfs_set_inode_last_trans()
or btrfs_update_inode() after calling btrfs_cont_expand(). Calling either
btrfs_set_inode_last_trans() or btrfs_update_inode() guarantees that the
next fsync will log the inode, as it makes btrfs_inode_in_log() return
false.

So just remove the code that explicitly sets the inode's last_trans,
last_sub_trans and last_log_commit fields.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/inode.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bd5689fa290e..ab04f9dd465e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5088,15 +5088,13 @@ static int maybe_insert_hole(struct btrfs_root *root, struct btrfs_inode *inode,
 	int ret;

 	/*
-	 * Still need to make sure the inode looks like it's been updated so
-	 * that any holes get logged if we fsync.
+	 * If NO_HOLES is enabled, we don't need to do anything.
+	 * Later, up in the call chain, either btrfs_set_inode_last_sub_trans()
+	 * or btrfs_update_inode() will be called, which guarantee that the next
+	 * fsync will know this inode was changed and needs to be logged.
 	 */
-	if (btrfs_fs_incompat(fs_info, NO_HOLES)) {
-		inode->last_trans = fs_info->generation;
-		inode->last_sub_trans = root->log_transid;
-		inode->last_log_commit = root->last_log_commit;
+	if (btrfs_fs_incompat(fs_info, NO_HOLES))
 		return 0;
-	}

 	/*
 	 * 1 - for the one we're dropping
-- 
2.30.2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 5.14 191/252] btrfs: reset this_bio_flag to avoid inheriting old flags
       [not found] <20210909114106.141462-1-sashal@kernel.org>
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 190/252] btrfs: remove racy and unnecessary inode transaction update when using no-holes Sasha Levin
@ 2021-09-09 11:40 ` Sasha Levin
  2021-09-09 11:55   ` David Sterba
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 192/252] btrfs: subpage: check if there are compressed extents inside one page Sasha Levin
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Sasha Levin @ 2021-09-09 11:40 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Qu Wenruo, Anand Jain, David Sterba, Sasha Levin, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit 4c37a7938496756649096a7ec26320eb8b0d90fb ]

In btrfs_do_readpage(), we never reset @this_bio_flag after we hit a
compressed extent.

This is fine, as for PAGE_SIZE == sectorsize case, we can only have one
sector for one page, thus @this_bio_flag will only be set at most once.

But for subpage case, after hitting a compressed extent, @this_bio_flag
will always have EXTENT_BIO_COMPRESSED bit, even we're reading a regular
extent.

This will lead to various read errors, and causing new ASSERT() in
incoming subpage patches, which adds more strict check in
btrfs_submit_compressed_read().

Fix it by declaring @this_bio_flag inside the main loop and reset its
value for each iteration.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9e81d25dea70..907cb88fb30b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3488,7 +3488,6 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 	size_t pg_offset = 0;
 	size_t iosize;
 	size_t blocksize = inode->i_sb->s_blocksize;
-	unsigned long this_bio_flag = 0;
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 
 	ret = set_page_extent_mapped(page);
@@ -3519,6 +3518,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 	}
 	begin_page_read(fs_info, page);
 	while (cur <= end) {
+		unsigned long this_bio_flag = 0;
 		bool force_bio_submit = false;
 		u64 disk_bytenr;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 5.14 192/252] btrfs: subpage: check if there are compressed extents inside one page
       [not found] <20210909114106.141462-1-sashal@kernel.org>
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 190/252] btrfs: remove racy and unnecessary inode transaction update when using no-holes Sasha Levin
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 191/252] btrfs: reset this_bio_flag to avoid inheriting old flags Sasha Levin
@ 2021-09-09 11:40 ` Sasha Levin
  2021-09-09 11:55   ` David Sterba
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 193/252] btrfs: grab correct extent map for subpage compressed extent read Sasha Levin
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Sasha Levin @ 2021-09-09 11:40 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Qu Wenruo, David Sterba, Sasha Levin, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit 3670e6451bc9c39ab3a46f1da19360219e4319f3 ]

[BUG]
When testing experimental subpage compressed write support, it hits a
NULL pointer dereference inside read path:

 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018
 pc : __pi_memcmp+0x28/0x1ec
 lr : check_data_csum+0xd0/0x274 [btrfs]
 Call trace:
  __pi_memcmp+0x28/0x1ec
  btrfs_verify_data_csum+0xf4/0x244 [btrfs]
  end_bio_extent_readpage+0x1d0/0x6b0 [btrfs]
  bio_endio+0x15c/0x1dc
  end_workqueue_fn+0x44/0x64 [btrfs]
  btrfs_work_helper+0x74/0x250 [btrfs]
  process_one_work+0x1d4/0x47c
  worker_thread+0x180/0x400
  kthread+0x11c/0x120
  ret_from_fork+0x10/0x30
 Code: 54000261 d100044c d343fd8c f8408403 (f8408424)
 ---[ end trace 9e2c59f33ea40866 ]---

[CAUSE]
When reading two compressed extents inside the same page, like the
following layout, we trigger above crash:

	0	32K	64K
	|-------|\\\\\\\|
	     |	     \- Compressed extent (A)
	     \--------- Compressed extent (B)

For compressed read, we don't need to populate its io_bio->csum, as we
rely on compressed_bio->csum to verify the compressed data, and then
copy the decompressed to inode pages.

Normally btrfs_verify_data_csum() skip such page by checking and
clearing its PageChecked flag

But since that flag is still for the full page, when endio for inode
page range [0, 32K) gets executed, it clears PageChecked flag for the
full page.

Then when endio for inode page range [32K, 64K) gets executed, since the
page no longer has PageChecked flag, it just continues checking, even
though io_bio->csum is NULL.

[FIX]
Thankfully there are only two users of PageChecked bit:

- Cow fixup
  Since subpage has its own way to trace page dirty (dirty_bitmap) and
  ordered bit (ordered_bitmap), it should never trigger cow fixup.

- Compressed read
  We can distinguish such read by just checking io_bio->csum.

So just check io_bio->csum before doing the verification to avoid such
NULL pointer dereference.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/inode.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ab04f9dd465e..ef6e1798fc2d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3257,6 +3257,20 @@ unsigned int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
 		return 0;
 	}
 
+	/*
+	 * For subpage case, above PageChecked is not safe as it's not subpage
+	 * compatible.
+	 * But for now only cow fixup and compressed read utilize PageChecked
+	 * flag, while in this context we can easily use io_bio->csum to
+	 * determine if we really need to do csum verification.
+	 *
+	 * So for now, just exit if io_bio->csum is NULL, as it means it's
+	 * compressed read, and its compressed data csum has already been
+	 * verified.
+	 */
+	if (io_bio->csum == NULL)
+		return 0;
+
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
 		return 0;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 5.14 193/252] btrfs: grab correct extent map for subpage compressed extent read
       [not found] <20210909114106.141462-1-sashal@kernel.org>
                   ` (2 preceding siblings ...)
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 192/252] btrfs: subpage: check if there are compressed extents inside one page Sasha Levin
@ 2021-09-09 11:40 ` Sasha Levin
  2021-09-09 11:55   ` David Sterba
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 194/252] btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage() Sasha Levin
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Sasha Levin @ 2021-09-09 11:40 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Qu Wenruo, Anand Jain, David Sterba, Sasha Levin, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit 557023ea9f06baf2659b232b08b8e8711f7001a6 ]

[BUG]
When subpage compressed read write support is enabled, btrfs/038 always
fails with EIO.

A simplified script can easily trigger the problem:

  mkfs.btrfs -f -s 4k $dev
  mount $dev $mnt -o compress=lzo

  xfs_io -f -c "truncate 118811" $mnt/foo
  xfs_io -c "pwrite -S 0x0d -b 39987 92267 39987" $mnt/foo > /dev/null

  sync
  btrfs subvolume snapshot -r $mnt $mnt/mysnap1

  xfs_io -c "pwrite -S 0x3e -b 80000 200000 80000" $mnt/foo > /dev/null
  sync

  xfs_io -c "pwrite -S 0xdc -b 10000 250000 10000" $mnt/foo > /dev/null
  xfs_io -c "pwrite -S 0xff -b 10000 300000 10000" $mnt/foo > /dev/null

  sync
  btrfs subvolume snapshot -r $mnt $mnt/mysnap2

  cat $mnt/mysnap2/foo
  # Above cat will fail due to EIO

[CAUSE]
The problem is in btrfs_submit_compressed_read().

When it tries to grab the extent map of the read range, it uses the
following call:

	em = lookup_extent_mapping(em_tree,
  				   page_offset(bio_first_page_all(bio)),
				   fs_info->sectorsize);

The problem is in the page_offset(bio_first_page_all(bio)) part.

The offending inode has the following file extent layout

        item 10 key (257 EXTENT_DATA 131072) itemoff 15639 itemsize 53
                generation 8 type 1 (regular)
                extent data disk byte 13680640 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 11 key (257 EXTENT_DATA 135168) itemoff 15586 itemsize 53
                generation 8 type 1 (regular)
                extent data disk byte 0 nr 0
        item 12 key (257 EXTENT_DATA 196608) itemoff 15533 itemsize 53
                generation 8 type 1 (regular)
                extent data disk byte 13676544 nr 4096
                extent data offset 0 nr 53248 ram 86016
                extent compression 2 (lzo)

And the bio passed in has the following parameters:

page_offset(bio_first_page_all(bio))	= 131072
bio_first_bvec_all(bio)->bv_offset	= 65536

If we use page_offset(bio_first_page_all(bio) without adding bv_offset,
we will get an extent map for file offset 131072, not 196608.

This means we read uncompressed data from disk, and later decompression
will definitely fail.

[FIX]
Take bv_offset into consideration when trying to grab an extent map.

And add an ASSERT() to ensure we're really getting a compressed extent.

Thankfully this won't affect anything but subpage, thus we only need to
ensure this patch get merged before we enabled basic subpage support.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/compression.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 30d82cdf128c..ad700cf287c9 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -673,6 +673,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 	struct page *page;
 	struct bio *comp_bio;
 	u64 cur_disk_byte = bio->bi_iter.bi_sector << 9;
+	u64 file_offset;
 	u64 em_len;
 	u64 em_start;
 	struct extent_map *em;
@@ -682,15 +683,17 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 
 	em_tree = &BTRFS_I(inode)->extent_tree;
 
+	file_offset = bio_first_bvec_all(bio)->bv_offset +
+		      page_offset(bio_first_page_all(bio));
+
 	/* we need the actual starting offset of this extent in the file */
 	read_lock(&em_tree->lock);
-	em = lookup_extent_mapping(em_tree,
-				   page_offset(bio_first_page_all(bio)),
-				   fs_info->sectorsize);
+	em = lookup_extent_mapping(em_tree, file_offset, fs_info->sectorsize);
 	read_unlock(&em_tree->lock);
 	if (!em)
 		return BLK_STS_IOERR;
 
+	ASSERT(em->compress_type != BTRFS_COMPRESS_NONE);
 	compressed_len = em->block_len;
 	cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS);
 	if (!cb)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 5.14 194/252] btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage()
       [not found] <20210909114106.141462-1-sashal@kernel.org>
                   ` (3 preceding siblings ...)
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 193/252] btrfs: grab correct extent map for subpage compressed extent read Sasha Levin
@ 2021-09-09 11:40 ` Sasha Levin
  2021-09-09 11:55   ` David Sterba
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 195/252] btrfs: subpage: fix a potential use-after-free in writeback helper Sasha Levin
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Sasha Levin @ 2021-09-09 11:40 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Qu Wenruo, Ritesh Harjani, Filipe Manana, David Sterba,
	Sasha Levin, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit e0467866198f7f536806f39e5d0d91ae8018de08 ]

[BUG]
When running generic/095, there is a high chance to crash with subpage
data RW support:

 assertion failed: PagePrivate(page) && page->private
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/ctree.h:3403!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3567 Comm: fio Tainted: 5.12.0-rc7-custom+ #17
 Hardware name: Khadas VIM3 (DT)
 Call trace:
  assertfail.constprop.0+0x28/0x2c [btrfs]
  btrfs_subpage_assert+0x80/0xa0 [btrfs]
  btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
  btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
  btrfs_dirty_pages+0x160/0x270 [btrfs]
  btrfs_buffered_write+0x444/0x630 [btrfs]
  btrfs_direct_write+0x1cc/0x2d0 [btrfs]
  btrfs_file_write_iter+0xc0/0x160 [btrfs]
  new_sync_write+0xe8/0x180
  vfs_write+0x1b4/0x210
  ksys_pwrite64+0x7c/0xc0
  __arm64_sys_pwrite64+0x24/0x30
  el0_svc_common.constprop.0+0x70/0x140
  do_el0_svc+0x28/0x90
  el0_svc+0x2c/0x54
  el0_sync_handler+0x1a8/0x1ac
  el0_sync+0x170/0x180
 Code: f0000160 913be042 913c4000 955444bc (d4210000)
 ---[ end trace 3fdd39f4cccedd68 ]---

[CAUSE]
Although prepare_pages() calls find_or_create_page(), which returns the
page locked, but in later prepare_uptodate_page() calls, we may call
btrfs_readpage() which will unlock the page before it returns.

This leaves a window where btrfs_releasepage() can sneak in and release
the page, clearing page->private and causing above ASSERT().

[FIX]
In prepare_uptodate_page(), we should not only check page->mapping, but
also PagePrivate() to ensure we are still holding the correct page which
has proper fs context setup.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/file.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ee34497500e1..8c57af3702fa 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1340,7 +1340,18 @@ static int prepare_uptodate_page(struct inode *inode,
 			unlock_page(page);
 			return -EIO;
 		}
-		if (page->mapping != inode->i_mapping) {
+
+		/*
+		 * Since btrfs_readpage() will unlock the page before it
+		 * returns, there is a window where btrfs_releasepage() can
+		 * be called to release the page.
+		 * Here we check both inode mapping and PagePrivate() to
+		 * make sure the page was not released.
+		 *
+		 * The private flag check is essential for subpage as we need
+		 * to store extra bitmap using page->private.
+		 */
+		if (page->mapping != inode->i_mapping || !PagePrivate(page)) {
 			unlock_page(page);
 			return -EAGAIN;
 		}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 5.14 195/252] btrfs: subpage: fix a potential use-after-free in writeback helper
       [not found] <20210909114106.141462-1-sashal@kernel.org>
                   ` (4 preceding siblings ...)
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 194/252] btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage() Sasha Levin
@ 2021-09-09 11:40 ` Sasha Levin
  2021-09-09 11:55   ` David Sterba
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 196/252] btrfs: subpage: fix false alert when relocating partial preallocated data extents Sasha Levin
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 197/252] btrfs: tree-log: check btrfs_lookup_data_extent return value Sasha Levin
  7 siblings, 1 reply; 15+ messages in thread
From: Sasha Levin @ 2021-09-09 11:40 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Qu Wenruo, Ritesh Harjani, David Sterba, Sasha Levin, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit 7c11d0ae439565b4560b0c0f36bf05171ed1a146 ]

[BUG]
There is a possible use-after-free bug when running generic/095.

 BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
 Faulting instruction address: 0xc000000000283654
 c000000000283078 do_raw_spin_unlock+0x88/0x230
 c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
 c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
 c0000000009e0458 end_bio_extent_writepage+0x158/0x270
 c000000000b6fd14 bio_endio+0x254/0x270
 c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
 c000000000b6fd14 bio_endio+0x254/0x270
 c000000000b781fc blk_update_request+0x46c/0x670
 c000000000b8b394 blk_mq_end_request+0x34/0x1d0
 c000000000d82d1c lo_complete_rq+0x11c/0x140
 c000000000b880a4 blk_complete_reqs+0x84/0xb0
 c0000000012b2ca4 __do_softirq+0x334/0x680
 c0000000001dd878 irq_exit+0x148/0x1d0
 c000000000016f4c do_IRQ+0x20c/0x240
 c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0

[CAUSE]
There is very small race window like the following in generic/095.

	Thread 1		|		Thread 2
--------------------------------+------------------------------------
  end_bio_extent_writepage()	| btrfs_releasepage()
  |- spin_lock_irqsave()	| |
  |- end_page_writeback()	| |
  |				| |- if (PageWriteback() ||...)
  |				| |- clear_page_extent_mapped()
  |				|    |- kfree(subpage);
  |- spin_unlock_irqrestore().

The race can also happen between writeback and btrfs_invalidatepage(),
although that would be much harder as btrfs_invalidatepage() has much
more work to do before the clear_page_extent_mapped() call.

[FIX]
Here we "wait" for the subapge spinlock to be released before we detach
subpage structure.
So this patch will introduce a new function, wait_subpage_spinlock(), to
do the "wait" by acquiring the spinlock and release it.

Since the caller has ensured the page is not dirty nor writeback, and
page is already locked, the only way to hold the subpage spinlock is
from endio function.
Thus we only need to acquire the spinlock to wait for any existing
holder.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/file.c    |  8 ++++----
 fs/btrfs/inode.c   | 39 ++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/subpage.c |  4 +++-
 3 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 8c57af3702fa..d3f2623a2af0 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1343,10 +1343,10 @@ static int prepare_uptodate_page(struct inode *inode,
 
 		/*
 		 * Since btrfs_readpage() will unlock the page before it
-		 * returns, there is a window where btrfs_releasepage() can
-		 * be called to release the page.
-		 * Here we check both inode mapping and PagePrivate() to
-		 * make sure the page was not released.
+		 * returns, there is a window where btrfs_releasepage() can be
+		 * called to release the page.  Here we check both inode
+		 * mapping and PagePrivate() to make sure the page was not
+		 * released.
 		 *
 		 * The private flag check is essential for subpage as we need
 		 * to store extra bitmap using page->private.
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ef6e1798fc2d..34dd4af72246 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8413,11 +8413,47 @@ static void btrfs_readahead(struct readahead_control *rac)
 	extent_readahead(rac);
 }
 
+/*
+ * For releasepage() and invalidatepage() we have a race window where
+ * end_page_writeback() is called but the subpage spinlock is not yet released.
+ * If we continue to release/invalidate the page, we could cause use-after-free
+ * for subpage spinlock.  So this function is to spin and wait for subpage
+ * spinlock.
+ */
+static void wait_subpage_spinlock(struct page *page)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+	struct btrfs_subpage *subpage;
+
+	if (fs_info->sectorsize == PAGE_SIZE)
+		return;
+
+	ASSERT(PagePrivate(page) && page->private);
+	subpage = (struct btrfs_subpage *)page->private;
+
+	/*
+	 * This may look insane as we just acquire the spinlock and release it,
+	 * without doing anything.  But we just want to make sure no one is
+	 * still holding the subpage spinlock.
+	 * And since the page is not dirty nor writeback, and we have page
+	 * locked, the only possible way to hold a spinlock is from the endio
+	 * function to clear page writeback.
+	 *
+	 * Here we just acquire the spinlock so that all existing callers
+	 * should exit and we're safe to release/invalidate the page.
+	 */
+	spin_lock_irq(&subpage->lock);
+	spin_unlock_irq(&subpage->lock);
+}
+
 static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
 {
 	int ret = try_release_extent_mapping(page, gfp_flags);
-	if (ret == 1)
+
+	if (ret == 1) {
+		wait_subpage_spinlock(page);
 		clear_page_extent_mapped(page);
+	}
 	return ret;
 }
 
@@ -8481,6 +8517,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	 * do double ordered extent accounting on the same page.
 	 */
 	wait_on_page_writeback(page);
+	wait_subpage_spinlock(page);
 
 	/*
 	 * For subpage case, we have call sites like
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 640bcd21bf28..9408232e1dc9 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -435,8 +435,10 @@ void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
 
 	spin_lock_irqsave(&subpage->lock, flags);
 	subpage->writeback_bitmap &= ~tmp;
-	if (subpage->writeback_bitmap == 0)
+	if (subpage->writeback_bitmap == 0) {
+		ASSERT(PageWriteback(page));
 		end_page_writeback(page);
+	}
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 5.14 196/252] btrfs: subpage: fix false alert when relocating partial preallocated data extents
       [not found] <20210909114106.141462-1-sashal@kernel.org>
                   ` (5 preceding siblings ...)
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 195/252] btrfs: subpage: fix a potential use-after-free in writeback helper Sasha Levin
@ 2021-09-09 11:40 ` Sasha Levin
  2021-09-09 11:55   ` David Sterba
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 197/252] btrfs: tree-log: check btrfs_lookup_data_extent return value Sasha Levin
  7 siblings, 1 reply; 15+ messages in thread
From: Sasha Levin @ 2021-09-09 11:40 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Qu Wenruo, David Sterba, Sasha Levin, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit e3c62324e470c0a89df966603156b34fccd01708 ]

[BUG]
When relocating partial preallocated data extents (part of the
preallocated extent is written) for subpage, it can cause the following
false alert and make the relocation to fail:

  BTRFS info (device dm-3): balance: start -d
  BTRFS info (device dm-3): relocating block group 13631488 flags data
  BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
  BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
  BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  BTRFS info (device dm-3): balance: ended with status: -5

The minimal script to reproduce looks like this:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt
  xfs_io -f -c "falloc 0 8k" $mnt/file
  xfs_io -f -c "pwrite 0 4k" $mnt/file
  btrfs balance start -d $mnt

[CAUSE]
Function btrfs_verify_data_csum() checks if the full range has
EXTENT_NODATASUM bit for data reloc inode, if *all* bytes of the range
have EXTENT_NODATASUM bit, then it skip the range.

This works pretty well for regular sectorsize, as in that case
btrfs_verify_data_csum() is called for each sector, thus no problem at
all.

But for subpage case, btrfs_verify_data_csum() is called on each bvec,
which can contain several sectors, and since it checks *all* bytes for
EXTENT_NODATASUM bit, if we have some range with csum, then we will
continue checking all the sectors.

For the preallocated sectors, it doesn't have any csum, thus obviously
the csum won't match and cause the false alert.

[FIX]
Move the EXTENT_NODATASUM check into the main loop, so that we can check
each sector for EXTENT_NODATASUM bit for subpage case.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/inode.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 34dd4af72246..05f26275be4d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3277,19 +3277,24 @@ unsigned int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset,
 	if (!root->fs_info->csum_root)
 		return 0;
 
-	if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
-	    test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) {
-		clear_extent_bits(io_tree, start, end, EXTENT_NODATASUM);
-		return 0;
-	}
-
 	ASSERT(page_offset(page) <= start &&
 	       end <= page_offset(page) + PAGE_SIZE - 1);
 	for (pg_off = offset_in_page(start);
 	     pg_off < offset_in_page(end);
 	     pg_off += sectorsize, bio_offset += sectorsize) {
+		u64 file_offset = pg_off + page_offset(page);
 		int ret;
 
+		if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
+		    test_range_bit(io_tree, file_offset,
+				   file_offset + sectorsize - 1,
+				   EXTENT_NODATASUM, 1, NULL)) {
+			/* Skip the range without csum for data reloc inode */
+			clear_extent_bits(io_tree, file_offset,
+					  file_offset + sectorsize - 1,
+					  EXTENT_NODATASUM);
+			continue;
+		}
 		ret = check_data_csum(inode, io_bio, bio_offset, page, pg_off,
 				      page_offset(page) + pg_off);
 		if (ret < 0) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 5.14 197/252] btrfs: tree-log: check btrfs_lookup_data_extent return value
       [not found] <20210909114106.141462-1-sashal@kernel.org>
                   ` (6 preceding siblings ...)
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 196/252] btrfs: subpage: fix false alert when relocating partial preallocated data extents Sasha Levin
@ 2021-09-09 11:40 ` Sasha Levin
  7 siblings, 0 replies; 15+ messages in thread
From: Sasha Levin @ 2021-09-09 11:40 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Marcos Paulo de Souza, Filipe Manana, David Sterba, Sasha Levin,
	linux-btrfs

From: Marcos Paulo de Souza <mpdesouza@suse.com>

[ Upstream commit 3736127a3aa805602b7a2ad60ec9cfce68065fbb ]

Function btrfs_lookup_data_extent calls btrfs_search_slot to verify if
the EXTENT_ITEM exists in the extent tree. btrfs_search_slot can return
values bellow zero if an error happened.

Function replay_one_extent currently checks if the search found
something (0 returned) and increments the reference, and if not, it
seems to evaluate as 'not found'.

Fix the condition by checking if the value was bellow zero and return
early.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/tree-log.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index e6430ac9bbe8..7037e5855d2a 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -753,7 +753,9 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
 			 */
 			ret = btrfs_lookup_data_extent(fs_info, ins.objectid,
 						ins.offset);
-			if (ret == 0) {
+			if (ret < 0) {
+				goto out;
+			} else if (ret == 0) {
 				btrfs_init_generic_ref(&ref,
 						BTRFS_ADD_DELAYED_REF,
 						ins.objectid, ins.offset, 0);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH AUTOSEL 5.14 191/252] btrfs: reset this_bio_flag to avoid inheriting old flags
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 191/252] btrfs: reset this_bio_flag to avoid inheriting old flags Sasha Levin
@ 2021-09-09 11:55   ` David Sterba
  2021-09-11 14:37     ` Sasha Levin
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2021-09-09 11:55 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, stable, Qu Wenruo, Anand Jain, David Sterba,
	linux-btrfs

On Thu, Sep 09, 2021 at 07:40:05AM -0400, Sasha Levin wrote:
> From: Qu Wenruo <wqu@suse.com>
> 
> [ Upstream commit 4c37a7938496756649096a7ec26320eb8b0d90fb ]

Please drop this patch from stable queue, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH AUTOSEL 5.14 192/252] btrfs: subpage: check if there are compressed extents inside one page
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 192/252] btrfs: subpage: check if there are compressed extents inside one page Sasha Levin
@ 2021-09-09 11:55   ` David Sterba
  0 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2021-09-09 11:55 UTC (permalink / raw)
  To: Sasha Levin; +Cc: linux-kernel, stable, Qu Wenruo, David Sterba, linux-btrfs

On Thu, Sep 09, 2021 at 07:40:06AM -0400, Sasha Levin wrote:
> From: Qu Wenruo <wqu@suse.com>
> 
> [ Upstream commit 3670e6451bc9c39ab3a46f1da19360219e4319f3 ]

Please drop this patch from stable queue, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH AUTOSEL 5.14 193/252] btrfs: grab correct extent map for subpage compressed extent read
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 193/252] btrfs: grab correct extent map for subpage compressed extent read Sasha Levin
@ 2021-09-09 11:55   ` David Sterba
  0 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2021-09-09 11:55 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, stable, Qu Wenruo, Anand Jain, David Sterba,
	linux-btrfs

On Thu, Sep 09, 2021 at 07:40:07AM -0400, Sasha Levin wrote:
> From: Qu Wenruo <wqu@suse.com>
> 
> [ Upstream commit 557023ea9f06baf2659b232b08b8e8711f7001a6 ]

Please drop this patch from stable queue, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH AUTOSEL 5.14 194/252] btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage()
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 194/252] btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage() Sasha Levin
@ 2021-09-09 11:55   ` David Sterba
  0 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2021-09-09 11:55 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, stable, Qu Wenruo, Ritesh Harjani, Filipe Manana,
	David Sterba, linux-btrfs

On Thu, Sep 09, 2021 at 07:40:08AM -0400, Sasha Levin wrote:
> From: Qu Wenruo <wqu@suse.com>
> 
> [ Upstream commit e0467866198f7f536806f39e5d0d91ae8018de08 ]

Please drop this patch from stable queue, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH AUTOSEL 5.14 195/252] btrfs: subpage: fix a potential use-after-free in writeback helper
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 195/252] btrfs: subpage: fix a potential use-after-free in writeback helper Sasha Levin
@ 2021-09-09 11:55   ` David Sterba
  0 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2021-09-09 11:55 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, stable, Qu Wenruo, Ritesh Harjani, David Sterba,
	linux-btrfs

On Thu, Sep 09, 2021 at 07:40:09AM -0400, Sasha Levin wrote:
> From: Qu Wenruo <wqu@suse.com>
> 
> [ Upstream commit 7c11d0ae439565b4560b0c0f36bf05171ed1a146 ]

Please drop this patch from stable queue, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH AUTOSEL 5.14 196/252] btrfs: subpage: fix false alert when relocating partial preallocated data extents
  2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 196/252] btrfs: subpage: fix false alert when relocating partial preallocated data extents Sasha Levin
@ 2021-09-09 11:55   ` David Sterba
  0 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2021-09-09 11:55 UTC (permalink / raw)
  To: Sasha Levin; +Cc: linux-kernel, stable, Qu Wenruo, David Sterba, linux-btrfs

On Thu, Sep 09, 2021 at 07:40:10AM -0400, Sasha Levin wrote:
> From: Qu Wenruo <wqu@suse.com>
> 
> [ Upstream commit e3c62324e470c0a89df966603156b34fccd01708 ]

Please drop this patch from stable queue, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH AUTOSEL 5.14 191/252] btrfs: reset this_bio_flag to avoid inheriting old flags
  2021-09-09 11:55   ` David Sterba
@ 2021-09-11 14:37     ` Sasha Levin
  0 siblings, 0 replies; 15+ messages in thread
From: Sasha Levin @ 2021-09-11 14:37 UTC (permalink / raw)
  To: dsterba, linux-kernel, stable, Qu Wenruo, Anand Jain,
	David Sterba, linux-btrfs

On Thu, Sep 09, 2021 at 01:55:04PM +0200, David Sterba wrote:
>On Thu, Sep 09, 2021 at 07:40:05AM -0400, Sasha Levin wrote:
>> From: Qu Wenruo <wqu@suse.com>
>>
>> [ Upstream commit 4c37a7938496756649096a7ec26320eb8b0d90fb ]
>
>Please drop this patch from stable queue, thanks.

I've dropped all the patches you've pointed out, thanks!

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-09-11 14:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20210909114106.141462-1-sashal@kernel.org>
2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 190/252] btrfs: remove racy and unnecessary inode transaction update when using no-holes Sasha Levin
2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 191/252] btrfs: reset this_bio_flag to avoid inheriting old flags Sasha Levin
2021-09-09 11:55   ` David Sterba
2021-09-11 14:37     ` Sasha Levin
2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 192/252] btrfs: subpage: check if there are compressed extents inside one page Sasha Levin
2021-09-09 11:55   ` David Sterba
2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 193/252] btrfs: grab correct extent map for subpage compressed extent read Sasha Levin
2021-09-09 11:55   ` David Sterba
2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 194/252] btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage() Sasha Levin
2021-09-09 11:55   ` David Sterba
2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 195/252] btrfs: subpage: fix a potential use-after-free in writeback helper Sasha Levin
2021-09-09 11:55   ` David Sterba
2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 196/252] btrfs: subpage: fix false alert when relocating partial preallocated data extents Sasha Levin
2021-09-09 11:55   ` David Sterba
2021-09-09 11:40 ` [PATCH AUTOSEL 5.14 197/252] btrfs: tree-log: check btrfs_lookup_data_extent return value Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox