[PATCH AUTOSEL 5.13 70/85] btrfs: fix error handling in __btrfs_update_delayed

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 5.13 70/85] btrfs: fix error handling in __btrfs_update_delayed_inode
       [not found] <20210704230420.1488358-1-sashal@kernel.org>
@ 2021-07-04 23:04 ` Sasha Levin
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 71/85] btrfs: abort transaction if we fail to update the delayed inode Sasha Levin
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2021-07-04 23:04 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Josef Bacik, David Sterba, Sasha Levin, linux-btrfs

From: Josef Bacik <josef@toxicpanda.com>

[ Upstream commit bb385bedded3ccbd794559600de4a09448810f4a ]

If we get an error while looking up the inode item we'll simply bail
without cleaning up the delayed node.  This results in this style of
warning happening on commit:

  WARNING: CPU: 0 PID: 76403 at fs/btrfs/delayed-inode.c:1365 btrfs_assert_delayed_root_empty+0x5b/0x90
  CPU: 0 PID: 76403 Comm: fsstress Tainted: G        W         5.13.0-rc1+ #373
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  RIP: 0010:btrfs_assert_delayed_root_empty+0x5b/0x90
  RSP: 0018:ffffb8bb815a7e50 EFLAGS: 00010286
  RAX: 0000000000000000 RBX: ffff95d6d07e1888 RCX: ffff95d6c0fa3000
  RDX: 0000000000000002 RSI: 000000000029e91c RDI: ffff95d6c0fc8060
  RBP: ffff95d6c0fc8060 R08: 00008d6d701a2c1d R09: 0000000000000000
  R10: ffff95d6d1760ea0 R11: 0000000000000001 R12: ffff95d6c15a4d00
  R13: ffff95d6c0fa3000 R14: 0000000000000000 R15: ffffb8bb815a7e90
  FS:  00007f490e8dbb80(0000) GS:ffff95d73bc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f6e75555cb0 CR3: 00000001101ce001 CR4: 0000000000370ef0
  Call Trace:
   btrfs_commit_transaction+0x43c/0xb00
   ? finish_wait+0x80/0x80
   ? vfs_fsync_range+0x90/0x90
   iterate_supers+0x8c/0x100
   ksys_sync+0x50/0x90
   __do_sys_sync+0xa/0x10
   do_syscall_64+0x3d/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Because the iref isn't dropped and this leaves an elevated node->count,
so any release just re-queues it onto the delayed inodes list.  Fix this
by going to the out label to handle the proper cleanup of the delayed
node.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/delayed-inode.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 1a88f6214ebc..3091540fc22a 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1009,12 +1009,10 @@ static int __btrfs_update_delayed_inode(struct btrfs_trans_handle *trans,
 	nofs_flag = memalloc_nofs_save();
 	ret = btrfs_lookup_inode(trans, root, path, &key, mod);
 	memalloc_nofs_restore(nofs_flag);
-	if (ret > 0) {
-		btrfs_release_path(path);
-		return -ENOENT;
-	} else if (ret < 0) {
-		return ret;
-	}
+	if (ret > 0)
+		ret = -ENOENT;
+	if (ret < 0)
+		goto out;
 
 	leaf = path->nodes[0];
 	inode_item = btrfs_item_ptr(leaf, path->slots[0],
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH AUTOSEL 5.13 71/85] btrfs: abort transaction if we fail to update the delayed inode
       [not found] <20210704230420.1488358-1-sashal@kernel.org>
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 70/85] btrfs: fix error handling in __btrfs_update_delayed_inode Sasha Levin
@ 2021-07-04 23:04 ` Sasha Levin
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 72/85] btrfs: always abort the transaction if we abort a trans handle Sasha Levin
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2021-07-04 23:04 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Josef Bacik, David Sterba, Sasha Levin, linux-btrfs

From: Josef Bacik <josef@toxicpanda.com>

[ Upstream commit 04587ad9bef6ce9d510325b4ba9852b6129eebdb ]

If we fail to update the delayed inode we need to abort the transaction,
because we could leave an inode with the improper counts or some other
such corruption behind.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/delayed-inode.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 3091540fc22a..3bb8b919d2c1 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1050,6 +1050,14 @@ static int __btrfs_update_delayed_inode(struct btrfs_trans_handle *trans,
 	btrfs_delayed_inode_release_metadata(fs_info, node, (ret < 0));
 	btrfs_release_delayed_inode(node);
 
+	/*
+	 * If we fail to update the delayed inode we need to abort the
+	 * transaction, because we could leave the inode with the improper
+	 * counts behind.
+	 */
+	if (ret && ret != -ENOENT)
+		btrfs_abort_transaction(trans, ret);
+
 	return ret;
 
 search:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH AUTOSEL 5.13 72/85] btrfs: always abort the transaction if we abort a trans handle
       [not found] <20210704230420.1488358-1-sashal@kernel.org>
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 70/85] btrfs: fix error handling in __btrfs_update_delayed_inode Sasha Levin
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 71/85] btrfs: abort transaction if we fail to update the delayed inode Sasha Levin
@ 2021-07-04 23:04 ` Sasha Levin
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 73/85] btrfs: sysfs: fix format string for some discard stats Sasha Levin
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2021-07-04 23:04 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Josef Bacik, David Sterba, Sasha Levin, linux-btrfs

From: Josef Bacik <josef@toxicpanda.com>

[ Upstream commit 5963ffcaf383134985a5a2d8a4baa582d3999e0a ]

While stress testing our error handling I noticed that sometimes we
would still commit the transaction even though we had aborted the
transaction.

Currently we track if a trans handle has dirtied any metadata, and if it
hasn't we mark the filesystem as having an error (so no new transactions
can be started), but we will allow the current transaction to complete
as we do not mark the transaction itself as having been aborted.

This sounds good in theory, but we were not properly tracking IO errors
in btrfs_finish_ordered_io, and thus committing the transaction with
bogus free space data.  This isn't necessarily a problem per-se with the
free space cache, as the other guards in place would have kept us from
accepting the free space cache as valid, but highlights a real world
case where we had a bug and could have corrupted the filesystem because
of it.

This "skip abort on empty trans handle" is nice in theory, but assumes
we have perfect error handling everywhere, which we clearly do not.
Also we do not allow further transactions to be started, so all this
does is save the last transaction that was happening, which doesn't
necessarily gain us anything other than the potential for real
corruption.

Remove this particular bit of code, if we decide we need to abort the
transaction then abort the current one and keep us from doing real harm
to the file system, regardless of whether this specific trans handle
dirtied anything or not.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/ctree.c       |  5 +----
 fs/btrfs/extent-tree.c |  1 -
 fs/btrfs/super.c       | 11 -----------
 fs/btrfs/transaction.c |  8 --------
 fs/btrfs/transaction.h |  1 -
 5 files changed, 1 insertion(+), 25 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index a484fb72a01f..4bc3ca2cbd7d 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -596,7 +596,6 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans,
 		       trans->transid, fs_info->generation);
 
 	if (!should_cow_block(trans, root, buf)) {
-		trans->dirty = true;
 		*cow_ret = buf;
 		return 0;
 	}
@@ -1788,10 +1787,8 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 			 * then we don't want to set the path blocking,
 			 * so we test it here
 			 */
-			if (!should_cow_block(trans, root, b)) {
-				trans->dirty = true;
+			if (!should_cow_block(trans, root, b))
 				goto cow_done;
-			}
 
 			/*
 			 * must have write locks on this node and the
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d5c35e4cb76..d2f39a122d89 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4784,7 +4784,6 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		set_extent_dirty(&trans->transaction->dirty_pages, buf->start,
 			 buf->start + buf->len - 1, GFP_NOFS);
 	}
-	trans->dirty = true;
 	/* this returns a buffer locked for blocking */
 	return buf;
 }
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4a396c1147f1..bc613218c8c5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -299,17 +299,6 @@ void __btrfs_abort_transaction(struct btrfs_trans_handle *trans,
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 
 	WRITE_ONCE(trans->aborted, errno);
-	/* Nothing used. The other threads that have joined this
-	 * transaction may be able to continue. */
-	if (!trans->dirty && list_empty(&trans->new_bgs)) {
-		const char *errstr;
-
-		errstr = btrfs_decode_error(errno);
-		btrfs_warn(fs_info,
-		           "%s:%d: Aborting unused transaction(%s).",
-		           function, line, errstr);
-		return;
-	}
 	WRITE_ONCE(trans->transaction->aborted, errno);
 	/* Wake up anybody who may be waiting on this transaction */
 	wake_up(&fs_info->transaction_wait);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index f75de9f6c0ad..e0a82aa7da89 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2074,14 +2074,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 
 	ASSERT(refcount_read(&trans->use_count) == 1);
 
-	/*
-	 * Some places just start a transaction to commit it.  We need to make
-	 * sure that if this commit fails that the abort code actually marks the
-	 * transaction as failed, so set trans->dirty to make the abort code do
-	 * the right thing.
-	 */
-	trans->dirty = true;
-
 	/* Stop the commit early if ->aborted is set */
 	if (TRANS_ABORTED(cur_trans)) {
 		ret = cur_trans->aborted;
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 364cfbb4c5c5..c49e2266b28b 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -143,7 +143,6 @@ struct btrfs_trans_handle {
 	bool allocating_chunk;
 	bool can_flush_pending_bgs;
 	bool reloc_reserved;
-	bool dirty;
 	bool in_fsync;
 	struct btrfs_root *root;
 	struct btrfs_fs_info *fs_info;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH AUTOSEL 5.13 73/85] btrfs: sysfs: fix format string for some discard stats
       [not found] <20210704230420.1488358-1-sashal@kernel.org>
                   ` (2 preceding siblings ...)
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 72/85] btrfs: always abort the transaction if we abort a trans handle Sasha Levin
@ 2021-07-04 23:04 ` Sasha Levin
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 74/85] btrfs: scrub: fix subpage repair error caused by hard coded PAGE_SIZE Sasha Levin
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2021-07-04 23:04 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: David Sterba, Qu Wenruo, Anand Jain, Sasha Levin, linux-btrfs

From: David Sterba <dsterba@suse.com>

[ Upstream commit 8c5ec995616f1202ab92e195fd75d6f60d86f85c ]

The type of discard_bitmap_bytes and discard_extent_bytes is u64 so the
format should be %llu, though the actual values would hardly ever
overflow to negative values.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/sysfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 436ac7b4b334..4f5b14cd3a19 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -429,7 +429,7 @@ static ssize_t btrfs_discard_bitmap_bytes_show(struct kobject *kobj,
 {
 	struct btrfs_fs_info *fs_info = discard_to_fs_info(kobj);
 
-	return scnprintf(buf, PAGE_SIZE, "%lld\n",
+	return scnprintf(buf, PAGE_SIZE, "%llu\n",
 			fs_info->discard_ctl.discard_bitmap_bytes);
 }
 BTRFS_ATTR(discard, discard_bitmap_bytes, btrfs_discard_bitmap_bytes_show);
@@ -451,7 +451,7 @@ static ssize_t btrfs_discard_extent_bytes_show(struct kobject *kobj,
 {
 	struct btrfs_fs_info *fs_info = discard_to_fs_info(kobj);
 
-	return scnprintf(buf, PAGE_SIZE, "%lld\n",
+	return scnprintf(buf, PAGE_SIZE, "%llu\n",
 			fs_info->discard_ctl.discard_extent_bytes);
 }
 BTRFS_ATTR(discard, discard_extent_bytes, btrfs_discard_extent_bytes_show);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH AUTOSEL 5.13 74/85] btrfs: scrub: fix subpage repair error caused by hard coded PAGE_SIZE
       [not found] <20210704230420.1488358-1-sashal@kernel.org>
                   ` (3 preceding siblings ...)
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 73/85] btrfs: sysfs: fix format string for some discard stats Sasha Levin
@ 2021-07-04 23:04 ` Sasha Levin
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 75/85] btrfs: make Private2 lifespan more consistent Sasha Levin
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2021-07-04 23:04 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Qu Wenruo, David Sterba, Sasha Levin, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit 8df507cbb5952719353c912a021b66c27641e90c ]

[BUG]
For the following file layout, scrub will not be able to repair all
these two repairable error, but in fact make one corruption even
unrepairable:

	  inode offset 0      4k     8K
Mirror 1               |XXXXXX|      |
Mirror 2               |      |XXXXXX|

[CAUSE]
The root cause is the hard coded PAGE_SIZE, which makes scrub repair to
go crazy for subpage.

For above case, when reading the first sector, we use PAGE_SIZE other
than sectorsize to read, which makes us to read the full range [0, 64K).
In fact, after 8K there may be no data at all, we can just get some
garbage.

Then when doing the repair, we also writeback a full page from mirror 2,
this means, we will also writeback the corrupted data in mirror 2 back
to mirror 1, leaving the range [4K, 8K) unrepairable.

[FIX]
This patch will modify the following PAGE_SIZE use with sectorsize:

- scrub_print_warning_inode()
  Remove the min() and replace PAGE_SIZE with sectorsize.
  The min() makes no sense, as csum is done for the full sector with
  padding.

  This fixes a bug that subpage report extra length like:
   checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test,
   physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file)

  Where the error is only 1 sector.

- scrub_handle_errored_block()
  Comments with PAGE|page involved, all changed to sector.

- scrub_setup_recheck_block()
- scrub_repair_page_from_good_copy()
- scrub_add_page_to_wr_bio()
- scrub_wr_submit()
- scrub_add_page_to_rd_bio()
- scrub_block_complete()
  Replace PAGE_SIZE with sectorsize.
  This solves several problems where we read/write extra range for
  subpage case.

RAID56 code is excluded intentionally, as RAID56 has extra PAGE_SIZE
usage, and is not really safe enough.
Thus we will reject RAID56 for subpage in later commit.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/scrub.c | 82 +++++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 40 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 485cda3eb8d7..6af901539ef2 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -626,7 +626,6 @@ static noinline_for_stack struct scrub_ctx *scrub_setup_ctx(
 static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root,
 				     void *warn_ctx)
 {
-	u64 isize;
 	u32 nlink;
 	int ret;
 	int i;
@@ -662,7 +661,6 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root,
 	eb = swarn->path->nodes[0];
 	inode_item = btrfs_item_ptr(eb, swarn->path->slots[0],
 					struct btrfs_inode_item);
-	isize = btrfs_inode_size(eb, inode_item);
 	nlink = btrfs_inode_nlink(eb, inode_item);
 	btrfs_release_path(swarn->path);
 
@@ -691,12 +689,12 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root,
 	 */
 	for (i = 0; i < ipath->fspath->elem_cnt; ++i)
 		btrfs_warn_in_rcu(fs_info,
-"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu, length %llu, links %u (path: %s)",
+"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu, length %u, links %u (path: %s)",
 				  swarn->errstr, swarn->logical,
 				  rcu_str_deref(swarn->dev->name),
 				  swarn->physical,
 				  root, inum, offset,
-				  min(isize - offset, (u64)PAGE_SIZE), nlink,
+				  fs_info->sectorsize, nlink,
 				  (char *)(unsigned long)ipath->fspath->val[i]);
 
 	btrfs_put_root(local_root);
@@ -885,25 +883,25 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 	 * read all mirrors one after the other. This includes to
 	 * re-read the extent or metadata block that failed (that was
 	 * the cause that this fixup code is called) another time,
-	 * page by page this time in order to know which pages
+	 * sector by sector this time in order to know which sectors
 	 * caused I/O errors and which ones are good (for all mirrors).
 	 * It is the goal to handle the situation when more than one
 	 * mirror contains I/O errors, but the errors do not
 	 * overlap, i.e. the data can be repaired by selecting the
-	 * pages from those mirrors without I/O error on the
-	 * particular pages. One example (with blocks >= 2 * PAGE_SIZE)
-	 * would be that mirror #1 has an I/O error on the first page,
-	 * the second page is good, and mirror #2 has an I/O error on
-	 * the second page, but the first page is good.
-	 * Then the first page of the first mirror can be repaired by
-	 * taking the first page of the second mirror, and the
-	 * second page of the second mirror can be repaired by
-	 * copying the contents of the 2nd page of the 1st mirror.
-	 * One more note: if the pages of one mirror contain I/O
+	 * sectors from those mirrors without I/O error on the
+	 * particular sectors. One example (with blocks >= 2 * sectorsize)
+	 * would be that mirror #1 has an I/O error on the first sector,
+	 * the second sector is good, and mirror #2 has an I/O error on
+	 * the second sector, but the first sector is good.
+	 * Then the first sector of the first mirror can be repaired by
+	 * taking the first sector of the second mirror, and the
+	 * second sector of the second mirror can be repaired by
+	 * copying the contents of the 2nd sector of the 1st mirror.
+	 * One more note: if the sectors of one mirror contain I/O
 	 * errors, the checksum cannot be verified. In order to get
 	 * the best data for repairing, the first attempt is to find
 	 * a mirror without I/O errors and with a validated checksum.
-	 * Only if this is not possible, the pages are picked from
+	 * Only if this is not possible, the sectors are picked from
 	 * mirrors with I/O errors without considering the checksum.
 	 * If the latter is the case, at the end, the checksum of the
 	 * repaired area is verified in order to correctly maintain
@@ -1060,26 +1058,26 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 
 	/*
 	 * In case of I/O errors in the area that is supposed to be
-	 * repaired, continue by picking good copies of those pages.
-	 * Select the good pages from mirrors to rewrite bad pages from
+	 * repaired, continue by picking good copies of those sectors.
+	 * Select the good sectors from mirrors to rewrite bad sectors from
 	 * the area to fix. Afterwards verify the checksum of the block
 	 * that is supposed to be repaired. This verification step is
 	 * only done for the purpose of statistic counting and for the
 	 * final scrub report, whether errors remain.
 	 * A perfect algorithm could make use of the checksum and try
-	 * all possible combinations of pages from the different mirrors
+	 * all possible combinations of sectors from the different mirrors
 	 * until the checksum verification succeeds. For example, when
-	 * the 2nd page of mirror #1 faces I/O errors, and the 2nd page
+	 * the 2nd sector of mirror #1 faces I/O errors, and the 2nd sector
 	 * of mirror #2 is readable but the final checksum test fails,
-	 * then the 2nd page of mirror #3 could be tried, whether now
+	 * then the 2nd sector of mirror #3 could be tried, whether now
 	 * the final checksum succeeds. But this would be a rare
 	 * exception and is therefore not implemented. At least it is
 	 * avoided that the good copy is overwritten.
 	 * A more useful improvement would be to pick the sectors
 	 * without I/O error based on sector sizes (512 bytes on legacy
-	 * disks) instead of on PAGE_SIZE. Then maybe 512 byte of one
+	 * disks) instead of on sectorsize. Then maybe 512 byte of one
 	 * mirror could be repaired by taking 512 byte of a different
-	 * mirror, even if other 512 byte sectors in the same PAGE_SIZE
+	 * mirror, even if other 512 byte sectors in the same sectorsize
 	 * area are unreadable.
 	 */
 	success = 1;
@@ -1260,7 +1258,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 {
 	struct scrub_ctx *sctx = original_sblock->sctx;
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
-	u64 length = original_sblock->page_count * PAGE_SIZE;
+	u64 length = original_sblock->page_count * fs_info->sectorsize;
 	u64 logical = original_sblock->pagev[0]->logical;
 	u64 generation = original_sblock->pagev[0]->generation;
 	u64 flags = original_sblock->pagev[0]->flags;
@@ -1283,13 +1281,13 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 	 */
 
 	while (length > 0) {
-		sublen = min_t(u64, length, PAGE_SIZE);
+		sublen = min_t(u64, length, fs_info->sectorsize);
 		mapped_length = sublen;
 		bbio = NULL;
 
 		/*
-		 * with a length of PAGE_SIZE, each returned stripe
-		 * represents one mirror
+		 * With a length of sectorsize, each returned stripe represents
+		 * one mirror
 		 */
 		btrfs_bio_counter_inc_blocked(fs_info);
 		ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
@@ -1480,7 +1478,7 @@ static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
 		bio = btrfs_io_bio_alloc(1);
 		bio_set_dev(bio, spage->dev->bdev);
 
-		bio_add_page(bio, spage->page, PAGE_SIZE, 0);
+		bio_add_page(bio, spage->page, fs_info->sectorsize, 0);
 		bio->bi_iter.bi_sector = spage->physical >> 9;
 		bio->bi_opf = REQ_OP_READ;
 
@@ -1544,6 +1542,7 @@ static int scrub_repair_page_from_good_copy(struct scrub_block *sblock_bad,
 	struct scrub_page *spage_bad = sblock_bad->pagev[page_num];
 	struct scrub_page *spage_good = sblock_good->pagev[page_num];
 	struct btrfs_fs_info *fs_info = sblock_bad->sctx->fs_info;
+	const u32 sectorsize = fs_info->sectorsize;
 
 	BUG_ON(spage_bad->page == NULL);
 	BUG_ON(spage_good->page == NULL);
@@ -1563,8 +1562,8 @@ static int scrub_repair_page_from_good_copy(struct scrub_block *sblock_bad,
 		bio->bi_iter.bi_sector = spage_bad->physical >> 9;
 		bio->bi_opf = REQ_OP_WRITE;
 
-		ret = bio_add_page(bio, spage_good->page, PAGE_SIZE, 0);
-		if (PAGE_SIZE != ret) {
+		ret = bio_add_page(bio, spage_good->page, sectorsize, 0);
+		if (ret != sectorsize) {
 			bio_put(bio);
 			return -EIO;
 		}
@@ -1642,6 +1641,7 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 {
 	struct scrub_bio *sbio;
 	int ret;
+	const u32 sectorsize = sctx->fs_info->sectorsize;
 
 	mutex_lock(&sctx->wr_lock);
 again:
@@ -1681,16 +1681,16 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 		bio->bi_iter.bi_sector = sbio->physical >> 9;
 		bio->bi_opf = REQ_OP_WRITE;
 		sbio->status = 0;
-	} else if (sbio->physical + sbio->page_count * PAGE_SIZE !=
+	} else if (sbio->physical + sbio->page_count * sectorsize !=
 		   spage->physical_for_dev_replace ||
-		   sbio->logical + sbio->page_count * PAGE_SIZE !=
+		   sbio->logical + sbio->page_count * sectorsize !=
 		   spage->logical) {
 		scrub_wr_submit(sctx);
 		goto again;
 	}
 
-	ret = bio_add_page(sbio->bio, spage->page, PAGE_SIZE, 0);
-	if (ret != PAGE_SIZE) {
+	ret = bio_add_page(sbio->bio, spage->page, sectorsize, 0);
+	if (ret != sectorsize) {
 		if (sbio->page_count < 1) {
 			bio_put(sbio->bio);
 			sbio->bio = NULL;
@@ -1729,7 +1729,8 @@ static void scrub_wr_submit(struct scrub_ctx *sctx)
 	btrfsic_submit_bio(sbio->bio);
 
 	if (btrfs_is_zoned(sctx->fs_info))
-		sctx->write_pointer = sbio->physical + sbio->page_count * PAGE_SIZE;
+		sctx->write_pointer = sbio->physical + sbio->page_count *
+			sctx->fs_info->sectorsize;
 }
 
 static void scrub_wr_bio_end_io(struct bio *bio)
@@ -2006,6 +2007,7 @@ static int scrub_add_page_to_rd_bio(struct scrub_ctx *sctx,
 {
 	struct scrub_block *sblock = spage->sblock;
 	struct scrub_bio *sbio;
+	const u32 sectorsize = sctx->fs_info->sectorsize;
 	int ret;
 
 again:
@@ -2044,9 +2046,9 @@ static int scrub_add_page_to_rd_bio(struct scrub_ctx *sctx,
 		bio->bi_iter.bi_sector = sbio->physical >> 9;
 		bio->bi_opf = REQ_OP_READ;
 		sbio->status = 0;
-	} else if (sbio->physical + sbio->page_count * PAGE_SIZE !=
+	} else if (sbio->physical + sbio->page_count * sectorsize !=
 		   spage->physical ||
-		   sbio->logical + sbio->page_count * PAGE_SIZE !=
+		   sbio->logical + sbio->page_count * sectorsize !=
 		   spage->logical ||
 		   sbio->dev != spage->dev) {
 		scrub_submit(sctx);
@@ -2054,8 +2056,8 @@ static int scrub_add_page_to_rd_bio(struct scrub_ctx *sctx,
 	}
 
 	sbio->pagev[sbio->page_count] = spage;
-	ret = bio_add_page(sbio->bio, spage->page, PAGE_SIZE, 0);
-	if (ret != PAGE_SIZE) {
+	ret = bio_add_page(sbio->bio, spage->page, sectorsize, 0);
+	if (ret != sectorsize) {
 		if (sbio->page_count < 1) {
 			bio_put(sbio->bio);
 			sbio->bio = NULL;
@@ -2398,7 +2400,7 @@ static void scrub_block_complete(struct scrub_block *sblock)
 	if (sblock->sparity && corrupted && !sblock->data_corrected) {
 		u64 start = sblock->pagev[0]->logical;
 		u64 end = sblock->pagev[sblock->page_count - 1]->logical +
-			  PAGE_SIZE;
+			  sblock->sctx->fs_info->sectorsize;
 
 		ASSERT(end - start <= U32_MAX);
 		scrub_parity_mark_sectors_error(sblock->sparity,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH AUTOSEL 5.13 75/85] btrfs: make Private2 lifespan more consistent
       [not found] <20210704230420.1488358-1-sashal@kernel.org>
                   ` (4 preceding siblings ...)
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 74/85] btrfs: scrub: fix subpage repair error caused by hard coded PAGE_SIZE Sasha Levin
@ 2021-07-04 23:04 ` Sasha Levin
  2021-07-07 11:10   ` David Sterba
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 76/85] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Sasha Levin
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 11+ messages in thread
From: Sasha Levin @ 2021-07-04 23:04 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Qu Wenruo, Josef Bacik, David Sterba, Sasha Levin, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit 87b4d86baae219a9a79f6b0a1434b2a42fd40d09 ]

Currently we use page Private2 bit to indicate that we have ordered
extent for the page range.

But the lifespan of it is not consistent, during regular writeback path,
there are two locations to clear the same PagePrivate2:

    T ----- Page marked Dirty
    |
    + ----- Page marked Private2, through btrfs_run_dealloc_range()
    |
    + ----- Page cleared Private2, through btrfs_writepage_cow_fixup()
    |       in __extent_writepage_io()
    |       ^^^ Private2 cleared for the first time
    |
    + ----- Page marked Writeback, through btrfs_set_range_writeback()
    |       in __extent_writepage_io().
    |
    + ----- Page cleared Private2, through
    |       btrfs_writepage_endio_finish_ordered()
    |       ^^^ Private2 cleared for the second time.
    |
    + ----- Page cleared Writeback, through
            btrfs_writepage_endio_finish_ordered()

Currently PagePrivate2 is mostly to prevent ordered extent accounting
being executed for both endio and invalidatepage.
Thus only the one who cleared page Private2 is responsible for ordered
extent accounting.

But the fact is, in btrfs_writepage_endio_finish_ordered(), page
Private2 is cleared and ordered extent accounting is executed
unconditionally.

The race prevention only happens through btrfs_invalidatepage(), where
we wait for the page writeback first, before checking the Private2 bit.

This means, Private2 is also protected by Writeback bit, and there is no
need for btrfs_writepage_cow_fixup() to clear Priavte2.

This patch will change btrfs_writepage_cow_fixup() to just check
PagePrivate2, not to clear it.
The clearing will happen in either btrfs_invalidatepage() or
btrfs_writepage_endio_finish_ordered().

This makes the Private2 bit easier to understand, just meaning the page
has unfinished ordered extent attached to it.

And this patch is a hard requirement for the incoming refactoring for
how we finished ordered IO for endio context, as the coming patch will
check Private2 to determine if we need to do the ordered extent
accounting.  Thus this patch is definitely needed or we will hang due to
unfinished ordered extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 46f392943f4d..f9942560866b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2677,7 +2677,7 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end)
 	struct btrfs_writepage_fixup *fixup;
 
 	/* this page is properly in the ordered list */
-	if (TestClearPagePrivate2(page))
+	if (PagePrivate2(page))
 		return 0;
 
 	/*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH AUTOSEL 5.13 76/85] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()
       [not found] <20210704230420.1488358-1-sashal@kernel.org>
                   ` (5 preceding siblings ...)
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 75/85] btrfs: make Private2 lifespan more consistent Sasha Levin
@ 2021-07-04 23:04 ` Sasha Levin
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 77/85] btrfs: don't clear page extent mapped if we're not invalidating the full page Sasha Levin
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 78/85] btrfs: disable build on platforms having page size 256K Sasha Levin
  8 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2021-07-04 23:04 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Qu Wenruo, Ritesh Harjani, Anand Jain, David Sterba, Sasha Levin,
	linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit 0528476b6ac7832f31e2ed740a57ae31316b124e ]

[BUG]
With current subpage RW support, the following script can hang the fs
with 64K page size.

 # mkfs.btrfs -f -s 4k $dev
 # mount $dev -o nospace_cache $mnt
 # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt

The kernel will do an infinite loop in btrfs_punch_hole_lock_range().

[CAUSE]
In btrfs_punch_hole_lock_range() we:

- Truncate page cache range
- Lock extent io tree
- Wait any ordered extents in the range.

We exit the loop until we meet all the following conditions:

- No ordered extent in the lock range
- No page is in the lock range

The latter condition has a pitfall, it only works for sector size ==
PAGE_SIZE case.

While can't handle the following subpage case:

  0       32K     64K     96K     128K
  |       |///////||//////|       ||

lockstart=32K
lockend=96K - 1

In this case, although the range crosses 2 pages,
truncate_pagecache_range() will invalidate no page at all, but only zero
the [32K, 96K) range of the two pages.

Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we
will never meet the loop exit condition.

[FIX]
Fix the problem by doing page alignment for the lock range.

Function filemap_range_has_page() has already handled lend < lstart
case, we only need to round up @lockstart, and round_down @lockend for
truncate_pagecache_range().

This modification should not change any thing for sector size ==
PAGE_SIZE case, as in that case our range is already page aligned.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/file.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 55f68422061d..7e6ed97a6d08 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2483,6 +2483,17 @@ static int btrfs_punch_hole_lock_range(struct inode *inode,
 				       const u64 lockend,
 				       struct extent_state **cached_state)
 {
+	/*
+	 * For subpage case, if the range is not at page boundary, we could
+	 * have pages at the leading/tailing part of the range.
+	 * This could lead to dead loop since filemap_range_has_page()
+	 * will always return true.
+	 * So here we need to do extra page alignment for
+	 * filemap_range_has_page().
+	 */
+	const u64 page_lockstart = round_up(lockstart, PAGE_SIZE);
+	const u64 page_lockend = round_down(lockend + 1, PAGE_SIZE) - 1;
+
 	while (1) {
 		struct btrfs_ordered_extent *ordered;
 		int ret;
@@ -2503,7 +2514,7 @@ static int btrfs_punch_hole_lock_range(struct inode *inode,
 		    (ordered->file_offset + ordered->num_bytes <= lockstart ||
 		     ordered->file_offset > lockend)) &&
 		     !filemap_range_has_page(inode->i_mapping,
-					     lockstart, lockend)) {
+					     page_lockstart, page_lockend)) {
 			if (ordered)
 				btrfs_put_ordered_extent(ordered);
 			break;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH AUTOSEL 5.13 77/85] btrfs: don't clear page extent mapped if we're not invalidating the full page
       [not found] <20210704230420.1488358-1-sashal@kernel.org>
                   ` (6 preceding siblings ...)
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 76/85] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Sasha Levin
@ 2021-07-04 23:04 ` Sasha Levin
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 78/85] btrfs: disable build on platforms having page size 256K Sasha Levin
  8 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2021-07-04 23:04 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Qu Wenruo, Ritesh Harjani, Anand Jain, David Sterba, Sasha Levin,
	linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit bcd77455d590eaa0422a5e84ae852007cfce574a ]

[BUG]
With current btrfs subpage rw support, the following script can lead to
fs hang:

  $ mkfs.btrfs -f -s 4k $dev
  $ mount $dev -o nospace_cache $mnt
  $ fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt

The fs will hang at btrfs_start_ordered_extent().

[CAUSE]
In above test case, btrfs_invalidate() will be called with the following
parameters:

  offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000

Since @offset is 0, btrfs_invalidate() will try to invalidate the full
page, and finally call clear_page_extent_mapped() which will detach
subpage structure from the page.

And since the page no longer has subpage structure, the subpage dirty
bitmap will be cleared, preventing the dirty range from being written
back, thus no way to wake up the ordered extent.

[FIX]
Just follow other filesystems, only to invalidate the page if the range
covers the full page.

There are cases like truncate_setsize() which can call
btrfs_invalidatepage() with offset == 0 and length != 0 for the last
page of an inode.

Although the old code will still try to invalidate the full page, we are
still safe to just wait for ordered extent to finish.
So it shouldn't cause extra problems.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/inode.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f9942560866b..5d147f204fca 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8390,7 +8390,19 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	 */
 	wait_on_page_writeback(page);
 
-	if (offset) {
+	/*
+	 * For subpage case, we have call sites like
+	 * btrfs_punch_hole_lock_range() which passes range not aligned to
+	 * sectorsize.
+	 * If the range doesn't cover the full page, we don't need to and
+	 * shouldn't clear page extent mapped, as page->private can still
+	 * record subpage dirty bits for other part of the range.
+	 *
+	 * For cases that can invalidate the full even the range doesn't
+	 * cover the full page, like invalidating the last page, we're
+	 * still safe to wait for ordered extent to finish.
+	 */
+	if (!(offset == 0 && length == PAGE_SIZE)) {
 		btrfs_releasepage(page, GFP_NOFS);
 		return;
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH AUTOSEL 5.13 78/85] btrfs: disable build on platforms having page size 256K
       [not found] <20210704230420.1488358-1-sashal@kernel.org>
                   ` (7 preceding siblings ...)
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 77/85] btrfs: don't clear page extent mapped if we're not invalidating the full page Sasha Levin
@ 2021-07-04 23:04 ` Sasha Levin
  8 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2021-07-04 23:04 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Christophe Leroy, kernel test robot, David Sterba, Sasha Levin,
	linux-btrfs

From: Christophe Leroy <christophe.leroy@csgroup.eu>

[ Upstream commit b05fbcc36be1f8597a1febef4892053a0b2f3f60 ]

With a config having PAGE_SIZE set to 256K, BTRFS build fails
with the following message

  include/linux/compiler_types.h:326:38: error: call to
  '__compiletime_assert_791' declared with attribute error:
  BUILD_BUG_ON failed: (BTRFS_MAX_COMPRESSED % PAGE_SIZE) != 0

BTRFS_MAX_COMPRESSED being 128K, BTRFS cannot support platforms with
256K pages at the time being.

There are two platforms that can select 256K pages:
 - hexagon
 - powerpc

Disable BTRFS when 256K page size is selected. Supporting this would
require changes to the subpage mode that's currently being developed.
Given that 256K is many times larger than page sizes commonly used and
for what the algorithms and structures have been tuned, it's out of
scope and disabling build is a reasonable option.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/Kconfig | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index 68b95ad82126..520a0f6a7d9e 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -18,6 +18,8 @@ config BTRFS_FS
 	select RAID6_PQ
 	select XOR_BLOCKS
 	select SRCU
+	depends on !PPC_256K_PAGES	# powerpc
+	depends on !PAGE_SIZE_256KB	# hexagon
 
 	help
 	  Btrfs is a general purpose copy-on-write filesystem with extents,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH AUTOSEL 5.13 75/85] btrfs: make Private2 lifespan more consistent
  2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 75/85] btrfs: make Private2 lifespan more consistent Sasha Levin
@ 2021-07-07 11:10   ` David Sterba
  2021-07-08 11:09     ` Sasha Levin
  0 siblings, 1 reply; 11+ messages in thread
From: David Sterba @ 2021-07-07 11:10 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, stable, Qu Wenruo, Josef Bacik, David Sterba,
	linux-btrfs

On Sun, Jul 04, 2021 at 07:04:10PM -0400, Sasha Levin wrote:
> From: Qu Wenruo <wqu@suse.com>
> 
> [ Upstream commit 87b4d86baae219a9a79f6b0a1434b2a42fd40d09 ]
> 
> Currently we use page Private2 bit to indicate that we have ordered
> extent for the page range.
> 
> But the lifespan of it is not consistent, during regular writeback path,
> there are two locations to clear the same PagePrivate2:
> 
>     T ----- Page marked Dirty
>     |
>     + ----- Page marked Private2, through btrfs_run_dealloc_range()
>     |
>     + ----- Page cleared Private2, through btrfs_writepage_cow_fixup()
>     |       in __extent_writepage_io()
>     |       ^^^ Private2 cleared for the first time
>     |
>     + ----- Page marked Writeback, through btrfs_set_range_writeback()
>     |       in __extent_writepage_io().
>     |
>     + ----- Page cleared Private2, through
>     |       btrfs_writepage_endio_finish_ordered()
>     |       ^^^ Private2 cleared for the second time.
>     |
>     + ----- Page cleared Writeback, through
>             btrfs_writepage_endio_finish_ordered()
> 
> Currently PagePrivate2 is mostly to prevent ordered extent accounting
> being executed for both endio and invalidatepage.
> Thus only the one who cleared page Private2 is responsible for ordered
> extent accounting.
> 
> But the fact is, in btrfs_writepage_endio_finish_ordered(), page
> Private2 is cleared and ordered extent accounting is executed
> unconditionally.
> 
> The race prevention only happens through btrfs_invalidatepage(), where
> we wait for the page writeback first, before checking the Private2 bit.
> 
> This means, Private2 is also protected by Writeback bit, and there is no
> need for btrfs_writepage_cow_fixup() to clear Priavte2.
> 
> This patch will change btrfs_writepage_cow_fixup() to just check
> PagePrivate2, not to clear it.
> The clearing will happen in either btrfs_invalidatepage() or
> btrfs_writepage_endio_finish_ordered().
> 
> This makes the Private2 bit easier to understand, just meaning the page
> has unfinished ordered extent attached to it.
> 
> And this patch is a hard requirement for the incoming refactoring for
> how we finished ordered IO for endio context, as the coming patch will
> check Private2 to determine if we need to do the ordered extent
> accounting.  Thus this patch is definitely needed or we will hang due to
> unfinished ordered extent.
> 
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> Signed-off-by: Sasha Levin <sashal@kernel.org>

Please drop this patch from all autosel stable backports. This is not a
standalone fix and the CC: stable@ is not there intentionally.

All patches that go through my tree are evaluated for stable backports
up to 4.4 so it's unlikely the machinery you're using can find something
I've overlooked.

The patches that autosel picks and do not get a complain^Wreply for me
are below the bar I'd consider it for stable but if after another review
the patch "does no harm", I let it pass because you're obviously
handling the backports. I would not tag it myself to avoid increasing
load of Greg with patches that don't matter much.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH AUTOSEL 5.13 75/85] btrfs: make Private2 lifespan more consistent
  2021-07-07 11:10   ` David Sterba
@ 2021-07-08 11:09     ` Sasha Levin
  0 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2021-07-08 11:09 UTC (permalink / raw)
  To: dsterba, linux-kernel, stable, Qu Wenruo, Josef Bacik,
	David Sterba, linux-btrfs

On Wed, Jul 07, 2021 at 01:10:05PM +0200, David Sterba wrote:
>On Sun, Jul 04, 2021 at 07:04:10PM -0400, Sasha Levin wrote:
>> From: Qu Wenruo <wqu@suse.com>
>>
>> [ Upstream commit 87b4d86baae219a9a79f6b0a1434b2a42fd40d09 ]
>>
>> Currently we use page Private2 bit to indicate that we have ordered
>> extent for the page range.
>>
>> But the lifespan of it is not consistent, during regular writeback path,
>> there are two locations to clear the same PagePrivate2:
>>
>>     T ----- Page marked Dirty
>>     |
>>     + ----- Page marked Private2, through btrfs_run_dealloc_range()
>>     |
>>     + ----- Page cleared Private2, through btrfs_writepage_cow_fixup()
>>     |       in __extent_writepage_io()
>>     |       ^^^ Private2 cleared for the first time
>>     |
>>     + ----- Page marked Writeback, through btrfs_set_range_writeback()
>>     |       in __extent_writepage_io().
>>     |
>>     + ----- Page cleared Private2, through
>>     |       btrfs_writepage_endio_finish_ordered()
>>     |       ^^^ Private2 cleared for the second time.
>>     |
>>     + ----- Page cleared Writeback, through
>>             btrfs_writepage_endio_finish_ordered()
>>
>> Currently PagePrivate2 is mostly to prevent ordered extent accounting
>> being executed for both endio and invalidatepage.
>> Thus only the one who cleared page Private2 is responsible for ordered
>> extent accounting.
>>
>> But the fact is, in btrfs_writepage_endio_finish_ordered(), page
>> Private2 is cleared and ordered extent accounting is executed
>> unconditionally.
>>
>> The race prevention only happens through btrfs_invalidatepage(), where
>> we wait for the page writeback first, before checking the Private2 bit.
>>
>> This means, Private2 is also protected by Writeback bit, and there is no
>> need for btrfs_writepage_cow_fixup() to clear Priavte2.
>>
>> This patch will change btrfs_writepage_cow_fixup() to just check
>> PagePrivate2, not to clear it.
>> The clearing will happen in either btrfs_invalidatepage() or
>> btrfs_writepage_endio_finish_ordered().
>>
>> This makes the Private2 bit easier to understand, just meaning the page
>> has unfinished ordered extent attached to it.
>>
>> And this patch is a hard requirement for the incoming refactoring for
>> how we finished ordered IO for endio context, as the coming patch will
>> check Private2 to determine if we need to do the ordered extent
>> accounting.  Thus this patch is definitely needed or we will hang due to
>> unfinished ordered extent.
>>
>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> Signed-off-by: David Sterba <dsterba@suse.com>
>> Signed-off-by: Sasha Levin <sashal@kernel.org>
>
>Please drop this patch from all autosel stable backports. This is not a
>standalone fix and the CC: stable@ is not there intentionally.

Will do.

>All patches that go through my tree are evaluated for stable backports
>up to 4.4 so it's unlikely the machinery you're using can find something
>I've overlooked.

If you'd like, I can make it ignore fs/btrfs/. The tool is there to help
maintainers who aren't as diligent w.r.t tagging patches for stable, not
to create extra noise for something.

We can ofcourse also keep the current workflow if you think that it's
helpful in some way.

>The patches that autosel picks and do not get a complain^Wreply for me
>are below the bar I'd consider it for stable but if after another review
>the patch "does no harm", I let it pass because you're obviously
>handling the backports. I would not tag it myself to avoid increasing
>load of Greg with patches that don't matter much.

I'd say don't worry about this side of things: we're trying to get any
fixes in, without too much regard to whether the fix is for something
"big" or "small". We'd prefer to go over extra patches here in exchange
for a much better user experience :)

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-07-08 11:09 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20210704230420.1488358-1-sashal@kernel.org>
2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 70/85] btrfs: fix error handling in __btrfs_update_delayed_inode Sasha Levin
2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 71/85] btrfs: abort transaction if we fail to update the delayed inode Sasha Levin
2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 72/85] btrfs: always abort the transaction if we abort a trans handle Sasha Levin
2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 73/85] btrfs: sysfs: fix format string for some discard stats Sasha Levin
2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 74/85] btrfs: scrub: fix subpage repair error caused by hard coded PAGE_SIZE Sasha Levin
2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 75/85] btrfs: make Private2 lifespan more consistent Sasha Levin
2021-07-07 11:10   ` David Sterba
2021-07-08 11:09     ` Sasha Levin
2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 76/85] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() Sasha Levin
2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 77/85] btrfs: don't clear page extent mapped if we're not invalidating the full page Sasha Levin
2021-07-04 23:04 ` [PATCH AUTOSEL 5.13 78/85] btrfs: disable build on platforms having page size 256K Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox