[PATCH AUTOSEL 5.4 097/330] btrfs: tree-checker: Check leaf chunk item size

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 5.4 097/330] btrfs: tree-checker: Check leaf chunk item size
       [not found] <20200918020110.2063155-1-sashal@kernel.org>
@ 2020-09-18  1:57 ` Sasha Levin
  2020-09-18  1:58 ` [PATCH AUTOSEL 5.4 187/330] btrfs: do not init a reloc root if we aren't relocating Sasha Levin
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2020-09-18  1:57 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Qu Wenruo, Josef Bacik, David Sterba, Sasha Levin, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit f6d2a5c263afca84646cf3300dc13061bedbd99e ]

Inspired by btrfs-progs github issue #208, where chunk item in chunk
tree has invalid num_stripes (0).

Although that can already be caught by current btrfs_check_chunk_valid(),
that function doesn't really check item size as it needs to handle chunk
item in super block sys_chunk_array().

This patch will add two extra checks for chunk items in chunk tree:

- Basic chunk item size
  If the item is smaller than btrfs_chunk (which already contains one
  stripe), exit right now as reading num_stripes may even go beyond
  eb boundary.

- Item size check against num_stripes
  If item size doesn't match with calculated chunk size, then either the
  item size or the num_stripes is corrupted. Error out anyway.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/tree-checker.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 91ea38506fbb7..84b8d6ebf98f3 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -674,6 +674,44 @@ int btrfs_check_chunk_valid(struct extent_buffer *leaf,
 	return 0;
 }
 
+/*
+ * Enhanced version of chunk item checker.
+ *
+ * The common btrfs_check_chunk_valid() doesn't check item size since it needs
+ * to work on super block sys_chunk_array which doesn't have full item ptr.
+ */
+static int check_leaf_chunk_item(struct extent_buffer *leaf,
+				 struct btrfs_chunk *chunk,
+				 struct btrfs_key *key, int slot)
+{
+	int num_stripes;
+
+	if (btrfs_item_size_nr(leaf, slot) < sizeof(struct btrfs_chunk)) {
+		chunk_err(leaf, chunk, key->offset,
+			"invalid chunk item size: have %u expect [%zu, %u)",
+			btrfs_item_size_nr(leaf, slot),
+			sizeof(struct btrfs_chunk),
+			BTRFS_LEAF_DATA_SIZE(leaf->fs_info));
+		return -EUCLEAN;
+	}
+
+	num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
+	/* Let btrfs_check_chunk_valid() handle this error type */
+	if (num_stripes == 0)
+		goto out;
+
+	if (btrfs_chunk_item_size(num_stripes) !=
+	    btrfs_item_size_nr(leaf, slot)) {
+		chunk_err(leaf, chunk, key->offset,
+			"invalid chunk item size: have %u expect %lu",
+			btrfs_item_size_nr(leaf, slot),
+			btrfs_chunk_item_size(num_stripes));
+		return -EUCLEAN;
+	}
+out:
+	return btrfs_check_chunk_valid(leaf, chunk, key->offset);
+}
+
 __printf(3, 4)
 __cold
 static void dev_item_err(const struct extent_buffer *eb, int slot,
@@ -1265,7 +1303,7 @@ static int check_leaf_item(struct extent_buffer *leaf,
 		break;
 	case BTRFS_CHUNK_ITEM_KEY:
 		chunk = btrfs_item_ptr(leaf, slot, struct btrfs_chunk);
-		ret = btrfs_check_chunk_valid(leaf, chunk, key->offset);
+		ret = check_leaf_chunk_item(leaf, chunk, key, slot);
 		break;
 	case BTRFS_DEV_ITEM_KEY:
 		ret = check_dev_item(leaf, key, slot);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH AUTOSEL 5.4 187/330] btrfs: do not init a reloc root if we aren't relocating
       [not found] <20200918020110.2063155-1-sashal@kernel.org>
  2020-09-18  1:57 ` [PATCH AUTOSEL 5.4 097/330] btrfs: tree-checker: Check leaf chunk item size Sasha Levin
@ 2020-09-18  1:58 ` Sasha Levin
  2020-09-18  1:58 ` [PATCH AUTOSEL 5.4 188/330] btrfs: free the reloc_control in a consistent way Sasha Levin
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2020-09-18  1:58 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Josef Bacik, Qu Wenruo, David Sterba, Sasha Levin, linux-btrfs

From: Josef Bacik <josef@toxicpanda.com>

[ Upstream commit 2abc726ab4b83db774e315c660ab8da21477092f ]

We previously were checking if the root had a dead root before accessing
root->reloc_root in order to avoid a use-after-free type bug.  However
this scenario happens after we've unset the reloc control, so we would
have been saved if we'd simply checked for fs_info->reloc_control.  At
this point during relocation we no longer need to be creating new reloc
roots, so simply move this check above the reloc_root checks to avoid
any future races and confusion.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/relocation.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index af3605a0bf2e0..1313506a7ecb5 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1468,6 +1468,10 @@ int btrfs_init_reloc_root(struct btrfs_trans_handle *trans,
 	int clear_rsv = 0;
 	int ret;
 
+	if (!rc || !rc->create_reloc_tree ||
+	    root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID)
+		return 0;
+
 	/*
 	 * The subvolume has reloc tree but the swap is finished, no need to
 	 * create/update the dead reloc tree
@@ -1481,10 +1485,6 @@ int btrfs_init_reloc_root(struct btrfs_trans_handle *trans,
 		return 0;
 	}
 
-	if (!rc || !rc->create_reloc_tree ||
-	    root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID)
-		return 0;
-
 	if (!trans->reloc_reserved) {
 		rsv = trans->block_rsv;
 		trans->block_rsv = rc->block_rsv;
@@ -2336,6 +2336,18 @@ static noinline_for_stack int merge_reloc_root(struct reloc_control *rc,
 			trans = NULL;
 			goto out;
 		}
+
+		/*
+		 * At this point we no longer have a reloc_control, so we can't
+		 * depend on btrfs_init_reloc_root to update our last_trans.
+		 *
+		 * But that's ok, we started the trans handle on our
+		 * corresponding fs_root, which means it's been added to the
+		 * dirty list.  At commit time we'll still call
+		 * btrfs_update_reloc_root() and update our root item
+		 * appropriately.
+		 */
+		reloc_root->last_trans = trans->transid;
 		trans->block_rsv = rc->block_rsv;
 
 		replaced = 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH AUTOSEL 5.4 188/330] btrfs: free the reloc_control in a consistent way
       [not found] <20200918020110.2063155-1-sashal@kernel.org>
  2020-09-18  1:57 ` [PATCH AUTOSEL 5.4 097/330] btrfs: tree-checker: Check leaf chunk item size Sasha Levin
  2020-09-18  1:58 ` [PATCH AUTOSEL 5.4 187/330] btrfs: do not init a reloc root if we aren't relocating Sasha Levin
@ 2020-09-18  1:58 ` Sasha Levin
  2020-09-18  1:59 ` [PATCH AUTOSEL 5.4 237/330] btrfs: fix setting last_trans for reloc roots Sasha Levin
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2020-09-18  1:58 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Josef Bacik, David Sterba, Sasha Levin, linux-btrfs

From: Josef Bacik <josef@toxicpanda.com>

[ Upstream commit 1a0afa0ecfc4dbc8d7583d03cafd3f68f781df0c ]

If we have an error while processing the reloc roots we could leak roots
that were added to rc->reloc_roots before we hit the error.  We could
have also not removed the reloc tree mapping from our rb_tree, so clean
up any remaining nodes in the reloc root rb_tree.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use rbtree_postorder_for_each_entry_safe ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/relocation.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 1313506a7ecb5..ece53d2f55ae3 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4354,6 +4354,18 @@ static struct reloc_control *alloc_reloc_control(struct btrfs_fs_info *fs_info)
 	return rc;
 }
 
+static void free_reloc_control(struct reloc_control *rc)
+{
+	struct mapping_node *node, *tmp;
+
+	free_reloc_roots(&rc->reloc_roots);
+	rbtree_postorder_for_each_entry_safe(node, tmp,
+			&rc->reloc_root_tree.rb_root, rb_node)
+		kfree(node);
+
+	kfree(rc);
+}
+
 /*
  * Print the block group being relocated
  */
@@ -4486,7 +4498,7 @@ out:
 		btrfs_dec_block_group_ro(rc->block_group);
 	iput(rc->data_inode);
 	btrfs_put_block_group(rc->block_group);
-	kfree(rc);
+	free_reloc_control(rc);
 	return err;
 }
 
@@ -4659,7 +4671,7 @@ out_clean:
 		err = ret;
 out_unset:
 	unset_reloc_control(rc);
-	kfree(rc);
+	free_reloc_control(rc);
 out:
 	if (!list_empty(&reloc_roots))
 		free_reloc_roots(&reloc_roots);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH AUTOSEL 5.4 237/330] btrfs: fix setting last_trans for reloc roots
       [not found] <20200918020110.2063155-1-sashal@kernel.org>
                   ` (2 preceding siblings ...)
  2020-09-18  1:58 ` [PATCH AUTOSEL 5.4 188/330] btrfs: free the reloc_control in a consistent way Sasha Levin
@ 2020-09-18  1:59 ` Sasha Levin
  2020-09-18  2:00 ` [PATCH AUTOSEL 5.4 286/330] btrfs: don't force read-only after error in drop snapshot Sasha Levin
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2020-09-18  1:59 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Josef Bacik, Filipe Manana, David Sterba, Sasha Levin,
	linux-btrfs

From: Josef Bacik <josef@toxicpanda.com>

[ Upstream commit aec7db3b13a07d515c15ada752a7287a44a79ea0 ]

I made a mistake with my previous fix, I assumed that we didn't need to
mess with the reloc roots once we were out of the part of relocation where
we are actually moving the extents.

The subtle thing that I missed is that btrfs_init_reloc_root() also
updates the last_trans for the reloc root when we do
btrfs_record_root_in_trans() for the corresponding fs_root.  I've added a
comment to make sure future me doesn't make this mistake again.

This showed up as a WARN_ON() in btrfs_copy_root() because our
last_trans didn't == the current transid.  This could happen if we
snapshotted a fs root with a reloc root after we set
rc->create_reloc_tree = 0, but before we actually merge the reloc root.

Worth mentioning that the regression produced the following warning
when running snapshot creation and balance in parallel:

  BTRFS info (device sdc): relocating block group 30408704 flags metadata|dup
  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 12823 at fs/btrfs/ctree.c:191 btrfs_copy_root+0x26f/0x430 [btrfs]
  CPU: 0 PID: 12823 Comm: btrfs Tainted: G        W 5.6.0-rc7-btrfs-next-58 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_copy_root+0x26f/0x430 [btrfs]
  RSP: 0018:ffffb96e044279b8 EFLAGS: 00010202
  RAX: 0000000000000009 RBX: ffff9da70bf61000 RCX: ffffb96e04427a48
  RDX: ffff9da733a770c8 RSI: ffff9da70bf61000 RDI: ffff9da694163818
  RBP: ffff9da733a770c8 R08: fffffffffffffff8 R09: 0000000000000002
  R10: ffffb96e044279a0 R11: 0000000000000000 R12: ffff9da694163818
  R13: fffffffffffffff8 R14: ffff9da6d2512000 R15: ffff9da714cdac00
  FS:  00007fdeacf328c0(0000) GS:ffff9da735e00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000055a2a5b8a118 CR3: 00000001eed78002 CR4: 00000000003606f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   ? create_reloc_root+0x49/0x2b0 [btrfs]
   ? kmem_cache_alloc_trace+0xe5/0x200
   create_reloc_root+0x8b/0x2b0 [btrfs]
   btrfs_reloc_post_snapshot+0x96/0x5b0 [btrfs]
   create_pending_snapshot+0x610/0x1010 [btrfs]
   create_pending_snapshots+0xa8/0xd0 [btrfs]
   btrfs_commit_transaction+0x4c7/0xc50 [btrfs]
   ? btrfs_mksubvol+0x3cd/0x560 [btrfs]
   btrfs_mksubvol+0x455/0x560 [btrfs]
   __btrfs_ioctl_snap_create+0x15f/0x190 [btrfs]
   btrfs_ioctl_snap_create_v2+0xa4/0xf0 [btrfs]
   ? mem_cgroup_commit_charge+0x6e/0x540
   btrfs_ioctl+0x12d8/0x3760 [btrfs]
   ? do_raw_spin_unlock+0x49/0xc0
   ? _raw_spin_unlock+0x29/0x40
   ? __handle_mm_fault+0x11b3/0x14b0
   ? ksys_ioctl+0x92/0xb0
   ksys_ioctl+0x92/0xb0
   ? trace_hardirqs_off_thunk+0x1a/0x1c
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x5c/0x280
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x7fdeabd3bdd7

Fixes: 2abc726ab4b8 ("btrfs: do not init a reloc root if we aren't relocating")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/relocation.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index ece53d2f55ae3..1bc57f7b91cfa 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1468,8 +1468,7 @@ int btrfs_init_reloc_root(struct btrfs_trans_handle *trans,
 	int clear_rsv = 0;
 	int ret;
 
-	if (!rc || !rc->create_reloc_tree ||
-	    root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID)
+	if (!rc)
 		return 0;
 
 	/*
@@ -1479,12 +1478,28 @@ int btrfs_init_reloc_root(struct btrfs_trans_handle *trans,
 	if (reloc_root_is_dead(root))
 		return 0;
 
+	/*
+	 * This is subtle but important.  We do not do
+	 * record_root_in_transaction for reloc roots, instead we record their
+	 * corresponding fs root, and then here we update the last trans for the
+	 * reloc root.  This means that we have to do this for the entire life
+	 * of the reloc root, regardless of which stage of the relocation we are
+	 * in.
+	 */
 	if (root->reloc_root) {
 		reloc_root = root->reloc_root;
 		reloc_root->last_trans = trans->transid;
 		return 0;
 	}
 
+	/*
+	 * We are merging reloc roots, we do not need new reloc trees.  Also
+	 * reloc trees never need their own reloc tree.
+	 */
+	if (!rc->create_reloc_tree ||
+	    root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID)
+		return 0;
+
 	if (!trans->reloc_reserved) {
 		rsv = trans->block_rsv;
 		trans->block_rsv = rc->block_rsv;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH AUTOSEL 5.4 286/330] btrfs: don't force read-only after error in drop snapshot
       [not found] <20200918020110.2063155-1-sashal@kernel.org>
                   ` (3 preceding siblings ...)
  2020-09-18  1:59 ` [PATCH AUTOSEL 5.4 237/330] btrfs: fix setting last_trans for reloc roots Sasha Levin
@ 2020-09-18  2:00 ` Sasha Levin
  2020-09-18  2:00 ` [PATCH AUTOSEL 5.4 287/330] btrfs: fix double __endio_write_update_ordered in direct I/O Sasha Levin
  2020-09-18  2:01 ` [PATCH AUTOSEL 5.4 320/330] btrfs: qgroup: fix data leak caused by race between writeback and truncate Sasha Levin
  6 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2020-09-18  2:00 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: David Sterba, Sasha Levin, linux-btrfs

From: David Sterba <dsterba@suse.com>

[ Upstream commit 7c09c03091ac562ddca2b393e5d65c1d37da79f1 ]

Deleting a subvolume on a full filesystem leads to ENOSPC followed by a
forced read-only. This is not a transaction abort and the filesystem is
otherwise ok, so the error should be just propagated to the callers.

This is caused by unnecessary call to btrfs_handle_fs_error for all
errors, except EAGAIN. This does not make sense as the standard
transaction abort mechanism is in btrfs_drop_snapshot so all relevant
failures are handled.

Originally in commit cb1b69f4508a ("Btrfs: forced readonly when
btrfs_drop_snapshot() fails") there was no return value at all, so the
btrfs_std_error made some sense but once the error handling and
propagation has been implemented we don't need it anymore.

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/extent-tree.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 541497036cc24..60c3a03203fae 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5429,8 +5429,6 @@ out:
 	 */
 	if (!for_reloc && !root_dropped)
 		btrfs_add_dead_root(root);
-	if (err && err != -EAGAIN)
-		btrfs_handle_fs_error(fs_info, err, NULL);
 	return err;
 }

-- 
2.25.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH AUTOSEL 5.4 287/330] btrfs: fix double __endio_write_update_ordered in direct I/O
       [not found] <20200918020110.2063155-1-sashal@kernel.org>
                   ` (4 preceding siblings ...)
  2020-09-18  2:00 ` [PATCH AUTOSEL 5.4 286/330] btrfs: don't force read-only after error in drop snapshot Sasha Levin
@ 2020-09-18  2:00 ` Sasha Levin
  2020-09-18  2:01 ` [PATCH AUTOSEL 5.4 320/330] btrfs: qgroup: fix data leak caused by race between writeback and truncate Sasha Levin
  6 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2020-09-18  2:00 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Omar Sandoval, Johannes Thumshirn, Nikolay Borisov, David Sterba,
	Sasha Levin, linux-btrfs

From: Omar Sandoval <osandov@fb.com>

[ Upstream commit c36cac28cb94e58f7e21ff43bdc6064346dab32c ]

In btrfs_submit_direct(), if we fail to allocate the btrfs_dio_private,
we complete the ordered extent range. However, we don't mark that the
range doesn't need to be cleaned up from btrfs_direct_IO() until later.
Therefore, if we fail to allocate the btrfs_dio_private, we complete the
ordered extent range twice. We could fix this by updating
unsubmitted_oe_range earlier, but it's cleaner to reorganize the code so
that creating the btrfs_dio_private and submitting the bios are
separate, and once the btrfs_dio_private is created, cleanup always
happens through the btrfs_dio_private.

The logic around unsubmitted_oe_range_end and unsubmitted_oe_range_start
is really subtle. We have the following:

  1. btrfs_direct_IO sets those two to the same value.

  2. When we call __blockdev_direct_IO unless
     btrfs_get_blocks_direct->btrfs_get_blocks_direct_write is called to
     modify unsubmitted_oe_range_start so that start < end. Cleanup
     won't happen.

  3. We come into btrfs_submit_direct - if it dip allocation fails we'd
     return with oe_range_end now modified so cleanup will happen.

  4. If we manage to allocate the dip we reset the unsubmitted range
     members to be equal so that cleanup happens from
     btrfs_endio_direct_write.

This 4-step logic is not really obvious, especially given it's scattered
across 3 functions.

Fixes: f28a49287817 ("Btrfs: fix leaking of ordered extents after direct IO write error")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
[ add range start/end logic explanation from Nikolay ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/inode.c | 178 +++++++++++++++++++----------------------------
 1 file changed, 70 insertions(+), 108 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9ac40991a6405..e9787b7b943a2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8586,14 +8586,64 @@ err:
 	return ret;
 }
 
-static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
+/*
+ * If this succeeds, the btrfs_dio_private is responsible for cleaning up locked
+ * or ordered extents whether or not we submit any bios.
+ */
+static struct btrfs_dio_private *btrfs_create_dio_private(struct bio *dio_bio,
+							  struct inode *inode,
+							  loff_t file_offset)
 {
-	struct inode *inode = dip->inode;
+	const bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
+	struct btrfs_dio_private *dip;
+	struct bio *bio;
+
+	dip = kzalloc(sizeof(*dip), GFP_NOFS);
+	if (!dip)
+		return NULL;
+
+	bio = btrfs_bio_clone(dio_bio);
+	bio->bi_private = dip;
+	btrfs_io_bio(bio)->logical = file_offset;
+
+	dip->private = dio_bio->bi_private;
+	dip->inode = inode;
+	dip->logical_offset = file_offset;
+	dip->bytes = dio_bio->bi_iter.bi_size;
+	dip->disk_bytenr = (u64)dio_bio->bi_iter.bi_sector << 9;
+	dip->orig_bio = bio;
+	dip->dio_bio = dio_bio;
+	atomic_set(&dip->pending_bios, 1);
+
+	if (write) {
+		struct btrfs_dio_data *dio_data = current->journal_info;
+
+		/*
+		 * Setting range start and end to the same value means that
+		 * no cleanup will happen in btrfs_direct_IO
+		 */
+		dio_data->unsubmitted_oe_range_end = dip->logical_offset +
+			dip->bytes;
+		dio_data->unsubmitted_oe_range_start =
+			dio_data->unsubmitted_oe_range_end;
+
+		bio->bi_end_io = btrfs_endio_direct_write;
+	} else {
+		bio->bi_end_io = btrfs_endio_direct_read;
+		dip->subio_endio = btrfs_subio_endio_read;
+	}
+	return dip;
+}
+
+static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
+				loff_t file_offset)
+{
+	const bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_dio_private *dip;
 	struct bio *bio;
-	struct bio *orig_bio = dip->orig_bio;
-	u64 start_sector = orig_bio->bi_iter.bi_sector;
-	u64 file_offset = dip->logical_offset;
+	struct bio *orig_bio;
+	u64 start_sector;
 	int async_submit = 0;
 	u64 submit_len;
 	int clone_offset = 0;
@@ -8602,11 +8652,24 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 	blk_status_t status;
 	struct btrfs_io_geometry geom;
 
+	dip = btrfs_create_dio_private(dio_bio, inode, file_offset);
+	if (!dip) {
+		if (!write) {
+			unlock_extent(&BTRFS_I(inode)->io_tree, file_offset,
+				file_offset + dio_bio->bi_iter.bi_size - 1);
+		}
+		dio_bio->bi_status = BLK_STS_RESOURCE;
+		dio_end_io(dio_bio);
+		return;
+	}
+
+	orig_bio = dip->orig_bio;
+	start_sector = orig_bio->bi_iter.bi_sector;
 	submit_len = orig_bio->bi_iter.bi_size;
 	ret = btrfs_get_io_geometry(fs_info, btrfs_op(orig_bio),
 				    start_sector << 9, submit_len, &geom);
 	if (ret)
-		return -EIO;
+		goto out_err;
 
 	if (geom.len >= submit_len) {
 		bio = orig_bio;
@@ -8669,7 +8732,7 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 submit:
 	status = btrfs_submit_dio_bio(bio, inode, file_offset, async_submit);
 	if (!status)
-		return 0;
+		return;
 
 	if (bio != orig_bio)
 		bio_put(bio);
@@ -8683,107 +8746,6 @@ out_err:
 	 */
 	if (atomic_dec_and_test(&dip->pending_bios))
 		bio_io_error(dip->orig_bio);
-
-	/* bio_end_io() will handle error, so we needn't return it */
-	return 0;
-}
-
-static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
-				loff_t file_offset)
-{
-	struct btrfs_dio_private *dip = NULL;
-	struct bio *bio = NULL;
-	struct btrfs_io_bio *io_bio;
-	bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
-	int ret = 0;
-
-	bio = btrfs_bio_clone(dio_bio);
-
-	dip = kzalloc(sizeof(*dip), GFP_NOFS);
-	if (!dip) {
-		ret = -ENOMEM;
-		goto free_ordered;
-	}
-
-	dip->private = dio_bio->bi_private;
-	dip->inode = inode;
-	dip->logical_offset = file_offset;
-	dip->bytes = dio_bio->bi_iter.bi_size;
-	dip->disk_bytenr = (u64)dio_bio->bi_iter.bi_sector << 9;
-	bio->bi_private = dip;
-	dip->orig_bio = bio;
-	dip->dio_bio = dio_bio;
-	atomic_set(&dip->pending_bios, 1);
-	io_bio = btrfs_io_bio(bio);
-	io_bio->logical = file_offset;
-
-	if (write) {
-		bio->bi_end_io = btrfs_endio_direct_write;
-	} else {
-		bio->bi_end_io = btrfs_endio_direct_read;
-		dip->subio_endio = btrfs_subio_endio_read;
-	}
-
-	/*
-	 * Reset the range for unsubmitted ordered extents (to a 0 length range)
-	 * even if we fail to submit a bio, because in such case we do the
-	 * corresponding error handling below and it must not be done a second
-	 * time by btrfs_direct_IO().
-	 */
-	if (write) {
-		struct btrfs_dio_data *dio_data = current->journal_info;
-
-		dio_data->unsubmitted_oe_range_end = dip->logical_offset +
-			dip->bytes;
-		dio_data->unsubmitted_oe_range_start =
-			dio_data->unsubmitted_oe_range_end;
-	}
-
-	ret = btrfs_submit_direct_hook(dip);
-	if (!ret)
-		return;
-
-	btrfs_io_bio_free_csum(io_bio);
-
-free_ordered:
-	/*
-	 * If we arrived here it means either we failed to submit the dip
-	 * or we either failed to clone the dio_bio or failed to allocate the
-	 * dip. If we cloned the dio_bio and allocated the dip, we can just
-	 * call bio_endio against our io_bio so that we get proper resource
-	 * cleanup if we fail to submit the dip, otherwise, we must do the
-	 * same as btrfs_endio_direct_[write|read] because we can't call these
-	 * callbacks - they require an allocated dip and a clone of dio_bio.
-	 */
-	if (bio && dip) {
-		bio_io_error(bio);
-		/*
-		 * The end io callbacks free our dip, do the final put on bio
-		 * and all the cleanup and final put for dio_bio (through
-		 * dio_end_io()).
-		 */
-		dip = NULL;
-		bio = NULL;
-	} else {
-		if (write)
-			__endio_write_update_ordered(inode,
-						file_offset,
-						dio_bio->bi_iter.bi_size,
-						false);
-		else
-			unlock_extent(&BTRFS_I(inode)->io_tree, file_offset,
-			      file_offset + dio_bio->bi_iter.bi_size - 1);
-
-		dio_bio->bi_status = BLK_STS_IOERR;
-		/*
-		 * Releases and cleans up our dio_bio, no need to bio_put()
-		 * nor bio_endio()/bio_io_error() against dio_bio.
-		 */
-		dio_end_io(dio_bio);
-	}
-	if (bio)
-		bio_put(bio);
-	kfree(dip);
 }
 
 static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH AUTOSEL 5.4 320/330] btrfs: qgroup: fix data leak caused by race between writeback and truncate
       [not found] <20200918020110.2063155-1-sashal@kernel.org>
                   ` (5 preceding siblings ...)
  2020-09-18  2:00 ` [PATCH AUTOSEL 5.4 287/330] btrfs: fix double __endio_write_update_ordered in direct I/O Sasha Levin
@ 2020-09-18  2:01 ` Sasha Levin
  6 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2020-09-18  2:01 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Qu Wenruo, Josef Bacik, David Sterba, Sasha Levin, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit fa91e4aa1716004ea8096d5185ec0451e206aea0 ]

[BUG]
When running tests like generic/013 on test device with btrfs quota
enabled, it can normally lead to data leak, detected at unmount time:

  BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096
  ------------[ cut here ]------------
  WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
  RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
  Call Trace:
   btrfs_put_super+0x15/0x17 [btrfs]
   generic_shutdown_super+0x72/0x110
   kill_anon_super+0x18/0x30
   btrfs_kill_super+0x17/0x30 [btrfs]
   deactivate_locked_super+0x3b/0xa0
   deactivate_super+0x40/0x50
   cleanup_mnt+0x135/0x190
   __cleanup_mnt+0x12/0x20
   task_work_run+0x64/0xb0
   __prepare_exit_to_usermode+0x1bc/0x1c0
   __syscall_return_slowpath+0x47/0x230
   do_syscall_64+0x64/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ---[ end trace caf08beafeca2392 ]---
  BTRFS error (device dm-3): qgroup reserved space leaked

[CAUSE]
In the offending case, the offending operations are:
2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0
2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0

The following sequence of events could happen after the writev():
	CPU1 (writeback)		|		CPU2 (truncate)
-----------------------------------------------------------------
btrfs_writepages()			|
|- extent_write_cache_pages()		|
   |- Got page for 1003520		|
   |  1003520 is Dirty, no writeback	|
   |  So (!clear_page_dirty_for_io())   |
   |  gets called for it		|
   |- Now page 1003520 is Clean.	|
   |					| btrfs_setattr()
   |					| |- btrfs_setsize()
   |					|    |- truncate_setsize()
   |					|       New i_size is 18388
   |- __extent_writepage()		|
   |  |- page_offset() > i_size		|
      |- btrfs_invalidatepage()		|
	 |- Page is clean, so no qgroup |
	    callback executed

This means, the qgroup reserved data space is not properly released in
btrfs_invalidatepage() as the page is Clean.

[FIX]
Instead of checking the dirty bit of a page, call
btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage().

As qgroup rsv are completely bound to the QGROUP_RESERVED bit of
io_tree, not bound to page status, thus we won't cause double freeing
anyway.

Fixes: 0b34c261e235 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/inode.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e9787b7b943a2..182e93a5b11d5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9044,20 +9044,17 @@ again:
 	/*
 	 * Qgroup reserved space handler
 	 * Page here will be either
-	 * 1) Already written to disk
-	 *    In this case, its reserved space is released from data rsv map
-	 *    and will be freed by delayed_ref handler finally.
-	 *    So even we call qgroup_free_data(), it won't decrease reserved
-	 *    space.
-	 * 2) Not written to disk
-	 *    This means the reserved space should be freed here. However,
-	 *    if a truncate invalidates the page (by clearing PageDirty)
-	 *    and the page is accounted for while allocating extent
-	 *    in btrfs_check_data_free_space() we let delayed_ref to
-	 *    free the entire extent.
+	 * 1) Already written to disk or ordered extent already submitted
+	 *    Then its QGROUP_RESERVED bit in io_tree is already cleaned.
+	 *    Qgroup will be handled by its qgroup_record then.
+	 *    btrfs_qgroup_free_data() call will do nothing here.
+	 *
+	 * 2) Not written to disk yet
+	 *    Then btrfs_qgroup_free_data() call will clear the QGROUP_RESERVED
+	 *    bit of its io_tree, and free the qgroup reserved data space.
+	 *    Since the IO will never happen for this page.
 	 */
-	if (PageDirty(page))
-		btrfs_qgroup_free_data(inode, NULL, page_start, PAGE_SIZE);
+	btrfs_qgroup_free_data(inode, NULL, page_start, PAGE_SIZE);
 	if (!inode_evicting) {
 		clear_extent_bit(tree, page_start, page_end, EXTENT_LOCKED |
 				 EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-09-18  3:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20200918020110.2063155-1-sashal@kernel.org>
2020-09-18  1:57 ` [PATCH AUTOSEL 5.4 097/330] btrfs: tree-checker: Check leaf chunk item size Sasha Levin
2020-09-18  1:58 ` [PATCH AUTOSEL 5.4 187/330] btrfs: do not init a reloc root if we aren't relocating Sasha Levin
2020-09-18  1:58 ` [PATCH AUTOSEL 5.4 188/330] btrfs: free the reloc_control in a consistent way Sasha Levin
2020-09-18  1:59 ` [PATCH AUTOSEL 5.4 237/330] btrfs: fix setting last_trans for reloc roots Sasha Levin
2020-09-18  2:00 ` [PATCH AUTOSEL 5.4 286/330] btrfs: don't force read-only after error in drop snapshot Sasha Levin
2020-09-18  2:00 ` [PATCH AUTOSEL 5.4 287/330] btrfs: fix double __endio_write_update_ordered in direct I/O Sasha Levin
2020-09-18  2:01 ` [PATCH AUTOSEL 5.4 320/330] btrfs: qgroup: fix data leak caused by race between writeback and truncate Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox