[PATCH AUTOSEL 5.2 26/85] btrfs: tree-checker: Check if the file extent end overflows

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 5.2 26/85] btrfs: tree-checker: Check if the file extent end overflows
       [not found] <20190726133936.11177-1-sashal@kernel.org>
@ 2019-07-26 13:38 ` Sasha Levin
  2019-07-26 13:38 ` [PATCH AUTOSEL 5.2 27/85] btrfs: fix minimum number of chunk errors for DUP Sasha Levin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2019-07-26 13:38 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Qu Wenruo, David Sterba, Sasha Levin, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit 4c094c33c9ed4b8d0d814bd1d7ff78e123d15d00 ]

Under certain conditions, we could have strange file extent item in log
tree like:

  item 18 key (69599 108 397312) itemoff 15208 itemsize 53
	extent data disk bytenr 0 nr 0
	extent data offset 0 nr 18446744073709547520 ram 18446744073709547520

The num_bytes + ram_bytes overflow 64 bit type.

For num_bytes part, we can detect such overflow along with file offset
(key->offset), as file_offset + num_bytes should never go beyond u64.

For ram_bytes part, it's about the decompressed size of the extent, not
directly related to the size.
In theory it is OK to have a large value, and put extra limitation
on RAM bytes may cause unexpected false alerts.

So in tree-checker, we only check if the file offset and num bytes
overflow.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/tree-checker.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 96fce4bef4e7..ccd5706199d7 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -132,6 +132,7 @@ static int check_extent_data_item(struct extent_buffer *leaf,
 	struct btrfs_file_extent_item *fi;
 	u32 sectorsize = fs_info->sectorsize;
 	u32 item_size = btrfs_item_size_nr(leaf, slot);
+	u64 extent_end;
 
 	if (!IS_ALIGNED(key->offset, sectorsize)) {
 		file_extent_err(leaf, slot,
@@ -207,6 +208,16 @@ static int check_extent_data_item(struct extent_buffer *leaf,
 	    CHECK_FE_ALIGNED(leaf, slot, fi, num_bytes, sectorsize))
 		return -EUCLEAN;
 
+	/* Catch extent end overflow */
+	if (check_add_overflow(btrfs_file_extent_num_bytes(leaf, fi),
+			       key->offset, &extent_end)) {
+		file_extent_err(leaf, slot,
+	"extent end overflow, have file offset %llu extent num bytes %llu",
+				key->offset,
+				btrfs_file_extent_num_bytes(leaf, fi));
+		return -EUCLEAN;
+	}
+
 	/*
 	 * Check that no two consecutive file extent items, in the same leaf,
 	 * present ranges that overlap each other.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH AUTOSEL 5.2 27/85] btrfs: fix minimum number of chunk errors for DUP
       [not found] <20190726133936.11177-1-sashal@kernel.org>
  2019-07-26 13:38 ` [PATCH AUTOSEL 5.2 26/85] btrfs: tree-checker: Check if the file extent end overflows Sasha Levin
@ 2019-07-26 13:38 ` Sasha Levin
  2019-07-26 13:38 ` [PATCH AUTOSEL 5.2 28/85] btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to COW without data reservation Sasha Levin
  2019-07-26 13:38 ` [PATCH AUTOSEL 5.2 30/85] btrfs: qgroup: Don't hold qgroup_ioctl_lock in btrfs_qgroup_inherit() Sasha Levin
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2019-07-26 13:38 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: David Sterba, Qu Wenruo, Sasha Levin, linux-btrfs

From: David Sterba <dsterba@suse.com>

[ Upstream commit 0ee5f8ae082e1f675a2fb6db601c31ac9958a134 ]

The list of profiles in btrfs_chunk_max_errors lists DUP as a profile
DUP able to tolerate 1 device missing. Though this profile is special
with 2 copies, it still needs the device, unlike the others.

Looking at the history of changes, thre's no clear reason why DUP is
there, functions were refactored and blocks of code merged to one
helper.

d20983b40e828 Btrfs: fix writing data into the seed filesystem
  - factor code to a helper

de11cc12df173 Btrfs: don't pre-allocate btrfs bio
  - unrelated change, DUP still in the list with max errors 1

a236aed14ccb0 Btrfs: Deal with failed writes in mirrored configurations
  - introduced the max errors, leaves DUP and RAID1 in the same group

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/volumes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1c2a6e4b39da..8508f6028c8d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5328,8 +5328,7 @@ static inline int btrfs_chunk_max_errors(struct map_lookup *map)
 
 	if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
 			 BTRFS_BLOCK_GROUP_RAID10 |
-			 BTRFS_BLOCK_GROUP_RAID5 |
-			 BTRFS_BLOCK_GROUP_DUP)) {
+			 BTRFS_BLOCK_GROUP_RAID5)) {
 		max_errors = 1;
 	} else if (map->type & BTRFS_BLOCK_GROUP_RAID6) {
 		max_errors = 2;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH AUTOSEL 5.2 28/85] btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to COW without data reservation
       [not found] <20190726133936.11177-1-sashal@kernel.org>
  2019-07-26 13:38 ` [PATCH AUTOSEL 5.2 26/85] btrfs: tree-checker: Check if the file extent end overflows Sasha Levin
  2019-07-26 13:38 ` [PATCH AUTOSEL 5.2 27/85] btrfs: fix minimum number of chunk errors for DUP Sasha Levin
@ 2019-07-26 13:38 ` Sasha Levin
  2019-07-26 13:38 ` [PATCH AUTOSEL 5.2 30/85] btrfs: qgroup: Don't hold qgroup_ioctl_lock in btrfs_qgroup_inherit() Sasha Levin
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2019-07-26 13:38 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Qu Wenruo, Filipe Manana, David Sterba, Sasha Levin, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit a94d1d0cb3bf1983fcdf05b59d914dbff4f1f52c ]

[BUG]
The following script can cause unexpected fsync failure:

  #!/bin/bash

  dev=/dev/test/test
  mnt=/mnt/btrfs

  mkfs.btrfs -f $dev -b 512M > /dev/null
  mount $dev $mnt -o nospace_cache

  # Prealloc one extent
  xfs_io -f -c "falloc 8k 64m" $mnt/file1
  # Fill the remaining data space
  xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
  sync

  # Write into the prealloc extent
  xfs_io -c "pwrite 1m 16m" $mnt/file1

  # Reflink then fsync, fsync would fail due to ENOSPC
  xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
  umount $dev

The fsync fails with ENOSPC, and the last page of the buffered write is
lost.

[CAUSE]
This is caused by:
- Btrfs' back reference only has extent level granularity
  So write into shared extent must be COWed even only part of the extent
  is shared.

So for above script we have:
- fallocate
  Create a preallocated extent where we can do NOCOW write.

- fill all the remaining data and unallocated space

- buffered write into preallocated space
  As we have not enough space available for data and the extent is not
  shared (yet) we fall into NOCOW mode.

- reflink
  Now part of the large preallocated extent is shared, later write
  into that extent must be COWed.

- fsync triggers writeback
  But now the extent is shared and therefore we must fallback into COW
  mode, which fails with ENOSPC since there's not enough space to
  allocate data extents.

[WORKAROUND]
The workaround is to ensure any buffered write in the related extents
(not just the reflink source range) get flushed before reflink/dedupe,
so that NOCOW writes succeed that happened before reflinking succeed.

The workaround is expensive, we could do it better by only flushing
NOCOW range, but that needs extra accounting for NOCOW range.
For now, fix the possible data loss first.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/ioctl.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2a1be0d1a698..5b4beebf138c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3999,6 +3999,27 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	if (!same_inode)
 		inode_dio_wait(inode_out);
 
+	/*
+	 * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
+	 *
+	 * Btrfs' back references do not have a block level granularity, they
+	 * work at the whole extent level.
+	 * NOCOW buffered write without data space reserved may not be able
+	 * to fall back to CoW due to lack of data space, thus could cause
+	 * data loss.
+	 *
+	 * Here we take a shortcut by flushing the whole inode, so that all
+	 * nocow write should reach disk as nocow before we increase the
+	 * reference of the extent. We could do better by only flushing NOCOW
+	 * data, but that needs extra accounting.
+	 *
+	 * Also we don't need to check ASYNC_EXTENT, as async extent will be
+	 * CoWed anyway, not affecting nocow part.
+	 */
+	ret = filemap_flush(inode_in->i_mapping);
+	if (ret < 0)
+		return ret;
+
 	ret = btrfs_wait_ordered_range(inode_in, ALIGN_DOWN(pos_in, bs),
 				       wb_len);
 	if (ret < 0)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH AUTOSEL 5.2 30/85] btrfs: qgroup: Don't hold qgroup_ioctl_lock in btrfs_qgroup_inherit()
       [not found] <20190726133936.11177-1-sashal@kernel.org>
                   ` (2 preceding siblings ...)
  2019-07-26 13:38 ` [PATCH AUTOSEL 5.2 28/85] btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to COW without data reservation Sasha Levin
@ 2019-07-26 13:38 ` Sasha Levin
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2019-07-26 13:38 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Qu Wenruo, Nikolay Borisov, David Sterba, Sasha Levin,
	linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit e88439debd0a7f969b3ddba6f147152cd0732676 ]

[BUG]
Lockdep will report the following circular locking dependency:

  WARNING: possible circular locking dependency detected
  5.2.0-rc2-custom #24 Tainted: G           O
  ------------------------------------------------------
  btrfs/8631 is trying to acquire lock:
  000000002536438c (&fs_info->qgroup_ioctl_lock#2){+.+.}, at: btrfs_qgroup_inherit+0x40/0x620 [btrfs]

  but task is already holding lock:
  000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (&fs_info->tree_log_mutex){+.+.}:
         __mutex_lock+0x76/0x940
         mutex_lock_nested+0x1b/0x20
         btrfs_commit_transaction+0x475/0xa00 [btrfs]
         btrfs_commit_super+0x71/0x80 [btrfs]
         close_ctree+0x2bd/0x320 [btrfs]
         btrfs_put_super+0x15/0x20 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x16/0xa0 [btrfs]
         deactivate_locked_super+0x3a/0x80
         deactivate_super+0x51/0x60
         cleanup_mnt+0x3f/0x80
         __cleanup_mnt+0x12/0x20
         task_work_run+0x94/0xb0
         exit_to_usermode_loop+0xd8/0xe0
         do_syscall_64+0x210/0x240
         entry_SYSCALL_64_after_hwframe+0x49/0xbe

  -> #1 (&fs_info->reloc_mutex){+.+.}:
         __mutex_lock+0x76/0x940
         mutex_lock_nested+0x1b/0x20
         btrfs_commit_transaction+0x40d/0xa00 [btrfs]
         btrfs_quota_enable+0x2da/0x730 [btrfs]
         btrfs_ioctl+0x2691/0x2b40 [btrfs]
         do_vfs_ioctl+0xa9/0x6d0
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x1a/0x20
         do_syscall_64+0x65/0x240
         entry_SYSCALL_64_after_hwframe+0x49/0xbe

  -> #0 (&fs_info->qgroup_ioctl_lock#2){+.+.}:
         lock_acquire+0xa7/0x190
         __mutex_lock+0x76/0x940
         mutex_lock_nested+0x1b/0x20
         btrfs_qgroup_inherit+0x40/0x620 [btrfs]
         create_pending_snapshot+0x9d7/0xe60 [btrfs]
         create_pending_snapshots+0x94/0xb0 [btrfs]
         btrfs_commit_transaction+0x415/0xa00 [btrfs]
         btrfs_mksubvol+0x496/0x4e0 [btrfs]
         btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
         btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
         btrfs_ioctl+0xa90/0x2b40 [btrfs]
         do_vfs_ioctl+0xa9/0x6d0
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x1a/0x20
         do_syscall_64+0x65/0x240
         entry_SYSCALL_64_after_hwframe+0x49/0xbe

  other info that might help us debug this:

  Chain exists of:
    &fs_info->qgroup_ioctl_lock#2 --> &fs_info->reloc_mutex --> &fs_info->tree_log_mutex

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(&fs_info->tree_log_mutex);
                                 lock(&fs_info->reloc_mutex);
                                 lock(&fs_info->tree_log_mutex);
    lock(&fs_info->qgroup_ioctl_lock#2);

   *** DEADLOCK ***

  6 locks held by btrfs/8631:
   #0: 00000000ed8f23f6 (sb_writers#12){.+.+}, at: mnt_want_write_file+0x28/0x60
   #1: 000000009fb1597a (&type->i_mutex_dir_key#10/1){+.+.}, at: btrfs_mksubvol+0x70/0x4e0 [btrfs]
   #2: 0000000088c5ad88 (&fs_info->subvol_sem){++++}, at: btrfs_mksubvol+0x128/0x4e0 [btrfs]
   #3: 000000009606fc3e (sb_internal#2){.+.+}, at: start_transaction+0x37a/0x520 [btrfs]
   #4: 00000000f82bbdf5 (&fs_info->reloc_mutex){+.+.}, at: btrfs_commit_transaction+0x40d/0xa00 [btrfs]
   #5: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]

[CAUSE]
Due to the delayed subvolume creation, we need to call
btrfs_qgroup_inherit() inside commit transaction code, with a lot of
other mutex hold.
This hell of lock chain can lead to above problem.

[FIX]
On the other hand, we don't really need to hold qgroup_ioctl_lock if
we're in the context of create_pending_snapshot().
As in that context, we're the only one being able to modify qgroup.

All other qgroup functions which needs qgroup_ioctl_lock are either
holding a transaction handle, or will start a new transaction:
  Functions will start a new transaction():
  * btrfs_quota_enable()
  * btrfs_quota_disable()
  Functions hold a transaction handler:
  * btrfs_add_qgroup_relation()
  * btrfs_del_qgroup_relation()
  * btrfs_create_qgroup()
  * btrfs_remove_qgroup()
  * btrfs_limit_qgroup()
  * btrfs_qgroup_inherit() call inside create_subvol()

So we have a higher level protection provided by transaction, thus we
don't need to always hold qgroup_ioctl_lock in btrfs_qgroup_inherit().

Only the btrfs_qgroup_inherit() call in create_subvol() needs to hold
qgroup_ioctl_lock, while the btrfs_qgroup_inherit() call in
create_pending_snapshot() is already protected by transaction.

So the fix is to detect the context by checking
trans->transaction->state.
If we're at TRANS_STATE_COMMIT_DOING, then we're in commit transaction
context and no need to get the mutex.

Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/qgroup.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 3e6ffbbd8b0a..f8a3c1b0a15a 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2614,6 +2614,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 	int ret = 0;
 	int i;
 	u64 *i_qgroups;
+	bool committing = false;
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_root *quota_root;
 	struct btrfs_qgroup *srcgroup;
@@ -2621,7 +2622,25 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 	u32 level_size = 0;
 	u64 nums;
 
-	mutex_lock(&fs_info->qgroup_ioctl_lock);
+	/*
+	 * There are only two callers of this function.
+	 *
+	 * One in create_subvol() in the ioctl context, which needs to hold
+	 * the qgroup_ioctl_lock.
+	 *
+	 * The other one in create_pending_snapshot() where no other qgroup
+	 * code can modify the fs as they all need to either start a new trans
+	 * or hold a trans handler, thus we don't need to hold
+	 * qgroup_ioctl_lock.
+	 * This would avoid long and complex lock chain and make lockdep happy.
+	 */
+	spin_lock(&fs_info->trans_lock);
+	if (trans->transaction->state == TRANS_STATE_COMMIT_DOING)
+		committing = true;
+	spin_unlock(&fs_info->trans_lock);
+
+	if (!committing)
+		mutex_lock(&fs_info->qgroup_ioctl_lock);
 	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
 		goto out;
 
@@ -2785,7 +2804,8 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 unlock:
 	spin_unlock(&fs_info->qgroup_lock);
 out:
-	mutex_unlock(&fs_info->qgroup_ioctl_lock);
+	if (!committing)
+		mutex_unlock(&fs_info->qgroup_ioctl_lock);
 	return ret;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-07-26 13:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20190726133936.11177-1-sashal@kernel.org>
2019-07-26 13:38 ` [PATCH AUTOSEL 5.2 26/85] btrfs: tree-checker: Check if the file extent end overflows Sasha Levin
2019-07-26 13:38 ` [PATCH AUTOSEL 5.2 27/85] btrfs: fix minimum number of chunk errors for DUP Sasha Levin
2019-07-26 13:38 ` [PATCH AUTOSEL 5.2 28/85] btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to COW without data reservation Sasha Levin
2019-07-26 13:38 ` [PATCH AUTOSEL 5.2 30/85] btrfs: qgroup: Don't hold qgroup_ioctl_lock in btrfs_qgroup_inherit() Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).