[PATCH 00/11] btrfs: fixes around swap file activation and cleanups

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/11] btrfs: fixes around swap file activation and cleanups
@ 2024-12-11 14:53 fdmanana
  2024-12-11 14:53 ` [PATCH 01/11] btrfs: fix race with memory mapped writes when activating swap file fdmanana
                   ` (11 more replies)
  0 siblings, 12 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 14:53 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There are a couple issues with swap file activation, one being races with
memory mapped writes, while the other one is a failure due to buggy
detection of extent sharedness - we can often consider an extent as shared
when it is not but it used to be. The rest are just some cleanups or
enhancements.

Filipe Manana (11):
  btrfs: fix race with memory mapped writes when activating swap file
  btrfs: fix swap file activation failure due to extents that used to be shared
  btrfs: allow swap activation to be interruptible
  btrfs: avoid monopolizing a core when activating a swap file
  btrfs: remove no longer needed strict argument from can_nocow_extent()
  btrfs: remove the snapshot check from check_committed_ref()
  btrfs: avoid redundant call to get inline ref type at check_committed_ref()
  btrfs: simplify return logic at check_committed_ref()
  btrfs: simplify arguments for btrfs_cross_ref_exist()
  btrfs: add function comment for check_committed_ref()
  btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist()

 fs/btrfs/btrfs_inode.h |   2 +-
 fs/btrfs/direct-io.c   |   3 +-
 fs/btrfs/extent-tree.c | 127 ++++++++++++++++++++++++-----------
 fs/btrfs/extent-tree.h |   3 +-
 fs/btrfs/file.c        |   2 +-
 fs/btrfs/inode.c       | 146 +++++++++++++++++++++++++++++------------
 fs/btrfs/locking.h     |   5 ++
 7 files changed, 202 insertions(+), 86 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 01/11] btrfs: fix race with memory mapped writes when activating swap file
  2024-12-11 14:53 [PATCH 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
@ 2024-12-11 14:53 ` fdmanana
  2024-12-11 14:53 ` [PATCH 02/11] btrfs: fix swap file activation failure due to extents that used to be shared fdmanana
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 14:53 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When activating the swap file we flush all delalloc and wait for ordered
extent completion, so that we don't miss any delalloc and extents before
we check that the file's extent layout is usable for a swap file and
activate the swap file. We are called with the inode's VFS lock acquired,
so we won't race with buffered and direct IO writes, however we can still
race with memory mapped writes since they don't acquire the inode's VFS
lock. The race window is between flushing all delalloc and locking the
whole file's extent range, since memory mapped writes lock an extent range
with the length of a page.

Fix this by acquiring the inode's mmap lock before we flush delalloc.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c4997200dbb2..926d82fbdbae 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9809,6 +9809,15 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	u64 isize;
 	u64 start;
 
+	/*
+	 * Acquire the inode's mmap lock to prevent races with memory mapped
+	 * writes, as they could happen after we flush delalloc below and before
+	 * we lock the extent range further below. The inode was already locked
+	 * up in the call chain.
+	 */
+	btrfs_assert_inode_locked(BTRFS_I(inode));
+	down_write(&BTRFS_I(inode)->i_mmap_lock);
+
 	/*
 	 * If the swap file was just created, make sure delalloc is done. If the
 	 * file changes again after this, the user is doing something stupid and
@@ -9816,22 +9825,25 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	 */
 	ret = btrfs_wait_ordered_range(BTRFS_I(inode), 0, (u64)-1);
 	if (ret)
-		return ret;
+		goto out_unlock_mmap;
 
 	/*
 	 * The inode is locked, so these flags won't change after we check them.
 	 */
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
 		btrfs_warn(fs_info, "swapfile must not be compressed");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out_unlock_mmap;
 	}
 	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
 		btrfs_warn(fs_info, "swapfile must not be copy-on-write");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out_unlock_mmap;
 	}
 	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
 		btrfs_warn(fs_info, "swapfile must not be checksummed");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out_unlock_mmap;
 	}
 
 	/*
@@ -9846,7 +9858,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_SWAP_ACTIVATE)) {
 		btrfs_warn(fs_info,
 	   "cannot activate swapfile while exclusive operation is running");
-		return -EBUSY;
+		ret = -EBUSY;
+		goto out_unlock_mmap;
 	}
 
 	/*
@@ -9860,7 +9873,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 		btrfs_exclop_finish(fs_info);
 		btrfs_warn(fs_info,
 	   "cannot activate swapfile because snapshot creation is in progress");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out_unlock_mmap;
 	}
 	/*
 	 * Snapshots can create extents which require COW even if NODATACOW is
@@ -9881,7 +9895,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 		btrfs_warn(fs_info,
 		"cannot activate swapfile because subvolume %llu is being deleted",
 			btrfs_root_id(root));
-		return -EPERM;
+		ret = -EPERM;
+		goto out_unlock_mmap;
 	}
 	atomic_inc(&root->nr_swapfiles);
 	spin_unlock(&root->root_item_lock);
@@ -10036,6 +10051,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 
 	btrfs_exclop_finish(fs_info);
 
+out_unlock_mmap:
+	up_write(&BTRFS_I(inode)->i_mmap_lock);
 	if (ret)
 		return ret;
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 02/11] btrfs: fix swap file activation failure due to extents that used to be shared
  2024-12-11 14:53 [PATCH 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
  2024-12-11 14:53 ` [PATCH 01/11] btrfs: fix race with memory mapped writes when activating swap file fdmanana
@ 2024-12-11 14:53 ` fdmanana
  2024-12-11 14:53 ` [PATCH 03/11] btrfs: allow swap activation to be interruptible fdmanana
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 14:53 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When activating a swap file, to determine if an extent is shared we use
can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper
is meant to be quick because it's used in the NOCOW write path, when
flushing delalloc and when doing a direct IO write, however it does return
some false positives, meaning it may indicate that an extent is shared
even if it's no longer the case. For the write path this is fine, we just
do a unnecessary COW operation instead of doing a more rigorous check
which would be too heavy (calling btrfs_is_data_extent_shared()).

However when activating a swap file, the false positives simply result
in a failure, which is confusing for users/applications. One particular
case where this happens is when a data extent only has 1 reference but
that reference is not inlined in the extent item located in the extent
tree - this happens when we create more than 33 references for an extent
and then delete those 33 references plus every other non-inline reference
except one. The function check_committed_ref() assumes that if the size
of an extent item doesn't match the size of struct btrfs_extent_item
plus the size of an inline reference (plus an owner reference in case
simple quotas are enabled), then the extent is shared - that is not the
case however, we can have a single reference but it's not inlined - the
reason we do this is to be fast and avoid inspecting non-inline references
which may be located in another leaf of the extent tree, slowing down
write paths.

The following test script reproduces the bug:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdi
   MNT=/mnt/sdi
   NUM_CLONES=50

   umount $DEV &> /dev/null

   run_test()
   {
        local sync_after_add_reflinks=$1
        local sync_after_remove_reflinks=$2

        mkfs.btrfs -f $DEV > /dev/null
        #mkfs.xfs -f $DEV > /dev/null
        mount $DEV $MNT

        touch $MNT/foo
        chmod 0600 $MNT/foo
   	# On btrfs the file must be NOCOW.
        chattr +C $MNT/foo &> /dev/null
        xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo
        mkswap $MNT/foo

        for ((i = 1; i <= $NUM_CLONES; i++)); do
            touch $MNT/foo_clone_$i
            chmod 0600 $MNT/foo_clone_$i
            # On btrfs the file must be NOCOW.
            chattr +C $MNT/foo_clone_$i &> /dev/null
            cp --reflink=always $MNT/foo $MNT/foo_clone_$i
        done

        if [ $sync_after_add_reflinks -ne 0 ]; then
            # Flush delayed refs and commit current transaction.
            sync -f $MNT
        fi

        # Remove the original file and all clones except the last.
        rm -f $MNT/foo
        for ((i = 1; i < $NUM_CLONES; i++)); do
            rm -f $MNT/foo_clone_$i
        done

        if [ $sync_after_remove_reflinks -ne 0 ]; then
            # Flush delayed refs and commit current transaction.
            sync -f $MNT
        fi

        # Now use the last clone as a swap file. It should work since
        # its extent are not shared anymore.
        swapon $MNT/foo_clone_${NUM_CLONES}
        swapoff $MNT/foo_clone_${NUM_CLONES}

        umount $MNT
   }

   echo -e "\nTest without sync after creating and removing clones"
   run_test 0 0

   echo -e "\nTest with sync after creating clones"
   run_test 1 0

   echo -e "\nTest with sync after removing clones"
   run_test 0 1

   echo -e "\nTest with sync after creating and removing clones"
   run_test 1 1

Running the test:

   $ ./test.sh
   Test without sync after creating and removing clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02
   swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
   swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument

   Test with sync after creating clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33
   swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
   swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument

   Test with sync after removing clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764

   Test with sync after creating and removing clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912
   swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
   swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument

Fix this by reworking btrfs_swap_activate() to instead of using extent
maps and checking for shared extents with can_nocow_extent(), iterate
over the inode's file extent items and use the accurate
btrfs_is_data_extent_shared().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 96 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 69 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 926d82fbdbae..7ddb8a01b60f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9799,15 +9799,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct extent_state *cached_state = NULL;
-	struct extent_map *em = NULL;
 	struct btrfs_chunk_map *map = NULL;
 	struct btrfs_device *device = NULL;
 	struct btrfs_swap_info bsi = {
 		.lowest_ppage = (sector_t)-1ULL,
 	};
+	struct btrfs_backref_share_check_ctx *backref_ctx = NULL;
+	struct btrfs_path *path = NULL;
 	int ret = 0;
 	u64 isize;
-	u64 start;
+	u64 prev_extent_end = 0;
 
 	/*
 	 * Acquire the inode's mmap lock to prevent races with memory mapped
@@ -9846,6 +9847,13 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 		goto out_unlock_mmap;
 	}
 
+	path = btrfs_alloc_path();
+	backref_ctx = btrfs_alloc_backref_share_check_ctx();
+	if (!path || !backref_ctx) {
+		ret = -ENOMEM;
+		goto out_unlock_mmap;
+	}
+
 	/*
 	 * Balance or device remove/replace/resize can move stuff around from
 	 * under us. The exclop protection makes sure they aren't running/won't
@@ -9904,24 +9912,39 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize);
 
 	lock_extent(io_tree, 0, isize - 1, &cached_state);
-	start = 0;
-	while (start < isize) {
-		u64 logical_block_start, physical_block_start;
+	while (prev_extent_end < isize) {
+		struct btrfs_key key;
+		struct extent_buffer *leaf;
+		struct btrfs_file_extent_item *ei;
 		struct btrfs_block_group *bg;
-		u64 len = isize - start;
+		u64 logical_block_start;
+		u64 physical_block_start;
+		u64 extent_gen;
+		u64 disk_bytenr;
+		u64 len;
 
-		em = btrfs_get_extent(BTRFS_I(inode), NULL, start, len);
-		if (IS_ERR(em)) {
-			ret = PTR_ERR(em);
+		key.objectid = btrfs_ino(BTRFS_I(inode));
+		key.type = BTRFS_EXTENT_DATA_KEY;
+		key.offset = prev_extent_end;
+
+		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+		if (ret < 0)
 			goto out;
-		}
 
-		if (em->disk_bytenr == EXTENT_MAP_HOLE) {
+		/*
+		 * If key not found it means we have an implicit hole (NO_HOLES
+		 * is enabled).
+		 */
+		if (ret > 0) {
 			btrfs_warn(fs_info, "swapfile must not have holes");
 			ret = -EINVAL;
 			goto out;
 		}
-		if (em->disk_bytenr == EXTENT_MAP_INLINE) {
+
+		leaf = path->nodes[0];
+		ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
+
+		if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
 			/*
 			 * It's unlikely we'll ever actually find ourselves
 			 * here, as a file small enough to fit inline won't be
@@ -9933,23 +9956,45 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 			ret = -EINVAL;
 			goto out;
 		}
-		if (extent_map_is_compressed(em)) {
+
+		if (btrfs_file_extent_compression(leaf, ei) != BTRFS_COMPRESS_NONE) {
 			btrfs_warn(fs_info, "swapfile must not be compressed");
 			ret = -EINVAL;
 			goto out;
 		}
 
-		logical_block_start = extent_map_block_start(em) + (start - em->start);
-		len = min(len, em->len - (start - em->start));
-		free_extent_map(em);
-		em = NULL;
+		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+		if (disk_bytenr == 0) {
+			btrfs_warn(fs_info, "swapfile must not have holes");
+			ret = -EINVAL;
+			goto out;
+		}
+
+		logical_block_start = disk_bytenr + btrfs_file_extent_offset(leaf, ei);
+		extent_gen = btrfs_file_extent_generation(leaf, ei);
+		prev_extent_end = btrfs_file_extent_end(path);
+
+		if (prev_extent_end > isize)
+			len = isize - key.offset;
+		else
+			len = btrfs_file_extent_num_bytes(leaf, ei);
 
-		ret = can_nocow_extent(inode, start, &len, NULL, false, true);
+		backref_ctx->curr_leaf_bytenr = leaf->start;
+
+		/*
+		 * Don't need the path anymore, release to avoid deadlocks when
+		 * calling btrfs_is_data_extent_shared() because when joining a
+		 * transaction it can block waiting for the current one's commit
+		 * which in turn may be trying to lock the same leaf to flush
+		 * delayed items for example.
+		 */
+		btrfs_release_path(path);
+
+		ret = btrfs_is_data_extent_shared(BTRFS_I(inode), disk_bytenr,
+						  extent_gen, backref_ctx);
 		if (ret < 0) {
 			goto out;
-		} else if (ret) {
-			ret = 0;
-		} else {
+		} else if (ret > 0) {
 			btrfs_warn(fs_info,
 				   "swapfile must not be copy-on-write");
 			ret = -EINVAL;
@@ -9984,7 +10029,6 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 
 		physical_block_start = (map->stripes[0].physical +
 					(logical_block_start - map->start));
-		len = min(len, map->chunk_len - (logical_block_start - map->start));
 		btrfs_free_chunk_map(map);
 		map = NULL;
 
@@ -10025,20 +10069,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 				if (ret)
 					goto out;
 			}
-			bsi.start = start;
+			bsi.start = key.offset;
 			bsi.block_start = physical_block_start;
 			bsi.block_len = len;
 		}
-
-		start += len;
 	}
 
 	if (bsi.block_len)
 		ret = btrfs_add_swap_extent(sis, &bsi);
 
 out:
-	if (!IS_ERR_OR_NULL(em))
-		free_extent_map(em);
 	if (!IS_ERR_OR_NULL(map))
 		btrfs_free_chunk_map(map);
 
@@ -10053,6 +10093,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 
 out_unlock_mmap:
 	up_write(&BTRFS_I(inode)->i_mmap_lock);
+	btrfs_free_backref_share_ctx(backref_ctx);
+	btrfs_free_path(path);
 	if (ret)
 		return ret;
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 03/11] btrfs: allow swap activation to be interruptible
  2024-12-11 14:53 [PATCH 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
  2024-12-11 14:53 ` [PATCH 01/11] btrfs: fix race with memory mapped writes when activating swap file fdmanana
  2024-12-11 14:53 ` [PATCH 02/11] btrfs: fix swap file activation failure due to extents that used to be shared fdmanana
@ 2024-12-11 14:53 ` fdmanana
  2024-12-11 14:53 ` [PATCH 04/11] btrfs: avoid monopolizing a core when activating a swap file fdmanana
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 14:53 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

During swap activation we iterate over the extents of a file, then do
several checks for each extent, some of which may take some significant
time such as checking if an extent is shared. Since a file can have
many thousands of extents, this can be a very slow operation and it's
currently not interruptible. I had a bug during development of a previous
patch that resulted in an infinite loop when iterating the extents, so
a core was busy looping and I couldn't cancel the operation, which is very
annoying and requires a reboot. So make the loop interruptible by checking
for fatal signals at the end of each iteration and stopping immediately if
there is one.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7ddb8a01b60f..5edc151c640d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10073,6 +10073,11 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 			bsi.block_start = physical_block_start;
 			bsi.block_len = len;
 		}
+
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			goto out;
+		}
 	}

 	if (bsi.block_len)
-- 
2.45.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 04/11] btrfs: avoid monopolizing a core when activating a swap file
  2024-12-11 14:53 [PATCH 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                   ` (2 preceding siblings ...)
  2024-12-11 14:53 ` [PATCH 03/11] btrfs: allow swap activation to be interruptible fdmanana
@ 2024-12-11 14:53 ` fdmanana
  2024-12-11 14:53 ` [PATCH 05/11] btrfs: remove no longer needed strict argument from can_nocow_extent() fdmanana
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 14:53 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

During swap activation we iterate over the extents of a file and we can
have many thounsands of them, so we can end up in a busy loop monopolizing
a core. Avoid this by doing a voluntary reschedule after processing each
extent.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5edc151c640d..283199d11642 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10078,6 +10078,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 			ret = -EINTR;
 			goto out;
 		}
+
+		cond_resched();
 	}
 
 	if (bsi.block_len)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 05/11] btrfs: remove no longer needed strict argument from can_nocow_extent()
  2024-12-11 14:53 [PATCH 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                   ` (3 preceding siblings ...)
  2024-12-11 14:53 ` [PATCH 04/11] btrfs: avoid monopolizing a core when activating a swap file fdmanana
@ 2024-12-11 14:53 ` fdmanana
  2024-12-11 14:53 ` [PATCH 06/11] btrfs: remove the snapshot check from check_committed_ref() fdmanana
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 14:53 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

All callers of can_nocow_extent() now pass a value of false for its
'strict' argument, making it redundant. So remove the argument from
can_nocow_extent() as well as can_nocow_file_extent(),
btrfs_cross_ref_exist() and check_committed_ref(), because this
argument was used just to influence the behavior of check_committed_ref().
Also remove the 'strict' field from struct can_nocow_file_extent_args,
which is now always false as well, as its value is taken from the
argument to can_nocow_extent().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h |  2 +-
 fs/btrfs/direct-io.c   |  3 +--
 fs/btrfs/extent-tree.c | 15 ++++++---------
 fs/btrfs/extent-tree.h |  2 +-
 fs/btrfs/file.c        |  2 +-
 fs/btrfs/inode.c       | 11 +++--------
 6 files changed, 13 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index aa1f55cd81b7..b2fa33911c28 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -526,7 +526,7 @@ bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
 			u32 bio_offset, struct bio_vec *bv);
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 			      struct btrfs_file_extent *file_extent,
-			      bool nowait, bool strict);
+			      bool nowait);
 
 void btrfs_del_delalloc_inode(struct btrfs_inode *inode);
 struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry);
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index a7c3e221378d..8567af46e16f 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -248,8 +248,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		len = min(len, em->len - (start - em->start));
 		block_start = extent_map_block_start(em) + (start - em->start);
 
-		if (can_nocow_extent(inode, start, &len,
-				     &file_extent, false, false) == 1) {
+		if (can_nocow_extent(inode, start, &len, &file_extent, false) == 1) {
 			bg = btrfs_inc_nocow_writers(fs_info, block_start);
 			if (bg)
 				can_nocow = true;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2f9126528a01..46a3a4a4536b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2296,8 +2296,7 @@ static noinline int check_delayed_ref(struct btrfs_root *root,
 
 static noinline int check_committed_ref(struct btrfs_root *root,
 					struct btrfs_path *path,
-					u64 objectid, u64 offset, u64 bytenr,
-					bool strict)
+					u64 objectid, u64 offset, u64 bytenr)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_root *extent_root = btrfs_extent_root(fs_info, bytenr);
@@ -2361,11 +2360,10 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 
 	/*
 	 * If extent created before last snapshot => it's shared unless the
-	 * snapshot has been deleted. Use the heuristic if strict is false.
+	 * snapshot has been deleted.
 	 */
-	if (!strict &&
-	    (btrfs_extent_generation(leaf, ei) <=
-	     btrfs_root_last_snapshot(&root->root_item)))
+	if (btrfs_extent_generation(leaf, ei) <=
+	    btrfs_root_last_snapshot(&root->root_item))
 		goto out;
 
 	/* If this extent has SHARED_DATA_REF then it's shared */
@@ -2387,13 +2385,12 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 }
 
 int btrfs_cross_ref_exist(struct btrfs_root *root, u64 objectid, u64 offset,
-			  u64 bytenr, bool strict, struct btrfs_path *path)
+			  u64 bytenr, struct btrfs_path *path)
 {
 	int ret;
 
 	do {
-		ret = check_committed_ref(root, path, objectid,
-					  offset, bytenr, strict);
+		ret = check_committed_ref(root, path, objectid, offset, bytenr);
 		if (ret && ret != -ENOENT)
 			goto out;
 
diff --git a/fs/btrfs/extent-tree.h b/fs/btrfs/extent-tree.h
index 2ad51130c037..ee62035c4a71 100644
--- a/fs/btrfs/extent-tree.h
+++ b/fs/btrfs/extent-tree.h
@@ -117,7 +117,7 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
 				    const struct extent_buffer *eb);
 int btrfs_exclude_logged_extents(struct extent_buffer *eb);
 int btrfs_cross_ref_exist(struct btrfs_root *root,
-			  u64 objectid, u64 offset, u64 bytenr, bool strict,
+			  u64 objectid, u64 offset, u64 bytenr,
 			  struct btrfs_path *path);
 struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 					     struct btrfs_root *root,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index c61f210259d8..a2fb100fd017 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1069,7 +1069,7 @@ int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
 						   &cached_state);
 	}
 	ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes,
-			       NULL, nowait, false);
+			       NULL, nowait);
 	if (ret <= 0)
 		btrfs_drew_write_unlock(&root->snapshot_lock);
 	else
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 283199d11642..0965a29cf4f7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1837,7 +1837,6 @@ struct can_nocow_file_extent_args {
 	/* End file offset (inclusive) of the range we want to NOCOW. */
 	u64 end;
 	bool writeback_path;
-	bool strict;
 	/*
 	 * Free the path passed to can_nocow_file_extent() once it's not needed
 	 * anymore.
@@ -1892,8 +1891,7 @@ static int can_nocow_file_extent(struct btrfs_path *path,
 	 * for its subvolume was created, then this implies the extent is shared,
 	 * hence we must COW.
 	 */
-	if (!args->strict &&
-	    btrfs_file_extent_generation(leaf, fi) <=
+	if (btrfs_file_extent_generation(leaf, fi) <=
 	    btrfs_root_last_snapshot(&root->root_item))
 		goto out;
 
@@ -1924,7 +1922,7 @@ static int can_nocow_file_extent(struct btrfs_path *path,
 
 	ret = btrfs_cross_ref_exist(root, btrfs_ino(inode),
 				    key->offset - args->file_extent.offset,
-				    args->file_extent.disk_bytenr, args->strict, path);
+				    args->file_extent.disk_bytenr, path);
 	WARN_ON_ONCE(ret > 0 && is_freespace_inode);
 	if (ret != 0)
 		goto out;
@@ -7011,8 +7009,6 @@ static bool btrfs_extent_readonly(struct btrfs_fs_info *fs_info, u64 bytenr)
  * @orig_start:	(optional) Return the original file offset of the file extent
  * @orig_len:	(optional) Return the original on-disk length of the file extent
  * @ram_bytes:	(optional) Return the ram_bytes of the file extent
- * @strict:	if true, omit optimizations that might force us into unnecessary
- *		cow. e.g., don't trust generation number.
  *
  * Return:
  * >0	and update @len if we can do nocow write
@@ -7024,7 +7020,7 @@ static bool btrfs_extent_readonly(struct btrfs_fs_info *fs_info, u64 bytenr)
  */
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 			      struct btrfs_file_extent *file_extent,
-			      bool nowait, bool strict)
+			      bool nowait)
 {
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
 	struct can_nocow_file_extent_args nocow_args = { 0 };
@@ -7077,7 +7073,6 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 
 	nocow_args.start = offset;
 	nocow_args.end = offset + *len - 1;
-	nocow_args.strict = strict;
 	nocow_args.free_path = true;
 
 	ret = can_nocow_file_extent(path, &key, BTRFS_I(inode), &nocow_args);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 06/11] btrfs: remove the snapshot check from check_committed_ref()
  2024-12-11 14:53 [PATCH 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                   ` (4 preceding siblings ...)
  2024-12-11 14:53 ` [PATCH 05/11] btrfs: remove no longer needed strict argument from can_nocow_extent() fdmanana
@ 2024-12-11 14:53 ` fdmanana
  2024-12-11 14:53 ` [PATCH 07/11] btrfs: avoid redundant call to get inline ref type at check_committed_ref() fdmanana
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 14:53 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

At check_committed_ref() we have this check to see if the data extent was
created in a generation lower than or equals to the generation where the
last snapshot for the root was created, and if so we return immediately
with 1, since it's very likely the extent is shared, referenced by other
root.

The only call chain for check_committed_ref() is the following:

   can_nocow_file_extent()
      btrfs_cross_ref_exist()
         check_committed_ref()

And we already do that snapshot check at can_nocow_file_extent(), before
we call btrfs_cross_ref_exist(). This makes the check done at
check_committed_ref() redundant, so remove it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 46a3a4a4536b..e81f4615ccdf 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2358,14 +2358,6 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 	if (item_size != expected_size)
 		goto out;
 
-	/*
-	 * If extent created before last snapshot => it's shared unless the
-	 * snapshot has been deleted.
-	 */
-	if (btrfs_extent_generation(leaf, ei) <=
-	    btrfs_root_last_snapshot(&root->root_item))
-		goto out;
-
 	/* If this extent has SHARED_DATA_REF then it's shared */
 	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
 	if (type != BTRFS_EXTENT_DATA_REF_KEY)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 07/11] btrfs: avoid redundant call to get inline ref type at check_committed_ref()
  2024-12-11 14:53 [PATCH 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                   ` (5 preceding siblings ...)
  2024-12-11 14:53 ` [PATCH 06/11] btrfs: remove the snapshot check from check_committed_ref() fdmanana
@ 2024-12-11 14:53 ` fdmanana
  2024-12-11 14:53 ` [PATCH 08/11] btrfs: simplify return logic " fdmanana
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 14:53 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

At check_committed_ref() we are calling btrfs_get_extent_inline_ref_type()
twice, once before we check if have an inline extent owner ref (for simple
qgroups) and then once again sometime after that check. This second call
is redundant when we have simple quotas disabled or we found an inline ref
that is not of the owner ref type. So avoid this second call unless we
have simple quotas enabled and found an owner ref, saving a function call
that does inline ref validation again.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e81f4615ccdf..00e137c48a9b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2352,6 +2352,7 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 	if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA) && type == BTRFS_EXTENT_OWNER_REF_KEY) {
 		expected_size += btrfs_extent_inline_ref_size(BTRFS_EXTENT_OWNER_REF_KEY);
 		iref = (struct btrfs_extent_inline_ref *)(iref + 1);
+		type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
 	}
 
 	/* If extent item has more than 1 inline ref then it's shared */
@@ -2359,7 +2360,6 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 		goto out;
 
 	/* If this extent has SHARED_DATA_REF then it's shared */
-	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
 	if (type != BTRFS_EXTENT_DATA_REF_KEY)
 		goto out;
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 08/11] btrfs: simplify return logic at check_committed_ref()
  2024-12-11 14:53 [PATCH 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                   ` (6 preceding siblings ...)
  2024-12-11 14:53 ` [PATCH 07/11] btrfs: avoid redundant call to get inline ref type at check_committed_ref() fdmanana
@ 2024-12-11 14:53 ` fdmanana
  2024-12-11 14:53 ` [PATCH 09/11] btrfs: simplify arguments for btrfs_cross_ref_exist() fdmanana
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 14:53 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of setting the value to return in a local variable 'ret' and then
jumping into a label named 'out' that does nothing but return that value,
simplify everything by getting rid of the label and directly returning a
value.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 00e137c48a9b..51c49b2f4991 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2316,35 +2316,32 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 
 	ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
 	if (ret < 0)
-		goto out;
+		return ret;
 	if (ret == 0) {
 		/*
 		 * Key with offset -1 found, there would have to exist an extent
 		 * item with such offset, but this is out of the valid range.
 		 */
-		ret = -EUCLEAN;
-		goto out;
+		return -EUCLEAN;
 	}
 
-	ret = -ENOENT;
 	if (path->slots[0] == 0)
-		goto out;
+		return -ENOENT;
 
 	path->slots[0]--;
 	leaf = path->nodes[0];
 	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
 
 	if (key.objectid != bytenr || key.type != BTRFS_EXTENT_ITEM_KEY)
-		goto out;
+		return -ENOENT;
 
-	ret = 1;
 	item_size = btrfs_item_size(leaf, path->slots[0]);
 	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
 	expected_size = sizeof(*ei) + btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY);
 
 	/* No inline refs; we need to bail before checking for owner ref. */
 	if (item_size == sizeof(*ei))
-		goto out;
+		return 1;
 
 	/* Check for an owner ref; skip over it to the real inline refs. */
 	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
@@ -2357,11 +2354,11 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 
 	/* If extent item has more than 1 inline ref then it's shared */
 	if (item_size != expected_size)
-		goto out;
+		return 1;
 
 	/* If this extent has SHARED_DATA_REF then it's shared */
 	if (type != BTRFS_EXTENT_DATA_REF_KEY)
-		goto out;
+		return 1;
 
 	ref = (struct btrfs_extent_data_ref *)(&iref->offset);
 	if (btrfs_extent_refs(leaf, ei) !=
@@ -2369,11 +2366,9 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 	    btrfs_extent_data_ref_root(leaf, ref) != btrfs_root_id(root) ||
 	    btrfs_extent_data_ref_objectid(leaf, ref) != objectid ||
 	    btrfs_extent_data_ref_offset(leaf, ref) != offset)
-		goto out;
+		return 1;
 
-	ret = 0;
-out:
-	return ret;
+	return 0;
 }
 
 int btrfs_cross_ref_exist(struct btrfs_root *root, u64 objectid, u64 offset,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 09/11] btrfs: simplify arguments for btrfs_cross_ref_exist()
  2024-12-11 14:53 [PATCH 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                   ` (7 preceding siblings ...)
  2024-12-11 14:53 ` [PATCH 08/11] btrfs: simplify return logic " fdmanana
@ 2024-12-11 14:53 ` fdmanana
  2024-12-11 14:53 ` [PATCH 10/11] btrfs: add function comment for check_committed_ref() fdmanana
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 14:53 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of passing a root and an objectid which matches an inode number,
pass the inode instead, since the root is always the root associated to
the inode and the objectid is the number of that inode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 22 ++++++++++++----------
 fs/btrfs/extent-tree.h |  3 +--
 fs/btrfs/inode.c       |  3 +--
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 51c49b2f4991..af3893ad784b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2206,10 +2206,11 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-static noinline int check_delayed_ref(struct btrfs_root *root,
+static noinline int check_delayed_ref(struct btrfs_inode *inode,
 				      struct btrfs_path *path,
-				      u64 objectid, u64 offset, u64 bytenr)
+				      u64 offset, u64 bytenr)
 {
+	struct btrfs_root *root = inode->root;
 	struct btrfs_delayed_ref_head *head;
 	struct btrfs_delayed_ref_node *ref;
 	struct btrfs_delayed_ref_root *delayed_refs;
@@ -2283,7 +2284,7 @@ static noinline int check_delayed_ref(struct btrfs_root *root,
 		 * then we have a cross reference.
 		 */
 		if (ref->ref_root != btrfs_root_id(root) ||
-		    ref_owner != objectid || ref_offset != offset) {
+		    ref_owner != btrfs_ino(inode) || ref_offset != offset) {
 			ret = 1;
 			break;
 		}
@@ -2294,10 +2295,11 @@ static noinline int check_delayed_ref(struct btrfs_root *root,
 	return ret;
 }
 
-static noinline int check_committed_ref(struct btrfs_root *root,
+static noinline int check_committed_ref(struct btrfs_inode *inode,
 					struct btrfs_path *path,
-					u64 objectid, u64 offset, u64 bytenr)
+					u64 offset, u64 bytenr)
 {
+	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_root *extent_root = btrfs_extent_root(fs_info, bytenr);
 	struct extent_buffer *leaf;
@@ -2364,29 +2366,29 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 	if (btrfs_extent_refs(leaf, ei) !=
 	    btrfs_extent_data_ref_count(leaf, ref) ||
 	    btrfs_extent_data_ref_root(leaf, ref) != btrfs_root_id(root) ||
-	    btrfs_extent_data_ref_objectid(leaf, ref) != objectid ||
+	    btrfs_extent_data_ref_objectid(leaf, ref) != btrfs_ino(inode) ||
 	    btrfs_extent_data_ref_offset(leaf, ref) != offset)
 		return 1;
 
 	return 0;
 }
 
-int btrfs_cross_ref_exist(struct btrfs_root *root, u64 objectid, u64 offset,
+int btrfs_cross_ref_exist(struct btrfs_inode *inode, u64 offset,
 			  u64 bytenr, struct btrfs_path *path)
 {
 	int ret;
 
 	do {
-		ret = check_committed_ref(root, path, objectid, offset, bytenr);
+		ret = check_committed_ref(inode, path, offset, bytenr);
 		if (ret && ret != -ENOENT)
 			goto out;
 
-		ret = check_delayed_ref(root, path, objectid, offset, bytenr);
+		ret = check_delayed_ref(inode, path, offset, bytenr);
 	} while (ret == -EAGAIN && !path->nowait);
 
 out:
 	btrfs_release_path(path);
-	if (btrfs_is_data_reloc_root(root))
+	if (btrfs_is_data_reloc_root(inode->root))
 		WARN_ON(ret > 0);
 	return ret;
 }
diff --git a/fs/btrfs/extent-tree.h b/fs/btrfs/extent-tree.h
index ee62035c4a71..46b8e19022df 100644
--- a/fs/btrfs/extent-tree.h
+++ b/fs/btrfs/extent-tree.h
@@ -116,8 +116,7 @@ int btrfs_pin_extent(struct btrfs_trans_handle *trans, u64 bytenr, u64 num,
 int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
 				    const struct extent_buffer *eb);
 int btrfs_exclude_logged_extents(struct extent_buffer *eb);
-int btrfs_cross_ref_exist(struct btrfs_root *root,
-			  u64 objectid, u64 offset, u64 bytenr,
+int btrfs_cross_ref_exist(struct btrfs_inode *inode, u64 offset, u64 bytenr,
 			  struct btrfs_path *path);
 struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 					     struct btrfs_root *root,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0965a29cf4f7..8a173a24ac05 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1920,8 +1920,7 @@ static int can_nocow_file_extent(struct btrfs_path *path,
 	 */
 	btrfs_release_path(path);
 
-	ret = btrfs_cross_ref_exist(root, btrfs_ino(inode),
-				    key->offset - args->file_extent.offset,
+	ret = btrfs_cross_ref_exist(inode, key->offset - args->file_extent.offset,
 				    args->file_extent.disk_bytenr, path);
 	WARN_ON_ONCE(ret > 0 && is_freespace_inode);
 	if (ret != 0)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 10/11] btrfs: add function comment for check_committed_ref()
  2024-12-11 14:53 [PATCH 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                   ` (8 preceding siblings ...)
  2024-12-11 14:53 ` [PATCH 09/11] btrfs: simplify arguments for btrfs_cross_ref_exist() fdmanana
@ 2024-12-11 14:53 ` fdmanana
  2024-12-11 14:53 ` [PATCH 11/11] btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist() fdmanana
  2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 14:53 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There are some not immediately obvious details about the operation of
check_committed_ref(), namely that when it returns 0 it must return with
the path having a locked leaf from the extent tree that contains the
extent's extent item, so that we can later check for delayed refs when
calling check_delayed_ref() in a way that doesn't race with a task running
delayed references. For similar reasons, it must also return with a locked
leaf when the extent item is not found, and that leaf is where the extent
item should be located, because we may have delayed references that are
going to create the extent item. Also document that the function can
return false positives in order to not be too slow, and that the most
important is to not return false negatives.

So add a function comment to check_committed_ref().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index af3893ad784b..3dea8ce87bf7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2295,6 +2295,48 @@ static noinline int check_delayed_ref(struct btrfs_inode *inode,
 	return ret;
 }
 
+/*
+ * Check if there are references for a data extent other than the one belonging
+ * to the given inode and offset.
+ *
+ * @inode:     The only inode we expect to find associated with the data extent.
+ * @path:      A path to use for searching the extent tree.
+ * @offset:    The only offset we expect to find associated with the data
+ *             extent.
+ * @bytenr:    The logical address of the data extent.
+ *
+ * When the extent does not have any other references other than the one we
+ * expect to find, we always return a value of 0 with the path having a locked
+ * leaf that contains the extent's extent item - this is necessary to ensure
+ * we don't race with a task running delayed references, and our caller must
+ * have such a path when calling check_delayed_ref() - it must lock a delayed
+ * ref head while holding the leaf locked. In case the extent item is not found
+ * in the extent tree, we return -ENOENT with the path having the leaf (locked)
+ * where the extent item should be, in order to prevent races with another task
+ * running delayed references, so that we don't miss any reference when calling
+ * check_delayed_ref().
+ *
+ * Note: this may return false positives, and this is because we want to be
+ *       quick here as we're called in write paths (when flushing delalloc and
+ *       in the direct IO write path). For example we can have an extent with
+ *       a single reference but that reference is not inlined, or we may have
+ *       many references in the extent tree but we also have delayed references
+ *       that cancel all the reference except the one for our inode and offset,
+ *       but it would be expensive to do such checks and complex due to all
+ *       locking to avoid races between the checks and flushing delayed refs,
+ *       plus non-inline reference may be located on leaves other than the one
+ *       that contains the extent item in the extent tree. The import thing here
+ *       is to not return false negatives and that the false positives are not
+ *       very common.
+ *
+ * Returns: 0 if there are no cross references and with the path having a locked
+ *          leaf from the extent tree that contains the extent's extent item.
+ *
+ *          1 if there are cross references or if there may be because @strict
+ *          is false and the root was snapshoted after the extent was created.
+ *
+ *          < 0 in case of an error.
+ */
 static noinline int check_committed_ref(struct btrfs_inode *inode,
 					struct btrfs_path *path,
 					u64 offset, u64 bytenr)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 11/11] btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist()
  2024-12-11 14:53 [PATCH 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                   ` (9 preceding siblings ...)
  2024-12-11 14:53 ` [PATCH 10/11] btrfs: add function comment for check_committed_ref() fdmanana
@ 2024-12-11 14:53 ` fdmanana
  2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 14:53 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We should always call check_delayed_ref() with a path having a locked leaf
from the extent tree where either the extent item is located or where it
should be located in case it doesn't exist yet (when there's a pending
unflushed delayed ref to do it), as we need to lock any existing delayed
ref head while holding such leaf locked in order to avoid races with
flushing delayed references, which could make us think an extent is not
shared when it really is.

So add some assertions and a comment about such expectations to
btrfs_cross_ref_exist(), which is the only caller of check_delayed_ref().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 25 +++++++++++++++++++++++++
 fs/btrfs/locking.h     |  5 +++++
 2 files changed, 30 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3dea8ce87bf7..7ef07ae0c895 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2425,6 +2425,31 @@ int btrfs_cross_ref_exist(struct btrfs_inode *inode, u64 offset,
 		if (ret && ret != -ENOENT)
 			goto out;
 
+		/*
+		 * The path must have a locked leaf from the extent tree where
+		 * the extent item for our extent is located, in case it exists,
+		 * or where it should be located in case it doesn't exist yet
+		 * because it's new and its delayed ref was not yet flushed.
+		 * We need to lock the delayed ref head at check_delayed_ref(),
+		 * if one exists, while holding the leaf locked in order to not
+		 * race with delayed ref flushing, missing references and
+		 * incorrectly reporting that the extent is not shared.
+		 */
+		if (IS_ENABLED(CONFIG_BTRFS_ASSERT)) {
+			struct extent_buffer *leaf = path->nodes[0];
+
+			ASSERT(leaf != NULL);
+			btrfs_assert_tree_read_locked(leaf);
+
+			if (ret != -ENOENT) {
+				struct btrfs_key key;
+
+				btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+				ASSERT(key.objectid == bytenr);
+				ASSERT(key.type == BTRFS_EXTENT_ITEM_KEY);
+			}
+		}
+
 		ret = check_delayed_ref(inode, path, offset, bytenr);
 	} while (ret == -EAGAIN && !path->nowait);
 
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index 35036b151bf5..c69e57ff804b 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -199,8 +199,13 @@ static inline void btrfs_assert_tree_write_locked(struct extent_buffer *eb)
 {
 	lockdep_assert_held_write(&eb->lock);
 }
+static inline void btrfs_assert_tree_read_locked(struct extent_buffer *eb)
+{
+	lockdep_assert_held_read(&eb->lock);
+}
 #else
 static inline void btrfs_assert_tree_write_locked(struct extent_buffer *eb) { }
+static inline void btrfs_assert_tree_read_locked(struct extent_buffer *eb) { }
 #endif
 
 void btrfs_unlock_up_safe(struct btrfs_path *path, int level);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups
  2024-12-11 14:53 [PATCH 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                   ` (10 preceding siblings ...)
  2024-12-11 14:53 ` [PATCH 11/11] btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist() fdmanana
@ 2024-12-11 15:04 ` fdmanana
  2024-12-11 15:04   ` [PATCH v2 01/11] btrfs: fix race with memory mapped writes when activating swap file fdmanana
                     ` (11 more replies)
  11 siblings, 12 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 15:04 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There are a couple issues with swap file activation, one being races with
memory mapped writes, while the other one is a failure due to buggy
detection of extent sharedness - we can often consider an extent as shared
when it is not but it used to be. The rest are just some cleanups or
enhancements.

V2: Updated patch 10/11, it mentioned a no longer existing argument that
    removed in patch 05/11, the patches were re-ordered at some point.

Filipe Manana (11):
  btrfs: fix race with memory mapped writes when activating swap file
  btrfs: fix swap file activation failure due to extents that used to be shared
  btrfs: allow swap activation to be interruptible
  btrfs: avoid monopolizing a core when activating a swap file
  btrfs: remove no longer needed strict argument from can_nocow_extent()
  btrfs: remove the snapshot check from check_committed_ref()
  btrfs: avoid redundant call to get inline ref type at check_committed_ref()
  btrfs: simplify return logic at check_committed_ref()
  btrfs: simplify arguments for btrfs_cross_ref_exist()
  btrfs: add function comment for check_committed_ref()
  btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist()

 fs/btrfs/btrfs_inode.h |   2 +-
 fs/btrfs/direct-io.c   |   3 +-
 fs/btrfs/extent-tree.c | 128 +++++++++++++++++++++++++-----------
 fs/btrfs/extent-tree.h |   3 +-
 fs/btrfs/file.c        |   2 +-
 fs/btrfs/inode.c       | 146 +++++++++++++++++++++++++++++------------
 fs/btrfs/locking.h     |   5 ++
 7 files changed, 203 insertions(+), 86 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 01/11] btrfs: fix race with memory mapped writes when activating swap file
  2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
@ 2024-12-11 15:04   ` fdmanana
  2024-12-11 15:04   ` [PATCH v2 02/11] btrfs: fix swap file activation failure due to extents that used to be shared fdmanana
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 15:04 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When activating the swap file we flush all delalloc and wait for ordered
extent completion, so that we don't miss any delalloc and extents before
we check that the file's extent layout is usable for a swap file and
activate the swap file. We are called with the inode's VFS lock acquired,
so we won't race with buffered and direct IO writes, however we can still
race with memory mapped writes since they don't acquire the inode's VFS
lock. The race window is between flushing all delalloc and locking the
whole file's extent range, since memory mapped writes lock an extent range
with the length of a page.

Fix this by acquiring the inode's mmap lock before we flush delalloc.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c4997200dbb2..926d82fbdbae 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9809,6 +9809,15 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	u64 isize;
 	u64 start;
 
+	/*
+	 * Acquire the inode's mmap lock to prevent races with memory mapped
+	 * writes, as they could happen after we flush delalloc below and before
+	 * we lock the extent range further below. The inode was already locked
+	 * up in the call chain.
+	 */
+	btrfs_assert_inode_locked(BTRFS_I(inode));
+	down_write(&BTRFS_I(inode)->i_mmap_lock);
+
 	/*
 	 * If the swap file was just created, make sure delalloc is done. If the
 	 * file changes again after this, the user is doing something stupid and
@@ -9816,22 +9825,25 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	 */
 	ret = btrfs_wait_ordered_range(BTRFS_I(inode), 0, (u64)-1);
 	if (ret)
-		return ret;
+		goto out_unlock_mmap;
 
 	/*
 	 * The inode is locked, so these flags won't change after we check them.
 	 */
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
 		btrfs_warn(fs_info, "swapfile must not be compressed");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out_unlock_mmap;
 	}
 	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
 		btrfs_warn(fs_info, "swapfile must not be copy-on-write");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out_unlock_mmap;
 	}
 	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
 		btrfs_warn(fs_info, "swapfile must not be checksummed");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out_unlock_mmap;
 	}
 
 	/*
@@ -9846,7 +9858,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_SWAP_ACTIVATE)) {
 		btrfs_warn(fs_info,
 	   "cannot activate swapfile while exclusive operation is running");
-		return -EBUSY;
+		ret = -EBUSY;
+		goto out_unlock_mmap;
 	}
 
 	/*
@@ -9860,7 +9873,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 		btrfs_exclop_finish(fs_info);
 		btrfs_warn(fs_info,
 	   "cannot activate swapfile because snapshot creation is in progress");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out_unlock_mmap;
 	}
 	/*
 	 * Snapshots can create extents which require COW even if NODATACOW is
@@ -9881,7 +9895,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 		btrfs_warn(fs_info,
 		"cannot activate swapfile because subvolume %llu is being deleted",
 			btrfs_root_id(root));
-		return -EPERM;
+		ret = -EPERM;
+		goto out_unlock_mmap;
 	}
 	atomic_inc(&root->nr_swapfiles);
 	spin_unlock(&root->root_item_lock);
@@ -10036,6 +10051,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 
 	btrfs_exclop_finish(fs_info);
 
+out_unlock_mmap:
+	up_write(&BTRFS_I(inode)->i_mmap_lock);
 	if (ret)
 		return ret;
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 02/11] btrfs: fix swap file activation failure due to extents that used to be shared
  2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
  2024-12-11 15:04   ` [PATCH v2 01/11] btrfs: fix race with memory mapped writes when activating swap file fdmanana
@ 2024-12-11 15:04   ` fdmanana
  2024-12-13 20:52     ` Qu Wenruo
  2024-12-11 15:05   ` [PATCH v2 03/11] btrfs: allow swap activation to be interruptible fdmanana
                     ` (9 subsequent siblings)
  11 siblings, 1 reply; 29+ messages in thread
From: fdmanana @ 2024-12-11 15:04 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When activating a swap file, to determine if an extent is shared we use
can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper
is meant to be quick because it's used in the NOCOW write path, when
flushing delalloc and when doing a direct IO write, however it does return
some false positives, meaning it may indicate that an extent is shared
even if it's no longer the case. For the write path this is fine, we just
do a unnecessary COW operation instead of doing a more rigorous check
which would be too heavy (calling btrfs_is_data_extent_shared()).

However when activating a swap file, the false positives simply result
in a failure, which is confusing for users/applications. One particular
case where this happens is when a data extent only has 1 reference but
that reference is not inlined in the extent item located in the extent
tree - this happens when we create more than 33 references for an extent
and then delete those 33 references plus every other non-inline reference
except one. The function check_committed_ref() assumes that if the size
of an extent item doesn't match the size of struct btrfs_extent_item
plus the size of an inline reference (plus an owner reference in case
simple quotas are enabled), then the extent is shared - that is not the
case however, we can have a single reference but it's not inlined - the
reason we do this is to be fast and avoid inspecting non-inline references
which may be located in another leaf of the extent tree, slowing down
write paths.

The following test script reproduces the bug:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdi
   MNT=/mnt/sdi
   NUM_CLONES=50

   umount $DEV &> /dev/null

   run_test()
   {
        local sync_after_add_reflinks=$1
        local sync_after_remove_reflinks=$2

        mkfs.btrfs -f $DEV > /dev/null
        #mkfs.xfs -f $DEV > /dev/null
        mount $DEV $MNT

        touch $MNT/foo
        chmod 0600 $MNT/foo
   	# On btrfs the file must be NOCOW.
        chattr +C $MNT/foo &> /dev/null
        xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo
        mkswap $MNT/foo

        for ((i = 1; i <= $NUM_CLONES; i++)); do
            touch $MNT/foo_clone_$i
            chmod 0600 $MNT/foo_clone_$i
            # On btrfs the file must be NOCOW.
            chattr +C $MNT/foo_clone_$i &> /dev/null
            cp --reflink=always $MNT/foo $MNT/foo_clone_$i
        done

        if [ $sync_after_add_reflinks -ne 0 ]; then
            # Flush delayed refs and commit current transaction.
            sync -f $MNT
        fi

        # Remove the original file and all clones except the last.
        rm -f $MNT/foo
        for ((i = 1; i < $NUM_CLONES; i++)); do
            rm -f $MNT/foo_clone_$i
        done

        if [ $sync_after_remove_reflinks -ne 0 ]; then
            # Flush delayed refs and commit current transaction.
            sync -f $MNT
        fi

        # Now use the last clone as a swap file. It should work since
        # its extent are not shared anymore.
        swapon $MNT/foo_clone_${NUM_CLONES}
        swapoff $MNT/foo_clone_${NUM_CLONES}

        umount $MNT
   }

   echo -e "\nTest without sync after creating and removing clones"
   run_test 0 0

   echo -e "\nTest with sync after creating clones"
   run_test 1 0

   echo -e "\nTest with sync after removing clones"
   run_test 0 1

   echo -e "\nTest with sync after creating and removing clones"
   run_test 1 1

Running the test:

   $ ./test.sh
   Test without sync after creating and removing clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02
   swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
   swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument

   Test with sync after creating clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33
   swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
   swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument

   Test with sync after removing clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764

   Test with sync after creating and removing clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912
   swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
   swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument

Fix this by reworking btrfs_swap_activate() to instead of using extent
maps and checking for shared extents with can_nocow_extent(), iterate
over the inode's file extent items and use the accurate
btrfs_is_data_extent_shared().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 96 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 69 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 926d82fbdbae..7ddb8a01b60f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9799,15 +9799,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct extent_state *cached_state = NULL;
-	struct extent_map *em = NULL;
 	struct btrfs_chunk_map *map = NULL;
 	struct btrfs_device *device = NULL;
 	struct btrfs_swap_info bsi = {
 		.lowest_ppage = (sector_t)-1ULL,
 	};
+	struct btrfs_backref_share_check_ctx *backref_ctx = NULL;
+	struct btrfs_path *path = NULL;
 	int ret = 0;
 	u64 isize;
-	u64 start;
+	u64 prev_extent_end = 0;
 
 	/*
 	 * Acquire the inode's mmap lock to prevent races with memory mapped
@@ -9846,6 +9847,13 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 		goto out_unlock_mmap;
 	}
 
+	path = btrfs_alloc_path();
+	backref_ctx = btrfs_alloc_backref_share_check_ctx();
+	if (!path || !backref_ctx) {
+		ret = -ENOMEM;
+		goto out_unlock_mmap;
+	}
+
 	/*
 	 * Balance or device remove/replace/resize can move stuff around from
 	 * under us. The exclop protection makes sure they aren't running/won't
@@ -9904,24 +9912,39 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize);
 
 	lock_extent(io_tree, 0, isize - 1, &cached_state);
-	start = 0;
-	while (start < isize) {
-		u64 logical_block_start, physical_block_start;
+	while (prev_extent_end < isize) {
+		struct btrfs_key key;
+		struct extent_buffer *leaf;
+		struct btrfs_file_extent_item *ei;
 		struct btrfs_block_group *bg;
-		u64 len = isize - start;
+		u64 logical_block_start;
+		u64 physical_block_start;
+		u64 extent_gen;
+		u64 disk_bytenr;
+		u64 len;
 
-		em = btrfs_get_extent(BTRFS_I(inode), NULL, start, len);
-		if (IS_ERR(em)) {
-			ret = PTR_ERR(em);
+		key.objectid = btrfs_ino(BTRFS_I(inode));
+		key.type = BTRFS_EXTENT_DATA_KEY;
+		key.offset = prev_extent_end;
+
+		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+		if (ret < 0)
 			goto out;
-		}
 
-		if (em->disk_bytenr == EXTENT_MAP_HOLE) {
+		/*
+		 * If key not found it means we have an implicit hole (NO_HOLES
+		 * is enabled).
+		 */
+		if (ret > 0) {
 			btrfs_warn(fs_info, "swapfile must not have holes");
 			ret = -EINVAL;
 			goto out;
 		}
-		if (em->disk_bytenr == EXTENT_MAP_INLINE) {
+
+		leaf = path->nodes[0];
+		ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
+
+		if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
 			/*
 			 * It's unlikely we'll ever actually find ourselves
 			 * here, as a file small enough to fit inline won't be
@@ -9933,23 +9956,45 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 			ret = -EINVAL;
 			goto out;
 		}
-		if (extent_map_is_compressed(em)) {
+
+		if (btrfs_file_extent_compression(leaf, ei) != BTRFS_COMPRESS_NONE) {
 			btrfs_warn(fs_info, "swapfile must not be compressed");
 			ret = -EINVAL;
 			goto out;
 		}
 
-		logical_block_start = extent_map_block_start(em) + (start - em->start);
-		len = min(len, em->len - (start - em->start));
-		free_extent_map(em);
-		em = NULL;
+		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+		if (disk_bytenr == 0) {
+			btrfs_warn(fs_info, "swapfile must not have holes");
+			ret = -EINVAL;
+			goto out;
+		}
+
+		logical_block_start = disk_bytenr + btrfs_file_extent_offset(leaf, ei);
+		extent_gen = btrfs_file_extent_generation(leaf, ei);
+		prev_extent_end = btrfs_file_extent_end(path);
+
+		if (prev_extent_end > isize)
+			len = isize - key.offset;
+		else
+			len = btrfs_file_extent_num_bytes(leaf, ei);
 
-		ret = can_nocow_extent(inode, start, &len, NULL, false, true);
+		backref_ctx->curr_leaf_bytenr = leaf->start;
+
+		/*
+		 * Don't need the path anymore, release to avoid deadlocks when
+		 * calling btrfs_is_data_extent_shared() because when joining a
+		 * transaction it can block waiting for the current one's commit
+		 * which in turn may be trying to lock the same leaf to flush
+		 * delayed items for example.
+		 */
+		btrfs_release_path(path);
+
+		ret = btrfs_is_data_extent_shared(BTRFS_I(inode), disk_bytenr,
+						  extent_gen, backref_ctx);
 		if (ret < 0) {
 			goto out;
-		} else if (ret) {
-			ret = 0;
-		} else {
+		} else if (ret > 0) {
 			btrfs_warn(fs_info,
 				   "swapfile must not be copy-on-write");
 			ret = -EINVAL;
@@ -9984,7 +10029,6 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 
 		physical_block_start = (map->stripes[0].physical +
 					(logical_block_start - map->start));
-		len = min(len, map->chunk_len - (logical_block_start - map->start));
 		btrfs_free_chunk_map(map);
 		map = NULL;
 
@@ -10025,20 +10069,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 				if (ret)
 					goto out;
 			}
-			bsi.start = start;
+			bsi.start = key.offset;
 			bsi.block_start = physical_block_start;
 			bsi.block_len = len;
 		}
-
-		start += len;
 	}
 
 	if (bsi.block_len)
 		ret = btrfs_add_swap_extent(sis, &bsi);
 
 out:
-	if (!IS_ERR_OR_NULL(em))
-		free_extent_map(em);
 	if (!IS_ERR_OR_NULL(map))
 		btrfs_free_chunk_map(map);
 
@@ -10053,6 +10093,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 
 out_unlock_mmap:
 	up_write(&BTRFS_I(inode)->i_mmap_lock);
+	btrfs_free_backref_share_ctx(backref_ctx);
+	btrfs_free_path(path);
 	if (ret)
 		return ret;
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 03/11] btrfs: allow swap activation to be interruptible
  2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
  2024-12-11 15:04   ` [PATCH v2 01/11] btrfs: fix race with memory mapped writes when activating swap file fdmanana
  2024-12-11 15:04   ` [PATCH v2 02/11] btrfs: fix swap file activation failure due to extents that used to be shared fdmanana
@ 2024-12-11 15:05   ` fdmanana
  2024-12-11 15:05   ` [PATCH v2 04/11] btrfs: avoid monopolizing a core when activating a swap file fdmanana
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 15:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

During swap activation we iterate over the extents of a file, then do
several checks for each extent, some of which may take some significant
time such as checking if an extent is shared. Since a file can have
many thousands of extents, this can be a very slow operation and it's
currently not interruptible. I had a bug during development of a previous
patch that resulted in an infinite loop when iterating the extents, so
a core was busy looping and I couldn't cancel the operation, which is very
annoying and requires a reboot. So make the loop interruptible by checking
for fatal signals at the end of each iteration and stopping immediately if
there is one.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7ddb8a01b60f..5edc151c640d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10073,6 +10073,11 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 			bsi.block_start = physical_block_start;
 			bsi.block_len = len;
 		}
+
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			goto out;
+		}
 	}

 	if (bsi.block_len)
-- 
2.45.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 04/11] btrfs: avoid monopolizing a core when activating a swap file
  2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                     ` (2 preceding siblings ...)
  2024-12-11 15:05   ` [PATCH v2 03/11] btrfs: allow swap activation to be interruptible fdmanana
@ 2024-12-11 15:05   ` fdmanana
  2024-12-11 15:05   ` [PATCH v2 05/11] btrfs: remove no longer needed strict argument from can_nocow_extent() fdmanana
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 15:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

During swap activation we iterate over the extents of a file and we can
have many thounsands of them, so we can end up in a busy loop monopolizing
a core. Avoid this by doing a voluntary reschedule after processing each
extent.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5edc151c640d..283199d11642 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10078,6 +10078,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 			ret = -EINTR;
 			goto out;
 		}
+
+		cond_resched();
 	}
 
 	if (bsi.block_len)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 05/11] btrfs: remove no longer needed strict argument from can_nocow_extent()
  2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                     ` (3 preceding siblings ...)
  2024-12-11 15:05   ` [PATCH v2 04/11] btrfs: avoid monopolizing a core when activating a swap file fdmanana
@ 2024-12-11 15:05   ` fdmanana
  2024-12-11 15:05   ` [PATCH v2 06/11] btrfs: remove the snapshot check from check_committed_ref() fdmanana
                     ` (6 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 15:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

All callers of can_nocow_extent() now pass a value of false for its
'strict' argument, making it redundant. So remove the argument from
can_nocow_extent() as well as can_nocow_file_extent(),
btrfs_cross_ref_exist() and check_committed_ref(), because this
argument was used just to influence the behavior of check_committed_ref().
Also remove the 'strict' field from struct can_nocow_file_extent_args,
which is now always false as well, as its value is taken from the
argument to can_nocow_extent().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h |  2 +-
 fs/btrfs/direct-io.c   |  3 +--
 fs/btrfs/extent-tree.c | 15 ++++++---------
 fs/btrfs/extent-tree.h |  2 +-
 fs/btrfs/file.c        |  2 +-
 fs/btrfs/inode.c       | 11 +++--------
 6 files changed, 13 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index aa1f55cd81b7..b2fa33911c28 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -526,7 +526,7 @@ bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
 			u32 bio_offset, struct bio_vec *bv);
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 			      struct btrfs_file_extent *file_extent,
-			      bool nowait, bool strict);
+			      bool nowait);
 
 void btrfs_del_delalloc_inode(struct btrfs_inode *inode);
 struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry);
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index a7c3e221378d..8567af46e16f 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -248,8 +248,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		len = min(len, em->len - (start - em->start));
 		block_start = extent_map_block_start(em) + (start - em->start);
 
-		if (can_nocow_extent(inode, start, &len,
-				     &file_extent, false, false) == 1) {
+		if (can_nocow_extent(inode, start, &len, &file_extent, false) == 1) {
 			bg = btrfs_inc_nocow_writers(fs_info, block_start);
 			if (bg)
 				can_nocow = true;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2f9126528a01..46a3a4a4536b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2296,8 +2296,7 @@ static noinline int check_delayed_ref(struct btrfs_root *root,
 
 static noinline int check_committed_ref(struct btrfs_root *root,
 					struct btrfs_path *path,
-					u64 objectid, u64 offset, u64 bytenr,
-					bool strict)
+					u64 objectid, u64 offset, u64 bytenr)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_root *extent_root = btrfs_extent_root(fs_info, bytenr);
@@ -2361,11 +2360,10 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 
 	/*
 	 * If extent created before last snapshot => it's shared unless the
-	 * snapshot has been deleted. Use the heuristic if strict is false.
+	 * snapshot has been deleted.
 	 */
-	if (!strict &&
-	    (btrfs_extent_generation(leaf, ei) <=
-	     btrfs_root_last_snapshot(&root->root_item)))
+	if (btrfs_extent_generation(leaf, ei) <=
+	    btrfs_root_last_snapshot(&root->root_item))
 		goto out;
 
 	/* If this extent has SHARED_DATA_REF then it's shared */
@@ -2387,13 +2385,12 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 }
 
 int btrfs_cross_ref_exist(struct btrfs_root *root, u64 objectid, u64 offset,
-			  u64 bytenr, bool strict, struct btrfs_path *path)
+			  u64 bytenr, struct btrfs_path *path)
 {
 	int ret;
 
 	do {
-		ret = check_committed_ref(root, path, objectid,
-					  offset, bytenr, strict);
+		ret = check_committed_ref(root, path, objectid, offset, bytenr);
 		if (ret && ret != -ENOENT)
 			goto out;
 
diff --git a/fs/btrfs/extent-tree.h b/fs/btrfs/extent-tree.h
index 2ad51130c037..ee62035c4a71 100644
--- a/fs/btrfs/extent-tree.h
+++ b/fs/btrfs/extent-tree.h
@@ -117,7 +117,7 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
 				    const struct extent_buffer *eb);
 int btrfs_exclude_logged_extents(struct extent_buffer *eb);
 int btrfs_cross_ref_exist(struct btrfs_root *root,
-			  u64 objectid, u64 offset, u64 bytenr, bool strict,
+			  u64 objectid, u64 offset, u64 bytenr,
 			  struct btrfs_path *path);
 struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 					     struct btrfs_root *root,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index c61f210259d8..a2fb100fd017 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1069,7 +1069,7 @@ int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
 						   &cached_state);
 	}
 	ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes,
-			       NULL, nowait, false);
+			       NULL, nowait);
 	if (ret <= 0)
 		btrfs_drew_write_unlock(&root->snapshot_lock);
 	else
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 283199d11642..0965a29cf4f7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1837,7 +1837,6 @@ struct can_nocow_file_extent_args {
 	/* End file offset (inclusive) of the range we want to NOCOW. */
 	u64 end;
 	bool writeback_path;
-	bool strict;
 	/*
 	 * Free the path passed to can_nocow_file_extent() once it's not needed
 	 * anymore.
@@ -1892,8 +1891,7 @@ static int can_nocow_file_extent(struct btrfs_path *path,
 	 * for its subvolume was created, then this implies the extent is shared,
 	 * hence we must COW.
 	 */
-	if (!args->strict &&
-	    btrfs_file_extent_generation(leaf, fi) <=
+	if (btrfs_file_extent_generation(leaf, fi) <=
 	    btrfs_root_last_snapshot(&root->root_item))
 		goto out;
 
@@ -1924,7 +1922,7 @@ static int can_nocow_file_extent(struct btrfs_path *path,
 
 	ret = btrfs_cross_ref_exist(root, btrfs_ino(inode),
 				    key->offset - args->file_extent.offset,
-				    args->file_extent.disk_bytenr, args->strict, path);
+				    args->file_extent.disk_bytenr, path);
 	WARN_ON_ONCE(ret > 0 && is_freespace_inode);
 	if (ret != 0)
 		goto out;
@@ -7011,8 +7009,6 @@ static bool btrfs_extent_readonly(struct btrfs_fs_info *fs_info, u64 bytenr)
  * @orig_start:	(optional) Return the original file offset of the file extent
  * @orig_len:	(optional) Return the original on-disk length of the file extent
  * @ram_bytes:	(optional) Return the ram_bytes of the file extent
- * @strict:	if true, omit optimizations that might force us into unnecessary
- *		cow. e.g., don't trust generation number.
  *
  * Return:
  * >0	and update @len if we can do nocow write
@@ -7024,7 +7020,7 @@ static bool btrfs_extent_readonly(struct btrfs_fs_info *fs_info, u64 bytenr)
  */
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 			      struct btrfs_file_extent *file_extent,
-			      bool nowait, bool strict)
+			      bool nowait)
 {
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
 	struct can_nocow_file_extent_args nocow_args = { 0 };
@@ -7077,7 +7073,6 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 
 	nocow_args.start = offset;
 	nocow_args.end = offset + *len - 1;
-	nocow_args.strict = strict;
 	nocow_args.free_path = true;
 
 	ret = can_nocow_file_extent(path, &key, BTRFS_I(inode), &nocow_args);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 06/11] btrfs: remove the snapshot check from check_committed_ref()
  2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                     ` (4 preceding siblings ...)
  2024-12-11 15:05   ` [PATCH v2 05/11] btrfs: remove no longer needed strict argument from can_nocow_extent() fdmanana
@ 2024-12-11 15:05   ` fdmanana
  2024-12-11 15:05   ` [PATCH v2 07/11] btrfs: avoid redundant call to get inline ref type at check_committed_ref() fdmanana
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 15:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

At check_committed_ref() we have this check to see if the data extent was
created in a generation lower than or equals to the generation where the
last snapshot for the root was created, and if so we return immediately
with 1, since it's very likely the extent is shared, referenced by other
root.

The only call chain for check_committed_ref() is the following:

   can_nocow_file_extent()
      btrfs_cross_ref_exist()
         check_committed_ref()

And we already do that snapshot check at can_nocow_file_extent(), before
we call btrfs_cross_ref_exist(). This makes the check done at
check_committed_ref() redundant, so remove it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 46a3a4a4536b..e81f4615ccdf 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2358,14 +2358,6 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 	if (item_size != expected_size)
 		goto out;
 
-	/*
-	 * If extent created before last snapshot => it's shared unless the
-	 * snapshot has been deleted.
-	 */
-	if (btrfs_extent_generation(leaf, ei) <=
-	    btrfs_root_last_snapshot(&root->root_item))
-		goto out;
-
 	/* If this extent has SHARED_DATA_REF then it's shared */
 	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
 	if (type != BTRFS_EXTENT_DATA_REF_KEY)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 07/11] btrfs: avoid redundant call to get inline ref type at check_committed_ref()
  2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                     ` (5 preceding siblings ...)
  2024-12-11 15:05   ` [PATCH v2 06/11] btrfs: remove the snapshot check from check_committed_ref() fdmanana
@ 2024-12-11 15:05   ` fdmanana
  2024-12-11 15:05   ` [PATCH v2 08/11] btrfs: simplify return logic " fdmanana
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 15:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

At check_committed_ref() we are calling btrfs_get_extent_inline_ref_type()
twice, once before we check if have an inline extent owner ref (for simple
qgroups) and then once again sometime after that check. This second call
is redundant when we have simple quotas disabled or we found an inline ref
that is not of the owner ref type. So avoid this second call unless we
have simple quotas enabled and found an owner ref, saving a function call
that does inline ref validation again.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e81f4615ccdf..00e137c48a9b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2352,6 +2352,7 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 	if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA) && type == BTRFS_EXTENT_OWNER_REF_KEY) {
 		expected_size += btrfs_extent_inline_ref_size(BTRFS_EXTENT_OWNER_REF_KEY);
 		iref = (struct btrfs_extent_inline_ref *)(iref + 1);
+		type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
 	}
 
 	/* If extent item has more than 1 inline ref then it's shared */
@@ -2359,7 +2360,6 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 		goto out;
 
 	/* If this extent has SHARED_DATA_REF then it's shared */
-	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
 	if (type != BTRFS_EXTENT_DATA_REF_KEY)
 		goto out;
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 08/11] btrfs: simplify return logic at check_committed_ref()
  2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                     ` (6 preceding siblings ...)
  2024-12-11 15:05   ` [PATCH v2 07/11] btrfs: avoid redundant call to get inline ref type at check_committed_ref() fdmanana
@ 2024-12-11 15:05   ` fdmanana
  2024-12-11 15:05   ` [PATCH v2 09/11] btrfs: simplify arguments for btrfs_cross_ref_exist() fdmanana
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 15:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of setting the value to return in a local variable 'ret' and then
jumping into a label named 'out' that does nothing but return that value,
simplify everything by getting rid of the label and directly returning a
value.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 00e137c48a9b..51c49b2f4991 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2316,35 +2316,32 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 
 	ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
 	if (ret < 0)
-		goto out;
+		return ret;
 	if (ret == 0) {
 		/*
 		 * Key with offset -1 found, there would have to exist an extent
 		 * item with such offset, but this is out of the valid range.
 		 */
-		ret = -EUCLEAN;
-		goto out;
+		return -EUCLEAN;
 	}
 
-	ret = -ENOENT;
 	if (path->slots[0] == 0)
-		goto out;
+		return -ENOENT;
 
 	path->slots[0]--;
 	leaf = path->nodes[0];
 	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
 
 	if (key.objectid != bytenr || key.type != BTRFS_EXTENT_ITEM_KEY)
-		goto out;
+		return -ENOENT;
 
-	ret = 1;
 	item_size = btrfs_item_size(leaf, path->slots[0]);
 	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
 	expected_size = sizeof(*ei) + btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY);
 
 	/* No inline refs; we need to bail before checking for owner ref. */
 	if (item_size == sizeof(*ei))
-		goto out;
+		return 1;
 
 	/* Check for an owner ref; skip over it to the real inline refs. */
 	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
@@ -2357,11 +2354,11 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 
 	/* If extent item has more than 1 inline ref then it's shared */
 	if (item_size != expected_size)
-		goto out;
+		return 1;
 
 	/* If this extent has SHARED_DATA_REF then it's shared */
 	if (type != BTRFS_EXTENT_DATA_REF_KEY)
-		goto out;
+		return 1;
 
 	ref = (struct btrfs_extent_data_ref *)(&iref->offset);
 	if (btrfs_extent_refs(leaf, ei) !=
@@ -2369,11 +2366,9 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 	    btrfs_extent_data_ref_root(leaf, ref) != btrfs_root_id(root) ||
 	    btrfs_extent_data_ref_objectid(leaf, ref) != objectid ||
 	    btrfs_extent_data_ref_offset(leaf, ref) != offset)
-		goto out;
+		return 1;
 
-	ret = 0;
-out:
-	return ret;
+	return 0;
 }
 
 int btrfs_cross_ref_exist(struct btrfs_root *root, u64 objectid, u64 offset,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 09/11] btrfs: simplify arguments for btrfs_cross_ref_exist()
  2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                     ` (7 preceding siblings ...)
  2024-12-11 15:05   ` [PATCH v2 08/11] btrfs: simplify return logic " fdmanana
@ 2024-12-11 15:05   ` fdmanana
  2024-12-11 15:05   ` [PATCH v2 10/11] btrfs: add function comment for check_committed_ref() fdmanana
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 15:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of passing a root and an objectid which matches an inode number,
pass the inode instead, since the root is always the root associated to
the inode and the objectid is the number of that inode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 22 ++++++++++++----------
 fs/btrfs/extent-tree.h |  3 +--
 fs/btrfs/inode.c       |  3 +--
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 51c49b2f4991..af3893ad784b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2206,10 +2206,11 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-static noinline int check_delayed_ref(struct btrfs_root *root,
+static noinline int check_delayed_ref(struct btrfs_inode *inode,
 				      struct btrfs_path *path,
-				      u64 objectid, u64 offset, u64 bytenr)
+				      u64 offset, u64 bytenr)
 {
+	struct btrfs_root *root = inode->root;
 	struct btrfs_delayed_ref_head *head;
 	struct btrfs_delayed_ref_node *ref;
 	struct btrfs_delayed_ref_root *delayed_refs;
@@ -2283,7 +2284,7 @@ static noinline int check_delayed_ref(struct btrfs_root *root,
 		 * then we have a cross reference.
 		 */
 		if (ref->ref_root != btrfs_root_id(root) ||
-		    ref_owner != objectid || ref_offset != offset) {
+		    ref_owner != btrfs_ino(inode) || ref_offset != offset) {
 			ret = 1;
 			break;
 		}
@@ -2294,10 +2295,11 @@ static noinline int check_delayed_ref(struct btrfs_root *root,
 	return ret;
 }
 
-static noinline int check_committed_ref(struct btrfs_root *root,
+static noinline int check_committed_ref(struct btrfs_inode *inode,
 					struct btrfs_path *path,
-					u64 objectid, u64 offset, u64 bytenr)
+					u64 offset, u64 bytenr)
 {
+	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_root *extent_root = btrfs_extent_root(fs_info, bytenr);
 	struct extent_buffer *leaf;
@@ -2364,29 +2366,29 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 	if (btrfs_extent_refs(leaf, ei) !=
 	    btrfs_extent_data_ref_count(leaf, ref) ||
 	    btrfs_extent_data_ref_root(leaf, ref) != btrfs_root_id(root) ||
-	    btrfs_extent_data_ref_objectid(leaf, ref) != objectid ||
+	    btrfs_extent_data_ref_objectid(leaf, ref) != btrfs_ino(inode) ||
 	    btrfs_extent_data_ref_offset(leaf, ref) != offset)
 		return 1;
 
 	return 0;
 }
 
-int btrfs_cross_ref_exist(struct btrfs_root *root, u64 objectid, u64 offset,
+int btrfs_cross_ref_exist(struct btrfs_inode *inode, u64 offset,
 			  u64 bytenr, struct btrfs_path *path)
 {
 	int ret;
 
 	do {
-		ret = check_committed_ref(root, path, objectid, offset, bytenr);
+		ret = check_committed_ref(inode, path, offset, bytenr);
 		if (ret && ret != -ENOENT)
 			goto out;
 
-		ret = check_delayed_ref(root, path, objectid, offset, bytenr);
+		ret = check_delayed_ref(inode, path, offset, bytenr);
 	} while (ret == -EAGAIN && !path->nowait);
 
 out:
 	btrfs_release_path(path);
-	if (btrfs_is_data_reloc_root(root))
+	if (btrfs_is_data_reloc_root(inode->root))
 		WARN_ON(ret > 0);
 	return ret;
 }
diff --git a/fs/btrfs/extent-tree.h b/fs/btrfs/extent-tree.h
index ee62035c4a71..46b8e19022df 100644
--- a/fs/btrfs/extent-tree.h
+++ b/fs/btrfs/extent-tree.h
@@ -116,8 +116,7 @@ int btrfs_pin_extent(struct btrfs_trans_handle *trans, u64 bytenr, u64 num,
 int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
 				    const struct extent_buffer *eb);
 int btrfs_exclude_logged_extents(struct extent_buffer *eb);
-int btrfs_cross_ref_exist(struct btrfs_root *root,
-			  u64 objectid, u64 offset, u64 bytenr,
+int btrfs_cross_ref_exist(struct btrfs_inode *inode, u64 offset, u64 bytenr,
 			  struct btrfs_path *path);
 struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 					     struct btrfs_root *root,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0965a29cf4f7..8a173a24ac05 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1920,8 +1920,7 @@ static int can_nocow_file_extent(struct btrfs_path *path,
 	 */
 	btrfs_release_path(path);
 
-	ret = btrfs_cross_ref_exist(root, btrfs_ino(inode),
-				    key->offset - args->file_extent.offset,
+	ret = btrfs_cross_ref_exist(inode, key->offset - args->file_extent.offset,
 				    args->file_extent.disk_bytenr, path);
 	WARN_ON_ONCE(ret > 0 && is_freespace_inode);
 	if (ret != 0)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 10/11] btrfs: add function comment for check_committed_ref()
  2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                     ` (8 preceding siblings ...)
  2024-12-11 15:05   ` [PATCH v2 09/11] btrfs: simplify arguments for btrfs_cross_ref_exist() fdmanana
@ 2024-12-11 15:05   ` fdmanana
  2024-12-11 15:05   ` [PATCH v2 11/11] btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist() fdmanana
  2024-12-13 21:07   ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups Qu Wenruo
  11 siblings, 0 replies; 29+ messages in thread
From: fdmanana @ 2024-12-11 15:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There are some not immediately obvious details about the operation of
check_committed_ref(), namely that when it returns 0 it must return with
the path having a locked leaf from the extent tree that contains the
extent's extent item, so that we can later check for delayed refs when
calling check_delayed_ref() in a way that doesn't race with a task running
delayed references. For similar reasons, it must also return with a locked
leaf when the extent item is not found, and that leaf is where the extent
item should be located, because we may have delayed references that are
going to create the extent item. Also document that the function can
return false positives in order to not be too slow, and that the most
important is to not return false negatives.

So add a function comment to check_committed_ref().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 43 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index af3893ad784b..bd13059299e1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2295,6 +2295,49 @@ static noinline int check_delayed_ref(struct btrfs_inode *inode,
 	return ret;
 }
 
+/*
+ * Check if there are references for a data extent other than the one belonging
+ * to the given inode and offset.
+ *
+ * @inode:     The only inode we expect to find associated with the data extent.
+ * @path:      A path to use for searching the extent tree.
+ * @offset:    The only offset we expect to find associated with the data
+ *             extent.
+ * @bytenr:    The logical address of the data extent.
+ *
+ * When the extent does not have any other references other than the one we
+ * expect to find, we always return a value of 0 with the path having a locked
+ * leaf that contains the extent's extent item - this is necessary to ensure
+ * we don't race with a task running delayed references, and our caller must
+ * have such a path when calling check_delayed_ref() - it must lock a delayed
+ * ref head while holding the leaf locked. In case the extent item is not found
+ * in the extent tree, we return -ENOENT with the path having the leaf (locked)
+ * where the extent item should be, in order to prevent races with another task
+ * running delayed references, so that we don't miss any reference when calling
+ * check_delayed_ref().
+ *
+ * Note: this may return false positives, and this is because we want to be
+ *       quick here as we're called in write paths (when flushing delalloc and
+ *       in the direct IO write path). For example we can have an extent with
+ *       a single reference but that reference is not inlined, or we may have
+ *       many references in the extent tree but we also have delayed references
+ *       that cancel all the reference except the one for our inode and offset,
+ *       but it would be expensive to do such checks and complex due to all
+ *       locking to avoid races between the checks and flushing delayed refs,
+ *       plus non-inline reference may be located on leaves other than the one
+ *       that contains the extent item in the extent tree. The import thing here
+ *       is to not return false negatives and that the false positives are not
+ *       very common.
+ *
+ * Returns: 0 if there are no cross references and with the path having a locked
+ *          leaf from the extent tree that contains the extent's extent item.
+ *
+ *          1 if there are cross references (false positives can happen).
+ *
+ *          < 0 in case of an error. In case of -ENOENT the leaf in the extent
+ *          tree where the extent item should be located at is locked and
+ *          accessible in the given path.
+ */
 static noinline int check_committed_ref(struct btrfs_inode *inode,
 					struct btrfs_path *path,
 					u64 offset, u64 bytenr)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 11/11] btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist()
  2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                     ` (9 preceding siblings ...)
  2024-12-11 15:05   ` [PATCH v2 10/11] btrfs: add function comment for check_committed_ref() fdmanana
@ 2024-12-11 15:05   ` fdmanana
  2024-12-13 21:03     ` Qu Wenruo
  2024-12-13 21:07   ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups Qu Wenruo
  11 siblings, 1 reply; 29+ messages in thread
From: fdmanana @ 2024-12-11 15:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We should always call check_delayed_ref() with a path having a locked leaf
from the extent tree where either the extent item is located or where it
should be located in case it doesn't exist yet (when there's a pending
unflushed delayed ref to do it), as we need to lock any existing delayed
ref head while holding such leaf locked in order to avoid races with
flushing delayed references, which could make us think an extent is not
shared when it really is.

So add some assertions and a comment about such expectations to
btrfs_cross_ref_exist(), which is the only caller of check_delayed_ref().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 25 +++++++++++++++++++++++++
 fs/btrfs/locking.h     |  5 +++++
 2 files changed, 30 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index bd13059299e1..0f30f53f51b9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2426,6 +2426,31 @@ int btrfs_cross_ref_exist(struct btrfs_inode *inode, u64 offset,
 		if (ret && ret != -ENOENT)
 			goto out;
 
+		/*
+		 * The path must have a locked leaf from the extent tree where
+		 * the extent item for our extent is located, in case it exists,
+		 * or where it should be located in case it doesn't exist yet
+		 * because it's new and its delayed ref was not yet flushed.
+		 * We need to lock the delayed ref head at check_delayed_ref(),
+		 * if one exists, while holding the leaf locked in order to not
+		 * race with delayed ref flushing, missing references and
+		 * incorrectly reporting that the extent is not shared.
+		 */
+		if (IS_ENABLED(CONFIG_BTRFS_ASSERT)) {
+			struct extent_buffer *leaf = path->nodes[0];
+
+			ASSERT(leaf != NULL);
+			btrfs_assert_tree_read_locked(leaf);
+
+			if (ret != -ENOENT) {
+				struct btrfs_key key;
+
+				btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+				ASSERT(key.objectid == bytenr);
+				ASSERT(key.type == BTRFS_EXTENT_ITEM_KEY);
+			}
+		}
+
 		ret = check_delayed_ref(inode, path, offset, bytenr);
 	} while (ret == -EAGAIN && !path->nowait);
 
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index 35036b151bf5..c69e57ff804b 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -199,8 +199,13 @@ static inline void btrfs_assert_tree_write_locked(struct extent_buffer *eb)
 {
 	lockdep_assert_held_write(&eb->lock);
 }
+static inline void btrfs_assert_tree_read_locked(struct extent_buffer *eb)
+{
+	lockdep_assert_held_read(&eb->lock);
+}
 #else
 static inline void btrfs_assert_tree_write_locked(struct extent_buffer *eb) { }
+static inline void btrfs_assert_tree_read_locked(struct extent_buffer *eb) { }
 #endif
 
 void btrfs_unlock_up_safe(struct btrfs_path *path, int level);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 02/11] btrfs: fix swap file activation failure due to extents that used to be shared
  2024-12-11 15:04   ` [PATCH v2 02/11] btrfs: fix swap file activation failure due to extents that used to be shared fdmanana
@ 2024-12-13 20:52     ` Qu Wenruo
  2024-12-13 22:36       ` Filipe Manana
  0 siblings, 1 reply; 29+ messages in thread
From: Qu Wenruo @ 2024-12-13 20:52 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/12/12 01:34, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> When activating a swap file, to determine if an extent is shared we use
> can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper
> is meant to be quick because it's used in the NOCOW write path, when
> flushing delalloc and when doing a direct IO write, however it does return
> some false positives, meaning it may indicate that an extent is shared
> even if it's no longer the case. For the write path this is fine, we just
> do a unnecessary COW operation instead of doing a more rigorous check
> which would be too heavy (calling btrfs_is_data_extent_shared()).
>
> However when activating a swap file, the false positives simply result
> in a failure, which is confusing for users/applications. One particular
> case where this happens is when a data extent only has 1 reference but
> that reference is not inlined in the extent item located in the extent
> tree - this happens when we create more than 33 references for an extent
> and then delete those 33 references plus every other non-inline reference
> except one. The function check_committed_ref() assumes that if the size
> of an extent item doesn't match the size of struct btrfs_extent_item
> plus the size of an inline reference (plus an owner reference in case
> simple quotas are enabled), then the extent is shared - that is not the
> case however, we can have a single reference but it's not inlined - the
> reason we do this is to be fast and avoid inspecting non-inline references
> which may be located in another leaf of the extent tree, slowing down
> write paths.
>
> The following test script reproduces the bug:
>
>     $ cat test.sh
>     #!/bin/bash
>
>     DEV=/dev/sdi
>     MNT=/mnt/sdi
>     NUM_CLONES=50
>
>     umount $DEV &> /dev/null
>
>     run_test()
>     {
>          local sync_after_add_reflinks=$1
>          local sync_after_remove_reflinks=$2
>
>          mkfs.btrfs -f $DEV > /dev/null
>          #mkfs.xfs -f $DEV > /dev/null
>          mount $DEV $MNT
>
>          touch $MNT/foo
>          chmod 0600 $MNT/foo
>     	# On btrfs the file must be NOCOW.
>          chattr +C $MNT/foo &> /dev/null
>          xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo
>          mkswap $MNT/foo
>
>          for ((i = 1; i <= $NUM_CLONES; i++)); do
>              touch $MNT/foo_clone_$i
>              chmod 0600 $MNT/foo_clone_$i
>              # On btrfs the file must be NOCOW.
>              chattr +C $MNT/foo_clone_$i &> /dev/null
>              cp --reflink=always $MNT/foo $MNT/foo_clone_$i
>          done
>
>          if [ $sync_after_add_reflinks -ne 0 ]; then
>              # Flush delayed refs and commit current transaction.
>              sync -f $MNT
>          fi
>
>          # Remove the original file and all clones except the last.
>          rm -f $MNT/foo
>          for ((i = 1; i < $NUM_CLONES; i++)); do
>              rm -f $MNT/foo_clone_$i
>          done
>
>          if [ $sync_after_remove_reflinks -ne 0 ]; then
>              # Flush delayed refs and commit current transaction.
>              sync -f $MNT
>          fi
>
>          # Now use the last clone as a swap file. It should work since
>          # its extent are not shared anymore.
>          swapon $MNT/foo_clone_${NUM_CLONES}
>          swapoff $MNT/foo_clone_${NUM_CLONES}
>
>          umount $MNT
>     }
>
>     echo -e "\nTest without sync after creating and removing clones"
>     run_test 0 0
>
>     echo -e "\nTest with sync after creating clones"
>     run_test 1 0
>
>     echo -e "\nTest with sync after removing clones"
>     run_test 0 1
>
>     echo -e "\nTest with sync after creating and removing clones"
>     run_test 1 1
>
> Running the test:
>
>     $ ./test.sh
>     Test without sync after creating and removing clones
>     wrote 1048576/1048576 bytes at offset 0
>     1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec)
>     Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
>     no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02
>     swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
>     swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
>
>     Test with sync after creating clones
>     wrote 1048576/1048576 bytes at offset 0
>     1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec)
>     Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
>     no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33
>     swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
>     swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
>
>     Test with sync after removing clones
>     wrote 1048576/1048576 bytes at offset 0
>     1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec)
>     Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
>     no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764
>
>     Test with sync after creating and removing clones
>     wrote 1048576/1048576 bytes at offset 0
>     1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec)
>     Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
>     no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912
>     swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
>     swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
>
> Fix this by reworking btrfs_swap_activate() to instead of using extent
> maps and checking for shared extents with can_nocow_extent(), iterate
> over the inode's file extent items and use the accurate
> btrfs_is_data_extent_shared().
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

The patch looks good to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>


Although there is no mention about why we get rid of btrfs_get_extent().

I guess it's to avoid caching those extent maps to save space? Just like
what we did in fiemap.

Thanks,
Qu

> ---
>   fs/btrfs/inode.c | 96 ++++++++++++++++++++++++++++++++++--------------
>   1 file changed, 69 insertions(+), 27 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 926d82fbdbae..7ddb8a01b60f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9799,15 +9799,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>   	struct btrfs_fs_info *fs_info = root->fs_info;
>   	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
>   	struct extent_state *cached_state = NULL;
> -	struct extent_map *em = NULL;
>   	struct btrfs_chunk_map *map = NULL;
>   	struct btrfs_device *device = NULL;
>   	struct btrfs_swap_info bsi = {
>   		.lowest_ppage = (sector_t)-1ULL,
>   	};
> +	struct btrfs_backref_share_check_ctx *backref_ctx = NULL;
> +	struct btrfs_path *path = NULL;
>   	int ret = 0;
>   	u64 isize;
> -	u64 start;
> +	u64 prev_extent_end = 0;
>
>   	/*
>   	 * Acquire the inode's mmap lock to prevent races with memory mapped
> @@ -9846,6 +9847,13 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>   		goto out_unlock_mmap;
>   	}
>
> +	path = btrfs_alloc_path();
> +	backref_ctx = btrfs_alloc_backref_share_check_ctx();
> +	if (!path || !backref_ctx) {
> +		ret = -ENOMEM;
> +		goto out_unlock_mmap;
> +	}
> +
>   	/*
>   	 * Balance or device remove/replace/resize can move stuff around from
>   	 * under us. The exclop protection makes sure they aren't running/won't
> @@ -9904,24 +9912,39 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>   	isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize);
>
>   	lock_extent(io_tree, 0, isize - 1, &cached_state);
> -	start = 0;
> -	while (start < isize) {
> -		u64 logical_block_start, physical_block_start;
> +	while (prev_extent_end < isize) {
> +		struct btrfs_key key;
> +		struct extent_buffer *leaf;
> +		struct btrfs_file_extent_item *ei;
>   		struct btrfs_block_group *bg;
> -		u64 len = isize - start;
> +		u64 logical_block_start;
> +		u64 physical_block_start;
> +		u64 extent_gen;
> +		u64 disk_bytenr;
> +		u64 len;
>
> -		em = btrfs_get_extent(BTRFS_I(inode), NULL, start, len);
> -		if (IS_ERR(em)) {
> -			ret = PTR_ERR(em);
> +		key.objectid = btrfs_ino(BTRFS_I(inode));
> +		key.type = BTRFS_EXTENT_DATA_KEY;
> +		key.offset = prev_extent_end;
> +
> +		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +		if (ret < 0)
>   			goto out;
> -		}
>
> -		if (em->disk_bytenr == EXTENT_MAP_HOLE) {
> +		/*
> +		 * If key not found it means we have an implicit hole (NO_HOLES
> +		 * is enabled).
> +		 */
> +		if (ret > 0) {
>   			btrfs_warn(fs_info, "swapfile must not have holes");
>   			ret = -EINVAL;
>   			goto out;
>   		}
> -		if (em->disk_bytenr == EXTENT_MAP_INLINE) {
> +
> +		leaf = path->nodes[0];
> +		ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
> +
> +		if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
>   			/*
>   			 * It's unlikely we'll ever actually find ourselves
>   			 * here, as a file small enough to fit inline won't be
> @@ -9933,23 +9956,45 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>   			ret = -EINVAL;
>   			goto out;
>   		}
> -		if (extent_map_is_compressed(em)) {
> +
> +		if (btrfs_file_extent_compression(leaf, ei) != BTRFS_COMPRESS_NONE) {
>   			btrfs_warn(fs_info, "swapfile must not be compressed");
>   			ret = -EINVAL;
>   			goto out;
>   		}
>
> -		logical_block_start = extent_map_block_start(em) + (start - em->start);
> -		len = min(len, em->len - (start - em->start));
> -		free_extent_map(em);
> -		em = NULL;
> +		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> +		if (disk_bytenr == 0) {
> +			btrfs_warn(fs_info, "swapfile must not have holes");
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		logical_block_start = disk_bytenr + btrfs_file_extent_offset(leaf, ei);
> +		extent_gen = btrfs_file_extent_generation(leaf, ei);
> +		prev_extent_end = btrfs_file_extent_end(path);
> +
> +		if (prev_extent_end > isize)
> +			len = isize - key.offset;
> +		else
> +			len = btrfs_file_extent_num_bytes(leaf, ei);
>
> -		ret = can_nocow_extent(inode, start, &len, NULL, false, true);
> +		backref_ctx->curr_leaf_bytenr = leaf->start;
> +
> +		/*
> +		 * Don't need the path anymore, release to avoid deadlocks when
> +		 * calling btrfs_is_data_extent_shared() because when joining a
> +		 * transaction it can block waiting for the current one's commit
> +		 * which in turn may be trying to lock the same leaf to flush
> +		 * delayed items for example.
> +		 */
> +		btrfs_release_path(path);
> +
> +		ret = btrfs_is_data_extent_shared(BTRFS_I(inode), disk_bytenr,
> +						  extent_gen, backref_ctx);
>   		if (ret < 0) {
>   			goto out;
> -		} else if (ret) {
> -			ret = 0;
> -		} else {
> +		} else if (ret > 0) {
>   			btrfs_warn(fs_info,
>   				   "swapfile must not be copy-on-write");
>   			ret = -EINVAL;
> @@ -9984,7 +10029,6 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>
>   		physical_block_start = (map->stripes[0].physical +
>   					(logical_block_start - map->start));
> -		len = min(len, map->chunk_len - (logical_block_start - map->start));
>   		btrfs_free_chunk_map(map);
>   		map = NULL;
>
> @@ -10025,20 +10069,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>   				if (ret)
>   					goto out;
>   			}
> -			bsi.start = start;
> +			bsi.start = key.offset;
>   			bsi.block_start = physical_block_start;
>   			bsi.block_len = len;
>   		}
> -
> -		start += len;
>   	}
>
>   	if (bsi.block_len)
>   		ret = btrfs_add_swap_extent(sis, &bsi);
>
>   out:
> -	if (!IS_ERR_OR_NULL(em))
> -		free_extent_map(em);
>   	if (!IS_ERR_OR_NULL(map))
>   		btrfs_free_chunk_map(map);
>
> @@ -10053,6 +10093,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>
>   out_unlock_mmap:
>   	up_write(&BTRFS_I(inode)->i_mmap_lock);
> +	btrfs_free_backref_share_ctx(backref_ctx);
> +	btrfs_free_path(path);
>   	if (ret)
>   		return ret;
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 11/11] btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist()
  2024-12-11 15:05   ` [PATCH v2 11/11] btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist() fdmanana
@ 2024-12-13 21:03     ` Qu Wenruo
  2024-12-13 22:29       ` Filipe Manana
  0 siblings, 1 reply; 29+ messages in thread
From: Qu Wenruo @ 2024-12-13 21:03 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/12/12 01:35, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> We should always call check_delayed_ref() with a path having a locked leaf
> from the extent tree where either the extent item is located or where it
> should be located in case it doesn't exist yet (when there's a pending
> unflushed delayed ref to do it), as we need to lock any existing delayed
> ref head while holding such leaf locked in order to avoid races with
> flushing delayed references, which could make us think an extent is not
> shared when it really is.
>
> So add some assertions and a comment about such expectations to
> btrfs_cross_ref_exist(), which is the only caller of check_delayed_ref().
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Just a small nitpick.
> ---
>   fs/btrfs/extent-tree.c | 25 +++++++++++++++++++++++++
>   fs/btrfs/locking.h     |  5 +++++
>   2 files changed, 30 insertions(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index bd13059299e1..0f30f53f51b9 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2426,6 +2426,31 @@ int btrfs_cross_ref_exist(struct btrfs_inode *inode, u64 offset,
>   		if (ret && ret != -ENOENT)
>   			goto out;
>
> +		/*
> +		 * The path must have a locked leaf from the extent tree where
> +		 * the extent item for our extent is located, in case it exists,
> +		 * or where it should be located in case it doesn't exist yet
> +		 * because it's new and its delayed ref was not yet flushed.
> +		 * We need to lock the delayed ref head at check_delayed_ref(),
> +		 * if one exists, while holding the leaf locked in order to not
> +		 * race with delayed ref flushing, missing references and
> +		 * incorrectly reporting that the extent is not shared.
> +		 */
> +		if (IS_ENABLED(CONFIG_BTRFS_ASSERT)) {

I think we can just get rid of the CONFIG_BTRFS_ASSERT() check.

All the assert functions have already done the check anyway.
We only skip the btrfs_item_key_to_cpu() call which shouldn't be that
costly.

Thanks,
Qu
> +			struct extent_buffer *leaf = path->nodes[0];
> +
> +			ASSERT(leaf != NULL);
> +			btrfs_assert_tree_read_locked(leaf);
> +
> +			if (ret != -ENOENT) {
> +				struct btrfs_key key;
> +
> +				btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +				ASSERT(key.objectid == bytenr);
> +				ASSERT(key.type == BTRFS_EXTENT_ITEM_KEY);
> +			}
> +		}
> +
>   		ret = check_delayed_ref(inode, path, offset, bytenr);
>   	} while (ret == -EAGAIN && !path->nowait);
>
> diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
> index 35036b151bf5..c69e57ff804b 100644
> --- a/fs/btrfs/locking.h
> +++ b/fs/btrfs/locking.h
> @@ -199,8 +199,13 @@ static inline void btrfs_assert_tree_write_locked(struct extent_buffer *eb)
>   {
>   	lockdep_assert_held_write(&eb->lock);
>   }
> +static inline void btrfs_assert_tree_read_locked(struct extent_buffer *eb)
> +{
> +	lockdep_assert_held_read(&eb->lock);
> +}
>   #else
>   static inline void btrfs_assert_tree_write_locked(struct extent_buffer *eb) { }
> +static inline void btrfs_assert_tree_read_locked(struct extent_buffer *eb) { }
>   #endif
>
>   void btrfs_unlock_up_safe(struct btrfs_path *path, int level);


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups
  2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
                     ` (10 preceding siblings ...)
  2024-12-11 15:05   ` [PATCH v2 11/11] btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist() fdmanana
@ 2024-12-13 21:07   ` Qu Wenruo
  11 siblings, 0 replies; 29+ messages in thread
From: Qu Wenruo @ 2024-12-13 21:07 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/12/12 01:34, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> There are a couple issues with swap file activation, one being races with
> memory mapped writes, while the other one is a failure due to buggy
> detection of extent sharedness - we can often consider an extent as shared
> when it is not but it used to be. The rest are just some cleanups or
> enhancements.
>
> V2: Updated patch 10/11, it mentioned a no longer existing argument that
>      removed in patch 05/11, the patches were re-ordered at some point.

Reviewed-by: Qu Wenruo <wqu@suse.com>

>
> Filipe Manana (11):
>    btrfs: fix race with memory mapped writes when activating swap file
>    btrfs: fix swap file activation failure due to extents that used to be shared
>    btrfs: allow swap activation to be interruptible
>    btrfs: avoid monopolizing a core when activating a swap file
>    btrfs: remove no longer needed strict argument from can_nocow_extent()

Very happy we can get rid of one of the double bool parameters.

Everything I see something like func(x, y, true, false); it almost
drives me crazy.

Thanks,
Qu
>    btrfs: remove the snapshot check from check_committed_ref()
>    btrfs: avoid redundant call to get inline ref type at check_committed_ref()
>    btrfs: simplify return logic at check_committed_ref()
>    btrfs: simplify arguments for btrfs_cross_ref_exist()
>    btrfs: add function comment for check_committed_ref()
>    btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist()
>
>   fs/btrfs/btrfs_inode.h |   2 +-
>   fs/btrfs/direct-io.c   |   3 +-
>   fs/btrfs/extent-tree.c | 128 +++++++++++++++++++++++++-----------
>   fs/btrfs/extent-tree.h |   3 +-
>   fs/btrfs/file.c        |   2 +-
>   fs/btrfs/inode.c       | 146 +++++++++++++++++++++++++++++------------
>   fs/btrfs/locking.h     |   5 ++
>   7 files changed, 203 insertions(+), 86 deletions(-)
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 11/11] btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist()
  2024-12-13 21:03     ` Qu Wenruo
@ 2024-12-13 22:29       ` Filipe Manana
  0 siblings, 0 replies; 29+ messages in thread
From: Filipe Manana @ 2024-12-13 22:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Dec 13, 2024 at 9:04 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/12/12 01:35, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > We should always call check_delayed_ref() with a path having a locked leaf
> > from the extent tree where either the extent item is located or where it
> > should be located in case it doesn't exist yet (when there's a pending
> > unflushed delayed ref to do it), as we need to lock any existing delayed
> > ref head while holding such leaf locked in order to avoid races with
> > flushing delayed references, which could make us think an extent is not
> > shared when it really is.
> >
> > So add some assertions and a comment about such expectations to
> > btrfs_cross_ref_exist(), which is the only caller of check_delayed_ref().
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> Just a small nitpick.
> > ---
> >   fs/btrfs/extent-tree.c | 25 +++++++++++++++++++++++++
> >   fs/btrfs/locking.h     |  5 +++++
> >   2 files changed, 30 insertions(+)
> >
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index bd13059299e1..0f30f53f51b9 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2426,6 +2426,31 @@ int btrfs_cross_ref_exist(struct btrfs_inode *inode, u64 offset,
> >               if (ret && ret != -ENOENT)
> >                       goto out;
> >
> > +             /*
> > +              * The path must have a locked leaf from the extent tree where
> > +              * the extent item for our extent is located, in case it exists,
> > +              * or where it should be located in case it doesn't exist yet
> > +              * because it's new and its delayed ref was not yet flushed.
> > +              * We need to lock the delayed ref head at check_delayed_ref(),
> > +              * if one exists, while holding the leaf locked in order to not
> > +              * race with delayed ref flushing, missing references and
> > +              * incorrectly reporting that the extent is not shared.
> > +              */
> > +             if (IS_ENABLED(CONFIG_BTRFS_ASSERT)) {
>
> I think we can just get rid of the CONFIG_BTRFS_ASSERT() check.
>
> All the assert functions have already done the check anyway.
> We only skip the btrfs_item_key_to_cpu() call which shouldn't be that
> costly.

It's done this way because otherwise it may trigger warnings from
compilers or static analysis tools if assertions are disabled.
As in that case we write to the key structure but never read from it.

Thanks.

>
> Thanks,
> Qu
> > +                     struct extent_buffer *leaf = path->nodes[0];
> > +
> > +                     ASSERT(leaf != NULL);
> > +                     btrfs_assert_tree_read_locked(leaf);
> > +
> > +                     if (ret != -ENOENT) {
> > +                             struct btrfs_key key;
> > +
> > +                             btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> > +                             ASSERT(key.objectid == bytenr);
> > +                             ASSERT(key.type == BTRFS_EXTENT_ITEM_KEY);
> > +                     }
> > +             }
> > +
> >               ret = check_delayed_ref(inode, path, offset, bytenr);
> >       } while (ret == -EAGAIN && !path->nowait);
> >
> > diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
> > index 35036b151bf5..c69e57ff804b 100644
> > --- a/fs/btrfs/locking.h
> > +++ b/fs/btrfs/locking.h
> > @@ -199,8 +199,13 @@ static inline void btrfs_assert_tree_write_locked(struct extent_buffer *eb)
> >   {
> >       lockdep_assert_held_write(&eb->lock);
> >   }
> > +static inline void btrfs_assert_tree_read_locked(struct extent_buffer *eb)
> > +{
> > +     lockdep_assert_held_read(&eb->lock);
> > +}
> >   #else
> >   static inline void btrfs_assert_tree_write_locked(struct extent_buffer *eb) { }
> > +static inline void btrfs_assert_tree_read_locked(struct extent_buffer *eb) { }
> >   #endif
> >
> >   void btrfs_unlock_up_safe(struct btrfs_path *path, int level);
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 02/11] btrfs: fix swap file activation failure due to extents that used to be shared
  2024-12-13 20:52     ` Qu Wenruo
@ 2024-12-13 22:36       ` Filipe Manana
  0 siblings, 0 replies; 29+ messages in thread
From: Filipe Manana @ 2024-12-13 22:36 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Dec 13, 2024 at 8:52 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/12/12 01:34, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When activating a swap file, to determine if an extent is shared we use
> > can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper
> > is meant to be quick because it's used in the NOCOW write path, when
> > flushing delalloc and when doing a direct IO write, however it does return
> > some false positives, meaning it may indicate that an extent is shared
> > even if it's no longer the case. For the write path this is fine, we just
> > do a unnecessary COW operation instead of doing a more rigorous check
> > which would be too heavy (calling btrfs_is_data_extent_shared()).
> >
> > However when activating a swap file, the false positives simply result
> > in a failure, which is confusing for users/applications. One particular
> > case where this happens is when a data extent only has 1 reference but
> > that reference is not inlined in the extent item located in the extent
> > tree - this happens when we create more than 33 references for an extent
> > and then delete those 33 references plus every other non-inline reference
> > except one. The function check_committed_ref() assumes that if the size
> > of an extent item doesn't match the size of struct btrfs_extent_item
> > plus the size of an inline reference (plus an owner reference in case
> > simple quotas are enabled), then the extent is shared - that is not the
> > case however, we can have a single reference but it's not inlined - the
> > reason we do this is to be fast and avoid inspecting non-inline references
> > which may be located in another leaf of the extent tree, slowing down
> > write paths.
> >
> > The following test script reproduces the bug:
> >
> >     $ cat test.sh
> >     #!/bin/bash
> >
> >     DEV=/dev/sdi
> >     MNT=/mnt/sdi
> >     NUM_CLONES=50
> >
> >     umount $DEV &> /dev/null
> >
> >     run_test()
> >     {
> >          local sync_after_add_reflinks=$1
> >          local sync_after_remove_reflinks=$2
> >
> >          mkfs.btrfs -f $DEV > /dev/null
> >          #mkfs.xfs -f $DEV > /dev/null
> >          mount $DEV $MNT
> >
> >          touch $MNT/foo
> >          chmod 0600 $MNT/foo
> >       # On btrfs the file must be NOCOW.
> >          chattr +C $MNT/foo &> /dev/null
> >          xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo
> >          mkswap $MNT/foo
> >
> >          for ((i = 1; i <= $NUM_CLONES; i++)); do
> >              touch $MNT/foo_clone_$i
> >              chmod 0600 $MNT/foo_clone_$i
> >              # On btrfs the file must be NOCOW.
> >              chattr +C $MNT/foo_clone_$i &> /dev/null
> >              cp --reflink=always $MNT/foo $MNT/foo_clone_$i
> >          done
> >
> >          if [ $sync_after_add_reflinks -ne 0 ]; then
> >              # Flush delayed refs and commit current transaction.
> >              sync -f $MNT
> >          fi
> >
> >          # Remove the original file and all clones except the last.
> >          rm -f $MNT/foo
> >          for ((i = 1; i < $NUM_CLONES; i++)); do
> >              rm -f $MNT/foo_clone_$i
> >          done
> >
> >          if [ $sync_after_remove_reflinks -ne 0 ]; then
> >              # Flush delayed refs and commit current transaction.
> >              sync -f $MNT
> >          fi
> >
> >          # Now use the last clone as a swap file. It should work since
> >          # its extent are not shared anymore.
> >          swapon $MNT/foo_clone_${NUM_CLONES}
> >          swapoff $MNT/foo_clone_${NUM_CLONES}
> >
> >          umount $MNT
> >     }
> >
> >     echo -e "\nTest without sync after creating and removing clones"
> >     run_test 0 0
> >
> >     echo -e "\nTest with sync after creating clones"
> >     run_test 1 0
> >
> >     echo -e "\nTest with sync after removing clones"
> >     run_test 0 1
> >
> >     echo -e "\nTest with sync after creating and removing clones"
> >     run_test 1 1
> >
> > Running the test:
> >
> >     $ ./test.sh
> >     Test without sync after creating and removing clones
> >     wrote 1048576/1048576 bytes at offset 0
> >     1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec)
> >     Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
> >     no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02
> >     swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
> >     swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
> >
> >     Test with sync after creating clones
> >     wrote 1048576/1048576 bytes at offset 0
> >     1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec)
> >     Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
> >     no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33
> >     swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
> >     swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
> >
> >     Test with sync after removing clones
> >     wrote 1048576/1048576 bytes at offset 0
> >     1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec)
> >     Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
> >     no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764
> >
> >     Test with sync after creating and removing clones
> >     wrote 1048576/1048576 bytes at offset 0
> >     1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec)
> >     Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
> >     no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912
> >     swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
> >     swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
> >
> > Fix this by reworking btrfs_swap_activate() to instead of using extent
> > maps and checking for shared extents with can_nocow_extent(), iterate
> > over the inode's file extent items and use the accurate
> > btrfs_is_data_extent_shared().
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> The patch looks good to me.
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
>
> Although there is no mention about why we get rid of btrfs_get_extent().
>
> I guess it's to avoid caching those extent maps to save space? Just like
> what we did in fiemap.

It's not so much to avoid loading extent maps unnecessarily but
because of correctness.
Extent maps can be merged, and we need to find out individual extents
because some of the extents that were merged may be shared while others
are not and vice-versa.

Otherwise it would be a bug, similar to what we had with fiemap fixed
in ac3c0d36a2a2f7a8f9778faef3f2639f5bf29d44, for which there's test
case generic/702.

That's why we iterate over extent items and not extent maps.

Thanks.



>
> Thanks,
> Qu
>
> > ---
> >   fs/btrfs/inode.c | 96 ++++++++++++++++++++++++++++++++++--------------
> >   1 file changed, 69 insertions(+), 27 deletions(-)
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 926d82fbdbae..7ddb8a01b60f 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -9799,15 +9799,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
> >       struct btrfs_fs_info *fs_info = root->fs_info;
> >       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> >       struct extent_state *cached_state = NULL;
> > -     struct extent_map *em = NULL;
> >       struct btrfs_chunk_map *map = NULL;
> >       struct btrfs_device *device = NULL;
> >       struct btrfs_swap_info bsi = {
> >               .lowest_ppage = (sector_t)-1ULL,
> >       };
> > +     struct btrfs_backref_share_check_ctx *backref_ctx = NULL;
> > +     struct btrfs_path *path = NULL;
> >       int ret = 0;
> >       u64 isize;
> > -     u64 start;
> > +     u64 prev_extent_end = 0;
> >
> >       /*
> >        * Acquire the inode's mmap lock to prevent races with memory mapped
> > @@ -9846,6 +9847,13 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
> >               goto out_unlock_mmap;
> >       }
> >
> > +     path = btrfs_alloc_path();
> > +     backref_ctx = btrfs_alloc_backref_share_check_ctx();
> > +     if (!path || !backref_ctx) {
> > +             ret = -ENOMEM;
> > +             goto out_unlock_mmap;
> > +     }
> > +
> >       /*
> >        * Balance or device remove/replace/resize can move stuff around from
> >        * under us. The exclop protection makes sure they aren't running/won't
> > @@ -9904,24 +9912,39 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
> >       isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize);
> >
> >       lock_extent(io_tree, 0, isize - 1, &cached_state);
> > -     start = 0;
> > -     while (start < isize) {
> > -             u64 logical_block_start, physical_block_start;
> > +     while (prev_extent_end < isize) {
> > +             struct btrfs_key key;
> > +             struct extent_buffer *leaf;
> > +             struct btrfs_file_extent_item *ei;
> >               struct btrfs_block_group *bg;
> > -             u64 len = isize - start;
> > +             u64 logical_block_start;
> > +             u64 physical_block_start;
> > +             u64 extent_gen;
> > +             u64 disk_bytenr;
> > +             u64 len;
> >
> > -             em = btrfs_get_extent(BTRFS_I(inode), NULL, start, len);
> > -             if (IS_ERR(em)) {
> > -                     ret = PTR_ERR(em);
> > +             key.objectid = btrfs_ino(BTRFS_I(inode));
> > +             key.type = BTRFS_EXTENT_DATA_KEY;
> > +             key.offset = prev_extent_end;
> > +
> > +             ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> > +             if (ret < 0)
> >                       goto out;
> > -             }
> >
> > -             if (em->disk_bytenr == EXTENT_MAP_HOLE) {
> > +             /*
> > +              * If key not found it means we have an implicit hole (NO_HOLES
> > +              * is enabled).
> > +              */
> > +             if (ret > 0) {
> >                       btrfs_warn(fs_info, "swapfile must not have holes");
> >                       ret = -EINVAL;
> >                       goto out;
> >               }
> > -             if (em->disk_bytenr == EXTENT_MAP_INLINE) {
> > +
> > +             leaf = path->nodes[0];
> > +             ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
> > +
> > +             if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
> >                       /*
> >                        * It's unlikely we'll ever actually find ourselves
> >                        * here, as a file small enough to fit inline won't be
> > @@ -9933,23 +9956,45 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
> >                       ret = -EINVAL;
> >                       goto out;
> >               }
> > -             if (extent_map_is_compressed(em)) {
> > +
> > +             if (btrfs_file_extent_compression(leaf, ei) != BTRFS_COMPRESS_NONE) {
> >                       btrfs_warn(fs_info, "swapfile must not be compressed");
> >                       ret = -EINVAL;
> >                       goto out;
> >               }
> >
> > -             logical_block_start = extent_map_block_start(em) + (start - em->start);
> > -             len = min(len, em->len - (start - em->start));
> > -             free_extent_map(em);
> > -             em = NULL;
> > +             disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> > +             if (disk_bytenr == 0) {
> > +                     btrfs_warn(fs_info, "swapfile must not have holes");
> > +                     ret = -EINVAL;
> > +                     goto out;
> > +             }
> > +
> > +             logical_block_start = disk_bytenr + btrfs_file_extent_offset(leaf, ei);
> > +             extent_gen = btrfs_file_extent_generation(leaf, ei);
> > +             prev_extent_end = btrfs_file_extent_end(path);
> > +
> > +             if (prev_extent_end > isize)
> > +                     len = isize - key.offset;
> > +             else
> > +                     len = btrfs_file_extent_num_bytes(leaf, ei);
> >
> > -             ret = can_nocow_extent(inode, start, &len, NULL, false, true);
> > +             backref_ctx->curr_leaf_bytenr = leaf->start;
> > +
> > +             /*
> > +              * Don't need the path anymore, release to avoid deadlocks when
> > +              * calling btrfs_is_data_extent_shared() because when joining a
> > +              * transaction it can block waiting for the current one's commit
> > +              * which in turn may be trying to lock the same leaf to flush
> > +              * delayed items for example.
> > +              */
> > +             btrfs_release_path(path);
> > +
> > +             ret = btrfs_is_data_extent_shared(BTRFS_I(inode), disk_bytenr,
> > +                                               extent_gen, backref_ctx);
> >               if (ret < 0) {
> >                       goto out;
> > -             } else if (ret) {
> > -                     ret = 0;
> > -             } else {
> > +             } else if (ret > 0) {
> >                       btrfs_warn(fs_info,
> >                                  "swapfile must not be copy-on-write");
> >                       ret = -EINVAL;
> > @@ -9984,7 +10029,6 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
> >
> >               physical_block_start = (map->stripes[0].physical +
> >                                       (logical_block_start - map->start));
> > -             len = min(len, map->chunk_len - (logical_block_start - map->start));
> >               btrfs_free_chunk_map(map);
> >               map = NULL;
> >
> > @@ -10025,20 +10069,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
> >                               if (ret)
> >                                       goto out;
> >                       }
> > -                     bsi.start = start;
> > +                     bsi.start = key.offset;
> >                       bsi.block_start = physical_block_start;
> >                       bsi.block_len = len;
> >               }
> > -
> > -             start += len;
> >       }
> >
> >       if (bsi.block_len)
> >               ret = btrfs_add_swap_extent(sis, &bsi);
> >
> >   out:
> > -     if (!IS_ERR_OR_NULL(em))
> > -             free_extent_map(em);
> >       if (!IS_ERR_OR_NULL(map))
> >               btrfs_free_chunk_map(map);
> >
> > @@ -10053,6 +10093,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
> >
> >   out_unlock_mmap:
> >       up_write(&BTRFS_I(inode)->i_mmap_lock);
> > +     btrfs_free_backref_share_ctx(backref_ctx);
> > +     btrfs_free_path(path);
> >       if (ret)
> >               return ret;
> >
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2024-12-13 22:37 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-11 14:53 [PATCH 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
2024-12-11 14:53 ` [PATCH 01/11] btrfs: fix race with memory mapped writes when activating swap file fdmanana
2024-12-11 14:53 ` [PATCH 02/11] btrfs: fix swap file activation failure due to extents that used to be shared fdmanana
2024-12-11 14:53 ` [PATCH 03/11] btrfs: allow swap activation to be interruptible fdmanana
2024-12-11 14:53 ` [PATCH 04/11] btrfs: avoid monopolizing a core when activating a swap file fdmanana
2024-12-11 14:53 ` [PATCH 05/11] btrfs: remove no longer needed strict argument from can_nocow_extent() fdmanana
2024-12-11 14:53 ` [PATCH 06/11] btrfs: remove the snapshot check from check_committed_ref() fdmanana
2024-12-11 14:53 ` [PATCH 07/11] btrfs: avoid redundant call to get inline ref type at check_committed_ref() fdmanana
2024-12-11 14:53 ` [PATCH 08/11] btrfs: simplify return logic " fdmanana
2024-12-11 14:53 ` [PATCH 09/11] btrfs: simplify arguments for btrfs_cross_ref_exist() fdmanana
2024-12-11 14:53 ` [PATCH 10/11] btrfs: add function comment for check_committed_ref() fdmanana
2024-12-11 14:53 ` [PATCH 11/11] btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist() fdmanana
2024-12-11 15:04 ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups fdmanana
2024-12-11 15:04   ` [PATCH v2 01/11] btrfs: fix race with memory mapped writes when activating swap file fdmanana
2024-12-11 15:04   ` [PATCH v2 02/11] btrfs: fix swap file activation failure due to extents that used to be shared fdmanana
2024-12-13 20:52     ` Qu Wenruo
2024-12-13 22:36       ` Filipe Manana
2024-12-11 15:05   ` [PATCH v2 03/11] btrfs: allow swap activation to be interruptible fdmanana
2024-12-11 15:05   ` [PATCH v2 04/11] btrfs: avoid monopolizing a core when activating a swap file fdmanana
2024-12-11 15:05   ` [PATCH v2 05/11] btrfs: remove no longer needed strict argument from can_nocow_extent() fdmanana
2024-12-11 15:05   ` [PATCH v2 06/11] btrfs: remove the snapshot check from check_committed_ref() fdmanana
2024-12-11 15:05   ` [PATCH v2 07/11] btrfs: avoid redundant call to get inline ref type at check_committed_ref() fdmanana
2024-12-11 15:05   ` [PATCH v2 08/11] btrfs: simplify return logic " fdmanana
2024-12-11 15:05   ` [PATCH v2 09/11] btrfs: simplify arguments for btrfs_cross_ref_exist() fdmanana
2024-12-11 15:05   ` [PATCH v2 10/11] btrfs: add function comment for check_committed_ref() fdmanana
2024-12-11 15:05   ` [PATCH v2 11/11] btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist() fdmanana
2024-12-13 21:03     ` Qu Wenruo
2024-12-13 22:29       ` Filipe Manana
2024-12-13 21:07   ` [PATCH v2 00/11] btrfs: fixes around swap file activation and cleanups Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox