public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/10] Remap tree
@ 2025-05-15 16:36 Mark Harmstone
  2025-05-15 16:36 ` [RFC PATCH 01/10] btrfs: add definitions and constants for remap-tree Mark Harmstone
                   ` (9 more replies)
  0 siblings, 10 replies; 20+ messages in thread
From: Mark Harmstone @ 2025-05-15 16:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Hi all,

This is an RFC for an experimental remap-tree incompat feature, which reworks
how we perform relocations. Our present system involves following backrefs to
rewrite addresses in the metadata trees, which can be slow and has knock-on
effects such as breaking nocow files. Instead, this adds a remap tree to act as
another layer of indirection: if a block group has the REMAPPED bit set, all
I/O to addresses nominally within it get translated by a lookup to the remap
tree.

See https://github.com/btrfs/btrfs-todo/issues/54 for Josef's original design,
which I've elaborated on, and more on the rationale. To test this you will also
need a patched btrfs-progs, for `mkfs.btrfs -O remap-tree`:
https://github.com/maharmstone/btrfs-progs/tree/remap-tree. See also
https://github.com/maharmstone/btrfs/blob/remap-tree/btrfs-dump.pl for a dumper
script that's remap-aware.

The BTRFS_FEATURE_INCOMPAT_REMAP_TREE flag implies the following changes:

* There's a new remap tree, which is created empty

* The data reloc tree isn't created, as it's no longer needed (data relocation
is no longer done via snapshots)

* There's a new metadata chunk type, REMAP, which consists solely of the remap
tree. This is to avoid bootstrapping issues: REMAP chunks, like SYSTEM, are
still relocated the old way (i.e. by COWing every leaf). We can't put the remap
tree in SYSTEM: because the chunk items are put in the superblock, it would
limit our available space

* The top of the remap tree is recorded in the superblock

* Two new fields are added to struct btrfs_block_group_item: le64 remap_bytes
and le32 identity_remap_count. remap_bytes records how much data in a block
group has been relocated from another block group. identity_remap_count records
how many identity remaps exist for this block group (see later)

* If a block group has the REMAPPED flag set and identity_remaps == 0, its
chunk will have 0 stripes and 0 dev extents: everything nominally within it its
actually located elsewhere

* All data and metadata addresses are now translated addresses, for which we
need to consult the remap tree to find the underlying addresses before doing
I/O, if they live in a remapped BG. The exception is the free-space tree, which
is entirely underlying addresses (the free-space cache v1 isn't supported here).

REMAP TREE
----------

The remap tree consists of three types of items: identity remaps, remaps, and
remap backrefs. Each represents a non-overlapping range. Remaps are orthogonal
to extents: an extent might be split into several ranges in the remap tree, or
multiple consecutive extents might be remapped together.

Identity remaps represent a range for which we don't need to do a translation,
i.e. the REMAPPED flag has been set for the BG but we haven't yet done a
relocation.

Remaps have a u64 payload which gives the underlying address to use for the
start of this range, i.e. the non-REMAPPED block group we should use for I/O.

Remap backrefs are a reverse index for the remaps, with their objectid being
the underlying address and their u64 payload being the translated address.
We need backrefs because when relocation a block group that contains existing
remaps, these need to be moved first (we don't want to have to consult an
ever-increasing chain of BGs).

RELOCATION
----------

For SYSTEM and REMAP block groups, relocation is as it is at the moment (mark
the BG readonly, COW every leaf).

For DATA and METADATA block groups, we do the following:

* If remap_bytes > 0, search the remap tree for the remap backrefs that are
physically located in this block group. Move the data on disk, munge the remap
and free-space trees to reflect this, and reduce remap_bytes. Loop until
remap_bytes is 0.

* Search the free-space tree for holes, convert these into identity remaps
within the remap tree, set the identity_remap_count within in the block
group item, and set the REMAPPED flag on the block group item and the chunk.
The REMAPPED flag will prevent new allocations from being made from this block
group.

* Walk through the remap tree looking for the identity remaps that we have
created. For each one, try to reserve the same amount of space in another block
group. Read the data into memory, and write an exact copy into the new
location. Increase remap_bytes in the destination block group, and reduce
identity_remap_count in the source block group if we can move the whole thing.
Add the space back to the free-space tree in the source BG, and remove the
space in the free-space tree for the destination BG.

* When identity_remaps_count == 0, the block group has been fully remapped, and
now exists solely for the purposes of remapping. Set num_stripes to 0 in the
chunk item, remove its stripes, and remove the entries in the dev extent tree.

* If a block group has the REMAPPED flag set, identity_remaps_count == 0, and
remap_bytes == 0, it is now empty. The block group item, chunk item, and
entries in the free-space tree can be removed.

KNOWN ISSUES
------------

* Still a few problems with some fstests: btrfs/156, btrfs/170,
btrfs/177, btrfs/226, btrfs/250, btrfs/252.

* There's a race when it comes to nodatacow writes. I think we ought to be
calling btrfs_inc_nocow_writers(), but that COWs rather than blocking.

* It can be reluctant to drop entries for empty remapped block groups. This
doesn't waste substantial amounts of space on disk as there's no dev extents,
but it is polluting the block group and chunk trees. Similarly we ought to be
removing any enties in the free-space tree for fully remapped block groups: no
new allocations can be done within them, and they no longer correspond to a
location on disk.

* At the moment we're allocating 1GB for the REMAP chunks, which is probably
overkilled. One 16KB leaf of the remap tree can cover ~250GB in the best case
and ~1MB in the worst case. Possibly 32MB is the sweet spot, as for SYSTEM.
Allocating more REMAP chunks isn't a problem, as unlike SYSTEM they don't go in
the superblock.

* There's a spurious lockdep warning when doing remapped metadata reads, as
we're locking an extent buffer within the remap tree while already holding
another extent buffer lock. We either need to disable the warning for this, or
change the code to use search_commit_root. We can't add another level to the
lockdep hierarchy as we're already maxed out at 8.

* I'm think the lookup code in btrfs_translate_remap() probably needs to be
refined. Possibly we can do without the call to btrfs_prev_leaf().

* At present we scan the free-space holes and create the identity remaps all in
the same transaction. For the worst-case scenario of a 1GB block group with
every other sector allocated, relocation takes ~30 seconds on my system, on an
NVMe drive with no other load. Josef has suggested that we split this into
multiple transactions, setting something like
BTRFS_BLOCK_GROUP_ADDING_IDENTITY_REMAPS at the start, and discarding any
progress on mount if we find a BG with this flag set, which makes sense to me.
The code is currently gated behind CONFIG_BTRFS_EXPERIMENTAL. My preference
would be that, like with the RAID stripe tree, we accept that there may be
minor on-disk format changes until it can be moved out of experimental - i.e.
we treat this as a (near-)future improvement rather than a blocker.

* Josef has also suggested that we don't log metadata items for the remap tree
in the extent tree, in anticipation of later omitting all COW-only metadata
items from the extent tree. My view is that treating the remap tree as a
special case would be more trouble than it's worth, both here and in progs, and
omitting all COW-only metadata items should be relegated to a later incompat
change that depends on this one.

* There's still a lot of userspace work to be done: making sure that all space
reporting etc. tools are okay, adding a comprehensive series of tests to btrfs
check, allowing toggling this incompat flag on and off in btrfstune.

Thanks

Mark

Mark Harmstone (10):
  btrfs: add definitions and constants for remap-tree
  btrfs: add REMAP chunk type
  btrfs: allow remapped chunks to have zero stripes
  btrfs: add extended version of struct block_group_item
  btrfs: allow mounting filesystems with remap-tree incompat flag
  btrfs: redirect I/O for remapped block groups
  btrfs: handle deletions from remapped block group
  btrfs: handle setting up relocation of block group with remap-tree
  btrfs: move existing remaps before relocating block group
  btrfs: replace identity maps with actual remaps when doing relocations

 fs/btrfs/Kconfig                |    2 +
 fs/btrfs/accessors.h            |   29 +
 fs/btrfs/block-group.c          |  181 ++-
 fs/btrfs/block-group.h          |   15 +-
 fs/btrfs/block-rsv.c            |    8 +
 fs/btrfs/block-rsv.h            |    1 +
 fs/btrfs/ctree.c                |   11 +-
 fs/btrfs/ctree.h                |    3 +
 fs/btrfs/disk-io.c              |   88 +-
 fs/btrfs/extent-tree.c          |   38 +-
 fs/btrfs/free-space-tree.c      |    4 +-
 fs/btrfs/free-space-tree.h      |    5 +-
 fs/btrfs/fs.h                   |    7 +-
 fs/btrfs/relocation.c           | 2065 ++++++++++++++++++++++++++++++-
 fs/btrfs/relocation.h           |    8 +-
 fs/btrfs/space-info.c           |   20 +-
 fs/btrfs/sysfs.c                |    4 +
 fs/btrfs/transaction.c          |    7 +
 fs/btrfs/tree-checker.c         |   37 +-
 fs/btrfs/volumes.c              |  106 +-
 fs/btrfs/volumes.h              |   17 +-
 include/uapi/linux/btrfs.h      |    1 +
 include/uapi/linux/btrfs_tree.h |   29 +-
 23 files changed, 2509 insertions(+), 177 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 01/10] btrfs: add definitions and constants for remap-tree
  2025-05-15 16:36 [RFC PATCH 00/10] Remap tree Mark Harmstone
@ 2025-05-15 16:36 ` Mark Harmstone
  2025-05-21 12:43   ` Johannes Thumshirn
  2025-05-15 16:36 ` [RFC PATCH 02/10] btrfs: add REMAP chunk type Mark Harmstone
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 20+ messages in thread
From: Mark Harmstone @ 2025-05-15 16:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Add an incompat flag for the new remap-tree feature, and the constants
and definitions needed to support it.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/accessors.h            |  3 +++
 fs/btrfs/sysfs.c                |  2 ++
 fs/btrfs/tree-checker.c         |  6 ++++--
 fs/btrfs/volumes.c              |  1 +
 include/uapi/linux/btrfs.h      |  1 +
 include/uapi/linux/btrfs_tree.h | 12 ++++++++++++
 6 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index 15ea6348800b..5f5eda8d6f9e 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -1046,6 +1046,9 @@ BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption,
 BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size,
 			 struct btrfs_verity_descriptor_item, size, 64);
 
+BTRFS_SETGET_FUNCS(remap_address, struct btrfs_remap, address, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_remap_address, struct btrfs_remap, address, 64);
+
 /* Cast into the data area of the leaf. */
 #define btrfs_item_ptr(leaf, slot, type)				\
 	((type *)(btrfs_item_nr_offset(leaf, 0) + btrfs_item_offset(leaf, slot)))
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 5d93d9dd2c12..3165194f62ab 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -292,6 +292,7 @@ BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 BTRFS_FEAT_ATTR_COMPAT_RO(block_group_tree, BLOCK_GROUP_TREE);
 BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
 BTRFS_FEAT_ATTR_INCOMPAT(simple_quota, SIMPLE_QUOTA);
+BTRFS_FEAT_ATTR_INCOMPAT(remap_tree, REMAP_TREE);
 #ifdef CONFIG_BLK_DEV_ZONED
 BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
 #endif
@@ -326,6 +327,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(raid1c34),
 	BTRFS_FEAT_ATTR_PTR(block_group_tree),
 	BTRFS_FEAT_ATTR_PTR(simple_quota),
+	BTRFS_FEAT_ATTR_PTR(remap_tree),
 #ifdef CONFIG_BLK_DEV_ZONED
 	BTRFS_FEAT_ATTR_PTR(zoned),
 #endif
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 8f4703b488b7..a83fb828723a 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -913,11 +913,13 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
 		return -EUCLEAN;
 	}
 	if (unlikely(type & ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
-			      BTRFS_BLOCK_GROUP_PROFILE_MASK))) {
+			      BTRFS_BLOCK_GROUP_PROFILE_MASK |
+			      BTRFS_BLOCK_GROUP_REMAPPED))) {
 		chunk_err(fs_info, leaf, chunk, logical,
 			  "unrecognized chunk type: 0x%llx",
 			  ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
-			    BTRFS_BLOCK_GROUP_PROFILE_MASK) & type);
+			    BTRFS_BLOCK_GROUP_PROFILE_MASK |
+			    BTRFS_BLOCK_GROUP_REMAPPED) & type);
 		return -EUCLEAN;
 	}
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 89835071cfea..e041964d03c8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -234,6 +234,7 @@ void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
 	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
 	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
 	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
+	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAPPED, "remapped");
 
 	DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
 	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dd02160015b2..d857cdc7694a 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -336,6 +336,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2	(1ULL << 13)
 #define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE	(1ULL << 14)
 #define BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA	(1ULL << 16)
+#define BTRFS_FEATURE_INCOMPAT_REMAP_TREE	(1ULL << 17)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index fc29d273845d..4439d77a7252 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -76,6 +76,9 @@
 /* Tracks RAID stripes in block groups. */
 #define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
 
+/* Holds details of remapped addresses after relocation. */
+#define BTRFS_REMAP_TREE_OBJECTID 13ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
@@ -282,6 +285,10 @@
 
 #define BTRFS_RAID_STRIPE_KEY	230
 
+#define BTRFS_IDENTITY_REMAP_KEY 	234
+#define BTRFS_REMAP_KEY		 	235
+#define BTRFS_REMAP_BACKREF_KEY	 	236
+
 /*
  * Records the overall state of the qgroups.
  * There's only one instance of this key present,
@@ -1161,6 +1168,7 @@ struct btrfs_dev_replace_item {
 #define BTRFS_BLOCK_GROUP_RAID6         (1ULL << 8)
 #define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
 #define BTRFS_BLOCK_GROUP_RAID1C4       (1ULL << 10)
+#define BTRFS_BLOCK_GROUP_REMAPPED      (1ULL << 11)
 #define BTRFS_BLOCK_GROUP_RESERVED	(BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
 					 BTRFS_SPACE_INFO_GLOBAL_RSV)
 
@@ -1323,4 +1331,8 @@ struct btrfs_verity_descriptor_item {
 	__u8 encryption;
 } __attribute__ ((__packed__));
 
+struct btrfs_remap {
+	__le64 address;
+} __attribute__ ((__packed__));
+
 #endif /* _BTRFS_CTREE_H_ */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 02/10] btrfs: add REMAP chunk type
  2025-05-15 16:36 [RFC PATCH 00/10] Remap tree Mark Harmstone
  2025-05-15 16:36 ` [RFC PATCH 01/10] btrfs: add definitions and constants for remap-tree Mark Harmstone
@ 2025-05-15 16:36 ` Mark Harmstone
  2025-05-15 16:36 ` [RFC PATCH 03/10] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Mark Harmstone @ 2025-05-15 16:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Add a new REMAP chunk type, which is a metadata chunk that holds the
remap tree.

This is needed for bootstrapping purposes: the remap tree can't itself
be remapped, and must be relocated the existing way, by COWing every
leaf. The remap tree can't go in the SYSTEM chunk as space there is
limited, because a copy of the chunk item gets placed in the superblock.

The changes in fs/btrfs/volumes.h are because we're adding a new block
group type bit after the profile bits, and so can no longer rely on the
const_ilog2 trick.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/block-rsv.c            |  8 ++++++++
 fs/btrfs/block-rsv.h            |  1 +
 fs/btrfs/disk-io.c              |  1 +
 fs/btrfs/fs.h                   |  2 ++
 fs/btrfs/space-info.c           | 11 +++++++++++
 fs/btrfs/sysfs.c                |  2 ++
 fs/btrfs/tree-checker.c         |  5 +++--
 fs/btrfs/volumes.c              |  7 +++++++
 fs/btrfs/volumes.h              | 11 +++++++++--
 include/uapi/linux/btrfs_tree.h |  4 +++-
 10 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index 5ad6de738aee..2678cd3bed29 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -421,6 +421,9 @@ void btrfs_init_root_block_rsv(struct btrfs_root *root)
 	case BTRFS_TREE_LOG_OBJECTID:
 		root->block_rsv = &fs_info->treelog_rsv;
 		break;
+	case BTRFS_REMAP_TREE_OBJECTID:
+		root->block_rsv = &fs_info->remap_block_rsv;
+		break;
 	default:
 		root->block_rsv = NULL;
 		break;
@@ -434,6 +437,9 @@ void btrfs_init_global_block_rsv(struct btrfs_fs_info *fs_info)
 	space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_SYSTEM);
 	fs_info->chunk_block_rsv.space_info = space_info;
 
+	space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_REMAP);
+	fs_info->remap_block_rsv.space_info = space_info;
+
 	space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
 	fs_info->global_block_rsv.space_info = space_info;
 	fs_info->trans_block_rsv.space_info = space_info;
@@ -460,6 +466,8 @@ void btrfs_release_global_block_rsv(struct btrfs_fs_info *fs_info)
 	WARN_ON(fs_info->trans_block_rsv.reserved > 0);
 	WARN_ON(fs_info->chunk_block_rsv.size > 0);
 	WARN_ON(fs_info->chunk_block_rsv.reserved > 0);
+	WARN_ON(fs_info->remap_block_rsv.size > 0);
+	WARN_ON(fs_info->remap_block_rsv.reserved > 0);
 	WARN_ON(fs_info->delayed_block_rsv.size > 0);
 	WARN_ON(fs_info->delayed_block_rsv.reserved > 0);
 	WARN_ON(fs_info->delayed_refs_rsv.reserved > 0);
diff --git a/fs/btrfs/block-rsv.h b/fs/btrfs/block-rsv.h
index 79ae9d05cd91..8359fb96bc3c 100644
--- a/fs/btrfs/block-rsv.h
+++ b/fs/btrfs/block-rsv.h
@@ -22,6 +22,7 @@ enum btrfs_rsv_type {
 	BTRFS_BLOCK_RSV_DELALLOC,
 	BTRFS_BLOCK_RSV_TRANS,
 	BTRFS_BLOCK_RSV_CHUNK,
+	BTRFS_BLOCK_RSV_REMAP,
 	BTRFS_BLOCK_RSV_DELOPS,
 	BTRFS_BLOCK_RSV_DELREFS,
 	BTRFS_BLOCK_RSV_TREELOG,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1beb9458f622..95058c9aa31b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2831,6 +2831,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 			     BTRFS_BLOCK_RSV_GLOBAL);
 	btrfs_init_block_rsv(&fs_info->trans_block_rsv, BTRFS_BLOCK_RSV_TRANS);
 	btrfs_init_block_rsv(&fs_info->chunk_block_rsv, BTRFS_BLOCK_RSV_CHUNK);
+	btrfs_init_block_rsv(&fs_info->remap_block_rsv, BTRFS_BLOCK_RSV_REMAP);
 	btrfs_init_block_rsv(&fs_info->treelog_rsv, BTRFS_BLOCK_RSV_TREELOG);
 	btrfs_init_block_rsv(&fs_info->empty_block_rsv, BTRFS_BLOCK_RSV_EMPTY);
 	btrfs_init_block_rsv(&fs_info->delayed_block_rsv,
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 4394de12a767..2dfdbfda5901 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -468,6 +468,8 @@ struct btrfs_fs_info {
 	struct btrfs_block_rsv trans_block_rsv;
 	/* Block reservation for chunk tree */
 	struct btrfs_block_rsv chunk_block_rsv;
+	/* Block reservation for remap tree */
+	struct btrfs_block_rsv remap_block_rsv;
 	/* Block reservation for delayed operations */
 	struct btrfs_block_rsv delayed_block_rsv;
 	/* Block reservation for delayed refs */
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index d9087aa81b21..3f927a514643 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -343,6 +343,8 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
 	if (mixed) {
 		flags = BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA;
 		ret = create_space_info(fs_info, flags);
+		if (ret)
+			goto out;
 	} else {
 		flags = BTRFS_BLOCK_GROUP_METADATA;
 		ret = create_space_info(fs_info, flags);
@@ -351,7 +353,15 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
 
 		flags = BTRFS_BLOCK_GROUP_DATA;
 		ret = create_space_info(fs_info, flags);
+		if (ret)
+			goto out;
+	}
+
+	if (features & BTRFS_FEATURE_INCOMPAT_REMAP_TREE) {
+		flags = BTRFS_BLOCK_GROUP_REMAP;
+		ret = create_space_info(fs_info, flags);
 	}
+
 out:
 	return ret;
 }
@@ -590,6 +600,7 @@ static void dump_global_block_rsv(struct btrfs_fs_info *fs_info)
 	DUMP_BLOCK_RSV(fs_info, global_block_rsv);
 	DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
 	DUMP_BLOCK_RSV(fs_info, chunk_block_rsv);
+	DUMP_BLOCK_RSV(fs_info, remap_block_rsv);
 	DUMP_BLOCK_RSV(fs_info, delayed_block_rsv);
 	DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv);
 }
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 3165194f62ab..b8c2d9a5ebeb 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1962,6 +1962,8 @@ static const char *alloc_name(struct btrfs_space_info *space_info)
 	case BTRFS_BLOCK_GROUP_SYSTEM:
 		ASSERT(space_info->subgroup_id == BTRFS_SUB_GROUP_PRIMARY);
 		return "system";
+	case BTRFS_BLOCK_GROUP_REMAP:
+		return "remap";
 	default:
 		WARN_ON(1);
 		return "invalid-combination";
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index a83fb828723a..0505f8d76581 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -751,13 +751,14 @@ static int check_block_group_item(struct extent_buffer *leaf,
 	if (unlikely(type != BTRFS_BLOCK_GROUP_DATA &&
 		     type != BTRFS_BLOCK_GROUP_METADATA &&
 		     type != BTRFS_BLOCK_GROUP_SYSTEM &&
+		     type != BTRFS_BLOCK_GROUP_REMAP &&
 		     type != (BTRFS_BLOCK_GROUP_METADATA |
 			      BTRFS_BLOCK_GROUP_DATA))) {
 		block_group_err(leaf, slot,
-"invalid type, have 0x%llx (%lu bits set) expect either 0x%llx, 0x%llx, 0x%llx or 0x%llx",
+"invalid type, have 0x%llx (%lu bits set) expect either 0x%llx, 0x%llx, 0x%llx, 0x%llx or 0x%llx",
 			type, hweight64(type),
 			BTRFS_BLOCK_GROUP_DATA, BTRFS_BLOCK_GROUP_METADATA,
-			BTRFS_BLOCK_GROUP_SYSTEM,
+			BTRFS_BLOCK_GROUP_SYSTEM, BTRFS_BLOCK_GROUP_REMAP,
 			BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA);
 		return -EUCLEAN;
 	}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e041964d03c8..0698613276d9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -234,6 +234,7 @@ void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
 	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
 	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
 	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
+	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAP, "remap");
 	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAPPED, "remapped");
 
 	DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
@@ -3974,6 +3975,12 @@ static bool should_balance_chunk(struct extent_buffer *leaf, struct btrfs_chunk
 	struct btrfs_balance_args *bargs = NULL;
 	u64 chunk_type = btrfs_chunk_type(leaf, chunk);
 
+	/* treat REMAP chunks as METADATA */
+	if (chunk_type & BTRFS_BLOCK_GROUP_REMAP) {
+		chunk_type &= ~BTRFS_BLOCK_GROUP_REMAP;
+		chunk_type |= BTRFS_BLOCK_GROUP_METADATA;
+	}
+
 	/* type filter */
 	if (!((chunk_type & BTRFS_BLOCK_GROUP_TYPE_MASK) &
 	      (bctl->flags & BTRFS_BALANCE_TYPE_MASK))) {
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 137cc232f58e..670d7bf18c40 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -59,8 +59,6 @@ static_assert(const_ilog2(BTRFS_STRIPE_LEN) == BTRFS_STRIPE_LEN_SHIFT);
  */
 static_assert(const_ffs(BTRFS_BLOCK_GROUP_RAID0) <
 	      const_ffs(BTRFS_BLOCK_GROUP_PROFILE_MASK & ~BTRFS_BLOCK_GROUP_RAID0));
-static_assert(const_ilog2(BTRFS_BLOCK_GROUP_RAID0) >
-	      ilog2(BTRFS_BLOCK_GROUP_TYPE_MASK));
 
 /* ilog2() can handle both constants and variables */
 #define BTRFS_BG_FLAG_TO_INDEX(profile)					\
@@ -82,6 +80,15 @@ enum btrfs_raid_types {
 	BTRFS_NR_RAID_TYPES
 };
 
+static_assert(BTRFS_RAID_RAID0 == 1);
+static_assert(BTRFS_RAID_RAID1 == 2);
+static_assert(BTRFS_RAID_DUP == 3);
+static_assert(BTRFS_RAID_RAID10 == 4);
+static_assert(BTRFS_RAID_RAID5 == 5);
+static_assert(BTRFS_RAID_RAID6 == 6);
+static_assert(BTRFS_RAID_RAID1C3 == 7);
+static_assert(BTRFS_RAID_RAID1C4 == 8);
+
 /*
  * Use sequence counter to get consistent device stat data on
  * 32-bit processors.
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 4439d77a7252..9a36f0206d90 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -1169,12 +1169,14 @@ struct btrfs_dev_replace_item {
 #define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
 #define BTRFS_BLOCK_GROUP_RAID1C4       (1ULL << 10)
 #define BTRFS_BLOCK_GROUP_REMAPPED      (1ULL << 11)
+#define BTRFS_BLOCK_GROUP_REMAP         (1ULL << 12)
 #define BTRFS_BLOCK_GROUP_RESERVED	(BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
 					 BTRFS_SPACE_INFO_GLOBAL_RSV)
 
 #define BTRFS_BLOCK_GROUP_TYPE_MASK	(BTRFS_BLOCK_GROUP_DATA |    \
 					 BTRFS_BLOCK_GROUP_SYSTEM |  \
-					 BTRFS_BLOCK_GROUP_METADATA)
+					 BTRFS_BLOCK_GROUP_METADATA | \
+					 BTRFS_BLOCK_GROUP_REMAP)
 
 #define BTRFS_BLOCK_GROUP_PROFILE_MASK	(BTRFS_BLOCK_GROUP_RAID0 |   \
 					 BTRFS_BLOCK_GROUP_RAID1 |   \
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 03/10] btrfs: allow remapped chunks to have zero stripes
  2025-05-15 16:36 [RFC PATCH 00/10] Remap tree Mark Harmstone
  2025-05-15 16:36 ` [RFC PATCH 01/10] btrfs: add definitions and constants for remap-tree Mark Harmstone
  2025-05-15 16:36 ` [RFC PATCH 02/10] btrfs: add REMAP chunk type Mark Harmstone
@ 2025-05-15 16:36 ` Mark Harmstone
  2025-05-15 16:36 ` [RFC PATCH 04/10] btrfs: add extended version of struct block_group_item Mark Harmstone
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Mark Harmstone @ 2025-05-15 16:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

When a chunk has been fully remapped, we are going to set its
num_stripes to 0, as it will no longer represent a physical location on
disk.

Change tree-checker to allow for this, and fix a couple of
divide-by-zeroes seen elsewhere.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/tree-checker.c | 16 +++++++++-------
 fs/btrfs/volumes.c      |  8 +++++++-
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 0505f8d76581..fd83df06e3fb 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -829,7 +829,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
 	u64 type;
 	u64 features;
 	u32 chunk_sector_size;
-	bool mixed = false;
+	bool mixed = false, remapped;
 	int raid_index;
 	int nparity;
 	int ncopies;
@@ -853,12 +853,14 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
 	ncopies = btrfs_raid_array[raid_index].ncopies;
 	nparity = btrfs_raid_array[raid_index].nparity;
 
-	if (unlikely(!num_stripes)) {
+	remapped = type & BTRFS_BLOCK_GROUP_REMAPPED;
+
+	if (unlikely(!remapped && !num_stripes)) {
 		chunk_err(fs_info, leaf, chunk, logical,
 			  "invalid chunk num_stripes, have %u", num_stripes);
 		return -EUCLEAN;
 	}
-	if (unlikely(num_stripes < ncopies)) {
+	if (unlikely(!remapped && num_stripes < ncopies)) {
 		chunk_err(fs_info, leaf, chunk, logical,
 			  "invalid chunk num_stripes < ncopies, have %u < %d",
 			  num_stripes, ncopies);
@@ -960,7 +962,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
 		}
 	}
 
-	if (unlikely((type & BTRFS_BLOCK_GROUP_RAID10 &&
+	if (unlikely(!remapped && ((type & BTRFS_BLOCK_GROUP_RAID10 &&
 		      sub_stripes != btrfs_raid_array[BTRFS_RAID_RAID10].sub_stripes) ||
 		     (type & BTRFS_BLOCK_GROUP_RAID1 &&
 		      num_stripes != btrfs_raid_array[BTRFS_RAID_RAID1].devs_min) ||
@@ -975,7 +977,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
 		     (type & BTRFS_BLOCK_GROUP_DUP &&
 		      num_stripes != btrfs_raid_array[BTRFS_RAID_DUP].dev_stripes) ||
 		     ((type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 &&
-		      num_stripes != btrfs_raid_array[BTRFS_RAID_SINGLE].dev_stripes))) {
+		      num_stripes != btrfs_raid_array[BTRFS_RAID_SINGLE].dev_stripes)))) {
 		chunk_err(fs_info, leaf, chunk, logical,
 			"invalid num_stripes:sub_stripes %u:%u for profile %llu",
 			num_stripes, sub_stripes,
@@ -999,11 +1001,11 @@ static int check_leaf_chunk_item(struct extent_buffer *leaf,
 	struct btrfs_fs_info *fs_info = leaf->fs_info;
 	int num_stripes;
 
-	if (unlikely(btrfs_item_size(leaf, slot) < sizeof(struct btrfs_chunk))) {
+	if (unlikely(btrfs_item_size(leaf, slot) < offsetof(struct btrfs_chunk, stripe))) {
 		chunk_err(fs_info, leaf, chunk, key->offset,
 			"invalid chunk item size: have %u expect [%zu, %u)",
 			btrfs_item_size(leaf, slot),
-			sizeof(struct btrfs_chunk),
+			offsetof(struct btrfs_chunk, stripe),
 			BTRFS_LEAF_DATA_SIZE(fs_info));
 		return -EUCLEAN;
 	}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0698613276d9..77194bb46b40 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6133,6 +6133,12 @@ struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info,
 		goto out_free_map;
 	}
 
+	/* avoid divide by zero on fully-remapped chunks */
+	if (map->num_stripes == 0) {
+		ret = -EOPNOTSUPP;
+		goto out_free_map;
+	}
+
 	offset = logical - map->start;
 	length = min_t(u64, map->start + map->chunk_len - logical, length);
 	*length_ret = length;
@@ -6953,7 +6959,7 @@ u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map)
 {
 	const int data_stripes = calc_data_stripes(map->type, map->num_stripes);
 
-	return div_u64(map->chunk_len, data_stripes);
+	return data_stripes ? div_u64(map->chunk_len, data_stripes) : 0;
 }
 
 #if BITS_PER_LONG == 32
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 04/10] btrfs: add extended version of struct block_group_item
  2025-05-15 16:36 [RFC PATCH 00/10] Remap tree Mark Harmstone
                   ` (2 preceding siblings ...)
  2025-05-15 16:36 ` [RFC PATCH 03/10] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
@ 2025-05-15 16:36 ` Mark Harmstone
  2025-05-23  9:53   ` Qu Wenruo
  2025-05-15 16:36 ` [RFC PATCH 05/10] btrfs: allow mounting filesystems with remap-tree incompat flag Mark Harmstone
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 20+ messages in thread
From: Mark Harmstone @ 2025-05-15 16:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Add a struct btrfs_block_group_item_v2, which is used in the block group
tree if the remap-tree incompat flag is set.

This adds two new fields to the block group item: `remap_bytes` and
`identity_remap_count`.

`remap_bytes` records the amount of data that's physically within this
block group, but nominally in another, remapped block group. This is
necessary because this data will need to be moved first if this block
group is itself relocated. If `remap_bytes` > 0, this is an indicator to
the relocation thread that it will need to search the remap-tree for
backrefs. A block group must also have `remap_bytes` == 0 before it can
be dropped.

`identity_remap_count` records how many identity remap items are located
in the remap tree for this block group. When relocation is begun for
this block group, this is set to the number of holes in the free-space
tree for this range. As identity remaps are converted into actual remaps
by the relocation process, this number is decreased. Once it reaches 0,
either because of relocation or because extents have been deleted, the
block group has been fully remapped and its chunk's device extents are
removed.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/accessors.h            |  20 +++++++
 fs/btrfs/block-group.c          | 101 ++++++++++++++++++++++++--------
 fs/btrfs/block-group.h          |  14 ++++-
 fs/btrfs/tree-checker.c         |  10 +++-
 include/uapi/linux/btrfs_tree.h |   8 +++
 5 files changed, 126 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index 5f5eda8d6f9e..6e6dd664217b 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -264,6 +264,26 @@ BTRFS_SETGET_FUNCS(block_group_flags, struct btrfs_block_group_item, flags, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_block_group_flags,
 			struct btrfs_block_group_item, flags, 64);
 
+/* struct btrfs_block_group_item_v2 */
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_used, struct btrfs_block_group_item_v2,
+			 used, 64);
+BTRFS_SETGET_FUNCS(block_group_v2_used, struct btrfs_block_group_item_v2, used, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_chunk_objectid,
+			 struct btrfs_block_group_item_v2, chunk_objectid, 64);
+BTRFS_SETGET_FUNCS(block_group_v2_chunk_objectid,
+		   struct btrfs_block_group_item_v2, chunk_objectid, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_flags,
+			 struct btrfs_block_group_item_v2, flags, 64);
+BTRFS_SETGET_FUNCS(block_group_v2_flags, struct btrfs_block_group_item_v2, flags, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_remap_bytes,
+			 struct btrfs_block_group_item_v2, remap_bytes, 64);
+BTRFS_SETGET_FUNCS(block_group_v2_remap_bytes, struct btrfs_block_group_item_v2,
+		   remap_bytes, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_identity_remap_count,
+			 struct btrfs_block_group_item_v2, identity_remap_count, 32);
+BTRFS_SETGET_FUNCS(block_group_v2_identity_remap_count, struct btrfs_block_group_item_v2,
+		   identity_remap_count, 32);
+
 /* struct btrfs_free_space_info */
 BTRFS_SETGET_FUNCS(free_space_extent_count, struct btrfs_free_space_info,
 		   extent_count, 32);
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 5b0cb04b2b93..6a2aa792ccb2 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2351,7 +2351,7 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
 }
 
 static int read_one_block_group(struct btrfs_fs_info *info,
-				struct btrfs_block_group_item *bgi,
+				struct btrfs_block_group_item_v2 *bgi,
 				const struct btrfs_key *key,
 				int need_clear)
 {
@@ -2366,11 +2366,16 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 		return -ENOMEM;
 
 	cache->length = key->offset;
-	cache->used = btrfs_stack_block_group_used(bgi);
+	cache->used = btrfs_stack_block_group_v2_used(bgi);
 	cache->commit_used = cache->used;
-	cache->flags = btrfs_stack_block_group_flags(bgi);
-	cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);
+	cache->flags = btrfs_stack_block_group_v2_flags(bgi);
+	cache->global_root_id = btrfs_stack_block_group_v2_chunk_objectid(bgi);
 	cache->space_info = btrfs_find_space_info(info, cache->flags);
+	cache->remap_bytes = btrfs_stack_block_group_v2_remap_bytes(bgi);
+	cache->commit_remap_bytes = cache->remap_bytes;
+	cache->identity_remap_count =
+		btrfs_stack_block_group_v2_identity_remap_count(bgi);
+	cache->commit_identity_remap_count = cache->identity_remap_count;
 
 	set_free_space_tree_thresholds(cache);
 
@@ -2435,7 +2440,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	} else if (cache->length == cache->used) {
 		cache->cached = BTRFS_CACHE_FINISHED;
 		btrfs_free_excluded_extents(cache);
-	} else if (cache->used == 0) {
+	} else if (cache->used == 0 && cache->remap_bytes == 0) {
 		cache->cached = BTRFS_CACHE_FINISHED;
 		ret = btrfs_add_new_free_space(cache, cache->start,
 					       cache->start + cache->length, NULL);
@@ -2455,7 +2460,8 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 
 	set_avail_alloc_bits(info, cache->flags);
 	if (btrfs_chunk_writeable(info, cache->start)) {
-		if (cache->used == 0) {
+		if (cache->used == 0 && cache->identity_remap_count == 0 &&
+		    cache->remap_bytes == 0) {
 			ASSERT(list_empty(&cache->bg_list));
 			if (btrfs_test_opt(info, DISCARD_ASYNC))
 				btrfs_discard_queue_work(&info->discard_ctl, cache);
@@ -2559,9 +2565,10 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		need_clear = 1;
 
 	while (1) {
-		struct btrfs_block_group_item bgi;
+		struct btrfs_block_group_item_v2 bgi;
 		struct extent_buffer *leaf;
 		int slot;
+		size_t size;
 
 		ret = find_first_block_group(info, path, &key);
 		if (ret > 0)
@@ -2572,8 +2579,16 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		leaf = path->nodes[0];
 		slot = path->slots[0];
 
+		if (btrfs_fs_incompat(info, REMAP_TREE)) {
+			size = sizeof(struct btrfs_block_group_item_v2);
+		} else {
+			size = sizeof(struct btrfs_block_group_item);
+			btrfs_set_stack_block_group_v2_remap_bytes(&bgi, 0);
+			btrfs_set_stack_block_group_v2_identity_remap_count(&bgi, 0);
+		}
+
 		read_extent_buffer(leaf, &bgi, btrfs_item_ptr_offset(leaf, slot),
-				   sizeof(bgi));
+				   size);
 
 		btrfs_item_key_to_cpu(leaf, &key, slot);
 		btrfs_release_path(path);
@@ -2643,25 +2658,38 @@ static int insert_block_group_item(struct btrfs_trans_handle *trans,
 				   struct btrfs_block_group *block_group)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
-	struct btrfs_block_group_item bgi;
+	struct btrfs_block_group_item_v2 bgi;
 	struct btrfs_root *root = btrfs_block_group_root(fs_info);
 	struct btrfs_key key;
 	u64 old_commit_used;
+	size_t size;
 	int ret;
 
 	spin_lock(&block_group->lock);
-	btrfs_set_stack_block_group_used(&bgi, block_group->used);
-	btrfs_set_stack_block_group_chunk_objectid(&bgi,
-						   block_group->global_root_id);
-	btrfs_set_stack_block_group_flags(&bgi, block_group->flags);
+	btrfs_set_stack_block_group_v2_used(&bgi, block_group->used);
+	btrfs_set_stack_block_group_v2_chunk_objectid(&bgi,
+						      block_group->global_root_id);
+	btrfs_set_stack_block_group_v2_flags(&bgi, block_group->flags);
+	btrfs_set_stack_block_group_v2_remap_bytes(&bgi,
+						   block_group->remap_bytes);
+	btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
+					block_group->identity_remap_count);
 	old_commit_used = block_group->commit_used;
 	block_group->commit_used = block_group->used;
+	block_group->commit_remap_bytes = block_group->remap_bytes;
+	block_group->commit_identity_remap_count =
+		block_group->identity_remap_count;
 	key.objectid = block_group->start;
 	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
 	key.offset = block_group->length;
 	spin_unlock(&block_group->lock);
 
-	ret = btrfs_insert_item(trans, root, &key, &bgi, sizeof(bgi));
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE))
+		size = sizeof(struct btrfs_block_group_item_v2);
+	else
+		size = sizeof(struct btrfs_block_group_item);
+
+	ret = btrfs_insert_item(trans, root, &key, &bgi, size);
 	if (ret < 0) {
 		spin_lock(&block_group->lock);
 		block_group->commit_used = old_commit_used;
@@ -3116,10 +3144,12 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 	struct btrfs_root *root = btrfs_block_group_root(fs_info);
 	unsigned long bi;
 	struct extent_buffer *leaf;
-	struct btrfs_block_group_item bgi;
+	struct btrfs_block_group_item_v2 bgi;
 	struct btrfs_key key;
-	u64 old_commit_used;
-	u64 used;
+	u64 old_commit_used, old_commit_remap_bytes;
+	u32 old_commit_identity_remap_count;
+	u64 used, remap_bytes;
+	u32 identity_remap_count;
 
 	/*
 	 * Block group items update can be triggered out of commit transaction
@@ -3129,13 +3159,21 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 	 */
 	spin_lock(&cache->lock);
 	old_commit_used = cache->commit_used;
+	old_commit_remap_bytes = cache->commit_remap_bytes;
+	old_commit_identity_remap_count = cache->commit_identity_remap_count;
 	used = cache->used;
-	/* No change in used bytes, can safely skip it. */
-	if (cache->commit_used == used) {
+	remap_bytes = cache->remap_bytes;
+	identity_remap_count = cache->identity_remap_count;
+	/* No change in values, can safely skip it. */
+	if (cache->commit_used == used &&
+	    cache->commit_remap_bytes == remap_bytes &&
+	    cache->commit_identity_remap_count == identity_remap_count) {
 		spin_unlock(&cache->lock);
 		return 0;
 	}
 	cache->commit_used = used;
+	cache->commit_remap_bytes = remap_bytes;
+	cache->commit_identity_remap_count = identity_remap_count;
 	spin_unlock(&cache->lock);
 
 	key.objectid = cache->start;
@@ -3151,11 +3189,23 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 
 	leaf = path->nodes[0];
 	bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
-	btrfs_set_stack_block_group_used(&bgi, used);
-	btrfs_set_stack_block_group_chunk_objectid(&bgi,
-						   cache->global_root_id);
-	btrfs_set_stack_block_group_flags(&bgi, cache->flags);
-	write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
+	btrfs_set_stack_block_group_v2_used(&bgi, used);
+	btrfs_set_stack_block_group_v2_chunk_objectid(&bgi,
+						      cache->global_root_id);
+	btrfs_set_stack_block_group_v2_flags(&bgi, cache->flags);
+
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+		btrfs_set_stack_block_group_v2_remap_bytes(&bgi,
+							   cache->remap_bytes);
+		btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
+						cache->identity_remap_count);
+		write_extent_buffer(leaf, &bgi, bi,
+				    sizeof(struct btrfs_block_group_item_v2));
+	} else {
+		write_extent_buffer(leaf, &bgi, bi,
+				    sizeof(struct btrfs_block_group_item));
+	}
+
 fail:
 	btrfs_release_path(path);
 	/*
@@ -3170,6 +3220,9 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 	if (ret < 0 && ret != -ENOENT) {
 		spin_lock(&cache->lock);
 		cache->commit_used = old_commit_used;
+		cache->commit_remap_bytes = old_commit_remap_bytes;
+		cache->commit_identity_remap_count =
+			old_commit_identity_remap_count;
 		spin_unlock(&cache->lock);
 	}
 	return ret;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 9de356bcb411..c484118b8b8d 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -127,6 +127,8 @@ struct btrfs_block_group {
 	u64 flags;
 	u64 cache_generation;
 	u64 global_root_id;
+	u64 remap_bytes;
+	u32 identity_remap_count;
 
 	/*
 	 * The last committed used bytes of this block group, if the above @used
@@ -134,6 +136,15 @@ struct btrfs_block_group {
 	 * group item of this block group.
 	 */
 	u64 commit_used;
+	/*
+	 * The last committed remap_bytes value of this block group.
+	 */
+	u64 commit_remap_bytes;
+	/*
+	 * The last commited identity_remap_count value of this block group.
+	 */
+	u32 commit_identity_remap_count;
+
 	/*
 	 * If the free space extent count exceeds this number, convert the block
 	 * group to bitmaps.
@@ -275,7 +286,8 @@ static inline bool btrfs_is_block_group_used(const struct btrfs_block_group *bg)
 {
 	lockdep_assert_held(&bg->lock);
 
-	return (bg->used > 0 || bg->reserved > 0 || bg->pinned > 0);
+	return (bg->used > 0 || bg->reserved > 0 || bg->pinned > 0 ||
+		bg->remap_bytes > 0);
 }
 
 static inline bool btrfs_is_block_group_data_only(const struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index fd83df06e3fb..25311576fab6 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -687,6 +687,7 @@ static int check_block_group_item(struct extent_buffer *leaf,
 	u64 chunk_objectid;
 	u64 flags;
 	u64 type;
+	size_t exp_size;
 
 	/*
 	 * Here we don't really care about alignment since extent allocator can
@@ -698,10 +699,15 @@ static int check_block_group_item(struct extent_buffer *leaf,
 		return -EUCLEAN;
 	}
 
-	if (unlikely(item_size != sizeof(bgi))) {
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE))
+		exp_size = sizeof(struct btrfs_block_group_item_v2);
+	else
+		exp_size = sizeof(struct btrfs_block_group_item);
+
+	if (unlikely(item_size != exp_size)) {
 		block_group_err(leaf, slot,
 			"invalid item size, have %u expect %zu",
-				item_size, sizeof(bgi));
+				item_size, exp_size);
 		return -EUCLEAN;
 	}
 
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 9a36f0206d90..500e3a7df90b 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -1229,6 +1229,14 @@ struct btrfs_block_group_item {
 	__le64 flags;
 } __attribute__ ((__packed__));
 
+struct btrfs_block_group_item_v2 {
+	__le64 used;
+	__le64 chunk_objectid;
+	__le64 flags;
+	__le64 remap_bytes;
+	__le32 identity_remap_count;
+} __attribute__ ((__packed__));
+
 struct btrfs_free_space_info {
 	__le32 extent_count;
 	__le32 flags;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 05/10] btrfs: allow mounting filesystems with remap-tree incompat flag
  2025-05-15 16:36 [RFC PATCH 00/10] Remap tree Mark Harmstone
                   ` (3 preceding siblings ...)
  2025-05-15 16:36 ` [RFC PATCH 04/10] btrfs: add extended version of struct block_group_item Mark Harmstone
@ 2025-05-15 16:36 ` Mark Harmstone
  2025-05-15 16:36 ` [RFC PATCH 06/10] btrfs: redirect I/O for remapped block groups Mark Harmstone
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Mark Harmstone @ 2025-05-15 16:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

If we encounter a filesystem with the remap-tree incompat flag set,
valdiate its compatibility with the other flags, and load the remap tree
using the values that have been added to the superblock.

The remap-tree feature depends on the free space tere, but no-holes and
block-group-tree have been made dependencies to reduce the testing
matrix. Similarly I'm not aware of any reason why mixed-bg and zoned would be
incompatible with remap-tree, but this is blocked for the time being
until it can be fully tested.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/Kconfig                |  2 ++
 fs/btrfs/accessors.h            |  6 ++++
 fs/btrfs/disk-io.c              | 60 +++++++++++++++++++++++++++++++++
 fs/btrfs/extent-tree.c          |  2 ++
 fs/btrfs/fs.h                   |  4 ++-
 fs/btrfs/transaction.c          |  7 ++++
 include/uapi/linux/btrfs_tree.h |  5 ++-
 7 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index c352f3ae0385..f41446102b14 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -114,6 +114,8 @@ config BTRFS_EXPERIMENTAL
 
 	  - extent tree v2 - complex rework of extent tracking
 
+	  - remap-tree - logical address remapping tree
+
 	  If unsure, say N.
 
 config BTRFS_FS_REF_VERIFY
diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index 6e6dd664217b..1bb6c0439ba7 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -919,6 +919,12 @@ BTRFS_SETGET_STACK_FUNCS(super_uuid_tree_generation, struct btrfs_super_block,
 			 uuid_tree_generation, 64);
 BTRFS_SETGET_STACK_FUNCS(super_nr_global_roots, struct btrfs_super_block,
 			 nr_global_roots, 64);
+BTRFS_SETGET_STACK_FUNCS(super_remap_root, struct btrfs_super_block,
+			 remap_root, 64);
+BTRFS_SETGET_STACK_FUNCS(super_remap_root_generation, struct btrfs_super_block,
+			 remap_root_generation, 64);
+BTRFS_SETGET_STACK_FUNCS(super_remap_root_level, struct btrfs_super_block,
+			 remap_root_level, 8);
 
 /* struct btrfs_file_extent_item */
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_type, struct btrfs_file_extent_item,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 95058c9aa31b..68fc6fea221d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1181,6 +1181,8 @@ static struct btrfs_root *btrfs_get_global_root(struct btrfs_fs_info *fs_info,
 		return btrfs_grab_root(btrfs_global_root(fs_info, &key));
 	case BTRFS_RAID_STRIPE_TREE_OBJECTID:
 		return btrfs_grab_root(fs_info->stripe_root);
+	case BTRFS_REMAP_TREE_OBJECTID:
+		return btrfs_grab_root(fs_info->remap_root);
 	default:
 		return NULL;
 	}
@@ -1269,6 +1271,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 	btrfs_put_root(fs_info->data_reloc_root);
 	btrfs_put_root(fs_info->block_group_root);
 	btrfs_put_root(fs_info->stripe_root);
+	btrfs_put_root(fs_info->remap_root);
 	btrfs_check_leaked_roots(fs_info);
 	btrfs_extent_buffer_leak_debug_check(fs_info);
 	kfree(fs_info->super_copy);
@@ -1823,6 +1826,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, bool free_chunk_root)
 	free_root_extent_buffers(info->data_reloc_root);
 	free_root_extent_buffers(info->block_group_root);
 	free_root_extent_buffers(info->stripe_root);
+	free_root_extent_buffers(info->remap_root);
 	if (free_chunk_root)
 		free_root_extent_buffers(info->chunk_root);
 }
@@ -2258,6 +2262,17 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 	if (ret)
 		goto out;
 
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+		/* remap_root already loaded in load_important_roots() */
+		root = fs_info->remap_root;
+
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+
+		root->root_key.objectid = BTRFS_REMAP_TREE_OBJECTID;
+		root->root_key.type = BTRFS_ROOT_ITEM_KEY;
+		root->root_key.offset = 0;
+	}
+
 	/*
 	 * This tree can share blocks with some other fs tree during relocation
 	 * and we need a proper setup by btrfs_get_fs_root
@@ -2525,6 +2540,28 @@ int btrfs_validate_super(const struct btrfs_fs_info *fs_info,
 		ret = -EINVAL;
 	}
 
+	/* Ditto for remap_tree */
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE) &&
+	    (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE_VALID) ||
+	     !btrfs_fs_incompat(fs_info, NO_HOLES) ||
+	     !btrfs_fs_compat_ro(fs_info, BLOCK_GROUP_TREE))) {
+		btrfs_err(fs_info,
+"remap-tree feature requires free-space-tree, no-holes, and block-group-tree");
+		ret = -EINVAL;
+	}
+
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE) &&
+	    btrfs_fs_incompat(fs_info, MIXED_GROUPS)) {
+		btrfs_err(fs_info, "remap-tree not supported with mixed-bg");
+		ret = -EINVAL;
+	}
+
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE) &&
+	    btrfs_fs_incompat(fs_info, ZONED)) {
+		btrfs_err(fs_info, "remap-tree not supported with zoned devices");
+		ret = -EINVAL;
+	}
+
 	/*
 	 * Hint to catch really bogus numbers, bitflips or so, more exact checks are
 	 * done later
@@ -2683,6 +2720,18 @@ static int load_important_roots(struct btrfs_fs_info *fs_info)
 		btrfs_warn(fs_info, "couldn't read tree root");
 		return ret;
 	}
+
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+		bytenr = btrfs_super_remap_root(sb);
+		gen = btrfs_super_remap_root_generation(sb);
+		level = btrfs_super_remap_root_level(sb);
+		ret = load_super_root(fs_info->remap_root, bytenr, gen, level);
+		if (ret) {
+			btrfs_warn(fs_info, "couldn't read remap root");
+			return ret;
+		}
+	}
+
 	return 0;
 }
 
@@ -3291,6 +3340,7 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
 	struct btrfs_root *tree_root;
 	struct btrfs_root *chunk_root;
+	struct btrfs_root *remap_root;
 	int ret;
 	int level;
 
@@ -3325,6 +3375,16 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_alloc;
 	}
 
+	if (btrfs_super_incompat_flags(disk_super) & BTRFS_FEATURE_INCOMPAT_REMAP_TREE) {
+		remap_root = btrfs_alloc_root(fs_info, BTRFS_REMAP_TREE_OBJECTID,
+					      GFP_KERNEL);
+		fs_info->remap_root = remap_root;
+		if (!remap_root) {
+			ret = -ENOMEM;
+			goto fail_alloc;
+		}
+	}
+
 	btrfs_info(fs_info, "first mount of filesystem %pU", disk_super->fsid);
 	/*
 	 * Verify the type first, if that or the checksum value are
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index cb6128778a83..065bd48fcebe 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2562,6 +2562,8 @@ static u64 get_alloc_profile_by_root(struct btrfs_root *root, int data)
 		flags = BTRFS_BLOCK_GROUP_DATA;
 	else if (root == fs_info->chunk_root)
 		flags = BTRFS_BLOCK_GROUP_SYSTEM;
+	else if (root == fs_info->remap_root)
+		flags = BTRFS_BLOCK_GROUP_REMAP;
 	else
 		flags = BTRFS_BLOCK_GROUP_METADATA;
 
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 2dfdbfda5901..5072a2031631 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -286,7 +286,8 @@ enum {
 #define BTRFS_FEATURE_INCOMPAT_SUPP		\
 	(BTRFS_FEATURE_INCOMPAT_SUPP_STABLE |	\
 	 BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE | \
-	 BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2)
+	 BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 | \
+	 BTRFS_FEATURE_INCOMPAT_REMAP_TREE)
 
 #else
 
@@ -434,6 +435,7 @@ struct btrfs_fs_info {
 	struct btrfs_root *data_reloc_root;
 	struct btrfs_root *block_group_root;
 	struct btrfs_root *stripe_root;
+	struct btrfs_root *remap_root;
 
 	/* The log root tree is a directory of all the other log roots */
 	struct btrfs_root *log_root_tree;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index b96195d6480f..68525e833753 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1951,6 +1951,13 @@ static void update_super_roots(struct btrfs_fs_info *fs_info)
 		super->cache_generation = 0;
 	if (test_bit(BTRFS_FS_UPDATE_UUID_TREE_GEN, &fs_info->flags))
 		super->uuid_tree_generation = root_item->generation;
+
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+		root_item = &fs_info->remap_root->root_item;
+		super->remap_root = root_item->bytenr;
+		super->remap_root_generation = root_item->generation;
+		super->remap_root_level = root_item->level;
+	}
 }
 
 int btrfs_transaction_blocked(struct btrfs_fs_info *info)
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 500e3a7df90b..89bcb80081a6 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -721,9 +721,12 @@ struct btrfs_super_block {
 	__u8 metadata_uuid[BTRFS_FSID_SIZE];
 
 	__u64 nr_global_roots;
+	__le64 remap_root;
+	__le64 remap_root_generation;
+	__u8 remap_root_level;
 
 	/* Future expansion */
-	__le64 reserved[27];
+	__u8 reserved[199];
 	__u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
 	struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 06/10] btrfs: redirect I/O for remapped block groups
  2025-05-15 16:36 [RFC PATCH 00/10] Remap tree Mark Harmstone
                   ` (4 preceding siblings ...)
  2025-05-15 16:36 ` [RFC PATCH 05/10] btrfs: allow mounting filesystems with remap-tree incompat flag Mark Harmstone
@ 2025-05-15 16:36 ` Mark Harmstone
  2025-05-23 10:09   ` Qu Wenruo
  2025-05-15 16:36 ` [RFC PATCH 07/10] btrfs: handle deletions from remapped block group Mark Harmstone
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 20+ messages in thread
From: Mark Harmstone @ 2025-05-15 16:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Change btrfs_map_block() so that if the block group has the REMAPPED
flag set, we call btrfs_translate_remap() to obtain a new address.

btrfs_translate_remap() searches the remap tree for a range
corresponding to the logical address passed to btrfs_map_block(). If it
is within an identity remap, this part of the block group hasn't yet
been relocated, and so we use the existing address.

If it is within an actual remap, we subtract the start of the remap
range and add the address of its destination, contained in the item's
payload.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/ctree.c      | 11 ++++---
 fs/btrfs/ctree.h      |  3 ++
 fs/btrfs/relocation.c | 75 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/relocation.h |  2 ++
 fs/btrfs/volumes.c    | 19 +++++++++++
 5 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index a2e7979372cc..7808f7bc2303 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2331,7 +2331,8 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
  * This may release the path, and so you may lose any locks held at the
  * time you call it.
  */
-static int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path)
+int btrfs_prev_leaf(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+		    struct btrfs_path *path, int ins_len, int cow)
 {
 	struct btrfs_key key;
 	struct btrfs_key orig_key;
@@ -2355,7 +2356,7 @@ static int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path)
 	}
 
 	btrfs_release_path(path);
-	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow);
 	if (ret <= 0)
 		return ret;
 
@@ -2454,7 +2455,7 @@ int btrfs_search_slot_for_read(struct btrfs_root *root,
 		}
 	} else {
 		if (p->slots[0] == 0) {
-			ret = btrfs_prev_leaf(root, p);
+			ret = btrfs_prev_leaf(NULL, root, p, 0, 0);
 			if (ret < 0)
 				return ret;
 			if (!ret) {
@@ -5003,7 +5004,7 @@ int btrfs_previous_item(struct btrfs_root *root,
 
 	while (1) {
 		if (path->slots[0] == 0) {
-			ret = btrfs_prev_leaf(root, path);
+			ret = btrfs_prev_leaf(NULL, root, path, 0, 0);
 			if (ret != 0)
 				return ret;
 		} else {
@@ -5044,7 +5045,7 @@ int btrfs_previous_extent_item(struct btrfs_root *root,
 
 	while (1) {
 		if (path->slots[0] == 0) {
-			ret = btrfs_prev_leaf(root, path);
+			ret = btrfs_prev_leaf(NULL, root, path, 0, 0);
 			if (ret != 0)
 				return ret;
 		} else {
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 075a06db43a1..90a0d38a31c9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -721,6 +721,9 @@ static inline int btrfs_next_leaf(struct btrfs_root *root, struct btrfs_path *pa
 	return btrfs_next_old_leaf(root, path, 0);
 }
 
+int btrfs_prev_leaf(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+		    struct btrfs_path *path, int ins_len, int cow);
+
 static inline int btrfs_next_item(struct btrfs_root *root, struct btrfs_path *p)
 {
 	return btrfs_next_old_item(root, p, 0);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 02086191630d..e5571c897906 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3897,6 +3897,81 @@ static const char *stage_to_string(enum reloc_stage stage)
 	return "unknown";
 }
 
+int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
+			  u64 *length)
+{
+	int ret;
+	struct btrfs_key key, found_key;
+	struct extent_buffer *leaf;
+	struct btrfs_remap *remap;
+	BTRFS_PATH_AUTO_FREE(path);
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = *logical;
+	key.type = BTRFS_IDENTITY_REMAP_KEY;
+	key.offset = 0;
+
+	ret = btrfs_search_slot(NULL, fs_info->remap_root, &key, path,
+				0, 0);
+	if (ret < 0)
+		return ret;
+
+	leaf = path->nodes[0];
+
+	if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+		ret = btrfs_next_leaf(fs_info->remap_root, path);
+		if (ret < 0)
+			return ret;
+
+		leaf = path->nodes[0];
+	}
+
+	btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+	if (found_key.objectid > *logical) {
+		if (path->slots[0] == 0) {
+			ret = btrfs_prev_leaf(NULL, fs_info->remap_root, path,
+					      0, 0);
+			if (ret) {
+				if (ret == 1)
+					ret = -ENOENT;
+				return ret;
+			}
+
+			leaf = path->nodes[0];
+		} else {
+			path->slots[0]--;
+		}
+
+		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+	}
+
+	if (found_key.type != BTRFS_REMAP_KEY &&
+	    found_key.type != BTRFS_IDENTITY_REMAP_KEY) {
+		return -ENOENT;
+	}
+
+	if (found_key.objectid > *logical ||
+	    found_key.objectid + found_key.offset <= *logical) {
+		return -ENOENT;
+	}
+
+	if (*logical + *length > found_key.objectid + found_key.offset)
+		*length = found_key.objectid + found_key.offset - *logical;
+
+	if (found_key.type == BTRFS_IDENTITY_REMAP_KEY)
+		return 0;
+
+	remap = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_remap);
+
+	*logical = *logical - found_key.objectid + btrfs_remap_address(leaf, remap);
+
+	return 0;
+}
+
 /*
  * function to relocate all extents in a block group.
  */
diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
index 788c86d8633a..f07dbd9a89c6 100644
--- a/fs/btrfs/relocation.h
+++ b/fs/btrfs/relocation.h
@@ -30,5 +30,7 @@ int btrfs_should_cancel_balance(const struct btrfs_fs_info *fs_info);
 struct btrfs_root *find_reloc_root(struct btrfs_fs_info *fs_info, u64 bytenr);
 bool btrfs_should_ignore_reloc_root(const struct btrfs_root *root);
 u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info);
+int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
+			  u64 *length);
 
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 77194bb46b40..4777926213c0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6620,6 +6620,25 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 	if (IS_ERR(map))
 		return PTR_ERR(map);
 
+	if (map->type & BTRFS_BLOCK_GROUP_REMAPPED) {
+		u64 new_logical = logical;
+
+		ret = btrfs_translate_remap(fs_info, &new_logical, length);
+		if (ret)
+			return ret;
+
+		if (new_logical != logical) {
+			btrfs_free_chunk_map(map);
+
+			map = btrfs_get_chunk_map(fs_info, new_logical,
+						  *length);
+			if (IS_ERR(map))
+				return PTR_ERR(map);
+
+			logical = new_logical;
+		}
+	}
+
 	num_copies = btrfs_chunk_map_num_copies(map);
 	if (io_geom.mirror_num > num_copies)
 		return -EINVAL;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 07/10] btrfs: handle deletions from remapped block group
  2025-05-15 16:36 [RFC PATCH 00/10] Remap tree Mark Harmstone
                   ` (5 preceding siblings ...)
  2025-05-15 16:36 ` [RFC PATCH 06/10] btrfs: redirect I/O for remapped block groups Mark Harmstone
@ 2025-05-15 16:36 ` Mark Harmstone
  2025-05-15 16:36 ` [RFC PATCH 08/10] btrfs: handle setting up relocation of block group with remap-tree Mark Harmstone
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Mark Harmstone @ 2025-05-15 16:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Handle the case where we free an extent from a block group that has the
REMAPPED flag set. Because the remap tree is orthogonal to the extent
tree, for data this may be within any number of identity remaps or
actual remaps. If we're freeing a metadata node, this will be wholly
inside one or the other.

btrfs_remove_extent_from_remap_tree() searches the remap tree for the
remaps that cover the range in question, then calls
remove_range_from_remap_tree() for each one, to punch a hole in the
remap and adjust the free-space tree.

For an identity remap, remove_range_from_remap_tree() will adjust the
block group's `identity_remap_count` if this changes. If it reaches
zero we call last_identity_remap_gone(), which removes the chunk's
stripes and device extents - it is now fully remapped.

The changes which involve the block group's ro flag are because the
REMAPPED flag itself prevents a block group from having any new
allocations within it, and so we don't need to account for this
separately.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/block-group.c |  80 ++++---
 fs/btrfs/block-group.h |   1 +
 fs/btrfs/disk-io.c     |   1 +
 fs/btrfs/extent-tree.c |  30 ++-
 fs/btrfs/fs.h          |   1 +
 fs/btrfs/relocation.c  | 512 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/relocation.h  |   3 +
 fs/btrfs/volumes.c     |  56 +++--
 fs/btrfs/volumes.h     |   6 +
 9 files changed, 635 insertions(+), 55 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 6a2aa792ccb2..ce5ad7bf8025 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1048,6 +1048,32 @@ static int remove_block_group_item(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+void btrfs_remove_bg_from_sinfo(struct btrfs_block_group *block_group)
+{
+	int factor = btrfs_bg_type_to_factor(block_group->flags);
+
+	spin_lock(&block_group->space_info->lock);
+
+	if (btrfs_test_opt(block_group->fs_info, ENOSPC_DEBUG)) {
+		WARN_ON(block_group->space_info->total_bytes
+			< block_group->length);
+		WARN_ON(block_group->space_info->bytes_readonly
+			< block_group->length - block_group->zone_unusable);
+		WARN_ON(block_group->space_info->bytes_zone_unusable
+			< block_group->zone_unusable);
+		WARN_ON(block_group->space_info->disk_total
+			< block_group->length * factor);
+	}
+	block_group->space_info->total_bytes -= block_group->length;
+	block_group->space_info->bytes_readonly -=
+		(block_group->length - block_group->zone_unusable);
+	btrfs_space_info_update_bytes_zone_unusable(block_group->space_info,
+						    -block_group->zone_unusable);
+	block_group->space_info->disk_total -= block_group->length * factor;
+
+	spin_unlock(&block_group->space_info->lock);
+}
+
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 			     struct btrfs_chunk_map *map)
 {
@@ -1059,7 +1085,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	struct kobject *kobj = NULL;
 	int ret;
 	int index;
-	int factor;
 	struct btrfs_caching_control *caching_ctl = NULL;
 	bool remove_map;
 	bool remove_rsv = false;
@@ -1068,7 +1093,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	if (!block_group)
 		return -ENOENT;
 
-	BUG_ON(!block_group->ro);
+	BUG_ON(!block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED));
 
 	trace_btrfs_remove_block_group(block_group);
 	/*
@@ -1080,7 +1105,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 				  block_group->length);
 
 	index = btrfs_bg_flags_to_raid_index(block_group->flags);
-	factor = btrfs_bg_type_to_factor(block_group->flags);
 
 	/* make sure this block group isn't part of an allocation cluster */
 	cluster = &fs_info->data_alloc_cluster;
@@ -1204,26 +1228,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 	spin_lock(&block_group->space_info->lock);
 	list_del_init(&block_group->ro_list);
-
-	if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
-		WARN_ON(block_group->space_info->total_bytes
-			< block_group->length);
-		WARN_ON(block_group->space_info->bytes_readonly
-			< block_group->length - block_group->zone_unusable);
-		WARN_ON(block_group->space_info->bytes_zone_unusable
-			< block_group->zone_unusable);
-		WARN_ON(block_group->space_info->disk_total
-			< block_group->length * factor);
-	}
-	block_group->space_info->total_bytes -= block_group->length;
-	block_group->space_info->bytes_readonly -=
-		(block_group->length - block_group->zone_unusable);
-	btrfs_space_info_update_bytes_zone_unusable(block_group->space_info,
-						    -block_group->zone_unusable);
-	block_group->space_info->disk_total -= block_group->length * factor;
-
 	spin_unlock(&block_group->space_info->lock);
 
+	if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED))
+		btrfs_remove_bg_from_sinfo(block_group);
+
 	/*
 	 * Remove the free space for the block group from the free space tree
 	 * and the block group's item from the extent tree before marking the
@@ -1508,6 +1517,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 	while (!list_empty(&fs_info->unused_bgs)) {
 		u64 used;
 		int trimming;
+		bool made_ro = false;
 
 		block_group = list_first_entry(&fs_info->unused_bgs,
 					       struct btrfs_block_group,
@@ -1544,7 +1554,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 
 		spin_lock(&space_info->lock);
 		spin_lock(&block_group->lock);
-		if (btrfs_is_block_group_used(block_group) || block_group->ro ||
+		if (btrfs_is_block_group_used(block_group) ||
+		    (block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) ||
 		    list_is_singular(&block_group->list)) {
 			/*
 			 * We want to bail if we made new allocations or have
@@ -1587,7 +1598,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		 */
 		used = btrfs_space_info_used(space_info, true);
 		if (space_info->total_bytes - block_group->length < used &&
-		    block_group->zone_unusable < block_group->length) {
+		    block_group->zone_unusable < block_group->length &&
+		    !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
 			/*
 			 * Add a reference for the list, compensate for the ref
 			 * drop under the "next" label for the
@@ -1605,8 +1617,14 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_unlock(&block_group->lock);
 		spin_unlock(&space_info->lock);
 
-		/* We don't want to force the issue, only flip if it's ok. */
-		ret = inc_block_group_ro(block_group, 0);
+		if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
+			/* We don't want to force the issue, only flip if it's ok. */
+			ret = inc_block_group_ro(block_group, 0);
+			made_ro = true;
+		} else {
+			ret = 0;
+		}
+
 		up_write(&space_info->groups_sem);
 		if (ret < 0) {
 			ret = 0;
@@ -1615,7 +1633,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 
 		ret = btrfs_zone_finish(block_group);
 		if (ret < 0) {
-			btrfs_dec_block_group_ro(block_group);
+			if (made_ro)
+				btrfs_dec_block_group_ro(block_group);
 			if (ret == -EAGAIN)
 				ret = 0;
 			goto next;
@@ -1628,7 +1647,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		trans = btrfs_start_trans_remove_block_group(fs_info,
 						     block_group->start);
 		if (IS_ERR(trans)) {
-			btrfs_dec_block_group_ro(block_group);
+			if (made_ro)
+				btrfs_dec_block_group_ro(block_group);
 			ret = PTR_ERR(trans);
 			goto next;
 		}
@@ -1638,7 +1658,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		 * just delete them, we don't care about them anymore.
 		 */
 		if (!clean_pinned_extents(trans, block_group)) {
-			btrfs_dec_block_group_ro(block_group);
+			if (made_ro)
+				btrfs_dec_block_group_ro(block_group);
 			goto end_trans;
 		}
 
@@ -1652,7 +1673,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_lock(&fs_info->discard_ctl.lock);
 		if (!list_empty(&block_group->discard_list)) {
 			spin_unlock(&fs_info->discard_ctl.lock);
-			btrfs_dec_block_group_ro(block_group);
+			if (made_ro)
+				btrfs_dec_block_group_ro(block_group);
 			btrfs_discard_queue_work(&fs_info->discard_ctl,
 						 block_group);
 			goto end_trans;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index c484118b8b8d..767898929960 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -329,6 +329,7 @@ int btrfs_add_new_free_space(struct btrfs_block_group *block_group,
 struct btrfs_trans_handle *btrfs_start_trans_remove_block_group(
 				struct btrfs_fs_info *fs_info,
 				const u64 chunk_offset);
+void btrfs_remove_bg_from_sinfo(struct btrfs_block_group *block_group);
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 			     struct btrfs_chunk_map *map);
 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 68fc6fea221d..55be43bc50d5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2923,6 +2923,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	mutex_init(&fs_info->chunk_mutex);
 	mutex_init(&fs_info->transaction_kthread_mutex);
 	mutex_init(&fs_info->cleaner_mutex);
+	mutex_init(&fs_info->remap_mutex);
 	mutex_init(&fs_info->ro_block_group_mutex);
 	init_rwsem(&fs_info->commit_root_sem);
 	init_rwsem(&fs_info->cleanup_work_sem);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 065bd48fcebe..857a06553b19 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -40,6 +40,7 @@
 #include "orphan.h"
 #include "tree-checker.h"
 #include "raid-stripe-tree.h"
+#include "relocation.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2975,6 +2976,8 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
 				     u64 bytenr, struct btrfs_squota_delta *delta)
 {
 	int ret;
+	struct btrfs_block_group *bg;
+	bool bg_is_remapped = false;
 	u64 num_bytes = delta->num_bytes;
 
 	if (delta->is_data) {
@@ -3000,10 +3003,22 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
 		return ret;
 	}
 
-	ret = add_to_free_space_tree(trans, bytenr, num_bytes);
-	if (ret) {
-		btrfs_abort_transaction(trans, ret);
-		return ret;
+	if (btrfs_fs_incompat(trans->fs_info, REMAP_TREE)) {
+		bg = btrfs_lookup_block_group(trans->fs_info, bytenr);
+		bg_is_remapped = bg->flags & BTRFS_BLOCK_GROUP_REMAPPED;
+		btrfs_put_block_group(bg);
+	}
+
+	/*
+	 * If remapped, FST has already been taken care of in
+	 * remove_range_from_remap_tree().
+	 */
+	if (!bg_is_remapped) {
+		ret = add_to_free_space_tree(trans, bytenr, num_bytes);
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
+			return ret;
+		}
 	}
 
 	ret = btrfs_update_block_group(trans, bytenr, num_bytes, false);
@@ -3369,6 +3384,13 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 		}
 		btrfs_release_path(path);
 
+		ret = btrfs_remove_extent_from_remap_tree(trans, path, bytenr,
+							  num_bytes);
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
+			goto out;
+		}
+
 		ret = do_free_extent_accounting(trans, bytenr, &delta);
 	}
 	btrfs_release_path(path);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 5072a2031631..4577768e55c2 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -543,6 +543,7 @@ struct btrfs_fs_info {
 	struct mutex transaction_kthread_mutex;
 	struct mutex cleaner_mutex;
 	struct mutex chunk_mutex;
+	struct mutex remap_mutex;
 
 	/*
 	 * This is taken to make sure we don't set block groups ro after the
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index e5571c897906..d1bccae74703 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -37,6 +37,7 @@
 #include "super.h"
 #include "tree-checker.h"
 #include "raid-stripe-tree.h"
+#include "free-space-tree.h"
 
 /*
  * Relocation overview
@@ -3897,6 +3898,150 @@ static const char *stage_to_string(enum reloc_stage stage)
 	return "unknown";
 }
 
+static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
+					   struct btrfs_block_group *bg,
+					   s64 diff)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	bool bg_already_dirty = true;
+
+	bg->remap_bytes += diff;
+
+	if (bg->used == 0 && bg->remap_bytes == 0)
+		btrfs_mark_bg_unused(bg);
+
+	spin_lock(&trans->transaction->dirty_bgs_lock);
+	if (list_empty(&bg->dirty_list)) {
+		list_add_tail(&bg->dirty_list, &trans->transaction->dirty_bgs);
+		bg_already_dirty = false;
+		btrfs_get_block_group(bg);
+	}
+	spin_unlock(&trans->transaction->dirty_bgs_lock);
+
+	/* Modified block groups are accounted for in the delayed_refs_rsv. */
+	if (!bg_already_dirty)
+		btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
+}
+
+static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
+				struct btrfs_chunk_map *chunk,
+				struct btrfs_path *path)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_key key;
+	struct extent_buffer *leaf;
+	struct btrfs_chunk *c;
+	int ret;
+
+	key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+	key.type = BTRFS_CHUNK_ITEM_KEY;
+	key.offset = chunk->start;
+
+	ret = btrfs_search_slot(trans, fs_info->chunk_root, &key, path,
+				0, 1);
+	if (ret) {
+		if (ret == 1) {
+			btrfs_release_path(path);
+			ret = -ENOENT;
+		}
+		return ret;
+	}
+
+	leaf = path->nodes[0];
+
+	c = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_chunk);
+	btrfs_set_chunk_num_stripes(leaf, c, 0);
+
+	btrfs_truncate_item(trans, path, offsetof(struct btrfs_chunk, stripe),
+			    1);
+
+	btrfs_mark_buffer_dirty(trans, leaf);
+
+	btrfs_release_path(path);
+
+	chunk->num_stripes = 0;
+
+	return 0;
+}
+
+static int last_identity_remap_gone(struct btrfs_trans_handle *trans,
+				    struct btrfs_chunk_map *chunk,
+				    struct btrfs_block_group *bg,
+				    struct btrfs_path *path)
+{
+	int ret;
+
+	ret = btrfs_remove_dev_extents(trans, chunk);
+	if (ret)
+		return ret;
+
+	mutex_lock(&trans->fs_info->chunk_mutex);
+
+	for (unsigned int i = 0; i < chunk->num_stripes; i++) {
+		ret = btrfs_update_device(trans, chunk->stripes[i].dev);
+		if (ret) {
+			mutex_unlock(&trans->fs_info->chunk_mutex);
+			return ret;
+		}
+	}
+
+	mutex_unlock(&trans->fs_info->chunk_mutex);
+
+	write_lock(&trans->fs_info->mapping_tree_lock);
+	btrfs_chunk_map_device_clear_bits(chunk, CHUNK_ALLOCATED);
+	write_unlock(&trans->fs_info->mapping_tree_lock);
+
+	btrfs_remove_bg_from_sinfo(bg);
+
+	ret = remove_chunk_stripes(trans, chunk, path);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
+				       struct btrfs_path *path,
+				       struct btrfs_block_group *bg, int delta)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_chunk_map *chunk;
+	bool bg_already_dirty = true;
+	int ret;
+
+	WARN_ON(delta < 0 && -delta > bg->identity_remap_count);
+
+	bg->identity_remap_count += delta;
+
+	spin_lock(&trans->transaction->dirty_bgs_lock);
+	if (list_empty(&bg->dirty_list)) {
+		list_add_tail(&bg->dirty_list, &trans->transaction->dirty_bgs);
+		bg_already_dirty = false;
+		btrfs_get_block_group(bg);
+	}
+	spin_unlock(&trans->transaction->dirty_bgs_lock);
+
+	/* Modified block groups are accounted for in the delayed_refs_rsv. */
+	if (!bg_already_dirty)
+		btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
+
+	if (bg->identity_remap_count != 0)
+		return 0;
+
+	chunk = btrfs_find_chunk_map(fs_info, bg->start, 1);
+	if (!chunk)
+		return -ENOENT;
+
+	ret = last_identity_remap_gone(trans, chunk, bg, path);
+	if (ret)
+		goto end;
+
+	ret = 0;
+end:
+	btrfs_free_chunk_map(chunk);
+	return ret;
+}
+
 int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
 			  u64 *length)
 {
@@ -4529,3 +4674,370 @@ u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info)
 		logical = fs_info->reloc_ctl->block_group->start;
 	return logical;
 }
+
+static int remove_range_from_remap_tree(struct btrfs_trans_handle *trans,
+					struct btrfs_path *path,
+					struct btrfs_block_group *bg,
+					u64 *bytenr, u64 *num_bytes)
+{
+	int ret;
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct extent_buffer *leaf = path->nodes[0];
+	struct btrfs_key key, new_key;
+	struct btrfs_remap *remap_ptr = NULL, remap;
+	struct btrfs_block_group *dest_bg = NULL;
+	u64 end, new_addr = 0, remap_start, remap_length, overlap_length;
+
+	end = *bytenr + *num_bytes;
+
+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+	remap_start = key.objectid;
+	remap_length = key.offset;
+
+	if (key.type != BTRFS_IDENTITY_REMAP_KEY) {
+		remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
+					   struct btrfs_remap);
+		new_addr = btrfs_remap_address(leaf, remap_ptr);
+
+		dest_bg = btrfs_lookup_block_group(fs_info, new_addr);
+	}
+
+	if (*bytenr == remap_start && *num_bytes >= remap_length) {
+		/* Remove entirely. */
+
+		ret = btrfs_del_item(trans, fs_info->remap_root, path);
+		if (ret)
+			goto end;
+
+		btrfs_release_path(path);
+
+		overlap_length = remap_length;
+
+		if (key.type != BTRFS_IDENTITY_REMAP_KEY) {
+			/* Remove backref. */
+
+			key.objectid = new_addr;
+			key.type = BTRFS_REMAP_BACKREF_KEY;
+			key.offset = remap_length;
+
+			ret = btrfs_search_slot(trans, fs_info->remap_root,
+						&key, path, 0, 1);
+			if (ret) {
+				if (ret == 1) {
+					btrfs_release_path(path);
+					ret = -ENOENT;
+				}
+				goto end;
+			}
+
+			ret = btrfs_del_item(trans, fs_info->remap_root, path);
+
+			btrfs_release_path(path);
+
+			if (ret)
+				goto end;
+
+			adjust_block_group_remap_bytes(trans, dest_bg,
+						       -remap_length);
+		} else {
+			ret = adjust_identity_remap_count(trans, path, bg, -1);
+			if (ret)
+				goto end;
+		}
+	} else if (*bytenr == remap_start) {
+		/* Remove beginning. */
+
+		new_key.objectid = end;
+		new_key.type = key.type;
+		new_key.offset = remap_length + remap_start - end;
+
+		btrfs_set_item_key_safe(trans, path, &new_key);
+		btrfs_mark_buffer_dirty(trans, leaf);
+
+		overlap_length = *num_bytes;
+
+		if (key.type != BTRFS_IDENTITY_REMAP_KEY) {
+			btrfs_set_remap_address(leaf, remap_ptr,
+						new_addr + end - remap_start);
+			btrfs_release_path(path);
+
+			/* Adjust backref. */
+
+			key.objectid = new_addr;
+			key.type = BTRFS_REMAP_BACKREF_KEY;
+			key.offset = remap_length;
+
+			ret = btrfs_search_slot(trans, fs_info->remap_root,
+						&key, path, 0, 1);
+			if (ret) {
+				if (ret == 1) {
+					btrfs_release_path(path);
+					ret = -ENOENT;
+				}
+				goto end;
+			}
+
+			leaf = path->nodes[0];
+
+			new_key.objectid = new_addr + end - remap_start;
+			new_key.type = BTRFS_REMAP_BACKREF_KEY;
+			new_key.offset = remap_length + remap_start - end;
+
+			btrfs_set_item_key_safe(trans, path, &new_key);
+
+			remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
+						   struct btrfs_remap);
+			btrfs_set_remap_address(leaf, remap_ptr, end);
+
+			btrfs_mark_buffer_dirty(trans, path->nodes[0]);
+
+			btrfs_release_path(path);
+
+			adjust_block_group_remap_bytes(trans, dest_bg,
+						       -*num_bytes);
+		}
+	} else if (*bytenr + *num_bytes < remap_start + remap_length) {
+		/* Remove middle. */
+
+		new_key.objectid = remap_start;
+		new_key.type = key.type;
+		new_key.offset = *bytenr - remap_start;
+
+		btrfs_set_item_key_safe(trans, path, &new_key);
+		btrfs_mark_buffer_dirty(trans, leaf);
+
+		new_key.objectid = end;
+		new_key.offset = remap_start + remap_length - end;
+
+		btrfs_release_path(path);
+
+		overlap_length = *num_bytes;
+
+		if (key.type != BTRFS_IDENTITY_REMAP_KEY) {
+			/* Add second remap entry. */
+
+			ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+						path, &new_key,
+						sizeof(struct btrfs_remap));
+			if (ret)
+				goto end;
+
+			btrfs_set_stack_remap_address(&remap,
+						new_addr + end - remap_start);
+
+			write_extent_buffer(path->nodes[0], &remap,
+				btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
+				sizeof(struct btrfs_remap));
+
+			btrfs_release_path(path);
+
+			/* Shorten backref entry. */
+
+			key.objectid = new_addr;
+			key.type = BTRFS_REMAP_BACKREF_KEY;
+			key.offset = remap_length;
+
+			ret = btrfs_search_slot(trans, fs_info->remap_root,
+						&key, path, 0, 1);
+			if (ret) {
+				if (ret == 1) {
+					btrfs_release_path(path);
+					ret = -ENOENT;
+				}
+				goto end;
+			}
+
+			new_key.objectid = new_addr;
+			new_key.type = BTRFS_REMAP_BACKREF_KEY;
+			new_key.offset = *bytenr - remap_start;
+
+			btrfs_set_item_key_safe(trans, path, &new_key);
+			btrfs_mark_buffer_dirty(trans, path->nodes[0]);
+
+			btrfs_release_path(path);
+
+			/* Add second backref entry. */
+
+			new_key.objectid = new_addr + end - remap_start;
+			new_key.type = BTRFS_REMAP_BACKREF_KEY;
+			new_key.offset = remap_start + remap_length - end;
+
+			ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+						path, &new_key,
+						sizeof(struct btrfs_remap));
+			if (ret)
+				goto end;
+
+			btrfs_set_stack_remap_address(&remap, end);
+
+			write_extent_buffer(path->nodes[0], &remap,
+				btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
+				sizeof(struct btrfs_remap));
+
+			btrfs_release_path(path);
+
+			adjust_block_group_remap_bytes(trans, dest_bg,
+						       -*num_bytes);
+		} else {
+			/* Add second identity remap entry. */
+
+			ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+						      path, &new_key, 0);
+			if (ret)
+				goto end;
+
+			btrfs_release_path(path);
+
+			ret = adjust_identity_remap_count(trans, path, bg, 1);
+			if (ret)
+				goto end;
+		}
+	} else {
+		/* Remove end. */
+
+		new_key.objectid = remap_start;
+		new_key.type = key.type;
+		new_key.offset = *bytenr - remap_start;
+
+		btrfs_set_item_key_safe(trans, path, &new_key);
+		btrfs_mark_buffer_dirty(trans, leaf);
+
+		btrfs_release_path(path);
+
+		overlap_length = remap_start + remap_length - *bytenr;
+
+		if (key.type != BTRFS_IDENTITY_REMAP_KEY) {
+			/* Shorten backref entry. */
+
+			key.objectid = new_addr;
+			key.type = BTRFS_REMAP_BACKREF_KEY;
+			key.offset = remap_length;
+
+			ret = btrfs_search_slot(trans, fs_info->remap_root,
+						&key, path, 0, 1);
+			if (ret) {
+				if (ret == 1) {
+					btrfs_release_path(path);
+					ret = -ENOENT;
+				}
+				goto end;
+			}
+
+			new_key.objectid = new_addr;
+			new_key.type = BTRFS_REMAP_BACKREF_KEY;
+			new_key.offset = *bytenr - remap_start;
+
+			btrfs_set_item_key_safe(trans, path, &new_key);
+			btrfs_mark_buffer_dirty(trans, path->nodes[0]);
+
+			btrfs_release_path(path);
+
+			adjust_block_group_remap_bytes(trans, dest_bg,
+					*bytenr - remap_start - remap_length);
+		}
+	}
+
+	if (key.type != BTRFS_IDENTITY_REMAP_KEY) {
+		ret = add_to_free_space_tree(trans,
+					     *bytenr - remap_start + new_addr,
+					     overlap_length);
+	} else {
+		ret = add_to_free_space_tree(trans, *bytenr, overlap_length);
+	}
+
+	*bytenr += overlap_length;
+	*num_bytes -= overlap_length;
+
+end:
+	if (dest_bg)
+		btrfs_put_block_group(dest_bg);
+
+	return ret;
+}
+
+int btrfs_remove_extent_from_remap_tree(struct btrfs_trans_handle *trans,
+					struct btrfs_path *path,
+					u64 bytenr, u64 num_bytes)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_key key, found_key;
+	struct extent_buffer *leaf;
+	struct btrfs_block_group *bg;
+	int ret;
+
+	if (!(btrfs_super_incompat_flags(fs_info->super_copy) &
+	      BTRFS_FEATURE_INCOMPAT_REMAP_TREE))
+		return 0;
+
+	bg = btrfs_lookup_block_group(fs_info, bytenr);
+	if (!bg)
+		return 0;
+
+	if (!(bg->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
+		btrfs_put_block_group(bg);
+		return 0;
+	}
+
+	mutex_lock(&fs_info->remap_mutex);
+
+	do {
+		key.objectid = bytenr;
+		key.type = (u8)-1;
+		key.offset = (u64)-1;
+
+		ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path,
+					0, 1);
+		if (ret < 0)
+			goto end;
+
+		leaf = path->nodes[0];
+
+		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+		if (found_key.objectid > bytenr ||
+		    path->slots[0] == btrfs_header_nritems(leaf)) {
+			if (path->slots[0] == 0) {
+				ret = btrfs_prev_leaf(trans, fs_info->remap_root,
+						      path, 0, 1);
+				if (ret) {
+					if (ret == 1)
+						ret = -ENOENT;
+					goto end;
+				}
+
+				leaf = path->nodes[0];
+			} else {
+				path->slots[0]--;
+			}
+
+			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+		}
+
+		if (found_key.type != BTRFS_IDENTITY_REMAP_KEY &&
+		    found_key.type != BTRFS_REMAP_KEY) {
+			ret = -ENOENT;
+			goto end;
+		}
+
+		if (bytenr < found_key.objectid ||
+		    bytenr >= found_key.objectid + found_key.offset) {
+			ret = -ENOENT;
+			goto end;
+		}
+
+		ret = remove_range_from_remap_tree(trans, path, bg, &bytenr,
+						   &num_bytes);
+		if (ret)
+			goto end;
+	} while (num_bytes > 0);
+
+	ret = 0;
+
+end:
+	mutex_unlock(&fs_info->remap_mutex);
+
+	btrfs_put_block_group(bg);
+	btrfs_release_path(path);
+	return ret;
+}
diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
index f07dbd9a89c6..7f8e27f638bc 100644
--- a/fs/btrfs/relocation.h
+++ b/fs/btrfs/relocation.h
@@ -32,5 +32,8 @@ bool btrfs_should_ignore_reloc_root(const struct btrfs_root *root);
 u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info);
 int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
 			  u64 *length);
+int btrfs_remove_extent_from_remap_tree(struct btrfs_trans_handle *trans,
+					struct btrfs_path *path,
+					u64 bytenr, u64 num_bytes);
 
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4777926213c0..ea32ee9a63fd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2931,8 +2931,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	return ret;
 }
 
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
-					struct btrfs_device *device)
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+			struct btrfs_device *device)
 {
 	int ret;
 	struct btrfs_path *path;
@@ -3236,25 +3236,13 @@ static int remove_chunk_item(struct btrfs_trans_handle *trans,
 	return btrfs_free_chunk(trans, chunk_offset);
 }
 
-int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
+int btrfs_remove_dev_extents(struct btrfs_trans_handle *trans,
+			     struct btrfs_chunk_map *map)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
-	struct btrfs_chunk_map *map;
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 	u64 dev_extent_len = 0;
 	int i, ret = 0;
-	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
-
-	map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
-	if (IS_ERR(map)) {
-		/*
-		 * This is a logic error, but we don't want to just rely on the
-		 * user having built with ASSERT enabled, so if ASSERT doesn't
-		 * do anything we still error out.
-		 */
-		DEBUG_WARN("errr %ld reading chunk map at offset %llu",
-			   PTR_ERR(map), chunk_offset);
-		return PTR_ERR(map);
-	}
 
 	/*
 	 * First delete the device extent items from the devices btree.
@@ -3275,7 +3263,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
 		if (ret) {
 			mutex_unlock(&fs_devices->device_list_mutex);
 			btrfs_abort_transaction(trans, ret);
-			goto out;
+			return ret;
 		}
 
 		if (device->bytes_used > 0) {
@@ -3289,6 +3277,30 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
 	}
 	mutex_unlock(&fs_devices->device_list_mutex);
 
+	return 0;
+}
+
+int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_chunk_map *map;
+	int ret;
+
+	map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
+	if (IS_ERR(map)) {
+		/*
+		 * This is a logic error, but we don't want to just rely on the
+		 * user having built with ASSERT enabled, so if ASSERT doesn't
+		 * do anything we still error out.
+		 */
+		ASSERT(0);
+		return PTR_ERR(map);
+	}
+
+	ret = btrfs_remove_dev_extents(trans, map);
+	if (ret)
+		goto out;
+
 	/*
 	 * We acquire fs_info->chunk_mutex for 2 reasons:
 	 *
@@ -5433,7 +5445,7 @@ static void chunk_map_device_set_bits(struct btrfs_chunk_map *map, unsigned int
 	}
 }
 
-static void chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
+void btrfs_chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
 {
 	for (int i = 0; i < map->num_stripes; i++) {
 		struct btrfs_io_stripe *stripe = &map->stripes[i];
@@ -5450,7 +5462,7 @@ void btrfs_remove_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_ma
 	write_lock(&fs_info->mapping_tree_lock);
 	rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
 	RB_CLEAR_NODE(&map->rb_node);
-	chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
+	btrfs_chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
 	write_unlock(&fs_info->mapping_tree_lock);
 
 	/* Once for the tree reference. */
@@ -5486,7 +5498,7 @@ int btrfs_add_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_map *m
 		return -EEXIST;
 	}
 	chunk_map_device_set_bits(map, CHUNK_ALLOCATED);
-	chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
+	btrfs_chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
 	write_unlock(&fs_info->mapping_tree_lock);
 
 	return 0;
@@ -5851,7 +5863,7 @@ void btrfs_mapping_tree_free(struct btrfs_fs_info *fs_info)
 		map = rb_entry(node, struct btrfs_chunk_map, rb_node);
 		rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
 		RB_CLEAR_NODE(&map->rb_node);
-		chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
+		btrfs_chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
 		/* Once for the tree ref. */
 		btrfs_free_chunk_map(map);
 		cond_resched_rwlock_write(&fs_info->mapping_tree_lock);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 670d7bf18c40..5bab153926e0 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -779,6 +779,8 @@ u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map);
 int btrfs_nr_parity_stripes(u64 type);
 int btrfs_chunk_alloc_add_chunk_item(struct btrfs_trans_handle *trans,
 				     struct btrfs_block_group *bg);
+int btrfs_remove_dev_extents(struct btrfs_trans_handle *trans,
+			     struct btrfs_chunk_map *map);
 int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
@@ -876,6 +878,10 @@ bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
 
 bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
 const u8 *btrfs_sb_fsid_ptr(const struct btrfs_super_block *sb);
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+			struct btrfs_device *device);
+void btrfs_chunk_map_device_clear_bits(struct btrfs_chunk_map *map,
+				       unsigned int bits);
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 08/10] btrfs: handle setting up relocation of block group with remap-tree
  2025-05-15 16:36 [RFC PATCH 00/10] Remap tree Mark Harmstone
                   ` (6 preceding siblings ...)
  2025-05-15 16:36 ` [RFC PATCH 07/10] btrfs: handle deletions from remapped block group Mark Harmstone
@ 2025-05-15 16:36 ` Mark Harmstone
  2025-05-15 16:36 ` [RFC PATCH 09/10] btrfs: move existing remaps before relocating block group Mark Harmstone
  2025-05-15 16:36 ` [RFC PATCH 10/10] btrfs: replace identity maps with actual remaps when doing relocations Mark Harmstone
  9 siblings, 0 replies; 20+ messages in thread
From: Mark Harmstone @ 2025-05-15 16:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Handle the preliminary work for relocating a block group in a filesystem
with the remap-tree flag set.

If the block group is SYSTEM or REMAP btrfs_relocate_block_group()
proceeds as it does already, as bootstrapping issues mean that these
block groups have to be processed the existing way.

Otherwise we walk the free-space tree for the block group in question,
recording any holes. These get converted into identity remaps and placed
in the remap tree, and the block group's REMAPPED flag is set. From now
on no new allocations are possible within this block group, and any I/O
to it will be funnelled through btrfs_translate_remap(). We store the
number of identity remaps in `identity_remap_count`, so that we know
when we've removed the last one and the block group is fully remapped.

The change in btrfs_read_roots() is because data relocations no longer
rely on the data reloc tree as a hidden subvolume in which to do
snapshots.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/disk-io.c         |  30 +--
 fs/btrfs/free-space-tree.c |   4 +-
 fs/btrfs/free-space-tree.h |   5 +-
 fs/btrfs/relocation.c      | 436 ++++++++++++++++++++++++++++++++++++-
 fs/btrfs/relocation.h      |   3 +-
 fs/btrfs/space-info.c      |   9 +-
 fs/btrfs/volumes.c         |  15 +-
 7 files changed, 467 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 55be43bc50d5..73e451c32bf1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2271,22 +2271,22 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 		root->root_key.objectid = BTRFS_REMAP_TREE_OBJECTID;
 		root->root_key.type = BTRFS_ROOT_ITEM_KEY;
 		root->root_key.offset = 0;
-	}
-
-	/*
-	 * This tree can share blocks with some other fs tree during relocation
-	 * and we need a proper setup by btrfs_get_fs_root
-	 */
-	root = btrfs_get_fs_root(tree_root->fs_info,
-				 BTRFS_DATA_RELOC_TREE_OBJECTID, true);
-	if (IS_ERR(root)) {
-		if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
-			ret = PTR_ERR(root);
-			goto out;
-		}
 	} else {
-		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
-		fs_info->data_reloc_root = root;
+		/*
+		 * This tree can share blocks with some other fs tree during
+		 * relocation and we need a proper setup by btrfs_get_fs_root
+		 */
+		root = btrfs_get_fs_root(tree_root->fs_info,
+					 BTRFS_DATA_RELOC_TREE_OBJECTID, true);
+		if (IS_ERR(root)) {
+			if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
+				ret = PTR_ERR(root);
+				goto out;
+			}
+		} else {
+			set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+			fs_info->data_reloc_root = root;
+		}
 	}
 
 	location.objectid = BTRFS_QUOTA_TREE_OBJECTID;
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index 0c573d46639a..85e8f3137973 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -21,8 +21,7 @@ static int __add_block_group_free_space(struct btrfs_trans_handle *trans,
 					struct btrfs_block_group *block_group,
 					struct btrfs_path *path);
 
-static struct btrfs_root *btrfs_free_space_root(
-				struct btrfs_block_group *block_group)
+struct btrfs_root *btrfs_free_space_root(struct btrfs_block_group *block_group)
 {
 	struct btrfs_key key = {
 		.objectid = BTRFS_FREE_SPACE_TREE_OBJECTID,
@@ -96,7 +95,6 @@ static int add_new_free_space_info(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-EXPORT_FOR_TESTS
 struct btrfs_free_space_info *search_free_space_info(
 		struct btrfs_trans_handle *trans,
 		struct btrfs_block_group *block_group,
diff --git a/fs/btrfs/free-space-tree.h b/fs/btrfs/free-space-tree.h
index e6c6d6f4f221..1b804544730a 100644
--- a/fs/btrfs/free-space-tree.h
+++ b/fs/btrfs/free-space-tree.h
@@ -35,12 +35,13 @@ int add_to_free_space_tree(struct btrfs_trans_handle *trans,
 			   u64 start, u64 size);
 int remove_from_free_space_tree(struct btrfs_trans_handle *trans,
 				u64 start, u64 size);
-
-#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 struct btrfs_free_space_info *
 search_free_space_info(struct btrfs_trans_handle *trans,
 		       struct btrfs_block_group *block_group,
 		       struct btrfs_path *path, int cow);
+struct btrfs_root *btrfs_free_space_root(struct btrfs_block_group *block_group);
+
+#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 int __add_to_free_space_tree(struct btrfs_trans_handle *trans,
 			     struct btrfs_block_group *block_group,
 			     struct btrfs_path *path, u64 start, u64 size);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d1bccae74703..6c11369bc883 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3651,7 +3651,7 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
 		btrfs_btree_balance_dirty(fs_info);
 	}
 
-	if (!err) {
+	if (!err && !btrfs_fs_incompat(fs_info, REMAP_TREE)) {
 		ret = relocate_file_extent_cluster(rc);
 		if (ret < 0)
 			err = ret;
@@ -3898,6 +3898,90 @@ static const char *stage_to_string(enum reloc_stage stage)
 	return "unknown";
 }
 
+static int add_remap_tree_entries(struct btrfs_trans_handle *trans,
+				  struct btrfs_path *path,
+				  struct btrfs_key *entries,
+				  unsigned int num_entries)
+{
+	int ret;
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_item_batch batch;
+	u32 *data_sizes;
+	u32 max_items;
+
+	max_items = BTRFS_LEAF_DATA_SIZE(trans->fs_info) / sizeof(struct btrfs_item);
+
+	data_sizes = kzalloc(sizeof(u32) * min_t(u32, num_entries, max_items),
+			     GFP_NOFS);
+	if (!data_sizes)
+		return -ENOMEM;
+
+	while (true) {
+		batch.keys = entries;
+		batch.data_sizes = data_sizes;
+		batch.total_data_size = 0;
+		batch.nr = min_t(u32, num_entries, max_items);
+
+		ret = btrfs_insert_empty_items(trans, fs_info->remap_root, path,
+					       &batch);
+		btrfs_release_path(path);
+
+		if (num_entries <= max_items)
+			break;
+
+		num_entries -= max_items;
+		entries += max_items;
+	}
+
+	kfree(data_sizes);
+
+	return ret;
+}
+
+struct space_run {
+	u64 start;
+	u64 end;
+};
+
+static void parse_bitmap(u64 block_size, const unsigned long *bitmap,
+			 unsigned long size, u64 address,
+			 struct space_run *space_runs,
+			 unsigned int *num_space_runs)
+{
+	unsigned long pos, end;
+	u64 run_start, run_length;
+
+	pos = find_first_bit(bitmap, size);
+
+	if (pos == size)
+		return;
+
+	while (true) {
+		end = find_next_zero_bit(bitmap, size, pos);
+
+		run_start = address + (pos * block_size);
+		run_length = (end - pos) * block_size;
+
+		if (*num_space_runs != 0 &&
+		    space_runs[*num_space_runs - 1].end == run_start) {
+			space_runs[*num_space_runs - 1].end += run_length;
+		} else {
+			space_runs[*num_space_runs].start = run_start;
+			space_runs[*num_space_runs].end = run_start + run_length;
+
+			(*num_space_runs)++;
+		}
+
+		if (end == size)
+			break;
+
+		pos = find_next_bit(bitmap, size, end + 1);
+
+		if (pos == size)
+			break;
+	}
+}
+
 static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
 					   struct btrfs_block_group *bg,
 					   s64 diff)
@@ -3923,6 +4007,223 @@ static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
 		btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
 }
 
+static int create_remap_tree_entries(struct btrfs_trans_handle *trans,
+				     struct btrfs_path *path,
+				     struct btrfs_block_group *bg)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_free_space_info *fsi;
+	struct btrfs_key key, found_key;
+	struct extent_buffer *leaf;
+	struct btrfs_root *space_root;
+	u32 extent_count;
+	struct space_run *space_runs = NULL;
+	unsigned int num_space_runs = 0;
+	struct btrfs_key *entries = NULL;
+	unsigned int max_entries, num_entries;
+	int ret;
+
+	mutex_lock(&bg->free_space_lock);
+
+	if (test_bit(BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE, &bg->runtime_flags)) {
+		mutex_unlock(&bg->free_space_lock);
+
+		ret = add_block_group_free_space(trans, bg);
+		if (ret)
+			return ret;
+
+		mutex_lock(&bg->free_space_lock);
+	}
+
+	fsi = search_free_space_info(trans, bg, path, 0);
+	if (IS_ERR(fsi)) {
+		mutex_unlock(&bg->free_space_lock);
+		return PTR_ERR(fsi);
+	}
+
+	extent_count = btrfs_free_space_extent_count(path->nodes[0], fsi);
+
+	btrfs_release_path(path);
+
+	space_runs = kmalloc(sizeof(*space_runs) * extent_count, GFP_NOFS);
+	if (!space_runs) {
+		mutex_unlock(&bg->free_space_lock);
+		return -ENOMEM;
+	}
+
+	key.objectid = bg->start;
+	key.type = 0;
+	key.offset = 0;
+
+	space_root = btrfs_free_space_root(bg);
+
+	ret = btrfs_search_slot(trans, space_root, &key, path, 0, 0);
+	if (ret < 0) {
+		mutex_unlock(&bg->free_space_lock);
+		goto out;
+	}
+
+	ret = 0;
+
+	while (true) {
+		leaf = path->nodes[0];
+
+		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+		if (found_key.objectid >= bg->start + bg->length)
+			break;
+
+		if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY) {
+			if (num_space_runs != 0 &&
+			    space_runs[num_space_runs - 1].end == found_key.objectid) {
+				space_runs[num_space_runs - 1].end =
+					found_key.objectid + found_key.offset;
+			} else {
+				space_runs[num_space_runs].start = found_key.objectid;
+				space_runs[num_space_runs].end =
+					found_key.objectid + found_key.offset;
+
+				num_space_runs++;
+
+				BUG_ON(num_space_runs > extent_count);
+			}
+		} else if (found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
+			void *bitmap;
+			unsigned long offset;
+			u32 data_size;
+
+			offset = btrfs_item_ptr_offset(leaf, path->slots[0]);
+			data_size = btrfs_item_size(leaf, path->slots[0]);
+
+			if (data_size != 0) {
+				bitmap = kmalloc(data_size, GFP_NOFS);
+				if (!bitmap) {
+					mutex_unlock(&bg->free_space_lock);
+					ret = -ENOMEM;
+					goto out;
+				}
+
+				read_extent_buffer(leaf, bitmap, offset,
+						   data_size);
+
+				parse_bitmap(fs_info->sectorsize, bitmap,
+					     data_size * BITS_PER_BYTE,
+					     found_key.objectid, space_runs,
+					     &num_space_runs);
+
+				BUG_ON(num_space_runs > extent_count);
+
+				kfree(bitmap);
+			}
+		}
+
+		path->slots[0]++;
+
+		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+			ret = btrfs_next_leaf(space_root, path);
+			if (ret != 0) {
+				if (ret == 1)
+					ret = 0;
+				break;
+			}
+			leaf = path->nodes[0];
+		}
+	}
+
+	btrfs_release_path(path);
+
+	mutex_unlock(&bg->free_space_lock);
+
+	max_entries = extent_count + 2;
+	entries = kmalloc(sizeof(*entries) * max_entries, GFP_NOFS);
+	if (!entries) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	num_entries = 0;
+
+	if (num_space_runs > 0 && space_runs[0].start > bg->start) {
+		entries[num_entries].objectid = bg->start;
+		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
+		entries[num_entries].offset = space_runs[0].start - bg->start;
+		num_entries++;
+	}
+
+	for (unsigned int i = 1; i < num_space_runs; i++) {
+		entries[num_entries].objectid = space_runs[i - 1].end;
+		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
+		entries[num_entries].offset =
+			space_runs[i].start - space_runs[i - 1].end;
+		num_entries++;
+	}
+
+	if (num_space_runs == 0) {
+		entries[num_entries].objectid = bg->start;
+		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
+		entries[num_entries].offset = bg->length;
+		num_entries++;
+	} else if (space_runs[num_space_runs - 1].end < bg->start + bg->length) {
+		entries[num_entries].objectid = space_runs[num_space_runs - 1].end;
+		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
+		entries[num_entries].offset =
+			bg->start + bg->length - space_runs[num_space_runs - 1].end;
+		num_entries++;
+	}
+
+	if (num_entries == 0)
+		goto out;
+
+	bg->identity_remap_count = num_entries;
+
+	ret = add_remap_tree_entries(trans, path, entries, num_entries);
+
+out:
+	kfree(entries);
+	kfree(space_runs);
+
+	return ret;
+}
+
+static int mark_bg_remapped(struct btrfs_trans_handle *trans,
+			    struct btrfs_path *path,
+			    struct btrfs_block_group *bg)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	unsigned long bi;
+	struct extent_buffer *leaf;
+	struct btrfs_block_group_item bgi;
+	struct btrfs_key key;
+	int ret;
+
+	bg->flags |= BTRFS_BLOCK_GROUP_REMAPPED;
+
+	key.objectid = bg->start;
+	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
+	key.offset = bg->length;
+
+	ret = btrfs_search_slot(trans, fs_info->block_group_root, &key,
+				path, 0, 1);
+	if (ret) {
+		if (ret > 0)
+			ret = -ENOENT;
+		goto out;
+	}
+
+	leaf = path->nodes[0];
+	bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
+	read_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
+	btrfs_set_stack_block_group_flags(&bgi, bg->flags);
+	write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
+
+	btrfs_mark_buffer_dirty(trans, leaf);
+
+	ret = 0;
+out:
+	btrfs_release_path(path);
+	return ret;
+}
+
 static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
 				struct btrfs_chunk_map *chunk,
 				struct btrfs_path *path)
@@ -4042,6 +4343,55 @@ static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
+			       struct btrfs_path *path, uint64_t start)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_chunk_map *chunk;
+	struct btrfs_key key;
+	u64 type;
+	int ret;
+	struct extent_buffer *leaf;
+	struct btrfs_chunk *c;
+
+	read_lock(&fs_info->mapping_tree_lock);
+
+	chunk = btrfs_find_chunk_map_nolock(fs_info, start, 1);
+	if (!chunk) {
+		read_unlock(&fs_info->mapping_tree_lock);
+		return -ENOENT;
+	}
+
+	chunk->type |= BTRFS_BLOCK_GROUP_REMAPPED;
+	type = chunk->type;
+
+	read_unlock(&fs_info->mapping_tree_lock);
+
+	key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+	key.type = BTRFS_CHUNK_ITEM_KEY;
+	key.offset = start;
+
+	ret = btrfs_search_slot(trans, fs_info->chunk_root, &key, path,
+				0, 1);
+	if (ret == 1) {
+		ret = -ENOENT;
+		goto end;
+	} else if (ret < 0)
+		goto end;
+
+	leaf = path->nodes[0];
+
+	c = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_chunk);
+	btrfs_set_chunk_type(leaf, c, type);
+	btrfs_mark_buffer_dirty(trans, leaf);
+
+	ret = 0;
+end:
+	btrfs_free_chunk_map(chunk);
+	btrfs_release_path(path);
+	return ret;
+}
+
 int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
 			  u64 *length)
 {
@@ -4117,16 +4467,66 @@ int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
 	return 0;
 }
 
+static int start_block_group_remapping(struct btrfs_fs_info *fs_info,
+				       struct btrfs_path *path,
+				       struct btrfs_block_group *bg)
+{
+	struct btrfs_trans_handle *trans;
+	int ret, ret2;
+
+	ret = btrfs_cache_block_group(bg, true);
+	if (ret)
+		return ret;
+
+	trans = btrfs_start_transaction(fs_info->remap_root, 0);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	/* We need to run delayed refs, to make sure FST is up to date. */
+	ret = btrfs_run_delayed_refs(trans, U64_MAX);
+	if (ret) {
+		btrfs_end_transaction(trans);
+		return ret;
+	}
+
+	mutex_lock(&fs_info->remap_mutex);
+
+	if (bg->flags & BTRFS_BLOCK_GROUP_REMAPPED) {
+		ret = 0;
+		goto end;
+	}
+
+	ret = create_remap_tree_entries(trans, path, bg);
+	if (ret)
+		goto end;
+
+	ret = mark_bg_remapped(trans, path, bg);
+	if (ret)
+		goto end;
+
+	ret = mark_chunk_remapped(trans, path, bg->start);
+
+end:
+	mutex_unlock(&fs_info->remap_mutex);
+
+	ret2 = btrfs_end_transaction(trans);
+	if (!ret)
+		ret = ret2;
+
+	return ret;
+}
+
 /*
  * function to relocate all extents in a block group.
  */
-int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
+int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
+			       bool *using_remap_tree)
 {
 	struct btrfs_block_group *bg;
 	struct btrfs_root *extent_root = btrfs_extent_root(fs_info, group_start);
 	struct reloc_control *rc;
 	struct inode *inode;
-	struct btrfs_path *path;
+	struct btrfs_path *path = NULL;
 	int ret;
 	int rw = 0;
 	int err = 0;
@@ -4193,7 +4593,7 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
 	}
 
 	inode = lookup_free_space_inode(rc->block_group, path);
-	btrfs_free_path(path);
+	btrfs_release_path(path);
 
 	if (!IS_ERR(inode))
 		ret = delete_block_group_cache(rc->block_group, inode, 0);
@@ -4205,11 +4605,17 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
 		goto out;
 	}
 
-	rc->data_inode = create_reloc_inode(rc->block_group);
-	if (IS_ERR(rc->data_inode)) {
-		err = PTR_ERR(rc->data_inode);
-		rc->data_inode = NULL;
-		goto out;
+	*using_remap_tree = btrfs_fs_incompat(fs_info, REMAP_TREE) &&
+		!(bg->flags & BTRFS_BLOCK_GROUP_SYSTEM) &&
+		!(bg->flags & BTRFS_BLOCK_GROUP_REMAP);
+
+	if (!btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+		rc->data_inode = create_reloc_inode(rc->block_group);
+		if (IS_ERR(rc->data_inode)) {
+			err = PTR_ERR(rc->data_inode);
+			rc->data_inode = NULL;
+			goto out;
+		}
 	}
 
 	describe_relocation(rc->block_group);
@@ -4221,6 +4627,12 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
 	ret = btrfs_zone_finish(rc->block_group);
 	WARN_ON(ret && ret != -EAGAIN);
 
+	if (*using_remap_tree) {
+		err = start_block_group_remapping(fs_info, path, bg);
+
+		goto out;
+	}
+
 	while (1) {
 		enum reloc_stage finishes_stage;
 
@@ -4266,7 +4678,9 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
 out:
 	if (err && rw)
 		btrfs_dec_block_group_ro(rc->block_group);
-	iput(rc->data_inode);
+	if (!btrfs_fs_incompat(fs_info, REMAP_TREE))
+		iput(rc->data_inode);
+	btrfs_free_path(path);
 out_put_bg:
 	btrfs_put_block_group(bg);
 	reloc_chunk_end(fs_info);
@@ -4460,7 +4874,7 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 
 	btrfs_free_path(path);
 
-	if (ret == 0) {
+	if (ret == 0 && !btrfs_fs_incompat(fs_info, REMAP_TREE)) {
 		/* cleanup orphan inode in data relocation tree */
 		fs_root = btrfs_grab_root(fs_info->data_reloc_root);
 		ASSERT(fs_root);
diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
index 7f8e27f638bc..bfef88b47b0e 100644
--- a/fs/btrfs/relocation.h
+++ b/fs/btrfs/relocation.h
@@ -12,7 +12,8 @@ struct btrfs_trans_handle;
 struct btrfs_ordered_extent;
 struct btrfs_pending_snapshot;
 
-int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start);
+int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
+			       bool *using_remap_tree);
 int btrfs_init_reloc_root(struct btrfs_trans_handle *trans, struct btrfs_root *root);
 int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
 			    struct btrfs_root *root);
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 3f927a514643..e6f6463c8a6d 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -375,8 +375,13 @@ void btrfs_add_bg_to_space_info(struct btrfs_fs_info *info,
 	factor = btrfs_bg_type_to_factor(block_group->flags);
 
 	spin_lock(&space_info->lock);
-	space_info->total_bytes += block_group->length;
-	space_info->disk_total += block_group->length * factor;
+
+	if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED) ||
+	    block_group->identity_remap_count != 0) {
+		space_info->total_bytes += block_group->length;
+		space_info->disk_total += block_group->length * factor;
+	}
+
 	space_info->bytes_used += block_group->used;
 	space_info->disk_used += block_group->used * factor;
 	space_info->bytes_readonly += block_group->bytes_super;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ea32ee9a63fd..e59aa0b5c4f3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3419,6 +3419,7 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	struct btrfs_block_group *block_group;
 	u64 length;
 	int ret;
+	bool using_remap_tree;
 
 	if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2)) {
 		btrfs_err(fs_info,
@@ -3442,7 +3443,8 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 
 	/* step one, relocate all the extents inside this chunk */
 	btrfs_scrub_pause(fs_info);
-	ret = btrfs_relocate_block_group(fs_info, chunk_offset);
+	ret = btrfs_relocate_block_group(fs_info, chunk_offset,
+					 &using_remap_tree);
 	btrfs_scrub_continue(fs_info);
 	if (ret) {
 		/*
@@ -3461,6 +3463,9 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	length = block_group->length;
 	btrfs_put_block_group(block_group);
 
+	if (using_remap_tree)
+		return 0;
+
 	/*
 	 * On a zoned file system, discard the whole block group, this will
 	 * trigger a REQ_OP_ZONE_RESET operation on the device zone. If
@@ -4162,6 +4167,14 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
 		chunk = btrfs_item_ptr(leaf, slot, struct btrfs_chunk);
 		chunk_type = btrfs_chunk_type(leaf, chunk);
 
+		/* Check if chunk has already been fully relocated. */
+		if (chunk_type & BTRFS_BLOCK_GROUP_REMAPPED &&
+		    btrfs_chunk_num_stripes(leaf, chunk) == 0) {
+			btrfs_release_path(path);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
+			goto loop;
+		}
+
 		if (!counting) {
 			spin_lock(&fs_info->balance_lock);
 			bctl->stat.considered++;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 09/10] btrfs: move existing remaps before relocating block group
  2025-05-15 16:36 [RFC PATCH 00/10] Remap tree Mark Harmstone
                   ` (7 preceding siblings ...)
  2025-05-15 16:36 ` [RFC PATCH 08/10] btrfs: handle setting up relocation of block group with remap-tree Mark Harmstone
@ 2025-05-15 16:36 ` Mark Harmstone
       [not found]   ` <202505161726.w1lqCZxG-lkp@intel.com>
  2025-05-15 16:36 ` [RFC PATCH 10/10] btrfs: replace identity maps with actual remaps when doing relocations Mark Harmstone
  9 siblings, 1 reply; 20+ messages in thread
From: Mark Harmstone @ 2025-05-15 16:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

If when relocating a block group we find that `remap_bytes` > 0 in its
block group item, that means that it has been the destination block
group for another that has been remapped.

We need to seach the remap tree for any remap backrefs within this
range, and move the data to a third block group. This is because
otherwise btrfs_translate_remap() could end up following an unbounded
chain of remaps, which would only get worse over time.

We only relocate one block group at a time, so `remap_bytes` will only
ever go down while we are doing this. Once we're finished we set the
REMAPPED flag on the block group, which will permanently prevent any
other data from being moved to within it.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/extent-tree.c |   6 +-
 fs/btrfs/relocation.c  | 444 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 448 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 857a06553b19..223904c2a8d8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4472,7 +4472,8 @@ static noinline int find_free_extent(struct btrfs_root *root,
 		    block_group->cached != BTRFS_CACHE_NO) {
 			down_read(&space_info->groups_sem);
 			if (list_empty(&block_group->list) ||
-			    block_group->ro) {
+			    block_group->ro ||
+			    block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED) {
 				/*
 				 * someone is removing this block group,
 				 * we can't jump into the have_block_group
@@ -4506,7 +4507,8 @@ static noinline int find_free_extent(struct btrfs_root *root,
 
 		ffe_ctl->hinted = false;
 		/* If the block group is read-only, we can skip it entirely. */
-		if (unlikely(block_group->ro)) {
+		if (unlikely(block_group->ro) ||
+		    block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED) {
 			if (ffe_ctl->for_treelog)
 				btrfs_clear_treelog_bg(block_group);
 			if (ffe_ctl->for_data_reloc)
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 6c11369bc883..7da95b82c798 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4007,6 +4007,442 @@ static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
 		btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
 }
 
+struct reloc_io_private {
+	struct completion done;
+	refcount_t pending_refs;
+	blk_status_t status;
+};
+
+static void reloc_endio(struct btrfs_bio *bbio)
+{
+	struct reloc_io_private *priv = bbio->private;
+
+	if (bbio->bio.bi_status)
+		WRITE_ONCE(priv->status, bbio->bio.bi_status);
+
+	if (refcount_dec_and_test(&priv->pending_refs))
+		complete(&priv->done);
+
+	bio_put(&bbio->bio);
+}
+
+static int copy_remapped_data_io(struct btrfs_fs_info *fs_info,
+				 struct reloc_io_private *priv,
+				 struct page **pages, u64 addr, u64 length,
+				 bool do_write)
+{
+	struct btrfs_bio *bbio;
+	unsigned long i = 0;
+	int op = do_write ? REQ_OP_WRITE : REQ_OP_READ;
+
+	init_completion(&priv->done);
+	refcount_set(&priv->pending_refs, 1);
+	priv->status = 0;
+
+	bbio = btrfs_bio_alloc(BIO_MAX_VECS, op, fs_info, reloc_endio,
+			       priv);
+	bbio->bio.bi_iter.bi_sector = addr >> SECTOR_SHIFT;
+
+	do {
+		size_t bytes = min_t(u64, length, PAGE_SIZE);
+
+		if (bio_add_page(&bbio->bio, pages[i], bytes, 0) < bytes) {
+			refcount_inc(&priv->pending_refs);
+			btrfs_submit_bbio(bbio, 0);
+
+			bbio = btrfs_bio_alloc(BIO_MAX_VECS, op, fs_info,
+					       reloc_endio, priv);
+			bbio->bio.bi_iter.bi_sector = addr >> SECTOR_SHIFT;
+			continue;
+		}
+
+		i++;
+		addr += bytes;
+		length -= bytes;
+	} while (length);
+
+	refcount_inc(&priv->pending_refs);
+	btrfs_submit_bbio(bbio, 0);
+
+	if (!refcount_dec_and_test(&priv->pending_refs))
+		wait_for_completion_io(&priv->done);
+
+	return blk_status_to_errno(READ_ONCE(priv->status));
+}
+
+static int copy_remapped_data(struct btrfs_fs_info *fs_info, u64 old_addr,
+			      u64 new_addr, u64 length)
+{
+	int ret;
+	struct page **pages;
+	unsigned int nr_pages;
+	struct reloc_io_private priv;
+
+	nr_pages = DIV_ROUND_UP(length, PAGE_SIZE);
+	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
+	if (!pages)
+		return -ENOMEM;
+	ret = btrfs_alloc_page_array(nr_pages, pages, 0);
+	if (ret) {
+		ret = -ENOMEM;
+		goto end;
+	}
+
+	ret = copy_remapped_data_io(fs_info, &priv, pages, old_addr, length,
+				    false);
+	if (ret)
+		goto end;
+
+	ret = copy_remapped_data_io(fs_info, &priv, pages, new_addr, length,
+				    true);
+
+end:
+	for (unsigned int i = 0; i < nr_pages; i++) {
+		if (pages[i])
+			__free_page(pages[i]);
+	}
+	kfree(pages);
+
+	return ret;
+}
+
+static int do_copy(struct btrfs_fs_info *fs_info, u64 old_addr, u64 new_addr,
+		   u64 length)
+{
+	int ret;
+
+	/* Copy 1MB at a time, to avoid using too much memory. */
+
+	do {
+		u64 to_copy = min_t(u64, length, SZ_1M);
+
+		ret = copy_remapped_data(fs_info, old_addr, new_addr,
+					 to_copy);
+		if (ret)
+			return ret;
+
+		if (to_copy == length)
+			break;
+
+		old_addr += to_copy;
+		new_addr += to_copy;
+		length -= to_copy;
+	} while (true);
+
+	return 0;
+}
+
+static int move_existing_remap(struct btrfs_fs_info *fs_info,
+			       struct btrfs_path *path,
+			       struct btrfs_block_group *bg, u64 new_addr,
+			       u64 length, u64 old_addr)
+{
+	struct btrfs_trans_handle *trans;
+	struct extent_buffer *leaf;
+	struct btrfs_remap *remap_ptr, remap;
+	struct btrfs_key key, ins;
+	u64 dest_addr, dest_length, min_size;
+	struct btrfs_block_group *dest_bg;
+	int ret;
+	bool is_data = bg->flags & BTRFS_BLOCK_GROUP_DATA;
+	struct btrfs_space_info *sinfo = bg->space_info;
+	bool mutex_taken = false, bg_needs_free_space;
+
+	spin_lock(&sinfo->lock);
+	btrfs_space_info_update_bytes_may_use(sinfo, length);
+	spin_unlock(&sinfo->lock);
+
+	if (is_data)
+		min_size = fs_info->sectorsize;
+	else
+		min_size = fs_info->nodesize;
+
+	ret = btrfs_reserve_extent(fs_info->fs_root, length, length, min_size,
+				   0, 0, &ins, is_data, false);
+	if (ret) {
+		spin_lock(&sinfo->lock);
+		btrfs_space_info_update_bytes_may_use(sinfo, -length);
+		spin_unlock(&sinfo->lock);
+		return ret;
+	}
+
+	dest_addr = ins.objectid;
+	dest_length = ins.offset;
+
+	if (!is_data && dest_length % fs_info->nodesize) {
+		u64 new_length = dest_length - (dest_length % fs_info->nodesize);
+
+		btrfs_free_reserved_extent(fs_info, dest_addr + new_length,
+					   dest_length - new_length, 0);
+
+		dest_length = new_length;
+	}
+
+	trans = btrfs_join_transaction(fs_info->remap_root);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		trans = NULL;
+		goto end;
+	}
+
+	mutex_lock(&fs_info->remap_mutex);
+	mutex_taken = true;
+
+	/* Find old remap entry. */
+
+	key.objectid = old_addr;
+	key.type = BTRFS_REMAP_KEY;
+	key.offset = length;
+
+	ret = btrfs_search_slot(trans, fs_info->remap_root, &key,
+				path, 0, 1);
+	if (ret == 1) {
+		/*
+		 * Not a problem if the remap entry wasn't found: that means
+		 * that another transaction has deallocated the data.
+		 * move_existing_remaps() loops until the BG contains no
+		 * remaps, so we can just return 0 in this case.
+		 */
+		btrfs_release_path(path);
+		ret = 0;
+		goto end;
+	} else if (ret) {
+		goto end;
+	}
+
+	ret = do_copy(fs_info, new_addr, dest_addr, dest_length);
+	if (ret)
+		goto end;
+
+	/* Change data of old remap entry. */
+
+	leaf = path->nodes[0];
+
+	remap_ptr = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_remap);
+	btrfs_set_remap_address(leaf, remap_ptr, dest_addr);
+
+	btrfs_mark_buffer_dirty(trans, leaf);
+
+	if (dest_length != length) {
+		key.offset = dest_length;
+		btrfs_set_item_key_safe(trans, path, &key);
+	}
+
+	btrfs_release_path(path);
+
+	if (dest_length != length) {
+		/* Add remap item for remainder. */
+
+		key.objectid += dest_length;
+		key.offset = length - dest_length;
+
+		ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+					      path, &key,
+				sizeof(struct btrfs_remap));
+		if (ret)
+			goto end;
+
+		leaf = path->nodes[0];
+
+		btrfs_set_stack_remap_address(&remap, new_addr + dest_length);
+
+		write_extent_buffer(leaf, &remap,
+				    btrfs_item_ptr_offset(leaf, path->slots[0]),
+				    sizeof(struct btrfs_remap));
+		btrfs_release_path(path);
+	}
+
+	/* Change or remove old backref. */
+
+	key.objectid = new_addr;
+	key.type = BTRFS_REMAP_BACKREF_KEY;
+	key.offset = length;
+
+	ret = btrfs_search_slot(trans, fs_info->remap_root, &key,
+				path, 0, 1);
+	if (ret) {
+		if (ret == 1) {
+			btrfs_release_path(path);
+			ret = -ENOENT;
+		}
+		goto end;
+	}
+
+	leaf = path->nodes[0];
+
+	if (dest_length == length) {
+		ret = btrfs_del_item(trans, fs_info->remap_root, path);
+		if (ret) {
+			btrfs_release_path(path);
+			goto end;
+		}
+	} else {
+		key.objectid += dest_length;
+		key.offset -= dest_length;
+		btrfs_set_item_key_safe(trans, path, &key);
+
+		btrfs_set_stack_remap_address(&remap, old_addr + dest_length);
+
+		write_extent_buffer(leaf, &remap,
+				    btrfs_item_ptr_offset(leaf, path->slots[0]),
+				    sizeof(struct btrfs_remap));
+	}
+
+	btrfs_release_path(path);
+
+	/* Add new backref. */
+
+	key.objectid = dest_addr;
+	key.type = BTRFS_REMAP_BACKREF_KEY;
+	key.offset = dest_length;
+
+	ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+				      path, &key, sizeof(struct btrfs_remap));
+	if (ret)
+		goto end;
+
+	leaf = path->nodes[0];
+
+	btrfs_set_stack_remap_address(&remap, old_addr);
+
+	write_extent_buffer(leaf, &remap,
+			    btrfs_item_ptr_offset(leaf, path->slots[0]),
+			    sizeof(struct btrfs_remap));
+
+	btrfs_release_path(path);
+
+	adjust_block_group_remap_bytes(trans, bg, -dest_length);
+
+	ret = add_to_free_space_tree(trans, new_addr, dest_length);
+	if (ret)
+		goto end;
+
+	dest_bg = btrfs_lookup_block_group(fs_info, dest_addr);
+
+	adjust_block_group_remap_bytes(trans, dest_bg, dest_length);
+
+	mutex_lock(&dest_bg->free_space_lock);
+	bg_needs_free_space = test_bit(BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE,
+				       &dest_bg->runtime_flags);
+	mutex_unlock(&dest_bg->free_space_lock);
+	btrfs_put_block_group(dest_bg);
+
+	if (bg_needs_free_space) {
+		ret = add_block_group_free_space(trans, dest_bg);
+		if (ret)
+			goto end;
+	}
+
+	ret = remove_from_free_space_tree(trans, dest_addr, dest_length);
+	if (ret) {
+		remove_from_free_space_tree(trans, new_addr, dest_length);
+		goto end;
+	}
+
+	ret = 0;
+
+end:
+	if (mutex_taken)
+		mutex_unlock(&fs_info->remap_mutex);
+
+	btrfs_dec_block_group_reservations(fs_info, dest_addr);
+
+	if (ret) {
+		btrfs_free_reserved_extent(fs_info, dest_addr, dest_length, 0);
+
+		if (trans) {
+			btrfs_abort_transaction(trans, ret);
+			btrfs_end_transaction(trans);
+		}
+	} else {
+		dest_bg = btrfs_lookup_block_group(fs_info, dest_addr);
+		btrfs_free_reserved_bytes(dest_bg, dest_length, 0);
+		btrfs_put_block_group(dest_bg);
+
+		ret = btrfs_commit_transaction(trans);
+	}
+
+	return ret;
+}
+
+static int move_existing_remaps(struct btrfs_fs_info *fs_info,
+				struct btrfs_block_group *bg,
+				struct btrfs_path *path)
+{
+	int ret;
+	struct btrfs_key key;
+	struct extent_buffer *leaf;
+	struct btrfs_remap *remap;
+	u64 old_addr;
+
+	/* Look for backrefs in remap tree. */
+
+	while (bg->remap_bytes > 0) {
+		key.objectid = bg->start;
+		key.type = BTRFS_REMAP_BACKREF_KEY;
+		key.offset = 0;
+
+		ret = btrfs_search_slot(NULL, fs_info->remap_root, &key, path,
+					0, 0);
+		if (ret < 0)
+			return ret;
+
+		leaf = path->nodes[0];
+
+		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+			ret = btrfs_next_leaf(fs_info->remap_root, path);
+			if (ret < 0) {
+				btrfs_release_path(path);
+				return ret;
+			}
+
+			if (ret) {
+				btrfs_release_path(path);
+				break;
+			}
+
+			leaf = path->nodes[0];
+		}
+
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+		if (key.type != BTRFS_REMAP_BACKREF_KEY) {
+			path->slots[0]++;
+
+			if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+				ret = btrfs_next_leaf(fs_info->remap_root, path);
+				if (ret < 0) {
+					btrfs_release_path(path);
+					return ret;
+				}
+
+				if (ret) {
+					btrfs_release_path(path);
+					break;
+				}
+
+				leaf = path->nodes[0];
+			}
+		}
+
+		remap = btrfs_item_ptr(leaf, path->slots[0],
+				       struct btrfs_remap);
+
+		old_addr = btrfs_remap_address(leaf, remap);
+
+		btrfs_release_path(path);
+
+		ret = move_existing_remap(fs_info, path, bg, key.objectid,
+					  key.offset, old_addr);
+		if (ret)
+			return ret;
+	}
+
+	BUG_ON(bg->remap_bytes > 0);
+
+	return 0;
+}
+
 static int create_remap_tree_entries(struct btrfs_trans_handle *trans,
 				     struct btrfs_path *path,
 				     struct btrfs_block_group *bg)
@@ -4628,6 +5064,14 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
 	WARN_ON(ret && ret != -EAGAIN);
 
 	if (*using_remap_tree) {
+		if (bg->remap_bytes != 0) {
+			ret = move_existing_remaps(fs_info, bg, path);
+			if (ret) {
+				err = ret;
+				goto out;
+			}
+		}
+
 		err = start_block_group_remapping(fs_info, path, bg);
 
 		goto out;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 10/10] btrfs: replace identity maps with actual remaps when doing relocations
  2025-05-15 16:36 [RFC PATCH 00/10] Remap tree Mark Harmstone
                   ` (8 preceding siblings ...)
  2025-05-15 16:36 ` [RFC PATCH 09/10] btrfs: move existing remaps before relocating block group Mark Harmstone
@ 2025-05-15 16:36 ` Mark Harmstone
  2025-05-21  0:04   ` Boris Burkov
  9 siblings, 1 reply; 20+ messages in thread
From: Mark Harmstone @ 2025-05-15 16:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Add a function do_remap_tree_reloc(), which does the actual work of
doing a relocation using the remap tree.

In a loop we call do_remap_tree_reloc_trans(), which searches for the
first identity remap for the block group. We call btrfs_reserve_extent()
to find space elsewhere for it, and read the data into memory and write
it to the new location. We then carve out the identity remap and replace
it with an actual remap, which points to the new location in which to
look.

Once the last identity remap has been removed we call
last_identity_remap_gone(), which, as with deletions, removes the
chunk's stripes and device extents.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/relocation.c | 522 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 522 insertions(+)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 7da95b82c798..bcf04d4c5af1 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4660,6 +4660,60 @@ static int mark_bg_remapped(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+static int find_next_identity_remap(struct btrfs_trans_handle *trans,
+				    struct btrfs_path *path, u64 bg_end,
+				    u64 last_start, u64 *start,
+				    u64 *length)
+{
+	int ret;
+	struct btrfs_key key, found_key;
+	struct btrfs_root *remap_root = trans->fs_info->remap_root;
+	struct extent_buffer *leaf;
+
+	key.objectid = last_start;
+	key.type = BTRFS_IDENTITY_REMAP_KEY;
+	key.offset = 0;
+
+	ret = btrfs_search_slot(trans, remap_root, &key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+
+	leaf = path->nodes[0];
+	while (true) {
+		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+		if (found_key.objectid >= bg_end) {
+			ret = -ENOENT;
+			goto out;
+		}
+
+		if (found_key.type == BTRFS_IDENTITY_REMAP_KEY) {
+			*start = found_key.objectid;
+			*length = found_key.offset;
+			ret = 0;
+			goto out;
+		}
+
+		path->slots[0]++;
+		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+			ret = btrfs_next_leaf(remap_root, path);
+
+			if (ret != 0) {
+				if (ret == 1)
+					ret = -ENOENT;
+				goto out;
+			}
+
+			leaf = path->nodes[0];
+		}
+	}
+
+out:
+	btrfs_release_path(path);
+
+	return ret;
+}
+
 static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
 				struct btrfs_chunk_map *chunk,
 				struct btrfs_path *path)
@@ -4779,6 +4833,288 @@ static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+static int merge_remap_entries(struct btrfs_trans_handle *trans,
+			       struct btrfs_path *path,
+			       struct btrfs_block_group *src_bg, u64 old_addr,
+			       u64 new_addr, u64 length)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_remap *remap_ptr;
+	struct extent_buffer *leaf;
+	struct btrfs_key key, new_key;
+	u64 last_addr, old_length;
+	int ret;
+
+	leaf = path->nodes[0];
+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+	remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
+				   struct btrfs_remap);
+
+	last_addr = btrfs_remap_address(leaf, remap_ptr);
+	old_length = key.offset;
+
+	if (last_addr + old_length != new_addr)
+		return 0;
+
+	/* Merge entries. */
+
+	new_key.objectid = key.objectid;
+	new_key.type = BTRFS_REMAP_KEY;
+	new_key.offset = old_length + length;
+
+	btrfs_set_item_key_safe(trans, path, &new_key);
+
+	btrfs_release_path(path);
+
+	/* Merge backref too. */
+
+	key.objectid = new_addr - old_length;
+	key.type = BTRFS_REMAP_BACKREF_KEY;
+	key.offset = old_length;
+
+	ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, 0, 1);
+	if (ret < 0) {
+		return ret;
+	} else if (ret == 1) {
+		btrfs_release_path(path);
+		return -ENOENT;
+	}
+
+	new_key.objectid = new_addr - old_length;
+	new_key.type = BTRFS_REMAP_BACKREF_KEY;
+	new_key.offset = old_length + length;
+
+	btrfs_set_item_key_safe(trans, path, &new_key);
+
+	btrfs_release_path(path);
+
+	/* Fix the following identity map. */
+
+	key.objectid = old_addr;
+	key.type = BTRFS_IDENTITY_REMAP_KEY;
+	key.offset = 0;
+
+	ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, 0, 1);
+	if (ret < 0)
+		return ret;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+	if (key.objectid != old_addr || key.type != BTRFS_IDENTITY_REMAP_KEY)
+		return -ENOENT;
+
+	if (key.offset == length) {
+		ret = btrfs_del_item(trans, fs_info->remap_root, path);
+		if (ret)
+			return ret;
+
+		btrfs_release_path(path);
+
+		ret = adjust_identity_remap_count(trans, path, src_bg, -1);
+		if (ret)
+			return ret;
+
+		return 1;
+	}
+
+	new_key.objectid = old_addr + length;
+	new_key.type = BTRFS_IDENTITY_REMAP_KEY;
+	new_key.offset = key.offset - length;
+
+	btrfs_set_item_key_safe(trans, path, &new_key);
+
+	btrfs_release_path(path);
+
+	return 1;
+}
+
+static int add_new_remap_entry(struct btrfs_trans_handle *trans,
+			       struct btrfs_path *path,
+			       struct btrfs_block_group *src_bg, u64 old_addr,
+			       u64 new_addr, u64 length)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_key key, new_key;
+	struct btrfs_remap remap;
+	int ret;
+	int identity_count_delta = 0;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+	/* Shorten or delete identity mapping entry. */
+
+	if (key.objectid == old_addr) {
+		ret = btrfs_del_item(trans, fs_info->remap_root, path);
+		if (ret)
+			return ret;
+
+		identity_count_delta--;
+	} else {
+		new_key.objectid = key.objectid;
+		new_key.type = BTRFS_IDENTITY_REMAP_KEY;
+		new_key.offset = old_addr - key.objectid;
+
+		btrfs_set_item_key_safe(trans, path, &new_key);
+	}
+
+	btrfs_release_path(path);
+
+	/* Create new remap entry. */
+
+	new_key.objectid = old_addr;
+	new_key.type = BTRFS_REMAP_KEY;
+	new_key.offset = length;
+
+	ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+		path, &new_key, sizeof(struct btrfs_remap));
+	if (ret)
+		return ret;
+
+	btrfs_set_stack_remap_address(&remap, new_addr);
+
+	write_extent_buffer(path->nodes[0], &remap,
+		btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
+		sizeof(struct btrfs_remap));
+
+	btrfs_release_path(path);
+
+	/* Add entry for remainder of identity mapping, if necessary. */
+
+	if (key.objectid + key.offset != old_addr + length) {
+		new_key.objectid = old_addr + length;
+		new_key.type = BTRFS_IDENTITY_REMAP_KEY;
+		new_key.offset = key.objectid + key.offset - old_addr - length;
+
+		ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+					      path, &new_key, 0);
+		if (ret)
+			return ret;
+
+		btrfs_release_path(path);
+
+		identity_count_delta++;
+	}
+
+	/* Add backref. */
+
+	new_key.objectid = new_addr;
+	new_key.type = BTRFS_REMAP_BACKREF_KEY;
+	new_key.offset = length;
+
+	ret = btrfs_insert_empty_item(trans, fs_info->remap_root, path,
+				      &new_key, sizeof(struct btrfs_remap));
+	if (ret)
+		return ret;
+
+	btrfs_set_stack_remap_address(&remap, old_addr);
+
+	write_extent_buffer(path->nodes[0], &remap,
+		btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
+		sizeof(struct btrfs_remap));
+
+	btrfs_release_path(path);
+
+	if (identity_count_delta == 0)
+		return 0;
+
+	ret = adjust_identity_remap_count(trans, path, src_bg,
+					  identity_count_delta);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int add_remap_entry(struct btrfs_trans_handle *trans,
+			   struct btrfs_path *path,
+			   struct btrfs_block_group *src_bg, u64 old_addr,
+			   u64 new_addr, u64 length)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_key key;
+	struct extent_buffer *leaf;
+	int ret;
+
+	key.objectid = old_addr;
+	key.type = BTRFS_IDENTITY_REMAP_KEY;
+	key.offset = 0;
+
+	ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, 0, 1);
+	if (ret < 0)
+		goto end;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+	if (key.objectid >= old_addr) {
+		if (path->slots[0] == 0) {
+			ret = btrfs_prev_leaf(trans, fs_info->remap_root, path,
+					      0, 1);
+			if (ret < 0)
+				goto end;
+		} else {
+			path->slots[0]--;
+		}
+	}
+
+	while (true) {
+		leaf = path->nodes[0];
+		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+			ret = btrfs_next_leaf(fs_info->remap_root, path);
+			if (ret < 0)
+				goto end;
+			else if (ret == 1)
+				break;
+			leaf = path->nodes[0];
+		}
+
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+		if (key.objectid >= old_addr + length) {
+			ret = -ENOENT;
+			goto end;
+		}
+
+		if (key.type != BTRFS_REMAP_KEY &&
+		    key.type != BTRFS_IDENTITY_REMAP_KEY) {
+			path->slots[0]++;
+			continue;
+		}
+
+		if (key.type == BTRFS_REMAP_KEY &&
+		    key.objectid + key.offset == old_addr) {
+			ret = merge_remap_entries(trans, path, src_bg, old_addr,
+						  new_addr, length);
+			if (ret < 0) {
+				goto end;
+			} else if (ret == 0) {
+				path->slots[0]++;
+				continue;
+			}
+			break;
+		}
+
+		if (key.objectid <= old_addr &&
+		    key.type == BTRFS_IDENTITY_REMAP_KEY &&
+		    key.objectid + key.offset > old_addr) {
+			ret = add_new_remap_entry(trans, path, src_bg,
+						  old_addr, new_addr, length);
+			if (ret)
+				goto end;
+			break;
+		}
+
+		path->slots[0]++;
+	}
+
+	ret = 0;
+
+end:
+	btrfs_release_path(path);
+
+	return ret;
+}
+
 static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
 			       struct btrfs_path *path, uint64_t start)
 {
@@ -4828,6 +5164,188 @@ static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+static int do_remap_tree_reloc_trans(struct btrfs_fs_info *fs_info,
+				     struct btrfs_block_group *src_bg,
+				     struct btrfs_path *path, u64 *last_start)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *extent_root;
+	struct btrfs_key ins;
+	struct btrfs_block_group *dest_bg = NULL;
+	struct btrfs_chunk_map *chunk;
+	u64 start, remap_length, length, new_addr, min_size;
+	int ret;
+	bool no_more = false;
+	bool is_data = src_bg->flags & BTRFS_BLOCK_GROUP_DATA;
+	bool made_reservation = false, bg_needs_free_space;
+	struct btrfs_space_info *sinfo = src_bg->space_info;
+
+	extent_root = btrfs_extent_root(fs_info, src_bg->start);
+
+	trans = btrfs_start_transaction(extent_root, 0);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	mutex_lock(&fs_info->remap_mutex);
+
+	ret = find_next_identity_remap(trans, path, src_bg->start + src_bg->length,
+				       *last_start, &start, &remap_length);
+	if (ret == -ENOENT) {
+		no_more = true;
+		goto next;
+	} else if (ret) {
+		mutex_unlock(&fs_info->remap_mutex);
+		btrfs_end_transaction(trans);
+		return ret;
+	}
+
+	/* Try to reserve enough space for block. */
+
+	spin_lock(&sinfo->lock);
+	btrfs_space_info_update_bytes_may_use(sinfo, remap_length);
+	spin_unlock(&sinfo->lock);
+
+	if (is_data)
+		min_size = fs_info->sectorsize;
+	else
+		min_size = fs_info->nodesize;
+
+	ret = btrfs_reserve_extent(fs_info->fs_root, remap_length,
+				   remap_length, min_size,
+				   0, 0, &ins, is_data, false);
+	if (ret) {
+		spin_lock(&sinfo->lock);
+		btrfs_space_info_update_bytes_may_use(sinfo, -remap_length);
+		spin_unlock(&sinfo->lock);
+
+		mutex_unlock(&fs_info->remap_mutex);
+		btrfs_end_transaction(trans);
+		return ret;
+	}
+
+	made_reservation = true;
+
+	new_addr = ins.objectid;
+	length = ins.offset;
+
+	if (!is_data && length % fs_info->nodesize) {
+		u64 new_length = length - (length % fs_info->nodesize);
+
+		btrfs_free_reserved_extent(fs_info, new_addr + new_length,
+					   length - new_length, 0);
+
+		length = new_length;
+	}
+
+	ret = add_to_free_space_tree(trans, start, length);
+	if (ret)
+		goto fail;
+
+	dest_bg = btrfs_lookup_block_group(fs_info, new_addr);
+
+	mutex_lock(&dest_bg->free_space_lock);
+	bg_needs_free_space = test_bit(BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE,
+				       &dest_bg->runtime_flags);
+	mutex_unlock(&dest_bg->free_space_lock);
+
+	if (bg_needs_free_space) {
+		ret = add_block_group_free_space(trans, dest_bg);
+		if (ret)
+			goto fail;
+	}
+
+	ret = remove_from_free_space_tree(trans, new_addr, length);
+	if (ret)
+		goto fail;
+
+	ret = do_copy(fs_info, start, new_addr, length);
+	if (ret)
+		goto fail;
+
+	ret = add_remap_entry(trans, path, src_bg, start, new_addr, length);
+	if (ret)
+		goto fail;
+
+	adjust_block_group_remap_bytes(trans, dest_bg, length);
+	btrfs_free_reserved_bytes(dest_bg, length, 0);
+
+	spin_lock(&sinfo->lock);
+	sinfo->bytes_readonly += length;
+	spin_unlock(&sinfo->lock);
+
+next:
+	if (dest_bg)
+		btrfs_put_block_group(dest_bg);
+
+	if (made_reservation)
+		btrfs_dec_block_group_reservations(fs_info, new_addr);
+
+	if (src_bg->used == 0 && src_bg->remap_bytes == 0) {
+		chunk = btrfs_find_chunk_map(fs_info, src_bg->start, 1);
+		if (!chunk) {
+			mutex_unlock(&fs_info->remap_mutex);
+			btrfs_end_transaction(trans);
+			return -ENOENT;
+		}
+
+		ret = last_identity_remap_gone(trans, chunk, src_bg, path);
+		if (ret) {
+			btrfs_free_chunk_map(chunk);
+			mutex_unlock(&fs_info->remap_mutex);
+			btrfs_end_transaction(trans);
+			return ret;
+		}
+
+		btrfs_free_chunk_map(chunk);
+	}
+
+	mutex_unlock(&fs_info->remap_mutex);
+
+	ret = btrfs_end_transaction(trans);
+	if (ret)
+		return ret;
+
+	if (no_more)
+		return 1;
+
+	*last_start = start;
+
+	return 0;
+
+fail:
+	if (dest_bg)
+		btrfs_put_block_group(dest_bg);
+
+	btrfs_free_reserved_extent(fs_info, new_addr, length, 0);
+
+	mutex_unlock(&fs_info->remap_mutex);
+	btrfs_end_transaction(trans);
+
+	return ret;
+}
+
+static int do_remap_tree_reloc(struct btrfs_fs_info *fs_info,
+			       struct btrfs_path *path,
+			       struct btrfs_block_group *bg)
+{
+	u64 last_start;
+	int ret;
+
+	last_start = bg->start;
+
+	while (true) {
+		ret = do_remap_tree_reloc_trans(fs_info, bg, path,
+						&last_start);
+		if (ret) {
+			if (ret == 1)
+				ret = 0;
+			break;
+		}
+	}
+
+	return ret;
+}
+
 int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
 			  u64 *length)
 {
@@ -5073,6 +5591,10 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
 		}
 
 		err = start_block_group_remapping(fs_info, path, bg);
+		if (err)
+			goto out;
+
+		err = do_remap_tree_reloc(fs_info, path, rc->block_group);
 
 		goto out;
 	}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 09/10] btrfs: move existing remaps before relocating block group
       [not found]   ` <202505161726.w1lqCZxG-lkp@intel.com>
@ 2025-05-16 11:43     ` Mark Harmstone
  0 siblings, 0 replies; 20+ messages in thread
From: Mark Harmstone @ 2025-05-16 11:43 UTC (permalink / raw)
  To: kernel test robot, linux-btrfs@vger.kernel.org

Kernel bot is saying here that I'm relying on 64-bit modulo even on 
32-bit processors, which is easy enough to sort for the actual patch series.

On 16/5/25 10:42, kernel test robot wrote:
> > 
> Hi Mark,
> 
> [This is a private test report for your RFC patch.]
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on kdave/for-next]
> [also build test ERROR on next-20250515]
> [cannot apply to linus/master v6.15-rc6]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://urldefense.com/v3/__https://git-scm.com/docs/git-format-patch*_base_tree_information__;Iw!!Bt8RZUm9aw!62MBgtgUBKJiRzdqPxiQTuw-_8GZHkeBKL3gkYvxPzAhENOx0BjCPNqmDBzj50VZzLPMMjlG$ ]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Mark-Harmstone/btrfs-add-definitions-and-constants-for-remap-tree/20250516-003914
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git  for-next
> patch link:    https://lore.kernel.org/r/20250515163641.3449017-10-maharmstone%40fb.com
> patch subject: [RFC PATCH 09/10] btrfs: move existing remaps before relocating block group
> config: i386-buildonly-randconfig-002-20250516 (https://urldefense.com/v3/__https://download.01.org/0day-ci/archive/20250516/202505161726.w1lqCZxG-lkp@intel.com/config__;!!Bt8RZUm9aw!62MBgtgUBKJiRzdqPxiQTuw-_8GZHkeBKL3gkYvxPzAhENOx0BjCPNqmDBzj50VZzMhIuygf$ )
> compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> reproduce (this is a W=1 build): (https://urldefense.com/v3/__https://download.01.org/0day-ci/archive/20250516/202505161726.w1lqCZxG-lkp@intel.com/reproduce__;!!Bt8RZUm9aw!62MBgtgUBKJiRzdqPxiQTuw-_8GZHkeBKL3gkYvxPzAhENOx0BjCPNqmDBzj50VZzNUkCx9l$ )
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202505161726.w1lqCZxG-lkp@intel.com/
> 
> All errors (new ones prefixed by >>, old ones prefixed by <<):
> 
>>> ERROR: modpost: "__umoddi3" [fs/btrfs/btrfs.ko] undefined!
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 10/10] btrfs: replace identity maps with actual remaps when doing relocations
  2025-05-15 16:36 ` [RFC PATCH 10/10] btrfs: replace identity maps with actual remaps when doing relocations Mark Harmstone
@ 2025-05-21  0:04   ` Boris Burkov
  2025-05-23 14:54     ` Mark Harmstone
  0 siblings, 1 reply; 20+ messages in thread
From: Boris Burkov @ 2025-05-21  0:04 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: linux-btrfs

On Thu, May 15, 2025 at 05:36:38PM +0100, Mark Harmstone wrote:
> Add a function do_remap_tree_reloc(), which does the actual work of
> doing a relocation using the remap tree.
> 
> In a loop we call do_remap_tree_reloc_trans(), which searches for the
> first identity remap for the block group. We call btrfs_reserve_extent()
> to find space elsewhere for it, and read the data into memory and write
> it to the new location. We then carve out the identity remap and replace
> it with an actual remap, which points to the new location in which to
> look.
> 
> Once the last identity remap has been removed we call
> last_identity_remap_gone(), which, as with deletions, removes the
> chunk's stripes and device extents.

I think this is a good candidate for unit testing. Just hammer a bunch
of cases adding/removing/merging remaps.

> 
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
>  fs/btrfs/relocation.c | 522 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 522 insertions(+)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 7da95b82c798..bcf04d4c5af1 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -4660,6 +4660,60 @@ static int mark_bg_remapped(struct btrfs_trans_handle *trans,
>  	return ret;
>  }
>  

Thinking out loud: I wonder if you do end up re-modeling the
transactions s.t. we do one transaction per loop or something, then
maybe you can use btrfs_for_each_slot.

> +static int find_next_identity_remap(struct btrfs_trans_handle *trans,
> +				    struct btrfs_path *path, u64 bg_end,
> +				    u64 last_start, u64 *start,
> +				    u64 *length)
> +{
> +	int ret;
> +	struct btrfs_key key, found_key;
> +	struct btrfs_root *remap_root = trans->fs_info->remap_root;
> +	struct extent_buffer *leaf;
> +
> +	key.objectid = last_start;
> +	key.type = BTRFS_IDENTITY_REMAP_KEY;
> +	key.offset = 0;
> +
> +	ret = btrfs_search_slot(trans, remap_root, &key, path, 0, 0);
> +	if (ret < 0)
> +		goto out;
> +
> +	leaf = path->nodes[0];
> +	while (true) {
> +		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
> +
> +		if (found_key.objectid >= bg_end) {
> +			ret = -ENOENT;
> +			goto out;
> +		}
> +
> +		if (found_key.type == BTRFS_IDENTITY_REMAP_KEY) {
> +			*start = found_key.objectid;
> +			*length = found_key.offset;
> +			ret = 0;
> +			goto out;
> +		}
> +
> +		path->slots[0]++;
> +		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
> +			ret = btrfs_next_leaf(remap_root, path);
> +
> +			if (ret != 0) {
> +				if (ret == 1)
> +					ret = -ENOENT;
> +				goto out;
> +			}
> +
> +			leaf = path->nodes[0];
> +		}
> +	}
> +
> +out:
> +	btrfs_release_path(path);
> +
> +	return ret;
> +}
> +
>  static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
>  				struct btrfs_chunk_map *chunk,
>  				struct btrfs_path *path)
> @@ -4779,6 +4833,288 @@ static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
>  	return ret;
>  }
>  
> +static int merge_remap_entries(struct btrfs_trans_handle *trans,
> +			       struct btrfs_path *path,
> +			       struct btrfs_block_group *src_bg, u64 old_addr,
> +			       u64 new_addr, u64 length)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_remap *remap_ptr;
> +	struct extent_buffer *leaf;
> +	struct btrfs_key key, new_key;
> +	u64 last_addr, old_length;
> +	int ret;
> +
> +	leaf = path->nodes[0];
> +	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +
> +	remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
> +				   struct btrfs_remap);
> +
> +	last_addr = btrfs_remap_address(leaf, remap_ptr);
> +	old_length = key.offset;
> +
> +	if (last_addr + old_length != new_addr)
> +		return 0;
> +
> +	/* Merge entries. */
> +
> +	new_key.objectid = key.objectid;
> +	new_key.type = BTRFS_REMAP_KEY;
> +	new_key.offset = old_length + length;
> +
> +	btrfs_set_item_key_safe(trans, path, &new_key);
> +
> +	btrfs_release_path(path);
> +
> +	/* Merge backref too. */
> +
> +	key.objectid = new_addr - old_length;
> +	key.type = BTRFS_REMAP_BACKREF_KEY;
> +	key.offset = old_length;
> +
> +	ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, 0, 1);
> +	if (ret < 0) {
> +		return ret;
> +	} else if (ret == 1) {
> +		btrfs_release_path(path);
> +		return -ENOENT;
> +	}
> +
> +	new_key.objectid = new_addr - old_length;
> +	new_key.type = BTRFS_REMAP_BACKREF_KEY;
> +	new_key.offset = old_length + length;
> +
> +	btrfs_set_item_key_safe(trans, path, &new_key);
> +
> +	btrfs_release_path(path);
> +
> +	/* Fix the following identity map. */
> +
> +	key.objectid = old_addr;
> +	key.type = BTRFS_IDENTITY_REMAP_KEY;
> +	key.offset = 0;
> +
> +	ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, 0, 1);
> +	if (ret < 0)
> +		return ret;
> +
> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +
> +	if (key.objectid != old_addr || key.type != BTRFS_IDENTITY_REMAP_KEY)
> +		return -ENOENT;
> +
> +	if (key.offset == length) {
> +		ret = btrfs_del_item(trans, fs_info->remap_root, path);
> +		if (ret)
> +			return ret;
> +
> +		btrfs_release_path(path);
> +
> +		ret = adjust_identity_remap_count(trans, path, src_bg, -1);
> +		if (ret)
> +			return ret;
> +
> +		return 1;
> +	}
> +
> +	new_key.objectid = old_addr + length;
> +	new_key.type = BTRFS_IDENTITY_REMAP_KEY;
> +	new_key.offset = key.offset - length;
> +
> +	btrfs_set_item_key_safe(trans, path, &new_key);
> +
> +	btrfs_release_path(path);
> +
> +	return 1;
> +}
> +
> +static int add_new_remap_entry(struct btrfs_trans_handle *trans,
> +			       struct btrfs_path *path,
> +			       struct btrfs_block_group *src_bg, u64 old_addr,
> +			       u64 new_addr, u64 length)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_key key, new_key;
> +	struct btrfs_remap remap;
> +	int ret;
> +	int identity_count_delta = 0;
> +
> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +
> +	/* Shorten or delete identity mapping entry. */
> +
> +	if (key.objectid == old_addr) {
> +		ret = btrfs_del_item(trans, fs_info->remap_root, path);
> +		if (ret)
> +			return ret;
> +
> +		identity_count_delta--;
> +	} else {
> +		new_key.objectid = key.objectid;
> +		new_key.type = BTRFS_IDENTITY_REMAP_KEY;
> +		new_key.offset = old_addr - key.objectid;
> +
> +		btrfs_set_item_key_safe(trans, path, &new_key);
> +	}
> +
> +	btrfs_release_path(path);
> +
> +	/* Create new remap entry. */
> +
> +	new_key.objectid = old_addr;
> +	new_key.type = BTRFS_REMAP_KEY;
> +	new_key.offset = length;
> +
> +	ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
> +		path, &new_key, sizeof(struct btrfs_remap));
> +	if (ret)
> +		return ret;
> +
> +	btrfs_set_stack_remap_address(&remap, new_addr);
> +
> +	write_extent_buffer(path->nodes[0], &remap,
> +		btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
> +		sizeof(struct btrfs_remap));
> +
> +	btrfs_release_path(path);
> +
> +	/* Add entry for remainder of identity mapping, if necessary. */
> +
> +	if (key.objectid + key.offset != old_addr + length) {
> +		new_key.objectid = old_addr + length;
> +		new_key.type = BTRFS_IDENTITY_REMAP_KEY;
> +		new_key.offset = key.objectid + key.offset - old_addr - length;
> +
> +		ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
> +					      path, &new_key, 0);
> +		if (ret)
> +			return ret;
> +
> +		btrfs_release_path(path);
> +
> +		identity_count_delta++;
> +	}
> +
> +	/* Add backref. */
> +
> +	new_key.objectid = new_addr;
> +	new_key.type = BTRFS_REMAP_BACKREF_KEY;
> +	new_key.offset = length;
> +
> +	ret = btrfs_insert_empty_item(trans, fs_info->remap_root, path,
> +				      &new_key, sizeof(struct btrfs_remap));
> +	if (ret)
> +		return ret;
> +
> +	btrfs_set_stack_remap_address(&remap, old_addr);
> +
> +	write_extent_buffer(path->nodes[0], &remap,
> +		btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
> +		sizeof(struct btrfs_remap));
> +
> +	btrfs_release_path(path);
> +
> +	if (identity_count_delta == 0)
> +		return 0;
> +
> +	ret = adjust_identity_remap_count(trans, path, src_bg,
> +					  identity_count_delta);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +static int add_remap_entry(struct btrfs_trans_handle *trans,
> +			   struct btrfs_path *path,
> +			   struct btrfs_block_group *src_bg, u64 old_addr,
> +			   u64 new_addr, u64 length)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_key key;
> +	struct extent_buffer *leaf;
> +	int ret;
> +
> +	key.objectid = old_addr;
> +	key.type = BTRFS_IDENTITY_REMAP_KEY;
> +	key.offset = 0;
> +

Can this lookup code be shared at all with the remapping logic in the
previous patch? It seems fundamentally both are finding a remap entry for
a given logical address. Or is it impossible since this one needs cow?

Maybe some kind of prev_item helper that's either remap tree specific or
for this use case of going back exactly one item instead of obeying a
min_objectid?

> +	ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, 0, 1);
> +	if (ret < 0)
> +		goto end;
> +
> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +
> +	if (key.objectid >= old_addr) {
> +		if (path->slots[0] == 0) {
> +			ret = btrfs_prev_leaf(trans, fs_info->remap_root, path,
> +					      0, 1);
> +			if (ret < 0)
> +				goto end;
> +		} else {
> +			path->slots[0]--;
> +		}
> +	}
> +
> +	while (true) {
> +		leaf = path->nodes[0];
> +		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
> +			ret = btrfs_next_leaf(fs_info->remap_root, path);
> +			if (ret < 0)
> +				goto end;
> +			else if (ret == 1)
> +				break;
> +			leaf = path->nodes[0];
> +		}
> +
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +
> +		if (key.objectid >= old_addr + length) {
> +			ret = -ENOENT;
> +			goto end;
> +		}
> +
> +		if (key.type != BTRFS_REMAP_KEY &&
> +		    key.type != BTRFS_IDENTITY_REMAP_KEY) {
> +			path->slots[0]++;
> +			continue;
> +		}
> +
> +		if (key.type == BTRFS_REMAP_KEY &&
> +		    key.objectid + key.offset == old_addr) {
> +			ret = merge_remap_entries(trans, path, src_bg, old_addr,
> +						  new_addr, length);
> +			if (ret < 0) {
> +				goto end;
> +			} else if (ret == 0) {
> +				path->slots[0]++;
> +				continue;
> +			}
> +			break;
> +		}
> +
> +		if (key.objectid <= old_addr &&
> +		    key.type == BTRFS_IDENTITY_REMAP_KEY &&
> +		    key.objectid + key.offset > old_addr) {
> +			ret = add_new_remap_entry(trans, path, src_bg,
> +						  old_addr, new_addr, length);
> +			if (ret)
> +				goto end;
> +			break;
> +		}
> +
> +		path->slots[0]++;
> +	}
> +
> +	ret = 0;
> +
> +end:
> +	btrfs_release_path(path);
> +
> +	return ret;
> +}
> +
>  static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
>  			       struct btrfs_path *path, uint64_t start)
>  {
> @@ -4828,6 +5164,188 @@ static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
>  	return ret;
>  }
>  
> +static int do_remap_tree_reloc_trans(struct btrfs_fs_info *fs_info,
> +				     struct btrfs_block_group *src_bg,
> +				     struct btrfs_path *path, u64 *last_start)
> +{
> +	struct btrfs_trans_handle *trans;
> +	struct btrfs_root *extent_root;
> +	struct btrfs_key ins;
> +	struct btrfs_block_group *dest_bg = NULL;
> +	struct btrfs_chunk_map *chunk;
> +	u64 start, remap_length, length, new_addr, min_size;
> +	int ret;
> +	bool no_more = false;
> +	bool is_data = src_bg->flags & BTRFS_BLOCK_GROUP_DATA;
> +	bool made_reservation = false, bg_needs_free_space;
> +	struct btrfs_space_info *sinfo = src_bg->space_info;
> +
> +	extent_root = btrfs_extent_root(fs_info, src_bg->start);
> +
> +	trans = btrfs_start_transaction(extent_root, 0);
> +	if (IS_ERR(trans))
> +		return PTR_ERR(trans);
> +
> +	mutex_lock(&fs_info->remap_mutex);
> +
> +	ret = find_next_identity_remap(trans, path, src_bg->start + src_bg->length,
> +				       *last_start, &start, &remap_length);
> +	if (ret == -ENOENT) {
> +		no_more = true;
> +		goto next;
> +	} else if (ret) {
> +		mutex_unlock(&fs_info->remap_mutex);
> +		btrfs_end_transaction(trans);
> +		return ret;
> +	}
> +
> +	/* Try to reserve enough space for block. */
> +
> +	spin_lock(&sinfo->lock);
> +	btrfs_space_info_update_bytes_may_use(sinfo, remap_length);

Why isn't this partly leaked if btrfs_reserve_extent returns a smaller extent than
remap_length?

> +	spin_unlock(&sinfo->lock);
> +
> +	if (is_data)
> +		min_size = fs_info->sectorsize;
> +	else
> +		min_size = fs_info->nodesize;
> +
> +	ret = btrfs_reserve_extent(fs_info->fs_root, remap_length,
> +				   remap_length, min_size,
> +				   0, 0, &ins, is_data, false);

^ i.e., this will reduce bytes_may_use by the amount it actually
reserved, and I don't see anywhere where we make up the difference. Then
it looks like we will remap the extent we can, find the next free range
and come back to this function and that remaining range to bytes_may_use
a second time.

> +	if (ret) {
> +		spin_lock(&sinfo->lock);
> +		btrfs_space_info_update_bytes_may_use(sinfo, -remap_length);
> +		spin_unlock(&sinfo->lock);
> +
> +		mutex_unlock(&fs_info->remap_mutex);
> +		btrfs_end_transaction(trans);
> +		return ret;
> +	}
> +
> +	made_reservation = true;
> +
> +	new_addr = ins.objectid;
> +	length = ins.offset;
> +
> +	if (!is_data && length % fs_info->nodesize) {
> +		u64 new_length = length - (length % fs_info->nodesize);

Why not use the IS_ALIGNED / ALIGN_DOWN macros? Nodesize is a power of
two, so I think it should be quicker. Probably doesn't matter, but it
does seem to be the predominant pattern in the code base. Also avoids
ever worrying about dividing by zero.

> +
> +		btrfs_free_reserved_extent(fs_info, new_addr + new_length,
> +					   length - new_length, 0);
> +
> +		length = new_length;
> +	}
> +
> +	ret = add_to_free_space_tree(trans, start, length);

Can you explain this? Intuitively, to me, the old remapped address is
not a logical range we can allocate from, so it should not be in the
free space tree. Is this a hack to get the bytes back into the
accounting and allocations are blocked by the remapped block group being
remapped / read-only?

> +	if (ret)
> +		goto fail;
> +
> +	dest_bg = btrfs_lookup_block_group(fs_info, new_addr);
> +
> +	mutex_lock(&dest_bg->free_space_lock);
> +	bg_needs_free_space = test_bit(BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE,
> +				       &dest_bg->runtime_flags);
> +	mutex_unlock(&dest_bg->free_space_lock);
> +
> +	if (bg_needs_free_space) {
> +		ret = add_block_group_free_space(trans, dest_bg);
> +		if (ret)
> +			goto fail;
> +	}
> +
> +	ret = remove_from_free_space_tree(trans, new_addr, length);
> +	if (ret)
> +		goto fail;

I think you have also discussed this recently with Josef, but it seems
a little sketchy. I suppose it depends if the remap tree ends up getting
delayed refs and going in the extent tree? I think this is currently
only called from alloc_reserved_extent.

> +
> +	ret = do_copy(fs_info, start, new_addr, length);
> +	if (ret)
> +		goto fail;
> +
> +	ret = add_remap_entry(trans, path, src_bg, start, new_addr, length);
> +	if (ret)
> +		goto fail;
> +
> +	adjust_block_group_remap_bytes(trans, dest_bg, length);
> +	btrfs_free_reserved_bytes(dest_bg, length, 0);
> +
> +	spin_lock(&sinfo->lock);
> +	sinfo->bytes_readonly += length;
> +	spin_unlock(&sinfo->lock);
> +
> +next:
> +	if (dest_bg)
> +		btrfs_put_block_group(dest_bg);
> +
> +	if (made_reservation)
> +		btrfs_dec_block_group_reservations(fs_info, new_addr);
> +
> +	if (src_bg->used == 0 && src_bg->remap_bytes == 0) {
> +		chunk = btrfs_find_chunk_map(fs_info, src_bg->start, 1);
> +		if (!chunk) {
> +			mutex_unlock(&fs_info->remap_mutex);
> +			btrfs_end_transaction(trans);
> +			return -ENOENT;
> +		}
> +
> +		ret = last_identity_remap_gone(trans, chunk, src_bg, path);
> +		if (ret) {
> +			btrfs_free_chunk_map(chunk);
> +			mutex_unlock(&fs_info->remap_mutex);
> +			btrfs_end_transaction(trans);
> +			return ret;
> +		}
> +
> +		btrfs_free_chunk_map(chunk);
> +	}
> +
> +	mutex_unlock(&fs_info->remap_mutex);
> +
> +	ret = btrfs_end_transaction(trans);
> +	if (ret)
> +		return ret;
> +
> +	if (no_more)
> +		return 1;
> +
> +	*last_start = start;
> +
> +	return 0;
> +
> +fail:
> +	if (dest_bg)
> +		btrfs_put_block_group(dest_bg);
> +
> +	btrfs_free_reserved_extent(fs_info, new_addr, length, 0);
> +
> +	mutex_unlock(&fs_info->remap_mutex);
> +	btrfs_end_transaction(trans);
> +
> +	return ret;
> +}
> +
> +static int do_remap_tree_reloc(struct btrfs_fs_info *fs_info,
> +			       struct btrfs_path *path,
> +			       struct btrfs_block_group *bg)
> +{
> +	u64 last_start;
> +	int ret;
> +
> +	last_start = bg->start;
> +
> +	while (true) {
> +		ret = do_remap_tree_reloc_trans(fs_info, bg, path,
> +						&last_start);
> +		if (ret) {
> +			if (ret == 1)
> +				ret = 0;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
>  int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>  			  u64 *length)
>  {
> @@ -5073,6 +5591,10 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
>  		}
>  
>  		err = start_block_group_remapping(fs_info, path, bg);
> +		if (err)
> +			goto out;
> +
> +		err = do_remap_tree_reloc(fs_info, path, rc->block_group);
>  
>  		goto out;
>  	}
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 01/10] btrfs: add definitions and constants for remap-tree
  2025-05-15 16:36 ` [RFC PATCH 01/10] btrfs: add definitions and constants for remap-tree Mark Harmstone
@ 2025-05-21 12:43   ` Johannes Thumshirn
  2025-05-23 13:06     ` Mark Harmstone
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Thumshirn @ 2025-05-21 12:43 UTC (permalink / raw)
  To: Mark Harmstone, linux-btrfs@vger.kernel.org

On 15.05.25 18:38, Mark Harmstone wrote:
> @@ -282,6 +285,10 @@
>   
>   #define BTRFS_RAID_STRIPE_KEY	230

Just a small heads up, I'd need 231 for BTRFS_RAID_STRIPE_PARITY_KEY
and maybe a 232 as well, so there's still space just to let you know.

>   
> +#define BTRFS_IDENTITY_REMAP_KEY 	234
> +#define BTRFS_REMAP_KEY		 	235
> +#define BTRFS_REMAP_BACKREF_KEY	 	236


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 04/10] btrfs: add extended version of struct block_group_item
  2025-05-15 16:36 ` [RFC PATCH 04/10] btrfs: add extended version of struct block_group_item Mark Harmstone
@ 2025-05-23  9:53   ` Qu Wenruo
  2025-05-23 12:00     ` Mark Harmstone
  0 siblings, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2025-05-23  9:53 UTC (permalink / raw)
  To: Mark Harmstone, linux-btrfs



在 2025/5/16 02:06, Mark Harmstone 写道:
> Add a struct btrfs_block_group_item_v2, which is used in the block group
> tree if the remap-tree incompat flag is set.
> 
> This adds two new fields to the block group item: `remap_bytes` and
> `identity_remap_count`.
> 
> `remap_bytes` records the amount of data that's physically within this
> block group, but nominally in another, remapped block group. This is
> necessary because this data will need to be moved first if this block
> group is itself relocated. If `remap_bytes` > 0, this is an indicator to
> the relocation thread that it will need to search the remap-tree for
> backrefs. A block group must also have `remap_bytes` == 0 before it can
> be dropped.
> 
> `identity_remap_count` records how many identity remap items are located
> in the remap tree for this block group. When relocation is begun for
> this block group, this is set to the number of holes in the free-space
> tree for this range. As identity remaps are converted into actual remaps
> by the relocation process, this number is decreased. Once it reaches 0,
> either because of relocation or because extents have been deleted, the
> block group has been fully remapped and its chunk's device extents are
> removed.

Can we add those two items into a new item other than a completely new 
v2 block group item?

I mean for regular block groups they do not need those members, and all 
block groups starts from regular ones at mkfs time.

We can add a regular block group flag to indicate if the bg has the 
extra members.

Thanks,
Qu

> 
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
>   fs/btrfs/accessors.h            |  20 +++++++
>   fs/btrfs/block-group.c          | 101 ++++++++++++++++++++++++--------
>   fs/btrfs/block-group.h          |  14 ++++-
>   fs/btrfs/tree-checker.c         |  10 +++-
>   include/uapi/linux/btrfs_tree.h |   8 +++
>   5 files changed, 126 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> index 5f5eda8d6f9e..6e6dd664217b 100644
> --- a/fs/btrfs/accessors.h
> +++ b/fs/btrfs/accessors.h
> @@ -264,6 +264,26 @@ BTRFS_SETGET_FUNCS(block_group_flags, struct btrfs_block_group_item, flags, 64);
>   BTRFS_SETGET_STACK_FUNCS(stack_block_group_flags,
>   			struct btrfs_block_group_item, flags, 64);
>   
> +/* struct btrfs_block_group_item_v2 */
> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_used, struct btrfs_block_group_item_v2,
> +			 used, 64);
> +BTRFS_SETGET_FUNCS(block_group_v2_used, struct btrfs_block_group_item_v2, used, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_chunk_objectid,
> +			 struct btrfs_block_group_item_v2, chunk_objectid, 64);
> +BTRFS_SETGET_FUNCS(block_group_v2_chunk_objectid,
> +		   struct btrfs_block_group_item_v2, chunk_objectid, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_flags,
> +			 struct btrfs_block_group_item_v2, flags, 64);
> +BTRFS_SETGET_FUNCS(block_group_v2_flags, struct btrfs_block_group_item_v2, flags, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_remap_bytes,
> +			 struct btrfs_block_group_item_v2, remap_bytes, 64);
> +BTRFS_SETGET_FUNCS(block_group_v2_remap_bytes, struct btrfs_block_group_item_v2,
> +		   remap_bytes, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_identity_remap_count,
> +			 struct btrfs_block_group_item_v2, identity_remap_count, 32);
> +BTRFS_SETGET_FUNCS(block_group_v2_identity_remap_count, struct btrfs_block_group_item_v2,
> +		   identity_remap_count, 32);
> +
>   /* struct btrfs_free_space_info */
>   BTRFS_SETGET_FUNCS(free_space_extent_count, struct btrfs_free_space_info,
>   		   extent_count, 32);
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 5b0cb04b2b93..6a2aa792ccb2 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -2351,7 +2351,7 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
>   }
>   
>   static int read_one_block_group(struct btrfs_fs_info *info,
> -				struct btrfs_block_group_item *bgi,
> +				struct btrfs_block_group_item_v2 *bgi,
>   				const struct btrfs_key *key,
>   				int need_clear)
>   {
> @@ -2366,11 +2366,16 @@ static int read_one_block_group(struct btrfs_fs_info *info,
>   		return -ENOMEM;
>   
>   	cache->length = key->offset;
> -	cache->used = btrfs_stack_block_group_used(bgi);
> +	cache->used = btrfs_stack_block_group_v2_used(bgi);
>   	cache->commit_used = cache->used;
> -	cache->flags = btrfs_stack_block_group_flags(bgi);
> -	cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);
> +	cache->flags = btrfs_stack_block_group_v2_flags(bgi);
> +	cache->global_root_id = btrfs_stack_block_group_v2_chunk_objectid(bgi);
>   	cache->space_info = btrfs_find_space_info(info, cache->flags);
> +	cache->remap_bytes = btrfs_stack_block_group_v2_remap_bytes(bgi);
> +	cache->commit_remap_bytes = cache->remap_bytes;
> +	cache->identity_remap_count =
> +		btrfs_stack_block_group_v2_identity_remap_count(bgi);
> +	cache->commit_identity_remap_count = cache->identity_remap_count;
>   
>   	set_free_space_tree_thresholds(cache);
>   
> @@ -2435,7 +2440,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
>   	} else if (cache->length == cache->used) {
>   		cache->cached = BTRFS_CACHE_FINISHED;
>   		btrfs_free_excluded_extents(cache);
> -	} else if (cache->used == 0) {
> +	} else if (cache->used == 0 && cache->remap_bytes == 0) {
>   		cache->cached = BTRFS_CACHE_FINISHED;
>   		ret = btrfs_add_new_free_space(cache, cache->start,
>   					       cache->start + cache->length, NULL);
> @@ -2455,7 +2460,8 @@ static int read_one_block_group(struct btrfs_fs_info *info,
>   
>   	set_avail_alloc_bits(info, cache->flags);
>   	if (btrfs_chunk_writeable(info, cache->start)) {
> -		if (cache->used == 0) {
> +		if (cache->used == 0 && cache->identity_remap_count == 0 &&
> +		    cache->remap_bytes == 0) {
>   			ASSERT(list_empty(&cache->bg_list));
>   			if (btrfs_test_opt(info, DISCARD_ASYNC))
>   				btrfs_discard_queue_work(&info->discard_ctl, cache);
> @@ -2559,9 +2565,10 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>   		need_clear = 1;
>   
>   	while (1) {
> -		struct btrfs_block_group_item bgi;
> +		struct btrfs_block_group_item_v2 bgi;
>   		struct extent_buffer *leaf;
>   		int slot;
> +		size_t size;
>   
>   		ret = find_first_block_group(info, path, &key);
>   		if (ret > 0)
> @@ -2572,8 +2579,16 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>   		leaf = path->nodes[0];
>   		slot = path->slots[0];
>   
> +		if (btrfs_fs_incompat(info, REMAP_TREE)) {
> +			size = sizeof(struct btrfs_block_group_item_v2);
> +		} else {
> +			size = sizeof(struct btrfs_block_group_item);
> +			btrfs_set_stack_block_group_v2_remap_bytes(&bgi, 0);
> +			btrfs_set_stack_block_group_v2_identity_remap_count(&bgi, 0);
> +		}
> +
>   		read_extent_buffer(leaf, &bgi, btrfs_item_ptr_offset(leaf, slot),
> -				   sizeof(bgi));
> +				   size);
>   
>   		btrfs_item_key_to_cpu(leaf, &key, slot);
>   		btrfs_release_path(path);
> @@ -2643,25 +2658,38 @@ static int insert_block_group_item(struct btrfs_trans_handle *trans,
>   				   struct btrfs_block_group *block_group)
>   {
>   	struct btrfs_fs_info *fs_info = trans->fs_info;
> -	struct btrfs_block_group_item bgi;
> +	struct btrfs_block_group_item_v2 bgi;
>   	struct btrfs_root *root = btrfs_block_group_root(fs_info);
>   	struct btrfs_key key;
>   	u64 old_commit_used;
> +	size_t size;
>   	int ret;
>   
>   	spin_lock(&block_group->lock);
> -	btrfs_set_stack_block_group_used(&bgi, block_group->used);
> -	btrfs_set_stack_block_group_chunk_objectid(&bgi,
> -						   block_group->global_root_id);
> -	btrfs_set_stack_block_group_flags(&bgi, block_group->flags);
> +	btrfs_set_stack_block_group_v2_used(&bgi, block_group->used);
> +	btrfs_set_stack_block_group_v2_chunk_objectid(&bgi,
> +						      block_group->global_root_id);
> +	btrfs_set_stack_block_group_v2_flags(&bgi, block_group->flags);
> +	btrfs_set_stack_block_group_v2_remap_bytes(&bgi,
> +						   block_group->remap_bytes);
> +	btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
> +					block_group->identity_remap_count);
>   	old_commit_used = block_group->commit_used;
>   	block_group->commit_used = block_group->used;
> +	block_group->commit_remap_bytes = block_group->remap_bytes;
> +	block_group->commit_identity_remap_count =
> +		block_group->identity_remap_count;
>   	key.objectid = block_group->start;
>   	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
>   	key.offset = block_group->length;
>   	spin_unlock(&block_group->lock);
>   
> -	ret = btrfs_insert_item(trans, root, &key, &bgi, sizeof(bgi));
> +	if (btrfs_fs_incompat(fs_info, REMAP_TREE))
> +		size = sizeof(struct btrfs_block_group_item_v2);
> +	else
> +		size = sizeof(struct btrfs_block_group_item);
> +
> +	ret = btrfs_insert_item(trans, root, &key, &bgi, size);
>   	if (ret < 0) {
>   		spin_lock(&block_group->lock);
>   		block_group->commit_used = old_commit_used;
> @@ -3116,10 +3144,12 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>   	struct btrfs_root *root = btrfs_block_group_root(fs_info);
>   	unsigned long bi;
>   	struct extent_buffer *leaf;
> -	struct btrfs_block_group_item bgi;
> +	struct btrfs_block_group_item_v2 bgi;
>   	struct btrfs_key key;
> -	u64 old_commit_used;
> -	u64 used;
> +	u64 old_commit_used, old_commit_remap_bytes;
> +	u32 old_commit_identity_remap_count;
> +	u64 used, remap_bytes;
> +	u32 identity_remap_count;
>   
>   	/*
>   	 * Block group items update can be triggered out of commit transaction
> @@ -3129,13 +3159,21 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>   	 */
>   	spin_lock(&cache->lock);
>   	old_commit_used = cache->commit_used;
> +	old_commit_remap_bytes = cache->commit_remap_bytes;
> +	old_commit_identity_remap_count = cache->commit_identity_remap_count;
>   	used = cache->used;
> -	/* No change in used bytes, can safely skip it. */
> -	if (cache->commit_used == used) {
> +	remap_bytes = cache->remap_bytes;
> +	identity_remap_count = cache->identity_remap_count;
> +	/* No change in values, can safely skip it. */
> +	if (cache->commit_used == used &&
> +	    cache->commit_remap_bytes == remap_bytes &&
> +	    cache->commit_identity_remap_count == identity_remap_count) {
>   		spin_unlock(&cache->lock);
>   		return 0;
>   	}
>   	cache->commit_used = used;
> +	cache->commit_remap_bytes = remap_bytes;
> +	cache->commit_identity_remap_count = identity_remap_count;
>   	spin_unlock(&cache->lock);
>   
>   	key.objectid = cache->start;
> @@ -3151,11 +3189,23 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>   
>   	leaf = path->nodes[0];
>   	bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
> -	btrfs_set_stack_block_group_used(&bgi, used);
> -	btrfs_set_stack_block_group_chunk_objectid(&bgi,
> -						   cache->global_root_id);
> -	btrfs_set_stack_block_group_flags(&bgi, cache->flags);
> -	write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
> +	btrfs_set_stack_block_group_v2_used(&bgi, used);
> +	btrfs_set_stack_block_group_v2_chunk_objectid(&bgi,
> +						      cache->global_root_id);
> +	btrfs_set_stack_block_group_v2_flags(&bgi, cache->flags);
> +
> +	if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
> +		btrfs_set_stack_block_group_v2_remap_bytes(&bgi,
> +							   cache->remap_bytes);
> +		btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
> +						cache->identity_remap_count);
> +		write_extent_buffer(leaf, &bgi, bi,
> +				    sizeof(struct btrfs_block_group_item_v2));
> +	} else {
> +		write_extent_buffer(leaf, &bgi, bi,
> +				    sizeof(struct btrfs_block_group_item));
> +	}
> +
>   fail:
>   	btrfs_release_path(path);
>   	/*
> @@ -3170,6 +3220,9 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>   	if (ret < 0 && ret != -ENOENT) {
>   		spin_lock(&cache->lock);
>   		cache->commit_used = old_commit_used;
> +		cache->commit_remap_bytes = old_commit_remap_bytes;
> +		cache->commit_identity_remap_count =
> +			old_commit_identity_remap_count;
>   		spin_unlock(&cache->lock);
>   	}
>   	return ret;
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index 9de356bcb411..c484118b8b8d 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -127,6 +127,8 @@ struct btrfs_block_group {
>   	u64 flags;
>   	u64 cache_generation;
>   	u64 global_root_id;
> +	u64 remap_bytes;
> +	u32 identity_remap_count;
>   
>   	/*
>   	 * The last committed used bytes of this block group, if the above @used
> @@ -134,6 +136,15 @@ struct btrfs_block_group {
>   	 * group item of this block group.
>   	 */
>   	u64 commit_used;
> +	/*
> +	 * The last committed remap_bytes value of this block group.
> +	 */
> +	u64 commit_remap_bytes;
> +	/*
> +	 * The last commited identity_remap_count value of this block group.
> +	 */
> +	u32 commit_identity_remap_count;
> +
>   	/*
>   	 * If the free space extent count exceeds this number, convert the block
>   	 * group to bitmaps.
> @@ -275,7 +286,8 @@ static inline bool btrfs_is_block_group_used(const struct btrfs_block_group *bg)
>   {
>   	lockdep_assert_held(&bg->lock);
>   
> -	return (bg->used > 0 || bg->reserved > 0 || bg->pinned > 0);
> +	return (bg->used > 0 || bg->reserved > 0 || bg->pinned > 0 ||
> +		bg->remap_bytes > 0);
>   }
>   
>   static inline bool btrfs_is_block_group_data_only(const struct btrfs_block_group *block_group)
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index fd83df06e3fb..25311576fab6 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -687,6 +687,7 @@ static int check_block_group_item(struct extent_buffer *leaf,
>   	u64 chunk_objectid;
>   	u64 flags;
>   	u64 type;
> +	size_t exp_size;
>   
>   	/*
>   	 * Here we don't really care about alignment since extent allocator can
> @@ -698,10 +699,15 @@ static int check_block_group_item(struct extent_buffer *leaf,
>   		return -EUCLEAN;
>   	}
>   
> -	if (unlikely(item_size != sizeof(bgi))) {
> +	if (btrfs_fs_incompat(fs_info, REMAP_TREE))
> +		exp_size = sizeof(struct btrfs_block_group_item_v2);
> +	else
> +		exp_size = sizeof(struct btrfs_block_group_item);
> +
> +	if (unlikely(item_size != exp_size)) {
>   		block_group_err(leaf, slot,
>   			"invalid item size, have %u expect %zu",
> -				item_size, sizeof(bgi));
> +				item_size, exp_size);
>   		return -EUCLEAN;
>   	}
>   
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index 9a36f0206d90..500e3a7df90b 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -1229,6 +1229,14 @@ struct btrfs_block_group_item {
>   	__le64 flags;
>   } __attribute__ ((__packed__));
>   
> +struct btrfs_block_group_item_v2 {
> +	__le64 used;
> +	__le64 chunk_objectid;
> +	__le64 flags;
> +	__le64 remap_bytes;
> +	__le32 identity_remap_count;
> +} __attribute__ ((__packed__));
> +
>   struct btrfs_free_space_info {
>   	__le32 extent_count;
>   	__le32 flags;


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 06/10] btrfs: redirect I/O for remapped block groups
  2025-05-15 16:36 ` [RFC PATCH 06/10] btrfs: redirect I/O for remapped block groups Mark Harmstone
@ 2025-05-23 10:09   ` Qu Wenruo
  2025-05-23 11:53     ` Mark Harmstone
  0 siblings, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2025-05-23 10:09 UTC (permalink / raw)
  To: Mark Harmstone, linux-btrfs



在 2025/5/16 02:06, Mark Harmstone 写道:
> Change btrfs_map_block() so that if the block group has the REMAPPED
> flag set, we call btrfs_translate_remap() to obtain a new address.

I'm wondering if we can do it a little simpler:

- Delete the chunk item for a fully relocated/remapped chunk
   So that future read/write into that logical range will not find a chunk.

- If chunk map lookup failed, search remap tree instead

By this we do not need the REMAPPED flag at all.

Thanks,
Qu

> 
> btrfs_translate_remap() searches the remap tree for a range
> corresponding to the logical address passed to btrfs_map_block(). If it
> is within an identity remap, this part of the block group hasn't yet
> been relocated, and so we use the existing address.
> 
> If it is within an actual remap, we subtract the start of the remap
> range and add the address of its destination, contained in the item's
> payload.
> 
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
>   fs/btrfs/ctree.c      | 11 ++++---
>   fs/btrfs/ctree.h      |  3 ++
>   fs/btrfs/relocation.c | 75 +++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/relocation.h |  2 ++
>   fs/btrfs/volumes.c    | 19 +++++++++++
>   5 files changed, 105 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index a2e7979372cc..7808f7bc2303 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2331,7 +2331,8 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
>    * This may release the path, and so you may lose any locks held at the
>    * time you call it.
>    */
> -static int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path)
> +int btrfs_prev_leaf(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> +		    struct btrfs_path *path, int ins_len, int cow)
>   {
>   	struct btrfs_key key;
>   	struct btrfs_key orig_key;
> @@ -2355,7 +2356,7 @@ static int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path)
>   	}
>   
>   	btrfs_release_path(path);
> -	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +	ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow);
>   	if (ret <= 0)
>   		return ret;
>   
> @@ -2454,7 +2455,7 @@ int btrfs_search_slot_for_read(struct btrfs_root *root,
>   		}
>   	} else {
>   		if (p->slots[0] == 0) {
> -			ret = btrfs_prev_leaf(root, p);
> +			ret = btrfs_prev_leaf(NULL, root, p, 0, 0);
>   			if (ret < 0)
>   				return ret;
>   			if (!ret) {
> @@ -5003,7 +5004,7 @@ int btrfs_previous_item(struct btrfs_root *root,
>   
>   	while (1) {
>   		if (path->slots[0] == 0) {
> -			ret = btrfs_prev_leaf(root, path);
> +			ret = btrfs_prev_leaf(NULL, root, path, 0, 0);
>   			if (ret != 0)
>   				return ret;
>   		} else {
> @@ -5044,7 +5045,7 @@ int btrfs_previous_extent_item(struct btrfs_root *root,
>   
>   	while (1) {
>   		if (path->slots[0] == 0) {
> -			ret = btrfs_prev_leaf(root, path);
> +			ret = btrfs_prev_leaf(NULL, root, path, 0, 0);
>   			if (ret != 0)
>   				return ret;
>   		} else {
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 075a06db43a1..90a0d38a31c9 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -721,6 +721,9 @@ static inline int btrfs_next_leaf(struct btrfs_root *root, struct btrfs_path *pa
>   	return btrfs_next_old_leaf(root, path, 0);
>   }
>   
> +int btrfs_prev_leaf(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> +		    struct btrfs_path *path, int ins_len, int cow);
> +
>   static inline int btrfs_next_item(struct btrfs_root *root, struct btrfs_path *p)
>   {
>   	return btrfs_next_old_item(root, p, 0);
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 02086191630d..e5571c897906 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -3897,6 +3897,81 @@ static const char *stage_to_string(enum reloc_stage stage)
>   	return "unknown";
>   }
>   
> +int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
> +			  u64 *length)
> +{
> +	int ret;
> +	struct btrfs_key key, found_key;
> +	struct extent_buffer *leaf;
> +	struct btrfs_remap *remap;
> +	BTRFS_PATH_AUTO_FREE(path);
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	key.objectid = *logical;
> +	key.type = BTRFS_IDENTITY_REMAP_KEY;
> +	key.offset = 0;
> +
> +	ret = btrfs_search_slot(NULL, fs_info->remap_root, &key, path,
> +				0, 0);
> +	if (ret < 0)
> +		return ret;
> +
> +	leaf = path->nodes[0];
> +
> +	if (path->slots[0] >= btrfs_header_nritems(leaf)) {
> +		ret = btrfs_next_leaf(fs_info->remap_root, path);
> +		if (ret < 0)
> +			return ret;
> +
> +		leaf = path->nodes[0];
> +	}
> +
> +	btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
> +
> +	if (found_key.objectid > *logical) {
> +		if (path->slots[0] == 0) {
> +			ret = btrfs_prev_leaf(NULL, fs_info->remap_root, path,
> +					      0, 0);
> +			if (ret) {
> +				if (ret == 1)
> +					ret = -ENOENT;
> +				return ret;
> +			}
> +
> +			leaf = path->nodes[0];
> +		} else {
> +			path->slots[0]--;
> +		}
> +
> +		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
> +	}
> +
> +	if (found_key.type != BTRFS_REMAP_KEY &&
> +	    found_key.type != BTRFS_IDENTITY_REMAP_KEY) {
> +		return -ENOENT;
> +	}
> +
> +	if (found_key.objectid > *logical ||
> +	    found_key.objectid + found_key.offset <= *logical) {
> +		return -ENOENT;
> +	}
> +
> +	if (*logical + *length > found_key.objectid + found_key.offset)
> +		*length = found_key.objectid + found_key.offset - *logical;
> +
> +	if (found_key.type == BTRFS_IDENTITY_REMAP_KEY)
> +		return 0;
> +
> +	remap = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_remap);
> +
> +	*logical = *logical - found_key.objectid + btrfs_remap_address(leaf, remap);
> +
> +	return 0;
> +}
> +
>   /*
>    * function to relocate all extents in a block group.
>    */
> diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
> index 788c86d8633a..f07dbd9a89c6 100644
> --- a/fs/btrfs/relocation.h
> +++ b/fs/btrfs/relocation.h
> @@ -30,5 +30,7 @@ int btrfs_should_cancel_balance(const struct btrfs_fs_info *fs_info);
>   struct btrfs_root *find_reloc_root(struct btrfs_fs_info *fs_info, u64 bytenr);
>   bool btrfs_should_ignore_reloc_root(const struct btrfs_root *root);
>   u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info);
> +int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
> +			  u64 *length);
>   
>   #endif
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 77194bb46b40..4777926213c0 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6620,6 +6620,25 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>   	if (IS_ERR(map))
>   		return PTR_ERR(map);
>   
> +	if (map->type & BTRFS_BLOCK_GROUP_REMAPPED) {
> +		u64 new_logical = logical;
> +
> +		ret = btrfs_translate_remap(fs_info, &new_logical, length);
> +		if (ret)
> +			return ret;
> +
> +		if (new_logical != logical) {
> +			btrfs_free_chunk_map(map);
> +
> +			map = btrfs_get_chunk_map(fs_info, new_logical,
> +						  *length);
> +			if (IS_ERR(map))
> +				return PTR_ERR(map);
> +
> +			logical = new_logical;
> +		}
> +	}
> +
>   	num_copies = btrfs_chunk_map_num_copies(map);
>   	if (io_geom.mirror_num > num_copies)
>   		return -EINVAL;


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 06/10] btrfs: redirect I/O for remapped block groups
  2025-05-23 10:09   ` Qu Wenruo
@ 2025-05-23 11:53     ` Mark Harmstone
  0 siblings, 0 replies; 20+ messages in thread
From: Mark Harmstone @ 2025-05-23 11:53 UTC (permalink / raw)
  To: Qu Wenruo, Mark Harmstone, linux-btrfs@vger.kernel.org

On 23/5/25 11:09, Qu Wenruo wrote:
> > 
> 
> 
> 在 2025/5/16 02:06, Mark Harmstone 写道:
>> Change btrfs_map_block() so that if the block group has the REMAPPED
>> flag set, we call btrfs_translate_remap() to obtain a new address.
> 
> I'm wondering if we can do it a little simpler:
> 
> - Delete the chunk item for a fully relocated/remapped chunk
>    So that future read/write into that logical range will not find a chunk.
> 
> - If chunk map lookup failed, search remap tree instead
> 
> By this we do not need the REMAPPED flag at all.
> 
> Thanks,
> Qu

You would still need the REMAPPED flag, as that's also set on 
partially-remapped block groups.

The life cycle is:
* Normal block group
* Block group with REMAPPED flag set and identity remaps covering its 
data. The REMAPPED flag is an instruction to search the remap tree for 
this BG, and also means that no new allocations can be made from it
* Block group with a mixture of identity remaps and actual remaps
* Fully-remapped block group, with no chunk stripes and no identity 
remaps left

My concern with making fully-remapped block groups implicit is that it 
makes it harder to diagnose corruption. If we see an address that's 
outside of a block group but has no remap entry, is it a bit-flip error 
or a bug in the remap tree code?

Mark

> 
>>
>> btrfs_translate_remap() searches the remap tree for a range
>> corresponding to the logical address passed to btrfs_map_block(). If it
>> is within an identity remap, this part of the block group hasn't yet
>> been relocated, and so we use the existing address.
>>
>> If it is within an actual remap, we subtract the start of the remap
>> range and add the address of its destination, contained in the item's
>> payload.
>>
>> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
>> ---
>>   fs/btrfs/ctree.c      | 11 ++++---
>>   fs/btrfs/ctree.h      |  3 ++
>>   fs/btrfs/relocation.c | 75 +++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/relocation.h |  2 ++
>>   fs/btrfs/volumes.c    | 19 +++++++++++
>>   5 files changed, 105 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>> index a2e7979372cc..7808f7bc2303 100644
>> --- a/fs/btrfs/ctree.c
>> +++ b/fs/btrfs/ctree.c
>> @@ -2331,7 +2331,8 @@ int btrfs_search_old_slot(struct btrfs_root 
>> *root, const struct btrfs_key *key,
>>    * This may release the path, and so you may lose any locks held at the
>>    * time you call it.
>>    */
>> -static int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path 
>> *path)
>> +int btrfs_prev_leaf(struct btrfs_trans_handle *trans, struct 
>> btrfs_root *root,
>> +            struct btrfs_path *path, int ins_len, int cow)
>>   {
>>       struct btrfs_key key;
>>       struct btrfs_key orig_key;
>> @@ -2355,7 +2356,7 @@ static int btrfs_prev_leaf(struct btrfs_root 
>> *root, struct btrfs_path *path)
>>       }
>>       btrfs_release_path(path);
>> -    ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>> +    ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow);
>>       if (ret <= 0)
>>           return ret;
>> @@ -2454,7 +2455,7 @@ int btrfs_search_slot_for_read(struct btrfs_root 
>> *root,
>>           }
>>       } else {
>>           if (p->slots[0] == 0) {
>> -            ret = btrfs_prev_leaf(root, p);
>> +            ret = btrfs_prev_leaf(NULL, root, p, 0, 0);
>>               if (ret < 0)
>>                   return ret;
>>               if (!ret) {
>> @@ -5003,7 +5004,7 @@ int btrfs_previous_item(struct btrfs_root *root,
>>       while (1) {
>>           if (path->slots[0] == 0) {
>> -            ret = btrfs_prev_leaf(root, path);
>> +            ret = btrfs_prev_leaf(NULL, root, path, 0, 0);
>>               if (ret != 0)
>>                   return ret;
>>           } else {
>> @@ -5044,7 +5045,7 @@ int btrfs_previous_extent_item(struct btrfs_root 
>> *root,
>>       while (1) {
>>           if (path->slots[0] == 0) {
>> -            ret = btrfs_prev_leaf(root, path);
>> +            ret = btrfs_prev_leaf(NULL, root, path, 0, 0);
>>               if (ret != 0)
>>                   return ret;
>>           } else {
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 075a06db43a1..90a0d38a31c9 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -721,6 +721,9 @@ static inline int btrfs_next_leaf(struct 
>> btrfs_root *root, struct btrfs_path *pa
>>       return btrfs_next_old_leaf(root, path, 0);
>>   }
>> +int btrfs_prev_leaf(struct btrfs_trans_handle *trans, struct 
>> btrfs_root *root,
>> +            struct btrfs_path *path, int ins_len, int cow);
>> +
>>   static inline int btrfs_next_item(struct btrfs_root *root, struct 
>> btrfs_path *p)
>>   {
>>       return btrfs_next_old_item(root, p, 0);
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index 02086191630d..e5571c897906 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -3897,6 +3897,81 @@ static const char *stage_to_string(enum 
>> reloc_stage stage)
>>       return "unknown";
>>   }
>> +int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>> +              u64 *length)
>> +{
>> +    int ret;
>> +    struct btrfs_key key, found_key;
>> +    struct extent_buffer *leaf;
>> +    struct btrfs_remap *remap;
>> +    BTRFS_PATH_AUTO_FREE(path);
>> +
>> +    path = btrfs_alloc_path();
>> +    if (!path)
>> +        return -ENOMEM;
>> +
>> +    key.objectid = *logical;
>> +    key.type = BTRFS_IDENTITY_REMAP_KEY;
>> +    key.offset = 0;
>> +
>> +    ret = btrfs_search_slot(NULL, fs_info->remap_root, &key, path,
>> +                0, 0);
>> +    if (ret < 0)
>> +        return ret;
>> +
>> +    leaf = path->nodes[0];
>> +
>> +    if (path->slots[0] >= btrfs_header_nritems(leaf)) {
>> +        ret = btrfs_next_leaf(fs_info->remap_root, path);
>> +        if (ret < 0)
>> +            return ret;
>> +
>> +        leaf = path->nodes[0];
>> +    }
>> +
>> +    btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
>> +
>> +    if (found_key.objectid > *logical) {
>> +        if (path->slots[0] == 0) {
>> +            ret = btrfs_prev_leaf(NULL, fs_info->remap_root, path,
>> +                          0, 0);
>> +            if (ret) {
>> +                if (ret == 1)
>> +                    ret = -ENOENT;
>> +                return ret;
>> +            }
>> +
>> +            leaf = path->nodes[0];
>> +        } else {
>> +            path->slots[0]--;
>> +        }
>> +
>> +        btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
>> +    }
>> +
>> +    if (found_key.type != BTRFS_REMAP_KEY &&
>> +        found_key.type != BTRFS_IDENTITY_REMAP_KEY) {
>> +        return -ENOENT;
>> +    }
>> +
>> +    if (found_key.objectid > *logical ||
>> +        found_key.objectid + found_key.offset <= *logical) {
>> +        return -ENOENT;
>> +    }
>> +
>> +    if (*logical + *length > found_key.objectid + found_key.offset)
>> +        *length = found_key.objectid + found_key.offset - *logical;
>> +
>> +    if (found_key.type == BTRFS_IDENTITY_REMAP_KEY)
>> +        return 0;
>> +
>> +    remap = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_remap);
>> +
>> +    *logical = *logical - found_key.objectid + 
>> btrfs_remap_address(leaf, remap);
>> +
>> +    return 0;
>> +}
>> +
>>   /*
>>    * function to relocate all extents in a block group.
>>    */
>> diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
>> index 788c86d8633a..f07dbd9a89c6 100644
>> --- a/fs/btrfs/relocation.h
>> +++ b/fs/btrfs/relocation.h
>> @@ -30,5 +30,7 @@ int btrfs_should_cancel_balance(const struct 
>> btrfs_fs_info *fs_info);
>>   struct btrfs_root *find_reloc_root(struct btrfs_fs_info *fs_info, 
>> u64 bytenr);
>>   bool btrfs_should_ignore_reloc_root(const struct btrfs_root *root);
>>   u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info);
>> +int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>> +              u64 *length);
>>   #endif
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 77194bb46b40..4777926213c0 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6620,6 +6620,25 @@ int btrfs_map_block(struct btrfs_fs_info 
>> *fs_info, enum btrfs_map_op op,
>>       if (IS_ERR(map))
>>           return PTR_ERR(map);
>> +    if (map->type & BTRFS_BLOCK_GROUP_REMAPPED) {
>> +        u64 new_logical = logical;
>> +
>> +        ret = btrfs_translate_remap(fs_info, &new_logical, length);
>> +        if (ret)
>> +            return ret;
>> +
>> +        if (new_logical != logical) {
>> +            btrfs_free_chunk_map(map);
>> +
>> +            map = btrfs_get_chunk_map(fs_info, new_logical,
>> +                          *length);
>> +            if (IS_ERR(map))
>> +                return PTR_ERR(map);
>> +
>> +            logical = new_logical;
>> +        }
>> +    }
>> +
>>       num_copies = btrfs_chunk_map_num_copies(map);
>>       if (io_geom.mirror_num > num_copies)
>>           return -EINVAL;
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 04/10] btrfs: add extended version of struct block_group_item
  2025-05-23  9:53   ` Qu Wenruo
@ 2025-05-23 12:00     ` Mark Harmstone
  0 siblings, 0 replies; 20+ messages in thread
From: Mark Harmstone @ 2025-05-23 12:00 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs@vger.kernel.org

On 23/5/25 10:53, Qu Wenruo wrote:
> > 
> 
> 
> 在 2025/5/16 02:06, Mark Harmstone 写道:
>> Add a struct btrfs_block_group_item_v2, which is used in the block group
>> tree if the remap-tree incompat flag is set.
>>
>> This adds two new fields to the block group item: `remap_bytes` and
>> `identity_remap_count`.
>>
>> `remap_bytes` records the amount of data that's physically within this
>> block group, but nominally in another, remapped block group. This is
>> necessary because this data will need to be moved first if this block
>> group is itself relocated. If `remap_bytes` > 0, this is an indicator to
>> the relocation thread that it will need to search the remap-tree for
>> backrefs. A block group must also have `remap_bytes` == 0 before it can
>> be dropped.
>>
>> `identity_remap_count` records how many identity remap items are located
>> in the remap tree for this block group. When relocation is begun for
>> this block group, this is set to the number of holes in the free-space
>> tree for this range. As identity remaps are converted into actual remaps
>> by the relocation process, this number is decreased. Once it reaches 0,
>> either because of relocation or because extents have been deleted, the
>> block group has been fully remapped and its chunk's device extents are
>> removed.
> 
> Can we add those two items into a new item other than a completely new 
> v2 block group item?
> 
> I mean for regular block groups they do not need those members, and all 
> block groups starts from regular ones at mkfs time.
> 
> We can add a regular block group flag to indicate if the bg has the 
> extra members.
> 
> Thanks,
> Qu

I did consider that, but the downside of doing that is that it makes the 
timing of updating the block group tree less predictable. It would imply 
that when relocating a block group we have to lock the BGT up to the 
root, as we'd be changing keys and the length of items rather than doing 
everything in place.
This didn't seem worth it to save a few bytes, particularly as it's 
anticipated that in practice most block groups will have the REMAPPED 
flag set.

Mark

> 
>>
>> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
>> ---
>>   fs/btrfs/accessors.h            |  20 +++++++
>>   fs/btrfs/block-group.c          | 101 ++++++++++++++++++++++++--------
>>   fs/btrfs/block-group.h          |  14 ++++-
>>   fs/btrfs/tree-checker.c         |  10 +++-
>>   include/uapi/linux/btrfs_tree.h |   8 +++
>>   5 files changed, 126 insertions(+), 27 deletions(-)
>>
>> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
>> index 5f5eda8d6f9e..6e6dd664217b 100644
>> --- a/fs/btrfs/accessors.h
>> +++ b/fs/btrfs/accessors.h
>> @@ -264,6 +264,26 @@ BTRFS_SETGET_FUNCS(block_group_flags, struct 
>> btrfs_block_group_item, flags, 64);
>>   BTRFS_SETGET_STACK_FUNCS(stack_block_group_flags,
>>               struct btrfs_block_group_item, flags, 64);
>> +/* struct btrfs_block_group_item_v2 */
>> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_used, struct 
>> btrfs_block_group_item_v2,
>> +             used, 64);
>> +BTRFS_SETGET_FUNCS(block_group_v2_used, struct 
>> btrfs_block_group_item_v2, used, 64);
>> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_chunk_objectid,
>> +             struct btrfs_block_group_item_v2, chunk_objectid, 64);
>> +BTRFS_SETGET_FUNCS(block_group_v2_chunk_objectid,
>> +           struct btrfs_block_group_item_v2, chunk_objectid, 64);
>> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_flags,
>> +             struct btrfs_block_group_item_v2, flags, 64);
>> +BTRFS_SETGET_FUNCS(block_group_v2_flags, struct 
>> btrfs_block_group_item_v2, flags, 64);
>> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_remap_bytes,
>> +             struct btrfs_block_group_item_v2, remap_bytes, 64);
>> +BTRFS_SETGET_FUNCS(block_group_v2_remap_bytes, struct 
>> btrfs_block_group_item_v2,
>> +           remap_bytes, 64);
>> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_identity_remap_count,
>> +             struct btrfs_block_group_item_v2, identity_remap_count, 
>> 32);
>> +BTRFS_SETGET_FUNCS(block_group_v2_identity_remap_count, struct 
>> btrfs_block_group_item_v2,
>> +           identity_remap_count, 32);
>> +
>>   /* struct btrfs_free_space_info */
>>   BTRFS_SETGET_FUNCS(free_space_extent_count, struct 
>> btrfs_free_space_info,
>>              extent_count, 32);
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index 5b0cb04b2b93..6a2aa792ccb2 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -2351,7 +2351,7 @@ static int 
>> check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
>>   }
>>   static int read_one_block_group(struct btrfs_fs_info *info,
>> -                struct btrfs_block_group_item *bgi,
>> +                struct btrfs_block_group_item_v2 *bgi,
>>                   const struct btrfs_key *key,
>>                   int need_clear)
>>   {
>> @@ -2366,11 +2366,16 @@ static int read_one_block_group(struct 
>> btrfs_fs_info *info,
>>           return -ENOMEM;
>>       cache->length = key->offset;
>> -    cache->used = btrfs_stack_block_group_used(bgi);
>> +    cache->used = btrfs_stack_block_group_v2_used(bgi);
>>       cache->commit_used = cache->used;
>> -    cache->flags = btrfs_stack_block_group_flags(bgi);
>> -    cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);
>> +    cache->flags = btrfs_stack_block_group_v2_flags(bgi);
>> +    cache->global_root_id = 
>> btrfs_stack_block_group_v2_chunk_objectid(bgi);
>>       cache->space_info = btrfs_find_space_info(info, cache->flags);
>> +    cache->remap_bytes = btrfs_stack_block_group_v2_remap_bytes(bgi);
>> +    cache->commit_remap_bytes = cache->remap_bytes;
>> +    cache->identity_remap_count =
>> +        btrfs_stack_block_group_v2_identity_remap_count(bgi);
>> +    cache->commit_identity_remap_count = cache->identity_remap_count;
>>       set_free_space_tree_thresholds(cache);
>> @@ -2435,7 +2440,7 @@ static int read_one_block_group(struct 
>> btrfs_fs_info *info,
>>       } else if (cache->length == cache->used) {
>>           cache->cached = BTRFS_CACHE_FINISHED;
>>           btrfs_free_excluded_extents(cache);
>> -    } else if (cache->used == 0) {
>> +    } else if (cache->used == 0 && cache->remap_bytes == 0) {
>>           cache->cached = BTRFS_CACHE_FINISHED;
>>           ret = btrfs_add_new_free_space(cache, cache->start,
>>                              cache->start + cache->length, NULL);
>> @@ -2455,7 +2460,8 @@ static int read_one_block_group(struct 
>> btrfs_fs_info *info,
>>       set_avail_alloc_bits(info, cache->flags);
>>       if (btrfs_chunk_writeable(info, cache->start)) {
>> -        if (cache->used == 0) {
>> +        if (cache->used == 0 && cache->identity_remap_count == 0 &&
>> +            cache->remap_bytes == 0) {
>>               ASSERT(list_empty(&cache->bg_list));
>>               if (btrfs_test_opt(info, DISCARD_ASYNC))
>>                   btrfs_discard_queue_work(&info->discard_ctl, cache);
>> @@ -2559,9 +2565,10 @@ int btrfs_read_block_groups(struct 
>> btrfs_fs_info *info)
>>           need_clear = 1;
>>       while (1) {
>> -        struct btrfs_block_group_item bgi;
>> +        struct btrfs_block_group_item_v2 bgi;
>>           struct extent_buffer *leaf;
>>           int slot;
>> +        size_t size;
>>           ret = find_first_block_group(info, path, &key);
>>           if (ret > 0)
>> @@ -2572,8 +2579,16 @@ int btrfs_read_block_groups(struct 
>> btrfs_fs_info *info)
>>           leaf = path->nodes[0];
>>           slot = path->slots[0];
>> +        if (btrfs_fs_incompat(info, REMAP_TREE)) {
>> +            size = sizeof(struct btrfs_block_group_item_v2);
>> +        } else {
>> +            size = sizeof(struct btrfs_block_group_item);
>> +            btrfs_set_stack_block_group_v2_remap_bytes(&bgi, 0);
>> +            btrfs_set_stack_block_group_v2_identity_remap_count(&bgi, 
>> 0);
>> +        }
>> +
>>           read_extent_buffer(leaf, &bgi, btrfs_item_ptr_offset(leaf, 
>> slot),
>> -                   sizeof(bgi));
>> +                   size);
>>           btrfs_item_key_to_cpu(leaf, &key, slot);
>>           btrfs_release_path(path);
>> @@ -2643,25 +2658,38 @@ static int insert_block_group_item(struct 
>> btrfs_trans_handle *trans,
>>                      struct btrfs_block_group *block_group)
>>   {
>>       struct btrfs_fs_info *fs_info = trans->fs_info;
>> -    struct btrfs_block_group_item bgi;
>> +    struct btrfs_block_group_item_v2 bgi;
>>       struct btrfs_root *root = btrfs_block_group_root(fs_info);
>>       struct btrfs_key key;
>>       u64 old_commit_used;
>> +    size_t size;
>>       int ret;
>>       spin_lock(&block_group->lock);
>> -    btrfs_set_stack_block_group_used(&bgi, block_group->used);
>> -    btrfs_set_stack_block_group_chunk_objectid(&bgi,
>> -                           block_group->global_root_id);
>> -    btrfs_set_stack_block_group_flags(&bgi, block_group->flags);
>> +    btrfs_set_stack_block_group_v2_used(&bgi, block_group->used);
>> +    btrfs_set_stack_block_group_v2_chunk_objectid(&bgi,
>> +                              block_group->global_root_id);
>> +    btrfs_set_stack_block_group_v2_flags(&bgi, block_group->flags);
>> +    btrfs_set_stack_block_group_v2_remap_bytes(&bgi,
>> +                           block_group->remap_bytes);
>> +    btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
>> +                    block_group->identity_remap_count);
>>       old_commit_used = block_group->commit_used;
>>       block_group->commit_used = block_group->used;
>> +    block_group->commit_remap_bytes = block_group->remap_bytes;
>> +    block_group->commit_identity_remap_count =
>> +        block_group->identity_remap_count;
>>       key.objectid = block_group->start;
>>       key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
>>       key.offset = block_group->length;
>>       spin_unlock(&block_group->lock);
>> -    ret = btrfs_insert_item(trans, root, &key, &bgi, sizeof(bgi));
>> +    if (btrfs_fs_incompat(fs_info, REMAP_TREE))
>> +        size = sizeof(struct btrfs_block_group_item_v2);
>> +    else
>> +        size = sizeof(struct btrfs_block_group_item);
>> +
>> +    ret = btrfs_insert_item(trans, root, &key, &bgi, size);
>>       if (ret < 0) {
>>           spin_lock(&block_group->lock);
>>           block_group->commit_used = old_commit_used;
>> @@ -3116,10 +3144,12 @@ static int update_block_group_item(struct 
>> btrfs_trans_handle *trans,
>>       struct btrfs_root *root = btrfs_block_group_root(fs_info);
>>       unsigned long bi;
>>       struct extent_buffer *leaf;
>> -    struct btrfs_block_group_item bgi;
>> +    struct btrfs_block_group_item_v2 bgi;
>>       struct btrfs_key key;
>> -    u64 old_commit_used;
>> -    u64 used;
>> +    u64 old_commit_used, old_commit_remap_bytes;
>> +    u32 old_commit_identity_remap_count;
>> +    u64 used, remap_bytes;
>> +    u32 identity_remap_count;
>>       /*
>>        * Block group items update can be triggered out of commit 
>> transaction
>> @@ -3129,13 +3159,21 @@ static int update_block_group_item(struct 
>> btrfs_trans_handle *trans,
>>        */
>>       spin_lock(&cache->lock);
>>       old_commit_used = cache->commit_used;
>> +    old_commit_remap_bytes = cache->commit_remap_bytes;
>> +    old_commit_identity_remap_count = cache- 
>> >commit_identity_remap_count;
>>       used = cache->used;
>> -    /* No change in used bytes, can safely skip it. */
>> -    if (cache->commit_used == used) {
>> +    remap_bytes = cache->remap_bytes;
>> +    identity_remap_count = cache->identity_remap_count;
>> +    /* No change in values, can safely skip it. */
>> +    if (cache->commit_used == used &&
>> +        cache->commit_remap_bytes == remap_bytes &&
>> +        cache->commit_identity_remap_count == identity_remap_count) {
>>           spin_unlock(&cache->lock);
>>           return 0;
>>       }
>>       cache->commit_used = used;
>> +    cache->commit_remap_bytes = remap_bytes;
>> +    cache->commit_identity_remap_count = identity_remap_count;
>>       spin_unlock(&cache->lock);
>>       key.objectid = cache->start;
>> @@ -3151,11 +3189,23 @@ static int update_block_group_item(struct 
>> btrfs_trans_handle *trans,
>>       leaf = path->nodes[0];
>>       bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
>> -    btrfs_set_stack_block_group_used(&bgi, used);
>> -    btrfs_set_stack_block_group_chunk_objectid(&bgi,
>> -                           cache->global_root_id);
>> -    btrfs_set_stack_block_group_flags(&bgi, cache->flags);
>> -    write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
>> +    btrfs_set_stack_block_group_v2_used(&bgi, used);
>> +    btrfs_set_stack_block_group_v2_chunk_objectid(&bgi,
>> +                              cache->global_root_id);
>> +    btrfs_set_stack_block_group_v2_flags(&bgi, cache->flags);
>> +
>> +    if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
>> +        btrfs_set_stack_block_group_v2_remap_bytes(&bgi,
>> +                               cache->remap_bytes);
>> +        btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
>> +                        cache->identity_remap_count);
>> +        write_extent_buffer(leaf, &bgi, bi,
>> +                    sizeof(struct btrfs_block_group_item_v2));
>> +    } else {
>> +        write_extent_buffer(leaf, &bgi, bi,
>> +                    sizeof(struct btrfs_block_group_item));
>> +    }
>> +
>>   fail:
>>       btrfs_release_path(path);
>>       /*
>> @@ -3170,6 +3220,9 @@ static int update_block_group_item(struct 
>> btrfs_trans_handle *trans,
>>       if (ret < 0 && ret != -ENOENT) {
>>           spin_lock(&cache->lock);
>>           cache->commit_used = old_commit_used;
>> +        cache->commit_remap_bytes = old_commit_remap_bytes;
>> +        cache->commit_identity_remap_count =
>> +            old_commit_identity_remap_count;
>>           spin_unlock(&cache->lock);
>>       }
>>       return ret;
>> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
>> index 9de356bcb411..c484118b8b8d 100644
>> --- a/fs/btrfs/block-group.h
>> +++ b/fs/btrfs/block-group.h
>> @@ -127,6 +127,8 @@ struct btrfs_block_group {
>>       u64 flags;
>>       u64 cache_generation;
>>       u64 global_root_id;
>> +    u64 remap_bytes;
>> +    u32 identity_remap_count;
>>       /*
>>        * The last committed used bytes of this block group, if the 
>> above @used
>> @@ -134,6 +136,15 @@ struct btrfs_block_group {
>>        * group item of this block group.
>>        */
>>       u64 commit_used;
>> +    /*
>> +     * The last committed remap_bytes value of this block group.
>> +     */
>> +    u64 commit_remap_bytes;
>> +    /*
>> +     * The last commited identity_remap_count value of this block group.
>> +     */
>> +    u32 commit_identity_remap_count;
>> +
>>       /*
>>        * If the free space extent count exceeds this number, convert 
>> the block
>>        * group to bitmaps.
>> @@ -275,7 +286,8 @@ static inline bool btrfs_is_block_group_used(const 
>> struct btrfs_block_group *bg)
>>   {
>>       lockdep_assert_held(&bg->lock);
>> -    return (bg->used > 0 || bg->reserved > 0 || bg->pinned > 0);
>> +    return (bg->used > 0 || bg->reserved > 0 || bg->pinned > 0 ||
>> +        bg->remap_bytes > 0);
>>   }
>>   static inline bool btrfs_is_block_group_data_only(const struct 
>> btrfs_block_group *block_group)
>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>> index fd83df06e3fb..25311576fab6 100644
>> --- a/fs/btrfs/tree-checker.c
>> +++ b/fs/btrfs/tree-checker.c
>> @@ -687,6 +687,7 @@ static int check_block_group_item(struct 
>> extent_buffer *leaf,
>>       u64 chunk_objectid;
>>       u64 flags;
>>       u64 type;
>> +    size_t exp_size;
>>       /*
>>        * Here we don't really care about alignment since extent 
>> allocator can
>> @@ -698,10 +699,15 @@ static int check_block_group_item(struct 
>> extent_buffer *leaf,
>>           return -EUCLEAN;
>>       }
>> -    if (unlikely(item_size != sizeof(bgi))) {
>> +    if (btrfs_fs_incompat(fs_info, REMAP_TREE))
>> +        exp_size = sizeof(struct btrfs_block_group_item_v2);
>> +    else
>> +        exp_size = sizeof(struct btrfs_block_group_item);
>> +
>> +    if (unlikely(item_size != exp_size)) {
>>           block_group_err(leaf, slot,
>>               "invalid item size, have %u expect %zu",
>> -                item_size, sizeof(bgi));
>> +                item_size, exp_size);
>>           return -EUCLEAN;
>>       }
>> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/ 
>> btrfs_tree.h
>> index 9a36f0206d90..500e3a7df90b 100644
>> --- a/include/uapi/linux/btrfs_tree.h
>> +++ b/include/uapi/linux/btrfs_tree.h
>> @@ -1229,6 +1229,14 @@ struct btrfs_block_group_item {
>>       __le64 flags;
>>   } __attribute__ ((__packed__));
>> +struct btrfs_block_group_item_v2 {
>> +    __le64 used;
>> +    __le64 chunk_objectid;
>> +    __le64 flags;
>> +    __le64 remap_bytes;
>> +    __le32 identity_remap_count;
>> +} __attribute__ ((__packed__));
>> +
>>   struct btrfs_free_space_info {
>>       __le32 extent_count;
>>       __le32 flags;
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 01/10] btrfs: add definitions and constants for remap-tree
  2025-05-21 12:43   ` Johannes Thumshirn
@ 2025-05-23 13:06     ` Mark Harmstone
  0 siblings, 0 replies; 20+ messages in thread
From: Mark Harmstone @ 2025-05-23 13:06 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-btrfs@vger.kernel.org

On 21/5/25 13:43, Johannes Thumshirn wrote:
> > 
> On 15.05.25 18:38, Mark Harmstone wrote:
>> @@ -282,6 +285,10 @@
>>    
>>    #define BTRFS_RAID_STRIPE_KEY	230
> 
> Just a small heads up, I'd need 231 for BTRFS_RAID_STRIPE_PARITY_KEY
> and maybe a 232 as well, so there's still space just to let you know.
> 
>>    
>> +#define BTRFS_IDENTITY_REMAP_KEY 	234
>> +#define BTRFS_REMAP_KEY		 	235
>> +#define BTRFS_REMAP_BACKREF_KEY	 	236
> 

No worries, thanks Johannes

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 10/10] btrfs: replace identity maps with actual remaps when doing relocations
  2025-05-21  0:04   ` Boris Burkov
@ 2025-05-23 14:54     ` Mark Harmstone
  0 siblings, 0 replies; 20+ messages in thread
From: Mark Harmstone @ 2025-05-23 14:54 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs@vger.kernel.org

Thanks Boris.

On 21/5/25 01:04, Boris Burkov wrote:
> > 
> On Thu, May 15, 2025 at 05:36:38PM +0100, Mark Harmstone wrote:
>> Add a function do_remap_tree_reloc(), which does the actual work of
>> doing a relocation using the remap tree.
>>
>> In a loop we call do_remap_tree_reloc_trans(), which searches for the
>> first identity remap for the block group. We call btrfs_reserve_extent()
>> to find space elsewhere for it, and read the data into memory and write
>> it to the new location. We then carve out the identity remap and replace
>> it with an actual remap, which points to the new location in which to
>> look.
>>
>> Once the last identity remap has been removed we call
>> last_identity_remap_gone(), which, as with deletions, removes the
>> chunk's stripes and device extents.
> 
> I think this is a good candidate for unit testing. Just hammer a bunch
> of cases adding/removing/merging remaps.

Yes, makes sense

>>
>> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
>> ---
>>   fs/btrfs/relocation.c | 522 ++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 522 insertions(+)
>>
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index 7da95b82c798..bcf04d4c5af1 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -4660,6 +4660,60 @@ static int mark_bg_remapped(struct btrfs_trans_handle *trans,
>>   	return ret;
>>   }
>>   
> 
> Thinking out loud: I wonder if you do end up re-modeling the
> transactions s.t. we do one transaction per loop or something, then
> maybe you can use btrfs_for_each_slot.

That is what we're doing in do_remap_tree_reloc(). I think the downside 
of btrfs_for_each_slot is that we're not necessarily removing the whole 
identity remap every time, as btrfs_reserve_extent() might give us less 
than we asked for.

>> +static int find_next_identity_remap(struct btrfs_trans_handle *trans,
>> +				    struct btrfs_path *path, u64 bg_end,
>> +				    u64 last_start, u64 *start,
>> +				    u64 *length)
>> +{
>> +	int ret;
>> +	struct btrfs_key key, found_key;
>> +	struct btrfs_root *remap_root = trans->fs_info->remap_root;
>> +	struct extent_buffer *leaf;
>> +
>> +	key.objectid = last_start;
>> +	key.type = BTRFS_IDENTITY_REMAP_KEY;
>> +	key.offset = 0;
>> +
>> +	ret = btrfs_search_slot(trans, remap_root, &key, path, 0, 0);
>> +	if (ret < 0)
>> +		goto out;
>> +
>> +	leaf = path->nodes[0];
>> +	while (true) {
>> +		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
>> +
>> +		if (found_key.objectid >= bg_end) {
>> +			ret = -ENOENT;
>> +			goto out;
>> +		}
>> +
>> +		if (found_key.type == BTRFS_IDENTITY_REMAP_KEY) {
>> +			*start = found_key.objectid;
>> +			*length = found_key.offset;
>> +			ret = 0;
>> +			goto out;
>> +		}
>> +
>> +		path->slots[0]++;
>> +		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
>> +			ret = btrfs_next_leaf(remap_root, path);
>> +
>> +			if (ret != 0) {
>> +				if (ret == 1)
>> +					ret = -ENOENT;
>> +				goto out;
>> +			}
>> +
>> +			leaf = path->nodes[0];
>> +		}
>> +	}
>> +
>> +out:
>> +	btrfs_release_path(path);
>> +
>> +	return ret;
>> +}
>> +
>>   static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
>>   				struct btrfs_chunk_map *chunk,
>>   				struct btrfs_path *path)
>> @@ -4779,6 +4833,288 @@ static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
>>   	return ret;
>>   }
>>   
>> +static int merge_remap_entries(struct btrfs_trans_handle *trans,
>> +			       struct btrfs_path *path,
>> +			       struct btrfs_block_group *src_bg, u64 old_addr,
>> +			       u64 new_addr, u64 length)
>> +{
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	struct btrfs_remap *remap_ptr;
>> +	struct extent_buffer *leaf;
>> +	struct btrfs_key key, new_key;
>> +	u64 last_addr, old_length;
>> +	int ret;
>> +
>> +	leaf = path->nodes[0];
>> +	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>> +
>> +	remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
>> +				   struct btrfs_remap);
>> +
>> +	last_addr = btrfs_remap_address(leaf, remap_ptr);
>> +	old_length = key.offset;
>> +
>> +	if (last_addr + old_length != new_addr)
>> +		return 0;
>> +
>> +	/* Merge entries. */
>> +
>> +	new_key.objectid = key.objectid;
>> +	new_key.type = BTRFS_REMAP_KEY;
>> +	new_key.offset = old_length + length;
>> +
>> +	btrfs_set_item_key_safe(trans, path, &new_key);
>> +
>> +	btrfs_release_path(path);
>> +
>> +	/* Merge backref too. */
>> +
>> +	key.objectid = new_addr - old_length;
>> +	key.type = BTRFS_REMAP_BACKREF_KEY;
>> +	key.offset = old_length;
>> +
>> +	ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, 0, 1);
>> +	if (ret < 0) {
>> +		return ret;
>> +	} else if (ret == 1) {
>> +		btrfs_release_path(path);
>> +		return -ENOENT;
>> +	}
>> +
>> +	new_key.objectid = new_addr - old_length;
>> +	new_key.type = BTRFS_REMAP_BACKREF_KEY;
>> +	new_key.offset = old_length + length;
>> +
>> +	btrfs_set_item_key_safe(trans, path, &new_key);
>> +
>> +	btrfs_release_path(path);
>> +
>> +	/* Fix the following identity map. */
>> +
>> +	key.objectid = old_addr;
>> +	key.type = BTRFS_IDENTITY_REMAP_KEY;
>> +	key.offset = 0;
>> +
>> +	ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, 0, 1);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>> +
>> +	if (key.objectid != old_addr || key.type != BTRFS_IDENTITY_REMAP_KEY)
>> +		return -ENOENT;
>> +
>> +	if (key.offset == length) {
>> +		ret = btrfs_del_item(trans, fs_info->remap_root, path);
>> +		if (ret)
>> +			return ret;
>> +
>> +		btrfs_release_path(path);
>> +
>> +		ret = adjust_identity_remap_count(trans, path, src_bg, -1);
>> +		if (ret)
>> +			return ret;
>> +
>> +		return 1;
>> +	}
>> +
>> +	new_key.objectid = old_addr + length;
>> +	new_key.type = BTRFS_IDENTITY_REMAP_KEY;
>> +	new_key.offset = key.offset - length;
>> +
>> +	btrfs_set_item_key_safe(trans, path, &new_key);
>> +
>> +	btrfs_release_path(path);
>> +
>> +	return 1;
>> +}
>> +
>> +static int add_new_remap_entry(struct btrfs_trans_handle *trans,
>> +			       struct btrfs_path *path,
>> +			       struct btrfs_block_group *src_bg, u64 old_addr,
>> +			       u64 new_addr, u64 length)
>> +{
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	struct btrfs_key key, new_key;
>> +	struct btrfs_remap remap;
>> +	int ret;
>> +	int identity_count_delta = 0;
>> +
>> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>> +
>> +	/* Shorten or delete identity mapping entry. */
>> +
>> +	if (key.objectid == old_addr) {
>> +		ret = btrfs_del_item(trans, fs_info->remap_root, path);
>> +		if (ret)
>> +			return ret;
>> +
>> +		identity_count_delta--;
>> +	} else {
>> +		new_key.objectid = key.objectid;
>> +		new_key.type = BTRFS_IDENTITY_REMAP_KEY;
>> +		new_key.offset = old_addr - key.objectid;
>> +
>> +		btrfs_set_item_key_safe(trans, path, &new_key);
>> +	}
>> +
>> +	btrfs_release_path(path);
>> +
>> +	/* Create new remap entry. */
>> +
>> +	new_key.objectid = old_addr;
>> +	new_key.type = BTRFS_REMAP_KEY;
>> +	new_key.offset = length;
>> +
>> +	ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
>> +		path, &new_key, sizeof(struct btrfs_remap));
>> +	if (ret)
>> +		return ret;
>> +
>> +	btrfs_set_stack_remap_address(&remap, new_addr);
>> +
>> +	write_extent_buffer(path->nodes[0], &remap,
>> +		btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
>> +		sizeof(struct btrfs_remap));
>> +
>> +	btrfs_release_path(path);
>> +
>> +	/* Add entry for remainder of identity mapping, if necessary. */
>> +
>> +	if (key.objectid + key.offset != old_addr + length) {
>> +		new_key.objectid = old_addr + length;
>> +		new_key.type = BTRFS_IDENTITY_REMAP_KEY;
>> +		new_key.offset = key.objectid + key.offset - old_addr - length;
>> +
>> +		ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
>> +					      path, &new_key, 0);
>> +		if (ret)
>> +			return ret;
>> +
>> +		btrfs_release_path(path);
>> +
>> +		identity_count_delta++;
>> +	}
>> +
>> +	/* Add backref. */
>> +
>> +	new_key.objectid = new_addr;
>> +	new_key.type = BTRFS_REMAP_BACKREF_KEY;
>> +	new_key.offset = length;
>> +
>> +	ret = btrfs_insert_empty_item(trans, fs_info->remap_root, path,
>> +				      &new_key, sizeof(struct btrfs_remap));
>> +	if (ret)
>> +		return ret;
>> +
>> +	btrfs_set_stack_remap_address(&remap, old_addr);
>> +
>> +	write_extent_buffer(path->nodes[0], &remap,
>> +		btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
>> +		sizeof(struct btrfs_remap));
>> +
>> +	btrfs_release_path(path);
>> +
>> +	if (identity_count_delta == 0)
>> +		return 0;
>> +
>> +	ret = adjust_identity_remap_count(trans, path, src_bg,
>> +					  identity_count_delta);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return 0;
>> +}
>> +
>> +static int add_remap_entry(struct btrfs_trans_handle *trans,
>> +			   struct btrfs_path *path,
>> +			   struct btrfs_block_group *src_bg, u64 old_addr,
>> +			   u64 new_addr, u64 length)
>> +{
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	struct btrfs_key key;
>> +	struct extent_buffer *leaf;
>> +	int ret;
>> +
>> +	key.objectid = old_addr;
>> +	key.type = BTRFS_IDENTITY_REMAP_KEY;
>> +	key.offset = 0;
>> +
> 
> Can this lookup code be shared at all with the remapping logic in the
> previous patch? It seems fundamentally both are finding a remap entry for
> a given logical address. Or is it impossible since this one needs cow?
> 
> Maybe some kind of prev_item helper that's either remap tree specific or
> for this use case of going back exactly one item instead of obeying a
> min_objectid?

Thanks - yes, there's some overlap between add_remap_entry() and 
move_existing_remap(), it'd make sense to merge them as much as possible.

> 
>> +	ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, 0, 1);
>> +	if (ret < 0)
>> +		goto end;
>> +
>> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>> +
>> +	if (key.objectid >= old_addr) {
>> +		if (path->slots[0] == 0) {
>> +			ret = btrfs_prev_leaf(trans, fs_info->remap_root, path,
>> +					      0, 1);
>> +			if (ret < 0)
>> +				goto end;
>> +		} else {
>> +			path->slots[0]--;
>> +		}
>> +	}
>> +
>> +	while (true) {
>> +		leaf = path->nodes[0];
>> +		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
>> +			ret = btrfs_next_leaf(fs_info->remap_root, path);
>> +			if (ret < 0)
>> +				goto end;
>> +			else if (ret == 1)
>> +				break;
>> +			leaf = path->nodes[0];
>> +		}
>> +
>> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>> +
>> +		if (key.objectid >= old_addr + length) {
>> +			ret = -ENOENT;
>> +			goto end;
>> +		}
>> +
>> +		if (key.type != BTRFS_REMAP_KEY &&
>> +		    key.type != BTRFS_IDENTITY_REMAP_KEY) {
>> +			path->slots[0]++;
>> +			continue;
>> +		}
>> +
>> +		if (key.type == BTRFS_REMAP_KEY &&
>> +		    key.objectid + key.offset == old_addr) {
>> +			ret = merge_remap_entries(trans, path, src_bg, old_addr,
>> +						  new_addr, length);
>> +			if (ret < 0) {
>> +				goto end;
>> +			} else if (ret == 0) {
>> +				path->slots[0]++;
>> +				continue;
>> +			}
>> +			break;
>> +		}
>> +
>> +		if (key.objectid <= old_addr &&
>> +		    key.type == BTRFS_IDENTITY_REMAP_KEY &&
>> +		    key.objectid + key.offset > old_addr) {
>> +			ret = add_new_remap_entry(trans, path, src_bg,
>> +						  old_addr, new_addr, length);
>> +			if (ret)
>> +				goto end;
>> +			break;
>> +		}
>> +
>> +		path->slots[0]++;
>> +	}
>> +
>> +	ret = 0;
>> +
>> +end:
>> +	btrfs_release_path(path);
>> +
>> +	return ret;
>> +}
>> +
>>   static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
>>   			       struct btrfs_path *path, uint64_t start)
>>   {
>> @@ -4828,6 +5164,188 @@ static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
>>   	return ret;
>>   }
>>   
>> +static int do_remap_tree_reloc_trans(struct btrfs_fs_info *fs_info,
>> +				     struct btrfs_block_group *src_bg,
>> +				     struct btrfs_path *path, u64 *last_start)
>> +{
>> +	struct btrfs_trans_handle *trans;
>> +	struct btrfs_root *extent_root;
>> +	struct btrfs_key ins;
>> +	struct btrfs_block_group *dest_bg = NULL;
>> +	struct btrfs_chunk_map *chunk;
>> +	u64 start, remap_length, length, new_addr, min_size;
>> +	int ret;
>> +	bool no_more = false;
>> +	bool is_data = src_bg->flags & BTRFS_BLOCK_GROUP_DATA;
>> +	bool made_reservation = false, bg_needs_free_space;
>> +	struct btrfs_space_info *sinfo = src_bg->space_info;
>> +
>> +	extent_root = btrfs_extent_root(fs_info, src_bg->start);
>> +
>> +	trans = btrfs_start_transaction(extent_root, 0);
>> +	if (IS_ERR(trans))
>> +		return PTR_ERR(trans);
>> +
>> +	mutex_lock(&fs_info->remap_mutex);
>> +
>> +	ret = find_next_identity_remap(trans, path, src_bg->start + src_bg->length,
>> +				       *last_start, &start, &remap_length);
>> +	if (ret == -ENOENT) {
>> +		no_more = true;
>> +		goto next;
>> +	} else if (ret) {
>> +		mutex_unlock(&fs_info->remap_mutex);
>> +		btrfs_end_transaction(trans);
>> +		return ret;
>> +	}
>> +
>> +	/* Try to reserve enough space for block. */
>> +
>> +	spin_lock(&sinfo->lock);
>> +	btrfs_space_info_update_bytes_may_use(sinfo, remap_length);
> 
> Why isn't this partly leaked if btrfs_reserve_extent returns a smaller extent than
> remap_length?
> 
>> +	spin_unlock(&sinfo->lock);
>> +
>> +	if (is_data)
>> +		min_size = fs_info->sectorsize;
>> +	else
>> +		min_size = fs_info->nodesize;
>> +
>> +	ret = btrfs_reserve_extent(fs_info->fs_root, remap_length,
>> +				   remap_length, min_size,
>> +				   0, 0, &ins, is_data, false);
> 
> ^ i.e., this will reduce bytes_may_use by the amount it actually
> reserved, and I don't see anywhere where we make up the difference. Then
> it looks like we will remap the extent we can, find the next free range
> and come back to this function and that remaining range to bytes_may_use
> a second time.

As I said to you off-list, this tripped me up! btrfs_reserve_extent() 
reverses the whole bytes_may_use value, even if it hands out less than that.

> 
>> +	if (ret) {
>> +		spin_lock(&sinfo->lock);
>> +		btrfs_space_info_update_bytes_may_use(sinfo, -remap_length);
>> +		spin_unlock(&sinfo->lock);
>> +
>> +		mutex_unlock(&fs_info->remap_mutex);
>> +		btrfs_end_transaction(trans);
>> +		return ret;
>> +	}
>> +
>> +	made_reservation = true;
>> +
>> +	new_addr = ins.objectid;
>> +	length = ins.offset;
>> +
>> +	if (!is_data && length % fs_info->nodesize) {
>> +		u64 new_length = length - (length % fs_info->nodesize);
> 
> Why not use the IS_ALIGNED / ALIGN_DOWN macros? Nodesize is a power of
> two, so I think it should be quicker. Probably doesn't matter, but it
> does seem to be the predominant pattern in the code base. Also avoids
> ever worrying about dividing by zero.

Makes sense, thank you. I changed this to use & after kernelbot 
complained that I broke compilation on 32-bit CPUs, which is presumably 
what the macro is doing.

> 
>> +
>> +		btrfs_free_reserved_extent(fs_info, new_addr + new_length,
>> +					   length - new_length, 0);
>> +
>> +		length = new_length;
>> +	}
>> +
>> +	ret = add_to_free_space_tree(trans, start, length);
> 
> Can you explain this? Intuitively, to me, the old remapped address is
> not a logical range we can allocate from, so it should not be in the
> free space tree. Is this a hack to get the bytes back into the
> accounting and allocations are blocked by the remapped block group being
> remapped / read-only?

Yes, good point - we should be clearing out the FST once a block is 
marked REMAPPED. I think I'll have to fix the discard code, IIRC it 
works by walking the FST. Plus btrfs-check as well of course

> 
>> +	if (ret)
>> +		goto fail;
>> +
>> +	dest_bg = btrfs_lookup_block_group(fs_info, new_addr);
>> +
>> +	mutex_lock(&dest_bg->free_space_lock);
>> +	bg_needs_free_space = test_bit(BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE,
>> +				       &dest_bg->runtime_flags);
>> +	mutex_unlock(&dest_bg->free_space_lock);
>> +
>> +	if (bg_needs_free_space) {
>> +		ret = add_block_group_free_space(trans, dest_bg);
>> +		if (ret)
>> +			goto fail;
>> +	}
>> +
>> +	ret = remove_from_free_space_tree(trans, new_addr, length);
>> +	if (ret)
>> +		goto fail;
> 
> I think you have also discussed this recently with Josef, but it seems
> a little sketchy. I suppose it depends if the remap tree ends up getting
> delayed refs and going in the extent tree? I think this is currently
> only called from alloc_reserved_extent.

Do you mean the bit about BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE? We delay 
populating the in-memory representation of the FST for a BG until we 
need it, so we have to force it here. Presumably if we accept that 
remapped BGs don't have anything in the FST we can remove this code.

>> +
>> +	ret = do_copy(fs_info, start, new_addr, length);
>> +	if (ret)
>> +		goto fail;
>> +
>> +	ret = add_remap_entry(trans, path, src_bg, start, new_addr, length);
>> +	if (ret)
>> +		goto fail;
>> +
>> +	adjust_block_group_remap_bytes(trans, dest_bg, length);
>> +	btrfs_free_reserved_bytes(dest_bg, length, 0);
>> +
>> +	spin_lock(&sinfo->lock);
>> +	sinfo->bytes_readonly += length;
>> +	spin_unlock(&sinfo->lock);
>> +
>> +next:
>> +	if (dest_bg)
>> +		btrfs_put_block_group(dest_bg);
>> +
>> +	if (made_reservation)
>> +		btrfs_dec_block_group_reservations(fs_info, new_addr);
>> +
>> +	if (src_bg->used == 0 && src_bg->remap_bytes == 0) {
>> +		chunk = btrfs_find_chunk_map(fs_info, src_bg->start, 1);
>> +		if (!chunk) {
>> +			mutex_unlock(&fs_info->remap_mutex);
>> +			btrfs_end_transaction(trans);
>> +			return -ENOENT;
>> +		}
>> +
>> +		ret = last_identity_remap_gone(trans, chunk, src_bg, path);
>> +		if (ret) {
>> +			btrfs_free_chunk_map(chunk);
>> +			mutex_unlock(&fs_info->remap_mutex);
>> +			btrfs_end_transaction(trans);
>> +			return ret;
>> +		}
>> +
>> +		btrfs_free_chunk_map(chunk);
>> +	}
>> +
>> +	mutex_unlock(&fs_info->remap_mutex);
>> +
>> +	ret = btrfs_end_transaction(trans);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (no_more)
>> +		return 1;
>> +
>> +	*last_start = start;
>> +
>> +	return 0;
>> +
>> +fail:
>> +	if (dest_bg)
>> +		btrfs_put_block_group(dest_bg);
>> +
>> +	btrfs_free_reserved_extent(fs_info, new_addr, length, 0);
>> +
>> +	mutex_unlock(&fs_info->remap_mutex);
>> +	btrfs_end_transaction(trans);
>> +
>> +	return ret;
>> +}
>> +
>> +static int do_remap_tree_reloc(struct btrfs_fs_info *fs_info,
>> +			       struct btrfs_path *path,
>> +			       struct btrfs_block_group *bg)
>> +{
>> +	u64 last_start;
>> +	int ret;
>> +
>> +	last_start = bg->start;
>> +
>> +	while (true) {
>> +		ret = do_remap_tree_reloc_trans(fs_info, bg, path,
>> +						&last_start);
>> +		if (ret) {
>> +			if (ret == 1)
>> +				ret = 0;
>> +			break;
>> +		}
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>>   int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>>   			  u64 *length)
>>   {
>> @@ -5073,6 +5591,10 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
>>   		}
>>   
>>   		err = start_block_group_remapping(fs_info, path, bg);
>> +		if (err)
>> +			goto out;
>> +
>> +		err = do_remap_tree_reloc(fs_info, path, rc->block_group);
>>   
>>   		goto out;
>>   	}
>> -- 
>> 2.49.0
>>


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-05-23 14:54 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-15 16:36 [RFC PATCH 00/10] Remap tree Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 01/10] btrfs: add definitions and constants for remap-tree Mark Harmstone
2025-05-21 12:43   ` Johannes Thumshirn
2025-05-23 13:06     ` Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 02/10] btrfs: add REMAP chunk type Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 03/10] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 04/10] btrfs: add extended version of struct block_group_item Mark Harmstone
2025-05-23  9:53   ` Qu Wenruo
2025-05-23 12:00     ` Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 05/10] btrfs: allow mounting filesystems with remap-tree incompat flag Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 06/10] btrfs: redirect I/O for remapped block groups Mark Harmstone
2025-05-23 10:09   ` Qu Wenruo
2025-05-23 11:53     ` Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 07/10] btrfs: handle deletions from remapped block group Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 08/10] btrfs: handle setting up relocation of block group with remap-tree Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 09/10] btrfs: move existing remaps before relocating block group Mark Harmstone
     [not found]   ` <202505161726.w1lqCZxG-lkp@intel.com>
2025-05-16 11:43     ` Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 10/10] btrfs: replace identity maps with actual remaps when doing relocations Mark Harmstone
2025-05-21  0:04   ` Boris Burkov
2025-05-23 14:54     ` Mark Harmstone

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox