public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12] btrfs: remap tree
@ 2025-06-05 16:23 Mark Harmstone
  2025-06-05 16:23 ` [PATCH 01/12] btrfs: add definitions and constants for remap-tree Mark Harmstone
                   ` (17 more replies)
  0 siblings, 18 replies; 39+ messages in thread
From: Mark Harmstone @ 2025-06-05 16:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

This patch series adds a disk format change gated behind
CONFIG_BTRFS_EXPERIMENTAL to add a "remap tree", which acts as a layer of
indirection when doing I/O. When doing relocation, rather than fixing up every
tree, we instead record the old and new addresses in the remap tree. This should
hopefully make things more reliable and flexible, as well as enabling some
future changes we'd like to make, such as larger data extents and reducing
write amplification by removing cow-only metadata items.

The remap tree lives in a new REMAP chunk type. This is because bootstrapping
means that it can't be remapped itself, and has to be relocated by COWing it as
at present. It can't go in the SYSTEM chunk as we are then limited by the chunk
item needing to fit in the superblock.

For more on the design and rationale, please see my RFC sent last month[1], as
well as Josef Bacik's original design document[2]. The main change from Josef's
design is that I've added remap backrefs, as we need to be able to move a
chunk's existing remaps before remapping it.

You will also need my patches to btrfs-progs[3] to make
`mkfs.btrfs -O remap-tree` work, as well as allowing `btrfs check` to recognize
the new format.

Changes since the RFC:

* I've reduce the REMAP chunk size from the normal 1GB to 32MB, to match the
  SYSTEM chunk. For a filesystem with 4KB sectors and 16KB node size, the worst
  case is that one leaf covers ~1MB of data, and the best case ~250GB. For a
  chunk, that implies a worst case of ~2GB and a best case of ~500TB.
  This isn't a disk-format change, so we can always adjust it if it proves too
  big or small in practice. mkfs creates 8MB chunks, as it does for everything.

* You can't make new allocations from remapped block groups, so I've changed
  it so there's no free-space entries for these (thanks to Boris Burkov for the
  suggestion).

* The remap tree doesn't have metadata items in the extent tree (thanks to Josef
  for the suggestion). This was to work around some corruption that delayed refs
  were causing, but it also fits it with our future plans of removing all
  metadata items for COW-only trees, reducing write amplification.
  A knock-on effect of this is that I've had to disable balancing of the remap
  chunk itself. This is because we can no longer walk the extent tree, and will
  have to walk the remap tree instead. When we remove the COW-only metadata
  items, we will also have to do this for the chunk and root trees, as
  bootstrapping means they can't be remapped.

* btrfs_translate_remap() uses search_commit_root when doing metadata lookups,
  to avoid nested locking issues. This also seems to be a lot quicker (btrfs/187
  went from ~20mins to ~90secs).

* Unused remapped block groups should now get cleaned up more aggressively

* Other miscellaneous cleanups and fixes

Known issues:

* Relocation still needs to be implemented for the remap tree itself (see above)

* Some test failures: btrfs/156, btrfs/170, btrfs/226, btrfs/250

* nodatacow extents aren't safe, as they can race with the relocation thread.
  We either need to follow the btrfs_inc_nocow_writers() approach, which COWs
  the extent, or change it so that it blocks here.

* When initially marking a block group as remapped, we are walking the free-
  space tree and creating the identity remaps all in one transaction. For the
  worst-case scenario, i.e. a 1GB block group with every other sector allocated
  (131,072 extents), this can result in transaction times of more than 10 mins.
  This needs to be changed to allow this to happen over multiple transactions.

* All this is disabled for zoned devices for the time being, as I've not been
  able to test it. I'm planning to make it compatible with zoned at a later
  date.

Thanks

[1] https://lwn.net/Articles/1021452/
[2] https://github.com/btrfs/btrfs-todo/issues/54
[3] https://github.com/maharmstone/btrfs-progs/tree/remap-tree

Mark Harmstone (12):
  btrfs: add definitions and constants for remap-tree
  btrfs: add REMAP chunk type
  btrfs: allow remapped chunks to have zero stripes
  btrfs: remove remapped block groups from the free-space tree
  btrfs: don't add metadata items for the remap tree to the extent tree
  btrfs: add extended version of struct block_group_item
  btrfs: allow mounting filesystems with remap-tree incompat flag
  btrfs: redirect I/O for remapped block groups
  btrfs: handle deletions from remapped block group
  btrfs: handle setting up relocation of block group with remap-tree
  btrfs: move existing remaps before relocating block group
  btrfs: replace identity maps with actual remaps when doing relocations

 fs/btrfs/Kconfig                |    2 +
 fs/btrfs/accessors.h            |   29 +
 fs/btrfs/block-group.c          |  202 +++-
 fs/btrfs/block-group.h          |   15 +-
 fs/btrfs/block-rsv.c            |    8 +
 fs/btrfs/block-rsv.h            |    1 +
 fs/btrfs/discard.c              |   11 +-
 fs/btrfs/disk-io.c              |   91 +-
 fs/btrfs/extent-tree.c          |  152 ++-
 fs/btrfs/free-space-tree.c      |    4 +-
 fs/btrfs/free-space-tree.h      |    5 +-
 fs/btrfs/fs.h                   |    7 +-
 fs/btrfs/relocation.c           | 1897 ++++++++++++++++++++++++++++++-
 fs/btrfs/relocation.h           |    8 +-
 fs/btrfs/space-info.c           |   22 +-
 fs/btrfs/sysfs.c                |    4 +
 fs/btrfs/transaction.c          |    7 +
 fs/btrfs/tree-checker.c         |   37 +-
 fs/btrfs/volumes.c              |  115 +-
 fs/btrfs/volumes.h              |   17 +-
 include/uapi/linux/btrfs.h      |    1 +
 include/uapi/linux/btrfs_tree.h |   29 +-
 22 files changed, 2444 insertions(+), 220 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 01/12] btrfs: add definitions and constants for remap-tree
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
@ 2025-06-05 16:23 ` Mark Harmstone
  2025-06-13 21:02   ` Boris Burkov
  2025-06-05 16:23 ` [PATCH 02/12] btrfs: add REMAP chunk type Mark Harmstone
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Mark Harmstone @ 2025-06-05 16:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Add an incompat flag for the new remap-tree feature, and the constants
and definitions needed to support it.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/accessors.h            |  3 +++
 fs/btrfs/sysfs.c                |  2 ++
 fs/btrfs/tree-checker.c         |  6 ++++--
 fs/btrfs/volumes.c              |  1 +
 include/uapi/linux/btrfs.h      |  1 +
 include/uapi/linux/btrfs_tree.h | 12 ++++++++++++
 6 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index 15ea6348800b..5f5eda8d6f9e 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -1046,6 +1046,9 @@ BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption,
 BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size,
 			 struct btrfs_verity_descriptor_item, size, 64);
 
+BTRFS_SETGET_FUNCS(remap_address, struct btrfs_remap, address, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_remap_address, struct btrfs_remap, address, 64);
+
 /* Cast into the data area of the leaf. */
 #define btrfs_item_ptr(leaf, slot, type)				\
 	((type *)(btrfs_item_nr_offset(leaf, 0) + btrfs_item_offset(leaf, slot)))
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index f186c8082eff..831c25c2fb25 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -292,6 +292,7 @@ BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 BTRFS_FEAT_ATTR_COMPAT_RO(block_group_tree, BLOCK_GROUP_TREE);
 BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
 BTRFS_FEAT_ATTR_INCOMPAT(simple_quota, SIMPLE_QUOTA);
+BTRFS_FEAT_ATTR_INCOMPAT(remap_tree, REMAP_TREE);
 #ifdef CONFIG_BLK_DEV_ZONED
 BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
 #endif
@@ -326,6 +327,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(raid1c34),
 	BTRFS_FEAT_ATTR_PTR(block_group_tree),
 	BTRFS_FEAT_ATTR_PTR(simple_quota),
+	BTRFS_FEAT_ATTR_PTR(remap_tree),
 #ifdef CONFIG_BLK_DEV_ZONED
 	BTRFS_FEAT_ATTR_PTR(zoned),
 #endif
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 8f4703b488b7..a83fb828723a 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -913,11 +913,13 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
 		return -EUCLEAN;
 	}
 	if (unlikely(type & ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
-			      BTRFS_BLOCK_GROUP_PROFILE_MASK))) {
+			      BTRFS_BLOCK_GROUP_PROFILE_MASK |
+			      BTRFS_BLOCK_GROUP_REMAPPED))) {
 		chunk_err(fs_info, leaf, chunk, logical,
 			  "unrecognized chunk type: 0x%llx",
 			  ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
-			    BTRFS_BLOCK_GROUP_PROFILE_MASK) & type);
+			    BTRFS_BLOCK_GROUP_PROFILE_MASK |
+			    BTRFS_BLOCK_GROUP_REMAPPED) & type);
 		return -EUCLEAN;
 	}
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1535a425e8f9..3e53bde0e605 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -234,6 +234,7 @@ void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
 	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
 	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
 	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
+	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAPPED, "remapped");
 
 	DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
 	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dd02160015b2..d857cdc7694a 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -336,6 +336,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2	(1ULL << 13)
 #define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE	(1ULL << 14)
 #define BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA	(1ULL << 16)
+#define BTRFS_FEATURE_INCOMPAT_REMAP_TREE	(1ULL << 17)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index fc29d273845d..4439d77a7252 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -76,6 +76,9 @@
 /* Tracks RAID stripes in block groups. */
 #define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
 
+/* Holds details of remapped addresses after relocation. */
+#define BTRFS_REMAP_TREE_OBJECTID 13ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
@@ -282,6 +285,10 @@
 
 #define BTRFS_RAID_STRIPE_KEY	230
 
+#define BTRFS_IDENTITY_REMAP_KEY 	234
+#define BTRFS_REMAP_KEY		 	235
+#define BTRFS_REMAP_BACKREF_KEY	 	236
+
 /*
  * Records the overall state of the qgroups.
  * There's only one instance of this key present,
@@ -1161,6 +1168,7 @@ struct btrfs_dev_replace_item {
 #define BTRFS_BLOCK_GROUP_RAID6         (1ULL << 8)
 #define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
 #define BTRFS_BLOCK_GROUP_RAID1C4       (1ULL << 10)
+#define BTRFS_BLOCK_GROUP_REMAPPED      (1ULL << 11)
 #define BTRFS_BLOCK_GROUP_RESERVED	(BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
 					 BTRFS_SPACE_INFO_GLOBAL_RSV)
 
@@ -1323,4 +1331,8 @@ struct btrfs_verity_descriptor_item {
 	__u8 encryption;
 } __attribute__ ((__packed__));
 
+struct btrfs_remap {
+	__le64 address;
+} __attribute__ ((__packed__));
+
 #endif /* _BTRFS_CTREE_H_ */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 02/12] btrfs: add REMAP chunk type
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
  2025-06-05 16:23 ` [PATCH 01/12] btrfs: add definitions and constants for remap-tree Mark Harmstone
@ 2025-06-05 16:23 ` Mark Harmstone
  2025-06-13 21:22   ` Boris Burkov
  2025-06-05 16:23 ` [PATCH 03/12] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Mark Harmstone @ 2025-06-05 16:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Add a new REMAP chunk type, which is a metadata chunk that holds the
remap tree.

This is needed for bootstrapping purposes: the remap tree can't itself
be remapped, and must be relocated the existing way, by COWing every
leaf. The remap tree can't go in the SYSTEM chunk as space there is
limited, because a copy of the chunk item gets placed in the superblock.

The changes in fs/btrfs/volumes.h are because we're adding a new block
group type bit after the profile bits, and so can no longer rely on the
const_ilog2 trick.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/block-rsv.c            |  8 ++++++++
 fs/btrfs/block-rsv.h            |  1 +
 fs/btrfs/disk-io.c              |  1 +
 fs/btrfs/fs.h                   |  2 ++
 fs/btrfs/space-info.c           | 13 ++++++++++++-
 fs/btrfs/sysfs.c                |  2 ++
 fs/btrfs/tree-checker.c         |  5 +++--
 fs/btrfs/volumes.c              |  1 +
 fs/btrfs/volumes.h              | 11 +++++++++--
 include/uapi/linux/btrfs_tree.h |  4 +++-
 10 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index 5ad6de738aee..2678cd3bed29 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -421,6 +421,9 @@ void btrfs_init_root_block_rsv(struct btrfs_root *root)
 	case BTRFS_TREE_LOG_OBJECTID:
 		root->block_rsv = &fs_info->treelog_rsv;
 		break;
+	case BTRFS_REMAP_TREE_OBJECTID:
+		root->block_rsv = &fs_info->remap_block_rsv;
+		break;
 	default:
 		root->block_rsv = NULL;
 		break;
@@ -434,6 +437,9 @@ void btrfs_init_global_block_rsv(struct btrfs_fs_info *fs_info)
 	space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_SYSTEM);
 	fs_info->chunk_block_rsv.space_info = space_info;
 
+	space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_REMAP);
+	fs_info->remap_block_rsv.space_info = space_info;
+
 	space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
 	fs_info->global_block_rsv.space_info = space_info;
 	fs_info->trans_block_rsv.space_info = space_info;
@@ -460,6 +466,8 @@ void btrfs_release_global_block_rsv(struct btrfs_fs_info *fs_info)
 	WARN_ON(fs_info->trans_block_rsv.reserved > 0);
 	WARN_ON(fs_info->chunk_block_rsv.size > 0);
 	WARN_ON(fs_info->chunk_block_rsv.reserved > 0);
+	WARN_ON(fs_info->remap_block_rsv.size > 0);
+	WARN_ON(fs_info->remap_block_rsv.reserved > 0);
 	WARN_ON(fs_info->delayed_block_rsv.size > 0);
 	WARN_ON(fs_info->delayed_block_rsv.reserved > 0);
 	WARN_ON(fs_info->delayed_refs_rsv.reserved > 0);
diff --git a/fs/btrfs/block-rsv.h b/fs/btrfs/block-rsv.h
index 79ae9d05cd91..8359fb96bc3c 100644
--- a/fs/btrfs/block-rsv.h
+++ b/fs/btrfs/block-rsv.h
@@ -22,6 +22,7 @@ enum btrfs_rsv_type {
 	BTRFS_BLOCK_RSV_DELALLOC,
 	BTRFS_BLOCK_RSV_TRANS,
 	BTRFS_BLOCK_RSV_CHUNK,
+	BTRFS_BLOCK_RSV_REMAP,
 	BTRFS_BLOCK_RSV_DELOPS,
 	BTRFS_BLOCK_RSV_DELREFS,
 	BTRFS_BLOCK_RSV_TREELOG,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5a565bf96bf8..60cce96a9ec4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2830,6 +2830,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 			     BTRFS_BLOCK_RSV_GLOBAL);
 	btrfs_init_block_rsv(&fs_info->trans_block_rsv, BTRFS_BLOCK_RSV_TRANS);
 	btrfs_init_block_rsv(&fs_info->chunk_block_rsv, BTRFS_BLOCK_RSV_CHUNK);
+	btrfs_init_block_rsv(&fs_info->remap_block_rsv, BTRFS_BLOCK_RSV_REMAP);
 	btrfs_init_block_rsv(&fs_info->treelog_rsv, BTRFS_BLOCK_RSV_TREELOG);
 	btrfs_init_block_rsv(&fs_info->empty_block_rsv, BTRFS_BLOCK_RSV_EMPTY);
 	btrfs_init_block_rsv(&fs_info->delayed_block_rsv,
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 5e48ed252fd0..07ac1a96477a 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -476,6 +476,8 @@ struct btrfs_fs_info {
 	struct btrfs_block_rsv trans_block_rsv;
 	/* Block reservation for chunk tree */
 	struct btrfs_block_rsv chunk_block_rsv;
+	/* Block reservation for remap tree */
+	struct btrfs_block_rsv remap_block_rsv;
 	/* Block reservation for delayed operations */
 	struct btrfs_block_rsv delayed_block_rsv;
 	/* Block reservation for delayed refs */
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 517916004f21..6471861c4b25 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -215,7 +215,7 @@ static u64 calc_chunk_size(const struct btrfs_fs_info *fs_info, u64 flags)
 
 	if (flags & BTRFS_BLOCK_GROUP_DATA)
 		return BTRFS_MAX_DATA_CHUNK_SIZE;
-	else if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
+	else if (flags & (BTRFS_BLOCK_GROUP_SYSTEM | BTRFS_BLOCK_GROUP_REMAP))
 		return SZ_32M;
 
 	/* Handle BTRFS_BLOCK_GROUP_METADATA */
@@ -343,6 +343,8 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
 	if (mixed) {
 		flags = BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA;
 		ret = create_space_info(fs_info, flags);
+		if (ret)
+			goto out;
 	} else {
 		flags = BTRFS_BLOCK_GROUP_METADATA;
 		ret = create_space_info(fs_info, flags);
@@ -351,7 +353,15 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
 
 		flags = BTRFS_BLOCK_GROUP_DATA;
 		ret = create_space_info(fs_info, flags);
+		if (ret)
+			goto out;
+	}
+
+	if (features & BTRFS_FEATURE_INCOMPAT_REMAP_TREE) {
+		flags = BTRFS_BLOCK_GROUP_REMAP;
+		ret = create_space_info(fs_info, flags);
 	}
+
 out:
 	return ret;
 }
@@ -590,6 +600,7 @@ static void dump_global_block_rsv(struct btrfs_fs_info *fs_info)
 	DUMP_BLOCK_RSV(fs_info, global_block_rsv);
 	DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
 	DUMP_BLOCK_RSV(fs_info, chunk_block_rsv);
+	DUMP_BLOCK_RSV(fs_info, remap_block_rsv);
 	DUMP_BLOCK_RSV(fs_info, delayed_block_rsv);
 	DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv);
 }
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 831c25c2fb25..aa98d99833fd 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1985,6 +1985,8 @@ static const char *alloc_name(struct btrfs_space_info *space_info)
 	case BTRFS_BLOCK_GROUP_SYSTEM:
 		ASSERT(space_info->subgroup_id == BTRFS_SUB_GROUP_PRIMARY);
 		return "system";
+	case BTRFS_BLOCK_GROUP_REMAP:
+		return "remap";
 	default:
 		WARN_ON(1);
 		return "invalid-combination";
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index a83fb828723a..0505f8d76581 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -751,13 +751,14 @@ static int check_block_group_item(struct extent_buffer *leaf,
 	if (unlikely(type != BTRFS_BLOCK_GROUP_DATA &&
 		     type != BTRFS_BLOCK_GROUP_METADATA &&
 		     type != BTRFS_BLOCK_GROUP_SYSTEM &&
+		     type != BTRFS_BLOCK_GROUP_REMAP &&
 		     type != (BTRFS_BLOCK_GROUP_METADATA |
 			      BTRFS_BLOCK_GROUP_DATA))) {
 		block_group_err(leaf, slot,
-"invalid type, have 0x%llx (%lu bits set) expect either 0x%llx, 0x%llx, 0x%llx or 0x%llx",
+"invalid type, have 0x%llx (%lu bits set) expect either 0x%llx, 0x%llx, 0x%llx, 0x%llx or 0x%llx",
 			type, hweight64(type),
 			BTRFS_BLOCK_GROUP_DATA, BTRFS_BLOCK_GROUP_METADATA,
-			BTRFS_BLOCK_GROUP_SYSTEM,
+			BTRFS_BLOCK_GROUP_SYSTEM, BTRFS_BLOCK_GROUP_REMAP,
 			BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA);
 		return -EUCLEAN;
 	}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3e53bde0e605..e7c467b6af46 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -234,6 +234,7 @@ void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
 	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
 	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
 	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
+	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAP, "remap");
 	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAPPED, "remapped");
 
 	DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6d8b1f38e3ee..9fb8fe4312a5 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -59,8 +59,6 @@ static_assert(const_ilog2(BTRFS_STRIPE_LEN) == BTRFS_STRIPE_LEN_SHIFT);
  */
 static_assert(const_ffs(BTRFS_BLOCK_GROUP_RAID0) <
 	      const_ffs(BTRFS_BLOCK_GROUP_PROFILE_MASK & ~BTRFS_BLOCK_GROUP_RAID0));
-static_assert(const_ilog2(BTRFS_BLOCK_GROUP_RAID0) >
-	      ilog2(BTRFS_BLOCK_GROUP_TYPE_MASK));
 
 /* ilog2() can handle both constants and variables */
 #define BTRFS_BG_FLAG_TO_INDEX(profile)					\
@@ -82,6 +80,15 @@ enum btrfs_raid_types {
 	BTRFS_NR_RAID_TYPES
 };
 
+static_assert(BTRFS_RAID_RAID0 == 1);
+static_assert(BTRFS_RAID_RAID1 == 2);
+static_assert(BTRFS_RAID_DUP == 3);
+static_assert(BTRFS_RAID_RAID10 == 4);
+static_assert(BTRFS_RAID_RAID5 == 5);
+static_assert(BTRFS_RAID_RAID6 == 6);
+static_assert(BTRFS_RAID_RAID1C3 == 7);
+static_assert(BTRFS_RAID_RAID1C4 == 8);
+
 /*
  * Use sequence counter to get consistent device stat data on
  * 32-bit processors.
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 4439d77a7252..9a36f0206d90 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -1169,12 +1169,14 @@ struct btrfs_dev_replace_item {
 #define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
 #define BTRFS_BLOCK_GROUP_RAID1C4       (1ULL << 10)
 #define BTRFS_BLOCK_GROUP_REMAPPED      (1ULL << 11)
+#define BTRFS_BLOCK_GROUP_REMAP         (1ULL << 12)
 #define BTRFS_BLOCK_GROUP_RESERVED	(BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
 					 BTRFS_SPACE_INFO_GLOBAL_RSV)
 
 #define BTRFS_BLOCK_GROUP_TYPE_MASK	(BTRFS_BLOCK_GROUP_DATA |    \
 					 BTRFS_BLOCK_GROUP_SYSTEM |  \
-					 BTRFS_BLOCK_GROUP_METADATA)
+					 BTRFS_BLOCK_GROUP_METADATA | \
+					 BTRFS_BLOCK_GROUP_REMAP)
 
 #define BTRFS_BLOCK_GROUP_PROFILE_MASK	(BTRFS_BLOCK_GROUP_RAID0 |   \
 					 BTRFS_BLOCK_GROUP_RAID1 |   \
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 03/12] btrfs: allow remapped chunks to have zero stripes
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
  2025-06-05 16:23 ` [PATCH 01/12] btrfs: add definitions and constants for remap-tree Mark Harmstone
  2025-06-05 16:23 ` [PATCH 02/12] btrfs: add REMAP chunk type Mark Harmstone
@ 2025-06-05 16:23 ` Mark Harmstone
  2025-06-13 21:41   ` Boris Burkov
  2025-06-05 16:23 ` [PATCH 04/12] btrfs: remove remapped block groups from the free-space tree Mark Harmstone
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Mark Harmstone @ 2025-06-05 16:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

When a chunk has been fully remapped, we are going to set its
num_stripes to 0, as it will no longer represent a physical location on
disk.

Change tree-checker to allow for this, and fix a couple of
divide-by-zeroes seen elsewhere.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/tree-checker.c | 16 +++++++++-------
 fs/btrfs/volumes.c      |  8 +++++++-
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 0505f8d76581..fd83df06e3fb 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -829,7 +829,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
 	u64 type;
 	u64 features;
 	u32 chunk_sector_size;
-	bool mixed = false;
+	bool mixed = false, remapped;
 	int raid_index;
 	int nparity;
 	int ncopies;
@@ -853,12 +853,14 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
 	ncopies = btrfs_raid_array[raid_index].ncopies;
 	nparity = btrfs_raid_array[raid_index].nparity;
 
-	if (unlikely(!num_stripes)) {
+	remapped = type & BTRFS_BLOCK_GROUP_REMAPPED;
+
+	if (unlikely(!remapped && !num_stripes)) {
 		chunk_err(fs_info, leaf, chunk, logical,
 			  "invalid chunk num_stripes, have %u", num_stripes);
 		return -EUCLEAN;
 	}
-	if (unlikely(num_stripes < ncopies)) {
+	if (unlikely(!remapped && num_stripes < ncopies)) {
 		chunk_err(fs_info, leaf, chunk, logical,
 			  "invalid chunk num_stripes < ncopies, have %u < %d",
 			  num_stripes, ncopies);
@@ -960,7 +962,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
 		}
 	}
 
-	if (unlikely((type & BTRFS_BLOCK_GROUP_RAID10 &&
+	if (unlikely(!remapped && ((type & BTRFS_BLOCK_GROUP_RAID10 &&
 		      sub_stripes != btrfs_raid_array[BTRFS_RAID_RAID10].sub_stripes) ||
 		     (type & BTRFS_BLOCK_GROUP_RAID1 &&
 		      num_stripes != btrfs_raid_array[BTRFS_RAID_RAID1].devs_min) ||
@@ -975,7 +977,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
 		     (type & BTRFS_BLOCK_GROUP_DUP &&
 		      num_stripes != btrfs_raid_array[BTRFS_RAID_DUP].dev_stripes) ||
 		     ((type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 &&
-		      num_stripes != btrfs_raid_array[BTRFS_RAID_SINGLE].dev_stripes))) {
+		      num_stripes != btrfs_raid_array[BTRFS_RAID_SINGLE].dev_stripes)))) {
 		chunk_err(fs_info, leaf, chunk, logical,
 			"invalid num_stripes:sub_stripes %u:%u for profile %llu",
 			num_stripes, sub_stripes,
@@ -999,11 +1001,11 @@ static int check_leaf_chunk_item(struct extent_buffer *leaf,
 	struct btrfs_fs_info *fs_info = leaf->fs_info;
 	int num_stripes;
 
-	if (unlikely(btrfs_item_size(leaf, slot) < sizeof(struct btrfs_chunk))) {
+	if (unlikely(btrfs_item_size(leaf, slot) < offsetof(struct btrfs_chunk, stripe))) {
 		chunk_err(fs_info, leaf, chunk, key->offset,
 			"invalid chunk item size: have %u expect [%zu, %u)",
 			btrfs_item_size(leaf, slot),
-			sizeof(struct btrfs_chunk),
+			offsetof(struct btrfs_chunk, stripe),
 			BTRFS_LEAF_DATA_SIZE(fs_info));
 		return -EUCLEAN;
 	}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e7c467b6af46..9159d11cb143 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6133,6 +6133,12 @@ struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info,
 		goto out_free_map;
 	}
 
+	/* avoid divide by zero on fully-remapped chunks */
+	if (map->num_stripes == 0) {
+		ret = -EOPNOTSUPP;
+		goto out_free_map;
+	}
+
 	offset = logical - map->start;
 	length = min_t(u64, map->start + map->chunk_len - logical, length);
 	*length_ret = length;
@@ -6953,7 +6959,7 @@ u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map)
 {
 	const int data_stripes = calc_data_stripes(map->type, map->num_stripes);
 
-	return div_u64(map->chunk_len, data_stripes);
+	return data_stripes ? div_u64(map->chunk_len, data_stripes) : 0;
 }
 
 #if BITS_PER_LONG == 32
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 04/12] btrfs: remove remapped block groups from the free-space tree
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (2 preceding siblings ...)
  2025-06-05 16:23 ` [PATCH 03/12] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
@ 2025-06-05 16:23 ` Mark Harmstone
  2025-06-06  6:41   ` kernel test robot
  2025-06-13 22:00   ` Boris Burkov
  2025-06-05 16:23 ` [PATCH 05/12] btrfs: don't add metadata items for the remap tree to the extent tree Mark Harmstone
                   ` (13 subsequent siblings)
  17 siblings, 2 replies; 39+ messages in thread
From: Mark Harmstone @ 2025-06-05 16:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

No new allocations can be done from block groups that have the REMAPPED flag
set, so there's no value in their having entries in the free-space tree.

Prevent a search through the free-space tree being scheduled for such a
block group, and prevent discard being run for a fully-remapped block
group.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/block-group.c | 21 ++++++++++++++++-----
 fs/btrfs/discard.c     |  9 +++++++++
 2 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 5b0cb04b2b93..9b3b5358f1ba 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -920,6 +920,13 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait)
 	if (btrfs_is_zoned(fs_info))
 		return 0;
 
+	/*
+	 * No allocations can be done from remapped block groups, so they have
+	 * no entries in the free-space tree.
+	 */
+	if (cache->flags & BTRFS_BLOCK_GROUP_REMAPPED)
+		return 0;
+
 	caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
 	if (!caching_ctl)
 		return -ENOMEM;
@@ -1235,9 +1242,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	 * another task to attempt to create another block group with the same
 	 * item key (and failing with -EEXIST and a transaction abort).
 	 */
-	ret = remove_block_group_free_space(trans, block_group);
-	if (ret)
-		goto out;
+	if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
+		ret = remove_block_group_free_space(trans, block_group);
+		if (ret)
+			goto out;
+	}
 
 	ret = remove_block_group_item(trans, path, block_group);
 	if (ret < 0)
@@ -2457,10 +2466,12 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	if (btrfs_chunk_writeable(info, cache->start)) {
 		if (cache->used == 0) {
 			ASSERT(list_empty(&cache->bg_list));
-			if (btrfs_test_opt(info, DISCARD_ASYNC))
+			if (btrfs_test_opt(info, DISCARD_ASYNC) &&
+			    !(cache->flags && BTRFS_BLOCK_GROUP_REMAPPED)) {
 				btrfs_discard_queue_work(&info->discard_ctl, cache);
-			else
+			} else {
 				btrfs_mark_bg_unused(cache);
+			}
 		}
 	} else {
 		inc_block_group_ro(cache, 1);
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index 89fe85778115..1015a4d37fb2 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -698,6 +698,15 @@ void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info)
 	/* We enabled async discard, so punt all to the queue */
 	list_for_each_entry_safe(block_group, next, &fs_info->unused_bgs,
 				 bg_list) {
+		/* Fully remapped BGs have nothing to discard */
+		spin_lock(&block_group->lock);
+		if (block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED &&
+		    !btrfs_is_block_group_used(block_group)) {
+			spin_unlock(&block_group->lock);
+			continue;
+		}
+		spin_unlock(&block_group->lock);
+
 		list_del_init(&block_group->bg_list);
 		btrfs_discard_queue_work(&fs_info->discard_ctl, block_group);
 		/*
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 05/12] btrfs: don't add metadata items for the remap tree to the extent tree
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (3 preceding siblings ...)
  2025-06-05 16:23 ` [PATCH 04/12] btrfs: remove remapped block groups from the free-space tree Mark Harmstone
@ 2025-06-05 16:23 ` Mark Harmstone
  2025-06-13 22:39   ` Boris Burkov
  2025-06-05 16:23 ` [PATCH 06/12] btrfs: add extended version of struct block_group_item Mark Harmstone
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Mark Harmstone @ 2025-06-05 16:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

There is the following potential problem with the remap tree and delayed refs:

* Remapped extent freed in a delayed ref, which removes an entry from the
  remap tree
* Remap tree now small enough to fit in a single leaf
* Corruption as we now have a level-0 block with a level-1 metadata item
  in the extent tree

One solution to this would be to rework the remap tree code so that it operates
via delayed refs. But as we're hoping to remove cow-only metadata items in the
future anyway, change things so that the remap tree doesn't have any entries in
the extent tree. This also has the benefit of reducing write amplification.

We also make it so that the clear_cache mount option is a no-op, as with the
extent tree v2, as the free-space tree can no longer be recreated from the
extent tree.

Finally disable relocating the remap tree itself for the time being: rather
than walking the extent tree, this will need to be changed so that the remap
tree gets walked, and any nodes within the specified block groups get COWed.
This code will also cover the future cases when we remove the metadata items
for the SYSTEM block groups, i.e. the chunk and root trees.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/disk-io.c     |   3 ++
 fs/btrfs/extent-tree.c | 114 ++++++++++++++++++++++++-----------------
 fs/btrfs/volumes.c     |   3 ++
 3 files changed, 73 insertions(+), 47 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 60cce96a9ec4..324116c3566c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3064,6 +3064,9 @@ int btrfs_start_pre_rw_mount(struct btrfs_fs_info *fs_info)
 		if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2))
 			btrfs_warn(fs_info,
 				   "'clear_cache' option is ignored with extent tree v2");
+		else if (btrfs_fs_incompat(fs_info, REMAP_TREE))
+			btrfs_warn(fs_info,
+				   "'clear_cache' option is ignored with remap tree");
 		else
 			rebuild_free_space_tree = true;
 	} else if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE) &&
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 46d4963a8241..205692fc1c7e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3106,6 +3106,24 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 	bool skinny_metadata = btrfs_fs_incompat(info, SKINNY_METADATA);
 	u64 delayed_ref_root = href->owning_root;
 
+	is_data = owner_objectid >= BTRFS_FIRST_FREE_OBJECTID;
+
+	if (!is_data && node->ref_root == BTRFS_REMAP_TREE_OBJECTID) {
+		ret = add_to_free_space_tree(trans, bytenr, num_bytes);
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
+			return ret;
+		}
+
+		ret = btrfs_update_block_group(trans, bytenr, num_bytes, false);
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
+			return ret;
+		}
+
+		return 0;
+	}
+
 	extent_root = btrfs_extent_root(info, bytenr);
 	ASSERT(extent_root);
 
@@ -3113,8 +3131,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	is_data = owner_objectid >= BTRFS_FIRST_FREE_OBJECTID;
-
 	if (!is_data && refs_to_drop != 1) {
 		btrfs_crit(info,
 "invalid refs_to_drop, dropping more than 1 refs for tree block %llu refs_to_drop %u",
@@ -4893,57 +4909,61 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
 	int level = btrfs_delayed_ref_owner(node);
 	bool skinny_metadata = btrfs_fs_incompat(fs_info, SKINNY_METADATA);
 
-	extent_key.objectid = node->bytenr;
-	if (skinny_metadata) {
-		/* The owner of a tree block is the level. */
-		extent_key.offset = level;
-		extent_key.type = BTRFS_METADATA_ITEM_KEY;
-	} else {
-		extent_key.offset = node->num_bytes;
-		extent_key.type = BTRFS_EXTENT_ITEM_KEY;
-		size += sizeof(*block_info);
-	}
+	if (node->ref_root != BTRFS_REMAP_TREE_OBJECTID) {
+		extent_key.objectid = node->bytenr;
+		if (skinny_metadata) {
+			/* The owner of a tree block is the level. */
+			extent_key.offset = level;
+			extent_key.type = BTRFS_METADATA_ITEM_KEY;
+		} else {
+			extent_key.offset = node->num_bytes;
+			extent_key.type = BTRFS_EXTENT_ITEM_KEY;
+			size += sizeof(*block_info);
+		}
 
-	path = btrfs_alloc_path();
-	if (!path)
-		return -ENOMEM;
+		path = btrfs_alloc_path();
+		if (!path)
+			return -ENOMEM;
 
-	extent_root = btrfs_extent_root(fs_info, extent_key.objectid);
-	ret = btrfs_insert_empty_item(trans, extent_root, path, &extent_key,
-				      size);
-	if (ret) {
-		btrfs_free_path(path);
-		return ret;
-	}
+		extent_root = btrfs_extent_root(fs_info, extent_key.objectid);
+		ret = btrfs_insert_empty_item(trans, extent_root, path,
+					      &extent_key, size);
+		if (ret) {
+			btrfs_free_path(path);
+			return ret;
+		}
 
-	leaf = path->nodes[0];
-	extent_item = btrfs_item_ptr(leaf, path->slots[0],
-				     struct btrfs_extent_item);
-	btrfs_set_extent_refs(leaf, extent_item, 1);
-	btrfs_set_extent_generation(leaf, extent_item, trans->transid);
-	btrfs_set_extent_flags(leaf, extent_item,
-			       flags | BTRFS_EXTENT_FLAG_TREE_BLOCK);
+		leaf = path->nodes[0];
+		extent_item = btrfs_item_ptr(leaf, path->slots[0],
+					struct btrfs_extent_item);
+		btrfs_set_extent_refs(leaf, extent_item, 1);
+		btrfs_set_extent_generation(leaf, extent_item, trans->transid);
+		btrfs_set_extent_flags(leaf, extent_item,
+				flags | BTRFS_EXTENT_FLAG_TREE_BLOCK);
 
-	if (skinny_metadata) {
-		iref = (struct btrfs_extent_inline_ref *)(extent_item + 1);
-	} else {
-		block_info = (struct btrfs_tree_block_info *)(extent_item + 1);
-		btrfs_set_tree_block_key(leaf, block_info, &extent_op->key);
-		btrfs_set_tree_block_level(leaf, block_info, level);
-		iref = (struct btrfs_extent_inline_ref *)(block_info + 1);
-	}
+		if (skinny_metadata) {
+			iref = (struct btrfs_extent_inline_ref *)(extent_item + 1);
+		} else {
+			block_info = (struct btrfs_tree_block_info *)(extent_item + 1);
+			btrfs_set_tree_block_key(leaf, block_info, &extent_op->key);
+			btrfs_set_tree_block_level(leaf, block_info, level);
+			iref = (struct btrfs_extent_inline_ref *)(block_info + 1);
+		}
 
-	if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) {
-		btrfs_set_extent_inline_ref_type(leaf, iref,
-						 BTRFS_SHARED_BLOCK_REF_KEY);
-		btrfs_set_extent_inline_ref_offset(leaf, iref, node->parent);
-	} else {
-		btrfs_set_extent_inline_ref_type(leaf, iref,
-						 BTRFS_TREE_BLOCK_REF_KEY);
-		btrfs_set_extent_inline_ref_offset(leaf, iref, node->ref_root);
-	}
+		if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) {
+			btrfs_set_extent_inline_ref_type(leaf, iref,
+						BTRFS_SHARED_BLOCK_REF_KEY);
+			btrfs_set_extent_inline_ref_offset(leaf, iref,
+							   node->parent);
+		} else {
+			btrfs_set_extent_inline_ref_type(leaf, iref,
+						BTRFS_TREE_BLOCK_REF_KEY);
+			btrfs_set_extent_inline_ref_offset(leaf, iref,
+							   node->ref_root);
+		}
 
-	btrfs_free_path(path);
+		btrfs_free_path(path);
+	}
 
 	return alloc_reserved_extent(trans, node->bytenr, fs_info->nodesize);
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9159d11cb143..0f4954f998cd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3981,6 +3981,9 @@ static bool should_balance_chunk(struct extent_buffer *leaf, struct btrfs_chunk
 	struct btrfs_balance_args *bargs = NULL;
 	u64 chunk_type = btrfs_chunk_type(leaf, chunk);
 
+	if (chunk_type & BTRFS_BLOCK_GROUP_REMAP)
+		return false;
+
 	/* type filter */
 	if (!((chunk_type & BTRFS_BLOCK_GROUP_TYPE_MASK) &
 	      (bctl->flags & BTRFS_BALANCE_TYPE_MASK))) {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 06/12] btrfs: add extended version of struct block_group_item
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (4 preceding siblings ...)
  2025-06-05 16:23 ` [PATCH 05/12] btrfs: don't add metadata items for the remap tree to the extent tree Mark Harmstone
@ 2025-06-05 16:23 ` Mark Harmstone
  2025-06-05 16:23 ` [PATCH 07/12] btrfs: allow mounting filesystems with remap-tree incompat flag Mark Harmstone
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Mark Harmstone @ 2025-06-05 16:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Add a struct btrfs_block_group_item_v2, which is used in the block group
tree if the remap-tree incompat flag is set.

This adds two new fields to the block group item: `remap_bytes` and
`identity_remap_count`.

`remap_bytes` records the amount of data that's physically within this
block group, but nominally in another, remapped block group. This is
necessary because this data will need to be moved first if this block
group is itself relocated. If `remap_bytes` > 0, this is an indicator to
the relocation thread that it will need to search the remap-tree for
backrefs. A block group must also have `remap_bytes` == 0 before it can
be dropped.

`identity_remap_count` records how many identity remap items are located
in the remap tree for this block group. When relocation is begun for
this block group, this is set to the number of holes in the free-space
tree for this range. As identity remaps are converted into actual remaps
by the relocation process, this number is decreased. Once it reaches 0,
either because of relocation or because extents have been deleted, the
block group has been fully remapped and its chunk's device extents are
removed.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/accessors.h            |  20 +++++++
 fs/btrfs/block-group.c          | 101 ++++++++++++++++++++++++--------
 fs/btrfs/block-group.h          |  14 ++++-
 fs/btrfs/discard.c              |   2 +-
 fs/btrfs/tree-checker.c         |  10 +++-
 include/uapi/linux/btrfs_tree.h |   8 +++
 6 files changed, 127 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index 5f5eda8d6f9e..6e6dd664217b 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -264,6 +264,26 @@ BTRFS_SETGET_FUNCS(block_group_flags, struct btrfs_block_group_item, flags, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_block_group_flags,
 			struct btrfs_block_group_item, flags, 64);
 
+/* struct btrfs_block_group_item_v2 */
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_used, struct btrfs_block_group_item_v2,
+			 used, 64);
+BTRFS_SETGET_FUNCS(block_group_v2_used, struct btrfs_block_group_item_v2, used, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_chunk_objectid,
+			 struct btrfs_block_group_item_v2, chunk_objectid, 64);
+BTRFS_SETGET_FUNCS(block_group_v2_chunk_objectid,
+		   struct btrfs_block_group_item_v2, chunk_objectid, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_flags,
+			 struct btrfs_block_group_item_v2, flags, 64);
+BTRFS_SETGET_FUNCS(block_group_v2_flags, struct btrfs_block_group_item_v2, flags, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_remap_bytes,
+			 struct btrfs_block_group_item_v2, remap_bytes, 64);
+BTRFS_SETGET_FUNCS(block_group_v2_remap_bytes, struct btrfs_block_group_item_v2,
+		   remap_bytes, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_identity_remap_count,
+			 struct btrfs_block_group_item_v2, identity_remap_count, 32);
+BTRFS_SETGET_FUNCS(block_group_v2_identity_remap_count, struct btrfs_block_group_item_v2,
+		   identity_remap_count, 32);
+
 /* struct btrfs_free_space_info */
 BTRFS_SETGET_FUNCS(free_space_extent_count, struct btrfs_free_space_info,
 		   extent_count, 32);
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 9b3b5358f1ba..4529356bb1e3 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2360,7 +2360,7 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
 }
 
 static int read_one_block_group(struct btrfs_fs_info *info,
-				struct btrfs_block_group_item *bgi,
+				struct btrfs_block_group_item_v2 *bgi,
 				const struct btrfs_key *key,
 				int need_clear)
 {
@@ -2375,11 +2375,16 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 		return -ENOMEM;
 
 	cache->length = key->offset;
-	cache->used = btrfs_stack_block_group_used(bgi);
+	cache->used = btrfs_stack_block_group_v2_used(bgi);
 	cache->commit_used = cache->used;
-	cache->flags = btrfs_stack_block_group_flags(bgi);
-	cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);
+	cache->flags = btrfs_stack_block_group_v2_flags(bgi);
+	cache->global_root_id = btrfs_stack_block_group_v2_chunk_objectid(bgi);
 	cache->space_info = btrfs_find_space_info(info, cache->flags);
+	cache->remap_bytes = btrfs_stack_block_group_v2_remap_bytes(bgi);
+	cache->commit_remap_bytes = cache->remap_bytes;
+	cache->identity_remap_count =
+		btrfs_stack_block_group_v2_identity_remap_count(bgi);
+	cache->commit_identity_remap_count = cache->identity_remap_count;
 
 	set_free_space_tree_thresholds(cache);
 
@@ -2444,7 +2449,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	} else if (cache->length == cache->used) {
 		cache->cached = BTRFS_CACHE_FINISHED;
 		btrfs_free_excluded_extents(cache);
-	} else if (cache->used == 0) {
+	} else if (cache->used == 0 && cache->remap_bytes == 0) {
 		cache->cached = BTRFS_CACHE_FINISHED;
 		ret = btrfs_add_new_free_space(cache, cache->start,
 					       cache->start + cache->length, NULL);
@@ -2464,7 +2469,8 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 
 	set_avail_alloc_bits(info, cache->flags);
 	if (btrfs_chunk_writeable(info, cache->start)) {
-		if (cache->used == 0) {
+		if (cache->used == 0 && cache->identity_remap_count == 0 &&
+		    cache->remap_bytes == 0) {
 			ASSERT(list_empty(&cache->bg_list));
 			if (btrfs_test_opt(info, DISCARD_ASYNC) &&
 			    !(cache->flags && BTRFS_BLOCK_GROUP_REMAPPED)) {
@@ -2570,9 +2576,10 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		need_clear = 1;
 
 	while (1) {
-		struct btrfs_block_group_item bgi;
+		struct btrfs_block_group_item_v2 bgi;
 		struct extent_buffer *leaf;
 		int slot;
+		size_t size;
 
 		ret = find_first_block_group(info, path, &key);
 		if (ret > 0)
@@ -2583,8 +2590,16 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		leaf = path->nodes[0];
 		slot = path->slots[0];
 
+		if (btrfs_fs_incompat(info, REMAP_TREE)) {
+			size = sizeof(struct btrfs_block_group_item_v2);
+		} else {
+			size = sizeof(struct btrfs_block_group_item);
+			btrfs_set_stack_block_group_v2_remap_bytes(&bgi, 0);
+			btrfs_set_stack_block_group_v2_identity_remap_count(&bgi, 0);
+		}
+
 		read_extent_buffer(leaf, &bgi, btrfs_item_ptr_offset(leaf, slot),
-				   sizeof(bgi));
+				   size);
 
 		btrfs_item_key_to_cpu(leaf, &key, slot);
 		btrfs_release_path(path);
@@ -2654,25 +2669,38 @@ static int insert_block_group_item(struct btrfs_trans_handle *trans,
 				   struct btrfs_block_group *block_group)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
-	struct btrfs_block_group_item bgi;
+	struct btrfs_block_group_item_v2 bgi;
 	struct btrfs_root *root = btrfs_block_group_root(fs_info);
 	struct btrfs_key key;
 	u64 old_commit_used;
+	size_t size;
 	int ret;
 
 	spin_lock(&block_group->lock);
-	btrfs_set_stack_block_group_used(&bgi, block_group->used);
-	btrfs_set_stack_block_group_chunk_objectid(&bgi,
-						   block_group->global_root_id);
-	btrfs_set_stack_block_group_flags(&bgi, block_group->flags);
+	btrfs_set_stack_block_group_v2_used(&bgi, block_group->used);
+	btrfs_set_stack_block_group_v2_chunk_objectid(&bgi,
+						      block_group->global_root_id);
+	btrfs_set_stack_block_group_v2_flags(&bgi, block_group->flags);
+	btrfs_set_stack_block_group_v2_remap_bytes(&bgi,
+						   block_group->remap_bytes);
+	btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
+					block_group->identity_remap_count);
 	old_commit_used = block_group->commit_used;
 	block_group->commit_used = block_group->used;
+	block_group->commit_remap_bytes = block_group->remap_bytes;
+	block_group->commit_identity_remap_count =
+		block_group->identity_remap_count;
 	key.objectid = block_group->start;
 	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
 	key.offset = block_group->length;
 	spin_unlock(&block_group->lock);
 
-	ret = btrfs_insert_item(trans, root, &key, &bgi, sizeof(bgi));
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE))
+		size = sizeof(struct btrfs_block_group_item_v2);
+	else
+		size = sizeof(struct btrfs_block_group_item);
+
+	ret = btrfs_insert_item(trans, root, &key, &bgi, size);
 	if (ret < 0) {
 		spin_lock(&block_group->lock);
 		block_group->commit_used = old_commit_used;
@@ -3127,10 +3155,12 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 	struct btrfs_root *root = btrfs_block_group_root(fs_info);
 	unsigned long bi;
 	struct extent_buffer *leaf;
-	struct btrfs_block_group_item bgi;
+	struct btrfs_block_group_item_v2 bgi;
 	struct btrfs_key key;
-	u64 old_commit_used;
-	u64 used;
+	u64 old_commit_used, old_commit_remap_bytes;
+	u32 old_commit_identity_remap_count;
+	u64 used, remap_bytes;
+	u32 identity_remap_count;
 
 	/*
 	 * Block group items update can be triggered out of commit transaction
@@ -3140,13 +3170,21 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 	 */
 	spin_lock(&cache->lock);
 	old_commit_used = cache->commit_used;
+	old_commit_remap_bytes = cache->commit_remap_bytes;
+	old_commit_identity_remap_count = cache->commit_identity_remap_count;
 	used = cache->used;
-	/* No change in used bytes, can safely skip it. */
-	if (cache->commit_used == used) {
+	remap_bytes = cache->remap_bytes;
+	identity_remap_count = cache->identity_remap_count;
+	/* No change in values, can safely skip it. */
+	if (cache->commit_used == used &&
+	    cache->commit_remap_bytes == remap_bytes &&
+	    cache->commit_identity_remap_count == identity_remap_count) {
 		spin_unlock(&cache->lock);
 		return 0;
 	}
 	cache->commit_used = used;
+	cache->commit_remap_bytes = remap_bytes;
+	cache->commit_identity_remap_count = identity_remap_count;
 	spin_unlock(&cache->lock);
 
 	key.objectid = cache->start;
@@ -3162,11 +3200,23 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 
 	leaf = path->nodes[0];
 	bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
-	btrfs_set_stack_block_group_used(&bgi, used);
-	btrfs_set_stack_block_group_chunk_objectid(&bgi,
-						   cache->global_root_id);
-	btrfs_set_stack_block_group_flags(&bgi, cache->flags);
-	write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
+	btrfs_set_stack_block_group_v2_used(&bgi, used);
+	btrfs_set_stack_block_group_v2_chunk_objectid(&bgi,
+						      cache->global_root_id);
+	btrfs_set_stack_block_group_v2_flags(&bgi, cache->flags);
+
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+		btrfs_set_stack_block_group_v2_remap_bytes(&bgi,
+							   cache->remap_bytes);
+		btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
+						cache->identity_remap_count);
+		write_extent_buffer(leaf, &bgi, bi,
+				    sizeof(struct btrfs_block_group_item_v2));
+	} else {
+		write_extent_buffer(leaf, &bgi, bi,
+				    sizeof(struct btrfs_block_group_item));
+	}
+
 fail:
 	btrfs_release_path(path);
 	/*
@@ -3181,6 +3231,9 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 	if (ret < 0 && ret != -ENOENT) {
 		spin_lock(&cache->lock);
 		cache->commit_used = old_commit_used;
+		cache->commit_remap_bytes = old_commit_remap_bytes;
+		cache->commit_identity_remap_count =
+			old_commit_identity_remap_count;
 		spin_unlock(&cache->lock);
 	}
 	return ret;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 9de356bcb411..c484118b8b8d 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -127,6 +127,8 @@ struct btrfs_block_group {
 	u64 flags;
 	u64 cache_generation;
 	u64 global_root_id;
+	u64 remap_bytes;
+	u32 identity_remap_count;
 
 	/*
 	 * The last committed used bytes of this block group, if the above @used
@@ -134,6 +136,15 @@ struct btrfs_block_group {
 	 * group item of this block group.
 	 */
 	u64 commit_used;
+	/*
+	 * The last committed remap_bytes value of this block group.
+	 */
+	u64 commit_remap_bytes;
+	/*
+	 * The last commited identity_remap_count value of this block group.
+	 */
+	u32 commit_identity_remap_count;
+
 	/*
 	 * If the free space extent count exceeds this number, convert the block
 	 * group to bitmaps.
@@ -275,7 +286,8 @@ static inline bool btrfs_is_block_group_used(const struct btrfs_block_group *bg)
 {
 	lockdep_assert_held(&bg->lock);
 
-	return (bg->used > 0 || bg->reserved > 0 || bg->pinned > 0);
+	return (bg->used > 0 || bg->reserved > 0 || bg->pinned > 0 ||
+		bg->remap_bytes > 0);
 }
 
 static inline bool btrfs_is_block_group_data_only(const struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index 1015a4d37fb2..2b7b1e440bc8 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -373,7 +373,7 @@ void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
 	if (!block_group || !btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC))
 		return;
 
-	if (block_group->used == 0)
+	if (block_group->used == 0 && block_group->remap_bytes == 0)
 		add_to_discard_unused_list(discard_ctl, block_group);
 	else
 		add_to_discard_list(discard_ctl, block_group);
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index fd83df06e3fb..25311576fab6 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -687,6 +687,7 @@ static int check_block_group_item(struct extent_buffer *leaf,
 	u64 chunk_objectid;
 	u64 flags;
 	u64 type;
+	size_t exp_size;
 
 	/*
 	 * Here we don't really care about alignment since extent allocator can
@@ -698,10 +699,15 @@ static int check_block_group_item(struct extent_buffer *leaf,
 		return -EUCLEAN;
 	}
 
-	if (unlikely(item_size != sizeof(bgi))) {
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE))
+		exp_size = sizeof(struct btrfs_block_group_item_v2);
+	else
+		exp_size = sizeof(struct btrfs_block_group_item);
+
+	if (unlikely(item_size != exp_size)) {
 		block_group_err(leaf, slot,
 			"invalid item size, have %u expect %zu",
-				item_size, sizeof(bgi));
+				item_size, exp_size);
 		return -EUCLEAN;
 	}
 
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 9a36f0206d90..500e3a7df90b 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -1229,6 +1229,14 @@ struct btrfs_block_group_item {
 	__le64 flags;
 } __attribute__ ((__packed__));
 
+struct btrfs_block_group_item_v2 {
+	__le64 used;
+	__le64 chunk_objectid;
+	__le64 flags;
+	__le64 remap_bytes;
+	__le32 identity_remap_count;
+} __attribute__ ((__packed__));
+
 struct btrfs_free_space_info {
 	__le32 extent_count;
 	__le32 flags;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 07/12] btrfs: allow mounting filesystems with remap-tree incompat flag
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (5 preceding siblings ...)
  2025-06-05 16:23 ` [PATCH 06/12] btrfs: add extended version of struct block_group_item Mark Harmstone
@ 2025-06-05 16:23 ` Mark Harmstone
  2025-06-05 16:23 ` [PATCH 08/12] btrfs: redirect I/O for remapped block groups Mark Harmstone
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Mark Harmstone @ 2025-06-05 16:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

If we encounter a filesystem with the remap-tree incompat flag set,
valdiate its compatibility with the other flags, and load the remap tree
using the values that have been added to the superblock.

The remap-tree feature depends on the free space tere, but no-holes and
block-group-tree have been made dependencies to reduce the testing
matrix. Similarly I'm not aware of any reason why mixed-bg and zoned would be
incompatible with remap-tree, but this is blocked for the time being
until it can be fully tested.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/Kconfig                |  2 ++
 fs/btrfs/accessors.h            |  6 ++++
 fs/btrfs/disk-io.c              | 60 +++++++++++++++++++++++++++++++++
 fs/btrfs/extent-tree.c          |  2 ++
 fs/btrfs/fs.h                   |  4 ++-
 fs/btrfs/transaction.c          |  7 ++++
 include/uapi/linux/btrfs_tree.h |  5 ++-
 7 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index c352f3ae0385..f41446102b14 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -114,6 +114,8 @@ config BTRFS_EXPERIMENTAL
 
 	  - extent tree v2 - complex rework of extent tracking
 
+	  - remap-tree - logical address remapping tree
+
 	  If unsure, say N.
 
 config BTRFS_FS_REF_VERIFY
diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index 6e6dd664217b..1bb6c0439ba7 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -919,6 +919,12 @@ BTRFS_SETGET_STACK_FUNCS(super_uuid_tree_generation, struct btrfs_super_block,
 			 uuid_tree_generation, 64);
 BTRFS_SETGET_STACK_FUNCS(super_nr_global_roots, struct btrfs_super_block,
 			 nr_global_roots, 64);
+BTRFS_SETGET_STACK_FUNCS(super_remap_root, struct btrfs_super_block,
+			 remap_root, 64);
+BTRFS_SETGET_STACK_FUNCS(super_remap_root_generation, struct btrfs_super_block,
+			 remap_root_generation, 64);
+BTRFS_SETGET_STACK_FUNCS(super_remap_root_level, struct btrfs_super_block,
+			 remap_root_level, 8);
 
 /* struct btrfs_file_extent_item */
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_type, struct btrfs_file_extent_item,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 324116c3566c..a0542b581f4e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1176,6 +1176,8 @@ static struct btrfs_root *btrfs_get_global_root(struct btrfs_fs_info *fs_info,
 		return btrfs_grab_root(btrfs_global_root(fs_info, &key));
 	case BTRFS_RAID_STRIPE_TREE_OBJECTID:
 		return btrfs_grab_root(fs_info->stripe_root);
+	case BTRFS_REMAP_TREE_OBJECTID:
+		return btrfs_grab_root(fs_info->remap_root);
 	default:
 		return NULL;
 	}
@@ -1264,6 +1266,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 	btrfs_put_root(fs_info->data_reloc_root);
 	btrfs_put_root(fs_info->block_group_root);
 	btrfs_put_root(fs_info->stripe_root);
+	btrfs_put_root(fs_info->remap_root);
 	btrfs_check_leaked_roots(fs_info);
 	btrfs_extent_buffer_leak_debug_check(fs_info);
 	kfree(fs_info->super_copy);
@@ -1818,6 +1821,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, bool free_chunk_root)
 	free_root_extent_buffers(info->data_reloc_root);
 	free_root_extent_buffers(info->block_group_root);
 	free_root_extent_buffers(info->stripe_root);
+	free_root_extent_buffers(info->remap_root);
 	if (free_chunk_root)
 		free_root_extent_buffers(info->chunk_root);
 }
@@ -2255,6 +2259,17 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 	if (ret)
 		goto out;
 
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+		/* remap_root already loaded in load_important_roots() */
+		root = fs_info->remap_root;
+
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+
+		root->root_key.objectid = BTRFS_REMAP_TREE_OBJECTID;
+		root->root_key.type = BTRFS_ROOT_ITEM_KEY;
+		root->root_key.offset = 0;
+	}
+
 	/*
 	 * This tree can share blocks with some other fs tree during relocation
 	 * and we need a proper setup by btrfs_get_fs_root
@@ -2522,6 +2537,28 @@ int btrfs_validate_super(const struct btrfs_fs_info *fs_info,
 		ret = -EINVAL;
 	}
 
+	/* Ditto for remap_tree */
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE) &&
+	    (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE_VALID) ||
+	     !btrfs_fs_incompat(fs_info, NO_HOLES) ||
+	     !btrfs_fs_compat_ro(fs_info, BLOCK_GROUP_TREE))) {
+		btrfs_err(fs_info,
+"remap-tree feature requires free-space-tree, no-holes, and block-group-tree");
+		ret = -EINVAL;
+	}
+
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE) &&
+	    btrfs_fs_incompat(fs_info, MIXED_GROUPS)) {
+		btrfs_err(fs_info, "remap-tree not supported with mixed-bg");
+		ret = -EINVAL;
+	}
+
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE) &&
+	    btrfs_fs_incompat(fs_info, ZONED)) {
+		btrfs_err(fs_info, "remap-tree not supported with zoned devices");
+		ret = -EINVAL;
+	}
+
 	/*
 	 * Hint to catch really bogus numbers, bitflips or so, more exact checks are
 	 * done later
@@ -2680,6 +2717,18 @@ static int load_important_roots(struct btrfs_fs_info *fs_info)
 		btrfs_warn(fs_info, "couldn't read tree root");
 		return ret;
 	}
+
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+		bytenr = btrfs_super_remap_root(sb);
+		gen = btrfs_super_remap_root_generation(sb);
+		level = btrfs_super_remap_root_level(sb);
+		ret = load_super_root(fs_info->remap_root, bytenr, gen, level);
+		if (ret) {
+			btrfs_warn(fs_info, "couldn't read remap root");
+			return ret;
+		}
+	}
+
 	return 0;
 }
 
@@ -3293,6 +3342,7 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
 	struct btrfs_root *tree_root;
 	struct btrfs_root *chunk_root;
+	struct btrfs_root *remap_root;
 	int ret;
 	int level;
 
@@ -3327,6 +3377,16 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_alloc;
 	}
 
+	if (btrfs_super_incompat_flags(disk_super) & BTRFS_FEATURE_INCOMPAT_REMAP_TREE) {
+		remap_root = btrfs_alloc_root(fs_info, BTRFS_REMAP_TREE_OBJECTID,
+					      GFP_KERNEL);
+		fs_info->remap_root = remap_root;
+		if (!remap_root) {
+			ret = -ENOMEM;
+			goto fail_alloc;
+		}
+	}
+
 	btrfs_info(fs_info, "first mount of filesystem %pU", disk_super->fsid);
 	/*
 	 * Verify the type first, if that or the checksum value are
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 205692fc1c7e..e8f752ef1da9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2564,6 +2564,8 @@ static u64 get_alloc_profile_by_root(struct btrfs_root *root, int data)
 		flags = BTRFS_BLOCK_GROUP_DATA;
 	else if (root == fs_info->chunk_root)
 		flags = BTRFS_BLOCK_GROUP_SYSTEM;
+	else if (root == fs_info->remap_root)
+		flags = BTRFS_BLOCK_GROUP_REMAP;
 	else
 		flags = BTRFS_BLOCK_GROUP_METADATA;
 
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 07ac1a96477a..8ceeb64aceb3 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -286,7 +286,8 @@ enum {
 #define BTRFS_FEATURE_INCOMPAT_SUPP		\
 	(BTRFS_FEATURE_INCOMPAT_SUPP_STABLE |	\
 	 BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE | \
-	 BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2)
+	 BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 | \
+	 BTRFS_FEATURE_INCOMPAT_REMAP_TREE)
 
 #else
 
@@ -442,6 +443,7 @@ struct btrfs_fs_info {
 	struct btrfs_root *data_reloc_root;
 	struct btrfs_root *block_group_root;
 	struct btrfs_root *stripe_root;
+	struct btrfs_root *remap_root;
 
 	/* The log root tree is a directory of all the other log roots */
 	struct btrfs_root *log_root_tree;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 825d135ef6c7..045468fc807d 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1951,6 +1951,13 @@ static void update_super_roots(struct btrfs_fs_info *fs_info)
 		super->cache_generation = 0;
 	if (test_bit(BTRFS_FS_UPDATE_UUID_TREE_GEN, &fs_info->flags))
 		super->uuid_tree_generation = root_item->generation;
+
+	if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+		root_item = &fs_info->remap_root->root_item;
+		super->remap_root = root_item->bytenr;
+		super->remap_root_generation = root_item->generation;
+		super->remap_root_level = root_item->level;
+	}
 }
 
 int btrfs_transaction_blocked(struct btrfs_fs_info *info)
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 500e3a7df90b..89bcb80081a6 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -721,9 +721,12 @@ struct btrfs_super_block {
 	__u8 metadata_uuid[BTRFS_FSID_SIZE];
 
 	__u64 nr_global_roots;
+	__le64 remap_root;
+	__le64 remap_root_generation;
+	__u8 remap_root_level;
 
 	/* Future expansion */
-	__le64 reserved[27];
+	__u8 reserved[199];
 	__u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
 	struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 08/12] btrfs: redirect I/O for remapped block groups
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (6 preceding siblings ...)
  2025-06-05 16:23 ` [PATCH 07/12] btrfs: allow mounting filesystems with remap-tree incompat flag Mark Harmstone
@ 2025-06-05 16:23 ` Mark Harmstone
  2025-06-05 16:23 ` [PATCH 09/12] btrfs: handle deletions from remapped block group Mark Harmstone
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Mark Harmstone @ 2025-06-05 16:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Change btrfs_map_block() so that if the block group has the REMAPPED
flag set, we call btrfs_translate_remap() to obtain a new address.

btrfs_translate_remap() searches the remap tree for a range
corresponding to the logical address passed to btrfs_map_block(). If it
is within an identity remap, this part of the block group hasn't yet
been relocated, and so we use the existing address.

If it is within an actual remap, we subtract the start of the remap
range and add the address of its destination, contained in the item's
payload.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/relocation.c | 59 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/relocation.h |  2 ++
 fs/btrfs/volumes.c    | 31 +++++++++++++++++++++++
 3 files changed, 92 insertions(+)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d7ec1d72821c..e45f3598ef03 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3905,6 +3905,65 @@ static const char *stage_to_string(enum reloc_stage stage)
 	return "unknown";
 }
 
+int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
+			  u64 *length, bool nolock)
+{
+	int ret;
+	struct btrfs_key key, found_key;
+	struct extent_buffer *leaf;
+	struct btrfs_remap *remap;
+	BTRFS_PATH_AUTO_FREE(path);
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	if (nolock) {
+		path->search_commit_root = 1;
+		path->skip_locking = 1;
+	}
+
+	key.objectid = *logical;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	ret = btrfs_search_slot(NULL, fs_info->remap_root, &key, path,
+				0, 0);
+	if (ret < 0)
+		return ret;
+
+	leaf = path->nodes[0];
+
+	if (path->slots[0] == 0)
+		return -ENOENT;
+
+	path->slots[0]--;
+
+	btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+	if (found_key.type != BTRFS_REMAP_KEY &&
+	    found_key.type != BTRFS_IDENTITY_REMAP_KEY) {
+		return -ENOENT;
+	}
+
+	if (found_key.objectid > *logical ||
+	    found_key.objectid + found_key.offset <= *logical) {
+		return -ENOENT;
+	}
+
+	if (*logical + *length > found_key.objectid + found_key.offset)
+		*length = found_key.objectid + found_key.offset - *logical;
+
+	if (found_key.type == BTRFS_IDENTITY_REMAP_KEY)
+		return 0;
+
+	remap = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_remap);
+
+	*logical = *logical - found_key.objectid + btrfs_remap_address(leaf, remap);
+
+	return 0;
+}
+
 /*
  * function to relocate all extents in a block group.
  */
diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
index 788c86d8633a..8c9dfc55b799 100644
--- a/fs/btrfs/relocation.h
+++ b/fs/btrfs/relocation.h
@@ -30,5 +30,7 @@ int btrfs_should_cancel_balance(const struct btrfs_fs_info *fs_info);
 struct btrfs_root *find_reloc_root(struct btrfs_fs_info *fs_info, u64 bytenr);
 bool btrfs_should_ignore_reloc_root(const struct btrfs_root *root);
 u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info);
+int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
+			  u64 *length, bool nolock);
 
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0f4954f998cd..62bd6259ebd3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6623,6 +6623,37 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 	if (IS_ERR(map))
 		return PTR_ERR(map);
 
+	if (map->type & BTRFS_BLOCK_GROUP_REMAPPED) {
+		u64 new_logical = logical;
+		bool nolock = !(map->type & BTRFS_BLOCK_GROUP_DATA);
+
+		/*
+		 * We use search_commit_root in btrfs_translate_remap for
+		 * metadata blocks, to avoid lockdep complaining about
+		 * recursive locking.
+		 * If we get -ENOENT this means this is a BG that has just had
+		 * its REMAPPED flag set, and so nothing has yet been actually
+		 * remapped.
+		 */
+		ret = btrfs_translate_remap(fs_info, &new_logical, length,
+					    nolock);
+		if (ret && (!nolock || ret != -ENOENT))
+			return ret;
+
+		if (ret != -ENOENT && new_logical != logical) {
+			btrfs_free_chunk_map(map);
+
+			map = btrfs_get_chunk_map(fs_info, new_logical,
+						  *length);
+			if (IS_ERR(map))
+				return PTR_ERR(map);
+
+			logical = new_logical;
+		}
+
+		ret = 0;
+	}
+
 	num_copies = btrfs_chunk_map_num_copies(map);
 	if (io_geom.mirror_num > num_copies)
 		return -EINVAL;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 09/12] btrfs: handle deletions from remapped block group
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (7 preceding siblings ...)
  2025-06-05 16:23 ` [PATCH 08/12] btrfs: redirect I/O for remapped block groups Mark Harmstone
@ 2025-06-05 16:23 ` Mark Harmstone
  2025-06-13 23:42   ` Boris Burkov
  2025-06-05 16:23 ` [PATCH 10/12] btrfs: handle setting up relocation of block group with remap-tree Mark Harmstone
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Mark Harmstone @ 2025-06-05 16:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Handle the case where we free an extent from a block group that has the
REMAPPED flag set. Because the remap tree is orthogonal to the extent
tree, for data this may be within any number of identity remaps or
actual remaps. If we're freeing a metadata node, this will be wholly
inside one or the other.

btrfs_remove_extent_from_remap_tree() searches the remap tree for the
remaps that cover the range in question, then calls
remove_range_from_remap_tree() for each one, to punch a hole in the
remap and adjust the free-space tree.

For an identity remap, remove_range_from_remap_tree() will adjust the
block group's `identity_remap_count` if this changes. If it reaches
zero we call last_identity_remap_gone(), which removes the chunk's
stripes and device extents - it is now fully remapped.

The changes which involve the block group's ro flag are because the
REMAPPED flag itself prevents a block group from having any new
allocations within it, and so we don't need to account for this
separately.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/block-group.c |  80 ++++---
 fs/btrfs/block-group.h |   1 +
 fs/btrfs/disk-io.c     |   1 +
 fs/btrfs/extent-tree.c |  30 ++-
 fs/btrfs/fs.h          |   1 +
 fs/btrfs/relocation.c  | 505 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/relocation.h  |   3 +
 fs/btrfs/volumes.c     |  56 +++--
 fs/btrfs/volumes.h     |   6 +
 9 files changed, 628 insertions(+), 55 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 4529356bb1e3..334df145ab3f 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1055,6 +1055,32 @@ static int remove_block_group_item(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+void btrfs_remove_bg_from_sinfo(struct btrfs_block_group *block_group)
+{
+	int factor = btrfs_bg_type_to_factor(block_group->flags);
+
+	spin_lock(&block_group->space_info->lock);
+
+	if (btrfs_test_opt(block_group->fs_info, ENOSPC_DEBUG)) {
+		WARN_ON(block_group->space_info->total_bytes
+			< block_group->length);
+		WARN_ON(block_group->space_info->bytes_readonly
+			< block_group->length - block_group->zone_unusable);
+		WARN_ON(block_group->space_info->bytes_zone_unusable
+			< block_group->zone_unusable);
+		WARN_ON(block_group->space_info->disk_total
+			< block_group->length * factor);
+	}
+	block_group->space_info->total_bytes -= block_group->length;
+	block_group->space_info->bytes_readonly -=
+		(block_group->length - block_group->zone_unusable);
+	btrfs_space_info_update_bytes_zone_unusable(block_group->space_info,
+						    -block_group->zone_unusable);
+	block_group->space_info->disk_total -= block_group->length * factor;
+
+	spin_unlock(&block_group->space_info->lock);
+}
+
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 			     struct btrfs_chunk_map *map)
 {
@@ -1066,7 +1092,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	struct kobject *kobj = NULL;
 	int ret;
 	int index;
-	int factor;
 	struct btrfs_caching_control *caching_ctl = NULL;
 	bool remove_map;
 	bool remove_rsv = false;
@@ -1075,7 +1100,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	if (!block_group)
 		return -ENOENT;
 
-	BUG_ON(!block_group->ro);
+	BUG_ON(!block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED));
 
 	trace_btrfs_remove_block_group(block_group);
 	/*
@@ -1087,7 +1112,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 				  block_group->length);
 
 	index = btrfs_bg_flags_to_raid_index(block_group->flags);
-	factor = btrfs_bg_type_to_factor(block_group->flags);
 
 	/* make sure this block group isn't part of an allocation cluster */
 	cluster = &fs_info->data_alloc_cluster;
@@ -1211,26 +1235,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 	spin_lock(&block_group->space_info->lock);
 	list_del_init(&block_group->ro_list);
-
-	if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
-		WARN_ON(block_group->space_info->total_bytes
-			< block_group->length);
-		WARN_ON(block_group->space_info->bytes_readonly
-			< block_group->length - block_group->zone_unusable);
-		WARN_ON(block_group->space_info->bytes_zone_unusable
-			< block_group->zone_unusable);
-		WARN_ON(block_group->space_info->disk_total
-			< block_group->length * factor);
-	}
-	block_group->space_info->total_bytes -= block_group->length;
-	block_group->space_info->bytes_readonly -=
-		(block_group->length - block_group->zone_unusable);
-	btrfs_space_info_update_bytes_zone_unusable(block_group->space_info,
-						    -block_group->zone_unusable);
-	block_group->space_info->disk_total -= block_group->length * factor;
-
 	spin_unlock(&block_group->space_info->lock);
 
+	if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED))
+		btrfs_remove_bg_from_sinfo(block_group);
+
 	/*
 	 * Remove the free space for the block group from the free space tree
 	 * and the block group's item from the extent tree before marking the
@@ -1517,6 +1526,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 	while (!list_empty(&fs_info->unused_bgs)) {
 		u64 used;
 		int trimming;
+		bool made_ro = false;
 
 		block_group = list_first_entry(&fs_info->unused_bgs,
 					       struct btrfs_block_group,
@@ -1553,7 +1563,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 
 		spin_lock(&space_info->lock);
 		spin_lock(&block_group->lock);
-		if (btrfs_is_block_group_used(block_group) || block_group->ro ||
+		if (btrfs_is_block_group_used(block_group) ||
+		    (block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) ||
 		    list_is_singular(&block_group->list)) {
 			/*
 			 * We want to bail if we made new allocations or have
@@ -1596,7 +1607,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		 */
 		used = btrfs_space_info_used(space_info, true);
 		if (space_info->total_bytes - block_group->length < used &&
-		    block_group->zone_unusable < block_group->length) {
+		    block_group->zone_unusable < block_group->length &&
+		    !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
 			/*
 			 * Add a reference for the list, compensate for the ref
 			 * drop under the "next" label for the
@@ -1614,8 +1626,14 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_unlock(&block_group->lock);
 		spin_unlock(&space_info->lock);
 
-		/* We don't want to force the issue, only flip if it's ok. */
-		ret = inc_block_group_ro(block_group, 0);
+		if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
+			/* We don't want to force the issue, only flip if it's ok. */
+			ret = inc_block_group_ro(block_group, 0);
+			made_ro = true;
+		} else {
+			ret = 0;
+		}
+
 		up_write(&space_info->groups_sem);
 		if (ret < 0) {
 			ret = 0;
@@ -1624,7 +1642,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 
 		ret = btrfs_zone_finish(block_group);
 		if (ret < 0) {
-			btrfs_dec_block_group_ro(block_group);
+			if (made_ro)
+				btrfs_dec_block_group_ro(block_group);
 			if (ret == -EAGAIN)
 				ret = 0;
 			goto next;
@@ -1637,7 +1656,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		trans = btrfs_start_trans_remove_block_group(fs_info,
 						     block_group->start);
 		if (IS_ERR(trans)) {
-			btrfs_dec_block_group_ro(block_group);
+			if (made_ro)
+				btrfs_dec_block_group_ro(block_group);
 			ret = PTR_ERR(trans);
 			goto next;
 		}
@@ -1647,7 +1667,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		 * just delete them, we don't care about them anymore.
 		 */
 		if (!clean_pinned_extents(trans, block_group)) {
-			btrfs_dec_block_group_ro(block_group);
+			if (made_ro)
+				btrfs_dec_block_group_ro(block_group);
 			goto end_trans;
 		}
 
@@ -1661,7 +1682,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_lock(&fs_info->discard_ctl.lock);
 		if (!list_empty(&block_group->discard_list)) {
 			spin_unlock(&fs_info->discard_ctl.lock);
-			btrfs_dec_block_group_ro(block_group);
+			if (made_ro)
+				btrfs_dec_block_group_ro(block_group);
 			btrfs_discard_queue_work(&fs_info->discard_ctl,
 						 block_group);
 			goto end_trans;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index c484118b8b8d..767898929960 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -329,6 +329,7 @@ int btrfs_add_new_free_space(struct btrfs_block_group *block_group,
 struct btrfs_trans_handle *btrfs_start_trans_remove_block_group(
 				struct btrfs_fs_info *fs_info,
 				const u64 chunk_offset);
+void btrfs_remove_bg_from_sinfo(struct btrfs_block_group *block_group);
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 			     struct btrfs_chunk_map *map);
 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a0542b581f4e..dac22efd2332 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2922,6 +2922,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	mutex_init(&fs_info->chunk_mutex);
 	mutex_init(&fs_info->transaction_kthread_mutex);
 	mutex_init(&fs_info->cleaner_mutex);
+	mutex_init(&fs_info->remap_mutex);
 	mutex_init(&fs_info->ro_block_group_mutex);
 	init_rwsem(&fs_info->commit_root_sem);
 	init_rwsem(&fs_info->cleanup_work_sem);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e8f752ef1da9..995784cdca9d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -40,6 +40,7 @@
 #include "orphan.h"
 #include "tree-checker.h"
 #include "raid-stripe-tree.h"
+#include "relocation.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2977,6 +2978,8 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
 				     u64 bytenr, struct btrfs_squota_delta *delta)
 {
 	int ret;
+	struct btrfs_block_group *bg;
+	bool bg_is_remapped = false;
 	u64 num_bytes = delta->num_bytes;
 
 	if (delta->is_data) {
@@ -3002,10 +3005,22 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
 		return ret;
 	}
 
-	ret = add_to_free_space_tree(trans, bytenr, num_bytes);
-	if (ret) {
-		btrfs_abort_transaction(trans, ret);
-		return ret;
+	if (btrfs_fs_incompat(trans->fs_info, REMAP_TREE)) {
+		bg = btrfs_lookup_block_group(trans->fs_info, bytenr);
+		bg_is_remapped = bg->flags & BTRFS_BLOCK_GROUP_REMAPPED;
+		btrfs_put_block_group(bg);
+	}
+
+	/*
+	 * If remapped, FST has already been taken care of in
+	 * remove_range_from_remap_tree().
+	 */
+	if (!bg_is_remapped) {
+		ret = add_to_free_space_tree(trans, bytenr, num_bytes);
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
+			return ret;
+		}
 	}
 
 	ret = btrfs_update_block_group(trans, bytenr, num_bytes, false);
@@ -3387,6 +3402,13 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 		}
 		btrfs_release_path(path);
 
+		ret = btrfs_remove_extent_from_remap_tree(trans, path, bytenr,
+							  num_bytes);
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
+			goto out;
+		}
+
 		ret = do_free_extent_accounting(trans, bytenr, &delta);
 	}
 	btrfs_release_path(path);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 8ceeb64aceb3..fd7dc61be9c7 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -551,6 +551,7 @@ struct btrfs_fs_info {
 	struct mutex transaction_kthread_mutex;
 	struct mutex cleaner_mutex;
 	struct mutex chunk_mutex;
+	struct mutex remap_mutex;
 
 	/*
 	 * This is taken to make sure we don't set block groups ro after the
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index e45f3598ef03..54c3e99c7dab 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -37,6 +37,7 @@
 #include "super.h"
 #include "tree-checker.h"
 #include "raid-stripe-tree.h"
+#include "free-space-tree.h"
 
 /*
  * Relocation overview
@@ -3905,6 +3906,150 @@ static const char *stage_to_string(enum reloc_stage stage)
 	return "unknown";
 }
 
+static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
+					   struct btrfs_block_group *bg,
+					   s64 diff)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	bool bg_already_dirty = true;
+
+	bg->remap_bytes += diff;
+
+	if (bg->used == 0 && bg->remap_bytes == 0)
+		btrfs_mark_bg_unused(bg);
+
+	spin_lock(&trans->transaction->dirty_bgs_lock);
+	if (list_empty(&bg->dirty_list)) {
+		list_add_tail(&bg->dirty_list, &trans->transaction->dirty_bgs);
+		bg_already_dirty = false;
+		btrfs_get_block_group(bg);
+	}
+	spin_unlock(&trans->transaction->dirty_bgs_lock);
+
+	/* Modified block groups are accounted for in the delayed_refs_rsv. */
+	if (!bg_already_dirty)
+		btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
+}
+
+static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
+				struct btrfs_chunk_map *chunk,
+				struct btrfs_path *path)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_key key;
+	struct extent_buffer *leaf;
+	struct btrfs_chunk *c;
+	int ret;
+
+	key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+	key.type = BTRFS_CHUNK_ITEM_KEY;
+	key.offset = chunk->start;
+
+	ret = btrfs_search_slot(trans, fs_info->chunk_root, &key, path,
+				0, 1);
+	if (ret) {
+		if (ret == 1) {
+			btrfs_release_path(path);
+			ret = -ENOENT;
+		}
+		return ret;
+	}
+
+	leaf = path->nodes[0];
+
+	c = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_chunk);
+	btrfs_set_chunk_num_stripes(leaf, c, 0);
+
+	btrfs_truncate_item(trans, path, offsetof(struct btrfs_chunk, stripe),
+			    1);
+
+	btrfs_mark_buffer_dirty(trans, leaf);
+
+	btrfs_release_path(path);
+
+	chunk->num_stripes = 0;
+
+	return 0;
+}
+
+static int last_identity_remap_gone(struct btrfs_trans_handle *trans,
+				    struct btrfs_chunk_map *chunk,
+				    struct btrfs_block_group *bg,
+				    struct btrfs_path *path)
+{
+	int ret;
+
+	ret = btrfs_remove_dev_extents(trans, chunk);
+	if (ret)
+		return ret;
+
+	mutex_lock(&trans->fs_info->chunk_mutex);
+
+	for (unsigned int i = 0; i < chunk->num_stripes; i++) {
+		ret = btrfs_update_device(trans, chunk->stripes[i].dev);
+		if (ret) {
+			mutex_unlock(&trans->fs_info->chunk_mutex);
+			return ret;
+		}
+	}
+
+	mutex_unlock(&trans->fs_info->chunk_mutex);
+
+	write_lock(&trans->fs_info->mapping_tree_lock);
+	btrfs_chunk_map_device_clear_bits(chunk, CHUNK_ALLOCATED);
+	write_unlock(&trans->fs_info->mapping_tree_lock);
+
+	btrfs_remove_bg_from_sinfo(bg);
+
+	ret = remove_chunk_stripes(trans, chunk, path);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
+				       struct btrfs_path *path,
+				       struct btrfs_block_group *bg, int delta)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_chunk_map *chunk;
+	bool bg_already_dirty = true;
+	int ret;
+
+	WARN_ON(delta < 0 && -delta > bg->identity_remap_count);
+
+	bg->identity_remap_count += delta;
+
+	spin_lock(&trans->transaction->dirty_bgs_lock);
+	if (list_empty(&bg->dirty_list)) {
+		list_add_tail(&bg->dirty_list, &trans->transaction->dirty_bgs);
+		bg_already_dirty = false;
+		btrfs_get_block_group(bg);
+	}
+	spin_unlock(&trans->transaction->dirty_bgs_lock);
+
+	/* Modified block groups are accounted for in the delayed_refs_rsv. */
+	if (!bg_already_dirty)
+		btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
+
+	if (bg->identity_remap_count != 0)
+		return 0;
+
+	chunk = btrfs_find_chunk_map(fs_info, bg->start, 1);
+	if (!chunk)
+		return -ENOENT;
+
+	ret = last_identity_remap_gone(trans, chunk, bg, path);
+	if (ret)
+		goto end;
+
+	ret = 0;
+end:
+	btrfs_free_chunk_map(chunk);
+	return ret;
+}
+
 int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
 			  u64 *length, bool nolock)
 {
@@ -4521,3 +4666,363 @@ u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info)
 		logical = fs_info->reloc_ctl->block_group->start;
 	return logical;
 }
+
+static int remove_range_from_remap_tree(struct btrfs_trans_handle *trans,
+					struct btrfs_path *path,
+					struct btrfs_block_group *bg,
+					u64 bytenr, u64 num_bytes)
+{
+	int ret;
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct extent_buffer *leaf = path->nodes[0];
+	struct btrfs_key key, new_key;
+	struct btrfs_remap *remap_ptr = NULL, remap;
+	struct btrfs_block_group *dest_bg = NULL;
+	u64 end, new_addr = 0, remap_start, remap_length, overlap_length;
+	bool is_identity_remap;
+
+	end = bytenr + num_bytes;
+
+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+	is_identity_remap = key.type == BTRFS_IDENTITY_REMAP_KEY;
+
+	remap_start = key.objectid;
+	remap_length = key.offset;
+
+	if (!is_identity_remap) {
+		remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
+					   struct btrfs_remap);
+		new_addr = btrfs_remap_address(leaf, remap_ptr);
+
+		dest_bg = btrfs_lookup_block_group(fs_info, new_addr);
+	}
+
+	if (bytenr == remap_start && num_bytes >= remap_length) {
+		/* Remove entirely. */
+
+		ret = btrfs_del_item(trans, fs_info->remap_root, path);
+		if (ret)
+			goto end;
+
+		btrfs_release_path(path);
+
+		overlap_length = remap_length;
+
+		if (!is_identity_remap) {
+			/* Remove backref. */
+
+			key.objectid = new_addr;
+			key.type = BTRFS_REMAP_BACKREF_KEY;
+			key.offset = remap_length;
+
+			ret = btrfs_search_slot(trans, fs_info->remap_root,
+						&key, path, -1, 1);
+			if (ret) {
+				if (ret == 1) {
+					btrfs_release_path(path);
+					ret = -ENOENT;
+				}
+				goto end;
+			}
+
+			ret = btrfs_del_item(trans, fs_info->remap_root, path);
+
+			btrfs_release_path(path);
+
+			if (ret)
+				goto end;
+
+			adjust_block_group_remap_bytes(trans, dest_bg,
+						       -remap_length);
+		} else {
+			ret = adjust_identity_remap_count(trans, path, bg, -1);
+			if (ret)
+				goto end;
+		}
+	} else if (bytenr == remap_start) {
+		/* Remove beginning. */
+
+		new_key.objectid = end;
+		new_key.type = key.type;
+		new_key.offset = remap_length + remap_start - end;
+
+		btrfs_set_item_key_safe(trans, path, &new_key);
+		btrfs_mark_buffer_dirty(trans, leaf);
+
+		overlap_length = num_bytes;
+
+		if (!is_identity_remap) {
+			btrfs_set_remap_address(leaf, remap_ptr,
+						new_addr + end - remap_start);
+			btrfs_release_path(path);
+
+			/* Adjust backref. */
+
+			key.objectid = new_addr;
+			key.type = BTRFS_REMAP_BACKREF_KEY;
+			key.offset = remap_length;
+
+			ret = btrfs_search_slot(trans, fs_info->remap_root,
+						&key, path, -1, 1);
+			if (ret) {
+				if (ret == 1) {
+					btrfs_release_path(path);
+					ret = -ENOENT;
+				}
+				goto end;
+			}
+
+			leaf = path->nodes[0];
+
+			new_key.objectid = new_addr + end - remap_start;
+			new_key.type = BTRFS_REMAP_BACKREF_KEY;
+			new_key.offset = remap_length + remap_start - end;
+
+			btrfs_set_item_key_safe(trans, path, &new_key);
+
+			remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
+						   struct btrfs_remap);
+			btrfs_set_remap_address(leaf, remap_ptr, end);
+
+			btrfs_mark_buffer_dirty(trans, path->nodes[0]);
+
+			btrfs_release_path(path);
+
+			adjust_block_group_remap_bytes(trans, dest_bg,
+						       -num_bytes);
+		}
+	} else if (bytenr + num_bytes < remap_start + remap_length) {
+		/* Remove middle. */
+
+		new_key.objectid = remap_start;
+		new_key.type = key.type;
+		new_key.offset = bytenr - remap_start;
+
+		btrfs_set_item_key_safe(trans, path, &new_key);
+		btrfs_mark_buffer_dirty(trans, leaf);
+
+		new_key.objectid = end;
+		new_key.offset = remap_start + remap_length - end;
+
+		btrfs_release_path(path);
+
+		overlap_length = num_bytes;
+
+		if (!is_identity_remap) {
+			/* Add second remap entry. */
+
+			ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+						path, &new_key,
+						sizeof(struct btrfs_remap));
+			if (ret)
+				goto end;
+
+			btrfs_set_stack_remap_address(&remap,
+						new_addr + end - remap_start);
+
+			write_extent_buffer(path->nodes[0], &remap,
+				btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
+				sizeof(struct btrfs_remap));
+
+			btrfs_release_path(path);
+
+			/* Shorten backref entry. */
+
+			key.objectid = new_addr;
+			key.type = BTRFS_REMAP_BACKREF_KEY;
+			key.offset = remap_length;
+
+			ret = btrfs_search_slot(trans, fs_info->remap_root,
+						&key, path, -1, 1);
+			if (ret) {
+				if (ret == 1) {
+					btrfs_release_path(path);
+					ret = -ENOENT;
+				}
+				goto end;
+			}
+
+			new_key.objectid = new_addr;
+			new_key.type = BTRFS_REMAP_BACKREF_KEY;
+			new_key.offset = bytenr - remap_start;
+
+			btrfs_set_item_key_safe(trans, path, &new_key);
+			btrfs_mark_buffer_dirty(trans, path->nodes[0]);
+
+			btrfs_release_path(path);
+
+			/* Add second backref entry. */
+
+			new_key.objectid = new_addr + end - remap_start;
+			new_key.type = BTRFS_REMAP_BACKREF_KEY;
+			new_key.offset = remap_start + remap_length - end;
+
+			ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+						path, &new_key,
+						sizeof(struct btrfs_remap));
+			if (ret)
+				goto end;
+
+			btrfs_set_stack_remap_address(&remap, end);
+
+			write_extent_buffer(path->nodes[0], &remap,
+				btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
+				sizeof(struct btrfs_remap));
+
+			btrfs_release_path(path);
+
+			adjust_block_group_remap_bytes(trans, dest_bg,
+						       -num_bytes);
+		} else {
+			/* Add second identity remap entry. */
+
+			ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+						      path, &new_key, 0);
+			if (ret)
+				goto end;
+
+			btrfs_release_path(path);
+
+			ret = adjust_identity_remap_count(trans, path, bg, 1);
+			if (ret)
+				goto end;
+		}
+	} else {
+		/* Remove end. */
+
+		new_key.objectid = remap_start;
+		new_key.type = key.type;
+		new_key.offset = bytenr - remap_start;
+
+		btrfs_set_item_key_safe(trans, path, &new_key);
+		btrfs_mark_buffer_dirty(trans, leaf);
+
+		btrfs_release_path(path);
+
+		overlap_length = remap_start + remap_length - bytenr;
+
+		if (!is_identity_remap) {
+			/* Shorten backref entry. */
+
+			key.objectid = new_addr;
+			key.type = BTRFS_REMAP_BACKREF_KEY;
+			key.offset = remap_length;
+
+			ret = btrfs_search_slot(trans, fs_info->remap_root,
+						&key, path, -1, 1);
+			if (ret) {
+				if (ret == 1) {
+					btrfs_release_path(path);
+					ret = -ENOENT;
+				}
+				goto end;
+			}
+
+			new_key.objectid = new_addr;
+			new_key.type = BTRFS_REMAP_BACKREF_KEY;
+			new_key.offset = bytenr - remap_start;
+
+			btrfs_set_item_key_safe(trans, path, &new_key);
+			btrfs_mark_buffer_dirty(trans, path->nodes[0]);
+
+			btrfs_release_path(path);
+
+			adjust_block_group_remap_bytes(trans, dest_bg,
+					bytenr - remap_start - remap_length);
+		}
+	}
+
+	if (!is_identity_remap) {
+		ret = add_to_free_space_tree(trans,
+					     bytenr - remap_start + new_addr,
+					     overlap_length);
+		if (ret)
+			goto end;
+	}
+
+	ret = overlap_length;
+
+end:
+	if (dest_bg)
+		btrfs_put_block_group(dest_bg);
+
+	return ret;
+}
+
+int btrfs_remove_extent_from_remap_tree(struct btrfs_trans_handle *trans,
+					struct btrfs_path *path,
+					u64 bytenr, u64 num_bytes)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_key key, found_key;
+	struct extent_buffer *leaf;
+	struct btrfs_block_group *bg;
+	int ret;
+
+	if (!(btrfs_super_incompat_flags(fs_info->super_copy) &
+	      BTRFS_FEATURE_INCOMPAT_REMAP_TREE))
+		return 0;
+
+	bg = btrfs_lookup_block_group(fs_info, bytenr);
+	if (!bg)
+		return 0;
+
+	if (!(bg->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
+		btrfs_put_block_group(bg);
+		return 0;
+	}
+
+	mutex_lock(&fs_info->remap_mutex);
+
+	do {
+		key.objectid = bytenr;
+		key.type = (u8)-1;
+		key.offset = (u64)-1;
+
+		ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path,
+					-1, 1);
+		if (ret < 0)
+			goto end;
+
+		leaf = path->nodes[0];
+
+		if (path->slots[0] == 0) {
+			ret = -ENOENT;
+			goto end;
+		}
+
+		path->slots[0]--;
+
+		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+		if (found_key.type != BTRFS_IDENTITY_REMAP_KEY &&
+		    found_key.type != BTRFS_REMAP_KEY) {
+			ret = -ENOENT;
+			goto end;
+		}
+
+		if (bytenr < found_key.objectid ||
+		    bytenr >= found_key.objectid + found_key.offset) {
+			ret = -ENOENT;
+			goto end;
+		}
+
+		ret = remove_range_from_remap_tree(trans, path, bg, bytenr,
+						   num_bytes);
+		if (ret < 0)
+			goto end;
+
+		bytenr += ret;
+		num_bytes -= ret;
+	} while (num_bytes > 0);
+
+	ret = 0;
+
+end:
+	mutex_unlock(&fs_info->remap_mutex);
+
+	btrfs_put_block_group(bg);
+	btrfs_release_path(path);
+	return ret;
+}
diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
index 8c9dfc55b799..0021f812b12c 100644
--- a/fs/btrfs/relocation.h
+++ b/fs/btrfs/relocation.h
@@ -32,5 +32,8 @@ bool btrfs_should_ignore_reloc_root(const struct btrfs_root *root);
 u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info);
 int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
 			  u64 *length, bool nolock);
+int btrfs_remove_extent_from_remap_tree(struct btrfs_trans_handle *trans,
+					struct btrfs_path *path,
+					u64 bytenr, u64 num_bytes);
 
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 62bd6259ebd3..6c0a67da92f1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2931,8 +2931,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	return ret;
 }
 
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
-					struct btrfs_device *device)
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+			struct btrfs_device *device)
 {
 	int ret;
 	struct btrfs_path *path;
@@ -3236,25 +3236,13 @@ static int remove_chunk_item(struct btrfs_trans_handle *trans,
 	return btrfs_free_chunk(trans, chunk_offset);
 }
 
-int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
+int btrfs_remove_dev_extents(struct btrfs_trans_handle *trans,
+			     struct btrfs_chunk_map *map)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
-	struct btrfs_chunk_map *map;
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 	u64 dev_extent_len = 0;
 	int i, ret = 0;
-	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
-
-	map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
-	if (IS_ERR(map)) {
-		/*
-		 * This is a logic error, but we don't want to just rely on the
-		 * user having built with ASSERT enabled, so if ASSERT doesn't
-		 * do anything we still error out.
-		 */
-		DEBUG_WARN("errr %ld reading chunk map at offset %llu",
-			   PTR_ERR(map), chunk_offset);
-		return PTR_ERR(map);
-	}
 
 	/*
 	 * First delete the device extent items from the devices btree.
@@ -3275,7 +3263,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
 		if (ret) {
 			mutex_unlock(&fs_devices->device_list_mutex);
 			btrfs_abort_transaction(trans, ret);
-			goto out;
+			return ret;
 		}
 
 		if (device->bytes_used > 0) {
@@ -3295,6 +3283,30 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
 	}
 	mutex_unlock(&fs_devices->device_list_mutex);
 
+	return 0;
+}
+
+int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_chunk_map *map;
+	int ret;
+
+	map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
+	if (IS_ERR(map)) {
+		/*
+		 * This is a logic error, but we don't want to just rely on the
+		 * user having built with ASSERT enabled, so if ASSERT doesn't
+		 * do anything we still error out.
+		 */
+		ASSERT(0);
+		return PTR_ERR(map);
+	}
+
+	ret = btrfs_remove_dev_extents(trans, map);
+	if (ret)
+		goto out;
+
 	/*
 	 * We acquire fs_info->chunk_mutex for 2 reasons:
 	 *
@@ -5436,7 +5448,7 @@ static void chunk_map_device_set_bits(struct btrfs_chunk_map *map, unsigned int
 	}
 }
 
-static void chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
+void btrfs_chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
 {
 	for (int i = 0; i < map->num_stripes; i++) {
 		struct btrfs_io_stripe *stripe = &map->stripes[i];
@@ -5453,7 +5465,7 @@ void btrfs_remove_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_ma
 	write_lock(&fs_info->mapping_tree_lock);
 	rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
 	RB_CLEAR_NODE(&map->rb_node);
-	chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
+	btrfs_chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
 	write_unlock(&fs_info->mapping_tree_lock);
 
 	/* Once for the tree reference. */
@@ -5489,7 +5501,7 @@ int btrfs_add_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_map *m
 		return -EEXIST;
 	}
 	chunk_map_device_set_bits(map, CHUNK_ALLOCATED);
-	chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
+	btrfs_chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
 	write_unlock(&fs_info->mapping_tree_lock);
 
 	return 0;
@@ -5854,7 +5866,7 @@ void btrfs_mapping_tree_free(struct btrfs_fs_info *fs_info)
 		map = rb_entry(node, struct btrfs_chunk_map, rb_node);
 		rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
 		RB_CLEAR_NODE(&map->rb_node);
-		chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
+		btrfs_chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
 		/* Once for the tree ref. */
 		btrfs_free_chunk_map(map);
 		cond_resched_rwlock_write(&fs_info->mapping_tree_lock);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 9fb8fe4312a5..0a73ea2a2a6a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -779,6 +779,8 @@ u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map);
 int btrfs_nr_parity_stripes(u64 type);
 int btrfs_chunk_alloc_add_chunk_item(struct btrfs_trans_handle *trans,
 				     struct btrfs_block_group *bg);
+int btrfs_remove_dev_extents(struct btrfs_trans_handle *trans,
+			     struct btrfs_chunk_map *map);
 int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
@@ -876,6 +878,10 @@ bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
 
 bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
 const u8 *btrfs_sb_fsid_ptr(const struct btrfs_super_block *sb);
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+			struct btrfs_device *device);
+void btrfs_chunk_map_device_clear_bits(struct btrfs_chunk_map *map,
+				       unsigned int bits);
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 10/12] btrfs: handle setting up relocation of block group with remap-tree
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (8 preceding siblings ...)
  2025-06-05 16:23 ` [PATCH 09/12] btrfs: handle deletions from remapped block group Mark Harmstone
@ 2025-06-05 16:23 ` Mark Harmstone
  2025-06-13 23:25   ` Boris Burkov
  2025-06-05 16:23 ` [PATCH 11/12] btrfs: move existing remaps before relocating block group Mark Harmstone
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Mark Harmstone @ 2025-06-05 16:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Handle the preliminary work for relocating a block group in a filesystem
with the remap-tree flag set.

If the block group is SYSTEM or REMAP btrfs_relocate_block_group()
proceeds as it does already, as bootstrapping issues mean that these
block groups have to be processed the existing way.

Otherwise we walk the free-space tree for the block group in question,
recording any holes. These get converted into identity remaps and placed
in the remap tree, and the block group's REMAPPED flag is set. From now
on no new allocations are possible within this block group, and any I/O
to it will be funnelled through btrfs_translate_remap(). We store the
number of identity remaps in `identity_remap_count`, so that we know
when we've removed the last one and the block group is fully remapped.

The change in btrfs_read_roots() is because data relocations no longer
rely on the data reloc tree as a hidden subvolume in which to do
snapshots.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/disk-io.c         |  30 +--
 fs/btrfs/free-space-tree.c |   4 +-
 fs/btrfs/free-space-tree.h |   5 +-
 fs/btrfs/relocation.c      | 452 ++++++++++++++++++++++++++++++++++++-
 fs/btrfs/relocation.h      |   3 +-
 fs/btrfs/space-info.c      |   9 +-
 fs/btrfs/volumes.c         |  15 +-
 7 files changed, 483 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dac22efd2332..f2a9192293b1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2268,22 +2268,22 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 		root->root_key.objectid = BTRFS_REMAP_TREE_OBJECTID;
 		root->root_key.type = BTRFS_ROOT_ITEM_KEY;
 		root->root_key.offset = 0;
-	}
-
-	/*
-	 * This tree can share blocks with some other fs tree during relocation
-	 * and we need a proper setup by btrfs_get_fs_root
-	 */
-	root = btrfs_get_fs_root(tree_root->fs_info,
-				 BTRFS_DATA_RELOC_TREE_OBJECTID, true);
-	if (IS_ERR(root)) {
-		if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
-			ret = PTR_ERR(root);
-			goto out;
-		}
 	} else {
-		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
-		fs_info->data_reloc_root = root;
+		/*
+		 * This tree can share blocks with some other fs tree during
+		 * relocation and we need a proper setup by btrfs_get_fs_root
+		 */
+		root = btrfs_get_fs_root(tree_root->fs_info,
+					 BTRFS_DATA_RELOC_TREE_OBJECTID, true);
+		if (IS_ERR(root)) {
+			if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
+				ret = PTR_ERR(root);
+				goto out;
+			}
+		} else {
+			set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+			fs_info->data_reloc_root = root;
+		}
 	}
 
 	location.objectid = BTRFS_QUOTA_TREE_OBJECTID;
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index af51cf784a5b..eb579c17a79f 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -21,8 +21,7 @@ static int __add_block_group_free_space(struct btrfs_trans_handle *trans,
 					struct btrfs_block_group *block_group,
 					struct btrfs_path *path);
 
-static struct btrfs_root *btrfs_free_space_root(
-				struct btrfs_block_group *block_group)
+struct btrfs_root *btrfs_free_space_root(struct btrfs_block_group *block_group)
 {
 	struct btrfs_key key = {
 		.objectid = BTRFS_FREE_SPACE_TREE_OBJECTID,
@@ -96,7 +95,6 @@ static int add_new_free_space_info(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-EXPORT_FOR_TESTS
 struct btrfs_free_space_info *search_free_space_info(
 		struct btrfs_trans_handle *trans,
 		struct btrfs_block_group *block_group,
diff --git a/fs/btrfs/free-space-tree.h b/fs/btrfs/free-space-tree.h
index e6c6d6f4f221..1b804544730a 100644
--- a/fs/btrfs/free-space-tree.h
+++ b/fs/btrfs/free-space-tree.h
@@ -35,12 +35,13 @@ int add_to_free_space_tree(struct btrfs_trans_handle *trans,
 			   u64 start, u64 size);
 int remove_from_free_space_tree(struct btrfs_trans_handle *trans,
 				u64 start, u64 size);
-
-#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 struct btrfs_free_space_info *
 search_free_space_info(struct btrfs_trans_handle *trans,
 		       struct btrfs_block_group *block_group,
 		       struct btrfs_path *path, int cow);
+struct btrfs_root *btrfs_free_space_root(struct btrfs_block_group *block_group);
+
+#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 int __add_to_free_space_tree(struct btrfs_trans_handle *trans,
 			     struct btrfs_block_group *block_group,
 			     struct btrfs_path *path, u64 start, u64 size);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 54c3e99c7dab..acf2fefedc96 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3659,7 +3659,7 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
 		btrfs_btree_balance_dirty(fs_info);
 	}
 
-	if (!err) {
+	if (!err && !btrfs_fs_incompat(fs_info, REMAP_TREE)) {
 		ret = relocate_file_extent_cluster(rc);
 		if (ret < 0)
 			err = ret;
@@ -3906,6 +3906,90 @@ static const char *stage_to_string(enum reloc_stage stage)
 	return "unknown";
 }
 
+static int add_remap_tree_entries(struct btrfs_trans_handle *trans,
+				  struct btrfs_path *path,
+				  struct btrfs_key *entries,
+				  unsigned int num_entries)
+{
+	int ret;
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_item_batch batch;
+	u32 *data_sizes;
+	u32 max_items;
+
+	max_items = BTRFS_LEAF_DATA_SIZE(trans->fs_info) / sizeof(struct btrfs_item);
+
+	data_sizes = kzalloc(sizeof(u32) * min_t(u32, num_entries, max_items),
+			     GFP_NOFS);
+	if (!data_sizes)
+		return -ENOMEM;
+
+	while (true) {
+		batch.keys = entries;
+		batch.data_sizes = data_sizes;
+		batch.total_data_size = 0;
+		batch.nr = min_t(u32, num_entries, max_items);
+
+		ret = btrfs_insert_empty_items(trans, fs_info->remap_root, path,
+					       &batch);
+		btrfs_release_path(path);
+
+		if (num_entries <= max_items)
+			break;
+
+		num_entries -= max_items;
+		entries += max_items;
+	}
+
+	kfree(data_sizes);
+
+	return ret;
+}
+
+struct space_run {
+	u64 start;
+	u64 end;
+};
+
+static void parse_bitmap(u64 block_size, const unsigned long *bitmap,
+			 unsigned long size, u64 address,
+			 struct space_run *space_runs,
+			 unsigned int *num_space_runs)
+{
+	unsigned long pos, end;
+	u64 run_start, run_length;
+
+	pos = find_first_bit(bitmap, size);
+
+	if (pos == size)
+		return;
+
+	while (true) {
+		end = find_next_zero_bit(bitmap, size, pos);
+
+		run_start = address + (pos * block_size);
+		run_length = (end - pos) * block_size;
+
+		if (*num_space_runs != 0 &&
+		    space_runs[*num_space_runs - 1].end == run_start) {
+			space_runs[*num_space_runs - 1].end += run_length;
+		} else {
+			space_runs[*num_space_runs].start = run_start;
+			space_runs[*num_space_runs].end = run_start + run_length;
+
+			(*num_space_runs)++;
+		}
+
+		if (end == size)
+			break;
+
+		pos = find_next_bit(bitmap, size, end + 1);
+
+		if (pos == size)
+			break;
+	}
+}
+
 static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
 					   struct btrfs_block_group *bg,
 					   s64 diff)
@@ -3931,6 +4015,227 @@ static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
 		btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
 }
 
+static int create_remap_tree_entries(struct btrfs_trans_handle *trans,
+				     struct btrfs_path *path,
+				     struct btrfs_block_group *bg)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_free_space_info *fsi;
+	struct btrfs_key key, found_key;
+	struct extent_buffer *leaf;
+	struct btrfs_root *space_root;
+	u32 extent_count;
+	struct space_run *space_runs = NULL;
+	unsigned int num_space_runs = 0;
+	struct btrfs_key *entries = NULL;
+	unsigned int max_entries, num_entries;
+	int ret;
+
+	mutex_lock(&bg->free_space_lock);
+
+	if (test_bit(BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE, &bg->runtime_flags)) {
+		mutex_unlock(&bg->free_space_lock);
+
+		ret = add_block_group_free_space(trans, bg);
+		if (ret)
+			return ret;
+
+		mutex_lock(&bg->free_space_lock);
+	}
+
+	fsi = search_free_space_info(trans, bg, path, 0);
+	if (IS_ERR(fsi)) {
+		mutex_unlock(&bg->free_space_lock);
+		return PTR_ERR(fsi);
+	}
+
+	extent_count = btrfs_free_space_extent_count(path->nodes[0], fsi);
+
+	btrfs_release_path(path);
+
+	space_runs = kmalloc(sizeof(*space_runs) * extent_count, GFP_NOFS);
+	if (!space_runs) {
+		mutex_unlock(&bg->free_space_lock);
+		return -ENOMEM;
+	}
+
+	key.objectid = bg->start;
+	key.type = 0;
+	key.offset = 0;
+
+	space_root = btrfs_free_space_root(bg);
+
+	ret = btrfs_search_slot(trans, space_root, &key, path, 0, 0);
+	if (ret < 0) {
+		mutex_unlock(&bg->free_space_lock);
+		goto out;
+	}
+
+	ret = 0;
+
+	while (true) {
+		leaf = path->nodes[0];
+
+		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+		if (found_key.objectid >= bg->start + bg->length)
+			break;
+
+		if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY) {
+			if (num_space_runs != 0 &&
+			    space_runs[num_space_runs - 1].end == found_key.objectid) {
+				space_runs[num_space_runs - 1].end =
+					found_key.objectid + found_key.offset;
+			} else {
+				space_runs[num_space_runs].start = found_key.objectid;
+				space_runs[num_space_runs].end =
+					found_key.objectid + found_key.offset;
+
+				num_space_runs++;
+
+				BUG_ON(num_space_runs > extent_count);
+			}
+		} else if (found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
+			void *bitmap;
+			unsigned long offset;
+			u32 data_size;
+
+			offset = btrfs_item_ptr_offset(leaf, path->slots[0]);
+			data_size = btrfs_item_size(leaf, path->slots[0]);
+
+			if (data_size != 0) {
+				bitmap = kmalloc(data_size, GFP_NOFS);
+				if (!bitmap) {
+					mutex_unlock(&bg->free_space_lock);
+					ret = -ENOMEM;
+					goto out;
+				}
+
+				read_extent_buffer(leaf, bitmap, offset,
+						   data_size);
+
+				parse_bitmap(fs_info->sectorsize, bitmap,
+					     data_size * BITS_PER_BYTE,
+					     found_key.objectid, space_runs,
+					     &num_space_runs);
+
+				BUG_ON(num_space_runs > extent_count);
+
+				kfree(bitmap);
+			}
+		}
+
+		path->slots[0]++;
+
+		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+			ret = btrfs_next_leaf(space_root, path);
+			if (ret != 0) {
+				if (ret == 1)
+					ret = 0;
+				break;
+			}
+			leaf = path->nodes[0];
+		}
+	}
+
+	btrfs_release_path(path);
+
+	mutex_unlock(&bg->free_space_lock);
+
+	max_entries = extent_count + 2;
+	entries = kmalloc(sizeof(*entries) * max_entries, GFP_NOFS);
+	if (!entries) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	num_entries = 0;
+
+	if (num_space_runs > 0 && space_runs[0].start > bg->start) {
+		entries[num_entries].objectid = bg->start;
+		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
+		entries[num_entries].offset = space_runs[0].start - bg->start;
+		num_entries++;
+	}
+
+	for (unsigned int i = 1; i < num_space_runs; i++) {
+		entries[num_entries].objectid = space_runs[i - 1].end;
+		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
+		entries[num_entries].offset =
+			space_runs[i].start - space_runs[i - 1].end;
+		num_entries++;
+	}
+
+	if (num_space_runs == 0) {
+		entries[num_entries].objectid = bg->start;
+		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
+		entries[num_entries].offset = bg->length;
+		num_entries++;
+	} else if (space_runs[num_space_runs - 1].end < bg->start + bg->length) {
+		entries[num_entries].objectid = space_runs[num_space_runs - 1].end;
+		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
+		entries[num_entries].offset =
+			bg->start + bg->length - space_runs[num_space_runs - 1].end;
+		num_entries++;
+	}
+
+	if (num_entries == 0)
+		goto out;
+
+	bg->identity_remap_count = num_entries;
+
+	ret = add_remap_tree_entries(trans, path, entries, num_entries);
+
+out:
+	kfree(entries);
+	kfree(space_runs);
+
+	return ret;
+}
+
+static int mark_bg_remapped(struct btrfs_trans_handle *trans,
+			    struct btrfs_path *path,
+			    struct btrfs_block_group *bg)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	unsigned long bi;
+	struct extent_buffer *leaf;
+	struct btrfs_block_group_item_v2 bgi;
+	struct btrfs_key key;
+	int ret;
+
+	bg->flags |= BTRFS_BLOCK_GROUP_REMAPPED;
+
+	key.objectid = bg->start;
+	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
+	key.offset = bg->length;
+
+	ret = btrfs_search_slot(trans, fs_info->block_group_root, &key,
+				path, 0, 1);
+	if (ret) {
+		if (ret > 0)
+			ret = -ENOENT;
+		goto out;
+	}
+
+	leaf = path->nodes[0];
+	bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
+	read_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
+	btrfs_set_stack_block_group_v2_flags(&bgi, bg->flags);
+	btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
+						bg->identity_remap_count);
+	write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
+
+	btrfs_mark_buffer_dirty(trans, leaf);
+
+	bg->commit_identity_remap_count = bg->identity_remap_count;
+
+	ret = 0;
+out:
+	btrfs_release_path(path);
+	return ret;
+}
+
 static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
 				struct btrfs_chunk_map *chunk,
 				struct btrfs_path *path)
@@ -4050,6 +4355,55 @@ static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
+			       struct btrfs_path *path, uint64_t start)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_chunk_map *chunk;
+	struct btrfs_key key;
+	u64 type;
+	int ret;
+	struct extent_buffer *leaf;
+	struct btrfs_chunk *c;
+
+	read_lock(&fs_info->mapping_tree_lock);
+
+	chunk = btrfs_find_chunk_map_nolock(fs_info, start, 1);
+	if (!chunk) {
+		read_unlock(&fs_info->mapping_tree_lock);
+		return -ENOENT;
+	}
+
+	chunk->type |= BTRFS_BLOCK_GROUP_REMAPPED;
+	type = chunk->type;
+
+	read_unlock(&fs_info->mapping_tree_lock);
+
+	key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+	key.type = BTRFS_CHUNK_ITEM_KEY;
+	key.offset = start;
+
+	ret = btrfs_search_slot(trans, fs_info->chunk_root, &key, path,
+				0, 1);
+	if (ret == 1) {
+		ret = -ENOENT;
+		goto end;
+	} else if (ret < 0)
+		goto end;
+
+	leaf = path->nodes[0];
+
+	c = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_chunk);
+	btrfs_set_chunk_type(leaf, c, type);
+	btrfs_mark_buffer_dirty(trans, leaf);
+
+	ret = 0;
+end:
+	btrfs_free_chunk_map(chunk);
+	btrfs_release_path(path);
+	return ret;
+}
+
 int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
 			  u64 *length, bool nolock)
 {
@@ -4109,16 +4463,78 @@ int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
 	return 0;
 }
 
+static int start_block_group_remapping(struct btrfs_fs_info *fs_info,
+				       struct btrfs_path *path,
+				       struct btrfs_block_group *bg)
+{
+	struct btrfs_trans_handle *trans;
+	int ret, ret2;
+
+	ret = btrfs_cache_block_group(bg, true);
+	if (ret)
+		return ret;
+
+	trans = btrfs_start_transaction(fs_info->remap_root, 0);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	/* We need to run delayed refs, to make sure FST is up to date. */
+	ret = btrfs_run_delayed_refs(trans, U64_MAX);
+	if (ret) {
+		btrfs_end_transaction(trans);
+		return ret;
+	}
+
+	mutex_lock(&fs_info->remap_mutex);
+
+	if (bg->flags & BTRFS_BLOCK_GROUP_REMAPPED) {
+		ret = 0;
+		goto end;
+	}
+
+	ret = create_remap_tree_entries(trans, path, bg);
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		goto end;
+	}
+
+	ret = mark_bg_remapped(trans, path, bg);
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		goto end;
+	}
+
+	ret = mark_chunk_remapped(trans, path, bg->start);
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		goto end;
+	}
+
+	ret = remove_block_group_free_space(trans, bg);
+	if (ret)
+		btrfs_abort_transaction(trans, ret);
+
+end:
+	mutex_unlock(&fs_info->remap_mutex);
+
+	ret2 = btrfs_end_transaction(trans);
+	if (!ret)
+		ret = ret2;
+
+	return ret;
+}
+
 /*
  * function to relocate all extents in a block group.
  */
-int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
+int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
+			       bool *using_remap_tree)
 {
 	struct btrfs_block_group *bg;
 	struct btrfs_root *extent_root = btrfs_extent_root(fs_info, group_start);
 	struct reloc_control *rc;
 	struct inode *inode;
-	struct btrfs_path *path;
+	struct btrfs_path *path = NULL;
 	int ret;
 	int rw = 0;
 	int err = 0;
@@ -4185,7 +4601,7 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
 	}
 
 	inode = lookup_free_space_inode(rc->block_group, path);
-	btrfs_free_path(path);
+	btrfs_release_path(path);
 
 	if (!IS_ERR(inode))
 		ret = delete_block_group_cache(rc->block_group, inode, 0);
@@ -4197,11 +4613,17 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
 		goto out;
 	}
 
-	rc->data_inode = create_reloc_inode(rc->block_group);
-	if (IS_ERR(rc->data_inode)) {
-		err = PTR_ERR(rc->data_inode);
-		rc->data_inode = NULL;
-		goto out;
+	*using_remap_tree = btrfs_fs_incompat(fs_info, REMAP_TREE) &&
+		!(bg->flags & BTRFS_BLOCK_GROUP_SYSTEM) &&
+		!(bg->flags & BTRFS_BLOCK_GROUP_REMAP);
+
+	if (!btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+		rc->data_inode = create_reloc_inode(rc->block_group);
+		if (IS_ERR(rc->data_inode)) {
+			err = PTR_ERR(rc->data_inode);
+			rc->data_inode = NULL;
+			goto out;
+		}
 	}
 
 	describe_relocation(rc->block_group);
@@ -4213,6 +4635,12 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
 	ret = btrfs_zone_finish(rc->block_group);
 	WARN_ON(ret && ret != -EAGAIN);
 
+	if (*using_remap_tree) {
+		err = start_block_group_remapping(fs_info, path, bg);
+
+		goto out;
+	}
+
 	while (1) {
 		enum reloc_stage finishes_stage;
 
@@ -4258,7 +4686,9 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
 out:
 	if (err && rw)
 		btrfs_dec_block_group_ro(rc->block_group);
-	iput(rc->data_inode);
+	if (!btrfs_fs_incompat(fs_info, REMAP_TREE))
+		iput(rc->data_inode);
+	btrfs_free_path(path);
 out_put_bg:
 	btrfs_put_block_group(bg);
 	reloc_chunk_end(fs_info);
@@ -4452,7 +4882,7 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 
 	btrfs_free_path(path);
 
-	if (ret == 0) {
+	if (ret == 0 && !btrfs_fs_incompat(fs_info, REMAP_TREE)) {
 		/* cleanup orphan inode in data relocation tree */
 		fs_root = btrfs_grab_root(fs_info->data_reloc_root);
 		ASSERT(fs_root);
diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
index 0021f812b12c..49bd48296ddb 100644
--- a/fs/btrfs/relocation.h
+++ b/fs/btrfs/relocation.h
@@ -12,7 +12,8 @@ struct btrfs_trans_handle;
 struct btrfs_ordered_extent;
 struct btrfs_pending_snapshot;
 
-int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start);
+int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
+			       bool *using_remap_tree);
 int btrfs_init_reloc_root(struct btrfs_trans_handle *trans, struct btrfs_root *root);
 int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
 			    struct btrfs_root *root);
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 6471861c4b25..ab4a70c420de 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -375,8 +375,13 @@ void btrfs_add_bg_to_space_info(struct btrfs_fs_info *info,
 	factor = btrfs_bg_type_to_factor(block_group->flags);
 
 	spin_lock(&space_info->lock);
-	space_info->total_bytes += block_group->length;
-	space_info->disk_total += block_group->length * factor;
+
+	if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED) ||
+	    block_group->identity_remap_count != 0) {
+		space_info->total_bytes += block_group->length;
+		space_info->disk_total += block_group->length * factor;
+	}
+
 	space_info->bytes_used += block_group->used;
 	space_info->disk_used += block_group->used * factor;
 	space_info->bytes_readonly += block_group->bytes_super;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6c0a67da92f1..771415139dc0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3425,6 +3425,7 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	struct btrfs_block_group *block_group;
 	u64 length;
 	int ret;
+	bool using_remap_tree;
 
 	if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2)) {
 		btrfs_err(fs_info,
@@ -3448,7 +3449,8 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 
 	/* step one, relocate all the extents inside this chunk */
 	btrfs_scrub_pause(fs_info);
-	ret = btrfs_relocate_block_group(fs_info, chunk_offset);
+	ret = btrfs_relocate_block_group(fs_info, chunk_offset,
+					 &using_remap_tree);
 	btrfs_scrub_continue(fs_info);
 	if (ret) {
 		/*
@@ -3467,6 +3469,9 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	length = block_group->length;
 	btrfs_put_block_group(block_group);
 
+	if (using_remap_tree)
+		return 0;
+
 	/*
 	 * On a zoned file system, discard the whole block group, this will
 	 * trigger a REQ_OP_ZONE_RESET operation on the device zone. If
@@ -4165,6 +4170,14 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
 		chunk = btrfs_item_ptr(leaf, slot, struct btrfs_chunk);
 		chunk_type = btrfs_chunk_type(leaf, chunk);
 
+		/* Check if chunk has already been fully relocated. */
+		if (chunk_type & BTRFS_BLOCK_GROUP_REMAPPED &&
+		    btrfs_chunk_num_stripes(leaf, chunk) == 0) {
+			btrfs_release_path(path);
+			mutex_unlock(&fs_info->reclaim_bgs_lock);
+			goto loop;
+		}
+
 		if (!counting) {
 			spin_lock(&fs_info->balance_lock);
 			bctl->stat.considered++;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 11/12] btrfs: move existing remaps before relocating block group
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (9 preceding siblings ...)
  2025-06-05 16:23 ` [PATCH 10/12] btrfs: handle setting up relocation of block group with remap-tree Mark Harmstone
@ 2025-06-05 16:23 ` Mark Harmstone
  2025-06-06 11:20   ` kernel test robot
  2025-06-05 16:23 ` [PATCH 12/12] btrfs: replace identity maps with actual remaps when doing relocations Mark Harmstone
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Mark Harmstone @ 2025-06-05 16:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

If when relocating a block group we find that `remap_bytes` > 0 in its
block group item, that means that it has been the destination block
group for another that has been remapped.

We need to seach the remap tree for any remap backrefs within this
range, and move the data to a third block group. This is because
otherwise btrfs_translate_remap() could end up following an unbounded
chain of remaps, which would only get worse over time.

We only relocate one block group at a time, so `remap_bytes` will only
ever go down while we are doing this. Once we're finished we set the
REMAPPED flag on the block group, which will permanently prevent any
other data from being moved to within it.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/extent-tree.c |   6 +-
 fs/btrfs/relocation.c  | 482 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 486 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 995784cdca9d..e469261f6c8d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4490,7 +4490,8 @@ static noinline int find_free_extent(struct btrfs_root *root,
 		    block_group->cached != BTRFS_CACHE_NO) {
 			down_read(&space_info->groups_sem);
 			if (list_empty(&block_group->list) ||
-			    block_group->ro) {
+			    block_group->ro ||
+			    block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED) {
 				/*
 				 * someone is removing this block group,
 				 * we can't jump into the have_block_group
@@ -4524,7 +4525,8 @@ static noinline int find_free_extent(struct btrfs_root *root,
 
 		ffe_ctl->hinted = false;
 		/* If the block group is read-only, we can skip it entirely. */
-		if (unlikely(block_group->ro)) {
+		if (unlikely(block_group->ro) ||
+		    block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED) {
 			if (ffe_ctl->for_treelog)
 				btrfs_clear_treelog_bg(block_group);
 			if (ffe_ctl->for_data_reloc)
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index acf2fefedc96..d7aad3f92224 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4015,6 +4015,480 @@ static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
 		btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
 }
 
+struct reloc_io_private {
+	struct completion done;
+	refcount_t pending_refs;
+	blk_status_t status;
+};
+
+static void reloc_endio(struct btrfs_bio *bbio)
+{
+	struct reloc_io_private *priv = bbio->private;
+
+	if (bbio->bio.bi_status)
+		WRITE_ONCE(priv->status, bbio->bio.bi_status);
+
+	if (refcount_dec_and_test(&priv->pending_refs))
+		complete(&priv->done);
+
+	bio_put(&bbio->bio);
+}
+
+static int copy_remapped_data_io(struct btrfs_fs_info *fs_info,
+				 struct reloc_io_private *priv,
+				 struct page **pages, u64 addr, u64 length,
+				 bool do_write)
+{
+	struct btrfs_bio *bbio;
+	unsigned long i = 0;
+	int op = do_write ? REQ_OP_WRITE : REQ_OP_READ;
+
+	init_completion(&priv->done);
+	refcount_set(&priv->pending_refs, 1);
+	priv->status = 0;
+
+	bbio = btrfs_bio_alloc(BIO_MAX_VECS, op, fs_info, reloc_endio,
+			       priv);
+	bbio->bio.bi_iter.bi_sector = addr >> SECTOR_SHIFT;
+
+	do {
+		size_t bytes = min_t(u64, length, PAGE_SIZE);
+
+		if (bio_add_page(&bbio->bio, pages[i], bytes, 0) < bytes) {
+			refcount_inc(&priv->pending_refs);
+			btrfs_submit_bbio(bbio, 0);
+
+			bbio = btrfs_bio_alloc(BIO_MAX_VECS, op, fs_info,
+					       reloc_endio, priv);
+			bbio->bio.bi_iter.bi_sector = addr >> SECTOR_SHIFT;
+			continue;
+		}
+
+		i++;
+		addr += bytes;
+		length -= bytes;
+	} while (length);
+
+	refcount_inc(&priv->pending_refs);
+	btrfs_submit_bbio(bbio, 0);
+
+	if (!refcount_dec_and_test(&priv->pending_refs))
+		wait_for_completion_io(&priv->done);
+
+	return blk_status_to_errno(READ_ONCE(priv->status));
+}
+
+static int copy_remapped_data(struct btrfs_fs_info *fs_info, u64 old_addr,
+			      u64 new_addr, u64 length)
+{
+	int ret;
+	struct page **pages;
+	unsigned int nr_pages;
+	struct reloc_io_private priv;
+
+	nr_pages = DIV_ROUND_UP(length, PAGE_SIZE);
+	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
+	if (!pages)
+		return -ENOMEM;
+	ret = btrfs_alloc_page_array(nr_pages, pages, 0);
+	if (ret) {
+		ret = -ENOMEM;
+		goto end;
+	}
+
+	ret = copy_remapped_data_io(fs_info, &priv, pages, old_addr, length,
+				    false);
+	if (ret)
+		goto end;
+
+	ret = copy_remapped_data_io(fs_info, &priv, pages, new_addr, length,
+				    true);
+
+end:
+	for (unsigned int i = 0; i < nr_pages; i++) {
+		if (pages[i])
+			__free_page(pages[i]);
+	}
+	kfree(pages);
+
+	return ret;
+}
+
+static int do_copy(struct btrfs_fs_info *fs_info, u64 old_addr, u64 new_addr,
+		   u64 length)
+{
+	int ret;
+
+	/* Copy 1MB at a time, to avoid using too much memory. */
+
+	do {
+		u64 to_copy = min_t(u64, length, SZ_1M);
+
+		ret = copy_remapped_data(fs_info, old_addr, new_addr,
+					 to_copy);
+		if (ret)
+			return ret;
+
+		if (to_copy == length)
+			break;
+
+		old_addr += to_copy;
+		new_addr += to_copy;
+		length -= to_copy;
+	} while (true);
+
+	return 0;
+}
+
+static int add_remap_item(struct btrfs_trans_handle *trans,
+			  struct btrfs_path *path, u64 new_addr, u64 length,
+			  u64 old_addr)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_remap remap;
+	struct btrfs_key key;
+	struct extent_buffer *leaf;
+	int ret;
+
+	key.objectid = old_addr;
+	key.type = BTRFS_REMAP_KEY;
+	key.offset = length;
+
+	ret = btrfs_insert_empty_item(trans, fs_info->remap_root, path,
+				      &key, sizeof(struct btrfs_remap));
+	if (ret)
+		return ret;
+
+	leaf = path->nodes[0];
+
+	btrfs_set_stack_remap_address(&remap, new_addr);
+
+	write_extent_buffer(leaf, &remap,
+			    btrfs_item_ptr_offset(leaf, path->slots[0]),
+			    sizeof(struct btrfs_remap));
+
+	btrfs_release_path(path);
+
+	return 0;
+}
+
+static int add_remap_backref_item(struct btrfs_trans_handle *trans,
+				  struct btrfs_path *path, u64 new_addr,
+				  u64 length, u64 old_addr)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_remap remap;
+	struct btrfs_key key;
+	struct extent_buffer *leaf;
+	int ret;
+
+	key.objectid = new_addr;
+	key.type = BTRFS_REMAP_BACKREF_KEY;
+	key.offset = length;
+
+	ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+				      path, &key, sizeof(struct btrfs_remap));
+	if (ret)
+		return ret;
+
+	leaf = path->nodes[0];
+
+	btrfs_set_stack_remap_address(&remap, old_addr);
+
+	write_extent_buffer(leaf, &remap,
+			    btrfs_item_ptr_offset(leaf, path->slots[0]),
+			    sizeof(struct btrfs_remap));
+
+	btrfs_release_path(path);
+
+	return 0;
+}
+
+static int move_existing_remap(struct btrfs_fs_info *fs_info,
+			       struct btrfs_path *path,
+			       struct btrfs_block_group *bg, u64 new_addr,
+			       u64 length, u64 old_addr)
+{
+	struct btrfs_trans_handle *trans;
+	struct extent_buffer *leaf;
+	struct btrfs_remap *remap_ptr, remap;
+	struct btrfs_key key, ins;
+	u64 dest_addr, dest_length, min_size;
+	struct btrfs_block_group *dest_bg;
+	int ret;
+	bool is_data = bg->flags & BTRFS_BLOCK_GROUP_DATA;
+	struct btrfs_space_info *sinfo = bg->space_info;
+	bool mutex_taken = false, bg_needs_free_space;
+
+	spin_lock(&sinfo->lock);
+	btrfs_space_info_update_bytes_may_use(sinfo, length);
+	spin_unlock(&sinfo->lock);
+
+	if (is_data)
+		min_size = fs_info->sectorsize;
+	else
+		min_size = fs_info->nodesize;
+
+	ret = btrfs_reserve_extent(fs_info->fs_root, length, length, min_size,
+				   0, 0, &ins, is_data, false);
+	if (ret) {
+		spin_lock(&sinfo->lock);
+		btrfs_space_info_update_bytes_may_use(sinfo, -length);
+		spin_unlock(&sinfo->lock);
+		return ret;
+	}
+
+	dest_addr = ins.objectid;
+	dest_length = ins.offset;
+
+	if (!is_data && !IS_ALIGNED(dest_length, fs_info->nodesize)) {
+		u64 new_length = ALIGN_DOWN(dest_length, fs_info->nodesize);
+
+		btrfs_free_reserved_extent(fs_info, dest_addr + new_length,
+					   dest_length - new_length, 0);
+
+		dest_length = new_length;
+	}
+
+	trans = btrfs_join_transaction(fs_info->remap_root);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		trans = NULL;
+		goto end;
+	}
+
+	mutex_lock(&fs_info->remap_mutex);
+	mutex_taken = true;
+
+	/* Find old remap entry. */
+
+	key.objectid = old_addr;
+	key.type = BTRFS_REMAP_KEY;
+	key.offset = length;
+
+	ret = btrfs_search_slot(trans, fs_info->remap_root, &key,
+				path, 0, 1);
+	if (ret == 1) {
+		/*
+		 * Not a problem if the remap entry wasn't found: that means
+		 * that another transaction has deallocated the data.
+		 * move_existing_remaps() loops until the BG contains no
+		 * remaps, so we can just return 0 in this case.
+		 */
+		btrfs_release_path(path);
+		ret = 0;
+		goto end;
+	} else if (ret) {
+		goto end;
+	}
+
+	ret = do_copy(fs_info, new_addr, dest_addr, dest_length);
+	if (ret)
+		goto end;
+
+	/* Change data of old remap entry. */
+
+	leaf = path->nodes[0];
+
+	remap_ptr = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_remap);
+	btrfs_set_remap_address(leaf, remap_ptr, dest_addr);
+
+	btrfs_mark_buffer_dirty(trans, leaf);
+
+	if (dest_length != length) {
+		key.offset = dest_length;
+		btrfs_set_item_key_safe(trans, path, &key);
+	}
+
+	btrfs_release_path(path);
+
+	if (dest_length != length) {
+		/* Add remap item for remainder. */
+
+		ret = add_remap_item(trans, path, new_addr + dest_length,
+				     length - dest_length,
+				     old_addr + dest_length);
+		if (ret)
+			goto end;
+	}
+
+	/* Change or remove old backref. */
+
+	key.objectid = new_addr;
+	key.type = BTRFS_REMAP_BACKREF_KEY;
+	key.offset = length;
+
+	ret = btrfs_search_slot(trans, fs_info->remap_root, &key,
+				path, -1, 1);
+	if (ret) {
+		if (ret == 1) {
+			btrfs_release_path(path);
+			ret = -ENOENT;
+		}
+		goto end;
+	}
+
+	leaf = path->nodes[0];
+
+	if (dest_length == length) {
+		ret = btrfs_del_item(trans, fs_info->remap_root, path);
+		if (ret) {
+			btrfs_release_path(path);
+			goto end;
+		}
+	} else {
+		key.objectid += dest_length;
+		key.offset -= dest_length;
+		btrfs_set_item_key_safe(trans, path, &key);
+
+		btrfs_set_stack_remap_address(&remap, old_addr + dest_length);
+
+		write_extent_buffer(leaf, &remap,
+				    btrfs_item_ptr_offset(leaf, path->slots[0]),
+				    sizeof(struct btrfs_remap));
+	}
+
+	btrfs_release_path(path);
+
+	/* Add new backref. */
+
+	ret = add_remap_backref_item(trans, path, dest_addr, dest_length,
+				     old_addr);
+	if (ret)
+		goto end;
+
+	adjust_block_group_remap_bytes(trans, bg, -dest_length);
+
+	ret = add_to_free_space_tree(trans, new_addr, dest_length);
+	if (ret)
+		goto end;
+
+	dest_bg = btrfs_lookup_block_group(fs_info, dest_addr);
+
+	adjust_block_group_remap_bytes(trans, dest_bg, dest_length);
+
+	mutex_lock(&dest_bg->free_space_lock);
+	bg_needs_free_space = test_bit(BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE,
+				       &dest_bg->runtime_flags);
+	mutex_unlock(&dest_bg->free_space_lock);
+	btrfs_put_block_group(dest_bg);
+
+	if (bg_needs_free_space) {
+		ret = add_block_group_free_space(trans, dest_bg);
+		if (ret)
+			goto end;
+	}
+
+	ret = remove_from_free_space_tree(trans, dest_addr, dest_length);
+	if (ret) {
+		remove_from_free_space_tree(trans, new_addr, dest_length);
+		goto end;
+	}
+
+	ret = 0;
+
+end:
+	if (mutex_taken)
+		mutex_unlock(&fs_info->remap_mutex);
+
+	btrfs_dec_block_group_reservations(fs_info, dest_addr);
+
+	if (ret) {
+		btrfs_free_reserved_extent(fs_info, dest_addr, dest_length, 0);
+
+		if (trans) {
+			btrfs_abort_transaction(trans, ret);
+			btrfs_end_transaction(trans);
+		}
+	} else {
+		dest_bg = btrfs_lookup_block_group(fs_info, dest_addr);
+		btrfs_free_reserved_bytes(dest_bg, dest_length, 0);
+		btrfs_put_block_group(dest_bg);
+
+		ret = btrfs_commit_transaction(trans);
+	}
+
+	return ret;
+}
+
+static int move_existing_remaps(struct btrfs_fs_info *fs_info,
+				struct btrfs_block_group *bg,
+				struct btrfs_path *path)
+{
+	int ret;
+	struct btrfs_key key;
+	struct extent_buffer *leaf;
+	struct btrfs_remap *remap;
+	u64 old_addr;
+
+	/* Look for backrefs in remap tree. */
+
+	while (bg->remap_bytes > 0) {
+		key.objectid = bg->start;
+		key.type = BTRFS_REMAP_BACKREF_KEY;
+		key.offset = 0;
+
+		ret = btrfs_search_slot(NULL, fs_info->remap_root, &key, path,
+					0, 0);
+		if (ret < 0)
+			return ret;
+
+		leaf = path->nodes[0];
+
+		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+			ret = btrfs_next_leaf(fs_info->remap_root, path);
+			if (ret < 0) {
+				btrfs_release_path(path);
+				return ret;
+			}
+
+			if (ret) {
+				btrfs_release_path(path);
+				break;
+			}
+
+			leaf = path->nodes[0];
+		}
+
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+		if (key.type != BTRFS_REMAP_BACKREF_KEY) {
+			path->slots[0]++;
+
+			if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+				ret = btrfs_next_leaf(fs_info->remap_root, path);
+				if (ret < 0) {
+					btrfs_release_path(path);
+					return ret;
+				}
+
+				if (ret) {
+					btrfs_release_path(path);
+					break;
+				}
+
+				leaf = path->nodes[0];
+			}
+		}
+
+		remap = btrfs_item_ptr(leaf, path->slots[0],
+				       struct btrfs_remap);
+
+		old_addr = btrfs_remap_address(leaf, remap);
+
+		btrfs_release_path(path);
+
+		ret = move_existing_remap(fs_info, path, bg, key.objectid,
+					  key.offset, old_addr);
+		if (ret)
+			return ret;
+	}
+
+	BUG_ON(bg->remap_bytes > 0);
+
+	return 0;
+}
+
 static int create_remap_tree_entries(struct btrfs_trans_handle *trans,
 				     struct btrfs_path *path,
 				     struct btrfs_block_group *bg)
@@ -4636,6 +5110,14 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
 	WARN_ON(ret && ret != -EAGAIN);
 
 	if (*using_remap_tree) {
+		if (bg->remap_bytes != 0) {
+			ret = move_existing_remaps(fs_info, bg, path);
+			if (ret) {
+				err = ret;
+				goto out;
+			}
+		}
+
 		err = start_block_group_remapping(fs_info, path, bg);
 
 		goto out;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 12/12] btrfs: replace identity maps with actual remaps when doing relocations
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (10 preceding siblings ...)
  2025-06-05 16:23 ` [PATCH 11/12] btrfs: move existing remaps before relocating block group Mark Harmstone
@ 2025-06-05 16:23 ` Mark Harmstone
  2025-06-05 16:43 ` [PATCH 00/12] btrfs: remap tree Jonah Sabean
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Mark Harmstone @ 2025-06-05 16:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Mark Harmstone

Add a function do_remap_tree_reloc(), which does the actual work of
doing a relocation using the remap tree.

In a loop we call do_remap_tree_reloc_trans(), which searches for the
first identity remap for the block group. We call btrfs_reserve_extent()
to find space elsewhere for it, and read the data into memory and write
it to the new location. We then carve out the identity remap and replace
it with an actual remap, which points to the new location in which to
look.

Once the last identity remap has been removed we call
last_identity_remap_gone(), which, as with deletions, removes the
chunk's stripes and device extents.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/relocation.c | 335 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 335 insertions(+)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d7aad3f92224..95b28678fb65 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4710,6 +4710,61 @@ static int mark_bg_remapped(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+static int find_next_identity_remap(struct btrfs_trans_handle *trans,
+				    struct btrfs_path *path, u64 bg_end,
+				    u64 last_start, u64 *start,
+				    u64 *length)
+{
+	int ret;
+	struct btrfs_key key, found_key;
+	struct btrfs_root *remap_root = trans->fs_info->remap_root;
+	struct extent_buffer *leaf;
+
+	key.objectid = last_start;
+	key.type = BTRFS_IDENTITY_REMAP_KEY;
+	key.offset = 0;
+
+	ret = btrfs_search_slot(trans, remap_root, &key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+
+	leaf = path->nodes[0];
+	while (true) {
+		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+			ret = btrfs_next_leaf(remap_root, path);
+
+			if (ret != 0) {
+				if (ret == 1)
+					ret = -ENOENT;
+				goto out;
+			}
+
+			leaf = path->nodes[0];
+		}
+
+		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+		if (found_key.objectid >= bg_end) {
+			ret = -ENOENT;
+			goto out;
+		}
+
+		if (found_key.type == BTRFS_IDENTITY_REMAP_KEY) {
+			*start = found_key.objectid;
+			*length = found_key.offset;
+			ret = 0;
+			goto out;
+		}
+
+		path->slots[0]++;
+	}
+
+out:
+	btrfs_release_path(path);
+
+	return ret;
+}
+
 static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
 				struct btrfs_chunk_map *chunk,
 				struct btrfs_path *path)
@@ -4829,6 +4884,98 @@ static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+static int add_remap_entry(struct btrfs_trans_handle *trans,
+			   struct btrfs_path *path,
+			   struct btrfs_block_group *src_bg, u64 old_addr,
+			   u64 new_addr, u64 length)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_key key, new_key;
+	int ret;
+	int identity_count_delta = 0;
+
+	key.objectid = old_addr;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, -1, 1);
+	if (ret < 0)
+		goto end;
+
+	if (path->slots[0] == 0) {
+		ret = -ENOENT;
+		goto end;
+	}
+
+	path->slots[0]--;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+	if (key.type != BTRFS_IDENTITY_REMAP_KEY ||
+	    key.objectid > old_addr ||
+	    key.objectid + key.offset <= old_addr) {
+		ret = -ENOENT;
+		goto end;
+	}
+
+	/* Shorten or delete identity mapping entry. */
+
+	if (key.objectid == old_addr) {
+		ret = btrfs_del_item(trans, fs_info->remap_root, path);
+		if (ret)
+			goto end;
+
+		identity_count_delta--;
+	} else {
+		new_key.objectid = key.objectid;
+		new_key.type = BTRFS_IDENTITY_REMAP_KEY;
+		new_key.offset = old_addr - key.objectid;
+
+		btrfs_set_item_key_safe(trans, path, &new_key);
+	}
+
+	btrfs_release_path(path);
+
+	/* Create new remap entry. */
+
+	ret = add_remap_item(trans, path, new_addr, length, old_addr);
+	if (ret)
+		goto end;
+
+	/* Add entry for remainder of identity mapping, if necessary. */
+
+	if (key.objectid + key.offset != old_addr + length) {
+		new_key.objectid = old_addr + length;
+		new_key.type = BTRFS_IDENTITY_REMAP_KEY;
+		new_key.offset = key.objectid + key.offset - old_addr - length;
+
+		ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+					      path, &new_key, 0);
+		if (ret)
+			goto end;
+
+		btrfs_release_path(path);
+
+		identity_count_delta++;
+	}
+
+	/* Add backref. */
+
+	ret = add_remap_backref_item(trans, path, new_addr, length, old_addr);
+	if (ret)
+		goto end;
+
+	if (identity_count_delta != 0) {
+		ret = adjust_identity_remap_count(trans, path, src_bg,
+						  identity_count_delta);
+	}
+
+end:
+	btrfs_release_path(path);
+
+	return ret;
+}
+
 static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
 			       struct btrfs_path *path, uint64_t start)
 {
@@ -4878,6 +5025,186 @@ static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+static int do_remap_tree_reloc_trans(struct btrfs_fs_info *fs_info,
+				     struct btrfs_block_group *src_bg,
+				     struct btrfs_path *path, u64 *last_start)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *extent_root;
+	struct btrfs_key ins;
+	struct btrfs_block_group *dest_bg = NULL;
+	struct btrfs_chunk_map *chunk;
+	u64 start, remap_length, length, new_addr, min_size;
+	int ret;
+	bool no_more = false;
+	bool is_data = src_bg->flags & BTRFS_BLOCK_GROUP_DATA;
+	bool made_reservation = false, bg_needs_free_space;
+	struct btrfs_space_info *sinfo = src_bg->space_info;
+
+	extent_root = btrfs_extent_root(fs_info, src_bg->start);
+
+	trans = btrfs_start_transaction(extent_root, 0);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	mutex_lock(&fs_info->remap_mutex);
+
+	ret = find_next_identity_remap(trans, path, src_bg->start + src_bg->length,
+				       *last_start, &start, &remap_length);
+	if (ret == -ENOENT) {
+		no_more = true;
+		goto next;
+	} else if (ret) {
+		mutex_unlock(&fs_info->remap_mutex);
+		btrfs_end_transaction(trans);
+		return ret;
+	}
+
+	/* Try to reserve enough space for block. */
+
+	spin_lock(&sinfo->lock);
+	btrfs_space_info_update_bytes_may_use(sinfo, remap_length);
+	spin_unlock(&sinfo->lock);
+
+	if (is_data)
+		min_size = fs_info->sectorsize;
+	else
+		min_size = fs_info->nodesize;
+
+	ret = btrfs_reserve_extent(fs_info->fs_root, remap_length,
+				   remap_length, min_size,
+				   0, 0, &ins, is_data, false);
+	if (ret) {
+		spin_lock(&sinfo->lock);
+		btrfs_space_info_update_bytes_may_use(sinfo, -remap_length);
+		spin_unlock(&sinfo->lock);
+
+		mutex_unlock(&fs_info->remap_mutex);
+		btrfs_end_transaction(trans);
+		return ret;
+	}
+
+	made_reservation = true;
+
+	new_addr = ins.objectid;
+	length = ins.offset;
+
+	if (!is_data && !IS_ALIGNED(length, fs_info->nodesize)) {
+		u64 new_length = ALIGN_DOWN(length, fs_info->nodesize);
+
+		btrfs_free_reserved_extent(fs_info, new_addr + new_length,
+					   length - new_length, 0);
+
+		length = new_length;
+	}
+
+	dest_bg = btrfs_lookup_block_group(fs_info, new_addr);
+
+	mutex_lock(&dest_bg->free_space_lock);
+	bg_needs_free_space = test_bit(BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE,
+				       &dest_bg->runtime_flags);
+	mutex_unlock(&dest_bg->free_space_lock);
+
+	if (bg_needs_free_space) {
+		ret = add_block_group_free_space(trans, dest_bg);
+		if (ret)
+			goto fail;
+	}
+
+	ret = do_copy(fs_info, start, new_addr, length);
+	if (ret)
+		goto fail;
+
+	ret = remove_from_free_space_tree(trans, new_addr, length);
+	if (ret)
+		goto fail;
+
+	ret = add_remap_entry(trans, path, src_bg, start, new_addr, length);
+	if (ret) {
+		add_to_free_space_tree(trans, new_addr, length);
+		goto fail;
+	}
+
+	adjust_block_group_remap_bytes(trans, dest_bg, length);
+	btrfs_free_reserved_bytes(dest_bg, length, 0);
+
+	spin_lock(&sinfo->lock);
+	sinfo->bytes_readonly += length;
+	spin_unlock(&sinfo->lock);
+
+next:
+	if (dest_bg)
+		btrfs_put_block_group(dest_bg);
+
+	if (made_reservation)
+		btrfs_dec_block_group_reservations(fs_info, new_addr);
+
+	if (src_bg->used == 0 && src_bg->remap_bytes == 0) {
+		chunk = btrfs_find_chunk_map(fs_info, src_bg->start, 1);
+		if (!chunk) {
+			mutex_unlock(&fs_info->remap_mutex);
+			btrfs_end_transaction(trans);
+			return -ENOENT;
+		}
+
+		ret = last_identity_remap_gone(trans, chunk, src_bg, path);
+		if (ret) {
+			btrfs_free_chunk_map(chunk);
+			mutex_unlock(&fs_info->remap_mutex);
+			btrfs_end_transaction(trans);
+			return ret;
+		}
+
+		btrfs_free_chunk_map(chunk);
+	}
+
+	mutex_unlock(&fs_info->remap_mutex);
+
+	ret = btrfs_end_transaction(trans);
+	if (ret)
+		return ret;
+
+	if (no_more)
+		return 1;
+
+	*last_start = start;
+
+	return 0;
+
+fail:
+	if (dest_bg)
+		btrfs_put_block_group(dest_bg);
+
+	btrfs_free_reserved_extent(fs_info, new_addr, length, 0);
+
+	mutex_unlock(&fs_info->remap_mutex);
+	btrfs_end_transaction(trans);
+
+	return ret;
+}
+
+static int do_remap_tree_reloc(struct btrfs_fs_info *fs_info,
+			       struct btrfs_path *path,
+			       struct btrfs_block_group *bg)
+{
+	u64 last_start;
+	int ret;
+
+	last_start = bg->start;
+
+	while (true) {
+		ret = do_remap_tree_reloc_trans(fs_info, bg, path,
+						&last_start);
+		if (ret) {
+			if (ret == 1)
+				ret = 0;
+			break;
+		}
+	}
+
+	return ret;
+}
+
 int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
 			  u64 *length, bool nolock)
 {
@@ -5119,6 +5446,14 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
 		}
 
 		err = start_block_group_remapping(fs_info, path, bg);
+		if (err)
+			goto out;
+
+		err = do_remap_tree_reloc(fs_info, path, rc->block_group);
+		if (err)
+			goto out;
+
+		btrfs_delete_unused_bgs(fs_info);
 
 		goto out;
 	}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/12] btrfs: remap tree
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (11 preceding siblings ...)
  2025-06-05 16:23 ` [PATCH 12/12] btrfs: replace identity maps with actual remaps when doing relocations Mark Harmstone
@ 2025-06-05 16:43 ` Jonah Sabean
  2025-06-06 13:35   ` Mark Harmstone
  2025-06-09 18:51 ` David Sterba
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Jonah Sabean @ 2025-06-05 16:43 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: linux-btrfs

On Thu, Jun 5, 2025 at 1:25 PM Mark Harmstone <maharmstone@fb.com> wrote:
>
> This patch series adds a disk format change gated behind
> CONFIG_BTRFS_EXPERIMENTAL to add a "remap tree", which acts as a layer of
> indirection when doing I/O. When doing relocation, rather than fixing up every
> tree, we instead record the old and new addresses in the remap tree. This should
> hopefully make things more reliable and flexible, as well as enabling some
> future changes we'd like to make, such as larger data extents and reducing
> write amplification by removing cow-only metadata items.
>
> The remap tree lives in a new REMAP chunk type. This is because bootstrapping
> means that it can't be remapped itself, and has to be relocated by COWing it as
> at present. It can't go in the SYSTEM chunk as we are then limited by the chunk
> item needing to fit in the superblock.
>
> For more on the design and rationale, please see my RFC sent last month[1], as
> well as Josef Bacik's original design document[2]. The main change from Josef's
> design is that I've added remap backrefs, as we need to be able to move a
> chunk's existing remaps before remapping it.
>
> You will also need my patches to btrfs-progs[3] to make
> `mkfs.btrfs -O remap-tree` work, as well as allowing `btrfs check` to recognize
> the new format.
>
> Changes since the RFC:
>
> * I've reduce the REMAP chunk size from the normal 1GB to 32MB, to match the
>   SYSTEM chunk. For a filesystem with 4KB sectors and 16KB node size, the worst
>   case is that one leaf covers ~1MB of data, and the best case ~250GB. For a
>   chunk, that implies a worst case of ~2GB and a best case of ~500TB.
>   This isn't a disk-format change, so we can always adjust it if it proves too
>   big or small in practice. mkfs creates 8MB chunks, as it does for everything.

One thing I'd like to see fixed is the fragmentation of dev_extents on
stripped profiles when you have less than 1G left of space, as btrfs
will allocate these smaller chunks across a stripped array (ie raid0,
10, 5 or 6), otherwise being able to support larger extents can be
made moot because you can end up with chunks being less as small as
1MiB. Depending on if you add/remove devices often and balance often
you can end up with a lot of chunks across all disks that can be made
smaller, so one hacky way I've got around this is to align partitions
and force the system chunk to 1G with this patch:
https://pastebin.com/4PWbgEXV

Ideally, I'd like this problem solved, but it seems to me this will
just add yet another small chunk in the mix that makes alignment
harder in this case. Really makes striping a curse on btrfs.

>
> * You can't make new allocations from remapped block groups, so I've changed
>   it so there's no free-space entries for these (thanks to Boris Burkov for the
>   suggestion).
>
> * The remap tree doesn't have metadata items in the extent tree (thanks to Josef
>   for the suggestion). This was to work around some corruption that delayed refs
>   were causing, but it also fits it with our future plans of removing all
>   metadata items for COW-only trees, reducing write amplification.
>   A knock-on effect of this is that I've had to disable balancing of the remap
>   chunk itself. This is because we can no longer walk the extent tree, and will
>   have to walk the remap tree instead. When we remove the COW-only metadata
>   items, we will also have to do this for the chunk and root trees, as
>   bootstrapping means they can't be remapped.
>
> * btrfs_translate_remap() uses search_commit_root when doing metadata lookups,
>   to avoid nested locking issues. This also seems to be a lot quicker (btrfs/187
>   went from ~20mins to ~90secs).
>
> * Unused remapped block groups should now get cleaned up more aggressively
>
> * Other miscellaneous cleanups and fixes
>
> Known issues:
>
> * Relocation still needs to be implemented for the remap tree itself (see above)
>
> * Some test failures: btrfs/156, btrfs/170, btrfs/226, btrfs/250
>
> * nodatacow extents aren't safe, as they can race with the relocation thread.
>   We either need to follow the btrfs_inc_nocow_writers() approach, which COWs
>   the extent, or change it so that it blocks here.
>
> * When initially marking a block group as remapped, we are walking the free-
>   space tree and creating the identity remaps all in one transaction. For the
>   worst-case scenario, i.e. a 1GB block group with every other sector allocated
>   (131,072 extents), this can result in transaction times of more than 10 mins.
>   This needs to be changed to allow this to happen over multiple transactions.
>
> * All this is disabled for zoned devices for the time being, as I've not been
>   able to test it. I'm planning to make it compatible with zoned at a later
>   date.
>
> Thanks
>
> [1] https://lwn.net/Articles/1021452/
> [2] https://github.com/btrfs/btrfs-todo/issues/54
> [3] https://github.com/maharmstone/btrfs-progs/tree/remap-tree
>
> Mark Harmstone (12):
>   btrfs: add definitions and constants for remap-tree
>   btrfs: add REMAP chunk type
>   btrfs: allow remapped chunks to have zero stripes
>   btrfs: remove remapped block groups from the free-space tree
>   btrfs: don't add metadata items for the remap tree to the extent tree
>   btrfs: add extended version of struct block_group_item
>   btrfs: allow mounting filesystems with remap-tree incompat flag
>   btrfs: redirect I/O for remapped block groups
>   btrfs: handle deletions from remapped block group
>   btrfs: handle setting up relocation of block group with remap-tree
>   btrfs: move existing remaps before relocating block group
>   btrfs: replace identity maps with actual remaps when doing relocations
>
>  fs/btrfs/Kconfig                |    2 +
>  fs/btrfs/accessors.h            |   29 +
>  fs/btrfs/block-group.c          |  202 +++-
>  fs/btrfs/block-group.h          |   15 +-
>  fs/btrfs/block-rsv.c            |    8 +
>  fs/btrfs/block-rsv.h            |    1 +
>  fs/btrfs/discard.c              |   11 +-
>  fs/btrfs/disk-io.c              |   91 +-
>  fs/btrfs/extent-tree.c          |  152 ++-
>  fs/btrfs/free-space-tree.c      |    4 +-
>  fs/btrfs/free-space-tree.h      |    5 +-
>  fs/btrfs/fs.h                   |    7 +-
>  fs/btrfs/relocation.c           | 1897 ++++++++++++++++++++++++++++++-
>  fs/btrfs/relocation.h           |    8 +-
>  fs/btrfs/space-info.c           |   22 +-
>  fs/btrfs/sysfs.c                |    4 +
>  fs/btrfs/transaction.c          |    7 +
>  fs/btrfs/tree-checker.c         |   37 +-
>  fs/btrfs/volumes.c              |  115 +-
>  fs/btrfs/volumes.h              |   17 +-
>  include/uapi/linux/btrfs.h      |    1 +
>  include/uapi/linux/btrfs_tree.h |   29 +-
>  22 files changed, 2444 insertions(+), 220 deletions(-)
>
> --
> 2.49.0
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 04/12] btrfs: remove remapped block groups from the free-space tree
  2025-06-05 16:23 ` [PATCH 04/12] btrfs: remove remapped block groups from the free-space tree Mark Harmstone
@ 2025-06-06  6:41   ` kernel test robot
  2025-06-13 22:00   ` Boris Burkov
  1 sibling, 0 replies; 39+ messages in thread
From: kernel test robot @ 2025-06-06  6:41 UTC (permalink / raw)
  To: Mark Harmstone, linux-btrfs; +Cc: llvm, oe-kbuild-all, Mark Harmstone

Hi Mark,

kernel test robot noticed the following build warnings:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on linus/master next-20250605]
[cannot apply to v6.15]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mark-Harmstone/btrfs-add-definitions-and-constants-for-remap-tree/20250606-002804
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
patch link:    https://lore.kernel.org/r/20250605162345.2561026-5-maharmstone%40fb.com
patch subject: [PATCH 04/12] btrfs: remove remapped block groups from the free-space tree
config: i386-randconfig-014-20250606 (https://download.01.org/0day-ci/archive/20250606/202506061455.6kk5l6Ct-lkp@intel.com/config)
compiler: clang version 20.1.2 (https://github.com/llvm/llvm-project 58df0ef89dd64126512e4ee27b4ac3fd8ddf6247)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250606/202506061455.6kk5l6Ct-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506061455.6kk5l6Ct-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> fs/btrfs/block-group.c:2470:23: warning: use of logical '&&' with constant operand [-Wconstant-logical-operand]
    2470 |                             !(cache->flags && BTRFS_BLOCK_GROUP_REMAPPED)) {
         |                                            ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/block-group.c:2470:23: note: use '&' for a bitwise operation
    2470 |                             !(cache->flags && BTRFS_BLOCK_GROUP_REMAPPED)) {
         |                                            ^~
         |                                            &
   fs/btrfs/block-group.c:2470:23: note: remove constant to silence this warning
    2470 |                             !(cache->flags && BTRFS_BLOCK_GROUP_REMAPPED)) {
         |                                            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> fs/btrfs/block-group.c:2470:26: warning: converting the result of '<<' to a boolean always evaluates to true [-Wtautological-constant-compare]
    2470 |                             !(cache->flags && BTRFS_BLOCK_GROUP_REMAPPED)) {
         |                                               ^
   include/uapi/linux/btrfs_tree.h:1171:47: note: expanded from macro 'BTRFS_BLOCK_GROUP_REMAPPED'
    1171 | #define BTRFS_BLOCK_GROUP_REMAPPED      (1ULL << 11)
         |                                               ^
   2 warnings generated.


vim +2470 fs/btrfs/block-group.c

  2361	
  2362	static int read_one_block_group(struct btrfs_fs_info *info,
  2363					struct btrfs_block_group_item *bgi,
  2364					const struct btrfs_key *key,
  2365					int need_clear)
  2366	{
  2367		struct btrfs_block_group *cache;
  2368		const bool mixed = btrfs_fs_incompat(info, MIXED_GROUPS);
  2369		int ret;
  2370	
  2371		ASSERT(key->type == BTRFS_BLOCK_GROUP_ITEM_KEY);
  2372	
  2373		cache = btrfs_create_block_group_cache(info, key->objectid);
  2374		if (!cache)
  2375			return -ENOMEM;
  2376	
  2377		cache->length = key->offset;
  2378		cache->used = btrfs_stack_block_group_used(bgi);
  2379		cache->commit_used = cache->used;
  2380		cache->flags = btrfs_stack_block_group_flags(bgi);
  2381		cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);
  2382		cache->space_info = btrfs_find_space_info(info, cache->flags);
  2383	
  2384		set_free_space_tree_thresholds(cache);
  2385	
  2386		if (need_clear) {
  2387			/*
  2388			 * When we mount with old space cache, we need to
  2389			 * set BTRFS_DC_CLEAR and set dirty flag.
  2390			 *
  2391			 * a) Setting 'BTRFS_DC_CLEAR' makes sure that we
  2392			 *    truncate the old free space cache inode and
  2393			 *    setup a new one.
  2394			 * b) Setting 'dirty flag' makes sure that we flush
  2395			 *    the new space cache info onto disk.
  2396			 */
  2397			if (btrfs_test_opt(info, SPACE_CACHE))
  2398				cache->disk_cache_state = BTRFS_DC_CLEAR;
  2399		}
  2400		if (!mixed && ((cache->flags & BTRFS_BLOCK_GROUP_METADATA) &&
  2401		    (cache->flags & BTRFS_BLOCK_GROUP_DATA))) {
  2402				btrfs_err(info,
  2403	"bg %llu is a mixed block group but filesystem hasn't enabled mixed block groups",
  2404					  cache->start);
  2405				ret = -EINVAL;
  2406				goto error;
  2407		}
  2408	
  2409		ret = btrfs_load_block_group_zone_info(cache, false);
  2410		if (ret) {
  2411			btrfs_err(info, "zoned: failed to load zone info of bg %llu",
  2412				  cache->start);
  2413			goto error;
  2414		}
  2415	
  2416		/*
  2417		 * We need to exclude the super stripes now so that the space info has
  2418		 * super bytes accounted for, otherwise we'll think we have more space
  2419		 * than we actually do.
  2420		 */
  2421		ret = exclude_super_stripes(cache);
  2422		if (ret) {
  2423			/* We may have excluded something, so call this just in case. */
  2424			btrfs_free_excluded_extents(cache);
  2425			goto error;
  2426		}
  2427	
  2428		/*
  2429		 * For zoned filesystem, space after the allocation offset is the only
  2430		 * free space for a block group. So, we don't need any caching work.
  2431		 * btrfs_calc_zone_unusable() will set the amount of free space and
  2432		 * zone_unusable space.
  2433		 *
  2434		 * For regular filesystem, check for two cases, either we are full, and
  2435		 * therefore don't need to bother with the caching work since we won't
  2436		 * find any space, or we are empty, and we can just add all the space
  2437		 * in and be done with it.  This saves us _a_lot_ of time, particularly
  2438		 * in the full case.
  2439		 */
  2440		if (btrfs_is_zoned(info)) {
  2441			btrfs_calc_zone_unusable(cache);
  2442			/* Should not have any excluded extents. Just in case, though. */
  2443			btrfs_free_excluded_extents(cache);
  2444		} else if (cache->length == cache->used) {
  2445			cache->cached = BTRFS_CACHE_FINISHED;
  2446			btrfs_free_excluded_extents(cache);
  2447		} else if (cache->used == 0) {
  2448			cache->cached = BTRFS_CACHE_FINISHED;
  2449			ret = btrfs_add_new_free_space(cache, cache->start,
  2450						       cache->start + cache->length, NULL);
  2451			btrfs_free_excluded_extents(cache);
  2452			if (ret)
  2453				goto error;
  2454		}
  2455	
  2456		ret = btrfs_add_block_group_cache(cache);
  2457		if (ret) {
  2458			btrfs_remove_free_space_cache(cache);
  2459			goto error;
  2460		}
  2461	
  2462		trace_btrfs_add_block_group(info, cache, 0);
  2463		btrfs_add_bg_to_space_info(info, cache);
  2464	
  2465		set_avail_alloc_bits(info, cache->flags);
  2466		if (btrfs_chunk_writeable(info, cache->start)) {
  2467			if (cache->used == 0) {
  2468				ASSERT(list_empty(&cache->bg_list));
  2469				if (btrfs_test_opt(info, DISCARD_ASYNC) &&
> 2470				    !(cache->flags && BTRFS_BLOCK_GROUP_REMAPPED)) {
  2471					btrfs_discard_queue_work(&info->discard_ctl, cache);
  2472				} else {
  2473					btrfs_mark_bg_unused(cache);
  2474				}
  2475			}
  2476		} else {
  2477			inc_block_group_ro(cache, 1);
  2478		}
  2479	
  2480		return 0;
  2481	error:
  2482		btrfs_put_block_group(cache);
  2483		return ret;
  2484	}
  2485	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 11/12] btrfs: move existing remaps before relocating block group
  2025-06-05 16:23 ` [PATCH 11/12] btrfs: move existing remaps before relocating block group Mark Harmstone
@ 2025-06-06 11:20   ` kernel test robot
  0 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2025-06-06 11:20 UTC (permalink / raw)
  To: Mark Harmstone, linux-btrfs; +Cc: oe-kbuild-all, Mark Harmstone

Hi Mark,

kernel test robot noticed the following build warnings:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on linus/master next-20250606]
[cannot apply to v6.15]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mark-Harmstone/btrfs-add-definitions-and-constants-for-remap-tree/20250606-002804
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
patch link:    https://lore.kernel.org/r/20250605162345.2561026-12-maharmstone%40fb.com
patch subject: [PATCH 11/12] btrfs: move existing remaps before relocating block group
config: csky-randconfig-r133-20250606 (https://download.01.org/0day-ci/archive/20250606/202506061952.ALl7BezR-lkp@intel.com/config)
compiler: csky-linux-gcc (GCC) 14.3.0
reproduce: (https://download.01.org/0day-ci/archive/20250606/202506061952.ALl7BezR-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506061952.ALl7BezR-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> fs/btrfs/relocation.c:4036:27: sparse: sparse: incorrect type in initializer (different base types) @@     expected int op @@     got restricted blk_opf_t @@
   fs/btrfs/relocation.c:4036:27: sparse:     expected int op
   fs/btrfs/relocation.c:4036:27: sparse:     got restricted blk_opf_t
>> fs/btrfs/relocation.c:4042:46: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted blk_opf_t [usertype] opf @@     got int op @@
   fs/btrfs/relocation.c:4042:46: sparse:     expected restricted blk_opf_t [usertype] opf
   fs/btrfs/relocation.c:4042:46: sparse:     got int op
   fs/btrfs/relocation.c:4053:62: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted blk_opf_t [usertype] opf @@     got int op @@
   fs/btrfs/relocation.c:4053:62: sparse:     expected restricted blk_opf_t [usertype] opf
   fs/btrfs/relocation.c:4053:62: sparse:     got int op

vim +4036 fs/btrfs/relocation.c

  4028	
  4029	static int copy_remapped_data_io(struct btrfs_fs_info *fs_info,
  4030					 struct reloc_io_private *priv,
  4031					 struct page **pages, u64 addr, u64 length,
  4032					 bool do_write)
  4033	{
  4034		struct btrfs_bio *bbio;
  4035		unsigned long i = 0;
> 4036		int op = do_write ? REQ_OP_WRITE : REQ_OP_READ;
  4037	
  4038		init_completion(&priv->done);
  4039		refcount_set(&priv->pending_refs, 1);
  4040		priv->status = 0;
  4041	
> 4042		bbio = btrfs_bio_alloc(BIO_MAX_VECS, op, fs_info, reloc_endio,
  4043				       priv);
  4044		bbio->bio.bi_iter.bi_sector = addr >> SECTOR_SHIFT;
  4045	
  4046		do {
  4047			size_t bytes = min_t(u64, length, PAGE_SIZE);
  4048	
  4049			if (bio_add_page(&bbio->bio, pages[i], bytes, 0) < bytes) {
  4050				refcount_inc(&priv->pending_refs);
  4051				btrfs_submit_bbio(bbio, 0);
  4052	
  4053				bbio = btrfs_bio_alloc(BIO_MAX_VECS, op, fs_info,
  4054						       reloc_endio, priv);
  4055				bbio->bio.bi_iter.bi_sector = addr >> SECTOR_SHIFT;
  4056				continue;
  4057			}
  4058	
  4059			i++;
  4060			addr += bytes;
  4061			length -= bytes;
  4062		} while (length);
  4063	
  4064		refcount_inc(&priv->pending_refs);
  4065		btrfs_submit_bbio(bbio, 0);
  4066	
  4067		if (!refcount_dec_and_test(&priv->pending_refs))
  4068			wait_for_completion_io(&priv->done);
  4069	
  4070		return blk_status_to_errno(READ_ONCE(priv->status));
  4071	}
  4072	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/12] btrfs: remap tree
  2025-06-05 16:43 ` [PATCH 00/12] btrfs: remap tree Jonah Sabean
@ 2025-06-06 13:35   ` Mark Harmstone
  2025-06-09 16:05     ` Anand Jain
  0 siblings, 1 reply; 39+ messages in thread
From: Mark Harmstone @ 2025-06-06 13:35 UTC (permalink / raw)
  To: Jonah Sabean; +Cc: linux-btrfs@vger.kernel.org

On 5/6/25 17:43, Jonah Sabean wrote:
> > 
> On Thu, Jun 5, 2025 at 1:25 PM Mark Harmstone <maharmstone@fb.com> wrote:
>>
>> This patch series adds a disk format change gated behind
>> CONFIG_BTRFS_EXPERIMENTAL to add a "remap tree", which acts as a layer of
>> indirection when doing I/O. When doing relocation, rather than fixing up every
>> tree, we instead record the old and new addresses in the remap tree. This should
>> hopefully make things more reliable and flexible, as well as enabling some
>> future changes we'd like to make, such as larger data extents and reducing
>> write amplification by removing cow-only metadata items.
>>
>> The remap tree lives in a new REMAP chunk type. This is because bootstrapping
>> means that it can't be remapped itself, and has to be relocated by COWing it as
>> at present. It can't go in the SYSTEM chunk as we are then limited by the chunk
>> item needing to fit in the superblock.
>>
>> For more on the design and rationale, please see my RFC sent last month[1], as
>> well as Josef Bacik's original design document[2]. The main change from Josef's
>> design is that I've added remap backrefs, as we need to be able to move a
>> chunk's existing remaps before remapping it.
>>
>> You will also need my patches to btrfs-progs[3] to make
>> `mkfs.btrfs -O remap-tree` work, as well as allowing `btrfs check` to recognize
>> the new format.
>>
>> Changes since the RFC:
>>
>> * I've reduce the REMAP chunk size from the normal 1GB to 32MB, to match the
>>    SYSTEM chunk. For a filesystem with 4KB sectors and 16KB node size, the worst
>>    case is that one leaf covers ~1MB of data, and the best case ~250GB. For a
>>    chunk, that implies a worst case of ~2GB and a best case of ~500TB.
>>    This isn't a disk-format change, so we can always adjust it if it proves too
>>    big or small in practice. mkfs creates 8MB chunks, as it does for everything.
> 
> One thing I'd like to see fixed is the fragmentation of dev_extents on
> stripped profiles when you have less than 1G left of space, as btrfs
> will allocate these smaller chunks across a stripped array (ie raid0,
> 10, 5 or 6), otherwise being able to support larger extents can be
> made moot because you can end up with chunks being less as small as
> 1MiB. Depending on if you add/remove devices often and balance often
> you can end up with a lot of chunks across all disks that can be made
> smaller, so one hacky way I've got around this is to align partitions
> and force the system chunk to 1G with this patch:
> https://urldefense.com/v3/__https://pastebin.com/4PWbgEXV__;!!Bt8RZUm9aw!5woVoadd383IuqBtW6VYdNfYTRc1ugI44XocnoPkA0gEjtp58o3ubI7wW3X5fzx58qYL9cxWUDY$
> 
> Ideally, I'd like this problem solved, but it seems to me this will
> just add yet another small chunk in the mix that makes alignment
> harder in this case. Really makes striping a curse on btrfs.

This is a different problem to what my patches are trying to solve, but 
yes, I can understand why that would be an issue. Sometimes you'd prefer 
the FS to ENOSPC rather than fragmenting your files.

I know one of the btrfs developers has been looking into making the 
allocator more intelligent, so I'll make sure he's aware of this.

>>
>> * You can't make new allocations from remapped block groups, so I've changed
>>    it so there's no free-space entries for these (thanks to Boris Burkov for the
>>    suggestion).
>>
>> * The remap tree doesn't have metadata items in the extent tree (thanks to Josef
>>    for the suggestion). This was to work around some corruption that delayed refs
>>    were causing, but it also fits it with our future plans of removing all
>>    metadata items for COW-only trees, reducing write amplification.
>>    A knock-on effect of this is that I've had to disable balancing of the remap
>>    chunk itself. This is because we can no longer walk the extent tree, and will
>>    have to walk the remap tree instead. When we remove the COW-only metadata
>>    items, we will also have to do this for the chunk and root trees, as
>>    bootstrapping means they can't be remapped.
>>
>> * btrfs_translate_remap() uses search_commit_root when doing metadata lookups,
>>    to avoid nested locking issues. This also seems to be a lot quicker (btrfs/187
>>    went from ~20mins to ~90secs).
>>
>> * Unused remapped block groups should now get cleaned up more aggressively
>>
>> * Other miscellaneous cleanups and fixes
>>
>> Known issues:
>>
>> * Relocation still needs to be implemented for the remap tree itself (see above)
>>
>> * Some test failures: btrfs/156, btrfs/170, btrfs/226, btrfs/250
>>
>> * nodatacow extents aren't safe, as they can race with the relocation thread.
>>    We either need to follow the btrfs_inc_nocow_writers() approach, which COWs
>>    the extent, or change it so that it blocks here.
>>
>> * When initially marking a block group as remapped, we are walking the free-
>>    space tree and creating the identity remaps all in one transaction. For the
>>    worst-case scenario, i.e. a 1GB block group with every other sector allocated
>>    (131,072 extents), this can result in transaction times of more than 10 mins.
>>    This needs to be changed to allow this to happen over multiple transactions.
>>
>> * All this is disabled for zoned devices for the time being, as I've not been
>>    able to test it. I'm planning to make it compatible with zoned at a later
>>    date.
>>
>> Thanks
>>
>> [1] https://urldefense.com/v3/__https://lwn.net/Articles/1021452/__;!!Bt8RZUm9aw!5woVoadd383IuqBtW6VYdNfYTRc1ugI44XocnoPkA0gEjtp58o3ubI7wW3X5fzx58qYL4uvDpII$
>> [2] https://github.com/btrfs/btrfs-todo/issues/54
>> [3] https://github.com/maharmstone/btrfs-progs/tree/remap-tree
>>
>> Mark Harmstone (12):
>>    btrfs: add definitions and constants for remap-tree
>>    btrfs: add REMAP chunk type
>>    btrfs: allow remapped chunks to have zero stripes
>>    btrfs: remove remapped block groups from the free-space tree
>>    btrfs: don't add metadata items for the remap tree to the extent tree
>>    btrfs: add extended version of struct block_group_item
>>    btrfs: allow mounting filesystems with remap-tree incompat flag
>>    btrfs: redirect I/O for remapped block groups
>>    btrfs: handle deletions from remapped block group
>>    btrfs: handle setting up relocation of block group with remap-tree
>>    btrfs: move existing remaps before relocating block group
>>    btrfs: replace identity maps with actual remaps when doing relocations
>>
>>   fs/btrfs/Kconfig                |    2 +
>>   fs/btrfs/accessors.h            |   29 +
>>   fs/btrfs/block-group.c          |  202 +++-
>>   fs/btrfs/block-group.h          |   15 +-
>>   fs/btrfs/block-rsv.c            |    8 +
>>   fs/btrfs/block-rsv.h            |    1 +
>>   fs/btrfs/discard.c              |   11 +-
>>   fs/btrfs/disk-io.c              |   91 +-
>>   fs/btrfs/extent-tree.c          |  152 ++-
>>   fs/btrfs/free-space-tree.c      |    4 +-
>>   fs/btrfs/free-space-tree.h      |    5 +-
>>   fs/btrfs/fs.h                   |    7 +-
>>   fs/btrfs/relocation.c           | 1897 ++++++++++++++++++++++++++++++-
>>   fs/btrfs/relocation.h           |    8 +-
>>   fs/btrfs/space-info.c           |   22 +-
>>   fs/btrfs/sysfs.c                |    4 +
>>   fs/btrfs/transaction.c          |    7 +
>>   fs/btrfs/tree-checker.c         |   37 +-
>>   fs/btrfs/volumes.c              |  115 +-
>>   fs/btrfs/volumes.h              |   17 +-
>>   include/uapi/linux/btrfs.h      |    1 +
>>   include/uapi/linux/btrfs_tree.h |   29 +-
>>   22 files changed, 2444 insertions(+), 220 deletions(-)
>>
>> --
>> 2.49.0
>>
>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/12] btrfs: remap tree
  2025-06-06 13:35   ` Mark Harmstone
@ 2025-06-09 16:05     ` Anand Jain
  0 siblings, 0 replies; 39+ messages in thread
From: Anand Jain @ 2025-06-09 16:05 UTC (permalink / raw)
  To: Mark Harmstone, Jonah Sabean; +Cc: linux-btrfs@vger.kernel.org

On 6/6/25 21:35, Mark Harmstone wrote:
> On 5/6/25 17:43, Jonah Sabean wrote:
>>>
>> On Thu, Jun 5, 2025 at 1:25 PM Mark Harmstone <maharmstone@fb.com> wrote:
>>>
>>> This patch series adds a disk format change gated behind
>>> CONFIG_BTRFS_EXPERIMENTAL to add a "remap tree", which acts as a layer of
>>> indirection when doing I/O. When doing relocation, rather than fixing up every
>>> tree, we instead record the old and new addresses in the remap tree. This should
>>> hopefully make things more reliable and flexible, as well as enabling some
>>> future changes we'd like to make, such as larger data extents and reducing
>>> write amplification by removing cow-only metadata items.
>>>
>>> The remap tree lives in a new REMAP chunk type. This is because bootstrapping
>>> means that it can't be remapped itself, and has to be relocated by COWing it as
>>> at present. It can't go in the SYSTEM chunk as we are then limited by the chunk
>>> item needing to fit in the superblock.
>>>
>>> For more on the design and rationale, please see my RFC sent last month[1], as
>>> well as Josef Bacik's original design document[2]. The main change from Josef's
>>> design is that I've added remap backrefs, as we need to be able to move a
>>> chunk's existing remaps before remapping it.
>>>
>>> You will also need my patches to btrfs-progs[3] to make
>>> `mkfs.btrfs -O remap-tree` work, as well as allowing `btrfs check` to recognize
>>> the new format.
>>>
>>> Changes since the RFC:
>>>
>>> * I've reduce the REMAP chunk size from the normal 1GB to 32MB, to match the
>>>     SYSTEM chunk. For a filesystem with 4KB sectors and 16KB node size, the worst
>>>     case is that one leaf covers ~1MB of data, and the best case ~250GB. For a
>>>     chunk, that implies a worst case of ~2GB and a best case of ~500TB.
>>>     This isn't a disk-format change, so we can always adjust it if it proves too
>>>     big or small in practice. mkfs creates 8MB chunks, as it does for everything.
>>
>> One thing I'd like to see fixed is the fragmentation of dev_extents on
>> stripped profiles when you have less than 1G left of space, as btrfs
>> will allocate these smaller chunks across a stripped array (ie raid0,
>> 10, 5 or 6), otherwise being able to support larger extents can be
>> made moot because you can end up with chunks being less as small as
>> 1MiB. Depending on if you add/remove devices often and balance often
>> you can end up with a lot of chunks across all disks that can be made
>> smaller, so one hacky way I've got around this is to align partitions
>> and force the system chunk to 1G with this patch:
>> https://urldefense.com/v3/__https://pastebin.com/4PWbgEXV__;!!Bt8RZUm9aw!5woVoadd383IuqBtW6VYdNfYTRc1ugI44XocnoPkA0gEjtp58o3ubI7wW3X5fzx58qYL9cxWUDY$
>>
>> Ideally, I'd like this problem solved, but it seems to me this will
>> just add yet another small chunk in the mix that makes alignment
>> harder in this case. Really makes striping a curse on btrfs.
> 
> This is a different problem to what my patches are trying to solve, but
> yes, I can understand why that would be an issue. Sometimes you'd prefer
> the FS to ENOSPC rather than fragmenting your files.
> 
> I know one of the btrfs developers has been looking into making the
> allocator more intelligent, so I'll make sure he's aware of this.
> 


We’re adding a framework [1] to support more allocation methods, so
let’s see how that evolves.

  [1]
    https://asj.github.io/chunk-alloc-enhancement.html

  
https://lore.kernel.org/linux-btrfs/cover.1747070147.git.anand.jain@oracle.com/


Dynamically calculating chunk sizes in striped RAID can improve free
space usage, especially when device sizes are uneven. The trade-off
is increased chunk fragmentation—that’s the cost of maximizing the
space. I'm unsure about the impact as of now, one option is to
enforce fixed stripe counts and sizes, then  benchmark with test
cases to assess the actual gains. Let me see if I can create a testcase.

Thanks, Anand

>>>
>>> * You can't make new allocations from remapped block groups, so I've changed
>>>     it so there's no free-space entries for these (thanks to Boris Burkov for the
>>>     suggestion).
>>>
>>> * The remap tree doesn't have metadata items in the extent tree (thanks to Josef
>>>     for the suggestion). This was to work around some corruption that delayed refs
>>>     were causing, but it also fits it with our future plans of removing all
>>>     metadata items for COW-only trees, reducing write amplification.
>>>     A knock-on effect of this is that I've had to disable balancing of the remap
>>>     chunk itself. This is because we can no longer walk the extent tree, and will
>>>     have to walk the remap tree instead. When we remove the COW-only metadata
>>>     items, we will also have to do this for the chunk and root trees, as
>>>     bootstrapping means they can't be remapped.
>>>
>>> * btrfs_translate_remap() uses search_commit_root when doing metadata lookups,
>>>     to avoid nested locking issues. This also seems to be a lot quicker (btrfs/187
>>>     went from ~20mins to ~90secs).
>>>
>>> * Unused remapped block groups should now get cleaned up more aggressively
>>>
>>> * Other miscellaneous cleanups and fixes
>>>
>>> Known issues:
>>>
>>> * Relocation still needs to be implemented for the remap tree itself (see above)
>>>
>>> * Some test failures: btrfs/156, btrfs/170, btrfs/226, btrfs/250
>>>
>>> * nodatacow extents aren't safe, as they can race with the relocation thread.
>>>     We either need to follow the btrfs_inc_nocow_writers() approach, which COWs
>>>     the extent, or change it so that it blocks here.
>>>
>>> * When initially marking a block group as remapped, we are walking the free-
>>>     space tree and creating the identity remaps all in one transaction. For the
>>>     worst-case scenario, i.e. a 1GB block group with every other sector allocated
>>>     (131,072 extents), this can result in transaction times of more than 10 mins.
>>>     This needs to be changed to allow this to happen over multiple transactions.
>>>
>>> * All this is disabled for zoned devices for the time being, as I've not been
>>>     able to test it. I'm planning to make it compatible with zoned at a later
>>>     date.
>>>
>>> Thanks
>>>
>>> [1] https://urldefense.com/v3/__https://lwn.net/Articles/1021452/__;!!Bt8RZUm9aw!5woVoadd383IuqBtW6VYdNfYTRc1ugI44XocnoPkA0gEjtp58o3ubI7wW3X5fzx58qYL4uvDpII$
>>> [2] https://github.com/btrfs/btrfs-todo/issues/54
>>> [3] https://github.com/maharmstone/btrfs-progs/tree/remap-tree
>>>
>>> Mark Harmstone (12):
>>>     btrfs: add definitions and constants for remap-tree
>>>     btrfs: add REMAP chunk type
>>>     btrfs: allow remapped chunks to have zero stripes
>>>     btrfs: remove remapped block groups from the free-space tree
>>>     btrfs: don't add metadata items for the remap tree to the extent tree
>>>     btrfs: add extended version of struct block_group_item
>>>     btrfs: allow mounting filesystems with remap-tree incompat flag
>>>     btrfs: redirect I/O for remapped block groups
>>>     btrfs: handle deletions from remapped block group
>>>     btrfs: handle setting up relocation of block group with remap-tree
>>>     btrfs: move existing remaps before relocating block group
>>>     btrfs: replace identity maps with actual remaps when doing relocations
>>>
>>>    fs/btrfs/Kconfig                |    2 +
>>>    fs/btrfs/accessors.h            |   29 +
>>>    fs/btrfs/block-group.c          |  202 +++-
>>>    fs/btrfs/block-group.h          |   15 +-
>>>    fs/btrfs/block-rsv.c            |    8 +
>>>    fs/btrfs/block-rsv.h            |    1 +
>>>    fs/btrfs/discard.c              |   11 +-
>>>    fs/btrfs/disk-io.c              |   91 +-
>>>    fs/btrfs/extent-tree.c          |  152 ++-
>>>    fs/btrfs/free-space-tree.c      |    4 +-
>>>    fs/btrfs/free-space-tree.h      |    5 +-
>>>    fs/btrfs/fs.h                   |    7 +-
>>>    fs/btrfs/relocation.c           | 1897 ++++++++++++++++++++++++++++++-
>>>    fs/btrfs/relocation.h           |    8 +-
>>>    fs/btrfs/space-info.c           |   22 +-
>>>    fs/btrfs/sysfs.c                |    4 +
>>>    fs/btrfs/transaction.c          |    7 +
>>>    fs/btrfs/tree-checker.c         |   37 +-
>>>    fs/btrfs/volumes.c              |  115 +-
>>>    fs/btrfs/volumes.h              |   17 +-
>>>    include/uapi/linux/btrfs.h      |    1 +
>>>    include/uapi/linux/btrfs_tree.h |   29 +-
>>>    22 files changed, 2444 insertions(+), 220 deletions(-)
>>>
>>> --
>>> 2.49.0
>>>
>>>
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/12] btrfs: remap tree
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (12 preceding siblings ...)
  2025-06-05 16:43 ` [PATCH 00/12] btrfs: remap tree Jonah Sabean
@ 2025-06-09 18:51 ` David Sterba
  2025-06-10  9:19   ` Mark Harmstone
  2025-06-10 14:31 ` Mark Harmstone
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: David Sterba @ 2025-06-09 18:51 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: linux-btrfs

On Thu, Jun 05, 2025 at 05:23:30PM +0100, Mark Harmstone wrote:
> * The remap tree doesn't have metadata items in the extent tree (thanks to Josef
>   for the suggestion). This was to work around some corruption that delayed refs
>   were causing, but it also fits it with our future plans of removing all
>   metadata items for COW-only trees, reducing write amplification.

>   A knock-on effect of this is that I've had to disable balancing of the remap
>   chunk itself.

Not relocatable at all? How will the shrink or device deletion work,
this uses relocation to move the chunks.

>   This is because we can no longer walk the extent tree, and will
>   have to walk the remap tree instead. When we remove the COW-only metadata
>   items, we will also have to do this for the chunk and root trees, as
>   bootstrapping means they can't be remapped.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/12] btrfs: remap tree
  2025-06-09 18:51 ` David Sterba
@ 2025-06-10  9:19   ` Mark Harmstone
  0 siblings, 0 replies; 39+ messages in thread
From: Mark Harmstone @ 2025-06-10  9:19 UTC (permalink / raw)
  To: dsterba@suse.cz; +Cc: linux-btrfs@vger.kernel.org

On 9/6/25 19:51, David Sterba wrote:
> > 
> On Thu, Jun 05, 2025 at 05:23:30PM +0100, Mark Harmstone wrote:
>> * The remap tree doesn't have metadata items in the extent tree (thanks to Josef
>>    for the suggestion). This was to work around some corruption that delayed refs
>>    were causing, but it also fits it with our future plans of removing all
>>    metadata items for COW-only trees, reducing write amplification.
> 
>>    A knock-on effect of this is that I've had to disable balancing of the remap
>>    chunk itself.
> 
> Not relocatable at all? How will the shrink or device deletion work,
> this uses relocation to move the chunks.

I know, it won't. At the moment this is more-or-less limited to 
single-volume devices that you don't shrink.

This is something I'm going to address with a future patch before this 
leaves experimental. It won't require any more format changes, it's just 
that this bit is tricky to get right.

> 
>>    This is because we can no longer walk the extent tree, and will
>>    have to walk the remap tree instead. When we remove the COW-only metadata
>>    items, we will also have to do this for the chunk and root trees, as
>>    bootstrapping means they can't be remapped.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/12] btrfs: remap tree
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (13 preceding siblings ...)
  2025-06-09 18:51 ` David Sterba
@ 2025-06-10 14:31 ` Mark Harmstone
  2025-06-10 23:56   ` Qu Wenruo
  2025-06-11 15:28 ` Mark Harmstone
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Mark Harmstone @ 2025-06-10 14:31 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

On 5/6/25 17:23, Mark Harmstone wrote:
> * Some test failures: btrfs/156, btrfs/170, btrfs/226, btrfs/250

These all turned out to be spurious.

btrfs/226 is broken for me on the btrfs/for-next branch that I based 
these on (254ae2606b258a63b5063bed03bb4cf87a688502)

btrfs/156, btrfs/170, and btrfs/250 all involve creating small 
filesystems, which are then ENOSPCing because of the extra REMAP chunk.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/12] btrfs: remap tree
  2025-06-10 14:31 ` Mark Harmstone
@ 2025-06-10 23:56   ` Qu Wenruo
  2025-06-11  8:06     ` Mark Harmstone
  0 siblings, 1 reply; 39+ messages in thread
From: Qu Wenruo @ 2025-06-10 23:56 UTC (permalink / raw)
  To: Mark Harmstone, linux-btrfs@vger.kernel.org



在 2025/6/11 00:01, Mark Harmstone 写道:
> On 5/6/25 17:23, Mark Harmstone wrote:
>> * Some test failures: btrfs/156, btrfs/170, btrfs/226, btrfs/250
> 
> These all turned out to be spurious.
> 
> btrfs/226 is broken for me on the btrfs/for-next branch that I based
> these on (254ae2606b258a63b5063bed03bb4cf87a688502)

You may need to update fstests, as a recent kernel change requires 
nodatasum for NOWAIT.

And fstest commit 7e92cb991b0b ("fstests: btrfs/226: use nodatasum mount 
option to prevent false alerts") updated the test case to handle the 
kernel change.

> 
> btrfs/156, btrfs/170, and btrfs/250 all involve creating small
> filesystems, which are then ENOSPCing because of the extra REMAP chunk.

I do not have a good idea how to handle those cases.

E.g, the test case btrfs/156 is creating a 1G fs, although small it 
should still be fine for most cases.

If even a single 32MiB remap chunk is causing ENOSPC, it may indicate 
more ENOSPC in the real world.

Thanks,
Qu


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/12] btrfs: remap tree
  2025-06-10 23:56   ` Qu Wenruo
@ 2025-06-11  8:06     ` Mark Harmstone
  0 siblings, 0 replies; 39+ messages in thread
From: Mark Harmstone @ 2025-06-11  8:06 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs@vger.kernel.org

Thanks Qu.

On 11/6/25 00:56, Qu Wenruo wrote:
> 在 2025/6/11 00:01, Mark Harmstone 写道:
>> On 5/6/25 17:23, Mark Harmstone wrote:
>>> * Some test failures: btrfs/156, btrfs/170, btrfs/226, btrfs/250
>>
>> These all turned out to be spurious.
>>
>> btrfs/226 is broken for me on the btrfs/for-next branch that I based
>> these on (254ae2606b258a63b5063bed03bb4cf87a688502)
> 
> You may need to update fstests, as a recent kernel change requires 
> nodatasum for NOWAIT.
> 
> And fstest commit 7e92cb991b0b ("fstests: btrfs/226: use nodatasum mount 
> option to prevent false alerts") updated the test case to handle the 
> kernel change.

Makes sense, thank you.

>> btrfs/156, btrfs/170, and btrfs/250 all involve creating small
>> filesystems, which are then ENOSPCing because of the extra REMAP chunk.
> 
> I do not have a good idea how to handle those cases.
> 
> E.g, the test case btrfs/156 is creating a 1G fs, although small it 
> should still be fine for most cases.
> 
> If even a single 32MiB remap chunk is causing ENOSPC, it may indicate 
> more ENOSPC in the real world.

Probably 8MB actually, as IIRC that's the chunk size that mkfs uses for 
everything. It'd be 32MB the second and subsequent times round.

I'll investigate the tests properly, but I'm fairly sure they're not 
diagnosing a problem with this particular patchset.

Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/12] btrfs: remap tree
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (14 preceding siblings ...)
  2025-06-10 14:31 ` Mark Harmstone
@ 2025-06-11 15:28 ` Mark Harmstone
  2025-06-14  0:04 ` Boris Burkov
  2025-06-26 22:10 ` Mark Harmstone
  17 siblings, 0 replies; 39+ messages in thread
From: Mark Harmstone @ 2025-06-11 15:28 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

It looks like one of the nice side-effects of this patchset is a ~7% 
improvement in balance speed. These figures are from an NVMe drive, 
freshly rsync'd, with 69GB of 236GB used.

Without remap-tree, first time:
balanced 71 data chunks in 556s (= 7.8s each)
balanced 2 metadata chunks in 3s (= 1.5s each)

Without remap-tree, second time:
balanced 70 data chunks in 574s (= 8.2s each)
balanced 2 metadata chunks in 3s (= 1.5s each)

(same FS after being converted with btrfstune)
With remap-tree, first time:
balanced 71 data chunks in 533s (= 7.5s each)
balanced 2 metadata chunks in 5s (= 2.5s each)

With remap-tree, second time:
balanced 71 data chunks in 527s (= 7.4s each)
balanced 2 metadata chunks in 7s (= 3.5s each)

I've included the metadata balances as well, but I think they're too 
quick to be statistically significant.

I ran the balances twice because the remap-tree version has the 
potential to take longer the second time, as the move_existing_remaps() 
code will be called.

Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/12] btrfs: add definitions and constants for remap-tree
  2025-06-05 16:23 ` [PATCH 01/12] btrfs: add definitions and constants for remap-tree Mark Harmstone
@ 2025-06-13 21:02   ` Boris Burkov
  0 siblings, 0 replies; 39+ messages in thread
From: Boris Burkov @ 2025-06-13 21:02 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: linux-btrfs

On Thu, Jun 05, 2025 at 05:23:31PM +0100, Mark Harmstone wrote:
> Add an incompat flag for the new remap-tree feature, and the constants
> and definitions needed to support it.
> 
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
>  fs/btrfs/accessors.h            |  3 +++
>  fs/btrfs/sysfs.c                |  2 ++
>  fs/btrfs/tree-checker.c         |  6 ++++--
>  fs/btrfs/volumes.c              |  1 +
>  include/uapi/linux/btrfs.h      |  1 +
>  include/uapi/linux/btrfs_tree.h | 12 ++++++++++++
>  6 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> index 15ea6348800b..5f5eda8d6f9e 100644
> --- a/fs/btrfs/accessors.h
> +++ b/fs/btrfs/accessors.h
> @@ -1046,6 +1046,9 @@ BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption,
>  BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size,
>  			 struct btrfs_verity_descriptor_item, size, 64);
>  
> +BTRFS_SETGET_FUNCS(remap_address, struct btrfs_remap, address, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_remap_address, struct btrfs_remap, address, 64);
> +
>  /* Cast into the data area of the leaf. */
>  #define btrfs_item_ptr(leaf, slot, type)				\
>  	((type *)(btrfs_item_nr_offset(leaf, 0) + btrfs_item_offset(leaf, slot)))
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index f186c8082eff..831c25c2fb25 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -292,6 +292,7 @@ BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
>  BTRFS_FEAT_ATTR_COMPAT_RO(block_group_tree, BLOCK_GROUP_TREE);
>  BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
>  BTRFS_FEAT_ATTR_INCOMPAT(simple_quota, SIMPLE_QUOTA);
> +BTRFS_FEAT_ATTR_INCOMPAT(remap_tree, REMAP_TREE);
>  #ifdef CONFIG_BLK_DEV_ZONED
>  BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
>  #endif
> @@ -326,6 +327,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
>  	BTRFS_FEAT_ATTR_PTR(raid1c34),
>  	BTRFS_FEAT_ATTR_PTR(block_group_tree),
>  	BTRFS_FEAT_ATTR_PTR(simple_quota),
> +	BTRFS_FEAT_ATTR_PTR(remap_tree),
>  #ifdef CONFIG_BLK_DEV_ZONED
>  	BTRFS_FEAT_ATTR_PTR(zoned),
>  #endif
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index 8f4703b488b7..a83fb828723a 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -913,11 +913,13 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
>  		return -EUCLEAN;
>  	}
>  	if (unlikely(type & ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
> -			      BTRFS_BLOCK_GROUP_PROFILE_MASK))) {
> +			      BTRFS_BLOCK_GROUP_PROFILE_MASK |
> +			      BTRFS_BLOCK_GROUP_REMAPPED))) {

I'm not sure when we start a mask variable for these, but probably at
least at 3 of them. Wouldn't hate one here with the second, either.

>  		chunk_err(fs_info, leaf, chunk, logical,
>  			  "unrecognized chunk type: 0x%llx",
>  			  ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
> -			    BTRFS_BLOCK_GROUP_PROFILE_MASK) & type);
> +			    BTRFS_BLOCK_GROUP_PROFILE_MASK |
> +			    BTRFS_BLOCK_GROUP_REMAPPED) & type);
>  		return -EUCLEAN;
>  	}
>  
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 1535a425e8f9..3e53bde0e605 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -234,6 +234,7 @@ void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
>  	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
>  	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
>  	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
> +	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAPPED, "remapped");
>  
>  	DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
>  	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index dd02160015b2..d857cdc7694a 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -336,6 +336,7 @@ struct btrfs_ioctl_fs_info_args {
>  #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2	(1ULL << 13)
>  #define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE	(1ULL << 14)
>  #define BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA	(1ULL << 16)
> +#define BTRFS_FEATURE_INCOMPAT_REMAP_TREE	(1ULL << 17)
>  
>  struct btrfs_ioctl_feature_flags {
>  	__u64 compat_flags;
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index fc29d273845d..4439d77a7252 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -76,6 +76,9 @@
>  /* Tracks RAID stripes in block groups. */
>  #define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
>  
> +/* Holds details of remapped addresses after relocation. */
> +#define BTRFS_REMAP_TREE_OBJECTID 13ULL
> +
>  /* device stats in the device tree */
>  #define BTRFS_DEV_STATS_OBJECTID 0ULL
>  
> @@ -282,6 +285,10 @@
>  
>  #define BTRFS_RAID_STRIPE_KEY	230
>  
> +#define BTRFS_IDENTITY_REMAP_KEY 	234
> +#define BTRFS_REMAP_KEY		 	235
> +#define BTRFS_REMAP_BACKREF_KEY	 	236
> +
>  /*
>   * Records the overall state of the qgroups.
>   * There's only one instance of this key present,
> @@ -1161,6 +1168,7 @@ struct btrfs_dev_replace_item {
>  #define BTRFS_BLOCK_GROUP_RAID6         (1ULL << 8)
>  #define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
>  #define BTRFS_BLOCK_GROUP_RAID1C4       (1ULL << 10)
> +#define BTRFS_BLOCK_GROUP_REMAPPED      (1ULL << 11)
>  #define BTRFS_BLOCK_GROUP_RESERVED	(BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
>  					 BTRFS_SPACE_INFO_GLOBAL_RSV)
>  
> @@ -1323,4 +1331,8 @@ struct btrfs_verity_descriptor_item {
>  	__u8 encryption;
>  } __attribute__ ((__packed__));
>  
> +struct btrfs_remap {
> +	__le64 address;

Have we ever discussed any possible future extensions or more exotic
remappings? Would any metadata about the tree or type of remapping help
with the trickier relocation issues?

> +} __attribute__ ((__packed__));
> +
>  #endif /* _BTRFS_CTREE_H_ */
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/12] btrfs: add REMAP chunk type
  2025-06-05 16:23 ` [PATCH 02/12] btrfs: add REMAP chunk type Mark Harmstone
@ 2025-06-13 21:22   ` Boris Burkov
  0 siblings, 0 replies; 39+ messages in thread
From: Boris Burkov @ 2025-06-13 21:22 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: linux-btrfs

On Thu, Jun 05, 2025 at 05:23:32PM +0100, Mark Harmstone wrote:
> Add a new REMAP chunk type, which is a metadata chunk that holds the
> remap tree.
> 
> This is needed for bootstrapping purposes: the remap tree can't itself

I think it's pretty clear why, *now*, while we're thinking about it.
However, I think a longer, more detailed explanation of bootstrapping
and the constraints on the remap tree would be quite helpful.

i.e.,
Why can't it be remapped? (where do you look up the remappings for its
own blocks?!)

What if you put it in metadata BGs and made those un-relocatable (that
would contaminate too many of them?)

Why can't you make it generic metadata, then while relocating check what
tree a block is from? If remap tree, then cow it, else do the normal
remap thing. (complexity? performance?)

Since this seems to be one of the more contentious parts of the
proposal, fleshing out ALL the reasoning is helpful, IMO.

Along those lines and moving away from the exact self relocating and
bootstrapping problem details:

What are the drawbacks to a new space info? How serious are they given
that we don't expect the remap tree to be very large (probably include
your reasoned bounding estimates too...).

Are there any other space infos we have or would consider? What happens
if we move forward on plans to parallelize the system further, would
we add multiple data space infos? That might be a useful argument that
while this is the weird "4th" space info, there could also be a 5th and
6th and 7th etc.. and it's OK. I'd also be curious to get Josef's thoughts
on this, for example.

> be remapped, and must be relocated the existing way, by COWing every
> leaf. The remap tree can't go in the SYSTEM chunk as space there is
> limited, because a copy of the chunk item gets placed in the superblock.
> 
> The changes in fs/btrfs/volumes.h are because we're adding a new block
> group type bit after the profile bits, and so can no longer rely on the
> const_ilog2 trick.
> 
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
>  fs/btrfs/block-rsv.c            |  8 ++++++++
>  fs/btrfs/block-rsv.h            |  1 +
>  fs/btrfs/disk-io.c              |  1 +
>  fs/btrfs/fs.h                   |  2 ++
>  fs/btrfs/space-info.c           | 13 ++++++++++++-
>  fs/btrfs/sysfs.c                |  2 ++
>  fs/btrfs/tree-checker.c         |  5 +++--
>  fs/btrfs/volumes.c              |  1 +
>  fs/btrfs/volumes.h              | 11 +++++++++--
>  include/uapi/linux/btrfs_tree.h |  4 +++-
>  10 files changed, 42 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
> index 5ad6de738aee..2678cd3bed29 100644
> --- a/fs/btrfs/block-rsv.c
> +++ b/fs/btrfs/block-rsv.c
> @@ -421,6 +421,9 @@ void btrfs_init_root_block_rsv(struct btrfs_root *root)
>  	case BTRFS_TREE_LOG_OBJECTID:
>  		root->block_rsv = &fs_info->treelog_rsv;
>  		break;
> +	case BTRFS_REMAP_TREE_OBJECTID:
> +		root->block_rsv = &fs_info->remap_block_rsv;
> +		break;
>  	default:
>  		root->block_rsv = NULL;
>  		break;
> @@ -434,6 +437,9 @@ void btrfs_init_global_block_rsv(struct btrfs_fs_info *fs_info)
>  	space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_SYSTEM);
>  	fs_info->chunk_block_rsv.space_info = space_info;
>  
> +	space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_REMAP);
> +	fs_info->remap_block_rsv.space_info = space_info;
> +
>  	space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
>  	fs_info->global_block_rsv.space_info = space_info;
>  	fs_info->trans_block_rsv.space_info = space_info;
> @@ -460,6 +466,8 @@ void btrfs_release_global_block_rsv(struct btrfs_fs_info *fs_info)
>  	WARN_ON(fs_info->trans_block_rsv.reserved > 0);
>  	WARN_ON(fs_info->chunk_block_rsv.size > 0);
>  	WARN_ON(fs_info->chunk_block_rsv.reserved > 0);
> +	WARN_ON(fs_info->remap_block_rsv.size > 0);
> +	WARN_ON(fs_info->remap_block_rsv.reserved > 0);
>  	WARN_ON(fs_info->delayed_block_rsv.size > 0);
>  	WARN_ON(fs_info->delayed_block_rsv.reserved > 0);
>  	WARN_ON(fs_info->delayed_refs_rsv.reserved > 0);
> diff --git a/fs/btrfs/block-rsv.h b/fs/btrfs/block-rsv.h
> index 79ae9d05cd91..8359fb96bc3c 100644
> --- a/fs/btrfs/block-rsv.h
> +++ b/fs/btrfs/block-rsv.h
> @@ -22,6 +22,7 @@ enum btrfs_rsv_type {
>  	BTRFS_BLOCK_RSV_DELALLOC,
>  	BTRFS_BLOCK_RSV_TRANS,
>  	BTRFS_BLOCK_RSV_CHUNK,
> +	BTRFS_BLOCK_RSV_REMAP,

Why do we need a new block_rsv_type? Please include that justification
in your commit as well.

>  	BTRFS_BLOCK_RSV_DELOPS,
>  	BTRFS_BLOCK_RSV_DELREFS,
>  	BTRFS_BLOCK_RSV_TREELOG,
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5a565bf96bf8..60cce96a9ec4 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2830,6 +2830,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>  			     BTRFS_BLOCK_RSV_GLOBAL);
>  	btrfs_init_block_rsv(&fs_info->trans_block_rsv, BTRFS_BLOCK_RSV_TRANS);
>  	btrfs_init_block_rsv(&fs_info->chunk_block_rsv, BTRFS_BLOCK_RSV_CHUNK);
> +	btrfs_init_block_rsv(&fs_info->remap_block_rsv, BTRFS_BLOCK_RSV_REMAP);
>  	btrfs_init_block_rsv(&fs_info->treelog_rsv, BTRFS_BLOCK_RSV_TREELOG);
>  	btrfs_init_block_rsv(&fs_info->empty_block_rsv, BTRFS_BLOCK_RSV_EMPTY);
>  	btrfs_init_block_rsv(&fs_info->delayed_block_rsv,
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 5e48ed252fd0..07ac1a96477a 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -476,6 +476,8 @@ struct btrfs_fs_info {
>  	struct btrfs_block_rsv trans_block_rsv;
>  	/* Block reservation for chunk tree */
>  	struct btrfs_block_rsv chunk_block_rsv;
> +	/* Block reservation for remap tree */
> +	struct btrfs_block_rsv remap_block_rsv;
>  	/* Block reservation for delayed operations */
>  	struct btrfs_block_rsv delayed_block_rsv;
>  	/* Block reservation for delayed refs */
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 517916004f21..6471861c4b25 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -215,7 +215,7 @@ static u64 calc_chunk_size(const struct btrfs_fs_info *fs_info, u64 flags)
>  
>  	if (flags & BTRFS_BLOCK_GROUP_DATA)
>  		return BTRFS_MAX_DATA_CHUNK_SIZE;
> -	else if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
> +	else if (flags & (BTRFS_BLOCK_GROUP_SYSTEM | BTRFS_BLOCK_GROUP_REMAP))

What is the rationale for this sizing? Experimentation? Estimate?

>  		return SZ_32M;
>  
>  	/* Handle BTRFS_BLOCK_GROUP_METADATA */
> @@ -343,6 +343,8 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
>  	if (mixed) {
>  		flags = BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA;
>  		ret = create_space_info(fs_info, flags);
> +		if (ret)
> +			goto out;
>  	} else {
>  		flags = BTRFS_BLOCK_GROUP_METADATA;
>  		ret = create_space_info(fs_info, flags);
> @@ -351,7 +353,15 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
>  
>  		flags = BTRFS_BLOCK_GROUP_DATA;
>  		ret = create_space_info(fs_info, flags);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	if (features & BTRFS_FEATURE_INCOMPAT_REMAP_TREE) {
> +		flags = BTRFS_BLOCK_GROUP_REMAP;
> +		ret = create_space_info(fs_info, flags);
>  	}
> +
>  out:
>  	return ret;
>  }
> @@ -590,6 +600,7 @@ static void dump_global_block_rsv(struct btrfs_fs_info *fs_info)
>  	DUMP_BLOCK_RSV(fs_info, global_block_rsv);
>  	DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
>  	DUMP_BLOCK_RSV(fs_info, chunk_block_rsv);
> +	DUMP_BLOCK_RSV(fs_info, remap_block_rsv);
>  	DUMP_BLOCK_RSV(fs_info, delayed_block_rsv);
>  	DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv);
>  }
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 831c25c2fb25..aa98d99833fd 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -1985,6 +1985,8 @@ static const char *alloc_name(struct btrfs_space_info *space_info)
>  	case BTRFS_BLOCK_GROUP_SYSTEM:
>  		ASSERT(space_info->subgroup_id == BTRFS_SUB_GROUP_PRIMARY);
>  		return "system";
> +	case BTRFS_BLOCK_GROUP_REMAP:
> +		return "remap";
>  	default:
>  		WARN_ON(1);
>  		return "invalid-combination";
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index a83fb828723a..0505f8d76581 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -751,13 +751,14 @@ static int check_block_group_item(struct extent_buffer *leaf,
>  	if (unlikely(type != BTRFS_BLOCK_GROUP_DATA &&
>  		     type != BTRFS_BLOCK_GROUP_METADATA &&
>  		     type != BTRFS_BLOCK_GROUP_SYSTEM &&
> +		     type != BTRFS_BLOCK_GROUP_REMAP &&

We should fail if the type is REMAP and we don't have the flag.

>  		     type != (BTRFS_BLOCK_GROUP_METADATA |
>  			      BTRFS_BLOCK_GROUP_DATA))) {
>  		block_group_err(leaf, slot,
> -"invalid type, have 0x%llx (%lu bits set) expect either 0x%llx, 0x%llx, 0x%llx or 0x%llx",
> +"invalid type, have 0x%llx (%lu bits set) expect either 0x%llx, 0x%llx, 0x%llx, 0x%llx or 0x%llx",
>  			type, hweight64(type),
>  			BTRFS_BLOCK_GROUP_DATA, BTRFS_BLOCK_GROUP_METADATA,
> -			BTRFS_BLOCK_GROUP_SYSTEM,
> +			BTRFS_BLOCK_GROUP_SYSTEM, BTRFS_BLOCK_GROUP_REMAP,
>  			BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA);
>  		return -EUCLEAN;
>  	}
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 3e53bde0e605..e7c467b6af46 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -234,6 +234,7 @@ void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
>  	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
>  	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
>  	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
> +	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAP, "remap");
>  	DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAPPED, "remapped");

This is a little confusing on the surface. I think a comment for each
like the following are merited:

/* block groups containing the remap tree */
and
/* block group that has been remapped */

>  
>  	DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 6d8b1f38e3ee..9fb8fe4312a5 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -59,8 +59,6 @@ static_assert(const_ilog2(BTRFS_STRIPE_LEN) == BTRFS_STRIPE_LEN_SHIFT);
>   */
>  static_assert(const_ffs(BTRFS_BLOCK_GROUP_RAID0) <
>  	      const_ffs(BTRFS_BLOCK_GROUP_PROFILE_MASK & ~BTRFS_BLOCK_GROUP_RAID0));
> -static_assert(const_ilog2(BTRFS_BLOCK_GROUP_RAID0) >
> -	      ilog2(BTRFS_BLOCK_GROUP_TYPE_MASK));
>  
>  /* ilog2() can handle both constants and variables */
>  #define BTRFS_BG_FLAG_TO_INDEX(profile)					\
> @@ -82,6 +80,15 @@ enum btrfs_raid_types {
>  	BTRFS_NR_RAID_TYPES
>  };
>  
> +static_assert(BTRFS_RAID_RAID0 == 1);
> +static_assert(BTRFS_RAID_RAID1 == 2);
> +static_assert(BTRFS_RAID_DUP == 3);
> +static_assert(BTRFS_RAID_RAID10 == 4);
> +static_assert(BTRFS_RAID_RAID5 == 5);
> +static_assert(BTRFS_RAID_RAID6 == 6);
> +static_assert(BTRFS_RAID_RAID1C3 == 7);
> +static_assert(BTRFS_RAID_RAID1C4 == 8);
> +
>  /*
>   * Use sequence counter to get consistent device stat data on
>   * 32-bit processors.
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index 4439d77a7252..9a36f0206d90 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -1169,12 +1169,14 @@ struct btrfs_dev_replace_item {
>  #define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
>  #define BTRFS_BLOCK_GROUP_RAID1C4       (1ULL << 10)
>  #define BTRFS_BLOCK_GROUP_REMAPPED      (1ULL << 11)
> +#define BTRFS_BLOCK_GROUP_REMAP         (1ULL << 12)
>  #define BTRFS_BLOCK_GROUP_RESERVED	(BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
>  					 BTRFS_SPACE_INFO_GLOBAL_RSV)
>  
>  #define BTRFS_BLOCK_GROUP_TYPE_MASK	(BTRFS_BLOCK_GROUP_DATA |    \
>  					 BTRFS_BLOCK_GROUP_SYSTEM |  \
> -					 BTRFS_BLOCK_GROUP_METADATA)
> +					 BTRFS_BLOCK_GROUP_METADATA | \
> +					 BTRFS_BLOCK_GROUP_REMAP)
>  
>  #define BTRFS_BLOCK_GROUP_PROFILE_MASK	(BTRFS_BLOCK_GROUP_RAID0 |   \
>  					 BTRFS_BLOCK_GROUP_RAID1 |   \
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 03/12] btrfs: allow remapped chunks to have zero stripes
  2025-06-05 16:23 ` [PATCH 03/12] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
@ 2025-06-13 21:41   ` Boris Burkov
  2025-08-08 14:12     ` Mark Harmstone
  0 siblings, 1 reply; 39+ messages in thread
From: Boris Burkov @ 2025-06-13 21:41 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: linux-btrfs

On Thu, Jun 05, 2025 at 05:23:33PM +0100, Mark Harmstone wrote:
> When a chunk has been fully remapped, we are going to set its
> num_stripes to 0, as it will no longer represent a physical location on
> disk.
> 
> Change tree-checker to allow for this, and fix a couple of
> divide-by-zeroes seen elsewhere.
> 
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
>  fs/btrfs/tree-checker.c | 16 +++++++++-------
>  fs/btrfs/volumes.c      |  8 +++++++-
>  2 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index 0505f8d76581..fd83df06e3fb 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -829,7 +829,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
>  	u64 type;
>  	u64 features;
>  	u32 chunk_sector_size;
> -	bool mixed = false;
> +	bool mixed = false, remapped;

nit:
I *personally* don't like such declarations. I think they are lightly if
not decisively discouraged in the Linux Coding Style document as well
(at least to my reading). In our code, I found some examples where they
are used where both values are related and get assigned. But I haven't
seen any like this for unrelated variables with mixed assignment.

>  	int raid_index;
>  	int nparity;
>  	int ncopies;
> @@ -853,12 +853,14 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info, >  	ncopies = btrfs_raid_array[raid_index].ncopies;
>  	nparity = btrfs_raid_array[raid_index].nparity;
>  
> -	if (unlikely(!num_stripes)) {
> +	remapped = type & BTRFS_BLOCK_GROUP_REMAPPED;
> +
> +	if (unlikely(!remapped && !num_stripes)) {
>  		chunk_err(fs_info, leaf, chunk, logical,
>  			  "invalid chunk num_stripes, have %u", num_stripes);
>  		return -EUCLEAN;
>  	}
> -	if (unlikely(num_stripes < ncopies)) {
> +	if (unlikely(!remapped && num_stripes < ncopies)) {

I think remapped only permits you exactly num_stripes == 0, not
num_stripes = 2 if ncopies = 3, right? Though it makes the logic less
neat, I would make the check as precise as possible on the invariants.

>  		chunk_err(fs_info, leaf, chunk, logical,
>  			  "invalid chunk num_stripes < ncopies, have %u < %d",
>  			  num_stripes, ncopies);
> @@ -960,7 +962,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
>  		}
>  	}
>  
> -	if (unlikely((type & BTRFS_BLOCK_GROUP_RAID10 &&
> +	if (unlikely(!remapped && ((type & BTRFS_BLOCK_GROUP_RAID10 &&
>  		      sub_stripes != btrfs_raid_array[BTRFS_RAID_RAID10].sub_stripes) ||
>  		     (type & BTRFS_BLOCK_GROUP_RAID1 &&
>  		      num_stripes != btrfs_raid_array[BTRFS_RAID_RAID1].devs_min) ||
> @@ -975,7 +977,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
>  		     (type & BTRFS_BLOCK_GROUP_DUP &&
>  		      num_stripes != btrfs_raid_array[BTRFS_RAID_DUP].dev_stripes) ||
>  		     ((type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 &&
> -		      num_stripes != btrfs_raid_array[BTRFS_RAID_SINGLE].dev_stripes))) {
> +		      num_stripes != btrfs_raid_array[BTRFS_RAID_SINGLE].dev_stripes)))) {
>  		chunk_err(fs_info, leaf, chunk, logical,
>  			"invalid num_stripes:sub_stripes %u:%u for profile %llu",
>  			num_stripes, sub_stripes,
> @@ -999,11 +1001,11 @@ static int check_leaf_chunk_item(struct extent_buffer *leaf,
>  	struct btrfs_fs_info *fs_info = leaf->fs_info;
>  	int num_stripes;
>  
> -	if (unlikely(btrfs_item_size(leaf, slot) < sizeof(struct btrfs_chunk))) {
> +	if (unlikely(btrfs_item_size(leaf, slot) < offsetof(struct btrfs_chunk, stripe))) {
>  		chunk_err(fs_info, leaf, chunk, key->offset,
>  			"invalid chunk item size: have %u expect [%zu, %u)",
>  			btrfs_item_size(leaf, slot),
> -			sizeof(struct btrfs_chunk),
> +			offsetof(struct btrfs_chunk, stripe),
>  			BTRFS_LEAF_DATA_SIZE(fs_info));
>  		return -EUCLEAN;

Same complaint as above for nstripes < ncopies. Is there some way to
more generically bypass stripe checking if we detect the case we care
about: (remapped && num_stripes == 0) ?

>  	}
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e7c467b6af46..9159d11cb143 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6133,6 +6133,12 @@ struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info,
>  		goto out_free_map;
>  	}
>  
> +	/* avoid divide by zero on fully-remapped chunks */
> +	if (map->num_stripes == 0) {
> +		ret = -EOPNOTSUPP;
> +		goto out_free_map;
> +	}
> +

This seems kinda sketchy to me. Presumably once we have remapped a block
group, we do want to discard it. But this makes that impossible? Does
the discarding happen before we set num_stripes to 0? Even with
discard=async?

>  	offset = logical - map->start;
>  	length = min_t(u64, map->start + map->chunk_len - logical, length);
>  	*length_ret = length;
> @@ -6953,7 +6959,7 @@ u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map)
>  {
>  	const int data_stripes = calc_data_stripes(map->type, map->num_stripes);
>  
> -	return div_u64(map->chunk_len, data_stripes);
> +	return data_stripes ? div_u64(map->chunk_len, data_stripes) : 0;

Rather than making 0 a special value meaning remapped, would it be
useful/clearer to also include a remapped flag on the btrfs_chunk_map?
This function would have to behave the same way, presumably, but it
would be less implicit, and we could make assertions on chunk maps that
they have 0 stripes iff remapped at various places where we use the
stripe length. That would raise confidence that all the uses of stripe
logic account for remapped chunks.

>  }
>  
>  #if BITS_PER_LONG == 32
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 04/12] btrfs: remove remapped block groups from the free-space tree
  2025-06-05 16:23 ` [PATCH 04/12] btrfs: remove remapped block groups from the free-space tree Mark Harmstone
  2025-06-06  6:41   ` kernel test robot
@ 2025-06-13 22:00   ` Boris Burkov
  2025-08-12 14:50     ` Mark Harmstone
  1 sibling, 1 reply; 39+ messages in thread
From: Boris Burkov @ 2025-06-13 22:00 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: linux-btrfs

On Thu, Jun 05, 2025 at 05:23:34PM +0100, Mark Harmstone wrote:
> No new allocations can be done from block groups that have the REMAPPED flag
> set, so there's no value in their having entries in the free-space tree.
> 
> Prevent a search through the free-space tree being scheduled for such a
> block group, and prevent discard being run for a fully-remapped block
> group.
> 
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
>  fs/btrfs/block-group.c | 21 ++++++++++++++++-----
>  fs/btrfs/discard.c     |  9 +++++++++
>  2 files changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 5b0cb04b2b93..9b3b5358f1ba 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -920,6 +920,13 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait)
>  	if (btrfs_is_zoned(fs_info))
>  		return 0;
>  
> +	/*
> +	 * No allocations can be done from remapped block groups, so they have
> +	 * no entries in the free-space tree.
> +	 */
> +	if (cache->flags & BTRFS_BLOCK_GROUP_REMAPPED)
> +		return 0;
> +
>  	caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
>  	if (!caching_ctl)
>  		return -ENOMEM;
> @@ -1235,9 +1242,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>  	 * another task to attempt to create another block group with the same
>  	 * item key (and failing with -EEXIST and a transaction abort).
>  	 */
> -	ret = remove_block_group_free_space(trans, block_group);
> -	if (ret)
> -		goto out;

nit: it feels nicer to hide the check inside the function.

> +	if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
> +		ret = remove_block_group_free_space(trans, block_group);
> +		if (ret)
> +			goto out;
> +	}
>  
>  	ret = remove_block_group_item(trans, path, block_group);
>  	if (ret < 0)
> @@ -2457,10 +2466,12 @@ static int read_one_block_group(struct btrfs_fs_info *info,
>  	if (btrfs_chunk_writeable(info, cache->start)) {
>  		if (cache->used == 0) {
>  			ASSERT(list_empty(&cache->bg_list));
> -			if (btrfs_test_opt(info, DISCARD_ASYNC))
> +			if (btrfs_test_opt(info, DISCARD_ASYNC) &&

I asked this on the previous patch, but I guess this means we will never
discard these blocks? Is that desirable? Or are we discarding them at
some other point in the life-cycle?

> +			    !(cache->flags && BTRFS_BLOCK_GROUP_REMAPPED)) {
>  				btrfs_discard_queue_work(&info->discard_ctl, cache);
> -			else
> +			} else {
>  				btrfs_mark_bg_unused(cache);
> +			}
>  		}
>  	} else {
>  		inc_block_group_ro(cache, 1);
> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> index 89fe85778115..1015a4d37fb2 100644
> --- a/fs/btrfs/discard.c
> +++ b/fs/btrfs/discard.c
> @@ -698,6 +698,15 @@ void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info)
>  	/* We enabled async discard, so punt all to the queue */
>  	list_for_each_entry_safe(block_group, next, &fs_info->unused_bgs,
>  				 bg_list) {
> +		/* Fully remapped BGs have nothing to discard */

Same question. If we simply *don't* discard them, I feel like this
comment is misleadingly worded.

> +		spin_lock(&block_group->lock);
> +		if (block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED &&
> +		    !btrfs_is_block_group_used(block_group)) {
> +			spin_unlock(&block_group->lock);
> +			continue;
> +		}
> +		spin_unlock(&block_group->lock);
> +
>  		list_del_init(&block_group->bg_list);
>  		btrfs_discard_queue_work(&fs_info->discard_ctl, block_group);
>  		/*
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 05/12] btrfs: don't add metadata items for the remap tree to the extent tree
  2025-06-05 16:23 ` [PATCH 05/12] btrfs: don't add metadata items for the remap tree to the extent tree Mark Harmstone
@ 2025-06-13 22:39   ` Boris Burkov
  0 siblings, 0 replies; 39+ messages in thread
From: Boris Burkov @ 2025-06-13 22:39 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: linux-btrfs

On Thu, Jun 05, 2025 at 05:23:35PM +0100, Mark Harmstone wrote:
> There is the following potential problem with the remap tree and delayed refs:
> 
> * Remapped extent freed in a delayed ref, which removes an entry from the
>   remap tree
> * Remap tree now small enough to fit in a single leaf
> * Corruption as we now have a level-0 block with a level-1 metadata item
>   in the extent tree
> 
> One solution to this would be to rework the remap tree code so that it operates
> via delayed refs. But as we're hoping to remove cow-only metadata items in the
> future anyway, change things so that the remap tree doesn't have any entries in
> the extent tree. This also has the benefit of reducing write amplification.
> 
> We also make it so that the clear_cache mount option is a no-op, as with the
> extent tree v2, as the free-space tree can no longer be recreated from the
> extent tree.
> 
> Finally disable relocating the remap tree itself for the time being: rather
> than walking the extent tree, this will need to be changed so that the remap
> tree gets walked, and any nodes within the specified block groups get COWed.
> This code will also cover the future cases when we remove the metadata items
> for the SYSTEM block groups, i.e. the chunk and root trees.

Why not a separate trivial patch for disabling remap tree reloc?

> 
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
>  fs/btrfs/disk-io.c     |   3 ++
>  fs/btrfs/extent-tree.c | 114 ++++++++++++++++++++++++-----------------
>  fs/btrfs/volumes.c     |   3 ++
>  3 files changed, 73 insertions(+), 47 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 60cce96a9ec4..324116c3566c 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3064,6 +3064,9 @@ int btrfs_start_pre_rw_mount(struct btrfs_fs_info *fs_info)
>  		if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2))
>  			btrfs_warn(fs_info,
>  				   "'clear_cache' option is ignored with extent tree v2");
> +		else if (btrfs_fs_incompat(fs_info, REMAP_TREE))
> +			btrfs_warn(fs_info,
> +				   "'clear_cache' option is ignored with remap tree");
>  		else
>  			rebuild_free_space_tree = true;
>  	} else if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE) &&
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 46d4963a8241..205692fc1c7e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3106,6 +3106,24 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>  	bool skinny_metadata = btrfs_fs_incompat(info, SKINNY_METADATA);
>  	u64 delayed_ref_root = href->owning_root;
>  
> +	is_data = owner_objectid >= BTRFS_FIRST_FREE_OBJECTID;
> +
> +	if (!is_data && node->ref_root == BTRFS_REMAP_TREE_OBJECTID) {

Are there cases where ref_root is REMAP_TREE but is_data is true? Or is
this redundant? If so, an assert might make more sense than including it
in the if condition.


Also, rather than special-casing / short-cutting a generic function or
metadata/data that is fully concerned with *extents*, I would come up
with a name for the "non extent tree metadata" concept and make that a
case at the metadata-freeing callsite.

> +		ret = add_to_free_space_tree(trans, bytenr, num_bytes);
> +		if (ret) {
> +			btrfs_abort_transaction(trans, ret);
> +			return ret;
> +		}
> +
> +		ret = btrfs_update_block_group(trans, bytenr, num_bytes, false);
> +		if (ret) {
> +			btrfs_abort_transaction(trans, ret);
> +			return ret;
> +		}
> +
> +		return 0;
> +	}
> +
>  	extent_root = btrfs_extent_root(info, bytenr);
>  	ASSERT(extent_root);
>  
> @@ -3113,8 +3131,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>  	if (!path)
>  		return -ENOMEM;
>  
> -	is_data = owner_objectid >= BTRFS_FIRST_FREE_OBJECTID;
> -
>  	if (!is_data && refs_to_drop != 1) {
>  		btrfs_crit(info,
>  "invalid refs_to_drop, dropping more than 1 refs for tree block %llu refs_to_drop %u",
> @@ -4893,57 +4909,61 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
>  	int level = btrfs_delayed_ref_owner(node);
>  	bool skinny_metadata = btrfs_fs_incompat(fs_info, SKINNY_METADATA);
>  
> -	extent_key.objectid = node->bytenr;
> -	if (skinny_metadata) {
> -		/* The owner of a tree block is the level. */
> -		extent_key.offset = level;
> -		extent_key.type = BTRFS_METADATA_ITEM_KEY;
> -	} else {
> -		extent_key.offset = node->num_bytes;
> -		extent_key.type = BTRFS_EXTENT_ITEM_KEY;
> -		size += sizeof(*block_info);
> -	}
> +	if (node->ref_root != BTRFS_REMAP_TREE_OBJECTID) {

Similarly, I don't like jamming this whole allocation function that
fundamentally doesn't care about remap_tree behind a remap tree check
if.

I would at least change it to

if (unlikely(remap_tree))
        goto skip;

or something.

> +		extent_key.objectid = node->bytenr;
> +		if (skinny_metadata) {
> +			/* The owner of a tree block is the level. */
> +			extent_key.offset = level;
> +			extent_key.type = BTRFS_METADATA_ITEM_KEY;
> +		} else {
> +			extent_key.offset = node->num_bytes;
> +			extent_key.type = BTRFS_EXTENT_ITEM_KEY;
> +			size += sizeof(*block_info);
> +		}
>  
> -	path = btrfs_alloc_path();
> -	if (!path)
> -		return -ENOMEM;
> +		path = btrfs_alloc_path();
> +		if (!path)
> +			return -ENOMEM;
>  
> -	extent_root = btrfs_extent_root(fs_info, extent_key.objectid);
> -	ret = btrfs_insert_empty_item(trans, extent_root, path, &extent_key,
> -				      size);
> -	if (ret) {
> -		btrfs_free_path(path);
> -		return ret;
> -	}
> +		extent_root = btrfs_extent_root(fs_info, extent_key.objectid);
> +		ret = btrfs_insert_empty_item(trans, extent_root, path,
> +					      &extent_key, size);
> +		if (ret) {
> +			btrfs_free_path(path);
> +			return ret;
> +		}
>  
> -	leaf = path->nodes[0];
> -	extent_item = btrfs_item_ptr(leaf, path->slots[0],
> -				     struct btrfs_extent_item);
> -	btrfs_set_extent_refs(leaf, extent_item, 1);
> -	btrfs_set_extent_generation(leaf, extent_item, trans->transid);
> -	btrfs_set_extent_flags(leaf, extent_item,
> -			       flags | BTRFS_EXTENT_FLAG_TREE_BLOCK);
> +		leaf = path->nodes[0];
> +		extent_item = btrfs_item_ptr(leaf, path->slots[0],
> +					struct btrfs_extent_item);
> +		btrfs_set_extent_refs(leaf, extent_item, 1);
> +		btrfs_set_extent_generation(leaf, extent_item, trans->transid);
> +		btrfs_set_extent_flags(leaf, extent_item,
> +				flags | BTRFS_EXTENT_FLAG_TREE_BLOCK);
>  
> -	if (skinny_metadata) {
> -		iref = (struct btrfs_extent_inline_ref *)(extent_item + 1);
> -	} else {
> -		block_info = (struct btrfs_tree_block_info *)(extent_item + 1);
> -		btrfs_set_tree_block_key(leaf, block_info, &extent_op->key);
> -		btrfs_set_tree_block_level(leaf, block_info, level);
> -		iref = (struct btrfs_extent_inline_ref *)(block_info + 1);
> -	}
> +		if (skinny_metadata) {
> +			iref = (struct btrfs_extent_inline_ref *)(extent_item + 1);
> +		} else {
> +			block_info = (struct btrfs_tree_block_info *)(extent_item + 1);
> +			btrfs_set_tree_block_key(leaf, block_info, &extent_op->key);
> +			btrfs_set_tree_block_level(leaf, block_info, level);
> +			iref = (struct btrfs_extent_inline_ref *)(block_info + 1);
> +		}
>  
> -	if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) {
> -		btrfs_set_extent_inline_ref_type(leaf, iref,
> -						 BTRFS_SHARED_BLOCK_REF_KEY);
> -		btrfs_set_extent_inline_ref_offset(leaf, iref, node->parent);
> -	} else {
> -		btrfs_set_extent_inline_ref_type(leaf, iref,
> -						 BTRFS_TREE_BLOCK_REF_KEY);
> -		btrfs_set_extent_inline_ref_offset(leaf, iref, node->ref_root);
> -	}
> +		if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) {
> +			btrfs_set_extent_inline_ref_type(leaf, iref,
> +						BTRFS_SHARED_BLOCK_REF_KEY);
> +			btrfs_set_extent_inline_ref_offset(leaf, iref,
> +							   node->parent);
> +		} else {
> +			btrfs_set_extent_inline_ref_type(leaf, iref,
> +						BTRFS_TREE_BLOCK_REF_KEY);
> +			btrfs_set_extent_inline_ref_offset(leaf, iref,
> +							   node->ref_root);
> +		}
>  
> -	btrfs_free_path(path);
> +		btrfs_free_path(path);
> +	}
>  
>  	return alloc_reserved_extent(trans, node->bytenr, fs_info->nodesize);
>  }
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9159d11cb143..0f4954f998cd 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3981,6 +3981,9 @@ static bool should_balance_chunk(struct extent_buffer *leaf, struct btrfs_chunk
>  	struct btrfs_balance_args *bargs = NULL;
>  	u64 chunk_type = btrfs_chunk_type(leaf, chunk);
>  
> +	if (chunk_type & BTRFS_BLOCK_GROUP_REMAP)
> +		return false;
> +
>  	/* type filter */
>  	if (!((chunk_type & BTRFS_BLOCK_GROUP_TYPE_MASK) &
>  	      (bctl->flags & BTRFS_BALANCE_TYPE_MASK))) {
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/12] btrfs: handle setting up relocation of block group with remap-tree
  2025-06-05 16:23 ` [PATCH 10/12] btrfs: handle setting up relocation of block group with remap-tree Mark Harmstone
@ 2025-06-13 23:25   ` Boris Burkov
  2025-08-12 11:20     ` Mark Harmstone
  0 siblings, 1 reply; 39+ messages in thread
From: Boris Burkov @ 2025-06-13 23:25 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: linux-btrfs

On Thu, Jun 05, 2025 at 05:23:40PM +0100, Mark Harmstone wrote:
> Handle the preliminary work for relocating a block group in a filesystem
> with the remap-tree flag set.
> 
> If the block group is SYSTEM or REMAP btrfs_relocate_block_group()
> proceeds as it does already, as bootstrapping issues mean that these
> block groups have to be processed the existing way.
> 
> Otherwise we walk the free-space tree for the block group in question,
> recording any holes. These get converted into identity remaps and placed
> in the remap tree, and the block group's REMAPPED flag is set. From now
> on no new allocations are possible within this block group, and any I/O
> to it will be funnelled through btrfs_translate_remap(). We store the
> number of identity remaps in `identity_remap_count`, so that we know
> when we've removed the last one and the block group is fully remapped.
> 
> The change in btrfs_read_roots() is because data relocations no longer
> rely on the data reloc tree as a hidden subvolume in which to do
> snapshots.
> 
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
>  fs/btrfs/disk-io.c         |  30 +--
>  fs/btrfs/free-space-tree.c |   4 +-
>  fs/btrfs/free-space-tree.h |   5 +-
>  fs/btrfs/relocation.c      | 452 ++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/relocation.h      |   3 +-
>  fs/btrfs/space-info.c      |   9 +-
>  fs/btrfs/volumes.c         |  15 +-
>  7 files changed, 483 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index dac22efd2332..f2a9192293b1 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2268,22 +2268,22 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>  		root->root_key.objectid = BTRFS_REMAP_TREE_OBJECTID;
>  		root->root_key.type = BTRFS_ROOT_ITEM_KEY;
>  		root->root_key.offset = 0;
> -	}
> -
> -	/*
> -	 * This tree can share blocks with some other fs tree during relocation
> -	 * and we need a proper setup by btrfs_get_fs_root
> -	 */
> -	root = btrfs_get_fs_root(tree_root->fs_info,
> -				 BTRFS_DATA_RELOC_TREE_OBJECTID, true);
> -	if (IS_ERR(root)) {
> -		if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
> -			ret = PTR_ERR(root);
> -			goto out;
> -		}
>  	} else {
> -		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> -		fs_info->data_reloc_root = root;
> +		/*
> +		 * This tree can share blocks with some other fs tree during
> +		 * relocation and we need a proper setup by btrfs_get_fs_root
> +		 */
> +		root = btrfs_get_fs_root(tree_root->fs_info,
> +					 BTRFS_DATA_RELOC_TREE_OBJECTID, true);
> +		if (IS_ERR(root)) {
> +			if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
> +				ret = PTR_ERR(root);
> +				goto out;
> +			}
> +		} else {
> +			set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> +			fs_info->data_reloc_root = root;
> +		}

Why not do this change to the else case for remap tree along with the if
case in the earlier patch (allow mounting filesystems with remap-tree
incompat flag)?

>  	}
>  
>  	location.objectid = BTRFS_QUOTA_TREE_OBJECTID;
> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
> index af51cf784a5b..eb579c17a79f 100644
> --- a/fs/btrfs/free-space-tree.c
> +++ b/fs/btrfs/free-space-tree.c
> @@ -21,8 +21,7 @@ static int __add_block_group_free_space(struct btrfs_trans_handle *trans,
>  					struct btrfs_block_group *block_group,
>  					struct btrfs_path *path);
>  
> -static struct btrfs_root *btrfs_free_space_root(
> -				struct btrfs_block_group *block_group)
> +struct btrfs_root *btrfs_free_space_root(struct btrfs_block_group *block_group)
>  {
>  	struct btrfs_key key = {
>  		.objectid = BTRFS_FREE_SPACE_TREE_OBJECTID,
> @@ -96,7 +95,6 @@ static int add_new_free_space_info(struct btrfs_trans_handle *trans,
>  	return ret;
>  }
>  
> -EXPORT_FOR_TESTS
>  struct btrfs_free_space_info *search_free_space_info(
>  		struct btrfs_trans_handle *trans,
>  		struct btrfs_block_group *block_group,
> diff --git a/fs/btrfs/free-space-tree.h b/fs/btrfs/free-space-tree.h
> index e6c6d6f4f221..1b804544730a 100644
> --- a/fs/btrfs/free-space-tree.h
> +++ b/fs/btrfs/free-space-tree.h
> @@ -35,12 +35,13 @@ int add_to_free_space_tree(struct btrfs_trans_handle *trans,
>  			   u64 start, u64 size);
>  int remove_from_free_space_tree(struct btrfs_trans_handle *trans,
>  				u64 start, u64 size);
> -
> -#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>  struct btrfs_free_space_info *
>  search_free_space_info(struct btrfs_trans_handle *trans,
>  		       struct btrfs_block_group *block_group,
>  		       struct btrfs_path *path, int cow);
> +struct btrfs_root *btrfs_free_space_root(struct btrfs_block_group *block_group);
> +
> +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>  int __add_to_free_space_tree(struct btrfs_trans_handle *trans,
>  			     struct btrfs_block_group *block_group,
>  			     struct btrfs_path *path, u64 start, u64 size);
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 54c3e99c7dab..acf2fefedc96 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -3659,7 +3659,7 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
>  		btrfs_btree_balance_dirty(fs_info);
>  	}
>  
> -	if (!err) {
> +	if (!err && !btrfs_fs_incompat(fs_info, REMAP_TREE)) {
>  		ret = relocate_file_extent_cluster(rc);
>  		if (ret < 0)
>  			err = ret;
> @@ -3906,6 +3906,90 @@ static const char *stage_to_string(enum reloc_stage stage)
>  	return "unknown";
>  }
>  
> +static int add_remap_tree_entries(struct btrfs_trans_handle *trans,
> +				  struct btrfs_path *path,
> +				  struct btrfs_key *entries,
> +				  unsigned int num_entries)
> +{
> +	int ret;
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_item_batch batch;
> +	u32 *data_sizes;
> +	u32 max_items;
> +
> +	max_items = BTRFS_LEAF_DATA_SIZE(trans->fs_info) / sizeof(struct btrfs_item);
> +
> +	data_sizes = kzalloc(sizeof(u32) * min_t(u32, num_entries, max_items),
> +			     GFP_NOFS);
> +	if (!data_sizes)
> +		return -ENOMEM;
> +
> +	while (true) {
> +		batch.keys = entries;
> +		batch.data_sizes = data_sizes;
> +		batch.total_data_size = 0;
> +		batch.nr = min_t(u32, num_entries, max_items);
> +
> +		ret = btrfs_insert_empty_items(trans, fs_info->remap_root, path,
> +					       &batch);
> +		btrfs_release_path(path);
> +
> +		if (num_entries <= max_items)
> +			break;
> +
> +		num_entries -= max_items;
> +		entries += max_items;
> +	}
> +
> +	kfree(data_sizes);
> +
> +	return ret;
> +}
> +
> +struct space_run {
> +	u64 start;
> +	u64 end;
> +};
> +
> +static void parse_bitmap(u64 block_size, const unsigned long *bitmap,
> +			 unsigned long size, u64 address,
> +			 struct space_run *space_runs,
> +			 unsigned int *num_space_runs)
> +{
> +	unsigned long pos, end;
> +	u64 run_start, run_length;
> +
> +	pos = find_first_bit(bitmap, size);
> +
> +	if (pos == size)
> +		return;
> +
> +	while (true) {
> +		end = find_next_zero_bit(bitmap, size, pos);
> +
> +		run_start = address + (pos * block_size);
> +		run_length = (end - pos) * block_size;
> +
> +		if (*num_space_runs != 0 &&
> +		    space_runs[*num_space_runs - 1].end == run_start) {
> +			space_runs[*num_space_runs - 1].end += run_length;
> +		} else {
> +			space_runs[*num_space_runs].start = run_start;
> +			space_runs[*num_space_runs].end = run_start + run_length;
> +
> +			(*num_space_runs)++;
> +		}
> +
> +		if (end == size)
> +			break;
> +
> +		pos = find_next_bit(bitmap, size, end + 1);
> +
> +		if (pos == size)
> +			break;
> +	}
> +}
> +
>  static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
>  					   struct btrfs_block_group *bg,
>  					   s64 diff)
> @@ -3931,6 +4015,227 @@ static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
>  		btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
>  }
>  
> +static int create_remap_tree_entries(struct btrfs_trans_handle *trans,
> +				     struct btrfs_path *path,
> +				     struct btrfs_block_group *bg)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_free_space_info *fsi;
> +	struct btrfs_key key, found_key;
> +	struct extent_buffer *leaf;
> +	struct btrfs_root *space_root;
> +	u32 extent_count;
> +	struct space_run *space_runs = NULL;
> +	unsigned int num_space_runs = 0;
> +	struct btrfs_key *entries = NULL;
> +	unsigned int max_entries, num_entries;
> +	int ret;
> +
> +	mutex_lock(&bg->free_space_lock);
> +
> +	if (test_bit(BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE, &bg->runtime_flags)) {
> +		mutex_unlock(&bg->free_space_lock);
> +
> +		ret = add_block_group_free_space(trans, bg);
> +		if (ret)
> +			return ret;
> +
> +		mutex_lock(&bg->free_space_lock);
> +	}
> +
> +	fsi = search_free_space_info(trans, bg, path, 0);
> +	if (IS_ERR(fsi)) {
> +		mutex_unlock(&bg->free_space_lock);
> +		return PTR_ERR(fsi);
> +	}
> +
> +	extent_count = btrfs_free_space_extent_count(path->nodes[0], fsi);
> +
> +	btrfs_release_path(path);
> +
> +	space_runs = kmalloc(sizeof(*space_runs) * extent_count, GFP_NOFS);
> +	if (!space_runs) {
> +		mutex_unlock(&bg->free_space_lock);
> +		return -ENOMEM;
> +	}
> +
> +	key.objectid = bg->start;
> +	key.type = 0;
> +	key.offset = 0;
> +
> +	space_root = btrfs_free_space_root(bg);
> +
> +	ret = btrfs_search_slot(trans, space_root, &key, path, 0, 0);
> +	if (ret < 0) {
> +		mutex_unlock(&bg->free_space_lock);
> +		goto out;
> +	}
> +
> +	ret = 0;
> +
> +	while (true) {
> +		leaf = path->nodes[0];
> +
> +		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
> +
> +		if (found_key.objectid >= bg->start + bg->length)
> +			break;
> +
> +		if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY) {
> +			if (num_space_runs != 0 &&
> +			    space_runs[num_space_runs - 1].end == found_key.objectid) {
> +				space_runs[num_space_runs - 1].end =
> +					found_key.objectid + found_key.offset;
> +			} else {
> +				space_runs[num_space_runs].start = found_key.objectid;
> +				space_runs[num_space_runs].end =
> +					found_key.objectid + found_key.offset;
> +
> +				num_space_runs++;
> +
> +				BUG_ON(num_space_runs > extent_count);
> +			}
> +		} else if (found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
> +			void *bitmap;
> +			unsigned long offset;
> +			u32 data_size;
> +
> +			offset = btrfs_item_ptr_offset(leaf, path->slots[0]);
> +			data_size = btrfs_item_size(leaf, path->slots[0]);
> +
> +			if (data_size != 0) {
> +				bitmap = kmalloc(data_size, GFP_NOFS);

free space tree does this with alloc_bitmap, as far as I can tell.

> +				if (!bitmap) {
> +					mutex_unlock(&bg->free_space_lock);
> +					ret = -ENOMEM;
> +					goto out;
> +				}
> +
> +				read_extent_buffer(leaf, bitmap, offset,
> +						   data_size);
> +
> +				parse_bitmap(fs_info->sectorsize, bitmap,
> +					     data_size * BITS_PER_BYTE,
> +					     found_key.objectid, space_runs,
> +					     &num_space_runs);
> +
> +				BUG_ON(num_space_runs > extent_count);
> +
> +				kfree(bitmap);
> +			}
> +		}
> +
> +		path->slots[0]++;
> +
> +		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
> +			ret = btrfs_next_leaf(space_root, path);
> +			if (ret != 0) {
> +				if (ret == 1)
> +					ret = 0;
> +				break;
> +			}
> +			leaf = path->nodes[0];
> +		}
> +	}
> +
> +	btrfs_release_path(path);
> +
> +	mutex_unlock(&bg->free_space_lock);
> +
> +	max_entries = extent_count + 2;
> +	entries = kmalloc(sizeof(*entries) * max_entries, GFP_NOFS);
> +	if (!entries) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	num_entries = 0;
> +
> +	if (num_space_runs > 0 && space_runs[0].start > bg->start) {
> +		entries[num_entries].objectid = bg->start;
> +		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
> +		entries[num_entries].offset = space_runs[0].start - bg->start;
> +		num_entries++;
> +	}
> +
> +	for (unsigned int i = 1; i < num_space_runs; i++) {
> +		entries[num_entries].objectid = space_runs[i - 1].end;
> +		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
> +		entries[num_entries].offset =
> +			space_runs[i].start - space_runs[i - 1].end;
> +		num_entries++;
> +	}
> +
> +	if (num_space_runs == 0) {
> +		entries[num_entries].objectid = bg->start;
> +		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
> +		entries[num_entries].offset = bg->length;
> +		num_entries++;
> +	} else if (space_runs[num_space_runs - 1].end < bg->start + bg->length) {
> +		entries[num_entries].objectid = space_runs[num_space_runs - 1].end;
> +		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
> +		entries[num_entries].offset =
> +			bg->start + bg->length - space_runs[num_space_runs - 1].end;
> +		num_entries++;
> +	}
> +
> +	if (num_entries == 0)
> +		goto out;
> +
> +	bg->identity_remap_count = num_entries;
> +
> +	ret = add_remap_tree_entries(trans, path, entries, num_entries);
> +
> +out:
> +	kfree(entries);
> +	kfree(space_runs);
> +
> +	return ret;
> +}
> +
> +static int mark_bg_remapped(struct btrfs_trans_handle *trans,
> +			    struct btrfs_path *path,
> +			    struct btrfs_block_group *bg)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	unsigned long bi;
> +	struct extent_buffer *leaf;
> +	struct btrfs_block_group_item_v2 bgi;
> +	struct btrfs_key key;
> +	int ret;
> +

Do the changes to the in memory bg flags / commit_identity_remap_count
need any locking? What happens if we see the flag set but don't yet see
the identity remap count set from some other context?

> +	bg->flags |= BTRFS_BLOCK_GROUP_REMAPPED;
> +
> +	key.objectid = bg->start;
> +	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> +	key.offset = bg->length;
> +
> +	ret = btrfs_search_slot(trans, fs_info->block_group_root, &key,
> +				path, 0, 1);
> +	if (ret) {
> +		if (ret > 0)
> +			ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	leaf = path->nodes[0];
> +	bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
> +	read_extent_buffer(leaf, &bgi, bi, sizeof(bgi));

ASSERT the incompat flag?

> +	btrfs_set_stack_block_group_v2_flags(&bgi, bg->flags);
> +	btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
> +						bg->identity_remap_count);
> +	write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
> +
> +	btrfs_mark_buffer_dirty(trans, leaf);
> +
> +	bg->commit_identity_remap_count = bg->identity_remap_count;
> +
> +	ret = 0;
> +out:
> +	btrfs_release_path(path);
> +	return ret;
> +}
> +
>  static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
>  				struct btrfs_chunk_map *chunk,
>  				struct btrfs_path *path)
> @@ -4050,6 +4355,55 @@ static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
>  	return ret;
>  }
>  
> +static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
> +			       struct btrfs_path *path, uint64_t start)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_chunk_map *chunk;
> +	struct btrfs_key key;
> +	u64 type;
> +	int ret;
> +	struct extent_buffer *leaf;
> +	struct btrfs_chunk *c;
> +
> +	read_lock(&fs_info->mapping_tree_lock);
> +
> +	chunk = btrfs_find_chunk_map_nolock(fs_info, start, 1);
> +	if (!chunk) {
> +		read_unlock(&fs_info->mapping_tree_lock);
> +		return -ENOENT;
> +	}
> +
> +	chunk->type |= BTRFS_BLOCK_GROUP_REMAPPED;
> +	type = chunk->type;
> +
> +	read_unlock(&fs_info->mapping_tree_lock);
> +
> +	key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
> +	key.type = BTRFS_CHUNK_ITEM_KEY;
> +	key.offset = start;
> +
> +	ret = btrfs_search_slot(trans, fs_info->chunk_root, &key, path,
> +				0, 1);
> +	if (ret == 1) {
> +		ret = -ENOENT;
> +		goto end;
> +	} else if (ret < 0)
> +		goto end;
> +
> +	leaf = path->nodes[0];
> +
> +	c = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_chunk);
> +	btrfs_set_chunk_type(leaf, c, type);
> +	btrfs_mark_buffer_dirty(trans, leaf);
> +
> +	ret = 0;
> +end:
> +	btrfs_free_chunk_map(chunk);
> +	btrfs_release_path(path);
> +	return ret;
> +}
> +
>  int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>  			  u64 *length, bool nolock)
>  {
> @@ -4109,16 +4463,78 @@ int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>  	return 0;
>  }
>  
> +static int start_block_group_remapping(struct btrfs_fs_info *fs_info,
> +				       struct btrfs_path *path,
> +				       struct btrfs_block_group *bg)
> +{
> +	struct btrfs_trans_handle *trans;
> +	int ret, ret2;
> +
> +	ret = btrfs_cache_block_group(bg, true);
> +	if (ret)
> +		return ret;
> +
> +	trans = btrfs_start_transaction(fs_info->remap_root, 0);
> +	if (IS_ERR(trans))
> +		return PTR_ERR(trans);
> +
> +	/* We need to run delayed refs, to make sure FST is up to date. */
> +	ret = btrfs_run_delayed_refs(trans, U64_MAX);
> +	if (ret) {
> +		btrfs_end_transaction(trans);
> +		return ret;
> +	}
> +
> +	mutex_lock(&fs_info->remap_mutex);
> +
> +	if (bg->flags & BTRFS_BLOCK_GROUP_REMAPPED) {
> +		ret = 0;
> +		goto end;
> +	}
> +
> +	ret = create_remap_tree_entries(trans, path, bg);
> +	if (ret) {
> +		btrfs_abort_transaction(trans, ret);
> +		goto end;
> +	}
> +
> +	ret = mark_bg_remapped(trans, path, bg);
> +	if (ret) {
> +		btrfs_abort_transaction(trans, ret);
> +		goto end;
> +	}
> +
> +	ret = mark_chunk_remapped(trans, path, bg->start);
> +	if (ret) {
> +		btrfs_abort_transaction(trans, ret);
> +		goto end;
> +	}
> +
> +	ret = remove_block_group_free_space(trans, bg);
> +	if (ret)
> +		btrfs_abort_transaction(trans, ret);
> +
> +end:
> +	mutex_unlock(&fs_info->remap_mutex);
> +
> +	ret2 = btrfs_end_transaction(trans);
> +	if (!ret)
> +		ret = ret2;
> +
> +	return ret;
> +}
> +
>  /*
>   * function to relocate all extents in a block group.
>   */
> -int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
> +int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
> +			       bool *using_remap_tree)
>  {
>  	struct btrfs_block_group *bg;
>  	struct btrfs_root *extent_root = btrfs_extent_root(fs_info, group_start);
>  	struct reloc_control *rc;
>  	struct inode *inode;
> -	struct btrfs_path *path;
> +	struct btrfs_path *path = NULL;
>  	int ret;
>  	int rw = 0;
>  	int err = 0;
> @@ -4185,7 +4601,7 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
>  	}
>  
>  	inode = lookup_free_space_inode(rc->block_group, path);
> -	btrfs_free_path(path);
> +	btrfs_release_path(path);
>  
>  	if (!IS_ERR(inode))
>  		ret = delete_block_group_cache(rc->block_group, inode, 0);
> @@ -4197,11 +4613,17 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
>  		goto out;
>  	}
>  
> -	rc->data_inode = create_reloc_inode(rc->block_group);
> -	if (IS_ERR(rc->data_inode)) {
> -		err = PTR_ERR(rc->data_inode);
> -		rc->data_inode = NULL;
> -		goto out;
> +	*using_remap_tree = btrfs_fs_incompat(fs_info, REMAP_TREE) &&
> +		!(bg->flags & BTRFS_BLOCK_GROUP_SYSTEM) &&
> +		!(bg->flags & BTRFS_BLOCK_GROUP_REMAP);
> +
> +	if (!btrfs_fs_incompat(fs_info, REMAP_TREE)) {
> +		rc->data_inode = create_reloc_inode(rc->block_group);
> +		if (IS_ERR(rc->data_inode)) {
> +			err = PTR_ERR(rc->data_inode);
> +			rc->data_inode = NULL;
> +			goto out;
> +		}
>  	}
>  
>  	describe_relocation(rc->block_group);
> @@ -4213,6 +4635,12 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
>  	ret = btrfs_zone_finish(rc->block_group);
>  	WARN_ON(ret && ret != -EAGAIN);
>  
> +	if (*using_remap_tree) {
> +		err = start_block_group_remapping(fs_info, path, bg);
> +
> +		goto out;
> +	}
> +
>  	while (1) {
>  		enum reloc_stage finishes_stage;
>  
> @@ -4258,7 +4686,9 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
>  out:
>  	if (err && rw)
>  		btrfs_dec_block_group_ro(rc->block_group);
> -	iput(rc->data_inode);
> +	if (!btrfs_fs_incompat(fs_info, REMAP_TREE))
> +		iput(rc->data_inode);
> +	btrfs_free_path(path);
>  out_put_bg:
>  	btrfs_put_block_group(bg);
>  	reloc_chunk_end(fs_info);
> @@ -4452,7 +4882,7 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
>  
>  	btrfs_free_path(path);
>  
> -	if (ret == 0) {
> +	if (ret == 0 && !btrfs_fs_incompat(fs_info, REMAP_TREE)) {
>  		/* cleanup orphan inode in data relocation tree */
>  		fs_root = btrfs_grab_root(fs_info->data_reloc_root);
>  		ASSERT(fs_root);
> diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
> index 0021f812b12c..49bd48296ddb 100644
> --- a/fs/btrfs/relocation.h
> +++ b/fs/btrfs/relocation.h
> @@ -12,7 +12,8 @@ struct btrfs_trans_handle;
>  struct btrfs_ordered_extent;
>  struct btrfs_pending_snapshot;
>  
> -int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start);
> +int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
> +			       bool *using_remap_tree);
>  int btrfs_init_reloc_root(struct btrfs_trans_handle *trans, struct btrfs_root *root);
>  int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
>  			    struct btrfs_root *root);
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 6471861c4b25..ab4a70c420de 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -375,8 +375,13 @@ void btrfs_add_bg_to_space_info(struct btrfs_fs_info *info,
>  	factor = btrfs_bg_type_to_factor(block_group->flags);
>  
>  	spin_lock(&space_info->lock);
> -	space_info->total_bytes += block_group->length;
> -	space_info->disk_total += block_group->length * factor;
> +
> +	if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED) ||
> +	    block_group->identity_remap_count != 0) {
> +		space_info->total_bytes += block_group->length;
> +		space_info->disk_total += block_group->length * factor;
> +	}
> +
>  	space_info->bytes_used += block_group->used;
>  	space_info->disk_used += block_group->used * factor;
>  	space_info->bytes_readonly += block_group->bytes_super;
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 6c0a67da92f1..771415139dc0 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3425,6 +3425,7 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>  	struct btrfs_block_group *block_group;
>  	u64 length;
>  	int ret;
> +	bool using_remap_tree;
>  
>  	if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2)) {
>  		btrfs_err(fs_info,
> @@ -3448,7 +3449,8 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>  
>  	/* step one, relocate all the extents inside this chunk */
>  	btrfs_scrub_pause(fs_info);
> -	ret = btrfs_relocate_block_group(fs_info, chunk_offset);
> +	ret = btrfs_relocate_block_group(fs_info, chunk_offset,
> +					 &using_remap_tree);
>  	btrfs_scrub_continue(fs_info);
>  	if (ret) {
>  		/*
> @@ -3467,6 +3469,9 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>  	length = block_group->length;
>  	btrfs_put_block_group(block_group);
>  
> +	if (using_remap_tree)
> +		return 0;
> +
>  	/*
>  	 * On a zoned file system, discard the whole block group, this will
>  	 * trigger a REQ_OP_ZONE_RESET operation on the device zone. If
> @@ -4165,6 +4170,14 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
>  		chunk = btrfs_item_ptr(leaf, slot, struct btrfs_chunk);
>  		chunk_type = btrfs_chunk_type(leaf, chunk);
>  
> +		/* Check if chunk has already been fully relocated. */
> +		if (chunk_type & BTRFS_BLOCK_GROUP_REMAPPED &&
> +		    btrfs_chunk_num_stripes(leaf, chunk) == 0) {
> +			btrfs_release_path(path);
> +			mutex_unlock(&fs_info->reclaim_bgs_lock);
> +			goto loop;
> +		}
> +
>  		if (!counting) {
>  			spin_lock(&fs_info->balance_lock);
>  			bctl->stat.considered++;
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 09/12] btrfs: handle deletions from remapped block group
  2025-06-05 16:23 ` [PATCH 09/12] btrfs: handle deletions from remapped block group Mark Harmstone
@ 2025-06-13 23:42   ` Boris Burkov
  2025-08-11 16:48     ` Mark Harmstone
  2025-08-11 16:59     ` Mark Harmstone
  0 siblings, 2 replies; 39+ messages in thread
From: Boris Burkov @ 2025-06-13 23:42 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: linux-btrfs

On Thu, Jun 05, 2025 at 05:23:39PM +0100, Mark Harmstone wrote:
> Handle the case where we free an extent from a block group that has the
> REMAPPED flag set. Because the remap tree is orthogonal to the extent
> tree, for data this may be within any number of identity remaps or
> actual remaps. If we're freeing a metadata node, this will be wholly
> inside one or the other.
> 
> btrfs_remove_extent_from_remap_tree() searches the remap tree for the
> remaps that cover the range in question, then calls
> remove_range_from_remap_tree() for each one, to punch a hole in the
> remap and adjust the free-space tree.
> 
> For an identity remap, remove_range_from_remap_tree() will adjust the
> block group's `identity_remap_count` if this changes. If it reaches
> zero we call last_identity_remap_gone(), which removes the chunk's
> stripes and device extents - it is now fully remapped.
> 
> The changes which involve the block group's ro flag are because the
> REMAPPED flag itself prevents a block group from having any new
> allocations within it, and so we don't need to account for this
> separately.

Those changes didn't really make much sense to me. Do you *want to*
delete the "unused" block group with remapped set?
How did it get in the list?
Did you actually hit this case and are fixing something?

> 
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
>  fs/btrfs/block-group.c |  80 ++++---
>  fs/btrfs/block-group.h |   1 +
>  fs/btrfs/disk-io.c     |   1 +
>  fs/btrfs/extent-tree.c |  30 ++-
>  fs/btrfs/fs.h          |   1 +
>  fs/btrfs/relocation.c  | 505 +++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/relocation.h  |   3 +
>  fs/btrfs/volumes.c     |  56 +++--
>  fs/btrfs/volumes.h     |   6 +
>  9 files changed, 628 insertions(+), 55 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 4529356bb1e3..334df145ab3f 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1055,6 +1055,32 @@ static int remove_block_group_item(struct btrfs_trans_handle *trans,
>  	return ret;
>  }
>  
> +void btrfs_remove_bg_from_sinfo(struct btrfs_block_group *block_group)
> +{
> +	int factor = btrfs_bg_type_to_factor(block_group->flags);
> +
> +	spin_lock(&block_group->space_info->lock);
> +
> +	if (btrfs_test_opt(block_group->fs_info, ENOSPC_DEBUG)) {
> +		WARN_ON(block_group->space_info->total_bytes
> +			< block_group->length);
> +		WARN_ON(block_group->space_info->bytes_readonly
> +			< block_group->length - block_group->zone_unusable);
> +		WARN_ON(block_group->space_info->bytes_zone_unusable
> +			< block_group->zone_unusable);
> +		WARN_ON(block_group->space_info->disk_total
> +			< block_group->length * factor);
> +	}
> +	block_group->space_info->total_bytes -= block_group->length;
> +	block_group->space_info->bytes_readonly -=
> +		(block_group->length - block_group->zone_unusable);
> +	btrfs_space_info_update_bytes_zone_unusable(block_group->space_info,
> +						    -block_group->zone_unusable);
> +	block_group->space_info->disk_total -= block_group->length * factor;
> +
> +	spin_unlock(&block_group->space_info->lock);
> +}
> +
>  int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>  			     struct btrfs_chunk_map *map)
>  {
> @@ -1066,7 +1092,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>  	struct kobject *kobj = NULL;
>  	int ret;
>  	int index;
> -	int factor;
>  	struct btrfs_caching_control *caching_ctl = NULL;
>  	bool remove_map;
>  	bool remove_rsv = false;
> @@ -1075,7 +1100,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>  	if (!block_group)
>  		return -ENOENT;
>  
> -	BUG_ON(!block_group->ro);
> +	BUG_ON(!block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED));
>  
>  	trace_btrfs_remove_block_group(block_group);
>  	/*
> @@ -1087,7 +1112,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>  				  block_group->length);
>  
>  	index = btrfs_bg_flags_to_raid_index(block_group->flags);
> -	factor = btrfs_bg_type_to_factor(block_group->flags);
>  
>  	/* make sure this block group isn't part of an allocation cluster */
>  	cluster = &fs_info->data_alloc_cluster;
> @@ -1211,26 +1235,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>  
>  	spin_lock(&block_group->space_info->lock);
>  	list_del_init(&block_group->ro_list);
> -
> -	if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
> -		WARN_ON(block_group->space_info->total_bytes
> -			< block_group->length);
> -		WARN_ON(block_group->space_info->bytes_readonly
> -			< block_group->length - block_group->zone_unusable);
> -		WARN_ON(block_group->space_info->bytes_zone_unusable
> -			< block_group->zone_unusable);
> -		WARN_ON(block_group->space_info->disk_total
> -			< block_group->length * factor);
> -	}
> -	block_group->space_info->total_bytes -= block_group->length;
> -	block_group->space_info->bytes_readonly -=
> -		(block_group->length - block_group->zone_unusable);
> -	btrfs_space_info_update_bytes_zone_unusable(block_group->space_info,
> -						    -block_group->zone_unusable);
> -	block_group->space_info->disk_total -= block_group->length * factor;
> -
>  	spin_unlock(&block_group->space_info->lock);
>  
> +	if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED))
> +		btrfs_remove_bg_from_sinfo(block_group);

So we double count the block group in the space info usage while the
remapped one stays around? Why delete the block group itself then? Why
not just delete it when the last identity remap is gone and we call this
function? Sorry if this is obvious, but I don't see it for some reason.

> +
>  	/*
>  	 * Remove the free space for the block group from the free space tree
>  	 * and the block group's item from the extent tree before marking the
> @@ -1517,6 +1526,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  	while (!list_empty(&fs_info->unused_bgs)) {
>  		u64 used;
>  		int trimming;
> +		bool made_ro = false;
>  
>  		block_group = list_first_entry(&fs_info->unused_bgs,
>  					       struct btrfs_block_group,
> @@ -1553,7 +1563,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  
>  		spin_lock(&space_info->lock);
>  		spin_lock(&block_group->lock);
> -		if (btrfs_is_block_group_used(block_group) || block_group->ro ||
> +		if (btrfs_is_block_group_used(block_group) ||
> +		    (block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) ||
>  		    list_is_singular(&block_group->list)) {
>  			/*
>  			 * We want to bail if we made new allocations or have
> @@ -1596,7 +1607,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  		 */
>  		used = btrfs_space_info_used(space_info, true);
>  		if (space_info->total_bytes - block_group->length < used &&
> -		    block_group->zone_unusable < block_group->length) {
> +		    block_group->zone_unusable < block_group->length &&
> +		    !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
>  			/*
>  			 * Add a reference for the list, compensate for the ref
>  			 * drop under the "next" label for the
> @@ -1614,8 +1626,14 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  		spin_unlock(&block_group->lock);
>  		spin_unlock(&space_info->lock);
>  
> -		/* We don't want to force the issue, only flip if it's ok. */
> -		ret = inc_block_group_ro(block_group, 0);
> +		if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
> +			/* We don't want to force the issue, only flip if it's ok. */
> +			ret = inc_block_group_ro(block_group, 0);
> +			made_ro = true;
> +		} else {
> +			ret = 0;
> +		}
> +
>  		up_write(&space_info->groups_sem);
>  		if (ret < 0) {
>  			ret = 0;
> @@ -1624,7 +1642,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  
>  		ret = btrfs_zone_finish(block_group);
>  		if (ret < 0) {
> -			btrfs_dec_block_group_ro(block_group);
> +			if (made_ro)
> +				btrfs_dec_block_group_ro(block_group);
>  			if (ret == -EAGAIN)
>  				ret = 0;
>  			goto next;
> @@ -1637,7 +1656,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  		trans = btrfs_start_trans_remove_block_group(fs_info,
>  						     block_group->start);
>  		if (IS_ERR(trans)) {
> -			btrfs_dec_block_group_ro(block_group);
> +			if (made_ro)
> +				btrfs_dec_block_group_ro(block_group);
>  			ret = PTR_ERR(trans);
>  			goto next;
>  		}
> @@ -1647,7 +1667,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  		 * just delete them, we don't care about them anymore.
>  		 */
>  		if (!clean_pinned_extents(trans, block_group)) {
> -			btrfs_dec_block_group_ro(block_group);
> +			if (made_ro)
> +				btrfs_dec_block_group_ro(block_group);
>  			goto end_trans;
>  		}
>  
> @@ -1661,7 +1682,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  		spin_lock(&fs_info->discard_ctl.lock);
>  		if (!list_empty(&block_group->discard_list)) {
>  			spin_unlock(&fs_info->discard_ctl.lock);
> -			btrfs_dec_block_group_ro(block_group);
> +			if (made_ro)
> +				btrfs_dec_block_group_ro(block_group);
>  			btrfs_discard_queue_work(&fs_info->discard_ctl,
>  						 block_group);
>  			goto end_trans;
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index c484118b8b8d..767898929960 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -329,6 +329,7 @@ int btrfs_add_new_free_space(struct btrfs_block_group *block_group,
>  struct btrfs_trans_handle *btrfs_start_trans_remove_block_group(
>  				struct btrfs_fs_info *fs_info,
>  				const u64 chunk_offset);
> +void btrfs_remove_bg_from_sinfo(struct btrfs_block_group *block_group);
>  int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>  			     struct btrfs_chunk_map *map);
>  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a0542b581f4e..dac22efd2332 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2922,6 +2922,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>  	mutex_init(&fs_info->chunk_mutex);
>  	mutex_init(&fs_info->transaction_kthread_mutex);
>  	mutex_init(&fs_info->cleaner_mutex);
> +	mutex_init(&fs_info->remap_mutex);
>  	mutex_init(&fs_info->ro_block_group_mutex);
>  	init_rwsem(&fs_info->commit_root_sem);
>  	init_rwsem(&fs_info->cleanup_work_sem);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e8f752ef1da9..995784cdca9d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -40,6 +40,7 @@
>  #include "orphan.h"
>  #include "tree-checker.h"
>  #include "raid-stripe-tree.h"
> +#include "relocation.h"
>  
>  #undef SCRAMBLE_DELAYED_REFS
>  
> @@ -2977,6 +2978,8 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
>  				     u64 bytenr, struct btrfs_squota_delta *delta)
>  {
>  	int ret;
> +	struct btrfs_block_group *bg;
> +	bool bg_is_remapped = false;
>  	u64 num_bytes = delta->num_bytes;
>  
>  	if (delta->is_data) {
> @@ -3002,10 +3005,22 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
>  		return ret;
>  	}
>  
> -	ret = add_to_free_space_tree(trans, bytenr, num_bytes);
> -	if (ret) {
> -		btrfs_abort_transaction(trans, ret);
> -		return ret;
> +	if (btrfs_fs_incompat(trans->fs_info, REMAP_TREE)) {
> +		bg = btrfs_lookup_block_group(trans->fs_info, bytenr);
> +		bg_is_remapped = bg->flags & BTRFS_BLOCK_GROUP_REMAPPED;

As mentioned in the patch that sets the flag, this feels quite
susceptible to races. I don't have a particular one in mind, it just
feels off to set and rely on this flag to decide what work to do without
some kind of explicit synchronization plan.

> +		btrfs_put_block_group(bg);
> +	}
> +
> +	/*
> +	 * If remapped, FST has already been taken care of in
> +	 * remove_range_from_remap_tree().
> +	 */
> +	if (!bg_is_remapped) {
> +		ret = add_to_free_space_tree(trans, bytenr, num_bytes);
> +		if (ret) {
> +			btrfs_abort_transaction(trans, ret);
> +			return ret;
> +		}
>  	}
>  
>  	ret = btrfs_update_block_group(trans, bytenr, num_bytes, false);
> @@ -3387,6 +3402,13 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>  		}
>  		btrfs_release_path(path);
>  
> +		ret = btrfs_remove_extent_from_remap_tree(trans, path, bytenr,
> +							  num_bytes);
> +		if (ret) {
> +			btrfs_abort_transaction(trans, ret);
> +			goto out;
> +		}
> +
>  		ret = do_free_extent_accounting(trans, bytenr, &delta);
>  	}
>  	btrfs_release_path(path);
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 8ceeb64aceb3..fd7dc61be9c7 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -551,6 +551,7 @@ struct btrfs_fs_info {
>  	struct mutex transaction_kthread_mutex;
>  	struct mutex cleaner_mutex;
>  	struct mutex chunk_mutex;
> +	struct mutex remap_mutex;
>  
>  	/*
>  	 * This is taken to make sure we don't set block groups ro after the
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index e45f3598ef03..54c3e99c7dab 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -37,6 +37,7 @@
>  #include "super.h"
>  #include "tree-checker.h"
>  #include "raid-stripe-tree.h"
> +#include "free-space-tree.h"
>  
>  /*
>   * Relocation overview
> @@ -3905,6 +3906,150 @@ static const char *stage_to_string(enum reloc_stage stage)
>  	return "unknown";
>  }
>  
> +static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
> +					   struct btrfs_block_group *bg,
> +					   s64 diff)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	bool bg_already_dirty = true;
> +
> +	bg->remap_bytes += diff;
> +
> +	if (bg->used == 0 && bg->remap_bytes == 0)
> +		btrfs_mark_bg_unused(bg);
> +
> +	spin_lock(&trans->transaction->dirty_bgs_lock);
> +	if (list_empty(&bg->dirty_list)) {
> +		list_add_tail(&bg->dirty_list, &trans->transaction->dirty_bgs);
> +		bg_already_dirty = false;
> +		btrfs_get_block_group(bg);
> +	}
> +	spin_unlock(&trans->transaction->dirty_bgs_lock);
> +
> +	/* Modified block groups are accounted for in the delayed_refs_rsv. */
> +	if (!bg_already_dirty)
> +		btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
> +}
> +
> +static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
> +				struct btrfs_chunk_map *chunk,
> +				struct btrfs_path *path)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_key key;
> +	struct extent_buffer *leaf;
> +	struct btrfs_chunk *c;
> +	int ret;
> +
> +	key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
> +	key.type = BTRFS_CHUNK_ITEM_KEY;
> +	key.offset = chunk->start;
> +
> +	ret = btrfs_search_slot(trans, fs_info->chunk_root, &key, path,
> +				0, 1);
> +	if (ret) {
> +		if (ret == 1) {
> +			btrfs_release_path(path);
> +			ret = -ENOENT;
> +		}
> +		return ret;
> +	}
> +
> +	leaf = path->nodes[0];
> +
> +	c = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_chunk);
> +	btrfs_set_chunk_num_stripes(leaf, c, 0);
> +
> +	btrfs_truncate_item(trans, path, offsetof(struct btrfs_chunk, stripe),
> +			    1);
> +
> +	btrfs_mark_buffer_dirty(trans, leaf);
> +
> +	btrfs_release_path(path);
> +
> +	chunk->num_stripes = 0;
> +
> +	return 0;
> +}
> +
> +static int last_identity_remap_gone(struct btrfs_trans_handle *trans,
> +				    struct btrfs_chunk_map *chunk,
> +				    struct btrfs_block_group *bg,
> +				    struct btrfs_path *path)
> +{
> +	int ret;
> +
> +	ret = btrfs_remove_dev_extents(trans, chunk);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&trans->fs_info->chunk_mutex);
> +
> +	for (unsigned int i = 0; i < chunk->num_stripes; i++) {
> +		ret = btrfs_update_device(trans, chunk->stripes[i].dev);
> +		if (ret) {
> +			mutex_unlock(&trans->fs_info->chunk_mutex);
> +			return ret;
> +		}
> +	}
> +
> +	mutex_unlock(&trans->fs_info->chunk_mutex);
> +
> +	write_lock(&trans->fs_info->mapping_tree_lock);
> +	btrfs_chunk_map_device_clear_bits(chunk, CHUNK_ALLOCATED);
> +	write_unlock(&trans->fs_info->mapping_tree_lock);
> +
> +	btrfs_remove_bg_from_sinfo(bg);
> +
> +	ret = remove_chunk_stripes(trans, chunk, path);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
> +				       struct btrfs_path *path,
> +				       struct btrfs_block_group *bg, int delta)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_chunk_map *chunk;
> +	bool bg_already_dirty = true;
> +	int ret;
> +
> +	WARN_ON(delta < 0 && -delta > bg->identity_remap_count);
> +
> +	bg->identity_remap_count += delta;
> +
> +	spin_lock(&trans->transaction->dirty_bgs_lock);
> +	if (list_empty(&bg->dirty_list)) {
> +		list_add_tail(&bg->dirty_list, &trans->transaction->dirty_bgs);
> +		bg_already_dirty = false;
> +		btrfs_get_block_group(bg);
> +	}
> +	spin_unlock(&trans->transaction->dirty_bgs_lock);
> +
> +	/* Modified block groups are accounted for in the delayed_refs_rsv. */
> +	if (!bg_already_dirty)
> +		btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
> +
> +	if (bg->identity_remap_count != 0)
> +		return 0;
> +
> +	chunk = btrfs_find_chunk_map(fs_info, bg->start, 1);
> +	if (!chunk)
> +		return -ENOENT;
> +
> +	ret = last_identity_remap_gone(trans, chunk, bg, path);
> +	if (ret)
> +		goto end;
> +
> +	ret = 0;
> +end:
> +	btrfs_free_chunk_map(chunk);
> +	return ret;
> +}
> +
>  int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>  			  u64 *length, bool nolock)
>  {
> @@ -4521,3 +4666,363 @@ u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info)
>  		logical = fs_info->reloc_ctl->block_group->start;
>  	return logical;
>  }
> +
> +static int remove_range_from_remap_tree(struct btrfs_trans_handle *trans,
> +					struct btrfs_path *path,
> +					struct btrfs_block_group *bg,
> +					u64 bytenr, u64 num_bytes)
> +{
> +	int ret;
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct extent_buffer *leaf = path->nodes[0];
> +	struct btrfs_key key, new_key;
> +	struct btrfs_remap *remap_ptr = NULL, remap;
> +	struct btrfs_block_group *dest_bg = NULL;
> +	u64 end, new_addr = 0, remap_start, remap_length, overlap_length;
> +	bool is_identity_remap;
> +
> +	end = bytenr + num_bytes;
> +
> +	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +
> +	is_identity_remap = key.type == BTRFS_IDENTITY_REMAP_KEY;
> +
> +	remap_start = key.objectid;
> +	remap_length = key.offset;
> +
> +	if (!is_identity_remap) {
> +		remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
> +					   struct btrfs_remap);
> +		new_addr = btrfs_remap_address(leaf, remap_ptr);
> +
> +		dest_bg = btrfs_lookup_block_group(fs_info, new_addr);
> +	}
> +
> +	if (bytenr == remap_start && num_bytes >= remap_length) {
> +		/* Remove entirely. */
> +
> +		ret = btrfs_del_item(trans, fs_info->remap_root, path);
> +		if (ret)
> +			goto end;
> +
> +		btrfs_release_path(path);
> +
> +		overlap_length = remap_length;
> +
> +		if (!is_identity_remap) {
> +			/* Remove backref. */
> +
> +			key.objectid = new_addr;
> +			key.type = BTRFS_REMAP_BACKREF_KEY;
> +			key.offset = remap_length;
> +
> +			ret = btrfs_search_slot(trans, fs_info->remap_root,
> +						&key, path, -1, 1);
> +			if (ret) {
> +				if (ret == 1) {
> +					btrfs_release_path(path);
> +					ret = -ENOENT;
> +				}
> +				goto end;
> +			}
> +
> +			ret = btrfs_del_item(trans, fs_info->remap_root, path);
> +
> +			btrfs_release_path(path);
> +
> +			if (ret)
> +				goto end;
> +
> +			adjust_block_group_remap_bytes(trans, dest_bg,
> +						       -remap_length);
> +		} else {
> +			ret = adjust_identity_remap_count(trans, path, bg, -1);
> +			if (ret)
> +				goto end;
> +		}
> +	} else if (bytenr == remap_start) {
> +		/* Remove beginning. */
> +
> +		new_key.objectid = end;
> +		new_key.type = key.type;
> +		new_key.offset = remap_length + remap_start - end;
> +
> +		btrfs_set_item_key_safe(trans, path, &new_key);
> +		btrfs_mark_buffer_dirty(trans, leaf);
> +
> +		overlap_length = num_bytes;
> +
> +		if (!is_identity_remap) {
> +			btrfs_set_remap_address(leaf, remap_ptr,
> +						new_addr + end - remap_start);
> +			btrfs_release_path(path);
> +
> +			/* Adjust backref. */
> +
> +			key.objectid = new_addr;
> +			key.type = BTRFS_REMAP_BACKREF_KEY;
> +			key.offset = remap_length;
> +
> +			ret = btrfs_search_slot(trans, fs_info->remap_root,
> +						&key, path, -1, 1);
> +			if (ret) {
> +				if (ret == 1) {
> +					btrfs_release_path(path);
> +					ret = -ENOENT;
> +				}
> +				goto end;
> +			}
> +
> +			leaf = path->nodes[0];
> +
> +			new_key.objectid = new_addr + end - remap_start;
> +			new_key.type = BTRFS_REMAP_BACKREF_KEY;
> +			new_key.offset = remap_length + remap_start - end;
> +
> +			btrfs_set_item_key_safe(trans, path, &new_key);
> +
> +			remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
> +						   struct btrfs_remap);
> +			btrfs_set_remap_address(leaf, remap_ptr, end);
> +
> +			btrfs_mark_buffer_dirty(trans, path->nodes[0]);
> +
> +			btrfs_release_path(path);
> +
> +			adjust_block_group_remap_bytes(trans, dest_bg,
> +						       -num_bytes);
> +		}
> +	} else if (bytenr + num_bytes < remap_start + remap_length) {
> +		/* Remove middle. */
> +
> +		new_key.objectid = remap_start;
> +		new_key.type = key.type;
> +		new_key.offset = bytenr - remap_start;
> +
> +		btrfs_set_item_key_safe(trans, path, &new_key);
> +		btrfs_mark_buffer_dirty(trans, leaf);
> +
> +		new_key.objectid = end;
> +		new_key.offset = remap_start + remap_length - end;
> +
> +		btrfs_release_path(path);
> +
> +		overlap_length = num_bytes;
> +
> +		if (!is_identity_remap) {
> +			/* Add second remap entry. */
> +
> +			ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
> +						path, &new_key,
> +						sizeof(struct btrfs_remap));
> +			if (ret)
> +				goto end;
> +
> +			btrfs_set_stack_remap_address(&remap,
> +						new_addr + end - remap_start);
> +
> +			write_extent_buffer(path->nodes[0], &remap,
> +				btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
> +				sizeof(struct btrfs_remap));
> +
> +			btrfs_release_path(path);
> +
> +			/* Shorten backref entry. */
> +
> +			key.objectid = new_addr;
> +			key.type = BTRFS_REMAP_BACKREF_KEY;
> +			key.offset = remap_length;
> +
> +			ret = btrfs_search_slot(trans, fs_info->remap_root,
> +						&key, path, -1, 1);
> +			if (ret) {
> +				if (ret == 1) {
> +					btrfs_release_path(path);
> +					ret = -ENOENT;
> +				}
> +				goto end;
> +			}
> +
> +			new_key.objectid = new_addr;
> +			new_key.type = BTRFS_REMAP_BACKREF_KEY;
> +			new_key.offset = bytenr - remap_start;
> +
> +			btrfs_set_item_key_safe(trans, path, &new_key);
> +			btrfs_mark_buffer_dirty(trans, path->nodes[0]);
> +
> +			btrfs_release_path(path);
> +
> +			/* Add second backref entry. */
> +
> +			new_key.objectid = new_addr + end - remap_start;
> +			new_key.type = BTRFS_REMAP_BACKREF_KEY;
> +			new_key.offset = remap_start + remap_length - end;
> +
> +			ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
> +						path, &new_key,
> +						sizeof(struct btrfs_remap));
> +			if (ret)
> +				goto end;
> +
> +			btrfs_set_stack_remap_address(&remap, end);
> +
> +			write_extent_buffer(path->nodes[0], &remap,
> +				btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
> +				sizeof(struct btrfs_remap));
> +
> +			btrfs_release_path(path);
> +
> +			adjust_block_group_remap_bytes(trans, dest_bg,
> +						       -num_bytes);
> +		} else {
> +			/* Add second identity remap entry. */
> +
> +			ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
> +						      path, &new_key, 0);
> +			if (ret)
> +				goto end;
> +
> +			btrfs_release_path(path);
> +
> +			ret = adjust_identity_remap_count(trans, path, bg, 1);
> +			if (ret)
> +				goto end;
> +		}
> +	} else {
> +		/* Remove end. */
> +
> +		new_key.objectid = remap_start;
> +		new_key.type = key.type;
> +		new_key.offset = bytenr - remap_start;
> +
> +		btrfs_set_item_key_safe(trans, path, &new_key);
> +		btrfs_mark_buffer_dirty(trans, leaf);
> +
> +		btrfs_release_path(path);
> +
> +		overlap_length = remap_start + remap_length - bytenr;
> +
> +		if (!is_identity_remap) {
> +			/* Shorten backref entry. */
> +
> +			key.objectid = new_addr;
> +			key.type = BTRFS_REMAP_BACKREF_KEY;
> +			key.offset = remap_length;
> +
> +			ret = btrfs_search_slot(trans, fs_info->remap_root,
> +						&key, path, -1, 1);
> +			if (ret) {
> +				if (ret == 1) {
> +					btrfs_release_path(path);
> +					ret = -ENOENT;
> +				}
> +				goto end;
> +			}
> +
> +			new_key.objectid = new_addr;
> +			new_key.type = BTRFS_REMAP_BACKREF_KEY;
> +			new_key.offset = bytenr - remap_start;
> +
> +			btrfs_set_item_key_safe(trans, path, &new_key);
> +			btrfs_mark_buffer_dirty(trans, path->nodes[0]);
> +
> +			btrfs_release_path(path);
> +
> +			adjust_block_group_remap_bytes(trans, dest_bg,
> +					bytenr - remap_start - remap_length);
> +		}
> +	}
> +
> +	if (!is_identity_remap) {
> +		ret = add_to_free_space_tree(trans,
> +					     bytenr - remap_start + new_addr,
> +					     overlap_length);
> +		if (ret)
> +			goto end;
> +	}
> +
> +	ret = overlap_length;
> +
> +end:
> +	if (dest_bg)
> +		btrfs_put_block_group(dest_bg);
> +
> +	return ret;
> +}
> +
> +int btrfs_remove_extent_from_remap_tree(struct btrfs_trans_handle *trans,
> +					struct btrfs_path *path,
> +					u64 bytenr, u64 num_bytes)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_key key, found_key;
> +	struct extent_buffer *leaf;
> +	struct btrfs_block_group *bg;
> +	int ret;
> +
> +	if (!(btrfs_super_incompat_flags(fs_info->super_copy) &
> +	      BTRFS_FEATURE_INCOMPAT_REMAP_TREE))
> +		return 0;
> +
> +	bg = btrfs_lookup_block_group(fs_info, bytenr);
> +	if (!bg)
> +		return 0;
> +
> +	if (!(bg->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
> +		btrfs_put_block_group(bg);
> +		return 0;
> +	}
> +
> +	mutex_lock(&fs_info->remap_mutex);
> +
> +	do {
> +		key.objectid = bytenr;
> +		key.type = (u8)-1;
> +		key.offset = (u64)-1;
> +
> +		ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path,
> +					-1, 1);
> +		if (ret < 0)
> +			goto end;
> +
> +		leaf = path->nodes[0];
> +
> +		if (path->slots[0] == 0) {
> +			ret = -ENOENT;
> +			goto end;
> +		}
> +
> +		path->slots[0]--;
> +
> +		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
> +
> +		if (found_key.type != BTRFS_IDENTITY_REMAP_KEY &&
> +		    found_key.type != BTRFS_REMAP_KEY) {
> +			ret = -ENOENT;
> +			goto end;
> +		}
> +
> +		if (bytenr < found_key.objectid ||
> +		    bytenr >= found_key.objectid + found_key.offset) {
> +			ret = -ENOENT;
> +			goto end;
> +		}
> +
> +		ret = remove_range_from_remap_tree(trans, path, bg, bytenr,
> +						   num_bytes);
> +		if (ret < 0)
> +			goto end;
> +
> +		bytenr += ret;
> +		num_bytes -= ret;
> +	} while (num_bytes > 0);
> +
> +	ret = 0;
> +
> +end:
> +	mutex_unlock(&fs_info->remap_mutex);
> +
> +	btrfs_put_block_group(bg);
> +	btrfs_release_path(path);
> +	return ret;
> +}
> diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
> index 8c9dfc55b799..0021f812b12c 100644
> --- a/fs/btrfs/relocation.h
> +++ b/fs/btrfs/relocation.h
> @@ -32,5 +32,8 @@ bool btrfs_should_ignore_reloc_root(const struct btrfs_root *root);
>  u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info);
>  int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>  			  u64 *length, bool nolock);
> +int btrfs_remove_extent_from_remap_tree(struct btrfs_trans_handle *trans,
> +					struct btrfs_path *path,
> +					u64 bytenr, u64 num_bytes);
>  
>  #endif
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 62bd6259ebd3..6c0a67da92f1 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2931,8 +2931,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>  	return ret;
>  }
>  
> -static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
> -					struct btrfs_device *device)
> +int btrfs_update_device(struct btrfs_trans_handle *trans,
> +			struct btrfs_device *device)
>  {
>  	int ret;
>  	struct btrfs_path *path;
> @@ -3236,25 +3236,13 @@ static int remove_chunk_item(struct btrfs_trans_handle *trans,
>  	return btrfs_free_chunk(trans, chunk_offset);
>  }
>  
> -int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
> +int btrfs_remove_dev_extents(struct btrfs_trans_handle *trans,
> +			     struct btrfs_chunk_map *map)
>  {
>  	struct btrfs_fs_info *fs_info = trans->fs_info;
> -	struct btrfs_chunk_map *map;
> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>  	u64 dev_extent_len = 0;
>  	int i, ret = 0;
> -	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> -
> -	map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
> -	if (IS_ERR(map)) {
> -		/*
> -		 * This is a logic error, but we don't want to just rely on the
> -		 * user having built with ASSERT enabled, so if ASSERT doesn't
> -		 * do anything we still error out.
> -		 */
> -		DEBUG_WARN("errr %ld reading chunk map at offset %llu",
> -			   PTR_ERR(map), chunk_offset);
> -		return PTR_ERR(map);
> -	}
>  
>  	/*
>  	 * First delete the device extent items from the devices btree.
> @@ -3275,7 +3263,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
>  		if (ret) {
>  			mutex_unlock(&fs_devices->device_list_mutex);
>  			btrfs_abort_transaction(trans, ret);
> -			goto out;
> +			return ret;
>  		}
>  
>  		if (device->bytes_used > 0) {
> @@ -3295,6 +3283,30 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
>  	}
>  	mutex_unlock(&fs_devices->device_list_mutex);
>  
> +	return 0;
> +}
> +
> +int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
> +{
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
> +	struct btrfs_chunk_map *map;
> +	int ret;
> +
> +	map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
> +	if (IS_ERR(map)) {
> +		/*
> +		 * This is a logic error, but we don't want to just rely on the
> +		 * user having built with ASSERT enabled, so if ASSERT doesn't
> +		 * do anything we still error out.
> +		 */
> +		ASSERT(0);
> +		return PTR_ERR(map);
> +	}
> +
> +	ret = btrfs_remove_dev_extents(trans, map);
> +	if (ret)
> +		goto out;
> +
>  	/*
>  	 * We acquire fs_info->chunk_mutex for 2 reasons:
>  	 *
> @@ -5436,7 +5448,7 @@ static void chunk_map_device_set_bits(struct btrfs_chunk_map *map, unsigned int
>  	}
>  }
>  
> -static void chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
> +void btrfs_chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
>  {
>  	for (int i = 0; i < map->num_stripes; i++) {
>  		struct btrfs_io_stripe *stripe = &map->stripes[i];
> @@ -5453,7 +5465,7 @@ void btrfs_remove_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_ma
>  	write_lock(&fs_info->mapping_tree_lock);
>  	rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
>  	RB_CLEAR_NODE(&map->rb_node);
> -	chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
> +	btrfs_chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
>  	write_unlock(&fs_info->mapping_tree_lock);
>  
>  	/* Once for the tree reference. */
> @@ -5489,7 +5501,7 @@ int btrfs_add_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_map *m
>  		return -EEXIST;
>  	}
>  	chunk_map_device_set_bits(map, CHUNK_ALLOCATED);
> -	chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
> +	btrfs_chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
>  	write_unlock(&fs_info->mapping_tree_lock);
>  
>  	return 0;
> @@ -5854,7 +5866,7 @@ void btrfs_mapping_tree_free(struct btrfs_fs_info *fs_info)
>  		map = rb_entry(node, struct btrfs_chunk_map, rb_node);
>  		rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
>  		RB_CLEAR_NODE(&map->rb_node);
> -		chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
> +		btrfs_chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
>  		/* Once for the tree ref. */
>  		btrfs_free_chunk_map(map);
>  		cond_resched_rwlock_write(&fs_info->mapping_tree_lock);
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 9fb8fe4312a5..0a73ea2a2a6a 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -779,6 +779,8 @@ u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map);
>  int btrfs_nr_parity_stripes(u64 type);
>  int btrfs_chunk_alloc_add_chunk_item(struct btrfs_trans_handle *trans,
>  				     struct btrfs_block_group *bg);
> +int btrfs_remove_dev_extents(struct btrfs_trans_handle *trans,
> +			     struct btrfs_chunk_map *map);
>  int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
>  
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> @@ -876,6 +878,10 @@ bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
>  
>  bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
>  const u8 *btrfs_sb_fsid_ptr(const struct btrfs_super_block *sb);
> +int btrfs_update_device(struct btrfs_trans_handle *trans,
> +			struct btrfs_device *device);
> +void btrfs_chunk_map_device_clear_bits(struct btrfs_chunk_map *map,
> +				       unsigned int bits);
>  
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>  struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/12] btrfs: remap tree
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (15 preceding siblings ...)
  2025-06-11 15:28 ` Mark Harmstone
@ 2025-06-14  0:04 ` Boris Burkov
  2025-06-26 22:10 ` Mark Harmstone
  17 siblings, 0 replies; 39+ messages in thread
From: Boris Burkov @ 2025-06-14  0:04 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: linux-btrfs

On Thu, Jun 05, 2025 at 05:23:30PM +0100, Mark Harmstone wrote:
> This patch series adds a disk format change gated behind
> CONFIG_BTRFS_EXPERIMENTAL to add a "remap tree", which acts as a layer of
> indirection when doing I/O. When doing relocation, rather than fixing up every
> tree, we instead record the old and new addresses in the remap tree. This should
> hopefully make things more reliable and flexible, as well as enabling some
> future changes we'd like to make, such as larger data extents and reducing
> write amplification by removing cow-only metadata items.
> 
> The remap tree lives in a new REMAP chunk type. This is because bootstrapping
> means that it can't be remapped itself, and has to be relocated by COWing it as
> at present. It can't go in the SYSTEM chunk as we are then limited by the chunk
> item needing to fit in the superblock.
> 
> For more on the design and rationale, please see my RFC sent last month[1], as
> well as Josef Bacik's original design document[2]. The main change from Josef's
> design is that I've added remap backrefs, as we need to be able to move a
> chunk's existing remaps before remapping it.
> 
> You will also need my patches to btrfs-progs[3] to make
> `mkfs.btrfs -O remap-tree` work, as well as allowing `btrfs check` to recognize
> the new format.
> 
> Changes since the RFC:
> 
> * I've reduce the REMAP chunk size from the normal 1GB to 32MB, to match the
>   SYSTEM chunk. For a filesystem with 4KB sectors and 16KB node size, the worst
>   case is that one leaf covers ~1MB of data, and the best case ~250GB. For a
>   chunk, that implies a worst case of ~2GB and a best case of ~500TB.
>   This isn't a disk-format change, so we can always adjust it if it proves too
>   big or small in practice. mkfs creates 8MB chunks, as it does for everything.
> 
> * You can't make new allocations from remapped block groups, so I've changed
>   it so there's no free-space entries for these (thanks to Boris Burkov for the
>   suggestion).
> 
> * The remap tree doesn't have metadata items in the extent tree (thanks to Josef
>   for the suggestion). This was to work around some corruption that delayed refs
>   were causing, but it also fits it with our future plans of removing all
>   metadata items for COW-only trees, reducing write amplification.
>   A knock-on effect of this is that I've had to disable balancing of the remap
>   chunk itself. This is because we can no longer walk the extent tree, and will
>   have to walk the remap tree instead. When we remove the COW-only metadata
>   items, we will also have to do this for the chunk and root trees, as
>   bootstrapping means they can't be remapped.
> 
> * btrfs_translate_remap() uses search_commit_root when doing metadata lookups,
>   to avoid nested locking issues. This also seems to be a lot quicker (btrfs/187
>   went from ~20mins to ~90secs).
> 
> * Unused remapped block groups should now get cleaned up more aggressively
> 
> * Other miscellaneous cleanups and fixes
> 
> Known issues:
> 
> * Relocation still needs to be implemented for the remap tree itself (see above)
> 
> * Some test failures: btrfs/156, btrfs/170, btrfs/226, btrfs/250
> 
> * nodatacow extents aren't safe, as they can race with the relocation thread.
>   We either need to follow the btrfs_inc_nocow_writers() approach, which COWs
>   the extent, or change it so that it blocks here.
> 
> * When initially marking a block group as remapped, we are walking the free-
>   space tree and creating the identity remaps all in one transaction. For the
>   worst-case scenario, i.e. a 1GB block group with every other sector allocated
>   (131,072 extents), this can result in transaction times of more than 10 mins.
>   This needs to be changed to allow this to happen over multiple transactions.
> 
> * All this is disabled for zoned devices for the time being, as I've not been
>   able to test it. I'm planning to make it compatible with zoned at a later
>   date.
> 
> Thanks

This is really great, thanks Mark. I have tried to review the setup
patches some more, as during the RFC I mostly looked at the "remap
copying" stuff.

I still owe you more thorough review on the chunk map bit (looks quite
reasonable to me, though) and the details on the identity remaps going
away.

With that said, some more high level thoughts:

Do you have a design document for the new structures, new invariants,
etc..? This is changing some pretty fundamental assumptions about the
extent tree, free space tree, is introducing double logical mapping,
etc... Whether it is in the patches that introduce the new on disk
stuff, or a separate doc patch, I think it is quite critical. Perhaps
that will primarily live with the progs changes, but I still think some
thorough exposition in these patches will be good.

Reading all the patches, I found myself a bit concerned about the
proliferation of special cases for the remapped case that didn't feel
fully fleshed out or understood. I apologize if they were all carefully
considered, but just going through the code some of them felt thrown in
to not have to worry about it. I feel like that incurs non-trivial tech
debt in the long run and I would really like to avoid it. Each special
case should be thoroughly reasoned and as elegantly "hidden" as
possible, IMO. I understand that this is kind of picky and high-level,
so I also tried to call out the ones where I had a specific reason I felt
they didn't make sense or the code could be improved to make this
complaint as concrete as possible.

Performance numbers like you started sharing in your reply would be great.
In particular, in addition to what you shared, I would like to know:
1. How badly does it beat regular relocation when lots of backrefs are
involved? Jack up the snapshots/reflinks and show the scaling. I
*assume* it should blow the doors off relocation v1 in that case, but I
want to see it :)
2. How bad is the overhead on the remapped reads in the happy case in
practice?

There were concerns (I think from Qu) about this increasing
file fragmentation by breaking up the remapped allocations down to block
size. Should we separate that change out? I think slipping that in as an
incidental side-effect with this is not ideal, as we aren't sure we want
that. I personally kind of think it is neat and makes relocation more
robust, but that's just a hunch.

I'd also love to hear what others view as blockers on moving forward with
this. So far I think we have concerns about:
- new space info
- relocating the remap tree itself being needed for full balances but
  not supported
- potential read perf issues
- increased file fragmentation
- transaction latency issues for the "read all the free space tree holes
  and write out identity remaps" transaction

and maybe some others I'm forgetting. Which are blockers to landing
behind EXPERIMENTAL? Which are blockers for moving from EXPERIMENTAL to
fully legit status? This question is probably more directed at other
reviewers and the development group at large rather than you, Mark.

Thanks again for your excellent work on this patch series, and I
earnestly hope we can all start reaping the benefits of this soon!

Boris

> 
> [1] https://lwn.net/Articles/1021452/
> [2] https://github.com/btrfs/btrfs-todo/issues/54
> [3] https://github.com/maharmstone/btrfs-progs/tree/remap-tree
> 
> Mark Harmstone (12):
>   btrfs: add definitions and constants for remap-tree
>   btrfs: add REMAP chunk type
>   btrfs: allow remapped chunks to have zero stripes
>   btrfs: remove remapped block groups from the free-space tree
>   btrfs: don't add metadata items for the remap tree to the extent tree
>   btrfs: add extended version of struct block_group_item
>   btrfs: allow mounting filesystems with remap-tree incompat flag
>   btrfs: redirect I/O for remapped block groups
>   btrfs: handle deletions from remapped block group
>   btrfs: handle setting up relocation of block group with remap-tree
>   btrfs: move existing remaps before relocating block group
>   btrfs: replace identity maps with actual remaps when doing relocations
> 
>  fs/btrfs/Kconfig                |    2 +
>  fs/btrfs/accessors.h            |   29 +
>  fs/btrfs/block-group.c          |  202 +++-
>  fs/btrfs/block-group.h          |   15 +-
>  fs/btrfs/block-rsv.c            |    8 +
>  fs/btrfs/block-rsv.h            |    1 +
>  fs/btrfs/discard.c              |   11 +-
>  fs/btrfs/disk-io.c              |   91 +-
>  fs/btrfs/extent-tree.c          |  152 ++-
>  fs/btrfs/free-space-tree.c      |    4 +-
>  fs/btrfs/free-space-tree.h      |    5 +-
>  fs/btrfs/fs.h                   |    7 +-
>  fs/btrfs/relocation.c           | 1897 ++++++++++++++++++++++++++++++-
>  fs/btrfs/relocation.h           |    8 +-
>  fs/btrfs/space-info.c           |   22 +-
>  fs/btrfs/sysfs.c                |    4 +
>  fs/btrfs/transaction.c          |    7 +
>  fs/btrfs/tree-checker.c         |   37 +-
>  fs/btrfs/volumes.c              |  115 +-
>  fs/btrfs/volumes.h              |   17 +-
>  include/uapi/linux/btrfs.h      |    1 +
>  include/uapi/linux/btrfs_tree.h |   29 +-
>  22 files changed, 2444 insertions(+), 220 deletions(-)
> 
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/12] btrfs: remap tree
  2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
                   ` (16 preceding siblings ...)
  2025-06-14  0:04 ` Boris Burkov
@ 2025-06-26 22:10 ` Mark Harmstone
  2025-06-27  5:59   ` Neal Gompa
  17 siblings, 1 reply; 39+ messages in thread
From: Mark Harmstone @ 2025-06-26 22:10 UTC (permalink / raw)
  To: linux-btrfs

Some performance figures for this.

There's a statistically insignificant slowdown doing I/O after a balance:

	Compiling kernel after balance (5 times)
	925s	without remap-tree
	926s	with remap-tree

Doing I/O with a balance looping in the background is slightly quicker:

	Compiling kernel while balancing
	209s	without remap-tree
	207s	with remap-tree

Doing a data balance with a 10MB file snapshotted 100,000 times is far, 
far quicker:

	Balancing with 100,000 snapshots
	29.4s	without remap-tree
	0.089s	with remap-tree

I can provide the exact scripts for any of this if anybody's interested.

Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/12] btrfs: remap tree
  2025-06-26 22:10 ` Mark Harmstone
@ 2025-06-27  5:59   ` Neal Gompa
  0 siblings, 0 replies; 39+ messages in thread
From: Neal Gompa @ 2025-06-27  5:59 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: linux-btrfs

On Fri, Jun 27, 2025 at 12:09 AM Mark Harmstone <mark@harmstone.com> wrote:
>
> Some performance figures for this.
>
> There's a statistically insignificant slowdown doing I/O after a balance:
>
>         Compiling kernel after balance (5 times)
>         925s    without remap-tree
>         926s    with remap-tree
>
> Doing I/O with a balance looping in the background is slightly quicker:
>
>         Compiling kernel while balancing
>         209s    without remap-tree
>         207s    with remap-tree
>
> Doing a data balance with a 10MB file snapshotted 100,000 times is far,
> far quicker:
>
>         Balancing with 100,000 snapshots
>         29.4s   without remap-tree
>         0.089s  with remap-tree
>
> I can provide the exact scripts for any of this if anybody's interested.
>

I would be interested in this!


-- 
真実はいつも一つ!/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 03/12] btrfs: allow remapped chunks to have zero stripes
  2025-06-13 21:41   ` Boris Burkov
@ 2025-08-08 14:12     ` Mark Harmstone
  0 siblings, 0 replies; 39+ messages in thread
From: Mark Harmstone @ 2025-08-08 14:12 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs

On 13/06/2025 10.41 pm, Boris Burkov wrote:

...snip...

 >> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>> index 0505f8d76581..fd83df06e3fb 100644
>> --- a/fs/btrfs/tree-checker.c
>> +++ b/fs/btrfs/tree-checker.c
>> @@ -829,7 +829,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
>>   	u64 type;
>>   	u64 features;
>>   	u32 chunk_sector_size;
>> -	bool mixed = false;
>> +	bool mixed = false, remapped;
> 
> nit:
> I *personally* don't like such declarations. I think they are lightly if
> not decisively discouraged in the Linux Coding Style document as well
> (at least to my reading). In our code, I found some examples where they
> are used where both values are related and get assigned. But I haven't
> seen any like this for unrelated variables with mixed assignment.
> 
>>   	int raid_index;
>>   	int nparity;
>>   	int ncopies;
>> @@ -853,12 +853,14 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info, >  	ncopies = btrfs_raid_array[raid_index].ncopies;
>>   	nparity = btrfs_raid_array[raid_index].nparity;
>>   
>> -	if (unlikely(!num_stripes)) {
>> +	remapped = type & BTRFS_BLOCK_GROUP_REMAPPED;
>> +
>> +	if (unlikely(!remapped && !num_stripes)) {
>>   		chunk_err(fs_info, leaf, chunk, logical,
>>   			  "invalid chunk num_stripes, have %u", num_stripes);
>>   		return -EUCLEAN;
>>   	}
>> -	if (unlikely(num_stripes < ncopies)) {
>> +	if (unlikely(!remapped && num_stripes < ncopies)) {
> 
> I think remapped only permits you exactly num_stripes == 0, not
> num_stripes = 2 if ncopies = 3, right? Though it makes the logic less
> neat, I would make the check as precise as possible on the invariants.

I'm changing the check here to num_stripes != 0 && num_stripes < ncopies,
which I hope makes things a little clearer. It also means that we have an
error if something goes wrong with a partially remapped chunk.

The current logic is that we keep the RAID flags as they are when a chunk
is remapped, and ncopies is derived from the RAID flags. I'm not entirely
sure this is the right thing to do, but it hopefully might make it easier
to discover what happened if something goes wrong. Similarly with
sub_stripes etc.

>>   		chunk_err(fs_info, leaf, chunk, logical,
>>   			  "invalid chunk num_stripes < ncopies, have %u < %d",
>>   			  num_stripes, ncopies);
>> @@ -960,7 +962,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
>>   		}
>>   	}
>>   
>> -	if (unlikely((type & BTRFS_BLOCK_GROUP_RAID10 &&
>> +	if (unlikely(!remapped && ((type & BTRFS_BLOCK_GROUP_RAID10 &&
>>   		      sub_stripes != btrfs_raid_array[BTRFS_RAID_RAID10].sub_stripes) ||
>>   		     (type & BTRFS_BLOCK_GROUP_RAID1 &&
>>   		      num_stripes != btrfs_raid_array[BTRFS_RAID_RAID1].devs_min) ||
>> @@ -975,7 +977,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
>>   		     (type & BTRFS_BLOCK_GROUP_DUP &&
>>   		      num_stripes != btrfs_raid_array[BTRFS_RAID_DUP].dev_stripes) ||
>>   		     ((type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 &&
>> -		      num_stripes != btrfs_raid_array[BTRFS_RAID_SINGLE].dev_stripes))) {
>> +		      num_stripes != btrfs_raid_array[BTRFS_RAID_SINGLE].dev_stripes)))) {
>>   		chunk_err(fs_info, leaf, chunk, logical,
>>   			"invalid num_stripes:sub_stripes %u:%u for profile %llu",
>>   			num_stripes, sub_stripes,
>> @@ -999,11 +1001,11 @@ static int check_leaf_chunk_item(struct extent_buffer *leaf,
>>   	struct btrfs_fs_info *fs_info = leaf->fs_info;
>>   	int num_stripes;
>>   
>> -	if (unlikely(btrfs_item_size(leaf, slot) < sizeof(struct btrfs_chunk))) {
>> +	if (unlikely(btrfs_item_size(leaf, slot) < offsetof(struct btrfs_chunk, stripe))) {
>>   		chunk_err(fs_info, leaf, chunk, key->offset,
>>   			"invalid chunk item size: have %u expect [%zu, %u)",
>>   			btrfs_item_size(leaf, slot),
>> -			sizeof(struct btrfs_chunk),
>> +			offsetof(struct btrfs_chunk, stripe),
>>   			BTRFS_LEAF_DATA_SIZE(fs_info));
>>   		return -EUCLEAN;
> 
> Same complaint as above for nstripes < ncopies. Is there some way to
> more generically bypass stripe checking if we detect the case we care
> about: (remapped && num_stripes == 0) ?

I'm presuming this comment is for the if beginning on line 964. I'm moving this whole
bit to a helper function, it's too confusing as it is.

> 
>>   	}
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index e7c467b6af46..9159d11cb143 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6133,6 +6133,12 @@ struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info,
>>   		goto out_free_map;
>>   	}
>>   
>> +	/* avoid divide by zero on fully-remapped chunks */
>> +	if (map->num_stripes == 0) {
>> +		ret = -EOPNOTSUPP;
>> +		goto out_free_map;
>> +	}
>> +
> 
> This seems kinda sketchy to me. Presumably once we have remapped a block
> group, we do want to discard it. But this makes that impossible? Does
> the discarding happen before we set num_stripes to 0? Even with
> discard=async?
> 
>>   	offset = logical - map->start;
>>   	length = min_t(u64, map->start + map->chunk_len - logical, length);
>>   	*length_ret = length;
>> @@ -6953,7 +6959,7 @@ u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map)
>>   {
>>   	const int data_stripes = calc_data_stripes(map->type, map->num_stripes);
>>   
>> -	return div_u64(map->chunk_len, data_stripes);
>> +	return data_stripes ? div_u64(map->chunk_len, data_stripes) : 0;
> 
> Rather than making 0 a special value meaning remapped, would it be
> useful/clearer to also include a remapped flag on the btrfs_chunk_map?
> This function would have to behave the same way, presumably, but it
> would be less implicit, and we could make assertions on chunk maps that
> they have 0 stripes iff remapped at various places where we use the
> stripe length. That would raise confidence that all the uses of stripe
> logic account for remapped chunks.

It's less that 0 is a special value, more that btrfs_calc_stripe_length()
gets called from read_one_chunk(), and will divide-by-zero otherwise. I could
have put a check in read_one_chunk() instead, but this seems less footgunny.
If a chunk is fully remapped all I/O to it will be redirected elsewhere, so its
stripe length won't actually be used.

The BTRFS_BLOCK_GROUP_REMAPPED flag already gets set in the btrfs_chunk_map,
as does the num_stripes == 0 meaning fully remapped.

> 
>>   }
>>   
>>   #if BITS_PER_LONG == 32
>> -- 
>> 2.49.0
>>
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 09/12] btrfs: handle deletions from remapped block group
  2025-06-13 23:42   ` Boris Burkov
@ 2025-08-11 16:48     ` Mark Harmstone
  2025-08-11 16:59     ` Mark Harmstone
  1 sibling, 0 replies; 39+ messages in thread
From: Mark Harmstone @ 2025-08-11 16:48 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs

On 14/06/2025 12.42 am, Boris Burkov wrote:
> On Thu, Jun 05, 2025 at 05:23:39PM +0100, Mark Harmstone wrote:
>> Handle the case where we free an extent from a block group that has the
>> REMAPPED flag set. Because the remap tree is orthogonal to the extent
>> tree, for data this may be within any number of identity remaps or
>> actual remaps. If we're freeing a metadata node, this will be wholly
>> inside one or the other.
>>
>> btrfs_remove_extent_from_remap_tree() searches the remap tree for the
>> remaps that cover the range in question, then calls
>> remove_range_from_remap_tree() for each one, to punch a hole in the
>> remap and adjust the free-space tree.
>>
>> For an identity remap, remove_range_from_remap_tree() will adjust the
>> block group's `identity_remap_count` if this changes. If it reaches
>> zero we call last_identity_remap_gone(), which removes the chunk's
>> stripes and device extents - it is now fully remapped.
>>
>> The changes which involve the block group's ro flag are because the
>> REMAPPED flag itself prevents a block group from having any new
>> allocations within it, and so we don't need to account for this
>> separately.
> 
> Those changes didn't really make much sense to me. Do you *want to*
> delete the "unused" block group with remapped set?
> How did it get in the list?
> Did you actually hit this case and are fixing something?

So the life cycle of a block group is:

1. Normal BG

2. Partially remapped BG (REMAPPED flag set, one or more identity remaps)

3. Fully remapped BG (no identity remaps, chunk stripes removed; all data
that's nominally here is actually somewhere else)

4. Empty BG (all extents that nominally live within this BG have been
removed)

You still want to remove these empty BGs, otherwise they'll stick around
in the BG tree forever.

> 
>>
>> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
>> ---
>>   fs/btrfs/block-group.c |  80 ++++---
>>   fs/btrfs/block-group.h |   1 +
>>   fs/btrfs/disk-io.c     |   1 +
>>   fs/btrfs/extent-tree.c |  30 ++-
>>   fs/btrfs/fs.h          |   1 +
>>   fs/btrfs/relocation.c  | 505 +++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/relocation.h  |   3 +
>>   fs/btrfs/volumes.c     |  56 +++--
>>   fs/btrfs/volumes.h     |   6 +
>>   9 files changed, 628 insertions(+), 55 deletions(-)
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index 4529356bb1e3..334df145ab3f 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -1055,6 +1055,32 @@ static int remove_block_group_item(struct btrfs_trans_handle *trans,
>>   	return ret;
>>   }
>>   
>> +void btrfs_remove_bg_from_sinfo(struct btrfs_block_group *block_group)
>> +{
>> +	int factor = btrfs_bg_type_to_factor(block_group->flags);
>> +
>> +	spin_lock(&block_group->space_info->lock);
>> +
>> +	if (btrfs_test_opt(block_group->fs_info, ENOSPC_DEBUG)) {
>> +		WARN_ON(block_group->space_info->total_bytes
>> +			< block_group->length);
>> +		WARN_ON(block_group->space_info->bytes_readonly
>> +			< block_group->length - block_group->zone_unusable);
>> +		WARN_ON(block_group->space_info->bytes_zone_unusable
>> +			< block_group->zone_unusable);
>> +		WARN_ON(block_group->space_info->disk_total
>> +			< block_group->length * factor);
>> +	}
>> +	block_group->space_info->total_bytes -= block_group->length;
>> +	block_group->space_info->bytes_readonly -=
>> +		(block_group->length - block_group->zone_unusable);
>> +	btrfs_space_info_update_bytes_zone_unusable(block_group->space_info,
>> +						    -block_group->zone_unusable);
>> +	block_group->space_info->disk_total -= block_group->length * factor;
>> +
>> +	spin_unlock(&block_group->space_info->lock);
>> +}
>> +
>>   int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>   			     struct btrfs_chunk_map *map)
>>   {
>> @@ -1066,7 +1092,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>   	struct kobject *kobj = NULL;
>>   	int ret;
>>   	int index;
>> -	int factor;
>>   	struct btrfs_caching_control *caching_ctl = NULL;
>>   	bool remove_map;
>>   	bool remove_rsv = false;
>> @@ -1075,7 +1100,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>   	if (!block_group)
>>   		return -ENOENT;
>>   
>> -	BUG_ON(!block_group->ro);
>> +	BUG_ON(!block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED));
>>   
>>   	trace_btrfs_remove_block_group(block_group);
>>   	/*
>> @@ -1087,7 +1112,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>   				  block_group->length);
>>   
>>   	index = btrfs_bg_flags_to_raid_index(block_group->flags);
>> -	factor = btrfs_bg_type_to_factor(block_group->flags);
>>   
>>   	/* make sure this block group isn't part of an allocation cluster */
>>   	cluster = &fs_info->data_alloc_cluster;
>> @@ -1211,26 +1235,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>   
>>   	spin_lock(&block_group->space_info->lock);
>>   	list_del_init(&block_group->ro_list);
>> -
>> -	if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
>> -		WARN_ON(block_group->space_info->total_bytes
>> -			< block_group->length);
>> -		WARN_ON(block_group->space_info->bytes_readonly
>> -			< block_group->length - block_group->zone_unusable);
>> -		WARN_ON(block_group->space_info->bytes_zone_unusable
>> -			< block_group->zone_unusable);
>> -		WARN_ON(block_group->space_info->disk_total
>> -			< block_group->length * factor);
>> -	}
>> -	block_group->space_info->total_bytes -= block_group->length;
>> -	block_group->space_info->bytes_readonly -=
>> -		(block_group->length - block_group->zone_unusable);
>> -	btrfs_space_info_update_bytes_zone_unusable(block_group->space_info,
>> -						    -block_group->zone_unusable);
>> -	block_group->space_info->disk_total -= block_group->length * factor;
>> -
>>   	spin_unlock(&block_group->space_info->lock);
>>   
>> +	if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED))
>> +		btrfs_remove_bg_from_sinfo(block_group);
> 
> So we double count the block group in the space info usage while the
> remapped one stays around? Why delete the block group itself then? Why
> not just delete it when the last identity remap is gone and we call this
> function? Sorry if this is obvious, but I don't see it for some reason.

btrfs_remove_bg_from_sinfo() also gets called in last_identity_remap_gone(),
there's no double-counting. For a remapped extent, the data gets charged
to the destination BG's space info.

IIRC Qu asked a similar question, i.e. why don't we remove remapped BGs as
soon as the last identity remap has gone. The problem here is that there's
no longer anyway to distinguish between addresses of fully remapped BGs,
and completely invalid addresses - which makes things like btrfs-check more
difficult. See also the frequency at which bit-flip errors get posted to
the mailing list.

> 
>> +
>>   	/*
>>   	 * Remove the free space for the block group from the free space tree
>>   	 * and the block group's item from the extent tree before marking the
>> @@ -1517,6 +1526,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>   	while (!list_empty(&fs_info->unused_bgs)) {
>>   		u64 used;
>>   		int trimming;
>> +		bool made_ro = false;
>>   
>>   		block_group = list_first_entry(&fs_info->unused_bgs,
>>   					       struct btrfs_block_group,
>> @@ -1553,7 +1563,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>   
>>   		spin_lock(&space_info->lock);
>>   		spin_lock(&block_group->lock);
>> -		if (btrfs_is_block_group_used(block_group) || block_group->ro ||
>> +		if (btrfs_is_block_group_used(block_group) ||
>> +		    (block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) ||
>>   		    list_is_singular(&block_group->list)) {
>>   			/*
>>   			 * We want to bail if we made new allocations or have
>> @@ -1596,7 +1607,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>   		 */
>>   		used = btrfs_space_info_used(space_info, true);
>>   		if (space_info->total_bytes - block_group->length < used &&
>> -		    block_group->zone_unusable < block_group->length) {
>> +		    block_group->zone_unusable < block_group->length &&
>> +		    !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
>>   			/*
>>   			 * Add a reference for the list, compensate for the ref
>>   			 * drop under the "next" label for the
>> @@ -1614,8 +1626,14 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>   		spin_unlock(&block_group->lock);
>>   		spin_unlock(&space_info->lock);
>>   
>> -		/* We don't want to force the issue, only flip if it's ok. */
>> -		ret = inc_block_group_ro(block_group, 0);
>> +		if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
>> +			/* We don't want to force the issue, only flip if it's ok. */
>> +			ret = inc_block_group_ro(block_group, 0);
>> +			made_ro = true;
>> +		} else {
>> +			ret = 0;
>> +		}
>> +
>>   		up_write(&space_info->groups_sem);
>>   		if (ret < 0) {
>>   			ret = 0;
>> @@ -1624,7 +1642,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>   
>>   		ret = btrfs_zone_finish(block_group);
>>   		if (ret < 0) {
>> -			btrfs_dec_block_group_ro(block_group);
>> +			if (made_ro)
>> +				btrfs_dec_block_group_ro(block_group);
>>   			if (ret == -EAGAIN)
>>   				ret = 0;
>>   			goto next;
>> @@ -1637,7 +1656,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>   		trans = btrfs_start_trans_remove_block_group(fs_info,
>>   						     block_group->start);
>>   		if (IS_ERR(trans)) {
>> -			btrfs_dec_block_group_ro(block_group);
>> +			if (made_ro)
>> +				btrfs_dec_block_group_ro(block_group);
>>   			ret = PTR_ERR(trans);
>>   			goto next;
>>   		}
>> @@ -1647,7 +1667,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>   		 * just delete them, we don't care about them anymore.
>>   		 */
>>   		if (!clean_pinned_extents(trans, block_group)) {
>> -			btrfs_dec_block_group_ro(block_group);
>> +			if (made_ro)
>> +				btrfs_dec_block_group_ro(block_group);
>>   			goto end_trans;
>>   		}
>>   
>> @@ -1661,7 +1682,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>   		spin_lock(&fs_info->discard_ctl.lock);
>>   		if (!list_empty(&block_group->discard_list)) {
>>   			spin_unlock(&fs_info->discard_ctl.lock);
>> -			btrfs_dec_block_group_ro(block_group);
>> +			if (made_ro)
>> +				btrfs_dec_block_group_ro(block_group);
>>   			btrfs_discard_queue_work(&fs_info->discard_ctl,
>>   						 block_group);
>>   			goto end_trans;
>> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
>> index c484118b8b8d..767898929960 100644
>> --- a/fs/btrfs/block-group.h
>> +++ b/fs/btrfs/block-group.h
>> @@ -329,6 +329,7 @@ int btrfs_add_new_free_space(struct btrfs_block_group *block_group,
>>   struct btrfs_trans_handle *btrfs_start_trans_remove_block_group(
>>   				struct btrfs_fs_info *fs_info,
>>   				const u64 chunk_offset);
>> +void btrfs_remove_bg_from_sinfo(struct btrfs_block_group *block_group);
>>   int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>   			     struct btrfs_chunk_map *map);
>>   void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index a0542b581f4e..dac22efd2332 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2922,6 +2922,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>>   	mutex_init(&fs_info->chunk_mutex);
>>   	mutex_init(&fs_info->transaction_kthread_mutex);
>>   	mutex_init(&fs_info->cleaner_mutex);
>> +	mutex_init(&fs_info->remap_mutex);
>>   	mutex_init(&fs_info->ro_block_group_mutex);
>>   	init_rwsem(&fs_info->commit_root_sem);
>>   	init_rwsem(&fs_info->cleanup_work_sem);
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index e8f752ef1da9..995784cdca9d 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -40,6 +40,7 @@
>>   #include "orphan.h"
>>   #include "tree-checker.h"
>>   #include "raid-stripe-tree.h"
>> +#include "relocation.h"
>>   
>>   #undef SCRAMBLE_DELAYED_REFS
>>   
>> @@ -2977,6 +2978,8 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
>>   				     u64 bytenr, struct btrfs_squota_delta *delta)
>>   {
>>   	int ret;
>> +	struct btrfs_block_group *bg;
>> +	bool bg_is_remapped = false;
>>   	u64 num_bytes = delta->num_bytes;
>>   
>>   	if (delta->is_data) {
>> @@ -3002,10 +3005,22 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
>>   		return ret;
>>   	}
>>   
>> -	ret = add_to_free_space_tree(trans, bytenr, num_bytes);
>> -	if (ret) {
>> -		btrfs_abort_transaction(trans, ret);
>> -		return ret;
>> +	if (btrfs_fs_incompat(trans->fs_info, REMAP_TREE)) {
>> +		bg = btrfs_lookup_block_group(trans->fs_info, bytenr);
>> +		bg_is_remapped = bg->flags & BTRFS_BLOCK_GROUP_REMAPPED;
> 
> As mentioned in the patch that sets the flag, this feels quite
> susceptible to races. I don't have a particular one in mind, it just
> feels off to set and rely on this flag to decide what work to do without
> some kind of explicit synchronization plan.
> 
>> +		btrfs_put_block_group(bg);
>> +	}
>> +
>> +	/*
>> +	 * If remapped, FST has already been taken care of in
>> +	 * remove_range_from_remap_tree().
>> +	 */
>> +	if (!bg_is_remapped) {
>> +		ret = add_to_free_space_tree(trans, bytenr, num_bytes);
>> +		if (ret) {
>> +			btrfs_abort_transaction(trans, ret);
>> +			return ret;
>> +		}
>>   	}
>>   
>>   	ret = btrfs_update_block_group(trans, bytenr, num_bytes, false);
>> @@ -3387,6 +3402,13 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>>   		}
>>   		btrfs_release_path(path);
>>   
>> +		ret = btrfs_remove_extent_from_remap_tree(trans, path, bytenr,
>> +							  num_bytes);
>> +		if (ret) {
>> +			btrfs_abort_transaction(trans, ret);
>> +			goto out;
>> +		}
>> +
>>   		ret = do_free_extent_accounting(trans, bytenr, &delta);
>>   	}
>>   	btrfs_release_path(path);
>> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
>> index 8ceeb64aceb3..fd7dc61be9c7 100644
>> --- a/fs/btrfs/fs.h
>> +++ b/fs/btrfs/fs.h
>> @@ -551,6 +551,7 @@ struct btrfs_fs_info {
>>   	struct mutex transaction_kthread_mutex;
>>   	struct mutex cleaner_mutex;
>>   	struct mutex chunk_mutex;
>> +	struct mutex remap_mutex;
>>   
>>   	/*
>>   	 * This is taken to make sure we don't set block groups ro after the
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index e45f3598ef03..54c3e99c7dab 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -37,6 +37,7 @@
>>   #include "super.h"
>>   #include "tree-checker.h"
>>   #include "raid-stripe-tree.h"
>> +#include "free-space-tree.h"
>>   
>>   /*
>>    * Relocation overview
>> @@ -3905,6 +3906,150 @@ static const char *stage_to_string(enum reloc_stage stage)
>>   	return "unknown";
>>   }
>>   
>> +static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
>> +					   struct btrfs_block_group *bg,
>> +					   s64 diff)
>> +{
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	bool bg_already_dirty = true;
>> +
>> +	bg->remap_bytes += diff;
>> +
>> +	if (bg->used == 0 && bg->remap_bytes == 0)
>> +		btrfs_mark_bg_unused(bg);
>> +
>> +	spin_lock(&trans->transaction->dirty_bgs_lock);
>> +	if (list_empty(&bg->dirty_list)) {
>> +		list_add_tail(&bg->dirty_list, &trans->transaction->dirty_bgs);
>> +		bg_already_dirty = false;
>> +		btrfs_get_block_group(bg);
>> +	}
>> +	spin_unlock(&trans->transaction->dirty_bgs_lock);
>> +
>> +	/* Modified block groups are accounted for in the delayed_refs_rsv. */
>> +	if (!bg_already_dirty)
>> +		btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
>> +}
>> +
>> +static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
>> +				struct btrfs_chunk_map *chunk,
>> +				struct btrfs_path *path)
>> +{
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	struct btrfs_key key;
>> +	struct extent_buffer *leaf;
>> +	struct btrfs_chunk *c;
>> +	int ret;
>> +
>> +	key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
>> +	key.type = BTRFS_CHUNK_ITEM_KEY;
>> +	key.offset = chunk->start;
>> +
>> +	ret = btrfs_search_slot(trans, fs_info->chunk_root, &key, path,
>> +				0, 1);
>> +	if (ret) {
>> +		if (ret == 1) {
>> +			btrfs_release_path(path);
>> +			ret = -ENOENT;
>> +		}
>> +		return ret;
>> +	}
>> +
>> +	leaf = path->nodes[0];
>> +
>> +	c = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_chunk);
>> +	btrfs_set_chunk_num_stripes(leaf, c, 0);
>> +
>> +	btrfs_truncate_item(trans, path, offsetof(struct btrfs_chunk, stripe),
>> +			    1);
>> +
>> +	btrfs_mark_buffer_dirty(trans, leaf);
>> +
>> +	btrfs_release_path(path);
>> +
>> +	chunk->num_stripes = 0;
>> +
>> +	return 0;
>> +}
>> +
>> +static int last_identity_remap_gone(struct btrfs_trans_handle *trans,
>> +				    struct btrfs_chunk_map *chunk,
>> +				    struct btrfs_block_group *bg,
>> +				    struct btrfs_path *path)
>> +{
>> +	int ret;
>> +
>> +	ret = btrfs_remove_dev_extents(trans, chunk);
>> +	if (ret)
>> +		return ret;
>> +
>> +	mutex_lock(&trans->fs_info->chunk_mutex);
>> +
>> +	for (unsigned int i = 0; i < chunk->num_stripes; i++) {
>> +		ret = btrfs_update_device(trans, chunk->stripes[i].dev);
>> +		if (ret) {
>> +			mutex_unlock(&trans->fs_info->chunk_mutex);
>> +			return ret;
>> +		}
>> +	}
>> +
>> +	mutex_unlock(&trans->fs_info->chunk_mutex);
>> +
>> +	write_lock(&trans->fs_info->mapping_tree_lock);
>> +	btrfs_chunk_map_device_clear_bits(chunk, CHUNK_ALLOCATED);
>> +	write_unlock(&trans->fs_info->mapping_tree_lock);
>> +
>> +	btrfs_remove_bg_from_sinfo(bg);
>> +
>> +	ret = remove_chunk_stripes(trans, chunk, path);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return 0;
>> +}
>> +
>> +static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
>> +				       struct btrfs_path *path,
>> +				       struct btrfs_block_group *bg, int delta)
>> +{
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	struct btrfs_chunk_map *chunk;
>> +	bool bg_already_dirty = true;
>> +	int ret;
>> +
>> +	WARN_ON(delta < 0 && -delta > bg->identity_remap_count);
>> +
>> +	bg->identity_remap_count += delta;
>> +
>> +	spin_lock(&trans->transaction->dirty_bgs_lock);
>> +	if (list_empty(&bg->dirty_list)) {
>> +		list_add_tail(&bg->dirty_list, &trans->transaction->dirty_bgs);
>> +		bg_already_dirty = false;
>> +		btrfs_get_block_group(bg);
>> +	}
>> +	spin_unlock(&trans->transaction->dirty_bgs_lock);
>> +
>> +	/* Modified block groups are accounted for in the delayed_refs_rsv. */
>> +	if (!bg_already_dirty)
>> +		btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
>> +
>> +	if (bg->identity_remap_count != 0)
>> +		return 0;
>> +
>> +	chunk = btrfs_find_chunk_map(fs_info, bg->start, 1);
>> +	if (!chunk)
>> +		return -ENOENT;
>> +
>> +	ret = last_identity_remap_gone(trans, chunk, bg, path);
>> +	if (ret)
>> +		goto end;
>> +
>> +	ret = 0;
>> +end:
>> +	btrfs_free_chunk_map(chunk);
>> +	return ret;
>> +}
>> +
>>   int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>>   			  u64 *length, bool nolock)
>>   {
>> @@ -4521,3 +4666,363 @@ u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info)
>>   		logical = fs_info->reloc_ctl->block_group->start;
>>   	return logical;
>>   }
>> +
>> +static int remove_range_from_remap_tree(struct btrfs_trans_handle *trans,
>> +					struct btrfs_path *path,
>> +					struct btrfs_block_group *bg,
>> +					u64 bytenr, u64 num_bytes)
>> +{
>> +	int ret;
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	struct extent_buffer *leaf = path->nodes[0];
>> +	struct btrfs_key key, new_key;
>> +	struct btrfs_remap *remap_ptr = NULL, remap;
>> +	struct btrfs_block_group *dest_bg = NULL;
>> +	u64 end, new_addr = 0, remap_start, remap_length, overlap_length;
>> +	bool is_identity_remap;
>> +
>> +	end = bytenr + num_bytes;
>> +
>> +	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>> +
>> +	is_identity_remap = key.type == BTRFS_IDENTITY_REMAP_KEY;
>> +
>> +	remap_start = key.objectid;
>> +	remap_length = key.offset;
>> +
>> +	if (!is_identity_remap) {
>> +		remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
>> +					   struct btrfs_remap);
>> +		new_addr = btrfs_remap_address(leaf, remap_ptr);
>> +
>> +		dest_bg = btrfs_lookup_block_group(fs_info, new_addr);
>> +	}
>> +
>> +	if (bytenr == remap_start && num_bytes >= remap_length) {
>> +		/* Remove entirely. */
>> +
>> +		ret = btrfs_del_item(trans, fs_info->remap_root, path);
>> +		if (ret)
>> +			goto end;
>> +
>> +		btrfs_release_path(path);
>> +
>> +		overlap_length = remap_length;
>> +
>> +		if (!is_identity_remap) {
>> +			/* Remove backref. */
>> +
>> +			key.objectid = new_addr;
>> +			key.type = BTRFS_REMAP_BACKREF_KEY;
>> +			key.offset = remap_length;
>> +
>> +			ret = btrfs_search_slot(trans, fs_info->remap_root,
>> +						&key, path, -1, 1);
>> +			if (ret) {
>> +				if (ret == 1) {
>> +					btrfs_release_path(path);
>> +					ret = -ENOENT;
>> +				}
>> +				goto end;
>> +			}
>> +
>> +			ret = btrfs_del_item(trans, fs_info->remap_root, path);
>> +
>> +			btrfs_release_path(path);
>> +
>> +			if (ret)
>> +				goto end;
>> +
>> +			adjust_block_group_remap_bytes(trans, dest_bg,
>> +						       -remap_length);
>> +		} else {
>> +			ret = adjust_identity_remap_count(trans, path, bg, -1);
>> +			if (ret)
>> +				goto end;
>> +		}
>> +	} else if (bytenr == remap_start) {
>> +		/* Remove beginning. */
>> +
>> +		new_key.objectid = end;
>> +		new_key.type = key.type;
>> +		new_key.offset = remap_length + remap_start - end;
>> +
>> +		btrfs_set_item_key_safe(trans, path, &new_key);
>> +		btrfs_mark_buffer_dirty(trans, leaf);
>> +
>> +		overlap_length = num_bytes;
>> +
>> +		if (!is_identity_remap) {
>> +			btrfs_set_remap_address(leaf, remap_ptr,
>> +						new_addr + end - remap_start);
>> +			btrfs_release_path(path);
>> +
>> +			/* Adjust backref. */
>> +
>> +			key.objectid = new_addr;
>> +			key.type = BTRFS_REMAP_BACKREF_KEY;
>> +			key.offset = remap_length;
>> +
>> +			ret = btrfs_search_slot(trans, fs_info->remap_root,
>> +						&key, path, -1, 1);
>> +			if (ret) {
>> +				if (ret == 1) {
>> +					btrfs_release_path(path);
>> +					ret = -ENOENT;
>> +				}
>> +				goto end;
>> +			}
>> +
>> +			leaf = path->nodes[0];
>> +
>> +			new_key.objectid = new_addr + end - remap_start;
>> +			new_key.type = BTRFS_REMAP_BACKREF_KEY;
>> +			new_key.offset = remap_length + remap_start - end;
>> +
>> +			btrfs_set_item_key_safe(trans, path, &new_key);
>> +
>> +			remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
>> +						   struct btrfs_remap);
>> +			btrfs_set_remap_address(leaf, remap_ptr, end);
>> +
>> +			btrfs_mark_buffer_dirty(trans, path->nodes[0]);
>> +
>> +			btrfs_release_path(path);
>> +
>> +			adjust_block_group_remap_bytes(trans, dest_bg,
>> +						       -num_bytes);
>> +		}
>> +	} else if (bytenr + num_bytes < remap_start + remap_length) {
>> +		/* Remove middle. */
>> +
>> +		new_key.objectid = remap_start;
>> +		new_key.type = key.type;
>> +		new_key.offset = bytenr - remap_start;
>> +
>> +		btrfs_set_item_key_safe(trans, path, &new_key);
>> +		btrfs_mark_buffer_dirty(trans, leaf);
>> +
>> +		new_key.objectid = end;
>> +		new_key.offset = remap_start + remap_length - end;
>> +
>> +		btrfs_release_path(path);
>> +
>> +		overlap_length = num_bytes;
>> +
>> +		if (!is_identity_remap) {
>> +			/* Add second remap entry. */
>> +
>> +			ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
>> +						path, &new_key,
>> +						sizeof(struct btrfs_remap));
>> +			if (ret)
>> +				goto end;
>> +
>> +			btrfs_set_stack_remap_address(&remap,
>> +						new_addr + end - remap_start);
>> +
>> +			write_extent_buffer(path->nodes[0], &remap,
>> +				btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
>> +				sizeof(struct btrfs_remap));
>> +
>> +			btrfs_release_path(path);
>> +
>> +			/* Shorten backref entry. */
>> +
>> +			key.objectid = new_addr;
>> +			key.type = BTRFS_REMAP_BACKREF_KEY;
>> +			key.offset = remap_length;
>> +
>> +			ret = btrfs_search_slot(trans, fs_info->remap_root,
>> +						&key, path, -1, 1);
>> +			if (ret) {
>> +				if (ret == 1) {
>> +					btrfs_release_path(path);
>> +					ret = -ENOENT;
>> +				}
>> +				goto end;
>> +			}
>> +
>> +			new_key.objectid = new_addr;
>> +			new_key.type = BTRFS_REMAP_BACKREF_KEY;
>> +			new_key.offset = bytenr - remap_start;
>> +
>> +			btrfs_set_item_key_safe(trans, path, &new_key);
>> +			btrfs_mark_buffer_dirty(trans, path->nodes[0]);
>> +
>> +			btrfs_release_path(path);
>> +
>> +			/* Add second backref entry. */
>> +
>> +			new_key.objectid = new_addr + end - remap_start;
>> +			new_key.type = BTRFS_REMAP_BACKREF_KEY;
>> +			new_key.offset = remap_start + remap_length - end;
>> +
>> +			ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
>> +						path, &new_key,
>> +						sizeof(struct btrfs_remap));
>> +			if (ret)
>> +				goto end;
>> +
>> +			btrfs_set_stack_remap_address(&remap, end);
>> +
>> +			write_extent_buffer(path->nodes[0], &remap,
>> +				btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
>> +				sizeof(struct btrfs_remap));
>> +
>> +			btrfs_release_path(path);
>> +
>> +			adjust_block_group_remap_bytes(trans, dest_bg,
>> +						       -num_bytes);
>> +		} else {
>> +			/* Add second identity remap entry. */
>> +
>> +			ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
>> +						      path, &new_key, 0);
>> +			if (ret)
>> +				goto end;
>> +
>> +			btrfs_release_path(path);
>> +
>> +			ret = adjust_identity_remap_count(trans, path, bg, 1);
>> +			if (ret)
>> +				goto end;
>> +		}
>> +	} else {
>> +		/* Remove end. */
>> +
>> +		new_key.objectid = remap_start;
>> +		new_key.type = key.type;
>> +		new_key.offset = bytenr - remap_start;
>> +
>> +		btrfs_set_item_key_safe(trans, path, &new_key);
>> +		btrfs_mark_buffer_dirty(trans, leaf);
>> +
>> +		btrfs_release_path(path);
>> +
>> +		overlap_length = remap_start + remap_length - bytenr;
>> +
>> +		if (!is_identity_remap) {
>> +			/* Shorten backref entry. */
>> +
>> +			key.objectid = new_addr;
>> +			key.type = BTRFS_REMAP_BACKREF_KEY;
>> +			key.offset = remap_length;
>> +
>> +			ret = btrfs_search_slot(trans, fs_info->remap_root,
>> +						&key, path, -1, 1);
>> +			if (ret) {
>> +				if (ret == 1) {
>> +					btrfs_release_path(path);
>> +					ret = -ENOENT;
>> +				}
>> +				goto end;
>> +			}
>> +
>> +			new_key.objectid = new_addr;
>> +			new_key.type = BTRFS_REMAP_BACKREF_KEY;
>> +			new_key.offset = bytenr - remap_start;
>> +
>> +			btrfs_set_item_key_safe(trans, path, &new_key);
>> +			btrfs_mark_buffer_dirty(trans, path->nodes[0]);
>> +
>> +			btrfs_release_path(path);
>> +
>> +			adjust_block_group_remap_bytes(trans, dest_bg,
>> +					bytenr - remap_start - remap_length);
>> +		}
>> +	}
>> +
>> +	if (!is_identity_remap) {
>> +		ret = add_to_free_space_tree(trans,
>> +					     bytenr - remap_start + new_addr,
>> +					     overlap_length);
>> +		if (ret)
>> +			goto end;
>> +	}
>> +
>> +	ret = overlap_length;
>> +
>> +end:
>> +	if (dest_bg)
>> +		btrfs_put_block_group(dest_bg);
>> +
>> +	return ret;
>> +}
>> +
>> +int btrfs_remove_extent_from_remap_tree(struct btrfs_trans_handle *trans,
>> +					struct btrfs_path *path,
>> +					u64 bytenr, u64 num_bytes)
>> +{
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	struct btrfs_key key, found_key;
>> +	struct extent_buffer *leaf;
>> +	struct btrfs_block_group *bg;
>> +	int ret;
>> +
>> +	if (!(btrfs_super_incompat_flags(fs_info->super_copy) &
>> +	      BTRFS_FEATURE_INCOMPAT_REMAP_TREE))
>> +		return 0;
>> +
>> +	bg = btrfs_lookup_block_group(fs_info, bytenr);
>> +	if (!bg)
>> +		return 0;
>> +
>> +	if (!(bg->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
>> +		btrfs_put_block_group(bg);
>> +		return 0;
>> +	}
>> +
>> +	mutex_lock(&fs_info->remap_mutex);
>> +
>> +	do {
>> +		key.objectid = bytenr;
>> +		key.type = (u8)-1;
>> +		key.offset = (u64)-1;
>> +
>> +		ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path,
>> +					-1, 1);
>> +		if (ret < 0)
>> +			goto end;
>> +
>> +		leaf = path->nodes[0];
>> +
>> +		if (path->slots[0] == 0) {
>> +			ret = -ENOENT;
>> +			goto end;
>> +		}
>> +
>> +		path->slots[0]--;
>> +
>> +		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
>> +
>> +		if (found_key.type != BTRFS_IDENTITY_REMAP_KEY &&
>> +		    found_key.type != BTRFS_REMAP_KEY) {
>> +			ret = -ENOENT;
>> +			goto end;
>> +		}
>> +
>> +		if (bytenr < found_key.objectid ||
>> +		    bytenr >= found_key.objectid + found_key.offset) {
>> +			ret = -ENOENT;
>> +			goto end;
>> +		}
>> +
>> +		ret = remove_range_from_remap_tree(trans, path, bg, bytenr,
>> +						   num_bytes);
>> +		if (ret < 0)
>> +			goto end;
>> +
>> +		bytenr += ret;
>> +		num_bytes -= ret;
>> +	} while (num_bytes > 0);
>> +
>> +	ret = 0;
>> +
>> +end:
>> +	mutex_unlock(&fs_info->remap_mutex);
>> +
>> +	btrfs_put_block_group(bg);
>> +	btrfs_release_path(path);
>> +	return ret;
>> +}
>> diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
>> index 8c9dfc55b799..0021f812b12c 100644
>> --- a/fs/btrfs/relocation.h
>> +++ b/fs/btrfs/relocation.h
>> @@ -32,5 +32,8 @@ bool btrfs_should_ignore_reloc_root(const struct btrfs_root *root);
>>   u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info);
>>   int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>>   			  u64 *length, bool nolock);
>> +int btrfs_remove_extent_from_remap_tree(struct btrfs_trans_handle *trans,
>> +					struct btrfs_path *path,
>> +					u64 bytenr, u64 num_bytes);
>>   
>>   #endif
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 62bd6259ebd3..6c0a67da92f1 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2931,8 +2931,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>   	return ret;
>>   }
>>   
>> -static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
>> -					struct btrfs_device *device)
>> +int btrfs_update_device(struct btrfs_trans_handle *trans,
>> +			struct btrfs_device *device)
>>   {
>>   	int ret;
>>   	struct btrfs_path *path;
>> @@ -3236,25 +3236,13 @@ static int remove_chunk_item(struct btrfs_trans_handle *trans,
>>   	return btrfs_free_chunk(trans, chunk_offset);
>>   }
>>   
>> -int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
>> +int btrfs_remove_dev_extents(struct btrfs_trans_handle *trans,
>> +			     struct btrfs_chunk_map *map)
>>   {
>>   	struct btrfs_fs_info *fs_info = trans->fs_info;
>> -	struct btrfs_chunk_map *map;
>> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>>   	u64 dev_extent_len = 0;
>>   	int i, ret = 0;
>> -	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>> -
>> -	map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
>> -	if (IS_ERR(map)) {
>> -		/*
>> -		 * This is a logic error, but we don't want to just rely on the
>> -		 * user having built with ASSERT enabled, so if ASSERT doesn't
>> -		 * do anything we still error out.
>> -		 */
>> -		DEBUG_WARN("errr %ld reading chunk map at offset %llu",
>> -			   PTR_ERR(map), chunk_offset);
>> -		return PTR_ERR(map);
>> -	}
>>   
>>   	/*
>>   	 * First delete the device extent items from the devices btree.
>> @@ -3275,7 +3263,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
>>   		if (ret) {
>>   			mutex_unlock(&fs_devices->device_list_mutex);
>>   			btrfs_abort_transaction(trans, ret);
>> -			goto out;
>> +			return ret;
>>   		}
>>   
>>   		if (device->bytes_used > 0) {
>> @@ -3295,6 +3283,30 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
>>   	}
>>   	mutex_unlock(&fs_devices->device_list_mutex);
>>   
>> +	return 0;
>> +}
>> +
>> +int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
>> +{
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	struct btrfs_chunk_map *map;
>> +	int ret;
>> +
>> +	map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
>> +	if (IS_ERR(map)) {
>> +		/*
>> +		 * This is a logic error, but we don't want to just rely on the
>> +		 * user having built with ASSERT enabled, so if ASSERT doesn't
>> +		 * do anything we still error out.
>> +		 */
>> +		ASSERT(0);
>> +		return PTR_ERR(map);
>> +	}
>> +
>> +	ret = btrfs_remove_dev_extents(trans, map);
>> +	if (ret)
>> +		goto out;
>> +
>>   	/*
>>   	 * We acquire fs_info->chunk_mutex for 2 reasons:
>>   	 *
>> @@ -5436,7 +5448,7 @@ static void chunk_map_device_set_bits(struct btrfs_chunk_map *map, unsigned int
>>   	}
>>   }
>>   
>> -static void chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
>> +void btrfs_chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
>>   {
>>   	for (int i = 0; i < map->num_stripes; i++) {
>>   		struct btrfs_io_stripe *stripe = &map->stripes[i];
>> @@ -5453,7 +5465,7 @@ void btrfs_remove_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_ma
>>   	write_lock(&fs_info->mapping_tree_lock);
>>   	rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
>>   	RB_CLEAR_NODE(&map->rb_node);
>> -	chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
>> +	btrfs_chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
>>   	write_unlock(&fs_info->mapping_tree_lock);
>>   
>>   	/* Once for the tree reference. */
>> @@ -5489,7 +5501,7 @@ int btrfs_add_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_map *m
>>   		return -EEXIST;
>>   	}
>>   	chunk_map_device_set_bits(map, CHUNK_ALLOCATED);
>> -	chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
>> +	btrfs_chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
>>   	write_unlock(&fs_info->mapping_tree_lock);
>>   
>>   	return 0;
>> @@ -5854,7 +5866,7 @@ void btrfs_mapping_tree_free(struct btrfs_fs_info *fs_info)
>>   		map = rb_entry(node, struct btrfs_chunk_map, rb_node);
>>   		rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
>>   		RB_CLEAR_NODE(&map->rb_node);
>> -		chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
>> +		btrfs_chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
>>   		/* Once for the tree ref. */
>>   		btrfs_free_chunk_map(map);
>>   		cond_resched_rwlock_write(&fs_info->mapping_tree_lock);
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index 9fb8fe4312a5..0a73ea2a2a6a 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -779,6 +779,8 @@ u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map);
>>   int btrfs_nr_parity_stripes(u64 type);
>>   int btrfs_chunk_alloc_add_chunk_item(struct btrfs_trans_handle *trans,
>>   				     struct btrfs_block_group *bg);
>> +int btrfs_remove_dev_extents(struct btrfs_trans_handle *trans,
>> +			     struct btrfs_chunk_map *map);
>>   int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
>>   
>>   #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>> @@ -876,6 +878,10 @@ bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
>>   
>>   bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
>>   const u8 *btrfs_sb_fsid_ptr(const struct btrfs_super_block *sb);
>> +int btrfs_update_device(struct btrfs_trans_handle *trans,
>> +			struct btrfs_device *device);
>> +void btrfs_chunk_map_device_clear_bits(struct btrfs_chunk_map *map,
>> +				       unsigned int bits);
>>   
>>   #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>>   struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
>> -- 
>> 2.49.0
>>
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 09/12] btrfs: handle deletions from remapped block group
  2025-06-13 23:42   ` Boris Burkov
  2025-08-11 16:48     ` Mark Harmstone
@ 2025-08-11 16:59     ` Mark Harmstone
  1 sibling, 0 replies; 39+ messages in thread
From: Mark Harmstone @ 2025-08-11 16:59 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs

On 14/06/2025 12.42 am, Boris Burkov wrote:
... snip ...
>> @@ -3002,10 +3005,22 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
>>   		return ret;
>>   	}
>>   
>> -	ret = add_to_free_space_tree(trans, bytenr, num_bytes);
>> -	if (ret) {
>> -		btrfs_abort_transaction(trans, ret);
>> -		return ret;
>> +	if (btrfs_fs_incompat(trans->fs_info, REMAP_TREE)) {
>> +		bg = btrfs_lookup_block_group(trans->fs_info, bytenr);
>> +		bg_is_remapped = bg->flags & BTRFS_BLOCK_GROUP_REMAPPED;
> 
> As mentioned in the patch that sets the flag, this feels quite
> susceptible to races. I don't have a particular one in mind, it just
> feels off to set and rely on this flag to decide what work to do without
> some kind of explicit synchronization plan.

I thought this was safe, but I think you're right. btrfs_remove_extent_from_remap_tree()
ought to return a flag saying whether it touched the FST or not, rather than
leaving us to racily guess here.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/12] btrfs: handle setting up relocation of block group with remap-tree
  2025-06-13 23:25   ` Boris Burkov
@ 2025-08-12 11:20     ` Mark Harmstone
  0 siblings, 0 replies; 39+ messages in thread
From: Mark Harmstone @ 2025-08-12 11:20 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs

On 14/06/2025 12.25 am, Boris Burkov wrote:
> On Thu, Jun 05, 2025 at 05:23:40PM +0100, Mark Harmstone wrote:
>> Handle the preliminary work for relocating a block group in a filesystem
>> with the remap-tree flag set.
>>
>> If the block group is SYSTEM or REMAP btrfs_relocate_block_group()
>> proceeds as it does already, as bootstrapping issues mean that these
>> block groups have to be processed the existing way.
>>
>> Otherwise we walk the free-space tree for the block group in question,
>> recording any holes. These get converted into identity remaps and placed
>> in the remap tree, and the block group's REMAPPED flag is set. From now
>> on no new allocations are possible within this block group, and any I/O
>> to it will be funnelled through btrfs_translate_remap(). We store the
>> number of identity remaps in `identity_remap_count`, so that we know
>> when we've removed the last one and the block group is fully remapped.
>>
>> The change in btrfs_read_roots() is because data relocations no longer
>> rely on the data reloc tree as a hidden subvolume in which to do
>> snapshots.
>>
>> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
>> ---
>>   fs/btrfs/disk-io.c         |  30 +--
>>   fs/btrfs/free-space-tree.c |   4 +-
>>   fs/btrfs/free-space-tree.h |   5 +-
>>   fs/btrfs/relocation.c      | 452 ++++++++++++++++++++++++++++++++++++-
>>   fs/btrfs/relocation.h      |   3 +-
>>   fs/btrfs/space-info.c      |   9 +-
>>   fs/btrfs/volumes.c         |  15 +-
>>   7 files changed, 483 insertions(+), 35 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index dac22efd2332..f2a9192293b1 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2268,22 +2268,22 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>>   		root->root_key.objectid = BTRFS_REMAP_TREE_OBJECTID;
>>   		root->root_key.type = BTRFS_ROOT_ITEM_KEY;
>>   		root->root_key.offset = 0;
>> -	}
>> -
>> -	/*
>> -	 * This tree can share blocks with some other fs tree during relocation
>> -	 * and we need a proper setup by btrfs_get_fs_root
>> -	 */
>> -	root = btrfs_get_fs_root(tree_root->fs_info,
>> -				 BTRFS_DATA_RELOC_TREE_OBJECTID, true);
>> -	if (IS_ERR(root)) {
>> -		if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
>> -			ret = PTR_ERR(root);
>> -			goto out;
>> -		}
>>   	} else {
>> -		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
>> -		fs_info->data_reloc_root = root;
>> +		/*
>> +		 * This tree can share blocks with some other fs tree during
>> +		 * relocation and we need a proper setup by btrfs_get_fs_root
>> +		 */
>> +		root = btrfs_get_fs_root(tree_root->fs_info,
>> +					 BTRFS_DATA_RELOC_TREE_OBJECTID, true);
>> +		if (IS_ERR(root)) {
>> +			if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
>> +				ret = PTR_ERR(root);
>> +				goto out;
>> +			}
>> +		} else {
>> +			set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
>> +			fs_info->data_reloc_root = root;
>> +		}
> 
> Why not do this change to the else case for remap tree along with the if
> case in the earlier patch (allow mounting filesystems with remap-tree
> incompat flag)?
> 
>>   	}
>>   
>>   	location.objectid = BTRFS_QUOTA_TREE_OBJECTID;
>> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
>> index af51cf784a5b..eb579c17a79f 100644
>> --- a/fs/btrfs/free-space-tree.c
>> +++ b/fs/btrfs/free-space-tree.c
>> @@ -21,8 +21,7 @@ static int __add_block_group_free_space(struct btrfs_trans_handle *trans,
>>   					struct btrfs_block_group *block_group,
>>   					struct btrfs_path *path);
>>   
>> -static struct btrfs_root *btrfs_free_space_root(
>> -				struct btrfs_block_group *block_group)
>> +struct btrfs_root *btrfs_free_space_root(struct btrfs_block_group *block_group)
>>   {
>>   	struct btrfs_key key = {
>>   		.objectid = BTRFS_FREE_SPACE_TREE_OBJECTID,
>> @@ -96,7 +95,6 @@ static int add_new_free_space_info(struct btrfs_trans_handle *trans,
>>   	return ret;
>>   }
>>   
>> -EXPORT_FOR_TESTS
>>   struct btrfs_free_space_info *search_free_space_info(
>>   		struct btrfs_trans_handle *trans,
>>   		struct btrfs_block_group *block_group,
>> diff --git a/fs/btrfs/free-space-tree.h b/fs/btrfs/free-space-tree.h
>> index e6c6d6f4f221..1b804544730a 100644
>> --- a/fs/btrfs/free-space-tree.h
>> +++ b/fs/btrfs/free-space-tree.h
>> @@ -35,12 +35,13 @@ int add_to_free_space_tree(struct btrfs_trans_handle *trans,
>>   			   u64 start, u64 size);
>>   int remove_from_free_space_tree(struct btrfs_trans_handle *trans,
>>   				u64 start, u64 size);
>> -
>> -#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>>   struct btrfs_free_space_info *
>>   search_free_space_info(struct btrfs_trans_handle *trans,
>>   		       struct btrfs_block_group *block_group,
>>   		       struct btrfs_path *path, int cow);
>> +struct btrfs_root *btrfs_free_space_root(struct btrfs_block_group *block_group);
>> +
>> +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>>   int __add_to_free_space_tree(struct btrfs_trans_handle *trans,
>>   			     struct btrfs_block_group *block_group,
>>   			     struct btrfs_path *path, u64 start, u64 size);
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index 54c3e99c7dab..acf2fefedc96 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -3659,7 +3659,7 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
>>   		btrfs_btree_balance_dirty(fs_info);
>>   	}
>>   
>> -	if (!err) {
>> +	if (!err && !btrfs_fs_incompat(fs_info, REMAP_TREE)) {
>>   		ret = relocate_file_extent_cluster(rc);
>>   		if (ret < 0)
>>   			err = ret;
>> @@ -3906,6 +3906,90 @@ static const char *stage_to_string(enum reloc_stage stage)
>>   	return "unknown";
>>   }
>>   
>> +static int add_remap_tree_entries(struct btrfs_trans_handle *trans,
>> +				  struct btrfs_path *path,
>> +				  struct btrfs_key *entries,
>> +				  unsigned int num_entries)
>> +{
>> +	int ret;
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	struct btrfs_item_batch batch;
>> +	u32 *data_sizes;
>> +	u32 max_items;
>> +
>> +	max_items = BTRFS_LEAF_DATA_SIZE(trans->fs_info) / sizeof(struct btrfs_item);
>> +
>> +	data_sizes = kzalloc(sizeof(u32) * min_t(u32, num_entries, max_items),
>> +			     GFP_NOFS);
>> +	if (!data_sizes)
>> +		return -ENOMEM;
>> +
>> +	while (true) {
>> +		batch.keys = entries;
>> +		batch.data_sizes = data_sizes;
>> +		batch.total_data_size = 0;
>> +		batch.nr = min_t(u32, num_entries, max_items);
>> +
>> +		ret = btrfs_insert_empty_items(trans, fs_info->remap_root, path,
>> +					       &batch);
>> +		btrfs_release_path(path);
>> +
>> +		if (num_entries <= max_items)
>> +			break;
>> +
>> +		num_entries -= max_items;
>> +		entries += max_items;
>> +	}
>> +
>> +	kfree(data_sizes);
>> +
>> +	return ret;
>> +}
>> +
>> +struct space_run {
>> +	u64 start;
>> +	u64 end;
>> +};
>> +
>> +static void parse_bitmap(u64 block_size, const unsigned long *bitmap,
>> +			 unsigned long size, u64 address,
>> +			 struct space_run *space_runs,
>> +			 unsigned int *num_space_runs)
>> +{
>> +	unsigned long pos, end;
>> +	u64 run_start, run_length;
>> +
>> +	pos = find_first_bit(bitmap, size);
>> +
>> +	if (pos == size)
>> +		return;
>> +
>> +	while (true) {
>> +		end = find_next_zero_bit(bitmap, size, pos);
>> +
>> +		run_start = address + (pos * block_size);
>> +		run_length = (end - pos) * block_size;
>> +
>> +		if (*num_space_runs != 0 &&
>> +		    space_runs[*num_space_runs - 1].end == run_start) {
>> +			space_runs[*num_space_runs - 1].end += run_length;
>> +		} else {
>> +			space_runs[*num_space_runs].start = run_start;
>> +			space_runs[*num_space_runs].end = run_start + run_length;
>> +
>> +			(*num_space_runs)++;
>> +		}
>> +
>> +		if (end == size)
>> +			break;
>> +
>> +		pos = find_next_bit(bitmap, size, end + 1);
>> +
>> +		if (pos == size)
>> +			break;
>> +	}
>> +}
>> +
>>   static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
>>   					   struct btrfs_block_group *bg,
>>   					   s64 diff)
>> @@ -3931,6 +4015,227 @@ static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
>>   		btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
>>   }
>>   
>> +static int create_remap_tree_entries(struct btrfs_trans_handle *trans,
>> +				     struct btrfs_path *path,
>> +				     struct btrfs_block_group *bg)
>> +{
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	struct btrfs_free_space_info *fsi;
>> +	struct btrfs_key key, found_key;
>> +	struct extent_buffer *leaf;
>> +	struct btrfs_root *space_root;
>> +	u32 extent_count;
>> +	struct space_run *space_runs = NULL;
>> +	unsigned int num_space_runs = 0;
>> +	struct btrfs_key *entries = NULL;
>> +	unsigned int max_entries, num_entries;
>> +	int ret;
>> +
>> +	mutex_lock(&bg->free_space_lock);
>> +
>> +	if (test_bit(BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE, &bg->runtime_flags)) {
>> +		mutex_unlock(&bg->free_space_lock);
>> +
>> +		ret = add_block_group_free_space(trans, bg);
>> +		if (ret)
>> +			return ret;
>> +
>> +		mutex_lock(&bg->free_space_lock);
>> +	}
>> +
>> +	fsi = search_free_space_info(trans, bg, path, 0);
>> +	if (IS_ERR(fsi)) {
>> +		mutex_unlock(&bg->free_space_lock);
>> +		return PTR_ERR(fsi);
>> +	}
>> +
>> +	extent_count = btrfs_free_space_extent_count(path->nodes[0], fsi);
>> +
>> +	btrfs_release_path(path);
>> +
>> +	space_runs = kmalloc(sizeof(*space_runs) * extent_count, GFP_NOFS);
>> +	if (!space_runs) {
>> +		mutex_unlock(&bg->free_space_lock);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	key.objectid = bg->start;
>> +	key.type = 0;
>> +	key.offset = 0;
>> +
>> +	space_root = btrfs_free_space_root(bg);
>> +
>> +	ret = btrfs_search_slot(trans, space_root, &key, path, 0, 0);
>> +	if (ret < 0) {
>> +		mutex_unlock(&bg->free_space_lock);
>> +		goto out;
>> +	}
>> +
>> +	ret = 0;
>> +
>> +	while (true) {
>> +		leaf = path->nodes[0];
>> +
>> +		btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
>> +
>> +		if (found_key.objectid >= bg->start + bg->length)
>> +			break;
>> +
>> +		if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY) {
>> +			if (num_space_runs != 0 &&
>> +			    space_runs[num_space_runs - 1].end == found_key.objectid) {
>> +				space_runs[num_space_runs - 1].end =
>> +					found_key.objectid + found_key.offset;
>> +			} else {
>> +				space_runs[num_space_runs].start = found_key.objectid;
>> +				space_runs[num_space_runs].end =
>> +					found_key.objectid + found_key.offset;
>> +
>> +				num_space_runs++;
>> +
>> +				BUG_ON(num_space_runs > extent_count);
>> +			}
>> +		} else if (found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
>> +			void *bitmap;
>> +			unsigned long offset;
>> +			u32 data_size;
>> +
>> +			offset = btrfs_item_ptr_offset(leaf, path->slots[0]);
>> +			data_size = btrfs_item_size(leaf, path->slots[0]);
>> +
>> +			if (data_size != 0) {
>> +				bitmap = kmalloc(data_size, GFP_NOFS);
> 
> free space tree does this with alloc_bitmap, as far as I can tell.

It does, but that's because it uses alloc_bitmap() to calculate the size.
Here we already know the size, so we alloc for the whole thing.

>> +				if (!bitmap) {
>> +					mutex_unlock(&bg->free_space_lock);
>> +					ret = -ENOMEM;
>> +					goto out;
>> +				}
>> +
>> +				read_extent_buffer(leaf, bitmap, offset,
>> +						   data_size);
>> +
>> +				parse_bitmap(fs_info->sectorsize, bitmap,
>> +					     data_size * BITS_PER_BYTE,
>> +					     found_key.objectid, space_runs,
>> +					     &num_space_runs);
>> +
>> +				BUG_ON(num_space_runs > extent_count);
>> +
>> +				kfree(bitmap);
>> +			}
>> +		}
>> +
>> +		path->slots[0]++;
>> +
>> +		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
>> +			ret = btrfs_next_leaf(space_root, path);
>> +			if (ret != 0) {
>> +				if (ret == 1)
>> +					ret = 0;
>> +				break;
>> +			}
>> +			leaf = path->nodes[0];
>> +		}
>> +	}
>> +
>> +	btrfs_release_path(path);
>> +
>> +	mutex_unlock(&bg->free_space_lock);
>> +
>> +	max_entries = extent_count + 2;
>> +	entries = kmalloc(sizeof(*entries) * max_entries, GFP_NOFS);
>> +	if (!entries) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	num_entries = 0;
>> +
>> +	if (num_space_runs > 0 && space_runs[0].start > bg->start) {
>> +		entries[num_entries].objectid = bg->start;
>> +		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
>> +		entries[num_entries].offset = space_runs[0].start - bg->start;
>> +		num_entries++;
>> +	}
>> +
>> +	for (unsigned int i = 1; i < num_space_runs; i++) {
>> +		entries[num_entries].objectid = space_runs[i - 1].end;
>> +		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
>> +		entries[num_entries].offset =
>> +			space_runs[i].start - space_runs[i - 1].end;
>> +		num_entries++;
>> +	}
>> +
>> +	if (num_space_runs == 0) {
>> +		entries[num_entries].objectid = bg->start;
>> +		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
>> +		entries[num_entries].offset = bg->length;
>> +		num_entries++;
>> +	} else if (space_runs[num_space_runs - 1].end < bg->start + bg->length) {
>> +		entries[num_entries].objectid = space_runs[num_space_runs - 1].end;
>> +		entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
>> +		entries[num_entries].offset =
>> +			bg->start + bg->length - space_runs[num_space_runs - 1].end;
>> +		num_entries++;
>> +	}
>> +
>> +	if (num_entries == 0)
>> +		goto out;
>> +
>> +	bg->identity_remap_count = num_entries;
>> +
>> +	ret = add_remap_tree_entries(trans, path, entries, num_entries);
>> +
>> +out:
>> +	kfree(entries);
>> +	kfree(space_runs);
>> +
>> +	return ret;
>> +}
>> +
>> +static int mark_bg_remapped(struct btrfs_trans_handle *trans,
>> +			    struct btrfs_path *path,
>> +			    struct btrfs_block_group *bg)
>> +{
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	unsigned long bi;
>> +	struct extent_buffer *leaf;
>> +	struct btrfs_block_group_item_v2 bgi;
>> +	struct btrfs_key key;
>> +	int ret;
>> +
> 
> Do the changes to the in memory bg flags / commit_identity_remap_count
> need any locking? What happens if we see the flag set but don't yet see
> the identity remap count set from some other context?

It's okay here, we're behind fs_info->remap_mutex.

I'm not 100% that having a big per-filesystem mutex is the way to go, but
I'd rather initially have something safe but slow before moving onto
something more refined.

>> +	bg->flags |= BTRFS_BLOCK_GROUP_REMAPPED;
>> +
>> +	key.objectid = bg->start;
>> +	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
>> +	key.offset = bg->length;
>> +
>> +	ret = btrfs_search_slot(trans, fs_info->block_group_root, &key,
>> +				path, 0, 1);
>> +	if (ret) {
>> +		if (ret > 0)
>> +			ret = -ENOENT;
>> +		goto out;
>> +	}
>> +
>> +	leaf = path->nodes[0];
>> +	bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
>> +	read_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
> 
> ASSERT the incompat flag?
> 
>> +	btrfs_set_stack_block_group_v2_flags(&bgi, bg->flags);
>> +	btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
>> +						bg->identity_remap_count);
>> +	write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
>> +
>> +	btrfs_mark_buffer_dirty(trans, leaf);
>> +
>> +	bg->commit_identity_remap_count = bg->identity_remap_count;
>> +
>> +	ret = 0;
>> +out:
>> +	btrfs_release_path(path);
>> +	return ret;
>> +}
>> +
>>   static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
>>   				struct btrfs_chunk_map *chunk,
>>   				struct btrfs_path *path)
>> @@ -4050,6 +4355,55 @@ static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
>>   	return ret;
>>   }
>>   
>> +static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
>> +			       struct btrfs_path *path, uint64_t start)
>> +{
>> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>> +	struct btrfs_chunk_map *chunk;
>> +	struct btrfs_key key;
>> +	u64 type;
>> +	int ret;
>> +	struct extent_buffer *leaf;
>> +	struct btrfs_chunk *c;
>> +
>> +	read_lock(&fs_info->mapping_tree_lock);
>> +
>> +	chunk = btrfs_find_chunk_map_nolock(fs_info, start, 1);
>> +	if (!chunk) {
>> +		read_unlock(&fs_info->mapping_tree_lock);
>> +		return -ENOENT;
>> +	}
>> +
>> +	chunk->type |= BTRFS_BLOCK_GROUP_REMAPPED;
>> +	type = chunk->type;
>> +
>> +	read_unlock(&fs_info->mapping_tree_lock);
>> +
>> +	key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
>> +	key.type = BTRFS_CHUNK_ITEM_KEY;
>> +	key.offset = start;
>> +
>> +	ret = btrfs_search_slot(trans, fs_info->chunk_root, &key, path,
>> +				0, 1);
>> +	if (ret == 1) {
>> +		ret = -ENOENT;
>> +		goto end;
>> +	} else if (ret < 0)
>> +		goto end;
>> +
>> +	leaf = path->nodes[0];
>> +
>> +	c = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_chunk);
>> +	btrfs_set_chunk_type(leaf, c, type);
>> +	btrfs_mark_buffer_dirty(trans, leaf);
>> +
>> +	ret = 0;
>> +end:
>> +	btrfs_free_chunk_map(chunk);
>> +	btrfs_release_path(path);
>> +	return ret;
>> +}
>> +
>>   int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>>   			  u64 *length, bool nolock)
>>   {
>> @@ -4109,16 +4463,78 @@ int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>>   	return 0;
>>   }
>>   
>> +static int start_block_group_remapping(struct btrfs_fs_info *fs_info,
>> +				       struct btrfs_path *path,
>> +				       struct btrfs_block_group *bg)
>> +{
>> +	struct btrfs_trans_handle *trans;
>> +	int ret, ret2;
>> +
>> +	ret = btrfs_cache_block_group(bg, true);
>> +	if (ret)
>> +		return ret;
>> +
>> +	trans = btrfs_start_transaction(fs_info->remap_root, 0);
>> +	if (IS_ERR(trans))
>> +		return PTR_ERR(trans);
>> +
>> +	/* We need to run delayed refs, to make sure FST is up to date. */
>> +	ret = btrfs_run_delayed_refs(trans, U64_MAX);
>> +	if (ret) {
>> +		btrfs_end_transaction(trans);
>> +		return ret;
>> +	}
>> +
>> +	mutex_lock(&fs_info->remap_mutex);
>> +
>> +	if (bg->flags & BTRFS_BLOCK_GROUP_REMAPPED) {
>> +		ret = 0;
>> +		goto end;
>> +	}
>> +
>> +	ret = create_remap_tree_entries(trans, path, bg);
>> +	if (ret) {
>> +		btrfs_abort_transaction(trans, ret);
>> +		goto end;
>> +	}
>> +
>> +	ret = mark_bg_remapped(trans, path, bg);
>> +	if (ret) {
>> +		btrfs_abort_transaction(trans, ret);
>> +		goto end;
>> +	}
>> +
>> +	ret = mark_chunk_remapped(trans, path, bg->start);
>> +	if (ret) {
>> +		btrfs_abort_transaction(trans, ret);
>> +		goto end;
>> +	}
>> +
>> +	ret = remove_block_group_free_space(trans, bg);
>> +	if (ret)
>> +		btrfs_abort_transaction(trans, ret);
>> +
>> +end:
>> +	mutex_unlock(&fs_info->remap_mutex);
>> +
>> +	ret2 = btrfs_end_transaction(trans);
>> +	if (!ret)
>> +		ret = ret2;
>> +
>> +	return ret;
>> +}
>> +
>>   /*
>>    * function to relocate all extents in a block group.
>>    */
>> -int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
>> +int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
>> +			       bool *using_remap_tree)
>>   {
>>   	struct btrfs_block_group *bg;
>>   	struct btrfs_root *extent_root = btrfs_extent_root(fs_info, group_start);
>>   	struct reloc_control *rc;
>>   	struct inode *inode;
>> -	struct btrfs_path *path;
>> +	struct btrfs_path *path = NULL;
>>   	int ret;
>>   	int rw = 0;
>>   	int err = 0;
>> @@ -4185,7 +4601,7 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
>>   	}
>>   
>>   	inode = lookup_free_space_inode(rc->block_group, path);
>> -	btrfs_free_path(path);
>> +	btrfs_release_path(path);
>>   
>>   	if (!IS_ERR(inode))
>>   		ret = delete_block_group_cache(rc->block_group, inode, 0);
>> @@ -4197,11 +4613,17 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
>>   		goto out;
>>   	}
>>   
>> -	rc->data_inode = create_reloc_inode(rc->block_group);
>> -	if (IS_ERR(rc->data_inode)) {
>> -		err = PTR_ERR(rc->data_inode);
>> -		rc->data_inode = NULL;
>> -		goto out;
>> +	*using_remap_tree = btrfs_fs_incompat(fs_info, REMAP_TREE) &&
>> +		!(bg->flags & BTRFS_BLOCK_GROUP_SYSTEM) &&
>> +		!(bg->flags & BTRFS_BLOCK_GROUP_REMAP);
>> +
>> +	if (!btrfs_fs_incompat(fs_info, REMAP_TREE)) {
>> +		rc->data_inode = create_reloc_inode(rc->block_group);
>> +		if (IS_ERR(rc->data_inode)) {
>> +			err = PTR_ERR(rc->data_inode);
>> +			rc->data_inode = NULL;
>> +			goto out;
>> +		}
>>   	}
>>   
>>   	describe_relocation(rc->block_group);
>> @@ -4213,6 +4635,12 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
>>   	ret = btrfs_zone_finish(rc->block_group);
>>   	WARN_ON(ret && ret != -EAGAIN);
>>   
>> +	if (*using_remap_tree) {
>> +		err = start_block_group_remapping(fs_info, path, bg);
>> +
>> +		goto out;
>> +	}
>> +
>>   	while (1) {
>>   		enum reloc_stage finishes_stage;
>>   
>> @@ -4258,7 +4686,9 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
>>   out:
>>   	if (err && rw)
>>   		btrfs_dec_block_group_ro(rc->block_group);
>> -	iput(rc->data_inode);
>> +	if (!btrfs_fs_incompat(fs_info, REMAP_TREE))
>> +		iput(rc->data_inode);
>> +	btrfs_free_path(path);
>>   out_put_bg:
>>   	btrfs_put_block_group(bg);
>>   	reloc_chunk_end(fs_info);
>> @@ -4452,7 +4882,7 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
>>   
>>   	btrfs_free_path(path);
>>   
>> -	if (ret == 0) {
>> +	if (ret == 0 && !btrfs_fs_incompat(fs_info, REMAP_TREE)) {
>>   		/* cleanup orphan inode in data relocation tree */
>>   		fs_root = btrfs_grab_root(fs_info->data_reloc_root);
>>   		ASSERT(fs_root);
>> diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
>> index 0021f812b12c..49bd48296ddb 100644
>> --- a/fs/btrfs/relocation.h
>> +++ b/fs/btrfs/relocation.h
>> @@ -12,7 +12,8 @@ struct btrfs_trans_handle;
>>   struct btrfs_ordered_extent;
>>   struct btrfs_pending_snapshot;
>>   
>> -int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start);
>> +int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
>> +			       bool *using_remap_tree);
>>   int btrfs_init_reloc_root(struct btrfs_trans_handle *trans, struct btrfs_root *root);
>>   int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
>>   			    struct btrfs_root *root);
>> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
>> index 6471861c4b25..ab4a70c420de 100644
>> --- a/fs/btrfs/space-info.c
>> +++ b/fs/btrfs/space-info.c
>> @@ -375,8 +375,13 @@ void btrfs_add_bg_to_space_info(struct btrfs_fs_info *info,
>>   	factor = btrfs_bg_type_to_factor(block_group->flags);
>>   
>>   	spin_lock(&space_info->lock);
>> -	space_info->total_bytes += block_group->length;
>> -	space_info->disk_total += block_group->length * factor;
>> +
>> +	if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED) ||
>> +	    block_group->identity_remap_count != 0) {
>> +		space_info->total_bytes += block_group->length;
>> +		space_info->disk_total += block_group->length * factor;
>> +	}
>> +
>>   	space_info->bytes_used += block_group->used;
>>   	space_info->disk_used += block_group->used * factor;
>>   	space_info->bytes_readonly += block_group->bytes_super;
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 6c0a67da92f1..771415139dc0 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -3425,6 +3425,7 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>>   	struct btrfs_block_group *block_group;
>>   	u64 length;
>>   	int ret;
>> +	bool using_remap_tree;
>>   
>>   	if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2)) {
>>   		btrfs_err(fs_info,
>> @@ -3448,7 +3449,8 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>>   
>>   	/* step one, relocate all the extents inside this chunk */
>>   	btrfs_scrub_pause(fs_info);
>> -	ret = btrfs_relocate_block_group(fs_info, chunk_offset);
>> +	ret = btrfs_relocate_block_group(fs_info, chunk_offset,
>> +					 &using_remap_tree);
>>   	btrfs_scrub_continue(fs_info);
>>   	if (ret) {
>>   		/*
>> @@ -3467,6 +3469,9 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>>   	length = block_group->length;
>>   	btrfs_put_block_group(block_group);
>>   
>> +	if (using_remap_tree)
>> +		return 0;
>> +
>>   	/*
>>   	 * On a zoned file system, discard the whole block group, this will
>>   	 * trigger a REQ_OP_ZONE_RESET operation on the device zone. If
>> @@ -4165,6 +4170,14 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
>>   		chunk = btrfs_item_ptr(leaf, slot, struct btrfs_chunk);
>>   		chunk_type = btrfs_chunk_type(leaf, chunk);
>>   
>> +		/* Check if chunk has already been fully relocated. */
>> +		if (chunk_type & BTRFS_BLOCK_GROUP_REMAPPED &&
>> +		    btrfs_chunk_num_stripes(leaf, chunk) == 0) {
>> +			btrfs_release_path(path);
>> +			mutex_unlock(&fs_info->reclaim_bgs_lock);
>> +			goto loop;
>> +		}
>> +
>>   		if (!counting) {
>>   			spin_lock(&fs_info->balance_lock);
>>   			bctl->stat.considered++;
>> -- 
>> 2.49.0
>>
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 04/12] btrfs: remove remapped block groups from the free-space tree
  2025-06-13 22:00   ` Boris Burkov
@ 2025-08-12 14:50     ` Mark Harmstone
  0 siblings, 0 replies; 39+ messages in thread
From: Mark Harmstone @ 2025-08-12 14:50 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs

On 13/06/2025 11.00 pm, Boris Burkov wrote:
> On Thu, Jun 05, 2025 at 05:23:34PM +0100, Mark Harmstone wrote:
>> No new allocations can be done from block groups that have the REMAPPED flag
>> set, so there's no value in their having entries in the free-space tree.
>>
>> Prevent a search through the free-space tree being scheduled for such a
>> block group, and prevent discard being run for a fully-remapped block
>> group.
>>
>> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
>> ---
>>   fs/btrfs/block-group.c | 21 ++++++++++++++++-----
>>   fs/btrfs/discard.c     |  9 +++++++++
>>   2 files changed, 25 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index 5b0cb04b2b93..9b3b5358f1ba 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -920,6 +920,13 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait)
>>   	if (btrfs_is_zoned(fs_info))
>>   		return 0;
>>   
>> +	/*
>> +	 * No allocations can be done from remapped block groups, so they have
>> +	 * no entries in the free-space tree.
>> +	 */
>> +	if (cache->flags & BTRFS_BLOCK_GROUP_REMAPPED)
>> +		return 0;
>> +
>>   	caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
>>   	if (!caching_ctl)
>>   		return -ENOMEM;
>> @@ -1235,9 +1242,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>   	 * another task to attempt to create another block group with the same
>>   	 * item key (and failing with -EEXIST and a transaction abort).
>>   	 */
>> -	ret = remove_block_group_free_space(trans, block_group);
>> -	if (ret)
>> -		goto out;
> 
> nit: it feels nicer to hide the check inside the function.

It's not obvious here, but this is because of start_block_group_remapping(), which is
added in a later patch. remove_block_group_free_space() gets called when a block group
is remapped, or when a non-remapped block group is removed.

>> +	if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
>> +		ret = remove_block_group_free_space(trans, block_group);
>> +		if (ret)
>> +			goto out;
>> +	}
>>   
>>   	ret = remove_block_group_item(trans, path, block_group);
>>   	if (ret < 0)
>> @@ -2457,10 +2466,12 @@ static int read_one_block_group(struct btrfs_fs_info *info,
>>   	if (btrfs_chunk_writeable(info, cache->start)) {
>>   		if (cache->used == 0) {
>>   			ASSERT(list_empty(&cache->bg_list));
>> -			if (btrfs_test_opt(info, DISCARD_ASYNC))
>> +			if (btrfs_test_opt(info, DISCARD_ASYNC) &&
> 
> I asked this on the previous patch, but I guess this means we will never
> discard these blocks? Is that desirable? Or are we discarding them at
> some other point in the life-cycle?
> 
>> +			    !(cache->flags && BTRFS_BLOCK_GROUP_REMAPPED)) {
>>   				btrfs_discard_queue_work(&info->discard_ctl, cache);
>> -			else
>> +			} else {
>>   				btrfs_mark_bg_unused(cache);
>> +			}
>>   		}
>>   	} else {
>>   		inc_block_group_ro(cache, 1);
>> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
>> index 89fe85778115..1015a4d37fb2 100644
>> --- a/fs/btrfs/discard.c
>> +++ b/fs/btrfs/discard.c
>> @@ -698,6 +698,15 @@ void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info)
>>   	/* We enabled async discard, so punt all to the queue */
>>   	list_for_each_entry_safe(block_group, next, &fs_info->unused_bgs,
>>   				 bg_list) {
>> +		/* Fully remapped BGs have nothing to discard */
> 
> Same question. If we simply *don't* discard them, I feel like this
> comment is misleadingly worded.
> 
>> +		spin_lock(&block_group->lock);
>> +		if (block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED &&
>> +		    !btrfs_is_block_group_used(block_group)) {
>> +			spin_unlock(&block_group->lock);
>> +			continue;
>> +		}
>> +		spin_unlock(&block_group->lock);
>> +
>>   		list_del_init(&block_group->bg_list);
>>   		btrfs_discard_queue_work(&fs_info->discard_ctl, block_group);
>>   		/*
>> -- 
>> 2.49.0
>>
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2025-08-12 14:50 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
2025-06-05 16:23 ` [PATCH 01/12] btrfs: add definitions and constants for remap-tree Mark Harmstone
2025-06-13 21:02   ` Boris Burkov
2025-06-05 16:23 ` [PATCH 02/12] btrfs: add REMAP chunk type Mark Harmstone
2025-06-13 21:22   ` Boris Burkov
2025-06-05 16:23 ` [PATCH 03/12] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
2025-06-13 21:41   ` Boris Burkov
2025-08-08 14:12     ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 04/12] btrfs: remove remapped block groups from the free-space tree Mark Harmstone
2025-06-06  6:41   ` kernel test robot
2025-06-13 22:00   ` Boris Burkov
2025-08-12 14:50     ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 05/12] btrfs: don't add metadata items for the remap tree to the extent tree Mark Harmstone
2025-06-13 22:39   ` Boris Burkov
2025-06-05 16:23 ` [PATCH 06/12] btrfs: add extended version of struct block_group_item Mark Harmstone
2025-06-05 16:23 ` [PATCH 07/12] btrfs: allow mounting filesystems with remap-tree incompat flag Mark Harmstone
2025-06-05 16:23 ` [PATCH 08/12] btrfs: redirect I/O for remapped block groups Mark Harmstone
2025-06-05 16:23 ` [PATCH 09/12] btrfs: handle deletions from remapped block group Mark Harmstone
2025-06-13 23:42   ` Boris Burkov
2025-08-11 16:48     ` Mark Harmstone
2025-08-11 16:59     ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 10/12] btrfs: handle setting up relocation of block group with remap-tree Mark Harmstone
2025-06-13 23:25   ` Boris Burkov
2025-08-12 11:20     ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 11/12] btrfs: move existing remaps before relocating block group Mark Harmstone
2025-06-06 11:20   ` kernel test robot
2025-06-05 16:23 ` [PATCH 12/12] btrfs: replace identity maps with actual remaps when doing relocations Mark Harmstone
2025-06-05 16:43 ` [PATCH 00/12] btrfs: remap tree Jonah Sabean
2025-06-06 13:35   ` Mark Harmstone
2025-06-09 16:05     ` Anand Jain
2025-06-09 18:51 ` David Sterba
2025-06-10  9:19   ` Mark Harmstone
2025-06-10 14:31 ` Mark Harmstone
2025-06-10 23:56   ` Qu Wenruo
2025-06-11  8:06     ` Mark Harmstone
2025-06-11 15:28 ` Mark Harmstone
2025-06-14  0:04 ` Boris Burkov
2025-06-26 22:10 ` Mark Harmstone
2025-06-27  5:59   ` Neal Gompa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox