* [PATCH v2 00/16] btrfs: remap tree
@ 2025-08-13 14:34 Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 01/16] btrfs: add definitions and constants for remap-tree Mark Harmstone
` (15 more replies)
0 siblings, 16 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
This patch series adds a disk format change gated behind
CONFIG_BTRFS_EXPERIMENTAL to add a "remap tree", which acts as a layer of
indirection when doing I/O. When doing relocation, rather than fixing up every
tree, we instead record the old and new addresses in the remap tree. This should
hopefully make things more reliable and flexible, as well as enabling some
future changes we'd like to make, such as larger data extents and reducing
write amplification by removing cow-only metadata items.
The remap tree lives in a new REMAP chunk type. This is because bootstrapping
means that it can't be remapped itself, and has to be relocated by COWing it as
at present. It can't go in the SYSTEM chunk as we are then limited by the chunk
item needing to fit in the superblock.
For more on the design and rationale, please see my RFC sent earlier this year[1], as
well as Josef Bacik's original design document[2]. The main change from Josef's
design is that I've added remap backrefs, as we need to be able to move a
chunk's existing remaps before remapping it.
You will also need my patches to btrfs-progs[3] to make
`mkfs.btrfs -O remap-tree` work, as well as allowing `btrfs check` to recognize
the new format.
Changelog:
Since v1:
* Fixed the problems with discard. Removing an extent in a remapped
block group now gets its address translated through the remap tree, and
when the last identity map of a block group is removed it triggers a
discard for its old dev extents
* Added relocation of the REMAP chunks, i.e. the chunks where the remap
tree itself lives. This can't be done by the existing method as we've
removed the metadata items in the extent tree, so we just COW every
leaf.
* Fixed a couple of lockdep issues
* Addressed the points that Boris made in his review
Since the RFC:
* I've reduce the REMAP chunk size from the normal 1GB to 32MB, to match the
SYSTEM chunk. For a filesystem with 4KB sectors and 16KB node size, the worst
case is that one leaf covers ~1MB of data, and the best case ~250GB. For a
chunk, that implies a worst case of ~2GB and a best case of ~500TB.
This isn't a disk-format change, so we can always adjust it if it proves too
big or small in practice. mkfs creates 8MB chunks, as it does for everything.
* You can't make new allocations from remapped block groups, so I've changed
it so there's no free-space entries for these (thanks to Boris Burkov for the
suggestion).
* The remap tree doesn't have metadata items in the extent tree (thanks to Josef
for the suggestion). This was to work around some corruption that delayed refs
were causing, but it also fits it with our future plans of removing all
metadata items for COW-only trees, reducing write amplification.
* btrfs_translate_remap() uses search_commit_root when doing metadata lookups,
to avoid nested locking issues. This also seems to be a lot quicker (btrfs/187
went from ~20mins to ~90secs).
* Unused remapped block groups should now get cleaned up more aggressively
* Other miscellaneous cleanups and fixes
Known issues:
* Some test failures: btrfs/156, btrfs/170, btrfs/226, btrfs/250
* nodatacow extents aren't safe, as they can race with the relocation thread.
We either need to follow the btrfs_inc_nocow_writers() approach, which COWs
the extent, or change it so that it blocks here.
* When initially marking a block group as remapped, we are walking the free-
space tree and creating the identity remaps all in one transaction. For the
worst-case scenario, i.e. a 1GB block group with every other sector allocated
(131,072 extents), this can result in transaction times of more than 10 mins.
This needs to be changed to allow this to happen over multiple transactions.
* All this is disabled for zoned devices for the time being, as I've not been
able to test it. I'm planning to make it compatible with zoned at a later
date.
Thanks
[1] https://lwn.net/Articles/1021452/
[2] https://github.com/btrfs/btrfs-todo/issues/54
[3] https://github.com/maharmstone/btrfs-progs/tree/remap-tree
Mark Harmstone (16):
btrfs: add definitions and constants for remap-tree
btrfs: add REMAP chunk type
btrfs: allow remapped chunks to have zero stripes
btrfs: remove remapped block groups from the free-space tree
btrfs: don't add metadata items for the remap tree to the extent tree
btrfs: add extended version of struct block_group_item
btrfs: allow mounting filesystems with remap-tree incompat flag
btrfs: redirect I/O for remapped block groups
btrfs: release BG lock before calling btrfs_link_bg_list()
btrfs: handle deletions from remapped block group
btrfs: handle setting up relocation of block group with remap-tree
btrfs: move existing remaps before relocating block group
btrfs: replace identity maps with actual remaps when doing relocations
btrfs: add do_remap param to btrfs_discard_extent()
btrfs: add fully_remapped_bgs list
btrfs: allow balancing remap tree
fs/btrfs/Kconfig | 2 +
fs/btrfs/accessors.h | 29 +
fs/btrfs/block-group.c | 233 +++-
fs/btrfs/block-group.h | 17 +-
fs/btrfs/block-rsv.c | 8 +
fs/btrfs/block-rsv.h | 1 +
fs/btrfs/discard.c | 11 +-
fs/btrfs/disk-io.c | 91 +-
fs/btrfs/extent-tree.c | 115 +-
fs/btrfs/extent-tree.h | 2 +-
fs/btrfs/free-space-cache.c | 2 +-
fs/btrfs/free-space-tree.c | 4 +-
fs/btrfs/free-space-tree.h | 5 +-
fs/btrfs/fs.h | 7 +-
fs/btrfs/inode.c | 2 +-
fs/btrfs/locking.c | 1 +
fs/btrfs/relocation.c | 1906 ++++++++++++++++++++++++++++++-
fs/btrfs/relocation.h | 7 +-
fs/btrfs/space-info.c | 22 +-
fs/btrfs/sysfs.c | 4 +
fs/btrfs/transaction.c | 8 +
fs/btrfs/transaction.h | 1 +
fs/btrfs/tree-checker.c | 92 +-
fs/btrfs/tree-checker.h | 5 +
fs/btrfs/volumes.c | 299 ++++-
fs/btrfs/volumes.h | 19 +-
include/uapi/linux/btrfs.h | 1 +
include/uapi/linux/btrfs_tree.h | 29 +-
28 files changed, 2717 insertions(+), 206 deletions(-)
--
2.49.1
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v2 01/16] btrfs: add definitions and constants for remap-tree
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-15 23:51 ` Boris Burkov
2025-08-16 0:01 ` Qu Wenruo
2025-08-13 14:34 ` [PATCH v2 02/16] btrfs: add REMAP chunk type Mark Harmstone
` (14 subsequent siblings)
15 siblings, 2 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
Add an incompat flag for the new remap-tree feature, and the constants
and definitions needed to support it.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/accessors.h | 3 +++
fs/btrfs/locking.c | 1 +
fs/btrfs/sysfs.c | 2 ++
fs/btrfs/tree-checker.c | 6 ++----
fs/btrfs/tree-checker.h | 5 +++++
fs/btrfs/volumes.c | 1 +
include/uapi/linux/btrfs.h | 1 +
include/uapi/linux/btrfs_tree.h | 12 ++++++++++++
8 files changed, 27 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index 99b3ced12805..95a1ca8c099b 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -1009,6 +1009,9 @@ BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption,
BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size,
struct btrfs_verity_descriptor_item, size, 64);
+BTRFS_SETGET_FUNCS(remap_address, struct btrfs_remap, address, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_remap_address, struct btrfs_remap, address, 64);
+
/* Cast into the data area of the leaf. */
#define btrfs_item_ptr(leaf, slot, type) \
((type *)(btrfs_item_nr_offset(leaf, 0) + btrfs_item_offset(leaf, slot)))
diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index a3e6d9616e60..26f810258486 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -73,6 +73,7 @@ static struct btrfs_lockdep_keyset {
{ .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") },
{ .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-group") },
{ .id = BTRFS_RAID_STRIPE_TREE_OBJECTID, DEFINE_NAME("raid-stripe") },
+ { .id = BTRFS_REMAP_TREE_OBJECTID, DEFINE_NAME("remap-tree") },
{ .id = 0, DEFINE_NAME("tree") },
};
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 81f52c1f55ce..857d2772db1c 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -291,6 +291,7 @@ BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
BTRFS_FEAT_ATTR_COMPAT_RO(block_group_tree, BLOCK_GROUP_TREE);
BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
BTRFS_FEAT_ATTR_INCOMPAT(simple_quota, SIMPLE_QUOTA);
+BTRFS_FEAT_ATTR_INCOMPAT(remap_tree, REMAP_TREE);
#ifdef CONFIG_BLK_DEV_ZONED
BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
#endif
@@ -325,6 +326,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
BTRFS_FEAT_ATTR_PTR(raid1c34),
BTRFS_FEAT_ATTR_PTR(block_group_tree),
BTRFS_FEAT_ATTR_PTR(simple_quota),
+ BTRFS_FEAT_ATTR_PTR(remap_tree),
#ifdef CONFIG_BLK_DEV_ZONED
BTRFS_FEAT_ATTR_PTR(zoned),
#endif
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 0f556f4de3f9..76ec3698f197 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -912,12 +912,10 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
length, btrfs_stripe_nr_to_offset(U32_MAX));
return -EUCLEAN;
}
- if (unlikely(type & ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
- BTRFS_BLOCK_GROUP_PROFILE_MASK))) {
+ if (unlikely(type & ~BTRFS_BLOCK_GROUP_VALID)) {
chunk_err(fs_info, leaf, chunk, logical,
"unrecognized chunk type: 0x%llx",
- ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
- BTRFS_BLOCK_GROUP_PROFILE_MASK) & type);
+ type & ~BTRFS_BLOCK_GROUP_VALID);
return -EUCLEAN;
}
diff --git a/fs/btrfs/tree-checker.h b/fs/btrfs/tree-checker.h
index eb201f4ec3c7..833e2fd989eb 100644
--- a/fs/btrfs/tree-checker.h
+++ b/fs/btrfs/tree-checker.h
@@ -57,6 +57,11 @@ enum btrfs_tree_block_status {
BTRFS_TREE_BLOCK_WRITTEN_NOT_SET,
};
+
+#define BTRFS_BLOCK_GROUP_VALID (BTRFS_BLOCK_GROUP_TYPE_MASK | \
+ BTRFS_BLOCK_GROUP_PROFILE_MASK | \
+ BTRFS_BLOCK_GROUP_REMAPPED)
+
/*
* Exported simply for btrfs-progs which wants to have the
* btrfs_tree_block_status return codes.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fa7a929a0461..e067e9cd68a5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -231,6 +231,7 @@ void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
+ DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAPPED, "remapped");
DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 8e710bbb688e..fba303ed49e6 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -336,6 +336,7 @@ struct btrfs_ioctl_fs_info_args {
#define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 (1ULL << 13)
#define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE (1ULL << 14)
#define BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA (1ULL << 16)
+#define BTRFS_FEATURE_INCOMPAT_REMAP_TREE (1ULL << 17)
struct btrfs_ioctl_feature_flags {
__u64 compat_flags;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index fc29d273845d..4439d77a7252 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -76,6 +76,9 @@
/* Tracks RAID stripes in block groups. */
#define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
+/* Holds details of remapped addresses after relocation. */
+#define BTRFS_REMAP_TREE_OBJECTID 13ULL
+
/* device stats in the device tree */
#define BTRFS_DEV_STATS_OBJECTID 0ULL
@@ -282,6 +285,10 @@
#define BTRFS_RAID_STRIPE_KEY 230
+#define BTRFS_IDENTITY_REMAP_KEY 234
+#define BTRFS_REMAP_KEY 235
+#define BTRFS_REMAP_BACKREF_KEY 236
+
/*
* Records the overall state of the qgroups.
* There's only one instance of this key present,
@@ -1161,6 +1168,7 @@ struct btrfs_dev_replace_item {
#define BTRFS_BLOCK_GROUP_RAID6 (1ULL << 8)
#define BTRFS_BLOCK_GROUP_RAID1C3 (1ULL << 9)
#define BTRFS_BLOCK_GROUP_RAID1C4 (1ULL << 10)
+#define BTRFS_BLOCK_GROUP_REMAPPED (1ULL << 11)
#define BTRFS_BLOCK_GROUP_RESERVED (BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
BTRFS_SPACE_INFO_GLOBAL_RSV)
@@ -1323,4 +1331,8 @@ struct btrfs_verity_descriptor_item {
__u8 encryption;
} __attribute__ ((__packed__));
+struct btrfs_remap {
+ __le64 address;
+} __attribute__ ((__packed__));
+
#endif /* _BTRFS_CTREE_H_ */
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 02/16] btrfs: add REMAP chunk type
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 01/16] btrfs: add definitions and constants for remap-tree Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 03/16] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
` (13 subsequent siblings)
15 siblings, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
Add a new REMAP chunk type, which is a metadata chunk that holds the
remap tree.
This is needed for bootstrapping purposes: the remap tree can't itself
be remapped, and must be relocated the existing way, by COWing every
leaf. The remap tree can't go in the SYSTEM chunk as space there is
limited, because a copy of the chunk item gets placed in the superblock.
The changes in fs/btrfs/volumes.h are because we're adding a new block
group type bit after the profile bits, and so can no longer rely on the
const_ilog2 trick.
The sizing to 32MB per chunk, matching the SYSTEM chunk, is an estimate
here, we can adjust it later if it proves to be too big or too small.
This works out to be ~500,000 remap items, which for a 4KB block size
covers ~2GB of remapped data in the worst case and ~500TB in the best case.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/block-rsv.c | 8 ++++++++
fs/btrfs/block-rsv.h | 1 +
fs/btrfs/disk-io.c | 1 +
fs/btrfs/fs.h | 2 ++
fs/btrfs/space-info.c | 13 ++++++++++++-
fs/btrfs/sysfs.c | 2 ++
fs/btrfs/tree-checker.c | 13 +++++++++++--
fs/btrfs/volumes.c | 3 +++
fs/btrfs/volumes.h | 11 +++++++++--
include/uapi/linux/btrfs_tree.h | 4 +++-
10 files changed, 52 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index 5ad6de738aee..2678cd3bed29 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -421,6 +421,9 @@ void btrfs_init_root_block_rsv(struct btrfs_root *root)
case BTRFS_TREE_LOG_OBJECTID:
root->block_rsv = &fs_info->treelog_rsv;
break;
+ case BTRFS_REMAP_TREE_OBJECTID:
+ root->block_rsv = &fs_info->remap_block_rsv;
+ break;
default:
root->block_rsv = NULL;
break;
@@ -434,6 +437,9 @@ void btrfs_init_global_block_rsv(struct btrfs_fs_info *fs_info)
space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_SYSTEM);
fs_info->chunk_block_rsv.space_info = space_info;
+ space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_REMAP);
+ fs_info->remap_block_rsv.space_info = space_info;
+
space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
fs_info->global_block_rsv.space_info = space_info;
fs_info->trans_block_rsv.space_info = space_info;
@@ -460,6 +466,8 @@ void btrfs_release_global_block_rsv(struct btrfs_fs_info *fs_info)
WARN_ON(fs_info->trans_block_rsv.reserved > 0);
WARN_ON(fs_info->chunk_block_rsv.size > 0);
WARN_ON(fs_info->chunk_block_rsv.reserved > 0);
+ WARN_ON(fs_info->remap_block_rsv.size > 0);
+ WARN_ON(fs_info->remap_block_rsv.reserved > 0);
WARN_ON(fs_info->delayed_block_rsv.size > 0);
WARN_ON(fs_info->delayed_block_rsv.reserved > 0);
WARN_ON(fs_info->delayed_refs_rsv.reserved > 0);
diff --git a/fs/btrfs/block-rsv.h b/fs/btrfs/block-rsv.h
index 79ae9d05cd91..8359fb96bc3c 100644
--- a/fs/btrfs/block-rsv.h
+++ b/fs/btrfs/block-rsv.h
@@ -22,6 +22,7 @@ enum btrfs_rsv_type {
BTRFS_BLOCK_RSV_DELALLOC,
BTRFS_BLOCK_RSV_TRANS,
BTRFS_BLOCK_RSV_CHUNK,
+ BTRFS_BLOCK_RSV_REMAP,
BTRFS_BLOCK_RSV_DELOPS,
BTRFS_BLOCK_RSV_DELREFS,
BTRFS_BLOCK_RSV_TREELOG,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 123c397ca8f8..7e60097b2a96 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2815,6 +2815,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
BTRFS_BLOCK_RSV_GLOBAL);
btrfs_init_block_rsv(&fs_info->trans_block_rsv, BTRFS_BLOCK_RSV_TRANS);
btrfs_init_block_rsv(&fs_info->chunk_block_rsv, BTRFS_BLOCK_RSV_CHUNK);
+ btrfs_init_block_rsv(&fs_info->remap_block_rsv, BTRFS_BLOCK_RSV_REMAP);
btrfs_init_block_rsv(&fs_info->treelog_rsv, BTRFS_BLOCK_RSV_TREELOG);
btrfs_init_block_rsv(&fs_info->empty_block_rsv, BTRFS_BLOCK_RSV_EMPTY);
btrfs_init_block_rsv(&fs_info->delayed_block_rsv,
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index f0b090d4ac04..9ce75843b578 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -472,6 +472,8 @@ struct btrfs_fs_info {
struct btrfs_block_rsv trans_block_rsv;
/* Block reservation for chunk tree */
struct btrfs_block_rsv chunk_block_rsv;
+ /* Block reservation for remap tree */
+ struct btrfs_block_rsv remap_block_rsv;
/* Block reservation for delayed operations */
struct btrfs_block_rsv delayed_block_rsv;
/* Block reservation for delayed refs */
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 0481c693ac2e..278034a22dbf 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -215,7 +215,7 @@ static u64 calc_chunk_size(const struct btrfs_fs_info *fs_info, u64 flags)
if (flags & BTRFS_BLOCK_GROUP_DATA)
return BTRFS_MAX_DATA_CHUNK_SIZE;
- else if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
+ else if (flags & (BTRFS_BLOCK_GROUP_SYSTEM | BTRFS_BLOCK_GROUP_REMAP))
return SZ_32M;
/* Handle BTRFS_BLOCK_GROUP_METADATA */
@@ -343,6 +343,8 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
if (mixed) {
flags = BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA;
ret = create_space_info(fs_info, flags);
+ if (ret)
+ goto out;
} else {
flags = BTRFS_BLOCK_GROUP_METADATA;
ret = create_space_info(fs_info, flags);
@@ -351,7 +353,15 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
flags = BTRFS_BLOCK_GROUP_DATA;
ret = create_space_info(fs_info, flags);
+ if (ret)
+ goto out;
+ }
+
+ if (features & BTRFS_FEATURE_INCOMPAT_REMAP_TREE) {
+ flags = BTRFS_BLOCK_GROUP_REMAP;
+ ret = create_space_info(fs_info, flags);
}
+
out:
return ret;
}
@@ -590,6 +600,7 @@ static void dump_global_block_rsv(struct btrfs_fs_info *fs_info)
DUMP_BLOCK_RSV(fs_info, global_block_rsv);
DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
DUMP_BLOCK_RSV(fs_info, chunk_block_rsv);
+ DUMP_BLOCK_RSV(fs_info, remap_block_rsv);
DUMP_BLOCK_RSV(fs_info, delayed_block_rsv);
DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv);
}
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 857d2772db1c..f942fde1d936 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1973,6 +1973,8 @@ static const char *alloc_name(struct btrfs_space_info *space_info)
case BTRFS_BLOCK_GROUP_SYSTEM:
ASSERT(space_info->subgroup_id == BTRFS_SUB_GROUP_PRIMARY);
return "system";
+ case BTRFS_BLOCK_GROUP_REMAP:
+ return "remap";
default:
WARN_ON(1);
return "invalid-combination";
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 76ec3698f197..ca898b1f12f1 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -747,17 +747,26 @@ static int check_block_group_item(struct extent_buffer *leaf,
return -EUCLEAN;
}
+ if (flags & BTRFS_BLOCK_GROUP_REMAP &&
+ !btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+ block_group_err(leaf, slot,
+"invalid flags, have 0x%llx (REMAP flag set) but no remap-tree incompat flag",
+ flags);
+ return -EUCLEAN;
+ }
+
type = flags & BTRFS_BLOCK_GROUP_TYPE_MASK;
if (unlikely(type != BTRFS_BLOCK_GROUP_DATA &&
type != BTRFS_BLOCK_GROUP_METADATA &&
type != BTRFS_BLOCK_GROUP_SYSTEM &&
+ type != BTRFS_BLOCK_GROUP_REMAP &&
type != (BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DATA))) {
block_group_err(leaf, slot,
-"invalid type, have 0x%llx (%lu bits set) expect either 0x%llx, 0x%llx, 0x%llx or 0x%llx",
+"invalid type, have 0x%llx (%lu bits set) expect either 0x%llx, 0x%llx, 0x%llx, 0x%llx or 0x%llx",
type, hweight64(type),
BTRFS_BLOCK_GROUP_DATA, BTRFS_BLOCK_GROUP_METADATA,
- BTRFS_BLOCK_GROUP_SYSTEM,
+ BTRFS_BLOCK_GROUP_SYSTEM, BTRFS_BLOCK_GROUP_REMAP,
BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA);
return -EUCLEAN;
}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e067e9cd68a5..f4d1527f265e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -231,6 +231,9 @@ void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
+ /* block groups containing the remap tree */
+ DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAP, "remap");
+ /* block group that has been remapped */
DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAPPED, "remapped");
DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index a56e873a3029..430be12fd5e7 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -58,8 +58,6 @@ static_assert(const_ilog2(BTRFS_STRIPE_LEN) == BTRFS_STRIPE_LEN_SHIFT);
*/
static_assert(const_ffs(BTRFS_BLOCK_GROUP_RAID0) <
const_ffs(BTRFS_BLOCK_GROUP_PROFILE_MASK & ~BTRFS_BLOCK_GROUP_RAID0));
-static_assert(const_ilog2(BTRFS_BLOCK_GROUP_RAID0) >
- ilog2(BTRFS_BLOCK_GROUP_TYPE_MASK));
/* ilog2() can handle both constants and variables */
#define BTRFS_BG_FLAG_TO_INDEX(profile) \
@@ -81,6 +79,15 @@ enum btrfs_raid_types {
BTRFS_NR_RAID_TYPES
};
+static_assert(BTRFS_RAID_RAID0 == 1);
+static_assert(BTRFS_RAID_RAID1 == 2);
+static_assert(BTRFS_RAID_DUP == 3);
+static_assert(BTRFS_RAID_RAID10 == 4);
+static_assert(BTRFS_RAID_RAID5 == 5);
+static_assert(BTRFS_RAID_RAID6 == 6);
+static_assert(BTRFS_RAID_RAID1C3 == 7);
+static_assert(BTRFS_RAID_RAID1C4 == 8);
+
/*
* Use sequence counter to get consistent device stat data on
* 32-bit processors.
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 4439d77a7252..9a36f0206d90 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -1169,12 +1169,14 @@ struct btrfs_dev_replace_item {
#define BTRFS_BLOCK_GROUP_RAID1C3 (1ULL << 9)
#define BTRFS_BLOCK_GROUP_RAID1C4 (1ULL << 10)
#define BTRFS_BLOCK_GROUP_REMAPPED (1ULL << 11)
+#define BTRFS_BLOCK_GROUP_REMAP (1ULL << 12)
#define BTRFS_BLOCK_GROUP_RESERVED (BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
BTRFS_SPACE_INFO_GLOBAL_RSV)
#define BTRFS_BLOCK_GROUP_TYPE_MASK (BTRFS_BLOCK_GROUP_DATA | \
BTRFS_BLOCK_GROUP_SYSTEM | \
- BTRFS_BLOCK_GROUP_METADATA)
+ BTRFS_BLOCK_GROUP_METADATA | \
+ BTRFS_BLOCK_GROUP_REMAP)
#define BTRFS_BLOCK_GROUP_PROFILE_MASK (BTRFS_BLOCK_GROUP_RAID0 | \
BTRFS_BLOCK_GROUP_RAID1 | \
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 03/16] btrfs: allow remapped chunks to have zero stripes
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 01/16] btrfs: add definitions and constants for remap-tree Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 02/16] btrfs: add REMAP chunk type Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-16 0:03 ` Boris Burkov
2025-08-19 1:05 ` kernel test robot
2025-08-13 14:34 ` [PATCH v2 04/16] btrfs: remove remapped block groups from the free-space tree Mark Harmstone
` (12 subsequent siblings)
15 siblings, 2 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
When a chunk has been fully remapped, we are going to set its
num_stripes to 0, as it will no longer represent a physical location on
disk.
Change tree-checker to allow for this, and fix a couple of
divide-by-zeroes seen elsewhere.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/tree-checker.c | 63 ++++++++++++++++++++++++++++-------------
fs/btrfs/volumes.c | 8 +++++-
2 files changed, 50 insertions(+), 21 deletions(-)
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index ca898b1f12f1..20bfe333ffdd 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -815,6 +815,39 @@ static void chunk_err(const struct btrfs_fs_info *fs_info,
va_end(args);
}
+static bool valid_stripe_count(u64 profile, u16 num_stripes,
+ u16 sub_stripes)
+{
+ switch (profile) {
+ case BTRFS_BLOCK_GROUP_RAID10:
+ return sub_stripes ==
+ btrfs_raid_array[BTRFS_RAID_RAID10].sub_stripes;
+ case BTRFS_BLOCK_GROUP_RAID1:
+ return num_stripes ==
+ btrfs_raid_array[BTRFS_RAID_RAID1].devs_min;
+ case BTRFS_BLOCK_GROUP_RAID1C3:
+ return num_stripes ==
+ btrfs_raid_array[BTRFS_RAID_RAID1C3].devs_min;
+ case BTRFS_BLOCK_GROUP_RAID1C4:
+ return num_stripes ==
+ btrfs_raid_array[BTRFS_RAID_RAID1C4].devs_min;
+ case BTRFS_BLOCK_GROUP_RAID5:
+ return num_stripes >=
+ btrfs_raid_array[BTRFS_RAID_RAID5].devs_min;
+ case BTRFS_BLOCK_GROUP_RAID6:
+ return num_stripes >=
+ btrfs_raid_array[BTRFS_RAID_RAID6].devs_min;
+ case BTRFS_BLOCK_GROUP_DUP:
+ return num_stripes ==
+ btrfs_raid_array[BTRFS_RAID_DUP].dev_stripes;
+ case 0: /* SINGLE */
+ return num_stripes ==
+ btrfs_raid_array[BTRFS_RAID_SINGLE].dev_stripes;
+ default:
+ BUG();
+ }
+}
+
/*
* The common chunk check which could also work on super block sys chunk array.
*
@@ -838,6 +871,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
u64 features;
u32 chunk_sector_size;
bool mixed = false;
+ bool remapped;
int raid_index;
int nparity;
int ncopies;
@@ -861,12 +895,14 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
ncopies = btrfs_raid_array[raid_index].ncopies;
nparity = btrfs_raid_array[raid_index].nparity;
- if (unlikely(!num_stripes)) {
+ remapped = type & BTRFS_BLOCK_GROUP_REMAPPED;
+
+ if (unlikely(!remapped && !num_stripes)) {
chunk_err(fs_info, leaf, chunk, logical,
"invalid chunk num_stripes, have %u", num_stripes);
return -EUCLEAN;
}
- if (unlikely(num_stripes < ncopies)) {
+ if (unlikely(num_stripes != 0 && num_stripes < ncopies)) {
chunk_err(fs_info, leaf, chunk, logical,
"invalid chunk num_stripes < ncopies, have %u < %d",
num_stripes, ncopies);
@@ -964,22 +1000,9 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
}
}
- if (unlikely((type & BTRFS_BLOCK_GROUP_RAID10 &&
- sub_stripes != btrfs_raid_array[BTRFS_RAID_RAID10].sub_stripes) ||
- (type & BTRFS_BLOCK_GROUP_RAID1 &&
- num_stripes != btrfs_raid_array[BTRFS_RAID_RAID1].devs_min) ||
- (type & BTRFS_BLOCK_GROUP_RAID1C3 &&
- num_stripes != btrfs_raid_array[BTRFS_RAID_RAID1C3].devs_min) ||
- (type & BTRFS_BLOCK_GROUP_RAID1C4 &&
- num_stripes != btrfs_raid_array[BTRFS_RAID_RAID1C4].devs_min) ||
- (type & BTRFS_BLOCK_GROUP_RAID5 &&
- num_stripes < btrfs_raid_array[BTRFS_RAID_RAID5].devs_min) ||
- (type & BTRFS_BLOCK_GROUP_RAID6 &&
- num_stripes < btrfs_raid_array[BTRFS_RAID_RAID6].devs_min) ||
- (type & BTRFS_BLOCK_GROUP_DUP &&
- num_stripes != btrfs_raid_array[BTRFS_RAID_DUP].dev_stripes) ||
- ((type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 &&
- num_stripes != btrfs_raid_array[BTRFS_RAID_SINGLE].dev_stripes))) {
+ if (!remapped &&
+ !valid_stripe_count(type & BTRFS_BLOCK_GROUP_PROFILE_MASK,
+ num_stripes, sub_stripes)) {
chunk_err(fs_info, leaf, chunk, logical,
"invalid num_stripes:sub_stripes %u:%u for profile %llu",
num_stripes, sub_stripes,
@@ -1003,11 +1026,11 @@ static int check_leaf_chunk_item(struct extent_buffer *leaf,
struct btrfs_fs_info *fs_info = leaf->fs_info;
int num_stripes;
- if (unlikely(btrfs_item_size(leaf, slot) < sizeof(struct btrfs_chunk))) {
+ if (unlikely(btrfs_item_size(leaf, slot) < offsetof(struct btrfs_chunk, stripe))) {
chunk_err(fs_info, leaf, chunk, key->offset,
"invalid chunk item size: have %u expect [%zu, %u)",
btrfs_item_size(leaf, slot),
- sizeof(struct btrfs_chunk),
+ offsetof(struct btrfs_chunk, stripe),
BTRFS_LEAF_DATA_SIZE(fs_info));
return -EUCLEAN;
}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f4d1527f265e..c95f83305c82 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6145,6 +6145,12 @@ struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info,
goto out_free_map;
}
+ /* avoid divide by zero on fully-remapped chunks */
+ if (map->num_stripes == 0) {
+ ret = -EOPNOTSUPP;
+ goto out_free_map;
+ }
+
offset = logical - map->start;
length = min_t(u64, map->start + map->chunk_len - logical, length);
*length_ret = length;
@@ -6965,7 +6971,7 @@ u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map)
{
const int data_stripes = calc_data_stripes(map->type, map->num_stripes);
- return div_u64(map->chunk_len, data_stripes);
+ return data_stripes ? div_u64(map->chunk_len, data_stripes) : 0;
}
#if BITS_PER_LONG == 32
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 04/16] btrfs: remove remapped block groups from the free-space tree
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
` (2 preceding siblings ...)
2025-08-13 14:34 ` [PATCH v2 03/16] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 05/16] btrfs: don't add metadata items for the remap tree to the extent tree Mark Harmstone
` (11 subsequent siblings)
15 siblings, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
No new allocations can be done from block groups that have the REMAPPED flag
set, so there's no value in their having entries in the free-space tree.
Prevent a search through the free-space tree being scheduled for such a
block group, and prevent discard being run for a fully-remapped block
group.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/block-group.c | 21 ++++++++++++++++-----
fs/btrfs/discard.c | 9 +++++++++
2 files changed, 25 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 9bf282d2453c..4d76d457da9b 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -933,6 +933,13 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait)
if (btrfs_is_zoned(fs_info))
return 0;
+ /*
+ * No allocations can be done from remapped block groups, so they have
+ * no entries in the free-space tree.
+ */
+ if (cache->flags & BTRFS_BLOCK_GROUP_REMAPPED)
+ return 0;
+
caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
if (!caching_ctl)
return -ENOMEM;
@@ -1248,9 +1255,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
* another task to attempt to create another block group with the same
* item key (and failing with -EEXIST and a transaction abort).
*/
- ret = btrfs_remove_block_group_free_space(trans, block_group);
- if (ret)
- goto out;
+ if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
+ ret = btrfs_remove_block_group_free_space(trans, block_group);
+ if (ret)
+ goto out;
+ }
ret = remove_block_group_item(trans, path, block_group);
if (ret < 0)
@@ -2465,10 +2474,12 @@ static int read_one_block_group(struct btrfs_fs_info *info,
if (btrfs_chunk_writeable(info, cache->start)) {
if (cache->used == 0) {
ASSERT(list_empty(&cache->bg_list));
- if (btrfs_test_opt(info, DISCARD_ASYNC))
+ if (btrfs_test_opt(info, DISCARD_ASYNC) &&
+ !(cache->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
btrfs_discard_queue_work(&info->discard_ctl, cache);
- else
+ } else {
btrfs_mark_bg_unused(cache);
+ }
}
} else {
inc_block_group_ro(cache, 1);
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index 89fe85778115..1015a4d37fb2 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -698,6 +698,15 @@ void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info)
/* We enabled async discard, so punt all to the queue */
list_for_each_entry_safe(block_group, next, &fs_info->unused_bgs,
bg_list) {
+ /* Fully remapped BGs have nothing to discard */
+ spin_lock(&block_group->lock);
+ if (block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED &&
+ !btrfs_is_block_group_used(block_group)) {
+ spin_unlock(&block_group->lock);
+ continue;
+ }
+ spin_unlock(&block_group->lock);
+
list_del_init(&block_group->bg_list);
btrfs_discard_queue_work(&fs_info->discard_ctl, block_group);
/*
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 05/16] btrfs: don't add metadata items for the remap tree to the extent tree
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
` (3 preceding siblings ...)
2025-08-13 14:34 ` [PATCH v2 04/16] btrfs: remove remapped block groups from the free-space tree Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-16 0:06 ` Boris Burkov
2025-08-13 14:34 ` [PATCH v2 06/16] btrfs: add extended version of struct block_group_item Mark Harmstone
` (10 subsequent siblings)
15 siblings, 1 reply; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
There is the following potential problem with the remap tree and delayed refs:
* Remapped extent freed in a delayed ref, which removes an entry from the
remap tree
* Remap tree now small enough to fit in a single leaf
* Corruption as we now have a level-0 block with a level-1 metadata item
in the extent tree
One solution to this would be to rework the remap tree code so that it operates
via delayed refs. But as we're hoping to remove cow-only metadata items in the
future anyway, change things so that the remap tree doesn't have any entries in
the extent tree. This also has the benefit of reducing write amplification.
We also make it so that the clear_cache mount option is a no-op, as with the
extent tree v2, as the free-space tree can no longer be recreated from the
extent tree.
Finally disable relocating the remap tree itself, which is added back in
a later patch. As it is we would get corruption as the traditional
relocation method walks the extent tree, and we're removing its metadata
items.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/disk-io.c | 3 +++
fs/btrfs/extent-tree.c | 31 ++++++++++++++++++++++++++++++-
fs/btrfs/volumes.c | 3 +++
3 files changed, 36 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7e60097b2a96..8e9520119d4f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3049,6 +3049,9 @@ int btrfs_start_pre_rw_mount(struct btrfs_fs_info *fs_info)
if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2))
btrfs_warn(fs_info,
"'clear_cache' option is ignored with extent tree v2");
+ else if (btrfs_fs_incompat(fs_info, REMAP_TREE))
+ btrfs_warn(fs_info,
+ "'clear_cache' option is ignored with remap tree");
else
rebuild_free_space_tree = true;
} else if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE) &&
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 682d21a73a67..5e038ae1a93f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1552,6 +1552,28 @@ static void free_head_ref_squota_rsv(struct btrfs_fs_info *fs_info,
BTRFS_QGROUP_RSV_DATA);
}
+static int drop_remap_tree_ref(struct btrfs_trans_handle *trans,
+ const struct btrfs_delayed_ref_node *node)
+{
+ u64 bytenr = node->bytenr;
+ u64 num_bytes = node->num_bytes;
+ int ret;
+
+ ret = btrfs_add_to_free_space_tree(trans, bytenr, num_bytes);
+ if (ret) {
+ btrfs_abort_transaction(trans, ret);
+ return ret;
+ }
+
+ ret = btrfs_update_block_group(trans, bytenr, num_bytes, false);
+ if (ret) {
+ btrfs_abort_transaction(trans, ret);
+ return ret;
+ }
+
+ return 0;
+}
+
static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
struct btrfs_delayed_ref_head *href,
const struct btrfs_delayed_ref_node *node,
@@ -1746,7 +1768,10 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
} else if (node->action == BTRFS_ADD_DELAYED_REF) {
ret = __btrfs_inc_extent_ref(trans, node, extent_op);
} else if (node->action == BTRFS_DROP_DELAYED_REF) {
- ret = __btrfs_free_extent(trans, href, node, extent_op);
+ if (node->ref_root == BTRFS_REMAP_TREE_OBJECTID)
+ ret = drop_remap_tree_ref(trans, node);
+ else
+ ret = __btrfs_free_extent(trans, href, node, extent_op);
} else {
BUG();
}
@@ -4896,6 +4921,9 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
int level = btrfs_delayed_ref_owner(node);
bool skinny_metadata = btrfs_fs_incompat(fs_info, SKINNY_METADATA);
+ if (unlikely(node->ref_root == BTRFS_REMAP_TREE_OBJECTID))
+ goto skip;
+
extent_key.objectid = node->bytenr;
if (skinny_metadata) {
/* The owner of a tree block is the level. */
@@ -4948,6 +4976,7 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
btrfs_free_path(path);
+skip:
return alloc_reserved_extent(trans, node->bytenr, fs_info->nodesize);
}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c95f83305c82..678e5d4cd780 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3993,6 +3993,9 @@ static bool should_balance_chunk(struct extent_buffer *leaf, struct btrfs_chunk
struct btrfs_balance_args *bargs = NULL;
u64 chunk_type = btrfs_chunk_type(leaf, chunk);
+ if (chunk_type & BTRFS_BLOCK_GROUP_REMAP)
+ return false;
+
/* type filter */
if (!((chunk_type & BTRFS_BLOCK_GROUP_TYPE_MASK) &
(bctl->flags & BTRFS_BALANCE_TYPE_MASK))) {
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 06/16] btrfs: add extended version of struct block_group_item
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
` (4 preceding siblings ...)
2025-08-13 14:34 ` [PATCH v2 05/16] btrfs: don't add metadata items for the remap tree to the extent tree Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-16 0:08 ` Boris Burkov
2025-08-13 14:34 ` [PATCH v2 07/16] btrfs: allow mounting filesystems with remap-tree incompat flag Mark Harmstone
` (9 subsequent siblings)
15 siblings, 1 reply; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
Add a struct btrfs_block_group_item_v2, which is used in the block group
tree if the remap-tree incompat flag is set.
This adds two new fields to the block group item: `remap_bytes` and
`identity_remap_count`.
`remap_bytes` records the amount of data that's physically within this
block group, but nominally in another, remapped block group. This is
necessary because this data will need to be moved first if this block
group is itself relocated. If `remap_bytes` > 0, this is an indicator to
the relocation thread that it will need to search the remap-tree for
backrefs. A block group must also have `remap_bytes` == 0 before it can
be dropped.
`identity_remap_count` records how many identity remap items are located
in the remap tree for this block group. When relocation is begun for
this block group, this is set to the number of holes in the free-space
tree for this range. As identity remaps are converted into actual remaps
by the relocation process, this number is decreased. Once it reaches 0,
either because of relocation or because extents have been deleted, the
block group has been fully remapped and its chunk's device extents are
removed.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/accessors.h | 20 +++++++
fs/btrfs/block-group.c | 101 ++++++++++++++++++++++++--------
fs/btrfs/block-group.h | 14 ++++-
fs/btrfs/discard.c | 2 +-
fs/btrfs/tree-checker.c | 10 +++-
include/uapi/linux/btrfs_tree.h | 8 +++
6 files changed, 127 insertions(+), 28 deletions(-)
diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index 95a1ca8c099b..0dd161ee6863 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -239,6 +239,26 @@ BTRFS_SETGET_FUNCS(block_group_flags, struct btrfs_block_group_item, flags, 64);
BTRFS_SETGET_STACK_FUNCS(stack_block_group_flags,
struct btrfs_block_group_item, flags, 64);
+/* struct btrfs_block_group_item_v2 */
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_used, struct btrfs_block_group_item_v2,
+ used, 64);
+BTRFS_SETGET_FUNCS(block_group_v2_used, struct btrfs_block_group_item_v2, used, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_chunk_objectid,
+ struct btrfs_block_group_item_v2, chunk_objectid, 64);
+BTRFS_SETGET_FUNCS(block_group_v2_chunk_objectid,
+ struct btrfs_block_group_item_v2, chunk_objectid, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_flags,
+ struct btrfs_block_group_item_v2, flags, 64);
+BTRFS_SETGET_FUNCS(block_group_v2_flags, struct btrfs_block_group_item_v2, flags, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_remap_bytes,
+ struct btrfs_block_group_item_v2, remap_bytes, 64);
+BTRFS_SETGET_FUNCS(block_group_v2_remap_bytes, struct btrfs_block_group_item_v2,
+ remap_bytes, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_identity_remap_count,
+ struct btrfs_block_group_item_v2, identity_remap_count, 32);
+BTRFS_SETGET_FUNCS(block_group_v2_identity_remap_count, struct btrfs_block_group_item_v2,
+ identity_remap_count, 32);
+
/* struct btrfs_free_space_info */
BTRFS_SETGET_FUNCS(free_space_extent_count, struct btrfs_free_space_info,
extent_count, 32);
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 4d76d457da9b..bed9c58b6cbc 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2368,7 +2368,7 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
}
static int read_one_block_group(struct btrfs_fs_info *info,
- struct btrfs_block_group_item *bgi,
+ struct btrfs_block_group_item_v2 *bgi,
const struct btrfs_key *key,
int need_clear)
{
@@ -2383,11 +2383,16 @@ static int read_one_block_group(struct btrfs_fs_info *info,
return -ENOMEM;
cache->length = key->offset;
- cache->used = btrfs_stack_block_group_used(bgi);
+ cache->used = btrfs_stack_block_group_v2_used(bgi);
cache->commit_used = cache->used;
- cache->flags = btrfs_stack_block_group_flags(bgi);
- cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);
+ cache->flags = btrfs_stack_block_group_v2_flags(bgi);
+ cache->global_root_id = btrfs_stack_block_group_v2_chunk_objectid(bgi);
cache->space_info = btrfs_find_space_info(info, cache->flags);
+ cache->remap_bytes = btrfs_stack_block_group_v2_remap_bytes(bgi);
+ cache->commit_remap_bytes = cache->remap_bytes;
+ cache->identity_remap_count =
+ btrfs_stack_block_group_v2_identity_remap_count(bgi);
+ cache->commit_identity_remap_count = cache->identity_remap_count;
btrfs_set_free_space_tree_thresholds(cache);
@@ -2452,7 +2457,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
} else if (cache->length == cache->used) {
cache->cached = BTRFS_CACHE_FINISHED;
btrfs_free_excluded_extents(cache);
- } else if (cache->used == 0) {
+ } else if (cache->used == 0 && cache->remap_bytes == 0) {
cache->cached = BTRFS_CACHE_FINISHED;
ret = btrfs_add_new_free_space(cache, cache->start,
cache->start + cache->length, NULL);
@@ -2472,7 +2477,8 @@ static int read_one_block_group(struct btrfs_fs_info *info,
set_avail_alloc_bits(info, cache->flags);
if (btrfs_chunk_writeable(info, cache->start)) {
- if (cache->used == 0) {
+ if (cache->used == 0 && cache->identity_remap_count == 0 &&
+ cache->remap_bytes == 0) {
ASSERT(list_empty(&cache->bg_list));
if (btrfs_test_opt(info, DISCARD_ASYNC) &&
!(cache->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
@@ -2578,9 +2584,10 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
need_clear = 1;
while (1) {
- struct btrfs_block_group_item bgi;
+ struct btrfs_block_group_item_v2 bgi;
struct extent_buffer *leaf;
int slot;
+ size_t size;
ret = find_first_block_group(info, path, &key);
if (ret > 0)
@@ -2591,8 +2598,16 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
leaf = path->nodes[0];
slot = path->slots[0];
+ if (btrfs_fs_incompat(info, REMAP_TREE)) {
+ size = sizeof(struct btrfs_block_group_item_v2);
+ } else {
+ size = sizeof(struct btrfs_block_group_item);
+ btrfs_set_stack_block_group_v2_remap_bytes(&bgi, 0);
+ btrfs_set_stack_block_group_v2_identity_remap_count(&bgi, 0);
+ }
+
read_extent_buffer(leaf, &bgi, btrfs_item_ptr_offset(leaf, slot),
- sizeof(bgi));
+ size);
btrfs_item_key_to_cpu(leaf, &key, slot);
btrfs_release_path(path);
@@ -2662,25 +2677,38 @@ static int insert_block_group_item(struct btrfs_trans_handle *trans,
struct btrfs_block_group *block_group)
{
struct btrfs_fs_info *fs_info = trans->fs_info;
- struct btrfs_block_group_item bgi;
+ struct btrfs_block_group_item_v2 bgi;
struct btrfs_root *root = btrfs_block_group_root(fs_info);
struct btrfs_key key;
u64 old_commit_used;
+ size_t size;
int ret;
spin_lock(&block_group->lock);
- btrfs_set_stack_block_group_used(&bgi, block_group->used);
- btrfs_set_stack_block_group_chunk_objectid(&bgi,
- block_group->global_root_id);
- btrfs_set_stack_block_group_flags(&bgi, block_group->flags);
+ btrfs_set_stack_block_group_v2_used(&bgi, block_group->used);
+ btrfs_set_stack_block_group_v2_chunk_objectid(&bgi,
+ block_group->global_root_id);
+ btrfs_set_stack_block_group_v2_flags(&bgi, block_group->flags);
+ btrfs_set_stack_block_group_v2_remap_bytes(&bgi,
+ block_group->remap_bytes);
+ btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
+ block_group->identity_remap_count);
old_commit_used = block_group->commit_used;
block_group->commit_used = block_group->used;
+ block_group->commit_remap_bytes = block_group->remap_bytes;
+ block_group->commit_identity_remap_count =
+ block_group->identity_remap_count;
key.objectid = block_group->start;
key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
key.offset = block_group->length;
spin_unlock(&block_group->lock);
- ret = btrfs_insert_item(trans, root, &key, &bgi, sizeof(bgi));
+ if (btrfs_fs_incompat(fs_info, REMAP_TREE))
+ size = sizeof(struct btrfs_block_group_item_v2);
+ else
+ size = sizeof(struct btrfs_block_group_item);
+
+ ret = btrfs_insert_item(trans, root, &key, &bgi, size);
if (ret < 0) {
spin_lock(&block_group->lock);
block_group->commit_used = old_commit_used;
@@ -3135,10 +3163,12 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
struct btrfs_root *root = btrfs_block_group_root(fs_info);
unsigned long bi;
struct extent_buffer *leaf;
- struct btrfs_block_group_item bgi;
+ struct btrfs_block_group_item_v2 bgi;
struct btrfs_key key;
- u64 old_commit_used;
- u64 used;
+ u64 old_commit_used, old_commit_remap_bytes;
+ u32 old_commit_identity_remap_count;
+ u64 used, remap_bytes;
+ u32 identity_remap_count;
/*
* Block group items update can be triggered out of commit transaction
@@ -3148,13 +3178,21 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
*/
spin_lock(&cache->lock);
old_commit_used = cache->commit_used;
+ old_commit_remap_bytes = cache->commit_remap_bytes;
+ old_commit_identity_remap_count = cache->commit_identity_remap_count;
used = cache->used;
- /* No change in used bytes, can safely skip it. */
- if (cache->commit_used == used) {
+ remap_bytes = cache->remap_bytes;
+ identity_remap_count = cache->identity_remap_count;
+ /* No change in values, can safely skip it. */
+ if (cache->commit_used == used &&
+ cache->commit_remap_bytes == remap_bytes &&
+ cache->commit_identity_remap_count == identity_remap_count) {
spin_unlock(&cache->lock);
return 0;
}
cache->commit_used = used;
+ cache->commit_remap_bytes = remap_bytes;
+ cache->commit_identity_remap_count = identity_remap_count;
spin_unlock(&cache->lock);
key.objectid = cache->start;
@@ -3170,11 +3208,23 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
leaf = path->nodes[0];
bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
- btrfs_set_stack_block_group_used(&bgi, used);
- btrfs_set_stack_block_group_chunk_objectid(&bgi,
- cache->global_root_id);
- btrfs_set_stack_block_group_flags(&bgi, cache->flags);
- write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
+ btrfs_set_stack_block_group_v2_used(&bgi, used);
+ btrfs_set_stack_block_group_v2_chunk_objectid(&bgi,
+ cache->global_root_id);
+ btrfs_set_stack_block_group_v2_flags(&bgi, cache->flags);
+
+ if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+ btrfs_set_stack_block_group_v2_remap_bytes(&bgi,
+ cache->remap_bytes);
+ btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
+ cache->identity_remap_count);
+ write_extent_buffer(leaf, &bgi, bi,
+ sizeof(struct btrfs_block_group_item_v2));
+ } else {
+ write_extent_buffer(leaf, &bgi, bi,
+ sizeof(struct btrfs_block_group_item));
+ }
+
fail:
btrfs_release_path(path);
/*
@@ -3189,6 +3239,9 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
if (ret < 0 && ret != -ENOENT) {
spin_lock(&cache->lock);
cache->commit_used = old_commit_used;
+ cache->commit_remap_bytes = old_commit_remap_bytes;
+ cache->commit_identity_remap_count =
+ old_commit_identity_remap_count;
spin_unlock(&cache->lock);
}
return ret;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index a8bb8429c966..ecc89701b2ea 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -129,6 +129,8 @@ struct btrfs_block_group {
u64 flags;
u64 cache_generation;
u64 global_root_id;
+ u64 remap_bytes;
+ u32 identity_remap_count;
/*
* The last committed used bytes of this block group, if the above @used
@@ -136,6 +138,15 @@ struct btrfs_block_group {
* group item of this block group.
*/
u64 commit_used;
+ /*
+ * The last committed remap_bytes value of this block group.
+ */
+ u64 commit_remap_bytes;
+ /*
+ * The last commited identity_remap_count value of this block group.
+ */
+ u32 commit_identity_remap_count;
+
/*
* If the free space extent count exceeds this number, convert the block
* group to bitmaps.
@@ -282,7 +293,8 @@ static inline bool btrfs_is_block_group_used(const struct btrfs_block_group *bg)
{
lockdep_assert_held(&bg->lock);
- return (bg->used > 0 || bg->reserved > 0 || bg->pinned > 0);
+ return (bg->used > 0 || bg->reserved > 0 || bg->pinned > 0 ||
+ bg->remap_bytes > 0);
}
static inline bool btrfs_is_block_group_data_only(const struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index 1015a4d37fb2..2b7b1e440bc8 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -373,7 +373,7 @@ void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
if (!block_group || !btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC))
return;
- if (block_group->used == 0)
+ if (block_group->used == 0 && block_group->remap_bytes == 0)
add_to_discard_unused_list(discard_ctl, block_group);
else
add_to_discard_list(discard_ctl, block_group);
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 20bfe333ffdd..922f7afa024d 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -687,6 +687,7 @@ static int check_block_group_item(struct extent_buffer *leaf,
u64 chunk_objectid;
u64 flags;
u64 type;
+ size_t exp_size;
/*
* Here we don't really care about alignment since extent allocator can
@@ -698,10 +699,15 @@ static int check_block_group_item(struct extent_buffer *leaf,
return -EUCLEAN;
}
- if (unlikely(item_size != sizeof(bgi))) {
+ if (btrfs_fs_incompat(fs_info, REMAP_TREE))
+ exp_size = sizeof(struct btrfs_block_group_item_v2);
+ else
+ exp_size = sizeof(struct btrfs_block_group_item);
+
+ if (unlikely(item_size != exp_size)) {
block_group_err(leaf, slot,
"invalid item size, have %u expect %zu",
- item_size, sizeof(bgi));
+ item_size, exp_size);
return -EUCLEAN;
}
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 9a36f0206d90..500e3a7df90b 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -1229,6 +1229,14 @@ struct btrfs_block_group_item {
__le64 flags;
} __attribute__ ((__packed__));
+struct btrfs_block_group_item_v2 {
+ __le64 used;
+ __le64 chunk_objectid;
+ __le64 flags;
+ __le64 remap_bytes;
+ __le32 identity_remap_count;
+} __attribute__ ((__packed__));
+
struct btrfs_free_space_info {
__le32 extent_count;
__le32 flags;
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 07/16] btrfs: allow mounting filesystems with remap-tree incompat flag
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
` (5 preceding siblings ...)
2025-08-13 14:34 ` [PATCH v2 06/16] btrfs: add extended version of struct block_group_item Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-22 19:14 ` Boris Burkov
2025-08-13 14:34 ` [PATCH v2 08/16] btrfs: redirect I/O for remapped block groups Mark Harmstone
` (8 subsequent siblings)
15 siblings, 1 reply; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
If we encounter a filesystem with the remap-tree incompat flag set,
valdiate its compatibility with the other flags, and load the remap tree
using the values that have been added to the superblock.
The remap-tree feature depends on the free space tere, but no-holes and
block-group-tree have been made dependencies to reduce the testing
matrix. Similarly I'm not aware of any reason why mixed-bg and zoned would be
incompatible with remap-tree, but this is blocked for the time being
until it can be fully tested.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/Kconfig | 2 +
fs/btrfs/accessors.h | 6 +++
fs/btrfs/disk-io.c | 86 ++++++++++++++++++++++++++++-----
fs/btrfs/extent-tree.c | 2 +
fs/btrfs/fs.h | 4 +-
fs/btrfs/transaction.c | 7 +++
include/uapi/linux/btrfs_tree.h | 5 +-
7 files changed, 97 insertions(+), 15 deletions(-)
diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index ea95c90c8474..598a4af4ce4b 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -116,6 +116,8 @@ config BTRFS_EXPERIMENTAL
- large folio support
+ - remap-tree - logical address remapping tree
+
If unsure, say N.
config BTRFS_FS_REF_VERIFY
diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index 0dd161ee6863..392eaad75e72 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -882,6 +882,12 @@ BTRFS_SETGET_STACK_FUNCS(super_uuid_tree_generation, struct btrfs_super_block,
uuid_tree_generation, 64);
BTRFS_SETGET_STACK_FUNCS(super_nr_global_roots, struct btrfs_super_block,
nr_global_roots, 64);
+BTRFS_SETGET_STACK_FUNCS(super_remap_root, struct btrfs_super_block,
+ remap_root, 64);
+BTRFS_SETGET_STACK_FUNCS(super_remap_root_generation, struct btrfs_super_block,
+ remap_root_generation, 64);
+BTRFS_SETGET_STACK_FUNCS(super_remap_root_level, struct btrfs_super_block,
+ remap_root_level, 8);
/* struct btrfs_file_extent_item */
BTRFS_SETGET_STACK_FUNCS(stack_file_extent_type, struct btrfs_file_extent_item,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8e9520119d4f..563aea5e3b1b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1181,6 +1181,8 @@ static struct btrfs_root *btrfs_get_global_root(struct btrfs_fs_info *fs_info,
return btrfs_grab_root(btrfs_global_root(fs_info, &key));
case BTRFS_RAID_STRIPE_TREE_OBJECTID:
return btrfs_grab_root(fs_info->stripe_root);
+ case BTRFS_REMAP_TREE_OBJECTID:
+ return btrfs_grab_root(fs_info->remap_root);
default:
return NULL;
}
@@ -1271,6 +1273,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
btrfs_put_root(fs_info->data_reloc_root);
btrfs_put_root(fs_info->block_group_root);
btrfs_put_root(fs_info->stripe_root);
+ btrfs_put_root(fs_info->remap_root);
btrfs_check_leaked_roots(fs_info);
btrfs_extent_buffer_leak_debug_check(fs_info);
kfree(fs_info->super_copy);
@@ -1825,6 +1828,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, bool free_chunk_root)
free_root_extent_buffers(info->data_reloc_root);
free_root_extent_buffers(info->block_group_root);
free_root_extent_buffers(info->stripe_root);
+ free_root_extent_buffers(info->remap_root);
if (free_chunk_root)
free_root_extent_buffers(info->chunk_root);
}
@@ -2256,20 +2260,31 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
if (ret)
goto out;
- /*
- * This tree can share blocks with some other fs tree during relocation
- * and we need a proper setup by btrfs_get_fs_root
- */
- root = btrfs_get_fs_root(tree_root->fs_info,
- BTRFS_DATA_RELOC_TREE_OBJECTID, true);
- if (IS_ERR(root)) {
- if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
- ret = PTR_ERR(root);
- goto out;
- }
- } else {
+ if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+ /* remap_root already loaded in load_important_roots() */
+ root = fs_info->remap_root;
+
set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
- fs_info->data_reloc_root = root;
+
+ root->root_key.objectid = BTRFS_REMAP_TREE_OBJECTID;
+ root->root_key.type = BTRFS_ROOT_ITEM_KEY;
+ root->root_key.offset = 0;
+ } else {
+ /*
+ * This tree can share blocks with some other fs tree during
+ * relocation and we need a proper setup by btrfs_get_fs_root
+ */
+ root = btrfs_get_fs_root(tree_root->fs_info,
+ BTRFS_DATA_RELOC_TREE_OBJECTID, true);
+ if (IS_ERR(root)) {
+ if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
+ ret = PTR_ERR(root);
+ goto out;
+ }
+ } else {
+ set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+ fs_info->data_reloc_root = root;
+ }
}
location.objectid = BTRFS_QUOTA_TREE_OBJECTID;
@@ -2509,6 +2524,28 @@ int btrfs_validate_super(const struct btrfs_fs_info *fs_info,
ret = -EINVAL;
}
+ /* Ditto for remap_tree */
+ if (btrfs_fs_incompat(fs_info, REMAP_TREE) &&
+ (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE_VALID) ||
+ !btrfs_fs_incompat(fs_info, NO_HOLES) ||
+ !btrfs_fs_compat_ro(fs_info, BLOCK_GROUP_TREE))) {
+ btrfs_err(fs_info,
+"remap-tree feature requires free-space-tree, no-holes, and block-group-tree");
+ ret = -EINVAL;
+ }
+
+ if (btrfs_fs_incompat(fs_info, REMAP_TREE) &&
+ btrfs_fs_incompat(fs_info, MIXED_GROUPS)) {
+ btrfs_err(fs_info, "remap-tree not supported with mixed-bg");
+ ret = -EINVAL;
+ }
+
+ if (btrfs_fs_incompat(fs_info, REMAP_TREE) &&
+ btrfs_fs_incompat(fs_info, ZONED)) {
+ btrfs_err(fs_info, "remap-tree not supported with zoned devices");
+ ret = -EINVAL;
+ }
+
/*
* Hint to catch really bogus numbers, bitflips or so, more exact checks are
* done later
@@ -2667,6 +2704,18 @@ static int load_important_roots(struct btrfs_fs_info *fs_info)
btrfs_warn(fs_info, "couldn't read tree root");
return ret;
}
+
+ if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+ bytenr = btrfs_super_remap_root(sb);
+ gen = btrfs_super_remap_root_generation(sb);
+ level = btrfs_super_remap_root_level(sb);
+ ret = load_super_root(fs_info->remap_root, bytenr, gen, level);
+ if (ret) {
+ btrfs_warn(fs_info, "couldn't read remap root");
+ return ret;
+ }
+ }
+
return 0;
}
@@ -3278,6 +3327,7 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
struct btrfs_fs_info *fs_info = btrfs_sb(sb);
struct btrfs_root *tree_root;
struct btrfs_root *chunk_root;
+ struct btrfs_root *remap_root;
int ret;
int level;
@@ -3312,6 +3362,16 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
goto fail_alloc;
}
+ if (btrfs_super_incompat_flags(disk_super) & BTRFS_FEATURE_INCOMPAT_REMAP_TREE) {
+ remap_root = btrfs_alloc_root(fs_info, BTRFS_REMAP_TREE_OBJECTID,
+ GFP_KERNEL);
+ fs_info->remap_root = remap_root;
+ if (!remap_root) {
+ ret = -ENOMEM;
+ goto fail_alloc;
+ }
+ }
+
btrfs_info(fs_info, "first mount of filesystem %pU", disk_super->fsid);
/*
* Verify the type first, if that or the checksum value are
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5e038ae1a93f..c1b96c728fe6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2589,6 +2589,8 @@ static u64 get_alloc_profile_by_root(struct btrfs_root *root, int data)
flags = BTRFS_BLOCK_GROUP_DATA;
else if (root == fs_info->chunk_root)
flags = BTRFS_BLOCK_GROUP_SYSTEM;
+ else if (root == fs_info->remap_root)
+ flags = BTRFS_BLOCK_GROUP_REMAP;
else
flags = BTRFS_BLOCK_GROUP_METADATA;
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 9ce75843b578..6ea96e76655e 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -288,7 +288,8 @@ enum {
#define BTRFS_FEATURE_INCOMPAT_SUPP \
(BTRFS_FEATURE_INCOMPAT_SUPP_STABLE | \
BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE | \
- BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2)
+ BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 | \
+ BTRFS_FEATURE_INCOMPAT_REMAP_TREE)
#else
@@ -438,6 +439,7 @@ struct btrfs_fs_info {
struct btrfs_root *data_reloc_root;
struct btrfs_root *block_group_root;
struct btrfs_root *stripe_root;
+ struct btrfs_root *remap_root;
/* The log root tree is a directory of all the other log roots */
struct btrfs_root *log_root_tree;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index c5c0d9cf1a80..64b9c427af6a 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1953,6 +1953,13 @@ static void update_super_roots(struct btrfs_fs_info *fs_info)
super->cache_generation = 0;
if (test_bit(BTRFS_FS_UPDATE_UUID_TREE_GEN, &fs_info->flags))
super->uuid_tree_generation = root_item->generation;
+
+ if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+ root_item = &fs_info->remap_root->root_item;
+ super->remap_root = root_item->bytenr;
+ super->remap_root_generation = root_item->generation;
+ super->remap_root_level = root_item->level;
+ }
}
int btrfs_transaction_blocked(struct btrfs_fs_info *info)
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 500e3a7df90b..89bcb80081a6 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -721,9 +721,12 @@ struct btrfs_super_block {
__u8 metadata_uuid[BTRFS_FSID_SIZE];
__u64 nr_global_roots;
+ __le64 remap_root;
+ __le64 remap_root_generation;
+ __u8 remap_root_level;
/* Future expansion */
- __le64 reserved[27];
+ __u8 reserved[199];
__u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 08/16] btrfs: redirect I/O for remapped block groups
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
` (6 preceding siblings ...)
2025-08-13 14:34 ` [PATCH v2 07/16] btrfs: allow mounting filesystems with remap-tree incompat flag Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-22 19:42 ` Boris Burkov
2025-08-13 14:34 ` [PATCH v2 09/16] btrfs: release BG lock before calling btrfs_link_bg_list() Mark Harmstone
` (7 subsequent siblings)
15 siblings, 1 reply; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
Change btrfs_map_block() so that if the block group has the REMAPPED
flag set, we call btrfs_translate_remap() to obtain a new address.
btrfs_translate_remap() searches the remap tree for a range
corresponding to the logical address passed to btrfs_map_block(). If it
is within an identity remap, this part of the block group hasn't yet
been relocated, and so we use the existing address.
If it is within an actual remap, we subtract the start of the remap
range and add the address of its destination, contained in the item's
payload.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/relocation.c | 59 +++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/relocation.h | 2 ++
fs/btrfs/volumes.c | 31 +++++++++++++++++++++++
3 files changed, 92 insertions(+)
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 7256f6748c8f..e1f1da9336e7 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3884,6 +3884,65 @@ static const char *stage_to_string(enum reloc_stage stage)
return "unknown";
}
+int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
+ u64 *length, bool nolock)
+{
+ int ret;
+ struct btrfs_key key, found_key;
+ struct extent_buffer *leaf;
+ struct btrfs_remap *remap;
+ BTRFS_PATH_AUTO_FREE(path);
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
+
+ if (nolock) {
+ path->search_commit_root = 1;
+ path->skip_locking = 1;
+ }
+
+ key.objectid = *logical;
+ key.type = (u8)-1;
+ key.offset = (u64)-1;
+
+ ret = btrfs_search_slot(NULL, fs_info->remap_root, &key, path,
+ 0, 0);
+ if (ret < 0)
+ return ret;
+
+ leaf = path->nodes[0];
+
+ if (path->slots[0] == 0)
+ return -ENOENT;
+
+ path->slots[0]--;
+
+ btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+ if (found_key.type != BTRFS_REMAP_KEY &&
+ found_key.type != BTRFS_IDENTITY_REMAP_KEY) {
+ return -ENOENT;
+ }
+
+ if (found_key.objectid > *logical ||
+ found_key.objectid + found_key.offset <= *logical) {
+ return -ENOENT;
+ }
+
+ if (*logical + *length > found_key.objectid + found_key.offset)
+ *length = found_key.objectid + found_key.offset - *logical;
+
+ if (found_key.type == BTRFS_IDENTITY_REMAP_KEY)
+ return 0;
+
+ remap = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_remap);
+
+ *logical = *logical - found_key.objectid + btrfs_remap_address(leaf, remap);
+
+ return 0;
+}
+
/*
* function to relocate all extents in a block group.
*/
diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
index 5c36b3f84b57..a653c42a25a3 100644
--- a/fs/btrfs/relocation.h
+++ b/fs/btrfs/relocation.h
@@ -31,5 +31,7 @@ int btrfs_should_cancel_balance(const struct btrfs_fs_info *fs_info);
struct btrfs_root *find_reloc_root(struct btrfs_fs_info *fs_info, u64 bytenr);
bool btrfs_should_ignore_reloc_root(const struct btrfs_root *root);
u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info);
+int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
+ u64 *length, bool nolock);
#endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 678e5d4cd780..a2c49cb8bfc6 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6635,6 +6635,37 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
if (IS_ERR(map))
return PTR_ERR(map);
+ if (map->type & BTRFS_BLOCK_GROUP_REMAPPED) {
+ u64 new_logical = logical;
+ bool nolock = !(map->type & BTRFS_BLOCK_GROUP_DATA);
+
+ /*
+ * We use search_commit_root in btrfs_translate_remap for
+ * metadata blocks, to avoid lockdep complaining about
+ * recursive locking.
+ * If we get -ENOENT this means this is a BG that has just had
+ * its REMAPPED flag set, and so nothing has yet been actually
+ * remapped.
+ */
+ ret = btrfs_translate_remap(fs_info, &new_logical, length,
+ nolock);
+ if (ret && (!nolock || ret != -ENOENT))
+ return ret;
+
+ if (ret != -ENOENT && new_logical != logical) {
+ btrfs_free_chunk_map(map);
+
+ map = btrfs_get_chunk_map(fs_info, new_logical,
+ *length);
+ if (IS_ERR(map))
+ return PTR_ERR(map);
+
+ logical = new_logical;
+ }
+
+ ret = 0;
+ }
+
num_copies = btrfs_chunk_map_num_copies(map);
if (io_geom.mirror_num > num_copies)
return -EINVAL;
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 09/16] btrfs: release BG lock before calling btrfs_link_bg_list()
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
` (7 preceding siblings ...)
2025-08-13 14:34 ` [PATCH v2 08/16] btrfs: redirect I/O for remapped block groups Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-16 0:32 ` Boris Burkov
2025-08-13 14:34 ` [PATCH v2 10/16] btrfs: handle deletions from remapped block group Mark Harmstone
` (6 subsequent siblings)
15 siblings, 1 reply; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
Release block_group->lock before calling btrfs_link_bg_list() in
btrfs_delete_unused_bgs(), as this was causing lockdep issues.
This lock isn't held in any other place that we call btrfs_link_bg_list(), as
the block group lists are manipulated while holding fs_info->unused_bgs_lock.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/block-group.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index bed9c58b6cbc..8c28f829547e 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1620,6 +1620,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
if ((space_info->total_bytes - block_group->length < used &&
block_group->zone_unusable < block_group->length) ||
has_unwritten_metadata(block_group)) {
+ spin_unlock(&block_group->lock);
+
/*
* Add a reference for the list, compensate for the ref
* drop under the "next" label for the
@@ -1628,7 +1630,6 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
btrfs_link_bg_list(block_group, &retry_list);
trace_btrfs_skip_unused_block_group(block_group);
- spin_unlock(&block_group->lock);
spin_unlock(&space_info->lock);
up_write(&space_info->groups_sem);
goto next;
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 10/16] btrfs: handle deletions from remapped block group
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
` (8 preceding siblings ...)
2025-08-13 14:34 ` [PATCH v2 09/16] btrfs: release BG lock before calling btrfs_link_bg_list() Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-16 0:28 ` Boris Burkov
2025-08-13 14:34 ` [PATCH v2 11/16] btrfs: handle setting up relocation of block group with remap-tree Mark Harmstone
` (5 subsequent siblings)
15 siblings, 1 reply; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
Handle the case where we free an extent from a block group that has the
REMAPPED flag set. Because the remap tree is orthogonal to the extent
tree, for data this may be within any number of identity remaps or
actual remaps. If we're freeing a metadata node, this will be wholly
inside one or the other.
btrfs_remove_extent_from_remap_tree() searches the remap tree for the
remaps that cover the range in question, then calls
remove_range_from_remap_tree() for each one, to punch a hole in the
remap and adjust the free-space tree.
For an identity remap, remove_range_from_remap_tree() will adjust the
block group's `identity_remap_count` if this changes. If it reaches
zero we call last_identity_remap_gone(), which removes the chunk's
stripes and device extents - it is now fully remapped.
The changes which involve the block group's ro flag are because the
REMAPPED flag itself prevents a block group from having any new
allocations within it, and so we don't need to account for this
separately.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/block-group.c | 82 ++++---
fs/btrfs/block-group.h | 1 +
fs/btrfs/disk-io.c | 1 +
fs/btrfs/extent-tree.c | 28 ++-
fs/btrfs/fs.h | 1 +
fs/btrfs/relocation.c | 510 +++++++++++++++++++++++++++++++++++++++++
fs/btrfs/relocation.h | 3 +
fs/btrfs/volumes.c | 56 +++--
fs/btrfs/volumes.h | 6 +
9 files changed, 630 insertions(+), 58 deletions(-)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 8c28f829547e..7a0524138235 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1068,6 +1068,32 @@ static int remove_block_group_item(struct btrfs_trans_handle *trans,
return ret;
}
+void btrfs_remove_bg_from_sinfo(struct btrfs_block_group *block_group)
+{
+ int factor = btrfs_bg_type_to_factor(block_group->flags);
+
+ spin_lock(&block_group->space_info->lock);
+
+ if (btrfs_test_opt(block_group->fs_info, ENOSPC_DEBUG)) {
+ WARN_ON(block_group->space_info->total_bytes
+ < block_group->length);
+ WARN_ON(block_group->space_info->bytes_readonly
+ < block_group->length - block_group->zone_unusable);
+ WARN_ON(block_group->space_info->bytes_zone_unusable
+ < block_group->zone_unusable);
+ WARN_ON(block_group->space_info->disk_total
+ < block_group->length * factor);
+ }
+ block_group->space_info->total_bytes -= block_group->length;
+ block_group->space_info->bytes_readonly -=
+ (block_group->length - block_group->zone_unusable);
+ btrfs_space_info_update_bytes_zone_unusable(block_group->space_info,
+ -block_group->zone_unusable);
+ block_group->space_info->disk_total -= block_group->length * factor;
+
+ spin_unlock(&block_group->space_info->lock);
+}
+
int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
struct btrfs_chunk_map *map)
{
@@ -1079,7 +1105,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
struct kobject *kobj = NULL;
int ret;
int index;
- int factor;
struct btrfs_caching_control *caching_ctl = NULL;
bool remove_map;
bool remove_rsv = false;
@@ -1088,7 +1113,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
if (!block_group)
return -ENOENT;
- BUG_ON(!block_group->ro);
+ BUG_ON(!block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED));
trace_btrfs_remove_block_group(block_group);
/*
@@ -1100,7 +1125,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
block_group->length);
index = btrfs_bg_flags_to_raid_index(block_group->flags);
- factor = btrfs_bg_type_to_factor(block_group->flags);
/* make sure this block group isn't part of an allocation cluster */
cluster = &fs_info->data_alloc_cluster;
@@ -1224,26 +1248,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
spin_lock(&block_group->space_info->lock);
list_del_init(&block_group->ro_list);
-
- if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
- WARN_ON(block_group->space_info->total_bytes
- < block_group->length);
- WARN_ON(block_group->space_info->bytes_readonly
- < block_group->length - block_group->zone_unusable);
- WARN_ON(block_group->space_info->bytes_zone_unusable
- < block_group->zone_unusable);
- WARN_ON(block_group->space_info->disk_total
- < block_group->length * factor);
- }
- block_group->space_info->total_bytes -= block_group->length;
- block_group->space_info->bytes_readonly -=
- (block_group->length - block_group->zone_unusable);
- btrfs_space_info_update_bytes_zone_unusable(block_group->space_info,
- -block_group->zone_unusable);
- block_group->space_info->disk_total -= block_group->length * factor;
-
spin_unlock(&block_group->space_info->lock);
+ if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED))
+ btrfs_remove_bg_from_sinfo(block_group);
+
/*
* Remove the free space for the block group from the free space tree
* and the block group's item from the extent tree before marking the
@@ -1539,6 +1548,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
while (!list_empty(&fs_info->unused_bgs)) {
u64 used;
int trimming;
+ bool made_ro = false;
block_group = list_first_entry(&fs_info->unused_bgs,
struct btrfs_block_group,
@@ -1575,7 +1585,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
spin_lock(&space_info->lock);
spin_lock(&block_group->lock);
- if (btrfs_is_block_group_used(block_group) || block_group->ro ||
+ if (btrfs_is_block_group_used(block_group) ||
+ (block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) ||
list_is_singular(&block_group->list)) {
/*
* We want to bail if we made new allocations or have
@@ -1617,9 +1628,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
* needing to allocate extents from the block group.
*/
used = btrfs_space_info_used(space_info, true);
- if ((space_info->total_bytes - block_group->length < used &&
+ if (((space_info->total_bytes - block_group->length < used &&
block_group->zone_unusable < block_group->length) ||
- has_unwritten_metadata(block_group)) {
+ has_unwritten_metadata(block_group)) &&
+ !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
spin_unlock(&block_group->lock);
/*
@@ -1638,8 +1650,14 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
spin_unlock(&block_group->lock);
spin_unlock(&space_info->lock);
- /* We don't want to force the issue, only flip if it's ok. */
- ret = inc_block_group_ro(block_group, 0);
+ if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
+ /* We don't want to force the issue, only flip if it's ok. */
+ ret = inc_block_group_ro(block_group, 0);
+ made_ro = true;
+ } else {
+ ret = 0;
+ }
+
up_write(&space_info->groups_sem);
if (ret < 0) {
ret = 0;
@@ -1648,7 +1666,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
ret = btrfs_zone_finish(block_group);
if (ret < 0) {
- btrfs_dec_block_group_ro(block_group);
+ if (made_ro)
+ btrfs_dec_block_group_ro(block_group);
if (ret == -EAGAIN) {
btrfs_link_bg_list(block_group, &retry_list);
ret = 0;
@@ -1663,7 +1682,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
trans = btrfs_start_trans_remove_block_group(fs_info,
block_group->start);
if (IS_ERR(trans)) {
- btrfs_dec_block_group_ro(block_group);
+ if (made_ro)
+ btrfs_dec_block_group_ro(block_group);
ret = PTR_ERR(trans);
goto next;
}
@@ -1673,7 +1693,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
* just delete them, we don't care about them anymore.
*/
if (!clean_pinned_extents(trans, block_group)) {
- btrfs_dec_block_group_ro(block_group);
+ if (made_ro)
+ btrfs_dec_block_group_ro(block_group);
goto end_trans;
}
@@ -1687,7 +1708,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
spin_lock(&fs_info->discard_ctl.lock);
if (!list_empty(&block_group->discard_list)) {
spin_unlock(&fs_info->discard_ctl.lock);
- btrfs_dec_block_group_ro(block_group);
+ if (made_ro)
+ btrfs_dec_block_group_ro(block_group);
btrfs_discard_queue_work(&fs_info->discard_ctl,
block_group);
goto end_trans;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index ecc89701b2ea..0433b0127ed8 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -336,6 +336,7 @@ int btrfs_add_new_free_space(struct btrfs_block_group *block_group,
struct btrfs_trans_handle *btrfs_start_trans_remove_block_group(
struct btrfs_fs_info *fs_info,
const u64 chunk_offset);
+void btrfs_remove_bg_from_sinfo(struct btrfs_block_group *block_group);
int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
struct btrfs_chunk_map *map);
void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 563aea5e3b1b..d92d08316322 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2907,6 +2907,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
mutex_init(&fs_info->chunk_mutex);
mutex_init(&fs_info->transaction_kthread_mutex);
mutex_init(&fs_info->cleaner_mutex);
+ mutex_init(&fs_info->remap_mutex);
mutex_init(&fs_info->ro_block_group_mutex);
init_rwsem(&fs_info->commit_root_sem);
init_rwsem(&fs_info->cleanup_work_sem);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c1b96c728fe6..ca3f6d6bb5ba 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -40,6 +40,7 @@
#include "orphan.h"
#include "tree-checker.h"
#include "raid-stripe-tree.h"
+#include "relocation.h"
#undef SCRAMBLE_DELAYED_REFS
@@ -2999,7 +3000,8 @@ u64 btrfs_get_extent_owner_root(struct btrfs_fs_info *fs_info,
}
static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
- u64 bytenr, struct btrfs_squota_delta *delta)
+ u64 bytenr, struct btrfs_squota_delta *delta,
+ bool remapped)
{
int ret;
u64 num_bytes = delta->num_bytes;
@@ -3027,10 +3029,16 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
return ret;
}
- ret = btrfs_add_to_free_space_tree(trans, bytenr, num_bytes);
- if (ret) {
- btrfs_abort_transaction(trans, ret);
- return ret;
+ /*
+ * If remapped, FST has already been taken care of in
+ * remove_range_from_remap_tree().
+ */
+ if (!remapped) {
+ ret = btrfs_add_to_free_space_tree(trans, bytenr, num_bytes);
+ if (ret) {
+ btrfs_abort_transaction(trans, ret);
+ return ret;
+ }
}
ret = btrfs_update_block_group(trans, bytenr, num_bytes, false);
@@ -3396,7 +3404,15 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
}
btrfs_release_path(path);
- ret = do_free_extent_accounting(trans, bytenr, &delta);
+ /* returns 1 on success and 0 on no-op */
+ ret = btrfs_remove_extent_from_remap_tree(trans, path, bytenr,
+ num_bytes);
+ if (ret < 0) {
+ btrfs_abort_transaction(trans, ret);
+ goto out;
+ }
+
+ ret = do_free_extent_accounting(trans, bytenr, &delta, ret);
}
btrfs_release_path(path);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 6ea96e76655e..dbb7de95241b 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -547,6 +547,7 @@ struct btrfs_fs_info {
struct mutex transaction_kthread_mutex;
struct mutex cleaner_mutex;
struct mutex chunk_mutex;
+ struct mutex remap_mutex;
/*
* This is taken to make sure we don't set block groups ro after the
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index e1f1da9336e7..03a1246af678 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -37,6 +37,7 @@
#include "super.h"
#include "tree-checker.h"
#include "raid-stripe-tree.h"
+#include "free-space-tree.h"
/*
* Relocation overview
@@ -3884,6 +3885,148 @@ static const char *stage_to_string(enum reloc_stage stage)
return "unknown";
}
+static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
+ struct btrfs_block_group *bg,
+ s64 diff)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ bool bg_already_dirty = true;
+
+ bg->remap_bytes += diff;
+
+ if (bg->used == 0 && bg->remap_bytes == 0)
+ btrfs_mark_bg_unused(bg);
+
+ spin_lock(&trans->transaction->dirty_bgs_lock);
+ if (list_empty(&bg->dirty_list)) {
+ list_add_tail(&bg->dirty_list, &trans->transaction->dirty_bgs);
+ bg_already_dirty = false;
+ btrfs_get_block_group(bg);
+ }
+ spin_unlock(&trans->transaction->dirty_bgs_lock);
+
+ /* Modified block groups are accounted for in the delayed_refs_rsv. */
+ if (!bg_already_dirty)
+ btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
+}
+
+static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
+ struct btrfs_chunk_map *chunk,
+ struct btrfs_path *path)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_key key;
+ struct extent_buffer *leaf;
+ struct btrfs_chunk *c;
+ int ret;
+
+ key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+ key.type = BTRFS_CHUNK_ITEM_KEY;
+ key.offset = chunk->start;
+
+ ret = btrfs_search_slot(trans, fs_info->chunk_root, &key, path,
+ 0, 1);
+ if (ret) {
+ if (ret == 1) {
+ btrfs_release_path(path);
+ ret = -ENOENT;
+ }
+ return ret;
+ }
+
+ leaf = path->nodes[0];
+
+ c = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_chunk);
+ btrfs_set_chunk_num_stripes(leaf, c, 0);
+
+ btrfs_truncate_item(trans, path, offsetof(struct btrfs_chunk, stripe),
+ 1);
+
+ btrfs_mark_buffer_dirty(trans, leaf);
+
+ btrfs_release_path(path);
+
+ return 0;
+}
+
+static int last_identity_remap_gone(struct btrfs_trans_handle *trans,
+ struct btrfs_chunk_map *chunk,
+ struct btrfs_block_group *bg,
+ struct btrfs_path *path)
+{
+ int ret;
+
+ ret = btrfs_remove_dev_extents(trans, chunk);
+ if (ret)
+ return ret;
+
+ mutex_lock(&trans->fs_info->chunk_mutex);
+
+ for (unsigned int i = 0; i < chunk->num_stripes; i++) {
+ ret = btrfs_update_device(trans, chunk->stripes[i].dev);
+ if (ret) {
+ mutex_unlock(&trans->fs_info->chunk_mutex);
+ return ret;
+ }
+ }
+
+ mutex_unlock(&trans->fs_info->chunk_mutex);
+
+ write_lock(&trans->fs_info->mapping_tree_lock);
+ btrfs_chunk_map_device_clear_bits(chunk, CHUNK_ALLOCATED);
+ write_unlock(&trans->fs_info->mapping_tree_lock);
+
+ btrfs_remove_bg_from_sinfo(bg);
+
+ ret = remove_chunk_stripes(trans, chunk, path);
+ if (ret)
+ return ret;
+
+ return 0;
+}
+
+static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
+ struct btrfs_path *path,
+ struct btrfs_block_group *bg, int delta)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_chunk_map *chunk;
+ bool bg_already_dirty = true;
+ int ret;
+
+ WARN_ON(delta < 0 && -delta > bg->identity_remap_count);
+
+ bg->identity_remap_count += delta;
+
+ spin_lock(&trans->transaction->dirty_bgs_lock);
+ if (list_empty(&bg->dirty_list)) {
+ list_add_tail(&bg->dirty_list, &trans->transaction->dirty_bgs);
+ bg_already_dirty = false;
+ btrfs_get_block_group(bg);
+ }
+ spin_unlock(&trans->transaction->dirty_bgs_lock);
+
+ /* Modified block groups are accounted for in the delayed_refs_rsv. */
+ if (!bg_already_dirty)
+ btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
+
+ if (bg->identity_remap_count != 0)
+ return 0;
+
+ chunk = btrfs_find_chunk_map(fs_info, bg->start, 1);
+ if (!chunk)
+ return -ENOENT;
+
+ ret = last_identity_remap_gone(trans, chunk, bg, path);
+ if (ret)
+ goto end;
+
+ ret = 0;
+end:
+ btrfs_free_chunk_map(chunk);
+ return ret;
+}
+
int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
u64 *length, bool nolock)
{
@@ -4504,3 +4647,370 @@ u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info)
logical = fs_info->reloc_ctl->block_group->start;
return logical;
}
+
+static int remove_range_from_remap_tree(struct btrfs_trans_handle *trans,
+ struct btrfs_path *path,
+ struct btrfs_block_group *bg,
+ u64 bytenr, u64 num_bytes)
+{
+ int ret;
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct extent_buffer *leaf = path->nodes[0];
+ struct btrfs_key key, new_key;
+ struct btrfs_remap *remap_ptr = NULL, remap;
+ struct btrfs_block_group *dest_bg = NULL;
+ u64 end, new_addr = 0, remap_start, remap_length, overlap_length;
+ bool is_identity_remap;
+
+ end = bytenr + num_bytes;
+
+ btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+ is_identity_remap = key.type == BTRFS_IDENTITY_REMAP_KEY;
+
+ remap_start = key.objectid;
+ remap_length = key.offset;
+
+ if (!is_identity_remap) {
+ remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
+ struct btrfs_remap);
+ new_addr = btrfs_remap_address(leaf, remap_ptr);
+
+ dest_bg = btrfs_lookup_block_group(fs_info, new_addr);
+ }
+
+ if (bytenr == remap_start && num_bytes >= remap_length) {
+ /* Remove entirely. */
+
+ ret = btrfs_del_item(trans, fs_info->remap_root, path);
+ if (ret)
+ goto end;
+
+ btrfs_release_path(path);
+
+ overlap_length = remap_length;
+
+ if (!is_identity_remap) {
+ /* Remove backref. */
+
+ key.objectid = new_addr;
+ key.type = BTRFS_REMAP_BACKREF_KEY;
+ key.offset = remap_length;
+
+ ret = btrfs_search_slot(trans, fs_info->remap_root,
+ &key, path, -1, 1);
+ if (ret) {
+ if (ret == 1) {
+ btrfs_release_path(path);
+ ret = -ENOENT;
+ }
+ goto end;
+ }
+
+ ret = btrfs_del_item(trans, fs_info->remap_root, path);
+
+ btrfs_release_path(path);
+
+ if (ret)
+ goto end;
+
+ adjust_block_group_remap_bytes(trans, dest_bg,
+ -remap_length);
+ } else {
+ ret = adjust_identity_remap_count(trans, path, bg, -1);
+ if (ret)
+ goto end;
+ }
+ } else if (bytenr == remap_start) {
+ /* Remove beginning. */
+
+ new_key.objectid = end;
+ new_key.type = key.type;
+ new_key.offset = remap_length + remap_start - end;
+
+ btrfs_set_item_key_safe(trans, path, &new_key);
+ btrfs_mark_buffer_dirty(trans, leaf);
+
+ overlap_length = num_bytes;
+
+ if (!is_identity_remap) {
+ btrfs_set_remap_address(leaf, remap_ptr,
+ new_addr + end - remap_start);
+ btrfs_release_path(path);
+
+ /* Adjust backref. */
+
+ key.objectid = new_addr;
+ key.type = BTRFS_REMAP_BACKREF_KEY;
+ key.offset = remap_length;
+
+ ret = btrfs_search_slot(trans, fs_info->remap_root,
+ &key, path, -1, 1);
+ if (ret) {
+ if (ret == 1) {
+ btrfs_release_path(path);
+ ret = -ENOENT;
+ }
+ goto end;
+ }
+
+ leaf = path->nodes[0];
+
+ new_key.objectid = new_addr + end - remap_start;
+ new_key.type = BTRFS_REMAP_BACKREF_KEY;
+ new_key.offset = remap_length + remap_start - end;
+
+ btrfs_set_item_key_safe(trans, path, &new_key);
+
+ remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
+ struct btrfs_remap);
+ btrfs_set_remap_address(leaf, remap_ptr, end);
+
+ btrfs_mark_buffer_dirty(trans, path->nodes[0]);
+
+ btrfs_release_path(path);
+
+ adjust_block_group_remap_bytes(trans, dest_bg,
+ -num_bytes);
+ }
+ } else if (bytenr + num_bytes < remap_start + remap_length) {
+ /* Remove middle. */
+
+ new_key.objectid = remap_start;
+ new_key.type = key.type;
+ new_key.offset = bytenr - remap_start;
+
+ btrfs_set_item_key_safe(trans, path, &new_key);
+ btrfs_mark_buffer_dirty(trans, leaf);
+
+ new_key.objectid = end;
+ new_key.offset = remap_start + remap_length - end;
+
+ btrfs_release_path(path);
+
+ overlap_length = num_bytes;
+
+ if (!is_identity_remap) {
+ /* Add second remap entry. */
+
+ ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+ path, &new_key,
+ sizeof(struct btrfs_remap));
+ if (ret)
+ goto end;
+
+ btrfs_set_stack_remap_address(&remap,
+ new_addr + end - remap_start);
+
+ write_extent_buffer(path->nodes[0], &remap,
+ btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
+ sizeof(struct btrfs_remap));
+
+ btrfs_release_path(path);
+
+ /* Shorten backref entry. */
+
+ key.objectid = new_addr;
+ key.type = BTRFS_REMAP_BACKREF_KEY;
+ key.offset = remap_length;
+
+ ret = btrfs_search_slot(trans, fs_info->remap_root,
+ &key, path, -1, 1);
+ if (ret) {
+ if (ret == 1) {
+ btrfs_release_path(path);
+ ret = -ENOENT;
+ }
+ goto end;
+ }
+
+ new_key.objectid = new_addr;
+ new_key.type = BTRFS_REMAP_BACKREF_KEY;
+ new_key.offset = bytenr - remap_start;
+
+ btrfs_set_item_key_safe(trans, path, &new_key);
+ btrfs_mark_buffer_dirty(trans, path->nodes[0]);
+
+ btrfs_release_path(path);
+
+ /* Add second backref entry. */
+
+ new_key.objectid = new_addr + end - remap_start;
+ new_key.type = BTRFS_REMAP_BACKREF_KEY;
+ new_key.offset = remap_start + remap_length - end;
+
+ ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+ path, &new_key,
+ sizeof(struct btrfs_remap));
+ if (ret)
+ goto end;
+
+ btrfs_set_stack_remap_address(&remap, end);
+
+ write_extent_buffer(path->nodes[0], &remap,
+ btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
+ sizeof(struct btrfs_remap));
+
+ btrfs_release_path(path);
+
+ adjust_block_group_remap_bytes(trans, dest_bg,
+ -num_bytes);
+ } else {
+ /* Add second identity remap entry. */
+
+ ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+ path, &new_key, 0);
+ if (ret)
+ goto end;
+
+ btrfs_release_path(path);
+
+ ret = adjust_identity_remap_count(trans, path, bg, 1);
+ if (ret)
+ goto end;
+ }
+ } else {
+ /* Remove end. */
+
+ new_key.objectid = remap_start;
+ new_key.type = key.type;
+ new_key.offset = bytenr - remap_start;
+
+ btrfs_set_item_key_safe(trans, path, &new_key);
+ btrfs_mark_buffer_dirty(trans, leaf);
+
+ btrfs_release_path(path);
+
+ overlap_length = remap_start + remap_length - bytenr;
+
+ if (!is_identity_remap) {
+ /* Shorten backref entry. */
+
+ key.objectid = new_addr;
+ key.type = BTRFS_REMAP_BACKREF_KEY;
+ key.offset = remap_length;
+
+ ret = btrfs_search_slot(trans, fs_info->remap_root,
+ &key, path, -1, 1);
+ if (ret) {
+ if (ret == 1) {
+ btrfs_release_path(path);
+ ret = -ENOENT;
+ }
+ goto end;
+ }
+
+ new_key.objectid = new_addr;
+ new_key.type = BTRFS_REMAP_BACKREF_KEY;
+ new_key.offset = bytenr - remap_start;
+
+ btrfs_set_item_key_safe(trans, path, &new_key);
+ btrfs_mark_buffer_dirty(trans, path->nodes[0]);
+
+ btrfs_release_path(path);
+
+ adjust_block_group_remap_bytes(trans, dest_bg,
+ bytenr - remap_start - remap_length);
+ }
+ }
+
+ if (!is_identity_remap) {
+ ret = btrfs_add_to_free_space_tree(trans,
+ bytenr - remap_start + new_addr,
+ overlap_length);
+ if (ret)
+ goto end;
+ }
+
+ ret = overlap_length;
+
+end:
+ if (dest_bg)
+ btrfs_put_block_group(dest_bg);
+
+ return ret;
+}
+
+/*
+ * Returns 1 if remove_range_from_remap_tree() has been called successfully,
+ * 0 if block group wasn't remapped, and a negative number on error.
+ */
+int btrfs_remove_extent_from_remap_tree(struct btrfs_trans_handle *trans,
+ struct btrfs_path *path,
+ u64 bytenr, u64 num_bytes)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_key key, found_key;
+ struct extent_buffer *leaf;
+ struct btrfs_block_group *bg;
+ int ret, length;
+
+ if (!(btrfs_super_incompat_flags(fs_info->super_copy) &
+ BTRFS_FEATURE_INCOMPAT_REMAP_TREE))
+ return 0;
+
+ bg = btrfs_lookup_block_group(fs_info, bytenr);
+ if (!bg)
+ return 0;
+
+ mutex_lock(&fs_info->remap_mutex);
+
+ if (!(bg->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
+ mutex_unlock(&fs_info->remap_mutex);
+ btrfs_put_block_group(bg);
+ return 0;
+ }
+
+ do {
+ key.objectid = bytenr;
+ key.type = (u8)-1;
+ key.offset = (u64)-1;
+
+ ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path,
+ -1, 1);
+ if (ret < 0)
+ goto end;
+
+ leaf = path->nodes[0];
+
+ if (path->slots[0] == 0) {
+ ret = -ENOENT;
+ goto end;
+ }
+
+ path->slots[0]--;
+
+ btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+ if (found_key.type != BTRFS_IDENTITY_REMAP_KEY &&
+ found_key.type != BTRFS_REMAP_KEY) {
+ ret = -ENOENT;
+ goto end;
+ }
+
+ if (bytenr < found_key.objectid ||
+ bytenr >= found_key.objectid + found_key.offset) {
+ ret = -ENOENT;
+ goto end;
+ }
+
+ length = remove_range_from_remap_tree(trans, path, bg, bytenr,
+ num_bytes);
+ if (length < 0) {
+ ret = length;
+ goto end;
+ }
+
+ bytenr += length;
+ num_bytes -= length;
+ } while (num_bytes > 0);
+
+ ret = 1;
+
+end:
+ mutex_unlock(&fs_info->remap_mutex);
+
+ btrfs_put_block_group(bg);
+ btrfs_release_path(path);
+ return ret;
+}
diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
index a653c42a25a3..4b0bb34b3fc1 100644
--- a/fs/btrfs/relocation.h
+++ b/fs/btrfs/relocation.h
@@ -33,5 +33,8 @@ bool btrfs_should_ignore_reloc_root(const struct btrfs_root *root);
u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info);
int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
u64 *length, bool nolock);
+int btrfs_remove_extent_from_remap_tree(struct btrfs_trans_handle *trans,
+ struct btrfs_path *path,
+ u64 bytenr, u64 num_bytes);
#endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a2c49cb8bfc6..fc2b3e7de32e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2941,8 +2941,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
return ret;
}
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
- struct btrfs_device *device)
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device)
{
int ret;
struct btrfs_path *path;
@@ -3246,25 +3246,13 @@ static int remove_chunk_item(struct btrfs_trans_handle *trans,
return btrfs_free_chunk(trans, chunk_offset);
}
-int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
+int btrfs_remove_dev_extents(struct btrfs_trans_handle *trans,
+ struct btrfs_chunk_map *map)
{
struct btrfs_fs_info *fs_info = trans->fs_info;
- struct btrfs_chunk_map *map;
+ struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
u64 dev_extent_len = 0;
int i, ret = 0;
- struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
-
- map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
- if (IS_ERR(map)) {
- /*
- * This is a logic error, but we don't want to just rely on the
- * user having built with ASSERT enabled, so if ASSERT doesn't
- * do anything we still error out.
- */
- DEBUG_WARN("errr %ld reading chunk map at offset %llu",
- PTR_ERR(map), chunk_offset);
- return PTR_ERR(map);
- }
/*
* First delete the device extent items from the devices btree.
@@ -3285,7 +3273,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
if (ret) {
mutex_unlock(&fs_devices->device_list_mutex);
btrfs_abort_transaction(trans, ret);
- goto out;
+ return ret;
}
if (device->bytes_used > 0) {
@@ -3305,6 +3293,30 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
}
mutex_unlock(&fs_devices->device_list_mutex);
+ return 0;
+}
+
+int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_chunk_map *map;
+ int ret;
+
+ map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
+ if (IS_ERR(map)) {
+ /*
+ * This is a logic error, but we don't want to just rely on the
+ * user having built with ASSERT enabled, so if ASSERT doesn't
+ * do anything we still error out.
+ */
+ ASSERT(0);
+ return PTR_ERR(map);
+ }
+
+ ret = btrfs_remove_dev_extents(trans, map);
+ if (ret)
+ goto out;
+
/*
* We acquire fs_info->chunk_mutex for 2 reasons:
*
@@ -5448,7 +5460,7 @@ static void chunk_map_device_set_bits(struct btrfs_chunk_map *map, unsigned int
}
}
-static void chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
+void btrfs_chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
{
for (int i = 0; i < map->num_stripes; i++) {
struct btrfs_io_stripe *stripe = &map->stripes[i];
@@ -5465,7 +5477,7 @@ void btrfs_remove_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_ma
write_lock(&fs_info->mapping_tree_lock);
rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
RB_CLEAR_NODE(&map->rb_node);
- chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
+ btrfs_chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
write_unlock(&fs_info->mapping_tree_lock);
/* Once for the tree reference. */
@@ -5501,7 +5513,7 @@ int btrfs_add_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_map *m
return -EEXIST;
}
chunk_map_device_set_bits(map, CHUNK_ALLOCATED);
- chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
+ btrfs_chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
write_unlock(&fs_info->mapping_tree_lock);
return 0;
@@ -5866,7 +5878,7 @@ void btrfs_mapping_tree_free(struct btrfs_fs_info *fs_info)
map = rb_entry(node, struct btrfs_chunk_map, rb_node);
rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
RB_CLEAR_NODE(&map->rb_node);
- chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
+ btrfs_chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
/* Once for the tree ref. */
btrfs_free_chunk_map(map);
cond_resched_rwlock_write(&fs_info->mapping_tree_lock);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 430be12fd5e7..64b34710b68b 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -789,6 +789,8 @@ u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map);
int btrfs_nr_parity_stripes(u64 type);
int btrfs_chunk_alloc_add_chunk_item(struct btrfs_trans_handle *trans,
struct btrfs_block_group *bg);
+int btrfs_remove_dev_extents(struct btrfs_trans_handle *trans,
+ struct btrfs_chunk_map *map);
int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
@@ -900,6 +902,10 @@ bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
const u8 *btrfs_sb_fsid_ptr(const struct btrfs_super_block *sb);
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device);
+void btrfs_chunk_map_device_clear_bits(struct btrfs_chunk_map *map,
+ unsigned int bits);
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 11/16] btrfs: handle setting up relocation of block group with remap-tree
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
` (9 preceding siblings ...)
2025-08-13 14:34 ` [PATCH v2 10/16] btrfs: handle deletions from remapped block group Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 12/16] btrfs: move existing remaps before relocating block group Mark Harmstone
` (4 subsequent siblings)
15 siblings, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
Handle the preliminary work for relocating a block group in a filesystem
with the remap-tree flag set.
If the block group is SYSTEM btrfs_relocate_block_group() proceeds as it
does already, as bootstrapping issues mean that these block groups have
to be processed the existing way. Similarly with REMAP blocks, which are
dealt with in a later patch.
Otherwise we walk the free-space tree for the block group in question,
recording any holes. These get converted into identity remaps and placed
in the remap tree, and the block group's REMAPPED flag is set. From now
on no new allocations are possible within this block group, and any I/O
to it will be funnelled through btrfs_translate_remap(). We store the
number of identity remaps in `identity_remap_count`, so that we know
when we've removed the last one and the block group is fully remapped.
The change in btrfs_read_roots() is because data relocations no longer
rely on the data reloc tree as a hidden subvolume in which to do
snapshots.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/free-space-tree.c | 4 +-
fs/btrfs/free-space-tree.h | 5 +-
fs/btrfs/relocation.c | 453 ++++++++++++++++++++++++++++++++++++-
fs/btrfs/relocation.h | 2 +-
fs/btrfs/space-info.c | 9 +-
fs/btrfs/volumes.c | 15 +-
6 files changed, 468 insertions(+), 20 deletions(-)
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index eba7f22ae49c..96613716742b 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -21,8 +21,7 @@ static int __add_block_group_free_space(struct btrfs_trans_handle *trans,
struct btrfs_block_group *block_group,
struct btrfs_path *path);
-static struct btrfs_root *btrfs_free_space_root(
- struct btrfs_block_group *block_group)
+struct btrfs_root *btrfs_free_space_root(struct btrfs_block_group *block_group)
{
struct btrfs_key key = {
.objectid = BTRFS_FREE_SPACE_TREE_OBJECTID,
@@ -93,7 +92,6 @@ static int add_new_free_space_info(struct btrfs_trans_handle *trans,
return 0;
}
-EXPORT_FOR_TESTS
struct btrfs_free_space_info *btrfs_search_free_space_info(
struct btrfs_trans_handle *trans,
struct btrfs_block_group *block_group,
diff --git a/fs/btrfs/free-space-tree.h b/fs/btrfs/free-space-tree.h
index 3d9a5d4477fc..89d2ff7e5c18 100644
--- a/fs/btrfs/free-space-tree.h
+++ b/fs/btrfs/free-space-tree.h
@@ -35,12 +35,13 @@ int btrfs_add_to_free_space_tree(struct btrfs_trans_handle *trans,
u64 start, u64 size);
int btrfs_remove_from_free_space_tree(struct btrfs_trans_handle *trans,
u64 start, u64 size);
-
-#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
struct btrfs_free_space_info *
btrfs_search_free_space_info(struct btrfs_trans_handle *trans,
struct btrfs_block_group *block_group,
struct btrfs_path *path, int cow);
+struct btrfs_root *btrfs_free_space_root(struct btrfs_block_group *block_group);
+
+#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
int __btrfs_add_to_free_space_tree(struct btrfs_trans_handle *trans,
struct btrfs_block_group *block_group,
struct btrfs_path *path, u64 start, u64 size);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 03a1246af678..73324bfcfd98 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3638,7 +3638,7 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
btrfs_btree_balance_dirty(fs_info);
}
- if (!err) {
+ if (!err && !btrfs_fs_incompat(fs_info, REMAP_TREE)) {
ret = relocate_file_extent_cluster(rc);
if (ret < 0)
err = ret;
@@ -3885,6 +3885,90 @@ static const char *stage_to_string(enum reloc_stage stage)
return "unknown";
}
+static int add_remap_tree_entries(struct btrfs_trans_handle *trans,
+ struct btrfs_path *path,
+ struct btrfs_key *entries,
+ unsigned int num_entries)
+{
+ int ret;
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_item_batch batch;
+ u32 *data_sizes;
+ u32 max_items;
+
+ max_items = BTRFS_LEAF_DATA_SIZE(trans->fs_info) / sizeof(struct btrfs_item);
+
+ data_sizes = kzalloc(sizeof(u32) * min_t(u32, num_entries, max_items),
+ GFP_NOFS);
+ if (!data_sizes)
+ return -ENOMEM;
+
+ while (true) {
+ batch.keys = entries;
+ batch.data_sizes = data_sizes;
+ batch.total_data_size = 0;
+ batch.nr = min_t(u32, num_entries, max_items);
+
+ ret = btrfs_insert_empty_items(trans, fs_info->remap_root, path,
+ &batch);
+ btrfs_release_path(path);
+
+ if (num_entries <= max_items)
+ break;
+
+ num_entries -= max_items;
+ entries += max_items;
+ }
+
+ kfree(data_sizes);
+
+ return ret;
+}
+
+struct space_run {
+ u64 start;
+ u64 end;
+};
+
+static void parse_bitmap(u64 block_size, const unsigned long *bitmap,
+ unsigned long size, u64 address,
+ struct space_run *space_runs,
+ unsigned int *num_space_runs)
+{
+ unsigned long pos, end;
+ u64 run_start, run_length;
+
+ pos = find_first_bit(bitmap, size);
+
+ if (pos == size)
+ return;
+
+ while (true) {
+ end = find_next_zero_bit(bitmap, size, pos);
+
+ run_start = address + (pos * block_size);
+ run_length = (end - pos) * block_size;
+
+ if (*num_space_runs != 0 &&
+ space_runs[*num_space_runs - 1].end == run_start) {
+ space_runs[*num_space_runs - 1].end += run_length;
+ } else {
+ space_runs[*num_space_runs].start = run_start;
+ space_runs[*num_space_runs].end = run_start + run_length;
+
+ (*num_space_runs)++;
+ }
+
+ if (end == size)
+ break;
+
+ pos = find_next_bit(bitmap, size, end + 1);
+
+ if (pos == size)
+ break;
+ }
+}
+
static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
struct btrfs_block_group *bg,
s64 diff)
@@ -3910,6 +3994,229 @@ static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
}
+static int create_remap_tree_entries(struct btrfs_trans_handle *trans,
+ struct btrfs_path *path,
+ struct btrfs_block_group *bg)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_free_space_info *fsi;
+ struct btrfs_key key, found_key;
+ struct extent_buffer *leaf;
+ struct btrfs_root *space_root;
+ u32 extent_count;
+ struct space_run *space_runs = NULL;
+ unsigned int num_space_runs = 0;
+ struct btrfs_key *entries = NULL;
+ unsigned int max_entries, num_entries;
+ int ret;
+
+ mutex_lock(&bg->free_space_lock);
+
+ if (test_bit(BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE, &bg->runtime_flags)) {
+ mutex_unlock(&bg->free_space_lock);
+
+ ret = btrfs_add_block_group_free_space(trans, bg);
+ if (ret)
+ return ret;
+
+ mutex_lock(&bg->free_space_lock);
+ }
+
+ fsi = btrfs_search_free_space_info(trans, bg, path, 0);
+ if (IS_ERR(fsi)) {
+ mutex_unlock(&bg->free_space_lock);
+ return PTR_ERR(fsi);
+ }
+
+ extent_count = btrfs_free_space_extent_count(path->nodes[0], fsi);
+
+ btrfs_release_path(path);
+
+ space_runs = kmalloc(sizeof(*space_runs) * extent_count, GFP_NOFS);
+ if (!space_runs) {
+ mutex_unlock(&bg->free_space_lock);
+ return -ENOMEM;
+ }
+
+ key.objectid = bg->start;
+ key.type = 0;
+ key.offset = 0;
+
+ space_root = btrfs_free_space_root(bg);
+
+ ret = btrfs_search_slot(trans, space_root, &key, path, 0, 0);
+ if (ret < 0) {
+ mutex_unlock(&bg->free_space_lock);
+ goto out;
+ }
+
+ ret = 0;
+
+ while (true) {
+ leaf = path->nodes[0];
+
+ btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+ if (found_key.objectid >= bg->start + bg->length)
+ break;
+
+ if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY) {
+ if (num_space_runs != 0 &&
+ space_runs[num_space_runs - 1].end == found_key.objectid) {
+ space_runs[num_space_runs - 1].end =
+ found_key.objectid + found_key.offset;
+ } else {
+ BUG_ON(num_space_runs >= extent_count);
+
+ space_runs[num_space_runs].start = found_key.objectid;
+ space_runs[num_space_runs].end =
+ found_key.objectid + found_key.offset;
+
+ num_space_runs++;
+ }
+ } else if (found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
+ void *bitmap;
+ unsigned long offset;
+ u32 data_size;
+
+ offset = btrfs_item_ptr_offset(leaf, path->slots[0]);
+ data_size = btrfs_item_size(leaf, path->slots[0]);
+
+ if (data_size != 0) {
+ bitmap = kmalloc(data_size, GFP_NOFS);
+ if (!bitmap) {
+ mutex_unlock(&bg->free_space_lock);
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ read_extent_buffer(leaf, bitmap, offset,
+ data_size);
+
+ parse_bitmap(fs_info->sectorsize, bitmap,
+ data_size * BITS_PER_BYTE,
+ found_key.objectid, space_runs,
+ &num_space_runs);
+
+ BUG_ON(num_space_runs > extent_count);
+
+ kfree(bitmap);
+ }
+ }
+
+ path->slots[0]++;
+
+ if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+ ret = btrfs_next_leaf(space_root, path);
+ if (ret != 0) {
+ if (ret == 1)
+ ret = 0;
+ break;
+ }
+ leaf = path->nodes[0];
+ }
+ }
+
+ btrfs_release_path(path);
+
+ mutex_unlock(&bg->free_space_lock);
+
+ max_entries = extent_count + 2;
+ entries = kmalloc(sizeof(*entries) * max_entries, GFP_NOFS);
+ if (!entries) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ num_entries = 0;
+
+ if (num_space_runs > 0 && space_runs[0].start > bg->start) {
+ entries[num_entries].objectid = bg->start;
+ entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
+ entries[num_entries].offset = space_runs[0].start - bg->start;
+ num_entries++;
+ }
+
+ for (unsigned int i = 1; i < num_space_runs; i++) {
+ entries[num_entries].objectid = space_runs[i - 1].end;
+ entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
+ entries[num_entries].offset =
+ space_runs[i].start - space_runs[i - 1].end;
+ num_entries++;
+ }
+
+ if (num_space_runs == 0) {
+ entries[num_entries].objectid = bg->start;
+ entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
+ entries[num_entries].offset = bg->length;
+ num_entries++;
+ } else if (space_runs[num_space_runs - 1].end < bg->start + bg->length) {
+ entries[num_entries].objectid = space_runs[num_space_runs - 1].end;
+ entries[num_entries].type = BTRFS_IDENTITY_REMAP_KEY;
+ entries[num_entries].offset =
+ bg->start + bg->length - space_runs[num_space_runs - 1].end;
+ num_entries++;
+ }
+
+ if (num_entries == 0)
+ goto out;
+
+ bg->identity_remap_count = num_entries;
+
+ ret = add_remap_tree_entries(trans, path, entries, num_entries);
+
+out:
+ kfree(entries);
+ kfree(space_runs);
+
+ return ret;
+}
+
+static int mark_bg_remapped(struct btrfs_trans_handle *trans,
+ struct btrfs_path *path,
+ struct btrfs_block_group *bg)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ unsigned long bi;
+ struct extent_buffer *leaf;
+ struct btrfs_block_group_item_v2 bgi;
+ struct btrfs_key key;
+ int ret;
+
+ ASSERT(btrfs_fs_incompat(fs_info, REMAP_TREE));
+
+ bg->flags |= BTRFS_BLOCK_GROUP_REMAPPED;
+
+ key.objectid = bg->start;
+ key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
+ key.offset = bg->length;
+
+ ret = btrfs_search_slot(trans, fs_info->block_group_root, &key,
+ path, 0, 1);
+ if (ret) {
+ if (ret > 0)
+ ret = -ENOENT;
+ goto out;
+ }
+
+ leaf = path->nodes[0];
+ bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
+ read_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
+ btrfs_set_stack_block_group_v2_flags(&bgi, bg->flags);
+ btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
+ bg->identity_remap_count);
+ write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
+
+ btrfs_mark_buffer_dirty(trans, leaf);
+
+ bg->commit_identity_remap_count = bg->identity_remap_count;
+
+ ret = 0;
+out:
+ btrfs_release_path(path);
+ return ret;
+}
+
static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
struct btrfs_chunk_map *chunk,
struct btrfs_path *path)
@@ -4027,6 +4334,55 @@ static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
return ret;
}
+static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
+ struct btrfs_path *path, uint64_t start)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_chunk_map *chunk;
+ struct btrfs_key key;
+ u64 type;
+ int ret;
+ struct extent_buffer *leaf;
+ struct btrfs_chunk *c;
+
+ read_lock(&fs_info->mapping_tree_lock);
+
+ chunk = btrfs_find_chunk_map_nolock(fs_info, start, 1);
+ if (!chunk) {
+ read_unlock(&fs_info->mapping_tree_lock);
+ return -ENOENT;
+ }
+
+ chunk->type |= BTRFS_BLOCK_GROUP_REMAPPED;
+ type = chunk->type;
+
+ read_unlock(&fs_info->mapping_tree_lock);
+
+ key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+ key.type = BTRFS_CHUNK_ITEM_KEY;
+ key.offset = start;
+
+ ret = btrfs_search_slot(trans, fs_info->chunk_root, &key, path,
+ 0, 1);
+ if (ret == 1) {
+ ret = -ENOENT;
+ goto end;
+ } else if (ret < 0)
+ goto end;
+
+ leaf = path->nodes[0];
+
+ c = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_chunk);
+ btrfs_set_chunk_type(leaf, c, type);
+ btrfs_mark_buffer_dirty(trans, leaf);
+
+ ret = 0;
+end:
+ btrfs_free_chunk_map(chunk);
+ btrfs_release_path(path);
+ return ret;
+}
+
int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
u64 *length, bool nolock)
{
@@ -4086,17 +4442,78 @@ int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
return 0;
}
+static int start_block_group_remapping(struct btrfs_fs_info *fs_info,
+ struct btrfs_path *path,
+ struct btrfs_block_group *bg)
+{
+ struct btrfs_trans_handle *trans;
+ int ret, ret2;
+
+ ret = btrfs_cache_block_group(bg, true);
+ if (ret)
+ return ret;
+
+ trans = btrfs_start_transaction(fs_info->remap_root, 0);
+ if (IS_ERR(trans))
+ return PTR_ERR(trans);
+
+ /* We need to run delayed refs, to make sure FST is up to date. */
+ ret = btrfs_run_delayed_refs(trans, U64_MAX);
+ if (ret) {
+ btrfs_end_transaction(trans);
+ return ret;
+ }
+
+ mutex_lock(&fs_info->remap_mutex);
+
+ if (bg->flags & BTRFS_BLOCK_GROUP_REMAPPED) {
+ ret = 0;
+ goto end;
+ }
+
+ ret = create_remap_tree_entries(trans, path, bg);
+ if (ret) {
+ btrfs_abort_transaction(trans, ret);
+ goto end;
+ }
+
+ ret = mark_bg_remapped(trans, path, bg);
+ if (ret) {
+ btrfs_abort_transaction(trans, ret);
+ goto end;
+ }
+
+ ret = mark_chunk_remapped(trans, path, bg->start);
+ if (ret) {
+ btrfs_abort_transaction(trans, ret);
+ goto end;
+ }
+
+ ret = btrfs_remove_block_group_free_space(trans, bg);
+ if (ret)
+ btrfs_abort_transaction(trans, ret);
+
+end:
+ mutex_unlock(&fs_info->remap_mutex);
+
+ ret2 = btrfs_end_transaction(trans);
+ if (!ret)
+ ret = ret2;
+
+ return ret;
+}
+
/*
* function to relocate all extents in a block group.
*/
int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
- bool verbose)
+ bool verbose, bool *using_remap_tree)
{
struct btrfs_block_group *bg;
struct btrfs_root *extent_root = btrfs_extent_root(fs_info, group_start);
struct reloc_control *rc;
struct inode *inode;
- struct btrfs_path *path;
+ struct btrfs_path *path = NULL;
int ret;
int rw = 0;
int err = 0;
@@ -4163,7 +4580,7 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
}
inode = lookup_free_space_inode(rc->block_group, path);
- btrfs_free_path(path);
+ btrfs_release_path(path);
if (!IS_ERR(inode))
ret = delete_block_group_cache(rc->block_group, inode, 0);
@@ -4175,11 +4592,17 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
goto out;
}
- rc->data_inode = create_reloc_inode(rc->block_group);
- if (IS_ERR(rc->data_inode)) {
- err = PTR_ERR(rc->data_inode);
- rc->data_inode = NULL;
- goto out;
+ *using_remap_tree = btrfs_fs_incompat(fs_info, REMAP_TREE) &&
+ !(bg->flags & BTRFS_BLOCK_GROUP_SYSTEM) &&
+ !(bg->flags & BTRFS_BLOCK_GROUP_REMAP);
+
+ if (!btrfs_fs_incompat(fs_info, REMAP_TREE)) {
+ rc->data_inode = create_reloc_inode(rc->block_group);
+ if (IS_ERR(rc->data_inode)) {
+ err = PTR_ERR(rc->data_inode);
+ rc->data_inode = NULL;
+ goto out;
+ }
}
if (verbose)
@@ -4192,6 +4615,12 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
ret = btrfs_zone_finish(rc->block_group);
WARN_ON(ret && ret != -EAGAIN);
+ if (*using_remap_tree) {
+ err = start_block_group_remapping(fs_info, path, bg);
+
+ goto out;
+ }
+
while (1) {
enum reloc_stage finishes_stage;
@@ -4239,7 +4668,9 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
out:
if (err && rw)
btrfs_dec_block_group_ro(rc->block_group);
- iput(rc->data_inode);
+ if (!btrfs_fs_incompat(fs_info, REMAP_TREE))
+ iput(rc->data_inode);
+ btrfs_free_path(path);
out_put_bg:
btrfs_put_block_group(bg);
reloc_chunk_end(fs_info);
@@ -4433,7 +4864,7 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
btrfs_free_path(path);
- if (ret == 0) {
+ if (ret == 0 && !btrfs_fs_incompat(fs_info, REMAP_TREE)) {
/* cleanup orphan inode in data relocation tree */
fs_root = btrfs_grab_root(fs_info->data_reloc_root);
ASSERT(fs_root);
diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
index 4b0bb34b3fc1..1bf2b1536aa9 100644
--- a/fs/btrfs/relocation.h
+++ b/fs/btrfs/relocation.h
@@ -13,7 +13,7 @@ struct btrfs_ordered_extent;
struct btrfs_pending_snapshot;
int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
- bool verbose);
+ bool verbose, bool *using_remap_tree);
int btrfs_init_reloc_root(struct btrfs_trans_handle *trans, struct btrfs_root *root);
int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 278034a22dbf..409e7233689f 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -375,8 +375,13 @@ void btrfs_add_bg_to_space_info(struct btrfs_fs_info *info,
factor = btrfs_bg_type_to_factor(block_group->flags);
spin_lock(&space_info->lock);
- space_info->total_bytes += block_group->length;
- space_info->disk_total += block_group->length * factor;
+
+ if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED) ||
+ block_group->identity_remap_count != 0) {
+ space_info->total_bytes += block_group->length;
+ space_info->disk_total += block_group->length * factor;
+ }
+
space_info->bytes_used += block_group->used;
space_info->disk_used += block_group->used * factor;
space_info->bytes_readonly += block_group->bytes_super;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fc2b3e7de32e..b4fe2e928992 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3436,6 +3436,7 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset,
struct btrfs_block_group *block_group;
u64 length;
int ret;
+ bool using_remap_tree;
if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2)) {
btrfs_err(fs_info,
@@ -3459,7 +3460,8 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset,
/* step one, relocate all the extents inside this chunk */
btrfs_scrub_pause(fs_info);
- ret = btrfs_relocate_block_group(fs_info, chunk_offset, true);
+ ret = btrfs_relocate_block_group(fs_info, chunk_offset, true,
+ &using_remap_tree);
btrfs_scrub_continue(fs_info);
if (ret) {
/*
@@ -3478,6 +3480,9 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset,
length = block_group->length;
btrfs_put_block_group(block_group);
+ if (using_remap_tree)
+ return 0;
+
/*
* On a zoned file system, discard the whole block group, this will
* trigger a REQ_OP_ZONE_RESET operation on the device zone. If
@@ -4177,6 +4182,14 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
chunk = btrfs_item_ptr(leaf, slot, struct btrfs_chunk);
chunk_type = btrfs_chunk_type(leaf, chunk);
+ /* Check if chunk has already been fully relocated. */
+ if (chunk_type & BTRFS_BLOCK_GROUP_REMAPPED &&
+ btrfs_chunk_num_stripes(leaf, chunk) == 0) {
+ btrfs_release_path(path);
+ mutex_unlock(&fs_info->reclaim_bgs_lock);
+ goto loop;
+ }
+
if (!counting) {
spin_lock(&fs_info->balance_lock);
bctl->stat.considered++;
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 12/16] btrfs: move existing remaps before relocating block group
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
` (10 preceding siblings ...)
2025-08-13 14:34 ` [PATCH v2 11/16] btrfs: handle setting up relocation of block group with remap-tree Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 13/16] btrfs: replace identity maps with actual remaps when doing relocations Mark Harmstone
` (3 subsequent siblings)
15 siblings, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
If when relocating a block group we find that `remap_bytes` > 0 in its
block group item, that means that it has been the destination block
group for another that has been remapped.
We need to seach the remap tree for any remap backrefs within this
range, and move the data to a third block group. This is because
otherwise btrfs_translate_remap() could end up following an unbounded
chain of remaps, which would only get worse over time.
We only relocate one block group at a time, so `remap_bytes` will only
ever go down while we are doing this. Once we're finished we set the
REMAPPED flag on the block group, which will permanently prevent any
other data from being moved to within it.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/extent-tree.c | 6 +-
fs/btrfs/relocation.c | 483 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 487 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ca3f6d6bb5ba..1e5a68addf25 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4496,7 +4496,8 @@ static noinline int find_free_extent(struct btrfs_root *root,
block_group->cached != BTRFS_CACHE_NO) {
down_read(&space_info->groups_sem);
if (list_empty(&block_group->list) ||
- block_group->ro) {
+ block_group->ro ||
+ block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED) {
/*
* someone is removing this block group,
* we can't jump into the have_block_group
@@ -4530,7 +4531,8 @@ static noinline int find_free_extent(struct btrfs_root *root,
ffe_ctl->hinted = false;
/* If the block group is read-only, we can skip it entirely. */
- if (unlikely(block_group->ro)) {
+ if (unlikely(block_group->ro) ||
+ block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED) {
if (ffe_ctl->for_treelog)
btrfs_clear_treelog_bg(block_group);
if (ffe_ctl->for_data_reloc)
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 73324bfcfd98..c4d758d016e2 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3994,6 +3994,481 @@ static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
}
+struct reloc_io_private {
+ struct completion done;
+ refcount_t pending_refs;
+ blk_status_t status;
+};
+
+static void reloc_endio(struct btrfs_bio *bbio)
+{
+ struct reloc_io_private *priv = bbio->private;
+
+ if (bbio->bio.bi_status)
+ WRITE_ONCE(priv->status, bbio->bio.bi_status);
+
+ if (refcount_dec_and_test(&priv->pending_refs))
+ complete(&priv->done);
+
+ bio_put(&bbio->bio);
+}
+
+static int copy_remapped_data_io(struct btrfs_fs_info *fs_info,
+ struct reloc_io_private *priv,
+ struct page **pages, u64 addr, u64 length,
+ bool do_write)
+{
+ struct btrfs_bio *bbio;
+ unsigned long i = 0;
+ blk_opf_t op = do_write ? REQ_OP_WRITE : REQ_OP_READ;
+
+ init_completion(&priv->done);
+ refcount_set(&priv->pending_refs, 1);
+ priv->status = 0;
+
+ bbio = btrfs_bio_alloc(BIO_MAX_VECS, op, fs_info, reloc_endio,
+ priv);
+ bbio->bio.bi_iter.bi_sector = addr >> SECTOR_SHIFT;
+
+ do {
+ size_t bytes = min_t(u64, length, PAGE_SIZE);
+
+ if (bio_add_page(&bbio->bio, pages[i], bytes, 0) < bytes) {
+ refcount_inc(&priv->pending_refs);
+ btrfs_submit_bbio(bbio, 0);
+
+ bbio = btrfs_bio_alloc(BIO_MAX_VECS, op, fs_info,
+ reloc_endio, priv);
+ bbio->bio.bi_iter.bi_sector = addr >> SECTOR_SHIFT;
+ continue;
+ }
+
+ i++;
+ addr += bytes;
+ length -= bytes;
+ } while (length);
+
+ refcount_inc(&priv->pending_refs);
+ btrfs_submit_bbio(bbio, 0);
+
+ if (!refcount_dec_and_test(&priv->pending_refs))
+ wait_for_completion_io(&priv->done);
+
+ return blk_status_to_errno(READ_ONCE(priv->status));
+}
+
+static int copy_remapped_data(struct btrfs_fs_info *fs_info, u64 old_addr,
+ u64 new_addr, u64 length)
+{
+ int ret;
+ struct page **pages;
+ unsigned int nr_pages;
+ struct reloc_io_private priv;
+
+ nr_pages = DIV_ROUND_UP(length, PAGE_SIZE);
+ pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
+ if (!pages)
+ return -ENOMEM;
+ ret = btrfs_alloc_page_array(nr_pages, pages, 0);
+ if (ret) {
+ ret = -ENOMEM;
+ goto end;
+ }
+
+ ret = copy_remapped_data_io(fs_info, &priv, pages, old_addr, length,
+ false);
+ if (ret)
+ goto end;
+
+ ret = copy_remapped_data_io(fs_info, &priv, pages, new_addr, length,
+ true);
+
+end:
+ for (unsigned int i = 0; i < nr_pages; i++) {
+ if (pages[i])
+ __free_page(pages[i]);
+ }
+ kfree(pages);
+
+ return ret;
+}
+
+static int do_copy(struct btrfs_fs_info *fs_info, u64 old_addr, u64 new_addr,
+ u64 length)
+{
+ int ret;
+
+ /* Copy 1MB at a time, to avoid using too much memory. */
+
+ do {
+ u64 to_copy = min_t(u64, length, SZ_1M);
+
+ ret = copy_remapped_data(fs_info, old_addr, new_addr,
+ to_copy);
+ if (ret)
+ return ret;
+
+ if (to_copy == length)
+ break;
+
+ old_addr += to_copy;
+ new_addr += to_copy;
+ length -= to_copy;
+ } while (true);
+
+ return 0;
+}
+
+static int add_remap_item(struct btrfs_trans_handle *trans,
+ struct btrfs_path *path, u64 new_addr, u64 length,
+ u64 old_addr)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_remap remap;
+ struct btrfs_key key;
+ struct extent_buffer *leaf;
+ int ret;
+
+ key.objectid = old_addr;
+ key.type = BTRFS_REMAP_KEY;
+ key.offset = length;
+
+ ret = btrfs_insert_empty_item(trans, fs_info->remap_root, path,
+ &key, sizeof(struct btrfs_remap));
+ if (ret)
+ return ret;
+
+ leaf = path->nodes[0];
+
+ btrfs_set_stack_remap_address(&remap, new_addr);
+
+ write_extent_buffer(leaf, &remap,
+ btrfs_item_ptr_offset(leaf, path->slots[0]),
+ sizeof(struct btrfs_remap));
+
+ btrfs_release_path(path);
+
+ return 0;
+}
+
+static int add_remap_backref_item(struct btrfs_trans_handle *trans,
+ struct btrfs_path *path, u64 new_addr,
+ u64 length, u64 old_addr)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_remap remap;
+ struct btrfs_key key;
+ struct extent_buffer *leaf;
+ int ret;
+
+ key.objectid = new_addr;
+ key.type = BTRFS_REMAP_BACKREF_KEY;
+ key.offset = length;
+
+ ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+ path, &key, sizeof(struct btrfs_remap));
+ if (ret)
+ return ret;
+
+ leaf = path->nodes[0];
+
+ btrfs_set_stack_remap_address(&remap, old_addr);
+
+ write_extent_buffer(leaf, &remap,
+ btrfs_item_ptr_offset(leaf, path->slots[0]),
+ sizeof(struct btrfs_remap));
+
+ btrfs_release_path(path);
+
+ return 0;
+}
+
+static int move_existing_remap(struct btrfs_fs_info *fs_info,
+ struct btrfs_path *path,
+ struct btrfs_block_group *bg, u64 new_addr,
+ u64 length, u64 old_addr)
+{
+ struct btrfs_trans_handle *trans;
+ struct extent_buffer *leaf;
+ struct btrfs_remap *remap_ptr, remap;
+ struct btrfs_key key, ins;
+ u64 dest_addr, dest_length, min_size;
+ struct btrfs_block_group *dest_bg;
+ int ret;
+ bool is_data = bg->flags & BTRFS_BLOCK_GROUP_DATA;
+ struct btrfs_space_info *sinfo = bg->space_info;
+ bool mutex_taken = false, bg_needs_free_space;
+
+ spin_lock(&sinfo->lock);
+ btrfs_space_info_update_bytes_may_use(sinfo, length);
+ spin_unlock(&sinfo->lock);
+
+ if (is_data)
+ min_size = fs_info->sectorsize;
+ else
+ min_size = fs_info->nodesize;
+
+ ret = btrfs_reserve_extent(fs_info->fs_root, length, length, min_size,
+ 0, 0, &ins, is_data, false);
+ if (ret) {
+ spin_lock(&sinfo->lock);
+ btrfs_space_info_update_bytes_may_use(sinfo, -length);
+ spin_unlock(&sinfo->lock);
+ return ret;
+ }
+
+ dest_addr = ins.objectid;
+ dest_length = ins.offset;
+
+ if (!is_data && !IS_ALIGNED(dest_length, fs_info->nodesize)) {
+ u64 new_length = ALIGN_DOWN(dest_length, fs_info->nodesize);
+
+ btrfs_free_reserved_extent(fs_info, dest_addr + new_length,
+ dest_length - new_length, 0);
+
+ dest_length = new_length;
+ }
+
+ trans = btrfs_join_transaction(fs_info->remap_root);
+ if (IS_ERR(trans)) {
+ ret = PTR_ERR(trans);
+ trans = NULL;
+ goto end;
+ }
+
+ mutex_lock(&fs_info->remap_mutex);
+ mutex_taken = true;
+
+ /* Find old remap entry. */
+
+ key.objectid = old_addr;
+ key.type = BTRFS_REMAP_KEY;
+ key.offset = length;
+
+ ret = btrfs_search_slot(trans, fs_info->remap_root, &key,
+ path, 0, 1);
+ if (ret == 1) {
+ /*
+ * Not a problem if the remap entry wasn't found: that means
+ * that another transaction has deallocated the data.
+ * move_existing_remaps() loops until the BG contains no
+ * remaps, so we can just return 0 in this case.
+ */
+ btrfs_release_path(path);
+ ret = 0;
+ goto end;
+ } else if (ret) {
+ goto end;
+ }
+
+ ret = do_copy(fs_info, new_addr, dest_addr, dest_length);
+ if (ret)
+ goto end;
+
+ /* Change data of old remap entry. */
+
+ leaf = path->nodes[0];
+
+ remap_ptr = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_remap);
+ btrfs_set_remap_address(leaf, remap_ptr, dest_addr);
+
+ btrfs_mark_buffer_dirty(trans, leaf);
+
+ if (dest_length != length) {
+ key.offset = dest_length;
+ btrfs_set_item_key_safe(trans, path, &key);
+ }
+
+ btrfs_release_path(path);
+
+ if (dest_length != length) {
+ /* Add remap item for remainder. */
+
+ ret = add_remap_item(trans, path, new_addr + dest_length,
+ length - dest_length,
+ old_addr + dest_length);
+ if (ret)
+ goto end;
+ }
+
+ /* Change or remove old backref. */
+
+ key.objectid = new_addr;
+ key.type = BTRFS_REMAP_BACKREF_KEY;
+ key.offset = length;
+
+ ret = btrfs_search_slot(trans, fs_info->remap_root, &key,
+ path, -1, 1);
+ if (ret) {
+ if (ret == 1) {
+ btrfs_release_path(path);
+ ret = -ENOENT;
+ }
+ goto end;
+ }
+
+ leaf = path->nodes[0];
+
+ if (dest_length == length) {
+ ret = btrfs_del_item(trans, fs_info->remap_root, path);
+ if (ret) {
+ btrfs_release_path(path);
+ goto end;
+ }
+ } else {
+ key.objectid += dest_length;
+ key.offset -= dest_length;
+ btrfs_set_item_key_safe(trans, path, &key);
+
+ btrfs_set_stack_remap_address(&remap, old_addr + dest_length);
+
+ write_extent_buffer(leaf, &remap,
+ btrfs_item_ptr_offset(leaf, path->slots[0]),
+ sizeof(struct btrfs_remap));
+ }
+
+ btrfs_release_path(path);
+
+ /* Add new backref. */
+
+ ret = add_remap_backref_item(trans, path, dest_addr, dest_length,
+ old_addr);
+ if (ret)
+ goto end;
+
+ adjust_block_group_remap_bytes(trans, bg, -dest_length);
+
+ ret = btrfs_add_to_free_space_tree(trans, new_addr, dest_length);
+ if (ret)
+ goto end;
+
+ dest_bg = btrfs_lookup_block_group(fs_info, dest_addr);
+
+ adjust_block_group_remap_bytes(trans, dest_bg, dest_length);
+
+ mutex_lock(&dest_bg->free_space_lock);
+ bg_needs_free_space = test_bit(BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE,
+ &dest_bg->runtime_flags);
+ mutex_unlock(&dest_bg->free_space_lock);
+ btrfs_put_block_group(dest_bg);
+
+ if (bg_needs_free_space) {
+ ret = btrfs_add_block_group_free_space(trans, dest_bg);
+ if (ret)
+ goto end;
+ }
+
+ ret = btrfs_remove_from_free_space_tree(trans, dest_addr, dest_length);
+ if (ret) {
+ btrfs_remove_from_free_space_tree(trans, new_addr,
+ dest_length);
+ goto end;
+ }
+
+ ret = 0;
+
+end:
+ if (mutex_taken)
+ mutex_unlock(&fs_info->remap_mutex);
+
+ btrfs_dec_block_group_reservations(fs_info, dest_addr);
+
+ if (ret) {
+ btrfs_free_reserved_extent(fs_info, dest_addr, dest_length, 0);
+
+ if (trans) {
+ btrfs_abort_transaction(trans, ret);
+ btrfs_end_transaction(trans);
+ }
+ } else {
+ dest_bg = btrfs_lookup_block_group(fs_info, dest_addr);
+ btrfs_free_reserved_bytes(dest_bg, dest_length, 0);
+ btrfs_put_block_group(dest_bg);
+
+ ret = btrfs_commit_transaction(trans);
+ }
+
+ return ret;
+}
+
+static int move_existing_remaps(struct btrfs_fs_info *fs_info,
+ struct btrfs_block_group *bg,
+ struct btrfs_path *path)
+{
+ int ret;
+ struct btrfs_key key;
+ struct extent_buffer *leaf;
+ struct btrfs_remap *remap;
+ u64 old_addr;
+
+ /* Look for backrefs in remap tree. */
+
+ while (bg->remap_bytes > 0) {
+ key.objectid = bg->start;
+ key.type = BTRFS_REMAP_BACKREF_KEY;
+ key.offset = 0;
+
+ ret = btrfs_search_slot(NULL, fs_info->remap_root, &key, path,
+ 0, 0);
+ if (ret < 0)
+ return ret;
+
+ leaf = path->nodes[0];
+
+ if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+ ret = btrfs_next_leaf(fs_info->remap_root, path);
+ if (ret < 0) {
+ btrfs_release_path(path);
+ return ret;
+ }
+
+ if (ret) {
+ btrfs_release_path(path);
+ break;
+ }
+
+ leaf = path->nodes[0];
+ }
+
+ btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+ if (key.type != BTRFS_REMAP_BACKREF_KEY) {
+ path->slots[0]++;
+
+ if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+ ret = btrfs_next_leaf(fs_info->remap_root, path);
+ if (ret < 0) {
+ btrfs_release_path(path);
+ return ret;
+ }
+
+ if (ret) {
+ btrfs_release_path(path);
+ break;
+ }
+
+ leaf = path->nodes[0];
+ }
+ }
+
+ remap = btrfs_item_ptr(leaf, path->slots[0],
+ struct btrfs_remap);
+
+ old_addr = btrfs_remap_address(leaf, remap);
+
+ btrfs_release_path(path);
+
+ ret = move_existing_remap(fs_info, path, bg, key.objectid,
+ key.offset, old_addr);
+ if (ret)
+ return ret;
+ }
+
+ BUG_ON(bg->remap_bytes > 0);
+
+ return 0;
+}
+
static int create_remap_tree_entries(struct btrfs_trans_handle *trans,
struct btrfs_path *path,
struct btrfs_block_group *bg)
@@ -4616,6 +5091,14 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
WARN_ON(ret && ret != -EAGAIN);
if (*using_remap_tree) {
+ if (bg->remap_bytes != 0) {
+ ret = move_existing_remaps(fs_info, bg, path);
+ if (ret) {
+ err = ret;
+ goto out;
+ }
+ }
+
err = start_block_group_remapping(fs_info, path, bg);
goto out;
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 13/16] btrfs: replace identity maps with actual remaps when doing relocations
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
` (11 preceding siblings ...)
2025-08-13 14:34 ` [PATCH v2 12/16] btrfs: move existing remaps before relocating block group Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 14/16] btrfs: add do_remap param to btrfs_discard_extent() Mark Harmstone
` (2 subsequent siblings)
15 siblings, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
Add a function do_remap_tree_reloc(), which does the actual work of
doing a relocation using the remap tree.
In a loop we call do_remap_tree_reloc_trans(), which searches for the
first identity remap for the block group. We call btrfs_reserve_extent()
to find space elsewhere for it, and read the data into memory and write
it to the new location. We then carve out the identity remap and replace
it with an actual remap, which points to the new location in which to
look.
Once the last identity remap has been removed we call
last_identity_remap_gone(), which, as with deletions, removes the
chunk's stripes and device extents.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/relocation.c | 335 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 335 insertions(+)
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index c4d758d016e2..84ff59866e96 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4692,6 +4692,61 @@ static int mark_bg_remapped(struct btrfs_trans_handle *trans,
return ret;
}
+static int find_next_identity_remap(struct btrfs_trans_handle *trans,
+ struct btrfs_path *path, u64 bg_end,
+ u64 last_start, u64 *start,
+ u64 *length)
+{
+ int ret;
+ struct btrfs_key key, found_key;
+ struct btrfs_root *remap_root = trans->fs_info->remap_root;
+ struct extent_buffer *leaf;
+
+ key.objectid = last_start;
+ key.type = BTRFS_IDENTITY_REMAP_KEY;
+ key.offset = 0;
+
+ ret = btrfs_search_slot(trans, remap_root, &key, path, 0, 0);
+ if (ret < 0)
+ goto out;
+
+ leaf = path->nodes[0];
+ while (true) {
+ if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+ ret = btrfs_next_leaf(remap_root, path);
+
+ if (ret != 0) {
+ if (ret == 1)
+ ret = -ENOENT;
+ goto out;
+ }
+
+ leaf = path->nodes[0];
+ }
+
+ btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+ if (found_key.objectid >= bg_end) {
+ ret = -ENOENT;
+ goto out;
+ }
+
+ if (found_key.type == BTRFS_IDENTITY_REMAP_KEY) {
+ *start = found_key.objectid;
+ *length = found_key.offset;
+ ret = 0;
+ goto out;
+ }
+
+ path->slots[0]++;
+ }
+
+out:
+ btrfs_release_path(path);
+
+ return ret;
+}
+
static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
struct btrfs_chunk_map *chunk,
struct btrfs_path *path)
@@ -4809,6 +4864,98 @@ static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
return ret;
}
+static int add_remap_entry(struct btrfs_trans_handle *trans,
+ struct btrfs_path *path,
+ struct btrfs_block_group *src_bg, u64 old_addr,
+ u64 new_addr, u64 length)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_key key, new_key;
+ int ret;
+ int identity_count_delta = 0;
+
+ key.objectid = old_addr;
+ key.type = (u8)-1;
+ key.offset = (u64)-1;
+
+ ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, -1, 1);
+ if (ret < 0)
+ goto end;
+
+ if (path->slots[0] == 0) {
+ ret = -ENOENT;
+ goto end;
+ }
+
+ path->slots[0]--;
+
+ btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+ if (key.type != BTRFS_IDENTITY_REMAP_KEY ||
+ key.objectid > old_addr ||
+ key.objectid + key.offset <= old_addr) {
+ ret = -ENOENT;
+ goto end;
+ }
+
+ /* Shorten or delete identity mapping entry. */
+
+ if (key.objectid == old_addr) {
+ ret = btrfs_del_item(trans, fs_info->remap_root, path);
+ if (ret)
+ goto end;
+
+ identity_count_delta--;
+ } else {
+ new_key.objectid = key.objectid;
+ new_key.type = BTRFS_IDENTITY_REMAP_KEY;
+ new_key.offset = old_addr - key.objectid;
+
+ btrfs_set_item_key_safe(trans, path, &new_key);
+ }
+
+ btrfs_release_path(path);
+
+ /* Create new remap entry. */
+
+ ret = add_remap_item(trans, path, new_addr, length, old_addr);
+ if (ret)
+ goto end;
+
+ /* Add entry for remainder of identity mapping, if necessary. */
+
+ if (key.objectid + key.offset != old_addr + length) {
+ new_key.objectid = old_addr + length;
+ new_key.type = BTRFS_IDENTITY_REMAP_KEY;
+ new_key.offset = key.objectid + key.offset - old_addr - length;
+
+ ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
+ path, &new_key, 0);
+ if (ret)
+ goto end;
+
+ btrfs_release_path(path);
+
+ identity_count_delta++;
+ }
+
+ /* Add backref. */
+
+ ret = add_remap_backref_item(trans, path, new_addr, length, old_addr);
+ if (ret)
+ goto end;
+
+ if (identity_count_delta != 0) {
+ ret = adjust_identity_remap_count(trans, path, src_bg,
+ identity_count_delta);
+ }
+
+end:
+ btrfs_release_path(path);
+
+ return ret;
+}
+
static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
struct btrfs_path *path, uint64_t start)
{
@@ -4858,6 +5005,186 @@ static int mark_chunk_remapped(struct btrfs_trans_handle *trans,
return ret;
}
+static int do_remap_tree_reloc_trans(struct btrfs_fs_info *fs_info,
+ struct btrfs_block_group *src_bg,
+ struct btrfs_path *path, u64 *last_start)
+{
+ struct btrfs_trans_handle *trans;
+ struct btrfs_root *extent_root;
+ struct btrfs_key ins;
+ struct btrfs_block_group *dest_bg = NULL;
+ struct btrfs_chunk_map *chunk;
+ u64 start, remap_length, length, new_addr, min_size;
+ int ret;
+ bool no_more = false;
+ bool is_data = src_bg->flags & BTRFS_BLOCK_GROUP_DATA;
+ bool made_reservation = false, bg_needs_free_space;
+ struct btrfs_space_info *sinfo = src_bg->space_info;
+
+ extent_root = btrfs_extent_root(fs_info, src_bg->start);
+
+ trans = btrfs_start_transaction(extent_root, 0);
+ if (IS_ERR(trans))
+ return PTR_ERR(trans);
+
+ mutex_lock(&fs_info->remap_mutex);
+
+ ret = find_next_identity_remap(trans, path, src_bg->start + src_bg->length,
+ *last_start, &start, &remap_length);
+ if (ret == -ENOENT) {
+ no_more = true;
+ goto next;
+ } else if (ret) {
+ mutex_unlock(&fs_info->remap_mutex);
+ btrfs_end_transaction(trans);
+ return ret;
+ }
+
+ /* Try to reserve enough space for block. */
+
+ spin_lock(&sinfo->lock);
+ btrfs_space_info_update_bytes_may_use(sinfo, remap_length);
+ spin_unlock(&sinfo->lock);
+
+ if (is_data)
+ min_size = fs_info->sectorsize;
+ else
+ min_size = fs_info->nodesize;
+
+ ret = btrfs_reserve_extent(fs_info->fs_root, remap_length,
+ remap_length, min_size,
+ 0, 0, &ins, is_data, false);
+ if (ret) {
+ spin_lock(&sinfo->lock);
+ btrfs_space_info_update_bytes_may_use(sinfo, -remap_length);
+ spin_unlock(&sinfo->lock);
+
+ mutex_unlock(&fs_info->remap_mutex);
+ btrfs_end_transaction(trans);
+ return ret;
+ }
+
+ made_reservation = true;
+
+ new_addr = ins.objectid;
+ length = ins.offset;
+
+ if (!is_data && !IS_ALIGNED(length, fs_info->nodesize)) {
+ u64 new_length = ALIGN_DOWN(length, fs_info->nodesize);
+
+ btrfs_free_reserved_extent(fs_info, new_addr + new_length,
+ length - new_length, 0);
+
+ length = new_length;
+ }
+
+ dest_bg = btrfs_lookup_block_group(fs_info, new_addr);
+
+ mutex_lock(&dest_bg->free_space_lock);
+ bg_needs_free_space = test_bit(BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE,
+ &dest_bg->runtime_flags);
+ mutex_unlock(&dest_bg->free_space_lock);
+
+ if (bg_needs_free_space) {
+ ret = btrfs_add_block_group_free_space(trans, dest_bg);
+ if (ret)
+ goto fail;
+ }
+
+ ret = do_copy(fs_info, start, new_addr, length);
+ if (ret)
+ goto fail;
+
+ ret = btrfs_remove_from_free_space_tree(trans, new_addr, length);
+ if (ret)
+ goto fail;
+
+ ret = add_remap_entry(trans, path, src_bg, start, new_addr, length);
+ if (ret) {
+ btrfs_add_to_free_space_tree(trans, new_addr, length);
+ goto fail;
+ }
+
+ adjust_block_group_remap_bytes(trans, dest_bg, length);
+ btrfs_free_reserved_bytes(dest_bg, length, 0);
+
+ spin_lock(&sinfo->lock);
+ sinfo->bytes_readonly += length;
+ spin_unlock(&sinfo->lock);
+
+next:
+ if (dest_bg)
+ btrfs_put_block_group(dest_bg);
+
+ if (made_reservation)
+ btrfs_dec_block_group_reservations(fs_info, new_addr);
+
+ if (src_bg->used == 0 && src_bg->remap_bytes == 0) {
+ chunk = btrfs_find_chunk_map(fs_info, src_bg->start, 1);
+ if (!chunk) {
+ mutex_unlock(&fs_info->remap_mutex);
+ btrfs_end_transaction(trans);
+ return -ENOENT;
+ }
+
+ ret = last_identity_remap_gone(trans, chunk, src_bg, path);
+ if (ret) {
+ btrfs_free_chunk_map(chunk);
+ mutex_unlock(&fs_info->remap_mutex);
+ btrfs_end_transaction(trans);
+ return ret;
+ }
+
+ btrfs_free_chunk_map(chunk);
+ }
+
+ mutex_unlock(&fs_info->remap_mutex);
+
+ ret = btrfs_end_transaction(trans);
+ if (ret)
+ return ret;
+
+ if (no_more)
+ return 1;
+
+ *last_start = start;
+
+ return 0;
+
+fail:
+ if (dest_bg)
+ btrfs_put_block_group(dest_bg);
+
+ btrfs_free_reserved_extent(fs_info, new_addr, length, 0);
+
+ mutex_unlock(&fs_info->remap_mutex);
+ btrfs_end_transaction(trans);
+
+ return ret;
+}
+
+static int do_remap_tree_reloc(struct btrfs_fs_info *fs_info,
+ struct btrfs_path *path,
+ struct btrfs_block_group *bg)
+{
+ u64 last_start;
+ int ret;
+
+ last_start = bg->start;
+
+ while (true) {
+ ret = do_remap_tree_reloc_trans(fs_info, bg, path,
+ &last_start);
+ if (ret) {
+ if (ret == 1)
+ ret = 0;
+ break;
+ }
+ }
+
+ return ret;
+}
+
int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
u64 *length, bool nolock)
{
@@ -5100,6 +5427,14 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start,
}
err = start_block_group_remapping(fs_info, path, bg);
+ if (err)
+ goto out;
+
+ err = do_remap_tree_reloc(fs_info, path, rc->block_group);
+ if (err)
+ goto out;
+
+ btrfs_delete_unused_bgs(fs_info);
goto out;
}
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 14/16] btrfs: add do_remap param to btrfs_discard_extent()
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
` (12 preceding siblings ...)
2025-08-13 14:34 ` [PATCH v2 13/16] btrfs: replace identity maps with actual remaps when doing relocations Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 15/16] btrfs: add fully_remapped_bgs list Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 16/16] btrfs: allow balancing remap tree Mark Harmstone
15 siblings, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
btrfs_discard_extent() can be called either when an extent is removed
or from walking the free-space tree. With a remapped block group these
two things are no longer equivalent: the extent's addresses are
remapped, while the free-space tree exclusively uses underlying
addresses.
Add a do_remap parameter to btrfs_discard_extent() and
btrfs_map_discard(), saying whether or not the address needs to be run
through the remap tree first.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/extent-tree.c | 11 +++++++----
fs/btrfs/extent-tree.h | 2 +-
fs/btrfs/free-space-cache.c | 2 +-
fs/btrfs/inode.c | 2 +-
fs/btrfs/volumes.c | 25 +++++++++++++++++++++++--
fs/btrfs/volumes.h | 2 +-
6 files changed, 34 insertions(+), 10 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1e5a68addf25..b02e99b41553 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1380,7 +1380,7 @@ static int do_discard_extent(struct btrfs_discard_stripe *stripe, u64 *bytes)
}
int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
- u64 num_bytes, u64 *actual_bytes)
+ u64 num_bytes, u64 *actual_bytes, bool do_remap)
{
int ret = 0;
u64 discarded_bytes = 0;
@@ -1398,7 +1398,8 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
int i;
num_bytes = end - cur;
- stripes = btrfs_map_discard(fs_info, cur, &num_bytes, &num_stripes);
+ stripes = btrfs_map_discard(fs_info, cur, &num_bytes,
+ &num_stripes, do_remap);
if (IS_ERR(stripes)) {
ret = PTR_ERR(stripes);
if (ret == -EOPNOTSUPP)
@@ -2868,7 +2869,8 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
if (btrfs_test_opt(fs_info, DISCARD_SYNC))
ret = btrfs_discard_extent(fs_info, start,
- end + 1 - start, NULL);
+ end + 1 - start, NULL,
+ true);
next_state = btrfs_next_extent_state(unpin, cached_state);
btrfs_clear_extent_dirty(unpin, start, end, &cached_state);
@@ -2926,7 +2928,8 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
ret = -EROFS;
if (!TRANS_ABORTED(trans))
ret = btrfs_discard_extent(fs_info, block_group->start,
- block_group->length, NULL);
+ block_group->length, NULL,
+ true);
/*
* Not strictly necessary to lock, as the block_group should be
diff --git a/fs/btrfs/extent-tree.h b/fs/btrfs/extent-tree.h
index 82d3a82dc712..679f6f1319b7 100644
--- a/fs/btrfs/extent-tree.h
+++ b/fs/btrfs/extent-tree.h
@@ -163,7 +163,7 @@ int btrfs_drop_subtree(struct btrfs_trans_handle *trans,
struct extent_buffer *parent);
void btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info, u64 start, u64 end);
int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
- u64 num_bytes, u64 *actual_bytes);
+ u64 num_bytes, u64 *actual_bytes, bool do_remap);
int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range);
#endif
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 5d8d1570a5c9..b9ef7a8a7996 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3672,7 +3672,7 @@ static int do_trimming(struct btrfs_block_group *block_group,
spin_unlock(&block_group->lock);
spin_unlock(&space_info->lock);
- ret = btrfs_discard_extent(fs_info, start, bytes, &trimmed);
+ ret = btrfs_discard_extent(fs_info, start, bytes, &trimmed, false);
if (!ret) {
*total_trimmed += trimmed;
trim_state = BTRFS_TRIM_STATE_TRIMMED;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4001cd2a08b1..ef61007892bd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3293,7 +3293,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
btrfs_discard_extent(fs_info,
ordered_extent->disk_bytenr,
ordered_extent->disk_num_bytes,
- NULL);
+ NULL, true);
btrfs_free_reserved_extent(fs_info,
ordered_extent->disk_bytenr,
ordered_extent->disk_num_bytes, true);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b4fe2e928992..e13f16a7a904 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3490,7 +3490,8 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset,
* filesystem's point of view.
*/
if (btrfs_is_zoned(fs_info)) {
- ret = btrfs_discard_extent(fs_info, chunk_offset, length, NULL);
+ ret = btrfs_discard_extent(fs_info, chunk_offset, length, NULL,
+ true);
if (ret)
btrfs_info(fs_info,
"failed to reset zone %llu after relocation",
@@ -6143,7 +6144,7 @@ void btrfs_put_bioc(struct btrfs_io_context *bioc)
*/
struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info,
u64 logical, u64 *length_ret,
- u32 *num_stripes)
+ u32 *num_stripes, bool do_remap)
{
struct btrfs_chunk_map *map;
struct btrfs_discard_stripe *stripes;
@@ -6167,6 +6168,26 @@ struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info,
if (IS_ERR(map))
return ERR_CAST(map);
+ if (do_remap && map->type & BTRFS_BLOCK_GROUP_REMAPPED) {
+ u64 new_logical = logical;
+
+ ret = btrfs_translate_remap(fs_info, &new_logical, &length,
+ false);
+ if (ret)
+ goto out_free_map;
+
+ if (new_logical != logical) {
+ btrfs_free_chunk_map(map);
+
+ map = btrfs_get_chunk_map(fs_info, new_logical,
+ length);
+ if (IS_ERR(map))
+ return ERR_CAST(map);
+
+ logical = new_logical;
+ }
+ }
+
/* we don't discard raid56 yet */
if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
ret = -EOPNOTSUPP;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 64b34710b68b..7abf3b119345 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -727,7 +727,7 @@ int btrfs_map_repair_block(struct btrfs_fs_info *fs_info,
u32 length, int mirror_num);
struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info,
u64 logical, u64 *length_ret,
- u32 *num_stripes);
+ u32 *num_stripes, bool do_remap);
int btrfs_read_sys_array(struct btrfs_fs_info *fs_info);
int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info);
struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 15/16] btrfs: add fully_remapped_bgs list
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
` (13 preceding siblings ...)
2025-08-13 14:34 ` [PATCH v2 14/16] btrfs: add do_remap param to btrfs_discard_extent() Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-16 0:56 ` Boris Burkov
2025-08-13 14:34 ` [PATCH v2 16/16] btrfs: allow balancing remap tree Mark Harmstone
15 siblings, 1 reply; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
Add a fully_remapped_bgs list to struct btrfs_transaction, which holds
block groups which have just had their last identity remap removed.
In btrfs_finish_extent_commit() we can then discard their full dev
extents, as we're also setting their num_stripes to 0. Finally if the BG
is now empty, i.e. there's neither identity remaps nor normal remaps,
add it to the unused_bgs list to be taken care of there.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/block-group.c | 26 ++++++++++++++++++++++++++
fs/btrfs/block-group.h | 2 ++
fs/btrfs/extent-tree.c | 37 ++++++++++++++++++++++++++++++++++++-
fs/btrfs/relocation.c | 2 ++
fs/btrfs/transaction.c | 1 +
fs/btrfs/transaction.h | 1 +
6 files changed, 68 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 7a0524138235..7f8707dfd62c 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1803,6 +1803,14 @@ void btrfs_mark_bg_unused(struct btrfs_block_group *bg)
struct btrfs_fs_info *fs_info = bg->fs_info;
spin_lock(&fs_info->unused_bgs_lock);
+
+ /* Leave fully remapped block groups on the fully_remapped_bgs list. */
+ if (bg->flags & BTRFS_BLOCK_GROUP_REMAPPED &&
+ bg->identity_remap_count == 0) {
+ spin_unlock(&fs_info->unused_bgs_lock);
+ return;
+ }
+
if (list_empty(&bg->bg_list)) {
btrfs_get_block_group(bg);
trace_btrfs_add_unused_block_group(bg);
@@ -4792,3 +4800,21 @@ bool btrfs_block_group_should_use_size_class(const struct btrfs_block_group *bg)
return false;
return true;
}
+
+void btrfs_mark_bg_fully_remapped(struct btrfs_block_group *bg,
+ struct btrfs_trans_handle *trans)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+
+ spin_lock(&fs_info->unused_bgs_lock);
+
+ if (!list_empty(&bg->bg_list))
+ list_del(&bg->bg_list);
+ else
+ btrfs_get_block_group(bg);
+
+ list_add_tail(&bg->bg_list, &trans->transaction->fully_remapped_bgs);
+
+ spin_unlock(&fs_info->unused_bgs_lock);
+
+}
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 0433b0127ed8..025ea2c6f8a8 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -408,5 +408,7 @@ int btrfs_use_block_group_size_class(struct btrfs_block_group *bg,
enum btrfs_block_group_size_class size_class,
bool force_wrong_size_class);
bool btrfs_block_group_should_use_size_class(const struct btrfs_block_group *bg);
+void btrfs_mark_bg_fully_remapped(struct btrfs_block_group *bg,
+ struct btrfs_trans_handle *trans);
#endif /* BTRFS_BLOCK_GROUP_H */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b02e99b41553..157a032df128 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2853,7 +2853,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
{
struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_block_group *block_group, *tmp;
- struct list_head *deleted_bgs;
+ struct list_head *deleted_bgs, *fully_remapped_bgs;
struct extent_io_tree *unpin = &trans->transaction->pinned_extents;
struct extent_state *cached_state = NULL;
u64 start;
@@ -2951,6 +2951,41 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
}
}
+ fully_remapped_bgs = &trans->transaction->fully_remapped_bgs;
+ list_for_each_entry_safe(block_group, tmp, fully_remapped_bgs, bg_list) {
+ struct btrfs_chunk_map *map;
+
+ if (!TRANS_ABORTED(trans))
+ ret = btrfs_discard_extent(fs_info, block_group->start,
+ block_group->length, NULL,
+ false);
+
+ map = btrfs_get_chunk_map(fs_info, block_group->start, 1);
+ if (IS_ERR(map))
+ return PTR_ERR(map);
+
+ /*
+ * Set num_stripes to 0, so that btrfs_remove_dev_extents()
+ * won't run a second time.
+ */
+ map->num_stripes = 0;
+
+ btrfs_free_chunk_map(map);
+
+ if (block_group->used == 0 && block_group->remap_bytes == 0) {
+ spin_lock(&fs_info->unused_bgs_lock);
+ list_move_tail(&block_group->bg_list,
+ &fs_info->unused_bgs);
+ spin_unlock(&fs_info->unused_bgs_lock);
+ } else {
+ spin_lock(&fs_info->unused_bgs_lock);
+ list_del_init(&block_group->bg_list);
+ spin_unlock(&fs_info->unused_bgs_lock);
+
+ btrfs_put_block_group(block_group);
+ }
+ }
+
return unpin_error;
}
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 84ff59866e96..0745a3d1c867 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4819,6 +4819,8 @@ static int last_identity_remap_gone(struct btrfs_trans_handle *trans,
if (ret)
return ret;
+ btrfs_mark_bg_fully_remapped(bg, trans);
+
return 0;
}
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 64b9c427af6a..7c308d33e767 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -381,6 +381,7 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
mutex_init(&cur_trans->cache_write_mutex);
spin_lock_init(&cur_trans->dirty_bgs_lock);
INIT_LIST_HEAD(&cur_trans->deleted_bgs);
+ INIT_LIST_HEAD(&cur_trans->fully_remapped_bgs);
spin_lock_init(&cur_trans->dropped_roots_lock);
list_add_tail(&cur_trans->list, &fs_info->trans_list);
btrfs_extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 9f7c777af635..b362915288b5 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -109,6 +109,7 @@ struct btrfs_transaction {
spinlock_t dirty_bgs_lock;
/* Protected by spin lock fs_info->unused_bgs_lock. */
struct list_head deleted_bgs;
+ struct list_head fully_remapped_bgs;
spinlock_t dropped_roots_lock;
struct btrfs_delayed_ref_root delayed_refs;
struct btrfs_fs_info *fs_info;
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2 16/16] btrfs: allow balancing remap tree
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
` (14 preceding siblings ...)
2025-08-13 14:34 ` [PATCH v2 15/16] btrfs: add fully_remapped_bgs list Mark Harmstone
@ 2025-08-13 14:34 ` Mark Harmstone
2025-08-16 1:02 ` Boris Burkov
15 siblings, 1 reply; 43+ messages in thread
From: Mark Harmstone @ 2025-08-13 14:34 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
Balancing the REMAP chunk, i.e. the chunk in which the remap tree lives,
is a special case.
We can't use the remap tree itself for this, as then we'd have no way to
boostrap it on mount. And we can't use the pre-remap tree code for this
as it relies on walking the extent tree, and we're not creating backrefs
for REMAP chunks.
So instead, if a balance would relocate any REMAP block groups, mark
those block groups as readonly and COW every leaf of the remap tree.
There's more sophisticated ways of doing this, such as only COWing nodes
within a block group that's to be relocated, but they're fiddly and with
lots of edge cases. Plus it's not anticipated that a) the number of
REMAP chunks is going to be particularly large, or b) that users will
want to only relocate some of these chunks - the main use case here is
to unbreak RAID conversion and device removal.
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/volumes.c | 161 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 157 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e13f16a7a904..dc535ed90ae0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4011,8 +4011,11 @@ static bool should_balance_chunk(struct extent_buffer *leaf, struct btrfs_chunk
struct btrfs_balance_args *bargs = NULL;
u64 chunk_type = btrfs_chunk_type(leaf, chunk);
- if (chunk_type & BTRFS_BLOCK_GROUP_REMAP)
- return false;
+ /* treat REMAP chunks as METADATA */
+ if (chunk_type & BTRFS_BLOCK_GROUP_REMAP) {
+ chunk_type &= ~BTRFS_BLOCK_GROUP_REMAP;
+ chunk_type |= BTRFS_BLOCK_GROUP_METADATA;
+ }
/* type filter */
if (!((chunk_type & BTRFS_BLOCK_GROUP_TYPE_MASK) &
@@ -4095,6 +4098,113 @@ static bool should_balance_chunk(struct extent_buffer *leaf, struct btrfs_chunk
return true;
}
+struct remap_chunk_info {
+ struct list_head list;
+ u64 offset;
+ struct btrfs_block_group *bg;
+ bool made_ro;
+};
+
+static int cow_remap_tree(struct btrfs_trans_handle *trans,
+ struct btrfs_path *path)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_key key = { 0 };
+ int ret;
+
+ ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, 0, 1);
+ if (ret < 0)
+ return ret;
+
+ while (true) {
+ ret = btrfs_next_leaf(fs_info->remap_root, path);
+ if (ret < 0) {
+ return ret;
+ } else if (ret > 0) {
+ ret = 0;
+ break;
+ }
+
+ btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+ btrfs_release_path(path);
+
+ ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path,
+ 0, 1);
+ if (ret < 0)
+ break;
+ }
+
+ return ret;
+}
+
+static int balance_remap_chunks(struct btrfs_fs_info *fs_info,
+ struct btrfs_path *path,
+ struct list_head *chunks)
+{
+ struct remap_chunk_info *rci, *tmp;
+ struct btrfs_trans_handle *trans;
+ int ret;
+
+ list_for_each_entry_safe(rci, tmp, chunks, list) {
+ rci->bg = btrfs_lookup_block_group(fs_info, rci->offset);
+ if (!rci->bg) {
+ list_del(&rci->list);
+ kfree(rci);
+ continue;
+ }
+
+ ret = btrfs_inc_block_group_ro(rci->bg, false);
+ if (ret)
+ goto end;
+
+ rci->made_ro = true;
+ }
+
+ if (list_empty(chunks))
+ return 0;
+
+ trans = btrfs_start_transaction(fs_info->remap_root, 0);
+ if (IS_ERR(trans)) {
+ ret = PTR_ERR(trans);
+ goto end;
+ }
+
+ mutex_lock(&fs_info->remap_mutex);
+
+ ret = cow_remap_tree(trans, path);
+
+ btrfs_release_path(path);
+
+ mutex_unlock(&fs_info->remap_mutex);
+
+ btrfs_commit_transaction(trans);
+
+end:
+ while (!list_empty(chunks)) {
+ bool unused;
+
+ rci = list_first_entry(chunks, struct remap_chunk_info, list);
+
+ spin_lock(&rci->bg->lock);
+ unused = !btrfs_is_block_group_used(rci->bg);
+ spin_unlock(&rci->bg->lock);
+
+ if (unused)
+ btrfs_mark_bg_unused(rci->bg);
+
+ if (rci->made_ro)
+ btrfs_dec_block_group_ro(rci->bg);
+
+ btrfs_put_block_group(rci->bg);
+
+ list_del(&rci->list);
+ kfree(rci);
+ }
+
+ return ret;
+}
+
static int __btrfs_balance(struct btrfs_fs_info *fs_info)
{
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
@@ -4117,6 +4227,9 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
u32 count_meta = 0;
u32 count_sys = 0;
int chunk_reserved = 0;
+ struct remap_chunk_info *rci;
+ unsigned int num_remap_chunks = 0;
+ LIST_HEAD(remap_chunks);
path = btrfs_alloc_path();
if (!path) {
@@ -4215,7 +4328,8 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
count_data++;
else if (chunk_type & BTRFS_BLOCK_GROUP_SYSTEM)
count_sys++;
- else if (chunk_type & BTRFS_BLOCK_GROUP_METADATA)
+ else if (chunk_type & (BTRFS_BLOCK_GROUP_METADATA |
+ BTRFS_BLOCK_GROUP_REMAP))
count_meta++;
goto loop;
@@ -4235,6 +4349,30 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
goto loop;
}
+ /*
+ * Balancing REMAP chunks takes place separately - add the
+ * details to a list so it can be processed later.
+ */
+ if (chunk_type & BTRFS_BLOCK_GROUP_REMAP) {
+ mutex_unlock(&fs_info->reclaim_bgs_lock);
+
+ rci = kmalloc(sizeof(struct remap_chunk_info),
+ GFP_NOFS);
+ if (!rci) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ rci->offset = found_key.offset;
+ rci->bg = NULL;
+ rci->made_ro = false;
+ list_add_tail(&rci->list, &remap_chunks);
+
+ num_remap_chunks++;
+
+ goto loop;
+ }
+
if (!chunk_reserved) {
/*
* We may be relocating the only data chunk we have,
@@ -4274,11 +4412,26 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
key.offset = found_key.offset - 1;
}
+ btrfs_release_path(path);
+
if (counting) {
- btrfs_release_path(path);
counting = false;
goto again;
}
+
+ if (!list_empty(&remap_chunks)) {
+ ret = balance_remap_chunks(fs_info, path, &remap_chunks);
+ if (ret == -ENOSPC)
+ enospc_errors++;
+
+ if (!ret) {
+ btrfs_delete_unused_bgs(fs_info);
+
+ spin_lock(&fs_info->balance_lock);
+ bctl->stat.completed += num_remap_chunks;
+ spin_unlock(&fs_info->balance_lock);
+ }
+ }
error:
btrfs_free_path(path);
if (enospc_errors) {
--
2.49.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH v2 01/16] btrfs: add definitions and constants for remap-tree
2025-08-13 14:34 ` [PATCH v2 01/16] btrfs: add definitions and constants for remap-tree Mark Harmstone
@ 2025-08-15 23:51 ` Boris Burkov
2025-08-18 17:21 ` Mark Harmstone
2025-08-16 0:01 ` Qu Wenruo
1 sibling, 1 reply; 43+ messages in thread
From: Boris Burkov @ 2025-08-15 23:51 UTC (permalink / raw)
To: Mark Harmstone; +Cc: linux-btrfs
On Wed, Aug 13, 2025 at 03:34:43PM +0100, Mark Harmstone wrote:
> Add an incompat flag for the new remap-tree feature, and the constants
> and definitions needed to support it.
>
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
Some formatting nits, but you can add
Reviewed-by: Boris Burkov <boris@bur.io>
> ---
> fs/btrfs/accessors.h | 3 +++
> fs/btrfs/locking.c | 1 +
> fs/btrfs/sysfs.c | 2 ++
> fs/btrfs/tree-checker.c | 6 ++----
> fs/btrfs/tree-checker.h | 5 +++++
> fs/btrfs/volumes.c | 1 +
> include/uapi/linux/btrfs.h | 1 +
> include/uapi/linux/btrfs_tree.h | 12 ++++++++++++
> 8 files changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> index 99b3ced12805..95a1ca8c099b 100644
> --- a/fs/btrfs/accessors.h
> +++ b/fs/btrfs/accessors.h
> @@ -1009,6 +1009,9 @@ BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption,
> BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size,
> struct btrfs_verity_descriptor_item, size, 64);
>
> +BTRFS_SETGET_FUNCS(remap_address, struct btrfs_remap, address, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_remap_address, struct btrfs_remap, address, 64);
> +
> /* Cast into the data area of the leaf. */
> #define btrfs_item_ptr(leaf, slot, type) \
> ((type *)(btrfs_item_nr_offset(leaf, 0) + btrfs_item_offset(leaf, slot)))
> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
> index a3e6d9616e60..26f810258486 100644
> --- a/fs/btrfs/locking.c
> +++ b/fs/btrfs/locking.c
> @@ -73,6 +73,7 @@ static struct btrfs_lockdep_keyset {
> { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") },
> { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-group") },
> { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID, DEFINE_NAME("raid-stripe") },
> + { .id = BTRFS_REMAP_TREE_OBJECTID, DEFINE_NAME("remap-tree") },
> { .id = 0, DEFINE_NAME("tree") },
> };
>
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 81f52c1f55ce..857d2772db1c 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -291,6 +291,7 @@ BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
> BTRFS_FEAT_ATTR_COMPAT_RO(block_group_tree, BLOCK_GROUP_TREE);
> BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
> BTRFS_FEAT_ATTR_INCOMPAT(simple_quota, SIMPLE_QUOTA);
> +BTRFS_FEAT_ATTR_INCOMPAT(remap_tree, REMAP_TREE);
> #ifdef CONFIG_BLK_DEV_ZONED
> BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
> #endif
> @@ -325,6 +326,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
> BTRFS_FEAT_ATTR_PTR(raid1c34),
> BTRFS_FEAT_ATTR_PTR(block_group_tree),
> BTRFS_FEAT_ATTR_PTR(simple_quota),
> + BTRFS_FEAT_ATTR_PTR(remap_tree),
> #ifdef CONFIG_BLK_DEV_ZONED
> BTRFS_FEAT_ATTR_PTR(zoned),
> #endif
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index 0f556f4de3f9..76ec3698f197 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -912,12 +912,10 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
> length, btrfs_stripe_nr_to_offset(U32_MAX));
> return -EUCLEAN;
> }
> - if (unlikely(type & ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
> - BTRFS_BLOCK_GROUP_PROFILE_MASK))) {
> + if (unlikely(type & ~BTRFS_BLOCK_GROUP_VALID)) {
> chunk_err(fs_info, leaf, chunk, logical,
> "unrecognized chunk type: 0x%llx",
> - ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
> - BTRFS_BLOCK_GROUP_PROFILE_MASK) & type);
> + type & ~BTRFS_BLOCK_GROUP_VALID);
> return -EUCLEAN;
> }
>
> diff --git a/fs/btrfs/tree-checker.h b/fs/btrfs/tree-checker.h
> index eb201f4ec3c7..833e2fd989eb 100644
> --- a/fs/btrfs/tree-checker.h
> +++ b/fs/btrfs/tree-checker.h
> @@ -57,6 +57,11 @@ enum btrfs_tree_block_status {
> BTRFS_TREE_BLOCK_WRITTEN_NOT_SET,
> };
>
> +
> +#define BTRFS_BLOCK_GROUP_VALID (BTRFS_BLOCK_GROUP_TYPE_MASK | \
> + BTRFS_BLOCK_GROUP_PROFILE_MASK | \
> + BTRFS_BLOCK_GROUP_REMAPPED)
> +
I think the two next lines should be lined up after the '('
See the masks in include/uapi/linux/btrfs_tree.h
> /*
> * Exported simply for btrfs-progs which wants to have the
> * btrfs_tree_block_status return codes.
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index fa7a929a0461..e067e9cd68a5 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -231,6 +231,7 @@ void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
> + DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAPPED, "remapped");
>
> DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
> for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 8e710bbb688e..fba303ed49e6 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -336,6 +336,7 @@ struct btrfs_ioctl_fs_info_args {
> #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 (1ULL << 13)
> #define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE (1ULL << 14)
> #define BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA (1ULL << 16)
> +#define BTRFS_FEATURE_INCOMPAT_REMAP_TREE (1ULL << 17)
>
> struct btrfs_ioctl_feature_flags {
> __u64 compat_flags;
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index fc29d273845d..4439d77a7252 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -76,6 +76,9 @@
> /* Tracks RAID stripes in block groups. */
> #define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
>
> +/* Holds details of remapped addresses after relocation. */
> +#define BTRFS_REMAP_TREE_OBJECTID 13ULL
> +
> /* device stats in the device tree */
> #define BTRFS_DEV_STATS_OBJECTID 0ULL
>
> @@ -282,6 +285,10 @@
>
> #define BTRFS_RAID_STRIPE_KEY 230
>
> +#define BTRFS_IDENTITY_REMAP_KEY 234
> +#define BTRFS_REMAP_KEY 235
> +#define BTRFS_REMAP_BACKREF_KEY 236
more funny indenting
> +
> /*
> * Records the overall state of the qgroups.
> * There's only one instance of this key present,
> @@ -1161,6 +1168,7 @@ struct btrfs_dev_replace_item {
> #define BTRFS_BLOCK_GROUP_RAID6 (1ULL << 8)
> #define BTRFS_BLOCK_GROUP_RAID1C3 (1ULL << 9)
> #define BTRFS_BLOCK_GROUP_RAID1C4 (1ULL << 10)
> +#define BTRFS_BLOCK_GROUP_REMAPPED (1ULL << 11)
> #define BTRFS_BLOCK_GROUP_RESERVED (BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
> BTRFS_SPACE_INFO_GLOBAL_RSV)
>
> @@ -1323,4 +1331,8 @@ struct btrfs_verity_descriptor_item {
> __u8 encryption;
> } __attribute__ ((__packed__));
>
> +struct btrfs_remap {
> + __le64 address;
> +} __attribute__ ((__packed__));
> +
> #endif /* _BTRFS_CTREE_H_ */
> --
> 2.49.1
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 01/16] btrfs: add definitions and constants for remap-tree
2025-08-13 14:34 ` [PATCH v2 01/16] btrfs: add definitions and constants for remap-tree Mark Harmstone
2025-08-15 23:51 ` Boris Burkov
@ 2025-08-16 0:01 ` Qu Wenruo
2025-08-16 0:17 ` Qu Wenruo
1 sibling, 1 reply; 43+ messages in thread
From: Qu Wenruo @ 2025-08-16 0:01 UTC (permalink / raw)
To: Mark Harmstone, linux-btrfs
在 2025/8/14 00:04, Mark Harmstone 写道:
> Add an incompat flag for the new remap-tree feature, and the constants
> and definitions needed to support it.
>
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
> ---
> fs/btrfs/accessors.h | 3 +++
> fs/btrfs/locking.c | 1 +
> fs/btrfs/sysfs.c | 2 ++
> fs/btrfs/tree-checker.c | 6 ++----
> fs/btrfs/tree-checker.h | 5 +++++
> fs/btrfs/volumes.c | 1 +
> include/uapi/linux/btrfs.h | 1 +
> include/uapi/linux/btrfs_tree.h | 12 ++++++++++++
> 8 files changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> index 99b3ced12805..95a1ca8c099b 100644
> --- a/fs/btrfs/accessors.h
> +++ b/fs/btrfs/accessors.h
> @@ -1009,6 +1009,9 @@ BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption,
> BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size,
> struct btrfs_verity_descriptor_item, size, 64);
>
> +BTRFS_SETGET_FUNCS(remap_address, struct btrfs_remap, address, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_remap_address, struct btrfs_remap, address, 64);
> +
> /* Cast into the data area of the leaf. */
> #define btrfs_item_ptr(leaf, slot, type) \
> ((type *)(btrfs_item_nr_offset(leaf, 0) + btrfs_item_offset(leaf, slot)))
> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
> index a3e6d9616e60..26f810258486 100644
> --- a/fs/btrfs/locking.c
> +++ b/fs/btrfs/locking.c
> @@ -73,6 +73,7 @@ static struct btrfs_lockdep_keyset {
> { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") },
> { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-group") },
> { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID, DEFINE_NAME("raid-stripe") },
> + { .id = BTRFS_REMAP_TREE_OBJECTID, DEFINE_NAME("remap-tree") },
> { .id = 0, DEFINE_NAME("tree") },
> };
>
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 81f52c1f55ce..857d2772db1c 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -291,6 +291,7 @@ BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
> BTRFS_FEAT_ATTR_COMPAT_RO(block_group_tree, BLOCK_GROUP_TREE);
> BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
> BTRFS_FEAT_ATTR_INCOMPAT(simple_quota, SIMPLE_QUOTA);
> +BTRFS_FEAT_ATTR_INCOMPAT(remap_tree, REMAP_TREE);
> #ifdef CONFIG_BLK_DEV_ZONED
> BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
> #endif
> @@ -325,6 +326,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
> BTRFS_FEAT_ATTR_PTR(raid1c34),
> BTRFS_FEAT_ATTR_PTR(block_group_tree),
> BTRFS_FEAT_ATTR_PTR(simple_quota),
> + BTRFS_FEAT_ATTR_PTR(remap_tree),
> #ifdef CONFIG_BLK_DEV_ZONED
> BTRFS_FEAT_ATTR_PTR(zoned),
> #endif
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index 0f556f4de3f9..76ec3698f197 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -912,12 +912,10 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
> length, btrfs_stripe_nr_to_offset(U32_MAX));
> return -EUCLEAN;
> }
> - if (unlikely(type & ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
> - BTRFS_BLOCK_GROUP_PROFILE_MASK))) {
> + if (unlikely(type & ~BTRFS_BLOCK_GROUP_VALID)) {
> chunk_err(fs_info, leaf, chunk, logical,
> "unrecognized chunk type: 0x%llx",
> - ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
> - BTRFS_BLOCK_GROUP_PROFILE_MASK) & type);
> + type & ~BTRFS_BLOCK_GROUP_VALID);
> return -EUCLEAN;
> }
>
> diff --git a/fs/btrfs/tree-checker.h b/fs/btrfs/tree-checker.h
> index eb201f4ec3c7..833e2fd989eb 100644
> --- a/fs/btrfs/tree-checker.h
> +++ b/fs/btrfs/tree-checker.h
> @@ -57,6 +57,11 @@ enum btrfs_tree_block_status {
> BTRFS_TREE_BLOCK_WRITTEN_NOT_SET,
> };
>
> +
> +#define BTRFS_BLOCK_GROUP_VALID (BTRFS_BLOCK_GROUP_TYPE_MASK | \
> + BTRFS_BLOCK_GROUP_PROFILE_MASK | \
> + BTRFS_BLOCK_GROUP_REMAPPED)
So far it looks like the remapped flag is a new bg type.
Can we just put it into BLOCK_GROUP_TYPE_MASK?
Otherwise looks good to me.
Thanks,
Qu
> +
> /*
> * Exported simply for btrfs-progs which wants to have the
> * btrfs_tree_block_status return codes.
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index fa7a929a0461..e067e9cd68a5 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -231,6 +231,7 @@ void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
> + DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAPPED, "remapped");
>
> DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
> for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 8e710bbb688e..fba303ed49e6 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -336,6 +336,7 @@ struct btrfs_ioctl_fs_info_args {
> #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 (1ULL << 13)
> #define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE (1ULL << 14)
> #define BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA (1ULL << 16)
> +#define BTRFS_FEATURE_INCOMPAT_REMAP_TREE (1ULL << 17)
>
> struct btrfs_ioctl_feature_flags {
> __u64 compat_flags;
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index fc29d273845d..4439d77a7252 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -76,6 +76,9 @@
> /* Tracks RAID stripes in block groups. */
> #define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
>
> +/* Holds details of remapped addresses after relocation. */
> +#define BTRFS_REMAP_TREE_OBJECTID 13ULL
> +
> /* device stats in the device tree */
> #define BTRFS_DEV_STATS_OBJECTID 0ULL
>
> @@ -282,6 +285,10 @@
>
> #define BTRFS_RAID_STRIPE_KEY 230
>
> +#define BTRFS_IDENTITY_REMAP_KEY 234
> +#define BTRFS_REMAP_KEY 235
> +#define BTRFS_REMAP_BACKREF_KEY 236
> +
> /*
> * Records the overall state of the qgroups.
> * There's only one instance of this key present,
> @@ -1161,6 +1168,7 @@ struct btrfs_dev_replace_item {
> #define BTRFS_BLOCK_GROUP_RAID6 (1ULL << 8)
> #define BTRFS_BLOCK_GROUP_RAID1C3 (1ULL << 9)
> #define BTRFS_BLOCK_GROUP_RAID1C4 (1ULL << 10)
> +#define BTRFS_BLOCK_GROUP_REMAPPED (1ULL << 11)
> #define BTRFS_BLOCK_GROUP_RESERVED (BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
> BTRFS_SPACE_INFO_GLOBAL_RSV)
>
> @@ -1323,4 +1331,8 @@ struct btrfs_verity_descriptor_item {
> __u8 encryption;
> } __attribute__ ((__packed__));
>
> +struct btrfs_remap {
> + __le64 address;
> +} __attribute__ ((__packed__));
> +
> #endif /* _BTRFS_CTREE_H_ */
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 03/16] btrfs: allow remapped chunks to have zero stripes
2025-08-13 14:34 ` [PATCH v2 03/16] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
@ 2025-08-16 0:03 ` Boris Burkov
2025-08-22 17:01 ` Mark Harmstone
2025-08-19 1:05 ` kernel test robot
1 sibling, 1 reply; 43+ messages in thread
From: Boris Burkov @ 2025-08-16 0:03 UTC (permalink / raw)
To: Mark Harmstone; +Cc: linux-btrfs
On Wed, Aug 13, 2025 at 03:34:45PM +0100, Mark Harmstone wrote:
> When a chunk has been fully remapped, we are going to set its
> num_stripes to 0, as it will no longer represent a physical location on
> disk.
>
> Change tree-checker to allow for this, and fix a couple of
> divide-by-zeroes seen elsewhere.
>
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
> ---
> fs/btrfs/tree-checker.c | 63 ++++++++++++++++++++++++++++-------------
> fs/btrfs/volumes.c | 8 +++++-
> 2 files changed, 50 insertions(+), 21 deletions(-)
>
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index ca898b1f12f1..20bfe333ffdd 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -815,6 +815,39 @@ static void chunk_err(const struct btrfs_fs_info *fs_info,
> va_end(args);
> }
>
> +static bool valid_stripe_count(u64 profile, u16 num_stripes,
> + u16 sub_stripes)
> +{
> + switch (profile) {
> + case BTRFS_BLOCK_GROUP_RAID10:
> + return sub_stripes ==
> + btrfs_raid_array[BTRFS_RAID_RAID10].sub_stripes;
> + case BTRFS_BLOCK_GROUP_RAID1:
> + return num_stripes ==
> + btrfs_raid_array[BTRFS_RAID_RAID1].devs_min;
> + case BTRFS_BLOCK_GROUP_RAID1C3:
> + return num_stripes ==
> + btrfs_raid_array[BTRFS_RAID_RAID1C3].devs_min;
> + case BTRFS_BLOCK_GROUP_RAID1C4:
> + return num_stripes ==
> + btrfs_raid_array[BTRFS_RAID_RAID1C4].devs_min;
> + case BTRFS_BLOCK_GROUP_RAID5:
> + return num_stripes >=
> + btrfs_raid_array[BTRFS_RAID_RAID5].devs_min;
> + case BTRFS_BLOCK_GROUP_RAID6:
> + return num_stripes >=
> + btrfs_raid_array[BTRFS_RAID_RAID6].devs_min;
> + case BTRFS_BLOCK_GROUP_DUP:
> + return num_stripes ==
> + btrfs_raid_array[BTRFS_RAID_DUP].dev_stripes;
> + case 0: /* SINGLE */
> + return num_stripes ==
> + btrfs_raid_array[BTRFS_RAID_SINGLE].dev_stripes;
> + default:
> + BUG();
> + }
> +}
> +
> /*
> * The common chunk check which could also work on super block sys chunk array.
> *
> @@ -838,6 +871,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
> u64 features;
> u32 chunk_sector_size;
> bool mixed = false;
> + bool remapped;
> int raid_index;
> int nparity;
> int ncopies;
> @@ -861,12 +895,14 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
> ncopies = btrfs_raid_array[raid_index].ncopies;
> nparity = btrfs_raid_array[raid_index].nparity;
>
> - if (unlikely(!num_stripes)) {
> + remapped = type & BTRFS_BLOCK_GROUP_REMAPPED;
> +
> + if (unlikely(!remapped && !num_stripes)) {
> chunk_err(fs_info, leaf, chunk, logical,
> "invalid chunk num_stripes, have %u", num_stripes);
> return -EUCLEAN;
> }
> - if (unlikely(num_stripes < ncopies)) {
> + if (unlikely(num_stripes != 0 && num_stripes < ncopies)) {
This relying on the above check for the remapped <=> !num_stripes aspect
was still kinda confusing. Logically looks good now, though.
> chunk_err(fs_info, leaf, chunk, logical,
> "invalid chunk num_stripes < ncopies, have %u < %d",
> num_stripes, ncopies);
> @@ -964,22 +1000,9 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
> }
> }
>
> - if (unlikely((type & BTRFS_BLOCK_GROUP_RAID10 &&
> - sub_stripes != btrfs_raid_array[BTRFS_RAID_RAID10].sub_stripes) ||
> - (type & BTRFS_BLOCK_GROUP_RAID1 &&
> - num_stripes != btrfs_raid_array[BTRFS_RAID_RAID1].devs_min) ||
> - (type & BTRFS_BLOCK_GROUP_RAID1C3 &&
> - num_stripes != btrfs_raid_array[BTRFS_RAID_RAID1C3].devs_min) ||
> - (type & BTRFS_BLOCK_GROUP_RAID1C4 &&
> - num_stripes != btrfs_raid_array[BTRFS_RAID_RAID1C4].devs_min) ||
> - (type & BTRFS_BLOCK_GROUP_RAID5 &&
> - num_stripes < btrfs_raid_array[BTRFS_RAID_RAID5].devs_min) ||
> - (type & BTRFS_BLOCK_GROUP_RAID6 &&
> - num_stripes < btrfs_raid_array[BTRFS_RAID_RAID6].devs_min) ||
> - (type & BTRFS_BLOCK_GROUP_DUP &&
> - num_stripes != btrfs_raid_array[BTRFS_RAID_DUP].dev_stripes) ||
> - ((type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 &&
> - num_stripes != btrfs_raid_array[BTRFS_RAID_SINGLE].dev_stripes))) {
> + if (!remapped &&
> + !valid_stripe_count(type & BTRFS_BLOCK_GROUP_PROFILE_MASK,
> + num_stripes, sub_stripes)) {
This looks great, thanks.
> chunk_err(fs_info, leaf, chunk, logical,
> "invalid num_stripes:sub_stripes %u:%u for profile %llu",
> num_stripes, sub_stripes,
> @@ -1003,11 +1026,11 @@ static int check_leaf_chunk_item(struct extent_buffer *leaf,
> struct btrfs_fs_info *fs_info = leaf->fs_info;
> int num_stripes;
>
> - if (unlikely(btrfs_item_size(leaf, slot) < sizeof(struct btrfs_chunk))) {
> + if (unlikely(btrfs_item_size(leaf, slot) < offsetof(struct btrfs_chunk, stripe))) {
> chunk_err(fs_info, leaf, chunk, key->offset,
> "invalid chunk item size: have %u expect [%zu, %u)",
> btrfs_item_size(leaf, slot),
> - sizeof(struct btrfs_chunk),
> + offsetof(struct btrfs_chunk, stripe),
> BTRFS_LEAF_DATA_SIZE(fs_info));
> return -EUCLEAN;
> }
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index f4d1527f265e..c95f83305c82 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6145,6 +6145,12 @@ struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info,
> goto out_free_map;
> }
>
> + /* avoid divide by zero on fully-remapped chunks */
> + if (map->num_stripes == 0) {
> + ret = -EOPNOTSUPP;
> + goto out_free_map;
> + }
> +
> offset = logical - map->start;
> length = min_t(u64, map->start + map->chunk_len - logical, length);
> *length_ret = length;
> @@ -6965,7 +6971,7 @@ u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map)
> {
> const int data_stripes = calc_data_stripes(map->type, map->num_stripes);
>
> - return div_u64(map->chunk_len, data_stripes);
> + return data_stripes ? div_u64(map->chunk_len, data_stripes) : 0;
My point here was more that we are now including 0 in the range of this
function, where it wasn't before, meaning that callers must properly
handle it. And it's not a meaningful "stripe length", so it breaks that
correspondence, so checking explicitly for "remapped-ness" vs. "length
== 0" feels more robust to me.
I won't die on this hill, just making myself as clear as I can.
> }
>
> #if BITS_PER_LONG == 32
> --
> 2.49.1
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 05/16] btrfs: don't add metadata items for the remap tree to the extent tree
2025-08-13 14:34 ` [PATCH v2 05/16] btrfs: don't add metadata items for the remap tree to the extent tree Mark Harmstone
@ 2025-08-16 0:06 ` Boris Burkov
0 siblings, 0 replies; 43+ messages in thread
From: Boris Burkov @ 2025-08-16 0:06 UTC (permalink / raw)
To: Mark Harmstone; +Cc: linux-btrfs
On Wed, Aug 13, 2025 at 03:34:47PM +0100, Mark Harmstone wrote:
> There is the following potential problem with the remap tree and delayed refs:
>
> * Remapped extent freed in a delayed ref, which removes an entry from the
> remap tree
> * Remap tree now small enough to fit in a single leaf
> * Corruption as we now have a level-0 block with a level-1 metadata item
> in the extent tree
>
> One solution to this would be to rework the remap tree code so that it operates
> via delayed refs. But as we're hoping to remove cow-only metadata items in the
> future anyway, change things so that the remap tree doesn't have any entries in
> the extent tree. This also has the benefit of reducing write amplification.
>
> We also make it so that the clear_cache mount option is a no-op, as with the
> extent tree v2, as the free-space tree can no longer be recreated from the
> extent tree.
>
> Finally disable relocating the remap tree itself, which is added back in
> a later patch. As it is we would get corruption as the traditional
> relocation method walks the extent tree, and we're removing its metadata
> items.
>
Reviewed-by: Boris Burkov <boris@bur.io>
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
> ---
> fs/btrfs/disk-io.c | 3 +++
> fs/btrfs/extent-tree.c | 31 ++++++++++++++++++++++++++++++-
> fs/btrfs/volumes.c | 3 +++
> 3 files changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 7e60097b2a96..8e9520119d4f 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3049,6 +3049,9 @@ int btrfs_start_pre_rw_mount(struct btrfs_fs_info *fs_info)
> if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2))
> btrfs_warn(fs_info,
> "'clear_cache' option is ignored with extent tree v2");
> + else if (btrfs_fs_incompat(fs_info, REMAP_TREE))
> + btrfs_warn(fs_info,
> + "'clear_cache' option is ignored with remap tree");
> else
> rebuild_free_space_tree = true;
> } else if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE) &&
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 682d21a73a67..5e038ae1a93f 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1552,6 +1552,28 @@ static void free_head_ref_squota_rsv(struct btrfs_fs_info *fs_info,
> BTRFS_QGROUP_RSV_DATA);
> }
>
> +static int drop_remap_tree_ref(struct btrfs_trans_handle *trans,
> + const struct btrfs_delayed_ref_node *node)
> +{
> + u64 bytenr = node->bytenr;
> + u64 num_bytes = node->num_bytes;
> + int ret;
> +
> + ret = btrfs_add_to_free_space_tree(trans, bytenr, num_bytes);
> + if (ret) {
> + btrfs_abort_transaction(trans, ret);
> + return ret;
> + }
> +
> + ret = btrfs_update_block_group(trans, bytenr, num_bytes, false);
> + if (ret) {
> + btrfs_abort_transaction(trans, ret);
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
> struct btrfs_delayed_ref_head *href,
> const struct btrfs_delayed_ref_node *node,
> @@ -1746,7 +1768,10 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
> } else if (node->action == BTRFS_ADD_DELAYED_REF) {
> ret = __btrfs_inc_extent_ref(trans, node, extent_op);
> } else if (node->action == BTRFS_DROP_DELAYED_REF) {
> - ret = __btrfs_free_extent(trans, href, node, extent_op);
> + if (node->ref_root == BTRFS_REMAP_TREE_OBJECTID)
> + ret = drop_remap_tree_ref(trans, node);
> + else
> + ret = __btrfs_free_extent(trans, href, node, extent_op);
> } else {
> BUG();
> }
> @@ -4896,6 +4921,9 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
> int level = btrfs_delayed_ref_owner(node);
> bool skinny_metadata = btrfs_fs_incompat(fs_info, SKINNY_METADATA);
>
> + if (unlikely(node->ref_root == BTRFS_REMAP_TREE_OBJECTID))
> + goto skip;
> +
> extent_key.objectid = node->bytenr;
> if (skinny_metadata) {
> /* The owner of a tree block is the level. */
> @@ -4948,6 +4976,7 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
>
> btrfs_free_path(path);
>
> +skip:
> return alloc_reserved_extent(trans, node->bytenr, fs_info->nodesize);
> }
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c95f83305c82..678e5d4cd780 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3993,6 +3993,9 @@ static bool should_balance_chunk(struct extent_buffer *leaf, struct btrfs_chunk
> struct btrfs_balance_args *bargs = NULL;
> u64 chunk_type = btrfs_chunk_type(leaf, chunk);
>
> + if (chunk_type & BTRFS_BLOCK_GROUP_REMAP)
> + return false;
> +
> /* type filter */
> if (!((chunk_type & BTRFS_BLOCK_GROUP_TYPE_MASK) &
> (bctl->flags & BTRFS_BALANCE_TYPE_MASK))) {
> --
> 2.49.1
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 06/16] btrfs: add extended version of struct block_group_item
2025-08-13 14:34 ` [PATCH v2 06/16] btrfs: add extended version of struct block_group_item Mark Harmstone
@ 2025-08-16 0:08 ` Boris Burkov
0 siblings, 0 replies; 43+ messages in thread
From: Boris Burkov @ 2025-08-16 0:08 UTC (permalink / raw)
To: Mark Harmstone; +Cc: linux-btrfs
On Wed, Aug 13, 2025 at 03:34:48PM +0100, Mark Harmstone wrote:
> Add a struct btrfs_block_group_item_v2, which is used in the block group
> tree if the remap-tree incompat flag is set.
>
> This adds two new fields to the block group item: `remap_bytes` and
> `identity_remap_count`.
>
> `remap_bytes` records the amount of data that's physically within this
> block group, but nominally in another, remapped block group. This is
> necessary because this data will need to be moved first if this block
> group is itself relocated. If `remap_bytes` > 0, this is an indicator to
> the relocation thread that it will need to search the remap-tree for
> backrefs. A block group must also have `remap_bytes` == 0 before it can
> be dropped.
>
> `identity_remap_count` records how many identity remap items are located
> in the remap tree for this block group. When relocation is begun for
> this block group, this is set to the number of holes in the free-space
> tree for this range. As identity remaps are converted into actual remaps
> by the relocation process, this number is decreased. Once it reaches 0,
> either because of relocation or because extents have been deleted, the
> block group has been fully remapped and its chunk's device extents are
> removed.
>
Reviewed-by: Boris Burkov <boris@bur.io>
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
> ---
> fs/btrfs/accessors.h | 20 +++++++
> fs/btrfs/block-group.c | 101 ++++++++++++++++++++++++--------
> fs/btrfs/block-group.h | 14 ++++-
> fs/btrfs/discard.c | 2 +-
> fs/btrfs/tree-checker.c | 10 +++-
> include/uapi/linux/btrfs_tree.h | 8 +++
> 6 files changed, 127 insertions(+), 28 deletions(-)
>
> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> index 95a1ca8c099b..0dd161ee6863 100644
> --- a/fs/btrfs/accessors.h
> +++ b/fs/btrfs/accessors.h
> @@ -239,6 +239,26 @@ BTRFS_SETGET_FUNCS(block_group_flags, struct btrfs_block_group_item, flags, 64);
> BTRFS_SETGET_STACK_FUNCS(stack_block_group_flags,
> struct btrfs_block_group_item, flags, 64);
>
> +/* struct btrfs_block_group_item_v2 */
> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_used, struct btrfs_block_group_item_v2,
> + used, 64);
> +BTRFS_SETGET_FUNCS(block_group_v2_used, struct btrfs_block_group_item_v2, used, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_chunk_objectid,
> + struct btrfs_block_group_item_v2, chunk_objectid, 64);
> +BTRFS_SETGET_FUNCS(block_group_v2_chunk_objectid,
> + struct btrfs_block_group_item_v2, chunk_objectid, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_flags,
> + struct btrfs_block_group_item_v2, flags, 64);
> +BTRFS_SETGET_FUNCS(block_group_v2_flags, struct btrfs_block_group_item_v2, flags, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_remap_bytes,
> + struct btrfs_block_group_item_v2, remap_bytes, 64);
> +BTRFS_SETGET_FUNCS(block_group_v2_remap_bytes, struct btrfs_block_group_item_v2,
> + remap_bytes, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_block_group_v2_identity_remap_count,
> + struct btrfs_block_group_item_v2, identity_remap_count, 32);
> +BTRFS_SETGET_FUNCS(block_group_v2_identity_remap_count, struct btrfs_block_group_item_v2,
> + identity_remap_count, 32);
> +
> /* struct btrfs_free_space_info */
> BTRFS_SETGET_FUNCS(free_space_extent_count, struct btrfs_free_space_info,
> extent_count, 32);
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 4d76d457da9b..bed9c58b6cbc 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -2368,7 +2368,7 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
> }
>
> static int read_one_block_group(struct btrfs_fs_info *info,
> - struct btrfs_block_group_item *bgi,
> + struct btrfs_block_group_item_v2 *bgi,
> const struct btrfs_key *key,
> int need_clear)
> {
> @@ -2383,11 +2383,16 @@ static int read_one_block_group(struct btrfs_fs_info *info,
> return -ENOMEM;
>
> cache->length = key->offset;
> - cache->used = btrfs_stack_block_group_used(bgi);
> + cache->used = btrfs_stack_block_group_v2_used(bgi);
> cache->commit_used = cache->used;
> - cache->flags = btrfs_stack_block_group_flags(bgi);
> - cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);
> + cache->flags = btrfs_stack_block_group_v2_flags(bgi);
> + cache->global_root_id = btrfs_stack_block_group_v2_chunk_objectid(bgi);
> cache->space_info = btrfs_find_space_info(info, cache->flags);
> + cache->remap_bytes = btrfs_stack_block_group_v2_remap_bytes(bgi);
> + cache->commit_remap_bytes = cache->remap_bytes;
> + cache->identity_remap_count =
> + btrfs_stack_block_group_v2_identity_remap_count(bgi);
> + cache->commit_identity_remap_count = cache->identity_remap_count;
>
> btrfs_set_free_space_tree_thresholds(cache);
>
> @@ -2452,7 +2457,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
> } else if (cache->length == cache->used) {
> cache->cached = BTRFS_CACHE_FINISHED;
> btrfs_free_excluded_extents(cache);
> - } else if (cache->used == 0) {
> + } else if (cache->used == 0 && cache->remap_bytes == 0) {
> cache->cached = BTRFS_CACHE_FINISHED;
> ret = btrfs_add_new_free_space(cache, cache->start,
> cache->start + cache->length, NULL);
> @@ -2472,7 +2477,8 @@ static int read_one_block_group(struct btrfs_fs_info *info,
>
> set_avail_alloc_bits(info, cache->flags);
> if (btrfs_chunk_writeable(info, cache->start)) {
> - if (cache->used == 0) {
> + if (cache->used == 0 && cache->identity_remap_count == 0 &&
> + cache->remap_bytes == 0) {
> ASSERT(list_empty(&cache->bg_list));
> if (btrfs_test_opt(info, DISCARD_ASYNC) &&
> !(cache->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
> @@ -2578,9 +2584,10 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
> need_clear = 1;
>
> while (1) {
> - struct btrfs_block_group_item bgi;
> + struct btrfs_block_group_item_v2 bgi;
> struct extent_buffer *leaf;
> int slot;
> + size_t size;
>
> ret = find_first_block_group(info, path, &key);
> if (ret > 0)
> @@ -2591,8 +2598,16 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
> leaf = path->nodes[0];
> slot = path->slots[0];
>
> + if (btrfs_fs_incompat(info, REMAP_TREE)) {
> + size = sizeof(struct btrfs_block_group_item_v2);
> + } else {
> + size = sizeof(struct btrfs_block_group_item);
> + btrfs_set_stack_block_group_v2_remap_bytes(&bgi, 0);
> + btrfs_set_stack_block_group_v2_identity_remap_count(&bgi, 0);
> + }
> +
> read_extent_buffer(leaf, &bgi, btrfs_item_ptr_offset(leaf, slot),
> - sizeof(bgi));
> + size);
>
> btrfs_item_key_to_cpu(leaf, &key, slot);
> btrfs_release_path(path);
> @@ -2662,25 +2677,38 @@ static int insert_block_group_item(struct btrfs_trans_handle *trans,
> struct btrfs_block_group *block_group)
> {
> struct btrfs_fs_info *fs_info = trans->fs_info;
> - struct btrfs_block_group_item bgi;
> + struct btrfs_block_group_item_v2 bgi;
> struct btrfs_root *root = btrfs_block_group_root(fs_info);
> struct btrfs_key key;
> u64 old_commit_used;
> + size_t size;
> int ret;
>
> spin_lock(&block_group->lock);
> - btrfs_set_stack_block_group_used(&bgi, block_group->used);
> - btrfs_set_stack_block_group_chunk_objectid(&bgi,
> - block_group->global_root_id);
> - btrfs_set_stack_block_group_flags(&bgi, block_group->flags);
> + btrfs_set_stack_block_group_v2_used(&bgi, block_group->used);
> + btrfs_set_stack_block_group_v2_chunk_objectid(&bgi,
> + block_group->global_root_id);
> + btrfs_set_stack_block_group_v2_flags(&bgi, block_group->flags);
> + btrfs_set_stack_block_group_v2_remap_bytes(&bgi,
> + block_group->remap_bytes);
> + btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
> + block_group->identity_remap_count);
> old_commit_used = block_group->commit_used;
> block_group->commit_used = block_group->used;
> + block_group->commit_remap_bytes = block_group->remap_bytes;
> + block_group->commit_identity_remap_count =
> + block_group->identity_remap_count;
> key.objectid = block_group->start;
> key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> key.offset = block_group->length;
> spin_unlock(&block_group->lock);
>
> - ret = btrfs_insert_item(trans, root, &key, &bgi, sizeof(bgi));
> + if (btrfs_fs_incompat(fs_info, REMAP_TREE))
> + size = sizeof(struct btrfs_block_group_item_v2);
> + else
> + size = sizeof(struct btrfs_block_group_item);
> +
> + ret = btrfs_insert_item(trans, root, &key, &bgi, size);
> if (ret < 0) {
> spin_lock(&block_group->lock);
> block_group->commit_used = old_commit_used;
> @@ -3135,10 +3163,12 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
> struct btrfs_root *root = btrfs_block_group_root(fs_info);
> unsigned long bi;
> struct extent_buffer *leaf;
> - struct btrfs_block_group_item bgi;
> + struct btrfs_block_group_item_v2 bgi;
> struct btrfs_key key;
> - u64 old_commit_used;
> - u64 used;
> + u64 old_commit_used, old_commit_remap_bytes;
> + u32 old_commit_identity_remap_count;
> + u64 used, remap_bytes;
> + u32 identity_remap_count;
>
> /*
> * Block group items update can be triggered out of commit transaction
> @@ -3148,13 +3178,21 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
> */
> spin_lock(&cache->lock);
> old_commit_used = cache->commit_used;
> + old_commit_remap_bytes = cache->commit_remap_bytes;
> + old_commit_identity_remap_count = cache->commit_identity_remap_count;
> used = cache->used;
> - /* No change in used bytes, can safely skip it. */
> - if (cache->commit_used == used) {
> + remap_bytes = cache->remap_bytes;
> + identity_remap_count = cache->identity_remap_count;
> + /* No change in values, can safely skip it. */
> + if (cache->commit_used == used &&
> + cache->commit_remap_bytes == remap_bytes &&
> + cache->commit_identity_remap_count == identity_remap_count) {
> spin_unlock(&cache->lock);
> return 0;
> }
> cache->commit_used = used;
> + cache->commit_remap_bytes = remap_bytes;
> + cache->commit_identity_remap_count = identity_remap_count;
> spin_unlock(&cache->lock);
>
> key.objectid = cache->start;
> @@ -3170,11 +3208,23 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>
> leaf = path->nodes[0];
> bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
> - btrfs_set_stack_block_group_used(&bgi, used);
> - btrfs_set_stack_block_group_chunk_objectid(&bgi,
> - cache->global_root_id);
> - btrfs_set_stack_block_group_flags(&bgi, cache->flags);
> - write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
> + btrfs_set_stack_block_group_v2_used(&bgi, used);
> + btrfs_set_stack_block_group_v2_chunk_objectid(&bgi,
> + cache->global_root_id);
> + btrfs_set_stack_block_group_v2_flags(&bgi, cache->flags);
> +
> + if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
> + btrfs_set_stack_block_group_v2_remap_bytes(&bgi,
> + cache->remap_bytes);
> + btrfs_set_stack_block_group_v2_identity_remap_count(&bgi,
> + cache->identity_remap_count);
> + write_extent_buffer(leaf, &bgi, bi,
> + sizeof(struct btrfs_block_group_item_v2));
> + } else {
> + write_extent_buffer(leaf, &bgi, bi,
> + sizeof(struct btrfs_block_group_item));
> + }
> +
> fail:
> btrfs_release_path(path);
> /*
> @@ -3189,6 +3239,9 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
> if (ret < 0 && ret != -ENOENT) {
> spin_lock(&cache->lock);
> cache->commit_used = old_commit_used;
> + cache->commit_remap_bytes = old_commit_remap_bytes;
> + cache->commit_identity_remap_count =
> + old_commit_identity_remap_count;
> spin_unlock(&cache->lock);
> }
> return ret;
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index a8bb8429c966..ecc89701b2ea 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -129,6 +129,8 @@ struct btrfs_block_group {
> u64 flags;
> u64 cache_generation;
> u64 global_root_id;
> + u64 remap_bytes;
> + u32 identity_remap_count;
>
> /*
> * The last committed used bytes of this block group, if the above @used
> @@ -136,6 +138,15 @@ struct btrfs_block_group {
> * group item of this block group.
> */
> u64 commit_used;
> + /*
> + * The last committed remap_bytes value of this block group.
> + */
> + u64 commit_remap_bytes;
> + /*
> + * The last commited identity_remap_count value of this block group.
> + */
> + u32 commit_identity_remap_count;
> +
> /*
> * If the free space extent count exceeds this number, convert the block
> * group to bitmaps.
> @@ -282,7 +293,8 @@ static inline bool btrfs_is_block_group_used(const struct btrfs_block_group *bg)
> {
> lockdep_assert_held(&bg->lock);
>
> - return (bg->used > 0 || bg->reserved > 0 || bg->pinned > 0);
> + return (bg->used > 0 || bg->reserved > 0 || bg->pinned > 0 ||
> + bg->remap_bytes > 0);
> }
>
> static inline bool btrfs_is_block_group_data_only(const struct btrfs_block_group *block_group)
> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> index 1015a4d37fb2..2b7b1e440bc8 100644
> --- a/fs/btrfs/discard.c
> +++ b/fs/btrfs/discard.c
> @@ -373,7 +373,7 @@ void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
> if (!block_group || !btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC))
> return;
>
> - if (block_group->used == 0)
> + if (block_group->used == 0 && block_group->remap_bytes == 0)
> add_to_discard_unused_list(discard_ctl, block_group);
> else
> add_to_discard_list(discard_ctl, block_group);
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index 20bfe333ffdd..922f7afa024d 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -687,6 +687,7 @@ static int check_block_group_item(struct extent_buffer *leaf,
> u64 chunk_objectid;
> u64 flags;
> u64 type;
> + size_t exp_size;
>
> /*
> * Here we don't really care about alignment since extent allocator can
> @@ -698,10 +699,15 @@ static int check_block_group_item(struct extent_buffer *leaf,
> return -EUCLEAN;
> }
>
> - if (unlikely(item_size != sizeof(bgi))) {
> + if (btrfs_fs_incompat(fs_info, REMAP_TREE))
> + exp_size = sizeof(struct btrfs_block_group_item_v2);
> + else
> + exp_size = sizeof(struct btrfs_block_group_item);
> +
> + if (unlikely(item_size != exp_size)) {
> block_group_err(leaf, slot,
> "invalid item size, have %u expect %zu",
> - item_size, sizeof(bgi));
> + item_size, exp_size);
> return -EUCLEAN;
> }
>
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index 9a36f0206d90..500e3a7df90b 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -1229,6 +1229,14 @@ struct btrfs_block_group_item {
> __le64 flags;
> } __attribute__ ((__packed__));
>
> +struct btrfs_block_group_item_v2 {
> + __le64 used;
> + __le64 chunk_objectid;
> + __le64 flags;
> + __le64 remap_bytes;
> + __le32 identity_remap_count;
> +} __attribute__ ((__packed__));
> +
> struct btrfs_free_space_info {
> __le32 extent_count;
> __le32 flags;
> --
> 2.49.1
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 01/16] btrfs: add definitions and constants for remap-tree
2025-08-16 0:01 ` Qu Wenruo
@ 2025-08-16 0:17 ` Qu Wenruo
2025-08-18 17:23 ` Mark Harmstone
0 siblings, 1 reply; 43+ messages in thread
From: Qu Wenruo @ 2025-08-16 0:17 UTC (permalink / raw)
To: Mark Harmstone, linux-btrfs
在 2025/8/16 09:31, Qu Wenruo 写道:
>
>
> 在 2025/8/14 00:04, Mark Harmstone 写道:
>> Add an incompat flag for the new remap-tree feature, and the constants
>> and definitions needed to support it.
>>
>> Signed-off-by: Mark Harmstone <mark@harmstone.com>
>> ---
>> fs/btrfs/accessors.h | 3 +++
>> fs/btrfs/locking.c | 1 +
>> fs/btrfs/sysfs.c | 2 ++
>> fs/btrfs/tree-checker.c | 6 ++----
>> fs/btrfs/tree-checker.h | 5 +++++
>> fs/btrfs/volumes.c | 1 +
>> include/uapi/linux/btrfs.h | 1 +
>> include/uapi/linux/btrfs_tree.h | 12 ++++++++++++
>> 8 files changed, 27 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
>> index 99b3ced12805..95a1ca8c099b 100644
>> --- a/fs/btrfs/accessors.h
>> +++ b/fs/btrfs/accessors.h
>> @@ -1009,6 +1009,9 @@
>> BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption,
>> BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size,
>> struct btrfs_verity_descriptor_item, size, 64);
>> +BTRFS_SETGET_FUNCS(remap_address, struct btrfs_remap, address, 64);
>> +BTRFS_SETGET_STACK_FUNCS(stack_remap_address, struct btrfs_remap,
>> address, 64);
>> +
>> /* Cast into the data area of the leaf. */
>> #define btrfs_item_ptr(leaf, slot, type) \
>> ((type *)(btrfs_item_nr_offset(leaf, 0) +
>> btrfs_item_offset(leaf, slot)))
>> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
>> index a3e6d9616e60..26f810258486 100644
>> --- a/fs/btrfs/locking.c
>> +++ b/fs/btrfs/locking.c
>> @@ -73,6 +73,7 @@ static struct btrfs_lockdep_keyset {
>> { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-
>> space") },
>> { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-
>> group") },
>> { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID, DEFINE_NAME("raid-
>> stripe") },
>> + { .id = BTRFS_REMAP_TREE_OBJECTID, DEFINE_NAME("remap-tree") },
>> { .id = 0, DEFINE_NAME("tree") },
>> };
>> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
>> index 81f52c1f55ce..857d2772db1c 100644
>> --- a/fs/btrfs/sysfs.c
>> +++ b/fs/btrfs/sysfs.c
>> @@ -291,6 +291,7 @@ BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree,
>> FREE_SPACE_TREE);
>> BTRFS_FEAT_ATTR_COMPAT_RO(block_group_tree, BLOCK_GROUP_TREE);
>> BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
>> BTRFS_FEAT_ATTR_INCOMPAT(simple_quota, SIMPLE_QUOTA);
>> +BTRFS_FEAT_ATTR_INCOMPAT(remap_tree, REMAP_TREE);
>> #ifdef CONFIG_BLK_DEV_ZONED
>> BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
>> #endif
>> @@ -325,6 +326,7 @@ static struct attribute
>> *btrfs_supported_feature_attrs[] = {
>> BTRFS_FEAT_ATTR_PTR(raid1c34),
>> BTRFS_FEAT_ATTR_PTR(block_group_tree),
>> BTRFS_FEAT_ATTR_PTR(simple_quota),
>> + BTRFS_FEAT_ATTR_PTR(remap_tree),
>> #ifdef CONFIG_BLK_DEV_ZONED
>> BTRFS_FEAT_ATTR_PTR(zoned),
>> #endif
>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>> index 0f556f4de3f9..76ec3698f197 100644
>> --- a/fs/btrfs/tree-checker.c
>> +++ b/fs/btrfs/tree-checker.c
>> @@ -912,12 +912,10 @@ int btrfs_check_chunk_valid(const struct
>> btrfs_fs_info *fs_info,
>> length, btrfs_stripe_nr_to_offset(U32_MAX));
>> return -EUCLEAN;
>> }
>> - if (unlikely(type & ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
>> - BTRFS_BLOCK_GROUP_PROFILE_MASK))) {
>> + if (unlikely(type & ~BTRFS_BLOCK_GROUP_VALID)) {
>> chunk_err(fs_info, leaf, chunk, logical,
>> "unrecognized chunk type: 0x%llx",
>> - ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
>> - BTRFS_BLOCK_GROUP_PROFILE_MASK) & type);
>> + type & ~BTRFS_BLOCK_GROUP_VALID);
>> return -EUCLEAN;
>> }
>> diff --git a/fs/btrfs/tree-checker.h b/fs/btrfs/tree-checker.h
>> index eb201f4ec3c7..833e2fd989eb 100644
>> --- a/fs/btrfs/tree-checker.h
>> +++ b/fs/btrfs/tree-checker.h
>> @@ -57,6 +57,11 @@ enum btrfs_tree_block_status {
>> BTRFS_TREE_BLOCK_WRITTEN_NOT_SET,
>> };
>> +
>> +#define BTRFS_BLOCK_GROUP_VALID (BTRFS_BLOCK_GROUP_TYPE_MASK | \
>> + BTRFS_BLOCK_GROUP_PROFILE_MASK | \
>> + BTRFS_BLOCK_GROUP_REMAPPED)
>
> So far it looks like the remapped flag is a new bg type.
> Can we just put it into BLOCK_GROUP_TYPE_MASK?
Nevermind, we can not put the new type into TYPE_MASK, as tree-checker
will warn about remapped block groups as they have both METADATA and
REMAPPED flags set.
So this definition looks good.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Thanks,
Qu
>
> Otherwise looks good to me.
>
> Thanks,
> Qu
>
>> +
>> /*
>> * Exported simply for btrfs-progs which wants to have the
>> * btrfs_tree_block_status return codes.
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index fa7a929a0461..e067e9cd68a5 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -231,6 +231,7 @@ void btrfs_describe_block_groups(u64 bg_flags,
>> char *buf, u32 size_buf)
>> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
>> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
>> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
>> + DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAPPED, "remapped");
>> DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
>> for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
>> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
>> index 8e710bbb688e..fba303ed49e6 100644
>> --- a/include/uapi/linux/btrfs.h
>> +++ b/include/uapi/linux/btrfs.h
>> @@ -336,6 +336,7 @@ struct btrfs_ioctl_fs_info_args {
>> #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 (1ULL << 13)
>> #define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE (1ULL << 14)
>> #define BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA (1ULL << 16)
>> +#define BTRFS_FEATURE_INCOMPAT_REMAP_TREE (1ULL << 17)
>> struct btrfs_ioctl_feature_flags {
>> __u64 compat_flags;
>> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/
>> btrfs_tree.h
>> index fc29d273845d..4439d77a7252 100644
>> --- a/include/uapi/linux/btrfs_tree.h
>> +++ b/include/uapi/linux/btrfs_tree.h
>> @@ -76,6 +76,9 @@
>> /* Tracks RAID stripes in block groups. */
>> #define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
>> +/* Holds details of remapped addresses after relocation. */
>> +#define BTRFS_REMAP_TREE_OBJECTID 13ULL
>> +
>> /* device stats in the device tree */
>> #define BTRFS_DEV_STATS_OBJECTID 0ULL
>> @@ -282,6 +285,10 @@
>> #define BTRFS_RAID_STRIPE_KEY 230
>> +#define BTRFS_IDENTITY_REMAP_KEY 234
>> +#define BTRFS_REMAP_KEY 235
>> +#define BTRFS_REMAP_BACKREF_KEY 236
>> +
>> /*
>> * Records the overall state of the qgroups.
>> * There's only one instance of this key present,
>> @@ -1161,6 +1168,7 @@ struct btrfs_dev_replace_item {
>> #define BTRFS_BLOCK_GROUP_RAID6 (1ULL << 8)
>> #define BTRFS_BLOCK_GROUP_RAID1C3 (1ULL << 9)
>> #define BTRFS_BLOCK_GROUP_RAID1C4 (1ULL << 10)
>> +#define BTRFS_BLOCK_GROUP_REMAPPED (1ULL << 11)
>> #define BTRFS_BLOCK_GROUP_RESERVED (BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
>> BTRFS_SPACE_INFO_GLOBAL_RSV)
>> @@ -1323,4 +1331,8 @@ struct btrfs_verity_descriptor_item {
>> __u8 encryption;
>> } __attribute__ ((__packed__));
>> +struct btrfs_remap {
>> + __le64 address;
>> +} __attribute__ ((__packed__));
>> +
>> #endif /* _BTRFS_CTREE_H_ */
>
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 10/16] btrfs: handle deletions from remapped block group
2025-08-13 14:34 ` [PATCH v2 10/16] btrfs: handle deletions from remapped block group Mark Harmstone
@ 2025-08-16 0:28 ` Boris Burkov
2025-08-27 17:11 ` Mark Harmstone
0 siblings, 1 reply; 43+ messages in thread
From: Boris Burkov @ 2025-08-16 0:28 UTC (permalink / raw)
To: Mark Harmstone; +Cc: linux-btrfs
On Wed, Aug 13, 2025 at 03:34:52PM +0100, Mark Harmstone wrote:
> Handle the case where we free an extent from a block group that has the
> REMAPPED flag set. Because the remap tree is orthogonal to the extent
> tree, for data this may be within any number of identity remaps or
> actual remaps. If we're freeing a metadata node, this will be wholly
> inside one or the other.
>
> btrfs_remove_extent_from_remap_tree() searches the remap tree for the
> remaps that cover the range in question, then calls
> remove_range_from_remap_tree() for each one, to punch a hole in the
> remap and adjust the free-space tree.
>
> For an identity remap, remove_range_from_remap_tree() will adjust the
> block group's `identity_remap_count` if this changes. If it reaches
> zero we call last_identity_remap_gone(), which removes the chunk's
> stripes and device extents - it is now fully remapped.
>
> The changes which involve the block group's ro flag are because the
> REMAPPED flag itself prevents a block group from having any new
> allocations within it, and so we don't need to account for this
> separately.
>
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
> ---
> fs/btrfs/block-group.c | 82 ++++---
> fs/btrfs/block-group.h | 1 +
> fs/btrfs/disk-io.c | 1 +
> fs/btrfs/extent-tree.c | 28 ++-
> fs/btrfs/fs.h | 1 +
> fs/btrfs/relocation.c | 510 +++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/relocation.h | 3 +
> fs/btrfs/volumes.c | 56 +++--
> fs/btrfs/volumes.h | 6 +
> 9 files changed, 630 insertions(+), 58 deletions(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 8c28f829547e..7a0524138235 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1068,6 +1068,32 @@ static int remove_block_group_item(struct btrfs_trans_handle *trans,
> return ret;
> }
>
> +void btrfs_remove_bg_from_sinfo(struct btrfs_block_group *block_group)
> +{
> + int factor = btrfs_bg_type_to_factor(block_group->flags);
> +
> + spin_lock(&block_group->space_info->lock);
> +
> + if (btrfs_test_opt(block_group->fs_info, ENOSPC_DEBUG)) {
> + WARN_ON(block_group->space_info->total_bytes
> + < block_group->length);
> + WARN_ON(block_group->space_info->bytes_readonly
> + < block_group->length - block_group->zone_unusable);
> + WARN_ON(block_group->space_info->bytes_zone_unusable
> + < block_group->zone_unusable);
> + WARN_ON(block_group->space_info->disk_total
> + < block_group->length * factor);
> + }
> + block_group->space_info->total_bytes -= block_group->length;
> + block_group->space_info->bytes_readonly -=
> + (block_group->length - block_group->zone_unusable);
> + btrfs_space_info_update_bytes_zone_unusable(block_group->space_info,
> + -block_group->zone_unusable);
> + block_group->space_info->disk_total -= block_group->length * factor;
> +
> + spin_unlock(&block_group->space_info->lock);
> +}
> +
> int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> struct btrfs_chunk_map *map)
> {
> @@ -1079,7 +1105,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> struct kobject *kobj = NULL;
> int ret;
> int index;
> - int factor;
> struct btrfs_caching_control *caching_ctl = NULL;
> bool remove_map;
> bool remove_rsv = false;
> @@ -1088,7 +1113,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> if (!block_group)
> return -ENOENT;
>
> - BUG_ON(!block_group->ro);
> + BUG_ON(!block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED));
>
> trace_btrfs_remove_block_group(block_group);
> /*
> @@ -1100,7 +1125,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> block_group->length);
>
> index = btrfs_bg_flags_to_raid_index(block_group->flags);
> - factor = btrfs_bg_type_to_factor(block_group->flags);
>
> /* make sure this block group isn't part of an allocation cluster */
> cluster = &fs_info->data_alloc_cluster;
> @@ -1224,26 +1248,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>
> spin_lock(&block_group->space_info->lock);
> list_del_init(&block_group->ro_list);
> -
> - if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
> - WARN_ON(block_group->space_info->total_bytes
> - < block_group->length);
> - WARN_ON(block_group->space_info->bytes_readonly
> - < block_group->length - block_group->zone_unusable);
> - WARN_ON(block_group->space_info->bytes_zone_unusable
> - < block_group->zone_unusable);
> - WARN_ON(block_group->space_info->disk_total
> - < block_group->length * factor);
> - }
> - block_group->space_info->total_bytes -= block_group->length;
> - block_group->space_info->bytes_readonly -=
> - (block_group->length - block_group->zone_unusable);
> - btrfs_space_info_update_bytes_zone_unusable(block_group->space_info,
> - -block_group->zone_unusable);
> - block_group->space_info->disk_total -= block_group->length * factor;
> -
> spin_unlock(&block_group->space_info->lock);
>
> + if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED))
> + btrfs_remove_bg_from_sinfo(block_group);
> +
> /*
> * Remove the free space for the block group from the free space tree
> * and the block group's item from the extent tree before marking the
> @@ -1539,6 +1548,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> while (!list_empty(&fs_info->unused_bgs)) {
> u64 used;
> int trimming;
> + bool made_ro = false;
>
> block_group = list_first_entry(&fs_info->unused_bgs,
> struct btrfs_block_group,
> @@ -1575,7 +1585,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>
> spin_lock(&space_info->lock);
> spin_lock(&block_group->lock);
> - if (btrfs_is_block_group_used(block_group) || block_group->ro ||
> + if (btrfs_is_block_group_used(block_group) ||
> + (block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) ||
> list_is_singular(&block_group->list)) {
> /*
> * We want to bail if we made new allocations or have
> @@ -1617,9 +1628,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> * needing to allocate extents from the block group.
> */
> used = btrfs_space_info_used(space_info, true);
> - if ((space_info->total_bytes - block_group->length < used &&
> + if (((space_info->total_bytes - block_group->length < used &&
> block_group->zone_unusable < block_group->length) ||
> - has_unwritten_metadata(block_group)) {
> + has_unwritten_metadata(block_group)) &&
> + !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
> spin_unlock(&block_group->lock);
>
> /*
> @@ -1638,8 +1650,14 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> spin_unlock(&block_group->lock);
> spin_unlock(&space_info->lock);
>
> - /* We don't want to force the issue, only flip if it's ok. */
> - ret = inc_block_group_ro(block_group, 0);
> + if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
> + /* We don't want to force the issue, only flip if it's ok. */
> + ret = inc_block_group_ro(block_group, 0);
> + made_ro = true;
> + } else {
> + ret = 0;
> + }
> +
> up_write(&space_info->groups_sem);
> if (ret < 0) {
> ret = 0;
> @@ -1648,7 +1666,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>
> ret = btrfs_zone_finish(block_group);
> if (ret < 0) {
> - btrfs_dec_block_group_ro(block_group);
> + if (made_ro)
> + btrfs_dec_block_group_ro(block_group);
> if (ret == -EAGAIN) {
> btrfs_link_bg_list(block_group, &retry_list);
> ret = 0;
> @@ -1663,7 +1682,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> trans = btrfs_start_trans_remove_block_group(fs_info,
> block_group->start);
> if (IS_ERR(trans)) {
> - btrfs_dec_block_group_ro(block_group);
> + if (made_ro)
> + btrfs_dec_block_group_ro(block_group);
> ret = PTR_ERR(trans);
> goto next;
> }
> @@ -1673,7 +1693,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> * just delete them, we don't care about them anymore.
> */
> if (!clean_pinned_extents(trans, block_group)) {
> - btrfs_dec_block_group_ro(block_group);
> + if (made_ro)
> + btrfs_dec_block_group_ro(block_group);
> goto end_trans;
> }
>
> @@ -1687,7 +1708,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> spin_lock(&fs_info->discard_ctl.lock);
> if (!list_empty(&block_group->discard_list)) {
> spin_unlock(&fs_info->discard_ctl.lock);
> - btrfs_dec_block_group_ro(block_group);
> + if (made_ro)
> + btrfs_dec_block_group_ro(block_group);
> btrfs_discard_queue_work(&fs_info->discard_ctl,
> block_group);
> goto end_trans;
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index ecc89701b2ea..0433b0127ed8 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -336,6 +336,7 @@ int btrfs_add_new_free_space(struct btrfs_block_group *block_group,
> struct btrfs_trans_handle *btrfs_start_trans_remove_block_group(
> struct btrfs_fs_info *fs_info,
> const u64 chunk_offset);
> +void btrfs_remove_bg_from_sinfo(struct btrfs_block_group *block_group);
> int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
> struct btrfs_chunk_map *map);
> void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 563aea5e3b1b..d92d08316322 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2907,6 +2907,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
> mutex_init(&fs_info->chunk_mutex);
> mutex_init(&fs_info->transaction_kthread_mutex);
> mutex_init(&fs_info->cleaner_mutex);
> + mutex_init(&fs_info->remap_mutex);
> mutex_init(&fs_info->ro_block_group_mutex);
> init_rwsem(&fs_info->commit_root_sem);
> init_rwsem(&fs_info->cleanup_work_sem);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index c1b96c728fe6..ca3f6d6bb5ba 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -40,6 +40,7 @@
> #include "orphan.h"
> #include "tree-checker.h"
> #include "raid-stripe-tree.h"
> +#include "relocation.h"
>
> #undef SCRAMBLE_DELAYED_REFS
>
> @@ -2999,7 +3000,8 @@ u64 btrfs_get_extent_owner_root(struct btrfs_fs_info *fs_info,
> }
>
> static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
> - u64 bytenr, struct btrfs_squota_delta *delta)
> + u64 bytenr, struct btrfs_squota_delta *delta,
> + bool remapped)
> {
> int ret;
> u64 num_bytes = delta->num_bytes;
> @@ -3027,10 +3029,16 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
> return ret;
> }
>
> - ret = btrfs_add_to_free_space_tree(trans, bytenr, num_bytes);
> - if (ret) {
> - btrfs_abort_transaction(trans, ret);
> - return ret;
> + /*
> + * If remapped, FST has already been taken care of in
> + * remove_range_from_remap_tree().
> + */
Why not do btrfs_remove_extent_from_remap_tree() here in
do_free_extent_accounting() rather than the caller?
> + if (!remapped) {
> + ret = btrfs_add_to_free_space_tree(trans, bytenr, num_bytes);
> + if (ret) {
> + btrfs_abort_transaction(trans, ret);
> + return ret;
> + }
> }
>
> ret = btrfs_update_block_group(trans, bytenr, num_bytes, false);
So in the normal case, this will trigger updating the block group bytes
counters, and when they hit 0 put the block group on the unused list,
which queues it for deletion and ultimately removing from space info
etc.
I fail to see what is special about remapped block groups from that
perspective. I would strongly prefer to see you integrate with
btrsf_delete_unused_bgs() rather than special case skipping it there and
copying parts of the logic elsewhere.
Unless there is some very good reason for the special treatment that I
am not seeing.
> @@ -3396,7 +3404,15 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
> }
> btrfs_release_path(path);
>
> - ret = do_free_extent_accounting(trans, bytenr, &delta);
> + /* returns 1 on success and 0 on no-op */
> + ret = btrfs_remove_extent_from_remap_tree(trans, path, bytenr,
> + num_bytes);
> + if (ret < 0) {
> + btrfs_abort_transaction(trans, ret);
> + goto out;
> + }
> +
> + ret = do_free_extent_accounting(trans, bytenr, &delta, ret);
> }
> btrfs_release_path(path);
>
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 6ea96e76655e..dbb7de95241b 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -547,6 +547,7 @@ struct btrfs_fs_info {
> struct mutex transaction_kthread_mutex;
> struct mutex cleaner_mutex;
> struct mutex chunk_mutex;
> + struct mutex remap_mutex;
>
> /*
> * This is taken to make sure we don't set block groups ro after the
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index e1f1da9336e7..03a1246af678 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -37,6 +37,7 @@
> #include "super.h"
> #include "tree-checker.h"
> #include "raid-stripe-tree.h"
> +#include "free-space-tree.h"
>
> /*
> * Relocation overview
> @@ -3884,6 +3885,148 @@ static const char *stage_to_string(enum reloc_stage stage)
> return "unknown";
> }
>
> +static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
> + struct btrfs_block_group *bg,
> + s64 diff)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + bool bg_already_dirty = true;
> +
> + bg->remap_bytes += diff;
> +
> + if (bg->used == 0 && bg->remap_bytes == 0)
> + btrfs_mark_bg_unused(bg);
> +
> + spin_lock(&trans->transaction->dirty_bgs_lock);
> + if (list_empty(&bg->dirty_list)) {
> + list_add_tail(&bg->dirty_list, &trans->transaction->dirty_bgs);
> + bg_already_dirty = false;
> + btrfs_get_block_group(bg);
> + }
> + spin_unlock(&trans->transaction->dirty_bgs_lock);
> +
> + /* Modified block groups are accounted for in the delayed_refs_rsv. */
> + if (!bg_already_dirty)
> + btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
> +}
> +
> +static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
> + struct btrfs_chunk_map *chunk,
> + struct btrfs_path *path)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_key key;
> + struct extent_buffer *leaf;
> + struct btrfs_chunk *c;
> + int ret;
> +
> + key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
> + key.type = BTRFS_CHUNK_ITEM_KEY;
> + key.offset = chunk->start;
> +
> + ret = btrfs_search_slot(trans, fs_info->chunk_root, &key, path,
> + 0, 1);
> + if (ret) {
> + if (ret == 1) {
> + btrfs_release_path(path);
> + ret = -ENOENT;
> + }
> + return ret;
> + }
> +
> + leaf = path->nodes[0];
> +
> + c = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_chunk);
> + btrfs_set_chunk_num_stripes(leaf, c, 0);
> +
> + btrfs_truncate_item(trans, path, offsetof(struct btrfs_chunk, stripe),
> + 1);
> +
> + btrfs_mark_buffer_dirty(trans, leaf);
> +
> + btrfs_release_path(path);
> +
> + return 0;
> +}
> +
Same question as elsewhere, but placed here for clarity:
Why can't this be queued for normal unused bgs deletion, rather than
having a special remap bg deletion function?
> +static int last_identity_remap_gone(struct btrfs_trans_handle *trans,
> + struct btrfs_chunk_map *chunk,
> + struct btrfs_block_group *bg,
> + struct btrfs_path *path)
> +{
> + int ret;
> +
> + ret = btrfs_remove_dev_extents(trans, chunk);
> + if (ret)
> + return ret;
> +
> + mutex_lock(&trans->fs_info->chunk_mutex);
> +
> + for (unsigned int i = 0; i < chunk->num_stripes; i++) {
> + ret = btrfs_update_device(trans, chunk->stripes[i].dev);
> + if (ret) {
> + mutex_unlock(&trans->fs_info->chunk_mutex);
> + return ret;
> + }
> + }
> +
> + mutex_unlock(&trans->fs_info->chunk_mutex);
> +
> + write_lock(&trans->fs_info->mapping_tree_lock);
> + btrfs_chunk_map_device_clear_bits(chunk, CHUNK_ALLOCATED);
> + write_unlock(&trans->fs_info->mapping_tree_lock);
> +
> + btrfs_remove_bg_from_sinfo(bg);
> +
> + ret = remove_chunk_stripes(trans, chunk, path);
> + if (ret)
> + return ret;
> +
> + return 0;
> +}
> +
> +static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
> + struct btrfs_path *path,
> + struct btrfs_block_group *bg, int delta)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_chunk_map *chunk;
> + bool bg_already_dirty = true;
> + int ret;
> +
> + WARN_ON(delta < 0 && -delta > bg->identity_remap_count);
> +
> + bg->identity_remap_count += delta;
> +
> + spin_lock(&trans->transaction->dirty_bgs_lock);
> + if (list_empty(&bg->dirty_list)) {
> + list_add_tail(&bg->dirty_list, &trans->transaction->dirty_bgs);
> + bg_already_dirty = false;
> + btrfs_get_block_group(bg);
> + }
> + spin_unlock(&trans->transaction->dirty_bgs_lock);
> +
> + /* Modified block groups are accounted for in the delayed_refs_rsv. */
> + if (!bg_already_dirty)
> + btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
> +
> + if (bg->identity_remap_count != 0)
> + return 0;
> +
> + chunk = btrfs_find_chunk_map(fs_info, bg->start, 1);
> + if (!chunk)
> + return -ENOENT;
> +
> + ret = last_identity_remap_gone(trans, chunk, bg, path);
> + if (ret)
> + goto end;
> +
> + ret = 0;
> +end:
> + btrfs_free_chunk_map(chunk);
> + return ret;
> +}
> +
> int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
> u64 *length, bool nolock)
> {
> @@ -4504,3 +4647,370 @@ u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info)
> logical = fs_info->reloc_ctl->block_group->start;
> return logical;
> }
> +
> +static int remove_range_from_remap_tree(struct btrfs_trans_handle *trans,
> + struct btrfs_path *path,
> + struct btrfs_block_group *bg,
> + u64 bytenr, u64 num_bytes)
> +{
> + int ret;
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct extent_buffer *leaf = path->nodes[0];
> + struct btrfs_key key, new_key;
> + struct btrfs_remap *remap_ptr = NULL, remap;
> + struct btrfs_block_group *dest_bg = NULL;
> + u64 end, new_addr = 0, remap_start, remap_length, overlap_length;
> + bool is_identity_remap;
> +
> + end = bytenr + num_bytes;
> +
> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +
> + is_identity_remap = key.type == BTRFS_IDENTITY_REMAP_KEY;
> +
> + remap_start = key.objectid;
> + remap_length = key.offset;
> +
> + if (!is_identity_remap) {
> + remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
> + struct btrfs_remap);
> + new_addr = btrfs_remap_address(leaf, remap_ptr);
> +
> + dest_bg = btrfs_lookup_block_group(fs_info, new_addr);
> + }
> +
> + if (bytenr == remap_start && num_bytes >= remap_length) {
> + /* Remove entirely. */
> +
> + ret = btrfs_del_item(trans, fs_info->remap_root, path);
> + if (ret)
> + goto end;
> +
> + btrfs_release_path(path);
> +
> + overlap_length = remap_length;
> +
> + if (!is_identity_remap) {
> + /* Remove backref. */
> +
> + key.objectid = new_addr;
> + key.type = BTRFS_REMAP_BACKREF_KEY;
> + key.offset = remap_length;
> +
> + ret = btrfs_search_slot(trans, fs_info->remap_root,
> + &key, path, -1, 1);
> + if (ret) {
> + if (ret == 1) {
> + btrfs_release_path(path);
> + ret = -ENOENT;
> + }
> + goto end;
> + }
> +
> + ret = btrfs_del_item(trans, fs_info->remap_root, path);
> +
> + btrfs_release_path(path);
> +
> + if (ret)
> + goto end;
> +
> + adjust_block_group_remap_bytes(trans, dest_bg,
> + -remap_length);
> + } else {
> + ret = adjust_identity_remap_count(trans, path, bg, -1);
> + if (ret)
> + goto end;
> + }
> + } else if (bytenr == remap_start) {
> + /* Remove beginning. */
> +
> + new_key.objectid = end;
> + new_key.type = key.type;
> + new_key.offset = remap_length + remap_start - end;
> +
> + btrfs_set_item_key_safe(trans, path, &new_key);
> + btrfs_mark_buffer_dirty(trans, leaf);
> +
> + overlap_length = num_bytes;
> +
> + if (!is_identity_remap) {
> + btrfs_set_remap_address(leaf, remap_ptr,
> + new_addr + end - remap_start);
> + btrfs_release_path(path);
> +
> + /* Adjust backref. */
> +
> + key.objectid = new_addr;
> + key.type = BTRFS_REMAP_BACKREF_KEY;
> + key.offset = remap_length;
> +
> + ret = btrfs_search_slot(trans, fs_info->remap_root,
> + &key, path, -1, 1);
> + if (ret) {
> + if (ret == 1) {
> + btrfs_release_path(path);
> + ret = -ENOENT;
> + }
> + goto end;
> + }
> +
> + leaf = path->nodes[0];
> +
> + new_key.objectid = new_addr + end - remap_start;
> + new_key.type = BTRFS_REMAP_BACKREF_KEY;
> + new_key.offset = remap_length + remap_start - end;
> +
> + btrfs_set_item_key_safe(trans, path, &new_key);
> +
> + remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
> + struct btrfs_remap);
> + btrfs_set_remap_address(leaf, remap_ptr, end);
> +
> + btrfs_mark_buffer_dirty(trans, path->nodes[0]);
> +
> + btrfs_release_path(path);
> +
> + adjust_block_group_remap_bytes(trans, dest_bg,
> + -num_bytes);
> + }
> + } else if (bytenr + num_bytes < remap_start + remap_length) {
> + /* Remove middle. */
> +
> + new_key.objectid = remap_start;
> + new_key.type = key.type;
> + new_key.offset = bytenr - remap_start;
> +
> + btrfs_set_item_key_safe(trans, path, &new_key);
> + btrfs_mark_buffer_dirty(trans, leaf);
> +
> + new_key.objectid = end;
> + new_key.offset = remap_start + remap_length - end;
> +
> + btrfs_release_path(path);
> +
> + overlap_length = num_bytes;
> +
> + if (!is_identity_remap) {
> + /* Add second remap entry. */
> +
> + ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
> + path, &new_key,
> + sizeof(struct btrfs_remap));
> + if (ret)
> + goto end;
> +
> + btrfs_set_stack_remap_address(&remap,
> + new_addr + end - remap_start);
> +
> + write_extent_buffer(path->nodes[0], &remap,
> + btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
> + sizeof(struct btrfs_remap));
> +
> + btrfs_release_path(path);
> +
> + /* Shorten backref entry. */
> +
> + key.objectid = new_addr;
> + key.type = BTRFS_REMAP_BACKREF_KEY;
> + key.offset = remap_length;
> +
> + ret = btrfs_search_slot(trans, fs_info->remap_root,
> + &key, path, -1, 1);
> + if (ret) {
> + if (ret == 1) {
> + btrfs_release_path(path);
> + ret = -ENOENT;
> + }
> + goto end;
> + }
> +
> + new_key.objectid = new_addr;
> + new_key.type = BTRFS_REMAP_BACKREF_KEY;
> + new_key.offset = bytenr - remap_start;
> +
> + btrfs_set_item_key_safe(trans, path, &new_key);
> + btrfs_mark_buffer_dirty(trans, path->nodes[0]);
> +
> + btrfs_release_path(path);
> +
> + /* Add second backref entry. */
> +
> + new_key.objectid = new_addr + end - remap_start;
> + new_key.type = BTRFS_REMAP_BACKREF_KEY;
> + new_key.offset = remap_start + remap_length - end;
> +
> + ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
> + path, &new_key,
> + sizeof(struct btrfs_remap));
> + if (ret)
> + goto end;
> +
> + btrfs_set_stack_remap_address(&remap, end);
> +
> + write_extent_buffer(path->nodes[0], &remap,
> + btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
> + sizeof(struct btrfs_remap));
> +
> + btrfs_release_path(path);
> +
> + adjust_block_group_remap_bytes(trans, dest_bg,
> + -num_bytes);
> + } else {
> + /* Add second identity remap entry. */
> +
> + ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
> + path, &new_key, 0);
> + if (ret)
> + goto end;
> +
> + btrfs_release_path(path);
> +
> + ret = adjust_identity_remap_count(trans, path, bg, 1);
> + if (ret)
> + goto end;
> + }
> + } else {
> + /* Remove end. */
> +
> + new_key.objectid = remap_start;
> + new_key.type = key.type;
> + new_key.offset = bytenr - remap_start;
> +
> + btrfs_set_item_key_safe(trans, path, &new_key);
> + btrfs_mark_buffer_dirty(trans, leaf);
> +
> + btrfs_release_path(path);
> +
> + overlap_length = remap_start + remap_length - bytenr;
> +
> + if (!is_identity_remap) {
> + /* Shorten backref entry. */
> +
> + key.objectid = new_addr;
> + key.type = BTRFS_REMAP_BACKREF_KEY;
> + key.offset = remap_length;
> +
> + ret = btrfs_search_slot(trans, fs_info->remap_root,
> + &key, path, -1, 1);
> + if (ret) {
> + if (ret == 1) {
> + btrfs_release_path(path);
> + ret = -ENOENT;
> + }
> + goto end;
> + }
> +
> + new_key.objectid = new_addr;
> + new_key.type = BTRFS_REMAP_BACKREF_KEY;
> + new_key.offset = bytenr - remap_start;
> +
> + btrfs_set_item_key_safe(trans, path, &new_key);
> + btrfs_mark_buffer_dirty(trans, path->nodes[0]);
> +
> + btrfs_release_path(path);
> +
> + adjust_block_group_remap_bytes(trans, dest_bg,
> + bytenr - remap_start - remap_length);
> + }
> + }
> +
> + if (!is_identity_remap) {
> + ret = btrfs_add_to_free_space_tree(trans,
> + bytenr - remap_start + new_addr,
> + overlap_length);
> + if (ret)
> + goto end;
> + }
> +
> + ret = overlap_length;
> +
> +end:
> + if (dest_bg)
> + btrfs_put_block_group(dest_bg);
> +
> + return ret;
> +}
> +
> +/*
> + * Returns 1 if remove_range_from_remap_tree() has been called successfully,
> + * 0 if block group wasn't remapped, and a negative number on error.
> + */
> +int btrfs_remove_extent_from_remap_tree(struct btrfs_trans_handle *trans,
> + struct btrfs_path *path,
> + u64 bytenr, u64 num_bytes)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_key key, found_key;
> + struct extent_buffer *leaf;
> + struct btrfs_block_group *bg;
> + int ret, length;
> +
> + if (!(btrfs_super_incompat_flags(fs_info->super_copy) &
> + BTRFS_FEATURE_INCOMPAT_REMAP_TREE))
> + return 0;
> +
> + bg = btrfs_lookup_block_group(fs_info, bytenr);
> + if (!bg)
> + return 0;
> +
> + mutex_lock(&fs_info->remap_mutex);
> +
> + if (!(bg->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
> + mutex_unlock(&fs_info->remap_mutex);
> + btrfs_put_block_group(bg);
> + return 0;
> + }
> +
> + do {
> + key.objectid = bytenr;
> + key.type = (u8)-1;
> + key.offset = (u64)-1;
> +
> + ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path,
> + -1, 1);
> + if (ret < 0)
> + goto end;
> +
> + leaf = path->nodes[0];
> +
> + if (path->slots[0] == 0) {
> + ret = -ENOENT;
> + goto end;
> + }
> +
> + path->slots[0]--;
> +
> + btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
> +
> + if (found_key.type != BTRFS_IDENTITY_REMAP_KEY &&
> + found_key.type != BTRFS_REMAP_KEY) {
> + ret = -ENOENT;
> + goto end;
> + }
> +
> + if (bytenr < found_key.objectid ||
> + bytenr >= found_key.objectid + found_key.offset) {
> + ret = -ENOENT;
> + goto end;
> + }
> +
> + length = remove_range_from_remap_tree(trans, path, bg, bytenr,
> + num_bytes);
> + if (length < 0) {
> + ret = length;
> + goto end;
> + }
> +
> + bytenr += length;
> + num_bytes -= length;
> + } while (num_bytes > 0);
> +
> + ret = 1;
> +
> +end:
> + mutex_unlock(&fs_info->remap_mutex);
> +
> + btrfs_put_block_group(bg);
> + btrfs_release_path(path);
> + return ret;
> +}
> diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
> index a653c42a25a3..4b0bb34b3fc1 100644
> --- a/fs/btrfs/relocation.h
> +++ b/fs/btrfs/relocation.h
> @@ -33,5 +33,8 @@ bool btrfs_should_ignore_reloc_root(const struct btrfs_root *root);
> u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info);
> int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
> u64 *length, bool nolock);
> +int btrfs_remove_extent_from_remap_tree(struct btrfs_trans_handle *trans,
> + struct btrfs_path *path,
> + u64 bytenr, u64 num_bytes);
>
> #endif
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a2c49cb8bfc6..fc2b3e7de32e 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2941,8 +2941,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> return ret;
> }
>
> -static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
> - struct btrfs_device *device)
> +int btrfs_update_device(struct btrfs_trans_handle *trans,
> + struct btrfs_device *device)
> {
> int ret;
> struct btrfs_path *path;
> @@ -3246,25 +3246,13 @@ static int remove_chunk_item(struct btrfs_trans_handle *trans,
> return btrfs_free_chunk(trans, chunk_offset);
> }
>
> -int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
> +int btrfs_remove_dev_extents(struct btrfs_trans_handle *trans,
> + struct btrfs_chunk_map *map)
> {
> struct btrfs_fs_info *fs_info = trans->fs_info;
> - struct btrfs_chunk_map *map;
> + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> u64 dev_extent_len = 0;
> int i, ret = 0;
> - struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> -
> - map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
> - if (IS_ERR(map)) {
> - /*
> - * This is a logic error, but we don't want to just rely on the
> - * user having built with ASSERT enabled, so if ASSERT doesn't
> - * do anything we still error out.
> - */
> - DEBUG_WARN("errr %ld reading chunk map at offset %llu",
> - PTR_ERR(map), chunk_offset);
> - return PTR_ERR(map);
> - }
>
> /*
> * First delete the device extent items from the devices btree.
> @@ -3285,7 +3273,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
> if (ret) {
> mutex_unlock(&fs_devices->device_list_mutex);
> btrfs_abort_transaction(trans, ret);
> - goto out;
> + return ret;
> }
>
> if (device->bytes_used > 0) {
> @@ -3305,6 +3293,30 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
> }
> mutex_unlock(&fs_devices->device_list_mutex);
>
> + return 0;
> +}
> +
> +int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_chunk_map *map;
> + int ret;
> +
> + map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
> + if (IS_ERR(map)) {
> + /*
> + * This is a logic error, but we don't want to just rely on the
> + * user having built with ASSERT enabled, so if ASSERT doesn't
> + * do anything we still error out.
> + */
> + ASSERT(0);
> + return PTR_ERR(map);
> + }
> +
> + ret = btrfs_remove_dev_extents(trans, map);
> + if (ret)
> + goto out;
> +
> /*
> * We acquire fs_info->chunk_mutex for 2 reasons:
> *
> @@ -5448,7 +5460,7 @@ static void chunk_map_device_set_bits(struct btrfs_chunk_map *map, unsigned int
> }
> }
>
> -static void chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
> +void btrfs_chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
> {
> for (int i = 0; i < map->num_stripes; i++) {
> struct btrfs_io_stripe *stripe = &map->stripes[i];
> @@ -5465,7 +5477,7 @@ void btrfs_remove_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_ma
> write_lock(&fs_info->mapping_tree_lock);
> rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
> RB_CLEAR_NODE(&map->rb_node);
> - chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
> + btrfs_chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
> write_unlock(&fs_info->mapping_tree_lock);
>
> /* Once for the tree reference. */
> @@ -5501,7 +5513,7 @@ int btrfs_add_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_map *m
> return -EEXIST;
> }
> chunk_map_device_set_bits(map, CHUNK_ALLOCATED);
> - chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
> + btrfs_chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
> write_unlock(&fs_info->mapping_tree_lock);
>
> return 0;
> @@ -5866,7 +5878,7 @@ void btrfs_mapping_tree_free(struct btrfs_fs_info *fs_info)
> map = rb_entry(node, struct btrfs_chunk_map, rb_node);
> rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
> RB_CLEAR_NODE(&map->rb_node);
> - chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
> + btrfs_chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
> /* Once for the tree ref. */
> btrfs_free_chunk_map(map);
> cond_resched_rwlock_write(&fs_info->mapping_tree_lock);
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 430be12fd5e7..64b34710b68b 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -789,6 +789,8 @@ u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map);
> int btrfs_nr_parity_stripes(u64 type);
> int btrfs_chunk_alloc_add_chunk_item(struct btrfs_trans_handle *trans,
> struct btrfs_block_group *bg);
> +int btrfs_remove_dev_extents(struct btrfs_trans_handle *trans,
> + struct btrfs_chunk_map *map);
> int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
>
> #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> @@ -900,6 +902,10 @@ bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
>
> bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
> const u8 *btrfs_sb_fsid_ptr(const struct btrfs_super_block *sb);
> +int btrfs_update_device(struct btrfs_trans_handle *trans,
> + struct btrfs_device *device);
> +void btrfs_chunk_map_device_clear_bits(struct btrfs_chunk_map *map,
> + unsigned int bits);
>
> #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
> --
> 2.49.1
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 09/16] btrfs: release BG lock before calling btrfs_link_bg_list()
2025-08-13 14:34 ` [PATCH v2 09/16] btrfs: release BG lock before calling btrfs_link_bg_list() Mark Harmstone
@ 2025-08-16 0:32 ` Boris Burkov
2025-08-27 15:35 ` Mark Harmstone
0 siblings, 1 reply; 43+ messages in thread
From: Boris Burkov @ 2025-08-16 0:32 UTC (permalink / raw)
To: Mark Harmstone; +Cc: linux-btrfs
On Wed, Aug 13, 2025 at 03:34:51PM +0100, Mark Harmstone wrote:
> Release block_group->lock before calling btrfs_link_bg_list() in
> btrfs_delete_unused_bgs(), as this was causing lockdep issues.
>
> This lock isn't held in any other place that we call btrfs_link_bg_list(), as
> the block group lists are manipulated while holding fs_info->unused_bgs_lock.
>
Please include the offending lockdep output you are fixing.
Is this a generic fix unrelated to your other changes? I think a
separate patch from the series is clearer in that case. And it would
need a Fixes: tag (probably my patch that added btrfs_link_bg_list, haha)
Thanks.
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
> ---
> fs/btrfs/block-group.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index bed9c58b6cbc..8c28f829547e 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1620,6 +1620,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> if ((space_info->total_bytes - block_group->length < used &&
> block_group->zone_unusable < block_group->length) ||
> has_unwritten_metadata(block_group)) {
> + spin_unlock(&block_group->lock);
> +
> /*
> * Add a reference for the list, compensate for the ref
> * drop under the "next" label for the
> @@ -1628,7 +1630,6 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> btrfs_link_bg_list(block_group, &retry_list);
>
> trace_btrfs_skip_unused_block_group(block_group);
> - spin_unlock(&block_group->lock);
> spin_unlock(&space_info->lock);
> up_write(&space_info->groups_sem);
> goto next;
> --
> 2.49.1
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 15/16] btrfs: add fully_remapped_bgs list
2025-08-13 14:34 ` [PATCH v2 15/16] btrfs: add fully_remapped_bgs list Mark Harmstone
@ 2025-08-16 0:56 ` Boris Burkov
2025-08-27 18:51 ` Mark Harmstone
0 siblings, 1 reply; 43+ messages in thread
From: Boris Burkov @ 2025-08-16 0:56 UTC (permalink / raw)
To: Mark Harmstone; +Cc: linux-btrfs
On Wed, Aug 13, 2025 at 03:34:57PM +0100, Mark Harmstone wrote:
> Add a fully_remapped_bgs list to struct btrfs_transaction, which holds
> block groups which have just had their last identity remap removed.
>
> In btrfs_finish_extent_commit() we can then discard their full dev
> extents, as we're also setting their num_stripes to 0. Finally if the BG
> is now empty, i.e. there's neither identity remaps nor normal remaps,
> add it to the unused_bgs list to be taken care of there.
>
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
> ---
> fs/btrfs/block-group.c | 26 ++++++++++++++++++++++++++
> fs/btrfs/block-group.h | 2 ++
> fs/btrfs/extent-tree.c | 37 ++++++++++++++++++++++++++++++++++++-
> fs/btrfs/relocation.c | 2 ++
> fs/btrfs/transaction.c | 1 +
> fs/btrfs/transaction.h | 1 +
> 6 files changed, 68 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 7a0524138235..7f8707dfd62c 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1803,6 +1803,14 @@ void btrfs_mark_bg_unused(struct btrfs_block_group *bg)
> struct btrfs_fs_info *fs_info = bg->fs_info;
>
> spin_lock(&fs_info->unused_bgs_lock);
> +
> + /* Leave fully remapped block groups on the fully_remapped_bgs list. */
> + if (bg->flags & BTRFS_BLOCK_GROUP_REMAPPED &&
> + bg->identity_remap_count == 0) {
> + spin_unlock(&fs_info->unused_bgs_lock);
> + return;
> + }
> +
> if (list_empty(&bg->bg_list)) {
> btrfs_get_block_group(bg);
> trace_btrfs_add_unused_block_group(bg);
> @@ -4792,3 +4800,21 @@ bool btrfs_block_group_should_use_size_class(const struct btrfs_block_group *bg)
> return false;
> return true;
> }
> +
> +void btrfs_mark_bg_fully_remapped(struct btrfs_block_group *bg,
> + struct btrfs_trans_handle *trans)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> +
> + spin_lock(&fs_info->unused_bgs_lock);
> +
> + if (!list_empty(&bg->bg_list))
> + list_del(&bg->bg_list);
> + else
> + btrfs_get_block_group(bg);
> +
> + list_add_tail(&bg->bg_list, &trans->transaction->fully_remapped_bgs);
> +
> + spin_unlock(&fs_info->unused_bgs_lock);
> +
> +}
Why does the fully remapped bg list takeover from other lists rather
than use the link function?
What protection is in place to ensure that we never mark it fully
remapped while it is on the new_bgs list (as with the unused list)?
I suspect such a block group won't ever be reclaimed even with explicit
balances, but it is important to check and be sure.
If this *is* strictly necessary, I would like to see an extension to
btrfs_link_bg_list that can handle the list_move_tail variant.
Another option is to generalize this one together with mark_unused()
and just check the NEW flag here.
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index 0433b0127ed8..025ea2c6f8a8 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -408,5 +408,7 @@ int btrfs_use_block_group_size_class(struct btrfs_block_group *bg,
> enum btrfs_block_group_size_class size_class,
> bool force_wrong_size_class);
> bool btrfs_block_group_should_use_size_class(const struct btrfs_block_group *bg);
> +void btrfs_mark_bg_fully_remapped(struct btrfs_block_group *bg,
> + struct btrfs_trans_handle *trans);
>
> #endif /* BTRFS_BLOCK_GROUP_H */
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index b02e99b41553..157a032df128 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2853,7 +2853,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
> {
> struct btrfs_fs_info *fs_info = trans->fs_info;
> struct btrfs_block_group *block_group, *tmp;
> - struct list_head *deleted_bgs;
> + struct list_head *deleted_bgs, *fully_remapped_bgs;
> struct extent_io_tree *unpin = &trans->transaction->pinned_extents;
> struct extent_state *cached_state = NULL;
> u64 start;
> @@ -2951,6 +2951,41 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
> }
> }
>
1. This will block the next transaction waiting on TRANS_STATE_COMPLETED
2. This is not compatible with the spirit and purpose of async discard,
which is our default and best discard mode.
3. This doesn't check discard mode at all, it just defaults to
DISCARD_SYNC style behavior, so it doesn't respect NODISCARD either.
> + fully_remapped_bgs = &trans->transaction->fully_remapped_bgs;
> + list_for_each_entry_safe(block_group, tmp, fully_remapped_bgs, bg_list) {
> + struct btrfs_chunk_map *map;
> +
> + if (!TRANS_ABORTED(trans))
> + ret = btrfs_discard_extent(fs_info, block_group->start,
> + block_group->length, NULL,
> + false);
> +
> + map = btrfs_get_chunk_map(fs_info, block_group->start, 1);
> + if (IS_ERR(map))
> + return PTR_ERR(map);
> +
> + /*
> + * Set num_stripes to 0, so that btrfs_remove_dev_extents()
> + * won't run a second time.
> + */
> + map->num_stripes = 0;
> +
> + btrfs_free_chunk_map(map);
> +
> + if (block_group->used == 0 && block_group->remap_bytes == 0) {
> + spin_lock(&fs_info->unused_bgs_lock);
> + list_move_tail(&block_group->bg_list,
> + &fs_info->unused_bgs);
> + spin_unlock(&fs_info->unused_bgs_lock);
Please use the helpers, it's important for ensuring correct ref counting
in the long run. I also think that the previous patch had some
discussion for more standardized integration with unused_bgs so I sort
of hope this code goes away entirely.
> + } else {
> + spin_lock(&fs_info->unused_bgs_lock);
> + list_del_init(&block_group->bg_list);
> + spin_unlock(&fs_info->unused_bgs_lock);
> +
> + btrfs_put_block_group(block_group);
> + }
> + }
> +
> return unpin_error;
> }
>
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 84ff59866e96..0745a3d1c867 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -4819,6 +4819,8 @@ static int last_identity_remap_gone(struct btrfs_trans_handle *trans,
> if (ret)
> return ret;
>
> + btrfs_mark_bg_fully_remapped(bg, trans);
> +
> return 0;
> }
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 64b9c427af6a..7c308d33e767 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -381,6 +381,7 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
> mutex_init(&cur_trans->cache_write_mutex);
> spin_lock_init(&cur_trans->dirty_bgs_lock);
> INIT_LIST_HEAD(&cur_trans->deleted_bgs);
> + INIT_LIST_HEAD(&cur_trans->fully_remapped_bgs);
> spin_lock_init(&cur_trans->dropped_roots_lock);
> list_add_tail(&cur_trans->list, &fs_info->trans_list);
> btrfs_extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 9f7c777af635..b362915288b5 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -109,6 +109,7 @@ struct btrfs_transaction {
> spinlock_t dirty_bgs_lock;
> /* Protected by spin lock fs_info->unused_bgs_lock. */
> struct list_head deleted_bgs;
> + struct list_head fully_remapped_bgs;
> spinlock_t dropped_roots_lock;
> struct btrfs_delayed_ref_root delayed_refs;
> struct btrfs_fs_info *fs_info;
> --
> 2.49.1
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 16/16] btrfs: allow balancing remap tree
2025-08-13 14:34 ` [PATCH v2 16/16] btrfs: allow balancing remap tree Mark Harmstone
@ 2025-08-16 1:02 ` Boris Burkov
2025-09-02 14:58 ` Mark Harmstone
2025-09-02 15:21 ` Mark Harmstone
0 siblings, 2 replies; 43+ messages in thread
From: Boris Burkov @ 2025-08-16 1:02 UTC (permalink / raw)
To: Mark Harmstone; +Cc: linux-btrfs
On Wed, Aug 13, 2025 at 03:34:58PM +0100, Mark Harmstone wrote:
> Balancing the REMAP chunk, i.e. the chunk in which the remap tree lives,
> is a special case.
>
> We can't use the remap tree itself for this, as then we'd have no way to
> boostrap it on mount. And we can't use the pre-remap tree code for this
> as it relies on walking the extent tree, and we're not creating backrefs
> for REMAP chunks.
>
> So instead, if a balance would relocate any REMAP block groups, mark
> those block groups as readonly and COW every leaf of the remap tree.
>
> There's more sophisticated ways of doing this, such as only COWing nodes
> within a block group that's to be relocated, but they're fiddly and with
> lots of edge cases. Plus it's not anticipated that a) the number of
> REMAP chunks is going to be particularly large, or b) that users will
> want to only relocate some of these chunks - the main use case here is
> to unbreak RAID conversion and device removal.
>
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
> ---
> fs/btrfs/volumes.c | 161 +++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 157 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e13f16a7a904..dc535ed90ae0 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4011,8 +4011,11 @@ static bool should_balance_chunk(struct extent_buffer *leaf, struct btrfs_chunk
> struct btrfs_balance_args *bargs = NULL;
> u64 chunk_type = btrfs_chunk_type(leaf, chunk);
>
> - if (chunk_type & BTRFS_BLOCK_GROUP_REMAP)
> - return false;
> + /* treat REMAP chunks as METADATA */
> + if (chunk_type & BTRFS_BLOCK_GROUP_REMAP) {
> + chunk_type &= ~BTRFS_BLOCK_GROUP_REMAP;
> + chunk_type |= BTRFS_BLOCK_GROUP_METADATA;
why not honor the REMAP chunk type where appropriate?
> + }
>
> /* type filter */
> if (!((chunk_type & BTRFS_BLOCK_GROUP_TYPE_MASK) &
> @@ -4095,6 +4098,113 @@ static bool should_balance_chunk(struct extent_buffer *leaf, struct btrfs_chunk
> return true;
> }
>
> +struct remap_chunk_info {
> + struct list_head list;
> + u64 offset;
> + struct btrfs_block_group *bg;
> + bool made_ro;
> +};
> +
> +static int cow_remap_tree(struct btrfs_trans_handle *trans,
> + struct btrfs_path *path)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_key key = { 0 };
> + int ret;
> +
> + ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, 0, 1);
> + if (ret < 0)
> + return ret;
> +
> + while (true) {
> + ret = btrfs_next_leaf(fs_info->remap_root, path);
> + if (ret < 0) {
> + return ret;
> + } else if (ret > 0) {
> + ret = 0;
> + break;
> + }
> +
> + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +
> + btrfs_release_path(path);
> +
> + ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path,
> + 0, 1);
> + if (ret < 0)
> + break;
> + }
> +
> + return ret;
> +}
> +
> +static int balance_remap_chunks(struct btrfs_fs_info *fs_info,
> + struct btrfs_path *path,
> + struct list_head *chunks)
> +{
> + struct remap_chunk_info *rci, *tmp;
> + struct btrfs_trans_handle *trans;
> + int ret;
> +
> + list_for_each_entry_safe(rci, tmp, chunks, list) {
> + rci->bg = btrfs_lookup_block_group(fs_info, rci->offset);
> + if (!rci->bg) {
> + list_del(&rci->list);
> + kfree(rci);
> + continue;
> + }
> +
> + ret = btrfs_inc_block_group_ro(rci->bg, false);
Just thinking out loud, what happens if we concurrently attempt a
balance that would need to use the remap tree? Is something structurally
blocking that at a higher level? Or will it fail? How will that failure
be handled? Does the answer hold for btrfs-internal background reclaim
rather than explicit balancing?
> + if (ret)
> + goto end;
> +
> + rci->made_ro = true;
> + }
> +
> + if (list_empty(chunks))
> + return 0;
> +
> + trans = btrfs_start_transaction(fs_info->remap_root, 0);
> + if (IS_ERR(trans)) {
> + ret = PTR_ERR(trans);
> + goto end;
> + }
> +
> + mutex_lock(&fs_info->remap_mutex);
> +
> + ret = cow_remap_tree(trans, path);
> +
> + btrfs_release_path(path);
> +
> + mutex_unlock(&fs_info->remap_mutex);
> +
> + btrfs_commit_transaction(trans);
> +
> +end:
> + while (!list_empty(chunks)) {
> + bool unused;
> +
> + rci = list_first_entry(chunks, struct remap_chunk_info, list);
> +
> + spin_lock(&rci->bg->lock);
> + unused = !btrfs_is_block_group_used(rci->bg);
> + spin_unlock(&rci->bg->lock);
> +
> + if (unused)
> + btrfs_mark_bg_unused(rci->bg);
> +
> + if (rci->made_ro)
> + btrfs_dec_block_group_ro(rci->bg);
> +
> + btrfs_put_block_group(rci->bg);
> +
> + list_del(&rci->list);
> + kfree(rci);
> + }
> +
> + return ret;
> +}
> +
> static int __btrfs_balance(struct btrfs_fs_info *fs_info)
> {
> struct btrfs_balance_control *bctl = fs_info->balance_ctl;
> @@ -4117,6 +4227,9 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
> u32 count_meta = 0;
> u32 count_sys = 0;
> int chunk_reserved = 0;
> + struct remap_chunk_info *rci;
> + unsigned int num_remap_chunks = 0;
> + LIST_HEAD(remap_chunks);
>
> path = btrfs_alloc_path();
> if (!path) {
> @@ -4215,7 +4328,8 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
> count_data++;
> else if (chunk_type & BTRFS_BLOCK_GROUP_SYSTEM)
> count_sys++;
> - else if (chunk_type & BTRFS_BLOCK_GROUP_METADATA)
> + else if (chunk_type & (BTRFS_BLOCK_GROUP_METADATA |
> + BTRFS_BLOCK_GROUP_REMAP))
> count_meta++;
>
> goto loop;
> @@ -4235,6 +4349,30 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
> goto loop;
> }
>
> + /*
> + * Balancing REMAP chunks takes place separately - add the
> + * details to a list so it can be processed later.
> + */
> + if (chunk_type & BTRFS_BLOCK_GROUP_REMAP) {
> + mutex_unlock(&fs_info->reclaim_bgs_lock);
> +
> + rci = kmalloc(sizeof(struct remap_chunk_info),
> + GFP_NOFS);
> + if (!rci) {
> + ret = -ENOMEM;
> + goto error;
> + }
> +
> + rci->offset = found_key.offset;
> + rci->bg = NULL;
> + rci->made_ro = false;
> + list_add_tail(&rci->list, &remap_chunks);
> +
> + num_remap_chunks++;
> +
> + goto loop;
> + }
> +
> if (!chunk_reserved) {
> /*
> * We may be relocating the only data chunk we have,
> @@ -4274,11 +4412,26 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
> key.offset = found_key.offset - 1;
> }
>
> + btrfs_release_path(path);
> +
> if (counting) {
> - btrfs_release_path(path);
> counting = false;
> goto again;
> }
> +
> + if (!list_empty(&remap_chunks)) {
> + ret = balance_remap_chunks(fs_info, path, &remap_chunks);
> + if (ret == -ENOSPC)
> + enospc_errors++;
> +
> + if (!ret) {
> + btrfs_delete_unused_bgs(fs_info);
Why is this necessary here?
> +
> + spin_lock(&fs_info->balance_lock);
> + bctl->stat.completed += num_remap_chunks;
> + spin_unlock(&fs_info->balance_lock);
> + }
> + }
> error:
> btrfs_free_path(path);
> if (enospc_errors) {
> --
> 2.49.1
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 01/16] btrfs: add definitions and constants for remap-tree
2025-08-15 23:51 ` Boris Burkov
@ 2025-08-18 17:21 ` Mark Harmstone
2025-08-18 17:33 ` Boris Burkov
0 siblings, 1 reply; 43+ messages in thread
From: Mark Harmstone @ 2025-08-18 17:21 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs
Thanks Boris.
The funny indentation is artefact of diff: the plus sign at the beginning
throws off the tabstop in your e-mail client. If you do `git am` it looks
fine.
On 16/08/2025 12.51 am, Boris Burkov wrote:
> On Wed, Aug 13, 2025 at 03:34:43PM +0100, Mark Harmstone wrote:
>> Add an incompat flag for the new remap-tree feature, and the constants
>> and definitions needed to support it.
>>
>> Signed-off-by: Mark Harmstone <mark@harmstone.com>
>
> Some formatting nits, but you can add
>
> Reviewed-by: Boris Burkov <boris@bur.io>
>
>> ---
>> fs/btrfs/accessors.h | 3 +++
>> fs/btrfs/locking.c | 1 +
>> fs/btrfs/sysfs.c | 2 ++
>> fs/btrfs/tree-checker.c | 6 ++----
>> fs/btrfs/tree-checker.h | 5 +++++
>> fs/btrfs/volumes.c | 1 +
>> include/uapi/linux/btrfs.h | 1 +
>> include/uapi/linux/btrfs_tree.h | 12 ++++++++++++
>> 8 files changed, 27 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
>> index 99b3ced12805..95a1ca8c099b 100644
>> --- a/fs/btrfs/accessors.h
>> +++ b/fs/btrfs/accessors.h
>> @@ -1009,6 +1009,9 @@ BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption,
>> BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size,
>> struct btrfs_verity_descriptor_item, size, 64);
>>
>> +BTRFS_SETGET_FUNCS(remap_address, struct btrfs_remap, address, 64);
>> +BTRFS_SETGET_STACK_FUNCS(stack_remap_address, struct btrfs_remap, address, 64);
>> +
>> /* Cast into the data area of the leaf. */
>> #define btrfs_item_ptr(leaf, slot, type) \
>> ((type *)(btrfs_item_nr_offset(leaf, 0) + btrfs_item_offset(leaf, slot)))
>> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
>> index a3e6d9616e60..26f810258486 100644
>> --- a/fs/btrfs/locking.c
>> +++ b/fs/btrfs/locking.c
>> @@ -73,6 +73,7 @@ static struct btrfs_lockdep_keyset {
>> { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") },
>> { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-group") },
>> { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID, DEFINE_NAME("raid-stripe") },
>> + { .id = BTRFS_REMAP_TREE_OBJECTID, DEFINE_NAME("remap-tree") },
>> { .id = 0, DEFINE_NAME("tree") },
>> };
>>
>> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
>> index 81f52c1f55ce..857d2772db1c 100644
>> --- a/fs/btrfs/sysfs.c
>> +++ b/fs/btrfs/sysfs.c
>> @@ -291,6 +291,7 @@ BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
>> BTRFS_FEAT_ATTR_COMPAT_RO(block_group_tree, BLOCK_GROUP_TREE);
>> BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
>> BTRFS_FEAT_ATTR_INCOMPAT(simple_quota, SIMPLE_QUOTA);
>> +BTRFS_FEAT_ATTR_INCOMPAT(remap_tree, REMAP_TREE);
>> #ifdef CONFIG_BLK_DEV_ZONED
>> BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
>> #endif
>> @@ -325,6 +326,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
>> BTRFS_FEAT_ATTR_PTR(raid1c34),
>> BTRFS_FEAT_ATTR_PTR(block_group_tree),
>> BTRFS_FEAT_ATTR_PTR(simple_quota),
>> + BTRFS_FEAT_ATTR_PTR(remap_tree),
>> #ifdef CONFIG_BLK_DEV_ZONED
>> BTRFS_FEAT_ATTR_PTR(zoned),
>> #endif
>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>> index 0f556f4de3f9..76ec3698f197 100644
>> --- a/fs/btrfs/tree-checker.c
>> +++ b/fs/btrfs/tree-checker.c
>> @@ -912,12 +912,10 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
>> length, btrfs_stripe_nr_to_offset(U32_MAX));
>> return -EUCLEAN;
>> }
>> - if (unlikely(type & ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
>> - BTRFS_BLOCK_GROUP_PROFILE_MASK))) {
>> + if (unlikely(type & ~BTRFS_BLOCK_GROUP_VALID)) {
>> chunk_err(fs_info, leaf, chunk, logical,
>> "unrecognized chunk type: 0x%llx",
>> - ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
>> - BTRFS_BLOCK_GROUP_PROFILE_MASK) & type);
>> + type & ~BTRFS_BLOCK_GROUP_VALID);
>> return -EUCLEAN;
>> }
>>
>> diff --git a/fs/btrfs/tree-checker.h b/fs/btrfs/tree-checker.h
>> index eb201f4ec3c7..833e2fd989eb 100644
>> --- a/fs/btrfs/tree-checker.h
>> +++ b/fs/btrfs/tree-checker.h
>> @@ -57,6 +57,11 @@ enum btrfs_tree_block_status {
>> BTRFS_TREE_BLOCK_WRITTEN_NOT_SET,
>> };
>>
>> +
>> +#define BTRFS_BLOCK_GROUP_VALID (BTRFS_BLOCK_GROUP_TYPE_MASK | \
>> + BTRFS_BLOCK_GROUP_PROFILE_MASK | \
>> + BTRFS_BLOCK_GROUP_REMAPPED)
>> +
>
> I think the two next lines should be lined up after the '('
>
> See the masks in include/uapi/linux/btrfs_tree.h
>
>> /*
>> * Exported simply for btrfs-progs which wants to have the
>> * btrfs_tree_block_status return codes.
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index fa7a929a0461..e067e9cd68a5 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -231,6 +231,7 @@ void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
>> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
>> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
>> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
>> + DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAPPED, "remapped");
>>
>> DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
>> for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
>> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
>> index 8e710bbb688e..fba303ed49e6 100644
>> --- a/include/uapi/linux/btrfs.h
>> +++ b/include/uapi/linux/btrfs.h
>> @@ -336,6 +336,7 @@ struct btrfs_ioctl_fs_info_args {
>> #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 (1ULL << 13)
>> #define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE (1ULL << 14)
>> #define BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA (1ULL << 16)
>> +#define BTRFS_FEATURE_INCOMPAT_REMAP_TREE (1ULL << 17)
>>
>> struct btrfs_ioctl_feature_flags {
>> __u64 compat_flags;
>> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
>> index fc29d273845d..4439d77a7252 100644
>> --- a/include/uapi/linux/btrfs_tree.h
>> +++ b/include/uapi/linux/btrfs_tree.h
>> @@ -76,6 +76,9 @@
>> /* Tracks RAID stripes in block groups. */
>> #define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
>>
>> +/* Holds details of remapped addresses after relocation. */
>> +#define BTRFS_REMAP_TREE_OBJECTID 13ULL
>> +
>> /* device stats in the device tree */
>> #define BTRFS_DEV_STATS_OBJECTID 0ULL
>>
>> @@ -282,6 +285,10 @@
>>
>> #define BTRFS_RAID_STRIPE_KEY 230
>>
>> +#define BTRFS_IDENTITY_REMAP_KEY 234
>> +#define BTRFS_REMAP_KEY 235
>> +#define BTRFS_REMAP_BACKREF_KEY 236
>
> more funny indenting
>
>> +
>> /*
>> * Records the overall state of the qgroups.
>> * There's only one instance of this key present,
>> @@ -1161,6 +1168,7 @@ struct btrfs_dev_replace_item {
>> #define BTRFS_BLOCK_GROUP_RAID6 (1ULL << 8)
>> #define BTRFS_BLOCK_GROUP_RAID1C3 (1ULL << 9)
>> #define BTRFS_BLOCK_GROUP_RAID1C4 (1ULL << 10)
>> +#define BTRFS_BLOCK_GROUP_REMAPPED (1ULL << 11)
>> #define BTRFS_BLOCK_GROUP_RESERVED (BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
>> BTRFS_SPACE_INFO_GLOBAL_RSV)
>>
>> @@ -1323,4 +1331,8 @@ struct btrfs_verity_descriptor_item {
>> __u8 encryption;
>> } __attribute__ ((__packed__));
>>
>> +struct btrfs_remap {
>> + __le64 address;
>> +} __attribute__ ((__packed__));
>> +
>> #endif /* _BTRFS_CTREE_H_ */
>> --
>> 2.49.1
>>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 01/16] btrfs: add definitions and constants for remap-tree
2025-08-16 0:17 ` Qu Wenruo
@ 2025-08-18 17:23 ` Mark Harmstone
0 siblings, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-18 17:23 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
Thanks Qu. Yes, BTRFS_BLOCK_GROUP_REMAPPED is the block group type,
BTRFS_BLOCK_GROUP_REMAP is the flag saying that you should run any
addresses within this block group through the remap tree.
On 16/08/2025 1.17 am, Qu Wenruo wrote:
>
>
> 在 2025/8/16 09:31, Qu Wenruo 写道:
>>
>>
>> 在 2025/8/14 00:04, Mark Harmstone 写道:
>>> Add an incompat flag for the new remap-tree feature, and the constants
>>> and definitions needed to support it.
>>>
>>> Signed-off-by: Mark Harmstone <mark@harmstone.com>
>>> ---
>>> fs/btrfs/accessors.h | 3 +++
>>> fs/btrfs/locking.c | 1 +
>>> fs/btrfs/sysfs.c | 2 ++
>>> fs/btrfs/tree-checker.c | 6 ++----
>>> fs/btrfs/tree-checker.h | 5 +++++
>>> fs/btrfs/volumes.c | 1 +
>>> include/uapi/linux/btrfs.h | 1 +
>>> include/uapi/linux/btrfs_tree.h | 12 ++++++++++++
>>> 8 files changed, 27 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
>>> index 99b3ced12805..95a1ca8c099b 100644
>>> --- a/fs/btrfs/accessors.h
>>> +++ b/fs/btrfs/accessors.h
>>> @@ -1009,6 +1009,9 @@ BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption,
>>> BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size,
>>> struct btrfs_verity_descriptor_item, size, 64);
>>> +BTRFS_SETGET_FUNCS(remap_address, struct btrfs_remap, address, 64);
>>> +BTRFS_SETGET_STACK_FUNCS(stack_remap_address, struct btrfs_remap, address, 64);
>>> +
>>> /* Cast into the data area of the leaf. */
>>> #define btrfs_item_ptr(leaf, slot, type) \
>>> ((type *)(btrfs_item_nr_offset(leaf, 0) + btrfs_item_offset(leaf, slot)))
>>> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
>>> index a3e6d9616e60..26f810258486 100644
>>> --- a/fs/btrfs/locking.c
>>> +++ b/fs/btrfs/locking.c
>>> @@ -73,6 +73,7 @@ static struct btrfs_lockdep_keyset {
>>> { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free- space") },
>>> { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block- group") },
>>> { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID, DEFINE_NAME("raid- stripe") },
>>> + { .id = BTRFS_REMAP_TREE_OBJECTID, DEFINE_NAME("remap-tree") },
>>> { .id = 0, DEFINE_NAME("tree") },
>>> };
>>> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
>>> index 81f52c1f55ce..857d2772db1c 100644
>>> --- a/fs/btrfs/sysfs.c
>>> +++ b/fs/btrfs/sysfs.c
>>> @@ -291,6 +291,7 @@ BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
>>> BTRFS_FEAT_ATTR_COMPAT_RO(block_group_tree, BLOCK_GROUP_TREE);
>>> BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
>>> BTRFS_FEAT_ATTR_INCOMPAT(simple_quota, SIMPLE_QUOTA);
>>> +BTRFS_FEAT_ATTR_INCOMPAT(remap_tree, REMAP_TREE);
>>> #ifdef CONFIG_BLK_DEV_ZONED
>>> BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
>>> #endif
>>> @@ -325,6 +326,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
>>> BTRFS_FEAT_ATTR_PTR(raid1c34),
>>> BTRFS_FEAT_ATTR_PTR(block_group_tree),
>>> BTRFS_FEAT_ATTR_PTR(simple_quota),
>>> + BTRFS_FEAT_ATTR_PTR(remap_tree),
>>> #ifdef CONFIG_BLK_DEV_ZONED
>>> BTRFS_FEAT_ATTR_PTR(zoned),
>>> #endif
>>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>>> index 0f556f4de3f9..76ec3698f197 100644
>>> --- a/fs/btrfs/tree-checker.c
>>> +++ b/fs/btrfs/tree-checker.c
>>> @@ -912,12 +912,10 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
>>> length, btrfs_stripe_nr_to_offset(U32_MAX));
>>> return -EUCLEAN;
>>> }
>>> - if (unlikely(type & ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
>>> - BTRFS_BLOCK_GROUP_PROFILE_MASK))) {
>>> + if (unlikely(type & ~BTRFS_BLOCK_GROUP_VALID)) {
>>> chunk_err(fs_info, leaf, chunk, logical,
>>> "unrecognized chunk type: 0x%llx",
>>> - ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
>>> - BTRFS_BLOCK_GROUP_PROFILE_MASK) & type);
>>> + type & ~BTRFS_BLOCK_GROUP_VALID);
>>> return -EUCLEAN;
>>> }
>>> diff --git a/fs/btrfs/tree-checker.h b/fs/btrfs/tree-checker.h
>>> index eb201f4ec3c7..833e2fd989eb 100644
>>> --- a/fs/btrfs/tree-checker.h
>>> +++ b/fs/btrfs/tree-checker.h
>>> @@ -57,6 +57,11 @@ enum btrfs_tree_block_status {
>>> BTRFS_TREE_BLOCK_WRITTEN_NOT_SET,
>>> };
>>> +
>>> +#define BTRFS_BLOCK_GROUP_VALID (BTRFS_BLOCK_GROUP_TYPE_MASK | \
>>> + BTRFS_BLOCK_GROUP_PROFILE_MASK | \
>>> + BTRFS_BLOCK_GROUP_REMAPPED)
>>
>> So far it looks like the remapped flag is a new bg type.
>> Can we just put it into BLOCK_GROUP_TYPE_MASK?
>
> Nevermind, we can not put the new type into TYPE_MASK, as tree-checker will warn about remapped block groups as they have both METADATA and REMAPPED flags set.
>
> So this definition looks good.
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> Thanks,
> Qu
>
>>
>> Otherwise looks good to me.
>>
>> Thanks,
>> Qu
>>
>>> +
>>> /*
>>> * Exported simply for btrfs-progs which wants to have the
>>> * btrfs_tree_block_status return codes.
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index fa7a929a0461..e067e9cd68a5 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -231,6 +231,7 @@ void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
>>> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
>>> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
>>> DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
>>> + DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAPPED, "remapped");
>>> DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
>>> for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
>>> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
>>> index 8e710bbb688e..fba303ed49e6 100644
>>> --- a/include/uapi/linux/btrfs.h
>>> +++ b/include/uapi/linux/btrfs.h
>>> @@ -336,6 +336,7 @@ struct btrfs_ioctl_fs_info_args {
>>> #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 (1ULL << 13)
>>> #define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE (1ULL << 14)
>>> #define BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA (1ULL << 16)
>>> +#define BTRFS_FEATURE_INCOMPAT_REMAP_TREE (1ULL << 17)
>>> struct btrfs_ioctl_feature_flags {
>>> __u64 compat_flags;
>>> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/ btrfs_tree.h
>>> index fc29d273845d..4439d77a7252 100644
>>> --- a/include/uapi/linux/btrfs_tree.h
>>> +++ b/include/uapi/linux/btrfs_tree.h
>>> @@ -76,6 +76,9 @@
>>> /* Tracks RAID stripes in block groups. */
>>> #define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
>>> +/* Holds details of remapped addresses after relocation. */
>>> +#define BTRFS_REMAP_TREE_OBJECTID 13ULL
>>> +
>>> /* device stats in the device tree */
>>> #define BTRFS_DEV_STATS_OBJECTID 0ULL
>>> @@ -282,6 +285,10 @@
>>> #define BTRFS_RAID_STRIPE_KEY 230
>>> +#define BTRFS_IDENTITY_REMAP_KEY 234
>>> +#define BTRFS_REMAP_KEY 235
>>> +#define BTRFS_REMAP_BACKREF_KEY 236
>>> +
>>> /*
>>> * Records the overall state of the qgroups.
>>> * There's only one instance of this key present,
>>> @@ -1161,6 +1168,7 @@ struct btrfs_dev_replace_item {
>>> #define BTRFS_BLOCK_GROUP_RAID6 (1ULL << 8)
>>> #define BTRFS_BLOCK_GROUP_RAID1C3 (1ULL << 9)
>>> #define BTRFS_BLOCK_GROUP_RAID1C4 (1ULL << 10)
>>> +#define BTRFS_BLOCK_GROUP_REMAPPED (1ULL << 11)
>>> #define BTRFS_BLOCK_GROUP_RESERVED (BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
>>> BTRFS_SPACE_INFO_GLOBAL_RSV)
>>> @@ -1323,4 +1331,8 @@ struct btrfs_verity_descriptor_item {
>>> __u8 encryption;
>>> } __attribute__ ((__packed__));
>>> +struct btrfs_remap {
>>> + __le64 address;
>>> +} __attribute__ ((__packed__));
>>> +
>>> #endif /* _BTRFS_CTREE_H_ */
>>
>>
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 01/16] btrfs: add definitions and constants for remap-tree
2025-08-18 17:21 ` Mark Harmstone
@ 2025-08-18 17:33 ` Boris Burkov
0 siblings, 0 replies; 43+ messages in thread
From: Boris Burkov @ 2025-08-18 17:33 UTC (permalink / raw)
To: Mark Harmstone; +Cc: linux-btrfs
On Mon, Aug 18, 2025 at 06:21:24PM +0100, Mark Harmstone wrote:
> Thanks Boris.
>
> The funny indentation is artefact of diff: the plus sign at the beginning
> throws off the tabstop in your e-mail client. If you do `git am` it looks
> fine.
>
Oh yeah, that makes sense, thanks.
> On 16/08/2025 12.51 am, Boris Burkov wrote:
> > On Wed, Aug 13, 2025 at 03:34:43PM +0100, Mark Harmstone wrote:
> > > Add an incompat flag for the new remap-tree feature, and the constants
> > > and definitions needed to support it.
> > >
> > > Signed-off-by: Mark Harmstone <mark@harmstone.com>
> >
> > Some formatting nits, but you can add
> >
> > Reviewed-by: Boris Burkov <boris@bur.io>
> >
> > > ---
> > > fs/btrfs/accessors.h | 3 +++
> > > fs/btrfs/locking.c | 1 +
> > > fs/btrfs/sysfs.c | 2 ++
> > > fs/btrfs/tree-checker.c | 6 ++----
> > > fs/btrfs/tree-checker.h | 5 +++++
> > > fs/btrfs/volumes.c | 1 +
> > > include/uapi/linux/btrfs.h | 1 +
> > > include/uapi/linux/btrfs_tree.h | 12 ++++++++++++
> > > 8 files changed, 27 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> > > index 99b3ced12805..95a1ca8c099b 100644
> > > --- a/fs/btrfs/accessors.h
> > > +++ b/fs/btrfs/accessors.h
> > > @@ -1009,6 +1009,9 @@ BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption,
> > > BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size,
> > > struct btrfs_verity_descriptor_item, size, 64);
> > > +BTRFS_SETGET_FUNCS(remap_address, struct btrfs_remap, address, 64);
> > > +BTRFS_SETGET_STACK_FUNCS(stack_remap_address, struct btrfs_remap, address, 64);
> > > +
> > > /* Cast into the data area of the leaf. */
> > > #define btrfs_item_ptr(leaf, slot, type) \
> > > ((type *)(btrfs_item_nr_offset(leaf, 0) + btrfs_item_offset(leaf, slot)))
> > > diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
> > > index a3e6d9616e60..26f810258486 100644
> > > --- a/fs/btrfs/locking.c
> > > +++ b/fs/btrfs/locking.c
> > > @@ -73,6 +73,7 @@ static struct btrfs_lockdep_keyset {
> > > { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") },
> > > { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-group") },
> > > { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID, DEFINE_NAME("raid-stripe") },
> > > + { .id = BTRFS_REMAP_TREE_OBJECTID, DEFINE_NAME("remap-tree") },
> > > { .id = 0, DEFINE_NAME("tree") },
> > > };
> > > diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> > > index 81f52c1f55ce..857d2772db1c 100644
> > > --- a/fs/btrfs/sysfs.c
> > > +++ b/fs/btrfs/sysfs.c
> > > @@ -291,6 +291,7 @@ BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
> > > BTRFS_FEAT_ATTR_COMPAT_RO(block_group_tree, BLOCK_GROUP_TREE);
> > > BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
> > > BTRFS_FEAT_ATTR_INCOMPAT(simple_quota, SIMPLE_QUOTA);
> > > +BTRFS_FEAT_ATTR_INCOMPAT(remap_tree, REMAP_TREE);
> > > #ifdef CONFIG_BLK_DEV_ZONED
> > > BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
> > > #endif
> > > @@ -325,6 +326,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
> > > BTRFS_FEAT_ATTR_PTR(raid1c34),
> > > BTRFS_FEAT_ATTR_PTR(block_group_tree),
> > > BTRFS_FEAT_ATTR_PTR(simple_quota),
> > > + BTRFS_FEAT_ATTR_PTR(remap_tree),
> > > #ifdef CONFIG_BLK_DEV_ZONED
> > > BTRFS_FEAT_ATTR_PTR(zoned),
> > > #endif
> > > diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> > > index 0f556f4de3f9..76ec3698f197 100644
> > > --- a/fs/btrfs/tree-checker.c
> > > +++ b/fs/btrfs/tree-checker.c
> > > @@ -912,12 +912,10 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
> > > length, btrfs_stripe_nr_to_offset(U32_MAX));
> > > return -EUCLEAN;
> > > }
> > > - if (unlikely(type & ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
> > > - BTRFS_BLOCK_GROUP_PROFILE_MASK))) {
> > > + if (unlikely(type & ~BTRFS_BLOCK_GROUP_VALID)) {
> > > chunk_err(fs_info, leaf, chunk, logical,
> > > "unrecognized chunk type: 0x%llx",
> > > - ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
> > > - BTRFS_BLOCK_GROUP_PROFILE_MASK) & type);
> > > + type & ~BTRFS_BLOCK_GROUP_VALID);
> > > return -EUCLEAN;
> > > }
> > > diff --git a/fs/btrfs/tree-checker.h b/fs/btrfs/tree-checker.h
> > > index eb201f4ec3c7..833e2fd989eb 100644
> > > --- a/fs/btrfs/tree-checker.h
> > > +++ b/fs/btrfs/tree-checker.h
> > > @@ -57,6 +57,11 @@ enum btrfs_tree_block_status {
> > > BTRFS_TREE_BLOCK_WRITTEN_NOT_SET,
> > > };
> > > +
> > > +#define BTRFS_BLOCK_GROUP_VALID (BTRFS_BLOCK_GROUP_TYPE_MASK | \
> > > + BTRFS_BLOCK_GROUP_PROFILE_MASK | \
> > > + BTRFS_BLOCK_GROUP_REMAPPED)
> > > +
> >
> > I think the two next lines should be lined up after the '('
> >
> > See the masks in include/uapi/linux/btrfs_tree.h
> >
> > > /*
> > > * Exported simply for btrfs-progs which wants to have the
> > > * btrfs_tree_block_status return codes.
> > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > > index fa7a929a0461..e067e9cd68a5 100644
> > > --- a/fs/btrfs/volumes.c
> > > +++ b/fs/btrfs/volumes.c
> > > @@ -231,6 +231,7 @@ void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
> > > DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
> > > DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
> > > DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
> > > + DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_REMAPPED, "remapped");
> > > DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
> > > for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
> > > diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> > > index 8e710bbb688e..fba303ed49e6 100644
> > > --- a/include/uapi/linux/btrfs.h
> > > +++ b/include/uapi/linux/btrfs.h
> > > @@ -336,6 +336,7 @@ struct btrfs_ioctl_fs_info_args {
> > > #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 (1ULL << 13)
> > > #define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE (1ULL << 14)
> > > #define BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA (1ULL << 16)
> > > +#define BTRFS_FEATURE_INCOMPAT_REMAP_TREE (1ULL << 17)
> > > struct btrfs_ioctl_feature_flags {
> > > __u64 compat_flags;
> > > diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> > > index fc29d273845d..4439d77a7252 100644
> > > --- a/include/uapi/linux/btrfs_tree.h
> > > +++ b/include/uapi/linux/btrfs_tree.h
> > > @@ -76,6 +76,9 @@
> > > /* Tracks RAID stripes in block groups. */
> > > #define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
> > > +/* Holds details of remapped addresses after relocation. */
> > > +#define BTRFS_REMAP_TREE_OBJECTID 13ULL
> > > +
> > > /* device stats in the device tree */
> > > #define BTRFS_DEV_STATS_OBJECTID 0ULL
> > > @@ -282,6 +285,10 @@
> > > #define BTRFS_RAID_STRIPE_KEY 230
> > > +#define BTRFS_IDENTITY_REMAP_KEY 234
> > > +#define BTRFS_REMAP_KEY 235
> > > +#define BTRFS_REMAP_BACKREF_KEY 236
> >
> > more funny indenting
> >
> > > +
> > > /*
> > > * Records the overall state of the qgroups.
> > > * There's only one instance of this key present,
> > > @@ -1161,6 +1168,7 @@ struct btrfs_dev_replace_item {
> > > #define BTRFS_BLOCK_GROUP_RAID6 (1ULL << 8)
> > > #define BTRFS_BLOCK_GROUP_RAID1C3 (1ULL << 9)
> > > #define BTRFS_BLOCK_GROUP_RAID1C4 (1ULL << 10)
> > > +#define BTRFS_BLOCK_GROUP_REMAPPED (1ULL << 11)
> > > #define BTRFS_BLOCK_GROUP_RESERVED (BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
> > > BTRFS_SPACE_INFO_GLOBAL_RSV)
> > > @@ -1323,4 +1331,8 @@ struct btrfs_verity_descriptor_item {
> > > __u8 encryption;
> > > } __attribute__ ((__packed__));
> > > +struct btrfs_remap {
> > > + __le64 address;
> > > +} __attribute__ ((__packed__));
> > > +
> > > #endif /* _BTRFS_CTREE_H_ */
> > > --
> > > 2.49.1
> > >
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 03/16] btrfs: allow remapped chunks to have zero stripes
2025-08-13 14:34 ` [PATCH v2 03/16] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
2025-08-16 0:03 ` Boris Burkov
@ 2025-08-19 1:05 ` kernel test robot
2025-08-22 17:07 ` Mark Harmstone
1 sibling, 1 reply; 43+ messages in thread
From: kernel test robot @ 2025-08-19 1:05 UTC (permalink / raw)
To: Mark Harmstone; +Cc: oe-lkp, lkp, linux-btrfs, Mark Harmstone, oliver.sang
Hello,
kernel test robot noticed "kernel_BUG_at_fs/btrfs/tree-checker.c" on:
commit: 7ec33e314b27be8996378a9601527017b6ebba95 ("[PATCH v2 03/16] btrfs: allow remapped chunks to have zero stripes")
url: https://github.com/intel-lab-lkp/linux/commits/Mark-Harmstone/btrfs-add-definitions-and-constants-for-remap-tree/20250813-224507
base: v6.17-rc1
patch link: https://lore.kernel.org/all/20250813143509.31073-4-mark@harmstone.com/
patch subject: [PATCH v2 03/16] btrfs: allow remapped chunks to have zero stripes
in testcase: xfstests
version: xfstests-x86_64-e1e4a0ea-1_20250714
with following parameters:
disk: 6HDD
fs: btrfs
test: btrfs-group-02
config: x86_64-rhel-9.4-func
compiler: gcc-12
test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell) with 8G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202508181031.5f89d7-lkp@intel.com
[ 65.722045][ T7549] ------------[ cut here ]------------
[ 65.727330][ T7549] kernel BUG at fs/btrfs/tree-checker.c:847!
[ 65.733149][ T7549] Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
[ 65.739195][ T7549] CPU: 5 UID: 0 PID: 7549 Comm: mount Tainted: G S 6.17.0-rc1-00003-g7ec33e314b27 #1 PREEMPT(voluntary)
[ 65.751607][ T7549] Tainted: [S]=CPU_OUT_OF_SPEC
[ 65.756182][ T7549] Hardware name: Dell Inc. OptiPlex 9020/0DNKMN, BIOS A05 12/05/2013
[ 65.764022][ T7549] RIP: 0010:btrfs_check_chunk_valid (fs/btrfs/tree-checker.c:847 fs/btrfs/tree-checker.c:1004) btrfs
[ 65.770855][ T7549] Code: 24 18 4c 8b 5c 24 10 e9 81 f9 ff ff 48 89 4c 24 18 4c 89 5c 24 10 e8 78 9f 52 bf 48 8b 4c 24 18 4c 8b 5c 24 10 e9 2c fd ff ff <0f> 0b 48 c7 c7 a3 99 51 c2 48 89 4c 24 10 e8 f6 9e 52 bf 48 8b 4c
All code
========
0: 24 18 and $0x18,%al
2: 4c 8b 5c 24 10 mov 0x10(%rsp),%r11
7: e9 81 f9 ff ff jmp 0xfffffffffffff98d
c: 48 89 4c 24 18 mov %rcx,0x18(%rsp)
11: 4c 89 5c 24 10 mov %r11,0x10(%rsp)
16: e8 78 9f 52 bf call 0xffffffffbf529f93
1b: 48 8b 4c 24 18 mov 0x18(%rsp),%rcx
20: 4c 8b 5c 24 10 mov 0x10(%rsp),%r11
25: e9 2c fd ff ff jmp 0xfffffffffffffd56
2a:* 0f 0b ud2 <-- trapping instruction
2c: 48 c7 c7 a3 99 51 c2 mov $0xffffffffc25199a3,%rdi
33: 48 89 4c 24 10 mov %rcx,0x10(%rsp)
38: e8 f6 9e 52 bf call 0xffffffffbf529f33
3d: 48 rex.W
3e: 8b .byte 0x8b
3f: 4c rex.WR
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 48 c7 c7 a3 99 51 c2 mov $0xffffffffc25199a3,%rdi
9: 48 89 4c 24 10 mov %rcx,0x10(%rsp)
e: e8 f6 9e 52 bf call 0xffffffffbf529f09
13: 48 rex.W
14: 8b .byte 0x8b
15: 4c rex.WR
[ 65.790121][ T7549] RSP: 0018:ffffc9002239f960 EFLAGS: 00010283
[ 65.795985][ T7549] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000001500000
[ 65.803738][ T7549] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88814af8e0bc
[ 65.811492][ T7549] RBP: ffff88814af8e33c R08: 0000000000000001 R09: 0000000000000005
[ 65.819246][ T7549] R10: e400000000000001 R11: 0000000000000000 R12: 000000000000000a
[ 65.826998][ T7549] R13: 0000000000000008 R14: 0000000000000005 R15: 0000000000ff0000
[ 65.834749][ T7549] FS: 00007f8edf369840(0000) GS:ffff8882182db000(0000) knlGS:0000000000000000
[ 65.843446][ T7549] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 65.849838][ T7549] CR2: 0000562e34dee000 CR3: 0000000201f00005 CR4: 00000000001726f0
[ 65.857589][ T7549] Call Trace:
[ 65.860704][ T7549] <TASK>
[ 65.863474][ T7549] ? set_extent_bit (fs/btrfs/extent-io-tree.c:1099) btrfs
[ 65.869002][ T7549] btrfs_validate_super (fs/btrfs/disk-io.c:2369 fs/btrfs/disk-io.c:2558) btrfs
[ 65.874869][ T7549] ? crypto_alloc_tfmmem+0x92/0xf0
[ 65.880390][ T7549] ? __pfx_btrfs_validate_super (fs/btrfs/disk-io.c:2393) btrfs
[ 65.886682][ T7549] ? btrfs_release_disk_super (include/linux/page-flags.h:226 include/linux/page-flags.h:288 include/linux/mm.h:1424 fs/btrfs/volumes.c:1337) btrfs
[ 65.892995][ T7549] open_ctree (fs/btrfs/disk-io.c:3373) btrfs
[ 65.897997][ T7549] ? __pfx_open_ctree (fs/btrfs/disk-io.c:3282) btrfs
[ 65.903426][ T7549] ? mutex_unlock (arch/x86/include/asm/atomic64_64.h:101 include/linux/atomic/atomic-arch-fallback.h:4329 include/linux/atomic/atomic-long.h:1506 include/linux/atomic/atomic-instrumented.h:4481 kernel/locking/mutex.c:167 kernel/locking/mutex.c:533)
[ 65.907740][ T7549] ? __pfx_mutex_unlock (kernel/locking/mutex.c:531)
[ 65.912572][ T7549] btrfs_get_tree_super (fs/btrfs/super.c:978 fs/btrfs/super.c:1937) btrfs
[ 65.918346][ T7549] btrfs_get_tree_subvol (fs/btrfs/super.c:2074) btrfs
[ 65.924203][ T7549] vfs_get_tree (fs/super.c:1816)
[ 65.928432][ T7549] do_new_mount (fs/namespace.c:3806)
[ 65.932749][ T7549] ? __pfx_do_new_mount (fs/namespace.c:3760)
[ 65.937580][ T7549] ? __pfx_map_id_range_up (kernel/user_namespace.c:382)
[ 65.942669][ T7549] ? security_capable (security/security.c:1142)
[ 65.947330][ T7549] path_mount (fs/namespace.c:4120)
[ 65.951561][ T7549] ? 0xffffffff81000000
[ 65.955533][ T7549] ? __pfx_path_mount (fs/namespace.c:4047)
[ 65.960193][ T7549] ? kmem_cache_free (mm/slub.c:4680 mm/slub.c:4782)
[ 65.964939][ T7549] ? user_path_at (fs/namei.c:3131)
[ 65.969259][ T7549] __x64_sys_mount (fs/namespace.c:4134 fs/namespace.c:4344 fs/namespace.c:4321 fs/namespace.c:4321)
[ 65.973843][ T7549] ? __pfx___x64_sys_mount (fs/namespace.c:4321)
[ 65.978934][ T7549] ? do_user_addr_fault (arch/x86/include/asm/atomic.h:93 include/linux/atomic/atomic-arch-fallback.h:949 include/linux/atomic/atomic-instrumented.h:401 include/linux/refcount.h:389 include/linux/refcount.h:432 include/linux/mmap_lock.h:143 include/linux/mmap_lock.h:267 arch/x86/mm/fault.c:1338)
[ 65.983938][ T7549] do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
[ 65.988257][ T7549] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:114 arch/x86/mm/fault.c:1484 arch/x86/mm/fault.c:1532)
[ 65.992749][ T7549] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[ 65.998441][ T7549] RIP: 0033:0x7f8edf568e0a
[ 66.002672][ T7549] Code: 48 8b 0d f9 7f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c6 7f 0c 00 f7 d8 64 89 01 48
All code
========
0: 48 8b 0d f9 7f 0c 00 mov 0xc7ff9(%rip),%rcx # 0xc8000
7: f7 d8 neg %eax
9: 64 89 01 mov %eax,%fs:(%rcx)
c: 48 83 c8 ff or $0xffffffffffffffff,%rax
10: c3 ret
11: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
18: 00 00 00
1b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
20: 49 89 ca mov %rcx,%r10
23: b8 a5 00 00 00 mov $0xa5,%eax
28: 0f 05 syscall
2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction
30: 73 01 jae 0x33
32: c3 ret
33: 48 8b 0d c6 7f 0c 00 mov 0xc7fc6(%rip),%rcx # 0xc8000
3a: f7 d8 neg %eax
3c: 64 89 01 mov %eax,%fs:(%rcx)
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
6: 73 01 jae 0x9
8: c3 ret
9: 48 8b 0d c6 7f 0c 00 mov 0xc7fc6(%rip),%rcx # 0xc7fd6
10: f7 d8 neg %eax
12: 64 89 01 mov %eax,%fs:(%rcx)
15: 48 rex.W
[ 66.021939][ T7549] RSP: 002b:00007ffe66b952e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[ 66.030126][ T7549] RAX: ffffffffffffffda RBX: 0000557c5d7619e0 RCX: 00007f8edf568e0a
[ 66.037879][ T7549] RDX: 0000557c5d761c10 RSI: 0000557c5d761c50 RDI: 0000557c5d761c30
[ 66.045631][ T7549] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000557c5d762940
[ 66.053384][ T7549] R10: 0000000000000000 R11: 0000000000000246 R12: 0000557c5d761c30
[ 66.061140][ T7549] R13: 0000557c5d761c10 R14: 00007f8edf6d0264 R15: 0000557c5d761af8
[ 66.068895][ T7549] </TASK>
[ 66.071750][ T7549] Modules linked in: snd_hda_codec_intelhdmi snd_hda_codec_hdmi btrfs blake2b_generic xor intel_rapl_msr zstd_compress intel_rapl_common snd_hda_codec_alc269 x86_pkg_temp_thermal snd_hda_scodec_component raid6_pq snd_hda_codec_realtek_lib intel_powerclamp i915 snd_hda_codec_generic coretemp snd_hda_intel sd_mod snd_hda_codec kvm_intel intel_gtt sg snd_hda_core drm_buddy ipmi_devintf ipmi_msghandler platform_profile kvm snd_intel_dspcfg ttm snd_intel_sdw_acpi dell_wmi snd_hwdep irqbypass drm_display_helper dell_smbios ghash_clmulni_intel mei_wdt snd_pcm cec dell_wmi_descriptor drm_client_lib rapl ahci sparse_keymap rfkill snd_timer libahci drm_kms_helper intel_cstate mei_me pcspkr dcdbas intel_uncore libata mei video snd i2c_i801 lpc_ich soundcore i2c_smbus wmi binfmt_misc loop fuse drm dm_mod
[ 66.142949][ T7549] ---[ end trace 0000000000000000 ]---
[ 66.148220][ T7549] RIP: 0010:btrfs_check_chunk_valid (fs/btrfs/tree-checker.c:847 fs/btrfs/tree-checker.c:1004) btrfs
[ 66.155057][ T7549] Code: 24 18 4c 8b 5c 24 10 e9 81 f9 ff ff 48 89 4c 24 18 4c 89 5c 24 10 e8 78 9f 52 bf 48 8b 4c 24 18 4c 8b 5c 24 10 e9 2c fd ff ff <0f> 0b 48 c7 c7 a3 99 51 c2 48 89 4c 24 10 e8 f6 9e 52 bf 48 8b 4c
All code
========
0: 24 18 and $0x18,%al
2: 4c 8b 5c 24 10 mov 0x10(%rsp),%r11
7: e9 81 f9 ff ff jmp 0xfffffffffffff98d
c: 48 89 4c 24 18 mov %rcx,0x18(%rsp)
11: 4c 89 5c 24 10 mov %r11,0x10(%rsp)
16: e8 78 9f 52 bf call 0xffffffffbf529f93
1b: 48 8b 4c 24 18 mov 0x18(%rsp),%rcx
20: 4c 8b 5c 24 10 mov 0x10(%rsp),%r11
25: e9 2c fd ff ff jmp 0xfffffffffffffd56
2a:* 0f 0b ud2 <-- trapping instruction
2c: 48 c7 c7 a3 99 51 c2 mov $0xffffffffc25199a3,%rdi
33: 48 89 4c 24 10 mov %rcx,0x10(%rsp)
38: e8 f6 9e 52 bf call 0xffffffffbf529f33
3d: 48 rex.W
3e: 8b .byte 0x8b
3f: 4c rex.WR
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 48 c7 c7 a3 99 51 c2 mov $0xffffffffc25199a3,%rdi
9: 48 89 4c 24 10 mov %rcx,0x10(%rsp)
e: e8 f6 9e 52 bf call 0xffffffffbf529f09
13: 48 rex.W
14: 8b .byte 0x8b
15: 4c rex.WR
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250818/202508181031.5f89d7-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 03/16] btrfs: allow remapped chunks to have zero stripes
2025-08-16 0:03 ` Boris Burkov
@ 2025-08-22 17:01 ` Mark Harmstone
0 siblings, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-22 17:01 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs
On 16/08/2025 1.03 am, Boris Burkov wrote:
> On Wed, Aug 13, 2025 at 03:34:45PM +0100, Mark Harmstone wrote:
>> When a chunk has been fully remapped, we are going to set its
>> num_stripes to 0, as it will no longer represent a physical location on
>> disk.
>>
>> Change tree-checker to allow for this, and fix a couple of
>> divide-by-zeroes seen elsewhere.
>>
>> Signed-off-by: Mark Harmstone <mark@harmstone.com>
>> ---
>> fs/btrfs/tree-checker.c | 63 ++++++++++++++++++++++++++++-------------
>> fs/btrfs/volumes.c | 8 +++++-
>> 2 files changed, 50 insertions(+), 21 deletions(-)
>>
>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>> index ca898b1f12f1..20bfe333ffdd 100644
>> --- a/fs/btrfs/tree-checker.c
>> +++ b/fs/btrfs/tree-checker.c
>> @@ -815,6 +815,39 @@ static void chunk_err(const struct btrfs_fs_info *fs_info,
>> va_end(args);
>> }
>>
>> +static bool valid_stripe_count(u64 profile, u16 num_stripes,
>> + u16 sub_stripes)
>> +{
>> + switch (profile) {
>> + case BTRFS_BLOCK_GROUP_RAID10:
>> + return sub_stripes ==
>> + btrfs_raid_array[BTRFS_RAID_RAID10].sub_stripes;
>> + case BTRFS_BLOCK_GROUP_RAID1:
>> + return num_stripes ==
>> + btrfs_raid_array[BTRFS_RAID_RAID1].devs_min;
>> + case BTRFS_BLOCK_GROUP_RAID1C3:
>> + return num_stripes ==
>> + btrfs_raid_array[BTRFS_RAID_RAID1C3].devs_min;
>> + case BTRFS_BLOCK_GROUP_RAID1C4:
>> + return num_stripes ==
>> + btrfs_raid_array[BTRFS_RAID_RAID1C4].devs_min;
>> + case BTRFS_BLOCK_GROUP_RAID5:
>> + return num_stripes >=
>> + btrfs_raid_array[BTRFS_RAID_RAID5].devs_min;
>> + case BTRFS_BLOCK_GROUP_RAID6:
>> + return num_stripes >=
>> + btrfs_raid_array[BTRFS_RAID_RAID6].devs_min;
>> + case BTRFS_BLOCK_GROUP_DUP:
>> + return num_stripes ==
>> + btrfs_raid_array[BTRFS_RAID_DUP].dev_stripes;
>> + case 0: /* SINGLE */
>> + return num_stripes ==
>> + btrfs_raid_array[BTRFS_RAID_SINGLE].dev_stripes;
>> + default:
>> + BUG();
>> + }
>> +}
>> +
>> /*
>> * The common chunk check which could also work on super block sys chunk array.
>> *
>> @@ -838,6 +871,7 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
>> u64 features;
>> u32 chunk_sector_size;
>> bool mixed = false;
>> + bool remapped;
>> int raid_index;
>> int nparity;
>> int ncopies;
>> @@ -861,12 +895,14 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
>> ncopies = btrfs_raid_array[raid_index].ncopies;
>> nparity = btrfs_raid_array[raid_index].nparity;
>>
>> - if (unlikely(!num_stripes)) {
>> + remapped = type & BTRFS_BLOCK_GROUP_REMAPPED;
>> +
>> + if (unlikely(!remapped && !num_stripes)) {
>> chunk_err(fs_info, leaf, chunk, logical,
>> "invalid chunk num_stripes, have %u", num_stripes);
>> return -EUCLEAN;
>> }
>> - if (unlikely(num_stripes < ncopies)) {
>> + if (unlikely(num_stripes != 0 && num_stripes < ncopies)) {
>
> This relying on the above check for the remapped <=> !num_stripes aspect
> was still kinda confusing. Logically looks good now, though.
>
>> chunk_err(fs_info, leaf, chunk, logical,
>> "invalid chunk num_stripes < ncopies, have %u < %d",
>> num_stripes, ncopies);
>> @@ -964,22 +1000,9 @@ int btrfs_check_chunk_valid(const struct btrfs_fs_info *fs_info,
>> }
>> }
>>
>> - if (unlikely((type & BTRFS_BLOCK_GROUP_RAID10 &&
>> - sub_stripes != btrfs_raid_array[BTRFS_RAID_RAID10].sub_stripes) ||
>> - (type & BTRFS_BLOCK_GROUP_RAID1 &&
>> - num_stripes != btrfs_raid_array[BTRFS_RAID_RAID1].devs_min) ||
>> - (type & BTRFS_BLOCK_GROUP_RAID1C3 &&
>> - num_stripes != btrfs_raid_array[BTRFS_RAID_RAID1C3].devs_min) ||
>> - (type & BTRFS_BLOCK_GROUP_RAID1C4 &&
>> - num_stripes != btrfs_raid_array[BTRFS_RAID_RAID1C4].devs_min) ||
>> - (type & BTRFS_BLOCK_GROUP_RAID5 &&
>> - num_stripes < btrfs_raid_array[BTRFS_RAID_RAID5].devs_min) ||
>> - (type & BTRFS_BLOCK_GROUP_RAID6 &&
>> - num_stripes < btrfs_raid_array[BTRFS_RAID_RAID6].devs_min) ||
>> - (type & BTRFS_BLOCK_GROUP_DUP &&
>> - num_stripes != btrfs_raid_array[BTRFS_RAID_DUP].dev_stripes) ||
>> - ((type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 &&
>> - num_stripes != btrfs_raid_array[BTRFS_RAID_SINGLE].dev_stripes))) {
>> + if (!remapped &&
>> + !valid_stripe_count(type & BTRFS_BLOCK_GROUP_PROFILE_MASK,
>> + num_stripes, sub_stripes)) {
>
> This looks great, thanks.
>
>> chunk_err(fs_info, leaf, chunk, logical,
>> "invalid num_stripes:sub_stripes %u:%u for profile %llu",
>> num_stripes, sub_stripes,
>> @@ -1003,11 +1026,11 @@ static int check_leaf_chunk_item(struct extent_buffer *leaf,
>> struct btrfs_fs_info *fs_info = leaf->fs_info;
>> int num_stripes;
>>
>> - if (unlikely(btrfs_item_size(leaf, slot) < sizeof(struct btrfs_chunk))) {
>> + if (unlikely(btrfs_item_size(leaf, slot) < offsetof(struct btrfs_chunk, stripe))) {
>> chunk_err(fs_info, leaf, chunk, key->offset,
>> "invalid chunk item size: have %u expect [%zu, %u)",
>> btrfs_item_size(leaf, slot),
>> - sizeof(struct btrfs_chunk),
>> + offsetof(struct btrfs_chunk, stripe),
>> BTRFS_LEAF_DATA_SIZE(fs_info));
>> return -EUCLEAN;
>> }
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index f4d1527f265e..c95f83305c82 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6145,6 +6145,12 @@ struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info,
>> goto out_free_map;
>> }
>>
>> + /* avoid divide by zero on fully-remapped chunks */
>> + if (map->num_stripes == 0) {
>> + ret = -EOPNOTSUPP;
>> + goto out_free_map;
>> + }
>> +
>> offset = logical - map->start;
>> length = min_t(u64, map->start + map->chunk_len - logical, length);
>> *length_ret = length;
>> @@ -6965,7 +6971,7 @@ u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map)
>> {
>> const int data_stripes = calc_data_stripes(map->type, map->num_stripes);
>>
>> - return div_u64(map->chunk_len, data_stripes);
>> + return data_stripes ? div_u64(map->chunk_len, data_stripes) : 0;
>
> My point here was more that we are now including 0 in the range of this
> function, where it wasn't before, meaning that callers must properly
> handle it. And it's not a meaningful "stripe length", so it breaks that
> correspondence, so checking explicitly for "remapped-ness" vs. "length
> == 0" feels more robust to me.
>
> I won't die on this hill, just making myself as clear as I can.
Thanks Boris, this makes sense - that's fair enough.
>> }
>>
>> #if BITS_PER_LONG == 32
>> --
>> 2.49.1
>>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 03/16] btrfs: allow remapped chunks to have zero stripes
2025-08-19 1:05 ` kernel test robot
@ 2025-08-22 17:07 ` Mark Harmstone
0 siblings, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-22 17:07 UTC (permalink / raw)
To: kernel test robot; +Cc: oe-lkp, lkp, linux-btrfs
It looks like this is because the new valid_stripe_count() function BUG_ONs
for RAID0 - I'll fix this for v3.
On 19/08/2025 2.05 am, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed "kernel_BUG_at_fs/btrfs/tree-checker.c" on:
>
> commit: 7ec33e314b27be8996378a9601527017b6ebba95 ("[PATCH v2 03/16] btrfs: allow remapped chunks to have zero stripes")
> url: https://github.com/intel-lab-lkp/linux/commits/Mark-Harmstone/btrfs-add-definitions-and-constants-for-remap-tree/20250813-224507
> base: v6.17-rc1
> patch link: https://lore.kernel.org/all/20250813143509.31073-4-mark@harmstone.com/
> patch subject: [PATCH v2 03/16] btrfs: allow remapped chunks to have zero stripes
>
> in testcase: xfstests
> version: xfstests-x86_64-e1e4a0ea-1_20250714
> with following parameters:
>
> disk: 6HDD
> fs: btrfs
> test: btrfs-group-02
>
>
>
> config: x86_64-rhel-9.4-func
> compiler: gcc-12
> test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell) with 8G memory
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202508181031.5f89d7-lkp@intel.com
>
>
> [ 65.722045][ T7549] ------------[ cut here ]------------
> [ 65.727330][ T7549] kernel BUG at fs/btrfs/tree-checker.c:847!
> [ 65.733149][ T7549] Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
> [ 65.739195][ T7549] CPU: 5 UID: 0 PID: 7549 Comm: mount Tainted: G S 6.17.0-rc1-00003-g7ec33e314b27 #1 PREEMPT(voluntary)
> [ 65.751607][ T7549] Tainted: [S]=CPU_OUT_OF_SPEC
> [ 65.756182][ T7549] Hardware name: Dell Inc. OptiPlex 9020/0DNKMN, BIOS A05 12/05/2013
> [ 65.764022][ T7549] RIP: 0010:btrfs_check_chunk_valid (fs/btrfs/tree-checker.c:847 fs/btrfs/tree-checker.c:1004) btrfs
> [ 65.770855][ T7549] Code: 24 18 4c 8b 5c 24 10 e9 81 f9 ff ff 48 89 4c 24 18 4c 89 5c 24 10 e8 78 9f 52 bf 48 8b 4c 24 18 4c 8b 5c 24 10 e9 2c fd ff ff <0f> 0b 48 c7 c7 a3 99 51 c2 48 89 4c 24 10 e8 f6 9e 52 bf 48 8b 4c
> All code
> ========
> 0: 24 18 and $0x18,%al
> 2: 4c 8b 5c 24 10 mov 0x10(%rsp),%r11
> 7: e9 81 f9 ff ff jmp 0xfffffffffffff98d
> c: 48 89 4c 24 18 mov %rcx,0x18(%rsp)
> 11: 4c 89 5c 24 10 mov %r11,0x10(%rsp)
> 16: e8 78 9f 52 bf call 0xffffffffbf529f93
> 1b: 48 8b 4c 24 18 mov 0x18(%rsp),%rcx
> 20: 4c 8b 5c 24 10 mov 0x10(%rsp),%r11
> 25: e9 2c fd ff ff jmp 0xfffffffffffffd56
> 2a:* 0f 0b ud2 <-- trapping instruction
> 2c: 48 c7 c7 a3 99 51 c2 mov $0xffffffffc25199a3,%rdi
> 33: 48 89 4c 24 10 mov %rcx,0x10(%rsp)
> 38: e8 f6 9e 52 bf call 0xffffffffbf529f33
> 3d: 48 rex.W
> 3e: 8b .byte 0x8b
> 3f: 4c rex.WR
>
> Code starting with the faulting instruction
> ===========================================
> 0: 0f 0b ud2
> 2: 48 c7 c7 a3 99 51 c2 mov $0xffffffffc25199a3,%rdi
> 9: 48 89 4c 24 10 mov %rcx,0x10(%rsp)
> e: e8 f6 9e 52 bf call 0xffffffffbf529f09
> 13: 48 rex.W
> 14: 8b .byte 0x8b
> 15: 4c rex.WR
> [ 65.790121][ T7549] RSP: 0018:ffffc9002239f960 EFLAGS: 00010283
> [ 65.795985][ T7549] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000001500000
> [ 65.803738][ T7549] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88814af8e0bc
> [ 65.811492][ T7549] RBP: ffff88814af8e33c R08: 0000000000000001 R09: 0000000000000005
> [ 65.819246][ T7549] R10: e400000000000001 R11: 0000000000000000 R12: 000000000000000a
> [ 65.826998][ T7549] R13: 0000000000000008 R14: 0000000000000005 R15: 0000000000ff0000
> [ 65.834749][ T7549] FS: 00007f8edf369840(0000) GS:ffff8882182db000(0000) knlGS:0000000000000000
> [ 65.843446][ T7549] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 65.849838][ T7549] CR2: 0000562e34dee000 CR3: 0000000201f00005 CR4: 00000000001726f0
> [ 65.857589][ T7549] Call Trace:
> [ 65.860704][ T7549] <TASK>
> [ 65.863474][ T7549] ? set_extent_bit (fs/btrfs/extent-io-tree.c:1099) btrfs
> [ 65.869002][ T7549] btrfs_validate_super (fs/btrfs/disk-io.c:2369 fs/btrfs/disk-io.c:2558) btrfs
> [ 65.874869][ T7549] ? crypto_alloc_tfmmem+0x92/0xf0
> [ 65.880390][ T7549] ? __pfx_btrfs_validate_super (fs/btrfs/disk-io.c:2393) btrfs
> [ 65.886682][ T7549] ? btrfs_release_disk_super (include/linux/page-flags.h:226 include/linux/page-flags.h:288 include/linux/mm.h:1424 fs/btrfs/volumes.c:1337) btrfs
> [ 65.892995][ T7549] open_ctree (fs/btrfs/disk-io.c:3373) btrfs
> [ 65.897997][ T7549] ? __pfx_open_ctree (fs/btrfs/disk-io.c:3282) btrfs
> [ 65.903426][ T7549] ? mutex_unlock (arch/x86/include/asm/atomic64_64.h:101 include/linux/atomic/atomic-arch-fallback.h:4329 include/linux/atomic/atomic-long.h:1506 include/linux/atomic/atomic-instrumented.h:4481 kernel/locking/mutex.c:167 kernel/locking/mutex.c:533)
> [ 65.907740][ T7549] ? __pfx_mutex_unlock (kernel/locking/mutex.c:531)
> [ 65.912572][ T7549] btrfs_get_tree_super (fs/btrfs/super.c:978 fs/btrfs/super.c:1937) btrfs
> [ 65.918346][ T7549] btrfs_get_tree_subvol (fs/btrfs/super.c:2074) btrfs
> [ 65.924203][ T7549] vfs_get_tree (fs/super.c:1816)
> [ 65.928432][ T7549] do_new_mount (fs/namespace.c:3806)
> [ 65.932749][ T7549] ? __pfx_do_new_mount (fs/namespace.c:3760)
> [ 65.937580][ T7549] ? __pfx_map_id_range_up (kernel/user_namespace.c:382)
> [ 65.942669][ T7549] ? security_capable (security/security.c:1142)
> [ 65.947330][ T7549] path_mount (fs/namespace.c:4120)
> [ 65.951561][ T7549] ? 0xffffffff81000000
> [ 65.955533][ T7549] ? __pfx_path_mount (fs/namespace.c:4047)
> [ 65.960193][ T7549] ? kmem_cache_free (mm/slub.c:4680 mm/slub.c:4782)
> [ 65.964939][ T7549] ? user_path_at (fs/namei.c:3131)
> [ 65.969259][ T7549] __x64_sys_mount (fs/namespace.c:4134 fs/namespace.c:4344 fs/namespace.c:4321 fs/namespace.c:4321)
> [ 65.973843][ T7549] ? __pfx___x64_sys_mount (fs/namespace.c:4321)
> [ 65.978934][ T7549] ? do_user_addr_fault (arch/x86/include/asm/atomic.h:93 include/linux/atomic/atomic-arch-fallback.h:949 include/linux/atomic/atomic-instrumented.h:401 include/linux/refcount.h:389 include/linux/refcount.h:432 include/linux/mmap_lock.h:143 include/linux/mmap_lock.h:267 arch/x86/mm/fault.c:1338)
> [ 65.983938][ T7549] do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
> [ 65.988257][ T7549] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:114 arch/x86/mm/fault.c:1484 arch/x86/mm/fault.c:1532)
> [ 65.992749][ T7549] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
> [ 65.998441][ T7549] RIP: 0033:0x7f8edf568e0a
> [ 66.002672][ T7549] Code: 48 8b 0d f9 7f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c6 7f 0c 00 f7 d8 64 89 01 48
> All code
> ========
> 0: 48 8b 0d f9 7f 0c 00 mov 0xc7ff9(%rip),%rcx # 0xc8000
> 7: f7 d8 neg %eax
> 9: 64 89 01 mov %eax,%fs:(%rcx)
> c: 48 83 c8 ff or $0xffffffffffffffff,%rax
> 10: c3 ret
> 11: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
> 18: 00 00 00
> 1b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
> 20: 49 89 ca mov %rcx,%r10
> 23: b8 a5 00 00 00 mov $0xa5,%eax
> 28: 0f 05 syscall
> 2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction
> 30: 73 01 jae 0x33
> 32: c3 ret
> 33: 48 8b 0d c6 7f 0c 00 mov 0xc7fc6(%rip),%rcx # 0xc8000
> 3a: f7 d8 neg %eax
> 3c: 64 89 01 mov %eax,%fs:(%rcx)
> 3f: 48 rex.W
>
> Code starting with the faulting instruction
> ===========================================
> 0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
> 6: 73 01 jae 0x9
> 8: c3 ret
> 9: 48 8b 0d c6 7f 0c 00 mov 0xc7fc6(%rip),%rcx # 0xc7fd6
> 10: f7 d8 neg %eax
> 12: 64 89 01 mov %eax,%fs:(%rcx)
> 15: 48 rex.W
> [ 66.021939][ T7549] RSP: 002b:00007ffe66b952e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
> [ 66.030126][ T7549] RAX: ffffffffffffffda RBX: 0000557c5d7619e0 RCX: 00007f8edf568e0a
> [ 66.037879][ T7549] RDX: 0000557c5d761c10 RSI: 0000557c5d761c50 RDI: 0000557c5d761c30
> [ 66.045631][ T7549] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000557c5d762940
> [ 66.053384][ T7549] R10: 0000000000000000 R11: 0000000000000246 R12: 0000557c5d761c30
> [ 66.061140][ T7549] R13: 0000557c5d761c10 R14: 00007f8edf6d0264 R15: 0000557c5d761af8
> [ 66.068895][ T7549] </TASK>
> [ 66.071750][ T7549] Modules linked in: snd_hda_codec_intelhdmi snd_hda_codec_hdmi btrfs blake2b_generic xor intel_rapl_msr zstd_compress intel_rapl_common snd_hda_codec_alc269 x86_pkg_temp_thermal snd_hda_scodec_component raid6_pq snd_hda_codec_realtek_lib intel_powerclamp i915 snd_hda_codec_generic coretemp snd_hda_intel sd_mod snd_hda_codec kvm_intel intel_gtt sg snd_hda_core drm_buddy ipmi_devintf ipmi_msghandler platform_profile kvm snd_intel_dspcfg ttm snd_intel_sdw_acpi dell_wmi snd_hwdep irqbypass drm_display_helper dell_smbios ghash_clmulni_intel mei_wdt snd_pcm cec dell_wmi_descriptor drm_client_lib rapl ahci sparse_keymap rfkill snd_timer libahci drm_kms_helper intel_cstate mei_me pcspkr dcdbas intel_uncore libata mei video snd i2c_i801 lpc_ich soundcore i2c_smbus wmi binfmt_misc loop fuse drm dm_mod
> [ 66.142949][ T7549] ---[ end trace 0000000000000000 ]---
> [ 66.148220][ T7549] RIP: 0010:btrfs_check_chunk_valid (fs/btrfs/tree-checker.c:847 fs/btrfs/tree-checker.c:1004) btrfs
> [ 66.155057][ T7549] Code: 24 18 4c 8b 5c 24 10 e9 81 f9 ff ff 48 89 4c 24 18 4c 89 5c 24 10 e8 78 9f 52 bf 48 8b 4c 24 18 4c 8b 5c 24 10 e9 2c fd ff ff <0f> 0b 48 c7 c7 a3 99 51 c2 48 89 4c 24 10 e8 f6 9e 52 bf 48 8b 4c
> All code
> ========
> 0: 24 18 and $0x18,%al
> 2: 4c 8b 5c 24 10 mov 0x10(%rsp),%r11
> 7: e9 81 f9 ff ff jmp 0xfffffffffffff98d
> c: 48 89 4c 24 18 mov %rcx,0x18(%rsp)
> 11: 4c 89 5c 24 10 mov %r11,0x10(%rsp)
> 16: e8 78 9f 52 bf call 0xffffffffbf529f93
> 1b: 48 8b 4c 24 18 mov 0x18(%rsp),%rcx
> 20: 4c 8b 5c 24 10 mov 0x10(%rsp),%r11
> 25: e9 2c fd ff ff jmp 0xfffffffffffffd56
> 2a:* 0f 0b ud2 <-- trapping instruction
> 2c: 48 c7 c7 a3 99 51 c2 mov $0xffffffffc25199a3,%rdi
> 33: 48 89 4c 24 10 mov %rcx,0x10(%rsp)
> 38: e8 f6 9e 52 bf call 0xffffffffbf529f33
> 3d: 48 rex.W
> 3e: 8b .byte 0x8b
> 3f: 4c rex.WR
>
> Code starting with the faulting instruction
> ===========================================
> 0: 0f 0b ud2
> 2: 48 c7 c7 a3 99 51 c2 mov $0xffffffffc25199a3,%rdi
> 9: 48 89 4c 24 10 mov %rcx,0x10(%rsp)
> e: e8 f6 9e 52 bf call 0xffffffffbf529f09
> 13: 48 rex.W
> 14: 8b .byte 0x8b
> 15: 4c rex.WR
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20250818/202508181031.5f89d7-lkp@intel.com
>
>
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 07/16] btrfs: allow mounting filesystems with remap-tree incompat flag
2025-08-13 14:34 ` [PATCH v2 07/16] btrfs: allow mounting filesystems with remap-tree incompat flag Mark Harmstone
@ 2025-08-22 19:14 ` Boris Burkov
0 siblings, 0 replies; 43+ messages in thread
From: Boris Burkov @ 2025-08-22 19:14 UTC (permalink / raw)
To: Mark Harmstone; +Cc: linux-btrfs
On Wed, Aug 13, 2025 at 03:34:49PM +0100, Mark Harmstone wrote:
> If we encounter a filesystem with the remap-tree incompat flag set,
> valdiate its compatibility with the other flags, and load the remap tree
> using the values that have been added to the superblock.
>
> The remap-tree feature depends on the free space tere, but no-holes and
> block-group-tree have been made dependencies to reduce the testing
> matrix. Similarly I'm not aware of any reason why mixed-bg and zoned would be
> incompatible with remap-tree, but this is blocked for the time being
> until it can be fully tested.
>
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
> ---
> fs/btrfs/Kconfig | 2 +
> fs/btrfs/accessors.h | 6 +++
> fs/btrfs/disk-io.c | 86 ++++++++++++++++++++++++++++-----
> fs/btrfs/extent-tree.c | 2 +
> fs/btrfs/fs.h | 4 +-
> fs/btrfs/transaction.c | 7 +++
> include/uapi/linux/btrfs_tree.h | 5 +-
> 7 files changed, 97 insertions(+), 15 deletions(-)
>
> diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
> index ea95c90c8474..598a4af4ce4b 100644
> --- a/fs/btrfs/Kconfig
> +++ b/fs/btrfs/Kconfig
> @@ -116,6 +116,8 @@ config BTRFS_EXPERIMENTAL
>
> - large folio support
>
> + - remap-tree - logical address remapping tree
> +
> If unsure, say N.
>
> config BTRFS_FS_REF_VERIFY
> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> index 0dd161ee6863..392eaad75e72 100644
> --- a/fs/btrfs/accessors.h
> +++ b/fs/btrfs/accessors.h
> @@ -882,6 +882,12 @@ BTRFS_SETGET_STACK_FUNCS(super_uuid_tree_generation, struct btrfs_super_block,
> uuid_tree_generation, 64);
> BTRFS_SETGET_STACK_FUNCS(super_nr_global_roots, struct btrfs_super_block,
> nr_global_roots, 64);
> +BTRFS_SETGET_STACK_FUNCS(super_remap_root, struct btrfs_super_block,
> + remap_root, 64);
> +BTRFS_SETGET_STACK_FUNCS(super_remap_root_generation, struct btrfs_super_block,
> + remap_root_generation, 64);
> +BTRFS_SETGET_STACK_FUNCS(super_remap_root_level, struct btrfs_super_block,
> + remap_root_level, 8);
>
> /* struct btrfs_file_extent_item */
> BTRFS_SETGET_STACK_FUNCS(stack_file_extent_type, struct btrfs_file_extent_item,
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 8e9520119d4f..563aea5e3b1b 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1181,6 +1181,8 @@ static struct btrfs_root *btrfs_get_global_root(struct btrfs_fs_info *fs_info,
> return btrfs_grab_root(btrfs_global_root(fs_info, &key));
> case BTRFS_RAID_STRIPE_TREE_OBJECTID:
> return btrfs_grab_root(fs_info->stripe_root);
> + case BTRFS_REMAP_TREE_OBJECTID:
> + return btrfs_grab_root(fs_info->remap_root);
> default:
> return NULL;
> }
> @@ -1271,6 +1273,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
> btrfs_put_root(fs_info->data_reloc_root);
> btrfs_put_root(fs_info->block_group_root);
> btrfs_put_root(fs_info->stripe_root);
> + btrfs_put_root(fs_info->remap_root);
> btrfs_check_leaked_roots(fs_info);
> btrfs_extent_buffer_leak_debug_check(fs_info);
> kfree(fs_info->super_copy);
> @@ -1825,6 +1828,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, bool free_chunk_root)
> free_root_extent_buffers(info->data_reloc_root);
> free_root_extent_buffers(info->block_group_root);
> free_root_extent_buffers(info->stripe_root);
> + free_root_extent_buffers(info->remap_root);
> if (free_chunk_root)
> free_root_extent_buffers(info->chunk_root);
> }
> @@ -2256,20 +2260,31 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
> if (ret)
> goto out;
>
> - /*
> - * This tree can share blocks with some other fs tree during relocation
> - * and we need a proper setup by btrfs_get_fs_root
> - */
> - root = btrfs_get_fs_root(tree_root->fs_info,
> - BTRFS_DATA_RELOC_TREE_OBJECTID, true);
> - if (IS_ERR(root)) {
> - if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
> - ret = PTR_ERR(root);
> - goto out;
> - }
> - } else {
> + if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
> + /* remap_root already loaded in load_important_roots() */
> + root = fs_info->remap_root;
> +
> set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> - fs_info->data_reloc_root = root;
> +
> + root->root_key.objectid = BTRFS_REMAP_TREE_OBJECTID;
> + root->root_key.type = BTRFS_ROOT_ITEM_KEY;
> + root->root_key.offset = 0;
> + } else {
It might be a good idea to vomit on finding a reloc tree if the
REMAP_TREE incompat bit is set. If that would happen elsewhere in tree
checker, that's great, the idea just struck me while reading this.
> + /*
> + * This tree can share blocks with some other fs tree during
> + * relocation and we need a proper setup by btrfs_get_fs_root
> + */
> + root = btrfs_get_fs_root(tree_root->fs_info,
> + BTRFS_DATA_RELOC_TREE_OBJECTID, true);
> + if (IS_ERR(root)) {
> + if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
> + ret = PTR_ERR(root);
> + goto out;
> + }
> + } else {
> + set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> + fs_info->data_reloc_root = root;
> + }
> }
>
> location.objectid = BTRFS_QUOTA_TREE_OBJECTID;
> @@ -2509,6 +2524,28 @@ int btrfs_validate_super(const struct btrfs_fs_info *fs_info,
> ret = -EINVAL;
> }
>
> + /* Ditto for remap_tree */
Don't care strongly, but "ditto" is less clear and more prone to
breaking when other code gets refactored than just writing out what
you mean (perhaps with a reference to something above).
> + if (btrfs_fs_incompat(fs_info, REMAP_TREE) &&
> + (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE_VALID) ||
> + !btrfs_fs_incompat(fs_info, NO_HOLES) ||
> + !btrfs_fs_compat_ro(fs_info, BLOCK_GROUP_TREE))) {
> + btrfs_err(fs_info,
> +"remap-tree feature requires free-space-tree, no-holes, and block-group-tree");
> + ret = -EINVAL;
> + }
> +
> + if (btrfs_fs_incompat(fs_info, REMAP_TREE) &&
> + btrfs_fs_incompat(fs_info, MIXED_GROUPS)) {
> + btrfs_err(fs_info, "remap-tree not supported with mixed-bg");
> + ret = -EINVAL;
> + }
> +
> + if (btrfs_fs_incompat(fs_info, REMAP_TREE) &&
> + btrfs_fs_incompat(fs_info, ZONED)) {
> + btrfs_err(fs_info, "remap-tree not supported with zoned devices");
> + ret = -EINVAL;
> + }
> +
> /*
> * Hint to catch really bogus numbers, bitflips or so, more exact checks are
> * done later
> @@ -2667,6 +2704,18 @@ static int load_important_roots(struct btrfs_fs_info *fs_info)
> btrfs_warn(fs_info, "couldn't read tree root");
> return ret;
> }
> +
> + if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
> + bytenr = btrfs_super_remap_root(sb);
> + gen = btrfs_super_remap_root_generation(sb);
> + level = btrfs_super_remap_root_level(sb);
> + ret = load_super_root(fs_info->remap_root, bytenr, gen, level);
> + if (ret) {
> + btrfs_warn(fs_info, "couldn't read remap root");
> + return ret;
> + }
> + }
> +
> return 0;
> }
>
> @@ -3278,6 +3327,7 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
> struct btrfs_fs_info *fs_info = btrfs_sb(sb);
> struct btrfs_root *tree_root;
> struct btrfs_root *chunk_root;
> + struct btrfs_root *remap_root;
> int ret;
> int level;
>
> @@ -3312,6 +3362,16 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
> goto fail_alloc;
> }
>
> + if (btrfs_super_incompat_flags(disk_super) & BTRFS_FEATURE_INCOMPAT_REMAP_TREE) {
> + remap_root = btrfs_alloc_root(fs_info, BTRFS_REMAP_TREE_OBJECTID,
> + GFP_KERNEL);
> + fs_info->remap_root = remap_root;
> + if (!remap_root) {
> + ret = -ENOMEM;
> + goto fail_alloc;
> + }
> + }
> +
This feels like it should come after the csum verification stuff in the
bootstrap process. The csum stuff is from the super so it shouldn't
depend on remap, but remap is an eb and has csums, so it does "rely" on
that (obviously it will break anyway, but then why bother doing the
explicit checking at all)
As the most general statement, I think it putting it as "late as possible"
is the most self documenting option.
> btrfs_info(fs_info, "first mount of filesystem %pU", disk_super->fsid);
> /*
> * Verify the type first, if that or the checksum value are
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5e038ae1a93f..c1b96c728fe6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2589,6 +2589,8 @@ static u64 get_alloc_profile_by_root(struct btrfs_root *root, int data)
> flags = BTRFS_BLOCK_GROUP_DATA;
> else if (root == fs_info->chunk_root)
> flags = BTRFS_BLOCK_GROUP_SYSTEM;
> + else if (root == fs_info->remap_root)
> + flags = BTRFS_BLOCK_GROUP_REMAP;
> else
> flags = BTRFS_BLOCK_GROUP_METADATA;
>
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 9ce75843b578..6ea96e76655e 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -288,7 +288,8 @@ enum {
> #define BTRFS_FEATURE_INCOMPAT_SUPP \
> (BTRFS_FEATURE_INCOMPAT_SUPP_STABLE | \
> BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE | \
> - BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2)
> + BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 | \
> + BTRFS_FEATURE_INCOMPAT_REMAP_TREE)
>
> #else
>
> @@ -438,6 +439,7 @@ struct btrfs_fs_info {
> struct btrfs_root *data_reloc_root;
> struct btrfs_root *block_group_root;
> struct btrfs_root *stripe_root;
> + struct btrfs_root *remap_root;
>
> /* The log root tree is a directory of all the other log roots */
> struct btrfs_root *log_root_tree;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index c5c0d9cf1a80..64b9c427af6a 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1953,6 +1953,13 @@ static void update_super_roots(struct btrfs_fs_info *fs_info)
> super->cache_generation = 0;
> if (test_bit(BTRFS_FS_UPDATE_UUID_TREE_GEN, &fs_info->flags))
> super->uuid_tree_generation = root_item->generation;
> +
> + if (btrfs_fs_incompat(fs_info, REMAP_TREE)) {
> + root_item = &fs_info->remap_root->root_item;
> + super->remap_root = root_item->bytenr;
> + super->remap_root_generation = root_item->generation;
> + super->remap_root_level = root_item->level;
> + }
> }
>
> int btrfs_transaction_blocked(struct btrfs_fs_info *info)
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index 500e3a7df90b..89bcb80081a6 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -721,9 +721,12 @@ struct btrfs_super_block {
> __u8 metadata_uuid[BTRFS_FSID_SIZE];
>
> __u64 nr_global_roots;
> + __le64 remap_root;
> + __le64 remap_root_generation;
> + __u8 remap_root_level;
>
> /* Future expansion */
> - __le64 reserved[27];
> + __u8 reserved[199];
> __u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
> struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
>
> --
> 2.49.1
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 08/16] btrfs: redirect I/O for remapped block groups
2025-08-13 14:34 ` [PATCH v2 08/16] btrfs: redirect I/O for remapped block groups Mark Harmstone
@ 2025-08-22 19:42 ` Boris Burkov
2025-08-27 14:08 ` Mark Harmstone
0 siblings, 1 reply; 43+ messages in thread
From: Boris Burkov @ 2025-08-22 19:42 UTC (permalink / raw)
To: Mark Harmstone; +Cc: linux-btrfs
On Wed, Aug 13, 2025 at 03:34:50PM +0100, Mark Harmstone wrote:
> Change btrfs_map_block() so that if the block group has the REMAPPED
> flag set, we call btrfs_translate_remap() to obtain a new address.
>
> btrfs_translate_remap() searches the remap tree for a range
> corresponding to the logical address passed to btrfs_map_block(). If it
> is within an identity remap, this part of the block group hasn't yet
> been relocated, and so we use the existing address.
>
> If it is within an actual remap, we subtract the start of the remap
> range and add the address of its destination, contained in the item's
> payload.
>
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
> ---
> fs/btrfs/relocation.c | 59 +++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/relocation.h | 2 ++
> fs/btrfs/volumes.c | 31 +++++++++++++++++++++++
> 3 files changed, 92 insertions(+)
>
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 7256f6748c8f..e1f1da9336e7 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -3884,6 +3884,65 @@ static const char *stage_to_string(enum reloc_stage stage)
> return "unknown";
> }
>
> +int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
> + u64 *length, bool nolock)
> +{
> + int ret;
> + struct btrfs_key key, found_key;
> + struct extent_buffer *leaf;
> + struct btrfs_remap *remap;
> + BTRFS_PATH_AUTO_FREE(path);
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + if (nolock) {
> + path->search_commit_root = 1;
> + path->skip_locking = 1;
> + }
We are calling this without a transaction and in a loop in
btrfs_submit_bbio:
btrfs_submit_bbio
while (blah); btrfs_submit_chunk
btrfs_map_block
btrfs_translate_remap
So that means in that loop we can have one remap tree in one step of the
loop and then a transaction can finish and then the next chunk is
remapped on the next remap tree in the next step.
Is that acceptable? Otherwise you need to hold the commit_root_sem for
the whole loop. It seems OK because both copies ought to be around while
we're in the middle of remapping, but let's be sure. I'm also curious
about the paths that are removing things from the remap tree. I would
expect live IO that would use that remapping would block them, as it
is like removing an extent, but also worth considering.
> +
> + key.objectid = *logical;
> + key.type = (u8)-1;
> + key.offset = (u64)-1;
> +
> + ret = btrfs_search_slot(NULL, fs_info->remap_root, &key, path,
> + 0, 0);
> + if (ret < 0)
> + return ret;
> +
> + leaf = path->nodes[0];
> +
> + if (path->slots[0] == 0)
> + return -ENOENT;
> +
> + path->slots[0]--;
> +
> + btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
> +
> + if (found_key.type != BTRFS_REMAP_KEY &&
> + found_key.type != BTRFS_IDENTITY_REMAP_KEY) {
> + return -ENOENT;
> + }
> +
> + if (found_key.objectid > *logical ||
> + found_key.objectid + found_key.offset <= *logical) {
> + return -ENOENT;
> + }
> +
> + if (*logical + *length > found_key.objectid + found_key.offset)
> + *length = found_key.objectid + found_key.offset - *logical;
> +
> + if (found_key.type == BTRFS_IDENTITY_REMAP_KEY)
> + return 0;
> +
> + remap = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_remap);
> +
> + *logical = *logical - found_key.objectid + btrfs_remap_address(leaf, remap);
nit: I think the readability of this would benefit from some "offset"
helper variable, but your commit message does make it clear enough.
> +
> + return 0;
> +}
> +
> /*
> * function to relocate all extents in a block group.
> */
> diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
> index 5c36b3f84b57..a653c42a25a3 100644
> --- a/fs/btrfs/relocation.h
> +++ b/fs/btrfs/relocation.h
> @@ -31,5 +31,7 @@ int btrfs_should_cancel_balance(const struct btrfs_fs_info *fs_info);
> struct btrfs_root *find_reloc_root(struct btrfs_fs_info *fs_info, u64 bytenr);
> bool btrfs_should_ignore_reloc_root(const struct btrfs_root *root);
> u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info);
> +int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
> + u64 *length, bool nolock);
>
> #endif
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 678e5d4cd780..a2c49cb8bfc6 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6635,6 +6635,37 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> if (IS_ERR(map))
> return PTR_ERR(map);
>
> + if (map->type & BTRFS_BLOCK_GROUP_REMAPPED) {
potential optimization (not blocking for this version IMO):
if you can cache the type on the extent_map (I actually think it ought
to already be done, essentially, as we know data vs. not data) then you
don't need to lookup the map at all for a remapped block, and can go
straight to looking up the remap.
> + u64 new_logical = logical;
> + bool nolock = !(map->type & BTRFS_BLOCK_GROUP_DATA);
> +
> + /*
> + * We use search_commit_root in btrfs_translate_remap for
> + * metadata blocks, to avoid lockdep complaining about
> + * recursive locking.
real risk of deadlock or "complaining"?
> + * If we get -ENOENT this means this is a BG that has just had
> + * its REMAPPED flag set, and so nothing has yet been actually
> + * remapped.
> + */
> + ret = btrfs_translate_remap(fs_info, &new_logical, length,
> + nolock);
> + if (ret && (!nolock || ret != -ENOENT))
> + return ret;
> +
> + if (ret != -ENOENT && new_logical != logical) {
> + btrfs_free_chunk_map(map);
> +
> + map = btrfs_get_chunk_map(fs_info, new_logical,
> + *length);
> + if (IS_ERR(map))
> + return PTR_ERR(map);
> +
> + logical = new_logical;
> + }
> +
> + ret = 0;
> + }
> +
> num_copies = btrfs_chunk_map_num_copies(map);
> if (io_geom.mirror_num > num_copies)
> return -EINVAL;
> --
> 2.49.1
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 08/16] btrfs: redirect I/O for remapped block groups
2025-08-22 19:42 ` Boris Burkov
@ 2025-08-27 14:08 ` Mark Harmstone
0 siblings, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-27 14:08 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs
On 22/08/2025 8.42 pm, Boris Burkov wrote:
> On Wed, Aug 13, 2025 at 03:34:50PM +0100, Mark Harmstone wrote:
>> Change btrfs_map_block() so that if the block group has the REMAPPED
>> flag set, we call btrfs_translate_remap() to obtain a new address.
>>
>> btrfs_translate_remap() searches the remap tree for a range
>> corresponding to the logical address passed to btrfs_map_block(). If it
>> is within an identity remap, this part of the block group hasn't yet
>> been relocated, and so we use the existing address.
>>
>> If it is within an actual remap, we subtract the start of the remap
>> range and add the address of its destination, contained in the item's
>> payload.
>>
>> Signed-off-by: Mark Harmstone <mark@harmstone.com>
>> ---
>> fs/btrfs/relocation.c | 59 +++++++++++++++++++++++++++++++++++++++++++
>> fs/btrfs/relocation.h | 2 ++
>> fs/btrfs/volumes.c | 31 +++++++++++++++++++++++
>> 3 files changed, 92 insertions(+)
>>
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index 7256f6748c8f..e1f1da9336e7 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -3884,6 +3884,65 @@ static const char *stage_to_string(enum reloc_stage stage)
>> return "unknown";
>> }
>>
>> +int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>> + u64 *length, bool nolock)
>> +{
>> + int ret;
>> + struct btrfs_key key, found_key;
>> + struct extent_buffer *leaf;
>> + struct btrfs_remap *remap;
>> + BTRFS_PATH_AUTO_FREE(path);
>> +
>> + path = btrfs_alloc_path();
>> + if (!path)
>> + return -ENOMEM;
>> +
>> + if (nolock) {
>> + path->search_commit_root = 1;
>> + path->skip_locking = 1;
>> + }
>
> We are calling this without a transaction and in a loop in
> btrfs_submit_bbio:
>
> btrfs_submit_bbio
> while (blah); btrfs_submit_chunk
> btrfs_map_block
> btrfs_translate_remap
>
> So that means in that loop we can have one remap tree in one step of the
> loop and then a transaction can finish and then the next chunk is
> remapped on the next remap tree in the next step.
>
> Is that acceptable? Otherwise you need to hold the commit_root_sem for
> the whole loop. It seems OK because both copies ought to be around while
> we're in the middle of remapping, but let's be sure. I'm also curious
> about the paths that are removing things from the remap tree. I would
> expect live IO that would use that remapping would block them, as it
> is like removing an extent, but also worth considering.
Yes, this should be fine, as both copies will be valid. The only problem is
there's a race with DIO, which we know about.
The current discard code delays all discards until the last identity remap
has gone, rather than discarding as we go along, so there shouldn't be an
issue with reading data that's just been discarded.
>
>> +
>> + key.objectid = *logical;
>> + key.type = (u8)-1;
>> + key.offset = (u64)-1;
>> +
>> + ret = btrfs_search_slot(NULL, fs_info->remap_root, &key, path,
>> + 0, 0);
>> + if (ret < 0)
>> + return ret;
>> +
>> + leaf = path->nodes[0];
>> +
>> + if (path->slots[0] == 0)
>> + return -ENOENT;
>> +
>> + path->slots[0]--;
>> +
>> + btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
>> +
>> + if (found_key.type != BTRFS_REMAP_KEY &&
>> + found_key.type != BTRFS_IDENTITY_REMAP_KEY) {
>> + return -ENOENT;
>> + }
>> +
>> + if (found_key.objectid > *logical ||
>> + found_key.objectid + found_key.offset <= *logical) {
>> + return -ENOENT;
>> + }
>> +
>> + if (*logical + *length > found_key.objectid + found_key.offset)
>> + *length = found_key.objectid + found_key.offset - *logical;
>> +
>> + if (found_key.type == BTRFS_IDENTITY_REMAP_KEY)
>> + return 0;
>> +
>> + remap = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_remap);
>> +
>> + *logical = *logical - found_key.objectid + btrfs_remap_address(leaf, remap);
>
> nit: I think the readability of this would benefit from some "offset"
> helper variable, but your commit message does make it clear enough.
Rearranging it to...
*logical += btrfs_remap_address(leaf, remap) - found_key.objectid;
...looks a lot less ugly.
>
>> +
>> + return 0;
>> +}
>> +
>> /*
>> * function to relocate all extents in a block group.
>> */
>> diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
>> index 5c36b3f84b57..a653c42a25a3 100644
>> --- a/fs/btrfs/relocation.h
>> +++ b/fs/btrfs/relocation.h
>> @@ -31,5 +31,7 @@ int btrfs_should_cancel_balance(const struct btrfs_fs_info *fs_info);
>> struct btrfs_root *find_reloc_root(struct btrfs_fs_info *fs_info, u64 bytenr);
>> bool btrfs_should_ignore_reloc_root(const struct btrfs_root *root);
>> u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info);
>> +int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>> + u64 *length, bool nolock);
>>
>> #endif
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 678e5d4cd780..a2c49cb8bfc6 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6635,6 +6635,37 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>> if (IS_ERR(map))
>> return PTR_ERR(map);
>>
>> + if (map->type & BTRFS_BLOCK_GROUP_REMAPPED) {
>
> potential optimization (not blocking for this version IMO):
> if you can cache the type on the extent_map (I actually think it ought
> to already be done, essentially, as we know data vs. not data) then you
> don't need to lookup the map at all for a remapped block, and can go
> straight to looking up the remap.
>
>> + u64 new_logical = logical;
>> + bool nolock = !(map->type & BTRFS_BLOCK_GROUP_DATA);
>> +
>> + /*
>> + * We use search_commit_root in btrfs_translate_remap for
>> + * metadata blocks, to avoid lockdep complaining about
>> + * recursive locking.
>
> real risk of deadlock or "complaining"?
Complaining. Reading any tree other than the chunk tree can result in a
read of the remap tree, but reading the remap tree won't read any other
tree.
I added a btrfs_lockdep_keyset for the remap-tree as a result of another
issue, which I think might well have fixed this too. But search_commit_root
is a desirable optimization regardless, which is why I've kept it in.
>
>> + * If we get -ENOENT this means this is a BG that has just had
>> + * its REMAPPED flag set, and so nothing has yet been actually
>> + * remapped.
>> + */
>> + ret = btrfs_translate_remap(fs_info, &new_logical, length,
>> + nolock);
>> + if (ret && (!nolock || ret != -ENOENT))
>> + return ret;
>> +
>> + if (ret != -ENOENT && new_logical != logical) {
>> + btrfs_free_chunk_map(map);
>> +
>> + map = btrfs_get_chunk_map(fs_info, new_logical,
>> + *length);
>> + if (IS_ERR(map))
>> + return PTR_ERR(map);
>> +
>> + logical = new_logical;
>> + }
>> +
>> + ret = 0;
>> + }
>> +
>> num_copies = btrfs_chunk_map_num_copies(map);
>> if (io_geom.mirror_num > num_copies)
>> return -EINVAL;
>> --
>> 2.49.1
>>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 09/16] btrfs: release BG lock before calling btrfs_link_bg_list()
2025-08-16 0:32 ` Boris Burkov
@ 2025-08-27 15:35 ` Mark Harmstone
2025-08-27 15:48 ` Filipe Manana
0 siblings, 1 reply; 43+ messages in thread
From: Mark Harmstone @ 2025-08-27 15:35 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs
On 16/08/2025 1.32 am, Boris Burkov wrote:
> On Wed, Aug 13, 2025 at 03:34:51PM +0100, Mark Harmstone wrote:
>> Release block_group->lock before calling btrfs_link_bg_list() in
>> btrfs_delete_unused_bgs(), as this was causing lockdep issues.
>>
>> This lock isn't held in any other place that we call btrfs_link_bg_list(), as
>> the block group lists are manipulated while holding fs_info->unused_bgs_lock.
>>
>
> Please include the offending lockdep output you are fixing.
I didn't include it as lockdep was triggering on the second (correct) instance,
while the problem was on the first (incorrect) instance, and thought it would
confuse matters. And I stupidly didn't take a copy, and now I can't reproduce
it.
The issue is that in btrfs_discard_punt_unused_bgs_list() in "btrfs: remove
remapped block groups from the free-space tree", we're holding unused_bgs_lock
then looping through the list grabbing the individual block group list. In
btrfs_delete_unused_bgs() we're grabbing unused_bgs_lock while unnecessarily
holding the block group lock.
> Is this a generic fix unrelated to your other changes? I think a
> separate patch from the series is clearer in that case. And it would
> need a Fixes: tag (probably my patch that added btrfs_link_bg_list, haha)
It looks like it's actually f4a9f219411f318ae60d6ff7f129082a75686c6c,
"btrfs: do not delete unused block group if it may be used soon".
>
> Thanks.
>
>> Signed-off-by: Mark Harmstone <mark@harmstone.com>
>> ---
>> fs/btrfs/block-group.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index bed9c58b6cbc..8c28f829547e 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -1620,6 +1620,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>> if ((space_info->total_bytes - block_group->length < used &&
>> block_group->zone_unusable < block_group->length) ||
>> has_unwritten_metadata(block_group)) {
>> + spin_unlock(&block_group->lock);
>> +
>> /*
>> * Add a reference for the list, compensate for the ref
>> * drop under the "next" label for the
>> @@ -1628,7 +1630,6 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>> btrfs_link_bg_list(block_group, &retry_list);
>>
>> trace_btrfs_skip_unused_block_group(block_group);
>> - spin_unlock(&block_group->lock);
>> spin_unlock(&space_info->lock);
>> up_write(&space_info->groups_sem);
>> goto next;
>> --
>> 2.49.1
>>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 09/16] btrfs: release BG lock before calling btrfs_link_bg_list()
2025-08-27 15:35 ` Mark Harmstone
@ 2025-08-27 15:48 ` Filipe Manana
2025-08-27 15:52 ` Mark Harmstone
0 siblings, 1 reply; 43+ messages in thread
From: Filipe Manana @ 2025-08-27 15:48 UTC (permalink / raw)
To: Mark Harmstone; +Cc: Boris Burkov, linux-btrfs
On Wed, Aug 27, 2025 at 4:36 PM Mark Harmstone <mark@harmstone.com> wrote:
>
> On 16/08/2025 1.32 am, Boris Burkov wrote:
> > On Wed, Aug 13, 2025 at 03:34:51PM +0100, Mark Harmstone wrote:
> >> Release block_group->lock before calling btrfs_link_bg_list() in
> >> btrfs_delete_unused_bgs(), as this was causing lockdep issues.
> >>
> >> This lock isn't held in any other place that we call btrfs_link_bg_list(), as
> >> the block group lists are manipulated while holding fs_info->unused_bgs_lock.
> >>
> >
> > Please include the offending lockdep output you are fixing.
>
> I didn't include it as lockdep was triggering on the second (correct) instance,
> while the problem was on the first (incorrect) instance, and thought it would
> confuse matters. And I stupidly didn't take a copy, and now I can't reproduce
> it.
>
> The issue is that in btrfs_discard_punt_unused_bgs_list() in "btrfs: remove
> remapped block groups from the free-space tree", we're holding unused_bgs_lock
> then looping through the list grabbing the individual block group list. In
> btrfs_delete_unused_bgs() we're grabbing unused_bgs_lock while unnecessarily
> holding the block group lock.
>
>
> > Is this a generic fix unrelated to your other changes? I think a
> > separate patch from the series is clearer in that case. And it would
> > need a Fixes: tag (probably my patch that added btrfs_link_bg_list, haha)
>
> It looks like it's actually f4a9f219411f318ae60d6ff7f129082a75686c6c,
> "btrfs: do not delete unused block group if it may be used soon".
No, it's not.
In that commit we didn't acquire fs_info->unused_bgs_lock.
The locking that makes lockdep not happy was added in 0497dfba98c0
("btrfs: codify pattern for adding block_group to bg_list"),
as it replaced the open coded list_add_tail() with the call to the new
function btrfs_link_bg_list(), and this new function locks
fs_info->unused_bgs_lock.
>
> >
> > Thanks.
> >
> >> Signed-off-by: Mark Harmstone <mark@harmstone.com>
> >> ---
> >> fs/btrfs/block-group.c | 3 ++-
> >> 1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> >> index bed9c58b6cbc..8c28f829547e 100644
> >> --- a/fs/btrfs/block-group.c
> >> +++ b/fs/btrfs/block-group.c
> >> @@ -1620,6 +1620,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> >> if ((space_info->total_bytes - block_group->length < used &&
> >> block_group->zone_unusable < block_group->length) ||
> >> has_unwritten_metadata(block_group)) {
> >> + spin_unlock(&block_group->lock);
> >> +
> >> /*
> >> * Add a reference for the list, compensate for the ref
> >> * drop under the "next" label for the
> >> @@ -1628,7 +1630,6 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> >> btrfs_link_bg_list(block_group, &retry_list);
> >>
> >> trace_btrfs_skip_unused_block_group(block_group);
> >> - spin_unlock(&block_group->lock);
> >> spin_unlock(&space_info->lock);
> >> up_write(&space_info->groups_sem);
> >> goto next;
> >> --
> >> 2.49.1
> >>
>
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 09/16] btrfs: release BG lock before calling btrfs_link_bg_list()
2025-08-27 15:48 ` Filipe Manana
@ 2025-08-27 15:52 ` Mark Harmstone
0 siblings, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-27 15:52 UTC (permalink / raw)
To: Filipe Manana; +Cc: Boris Burkov, linux-btrfs
On 27/08/2025 4.48 pm, Filipe Manana wrote:
> On Wed, Aug 27, 2025 at 4:36 PM Mark Harmstone <mark@harmstone.com> wrote:
>>
>> On 16/08/2025 1.32 am, Boris Burkov wrote:
>>> On Wed, Aug 13, 2025 at 03:34:51PM +0100, Mark Harmstone wrote:
>>>> Release block_group->lock before calling btrfs_link_bg_list() in
>>>> btrfs_delete_unused_bgs(), as this was causing lockdep issues.
>>>>
>>>> This lock isn't held in any other place that we call btrfs_link_bg_list(), as
>>>> the block group lists are manipulated while holding fs_info->unused_bgs_lock.
>>>>
>>>
>>> Please include the offending lockdep output you are fixing.
>>
>> I didn't include it as lockdep was triggering on the second (correct) instance,
>> while the problem was on the first (incorrect) instance, and thought it would
>> confuse matters. And I stupidly didn't take a copy, and now I can't reproduce
>> it.
>>
>> The issue is that in btrfs_discard_punt_unused_bgs_list() in "btrfs: remove
>> remapped block groups from the free-space tree", we're holding unused_bgs_lock
>> then looping through the list grabbing the individual block group list. In
>> btrfs_delete_unused_bgs() we're grabbing unused_bgs_lock while unnecessarily
>> holding the block group lock.
>>
>>
>>> Is this a generic fix unrelated to your other changes? I think a
>>> separate patch from the series is clearer in that case. And it would
>>> need a Fixes: tag (probably my patch that added btrfs_link_bg_list, haha)
>>
>> It looks like it's actually f4a9f219411f318ae60d6ff7f129082a75686c6c,
>> "btrfs: do not delete unused block group if it may be used soon".
>
> No, it's not.
> In that commit we didn't acquire fs_info->unused_bgs_lock.
>
> The locking that makes lockdep not happy was added in 0497dfba98c0
> ("btrfs: codify pattern for adding block_group to bg_list"),
> as it replaced the open coded list_add_tail() with the call to the new
> function btrfs_link_bg_list(), and this new function locks
> fs_info->unused_bgs_lock.
Thanks Filipe, you're right. I'll make sure the next version of the patch
has the right commit in the Fixes tag.
>>
>>>
>>> Thanks.
>>>
>>>> Signed-off-by: Mark Harmstone <mark@harmstone.com>
>>>> ---
>>>> fs/btrfs/block-group.c | 3 ++-
>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>>>> index bed9c58b6cbc..8c28f829547e 100644
>>>> --- a/fs/btrfs/block-group.c
>>>> +++ b/fs/btrfs/block-group.c
>>>> @@ -1620,6 +1620,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>>> if ((space_info->total_bytes - block_group->length < used &&
>>>> block_group->zone_unusable < block_group->length) ||
>>>> has_unwritten_metadata(block_group)) {
>>>> + spin_unlock(&block_group->lock);
>>>> +
>>>> /*
>>>> * Add a reference for the list, compensate for the ref
>>>> * drop under the "next" label for the
>>>> @@ -1628,7 +1630,6 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>>> btrfs_link_bg_list(block_group, &retry_list);
>>>>
>>>> trace_btrfs_skip_unused_block_group(block_group);
>>>> - spin_unlock(&block_group->lock);
>>>> spin_unlock(&space_info->lock);
>>>> up_write(&space_info->groups_sem);
>>>> goto next;
>>>> --
>>>> 2.49.1
>>>>
>>
>>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 10/16] btrfs: handle deletions from remapped block group
2025-08-16 0:28 ` Boris Burkov
@ 2025-08-27 17:11 ` Mark Harmstone
0 siblings, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-27 17:11 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs
On 16/08/2025 1.28 am, Boris Burkov wrote:
> On Wed, Aug 13, 2025 at 03:34:52PM +0100, Mark Harmstone wrote:
>> Handle the case where we free an extent from a block group that has the
>> REMAPPED flag set. Because the remap tree is orthogonal to the extent
>> tree, for data this may be within any number of identity remaps or
>> actual remaps. If we're freeing a metadata node, this will be wholly
>> inside one or the other.
>>
>> btrfs_remove_extent_from_remap_tree() searches the remap tree for the
>> remaps that cover the range in question, then calls
>> remove_range_from_remap_tree() for each one, to punch a hole in the
>> remap and adjust the free-space tree.
>>
>> For an identity remap, remove_range_from_remap_tree() will adjust the
>> block group's `identity_remap_count` if this changes. If it reaches
>> zero we call last_identity_remap_gone(), which removes the chunk's
>> stripes and device extents - it is now fully remapped.
>>
>> The changes which involve the block group's ro flag are because the
>> REMAPPED flag itself prevents a block group from having any new
>> allocations within it, and so we don't need to account for this
>> separately.
>>
>> Signed-off-by: Mark Harmstone <mark@harmstone.com>
>> ---
>> fs/btrfs/block-group.c | 82 ++++---
>> fs/btrfs/block-group.h | 1 +
>> fs/btrfs/disk-io.c | 1 +
>> fs/btrfs/extent-tree.c | 28 ++-
>> fs/btrfs/fs.h | 1 +
>> fs/btrfs/relocation.c | 510 +++++++++++++++++++++++++++++++++++++++++
>> fs/btrfs/relocation.h | 3 +
>> fs/btrfs/volumes.c | 56 +++--
>> fs/btrfs/volumes.h | 6 +
>> 9 files changed, 630 insertions(+), 58 deletions(-)
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index 8c28f829547e..7a0524138235 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -1068,6 +1068,32 @@ static int remove_block_group_item(struct btrfs_trans_handle *trans,
>> return ret;
>> }
>>
>> +void btrfs_remove_bg_from_sinfo(struct btrfs_block_group *block_group)
>> +{
>> + int factor = btrfs_bg_type_to_factor(block_group->flags);
>> +
>> + spin_lock(&block_group->space_info->lock);
>> +
>> + if (btrfs_test_opt(block_group->fs_info, ENOSPC_DEBUG)) {
>> + WARN_ON(block_group->space_info->total_bytes
>> + < block_group->length);
>> + WARN_ON(block_group->space_info->bytes_readonly
>> + < block_group->length - block_group->zone_unusable);
>> + WARN_ON(block_group->space_info->bytes_zone_unusable
>> + < block_group->zone_unusable);
>> + WARN_ON(block_group->space_info->disk_total
>> + < block_group->length * factor);
>> + }
>> + block_group->space_info->total_bytes -= block_group->length;
>> + block_group->space_info->bytes_readonly -=
>> + (block_group->length - block_group->zone_unusable);
>> + btrfs_space_info_update_bytes_zone_unusable(block_group->space_info,
>> + -block_group->zone_unusable);
>> + block_group->space_info->disk_total -= block_group->length * factor;
>> +
>> + spin_unlock(&block_group->space_info->lock);
>> +}
>> +
>> int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>> struct btrfs_chunk_map *map)
>> {
>> @@ -1079,7 +1105,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>> struct kobject *kobj = NULL;
>> int ret;
>> int index;
>> - int factor;
>> struct btrfs_caching_control *caching_ctl = NULL;
>> bool remove_map;
>> bool remove_rsv = false;
>> @@ -1088,7 +1113,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>> if (!block_group)
>> return -ENOENT;
>>
>> - BUG_ON(!block_group->ro);
>> + BUG_ON(!block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED));
>>
>> trace_btrfs_remove_block_group(block_group);
>> /*
>> @@ -1100,7 +1125,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>> block_group->length);
>>
>> index = btrfs_bg_flags_to_raid_index(block_group->flags);
>> - factor = btrfs_bg_type_to_factor(block_group->flags);
>>
>> /* make sure this block group isn't part of an allocation cluster */
>> cluster = &fs_info->data_alloc_cluster;
>> @@ -1224,26 +1248,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>
>> spin_lock(&block_group->space_info->lock);
>> list_del_init(&block_group->ro_list);
>> -
>> - if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
>> - WARN_ON(block_group->space_info->total_bytes
>> - < block_group->length);
>> - WARN_ON(block_group->space_info->bytes_readonly
>> - < block_group->length - block_group->zone_unusable);
>> - WARN_ON(block_group->space_info->bytes_zone_unusable
>> - < block_group->zone_unusable);
>> - WARN_ON(block_group->space_info->disk_total
>> - < block_group->length * factor);
>> - }
>> - block_group->space_info->total_bytes -= block_group->length;
>> - block_group->space_info->bytes_readonly -=
>> - (block_group->length - block_group->zone_unusable);
>> - btrfs_space_info_update_bytes_zone_unusable(block_group->space_info,
>> - -block_group->zone_unusable);
>> - block_group->space_info->disk_total -= block_group->length * factor;
>> -
>> spin_unlock(&block_group->space_info->lock);
>>
>> + if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED))
>> + btrfs_remove_bg_from_sinfo(block_group);
>> +
>> /*
>> * Remove the free space for the block group from the free space tree
>> * and the block group's item from the extent tree before marking the
>> @@ -1539,6 +1548,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>> while (!list_empty(&fs_info->unused_bgs)) {
>> u64 used;
>> int trimming;
>> + bool made_ro = false;
>>
>> block_group = list_first_entry(&fs_info->unused_bgs,
>> struct btrfs_block_group,
>> @@ -1575,7 +1585,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>
>> spin_lock(&space_info->lock);
>> spin_lock(&block_group->lock);
>> - if (btrfs_is_block_group_used(block_group) || block_group->ro ||
>> + if (btrfs_is_block_group_used(block_group) ||
>> + (block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) ||
>> list_is_singular(&block_group->list)) {
>> /*
>> * We want to bail if we made new allocations or have
>> @@ -1617,9 +1628,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>> * needing to allocate extents from the block group.
>> */
>> used = btrfs_space_info_used(space_info, true);
>> - if ((space_info->total_bytes - block_group->length < used &&
>> + if (((space_info->total_bytes - block_group->length < used &&
>> block_group->zone_unusable < block_group->length) ||
>> - has_unwritten_metadata(block_group)) {
>> + has_unwritten_metadata(block_group)) &&
>> + !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
>> spin_unlock(&block_group->lock);
>>
>> /*
>> @@ -1638,8 +1650,14 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>> spin_unlock(&block_group->lock);
>> spin_unlock(&space_info->lock);
>>
>> - /* We don't want to force the issue, only flip if it's ok. */
>> - ret = inc_block_group_ro(block_group, 0);
>> + if (!(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
>> + /* We don't want to force the issue, only flip if it's ok. */
>> + ret = inc_block_group_ro(block_group, 0);
>> + made_ro = true;
>> + } else {
>> + ret = 0;
>> + }
>> +
>> up_write(&space_info->groups_sem);
>> if (ret < 0) {
>> ret = 0;
>> @@ -1648,7 +1666,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>
>> ret = btrfs_zone_finish(block_group);
>> if (ret < 0) {
>> - btrfs_dec_block_group_ro(block_group);
>> + if (made_ro)
>> + btrfs_dec_block_group_ro(block_group);
>> if (ret == -EAGAIN) {
>> btrfs_link_bg_list(block_group, &retry_list);
>> ret = 0;
>> @@ -1663,7 +1682,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>> trans = btrfs_start_trans_remove_block_group(fs_info,
>> block_group->start);
>> if (IS_ERR(trans)) {
>> - btrfs_dec_block_group_ro(block_group);
>> + if (made_ro)
>> + btrfs_dec_block_group_ro(block_group);
>> ret = PTR_ERR(trans);
>> goto next;
>> }
>> @@ -1673,7 +1693,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>> * just delete them, we don't care about them anymore.
>> */
>> if (!clean_pinned_extents(trans, block_group)) {
>> - btrfs_dec_block_group_ro(block_group);
>> + if (made_ro)
>> + btrfs_dec_block_group_ro(block_group);
>> goto end_trans;
>> }
>>
>> @@ -1687,7 +1708,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>> spin_lock(&fs_info->discard_ctl.lock);
>> if (!list_empty(&block_group->discard_list)) {
>> spin_unlock(&fs_info->discard_ctl.lock);
>> - btrfs_dec_block_group_ro(block_group);
>> + if (made_ro)
>> + btrfs_dec_block_group_ro(block_group);
>> btrfs_discard_queue_work(&fs_info->discard_ctl,
>> block_group);
>> goto end_trans;
>> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
>> index ecc89701b2ea..0433b0127ed8 100644
>> --- a/fs/btrfs/block-group.h
>> +++ b/fs/btrfs/block-group.h
>> @@ -336,6 +336,7 @@ int btrfs_add_new_free_space(struct btrfs_block_group *block_group,
>> struct btrfs_trans_handle *btrfs_start_trans_remove_block_group(
>> struct btrfs_fs_info *fs_info,
>> const u64 chunk_offset);
>> +void btrfs_remove_bg_from_sinfo(struct btrfs_block_group *block_group);
>> int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>> struct btrfs_chunk_map *map);
>> void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 563aea5e3b1b..d92d08316322 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2907,6 +2907,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>> mutex_init(&fs_info->chunk_mutex);
>> mutex_init(&fs_info->transaction_kthread_mutex);
>> mutex_init(&fs_info->cleaner_mutex);
>> + mutex_init(&fs_info->remap_mutex);
>> mutex_init(&fs_info->ro_block_group_mutex);
>> init_rwsem(&fs_info->commit_root_sem);
>> init_rwsem(&fs_info->cleanup_work_sem);
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index c1b96c728fe6..ca3f6d6bb5ba 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -40,6 +40,7 @@
>> #include "orphan.h"
>> #include "tree-checker.h"
>> #include "raid-stripe-tree.h"
>> +#include "relocation.h"
>>
>> #undef SCRAMBLE_DELAYED_REFS
>>
>> @@ -2999,7 +3000,8 @@ u64 btrfs_get_extent_owner_root(struct btrfs_fs_info *fs_info,
>> }
>>
>> static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
>> - u64 bytenr, struct btrfs_squota_delta *delta)
>> + u64 bytenr, struct btrfs_squota_delta *delta,
>> + bool remapped)
>> {
>> int ret;
>> u64 num_bytes = delta->num_bytes;
>> @@ -3027,10 +3029,16 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
>> return ret;
>> }
>>
>> - ret = btrfs_add_to_free_space_tree(trans, bytenr, num_bytes);
>> - if (ret) {
>> - btrfs_abort_transaction(trans, ret);
>> - return ret;
>> + /*
>> + * If remapped, FST has already been taken care of in
>> + * remove_range_from_remap_tree().
>> + */
>
> Why not do btrfs_remove_extent_from_remap_tree() here in
> do_free_extent_accounting() rather than the caller?
>
>> + if (!remapped) {
>> + ret = btrfs_add_to_free_space_tree(trans, bytenr, num_bytes);
>> + if (ret) {
>> + btrfs_abort_transaction(trans, ret);
>> + return ret;
>> + }
>> }
>>
>> ret = btrfs_update_block_group(trans, bytenr, num_bytes, false);
>
> So in the normal case, this will trigger updating the block group bytes
> counters, and when they hit 0 put the block group on the unused list,
> which queues it for deletion and ultimately removing from space info
> etc.
>
> I fail to see what is special about remapped block groups from that
> perspective. I would strongly prefer to see you integrate with
> btrsf_delete_unused_bgs() rather than special case skipping it there and
> copying parts of the logic elsewhere.
>
> Unless there is some very good reason for the special treatment that I
> am not seeing.
Two reasons:
1) The FST uses underlying addresses, and bytenr here is a (potentially)
remapped address. So if the BG has the remapped flag set, the range needs
to be run through the remap tree.
2) It's conditional: if it's an identity remap it's already had its FST
entries removed when we started relocation. If it's a normal remap, we
need to remove the FST entries for the translated address. And of course
an extent can be covered by multiple remap entries in multiple block groups.
>> @@ -3396,7 +3404,15 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>> }
>> btrfs_release_path(path);
>>
>> - ret = do_free_extent_accounting(trans, bytenr, &delta);
>> + /* returns 1 on success and 0 on no-op */
>> + ret = btrfs_remove_extent_from_remap_tree(trans, path, bytenr,
>> + num_bytes);
>> + if (ret < 0) {
>> + btrfs_abort_transaction(trans, ret);
>> + goto out;
>> + }
>> +
>> + ret = do_free_extent_accounting(trans, bytenr, &delta, ret);
>> }
>> btrfs_release_path(path);
>>
>> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
>> index 6ea96e76655e..dbb7de95241b 100644
>> --- a/fs/btrfs/fs.h
>> +++ b/fs/btrfs/fs.h
>> @@ -547,6 +547,7 @@ struct btrfs_fs_info {
>> struct mutex transaction_kthread_mutex;
>> struct mutex cleaner_mutex;
>> struct mutex chunk_mutex;
>> + struct mutex remap_mutex;
>>
>> /*
>> * This is taken to make sure we don't set block groups ro after the
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index e1f1da9336e7..03a1246af678 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -37,6 +37,7 @@
>> #include "super.h"
>> #include "tree-checker.h"
>> #include "raid-stripe-tree.h"
>> +#include "free-space-tree.h"
>>
>> /*
>> * Relocation overview
>> @@ -3884,6 +3885,148 @@ static const char *stage_to_string(enum reloc_stage stage)
>> return "unknown";
>> }
>>
>> +static void adjust_block_group_remap_bytes(struct btrfs_trans_handle *trans,
>> + struct btrfs_block_group *bg,
>> + s64 diff)
>> +{
>> + struct btrfs_fs_info *fs_info = trans->fs_info;
>> + bool bg_already_dirty = true;
>> +
>> + bg->remap_bytes += diff;
>> +
>> + if (bg->used == 0 && bg->remap_bytes == 0)
>> + btrfs_mark_bg_unused(bg);
>> +
>> + spin_lock(&trans->transaction->dirty_bgs_lock);
>> + if (list_empty(&bg->dirty_list)) {
>> + list_add_tail(&bg->dirty_list, &trans->transaction->dirty_bgs);
>> + bg_already_dirty = false;
>> + btrfs_get_block_group(bg);
>> + }
>> + spin_unlock(&trans->transaction->dirty_bgs_lock);
>> +
>> + /* Modified block groups are accounted for in the delayed_refs_rsv. */
>> + if (!bg_already_dirty)
>> + btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
>> +}
>> +
>> +static int remove_chunk_stripes(struct btrfs_trans_handle *trans,
>> + struct btrfs_chunk_map *chunk,
>> + struct btrfs_path *path)
>> +{
>> + struct btrfs_fs_info *fs_info = trans->fs_info;
>> + struct btrfs_key key;
>> + struct extent_buffer *leaf;
>> + struct btrfs_chunk *c;
>> + int ret;
>> +
>> + key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
>> + key.type = BTRFS_CHUNK_ITEM_KEY;
>> + key.offset = chunk->start;
>> +
>> + ret = btrfs_search_slot(trans, fs_info->chunk_root, &key, path,
>> + 0, 1);
>> + if (ret) {
>> + if (ret == 1) {
>> + btrfs_release_path(path);
>> + ret = -ENOENT;
>> + }
>> + return ret;
>> + }
>> +
>> + leaf = path->nodes[0];
>> +
>> + c = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_chunk);
>> + btrfs_set_chunk_num_stripes(leaf, c, 0);
>> +
>> + btrfs_truncate_item(trans, path, offsetof(struct btrfs_chunk, stripe),
>> + 1);
>> +
>> + btrfs_mark_buffer_dirty(trans, leaf);
>> +
>> + btrfs_release_path(path);
>> +
>> + return 0;
>> +}
>> +
>
> Same question as elsewhere, but placed here for clarity:
> Why can't this be queued for normal unused bgs deletion, rather than
> having a special remap bg deletion function?
remove_chunk_stripes() is called when a block group becomes fully remapped,
i.e. its last identity remap has gone. We're not deleting a block group
here.
>> +static int last_identity_remap_gone(struct btrfs_trans_handle *trans,
>> + struct btrfs_chunk_map *chunk,
>> + struct btrfs_block_group *bg,
>> + struct btrfs_path *path)
>> +{
>> + int ret;
>> +
>> + ret = btrfs_remove_dev_extents(trans, chunk);
>> + if (ret)
>> + return ret;
>> +
>> + mutex_lock(&trans->fs_info->chunk_mutex);
>> +
>> + for (unsigned int i = 0; i < chunk->num_stripes; i++) {
>> + ret = btrfs_update_device(trans, chunk->stripes[i].dev);
>> + if (ret) {
>> + mutex_unlock(&trans->fs_info->chunk_mutex);
>> + return ret;
>> + }
>> + }
>> +
>> + mutex_unlock(&trans->fs_info->chunk_mutex);
>> +
>> + write_lock(&trans->fs_info->mapping_tree_lock);
>> + btrfs_chunk_map_device_clear_bits(chunk, CHUNK_ALLOCATED);
>> + write_unlock(&trans->fs_info->mapping_tree_lock);
>> +
>> + btrfs_remove_bg_from_sinfo(bg);
>> +
>> + ret = remove_chunk_stripes(trans, chunk, path);
>> + if (ret)
>> + return ret;
>> +
>> + return 0;
>> +}
>> +
>> +static int adjust_identity_remap_count(struct btrfs_trans_handle *trans,
>> + struct btrfs_path *path,
>> + struct btrfs_block_group *bg, int delta)
>> +{
>> + struct btrfs_fs_info *fs_info = trans->fs_info;
>> + struct btrfs_chunk_map *chunk;
>> + bool bg_already_dirty = true;
>> + int ret;
>> +
>> + WARN_ON(delta < 0 && -delta > bg->identity_remap_count);
>> +
>> + bg->identity_remap_count += delta;
>> +
>> + spin_lock(&trans->transaction->dirty_bgs_lock);
>> + if (list_empty(&bg->dirty_list)) {
>> + list_add_tail(&bg->dirty_list, &trans->transaction->dirty_bgs);
>> + bg_already_dirty = false;
>> + btrfs_get_block_group(bg);
>> + }
>> + spin_unlock(&trans->transaction->dirty_bgs_lock);
>> +
>> + /* Modified block groups are accounted for in the delayed_refs_rsv. */
>> + if (!bg_already_dirty)
>> + btrfs_inc_delayed_refs_rsv_bg_updates(fs_info);
>> +
>> + if (bg->identity_remap_count != 0)
>> + return 0;
>> +
>> + chunk = btrfs_find_chunk_map(fs_info, bg->start, 1);
>> + if (!chunk)
>> + return -ENOENT;
>> +
>> + ret = last_identity_remap_gone(trans, chunk, bg, path);
>> + if (ret)
>> + goto end;
>> +
>> + ret = 0;
>> +end:
>> + btrfs_free_chunk_map(chunk);
>> + return ret;
>> +}
>> +
>> int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>> u64 *length, bool nolock)
>> {
>> @@ -4504,3 +4647,370 @@ u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info)
>> logical = fs_info->reloc_ctl->block_group->start;
>> return logical;
>> }
>> +
>> +static int remove_range_from_remap_tree(struct btrfs_trans_handle *trans,
>> + struct btrfs_path *path,
>> + struct btrfs_block_group *bg,
>> + u64 bytenr, u64 num_bytes)
>> +{
>> + int ret;
>> + struct btrfs_fs_info *fs_info = trans->fs_info;
>> + struct extent_buffer *leaf = path->nodes[0];
>> + struct btrfs_key key, new_key;
>> + struct btrfs_remap *remap_ptr = NULL, remap;
>> + struct btrfs_block_group *dest_bg = NULL;
>> + u64 end, new_addr = 0, remap_start, remap_length, overlap_length;
>> + bool is_identity_remap;
>> +
>> + end = bytenr + num_bytes;
>> +
>> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>> +
>> + is_identity_remap = key.type == BTRFS_IDENTITY_REMAP_KEY;
>> +
>> + remap_start = key.objectid;
>> + remap_length = key.offset;
>> +
>> + if (!is_identity_remap) {
>> + remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
>> + struct btrfs_remap);
>> + new_addr = btrfs_remap_address(leaf, remap_ptr);
>> +
>> + dest_bg = btrfs_lookup_block_group(fs_info, new_addr);
>> + }
>> +
>> + if (bytenr == remap_start && num_bytes >= remap_length) {
>> + /* Remove entirely. */
>> +
>> + ret = btrfs_del_item(trans, fs_info->remap_root, path);
>> + if (ret)
>> + goto end;
>> +
>> + btrfs_release_path(path);
>> +
>> + overlap_length = remap_length;
>> +
>> + if (!is_identity_remap) {
>> + /* Remove backref. */
>> +
>> + key.objectid = new_addr;
>> + key.type = BTRFS_REMAP_BACKREF_KEY;
>> + key.offset = remap_length;
>> +
>> + ret = btrfs_search_slot(trans, fs_info->remap_root,
>> + &key, path, -1, 1);
>> + if (ret) {
>> + if (ret == 1) {
>> + btrfs_release_path(path);
>> + ret = -ENOENT;
>> + }
>> + goto end;
>> + }
>> +
>> + ret = btrfs_del_item(trans, fs_info->remap_root, path);
>> +
>> + btrfs_release_path(path);
>> +
>> + if (ret)
>> + goto end;
>> +
>> + adjust_block_group_remap_bytes(trans, dest_bg,
>> + -remap_length);
>> + } else {
>> + ret = adjust_identity_remap_count(trans, path, bg, -1);
>> + if (ret)
>> + goto end;
>> + }
>> + } else if (bytenr == remap_start) {
>> + /* Remove beginning. */
>> +
>> + new_key.objectid = end;
>> + new_key.type = key.type;
>> + new_key.offset = remap_length + remap_start - end;
>> +
>> + btrfs_set_item_key_safe(trans, path, &new_key);
>> + btrfs_mark_buffer_dirty(trans, leaf);
>> +
>> + overlap_length = num_bytes;
>> +
>> + if (!is_identity_remap) {
>> + btrfs_set_remap_address(leaf, remap_ptr,
>> + new_addr + end - remap_start);
>> + btrfs_release_path(path);
>> +
>> + /* Adjust backref. */
>> +
>> + key.objectid = new_addr;
>> + key.type = BTRFS_REMAP_BACKREF_KEY;
>> + key.offset = remap_length;
>> +
>> + ret = btrfs_search_slot(trans, fs_info->remap_root,
>> + &key, path, -1, 1);
>> + if (ret) {
>> + if (ret == 1) {
>> + btrfs_release_path(path);
>> + ret = -ENOENT;
>> + }
>> + goto end;
>> + }
>> +
>> + leaf = path->nodes[0];
>> +
>> + new_key.objectid = new_addr + end - remap_start;
>> + new_key.type = BTRFS_REMAP_BACKREF_KEY;
>> + new_key.offset = remap_length + remap_start - end;
>> +
>> + btrfs_set_item_key_safe(trans, path, &new_key);
>> +
>> + remap_ptr = btrfs_item_ptr(leaf, path->slots[0],
>> + struct btrfs_remap);
>> + btrfs_set_remap_address(leaf, remap_ptr, end);
>> +
>> + btrfs_mark_buffer_dirty(trans, path->nodes[0]);
>> +
>> + btrfs_release_path(path);
>> +
>> + adjust_block_group_remap_bytes(trans, dest_bg,
>> + -num_bytes);
>> + }
>> + } else if (bytenr + num_bytes < remap_start + remap_length) {
>> + /* Remove middle. */
>> +
>> + new_key.objectid = remap_start;
>> + new_key.type = key.type;
>> + new_key.offset = bytenr - remap_start;
>> +
>> + btrfs_set_item_key_safe(trans, path, &new_key);
>> + btrfs_mark_buffer_dirty(trans, leaf);
>> +
>> + new_key.objectid = end;
>> + new_key.offset = remap_start + remap_length - end;
>> +
>> + btrfs_release_path(path);
>> +
>> + overlap_length = num_bytes;
>> +
>> + if (!is_identity_remap) {
>> + /* Add second remap entry. */
>> +
>> + ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
>> + path, &new_key,
>> + sizeof(struct btrfs_remap));
>> + if (ret)
>> + goto end;
>> +
>> + btrfs_set_stack_remap_address(&remap,
>> + new_addr + end - remap_start);
>> +
>> + write_extent_buffer(path->nodes[0], &remap,
>> + btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
>> + sizeof(struct btrfs_remap));
>> +
>> + btrfs_release_path(path);
>> +
>> + /* Shorten backref entry. */
>> +
>> + key.objectid = new_addr;
>> + key.type = BTRFS_REMAP_BACKREF_KEY;
>> + key.offset = remap_length;
>> +
>> + ret = btrfs_search_slot(trans, fs_info->remap_root,
>> + &key, path, -1, 1);
>> + if (ret) {
>> + if (ret == 1) {
>> + btrfs_release_path(path);
>> + ret = -ENOENT;
>> + }
>> + goto end;
>> + }
>> +
>> + new_key.objectid = new_addr;
>> + new_key.type = BTRFS_REMAP_BACKREF_KEY;
>> + new_key.offset = bytenr - remap_start;
>> +
>> + btrfs_set_item_key_safe(trans, path, &new_key);
>> + btrfs_mark_buffer_dirty(trans, path->nodes[0]);
>> +
>> + btrfs_release_path(path);
>> +
>> + /* Add second backref entry. */
>> +
>> + new_key.objectid = new_addr + end - remap_start;
>> + new_key.type = BTRFS_REMAP_BACKREF_KEY;
>> + new_key.offset = remap_start + remap_length - end;
>> +
>> + ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
>> + path, &new_key,
>> + sizeof(struct btrfs_remap));
>> + if (ret)
>> + goto end;
>> +
>> + btrfs_set_stack_remap_address(&remap, end);
>> +
>> + write_extent_buffer(path->nodes[0], &remap,
>> + btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
>> + sizeof(struct btrfs_remap));
>> +
>> + btrfs_release_path(path);
>> +
>> + adjust_block_group_remap_bytes(trans, dest_bg,
>> + -num_bytes);
>> + } else {
>> + /* Add second identity remap entry. */
>> +
>> + ret = btrfs_insert_empty_item(trans, fs_info->remap_root,
>> + path, &new_key, 0);
>> + if (ret)
>> + goto end;
>> +
>> + btrfs_release_path(path);
>> +
>> + ret = adjust_identity_remap_count(trans, path, bg, 1);
>> + if (ret)
>> + goto end;
>> + }
>> + } else {
>> + /* Remove end. */
>> +
>> + new_key.objectid = remap_start;
>> + new_key.type = key.type;
>> + new_key.offset = bytenr - remap_start;
>> +
>> + btrfs_set_item_key_safe(trans, path, &new_key);
>> + btrfs_mark_buffer_dirty(trans, leaf);
>> +
>> + btrfs_release_path(path);
>> +
>> + overlap_length = remap_start + remap_length - bytenr;
>> +
>> + if (!is_identity_remap) {
>> + /* Shorten backref entry. */
>> +
>> + key.objectid = new_addr;
>> + key.type = BTRFS_REMAP_BACKREF_KEY;
>> + key.offset = remap_length;
>> +
>> + ret = btrfs_search_slot(trans, fs_info->remap_root,
>> + &key, path, -1, 1);
>> + if (ret) {
>> + if (ret == 1) {
>> + btrfs_release_path(path);
>> + ret = -ENOENT;
>> + }
>> + goto end;
>> + }
>> +
>> + new_key.objectid = new_addr;
>> + new_key.type = BTRFS_REMAP_BACKREF_KEY;
>> + new_key.offset = bytenr - remap_start;
>> +
>> + btrfs_set_item_key_safe(trans, path, &new_key);
>> + btrfs_mark_buffer_dirty(trans, path->nodes[0]);
>> +
>> + btrfs_release_path(path);
>> +
>> + adjust_block_group_remap_bytes(trans, dest_bg,
>> + bytenr - remap_start - remap_length);
>> + }
>> + }
>> +
>> + if (!is_identity_remap) {
>> + ret = btrfs_add_to_free_space_tree(trans,
>> + bytenr - remap_start + new_addr,
>> + overlap_length);
>> + if (ret)
>> + goto end;
>> + }
>> +
>> + ret = overlap_length;
>> +
>> +end:
>> + if (dest_bg)
>> + btrfs_put_block_group(dest_bg);
>> +
>> + return ret;
>> +}
>> +
>> +/*
>> + * Returns 1 if remove_range_from_remap_tree() has been called successfully,
>> + * 0 if block group wasn't remapped, and a negative number on error.
>> + */
>> +int btrfs_remove_extent_from_remap_tree(struct btrfs_trans_handle *trans,
>> + struct btrfs_path *path,
>> + u64 bytenr, u64 num_bytes)
>> +{
>> + struct btrfs_fs_info *fs_info = trans->fs_info;
>> + struct btrfs_key key, found_key;
>> + struct extent_buffer *leaf;
>> + struct btrfs_block_group *bg;
>> + int ret, length;
>> +
>> + if (!(btrfs_super_incompat_flags(fs_info->super_copy) &
>> + BTRFS_FEATURE_INCOMPAT_REMAP_TREE))
>> + return 0;
>> +
>> + bg = btrfs_lookup_block_group(fs_info, bytenr);
>> + if (!bg)
>> + return 0;
>> +
>> + mutex_lock(&fs_info->remap_mutex);
>> +
>> + if (!(bg->flags & BTRFS_BLOCK_GROUP_REMAPPED)) {
>> + mutex_unlock(&fs_info->remap_mutex);
>> + btrfs_put_block_group(bg);
>> + return 0;
>> + }
>> +
>> + do {
>> + key.objectid = bytenr;
>> + key.type = (u8)-1;
>> + key.offset = (u64)-1;
>> +
>> + ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path,
>> + -1, 1);
>> + if (ret < 0)
>> + goto end;
>> +
>> + leaf = path->nodes[0];
>> +
>> + if (path->slots[0] == 0) {
>> + ret = -ENOENT;
>> + goto end;
>> + }
>> +
>> + path->slots[0]--;
>> +
>> + btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
>> +
>> + if (found_key.type != BTRFS_IDENTITY_REMAP_KEY &&
>> + found_key.type != BTRFS_REMAP_KEY) {
>> + ret = -ENOENT;
>> + goto end;
>> + }
>> +
>> + if (bytenr < found_key.objectid ||
>> + bytenr >= found_key.objectid + found_key.offset) {
>> + ret = -ENOENT;
>> + goto end;
>> + }
>> +
>> + length = remove_range_from_remap_tree(trans, path, bg, bytenr,
>> + num_bytes);
>> + if (length < 0) {
>> + ret = length;
>> + goto end;
>> + }
>> +
>> + bytenr += length;
>> + num_bytes -= length;
>> + } while (num_bytes > 0);
>> +
>> + ret = 1;
>> +
>> +end:
>> + mutex_unlock(&fs_info->remap_mutex);
>> +
>> + btrfs_put_block_group(bg);
>> + btrfs_release_path(path);
>> + return ret;
>> +}
>> diff --git a/fs/btrfs/relocation.h b/fs/btrfs/relocation.h
>> index a653c42a25a3..4b0bb34b3fc1 100644
>> --- a/fs/btrfs/relocation.h
>> +++ b/fs/btrfs/relocation.h
>> @@ -33,5 +33,8 @@ bool btrfs_should_ignore_reloc_root(const struct btrfs_root *root);
>> u64 btrfs_get_reloc_bg_bytenr(const struct btrfs_fs_info *fs_info);
>> int btrfs_translate_remap(struct btrfs_fs_info *fs_info, u64 *logical,
>> u64 *length, bool nolock);
>> +int btrfs_remove_extent_from_remap_tree(struct btrfs_trans_handle *trans,
>> + struct btrfs_path *path,
>> + u64 bytenr, u64 num_bytes);
>>
>> #endif
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index a2c49cb8bfc6..fc2b3e7de32e 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2941,8 +2941,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>> return ret;
>> }
>>
>> -static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
>> - struct btrfs_device *device)
>> +int btrfs_update_device(struct btrfs_trans_handle *trans,
>> + struct btrfs_device *device)
>> {
>> int ret;
>> struct btrfs_path *path;
>> @@ -3246,25 +3246,13 @@ static int remove_chunk_item(struct btrfs_trans_handle *trans,
>> return btrfs_free_chunk(trans, chunk_offset);
>> }
>>
>> -int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
>> +int btrfs_remove_dev_extents(struct btrfs_trans_handle *trans,
>> + struct btrfs_chunk_map *map)
>> {
>> struct btrfs_fs_info *fs_info = trans->fs_info;
>> - struct btrfs_chunk_map *map;
>> + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>> u64 dev_extent_len = 0;
>> int i, ret = 0;
>> - struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>> -
>> - map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
>> - if (IS_ERR(map)) {
>> - /*
>> - * This is a logic error, but we don't want to just rely on the
>> - * user having built with ASSERT enabled, so if ASSERT doesn't
>> - * do anything we still error out.
>> - */
>> - DEBUG_WARN("errr %ld reading chunk map at offset %llu",
>> - PTR_ERR(map), chunk_offset);
>> - return PTR_ERR(map);
>> - }
>>
>> /*
>> * First delete the device extent items from the devices btree.
>> @@ -3285,7 +3273,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
>> if (ret) {
>> mutex_unlock(&fs_devices->device_list_mutex);
>> btrfs_abort_transaction(trans, ret);
>> - goto out;
>> + return ret;
>> }
>>
>> if (device->bytes_used > 0) {
>> @@ -3305,6 +3293,30 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
>> }
>> mutex_unlock(&fs_devices->device_list_mutex);
>>
>> + return 0;
>> +}
>> +
>> +int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
>> +{
>> + struct btrfs_fs_info *fs_info = trans->fs_info;
>> + struct btrfs_chunk_map *map;
>> + int ret;
>> +
>> + map = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
>> + if (IS_ERR(map)) {
>> + /*
>> + * This is a logic error, but we don't want to just rely on the
>> + * user having built with ASSERT enabled, so if ASSERT doesn't
>> + * do anything we still error out.
>> + */
>> + ASSERT(0);
>> + return PTR_ERR(map);
>> + }
>> +
>> + ret = btrfs_remove_dev_extents(trans, map);
>> + if (ret)
>> + goto out;
>> +
>> /*
>> * We acquire fs_info->chunk_mutex for 2 reasons:
>> *
>> @@ -5448,7 +5460,7 @@ static void chunk_map_device_set_bits(struct btrfs_chunk_map *map, unsigned int
>> }
>> }
>>
>> -static void chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
>> +void btrfs_chunk_map_device_clear_bits(struct btrfs_chunk_map *map, unsigned int bits)
>> {
>> for (int i = 0; i < map->num_stripes; i++) {
>> struct btrfs_io_stripe *stripe = &map->stripes[i];
>> @@ -5465,7 +5477,7 @@ void btrfs_remove_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_ma
>> write_lock(&fs_info->mapping_tree_lock);
>> rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
>> RB_CLEAR_NODE(&map->rb_node);
>> - chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
>> + btrfs_chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
>> write_unlock(&fs_info->mapping_tree_lock);
>>
>> /* Once for the tree reference. */
>> @@ -5501,7 +5513,7 @@ int btrfs_add_chunk_map(struct btrfs_fs_info *fs_info, struct btrfs_chunk_map *m
>> return -EEXIST;
>> }
>> chunk_map_device_set_bits(map, CHUNK_ALLOCATED);
>> - chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
>> + btrfs_chunk_map_device_clear_bits(map, CHUNK_TRIMMED);
>> write_unlock(&fs_info->mapping_tree_lock);
>>
>> return 0;
>> @@ -5866,7 +5878,7 @@ void btrfs_mapping_tree_free(struct btrfs_fs_info *fs_info)
>> map = rb_entry(node, struct btrfs_chunk_map, rb_node);
>> rb_erase_cached(&map->rb_node, &fs_info->mapping_tree);
>> RB_CLEAR_NODE(&map->rb_node);
>> - chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
>> + btrfs_chunk_map_device_clear_bits(map, CHUNK_ALLOCATED);
>> /* Once for the tree ref. */
>> btrfs_free_chunk_map(map);
>> cond_resched_rwlock_write(&fs_info->mapping_tree_lock);
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index 430be12fd5e7..64b34710b68b 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -789,6 +789,8 @@ u64 btrfs_calc_stripe_length(const struct btrfs_chunk_map *map);
>> int btrfs_nr_parity_stripes(u64 type);
>> int btrfs_chunk_alloc_add_chunk_item(struct btrfs_trans_handle *trans,
>> struct btrfs_block_group *bg);
>> +int btrfs_remove_dev_extents(struct btrfs_trans_handle *trans,
>> + struct btrfs_chunk_map *map);
>> int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
>>
>> #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>> @@ -900,6 +902,10 @@ bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
>>
>> bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
>> const u8 *btrfs_sb_fsid_ptr(const struct btrfs_super_block *sb);
>> +int btrfs_update_device(struct btrfs_trans_handle *trans,
>> + struct btrfs_device *device);
>> +void btrfs_chunk_map_device_clear_bits(struct btrfs_chunk_map *map,
>> + unsigned int bits);
>>
>> #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>> struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
>> --
>> 2.49.1
>>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 15/16] btrfs: add fully_remapped_bgs list
2025-08-16 0:56 ` Boris Burkov
@ 2025-08-27 18:51 ` Mark Harmstone
0 siblings, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-08-27 18:51 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs
On 16/08/2025 1.56 am, Boris Burkov wrote:
> On Wed, Aug 13, 2025 at 03:34:57PM +0100, Mark Harmstone wrote:
>> Add a fully_remapped_bgs list to struct btrfs_transaction, which holds
>> block groups which have just had their last identity remap removed.
>>
>> In btrfs_finish_extent_commit() we can then discard their full dev
>> extents, as we're also setting their num_stripes to 0. Finally if the BG
>> is now empty, i.e. there's neither identity remaps nor normal remaps,
>> add it to the unused_bgs list to be taken care of there.
>>
>> Signed-off-by: Mark Harmstone <mark@harmstone.com>
>> ---
>> fs/btrfs/block-group.c | 26 ++++++++++++++++++++++++++
>> fs/btrfs/block-group.h | 2 ++
>> fs/btrfs/extent-tree.c | 37 ++++++++++++++++++++++++++++++++++++-
>> fs/btrfs/relocation.c | 2 ++
>> fs/btrfs/transaction.c | 1 +
>> fs/btrfs/transaction.h | 1 +
>> 6 files changed, 68 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index 7a0524138235..7f8707dfd62c 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -1803,6 +1803,14 @@ void btrfs_mark_bg_unused(struct btrfs_block_group *bg)
>> struct btrfs_fs_info *fs_info = bg->fs_info;
>>
>> spin_lock(&fs_info->unused_bgs_lock);
>> +
>> + /* Leave fully remapped block groups on the fully_remapped_bgs list. */
>> + if (bg->flags & BTRFS_BLOCK_GROUP_REMAPPED &&
>> + bg->identity_remap_count == 0) {
>> + spin_unlock(&fs_info->unused_bgs_lock);
>> + return;
>> + }
>> +
>> if (list_empty(&bg->bg_list)) {
>> btrfs_get_block_group(bg);
>> trace_btrfs_add_unused_block_group(bg);
>> @@ -4792,3 +4800,21 @@ bool btrfs_block_group_should_use_size_class(const struct btrfs_block_group *bg)
>> return false;
>> return true;
>> }
>> +
>> +void btrfs_mark_bg_fully_remapped(struct btrfs_block_group *bg,
>> + struct btrfs_trans_handle *trans)
>> +{
>> + struct btrfs_fs_info *fs_info = trans->fs_info;
>> +
>> + spin_lock(&fs_info->unused_bgs_lock);
>> +
>> + if (!list_empty(&bg->bg_list))
>> + list_del(&bg->bg_list);
>> + else
>> + btrfs_get_block_group(bg);
>> +
>> + list_add_tail(&bg->bg_list, &trans->transaction->fully_remapped_bgs);
>> +
>> + spin_unlock(&fs_info->unused_bgs_lock);
>> +
>> +}
>
> Why does the fully remapped bg list takeover from other lists rather
> than use the link function?
Because it's possible in the same transaction for a block group to become
both fully remapped (no identity remaps left) and unused (no addresses
nominally in this address range). In this case it goes on the fully
remapped list for its chunk stripes and dev extents to be removed, then
gets moved to the unused list for its block group item to be removed.
> What protection is in place to ensure that we never mark it fully
> remapped while it is on the new_bgs list (as with the unused list)?
Relocating a block group starts a transaction, so it won't be on the
new_bgs list anymore. btrfs_relocate_chunk() provides the BG offset
to btrfs_relocate_block_group(), which in turn starts multiple
transactions, which ensure that that particular BG isn't on new_bgs.
Or am I misunderstanding?
> I suspect such a block group won't ever be reclaimed even with explicit
> balances, but it is important to check and be sure.
I'll need to double-check that a remapped BG can't be placed on the
reclaimed_bgs list. In effect it's already been "reclaimed".
> If this *is* strictly necessary, I would like to see an extension to
> btrfs_link_bg_list that can handle the list_move_tail variant.
>
> Another option is to generalize this one together with mark_unused()
> and just check the NEW flag here.
>
>> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
>> index 0433b0127ed8..025ea2c6f8a8 100644
>> --- a/fs/btrfs/block-group.h
>> +++ b/fs/btrfs/block-group.h
>> @@ -408,5 +408,7 @@ int btrfs_use_block_group_size_class(struct btrfs_block_group *bg,
>> enum btrfs_block_group_size_class size_class,
>> bool force_wrong_size_class);
>> bool btrfs_block_group_should_use_size_class(const struct btrfs_block_group *bg);
>> +void btrfs_mark_bg_fully_remapped(struct btrfs_block_group *bg,
>> + struct btrfs_trans_handle *trans);
>>
>> #endif /* BTRFS_BLOCK_GROUP_H */
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index b02e99b41553..157a032df128 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -2853,7 +2853,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
>> {
>> struct btrfs_fs_info *fs_info = trans->fs_info;
>> struct btrfs_block_group *block_group, *tmp;
>> - struct list_head *deleted_bgs;
>> + struct list_head *deleted_bgs, *fully_remapped_bgs;
>> struct extent_io_tree *unpin = &trans->transaction->pinned_extents;
>> struct extent_state *cached_state = NULL;
>> u64 start;
>> @@ -2951,6 +2951,41 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
>> }
>> }
>>
>
> 1. This will block the next transaction waiting on TRANS_STATE_COMPLETED
>
> 2. This is not compatible with the spirit and purpose of async discard,
> which is our default and best discard mode.
>
> 3. This doesn't check discard mode at all, it just defaults to
> DISCARD_SYNC style behavior, so it doesn't respect NODISCARD either.
I think at the least this should be shunted off to a work queue, as I've said to
you off-list.
So the current logic is that no discards get done for removals within a remapped
BG, nor by the relocation process, on the grounds that the whole thing is going
away imminently. Once the last identity remap has gone, we do a discard for the
whole range. That means only one discard per stripe, and prevents any problems
with search_commit_root.
Leaving aside the blocking of the next transaction, is this the right way to
do things, or should it be done somehow else?
>> + fully_remapped_bgs = &trans->transaction->fully_remapped_bgs;
>> + list_for_each_entry_safe(block_group, tmp, fully_remapped_bgs, bg_list) {
>> + struct btrfs_chunk_map *map;
>> +
>> + if (!TRANS_ABORTED(trans))
>> + ret = btrfs_discard_extent(fs_info, block_group->start,
>> + block_group->length, NULL,
>> + false);
>> +
>> + map = btrfs_get_chunk_map(fs_info, block_group->start, 1);
>> + if (IS_ERR(map))
>> + return PTR_ERR(map);
>> +
>> + /*
>> + * Set num_stripes to 0, so that btrfs_remove_dev_extents()
>> + * won't run a second time.
>> + */
>> + map->num_stripes = 0;
>> +
>> + btrfs_free_chunk_map(map);
>> +
>> + if (block_group->used == 0 && block_group->remap_bytes == 0) {
>> + spin_lock(&fs_info->unused_bgs_lock);
>> + list_move_tail(&block_group->bg_list,
>> + &fs_info->unused_bgs);
>> + spin_unlock(&fs_info->unused_bgs_lock);
>
> Please use the helpers, it's important for ensuring correct ref counting
> in the long run. I also think that the previous patch had some
> discussion for more standardized integration with unused_bgs so I sort
> of hope this code goes away entirely.
>
>> + } else {
>> + spin_lock(&fs_info->unused_bgs_lock);
>> + list_del_init(&block_group->bg_list);
>> + spin_unlock(&fs_info->unused_bgs_lock);
>> +
>> + btrfs_put_block_group(block_group);
>> + }
>> + }
>> +
>> return unpin_error;
>> }
>>
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index 84ff59866e96..0745a3d1c867 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -4819,6 +4819,8 @@ static int last_identity_remap_gone(struct btrfs_trans_handle *trans,
>> if (ret)
>> return ret;
>>
>> + btrfs_mark_bg_fully_remapped(bg, trans);
>> +
>> return 0;
>> }
>>
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>> index 64b9c427af6a..7c308d33e767 100644
>> --- a/fs/btrfs/transaction.c
>> +++ b/fs/btrfs/transaction.c
>> @@ -381,6 +381,7 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
>> mutex_init(&cur_trans->cache_write_mutex);
>> spin_lock_init(&cur_trans->dirty_bgs_lock);
>> INIT_LIST_HEAD(&cur_trans->deleted_bgs);
>> + INIT_LIST_HEAD(&cur_trans->fully_remapped_bgs);
>> spin_lock_init(&cur_trans->dropped_roots_lock);
>> list_add_tail(&cur_trans->list, &fs_info->trans_list);
>> btrfs_extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
>> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
>> index 9f7c777af635..b362915288b5 100644
>> --- a/fs/btrfs/transaction.h
>> +++ b/fs/btrfs/transaction.h
>> @@ -109,6 +109,7 @@ struct btrfs_transaction {
>> spinlock_t dirty_bgs_lock;
>> /* Protected by spin lock fs_info->unused_bgs_lock. */
>> struct list_head deleted_bgs;
>> + struct list_head fully_remapped_bgs;
>> spinlock_t dropped_roots_lock;
>> struct btrfs_delayed_ref_root delayed_refs;
>> struct btrfs_fs_info *fs_info;
>> --
>> 2.49.1
>>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 16/16] btrfs: allow balancing remap tree
2025-08-16 1:02 ` Boris Burkov
@ 2025-09-02 14:58 ` Mark Harmstone
2025-09-02 15:21 ` Mark Harmstone
1 sibling, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-09-02 14:58 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs
On 16/08/2025 2.02 am, Boris Burkov wrote:
> On Wed, Aug 13, 2025 at 03:34:58PM +0100, Mark Harmstone wrote:
>> Balancing the REMAP chunk, i.e. the chunk in which the remap tree lives,
>> is a special case.
>>
>> We can't use the remap tree itself for this, as then we'd have no way to
>> boostrap it on mount. And we can't use the pre-remap tree code for this
>> as it relies on walking the extent tree, and we're not creating backrefs
>> for REMAP chunks.
>>
>> So instead, if a balance would relocate any REMAP block groups, mark
>> those block groups as readonly and COW every leaf of the remap tree.
>>
>> There's more sophisticated ways of doing this, such as only COWing nodes
>> within a block group that's to be relocated, but they're fiddly and with
>> lots of edge cases. Plus it's not anticipated that a) the number of
>> REMAP chunks is going to be particularly large, or b) that users will
>> want to only relocate some of these chunks - the main use case here is
>> to unbreak RAID conversion and device removal.
>>
>> Signed-off-by: Mark Harmstone <mark@harmstone.com>
>> ---
>> fs/btrfs/volumes.c | 161 +++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 157 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index e13f16a7a904..dc535ed90ae0 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -4011,8 +4011,11 @@ static bool should_balance_chunk(struct extent_buffer *leaf, struct btrfs_chunk
>> struct btrfs_balance_args *bargs = NULL;
>> u64 chunk_type = btrfs_chunk_type(leaf, chunk);
>>
>> - if (chunk_type & BTRFS_BLOCK_GROUP_REMAP)
>> - return false;
>> + /* treat REMAP chunks as METADATA */
>> + if (chunk_type & BTRFS_BLOCK_GROUP_REMAP) {
>> + chunk_type &= ~BTRFS_BLOCK_GROUP_REMAP;
>> + chunk_type |= BTRFS_BLOCK_GROUP_METADATA;
>
> why not honor the REMAP chunk type where appropriate?
>
>> + }
>>
>> /* type filter */
>> if (!((chunk_type & BTRFS_BLOCK_GROUP_TYPE_MASK) &
>> @@ -4095,6 +4098,113 @@ static bool should_balance_chunk(struct extent_buffer *leaf, struct btrfs_chunk
>> return true;
>> }
>>
>> +struct remap_chunk_info {
>> + struct list_head list;
>> + u64 offset;
>> + struct btrfs_block_group *bg;
>> + bool made_ro;
>> +};
>> +
>> +static int cow_remap_tree(struct btrfs_trans_handle *trans,
>> + struct btrfs_path *path)
>> +{
>> + struct btrfs_fs_info *fs_info = trans->fs_info;
>> + struct btrfs_key key = { 0 };
>> + int ret;
>> +
>> + ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, 0, 1);
>> + if (ret < 0)
>> + return ret;
>> +
>> + while (true) {
>> + ret = btrfs_next_leaf(fs_info->remap_root, path);
>> + if (ret < 0) {
>> + return ret;
>> + } else if (ret > 0) {
>> + ret = 0;
>> + break;
>> + }
>> +
>> + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>> +
>> + btrfs_release_path(path);
>> +
>> + ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path,
>> + 0, 1);
>> + if (ret < 0)
>> + break;
>> + }
>> +
>> + return ret;
>> +}
>> +
>> +static int balance_remap_chunks(struct btrfs_fs_info *fs_info,
>> + struct btrfs_path *path,
>> + struct list_head *chunks)
>> +{
>> + struct remap_chunk_info *rci, *tmp;
>> + struct btrfs_trans_handle *trans;
>> + int ret;
>> +
>> + list_for_each_entry_safe(rci, tmp, chunks, list) {
>> + rci->bg = btrfs_lookup_block_group(fs_info, rci->offset);
>> + if (!rci->bg) {
>> + list_del(&rci->list);
>> + kfree(rci);
>> + continue;
>> + }
>> +
>> + ret = btrfs_inc_block_group_ro(rci->bg, false);
>
> Just thinking out loud, what happens if we concurrently attempt a
> balance that would need to use the remap tree? Is something structurally
> blocking that at a higher level? Or will it fail? How will that failure
> be handled? Does the answer hold for btrfs-internal background reclaim
> rather than explicit balancing?
We take fs_info->remap_mutex whenever we do anything involving the remap
tree.
As I think I've said in another message, this is safe but perhaps not the
best choice long-term. I suspect something like a per-BG rwsem is the way
to go eventually, as a later optimization.
>> + if (ret)
>> + goto end;
>> +
>> + rci->made_ro = true;
>> + }
>> +
>> + if (list_empty(chunks))
>> + return 0;
>> +
>> + trans = btrfs_start_transaction(fs_info->remap_root, 0);
>> + if (IS_ERR(trans)) {
>> + ret = PTR_ERR(trans);
>> + goto end;
>> + }
>> +
>> + mutex_lock(&fs_info->remap_mutex);
>> +
>> + ret = cow_remap_tree(trans, path);
>> +
>> + btrfs_release_path(path);
>> +
>> + mutex_unlock(&fs_info->remap_mutex);
>> +
>> + btrfs_commit_transaction(trans);
>> +
>> +end:
>> + while (!list_empty(chunks)) {
>> + bool unused;
>> +
>> + rci = list_first_entry(chunks, struct remap_chunk_info, list);
>> +
>> + spin_lock(&rci->bg->lock);
>> + unused = !btrfs_is_block_group_used(rci->bg);
>> + spin_unlock(&rci->bg->lock);
>> +
>> + if (unused)
>> + btrfs_mark_bg_unused(rci->bg);
>> +
>> + if (rci->made_ro)
>> + btrfs_dec_block_group_ro(rci->bg);
>> +
>> + btrfs_put_block_group(rci->bg);
>> +
>> + list_del(&rci->list);
>> + kfree(rci);
>> + }
>> +
>> + return ret;
>> +}
>> +
>> static int __btrfs_balance(struct btrfs_fs_info *fs_info)
>> {
>> struct btrfs_balance_control *bctl = fs_info->balance_ctl;
>> @@ -4117,6 +4227,9 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
>> u32 count_meta = 0;
>> u32 count_sys = 0;
>> int chunk_reserved = 0;
>> + struct remap_chunk_info *rci;
>> + unsigned int num_remap_chunks = 0;
>> + LIST_HEAD(remap_chunks);
>>
>> path = btrfs_alloc_path();
>> if (!path) {
>> @@ -4215,7 +4328,8 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
>> count_data++;
>> else if (chunk_type & BTRFS_BLOCK_GROUP_SYSTEM)
>> count_sys++;
>> - else if (chunk_type & BTRFS_BLOCK_GROUP_METADATA)
>> + else if (chunk_type & (BTRFS_BLOCK_GROUP_METADATA |
>> + BTRFS_BLOCK_GROUP_REMAP))
>> count_meta++;
>>
>> goto loop;
>> @@ -4235,6 +4349,30 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
>> goto loop;
>> }
>>
>> + /*
>> + * Balancing REMAP chunks takes place separately - add the
>> + * details to a list so it can be processed later.
>> + */
>> + if (chunk_type & BTRFS_BLOCK_GROUP_REMAP) {
>> + mutex_unlock(&fs_info->reclaim_bgs_lock);
>> +
>> + rci = kmalloc(sizeof(struct remap_chunk_info),
>> + GFP_NOFS);
>> + if (!rci) {
>> + ret = -ENOMEM;
>> + goto error;
>> + }
>> +
>> + rci->offset = found_key.offset;
>> + rci->bg = NULL;
>> + rci->made_ro = false;
>> + list_add_tail(&rci->list, &remap_chunks);
>> +
>> + num_remap_chunks++;
>> +
>> + goto loop;
>> + }
>> +
>> if (!chunk_reserved) {
>> /*
>> * We may be relocating the only data chunk we have,
>> @@ -4274,11 +4412,26 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
>> key.offset = found_key.offset - 1;
>> }
>>
>> + btrfs_release_path(path);
>> +
>> if (counting) {
>> - btrfs_release_path(path);
>> counting = false;
>> goto again;
>> }
>> +
>> + if (!list_empty(&remap_chunks)) {
>> + ret = balance_remap_chunks(fs_info, path, &remap_chunks);
>> + if (ret == -ENOSPC)
>> + enospc_errors++;
>> +
>> + if (!ret) {
>> + btrfs_delete_unused_bgs(fs_info);
>
> Why is this necessary here?
>
>> +
>> + spin_lock(&fs_info->balance_lock);
>> + bctl->stat.completed += num_remap_chunks;
>> + spin_unlock(&fs_info->balance_lock);
>> + }
>> + }
>> error:
>> btrfs_free_path(path);
>> if (enospc_errors) {
>> --
>> 2.49.1
>>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2 16/16] btrfs: allow balancing remap tree
2025-08-16 1:02 ` Boris Burkov
2025-09-02 14:58 ` Mark Harmstone
@ 2025-09-02 15:21 ` Mark Harmstone
1 sibling, 0 replies; 43+ messages in thread
From: Mark Harmstone @ 2025-09-02 15:21 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs
On 16/08/2025 2.02 am, Boris Burkov wrote:
> On Wed, Aug 13, 2025 at 03:34:58PM +0100, Mark Harmstone wrote:
>> Balancing the REMAP chunk, i.e. the chunk in which the remap tree lives,
>> is a special case.
>>
>> We can't use the remap tree itself for this, as then we'd have no way to
>> boostrap it on mount. And we can't use the pre-remap tree code for this
>> as it relies on walking the extent tree, and we're not creating backrefs
>> for REMAP chunks.
>>
>> So instead, if a balance would relocate any REMAP block groups, mark
>> those block groups as readonly and COW every leaf of the remap tree.
>>
>> There's more sophisticated ways of doing this, such as only COWing nodes
>> within a block group that's to be relocated, but they're fiddly and with
>> lots of edge cases. Plus it's not anticipated that a) the number of
>> REMAP chunks is going to be particularly large, or b) that users will
>> want to only relocate some of these chunks - the main use case here is
>> to unbreak RAID conversion and device removal.
>>
>> Signed-off-by: Mark Harmstone <mark@harmstone.com>
>> ---
>> fs/btrfs/volumes.c | 161 +++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 157 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index e13f16a7a904..dc535ed90ae0 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -4011,8 +4011,11 @@ static bool should_balance_chunk(struct extent_buffer *leaf, struct btrfs_chunk
>> struct btrfs_balance_args *bargs = NULL;
>> u64 chunk_type = btrfs_chunk_type(leaf, chunk);
>>
>> - if (chunk_type & BTRFS_BLOCK_GROUP_REMAP)
>> - return false;
>> + /* treat REMAP chunks as METADATA */
>> + if (chunk_type & BTRFS_BLOCK_GROUP_REMAP) {
>> + chunk_type &= ~BTRFS_BLOCK_GROUP_REMAP;
>> + chunk_type |= BTRFS_BLOCK_GROUP_METADATA;
>
> why not honor the REMAP chunk type where appropriate?
This would imply adding a new flag to btrfs balance start, and a new
version of the ioctl, and I'm not sure it's worth it. Happy to argue
the toss though.
Doing btrfs balance start -m already implies -s, so it's not much of
a stretch for to cover REMAP as well.
Possibly it would make more sense for REMAP to be SYSTEM for balancing
purposes rather than METADATA.
>> + }
>>
>> /* type filter */
>> if (!((chunk_type & BTRFS_BLOCK_GROUP_TYPE_MASK) &
>> @@ -4095,6 +4098,113 @@ static bool should_balance_chunk(struct extent_buffer *leaf, struct btrfs_chunk
>> return true;
>> }
>>
>> +struct remap_chunk_info {
>> + struct list_head list;
>> + u64 offset;
>> + struct btrfs_block_group *bg;
>> + bool made_ro;
>> +};
>> +
>> +static int cow_remap_tree(struct btrfs_trans_handle *trans,
>> + struct btrfs_path *path)
>> +{
>> + struct btrfs_fs_info *fs_info = trans->fs_info;
>> + struct btrfs_key key = { 0 };
>> + int ret;
>> +
>> + ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path, 0, 1);
>> + if (ret < 0)
>> + return ret;
>> +
>> + while (true) {
>> + ret = btrfs_next_leaf(fs_info->remap_root, path);
>> + if (ret < 0) {
>> + return ret;
>> + } else if (ret > 0) {
>> + ret = 0;
>> + break;
>> + }
>> +
>> + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>> +
>> + btrfs_release_path(path);
>> +
>> + ret = btrfs_search_slot(trans, fs_info->remap_root, &key, path,
>> + 0, 1);
>> + if (ret < 0)
>> + break;
>> + }
>> +
>> + return ret;
>> +}
>> +
>> +static int balance_remap_chunks(struct btrfs_fs_info *fs_info,
>> + struct btrfs_path *path,
>> + struct list_head *chunks)
>> +{
>> + struct remap_chunk_info *rci, *tmp;
>> + struct btrfs_trans_handle *trans;
>> + int ret;
>> +
>> + list_for_each_entry_safe(rci, tmp, chunks, list) {
>> + rci->bg = btrfs_lookup_block_group(fs_info, rci->offset);
>> + if (!rci->bg) {
>> + list_del(&rci->list);
>> + kfree(rci);
>> + continue;
>> + }
>> +
>> + ret = btrfs_inc_block_group_ro(rci->bg, false);
>
> Just thinking out loud, what happens if we concurrently attempt a
> balance that would need to use the remap tree? Is something structurally
> blocking that at a higher level? Or will it fail? How will that failure
> be handled? Does the answer hold for btrfs-internal background reclaim
> rather than explicit balancing?
>
>> + if (ret)
>> + goto end;
>> +
>> + rci->made_ro = true;
>> + }
>> +
>> + if (list_empty(chunks))
>> + return 0;
>> +
>> + trans = btrfs_start_transaction(fs_info->remap_root, 0);
>> + if (IS_ERR(trans)) {
>> + ret = PTR_ERR(trans);
>> + goto end;
>> + }
>> +
>> + mutex_lock(&fs_info->remap_mutex);
>> +
>> + ret = cow_remap_tree(trans, path);
>> +
>> + btrfs_release_path(path);
>> +
>> + mutex_unlock(&fs_info->remap_mutex);
>> +
>> + btrfs_commit_transaction(trans);
>> +
>> +end:
>> + while (!list_empty(chunks)) {
>> + bool unused;
>> +
>> + rci = list_first_entry(chunks, struct remap_chunk_info, list);
>> +
>> + spin_lock(&rci->bg->lock);
>> + unused = !btrfs_is_block_group_used(rci->bg);
>> + spin_unlock(&rci->bg->lock);
>> +
>> + if (unused)
>> + btrfs_mark_bg_unused(rci->bg);
>> +
>> + if (rci->made_ro)
>> + btrfs_dec_block_group_ro(rci->bg);
>> +
>> + btrfs_put_block_group(rci->bg);
>> +
>> + list_del(&rci->list);
>> + kfree(rci);
>> + }
>> +
>> + return ret;
>> +}
>> +
>> static int __btrfs_balance(struct btrfs_fs_info *fs_info)
>> {
>> struct btrfs_balance_control *bctl = fs_info->balance_ctl;
>> @@ -4117,6 +4227,9 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
>> u32 count_meta = 0;
>> u32 count_sys = 0;
>> int chunk_reserved = 0;
>> + struct remap_chunk_info *rci;
>> + unsigned int num_remap_chunks = 0;
>> + LIST_HEAD(remap_chunks);
>>
>> path = btrfs_alloc_path();
>> if (!path) {
>> @@ -4215,7 +4328,8 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
>> count_data++;
>> else if (chunk_type & BTRFS_BLOCK_GROUP_SYSTEM)
>> count_sys++;
>> - else if (chunk_type & BTRFS_BLOCK_GROUP_METADATA)
>> + else if (chunk_type & (BTRFS_BLOCK_GROUP_METADATA |
>> + BTRFS_BLOCK_GROUP_REMAP))
>> count_meta++;
>>
>> goto loop;
>> @@ -4235,6 +4349,30 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
>> goto loop;
>> }
>>
>> + /*
>> + * Balancing REMAP chunks takes place separately - add the
>> + * details to a list so it can be processed later.
>> + */
>> + if (chunk_type & BTRFS_BLOCK_GROUP_REMAP) {
>> + mutex_unlock(&fs_info->reclaim_bgs_lock);
>> +
>> + rci = kmalloc(sizeof(struct remap_chunk_info),
>> + GFP_NOFS);
>> + if (!rci) {
>> + ret = -ENOMEM;
>> + goto error;
>> + }
>> +
>> + rci->offset = found_key.offset;
>> + rci->bg = NULL;
>> + rci->made_ro = false;
>> + list_add_tail(&rci->list, &remap_chunks);
>> +
>> + num_remap_chunks++;
>> +
>> + goto loop;
>> + }
>> +
>> if (!chunk_reserved) {
>> /*
>> * We may be relocating the only data chunk we have,
>> @@ -4274,11 +4412,26 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
>> key.offset = found_key.offset - 1;
>> }
>>
>> + btrfs_release_path(path);
>> +
>> if (counting) {
>> - btrfs_release_path(path);
>> counting = false;
>> goto again;
>> }
>> +
>> + if (!list_empty(&remap_chunks)) {
>> + ret = balance_remap_chunks(fs_info, path, &remap_chunks);
>> + if (ret == -ENOSPC)
>> + enospc_errors++;
>> +
>> + if (!ret) {
>> + btrfs_delete_unused_bgs(fs_info);
>
> Why is this necessary here?
>
>> +
>> + spin_lock(&fs_info->balance_lock);
>> + bctl->stat.completed += num_remap_chunks;
>> + spin_unlock(&fs_info->balance_lock);
>> + }
>> + }
>> error:
>> btrfs_free_path(path);
>> if (enospc_errors) {
>> --
>> 2.49.1
>>
^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2025-09-02 15:21 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-13 14:34 [PATCH v2 00/16] btrfs: remap tree Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 01/16] btrfs: add definitions and constants for remap-tree Mark Harmstone
2025-08-15 23:51 ` Boris Burkov
2025-08-18 17:21 ` Mark Harmstone
2025-08-18 17:33 ` Boris Burkov
2025-08-16 0:01 ` Qu Wenruo
2025-08-16 0:17 ` Qu Wenruo
2025-08-18 17:23 ` Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 02/16] btrfs: add REMAP chunk type Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 03/16] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
2025-08-16 0:03 ` Boris Burkov
2025-08-22 17:01 ` Mark Harmstone
2025-08-19 1:05 ` kernel test robot
2025-08-22 17:07 ` Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 04/16] btrfs: remove remapped block groups from the free-space tree Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 05/16] btrfs: don't add metadata items for the remap tree to the extent tree Mark Harmstone
2025-08-16 0:06 ` Boris Burkov
2025-08-13 14:34 ` [PATCH v2 06/16] btrfs: add extended version of struct block_group_item Mark Harmstone
2025-08-16 0:08 ` Boris Burkov
2025-08-13 14:34 ` [PATCH v2 07/16] btrfs: allow mounting filesystems with remap-tree incompat flag Mark Harmstone
2025-08-22 19:14 ` Boris Burkov
2025-08-13 14:34 ` [PATCH v2 08/16] btrfs: redirect I/O for remapped block groups Mark Harmstone
2025-08-22 19:42 ` Boris Burkov
2025-08-27 14:08 ` Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 09/16] btrfs: release BG lock before calling btrfs_link_bg_list() Mark Harmstone
2025-08-16 0:32 ` Boris Burkov
2025-08-27 15:35 ` Mark Harmstone
2025-08-27 15:48 ` Filipe Manana
2025-08-27 15:52 ` Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 10/16] btrfs: handle deletions from remapped block group Mark Harmstone
2025-08-16 0:28 ` Boris Burkov
2025-08-27 17:11 ` Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 11/16] btrfs: handle setting up relocation of block group with remap-tree Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 12/16] btrfs: move existing remaps before relocating block group Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 13/16] btrfs: replace identity maps with actual remaps when doing relocations Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 14/16] btrfs: add do_remap param to btrfs_discard_extent() Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 15/16] btrfs: add fully_remapped_bgs list Mark Harmstone
2025-08-16 0:56 ` Boris Burkov
2025-08-27 18:51 ` Mark Harmstone
2025-08-13 14:34 ` [PATCH v2 16/16] btrfs: allow balancing remap tree Mark Harmstone
2025-08-16 1:02 ` Boris Burkov
2025-09-02 14:58 ` Mark Harmstone
2025-09-02 15:21 ` Mark Harmstone
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).