From: Mark Harmstone <maharmstone@meta.com>
To: Jonah Sabean <jonah@jse.io>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH 00/12] btrfs: remap tree
Date: Fri, 6 Jun 2025 13:35:51 +0000 [thread overview]
Message-ID: <e35b2329-370b-4844-a464-d0a29573874a@meta.com> (raw)
In-Reply-To: <CAFMvigdEQVj-uJqfCVqYXf8a51xY48gsYg+tvFNFrC3gPyF-gA@mail.gmail.com>
On 5/6/25 17:43, Jonah Sabean wrote:
> >
> On Thu, Jun 5, 2025 at 1:25 PM Mark Harmstone <maharmstone@fb.com> wrote:
>>
>> This patch series adds a disk format change gated behind
>> CONFIG_BTRFS_EXPERIMENTAL to add a "remap tree", which acts as a layer of
>> indirection when doing I/O. When doing relocation, rather than fixing up every
>> tree, we instead record the old and new addresses in the remap tree. This should
>> hopefully make things more reliable and flexible, as well as enabling some
>> future changes we'd like to make, such as larger data extents and reducing
>> write amplification by removing cow-only metadata items.
>>
>> The remap tree lives in a new REMAP chunk type. This is because bootstrapping
>> means that it can't be remapped itself, and has to be relocated by COWing it as
>> at present. It can't go in the SYSTEM chunk as we are then limited by the chunk
>> item needing to fit in the superblock.
>>
>> For more on the design and rationale, please see my RFC sent last month[1], as
>> well as Josef Bacik's original design document[2]. The main change from Josef's
>> design is that I've added remap backrefs, as we need to be able to move a
>> chunk's existing remaps before remapping it.
>>
>> You will also need my patches to btrfs-progs[3] to make
>> `mkfs.btrfs -O remap-tree` work, as well as allowing `btrfs check` to recognize
>> the new format.
>>
>> Changes since the RFC:
>>
>> * I've reduce the REMAP chunk size from the normal 1GB to 32MB, to match the
>> SYSTEM chunk. For a filesystem with 4KB sectors and 16KB node size, the worst
>> case is that one leaf covers ~1MB of data, and the best case ~250GB. For a
>> chunk, that implies a worst case of ~2GB and a best case of ~500TB.
>> This isn't a disk-format change, so we can always adjust it if it proves too
>> big or small in practice. mkfs creates 8MB chunks, as it does for everything.
>
> One thing I'd like to see fixed is the fragmentation of dev_extents on
> stripped profiles when you have less than 1G left of space, as btrfs
> will allocate these smaller chunks across a stripped array (ie raid0,
> 10, 5 or 6), otherwise being able to support larger extents can be
> made moot because you can end up with chunks being less as small as
> 1MiB. Depending on if you add/remove devices often and balance often
> you can end up with a lot of chunks across all disks that can be made
> smaller, so one hacky way I've got around this is to align partitions
> and force the system chunk to 1G with this patch:
> https://urldefense.com/v3/__https://pastebin.com/4PWbgEXV__;!!Bt8RZUm9aw!5woVoadd383IuqBtW6VYdNfYTRc1ugI44XocnoPkA0gEjtp58o3ubI7wW3X5fzx58qYL9cxWUDY$
>
> Ideally, I'd like this problem solved, but it seems to me this will
> just add yet another small chunk in the mix that makes alignment
> harder in this case. Really makes striping a curse on btrfs.
This is a different problem to what my patches are trying to solve, but
yes, I can understand why that would be an issue. Sometimes you'd prefer
the FS to ENOSPC rather than fragmenting your files.
I know one of the btrfs developers has been looking into making the
allocator more intelligent, so I'll make sure he's aware of this.
>>
>> * You can't make new allocations from remapped block groups, so I've changed
>> it so there's no free-space entries for these (thanks to Boris Burkov for the
>> suggestion).
>>
>> * The remap tree doesn't have metadata items in the extent tree (thanks to Josef
>> for the suggestion). This was to work around some corruption that delayed refs
>> were causing, but it also fits it with our future plans of removing all
>> metadata items for COW-only trees, reducing write amplification.
>> A knock-on effect of this is that I've had to disable balancing of the remap
>> chunk itself. This is because we can no longer walk the extent tree, and will
>> have to walk the remap tree instead. When we remove the COW-only metadata
>> items, we will also have to do this for the chunk and root trees, as
>> bootstrapping means they can't be remapped.
>>
>> * btrfs_translate_remap() uses search_commit_root when doing metadata lookups,
>> to avoid nested locking issues. This also seems to be a lot quicker (btrfs/187
>> went from ~20mins to ~90secs).
>>
>> * Unused remapped block groups should now get cleaned up more aggressively
>>
>> * Other miscellaneous cleanups and fixes
>>
>> Known issues:
>>
>> * Relocation still needs to be implemented for the remap tree itself (see above)
>>
>> * Some test failures: btrfs/156, btrfs/170, btrfs/226, btrfs/250
>>
>> * nodatacow extents aren't safe, as they can race with the relocation thread.
>> We either need to follow the btrfs_inc_nocow_writers() approach, which COWs
>> the extent, or change it so that it blocks here.
>>
>> * When initially marking a block group as remapped, we are walking the free-
>> space tree and creating the identity remaps all in one transaction. For the
>> worst-case scenario, i.e. a 1GB block group with every other sector allocated
>> (131,072 extents), this can result in transaction times of more than 10 mins.
>> This needs to be changed to allow this to happen over multiple transactions.
>>
>> * All this is disabled for zoned devices for the time being, as I've not been
>> able to test it. I'm planning to make it compatible with zoned at a later
>> date.
>>
>> Thanks
>>
>> [1] https://urldefense.com/v3/__https://lwn.net/Articles/1021452/__;!!Bt8RZUm9aw!5woVoadd383IuqBtW6VYdNfYTRc1ugI44XocnoPkA0gEjtp58o3ubI7wW3X5fzx58qYL4uvDpII$
>> [2] https://github.com/btrfs/btrfs-todo/issues/54
>> [3] https://github.com/maharmstone/btrfs-progs/tree/remap-tree
>>
>> Mark Harmstone (12):
>> btrfs: add definitions and constants for remap-tree
>> btrfs: add REMAP chunk type
>> btrfs: allow remapped chunks to have zero stripes
>> btrfs: remove remapped block groups from the free-space tree
>> btrfs: don't add metadata items for the remap tree to the extent tree
>> btrfs: add extended version of struct block_group_item
>> btrfs: allow mounting filesystems with remap-tree incompat flag
>> btrfs: redirect I/O for remapped block groups
>> btrfs: handle deletions from remapped block group
>> btrfs: handle setting up relocation of block group with remap-tree
>> btrfs: move existing remaps before relocating block group
>> btrfs: replace identity maps with actual remaps when doing relocations
>>
>> fs/btrfs/Kconfig | 2 +
>> fs/btrfs/accessors.h | 29 +
>> fs/btrfs/block-group.c | 202 +++-
>> fs/btrfs/block-group.h | 15 +-
>> fs/btrfs/block-rsv.c | 8 +
>> fs/btrfs/block-rsv.h | 1 +
>> fs/btrfs/discard.c | 11 +-
>> fs/btrfs/disk-io.c | 91 +-
>> fs/btrfs/extent-tree.c | 152 ++-
>> fs/btrfs/free-space-tree.c | 4 +-
>> fs/btrfs/free-space-tree.h | 5 +-
>> fs/btrfs/fs.h | 7 +-
>> fs/btrfs/relocation.c | 1897 ++++++++++++++++++++++++++++++-
>> fs/btrfs/relocation.h | 8 +-
>> fs/btrfs/space-info.c | 22 +-
>> fs/btrfs/sysfs.c | 4 +
>> fs/btrfs/transaction.c | 7 +
>> fs/btrfs/tree-checker.c | 37 +-
>> fs/btrfs/volumes.c | 115 +-
>> fs/btrfs/volumes.h | 17 +-
>> include/uapi/linux/btrfs.h | 1 +
>> include/uapi/linux/btrfs_tree.h | 29 +-
>> 22 files changed, 2444 insertions(+), 220 deletions(-)
>>
>> --
>> 2.49.0
>>
>>
next prev parent reply other threads:[~2025-06-06 13:35 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
2025-06-05 16:23 ` [PATCH 01/12] btrfs: add definitions and constants for remap-tree Mark Harmstone
2025-06-13 21:02 ` Boris Burkov
2025-06-05 16:23 ` [PATCH 02/12] btrfs: add REMAP chunk type Mark Harmstone
2025-06-13 21:22 ` Boris Burkov
2025-06-05 16:23 ` [PATCH 03/12] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
2025-06-13 21:41 ` Boris Burkov
2025-08-08 14:12 ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 04/12] btrfs: remove remapped block groups from the free-space tree Mark Harmstone
2025-06-06 6:41 ` kernel test robot
2025-06-13 22:00 ` Boris Burkov
2025-08-12 14:50 ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 05/12] btrfs: don't add metadata items for the remap tree to the extent tree Mark Harmstone
2025-06-13 22:39 ` Boris Burkov
2025-06-05 16:23 ` [PATCH 06/12] btrfs: add extended version of struct block_group_item Mark Harmstone
2025-06-05 16:23 ` [PATCH 07/12] btrfs: allow mounting filesystems with remap-tree incompat flag Mark Harmstone
2025-06-05 16:23 ` [PATCH 08/12] btrfs: redirect I/O for remapped block groups Mark Harmstone
2025-06-05 16:23 ` [PATCH 09/12] btrfs: handle deletions from remapped block group Mark Harmstone
2025-06-13 23:42 ` Boris Burkov
2025-08-11 16:48 ` Mark Harmstone
2025-08-11 16:59 ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 10/12] btrfs: handle setting up relocation of block group with remap-tree Mark Harmstone
2025-06-13 23:25 ` Boris Burkov
2025-08-12 11:20 ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 11/12] btrfs: move existing remaps before relocating block group Mark Harmstone
2025-06-06 11:20 ` kernel test robot
2025-06-05 16:23 ` [PATCH 12/12] btrfs: replace identity maps with actual remaps when doing relocations Mark Harmstone
2025-06-05 16:43 ` [PATCH 00/12] btrfs: remap tree Jonah Sabean
2025-06-06 13:35 ` Mark Harmstone [this message]
2025-06-09 16:05 ` Anand Jain
2025-06-09 18:51 ` David Sterba
2025-06-10 9:19 ` Mark Harmstone
2025-06-10 14:31 ` Mark Harmstone
2025-06-10 23:56 ` Qu Wenruo
2025-06-11 8:06 ` Mark Harmstone
2025-06-11 15:28 ` Mark Harmstone
2025-06-14 0:04 ` Boris Burkov
2025-06-26 22:10 ` Mark Harmstone
2025-06-27 5:59 ` Neal Gompa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e35b2329-370b-4844-a464-d0a29573874a@meta.com \
--to=maharmstone@meta.com \
--cc=jonah@jse.io \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox