From: Anand Jain <anand.jain@oracle.com>
To: Mark Harmstone <maharmstone@meta.com>, Jonah Sabean <jonah@jse.io>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH 00/12] btrfs: remap tree
Date: Tue, 10 Jun 2025 00:05:34 +0800 [thread overview]
Message-ID: <9a1344dc-a7c1-4d1d-aa12-7559cc2bc856@oracle.com> (raw)
In-Reply-To: <e35b2329-370b-4844-a464-d0a29573874a@meta.com>
On 6/6/25 21:35, Mark Harmstone wrote:
> On 5/6/25 17:43, Jonah Sabean wrote:
>>>
>> On Thu, Jun 5, 2025 at 1:25 PM Mark Harmstone <maharmstone@fb.com> wrote:
>>>
>>> This patch series adds a disk format change gated behind
>>> CONFIG_BTRFS_EXPERIMENTAL to add a "remap tree", which acts as a layer of
>>> indirection when doing I/O. When doing relocation, rather than fixing up every
>>> tree, we instead record the old and new addresses in the remap tree. This should
>>> hopefully make things more reliable and flexible, as well as enabling some
>>> future changes we'd like to make, such as larger data extents and reducing
>>> write amplification by removing cow-only metadata items.
>>>
>>> The remap tree lives in a new REMAP chunk type. This is because bootstrapping
>>> means that it can't be remapped itself, and has to be relocated by COWing it as
>>> at present. It can't go in the SYSTEM chunk as we are then limited by the chunk
>>> item needing to fit in the superblock.
>>>
>>> For more on the design and rationale, please see my RFC sent last month[1], as
>>> well as Josef Bacik's original design document[2]. The main change from Josef's
>>> design is that I've added remap backrefs, as we need to be able to move a
>>> chunk's existing remaps before remapping it.
>>>
>>> You will also need my patches to btrfs-progs[3] to make
>>> `mkfs.btrfs -O remap-tree` work, as well as allowing `btrfs check` to recognize
>>> the new format.
>>>
>>> Changes since the RFC:
>>>
>>> * I've reduce the REMAP chunk size from the normal 1GB to 32MB, to match the
>>> SYSTEM chunk. For a filesystem with 4KB sectors and 16KB node size, the worst
>>> case is that one leaf covers ~1MB of data, and the best case ~250GB. For a
>>> chunk, that implies a worst case of ~2GB and a best case of ~500TB.
>>> This isn't a disk-format change, so we can always adjust it if it proves too
>>> big or small in practice. mkfs creates 8MB chunks, as it does for everything.
>>
>> One thing I'd like to see fixed is the fragmentation of dev_extents on
>> stripped profiles when you have less than 1G left of space, as btrfs
>> will allocate these smaller chunks across a stripped array (ie raid0,
>> 10, 5 or 6), otherwise being able to support larger extents can be
>> made moot because you can end up with chunks being less as small as
>> 1MiB. Depending on if you add/remove devices often and balance often
>> you can end up with a lot of chunks across all disks that can be made
>> smaller, so one hacky way I've got around this is to align partitions
>> and force the system chunk to 1G with this patch:
>> https://urldefense.com/v3/__https://pastebin.com/4PWbgEXV__;!!Bt8RZUm9aw!5woVoadd383IuqBtW6VYdNfYTRc1ugI44XocnoPkA0gEjtp58o3ubI7wW3X5fzx58qYL9cxWUDY$
>>
>> Ideally, I'd like this problem solved, but it seems to me this will
>> just add yet another small chunk in the mix that makes alignment
>> harder in this case. Really makes striping a curse on btrfs.
>
> This is a different problem to what my patches are trying to solve, but
> yes, I can understand why that would be an issue. Sometimes you'd prefer
> the FS to ENOSPC rather than fragmenting your files.
>
> I know one of the btrfs developers has been looking into making the
> allocator more intelligent, so I'll make sure he's aware of this.
>
We’re adding a framework [1] to support more allocation methods, so
let’s see how that evolves.
[1]
https://asj.github.io/chunk-alloc-enhancement.html
https://lore.kernel.org/linux-btrfs/cover.1747070147.git.anand.jain@oracle.com/
Dynamically calculating chunk sizes in striped RAID can improve free
space usage, especially when device sizes are uneven. The trade-off
is increased chunk fragmentation—that’s the cost of maximizing the
space. I'm unsure about the impact as of now, one option is to
enforce fixed stripe counts and sizes, then benchmark with test
cases to assess the actual gains. Let me see if I can create a testcase.
Thanks, Anand
>>>
>>> * You can't make new allocations from remapped block groups, so I've changed
>>> it so there's no free-space entries for these (thanks to Boris Burkov for the
>>> suggestion).
>>>
>>> * The remap tree doesn't have metadata items in the extent tree (thanks to Josef
>>> for the suggestion). This was to work around some corruption that delayed refs
>>> were causing, but it also fits it with our future plans of removing all
>>> metadata items for COW-only trees, reducing write amplification.
>>> A knock-on effect of this is that I've had to disable balancing of the remap
>>> chunk itself. This is because we can no longer walk the extent tree, and will
>>> have to walk the remap tree instead. When we remove the COW-only metadata
>>> items, we will also have to do this for the chunk and root trees, as
>>> bootstrapping means they can't be remapped.
>>>
>>> * btrfs_translate_remap() uses search_commit_root when doing metadata lookups,
>>> to avoid nested locking issues. This also seems to be a lot quicker (btrfs/187
>>> went from ~20mins to ~90secs).
>>>
>>> * Unused remapped block groups should now get cleaned up more aggressively
>>>
>>> * Other miscellaneous cleanups and fixes
>>>
>>> Known issues:
>>>
>>> * Relocation still needs to be implemented for the remap tree itself (see above)
>>>
>>> * Some test failures: btrfs/156, btrfs/170, btrfs/226, btrfs/250
>>>
>>> * nodatacow extents aren't safe, as they can race with the relocation thread.
>>> We either need to follow the btrfs_inc_nocow_writers() approach, which COWs
>>> the extent, or change it so that it blocks here.
>>>
>>> * When initially marking a block group as remapped, we are walking the free-
>>> space tree and creating the identity remaps all in one transaction. For the
>>> worst-case scenario, i.e. a 1GB block group with every other sector allocated
>>> (131,072 extents), this can result in transaction times of more than 10 mins.
>>> This needs to be changed to allow this to happen over multiple transactions.
>>>
>>> * All this is disabled for zoned devices for the time being, as I've not been
>>> able to test it. I'm planning to make it compatible with zoned at a later
>>> date.
>>>
>>> Thanks
>>>
>>> [1] https://urldefense.com/v3/__https://lwn.net/Articles/1021452/__;!!Bt8RZUm9aw!5woVoadd383IuqBtW6VYdNfYTRc1ugI44XocnoPkA0gEjtp58o3ubI7wW3X5fzx58qYL4uvDpII$
>>> [2] https://github.com/btrfs/btrfs-todo/issues/54
>>> [3] https://github.com/maharmstone/btrfs-progs/tree/remap-tree
>>>
>>> Mark Harmstone (12):
>>> btrfs: add definitions and constants for remap-tree
>>> btrfs: add REMAP chunk type
>>> btrfs: allow remapped chunks to have zero stripes
>>> btrfs: remove remapped block groups from the free-space tree
>>> btrfs: don't add metadata items for the remap tree to the extent tree
>>> btrfs: add extended version of struct block_group_item
>>> btrfs: allow mounting filesystems with remap-tree incompat flag
>>> btrfs: redirect I/O for remapped block groups
>>> btrfs: handle deletions from remapped block group
>>> btrfs: handle setting up relocation of block group with remap-tree
>>> btrfs: move existing remaps before relocating block group
>>> btrfs: replace identity maps with actual remaps when doing relocations
>>>
>>> fs/btrfs/Kconfig | 2 +
>>> fs/btrfs/accessors.h | 29 +
>>> fs/btrfs/block-group.c | 202 +++-
>>> fs/btrfs/block-group.h | 15 +-
>>> fs/btrfs/block-rsv.c | 8 +
>>> fs/btrfs/block-rsv.h | 1 +
>>> fs/btrfs/discard.c | 11 +-
>>> fs/btrfs/disk-io.c | 91 +-
>>> fs/btrfs/extent-tree.c | 152 ++-
>>> fs/btrfs/free-space-tree.c | 4 +-
>>> fs/btrfs/free-space-tree.h | 5 +-
>>> fs/btrfs/fs.h | 7 +-
>>> fs/btrfs/relocation.c | 1897 ++++++++++++++++++++++++++++++-
>>> fs/btrfs/relocation.h | 8 +-
>>> fs/btrfs/space-info.c | 22 +-
>>> fs/btrfs/sysfs.c | 4 +
>>> fs/btrfs/transaction.c | 7 +
>>> fs/btrfs/tree-checker.c | 37 +-
>>> fs/btrfs/volumes.c | 115 +-
>>> fs/btrfs/volumes.h | 17 +-
>>> include/uapi/linux/btrfs.h | 1 +
>>> include/uapi/linux/btrfs_tree.h | 29 +-
>>> 22 files changed, 2444 insertions(+), 220 deletions(-)
>>>
>>> --
>>> 2.49.0
>>>
>>>
>
next prev parent reply other threads:[~2025-06-09 16:09 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
2025-06-05 16:23 ` [PATCH 01/12] btrfs: add definitions and constants for remap-tree Mark Harmstone
2025-06-13 21:02 ` Boris Burkov
2025-06-05 16:23 ` [PATCH 02/12] btrfs: add REMAP chunk type Mark Harmstone
2025-06-13 21:22 ` Boris Burkov
2025-06-05 16:23 ` [PATCH 03/12] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
2025-06-13 21:41 ` Boris Burkov
2025-08-08 14:12 ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 04/12] btrfs: remove remapped block groups from the free-space tree Mark Harmstone
2025-06-06 6:41 ` kernel test robot
2025-06-13 22:00 ` Boris Burkov
2025-08-12 14:50 ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 05/12] btrfs: don't add metadata items for the remap tree to the extent tree Mark Harmstone
2025-06-13 22:39 ` Boris Burkov
2025-06-05 16:23 ` [PATCH 06/12] btrfs: add extended version of struct block_group_item Mark Harmstone
2025-06-05 16:23 ` [PATCH 07/12] btrfs: allow mounting filesystems with remap-tree incompat flag Mark Harmstone
2025-06-05 16:23 ` [PATCH 08/12] btrfs: redirect I/O for remapped block groups Mark Harmstone
2025-06-05 16:23 ` [PATCH 09/12] btrfs: handle deletions from remapped block group Mark Harmstone
2025-06-13 23:42 ` Boris Burkov
2025-08-11 16:48 ` Mark Harmstone
2025-08-11 16:59 ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 10/12] btrfs: handle setting up relocation of block group with remap-tree Mark Harmstone
2025-06-13 23:25 ` Boris Burkov
2025-08-12 11:20 ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 11/12] btrfs: move existing remaps before relocating block group Mark Harmstone
2025-06-06 11:20 ` kernel test robot
2025-06-05 16:23 ` [PATCH 12/12] btrfs: replace identity maps with actual remaps when doing relocations Mark Harmstone
2025-06-05 16:43 ` [PATCH 00/12] btrfs: remap tree Jonah Sabean
2025-06-06 13:35 ` Mark Harmstone
2025-06-09 16:05 ` Anand Jain [this message]
2025-06-09 18:51 ` David Sterba
2025-06-10 9:19 ` Mark Harmstone
2025-06-10 14:31 ` Mark Harmstone
2025-06-10 23:56 ` Qu Wenruo
2025-06-11 8:06 ` Mark Harmstone
2025-06-11 15:28 ` Mark Harmstone
2025-06-14 0:04 ` Boris Burkov
2025-06-26 22:10 ` Mark Harmstone
2025-06-27 5:59 ` Neal Gompa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9a1344dc-a7c1-4d1d-aa12-7559cc2bc856@oracle.com \
--to=anand.jain@oracle.com \
--cc=jonah@jse.io \
--cc=linux-btrfs@vger.kernel.org \
--cc=maharmstone@meta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox