public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Mark Harmstone <maharmstone@meta.com>
To: Jonah Sabean <jonah@jse.io>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH 00/12] btrfs: remap tree
Date: Fri, 6 Jun 2025 13:35:51 +0000	[thread overview]
Message-ID: <e35b2329-370b-4844-a464-d0a29573874a@meta.com> (raw)
In-Reply-To: <CAFMvigdEQVj-uJqfCVqYXf8a51xY48gsYg+tvFNFrC3gPyF-gA@mail.gmail.com>

On 5/6/25 17:43, Jonah Sabean wrote:
> > 
> On Thu, Jun 5, 2025 at 1:25 PM Mark Harmstone <maharmstone@fb.com> wrote:
>>
>> This patch series adds a disk format change gated behind
>> CONFIG_BTRFS_EXPERIMENTAL to add a "remap tree", which acts as a layer of
>> indirection when doing I/O. When doing relocation, rather than fixing up every
>> tree, we instead record the old and new addresses in the remap tree. This should
>> hopefully make things more reliable and flexible, as well as enabling some
>> future changes we'd like to make, such as larger data extents and reducing
>> write amplification by removing cow-only metadata items.
>>
>> The remap tree lives in a new REMAP chunk type. This is because bootstrapping
>> means that it can't be remapped itself, and has to be relocated by COWing it as
>> at present. It can't go in the SYSTEM chunk as we are then limited by the chunk
>> item needing to fit in the superblock.
>>
>> For more on the design and rationale, please see my RFC sent last month[1], as
>> well as Josef Bacik's original design document[2]. The main change from Josef's
>> design is that I've added remap backrefs, as we need to be able to move a
>> chunk's existing remaps before remapping it.
>>
>> You will also need my patches to btrfs-progs[3] to make
>> `mkfs.btrfs -O remap-tree` work, as well as allowing `btrfs check` to recognize
>> the new format.
>>
>> Changes since the RFC:
>>
>> * I've reduce the REMAP chunk size from the normal 1GB to 32MB, to match the
>>    SYSTEM chunk. For a filesystem with 4KB sectors and 16KB node size, the worst
>>    case is that one leaf covers ~1MB of data, and the best case ~250GB. For a
>>    chunk, that implies a worst case of ~2GB and a best case of ~500TB.
>>    This isn't a disk-format change, so we can always adjust it if it proves too
>>    big or small in practice. mkfs creates 8MB chunks, as it does for everything.
> 
> One thing I'd like to see fixed is the fragmentation of dev_extents on
> stripped profiles when you have less than 1G left of space, as btrfs
> will allocate these smaller chunks across a stripped array (ie raid0,
> 10, 5 or 6), otherwise being able to support larger extents can be
> made moot because you can end up with chunks being less as small as
> 1MiB. Depending on if you add/remove devices often and balance often
> you can end up with a lot of chunks across all disks that can be made
> smaller, so one hacky way I've got around this is to align partitions
> and force the system chunk to 1G with this patch:
> https://urldefense.com/v3/__https://pastebin.com/4PWbgEXV__;!!Bt8RZUm9aw!5woVoadd383IuqBtW6VYdNfYTRc1ugI44XocnoPkA0gEjtp58o3ubI7wW3X5fzx58qYL9cxWUDY$
> 
> Ideally, I'd like this problem solved, but it seems to me this will
> just add yet another small chunk in the mix that makes alignment
> harder in this case. Really makes striping a curse on btrfs.

This is a different problem to what my patches are trying to solve, but 
yes, I can understand why that would be an issue. Sometimes you'd prefer 
the FS to ENOSPC rather than fragmenting your files.

I know one of the btrfs developers has been looking into making the 
allocator more intelligent, so I'll make sure he's aware of this.

>>
>> * You can't make new allocations from remapped block groups, so I've changed
>>    it so there's no free-space entries for these (thanks to Boris Burkov for the
>>    suggestion).
>>
>> * The remap tree doesn't have metadata items in the extent tree (thanks to Josef
>>    for the suggestion). This was to work around some corruption that delayed refs
>>    were causing, but it also fits it with our future plans of removing all
>>    metadata items for COW-only trees, reducing write amplification.
>>    A knock-on effect of this is that I've had to disable balancing of the remap
>>    chunk itself. This is because we can no longer walk the extent tree, and will
>>    have to walk the remap tree instead. When we remove the COW-only metadata
>>    items, we will also have to do this for the chunk and root trees, as
>>    bootstrapping means they can't be remapped.
>>
>> * btrfs_translate_remap() uses search_commit_root when doing metadata lookups,
>>    to avoid nested locking issues. This also seems to be a lot quicker (btrfs/187
>>    went from ~20mins to ~90secs).
>>
>> * Unused remapped block groups should now get cleaned up more aggressively
>>
>> * Other miscellaneous cleanups and fixes
>>
>> Known issues:
>>
>> * Relocation still needs to be implemented for the remap tree itself (see above)
>>
>> * Some test failures: btrfs/156, btrfs/170, btrfs/226, btrfs/250
>>
>> * nodatacow extents aren't safe, as they can race with the relocation thread.
>>    We either need to follow the btrfs_inc_nocow_writers() approach, which COWs
>>    the extent, or change it so that it blocks here.
>>
>> * When initially marking a block group as remapped, we are walking the free-
>>    space tree and creating the identity remaps all in one transaction. For the
>>    worst-case scenario, i.e. a 1GB block group with every other sector allocated
>>    (131,072 extents), this can result in transaction times of more than 10 mins.
>>    This needs to be changed to allow this to happen over multiple transactions.
>>
>> * All this is disabled for zoned devices for the time being, as I've not been
>>    able to test it. I'm planning to make it compatible with zoned at a later
>>    date.
>>
>> Thanks
>>
>> [1] https://urldefense.com/v3/__https://lwn.net/Articles/1021452/__;!!Bt8RZUm9aw!5woVoadd383IuqBtW6VYdNfYTRc1ugI44XocnoPkA0gEjtp58o3ubI7wW3X5fzx58qYL4uvDpII$
>> [2] https://github.com/btrfs/btrfs-todo/issues/54
>> [3] https://github.com/maharmstone/btrfs-progs/tree/remap-tree
>>
>> Mark Harmstone (12):
>>    btrfs: add definitions and constants for remap-tree
>>    btrfs: add REMAP chunk type
>>    btrfs: allow remapped chunks to have zero stripes
>>    btrfs: remove remapped block groups from the free-space tree
>>    btrfs: don't add metadata items for the remap tree to the extent tree
>>    btrfs: add extended version of struct block_group_item
>>    btrfs: allow mounting filesystems with remap-tree incompat flag
>>    btrfs: redirect I/O for remapped block groups
>>    btrfs: handle deletions from remapped block group
>>    btrfs: handle setting up relocation of block group with remap-tree
>>    btrfs: move existing remaps before relocating block group
>>    btrfs: replace identity maps with actual remaps when doing relocations
>>
>>   fs/btrfs/Kconfig                |    2 +
>>   fs/btrfs/accessors.h            |   29 +
>>   fs/btrfs/block-group.c          |  202 +++-
>>   fs/btrfs/block-group.h          |   15 +-
>>   fs/btrfs/block-rsv.c            |    8 +
>>   fs/btrfs/block-rsv.h            |    1 +
>>   fs/btrfs/discard.c              |   11 +-
>>   fs/btrfs/disk-io.c              |   91 +-
>>   fs/btrfs/extent-tree.c          |  152 ++-
>>   fs/btrfs/free-space-tree.c      |    4 +-
>>   fs/btrfs/free-space-tree.h      |    5 +-
>>   fs/btrfs/fs.h                   |    7 +-
>>   fs/btrfs/relocation.c           | 1897 ++++++++++++++++++++++++++++++-
>>   fs/btrfs/relocation.h           |    8 +-
>>   fs/btrfs/space-info.c           |   22 +-
>>   fs/btrfs/sysfs.c                |    4 +
>>   fs/btrfs/transaction.c          |    7 +
>>   fs/btrfs/tree-checker.c         |   37 +-
>>   fs/btrfs/volumes.c              |  115 +-
>>   fs/btrfs/volumes.h              |   17 +-
>>   include/uapi/linux/btrfs.h      |    1 +
>>   include/uapi/linux/btrfs_tree.h |   29 +-
>>   22 files changed, 2444 insertions(+), 220 deletions(-)
>>
>> --
>> 2.49.0
>>
>>


  reply	other threads:[~2025-06-06 13:35 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-05 16:23 [PATCH 00/12] btrfs: remap tree Mark Harmstone
2025-06-05 16:23 ` [PATCH 01/12] btrfs: add definitions and constants for remap-tree Mark Harmstone
2025-06-13 21:02   ` Boris Burkov
2025-06-05 16:23 ` [PATCH 02/12] btrfs: add REMAP chunk type Mark Harmstone
2025-06-13 21:22   ` Boris Burkov
2025-06-05 16:23 ` [PATCH 03/12] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
2025-06-13 21:41   ` Boris Burkov
2025-08-08 14:12     ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 04/12] btrfs: remove remapped block groups from the free-space tree Mark Harmstone
2025-06-06  6:41   ` kernel test robot
2025-06-13 22:00   ` Boris Burkov
2025-08-12 14:50     ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 05/12] btrfs: don't add metadata items for the remap tree to the extent tree Mark Harmstone
2025-06-13 22:39   ` Boris Burkov
2025-06-05 16:23 ` [PATCH 06/12] btrfs: add extended version of struct block_group_item Mark Harmstone
2025-06-05 16:23 ` [PATCH 07/12] btrfs: allow mounting filesystems with remap-tree incompat flag Mark Harmstone
2025-06-05 16:23 ` [PATCH 08/12] btrfs: redirect I/O for remapped block groups Mark Harmstone
2025-06-05 16:23 ` [PATCH 09/12] btrfs: handle deletions from remapped block group Mark Harmstone
2025-06-13 23:42   ` Boris Burkov
2025-08-11 16:48     ` Mark Harmstone
2025-08-11 16:59     ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 10/12] btrfs: handle setting up relocation of block group with remap-tree Mark Harmstone
2025-06-13 23:25   ` Boris Burkov
2025-08-12 11:20     ` Mark Harmstone
2025-06-05 16:23 ` [PATCH 11/12] btrfs: move existing remaps before relocating block group Mark Harmstone
2025-06-06 11:20   ` kernel test robot
2025-06-05 16:23 ` [PATCH 12/12] btrfs: replace identity maps with actual remaps when doing relocations Mark Harmstone
2025-06-05 16:43 ` [PATCH 00/12] btrfs: remap tree Jonah Sabean
2025-06-06 13:35   ` Mark Harmstone [this message]
2025-06-09 16:05     ` Anand Jain
2025-06-09 18:51 ` David Sterba
2025-06-10  9:19   ` Mark Harmstone
2025-06-10 14:31 ` Mark Harmstone
2025-06-10 23:56   ` Qu Wenruo
2025-06-11  8:06     ` Mark Harmstone
2025-06-11 15:28 ` Mark Harmstone
2025-06-14  0:04 ` Boris Burkov
2025-06-26 22:10 ` Mark Harmstone
2025-06-27  5:59   ` Neal Gompa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e35b2329-370b-4844-a464-d0a29573874a@meta.com \
    --to=maharmstone@meta.com \
    --cc=jonah@jse.io \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox