public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Mark Harmstone <maharmstone@fb.com>
To: <linux-btrfs@vger.kernel.org>
Cc: Mark Harmstone <maharmstone@fb.com>
Subject: [RFC PATCH 00/10] Remap tree
Date: Thu, 15 May 2025 17:36:28 +0100	[thread overview]
Message-ID: <20250515163641.3449017-1-maharmstone@fb.com> (raw)

Hi all,

This is an RFC for an experimental remap-tree incompat feature, which reworks
how we perform relocations. Our present system involves following backrefs to
rewrite addresses in the metadata trees, which can be slow and has knock-on
effects such as breaking nocow files. Instead, this adds a remap tree to act as
another layer of indirection: if a block group has the REMAPPED bit set, all
I/O to addresses nominally within it get translated by a lookup to the remap
tree.

See https://github.com/btrfs/btrfs-todo/issues/54 for Josef's original design,
which I've elaborated on, and more on the rationale. To test this you will also
need a patched btrfs-progs, for `mkfs.btrfs -O remap-tree`:
https://github.com/maharmstone/btrfs-progs/tree/remap-tree. See also
https://github.com/maharmstone/btrfs/blob/remap-tree/btrfs-dump.pl for a dumper
script that's remap-aware.

The BTRFS_FEATURE_INCOMPAT_REMAP_TREE flag implies the following changes:

* There's a new remap tree, which is created empty

* The data reloc tree isn't created, as it's no longer needed (data relocation
is no longer done via snapshots)

* There's a new metadata chunk type, REMAP, which consists solely of the remap
tree. This is to avoid bootstrapping issues: REMAP chunks, like SYSTEM, are
still relocated the old way (i.e. by COWing every leaf). We can't put the remap
tree in SYSTEM: because the chunk items are put in the superblock, it would
limit our available space

* The top of the remap tree is recorded in the superblock

* Two new fields are added to struct btrfs_block_group_item: le64 remap_bytes
and le32 identity_remap_count. remap_bytes records how much data in a block
group has been relocated from another block group. identity_remap_count records
how many identity remaps exist for this block group (see later)

* If a block group has the REMAPPED flag set and identity_remaps == 0, its
chunk will have 0 stripes and 0 dev extents: everything nominally within it its
actually located elsewhere

* All data and metadata addresses are now translated addresses, for which we
need to consult the remap tree to find the underlying addresses before doing
I/O, if they live in a remapped BG. The exception is the free-space tree, which
is entirely underlying addresses (the free-space cache v1 isn't supported here).

REMAP TREE
----------

The remap tree consists of three types of items: identity remaps, remaps, and
remap backrefs. Each represents a non-overlapping range. Remaps are orthogonal
to extents: an extent might be split into several ranges in the remap tree, or
multiple consecutive extents might be remapped together.

Identity remaps represent a range for which we don't need to do a translation,
i.e. the REMAPPED flag has been set for the BG but we haven't yet done a
relocation.

Remaps have a u64 payload which gives the underlying address to use for the
start of this range, i.e. the non-REMAPPED block group we should use for I/O.

Remap backrefs are a reverse index for the remaps, with their objectid being
the underlying address and their u64 payload being the translated address.
We need backrefs because when relocation a block group that contains existing
remaps, these need to be moved first (we don't want to have to consult an
ever-increasing chain of BGs).

RELOCATION
----------

For SYSTEM and REMAP block groups, relocation is as it is at the moment (mark
the BG readonly, COW every leaf).

For DATA and METADATA block groups, we do the following:

* If remap_bytes > 0, search the remap tree for the remap backrefs that are
physically located in this block group. Move the data on disk, munge the remap
and free-space trees to reflect this, and reduce remap_bytes. Loop until
remap_bytes is 0.

* Search the free-space tree for holes, convert these into identity remaps
within the remap tree, set the identity_remap_count within in the block
group item, and set the REMAPPED flag on the block group item and the chunk.
The REMAPPED flag will prevent new allocations from being made from this block
group.

* Walk through the remap tree looking for the identity remaps that we have
created. For each one, try to reserve the same amount of space in another block
group. Read the data into memory, and write an exact copy into the new
location. Increase remap_bytes in the destination block group, and reduce
identity_remap_count in the source block group if we can move the whole thing.
Add the space back to the free-space tree in the source BG, and remove the
space in the free-space tree for the destination BG.

* When identity_remaps_count == 0, the block group has been fully remapped, and
now exists solely for the purposes of remapping. Set num_stripes to 0 in the
chunk item, remove its stripes, and remove the entries in the dev extent tree.

* If a block group has the REMAPPED flag set, identity_remaps_count == 0, and
remap_bytes == 0, it is now empty. The block group item, chunk item, and
entries in the free-space tree can be removed.

KNOWN ISSUES
------------

* Still a few problems with some fstests: btrfs/156, btrfs/170,
btrfs/177, btrfs/226, btrfs/250, btrfs/252.

* There's a race when it comes to nodatacow writes. I think we ought to be
calling btrfs_inc_nocow_writers(), but that COWs rather than blocking.

* It can be reluctant to drop entries for empty remapped block groups. This
doesn't waste substantial amounts of space on disk as there's no dev extents,
but it is polluting the block group and chunk trees. Similarly we ought to be
removing any enties in the free-space tree for fully remapped block groups: no
new allocations can be done within them, and they no longer correspond to a
location on disk.

* At the moment we're allocating 1GB for the REMAP chunks, which is probably
overkilled. One 16KB leaf of the remap tree can cover ~250GB in the best case
and ~1MB in the worst case. Possibly 32MB is the sweet spot, as for SYSTEM.
Allocating more REMAP chunks isn't a problem, as unlike SYSTEM they don't go in
the superblock.

* There's a spurious lockdep warning when doing remapped metadata reads, as
we're locking an extent buffer within the remap tree while already holding
another extent buffer lock. We either need to disable the warning for this, or
change the code to use search_commit_root. We can't add another level to the
lockdep hierarchy as we're already maxed out at 8.

* I'm think the lookup code in btrfs_translate_remap() probably needs to be
refined. Possibly we can do without the call to btrfs_prev_leaf().

* At present we scan the free-space holes and create the identity remaps all in
the same transaction. For the worst-case scenario of a 1GB block group with
every other sector allocated, relocation takes ~30 seconds on my system, on an
NVMe drive with no other load. Josef has suggested that we split this into
multiple transactions, setting something like
BTRFS_BLOCK_GROUP_ADDING_IDENTITY_REMAPS at the start, and discarding any
progress on mount if we find a BG with this flag set, which makes sense to me.
The code is currently gated behind CONFIG_BTRFS_EXPERIMENTAL. My preference
would be that, like with the RAID stripe tree, we accept that there may be
minor on-disk format changes until it can be moved out of experimental - i.e.
we treat this as a (near-)future improvement rather than a blocker.

* Josef has also suggested that we don't log metadata items for the remap tree
in the extent tree, in anticipation of later omitting all COW-only metadata
items from the extent tree. My view is that treating the remap tree as a
special case would be more trouble than it's worth, both here and in progs, and
omitting all COW-only metadata items should be relegated to a later incompat
change that depends on this one.

* There's still a lot of userspace work to be done: making sure that all space
reporting etc. tools are okay, adding a comprehensive series of tests to btrfs
check, allowing toggling this incompat flag on and off in btrfstune.

Thanks

Mark

Mark Harmstone (10):
  btrfs: add definitions and constants for remap-tree
  btrfs: add REMAP chunk type
  btrfs: allow remapped chunks to have zero stripes
  btrfs: add extended version of struct block_group_item
  btrfs: allow mounting filesystems with remap-tree incompat flag
  btrfs: redirect I/O for remapped block groups
  btrfs: handle deletions from remapped block group
  btrfs: handle setting up relocation of block group with remap-tree
  btrfs: move existing remaps before relocating block group
  btrfs: replace identity maps with actual remaps when doing relocations

 fs/btrfs/Kconfig                |    2 +
 fs/btrfs/accessors.h            |   29 +
 fs/btrfs/block-group.c          |  181 ++-
 fs/btrfs/block-group.h          |   15 +-
 fs/btrfs/block-rsv.c            |    8 +
 fs/btrfs/block-rsv.h            |    1 +
 fs/btrfs/ctree.c                |   11 +-
 fs/btrfs/ctree.h                |    3 +
 fs/btrfs/disk-io.c              |   88 +-
 fs/btrfs/extent-tree.c          |   38 +-
 fs/btrfs/free-space-tree.c      |    4 +-
 fs/btrfs/free-space-tree.h      |    5 +-
 fs/btrfs/fs.h                   |    7 +-
 fs/btrfs/relocation.c           | 2065 ++++++++++++++++++++++++++++++-
 fs/btrfs/relocation.h           |    8 +-
 fs/btrfs/space-info.c           |   20 +-
 fs/btrfs/sysfs.c                |    4 +
 fs/btrfs/transaction.c          |    7 +
 fs/btrfs/tree-checker.c         |   37 +-
 fs/btrfs/volumes.c              |  106 +-
 fs/btrfs/volumes.h              |   17 +-
 include/uapi/linux/btrfs.h      |    1 +
 include/uapi/linux/btrfs_tree.h |   29 +-
 23 files changed, 2509 insertions(+), 177 deletions(-)

-- 
2.49.0


             reply	other threads:[~2025-05-15 16:37 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-15 16:36 Mark Harmstone [this message]
2025-05-15 16:36 ` [RFC PATCH 01/10] btrfs: add definitions and constants for remap-tree Mark Harmstone
2025-05-21 12:43   ` Johannes Thumshirn
2025-05-23 13:06     ` Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 02/10] btrfs: add REMAP chunk type Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 03/10] btrfs: allow remapped chunks to have zero stripes Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 04/10] btrfs: add extended version of struct block_group_item Mark Harmstone
2025-05-23  9:53   ` Qu Wenruo
2025-05-23 12:00     ` Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 05/10] btrfs: allow mounting filesystems with remap-tree incompat flag Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 06/10] btrfs: redirect I/O for remapped block groups Mark Harmstone
2025-05-23 10:09   ` Qu Wenruo
2025-05-23 11:53     ` Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 07/10] btrfs: handle deletions from remapped block group Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 08/10] btrfs: handle setting up relocation of block group with remap-tree Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 09/10] btrfs: move existing remaps before relocating block group Mark Harmstone
     [not found]   ` <202505161726.w1lqCZxG-lkp@intel.com>
2025-05-16 11:43     ` Mark Harmstone
2025-05-15 16:36 ` [RFC PATCH 10/10] btrfs: replace identity maps with actual remaps when doing relocations Mark Harmstone
2025-05-21  0:04   ` Boris Burkov
2025-05-23 14:54     ` Mark Harmstone

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250515163641.3449017-1-maharmstone@fb.com \
    --to=maharmstone@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox