From: Josef Bacik <josef@toxicpanda.com>
To: Boris Burkov <boris@bur.io>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH 18/18] btrfs: track data relocation with simple quota
Date: Thu, 13 Jul 2023 13:37:52 -0400 [thread overview]
Message-ID: <20230713173752.GR207541@perftesting> (raw)
In-Reply-To: <d9e6a4525095ec5abb1818547d565fcf3ef58460.1688597211.git.boris@bur.io>
On Wed, Jul 05, 2023 at 04:20:55PM -0700, Boris Burkov wrote:
> Relocation data allocations are quite tricky for simple quotas. The
> basic data relocation sequence is (ignoring details that aren't relevant
> to this fix):
> - create a fake relocation data fs root
> - create a fake relocation inode in that root
> - foreach data extent:
> - preallocate a data extent on behalf of the fake inode
> - copy over the data
> - foreach extent
> - swap the refs so that the original file extent now refers to the new
> extent item
> - drop the fake root, dropping its refs on the old extents, which lets
> us delete them.
>
> Done naively, this results in storing an extent item in the extent tree
> whose owner_ref points at the relocation data root and a no-op squota
> recording, since the reloc root is not a legit fstree. So far, that's
> OK. The problem comes when you do the swap, and leave an extent item
> owned by this bogus root as the real permanent extents of the file. If
> the file then drops that ref, we free it and no-op account that against
> the fake relocation root. Essentially, this means that relocation is
> simple quota "extent laundering", since we re-own the extents into a
> fake root.
>
> Simple quotas very intentionally doesn't have a mechanism for
> transferring ownership of extents, as that is exactly the complicated
> thing we are trying to avoid with the new design. Further, it cannot be
> correctly done in this case, since at the time you create the new
> "real" refs, there is no way to know which was the original owner before
> relocation unless we track it.
>
> Therefore, it makes more sense to trick the preallocation to handle
> relocation as a special case and note the proper owner ref from the
> beginning. That way, we never write out an extent item without the
> correct owner ref that it will eventually have.
>
> This could be done by wiring a special root parameter all the way
> through the allocation code path, but to avoid that special case
> touching all the code, take advantage of the serial nature of relocation
> to store the src root on the relocation root object. Then when we finish
> the prealloc, if it happens to be this case, prepare the delayed ref
> appropriately.
>
> This is obviously a smelly bit of code, but I think it is the best
> solution to the problem, given the relocation implementation.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
> fs/btrfs/ctree.h | 1 +
> fs/btrfs/extent-tree.c | 13 +++++++------
> fs/btrfs/relocation.c | 15 +++++++++++++++
> 3 files changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index f2d2b313bde5..577186994188 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -333,6 +333,7 @@ struct btrfs_root {
> #ifdef CONFIG_BTRFS_DEBUG
> struct list_head leak_list;
> #endif
> + u64 relocation_src_root;
> };
>
> static inline bool btrfs_root_readonly(const struct btrfs_root *root)
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 99845a54e168..10e026d5b684 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -57,7 +57,7 @@ static void __run_delayed_extent_op(struct btrfs_delayed_extent_op *extent_op,
> static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
> u64 parent, u64 root_objectid,
> u64 flags, u64 owner, u64 offset,
> - struct btrfs_key *ins, int ref_mod);
> + struct btrfs_key *ins, int ref_mod, u64 oref_root);
> static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
> struct btrfs_delayed_ref_node *node,
> struct btrfs_delayed_extent_op *extent_op);
> @@ -1541,7 +1541,7 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
> ret = alloc_reserved_file_extent(trans, parent, ref_root,
> flags, ref->objectid,
> ref->offset, &ins,
> - node->ref_mod);
> + node->ref_mod, href->owning_root);
> if (!ret)
> ret = btrfs_record_simple_quota_delta(trans->fs_info, &delta);
> } else if (node->action == BTRFS_ADD_DELAYED_REF) {
> @@ -4683,7 +4683,7 @@ static int alloc_reserved_extent(struct btrfs_trans_handle *trans, u64 bytenr,
> static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
> u64 parent, u64 root_objectid,
> u64 flags, u64 owner, u64 offset,
> - struct btrfs_key *ins, int ref_mod)
> + struct btrfs_key *ins, int ref_mod, u64 oref_root)
> {
> struct btrfs_fs_info *fs_info = trans->fs_info;
> struct btrfs_root *extent_root;
> @@ -4731,7 +4731,7 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
> if (simple_quota) {
> btrfs_set_extent_inline_ref_type(leaf, iref, BTRFS_EXTENT_OWNER_REF_KEY);
> oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
> - btrfs_set_extent_owner_ref_root_id(leaf, oref, root_objectid);
> + btrfs_set_extent_owner_ref_root_id(leaf, oref, oref_root);
> iref = (struct btrfs_extent_inline_ref *)(oref + 1);
> }
> btrfs_set_extent_inline_ref_type(leaf, iref, type);
> @@ -4842,7 +4842,8 @@ int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
>
> BUG_ON(root_objectid == BTRFS_TREE_LOG_OBJECTID);
>
> - BUG_ON(root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
> + if (btrfs_is_data_reloc_root(root) && is_fstree(root->relocation_src_root))
> + owning_root = root->relocation_src_root;
>
> btrfs_init_generic_ref(&generic_ref, BTRFS_ADD_DELAYED_EXTENT,
> ins->objectid, ins->offset, 0, owning_root);
> @@ -4899,7 +4900,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
> spin_unlock(&space_info->lock);
>
> ret = alloc_reserved_file_extent(trans, 0, root_objectid, 0, owner,
> - offset, ins, 1);
> + offset, ins, 1, root_objectid);
> if (ret)
> btrfs_pin_extent(trans, ins->objectid, ins->offset, 1);
> ret = btrfs_record_simple_quota_delta(fs_info, &delta);
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 119f670538f7..e12377c818c0 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -3665,6 +3665,21 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
> struct btrfs_extent_item);
> flags = btrfs_extent_flags(path->nodes[0], ei);
>
> + /*
> + * If we are relocating a simple quota owned extent item, we need
> + * to note the owner on the reloc data root so that when we
> + * allocate the replacement item, we can attribute it to the
> + * correct eventual owner (rather than the reloc data root)
> + */
> + if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE) {
> + struct btrfs_root *root = BTRFS_I(rc->data_inode)->root;
> + u64 owning_root_id = btrfs_get_extent_owner_root(fs_info,
> + path->nodes[0],
> + path->slots[0]);
> +
> + root->relocation_src_root = owning_root_id;
> + }
> +
This is almost correct but can mess up if we have adjacent extents that are
owned by different roots. If you look further down we move the extents via
relocate_data_extent(), which will cluster together ranges to limit the number
of preallocations you do. You're going to have to add a check in there for the
owning_root_id != root->relocation_src_root, do the prealloc, and then set
root->relocation_src_root in there and carry on. Thanks,
Josef
prev parent reply other threads:[~2023-07-13 17:38 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-05 23:20 [PATCH 00/18] btrfs: simple quotas Boris Burkov
2023-07-05 23:20 ` [PATCH 01/18] btrfs: free qgroup rsv on io failure Boris Burkov
2023-07-13 14:01 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 02/18] btrfs: fix start transaction qgroup rsv double free Boris Burkov
2023-07-13 14:02 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 03/18] btrfs: introduce quota mode Boris Burkov
2023-07-13 14:02 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 04/18] btrfs: add new quota mode for simple quotas Boris Burkov
2023-07-13 14:07 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 05/18] btrfs: expose quota mode via sysfs Boris Burkov
2023-07-13 14:11 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 06/18] btrfs: flush reservations during quota disable Boris Burkov
2023-07-13 14:20 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 07/18] btrfs: create qgroup earlier in snapshot creation Boris Burkov
2023-07-13 14:26 ` Josef Bacik
2023-07-13 19:00 ` Boris Burkov
2023-07-13 20:37 ` Josef Bacik
2023-07-13 23:13 ` Boris Burkov
2023-07-05 23:20 ` [PATCH 08/18] btrfs: function for recording simple quota deltas Boris Burkov
2023-07-13 14:34 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 09/18] btrfs: rename tree_ref and data_ref owning_root Boris Burkov
2023-07-13 16:33 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 10/18] btrfs: track owning root in btrfs_ref Boris Burkov
2023-07-13 16:58 ` Josef Bacik
2023-07-13 21:21 ` Boris Burkov
2023-07-05 23:20 ` [PATCH 11/18] btrfs: track original extent owner in head_ref Boris Burkov
2023-07-13 17:09 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 12/18] btrfs: new inline ref storing owning subvol of data extents Boris Burkov
2023-07-13 17:16 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 13/18] btrfs: inline owner ref lookup helper Boris Burkov
2023-07-13 17:18 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 14/18] btrfs: record simple quota deltas Boris Burkov
2023-07-13 17:23 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 15/18] btrfs: simple quota auto hierarchy for nested subvols Boris Burkov
2023-07-13 17:28 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 16/18] btrfs: check generation when recording simple quota delta Boris Burkov
2023-07-13 17:29 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 17/18] btrfs: track metadata relocation cow with simple quota Boris Burkov
2023-07-13 17:31 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 18/18] btrfs: track data relocation " Boris Burkov
2023-07-13 17:37 ` Josef Bacik [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230713173752.GR207541@perftesting \
--to=josef@toxicpanda.com \
--cc=boris@bur.io \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox