From: Josef Bacik <josef@toxicpanda.com>
To: Boris Burkov <boris@bur.io>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH 18/18] btrfs: track data relocation with simple quota
Date: Thu, 13 Jul 2023 13:37:52 -0400 [thread overview]
Message-ID: <20230713173752.GR207541@perftesting> (raw)
In-Reply-To: <d9e6a4525095ec5abb1818547d565fcf3ef58460.1688597211.git.boris@bur.io>
On Wed, Jul 05, 2023 at 04:20:55PM -0700, Boris Burkov wrote:
> Relocation data allocations are quite tricky for simple quotas. The
> basic data relocation sequence is (ignoring details that aren't relevant
> to this fix):
> - create a fake relocation data fs root
> - create a fake relocation inode in that root
> - foreach data extent:
> - preallocate a data extent on behalf of the fake inode
> - copy over the data
> - foreach extent
> - swap the refs so that the original file extent now refers to the new
> extent item
> - drop the fake root, dropping its refs on the old extents, which lets
> us delete them.
>
> Done naively, this results in storing an extent item in the extent tree
> whose owner_ref points at the relocation data root and a no-op squota
> recording, since the reloc root is not a legit fstree. So far, that's
> OK. The problem comes when you do the swap, and leave an extent item
> owned by this bogus root as the real permanent extents of the file. If
> the file then drops that ref, we free it and no-op account that against
> the fake relocation root. Essentially, this means that relocation is
> simple quota "extent laundering", since we re-own the extents into a
> fake root.
>
> Simple quotas very intentionally doesn't have a mechanism for
> transferring ownership of extents, as that is exactly the complicated
> thing we are trying to avoid with the new design. Further, it cannot be
> correctly done in this case, since at the time you create the new
> "real" refs, there is no way to know which was the original owner before
> relocation unless we track it.
>
> Therefore, it makes more sense to trick the preallocation to handle
> relocation as a special case and note the proper owner ref from the
> beginning. That way, we never write out an extent item without the
> correct owner ref that it will eventually have.
>
> This could be done by wiring a special root parameter all the way
> through the allocation code path, but to avoid that special case
> touching all the code, take advantage of the serial nature of relocation
> to store the src root on the relocation root object. Then when we finish
> the prealloc, if it happens to be this case, prepare the delayed ref
> appropriately.
>
> This is obviously a smelly bit of code, but I think it is the best
> solution to the problem, given the relocation implementation.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
> fs/btrfs/ctree.h | 1 +
> fs/btrfs/extent-tree.c | 13 +++++++------
> fs/btrfs/relocation.c | 15 +++++++++++++++
> 3 files changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index f2d2b313bde5..577186994188 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -333,6 +333,7 @@ struct btrfs_root {
> #ifdef CONFIG_BTRFS_DEBUG
> struct list_head leak_list;
> #endif
> + u64 relocation_src_root;
> };
>
> static inline bool btrfs_root_readonly(const struct btrfs_root *root)
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 99845a54e168..10e026d5b684 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -57,7 +57,7 @@ static void __run_delayed_extent_op(struct btrfs_delayed_extent_op *extent_op,
> static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
> u64 parent, u64 root_objectid,
> u64 flags, u64 owner, u64 offset,
> - struct btrfs_key *ins, int ref_mod);
> + struct btrfs_key *ins, int ref_mod, u64 oref_root);
> static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
> struct btrfs_delayed_ref_node *node,
> struct btrfs_delayed_extent_op *extent_op);
> @@ -1541,7 +1541,7 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
> ret = alloc_reserved_file_extent(trans, parent, ref_root,
> flags, ref->objectid,
> ref->offset, &ins,
> - node->ref_mod);
> + node->ref_mod, href->owning_root);
> if (!ret)
> ret = btrfs_record_simple_quota_delta(trans->fs_info, &delta);
> } else if (node->action == BTRFS_ADD_DELAYED_REF) {
> @@ -4683,7 +4683,7 @@ static int alloc_reserved_extent(struct btrfs_trans_handle *trans, u64 bytenr,
> static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
> u64 parent, u64 root_objectid,
> u64 flags, u64 owner, u64 offset,
> - struct btrfs_key *ins, int ref_mod)
> + struct btrfs_key *ins, int ref_mod, u64 oref_root)
> {
> struct btrfs_fs_info *fs_info = trans->fs_info;
> struct btrfs_root *extent_root;
> @@ -4731,7 +4731,7 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
> if (simple_quota) {
> btrfs_set_extent_inline_ref_type(leaf, iref, BTRFS_EXTENT_OWNER_REF_KEY);
> oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
> - btrfs_set_extent_owner_ref_root_id(leaf, oref, root_objectid);
> + btrfs_set_extent_owner_ref_root_id(leaf, oref, oref_root);
> iref = (struct btrfs_extent_inline_ref *)(oref + 1);
> }
> btrfs_set_extent_inline_ref_type(leaf, iref, type);
> @@ -4842,7 +4842,8 @@ int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
>
> BUG_ON(root_objectid == BTRFS_TREE_LOG_OBJECTID);
>
> - BUG_ON(root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
> + if (btrfs_is_data_reloc_root(root) && is_fstree(root->relocation_src_root))
> + owning_root = root->relocation_src_root;
>
> btrfs_init_generic_ref(&generic_ref, BTRFS_ADD_DELAYED_EXTENT,
> ins->objectid, ins->offset, 0, owning_root);
> @@ -4899,7 +4900,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
> spin_unlock(&space_info->lock);
>
> ret = alloc_reserved_file_extent(trans, 0, root_objectid, 0, owner,
> - offset, ins, 1);
> + offset, ins, 1, root_objectid);
> if (ret)
> btrfs_pin_extent(trans, ins->objectid, ins->offset, 1);
> ret = btrfs_record_simple_quota_delta(fs_info, &delta);
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 119f670538f7..e12377c818c0 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -3665,6 +3665,21 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
> struct btrfs_extent_item);
> flags = btrfs_extent_flags(path->nodes[0], ei);
>
> + /*
> + * If we are relocating a simple quota owned extent item, we need
> + * to note the owner on the reloc data root so that when we
> + * allocate the replacement item, we can attribute it to the
> + * correct eventual owner (rather than the reloc data root)
> + */
> + if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE) {
> + struct btrfs_root *root = BTRFS_I(rc->data_inode)->root;
> + u64 owning_root_id = btrfs_get_extent_owner_root(fs_info,
> + path->nodes[0],
> + path->slots[0]);
> +
> + root->relocation_src_root = owning_root_id;
> + }
> +
This is almost correct but can mess up if we have adjacent extents that are
owned by different roots. If you look further down we move the extents via
relocate_data_extent(), which will cluster together ranges to limit the number
of preallocations you do. You're going to have to add a check in there for the
owning_root_id != root->relocation_src_root, do the prealloc, and then set
root->relocation_src_root in there and carry on. Thanks,
Josef
prev parent reply other threads:[~2023-07-13 17:38 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-05 23:20 [PATCH 00/18] btrfs: simple quotas Boris Burkov
2023-07-05 23:20 ` [PATCH 01/18] btrfs: free qgroup rsv on io failure Boris Burkov
2023-07-13 14:01 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 02/18] btrfs: fix start transaction qgroup rsv double free Boris Burkov
2023-07-13 14:02 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 03/18] btrfs: introduce quota mode Boris Burkov
2023-07-13 14:02 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 04/18] btrfs: add new quota mode for simple quotas Boris Burkov
2023-07-13 14:07 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 05/18] btrfs: expose quota mode via sysfs Boris Burkov
2023-07-13 14:11 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 06/18] btrfs: flush reservations during quota disable Boris Burkov
2023-07-13 14:20 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 07/18] btrfs: create qgroup earlier in snapshot creation Boris Burkov
2023-07-13 14:26 ` Josef Bacik
2023-07-13 19:00 ` Boris Burkov
2023-07-13 20:37 ` Josef Bacik
2023-07-13 23:13 ` Boris Burkov
2023-07-05 23:20 ` [PATCH 08/18] btrfs: function for recording simple quota deltas Boris Burkov
2023-07-13 14:34 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 09/18] btrfs: rename tree_ref and data_ref owning_root Boris Burkov
2023-07-13 16:33 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 10/18] btrfs: track owning root in btrfs_ref Boris Burkov
2023-07-13 16:58 ` Josef Bacik
2023-07-13 21:21 ` Boris Burkov
2023-07-05 23:20 ` [PATCH 11/18] btrfs: track original extent owner in head_ref Boris Burkov
2023-07-13 17:09 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 12/18] btrfs: new inline ref storing owning subvol of data extents Boris Burkov
2023-07-13 17:16 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 13/18] btrfs: inline owner ref lookup helper Boris Burkov
2023-07-13 17:18 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 14/18] btrfs: record simple quota deltas Boris Burkov
2023-07-13 17:23 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 15/18] btrfs: simple quota auto hierarchy for nested subvols Boris Burkov
2023-07-13 17:28 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 16/18] btrfs: check generation when recording simple quota delta Boris Burkov
2023-07-13 17:29 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 17/18] btrfs: track metadata relocation cow with simple quota Boris Burkov
2023-07-13 17:31 ` Josef Bacik
2023-07-05 23:20 ` [PATCH 18/18] btrfs: track data relocation " Boris Burkov
2023-07-13 17:37 ` Josef Bacik [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230713173752.GR207541@perftesting \
--to=josef@toxicpanda.com \
--cc=boris@bur.io \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.